JP2004080097A

JP2004080097A - Image-processing apparatus and image-processing method

Info

Publication number: JP2004080097A
Application number: JP2002233842A
Authority: JP
Inventors: Osamu Itokawa; 糸川　修
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-08-09
Filing date: 2002-08-09
Publication date: 2004-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image-processing apparatus and an image-processing method, capable of performing audio/video processing, i.e. fundamental function except for additional functions by performing usual processings for a system dealing with a multi-object and securing compatibility for a system which does not correspond to the multi-object. <P>SOLUTION: An inputted data stream is separated into audio data and video data in a separating section 201. Object data and scene description data are embedded in video data by electronic watermarks, and these data are separated by an embedded information separating section 214. Respective encoded data are decoded by an audio decoding section 203, a video decoding section 206, an object-decoding section 209, and a scene description decoding section 212, and the data are composed by a composing section 215. The audio data are outputted from an audio output section 216, and the video data are displayed on a display section 217. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のオブジェクトで構成されるシーンの画像を符号化する画像処理装置及び画像処理方法、並びにその符号化画像を復号化する画像処理装置及び画像処理方法に関する。
【０００２】
【従来の技術】
動画像の符号化方式として、これまでに様々なものが、国際標準規格として制定されている。例えば、ＩＳＯ（Ｉｎｔｅｒｎａｔｉｏｎａｌ　Ｏｒｇａｎｉｚａｔｉｏｎ　ｆｏｒ　Ｓｔａｎｄａｒｄｉｚａｔｉｏｎ：国際標準化機構）のＭＰＥＧ（Ｍｏｖｉｎｇ　Ｐｉｃｔｕｒｅ　Ｃｏｄｉｎｇ　Ｅｘｐａｒｔｓ　Ｇｒｏｕｐ）−１、ＭＰＥＧ−２、ＭＰＥＧ−４等がよく知られている。
【０００３】
ＭＰＥＧ−４は、従来のビデオ、オーディオ信号の圧縮符号化に加えて、静止画像、コンピュータ・グラフィックス（ＣＧ）画像、分析合成系の音声符号化、ＭＩＤＩ（Ｍｕｓｉｃａｌ　Ｉｎｓｔｒｕｍｅｎｔ　Ｄａｔａ　Ｉｎｔｅｒｆａｃｅ）等による合成オーディオや、テキスト等も含めた総合マルチメディアの符号化規格である。そのために、ＭＰＥＧ−４では、符号化対象の一つ一つをオブジェクトと捉え、これらを合成したシーンを２次元のディスプレイ上に表示するための符号化フォーマットも規定している。
【０００４】
図１０は、オーディオ信号とビデオ信号を符号化する従来のマルチオブジェクト非対応の符号化システムの構成を示すブロック図である。図１０において、オーディオ入力部１００１は、アナログ信号をサンプリングし、デジタル信号に変換したり、ファイルからデータを読み込む等、符号化処理を行うための前処理を行う部分である。オーディオ符号化部１００２では、デジタル入力信号を符号化し、圧縮したデータをビットストリームとして出力する。オーディオパケット化部１００３は、多重化処理を行うためにビットストリームをパケット単位で分割する処理である。
【０００５】
一方、ビデオ入力部１００４は、カメラからの入力信号をデジタル信号に変換したり、ファイルから読み込んだ信号のフォーマット変換を行う等、符号化処理を行うための前処理を行う部分である。ビデオ符号化部１００５では、デジタル入力信号を符号化し、圧縮したデータをビットストリームとして出力する。ビデオパケット化部１００６は、多重化処理を行うためにビットストリームをパケット単位で分割する処理である。多重化部１００７では、パケット化されたオーディオデータとビデオデータとを１つのデータストリームにまとめる処理が行われる。
【０００６】
図１２は、従来例におけるマルチオブジェクト非対応の復号システムの構成を示すブロック図である。図１２において、分離部１２０１では、オーディオ信号とビデオ信号とが符号化されて多重化されたデータストリームをパケット単位で分離し、データ毎にオーディオ復号化バッファ１２０２、ビデオ復号化バッファ１２０６にそれぞれ出力する。そして、オーディオ復号化バッファ１２０２では、パケット化されているデータを復号化処理できるようなビットストリームに戻し、オーディオ復号化部１２０３でオーディオデータに復号する。そして、オーディオ合成メモリ１２０４で出力データを生成し、オーディオ出力部１２０５においてオーディオ出力を行う。
【０００７】
一方、ビデオ復号化バッファ１２０６でも同様に、パケット化されているデータを復号化処理できるようなビットストリームに戻し、ビデオ復号化部１２０７でビデオデータに復号する。ビデオ合成メモリ１２０８で出力データを生成し、表示部１２０９においてビデオ出力を行う。図９Ａは、マルチオブジェクト非対応の復号化システムの表示部１２０９で表示されるビデオ出力例を示す概要図である。
【０００８】
上述した従来の符号化・復号化システムの構成は、ＭＰＥＧ−１、２、４に共通して利用可能な形態であるが、ＭＰＥＧ−４では更に、先に説明したようなシーン合成が可能なシステム構成を取る事もできる。すなわち、シーンを構成する複数のオブジェクトとその関連を示す「シーン記述」も併せて符号化する。
【０００９】
図１１は、従来例におけるマルチオブジェクト対応の符号化システムの構成を示すブロック図である。図１１において、オーディオ入力部１１０１からオーディオパケット化部１１０３までのオーディオ信号処理は、図１０におけるオーディオ入力部１００１からオーディオパケット化部１００３までの処理と同様である。また、ビデオ入力部１１０４からビデオパケット化部１１０６までのビデオ信号処理は、図１０におけるビデオ入力部１００４からビデオパケット化部１００６までの処理と同様である。
【００１０】
図１１において、オブジェクト入力部１１０７は、上記各部とは別のデータの入力部である。オーディオ入力部１１０１におけるオーディオデータや、ビデオ入力部１１０４におけるビデオデータもそれぞれ一つのオブジェクトであるので、データの中身はオーディオでもビデオでもよく、その他にテキストデータやＣＧデータ等も扱うことができる。ここではＣＧデータを用いて作成したボタンをオブジェクトの例として説明する。
【００１１】
オブジェクト入力部１１０７で入力されたボタンデータは、オブジェクト符号化部１１０８においてデータ圧縮が行われ、オブジェクトパケット化部１１０９において多重化のためのパケット分割処理が行われる。一方、シーン記述入力部１１１０は、シーン構成を記述したデータを入力する部分である。具体的には、ボタンをビデオ画面上のどの位置に配置するか、ボタンをクリックしたときに、どのような動作をさせるようにするのか、といった記述がなされる。
【００１２】
ここで、ＭＰＥＧ−４におけるシーン記述フォーマットは、ＢＩＦＳ（Ｂｉｎａｒｙ　Ｆｏｒｍａｔ　ｆｏｒ　Ｓｃｅｎｅ）と呼ばれており、ＶＲＭＬ（Ｖｉｒｔｕａｌ　Ｒｅａｌｉｔｙ　Ｍｏｄｅｌｉｎｇ　Ｌａｎｇｕａｇｅ）を基にしている。シーン記述符号化部１１１１では、記述したシーンのデータをバイナリデータに変換する。そしてこのバイナリデータもシーン記述パケット化部１１１２においてパケット分割される。パケット化されたシーンのデータは、オーディオ、ビデオ及びオブジェクトデータとともに多重化部１１１３において一つのデータストリームにまとめられる。
【００１３】
次に、図１３は、従来例におけるマルチオブジェクト対応の復号化システムの構成を示すブロック図である。分離部１３０１の機能は、図１２の分離部１２０１と同様であるが、扱うデータが、従来のオーディオやビデオの他に、ＣＧで作成されるボタン画像やシーン記述情報等が含まれている。また、オーディオ、ビデオに関する復号化処理は、図１２を用いて上述した場合の処理と同様であり、オーディオ復号化バッファ１３０２からビデオ合成メモリ１３０７が、それぞれ、オーディオ復号化バッファ１２０２からオーディオ合成メモリ１２０４及びビデオ復号化バッファ１２０６からビデオ合成メモリ１２０８に対応している。
【００１４】
一方、オブジェクト復号化バッファ１３０８は、オブジェクトを復号化するためのビットストリームを再構成し、オブジェクト復号化部１３０９で復号化を行う。復号化されたデータは、オブジェクト合成メモリ１３１０に一旦保存される。また、シーン記述復号化バッファ１３１１では、シーン記述情報を復号化するためのビットストリームを再構成し、シーン記述復号化部１３１２で復号化を行う。復号化されたシーン記述情報は、シーン記述合成メモリ１３１３に一旦保存される。
【００１５】
そして、合成部１３１４では、シーン記述情報に基づいて、シーンの合成が行われる。この例におけるシーンは、図８に示すように、オーディオ、ビデオ、ボタンにより構成されているものとする。図８は、シーングラフの説明図であり、シーンはオーディオとビデオとボタンで構成される。合成部１３１４においてシーンの合成が完成すると、オーディオ出力部１３１５からオーディオの出力、表示部１３１６から合成したビデオの出力を行う。
【００１６】
図９Ｂは、マルチオブジェクト対応の復号化システムの表示部１３１６で表示されるビデオ出力例を示す概要図である。図９Ｂにおいて、ビデオ画像の左下にＣＧ画像の「Ｌｉｎｋ」というボタンが合成されている。例えば、この「Ｌｉｎｋ」ボタンをクリックしたときの動作として、インターネットブラウザを起動し、指定したＵＲＬへジャンプするようシーン記述がされている場合、関連情報を同時に表示することも可能である。
【００１７】
尚、ＭＰＥＧに関するアルゴリズムの詳細は、ＩＥインスティテュート社発行、大久保榮、川島正久監修の「Ｈ．３２３／ＭＰＥＧ−４教科書」等に記載されているので、これ以上の説明は省略する。
【００１８】
【発明が解決しようとする課題】
しかしながら、上述した図１０と図１２に示されるようなマルチオブジェクト非対応の符号化・復号化システムと、図１１と図１３に示すようなマルチオブジェクト対応の符号化・復号化システムとでは、互いに互換性がないという問題がある。すなわち、オーディオやビデオに共通のＭＰＥＧ−４の符号化・復号化方式を用いた場合であっても、システム構成がマルチオブジェクト対応になっていないので、マルチオブジェクトのデータの中からオーディオやビデオの部分のみを取り出し、再生するというような使い方をすることができない。従って、不特定多数の人に動画配信を行う場合には、マルチオブジェクト対応と非対応の両方の符号化データを用意しておかなければならならず、大幅な作業効率の低下を招くこととなる。
【００１９】
本発明は、このような事情を考慮してなされたものであり、マルチオブジェクトに対応したシステムにおいては通常の処理を行い、マルチオブジェクトに非対応のシステムに対しては互換性を確保して、付加機能を除いた基本機能であるオーディオやビデオの処理をすることができる画像処理装置及び画像処理方法を提供することを目的とする。
【００２０】
【課題を解決するための手段】
上記課題を解決するために、本発明は、基本オブジェクトデータを入力する第１の入力手段と、前記基本オブジェクトデータを符号化する第１の符号化手段と、複数の基本オブジェクトデータが入力された場合、符号化された該複数の基本オブジェクトデータを多重化する多重化手段とを備える画像処理装置であって、所定の付加データを入力する第２の入力手段と、前記付加データを符号化する第２の符号化手段と、符号化された所定の基本オブジェクトデータに対して、符号化された前記付加データを電子透かしによって埋め込む合成手段とをさらに備えることを特徴とする。
【００２１】
また、本発明に係る画像処理装置は、前記付加データが、前記複数の基本オブジェクトデータで構成されるシーンに関して記述したシーン記述データであることを特徴とする。
【００２２】
さらに、本発明に係る画像処理装置は、所定の付加オブジェクトデータを入力する第３の入力手段と、前記付加オブジェクトデータを符号化する第３の符号化手段とをさらに備え、前記合成手段が、符号化された所定の基本オブジェクトデータに対して、符号化された前記付加データ及び符号化された前記付加オブジェクトデータを電子透かしによって埋め込むことを特徴とする。
【００２３】
さらにまた、本発明に係る画像処理装置は、符号化された複数の基本オブジェクトデータが多重化されたデータストリームを入力する入力手段と、前記データストリームを複数の符号化された基本オブジェクトデータに分離する分離手段と、分離された前記基本オブジェクトデータに、前記複数の基本オブジェクトデータで構成されるシーンに関して記述したシーン記述データが電子透かしによって埋め込まれているか否かを判定する判定手段と、前記基本オブジェクトデータに前記シーン記述データが埋め込まれている場合、符号化された前記基本オブジェクトデータから前記シーン記述データを抽出する抽出手段と、抽出された前記シーン記述データを復号化する第１の復号化手段と、前記シーン記述データが抽出された後の前記基本オブジェクトデータを復号化する第２の復号化手段と、復号化された前記シーン記述データに基づいて、復号化された前記複数の基本オブジェクトデータを合成する合成手段と、合成された結果を出力する出力手段とを備えることを特徴とする。
【００２４】
さらにまた、本発明に係る画像処理装置は、前記判定手段が、前記基本オブジェクトデータに前記シーン記述データが埋め込まれているか否かを判定することができない場合、又は、前記基本オブジェクトデータに前記シーン記述データが埋め込まれていない場合、符号化された前記基本オブジェクトデータを復号化する第３の復号化手段をさらに備え、前記合成手段が、復号化された前記複数の基本オブジェクトデータを合成することを特徴とする。
【００２５】
さらにまた、本発明に係る画像処理装置は、前記判定手段は、分離された前記基本オブジェクトデータに対してさらに付加オブジェクトデータが電子透かしによって埋め込まれているか否かを判定し、前記付加オブジェクトデータが埋め込まれている場合、前記抽出手段が、さらに前記付加オブジェクトデータを抽出し、前記合成手段が、復号化された前記シーン記述データに基づいて、復号化された前記複数の基本オブジェクトデータと復号化された前記付加オブジェクトデータとを合成することを特徴とする。
【００２６】
さらにまた、本発明に係る画像処理装置は、前記基本オブジェクトデータが、矩形形状のビデオデータ又はオーディオデータであることを特徴とする。
【００２７】
さらにまた、本発明に係る画像処理装置は、前記付加オブジェクトデータが、任意形状のビデオデータ、ＣＧデータ、オーディオデータ又はテキストデータのいずれかであることを特徴とする。
【００２８】
さらにまた、本発明に係る画像処理装置は、前記シーン記述データのフォーマットが、ＭＰＥＧ−４システムのＢＩＦＳに準拠することを特徴とする。
【００２９】
【発明の実施の形態】
以下、図面を参照して、本発明の好適な実施形態について詳細に説明する。
【００３０】
＜第１の実施形態＞
本発明の第１の実施形態について、図１、３、４、６を用いて説明する。
【００３１】
図１は、本発明の第１の実施形態におけるマルチオブジェクトに対応した符号化システムの構成を示すブロック図である。図１において、オーディオ入力部１０１からオーディオパケット化部１０３のオーディオデータの処理は、図１０に示したオーディオ入力部１００１からオーディオパケット化部１００３、若しくは、図１１に示したオーディオ入力部１１０１からオーディオパケット化部１１０３の処理と同様である。
【００３２】
また、ビデオ入力部１０４とビデオ符号化部１０５のビデオデータの処理については、図１０に示したビデオ入力部１００４とビデオ符号化部１００５及び図１１に示したビデオ入力部１１０４とビデオ符号化部１１０５の処理と同様である。
【００３３】
図１において、埋め込み情報合成部１１０は、符号化されたビデオデータに対して、オブジェクトの符号化データとシーン記述の符号化データを埋め込む処理を行う。尚、オブジェクトの符号化データは、オブジェクト入力部１０６とオブジェクト符号化部１０７を介して生成され、シーン記述符号化データは、シーン記述入力部１０８とシーン記述符号化部１０９を介して生成される。
【００３４】
この際、埋め込まれるデータは、ビデオデータの一部を加工することにより生成されるので、データの形式としてはビデオデータそのものとして扱うことができる。従って、埋め込み情報合成部１１０の出力データに対しては、従来のビデオデータと同様に、ビデオパケット化部１１１においてパケット分割の処理が行われる。また、多重化部１１２では、オーディオパケットデータとビデオパケットデータを多重化して、一つのストリームを作成する。尚、このストリームの構造は、図１０を用いて前述したものと同じ構成になる。
【００３５】
すなわち、本発明は、基本オブジェクトデータを入力する第１の入力手段（オーディオ入力部１０１、ビデオ入力部１０４）と、基本オブジェクトデータを符号化する第１の符号化手段（オーディオ符号化部１０２、ビデオ符号化部１０５）と、複数の基本オブジェクトデータが入力された場合、符号化された複数の基本オブジェクトデータを多重化する多重化手段（多重化部１１２）とを備える画像処理装置である。そして、所定の付加データを入力する第２の入力手段（シーン記述入力部１０８）と、付加データを符号化する第２の符号化手段（シーン記述符号化部１０９）と、符号化された所定の基本オブジェクトデータに対して、符号化された付加データを電子透かしによって埋め込む合成手段（埋め込み情報合成部１１０）とをさらに備え、付加データが、複数の基本オブジェクトデータで構成されるシーンに関して記述したシーン記述データであることを特徴とする。
【００３６】
さらに、本発明は、所定の付加オブジェクトデータを入力する第３の入力手段（オブジェクト入力部１０６）と、付加オブジェクトデータを符号化する第３の符号化手段（オブジェクト符号化部１０７）とをさらに備える。そして、合成手段（埋め込み情報合成部１１０）が、符号化された所定の基本オブジェクトデータに対して、符号化された付加データ及び符号化された付加オブジェクトデータを電子透かしによって埋め込むことを特徴とする。
【００３７】
図１に示した符号化システムをパーソナルコンピュータを使って実現する際のハードウェア構成を、図３を用いて説明する。図３は、本発明の第１の実施形態における符号化システムのハードウェア構成を示すブロック図である。図３において、中央演算装置（ＣＰＵ）３０１は、バス３０２を介して、さまざまな装置とデータの入出力を行う。メモリ３０３は、本システムの制御に必要なオペレーティングシステム（ＯＳ）や、アプリケーションソフトウェアの一時記憶領域である。端末３０４は、制御に関する入力インタフェースであり、キーボードやマウスがこれに相当する。
【００３８】
記憶装置３０５は、プログラムやデータを記憶しておく領域である。また、３０６はデータの入力装置であり、オーディオやビデオのデータを入力するインタフェースを備えている。３０７はモニタで、映像情報の出力装置である。３０８はスピーカで、オーディオの出力装置である。３０９は通信インタフェースであり、コンピュータ外部とのデータ入出力を行う装置である。
【００３９】
図４は、本発明の第１の実施形態におけるメモリ３０３の使用、格納領域を示す図である。図４に示すように、メモリ３０３の領域は、ＯＳの他に、プログラムの格納領域とデータの格納領域に分かれている。プログラムには、符号化ソフトウェアやデータ埋め込みのソフトウェア等がある。また、データに関しては、オーディオデータやビデオデータ、ボタンデータ、シーン記述データ等、扱うオブジェクトの数だけデータ領域が必要となる。また処理過程に必要なワークエリアも確保される。
【００４０】
ここで、埋め込み情報合成部１１０におけるデータの埋め込みには、電子透かしの技術を応用することができる。これは、画像データや音声データの再生時にデータの変化がないか、または変化が知覚できないレベルで情報データを埋め込む技術である。画像に対して電子透かしを埋め込む技術としては、特開平１０−２４３３９８号の「動画像エンコードプログラムを記録した記録媒体及び動画像エンコード装置」や特開平１１−３４１４５０号の「電子透かし埋め込み装置および抽出装置」等が開示されている。
【００４１】
また、オーディオデータに関しても同様であり、特開２００１−２０２０８９の「音声データに透かし情報を埋め込む方法、透かし情報埋め込み装置、透かし情報検出装置、透かし情報が埋め込まれた記録媒体、及び透かし情報を埋め込む方法を記録した記録媒体」や特開平１１−３１６５９９号の「電子透かし埋め込み装置、オーディオ符号化装置および記録媒体」等が開示されている。ここでは、追加するオブジェクトの符号化データとシーン記述に関するデータをビデオデータに埋め込む処理を説明するが、特にビデオデータに限定する必要はなく、オーディオデータに埋め込んでも、オーディオ、ビデオ両方のデータに埋め込んでも構わない。
【００４２】
図６は、本発明の第１の実施形態における埋め込み情報合成部１１０における埋め込み処理手順を詳細に説明するためのフローチャートである。まず、ＣＧデータで作成されたボタンオブジェクトの符号化を行い（ステップＳ６０１）、次に、シーン記述データ（ＢＩＦＳデータ）の符号化を行う（ステップＳ６０２）。これらの符号化データが、ビデオ符号化データに埋め込まれるデータとなる。そして、埋め込むデータのセットを行う（ステップＳ６０３）。尚、データ長はＬとする。
【００４３】
次いで、ビデオの符号化が全てのマクロブロックで終了しているかどうかを判定する（ステップＳ６０４）。その結果、符号化していないマクロブロックがある場合（Ｎｏ）、その中にデータを埋め込むことができる。すなわち、マクロブロックの符号化を行う（ステップＳ６０５）。そして、データ長Ｌが０（埋め込むデータが存在する）かどうかを判断する（ステップＳ６０６）。その結果、埋め込むデータがある場合（Ｎｏ）、そのマクロブロックに１ビットデータを埋め込む（ステップＳ６０７）。
【００４４】
そして、１ビットデータを埋め込んだ後、残りのデータ長を１つデクリメントする（ステップＳ６０８）。このように１つのマクロブロックにつき、１ビットずつデータを埋め込んでいく操作を繰り返す。ここで、ステップＳ６０６において、埋め込むデータがなくなった（Ｌ＝０）と判断した場合（Ｙｅｓ）、通常のマクロブロック符号化を全マクロブロック終了まで繰り返す。また、ステップＳ６０４の終了判定で全マクロブロックの処理が終わったと判定された場合（Ｙｅｓ）、埋め込み完了である。但し、本実施形態ではステップＳ６０９とＳ６１０でエラー判定処理を加えている。これは、データの埋め込みが完了する前に、全マクロブロックが符号化されてしまったときのためのエラー処理である。
【００４５】
ここでは、埋め込み方法の一例として、ビデオのマクロブロック毎に１ビットのデータを埋め込んでいく方法について説明する。各マクロブロックの最も高周波の係数を±１の範囲で増減させて、奇数・偶数を透かしデータに従って意図的に符号化データを変更する。すなわち、埋め込むデータの１ビットが０であれば、最後の係数を偶数に、１であれば奇数にする。埋め込むマクロブロックのＥＯＢの前の符号を読み出し、必要であればこれを変更する。
【００４６】
例えば、直前の符号が０ラン長が８で値が３であった時、埋め込む値が０であれば、ラン長が８で値が４の符号に置換する。実際の符号では、「１１１１１１１１１１１０１１０１１０」を「１１１１１１１１１１１０１１０１１１」に置換する。また、値が１であれば何もしない。尚、本発明の適用は上述の方法だけに限定されず、特開平１１−３４１４５２号の「動画像電子透かしシステム」に記載されている方法等を使用しても良い。
【００４７】
上述したように、本発明では、基本オブジェクトデータが、矩形形状のビデオデータ又はオーディオデータであることを特徴とする。また、付加オブジェクトデータが、任意形状のビデオデータ、ＣＧデータ、オーディオデータ又はテキストデータのいずれかであることを特徴とする。さらに、シーン記述データのフォーマットが、ＭＰＥＧ−４システムのＢＩＦＳに準拠することを特徴とする。
【００４８】
以上説明したように、本発明によれば、データの埋め込み処理により、マルチオブジェクト非対応のデータストリームを生成する符号化システムを得ることができる。
【００４９】
＜第２の実施形態＞
次に、本発明の第２の実施形態について、図２、３、５、７〜９を用いて説明する。
【００５０】
図２は、本発明の第２の実施形態におけるマルチオブジェクトに対応した復号化システムの構成を示すブロック図である。図２において、分離部２０１の処理は、図１２におけるオーディオ復号化バッファ１２０１で説明した処理と同様であり、多重化されているビットストリームをパケット単位で分離し、データ毎にオーディオ復号化バッファ２０２とビデオ復号化バッファ２０５とにそれぞれ出力する。
【００５１】
尚、第１の実施形態で述べたように、オーディオデータには埋め込み処理がなされていないので、オーディオデータについての復号化は通常の復号化処理を行う。すなわち、オーディオ復号化バッファ２０２では、パケット化されているデータを復号化処理できるようなビットストリームに戻し、オーディオ復号化部２０３でオーディオデータに復号する。そして、オーディオ合成メモリ２０４で出力データを生成する。
【００５２】
一方、ビデオ復号化バッファ２０５では、パケット化されているデータを復号化処理できるようなビットストリームに戻す。このビットストリームには、埋め込まれているデータがあるので、埋め込み情報分離部２１４で、データの分離を行う。
【００５３】
付加データの分離後、ビデオデータは、そのままビデオ復号化部２０６に出力され、ビデオ合成メモリ２０７で画像が生成される。分離された付加データは、１ビット毎に出力されるので、オブジェクト復号化バッファ２０８およびシーン記述復号化バッファ２１１に一旦蓄えられ、復号化が可能になったところで復号化処理が行われる。
【００５４】
一方、オブジェクト復号化部２０９、オブジェクト合成メモリ２１０、シーン記述復号化部２１２、シーン記述合成メモリ２１３における処理は、図１３におけるオブジェクト復号化部１３０９、オブジェクト合成メモリ１３１０、シーン記述復号化部１３１２、シーン記述合成メモリ１３１３の処理と同様である。そして、合成部２１５では、シーン記述データに基づいてシーンの合成を行い、オーディオ出力部２１６よりオーディオの出力を、表示部２１７より合成した映像シーンを出力する。尚、これらの合成部２１５から表示部２１７の処理も、図１３の合成部１３１４から表示部１３１６と同様である。従って、このときのビデオ出力は、図１３のシステムの場合と同様に、図９Ｂに示される表示例が得られることになる。
【００５５】
また、この復号化システムをパーソナルコンピュータ等を使って実現する際のハードウェア構成は、第１の実施形態で説明した図３に示す構成と同様である。ここで、図５は、本発明の第２の実施形態におけるメモリ３０３の使用、格納領域を示す図である。図５に示すように、メモリ領域は、ＯＳの他に、プログラムの格納領域とデータの格納領域に分かれている。ここで、プログラムには、復号化ソフトウェアやデータ抽出のソフトウェア等がある。また、データに関しては、オーディオデータやビデオデータ、ボタンデータ、シーン記述データ等、扱うオブジェクトの数だけデータ領域が必要となる。また処理過程に必要なワークエリアも確保される。
【００５６】
図７は、本発明の第２の実施形態における埋め込み情報分離部２１４の処理手順を詳細に説明するためのフローチャートである。まず、ビデオ符号化データに付加データが埋め込まれているかどうかの判定を行う（ステップＳ７０１）。その結果、データの埋め込みがある場合（Ｙｅｓ）、１ビットのデータを抽出する（ステップＳ７０２）。そして、データを分離した後のビデオ符号化データの復号化処理を行う（ステップＳ７０３）。ここでは、処理単位をマクロブロックとしている。
【００５７】
さらに、埋め込まれていた全データが抽出されたかどうかの判定を行う（ステップＳ７０４）。その結果、全データが抽出された場合（Ｙｅｓ）、ボタンオブジェクト復号化（ステップＳ７０５）、ＢＩＦＳデータ復号化を行う（ステップＳ７０６）。一方、全てのデータが抽出されていない場合（Ｎｏ）、これらの復号化処理を行わず、ステップＳ７０７の判定処理に進む。ステップＳ７０７では、全てのマクロブロックを復号化したかどうかの判定を行う。尚、このフローチャートでは、マクロブロックの数が、埋め込まれるビット数よりも多いという前提の処理になっている。
【００５８】
次に、マルチオブジェクトに非対応の復号化システムで、第１の実施形態による符号化システムで生成したストリームを復号する場合について、図１２を用いて説明する。分離部１２０１に入力されるデータストリームは、図２におけるデータストリームと構造的には全く同じものである。特にオーディオデータは何ら変更を加えていないので、従来と全く同じデータが再生される。ビデオデータは、各マクロブロック毎に最も高周波の係数が±１ビットの範囲でずれていることになる。このデータをそのまま復号化すると、高周波成分にわずかな歪を生じることとなるが、視覚特性上この差を知覚することは難しい。従って、ビデオ復号化部１２０７とビデオ合成メモリ１２０８を経て、表示部１２０９で出力される画像は、図９Ａと同等の画像が得られることになる。
【００５９】
上述したように、本発明は、符号化された複数の基本オブジェクトデータが多重化されたデータストリームを複数の符号化された基本オブジェクトデータに分離する分離手段（分離部２０１）と、分離された基本オブジェクトデータに、複数の基本オブジェクトデータで構成されるシーンに関して記述したシーン記述データが電子透かしによって埋め込まれているか否かを判定して、基本オブジェクトデータにシーン記述データが埋め込まれている場合、符号化された基本オブジェクトデータからシーン記述データを抽出する抽出手段（埋め込み情報分離部２１４）と、抽出されたシーン記述データを復号化する第１の復号化手段（シーン記述復号化部２１２）と、シーン記述データが抽出された後の基本オブジェクトデータを復号化する第２の復号化手段（例えば、ビデオ復号化部２０６）と、復号化されたシーン記述データに基づいて、復号化された複数の基本オブジェクトデータを合成する合成手段（合成部２１５）と、合成された結果を出力する出力手段（例えば、表示部２１７やオーディオ出力部２１６）とを備えることを特徴とする。
【００６０】
また、本発明は、基本オブジェクトデータにシーン記述データが埋め込まれているか否かを判定することができない場合、又は、基本オブジェクトデータにシーン記述データが埋め込まれていない場合、符号化された基本オブジェクトデータを復号化する第３の復号化手段（例えば、オーディオ復号化部２０３）をさらに備え、合成手段（合成部２１５）が、復号化された複数の基本オブジェクトデータを合成することを特徴とする。
【００６１】
さらに、本発明は、分離された前記基本オブジェクトデータに対してさらに付加オブジェクトデータが電子透かしによって埋め込まれているか否かを判定し、付加オブジェクトデータが埋め込まれている場合、抽出手段（埋め込み情報分離部２１４）が、さらに付加オブジェクトデータを抽出し、合成手段（合成部２１５）が、復号化されたシーン記述データに基づいて、復号化された複数の基本オブジェクトデータと復号化された付加オブジェクトデータとを合成することを特徴とする。
【００６２】
以上説明したように、本発明によれば、入力されたデータストリームは、マルチオブジェクト非対応の復号化システムでも再生可能であり、また、このデータストリームから埋め込みデータを抽出する復号化システムにより、マルチオブジェクト対応の出力を得ることもできる。このように、マルチオブジェクトに対応しているか否かに関わらず、共通のデータストリーム扱える効率のよい画像処理を提供することができる。
【００６３】
＜他の実施形態＞
尚、本発明は、複数の機器（例えば、ホストコンピュータ、インタフェース機器、リーダ、プリンタ等）から構成されるシステムに適用しても、一つの機器からなる装置（例えば、複写機、ファクシミリ装置等）に適用してもよい。
【００６４】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記録媒体（または記憶媒体）を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６５】
さらに、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６６】
本発明を上記記録媒体に適用する場合、その記録媒体には、先に説明したフローチャートに対応するプログラムコードが格納されることになる。
【００６７】
【発明の効果】
以上説明したように、本発明によれば、マルチオブジェクトに対応したシステムにおいては通常の処理を行い、マルチオブジェクトに非対応のシステムに対しては互換性を確保して、付加機能を除いた基本機能であるオーディオやビデオの処理をすることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態におけるマルチオブジェクトに対応した符号化システムの構成を示すブロック図である。
【図２】本発明の第２の実施形態におけるマルチオブジェクトに対応した復号化システムの構成を示すブロック図である。
【図３】本発明の第１の実施形態における符号化システムのハードウェア構成を示すブロック図である。
【図４】本発明の第１の実施形態におけるメモリ３０３の使用、格納領域を示す図である。
【図５】本発明の第２の実施形態におけるメモリ３０３の使用、格納領域を示す図である。
【図６】本発明の第１の実施形態における埋め込み情報合成部１１０における埋め込み処理手順を詳細に説明するためのフローチャートである。
【図７】本発明の第２の実施形態における埋め込み情報分離部２１４の処理手順を詳細に説明するためのフローチャートである。
【図８】シーングラフの説明図である。
【図９Ａ】マルチオブジェクト非対応の復号化システムの表示部１２０９で表示されるビデオ出力例を示す概要図である。
【図９Ｂ】マルチオブジェクト対応の復号化システムの表示部１３１６で表示されるビデオ出力例を示す概要図である。
【図１０】オーディオ信号とビデオ信号を符号化する従来のマルチオブジェクト非対応の符号化システムの構成を示すブロック図である。
【図１１】従来例におけるマルチオブジェクト対応の符号化システムの構成を示すブロック図である。
【図１２】従来例におけるマルチオブジェクト非対応の復号システムの構成を示すブロック図である。
【図１３】従来例におけるマルチオブジェクト対応の復号化システムの構成を示すブロック図である。
【符号の説明】
１０１　オーディオ入力部
１０２　オーディオ符号化部
１０３　オーディオパケット化部
１０４　ビデオ入力部
１０５　ビデオ符号化部
１０６　オブジェクト入力部
１０７　オブジェクト符号化部
１０８　シーン記述入力部
１０９　シーン記述符号化部
１１０　埋め込み情報合成部
１１１　ビデオパケット化部
１１２　多重化部
２０１　分離部
２０２　オーディ復号化バッファ
２０３　オーディオ復号化部
２０４　オーディオ合成メモリ
２０５　ビデオ復号化バッファ
２０６　ビデオ復号化部
２０７　ビデオ合成メモリ
２０８　オブジェクト復号化バッファ
２０９　オブジェクト復号化部
２１０　オブジェクト合成メモリ
２１１　シーン記述復号化バッファ
２１２　シーン記述復号化部
２１３　シーン記述合成メモリ
２１４　埋め込み情報分離部
２１５　合成部
２１６　オーディオ出力部
２１７　表示部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an image processing apparatus and an image processing method for encoding a scene image composed of a plurality of objects, and an image processing apparatus and an image processing method for decoding the encoded image.
[0002]
[Prior art]
A variety of moving picture coding methods have been established as international standards. For example, ISO (International Organization for Standardization: International Organization for Standardization) MPEG (Moving Picture Coding Experts Group) -1, MPEG-2, MPEG-4 and the like are well known.
[0003]
MPEG-4, in addition to conventional video and audio signal compression and encoding, still image, computer graphics (CG) image, audio encoding of analysis and synthesis system, synthetic audio by MIDI (Musical Instrument Data Interface), etc. And a general multimedia coding standard including text and the like. For this purpose, MPEG-4 defines an encoding format for treating each object to be encoded as an object and displaying a scene obtained by combining these objects on a two-dimensional display.
[0004]
FIG. 10 is a block diagram showing a configuration of a conventional multi-object non-compliant encoding system for encoding an audio signal and a video signal. In FIG. 10, an audio input unit 1001 is a unit that performs pre-processing for performing an encoding process, such as sampling an analog signal and converting it into a digital signal, reading data from a file, and the like. The audio encoding unit 1002 encodes the digital input signal and outputs compressed data as a bit stream. The audio packetizer 1003 divides the bit stream into packets for performing the multiplexing process.
[0005]
On the other hand, the video input unit 1004 is a unit that performs pre-processing for performing an encoding process, such as converting an input signal from a camera into a digital signal or performing format conversion of a signal read from a file. The video encoding unit 1005 encodes the digital input signal and outputs compressed data as a bit stream. The video packetizer 1006 divides the bit stream into packets for performing the multiplexing process. The multiplexing unit 1007 performs a process of combining the packetized audio data and video data into one data stream.
[0006]
FIG. 12 is a block diagram showing the configuration of a multi-object non-compliant decoding system in a conventional example. In FIG. 12, a separating unit 1201 separates a data stream in which an audio signal and a video signal are encoded and multiplexed into packet units, and outputs the data streams to an audio decoding buffer 1202 and a video decoding buffer 1206 for each data. I do. Then, the audio decoding buffer 1202 returns the packetized data to a bit stream that can be decoded, and the audio decoding unit 1203 decodes the data into audio data. Then, output data is generated in the audio synthesis memory 1204, and audio output is performed in the audio output unit 1205.
[0007]
On the other hand, in the video decoding buffer 1206, similarly, the packetized data is converted back to a bit stream that can be decoded, and the video decoding unit 1207 decodes the data into video data. Output data is generated in the video synthesis memory 1208, and video output is performed in the display unit 1209. FIG. 9A is a schematic diagram illustrating a video output example displayed on the display unit 1209 of the decoding system that does not support a multi-object.
[0008]
The configuration of the conventional encoding / decoding system described above is a form that can be commonly used for MPEG-1, 2, and 4. However, MPEG-4 can further perform scene synthesis as described above. You can also take a system configuration. That is, the “scene description” indicating the plurality of objects constituting the scene and their relations is also encoded.
[0009]
FIG. 11 is a block diagram showing the configuration of a multi-object compatible encoding system in a conventional example. 11, the audio signal processing from the audio input unit 1101 to the audio packetizing unit 1103 is the same as the processing from the audio input unit 1001 to the audio packetizing unit 1003 in FIG. The video signal processing from the video input unit 1104 to the video packetizing unit 1106 is the same as the processing from the video input unit 1004 to the video packetizing unit 1006 in FIG.
[0010]
In FIG. 11, an object input unit 1107 is an input unit for data different from those described above. Since the audio data in the audio input unit 1101 and the video data in the video input unit 1104 are each one object, the contents of the data may be audio or video, and text data or CG data may be handled. Here, a button created using CG data will be described as an example of an object.
[0011]
The button data input by the object input unit 1107 is subjected to data compression in the object encoding unit 1108, and packet division processing for multiplexing is performed in the object packetization unit 1109. On the other hand, the scene description input unit 1110 is a part for inputting data describing a scene configuration. Specifically, a description is given as to where the button is to be placed on the video screen, and what action should be taken when the button is clicked.
[0012]
Here, the scene description format in MPEG-4 is called BIFS (Binary Format for Scene) and is based on VRML (Virtual Reality Modeling Language). The scene description encoding unit 1111 converts the data of the described scene into binary data. Then, this binary data is also divided into packets by the scene description packetizing unit 1112. The multiplexed unit 1113 combines the packetized scene data together with the audio, video, and object data into one data stream.
[0013]
Next, FIG. 13 is a block diagram showing the configuration of a multi-object compatible decoding system in a conventional example. The function of the separation unit 1301 is the same as that of the separation unit 1201 in FIG. 12, but the data to be handled includes button images and scene description information created by CG in addition to conventional audio and video. The decoding process for audio and video is the same as the process described above with reference to FIG. 12, and the audio decoding buffer 1302 to the video synthesizing memory 1307 store the audio decoding buffer 1202 to the audio synthesizing memory 1204, respectively. And the video decoding buffer 1206 to the video synthesis memory 1208.
[0014]
On the other hand, the object decoding buffer 1308 reconstructs a bit stream for decoding the object, and the object decoding unit 1309 performs decoding. The decrypted data is temporarily stored in the object composition memory 1310. The scene description decoding buffer 1311 reconstructs a bit stream for decoding the scene description information, and the scene description decoding unit 1312 performs decoding. The decoded scene description information is temporarily stored in the scene description synthesis memory 1313.
[0015]
Then, the combining unit 1314 combines the scenes based on the scene description information. The scene in this example is assumed to be composed of audio, video, and buttons as shown in FIG. FIG. 8 is an explanatory diagram of a scene graph. A scene is composed of audio, video, and buttons. When the synthesis of the scene is completed in the synthesis unit 1314, the audio output unit 1315 outputs audio and the display unit 1316 outputs the synthesized video.
[0016]
FIG. 9B is a schematic diagram showing an example of video output displayed on the display unit 1316 of the multi-object compatible decoding system. In FIG. 9B, a button “Link” of the CG image is synthesized at the lower left of the video image. For example, as an operation when the “Link” button is clicked, when a scene description is made to start an Internet browser and jump to a specified URL, related information can be displayed at the same time.
[0017]
The details of the algorithm relating to MPEG are described in "H.323 / MPEG-4 Textbook" published by IE Institute, edited by Sakae Okubo and Masahisa Kawashima, etc., and will not be further described.
[0018]
[Problems to be solved by the invention]
However, the encoding / decoding system that does not support multi-objects as shown in FIGS. 10 and 12 and the encoding / decoding system that supports multi-objects as shown in FIGS. There is a problem of incompatibility. That is, even when the common MPEG-4 encoding / decoding method is used for audio and video, since the system configuration is not compatible with multi-objects, audio or video You can't use it to extract only the part and play it back. Therefore, when distributing a moving image to an unspecified number of people, it is necessary to prepare encoded data for both multi-object and non-multi-objects, resulting in a significant decrease in work efficiency. .
[0019]
The present invention has been made in view of such circumstances, and performs normal processing in a system supporting multi-objects, and secures compatibility in a system not supporting multi-objects, It is an object of the present invention to provide an image processing apparatus and an image processing method capable of processing audio and video, which are basic functions excluding additional functions.
[0020]
[Means for Solving the Problems]
In order to solve the above-mentioned problems, the present invention provides a first input unit for inputting basic object data, a first encoding unit for encoding the basic object data, and a plurality of basic object data. An image processing apparatus comprising: a multiplexing unit that multiplexes the plurality of encoded basic object data, wherein the second input unit inputs predetermined additional data, and the additional data is encoded. The digital still camera further includes a second encoding unit, and a combining unit that embeds the encoded additional data into the encoded basic object data using a digital watermark.
[0021]
The image processing apparatus according to the present invention is characterized in that the additional data is scene description data describing a scene composed of the plurality of basic object data.
[0022]
Further, the image processing apparatus according to the present invention further includes third input means for inputting predetermined additional object data, and third encoding means for encoding the additional object data, wherein the synthesizing means includes: It is characterized in that the encoded additional data and the encoded additional object data are embedded in an encoded predetermined basic object data by a digital watermark.
[0023]
Still further, the image processing apparatus according to the present invention further comprises an input means for inputting a data stream in which a plurality of encoded basic object data are multiplexed, and separating the data stream into a plurality of encoded basic object data. Separating means for determining whether or not scene description data describing a scene composed of the plurality of basic object data is embedded in the separated basic object data by an electronic watermark; and Extracting means for extracting the scene description data from the encoded basic object data when the scene description data is embedded in object data; and first decoding for decoding the extracted scene description data Means and the basic unit after the scene description data is extracted. Second decoding means for decoding the object data, synthesizing means for synthesizing the plurality of basic object data decoded based on the decoded scene description data, and outputting a synthesized result. Output means.
[0024]
Still further, the image processing apparatus according to the present invention is configured such that the determining means cannot determine whether the scene description data is embedded in the basic object data, or When description data is not embedded, the apparatus further comprises third decoding means for decoding the encoded basic object data, wherein the synthesizing means synthesizes the plurality of decoded basic object data. It is characterized by.
[0025]
Still further, in the image processing apparatus according to the present invention, the determining unit determines whether additional object data is further embedded in the separated basic object data by an electronic watermark, and the additional object data is If embedded, the extracting means further extracts the additional object data, and the combining means decodes the plurality of basic object data based on the decoded scene description data. And combining the added additional object data.
[0026]
Furthermore, the image processing apparatus according to the present invention is characterized in that the basic object data is rectangular video data or audio data.
[0027]
Furthermore, the image processing apparatus according to the present invention is characterized in that the additional object data is any one of video data, CG data, audio data, and text data of an arbitrary shape.
[0028]
Furthermore, the image processing apparatus according to the present invention is characterized in that the format of the scene description data conforms to the BIFS of the MPEG-4 system.
[0029]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
[0030]
<First embodiment>
A first embodiment of the present invention will be described with reference to FIGS.
[0031]
FIG. 1 is a block diagram illustrating a configuration of an encoding system corresponding to a multi-object according to the first embodiment of the present invention. In FIG. 1, the processing of audio data from the audio input unit 101 to the audio packetizing unit 103 is performed by the audio input unit 1001 to the audio packetizing unit 1003 shown in FIG. 10 or the audio input unit 1101 shown in FIG. This is the same as the processing of the packetizing unit 1103.
[0032]
Also, regarding the processing of the video data of the video input unit 104 and the video encoding unit 105, the video input unit 1004 and the video encoding unit 1005 shown in FIG. 10 and the video input unit 1104 and the video encoding unit shown in FIG. This is the same as the processing of 1105.
[0033]
In FIG. 1, an embedded information synthesizing unit 110 performs processing for embedding coded data of an object and coded data of a scene description in coded video data. The coded data of the object is generated through the object input unit 106 and the object coding unit 107, and the coded scene description data is generated through the scene description input unit 108 and the scene description coding unit 109. .
[0034]
At this time, since the data to be embedded is generated by processing a part of the video data, the data format can be handled as the video data itself. Therefore, the output data of the embedded information synthesizing unit 110 is subjected to packet division processing in the video packetizing unit 111 in the same manner as the conventional video data. The multiplexing unit 112 multiplexes the audio packet data and the video packet data to create one stream. The structure of this stream is the same as that described above with reference to FIG.
[0035]
That is, the present invention provides a first input unit (audio input unit 101, video input unit 104) for inputting basic object data, and a first encoding unit (audio encoding unit 102) for encoding basic object data. The image processing apparatus includes a video encoding unit 105) and a multiplexing unit (multiplexing unit 112) that multiplexes a plurality of encoded basic object data when a plurality of basic object data is input. A second input unit (scene description input unit 108) for inputting predetermined additional data, a second encoding unit (scene description encoding unit 109) for encoding the additional data, and an encoded predetermined And a synthesizing unit (embedding information synthesizing unit 110) for embedding the encoded additional data by digital watermarking with respect to the basic object data described above, wherein the additional data describes a scene composed of a plurality of basic object data. It is characterized by scene description data.
[0036]
Further, the present invention further includes a third input unit (object input unit 106) for inputting predetermined additional object data and a third encoding unit (object encoding unit 107) for encoding the additional object data. Prepare. The combining means (embedded information combining unit 110) embeds the encoded additional data and the encoded additional object data in the encoded predetermined basic object data by digital watermarking. .
[0037]
The hardware configuration when implementing the encoding system shown in FIG. 1 using a personal computer will be described with reference to FIG. FIG. 3 is a block diagram illustrating a hardware configuration of the encoding system according to the first embodiment of the present invention. In FIG. 3, a central processing unit (CPU) 301 inputs and outputs data to and from various devices via a bus 302. The memory 303 is a temporary storage area for an operating system (OS) necessary for controlling the system and application software. The terminal 304 is an input interface for control, and corresponds to a keyboard or a mouse.
[0038]
The storage device 305 is an area for storing programs and data. A data input device 306 has an interface for inputting audio and video data. A monitor 307 is a video information output device. A speaker 308 is an audio output device. Reference numeral 309 denotes a communication interface, which is a device that inputs and outputs data to and from the outside of the computer.
[0039]
FIG. 4 is a diagram illustrating the use and storage area of the memory 303 according to the first embodiment of this invention. As shown in FIG. 4, the area of the memory 303 is divided into a program storage area and a data storage area in addition to the OS. The programs include encoding software and data embedding software. As for data, data areas are required by the number of objects to be handled, such as audio data, video data, button data, and scene description data. Also, a work area required for the processing process is secured.
[0040]
Here, for embedding data in the embedded information synthesizing unit 110, a digital watermarking technique can be applied. This is a technique for embedding information data at a level where there is no change in data or no change can be perceived when reproducing image data or audio data. As a technique for embedding a digital watermark in an image, Japanese Patent Application Laid-Open No. H10-243398, entitled “Recording Medium and Video Encoding Apparatus Recording a Video Encoding Program” and Japanese Patent Application Laid-Open No. H11-341450, entitled “Digital Watermark Embedding Apparatus and And the like.
[0041]
The same applies to audio data. For example, Japanese Patent Application Laid-Open No. 2001-22089 discloses a method of embedding watermark information in audio data, a watermark information embedding device, a watermark information detecting device, a recording medium in which watermark information is embedded, and embedding watermark information. For example, a "recording medium on which a method is recorded", and "a digital watermark embedding device, an audio encoding device, and a recording medium" disclosed in JP-A-11-316599. Here, the process of embedding the coded data of the object to be added and the data relating to the scene description in the video data will be described. However, it is not particularly limited to the video data, and even if the data is embedded in the audio data, it is embedded in both the audio and video data. But it doesn't matter.
[0042]
FIG. 6 is a flowchart for explaining in detail the embedding processing procedure in the embedding information synthesizing unit 110 according to the first embodiment of the present invention. First, the button object created with the CG data is encoded (step S601), and then the scene description data (BIFS data) is encoded (step S602). These encoded data become data embedded in the video encoded data. Then, data to be embedded is set (step S603). The data length is L.
[0043]
Next, it is determined whether or not video encoding has been completed for all macroblocks (step S604). As a result, if there is an uncoded macroblock (No), data can be embedded therein. That is, encoding of the macroblock is performed (step S605). Then, it is determined whether or not the data length L is 0 (the data to be embedded exists) (step S606). As a result, if there is data to be embedded (No), 1-bit data is embedded in the macro block (step S607).
[0044]
After embedding the one-bit data, the remaining data length is decremented by one (step S608). As described above, the operation of embedding data one bit at a time for one macroblock is repeated. If it is determined in step S606 that there is no more data to be embedded (L = 0) (Yes), normal macroblock coding is repeated until the end of all macroblocks. If it is determined in step S604 that the processing of all macroblocks has been completed (Yes), the embedding is completed. However, in this embodiment, error determination processing is added in steps S609 and S610. This is an error process for when all the macroblocks have been encoded before data embedding is completed.
[0045]
Here, as an example of the embedding method, a method of embedding 1-bit data for each video macroblock will be described. The highest frequency coefficient of each macroblock is increased or decreased in the range of ± 1, and the encoded data is intentionally changed for odd and even numbers according to the watermark data. That is, if one bit of the data to be embedded is 0, the last coefficient is an even number, and if 1 is 1, the last coefficient is an odd number. The code before the EOB of the macro block to be embedded is read out and changed if necessary.
[0046]
For example, when the immediately preceding code is 0 and the run length is 8 and the value is 3, if the value to be embedded is 0, the code is replaced with a code having a run length of 8 and a value of 4. In the actual code, “111111111110110110” is replaced with “111111111110110111”. If the value is 1, nothing is performed. It should be noted that the application of the present invention is not limited to the above-described method, and a method described in “Moving picture digital watermarking system” of JP-A-11-341452 may be used.
[0047]
As described above, the present invention is characterized in that the basic object data is rectangular video data or audio data. Further, the additional object data is any one of video data, CG data, audio data and text data of an arbitrary shape. Further, the format of the scene description data conforms to BIFS of the MPEG-4 system.
[0048]
As described above, according to the present invention, it is possible to obtain an encoding system that generates a data stream that does not support multi-objects by embedding data.
[0049]
<Second embodiment>
Next, a second embodiment of the present invention will be described with reference to FIGS.
[0050]
FIG. 2 is a block diagram illustrating a configuration of a decoding system corresponding to a multi-object according to the second embodiment of the present invention. In FIG. 2, the processing of the demultiplexing unit 201 is the same as the processing described with reference to the audio decoding buffer 1201 in FIG. 12, separates the multiplexed bit stream in packet units, and processes the audio decoding buffer 202 for each data. And the video decoding buffer 205, respectively.
[0051]
As described in the first embodiment, since the embedding process is not performed on the audio data, the decoding of the audio data is performed by a normal decoding process. That is, the audio decoding buffer 202 returns the packetized data to a bit stream that can be decoded, and the audio decoding unit 203 decodes the data into audio data. Then, output data is generated in the audio synthesis memory 204.
[0052]
On the other hand, the video decoding buffer 205 returns the packetized data to a bit stream that can be decoded. Since there is embedded data in this bit stream, the embedded information separating unit 214 separates the data.
[0053]
After the separation of the additional data, the video data is output to the video decoding unit 206 as it is, and an image is generated in the video synthesis memory 207. Since the separated additional data is output bit by bit, the additional data is temporarily stored in the object decoding buffer 208 and the scene description decoding buffer 211, and the decoding process is performed when decoding becomes possible.
[0054]
On the other hand, the processing in the object decoding unit 209, the object synthesis memory 210, the scene description decoding unit 212, and the scene description synthesis memory 213 is performed by the object decoding unit 1309, the object synthesis memory 1310, the scene description decoding unit 1312 in FIG. This is the same as the processing of the scene description synthesis memory 1313. Then, the synthesizing unit 215 synthesizes the scene based on the scene description data, and outputs an audio output from the audio output unit 216 and a video scene synthesized from the display unit 217. Note that the processing of the combining unit 215 to the display unit 217 is the same as that of the combining unit 1314 to the display unit 1316 in FIG. Therefore, as for the video output at this time, the display example shown in FIG. 9B is obtained as in the case of the system in FIG.
[0055]
The hardware configuration when implementing this decoding system using a personal computer or the like is the same as the configuration shown in FIG. 3 described in the first embodiment. Here, FIG. 5 is a diagram showing the use and storage area of the memory 303 according to the second embodiment of the present invention. As shown in FIG. 5, the memory area is divided into a program storage area and a data storage area in addition to the OS. Here, the programs include decryption software and data extraction software. As for data, data areas are required by the number of objects to be handled, such as audio data, video data, button data, and scene description data. Also, a work area required for the processing process is secured.
[0056]
FIG. 7 is a flowchart for explaining in detail the processing procedure of the embedded information separating unit 214 according to the second embodiment of the present invention. First, it is determined whether additional data is embedded in the encoded video data (step S701). As a result, if data is embedded (Yes), 1-bit data is extracted (step S702). Then, decoding processing of the video encoded data after the data is separated is performed (step S703). Here, the processing unit is a macroblock.
[0057]
Further, it is determined whether or not all the embedded data has been extracted (step S704). As a result, when all the data has been extracted (Yes), BIFS data decoding is performed (Step S706) when the button object is decoded (Step S705). On the other hand, if all the data has not been extracted (No), the process proceeds to the determination process of step S707 without performing these decoding processes. In step S707, it is determined whether or not all macro blocks have been decoded. Note that this flowchart assumes that the number of macroblocks is larger than the number of embedded bits.
[0058]
Next, a case where a stream generated by the encoding system according to the first embodiment is decoded by a decoding system that does not support multi-objects will be described with reference to FIG. The data stream input to the separation unit 1201 is exactly the same in structure as the data stream in FIG. Particularly, since no change is made to the audio data, exactly the same data as in the related art is reproduced. In the video data, the highest frequency coefficient is shifted within a range of ± 1 bit for each macroblock. If this data is decoded as it is, a slight distortion will occur in the high frequency component, but it is difficult to perceive this difference due to visual characteristics. Therefore, the image output from the display unit 1209 via the video decoding unit 1207 and the video synthesis memory 1208 is an image equivalent to that shown in FIG. 9A.
[0059]
As described above, the present invention provides a separating unit (separating unit 201) for separating a data stream in which a plurality of encoded basic object data are multiplexed into a plurality of encoded basic object data, In the basic object data, it is determined whether or not scene description data describing a scene composed of a plurality of basic object data is embedded by a digital watermark, and when the scene description data is embedded in the basic object data, Extracting means for extracting scene description data from the encoded basic object data (embedded information separating section 214); first decoding means for decoding the extracted scene description data (scene description decoding section 212); Decoding the basic object data after the scene description data is extracted (For example, a video decoding unit 206), and a synthesizing unit (synthesizing unit 215) for synthesizing a plurality of decoded basic object data based on the decoded scene description data. An output unit (for example, a display unit 217 or an audio output unit 216) for outputting a result is provided.
[0060]
Further, the present invention provides a method for determining whether or not scene description data is embedded in basic object data, or a case in which scene description data is not embedded in basic object data. A third decoding unit (for example, the audio decoding unit 203) for decoding data is further provided, and the synthesizing unit (synthesizing unit 215) synthesizes a plurality of decoded basic object data. .
[0061]
Further, the present invention determines whether or not additional object data is embedded in the separated basic object data by a digital watermark. If the additional object data is embedded, the extracting means (embedded information separation) Unit 214) further extracts additional object data, and a combining unit (synthesizing unit 215) decodes the plurality of basic object data and the decoded additional object data based on the decoded scene description data. Are synthesized.
[0062]
As described above, according to the present invention, an input data stream can be played back by a decoding system that does not support multi-objects. You can also get output for objects. In this way, it is possible to provide efficient image processing that can handle a common data stream regardless of whether or not it supports multi-objects.
[0063]
<Other embodiments>
Note that the present invention is applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), but a device including one device (for example, a copying machine, a facsimile machine, etc.). May be applied.
[0064]
Further, an object of the present invention is to supply a recording medium (or a recording medium) in which a program code of software for realizing the functions of the above-described embodiments is recorded to a system or an apparatus, and a computer (or a CPU or a CPU) of the system or the apparatus. Needless to say, the present invention can also be achieved by the MPU) reading and executing the program code stored in the recording medium. In this case, the program code itself read from the recording medium implements the functions of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention. When the computer executes the readout program code, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instruction of the program code. It goes without saying that a part or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing.
[0065]
Further, after the program code read from the recording medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the card or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0066]
When the present invention is applied to the recording medium, the recording medium stores program codes corresponding to the flowcharts described above.
[0067]
【The invention's effect】
As described above, according to the present invention, normal processing is performed in a system supporting multi-objects, and compatibility is ensured in a system not supporting multi-objects, and basic processing is performed without additional functions. It can process audio and video that are functions.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of an encoding system corresponding to a multi-object according to a first embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of a decoding system corresponding to a multi-object according to a second embodiment of the present invention.
FIG. 3 is a block diagram illustrating a hardware configuration of an encoding system according to the first embodiment of the present invention.
FIG. 4 is a diagram showing use and storage areas of a memory 303 according to the first embodiment of the present invention.
FIG. 5 is a diagram showing use and storage areas of a memory 303 according to a second embodiment of the present invention.
FIG. 6 is a flowchart for explaining in detail an embedding processing procedure in an embedding information synthesizing unit 110 according to the first embodiment of the present invention.
FIG. 7 is a flowchart for describing in detail a processing procedure of an embedded information separating unit 214 according to the second embodiment of the present invention.
FIG. 8 is an explanatory diagram of a scene graph.
FIG. 9A is a schematic diagram showing a video output example displayed on the display unit 1209 of the decoding system not supporting multi-objects.
FIG. 9B is a schematic diagram showing an example of video output displayed on the display unit 1316 of the multi-object compatible decoding system.
FIG. 10 is a block diagram illustrating a configuration of a conventional multi-object non-compliant encoding system that encodes an audio signal and a video signal.
FIG. 11 is a block diagram showing the configuration of a multi-object compatible encoding system in a conventional example.
FIG. 12 is a block diagram showing a configuration of a multi-object non-compliant decoding system in a conventional example.
FIG. 13 is a block diagram showing the configuration of a multi-object compatible decoding system in a conventional example.
[Explanation of symbols]
101 audio input section
102 Audio encoding unit
103 Audio Packetizer
104 Video input section
105 Video Encoding Unit
106 Object input section
107 Object encoding unit
108 Scene description input unit
109 Scene Description Encoding Unit
110 Embedded Information Synthesizing Unit
111 Video Packetizer
112 Multiplexer
201 Separation unit
202 audio decoding buffer
203 audio decoder
204 audio synthesis memory
205 Video Decoding Buffer
206 Video Decoding Unit
207 Video synthesis memory
208 Object decryption buffer
209 Object decryption unit
210 Object Synthetic Memory
211 Scene description decoding buffer
212 Scene Description Decoding Unit
213 Scene description synthesis memory
214 embedded information separation unit
215 Synthesis unit
216 Audio output unit
217 Display

Claims

First input means for inputting basic object data;
First encoding means for encoding the basic object data;
A multiplexing unit for multiplexing the encoded plurality of basic object data when a plurality of basic object data are input,
Second input means for inputting predetermined additional data;
Second encoding means for encoding the additional data;
And combining means for embedding the encoded additional data by digital watermarking with respect to the encoded basic object data,
The image processing apparatus according to claim 1, wherein the additional data is scene description data describing a scene composed of the plurality of basic object data.

Third input means for inputting predetermined additional object data;
A third encoding unit that encodes the additional object data,
2. The image according to claim 1, wherein the synthesizing unit embeds the encoded additional data and the encoded additional object data in a predetermined encoded basic object data by a digital watermark. Processing equipment.

Input means for inputting a data stream in which a plurality of encoded basic object data are multiplexed;
Separating means for separating the data stream into a plurality of encoded basic object data;
Determining means for determining whether or not scene description data describing a scene composed of the plurality of basic object data is embedded in the separated basic object data by a digital watermark;
Extracting means for extracting the scene description data from the encoded basic object data when the scene description data is embedded in the basic object data;
First decoding means for decoding the extracted scene description data;
Second decoding means for decoding the basic object data after the scene description data has been extracted;
Combining means for combining the plurality of decoded basic object data based on the decoded scene description data;
An image processing apparatus comprising: an output unit that outputs a synthesized result.

If the determination means cannot determine whether the scene description data is embedded in the basic object data, or if the scene description data is not embedded in the basic object data, A third decoding unit for decoding the basic object data,
The image processing apparatus according to claim 3, wherein the synthesizing unit synthesizes the decoded plurality of basic object data.

The determining means determines whether additional object data is further embedded in the separated basic object data by a digital watermark,
When the additional object data is embedded, the extraction unit further extracts the additional object data,
5. The method according to claim 3, wherein the synthesizing unit synthesizes the decoded plurality of basic object data and the decoded additional object data based on the decoded scene description data. An image processing apparatus according to claim 1.

The image processing apparatus according to claim 1, wherein the basic object data is rectangular video data or audio data.

The image processing device according to claim 1, wherein the additional object data is any one of video data, CG data, audio data, and text data having an arbitrary shape.

The image processing apparatus according to any one of claims 1 to 7, wherein a format of the scene description data conforms to BIFS of an MPEG-4 system.

A first encoding step of encoding the basic object data;
A second encoding step of encoding the predetermined additional data;
A combining step of embedding the encoded additional data by digital watermarking with respect to the encoded basic object data;
A multiplexing step of multiplexing the encoded plurality of basic object data when a plurality of basic object data is input,
An image processing method, wherein the additional data is scene description data describing a scene composed of the plurality of basic object data.

A third encoding step of encoding predetermined additional object data;
10. The image according to claim 9, wherein the combining step embeds the encoded additional data and the encoded additional object data in a predetermined encoded basic object data by digital watermarking. Processing method.

A separation step of separating a data stream in which a plurality of encoded basic object data are multiplexed into a plurality of encoded basic object data;
A determining step of determining whether or not scene description data describing a scene composed of the plurality of basic object data is embedded in the separated basic object data by a digital watermark;
When the scene description data is embedded in the basic object data, an extraction step of extracting the scene description data from the encoded basic object data,
A first decoding step of decoding the extracted scene description data;
A second decoding step of decoding the basic object data after the scene description data is extracted;
Synthesizing the plurality of decoded basic object data based on the decoded scene description data.

If the determination step cannot determine whether or not the scene description data is embedded in the basic object data, or if the scene description data is not embedded in the basic object data, A third decryption step of decrypting the basic object data,
12. The image processing method according to claim 11, wherein the combining step combines the plurality of pieces of basic object data that have been decoded.

The determining step determines whether or not additional object data is further embedded in the separated basic object data by an electronic watermark.
When the additional object data is embedded, the extracting step further extracts the additional object data,
13. The method according to claim 11, wherein the synthesizing step synthesizes the decoded plurality of basic object data and the decoded additional object data based on the decoded scene description data. The image processing method according to 1.

14. The image processing method according to claim 9, wherein the basic object data is rectangular video data or audio data.

15. The image processing method according to claim 9, wherein the additional object data is any one of video data, CG data, audio data, and text data of an arbitrary shape.

16. The image processing method according to claim 9, wherein a format of the scene description data conforms to an MPEG-4 system BIFS.

On the computer,
A first encoding procedure for encoding the basic object data;
A second encoding procedure for encoding predetermined additional data;
A synthesis procedure for embedding the encoded additional data by digital watermarking with respect to encoded predetermined basic object data,
And a multiplexing procedure for multiplexing the encoded plurality of basic object data when a plurality of basic object data are input, the program comprising:
A program, wherein the additional data is scene description data describing a scene composed of the plurality of basic object data.

On the computer,
A separation procedure for separating a data stream in which a plurality of encoded basic object data are multiplexed into a plurality of encoded basic object data,
A determining step of determining whether or not scene description data describing a scene composed of the plurality of basic object data is embedded in the separated basic object data by a digital watermark;
When the scene description data is embedded in the basic object data, an extraction procedure for extracting the scene description data from the encoded basic object data,
A first decoding procedure for decoding the extracted scene description data;
A second decoding procedure for decoding the basic object data after the scene description data has been extracted;
A synthesizing procedure for synthesizing the plurality of decoded basic object data based on the decoded scene description data.

A computer-readable recording medium storing the program according to claim 17.