JP2005504490A

JP2005504490A - Video coding based on visemes

Info

Publication number: JP2005504490A
Application number: JP2003531746A
Authority: JP
Inventors: エスシャラパリ，キラン
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2001-09-24
Filing date: 2002-09-06
Publication date: 2005-02-10
Also published as: EP1433332A1; KR20040037099A; WO2003028383A1; CN1279763C; CN1557100A; US20030058932A1

Abstract

ビデオデータのフレームのストリームを処理するビデオ処理システムである。システムは、入力されたビデオデータのフレームが少なくとも１つの所定の口形素に対応するかを判定する口形素織別システムと、少なくとも１つの所定の口形素に対応するフレームを格納する口形素ライブラリと、少なくとも１つの所定の口形素に対応する各フレームを符号化する符号化器とを含む、パッケージングシステムを有し、符号化器は現在フレームを符号化するのに口形素ライブラリ中の予め格納されたフレームを用いる。また、ビデオデータの符号化されたフレームを復号化する復号化器と、復号化されたフレームを格納するフレーム参照ライブラリとを含む受信器システムを更に有し、復号化器は現在の符号化されたフレームを復号化するのにフレーム参照ライブラリからの以前に復号化されたフレームを用い、以前に復号化されたフレームは現在の符号化されたフレームと同じ口形素に属する、受信器システムが提供される。A video processing system for processing a stream of frames of video data. A system for determining whether a frame of input video data corresponds to at least one predetermined viseme; a viseme library storing a frame corresponding to at least one predetermined viseme; A packaging system including an encoder for encoding each frame corresponding to at least one predetermined viseme, the encoder pre-stored in the viseme library to encode the current frame Frame is used. Also included is a receiver system that includes a decoder that decodes the encoded frame of video data and a frame reference library that stores the decoded frame, the decoder being current encoded. Provided by the receiver system using a previously decoded frame from the frame reference library to decode the previously decoded frame, and the previously decoded frame belongs to the same viseme as the current encoded frame Is done.

Description

【０００１】
本発明は、ビデオ符号化及び復号化に係り、より特定的にはビデオフレームを符号化する口形素（ｖｉｓｅｍｅ）に基づくシステム及び方法に関連する。
【０００２】
遠隔ビデオ処理用途（例えばテレビ会議、テレビ電話）に対する必要が高まりつつあるため、限られた帯域幅を通じて効率的にビデオデータを送信しうるシステムを提供する必要が重要となった。帯域幅の消費を減少させるための１つの解決策は、圧縮されたビデオ信号を符号化及び復号化しうるビデオ処理システムを用いることである。
【０００３】
ビデオ圧縮を行うには、現在、２種類の技術、即ち、波形に基づく圧縮とモデルに基づく圧縮がある。波形に基づく圧縮は、例えば、ＭＰＥＧ及びＩＴＵ標準（例えば、ＭＰＥＧ−２、ＭＰＥＧ−４、Ｈ．２６３等）によって与えられる比較的成熟した技術を用いるものである。もう一つの方法として、モデルに基づく圧縮は、比較的成熟していない技術である。モデルに基づく圧縮に用いられる典型的なアプローチは、人の顔の３次元モデルを発生し、次に、ビデオデータの新しいフレームの基礎を成す２次元画像を導出することを含む。頭と肩の画像の場合のように、送信されたビデオ画像データの殆どが繰り返しである場合、モデルベースの符号化は、はるかに高い度合いの圧縮を達成しうる。
【０００４】
従って、現在のモデルに基づく圧縮技術はテレビ会議及びテレビ電話等の用途に役に立ち、３次元画像を発生し処理することに関する計算上の複雑さは、このようなシステムの実施を困難とし禁止的に高い費用とする傾向がある。従って、３次元画像の処理の計算オーバヘッドを必要とすることなく、モデルに基づくシステムの圧縮レベルを達成しうる符号化システムが必要とされている。
【０００５】
本発明は、新規なモデルに基づく符号化システムを提供することにより、上述の問題を扱うだけでなく、他の問題も扱う。特に、入力されたビデオフレームは、フレーム全体のうちの一部のみが実際に符号化されるよう間引きされる。符号化されるフレームは、以前に符号化されたフレームからの及び／又は動的に発生された口形素ライブラリからのフレームからの予測を用いて符号化される。
【０００６】
第１の面では、本発明は、ビデオデータのフレームのストリームを処理するビデオ処理システムであって、入力されたビデオデータのフレームが少なくとも１つの所定の口形素に対応するかを判定する口形素織別システムと、少なくとも１つの所定の口形素に対応するフレームを格納する口形素ライブラリと、少なくとも１つの所定の口形素に対応する各フレームを符号化する符号化器とを含む、パッケージングシステムを有し、符号化器は現在フレームを符号化するのに口形素ライブラリ中の予め格納されたフレームを用いるビデオ処理システムを提供する。
【０００７】
第２の面では、本発明は、ビデオデータのフレームのストリームを処理する方法であって、入力されたビデオデータの各フレームが少なくとも１つの所定の口形素に対応するかを判定する段階と、口形素ライブラリ中の少なくとも１つの所定の口形素に対応するフレームを格納する段階と、少なくとも１つの所定の口形素に対応する各フレームを符号化する段階とを有し、符号化する段階は、現在フレームを符号化するのに口形素ライブラリ中に以前に格納されたフレームを用いる方法を提供する。
【０００８】
第３の面では、本発明は、実行されたときに、ビデオデータのフレームのストリームを処理する、記録可能な媒体上に格納されたプログラムプロダクトであって、入力されたビデオデータのフレームが少なくとも１つの所定の口形素に対応するかを判定するシステムと、少なくとも１つの所定の口形素に対応するフレームを格納する口形素ライブラリと、少なくとも１つの所定の口形素に対応する各フレームを符号化するシステムとを有し、符号化するシステムは、現在フレームを符号化するのに口形素ライブラリ中の以前に格納されたフレームを用いるプログラムプロダクトを提供する。
【０００９】
第４の面では、本発明は、少なくとも１つの所定の口形素に関連するフレームを用いて符号化されたビデオデータの符号化されたフレームを復号化する復号化器であって、復号化されたフレームを格納するフレーム参照ライブラリを有し、復号化器は現在の符号化されたフレームを復号化するのにフレーム参照ライブラリ中の以前に格納されたフレームを用い、以前に格納されたフレームは現在の符号化されたフレームと同じ口形素に属し、符号化処理中に除去されたビデオデータのフレームを再構築するモーフィングシステムを有する復号化器を提供する。
【００１０】
以下、本発明の望ましい典型的な実施例について、添付の図面を参照して説明する。図中、同様の参照番号は同様の要素を示す。図面を参照するに、図１及び図２は、ビデオ画像を符号化するビデオ処理システムを示す。ここで説明する実施例は主に顔の画像の処理に関する用途に注目するものであるが、本発明は顔の画像の符号化に限られないことが理解されるべきである。図１は、ビデオデータ３２及びオーディオデータ３３の入力されたフレームから符号化されたビデオデータ５０を発生する符号化器１４を含む。図２は、図１のビデオパッケージングシステム１０によって符号化されたビデオデータ５０を復号化し、復号化されたビデオデータ５２を発生する復号化器４２を含むビデオ受信器システム４０を示す。
【００１１】
図１のビデオパッケージングシステム１０は、口形素（ｖｉｓｅｍｅ）識別システム１２、符号化器１４、及び口形素ライブラリ１６を用いてビデオデータ３２の入力されたフレームを処理する。典型的な用途では、ビデオデータ３２の入力されたフレームは、例えば一般的にはテレビ会議システムによって処理される人の顔の多数の画像を含みうる。入力された顔３２は、どのフレームが１つ又はそれ以上の所定の口形素に対応するかを決定するよう口形素識別システム１２によって検査される。口形素は、特定の音を記述する（例えば、「ｓｈ」を発話するのに必要な口の形を作る）のに使用されうる包括的な顔の画像として定義されうる。口形素は、音素（ｐｈｏｎｅｍｅ）又は話される言葉中の音の単位に対して視覚的に同等なものである。
【００１２】
どの画像が口形素に対応するかを判定する処理は、音声データ３３中の音素を識別する音声セグメント化部１８によって行われる。音素が識別されるたびに、対応するビデオ画像は、対応する口形素に属するとしてタグ付けされうる。例えば、オーディオデータ中で音素「ｓｈ」が検出されるたびに、対応するビデオフレームは「ｓｈ」口形素に属するものとして識別されうる。ビデオフレームにタグ付けする処理は、識別された音素を口形素にマップするマッピングシステム２０によって取り扱われる。尚、与えられた姿勢又は表情の明示的な識別は必要ではない。むしろ、既知の口形素に属するビデオフレームが、音素を用いて暗示的に識別され分類される。一定の期間内（例えば１秒間）に対応する発話のない画像を有する、無音口形素を含む任意の数又は種類の口形素が発生されうることが理解されるべきである。
【００１３】
フレームが、口形素に属するとして識別されると、フレームは口形素ライブラリ１６に格納される。口形素ライブラリ１６は、共通の口形素に属するとしてタグ付けされたフレームが複数のモデル集合（例えば、Ｖ１，Ｖ２，Ｖ３，Ｖ４）の内の１つの中に一緒に格納されるよう口形素によって物理的又は論理的に配置されうる。最初は、各モデル集合は、フレームのヌル集合を有する。より多くのフレームが処理されるにつれて、各モデル集合は成長する。あまりにも大きいモデル集合を防止するべく、与えられるモデル集合のサイズに対して閾値が設定されてもよい。フレームを廃棄する先入れ先出しシステムは、閾値基準が満たされた後に過剰なフレームを除去するのに使用されうる。
【００１４】
入力されたフレームが口形素に対応しない場合、フレーム間引きシステム２２は、フレームを間引き又は消去し、即ちゴミ箱３４へ送信する。この場合、フレームは、口形素ライブラリ１６に格納されておらず、符号化器１４によって符号化もされない。しかしながら、任意の間引きされたフレームの位置に関する情報は、符号化されたビデオデータ５０中に明示的又は暗示的に組み込まれてもよい。この情報は、以下に説明するように、間引きされたフレームをどこに再構築するかを決定するために受信器によって使用されうる。
【００１５】
入力されたフレームが口形素に対応すると想定すると、符号化器１４は、例えばブロック毎の予測法を用いて、フレームを符号化し、これは符号化されたビデオデータ５０として出力される。符号化器１４は、誤り予測システム２４、詳細な動き情報２５、及びフレーム予測システム２６を有する。誤り予測システム２４は、例えばＭＰＥＧ−２標準で与えられるような任意の公知の方法で予測誤りを符号化する。詳細な動き情報２５は、受信器４０（図２）におけるモーフィングシステム４８によって使用されうる副情報として発生されうる。フレーム予測システムは、２つの画像、即ち（１）符号化器１４によって発生される動き補償された以前に符号化されたフレームと、（２）検索システム２８によって口形素ライブラリ１６から検索された画像とからフレームを予測する。特に、口形素ライブラリ１６から検索される画像は、符号化されているフレームと同じ口形素を含むモデル集合から検索される。例えば、人の顔が音声「ｓｈ」を発話したときの画像をフレームが含む場合、同じ口形素からの以前の画像が選択され検索される。検索システム２８は、平均平方により最も近かった画像を検索する。従って、時間的な近さ（即ち近傍のフレーム）に頼るのではなく、本発明では、時間的な近さとは無関係に、任意の以前のフレームに対して最も近く一致するものを選択しうる。非常に似た以前のフレームを見つけることによい、予測誤差は小さく、非常に高い度合いの圧縮が容易に達成されうる。
【００１６】
ここで図２を参照するに、復号化器４２、参照フレームライブラリ４４、バッファ４６、及びモーフィングシステム４８を含むビデオ受信器システム４０が示される。復号化器４２は、ビデオパッケージングシステム１０と同じ並列技術を用いる符号化されたビデオデータ５０の入来フレームを復号化する。特に、符号化されたフレームは、（１）直前の復号化されたフレーム、及び、（２）参照フレームライブラリ４４からの画像を用いて復号化される。参照フレームライブラリからの画像は、フレームを符号化するのに使用されたものと同じであり、符号化されたフレーム中に格納された参照データで容易に識別されうる。フレームは、復号化された後、参照フレームライブラリ４４中に（将来のフレームを復号化するために）格納されるとともにバッファ４６へ転送される。
【００１７】
１つ又はそれ以上のフレームが元々間引きされている場合（例えば、バッファ４６中に「？？」として示される）、モーフィングシステム４８は、例えば、符号化されたフレーム５３と５５の間を補間することにより、間引きされたフレームを再構築するのに使用されうる。このような補間技術は、例えば、ここに参照として組み入れられる、エザット（Ｅｚｚａｔ）及びポッジオ（Ｐｏｇｇｉｏ）著，”Ｍｉｋｅｔａｌｋ；Ａｔａｌｋｉｎｇｆａｃｉａｌｄｉｓｐｌａｙｂａｓｅｄｏｎｍｏｒｐｈｉｎｇｖｉｓｅｍｅｓ，” Ｐｒｏｃ．ＣｏｍｐｕｔｅｒＡｎｉｍａｔｉｏｎＣｏｎｆｅｒｅｎｃｅ，第９６−１０２頁、米国ペンシルバニア州フィラデルフィア、１９９８年に記載されている。モーフィングシステム４８は、符号化器１４（図１）によって与えられる詳細な動き情報を用いてもよい。フレームは、再構築された後、符号化されたビデオデータ５２の完全な集合として復号化されたフレームとともに出力されうる。
【００１８】
ここで説明されるシステム、機能、方法、及びモジュールは、ハードウエア、ソフトウエア、又は、ハードウエアとソフトウエアの組合せとして実現されうる。これらは、任意の種類のコンピュータシステム、又は、上述の方法を実行するのに適合された他の装置によって実現されうる。ハードウエア及びソフトウエアの典型的な組合せは、ロードされ実行されたときに、コンピュータシステムが本願に記載の方法を実行するよう制御するコンピュータプログラムを有する汎用コンピュータシステムでありうる。或いは、本発明の１つ又はそれ以上の機能的なタスクを実行する専用ハードウエアを含む特定用途コンピュータが使用されてもよい。本発明はまた、本願に記載の方法及び機能の実現を可能とする全ての特徴を有し、コンピュータシステムにロードされたときにこれらの方法及び機能を実行することが可能なコンピュータプログラムプロダクトに埋め込まれうる。本願明細書では、コンピュータプログラム、ソフトウエアプログラム、プログラム、プログラムプロダクト、又はソフトウエアは、情報処理能力を有するシステムに、直接的に、又は、（ａ）或いは（ｂ）のいずれかの後に、特定の機能を実行させることが意図される一組の命令の、任意の言語、コード、又は表記法での任意の表現を意味する。
【００１９】
本発明の望ましい実施例の以下の説明は、例示及び説明のために示されたものである。これらは、網羅的なもの、或いは、本発明を開示されたそのものの形に制限するものを意図したものではなく、明らかに、上述の教示から、多くの変更及び変形が可能である。当業者にとって明らかなこのような変更及び変形は、特許請求の範囲に定義される本発明の範囲に含まれることが意図される。
【図面の簡単な説明】
【００２０】
【図１】本発明の望ましい実施例による符号化器を有するビデオパッケージングシステムを示す図である。
【図２】本発明の望ましい実施例による復号化器を有するビデオ受信器システムである。[0001]
The present invention relates to video encoding and decoding, and more particularly to systems and methods based on visemes that encode video frames.
[0002]
As the need for remote video processing applications (e.g., video conferencing, video telephony) is increasing, it has become important to provide a system that can efficiently transmit video data over a limited bandwidth. One solution for reducing bandwidth consumption is to use a video processing system that can encode and decode the compressed video signal.
[0003]
There are currently two techniques for performing video compression: waveform-based compression and model-based compression. Waveform-based compression uses, for example, relatively mature techniques provided by MPEG and ITU standards (eg, MPEG-2, MPEG-4, H.263, etc.). Alternatively, model-based compression is a relatively unmature technology. A typical approach used for model-based compression involves generating a three-dimensional model of a human face and then deriving a two-dimensional image that underlies a new frame of video data. If most of the transmitted video image data is repetitive, as in the case of head and shoulder images, model-based coding can achieve a much higher degree of compression.
[0004]
Thus, compression techniques based on current models are useful for applications such as videoconferencing and videophones, and the computational complexity of generating and processing three-dimensional images makes such systems difficult and prohibitive. Tends to be expensive. Therefore, there is a need for an encoding system that can achieve the compression level of a system based on a model without requiring computational overhead for processing 3D images.
[0005]
The present invention addresses not only the above-mentioned problems but also other problems by providing a novel model-based coding system. In particular, the input video frame is thinned out so that only a part of the entire frame is actually encoded. The frames to be encoded are encoded using predictions from previously encoded frames and / or from frames generated from a dynamically generated viseme library.
[0006]
In a first aspect, the present invention is a video processing system for processing a stream of frames of video data, the viseme determining whether an input frame of video data corresponds to at least one predetermined viseme. A packaging system including a weaving system, a viseme library that stores frames corresponding to at least one predetermined viseme, and an encoder that encodes each frame corresponding to at least one predetermined viseme. The encoder provides a video processing system that uses pre-stored frames in the viseme library to encode the current frame.
[0007]
In a second aspect, the invention is a method of processing a stream of frames of video data, determining whether each frame of input video data corresponds to at least one predetermined viseme; Storing a frame corresponding to at least one predetermined viseme in the viseme library; and encoding each frame corresponding to the at least one predetermined viseme; A method is provided for using a frame previously stored in a viseme library to encode a current frame.
[0008]
In a third aspect, the present invention is a program product stored on a recordable medium that, when executed, processes a stream of frames of video data, wherein the input video data frames are at least A system for determining whether or not one predetermined viseme corresponds, a viseme library storing frames corresponding to at least one predetermined viseme, and encoding each frame corresponding to at least one predetermined viseme The encoding system provides a program product that uses previously stored frames in the viseme library to encode the current frame.
[0009]
In a fourth aspect, the present invention is a decoder for decoding an encoded frame of video data encoded using a frame associated with at least one predetermined viseme, Has a frame reference library for storing the frames, and the decoder uses previously stored frames in the frame reference library to decode the current encoded frame, and the previously stored frames are A decoder is provided having a morphing system that reconstructs frames of video data that belong to the same viseme as the current encoded frame and that were removed during the encoding process.
[0010]
Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, like reference numerals indicate like elements. Referring to the drawings, FIGS. 1 and 2 show a video processing system for encoding video images. While the embodiments described herein focus primarily on applications relating to facial image processing, it should be understood that the present invention is not limited to facial image coding. FIG. 1 includes an encoder 14 that generates encoded video data 50 from input frames of video data 32 and audio data 33. FIG. 2 illustrates a video receiver system 40 that includes a decoder 42 that decodes the video data 50 encoded by the video packaging system 10 of FIG. 1 and generates decoded video data 52.
[0011]
The video packaging system 10 of FIG. 1 processes an input frame of video data 32 using a viseme identification system 12, an encoder 14, and a viseme library 16. In a typical application, an input frame of video data 32 may include multiple images of a human face, for example, typically processed by a video conference system. The input face 32 is examined by the viseme identification system 12 to determine which frames correspond to one or more predetermined visemes. A viseme can be defined as a generic facial image that can be used to describe a particular sound (eg, to create the shape of the mouth needed to speak “sh”). A viseme is visually equivalent to a phoneme or a unit of sound in a spoken word.
[0012]
The process of determining which image corresponds to a viseme is performed by the audio segmentation unit 18 that identifies a phoneme in the audio data 33. Each time a phoneme is identified, the corresponding video image can be tagged as belonging to the corresponding viseme. For example, each time a phoneme “sh” is detected in the audio data, the corresponding video frame can be identified as belonging to the “sh” viseme. The process of tagging video frames is handled by the mapping system 20 that maps identified phonemes to visemes. Note that explicit identification of a given posture or facial expression is not necessary. Rather, video frames belonging to known visemes are implicitly identified and classified using phonemes. It should be understood that any number or type of visemes can be generated, including silent visemes, that have no speech corresponding to a certain period of time (eg, 1 second).
[0013]
If the frame is identified as belonging to a viseme, the frame is stored in the viseme library 16. The viseme library 16 is based on visemes so that frames tagged as belonging to a common viseme are stored together in one of a plurality of model sets (eg, V1, V2, V3, V4). It can be physically or logically arranged. Initially, each model set has a null set of frames. Each model set grows as more frames are processed. A threshold may be set for a given model set size to prevent model sets that are too large. A first-in first-out system that discards frames can be used to remove excess frames after threshold criteria are met.
[0014]
If the input frame does not correspond to a viseme, the frame thinning system 22 thins or erases the frame, that is, transmits it to the trash can 34. In this case, the frame is not stored in the viseme library 16 and is not encoded by the encoder 14. However, information regarding the position of any decimation frame may be explicitly or implicitly incorporated into the encoded video data 50. This information can be used by the receiver to determine where to reconstruct the decimated frame, as described below.
[0015]
Assuming that the input frame corresponds to a viseme, the encoder 14 encodes the frame using, for example, a block-by-block prediction method, and this is output as encoded video data 50. The encoder 14 includes an error prediction system 24, detailed motion information 25, and a frame prediction system 26. The error prediction system 24 encodes the prediction error in any known manner, for example as given in the MPEG-2 standard. Detailed motion information 25 may be generated as side information that may be used by morphing system 48 at receiver 40 (FIG. 2). The frame prediction system consists of two images: (1) a motion compensated previously encoded frame generated by the encoder 14 and (2) an image retrieved from the viseme library 16 by the retrieval system 28. And predict the frame. In particular, an image retrieved from the viseme library 16 is retrieved from a model set that includes the same viseme as the frame being encoded. For example, if the frame includes an image when a human face utters the voice “sh”, the previous image from the same viseme is selected and searched. The search system 28 searches for an image that is closest to the mean square. Thus, rather than relying on temporal proximity (ie, neighboring frames), the present invention can select the closest match for any previous frame, regardless of temporal proximity. Good for finding very similar previous frames, the prediction error is small and a very high degree of compression can easily be achieved.
[0016]
Referring now to FIG. 2, a video receiver system 40 that includes a decoder 42, a reference frame library 44, a buffer 46, and a morphing system 48 is shown. Decoder 42 decodes incoming frames of encoded video data 50 that use the same parallel technology as video packaging system 10. In particular, the encoded frame is decoded using (1) the immediately preceding decoded frame and (2) an image from the reference frame library 44. The images from the reference frame library are the same as those used to encode the frames and can be easily identified with reference data stored in the encoded frames. After being decoded, the frame is stored in reference frame library 44 (to decode future frames) and transferred to buffer 46.
[0017]
If one or more frames were originally decimated (eg, indicated as “??” in buffer 46), morphing system 48 may interpolate between encoded frames 53 and 55, for example. Can be used to reconstruct the decimation frame. Such interpolation techniques are described, for example, by Ezzat and Poggio, “Miketalk; A tapping facial display based on morphing views,” Proc. Computer Animation Conference, pages 96-102, Philadelphia, PA, USA, 1998. Morphing system 48 may use detailed motion information provided by encoder 14 (FIG. 1). After the frame is reconstructed, it can be output with the decoded frame as a complete set of encoded video data 52.
[0018]
The systems, functions, methods, and modules described herein may be implemented as hardware, software, or a combination of hardware and software. These can be realized by any kind of computer system or other device adapted to carry out the method described above. A typical combination of hardware and software may be a general purpose computer system having a computer program that, when loaded and executed, controls the computer system to perform the methods described herein. Alternatively, special purpose computers may be used that include dedicated hardware that performs one or more functional tasks of the present invention. The present invention also has all the features that enable the implementation of the methods and functions described herein and is embedded in a computer program product capable of performing these methods and functions when loaded into a computer system. Can be. In this specification, a computer program, a software program, a program, a program product, or software is specified directly or after (a) or (b) in a system having information processing capability. Means any representation in any language, code, or notation of a set of instructions intended to perform the functions.
[0019]
The following description of the preferred embodiment of the present invention has been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible from the above teachings. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.
[Brief description of the drawings]
[0020]
FIG. 1 illustrates a video packaging system having an encoder according to a preferred embodiment of the present invention.
FIG. 2 is a video receiver system having a decoder according to a preferred embodiment of the present invention.

Claims

A video processing system for processing a stream of frames of video data,
A vise-mesh weaving system that determines whether a frame of input video data corresponds to at least one predetermined viseme;
A viseme library storing frames corresponding to the at least one predetermined viseme;
A packaging system comprising: an encoder that encodes each frame corresponding to the at least one predetermined viseme;
The encoder uses pre-stored frames in the viseme library to encode a current frame;
Video processing system.

The video processing system of claim 1, wherein the viseme identification system includes an audio segmentation unit that identifies phonemes associated with frames of the video data in an audio data stream.

The video processing system of claim 2, wherein the viseme identification system maps identified phonemes to the at least one predetermined viseme.

The video processing system of claim 2, wherein the viseme identification system tags frames with associated phonemes.

The video processing system of claim 1, further comprising a frame skipping system that removes frames that do not correspond to the at least one viseme.

A decoder for decoding encoded frames of video data;
A receiver system including a frame reference library for storing decoded frames;
The decoder uses a previously decoded frame from the frame reference library to decode a current encoded frame, and the previously decoded frame is the current encoded frame. 6. A video processing system according to claim 5, which belongs to the same viseme as the frame.

The video processing system of claim 6, wherein the receiver system further comprises a morphing system that reconstructs frames removed by the decimation system.

8. The video processing system of claim 7, wherein the encoder generates detailed motion information used by the morphing system to reconstruct a frame.

A method of processing a stream of frames of video data,
Determining whether each frame of input video data corresponds to at least one predetermined viseme;
Storing a frame corresponding to the at least one predetermined viseme in the viseme library;
Encoding each frame corresponding to the at least one predetermined viseme,
The method of encoding, wherein a frame previously stored in the viseme library is used to encode a current frame.

Decoding encoded frames of video data;
Providing a frame reference library for storing the decoded frames;
The decoding step uses a previously decoded frame from the frame reference library to decode a current encoded frame, and the previously decoded frame is the current encoding frame. The method according to claim 9, wherein the method belongs to the same viseme as the generated frame.

A program product stored on a recordable medium that, when executed, processes a stream of frames of video data,
A system for determining whether a frame of input video data corresponds to at least one predetermined viseme;
A viseme library storing frames corresponding to the at least one predetermined viseme;
A system for encoding each frame corresponding to the at least one predetermined viseme,
A program product, wherein the encoding system uses previously stored frames in the viseme library to encode a current frame.

The program product of claim 11, wherein the determining system includes an audio segmentation unit that identifies phonemes in an audio data stream associated with the frame of video data.

The program product of claim 11, wherein the determining system maps the identified phonemes to the at least one predetermined viseme.

A decoder for decoding an encoded frame of video data encoded using a frame associated with at least one predetermined viseme,
A frame reference library for storing decoded frames, wherein the decoder uses previously stored frames in the frame reference library to decode a current encoded frame, and the previous The frame stored in belongs to the same viseme as the current encoded frame,
A decoder having a morphing system for reconstructing frames of video data removed during the encoding process.