JP4437514B2

JP4437514B2 - Image transmission system

Info

Publication number: JP4437514B2
Application number: JP2000192965A
Authority: JP
Inventors: 哲也成瀬
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-06-27
Filing date: 2000-06-27
Publication date: 2010-03-24
Anticipated expiration: 2020-06-27
Also published as: JP2002008051A

Description

【０００１】
【発明の属する技術分野】
本発明は画像伝送システムに関し、例えばユーザ間で音声と共に画像の送受信を行なうテレビジョン電話システムに適用して好適なものである。
【０００２】
【従来の技術】
従来、図６に示すようにテレビジョン電話システム１は、送信側端末２と受信側端末３とによって構成されている。この送信側端末２は、撮像手段（図示せず）によってユーザの顔を撮像した後にディジタル変換した画像データＤ１を１フレーム分ずつ画像圧縮符号化回路４及び動き画像成分抽出回路５に順次送出する。
【０００３】
画像圧縮符号化回路４は、画像データＤ１を所定の方式で圧縮符号化することにより画像符号化データＤ２を生成し、これを多重化回路６に送出する。
【０００４】
動き画像成分抽出回路５は、まず画像データＤ１における顔部分を複数のポイントに分割し、それら複数のポイントを結ぶことによりワイヤフレームと呼ばれる基準の顔画像モデルを生成する。因みにワイヤフレームは、顔の中で特に動きのある目元や口元にポイントが数多く配置されている
【０００５】
そして動き画像成分抽出回路５は、ワイヤフレームの各ポイントを分析パラメータとして用い、当該分析パラメータの時間的変化（すなわち前フレームと現フレームとの差分）をワイヤフレームの動き画像成分として抽出した後に圧縮符号化することにより例えば口元部分の動きを表す動き画像成分データＤ３を生成し、これを多重化回路６に送出する。
【０００６】
また送信側端末２は、マイクロフォン（図示せず）で集音した後にディジタル変換したユーザの音声データＤ４を音声圧縮符号化回路７に送出する。音声圧縮符号化回路７は、音声データＤ４を所定の圧縮符号化方法によって圧縮符号化した後、これを音声符号化データＤ５として多重化回路６に送出する。
【０００７】
多重化回路６は、画像符号化データＤ２、動き画像成分データＤ３及び音声符号化データＤ５を多重化処理し、その結果得られる多重化データＤ６を変調回路８に送出する。
【０００８】
変調回路８は、通信路９を介して送信するための所定の変調方式で多重化データＤ６を変調処理した後、これを送信データＤ７として通信路９を介して受信側端末３へ送信する。因みに通信路９としては、有線及び無線に特にこだわるものではなく、いずれであっても良い。
【０００９】
すなわち通信路９が無線通信路であるときには、送信側端末２及び受信側端末３として例えばカメラ付携帯電話機を用いたテレビジョン電話システム１であることを想定し、通信路９が有線通信路であるときには、送信側端末２及び受信側端末３として例えば家庭に設置されるカメラ付電話機を用いたテレビジョン電話システム１であることを想定している。
【００１０】
受信側端末３は、通信路９を介して送信されてきた送信データＤ７を受信データＤ８として受信して復調回路１０に送出する。復調回路１０は、受信データＤ８に対して復調処理を施すことにより復調データＤ９を得、これを分離回路１１に送出する。
【００１１】
なお実際上、復調回路１０は通信路９を介して受信した受信データＤ８を復調する際、通信路９上で生じるデータ誤りの検出及び訂正を行っているが、ここでは説明の便宜上省略する。
【００１２】
分離回路１１は、送信側端末２の多重化データＤ６に相当する復調データＤ９を多重化処理の逆の手順で分離処理することにより、元の画像符号化データＤ２、動き画像成分データＤ３及び音声符号化データＤ５にそれぞれ相当する画像符号化データＤ１２、動き画像成分データＤ１３及び音声符号化データＤ１５に分離し、音声符号化データＤ１５を音声復号化回路１２に送出し、動き画像成分データＤ１３を動き画像成分復号化回路１３に送出すると共に、画像符号化データＤ１２を画像復号化回路１４に送出する。
【００１３】
画像復号化回路１４は、分離回路１１から順次送られてくる画像符号化データＤ１２を復号することにより元の顔画像を表す基準の画像データＤ１６を１フレーム分ずつ復元し、これらを順次画像データ保持回路１５に送出する。
【００１４】
画像データ保持回路１５は、画像復号化回路１４から送られてきた画像データＤ１６を内部メモリ（図示せず）に順次保持した後、合成回路１６に送出するようになされている。
【００１５】
動き画像成分復号化回路１３は、分離回路１１から連続的に送られてくる動き画像成分データＤ１３を復号することにより元の動き画像成分データＤ３に相当する動き画像成分データを復元した後、当該動き画像成分データに基づいて動きのあるワイヤフレームを生成し、これをワイヤフレームデータＤ１８として合成回路１６に送出する。
【００１６】
合成回路１６は、動き画像成分復号化回路１３から供給されたワイヤフレームデータＤ１８と、画像データ保持回路１５から順次供給された基準の画像データＤ１６とを合成することにより、顔画像の口元が音声に合わせて動くような合成画像データＤ１９を生成し、これを表示画像として表示部（図示せず）を介して出力する。
【００１７】
音声復号化回路１２は、音声符号化データＤ１５を復号することにより元の音声データＤ４に相当する音声データＤ１７を復元し、これをアナログ変換した後に、合成回路１６から表示部を介して出力される表示画像にタイミングを合わせてスピーカ（図示せず）から音声として出力する。
【００１８】
【発明が解決しようとする課題】
ところでかかる構成のテレビジョン電話システム１においては、送信側端末２が１フレーム分ずつ画像データＤ１及びその動き画像成分について圧縮符号化して受信側端末３へ順次送信する必要があると共に、ユーザの音声データＤ４についても画像データＤ１及びその動き画像成分とは別個に圧縮符号化して受信側端末３へ送信する必要があり、非常に多くのデータ伝送量を要すると共に多大な伝送時間を要してリアルタイムな処理を実行し得ないという問題があった。
【００１９】
またテレビジョン電話システム１においては、画像データＤ１及びその動き画像成分の圧縮符号化処理、音声データＤ４の圧縮符号化処理を要すると共に、それに対応する復号処理を要することにより高速かつ大量のディジタル処理が必要であり、それに伴って送信側端末２及び受信側端末３の構成が複雑になると共に多大な消費電力を要するという問題があった。
【００２０】
本発明は以上の点を考慮してなされたもので、簡易な構成及び低消費電力でリアルタイムな処理を実行し得る画像伝送システムを提案しようとするものである。
【００２１】
【課題を解決するための手段】
かかる課題を解決するため本発明においては、送信装置及び受信装置によって構成される画像伝送システムにおいて、送信装置は、マイクロフォンによって集音したユーザの音声データを受信装置へ送信する音声データ送信手段と、ユーザの顔を撮影することにより得られた基準となる１フレーム分の顔画像データを複数のポイントに分割して結ぶことによりワイヤフレームでなる顔画像モデルを生成し、これを受信装置へ送信する顔画像モデル送信手段と、ユーザの音声が発せられたときに顔の表情が変化するという相関関係を考慮し、音声データを受信装置へ送信している間だけ、顔画像モデルの目元部分の各ポイントを分析パラメータとして用い、その時間的変化を上記ワイヤフレームの動き画像成分データとして抽出し、これを受信装置へ送信する動き画像成分データ送信手段とを具え、受信装置は、顔画像モデルを保持する顔画像モデル保持手段と、動き画像成分データに基づいて顔画像モデルの目元部分について動きのある目元部分動き画像データを生成する目元部分動き画像データ生成手段と、音声データに基づいて顔画像モデルの口元部分の動き状態を解読することにより口元部分動き画像データを生成する口元部分動き画像データ生成手段と、顔画像モデルに対して、目元部分動き画像データ及び口元部分動き画像データを合成することにより合成画像を生成する合成手段と、合成画像を表示する表示手段とを具えるようにする。
【００２２】
これにより、送信側ではユーザの音声データと、１フレーム分の顔画像モデルと、ユーザの音声が発せられたときに顔の表情が変化するという相関関係を考慮し、音声データを受信装置へ送信している間だけ顔画像モデルの目元部分の各ポイントを分析パラメータとして用い、その時間的変化を示す上記ワイヤフレームの動き画像成分データとを送信し、受信側では、動き画像成分データに基づいて目元部分動き画像データを生成し、音声データに基づいて口元部分動き画像データを生成した後、顔画像モデルと合成することにより、送信装置から受信装置へのデータ伝送量を低減しつつ、目元部分と口元部分との間に表情として相関関係を持たせた状態の表情豊かな顔画像を表示することができる。
【００２５】
【発明の実施の形態】
以下図面について、本発明の一実施の形態を詳述する。
【００２６】
（１）第１の実施の形態
図６との対応部分に同一符号を付して示す図１において、２０は全体として第１の実施の形態におけるテレビジョン電話システムを示し、送信側端末２１及び受信側端末２２によって構成されている。因みにテレビジョン電話システム２０では、送信側端末２１が受信側となり、受信側端末２２が送信側となってもよい。
【００２７】
送信側端末２１は、マイクロフォン（図示せず）で集音した後にディジタル変換したユーザの音声データＤ４を音声圧縮符号化回路７に送出する。音声圧縮符号化回路７は、音声データＤ４を所定の圧縮符号化方法によって圧縮符号化した後、これを言葉情報に相当する音声符号化データＤ５として多重化回路６に送出する。
【００２８】
また送信側端末２１は、画像データベース２３に予め格納しておいたユーザ自身の顔画像を表す基準となる画像データＤ２３を通信開始時の最初に１フレーム分だけ読み出し、これを画像圧縮符号化回路２４に送出する。画像圧縮符号化回路２４は、画像データＤ２３を所定の圧縮符号化方法によって圧縮符号化することにより画像符号化データＤ２４を生成し、これを多重化回路６に送出する。
【００２９】
多重化回路６は、画像符号化データＤ２４及び音声符号化データＤ５を多重化処理し、その結果得られる多重化データＤ２５を変調回路８に送出する。ここで多重化回路６は、通信開始時の最初に画像圧縮符号化回路２４から１フレーム分の画像符号化データＤ２４が供給されたときのみ、当該画像符号化データＤ２４及び音声符号化データＤ５を多重化処理するが、それ以降は画像符号化データＤ２４が供給されることはないので音声符号化データＤ５だけを変調回路８に送出する。
【００３０】
変調回路８は、多重化データＤ２５及びそれ以降供給される音声符号化データＤ５を順次変調処理した後、これを送信データＤ２６として通信路９を介して受信側端末２２へ送信する。因みに通信路９としては、有線及び無線に特にこだわるものではなく、いずれであっても良い。
【００３１】
すなわち通信路９が無線通信路であるときには、送信側端末２１及び受信側端末２２として例えばカメラ付携帯電話機を用いたテレビジョン電話システム２０であることを想定し、通信路９が有線通信路であるときには、送信側端末２１及び受信側端末２２として例えば家庭に設置されるカメラ付電話機を用いたテレビジョン電話システム２０であることを想定している。
【００３２】
受信側端末２２は、通信路９を介して送信されてきた送信データＤ２６を受信データＤ２７として受信して復調回路１０に送出する。復調回路１０は、受信データＤ２７に対して復調処理を施すことにより復調データＤ２８を得、これを分離回路１１に送出する。
【００３３】
なお実際上、復調回路１０は通信路９を介して受信した受信データＤ２７を復調する際、通信路９上で生じるデータ誤りの検出及び訂正を行っているが、ここでは説明の便宜上省略する。
【００３４】
分離回路１１は、送信側端末２１の多重化データＤ２５に相当する復調データＤ２８を多重化処理とは逆の手順で分離処理することにより、元の画像符号化データＤ２４及び音声符号化データＤ５にそれぞれ対応する画像符号化データＤ２９及び音声符号化データＤ３０に分離し、当該音声符号化データＤ３０を音声復号化回路１２及び画像合成部２７の動き画像保持回路２５に送出すると共に、画像符号化データＤ２９を画像復号化回路１４に送出する。
【００３５】
因みに分離回路１１は、最初に復調データＤ２８を画像符号化データＤ２９及び音声符号化データＤ３０に分離した後には、それ以降の復調データＤ２８に画像符号化データＤ２９が多重化されていることはないので、音声符号化データＤ３０だけを音声復号化回路１２及び動き画像保持回路２５に送出するようになされている。
【００３６】
画像復号化回路１４は、分離回路１１から最初にのみ送られてくる画像符号化データＤ２９を復号することにより画像データＤ２３に相当する元の１フレーム分の基準となる画像データＤ３１を復元し、これを静止画記憶手段としての画像データ保持回路１５に送出する。
【００３７】
画像データ保持回路１５は、画像復号化回路１４から送られてきた基準の画像データＤ３１を内部メモリ（図示せず）に一旦保持した後、画像合成部２７の合成回路２６に送出するようになされている。
【００３８】
動き画像生成手段としての動き画像保持回路２５は、分離回路１１から順次送られてくる言葉情報としての音声符号化データＤ３０の符号パターンに対応した例えば口元の動きを表す動き画像データＤ３２を内部メモリ（図示せず）に複数保持しており、当該音声符号化データＤ３０に対応する動き画像データＤ３２を読み出して合成回路２６に送出する。
【００３９】
ここで音声符号化データＤ３０は、人間が音声を発する際の口元部分の形態をモデル化したものを基にディジタル圧縮符号化処理されたデータである。従って動き画像保持回路２５は、音声符号化データＤ３０の符号パターンに対応した例えば口元の動きを表す動き画像データＤ３２を予め内部メモリに複数保持していることにより、音声符号化データＤ３０の符号パターンに対応する動き画像データＤ３２を直ちに読み出して合成手段としての合成回路２６に送出し得るようになされている。
【００４０】
合成回路２６は、画像データ保持回路１５から供給された基準の画像データＤ３１に対して、動き画像保持回路２５から順次供給された動き画像データＤ３２を重ねて合成することにより、基準の顔画像に対して口元だけを動かしたような合成画像データＤ３３を生成し、これを表示画像として表示部（図示せず）を介して出力する。
【００４１】
音声復号化回路１２は、音声符号化データＤ３０を復号することにより元の音声データＤ３４を復元し、これをアナログ変換した後に、合成回路２６から表示部を介して出力される表示画像にタイミングを合わせてスピーカ（図示せず）から音声として出力する。
【００４２】
以上の構成において、第１の実施の形態におけるテレビジョン電話システム２０においては、送信側端末２１が通信開始時の最初だけ画像データベース２３から読み出した１フレーム分の基準となる顔画像の画像データＤ２３を圧縮符号化して受信側端末２２に送信し、音声データＤ４についても順次圧縮符号化して受信側端末２２に送信する。
【００４３】
これにより送信側端末２１は、ユーザ自身の顔画像を撮影しながら圧縮符号化して受信側端末２２に毎フレームずつ送信する必要はなく、予め画像データベース２３に格納してある自分の顔画像の画像データＤ２３を通信開始時の最初にだけ圧縮符号化して受信側端末２２に送信すればよく、データ伝送量を格段に低減することができる。
【００４４】
また送信側端末２１は、従来の送信側端末２（図６）と比較して動き画像成分抽出回路５が不要となる分だけ回路構成を簡素化できると共に、複雑なディジタル信号処理についてもその処理量を低減することができるので、消費電力を一段と低減することができる。
【００４５】
これに対して受信側端末２２は、受信データＤ２７を復調して分離した後、画像復号化回路１４によって通信開始時の最初だけ送られてくる基準となる顔画像の画像データＤ３１を復元すると共に、動き画像保持回路２５によって音声符号化データＤ３０に対応した口元の動きを表す動き画像データＤ３２を読み出し、基準となる顔画像の画像データＤ３１に口元の動きを表す動き画像データＤ３２を重ねて合成することにより合成画像データＤ３３を生成した後、表示画像として出力する。
【００４６】
このとき受信側端末２２は、音声復号化回路１２によって音声符号化データＤ３０を復号することにより元の音声データＤ３４を復元し、これをアナログ変換した後に、合成回路２６で合成された表示画像とタイミングを合わせて音声として出力する。
【００４７】
このように受信側端末２２は、キャラクタの口元部分だけをユーザの音声に合わせて動した表示画像を表示することにより、あたかも送信側端末２１のユーザの音声に合わせてキャラクタが喋っているような画像効果をもたらすことができる。
【００４８】
このときも受信側端末２２は、従来の受信側端末３（図６）と比較して動き画像成分復号化回路１３が不要となる分だけ回路構成を簡素化できると共に、ユーザの顔画像に関する復号処理を最初に１回だけ行えば済むので複雑なディジタル信号処理についてもその処理量を低減することができ、かくして消費電力を一段と低減することができる。
【００４９】
以上の構成によれば、第１の実施の形態におけるテレビジョン電話システム２０は、送信側端末２１が音声データＤ４を圧縮符号化して受信側端末２２に送信すると共に、予め画像データベース２３に格納してある自分の顔画像の画像データＤ２３を通信開始時の最初にだけ圧縮符号化して受信側端末２２に送信することにより、データ伝送量を従来の送信側端末２と比較して格段に低減すると共に消費電力を一段と低減することができる。
【００５０】
またテレビジョン電話システム２０は、受信側端末２２が送信側端末２１から最初にだけ送られてくる画像符号化データＤ２９を１度だけ復号することにより得られた顔画像の画像データＤ３１に対して、音声符号化データＤ３０に対応した口元部分の動きを表す動き画像データＤ３２を動き画像保持回路２５によって読み出して合成することにより、あたかも動画像のように送信側端末２１のユーザが喋っているような表示画像を表示することができる。
【００５１】
このとき受信側端末２２は、画像復号処理が１度だけで済むと共に、従来の受信側端末３のような動き画像成分復号化回路１３による動き画像成分復号化処理が不要になる分だけディジタル信号処理の処理量を格段に低減すると共に消費電力を一段と低減することができ、かくして音声のタイミングと一致して口元部分が動く表示画像をリアルタイムに表示することができる。
【００５２】
（２）第２の実施の形態
図１との対応部分に同一符号を付して示す図２において、４０は全体として第２の実施の形態におけるテレビジョン電話システムを示し、送信側端末４１及び受信側端末４２によって構成されている。因みにテレビジョン電話システム４０では、送信側端末４１が受信側となり、受信側端末４２が送信側となってもよい。
【００５３】
送信側端末４１は、マイクロフォン（図示せず）で集音した後にディジタル変換したユーザの音声データＤ４を音声圧縮符号化回路７に送出する。音声圧縮符号化回路７は、音声データＤ４を所定の圧縮符号化方法によって圧縮符号化した後、これを音声符号化データＤ５として多重化回路６に送出する。
【００５４】
また送信側端末４１は、画像データベース４２に予め格納しておいた例えばキャラクタの顔画像の画像データに対応した画像識別情報Ｄ４２を通信開始時の最初にだけ読み出し、これを多重化回路６に送出する。
【００５５】
多重化回路６は、画像識別情報Ｄ４２及び音声符号化データＤ５を多重化処理し、その結果得られる多重化データＤ４３を変調回路８に送出する。ここで多重化回路６は、通信開始時の最初に画像データベース４２から読み出された画像識別情報Ｄ４２が供給されたときのみ、当該画像識別情報Ｄ４２と音声符号化データＤ５とを多重化処理するが、それ以降は画像識別情報Ｄ４２が供給されることはないので音声符号化データＤ５だけを変調回路８に送出する。
【００５６】
変調回路８は、多重化データＤ４３及びそれ以降供給される音声符号化データＤ５を順次変調処理した後、これを送信データＤ４４として通信路９を介して受信側端末４２へ送信する。因みに通信路９としては、有線及び無線に特にこだわるものではなく、いずれであっても良い。
【００５７】
受信側端末４２は、通信路９を介して送信されてきた送信データＤ４４を受信データＤ４５として受信して復調回路１０に送出する。復調回路１０は、受信データＤ４５に対して復調処理を施すことにより復調データＤ４６を得、これを分離回路１１に送出する。
【００５８】
なお実際上、復調回路１０は通信路９を介して受信した受信データＤ４５を復調する際、通信路９上で生じるデータ誤りの検出及び訂正を行っているが、ここでは説明の便宜上省略する。
【００５９】
分離回路１１は、送信側端末４１の多重化データＤ４３に相当する復調データＤ４６を多重化処理とは逆の手順で分離処理することにより、元の画像識別情報Ｄ４２及び音声符号化データＤ５にそれぞれ対応する画像識別情報Ｄ４７及び音声符号化データＤ４８に分離し、当該音声符号化データＤ４８を音声復号化回路１２及び画像合成部４９のワイヤフレーム生成回路４４に送出すると共に、画像識別情報Ｄ４７を静止画記憶手段としての画像データベース４３に送出する。
【００６０】
因みに分離回路１１は、最初に復調データＤ４６を画像識別情報Ｄ４７及び音声符号化データＤ４８に分離した後には、それ以降の復調データＤ４６に画像識別情報Ｄ４７が多重化されていることはないので、音声符号化データＤ４８だけを音声復号化回路１２及びワイヤフレーム生成回路４４に送出するようになされている。
【００６１】
動き画像生成手段としてのワイヤフレーム生成回路４４は、分離回路１１から順次送られてくる言葉情報としての音声符号化データＤ４８に基づいてキャラクタの顔画像の例えば口元の動き状態を解読し、内部メモリ（図示せず）から解読結果に対応した動きのあるワイヤフレームを読み出し、これを動き画像に相当するワイヤフレームデータＤ５０として合成手段に相当する合成回路２６に送出する。
【００６２】
ここでワイヤフレームとは、例えばキャラクタの顔画像の顔部分を複数のポイントに分割し、それら複数のポイントを結ぶことにより生成される顔画像モデルのことであり、顔部分の中で特に動きのある目元や口元にポイントが数多く配置されている。
【００６３】
受信側端末４２の画像データベース４３は、送信側端末４３の画像データベース４２と同一内容の画像データを格納しており、分離回路１１から供給された画像識別情報Ｄ４７に対応するキャラクタの顔画像を表す画像データＤ４９を読み出して合成回路２６に送出する。
【００６４】
すなわち、送信側端末４１の画像データベース４２から読み出されたキャラクタの顔画像を表す画像データＤ４２と、受信側端末４２の画像データベース４３から読み出されたキャラクタの顔画像を表す画像データＤ４９とは同一内容のデータである。
【００６５】
ここで画像データＤ４９も、ワイヤフレームであり、キャラクタの顔部分における各ポイントの配置場所は、ワイヤフレーム生成回路４４によって生成されたワイヤフレームデータＤ５０の各ポイントと一致している。
【００６６】
合成回路２６は、画像データベース４３から供給されたキャラクタの顔画像を表すワイヤフレームの画像データＤ４９に対してワイヤフレームデータＤ５０を合成し、ワイヤフレームデータＤ５０に応じて動き部分の画像ひずみ分を補正することによりキャラクタの口元部分が音声に合わせて動くような合成画像データＤ５１を生成し、これを表示画像として表示部（図示せず）を介して出力する。
【００６７】
音声復号化回路１２は、音声符号化データＤ４８を復号することにより元の音声データＤ５２を復元し、これをアナログ変換した後に、合成回路２６から表示部を介して出力される表示画像にタイミングを合わせてスピーカ（図示せず）から音声として出力する。
【００６８】
以上の構成において、テレビジョン電話システム２０においては送信側端末４１が通信開始時の最初だけ画像データベース４２からキャラクタの顔画像の画像データを示す画像識別情報Ｄ４２を読み出して受信側端末４２に送信し、音声データＤ４についても順次圧縮符号化して受信側端末４２に送信する。
【００６９】
これにより送信側端末４１は、ユーザ自身の顔画像を撮影しながら圧縮符号化して送信する必要はなく、予め画像データベース４２に格納してあるキャラクタの顔画像データを示す画像識別情報Ｄ４２を通信開始時の最初だけ受信側端末４２に送信すればよいので、１枚だけ画像データＤ２３を送信する第１の実施の形態における送信側端末２１よりもさらにデータ伝送量を低減することができる。
【００７０】
また送信側端末４１は、従来の送信側端末２（図６）と比較して画像圧縮符号化回路４及び動き画像成分抽出回路５が不要となる分だけ回路構成を簡素化できると共に、複雑なディジタル信号処理についてもその処理量を低減することができるので、消費電力を一段と低減することができる。
【００７１】
これに対して受信側端末４２は、受信データＤ４５を復調して分離した後、画像データベース４３から通信開始時の最初だけ送られてきた画像識別情報Ｄ４７に対応するキャラクタの顔画像を表す画像データＤ４９を読み出すと共に、ワイヤフレーム生成回路４４によって音声符号化データＤ４８に対応するワイヤフレームデータＤ５０を生成する。
【００７２】
そして受信側端末４２は、ワイヤフレームデータＤ５０に基づいてキャラクタの顔画像における動き部分の画像ひずみ分を補正することによりキャラクタの口元部分が送信側端末４１のユーザの音声と同じように動く合成画像データＤ５１を生成し、これを表示画像として出力すると共に、音声復号化回路１２によって復号した音声を表示画像とタイミングを合わせてスピーカから出力する。
【００７３】
このように受信側端末４２は、キャラクタの口元部分をユーザの音声に合わせて動した表示画像を表示することにより、あたかも送信側端末４１のユーザの音声に合わせてキャラクタが喋っているような画像効果をもたらすことができる。
【００７４】
このときも受信側端末４２は、従来の受信側端末３（図６）と比較して動き画像成分復号化回路１３が不要となる上に、第１の実施の形態における受信側端末２２の画像復号化回路１４が不要となる分だけ回路構成をさらに簡素化できると共に、複雑なディジタル信号処理についてもその処理量をさらに低減することができ、かくして消費電力をより一段と低減することができる。
【００７５】
以上の構成によれば、第２の実施の形態におけるテレビジョン電話システム４０は、送信側端末４１が音声データＤ４を圧縮符号化して受信側端末４２に送信すると共に、予め画像データベース２３に格納してあるキャラクタの画像データの画像識別情報Ｄ４２を通信開始時の最初にだけ受信側端末４２に送信することにより、第１の実施の形態における送信側端末２１と比較してデータ伝送量をさらに低減すると共に消費電力を一段と低減することができる。
【００７６】
またテレビジョン電話システム２０は、受信側端末４２が送信側端末４１から最初に１回だけ送られてくる画像識別情報Ｄ４２に基づいてキャラクタの画像データＤ４９を読み出し、音声符号化データＤ４８に対応したワイヤフレームデータＤ５０を生成してキャラクタの画像データＤ４９と合成することにより、あたかも送信側端末４１のユーザの音声に合わせてキャラクタが喋っているような表示画像を生成することができる。
【００７７】
このとき受信側端末４２は、画像復号処理を全く必要としない分だけ、さらにディジタル信号処理の処理量を第１の実施の形態における受信側端末４２よりも低減すると共に消費電力を一段と低減することができ、かくして音声のタイミングと一致して口元部分が動く表示画像をリアルタイムに表示することができる。
【００７８】
（３）第３の実施の形態
図１との対応部分に同一符号を付して示す図３において、６０は全体として第３の実施の形態におけるテレビジョン電話システムを示し、送信側端末６１及び受信側端末６２によって構成されている。
【００７９】
送信側端末６１は、マイクロフォン（図示せず）で集音した後にディジタル変換したユーザの音声データＤ４を音声圧縮符号化回路７に送出する。音声圧縮符号化回路７は、音声データＤ４を所定の圧縮符号化方法によって圧縮符号化した後、これを言葉情報に相当する音声符号化データＤ５として変調回路８に送出する。
【００８０】
変調回路８は、音声圧縮符号化回路７から順次供給される音声符号化データＤ５を変調処理した後、これを送信データＤ６１として通信路９を介して受信側端末６２へ送信する。
【００８１】
この場合の送信側端末６１は、通常の携帯電話機と同様の回路構成であり、特にカメラ付携帯電話やテレビジョン電話システム６０特有の送信側端末である必要はなく、一般的な携帯電話機と同様の構成を有していれば良く、また通信路９に関しても、有線及び無線に特にこだわるものではなく、いずれであっても良い。
【００８２】
受信側端末６２は、通信路９を介して送信されてきた送信データＤ６１を受信データＤ６２として受信して復調回路１０に送出する。復調回路１０は、受信データＤ６２に対して復調処理を施すことにより送信側端末２１の音声符号化データＤ５に相当する音声符号化データＤ６３を得、これを音声復号化回路１２及び画像合成部６９のワイヤフレーム生成回路６３に送出する。
【００８３】
動き画像生成手段としてのワイヤフレーム生成回路６３は、復調回路１０から順次送られてくる言葉情報としての音声符号化データＤ６３に基づいてキャラクタの顔画像の例えば口元の動き状態を解読して動きのあるワイヤフレームを生成し、これを動き画像に相当するワイヤフレームデータＤ６４として合成手段に相当する合成回路６４に送出する。
【００８４】
ここでワイヤフレームとは、例えばキャラクタの顔画像の顔部分を複数のポイントに分割し、それら複数のポイントを結ぶことにより生成される顔画像モデルのことであり、顔部分の中で特に動きのある目元や口元にポイントが数多く配置されている。
【００８５】
一方、静止画記憶手段としての画像データベース６５は、予め決められた所定のキャラクタの顔画像を表す画像データＤ６５を読み出して合成回路６４に送出するようになされている。ここで画像データＤ６５も、ワイヤフレームであり、キャラクタの顔部分における各ポイントの配置場所は、ワイヤフレーム生成回路６３によって生成されたワイヤフレームデータＤ６４の各ポイントと一致している。
【００８６】
合成回路６４は、ワイヤフレーム生成回路６３からワイヤフレームデータＤ６４の供給を受けると同時に、画像データベース６５から予め決められた所定のキャラクタの顔画像を表す画像データＤ６５の供給を受け、当該画像データＤ６５に対してワイヤフレームデータＤ６４を合成し、ワイヤフレームデータＤ６４に応じて動き部分の画像ひずみ分を補正することにより、キャラクタの口元部分が音声に合わせて動くような合成画像データＤ６６を生成し、これを表示画像として表示部（図示せず）を介して出力する。
【００８７】
音声復号化回路１２は、音声符号化データＤ６３を復号することにより元の音声データＤ６７を復元し、これをアナログ変換した後に、画像合成回路６４から表示部を介して出力される表示画像にタイミングを合わせてスピーカ（図示せず）から音声として出力する。
【００８８】
以上の構成において、テレビジョン電話システム６０においては送信側端末６１が画像データを送信する必要はなく通常の音声データＤ４だけを順次圧縮符号化して受信側端末６２へ送信するだけで良いので、第１及び第２の実施の形態における送信側端末２１及び４１よりもデータ伝送量を低減し得ると同時に、通常の音声通話だけを行う一般的な携帯電話機と同等のデータ伝送量に抑えることができる。
【００８９】
これに対して受信側端末６２は、予め画像データベース６５に保持しているキャラクタの画像データＤ６５を読み出し、音声符号化データＤ６３に基づく口元の動き状態を表すワイヤフレームデータＤ６４を画像データＤ６５に合成することにより、キャラクタの口元部分が音声に合わせて動くような合成画像データＤ６６を生成し、これを表示画像として出力することができる。
【００９０】
このように受信側端末６２は、第２の実施の形態における受信側端末４２のように分離回路１１を必要としない分と、画像復号処理を必要としない分だけ回路構成をさらに簡素化し得ると共に、複雑なディジタル信号処理についてもその処理量をさらに低減することができるので、消費電力を一段と低減することができる。
【００９１】
さらにテレビジョン電話システム６０においては、送信側端末６１及び受信側端末６２を必ず１組として用いる必要ななく、送信側端末６１と同等の一般的な携帯電話機であっても、当該携帯電話機から音声データさえ受信することができれば、受信側端末６２においてキャラクタの顔画像をベースに口元部分を音声に合わせて動かす表示画像を表示することができ、ユーザの使い勝手を一段と向上させることができる。
【００９２】
以上の構成によれば、第３の実施の形態におけるテレビジョン電話システム６０は、受信側端末６２が送信側端末６１から受信した音声符号化データＤ６３に基づいて口元部分の動き状態を表すワイヤフレームデータＤ６４を生成し、これを画像データベース６５に予め保持していたキャラクタの画像データＤ６５に合成して合成画像データＤ６６を生成し、これを表示画像として出力することにより、送信側端末６１から画像データを送信してもらうことなく音声データだけからキャラクタの口元部分が音声に合わせて動くような画像効果をもたらすことができる。
【００９３】
また受信側端末６２は、第１及び第２の実施の形態における受信側端末２１及び４２よりも回路構成を簡素化し得ると共にディジタル信号処理の処理量を低減することができるので、消費電力をさらに一段と低減することができ、かくして音声のタイミングと一致して口元部分が動く表示画像をリアルタイムに表示することができる。
【００９４】
（４）第４の実施の形態
図６との対応部分に同一符号を付して示す図４において、８０は全体として第４の実施の形態におけるテレビジョン電話システムを示し、送信側端末８１及び受信側端末８２によって構成されている。因みにテレビジョン電話システム８０では、送信側端末８１が受信側となり、受信側端末８２が送信側となっても良い。
【００９５】
送信側端末８１は、マイクロフォン（図示せず）で集音した後にディジタル変換したユーザの音声データＤ４を音声圧縮符号化回路７に送出する。音声圧縮符号化回路７は、音声データＤ４を所定の圧縮符号化方法によって圧縮符号化した後、これを音声符号化データＤ５として多重化回路６及び動き画像成分抽出回路８３に送出する。
【００９６】
また送信側端末８１は、撮像手段（図示せず）によってユーザの顔を撮像した後にディジタル変換した画像データＤ１を画像圧縮符号化回路４及び動き画像成分抽出回路８３に順次送出する。
【００９７】
ここで送信側端末８１は、通信開始時の最初に１フレーム分の基準となる画像データＤ１を画像圧縮符号化回路４に送出した以後は、次フレーム以降の画像データＤ１を画像圧縮符号化回路４に送出するこはなく、動き画像成分抽出回路８３にのみ画像データＤ１を送出するようになされている。
【００９８】
画像圧縮符号化回路４は、画像データＤ１における顔部分を複数のポイントに分割し、それら複数のポイントを結ぶことによりワイヤフレームと呼ばれる基準の顔画像モデルを生成する。因みにワイヤフレームは、顔の中で特に動きのある目元部分や口元部分にポイントが数多く配置されている。
【００９９】
そして画像圧縮符号化回路４は、ワイヤフレーム化した画像データＤ１を所定の方式で圧縮符号化することにより画像符号化データＤ２を生成し、これを多重化回路６に送出する。
【０１００】
動き画像成分抽出回路８３も、ワイヤフレームと呼ばれる基準の顔画像モデルを生成し、音声圧縮符号化回路７から音声符号化データＤ５が供給されている間だけ、ワイヤフレームの口元部分を除く例えば目元部分の各ポイントを分析パラメータとして用い、当該分析パラメータの時間的変化（すなわち前フレームと現フレームとの差分）をワイヤフレームの動き画像成分として抽出した後に圧縮符号化することにより動き画像成分データＤ８３を生成し、これを多重化回路６に送出する。
【０１０１】
すなわち動き画像成分抽出回路８３は、音声圧縮符号化回路７から供給された音声符号化データＤ５との相関を考慮し、音声が発せられたときに顔の表情が変化すると考えて、音声が発せられている間の目元部分に関する動き画像成分データＤ８３を生成するようになされている。
【０１０２】
多重化回路６は、音声符号化データＤ５、画像符号化データＤ２及び動き画像成分データＤ８３を多重化処理し、その結果得られる多重化データＤ８４を変調回路８に送出する。
【０１０３】
ここで多重化回路６は、通信開始時の最初に画像圧縮符号化回路４から１フレーム分の画像符号化データＤ２が供給されたときのみ、当該画像符号化データＤ２、音声符号化データＤ５及び動き画像成分データＤ８３を多重化処理するが、それ以降は画像符号化データＤ２が供給されることはないので、音声符号化データＤ５及び動き画像成分データＤ８３だけを多重化処理することになる。
【０１０４】
変調回路８は、通信路９を介して送信するための所定の変調方式で多重化データＤ８４を変調処理した後、これを送信データＤ８５として通信路９を介して受信側端末８２へ送信する。因みに通信路９としては、有線及び無線に特にこだわるものではなく、いずれであっても良い。
【０１０５】
受信側端末８２は、通信路９を介して送信されてきた送信データＤ８５を受信データＤ８６として受信して復調回路１０に送出する。復調回路１０は、受信データＤ８６に対して復調処理を施すことにより復調データＤ８７を得、これを分離回路１１に送出する。
【０１０６】
なお実際上、復調回路１０は通信路９を介して受信した受信データＤ８６を復調する際、通信路９上で生じるデータ誤りの検出及び訂正を行っているが、ここでは説明の便宜上省略する。
【０１０７】
分離回路１１は、送信側端末８１の多重化データＤ８４に相当する復調データＤ８７を多重化処理の逆の手順で分離処理することにより、元の音声符号化データＤ５、画像符号化データＤ２及び動き画像成分データＤ８３にそれぞれ相当する音声符号化データＤ８８、画像符号化データＤ８９及び動き画像成分データＤ９０に分離し、音声符号化データＤ８８を音声復号化回路１２及び画像合成部８９のワイヤフレーム生成回路８５に送出し、動き画像成分データＤ９０を動き画像成分復号化回路８４に送出すると共に、画像符号化データＤ８９を画像復号化回路１４に送出する。
【０１０８】
因みに分離回路１１は、最初に復調データＤ８７を音声符号化データＤ８８、画像符号化データＤ８９及び動き画像成分データＤ９０に分離した後には、それ以降の復調データＤ８７に画像符号化データＤ８９が多重化されていることはないので、画像復号化回路１４に画像符号化データＤ８９を送出することはない。
【０１０９】
画像復号化回路１４は、分離回路１１から最初にのみ送られてくる画像符号化データＤ８９を復号することによりワイヤフレーム化された元の顔画像に相当する基準の画像データＤ１６を復元し、これを画像データ保持回路１５に送出する。
【０１１０】
画像データ保持回路１５は、画像復号化回路１４から送られてきた基準の画像データＤ１６を内部メモリ（図示せず）に一旦保持した後、画像合成部８９の合成回路８６に送出するようになされている。
【０１１１】
動き画像成分復号化回路８４は、分離回路１１から連続的に送られてくる動き画像成分データＤ９０を復号することにより元の動き画像成分データＤ８３に相当する動き画像成分データを復元した後、当該動き画像成分データに基づいて動きのあるワイヤフレームを生成し、これを動き成分画像に相当するワイヤフレームデータＤ９１として合成回路８６に送出する。この場合、ワイヤフレームデータＤ９１とは、ユーザの顔画像のうちで目元部分の動きを表したデータである。
【０１１２】
動き画像生成手段としてのワイヤフレーム生成回路８５は、分離回路１１から連続的に送られてくる言葉情報としての音声符号化データＤ８８に基づいて顔画像の例えば口元部分の動き状態を解読して動きのあるワイヤフレームを生成し、これを動き画像に相当するワイヤフレームデータＤ９２として合成手段に相当する合成回路８６に送出する。
【０１１３】
ここで、画像復号化回路１４によって復号されたワイヤフレームの画像データＤ１６と、動き画像成分復号化回路８４によって生成された目元部分の動きを表すワイヤフレームデータＤ９１と、ワイヤフレーム生成回路８５によって生成された口元部分の動きを表すワイヤフレームデータＤ９２とは、顔部分における各ポイントの配置場所が互いに一致している。
【０１１４】
合成回路８６は、画像データ保持回路１５から供給された送信側端末８１の顔画像の画像データＤ１６に対して、動き画像成分抽出回路８４から供給された目元部分のワイヤフレームデータＤ９１と、ワイヤフレーム生成回路８５から供給された口元部分のワイヤフレームデータＤ９２とを合成し、ワイヤフレームデータＤ９１及びＤ９２に応じて動き部分の画像ひずみ分を補正することにより、音声に合わせて目元部分及び口元部分が動くような合成画像データＤ９３を生成し、これを表示画像として表示部（図示せず）を介して出力する。
【０１１５】
音声復号化回路１２は、音声符号化データＤ８８を復号することにより元の音声データＤ１７を復元し、これをアナログ変換した後に、画像合成回路８６から表示部を介して出力される表示画像にタイミングを合わせてスピーカ（図示せず）から音声として出力する。
【０１１６】
以上の構成において、テレビジョン電話システム８０においては送信側端末８１が通信開始時の最初にユーザの顔画像を表す画像データＤ１を１フレーム分だけ圧縮符号化することにより画像符号化データＤ２を生成して多重化回路６に送出すると共に、ユーザの音声データＤ４を順次圧縮符号化することにより音声符号化データＤ５を生成して多重化回路６に送出する。
【０１１７】
このとき送信側端末８１は、音声が発せられたときに顔の表情が変化する場合が一般的に多いと考えられるので、音声が発せられている間の目元部分に関する動き画像成分データＤ８３を抽出して多重化回路６に送出する。
【０１１８】
そして送信側端末８１は、多重化回路６によって通信開始時の最初にのみ画像符号化データＤ２、音声符号化データＤ５及び動き画像成分データＤ８３を多重化し、変調処理して送信した後は、音声符号化データＤ５及び動き画像成分データＤ８３を多重化し、変調処理して送信する。
【０１１９】
このように送信側端末８１は、音声が発せられている間においては口元部分以外の目元部分に関する動き画像成分データＤ８３を抽出することにより、目元部分の動きを再現するための動き画像成分データＤ８３と、口元部分の動きを再現するための音声符号化データＤ５との間に相関関係を持たせている。
【０１２０】
これにより送信側端末８１は、基準となる１フレーム分の顔画像の画像データＤ１と、目元部分に関する動き画像成分データＤ８３と、口元部分に関する音声符号化データＤ５とが互いにデータとして無駄に重なりあうことなく、それぞれ必要最小限のデータ量として送信することができ、かくして従来の送信側端末２（図６）と比較してデータ伝送量を格段に低減すると共に消費電力を一段と低減することができる。
【０１２１】
これに対して受信側端末８２は、通信開始時の最初だけ送られてくる基準となる顔画像の画像データＤ１６を復元した後、音声符号化データＤ８８に基づいて生成された口元部分の動きを表すワイヤフレームデータＤ９２と、動き画像成分データＤ９０に基づいて生成されたワイヤフレームデータＤ９１とを顔画像の画像データＤ１６に重ねて合成し、ワイヤフレームデータＤ９１及びＤ９２に応じて動き部分の画像ひずみ分を補正することにより、音声に合わせて目元部分及び口元部分が動くような動き画像データＤ９３を生成して表示画像として出力する。
【０１２２】
これにより受信側端末８２は、送信側端末８１のユーザの音声に合わせて顔画像の口元を動かして喋っているような表示画像を表示するだけでなく、その時の目元部分の動きを表示することにより、表情豊かな顔画像を表示画像として表示することができる。
【０１２３】
このとき受信側端末８２は、基準となる１フレーム分の顔画像の画像データＤ１と、目元部分に関する動き画像成分データＤ８３と、口元部分に関する音声符号化データＤ５とがそれぞれ関連付けられており、それぞれ必要最小限のデータ処理量で表示画像を生成することができるので、従来の受信側端末３（図６）と比較してデータ処理量を格段に低減することができ、かくして消費電力を一段と低減することができる。
【０１２４】
以上の構成によれば、第４の実施の形態におけるテレビジョン電話システム８０は、送信側端末８１が基準となる１フレーム分の顔画像の画像データＤ１と、目元部分に関する動き画像成分データＤ８３と、口元部分に関する音声符号化データＤ５とが互いにデータとして無駄に重なりあうことなく関連付けて、それぞれ必要最小限のデータ量として送信することができ、かくして従来の送信側端末２と比較してデータ伝送量を格段に低減すると共に消費電力を一段と低減することができる。
【０１２５】
またテレビジョン電話システム８０は、受信側端末８２が基準となる１フレーム分の顔画像の画像データＤ１と、目元部分に関する動き画像成分データＤ８３と、口元部分に関する音声符号化データＤ５とをそれぞれ必要最小限のデータ処理量で表示画像を生成することができるので、従来の受信側端末３と比較してデータ処理量を格段に低減することができ、かくして消費電力を一段と低減することができる。
【０１２６】
このとき受信側端末８２は、基準の顔画像に対して口元部分及び目元部分を音声に合わせて動かして表示することができるので、表情豊かでより高次元な表示画像をリアルタイムに提供することができる。
【０１２７】
（５）他の実施の形態
なお上述の第１の実施の形態においては、受信側端末２２が動き画像保持回路２５によって音声符号化データＤ３２に対応した口元部分の動きを表す動き画像データＤ３２を読み出し、合成回路２６で基準の画像データＤ３１と合成するようにした場合について述べたが、本発明はこれに限らず、音声復号化回路１２の後段に音声認識回路を設けることによって音声を認識し、その認識結果をテキスト化して対応する動き画像データＤ３２を動き画像保持回路２５から読み出し、又は認識結果を発音記号化して対応する動き画像データＤ３２を動き画像保持回路２５から読み出し、これらを基準の画像データＤ３１と合成するようにしても良い。
【０１２８】
また上述の第１の実施の形態においては、送信側端末２１がユーザの顔画像を表す画像データＤ２３を圧縮符号化して受信側端末２２へ送信するようにした場合について述べたが、本発明はこれに限らず、キャラクタの顔を表す画像データＤ２３を圧縮符号化して送信するようにしても良い。
【０１２９】
さらに上述の第１の実施の形態においては、受信側端末２２が動き画像保持回路２５によって音声符号化データＤ３２に対応した口元部分の動きを表す動き画像データＤ３２を読み出し、合成回路２６で基準の画像データＤ３１と合成するようにした場合について述べたが、本発明はこれに限らず、送信側端末２１で音声符号化データＤ３２に対応した口元部分の動きを表す動き画像データを指定し、当該指定した動き画像データに対応する動き画像識別情報を受信側端末２２へ送信することにより、当該受信側端末２２で動き画像識別情報に対応する動き画像データを読み出して基準の画像データと合成するようにしても良い。
【０１３０】
この場合、図１との対応部分に同一符号を付して示す図５において、テレビジョン電話システム１００の受信側端末２２は音声圧縮符号化回路７によって圧縮符号化された音声符号化データＤ５を多重化回路６及び動き画像データベース１０１に送出する。動き画像データベース１０１は、音声符号化データＤ５に応じた口元部分の動きを表す動き画像データを認識し、当該認識した動き画像データに対応する動き画像識別情報Ｄ１０１を多重化回路６に送出する。
【０１３１】
多重化回路６は、画像符号化データＤ２４、音声符号化データＤ５及び動き画像識別情報Ｄ１０１を多重化処理し、その結果得られる多重化データＤ１０２を変調回路８に送出する。この場合も多重化回路６は、通信開始時の最初に画像圧縮符号化回路２４から１フレーム分の画像符号化データＤ２４が供給されたときのみ、当該画像符号化データＤ２４、音声符号化データＤ５及び動き画像識別情報Ｄ１０１を多重化処理するが、それ以降は画像符号化データＤ２４が供給されることはないので音声符号化データＤ５及び動き画像識別情報Ｄ１０１を多重化処理することによって得られる多重化データＤ１０２を変調回路８に送出する。変調回路８は、多重化データＤ１０２を順次変調処理した後、これを送信データＤ１０３として通信路９を介して受信側端末２２へ送信する。
【０１３２】
受信側端末２２は、通信路９を介して送信されてきた送信データＤ１０３を受信データＤ１０４として受信して復調回路１０に送出する。復調回路１０は、受信データＤ１０４に対して復調処理を施すことにより復調データＤ１０５を得、これを分離回路１１に送出する。分離回路１１は、復調データＤ１０５を分離処理することにより、元の画像符号化データＤ２４、音声符号化データＤ５及び動き画像識別情報Ｄ１０１にそれぞれ相当する画像符号化データＤ２９、音声符号化データＤ３０及び動き画像識別情報Ｄ１０６に分離し、音声符号化データＤ３０を音声復号化回路１２に送出し、画像符号化データＤ２９を画像復号化回路１４に送出すると共に、動き画像識別情報Ｄ１０６を画像合成部１０２の動き画像データベース１０３に送出する。
【０１３３】
動き画像データベース１０３は、送信側端末２１に設けられている動き画像データベース１０１と同一の動き画像データが格納されており、動き画像識別情報Ｄ１０６に対応する動き画像データＤ１０７を読み出して合成回路２６に送出する。合成回路２６は、画像データ保持回路１５から供給された基準の画像データＤ３１に対して、動き画像データベース１０３から順次供給された動き画像データＤ１０７を重ねて合成することにより、基準の顔画像に対して口元部分だけを動かしたような合成画像データＤ１０８を生成し、これを表示画像として表示部を介して出力するようになされている。
【０１３４】
さらに上述の第２の実施の形態においては、送信側端末４１がキャラクタの顔画像を表す画像データに対応した画像識別情報Ｄ４２を受信側端末４２へ送信するようにした場合について述べたが、本発明はこれに限らず、ユーザの顔画像を表す画像データに対応した画像識別情報Ｄ４２を受信側端末４２へ送信するようにしても良い。
【０１３５】
さらに上述の第２〜第４の実施の形態においては、受信側端末４２がワイヤフレーム生成回路４４、６３及び８５を用いて音声に対応した口元部分の動きを表すワイヤフレームデータＤ５０、Ｄ６４及びＤ９２を生成し、これを基準となる画像データＤ４９、Ｄ６５及びＤ１６に合成するようにした場合について述べたが、本発明はこれに限らず、ワイヤフレーム生成回路４４、６３及び８５の代わりに第１の実施の形態における送信側端末２１のような動き画像保持回路２５を用いて音声符号化データＤ３０に対応した口元部分の動きを表す動き画像データを読み出し、これを基準となる画像データＤ４９、Ｄ６５及びＤ１６に合成したり、あるいは音声復号化回路１２の後段に音声認識回路を設けることによって音声を認識し、その認識結果をテキスト化して対応する動き画像データＤ３２を動き画像保持回路２５から読み出し、又は認識結果を発音記号化して対応する音声画像データＤ３２を画像保持回路２５から読み出し、これらを基準となる画像データＤ４９、Ｄ６５及びＤ１６に合成するようにしても良い。
【０１３６】
さらに上述の第１〜第４の実施の形態においては、送信側端末２１、４１、６１及び８１から受信した音声符号化データに対応する言葉情報に基づいて口元部分の動きを解読し、それに応じて口元部分が動く表示画像を表示するようにした場合について述べたが、本発明はこれに限らず、テキストデータや発音記号等に対応する言葉情報に基づいて口元部分が動く表示画像を表示するようにしても良い。
【０１３７】
さらに上述の第１〜第４の実施の形態においては、送信側端末２１、４１、６１及び８１から受信した言葉情報としての音声符号化データに基づいて口元部分の動きを解読し、それに応じた動きを表示画像として表示するようにした場合について述べたが、本発明はこれに限らず、例えば基地局と受信側端末２２、４２、６２及び８２との間で、当該受信側端末２２、４２、６２及び８２が基地局から受信した音声符号化データに対応する言葉情報に基づいて口元部分の動きを解読し、それに応じて口元部分が動く表示画像を表示するようにしても良い。
【０１３８】
【発明の効果】
上述のように本発明によれば、送信側ではユーザの音声データと、１フレーム分の顔画像モデルと、ユーザの音声が発せられたときに顔の表情が変化するという相関関係を考慮し、音声データを受信装置へ送信している間だけ顔画像モデルの目元部分の各ポイントを分析パラメータとして用い、その時間的変化を示す上記ワイヤフレームの動き画像成分データとを送信し、受信側では、動き画像成分データに基づいて目元部分動き画像データを生成し、音声データに基づいて口元部分動き画像データを生成した後、顔画像モデルと合成することにより、送信装置から受信装置へのデータ伝送量を低減しつつ、目元部分と口元部分との間に表情として相関関係を持たせた状態の表情豊かな顔画像を表示することができ、かくして簡易な構成及び低消費電力でリアルタイムな処理を実行し得る画像伝送システムを実現できる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態におけるテレビジョン電話システムの構成を示すブロック図である。
【図２】本発明の第２の実施の形態におけるテレビジョン電話システムの構成を示すブロック図である。
【図３】本発明の第３の実施の形態におけるテレビジョン電話システムの構成を示すブロック図である。
【図４】本発明の第４の実施の形態におけるテレビジョン電話システムの構成を示すブロック図である。
【図５】他の実施の形態におけるテレビジョン電話システムの構成を示すブロック図である。
【図６】従来のテレビジョン電話システムの構成を示すブロック図である。
【符号の説明】
１、２０、４０、６０、８０、１００……テレビジョン電話システム、２、２１、４１、６１、８１……送信側端末、３、２２、４２、６２、８２……受信側端末、４、２４……画像圧縮符号化回路、５、８３……動き画像成分抽出回路、６……多重化回路、７……音声圧縮符号化回路、８……変調回路、９……通信路、１０……復調回路、１１……分離回路、１２……音声復号化回路、１３、８４……動き画像成分復号化回路、１４……画像復号化回路、１５……画像データ保持回路、１６、２６、６４、８６……合成回路、２５……動き画像保持回路、２３、４２、４３、６５……画像データベース、４４、６３、８５……ワイヤフレーム生成回路、１０１、１０３……動き画像データベース。[0001]
BACKGROUND OF THE INVENTION
The present invention Is a picture Image transmission system To For example, the present invention is suitable for application to a television telephone system that transmits and receives images together with sound between users.
[0002]
[Prior art]
Conventionally, as shown in FIG. 6, the television telephone system 1 is composed of a transmission side terminal 2 and a reception side terminal 3. The transmission side terminal 2 sequentially sends image data D1 digitally converted after the user's face is imaged by an imaging means (not shown) to the image compression encoding circuit 4 and the motion image component extraction circuit 5 one frame at a time. .
[0003]
The image compression encoding circuit 4 generates image encoded data D2 by compressing and encoding the image data D1 by a predetermined method, and sends this to the multiplexing circuit 6.
[0004]
The moving image component extraction circuit 5 first divides a face portion in the image data D1 into a plurality of points, and generates a reference face image model called a wire frame by connecting the plurality of points. By the way, the wire frame has many points placed around the eyes and mouth that are particularly moving in the face.
[0005]
Then, the motion image component extraction circuit 5 uses each point of the wire frame as an analysis parameter, extracts a temporal change of the analysis parameter (that is, a difference between the previous frame and the current frame) as a motion image component of the wire frame, and then compresses By encoding, for example, motion image component data D3 representing the motion of the mouth portion is generated and sent to the multiplexing circuit 6.
[0006]
Further, the transmission side terminal 2 sends the voice data D4 of the user digitally converted after collecting the sound with a microphone (not shown) to the voice compression coding circuit 7. The audio compression encoding circuit 7 compresses and encodes the audio data D4 by a predetermined compression encoding method, and then sends this to the multiplexing circuit 6 as audio encoded data D5.
[0007]
The multiplexing circuit 6 multiplexes the image encoded data D2, the motion image component data D3, and the audio encoded data D5, and sends the multiplexed data D6 obtained as a result to the modulation circuit 8.
[0008]
The modulation circuit 8 modulates the multiplexed data D6 by a predetermined modulation method for transmission via the communication path 9, and transmits this to the receiving side terminal 3 via the communication path 9 as transmission data D7. Incidentally, the communication path 9 is not particularly limited to wired and wireless, and may be any one.
[0009]
That is, when the communication path 9 is a wireless communication path, it is assumed that the transmission side terminal 2 and the reception side terminal 3 are a television telephone system 1 using, for example, a camera-equipped mobile phone, and the communication path 9 is a wired communication path. In some cases, it is assumed that the transmission-side terminal 2 and the reception-side terminal 3 are a television telephone system 1 using, for example, a camera phone installed at home.
[0010]
The receiving side terminal 3 receives transmission data D7 transmitted via the communication path 9 as reception data D8 and sends it to the demodulation circuit 10. The demodulating circuit 10 obtains demodulated data D9 by demodulating the received data D8 and sends it to the separating circuit 11.
[0011]
In practice, the demodulation circuit 10 detects and corrects data errors occurring on the communication path 9 when demodulating the received data D8 received via the communication path 9, but is omitted here for convenience of explanation.
[0012]
The separation circuit 11 separates the demodulated data D9 corresponding to the multiplexed data D6 of the transmission side terminal 2 in the reverse order of the multiplexing processing, thereby obtaining the original image encoded data D2, motion image component data D3, and audio. The encoded data D5 is divided into image encoded data D12, motion image component data D13, and audio encoded data D15 corresponding to the encoded data D5. The audio encoded data D15 is sent to the audio decoding circuit 12, and the motion image component data D13 is output. The encoded image data D12 is transmitted to the image decoding circuit 14 while being transmitted to the motion image component decoding circuit 13.
[0013]
The image decoding circuit 14 restores the reference image data D16 representing the original face image for each frame by decoding the encoded image data D12 sequentially sent from the separation circuit 11, and sequentially restores these image data. The data is sent to the holding circuit 15.
[0014]
The image data holding circuit 15 sequentially holds the image data D16 sent from the image decoding circuit 14 in an internal memory (not shown), and then sends it to the synthesizing circuit 16.
[0015]
The motion image component decoding circuit 13 restores the motion image component data corresponding to the original motion image component data D3 by decoding the motion image component data D13 continuously sent from the separation circuit 11, and then A wire frame having a motion is generated based on the motion image component data, and this is sent as wire frame data D18 to the synthesis circuit 16.
[0016]
The synthesizing circuit 16 synthesizes the wire frame data D18 supplied from the motion image component decoding circuit 13 and the reference image data D16 sequentially supplied from the image data holding circuit 15, so that the mouth of the face image is voiced. The composite image data D19 that moves in accordance with is generated, and is output as a display image via a display unit (not shown).
[0017]
The speech decoding circuit 12 restores the speech data D17 corresponding to the original speech data D4 by decoding the speech encoded data D15, converts this into analog, and then outputs it from the synthesis circuit 16 via the display unit. The sound is output as a sound from a speaker (not shown) in synchronization with the display image to be displayed.
[0018]
[Problems to be solved by the invention]
By the way, in the video telephone system 1 having such a configuration, it is necessary for the transmission side terminal 2 to compress and encode the image data D1 and its motion image component frame by frame and sequentially transmit them to the reception side terminal 3 as well as the user's voice. The data D4 also needs to be compressed and encoded separately from the image data D1 and its motion image component and transmitted to the receiving side terminal 3, which requires a large amount of data transmission and a large amount of transmission time in real time. There was a problem that it was not possible to execute proper processing.
[0019]
Further, the television telephone system 1 requires compression encoding processing of the image data D1 and its motion image component, and compression encoding processing of the audio data D4, and also requires decoding processing corresponding thereto. As a result, the configurations of the transmission side terminal 2 and the reception side terminal 3 become complicated, and a large amount of power is required.
[0020]
The present invention has been made in consideration of the above points, and can execute real-time processing with a simple configuration and low power consumption. Painting Image transmission system The It is what we are going to propose.
[0021]
[Means for Solving the Problems]
In order to solve such a problem, in the present invention, in an image transmission system constituted by a transmitting device and a receiving device, the transmitting device transmits audio data of a user collected by a microphone to the receiving device; A face image model of a wire frame is generated by dividing the face image data for one frame obtained by photographing the user's face into a plurality of points and connecting them, and this is transmitted to the receiving device. A face image model transmission means; Considering the correlation that facial expression changes when the user's voice is emitted, Audio data Sending to receiving device Each point of the eye part of the face image model As an analysis parameter Change over time Of the above wireframe A moving image component data transmitting means for extracting the moving image component data and transmitting it to the receiving device; the receiving device includes a face image model holding means for holding a face image model; and a face based on the moving image component data. Eye part motion image data generating means for generating eye part motion image data having movement for the eye part of the image model, and mouth part motion image data by decoding the motion state of the mouth part of the face image model based on the audio data The mouth partial motion image data generating means for generating the eye, the combining means for generating the composite image by combining the eye partial motion image data and the mouth partial motion image data with the face image model, and the display for displaying the composite image Means Prepare Like that.
[0022]
Thereby, on the transmission side, the user's voice data, the face image model for one frame, Considering the correlation that facial expression changes when the user's voice is emitted, Audio data Only while sending to the receiver Each point of eye area of face image model As an analysis parameter Change over time Showing of the above wireframe The motion image component data is transmitted, and on the receiving side, the eye partial motion image data is generated based on the motion image component data, the mouth partial motion image data is generated based on the audio data, and then synthesized with the face image model. By reducing the amount of data transmitted from the transmitting device to the receiving device, the distance between the eye portion and the mouth portion is reduced. As facial expression It is possible to display an expression-rich face image with a correlation.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
[0026]
(1) First embodiment
In FIG. 1, in which parts corresponding to those in FIG. 6 are assigned the same reference numerals, 20 denotes the television telephone system according to the first embodiment as a whole, which is composed of a transmission side terminal 21 and a reception side terminal 22. . Incidentally, in the television telephone system 20, the transmission side terminal 21 may be the reception side, and the reception side terminal 22 may be the transmission side.
[0027]
The transmitting side terminal 21 sends the voice data D4 of the user digitally converted after collecting the sound with a microphone (not shown) to the voice compression coding circuit 7. The audio compression encoding circuit 7 compresses and encodes the audio data D4 by a predetermined compression encoding method, and then sends this to the multiplexing circuit 6 as audio encoded data D5 corresponding to word information.
[0028]
Further, the transmission side terminal 21 reads image data D23, which is stored in advance in the image database 23 and serving as a reference representing the user's own face image, for one frame at the beginning of communication, and this is read out by an image compression coding circuit. 24. The image compression encoding circuit 24 generates image encoded data D24 by compressing and encoding the image data D23 by a predetermined compression encoding method, and sends this to the multiplexing circuit 6.
[0029]
The multiplexing circuit 6 multiplexes the encoded image data D24 and the encoded audio data D5 and sends the multiplexed data D25 obtained as a result to the modulation circuit 8. Here, the multiplexing circuit 6 receives the encoded image data D24 and the encoded audio data D5 only when the encoded image data D24 for one frame is supplied from the compressed image encoding circuit 24 at the beginning of communication. After the multiplexing process, the encoded image data D24 is not supplied thereafter, so only the encoded audio data D5 is sent to the modulation circuit 8.
[0030]
The modulation circuit 8 sequentially modulates the multiplexed data D25 and the encoded audio data D5 supplied thereafter, and transmits this to the receiving terminal 22 via the communication path 9 as transmission data D26. Incidentally, the communication path 9 is not particularly limited to wired and wireless, and may be any one.
[0031]
That is, when the communication path 9 is a wireless communication path, it is assumed that the video telephone system 20 uses, for example, a camera-equipped mobile phone as the transmission-side terminal 21 and the reception-side terminal 22, and the communication path 9 is a wired communication path. In some cases, it is assumed that the transmission-side terminal 21 and the reception-side terminal 22 are, for example, a television telephone system 20 using a camera phone installed at home.
[0032]
The reception side terminal 22 receives the transmission data D26 transmitted via the communication path 9 as reception data D27 and sends it to the demodulation circuit 10. The demodulation circuit 10 obtains demodulated data D28 by performing demodulation processing on the received data D27, and sends it to the separation circuit 11.
[0033]
In practice, the demodulation circuit 10 detects and corrects data errors occurring on the communication path 9 when demodulating the received data D27 received via the communication path 9, but is omitted here for convenience of explanation.
[0034]
The demultiplexing circuit 11 demultiplexes the demodulated data D28 corresponding to the multiplexed data D25 of the transmission side terminal 21 in the reverse procedure of the multiplexing process, so that the original image encoded data D24 and audio encoded data D5 are obtained. The encoded image data D29 and the encoded audio data D30 are respectively separated, and the encoded audio data D30 is sent to the audio decoding circuit 12 and the motion image holding circuit 25 of the image synthesizing unit 27 and the encoded image data. D29 is sent to the image decoding circuit 14.
[0035]
Incidentally, the separation circuit 11 first separates the demodulated data D28 into the encoded image data D29 and the encoded audio data D30, and then the encoded image data D29 is not multiplexed with the demodulated data D28 thereafter. Therefore, only the audio encoded data D30 is sent to the audio decoding circuit 12 and the motion image holding circuit 25.
[0036]
The image decoding circuit 14 restores the original image data D31 for one frame corresponding to the image data D23 by decoding the image encoded data D29 sent only from the separation circuit 11 first, This is sent to the image data holding circuit 15 as still image storage means.
[0037]
The image data holding circuit 15 temporarily holds the reference image data D31 sent from the image decoding circuit 14 in an internal memory (not shown), and then sends it to the synthesis circuit 26 of the image synthesis unit 27. ing.
[0038]
The motion image holding circuit 25 as the motion image generation means stores, for example, motion image data D32 representing the motion of the mouth corresponding to the code pattern of the speech encoded data D30 as the word information sequentially sent from the separation circuit 11 in the internal memory. A plurality of motion image data D32 corresponding to the audio encoded data D30 is read out and sent to the synthesis circuit 26.
[0039]
Here, the speech encoded data D30 is data that has been subjected to digital compression encoding processing based on a model of the form of the mouth portion when a human utters speech. Therefore, the motion image holding circuit 25 stores, for example, a plurality of motion image data D32 representing the movement of the mouth corresponding to the code pattern of the audio encoded data D30 in the internal memory in advance, whereby the code pattern of the audio encoded data D30 is stored. The motion image data D32 corresponding to the above can be immediately read out and sent to the synthesizing circuit 26 as synthesizing means.
[0040]
The synthesis circuit 26 superimposes the motion image data D32 sequentially supplied from the motion image holding circuit 25 on the reference image data D31 supplied from the image data holding circuit 15, thereby combining the reference image data D31 with the reference face image. On the other hand, composite image data D33 in which only the mouth is moved is generated, and this is output as a display image via a display unit (not shown).
[0041]
The audio decoding circuit 12 restores the original audio data D34 by decoding the audio encoded data D30, converts this to analog, and then gives a timing to the display image output from the synthesis circuit 26 via the display unit. In addition, the sound is output from a speaker (not shown).
[0042]
In the above configuration, in the videophone system 20 according to the first embodiment, the image data D23 of the face image serving as a reference for one frame read from the image database 23 by the transmitting terminal 21 only at the beginning of communication. Are compressed and encoded and transmitted to the receiving side terminal 22, and the audio data D <b> 4 is also sequentially encoded and transmitted to the receiving side terminal 22.
[0043]
As a result, it is not necessary for the transmission side terminal 21 to compress and encode the user's own face image and transmit it to the reception side terminal 22 frame by frame, but the image of the face image stored in the image database 23 in advance The data D23 may be compressed and encoded only at the beginning of communication and transmitted to the receiving terminal 22, and the amount of data transmission can be significantly reduced.
[0044]
Further, the transmission side terminal 21 can simplify the circuit configuration to the extent that the moving image component extraction circuit 5 is not required as compared with the conventional transmission side terminal 2 (FIG. 6), and can also process complicated digital signal processing. Since the amount can be reduced, the power consumption can be further reduced.
[0045]
On the other hand, the receiving side terminal 22 demodulates and separates the received data D27, and then restores the image data D31 of the reference face image sent by the image decoding circuit 14 only at the beginning of communication. Then, the motion image holding circuit 25 reads out the motion image data D32 representing the motion of the mouth corresponding to the encoded audio data D30, and the motion image data D32 representing the motion of the mouth is superimposed on the image data D31 of the reference face image and synthesized. Thus, the composite image data D33 is generated and then output as a display image.
[0046]
At this time, the receiving-side terminal 22 restores the original audio data D34 by decoding the audio encoded data D30 by the audio decoding circuit 12, and after converting the analog audio data D34 to the display image synthesized by the synthesis circuit 26, Output as audio at the same timing.
[0047]
In this way, the receiving side terminal 22 displays a display image in which only the mouth portion of the character is moved according to the user's voice, so that the character seems to sing according to the voice of the user of the transmitting side terminal 21. An image effect can be brought about.
[0048]
Also at this time, the receiving side terminal 22 can simplify the circuit configuration to the extent that the moving image component decoding circuit 13 becomes unnecessary as compared with the conventional receiving side terminal 3 (FIG. 6), and also decodes the face image of the user Since the processing only needs to be performed once at the beginning, the processing amount of complicated digital signal processing can be reduced, and thus power consumption can be further reduced.
[0049]
According to the above configuration, in the videophone system 20 in the first embodiment, the transmission side terminal 21 compresses and encodes the audio data D4 and transmits it to the reception side terminal 22 and stores it in the image database 23 in advance. By compressing and encoding the image data D23 of its own face image only at the beginning of communication and transmitting it to the receiving terminal 22, the amount of data transmission is significantly reduced compared to the conventional transmitting terminal 2. In addition, the power consumption can be further reduced.
[0050]
In addition, the videophone system 20 uses the facial image data D31 obtained by decoding the encoded image data D29 sent only from the transmitting terminal 21 first by the receiving terminal 22 only once. Then, the moving image data D32 representing the movement of the mouth corresponding to the audio encoded data D30 is read and synthesized by the moving image holding circuit 25, so that the user of the transmission side terminal 21 sings like a moving image. Display images can be displayed.
[0051]
At this time, the receiving side terminal 22 only needs to perform the image decoding process once, and the digital signal corresponding to the amount that the moving image component decoding process by the moving image component decoding circuit 13 as in the conventional receiving side terminal 3 becomes unnecessary. The amount of processing can be significantly reduced and the power consumption can be further reduced, and thus a display image in which the mouth portion moves in accordance with the timing of the voice can be displayed in real time.
[0052]
(2) Second embodiment
In FIG. 2, in which parts corresponding to those in FIG. 1 are denoted by the same reference numerals, reference numeral 40 denotes a television telephone system according to the second embodiment as a whole, which includes a transmission side terminal 41 and a reception side terminal 42. . Incidentally, in the television telephone system 40, the transmission side terminal 41 may be the reception side, and the reception side terminal 42 may be the transmission side.
[0053]
The transmission side terminal 41 sends the voice data D4 of the user digitally converted after collecting the sound with a microphone (not shown) to the voice compression coding circuit 7. The audio compression encoding circuit 7 compresses and encodes the audio data D4 by a predetermined compression encoding method, and then sends this to the multiplexing circuit 6 as audio encoded data D5.
[0054]
Further, the transmission side terminal 41 reads the image identification information D42 corresponding to the image data of the facial image of the character, for example, stored in advance in the image database 42 only at the beginning of communication, and sends this to the multiplexing circuit 6 To do.
[0055]
The multiplexing circuit 6 multiplexes the image identification information D42 and the encoded audio data D5 and sends the multiplexed data D43 obtained as a result to the modulation circuit 8. Here, the multiplexing circuit 6 multiplexes the image identification information D42 and the audio encoded data D5 only when the image identification information D42 read from the image database 42 is supplied at the beginning of communication. However, since the image identification information D42 is not supplied thereafter, only the audio encoded data D5 is sent to the modulation circuit 8.
[0056]
The modulation circuit 8 sequentially modulates the multiplexed data D43 and the voice encoded data D5 supplied thereafter, and transmits this to the receiving terminal 42 via the communication path 9 as transmission data D44. Incidentally, the communication path 9 is not particularly limited to wired and wireless, and may be any one.
[0057]
The receiving side terminal 42 receives the transmission data D44 transmitted via the communication path 9 as reception data D45 and sends it to the demodulation circuit 10. The demodulation circuit 10 obtains demodulated data D46 by performing demodulation processing on the received data D45, and sends it to the separation circuit 11.
[0058]
In practice, the demodulation circuit 10 detects and corrects data errors occurring on the communication path 9 when demodulating the reception data D45 received via the communication path 9, but is omitted here for convenience of explanation.
[0059]
The separation circuit 11 separates the demodulated data D46 corresponding to the multiplexed data D43 of the transmission side terminal 41 in the reverse order of the multiplexing process, thereby obtaining the original image identification information D42 and the audio encoded data D5, respectively. The corresponding image identification information D47 and audio encoded data D48 are separated, and the audio encoded data D48 is sent to the audio decoding circuit 12 and the wire frame generation circuit 44 of the image synthesis unit 49, and the image identification information D47 is stationary. The image data is sent to an image database 43 as image storage means.
[0060]
Incidentally, since the separation circuit 11 first separates the demodulated data D46 into the image identification information D47 and the audio encoded data D48, the image identification information D47 is never multiplexed with the demodulated data D46 thereafter. Only the audio encoded data D48 is sent to the audio decoding circuit 12 and the wire frame generation circuit 44.
[0061]
The wire frame generation circuit 44 as a motion image generation means decodes, for example, the motion state of the mouth of the character's face image based on the speech encoded data D48 as the word information sequentially sent from the separation circuit 11, and the internal memory A wire frame having a motion corresponding to the decoding result is read from (not shown), and this is sent as wire frame data D50 corresponding to the motion image to the combining circuit 26 corresponding to the combining means.
[0062]
Here, the wire frame is a face image model generated by, for example, dividing a face portion of a character's face image into a plurality of points and connecting the plurality of points. Many points are placed around a certain eye or mouth.
[0063]
The image database 43 of the receiving terminal 42 stores image data having the same contents as the image database 42 of the transmitting terminal 43, and represents a character face image corresponding to the image identification information D47 supplied from the separation circuit 11. The image data D49 is read out and sent to the synthesis circuit 26.
[0064]
That is, the sending terminal 41 The image data D42 representing the character's face image read from the image database 42 and the image data D49 representing the character's face image read from the image database 43 of the receiving terminal 42 are data having the same contents. .
[0065]
Here, the image data D49 is also a wire frame, and the location of each point in the character's face portion matches each point of the wire frame data D50 generated by the wire frame generation circuit 44.
[0066]
The synthesizing circuit 26 synthesizes the wire frame data D50 with the wire frame image data D49 representing the character's face image supplied from the image database 43, and corrects the image distortion of the moving part according to the wire frame data D50. As a result, the composite image data D51 in which the mouth portion of the character moves according to the voice is generated, and this is output as a display image via a display unit (not shown).
[0067]
The audio decoding circuit 12 restores the original audio data D52 by decoding the audio encoded data D48, converts this to analog, and then gives a timing to the display image output from the synthesis circuit 26 via the display unit. In addition, the sound is output from a speaker (not shown).
[0068]
In the above configuration, in the videophone system 20, the transmission side terminal 41 reads the image identification information D42 indicating the image data of the character's face image from the image database 42 only at the beginning of communication and transmits it to the reception side terminal 42. The audio data D4 is also sequentially compression encoded and transmitted to the receiving terminal 42.
[0069]
As a result, the transmission side terminal 41 does not need to compress and transmit the user's own face image while photographing, and starts communication with the image identification information D42 indicating the character face image data stored in the image database 42 in advance. Since it is only necessary to transmit to the receiving side terminal 42 only at the beginning of the time, the data transmission amount can be further reduced compared to the transmitting side terminal 21 in the first embodiment in which only one image data D23 is transmitted.
[0070]
In addition, the transmission side terminal 41 can simplify the circuit configuration to the extent that the image compression coding circuit 4 and the motion image component extraction circuit 5 are not required, as compared with the conventional transmission side terminal 2 (FIG. 6). Since the processing amount of digital signal processing can be reduced, power consumption can be further reduced.
[0071]
On the other hand, the receiving terminal 42 demodulates and separates the received data D45, and then the image data representing the facial image of the character corresponding to the image identification information D47 sent from the image database 43 only at the beginning of communication. In addition to reading D49, the wire frame generation circuit 44 generates wire frame data D50 corresponding to the audio encoded data D48.
[0072]
Then, the receiving side terminal 42 corrects the image distortion of the moving part in the character's face image based on the wire frame data D50, thereby causing the mouth portion of the character to move in the same manner as the voice of the user of the transmitting side terminal 41. Data D51 is generated and output as a display image, and the audio decoded by the audio decoding circuit 12 is output from the speaker in synchronization with the display image.
[0073]
In this way, the receiving terminal 42 displays a display image in which the mouth portion of the character is moved in accordance with the user's voice, so that the image is as if the character is speaking in accordance with the user's voice of the transmitting terminal 41. Can have an effect.
[0074]
Also at this time, the receiving side terminal 42 does not require the moving image component decoding circuit 13 as compared with the conventional receiving side terminal 3 (FIG. 6), and the image of the receiving side terminal 22 in the first embodiment. The circuit configuration can be further simplified as much as the decoding circuit 14 is not required, and the processing amount of complicated digital signal processing can be further reduced, and thus power consumption can be further reduced.
[0075]
According to the above configuration, in the videophone system 40 according to the second embodiment, the transmission side terminal 41 compresses and encodes the audio data D4 and transmits the audio data D4 to the reception side terminal 42, and stores it in the image database 23 in advance. By transmitting the image identification information D42 of the image data of a certain character to the receiving terminal 42 only at the beginning of communication, the data transmission amount is further reduced compared with the transmitting terminal 21 in the first embodiment. In addition, power consumption can be further reduced.
[0076]
In addition, the videophone system 20 reads the character image data D49 based on the image identification information D42 sent from the transmission side terminal 41 only once from the transmission side terminal 41, and corresponds to the audio encoded data D48. By generating the wire frame data D50 and synthesizing it with the character image data D49, it is possible to generate a display image as if the character is speaking according to the voice of the user of the transmission side terminal 41.
[0077]
At this time, the receiving side terminal 42 further reduces the processing amount of the digital signal processing as compared with the receiving side terminal 42 in the first embodiment and further reduces power consumption by the amount that does not require any image decoding processing. Thus, it is possible to display in real time a display image in which the mouth portion moves in accordance with the audio timing.
[0078]
(3) Third embodiment
In FIG. 3, in which parts corresponding to those in FIG. 1 are denoted by the same reference numerals, 60 indicates a television telephone system according to the third embodiment as a whole, and is composed of a transmission side terminal 61 and a reception side terminal 62. .
[0079]
The transmission side terminal 61 sends the voice data D4 of the user digitally converted after collecting the sound with a microphone (not shown) to the voice compression coding circuit 7. The audio compression encoding circuit 7 compresses and encodes the audio data D4 by a predetermined compression encoding method, and then sends this to the modulation circuit 8 as audio encoding data D5 corresponding to word information.
[0080]
The modulation circuit 8 modulates the audio encoded data D5 sequentially supplied from the audio compression encoding circuit 7, and transmits this to the receiving terminal 62 via the communication path 9 as transmission data D61.
[0081]
The transmission side terminal 61 in this case has a circuit configuration similar to that of a normal mobile phone, and does not have to be a transmission side terminal specific to a camera-equipped mobile phone or the television phone system 60, and is similar to a general mobile phone. The communication path 9 is not particularly limited to wired and wireless, and any of them may be used.
[0082]
The reception side terminal 62 receives the transmission data D61 transmitted via the communication path 9 as reception data D62 and sends it to the demodulation circuit 10. The demodulation circuit 10 performs demodulation processing on the reception data D62 to obtain audio encoded data D63 corresponding to the audio encoded data D5 of the transmission side terminal 21, which is obtained as the audio decoding circuit 12 and the image synthesis unit 69. To the wire frame generation circuit 63.
[0083]
The wire frame generation circuit 63 as a motion image generation means decodes, for example, the motion state of the mouth of the character's face image based on the speech encoded data D63 as the word information sequentially sent from the demodulation circuit 10 to detect the motion. A certain wire frame is generated, and this is sent as wire frame data D64 corresponding to the motion image to the combining circuit 64 corresponding to the combining means.
[0084]
Here, the wire frame is a face image model generated by, for example, dividing a face portion of a character's face image into a plurality of points and connecting the plurality of points. Many points are placed around a certain eye or mouth.
[0085]
On the other hand, the image database 65 as a still image storage means reads out image data D65 representing a face image of a predetermined character determined in advance and sends it to the synthesis circuit 64. Here, the image data D65 is also a wire frame, and the location of each point on the character's face coincides with each point of the wire frame data D64 generated by the wire frame generation circuit 63.
[0086]
The synthesizing circuit 64 receives the supply of the wire frame data D64 from the wire frame generation circuit 63, and at the same time, receives the supply of image data D65 representing a predetermined character face image from the image database 65, and the image data D65. Is combined with the wire frame data D64, and the image distortion of the moving portion is corrected according to the wire frame data D64, thereby generating the combined image data D66 such that the mouth portion of the character moves in accordance with the voice, This is output as a display image via a display unit (not shown).
[0087]
The audio decoding circuit 12 restores the original audio data D67 by decoding the audio encoded data D63, converts it into an analog signal, and then performs a timing on the display image output from the image synthesis circuit 64 via the display unit. Are output as sound from a speaker (not shown).
[0088]
In the above configuration, in the videophone system 60, it is not necessary for the transmission side terminal 61 to transmit image data, and only normal audio data D4 is sequentially compressed and encoded and transmitted to the reception side terminal 62. The data transmission amount can be reduced as compared with the transmission side terminals 21 and 41 in the first and second embodiments, and at the same time, the data transmission amount can be suppressed to be equivalent to that of a general mobile phone that performs only normal voice calls. .
[0089]
On the other hand, the receiving side terminal 62 reads the character image data D65 stored in the image database 65 in advance, and synthesizes the wire frame data D64 representing the movement state of the mouth based on the voice encoded data D63 into the image data D65. By doing so, it is possible to generate the composite image data D66 in which the mouth portion of the character moves according to the voice, and output this as a display image.
[0090]
As described above, the receiving side terminal 62 can further simplify the circuit configuration by the amount that does not require the separation circuit 11 and the amount that does not require the image decoding process, unlike the receiving side terminal 42 in the second embodiment. Since the processing amount of complicated digital signal processing can be further reduced, the power consumption can be further reduced.
[0091]
Further, in the television telephone system 60, it is not always necessary to use the transmission side terminal 61 and the reception side terminal 62 as one set, and even if a general mobile phone equivalent to the transmission side terminal 61 is used, audio from the mobile phone If only the data can be received, the receiving terminal 62 can display a display image in which the mouth portion is moved in accordance with the voice based on the face image of the character, and the usability for the user can be further improved.
[0092]
According to the above configuration, the videophone system 60 according to the third embodiment is a wire frame that represents the movement state of the mouth portion based on the audio encoded data D63 received by the receiving terminal 62 from the transmitting terminal 61. Data D64 is generated, and this is combined with character image data D65 previously stored in the image database 65 to generate composite image data D66, which is output as a display image. It is possible to bring about an image effect that the mouth portion of the character moves in accordance with the voice only from the voice data without having the data transmitted.
[0093]
Further, the receiving side terminal 62 can simplify the circuit configuration and reduce the processing amount of the digital signal processing as compared with the receiving side terminals 21 and 42 in the first and second embodiments. Thus, a display image in which the mouth portion moves in accordance with the audio timing can be displayed in real time.
[0094]
(4) Fourth embodiment
In FIG. 4, in which parts corresponding to those in FIG. 6 are assigned the same reference numerals, 80 denotes a television telephone system according to the fourth embodiment as a whole, which is composed of a transmitting terminal 81 and a receiving terminal 82. . Incidentally, in the television telephone system 80, the transmission side terminal 81 may be the reception side, and the reception side terminal 82 may be the transmission side.
[0095]
The transmission side terminal 81 sends the voice data D4 of the user digitally converted after being collected by a microphone (not shown) to the voice compression coding circuit 7. The audio compression encoding circuit 7 compresses and encodes the audio data D4 by a predetermined compression encoding method, and then sends this to the multiplexing circuit 6 and the motion image component extraction circuit 83 as audio encoding data D5.
[0096]
Further, the transmission side terminal 81 sequentially sends the image data D1 digitally converted after the user's face is imaged by an imaging means (not shown) to the image compression coding circuit 4 and the motion image component extraction circuit 83.
[0097]
Here, after sending the image data D1 serving as a reference for one frame to the image compression encoding circuit 4 at the beginning of communication, the transmission side terminal 81 transmits the image data D1 of the next frame and thereafter to the image compression encoding circuit. However, the image data D1 is transmitted only to the motion image component extraction circuit 83.
[0098]
The image compression encoding circuit 4 divides the face portion in the image data D1 into a plurality of points, and generates a reference face image model called a wire frame by connecting the plurality of points. By the way, the wire frame has many points arranged in the face part and the mouth part which are particularly moving in the face.
[0099]
Then, the image compression encoding circuit 4 generates image encoded data D2 by compressing and encoding the wire frame image data D1 by a predetermined method, and sends this to the multiplexing circuit 6.
[0100]
The motion image component extraction circuit 83 also generates a reference face image model called a wire frame, and removes the mouth portion of the wire frame only while the audio encoded data D5 is supplied from the audio compression encoding circuit 7, for example, the eyes The motion image component data D83 is obtained by using each point of the portion as an analysis parameter, extracting the temporal change of the analysis parameter (that is, the difference between the previous frame and the current frame) as a motion image component of the wire frame, and then compressing and encoding it. Is transmitted to the multiplexing circuit 6.
[0101]
That is, the motion image component extraction circuit 83 considers the correlation with the audio encoded data D5 supplied from the audio compression encoding circuit 7 and considers that the facial expression changes when the audio is output, The motion image component data D83 relating to the eye portion while it is being generated is generated.
[0102]
The multiplexing circuit 6 multiplexes the audio encoded data D5, the image encoded data D2, and the motion image component data D83, and sends the multiplexed data D84 obtained as a result to the modulation circuit 8.
[0103]
Here, only when image encoded data D2 for one frame is supplied from the image compression encoding circuit 4 at the beginning of communication, the multiplexing circuit 6 performs the image encoded data D2, the audio encoded data D5, and Although the motion image component data D83 is multiplexed, the encoded image data D2 is not supplied thereafter, so that only the audio encoded data D5 and the motion image component data D83 are multiplexed.
[0104]
The modulation circuit 8 modulates the multiplexed data D84 by a predetermined modulation method for transmission via the communication path 9, and transmits this to the reception side terminal 82 via the communication path 9 as transmission data D85. Incidentally, the communication path 9 is not particularly limited to wired and wireless, and may be any one.
[0105]
The receiving side terminal 82 receives the transmission data D85 transmitted via the communication path 9 as the reception data D86 and sends it to the demodulation circuit 10. The demodulating circuit 10 performs demodulation processing on the received data D86 to obtain demodulated data D87 and sends it to the separating circuit 11.
[0106]
In practice, the demodulation circuit 10 detects and corrects data errors occurring on the communication path 9 when demodulating the received data D86 received via the communication path 9, but is omitted here for convenience of explanation.
[0107]
The demultiplexing circuit 11 demultiplexes the demodulated data D87 corresponding to the multiplexed data D84 of the transmission side terminal 81 in the reverse procedure of the multiplexing process, whereby the original audio encoded data D5, the image encoded data D2, and the motion The audio encoded data D88, the image encoded data D89, and the motion image component data D90 respectively corresponding to the image component data D83 are separated into the audio encoded data D88 and the wire frame generating circuit of the image synthesizing unit 89. 85, the moving image component data D90 is transmitted to the moving image component decoding circuit 84, and the encoded image data D89 is transmitted to the image decoding circuit 14.
[0108]
Incidentally, the separation circuit 11 first separates the demodulated data D87 into the audio encoded data D88, the image encoded data D89, and the motion image component data D90, and then the image encoded data D89 is multiplexed with the demodulated data D87 thereafter. Therefore, the encoded image data D89 is not sent to the image decoding circuit 14.
[0109]
The image decoding circuit 14 restores the reference image data D16 corresponding to the original face image formed into a wire frame by decoding the image encoded data D89 sent only from the separation circuit 11 for the first time. Is sent to the image data holding circuit 15.
[0110]
The image data holding circuit 15 temporarily holds the reference image data D16 sent from the image decoding circuit 14 in an internal memory (not shown), and then sends it to the synthesis circuit 86 of the image synthesis unit 89. ing.
[0111]
The motion image component decoding circuit 84 restores the motion image component data corresponding to the original motion image component data D83 by decoding the motion image component data D90 continuously sent from the separation circuit 11, and then A wire frame having a motion is generated based on the motion image component data, and this is sent to the synthesis circuit 86 as wire frame data D91 corresponding to the motion component image. In this case, the wire frame data D91 is data representing the movement of the eye portion in the face image of the user.
[0112]
The wire frame generation circuit 85 as the motion image generation means decodes the motion state of, for example, the mouth portion of the face image based on the voice encoded data D88 as the word information continuously sent from the separation circuit 11 and moves. Is generated, and is transmitted as wire frame data D92 corresponding to the motion image to the combining circuit 86 corresponding to the combining means.
[0113]
Here, the wire frame image data D16 decoded by the image decoding circuit 14, the wire frame data D91 representing the movement of the eye part generated by the motion image component decoding circuit 84, and the wire frame generation circuit 85 are generated. The wire frame data D92 representing the movement of the mouth portion is the same as the location of each point in the face portion.
[0114]
The synthesizing circuit 86 uses the wire image data D91 of the eye portion supplied from the motion image component extraction circuit 84 and the wire frame for the image data D16 of the face image of the transmitting terminal 81 supplied from the image data holding circuit 15. By synthesizing the wire frame data D92 of the mouth portion supplied from the generating circuit 85 and correcting the image distortion of the moving portion according to the wire frame data D91 and D92, the eye portion and the mouth portion are matched to the sound. The moving composite image data D93 is generated and output as a display image via a display unit (not shown).
[0115]
The audio decoding circuit 12 restores the original audio data D17 by decoding the audio encoded data D88, converts this to analog, and then outputs a timing to the display image output from the image synthesis circuit 86 via the display unit. Are output as sound from a speaker (not shown).
[0116]
In the above configuration, in the videophone system 80, the transmission side terminal 81 generates image encoded data D2 by compressing and encoding the image data D1 representing the user's face image for one frame at the beginning of communication. Then, the audio data D4 of the user is sequentially compression-encoded to generate audio encoded data D5 and output to the multiplexing circuit 6.
[0117]
At this time, since it is considered that the transmitting terminal 81 generally changes the facial expression when a voice is emitted, it extracts the motion image component data D83 relating to the eye part while the voice is being emitted. And sent to the multiplexing circuit 6.
[0118]
Then, the transmission side terminal 81 multiplexes the image encoded data D2, the audio encoded data D5, and the motion image component data D83 only at the beginning of communication by the multiplexing circuit 6, and after performing modulation processing and transmitting, The encoded data D5 and the motion image component data D83 are multiplexed, modulated and transmitted.
[0119]
In this way, the transmitting-side terminal 81 extracts the motion image component data D83 relating to the eye portion other than the mouth portion while the sound is being emitted, thereby moving image component data D83 for reproducing the motion of the eye portion. And the speech encoded data D5 for reproducing the movement of the mouth portion.
[0120]
As a result, the transmitting-side terminal 81 wastefully overlaps the image data D1 of the face image for one frame serving as a reference, the motion image component data D83 related to the eye portion, and the audio encoded data D5 related to the mouth portion as data. Therefore, the data can be transmitted as the minimum necessary amount of data, respectively, and thus the data transmission amount can be remarkably reduced and the power consumption can be further reduced as compared with the conventional transmission side terminal 2 (FIG. 6). .
[0121]
On the other hand, the receiving side terminal 82 restores the image data D16 of the reference face image sent only at the beginning of communication, and then moves the movement of the mouth portion generated based on the audio encoded data D88. The wire frame data D92 to be represented and the wire frame data D91 generated based on the motion image component data D90 are superimposed on the image data D16 of the face image and synthesized, and the image distortion of the motion portion is determined according to the wire frame data D91 and D92. By correcting the minutes, motion image data D93 in which the eye portion and the mouth portion move according to the sound is generated and output as a display image.
[0122]
As a result, the receiving terminal 82 not only displays a display image as if the mouth of the face image is moving in accordance with the voice of the user of the transmitting terminal 81 but also displays the movement of the eye part at that time. As a result, a face image rich in expression can be displayed as a display image.
[0123]
At this time, the receiving-side terminal 82 is associated with image data D1 of the face image for one frame serving as a reference, motion image component data D83 related to the eye portion, and speech encoded data D5 related to the mouth portion, respectively. Since the display image can be generated with the minimum necessary data processing amount, the data processing amount can be greatly reduced as compared with the conventional receiving side terminal 3 (FIG. 6), thus further reducing the power consumption. can do.
[0124]
According to the above configuration, the videophone system 80 according to the fourth embodiment includes the face image image data D1 for one frame on which the transmission side terminal 81 is a reference, and the motion image component data D83 relating to the eye portion. The speech encoded data D5 related to the mouth portion can be associated with each other without wastefully overlapping each other, and can be transmitted as the minimum necessary amount of data, respectively. Thus, data transmission can be performed as compared with the conventional transmission side terminal 2. The amount can be remarkably reduced and the power consumption can be further reduced.
[0125]
In addition, the videophone system 80 requires image data D1 of a face image for one frame on which the receiving terminal 82 serves as a reference, motion image component data D83 relating to the eye portion, and voice encoded data D5 relating to the mouth portion. Since a display image can be generated with a minimum amount of data processing, the amount of data processing can be significantly reduced as compared with the conventional receiving side terminal 3, and thus power consumption can be further reduced.
[0126]
At this time, the receiving-side terminal 82 can move and display the mouth portion and the eye portion in accordance with the voice with respect to the reference face image, so that it is possible to provide an expression-rich and higher-dimensional display image in real time. it can.
[0127]
(5) Other embodiments
In the first embodiment described above, the receiving side terminal 22 reads out the motion image data D32 representing the movement of the mouth portion corresponding to the speech encoded data D32 by the motion image holding circuit 25, and the combining circuit 26 reads the reference image. Although the case of synthesizing with the image data D31 has been described, the present invention is not limited to this, and a speech recognition circuit is provided at the subsequent stage of the speech decoding circuit 12 to recognize speech, and the recognition result is converted into text. The corresponding motion image data D32 is read from the motion image holding circuit 25, or the recognition result is converted into phonetic symbols, and the corresponding motion image data D32 is read from the motion image holding circuit 25, and these are combined with the reference image data D31. May be.
[0128]
Further, in the above-described first embodiment, the case where the transmission side terminal 21 compresses and encodes the image data D23 representing the user's face image and transmits the image data D23 to the reception side terminal 22 has been described. However, the present invention is not limited thereto, and the image data D23 representing the character's face may be compressed and transmitted.
[0129]
Furthermore, in the first embodiment described above, the receiving side terminal 22 reads out the motion image data D32 representing the movement of the mouth portion corresponding to the speech encoded data D32 by the motion image holding circuit 25, and the combining circuit 26 reads the reference image. Although the case of combining with the image data D31 has been described, the present invention is not limited to this, and the transmitting-side terminal 21 designates motion image data representing the movement of the mouth portion corresponding to the audio encoded data D32, and By transmitting the moving image identification information corresponding to the designated moving image data to the receiving side terminal 22, the receiving side terminal 22 reads out the moving image data corresponding to the moving image identification information and combines it with the reference image data. Anyway.
[0130]
In this case, in FIG. 5 in which the same reference numerals are assigned to corresponding parts to FIG. 1, the receiving side terminal 22 of the television telephone system 100 receives the audio encoded data D5 compression encoded by the audio compression encoding circuit 7. The data is sent to the multiplexing circuit 6 and the motion image database 101. The motion image database 101 recognizes motion image data representing the motion of the mouth portion according to the audio encoded data D5, and sends motion image identification information D101 corresponding to the recognized motion image data to the multiplexing circuit 6.
[0131]
The multiplexing circuit 6 multiplexes the image encoded data D24, the audio encoded data D5, and the motion image identification information D101, and sends the resulting multiplexed data D102 to the modulation circuit 8. Also in this case, the multiplexing circuit 6 can only perform the encoded image data D24 and the encoded audio data D5 when the encoded image data D24 for one frame is supplied from the compressed image encoding circuit 24 at the beginning of communication. However, since the encoded image data D24 is not supplied thereafter, the multiplexing obtained by multiplexing the audio encoded data D5 and the motion image identification information D101 is multiplexed. The digitized data D102 is sent to the modulation circuit 8. The modulation circuit 8 sequentially modulates the multiplexed data D102, and transmits this to the reception side terminal 22 via the communication path 9 as transmission data D103.
[0132]
The reception side terminal 22 receives the transmission data D103 transmitted via the communication path 9 as reception data D104 and sends it to the demodulation circuit 10. The demodulation circuit 10 obtains demodulated data D105 by performing demodulation processing on the received data D104, and sends this to the separation circuit 11. The separation circuit 11 performs a separation process on the demodulated data D105 so that the original encoded image data D24, the encoded audio data D5, and the encoded image data D29, the encoded audio data D30, and the encoded moving image identification information D101, respectively. Separated into motion image identification information D106, audio encoded data D30 is transmitted to the audio decoding circuit 12, image encoded data D29 is transmitted to the image decoding circuit 14, and the motion image identification information D106 is transmitted to the image synthesizing unit 102. To the motion image database 103.
[0133]
The moving image database 103 stores the same moving image data as the moving image database 101 provided in the transmission-side terminal 21, reads out the moving image data D107 corresponding to the moving image identification information D106, and sends it to the synthesis circuit 26. Send it out. The synthesizing circuit 26 superimposes the motion image data D107 sequentially supplied from the motion image database 103 on the reference image data D31 supplied from the image data holding circuit 15 to synthesize the reference face image. Then, composite image data D108 in which only the mouth portion is moved is generated, and this is output as a display image via a display unit.
[0134]
Furthermore, in the second embodiment described above, a case has been described in which the transmitting terminal 41 transmits image identification information D42 corresponding to image data representing a character's face image to the receiving terminal 42. The invention is not limited to this, and image identification information D42 corresponding to image data representing the face image of the user. The You may make it transmit to the receiving side terminal 42. FIG.
[0135]
Further, in the second to fourth embodiments described above, the reception side terminal 42 uses the wire frame generation circuits 44, 63 and 85, and wire frame data D50, D64 and D92 representing the movement of the mouth portion corresponding to the voice. Has been described, and this is combined with the reference image data D49, D65, and D16. However, the present invention is not limited to this, and the first embodiment is not limited to the wire frame generation circuits 44, 63, and 85. The moving image data representing the movement of the mouth portion corresponding to the audio encoded data D30 is read out using the moving image holding circuit 25 such as the transmitting-side terminal 21 in the embodiment, and the image data D49 and D65 serving as the reference are read out. And D16, or by providing a speech recognition circuit downstream of the speech decoding circuit 12, the speech is recognized and recognized. The result is converted into text and the corresponding motion image data D32 is read out from the motion image holding circuit 25, or the recognition result is converted into phonetic symbols and the corresponding audio image data D32 is read out from the image holding circuit 25. These are used as reference image data D49. , D65 and D16 may be combined.
[0136]
Furthermore, in the above-described first to fourth embodiments, the movement of the mouth portion is decoded based on the word information corresponding to the speech encoded data received from the transmitting side terminals 21, 41, 61 and 81, and accordingly However, the present invention is not limited to this, and a display image in which the mouth portion moves is displayed based on word information corresponding to text data, phonetic symbols, and the like. You may do it.
[0137]
Furthermore, in the above-described first to fourth embodiments, the movement of the mouth portion is decoded based on the speech encoded data as the word information received from the transmitting side terminals 21, 41, 61 and 81, and accordingly Although the case where the movement is displayed as the display image has been described, the present invention is not limited to this. For example, between the base station and the receiving side terminals 22, 42, 62 and 82, the receiving side terminals 22, 42 are provided. 62 and 82 may decode the movement of the mouth portion based on the word information corresponding to the speech encoded data received from the base station, and display a display image in which the mouth portion moves accordingly.
[0138]
【The invention's effect】
As described above, according to the present invention, on the transmission side, the user's voice data, the face image model for one frame, Considering the correlation that facial expression changes when the user's voice is emitted, Audio data Only while sending to the receiver Each point of eye area of face image model As an analysis parameter Change over time Showing of the above wireframe The motion image component data is transmitted, and on the receiving side, the eye partial motion image data is generated based on the motion image component data, the mouth partial motion image data is generated based on the audio data, and then synthesized with the face image model. By reducing the amount of data transmitted from the transmitting device to the receiving device, the distance between the eye portion and the mouth portion is reduced. As facial expression An expression-rich face image in a correlated state can be displayed, thus realizing an image transmission system capable of executing real-time processing with a simple configuration and low power consumption.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a television telephone system according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration of a television telephone system according to a second embodiment of the present invention.
FIG. 3 is a block diagram showing a configuration of a television telephone system according to a third embodiment of the present invention.
FIG. 4 is a block diagram showing a configuration of a television telephone system according to a fourth embodiment of the present invention.
FIG. 5 is a block diagram showing a configuration of a videophone system according to another embodiment.
FIG. 6 is a block diagram showing a configuration of a conventional television telephone system.
[Explanation of symbols]
1, 20, 40, 60, 80, 100... TV telephone system, 2, 21, 41, 61, 81... Transmitting terminal, 3, 22, 42, 62, 82. 24... Image compression encoding circuit, 5, 83... Motion image component extraction circuit, 6... Multiplexing circuit, 7. ... demodulation circuit, 11 ... separation circuit, 12 ... sound decoding circuit, 13, 84 ... moving image component decoding circuit, 14 ... image decoding circuit, 15 ... image data holding circuit, 16, 26, 64, 86... Composition circuit, 25... Motion image holding circuit, 23, 42, 43, 65... Image database, 44, 63, 85.

Claims

In an image transmission system including a transmission device and a reception device,
The transmitter is
Voice data transmitting means for transmitting user voice data collected by a microphone to the receiving device;
A face image model of a wire frame is generated by dividing the face image data for one frame obtained by photographing the user's face into a plurality of points and connecting it to the receiving device. A face image model transmitting means for transmitting;
Considering the correlation that facial expression changes when the user's voice is emitted , each point in the eye portion of the face image model is analyzed only while the voice data is being transmitted to the receiving device. It is used as a parameter, and its temporal change is extracted as motion image component data of the wire frame, and includes a motion image component data transmitting means for transmitting this to the receiving device,
The receiving device is
Face image model holding means for holding the face image model;
Eye part motion image data generating means for generating eye part motion image data with movement for the eye part of the face image model based on the motion image component data;
Mouth portion motion image data generating means for generating mouth portion motion image data by decoding the motion state of the mouth portion of the face image model based on the audio data;
A synthesizing unit that generates a synthesized image by synthesizing the eye partial motion image data and the mouth partial motion image data with respect to the face image model;
An image transmission system comprising: display means for displaying the composite image.