JP3299797B2

JP3299797B2 - Composite image display system

Info

Publication number: JP3299797B2
Application number: JP33552692A
Authority: JP
Inventors: 章中川; 映史森松; 喜一松田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-11-20
Filing date: 1992-11-20
Publication date: 2002-07-08
Anticipated expiration: 2017-07-08
Also published as: JPH06162167A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文章（テキスト）デー
タを送るだけであたかもＴＶ電話のように送信者が喋っ
ている顔の合成動画像と合成音声で相手側にメッセージ
を伝えることができるＡＶ（オーディオ・ビデオ）電子
メール等に適用できる合成画像表示システムに係り、特
に、文章等の作成側で意図した声質や顔表情等の印象を
表示側に伝えることができる合成画像表示システムに関
するものである。BACKGROUND OF THE INVENTION The present invention can transmit a message to a partner by a synthesized moving image and a synthesized voice of a face in which a sender is talking like a TV phone, just by transmitting text (text) data. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a composite image display system applicable to AV (audio / video) e-mail and the like, and more particularly to a composite image display system that can convey to a display side an impression of voice quality and facial expression intended by a text creation side. It is.

【０００２】[0002]

【従来の技術】任意の文章（テキスト）情報からそれに
対応した合成音声を自由に生成し発音する技術は、規則
音声合成と呼ばれ、これを実現するための規則音声合成
装置が既に作られている。この規則音声合成技術は人間
と機械とのインターフェースを向上させるために様々な
分野で応用されている。また、近年、音声の合成と同様
に、任意の文章情報からそれを喋ったときの口の動きを
含む人物の動画像をその文章情報を解析することで生成
する技術が開発されており、これを上述の音声合成技術
と組み合わせることによって、より自然なインターフェ
ースを実現することができる。2. Description of the Related Art A technique for freely generating and producing a synthesized speech corresponding to arbitrary sentence (text) information is called ruled speech synthesis, and a ruled speech synthesis apparatus for realizing this is already manufactured. I have. This rule speech synthesis technology has been applied in various fields to improve the interface between humans and machines. In recent years, similar to speech synthesis, technology has been developed to generate a moving image of a person including mouth movements when speaking from arbitrary sentence information by analyzing the sentence information. Is combined with the above-described speech synthesis technology, a more natural interface can be realized.

【０００３】例えば、かかる音声と顔動画像の合成技術
を電子メールに適用すると、受信側にメール送信者の顔
画像などのデータファイルを予め用意しておくことによ
り、従来では受信側の画面上に文章が表示されるだけで
あった電子メールに対して、メール送信者が喋っている
顔の動画像が現れて合成音声で読み上げるといった表現
豊かなメッセージを受信者に伝えることができる。For example, when such a technology of synthesizing a voice and a face moving image is applied to an e-mail, a data file such as a face image of the mail sender is prepared in advance on the receiving side, so that a conventional method is used on the screen of the receiving side. In contrast to an e-mail whose text is simply displayed on the screen, an expressive message such as a moving image of the face where the mail sender is speaking appears and is read out with a synthetic voice can be transmitted to the recipient.

【０００４】このような文章に基づいて音声および顔動
画像を合成し出力する音声・動画像出力装置の構成例を
図４に示す。図４において、１は文章（テキスト）情報
が入力される文章分解部であり、この文章分解部１は入
力された文章情報を解析して音声出力用の発音制御デー
タを生成し規則音声合成部２と音声／口形変換部３に出
力する。例えば、文章情報として「ただいま」の文章が
入力された場合、これを「Ｔ，Ａ，Ｄ，Ａ，Ｉ，Ｍ，
Ａ」の母音と子音からなる音素データに分解して出力す
る。[0004] Fig. 4 shows an example of the configuration of a voice / moving image output device that synthesizes a voice and a face moving image based on such a sentence and outputs it. In FIG. 4, reference numeral 1 denotes a sentence decomposition unit to which sentence (text) information is input. The sentence decomposition unit 1 analyzes the input sentence information, generates pronunciation control data for voice output, and generates a rule speech synthesis unit. 2 and output to the voice / mouth shape converter 3. For example, when a sentence “I'm here” is input as sentence information, it is written as “T, A, D, A, I, M,
A ”is decomposed into phoneme data including a vowel and a consonant and output.

【０００５】規則音声合成部２は任意の文章についての
音素データに基づいてその文章を読み上げる合成音声を
生成し出力する装置である。The rule speech synthesizer 2 is a device for generating and outputting a synthesized speech for reading out a sentence based on phoneme data of an arbitrary sentence.

【０００６】音声／口形変換部３は、任意の文章につい
ての音素データをその文章を発音する際の一連の口の動
きを表すための口形符号の系列に変換するための装置で
ある。口形符号としては例えば、Ａ（母音のア）、Ｉ
（母音のイ）、Ｕ（母音のウ）、Ｅ（母音のエ）、Ｏ
（母音のオ）、Ｓ（子音）、Ｃ（閉じた口）の７種類が
あり、それぞれの口形符号に対応してそれらを発音する
際の口形の画像が予め用意される。例えば、文章情報と
して前述の「ただいま」の文章が入力された場合、その
文章の音素データ「ＴＡＤＡＩＭＡ」に基づいて、
「Ｔ」→口形符号Ｓ、「Ａ」→口形符号Ａ、「Ｄ」→
口形符号Ｓ、「Ｉ」→口形符号Ｉ、「Ｍ」→口形符号
Ｃ、「Ａ」→口形符号Ａ、をそれぞれ割り当てて、それ
らを口形符号の系列として画像表示制御部５に出力す
る。[0006] The voice / mouth shape conversion unit 3 is a device for converting phoneme data of an arbitrary sentence into a mouth shape code sequence for representing a series of mouth movements when the sentence is pronounced. Examples of the mouth-shaped code are A (a vowel), I
(A of vowel), U (V of vowel), E (E of vowel), O
There are seven types of vowels (e), S (consonant), and C (closed mouth). Mouth-shaped images are prepared in advance for each of these mouth-shape codes. For example, when the above-mentioned “Ima-san” sentence is input as sentence information, based on the phoneme data “TADAIMA” of the sentence,
"T" → mouth symbol S, "A" → mouth symbol A, "D" →
The mouth shape code S, “I” → the mouth shape code I, “M” → the mouth shape code C, “A” → the mouth shape code A, respectively, and outputs them to the image display control unit 5 as a mouth shape code sequence.

【０００７】画像メモリ６には合成画像データがファイ
リングされている。この合成画像データとしては、話者
の１フレーム分の肩上画像と、それを基に合成した前述
の７種類の口形符号に対応した７種類の口領域画像のデ
ータとを纏めて一つのファイルとしている。[0007] In the image memory 6, synthetic image data is filed. As the combined image data, one file is obtained by combining the above-shoulder image for one frame of the speaker and the data of the seven types of mouth area images corresponding to the seven types of mouth shape codes synthesized based on the image. And

【０００８】発音時間計算部４は文章分解部１からの発
音制御データに基づいて規則音声合成部２と全く同じア
ルゴリズムを用いて音声を合成する際の各音節が発音さ
れるまでの時間をそれぞれ計算する。つまり、入力され
た文章に対してそれが規則音声合成部２で音声合成され
て発音出力される際に、文章の先頭を起点にしてその文
章を構成する各音節の切れ目のタイミングをそれぞれ推
定してその結果を画像表示制御部５に出力する。[0008] The pronunciation time calculation unit 4 calculates the time until each syllable is to be pronounced when synthesizing speech using exactly the same algorithm as the rule speech synthesis unit 2 based on the pronunciation control data from the sentence decomposition unit 1. calculate. That is, when the input sentence is synthesized by the rule speech synthesis unit 2 and output as a sound, the timing of the break of each syllable constituting the sentence is estimated from the beginning of the sentence. The result is output to the image display controller 5.

【０００９】画像表示制御部５は発音時間計算部４から
のタイミング信号に基づいて、各音節の発音タイミング
が到来したときにその該当する音節の口形符号に対応す
る口形画像が画像メモリ６から選択されて出力されるよ
う画像表示制御を行う。すなわち、規則音声合成部２で
発音される音声に対して画面に表示される話者の口の動
きが一致するよう、つまり合成音声と顔動画像との同期
がとれるように同期制御を行うものである。An image display control unit 5 selects a mouth shape image corresponding to the mouth shape code of the corresponding syllable from the image memory 6 when the sound timing of each syllable arrives based on the timing signal from the sound emission time calculation unit 4. The image display is controlled so that the image is output. That is, synchronization control is performed so that the movement of the mouth of the speaker displayed on the screen matches the sound pronounced by the regular sound synthesis unit 2, that is, the synthesized sound and the face moving image can be synchronized. It is.

【００１０】パラメータ入力部７は規則音声合成部２で
合成する音声の声質、顔動画像の画面上での表示場所、
表示倍率等の各種パラメータをキーボード等を用いて入
力する部分であり、合成音声に関するパラメータは規則
音声合成部２に渡され、また顔動画像に関するパラメー
タは画像表示制御部５と画像メモリ６に渡される。The parameter input unit 7 includes a voice quality of the voice synthesized by the rule voice synthesis unit 2, a display location of the face moving image on the screen,
This is a part for inputting various parameters such as a display magnification using a keyboard or the like. Parameters relating to synthesized speech are passed to the regular speech synthesis unit 2, and parameters relating to face moving images are passed to the image display control unit 5 and the image memory 6. It is.

【００１１】このように構成した装置の動作を説明す
る。文章情報が入力されると、文章分解部１でその文章
情報が解析されて音素データがまとめて規則音声合成部
２に渡されて合成音声により発音出力される。この発音
動作に並行して、音素データが音声／口形変換部３で口
形符号の系列に変換される。また発音時間計算部４では
音素データから各音節の切れ目の時間が推定され、この
時間データが画像表示制御部５に渡される。画像表示制
御部５では各音節の発音タイミングに口形符号のタイミ
ングを合わせて、画像メモリ６上に展開された各口形符
号の画像のうちから音声／口形変換部３で求まった口形
符号に対応した顔動画像データがＶＲＡＭに転送される
ようにし、このＶＲＡＭを介して表示装置の画面上に話
者の顔動画像を表示する。これにより文章情報は、それ
を実際に発音した合成音声とその合成音声に口の動きの
タイミングがあった話者の顔動画像とによるメッセージ
として受信者に伝えられることになる。The operation of the apparatus having the above-described configuration will be described. When the sentence information is input, the sentence information is analyzed by the sentence decomposing unit 1 and the phoneme data is collectively passed to the regular speech synthesis unit 2 and is output as a synthesized speech. In parallel with this sounding operation, the phoneme data is converted into a series of mouth-shaped codes by the speech / mouth-shaped conversion unit 3. The pronunciation time calculation unit 4 estimates the time of each syllable break from the phoneme data, and the time data is passed to the image display control unit 5. The image display control unit 5 adjusts the timing of the mouth-shaped code to the sounding timing of each syllable, and corresponds to the mouth-shaped code obtained by the voice / mouth-shaped conversion unit 3 from among the images of each mouth-shaped code developed on the image memory 6. The facial moving image data is transferred to the VRAM, and the facial moving image of the speaker is displayed on the screen of the display device via the VRAM. As a result, the sentence information is transmitted to the receiver as a message composed of a synthesized voice that actually produces the speech and a face moving image of the speaker whose mouth movement timing is included in the synthesized voice.

【００１２】この図４の装置は、規則音声合成部２に従
来からある小型の音声合成ユニットを利用し、それ以外
の部分にはパーソナルコンピュータ等を用いることによ
り、小型で経済的なシステムとして実現することができ
る。The apparatus shown in FIG. 4 is realized as a small and economical system by using a conventional small speech synthesis unit for the regular speech synthesis unit 2 and using a personal computer or the like for the other parts. can do.

【００１３】[0013]

【発明が解決しようとする課題】かかる音声・顔動画像
出力装置をパーソナルコンピュータ上で実現させる場
合、処理量削減のため、上述したように合成画像を予め
作成しておいてそれらの画像を入力された文章に応じて
切り換えて表示することが一般に行われている。これら
の装置において合成音声と顔動画像を生成するにあたっ
ては、声質、画面上での画像の表示場所、表示倍率など
のパラメータは、表示するシステムに初期値として予め
設定されたもの（パラメータ入力部７で予め入力された
もの）が使われる。When such a voice / face moving image output apparatus is realized on a personal computer, a composite image is prepared in advance and the images are input as described above in order to reduce the processing amount. It is common practice to switch and display according to the sentence sentence. In generating a synthesized voice and a face moving image in these devices, parameters such as voice quality, a display position of an image on a screen, and a display magnification are set in advance as initial values in a display system (parameter input unit). 7) is used.

【００１４】このように従来の装置では合成音声の声質
と顔動画像の生成態様を受信側で予め設定しておくもの
であるが、それら予め登録されてある顔画像の人物と声
質が例えばメッセージに対して釣り合っていないような
場合、それをみる人に不自然な感じを与えてしまうこと
になる。As described above, in the conventional apparatus, the voice quality of the synthesized voice and the generation mode of the face moving image are set in advance on the receiving side, and the person and voice quality of the face image registered in advance are, for example, a message. If this is not the case, it will give an unnatural feeling to the viewer.

【００１５】また、この装置を電子メールなどに用いた
場合などに代表されるように、文章情報と合成画像を作
った人とその文章情報を実際に音声と動画像で表示して
見る人とが異なる場合、文章情報と合成画像を作った人
が希望するような声質や画像の大きさで、受信側におい
て発音・画像表示されるとは限らず、この結果、送り側
の人の意図とは全く違う印象を受信側の人に与えてしま
う可能性がある。Also, as typified by the use of this apparatus for e-mail, etc., a person who created text information and a composite image and a person who actually displays the text information as a voice and a moving image and sees it. Is different, the voice information and the image size are not always displayed and displayed on the receiving side with the voice quality and image size desired by the person who created the sentence information and the composite image. May give the recipient a completely different impression.

【００１６】つまり従来の装置では、音声と動画像でメ
ッセージを伝えるにあたっての声質や顔の容貌などから
表示側の人が受ける印象は表示側で予め設定したパラメ
ータによって決まってしまうことになり、情報の作成側
の人が意図した印象表現を表示側の人に的確に伝えるこ
とができなかった。In other words, in the conventional apparatus, the impression given to the person on the display side from the voice quality and the appearance of the face in transmitting the message by voice and moving image is determined by the parameters set in advance on the display side. It was not possible to accurately convey the impression expression intended by the person on the side of creation to the person on the display side.

【００１７】本発明はかかる問題点に鑑みてなされたも
のであり、その目的とするところは、表示側で文章情報
に基づいて合成音声あるいは顔合成画像を表示するにあ
たり、その文章等の作成側の人が意図した通りの表示が
可能となるようにすることにある。The present invention has been made in view of such a problem, and an object of the present invention is to display a synthesized voice or a face synthesized image based on text information on a display side, and to create a text or the like on the display side. It is intended to enable display as intended by the person.

【００１８】[0018]

【課題を解決するための手段】図１は本発明に係る原理
説明図である。本発明においては、一つの形態として、
任意の文章情報からそれに対応する合成音声および該合
成音声に合わせて口が動く人物の顔の合成動画像を生成
する合成画像表示システムにおいて、文章情報の作成側
において顔の合成動画像データを作成する合成動画像デ
ータ作成手段と、表示側における合成音声と合成動画像
の生成態様を決めるための各種パラメータを、前記合成
動画像データ作成手段で作成された合成動画像データに
付加するパラメータ付加手段と、文章情報と、前記合成
画像データ作成手段で作成された合成動画像データと、
該合成動画像データに前記パラメータ付加手段で付加さ
れた各種パラメータとを、１つの伝送データとして表示
側へ送信する送信手段と、を備えることを特徴とする合
成画像表示システムが提供される。FIG. 1 is an explanatory view of the principle according to the present invention. In the present invention, as one form,
In a synthetic image display system for generating a synthetic voice corresponding to a corresponding voice and a synthetic moving image of a face of a person whose mouth moves in accordance with the synthetic voice from arbitrary text information, a synthetic motion image data of a face is generated on a text information generating side. Means for generating synthesized moving image data, and parameter adding means for adding various parameters for determining the generation mode of the synthesized voice and the synthesized moving image on the display side to the synthesized moving image data generated by the synthesized moving image data generating means And sentence information, and synthetic moving image data created by the synthetic image data creating means,
A transmission unit for transmitting, to the display side, various parameters added by the parameter addition unit to the composite moving image data as one transmission data.

【００１９】上記の各種パラメータは、少なくとも、前
記合成音声の音質に関するパラメータと、前記合成動画
像の表示倍率及び表示位置に関するパラメータとを含む
ものとすることができる。The above-mentioned various parameters may include at least a parameter relating to the sound quality of the synthesized voice and a parameter relating to a display magnification and a display position of the synthesized moving image.

【００２０】また本発明においては、他の形態として、
上記の送信手段から送信されてきた伝送データを受信
し、該伝送データに含まれる合成動画像データ、文章情
報、各種パラメータを互いに分離する伝送データ入力手
段と、該文章情報に基づいて合成音声を生成し出力する
音声合成手段と、該伝送データ入力手段で分離された合
成動画像データをファイリングする画像メモリと、該文
章情報をその文章情報を発声したときの一連の口形の動
きを表す口形符号の系列に変換する変換手段と、該文章
情報に基づいて該音声合成手段から出力される合成音声
の各音節の発音時間を計算して各音声の切れ目のタイミ
ングを推定する発音時間計算手段と、該発音時間計算手
段で推定した各音節の切れ目のタイミングで表示画像を
該変換手段からの口形符号に対応した口形画像に切り換
える制御を行う表示制御手段と、該伝送データ入力手段
で分離された各種パラメータを対応する内部回路に送る
パラメータ入力手段と、を備えた合成画像表示システム
が提供される。In the present invention, as another form,
Receiving the transmission data transmitted from the transmission means, transmitting data input means for separating synthesized moving image data, text information, and various parameters included in the transmission data from each other, and generating synthesized speech based on the text information; Voice synthesizing means for generating and outputting, an image memory for filing the synthesized moving image data separated by the transmission data input means, and a mouth shape code representing a series of mouth shape movements when the sentence information is uttered as the sentence information Conversion means for converting into a sequence of the speech, and pronunciation time calculation means for calculating the pronunciation time of each syllable of the synthesized speech output from the speech synthesis means based on the sentence information and estimating the timing of each speech break, A display for controlling the display image to be switched to a mouth-shaped image corresponding to the mouth-shaped code from the converting means at the timing of the break of each syllable estimated by the sounding time calculating means; And control means, a parameter input means for sending to the internal circuit corresponding to various parameters that have been separated by said transmission data input means, the composite image display system having a are provided.

【００２１】[0021]

【作用】本発明の合成画像表示方式においては、送信側
において、表示に必要な顔画像を合成した際、その同じ
データに表示側で合成した人が希望する合成画像の表示
倍率や表示位置、合成音声の声質、その他のパラメータ
を埋め込む。表示側では、システムの初期値としてその
画像データに埋め込まれた値を用いる。これにより、顔
画像を合成した人の意図した通りに表示システム側で合
成音声と合成画像を生成することができる。According to the combined image display method of the present invention, when a face image required for display is combined on the transmitting side, the display magnification, display position, Embed voice quality and other parameters of synthesized speech. On the display side, a value embedded in the image data is used as an initial value of the system. As a result, the synthesized speech and the synthesized image can be generated on the display system side as intended by the person who synthesized the face image.

【００２２】また本発明の他の形態の合成画像表示シス
テムにおいては、伝送データ入力手段で受信した伝送デ
ータを合成動画像データ、文章情報、各種パラメータに
分離し、文章分解手段で文章情報を解析して発音制御デ
ータを生成し、音声合成手段でこの発音制御データに基
づいて合成音声を生成し出力し、受信した合成動画像デ
ータを画像メモリにファイリングし、変換手段で発音制
御データを口形符号の系列に変換し、発音時間計算手段
で発音制御データに基づいて音声合成手段で発音される
各音節の発音時間をそれぞれ計算して各音節の切れ目の
タイミングを推定し、画像表示制御手段で各音節のタイ
ミング信号に合わせてその音節の口形画像を画像メモリ
から読み出すように制御し、受信した各種パラメータを
パラメータ入力手段で対応する内部回路に送って合成音
声と合成画像の生成態様を文章作成側の人が意図したも
のとなるようにする。In a combined image display system according to another aspect of the present invention, the transmission data received by the transmission data input means is separated into combined moving image data, text information, and various parameters, and the text information is analyzed by the text decomposition means. To generate sound control data, generate and output synthesized speech based on the sound control data by the sound synthesizer, file the received synthesized moving image data in the image memory, and convert the sound control data into a mouth code by the converter. The pronunciation time of each syllable to be pronounced by the speech synthesis means is calculated by the speech synthesis means based on the pronunciation control data by the pronunciation time calculation means, and the timing of each syllable break is estimated. The mouth shape image of the syllable is controlled to be read from the image memory in accordance with the timing signal of the syllable, and the received various parameters are input into the parameter input means. Human production aspects sentence creation side sends the corresponding internal circuit as synthesized speech composite image made to be intended in.

【００２３】[0023]

【実施例】以下、図面を参照して本発明の実施例を説明
する。図２は本発明の一実施例としての合成画像表示シ
ステムによる音声・顔動画像出力装置が示される。図２
において、文章分解部１、規則音声合成部２、音声／口
形変換部３、発音時間計算部４、画像表示制御部５、画
像メモリ６は前述の従来例で説明したものと同じもので
ある。Embodiments of the present invention will be described below with reference to the drawings. FIG. 2 shows a voice / face moving image output apparatus using a composite image display system according to an embodiment of the present invention. FIG.
In the above, the sentence decomposition section 1, the ruled speech synthesis section 2, the voice / mouth shape conversion section 3, the pronunciation time calculation section 4, the image display control section 5, and the image memory 6 are the same as those described in the aforementioned conventional example.

【００２４】従来装置との相違点として、送信側から送
られてきた伝送データには、本来の文章情報の他に、送
信側の人が受信側で合成され表示されることを希望する
顔画像と口形画像等の合成画像データ、さらにその人が
希望する画面上での表示倍率、表示位置、声質、その他
のパラメータが合成画像データに埋め込まれている。A difference from the conventional apparatus is that, in addition to the original text information, the transmission data sent from the transmitting side has a face image desired by the transmitting side to be synthesized and displayed on the receiving side. And the synthesized image data such as the mouth shape image and the display magnification, display position, voice quality, and other parameters on the screen desired by the person are embedded in the synthesized image data.

【００２５】図３には送信側においてこれらのパラメー
タを伝送データに埋め込むための処理の概念が示され
る。受信側での表示を希望する顔画像の原画像に基づい
て顔モデルへのマッピングを行い、各口形のパラメータ
を用いて合成画像データを作成し、これに表示倍率、表
示位置、声質、その他の受け側に与える印象に係わるパ
ラメータを埋め込む。これとは別に文章情報を作成し、
双方を伝送データとして受信側に送る。この場合、パラ
メータが埋め込まれた合成画像データを一度送ってしま
えば、後は文章情報を繰り返し送るだけでよい。FIG. 3 shows the concept of processing for embedding these parameters in transmission data on the transmission side. Perform mapping on the face model based on the original image of the face image desired to be displayed on the receiving side, create synthetic image data using the parameters of each mouth shape, and display magnification, display position, voice quality, etc. Embed parameters related to the impression given to the recipient. Create text information separately from this,
Both are sent to the receiving side as transmission data. In this case, once the composite image data in which the parameters are embedded is sent, the text information only needs to be sent repeatedly.

【００２６】受信側ではこの伝送データは伝送データ入
力部８に入力され、ここで、文章情報、合成画像デー
タ、各種パラメータに分離され、文章情報は文章分解部
１に、合成画像データは画像メモリ６に、各種パラメー
タはパラメータ入力部７にそれぞれ送られる。On the receiving side, the transmission data is input to a transmission data input unit 8, where it is separated into text information, composite image data, and various parameters. The text information is transmitted to the text decomposition unit 1, and the composite image data is stored in an image memory. 6, the various parameters are sent to the parameter input unit 7, respectively.

【００２７】パラメータ入力部７はこの各種パラメータ
を受け取ると、この各種パラメータを調べて、音声合成
に関する声質等のパラメータは規則音声合成部２に、画
像の表示倍率、表示位置等の画像に関するパラメータは
画像表示制御部５と画像メモリ６にそれぞれ送る。When the parameter input unit 7 receives the various parameters, the parameter input unit 7 examines the various parameters. The parameters such as the voice quality related to the speech synthesis are input to the rule speech synthesis unit 2 and the parameters related to the image such as the display magnification and the display position of the image. It is sent to the image display control unit 5 and the image memory 6, respectively.

【００２８】このように構成することで、受信側では、
表示すべき顔画像とシステムの初期値として埋め込むパ
ラメータとして、送信側の人が希望した顔画像と、表示
倍率、表示位置、声質、その他のパラメータを用いるこ
とができる。よって送信側の人の意図通りの音声と画像
で受信側の表示システムにメッセージを表示させること
ができる。With this configuration, on the receiving side,
As the facial image to be displayed and the parameters to be embedded as the initial values of the system, the facial image desired by the transmitting side, the display magnification, the display position, the voice quality, and other parameters can be used. Therefore, the message can be displayed on the display system on the receiving side with the voice and image as intended by the person on the transmitting side.

【００２９】本発明の実施にあたっては種々の変形形態
が可能である。例えば、上述の実施例では口形として７
種類の画像を用いる場合について説明したが、もちろん
本発明はこれに限られるものではなく、より自然に近い
口の動きを合成するためにはこの口形の画像の種類をさ
らに増やしてもよい。また上述の実施例では表示側で合
成する顔画像の動き部分として口領域の動きを取り上げ
たが、これに限られるものではなく、例えば口の動きに
加えて、文章に合わせて目の動きなども変化させるよう
にすれば、より表情豊かなＡＶメッセージを受け側に送
ることができる。In carrying out the present invention, various modifications are possible. For example, in the above embodiment, the mouth shape is 7
Although the case of using different types of images has been described, the present invention is not limited to this. Of course, in order to synthesize a more natural movement of the mouth, the types of images of the mouth shape may be further increased. In the above-described embodiment, the movement of the mouth area is taken as the movement part of the face image to be synthesized on the display side. However, the movement is not limited to this. For example, in addition to the movement of the mouth, the movement of the eyes according to the sentence, etc. Is changed, a more expressive AV message can be sent to the receiving side.

【００３０】また上述の実施例では本発明をＡＶ電子メ
ールに適用した場合について説明したが、本発明はこれ
に限られるものではなく、音声・顔動画像出力装置単体
に適用することも可能であるし、あるいは、例えば音声
認識技術によりリアルタイムに発声音声の音素の認識が
可能となれば、通常の電話をかけるだけで受信者側に話
し手の顔の表情も動画像で表示できるという擬似テレビ
電話等のサービスに適用することも可能である。In the above-described embodiment, the case where the present invention is applied to AV electronic mail has been described. However, the present invention is not limited to this, and can be applied to a voice / face moving image output device alone. Yes, or if, for example, speech recognition technology allows real-time recognition of phonemes of uttered voices, a simulated videophone that can display the facial expression of the speaker's face to the recipient side as a moving image simply by making a normal telephone call It is also possible to apply to services such as.

【００３１】[0031]

【発明の効果】以上に説明したように、本発明によれ
ば、受信側で文章情報に基づいて合成音声あるいは顔合
成画像を表示するにあたり、その文章の送り側の人が意
図した通りの表示が可能となる。As described above, according to the present invention, when a synthesized speech or a face synthesized image is displayed on the receiving side based on the sentence information, the display is performed as intended by the person on the sending side of the sentence. Becomes possible.

[Brief description of the drawings]

【図１】本発明に係る原理説明図である。FIG. 1 is an explanatory view of the principle according to the present invention.

【図２】本発明の一実施例としての合成画像表示システ
ムによる音声・動画像出力装置を示す図である。FIG. 2 is a diagram showing an audio / moving image output device using a composite image display system as one embodiment of the present invention.

【図３】実施例システムによる送り側での処理概念を説
明する図である。FIG. 3 is a diagram illustrating a processing concept on a sending side according to the embodiment system.

【図４】従来の音声・動画像出力装置を示す図である。FIG. 4 is a diagram showing a conventional audio / video output device.

[Explanation of symbols]

１文章分解部２規則音声合成部３音声／口形変換部４発音時間計算部５画像表示制御部６画像メモリ７パラメータ入力部８伝送データ入力部 DESCRIPTION OF SYMBOLS 1 Sentence decomposition part 2 Rule speech synthesis part 3 Voice / mouth shape conversion part 4 Onset time calculation part 5 Image display control part 6 Image memory 7 Parameter input part 8 Transmission data input part

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平２−234285（ＪＰ，Ａ) 特開平２−196585（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 3/00 G06T 13/00 G06T 15/70 G10L 13/00 G10L 21/06 H04N 7/14 - 7/15 ────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-2-234285 (JP, A) JP-A-2-196585 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 3/00 G06T 13/00 G06T 15/70 G10L 13/00 G10L 21/06 H04N 7/14-7/15

Claims

(57) [Claims]

1. A synthetic image display system for generating a synthetic voice corresponding to an arbitrary text information and a synthetic moving image of a face of a person whose mouth moves in accordance with the synthetic voice. create a dynamic image data
Means for generating synthesized moving image data, and various parameters for determining the generation mode of the synthesized voice and the synthesized moving image on the display side.
Parameter to be added to the synthetic moving image data created by means
Data adding means, text information, and the composite image data
The composite moving image data and the parameter
The various parameters added by the
Composite image display system, characterized in that a transmission unit for transmitting to the display side as the transmission data.

2. The method according to claim 2, wherein the various parameters are at least
And parameters related to the audio quality of the serial synthesized speech, the composite image display system of <br/> claim 1, further comprising a parameter related to the display magnification and the display position of the synthesized moving image.

3. A composite image display system according to claim 1 or 2.
Transmission data transmitted from the transmission means in the system.
Receiving the data, a transmission data input means for separating synthesized moving image data included in the transmission data, text information, various parameters to one another, and speech synthesis means for generating and outputting a synthesized speech based on the text information, An image memory for filing the combined moving image data separated by the transmission data input means, and a converting means for converting the sentence information into a series of mouth-shaped codes representing a series of mouth-shaped movements when the sentence information is uttered; A pronunciation time calculation means for calculating a pronunciation time of each syllable of the synthesized speech output from the speech synthesis means based on the sentence information and estimating a timing of a break of each voice; Display control means for controlling switching of a display image to a mouth shape image corresponding to a mouth shape code from the conversion means at the timing of a syllable break; Composite image display system having a, a parameter input means for sending to the internal circuit corresponding to various parameters.