JP2003037826A

JP2003037826A - Substitute image display and tv phone apparatus

Info

Publication number: JP2003037826A
Application number: JP2001221581A
Authority: JP
Inventors: Kiyoshi Hashimoto; 喜孔橋本
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2001-07-23
Filing date: 2001-07-23
Publication date: 2003-02-07

Abstract

PROBLEM TO BE SOLVED: To provide a TV phone apparatus that can send a substitute image having natural expression suitable for the contents of voice. SOLUTION: When a transmission target is set to be a substitute image, a voice recognition section 10 carries out specific voice recognition processing to the voice of a speaker that is collected by a microphone 2a in a handset 2, and specifies pronunciation contents (six kinds of Japanese vowels and a Japanese snapping sound). A substitute image generation section 14 selects the substitute image having a mouth shape corresponding to the pronunciation contents specified by the voice recognition section 10 out of a plurality of kinds of substitute images stored into a substitute image memory 12. The selected substitute image is sent to a listener with the voice of the speaker.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、話者の音声の内容
に合わせて話者の代理画像を生成し、表示する代理画像
表示装置、および話者の音声と話者の画像（または話者
の代理画像）を互いに送受しながら通話を行うテレビ電
話装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a proxy image display device for generating and displaying a proxy image of a speaker in accordance with the content of the voice of the speaker, and the voice of the speaker and the image of the speaker (or the speaker). (Representative image of) is transmitted and received to and from each other, the present invention relates to a videophone device for making a call.

【０００２】[0002]

【従来の技術】最近では、音声による通話を行う際にお
互いの姿を撮影した画像も一緒に送受しあうことのでき
る、いわゆるテレビ電話装置が実用化されており、一部
で普及し始めている。このようなテレビ電話装置を用い
ることにより、例えば、祖父や祖母が離れた土地に住ん
でいる孫の顔を見ながら通話したり、恋人同士がお互い
の顔を見ながら通話することができるなど、音声のみを
やりとりする従来の電話機に比べて、コミュニケーショ
ンをより豊かにすることができる。2. Description of the Related Art Recently, a so-called videophone device has been put into practical use, which is capable of transmitting and receiving images of each other when making a voice call. . By using such a videophone device, for example, a grandfather or grandmother can talk while looking at the faces of grandchildren who live in distant land, or lovers can talk while looking at each other's faces, Communication can be enriched compared to conventional telephones that exchange only voice.

【０００３】また、一般的なテレビ電話装置では、基本
的に話者を撮影した画像をほぼそのまま通話相手に送る
か、あるいは画像を送らないかの２通りを選択すること
ができるようになっている。また、プライバシー保護、
あるいはエンターテイメント性の向上等の観点から、各
種のキャラクタ画像を話者の画像を代理する代理画像と
して通話相手に送る手法も提案されている。Further, in a general videophone apparatus, basically, it is possible to select two types, that is, an image obtained by photographing a speaker is sent to the other party as it is, or no image is sent. There is. Also, privacy protection,
Alternatively, from the viewpoint of improving entertainment, a method has also been proposed in which various character images are sent to the other party as a substitute image representing the image of the speaker.

【０００４】[0004]

【発明が解決しようとする課題】ところで、従来のテレ
ビ電話装置では、各種のキャラクタ画像などを用いた代
理画像を送信する場合には、送信される代理画像の表情
は、固定されているか、あるいは口を開閉させる程度の
ものであり表現力に乏しかった。したがって、この代理
画像を見ながら話している通話相手にとっては、こちら
の話す音声の内容と代理画像の表情（特に、口の動き）
が適合しておらず不自然に感じられるという問題があっ
た。By the way, in the conventional videophone apparatus, when a proxy image using various character images or the like is transmitted, the expression of the proxy image to be transmitted is fixed, or It was just a matter of opening and closing the mouth and lacked expressiveness. Therefore, for the other party who is talking while looking at this proxy image, the content of the voice spoken here and the facial expression of the proxy image (particularly mouth movement)
There was a problem that was not suitable and felt unnatural.

【０００５】同様の問題は、案内音声に合わせて所定の
代理画像を表示させる各種の装置についても言える。例
えば、車載用のナビゲーション装置では、経路誘導時に
進路を案内する際や、その他各種の操作を行う際に、所
定の案内音声とともにキャラクタを用いた代理画像が表
示され、あたかもこのキャラクタが案内音声を発声して
いるように見せているものがあるが、表示される代理画
像はその表情が画一的であり表現力に乏しく、案内音声
の内容と適合しておらず不自然に感じられるという問題
がある。The same problem can be said for various devices that display a predetermined proxy image in accordance with the guidance voice. For example, in a vehicle-mounted navigation device, a proxy image using a character is displayed together with a predetermined guidance voice when guiding a route when performing route guidance or when performing various other operations, and it is as if this character gave a guidance voice. Some of them appear to be uttered, but the displayed proxy image has a uniform expression and lacks expressiveness, and it does not match the content of the guidance voice and feels unnatural. There is.

【０００６】本発明は、このような点に鑑みて創作され
たものであり、その目的は、音声の内容に適合した自然
な表情の代理画像を表示することができる代理画像表示
装置を提供することにある。また本発明の他の目的は、
音声の内容に適合した自然な表情の代理画像を送ること
ができるテレビ電話装置を提供することにある。The present invention has been made in view of the above circumstances, and an object thereof is to provide a proxy image display device capable of displaying a proxy image having a natural expression suitable for the contents of voice. Especially. Another object of the present invention is to
It is an object of the present invention to provide a videophone device that can send a substitute image with a natural expression that matches the contents of voice.

【０００７】[0007]

【課題を解決するための手段】上述した課題を解決する
ために、本発明の代理画像表示装置は、音声出力手段か
ら出力される話者の音声の内容に合わせて、代理画像生
成手段によって、口形状が変化する話者の代理画像を生
成し、生成した画像を表示手段によって表示している。
話者の音声の内容に合わせて口形状を変化させた代理画
像を表示しているので、音声の内容に適合した自然な表
情の代理画像を表示することができる。In order to solve the above-mentioned problems, the proxy image display device of the present invention uses the proxy image generation means to match the contents of the voice of the speaker output from the voice output means. The proxy image of the speaker whose mouth shape changes is generated, and the generated image is displayed by the display means.
Since the proxy image in which the mouth shape is changed according to the content of the voice of the speaker is displayed, it is possible to display the proxy image having a natural expression suitable for the content of the voice.

【０００８】上述した代理画像生成手段は、音声出力手
段から出力される音声の母音を着目音として抽出し、こ
の着目音に対応する代理画像を生成することが望まし
い。一般に、人間が話す際の口の形状は、例えば「あ、
か、さ、…、わ」のいずれの場合にもほぼ同じであるよ
うに、そのほとんどが母音に対応して決まる。したがっ
て、音声の母音を着目音として抽出することにより、少
ない処理負担で精度良く、音声の内容に合わせた口形状
の代理画像を生成することができる。It is desirable that the above-mentioned substitute image generating means extracts a vowel of a voice output from the voice output means as a target sound and generates a substitute image corresponding to the target sound. Generally speaking, the shape of the mouth when a person speaks is
Almost all of them are decided corresponding to the vowels, as is almost the case in each case. Therefore, by extracting the vowel sound of the voice as the target sound, it is possible to generate a mouth-shaped proxy image that matches the content of the voice with high accuracy and with a small processing load.

【０００９】また、代理画像生成手段は、音声の母音に
加えて撥音を着目音として抽出することが望ましい。一
般に、撥音（日本語では「ん」に当たる音）を発声する
場合には、母音の場合と異なり口が閉じた状態となるの
で、この撥音を着目音として抽出することにより、音声
の内容に合わせた口形状の代理画像をさらに精度よく生
成することができる。Further, it is desirable that the substitute image generation means extracts a vowel sound as a target sound in addition to the vowel sound. In general, when uttering a vowel (sound equivalent to "n" in Japanese), the mouth is closed unlike the case of a vowel. Therefore, by extracting this vowel as the target sound, it is possible to match it with the content of the voice. The mouth-shaped surrogate image can be generated more accurately.

【００１０】また、上述した着目音に対応する複数種類
の静止代理画像を予め用意しておき、代理画像生成手段
は、これら複数種類の静止代理画像の中から、出力対象
となっている音声に対応するものを選択することによ
り、代理画像の生成を行うことが望ましい。抽出した着
目音に対応して、複数種類の静止代理画像の中からいず
れかを選択するという簡単な処理によって代理画像を生
成することができるため、処理負担を軽減することがで
きる。Further, a plurality of types of still substitute images corresponding to the above-described sound of interest are prepared in advance, and the substitute image generation means selects the voice to be output from the plurality of types of still substitute images. It is desirable to generate a proxy image by selecting the corresponding one. Since the proxy image can be generated by a simple process of selecting any one of a plurality of types of still proxy images corresponding to the extracted target sound, the processing load can be reduced.

【００１１】また、上述した着目音に対応する複数種類
の口形状画像とこれらの口形状画像以外の共通画像を予
め用意しておき、代理画像生成手段は、これら複数種類
の口形状画像の中から、出力対象となっている音声に対
応するものを選択して共通画像と合成することにより、
代理画像の生成を行うようにしてもよい。この場合に
は、複数種類の静止代理画像を用意する場合に比べて、
用意しておく画像のデータ量を少なくすることができる
ため、代理画像を生成するためのデータを格納しておく
メモリの記憶容量を低減することができる。Further, a plurality of types of mouth shape images corresponding to the above-mentioned sound of interest and a common image other than these mouth shape images are prepared in advance, and the proxy image generating means is configured to select among the plurality of types of mouth shape images. From, select the one that corresponds to the voice that is the output target and synthesize it with the common image,
A proxy image may be generated. In this case, compared to the case of preparing multiple types of still proxy images,
Since the data amount of the prepared image can be reduced, the storage capacity of the memory for storing the data for generating the proxy image can be reduced.

【００１２】また、代理画像生成手段は、着目音に対応
する選択画像の切り替えを行う際に、前後の選択画像を
用いた補間画像をこれらの選択画像の間に挿入すること
により代理画像の生成を行うことが望ましい。これによ
り、前の選択画像から後の選択画像への画像の変化がス
ムーズになり、より自然な表情を表現することができ
る。Further, the proxy image generating means generates a proxy image by inserting an interpolated image using the previous and next selected images between these selected images when switching the selected images corresponding to the sound of interest. It is desirable to do. Thereby, the change of the image from the previous selection image to the subsequent selection image becomes smooth, and a more natural expression can be expressed.

【００１３】また、代理画像生成手段は、音声出力手段
による音声の出力が停止したときに、撥音に対応する代
理画像を生成することが望ましい。これにより、話者の
音声が途切れた際に撥音に対応する代理画像、すなわち
口を閉じた状態の代理画像が表示されるので、話者が話
していないにも関わらず口が開いた状態の代理画像が表
示されてしまい、表示内容が不自然になることを防ぐこ
とができる。Further, it is desirable that the substitute image generation means generates a substitute image corresponding to the sound repellency when the output of the voice by the voice output means is stopped. As a result, when the voice of the speaker is interrupted, the proxy image corresponding to the sound repellency, that is, the proxy image with the mouth closed is displayed, so that the mouth open state is displayed even though the speaker is not speaking. It is possible to prevent the proxy image from being displayed and the display content to be unnatural.

【００１４】また、話者の音声を集音する集音手段と、
この集音手段によって集音された音声の内容を判別する
音声認識手段とをさらに備えておき、上述した音声出力
手段は、集音手段によって集音した音声に基づいて音声
出力を行い、代理画像生成手段は、音声認識手段によっ
て判別された音声の内容に基づいて代理画像を生成する
ことが望ましい。これにより、話者の音声にリアルタイ
ムに応答して、その内容に対応した代理画像を生成する
ことができる。Also, sound collecting means for collecting the voice of the speaker,
A voice recognition means for determining the content of the voice collected by the sound collection means is further provided, and the above-mentioned voice output means performs voice output based on the voice collected by the sound collection means, and the proxy image. It is desirable that the generation unit generate the proxy image based on the content of the voice discriminated by the voice recognition unit. This makes it possible to generate a proxy image corresponding to the content of the speaker's voice in real time by responding in real time.

【００１５】また、本発明のテレビ電話装置は、話者の
音声を集音手段によって集音するとともに、話者の画像
を撮影手段によって撮影しており、集音手段によって集
音された話者の音声の内容を音声認識手段によって判別
し、音声認識手段によって判別された音声の内容に合わ
せて口形状が変化する話者の代理画像を代理画像生成手
段によって生成している。そして、集音手段によって集
音された音声とともに、撮影手段によって撮影された話
者の画像、または代理画像生成手段によって生成された
代理画像を送信手段によって通話相手に向けて送信し、
通話相手から送られてくる音声および画像を受信手段に
よって受信しており、受信手段によって受信された音声
を音声出力手段により出力し、受信手段によって受信さ
れた画像を表示手段により表示している。話者の音声の
内容に合わせて口形状を変化させた代理画像を生成して
いるので、音声の内容に適合した自然な表情の代理画像
を通話相手に対して送ることができる。Further, in the video telephone device of the present invention, the voice of the speaker is collected by the sound collecting means and the image of the speaker is photographed by the photographing means, and the speaker collected by the sound collecting means. The voice recognition means determines the content of the voice, and the proxy image generation means generates a proxy image of the speaker whose mouth shape changes in accordance with the content of the voice determined by the voice recognition means. Then, together with the sound collected by the sound collecting means, the image of the speaker photographed by the photographing means, or the proxy image generated by the proxy image generating means is transmitted to the communication partner by the transmitting means,
The voice and the image sent from the other party are received by the receiving unit, the voice received by the receiving unit is output by the voice output unit, and the image received by the receiving unit is displayed by the display unit. Since the proxy image in which the mouth shape is changed according to the content of the voice of the speaker is generated, the proxy image with a natural expression suitable for the content of the voice can be sent to the call partner.

【００１６】上述した音声認識手段は、集音手段によっ
て集音された音声の母音を判別しており、代理画像生成
手段は、音声認識手段によって判別された母音を着目音
に設定し、この着目音に対応する代理画像を生成するこ
とが望ましい。上述したように、人間が話す際の口の形
状は、そのほとんどが母音に対応して決まっているた
め、音声の母音を判別して着目音に設定することによ
り、少ない処理負担で精度良く、音声の内容に合わせた
口形状の代理画像を生成することができるまた音声認識
手段は、音声の母音に加えて撥音を判別しており、代理
画像生成手段は、撥音を含めて着目音を設定することが
望ましい。上述したように、撥音を発声する場合には口
が閉じた状態となるので、撥音を着目音として抽出する
ことにより、音声の内容に合わせた口形状の代理画像を
さらに精度よく生成することができる。The above-mentioned voice recognition means determines the vowel of the voice collected by the sound collection means, and the proxy image generation means sets the vowel determined by the voice recognition means as the sound of interest, It is desirable to generate a proxy image corresponding to the sound. As described above, since the shape of the mouth when a human speaks is almost determined corresponding to the vowel, by distinguishing the vowel of the voice and setting it as the focused sound, the processing accuracy is low with a small processing load, It is possible to generate a mouth-shaped surrogate image that matches the content of the voice. Further, the voice recognition unit determines the sound repellency in addition to the vowel of the voice, and the substitute image generation unit sets the sound of interest including the sound repellency. It is desirable to do. As described above, when the utterance is uttered, the mouth is in a closed state. Therefore, by extracting the utterance as the sound of interest, it is possible to more accurately generate the mouth-shaped proxy image that matches the content of the voice. it can.

【００１７】また、上述した着目音に対応する複数種類
の静止代理画像を予め用意しておき、代理画像生成手段
は、これら複数種類の静止代理画像の中から、音声認識
手段によって判別された音声の内容に対応するものを選
択することにより、代理画像の生成を行うことが望まし
い。複数種類の静止代理画像の中からいずれかを選択す
るという簡単な処理によって代理画像を生成することが
できるため、処理負担を軽減することができる。Further, a plurality of types of still substitute images corresponding to the above-mentioned sound of interest are prepared in advance, and the substitute image generation means selects the voice determined by the voice recognition means from the plurality of types of still substitute images. It is desirable to generate the proxy image by selecting the one corresponding to the contents of. Since a proxy image can be generated by a simple process of selecting any one of a plurality of types of still proxy images, the processing load can be reduced.

【００１８】また、上述した着目音に対応する複数種類
の口形状画像とこれらの口形状画像以外の共通画像を予
め用意しておき、代理画像生成手段は、これら複数種類
の口形状画像の中から、音声認識手段によって判別され
た音声の内容に対応するものを選択して共通画像と合成
することにより、代理画像の生成を行うようにしてもよ
い。この場合には、複数種類の静止代理画像を用意する
場合に比べて、用意しておく画像のデータ量を少なくす
ることができるため、代理画像を生成するためのデータ
を格納しておくメモリの記憶容量を低減することができ
る。Further, a plurality of types of mouth shape images corresponding to the above-mentioned sound of interest and a common image other than these mouth shape images are prepared in advance, and the proxy image generating means sets the plurality of types of mouth shape images. From the above, the proxy image may be generated by selecting the one corresponding to the content of the voice discriminated by the voice recognition means and synthesizing it with the common image. In this case, the amount of image data to be prepared can be reduced as compared with the case where multiple types of still proxy images are prepared, so that the memory for storing the data for generating the proxy image is stored. The storage capacity can be reduced.

【００１９】また、代理画像生成手段は、着目音に対応
する選択画像の切り替えを行う際に、前後の選択画像を
用いた補間画像をこれらの選択画像の間に挿入すること
により代理画像を生成することが望ましい。これによ
り、前の選択画像から後の選択画像への画像の変化がス
ムーズになり、より自然な表情を表現することができ
る。Further, the proxy image generating means generates a proxy image by inserting an interpolated image using the selected images before and after the selected image corresponding to the sound of interest, between these selected images. It is desirable to do. Thereby, the change of the image from the previous selection image to the subsequent selection image becomes smooth, and a more natural expression can be expressed.

【００２０】また、代理画像生成手段は、集音手段によ
る音声の集音が中断したときに、撥音に対応する代理画
像を生成することが望ましい。これにより、話者の音声
が途切れた際に、話者が話していないにも関わらず口が
開いた状態の代理画像が表示されてしまい、表示内容が
不自然になることを防止できる。Further, it is desirable that the substitute image generating means generate a substitute image corresponding to the sound repellency when the sound collection by the sound collecting means is interrupted. Accordingly, when the voice of the speaker is interrupted, it is possible to prevent the display image from being unnatural because the proxy image with the mouth open even though the speaker is not speaking is displayed.

【００２１】また、送信画像設定手段をさらに備えてお
き、この送信画像設定手段によって、着信時には、最初
に代理画像生成手段によって生成された代理画像を送信
手段から通話相手に向けて送信した後、話者の切替指示
を受けて撮影手段によって撮影された話者の画像を送信
し、また発信時には、代理画像生成手段によって生成さ
れた代理画像と撮影手段によって撮影された話者の画像
のいずれかを話者の選択指示に応じて送信する、という
制御を行うことが望ましい。Further, a transmission image setting unit is further provided, and when the transmission image setting unit receives an incoming call, the proxy image first generated by the proxy image generation unit is transmitted from the transmission unit to the other party. An image of the speaker photographed by the photographing means is transmitted in response to the speaker switching instruction, and at the time of transmission, either the substitute image generated by the substitute image generating means or the image of the speaker photographed by the photographing means is transmitted. It is desirable to perform control such that the message is transmitted according to the speaker's selection instruction.

【００２２】着信時における話者は、起床直後等で服装
や頭髪などが整っていない場合など、自分の画像がすぐ
に相手に送信されることを望まない場合が多い。したが
って、着信時には最初に代理画像を送信し、話者の切替
指示がなされた場合に話者の画像を送信することによ
り、話者のプライバシーを保護することができる。また
発信時には、話者が自分の意志に沿って送信対象の画像
を自由に設定することができる。In many cases, the speaker at the time of receiving an incoming call does not want to immediately send his / her own image to the other party, such as when he / she is not dressed and has no hair immediately after getting up. Therefore, it is possible to protect the privacy of the speaker by first transmitting the proxy image when receiving a call and then transmitting the image of the speaker when the speaker switching instruction is issued. Further, at the time of making a call, the speaker can freely set the image to be sent according to his or her will.

【００２３】また、話者の感情を判定する感情判定手段
をさらに備えておき、代理画像生成手段は、感情判定手
段によって判定された話者の感情を表情に反映させた代
理画像を生成することが望ましい。話者の感情を代理画
像の表情に反映させることにより、代理画像の表情をよ
り豊かにし、コミュニケーションの円滑化を図ることが
できる。Further, an emotion determining means for determining the emotion of the speaker is further provided, and the proxy image generating means generates a proxy image in which the emotion of the speaker determined by the emotion determining means is reflected in the facial expression. Is desirable. By reflecting the emotion of the speaker on the facial expression of the proxy image, the facial expression of the proxy image can be enriched and communication can be facilitated.

【００２４】また感情判定手段は、集音手段によって集
音された話者の音声と撮影手段によって撮影された話者
の画像の少なくとも一方を用いて話者の感情を判定する
ことが望ましい。話者の音声または話者の画像の少なく
とも一方を用いれば、各種の手法によって、話者の感情
を判定することができる。具体的には、話者の音声を用
いた場合であれば、例えば、音声の抑揚や速さ、あるい
は言葉の内容などに基づいて感情を判定することができ
る。また、話者の画像を用いた場合であれば、例えば、
顔に含まれる目、眉毛、口等の各要素の位置や大きさ、
あるいは瞬きの回数などに基づいて感情を判定すること
ができる。Further, it is preferable that the emotion determining means determines the emotion of the speaker using at least one of the voice of the speaker collected by the sound collecting means and the image of the speaker captured by the capturing means. If at least one of the voice of the speaker and the image of the speaker is used, the emotion of the speaker can be determined by various methods. Specifically, if the voice of the speaker is used, the emotion can be determined based on the intonation and speed of the voice, the content of words, and the like. Also, if the image of the speaker is used, for example,
The position and size of each element such as eyes, eyebrows, mouth included in the face,
Alternatively, the emotion can be determined based on the number of blinks.

【００２５】また、代理画像生成手段による代理画像の
生成が終了するまでに要する時間分だけ音声の送信時間
を遅延させる遅延手段をさらに備えることが望ましい。
これにより、音声の出力タイミングと代理画像の表示タ
イミングとがずれてしまい通話相手に違和感を生じさせ
ることを防ぐことができる。Further, it is desirable to further include a delay means for delaying the voice transmission time by the time required for the generation of the substitute image by the substitute image generation means.
As a result, it is possible to prevent the voice output timing and the substitute image display timing from deviating from each other and causing the other party to feel uncomfortable.

【００２６】また、代理画像として用いる任意のキャラ
クタ画像の取り込みを行う画像入力手段をさらに備えて
おき、代理画像生成手段は、画像入力手段によって取り
込んだキャラクタ画像の口形状を変化させることにより
代理画像の生成を行うことが望ましい。任意のキャラク
タ画像を取り込み、代理画像を生成することができるの
で、代理画像の選択の自由度が高めることができる。し
たがって、話者がデジタルカメラやスキャナ等を用いて
作成した画像や、インターネットあるいはＣＤ−ＲＯＭ
等の媒体を介して入手した画像など、多様なキャラクタ
画像を取り込み、それらに基づいて代理画像を生成する
ことができる。Further, image input means for capturing an arbitrary character image used as a proxy image is further provided, and the proxy image generation means changes the mouth shape of the character image captured by the image input means to generate the proxy image. Is preferably generated. Since an arbitrary character image can be captured and a proxy image can be generated, the degree of freedom in selecting a proxy image can be increased. Therefore, an image created by a speaker using a digital camera or a scanner, the Internet, or a CD-ROM
Various character images, such as images obtained via a medium such as the above, can be taken in and a proxy image can be generated based on them.

【００２７】また、集音手段によって集音された音声の
音質を変更することにより、話者の性別を切り替える話
者性別変更手段をさらに備えておき、この話者性別変更
手段によって性別が切り替えられた後の音声を送信手段
から送信するようにしてもよい。これにより、例えば、
女性の利用者（話者）に迷惑電話が着信したような場合
に、話者の音声を男性であるかのように切り替えること
により、迷惑電話を撃退することができる。Further, a speaker sex changing means for changing the gender of the speaker by changing the sound quality of the sound collected by the sound collecting means is further provided, and the gender is changed by the speaker sex changing means. The voice after the completion may be transmitted from the transmitting means. This gives, for example,
When a nuisance call is received by a female user (speaker), the nuisance call can be repelled by switching the voice of the talker as if it were a male voice.

【００２８】[0028]

【発明の実施の形態】〔第１の実施形態〕以下、本発明
を適用した一実施形態のテレビ電話装置について、図面
を参照しながら説明する。BEST MODE FOR CARRYING OUT THE INVENTION [First Embodiment] A videophone device according to an embodiment of the present invention will be described below with reference to the drawings.

【００２９】図１は、第１の実施形態のテレビ電話装置
の構成を示す図である。図１に示すテレビ電話装置１０
０は、カメラ１、ハンドセット２、表示部３、操作部
４、制御部５、通信処理部６、メモリＩＦ（インタフェ
ース）部７を含んで構成されている。FIG. 1 is a diagram showing the configuration of the videophone device according to the first embodiment. The videophone device 10 shown in FIG.
0 includes a camera 1, a handset 2, a display unit 3, an operation unit 4, a control unit 5, a communication processing unit 6, and a memory IF (interface) unit 7.

【００３０】カメラ１は、テレビ電話装置１００を用い
て通話を行う利用者（話者）を撮影するためのものであ
る。ハンドセット２は、マイクロホン２ａおよびスピー
カ２ｂを備えており、話者の発声した音声をマイクロホ
ン２ａによって集音し、音声信号に変換して制御部５に
出力するとともに、制御部５から出力される音声信号に
基づいて通話相手の音声をスピーカ２ｂから出力する。The camera 1 is for photographing a user (speaker) who makes a call using the videophone device 100. The handset 2 includes a microphone 2a and a speaker 2b, collects a voice uttered by a speaker by the microphone 2a, converts the voice into a voice signal and outputs the voice signal to the control unit 5, and a voice output from the control unit 5. The voice of the other party is output from the speaker 2b based on the signal.

【００３１】表示部３は、操作内容（例えば、電話番号
等）やテレビ電話装置１００の動作状態などの表示を行
う。また表示部３は、通話相手から音声とともに画像が
送られてきている場合に、この通話相手の画像を表示す
る。操作部４は、通話相手の電話番号の入力や各種の設
定などの操作指示を行うためのものであり、各種の操作
キーを備えている。The display unit 3 displays the contents of the operation (for example, telephone number, etc.) and the operating state of the videophone device 100. Further, the display unit 3 displays the image of the other party when the image is sent together with the voice from the other party. The operation unit 4 is for inputting the telephone number of the other party of the call and for giving operation instructions such as various settings, and is provided with various operation keys.

【００３２】制御部５は、音声および画像を送受して通
話を行うためにテレビ電話装置１００の全体動作を制御
するものである。具体的には、制御部５は、話者の音声
と、カメラ１によって撮影される話者の画像（以後、
「自画像」と称する）、あるいはこの自画像の代わりと
なる画像（以後、「代理画像」と称する）を通話相手に
送信する処理や、通話相手から送られる音声や画像を受
信してハンドセット２や表示部３に出力するなどの制御
を行う。一般的には、この制御部５は、ＣＰＵ、ＲＯ
Ｍ、ＲＡＭ等のハードウェアを用いて所定の動作プログ
ラムを実行することによりその機能が実現される。な
お、制御部５内に備わった各構成の詳細については後述
する。The control unit 5 controls the entire operation of the video telephone 100 in order to transmit and receive voice and images and make a call. Specifically, the control unit 5 controls the speaker's voice and the speaker's image captured by the camera 1 (hereinafter,
A process of transmitting a "self-portrait") or an image (hereinafter, referred to as "surrogate image") as a substitute for this self-portrait to the other party, and receiving the voice or image sent from the other party to display the handset 2 or display. Control such as outputting to the unit 3 is performed. Generally, the control unit 5 includes a CPU, RO
The function is realized by executing a predetermined operation program using hardware such as M and RAM. The details of each component provided in the control unit 5 will be described later.

【００３３】通信処理部６は、通話相手との通話を行う
ために必要な所定の通信処理を行う。具体的には、通信
処理部６は、制御部５から出力される画像（代理画像ま
たは自画像）および音声を通話相手に送信する処理と、
通話相手から送られてくる画像および音声を受信して制
御部５に出力する処理を行う。The communication processing unit 6 carries out a predetermined communication process necessary for making a call with the other party. Specifically, the communication processing unit 6 transmits the image (the substitute image or the self-portrait) and the voice output from the control unit 5 to the other party,
A process of receiving an image and a sound sent from the other party and outputting the image and sound to the control unit 5 is performed.

【００３４】メモリＩＦ部７は、制御部５がメモリ・カ
ード５０に格納されたデータを読み取る際のインタフェ
ース部である。メモリ・カード５０は、半導体メモリに
よって構成されたデータ記憶媒体である。本実施形態で
は、このメモリ・カード５０に所望の画像データを格納
してメモリＩＦ部７に挿入し、制御部５に画像データを
読み取らせることにより、代理画像の内容を任意に設定
することができるようになっている。代理画像の詳細に
ついては後述する。The memory IF section 7 is an interface section when the control section 5 reads the data stored in the memory card 50. The memory card 50 is a data storage medium composed of a semiconductor memory. In the present embodiment, desired image data is stored in the memory card 50, inserted into the memory IF unit 7, and the control unit 5 is caused to read the image data, whereby the content of the substitute image can be arbitrarily set. You can do it. Details of the proxy image will be described later.

【００３５】次に、制御部５の詳細構成について説明す
る。制御部５は、音声認識部１０、代理画像メモリ１
２、代理画像生成部１４、音声遅延処理部１６、送信画
像設定部１８を備えている。音声認識部１０は、ハンド
セット２内のマイクロホン２ａによって集音される話者
の音声に対して所定の音声認識処理を行い、発音内容を
判別する。具体的には、本実施形態の音声認識部１０
は、話者の音声に対して、日本語の５つの母音「ア、
イ、ウ、エ、オ」と撥音「ン」とを合わせた６種類の発
音内容を着目音として抽出している。例えば、話者が発
声した音声が「あ」、「か」、「さ」、…、「わ」のい
ずれの場合においても、音声認識部１０は、発音内容が
母音の「ア」であることを特定する。Next, the detailed structure of the control unit 5 will be described. The control unit 5 includes a voice recognition unit 10 and a substitute image memory 1
2, the proxy image generation unit 14, the audio delay processing unit 16, and the transmission image setting unit 18 are provided. The voice recognition unit 10 performs a predetermined voice recognition process on the voice of the speaker collected by the microphone 2a in the handset 2 to determine the pronunciation content. Specifically, the voice recognition unit 10 of the present embodiment.
Are the five Japanese vowels "A,
Six types of pronunciation contents, which are "a, u, d, o" and a vowel "n", are extracted as the sound of interest. For example, in any case where the voice uttered by the speaker is "a", "ka", "sa", ..., "wa", the voice recognition unit 10 determines that the pronunciation content is the vowel "a". Specify.

【００３６】代理画像メモリ１２は、上述した６種類の
発音内容のそれぞれに対応する口の形状を有する６種類
の代理画像（静止代理画像）を格納している。図２は、
代理画像メモリ１２に格納される６種類の代理画像の一
例を示す図である。図２に示すように、代理画像メモリ
１２には、上述した６種類の発音内容「ア、イ、ウ、
エ、オ、ン」のそれぞれに対応した口の形状を有する６
種類の代理画像が格納されている。具体的には、図２
（ａ）に示す代理画像１５０ａが「ア」、図２（ｂ）に
示す代理画像１５０ｂが「イ」、図２（ｃ）に示す代理
画像１５０ｃが「ウ」、図２（ｄ）に示す代理画像１５
０ｄが「エ」、図２（ｅ）に示す代理画像１５０ｅが
「オ」、図２（ｆ）に示す代理画像１５０ｆが「ン」、
にそれぞれ対応している。The substitute image memory 12 stores six types of substitute images (still substitute images) having mouth shapes corresponding to the above-mentioned six types of sound generation contents. Figure 2
6 is a diagram showing an example of six types of proxy images stored in a proxy image memory 12. FIG. As shown in FIG. 2, in the proxy image memory 12, the above-mentioned six types of pronunciation contents “a, a, c,
6) with mouth shapes corresponding to "D, O, N"
Stores types of proxy images. Specifically, FIG.
The proxy image 150a shown in (a) is “A”, the proxy image 150b shown in FIG. 2B is “A”, the proxy image 150c shown in FIG. 2C is “U”, and the proxy image 150c shown in FIG. Proxy image 15
0d is “e”, proxy image 150e shown in FIG. 2 (e) is “o”, proxy image 150f shown in FIG. 2 (f) is “n”,
It corresponds to each.

【００３７】なお、図２に示す例では、所定のキャラク
タ画像を用いた代理画像が示されているが、これ以外に
も、話者が自分で６種類の発音内容のそれぞれに対応し
た顔をカメラ１によって撮影して得られた画像や、デジ
タルカメラやスキャナ等を用いて作成された画像、ある
いはインターネットなどの各種媒体を通じて入手した任
意のキャラクタ画像（例えば、男性キャラクタ、女性キ
ャラクタ、動物を模したキャラクタ、マンガ等の登場キ
ャラクタなどの画像）などを代理画像として用いてもよ
い。また、いくつかのキャラクタ画像、あるいは複数の
話者（例えば、家族など）のそれぞれが自分の顔を撮影
して作成した画像を選択的に使用できるようにしてもよ
い。In the example shown in FIG. 2, a proxy image using a predetermined character image is shown, but in addition to this, the speaker himself / herself can make a face corresponding to each of the six kinds of pronunciation contents. Images captured by the camera 1, images created using a digital camera, a scanner, or the like, or any character image obtained through various media such as the Internet (for example, a male character, a female character, or an animal model). Characters, images of characters appearing in manga, etc.) may be used as proxy images. In addition, some character images or images created by photographing each person's face by each of a plurality of speakers (for example, family members) may be selectively used.

【００３８】代理画像生成部１４は、代理画像メモリ１
２に格納されている代理画像の中から、音声認識部１０
によって特定される発音内容に対応したものを選び出す
ことにより、代理画像を生成する。例えば、話者が「こ
んどは（今度は）…」と発声し、発音内容が「オ、ン、
オ、ア、…」と特定された場合であれば、代理画像生成
部１４は、上述した図２に示す代理画像１５０ｅ、代理
画像１５０ｆ、代理画像１５０ｅ、代理画像１５０ａ、
…という順番で、代理画像メモリ１２に格納されている
代理画像を選び出す。代理画像生成部１４によって選び
出された代理画像は、通信処理部６によって通話相手の
テレビ電話装置（図示せず）へ送信される。The substitute image generation unit 14 includes the substitute image memory 1
2 from the substitute images stored in the voice recognition unit 10
A proxy image is generated by selecting one corresponding to the pronunciation content specified by. For example, the speaker utters "Kondoha (this time) ..." and the pronunciation is "Oh, n,
., ", The proxy image generation unit 14 determines that the proxy image 150e, the proxy image 150f, the proxy image 150e, and the proxy image 150a shown in FIG.
The proxy images stored in the proxy image memory 12 are selected in this order. The proxy image selected by the proxy image generation unit 14 is transmitted by the communication processing unit 6 to the videophone device (not shown) of the other party.

【００３９】音声遅延処理部１６は、マイクロホン２ａ
によって集音された話者の音声を所定時間だけ遅延させ
る処理を行う。具体的には、本実施形態の音声遅延処理
部１６は、音声認識部１０によって話者の音声の発音内
容が特定され、この特定された発音内容に応じて代理画
像生成部１４が代理画像を選び出すまでに要する時間、
すなわち、代理画像の生成が終了するまでに要する時間
分だけ、音声を遅延させている。本実施形態のテレビ電
話装置１００では、この所定時間だけ遅延された話者の
音声を代理画像とともに通話相手に送信しているので、
音声の内容と代理画像の表情とがずれてしまい、通話相
手に違和感を生じさせることを防ぐことができる。The voice delay processing section 16 includes the microphone 2a.
The process of delaying the voice of the speaker collected by the predetermined time is performed. Specifically, in the voice delay processing unit 16 of the present embodiment, the pronunciation recognition unit 10 identifies the pronunciation content of the speaker's voice, and the proxy image generation unit 14 creates a proxy image according to the identified pronunciation content. The time it takes to pick
That is, the sound is delayed by the time required to complete the generation of the proxy image. In the videophone device 100 of this embodiment, the voice of the speaker delayed by the predetermined time is transmitted to the other party of the call together with the substitute image.
It is possible to prevent the content of the voice and the facial expression of the substitute image from being deviated from each other and causing a feeling of strangeness to the communication partner.

【００４０】なお、話者の自画像を送信する場合には、
音声遅延処理部１６による遅延処理が行われないか、あ
るいは遅延させる所定時間が代理画像の場合とは異なっ
た短い値に設定されるものとする。送信画像設定部１８
は、通話相手に対して、自画像と代理画像のいずれを送
信するかを設定する。具体的には、送信画像設定部１８
は、着信時には、最初に代理画像生成部１４から出力さ
れる代理画像を通話相手に向けて送信するように設定
し、その後、話者によって操作部４を用いて所定の切替
指示が行われた場合に、カメラ１によって撮影された話
者の自画像を送信するように設定している。When transmitting the self-portrait of the speaker,
It is assumed that the delay processing is not performed by the audio delay processing unit 16 or that the predetermined time to be delayed is set to a short value different from the case of the proxy image. Transmission image setting unit 18
Sets whether to send a self-portrait or a proxy image to the other party. Specifically, the transmission image setting unit 18
Is set so that when a call arrives, the proxy image output from the proxy image generation unit 14 is first transmitted to the other party of the call, and then the speaker uses the operation unit 4 to give a predetermined switching instruction. In this case, the self-portrait of the speaker photographed by the camera 1 is set to be transmitted.

【００４１】また、送信画像設定部１８は、発信時に
は、利用者により操作部４を用いて行われた選択指示に
対応して、代理画像生成部１４から出力される代理画像
とカメラ１によって撮影された自画像のいずれかを送信
対象に設定している。なお、利用者による選択指示が行
われない場合には、あらかじめデフォルト設定されてい
る画像（例えば、代理画像）が自動的に送信対象として
設定されるものとする。Also, the transmission image setting unit 18 takes a photograph by the camera 1 and the proxy image output from the proxy image generation unit 14 in response to a selection instruction made by the user using the operation unit 4 when making a call. One of the selected self-portraits is set as the transmission target. It should be noted that when the user does not give a selection instruction, an image (for example, a surrogate image) preset in advance is automatically set as a transmission target.

【００４２】上述したように着信時における話者は、起
床直後等で服装や頭髪が整っていない場合など、自分の
画像がすぐに相手に送信されることを望まない場合が多
いので、最初に代理画像を送信し、話者の切替指示がな
された場合に話者の画像を送信することにより、話者の
プライバシーを保護することができる利点がある。また
発信時には、選択指示に応じて送信対象の画像を設定し
ているので、話者が自分の意志に沿って送信対象の画像
を自由に設定することができる。As described above, the speaker at the time of receiving an incoming call often does not want to immediately send his / her image to the other party, such as when he / she is not dressed or having hair properly immediately after getting up. There is an advantage that the privacy of the speaker can be protected by transmitting the proxy image and transmitting the image of the speaker when the speaker switching instruction is issued. Further, at the time of making a call, the image to be transmitted is set according to the selection instruction, so that the speaker can freely set the image to be transmitted according to his or her will.

【００４３】上述したマイクロホン２ａが集音手段に、
音声認識部１０が音声認識手段に、代理画像メモリ１
２、代理画像生成部１４が代理画像生成手段に、カメラ
１が撮影手段に、通信処理部６が送信手段および受信手
段に、スピーカ２ｂが音声出力手段に、表示部３が表示
手段に、音声遅延処理部１６が遅延手段に、送信画像設
定部１８が送信画像設定手段に、メモリＩＦ部７、メモ
リ・カード５０が画像入力手段にそれぞれ対応してい
る。The microphone 2a described above serves as a sound collecting means,
The voice recognition unit 10 serves as a voice recognition means, and substitute image memory 1
2. The proxy image generation unit 14 is a proxy image generation unit, the camera 1 is a photographing unit, the communication processing unit 6 is a transmission unit and a reception unit, the speaker 2b is a voice output unit, the display unit 3 is a display unit, and a voice. The delay processing unit 16 corresponds to the delay unit, the transmission image setting unit 18 corresponds to the transmission image setting unit, and the memory IF unit 7 and the memory card 50 correspond to the image input unit, respectively.

【００４４】本実施形態のテレビ電話装置１００はこの
ような構成を有しており、次にその動作について説明す
る。図３は、本実施形態のテレビ電話装置１００の全体
的な動作手順を示す流れ図である。制御部５は、操作部
４を用いて発信操作（相手に電話をかけるための操作）
が行われたか否かを判定する（ステップ１００）。The videophone device 100 of this embodiment has such a configuration, and its operation will be described below. FIG. 3 is a flowchart showing the overall operation procedure of the videophone device 100 of this embodiment. The control unit 5 uses the operation unit 4 to make a call operation (operation for making a call to the other party)
It is determined whether or not has been performed (step 100).

【００４５】所定の発信操作が行われると、ステップ１
００で肯定判断が行われ、制御部５内の送信画像設定部
１８は、送信対象とする画像の種別（自画像か代理画像
か）を設定する（ステップ１０１）。具体的には、送信
対象とする画像の種別を切り替えるための操作キーが操
作部４に備わっており、発信操作に先だって（あるいは
直後）に、この操作キーを用いて話者が送信対象の画像
の種別を選択しており、送信画像設定部１８は、この選
択指示に応じて画像の種別を設定する。また、この選択
指示が行われない場合には、デフォルト設定されている
画像の種別（例えば、代理画像）が自動的に選択され
る。When a predetermined calling operation is performed, step 1
An affirmative decision is made in 00, and the transmission image setting unit 18 in the control unit 5 sets the type of image to be transmitted (self image or proxy image) (step 101). Specifically, the operation unit 4 is provided with an operation key for switching the type of the image to be transmitted, and before (or immediately after) the transmission operation, the speaker uses the operation key to transmit the image to be transmitted. The type of image is selected, and the transmission image setting unit 18 sets the type of image according to the selection instruction. If this selection instruction is not issued, the default image type (eg, proxy image) is automatically selected.

【００４６】送信対象とする画像の種別が設定されると
制御部５は、話者の音声を通話相手先に送信したり、通
話相手の音声や画像を受信し、ハンドセット２と表示部
３に出力するなど、所定の通話処理を行う（ステップ１
０２）。また、ステップ１０２に示した通話処理と並行
して、制御部５は、話者を撮影した自画像、あるいは代
理画像を送信する処理（画像送信処理）を行う（ステッ
プ１０３）。なお、画像送信処理の詳細な内容について
は後述する。When the type of the image to be transmitted is set, the control section 5 transmits the voice of the speaker to the other party of the call or receives the voice or the image of the other party of the call, and the handset 2 and the display section 3 receive it. Perform predetermined call processing such as outputting (step 1
02). Further, in parallel with the call processing shown in step 102, the control unit 5 performs processing (image transmission processing) of transmitting the self-portrait of the speaker or the substitute image (step 103). The detailed contents of the image transmission process will be described later.

【００４７】その後、制御部５は、通話相手側あるいは
自装置側で電話が切られ、通話が終了したか否かを判定
する（ステップ１０４）。通話が終了していない間は、
ステップ１０４で否定判断が行われ、制御部５は、上述
したステップ１０２に戻り、所定の通話処理および画像
送信処理を継続する。また、通話が終了した場合には、
ステップ１０４で肯定判断が行われ、この場合に制御部
５は、一連の処理を終了する。After that, the control section 5 determines whether or not the call is terminated by the other party or the device itself (step 104). While the call is not finished,
A negative determination is made in step 104, and the control unit 5 returns to step 102 described above and continues the predetermined call processing and image transmission processing. Also, when the call ends,
A positive determination is made in step 104, and in this case, the control unit 5 ends the series of processes.

【００４８】また、所定の発信操作が行われていない場
合には、上述したステップ１００で否定判断が行われ、
この場合に制御部５は、相手先からの電話を着信したか
否かを判定する（ステップ１０５）。電話を着信してい
ない場合には、ステップ１０５で否定判断が行われ、こ
の場合には上述したステップ１００に戻り、以降の処理
が繰り返される。If the predetermined calling operation has not been performed, a negative determination is made in step 100 described above,
In this case, the control unit 5 determines whether or not a call from the other party has arrived (step 105). If no call is received, a negative determination is made in step 105, and in this case, the process returns to step 100 described above, and the subsequent processing is repeated.

【００４９】また、電話を着信した場合には、ステップ
１０５で肯定判断が行われ、この場合に制御部５内の送
信画像設定部１８は、送信対象とする画像の種別を自動
的に代理画像に設定する（ステップ１０６）。その後、
制御部５は、上述したステップ１０２に移り、それ以降
の処理を行う。When a call is received, an affirmative judgment is made in step 105, and in this case, the transmission image setting unit 18 in the control unit 5 automatically determines the type of image to be transmitted as a proxy image. (Step 106). afterwards,
The control unit 5 moves to the above-mentioned step 102 and performs the subsequent processing.

【００５０】次に、上述した画像送信処理の詳細内容に
ついて説明する。図４は、図３のステップ１０３に示し
た画像送信処理の詳細な手順を示す流れ図である。制御
部５内の送信画像設定部１８は、送信対象とする画像の
種別が代理画像に設定されているか否かを判定する（ス
テップ２００）。Next, detailed contents of the above-mentioned image transmission processing will be described. FIG. 4 is a flowchart showing a detailed procedure of the image transmission processing shown in step 103 of FIG. The transmission image setting unit 18 in the control unit 5 determines whether or not the type of image to be transmitted is set to a proxy image (step 200).

【００５１】送信対象が代理画像に設定されている場合
には、ステップ２００で肯定判断が行われ、制御部５内
の音声認識部１０は、マイクロホン２ａから出力される
音声信号に基づいて、話者の発声した音声に対して所定
の音声認識処理を行い、その発音内容を特定する（ステ
ップ２０１）。上述したように本実施形態の音声認識部
１０は、話者の音声に対応して、５つの母音「ア、イ、
ウ、エ、オ」と撥音「ン」とを合わせた６種類の発音内
容を特定している。If the transmission target is set to the substitute image, an affirmative decision is made in step 200, and the voice recognition unit 10 in the control unit 5 speaks based on the voice signal output from the microphone 2a. A predetermined voice recognition process is performed on the voice uttered by the person to specify the pronunciation content (step 201). As described above, the voice recognition unit 10 according to the present embodiment corresponds to the voice of the speaker by using the five vowels “a, a,
Six types of pronunciation contents, which are "u, d, o" and sound repellency "n", are specified.

【００５２】代理画像生成部１４は、音声認識部１０に
よって特定される発音内容を随時（例えば、１音毎に）
取得し、代理画像メモリ１２に格納されている代理画像
の中から発音内容に対応した代理画像を選び出す処理を
行う（ステップ２０２）。なお、本実施形態の代理画像
生成部１４は、発音内容の特定結果が音声認識部１０か
ら出力されない状態、換言すれば、マイクロホン２ａに
よる音声の集音が中断した状態となった場合には、図２
（ｆ）に示した発音内容が撥音「ン」の場合に対応する
代理画像１５０ｆを自動的に選び出す処理を行ってい
る。これにより、話者の音声が途切れた際に、口を閉じ
た状態の顔である代理画像１５０ｆを通話相手に送信す
ることができるので、話者が話していない状態であるに
も関わらず口の開いた状態の代理画像が通話相手に送信
されるといったことがなく、代理画像を見ている通話相
手が違和感を感じることを防止することができる。The substitute image generation unit 14 constantly outputs the sound content specified by the voice recognition unit 10 (for example, for each sound).
A process of acquiring and selecting a proxy image corresponding to the pronunciation content from the proxy images stored in the proxy image memory 12 is performed (step 202). It should be noted that the proxy image generation unit 14 of the present embodiment, when the result of identifying the pronunciation content is not output from the voice recognition unit 10, in other words, when the sound collection of the voice by the microphone 2a is interrupted, Figure 2
Processing for automatically selecting the substitute image 150f corresponding to the case where the pronunciation content shown in (f) is the sound utterance "n" is performed. Thus, when the voice of the speaker is interrupted, the proxy image 150f, which is the face with the mouth closed, can be transmitted to the other party of the call, so that the mouth is spoken even though the speaker is not speaking. Since the proxy image in the open state is not transmitted to the call partner, it is possible to prevent the call partner looking at the proxy image from feeling uncomfortable.

【００５３】発音内容に対応した代理画像が代理画像生
成部１４によって選び出されると、制御部５は、代理画
像を通話相手に送信する処理を行う（ステップ２０
３）。また、上述したステップ２０１〜２０３に示した
処理と並行して、送信画像設定部１８は、送信対象とす
る画像の種別を自画像に切り替える旨の操作指示が行わ
れたか否かを判定する（ステップ２０４）。送信対象を
自画像に切り替える旨の操作指示が行われない間は、ス
テップ２０４で否定判断が行われ、この場合には上述し
たステップ２０１に戻り、以降の処理が繰り返される。
すなわち、話者の音声の発音内容に応じた代理画像が随
時選択され、通話相手へ送信される。When the substitute image corresponding to the pronunciation content is selected by the substitute image generation unit 14, the control unit 5 performs a process of transmitting the substitute image to the other party (step 20).
3). Further, in parallel with the processing shown in steps 201 to 203 described above, the transmission image setting unit 18 determines whether or not an operation instruction to switch the type of the image to be transmitted to the self image has been performed (step 204). While the operation instruction to switch the transmission target to the self-portrait is not issued, a negative determination is made in step 204. In this case, the process returns to step 201 described above, and the subsequent processing is repeated.
That is, a substitute image according to the pronunciation content of the speaker's voice is selected at any time and transmitted to the other party.

【００５４】送信対象を自画像に切り替える旨の操作指
示が行われ、ステップ２０４で肯定判断が行われた場
合、あるいは、当初から送信対象が自画像に設定されて
おり、上述したステップ２００で否定判断が行われた場
合に、送信画像設定部１８は、送信対象を自画像に切り
替える。When an operation instruction to switch the transmission target to the self-portrait is given and an affirmative judgment is made in step 204, or the transmission target is set to the self-portrait from the beginning, a negative judgment is made in step 200 described above. When it is performed, the transmission image setting unit 18 switches the transmission target to the self-portrait.

【００５５】次に、制御部５は、カメラ１の動作を有効
状態にし、カメラ１によって話者の画像（自画像）を撮
影し（ステップ２０５）、撮影された自画像を通話相手
へ送信する処理を行う（ステップ２０６）。例えば、毎
秒数枚〜数十枚の自画像が撮影され、送信される。Next, the control unit 5 activates the operation of the camera 1, captures the image (self-portrait) of the speaker by the camera 1 (step 205), and transmits the captured self-portrait to the other party. Perform (step 206). For example, several to several tens of self-portraits are captured and transmitted every second.

【００５６】また、上述したステップ２０５、２０６に
示した処理と並行して、送信画像設定部１８は、送信対
象とする画像の種別を代理画像に切り替える旨の操作指
示が行われたか否かを判定する（ステップ２０７）。送
信対象を代理画像に切り替える旨の操作指示が行われな
い間は、ステップ２０７で否定判断が行われ、この場合
には上述したステップ２０５に戻り、以降の処理が繰り
返される。すなわち、話者の自画像が所定周期で撮影さ
れ、通話相手へ送信される。Further, in parallel with the processing shown in steps 205 and 206 described above, the transmission image setting unit 18 determines whether or not an operation instruction to switch the type of the image to be transmitted to the substitute image is issued. The determination is made (step 207). While the operation instruction to switch the transmission target to the substitute image is not issued, a negative determination is made in step 207. In this case, the process returns to step 205 described above, and the subsequent processing is repeated. That is, the self-portrait of the speaker is photographed at a predetermined cycle and transmitted to the other party.

【００５７】送信対象を代理画像に切り替える旨の操作
指示が行われた場合には、ステップ２０７で肯定判断が
行われ、送信画像設定部１８は、自画像の送信を中止
し、送信対象を代理画像に切り替える。その後、上述し
たステップ２０１に戻り、以降の処理が繰り返される。When an operation instruction to switch the transmission target to the proxy image is given, an affirmative decision is made in step 207, and the transmission image setting unit 18 stops the transmission of the self-portrait and sets the transmission target to the proxy image. Switch to. After that, the process returns to step 201 described above, and the subsequent processes are repeated.

【００５８】このように、第１の実施形態のテレビ電話
装置１００は、話者の音声の内容として５つの母音「ア
〜オ」と撥音「ン」を判別し、判別した音声の内容に合
わせて口形状を変化させた話者の代理画像を生成してい
るので、話者の音声の内容に適合した自然な表情の代理
画像を通話相手に対して送ることができる。特に、人間
が話す際の口の形状がほとんど母音と撥音に対応して決
まっていることに着目して音声の内容を判別しているの
で、少ない処理負担で精度良く、音声の内容に合わせた
口形状の代理画像を生成することができる。As described above, the videophone device 100 of the first embodiment discriminates between the five vowels "a-o" and the sound repellant "n" as the contents of the speaker's voice, and matches them with the determined contents of the voice. Since the proxy image of the speaker with the changed mouth shape is generated, the proxy image with a natural expression that matches the content of the speaker's voice can be sent to the other party. In particular, since the content of the voice is discriminated by focusing on the fact that the shape of the mouth when a human speaks is almost determined corresponding to the vowel and the sound repellency, it is possible to accurately match the content of the voice with a small processing load. A mouth-shaped proxy image can be generated.

【００５９】また、着目音（母音＋撥音）に対応する６
種類の代理画像（静止代理画像）を予め用意して代理画
像メモリ１２に格納しておき、これら６種類の代理画像
の中から、音声認識部１０によって判別された音声の内
容に対応するものを選択することによって代理画像を生
成しているので処理が簡単であり、代理画像を生成する
際の処理負担を軽減することができるという利点もあ
る。6 corresponding to the target sound (vowel + vowel sound)
Types of proxy images (still proxy images) are prepared in advance and stored in the proxy image memory 12, and one of these six types of proxy images corresponding to the content of the voice discriminated by the voice recognition unit 10 is selected. Since the proxy image is generated by selection, the processing is simple, and there is an advantage that the processing load when generating the proxy image can be reduced.

【００６０】なお、本発明を適用したテレビ電話装置
は、上述した実施形態に限定されるものではなく、本発
明の要旨の範囲内で種々の変形実施が可能である。例え
ば、上述した実施形態では、図２に示したように、母音
の「ア、イ、ウ、エ、オ」と「ン」とを合わせた６種類
の発音内容に対応する６種類の代理画像を用意していた
が、発音内容に応じた６種類の口形状のみの画像（口形
状画像）とベースとなる顔画像（共通画像）とを用意し
ておき、いずれか一の口形状画像と共通画像を合成して
代理画像を生成してもよい。The videophone device to which the present invention is applied is not limited to the above-described embodiment, and various modifications can be made within the scope of the present invention. For example, in the above-described embodiment, as shown in FIG. 2, six types of substitute images corresponding to six types of pronunciation contents including the vowels “a, i, u, e, o” and “n” are combined. However, six types of mouth shape-only images (mouth shape images) according to the pronunciation content and a base face image (common image) are prepared, and one of the mouth shape images is prepared. The common image may be combined to generate the proxy image.

【００６１】図５は、口形状画像と共通画像を合成して
代理画像を生成する変形例の処理内容を示す図である。
図５に示すように、６種類の発音内容に対応した６つの
口形状画像１５２ａ〜１５２ｆを用意するとともに、ベ
ースとなる共通画像１５４を用意しておき、発音内容に
応じて６つの口形状画像１５２ａ〜１５２ｆの中からい
ずれか一を選び出して共通画像１５４と合成することに
より、音声の発音内容に応じた表情の代理画像を生成
し、送信することができる。この場合には、図２に示し
たような複数種類の代理画像を用意する場合に比べて、
用意しておく画像のデータ量を少なくすることができる
ため、代理画像メモリ１２の記憶容量を低減することが
できる利点がある。FIG. 5 is a diagram showing the processing contents of a modified example in which the mouth shape image and the common image are combined to generate a substitute image.
As shown in FIG. 5, six mouth shape images 152a to 152f corresponding to six kinds of pronunciation contents are prepared, and a common image 154 serving as a base is prepared in advance, and six mouth shape images are prepared according to the pronunciation contents. By selecting any one of 152a to 152f and synthesizing it with the common image 154, it is possible to generate and transmit a proxy image having a facial expression according to the pronunciation content of the voice. In this case, compared to the case of preparing a plurality of types of proxy images as shown in FIG.
Since the amount of image data to be prepared can be reduced, there is an advantage that the storage capacity of the substitute image memory 12 can be reduced.

【００６２】また上述した実施形態では、通話中の話者
の音声に基づいて発音内容を特定し、対応する代理画像
を送信していたが、話者の不在時に自動的に着信して所
定の応答メッセージを再生する、いわゆる留守番電話機
能を用いる場合に、応答メッセージの発音内容に対応し
た代理画像を送信するようにしてもよい。In the above-described embodiment, the pronunciation content is specified based on the voice of the speaker who is talking and the corresponding proxy image is transmitted. When a so-called answering machine function for reproducing a response message is used, a proxy image corresponding to the pronunciation content of the response message may be transmitted.

【００６３】この場合には、まず話者が所定の応答メッ
セージ（例えば、「ただいま留守にしています。お名前
とご用件をお話しください」等）を予めマイクロホン２
ａを用いて入力し、この音声を制御部５の内部メモリ
（図示せず）、あるいはメモリ・カード５０に格納して
おく。そして、電話がかかってきた場合には、上述した
図３のステップ１０５において、所定回数（例えば、５
回）以上コールされた際に自動的に着信してステップ１
０６以降の処理を行い、その際に予め用意された応答メ
ッセージを通話相手へ送信するとともに、この応答メッ
セージの発音内容を音声認識部１０によって特定し、対
応する代理画像を順次選び出して通話相手へ送信すれば
よい。In this case, the speaker first sends a predetermined response message (for example, "I am out of the office. Please talk about your name and business requirements") in advance.
It is input using a and this voice is stored in the internal memory (not shown) of the control unit 5 or the memory card 50. Then, when an incoming call is received, a predetermined number of times (for example, 5
When the call is made more than once, the call is automatically received and step 1
The processing after 06 is performed, and a response message prepared in advance at that time is transmitted to the communication partner, the pronunciation content of this response message is specified by the voice recognition unit 10, and corresponding proxy images are sequentially selected to the communication partner. Just send it.

【００６４】また、話者の感情を判定し、話者の感情を
表情に反映させた代理画像を生成し、送信するようにし
てもよい。図６は、話者の感情を表情に反映させた代理
画像を生成する変形例のテレビ電話装置１００ａの構成
を示す図である。図６に示すテレビ電話装置１００ａ
は、上述した図１に示すテレビ電話装置１００と基本的
に同様の構成を有しており、制御部５ａ内の構成が異な
っている。以下、主に両者の相違点について着目して説
明を行う。Further, the emotion of the speaker may be determined, and a proxy image in which the emotion of the speaker is reflected in the facial expression may be generated and transmitted. FIG. 6 is a diagram showing the configuration of a modified videophone device 100a that generates a substitute image in which the emotion of the speaker is reflected in the facial expression. The videophone device 100a shown in FIG.
Has basically the same configuration as the above-described videophone device 100 shown in FIG. 1, but the configuration inside the control unit 5a is different. The following mainly describes the differences between the two.

【００６５】制御部５ａは、音声認識部１０、代理画像
メモリ１２ａ、代理画像生成部１４ａ、音声遅延部１
６、送信画像設定部１８、感情判定部２０を備えてい
る。感情判定部２０は、カメラ１によって撮影された話
者の画像と、マイクロホン２ａによって集音された話者
の音声に基づいて、周知の各種手法により話者の感情
（「喜、怒、哀、楽」など）を判定する。例えば、感情
判定部２０は、話者の音声に基づいて、音声の抑揚や速
さなどの特徴量を抽出するとともに、話者の画像に基づ
いて、顔に含まれる目、眉毛、口等の各要素の位置や大
きさ、あるいは瞬きの回数などの特徴量を抽出し、これ
らの特徴量を総合して話者の感情を判定する。この感情
判定部２０が感情判定手段に対応している。The control unit 5a includes a voice recognition unit 10, a substitute image memory 12a, a substitute image generation unit 14a, and a voice delay unit 1.
A transmission image setting unit 18 and an emotion determination unit 20 are provided. The emotion determination unit 20 uses a variety of well-known methods based on the image of the speaker captured by the camera 1 and the voice of the speaker collected by the microphone 2a, to determine the emotion of the speaker (“joy, anger, sadness, Easy ”). For example, the emotion determination unit 20 extracts a feature amount such as intonation and speed of the voice based on the voice of the speaker, and also based on the image of the speaker, the eyes, eyebrows, mouth, etc. included in the face. The feature amount such as the position and size of each element, or the number of blinks is extracted, and the emotion of the speaker is determined by integrating these feature amounts. The emotion determination unit 20 corresponds to the emotion determination means.

【００６６】なお、話者の音声と画像のいずれか一方の
みに基づいて感情を判定するようにしてもよい。また、
音声に基づく感情認識を行う場合には、その処理内容が
部分的に音声認識部１０による処理と重複する場合も考
えられるので、そのような場合には処理を共通化しても
よい。The emotion may be determined based on only one of the voice and the image of the speaker. Also,
When emotion recognition based on voice is performed, the processing content may partially overlap with the processing by the voice recognition unit 10. Therefore, in such a case, the processing may be shared.

【００６７】代理画像メモリ１２ａは、感情の種別に応
じた表情を有し、かつ６種類の発音内容のそれぞれに対
応する口の形状を有する代理画像を格納している。例え
ば、感情判定部２０によって判定される感情の種別が
「喜、怒、哀、楽」の４種類の場合には、少なくとも、
感情の種別と発音内容の種類とを組み合わせた２４種類
の代理画像が代理画像メモリ１２ａに格納される。The substitute image memory 12a stores a substitute image having a facial expression corresponding to the type of emotion and having a mouth shape corresponding to each of the six types of pronunciation contents. For example, in the case where there are four types of emotions determined by the emotion determination unit 20, namely “joy, anger, sadness, and comfort”, at least
Twenty-four types of proxy images in which the types of emotions and the types of pronunciation contents are combined are stored in the proxy image memory 12a.

【００６８】代理画像生成部１４ａは、音声認識部１０
によって特定される発音内容と、感情判定部２０によっ
て判定される感情の種別に対応した一の代理画像を代理
画像メモリ１２ａに格納されている代理画像の中から選
び出す処理を行う。例えば、感情が「怒」、発音内容が
「ア」の場合には、怒った表情を有し、かつ口の形状が
「ア」である一の代理画像が選び出される。The substitute image generation unit 14a includes a voice recognition unit 10a.
A process of selecting one proxy image corresponding to the pronunciation content specified by and the emotion type determined by the emotion determination unit 20 from the proxy images stored in the proxy image memory 12a is performed. For example, when the emotion is "angry" and the pronunciation content is "a", one proxy image having an angry expression and a mouth shape of "a" is selected.

【００６９】次に、上述した構成を有するテレビ電話装
置１００ａの動作について説明する。テレビ電話装置１
００ａの動作手順は、基本的には上述した図３および図
４に示したものと同様であり、図４に示す流れ図の一部
が変更される。具体的な変更点は、（１）図４に示すス
テップ２０２の前（あるいは、ステップ２０１の前でも
よい）に、上述した感情判定部２０による感情判定処理
を付け加えることと、（２）ステップ２０２において代
理画像を選出する際に、感情判定部２０による感情判定
結果を反映させることである。Next, the operation of the videophone device 100a having the above configuration will be described. Videophone 1
The operation procedure of 00a is basically the same as that shown in FIGS. 3 and 4 described above, and a part of the flow chart shown in FIG. 4 is changed. The specific changes are (1) before the step 202 shown in FIG. 4 (or before step 201), the emotion determination processing by the emotion determination unit 20 described above is added, and (2) step 202 Is to reflect the emotion determination result by the emotion determination unit 20 when the proxy image is selected.

【００７０】このように、話者の感情を代理画像の表情
に反映させることにより、代理画像の表情をさらに豊か
にし、コミュニケーションの円滑化を図ることができ
る。なお、上述した口形状画像を置き換える場合と同様
に、感情を反映した代理画像を生成する際には、代理画
像の目の部分を感情に応じて部分的に置き換えるように
してもよい。As described above, by reflecting the emotion of the speaker on the facial expression of the substitute image, the facial expression of the substitute image can be further enriched and the communication can be facilitated. Note that, as in the case of replacing the mouth shape image described above, when generating a proxy image that reflects an emotion, the eyes of the proxy image may be partially replaced according to the emotion.

【００７１】また、マイクロホン２ａによって集音され
た話者の音声の音質を変更することにより、話者の性別
を切り替える手段を備えておき、性別が切り替えられた
後の音声を送信するようにしてもよい。図７は、話者の
性別を切り替えた音声を送信する変形例のテレビ電話装
置１００ｂの構成を示す図である。図７に示すテレビ電
話装置１００ｂは、上述した図１に示すテレビ電話装置
１００と基本的に同様の構成を有しており、制御部５ｂ
内の構成が異なっている。以下、主に両者の相違点につ
いて着目して説明を行う。Further, a means for switching the gender of the speaker by changing the sound quality of the voice of the speaker collected by the microphone 2a is provided, and the voice after the gender is switched is transmitted. Good. FIG. 7: is a figure which shows the structure of the video telephone apparatus 100b of the modification which transmits the sound which switched the speaker's sex. The videophone device 100b shown in FIG. 7 has basically the same configuration as the above-described videophone device 100 shown in FIG.
The internal structure is different. The following mainly describes the differences between the two.

【００７２】制御部５ｂは、音声認識部１０、代理画像
メモリ１２、代理画像生成部１４、音声遅延部１６、送
信画像設定部１８、話者性別変更部２２を備えている。
話者性別変更部２２は、マイクロホン２ａによって集音
された話者の音声に対して周波数変換などの各種処理を
行って音質を変更することにより、話者の性別を切り替
える処理を行う。この話者性別変更部２２が話者性別変
更手段に対応している。The control unit 5b includes a voice recognition unit 10, a substitute image memory 12, a substitute image generation unit 14, a voice delay unit 16, a transmission image setting unit 18, and a speaker gender changing unit 22.
The speaker gender changing unit 22 performs a process of switching the gender of the speaker by performing various processes such as frequency conversion on the voice of the speaker collected by the microphone 2a to change the sound quality. The speaker sex changing unit 22 corresponds to a speaker sex changing unit.

【００７３】次に、上述した構成を有するテレビ電話装
置１００ｂの動作について説明する。テレビ電話装置１
００ｂの動作手順は、基本的には上述した図３および図
４に示したものと同様であり、操作部４を用いて話者に
より、所定の性別切替指示が行われた場合に、話者性別
変更部２２による処理を有効にし、性別が変更された後
の音声を通話相手に送信するようにすればよい。Next, the operation of the videophone device 100b having the above configuration will be described. Videophone 1
The operation procedure of 00b is basically the same as that shown in FIGS. 3 and 4 described above, and when the speaker gives a predetermined gender switching instruction using the operation unit 4, the speaker It suffices to validate the processing by the gender changing unit 22 and transmit the voice after the gender is changed to the other party.

【００７４】これにより、例えば、女性の話者に迷惑電
話が着信したような場合に、話者の音声を男性であるか
のように切り替えて送信することができるため、迷惑電
話を撃退することができる。なお、話者性別変更部２２
による処理が有効となった場合には、これに連動して代
理画像の内容を変更することも好適である。例えば、話
者が女性であり、女性のキャラクタ画像を用いていた場
合には、男性のキャラクタ画像に切り替えればよい。As a result, for example, when a nuisance call is received by a female speaker, the voice of the speaker can be switched and transmitted as if it were a male person, so that the nuisance call can be repelled. You can The speaker gender changing unit 22
When the processing by is effective, it is also preferable to change the content of the substitute image in conjunction with this. For example, when the speaker is a woman and a female character image is used, the character image may be switched to a male character image.

【００７５】また、発音内容に応じて代理画像の切り替
えを行う際に、前後の代理画像（選択画像）を用いた補
間画像を代理画像生成部１４が生成し、前後の代理画像
の間に挿入するようにしてもよい。図８は、前後の代理
画像を用いた補間画像を生成し、挿入する変形例におけ
る処理内容について概略的に示す図である。具体例とし
て、前の代理画像が母音「ア」に対応する代理画像１５
０ａであり、後の画像が撥音「ン」に対応する代理画像
１５０ｆである場合の処理内容が示されている。Further, when the proxy image is switched according to the pronunciation content, the proxy image generation unit 14 generates an interpolation image using the proxy images (selected images) before and after, and inserts it between the proxy images before and after. You may do it. FIG. 8 is a diagram schematically showing the processing content in a modified example in which an interpolated image is generated using the proxy images before and after it and inserted. As a concrete example, the proxy image 15 in which the previous proxy image corresponds to the vowel “A”
0a, the processing content when the subsequent image is the proxy image 150f corresponding to the sound repellency "n" is shown.

【００７６】図８に示すように、母音「ア」に対応する
代理画像１５０ａから撥音「ン」に対応する代理画像１
５０ｆへ切り替わる場合には、代理画像生成部１４によ
って、これらの代理画像１５０ａ、１５０ｆを用いて母
音「ア」と撥音「ン」の中間的な口形状を有する補間画
像１６０を生成し、代理画像１５０ａと代理画像１５０
ｆの間に挿入すればよい。これにより、代理画像の口形
状が「ア」から「ン」へスムーズに変化するように見
え、より自然な表情を表現することができる。なお、図
８に示す例では、前後の代理画像の間に１枚の補間画像
を挿入していたが、より多くの補間画像を挿入すること
により、口形状の変化をさらにスムーズに表現すること
ができる。As shown in FIG. 8, from the substitute image 150a corresponding to the vowel "A" to the substitute image 1 corresponding to the vowel "N".
In the case of switching to 50f, the substitute image generation unit 14 generates an interpolation image 160 having an intermediate mouth shape between a vowel "a" and a vowel "n" by using these substitute images 150a and 150f, and the substitute image is generated. 150a and proxy image 150
It may be inserted between f. As a result, the mouth shape of the proxy image appears to smoothly change from “A” to “N”, and a more natural expression can be expressed. Note that in the example shown in FIG. 8, one interpolation image is inserted between the front and rear proxy images, but by inserting more interpolation images, the change in the mouth shape can be expressed more smoothly. You can

【００７７】また上述した実施形態では、発音内容に応
じて顔の表情（主として口の形状）を変化させた代理画
像を生成していたが、さらに、適当なタイミングで手振
りを入れたり、目を開閉させるなどの変化を加えるよう
にしてもよい。手振りや目の開閉などの変化が加わるこ
とにより、代理画像の表現力をさらに豊かにすることが
できる。Further, in the above-described embodiment, the proxy image in which the facial expression (mainly the shape of the mouth) is changed according to the pronunciation content is generated. A change such as opening and closing may be added. By adding changes such as hand gestures and opening / closing of eyes, the expressive power of the proxy image can be further enhanced.

【００７８】また上述した実施形態（後述する第２の実
施形態も同様）では、話者が日本語を用いている場合を
想定してその発音内容を特定していたが、日本語以外の
言語を用いる場合においても、各言語における母音等を
着目音として抽出することにより、上述した実施形態と
同様にして本発明を適用することができる。In the above-described embodiment (similarly to the second embodiment described later), the pronunciation content is specified on the assumption that the speaker uses Japanese. However, a language other than Japanese is used. Also in the case of using, the present invention can be applied in the same manner as in the above-described embodiment by extracting vowels and the like in each language as the sound of interest.

【００７９】〔第２の実施形態〕ところで、上述した第
１の実施形態では、本発明を適用したテレビ電話装置の
一実施形態について説明していたが、本発明はテレビ電
話装置の他にも、音声出力に対応して所定のキャラクタ
画像（例えば、人物を模したキャラクタ画像等）を表示
する各種装置に適用することができる。Second Embodiment By the way, in the above-described first embodiment, one embodiment of the videophone device to which the present invention is applied has been described. However, the present invention is not limited to the videophone device. The present invention can be applied to various devices that display a predetermined character image (for example, a character image that imitates a person) corresponding to voice output.

【００８０】以下、本発明を車載用のナビゲーション装
置に対して適用した場合の一実施形態について説明す
る。なお、第２の実施形態では、音声出力を行う機能を
有するナビゲーション装置自身が擬似的に「話者」に対
応し、音声出力に対応して表示されるキャラクタ画像が
「代理画像」に対応することとなる。An embodiment in which the present invention is applied to a vehicle-mounted navigation device will be described below. In the second embodiment, the navigation device itself having the function of performing voice output corresponds to the "speaker" in a pseudo manner, and the character image displayed corresponding to the voice output corresponds to the "proxy image". It will be.

【００８１】図９は、第２の実施形態のナビゲーション
装置の構成を示す図である。図９に示すナビゲーション
装置２００は、ナビゲーションコントローラ６０、ＤＶ
Ｄドライブ６１、操作部６２、車両位置検出部６３、デ
ィスプレイ装置６４、スピーカ６５を含んで構成されて
いる。FIG. 9 is a diagram showing the configuration of the navigation device of the second embodiment. The navigation device 200 shown in FIG. 9 includes a navigation controller 60 and a DV.
The D drive 61, the operation unit 62, the vehicle position detection unit 63, the display device 64, and the speaker 65 are included.

【００８２】ナビゲーションコントローラ６０は、自車
位置周辺の地図を表示したり、指定された出発地と目的
地の間を結ぶ走行経路の探索やこの走行経路に沿った経
路誘導を行うなど、ナビゲーション装置の全体動作を制
御するものである。このナビゲーションコントローラ６
０は、ＣＰＵ、ＲＯＭ、ＲＡＭ等を用いて所定の動作プ
ログラムを実行することにより実現される。ナビゲーシ
ョンコントローラ６０の内部構成の詳細については後述
する。The navigation controller 60 displays a map around the position of the vehicle, searches for a travel route connecting the designated starting point and destination, and guides the route along this travel route. It controls the overall operation of the. This navigation controller 6
0 is realized by executing a predetermined operation program using a CPU, a ROM, a RAM and the like. Details of the internal configuration of the navigation controller 60 will be described later.

【００８３】ＤＶＤドライブ６１は、１枚または複数枚
のＤＶＤが装填されており、ナビゲーションコントロー
ラ６０の制御によっていずれかのＤＶＤから、地図表示
や経路探索等に必要な地図データの読み出しを行う。な
お、装填されるディスクは、ＤＶＤに限定されるもので
はなく、他のディスク型記憶媒体であってもよい。The DVD drive 61 is loaded with one or a plurality of DVDs, and reads out map data required for map display or route search from any of the DVDs under the control of the navigation controller 60. The loaded disc is not limited to the DVD and may be another disc type storage medium.

【００８４】操作部６２は、上下左右のカーソルキーや
テンキーなど各種の操作キーを備えており、操作内容に
応じた信号をナビゲーションコントローラ６０に出力す
る。車両位置検出部６３は、ＧＰＳ受信機、方位セン
サ、距離センサ等を備えており、自車位置（緯度、経
度）の検出を行って、検出結果をナビゲーションコント
ローラ６０に出力する。The operation section 62 is provided with various operation keys such as up / down / left / right cursor keys and a numeric keypad, and outputs a signal according to the operation content to the navigation controller 60. The vehicle position detection unit 63 includes a GPS receiver, an azimuth sensor, a distance sensor, and the like, detects the own vehicle position (latitude, longitude), and outputs the detection result to the navigation controller 60.

【００８５】ディスプレイ装置６４は、例えば、８イン
チ程度の画面サイズを有する液晶表示パネルを用いて構
成されており、ナビゲーションコントローラ６０から出
力される映像信号に基づいて地図画像等の表示を行う。
スピーカ６５は、ナビゲーションコントローラ６０から
出力される音声信号に基づいて、交差点における進行方
向の案内音声など各種の案内音声を出力する。The display device 64 is composed of, for example, a liquid crystal display panel having a screen size of about 8 inches, and displays a map image or the like based on a video signal output from the navigation controller 60.
The speaker 65 outputs various guidance voices such as a guidance voice in the traveling direction at the intersection based on the voice signal output from the navigation controller 60.

【００８６】次に、ナビゲーションコントローラ６０の
内部構成の詳細について説明する。図９に示すナビゲー
ションコントローラ６０は、案内文作成部７０、音声合
成部７２、発音内容特定部７４、キャラクタ画像メモリ
７６、キャラクタ画像生成部７８を含んで構成されてい
る。Next, details of the internal structure of the navigation controller 60 will be described. The navigation controller 60 shown in FIG. 9 includes a guidance sentence creating unit 70, a voice synthesizing unit 72, a pronunciation content identifying unit 74, a character image memory 76, and a character image generating unit 78.

【００８７】案内文作成部７０は、各種の音声出力（例
えば、交差点における進路案内など）を行うための案内
文を作成する。音声合成部７２は、案内文作成部７０に
よって生成された案内文に対応した音声出力を行うため
の音声信号を生成し、スピーカ６５に出力する。The guide text creating unit 70 creates a guide text for performing various voice outputs (for example, route guidance at an intersection). The voice synthesis unit 72 generates a voice signal for performing voice output corresponding to the guide sentence generated by the guide sentence generation unit 70, and outputs the voice signal to the speaker 65.

【００８８】発音内容特定部７４は、案内文作成部７０
によって作成された案内文を取得し、この案内文に対応
する案内音声の発音内容を特定する。具体的には、発音
内容特定部７４は、案内文の文字データを取得し、この
文字データに基づいて、案内文から日本語の５つの母音
「ア、イ、ウ、エ、オ」、および撥音「ン」のそれぞれ
を着目音として抽出する。例えば、「次の交差点を…」
といった案内文に対応する発音内容は、「ウ、イ、オ、
オ、ウ、ア、エ、ン、オ、…」と特定される。The pronunciation content specifying unit 74 is provided by the guidance sentence creating unit 70.
The guidance sentence created by is acquired, and the pronunciation content of the guidance voice corresponding to this guidance sentence is specified. Specifically, the pronunciation content identifying unit 74 acquires the character data of the guidance sentence, and based on the character data, the five vowels “A, I, U, E, O” of Japanese from the guidance sentence, and Each of the sound repellency "n" is extracted as a sound of interest. For example, "At the next intersection ..."
The pronunciation contents corresponding to the guidance sentence such as "U, I, O,
Oh, u, a, e, n, o ... "

【００８９】キャラクタ画像メモリ７６は、上述した第
１の実施形態における代理画像メモリ１２と同様に、６
種類の発音内容のそれぞれに対応する口の形状を有する
６種類のキャラクタ画像（静止代理画像）を格納してい
る。ここでは、人物を模した所定のキャラクタ画像がＤ
ＶＤドライブ６１によってＤＶＤから読み出され、キャ
ラクタ画像メモリ７６に格納されているものとする。The character image memory 76 is similar to the proxy image memory 12 in the first embodiment described above, and the character image memory
Six types of character images (still proxy images) having mouth shapes corresponding to the respective types of pronunciation contents are stored. Here, a predetermined character image simulating a person is D
It is assumed that it is read from the DVD by the VD drive 61 and stored in the character image memory 76.

【００９０】キャラクタ画像生成部７８は、キャラクタ
画像メモリ７６に格納されているキャラクタ画像の中か
ら、発音内容特定部７４によって特定される発音内容に
対応したものを選び出すことにより、キャラクタ画像を
生成する処理を行う。キャラクタ画像生成部７８によっ
て生成されたキャラクタ画像は、他の画像（交差点の案
内図など）とともにディスプレイ装置６４の画面上に表
示される。The character image generating unit 78 generates a character image by selecting from the character images stored in the character image memory 76 the one corresponding to the pronunciation content specified by the pronunciation content specifying unit 74. Perform processing. The character image generated by the character image generating unit 78 is displayed on the screen of the display device 64 together with other images (intersection guide map, etc.).

【００９１】上述したスピーカ６５、案内文作成部７
０、音声合成部７２が音声出力手段に、発音内容特定部
７４、キャラクタ画像メモリ７６、キャラクタ画像生成
部７８が代理画像生成手段に、ディスプレイ装置６４が
表示手段にそれぞれ対応している。The above-described speaker 65 and the guide text creating section 7
0, the voice synthesizing unit 72 corresponds to the voice output unit, the pronunciation content specifying unit 74, the character image memory 76, the character image generating unit 78 corresponds to the substitute image generating unit, and the display device 64 corresponds to the displaying unit.

【００９２】第２の実施形態のナビゲーション装置２０
０はこのような構成を有しており、次にその動作につい
て説明する。図１０は、各種の音声出力に対応してキャ
ラクタ画像を切り替えて表示する際のナビゲーション装
置の動作手順を示す流れ図である。一例として、経路誘
導時の交差点案内における案内音声の発音内容に対応し
て、キャラクタ画像を切り替えて表示する場合を想定し
て説明を行う。The navigation device 20 of the second embodiment
0 has such a configuration, and its operation will be described next. FIG. 10 is a flowchart showing an operation procedure of the navigation device when the character images are switched and displayed in correspondence with various audio outputs. As an example, a description will be given on the assumption that the character images are switched and displayed according to the pronunciation content of the guidance voice in the intersection guidance at the time of route guidance.

【００９３】発音内容特定部７４は、音声出力の対象と
なる案内文が案内文作成部７０によって作成されたか否
かを判定する（ステップ３００）。案内文が作成されて
いない場合には否定判断が行われ、この場合にはステッ
プ３００の判定が繰り返される。The pronunciation content identifying unit 74 determines whether or not the guidance sentence to be voice output is created by the guidance sentence creating unit 70 (step 300). If the guidance sentence is not created, a negative determination is made, and in this case, the determination of step 300 is repeated.

【００９４】案内文が作成された場合には、ステップ３
００で肯定判断が行われ、発音内容特定部７４は、案内
文を案内文作成部７０から取得し、この案内文に基づい
て発音内容を特定する（ステップ３０１）。発音内容が
特定されると、キャラクタ画像生成部７８は、キャラク
タ画像メモリ７６に格納されているキャラクタ画像の中
から発音内容に対応したものを選び出す処理を行う（ス
テップ３０２）。If the guidance sentence is created, step 3
An affirmative determination is made at 00, and the pronunciation content identification unit 74 acquires the guidance sentence from the guidance sentence creation unit 70 and identifies the pronunciation content based on this guidance sentence (step 301). When the pronunciation content is specified, the character image generation unit 78 performs a process of selecting a character image corresponding to the pronunciation content from the character images stored in the character image memory 76 (step 302).

【００９５】なお、本実施形態のキャラクタ画像生成部
７８は、１つの案内文に対応してキャラクタ画像を選択
する処理の行う際に、案内文の最後の文字に対応するキ
ャラクタ画像を選択した後（すなわち、音声の出力が停
止したとき）に、自動的に発音内容が「ン」の場合に対
応するキャラクタ画像を選び出す処理を行っている。こ
れにより、１つの案内文が終了した際に口の開いた状態
のキャラクタ画像が表示され続け、利用者が表示に違和
感を感じることを防止することができる。The character image generator 78 of this embodiment selects the character image corresponding to the last character of the guidance sentence when performing the process of selecting the character image corresponding to one guidance sentence. When the output of the sound is stopped (that is, when the output of the voice is stopped), the process of automatically selecting the character image corresponding to the case where the pronunciation content is "n" is performed. With this, it is possible to prevent the character image with the mouth open from being continuously displayed when one guidance sentence ends, and to prevent the user from feeling uncomfortable in the display.

【００９６】案内文に対応したキャラクタ画像が選出さ
れると、ナビゲーションコントローラ６０は、案内文の
音声出力の進行に合わせてキャラクタ画像を随時切り替
えてディスプレイ装置６４に表示する（ステップ３０
３）。具体的には、ナビゲーションコントローラ６０
は、案内音声の１文字ずつの発音タイミングを音声合成
部７２から取得し、この発音タイミングに同期してキャ
ラクタ画像を切り替える。When the character image corresponding to the guidance sentence is selected, the navigation controller 60 switches the character image at any time in accordance with the progress of the voice output of the guidance sentence and displays it on the display device 64 (step 30).
3). Specifically, the navigation controller 60
Acquires the pronunciation timing of the guidance voice for each character from the voice synthesis unit 72, and switches the character image in synchronization with this pronunciation timing.

【００９７】図１１は、交差点における進路案内図とこ
れに重ねて表示されるキャラクタ画像の表示例を示す図
である。図１１に示すように、画面左側には自車位置周
辺の地図画像が表示され、画面右側には進路変更を行う
交差点の拡大図が表示されている。そして、交差点拡大
図には自車の進行方向が矢印によって示されており、こ
のような表示とともに「およそ３００ｍ先を左折です。
目印は△△銀行です。」といった案内文が音声出力され
る。また、交差点拡大図の表示エリアの右下側には、所
定のキャラクタ画像２１０が重ねて表示されており、上
述した案内文の発音内容に対応してこのキャラクタ画像
２１０の口形状が変化する。FIG. 11 is a diagram showing a route guide map at an intersection and a display example of a character image displayed in an overlapping manner. As shown in FIG. 11, a map image around the vehicle position is displayed on the left side of the screen, and an enlarged view of an intersection where the route is changed is displayed on the right side of the screen. In the enlarged cross-section, the direction of travel of the vehicle is indicated by an arrow, and with such a display, "You will turn left about 300 m ahead.
The landmark is the △△ bank. A guidance sentence such as "is output by voice. Further, a predetermined character image 210 is displayed on the lower right side of the display area of the enlarged intersection image, and the mouth shape of the character image 210 changes in accordance with the pronunciation content of the above-mentioned guidance sentence.

【００９８】このように、第２の実施形態のナビゲーシ
ョン装置２００は、音声出力の対象となる案内文に基づ
いて、着目音として５つの母音「ア、イ、ウ、エ、オ」
と撥音「ン」を抽出し、抽出した着目音の内容に合わせ
て口形状を変化させたキャラクタ画像を表示しているの
で、音声の内容に適合した自然な表情の代理画像を表示
することができる。特に、人間が話す際の口の形状がほ
ぼ母音と撥音に対応して決まっていることに着目して音
声の内容を抽出しているので、少ない処理負担で精度良
く、音声の内容に合わせた口形状の代理画像を表示する
ことができる。As described above, the navigation device 200 of the second embodiment, based on the guidance sentence that is the target of voice output, has five vowels "A, I, U, E, O" as the focused sounds.
The sound image “n” is extracted and the character image with the mouth shape changed according to the content of the extracted target sound is displayed. Therefore, it is possible to display a proxy image with a natural expression that matches the content of the voice. it can. In particular, since we extract the contents of the voice by paying attention to the fact that the shape of the mouth when a human speaks is determined almost corresponding to the vowel and the sound repellency, we can accurately match the voice contents with a small processing load. A mouth-shaped surrogate image can be displayed.

【００９９】また、着目音に対応する６種類のキャラク
タ画像を予め用意してキャラクタ画像メモリ７６に格納
しておき、これら６種類のキャラクタ画像の中から、発
音内容特定部７２によって特定された音声の内容に対応
するものを選び出すことによってキャラクタ画像画像を
生成しているので処理が簡単であり、キャラクタ画像を
生成する際の処理負担を軽減することができるという利
点もある。Further, six kinds of character images corresponding to the sound of interest are prepared in advance and stored in the character image memory 76, and the voice specified by the pronunciation content specifying unit 72 is selected from these six kinds of character images. Since the character image image is generated by selecting the one corresponding to the content of, there is also an advantage that the processing is simple and the processing load when generating the character image can be reduced.

【０１００】なお、本発明を適用したナビゲーション装
置は、上述した実施形態に限定されるものではなく、さ
らに種々の変形実施が可能である。例えば、上述した第
１の実施形態における変形例と同様に、発音内容に応じ
た６種類の口形状画像とベースとなる共通画像とを用意
しておき、いずれか１つの口形状画像と共通画像を合成
することによりキャラクタ画像を生成するようにしても
よい。また、前後のキャラクタ画像の間を補間する補間
画像を挿入してもよい。The navigation device to which the present invention is applied is not limited to the above-mentioned embodiment, and various modifications can be made. For example, similar to the modification of the first embodiment described above, six types of mouth shape images corresponding to the pronunciation content and a common image serving as a base are prepared, and any one of the mouth shape image and the common image is prepared. The character image may be generated by synthesizing. Moreover, you may insert the interpolation image which interpolates between the character images before and behind.

【０１０１】また上述した実施形態では、経路誘導時の
交差点案内における案内音声の発音内容に対応して、キ
ャラクタ画像を切り替えて表示する場合を例に挙げて説
明を行っていたが、他のナビゲーション動作における音
声出力に対応して同様の処理を行うようにしてもよい。Further, in the above-mentioned embodiment, the case where the character images are switched and displayed according to the pronunciation contents of the guidance voice in the intersection guidance at the time of route guidance has been described as an example. You may make it perform the same process corresponding to the audio | voice output in operation.

【０１０２】また、本発明の適用範囲は、上述したテレ
ビ電話装置やナビゲーション装置に限定されるものでは
なく、他にも各種の装置に広く適用することができる。
例えば、銀行のキャッシュディスペンサなどの装置に適
用した場合であれば、利用者による操作に応じて音声出
力される各種の案内文（例えば、「暗唱番号を入力して
ください」等）の発音内容に合わせて、操作画面に表示
されている店員を模したキャラクタ画像の口形状を変化
させて表示すればよい。The scope of application of the present invention is not limited to the above-mentioned videophone device and navigation device, but can be widely applied to various other devices.
For example, if applied to a device such as a cash dispenser of a bank, it can be used as a pronunciation content for various guidance texts (for example, "Please enter the code number") that are output by voice according to the user's operation. At the same time, the mouth shape of the character image simulating the clerk displayed on the operation screen may be changed and displayed.

【０１０３】また、テレビ番組等において、エンターテ
イメント性を向上させるなどの目的で、本来の話者（例
えば、声優やタレントなど）とは別に所定のキャラクタ
を表示させ、あたかもこのキャラクタが話者であるかの
ようにして表示する場合があるが、このような処理を行
うキャラクタ画像表示装置に本発明を適用することもで
きる。Further, in a television program or the like, a predetermined character is displayed separately from the original speaker (for example, voice actor or talent) for the purpose of improving entertainment, and this character is the speaker. Although it may be displayed as described above, the present invention can be applied to a character image display device that performs such processing.

【０１０４】図１２は、上述した変形例におけるキャラ
クタ画像表示装置の構成を示す図である。図１２に示す
キャラクタ画像表示装置３００は、マイクロホン８０、
操作部８２、制御部８４、ディスプレイ装置８６、スピ
ーカ８８を含んで構成されている。また制御部８４は、
音声認識部９０、音声遅延処理部９２、キャラクタ画像
メモリ９４、キャラクタ画像生成部９６を備えている。FIG. 12 is a diagram showing the structure of the character image display device in the above-described modification. The character image display device 300 shown in FIG.
The operation unit 82, the control unit 84, the display device 86, and the speaker 88 are included. Further, the control unit 84
A voice recognition unit 90, a voice delay processing unit 92, a character image memory 94, and a character image generation unit 96 are provided.

【０１０５】上述したキャラクタ画像表示装置３００に
おいては、マイクロホン８０が集音手段に、音声認識部
９０が音声認識手段に、スピーカ８８が音声出力手段
に、キャラクタ画像メモリ９４、キャラクタ画像生成部
９６が代理画像生成手段に、ディスプレイ装置８６が表
示手段にそれぞれ対応している。なお、上述した各構成
の動作内容は、上述したテレビ電話装置１００またはナ
ビゲーション装置２００に含まれるものと同様であるた
めに詳細な説明は省略する。In the above-described character image display device 300, the microphone 80 serves as a sound collecting unit, the voice recognizing unit 90 serves as a voice recognizing unit, the speaker 88 serves as a voice output unit, and the character image memory 94 and the character image generating unit 96 are provided. The display device 86 corresponds to the proxy image generation means and the display means. Since the operation contents of each of the above-described components are the same as those included in the above-described videophone device 100 or navigation device 200, detailed description thereof will be omitted.

【０１０６】次に、キャラクタ画像表示装置３００の動
作について説明する。話者の音声がマイクロホン８０に
よって集音されると、音声認識部９０は、この音声に対
して所定の音声認識処理を行って発音内容を判別する。
キャラクタ画像生成部９６は、音声認識部９０によって
判別された発音内容に対応した口形状を有するキャラク
タ画像を順次選び出すことによりキャラクタ画像を生成
する。また音声遅延処理部９２は、マイクロホン８０に
よって集音された音声を所定時間だけ遅延させる。そし
て、制御部８４は、キャラクタ画像生成部９６によって
生成されたキャラクタ画像をディスプレイ装置８６に出
力し、音声遅延処理部９２によって遅延処理がなされた
音声をスピーカ８８に出力する。Next, the operation of the character image display device 300 will be described. When the voice of the speaker is collected by the microphone 80, the voice recognition unit 90 performs a predetermined voice recognition process on this voice to determine the pronunciation content.
The character image generation unit 96 generates character images by sequentially selecting character images having mouth shapes corresponding to the pronunciation contents determined by the voice recognition unit 90. Further, the voice delay processing unit 92 delays the voice collected by the microphone 80 by a predetermined time. Then, the control unit 84 outputs the character image generated by the character image generation unit 96 to the display device 86, and outputs the sound delayed by the sound delay processing unit 92 to the speaker 88.

【０１０７】これにより、話者の音声にリアルタイムに
応答して口形状が変化するキャラクタ画像を生成し、表
示することができるキャラクタ画像表示装置を実現する
ことが可能となり、臨場感あふれるキャラクタ表示を行
うことができる。なお、テレビ番組の作製などに用いる
場合には、ディスプレイ装置８６に出力されるキャラク
タ画像の映像信号とスピーカ８８に出力される音声信号
とを並列に他の録画用機器（図示せず）に対して入力
し、キャラクタ画像と音声とを収録すればよい。As a result, it becomes possible to realize a character image display device capable of generating and displaying a character image in which the mouth shape changes in response to the voice of the speaker in real time, and a character display full of realism can be realized. It can be carried out. When used for producing a television program or the like, the video signal of the character image output to the display device 86 and the audio signal output to the speaker 88 are parallel to another recording device (not shown). The character image and voice may be recorded.

【０１０８】[0108]

【発明の効果】上述したように、本発明の代理画像表示
装置によれば、話者の音声の内容に合わせて口形状を変
化させた代理画像が表示されるので、音声の内容に適合
した自然な表情の代理画像を表示することができる。As described above, according to the surrogate image display device of the present invention, the surrogate image in which the mouth shape is changed according to the content of the voice of the speaker is displayed. A surrogate image with a natural expression can be displayed.

【０１０９】また、本発明のテレビ電話装置によれば、
話者の音声の内容に合わせて口形状を変化させた代理画
像が生成されるので、音声の内容に適合した自然な表情
の代理画像を通話相手に対して送ることができる。According to the video telephone device of the present invention,
Since the proxy image in which the mouth shape is changed according to the content of the voice of the speaker is generated, the proxy image with a natural expression suitable for the content of the voice can be sent to the communication partner.

[Brief description of drawings]

【図１】第１の実施形態のテレビ電話装置の構成を示す
図である。FIG. 1 is a diagram showing a configuration of a videophone device according to a first embodiment.

【図２】代理画像メモリに格納される６種類の代理画像
の一例を示す図である。FIG. 2 is a diagram showing an example of six types of proxy images stored in a proxy image memory.

【図３】テレビ電話装置の全体的な動作手順を示す流れ
図である。FIG. 3 is a flowchart showing an overall operation procedure of the videophone device.

【図４】ステップ１０３に示した画像送信処理の詳細な
手順を示す流れ図である。FIG. 4 is a flowchart showing a detailed procedure of the image transmission processing shown in step 103.

【図５】口形状画像と共通画像を合成して代理画像を生
成する変形例の処理内容を示す図である。FIG. 5 is a diagram illustrating processing contents of a modified example in which a mouth shape image and a common image are combined to generate a proxy image.

【図６】話者の感情を表情に反映させた代理画像を生成
する変形例のテレビ電話装置の構成を示す図である。FIG. 6 is a diagram showing a configuration of a modified videophone device that generates a proxy image in which a speaker's emotions are reflected in facial expressions.

【図７】話者の性別を切り替えた音声を送信する変形例
のテレビ電話装置の構成を示す図である。FIG. 7 is a diagram showing a configuration of a videophone device of a modified example that transmits a voice in which a gender of a speaker is switched.

【図８】前後の代理画像を用いた補間画像を生成し、挿
入する変形例における処理内容について概略的に示す図
である。[Fig. 8] Fig. 8 is a diagram schematically showing the processing content in a modified example of generating and inserting an interpolated image using front and rear proxy images.

【図９】第２の実施形態のナビゲーション装置の構成を
示す図である。FIG. 9 is a diagram showing a configuration of a navigation device according to a second embodiment.

【図１０】各種の音声出力に対応してキャラクタ画像を
切り替えて表示する際のナビゲーション装置の動作手順
を示す流れ図である。FIG. 10 is a flowchart showing an operation procedure of the navigation device when a character image is switched and displayed corresponding to various audio outputs.

【図１１】交差点における進路案内図とこれに重ねて表
示されるキャラクタ画像の表示例を示す図である。FIG. 11 is a diagram showing a route guidance map at an intersection and a display example of a character image displayed in an overlapping manner.

【図１２】キャラクタ画像表示装置の構成を示す図であ
る。FIG. 12 is a diagram showing a configuration of a character image display device.

[Explanation of symbols]

１カメラ２ハンドセット２ａマイクロホン２ｂスピーカ３表示部４操作部５、５ａ、５ｂ制御部６通信処理部７メモリＩＦ（インタフェース）部１０音声認識部１２、１２ａ代理画像メモリ１４、１４ａ代理画像生成部１６音声遅延処理部１８送信画像設定部２０感情判定部２２話者性別変更部５０メモリ・カード１００、１００ａ、１００ｂテレビ電話装置 1 camera 2 handsets 2a microphone 2b speaker 3 Display 4 operation part 5, 5a, 5b control unit 6 Communication processing unit 7 Memory IF (interface) section 10 Speech recognition unit 12, 12a Proxy image memory 14, 14a Proxy image generation unit 16 Audio delay processing section 18 Transmission image setting section 20 emotion determination part 22 Speaker Gender Change Department 50 memory cards 100, 100a, 100b Videophone device

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 3/00 ５６１Ｃ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 3/00 561C

Claims

[Claims]

1. A voice output unit that outputs a voice of a speaker, and a proxy image generation unit that generates a proxy image of a speaker whose mouth shape changes according to the content of the voice output from the voice output unit. A proxy image display device, comprising: a display unit configured to display the image generated by the proxy image generation unit.

2. The substitute image generation means according to claim 1, wherein the vowel of the voice output from the voice output means is extracted as a target sound, and the substitute image corresponding to the target sound is generated. Surrogate image display device.

3. The proxy image display device according to claim 2, wherein the proxy image generation unit extracts a vowel sound as the target sound in addition to the vowel of the voice.

4. The method according to claim 2, wherein a plurality of types of still substitute images corresponding to the sound of interest are prepared in advance, and the substitute image generation unit outputs the plurality of types of still substitute images. A proxy image display device, wherein the proxy image is generated by selecting a target sound corresponding to the sound.

5. The method according to claim 2, wherein a plurality of types of mouth shape images corresponding to the sound of interest and common images other than these mouth shape images are prepared in advance. The proxy image display device is characterized in that the proxy image is generated by selecting one of the mouth shape images corresponding to the sound to be output and synthesizing the selected image with the common image. .

6. The substitute image generation means according to claim 2, wherein when the selected images corresponding to the sound of interest are switched, the interpolated images using the selected images before and after the selected images are selected. A proxy image display device, characterized in that the proxy image is generated by inserting it between images.

7. The substitute image generation means according to claim 1, wherein the substitute image generation means generates the substitute image corresponding to sound repellency when the output of the sound by the sound output means is stopped. Surrogate image display device.

8. The sound collecting unit according to claim 1, further comprising a sound collecting unit that collects the voice of the speaker, and a voice recognizing unit that determines the content of the voice collected by the sound collecting unit. The voice output means outputs a voice based on the voice collected by the voice collecting means, and the proxy image generating means displays the substitute image based on the content of the voice determined by the voice recognition means. A surrogate image display device characterized by generating.

9. A sound collecting unit that collects a voice of a speaker, a voice recognizing unit that determines the content of the voice collected by the sound collecting unit, and a voice content that is determined by the voice recognizing unit. A proxy image generation unit that generates a proxy image of the speaker whose mouth shape changes together, a shooting unit that shoots the image of the speaker, and a sound that is picked up by the sound collection unit, and is shot by the shooting unit. An image of the speaker, or a transmitting means for transmitting the proxy image generated by the proxy image generating means to the other party, a receiving means for receiving a voice and an image transmitted from the other party, A videophone device comprising: a voice output unit for outputting a voice received by the receiving unit; and a display unit for displaying an image received by the receiving unit.

10. The vowel sound determined by the voice recognition means according to claim 9, wherein the voice recognition means determines a vowel sound of the voice collected by the sound collection means. Is set as a target sound, and the substitute image corresponding to the target sound is generated.

11. The voice recognition means according to claim 10, wherein the voice recognition means determines a sound repellency in addition to the vowel of the voice, and the proxy image generation means sets the target sound including the sound repellency. Characteristic videophone device.

12. The substitute proxy image according to claim 10, wherein a plurality of types of still substitute images corresponding to the sound of interest are prepared in advance, and the substitute image generation unit selects the A videophone device characterized in that the substitute image is generated by selecting one corresponding to the content of the voice discriminated by the voice recognition means.

13. The method according to claim 10, wherein a plurality of types of mouth shape images corresponding to the sound of interest and common images other than these mouth shape images are prepared in advance. A television characterized in that the proxy image is generated by selecting, from among various types of mouth-shaped images, one corresponding to the content of the voice discriminated by the voice recognition means and combining it with the common image. Telephone device.

14. The substitute image generation means according to claim 10, wherein when the selected images corresponding to the sound of interest are switched, the interpolated images using the selected images before and after the selected images are selected. A videophone device, wherein the substitute image is generated by inserting the image between the images.

15. The substitute image generation means according to claim 9, wherein the substitute image generation means generates the substitute image corresponding to sound repellency when the sound collection by the sound collection means is interrupted. Videophone device.

16. The speaker according to any one of claims 9 to 15, when an incoming call is received, the proxy image generated by the proxy image generating means is first transmitted from the transmitting means to the communication partner, and then the speaker is switched. In response to the instruction, the image of the speaker photographed by the photographing means is transmitted, and at the time of transmission, one of the substitute image generated by the substitute image generating means and the image of the speaker photographed by the photographing means is displayed. A videophone device further comprising transmission image setting means for controlling transmission according to a speaker's selection instruction.

17. The emotion determination unit according to claim 9, further comprising an emotion determination unit that determines the emotion of the speaker, wherein the proxy image generation unit expresses the emotion of the speaker determined by the emotion determination unit. A videophone device characterized in that the substitute image reflected in the above is generated.

18. The emotion determination unit according to claim 17, wherein the emotion determination unit uses at least one of the voice of the speaker collected by the sound collection unit and the image of the speaker captured by the image capturing unit. A videophone device characterized by judging emotions.

19. The delay unit according to claim 9, further comprising a delay unit that delays a voice transmission time by a time required until the generation of the substitute image by the substitute image generation unit is completed. Videophone device.

20. The image processing device according to claim 9, further comprising an image input unit that captures an arbitrary character image used as the proxy image, wherein the proxy image generation unit captures the image by the image input unit. A videophone device, wherein the substitute image is generated by changing the mouth shape of the character image.

21. The speaker sex changing unit according to claim 9, further comprising a speaker gender changing unit that switches the gender of the speaker by changing the sound quality of the voice collected by the sound collecting unit. A videophone device, wherein the voice after the gender is switched by the person gender changing means is transmitted from the transmitting means.