JP2008085421A

JP2008085421A - Video telephone, calling method, program, voice quality conversion-image editing service providing system, and server

Info

Publication number: JP2008085421A
Application number: JP2006260054A
Authority: JP
Inventors: Akihiro Okamoto; 明浩岡本
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2006-09-26
Filing date: 2006-09-26
Publication date: 2008-04-10

Abstract

<P>PROBLEM TO BE SOLVED: To transmit an image concerning a target speaker interlockingly to the voice of the target speaker, when a user makes a call in another person's voice while utilizing voice quality transformation, i.e. performs impersonation. <P>SOLUTION: This video telephone set is provided with a telephone data storage section 148 for storing an image 192 about the target speaker and a voice quality transforming filter 194; a speaker selecting section 154 for selecting a target speaker; data selecting section 156 for selecting a target speaker-related image and a voice quality transforming filter from the telephone data storage section; an imaging section 160; an image editing section 162 for editing the image in the imaging section on the basis of the target speaker-related image; a voice input section 164; a voice quality transforming section 166 for transforming the voice quality of a speaker into the voice quality of the target speaker using the voice quality transforming filter; and a telephone transmitting section 168 for transmitting the image edited by the image editing section and the voice quality transformed by the voice quality transforming section to a calling party. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、テレビ電話における声質変換と共に画像を変化させるテレビ電話機、プログラム、通話方法、声質変換・画像編集サービス提供システム、および、サーバに関する。 The present invention relates to a video phone, a program, a calling method, a voice quality conversion / image editing service providing system, and a server that change an image together with voice quality conversion in a video phone.

現在、携帯電話等の携帯型移動端末において、利用者の顔画像を撮像するカメラと、通話相手の顔画像を表示するディスプレイとが装備され、互いに通話相手の現在の通話状態を確認しながら会話する、所謂テレビ電話が実施されるようになってきた。 Currently, a mobile mobile terminal such as a mobile phone is equipped with a camera that captures the face image of the user and a display that displays the face image of the other party. So-called videophone calls have been implemented.

このようなテレビ電話の応用として、自身の代理となるキャラクタ（アバター：Ａｖａｔａｒ）を選択し、通話中に自身の画像の代わりにそのアバターの画像を送信したり、通話相手側において任意にアバターを選択し、送信者の画像の代わりにそのアバターを表示したりする技術も開示されている（例えば、特許文献１）。 As an application of such a videophone, a character (Avatar) acting as a substitute for the user is selected and an image of the avatar is transmitted instead of the image of the user during a call, or an avatar is arbitrarily selected on the other party side. A technique of selecting and displaying the avatar instead of the image of the sender is also disclosed (for example, Patent Document 1).

また、このような携帯型移動端末を利用し、自身が発した音声を他の音声に置き換えて通話相手に伝えることもできる。しかし、自身の画像を送信しないで通話のみを行う場合、即ち、通話相手が自身を特定できない場合は、声質変換を行った声で他人になりすますといった不適切行為が生じ得る。そこで、発話者の画像を送信しない音声のみのモードでは声質変換を制限し、このような不適切行為を誘発しない技術も検討されている（例えば、特許文献２）。
特開２００３−２４８８４１号公報特開２００２−３１４６３８号公報 In addition, by using such a portable mobile terminal, it is possible to replace the voice uttered by itself with another voice and transmit it to the other party. However, when only making a call without transmitting its own image, that is, when the other party cannot identify himself / herself, an inappropriate act of impersonating another person with a voice whose voice has been converted may occur. Therefore, a technique for restricting voice quality conversion in a voice only mode in which an image of a speaker is not transmitted and not inducing such an inappropriate action has been studied (for example, Patent Document 2).
Japanese Patent Laid-Open No. 2003-248841 JP 2002-314638 A

しかしながら、上述した技術においては、選択されるアバターと声質変換後の音声との間に関連性が無く、アバターは、仮想的な通信空間における単なる疑似表示としての利用に留まり、通話の娯楽性を高める選択手段の一つでしかなかった。 However, in the above-described technology, there is no relationship between the selected avatar and the voice after voice quality conversion, and the avatar is merely used as a pseudo display in the virtual communication space, and the entertainment of the call is reduced. It was only one of the choices to increase.

また、テレビ電話における発話者（自身）の声質を目標話者（通話相手ではない他人）の声質に変換する声質変換では、その発話者の音声が目標話者の音声に変換され、別人の音声になって通話相手に伝わるが、音声のみでは臨場感に欠け、その目標話者が誰であるかを発話者が明示的に知らせない限り、通話相手が誰の音声であるかを把握することは困難であった。 Also, in the voice quality conversion that converts the voice quality of the speaker (self) in the videophone to the voice quality of the target speaker (other person who is not the other party), the voice of the speaker is converted to the voice of the target speaker, and the voice of another person is converted. Communicating to the other party, but the voice alone lacks realism, and unless the speaker explicitly tells who the target speaker is, to know who the other party is speaking Was difficult.

本発明は、従来の携帯型移動端末が有する上記問題点に鑑みてなされたものであり、本発明の目的は、声質変換を利用して他人の音声で通話する所謂ものまねを行う際に、目標話者の音声と連動して目標話者に関連する画像も送信することにより、通話相手が、その目標話者が誰であるかを聴覚および視覚で直感的に判断することが可能な、新規かつ改良されたテレビ電話機、通話方法、プログラム、声質変換・画像編集サービス提供システム、および、サーバを提供することである。 The present invention has been made in view of the above-mentioned problems of conventional portable mobile terminals, and an object of the present invention is to achieve a so-called imitation of making a call with another person's voice using voice quality conversion. By sending images related to the target speaker in conjunction with the voice of the speaker, the call partner can intuitively determine who the target speaker is by hearing and vision. Another object of the present invention is to provide an improved video phone, a calling method, a program, a voice quality conversion / image editing service providing system, and a server.

上記課題を解決するために、本発明に係る請求項１に記載のテレビ電話機は、発話者の画像を入力する撮像部と該発話者の音声を入力する音声入力部とを備え、該発話者の音声の声質を目標話者の声質に変換するテレビ電話機であって、目標話者に関連する画像と発話者の声質を該目標話者の声質に変換する声質変換フィルタとを予め記憶する電話機データ記憶部と、目標話者を選択する話者選択部と、話者選択部で選択された目標話者に関連する画像である目標話者関連画像と目標話者に対応する声質変換フィルタとを電話機データ記憶部から選択するデータ選択部と、撮像部から入力された画像を、目標話者関連画像に基づいて編集する画像編集部と、音声入力部から入力された発話者の音声の声質を、選択された声質変換フィルタを用いて目標話者の声質に変換する声質変換部と、画像編集部が編集した画像と声質変換部が変換した音声とを通話相手に送信する電話機送信部と、を備えることを特徴とする。ここで、目標話者は、人物もしくはアニメのキャラクタであってもよい。また、目標話者に関連する画像は、目標話者自身の画像でもよく、目標話者を連想可能な画像でもよい。 In order to solve the above-described problem, a videophone according to claim 1 of the present invention includes an imaging unit that inputs an image of a speaker and a voice input unit that inputs the voice of the speaker, and the speaker TV phone that converts the voice quality of the voice of the speaker into the voice quality of the target speaker, and stores in advance an image related to the target speaker and a voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the target speaker A data storage unit, a speaker selecting unit for selecting a target speaker, a target speaker related image that is an image related to the target speaker selected by the speaker selecting unit, and a voice quality conversion filter corresponding to the target speaker; A data selection unit that selects the phone data storage unit, an image editing unit that edits an image input from the imaging unit based on the target speaker-related image, and a voice quality of the voice of the speaker input from the voice input unit Use the selected voice quality conversion filter A voice conversion unit which converts the voice quality of the target speaker Te, image and voice conversion unit image editing unit edits is characterized in that it comprises a telephone transmission unit for transmitting to the communication partner and voice converted. Here, the target speaker may be a person or an anime character. The image related to the target speaker may be an image of the target speaker itself or an image that can associate the target speaker.

テレビ電話機同士の通話においては、発話者の音声と同時に発話者の画像も伝送される。当該テレビ電話を利用して、発話者が発した音声を俳優等の目標話者の音声に自動的に変換し、自身の声の代わりに目標話者の声を通話相手に送信する所謂「ものまね」を行う。このような構成であれば、かかる「ものまね」において、目標話者の音声だけでなく、目標話者に関連する画像に基づいて画像編集部で編集された画像、例えば目標話者の顔写真も通話相手のテレビ電話機に送信されるので、通話相手は、発話者が誰のものまねをしているのかを聴覚および視覚で直感的に把握することが可能となる。 In a call between videophones, an image of the speaker is transmitted simultaneously with the voice of the speaker. Using the videophone, the voice of a speaker is automatically converted to the voice of a target speaker such as an actor, and the target speaker's voice is sent to the other party in place of his own voice. "I do. With such a configuration, not only the target speaker's voice but also an image edited by the image editing unit based on an image related to the target speaker, for example, a face photograph of the target speaker, in such “imitation”. Since it is transmitted to the other party's video phone, the other party can intuitively grasp who the speaker is imitating by hearing and vision.

請求項２に記載の発明は、請求項１に記載のテレビ電話機において、ものまねスイッチをさらに備え、前記画像編集部および前記声質変換部は、前記ものまねスイッチが有効な間機能することを特徴とする。 The invention described in claim 2 is the video phone according to claim 1, further comprising a mimic switch, wherein the image editing unit and the voice quality conversion unit function while the mimic switch is valid. .

上記ものまねスイッチの構成により、発話者は、ものまねを行うタイミングを、自己の意志に基づいて決めることができる。従って、発話者は、ものまねしたい発話に対してのみ、ものまねを実行することができ、その部分を通話相手に強調して伝えることができる。 With the configuration of the mimic switch, the speaker can determine the timing for imitating based on his / her will. Therefore, the speaker can perform the imitation only for the utterance he / she wants to imitate, and can convey the portion with emphasis to the other party.

請求項３に記載の発明は、請求項１または２のいずれかに記載のテレビ電話機において、前記発話者の状態に係る情報である発話者状態情報を生成する発話者状態情報生成部をさらに備え、前記画像編集部は、前記発話者状態情報に応じて、編集に用いる前記目標話者関連画像を変化させることを特徴とする。 According to a third aspect of the present invention, the videophone according to the first or second aspect further includes a speaker state information generating unit that generates speaker state information that is information relating to the state of the speaker. The image editing unit changes the target speaker-related image used for editing according to the speaker state information.

目標話者関連画像を、発話者状態情報を通じて画一的に編集する上記の構成により、画像編集部は、発話者状態情報のみを解読することで撮像部から入力された画像を編集することが可能となる。 With the above-described configuration in which the target speaker-related image is uniformly edited through the speaker state information, the image editing unit can edit the image input from the imaging unit by decoding only the speaker state information. It becomes possible.

請求項４に記載の発明は、請求項３に記載のテレビ電話機において、前記音声入力部から入力された発話者の音声の有無を検知する音声検知部をさらに備え、前記発話者状態情報生成部は、前記検知した音声の有無に係る情報である音声有無情報を含めて前記発話者状態情報を生成することを特徴とする。 According to a fourth aspect of the present invention, the videophone according to the third aspect further includes a voice detection unit that detects the presence or absence of the voice of the speaker input from the voice input unit, and the speaker state information generation unit Is characterized in that the speaker state information is generated including voice presence / absence information which is information related to the presence / absence of the detected voice.

かかる構成により、画像編集部は、発話者の音声の有無に応じて、編集に用いる目標話者関連画像を変化させることが可能となる。このように音声の有無に応じて画像を変化させることで、音声が有るとき、即ち発話者がものまねしているとき、目標話者に関連する画像の変化によってものまねを実行していることを強調することができる。また、音声が無いときでも、目標話者に関連する他の画像を表示することによって通話相手を飽きさせることなく、通話相手の興味を維持することができる。 With this configuration, the image editing unit can change the target speaker-related image used for editing in accordance with the presence or absence of the voice of the speaker. In this way, by changing the image according to the presence or absence of voice, when there is voice, that is, when the speaker is imitating, emphasizing that imitation is being performed by changing the image related to the target speaker can do. Moreover, even when there is no sound, the other party's interest can be maintained without getting bored by displaying another image related to the target speaker.

請求項５に記載の発明は、請求項３または４のいずれかに記載のテレビ電話機において、前記音声入力部から入力された発話者の音声の発話内容を認識する音声認識部をさらに備え、前記発話者状態情報生成部は、前記認識した発話内容に係る情報である発話内容情報を含めて前記発話者状態情報を生成することを特徴とする。 The video phone according to any one of claims 3 and 4, further comprising a voice recognition unit for recognizing the utterance content of the voice of the speaker input from the voice input unit, The speaker state information generation unit generates the speaker state information including utterance content information that is information related to the recognized utterance content.

電話機データ記憶部に記憶された、目標話者に関連する画像は、予め任意の発話内容情報と関連付けられており、音声認識部がこの発話内容を認識すると、画像編集部は、その発話内容に関連付けられた画像を、通話相手に送信する画像に重ねる。かかる構成により、発話者の発話の内容に適した画像を通話相手に送信することができ、動的にものまねを表現することが可能となる。 The image related to the target speaker stored in the telephone data storage unit is associated with arbitrary utterance content information in advance, and when the speech recognition unit recognizes the utterance content, the image editing unit The associated image is superimposed on the image to be transmitted to the other party. With this configuration, an image suitable for the content of the speaker's utterance can be transmitted to the other party, and a mimicry can be dynamically expressed.

請求項６に記載の発明は、請求項３乃至５のいずれかに記載のテレビ電話機において、前記音声入力部から入力された発話者の音声の音素を認識する音素認識部をさらに備え、前記発話者状態情報生成部は、前記認識された音素の種類に係る情報である音素情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記音素情報に応じて、前記目標話者関連画像の口の開閉度合いを調整することを特徴とする。 The invention described in claim 6 is the video phone according to any one of claims 3 to 5, further comprising a phoneme recognition unit that recognizes a phoneme of a speaker's voice input from the voice input unit. The speaker state information generation unit generates the speaker state information including phoneme information that is information related to the recognized phoneme type, and the image editing unit generates the target speaker according to the phoneme information. The opening / closing degree of the mouth of the related image is adjusted.

電話機データ記憶部に記憶された、目標話者に関連する画像は、予め音素情報と関連付けられており、音素認識部が音声の音素を認識すると、画像編集部は、その音素情報に関連付けられた画像を、通話相手に送信する画像に重ねる。このような音素により表しうる発話者の口元の動きを目標話者の画像に反映する構成により、目標話者が実際に話しているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 The image related to the target speaker stored in the telephone data storage unit is associated with the phoneme information in advance, and when the phoneme recognition unit recognizes the phoneme of the speech, the image editing unit is associated with the phoneme information. Overlay the image on the image to be sent to the other party. By reflecting the movement of the speaker's mouth that can be represented by such phonemes in the target speaker's image, an image as if the target speaker is actually speaking can be transmitted to the other party. It is possible to express imitations.

請求項７に記載の発明は、請求項３乃至６のいずれかに記載のテレビ電話機において、前記撮像部から入力された発話者の画像の状態を認識する画像認識部をさらに備え、前記発話者状態情報生成部は、前記認識された画像の状態を含めて前記発話者状態情報を生成することを特徴とする。 The invention described in claim 7 is the video phone according to any one of claims 3 to 6, further comprising an image recognition unit for recognizing a state of an image of the speaker input from the imaging unit, and the speaker The state information generation unit generates the speaker state information including the state of the recognized image.

かかる構成により、画像編集部は、発話者の動き、特に実際の顔の動きに連動して、目標話者関連画像を変化させることが可能となる。従って、発話者は、自ら、視覚的かつ動的に目標話者のものまねを実行することができ、目標話者のものまねをしていることをより強調して通話相手に伝えることが可能となる。 With this configuration, the image editing unit can change the target speaker-related image in conjunction with the movement of the speaker, particularly the actual movement of the face. Therefore, the speaker can visually and dynamically imitate the target speaker, and can emphasize the fact that the speaker is imitating the target speaker. .

請求項８に記載の発明は、請求項７に記載のテレビ電話機において、前記画像認識部は、発話者の顔があると認識した場合に、該顔の位置を検出し、前記発話者状態情報生成部は、前記検出した顔の位置に係る情報である顔位置情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記顔位置情報に対応する顔の位置に、前記目標話者関連画像を重ねることを特徴とする。 According to an eighth aspect of the present invention, in the video phone according to the seventh aspect, when the image recognition unit recognizes that there is a speaker's face, the position of the speaker is detected, and the speaker state information The generation unit generates the speaker state information including face position information that is information related to the detected face position, and the image editing unit sets the target position at the face position corresponding to the face position information. It is characterized by overlapping speaker-related images.

かかる構成により、発話者は、自己の顔を移動させることによって、通話相手に送信する目標話者関連画像の位置を変化させることができる。 With this configuration, the speaker can change the position of the target speaker-related image to be transmitted to the other party by moving his / her face.

請求項９に記載の発明は、請求項８に記載のテレビ電話機において、前記画像認識部は、発話者の顔があると認識した場合に、該顔の傾きを検出し、前記発話者状態情報生成部は、前記検出した顔の傾きに係る情報である顔傾き情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記顔傾き情報に応じて、前記目標話者関連画像を回転させて重ねることを特徴とする。 According to a ninth aspect of the present invention, in the video phone according to the eighth aspect, when the image recognition unit recognizes that there is a speaker's face, the image recognition unit detects the inclination of the speaker, and the speaker state information The generation unit generates the speaker state information including face inclination information that is information related to the detected face inclination, and the image editing unit generates the target speaker related image according to the face inclination information. It is characterized by rotating and overlapping.

かかる構成により、発話者は、自己の顔の傾きを変えることによって、通話相手に送信する目標話者関連画像を回転させることができる。 With this configuration, the speaker can rotate the target speaker-related image to be transmitted to the other party by changing the inclination of his / her face.

請求項１０に記載の発明は、請求項８または９のいずれかに記載のテレビ電話機において、前記画像認識部は、発話者の顔があると認識した場合に、該顔の大きさを検出し、前記発話者状態情報生成部は、前記検出した顔の大きさに係る情報である顔サイズ情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記顔サイズ情報に応じて、前記目標話者関連画像を拡大もしくは縮小することを特徴とする。 According to a tenth aspect of the present invention, in the video phone according to the eighth or ninth aspect, when the image recognition unit recognizes that there is a face of a speaker, the size of the face is detected. The speaker state information generation unit generates the speaker state information including face size information that is information related to the detected face size, and the image editing unit is configured to respond to the face size information. The target speaker related image is enlarged or reduced.

かかる構成により、発話者は、自己の顔の大きさ、即ち、撮像部との距離を変えることによって、通話相手に送信する目標話者関連画像の大きさを変化させることができる。 With such a configuration, the speaker can change the size of the target speaker-related image to be transmitted to the other party by changing the size of his / her face, that is, the distance from the imaging unit.

請求項１１に記載の発明は、請求項８乃至１０のいずれかに記載のテレビ電話機において、前記画像認識部は、発話者の顔があると認識した場合に、該発話者の目の開閉を検出し、前記発話者状態情報生成部は、前記検出した目の開閉に係る情報である目開閉情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記目開閉情報に応じて、前記目標話者関連画像における目を開閉させることを特徴とする。 According to an eleventh aspect of the present invention, in the video phone according to any one of the eighth to tenth aspects, when the image recognition unit recognizes that there is a speaker's face, the speaker's eyes are opened and closed. And the speaker state information generating unit generates the speaker state information including the eye opening / closing information which is information relating to the detected eye opening / closing, and the image editing unit is configured to respond to the eye opening / closing information. Then, the eyes in the target speaker related image are opened and closed.

このような発話者の目の開閉動作を目標話者の画像に反映する構成により、目標話者が実際に瞬きしているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 With the configuration that reflects the opening / closing operation of the speaker's eyes in the target speaker's image, it is possible to send an image as if the target speaker is actually blinking to the other party. It is possible to express imitations.

請求項１２に記載の発明は、請求項８乃至１１のいずれかに記載のテレビ電話機において、前記画像認識部は、発話者の顔があると認識した場合に、該発話者の口の開閉を検出し、前記発話者状態情報生成部は、前記検出した口の開閉に係る情報である口開閉情報を含めて前記発話者状態情報を生成し、前記画像編集部は、前記口開閉情報に応じて、前記目標話者関連画像における口を開閉させることを特徴とする。 According to a twelfth aspect of the present invention, in the video phone according to any one of the eighth to eleventh aspects, when the image recognition unit recognizes that there is a speaker's face, the mouth of the speaker is opened and closed. And the speaker state information generating unit generates the speaker state information including mouth opening / closing information which is information relating to the detected opening / closing of the mouth, and the image editing unit is configured to respond to the mouth opening / closing information. The mouth in the target speaker related image is opened and closed.

このような発話者の口元の開閉動作を目標話者の画像に反映する構成により、目標話者が実際に話しているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 With the configuration that reflects the opening and closing movements of the speaker's mouth in the target speaker's image, it is possible to send an image as if the target speaker is actually speaking to the other party, more imitating it. Can be expressed.

請求項１３に記載の発明は、請求項１乃至１２のいずれかに記載のテレビ電話機において、前記声質変換フィルタは、個々の発話者の声質を共通の中間話者の声質に変換するための第１声質変換フィルタと、該中間話者の声質を個々の目標話者の声質に変換するための第２声質変換フィルタとからなり、前記データ選択部は、前記声質変換フィルタとして、前記第１声質変換フィルタと前記第２声質変換フィルタとを選択し、前記声質変換部は、前記音声入力部から入力された発話者の音声の声質を、前記選択された第１声質変換フィルタを用いて中間話者の声質に変換し、さらに該中間話者の声質を、前記選択された第２声質変換フィルタを用いて目標話者の声質に変換することを特徴とする。 According to a thirteenth aspect of the present invention, in the videophone according to any one of the first to twelfth aspects, the voice quality conversion filter is configured to convert a voice quality of each speaker into a common intermediate speaker voice quality. 1 voice quality conversion filter, and a second voice quality conversion filter for converting the voice quality of the intermediate speaker to the voice quality of each target speaker, and the data selection unit uses the first voice quality as the voice quality conversion filter. A conversion filter and the second voice quality conversion filter are selected, and the voice quality conversion unit uses the selected first voice quality conversion filter to convert the voice quality of the voice of the speaker input from the voice input unit to the intermediate speech. And converting the voice quality of the intermediate speaker into the voice quality of the target speaker using the selected second voice quality conversion filter.

かかる中間話者を介した２段階のフィルタ構成により、発話者は、一度、第１声質変換フィルタを準備するだけで、目標話者を変更する度に声質変換フィルタを生成する必要がなくなる。また、目標話者への声質変換フィルタを提供するサービス提供者側では、一度、第２声質変換フィルタを生成すると、複数の発話者にその共通の第２声質変換フィルタを提供できるので、低コストで効率の良いシステムを築くことができ、少ない負荷で、発話者と目標話者の多数のパターンを生成することが可能となる。 With such a two-stage filter configuration through the intermediate speaker, the speaker need only prepare the first voice quality conversion filter once, and does not need to generate a voice quality conversion filter each time the target speaker is changed. In addition, once the second voice quality conversion filter is generated on the service provider side providing the voice quality conversion filter to the target speaker, the common second voice quality conversion filter can be provided to a plurality of speakers. Thus, an efficient system can be built, and a large number of patterns of the speaker and the target speaker can be generated with a small load.

請求項１４に記載の発明は、請求項１３に記載のテレビ電話機において、前記画像編集部で利用される目標話者に関連する画像と、前記声質変換部で利用される第２声質変換フィルタとを外部の電子機器から受信する受信部をさらに備えることを特徴とする。 According to a fourteenth aspect of the present invention, in the videophone according to the thirteenth aspect, an image related to a target speaker used in the image editing unit, a second voice quality conversion filter used in the voice quality conversion unit, and Is further provided with a receiving unit that receives the signal from an external electronic device.

上述したように、目標話者への声質変換フィルタを提供するサービス提供者側は、利用者に共通の第２声質変換フィルタを提供できるので、当該テレビ電話機能を遂行するための目標話者に関連する画像とその第２声質変換フィルタとを組み合わせ、パッケージデータとして提供することもできる。発話者側では、かかるパッケージデータを取得するだけで、直ぐにかつ容易に任意の目標話者のものまねをすることが可能になる。 As described above, the service provider side that provides the voice quality conversion filter for the target speaker can provide the user with the second voice quality conversion filter common to the user. A related image and the second voice quality conversion filter may be combined and provided as package data. On the speaker side, it is possible to imitate an arbitrary target speaker immediately and simply by acquiring such package data.

請求項１５に記載の発明は、請求項１３または１４のいずれかに記載のテレビ電話機において、明瞭な発話か密やかな発話かを発話者に選択させる発話種類選択部をさらに備え、前記データ選択部は、前記選択された発話種類に応じて、発話者の明瞭な声質を中間話者の声質に変換する第１声質変換フィルタ、または、発話者の密やかな声質を中間話者の声質に変換する第１声質変換フィルタのいずれかを選択することを特徴とする。 The invention according to claim 15 is the video phone according to claim 13 or 14, further comprising an utterance type selection section for allowing a speaker to select a clear utterance or a secret utterance, and the data selection section The first voice quality conversion filter that converts the clear voice quality of the speaker into the voice quality of the intermediate speaker according to the selected speech type, or converts the voice quality of the speaker into the voice quality of the intermediate speaker One of the first voice quality conversion filters is selected.

かかる２種類の発話種類に対応した第１声質変換フィルタを設ける構成により、発話者は、自己のおかれている状況に応じて適切な第１声質変換フィルタを選択することが可能になる。 With the configuration in which the first voice quality conversion filter corresponding to the two types of utterances is provided, the speaker can select an appropriate first voice quality conversion filter according to his / her situation.

請求項１６に記載の発明は、請求項１５に記載のテレビ電話機において、前記発話種類選択部において密やかな発話種類が選択された場合、前記データ選択部は、前記中間話者の声質を発話者の明瞭な声質に変換する第２声質変換フィルタを選択することを特徴とする。 According to a sixteenth aspect of the present invention, in the video phone according to the fifteenth aspect, when a secret utterance type is selected in the utterance type selection unit, the data selection unit determines the voice quality of the intermediate speaker. The second voice quality conversion filter for converting to a clear voice quality is selected.

かかる構成により、発話者が自己のおかれている状況に制約されて、密やかな発話を行っている場合であっても、発話者の密やかな声質を中間話者の声質に変換する第１声質変換フィルタと、中間話者の声質から発話者本人の明瞭な声質に変換する第２声質変換フィルタとを介すことにより、通話相手は、発話者が発する音声を確実に把握することが可能となる。 With this configuration, the first voice quality that converts the voice quality of the speaker into the voice quality of the middle speaker even when the speaker is restricted to the situation in which he / she is placed, Through the conversion filter and the second voice quality conversion filter that converts the voice quality of the intermediate speaker into the clear voice quality of the speaker, the other party can surely grasp the voice uttered by the speaker. Become.

請求項１７に記載の発明は、請求項１５または１６のいずれかに記載のテレビ電話機において、前記発話種類選択部において密やかな発話種類が選択された場合、前記画像編集部は、発話者が密やかに発話していることを示す表示画像を重ねることを特徴とする。 According to a seventeenth aspect of the present invention, in the video phone according to the fifteenth or sixteenth aspect, when a dense utterance type is selected in the utterance type selection unit, the image editing unit A display image indicating that the user is speaking is overlaid.

発話者が密やかな発話を行っている場合であっても、発話者の密やかな声質を中間話者の声質に変換する第１声質変換フィルタと、中間話者の声質から発話者本人の明瞭な声質に変換する第２声質変換フィルタとを介すことにより、通話相手は、発話者の明瞭な声質を聞くことができる。しかし、通話相手からすれば、その通話がどのような状況で行われているかを把握することができない。上記密やかに発話していることを示す画像を重ねる構成により、通話相手は、発話者の状況を把握することが可能となる。 The first voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the middle speaker, and the voice quality of the middle speaker, even if the speaker is speaking dense Through the second voice quality conversion filter that converts the voice quality, the other party can hear the clear voice quality of the speaker. However, from the other party, it is impossible to grasp under what circumstances the call is being made. With the configuration in which the image indicating that the voice is spoken is overlaid, the call partner can grasp the situation of the speaker.

請求項１８に記載の発明は、請求項１３乃至１７のいずれかに記載のテレビ電話機において、前記第１声質変換フィルタと、前記第２声質変換フィルタとを合成して、発話者の声質を目標話者の声質に直接変換する合成フィルタを生成する声質変換フィルタ合成部をさらに備えることを特徴とする。 According to an eighteenth aspect of the present invention, in the video phone according to any of the thirteenth to seventeenth aspects, the first voice quality conversion filter and the second voice quality conversion filter are synthesized to target the voice quality of the speaker. A voice quality conversion filter synthesis unit that generates a synthesis filter that directly converts the voice quality of the speaker is further provided.

上記２段階のフィルタをテレビ電話の通話準備の段階で取り込んでしまった後は、声質変換フィルタを２段階のまま維持する必要はない。上述したように２段階の声質変換フィルタを合成すると、発話の度に２段階の声質変換フィルタを介す必要がなくなり、声質変換にかかる処理負荷や消費電力を軽減することが可能となる。 After the above two-stage filter is taken in at the stage of videophone call preparation, it is not necessary to keep the voice quality conversion filter in two stages. As described above, when the two-stage voice quality conversion filter is synthesized, it is not necessary to pass through the two-stage voice quality conversion filter for each utterance, and the processing load and power consumption for voice quality conversion can be reduced.

請求項１９に記載の発明は、請求項１乃至１８のいずれかに記載のテレビ電話機において、自己のテレビ電話機を特定可能な識別子が通話相手のテレビ電話機に送信されている場合に限り、前記画像編集部および前記声質変換部が機能することを許可する機能許可部をさらに備えることを特徴とする。 According to a nineteenth aspect of the present invention, in the video phone according to any one of the first to eighteenth aspects, the image can be obtained only when an identifier that can identify the own video phone is transmitted to the video phone of the other party. The image processing apparatus further includes a function permission unit that permits the editing unit and the voice quality conversion unit to function.

かかる構成により、目標話者へのなりすましや、発話者を特定できないことに基づく障害を回避することができる。ここで、送信される識別子は、自己のテレビ電話機の電話番号であってもよい。 With such a configuration, it is possible to avoid disguise based on impersonation of the target speaker and the inability to identify the speaker. Here, the identifier to be transmitted may be the telephone number of the own videophone.

ここでは、上記各構成要素をテレビ電話機に設ける構成としているが、かかる構成要素を、通話相手のテレビ電話機までの経路にあるサーバや、通話相手のテレビ電話機に設け、連動して動作させる構成とすることもできる。 Here, each of the above-described components is configured to be provided on the video phone, but such a component is provided on a server or a video phone of the other party in the path to the other party's video phone, and configured to operate in conjunction with the other party. You can also

上記テレビ電話機は、複数の構成要素の集合体で表されるが、各構成要素が単体の装置に属す必要はない。また、上記構成要素は、電気回路もしくはコンピュータ上の機能モジュールとして機能するとしてもよい。 The videophone is represented by an assembly of a plurality of components, but each component need not belong to a single device. Further, the above components may function as an electric circuit or a functional module on a computer.

上記課題を解決するために、請求項２０に記載の通話方法は、発話者の音声の声質を目標話者の声質に変換させるテレビ電話機を用いて通話を行う通話方法であって、目標話者に関連する画像と発話者の声質を該目標話者の声質に変換する声質変換フィルタとを電話機データ記憶部に記憶させる電話機データ記憶ステップと、目標話者を選択する話者選択ステップと、話者選択ステップで選択された目標話者に関連する画像である目標話者関連画像と目標話者に対応する声質変換フィルタとを、電話機データ記憶部から選択するデータ選択ステップと、発話者の画像を入力する撮像ステップと、撮像ステップで入力された画像を、目標話者関連画像に基づいて編集する画像編集ステップと、発話者の音声を入力する音声入力ステップと、音声入力ステップで入力された発話者の音声の声質を、選択された声質変換フィルタを用いて目標話者の声質に変換する声質変換ステップと、画像編集ステップで編集された画像と声質変換ステップで変換された音声とを通話相手に送信する電話機送信ステップと、を含むことを特徴とする。 In order to solve the above-mentioned problem, the call method according to claim 20 is a call method for making a call using a video phone that converts the voice quality of the voice of the speaker into the voice quality of the target speaker. A telephone data storage step of storing in the telephone data storage unit a voice quality conversion filter for converting an image related to the voice quality of the speaker to the voice quality of the target speaker, a speaker selection step of selecting a target speaker, A data selection step of selecting a target speaker related image, which is an image related to the target speaker selected in the speaker selection step, and a voice quality conversion filter corresponding to the target speaker from the telephone data storage unit, and an image of the speaker An imaging step for inputting the image, an image editing step for editing the image input in the imaging step based on the target speaker-related image, an audio input step for inputting the voice of the speaker, and an audio The voice quality of the speaker's voice input at the power step is converted to the voice quality of the target speaker using the selected voice quality conversion filter, and the image edited at the image editing step is converted at the voice quality conversion step. A telephone transmission step of transmitting the received voice to the other party.

上記課題を解決するために、請求項２１に記載のプログラムは、目標話者に関連する画像と発話者の声質を該目標話者の声質に変換する声質変換フィルタとを電話機データ記憶部に記憶させる電話機データ記憶ステップと、目標話者を選択する話者選択ステップと、話者選択ステップで選択された目標話者に関連する画像である目標話者関連画像と目標話者に対応する声質変換フィルタとを、電話機データ記憶部から選択するデータ選択ステップと、発話者の画像を入力する撮像ステップと、撮像ステップで入力された画像を、目標話者関連画像に基づいて編集する画像編集ステップと、発話者の音声を入力する音声入力ステップと、音声入力ステップで入力された発話者の音声の声質を、選択された声質変換フィルタを用いて目標話者の声質に変換する声質変換ステップと、画像編集ステップで編集された画像と声質変換ステップで変換された音声とを通話相手に送信する電話機送信ステップと、をコンピュータに実行させることを特徴とする。 In order to solve the above problem, the program according to claim 21 stores an image related to the target speaker and a voice quality conversion filter for converting the voice quality of the speaker into the voice quality of the target speaker in the telephone data storage unit. Phone data storage step to be performed, speaker selection step for selecting a target speaker, target speaker related image that is an image related to the target speaker selected in the speaker selection step, and voice quality conversion corresponding to the target speaker A data selection step for selecting a filter from the telephone data storage unit, an imaging step for inputting an image of the speaker, and an image editing step for editing the image input in the imaging step based on the target speaker-related image; The voice input step of inputting the voice of the speaker, and the voice quality of the voice of the speaker input in the voice input step, the voice of the target speaker using the selected voice quality conversion filter And voice conversion step of converting into, characterized in that to execute a telephone transmission step of transmitting a voice to the other party that has been converted by the image and voice conversion step edited in the image editing step, to the computer.

上記課題を解決するために、請求項２２に記載の声質変換・画像編集サービス提供システムは、本発明のさらに他の観点によれば、サーバと、該サーバと通信可能に接続されるテレビ電話機とから構成され、発話者の音声の声質を目標話者の声質に変換すると共に発話者の画像を編集する声質変換・画像編集サービスを提供する声質変換・画像編集サービス提供システムであって、サーバは、目標話者に関連する画像と、発話者の声質を該目標話者の声質に変換する声質変換フィルタとを記憶するサーバデータ記憶部と、サーバデータ記憶部に記憶された、目標話者に関連する画像と声質変換フィルタとをテレビ電話機に送信するサーバ送信部と、を備え、テレビ電話機は、目標話者に関連する画像と、声質変換フィルタとを受信する受信部と、受信部で受信した、目標話者に関連する画像と声質変換フィルタとを記憶する電話機データ記憶部と、目標話者を選択する話者選択部と、話者選択部で選択された目標話者に関連する画像である目標話者関連画像と、目標話者に対応する声質変換フィルタとを、電話機データ記憶部から選択するデータ選択部と、発話者の画像を入力する撮像部と、撮像部から入力された画像を、目標話者関連画像に基づいて編集する画像編集部と、発話者の音声を入力する音声入力部と、音声入力部から入力された発話者の音声の声質を、選択された声質変換フィルタを用いて目標話者の声質に変換する声質変換部と、画像編集部が編集した画像と、声質変換部が変換した音声とを通話相手に送信する電話機送信部と、を備えることを特徴とする。 According to still another aspect of the present invention, in order to solve the above-described problem, a voice quality conversion / image editing service providing system according to another aspect of the present invention includes a server, a videophone connected to the server, and a videophone. A voice quality conversion / image editing service providing system for converting a voice quality of a speaker's voice into a voice quality of a target speaker and providing a voice quality conversion / image editing service for editing a speaker's image, the server comprising: A server data storage unit that stores an image related to the target speaker and a voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the target speaker; and a target speaker stored in the server data storage unit. A server transmitting unit that transmits the related image and the voice quality conversion filter to the video phone, and the video phone receives the image related to the target speaker and the voice quality conversion filter. The telephone data storage unit that stores the image and voice quality conversion filter related to the target speaker received by the receiving unit, the speaker selection unit that selects the target speaker, and the target story selected by the speaker selection unit A target speaker-related image that is an image related to the speaker, a voice quality conversion filter corresponding to the target speaker, a data selection unit that selects from the telephone data storage unit, an imaging unit that inputs an image of the speaker, and imaging The image editing unit that edits the image input from the unit based on the target speaker-related image, the voice input unit that inputs the voice of the speaker, and the voice quality of the voice of the speaker input from the voice input unit, A voice quality conversion unit that converts the voice quality of the target speaker using the selected voice quality conversion filter, an image edited by the image editing unit, and a telephone transmission unit that transmits the voice converted by the voice quality conversion unit to the other party; It is characterized by providing.

このような構成であれば、サーバはサーバデータ記憶部によって、目標話者に関連する画像と、発話者の声質を該目標話者の声質に変換する声質変換フィルタとを記憶することが可能であり、サーバ送信部によって、目標話者に関連する画像と声質変換フィルタとをテレビ電話機に送信することが可能となる。 With such a configuration, the server can store an image related to the target speaker and a voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the target speaker by the server data storage unit. Yes, the server transmission unit can transmit the image and the voice quality conversion filter related to the target speaker to the video phone.

また、テレビ電話機は、所謂「ものまね」において、目標話者の音声だけでなく、目標話者に関連する画像に基づいて画像編集部で編集された画像、例えば目標話者の顔写真も通話相手のテレビ電話機に送信されるので、通話相手は、発話者が誰のものまねをしているのかを聴覚および視覚で直感的に把握することが可能となる。 In addition, in the so-called “imitation”, the video phone can call not only the voice of the target speaker but also an image edited by the image editing unit based on an image related to the target speaker, for example, a face photograph of the target speaker. Thus, the other party can intuitively grasp who the speaker is imitating by hearing and visual sense.

請求項２３に記載の発明は、請求項２２に記載の声質変換・画像編集サービス提供システムにおいて、前記声質変換フィルタは、個々の発話者の声質を共通の中間話者の声質に変換するための第１声質変換フィルタと、該中間話者の声質を個々の目標話者の声質に変換するための第２声質変換フィルタとからなり、前記サーバ送信部は、前記第１声質変換フィルタまたは第２声質変換フィルタのいずれか一方または両方を送信することができ、前記受信部は、前記第１声質変換フィルタまたは第２声質変換フィルタのいずれか一方または両方を受信することができ、前記電話機データ記憶部は、前記受信部で受信された第１声質変換フィルタまたは第２声質変換フィルタのいずれか一方または両方を含む、第１声質変換フィルタおよび第２声質変換フィルタを記憶し、前記データ選択部は、前期第１声質変換フィルタが予め指定されている場合には、第２声質変換フィルタを前記電話機データ記憶部から選択し、前期第１声質変換フィルタが予め指定されていない場合には、前記声質変換フィルタとして、前記第１声質変換フィルタと第２声質変換フィルタとを前記電話機データ記憶部から選択し、前記声質変換部は、前期第１声質変換フィルタが予め指定されている場合には、前記指定された第１声質変換フィルタを用いて中間話者の声質に変換し、前期第１声質変換フィルタが予め指定されていない場合には、前記音声入力部から入力された前記発話者の音声の声質を、前記選択された第１声質変換フィルタを用いて中間話者の声質に変換し、さらに該中間話者の声質を、前記選択された第２声質変換フィルタを用いて前記目標話者の声質に変換することを特徴とする。 According to a twenty-third aspect of the present invention, in the voice quality conversion / image editing service providing system according to the twenty-second aspect, the voice quality conversion filter converts the voice quality of individual speakers into the voice quality of a common intermediate speaker. A first voice quality conversion filter; and a second voice quality conversion filter for converting the voice quality of the intermediate speaker into the voice quality of each target speaker. The server transmission unit may include the first voice quality conversion filter or the second voice quality conversion filter. Either or both of voice quality conversion filters can be transmitted, and the receiving unit can receive either one or both of the first voice quality conversion filter and the second voice quality conversion filter, and the telephone data storage The first voice quality conversion filter and the second voice include one or both of the first voice quality conversion filter and the second voice quality conversion filter received by the reception section. When the first voice quality conversion filter is designated in advance, the data selection section selects the second voice quality conversion filter from the telephone data storage section, and the first voice quality conversion filter selects the second voice quality conversion filter. If not specified in advance, the first voice quality conversion filter and the second voice quality conversion filter are selected from the telephone data storage section as the voice quality conversion filter, and the voice quality conversion section Is designated in advance, the voice quality of the intermediate speaker is converted using the designated first voice quality conversion filter. If the first voice quality conversion filter is not designated in advance, the voice input is performed. The voice quality of the voice of the speaker input from the unit is converted into the voice quality of the intermediate speaker using the selected first voice quality conversion filter, and the voice quality of the intermediate speaker is further converted into the voice quality of the intermediate speaker. And converting the voice quality of the target speaker by using a second voice conversion filters-option.

このような構成であれば、かかる中間話者を介した２段階のフィルタ構成により、発話者は、一度、第１声質変換フィルタを準備するだけで、目標話者を変更する度に声質変換フィルタを生成する必要がなくなる。また、目標話者への声質変換フィルタを提供するサービス提供者側のサーバでは、一度、第２声質変換フィルタを生成すると、複数の発話者にその共通の第２声質変換フィルタを提供できるので、低コストで効率の良いシステムを築くことができ、少ない負荷で、発話者と目標話者の多数のパターンを生成することが可能となる。 With such a configuration, the two-stage filter configuration through the intermediate speaker allows the speaker to prepare the first voice conversion filter once and change the target speaker every time the target speaker is changed. Need not be generated. In addition, once the second voice quality conversion filter is generated in the server on the service provider side that provides the voice quality conversion filter to the target speaker, the common second voice quality conversion filter can be provided to a plurality of speakers. An efficient system can be built at a low cost, and a large number of patterns of a speaker and a target speaker can be generated with a small load.

請求項２４に記載の発明は、請求項２２または２３に記載の声質変換・画像編集サービス提供システムに用いられるサーバであって、前記サーバデータ記憶部およびサーバ送信部を備えることを特徴とする。 According to a twenty-fourth aspect of the present invention, there is provided a server used in the voice quality conversion / image editing service providing system according to the twenty-second or twenty-third aspect, comprising the server data storage unit and the server transmission unit.

このような構成であれば、請求項２２または２３に記載の声質変換・画像編集サービス提供システムにおけるサーバと同等の作用及び効果が得られる。 With such a configuration, operations and effects equivalent to those of the server in the voice quality conversion / image editing service providing system according to claim 22 or 23 can be obtained.

請求項２５に記載の発明は、請求項２２または２３に記載の声質変換・画像編集サービス提供システムに用いられるテレビ電話機であって、前記受信部、電話機データ記憶部、話者選択部、データ選択部、撮像部、画像編集部、音声入力部、声質変換部、および電話機送信部を備えることを特徴とする。 The invention described in claim 25 is a video phone used in the voice quality conversion / image editing service providing system according to claim 22 or 23, wherein the receiving unit, the telephone data storage unit, the speaker selection unit, the data selection Unit, an imaging unit, an image editing unit, a voice input unit, a voice quality conversion unit, and a telephone transmission unit.

このような構成であれば、請求項２２または２３に記載の声質変換・画像編集サービス提供システムにおけるテレビ電話機と同等の作用及び効果が得られる。 With such a configuration, the same operation and effect as the video phone in the voice quality conversion / image editing service providing system according to claim 22 or 23 can be obtained.

また、上述したテレビ電話機における従属項に対応する構成要素やその説明は、通話方法、プログラム、声質変換・画像編集サービス提供システム、および、サーバにも適用可能である。 Further, the constituent elements corresponding to the dependent claims in the above-described videophone and the description thereof can be applied to a calling method, a program, a voice quality conversion / image editing service providing system, and a server.

以上説明したように本発明のテレビ電話機によれば、所謂ものまねを行う際に、声質変換された目標話者の音声と、目標話者に関連する画像とを連動して通話相手に送信することができ、通話相手は、発話者が誰のものまねをしているのかを聴覚および視覚を通じて直感的に把握することが可能となる。 As described above, according to the video phone of the present invention, when performing so-called imitation, the voice of the target speaker whose voice quality has been converted and the image related to the target speaker are transmitted to the other party in conjunction with each other. The call partner can intuitively grasp who the speaker is imitating through hearing and vision.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

携帯電話等の携帯型移動端末には、利用者の顔画像を撮像するカメラと、通話相手の顔画像を表示するディスプレイとが装備されている。かかる構成を利用して、通話相手の現在の通話状態を確認しながら会話する、所謂テレビ電話が可能となる。 A portable mobile terminal such as a mobile phone is equipped with a camera that captures the face image of the user and a display that displays the face image of the other party. Using such a configuration, a so-called videophone call is possible in which a conversation is made while confirming the current call state of the other party.

このようなテレビ電話では、画像や音声を加工してより通話を楽しむ工夫がなされ、娯楽性が高められている。例えば、自身の分身としてのキャラクタであるアバターを選択し、通話中に自身の画像の代わりにそのアバターの画像を送信したり、自身の音声を他の音声に変換して通話相手に伝えたりすることができる。 In such a videophone, a device for enjoying a telephone call by processing an image and sound is made, and entertainment is enhanced. For example, you can select an avatar that is your character as a substitute, and send that avatar image instead of your own image during a call, or convert your voice to another voice and tell it to the other party be able to.

しかし、アバターが自身の分身、もしくは通話相手の分身であることと、声質変換との関連性を見いだすことはできない。従って、発話者の音声を周知の目標話者の音声に変換する所謂「ものまね」を行ったとしても、臨場感に欠けてしまう。さらには、テレビ電話機を介すことによる音質の劣化が生じ、そのものまねしている目標話者が誰であるかを発話者が明示的に示さない限り、誰のものまね音声であるかを通話相手が把握するのは困難である。 However, it is not possible to find out the relationship between voice quality conversion and the fact that an avatar is his or her partner's. Therefore, even if the so-called “imitation” is performed to convert the voice of the speaker into the voice of the known target speaker, the sense of reality is lacking. Furthermore, unless the speaker explicitly indicates who the target speaker is mimicking, the voice quality of the other party will be reduced. Is difficult to grasp.

本発明の実施形態においては、発話者から通話相手に画像および音声を送信する際に、発話者から目標話者への声質変換が行われた場合、それと連動して発話者の画像も目標話者に関連する画像に変化させることを特徴としている。この画像が変化するタイミングとしては、１．発話者が操作したとき、２．発話者の発話が変化したとき、３．発話者が視覚的に動いたとき、が挙げられる。以下、上記３つのタイミングに関して、３つの実施形態に分けて説明する。また、発話者と目標話者との間の声質の変換に関するさらなる特徴を第４および第５の実施形態として詳述する。 In the embodiment of the present invention, when voice quality conversion from the speaker to the target speaker is performed when transmitting an image and sound from the speaker to the call partner, the image of the speaker is also linked with the target speech. It is characterized by changing to an image related to the person. The timing at which this image changes is as follows: When the speaker operates, 2. 2. When the speaker's speech changes, When the speaker moves visually. Hereinafter, the three timings will be described in three embodiments. Further, further features regarding the conversion of voice quality between the speaker and the target speaker will be described in detail as fourth and fifth embodiments.

（第１の実施形態）
（声質変換・画像編集サービス提供システム）
図１は、テレビ電話機１００を利用した声質変換・画像編集サービス提供システムを説明するための説明図である。上記声質変換・画像編集サービス提供システムの一つの実施形態を説明すると、発信側の発話者１１０や対話先の通話相手１２０がテレビ電話機１００としての携帯電話を有し、互いが基地局１３０を介して通信網１３２により接続されている。ここで、テレビ電話機１００は、携帯電話機、ＰＨＳ（ＰｅｒｓｏｎａｌＨａｎｄｙｐｈｏｎｅＳｙｓｔｅｍ）端末、家庭用電話機、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、モバイルパーソナルコンピュータ、パーソナルコンピュータ等の情報通信端末から形成される。 (First embodiment)
(Voice quality conversion / image editing service provision system)
FIG. 1 is an explanatory diagram for explaining a voice quality conversion / image editing service providing system using a video phone 100. An embodiment of the voice quality conversion / image editing service providing system will be described. The calling party's speaker 110 and the conversation partner 120 have a mobile phone as the video phone 100, and each other through the base station 130. Connected by a communication network 132. Here, the video phone 100 is formed of an information communication terminal such as a mobile phone, a PHS (Personal Handy phone System) terminal, a home phone, a PDA (Personal Digital Assistant), a mobile personal computer, a personal computer, or the like.

かかるテレビ電話機１００は、テレビ電話を使う発話者１１０が発した画像と音声とをリアルタイムに通話相手１２０に送信し、同時に通話相手１２０の画像と音声とをリアルタイムに受信する。従って、発話者１１０は、通話相手１２０の現在の顔色を窺いながら、あたかも目の前で対話するが如く通話を楽しむことができる。ここで、音声とは、人間が発声器官を通じて発したり、電子機器で再生される言語音であり、１または２以上の基本周波数を合成した波形で表される発音の集合を言い、声質とは音声の音色や音程のことを言う。 Such a video phone 100 transmits an image and a voice uttered by a speaker 110 using a video phone to the call partner 120 in real time, and simultaneously receives an image and a sound of the call partner 120 in real time. Therefore, the speaker 110 can enjoy the call as if speaking in front of his eyes while scolding the current complexion of the other party 120. Here, speech is a speech sound that is uttered by a human through a vocal organ or reproduced by an electronic device, and is a set of pronunciations represented by a waveform obtained by synthesizing one or more fundamental frequencies. This refers to the tone and pitch of the voice.

また、通信網１３２には、サーバ１４０も接続されている。かかるサーバ１４０は、当該テレビ電話機１００で利用される目標話者に関連する画像や目標話者に対応する声質変換フィルタを組み合わせたパッケージデータを保持し、テレビ電話機１００からの要求に応じてこのパッケージデータを提供（配信）することができる。以下、テレビ電話機１００やサーバ１４０に関して詳述する。 A server 140 is also connected to the communication network 132. The server 140 holds package data in which an image related to a target speaker used in the videophone 100 and a voice quality conversion filter corresponding to the target speaker are combined, and this package is received in response to a request from the videophone 100. Data can be provided (distributed). Hereinafter, the video phone 100 and the server 140 will be described in detail.

（テレビ電話機１００）
図２は、テレビ電話機１００の概略的な構成を示した機能ブロック図である。かかるテレビ電話機１００は、中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、データ選択部１５６と、撮像部１６０と、画像編集部１６２と、音声入力部１６４と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とを含んで構成される。また、ここでは、テレビ電話機１００で発話するユーザ１１０を発話者、その発話を聞くユーザ１２０を通話相手と呼ぶ。 (Videophone 100)
FIG. 2 is a functional block diagram showing a schematic configuration of the video phone 100. The video phone 100 includes a central control unit 146, a telephone data storage unit 148, a display unit 150, a speaker selection switch 152, a speaker selection unit 154, a data selection unit 156, an imaging unit 160, an image Editing unit 162, voice input unit 164, voice quality conversion unit 166, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, imitation switch 174, reception demodulation unit 176, reception unit 180, The image display unit 182, the audio output unit 184, the speaker 186, and the function permission unit 188 are configured. Here, the user 110 speaking on the video phone 100 is called a speaker, and the user 120 listening to the speech is called a calling party.

上記中央制御部１４６は、中央処理装置（ＣＰＵ）を含む半導体集積回路により、当該テレビ電話機１００全体を管理および制御する。また点線で囲まれた領域１９０内の各構成要素は、この中央制御部１４６の管理下にあり、電気回路またはプログラムモジュールとして機能する。従って、領域１９０内の各構成要素は、通常記憶媒体に保持され、中央処理装置に読み込まれて各機能を遂行する。 The central control unit 146 manages and controls the entire video phone 100 by a semiconductor integrated circuit including a central processing unit (CPU). Each component in the area 190 surrounded by a dotted line is under the control of the central control unit 146 and functions as an electric circuit or a program module. Accordingly, each component in the area 190 is normally held in a storage medium and read into the central processing unit to perform each function.

上記電話機データ記憶部１４８は、ＲＡＭ、Ｅ^２ＰＲＯＭ、不揮発性ＲＡＭ、フラッシュメモリ、カードメモリ、ＵＳＢメモリ、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）等の記憶媒体から構成され、少なくとも、ものまね対象である目標話者に関連する目標話者データとして、目標話者に関連する１または２以上の画像１９２と、発話者の声質から目標話者の声質への声質変換を可能にする声質変換フィルタ１９４とを記憶している。 The telephone data storage unit 148 includes a storage medium such as RAM, E ² PROM, non-volatile RAM, flash memory, card memory, USB memory, HDD (Hard Disk Drive), etc., and is at least a target speaker to be imitated As one or more images 192 related to the target speaker, and a voice quality conversion filter 194 that enables voice quality conversion from the voice quality of the speaker to the voice quality of the target speaker. ing.

また、目標話者データは、上記目標話者に関連する画像１９２と後述する第２声質変換フィルタとの組み合わせであってもよく、かかる組み合わせによるパッケージデータとしてサーバ１４０からテレビ電話機１００にダウンロードされるとしてもよい。 The target speaker data may be a combination of an image 192 related to the target speaker and a second voice quality conversion filter described later, and is downloaded from the server 140 to the videophone 100 as package data based on such combination. It is good.

上記表示部１５０は、液晶表示器等からなり、当該テレビ電話機１００で利用されるアプリケーション等の選択画面出力もしくは結果出力を行う。図２では、ものまね対象となる目標話者を選択するための目標話者リスト１９６が表示されている。この目標話者リスト１９６は、後述する話者選択部１５４によって生成される。また、表示部１５０は、テレビ電話の際、通話相手のテレビ電話機から受信した通話相手の画像も表示する。 The display unit 150 includes a liquid crystal display or the like, and performs selection screen output or result output of an application or the like used in the video phone 100. In FIG. 2, a target speaker list 196 for selecting a target speaker to be imitated is displayed. The target speaker list 196 is generated by a speaker selection unit 154 described later. The display unit 150 also displays an image of the other party received from the other party's video phone during a videophone call.

上記話者選択スイッチ１５２は、十字キー、ジョグダイヤル、キーボード等から形成され、その押圧により、表示部１５０に表示された目標話者リスト１９６から特定の目標話者を選択する。また、通話相手への発信や他のアプリケーションの操作等にも利用される。 The speaker selection switch 152 is formed of a cross key, a jog dial, a keyboard, or the like, and selects a specific target speaker from the target speaker list 196 displayed on the display unit 150 when pressed. It is also used for making calls to other parties and operating other applications.

上記話者選択部１５４は、まず、データ選択部１５６を介して電話機データ記憶部１４８に記憶されている目標話者に関連する画像１９２または声質変換フィルタ１９４を参照し、目標話者として選択することが可能な話者のリストである目標話者リスト１９６を生成し、表示部１５０に送信する。ここで、発話者が、表示部１５０に表示された目標話者リスト１９６中から、話者選択スイッチ１５２を通じて目標話者を選択した場合、話者選択部１５４は、その選択された目標話者をものまね対象の目標話者として認識し、データ選択部１５６に伝達する。 First, the speaker selection unit 154 refers to the image 192 or the voice quality conversion filter 194 related to the target speaker stored in the telephone data storage unit 148 via the data selection unit 156 and selects the target speaker. A target speaker list 196, which is a list of speakers that can be used, is generated and transmitted to the display unit 150. Here, when the speaker selects a target speaker through the speaker selection switch 152 from the target speaker list 196 displayed on the display unit 150, the speaker selection unit 154 displays the selected target speaker. As a target speaker to be imitated and transmitted to the data selection unit 156.

上記データ選択部１５６は、話者選択部１５４で選択された目標話者に関連する画像１９２である目標話者関連画像と、目標話者に対応する声質変換フィルタ１９４とを、電話機データ記憶部１４８から選択する。 The data selection unit 156 includes a target speaker related image, which is an image 192 related to the target speaker selected by the speaker selection unit 154, and a voice quality conversion filter 194 corresponding to the target speaker. Select from 148.

上記撮像部１６０は、ＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）やＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）等の撮像素子を含んで形成され、この撮像素子により入力（撮像）された画像を取り込んで、画像編集部１６２に送信する。 The imaging unit 160 is formed to include an imaging element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor), and captures an image input (captured) by the imaging element to the image editing unit 162. Send.

上記画像編集部１６２は、撮像部１６０から入力された発話者やその周辺を含む画像を、データ選択部１５６が選択した目標話者関連画像に基づいて編集する。この編集は、入力された画像の一部や全部に、目標話者に関連する画像を重ねることを含む。重ねる画像は、生写真等、目標話者そのものであってもよいし、目標話者に関連する、例えば、目標話者を連想可能な画像であってもよい。 The image editing unit 162 edits the image including the speaker and its periphery input from the imaging unit 160 based on the target speaker related image selected by the data selection unit 156. This editing includes superimposing an image related to the target speaker on part or all of the input image. The superimposed image may be the target speaker itself, such as a raw photo, or may be an image related to the target speaker, for example, an image associated with the target speaker.

ここで、目標話者に関連する画像１９２に基づいて編集するとは、後述するように入力された画像に連動して目標話者関連画像を変化させることでもあるが、入力された画像に拘わらず、目標話者関連画像をアバターのように送信画像として利用する等、単に目標話者関連画像に置き換えることも含んでいる。 Here, editing based on the image 192 related to the target speaker is to change the target speaker related image in conjunction with the input image as will be described later, but regardless of the input image. In addition, the target speaker-related image is simply replaced with the target speaker-related image, such as using the target speaker-related image as a transmission image like an avatar.

画像編集部１６２が撮像部１６０から入力された画像に目標話者を上書きする場合、目標話者関連画像に座標指定が関連付けられていれば、かかる目標話者関連画像毎の座標指定に従って、その座標位置に目標話者関連画像を表示してもよく、発話者の顔画像にお面をかぶせる要領で発話者の顔画像の位置に表示してもよい。 When the image editing unit 162 overwrites the target speaker on the image input from the image capturing unit 160, if the coordinate designation is associated with the target speaker related image, the image editing unit 162 follows the coordinate designation for each target speaker related image. The target speaker-related image may be displayed at the coordinate position, or may be displayed at the position of the speaker's face image in the manner of covering the face image of the speaker.

上記音声入力部１６４は、マイクロフォン等音声入力可能な装置で構成され、発話者の発する音声を電気信号に変換して声質変換部１６６に送信する。 The voice input unit 164 is configured by a device capable of inputting voice, such as a microphone, and converts a voice uttered by a speaker into an electrical signal and transmits the electrical signal to the voice quality conversion unit 166.

上記声質変換部１６６は、データ選択部１５６が選択した声質変換フィルタ１９４を用いて、音声入力部１６４から入力された発話者の音声の声質を、話者選択部１５４で選択されている目標話者の声質に変換する。本実施形態においては、声質変換部１６６が音色のみ変換しているので、発話者は、目標話者の言い回しを真似る必要はあるものの、ピッチを自動調整したり地声と裏声のそれぞれのフィルタを用意するなどの工夫を施して、ものまねの質を上げることも可能である。 The voice quality conversion unit 166 uses the voice quality conversion filter 194 selected by the data selection unit 156 to convert the voice quality of the speaker's voice input from the voice input unit 164 into the target story selected by the speaker selection unit 154. To voice quality In this embodiment, since the voice quality conversion unit 166 converts only the timbre, the speaker does not need to imitate the target speaker's wording, but automatically adjusts the pitch or filters each of the local voice and the back voice. It is also possible to improve the quality of imitation by making preparations.

声質変換部１６６は、例えば、混合正規分布モデル（ＧＭＭ：ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）に基づいて、スペクトル系列等の特徴量を変換する特徴量変換法（例えば、A. Kain and M.W.Macon," Spectral voice conversion for text-to-speech synthesis," Proc.ICASSP,pp.285-288,Seattle,U.S.A.May,1998.参照）で実現でき、その他にもあらゆる公知の手法を用いることが可能である。 The voice quality conversion unit 166 is, for example, a feature value conversion method (for example, A. Kain and MWMacon, “Spectral voice conversion” that converts a feature value such as a spectrum series based on a Gaussian Mixture Model (GMM). for text-to-speech synthesis, "Proc.ICASSP, pp.285-288, Seattle, USAMay, 1998.), and any other known method can be used.

上記電話機送信部１６８は、通常通話において、発話者自身の画像と音声を変調送信部１７０に送信し、画像編集部１６２が編集した画像と、声質変換部１６６が変換した音声とを変調送信部１７０に送信する。 The telephone transmission unit 168 transmits the image and sound of the speaker himself / herself to the modulation transmission unit 170 in a normal call, and modulates the image edited by the image editing unit 162 and the sound converted by the voice quality conversion unit 166. To 170.

上記変調送信部１７０は、電話機送信部１６８から受信した電気信号の周波数を送信周波数に変調して変調信号を生成し、この変調信号をアンテナ部１７２に送出する。 The modulation transmission unit 170 generates a modulation signal by modulating the frequency of the electrical signal received from the telephone transmission unit 168 to the transmission frequency, and transmits the modulation signal to the antenna unit 172.

上記アンテナ部１７２は、変調送信部１７０の出力を送信電波に変えて送信し、通話相手のテレビ電話機にその内容を伝達する。また、通話相手のテレビ電話機からの受信電波を受信し、後述する受信復調部１７６に出力する。 The antenna unit 172 changes the output of the modulation transmission unit 170 to a transmission radio wave and transmits it, and transmits the contents to the video phone of the other party. In addition, it receives radio waves received from the other party's video phone and outputs them to a reception demodulator 176 described later.

上記ものまねスイッチ１７４は、発話者により所望のタイミングで押圧され、このものまねスイッチ１７４が有効な間、画像編集部１６２および声質変換部１６６を含むものまね機能が実行される。このものまねスイッチ１７４は、初期状態を、ものまね、即ち、画像編集部１６２および声質変換部１６６を機能させる状態とし、押圧毎にものまね解除とものまね処理開始とが切り替わるとしてもよく、押圧し続けている期間のみものまね処理が実行されるとしてもよい。また、ものまねスイッチ１７４は、物理的なボタンでもよいし、その都度メニュー画面からものまねを行うかどうかを設定するソフト入力であってもよい。このようにして、発話者は、ものまねスイッチ１７４の操作により、ものまねを行うタイミングを、自己の意志に基づいて決めることができる。従って、発話者は、ものまねしたい発話に対してのみ、ものまねを実行することができ、その部分を通話相手に強調して伝えることができる。 The imitation switch 174 is pressed by a speaker at a desired timing, and while the imitation switch 174 is valid, an imitation function including the image editing unit 162 and the voice quality conversion unit 166 is executed. The imitation switch 174 is set to an imitation state, that is, a state in which the image editing unit 162 and the voice quality conversion unit 166 function, and may be switched between imitation release and imitation processing start each time it is pressed. The mimicking process only for the period may be executed. The mimic switch 174 may be a physical button, or may be a soft input for setting whether to mimic from the menu screen each time. In this way, the speaker can determine the timing of imitation based on his / her will by operating the imitation switch 174. Therefore, the speaker can perform the imitation only for the utterance he / she wants to imitate, and can convey the portion with emphasis to the other party.

上記受信復調部１７６は、受信電波の増幅と周波数同調検波を行い、さらに復調処理を経て得られた電気信号を受信部１８０に送信する。 The reception demodulator 176 performs amplification of received radio waves and frequency tuning detection, and further transmits an electrical signal obtained through demodulation processing to the receiver 180.

上記受信部１８０は、受信復調部１７６からの電気信号を受信して、通話相手の画像と音声とを画像表示部１８２および音声出力部１８４とに送信する。また、画像編集部１６２で利用される目標話者に関連する画像や、声質変換部１６６で利用される声質変換フィルタを外部の電子機器、例えば、サーバ１４０から受信することもできる。また、声質変換フィルタ１９４として、後述する第１声質変換フィルタもしくは第２声質変換フィルタのいずれか一方または両方を受信することもできる。 The receiving unit 180 receives the electrical signal from the reception demodulating unit 176 and transmits the image and sound of the other party to the image display unit 182 and the audio output unit 184. In addition, an image related to the target speaker used in the image editing unit 162 and a voice quality conversion filter used in the voice quality conversion unit 166 can be received from an external electronic device such as the server 140. Further, as the voice quality conversion filter 194, either or both of a first voice quality conversion filter and a second voice quality conversion filter described later can be received.

上記画像表示部１８２は、受信部１８０で受信された通話相手の画像を、表示部１５０にリアルタイムに表示する。 The image display unit 182 displays the call partner image received by the receiving unit 180 on the display unit 150 in real time.

上記音声出力部１８４は、受信部１８０で受信された通話相手の音声を、スピーカ１８６にリアルタイムに出力する。 The voice output unit 184 outputs the other party's voice received by the receiving unit 180 to the speaker 186 in real time.

上記スピーカ１８６は、音声出力部１８４からの音声信号を受けて、通話相手の音声を発話者に伝達する。 The speaker 186 receives the audio signal from the audio output unit 184 and transmits the voice of the other party to the speaker.

上記機能許可部１８８は、自己のテレビ電話機１００を特定することが可能な識別子、例えば、自己のテレビ電話機１００の電話番号（ナンバーディスプレイ）や、声質変換に用いる固有の声質変換フィルタの識別情報、後述する固有の第１声質変換フィルタの識別情報等が、通話相手のテレビ電話機に送信されている場合に限り、画像編集部１６２および声質変換部１６６が機能することを許可する。かかる構成により、通話相手は、電話をかけてきた発話者を特定することが可能となり、目標話者へのなりすましや、発話者を特定できないことに基づく障害を回避することができる。 The function permission unit 188 includes an identifier that can identify the own video phone 100, for example, the telephone number (number display) of the own video phone 100, identification information of a unique voice quality conversion filter used for voice quality conversion, The image editing unit 162 and the voice quality conversion unit 166 are allowed to function only when identification information of a unique first voice quality conversion filter, which will be described later, is transmitted to the other party's videophone. With this configuration, the other party can specify the speaker who made the call, and can avoid impersonation of the target speaker and obstacles based on the fact that the speaker cannot be specified.

図３は、上述したテレビ電話機１００における発話者とのインターフェース配置例を示した外観図である。発話者は、表示部１５０およびスピーカ１８６により通話相手の画像と音声を取得し、撮像部１６０および音声入力部１６４により通話相手に自己の画像と音声を伝達する。そして、ものまね対象となる目標話者を話者選択スイッチ１５２で選択し、実際にものまねを行う際にはものまねスイッチ１７４を押圧する。 FIG. 3 is an external view showing an example of an interface arrangement with a speaker in the videophone 100 described above. The speaker acquires the image and sound of the other party through the display unit 150 and the speaker 186, and transmits his image and voice to the other party through the imaging unit 160 and the voice input unit 164. Then, the target speaker to be imitated is selected by the speaker selection switch 152, and the imitation switch 174 is pressed when actually imitating.

本実施形態において、ものまねスイッチ１７４は、表示部１５０の下側に設けられているが、かかる配置に限られず、撮像部１６０による入力を遮るものでなければ、テレビ電話機１００本体の前面、側面もしくは背面のいずれに配置されてもよい。また、ものまねスイッチ１７４の機能を、既存の話者選択スイッチ１５２に統合することで、話者選択スイッチ１５２が通話中にものまねスイッチ１７４としても機能してもよい。さらには、ものまねスイッチ１７４を、既存のポインティングデバイスで実現することもできる。 In the present embodiment, the mimic switch 174 is provided on the lower side of the display unit 150. However, the present invention is not limited to such an arrangement, and if the input by the imaging unit 160 is not blocked, You may arrange | position in any of a back surface. Further, by integrating the function of the mimic switch 174 into the existing speaker selection switch 152, the speaker selection switch 152 may function as the mimic switch 174 during a call. Furthermore, the mimicking switch 174 can be realized by an existing pointing device.

また、ものまねスイッチ１７４を複数設けるか、ジョグダイヤル形式で複数の選択をさせるか、話者選択スイッチ１５２もしくはポインティングデバイスを通じて複数の選択をさせて、同一目標話者の喜怒哀楽等複数の表情、または複数の目標話者（Ａさん、Ｂさん等）を随時選択することもできる。従って、発話者は、ものまねを行う際、そのタイミングと、どの表情もしくはどの目標話者でものまねするかを同時に決めることができ、より明確に発話者の意図するものまねを通話相手に強調して伝えることが可能となる。 Also, a plurality of mimic switches 174, a plurality of selections in the jog dial format, a plurality of selections through the speaker selection switch 152 or a pointing device, a plurality of facial expressions such as emotions of the same target speaker, or A plurality of target speakers (Mr. A, Mr. B, etc.) can be selected at any time. Therefore, the speaker can decide at the same time the timing and which facial expression or which target speaker will imitate, and more clearly emphasize the imitation intended by the speaker to the other party. It becomes possible.

本実施形態では、このようなテレビ電話機１００を利用して発話者の声質を目標話者の声質に変換し、「ものまね」を行う。かかる「ものまね」において、目標話者の音声だけでなく、目標話者に関連する画像も通話相手のテレビ電話機に送信することで、通話相手は、その目標話者が誰であるかを聴覚および視覚で直感的に判断することが可能となる。 In the present embodiment, such a video phone 100 is used to convert the voice quality of the speaker to the voice quality of the target speaker, and “imitate” is performed. In such “imitation”, not only the voice of the target speaker but also the image related to the target speaker is transmitted to the video phone of the other party, so that the other party hears and identifies who the target speaker is. It becomes possible to judge intuitively visually.

（通話方法）
また、上述したテレビ電話機１００を利用し、発話者の音声および画像を目標話者の音声および画像に変換して通話相手に伝達する通話方法も提供される。 (Call method)
There is also provided a calling method in which the above-described videophone 100 is used to convert the voice and image of the speaker into the voice and image of the target speaker and transmit the voice and image to the other party.

図４は、通話方法の処理の流れを示したフローチャートである。かかる通話方法では、目標話者に関連する画像１９２と、声質変換フィルタ１９４とが予め電話機データ記憶部１４８に記憶されている。 FIG. 4 is a flowchart showing the processing flow of the calling method. In such a call method, an image 192 related to the target speaker and a voice quality conversion filter 194 are stored in the telephone data storage unit 148 in advance.

テレビ電話機１００の話者選択部１５４は、発話者の選択行為に基づいて、目標話者を選択し、データ選択部１５６は、電話機データ記憶部１４８から、話者選択部１５４で選択された目標話者に関連する画像１９２である目標話者関連画像と、目標話者に対応する声質変換フィルタ１９４とを選択する（Ｓ２００）。その後、通話相手の電話番号を入力して通話相手との通話回線を開き、通話を開始する（Ｓ２０２）。 The speaker selection unit 154 of the video phone 100 selects a target speaker based on the selection action of the speaker, and the data selection unit 156 selects the target selected by the speaker selection unit 154 from the telephone data storage unit 148. A target speaker related image, which is an image 192 related to the speaker, and a voice quality conversion filter 194 corresponding to the target speaker are selected (S200). Thereafter, the telephone number of the other party is input to open the telephone line with the other party, and the call is started (S202).

次に、テレビ電話機１００の中央制御部１４６は、発話者によってものまねスイッチ１７４が押され、ものまねスイッチ１７４が有効であるかどうか判断し（Ｓ２０４）、ものまねスイッチ１７４が有効であると判断された場合、ものまね処理を動作させる。 Next, the central control unit 146 of the video phone 100 determines whether or not the mimic switch 174 is valid by the imitation switch 174 being pressed by the speaker (S204), and if the mimic switch 174 is determined to be valid. Execute the mimicking process.

上記ものまね処理として、まず、画像編集部１６２は、撮像部１６０から入力された画像を、目標話者に関連する画像１９２に基づいて編集する（Ｓ２０６）。 As the imitation process, first, the image editing unit 162 edits the image input from the imaging unit 160 based on the image 192 related to the target speaker (S206).

図５、図６、図７は、上記画像編集ステップ（Ｓ２０６）による表示部１５０の変化を説明した説明図である。テレビ電話機１００の画像編集部１６２は、電話機データ記憶部１４８から目標話者関連画像を読み込んで、入力された画像の任意の位置に重ねる。 FIGS. 5, 6, and 7 are explanatory diagrams illustrating changes in the display unit 150 due to the image editing step (S206). The image editing unit 162 of the video phone 100 reads the target speaker related image from the telephone data storage unit 148 and superimposes it on an arbitrary position of the input image.

例えば、図５では、ものまねスイッチ１７４が有効である間、画像全体に、目標話者の画像が上書きされ、表示部１５０全体に目標話者の画像が表現される。また、図６では、発話者の顔にあたる部分２５０に目標話者の顔画像２５２が上書きされている。さらに、図７では、発話者の画像を残したまま、発話者の認識に支障を来さない領域、例えば、発話者の背景画像に目標話者２６０を表示している。 For example, in FIG. 5, while the mimic switch 174 is valid, the image of the target speaker is overwritten on the entire image, and the image of the target speaker is represented on the entire display unit 150. In FIG. 6, the face image 252 of the target speaker is overwritten on the portion 250 corresponding to the speaker's face. Further, in FIG. 7, the target speaker 260 is displayed in an area where the recognition of the speaker is not hindered, for example, the background image of the speaker, while leaving the image of the speaker.

ここで利用される目標話者に関連する画像の形式は、静止画像であっても動画像であってもよく、画像の内容は、俳優や声優等実在する人物や他界した人物等の実写画像、アニメーション等のキャラクタ、コンピュータグラフィックであってもよい。 The format of the image related to the target speaker used here may be a still image or a moving image, and the content of the image is a live-action image such as a real person such as an actor or a voice actor or a person who has passed away. It may be a character such as an animation or a computer graphic.

続いて、声質変換部１６６は、音声入力部１６４から入力された発話者の声質を、声質変換フィルタ１９４を用いて目標話者の声質に変換する（Ｓ２０８）。ものまねスイッチ１７４が無効であると判断された場合、画像編集部１６２および声質変換部１６６は機能せず、送信される画像および音声は発話者の画像および音声のままとなる。 Subsequently, the voice quality conversion unit 166 converts the voice quality of the speaker input from the voice input unit 164 into the voice quality of the target speaker using the voice quality conversion filter 194 (S208). If it is determined that the mimic switch 174 is invalid, the image editing unit 162 and the voice quality conversion unit 166 do not function, and the transmitted image and sound remain the speaker's image and sound.

最後に、電話機送信部１６８は、画像編集ステップ（Ｓ２０６）および声質変換ステップ（Ｓ２０８）で生成された画像および音声を通話相手に送信する（Ｓ２１０）。ここで、通話の継続が判断され（Ｓ２１２）、通話が終了されなかった場合、その通話の終了が検知されるまで、ものまねスイッチ１７４の有効判断ステップ（Ｓ２０４）からの処理が繰り返される。 Finally, the telephone transmission unit 168 transmits the image and sound generated in the image editing step (S206) and the voice quality conversion step (S208) to the call partner (S210). Here, when it is determined that the call is continued (S212) and the call is not terminated, the processing from the validity determination step (S204) of the mimic switch 174 is repeated until the termination of the call is detected.

かかる通話方法により、発話者は、ものまねを行うタイミングを意図的に操作することができるので、ものまねしたい発話に対してのみ、ものまねを実行することができ、その部分を通話相手に強調して伝えることができる。 Such a call method allows the speaker to intentionally control the timing of the imitation, so that the imitation can be executed only for the utterance that he / she wants to imitate, and that portion is emphasized and communicated to the other party. be able to.

また、コンピュータに上述した通信方法を実行させるプログラムや、そのプログラムを記憶した、コンピュータで読み取り可能な記憶媒体も提供される。 Also provided are a program that causes a computer to execute the communication method described above, and a computer-readable storage medium that stores the program.

（第２の実施形態）
第１の実施形態では、発話者が意図的に操作したタイミングで発話者の画像を目標話者の画像に変換する動作を説明したが、第２の実施形態においては、それに加えてもしくは独立して、発話者の発話の変化に応じて画像を変換する動作について詳述する。 (Second Embodiment)
In the first embodiment, the operation of converting the image of the speaker into the image of the target speaker at the timing when the speaker intentionally operates has been described. However, in the second embodiment, in addition to or independently of the operation, An operation for converting an image according to a change in the utterance of the speaker will be described in detail.

（テレビ電話機３００）
図８は、第２の実施形態におけるテレビ電話機３００の概略的な構成を示した機能ブロック図である。かかるテレビ電話機３００は、中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、データ選択部１５６と、撮像部１６０と、画像編集部１６２と、音声入力部１６４と、発話者状態情報生成部３１０と、音声検知部３１２と、音声認識部３１４と、音素認識部３１６と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とを含んで構成される。 (Video phone 300)
FIG. 8 is a functional block diagram showing a schematic configuration of the video phone 300 in the second embodiment. The video phone 300 includes a central control unit 146, a telephone data storage unit 148, a display unit 150, a speaker selection switch 152, a speaker selection unit 154, a data selection unit 156, an imaging unit 160, and an image. Editing unit 162, voice input unit 164, speaker state information generation unit 310, voice detection unit 312, voice recognition unit 314, phoneme recognition unit 316, voice quality conversion unit 166, telephone transmission unit 168, Modulation transmission unit 170, antenna unit 172, imitation switch 174, reception demodulation unit 176, reception unit 180, image display unit 182, audio output unit 184, speaker 186, and function permission unit 188 are included. Consists of.

第１の実施形態における構成要素として既に述べた中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、データ選択部１５６と、撮像部１６０と、音声入力部１６４と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とは、実質的に機能が同一なので重複説明を省略し、ここでは、構成が相違する発話者状態情報生成部３１０と、音声検知部３１２と、音声認識部３１４と、音素認識部３１６と、画像編集部１６２とを主に説明する。 The central control unit 146, the telephone data storage unit 148, the display unit 150, the speaker selection switch 152, the speaker selection unit 154, the data selection unit 156, which have already been described as the constituent elements in the first embodiment, Imaging unit 160, voice input unit 164, voice quality conversion unit 166, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, imitation switch 174, reception demodulation unit 176, reception unit 180, Since the image display unit 182, the audio output unit 184, the speaker 186, and the function permission unit 188 have substantially the same functions, redundant description is omitted, and here, the speaker state information generation unit 310 having a different configuration is used. The voice detection unit 312, the voice recognition unit 314, the phoneme recognition unit 316, and the image editing unit 162 will be mainly described.

上記発話者状態情報生成部３１０は、発話者の状態に係る情報である発話者状態情報を生成する。かかる発話者状態情報は、音声入力部１６４から入力された発話者の音声の状態を示す情報、例えば、後述する、音声有無情報、発話内容情報、音素情報であってもよいし、他の実施形態において説明する撮像部１６０から入力された発話者の画像の状態を示す情報、例えば、顔位置情報、顔傾き情報、顔サイズ情報、目開閉情報、口開閉情報であってもよい。画像編集部１６２は、この発話者状態情報に応じて、編集に用いる目標話者関連画像を変化させることができる。このように目標話者関連画像を、発話者状態情報を通じて画一的に編集することで、画像編集部１６２は、発話者状態情報のみを解読することで撮像部１６０から入力された画像を編集することが可能となる。 The speaker state information generation unit 310 generates speaker state information that is information related to the state of the speaker. Such speaker state information may be information indicating the voice state of the speaker input from the voice input unit 164, for example, voice presence / absence information, utterance content information, phoneme information, which will be described later, or other implementations. Information indicating the state of the speaker's image input from the imaging unit 160 described in the embodiment, for example, face position information, face tilt information, face size information, eye opening / closing information, and mouth opening / closing information may be used. The image editing unit 162 can change the target speaker related image used for editing in accordance with the speaker state information. Thus, by editing the target speaker-related image uniformly through the speaker state information, the image editing unit 162 edits the image input from the imaging unit 160 by decoding only the speaker state information. It becomes possible to do.

特に本実施形態では、発話者の発話の変化に関する発話者状態情報が生成され、画像編集部１６２は、発話者の発話の変化に応じて画像を編集する。以下、発話者状態情報の具体的な例を挙げて、当該テレビ電話機３００の動作を説明する。 In particular, in the present embodiment, speaker state information relating to a change in the speaker's utterance is generated, and the image editing unit 162 edits the image in accordance with the change in the speaker's utterance. Hereinafter, the operation of the videophone 300 will be described with a specific example of the speaker state information.

上記音声検知部３１２は、発話者の音声の有無を判断し、所定レベル（閾値）以上の音声が検出されたことを発話者状態情報生成部３１０に伝達する。かかる所定レベルは複数設けられるとしてもよく、音声の有無を段階的に、即ち、音声の大雑把な音量も検出することができる。発話者状態情報生成部３１０は、検知した音声の有無に係る情報、例えば、音声の有無をフラグで示したものや音声の振幅の大きさを所定段階の数値で示した音声有無情報を含めて発話者状態情報を生成し、画像編集部１６２は、発話者状態情報の音声有無情報から発話者の音声の有無を把握し、編集に用いる目標話者関連画像を変化させる。 The voice detection unit 312 determines the presence or absence of the voice of the speaker, and transmits to the speaker state information generation unit 310 that a voice of a predetermined level (threshold) or more has been detected. A plurality of such predetermined levels may be provided, and the presence / absence of sound can be detected stepwise, that is, the rough volume of the sound can be detected. The speaker state information generation unit 310 includes information related to the presence or absence of detected speech, for example, information indicating the presence or absence of speech as a flag or speech presence or absence information indicating the amplitude of speech as a numerical value at a predetermined level. The speaker state information is generated, and the image editing unit 162 grasps the presence / absence of the speaker's voice from the voice presence / absence information of the speaker state information, and changes the target speaker-related image used for editing.

このように音声の有無に応じて画像を変化させることで、音声が有るとき、即ち発話者がものまねしているとき、目標話者に関連する画像の変化によってものまねを実行していることを強調することができる。また、音声が無いときでも、目標話者に関連する他の画像を表示することによって通話相手を飽きさせることなく、通話相手の興味を維持することができる。 In this way, by changing the image according to the presence or absence of voice, when there is voice, that is, when the speaker is imitating, emphasizing that imitation is being performed by changing the image related to the target speaker can do. Moreover, even when there is no sound, the other party's interest can be maintained without getting bored by displaying another image related to the target speaker.

例えば、画像編集部１６２は、音声有無情報に応じて、音声が無いときは発話者の画像を、音声が有るときは入力された画像に目標話者に関連する画像を重ねるとしてもよいし、音声が無いときは目標話者の静止画像を、音声が有るときは目標話者の動画像、例えば、口を開閉している動画像（口パク画像）を画像に重ねるとしてもよい。 For example, the image editing unit 162 may superimpose an image of the speaker when there is no sound, and an image related to the target speaker over the input image when there is sound, according to the sound presence / absence information. When there is no voice, a still image of the target speaker may be superimposed on the image, and when there is a voice, a moving image of the target speaker, for example, a moving image with an open / closed mouth (a mouth image) may be superimposed on the image.

図９は、上記口の開閉による表示部１５０の変化を説明した説明図である。ここでは、発話者全体に目標話者が上書きされ、その目標話者の口の開閉を調整している。例えば、画像編集部１６２は、口を完全に閉じた状態３３０、半開きの状態３３２、開けた状態３３４等を発話者の発話の有無に応じて変更する。かかる口の開閉画像は、口の開閉度合いが相異する目標話者の静止画像が任意の数だけ準備されるとしてもよいし、目標話者の全体画像は固定にして、口の開閉部分だけを任意の数準備し、目標話者に上書きして作られるとしてもよい。 FIG. 9 is an explanatory diagram illustrating changes in the display unit 150 due to opening and closing of the mouth. Here, the target speaker is overwritten on the entire speaker, and the opening and closing of the target speaker's mouth is adjusted. For example, the image editing unit 162 changes the state in which the mouth is completely closed 330, the half-open state 332, the open state 334, and the like according to the presence or absence of the speaker's speech. As for the opening and closing images of the mouth, an arbitrary number of still images of target speakers with different opening and closing degrees of the mouth may be prepared, or the entire image of the target speaker is fixed and only the opening and closing portions of the mouth May be prepared by overwriting the target speaker.

図１０は、音声検知部３１２による音声の有無の判断を説明するためのタイミングチャート図である。図１０では、音声の振幅が所定のレベルを超えているかどうかが検知され、音声の振幅が所定レベル以下の領域３５０では「音声無し」と、所定レベル以上の領域３５２では「音声有り」と判断される。また、判断基準は、音声の振幅のみならず、その音声の波形から人の声であるかどうかも検知し、人の声である場合のみ「音声有り」の判断がなされるとしてもよい。 FIG. 10 is a timing chart for explaining the determination of the presence or absence of sound by the sound detection unit 312. In FIG. 10, it is detected whether or not the amplitude of the sound exceeds a predetermined level, and “no sound” is determined in the region 350 where the sound amplitude is equal to or lower than the predetermined level, and “sound is present” in the region 352 where the sound amplitude is equal to or higher than the predetermined level. Is done. Further, the determination criterion may be to detect not only the amplitude of the voice but also whether it is a human voice from the waveform of the voice, and to determine “sound is present” only when the voice is a human voice.

ここで、音声の振幅が所定レベル以上かどうかを厳密に判断するとした場合、表示部１５０に映し出される目標話者の表示がその振幅の変動に応じて頻繁に変化することとなる。このような現象を回避するため、音声検知部３１２において、画像変更の最低実行時間を設けたり、ヒステリシス特性を設けてもよい。例えば、音声の振幅が一度所定のレベル以上になったら、振幅の変動に拘わらず所定時間「音声有り」を維持し、その間に再度所定のレベルを超えた場合、その時点から所定時間をカウントし直すことが考えられる。こうして、表示部１５０における画像が煩雑に変化する問題を解消できる。 Here, when it is strictly determined whether the amplitude of the voice is equal to or higher than a predetermined level, the display of the target speaker displayed on the display unit 150 frequently changes according to the fluctuation of the amplitude. In order to avoid such a phenomenon, the voice detection unit 312 may be provided with a minimum execution time of image change or a hysteresis characteristic. For example, once the amplitude of the sound exceeds a predetermined level, the “sound is present” is maintained for a predetermined time regardless of the fluctuation of the amplitude, and when the predetermined level is exceeded again during that time, the predetermined time is counted from that point. It is possible to fix it. Thus, the problem that the image on the display unit 150 changes complicatedly can be solved.

また、このような音声検知部３１２の音声の有無の判断は、ものまねスイッチ１７４が有効な間のみ動作するとしてもよい。即ち、発話者は、ものまねスイッチ１７４を押すことによって声質変換を開始し、所定レベル以上の音声を発したときだけ、画像を目標話者に変化させる。これに対して、ものまねスイッチ１７４が無効のときは、画像および音声のいずれの変換も行わない。こうして発話者は、発話の度にものまねスイッチ１７４を押す手間を省くことができる。 The determination of the presence / absence of sound by the sound detection unit 312 may be performed only while the mimic switch 174 is valid. That is, the speaker starts voice conversion by pressing the mimic switch 174, and changes the image to the target speaker only when the voice of a predetermined level or higher is emitted. On the other hand, when the mimic switch 174 is invalid, neither image nor audio conversion is performed. Thus, the speaker can save the trouble of pressing the mimic switch 174 for each utterance.

また、画像編集部１６２は、「音声有り」の場合のみ画像を編集するとは限らず、「音声無し」の場合においても画像の編集を行ってもよい。例えば、「音声有り」の場合、発話者の顔の位置に目標話者の顔の画像を重ね、「音声無し」の場合、発話者の顔の周りに目標話者をうろつかせたり、手を振らせたりして愛想良くさせるといった実施が考えられる。即ち、「音声無し」の場合においても、ものまね対象の音声に連動した画像を表現することができる。 Further, the image editing unit 162 does not always edit an image only when “sound is present”, but may edit an image even when “sound is not present”. For example, if “sound is present”, the target speaker's face image is superimposed on the position of the speaker's face, and if “sound is not present”, the target speaker can be hung around the speaker's face, It may be possible to make them amiable by shaking them. That is, even in the case of “no sound”, it is possible to express an image that is linked to the sound to be imitated.

このように、音声が無いときでも他の画像を重ねることによって通話相手を飽きさせず、通話相手の興味を維持することができ、また、今から行われるものまねが誰のものまねであるかを通話相手に予告することができ、音声が有るとき、即ち発話者がものまねしているときの目標話者に関連する画像を強調することが可能となる。 In this way, even when there is no sound, it is possible to maintain the other party's interest without overwhelming the other party by overlaying other images, and call who imitates what is being imitated from now on The other party can be notified and the image related to the target speaker when the voice is present, that is, when the speaker is imitating, can be emphasized.

上記音声認識部３１４は、音声入力部１６４から入力された発話者の音声（音声波形）に含まれる意味内容に関する情報（言語情報）を抽出し、その意味内容を認識し、例えば、予め設定されているキーワード等の単語もしくは文章と発話者の発話内容とを比較し、一致した場合、その旨発話者状態情報生成部３１０に伝える。発話者状態情報生成部３１０は、認識した発話内容に係る情報、例えば、発話された音声を文字列に置き換えたキーワードやキーワードに対応付けた識別子といった発話内容情報を含めて発話者状態情報を生成し、画像編集部１６２は、発話者状態情報の発話内容情報からキーワードが一致したのを認識し、そのキーワードと連動した目標話者に関連する画像を重ねる。例えば、目標話者が所定の決めポーズとともにキーワードを利用している場合、そのキーワードに連動して上記所定の決めポーズを表す画像を重ねるといった具合である。 The voice recognition unit 314 extracts information (language information) related to the semantic content included in the voice (speech waveform) of the speaker input from the voice input unit 164, recognizes the semantic content, and is set in advance, for example. A word or sentence such as a keyword or the like is compared with the utterance content of the utterer, and if they match, the utterance state information generation unit 310 is notified of the match. The speaker state information generation unit 310 generates speaker state information including information related to the recognized utterance content, for example, utterance content information such as a keyword in which the spoken voice is replaced with a character string and an identifier associated with the keyword. Then, the image editing unit 162 recognizes that the keyword matches from the utterance content information of the speaker state information, and superimposes the images related to the target speaker linked with the keyword. For example, when a target speaker uses a keyword together with a predetermined determined pose, an image representing the predetermined determined pose is overlaid in conjunction with the keyword.

さらに、音声認識部３１４は、発話者の動作に関連する音声、例えば、くしゃみ、笑い声、あくび、舌打ち、歯ぎしり、口笛等のその意味内容を認識し、その情報を発話者状態情報生成部３１０に伝えるとしてもよい。この場合、画像編集部１６２は、その発話者の動作に対応した目標話者に関連する画像、例えば、くしゃみをしている動画像等を重ねることによって、ものまねの状態を維持しつつ、発話者の状態を通話相手にリアルタイムで伝達することが可能になる。 Further, the voice recognition unit 314 recognizes the meaning content of the voice related to the operation of the speaker, for example, sneezing, laughing voice, yawning, tongue-brushing, bruxing, whistling, etc. You may tell. In this case, the image editing unit 162 superimposes an image related to the target speaker corresponding to the operation of the speaker, for example, a moving image that is sneezing, while maintaining the imitation state, and the speaker It is possible to transmit the status of the call to the other party in real time.

上記音素認識部３１６は、音声入力部１６４から入力された発話者の音声に関して音声認識を行い音声の音素を把握し、例えば、その音声の単音の種類、即ち子音か母音かを判断して、その旨発話者状態情報生成部３１０に伝える。発話者状態情報生成部３１０は、認識した発話内容に係る情報、例えば、子音かどうかのフラグや音素を文字に置き換えたもの等の発話内容情報を含めて発話者状態情報を生成する。画像編集部１６２は、発話者状態情報の発話内容情報から、今行われている発話が子音か母音か、またその子音および母音が何かによって目標話者に関連する画像の口の開閉度合いを調整することができ、その音声の大きさに応じて口の開閉度合いを調整することもできる。口の開閉度合いの画像に関しては、既に図９を用いて説明したのでここでは省略する。 The phoneme recognition unit 316 performs speech recognition on the voice of the speaker input from the voice input unit 164 and grasps the phoneme of the voice. For example, the phoneme recognition unit 316 determines the type of a single sound of the voice, that is, a consonant or a vowel. This is notified to the speaker status information generation unit 310. The speaker state information generating unit 310 generates speaker state information including information related to the recognized utterance content, for example, utterance content information such as a consonant flag or a phoneme replaced with characters. Based on the utterance content information of the speaker state information, the image editing unit 162 determines whether the utterance being performed is a consonant or a vowel, and the degree of opening and closing of the mouth of the image related to the target speaker depending on what the consonant and the vowel are. The degree of opening / closing of the mouth can be adjusted according to the volume of the sound. Since the image of the degree of opening and closing of the mouth has already been described with reference to FIG. 9, it is omitted here.

このような発話者の口元の動きを目標話者の画像に反映する構成により、目標話者が実際に話しているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 With the configuration that reflects the movement of the speaker's mouth in the target speaker's image, it is possible to send an image as if the target speaker is actually speaking to the other party. It becomes possible to express.

上述したように画像編集部１６２は、任意の音声と画像とを予め関連付け、この任意の音声が音声検知部３１２、音声認識部３１４、または音素認識部３１６によって検知された場合、この関連付けられた画像を用いて編集する。かかる構成により、発話者の発話に適した画像を通話相手に送信することができ、動的にものまねを表現することが可能となる。 As described above, the image editing unit 162 associates an arbitrary sound and an image in advance, and when the arbitrary sound is detected by the sound detection unit 312, the sound recognition unit 314, or the phoneme recognition unit 316, the associated sound is associated. Edit using images. With this configuration, it is possible to transmit an image suitable for the utterance of the speaker to the other party, and to dynamically imitate.

（第３の実施形態）
第３の実施形態では、第１の実施形態の動作および／または第２の実施形態の動作に加えてもしくは独立して、発話者が視覚的に動いたとき画像を変換する動作について詳述する。 (Third embodiment)
In the third embodiment, the operation of converting an image when a speaker moves visually will be described in detail in addition to or independently of the operation of the first embodiment and / or the operation of the second embodiment. .

（テレビ電話機４００）
図１１は、第３の実施形態におけるテレビ電話機４００の概略的な構成を示した機能ブロック図である。かかるテレビ電話機４００は、中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、データ選択部１５６と、撮像部１６０と、発話者状態情報生成部３１０と、画像認識部４１０と、画像編集部１６２と、音声入力部１６４と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とを含んで構成される。 (Video phone 400)
FIG. 11 is a functional block diagram showing a schematic configuration of the video phone 400 according to the third embodiment. The video phone 400 includes a central control unit 146, a telephone data storage unit 148, a display unit 150, a speaker selection switch 152, a speaker selection unit 154, a data selection unit 156, an imaging unit 160, an utterance, Person status information generation unit 310, image recognition unit 410, image editing unit 162, voice input unit 164, voice quality conversion unit 166, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, and imitation The switch 174, the reception demodulation unit 176, the reception unit 180, the image display unit 182, the audio output unit 184, the speaker 186, and the function permission unit 188 are configured.

第１の実施形態における構成要素として既に述べた中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、データ選択部１５６と、撮像部１６０と、発話者状態情報生成部３１０と、音声入力部１６４と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とは、実質的に機能が同一なので重複説明を省略し、ここでは、構成が相違する画像認識部４１０と、画像編集部１６２とを主に説明する。 The central control unit 146, the telephone data storage unit 148, the display unit 150, the speaker selection switch 152, the speaker selection unit 154, the data selection unit 156, which have already been described as the constituent elements in the first embodiment, Imaging unit 160, speaker state information generation unit 310, voice input unit 164, voice quality conversion unit 166, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, mimicking switch 174, and reception demodulation The unit 176, the receiving unit 180, the image display unit 182, the audio output unit 184, the speaker 186, and the function permission unit 188 are substantially the same in function, so redundant description is omitted here. The different image recognition unit 410 and image editing unit 162 will be mainly described.

上記画像認識部４１０は、撮像部１６０から入力された発話者の画像の状態を画像認識して発話者状態情報生成部３１０に伝える。発話者状態情報生成部３１０は、認識された画像の状態を含めて発話者状態情報を生成する。そして画像編集部１６２は、この発話者状態情報に応じて、編集に用いる目標話者関連画像を変化させる。ここで、画像認識は、撮像部１６０から入力された画像に対して認識処理を行い、その画像の意味や内容を認識するシステムである。ここでは、既存の画像認識技術を利用して、発話者の顔やその中の目、口といった特徴的な部分の撮像画像における座標を認識し、左右上下への移動や、瞬き、口の開閉といった変化を把握することができる。 The image recognition unit 410 recognizes the state of the speaker's image input from the imaging unit 160 and transmits the image to the speaker state information generation unit 310. The speaker state information generation unit 310 generates speaker state information including the recognized image state. Then, the image editing unit 162 changes the target speaker related image used for editing in accordance with the speaker state information. Here, the image recognition is a system that performs recognition processing on an image input from the imaging unit 160 and recognizes the meaning and content of the image. Here, the existing image recognition technology is used to recognize the coordinates in the captured image of a characteristic part such as the speaker's face and its eyes and mouth, and to move left and right, up and down, blink, open and close the mouth Such changes can be grasped.

かかる構成により、画像編集部１６２は、発話者の動き、特に実際の顔の動きに連動して、目標話者関連画像を変化させることが可能となる。従って、発話者は、自ら、視覚的かつ動的に目標話者のものまねを実行することができ、目標話者のものまねをしていることをより強調して通話相手に伝えることが可能となる。以下、発話者状態情報の具体的な例を挙げて、当該テレビ電話機４００の動作を説明する。 With this configuration, the image editing unit 162 can change the target speaker-related image in conjunction with the movement of the speaker, particularly the actual movement of the face. Therefore, the speaker can visually and dynamically imitate the target speaker, and can emphasize the fact that the speaker is imitating the target speaker. . Hereinafter, the operation of the video phone 400 will be described with a specific example of the speaker state information.

また、画像認識部４１０は、発話者の顔があると認識した場合に、該顔の位置を検出し、発話者状態情報生成部３１０は、検出した顔の位置に係る情報、例えば、表示部１５０中の絶対もしくは相対座標や発話者の画像に対する絶対もしくは相対座標といった顔位置情報を含めて発話者状態情報を生成し、画像編集部１６２は、顔位置情報に対応する顔の位置に、目標話者関連画像を重ねることができる。かかる構成により、発話者は、自己の顔を移動することによって、通話相手に送信する目標話者関連画像の位置を変化させることができる。 In addition, when the image recognition unit 410 recognizes that there is a speaker's face, the image recognition unit 410 detects the position of the face, and the speaker state information generation unit 310 includes information related to the detected face position, for example, a display unit. The speaker state information including face position information such as absolute or relative coordinates in 150 or absolute or relative coordinates with respect to the speaker's image is generated, and the image editing unit 162 sets the target position to the face position corresponding to the face position information. Speaker related images can be overlaid. With this configuration, the speaker can change the position of the target speaker-related image to be transmitted to the other party by moving his / her face.

また、画像認識部４１０は、発話者の顔があると認識した場合に、該顔の傾きを検出し、発話者状態情報生成部３１０は、検出した顔の傾きに係る情報、例えば、表示部１５０に対する絶対角度や発話者に対する相対角度といった顔傾き情報を含めて発話者状態情報を生成し、画像編集部１６２は、顔傾き情報に応じて、目標話者関連画像を回転させて重ねるとしてもよい。ここで、上記顔の傾きは、発話者の両目の配置から計算されるとしてもよい。さらに目の向いている方向や瞬きも連動して変化させてもよい。かかる構成により、発話者は、自己の顔の傾きを変えることによって、通話相手に送信する目標話者関連画像を回転させることができる。 Further, when the image recognition unit 410 recognizes that there is a speaker's face, the image recognition unit 410 detects the inclination of the face, and the speaker state information generation unit 310 includes information relating to the detected face inclination, for example, a display unit It is also possible to generate speaker state information including face tilt information such as an absolute angle with respect to 150 and a relative angle with respect to the speaker, and the image editing unit 162 rotates and overlaps the target speaker related image according to the face tilt information. Good. Here, the inclination of the face may be calculated from the arrangement of both eyes of the speaker. Further, the direction in which the eyes are facing and the blink may be changed in conjunction with each other. With this configuration, the speaker can rotate the target speaker-related image to be transmitted to the other party by changing the inclination of his / her face.

画像認識部４１０は、発話者の顔があると認識した場合に、該顔の大きさを検出し、発話者状態情報生成部３１０は、検出した顔の大きさに係る情報、例えば、表示部１５０に対して発話者の顔が占める面積や表示部１５０全体に対する面積比といった顔サイズ情報を含めて発話者状態情報を生成し、画像編集部１６２は、顔サイズ情報に応じて、目標話者関連画像を拡大もしくは縮小するとしてもよい。かかる構成により、発話者は、自己の顔の大きさ、即ち、撮像部１６０との距離を変えることによって、通話相手に送信する目標話者関連画像の大きさを変化させることができる。 When the image recognizing unit 410 recognizes that there is a speaker's face, the image recognizing unit 410 detects the size of the face. The speaker state information generating unit 310 includes information related to the detected face size, for example, a display unit. The speaker state information including face size information such as the area occupied by the speaker's face with respect to 150 and the area ratio with respect to the entire display unit 150 is generated, and the image editing unit 162 determines the target speaker according to the face size information. The related image may be enlarged or reduced. With this configuration, the speaker can change the size of the target speaker-related image to be transmitted to the other party by changing the size of his / her face, that is, the distance to the imaging unit 160.

画像認識部４１０は、発話者の顔があると認識した場合に、該発話者の目の開閉を検出し、発話者状態情報生成部３１０は、検出した目の開閉に係る情報、例えば、目の開閉をフラグで示したものや目の開閉度合いを所定段階の数値で示した目開閉情報を含めて発話者状態情報を生成し、画像編集部１６２は、目開閉情報に応じて、目標話者関連画像における目を開閉させるとしてもよい。このような発話者の目の開閉動作を目標話者の画像に反映する構成により、目標話者が実際に瞬きしているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 When the image recognition unit 410 recognizes that there is a speaker's face, the image recognition unit 410 detects opening / closing of the eyes of the speaker, and the speaker state information generation unit 310 detects information relating to the detected opening / closing of the eyes, for example, eyes Utterance state information is generated including eye opening / closing information indicating the opening / closing of the eye and the eye opening / closing information indicating the degree of opening / closing of the eyes with a numerical value at a predetermined stage, and the image editing unit 162 determines the target story according to the eye opening / closing information. The eyes in the person-related image may be opened and closed. With the configuration that reflects the opening / closing operation of the speaker's eyes in the target speaker's image, it is possible to send an image as if the target speaker is actually blinking to the other party. It is possible to express imitations.

画像認識部４１０は、発話者の顔があると認識した場合に、該発話者の口の開閉を検出し、発話者状態情報生成部３１０は、検出した口の開閉に係る情報、例えば、口の開閉をフラグで示したものや口の開閉度合いを所定段階の数値で示した口開閉情報を含めて発話者状態情報を生成し、画像編集部１６２は、口開閉情報に応じて、目標話者関連画像における口を開閉させるとしてもよい。このような発話者の口元の開閉動作を目標話者の画像に反映する構成により、目標話者が実際に話しているかのような画像を通話相手に送信することができ、より動的にものまねを表現することが可能となる。 When the image recognition unit 410 recognizes that there is a speaker's face, the image recognition unit 410 detects opening / closing of the mouth of the speaker, and the speaker state information generation unit 310 detects information related to opening / closing of the mouth, for example, mouth Utterance state information is generated including mouth opening / closing information indicating the opening / closing of the mouth as a flag and mouth opening / closing information indicating the opening / closing degree of the mouth by a numerical value at a predetermined stage, and the image editing unit 162 generates the target story according to the mouth opening / closing information. The mouth in the person-related image may be opened and closed. With the configuration that reflects the opening and closing movements of the speaker's mouth in the target speaker's image, it is possible to send an image as if the target speaker is actually speaking to the other party, more imitating it. Can be expressed.

かかる構成により、発話者の顔の細かい動作まで目標話者に対応させて通話相手に送信することができ、より動的にものまねを表現することが可能となる。 With such a configuration, it is possible to transmit to the call partner in correspondence with the target speaker up to the detailed operation of the speaker's face, and it is possible to express imitation more dynamically.

（第４の実施形態）
第４の実施形態では、発話者と目標話者との間の声質の変換に関するさらなる特徴を述べる。 (Fourth embodiment)
In the fourth embodiment, further characteristics relating to the conversion of voice quality between the speaker and the target speaker will be described.

（声質変換部１６６）
発話者の声質を目標話者の声質に変換する場合、通常、特定の発話者から特定の目標話者に変換する変換関数として声質変換フィルタ１９４が利用される。ここで、声質変換フィルタ１９４は、発話者および目標話者の音声を収録、蓄積し、発話者および目標話者の音声の対応関係をこの蓄積された音声から学習する学習機能を伴って、更新されるとしてもよい。 (Voice quality conversion unit 166)
When converting the voice quality of the speaker to the voice quality of the target speaker, the voice quality conversion filter 194 is usually used as a conversion function for converting from a specific speaker to a specific target speaker. Here, the voice quality conversion filter 194 records and accumulates the speech of the speaker and the target speaker, and is updated with a learning function for learning the correspondence between the speech of the speaker and the target speaker from the accumulated speech. It may be done.

図１２は、声質の変換に利用される声質変換フィルタ１９４を説明するための説明図である。図１２においては、Ｍ人（Ｍは整数）の発話者１１０Ａ、１１０Ｂ、１１０Ｃの声質をＮ人（Ｎは整数）の目標話者４５０Ａ、４５０Ｂ、４５０Ｃの声質に変化させる声質変換フィルタ４５２が記されている。かかる図を参照して分かるように、全ての声質変換を網羅するためには、Ｍ×Ｎの声質変換フィルタｆ_ＡＡ、ｆ_ＡＢ、ｆ_ＡＣ、ｆ_ＢＡ、ｆ_ＢＢ、ｆ_ＢＣ、ｆ_ＣＡ、ｆ_ＣＢ、ｆ_ＣＣが必要となる。 FIG. 12 is an explanatory diagram for explaining a voice quality conversion filter 194 used for voice quality conversion. In FIG. 12, a voice quality conversion filter 452 that changes the voice quality of M (M is an integer) speakers 110A, 110B, 110C to the voice quality of N (N is an integer) target speakers 450A, 450B, 450C is shown. Has been. As can be seen with reference to the figure, in order to cover all voice quality conversions, M × N voice quality conversion filters f _AA , f _AB , f _AC , f _BA , f _BB , f _BC , f _CA , f _CB and _fCC are required.

また、このような声質変換を実現しようとした場合、発話者の声質と目標話者の声質との組み合わせによる固有の変換関数を上述したＭ×Ｎ分生成しなくてはならない。従って、その準備には時間がかかり、任意の目標話者を気軽に選択することができない。 In order to realize such voice quality conversion, a unique conversion function corresponding to the combination of the voice quality of the speaker and the voice quality of the target speaker must be generated for M × N described above. Therefore, the preparation takes time, and an arbitrary target speaker cannot be easily selected.

本実施形態における声質変換部１６６は、上記のような発話者の声質から目標話者の声質への直接的な声質変換フィルタ１９４ではなく、その間に、共通に設けられた話者である中間話者の音声を中継した２段階の声質変換フィルタの構成をとることができる。即ち、声質変換部１６６は、個々の発話者の声質を共通の中間話者の声質に変換するための第１声質変換フィルタと、中間話者の声質を個々の目標話者の声質に変換するための第２声質変換フィルタとを用いて声質を変換する。 The voice quality conversion unit 166 according to the present embodiment is not a direct voice quality conversion filter 194 from the voice quality of the speaker to the voice quality of the target speaker as described above, but an intermediate story that is a speaker provided in common between them. It is possible to take a configuration of a two-stage voice quality conversion filter that relays a person's voice. That is, the voice quality conversion unit 166 converts the voice quality of each speaker into a common intermediate speaker voice quality, and converts the voice quality of the intermediate speaker into the voice quality of each target speaker. The voice quality is converted using the second voice quality conversion filter.

また、データ選択部１５６は、声質変換フィルタ１９４として、第１声質変換フィルタと第２声質変換フィルタとを選択し、声質変換部１６６は、音声入力部から入力された発話者の音声の声質を、選択された第１声質変換フィルタを用いて中間話者の声質に変換し、さらに中間話者の声質を、選択された第２声質変換フィルタを用いて目標話者の声質に変換することができる。 Further, the data selection unit 156 selects the first voice quality conversion filter and the second voice quality conversion filter as the voice quality conversion filter 194, and the voice quality conversion unit 166 determines the voice quality of the voice of the speaker input from the voice input unit. Converting the voice quality of the intermediate speaker using the selected first voice quality conversion filter, and further converting the voice quality of the intermediate speaker to the voice quality of the target speaker using the selected second voice quality conversion filter. it can.

ここで、中間話者は、人もしくはＴＴＳ（Ｔｅｘｔ−ｔｏ−Ｓｐｅｅｃｈ）とすることができ、第２声質変換フィルタを生成、提供するサービス提供者によって構築される。かかる技術の基本的な概念は、本件出願人による特願２００５―３４９７５４号の技術内容を参酌することができる。 Here, the intermediate speaker can be a person or TTS (Text-to-Speech), and is constructed by a service provider that generates and provides a second voice quality conversion filter. For the basic concept of such technology, the technical content of Japanese Patent Application No. 2005-349754 by the present applicant can be referred to.

図１３は、第１声質変換フィルタ４６２と第２声質変換フィルタ４６４による声質変換を説明するための説明図である。図１３においては、Ｍ人（Ｍは整数）の発話者１１０Ａ、１１０Ｂ、１１０ＣとＮ人（Ｎは整数）の目標話者４５０Ａ、４５０Ｂ、４５０Ｃとの間に、中間話者４６０を設け、発話者１１０Ａ、１１０Ｂ、１１０Ｃの声質は、一旦中間話者４６０の声質に変換された後、目標話者４５０Ａ、４５０Ｂ、４５０Ｃの声質に変換される。 FIG. 13 is an explanatory diagram for explaining voice quality conversion by the first voice quality conversion filter 462 and the second voice quality conversion filter 464. In FIG. 13, intermediate speakers 460 are provided between M (M is an integer) speakers 110A, 110B, 110C and N (N is an integer) target speakers 450A, 450B, 450C. The voice quality of the speakers 110A, 110B, and 110C is once converted to the voice quality of the intermediate speaker 460 and then converted to the voice quality of the target speakers 450A, 450B, and 450C.

従って、声質変換を実現するために、発話者１１０Ａ、１１０Ｂ、１１０Ｃの声質から中間話者の声質へのＭ個の第１声質変換フィルタｆ_ＡＭ、ｆ_ＢＭ、ｆ_ＣＭと、中間話者の声質から目標話者４５０Ａ、４５０Ｂ、４５０Ｃの声質へのＮ個の第２声質変換フィルタｆ_ＭＡ、ｆ_ＭＢ、ｆ_ＭＣとを準備するだけで済み、全ての声質変換を網羅するために、Ｍ＋Ｎ個の声質変換フィルタのみで足りる。従って、低制作コスト化や記憶容量の最小化を図ることが可能となる。また、当該第１、第２声質変換フィルタが学習機能を伴う場合、発話者の所持するテレビ電話機では、第１声質変換フィルタの学習機能を担保すればよく、学習負担も軽減される。 Therefore, in order to realize voice quality conversion, M first voice quality conversion filters f _AM , f _BM , f _CM from the voice quality of the speakers 110A, 110B, 110C to the voice quality of the intermediate speaker, and the voice quality of the intermediate speaker Need only to prepare N second voice quality conversion filters f _MA , f _MB , f _MC to voice quality of target speakers 450A, 450B, 450C, and to cover all voice quality conversions, A voice quality conversion filter is sufficient. Accordingly, it is possible to reduce the production cost and the storage capacity. Also, when the first and second voice quality conversion filters are accompanied by a learning function, the learning function of the first voice quality conversion filter only needs to be ensured in the video phone owned by the speaker, and the learning burden is reduced.

かかる中間話者を介した２段階の声質変換フィルタ構成により、発話者は、一度、第１声質変換フィルタ４６２を準備すると、目標話者を変更する度に声質変換フィルタを生成する必要がなくなる。これは、共通な中間話者４６０を声質変換元とした多数の第２声質変換フィルタ４６４を利用することができるからである。従って、任意の目標話者の第２声質変換フィルタ４６４さえダウンロードすれば、第１声質変換フィルタ４６２と合わせて直ぐにものまねに適用することが可能となる。 With this two-stage voice quality conversion filter configuration via the intermediate speaker, once the speaker prepares the first voice quality conversion filter 462, it is not necessary to generate a voice quality conversion filter every time the target speaker is changed. This is because a large number of second voice quality conversion filters 464 having a common intermediate speaker 460 as a voice quality conversion source can be used. Therefore, if only the second voice quality conversion filter 464 of an arbitrary target speaker is downloaded, it can be imitated immediately together with the first voice quality conversion filter 462.

また、このような発話者に第２声質変換フィルタ４６４を提供するサービス提供者は、発話者毎に声質変換フィルタを準備する必要がなくなり、目標話者毎に少なくとも１つの第２声質変換フィルタ４６４を準備するだけで、その第２声質変換フィルタ４６４を複数の発話者のものまねに適用することができる。従って、低コストで効率の良いシステムを築くことができ、少ない負荷で、発話者と目標話者の多数のパターンを形成することが可能となる。 Also, the service provider who provides the second voice quality conversion filter 464 to such a speaker does not need to prepare a voice quality conversion filter for each speaker, and at least one second voice quality conversion filter 464 is provided for each target speaker. The second voice quality conversion filter 464 can be applied to a plurality of speakers. Therefore, it is possible to build an efficient system at a low cost, and it is possible to form a large number of patterns of a speaker and a target speaker with a small load.

上記第１声質変換フィルタ４６２は、発話者１１０自身の音声を事前に登録し、素片単位で中間話者４６０の音声と対応付けて作成されるとしてもよい。このフィルタ作成機能は、テレビ電話機自体に設けられてもよいし、サーバ１４０等別体の装置に設けられてもよく、サーバ１４０で生成される場合、インターネット等の通信網、無線通信、赤外線通信、記録媒体を介してテレビ電話機にダウンロードされるとしてもよい。 The first voice quality conversion filter 462 may be created by registering the voice of the speaker 110 in advance and associating it with the voice of the intermediate speaker 460 in units of segments. This filter creation function may be provided in the video phone itself, or may be provided in a separate device such as the server 140. When the filter creation function is generated by the server 140, a communication network such as the Internet, wireless communication, infrared communication It may be downloaded to a video phone via a recording medium.

（サーバ１４０）
また、第１声質変換フィルタ４６２または第２声質変換フィルタ４６４は、目標話者に関連する画像１９２と共に、上述した声質変換・画像編集サービス提供システムに用いられるサーバ１４０から自由にダウンロードすることができる。 (Server 140)
The first voice quality conversion filter 462 or the second voice quality conversion filter 464 can be freely downloaded from the server 140 used in the voice quality conversion / image editing service providing system described above, together with the image 192 related to the target speaker. .

図１４は、第４の実施形態におけるサーバ１４０の概略的な構成を示した機能ブロック図である。かかるサーバ１４０は、サーバデータ記憶部４８０と、サーバ送信部４８２とを含んで構成される。 FIG. 14 is a functional block diagram illustrating a schematic configuration of the server 140 according to the fourth embodiment. The server 140 includes a server data storage unit 480 and a server transmission unit 482.

上記サーバデータ記憶部４８０は、目標話者に関連する画像１９２と、発話者の声質を目標話者の声質に変換する声質変換フィルタ１９４とを記憶している。かかる目標話者に関連する画像１９２および声質変換フィルタ１９４はパッケージデータとして一体に記憶されてもよい。 The server data storage unit 480 stores an image 192 related to the target speaker and a voice quality conversion filter 194 that converts the voice quality of the speaker into the voice quality of the target speaker. The image 192 and the voice quality conversion filter 194 related to the target speaker may be integrally stored as package data.

また、声質変換フィルタ１９４が、上述したように、中間話者を含んだ２段階の声質変換フィルタとしての第１声質変換フィルタ４６２と、第２声質変換フィルタ４６４とからなる場合、サーバデータ記憶部４８０は、第１声質変換フィルタ４６２または第２声質変換フィルタ４６４のいずれか一方または両方を記憶するとしてもよい。 Further, when the voice quality conversion filter 194 includes the first voice quality conversion filter 462 and the second voice quality conversion filter 464 as the two-stage voice quality conversion filter including the intermediate speaker as described above, the server data storage unit 480 may store either one or both of the first voice quality conversion filter 462 and the second voice quality conversion filter 464.

上記サーバ送信部４８２は、発話者の要求に応じて、サーバデータ記憶部４８０に記憶された、発話者が所望する目標話者に関連する画像１９２と、声質変換フィルタ１９４とをテレビ電話機１００に送信する。ここで、サーバ送信部４８２は、上述したパッケージデータ単位でテレビ電話機１００に送信するとしてもよい。 The server transmission unit 482 receives the image 192 related to the target speaker desired by the speaker and the voice quality conversion filter 194 stored in the server data storage unit 480 in response to the request of the speaker. Send. Here, the server transmission unit 482 may transmit the video data to the videophone 100 in units of package data described above.

また、声質変換フィルタ１９４が、第１声質変換フィルタ４６２と、第２声質変換フィルタ４６４とからなる場合、サーバ送信部４８２は、第１声質変換フィルタ４６２または第２声質変換フィルタ４６４のいずれか一方または両方をサーバデータ記憶部４８０から読み出して、テレビ電話１００に送信する。 Further, when the voice quality conversion filter 194 includes the first voice quality conversion filter 462 and the second voice quality conversion filter 464, the server transmission unit 482 has either the first voice quality conversion filter 462 or the second voice quality conversion filter 464. Alternatively, both are read from the server data storage unit 480 and transmitted to the videophone 100.

テレビ電話１００において、例えば、発話者が固定されており、電話機データ記憶部１４８に記憶された第１声質変換フィルタ４６２が予め指定されている場合は、この指定された第１声質変換フィルタ４６２を使用し、データ選択部１５６は、第２声質変換フィルタ４６４をデータ記憶部１４８から選択する。声質変換部１６６は、指定された第１声質変換フィルタ４６２と、第２声質変換フィルタ４６４を用いて、発話者の音声の声質を、話者選択部１５４で選択されている目標話者の声質に変換する。 In the videophone 100, for example, when a speaker is fixed and the first voice quality conversion filter 462 stored in the telephone data storage unit 148 is designated in advance, the designated first voice quality conversion filter 462 is used. In use, the data selection unit 156 selects the second voice quality conversion filter 464 from the data storage unit 148. The voice quality conversion unit 166 uses the designated first voice quality conversion filter 462 and second voice quality conversion filter 464 to convert the voice quality of the speaker's voice into the voice quality of the target speaker selected by the speaker selection unit 154. Convert to

このようなパッケージデータ等のダウンロードは、上記声質変換・画像編集サービス提供システムを利用した場合に限られず、インターネット等の通信網、無線通信、赤外線通信、記録媒体を介して実行することもでき、さらに、テレビ電話機内において、自由に追加、更新、削除が可能である。ダウンロードしたパッケージデータが複数ある場合は、話者選択部１５４によって選択された目標話者がものまねの対象となる。 Downloading such package data is not limited to the use of the voice quality conversion / image editing service providing system, but can be executed via a communication network such as the Internet, wireless communication, infrared communication, and a recording medium. Furthermore, addition, update, and deletion can be freely performed in the video phone. When there are a plurality of downloaded package data, the target speaker selected by the speaker selection unit 154 becomes the target of imitation.

また、同一の目標話者に対して、音声（声質）は共通だが、画像が異なる複数のパッケージデータが準備されるとしてもよい。例えば、発話者が「おはよう」と発声したとき、一方のパッケージの目標話者に関連する画像は、口を開閉（口パク）し、他方のパッケージでは、お辞儀するといった実施が考えられる。その他にも、目標話者は同じだが、服装、髪型等外観の違う画像が含まれる等、様々な変形例が考えられる。 Also, a plurality of package data may be prepared for the same target speaker, although the voice (voice quality) is common but the images are different. For example, when the speaker utters “Good morning”, an image related to the target speaker of one package can be opened and closed (mouth packed), and the other package can be bowed. In addition, various modifications are possible, such as including the same target speaker but different images such as clothes and hairstyle.

（第５の実施形態）
第５の実施形態では、第４の実施形態における声質変換部１６６に基づく、発話者と目標話者との間の声質の変換に関するさらなる特徴を述べる。 (Fifth embodiment)
In the fifth embodiment, further characteristics relating to voice quality conversion between the speaker and the target speaker based on the voice quality conversion unit 166 in the fourth embodiment will be described.

（声質変換部１６６）
図１５は、第５の実施形態におけるテレビ電話機５００の概略的な構成を示した機能ブロック図である。かかるテレビ電話機５００は、中央制御部１４６と、電話機データ記憶部１４８と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４とデータ選択部１５６と、、撮像部１６０と、画像編集部１６２と、音声入力部１６４と、発話種類選択部５１０と、声質変換部１６６と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８と、声質変換フィルタ合成部５１２とを含んで構成される。 (Voice quality conversion unit 166)
FIG. 15 is a functional block diagram showing a schematic configuration of a video phone 500 according to the fifth embodiment. The video phone 500 includes a central control unit 146, a telephone data storage unit 148, a display unit 150, a speaker selection switch 152, a speaker selection unit 154, a data selection unit 156, an imaging unit 160, an image, Editing unit 162, voice input unit 164, speech type selection unit 510, voice quality conversion unit 166, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, mimic switch 174, reception demodulation unit 176 A receiving unit 180, an image display unit 182, an audio output unit 184, a speaker 186, a function permission unit 188, and a voice quality conversion filter synthesis unit 512.

第１の実施形態における構成要素として既に述べた中央制御部１４６と、表示部１５０と、話者選択スイッチ１５２と、話者選択部１５４と、撮像部１６０と、画像編集部１６２と、音声入力部１６４と、電話機送信部１６８と、変調送信部１７０と、アンテナ部１７２と、ものまねスイッチ１７４と、受信復調部１７６と、受信部１８０と、画像表示部１８２と、音声出力部１８４と、スピーカ１８６と、機能許可部１８８とは、実質的に機能が同一なので重複説明を省略し、ここでは、構成が相違する電話機データ記憶部１４８と、発話種類選択部５１０と、声質変換部１６６と、画像編集部１６２と、声質変換フィルタ合成部５１２とを主に説明する。 The central control unit 146, the display unit 150, the speaker selection switch 152, the speaker selection unit 154, the imaging unit 160, the image editing unit 162, and the voice input already described as the constituent elements in the first embodiment Unit 164, telephone transmission unit 168, modulation transmission unit 170, antenna unit 172, imitation switch 174, reception demodulation unit 176, reception unit 180, image display unit 182, audio output unit 184, speaker 186 and the function permission unit 188 have substantially the same functions, and thus a duplicate description is omitted. Here, the telephone data storage unit 148, the utterance type selection unit 510, the voice quality conversion unit 166, which have different configurations, The image editing unit 162 and the voice quality conversion filter synthesis unit 512 will be mainly described.

上記電話機データ記憶部１４８には、発話者の明瞭な声質と、密やかな声質（ささやき声：ＮＡＭ）とからそれぞれ中間話者の声質に変換する２つの第１声質変換フィルタ５１４が記憶されている。ここで、密やかな声質として定義されるＮＡＭ（Ｎｏｎ−ＡｕｄｉｂｌｅＭｕｒｍｕｒ）は、周囲の人に内容が聴取不能な発話や発話器官のフィルタ特性により調音された声帯振動を伴わない軟部組織伝達の無声呼気音を言う。 The telephone data storage unit 148 stores two first voice quality conversion filters 514 that convert a clear voice quality of a speaker and a dense voice quality (whisper voice: NAM) into a voice quality of an intermediate speaker. Here, NAM (Non-Auditable Murmur), which is defined as a dense voice quality, is an unvoiced exhalation of soft tissue transmission that is not accompanied by vocal cord vibration that is tuned by utterances that are not audible to the surrounding people and filter characteristics of the speech organs. Say the sound.

上記発話種類選択部５１０は、まず、データ選択部１５６を介して、電話機データ記憶部１４８に記憶されている、発話種類が相異する第１声質変換フィルタ５１４を参照し、発話者の声質として選択することが可能な、第１声質変換フィルタ５１４の発話種類リスト５１６を作成し、表示部１５０に送信する。ここで、発話者が、表示部１５０に表示された発話種類リスト５１６中から、話者選択スイッチ１５２を通じて明瞭な声質か密やかな声質かを選択した場合、話者選択部１５４は、その選択された発話種類をデータ選択部１５６に伝達する。 The utterance type selection unit 510 first refers to the first voice quality conversion filter 514 with different utterance types stored in the telephone data storage unit 148 via the data selection unit 156, and determines the voice quality of the speaker. An utterance type list 516 of the first voice quality conversion filter 514 that can be selected is created and transmitted to the display unit 150. Here, when the speaker selects clear voice quality or dense voice quality from the speech type list 516 displayed on the display unit 150 through the speaker selection switch 152, the speaker selection unit 154 selects the selected voice quality. The transmitted utterance type is transmitted to the data selection unit 156.

上記データ選択部１５６は、発話種類選択部５１０によって選択された発話種類（発話者の発話状況）に応じて、データ選択部１５６が、発話者の明瞭な声質を中間話者の声質に変換する第１声質変換フィルタ５１４、または、発話者の密やかな声質を中間話者の声質に変換する第１声質変換フィルタ５１４のいずれかを選択し、声質変換部１６６は、その選択された第１声質変換フィルタ５１４を用いて、発話者の音声の声質を目標話者の声質に変換する。 The data selection unit 156 converts the clear voice quality of the speaker into the voice quality of the intermediate speaker according to the speech type selected by the speech type selection unit 510 (speaker's speech status). Either the first voice quality conversion filter 514 or the first voice quality conversion filter 514 that converts the voice quality of the speaker into the voice quality of the intermediate speaker is selected, and the voice quality conversion unit 166 selects the first voice quality selected. The conversion filter 514 is used to convert the voice quality of the speaker's voice to the voice quality of the target speaker.

上述したように発話種類選択部５１０において密やかな発話種類が選択された場合、データ選択部１５６は、第２声質変換フィルタ４６４として、中間話者の声質から発話者本人の明瞭な声質に変換するフィルタが自動的に選択されるとしてもよい。この場合は、発話者も目標話者も自分自身ということになる。 As described above, when a dense utterance type is selected by the utterance type selection unit 510, the data selection unit 156 converts the voice quality of the intermediate speaker into the clear voice quality of the speaker himself as the second voice quality conversion filter 464. A filter may be automatically selected. In this case, both the speaker and the target speaker are themselves.

かかる構成により、通話相手は、発話者のおかれている状況や発話者の発話の大きさに拘わらず、発話者の意図する音声を確実に把握することができ、通話が制限された、例えば、電車の中における発話も通話相手に伝達することが可能となる。 With such a configuration, the other party can surely grasp the voice intended by the speaker regardless of the situation where the speaker is placed and the size of the speaker's utterance. It is also possible to transmit utterances on the train to the other party.

上記画像編集部１６２は、発話種類選択部５１０において密やかな発話種類が選択された場合、発話者が密やかに発話していることを示す表示画像を重ねることができる。 When the utterance type selection unit 510 selects a secret utterance type, the image editing unit 162 can superimpose a display image indicating that the speaker is speaking secretly.

図１６は、上記画像編集部１６２の画像の上書きを示した外観図である。ここでは、発話者の画像を残したまま、表示部１５０中の発話者の認識に支障を来さない領域、例えば、発話者の背景画像に、発話者が密やかに発話していることを示す画像、例えば、ロに人差し指をあてた内緒話を意味するシンボル５２０を表示している。 FIG. 16 is an external view showing overwriting of an image by the image editing unit 162. Here, it shows that the speaker is secretly speaking in an area that does not interfere with the recognition of the speaker in the display unit 150, for example, the background image of the speaker, while leaving the image of the speaker. An image, for example, a symbol 520 indicating a secret story with an index finger applied to B is displayed.

発話者が密やかな発話を行っている場合であっても、発話者の密やかな声質を中間話者の声質に変換する第１声質変換フィルタ４６２と、中間話者の声質から発話者の明瞭な声質に変換する第２声質変換フィルタ４６４とを介すことにより、通話相手は、発話者の明瞭な音声を聞くことになる。従って、発話者が発話環境により密やかな声質で発話しているのに、通話相手は、それを把握することができないが、上述したようにシンボル５２０を表示することで、通話相手は、発話者の状況を把握することが可能となる。 The first voice quality conversion filter 462 that converts the voice quality of the speaker into the voice quality of the intermediate speaker, and the voice quality of the intermediate speaker, even if the speaker is speaking densely, Through the second voice quality conversion filter 464 that converts the voice quality, the other party can hear the clear voice of the speaker. Therefore, although the speaker is speaking with a voice quality that is denser than the speaking environment, the other party cannot grasp it, but by displaying the symbol 520 as described above, the other party can It becomes possible to grasp the situation.

上記声質変換フィルタ合成部５１２は、第１声質変換フィルタ４６２と、第２声質変換フィルタ４６４とを合成して、発話者の声質を目標話者の声質に直接変換する合成フィルタを生成する。そして、かかる合成完了後、声質変換部１６６は、その合成された合成フィルタを利用して声質変換を行う。 The voice quality conversion filter synthesis unit 512 synthesizes the first voice quality conversion filter 462 and the second voice quality conversion filter 464 to generate a synthesis filter that directly converts the voice quality of the speaker into the voice quality of the target speaker. Then, after the synthesis is completed, the voice quality conversion unit 166 performs voice quality conversion using the synthesized synthesis filter.

テレビ電話機５００では、第１声質変換フィルタ４６２と第２声質変換フィルタ４６４とを個別にダウンロードしている。しかし、その後は、声質変換フィルタを２段階のまま維持する必要はない。従って、発話の度に２段階の声質変換フィルタを介さず、合成した合成フィルタのみを介すことによって、声質変換にかかる負荷や消費電力を軽減し、声質変換を高速化することが可能となる。かかる声質変換フィルタ合成部５１２は、当然にして第４の実施形態の第１声質変換フィルタ４６２、第２声質変換フィルタ４６４にも適用できる。 In the video phone 500, the first voice quality conversion filter 462 and the second voice quality conversion filter 464 are individually downloaded. However, after that, it is not necessary to maintain the voice quality conversion filter in two stages. Therefore, it is possible to reduce the load and power consumption for voice quality conversion and speed up voice quality conversion by using only the synthesized filter instead of the two-stage voice quality conversion filter for each utterance. . Naturally, the voice quality conversion filter synthesis unit 512 can also be applied to the first voice quality conversion filter 462 and the second voice quality conversion filter 464 of the fourth embodiment.

以上、添付図面を参照しながら本発明の好適な実施形態について説明したが、本発明は係る例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to the example which concerns. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

例えば、上述した実施形態において、発話者のテレビ電話機に全ての構成要素を設けているが、かかる場合に限らず、通信網を介してアクセスすることが可能なサーバ内にこの構成要素の一部を配置し、通話中このサーバによる音声および画像の変換を通じて、通話相手に伝達されるとしてもよく、また、通話相手のテレビ電話機に上記構成要素を設け、発話者からの音声および画像を通話相手のテレビ電話機において変換することも可能である。 For example, in the embodiment described above, all the components are provided in the video phone of the speaker. However, the present invention is not limited to this, and some of the components are included in a server that can be accessed via the communication network. May be transmitted to the other party through voice and image conversion by this server during a call. Also, the above-mentioned components are provided on the other party's video phone, and the voice and image from the speaker are sent to the other party. It is also possible to convert in a video phone.

また、上述した実施形態においては、機能許可部を当該発話者側のテレビ電話機に設けているが、通話相手側のテレビ電話機に配して、発話者のものまねを通話相手側で制限することも可能である。 In the above-described embodiment, the function permission unit is provided in the video phone on the speaker side. However, the function permitting unit may be arranged on the video phone on the call partner side to limit the imitation of the speaker on the call partner side. Is possible.

また、上述した実施形態においては、発話者の画像および音声によって、目標話者に関連する画像を変更する例を示したが、かかる場合に限らず、例えば、目標話者の特徴的な話し方をキーワードとして検知したとき、その検知に応じて、目標話者に関連する画像を表示し、再度その特徴的な話し方を目標話者の音声で表す、ピンポイントものまねが実施されるとしてもよい。 In the above-described embodiment, an example in which an image related to the target speaker is changed based on the image and sound of the speaker has been described. However, the present invention is not limited to this example. When detected as a keyword, an image related to the target speaker may be displayed in response to the detection, and pinpoint imitation may be performed in which the characteristic way of speaking is represented by the voice of the target speaker again.

さらに、上述した実施形態においては、目標話者として俳優や声優等人間の音声を挙げて説明しているが、かかる場合に限られず、動物の鳴き声や、無生物から発せられる音等様々な音に適応することも可能である。また、テレビ電話機は無線に限らず、有線の回線を介して通信網に接続されていてもよい。 Furthermore, in the above-described embodiments, human voices such as actors and voice actors are described as target speakers, but the present invention is not limited to such cases, and various sounds such as animal calls and sounds generated from inanimate objects are used. It is also possible to adapt. In addition, the video phone is not limited to being wireless, and may be connected to a communication network via a wired line.

なお、本明細書の通話方法における各工程は、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）も含むとしても良い。 Note that each step in the calling method of the present specification does not necessarily have to be processed in chronological order according to the order described in the flowchart, but is performed in parallel or individually (for example, parallel processing or object processing). ) May also be included.

本発明は、テレビ電話における声質変換と共に画像を変化させるテレビ電話機、通話方法、プログラム、声質変換・画像編集サービス提供システム、および、サーバに適用可能である。 The present invention can be applied to a video phone, a telephone call method, a program, a voice quality conversion / image editing service providing system, and a server that change an image together with voice quality conversion in a video phone.

第１の実施形態におけるテレビ電話機を使用した声質変換・画像編集サービス提供システムを説明するための説明図である。It is explanatory drawing for demonstrating the voice quality conversion and image editing service provision system which uses the video telephone in 1st Embodiment. テレビ電話機の概略的な構成を示した機能ブロック図である。It is a functional block diagram showing a schematic configuration of a video phone. テレビ電話機における発話者とのインターフェース配置例を示した外観図である。It is the external view which showed the example of interface arrangement with the speaker in a videophone. 通話方法の処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process of a telephone call method. 画像編集ステップ）による表示部の変化を説明した説明図である。It is explanatory drawing explaining the change of the display part by an image editing step. 画像編集ステップ）による表示部の変化を説明した説明図である。It is explanatory drawing explaining the change of the display part by an image editing step. 画像編集ステップ）による表示部の変化を説明した説明図である。It is explanatory drawing explaining the change of the display part by an image editing step. 第２の実施形態におけるテレビ電話機の概略的な構成を示した機能ブロック図である。FIG. 5 is a functional block diagram illustrating a schematic configuration of a video phone according to a second embodiment. 口パクによる表示部の変化を説明した説明図である。It is explanatory drawing explaining the change of the display part by a mouth pack. 音声認識部による音声の判断を説明するためのタイミングチャート図である。It is a timing chart for demonstrating the judgment of the audio | voice by a speech recognition part. 第３の実施形態におけるテレビ電話機の概略的な構成を示した機能ブロック図である。It is the functional block diagram which showed the schematic structure of the video telephone in 3rd Embodiment. 声質の変換に利用される声質変換フィルタを説明するための説明図である。It is explanatory drawing for demonstrating the voice quality conversion filter utilized for conversion of a voice quality. 第４の実施形態における第１声質変換フィルタと第２声質変換フィルタによる声質変換を説明するための説明図である。It is explanatory drawing for demonstrating the voice quality conversion by the 1st voice quality conversion filter and 2nd voice quality conversion filter in 4th Embodiment. 第４の実施形態におけるサーバの概略的な構成を示した機能ブロック図である。It is the functional block diagram which showed the schematic structure of the server in 4th Embodiment. 第５の実施形態におけるテレビ電話機の概略的な構成を示した機能ブロック図である。It is the functional block diagram which showed the schematic structure of the video telephone in 5th Embodiment. 上記画像編集部の画像の上書きを示した外観図である。It is the external view which showed overwriting of the image of the said image editing part.

Explanation of symbols

１００、３００、４００、５００テレビ電話機
１４８電話機データ記憶部
１５２話者選択スイッチ
１５４話者選択部
１５６データ選択部
１６０撮像部
１６２画像編集部
１６４音声入力部
１６６声質変換部
１６８電話機送信部
１７０変調送信部
１７４ものまねスイッチ
１７６受信復調部
１８０受信部
１８８機能許可部
１９２目標話者に関連する画像
１９４、４５２声質変換フィルタ
３１０発話者状態情報生成部
３１２音声検知部
３１４音声認識部
３１６音素認識部
４１０画像認識部
４６０中間話者
４８０サーバデータ記憶部
４８２サーバ送信部
４６２、５１４第１声質変換フィルタ
４６４第２声質変換フィルタ
５１０発話種類選択部
５１２声質変換フィルタ合成部 100, 300, 400, 500 Videophone 148 Telephone data storage unit 152 Speaker selection switch 154 Speaker selection unit 156 Data selection unit 160 Imaging unit 162 Image editing unit 164 Voice input unit 166 Voice quality conversion unit 168 Telephone transmission unit 170 Modulated transmission Unit 174 imitation switch 176 reception demodulation unit 180 reception unit 188 function permission unit 192 images 194 and 452 related to the target speaker voice quality conversion filter 310 speaker state information generation unit 312 voice detection unit 314 voice recognition unit 316 phoneme recognition unit 410 image Recognition unit 460 Intermediate speaker 480 Server data storage unit 482 Server transmission unit 462, 514 First voice quality conversion filter 464 Second voice quality conversion filter 510 Utterance type selection unit 512 Voice quality conversion filter synthesis unit

Claims

A video phone comprising an imaging unit for inputting an image of a speaker and a voice input unit for inputting the voice of the speaker, and converting a voice quality of the voice of the speaker into a voice quality of a target speaker,
A telephone data storage unit that prestores an image associated with the target speaker and a voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the target speaker;
A speaker selection unit for selecting the target speaker;
A data selection unit for selecting a target speaker-related image, which is an image related to the target speaker selected by the speaker selection unit, and a voice quality conversion filter corresponding to the target speaker from the telephone data storage unit;
An image editing unit that edits an image input from the imaging unit based on the target speaker-related image;
A voice quality conversion unit that converts the voice quality of the voice of the speaker input from the voice input unit to the voice quality of the target speaker using the selected voice quality conversion filter;
A telephone transmission unit that transmits the image edited by the image editing unit and the voice converted by the voice quality conversion unit to a call partner;
A video phone, comprising:

It also has a mimic switch,
The video phone according to claim 1, wherein the image editing unit and the voice quality conversion unit function while the imitation switch is valid.

A speaker state information generating unit for generating speaker state information which is information relating to the state of the speaker;
The video phone according to claim 1, wherein the image editing unit changes the target speaker-related image used for editing in accordance with the speaker state information.

A voice detection unit for detecting the presence or absence of the voice of the speaker input from the voice input unit;
4. The videophone according to claim 3, wherein the speaker status information generation unit generates the speaker status information including voice presence / absence information that is information related to the presence / absence of the detected voice.

A speech recognition unit for recognizing the speech content of the speech of the speaker input from the speech input unit;
The said speaker state information generation part produces | generates the said speaker state information including the speech content information which is the information which concerns on the said recognized speech content, Either of Claim 3 or 4 characterized by the above-mentioned. Video phone.

Further comprising a phoneme recognition unit for recognizing a phoneme of a speaker's voice input from the voice input unit;
The speaker state information generation unit generates the speaker state information including phoneme information that is information related to the recognized phoneme type,
6. The video phone according to claim 3, wherein the image editing unit adjusts a degree of opening and closing of the mouth of the target speaker related image according to the phoneme information.

An image recognition unit for recognizing the state of the image of the speaker input from the imaging unit;
The videophone according to any one of claims 3 to 6, wherein the speaker state information generation unit generates the speaker state information including a state of the recognized image.

When the image recognition unit recognizes that there is a speaker's face, it detects the position of the face,
The speaker state information generation unit generates the speaker state information including face position information which is information related to the detected face position,
The video phone according to claim 7, wherein the image editing unit superimposes the target speaker-related image on a face position corresponding to the face position information.

When the image recognition unit recognizes that there is a speaker's face, it detects the inclination of the face,
The speaker state information generation unit generates the speaker state information including face inclination information that is information relating to the detected face inclination,
The video phone according to claim 8, wherein the image editing unit rotates and superimposes the target speaker related images according to the face tilt information.

When the image recognition unit recognizes that there is a speaker's face, it detects the size of the face,
The speaker state information generation unit generates the speaker state information including face size information which is information related to the detected face size,
10. The video phone according to claim 8, wherein the image editing unit enlarges or reduces the target speaker-related image in accordance with the face size information.

When the image recognition unit recognizes that there is a speaker's face, the image recognition unit detects opening / closing of the speaker's eyes,
The speaker state information generation unit generates the speaker state information including eye opening / closing information that is information relating to the detected opening / closing of the eyes,
11. The video phone according to claim 8, wherein the image editing unit opens and closes eyes in the target speaker related image according to the eye opening and closing information.

When the image recognition unit recognizes that there is a speaker's face, it detects the opening / closing of the speaker's mouth,
The speaker state information generation unit generates the speaker state information including mouth opening / closing information which is information relating to the detected opening / closing of the mouth,
The video phone according to claim 8, wherein the image editing unit opens and closes a mouth in the target speaker-related image according to the mouth opening and closing information.

The voice quality conversion filter has a first voice quality conversion filter for converting the voice quality of individual speakers into a voice quality of a common intermediate speaker, and a voice quality conversion filter for converting the voice quality of the intermediate speaker into voice quality of each target speaker. Of the second voice quality conversion filter,
The data selection unit selects the first voice conversion filter and the second voice conversion filter as the voice conversion filter,
The voice quality conversion unit converts the voice quality of the voice of the speaker input from the voice input unit to the voice quality of the intermediate speaker using the selected first voice quality conversion filter, and further The video phone according to any one of claims 1 to 12, wherein voice quality is converted into voice quality of the target speaker using the selected second voice quality conversion filter.

The image processing device further includes a receiving unit that receives an image related to the target speaker used in the image editing unit and a second voice quality conversion filter used in the voice quality conversion unit from an external electronic device. The video phone according to claim 13.

It further includes an utterance type selection unit that allows the utterer to select clear or secret utterance,
The data selection unit is configured to convert a clear voice quality of a speaker into a voice quality of an intermediate speaker according to the selected speech type, or a voice quality of the middle speaker is converted to a middle voice of the speaker. 15. The video phone according to claim 13, wherein any one of the first voice quality conversion filters for conversion into voice quality is selected.

When a dense utterance type is selected in the utterance type selection unit,
16. The video phone according to claim 15, wherein the data selection unit selects a second voice quality conversion filter that converts the voice quality of the intermediate speaker into a clear voice quality of the speaker.

When a dense utterance type is selected in the utterance type selection unit,
The video phone according to claim 15, wherein the image editing unit superimposes a display image indicating that a speaker is speaking secretly.

A voice quality conversion filter synthesizing unit that synthesizes the first voice quality conversion filter and the second voice quality conversion filter to generate a synthesis filter that directly converts the voice quality of the speaker into the voice quality of the target speaker. The video phone according to claim 13.

And a function permission unit that permits the image editing unit and the voice quality conversion unit to function only when an identifier that can identify the own video phone is transmitted to the video phone of the other party. The video phone according to any one of claims 1 to 18.

A call method for making a call using a video phone that converts the voice quality of a speaker's voice into the voice quality of a target speaker,
A telephone data storage step of storing an image related to the target speaker and a voice quality conversion filter for converting the voice quality of the speaker into the voice quality of the target speaker in a telephone data storage unit;
A speaker selection step of selecting the target speaker;
A data selection step of selecting a target speaker related image that is an image related to the target speaker selected in the speaker selection step and a voice quality conversion filter corresponding to the target speaker from the telephone data storage unit;
An imaging step of inputting an image of the speaker;
An image editing step of editing the image input in the imaging step based on the target speaker-related image;
A voice input step for inputting the voice of the speaker;
A voice quality conversion step of converting the voice quality of the voice of the speaker input in the voice input step into the voice quality of the target speaker using the selected voice quality conversion filter;
A telephone transmission step of transmitting the image edited in the image editing step and the voice converted in the voice quality conversion step to a call partner;
A call method comprising the steps of:

A telephone data storage step of storing an image related to the target speaker and a voice quality conversion filter for converting the voice quality of the speaker into the voice quality of the target speaker in a telephone data storage unit;
A speaker selection step of selecting the target speaker;
A data selection step of selecting a target speaker related image that is an image related to the target speaker selected in the speaker selection step and a voice quality conversion filter corresponding to the target speaker from the telephone data storage unit;
An imaging step of inputting an image of the speaker;
An image editing step of editing the image input in the imaging step based on the target speaker-related image;
A voice input step for inputting the voice of the speaker;
A voice quality conversion step of converting the voice quality of the voice of the speaker input in the voice input step into the voice quality of the target speaker using the selected voice quality conversion filter;
A telephone transmission step of transmitting the image edited in the image editing step and the voice converted in the voice quality conversion step to a call partner;
A program that causes a computer to execute.

Consists of a server and a videophone that is communicably connected to the server, and provides a voice quality conversion / image editing service that converts the voice quality of the speaker's voice into the voice quality of the target speaker and edits the image of the speaker A voice quality conversion / image editing service providing system,
The server
A server data storage unit that stores an image related to the target speaker and a voice quality conversion filter that converts the voice quality of the speaker into the voice quality of the target speaker;
A server transmission unit for transmitting an image related to the target speaker and a voice quality conversion filter stored in the server data storage unit to the videophone;
With
The video phone is
A receiver that receives an image associated with the target speaker and a voice quality conversion filter;
A telephone data storage unit that stores an image related to the target speaker and a voice quality conversion filter received by the reception unit;
A speaker selection unit for selecting the target speaker;
A data selection unit for selecting a target speaker-related image, which is an image related to the target speaker selected by the speaker selection unit, and a voice quality conversion filter corresponding to the target speaker from the telephone data storage unit;
An imaging unit for inputting an image of the speaker;
An image editing unit that edits an image input from the imaging unit based on the target speaker-related image;
A voice input unit for inputting the voice of the speaker;
A voice quality conversion unit that converts the voice quality of the voice of the speaker input from the voice input unit to the voice quality of the target speaker using the selected voice quality conversion filter;
A telephone transmission unit that transmits the image edited by the image editing unit and the voice converted by the voice quality conversion unit to a call partner;
A system for providing a voice quality conversion / image editing service, comprising:

The voice quality conversion filter includes a first voice quality conversion filter for converting the voice quality of individual speakers to a voice quality of a common intermediate speaker, and a voice quality conversion filter for converting the voice quality of the intermediate speaker to the voice quality of each target speaker. Of the second voice quality conversion filter,
The server transmission unit can transmit one or both of the first voice quality conversion filter and the second voice quality conversion filter,
The receiving unit can receive one or both of the first voice quality conversion filter and the second voice quality conversion filter,
The telephone data storage unit stores a first voice conversion filter and a second voice conversion filter including one or both of the first voice conversion filter and the second voice conversion filter received by the reception unit;
The data selection unit
If the first voice quality conversion filter is designated in advance, the second voice quality conversion filter is selected from the telephone data storage unit,
If the first voice quality conversion filter is not designated in advance, the first voice quality conversion filter and the second voice quality conversion filter are selected from the telephone data storage unit as the voice quality conversion filter;
The voice quality conversion unit
When the first voice quality conversion filter is designated in advance, it is converted to the voice quality of the intermediate speaker using the designated first voice quality conversion filter,
If the first voice quality conversion filter is not designated in advance, the voice quality of the speaker's voice input from the voice input unit is converted to the voice quality of the intermediate speaker using the selected first voice quality conversion filter. Converted to
The voice quality conversion / image editing service providing system according to claim 22, further comprising: converting the voice quality of the intermediate speaker into the voice quality of the target speaker using the selected second voice quality conversion filter. .

A server used in the voice quality conversion / image editing service providing system according to claim 22 or 23,
A server comprising the server data storage unit and a server transmission unit.

A video phone used in the voice quality conversion / image editing service providing system according to claim 22 or 23,
A video phone comprising the receiving unit, a telephone data storage unit, a speaker selection unit, a data selection unit, an imaging unit, an image editing unit, a voice input unit, a voice quality conversion unit, and a telephone transmission unit.