JP2005057431A

JP2005057431A - Video phone terminal apparatus

Info

Publication number: JP2005057431A
Application number: JP2003285184A
Authority: JP
Inventors: Tatsuya Arayagaito; 達也新谷垣内
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2003-08-01
Filing date: 2003-08-01
Publication date: 2005-03-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video phone terminal apparatus which enhances communication effect while taking speaker's privacy into consideration by using a portrait of a speaker himself or herself as a substitute image. <P>SOLUTION: A voice signal inputted from a voice input part 1 is encoded by a voice encoding part 2 and transmitted to an opposite-side part 2 through a communication I/F part 4. The inputted voice signal is sent to an expression data generation part 3 as well at the same time to generate a signal for adding expression to a face image. A basic face data generation part 6, on the other hand, generates basic face data showing sizes, positions, etc., of respective parts of the face, such as the outline, eyes, and mouth, according to operation at a user operation input part 5. A portrait generation part 7 combines the basic face data and expression data together to generate a portrait image of the speaker in the form of a moving picture. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、話者の音声の内容や、感情の変化に合わせて話者の似顔絵を生成して、相手側に送信するテレビ電話端末装置に関する。 The present invention relates to a videophone terminal device that generates a portrait of a speaker in accordance with the content of a speaker's voice and changes in emotion and transmits it to the other party.

電話機にカメラとディスプレイを装備し、画像によるコミュニケーションができるいわゆるテレビ電話装置は既に数多く商品化されている。しかしながら、不特定の通話先に対して話者の映像を送出する事についてはプライバシー保護の面から根強い抵抗もあり、実際の話者映像ではなく、あらかじめ格納されている代理画像を送出するようにしたテレビ電話装置が提案されている（例えば、特許文献１参照）。 Many so-called videophone devices, which are equipped with a camera and a display on a telephone and can communicate by image, have already been commercialized. However, there is a persistent resistance from the viewpoint of privacy protection about sending the video of the speaker to an unspecified destination, so that a proxy image stored in advance is sent instead of the actual video of the speaker. A videophone device has been proposed (see, for example, Patent Document 1).

また、一方で通常のテレビ電話のような話者間の意思の疎通のための装置ではなく、エンターテインメントとしてのコミュニケーション装置として、化身（アバター）の利用が考えられており、例えば、話者の実際の顔画像ではなく、架空のキャラクタを介在させた装置が提案されている（例えば、特許文献２参照）。
特開２００３−３７８２６号公報特開２００３−１６４７５号公報 On the other hand, the use of incarnations (avatars) is considered as a communication device for entertainment rather than a device for communication between speakers like a normal videophone. There has been proposed a device in which an imaginary character is interposed instead of the face image (see, for example, Patent Document 2).
JP 2003-37826 A JP 2003-16475 A

しかし、上記従来技術では、代理画像や架空のキャラクタを複数の選択肢から選ぶなどの方法で話者の嗜好を反映することができるものの、本来の話者映像とはまったく関連性がないので、プライバシー保護やエンターテインメント性を高めることができるが、本来の画像を使用したコミュニケーションと同様な効果を得ることが困難であった。 However, although the above-mentioned prior art can reflect the speaker's preference by selecting a proxy image or a fictitious character from a plurality of options, it is not related to the original speaker image at all. Although protection and entertainment properties can be improved, it has been difficult to obtain the same effect as communication using original images.

また、キャラクタ情報等を伝送する場合には、記憶されているデータから選択したキャラクタの情報と表情データは別個に伝送され、受信側端末で復号化されるため、送信側、受信側ともこの伝送方式に対応した端末であることが必要であり、通常のテレビ電話装置との接続性が確保されていなかった。 Also, when transmitting character information, etc., the character information selected from the stored data and facial expression data are transmitted separately and decoded at the receiving terminal, so both the transmitting side and the receiving side transmit this information. The terminal must be compatible with the system, and connectivity with a normal videophone device has not been ensured.

本発明は、上述した課題を解決するために創案されたものであり、代理画像として本人の似顔絵を使用することにより、話者のプライバシーに配慮しつつコミュニケーション効果を高めたテレビ電話端末装置を提供することを目的としている。 The present invention was devised to solve the above-described problems, and provides a videophone terminal device that enhances the communication effect while considering the privacy of the speaker by using the person's portrait as a proxy image. The purpose is to do.

上記目的を達成するために、本発明のテレビ電話端末装置は、読者の似顔絵の元となる顔の特徴を示す基本顔データを話者の顔の各部位の位置、サイズに基づいて生成する基本顔データ生成手段と、話者の音声を入力する音声入力手段と、前記音声入力手段に入力された音声信号における過去から現在までの基本周波数の平均を示す第１の平均周波数と前記第１の平均周波数よりも短い期間における基本周波数の平均を示す第２の平均周波数との組み合わせに基づいて感情の変化を判断し、話者の顔の表情データを生成する表情データ生成手段と、前記基本顔データの顔の特定部位の角度を前記表情データの感情情報に対応させて変化させることにより似顔絵の画像情報を生成する似顔絵生成手段と、音声情報と映像情報を通信回線に送出する通信インタフェース手段を備え、通話時に話者の実映像の代わりに前記似顔絵画像を送出することを特徴としている。 In order to achieve the above object, the videophone terminal device according to the present invention generates basic face data indicating a facial feature that is a source of a portrait of a reader based on the position and size of each part of a speaker's face. Face data generation means; voice input means for inputting a voice of a speaker; a first average frequency indicating an average of fundamental frequencies from past to present in a voice signal input to the voice input means; Expression data generation means for determining a change in emotion based on a combination with a second average frequency indicating an average of the fundamental frequency in a period shorter than the average frequency, and generating expression data of the speaker's face, and the basic face Caricature generating means for generating caricature image information by changing the angle of a specific part of the face of the data corresponding to the emotion information of the expression data, and sending voice information and video information to the communication line A communication interface unit, is characterized by sending the portrait image in place of the real image of the speaker during a call.

本発明によれば、テレビ電話端末には、話者の実映像の代わりに話者が作成した似顔絵を代理画像として使用することができるので、プライバシーを保護しつつ、映像を十分に活かしたコミュニケーションを行うことができ、不特定の相手との通話も心配なく行える。 According to the present invention, a portrait picture created by a speaker can be used as a substitute image for a videophone terminal instead of the actual video of the speaker. Therefore, communication that fully utilizes the video while protecting privacy. You can make calls with unspecified parties without worry.

また、実映像にある程度近い表情を持った似顔絵を画像として送出することができるので、カメラ入力などの実映像の送出手段を完全に省いた端末にすることもできる。 Further, since a portrait with a certain expression close to that of the actual video can be transmitted as an image, it is possible to provide a terminal in which actual video transmission means such as camera input are completely omitted.

以下、図面を参照して本発明の一実施形態を説明する。図１は本発明が適用されるテレビ電話端末装置を示す構成図である。
音声入力部１は、図示はしていないが通常のテレビ電話端末と同様にマイク等が接続されており、そのマイクによる音声入力を行う。ここで、入力された音声信号は音声エンコード部２により符号化され、通信Ｉ／Ｆ（インタフェース）部４を介して相手側端末に送信される。また、入力された音声信号は同時に表情データ生成部３にも送られ、ここでは顔画像に表情を付加するための信号が生成される。この表情データは音声信号の大きさを反映した口の動きを作り出すとともに、後述のように音声信号の基本周波数の動きを分析して、感情を表す喜怒哀泣楽の程度を数値化し、眉、口の角度を指示する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a videophone terminal device to which the present invention is applied.
Although not shown, the voice input unit 1 is connected to a microphone or the like as in a normal videophone terminal, and performs voice input using the microphone. Here, the input audio signal is encoded by the audio encoding unit 2 and transmitted to the counterpart terminal via the communication I / F (interface) unit 4. Also, the input audio signal is simultaneously sent to the facial expression data generation unit 3, where a signal for adding a facial expression to the facial image is generated. This facial expression data creates mouth movements that reflect the magnitude of the audio signal, and analyzes the movement of the fundamental frequency of the audio signal as described below to quantify the degree of emotions that express emotions. Instruct the mouth angle.

また、一方で、基本顔データ生成部６はユーザー操作入力部５の操作により、送出する顔画像のベースとなるデータを生成する。このデータは、輪郭、目、口などの顔の各部分のサイズや位置などを示す数値データであり、ユーザーは通話に先立って操作入力部５の操作により、所望の似顔絵データを作成しておくことができる。 On the other hand, the basic face data generation unit 6 generates data serving as a base of the face image to be transmitted by operating the user operation input unit 5. This data is numerical data indicating the size and position of each part of the face such as contour, eyes, mouth, etc., and the user creates desired caricature data by operating the operation input unit 5 prior to calling. be able to.

似顔絵生成部７では、この基本顔データと表情データを組み合わせることにより、話者の似顔絵画像を動画として生成し、生成された動画情報は画像エンコード部８によって符号化され、通信Ｉ／Ｆ部４を介して相手側端末に送信される。 The portrait generation unit 7 combines the basic face data and the expression data to generate a portrait image of the speaker as a moving image. The generated moving image information is encoded by the image encoding unit 8, and the communication I / F unit 4 Is sent to the other terminal via.

基本顔データ生成部６では、似顔絵を生成するためのデータとして、顔の特徴を現す数値化可能なパラメータを使用する。図２に基本顔データの数値構成例を示す。図２では、顔を構成する部品を輪郭、髪、眉、目、口とし、各部品のサイズ（幅、高さ又は厚さ）と位置（顔中心からのＸ、Ｙ座標）として数値化している。例えば、口は、顔の中心からｍｙで示される量だけ下方に位置し、その厚さはｍｈであり、幅はｍｗである。目や眉についても同様にその位置と、サイズが数値で示される。頬の線については、スプライン曲線化され、中継点がどこにあるかがパラメータｆｒで示される。 The basic face data generation unit 6 uses a numerical value parameter that represents facial features as data for generating a portrait. FIG. 2 shows a numerical configuration example of the basic face data. In FIG. 2, the parts constituting the face are contours, hair, eyebrows, eyes, mouth, and numerically expressed as the size (width, height or thickness) and position (X, Y coordinates from the face center) of each part. Yes. For example, the mouth is located below the center of the face by an amount indicated by my, the thickness is mh, and the width is mw. Similarly, the positions and sizes of the eyes and eyebrows are indicated by numerical values. The cheek line is splined and the parameter fr indicates where the relay point is.

図２には示されていないが、髪型については、単純な数値化が困難なほど形状
が多岐にわたっているため、あらかじめ決められている複数の髪型候補から選択する方式にするか、または、複数の基本髪型候補からの選択と各部の長さという数値データとの組み合わせ、髪型の外形そのもののベクトルデータ化などの方法により髪型を決定する方式にしている。 Although not shown in FIG. 2, the hairstyle has a variety of shapes so that it is difficult to make a simple numerical value. Therefore, a method of selecting from a plurality of hairstyle candidates determined in advance, or a plurality of hairstyles The hairstyle is determined by a combination of selection from basic hairstyle candidates and numerical data such as the length of each part, and conversion of the hairstyle outline itself into vector data.

また、描画の際の使用する顔の色や髪の色などの色データを追加の数値データとして使用することもできる。 Also, color data such as face color and hair color used for drawing can be used as additional numerical data.

上述したように、話者は、この基本顔データの作成のために、生成される顔画像をモニターしながら操作入力部５の操作により上記の各パラメータを調節し、自分の似顔絵を作成することができる。例えば、各数値パラメータはＧＵＩ上でスクロールバーの操作によってリアルタイムに出力の似顔絵をモニターしながら作成するようにしても良い。 As described above, in order to create this basic face data, the speaker adjusts each of the above parameters by operating the operation input unit 5 while monitoring the generated face image, and creates his own portrait. Can do. For example, each numerical parameter may be created while monitoring the output portrait in real time by operating a scroll bar on the GUI.

このようなデータ化を行うことにより、きわめて小さなデータ量でさまざまな顔を表現することができるとともに、表情データとの合成が非常に簡単になる。例えば、音声信号の振幅を口部分の開け方に対応させ、感情表現として「喜」の場合は口の両端を上方向に湾曲させたり、「怒」の場合は眉の両端を上にあげたりする等の単純な座標情報の変更により、表情を付加した顔画像を生成することができる。 By making such data, various faces can be expressed with a very small amount of data, and synthesis with facial expression data becomes very simple. For example, if the amplitude of the audio signal corresponds to how to open the mouth, if the emotional expression is “pleasure”, both ends of the mouth are curved upward, and if “angry”, both ends of the eyebrows are raised It is possible to generate a face image with a facial expression by simply changing coordinate information such as.

次に表情データ生成部３の動作と、表情データによる顔画像への反映について説明する。表情データには、音声信号そのものによる顔画像の口部分の形状への反映と、音声信号に含まれる感情情報の判断によって、口だけでなく目や眉等の部分の形状への反映が含まれる。 Next, the operation of the expression data generation unit 3 and the reflection of the expression data on the face image will be described. The expression data includes the reflection of the facial image on the shape of the mouth portion of the face image and the reflection of the emotional information contained in the sound signal on the shape of not only the mouth but also the eyes and eyebrows. .

音声信号そのものによる口形状への反映はもっとも単純な方法では、信号の振幅を口の開け方に連動させることにより行う。もっと精緻な制御を行う場合には、音声信号の母音に注目し、倍音構成（フォルマント）を分析して現在のフォルマントに対してあらかじめ定義されている口形状を当てはめることにより、発話内容と口の動きをより自然に連携させることもできる。 In the simplest method, the voice signal itself is reflected in the mouth shape by linking the amplitude of the signal with the way the mouth is opened. For more precise control, pay attention to the vowels of the audio signal, analyze the overtone structure (formant), and apply a mouth shape that is predefined for the current formant. You can also coordinate movements more naturally.

感情判断による顔画像の作成については、入力される音声信号の基本周波数成分の変化の分析から行う。図３は入力された音声信号の基本周波数の変化の例を示す。まず、基本周波数の長時間の平均Ｍを算出するとともに、発話内容をセンテンス毎に区切り、その各センテンスの平均基本周波数Ｓを算出する。長時間平均Ｍに対して各センテンスの平均周波数Ｓが、高い周波数、あるいは低い周波数で推移しているかどうかの判断を行う。この長時間平均Ｍは、現在の通話の情報だけではなく、過去の通話でのデータも合わせて積算して得られたものである。 The creation of a face image based on emotion determination is performed by analyzing changes in the fundamental frequency component of an input audio signal. FIG. 3 shows an example of a change in the fundamental frequency of the input audio signal. First, the long-term average M of the fundamental frequency is calculated, and the utterance content is divided for each sentence, and the average fundamental frequency S of each sentence is calculated. It is determined whether the average frequency S of each sentence is changing at a high frequency or a low frequency with respect to the long-term average M. The long-term average M is obtained by integrating not only information on the current call but also data on the past call.

また、各センテンス内を更に前後に分け、前半部の平均基本周波数をＨ１、後半部の平均基本周波数をＨ２とし、Ｈ１とＨ２の比較を行う。これにより、センテンス平均Ｓが長時間平均Ｍよりも高いのか低いのかという情報と、センテンスの平均周波数が上がり方向なのか下がり方向なのかという情報とが得られる。この情報から「喜怒哀泣楽」（感情情報）の各状態への当てはめを行う。 Further, each sentence is further divided into front and rear, the average basic frequency of the first half is H1, the average basic frequency of the second half is H2, and H1 and H2 are compared. As a result, information on whether the sentence average S is higher or lower than the long-term average M and information on whether the average frequency of the sentence is upward or downward are obtained. From this information, the application is applied to each state of “feeling crying” (emotion information).

この方法を図示したのが、図４である。上述した基本周波数に関する記号を使用すれば、
Ｘ＝Ｓ−Ｍ
Ｙ＝Ｈ２−Ｈ１
の式により、Ｘ、Ｙの値を算出し、このＸ、Ｙを図４の（Ｘ、Ｙ）座標に当てはめて、喜怒哀泣楽のどの領域にいるのかを判断する。また、喜怒哀泣楽の程度、例えば「喜」について言えば、極端に喜んでいるのか、少し喜んでいる程度なのかを表現するために、図４の喜怒哀泣楽の各領域が濃度の分布（Ｚ軸方向）を有しており、（Ｘ、Ｙ）の座標で示された地点の濃度を感情（喜怒哀泣楽）の程度として出力できるようになっている。 This method is illustrated in FIG. If we use the symbol for the fundamental frequency mentioned above,
X = SM
Y = H2-H1
The values of X and Y are calculated by the following equation, and the X and Y are applied to the (X, Y) coordinates in FIG. Further, in order to express the degree of emotions of joy and crying, for example, “joy”, each region of joy and crying in FIG. It has a distribution (in the Z-axis direction), and can output the concentration at the point indicated by the coordinates (X, Y) as the degree of emotion (feeling of joy and crying).

しかし、この方法は必ずしも正確なものではなく、より精度を高めるために周波数の細かな動きや、各周波数帯でのパワーの変化などを考慮しつつ当てはめを行うようにしても良い。 However, this method is not always accurate, and fitting may be performed in consideration of fine movement of the frequency and changes in power in each frequency band in order to improve accuracy.

次に、図４で得られたセンテンス毎の「喜怒哀泣楽」の感情情報は、音声信号そのものによる口形状データとともに表情データとして、表情データ生成部３から似顔絵生成部７に送られる。この感情情報に基づき、似顔絵生成部７では基本顔データ生成部６から供給される基本顔データの顔の各部に修正を加える。また、音声信号そのものによる口形状データに基づき、基本顔データの口の開け方の形状を決定する。 Next, the emotion information of “sheerful crying” for each sentence obtained in FIG. 4 is sent from the facial expression data generation unit 3 to the portrait generation unit 7 as facial expression data together with mouth shape data based on the voice signal itself. Based on this emotion information, the portrait generator 7 corrects each part of the face of the basic face data supplied from the basic face data generator 6. Further, the shape of how to open the mouth of the basic face data is determined based on the mouth shape data based on the audio signal itself.

「喜怒哀泣楽」の感情情報に対応して、眉と口の角度を変化させるようにした例が図５である。 FIG. 5 shows an example in which the angle between the eyebrows and the mouth is changed in accordance with the emotion information of “happy emotions”.

「楽」の表情は、眉、口ともに水平の状態であり、角度がついていない。似顔絵生成部７では、この「楽」の感情を基本として、表情データ生成部３から送られてくる表情データの内の感情情報に基づき、「喜」の場合、「怒」の場合、「哀」の場合、「泣」の場合へと表情を変化させる。表情データ生成部３から送られてくる感情情報が、「楽」の場合は眉と口の角度は変化させずに、そのままの状態を維持する。表情の変化を図５に沿って説明すると、「喜」の場合は、眉の中心を上げてへの字型にし、口は両端を上げるようにする。「怒」の場合は、眉の端を上げるようにし、口の両端を下げるようにする。「哀」の場合は、眉の端を下げるようにし、口の両端も下げるようにする。「泣」の場合は、「哀」の表情にさらに涙を付加する。 The expression of “Raku” is in a state where both the eyebrows and the mouth are horizontal, and there is no angle. Based on the emotion information in the facial expression data sent from the facial expression data generation unit 3 based on the emotion of “Easy”, the caricature generation unit 7 selects “feeling”, “anger”, “sorrow”. ", The expression is changed to" crying ". If the emotion information sent from the facial expression data generation unit 3 is “Easy”, the angle of the eyebrows and the mouth is not changed, and the state is maintained as it is. The change of the facial expression will be described with reference to FIG. 5. In the case of “joy”, the center of the eyebrows is raised to make a letter-shaped, and the mouth is raised at both ends. In the case of “anger”, raise the ends of the eyebrows and lower the ends of the mouth. In the case of “sorrow”, the ends of the eyebrows are lowered and both ends of the mouth are also lowered. In the case of “crying”, more tears are added to the expression of “sorrow”.

眉や口の角度の付け方は、前述した感情の程度を示すデータ（図４の濃度分布データ）に基づき、角度を大きくしたり、小さくしたりする。 The angle of the eyebrows and mouth is set based on the above-described data indicating the degree of emotion (density distribution data in FIG. 4).

このようにして、似顔絵生成部７では、あらかじめ作成しておいた基本顔データとセンテンス毎に送られてくる感情情報と音声信号そのものによる口形状データとを組み合わせることにより、話者の感情や発音に対応した似顔絵を連続的に作成することができるので、似顔絵の動画像を使用してテレビ電話端末を利用することができる。 In this way, the portrait generator 7 combines the basic face data created in advance, the emotion information sent for each sentence, and the mouth shape data based on the voice signal itself, thereby providing the emotion and pronunciation of the speaker. Since the caricature corresponding to can be continuously created, the videophone terminal can be used using the moving picture of the caricature.

本発明のテレビ電話端末装置の構成を示す図である。It is a figure which shows the structure of the video telephone terminal device of this invention. 基本顔データの構成を示す図である。It is a figure which shows the structure of basic face data. 音声信号の基本周波数の変化を示す図である。It is a figure which shows the change of the fundamental frequency of an audio | voice signal. 音声信号の基本周波数の変化と感情情報との対応を示す図である。It is a figure which shows a response | compatibility with the change of the fundamental frequency of an audio | voice signal, and emotion information. 表情データによる顔の各部分への修正を示す図である。It is a figure which shows correction to each part of the face by expression data.

Explanation of symbols

１音声入力部
２音声エンコード部
３表情データ生成部
４通信Ｉ／Ｆ部
５ユーザー操作入力部
６基本顔データ生成部
７似顔絵生成部
８画像エンコード部 DESCRIPTION OF SYMBOLS 1 Voice input part 2 Voice encoding part 3 Expression data generation part 4 Communication I / F part 5 User operation input part 6 Basic face data generation part 7 Caricature generation part 8 Image encoding part

Claims

Basic face data generating means for generating basic face data indicating the characteristics of the face that is the basis of the caricature of the reader based on the position and size of each part of the speaker's face;
Voice input means for inputting the voice of the speaker;
A first average frequency indicating an average of fundamental frequencies from the past to the present in an audio signal input to the audio input means, and a second average frequency indicating an average of fundamental frequencies in a period shorter than the first average frequency A facial expression data generating means for determining a change in emotion based on the combination of and generating facial expression data of the speaker;
Caricature generating means for generating image information of a caricature by changing an angle of a specific part of the face of the basic face data in correspondence with emotion information of the facial expression data;
Communication interface means for sending audio information and video information to a communication line,
A videophone terminal device that transmits the portrait image instead of an actual video of a speaker during a call.