JP2005208367A

JP2005208367A - Audio reproducing processing apparatus and telephone terminal

Info

Publication number: JP2005208367A
Application number: JP2004015218A
Authority: JP
Inventors: Noritaka Tateishi; 徳孝立石; Hideaki Matsuo; 英明松尾; Makoto Araida; 真新井田; Ryota Yoshida; 良太吉田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2004-01-23
Filing date: 2004-01-23
Publication date: 2005-08-04
Anticipated expiration: 2024-01-23
Also published as: JP4294502B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an audio reproducing processing apparatus capable of changing the display styles of an image according to the output of audio data without, increasing the amount of data. <P>SOLUTION: The audio reproducing processing apparatus is equipped with a mouth-shape generating means 21 of reading in face image data on a caller, corresponding to an origination-source telephone number from among of a telephone directory database 23, when sound-recording data are reproduced and analyzing the sound-recording data to generate a mouth-shape image, corresponding to the analysis result and an expression changing image generating means 26 of putting the mouth-shape image and a face image together, to generate an expression changing image in which the expression of the face changes, and an expression changing image which is generated each time an analysis is conducted is displayed by a display means 14. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声データの出力に連動して画像を表示させる音声再生処理装置及びそれを搭載した電話端末に関する。 The present invention relates to an audio reproduction processing apparatus that displays an image in conjunction with output of audio data and a telephone terminal equipped with the same.

従来から、留守録音した伝言メッセージを受信者に伝えるために、種々の留守番電話サービスシステムが提案されている。例えば、電話回線を利用して伝言メッセージを基地局のメッセージセンタに登録し、受信者がメッセージセンタに問い合わせて伝言メッセージを受けとるというものがある。これによれば、受信者は発呼者との間で、間接的な意志伝達が可能である。 Conventionally, various answering machine service systems have been proposed in order to convey a message message recorded in an absence to a receiver. For example, a message message is registered in a message center of a base station using a telephone line, and a receiver makes an inquiry to the message center to receive the message message. According to this, the recipient can indirectly communicate with the caller.

この留守番電話サービスシステムでは、音声による情報のみを伝言メッセージの媒体として用いるため、伝言メッセージが無機質となり、実際の表現意思が受信者に正確に伝わらない場合がある。これに対して、発呼者の伝言メッセージをデータベースに蓄積し、受信者の要求に応じてその伝言メッセージを再生可能にし、この伝言メッセージの再生時に発信者の感情をエージェント（代理）画像で表現し、そのエージェント画像を連続的に変化させて動画像のように表示する伝言システムが提案されている（例えば、特許文献１参照）。 In this answering machine service system, only voice information is used as a message message medium, so the message message becomes inorganic, and the actual expression intention may not be accurately transmitted to the receiver. On the other hand, the message message of the caller is stored in the database, and the message message can be played back according to the request of the receiver, and the emotion of the caller is expressed by an agent (proxy) image when the message message is played back. And the message system which changes the agent image continuously and displays it like a moving image is proposed (for example, refer to patent documents 1).

特開２００２−４１２７９号公報JP 2002-41279 A

しかしながら、このような従来の伝言システムにあっては、伝言メッセージの再生時に動画を表示するためには、エージェント画像のフレームデータを多数用意する必要がある。さらに長時間に亘って異なる動画を再生するには、膨大な量の画像データを蓄積、処理する必要がある。このため、容量の大きなメモリや高速の演算処理回路が必要になり、携帯端末への搭載には向かない。 However, in such a conventional message system, it is necessary to prepare a large number of agent image frame data in order to display a moving image when a message message is reproduced. Furthermore, in order to reproduce different moving images over a long period of time, it is necessary to accumulate and process a huge amount of image data. For this reason, a large-capacity memory and a high-speed arithmetic processing circuit are required, which is not suitable for mounting on a portable terminal.

本発明は、前記事情に鑑みてなされたものであって、データ量を増加させることなく、音声データの出力にあわせて画像の表示形態を変化させることができる音声再生処理装置及びそれを搭載した電話端末を提供することを目的とする。 The present invention has been made in view of the above circumstances, and is equipped with an audio reproduction processing apparatus capable of changing the display form of an image in accordance with the output of audio data without increasing the amount of data, and the same. The purpose is to provide a telephone terminal.

本発明の音声再生処理装置は、音声データの出力に連動して画像を表示させる音声再生処理装置であって、前記音声データを解析する解析手段と、前記音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成する画像生成手段と、前記新たな画像を生成する度に前記新たな画像を表示させる表示制御手段と、を備える。 An audio reproduction processing apparatus according to the present invention is an audio reproduction processing apparatus that displays an image in conjunction with output of audio data, and includes analysis means for analyzing the audio data, and an analysis result every time the audio data is analyzed. And an image generation unit that generates a new image in which the display form of the original image is changed, and a display control unit that displays the new image every time the new image is generated.

上記構成によれば、音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成する画像生成手段を備えることにより、音声データを解析する度に、一枚の画像から解析結果に応じた画像を順次生成し出力する為、予め複数の画像を用意する必要がなくなり、この結果、データ量を増加させることなく、音声データの出力にあわせて画像の表示形態を変化させることができる。 According to the configuration described above, the image generation unit that generates a new image in which the display form of the original image is changed according to the analysis result every time the sound data is analyzed is provided each time the sound data is analyzed. Since images corresponding to analysis results are sequentially generated and output from a single image, there is no need to prepare multiple images in advance, and as a result, images can be displayed in accordance with the output of audio data without increasing the amount of data. The form can be changed.

また、本発明の音声再生処理装置は、前記解析手段が、前記音声データを単位時間間隔で解析する。上記構成によれば、時間間隔の調整により、生成する画像の数を増やすことができ、画像の連続性を高めることができる。また、生成する画像の数を減らすことで処理の負荷を小さくすることができる。 In the audio reproduction processing apparatus of the present invention, the analysis unit analyzes the audio data at unit time intervals. According to the above configuration, the number of images to be generated can be increased by adjusting the time interval, and the continuity of images can be improved. Further, the processing load can be reduced by reducing the number of images to be generated.

また、本発明の音声再生処理装置は、前記画像生成手段が、顔画像の表情を変化させた新たな顔画像を生成する。上記構成によれば、音声の出力に連動して顔画像の表情を変化させるので、顔画像から音声が発せられているかのように音声を出力させることができる。 In the audio reproduction processing apparatus of the present invention, the image generation unit generates a new face image in which the facial expression is changed. According to the above configuration, the facial expression of the face image is changed in conjunction with the output of the sound, so that the sound can be output as if the sound is emitted from the face image.

また、本発明の音声再生処理装置は、前記顔画像の表情を前記音声データの解析結果に基づいて作成した口形状により変化させるための口形状作成手段を備える。上記構成によれば、原画像の口形状を新たに作成した口形状に変えるだけで画像の表示形態を変化させることができる為、新たな顔画像を少ないデータ量で容易に生成することができる。 The speech reproduction processing apparatus of the present invention further includes mouth shape creation means for changing the facial expression of the face image based on the mouth shape created based on the analysis result of the speech data. According to the above configuration, since the display form of the image can be changed simply by changing the mouth shape of the original image to the newly created mouth shape, a new face image can be easily generated with a small amount of data. .

また、本発明の音声再生処理装置は、前記顔画像の表情を前記音声データの解析結果に基づいて推測した感情により変化させるための感情推測手段を備える。また、本発明の音声再生処理装置は、前記感情推測手段が、前記音声データの声の大きさ、高さ及び話す速度の中から選択した少なくともいずれかに基づいて、話者の感情を推測する。上記構成によれば、話者の感情を表現できる為、より親しみのある顔画像を生成することができる。 The speech reproduction processing apparatus of the present invention further includes emotion estimation means for changing the expression of the face image based on the emotion estimated based on the analysis result of the audio data. In the audio reproduction processing apparatus of the present invention, the emotion estimation means estimates the speaker's emotion based on at least one selected from the loudness, height and speaking speed of the audio data. . According to the above configuration, since the speaker's emotion can be expressed, a more familiar face image can be generated.

また、本発明の音声再生処理装置は、複数の原画像から任意の原画像を選択する選択手段を備える。上記構成によれば、音声データにマッチした原画像を選択することにより、音声データの出力に連動した画像の表示を違和感なく行うことができる。 The audio reproduction processing apparatus according to the present invention further includes selection means for selecting an arbitrary original image from a plurality of original images. According to the above configuration, by selecting the original image that matches the audio data, it is possible to display the image linked with the output of the audio data without a sense of incongruity.

また、本発明の電話端末は、本発明の音声再生処理装置を搭載し、録音された音声データの出力に連動して画像を表示する電話端末であって、前記原画像を電話番号に対応付けて蓄積する蓄積手段と、前記録音された音声を発信元の電話番号に対応付けて記憶する記憶手段と、を備え、前記選択手段が、蓄積された複数の原画像から前記発信元の電話番号に対応する画像を選択し、前記画像生成手段が、前記録音された音声データを解析する度に解析結果に応じて、前記発信元の電話番号に対応して選択した原画像の表示形態を変化させた新たな画像を生成する。 The telephone terminal of the present invention is a telephone terminal that is equipped with the audio reproduction processing apparatus of the present invention and displays an image in conjunction with the output of recorded audio data, and associates the original image with a telephone number. And storage means for storing the recorded voice in association with the telephone number of the caller, and the selection means is configured to store the caller telephone number from a plurality of stored original images. Each time the image generation means analyzes the recorded voice data, the display form of the original image selected corresponding to the caller telephone number is changed according to the analysis result. A new image is generated.

上記構成によれば、原画像を電話番号に対応付けて蓄積する蓄積手段を備えることにより、発信元電話番号に対応付けたメッセージ録音者の顔画像を蓄積しておけば、メッセージ録音者が表情を変えながら話しているかのようにメッセージを再生することができる。 According to the above configuration, by storing the original image in association with the telephone number and storing the face image of the message recorder associated with the caller telephone number, You can play a message as if you were talking.

また、本発明の電話端末は、本発明の音声再生処理装置を搭載し、通話着信音の出力に連動して画像を表示する電話端末であって、前記原画像を電話番号に対応付けて蓄積する蓄積手段を備え、前記選択手段は、蓄積された複数の原画像から前記発信元の電話番号に対応する画像を選択し、前記画像生成手段は、前記着信音を解析する度に解析結果に応じて、前記発信元の電話番号に対応して選択した原画像の表示形態を変化させた新たな画像を生成する。 The telephone terminal of the present invention is a telephone terminal that is equipped with the voice reproduction processing apparatus of the present invention and displays an image in conjunction with the output of a call ringtone, and stores the original image in association with a telephone number. Storing means for selecting, and the selecting means selects an image corresponding to the telephone number of the caller from a plurality of stored original images, and the image generating means outputs an analysis result each time the ringtone is analyzed. In response, a new image is generated by changing the display form of the original image selected corresponding to the telephone number of the caller.

また、本発明の電話端末は、本発明の音声再生処理装置を搭載し、メール着信音の出力に連動して画像を表示する電話端末であって、前記原画像をメールアドレスに対応付けて蓄積する蓄積手段を備え、前記選択手段が、蓄積された複数の原画像から前記メールアドレスに対応する画像を選択し、前記画像生成手段が、前記着信音を解析する度に解析結果に応じて、前記メールアドレスに対応して選択した原画像の表示形態を変化させた新たな画像を生成する。 The telephone terminal of the present invention is a telephone terminal that is equipped with the voice reproduction processing apparatus of the present invention and displays an image in conjunction with the output of a mail ringtone, and stores the original image in association with a mail address. Storing means for selecting the image corresponding to the mail address from the plurality of stored original images, the image generating means according to the analysis result every time the ring tone is analyzed, A new image is generated by changing the display form of the original image selected corresponding to the e-mail address.

上記構成によれば、原画像を電話番号に対応付けて蓄積する蓄積手段を備えることにより、着信中に、発信元電話番号やメールアドレスに対応付けた顔画像から音声が発せられているかのように着信音を出力することができる。 According to the above configuration, by providing the storage unit that stores the original image in association with the telephone number, it is as if the voice is emitted from the face image associated with the caller telephone number or mail address during the incoming call. Can output a ringtone.

また、本発明の電話端末は、本発明の音声再生処理装置を搭載し、通話音声の出力に連動して画像を表示する電話端末であって、前記原画像を電話番号に対応付けて蓄積する蓄積手段を備え、前記選択手段は、蓄積された複数の原画像から前記発信元の電話番号に対応する画像を選択し、前記画像生成手段は、前記着信音声を解析する度に解析結果に応じて、前記発信元の電話番号に対応して選択した原画像の表示形態を変化させた新たな画像を生成する。 The telephone terminal of the present invention is a telephone terminal that is equipped with the voice reproduction processing apparatus of the present invention and displays an image in conjunction with the output of the call voice, and stores the original image in association with the telephone number. Storage means, wherein the selection means selects an image corresponding to the telephone number of the caller from a plurality of stored original images, and the image generation means responds to the analysis result each time the incoming call voice is analyzed. Then, a new image in which the display form of the original image selected corresponding to the caller telephone number is changed is generated.

上記構成によれば、通話相手の音声にあわせて、通話相手が話しているかのようにその顔画像の表情を変化させることができ、メモリ使用量の少ないテレビ電話として利用できる。 According to the above configuration, the facial expression of the face image can be changed as if the other party is talking in accordance with the voice of the other party, so that it can be used as a videophone with a small memory usage.

また、本発明の音声再生処理方法は、音声データの出力に連動して画像を表示させる音声再生処理方法であって、前記音声データを解析し、前記音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成し、前記新たな画像を生成する度に前記新たな画像を表示させる。 The audio reproduction processing method of the present invention is an audio reproduction processing method for displaying an image in conjunction with output of audio data, and analyzes the audio data and responds to the analysis result each time the audio data is analyzed. Thus, a new image in which the display form of the original image is changed is generated, and the new image is displayed every time the new image is generated.

さらに、本発明の音声再生処理プログラムは、音声データの出力に連動して画像を表示させる音声再生処理プログラムであって、コンピュータを、前記音声データを解析する解析手段、前記音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成する画像生成手段、前記新たな画像を生成する度に前記新たな画像を表示させる表示制御手段、として機能させるための音声再生処理プログラムである。 Furthermore, the audio reproduction processing program of the present invention is an audio reproduction processing program for displaying an image in conjunction with output of audio data, and the computer analyzes the audio data each time the audio data is analyzed. For generating a new image in which the display form of the original image is changed according to the analysis result, and a display control unit for displaying the new image every time the new image is generated. This is a sound reproduction processing program.

本発明によれば、音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成する画像生成手段を備えることにより、音声データを解析する度に、一枚の画像から解析結果に応じた画像を順次生成し、出力する為、予め複数の画像を用意する必要がなくなり、この結果、データ量を増加させることなく、音声データの出力にあわせて画像の表示形態を変化させることができる。 According to the present invention, the image generation unit that generates a new image in which the display form of the original image is changed according to the analysis result every time the audio data is analyzed is provided. Since the images corresponding to the analysis results are sequentially generated and output from the images, it is not necessary to prepare a plurality of images in advance, and as a result, the image data can be matched to the output of the audio data without increasing the data amount. The display form can be changed.

図１は、本発明の一実施の形態を説明するための留守番電話再生システムの構成図である。同図に示す端末１１は、音声データの出力に連動して画像を表示させる音声再生処理装置を搭載した電話端末であって、録音されたメッセージを出力しながら、メッセージ録音者の顔画像の表情を変化させて表示するものである。端末１１は、音声を入力するためのマイク１２と、音声を出力するためのスピーカ１３と、画像を出力するためのモニタ（ＬＣＤ）１４と、マウスなどのポインディングデバイス、キーボードなどの手動の入力操作部１５を外付けで備える。入力操作部１５は端末１１の各種機能の実行や設定入力を可能にしている。 FIG. 1 is a configuration diagram of an answering machine reproduction system for explaining an embodiment of the present invention. The terminal 11 shown in the figure is a telephone terminal equipped with a voice reproduction processing device that displays an image in conjunction with the output of voice data, and outputs the recorded message while the facial expression of the message recorder's face image. Is displayed. The terminal 11 includes a microphone 12 for inputting sound, a speaker 13 for outputting sound, a monitor (LCD) 14 for outputting an image, a pointing device such as a mouse, and manual input such as a keyboard. An operation unit 15 is provided externally. The input operation unit 15 enables execution of various functions of the terminal 11 and setting input.

端末１１は制御部１６を備えている。制御部１６は、ＣＰＵ等を備え、後述する伝言データの録音、再生や発呼者の画像処理に関わる各種手段の実行制御や、データベースへのアクセス制御等を行うものである。また、制御部１６は、原画像（顔画像）の表示形態を変化させた新たな画像（表情変更画像）を生成する度に、表示手段１４にこの新たな画像を表示させる。 The terminal 11 includes a control unit 16. The control unit 16 includes a CPU and the like, and performs execution control of various means related to recording and reproduction of message data, which will be described later, and image processing of the caller, access control to a database, and the like. Moreover, whenever the control part 16 produces | generates the new image (expression change image) which changed the display form of the original image (face image), this new image is displayed on the display means 14. FIG.

また、端末１１は、伝言メッセージ（音声データ）を録音する録音手段１７と、録音された音声を発信元の電話番号に対応付けて記憶する録音データベース１８と、録音データベース１８への音声データの入出力を管理する録音データ管理手段１９と、音声データを再生する再生手段２０と、音声データを解析し、顔画像の表情をこの音声データの解析結果に基づいて作成した口形状により変化させるための口形状作成手段２１と、音声データを解析し、顔画像の表情をこの音声データの解析結果に基づいて推測した感情により変化させるための感情推測手段２２と、原画像を電話番号に対応付けて蓄積する電話帳データベース２３と、電話帳データベース２３へのデータの入出力を管理する電話帳データ管理手段２４と、発呼者を電話番号に基づいて特定するユーザ特定手段２５と、音声データを解析する度に解析結果に応じて原画像の表示形態を変化させた新たな画像を生成する表情変更画像作成手段２６とを備えている。ユーザ特定手段２５及び電話帳データ管理手段２４は、電話帳データベース２３に蓄積された複数の顔画像から発信元の電話番号に対応する画像を選択する。 The terminal 11 also includes a recording unit 17 for recording a message message (voice data), a recording database 18 for storing the recorded voice in association with the telephone number of the caller, and input of voice data into the recording database 18. Recording data management means 19 for managing the output, reproduction means 20 for reproducing the audio data, analyzing the audio data, and changing the expression of the face image according to the mouth shape created based on the analysis result of the audio data Mouth shape creation means 21, emotion estimation means 22 for analyzing voice data and changing the facial expression based on the emotion estimated based on the analysis result of the voice data, and the original image associated with the telephone number A telephone directory database 23 to be stored, a telephone directory data management means 24 for managing input / output of data to / from the telephone directory database 23, and a caller as a telephone number A user identifying unit 25 for identifying based, and an expression modified image creating means 26 for generating a new image obtained by changing the display mode of the original image according to the analysis result each time analyzing the audio data. The user specifying unit 25 and the phone book data management unit 24 select an image corresponding to the caller telephone number from the plurality of face images stored in the phone book database 23.

電話帳データベース２４は、発呼者側端末の電話番号とこの電話番号に対応する発呼者の静止画像とを関連付けて格納している。発呼者の静止画像は、予め受信者側端末１１自体が保有するデジタルカメラ（図示せず）から取り込んだものや、発呼者から提供されたもの等である。この静止画像が、各発呼者の電話番号に対応付けて格納される。 The telephone directory database 24 stores the telephone number of the calling party terminal in association with the still image of the calling party corresponding to this telephone number. The caller's still image is captured in advance from a digital camera (not shown) held by the receiver terminal 11 itself, or provided from the caller. This still image is stored in association with the telephone number of each caller.

また、録音データベース１８は、留守録音時に、発呼者の留守録音データと発信元電話番号とを関連付けて、録音日時（着信日時）とともに格納する。 Further, the recording database 18 stores the recording date and time (incoming date and time) in association with the recording data of the caller and the caller telephone number at the time of the absence recording.

口形状作成手段２１は、録音データの再生時に発信元電話番号に対応する発呼者の顔画像の口の形状を録音データにもとづいて複数作成するものである。ここでは、例えば発呼者の話し言葉である日本語の母音「ア」、「イ」、「ウ」、「エ」、「オ」の５音に相当する五つの口の形状を作成する。口形状作成手段２１は、音声データを単位時間間隔で音声認識等により解析し、解析する度に、解析されたフレーズの最初の母音に相当する口形状画像を作成する。音声データの再生時に逐次口形状画像を作成することにより、複数の画像を保持しておく必要がないため、扱うデータ量の増加を抑えることができる。この口形状の変化により、発呼者がいかにも話をしている様子を画面上で表現することができる。なお、音声データ再生開始時に、５種類の口形状画像を予め作成しておいてもよい。この場合、原画像に対する口形状の差分データのみを記憶することによって、扱うデータ量の増加を最小限に抑えることができる。 The mouth shape creating means 21 creates a plurality of mouth shapes of the caller's face image corresponding to the caller telephone number based on the recorded data when the recorded data is reproduced. Here, for example, five mouth shapes corresponding to five vowels of Japanese vowels “A”, “I”, “U”, “E”, “O”, which are spoken words of the caller, are created. The mouth shape creating means 21 analyzes the speech data by speech recognition or the like at unit time intervals, and creates a mouth shape image corresponding to the first vowel of the analyzed phrase each time it is analyzed. By sequentially creating mouth shape images at the time of reproducing audio data, it is not necessary to hold a plurality of images, so that an increase in the amount of data to be handled can be suppressed. By changing the mouth shape, it is possible to express on the screen how the caller is talking. Note that five types of mouth shape images may be created in advance at the start of audio data reproduction. In this case, an increase in the amount of data to be handled can be minimized by storing only the difference data of the mouth shape with respect to the original image.

感情推測手段２２は、音声データの再生時に、声の大きさ、高さ及び話す速度の中から選択した少なくともいずれかに基づいて、話者の感情を推測するものである。この感情情報を表情変更画像作成手段２６で作られる顔の表情に反映させることができる。感情推測手段２２には発呼者の喜怒哀楽の感情情報が、声の大きさ、高さおよび話す速度ごとにパラメータ化されており、顔画像の表示形態を変更させるための感情表現情報として記憶されている。感情推測手段２２は、例えば、音声データの声が大きく、高い調子で、早口の場合は、話者が怒っている、また、声の大きさや高さが普通で、話す速度が遅い場合は、話者が平常状態である等と推測する。 The emotion estimation means 22 is for estimating the speaker's emotion based on at least one selected from the loudness, pitch and speaking speed when reproducing the voice data. This emotion information can be reflected in the facial expression created by the facial expression change image creating means 26. In the emotion estimation means 22, emotion information of the caller's emotions is parameterized for each loudness, height and speaking speed, and as emotion expression information for changing the display form of the face image It is remembered. The emotion estimation means 22 is, for example, that the voice data is loud and high in tone, the speaker is angry if the voice is fast, the voice is loud or normal, and the speaking speed is slow. Presume that the speaker is in a normal state.

表情変更画像作成手段２６は、音声データが解析される度に、作成された口形状画像と原画像とを合成して、原画像の顔の表情が変化する表情変更画像を生成する。また、表情変更画像作成手段２６は、口形状作成手段２１で作成された口形状画像を利用して、感情推測手段２２で得られた感情情報を付加して、原画像の口形状及び目の形状等（例えば、目元が下がっている、口を大きく開ける等）を変化させた表情変更画像を生成する。表情変更画像は、生成される度に表示手段１４に表示される。 Each time the voice data is analyzed, the facial expression changed image creating means 26 synthesizes the created mouth shape image and the original image to generate a facial expression modified image in which the facial expression of the original image changes. The facial expression change image creation means 26 adds the emotion information obtained by the emotion estimation means 22 using the mouth shape image created by the mouth shape creation means 21, and adds the mouth shape and eyes of the original image. An expression-changed image is generated in which the shape or the like (for example, the eyes are lowered or the mouth is opened wide) is changed. Each time the facial expression change image is generated, it is displayed on the display means 14.

図２は録音データの読み込みから画像の表情変更までのプロセスを説明する図であり、図３は、録音データの読み込みから画像の表情変更までのフローを示す図である。端末１１に付属の入力操作部１５が操作されると、制御部１６は、留守録音があったか否かを、録音データ管理手段１９を通じて録音データベース１８を参照することにより調べる。留守録音があった場合、録音データ管理手段１９は、録音データベース１８から着信時に記録された音声データ、つまり発呼者のメッセージデータと、このメッセージデータに対応する発呼者端末の電話番号（発信元電話番号）とを読み込む（ステップＳ３１）。 FIG. 2 is a diagram for explaining the process from reading of recorded data to changing the facial expression of an image, and FIG. 3 is a diagram showing the flow from reading of recorded data to changing the facial expression of an image. When the input operation unit 15 attached to the terminal 11 is operated, the control unit 16 checks whether there is an absence recording by referring to the recording database 18 through the recording data management means 19. When there is an absence recording, the recording data management means 19 sends the voice data recorded from the recording database 18 at the time of the incoming call, that is, the message data of the caller, and the telephone number of the caller terminal corresponding to this message data (outgoing call) (Original telephone number) is read (step S31).

次に、ユーザ特定手段２５は、電話帳データ管理手段２４を通じて発信元電話番号が電話帳データベース２３に登録されているか否かを調べる（ステップＳ３２）。電話帳データベース２３に電話番号が登録されていた場合には（ステップＳ３３）、続いてその電話番号に対応する発呼者の静止画像が電話帳データベース２３に登録されているか否かを調べる（ステップＳ３４）。 Next, the user specifying means 25 checks whether or not the caller telephone number is registered in the telephone book database 23 through the telephone book data management means 24 (step S32). If the telephone number is registered in the telephone book database 23 (step S33), it is then checked whether or not the still image of the caller corresponding to the telephone number is registered in the telephone book database 23 (step S33). S34).

静止画像が電話帳データベース２３に登録されている場合には、制御部１６はその静止画像をユーザ特定手段２５に読み込ませて保持させる（ステップＳ３５）。このようにして、一連の録音データに関する処理を終了すると、再生処理開始まで待機する。 If a still image is registered in the telephone directory database 23, the control unit 16 causes the user specifying means 25 to read and hold the still image (step S35). In this way, when the process related to the series of recorded data is completed, the process waits until the reproduction process is started.

留守録音データがあることを表示手段１４等により確認した端末１１のユーザが入力操作部１５を操作することにより、録音データ再生処理が開始される。制御部１６は、ユーザ特定手段２５に発呼者の画像データが保持されているか否かを調べる（ステップＳ３７）。顔画像データが保持されている場合には、録音データを単位時間分ずつ解析していき（ステップＳ３８）、その都度、口形状データを作成し（ステップＳ３９）、話者の感情を推測する（ステップＳ４０）。 When the user of the terminal 11 confirms that there is absence recording data by the display means 14 or the like and operates the input operation unit 15, the recording data reproduction process is started. The control unit 16 checks whether or not the caller's image data is held in the user specifying means 25 (step S37). If face image data is held, the recording data is analyzed for each unit time (step S38), and mouth shape data is created each time (step S39), and the emotion of the speaker is estimated (step S39). Step S40).

表情変更画像作成手段２６は、口形状の変化による顔の表情変化に加えて、感情推測手段で推測した感情情報を反映させて、一枚の原画像から録音データの解析結果に応じた表情変更画像を作成する（ステップＳ４１）。その後、表示手段１４が表情変更画像を表示する（ステップＳ４２）と共に録音データが単位時間分再生される（ステップＳ４３）。そして、未再生の録音データが無くなるまでステップＳ３７以下の処理を繰り返す（ステップＳ４４、Ｓ３６）。なお、ステップＳ３７で顔画像が読み込まれていないと判断されると、ステップＳ４３に進み、録音データの再生が行われる。 The expression change image creation means 26 reflects the emotion information estimated by the emotion estimation means in addition to the facial expression change due to the mouth shape change, and changes the expression according to the analysis result of the recorded data from one original image. An image is created (step S41). Thereafter, the display means 14 displays the facial expression change image (step S42) and the recorded data is reproduced for a unit time (step S43). Then, the processes after step S37 are repeated until there is no unreproduced recording data (steps S44 and S36). If it is determined in step S37 that the face image has not been read, the process proceeds to step S43, and the recorded data is reproduced.

このように、受信者側の端末で留守録音されたメッセージを再生する際に、単に音声を出力するだけでなく、発呼者本人の顔画像を表示し、その表情を変化させることで、発呼者がしゃべっているかのようにメッセージ再生を行うことができる。この画像表示によって、発呼者が受信者のそばに居てしゃべっている様な効果を得て、受信者を楽しませることができる。 In this way, when playing back a message recorded on the receiver's terminal, not only the voice is output but also the caller's face image is displayed and the facial expression is changed to Message playback can be performed as if the caller is speaking. By this image display, it is possible to entertain the receiver with the effect that the caller is near the receiver.

なお、以上説明した音声再生処理を他の用途に用いることもできる。例えば、着信音の出力とともに着信音にマッチする画像や発信者の顔画像等の表示の形状、模様、色彩（表示形態）を変化させて表示することができる。また、通話中に相手の顔画像を変化させて表示することができる。この場合、テレビ電話機能を持たない電話端末であってもテレビ電話と同様の機能を実現することができ、テレビ電話機能を持つ電話端末の場合には、音声及び画像を送受信する場合と比べて扱うデータ量を大幅に低減できる効果がある。 In addition, the audio | voice reproduction process demonstrated above can also be used for another use. For example, it is possible to change the display shape, pattern, and color (display form) of an image that matches the ring tone and the face image of the caller, etc., together with the output of the ring tone. In addition, the face image of the other party can be changed and displayed during a call. In this case, even a telephone terminal that does not have a videophone function can realize the same function as a videophone. In the case of a telephone terminal that has a videophone function, compared to the case of transmitting and receiving audio and images. This has the effect of greatly reducing the amount of data handled.

本発明の音声再生処理装置及び電話端末は、データ量を増加させることなく、音声データの出力にあわせて画像の表示形態を変化させることができる効果を有し、留守録音されたメッセージを発呼者自身の顔画像とともに再生する留守番電話システム等といった音声データの出力に連動して画像を表示させる音声再生処理装置及びそれを搭載した電話端末等に有用である。 The voice reproduction processing device and the telephone terminal of the present invention have an effect that the display form of an image can be changed in accordance with the output of voice data without increasing the amount of data, and a message recorded by absence is called. This is useful for a voice reproduction processing apparatus that displays an image in conjunction with output of voice data, such as an answering machine system that reproduces the user's own face image, and a telephone terminal equipped with the same.

本発明の一実施の形態を説明するための留守番電話再生システムの構成図Configuration diagram of answering machine playback system for explaining an embodiment of the present invention 録音データの読み込みから画像の表情変更までのプロセスを説明する図Diagram explaining the process from reading the recorded data to changing the facial expression of the image 録音データの読み込みから画像の表情変更までのフローを示す図Diagram showing the flow from reading the recorded data to changing the facial expression of the image

Explanation of symbols

１１端末
１６制御部
１７録音手段
１８録音データベース
１９録音データ管理手段
２０再生手段
２１口形状作成手段
２２感情推測手段
２３電話帳データベース
２４電話帳データ管理手段
２５ユーザ特定手段
２６表情変更画像作成手段 DESCRIPTION OF SYMBOLS 11 Terminal 16 Control part 17 Recording means 18 Recording database 19 Recording data management means 20 Playback means 21 Mouth shape creation means 22 Emotion estimation means 23 Telephone book database 24 Telephone book data management means 25 User specification means 26 Expression change image creation means

Claims

An audio reproduction processing device that displays an image in conjunction with output of audio data,
Analyzing means for analyzing the audio data;
Image generating means for generating a new image in which the display form of the original image is changed according to the analysis result every time the audio data is analyzed;
Display control means for displaying the new image every time the new image is generated;
An audio reproduction processing apparatus comprising:

The audio reproduction processing device according to claim 1,
The analysis means is an audio reproduction processing device that analyzes the audio data at unit time intervals.

The audio reproduction processing device according to claim 1 or 2,
The voice reproduction processing device, wherein the image generation means generates a new face image in which the facial expression is changed.

The audio reproduction processing device according to claim 3,
An audio reproduction processing device comprising a mouth shape creating means for changing the facial expression of the face image according to the mouth shape created based on the analysis result of the sound data.

The sound reproduction processing device according to claim 3 or 4,
An audio reproduction processing apparatus comprising emotion estimation means for changing an expression of the face image based on an emotion estimated based on an analysis result of the audio data.

The sound reproduction processing device according to claim 5,
The speech reproduction processing device, wherein the emotion estimation means estimates a speaker's emotion based on at least one selected from the volume, height and speaking speed of the voice data.

The sound reproduction processing device according to any one of claims 1 to 6,
An audio reproduction processing apparatus comprising selection means for selecting an arbitrary original image from a plurality of original images.

A telephone terminal that is equipped with the voice reproduction processing device according to claim 7 and displays an image in conjunction with output of recorded voice data,
Storage means for storing the original image in association with a telephone number;
Storage means for storing the recorded voice in association with the telephone number of the caller,
The selection means selects an image corresponding to the caller's telephone number from a plurality of stored original images,
The image generation means generates a new image in which the display form of the original image selected corresponding to the caller telephone number is changed according to the analysis result every time the recorded voice data is analyzed. Phone terminal.

A telephone terminal equipped with the voice reproduction processing device according to claim 7 and displaying an image in conjunction with output of a call ringtone,
Storing means for storing the original image in association with a telephone number;
The selection means selects an image corresponding to the caller's telephone number from a plurality of stored original images,
The image generation means is a telephone terminal that generates a new image in which the display form of the original image selected corresponding to the caller telephone number is changed according to the analysis result every time the ring tone is analyzed.

A telephone terminal equipped with the voice reproduction processing device according to claim 7 and displaying an image in conjunction with output of a mail ringtone,
Storing means for storing the original image in association with a mail address;
The selecting means selects an image corresponding to the mail address from a plurality of accumulated original images,
The image generation means is a telephone terminal that generates a new image in which the display form of the original image selected corresponding to the mail address is changed according to the analysis result every time the ring tone is analyzed.

A telephone terminal that is equipped with the voice reproduction processing device according to claim 7 and that displays an image in conjunction with output of a call voice,
Storing means for storing the original image in association with a telephone number;
The selection means selects an image corresponding to the caller's telephone number from a plurality of stored original images,
The image generation means is a telephone terminal that generates a new image in which the display form of the original image selected corresponding to the caller telephone number is changed according to the analysis result every time the incoming call sound is analyzed.

An audio reproduction processing method for displaying an image in conjunction with output of audio data,
Analyzing the audio data;
Each time the voice data is analyzed, a new image is generated by changing the display form of the original image according to the analysis result,
An audio reproduction processing method for displaying a new image every time the new image is generated.

An audio reproduction processing program that displays an image in conjunction with output of audio data, the computer
Analyzing means for analyzing the voice data;
Image generating means for generating a new image in which the display form of the original image is changed according to the analysis result every time the audio data is analyzed,
An audio reproduction processing program for functioning as display control means for displaying the new image every time the new image is generated.