JP5164911B2

JP5164911B2 - Avatar generating apparatus, method and program

Info

Publication number: JP5164911B2
Application number: JP2009102242A
Authority: JP
Inventors: 秀和玉木; 睦裕中茂; 豪東野; 由里子鈴木; 稔小林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-04-20
Filing date: 2009-04-20
Publication date: 2013-03-21
Anticipated expiration: 2029-04-20
Also published as: JP2010250761A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an avatar having natural expression and facilitates speech. <P>SOLUTION: The avatar generating device is provided with: a means 104 for predicting a break of expiration paragraphs of speech voice; a means 105 for calculating a laughter estimation value by a first model which outputs a larger laughter estimation value as a basic frequency is large according to the basic frequency of the speech voice in an expiration paragraph immediately before the break; a means 105 for setting a laughter level according to whether or not the laughter estimation value is a first threshold or more; a means 106 for calculating a smile estimation value which is the sum or the product of a first output value of a second model which uses the last smile estimation value as input, a laughter estimation value immediately before, and a second output value of a third model which uses sound pressure in a predetermined period at the beginning of the next expiration paragraph as input; a means 105 for setting a smile level according to whether or not the smile estimation value is a second threshold or more; and means 108, 109 for generating an avatar according to the laughter level and the smile level. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、ボイスメールやビデオメール、テレビ電話などで、機器に向かって喋ったり録音する際に、話者の発話音声に合わせてアバタ（avatar）が発話を促したり、話者を乗せる微笑みや笑いを返すことにより、話者が音声を吹き込みやすくなったり感情のこもったメッセージを吹き込めるようにするための音声録音インタフェース、ならびにその微笑み・笑い推測モデルに係る技術に関する。 In the present invention, when a voice mail, video mail, videophone, etc. are spoken to a device or recorded, an avatar prompts the speaker to speak in accordance with the voice of the speaker, The present invention relates to a voice recording interface for enabling a speaker to blow a voice or to send an emotional message by returning laughter, and a technique related to a smile / laughter guess model.

従来のリアルタイム、非リアルタイムのメディアコミュニケーション（人と人の間に機械を挟んだコミュニケーション）は、反応のない機械に対して発話するために間が掴み辛い、不安になる、緊張するといった問題がある。 Traditional real-time and non-real-time media communication (communication with a machine between people) has problems such as being difficult to grasp, anxious, and nervous because it speaks to a machine that does not respond .

これを解決するためにコミュニケーションする際にアバタを介することで上記の障壁を下げようとする研究事例（例えば、特許文献１、特許文献２参照）があるが、アバタをユーザがキー操作しなくてはならないという手間がある。また非リアルタイムのコミュニケーションでは、一方の発話内容に対してアバタをインタラクティブに操作することはできないという問題がある。 In order to solve this, there is a research example (for example, refer to Patent Document 1 and Patent Document 2) that attempts to lower the above-described barrier by using an avatar when communicating, but the user does not operate the avatar with keys. There is a trouble of not being. In non-real-time communication, there is a problem that an avatar cannot be operated interactively for one utterance content.

特開２００１−１６０１５４公報JP 2001-160154 A 特開２００５−３２７０９６公報JP 2005-327096 A

このように、従来のコミュニケーション支援システムではアバタの表情やジェスチャなどの反応をユーザが操作しなくてはならない。そのため自然かつインタラクティブな反応を返すことが困難であり、非リアルタイムコミュニケーションには用いられない。 As described above, in the conventional communication support system, the user has to operate a reaction such as an avatar's facial expression or gesture. Therefore, it is difficult to return a natural and interactive response, and it is not used for non-real-time communication.

そこで、本発明の目的は、このような従来の課題を解決し、表情が自然でかつ発話を促すように変化するアバタを生成するアバタ生成装置、方法およびプログラムを提供することにある。 Accordingly, an object of the present invention is to provide an avatar generation apparatus, method, and program for generating such an avatar that solves such a conventional problem and has a natural expression and changes so as to promote speech.

上述の課題を解決するため、本発明のアバタ生成装置は、発話音声の時系列的な関係性から呼気段落の切れ目を予測する予測手段と、前記切れ目の直前の呼気段落中の発話音声の基本周波数に応じて、該基本周波数が大きいほど大きな笑い推定値を出力する第１統計モデルにより今回の切れ目に応じた笑い推定値を算出する第１推定手段と、前記笑い推定値が第１閾値以上であるかどうかで笑いレベルを設定する第１設定手段と、前回の切れ目で算出された微笑み推定値を入力とした第２統計モデルの第１出力値、今回の切れ目で算出された笑い推定値、および今回の切れ目の次の呼気段落の初めの所定期間での音圧を入力とした第３統計モデルの第２出力値の和または積である微笑み推定値を今回の切れ目に応じて算出する第２推定手段と、前記微笑み推定値が第２閾値以上であるかどうかで微笑みレベルを設定する第２設定手段と、前記笑いレベルおよび前記微笑みレベルに応じてアバタを生成する生成手段と、を具備することを特徴とする。 In order to solve the above-described problems, an avatar generating apparatus according to the present invention includes a prediction unit that predicts a break in an expiratory paragraph from a time-series relationship of the utterance voice, and a basic of the utterance voice in the expiratory paragraph immediately before the break. First estimation means for calculating a laughter estimated value corresponding to the current break using a first statistical model that outputs a larger laughter estimated value as the fundamental frequency increases, and the laughter estimated value is greater than or equal to a first threshold value A first setting means for setting a laughing level based on whether or not, a first output value of a second statistical model with a smile estimated value calculated at the previous break as an input, and a laughter estimated value calculated at the current break , And a smile estimated value, which is the sum or product of the second output values of the third statistical model with the sound pressure in the predetermined period at the beginning of the exhalation paragraph following the current break as an input, is calculated according to the current break With second estimation means Comprising: a second setting means for setting a smile level based on whether the estimated smile value is equal to or greater than a second threshold; and a generating means for generating an avatar in accordance with the laughter level and the smile level. To do.

本発明のアバタ生成装置、方法およびプログラムによれば、表情が自然でかつ発話を促すように変化するアバタを生成するように表示させることができる。 According to the avatar generation device, method, and program of the present invention, it is possible to display so as to generate an avatar whose expression is natural and changes so as to promote speech.

本発明の実施形態のアバタ生成装置のブロック図。The block diagram of the avatar production | generation apparatus of embodiment of this invention. 図１のアバタ生成装置の動作の一例を示すフローチャート。The flowchart which shows an example of operation | movement of the avatar production | generation apparatus of FIG. 図１のアバタ生成部で生成されるアバタの表情の例を示す図。The figure which shows the example of the facial expression of the avatar produced | generated by the avatar production | generation part of FIG.

以下、図面を参照しながら本発明の実施形態に係るアバタ生成装置、方法およびプログラムについて詳細に説明する。なお、以下の実施形態では、同一の番号を付した部分については同様の動作を行うものとして、重ねての説明を省略する。
まず、本発明の実施形態の概要を説明する。
人の笑いは随意の笑いと不随意の笑いに大別される。本実施形態では、随意の笑いの中で社交の笑いをモデル化する。社交の笑いとは、コミュニケーションをとるときに相手との関係を良く保とうとするために用いられる微笑みで、カウンセラーの相手に話易くさせるためや、初対面の人と良い関係を築こうとするときなどにみられる。 Hereinafter, an avatar generating device, method, and program according to embodiments of the present invention will be described in detail with reference to the drawings. Note that, in the following embodiments, the same numbered portions are assumed to perform the same operation, and repeated description is omitted.
First, an outline of an embodiment of the present invention will be described.
Human laughter is broadly divided into voluntary laughter and involuntary laughter. In this embodiment, social laughter is modeled in optional laughter. Social laughter is a smile used to keep a good relationship with the other person when communicating, to make it easier for the counselor to talk, or when trying to build a good relationship with the person you meet for the first time Seen.

本実施形態では、話の聞き手が行う社交の笑いの中で、興味を持って聞いていることを表現し発話を促す笑い（興味の微笑み）と、発話に対して面白いまたは賛成であるという正の評価を表現する笑い（評価の笑い）とをモデル化する。興味の微笑みは、話し手が発話する前、もしくは発話し始めてまもなく生起し、その度合いは割合穏やかで、持続時間は長い。一方、評価の笑いは、一息で発話するかたまり（呼気段落）が終わる瞬間もしくはその少し前に生起し、その度合いは興味の微笑みよりも強く、次の発話が始まる前もしくは始まる瞬間まで続く。本実施形態では、話者の音声に応じて話しの聞き手の表情を推定する。 In this embodiment, among the social laughter performed by the listener of the story, the laughter (smile of interest) that expresses what he / she listens with interest and encourages utterance, and the positive that the utterance is interesting or in favor. Model the laughter (evaluation laughter) that expresses the evaluation. The smile of interest occurs before or shortly after the speaker speaks, and is moderately moderate in duration. On the other hand, evaluation laughter occurs at or just before the end of a mass of utterances (exhalation paragraph), which is stronger than the smile of interest, and continues until the next utterance begins or begins. In the present embodiment, the facial expression of the listener is estimated according to the voice of the speaker.

本発明の実施形態では、機器に向かって喋ったり録音する際に、発話音声に対応してアバタ（聞き手に対応する）が微笑みや笑いを返すことにより、音声を録音しやすくなったり、表情豊かなメッセージを吹き込めるようにすることを目的として、話者に喋りやすくさせるために、微笑みや笑いの推測値が予め定めた閾値を超えたときにアバタが微笑みや笑いの反応を返す。なお、アバタとは、人や分身としてのユーザの表現の１つであり、ユーザの化身を表現するオブジェクトを示し、コンピュータグラフィックで作成されるものである。 In the embodiment of the present invention, when speaking or recording toward a device, the avatar (corresponding to the listener) returns a smile or laughter in response to the uttered voice, making it easier to record the voice or expressiveness. In order to make it easier for the speaker to speak, the avatar returns a smile or laughter response when the estimated value of smile or laughter exceeds a predetermined threshold. Note that an avatar is one expression of a user as a person or an alternation, indicates an object that represents the user's incarnation, and is created by computer graphics.

次に、本発明の実施形態のアバタ生成装置について図１を参照して説明する。
本実施形態のアバタ生成装置は、マイク１０１、パーソナルコンピュータＰＣ１０２、およびディスプレイ１１０を含み、ＰＣ１０２は、反応推定部１０３、アバタ表情決定部１０８、およびアバタ生成部１０９を含む。反応推定部１０３は、呼気段落予測部１０４、笑い推定部１０５、および微笑み推定部１０６を含み、微笑み推定部１０６はメモリ１０７を含む。
なお、マイク１０１とディスプレイ１１０を含まない、反応推定部１０３とアバタ表情決定部１０８とアバタ生成部１０９を含む装置をアバタ生成装置と呼ぶこともある。 Next, an avatar generating apparatus according to an embodiment of the present invention will be described with reference to FIG.
The avatar generation device of this embodiment includes a microphone 101, a personal computer PC102, and a display 110. The PC102 includes a reaction estimation unit 103, an avatar expression determination unit 108, and an avatar generation unit 109. The response estimation unit 103 includes an exhalation paragraph prediction unit 104, a laughter estimation unit 105, and a smile estimation unit 106, and the smile estimation unit 106 includes a memory 107.
A device that does not include the microphone 101 and the display 110 and includes the reaction estimation unit 103, the avatar expression determination unit 108, and the avatar generation unit 109 may be referred to as an avatar generation device.

マイク１０１は、発話者の発話音声（音声パケット）を取得する。
呼気段落予測部１０４は、マイク１０１の取得音声のオン・オフの時系列的な関係性から呼気段落の切れ目を予測する。具体的には呼気段落予測部１０４は、発話音声のオン・オフのリズムから統計モデルを用いてリアルタイムに呼気段落の終わりのタイミングを予測する。ここで、呼気段落とは、ある息継ぎから次の息継ぎまでの期間を示す。また、統計モデルは、例えば、ＭＡ（moving average）モデル、ＨＭＭ（hidden Markov model）である。なお、統計モデルは以下同様の意味で用いるが、様々な計算で使用され、それぞれが異なるモデルであってもよい。また、計算内容が異なれば、同じモデルでも異なるパラメータが設定されている。 The microphone 101 acquires the voice (voice packet) of the speaker.
The expiratory paragraph prediction unit 104 predicts a break in the expiratory paragraph from the time-series relationship between on and off of the voice acquired by the microphone 101. Specifically, the expiratory paragraph prediction unit 104 predicts the end timing of the expiratory paragraph in real time from the on / off rhythm of the speech using a statistical model. Here, the exhalation paragraph indicates a period from one breath change to the next breath change. The statistical model is, for example, an MA (moving average) model or an HMM (hidden Markov model). In addition, although a statistical model is used with the same meaning hereafter, it is used by various calculations and each may be a different model. Also, if the calculation contents are different, different parameters are set even in the same model.

笑い推定部１０５は、上述した「評価の笑い」をモデル化して笑いレベルを決定する。具体的には、笑い推定部１０５は、呼気段落予測部１０４が予測したタイミングで呼気段落の切れ目である場合には、直前の呼気段落中の発話音声の基本周波数を用いて統計モデルにより笑い推定値を求め、この笑い推定値を笑い閾値と比較する。なお、基本周波数は、マイク１０１で取得した発話音声の音圧を周波数に変換し、雑音を除去後に存在する周波数のうちの最も高い周波数を示す。例えば笑い推定部１０５が、マイク１０１が取得した発話音声から基本周波数を計算する。
統計モデルでは、基本周波数が大きいほど、統計モデルの出力値である笑い推定値も大きくなる。笑い推定部１０５は、笑い閾値をＬ個記憶していて（Ｌは１以上の整数）、笑い推定値がこれらの閾値以上かどうか判定して、この笑い推定値に対応する笑いレベルを設定する。笑い推定部１０５は、例えばＬ＝２として、「笑い閾値２＞笑い閾値１」として、笑い推定値が笑い閾値２以上である場合に笑いレベルを笑いレベル２に設定し、笑い推定値が笑い閾値２よりも小さくかつ笑い閾値１以上である場合には笑いレベルを笑いレベル１に設定し、これらの設定値をアバタ表情決定部１０８に渡す。一方、笑い推定値が笑い閾値１よりも小さい場合には笑いレベルを設定せず、その旨をアバタ表情決定部１０８に渡す。 The laughter estimation unit 105 models the “evaluation laughter” described above and determines a laughter level. Specifically, the laughter estimation unit 105 estimates the laughter by a statistical model using the fundamental frequency of the uttered speech in the immediately preceding exhalation paragraph when there is a break in the exhalation paragraph at the timing predicted by the exhalation paragraph prediction unit 104 A value is determined and this estimated laughter value is compared with a laughter threshold. The fundamental frequency indicates the highest frequency among the frequencies that exist after converting the sound pressure of the uttered voice acquired by the microphone 101 into a frequency and removing noise. For example, the laughter estimation unit 105 calculates the fundamental frequency from the uttered voice acquired by the microphone 101.
In the statistical model, the larger the fundamental frequency, the larger the estimated laughter value that is the output value of the statistical model. The laughter estimation unit 105 stores L laughter threshold values (L is an integer equal to or greater than 1), determines whether the estimated laughter value is equal to or greater than these threshold values, and sets a laughter level corresponding to the estimated laughter value. . The laughter estimation unit 105 sets, for example, L = 2, “laughter threshold 2> laughter threshold 1”, sets the laughter level to the laughter level 2 when the estimated laughter is equal to or higher than the laughter threshold 2, and the estimated laughter If it is smaller than the threshold value 2 and greater than or equal to the laughter threshold value 1, the laughter level is set to the laughter level 1, and these set values are passed to the avatar expression determination unit 108. On the other hand, when the estimated laughter value is smaller than the laughter threshold value 1, the laughter level is not set and the fact is passed to the avatar facial expression determination unit 108.

微笑み推定部１０６は、上述した「興味の微笑み」をモデル化して微笑みレベルを決定する。具体的には、微笑み推定部１０６は、直前のＭ個の微笑み推定値（Ｍは１以上の整数）を入力とした統計モデルの出力値、直前の笑い推定値、次の呼気段落の初めの所定期間（例えば５００ｍｓ）の音圧を入力とした統計モデルの出力値により微笑み推定値を求める。例えば、微笑み推定部１０６は、直前のＭ個の微笑み推定値（Ｍは１以上の整数）を入力とした統計モデルの出力値と、直前の笑い推定値と、次の呼気段落の初めの所定期間の音圧を入力とした統計モデルの出力値との和または積を微笑み推定値とする。統計モデルは、直前のＭ個の微笑み推定値が大きいほど大きな出力値を出力し、所定期間の音圧が大きいほど大きな出力値を出力する。直前Ｍ個の微笑み推定値とは、現在の呼気段落のＭ個前までの呼気段落それぞれでの微笑み推定値を示す。また、直前の笑い推定値とは、現在の呼気段落の１つ前の呼気段落での笑い推定値を示す。メモリ１０７は、直前のＭ個の微笑み推定値、直前の笑い推定値を記憶している。
微笑み推定部１０６は、微笑み閾値をＮ個記憶していて（Ｎは１以上の整数）、微笑み推定値がこれらの閾値以上かどうか判定して、この微笑み推定値に対応する微笑みレベルを設定する。微笑み推定部１０６は、例えばＮ＝３として、「微笑み閾値３＞微笑み閾値２＞微笑み閾値１」として、微笑み推定値が微笑み閾値３以上である場合に微笑みレベルを微笑みレベル３に設定し、微笑み推定値が微笑み閾値３よりも小さくかつ微笑み閾値２以上である場合には微笑みレベルを微笑みレベル２に設定し、微笑み推定値が微笑み閾値２よりも小さくかつ微笑み閾値１以上である場合には微笑みレベルを微笑みレベル１に設定し、これらの設定値をアバタ表情決定部１０８に渡す。一方、微笑み推定値が微笑み閾値１よりも小さい場合には笑いおよび微笑みがないとして、その旨をアバタ表情決定部１０８に渡す。 The smile estimation unit 106 models the above-mentioned “smile of interest” to determine a smile level. Specifically, the smile estimation unit 106 outputs the output value of the statistical model having the immediately preceding M smile estimate values (M is an integer of 1 or more) as input, the immediately preceding laugh estimate value, and the beginning of the next exhalation paragraph. The estimated smile value is obtained from the output value of the statistical model with the sound pressure of a predetermined period (for example, 500 ms) as an input. For example, the smile estimation unit 106 outputs an output value of a statistical model having M previous estimated smile values (M is an integer of 1 or more) as input, a previous laughter estimated value, and a predetermined value at the beginning of the next exhalation paragraph. The sum or product of the output value of the statistical model with the sound pressure of the period as an input is used as the estimated smile value. The statistical model outputs a larger output value as the previous M estimated smile values are larger, and outputs a larger output value as the sound pressure for a predetermined period is larger. The previous M estimated smile values indicate smile estimates in each of the exhalation paragraphs up to M before the current exhalation paragraph. The immediately preceding estimated laughter value indicates the estimated laughter value in the exhalation paragraph immediately before the current exhalation paragraph. The memory 107 stores the immediately preceding M smiling estimated values and the immediately preceding laughing estimated value.
The smile estimation unit 106 stores N smile threshold values (N is an integer equal to or greater than 1), determines whether the smile estimate value is greater than or equal to these threshold values, and sets a smile level corresponding to the smile estimate value. . The smile estimation unit 106 sets, for example, N = 3, “smile threshold 3> smile threshold 2> smile threshold 1”, sets the smile level to smile level 3 when the smile estimate is equal to or greater than smile threshold 3, and smiles. If the estimated value is smaller than the smile threshold 3 and greater than or equal to the smile threshold 2, the smile level is set to the smile level 2. If the estimated smile is smaller than the smile threshold 2 and greater than or equal to the smile threshold 1, the smile is set. The level is set to smile level 1, and these set values are passed to avatar facial expression determination section 108. On the other hand, when the estimated smile value is smaller than the smile threshold 1, it is determined that there is no laughter and no smile, and that effect is passed to the avatar facial expression determination unit 108.

アバタ表情決定部１０８は、笑い推定部１０５および微笑み推定部１０６から、それぞれ笑いレベルおよび微笑みレベルを受け取る。それぞれのレベルに応じて表情を選択する。上述したように、例えばＬ＝２、Ｎ＝３の場合には、６つの表情のうちのいずれかの表情を選択する（図３を参照）。
アバタ生成部１０９は、発話者の発話音声に応じた、聞き手の微笑みや笑いを示すアバタ（笑いアバタ）を生成する。このときアバタの表情はアバタ表情決定部１０８から受け取った情報を基にする。ディスプレイ１１０は、アバタ生成部１０９で生成されたアバタ画像１５１を表示する。 The avatar facial expression determination unit 108 receives the laughing level and the smiling level from the laughing estimation unit 105 and the smile estimation unit 106, respectively. Select facial expressions according to each level. As described above, for example, when L = 2 and N = 3, one of the six facial expressions is selected (see FIG. 3).
The avatar generation unit 109 generates an avatar (laughing avatar) indicating a listener's smile or laughter according to the voice of the speaker. At this time, the expression of the avatar is based on the information received from the avatar expression determination unit 108. The display 110 displays the avatar image 151 generated by the avatar generation unit 109.

次に、図１のアバタ生成装置の動作の一例について図２を参照して説明する。
まず、マイク１０１が音声パケットを取得する（ステップＳ２０２）。次に、呼気段落予測部１０４が、統計モデルによる呼気段落予測処理によって呼気段落の切れ目を予測する（ステップＳ２０３）。笑い推定部１０５は、呼気段落の切れ目であればステップＳ２０５へ進み、呼気段落の切れ目でなければステップＳ２０２に戻る。笑い推定部１０５は、直前の呼気段落中の発話音声の基本周波数を用いて統計モデルにより笑いの推定値を求める（ステップＳ２０５）。笑い推定部１０５は、笑い推定値が笑い閾値２以上であるかどうかを判定し（ステップＳ２０６）、笑い閾値２以上である場合に笑いレベルを２に設定し（ステップＳ２１２）、笑い閾値２を下回った場合にはステップＳ２０７に進む。笑い推定部１０５は、笑い推定値が笑い閾値１以上であるかどうかを判定し（ステップＳ２０７）、笑い閾値１以上である場合に笑いレベルを１に設定し（ステップＳ２１３）、笑い閾値１を下回った場合には処理２０８へ進む。このとき「笑い閾値２＞笑い閾値１」である。ステップＳ２１２およびステップＳ２１３で笑いレベルが定まったら、アバタ表情決定部１０８、アバタ生成部１０９、およびディスプレイ１１０がレベルに応じた笑いを提示する（ステップＳ２１７）。 Next, an example of the operation of the avatar generation device in FIG. 1 will be described with reference to FIG.
First, the microphone 101 acquires a voice packet (step S202). Next, the expiratory paragraph prediction unit 104 predicts a break in the expiratory paragraph by the expiratory paragraph predicting process based on the statistical model (step S203). The laughing estimation unit 105 proceeds to step S205 if it is a break in the exhalation paragraph, and returns to step S202 if it is not a break in the exhalation paragraph. The laughter estimation unit 105 obtains an estimated value of laughter by a statistical model using the fundamental frequency of the uttered voice in the previous exhalation paragraph (step S205). The laughter estimation unit 105 determines whether or not the estimated laughter value is greater than or equal to the laughter threshold value 2 (step S206). If the estimated laughter value is equal to or greater than the laughter threshold value 2, the laughter level is set to 2 (step S212). When it falls below, it progresses to step S207. The laughter estimation unit 105 determines whether or not the estimated laughter value is greater than or equal to the laughter threshold value 1 (step S207). If the estimated laughter value is greater than or equal to the laughter threshold value 1, the laughter level is set to 1 (step S213). If it falls, the process proceeds to processing 208. At this time, “laughter threshold 2> laughter threshold 1”. When the laughter level is determined in step S212 and step S213, the avatar facial expression determination unit 108, the avatar generation unit 109, and the display 110 present laughter according to the level (step S217).

微笑み推定部１０６が、直前３つの微笑み推定値を入力として、現在の表情が笑んでいそうな度合いを出力する統計モデル（入力値が大きければ大きいほど、大きな値が出力され、微笑んでいる度合いが大きくなる）の出力値と、直前の笑い推定値と、次の呼気段落の始めの５００ｍｓの音圧を入力とした統計モデルの出力値との和または積として微笑み推定値を求める（統計モデルでは、音声により聞き手が微笑んでいるかどうかを判定）（ステップＳ２０８）。微笑み推定部１０６は、微笑み推定値が微笑み閾値３以上であるかどうかを判定し（ステップＳ２０９）、微笑み閾値３である場合に微笑みレベルを３に設定し（ステップＳ２１４）、微笑み閾値３を下回った場合にはステップＳ２１０に進む。微笑み推定部１０６は、微笑み推定値が微笑み閾値２以上であるかどうかを判定し（ステップＳ２１０）、微笑み閾値２以上である場合に微笑みレベルを２に設定し（ステップＳ２１５）、微笑み閾値２を下回った場合にはステップＳ２１１に進む。微笑み推定部１０６は、微笑み推定値が微笑み閾値１以上であるかどうかを判定し（ステップＳ２１１）、微笑み閾値１以上である場合に微笑みレベルを１に設定し（ステップＳ２１６）、微笑み閾値１を下回った場合にはステップＳ２１９に進む。アバタ表情決定部１０８、アバタ生成部１０９、およびディスプレイ１１０は、笑いおよび微笑みなし表情を提示し（ステップＳ２１９）、ステップＳ２１４、Ｓ２１５、およびＳ２１６で微笑みレベルが定まったら、レベルに応じた微笑みを提示する（ステップＳ２１８）。（ただし、微笑み閾値３＞微笑閾値２＞微笑閾値１とする）
それぞれのレベルに応じた提示するアバタの表情は、アバタ表情決定部１０８が例えば表１のように決定している。また、アバタ生成部１０９は、例えば図３に示すアバタを生成する。

A statistical model in which the smile estimation unit 106 inputs the three estimated smile values immediately before and outputs the degree to which the current facial expression is likely to laugh (the larger the input value, the larger the value that is output and the degree of smile The estimated smile value is calculated as the sum or product of the output value of the statistic model with the input value of the previous laughter estimate value and the 500 ms sound pressure at the beginning of the next exhalation paragraph (statistic model). Then, it is determined whether or not the listener is smiling by voice) (step S208). The smile estimation unit 106 determines whether or not the smile estimation value is equal to or greater than the smile threshold 3 (step S209). If the smile estimation value is the smile threshold 3, the smile level is set to 3 (step S214) and falls below the smile threshold 3. If YES, go to step S210. The smile estimation unit 106 determines whether or not the smile estimation value is equal to or greater than the smile threshold 2 (step S210). If the smile estimation value is equal to or greater than the smile threshold 2, the smile level is set to 2 (step S215), and the smile threshold 2 is set. When it falls below, it progresses to step S211. The smile estimation unit 106 determines whether the smile estimation value is equal to or greater than the smile threshold 1 (step S211). If the smile estimation value is equal to or greater than the smile threshold 1, the smile level is set to 1 (step S216), and the smile threshold 1 is set. When it falls below, it progresses to step S219. The avatar facial expression determination unit 108, the avatar generation unit 109, and the display 110 present a facial expression without laughter and smile (step S219). When the smile level is determined in steps S214, S215, and S216, a smile according to the level is presented. (Step S218). (However, smile threshold 3> smile threshold 2> smile threshold 1)
The avatar facial expression determination unit 108 determines the avatar facial expression to be presented according to each level as shown in Table 1, for example. Also, the avatar generation unit 109 generates, for example, the avatar shown in FIG.

以上に示した実施形態によれば、話者の発話音声を入力するためのマイクを用意し、そこから取り込んだ音声をＰＣ内の笑い推測部および微笑み推定部へと送り、笑い推測値および微笑み推定値を統計モデルに基づいて計算することができ、得られた推測値を基に笑いおよび微笑みを決定し、ディスプレイに表示されたアバタに笑いまたは微笑みを表現させることができる。これにより話者の発話リズムに合わせて聞き手を想定したアバタが自然に笑顔や笑いを返すため、話者とアバタ間で引きこみ現象が起こる。具体的には、アバタが話を肯定しながら聞いているように感じるために話者がより発話しやすくなる、発話の時系列的なリズムに合わせて引き出すような笑いを返すので話が盛り上がる、相手がいて、聞いてくれていると感じるので話者のメッセージに感情がこもるという効果を奏する。 According to the embodiment described above, a microphone for inputting a speaker's uttered voice is prepared, and the voice captured from the microphone is sent to the laughing estimation unit and the smile estimation unit in the PC, and the estimated laughter value and the smile The estimated value can be calculated based on the statistical model, and laughter and smile can be determined based on the obtained estimated value, and the avatar displayed on the display can express laughter or smile. As a result, the avatar that assumes the listener according to the utterance rhythm of the speaker naturally returns a smile or laughter, so that a pulling phenomenon occurs between the speaker and the avatar. Specifically, the avatar feels as if listening while affirming the story, making it easier for the speaker to speak, and returning the laughter that pulls out according to the chronological rhythm of the utterance, There is an other party, and it feels that he / she listens, so it has the effect of feeling emotional in the speaker's message.

また、話者の発話の時系列的なオン・オフのリズムから予測モデルを立てて笑顔や笑いの生起タイミングを求め、さらに発話音声の基本周波数や音圧から推測モデルを立てて笑いと微笑みを生起することによって、話者が発話しやすく、感情のこもったメッセージを吹き込める音声録音インタフェース、ならびにその微笑み・笑い推測モデルを提供することができる。 In addition, a prediction model is created from the chronological on / off rhythm of the speaker's utterance to determine the occurrence timing of smiles and laughter, and a guess model is established from the fundamental frequency and sound pressure of the spoken speech to laugh and smile. When it occurs, it is possible to provide a voice recording interface that makes it easy for a speaker to speak and a message with emotions, and a model for estimating a smile / laugh.

さらに、話者の発話に対して、アバタが自然かつ盛り上げるような微笑みや笑いを返すことにより、ボイスメールやビデオメール、テレビ電話など、機器に向かって話したり録音することを助ける。 In addition, avatars return a natural and exciting smile and laughter to the speaker's utterances, helping them speak and record to devices such as voicemail, videomail, and videophone.

またさらに、機器に向かって喋ったり録音する際に、発話音声に対応してアバタが微笑みや笑いを返すことにより、音声を録音しやすくなったり、表情豊かなメッセージを吹き込めるようにすることができる。具体的にはディスプレイに表示されたアバタが、人のコミュニケーションでの引き込みあいに使う笑いや微笑みを、発話者の発話音声に対してリアルタイムに返すことによって発話者は好意的に聞いてもらえているという安心感を得て発話しやすくなる。また、人と人の対面コミュニケーションでは各々の発する笑顔が相互に影響して互いに笑顔になり（同調）、コミュニケーションが盛り上がっていくが、発話音声に対してアバタが笑顔を返すことによって、発話者が同調して笑顔になる。発話者はアバタも同調しているという感覚をもち発話が盛り上がり、発話音声に感情を込めることができる。 In addition, when speaking toward the device or recording, the avatar returns a smile or laughter in response to the spoken voice, making it easier to record the voice or to express an expressive message. it can. Specifically, the avatar displayed on the display returns the laughter and smile used for pulling in the communication of people in real time to the utterance voice of the speaker, so that the speaker can listen favorably. It becomes easy to speak with a sense of security. Also, in face-to-face communication between people, the smiles that they make affect each other and make each other smile (synchronization), and the communication rises, but when the avatar returns a smile to the speech, It makes you smile. The speaker feels that the avatar is also synchronized, and the utterance rises, and emotions can be put into the speech.

さらにまた、本実施形態のアバタ生成装置は、音声を吹き込む様々なシステムに組み込むことができる。具体例としては、一般的に吹き込むのに緊張してしまったり感情が込められない、間が取りづらいといわれるボイスメール、留守番電話、ビデオメッセージなどに音声を吹き込む際に本実施形態のアバタ生成装置を用い、発話音声に合わせて微笑み／笑いを返すアバタを提示することで、発話者は落ち着いて、テンポよく、感情のこもったメッセージを吹き込むことができる。 Furthermore, the avatar generation device of the present embodiment can be incorporated into various systems that blow sound. As a specific example, the avatar generating device of this embodiment is generally used when voice is blown into voice mail, answering machine, video message, etc. By presenting an avatar that returns a smile / laugh according to the utterance voice, the speaker can calm down, inject a message with a good tempo and emotion.

また、リアルタイムの音声コミュニケーションメディアにも応用可能である。電話では相手の顔が見えないために本来対面コミュニケーションであれば観察できる表情などの手がかりがないために、相手がちゃんと聞いてくれているか、会話内容を理解しているか、会話内容に賛成しているかといったことが分からないというような不安がある。ＩＰ電話や携帯電話で通話するときに、互いの端末のディスプレイ上に本発明のアバタを表示することで、各々のユーザはアバタに対して、それがあたかも通話相手かのように話しかけることができ、相手が聞いてくれている、内容を理解してくれている、話に賛成してくれているという安心感をもって話すことができる。その結果感情豊かに発話することができ、会話が盛り上がっていく。 It can also be applied to real-time voice communication media. Because the other party's face cannot be seen on the phone, there are no clues such as facial expressions that can be observed in face-to-face communication, so the other party is listening properly, understands the conversation, or agrees with the conversation I have anxiety that I don't know. By displaying the avatar of the present invention on the display of each other's terminal when making a call with an IP phone or a mobile phone, each user can talk to the avatar as if it were a call partner. , I can speak with a sense of security that the other party is listening, understanding the content, and agreeing with the story. As a result, you can speak emotionally and the conversation will rise.

さらにＴＶ会議への応用も可能である。ＴＶ会議は対話をする、遠隔地にいる相手の様子がディスプレイに表示されているため、表情を観察することも、うなずいているのを見ることもできる。しかしＴＶ会議では相手と同じ空間を共有している感覚が少ない、相手は見えているが視線が一致しないためにアイコンタクトを取れず、メッセージが伝わっているのか分かりづらい。結果的にＴＶ会議では本来人が行っているコミュニケーションにおける相互の引き込みを行うことができない。そこで本発明のシステムをそれぞれの参加者の前のディスプレイの脇に表示する。アバタを参加者全員の発話に対して反応させ、参加者全員で共有することで、参加者は自信をもって感情豊かに発言することができるし、アバタを橋渡し役として参加者間での相互引き込みを行うことができる。 Furthermore, application to a TV conference is also possible. In the video conference, the state of the other party in the remote place that has a conversation is displayed on the display, so you can observe the expression and nodding. However, in video conferences, there is little sense of sharing the same space as the other party, the other party can see, but the line of sight does not match, so eye contact cannot be taken, and it is difficult to understand whether the message is transmitted. As a result, in the video conference, it is impossible to perform mutual pull-in in communication originally performed by a person. Therefore, the system of the present invention is displayed beside the display in front of each participant. By reacting the avatar to the speech of all participants and sharing it with all the participants, the participants can speak with confidence and emotionally, and the avatar acts as a bridge to engage each other. It can be carried out.

またさらに、メッセージの対象が限られていない朗読、アナウンス、自動音声応答の音声吹き込みの際にも本実施形態のアバタ生成装置は有効である。これらの音声吹き込みは相手がいない状況で行われるため、発話のテンポが発話者単体のリズムに依存してしまう。対面環境においてメッセージは送り手だけでなく受け手も含めた相互のリズムに乗せて発せられるが、発話者単体のリズムに依存してしまうとメッセージが機械的になってしまう。そこで本発明のアバタ生成装置によりアバタを表示し、それに対してメッセージを吹き込むことで、人間味、表情のあるメッセージを録音することができる。 Furthermore, the avatar generating apparatus of the present embodiment is also effective for reading voices, announcements, and automatic voice responses that are not limited to messages. Since these voice injections are performed in a situation where there is no other party, the tempo of the speech depends on the rhythm of the speaker alone. In a face-to-face environment, a message is issued on a mutual rhythm that includes not only the sender but also the receiver, but if it depends on the rhythm of the speaker alone, the message becomes mechanical. Therefore, by displaying the avatar with the avatar generating apparatus of the present invention and blowing a message on the avatar, it is possible to record a message with human feeling and expression.

また、上述の実施形態の中で示した処理手順に示された指示は、ソフトウェアであるプログラムに基づいて実行されることが可能である。汎用の計算機システムが、このプログラムを予め記憶しておき、このプログラムを読み込むことにより、上述した実施形態のアバタ生成装置による効果と同様な効果を得ることも可能である。上述の実施形態で記述された指示は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ±Ｒ、ＤＶＤ±ＲＷなど）、半導体メモリ、またはこれに類する記録媒体に記録される。コンピュータまたは組み込みシステムが読み取り可能な記録媒体であれば、その記憶形式は何れの形態であってもよい。コンピュータは、この記録媒体からプログラムを読み込み、このプログラムに基づいてプログラムに記述されている指示をＣＰＵで実行させれば、上述した実施形態のアバタ生成装置と同様な動作を実現することができる。もちろん、コンピュータがプログラムを取得する場合または読み込む場合はネットワークを通じて取得または読み込んでもよい。
また、記録媒体からコンピュータや組み込みシステムにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワーク等のＭＷ（ミドルウェア）等が本実施形態を実現するための各処理の一部を実行してもよい。
さらに、本願発明における記録媒体は、コンピュータあるいは組み込みシステムと独立した媒体に限らず、ＬＡＮやインターネット等により伝達されたプログラムをダウンロードして記憶または一時記憶した記録媒体も含まれる。
また、記録媒体は１つに限られず、複数の媒体から本実施形態における処理が実行される場合も、本発明における記録媒体に含まれ、媒体の構成は何れの構成であってもよい。 The instructions shown in the processing procedure shown in the above embodiment can be executed based on a program that is software. The general-purpose computer system stores this program in advance and reads this program, so that the same effect as that obtained by the avatar generating apparatus of the above-described embodiment can be obtained. The instructions described in the above-described embodiments are, as programs that can be executed by a computer, magnetic disks (flexible disks, hard disks, etc.), optical disks (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD). ± R, DVD ± RW, etc.), semiconductor memory, or a similar recording medium. As long as the recording medium is readable by the computer or the embedded system, the storage format may be any form. If the computer reads the program from the recording medium and causes the CPU to execute instructions described in the program based on the program, the same operation as the avatar generation device of the above-described embodiment can be realized. Of course, when the computer acquires or reads the program, it may be acquired or read through a network.
In addition, the OS (operating system), database management software, MW (middleware) such as a network, etc. running on the computer based on the instructions of the program installed in the computer or embedded system from the recording medium implement this embodiment. A part of each process for performing may be executed.
Furthermore, the recording medium in the present invention is not limited to a medium independent of a computer or an embedded system, but also includes a recording medium in which a program transmitted via a LAN or the Internet is downloaded and stored or temporarily stored.
Further, the number of recording media is not limited to one, and when the processing in the present embodiment is executed from a plurality of media, it is included in the recording media in the present invention, and the configuration of the media may be any configuration.

なお、本願発明におけるコンピュータまたは組み込みシステムは、記録媒体に記憶されたプログラムに基づき、本実施形態における各処理を実行するためのものであって、パソコン、マイコン等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であってもよい。
また、本願発明の実施形態におけるコンピュータとは、パソコンに限らず、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって本発明の実施形態における機能を実現することが可能な機器、装置を総称している。 The computer or the embedded system in the present invention is for executing each process in the present embodiment based on a program stored in a recording medium, and includes a single device such as a personal computer or a microcomputer, Any configuration such as a system in which apparatuses are connected to a network may be used.
Further, the computer in the embodiment of the present invention is not limited to a personal computer, but includes an arithmetic processing device, a microcomputer, and the like included in an information processing device, and a device capable of realizing the functions in the embodiment of the present invention by a program, The device is a general term.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０１・・・マイク、１０２・・・ＰＣ、１０３・・・反応推定部、１０４・・・呼気段落予測部、１０５・・・笑い推定部、１０６・・・微笑み推定部、１０７・・・メモリ、１０８・・・アバタ表情決定部、１０９・・・アバタ生成部、１１０・・・ディスプレイ、１５１・・・アバタ画像。 DESCRIPTION OF SYMBOLS 101 ... Microphone, 102 ... PC, 103 ... Reaction estimation part, 104 ... Expiration paragraph prediction part, 105 ... Laughter estimation part, 106 ... Smile estimation part, 107 ... Memory 108 avatar expression determining unit 109 avatar generating unit 110 display 151 avatar image.

Claims

A prediction means for predicting the break of the exhalation paragraph from the time-series relationship of the utterance voice;
A first statistic model that outputs a larger estimated laughter value as the fundamental frequency increases in accordance with the fundamental frequency of the speech voice in the exhalation paragraph immediately before the break, and calculates the estimated laughter value corresponding to the current break . An estimation means;
First setting means for setting a laughing level based on whether the estimated laughter value is equal to or greater than a first threshold;
The first output value of the second statistical model with the estimated smile value calculated at the previous break as an input, the estimated laughter value calculated at the current break , and the predetermined period at the beginning of the exhalation paragraph following the current break Second estimation means for calculating a smile estimated value, which is the sum or product of the second output values of the third statistical model with the sound pressure of
Second setting means for setting a smile level based on whether the estimated smile value is equal to or greater than a second threshold;
An avatar generating device comprising: generating means for generating an avatar according to the laughter level and the smile level.

The predicting means predicts the break using the first MA model from the on / off rhythm of the speech,
The first estimating means calculates the estimated laughter value by a second MA model,
The first setting means sets the laughter level to laughter level 1 when the estimated laughter value is equal to or greater than the first threshold value,
The generating means generates an avatar corresponding to the laughter level 1,
The second estimating means calculates the smile estimated value by a third MA model,
The second setting means sets the smile level to a smile level 1 when the estimated smile value is greater than or equal to the second threshold value,
The generating means generates an avatar corresponding to the smile level 1;
The second setting means sets the smile level to a level at which no smile and no smile when the estimated smile value is less than the second threshold;
The avatar generation device according to claim 1, wherein the generation unit generates an avatar without laughter and smile corresponding to the level without laughter and smile.

A prediction step for predicting a break in an exhalation paragraph from a time-series relationship of speech speech;
A first statistic model that outputs a larger estimated laughter value as the fundamental frequency increases in accordance with the fundamental frequency of the speech voice in the exhalation paragraph immediately before the break, and calculates the estimated laughter value corresponding to the current break . An estimation step;
A first setting step of setting a laughter level based on whether the estimated laughter value is greater than or equal to a first threshold;
The first output value of the second statistical model with the estimated smile value calculated at the previous break as an input, the estimated laughter value calculated at the current break , and the predetermined period at the beginning of the exhalation paragraph following the current break the third and second estimation step of calculating in accordance with the estimated value smile is the sum or the product of the second output value of the statistical model to the current cut that receives the sound pressure,
A second setting step for setting a smile level based on whether the estimated smile value is equal to or greater than a second threshold;
A generation step of generating an avatar according to the laughter level and the smile level.

In the predicting step, the cut is predicted using the first MA model from the on / off rhythm of the speech,
In the first estimation step, the estimated laughter value is calculated by a second MA model,
In the first setting step, when the estimated laughter value is greater than or equal to the first threshold, the laughter level is set to laughter level 1,
In the generation step, an avatar corresponding to the laughter level 1 is generated,
In the second estimating step, the estimated smile value is calculated by a third MA model,
In the second setting step, when the estimated smile value is equal to or greater than the second threshold, the smile level is set to a smile level 1,
In the generation step, an avatar corresponding to the smile level 1 is generated,
In the second setting step, when the estimated smile value is less than the second threshold value, the smile level is set to a level that does not laugh and smile;
4. The avatar generation method according to claim 3, wherein in the generation step, an avatar without laughter and smile corresponding to the level without laughter and smile is generated.

The program for making a computer perform each process of the avatar production | generation method of Claim 3 or Claim 4.