JP2011167820A

JP2011167820A - Head part action control information generating device

Info

Publication number: JP2011167820A
Application number: JP2010035614A
Authority: JP
Inventors: Toshinori Ishii Carlos; イシイ・カルロス・トシノリ; Chaoran Liu; 超然劉; Hiroshi Ishiguro; 浩石黒; Norihiro Hagita; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2010-02-22
Filing date: 2010-02-22
Publication date: 2011-09-01
Anticipated expiration: 2030-02-22
Also published as: JP5515173B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a head part action control information generating device for controlling an action of a head part of a robot to establish smoother communication between the robot and a human being. <P>SOLUTION: The head part action generating device 86 generates control information for controlling an action of the head part of a humanoid type robot by synchronizing with voice generated by the robot, and includes a probability model group 100 that regulates probability to execute a plurality of head part actions for each speech function tag attached to every phrase, and a head part action command generating part 104 that selects a probability model among the probability model group 100 based on notes attached to the input phrase, and outputs a head part action command corresponding to the input voice data in the predetermined unit to a control part 90 of the robot with the probability in accordance with the selected probability model. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明はロボットによる人間とのコミュニケーションの改善に関し、特に、ヒューマノイド型ロボットの頭部を、その発話内容にあわせて自然に動かすための技術に関する。 The present invention relates to an improvement in communication with a human by a robot, and more particularly, to a technique for naturally moving the head of a humanoid robot in accordance with the utterance content.

発話中、人間は自然に頭を動かす。これらの動きは時には、相手に対して明確な意味を伝えるよう意図的になされることがある。例えば頷くのは同意を表し、首を振るのは不同意を表す。しかし多くの場合、頭の動きは無意識にされる。そうした動きは、ときには話者が意図するもの以上の情報を相手に伝えることがある。したがって、ヒューマノイド型ロボットにおいて、話の内容に応じて頭を適切に動かすことができれば、相手とのコミュニケーションがより円滑になることが期待される。 While speaking, humans naturally move their heads. Sometimes these moves are intentionally made to convey a clear meaning to the other party. For example, whispering represents consent, and shaking the head represents disagreement. In many cases, however, head movements are unconscious. Such movements can sometimes convey more information than the speaker intended. Therefore, in a humanoid robot, if the head can be moved appropriately according to the content of the story, it is expected that communication with the opponent will be smoother.

この種の技術として、特許文献１に挙げたものがある。特許文献１は、発話するロボットなどにおいて、音声信号に予め頷くタイミングを決める信号を付しておき、音声信号から頷くためのタイミング信号を検出すると、その信号にあわせてロボットが頷くよう、ロボットの首を制御する技術を開示している。 As this type of technology, there is one described in Patent Document 1. Japanese Patent Laid-Open No. 2004-133867 adds a signal for determining a timing to speak to a voice signal in advance in a robot that speaks, and detects a timing signal for running from the voice signal. A technique for controlling the neck is disclosed.

特開2009-069789号公報JP2009-069789

人間とロボットとの有効なコミュニケーションを図る上で、ロボットの発話にあわせてその頭を動かすことが重要であることは、上記特許文献１においても記載されているとおりである。しかし、どのような方法でロボットの頭を動かせば人間にとって自然に感じられ、かつロボットの発話の意味を理解することが容易になるかについてはまだ適当な方法が提案されていない。特許文献１に記載の技術では、頭部の動きを予めプログラムすることに関する記載しかなく、どのように頭部を動かせば自然な動きとして感じられるかについては開示されていない。 As described in Patent Document 1, it is important to move the head according to the utterance of the robot in order to achieve effective communication between the human and the robot. However, no suitable method has yet been proposed as to how the robot's head can be felt naturally by humans and it is easy to understand the meaning of the robot's speech. In the technique described in Patent Document 1, there is only a description relating to pre-programming of the movement of the head, and it is not disclosed how to feel a natural movement if the head is moved.

それゆえに本発明の目的は、ロボットと人間とのコミュニケーションをより円滑にできるよう、ロボットの頭部の動きを制御する頭部動作制御情報生成装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a head motion control information generating device that controls the movement of the head of a robot so that communication between the robot and a human can be performed more smoothly.

本発明の他の目的は、ロボットの頭部の動きを発話の内容にふさわしく変化させ、自然に感じさせる頭部動作制御情報生成装置を提供することである。 Another object of the present invention is to provide a head motion control information generating device that changes the motion of the head of the robot appropriately according to the content of the utterance and makes it feel naturally.

本発明のある局面に係る頭部動作制御情報生成装置は、ヒューマノイド型ロボットの頭部の動きを、当該ロボットが発生する音声に同期して制御する制御情報を生成する。音声は、予め所定の単位に分割された音声データにより規定される。所定の単位の各々には、当該単位に割当てられた音声データにより発声される音声による談話機能を示す談話機能タグを含む注釈が付されている。この頭部動作制御情報生成装置は、談話機能タグごとに、複数の頭部動作をどのような確率で実行するかを規定する確率モデルを記憶するための確率モデル記憶手段と、音声データの所定の単位の入力を受け、当該単位に付された注釈に基づいて決定される確率モデルを、確率モデル記憶手段に記憶された確率モデルの中から選択するための確率モデル選択手段と、確率モデル選択手段により選択された確率モデルにしたがった確率で、入力された所定の単位の音声データに対応する頭部動作コマンドを生成し、ヒューマノイドロボット型ロボットの制御部に出力するための頭部コマンド生成手段と、を含む。 A head movement control information generation device according to an aspect of the present invention generates control information for controlling the movement of the head of a humanoid robot in synchronization with the sound generated by the robot. The sound is defined by sound data that has been divided into predetermined units in advance. Each of the predetermined units is provided with an annotation including a discourse function tag indicating a discourse function based on speech uttered by the sound data assigned to the unit. The head movement control information generation device includes a probability model storage unit for storing a probability model that defines a probability of performing a plurality of head movements for each discourse function tag, and a predetermined voice data And a probability model selection means for selecting a probability model determined based on the annotation attached to the unit from among the probability models stored in the probability model storage means, and a probability model selection Head command generation means for generating a head motion command corresponding to the input voice data of a predetermined unit with a probability according to the probability model selected by the means and outputting it to the control unit of the humanoid robot type robot And including.

音声データの所定の単位に付された談話機能タグが同一であっても、頭部コマンドが確率モデルにより選択されるため、ヒューマノイドロボットの頭部の動きは場合により異なる。同じ状況で常に同じ動きをする場合と比較して、不自然に感じられる動きを少なくすることができる。確率モデルを適切に準備しておくことにより、談話機能タグに応じた適切で自然な形でロボットの頭部を制御することができる。その結果、ロボットと人間とのコミュニケーションをより円滑にできるよう、ロボットの頭部の動きを制御する頭部動作コマンド生成装置を提供すること、及びロボットの頭部の動きを発話の内容にふさわしく変化させ、自然に感じさせる頭部動作コマンド生成装置を提供することができる。 Even if the discourse function tag attached to the predetermined unit of the voice data is the same, the head command of the humanoid robot varies depending on the case because the head command is selected by the probability model. Compared to the case where the same movement is always performed in the same situation, the movement that is felt unnatural can be reduced. By appropriately preparing the probability model, the head of the robot can be controlled in an appropriate and natural manner according to the discourse function tag. As a result, a head motion command generation device that controls the movement of the robot's head so that the communication between the robot and the human can be made smoother, and the movement of the robot's head appropriately changes to the content of the utterance. Therefore, it is possible to provide a head motion command generation device that makes the user feel natural.

好ましくは、音声データには、話者を特定する話者特定情報、話者と談話相手との関係を特定する関係特定情報、及び発話時の非言語的情報、の任意の組合せからなる、談話モードを指定する談話モード情報が付されている。確率モデル選択手段は、談話モード情報と談話機能タグとの組合せに応じて予め準備された複数個の確率モデルを記憶している。確率モデル選択手段は、音声データに付された談話モード情報と、入力された所定の単位とに応じて複数個の確率モデルの内の一つを選択するための手段を含む。 Preferably, the speech data is a discourse consisting of any combination of speaker specifying information for specifying a speaker, relationship specifying information for specifying a relationship between a speaker and a conversation partner, and non-linguistic information at the time of speaking. Discourse mode information for specifying the mode is attached. The probability model selection means stores a plurality of probability models prepared in advance according to the combination of the discourse mode information and the discourse function tag. The probability model selecting means includes means for selecting one of a plurality of probability models according to the discourse mode information attached to the voice data and the input predetermined unit.

人間同士の対話では、話者、話者と談話相手との関係、及び発話時の話者の態度・感情などの非言語的情報に関連する事情により異なることが実験により分かった。したがって、これらの組合せに応じた確率モデルを予め準備し、発話時の談話モードに応じてこれらに適合した確率モデルにしたがって頭部動作を選択することで、談話相手に単に音声だけでないより多くの情報を与えることができる。 Experiments have shown that human interactions differ depending on the speaker, the relationship between the speaker and the conversation partner, and circumstances related to nonverbal information such as the speaker's attitude and emotion at the time of speaking. Therefore, by preparing probabilistic models according to these combinations in advance and selecting head movements according to the probabilistic model adapted to these according to the discourse mode at the time of utterance, the conversation partner is more than just a voice. Information can be given.

より好ましくは、頭部コマンド生成手段は、確率モデル選択手段により選択された確率モデルにしたがった確率で、入力された所定の単位の音声データに対応する頭部動作を発生させるか否かを決定するための頭部動作決定手段と、複数の頭部動作の各々に対し、当該頭部動作に対応する頭部の時間的動きを規定する形状情報を記憶するための形状情報記憶手段と、頭部動作決定手段により何らかの頭部動作を発生させることが決定されたことに応答して、当該頭部動作の種類により予め決まっているタイミングで、当該頭部動作に対応する形状情報を形状情報記憶手段から読出してロボットの頭部を制御するための制御コマンドを生成し制御部に出力するための形状読出手段とを含む。 More preferably, the head command generation means determines whether or not to generate a head movement corresponding to the input sound data of a predetermined unit with a probability according to the probability model selected by the probability model selection means. A head motion determining means for performing, a shape information storage means for storing, for each of a plurality of head motions, shape information defining temporal movement of the head corresponding to the head motion, The shape information corresponding to the head motion is stored in the shape information at a timing determined in advance according to the type of the head motion in response to the head motion determining means having determined to generate some head motion. Shape reading means for generating a control command for reading out from the means and controlling the head of the robot and outputting it to the control unit.

頭部動作を発生させるタイミングを決定するのは単純なルールでも実現できる。予め頭部動作ごとに、その頭部動作を実現するための時間的な動きを表す形状情報を記憶しておき、決定されたタイミングでその形状情報から制御コマンドを生成する、という簡単な処理で、談話相手とのコミュニケーションを円滑にすることができる。 A simple rule can be used to determine the timing for generating head movement. A simple process of storing shape information representing temporal movement for realizing the head movement in advance for each head movement and generating a control command from the shape information at a determined timing. , Can facilitate communication with the conversation partner.

さらに好ましくは、ロボットの頭部の動きは、ロボットの頭部の３軸周りの回転のうち、ピッチ角に関連するものである。頭部コマンド生成手段はさらに、ロボットの発話時の頭部の時間的動きを規定する、予め準備された頭部動作の形状を記憶するための発話時頭部形状記憶手段と、形状読出手段と制御部との間に接続され、音声データによりロボットが発話する期間であることが示されていることに応答して、形状読出手段により出力される制御コマンドに、発話時頭部形状情報記憶手段から読出した頭部動作の形状を重畳して制御部に与えるための手段とを含む。 More preferably, the movement of the robot head is related to the pitch angle among the rotations of the robot head around three axes. The head command generation means further includes an utterance head shape storage means for storing the shape of the head movement prepared in advance, which prescribes the temporal movement of the head when the robot speaks, and a shape reading means. The head shape information storage means at the time of utterance is connected to the control command output from the shape reading means in response to the voice data indicating that the robot is speaking. And means for superimposing the shape of the head movement read from the control unit to the control unit.

実験から、特にピッチ角については、発話者が話しているときには特定の動きをすることが判明した。ただし、発話者によりこの動きは異なる可能性がある。そこで、このように予め準備された特定の頭部動作の形状を記憶しておく構成で、想定される発話者により、発話時に特徴的な頭部の動きを再現できる。 Experiments have shown that, especially with respect to the pitch angle, the speaker has a specific movement when speaking. However, this movement may vary depending on the speaker. Thus, with the configuration in which the shape of a specific head motion prepared in advance is stored in this way, a characteristic head movement during speech can be reproduced by an assumed speaker.

ロボットの頭部の動きは、ロボットの頭部の３軸のうち任意の組合せの軸周りの回転に関するものであってもよい。この場合、確率モデル記憶手段、確率モデル選択手段、及び頭部コマンド生成手段は、いずれも任意の組合せを構成する軸ごとに独立に頭部制御コマンドを生成する。 The movement of the robot head may relate to rotation around any combination of the three axes of the robot head. In this case, the probability model storage means, the probability model selection means, and the head command generation means all generate head control commands independently for each axis constituting an arbitrary combination.

本発明によれば、上記した構成により、ロボットと人間とのコミュニケーションをより円滑にできるよう、ロボットの頭部の動きを制御する頭部動作コマンド生成装置を提供することが可能になる。さらに、ロボットの頭部の動きを発話の内容にふさわしく変化させ、自然に感じさせる頭部動作コマンド生成装置を提供することもできる。 According to the present invention, with the above-described configuration, it is possible to provide a head movement command generation device that controls the movement of the robot head so that communication between the robot and a human can be performed more smoothly. Furthermore, it is possible to provide a head motion command generation device that changes the motion of the robot head appropriately according to the content of the utterance and makes it feel natural.

本発明の実施の形態を実現するために行なった人間の頭部のモーションキャプチャ時の設定を示す図である。It is a figure which shows the setting at the time of the motion capture of the human head performed in order to implement | achieve embodiment of this invention. 上記実験の結果得られた被験者の頭部の動き、各被験者と談話相手との関係、及び発話の内容の関係を示すグラフである。It is a graph which shows the test subject's head movement obtained as a result of the said experiment, the relationship between each test subject and the conversation partner, and the relationship of the content of the utterance. 上記実験の結果得られた被験者の頭部の動き、各被験者と談話相手との関係、及び発話の内容の関係を示すグラフである。It is a graph which shows the test subject's head movement obtained as a result of the said experiment, the relationship between each test subject and the conversation partner, and the relationship of the content of the utterance. 第１の実施の形態で制御の対象となるロボットの頭部の動きを実現するアクチュエータの配置を示す図である。It is a figure which shows arrangement | positioning of the actuator which implement | achieves the motion of the head of the robot used as the control object in 1st Embodiment. 本発明の第１の実施の形態に係る頭部動作生成装置の機能的ブロック図である。It is a functional block diagram of the head movement generating device concerning a 1st embodiment of the present invention. 図５に示す頭部動作位置生成部を実現するプログラムをブロック図的なフローチャートで示した模式図である。It is the schematic diagram which showed the program which implement | achieves the head movement position production | generation part shown in FIG. 5 with the block diagram flowchart. 本発明の第１の実施の形態で用いた、ロボットの頭部の頷きを実現するための頷き角度（上向きを正とする。）の時間的変化を示すグラフである。It is a graph which shows the time change of the rolling angle (upward is made into positive) for implement | achieving the heading of the robot used in the 1st Embodiment of this invention. 本発明の第１の実施の形態で用いた、連続した複数の頷き動作を実現するための頷き角度の時間的変化を示すグラフである。It is a graph which shows the time change of the rolling angle for implement | achieving the continuous several rolling operation | movement used in the 1st Embodiment of this invention. 本発明の第１の実施の形態で用いた、ロボットが発話中のロボットの頷き角度の時間的変化を示すグラフである。It is a graph which shows the time change of the whispering angle of the robot which the robot used in the first embodiment of the present invention is speaking. 本発明の第１の実施の形態における、発話から頭部動作コマンドを生成する際の頷き角度の時間的変化の生成過程を示すグラフである。It is a graph which shows the production | generation process of the temporal change of the beating angle at the time of producing | generating the head movement command from the speech in the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る頭部動作生成装置の頭部動作位置生成部を実現するプログラムをブロック図的なフローチャートで示した図である。It is the figure which showed the program which implement | achieves the head movement position production | generation part of the head movement production | generation apparatus which concerns on the 2nd Embodiment of this invention with the block diagram flowchart.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。それらの名称及び機能も同一である。したがってそれらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

［実験］
後述する本発明の実施の形態を実現するにあたり、人間の発話と頭部の動きとの関係、及び人間の頭部の動きを談話相手がどのように理解するかについての実験を行なった。この結果から、後述するように比較的単純な構成のモデルを得ることができ、それによって本発明の実施の形態に係るロボットの頭部動作生成装置を実現することができた。 [Experiment]
In realizing an embodiment of the present invention, which will be described later, an experiment was conducted on the relationship between human speech and head movement, and how the conversation partner understands human head movement. From this result, it was possible to obtain a model having a relatively simple configuration as will be described later, thereby realizing the robot head motion generation device according to the embodiment of the present invention.

＜実験の設定＞
─データ─
実験には７人の被験者（男性４人、女性３人の発話者）を用いた。テーブル１にこれら被験者のリストと、これら被験者と対話した相手（談話相手）と、談話相手及び被験者の関係とを示す。 <Experimental settings>
─Data─
Seven subjects (4 male speakers and 3 female speakers) were used in the experiment. Table 1 shows a list of these subjects, a partner (discourse partner) who interacted with these subjects, and the relationship between the talk partner and the subject.

これら発話者と、談話相手とのいくつかの組合せごとに、自由な対話による談話（１０分−１５分間）を何セッションか収録した。収録した談話セッションの総数は１９であった。各発話者についてのセッション数は、ＦＭＨ×ＦＫＨ（２）、ＦＫＮ×ＦＫＨ（２）、ＦＭＨ×ＭＨＩ（１）、ＦＫＮ×ＭＨＩ（１）、ＦＭＨ×ＭＳＮ（１）、ＦＫＮ×ＭＳＮ（１）、ＦＭＨ×ＭＩＴ（３）、ＦＫＮ×ＭＩＴ（３）、ＦＭＨ×ＭＳＲ（５）であった。

For each combination of these speakers and the talk partner, several sessions of conversations (10-15 minutes) by free dialogue were recorded. The total number of discourse sessions recorded was 19. The number of sessions for each speaker is FMH × FKH (2), FKN × FKH (2), FMH × MHI (1), FKN × MHI (1), FMH × MSN (1), FKN × MSN (1) FMH × MIT (3), FKN × MIT (3), FMH × MSR (5).

対話者の双方について、同時に音声及び動きデータの収録を行なった。対話者の間の距離は、双方のモーションキャプチャが同時に可能な範囲で、できるだけ近くなるように設定した。結果として、両者の間の距離は１ｍとなった。収録には指向性マイクロホンを用い、これら指向性マイクロホンを各対話者に向けて設置した。 Audio and motion data were recorded simultaneously for both of the interlocutors. The distance between the participants was set to be as close as possible within the range where both motion captures were possible at the same time. As a result, the distance between the two became 1 m. Directional microphones were used for recording, and these directional microphones were installed for each person who interacted.

話者の各々の頭部に７個の半球形のパッシブ反射マーカを装着し、赤外線を用いた市販のモーションキャプチャ装置で話者の頭部の動きデータを取得した。図１に、反射マーカの位置を示す。図１を参照して、話者３０の頭部の周囲に５つのマーカ３２，３４，３６，３８及び４０を装着した。話者３０の鼻の頂部にはマーカ４２を、あごにはマーカ４４を、さらに装着した。頭部及び鼻のマーカ３２−４２により頭部の座標を得、あごのマーカ４４の位置（鼻のマーカ４２に対する相対位置）により、動きデータと発話データとの間の対応付けを行なうようにした。 Seven hemispherical passive reflection markers were attached to each speaker's head, and the motion data of the speaker's head was acquired with a commercially available motion capture device using infrared rays. FIG. 1 shows the position of the reflective marker. Referring to FIG. 1, five markers 32, 34, 36, 38 and 40 are mounted around the head of the speaker 30. A marker 42 was further attached to the top of the nose of the speaker 30 and a marker 44 was attached to the chin. The coordinates of the head are obtained by the head and nose markers 32-42, and the movement data and the speech data are associated by the position of the chin marker 44 (relative position with respect to the nose marker 42). .

頭部の動きは、図１の右側に示す３軸（ＸＹＺ軸）により表した。「頷き」、「首振り」及び「首かしげ」とは、図１の座標軸において、それぞれＸ軸周り、Ｚ軸周り、及びＹ軸周りに首を回転させることをいう。これらは航空工学においてピッチ、ヨー、及びロールと呼ばれているものとそれぞれ同様である。これらの軸周りの回転角度をそれぞれピッチ角、ヨー角及びロール角と呼ぶ。 The movement of the head is represented by three axes (XYZ axes) shown on the right side of FIG. “Spring”, “swinging”, and “necking” refer to rotating the neck around the X axis, the Z axis, and the Y axis, respectively, in the coordinate axes of FIG. These are the same as what are called pitch, yaw and roll in aeronautical engineering. The rotation angles around these axes are called a pitch angle, a yaw angle, and a roll angle, respectively.

頭部の回転角度は、マーカの座標を用い、M.B. Stegmann, D.D. Gomez, “A brief introduction to statistical shape analysis,” http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/403/pdf/imm403.pdf，2002に紹介された特異値分解にもとづいて算出した。すなわち、ユニタリ行列Ｕ、対角行列Ｄ、ユニタリ行列Ｖの組合せが、以下の式（１）により表される特異値分解により得られる。 The rotation angle of the head uses the marker coordinates, MB Stegmann, DD Gomez, “A brief introduction to statistical shape analysis,” http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/403 It was calculated based on the singular value decomposition introduced in /pdf/imm403.pdf, 2002. That is, a combination of the unitary matrix U, the diagonal matrix D, and the unitary matrix V is obtained by singular value decomposition expressed by the following equation (1).

ただし、reference及びtargetはそれぞれ、中立位置及び現在位置の３Ｄマーカ位置を、それらの重心を原点とする新たな座標系に平行移動させた座標を表す。中立位置は、被験者がまっすぐ前方を向いているときに得られたものである。回転を表す行列は次の式（２）により得られる。

Here, reference and target represent coordinates obtained by translating the 3D marker positions of the neutral position and the current position into a new coordinate system with their center of gravity as the origin. The neutral position is obtained when the subject is looking straight ahead. A matrix representing rotation is obtained by the following equation (2).

この回転行列Ｒの要素から、次の式（３）−（５）によって回転角度が算出される。

ただし、ｓｑｒｔ、＾、及びａｔａｎ２はそれぞれ、Ｍａｔｌａｂ（登録商標）の平方関数、べき乗関数、及びアークタンジェント関数である。

From the elements of this rotation matrix R, the rotation angle is calculated by the following equations (3)-(5).

Here, sqrt, ^, and atan2 are a Matlab (registered trademark) square function, power function, and arctangent function, respectively.

こうして得た発話データに対し、日本語のネイティブ・スピーカが手作業で句単位に分割し、文字起こしをした。分割の結果、全部で１６９２０句が得られた。 The utterance data thus obtained was manually divided into phrases by Japanese native speakers and transcribed. As a result of the division, a total of 16920 phrases were obtained.

─頭部動作のタグ─
日本語による対話の間で何らかの意味がある部分と考えられる分について、そのときの頭部の動きに以下のような頭部動作タグからなる注釈を手作業により付した。 ─ Tags for head movement ─
For the part considered to have some meaning during the dialogue in Japanese, the head movement at that time was manually annotated with the following head movement tags.

・ｎｏ：動きなし
・ｎｄ：単一の頷き（顔下げ─顔上げの動き）
・ｍｎｄ：句とともに複数回の頷きが生じた
・ｆｄ：顔下げ
・ｕｄ：顔上げ─顔下げ（１回）
・ｆｕ：顔上げ
・ｔｉ：句内で顔かしげ
・ｓｈ：句内で顔振り（左右の動き）
頷きが常に頭部の上下への動きにより表されるとは限らない。例えば顔を軽く傾けることでも頷きと見られることがある。このようにして注釈をつけた後、最も程度の大きかった動きについて考察した。・ No: No movement ・ nd: Single whisper (lowering face-up movement)
・ Mnd: multiple whispers with phrases ・ fd: face down ・ ud: face up-face down (once)
・ Fu: Face up ・ ti: Fading in the phrase ・ sh: Shaking in the phrase (left-right movement)
A whisper is not always represented by a vertical movement of the head. For example, a slight tilt of the face can be seen as whispering. After annotating in this way, we considered the most significant movement.

３つの角度の時間変化、及びビデオ画像に基づいて、１人の被験者が頭部動きに関し、上記頭部動作タグセットによる注釈付けを音声情報に対して行なった。このラベルを他の被験者がチェックし、訂正した。いずれの被験者も本件発明者のアシスタントである。第１の被験者がつけた注釈のうち、５％が第２の被験者により修正された。 Based on the temporal changes of the three angles and the video image, one test subject annotated the audio information with respect to the head movement and the head movement tag set. This label was checked and corrected by another subject. Both subjects are assistants of the present inventors. Of the annotations made by the first subject, 5% were modified by the second subject.

テーブル２に、各頭部の動きタグを用いた注釈の分布を示す。頷き（ｎｄとｍｎｄの合計）が最も頻繁に生ずる頭部の動きであることが分かる。「その他」欄は、被験者が分類に確信が持てなかった動きに関する句についてのものである。 Table 2 shows the distribution of annotations using the movement tags of each head. It can be seen that whispering (the sum of nd and mnd) is the most frequent head movement. The “other” column is for phrases related to movements for which the subject was not sure of the classification.

─談話機能タグ─
データセット中の各句について、以下のような談話機能タグの組から選択した談話機能タグを注釈として付した。この際、肯定又は否定の反応、驚き・意外さの表れの感情表現、及び発話交替などについて考察した。

─Discourse function tag─
For each phrase in the data set, a discourse function tag selected from the following set of discourse function tags was annotated. At this time, we considered positive or negative reactions, emotional expressions of surprise and surprise, and speech alternation.

・ｋ（ｋｅｅｐ）：発話者が発話者としての地位（話順）を保持している。強い句境界において、短いポーズ又は明瞭なピッチのリセットが生じる。 K (keep): The speaker holds the position (speak order) as the speaker. Short poses or clear pitch resets occur at strong phrase boundaries.

・ｋ２（ｋｅｅｐ）：（句の間にポーズが存在しない）発話途中の弱い句境界
・ｋ３（ｋｅｅｐ）：考えているときなど、話者が句の末尾を延ばしながら、しかし話順は保持している（ポーズが続く場合も続かない場合もある。）。・ K2 (keep): weak phrase boundary in the middle of utterance (no pause between phrases) ・ k3 (keep): When the speaker is thinking, the end of the phrase is extended, but the order is kept (The pause may or may not continue.)

・ｆ（ｆｉｌｌｅｒ：フィラー）：話者が次の発話を考えているか準備している。例えば「うーん」、「えーと」、又は「あのー」など。 F (filler): The speaker is preparing to consider the next utterance. For example, "Umm", "Et", or "That".

・ｆ２（ｃｏｎｊｕｎｃｔｉｏｎｓ：接続詞）：長く延ばされないフィラーと同様に考えることができる。例えば「だから」、「じゃあ」、「で」など。 F2 (conjunctions): It can be considered in the same way as a filler that is not extended for a long time. For example, “So”, “Ja”, “De”, etc.

・ｇ（ｇｉｖｅ：話順の譲）：話者が発話を終わり、対話の相手に発話者の地位を渡す。 G (give: transfer of talk order): The speaker finishes speaking and passes the speaker's status to the other party of dialogue.

・ｑ（質問）：話者が対話の相手に対して質問をしたり、同意を求めたりする。 Q (question): The speaker asks a question or asks for consent from the other party.

・ｂｃ（相づち）：話者が対話の相手に対して相づち（同意の反応）を見せる。例えば頷きながら「うん」と言う。 Bc (compilation): The speaker shows the companion (consent response) to the other party. For example, say “Yes” while whispering.

・ｓｕ（驚き／意外／感心）発話者が対話の相手に対して表情のある反応（驚き／意外／感心）を示す。例えば「へー」、「うそ！」、「ああ」など。 Su (surprise / unexpected / impressed) The speaker shows a reaction (surprise / unexpected / impressed) with an expression to the other party of the dialogue. For example, “hee”, “lie!”, “Oh”, etc.

・ｄｎ（ｄｅｎｉａｌ，ｎｅｇａｔｉｏｎ：否認、否定）：首振り動作を伴う「いいえ」、「ううん」など。 Dn (denial, negation): “No”, “No”, etc. with a swing motion.

上に記載した分類は完全なものではないが、人間とロボットのコミュニケーションという観点から見るとこれらで十分であると考えられる。 Although the classifications described above are not complete, they are considered sufficient from the perspective of human-robot communication.

第１の被験者が談話機能タグによる注釈を句ごとに付した後、第２の被験者が注釈のチェック及び修正を行なった。これら被験者は頭部動作タグによる注釈を付したのと同じ被験者である。第２の被験者は全注釈のうち５．９％を訂正した。 After the first subject added annotations with the discourse function tag for each phrase, the second subject checked and modified the annotations. These subjects are the same subjects that have been annotated with head movement tags. The second subject corrected 5.9% of all annotations.

テーブル３は、各談話機能タグによる注釈の分布を示す。テーブル３において、「その他」は、被験者がどの談話機能タグを付したらよいか判断できなかった句を含む。この中にはさらに、挨拶のような、ｇ（話順の譲渡）のサブカテゴリの間投詞、並びにｂｃ（単純な相づち）及びｓｕ（驚き／意外／感心）以外の間投詞が含まれる。 Table 3 shows the distribution of annotations by each discourse function tag. In Table 3, “Others” includes a phrase for which the subject could not determine which discourse function tag to attach. This further includes interjections of subcategories of g (transfer of talk order), such as greetings, and interjections other than bc (simple combinations) and su (surprise / unexpected / impressed).

─頭部動作及び談話機能─
図２及び図３に、各談話機能の機能に対する頭部動作の分布を示す。これらグラフは、各発話者と談話相手との関係に応じたグループ（各図の最下部のｘ軸に示す。）に分けて示してある。ｙ軸には各頭部動作タグの生起した確率を示してある。

─Head movement and discourse function─
FIG. 2 and FIG. 3 show the distribution of head movements for each discourse function. These graphs are divided into groups (shown on the x-axis at the bottom of each figure) corresponding to the relationship between each speaker and the conversation partner. The y-axis shows the probability that each head motion tag has occurred.

図２及び図３を参照して、頷き（ｎｄ）及び複数頷き（ｍｎｄ）が相づち（ｂｃ）において高い頻度で生ずることが分かる。頷きはまた、強い句境界（ｋ、ｇ、ｑ）でも頻繁に観測された。質問（ｑ）では、通常は句の末尾が上昇するように発音されるが、頷きは、顔上げ─顔下げ動作、又は顔上げ動作より頻繁に観測された。ピッチ（Ｆ０形状）と頭部動作との相関が低くなるのはこれが原因であると思われる。 Referring to FIGS. 2 and 3, it can be seen that whispering (nd) and multiple whispering (mnd) occur frequently in the combination (bc). Whispering was also frequently observed at strong phrase boundaries (k, g, q). In question (q), it is usually pronounced so that the end of the phrase rises, but whispering was observed more frequently than a face-up-face-down action or a face-up action. This seems to be the reason why the correlation between the pitch (F0 shape) and the head movement is low.

頷きの観測頻度は、発話途中の弱い句境界（ｋ２）と、話者が考えていたり、発話を完了していないことを示したりしている句境界（ｋ３、ｆ、ｆ２）とにおいて、より低くなる。こうした談話機能カテゴリでは、顔の動きのない状態（ｎｏ）が大部分である。 The observation frequency of whispering is more in the weak phrase boundary (k2) in the middle of utterance and the phrase boundary (k3, f, f2) indicating that the speaker thinks or has not completed the utterance. Lower. In such a discourse function category, the state (no) where there is no movement of the face is most.

複数頷き（ｍｎｄ）のシーケンスをより詳細に観察すると、これらの動きは、発話者が談話相手の話に対して強い同意、理解、又は興味を表現しているときに、発話全体にわたって発生する傾向があることが分かった。 Looking more closely at multiple mund sequences, these movements tend to occur throughout the utterance when the speaker expresses strong agreement, understanding, or interest in the conversation partner's story. I found out that

驚き、意外、感心など（ｓｕ）を表す句の場合、上記データ内での発生頻度はより低かった。頭部の動きなし（ｎｏ）、顔上げ動作（ｆｕ）、及び首かしげ（ｔｉ）が大部分であった。 In the case of phrases representing surprise, surprise, impression, etc. (su), the frequency of occurrence in the data was lower. There was no head movement (no), face-up movement (fu), and neck-raising (ti).

─同一話者及び話者間による変化─
上記実験からはまた、頭部動作の頻度は話者により異なることが明らかとなった。 ─Changes between the same speaker and between speakers─
The above experiment also revealed that the frequency of head movements varies from speaker to speaker.

例えば、４人の男性話者のうちの２人（ＭＳＮ及びＭＨＩ）は他に比べて頭部動作がはるかに少なかった。この事実からは、話者の社会的地位が談話相手（研究助手）よりも高いことが影響しているのではないかと考えられる。他に考えられることとしては、話者と談話相手との年齢差が影響しているかもしれない。 For example, two out of four male speakers (MSN and MHI) had much less head movement than others. From this fact, it is considered that the social status of the speaker is higher than that of the other party (research assistant). Another possibility is that the age difference between the speaker and the other party may have an effect.

同一話者での変化を検討すると、談話相手との個人的関係によって頭部の動きの頻度が影響を受けるようである。図２及び図３において、話者と談話相手との関係は３段階に分けてある。すなわち、「近い」（家族、ボーイフレンド）、「遠い」（世代及び社会的地位の相違）、「中間」（友人、友人の友人、友人の親類）である。「中間」と「遠い」との区別のためには、社会的地位だけでなく、実際の個人の間の関係も考慮した。例えば話者と談話相手とが顔をあわせるのがはじめてか否か、という要因も考慮してある。 When considering changes in the same speaker, it seems that the frequency of head movements is affected by the personal relationship with the conversation partner. 2 and 3, the relationship between the speaker and the conversation partner is divided into three stages. That is, “near” (family, boyfriend), “far” (difference in generation and social status), “middle” (friend, friend friend, friend relative). In order to distinguish between “middle” and “far”, we considered not only social status but also the relationship between actual individuals. For example, the factor of whether or not the speaker and the conversation partner are the first to meet each other is also considered.

上の結果からはさらに、話者が談話相手と近い関係にあるときには、頭部の動きの頻度がかなり低いことが分かった。例えば、話者ＦＭＨの場合、彼女の母又は彼女のボーイフレンドと話しているとき（ＦＭＨ×「近い」）には、ｇ、ｑ、ｂｃ及びｋでは頭部動きなし（ｎｏ）が高い頻度で発生し、頷き（ｎｄ，ｍｎｄ）は小さな頻度でしか生じなかった。同様に、相づち（ｂｃ）では、ＦＭＨ×「近い」の組合せで大部分の動きが頷き（ｎｄ）であったのに対し、ＦＭＨが初めて見る相手との対話のときには複数頷き（ｍｎｄ）の頻度が高くなる。この事実は、頭部の動き（特に頷き）については、談話相手の話に自分が興味を持っていることを示すなど、態度を表明するために使用する、ということによって説明されると思われる。家族に対してはあまり気を使わない振る舞いが多くなり、その結果頭部の動きが少なくなる。発話交替の発話（ｋ）についていえば、ＦＭＨ×「中間」の組合せのときには頷きが頻繁に発生するが、ＦＭＨ×「遠い」の組合せのときには頻度は下がる。この事実は、ＦＭＨが、ＭＳＮ及びＭＨＩ（「遠い」談話相手）と話しているときには自分を抑えるのに対し、友人の友人であり年齢も近いＭＩＴ（「中間の」談話相手）と話しているときには、より自信を持って話す、ということで説明できる。 The above results further show that the frequency of head movement is much lower when the speaker is closely related to the conversation partner. For example, in the case of speaker FMH, when talking to her mother or her boyfriend (FMH x "close"), there is a high frequency of no head movement (no) in g, q, bc, and k. Occurred and sowing (nd, mnd) occurred only infrequently. Similarly, in the combination (bc), the combination of FMH × “near” was most of the movement (nd), whereas when the conversation with the partner the FMH first saw, the frequency of multiple whispering (mnd) Becomes higher. This fact seems to be explained by the fact that head movements (especially whispering) are used to express attitudes, such as showing that you are interested in the story of the other party. . There are more behaviors that do not care much for the family, resulting in less head movement. As for the utterance change utterance (k), whispering frequently occurs in the combination of FMH × “intermediate”, but the frequency decreases in the combination of FMH × “far”. This fact is that FMH suppresses herself when talking to MSN and MHI ("far" talk partner), while talking to MIT ("middle" talk partner) who is a friend's friend and close in age Sometimes it can be explained by speaking more confidently.

話者ＦＫＮについては、ＦＫＨ，ＭＨＩ及びＭＳＮと話しているときには、ｂｃ（相づち）では、単一頷き（ｎｄ）と比較して複数頷き（ｍｎｄ）の頻度が高かった。これらの談話相手が話者ＦＫＮより年長でありかつこれらの相手にあうのが初めてであったせいだろうと思われる。それに対して友人ＭＩＴと話しているときには単一の頷きが大部分であった。ｋ及びｇでは、ＦＫＨ，ＭＨＩ及びＭＳＮと話しているときには特に支配的な動きは見られなかった。この事実はＦＭＨの場合と同様であり、ＦＫＮが初めて会った相手（ＦＫＨ，ＭＨＩ及びＭＳＮ）に対してはＦＫＮは自分を主張せず、ＭＨＩと話すときにはより自信を持っていた、ということにより説明できる。 As for the speaker FKN, when talking to FKH, MHI and MSN, the frequency of multiple strikes (mnd) was higher in bc (combined) than in single strike (nd). It seems that these conversation partners were older than speaker FKN and were the first to meet them. On the other hand, when talking to a friend MIT, a single whisper was the majority. In k and g, there was no particularly dominant movement when talking to FKH, MHI and MSN. This fact is similar to the case of FMH, because FKN did not insist on the first party FKN met (FKH, MHI and MSN), and was more confident when talking to MHI. I can explain.

話者ＦＫＨの場合には、娘（ＦＭＨ）と話しているときには、娘の友人（ＦＫＮ）と話しているときより頭部動きなし（ｎｏ）が高い頻度だった。また、ＦＫＮと話しているときには複数頷き（ｍｎｄ）がより高い頻度で現れたことにも注目できる。 In the case of the speaker FKH, when talking to the daughter (FMH), no head movement (no) was more frequent than when talking to the daughter's friend (FKN). It can also be noted that when talking to FKN, multiple whirls (mnd) appeared more frequently.

＜ルールベースの頷き発生＞
以上の結果から、談話中に最もよく生ずる頭部の動きは頷きであることが分かるが、それ以外の頭部の動きの頻度にも、談話相手、及び談話機能により一定の傾向があることが分かる。したがって、以下の実施の形態では、ルールベースの頭部動作生成モデルとそれを用いてロボットの頭部動作コマンドを生成する装置を提案する。なお、以下の説明では主として頭部動作、特に頷きについて説明するが、首振り及び首かしげなどについても同様の考え方で頭部動作コマンドを生成することができる。 <Rule-based whispering>
From the above results, it can be seen that the head movement that occurs most frequently during the discourse is whispering, but the frequency of other head movements also has a certain tendency depending on the conversation partner and the discourse function. I understand. Therefore, in the following embodiments, a rule-based head movement generation model and an apparatus for generating a robot head movement command using the rule-based head movement generation model are proposed. In the following description, head movements, particularly whispering, will be mainly described. However, head movement commands can be generated based on the same concept with respect to swinging and necking.

最初に、ロボットに与えられる発話データからロボットの頷きのタイミングのみを生成する非常に簡単な確率モデルを考える。このモデルでは、頷きは強い句境界を持つ発話（ｋ、ｇ、ｑ）及び相づち（ｂｃ）の最終音節の中央で生成される。 First, consider a very simple probabilistic model that generates only the timing of whispering from speech data given to the robot. In this model, whispering is generated in the middle of the last syllable of an utterance (k, g, q) and a combination (bc) with strong phrase boundaries.

こうしたモデルは、図２及び図３に示すものと同様の実験結果を統計的に処理することにより、確率モデルとして得られる。上の実験では、頭部の動作に対する要因として考えられるものと実際の頭部の動きとの関係を明確にするために、要因として考えられるものを絞っているが、同様の考え方により例えば各句（又は発話全体）を発話するときに付随させるべき態度・感情などの非言語的情報（音声言語以外によって伝達される情報）に関連すると思われる情報についても、予め談話記録の句ごとにラベル付けをしておくことで同様の統計的情報を得ることができる。 Such a model can be obtained as a probabilistic model by statistically processing the experimental results similar to those shown in FIGS. In the above experiment, in order to clarify the relationship between what is considered as a factor for head movement and the actual head movement, what is considered as a factor is narrowed down. Information that seems to be related to non-linguistic information (information transmitted by means other than the spoken language) such as attitudes and emotions to be accompanied when speaking (or the entire utterance) is also labeled in advance for each phrase in the discourse record Similar statistical information can be obtained.

本実施の形態では、話者／談話相手との関係／態度・感情の組合せごとに（これらの組合せをそれぞれ簡略のためにここでは、これらの任意の組合せを単に「談話モード」と呼び、談話モードを特定するために音声データに付されるラベルを「談話モード情報」と呼ぶことにする。）、談話機能カテゴリから頷きに関する頭部動作がそれぞれどのような確率で得られるか、を与える確率モデルを予め準備するものとする。これらの確率モデルは、上記したのと同様の実験結果を統計的に処理することにより容易に得ることができる。各頭部動作については、句内のどの音節位置で動作を行なうかを示す情報が付されている。 In this embodiment, for each combination of relationship / attitude / emotion with the speaker / discussion partner (for the sake of simplicity, these combinations are simply referred to as a “discussion mode”, and The label attached to the audio data to specify the mode is called “Discourse mode information”.) Probability that gives the probability of each head movement related to whispering from the discourse function category A model shall be prepared in advance. These probability models can be easily obtained by statistically processing the same experimental results as described above. For each head movement, information indicating at which syllable position in the phrase the movement is performed is attached.

こうした確率モデルを使用する以下の実施の形態では、ロボットに与えられる発話データは、音声データと、音声からの句の切り出し情報と、句ごとに区切られた音素ラベルのシーケンスと、各句に割当てられた談話機能タグとを含むものとする。各音素ラベルには、その音素に対応する音声の、句内での発話時刻を特定する情報も付されている。発話データはさらに、話者と談話相手との関係（「近い」、「遠い」または「中間」）という情報と、各句に付随させるべき態度・感情などを示す談話モード情報とを含む。 In the following embodiment using such a probabilistic model, the speech data given to the robot includes speech data, phrase segmentation information from speech, a sequence of phoneme labels separated for each phrase, and assigned to each phrase. And the provided discourse function tag. Each phoneme label is also attached with information for specifying the utterance time in the phrase of the speech corresponding to the phoneme. The utterance data further includes information on the relationship between the speaker and the conversation partner (“near”, “far” or “intermediate”) and discourse mode information indicating the attitude and emotion to be attached to each phrase.

こうした確率モデルが、頷き動作だけでなく、首振り、首かしげなどについても得ることができることは、当該技術分野における技術者には容易に理解できるであろう。なおこの場合、頭部動作を開始させるタイミングについても実験結果から得る必要がある。例えば上記した頷き動作は、最後の音節で発生させているが、首振り動作については複数頷きと同様、発話全体にわたり続くことが分かっている。そうした情報についても予め実験により取得しておく必要がある。 Those skilled in the art will easily understand that such a probabilistic model can be obtained not only for a whispering motion but also for a swing, a head angling, and the like. In this case, it is necessary to obtain the timing for starting the head movement from the experimental results. For example, the above-described whirling motion is generated in the last syllable, but the swing motion is known to continue over the entire utterance as in the case of multiple whispering. Such information also needs to be obtained in advance through experiments.

頭部動作について、所定の動作（頷き）を発生させるか否かが決まり、どのタイミングで頷きを発生させるかが決まれば、そのタイミングで、予め準備しておいた各動作の形状を頭部制御のための制御信号の形状に重畳すればよい。 For the head movement, it is determined whether or not a predetermined movement (whipping) is to be generated, and when it is determined at which timing the heading is to be generated, the shape of each movement prepared in advance is controlled at that timing. It may be superimposed on the shape of the control signal for.

なお、上の実験では触れていないが、発話期間中には人間は顔をやや上方にあげる（上を向く。）ことも実験から分かった。その角度は頷き角度にして３度（上向きを正とする。）程度である。以下の実施の形態では、発話時のこの顔上げ動作についても取り入れている。 Although not mentioned in the above experiment, it was also found from the experiment that humans raise their faces slightly upwards (facing upwards) during the utterance period. The angle is about 3 degrees (the upward direction is positive). In the following embodiment, this face raising operation at the time of speech is also taken in.

［第１の実施の形態］
＜構成＞
図４を参照して、制御対象となるロボット５０の頭部６０の駆動系は、頭部６０をｙ軸周りに回転させるモータ６２と、ｘ軸周りに回転させるモータ６４と、ｚ軸周りに回転させるモータ６６とを含む。頷き動作ではモータ６４を、首振り動作ではモータ６６を、首かしげではモータ６２を、それぞれ動作させればよい。 [First Embodiment]
<Configuration>
Referring to FIG. 4, the drive system of head 60 of robot 50 to be controlled includes a motor 62 that rotates head 60 around the y axis, a motor 64 that rotates around x axis, and a z axis. And a motor 66 for rotation. It is only necessary to operate the motor 64 in the whirling operation, the motor 66 in the swinging operation, and the motor 62 in the neck movement.

図５を参照して、本実施の形態に係るロボット８０は、上記したものと同様の実験結果を統計的に処理することにより得られた確率モデルからなる確率モデル群１００を記憶するための記憶装置８２と、このロボット８０が発話すべき発話データを記憶した発話データ記憶装置８８と、発話データが与えられると、発話データの各句ごとに、付された談話機能タグと、記憶装置８２に記憶された確率モデル群１００を使用して、このロボットの頭部動作を制御する頭部動作コマンドを生成するための頭部動作生成装置８６と、発話データ記憶装置８８に記憶された発話データと、頭部動作生成装置８６からの頭部動作コマンドとに基づき、頭部動作が音声と同期するように音声を再生するための音声再生部９２と、頭部動作生成装置８６からの頭部動作コマンドに応答して、図４に示すモータ６２，６４及び６６を制御するためのロボット制御部９０とを含む。図５において、ロボット８０のうち、頭部動作に関連しない箇所は説明を明確にするために図示していない。 Referring to FIG. 5, robot 80 according to the present embodiment stores memory for storing probability model group 100 including probability models obtained by statistically processing the same experimental results as described above. The device 82, the utterance data storage device 88 storing the utterance data to be uttered by the robot 80, and the utterance data given to each phrase of the utterance data are stored in the storage device 82. Using the stored probability model group 100, a head motion generation device 86 for generating a head motion command for controlling the head motion of the robot, speech data stored in the speech data storage device 88, and Based on the head motion command from the head motion generation device 86, the sound reproduction unit 92 for reproducing the sound so that the head motion is synchronized with the sound, and the head motion generation device 86 In response to the head operation command, and a robot controller 90 for controlling the motor 62, 64 and 66 shown in FIG. In FIG. 5, portions of the robot 80 that are not related to head movement are not shown for the sake of clarity.

確率モデル群１００は、それぞれ話者／談話相手との関係／態度・感情の予め定められた組合せごとに、ある談話機能タグに対してどの頭部動作がどのような確率で得られるかを記述した確率モデル１３０，…，１３２を含む。例えば確率モデル１３０は話者Ａ、談話相手との関係が「近い」、態度は「普通」という組合せで、談話機能タグに対しどのような頭部動作がどのような確率で得られるかを記述している。なお、本実施の形態では、例えば頷きの場合には、談話機能タグ＝ｋ、ｇ、ｑ及びｂｃのときには頻度が高く、他の場合には頻度が低かったことに鑑み、談話機能タグがｋ、ｇ、ｑ及びｂｃのいずれかのときにのみに頷き動作を発生させ、それ以外の談話機能タグについては頭部動作を発生させないようにする。したがって確率モデル群１００に含まれる確率モデル１３０については、上記した４つの談話機能タグについての確率モデルしか含まれていない。 The probabilistic model group 100 describes which head motion can be obtained with what probability for a certain discourse function tag for each predetermined combination of relationship / attitude / emotion with the speaker / discussion partner. , 132 included. For example, the probability model 130 describes what kind of head motion is obtained with what probability for the discourse function tag in a combination of “close” and “normal” with the relationship between the speaker A and the conversation partner. is doing. In the present embodiment, for example, in the case of whispering, the discourse function tag is k when the discourse function tag is k, g, q, and bc, and the frequency is low in other cases. , G, q and bc are generated only at any time, and the head movement is not generated for the other discourse function tags. Therefore, the probability model 130 included in the probability model group 100 includes only the probability models for the four discourse function tags described above.

頭部動作生成装置８６は、発話データ記憶装置８８から発話データを読出し、各句に付された談話機能タグに基づいて、句ごとに、ｘｙｚ軸それぞれの周りの回転を生成するか否か、生成するならそのタイミングはどうなるかを定めて、ｘｙｚ軸の３軸について、ロボットの頭部動作コマンドに動きを重畳すべきタイミングを示す情報を出力するための頭部動作位置生成部１０２と、３軸の頭部動作コマンドにそれぞれ重畳されるべき頷き動作の形状モデル、首振り動作の形状モデル、及び首かしげ動作の形状モデルをそれぞれ記憶するための記憶装置１０８，１１０及び１１２と、頭部動作位置生成部１０２から出力される３軸の頭部動作位置のタイミングを示す情報に基づき、必要であればそれぞれ記憶装置１０８，１１０，１１２に記憶された頷き、首振り、及び首かしげの形状モデルにより特定される形状が重畳された頭部動作コマンドを生成し出力するための頭部動作コマンド生成部１０４と、上記した実験結果に基づき、頭部動作コマンド生成部１０４から出力される３軸の頭部動作コマンドのうち、発話時に頷きに関連する頭部動作コマンド（ピッチ角制御信号）に重畳されるべき顔上げ形状モデルを予め記憶する記憶装置１１４と、頭部動作コマンド生成部１０４から出力される３軸の頭部動作コマンドを発話データとともに受けるように接続され、発話データのうち談話区間については記憶装置１１４に記憶された顔上げ形状モデルを重畳し、ロボット制御部９０に与えるための談話区間用頭部動作コマンド生成部１０６とを含む。 The head movement generation device 86 reads out the utterance data from the utterance data storage device 88 and determines whether or not to generate a rotation around each xyz axis for each phrase based on the discourse function tag attached to each phrase. The head movement position generation unit 102 for outputting information indicating the timing at which the motion should be superimposed on the robot head movement command for the three axes xyz, Storage devices 108, 110, and 112 for storing a shape model of a whirling motion, a shape model of a swinging motion, and a shape model of a neck-raising motion, which are to be superimposed on the head motion command of the axis, respectively; Based on the information indicating the timing of the three-axis head movement position output from the position generation unit 102, the information is stored in the storage devices 108, 110, and 112, respectively, if necessary. A head motion command generation unit 104 for generating and outputting a head motion command on which the shape specified by the shape model of the whispering, swinging, and neck movement is superimposed, and based on the above experimental results, Among the three-axis head motion commands output from the head motion command generation unit 104, a memory that stores in advance a face-up shape model to be superimposed on a head motion command (pitch angle control signal) related to whispering when speaking The device 114 is connected to receive the three-axis head movement command output from the head movement command generation unit 104 together with the utterance data, and for the discourse section of the utterance data, the face-up shape stored in the storage device 114 A discourse section head motion command generation unit 106 for superimposing the model and giving it to the robot control unit 90.

図６を参照して、頭部動作位置生成部１０２は、発話データ記憶装置８８から発話データを受けて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより、頷きを生成するか否か、頷きを生成する場合にはそのタイミングはいつにするかを決定し、頷きタイミングを示す情報として出力するための頷き位置生成部１５０と、頷き位置生成部１５０と同様、発話データに基づいて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより首振り動作を行なうか否かを決定し、行なうならそのタイミングはいつにするかを決定して首振りタイミングを示す情報として出力するための首振り位置生成部１５２と、頷き位置生成部１５０及び首振り位置生成部１５２と同様、発話データに基づいて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより首かしげ動作を行なうか否かを決定し、行なうならそのタイミングはいつにするかを決定して首かしげタイミングを示す情報として出力するための首かしげ位置生成部１５４とを含む。頷き位置生成部１５０、首振り位置生成部１５２、及び首かしげ位置生成部１５４はいずれも、実質的にはコンピュータハードウェア及びそのコンピュータハードウェアで実行されるコンピュータプログラムにより実現される。 Referring to FIG. 6, head movement position generation unit 102 receives utterance data from utterance data storage device 88 and generates a whisper using an appropriate probability model determined by the utterance data in probability model group 100. In the case of generating a whispering, the timing is determined based on the speech data as well as the whirling position generation unit 150 for outputting as information indicating the whirling timing. In the probability model group 100, it is determined whether or not to perform the swing motion based on an appropriate probability model determined by the utterance data, and if so, the timing is determined as information indicating the swing timing. Similar to the swing position generation unit 152, the whirl position generation unit 150, and the swing position generation unit 152 for output, the probability is based on the utterance data. In order to determine whether or not to perform a neck raising operation using an appropriate probability model determined by the utterance data in the Dell group 100, and to determine when to do so and to output information indicating the timing of the neck raising And a neck position generating unit 154. The winding position generation unit 150, the swing position generation unit 152, and the neck position generation unit 154 are substantially realized by computer hardware and a computer program executed by the computer hardware.

頷き位置生成部１５０を実現するプログラムルーチンは、発話データ内の句に付された談話機能タグが頷きの対象（ｋ、ｇ、ｑ及びｂｃのいずれか）か否かを判定し、判定結果により制御の流れを分岐させるステップ１７０と、確率モデル群１００の中から、発話データに付された話者、話者と談話相手との関係、態度・感情を指定する情報、及び句に付された談話機能タグの値に応じて決まる確率モデルを選択するステップ１７２と、ステップ１７２において選択された確率モデルと乱数とにより、頭部の頷きに属する動作としてどのような動作を行なわせるかを決定するステップ１７４と、ステップ１７４で決定された動作の種類にしたがって、句のどの音節位置に頷きを生成するかを決定し、頷き位置を示す情報を出力して処理を終了するステップ１７６とを含む。ステップ１７０で談話機能タグが頷き対象ではないと判定された場合、頷きタイミングなしであることを示す情報を出力して処理を終了する。 The program routine for realizing the whispering position generation unit 150 determines whether or not the discourse function tag attached to the phrase in the utterance data is a target to be whispered (any one of k, g, q, and bc). Step 170 for diverting the flow of control, and the speaker attached to the utterance data, the relationship between the speaker and the conversation partner, information specifying the attitude / emotion, and the phrase are attached to the phrase from the probability model group 100. A step 172 for selecting a probability model determined according to the value of the discourse function tag, and a probability model selected in step 172 and a random number are used to determine what operation is to be performed as a motion belonging to the head whispering. In accordance with step 174 and the type of action determined in step 174, it is determined at which syllable position of the phrase the utterance is generated, the information indicating the utterance position is output, and the process ends. And a step 176 that. If it is determined in step 170 that the discourse function tag is not the target of the whispering, information indicating that there is no whispering timing is output and the process is terminated.

首振り位置生成部１５２を実現するプログラムは、頷き位置生成部１５０と略同様の処理をするステップ１８０，１８２及び１８４を含む。首振り位置生成部１５２において異なるのは、ステップ１８０で判定する談話機能タグが首振り判定のためのタグであることと、ステップ１８２で使用する確率モデルが、首振りの種類を決定するためのものであることとである。ステップ１８０で判定に使用する談話機能タグは、頷きについて既に述べたものと同様の実験により定める。確率モデルについても同様である。 The program that realizes the swing position generation unit 152 includes steps 180, 182, and 184 that perform substantially the same processing as the whirling position generation unit 150. The difference in the swing position generation unit 152 is that the discourse function tag determined in step 180 is a tag for swing determination, and the probability model used in step 182 determines the type of swing. It is to be a thing. The discourse function tag used for the determination in step 180 is determined by an experiment similar to that already described for whispering. The same applies to the probability model.

同様に、首かしげ位置生成部１５４を実現するプログラムは、頷き位置生成部１５０のステップ１７０，１７２及び１７４とそれぞれ略同様の処理をするステップ１９０，１９２及び１９４を含む。首かしげ位置生成部１５４において異なるのは、ステップ１９０で判定する談話機能タグが首かしげ判定のためのタグであることと、ステップ１９２で使用する確率モデルが、首かしげの種類を決定するためのものであることとである。ステップ１９０で判定に使用する談話機能タグも、頷きについて既に述べたものと同様の実験により定める。確率モデルについても同様である。 Similarly, the program that realizes the neck position generation unit 154 includes steps 190, 192, and 194 that perform substantially the same processing as steps 170, 172, and 174 of the neck position generation unit 150, respectively. A difference in the neck position generation unit 154 is that the discourse function tag determined in step 190 is a tag for neck connection determination, and the probability model used in step 192 determines the type of head connection. It is to be a thing. The discourse function tag used for determination in step 190 is also determined by the same experiment as described above for whispering. The same applies to the probability model.

図７に、記憶装置１０８に記憶される単一の頷き動作の形状を示す。図７に示す形状は、人間が頷く際の顔の動きから典型的な動きとして生成したものである。図７に示すように、人間の自然な頷きでは、顔を下向きにする直前にごく短い間、顔を上に向ける時間があり、その後に大きく下に動かし、さらにそのあと正面を向く。このような動きをロボットにさせることにより、人間が見て自然な形でロボットが頷くようにロボットの動きを制御することができる。なお、この頷き動作の形状の場合、波形の先頭及び末尾を句の末尾の文末の開始位置及び終了位置と一致させるように適宜時間軸を伸縮させることが好ましい。ただし、そのような時間軸上の伸縮をしなければならないわけではない。このように、波形の先頭及び末尾を句のどの位置にそれぞれ一致させるかは、形状に付随する情報として予め記憶装置１０８に記憶させておき、ステップ１７６，１８６及び１９６の処理ではその情報にしたがって波形の開始位置及び終了位置を定める。 FIG. 7 shows the shape of a single whispering operation stored in the storage device 108. The shape shown in FIG. 7 is generated as a typical movement from the movement of the face when a human crawls. As shown in FIG. 7, in natural human whispering, there is a time for the face to turn up for a very short time just before turning the face downward, and then it moves down greatly, and then turns to the front. By causing the robot to make such a movement, it is possible to control the movement of the robot so that the robot crawls in a natural manner as seen by humans. In the case of the shape of this whispering operation, it is preferable that the time axis is appropriately expanded and contracted so that the beginning and end of the waveform coincide with the start position and end position of the sentence end at the end of the phrase. However, it is not necessary to perform such expansion and contraction on the time axis. In this way, the position of the phrase at which the beginning and the end of the waveform are matched is stored in advance in the storage device 108 as information accompanying the shape, and the processing in steps 176, 186 and 196 is performed according to that information. Determine the start and end positions of the waveform.

図８は、頷き動作の形状の他の一例として、複数頷き（ｍｎｄ）の典型的な形状を示す。この動作の形状も、図５の記憶装置１０８に記憶されており、図６のステップ１７２において複数頷きが選択されたときに頷き位置を決定するために使用される。既に述べたように、複数頷きの場合には、句の全体にわたって頭部動作が行なわれる。したがってこの場合には、波形の先頭が句の先頭と一致し、波形の末尾が句の末尾と一致するように頷きのタイミングを定める。 FIG. 8 shows a typical shape of multiple whispering (mnd) as another example of the shape of the whispering operation. The shape of this operation is also stored in the storage device 108 in FIG. 5, and is used to determine the position to be rolled when a plurality of strokes are selected in step 172 in FIG. As already mentioned, in the case of multiple strokes, head movement is performed throughout the phrase. Therefore, in this case, the timing of winding is determined so that the beginning of the waveform matches the beginning of the phrase and the end of the waveform matches the end of the phrase.

記憶装置１０８には、これ以外にも、頷き動作の形状に属する形状を複数個記憶している。どのような形状を使用するかは、設計により定めればよい。ただしこの場合、確率モデルによって選択される形状の種類と、記憶装置１０８に記憶される形状の種類との間に矛盾がないようにしておく必要がある。 In addition to this, the storage device 108 stores a plurality of shapes belonging to the shape of the whispering operation. What shape is used may be determined by design. However, in this case, it is necessary to make sure that there is no contradiction between the shape type selected by the probability model and the shape type stored in the storage device 108.

同様に、記憶装置１１０には首振りのための形状モデルが記憶され、記憶装置１１２には首かしげのための形状モデルが記憶される。 Similarly, a shape model for swinging is stored in the storage device 110, and a shape model for necking is stored in the storage device 112.

図９を参照して、記憶装置１１４に記憶される顔上げ形状モデルは、句の最初にゆるやかに立ち上がったあと、３度程度の頷き角度で発話の間持続し、句の最後の部分で緩やかに立ち下がる。実際に人間の発話を観察した結果、発話時には人間はこのような行動をとることが多く、談話相手から見たときもその方が自然に見えるという結果が得られた。そこで、本実施の形態では、ロボットの発話時には、このような顔上げ形状モデルを頭部動作コマンド生成部１０４から出力される頷きの頭部動作コマンドに重畳することとする。ロボットが発話している期間については発話データから得ることができる。 Referring to FIG. 9, the face-up shape model stored in the storage device 114 rises gently at the beginning of the phrase, then persists during the utterance at a whirling angle of about 3 degrees, and slowly at the last part of the phrase. To fall. As a result of actually observing human utterances, humans often took this kind of behavior during utterances, and the results seemed to be more natural when viewed from the other party. Therefore, in this embodiment, when the robot speaks, such a face-up shape model is superimposed on the whirling head motion command output from the head motion command generation unit 104. The period during which the robot is speaking can be obtained from the speech data.

＜動作＞
図５〜図９を参照して、ロボット８０の頭部動作生成装置８６は以下のように動作する。 <Operation>
With reference to FIGS. 5 to 9, the head motion generation device 86 of the robot 80 operates as follows.

既に述べたような実験をし、その結果を統計的に処理することにより、図５に示す確率モデル群１００は予め準備されているものとする。発話データとして、発話すべき音声データと、その句の切り出し情報と、句ごとの音素ラベルのシーケンスと、各句に割当てられた談話機能タグと、ロボットの音声として想定されている話者を特定する情報と、ロボットと談話相手との関係を示す情報と、話者と談話相手との関係（「近い」、「遠い」または「中間」）を示す情報と、各句に付随させるべき態度・感情などを示す談話モード情報とを含むものが頭部動作生成装置８６に与えられる。各音素ラベルには、その音素に対応する音声の、句内での発話時刻を特定する情報も付されている。 It is assumed that the probabilistic model group 100 shown in FIG. 5 is prepared in advance by conducting the experiment as described above and statistically processing the result. As speech data, the speech data to be uttered, the segmentation information of the phrase, the phoneme label sequence for each phrase, the discourse function tag assigned to each phrase, and the speaker assumed as the robot's voice are identified Information indicating the relationship between the robot and the conversation partner, information indicating the relationship between the speaker and the conversation partner (“near”, “far” or “middle”), and the attitude to be attached to each phrase Information including discourse mode information indicating emotions and the like is given to the head motion generation device 86. Each phoneme label is also attached with information for specifying the utterance time in the phrase of the speech corresponding to the phoneme.

特に図６を参照して、例えば頷き位置生成部１５０を構成するコンピュータプログラムは、与えられた句に割当てられた談話機能タグが頷き対象であるか否かを判定する（ステップ１７０）。もしも談話機能タグが頷き対象でなければ頷きなしという情報を頭部動作コマンド生成部１０４に与える。もしも談話機能タグが頷き対象であるときには、確率モデル群１００の中から、発話データにより指定された話者に関する情報と、その話者と談話相手との関係を示す情報と、態度・感情などを示す談話モード情報と、指定された談話機能タグとにより確率モデルを選択する（ステップ１７２）。こうして選択された確率モデルに基づき、公知の方法で発生した乱数を用いて、頷き関連の動作としてどのような動作をロボットに行なわせるかを選択する（ステップ１７４）。動作が決まれば、その動作に伴って句内のどの位置（音節に関連して特定される位置）からその動作を開始させるか、及びどの位置でその動作が終了するようにするかが決定される（ステップ１７６）。こうして得られたタイミング情報が頭部動作コマンド生成部１０４（図５参照）に与えられる。 In particular, referring to FIG. 6, for example, the computer program constituting the whispering position generation unit 150 determines whether or not the discourse function tag assigned to the given phrase is a whirl target (step 170). If the discourse function tag is not to be spoken, information indicating that there is no whisper is given to the head motion command generation unit 104. If the discourse function tag is a target to be spoken, information on the speaker specified by the utterance data from the probability model group 100, information indicating the relationship between the speaker and the conversation partner, attitude, emotion, etc. A probability model is selected based on the displayed discourse mode information and the designated discourse function tag (step 172). Based on the probability model thus selected, a random number generated by a known method is used to select what kind of action is to be performed by the robot as a whispering-related action (step 174). Once the action is determined, the action determines the position within the phrase (the position specified in relation to the syllable) and the position at which the action ends. (Step 176). The timing information obtained in this way is given to the head operation command generation unit 104 (see FIG. 5).

同様にして、首振り位置生成部１５２は首振りに関するタイミング情報を、首かしげ位置生成部１５４は首かしげに関するタイミング情報を生成して、それぞれ頭部動作コマンド生成部１０４に与える。 Similarly, the swing position generation unit 152 generates timing information related to the head swing, and the neck movement position generation unit 154 generates timing information related to the head movement, and supplies the head movement command generation unit 104 with the timing information.

頭部動作コマンド生成部１０４は、例えば頷きについていうと以下のようにして頭部動作コマンドを作成する。図１０を参照して、説明のために、与えられた発話の音声が句２１０及び２１２を含むものとし、これらにそれぞれ談話機能フラグｋ及びｂｃが付されているものとする。説明及び図を簡単にするため、ここでは話者を特定する情報、話者と談話相手との関係を示す情報、態度・感情を示す談話モード情報などについては示していない。なお、句２１０は、最終音節２２２とそれ以前の音節シーケンス２２０とを含むものとする。 The head movement command generation unit 104 creates a head movement command in the following manner, for example, regarding whispering. Referring to FIG. 10, for the sake of explanation, it is assumed that the speech of a given utterance includes phrases 210 and 212, which are respectively provided with discourse function flags k and bc. In order to simplify the explanation and the figure, here, information for identifying the speaker, information indicating the relationship between the speaker and the conversation partner, discourse mode information indicating the attitude / feeling, and the like are not shown. Note that the phrase 210 includes the last syllable 222 and the previous syllable sequence 220.

乱数及び確率モデルを用いた頷き動作の決定処理（図６のステップ１７４）により、談話機能フラグｋの句２１０については単一頷き（ｎｄ）が、談話機能フラグｂｃの句２１２については複数頷きｍｎｄが、それぞれ選択されたものとする。 By the process of determining a whispering action using a random number and a probability model (step 174 in FIG. 6), a single whisper (nd) is made for the phrase 210 of the discourse function flag k, and a plurality of whisper mnd for the phrase 212 of the discourse function flag bc. Are respectively selected.

頭部動作コマンド生成部１０４は、ロボットのための頷き動作コマンドとして、最初に平坦な頷き動作コマンド２４０を生成する。各句に対して割当てられた頷き動作に対応する頷き動作の形状を、各頷き動作コマンドごとに句内で予め定められたタイミングでこの頷き動作コマンドに重畳していくことにより、最終的な頷き動作コマンド２４６が得られる。 The head motion command generation unit 104 first generates a flat motion command 240 as a motion command for the robot. By superimposing the shape of the motion corresponding to the motion assigned to each phrase on this motion command at a predetermined timing within the phrase for each motion command, the final motion is determined. An operation command 246 is obtained.

句２１０については、選択された頷き動作が単一の頷きであるため、最終音節２２２の開始位置以後に図７に示す形状を重畳する。一方、句２１２については、選択された頷き動作が複数頷きであるため、句２１２の全体にわたり、図８に示す複数頷きの形状を時間軸上で伸縮させて重畳する。以上の処理の結果、図１０に示す頭部動作コマンド２４２が得られる。 For the phrase 210, since the selected whispering action is a single whispering, the shape shown in FIG. 7 is superimposed after the start position of the final syllable 222. On the other hand, for the phrase 212, since the selected whispering operation is a plurality of whispering operations, the plural whirling shapes shown in FIG. As a result of the above processing, the head movement command 242 shown in FIG. 10 is obtained.

本実施の形態ではさらに、ロボット８０の発話時には図９に示す顔上げ形状モデルを頭部動作コマンド２４２に重畳する。その結果、図１０に示す頭部動作コマンド（発話区間用）２４６が得られる。この頭部動作コマンドを図５に示すロボット制御部９０及び音声再生部９２に与えることにより、ロボット８０の頭部が、発話の内容、ロボットと談話相手との間に想定されている関係、ロボットが発話時に示すべき態度及び感情に応じた自然な動きで頷き、首振り、首かしげを行なう。 In the present embodiment, the face-up shape model shown in FIG. 9 is superimposed on the head movement command 242 when the robot 80 speaks. As a result, the head movement command (for the speech section) 246 shown in FIG. 10 is obtained. By giving this head movement command to the robot control unit 90 and the voice reproduction unit 92 shown in FIG. 5, the head of the robot 80 is assumed to be the content of the utterance, the relationship assumed between the robot and the conversation partner, the robot Whisks, swings and raises his head with natural movements according to the attitude and feelings he should show when speaking.

以上のようにこの実施の形態によれば、予め実験により得られた結果を統計的に処理して、話者／話者と談話相手との関係／態度・感情に応じ、談話機能タグによってどのような頭部動作をどのような確率で行なうかを決定するための確率モデルを予め準備しておく。ロボットの発話時には、特に頭部動作の頻度が高い談話機能タグ（例えば頷き動作のときにはｋ、ｇ、ｑ、ｂｃなど）には、ロボットに想定されている話者／話者と談話相手との関係／指定された態度・感情、及び談話機能タグに応じた確率モデルを選択し、どのような頭部動作を行なうかを乱数で決定し、決定された頭部動作の形状を、頭部動作ごとに予め設定されているタイミングで頭部動作コマンドに重畳する。これら以外の談話タグについてはこのような処理は行なわない。こうした処理を頷き、首振り、及び首かしげのそれぞれについて独立で行なって、得られた頭部動作コマンドをロボット制御部９０に送ることで、ロボット制御部９０が頭部の３軸周りの動きを制御する。この結果、ロボットに想定されている話者、談話相手との間に想定されている関係、並びに指定された態度及び感情と、各句に付された談話機能タグとに応じ、自然な形でロボットの頭部を動作させることができる。したがって、ロボットと談話相手とのコミュニケーションが円滑になり、かつ談話相手はロボットの感情（として想定されるもの）、態度、ロボットが談話相手に対してどのような関係にあるか（想定されているか）などに関する情報を自然に理解することができる。 As described above, according to this embodiment, the results obtained by experiments in advance are statistically processed, and depending on the relationship / attitude / emotion between the speaker / speaker and the conversation partner, A probability model for determining the probability of performing such a head movement is prepared in advance. When a robot speaks, a talk function tag (for example, k, g, q, bc, etc. when whispering) that has a high frequency of head movement is used to connect the speaker / speaker assumed by the robot and the conversation partner. Select a probability model according to the relationship / designated attitude / emotion and discourse function tag, determine what kind of head movement to perform by random numbers, and determine the shape of the determined head movement Each time it is superimposed on the head movement command at a preset timing. Such processing is not performed for other discourse tags. By carrying out such processing independently for each of the head swing and the neck movement, the obtained head movement command is sent to the robot control section 90, so that the robot control section 90 moves the head around the three axes. Control. As a result, in a natural way, depending on the assumed speaker and relationship with the conversation partner, the specified attitude and emotion, and the discourse function tag attached to each phrase. The robot's head can be moved. Therefore, the communication between the robot and the conversation partner is smooth, and the conversation partner is the robot's emotion (assumed as what is assumed), the attitude, and the relationship between the robot and the conversation partner (assumed) ) Can be understood naturally.

［第２の実施の形態］
上記第１の実施の形態では、特定の談話機能タグのときのみ、頭部動作コマンドに所定の形状を重畳している。しかし、想定される全ての談話機能タグについて、必要なら頭部動作コマンドを生成するようにしてもよい。第２の実施の形態はそのような構成をとっている。 [Second Embodiment]
In the first embodiment, a predetermined shape is superimposed on the head movement command only when a specific discourse function tag is used. However, head motion commands may be generated for all possible discourse function tags if necessary. The second embodiment has such a configuration.

図１１を参照して、この実施の形態に係るロボットの頭部動作位置生成部２８０は、図５及び図６に示す頭部動作位置生成部１０２に代えて用いることができる。この頭部動作位置生成部２８０も、コンピュータハードウェア上で実行されるコンピュータプログラムにより、コンピュータハードウェアとコンピュータプログラムとの協働により実現することができる。 Referring to FIG. 11, the head movement position generation unit 280 of the robot according to this embodiment can be used in place of the head movement position generation unit 102 shown in FIGS. The head movement position generation unit 280 can also be realized by the cooperation of the computer hardware and the computer program by a computer program executed on the computer hardware.

頭部動作位置生成部２８０は、発話データ記憶装置８８から発話データを受けて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより、頷きを生成するか否か、頷きを生成する場合にはそのタイミングはいつにするかを決定し、頷きタイミングを示す情報として出力するための頷き位置生成部２９０と、頷き位置生成部２９０と同様、発話データに基づいて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより首振り動作を行なうか否かを決定し、行なうならそのタイミングはいつにするかを決定して首振りタイミングを示す情報として出力するための首振り位置生成部２９２と、頷き位置生成部２９０及び首振り位置生成部２９２と同様、発話データに基づいて、確率モデル群１００内で発話データにより定められる適切な確率モデルにより首かしげ動作を行なうか否かを決定し、行なうならそのタイミングはいつにするかを決定して首かしげタイミングを示す情報として出力するための首かしげ位置生成部２９４とを含む。 The head movement position generation unit 280 receives the utterance data from the utterance data storage device 88, and generates whether or not to generate a whisper according to an appropriate probability model determined by the utterance data in the probability model group 100. In such a case, when the timing is determined, the whirling position generation unit 290 for outputting as information indicating the whirling timing, and the whispering position generation unit 290, the probability model group 100 is based on the utterance data. The swing position for determining whether or not to perform the swing motion based on an appropriate probability model determined by the utterance data, and determining when to do so and outputting it as information indicating the swing timing Similar to the generation unit 292, the whirling position generation unit 290, and the swing position generation unit 292, the probability model group 100 is based on the utterance data. Decide whether or not to perform neck neck movement using an appropriate probabilistic model defined by the utterance data, and if so, determine when to do so and generate neck neck position for output as information indicating head neck timing Part 294.

頷き位置生成部２９０、首振り位置生成部２９２及び首かしげ位置生成部２９４は互いに似た構成を有する。 The winding position generation unit 290, the swing position generation unit 292, and the neck position generation unit 294 have similar configurations to each other.

頷き位置生成部２９０を実現するプログラムルーチンは、発話データに付された、ロボットに想定されている話者／話者と談話相手との関係／態度及び感情等、を指定する情報と、発話データの各句に付された談話機能タグとに基づいて、図５に示す確率モデル群１００内からこれら情報の値により定まる頷き動作決定のための確率モデルを選択するステップ３１０と、ステップ３１０において選択された確率モデルと乱数とにより、頷きに属する動作としてどのような動作をロボットに行なわせるかを決定するステップ１７４と、ステップ１７４で決定された動作の種類にしたがって、句のどの音節位置に頷きを生成するかを決定し、頷き位置を示す情報を出力して処理を終了するステップ１７６とを含む。 The program routine that realizes the whispering position generation unit 290 includes information specifying the relationship / attitude and emotion between the speaker / speaker and the conversation partner assumed for the robot, and the utterance data. Based on the discourse function tag attached to each phrase, step 310 for selecting a probability model for determining a whirling action determined by the value of these information from within the probability model group 100 shown in FIG. In accordance with the determined probability model and random numbers, step 174 for determining what kind of motion is to be performed by the robot as the motion belonging to the stroking, and according to the type of motion determined in step 174, the syllable position in the phrase 176 for determining whether or not to generate, outputting information indicating the position to be fired, and ending the processing.

首振り位置生成部２９２を実現するプログラムルーチンも同様に、発話データに付された、ロボットに想定されている話者／話者と談話相手との関係／態度及び感情等、を指定する情報と、発話データの各句に付された談話機能タグとに基づいて、図５に示す確率モデル群１００内からこれら情報の値により定まる首振り動作決定のための確率モデルを選択するステップ３２０と、ステップ３２０において選択された確率モデルと乱数とにより、首振りに属する動作としてどのような動作をロボットに行なわせるかを決定するステップ１８４と、ステップ１８４で決定された動作の種類にしたがって、句のどの音節位置に首振りを生成するかを決定し、首振り位置を示す情報を出力して処理を終了するステップ１８６とを含む。 Similarly, the program routine for realizing the swing position generation unit 292 also includes information for specifying the relationship / attitude and emotion between the speaker / speaker and the conversation partner assumed for the robot, attached to the utterance data. A step 320 of selecting a probability model for determining a swing motion determined by the value of these information from within the probability model group 100 shown in FIG. 5 based on the discourse function tag attached to each phrase of the utterance data; In accordance with the probability model selected in step 320 and the random number, step 184 for determining what kind of motion is to be performed by the robot as the motion belonging to the head swing, and the phrase type according to the type of motion determined in step 184 And step 186 for determining at which syllable position the head swing is generated, outputting information indicating the head swing position, and ending the process.

首かしげ位置生成部２９４を実現するプログラムルーチンも同様に、発話データに付された、ロボットに想定されている話者／話者と談話相手との関係／態度及び感情等、を指定する情報と、発話データの各句に付された談話機能タグとに基づいて、図５に示す確率モデル群１００内からこれら情報の値により定まる首かしげ動作決定のための確率モデルを選択するステップ３３０と、ステップ３３０において選択された確率モデルと乱数とにより、首振りに属する動作としてどのような動作をロボットに行なわせるかを決定するステップ１９４と、ステップ１９４で決定された動作の種類にしたがって、句のどの音節位置に首かしげを生成するかを決定し、首かしげ位置を示す情報を出力して処理を終了するステップ１９６とを含む。 Similarly, the program routine for realizing the head position generation unit 294 also includes information specifying the relationship / attitude and emotion between the speaker / speaker assumed to be a robot and the conversation partner attached to the utterance data. A step 330 of selecting a probability model for determining the head-and-head movement determined by the value of the information from within the probability model group 100 shown in FIG. 5 based on the discourse function tag attached to each phrase of the utterance data; In accordance with the probability model selected in step 330 and the random number, step 194 for determining what kind of motion is to be performed by the robot as the motion belonging to the swing, and according to the type of motion determined in step 194 And 196 deciding at which syllable position the neck fold is to be generated, outputting information indicating the neck fold position, and ending the process.

この実施の形態では、全ての談話機能タグについて、第１の実施の形態で述べたものと同様の頭部動作制御を行なうことができる。ロボットの頭部動作がより多彩なものとなり、人間の実際の頭部動作に近い動きを実現させることができる。 In this embodiment, the head movement control similar to that described in the first embodiment can be performed for all the discourse function tags. The head movement of the robot becomes more diverse, and the movement close to the actual human head movement can be realized.

なお、上記第１及び第２の実施の形態のいずれにおいても、頷きと、首振りと、首かしげとについての頭部動作コマンドを独立に生成しているが、これらについて例えば「首振りをしているときは頷き動作はさせない」、又は「複数頷きと複数首振りとは同時には行なわない」のような制約を設けてもよい。さらに、図１１のステップ３１０，３２０，３３０を一つにまとめ、談話機能タグにより３つの確率モデルが常に一組となって選択されるようにしても、第２の実施の形態と同じことになる。頷き位置生成部２９０、首振り位置生成部２９２、及び首かしげ位置生成部２９４のいずれか１つ又は２つのみを乱数などにより選択して動作させるようにし、選択されなかった処理部については同時には動作させないようにすることもできる。 In both of the first and second embodiments, head movement commands for whispering, swinging, and necking are generated independently. A restriction may be provided, such as “no whispering operation is performed” or “multiple whirling and plural swinging are not performed simultaneously”. Furthermore, even if steps 310, 320, and 330 in FIG. 11 are combined into one and three probability models are always selected as a set by the discourse function tag, the same as in the second embodiment. Become. Only one or two of the whirling position generation unit 290, the swing position generation unit 292, and the neck position generation unit 294 are selected and operated by a random number, and the processing units that are not selected are simultaneously operated. Can also be disabled.

上記実施の形態では、頷き、首振り、及び首かしげの３つを実現しているが、これらのうち任意の１つのみ、又は任意の２つの組合せのみを採用するようにしてもよい。 In the above-described embodiment, three types of stroking, swinging and necking are realized, but only one of these or only a combination of any two may be adopted.

また上記実施の形態では、頷きに関する頭部動作コマンドについて、発話時には顔上げ形状を重畳している。しかしこのような顔上げ形状の重畳が必須ではないこと、仮に顔上げ形状の重畳を行なうとしてもその角度が３度には限定されず、自然に見える範囲で任意に選択できること、顔上げ角度そのものを、ロボットの稼動時に変化させることも可能であること、そのための判断基準として既に述べたように、ロボットに想定されている話者、談話相手との関係、態度・感情などを用いることもできることはいうまでもない。 Moreover, in the said embodiment, the face-up shape is superimposed at the time of utterance about the head movement command regarding a whisper. However, such superposition of the face-up shape is not essential, and even if the face-up shape is superimposed, the angle is not limited to 3 degrees, and can be arbitrarily selected within a natural range, and the face-up angle itself Can be changed when the robot is in operation, and as already mentioned as a criterion for that, it is also possible to use the speakers, relationships with the conversation partner, attitudes and emotions, etc. assumed for the robot Needless to say.

上記実施の形態では、談話機能タグを句単位に付してある。談話ではこうした単位で頭部の動きが変化することが多いためである。しかし、談話機能タグを付す単位が句に限定されるわけではないことはいうまでもない。また、上記実施の形態では、発話データに音節シーケンスが付属しており、かつ各音節の開始時刻を示す情報も得られることが前提とされている。こうした構成により、頭部動作の開始タイミングなどを容易に決定することができる。しかし、音素の開始時刻などに関する情報を用いない構成もあり得る。例えば発話の最終音節に頭部動作を生じさせる場合、発話のパワーを句の末尾からさかのぼり、最初のピークを越えた（末尾の音節の終端部が見つかった。）後、ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）の値の変化を調べ、ΔＭＦＣＣの値があるしきい値を上回った時点をその最後の音節の開始位置と判定することで上記した実施の形態と同様の処理を実現できる。句全体にわたり頭部動作を行なう場合には、句の最初と最後とが明確であるため、こうした問題は生じない。又は、各句のうち、頭部動作の開始タイミングに関係する音節の開始位置に関する情報のみ、予めバッチ処理で作成し発話データに付するようにしておいてもよい。 In the above embodiment, the discourse function tag is attached to each phrase. This is because the movement of the head often changes in such units. However, it goes without saying that the unit with the discourse function tag is not limited to phrases. In the above embodiment, it is assumed that a syllable sequence is attached to speech data and information indicating the start time of each syllable is obtained. With such a configuration, it is possible to easily determine the head movement start timing and the like. However, there may be a configuration that does not use information on the start time of phonemes. For example, when head movement is caused in the last syllable of an utterance, the power of the utterance is traced back from the end of the phrase, and after the first peak (the end of the last syllable is found), MFCC (Mel Frequency Cepstrum Coefficient ) And the time when the ΔMFCC value exceeds a certain threshold value is determined as the start position of the last syllable, the same processing as in the above embodiment can be realized. When head movement is performed over the entire phrase, such a problem does not occur because the beginning and end of the phrase are clear. Alternatively, in each phrase, only information related to the start position of the syllable related to the start timing of the head movement may be created in advance by batch processing and attached to the utterance data.

さらに、上で頭部動作について開示したものと同様の考えが、頭部動作だけでなく手足の動き、上半身の動き、目の動きなどにも適用可能であることはいうまでもないであろう。 Furthermore, it goes without saying that the same idea as disclosed above for head movements can be applied not only to head movements but also to limb movements, upper body movements, eye movements, etc. .

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

３０話者
３２，３４，３６，３８，４０，４２，４４マーカ
８０ロボット
８２，１０８，１１０，１１２，１１４記憶装置
８６頭部動作生成装置
８８発話データ記憶装置
９０ロボット制御部
９２音声再生部
１００確率モデル群
１０２頭部動作位置生成部
１０４頭部動作コマンド生成部
１０６談話区間用頭部動作コマンド生成部
１１４記憶装置
１３０，１３２確率モデル
１５０，２９０頷き位置生成部
１５２，２９２首振り位置生成部
１５４，２９４首かしげ位置生成部 30 Speaker 32, 34, 36, 38, 40, 42, 44 Marker 80 Robot 82, 108, 110, 112, 114 Storage device 86 Head motion generation device 88 Utterance data storage device 90 Robot control unit 92 Voice reproduction unit 100 Probability model group 102 Head movement position generation unit 104 Head movement command generation unit 106 Head movement command generation unit for discourse section 114 Storage device 130, 132 Probability model 150, 290 Stroke position generation unit 152, 292 Swing position generation unit 154,294 Neck position generation unit

Claims

A head motion control information generating device for generating control information for controlling the movement of the head of a humanoid robot in synchronization with the sound generated by the robot,
The speech is defined by speech data previously divided into predetermined units, and each of the predetermined units includes a discourse function tag indicating a discourse function based on speech uttered by the sound data assigned to the unit. Is attached,
Probability model storage means for storing a probability model that defines at what probability a plurality of head movements are executed for each discourse function tag;
A probability model for selecting a probability model determined based on an annotation attached to the unit received from the speech data in the predetermined unit from among the probability models stored in the probability model storage means A selection means;
To generate a head motion command corresponding to the input predetermined unit of speech data with a probability according to the probability model selected by the probability model selection means, and to output it to the control unit of the humanoid robot type robot A head operation control information generating device, comprising:

The head movement control information generating device according to claim 1,
The voice data includes a conversation mode comprising any combination of speaker identification information for identifying a speaker, relationship identification information for identifying a relationship between a speaker and a conversation partner, and non-linguistic information at the time of utterance. The specified discourse mode information is attached.
The probability model selection means stores a plurality of probability models prepared in advance according to a combination of the discourse mode information and the discourse function tag,
The probability model selecting means includes means for selecting one of the plurality of probability models according to the discourse mode information attached to the voice data and the input predetermined unit. The head movement control information generation device according to claim 1.

The head movement control information generating device according to claim 1 or 2,
The head command generation means includes
A head motion determining means for determining whether or not to generate a head motion corresponding to the inputted predetermined unit of voice data with a probability according to the probability model selected by the probability model selecting means; ,
For each of the plurality of head movements, shape information storage means for storing shape information defining temporal movement of the head corresponding to the head movement;
In response to the fact that the head movement determining means determines to generate any head movement, the shape information corresponding to the head movement is obtained at a timing determined in advance by the type of the head movement. The head movement according to claim 1 or 2, further comprising: a shape reading means for reading out from the shape information storage means and generating a control command for controlling the head of the robot and outputting it to the control unit. Control information generator.

The head movement control information generating device according to claim 3,
The movement of the robot head is related to the pitch angle among the rotations about the three axes of the robot head,
The head command generation means further includes:
An utterance head shape storage means for storing a shape of a head movement prepared in advance, which defines temporal movement of the head when the robot speaks;
A control command connected between the shape reading means and the control unit, and in response to the fact that the voice data indicates that the robot speaks is transmitted to the control command output by the shape reading means. The head motion control information generating apparatus according to claim 3, further comprising: means for superimposing a head motion shape read from the head shape storage unit at the time of utterance and giving the head motion shape to the control unit.

The head movement control information generating device according to any one of claims 1 to 3,
The movement of the robot head is related to rotation around any combination of the three axes of the robot head,
The probability model storage means, the probability model selection means, and the head command generation means all generate head control commands independently for each axis constituting the arbitrary combination. The head movement control information generation device according to any one of the above.