JP2013119155A

JP2013119155A - Device and method for creating scenario

Info

Publication number: JP2013119155A
Application number: JP2011269644A
Authority: JP
Inventors: Osamu Sugiyama; 杉山治; Kazuhiko Shinozawa; 篠沢一彦; Michita Imai; 今井倫太
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2011-12-09
Filing date: 2011-12-09
Publication date: 2013-06-17
Anticipated expiration: 2031-12-09
Also published as: JP5877418B2

Abstract

PROBLEM TO BE SOLVED: To provide a device for creating a scenario capable of automatizing the creation of the scenario of a combination of a speech and a gesture.SOLUTION: Based on text cut pieces with respect to input text data stored in a storage part 42 and the information of the gesture, an SGAE service part 50 extracts a feature amount for each of combinations of them to be used as an input value. The likelihood of a combination of a voice and the gesture of a robot is calculated by each likelihood evaluation module 54 registered therein, and the scenario is created by a creation part 56.

Description

本発明は、コンピュータエージェントによる制御対象の発話とジェスチャとを関連付けたシナリオを生成するシナリオ生成装置およびシナリオ生成方法に関する。 The present invention relates to a scenario generation apparatus and a scenario generation method for generating a scenario in which an utterance to be controlled by a computer agent is associated with a gesture.

人同士の会話においてジェスチャが果たす役割については、会話の観察を通して、その役割、種類、そしてその生成過程が明らかにされてきた。ジェスチャの役割は、伝達内容の表現をはじめとして、コミュニケーションのメタ調節、情動的「きずな」づくりなど多岐に渡る。人の用いる発話とジェスチャは、「成長点」と呼ばれる最小の心理的単位から共起し、互いに意味を補完しながら会話を構成していく過程がこれまでの研究から明らかになっている。 Regarding the role of gestures in human conversation, the role, type, and generation process have been clarified through observation of conversation. The role of gestures ranges from expressing the content of communication to meta-control of communication and the creation of emotional “Kizuna”. Human utterances and gestures co-occur from the smallest psychological units called "growth points" and the process of composing conversations while complementing each other has been clarified from previous studies.

ロボットやエージェントの発話とジェスチャの生成は、これら人の発話とジェスチャの機能を再現することを目標に開発されてきた（たとえば、非特許文献１、非特許文献２）。 The utterance of robots and agents and the generation of gestures have been developed with the goal of reproducing the functions of these people's utterances and gestures (for example, Non-Patent Document 1 and Non-Patent Document 2).

また、たとえば、特許文献１には、複数の可動部または音声出力部のいずれかの部位を使用していない場合にその部位を有効に活用して対話対象の発話を誘発する行動のできる移動型ロボットが開示されている。 Further, for example, in Patent Document 1, when any part of a plurality of movable parts or voice output parts is not used, a mobile type capable of performing an action that induces an utterance of a conversation target by effectively utilizing the part. A robot is disclosed.

一方で、Cassell らは、自然言語処理の研究を発展させ、ECA(Embodied Conversational Agent) を使って、仮想空間上のモノを説明するために、エージェントの発話とジェスチャを生成するシステムを開発した（非特許文献３、非特許文献４）。ヒューマノイドロボットでは、ＨＲＩ（Honda Research Institute）のVictor らが、入力されたテキストをもとに、ASIMO の音声とジェスチャを自動的に同期させるモデルを提案した（非特許文献５）。Victorらの開発したシステムでは、実際に人の表出するジェスチャを観察し、そのパターンを基に確率モデルを用いて、エンブレム、表象、繰り返しなど、これまでの研究で明らかになった主要なジェスチャと音声を同期させることができる。これらの関連研究では、ロボットのジェスチャと音声の同期を人同士の会話を観察し、再現することを主眼に研究が進められている。 Cassell and colleagues, on the other hand, have developed research on natural language processing and developed a system that generates agent utterances and gestures to explain things in virtual space using ECA (Embodied Conversational Agent). Non-patent document 3, Non-patent document 4). For humanoid robots, Victor et al. From HRI (Honda Research Institute) proposed a model that automatically synchronizes ASIMO speech and gestures based on input text (Non-Patent Document 5). The system developed by Victor et al. Observes the gestures that people actually express, and uses the probabilistic model based on the patterns to make the main gestures clarified in previous studies such as emblems, representations, and repetitions. And the audio can be synchronized. In these related researches, research is being carried out mainly to observe and reproduce the conversations between people by synchronizing the gestures and voices of robots.

特開２００８−２７９５２９号公報JP 2008-279529 A

B. Hartmann, M. Mancini, and C. Pelachaud．”Implementing expressive gesture synthesis for embodied conversational agents．”In In Gesture in Human-Computer Interaction and Simulation, volume 3881, pages 188-199. Springer, 2006.B. Hartmann, M. Mancini, and C. Pelachaud. “Implementing expressive gesture synthesis for embodied conversational agents.” In In Gesture in Human-Computer Interaction and Simulation, volume 3881, pages 188-199. Springer, 2006. S. Kopp and I. Wachsmuth．”Synthesizing multi- modal utterances for conversational agents．”Comp. Anim. Virtual Worlds, 15(1):3952, 2004.S. Kopp and I. Wachsmuth. “Synthesizing multi-modal utterances for conversational agents.” Comp. Anim. Virtual Worlds, 15 (1): 3952, 2004. J. Cassell, H. Ho gni Vilhja lmsson, and T. Bickmore. ”Beat: the behavior expression animation toolkit.”In SIGGRAPH 2001: Proceedings of ACM SIGGRAPH, pages 477486, New York, NY, USA, 2001.J. Cassell, H. Ho gni Vilhja lmsson, and T. Bickmore. “Beat: the behavior expression animation toolkit.” In SIGGRAPH 2001: Proceedings of ACM SIGGRAPH, pages 477486, New York, NY, USA, 2001. K. Striegnitz, P. Tepper, A. Lovett, J. Cassell, ”Knowledge Representation for Generating Locating Gestures in Route Directions”. In Proceedings of Workshop on Spatial Language and Dialogue (5th Workshop on Language and Space). October 23-25, Delmenhorst, Germany, 2005.K. Striegnitz, P. Tepper, A. Lovett, J. Cassell, “Knowledge Representation for Generating Locating Gestures in Route Directions”. In Proceedings of Workshop on Spatial Language and Dialogue (5th Workshop on Language and Space). October 23-25 , Delmenhorst, Germany, 2005. V. Ng-Thow-Hing, P. Luo and S. Okita, ”Synchronized Gesture and Speech Production for Humanoid Robots,”The 2010 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems October 18- 22, Taipei, Taiwan, 2010.V. Ng-Thow-Hing, P. Luo and S. Okita, “Synchronized Gesture and Speech Production for Humanoid Robots,” The 2010 IEEE / RSJ International Conference on Intelligent Robots and Systems October 18- 22, Taipei, Taiwan, 2010.

しかしながら、これまで開発されてきたコミュニケーションロボットは、人の身体機能の一部を模倣するように開発されてきた。逆に言えば、アンドロイドなどの一部のロボットを除き、これらのロボットは人間が表出可能な全てのジェスチャを表現することは不可能である。例を挙げれば、指がないロボットはオーケーサインなどのエンブレムジェスチャを表出することができない。 However, communication robots that have been developed so far have been developed to mimic a part of human body functions. Conversely, with the exception of some robots such as Android, these robots are unable to express all gestures that humans can express. For example, a robot without a finger cannot express an emblem gesture such as an OK sign.

そのため、開発者はコミュニケーションロボットのジェスチャを生成するとき、個々のロボットの制約にあわせて、意図が伝わるように表出するジェスチャを編集していくことになる。結果として、これらのジェスチャはモデルとなった人間のジェスチャとは異なった動きになることが多い。 Therefore, when creating a gesture for a communication robot, the developer edits the gesture to be expressed so that the intention is transmitted according to the restrictions of each robot. As a result, these gestures often move differently than the modeled human gestures.

また、体を揺らしたり、瞬きするなどの会話の調整機能をそのまま再現できないため、それらを代替する別の動きを加えるといった変更も追加される。必然的にロボットの音声とジェスチャの組み合わせは、人のものとは異なってくると考えられる。人のジェスチャを再現することに加えて、このような、個々のロボットに合わせて、音声とジェスチャを編集していく心理モデルを構築することは困難である。 In addition, since the conversation adjustment function such as shaking the body or blinking cannot be reproduced as it is, a change to add another movement to replace them is also added. Inevitably, the combination of robot voices and gestures will be different from humans. In addition to reproducing human gestures, it is difficult to build a psychological model that edits speech and gestures for each individual robot.

さらに、これらのアサイン手法が、普遍的なものであるのかという議論も存在する。ロボットを使う人数が増え、時間が経過するにつれて、その音声やジェスチャのアサインや作成方法についてもいろいろな方法が編み出されていくと考えられる。それらのユーザの習熟や発展に沿って、システムも進化していくことが求められる。しかしながら、従来、このような進化の枠組みが存在しなかった。 There is also a debate about whether these assignment methods are universal. As the number of people who use robots increases and as time goes by, various methods will be devised for assigning and creating voices and gestures. It is required that the system evolves along with the learning and development of those users. However, there has never been such an evolutionary framework.

この発明は、以上のような問題点を解決するためになされたものであって、人がコミュニケーションロボットのような制御対象に、発話とジェスチャを割り当てるときのパターンの抽出結果に基づいて、発話とジェスチャの組合せのシナリオの作成を自動化することが可能なシナリオ生成装置またはシナリオ提供方法を提供することである。 The present invention has been made to solve the above-described problems, and based on a pattern extraction result when a person assigns an utterance and a gesture to a control target such as a communication robot, To provide a scenario generation device or a scenario providing method capable of automating the creation of a gesture combination scenario.

この発明の他の目的は、シナリオの自動作成のためのシステムを発展的に開発していくことが可能なシナリオ生成装置またはシナリオ提供方法を提供することである。 Another object of the present invention is to provide a scenario generation apparatus or a scenario providing method capable of developing a system for automatically creating a scenario in an expansive manner.

この発明の１つの局面に従うと、制御対象の発話に対してジェスチャを割り当てたシナリオを作成するためのシナリオ生成装置であって、発話に対応するテキストデータとジェスチャを制御するための情報とを格納するための記憶手段と、制御対象の発話に対応するテキストデータのうち、所定長のテキストデータを、所定の終端パターンに基づいて、複数のテキスト切片候補に分割する分割手段と、複数のテキスト切片候補と予め定められた複数のジェスチャとの組合せ候補の各々について、所定の終端パターンでテキスト切片候補が区切られる第１の尤度と、テキスト切片候補の再生時間または再生時間とジェスチャ時間との比のうち少なくとも１つに基づく第２の尤度とに基づき、組合せ候補のうち、最も尤度の高い組合せ候補を、シナリオ中の組合せとして選択する選択手段と、テキストデータのうち、選択された組合せに対応するテキスト切片に続く、所定長のテキストデータに対して、テキストデータの最終端まで、分割手段および選択手段による組合せの選択を繰り返し、シナリオを作成するシナリオ作成手段とを備える。 According to one aspect of the present invention, a scenario generation device for creating a scenario in which a gesture is assigned to an utterance to be controlled, which stores text data corresponding to the utterance and information for controlling the gesture Storage means for dividing the text data of a predetermined length among the text data corresponding to the utterance to be controlled, into a plurality of text segment candidates based on a predetermined termination pattern, and a plurality of text segments For each candidate combination of a candidate and a plurality of predetermined gestures, the first likelihood that the text segment candidate is delimited by a predetermined termination pattern and the ratio of the reproduction time of the text segment candidate or the reproduction time and the gesture time Based on the second likelihood based on at least one of the combination candidates, the combination candidate having the highest likelihood is selected from the combination candidates. The selection means for selecting as a combination in Rio, and the text data of a predetermined length following the text segment corresponding to the selected combination of the text data, by the dividing means and the selection means to the end of the text data Scenario creation means for repeatedly selecting combinations and creating scenarios.

好ましくは、第２の尤度は、テキスト切片候補の再生時間に基づく尤度と、再生時間とジェスチャ時間との比に基づく尤度との積である。 Preferably, the second likelihood is a product of the likelihood based on the reproduction time of the text segment candidate and the likelihood based on the ratio between the reproduction time and the gesture time.

好ましくは、選択手段は、第１および第２の尤度に加えて、テキスト切片候補中に存在するキーワードに基づく第３の尤度の乗算により、尤度を算出する。 Preferably, the selection unit calculates the likelihood by multiplying a third likelihood based on a keyword existing in the text segment candidate in addition to the first and second likelihoods.

好ましくは、シナリオ生成装置は、サーバ装置であって、記憶手段に対して、ネットワークを介して、ジェスチャを制御するための情報を登録する手段をさらに備える。 Preferably, the scenario generation device is a server device, and further includes means for registering information for controlling the gesture with respect to the storage means via the network.

好ましくは、第１ないし第３の尤度は、それぞれに対応する尤度評価モジュールにより算出され、選択手段に対して、尤度評価モジュールを登録するための手段をさらに備える。 Preferably, the first to third likelihoods are calculated by the corresponding likelihood evaluation modules, and further include means for registering the likelihood evaluation modules with respect to the selection means.

この発明の他の局面に従うと、制御対象の発話に対してジェスチャを割り当てたシナリオを作成するためのシナリオ生成方法であって、発話に対応するテキストデータとジェスチャを制御するための情報とを格納する記憶装置内の情報に基づいて、演算装置が、制御対象の発話に対応するテキストデータのうち、所定長のテキストデータを、所定の終端パターンに基づいて、複数のテキスト切片候補に分割するステップと、演算装置が、複数のテキスト切片候補と予め定められた複数のジェスチャとの組合せ候補の各々について、所定の終端パターンでテキスト切片候補が区切られる第１の尤度と、テキスト切片候補の再生時間または再生時間とジェスチャ時間との比のうち少なくとも１つに基づく第２の尤度とに基づき、組合せ候補のうち、最も尤度の高い組合せ候補を、シナリオ中の組合せとして選択するステップと、演算装置が、テキストデータのうち、選択された組合せに対応するテキスト切片に続く、所定長のテキストデータに対して、テキストデータの最終端まで、分割処理および組合せの選択の処理を繰り返し、シナリオを作成するステップとを備える。 According to another aspect of the present invention, a scenario generation method for creating a scenario in which a gesture is assigned to an utterance to be controlled, which stores text data corresponding to the utterance and information for controlling the gesture The arithmetic unit, based on the information in the storage device, divides the text data of a predetermined length among the text data corresponding to the utterance to be controlled into a plurality of text segment candidates based on a predetermined termination pattern The arithmetic unit, for each combination candidate of a plurality of text segment candidates and a plurality of predetermined gestures, a first likelihood that the text segment candidates are delimited by a predetermined termination pattern, and reproduction of the text segment candidates Based on the second likelihood based on at least one of the time or the ratio of the playback time and the gesture time, The step of selecting the most likely combination candidate as a combination in the scenario, and the arithmetic unit calculates text for a predetermined length of text data following the text segment corresponding to the selected combination of the text data. A step of creating a scenario by repeating the division process and the combination selection process until the end of the data.

この発明によれば、人がコミュニケーションロボットのような制御対象に、発話とジェスチャを割り当てるときのパターンの抽出結果に基づくことで、ジェスチャを見る人間にとって自然な発話とジェスチャの組合せシナリオの作成を自動化することが可能である。 According to the present invention, based on a pattern extraction result when a person assigns an utterance and a gesture to a control target such as a communication robot, the creation of a combination scenario of an utterance and a gesture that is natural for a person viewing the gesture is automated. Is possible.

または、この発明によれば、複数の人間がジェスチャの作成や発話とジェスチャの組合せの尤度の決定方法の作成に関与することにより、シナリオの自動作成のためのシステムを発展的に開発していくことが可能である。 Alternatively, according to the present invention, a system for automatically creating a scenario is developed by allowing a plurality of people to participate in creating a gesture and creating a method for determining the likelihood of a combination of an utterance and a gesture. It is possible to go.

コミュニケーションロボットのための発話とジェスチャのアサインパターンの抽出とその発展的開発の流れを示す図である。It is a figure which shows the flow of the extraction of the utterance and gesture assignment pattern for a communication robot, and its progressive development. サーバ装置２０００のコンピュータシステムのブロック図である。2 is a block diagram of a computer system of a server device 2000. FIG. 発話へのジェスチャのアサイン処理を説明するためのフローチャートである。It is a flowchart for demonstrating the gesture assignment process to speech. 作成されるシナリオの例を示す概念図である。It is a conceptual diagram which shows the example of the scenario produced. シナリオ生成装置のプロトタイプの構成を説明するための概念図である。It is a conceptual diagram for demonstrating the structure of the prototype of a scenario production | generation apparatus. robovie（登録商標）mR2の外観示す図である。It is a figure which shows the external appearance of robovie (trademark) mR2. シナリオ生成のシステムのＵＩ部１０でのユーザインタフェースを示す図である。It is a figure which shows the user interface in UI part 10 of the system of a scenario production | generation. 様々なパラメータ値におけるジョンソンＳＵ分布を示す図である。It is a figure which shows Johnson SU distribution in various parameter values. 実験環境を示す図である。It is a figure which shows an experimental environment. 式（３）で得られた各評価値を式（４）で正規化したNgi をプロットしたものである。Ngi obtained by normalizing each evaluation value obtained by Expression (3) by Expression (4) is plotted. 実験で得られた全ての試行の、各命令における音声とジェスチャの再生時間の比率のヒストグラムを示す図である。It is a figure which shows the histogram of the ratio of the reproduction time of the audio | voice and gesture in each command of all the trials obtained by experiment. 正規分布とジョンソンＳＵ分布による尤度評価モデル近似曲線を示す図である。It is a figure which shows the likelihood evaluation model approximated curve by normal distribution and Johnson SU distribution. 正規分布とジョンソンＳＵ分布による尤度評価モデル近似曲線を示す図である。It is a figure which shows the likelihood evaluation model approximated curve by normal distribution and Johnson SU distribution. 算出した各シナリオのスコアをANOVA(一要因被験者内分析) を用いて評価した。評価結果を示す図である。The calculated score of each scenario was evaluated using ANOVA (one-factor intra-subject analysis). It is a figure which shows an evaluation result. 算出した各シナリオのスコアをANOVA(一要因被験者内分析) を用いて評価した。評価結果を示す図である。The calculated score of each scenario was evaluated using ANOVA (one-factor intra-subject analysis). It is a figure which shows an evaluation result.

以下、本発明の実施の形態のシナリオ生成システムの構成について、図に従って説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 Hereinafter, the configuration of the scenario generation system according to the embodiment of the present invention will be described with reference to the drawings. In the following embodiments, components and processing steps given the same reference numerals are the same or equivalent, and the description thereof will not be repeated unless necessary.

以下に説明するように、本実施の形態では、人間の心理モデルを再現するのではなく、作成されたロボットの音声とジェスチャの組み合わせのデータを収集し、そのパターンを抽出することで、確率的に音声とジェスチャの組み合わせを生成する。人同士の会話におけるジェスチャと発話の既存概念をあえて導入せず、過去の履歴データを元に発話とジェスチャを組み合わせることで、対象となるコミュニケーションロボットに合わせた、より実用的なシナリオ生成システムを開発することができる。 As will be described below, in this embodiment, instead of reproducing a human psychological model, data of a combination of created robot voices and gestures is collected, and the pattern is extracted, so that Generate a combination of voice and gesture. Develop a more practical scenario generation system that matches the target communication robot by combining utterances and gestures based on past history data without introducing existing concepts of gestures and utterances in conversations between people can do.

また、本実施の形態では、履歴データを用いる上で、SOA(Service-Oriented Architecture) を用いたロボットの音声とジェスチャのアサインシステムを構築するものとしている。 In the present embodiment, a robot voice and gesture assignment system using SOA (Service-Oriented Architecture) is constructed when using historical data.

近年、業務上、一処理に相当するソフトウェアをサービスと見立て、それらのサービスをネットワーク上で連携させてシステムの全体を構築していくサービス指向アーキテクチャ(SOA) を用いたシステムが次々と発表されてきている。これらのシステムは、２つの大きな利点をもつ。 In recent years, systems that use service-oriented architecture (SOA), in which software equivalent to one process is regarded as a service in the business and those services are linked on the network to build the entire system, have been announced one after another. ing. These systems have two major advantages.

第１の利点は、システムを利用するユーザが必要とするコンピュータリソースが少ないことである。これらのシステムが必要とするサービスは、ほとんどの処理がネットワーク上にあるサーバで実行されるため、ユーザはブラウザを開くだけでサービスを利用できることが多い。 The first advantage is that less computer resources are required by the user using the system. Since most of the services required by these systems are executed by a server on the network, the user can use the services by simply opening a browser.

第２の利点は、ユーザの利用履歴をサービス提供側が収集できることである。サービスがどのように利用されているかを解析することで、より進んだサービスを提供していくことができる。 A second advantage is that the service provider can collect user usage histories. By analyzing how the service is used, it is possible to provide a more advanced service.

これら２つの利点は、ロボットを使ったシステムがまさに必要とするものである。ユーザの要求に応えたサービスを実現するには、音声認識、音声合成、顔認識、位置取得など様々な処理を行わなければならない。 These two advantages are exactly what robotic systems need. In order to realize a service in response to a user request, various processes such as speech recognition, speech synthesis, face recognition, and position acquisition must be performed.

これらをそれぞれのユーザローカルの環境に全て導入することは相当のコンピュータリソースを必要とし、サービスが普及する上での大きな壁になっている。また、ロボットがどのような行動をとるべきかを決定し、そのインタフェースを継続的に開発していくには、膨大な量のユーザ・操作者の利用履歴の収集と分析が不可欠であると考えられる。 Introducing all of these in each user's local environment requires considerable computer resources and is a major barrier to the spread of services. In addition, in order to determine what actions a robot should take and develop its interface continuously, it is essential to collect and analyze a huge amount of usage history of users and operators. It is done.

したがって、ＳＯＡベースのシステムで、以下に説明するような手法を運用することで、ユーザの編集履歴を収集し、発展的にロボットの音声とジェスチャのアサインシステムを開発していくことが望ましい。 Therefore, it is desirable to collect a user's editing history and to develop a robot voice and gesture assignment system in an advanced manner by operating a technique as described below in an SOA-based system.

以下では、ロボットの行動（音声とジェスチャの組み合わせ）を生成するモデルを、ユーザの編集履歴から構築することを目的として、これらＳＯＡベースのロボットサービスのシステム構成とその流れについて説明する。 In the following, the system configuration and flow of these SOA-based robot services will be described for the purpose of constructing a model for generating robot behavior (combination of voice and gesture) from the editing history of the user.

ただし、本発明は、このようなＳＯＡベースのシステムのシステムに必ずしも限定されるものではなく、ＳＯＡベースのシステムにおける各サービスと等価な機能を実現できるものであれば、他のシステム構成であってもよい。 However, the present invention is not necessarily limited to such a SOA-based system, and any other system configuration may be used as long as it can realize a function equivalent to each service in the SOA-based system. Also good.

また、以下の説明において、「コンピュータエージェント」とは、物理的な実在としてのロボットにおいて、発話とジェスチャとを組み合わせたシナリオに従って、このようなロボットに、対応する発話およびジェスチャを実行させるためのソフトウェアプログラムでありうる。ただし、「コンピュータエージェント」とは、より一般的には、制御対象に対して、シナリオに従って、対応する発話およびジェスチャを実行させるためのソフトウェアプログラムであってもよい。この場合、「制御対象」とは、発話またはジェスチャに対する自由度が人間に比べて少ないものであって、物理的な実体としての視覚対象に限らず、たとえば、ディスプレイ上に表示されるキャラクタ画像のような視覚対象であってもよい。 In the following description, “computer agent” refers to software for causing a robot as a physical entity to execute a corresponding utterance and gesture according to a scenario in which the utterance and gesture are combined. It can be a program. However, the “computer agent” may more generally be a software program for causing a control target to execute a corresponding utterance and gesture according to a scenario. In this case, the “control target” has less freedom of speech or gesture than humans, and is not limited to a visual target as a physical entity, for example, a character image displayed on a display. Such a visual object may be used.

（コミュニケーションロボットのための発話とジェスチャのアサインパターンの抽出）
図１は、コミュニケーションロボットのための発話とジェスチャのアサインパターンの抽出とその発展的開発の流れを示す図である。 (Extraction of speech and gesture assignment patterns for communication robots)
FIG. 1 is a diagram showing the flow of extraction and utterance development of speech and gesture assignment patterns for a communication robot.

図１に示されるように、本実施の形態のシナリオ生成システムは、クライエント装置側で実行されるユーザインタフェース部１０と、サーバ装置２０００側で実行されるロボット命令生成(Robot Instruction Generation, 以下、ＲＩＧと呼ぶ)サービス部３０と、データベース及びデータストレージ(Data Base and Data Storage, 以下、ＤＢＤＳと呼ぶ)サービス部４０と、そして音声とジェスチャのアサイン評価(Speech-Gesture Assignment Evaluation, 以下、ＳＧＡＥと呼ぶ)サービス部５０との４つのコンポーネントから構成される。 As shown in FIG. 1, the scenario generation system according to the present embodiment includes a user interface unit 10 executed on the client device side and a robot instruction generation (hereinafter referred to as “Robot Instruction Generation”) executed on the server device 2000 side. RIG) service unit 30, database and data storage (DBDS) service unit 40, and voice-gesture assignment evaluation (SGAE). ) It consists of four components with the service unit 50.

ユーザ２は、クライエント装置のブラウザ上のインタフェース１０から、ＲＩＧサービス部３０を用いてロボットの命令(音声とジェスチャの組み合わせ) を作成することができる。ＲＩＧサービス３０で生成されたリソースとその音声とジェスチャの組み合わせ情報は、ＤＢＤＳサービス部４０により、データベースまたはストレージサーバ等の記憶装置により構成される記憶部４２に登録される。 The user 2 can create a robot command (a combination of voice and gesture) using the RIG service unit 30 from the interface 10 on the browser of the client device. The resource generated by the RIG service 30 and the combination information of the voice and gesture are registered by the DBDS service unit 40 in a storage unit 42 configured by a storage device such as a database or a storage server.

特に限定されないが、ユーザインタフェース部１０からＲＩＧサービス部３０への情報の受け渡しは、ＲＥＳＴＡＰＩ（Representational State Transfer Application Programming Interface）を用いることができる。 Although not particularly limited, REST API (Representational State Transfer Application Programming Interface) can be used to transfer information from the user interface unit 10 to the RIG service unit 30.

ＳＧＡＥサービス部５０は、記憶部４２に蓄積されている、入力テキストデータに対するテキスト切片とジェスチャの情報に基づいて、それらの組合せの各々について特徴量を抽出し、入力値として用いる。そして、内部に登録された各尤度評価モジュール５４によって、ロボットの音声とジェスチャの組み合わせの尤度を計算し、生成部５６により、シナリオを、後に説明する手順にしたがって、生成する。なお、計算された尤度やシナリオについては、生成部５６からユーザインタフェース部１０に、送信される。 The SGAE service unit 50 extracts feature values for each of the combinations based on the text intercepts and gesture information for the input text data stored in the storage unit 42, and uses them as input values. Then, the likelihood of the combination of the robot voice and gesture is calculated by each likelihood evaluation module 54 registered inside, and the scenario is generated by the generation unit 56 according to the procedure described later. The calculated likelihood and scenario are transmitted from the generation unit 56 to the user interface unit 10.

本システムには、ロボット命令生成のためのインタフェースのユーザ以外に、ジェスチャのクリエータ４と、尤度評価サービスの分析者（開発者）６が介在する。 In this system, in addition to a user of an interface for generating robot instructions, a gesture creator 4 and a likelihood evaluation service analyst (developer) 6 are interposed.

ジェスチャのクリエータ４は、文字通り、ロボットのジェスチャを専用のツールで作成して、ＤＢＤＳサービス部４０に登録する人のことをさす。ジェスチャのクリエータ４は、インタフェースを使うユーザがその役割を兼ねてもよい。しかし、作成にはロボットの軸配置などを把握するなど専門となる知識が必要となり、役割を明確化するために、ここでは、別の存在として定義するものとする。特に限定されないが、ジェスチャのクリエータ４も、専用のインタフェースにより、ネットワークを介して、ジェスチャを制御するための情報をＤＢＤＳサービス部４０に登録することが可能である。 The gesture creator 4 literally refers to a person who creates a robot gesture with a dedicated tool and registers it in the DBDS service unit 40. The gesture creator 4 may be a user who uses the interface. However, the creation requires specialized knowledge such as grasping the axis arrangement of the robot, and in order to clarify the role, it is defined here as another existence. Although not particularly limited, the gesture creator 4 can also register information for controlling the gesture in the DBDS service unit 40 via a network using a dedicated interface.

ロボットの命令を作るユーザ２は、ジェスチャクリエータ４の作成したジェスチャを用い、シナリオ生成システムにより、そのジェスチャと音声を割り当てることでロボットの命令を作成する。 The user 2 who creates the robot command uses the gesture created by the gesture creator 4 to create the robot command by assigning the gesture and voice using the scenario generation system.

特に限定はされないが、本実施の形態では、特定のロボットに限定されずに、より広範囲なロボットに適用できるようにするため、ロボットのジェスチャは自動的に生成できず、登録されるものであるという立場で説明することとする。 Although not particularly limited, in this embodiment, in order to be applicable to a wider range of robots without being limited to a specific robot, robot gestures cannot be automatically generated and are registered. I will explain from the standpoint.

一方、分析者６は、ＳＧＡＥサービス部５０を実現するために、蓄積された履歴データを解析し、各尤度評価モジュールを定義し登録する人のことをさす。ロボットの音声とジェスチャの割り当てには、様々な要素が介在すると考えられ、システムは一度の実装で完成するものではなく、発展的に開発されていくものだと考えられる。このような発展的な開発のために、ＳＧＡＥサービス部５０は学習を継続的に行い、その結果を定期的に反映するフレームワークになる。分析者６は、尤度評価モジュール５４とそのパラメータ算出モジュール（図示せず）をＳＧＡＥサービス部５０に実装し、システムに組み込む。パラメータ算出モジュールは、特に限定されないが、たとえば、最小二乗法に基づく構成とすることができる。システムは定期的に、履歴データからパラメータ算出モジュールに基づいて、パラメータを更新し、その結果を尤度評価モジュール５４に反映する。このようなステップを繰り返すことで、全体のサービスは自動化しつつも、尤度評価モジュール５４の追加、パラメータの更新による発展的開発が可能になる。 On the other hand, the analyst 6 refers to a person who analyzes accumulated history data and defines and registers each likelihood evaluation module in order to realize the SGAE service unit 50. It is thought that various elements are involved in the assignment of robot voices and gestures, and the system will not be completed in a single implementation, but will be developed progressively. For such an advanced development, the SGAE service unit 50 becomes a framework that continuously learns and regularly reflects the results. The analyst 6 installs the likelihood evaluation module 54 and its parameter calculation module (not shown) in the SGAE service unit 50 and incorporates them in the system. The parameter calculation module is not particularly limited, but may be configured based on, for example, the least square method. The system periodically updates the parameter based on the parameter calculation module from the history data, and reflects the result in the likelihood evaluation module 54. By repeating these steps, the overall service can be automated, and further development by adding the likelihood evaluation module 54 and updating the parameters becomes possible.

（ハードウェアの構成）
図２は、サーバ装置２０００のコンピュータシステムのブロック図である。 (Hardware configuration)
FIG. 2 is a block diagram of a computer system of the server apparatus 2000.

図２において、サーバ装置２０００のコンピュータ本体２０１０は、メモリドライブ２０２０、ディスクドライブ２０３０に加えて、ＣＰＵ２０４０と、ディスクドライブ２０３０及びメモリドライブ２０２０に接続されたバス２０５０と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ２０６０とに接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ２０７０と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク（ＨＤＤ）２０８０と、ネットワーク等を介して外部機器、たとえば、ストレージサーバ等と通信するための通信インタフェース２０９０とを含む。 2, in addition to the memory drive 2020 and the disk drive 2030, the computer main body 2010 of the server apparatus 2000 stores a CPU 2040, a bus 2050 connected to the disk drive 2030 and the memory drive 2020, and a program such as a bootup program. A RAM 2070 for temporarily storing application program instructions and providing a temporary storage space, and a hard disk (HDD) 2080 for storing application programs, system programs, and data. And a communication interface 2090 for communicating with an external device such as a storage server via a network or the like.

ＣＰＵ２０４０が、プログラムに基づいて実行する演算処理により、上述したＲＩＧサービス部３０、ＤＢＤＳサービス部４０、ＳＧＡＥサービス部５０の機能が実現される。 The functions of the RIG service unit 30, the DBDS service unit 40, and the SGAE service unit 50 described above are realized by arithmetic processing executed by the CPU 2040 based on the program.

サーバ装置２０００に、上述した実施の形態の情報処理装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ２２００、またはメモリ媒体２２１０に記憶されて、ディスクドライブ２０３０またはメモリドライブ２０２０に挿入され、さらにハードディスク２０８０に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ本体２０１０に送信され、ハードディスク２０８０に記憶されても良い。プログラムは実行の際にＲＡＭ２０７０にロードされる。 A program that causes the server apparatus 2000 to execute the functions of the information processing apparatus and the like of the above-described embodiment is stored in the CD-ROM 2200 or the memory medium 2210, inserted into the disk drive 2030 or the memory drive 2020, and further the hard disk 2080. May be transferred to. Alternatively, the program may be transmitted to the computer main body 2010 via a network (not shown) and stored in the hard disk 2080. The program is loaded into the RAM 2070 at the time of execution.

サーバ装置２０００は、さらに、入力装置としてのキーボード２１００およびマウス２１１０と、出力装置としてのディスプレイ２１２０とを備える。 The server apparatus 2000 further includes a keyboard 2100 and a mouse 2110 as input devices, and a display 2120 as an output device.

上述したようなサーバーとして機能するためのプログラムは、コンピュータ本体２０１０に、情報処理装置等の機能を実行させるオペレーティングシステム（ＯＳ）は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。サーバ装置２０００がどのように動作するかは周知であり、詳細な説明は省略する。 The above-described program for functioning as a server does not necessarily include an operating system (OS) that causes the computer main body 2010 to execute functions such as an information processing apparatus. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the server apparatus 2000 operates is well known, and detailed description thereof is omitted.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

なお、ユーザインタフェース部１０を実行するクライエント装置についても、基本的なハードウェア構成は、図２の構成と同様である。
（発話へのジェスチャのアサイン処理）
以下、テキストデータから、発話にジェスチャを割り当てる処理について、説明する。 The basic hardware configuration of the client device that executes the user interface unit 10 is the same as that of FIG.
(Assigning gestures to utterances)
Hereinafter, processing for assigning a gesture to an utterance from text data will be described.

図３は、発話へのジェスチャのアサイン処理を説明するためのフローチャートである。 FIG. 3 is a flowchart for explaining a process for assigning a gesture to an utterance.

以下では、テキストとして、以下のようなものを例にとることにする。すなわち、ユーザインタフェース部１０より、以下のようなテキストデータが入力されるものとする。 In the following, the following is taken as an example of text. That is, it is assumed that the following text data is input from the user interface unit 10.

「この研究は、人がコミュニケーションロボットの発話とジェスチャを割り当てるときのパターンを抽出し、自動化するシステムの構築を目指します。そして、そのシステムを発展的に開発していく手法を提案します。従来のコミュニケーションロボットにおける発話とジェスチャのアサイン手法は、人同士の会話における発話とジェスチャの役割を分析し、その役割をモデル化することでロボットに実装されてきました。しかし、…」
図３を参照して、発話へのジェスチャのアサイン処理が開始されると（Ｓ１００）、ＲＩＧサービス部３０では、まず、上述したテキストデータについて、最初から所定の長さ分のデータを抜き出す。ここでは、たとえば、２センテンス分のデータを取り出すものとする。もちろん、何センテンス分のデータを抽出するかは、特に、限定されるものではない。 “This research aims to construct a system that extracts and automates patterns when people assign speech and gestures to communication robots, and proposes a method for developing the system progressively. The utterance and gesture assignment method in Japanese communication robots has been implemented in robots by analyzing the roles of utterances and gestures in human conversations and modeling their roles, but… ”
Referring to FIG. 3, when the process of assigning a gesture to an utterance is started (S100), RIG service unit 30 first extracts data for a predetermined length from the beginning of the text data described above. Here, for example, data for two sentences is taken out. Of course, how many sentences of data are extracted is not particularly limited.

ＲＩＧサービス部３０は、続いて、この２センテンス分のテキストデータから、終端パターン（たとえば、日本語では、句点や読点、英語であれば、カンマやピリオド）に基づいて、テキストデータを分割した複数のテキスト切片を作成する（Ｓ１０２）。 Next, the RIG service unit 30 divides the text data from the text data for two sentences based on the termination pattern (for example, a punctuation mark or punctuation mark in Japanese, or a comma or period in English). Is created (S102).

上述したテキストの例では、たとえば、以下のようになる。 In the example of the text described above, for example, it is as follows.

切片１：「この研究は、」
切片２：「この研究は、人がコミュニケーションロボットの発話とジェスチャを割り当てるときのパターンを抽出し、」
切片３：「この研究は、人がコミュニケーションロボットの発話とジェスチャを割り当てるときのパターンを抽出し、自動化するシステムの構築を目指します。」
切片４：「この研究は、人がコミュニケーションロボットの発話とジェスチャを割り当てるときのパターンを抽出し、自動化するシステムの構築を目指します。そして、」
切片５：「この研究は、人がコミュニケーションロボットの発話とジェスチャを割り当てるときのパターンを抽出し、自動化するシステムの構築を目指します。そして、そのシステムを発展的に開発していく手法を提案します。」
なお、「終端パターン」とは、後に説明するように、人間がコンピュータエージェントの発話として、一定以上の確率でテキストを区切る傾向のあるテキスト中の特定パターンのことを意味し、上述したような句点や読点、あるいは、カンマやピリオドに必ずしも限定されるものではない。 Section 1: “This study is”
Section 2: “This study extracts patterns when people assign speech and gestures for communication robots.”
Section 3: “This research aims to build a system that extracts and automates patterns when people assign speech and gestures of communication robots.”
Section 4: "This research aims to build a system that extracts and automates patterns when people assign speech and gestures of communication robots. And"
Section 5: “This research aims to build a system that extracts and automates patterns when people assign speech and gestures of communication robots. And, it proposes a method to develop the system progressively. ”
As will be described later, the term “terminal pattern” means a specific pattern in text that tends to divide text with a certain probability as a computer agent utterance. It is not necessarily limited to punctuation marks, commas, or periods.

なお、説明としては、テキスト切片は、ｉ個得られたものとする。 For explanation, it is assumed that i text sections are obtained.

さらに、ＲＩＧサービス部３０は、予め登録されたｊ個のジェスチャと、上記ｉ個のテキスト切片との（ｉ×ｊ）個の組合せの候補を生成し、ＤＢＤＳサービス部によりこれらの候補データをデータストレージに格納する（Ｓ１０２）。 Further, the RIG service unit 30 generates (i × j) combination candidates of j gestures registered in advance and the i text slices, and the DBDS service unit stores the candidate data as data. Store in the storage (S102).

続いて、ＳＧＡＥサービス部５０は、後に詳しく説明するような、各候補データに対応するような尤度評価モジュール１〜ｍ（ｍは２以上の整数）のそれぞれの尤度をすべて乗算することで、各候補データについての尤度を算出する（Ｓ１０４）。 Subsequently, the SGAE service unit 50 multiplies all likelihoods of likelihood evaluation modules 1 to m (m is an integer of 2 or more) corresponding to each candidate data, as will be described in detail later. The likelihood for each candidate data is calculated (S104).

なお、尤度評価モジュールとしては、以下のようなものを用いることができる。 As the likelihood evaluation module, the following can be used.

１）実験的・経験的に得られた「終端パターン」でテキストを区切る尤度を計算する尤度評価モジュール１
尤度評価モジュール１以外に、少なくとも１つ、以下のような尤度評価モジュールを、予め実験的・経験的に得られた尤度に基づいて構成しおき、尤度評価モジュール１に組み合わせることができる。 1) Likelihood evaluation module 1 for calculating the likelihood of dividing a text by “termination pattern” obtained experimentally and empirically.
In addition to the likelihood evaluation module 1, at least one of the following likelihood evaluation modules may be configured based on the likelihood obtained in advance experimentally and empirically, and combined with the likelihood evaluation module 1. it can.

２）テキスト中のキーワードに対して対応するジェスチャを割り当てる尤度を計算する尤度評価モジュール２
３）ジェスチャの再生時間に対して、これを選択する尤度を計算する尤度評価モジュール３
４）テキスト切片の音声の再生時間とジェスチャの再生時間の比率に基づき、音声とジェスチャを組み合わせる尤度を計算する尤度評価モジュール４
なお、尤度評価モジュールとしては、テキスト切片とジェスチャについて、他の要因に基づく尤度を考慮する場合には、必要に応じて、他の尤度評価モジュールを追加することが可能である。 2) Likelihood evaluation module 2 for calculating the likelihood of assigning the corresponding gesture to the keyword in the text
3) Likelihood evaluation module 3 for calculating the likelihood of selecting a gesture reproduction time
4) Likelihood evaluation module 4 for calculating the likelihood of combining speech and gesture based on the ratio of the speech playback time and the gesture playback time of the text segment.
As the likelihood evaluation module, if the likelihood based on other factors is taken into consideration for the text segment and the gesture, another likelihood evaluation module can be added as necessary.

続いて、ＳＧＡＥサービス部５０は、最も尤度の高い候補を選択する（Ｓ１０６）。 Subsequently, the SGAE service unit 50 selects a candidate with the highest likelihood (S106).

ＳＧＡＥサービス部５０は、候補として選択済みのテキスト切片が、入力されたテキストデータの最終端まで到達したかを判断し（Ｓ１０８）、最終端に到達していなければ、処理をステップＳ１０２に戻す。 The SGAE service unit 50 determines whether the text segment selected as a candidate has reached the final end of the input text data (S108). If not, the process returns to step S102.

一方、最終端に到達していれば、たとえば、サーバ装置２０００において、生成部５６が、最も尤度の高い候補を順次連ねることによりシナリオを生成する。 On the other hand, if the final end has been reached, for example, in the server device 2000, the generation unit 56 generates a scenario by sequentially connecting candidates with the highest likelihood.

たとえば、最初に、切片２が選択された場合は、次には、切片２よりも後のテキストについて、再び、２センテンス分のテキストについて、ステップＳ１０２からＳ１０６までの処理を繰り返す。 For example, when section 2 is first selected, the processes from step S102 to step S106 are repeated for the text after section 2, and again for the text for two sentences.

図４は、作成されるシナリオの例を示す概念図である。 FIG. 4 is a conceptual diagram showing an example of a scenario to be created.

図４に示すように、シナリオは、入力されたテキストについて、順次、尤度の高いテキスト切片とジェスチャを特定するジェスチャＩＤとの組合せを、テキストの再生順に並べたものである。 As shown in FIG. 4, the scenario is such that a combination of a text section having a high likelihood and a gesture ID for specifying a gesture is sequentially arranged for the input text in the order of text reproduction.

（シナリオ生成装置のプロトタイプの構成）
図１では、シナリオ生成装置において、クライエント装置側ではユーザインタフェース部１０が機能し、サーバ装置側では、ＲＩＧサービス部３０、ＤＢＤＳサービス部４０、ＳＧＡＥサービス部５０を、ＳＯＡベースのシステムとして構成するものとした。 (Configuration of scenario generator prototype)
In FIG. 1, in the scenario generation device, the user interface unit 10 functions on the client device side, and the RIG service unit 30, the DBDS service unit 40, and the SGAE service unit 50 are configured as an SOA-based system on the server device side. It was supposed to be.

以下では、図１に示したシナリオ生成装置の機能の実現性を検討するためのプロトタイプの構成について説明する。ただし、以下のようなプロトタイプの構成において、ロボット命令生成の機能として、図１で説明したＳＧＡＥサービス部５０のシナリオ自動生成の機能を合わせもつものとして、シナリオ生成装置を実現することも可能である。 In the following, the configuration of a prototype for examining the feasibility of the function of the scenario generation device shown in FIG. 1 will be described. However, in the prototype configuration as described below, the scenario generation device can also be realized as a robot instruction generation function that has the scenario automatic generation function of the SGAE service unit 50 described in FIG. .

図５は、シナリオ生成装置のプロトタイプの構成を説明するための概念図である。 FIG. 5 is a conceptual diagram for explaining a configuration of a prototype of the scenario generation device.

図５に示した構成は、図１に示したシナリオ生成装置の構成を簡略して実現したものである。以下の説明では、主として、図５に示したプロトタイプの構成による実験結果について説明する。 The configuration shown in FIG. 5 is a simplified implementation of the configuration of the scenario generation device shown in FIG. In the following description, the experimental results based on the configuration of the prototype shown in FIG. 5 will be mainly described.

図５を参照して、システムは、ユーザインタフェース（ＵＩ）アプリケーション部（以下、ＵＩ部）１０と、ロボット命令生成(ＲＩＧ)サーバ３０と、データベース・データストレージ(ＤＢＤＳ)部４０と、そして、ロボットレンダラー部３０００の４つのモジュールから成り立つ。 Referring to FIG. 5, the system includes a user interface (UI) application unit (hereinafter, UI unit) 10, a robot command generation (RIG) server 30, a database / data storage (DBDS) unit 40, and a robot. It consists of four modules of the renderer unit 3000.

特に限定されないが、ロボット命令生成(ＲＩＧ)サーバ３０と、ＤＢＤＳ部４０と、ロボットレンダラー部３０００とは、それぞれ、別のサーバ装置上で動作することが可能である。あるいは、これらは、同一のサーバ装置上で動作させる構成としてもよい。 Although not particularly limited, the robot command generation (RIG) server 30, the DBDS unit 40, and the robot renderer unit 3000 can each operate on different server devices. Alternatively, these may be configured to operate on the same server device.

さらに、たとえば、ロボットレンダラー部３０００の機能は、クライエント装置側で実行してもよい。すなわち、図３に示した各部の機能を、サーバ装置とクライエント装置とで、どのようにして分散して処理するかは、特に限定されない。あるいは、すべての機能を１つのコンピュータ装置上で実行してもよい。 Further, for example, the function of the robot renderer unit 3000 may be executed on the client device side. That is, there is no particular limitation on how the functions of each unit illustrated in FIG. 3 are distributed and processed between the server device and the client device. Alternatively, all functions may be executed on one computer device.

シナリオ生成のためのシステムは、ＵＩ部１０への入力を受けて、以下の２通りの処理を行う。 The system for generating a scenario receives the input to the UI unit 10 and performs the following two processes.

（１）ＵＩ部１０への入力を受けて、ロボットの命令を生成する。この処理はＵＩ部１０と、ロボット命令生成サーバ３０を用いて行われ、その結果は、ＤＢＤＳ部４０に格納される。これらの処理過程は図３中、実線の矢印で表される。 (1) Upon receiving an input to the UI unit 10, a robot command is generated. This process is performed using the UI unit 10 and the robot command generation server 30, and the result is stored in the DBDS unit 40. These processing steps are represented by solid arrows in FIG.

（２）ＵＩ部１０への入力を受けて、ロボットの動作を制御を行う。この処理はＵＩ部１０と、ＤＢＤＳ部４０、ロボットレンダラー部３０００を用いて行われる。これら処理過程は図３中、点線の矢印で表される。 (2) Upon receiving an input to the UI unit 10, the robot operation is controlled. This process is performed using the UI unit 10, the DBDS unit 40, and the robot renderer unit 3000. These processing steps are represented by dotted arrows in FIG.

ロボットの命令の生成は、以下の手順で行われる。 The generation of the robot command is performed according to the following procedure.

まず、ＵＩ部１０は、入力されたテキストを元にロボットの音声を生成し、その合成音声情報をＤＢＤＳ部４０のデータベースに登録する。同時に、合成された音声は、ＤＢＤＳ部４０のデータストレージサーバに保存される。次に、ＵＩ部１０で選択されたジェスチャＩＤに基づいて、データベース上に登録されたジェスチャ情報を取得する。最後に、合成した音声とジェスチャのモーションを組み合わせて、ロボット命令（シナリオ）を生成し、ＤＢＤＳ部４０のデータベースに登録する。ここでも、ジェスチャのモーションを制御するための情報は、予めジェスチャクリエータにより作成され、それぞれ、ジェスチャＩＤと関連付けられて、ＤＢＤＳ部４０に格納されているものとする。 First, the UI unit 10 generates a robot voice based on the input text, and registers the synthesized voice information in the database of the DBDS unit 40. At the same time, the synthesized voice is stored in the data storage server of the DBDS unit 40. Next, based on the gesture ID selected by the UI unit 10, the gesture information registered on the database is acquired. Finally, a robot command (scenario) is generated by combining the synthesized voice and gesture motion, and registered in the database of the DBDS unit 40. Also here, it is assumed that the information for controlling the motion of the gesture is created in advance by the gesture creator and stored in the DBDS unit 40 in association with the gesture ID.

一方、ロボットの行動制御は、次の手順で行われる。 On the other hand, the robot behavior control is performed in the following procedure.

まず、ＵＩ部１０は、Ｗｅｂソケットクライアントを用いて、ロボットレンダラー３０００上のＷｅｂソケットサーバに、あるＩＤで特定されるロボット命令の再生を指示する。ロボットレンダラー３０００上のＷｅｂソケットサーバ３００２はその指示を受けて、命令解析モジュール３００４に、ロボット命令のＩＤを送る。命令解析モジュール３００４は、受け取ったロボット命令のＩＤをキーとして、ロボット命令生成サーバ３０の命令シリアライズモジュール３８に命令内容を問い合わせ、その内容をダウンロードする。命令解析モジュール３００４は、命令内容を解析し、必要な音声、ジェスチャファイルのＵＲＩ情報を取り出し、リソースマネージャ３００８に指示して、そのリソースをＤＢＤＳ部４０のデータストレージサーバからダウンロードする。最後に、命令解析モジュール３００４は、命令の実行を、アクチュエータコントローラ３００６に指示する。アクチュエータコントローラ３００６はリソースマネージャ３００８が確保したリソースを用いて、ロボット１０００にモーションの命令を送り、音声とジェスチャを再生する。なお、一度、リソースマネージャ３００８によって取得された音声、ジェスチャのリソースは重複してダウンロードされることはない。 First, the UI unit 10 instructs a Web socket server on the robot renderer 3000 to reproduce a robot command specified by a certain ID using a Web socket client. Upon receiving the instruction, the Web socket server 3002 on the robot renderer 3000 sends the robot instruction ID to the instruction analysis module 3004. The command analysis module 3004 inquires the command serialization module 38 of the robot command generation server 30 about the command content using the received robot command ID as a key, and downloads the content. The instruction analysis module 3004 analyzes the instruction contents, extracts the necessary voice and gesture file URI information, instructs the resource manager 3008 to download the resources from the data storage server of the DBDS unit 40. Finally, the command analysis module 3004 instructs the actuator controller 3006 to execute the command. The actuator controller 3006 uses the resources secured by the resource manager 3008 to send a motion command to the robot 1000 and reproduce the voice and gesture. Note that the voice and gesture resources once acquired by the resource manager 3008 are not downloaded twice.

以上の処理を行うことで、実装したプロトタイプシステムは、ロボットの命令を生成するとともに、その命令を蓄積するという処理と、その命令の実行という処理の双方を行うことができる。 By performing the above processing, the mounted prototype system can generate both robot commands and store the commands and execute the commands.

（ロボット）
本実施の形態では、システムが制御するロボットとして、robovie（登録商標）mR2を用いるものとして説明する。 (robot)
In the present embodiment, it is assumed that robotie (registered trademark) mR2 is used as a robot controlled by the system.

図６は、robovie（登録商標）mR2の外観示す図である。図６（ａ）は正面図を、図６（ｂ）は側面図を、図６（ｃ）は、ロボットの自由度を示す。 FIG. 6 is a diagram showing the appearance of robovie (registered trademark) mR2. 6A shows a front view, FIG. 6B shows a side view, and FIG. 6C shows the degree of freedom of the robot.

このロボットは、机の上に置いて使うことを想定して、設計されており、その高さは３０．０ｃｍ、半径は１５．０ｃｍ、重量は２．０ｋｇである。人間の上半身をもとに設計されており、頭部に３自由度、目に２自由度、まぶたに２自由度、腕に４自由度、腰に１自由度の動作軸を持つ。ロボットの特徴として、腹部にipod touch/iphone（登録商標）接続のための空間があり、そこにipod touch/iphone（登録商標）を格納することで、携帯端末からシリアルケーブルを介して、ロボットの制御をおこなうことができる点がある。制御ソフトウェアを携帯端末側で実行させることでパソコンを介さずにロボットを制御することが可能となる。以上の特徴により、robovie（登録商標）mR2 は可搬性に優れ、家庭環境にも容易に導入できるロボットとなっている。本実施の形態で説明する実験においては、ユーザが手軽にロボットの再生コンテンツ(音声、モーションの組み合わせ) を作成できる環境を目指し、図３のシステムを用いてrobovie（登録商標） mR2 の再生コンテンツを作成した。 This robot is designed on the assumption that it is placed on a desk and has a height of 30.0 cm, a radius of 15.0 cm, and a weight of 2.0 kg. Designed based on the upper body of the human body, it has 3 DOF on the head, 2 DOF on the eyes, 2 DOF on the eyelid, 4 DOF on the arm, and 1 DOF on the waist. As a feature of the robot, there is a space for ipod touch / iphone (registered trademark) connection in the abdomen. By storing the ipod touch / iphone (registered trademark) in the abdomen, the robot can be connected to the robot via a serial cable. There is a point that can be controlled. By executing the control software on the mobile terminal side, the robot can be controlled without using a personal computer. Due to the above features, robovie (registered trademark) mR2 has excellent portability and can be easily introduced into the home environment. In the experiment described in this embodiment, aiming at an environment where the user can easily create the playback content (sound and motion combination) of the robot, the playback content of robovie (registered trademark) mR2 is created using the system of FIG. Created.

（ユーザインタフェースの設計）
図７は、シナリオ生成のプロトタイプシステムのＵＩ部１０でのユーザインタフェースを示す図である。 (User interface design)
FIG. 7 is a diagram showing a user interface in the UI unit 10 of the scenario generation prototype system.

システムにおけるテキストとジェスチャのアサインは、命令生成ウィンドウを基準に行われる。ウィンドウは、命令番号、合成ボタン、テキストを入力するためのテキストエリア、ジェスチャを選択するチェックボックスから成り立つ。命令番号は、組み合わされた音声とジェスチャが再生される順番を示す。ユーザはテキストエリアにテキストを入力し、チェックボックス中からジェスチャひとつを選択する。 The assignment of text and gestures in the system is performed based on the instruction generation window. The window includes an instruction number, a composition button, a text area for inputting text, and a check box for selecting a gesture. The command number indicates the order in which the combined voice and gesture are played. The user enters text in the text area and selects a gesture from the check box.

その後、合成ボタンを押すことで、ロボットの音声とジェスチャのアサインを行うことができる。合成が完了すると、システムは確認画面を出し、合成が終了したことをユーザに通知する。また、合成ボタンのラベル名を「合成」から「再生」に変更する。ユーザは再生ボタンを押すことで、割り当てた音声とジェスチャの組み合わせが妥当であるかどうかを確認することができる。ジェスチャと音声の割り当ては何度でもやり直すことができ、ユーザは自分で納得するまで、合成する音声の長さと割り当てるジェスチャの長さを調整することができる。命令生成ウィンドウは、ＵＩ部１０上部の「追加」ボタンで追加することができる。命令生成ウィンドウ順番に作成していくことで、最終的に再生したい文章全体の音声とジェスチャの組み合わせを作成する。 After that, the robot voice and gesture can be assigned by pressing the synthesis button. When the synthesis is completed, the system displays a confirmation screen and notifies the user that the synthesis is completed. Also, the label name of the composition button is changed from “Composition” to “Play”. The user can confirm whether the combination of the assigned voice and gesture is valid by pressing the play button. Gestures and voices can be assigned as many times as necessary, and the user can adjust the length of the synthesized voice and the length of the assigned gestures until the user is satisfied. The instruction generation window can be added by an “add” button at the top of the UI unit 10. By creating in the order of the command generation windows, a combination of voice and gesture of the entire sentence to be reproduced finally is created.

（音声とジェスチャのアサインの評価(ＳＧＡＥ)）
図１に示したサービスの設計音声とジェスチャのアサイン評価(ＳＧＡＥ)サービスは、ＤＢＤＳサービス部４０に登録されたロボットの音声とジェスチャの履歴に基づいて、登録された複数の尤度評価モジュールが算出する尤度の総乗から、音声とジェスチャの組み合わせを評価する。ここでは、本実施の形態における尤度評価モジュールの設計方法について述べる。 (Evaluation of voice and gesture assignment (SGAE))
The service design voice and gesture assignment evaluation (SGAE) service shown in FIG. 1 is calculated by a plurality of registered likelihood evaluation modules based on the robot voice and gesture history registered in the DBDS service unit 40. The combination of speech and gesture is evaluated from the total likelihood. Here, a design method of the likelihood evaluation module in the present embodiment will be described.

尤度評価モジュールiの尤度をLiとすると、ＳＧＡＥサービス５０が最終的に算出する尤度L は以下の式で与えられる。 When the likelihood of the likelihood evaluation module i is Li, the likelihood L finally calculated by the SGAE service 50 is given by the following equation.

尤度評価モジュールがどのように尤度を算出すかは、様々な方法が考えられるが、本実施の形態では、以下の２つのどちらかで尤度を算出・決定することとする。 Although various methods can be considered as to how the likelihood evaluation module calculates the likelihood, in this embodiment, the likelihood is calculated and determined by one of the following two methods.

１）非連続のデータが与えられた場合：ルールベースの尤度決定
２）連続するデータが与えられた場合：確率密度関数を基にした尤度算出
ルールベースによる尤度決定は、非連続なデータの分析に用いる。本実施の形態では、主にアサインされた音声の元となるテキストを分析するときに用いる。これまでの既存研究においても、テキストを文法解析、もしくは形態素解析し、そこに含まれるキーワードや品詞を元にアサインするジェスチャを決める手法が採用されてきた。本実施の形態においても、テキストを形態素解析し、得られたジェスチャのアサインパターンを基に尤度評価モジュールを定義する。 1) When discontinuous data is given: Rule-based likelihood determination 2) When continuous data is given: Likelihood calculation based on probability density function Rule-based likelihood determination is discontinuous Used for data analysis. In this embodiment, it is mainly used when analyzing the text that is the basis of the assigned speech. In the existing research so far, grammatical analysis or morphological analysis of text has been adopted, and a method for determining a gesture to be assigned based on keywords and parts of speech included therein has been adopted. Also in the present embodiment, a morphological analysis is performed on the text, and a likelihood evaluation module is defined based on the obtained gesture assignment pattern.

一方、確率密度関数による尤度算出は、連続した値を分析する際に用いる。音声の再生時間や、ジェスチャの再生時間、またはそれらの比率など、数値によって分析できるものは確率密度関数により、そのパターンを近似する。 On the other hand, likelihood calculation using a probability density function is used when analyzing consecutive values. For patterns that can be analyzed numerically, such as voice playback time, gesture playback time, or their ratio, the pattern is approximated by a probability density function.

このような分析に最も用いられる確率分布は、ガウス分布（正規分布）であるが、解析して得られたデータは、必ずしも左右対称の分布を示さない。むしろ、その分布は、左右非対称で偏りを示す場合が多い。このような要求を満たすため、本実施の形態では、近似式として、ジョンソンＳＵ分布を用いた。ジョンソンＳＵ分布は、歪度（分布の非対称性）と尖度（裾の厚さ）を適当に与えることで正規分布の形状をかなり自由に操作できるという特徴を持ち、その確率密度関数は以下の式（２）で与えられる。 The probability distribution most used for such analysis is a Gaussian distribution (normal distribution), but the data obtained by analysis does not necessarily show a symmetrical distribution. Rather, the distribution is often asymmetric in the left-right direction. In order to satisfy such a requirement, the Johnson SU distribution is used as an approximate expression in the present embodiment. The Johnson SU distribution has the characteristic that the shape of the normal distribution can be manipulated fairly freely by appropriately giving skewness (distribution asymmetry) and kurtosis (hem thickness), and its probability density function is It is given by equation (2).

確率密度関数(2) は、γ，δ，λ、ε の４変数によって、その中心、裾の広がり、歪度、尖度を決定することができる。 The probability density function (2) can determine the center, the spread of the tail, the skewness, and the kurtosis by the four variables γ, δ, λ, and ε.

図８は、様々なパラメータ値におけるジョンソンＳＵ分布を示す図である。 FIG. 8 is a diagram showing the Johnson SU distribution at various parameter values.

図８に示すように、ジョンソンＳＵ分布により、様々な分布を定義することが可能である。 As shown in FIG. 8, various distributions can be defined by the Johnson SU distribution.

これらルールベースと、確率密度関数を用いた近似式のいずれかの手法を用いることで本実施の形態では、ＳＧＡＥサービス部５０の尤度評価モジュールを設計する。 In this embodiment, the likelihood evaluation module of the SGAE service unit 50 is designed by using any one of these rule bases and an approximate expression using a probability density function.

（ＳＧＡＥサービス部５０構築のためのデータ収集）
本実施の形態で提案するロボットの音声とジェスチャのアサイン評価(ＳＧＡＥ) サービス部５０は、過去に構築されたロボットの命令（音声とジェスチャの組み合わせ）から、アサインされた組み合わせを評価する。従って、ＳＧＡＥサービス部５０を構築するためには、予めユーザによるロボット命令の生成履歴が必要となる。 (Data collection for construction of SGAE service department 50)
The robot voice and gesture assignment evaluation (SGAE) proposed in the present embodiment, the service unit 50 evaluates the assigned combination from robot commands (combination of voice and gesture) constructed in the past. Therefore, in order to construct the SGAE service unit 50, a robot command generation history in advance is required.

これらＳＧＡＥサービス部５０の尤度評価モジュールに必要な履歴データを収集するために実験を行なった。実験において、被験者は、図５で説明したプロトタイプシステムを用いて、教示されたドキュメントの文章を分割し、分割したそれぞれの文にジェスチャをアサインした。分割された文章、文章に割り当てられたジェスチャはシステム中のデータベースに保存される。実験後、データベースに格納された情報に基づいて、尤度評価モジュールを定義した。 An experiment was conducted to collect history data necessary for the likelihood evaluation module of the SGAE service unit 50. In the experiment, the subject divided the sentence of the taught document using the prototype system described with reference to FIG. 5, and assigned a gesture to each divided sentence. The divided sentences and the gestures assigned to the sentences are stored in a database in the system. After the experiment, a likelihood evaluation module was defined based on the information stored in the database.

（実験概要）
図９は、実験環境を示す図である。 (Experiment overview)
FIG. 9 is a diagram showing an experimental environment.

実験では、被験者はテーブルの正面に座り、ラップトップを制御する。ラップトップには、図５で述べたプロトタイプシステムのＵＩ部１０が立ち上がっており、robovie（登録商標） mR2 の発話とジェスチャをアサインすることができる。 In the experiment, the subject sits in front of the table and controls the laptop. On the laptop, the UI unit 10 of the prototype system described with reference to FIG. 5 is set up, and the utterance and gesture of robovie (registered trademark) mR2 can be assigned.

図９中、右側のディスプレイは、説明に必要な画像が描画される(画像がディスプレイに描画されているかどうかは実験条件によって異なる)。今回の実験では、大学生の男女２７名（男性：１３名、女性：１４名）に参加してもらい、ロボットの発話とジェスチャのアサイン作業を行った後、アンケートに答えてもらった。 In FIG. 9, an image necessary for explanation is drawn on the right display (whether or not an image is drawn on the display depends on experimental conditions). In this experiment, 27 male and female university students (13 males and 14 females) participated in the utterance of robots and assigning gestures, and then responded to a questionnaire.

（実験手順）
実験手順は次のとおりである。 (Experimental procedure)
The experimental procedure is as follows.

（１）まず、実験者は、被験者にこれからロボットに説明させる文章を手渡す。 (1) First, the experimenter hands over a sentence that the robot will explain to the subject.

（２）被験者はまずロボットに説明させる文章を音読し、その内容を把握する。 (2) The subject first reads aloud a sentence to be explained by the robot and grasps the content.

（３）実験者は、実際に最初の２文を作成しながら（これら実験者が作成したコンテンツは解析対象から外される）、被験者に実装したプロトタイプシステムのユーザインタフェースの使い方を説明し、質問を受け付ける。 (3) The experimenter explains how to use the user interface of the prototype system implemented on the subject while actually creating the first two sentences (the contents created by these experimenters are excluded from the analysis target). Accept.

（４）被験者は、与えられた文章をロボットが説明できるよう、ロボットの命令（音声とジェスチャの組み合わせ）を文章分、作成する。 (4) The subject prepares a robot command (combination of voice and gesture) for each sentence so that the robot can explain the given sentence.

（５）全ての命令の作成が終了し、再生確認を終えたら、アンケートに記述する。 (5) When all the commands have been created and playback confirmation is complete, write them in the questionnaire.

本実施の形態では、ジェスチャのパターンとして、ジェスチャの種類と共に、その再生時間の長さについても着目した。そこで、時系列別のジェスチャと文章を組み合わせる条件１と、種類の違うモーションと文章を組み合わせる条件２の計２種類の条件で音声とジェスチャのアサインを行なった。それぞれの条件について、その詳細を表１にまとめる。 In the present embodiment, attention is paid to the length of the reproduction time as well as the type of gesture as the gesture pattern. Therefore, voice and gesture assignments were performed under two conditions: Condition 1 for combining gestures and sentences according to time series and Condition 2 for combining different types of motion and sentences. The details of each condition are summarized in Table 1.

各条件において、ロボットの命令の元となる文章は、文章構造によるテキスト分割のパターンを分析するため、複雑な文章構造になりやすい専門的な内容を説明するものとし、ウィキペディアのページから取得した文章に基づいて作成した。それぞれの文章は、音声合成ソフトXimeraで音声合成すると、その長さは６０．０秒程度の内容となり、誤差は０．５秒以内である。 Under each condition, the sentence that is the source of the robot's command is to analyze the pattern of text division according to the sentence structure, and to explain the specialized content that tends to become a complex sentence structure. The sentence obtained from the Wikipedia page Created based on. When each sentence is synthesized by speech synthesis software Ximera, the length is about 60.0 seconds, and the error is within 0.5 seconds.

なお、Ximeraについては、以下の文献に開示がある。 Ximera is disclosed in the following document.

文献：H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, ”Ximera: A New Tts from ATR Based on Corpus-Based Technologies,”ISCA Speech Synthesis Workshop, pp. 179-184, 2004.
一方、ジェスチャについては、２つのパターンを用意した。条件１では、再生時間の違いによるユーザの選択傾向を分析することを目的として、２．０秒から２０．０秒までの２．０秒ごと、計１０種類の繰り返しジェスチャを用意した。ジェスチャの種類によってアサインが変わらないよう、ジェスチャのクリエータには、話者のイメージを表現する表象的ジェスチャにはならないように注意し、体の一部を交互に動かす繰り返しジェスチャとして作成してもらった。 Literature: H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, “Ximera: A New Tts from ATR Based on Corpus-Based Technologies,” ISCA Speech Synthesis Workshop, pp. 179-184, 2004.
On the other hand, two patterns were prepared for gestures. In condition 1, a total of 10 types of repeated gestures were prepared every 2.0 seconds from 2.0 seconds to 20.0 seconds for the purpose of analyzing the user's selection tendency due to the difference in playback time. In order not to change the assignment depending on the type of gesture, the creator of the gesture was careful not to be a symbolic gesture that expresses the image of the speaker, and it was created as a repeated gesture that moves parts of the body alternately. .

一方、条件２では、エンブレム(”byebye”)、表象的(直示的（指さし）、および描写的ジェスチャ)、繰り返しジェスチャをアサインしてもらった。これらのジェスチャは、それぞれ音声の長さに対応できるように、５．０秒、１０．０秒の２種類の選択肢を用意した。直示的ジェスチャは指し示す対象が必要となるため、条件２においてはディスプレイに、糖尿病による合併症を説明する図が表示される（図の内容は、文章の内容と対応する）。 On the other hand, in condition 2, emblems ("byebye"), symbolic (direct (pointing), and descriptive gestures) and repeated gestures were assigned. For these gestures, two types of options of 5.0 seconds and 10.0 seconds were prepared so as to correspond to the length of each voice. Since the direct gesture requires an object to be pointed, in the condition 2, a diagram explaining complications due to diabetes is displayed on the display (the content of the diagram corresponds to the content of the sentence).

（実験データを用いたＳＧＡＥサービス部５０の尤度決定モジュールの構成）
以下では、実験で収集した音声とジェスチャのアサイン履歴に基づいて、ＳＧＡＥサービス部５０の尤度評価モジュールを定義する。本実施の形態では、ルールベースの評価を２通り、確率密度に基づく評価を２通り、計４つの分析を行った。それぞれの内容を以下に示す。 (Configuration of likelihood determination module of SGAE service unit 50 using experimental data)
In the following, the likelihood evaluation module of the SGAE service unit 50 is defined based on the voice and gesture assignment history collected in the experiment. In the present embodiment, two types of rule-based evaluation and two types of evaluation based on probability density are performed, and a total of four analyzes are performed. The contents of each are shown below.

１）文章構造解析に基づく、テキスト終端パターンの分析
２）ジェスチャとキーワードのアサインパターンの分析
３）ジェスチャの再生時間に基づく、アサインパターンの分析
４）音声とジェスチャの再生時間の比率に基づく、アサインパターンの分析
分析によって得られたルール、もしくは近似式から決定される尤度の総積（式（１）) から、音声とジェスチャの組み合わせを評価する。 1) Analysis of text termination pattern based on sentence structure analysis 2) Analysis of gesture and keyword assignment pattern 3) Analysis of assignment pattern based on gesture playback time 4) Assignment based on the ratio of voice and gesture playback time Pattern Analysis The combination of speech and gesture is evaluated from the rule obtained by the analysis or the total likelihood (equation (1)) determined from the approximate expression.

（文章構造解析に基づく、テキスト終端パターンによる分析）
実験で得られたデータを分析すると、音声の元となるテキストの分割パターンに、文章構造の影響、特に句読点の影響が顕著に見られた。本節では、入力されたテキストの終端パターンから、尤度を決定するルールを導く。分析は、実験条件１，２全ての音声とジェスチャの組み合わせを対象として行なった。 (Analysis by text termination pattern based on sentence structure analysis)
When analyzing the data obtained in the experiment, the effect of sentence structure, especially the influence of punctuation marks, was noticeable in the text segmentation pattern that was the source of speech. In this section, a rule for determining likelihood is derived from the end pattern of the input text. The analysis was conducted for all combinations of voices and gestures in Experimental Conditions 1 and 2.

被験者が分割したテキストの終端をみると、そのほとんどが、句読点で分割されていた。 Looking at the end of the text that the subject split, most of them were split with punctuation marks.

特に句点「。」では、ほぼ１００％の確率で分割されており、顕著なパターンといえる。次に読点「、」による分割が多い。逆に、句読点以外の場所で分割されているケースは、稀であった。読点による分割は、その前にくる文章構造から、その分割割合に偏りが見られた。読点の前の文章構造のパターンは様々であったが、本実施の形態では、その中でも分割の割合が高い、動詞と前置詞の組み合わせ、動詞句（あるいは動詞節）による読点分割と、それ以外の読点分割に分けてその割合を計算した。句点、読点（動詞句、動詞句以外）、その他による分割の割合を、表２に示す。表２に示されるように、隣接するパターンの割合を比較すると、２．０倍程度の開きがあることがわかる。 In particular, the punctuation mark “.” Is divided with a probability of almost 100%, which is a remarkable pattern. Next, there are many divisions by the reading “,”. On the contrary, the case where it was divided in places other than punctuation marks was rare. The division by punctuation was biased in the division ratio due to the sentence structure that preceded it. The pattern of the sentence structure before the punctuation was various, but in this embodiment, the division ratio is high, the combination of verb and preposition, punctuation division by verb phrase (or verb clause), and other The percentage was calculated by dividing it into reading divisions. Table 2 shows the percentages of division by phrases, punctuation (other than verb phrases and verb phrases), and others. As shown in Table 2, when the ratios of adjacent patterns are compared, it can be seen that there is an opening of about 2.0 times.

文章構造に基づく尤度評価モジュールは、表２に示されるパターンと割合を用いて、与えられたロボットの音声とジェスチャの組み合わせの尤度を決定する。 The likelihood evaluation module based on the sentence structure uses the patterns and ratios shown in Table 2 to determine the likelihood of a given robot voice and gesture combination.

（人が自身で説明する場合とロボットの行動を生成する場合におけるテキスト終端パターンの違いの検証）
本実施の形態が提案する手法の有効性を示すためには、提案手法を使って人−人と人−ロボットのデータでそれぞれモデル化した場合，尤度関数のパラメータが全く異なったものになることと共に、モデル化に使用したロボットを制御する場合には提案手法のほうが良い結果をもたらすことを示す必要がある。 (Verification of differences in text termination patterns between human explanations and robot behavior generation)
In order to show the effectiveness of the method proposed by the present embodiment, the parameters of the likelihood function are completely different when the proposed method is used to model with human-person and human-robot data. In addition, it is necessary to show that the proposed method gives better results when controlling the robot used for modeling.

ジェスチャに関して、人とロボットのジェスチャを合わせて、対照実験を設定することが困難であるため、このモジュールを例に挙げて、人の自身の行動モデルと作成者の音声とジェスチャをアサインするモデルのパラメータが異なることを示し、その有効性を検証する。 With regard to gestures, it is difficult to set up a control experiment combining human and robot gestures, so this module is an example of a model that assigns a person's own behavior model and the creator's voice and gesture. Indicate that the parameters are different and verify their effectiveness.

人の自身の行動モデルと、ロボットの音声とジェスチャをアサインする際のモデルの違いについて調べるために、実験の教示時に被験者がシナリオを朗読した音声の解析を行なった。解析では、２人の解析者を用意し、朗読に使用したシナリオに対して、「被験者の朗読を聞き、文章を区切っていると思う部分に斜線を入れる」ように教示した。そして、２人の被験者が共に、「文章を区切っている」と解釈した部分をテキストの終端とし、本文中の「文章構造解析に基づく、テキスト終端パターンによる分析」と同様に、句点、読点（動詞句区切り）、読点（動詞句区切り以外）、その他に分類して、その平均値と割合を求めた。表３に、朗読の解析結果とアサインの解析結果のデータの比較を示す。 In order to investigate the difference between a person's own behavior model and the model for assigning robot speech and gestures, we analyzed the speech that the subject read the scenario when teaching the experiment. In the analysis, two analysts were prepared and instructed that the scenario used for the reading was to "hear the subject's reading and put a diagonal line in the part that he thought was separating the sentence". The two subjects both interpret the sentence as “breaking the sentence” as the end of the text, and in the text, as in “analysis by the text end pattern based on the sentence structure analysis” It was classified into verb phrase delimiters), punctuation marks (other than verb phrase delimiters), and others, and the average value and ratio were obtained. Table 3 shows a comparison between the reading analysis result and the assignment analysis result data.

表３において、平均値は、被験者が一つのシナリオについて入れた該当する終端パターンの平均値を示す。割合は、該当する終端パターンの全終端パターンにおける割合を表す（アサインにおける割合は、シナリオ中に存在する該当の終端パターンの総数と、実際にその終端パターンで区切られた数の割合を示すため、母数が異なる）。表３に示されるように、それぞれの句点の平均値は、ほぼ同一になるもののその他の終端パターンについては、アサイン時は句点の平均値から減少していくのに対し、朗読時は増加し、必然的にその割合も変化する。 In Table 3, the average value indicates the average value of the corresponding termination pattern entered by the subject for one scenario. The ratio indicates the ratio of the corresponding termination pattern in all termination patterns (the ratio in the assignment indicates the total number of corresponding termination patterns existing in the scenario and the ratio of the number actually divided by the termination pattern, The parameter is different). As shown in Table 3, the average value of each punctuation point is almost the same, but for other termination patterns, it decreases from the punctuation point average value at the time of assignment, while it increases at the time of reading, Inevitably, the ratio will change.

この違いは、人が自身で朗読するときは、イントネーションや緩急をつけて、文章の区切りをつけられるのに対し、ロボットにおいては音声合成機能（本実施の形態においては、音声合成ソフトXimera）の限界のため、そのような表現が難しいことに起因すると考えられる。つまり、表現の多様性が保証されている環境下においては、人は文章を短く切って説明しようとするのに対し、区切りの表現手段が限られている環境下においては、文章を長く区切るように方針を転換していると考えられる。 This difference is that when a person reads by himself, the sentence can be separated by intonation and gradual, whereas in the robot, the speech synthesis function (in this embodiment, the speech synthesis software Ximera) It is thought that due to the limitations, such expression is difficult. In other words, in an environment where diversity of expressions is guaranteed, people try to cut the sentences into short sentences, but in an environment where there are limited ways to express the separators, It is thought that the policy has been changed.

以上の結果より、人が自身で発話する場合とロボットに表現させる場合では、その文章の区切り方についてパターンが異なることが示された。このような文章の区切り間隔に関する方針の転換は、聞き手にとってどのような効果を及ぼすのかを以下さらに検討する。上記の仮説を検証するために、被験者が作成したシナリオの中で、音声とジェスチャの組み合わせの長さの平均が、ａ）最も長いシナリオと、ｂ）最も短いシナリオを、実験を行った９名(男性４名、女性５名) の被験者に比較してもらい、どちらが好ましいか評価してもらった。結果、全ての被験者がａ）のシナリオの方が良いと回答した。以上の結果より、ロボットの音声とジェスチャのアサインにおける文章区切りの方針転換は、聞き手に対しても好ましい印象を与える効果を持つといえる。
（ジェスチャとキーワードのアサインパターンの分析）
音声に対して割り当てられるジェスチャは、ランダムな確率になるのかについて以下検討する。これまでの関連研究では、あるキーワードが音声テキスト中に含まれると、エンブレムや表象などのジェスチャが適用されるアルゴリズムを採用しているシステムが多い。音声テキスト中のあるキーワードやその組み合わせによって、特定の種類のジェスチャがアサインされる傾向があることは自明のことであると考えられる。 From the above results, it was shown that the patterns differ in how the sentences are separated when a person speaks by himself or when the robot expresses them. We will further examine the effects of such a change in the policy on sentence separation for listeners. In order to verify the above hypothesis, among the scenarios created by the subjects, the average length of the combination of voice and gesture is a) the longest scenario, and b) the shortest scenario, the nine who have experimented (4 males, 5 females) subjects were compared and asked to evaluate which was preferred. As a result, all the subjects answered that the scenario of a) was better. From the above results, it can be said that the change of the sentence break policy in the robot voice and gesture assignment has the effect of giving a favorable impression to the listener.
(Analysis of gesture and keyword assignment patterns)
Consider whether the gesture assigned to the voice has a random probability. In related research so far, there are many systems that employ algorithms that apply gestures such as emblems and representations when certain keywords are included in speech text. It is self-evident that certain types of gestures tend to be assigned by certain keywords or their combinations in the speech text.

ここでも、音声テキスト中に含まれるキーワードとジェスチャのアサインパターンを分析し、ルールベースに尤度を決定する尤度評価モジュールを定義する。これまで既存研究で分類されたジェスチャの種類はあえて使わず、登録された個別のジェスチャとキーワードの関係のみによって尤度を決定する方針を採る。 Again, a likelihood evaluation module that analyzes the keyword and gesture assignment patterns included in the speech text and determines the likelihood based on the rule base is defined. So far, we do not use the types of gestures classified in the existing research, but adopt the policy of determining the likelihood based only on the relationship between registered individual gestures and keywords.

本実施の形態では、形態素解析ソフトChasenを利用して、アサインされたテキストを形態素解析し、名詞、動詞、形容詞、副詞、接続詞の６種類の品詞に関して、それぞれのワードにアサインされたジェスチャとその割合を分析し、マトリックスを作成した。 In this embodiment, the morphological analysis software Chasen is used to perform morphological analysis on the assigned text, and for the six types of parts of speech, nouns, verbs, adjectives, adverbs, and conjunctions, the gestures assigned to each word and its Percentages were analyzed to create a matrix.

なお、Chasenについては、以下に開示がある。 Chasen is disclosed below.

文献：松本裕治，北内啓，山下達雄，平野善隆，松田寛, 高岡一馬, 浅原正幸, ”日本語形態素解析システム『茶筌』version 2.2.1 使用説明書,”, Dec, 2000.
分析された結果の中で特に特徴的なキーワード、「バイバイ」と「図」に関して、アサインされたジェスチャのラベルと、その割合を表４、表５に示す。 References: Yuji Matsumoto, Kei Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, Masayuki Asahara, “Japanese Morphological Analysis System“ Chaya ”version 2.2.1 Instruction Manual,” Dec, 2000.
Tables 4 and 5 show the labels of the assigned gestures and their ratios with respect to the keywords “bye-bye” and “figure” that are particularly characteristic among the analyzed results.

アサインされているジェスチャの中で、ラベルに”byebye”が含まれるジェスチャは、ロボットが手を振るエンブレムジェスチャである。後半のshort/long でその長さが５秒であるか、１０秒であるかが示される。対して、”point”が含まれるジェスチャは、ロボットがディスプレイを見ながら、指をさす直示的ジェスチャであり、長さに関する定義は”byebye”のものと同様である。最後に”high”を含むジェスチャは両手を頭の高さまで広げ、下ろすという、大きさや程度が大きいことを示すための描画的ジェスチャである。これらの例から、本節で定義したキーワードとジェスチャのアサインが機能していることが確認できたと考える。 Among the assigned gestures, a gesture including “byebye” in the label is an emblem gesture in which the robot shakes its hand. The short / long in the second half indicates whether the length is 5 seconds or 10 seconds. On the other hand, a gesture including “point” is a direct gesture in which the robot points a finger while looking at the display, and the definition regarding the length is the same as that of “byebye”. Finally, a gesture including “high” is a drawing gesture for showing that the size and degree of the gesture are large, in which both hands are extended to the height of the head and lowered. From these examples, we can confirm that the keyword and gesture assignments defined in this section are working.

テキストを解析する際、複数のキーワード（名詞、動詞、形容詞、副詞、接続詞）が抽出される場合がある。この場合は、それぞれのキーワードのマトリックスを解析し、もっとも候補ジェスチャが少ないもの（候補ジェスチャが少ないということは、あるジェスチャが選択的にアサインされていることを意味する）を採用する。また、該当するマトリックスがひとつもない場合は、１．０を返すこととする。 When analyzing text, a plurality of keywords (nouns, verbs, adjectives, adverbs, conjunctions) may be extracted. In this case, the matrix of each keyword is analyzed, and the one with the fewest candidate gestures (the fact that there are few candidate gestures means that a certain gesture is selectively assigned) is adopted. If there is no corresponding matrix, 1.0 is returned.

テキストとジェスチャのアサインパターンに基づく尤度評価モジュールは、表4, 5 に示されるパターンと割合に代表される関係マトリックス群を用いて、与えられたロボットの音声とジェスチャの組み合わせの尤度を決定する。
（ジェスチャの再生時間に基づく、アサインパターンの分析）
本実施の形態では、ロボットのジェスチャは、自動生成するものではなく、ジェスチャのクリエータによって作成され、ＤＢＤＳサービス部４０に登録されるものである。 The likelihood evaluation module based on text and gesture assignment patterns determines the likelihood of a given robot voice and gesture combination using the relationship matrix group represented by the patterns and ratios shown in Tables 4 and 5. To do.
(Analysis of assignment pattern based on gesture playback time)
In this embodiment, the gesture of the robot is not automatically generated, but is created by the gesture creator and registered in the DBDS service unit 40.

ロボットのジェスチャを作ることを仮定したとき、どれくらいの長さのモーションがユーザに好まれるかということについてはまだ明らかにされていない。ここでは、ジェスチャの長さに対する被験者の選択傾向を評価する尤度評価モジュールを定義する。 It has not yet been clarified how long a motion is favored by the user, assuming a robot gesture is made. Here, a likelihood evaluation module that evaluates a subject's selection tendency with respect to the length of a gesture is defined.

まず、実験の条件１において、割り振られた全てのジェスチャに関して、選択された回数を求めた。課題の特性上、文章を短く区切り、短いジェスチャを割り当てるとより多くのロボット命令を作れる。そのため、短いジェスチャが選択される回数は長いジェスチャと比べて増加する傾向にある。この不均衡を解消するため、選択傾向の評価には、それぞれのジェスチャが選択された回数に、そのジェスチャが文章全体に占める割合を乗算した値Egi を用いた。Egi はi 番目のジェスチャgi の評価値を表し、以下の式によって与えられる。 First, in the condition 1 of the experiment, the selected number of times was obtained for all the allocated gestures. Due to the characteristics of the task, you can make more robot commands by dividing sentences and assigning short gestures. Therefore, the number of times a short gesture is selected tends to increase compared to a long gesture. In order to eliminate this imbalance, the selection tendency was evaluated by using the value Egi, which is obtained by multiplying the number of times each gesture was selected by the ratio of the gesture to the whole sentence. Egi represents the evaluation value of the i-th gesture gi and is given by the following equation.

ここで、cgi は全ての試行を通してそのジェスチャが採用された数、dgi は該当するジェスチャの長さ、式の分母は音声の長さdsj の総和、つまり、文章の長さ(実験条件により、６０．０秒程度となる) を表す(分割後の音声の平均時間は７．０秒となった)。 Where cgi is the number of gestures adopted through all trials, dgi is the length of the corresponding gesture, and the denominator is the sum of the speech length dsj, that is, the length of the sentence (depending on the experimental conditions, 60 (The average time of the divided voice is 7.0 seconds).

図１０は、式（３）で得られた各評価値を式（４）で正規化したNgi をプロットしたものである。 FIG. 10 is a plot of Ngi obtained by normalizing each evaluation value obtained by Expression (3) by Expression (4).

これらの分布を近似するジョンソンＳＵ分布のパラメータγ，δ，λ、εを最小二乗法を用いて、以下の式（５）のように求めた。 The Johnson SU distribution parameters γ, δ, λ, and ε that approximate these distributions were obtained by the following equation (5) using the least square method.

このようにして得られたパラメータによるジョンソンＳＵ分布の推移を図１０に示す。プロットされている値に比べて、近似曲線が低い位置にあるのは、プロットされている点はその値の合計が１．０になるのに対し、近似式はその面積が１．０になる関数であるためである。 The transition of the Johnson SU distribution according to the parameters obtained in this way is shown in FIG. The approximate curve has a lower position than the plotted value because the sum of the plotted points is 1.0 while the approximate expression has an area of 1.0. This is because it is a function.

ジェスチャの再生時間に基づく尤度評価モジュールは、式（２）とパラメータ式（５）によって定義される確率密度関数に基づき、与えられたロボットの音声とジェスチャの組み合わせの尤度を決定する。 The likelihood evaluation module based on the reproduction time of the gesture determines the likelihood of a given combination of voice and gesture of the robot based on the probability density function defined by Equation (2) and Parameter Equation (5).

この尤度評価モジュールは、ジェスチャの再生時間に基づいているが、図１０に示される分布の意味は、音声とジェスチャの組み合わせの長さ（再生時間）が短すぎれば、より長い組み合わせを模索し、逆に長すぎれば、より短い組み合わせを模索するという人の選択傾向だと考えられる。 Although this likelihood evaluation module is based on the reproduction time of a gesture, the meaning of the distribution shown in FIG. 10 is to seek a longer combination if the length of the combination of voice and gesture (reproduction time) is too short. On the other hand, if it is too long, it is thought that it is a person's selection tendency to seek a shorter combination.

今回の実験において、その長短の判断が分かれる境目は１０．０秒付近だということが示された。従って、この尤度評価モジュールを導入することで、システムは発話音声が１０．０秒以下になった場合、より長い組み合わせを探し、逆に１０．０秒以上になった場合はより短い組み合わさを模索するように動作する。
（音声とジェスチャの再生時間の比率に基づく、アサインパターンの分析）
ロボットの音声とジェスチャをアサインするとき、ユーザの観点から重要であると考えられる要因の一つが、音声とジェスチャの再生時間の比率である。ユーザはなるべく、音声とジェスチャの長さが一致するように組み合わせを決定するものと考えられる。本節では、音声とジェスチャの再生時間に基づく、尤度評価モジュールを定義する。 In this experiment, it was shown that the boundary between the long and short judgments is around 10.0 seconds. Therefore, by introducing this likelihood evaluation module, the system looks for a longer combination when the speech becomes 10.0 seconds or less, and conversely, when the speech becomes 10.0 seconds or more, the system uses a shorter combination. Operates to seek.
(Assignment pattern analysis based on the ratio of voice and gesture playback time)
When assigning the voice and gesture of the robot, one of the factors considered to be important from the viewpoint of the user is the ratio of the reproduction time of the voice and the gesture. It is considered that the user decides the combination so that the lengths of the voice and the gesture match as much as possible. In this section, we define a likelihood evaluation module based on the playback time of speech and gestures.

図１１は、実験で得られた全ての試行（条件１,条件２を含む）の、各命令における音声とジェスチャの再生時間の比率のヒストグラムを示す図である。 FIG. 11 is a diagram showing a histogram of the ratio of the reproduction time of the voice and the gesture in each command for all trials (including condition 1 and condition 2) obtained in the experiment.

命令i における、比率ri は以下の式（６）で計算される。 The ratio ri in the instruction i is calculated by the following equation (6).

ここで、si は、命令i における音声の再生時間、giは、命令i におけるジェスチャの再生時間を表す。 Here, si represents the playback time of the voice in command i, and gi represents the playback time of the gesture in command i.

図１１を見ると、比率１．０までは正規分布に対応した増加を見せるが、１．０を越えたところで、急激に値が減少する傾向が見える。このヒストグラムの分布を近似するジョンソンＳＵ分布のパラメータγ，δ，λ、εを最小二乗法を用いて、以下の式（７）のように求めた。 Referring to FIG. 11, an increase corresponding to the normal distribution is shown up to the ratio 1.0, but when the value exceeds 1.0, the value tends to decrease rapidly. The parameters γ, δ, λ, and ε of the Johnson SU distribution that approximates the distribution of this histogram were obtained by the following equation (7) using the least square method.

上記、パラメータによって近似される確率密度の分布が図１１に示される。 The distribution of probability density approximated by the above parameters is shown in FIG.

ジェスチャと音声の再生時間の比率に基づく尤度評価モジュールは、式（２）とパラメータ式（７）によって定義される確率密度関数に基づき、与えられたロボットの音声とジェスチャの組み合わせの尤度を決定する。
（ジョンソンＳＵ分布による近似の有効性の検証）
本実施の形態では、プロトタイプシステムによって得られた音声とジェスチャの組み合わせの履歴データのパターンが、ガウス分布による近似と比較して、尖度が違ったり、歪んでいたり、左右非対称である場合においても、より正確にその分布を近似できるようにジョンソンＳＵ分布を用いて、近似を行なった。本節では、その有効性について検証する。 The likelihood evaluation module based on the ratio of the reproduction time of the gesture and the voice calculates the likelihood of the given robot voice and gesture combination based on the probability density function defined by the equation (2) and the parameter equation (7). decide.
(Verification of effectiveness of approximation by Johnson SU distribution)
In the present embodiment, even when the history data pattern of the combination of speech and gesture obtained by the prototype system has different kurtosis, distortion, or left-right asymmetry compared to approximation by Gaussian distribution. Approximation was performed using the Johnson SU distribution so that the distribution could be approximated more accurately. This section examines its effectiveness.

図１２および図１３は、正規分布とジョンソンＳＵ分布による尤度評価モデル近似曲線を示す図である。 12 and 13 are diagrams showing likelihood evaluation model approximate curves based on the normal distribution and the Johnson SU distribution.

図中、点線によって示される曲線が、ガウス分布による近似曲線である。対して、実線によって示される曲線がジョンソンＳＵ分布による近似曲線である。 In the figure, a curve indicated by a dotted line is an approximate curve based on a Gaussian distribution. On the other hand, the curve indicated by the solid line is an approximate curve based on the Johnson SU distribution.

図１２を見るとわかるように、ジェスチャの再生時間の選択傾向の近似に関しては、２つの近似にあまり変化はなかった（ジョンソンＳＵ分布の方が残差がわずかに小さい）。 As can be seen from FIG. 12, regarding the approximation of the selection tendency of the reproduction time of the gesture, the two approximations have not changed much (the Johnson SU distribution has a slightly smaller residual).

対して、図１３から、ジェスチャと音声の再生時間の比率分布に関しては、ジョンソンＳＵ分布の方がその特徴を捉えていることがわかる。ジェスチャと音声の再生時間の比率の分布は、比率が１．０を越えた時点で発生確率が急速に低下する傾向がある（歪んでいる）。本実施の形態で提案したジョンソンＳＵ分布による近似はその特徴を捉えることができたと考える。
（ＳＧＡＥサービス部５０のシミュレーション）
上記、実験データを元にして構成したＳＧＡＥサービス５０の尤度評価モジュールが出力する尤度の総積が、実際に適切な音声とジェスチャの組み合わせを提示できるのか、厳密に検証することは非常に難しい問題である。 In contrast, FIG. 13 shows that the Johnson SU distribution captures the characteristics of the ratio distribution of the reproduction time of the gesture and the voice. The distribution of the ratio between the gesture and audio playback time tends to rapidly decrease (distort) the probability of occurrence when the ratio exceeds 1.0. It is considered that the approximation by the Johnson SU distribution proposed in this embodiment has captured the feature.
(SGAE service unit 50 simulation)
Exactly verifying whether the total sum of likelihoods output from the likelihood evaluation module of the SGAE service 50 configured based on the above experimental data can actually present an appropriate combination of speech and gesture is very It is a difficult problem.

音声とジェスチャの組み合わせ総数は、相当な数に達し、それら全てを検証することは困難である。 The total number of voice and gesture combinations reaches a considerable number, and it is difficult to verify all of them.

対して、モジュール一つ一つの評価をすることは、サービス全体の評価につながらない。例えば、キーワードのみでジェスチャを選択したとしても、音声とジェスチャの長さの比率が１．０から極端に遠くなれば、その組み合わせに対する評価は低くなると考えられる。一方、音声とジェスチャの比率を一定に保とうとすれば、文章構造を反映してテキストを分割することが困難になる。 On the other hand, evaluating each module individually does not lead to an evaluation of the entire service. For example, even if a gesture is selected only by a keyword, if the ratio of the length of the voice and the gesture is extremely far from 1.0, the evaluation for the combination is considered to be low. On the other hand, if the ratio of speech to gesture is kept constant, it becomes difficult to divide the text by reflecting the sentence structure.

ここでは、ＳＧＡＥサービス部５０の簡単な検証のため、シミュレーションを行い、特徴的な組み合わせを作成した３名の被験者の音声とジェスチャの組み合わせを、別の被験者が評価し、評価が高い組み合わせと、算出される尤度が高い組み合わせが一致することを確認する。 Here, for simple verification of the SGAE service unit 50, another subject evaluates a combination of voices and gestures of three subjects who performed simulation and created a characteristic combination, and a combination with high evaluation, Confirm that the combinations with the highest likelihood of calculation match.

（シミュレーション手順）
シミュレーションでは、条件１のシナリオと条件２のシナリオの２種類のシナリオに関して、それぞれ被験者たちが作ったジェスチャと音声の組み合わせの尤度を求め、その平均を求めた。そして、それぞれのシナリオについて、以下の３種類をを選出した。 (Simulation procedure)
In the simulation, the likelihood of the combination of gestures and voices made by the subjects for each of two types of scenarios, the condition 1 scenario and the condition 2 scenario, was obtained, and the average was obtained. The following three types were selected for each scenario.

Min.上述したように定義した尤度評価モジュールによって算出した尤度の平均が最も低いシナリオ
Mid.上述したように定義した尤度評価モジュールによって算出した尤度の平均が中間値であるシナリオ
Max.上述したように定義した尤度評価モジュールによって算出した尤度の平均が最も高いシナリオ
そして、男女９名(うち、男性３名、女性６名) の被験者が３つのシナリオを評価し、聞き取りやすいと思った順に不等号・等号を用いて並び替えてもらった。表６に、用いた不等号と、そのスコアを示す。 Min. Scenario with the lowest average likelihood calculated by the likelihood evaluation module defined above
Mid. A scenario in which the mean of likelihoods calculated by the likelihood evaluation module defined above is an intermediate value
Max. The scenario with the highest average likelihood calculated by the likelihood evaluation module defined above, and 9 male and female subjects (including 3 males and 6 females) evaluated and listened to 3 scenarios. They were sorted using inequality / equal signs in the order they thought it was easy. Table 6 shows the inequality signs used and their scores.

それぞれのシナリオの評価方法は、次のとおりである。 The evaluation method for each scenario is as follows.

（１）一番、評価が低いシナリオのスコアを１とする
（２）一番低いシナリオから、加算スコアに基づいて、スコアを順に加算していく
例えば、Max.> Min.>> Mid.という評価をした場合、シナリオMid.の評価１．０、Min.の評価は３．０、Max.の評価は４．０となる。
（シミュレーション結果）
図１４および図１５は、算出した各シナリオのスコアをANOVA(一要因被験者内分析) を用いて評価した。評価結果を示す図である。 (1) The score of the scenario with the lowest evaluation is set to 1. (2) The score is added in order based on the addition score from the lowest scenario. For example, Max.> Min. >> Mid. When the evaluation is performed, the evaluation of scenario Mid. Is 1.0, the evaluation of Min. Is 3.0, and the evaluation of Max. Is 4.0.
(simulation result)
14 and 15, the score of each calculated scenario was evaluated using ANOVA (one-factor intra-subject analysis). It is a figure which shows an evaluation result.

図１４に示されるように、条件１のシナリオに対する評価では、評価のスコアに対して、有意な差が確認された(p < .01; F = 10.77)。また、多重比較検定(LSD) において、尤度の平均が最大のシナリオが、尤度の平均が最小のシナリオより評価が有意に高いこと(p < .05)、また、尤度の平均が中間値であるシナリオが、平均値が最小のシナリオより評価が有意に高いこと(p < .05) が確認された。以上の結果より、尤度の平均が最小値のシナリオより、中間値、最大値のシナリオの方が高く評価されることが示された。条件１のシナリオにおいては、尤度が中間値のシナリオと最大値のシナリオ間には、有意な差は確認できなかった。 As shown in FIG. 14, in the evaluation for the scenario of condition 1, a significant difference was confirmed with respect to the evaluation score (p <.01; F = 10.77). In a multiple comparison test (LSD), the scenario with the highest likelihood average is significantly higher in evaluation than the scenario with the lowest likelihood average (p <.05), and the likelihood average is intermediate. It was confirmed that the scenario with the value is significantly higher in evaluation than the scenario with the lowest average value (p <.05). From the above results, it was shown that the scenario with the intermediate value and the maximum value is more highly evaluated than the scenario with the average likelihood. In the scenario of condition 1, a significant difference could not be confirmed between the scenario with the intermediate likelihood and the scenario with the maximum value.

一方、図１５に示すように、条件２のシナリオに対する評価でも、スコアに対して、有意な差が確認された(p < .05; F = 4.90)。 On the other hand, as shown in FIG. 15, even in the evaluation for the scenario of condition 2, a significant difference was confirmed with respect to the score (p <.05; F = 4.90).

また、多重比較検定において、尤度の平均が最大のシナリオが、尤度の平均が中間値のシナリオと比べて評価が有意に高いこと(p < .05) が示された。他の条件間には有意な差は確認できなかった。 In the multiple comparison test, it was shown that the scenario with the highest likelihood average was significantly higher in evaluation than the scenario with the average likelihood average (p <.05). There was no significant difference between the other conditions.

以上により、少なくとも、尤度が最大のスコアとなるシナリオは、ユーザにとって、好ましいとの評価を受けていることがわかる。 From the above, it can be seen that at least the scenario with the highest likelihood score is evaluated as favorable for the user.

したがって、入力された文章を分割して、作成し得るロボットの命令（音声とジェスチャの組み合わせ）をＳＧＡＥサービス部５０を用いて評価し、もっとも高い尤度を持つ組み合わせを選択していくことで、ロボットの命令生成を自動化することが可能である。 Therefore, by dividing the input sentence and evaluating the robot command (combination of voice and gesture) that can be created using the SGAE service unit 50 and selecting the combination with the highest likelihood, It is possible to automate robot command generation.

シナリオ生成装置によれば、たとえば、ブログのテキストを身振り手振りを交えながら、説明してくれるロボットサービスを作ることが可能になる。ひいては、膨大な量のｈｔｍｌコンテンツをロボットサービスに取り込むことが可能になる。 According to the scenario generation device, for example, it is possible to create a robot service that explains blog text while gesturing. Eventually, a huge amount of html content can be taken into the robot service.

しかも、ジェスチャクリエータ４や、分析者６が、ロボットコンテンツの作成者２とは、独立した存在であるので、シナリオ生成装置を利用して、システムを発展的に開発していくことが可能となる。 Moreover, since the gesture creator 4 and the analyst 6 are independent of the robot content creator 2, the system can be developed progressively by using the scenario generation device. .

今回開示された実施の形態は、本発明を具体的に実施するための構成の例示であって、本発明の技術的範囲を制限するものではない。本発明の技術的範囲は、実施の形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲の文言上の範囲および均等の意味の範囲内での変更が含まれることが意図される。 Embodiment disclosed this time is an illustration of the structure for implementing this invention concretely, Comprising: The technical scope of this invention is not restrict | limited. The technical scope of the present invention is shown not by the description of the embodiment but by the scope of the claims, and includes modifications within the wording and equivalent meanings of the scope of the claims. Is intended.

２ユーザ、４ジェスチャクリエータ、６分析者、１０ユーザインタフェース部、３０ＲＩＧサービス部、４０ＤＢＤＳサービス部、４２記憶部、５０ＳＧＡＥサービス部、５４尤度評価モジュール、５６生成部、２０００サーバ装置。 2 users, 4 gesture creators, 6 analysts, 10 user interface units, 30 RIG service units, 40 DBDS service units, 42 storage units, 50 SGAE service units, 54 likelihood evaluation modules, 56 generation units, 2000 server devices.

Claims

A scenario generation device for creating a scenario in which a gesture is assigned to an utterance to be controlled,
Storage means for storing text data corresponding to the utterance and information for controlling the gesture;
Dividing means for dividing text data of a predetermined length among text data corresponding to the utterance to be controlled into a plurality of text segment candidates based on a predetermined termination pattern;
For each combination candidate of the plurality of text segment candidates and a plurality of predetermined gestures, a first likelihood that the text segment candidates are delimited by the predetermined termination pattern, and a reproduction time of the text segment candidates or Selection means for selecting, as a combination in the scenario, a combination candidate having the highest likelihood among the combination candidates based on a second likelihood based on at least one of the ratios of the reproduction time and the gesture time. When,
Of the text data, the selection of the combination by the dividing unit and the selection unit is repeated until the end of the text data with respect to the text data of the predetermined length following the text segment corresponding to the selected combination. A scenario generation device comprising scenario creation means for creating the scenario.

The scenario generation device according to claim 1, wherein the second likelihood is a product of a likelihood based on a reproduction time of the text segment candidate and a likelihood based on a ratio between the reproduction time and the gesture time.

The said selection means calculates likelihood by the multiplication of the 3rd likelihood based on the keyword which exists in the said text intercept candidate in addition to the said 1st and said 2nd likelihood. The scenario generator described.

The said scenario production | generation apparatus is a server apparatus, Comprising: The means for registering the information for controlling the said gesture with respect to the said memory | storage means via a network is further provided. The scenario generation device described in 1.

The said 1st thru | or 3rd likelihood is calculated by the corresponding likelihood evaluation module, and further comprises the means for registering the said likelihood evaluation module with respect to the said selection means. Scenario generator.

A scenario generation method for creating a scenario in which a gesture is assigned to an utterance to be controlled,
Based on the information in the storage device that stores the text data corresponding to the utterance and the information for controlling the gesture, the arithmetic unit has a predetermined length of text data corresponding to the utterance to be controlled. Dividing the text data into a plurality of text segment candidates based on a predetermined termination pattern;
The arithmetic unit, for each combination candidate of the plurality of text segment candidates and a plurality of predetermined gestures, a first likelihood that the text segment candidate is divided by the predetermined termination pattern, and the text segment candidates Or the second likelihood based on at least one of the ratios of the reproduction time and the gesture time, and the combination candidate having the highest likelihood is selected as the combination in the scenario. A step to choose;
An arithmetic unit selects the selection of the division process and the combination of the text data of the predetermined length following the text segment corresponding to the selected combination of the text data up to the end of the text data. A scenario generation method comprising: repeating the process to create the scenario.