JP2017213612A

JP2017213612A - Robot and method for controlling robot

Info

Publication number: JP2017213612A
Application number: JP2016107532A
Authority: JP
Inventors: 池野　篤司; Tokuji Ikeno; 篤司池野; 宗明島田; Muneaki Shimada; 浩太畠中; Kota HATANAKA; 西島　敏文; Toshifumi Nishijima; 敏文西島; 史憲片岡; Fuminori Kataoka; 刀根川　浩巳; Hiromi Tonegawa; 浩巳刀根川; 倫秀梅山; Norihide Umeyama
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2016-05-30
Filing date: 2016-05-30
Publication date: 2017-12-07

Abstract

PROBLEM TO BE SOLVED: To provide a robot in which utterance and operation are linked to each other.SOLUTION: A robot has: storage means for storing operation definition set which is data, in which operation by a robot having a movable part is defined, and includes a start operation, an intermediate operation and an end operation; voice output means for outputting a voice; driving means for driving the movable part; command generation means for generating a drive command for the driving means by use of the operation definition set; and control means for issuing the drive command for the driving means in synchronization with output of the voice. The command generation means generates the drive command by which the intermediate operation is inserted between the start operation and end operation one or more times so that an operation time of the movable part is substantially same as an output time of the voice.SELECTED DRAWING: Figure 1

Description

本発明は、音声によってユーザと対話するロボットに関する。 The present invention relates to a robot that interacts with a user by voice.

近年、人と対話をすることによって様々な情報を提供するロボットが開発されている。例えば、特許文献１には、カメラとマイク、アクチュエータを内蔵し、音声によってユーザとの会話を行うコミュニケーションロボットが開示されている。 In recent years, robots that provide various information by interacting with people have been developed. For example, Patent Document 1 discloses a communication robot that incorporates a camera, a microphone, and an actuator and that has a conversation with a user by voice.

特許文献１に記載のロボットは、自身が発話を行っている間、アクチュエータを動作させることでリアクションを行う機能を有している。例えば、ロボットがユーザへの問いかけを行う場合、手や首を動かして問いかけに対応するアクションを行う。このように構成することで、発話と動作を自然な形でリンクさせることができる。 The robot described in Patent Document 1 has a function of performing a reaction by operating an actuator while it is speaking. For example, when a robot makes an inquiry to a user, an action corresponding to the inquiry is performed by moving a hand or a neck. By configuring in this way, it is possible to link speech and motion in a natural way.

特開２００４−０３４２７３号公報JP 2004-034273 A 特開２００１−１７９６６７号公報JP 2001-179667 A

特許文献１に記載のロボットは、予め動作時間が決まっている動作定義を複数記憶しており、発話の継続時間に適合する動作定義の組み合わせを動的に生成するという特徴を持っている。これにより、発話の終了と動作の終了を一致させることができる。
しかし、このような方式では、継続時間を優先するため、様々な動作を組み合わせた結果、全体として不自然な動作となってしまうおそれがある。 The robot described in Patent Document 1 has a feature that it stores a plurality of motion definitions whose motion times are determined in advance, and dynamically generates a combination of motion definitions that match the duration of the utterance. Thereby, the end of the utterance and the end of the operation can be matched.
However, in such a method, priority is given to the duration time, and as a result, a combination of various operations may result in an unnatural operation as a whole.

本発明は上記の課題を考慮してなされたものであり、発話と動作がリンクしたロボットを提供することを目的とする。 The present invention has been made in consideration of the above problems, and an object thereof is to provide a robot in which speech and motion are linked.

本発明に係るロボットは、
可動部を有するロボットによる動作が定義されたデータであって、開始動作、中間動作、終了動作を含む動作定義セットを記憶する記憶手段と、音声を出力する音声出力手段と、前記可動部を駆動する駆動手段と、前記動作定義セットを用いて前記駆動手段に対する駆動命令を生成する命令生成手段と、前記音声の出力に同期して前記駆動手段に対して駆動命令を発行する制御手段と、を有し、前記命令生成手段は、前記可動部の動作時間が前記音声の出力時間と略同一となるように、前記開始動作と終了動作の間に前記中間動作を一回以上挿入した駆動命令を生成することを特徴とする。 The robot according to the present invention is
Data defining motion by a robot having a movable part, which is a storage means for storing an action definition set including a start action, an intermediate action, and an end action, an audio output means for outputting sound, and driving the movable part Driving means, command generating means for generating a driving command for the driving means using the action definition set, and control means for issuing a driving command to the driving means in synchronization with the output of the sound. And the command generation means includes a drive command in which the intermediate operation is inserted at least once between the start operation and the end operation so that the operation time of the movable part is substantially the same as the output time of the sound. It is characterized by generating.

動作定義セットとは、可動部を駆動することによるロボットの動作を定義したデータであり、開始動作、中間動作、終了動作の三種類を含んでいる。また、命令生成手段は、音声出力手段が音声を出力する時間と、ロボットが動作する時間が略同一となるように、開始動作と終了動作の間に中間動作を一回以上挿入した駆動命令を生成する。
かかる構成によると、中間動作を繰り返すことで全体としての動作時間の長短を制御することができるため、音声の出力時間が事前に取得できない場合であっても、ロボットによる発話と動作をリンクさせることができる。また、連続させることが想定されていない動作同士が組み合わさることがないため、全体として自然な動作を行うことができる。 The motion definition set is data that defines the motion of the robot by driving the movable part, and includes three types of start motion, intermediate motion, and end motion. In addition, the command generation means outputs a drive command in which an intermediate operation is inserted at least once between the start operation and the end operation so that the time during which the sound output unit outputs the sound and the time during which the robot operates Generate.
According to such a configuration, since the overall operation time can be controlled by repeating the intermediate operation, even if the output time of the voice cannot be acquired in advance, the speech and the operation by the robot can be linked. Can do. In addition, since operations that are not supposed to be continuous are not combined, a natural operation can be performed as a whole.

また、前記命令生成手段は、前記開始動作および終了動作を連続して行った場合の動作時間が前記音声の出力時間に満たない場合に、前記可動部の動作時間が前記音声の出力時間に達するまで、前記開始動作と終了動作の間に前記中間動作を反復して挿入することを特徴としてもよい。 In addition, the command generation means, when the operation time when the start operation and the end operation are continuously performed is less than the sound output time, the operation time of the movable part reaches the sound output time. Until then, the intermediate operation may be repeatedly inserted between the start operation and the end operation.

このように、音声の出力時間が事前に取得できる場合、動作時間が音声の出力時間を上回るまで中間動作を繰り返し挿入してもよい。 As described above, when the sound output time can be acquired in advance, the intermediate operation may be repeatedly inserted until the operation time exceeds the sound output time.

また、前記中間動作は、１秒以内の、ループが可能な動作であることを特徴としてもよい。 The intermediate operation may be an operation that allows looping within one second.

ループが可能な短い動作を中間動作とすることで、同じ動作を繰り返しても不自然に見えないという効果が得られる。例えば、３〜５秒程度あるような長い動作を何回も繰り返すと、ユーザに違和感を与えてしまう場合があるが、ループ可能な短い動作を繰り返すことで違和感を軽減し、ロボットにより自然なアクションを取らせることができる。 By making a short operation capable of looping as an intermediate operation, an effect of not appearing unnatural even if the same operation is repeated can be obtained. For example, repeating a long motion such as 3 to 5 seconds many times may give the user a sense of incongruity, but repeating the short loopable motion alleviates the sense of discomfort and makes the action more natural Can be taken.

また、本発明に係るロボットは、前記音声出力手段が出力する音声の内容に応じた種別を取得する種別取得手段をさらに有し、前記記憶手段には、前記種別に関連付いた複数の動作定義セットが記憶されており、前記命令生成手段は、前記種別に対応する動作定義セットを選択し、当該動作定義セットを用いて前記駆動命令を生成することを特徴としてもよい。 The robot according to the present invention further includes a type acquisition unit that acquires a type according to the content of the sound output by the audio output unit, and the storage unit includes a plurality of motion definitions associated with the type. A set is stored, and the command generation unit may select an operation definition set corresponding to the type and generate the drive command using the operation definition set.

音声の内容に応じた種別とは、例えば、ロボットの擬似的な感情（喜怒哀楽）などを表すものであってもよいし、音声がユーザに対して何らかの情報を提供するものである場合、当該情報の内容に基づいて設定されたものであってもよい。このように、応答文の内容や種別に応じて複数の動作定義セットを設定しておき、適したものを選択するようにしてもよい。かかる構成によると、発話内容に応じてロボットに異なるリアクションをとらせることができる。 The type corresponding to the content of the voice may represent, for example, a robot's pseudo emotion (feeling emotional), or if the voice provides some information to the user, It may be set based on the content of the information. As described above, a plurality of action definition sets may be set according to the content and type of the response sentence, and a suitable one may be selected. According to such a configuration, it is possible to cause the robot to take different reactions according to the utterance content.

また、前記音声出力手段が出力する音声が二文以上である場合に、前記命令生成手段は、それぞれの文の種別に応じた複数の動作定義セットを選択し、前記複数の動作定義セットを用いて前記駆動命令を生成することを特徴としてもよい。 When the voice output unit outputs two or more sentences, the command generation unit selects a plurality of action definition sets according to the type of each sentence, and uses the plurality of action definition sets. The driving command may be generated.

出力対象の音声が二文以上からなる場合、途中で応答文の種別を変更してもよい。例えば、終了動作を一旦行ってから他の種別に対応する動作を開始してもよいし、中間動作の途中で、他の種別に対応する中間動作に移行してもよい。かかる構成によると、より豊かな表現が可能になる。 When the output target voice is composed of two or more sentences, the response sentence type may be changed in the middle. For example, once the end operation is performed, an operation corresponding to another type may be started, or an intermediate operation corresponding to another type may be shifted in the middle of the intermediate operation. According to such a configuration, richer expressions are possible.

また、前記命令生成手段は、前記可動部の位置を初期位置に戻す初期化動作をさらに付加した駆動命令を生成することを特徴としてもよい。 Further, the command generation means may generate a drive command to which an initialization operation for returning the position of the movable part to an initial position is further added.

ロボットが動作を終了する際は、次の動作開始に支障のない位置で可動部を停止させる必要がある。そこで、可動部の位置を初期化する命令を駆動命令の末尾に付加するようにしてもよい。かかる構成によると、動作を設計する際に、駆動終了時における可動部の位置を考慮する必要がなくなり、表現の自由度を増すことができる。 When the robot finishes the operation, it is necessary to stop the movable part at a position that does not hinder the start of the next operation. Therefore, a command for initializing the position of the movable part may be added to the end of the drive command. According to such a configuration, it is not necessary to consider the position of the movable part at the end of driving when designing the operation, and the degree of freedom of expression can be increased.

なお、本発明は、上記手段の少なくとも一部を含むロボットとして特定することができる。また、前記ロボットの制御方法として特定することもできる。上記処理や手段は、技術的な矛盾が生じない限りにおいて、自由に組み合わせて実施することができる。 In addition, this invention can be specified as a robot containing at least one part of the said means. It can also be specified as a control method of the robot. The above processes and means can be freely combined and implemented as long as no technical contradiction occurs.

本発明によれば、発話と動作がリンクしたロボットを提供することができる。 According to the present invention, it is possible to provide a robot in which speech and motion are linked.

実施形態に係る音声対話システムのシステム構成図である。1 is a system configuration diagram of a voice interaction system according to an embodiment. ロボット１０を説明する図である。1 is a diagram illustrating a robot 10. FIG. 動作定義セットの具体例を説明する図である。It is a figure explaining the specific example of an action definition set. 実施形態に係る音声対話システムの処理フロー図である。It is a processing flow figure of the voice dialogue system concerning an embodiment. 実施形態に係る音声対話システムの処理フロー図である。It is a processing flow figure of the voice dialogue system concerning an embodiment. 実施形態に係る音声対話システムの処理フロー図である。It is a processing flow figure of the voice dialogue system concerning an embodiment. ロボットの発話時間と動作の組み合わせを説明する図である。It is a figure explaining the combination of the speech time and operation | movement of a robot.

以下、本発明の好ましい実施形態について図面を参照しながら説明する。
本実施形態に係る音声対話システムは、ロボットが音声によってユーザと対話することでコミュニケーションを行うシステムである。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.
The voice interaction system according to the present embodiment is a system that performs communication by a robot interacting with a user by voice.

（第一の実施形態）
<システム構成>
図１は、本実施形態に係る音声対話システムのシステム構成図である。本実施形態に係る音声対話システムは、ロボット１０と制御装置２０から構成される。
ロボット１０は、スピーカやマイク、カメラ等を有しており、ユーザとのインタフェースを担う手段である。ロボット１０は、図２に示したように人型（またはキャラクター型）をしており、関節をアクチュエータで可動させることで様々なリアクションを行うことができる。
また、制御装置２０は、ロボット１０に対して制御命令を発行する装置である。本実施形態では、ロボット１０はユーザインタフェースとしてのみ機能し、対話文の生成や音声の認識、その他の処理など、システム全体を制御する処理は制御装置２０が行う。 (First embodiment)
<System configuration>
FIG. 1 is a system configuration diagram of a spoken dialogue system according to the present embodiment. The voice interaction system according to the present embodiment includes a robot 10 and a control device 20.
The robot 10 has a speaker, a microphone, a camera, and the like, and is a means for performing an interface with a user. The robot 10 has a human type (or character type) as shown in FIG. 2, and various reactions can be performed by moving the joint with an actuator.
The control device 20 is a device that issues a control command to the robot 10. In this embodiment, the robot 10 functions only as a user interface, and the control device 20 performs processing for controlling the entire system, such as generation of a dialog sentence, speech recognition, and other processing.

まず、ロボット１０について説明する。
ロボット１０は、音声入力部１１、通信部１２、音声出力部１３、動作制御部１４から構成される。 First, the robot 10 will be described.
The robot 10 includes a voice input unit 11, a communication unit 12, a voice output unit 13, and an operation control unit 14.

音声入力部１１は、ユーザが発した音声を取得する手段である。具体的には、不図示のマイクを用いて、音声を電気信号（以下、音声データ）に変換する。取得した音声データは、後述する通信部１２を介して制御装置２０へ送信される。 The voice input unit 11 is means for acquiring voice uttered by the user. Specifically, sound is converted into an electrical signal (hereinafter referred to as sound data) using a microphone (not shown). The acquired audio data is transmitted to the control device 20 via the communication unit 12 described later.

通信部１２は、制御装置２０と近距離無線通信を行う手段である。本実施形態では、通信部１２は、Ｂｌｕｅｔｏｏｔｈ（登録商標）接続を利用した通信を行う。通信部１２は、ペアリング先となる制御装置２０に関する情報を記憶しており、簡便な操作で接続を行うことができる。 The communication unit 12 is means for performing near field communication with the control device 20. In the present embodiment, the communication unit 12 performs communication using a Bluetooth (registered trademark) connection. The communication unit 12 stores information related to the control device 20 that is a pairing destination, and can be connected by a simple operation.

音声出力部１３は、ユーザに提供する音声を出力する手段である。具体的には、スピーカを用いて、制御装置２０から送信された音声データを音声に変換する。 The audio output unit 13 is means for outputting audio to be provided to the user. Specifically, the audio data transmitted from the control device 20 is converted into audio using a speaker.

動作制御部１４は、ロボット１０が有する複数の可動部に内蔵されたアクチュエータを駆動させることで、ロボット１０の動作を制御する手段である。具体的には、制御装置２０から送信された命令に基づいて、例えば手、肩、肘、足などの関節に配置されたアクチュエータを駆動させることで、ロボット１０に所定のリアクションをさせる。
また、動作制御部１４は、アクチュエータの動作定義（どのような命令に対して、どのアクチュエータをどのように動かすか）を事前に記憶しており、制御装置２０から送信された命令に基づいてアクチュエータを駆動させる。
なお、可動部は、例えば図２で示したように、関節ごとに設けられてもよいし、車輪など、関節以外の箇所に設けられてもよい。ロボットに送信される命令の具体例については後述する。 The operation control unit 14 is means for controlling the operation of the robot 10 by driving actuators built in a plurality of movable units of the robot 10. Specifically, based on a command transmitted from the control device 20, for example, an actuator arranged in a joint such as a hand, shoulder, elbow, or foot is driven to cause the robot 10 to perform a predetermined reaction.
Further, the operation control unit 14 stores in advance the operation definition of the actuator (how to move which actuator for what command), and the actuator based on the command transmitted from the control device 20. Drive.
In addition, the movable part may be provided for each joint as shown in FIG. 2, for example, or may be provided at a place other than the joint, such as a wheel. A specific example of the command transmitted to the robot will be described later.

次に、制御装置２０について説明する。制御装置２０は、ロボット１０の制御を行う装置であって、典型的にはパーソナルコンピュータ、携帯電話、スマートフォンなどである。制御装置２０は、ＣＰＵ、主記憶装置、補助記憶装置を有する情報処理装置として構成することができる。補助記憶装置に記憶されたプログラムが主記憶装置にロードされ、ＣＰＵによって実行されることで、図１に図示した各手段が機能する。なお、図示した機能の全部または一部は、専用に設計された回路を用いて実行されてもよい。 Next, the control device 20 will be described. The control device 20 is a device that controls the robot 10 and is typically a personal computer, a mobile phone, a smartphone, or the like. The control device 20 can be configured as an information processing device having a CPU, a main storage device, and an auxiliary storage device. Each unit shown in FIG. 1 functions by loading a program stored in the auxiliary storage device into the main storage device and executing it by the CPU. Note that all or part of the illustrated functions may be executed using a circuit designed exclusively.

制御装置２０は、通信部２１、音声認識部２２、応答生成部２３、音声合成部２４、動作生成部２５から構成される。 The control device 20 includes a communication unit 21, a voice recognition unit 22, a response generation unit 23, a voice synthesis unit 24, and an operation generation unit 25.

通信部２１が有する機能は、前述した通信部１２と同様であるため、詳細な説明は省略する。 Since the function which the communication part 21 has is the same as that of the communication part 12 mentioned above, detailed description is abbreviate | omitted.

音声認識部２２は、ロボットが有する音声入力部１１が取得した音声に対して音声認識を行い、テキストに変換する手段である。音声認識は、既知の技術によって行うことができる。例えば、音声認識部２２には、音響モデルと認識辞書が記憶されており、取得した音声データと音響モデルとを比較して特徴を抽出し、抽出した特徴を認識辞書とをマッチングさせることで音声認識を行う。認識結果は、応答生成部２３へ送信される。 The voice recognition unit 22 is means for performing voice recognition on the voice acquired by the voice input unit 11 of the robot and converting the voice into text. Speech recognition can be performed by known techniques. For example, the speech recognition unit 22 stores an acoustic model and a recognition dictionary, extracts features by comparing the acquired speech data with the acoustic model, and matches the extracted features with the recognition dictionary to generate speech. Recognize. The recognition result is transmitted to the response generation unit 23.

応答生成部２３は、音声認識部２２が出力したテキストに基づいて、ユーザに提供する応答文を生成ないし取得する手段である。提供する情報は、例えば、データベースを検索して得られた情報であってもよいし、ウェブ検索によって得られた情報であってもよい。また、提供する情報は、質問に対する回答でなくてもよい。例えば、ロボット１０がコミュニケーションロボットであるような場合、対話シナリオ（対話辞書）から選択された返答であってもよい。この他にも、自然言語処理によって情報を提供できるものであれば、入力されるテキストと、出力される応答文はどのようなものであってもよい。応答生成部２３が生成ないし取得した応答文は、合成音声に変換されてロボット１０に送信され、その後、ユーザに向けて出力される。 The response generation unit 23 is a unit that generates or acquires a response sentence to be provided to the user based on the text output by the voice recognition unit 22. The information to be provided may be, for example, information obtained by searching a database, or information obtained by web search. Further, the provided information may not be an answer to the question. For example, when the robot 10 is a communication robot, it may be a response selected from a dialogue scenario (dialog dictionary). In addition, as long as information can be provided by natural language processing, any text can be input and any response sentence can be output. The response sentence generated or acquired by the response generation unit 23 is converted into synthesized speech, transmitted to the robot 10, and then output to the user.

音声合成部２４は、既知の音声合成技術によって、応答生成部２３が生成した応答文（テキスト）を音声データに変換する手段である。生成された音声データはロボット２０に送信され、ユーザに提供される。 The speech synthesizer 24 is a means for converting the response sentence (text) generated by the response generator 23 into speech data using a known speech synthesis technique. The generated voice data is transmitted to the robot 20 and provided to the user.

動作生成部２５は、応答生成部２３が生成した応答文に基づいて、ロボット１０が行う動作を決定する手段である。動作生成部２５は、可動部の動きを規定したデータ（動作定義セット）を記憶しており、当該データと、応答生成部２３が生成した応答文に基づいて、ロボットに対する駆動命令（以下、動作コマンド）を生成する。
ここで、動作定義セットについて説明する。動作定義セットとは、ロボットが有する複数の可動部をどのように動かすかを定義したデータであって、「開始動作」「中間動作」「終了動作」の三種類の動作の組み合わせからなる。図３は、動作定義セットの一例である。ここでは、ロボットが挨拶を行う場合の動作定義セットを示す。 The action generation unit 25 is means for determining an action to be performed by the robot 10 based on the response sentence generated by the response generation part 23. The motion generation unit 25 stores data (motion definition set) that defines the movement of the movable unit, and based on the data and the response sentence generated by the response generation unit 23, a drive command (hereinafter referred to as motion) is given to the robot. Command).
Here, the action definition set will be described. The action definition set is data that defines how to move a plurality of movable parts of the robot, and includes a combination of three types of actions: “start action”, “intermediate action”, and “end action”. FIG. 3 is an example of an action definition set. Here, an action definition set when the robot greets is shown.

ロボット２０は、動作定義セットに従って、開始動作、中間動作、終了動作の順に可動
部を駆動させる。図３の例の場合、まず、右手の３つの可動部を初期位置から駆動し切った位置（以下、位置Ｘ）まで移動させることで、右腕が持ち上がる。次いで、肩以外の２つの可動部を初期位置まで往復させることで手を振る動作を行う。最後に、３つの可動部を初期位置に戻すことで、右腕を下げる。これを連続して行うと、ユーザに対して右手を振る動作となる。 The robot 20 drives the movable part in the order of the start operation, the intermediate operation, and the end operation according to the operation definition set. In the case of the example in FIG. 3, first, the right arm is lifted by moving the three movable parts of the right hand to a position (hereinafter referred to as position X) that has been driven from the initial position. Next, an operation of shaking hands by reciprocating the two movable parts other than the shoulder to the initial position is performed. Finally, the right arm is lowered by returning the three movable parts to the initial position. If this is performed continuously, the operation of waving the right hand to the user is performed.

なお、ここでは「挨拶」に対応する動作定義セットを示すが、動作定義セットは、応答文の内容（種別）に応じて複数の種類が定義されており、適したものが選択される。たとえば、ロボットの喜怒哀楽や感情（擬似的な感情）に応じて、複数の動作定義セットの中から適するものを選択してもよい。 Although an action definition set corresponding to “greeting” is shown here, a plurality of kinds of action definition sets are defined according to the content (type) of the response sentence, and an appropriate one is selected. For example, a suitable one may be selected from a plurality of action definition sets according to the emotions and emotions (pseudo emotions) of the robot.

例えば、「質問の投げかけ」「質問に対する回答」「褒める」「共感」「喜ぶ」などの種別が定義されていてもよい。また、特定のワードに対応する動作定義セットがあってもよい。
また、応答文の内容や種別が同一であっても、環境が異なる場合、異なる動作定義セットを選択する構成としてもよい。例えば、ロボットが自宅にある場合と、自動車の車内にある場合とで異なる動作定義セットを選択してもよい。
また、応答文の内容や種別が同一であっても、ロボットの感情が異なる場合、異なる動作定義セットを選択してもよい。例えば、ロボットが怒りの感情を抱いている場合と、喜びの感情を抱いている場合とで異なる動作定義セットを選択してもよい。
また、応答文が、ユーザに対して情報を提供するものである場合、情報の種別や内容に応じた動作定義セットを選択してもよい。
このように、動作定義セットは、感情、応答文の種別、環境などに応じて、適したものが選択されうる。 For example, types such as “question of question”, “answer to question”, “praise”, “sympathy”, and “joy” may be defined. There may also be an action definition set corresponding to a specific word.
Further, even if the contents and types of response sentences are the same, different operation definition sets may be selected when the environment is different. For example, different motion definition sets may be selected depending on whether the robot is at home or in a car.
In addition, even if the contents and types of response sentences are the same, different action definition sets may be selected if the emotions of the robot are different. For example, different motion definition sets may be selected depending on whether the robot has an angry emotion or a joyful emotion.
In addition, when the response sentence provides information to the user, an action definition set corresponding to the type and content of the information may be selected.
As described above, a suitable action definition set can be selected according to emotion, response sentence type, environment, and the like.

なお、中間動作については省略される場合や、繰り返される場合がある。また、終了動作は省略される場合がある。これについてはのちほど説明する。 Note that the intermediate operation may be omitted or repeated. Also, the end operation may be omitted. This will be explained later.

<処理フロー>
次に、図１に示した各手段が行う処理とデータの流れについて、処理内容およびデータの流れを説明するフロー図である、図４〜図６を参照しながら説明する。
まず、ステップＳ１１で、ロボット１０が有する音声入力部１１が、マイクを通してユーザから音声を取得する。取得した音声は音声データに変換され、通信部を介して制御装置２０が有する音声認識部２２へ送信される。
そして、音声認識部２２が、取得した音声データに対して音声認識を行い、テキストに変換する（ステップＳ１２）。音声認識の結果得られたテキストは、応答生成部２３へ送信される。 <Processing flow>
Next, the processing performed by each unit shown in FIG. 1 and the data flow will be described with reference to FIGS. 4 to 6 which are flowcharts illustrating the processing content and the data flow.
First, in step S11, the voice input unit 11 of the robot 10 acquires voice from the user through a microphone. The acquired voice is converted into voice data and transmitted to the voice recognition unit 22 of the control device 20 via the communication unit.
Then, the voice recognition unit 22 performs voice recognition on the acquired voice data and converts it into text (step S12). The text obtained as a result of the speech recognition is transmitted to the response generation unit 23.

次に、ステップＳ２１で、応答生成部２３が、取得したテキストに基づいて応答文を生成する。前述したように、応答文は、自装置が有する対話辞書（対話シナリオ）を用いて生成してもよいし、外部にある情報ソース（データベースサーバやウェブサーバ）を用いて生成してもよい。生成された応答文は、音声合成部２４と動作生成部２５の双方へ送信される。 Next, in step S21, the response generation unit 23 generates a response sentence based on the acquired text. As described above, the response sentence may be generated using a dialog dictionary (dialog scenario) that the device itself has, or may be generated using an external information source (database server or web server). The generated response sentence is transmitted to both the speech synthesizer 24 and the action generator 25.

ステップＳ２２では、音声合成部２４が、応答文に基づいて音声を合成し、音声データを取得する。取得した音声データは、動作生成部２５と通信部２１の双方へ送信される。
また、ステップＳ２３では、動作生成部２５が、応答文に基づいて動作定義セットを取得する。具体的には、応答文の種別（応答がどのような内容であるか）に基づいて、複数の動作定義セットの中から適合するものを取得する。例えば、応答文が挨拶である場合、「挨拶」に対応する動作定義セットを取得する。
なお、ステップＳ２２およびＳ２３の実行順序は問わない。 In step S22, the speech synthesizer 24 synthesizes speech based on the response sentence and acquires speech data. The acquired voice data is transmitted to both the action generation unit 25 and the communication unit 21.
In step S23, the action generation unit 25 acquires an action definition set based on the response sentence. Specifically, based on the type of response sentence (what the response is), a suitable one is acquired from a plurality of action definition sets. For example, when the response sentence is a greeting, an action definition set corresponding to “greeting” is acquired.
Note that the execution order of steps S22 and S23 does not matter.

次に、ステップＳ２４で、動作生成部２５が、取得した音声データに基づいて再生時間を演算し、ステップＳ２５で、再生時間に対応する動作コマンドを生成する。 Next, in step S24, the motion generation unit 25 calculates a playback time based on the acquired audio data, and generates a motion command corresponding to the playback time in step S25.

ここで、動作コマンドの生成方法について説明する。
図３に示したように、本実施形態における動作定義セットは、開始動作、中間動作、終了動作の三種類である。一方、ロボットがユーザに提供する応答は動的に生成されるため、ロボットが音声を再生するのにかかる時間は事前に知ることができない。そこで、本実施形態に係る制御装置２０は、音声の再生時間（以下、発話時間）を演算したうえで、開始動作、中間動作、終了動作を組み合わせ、発話時間に適合する動作を生成する。 Here, an operation command generation method will be described.
As shown in FIG. 3, there are three types of action definition sets in the present embodiment: a start action, an intermediate action, and an end action. On the other hand, since the response provided by the robot to the user is dynamically generated, it is not possible to know in advance the time taken for the robot to reproduce the sound. Therefore, the control device 20 according to the present embodiment calculates an audio playback time (hereinafter referred to as “speech time”), and then combines the start operation, the intermediate operation, and the end operation to generate an operation suitable for the speech time.

図７は、動作生成部２５が動作コマンドを生成する処理を説明する図である。ここでは、発話時間が３秒であり、開始動作に要する時間が６００ミリ秒、中間動作に要する時間が３００ミリ秒、終了動作に要する時間が６００ミリ秒であるものとする。
まず、開始動作と終了動作を続けて行った場合の時間と、発話時間とを比較する。この結果、発話時間のほうが長い場合、開始動作と終了動作との間に中間動作を挿入する。本例の場合、開始動作と終了動作を続けて行った場合の時間が１．２秒であり、発話時間が３秒であるため、中間動作の挿入を行う（図７（Ａ））。 FIG. 7 is a diagram illustrating a process in which the motion generation unit 25 generates a motion command. Here, the speech time is 3 seconds, the time required for the start operation is 600 milliseconds, the time required for the intermediate operation is 300 milliseconds, and the time required for the end operation is 600 milliseconds.
First, the time when the start operation and the end operation are continuously performed is compared with the speech time. As a result, when the utterance time is longer, an intermediate operation is inserted between the start operation and the end operation. In this example, since the time when the start operation and the end operation are continuously performed is 1.2 seconds and the speech time is 3 seconds, an intermediate operation is inserted (FIG. 7A).

挿入される中間動作の個数は、ロボットの動作時間と発話時間との差異が最も小さくなるように決定される。本例では、中間動作を６個挿入した場合、合計時間が３．０秒となる（図７（Ｂ））。
本例の場合、ステップＳ２５で、開始動作を１回、中間動作を６回、終了動作を１回行うという動作コマンドが生成され、通信部を介してロボット１０に送信される。 The number of intermediate motions to be inserted is determined so that the difference between the robot motion time and the speech time is minimized. In this example, when six intermediate operations are inserted, the total time is 3.0 seconds (FIG. 7B).
In the case of this example, in step S25, an operation command for performing the start operation once, the intermediate operation six times, and the end operation once is generated and transmitted to the robot 10 via the communication unit.

一方、例えば発話時間が３．２秒であった場合、中間動作を６個挿入した場合３．０秒（差異は０．２秒）、７個挿入した場合は３．３秒（差異は０．１秒）となるため、挿入される中間動作の個数は７個となる（図７（Ｃ））。
ただし、動作時間と発話時間との差異を最小化する必要は必ずしもなく、動作時間と発話時間が略同一となればよい。 On the other hand, for example, when the utterance time is 3.2 seconds, when six intermediate motions are inserted, 3.0 seconds (difference is 0.2 seconds), and when seven are inserted, 3.3 seconds (difference is 0). .1 second), the number of intermediate operations to be inserted is seven (FIG. 7C).
However, it is not always necessary to minimize the difference between the operation time and the speech time, and the operation time and the speech time may be substantially the same.

なお、ここでは、中間動作を挿入した場合の例を挙げたが、発話時間が短い場合、中間動作や終了動作を行わなくてもよい。例えば、開始動作と終了動作を続けて行った場合の時間よりも発話時間が短い場合、開始動作と終了動作を続けて行ってもよい。また、開始動作のみを行った場合の時間よりも発話時間が短い場合、開始動作のみを行ってもよい。 Here, an example in which an intermediate operation is inserted has been described, but if the utterance time is short, the intermediate operation or the end operation may not be performed. For example, when the utterance time is shorter than the time when the start operation and the end operation are continuously performed, the start operation and the end operation may be performed continuously. Further, when the utterance time is shorter than the time when only the start operation is performed, only the start operation may be performed.

また、開始動作についても必ずしも行わなくてもよい。例えば、開始動作を行うことで、開始動作を行わない場合と比較して発話時間と動作時間とのずれが大きくなってしまう場合、開始動作を省略するようにしてもよい。また、開始動作や終了動作に優先度を付与しておき、優先度の低い動作を省略することで、発話時間と動作時間とのずれを小さくする試みを行ってもよい。 Further, the start operation is not necessarily performed. For example, when the start operation is performed and the difference between the speech time and the operation time becomes larger than when the start operation is not performed, the start operation may be omitted. Further, an attempt may be made to reduce the difference between the speech time and the operation time by giving priority to the start operation and the end operation and omitting the operation with low priority.

また、開始動作や終了動作は、必ずしも一回のみでなくてもよい。例えば、開始動作や終了動作を複数回行うことで、発話時間と動作時間とのずれをより小さくできる場合、開始動作や終了動作を複数回繰り返してもよい。 Moreover, the start operation and the end operation are not necessarily limited to one time. For example, if the difference between the speech time and the operation time can be further reduced by performing the start operation and the end operation a plurality of times, the start operation and the end operation may be repeated a plurality of times.

また、発話時間と動作時間とを完全に一致させたい場合、中間動作や終了動作を途中で打ち切ることで時間の調整を行ってもよい。 When it is desired to make the speech time and the operation time completely coincide with each other, the time may be adjusted by stopping the intermediate operation or the end operation halfway.

ロボット１０側の通信部が音声データと動作コマンドを受信する（ステップＳ３１）と、音声データは音声出力部１３へ、動作コマンドは動作制御部１４へ送信される。
ステップＳ３２では、動作制御部１４が、動作コマンドに対応するアクチュエータの動作定義（どのアクチュエータをどのようなシーケンスで動かすか）を取得する。
一方、音声出力部１３は、音声データを取得すると、当該音声の再生を開始する（ステップＳ３３）。また、同時に再生開始トリガを動作制御部１４に送信し、動作制御部１４が、当該トリガに基づいてアクチュエータの駆動を開始する（ステップＳ３４）。
これにより、発話が始まるのと同時にロボットが動作を開始し、発話が終了すると略同時にロボットの動作も停止する。 When the communication unit on the robot 10 side receives the audio data and the operation command (step S31), the audio data is transmitted to the audio output unit 13 and the operation command is transmitted to the operation control unit 14.
In step S32, the motion control unit 14 acquires the motion definition of the actuator corresponding to the motion command (which actuator is to be moved in what sequence).
On the other hand, when the audio output unit 13 acquires the audio data, the audio output unit 13 starts reproducing the audio (step S33). At the same time, a reproduction start trigger is transmitted to the operation control unit 14, and the operation control unit 14 starts driving the actuator based on the trigger (step S34).
As a result, the robot starts to operate simultaneously with the start of the utterance, and when the utterance ends, the operation of the robot also stops substantially simultaneously.

以上説明したように、本実施形態に係る音声対話システムは、ロボットが行う発話の時間と、ロボットの動作時間とが一致するように中間動作を割り当て、全体の動作コマンドを生成する。これにより、ロボットによる発話と動きをリンクさせることができる。特に、中間動作をループ可能な短めの動作とし、発話時間を満たすまで動作を繰り返すことで、発話時間の長短にかかわらず自然な動きをさせることができる。 As described above, the spoken dialogue system according to the present embodiment allocates an intermediate operation so that the time of speech performed by the robot matches the operation time of the robot, and generates an entire operation command. Thereby, the speech and movement by the robot can be linked. In particular, by making the intermediate operation a short operation that can be looped and repeating the operation until the utterance time is satisfied, a natural motion can be made regardless of the length of the utterance time.

既知の技術では、比較的大きめな動作を組み合わせることで全体としての動きを生成しているため、動作のつなぎ目や、同じ動作を何回も行っていることがユーザに認識されてしまうおそれがあった。これに対し、本発明では、短めの動作をループさせるため、例えば、手や首がゆらゆらと揺れるような自然な動作をさせることができ、ユーザに与える違和感を軽減することができる。 In the known technology, the movement as a whole is generated by combining relatively large movements. Therefore, there is a possibility that the user may recognize that the movement is connected or that the same movement is performed many times. It was. On the other hand, in the present invention, since a short motion is looped, for example, a natural motion in which the hand or neck shakes swayingly can be performed, and the uncomfortable feeling given to the user can be reduced.

（第二の実施形態）
ロボットが動作を終了する際は、次の動作開始に支障のない位置で可動部を停止させる必要がある。これを行わないと、可動部に位置ずれが発生したり、次の動作が行えなくなるおそれがあるためである。
第二の実施形態は、これに対応するため、ロボットによる動作が終了するごとに、ロボットが有する可動部の位置を初期化する実施形態である。 (Second embodiment)
When the robot finishes the operation, it is necessary to stop the movable part at a position that does not hinder the start of the next operation. If this is not performed, there is a possibility that the movable portion may be displaced and the next operation cannot be performed.
In order to cope with this, the second embodiment is an embodiment in which the position of the movable part of the robot is initialized every time the operation by the robot is completed.

第二の実施形態では、ステップＳ２５で、動作コマンドの末尾に可動部の位置を初期化する命令を付加する。これにより、確実に可動部を初期位置に戻すことができる。すなわち、動作を設計する際に、駆動終了時における可動部の位置を考慮する必要がなくなり、表現の自由度を増すことができる。 In the second embodiment, in step S25, a command for initializing the position of the movable part is added to the end of the operation command. Thereby, a movable part can be reliably returned to an initial position. That is, when designing the operation, there is no need to consider the position of the movable part at the end of driving, and the degree of freedom of expression can be increased.

（変形例）
上記の実施形態はあくまでも一例であって、本発明はその要旨を逸脱しない範囲内で適宜変更して実施しうる。 (Modification)
The above embodiment is merely an example, and the present invention can be implemented with appropriate modifications within a range not departing from the gist thereof.

例えば、実施形態の説明では、音声認識部２２が音声認識を行ったが、音声認識を外部装置で行うようにしてもよい。この場合、制御装置２０に、ネットワークと通信を行うための通信部を追加し、音声データおよび認識結果を送受信するようにしてもよい。
同様に、実施形態の説明では、応答生成部２３が応答を生成したが、応答文の生成を外部装置で行うようにしてもよい。この場合、制御装置２０に、ネットワークと通信を行うための通信部を追加し、音声認識結果および応答文を送受信するようにしてもよい。 For example, in the description of the embodiment, the voice recognition unit 22 performs voice recognition, but the voice recognition may be performed by an external device. In this case, a communication unit for communicating with the network may be added to the control device 20 to transmit / receive voice data and a recognition result.
Similarly, in the description of the embodiment, the response generation unit 23 generates a response, but a response sentence may be generated by an external device. In this case, a communication unit for communicating with the network may be added to the control device 20 to transmit / receive the voice recognition result and the response text.

また、実施形態の説明では、一回の発話につき一種類の動作定義セットを取得したが、発話が複数回ある場合、発話ごとに異なる動作定義セットを用いてもよい。また、発話の途中でロボットの擬似感情などが変化する場合、途中で応答文の種別を変更（すなわち動作定義セットを変更）してもよい。この場合、終了動作を経たうえで次の動作に移行してもよいし、動作定義セットＡの中間動作から動作定義セットＢの中間動作へ移行させても
よい。 In the description of the embodiment, one type of action definition set is acquired for one utterance. However, when there are a plurality of utterances, a different action definition set may be used for each utterance. Further, when the robot's pseudo emotions change during the utterance, the type of the response sentence may be changed in the middle (that is, the action definition set is changed). In this case, the operation may be shifted to the next operation after the end operation, or may be shifted from the intermediate operation of the operation definition set A to the intermediate operation of the operation definition set B.

１０・・・ロボット
１１・・・音声入力部
１２，２１・・・通信部
１３・・・音声出力部
２０・・・制御装置
２２・・・音声認識部
２３・・・応答生成部
２４・・・音声合成部
２４・・・動作生成部 DESCRIPTION OF SYMBOLS 10 ... Robot 11 ... Voice input part 12, 21 ... Communication part 13 ... Voice output part 20 ... Control apparatus 22 ... Voice recognition part 23 ... Response generation part 24 ...・ Speech synthesizer 24 ... Motion generator

Claims

Storage means for storing an operation definition set including data for defining an operation by a robot having a movable part, including a start operation, an intermediate operation, and an end operation;
Audio output means for outputting audio;
Drive means for driving the movable part;
Command generation means for generating a drive command for the drive means using the operation definition set;
Control means for issuing a drive command to the drive means in synchronization with the audio output;
Have
The command generation means generates a drive command in which the intermediate operation is inserted at least once between the start operation and the end operation so that the operation time of the movable part is substantially the same as the output time of the sound.
robot.

When the operation time when the start operation and the end operation are continuously performed is less than the sound output time, the operation time of the movable unit reaches the sound output time. Repeatedly inserting the intermediate operation between the start operation and the end operation;
The robot according to claim 1.

The intermediate operation is an operation capable of looping within one second.
The robot according to claim 1 or 2.

Further comprising a type acquisition unit for acquiring a type according to the content of the audio output by the audio output unit;
The storage means stores a plurality of action definition sets associated with the type,
The command generation means selects an operation definition set corresponding to the type, and generates the drive command using the operation definition set.
The robot according to any one of claims 1 to 3.

When the voice output unit outputs two or more sentences, the command generation unit selects a plurality of action definition sets according to the type of each sentence, and uses the plurality of action definition sets. Generate drive instructions,
The robot according to claim 4.

The command generation means generates a drive command to which an initialization operation for returning the position of the movable part to an initial position is further added.
The robot according to any one of claims 1 to 5.

A control method for a robot having a movable part driven by a driving means, and an audio output means,
An operation definition acquisition step for acquiring an operation definition set including data for defining an operation by the movable part, including a start operation, an intermediate operation, and an end operation;
A command generation step of generating a drive command for the drive means using the motion definition set;
A control step of issuing a drive command to the drive means in synchronization with the output of the sound;
Including
In the command generation step, a drive command in which the intermediate operation is inserted one or more times between the start operation and the end operation is generated so that the operation time of the movable part is substantially the same as the output time of the sound.
Robot control method.

A program for causing a computer to execute the robot control method according to claim 7.