JP2018097201A

JP2018097201A - Voice dialog device and voice dialog method

Info

Publication number: JP2018097201A
Application number: JP2016242445A
Authority: JP
Inventors: 池野　篤司; Tokuji Ikeno; 篤司池野; 宗明島田; Muneaki Shimada; 浩太畠中; Kota HATANAKA; 西島　敏文; Toshifumi Nishijima; 敏文西島; 史憲片岡; Fuminori Kataoka; 刀根川　浩巳; Hiromi Tonegawa; 浩巳刀根川; 倫秀梅山; Norihide Umeyama
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2018-06-21
Anticipated expiration: 2036-12-14
Also published as: JP6790791B2

Abstract

PROBLEM TO BE SOLVED: To provide conversation of a rich variation in a device which dialogs with a user.SOLUTION: A voice dialog device which dialogs with a user by voice comprises: voice acquisition means which acquires an utterance uttered by the user; a vocabulary database which stores a plurality of vocabulary data for generating a response sentence; and response generating means which generates the response sentence to the utterance on the basis of the stored vocabulary data. The vocabulary data are the data in which a first word is associated with a second word which is a word showing recognition of the user to the first word. The response generating means extracts the first word or the second word corresponding to an object mentioned in the utterance from the vocabulary data, to execute a first response generation to generate the response sentence using a word associated with the word.SELECTED DRAWING: Figure 1

Description

本発明は、音声によってユーザと対話する装置に関する。 The present invention relates to an apparatus for interacting with a user by voice.

対話型ロボットシステムにおいて、自然な会話を提供するための研究が多くなされている。例えば、特許文献１には、入力された発話に対して応答を生成するための応答生成装置が開示されている。当該装置では、ある話題に対して応答文の候補が複数ある場合に、話題の豊富さと、感情の明るさの度合いに応じて、適切な応答文を選択するという特徴を有している。かかる発明によると、より適切な応答文を生成することができる。 Much research has been done to provide natural conversations in interactive robot systems. For example, Patent Document 1 discloses a response generation device for generating a response to an input utterance. The apparatus has a feature that, when there are a plurality of response sentence candidates for a certain topic, an appropriate response sentence is selected according to the abundance of topics and the degree of emotional brightness. According to this invention, a more appropriate response sentence can be generated.

特開２００７−２１９１４９号公報JP 2007-219149 A

特許文献１に係る発明では、予め蓄積されたデータの範囲でしか応答を生成することができない。すなわち、データを随時増やして対話の内容を充実させることができない。 In the invention according to Patent Document 1, a response can be generated only in the range of data accumulated in advance. In other words, it is not possible to increase the data at any time to enrich the content of the dialogue.

一方で、ユーザとの会話を通して学習を行うことで、利用可能な語彙を能動的に増やす会話システムが知られている。しかし、このようなシステムでは、会話において未知の単語が出現した場合に、当該単語の意味をユーザに質問し、学習させることはできるが、学習結果に基づいて会話のバリエーションを膨らませることができない。 On the other hand, a conversation system that actively increases the vocabulary that can be used by learning through conversation with a user is known. However, in such a system, when an unknown word appears in a conversation, the user can be asked and learned about the meaning of the word, but the conversation variation cannot be expanded based on the learning result. .

本発明は上記の課題を考慮してなされたものであり、ユーザと対話する装置において、バリエーション豊かな会話を提供することを目的とする。 The present invention has been made in consideration of the above-described problems, and an object of the present invention is to provide a conversation rich in variations in an apparatus that interacts with a user.

本発明に係る音声対話装置は、音声によってユーザと対話する音声対話装置であって、
前記ユーザが発した発話を取得する音声取得手段と、応答文を生成するための複数の語彙データを記憶する語彙データベースと、前記記憶された語彙データに基づいて、前記発話に対する応答文を生成する応答生成手段と、を有し、前記語彙データは、第一の単語と、前記第一の単語に対する前記ユーザの認識を表す単語である第二の単語と、を関連付けたデータであり、前記応答生成手段は、前記発話で言及された対象に対応する前記第一の単語または第二の単語を前記語彙データから抽出し、当該単語に関連付いた単語を用いて応答文を生成する第一の応答生成を行うことを特徴とする。 A voice interaction apparatus according to the present invention is a voice interaction apparatus that interacts with a user by voice,
A voice acquisition means for acquiring an utterance uttered by the user, a vocabulary database for storing a plurality of vocabulary data for generating a response sentence, and a response sentence for the utterance based on the stored vocabulary data Response generation means, and the vocabulary data is data in which a first word is associated with a second word that is a word representing the user's recognition of the first word, and the response The generation means extracts the first word or the second word corresponding to the object mentioned in the utterance from the vocabulary data, and generates a response sentence using the word associated with the word Response generation is performed.

本発明に係る音声対話装置は、語彙データベースに登録された語彙データを用いて、ユーザが行った発話に対する応答文を生成する。語彙データとは、第一の単語と、当該第一の単語に対するユーザの認識を表す単語である第二の単語とを関連付けたデータである。ユーザの認識を表す単語とは、例えば、「第一の単語に対するユーザの嗜好を表す単語」、「ユーザが第一の単語を形容した単語」、「ユーザが第一の単語の概念を説明した単語」などである。このように、第二の単語は、第一の単語に対するユーザの認識を表すものであればよく、一義的なものでなくてもよい。例えば、第一の単語が「ラーメン」であった場合、第二の単語は「麺類」といったように上位概念を表す単語であってもよく、「好き」といったように嗜好を表す単語であってもよい。また、例えば、第一の単語が「試験」であった場合、第二の単語は「緊張」といったように、ユーザが単に連想する単語であ
ってもよい。
なお、第一および第二の単語は、必ずしも一単語である必要はない。例えば、「緊張する」といったように、単語の集合（文節）であってもよい。 The spoken dialogue apparatus according to the present invention generates a response sentence for an utterance made by a user, using vocabulary data registered in a vocabulary database. Vocabulary data is data in which a first word is associated with a second word, which is a word representing user recognition of the first word. The word representing the user's recognition is, for example, “a word representing the user's preference for the first word”, “a word describing the first word by the user”, “the user explained the concept of the first word Words ". As described above, the second word only needs to represent the user's recognition of the first word, and may not be unique. For example, when the first word is “ramen”, the second word may be a word representing a higher concept such as “noodles”, or a word representing preference such as “like”. Also good. For example, when the first word is “test”, the second word may be a word simply associated with the user, such as “tension”.
Note that the first and second words are not necessarily one word. For example, it may be a set of words (sentences) such as “tense”.

また、応答生成手段は、取得した発話文で言及された単語に対応する単語について、対になる単語を語彙データから取得し、応答文を生成する。例えば、ユーザがラーメンについて言及する発話を行った場合、「ラーメン」という単語を第一の単語から検索し、対応する第二の単語を用いて応答文を生成してもよい。また、例えば、ユーザが「緊張している」旨の発話を行った場合、「緊張」という単語を第二の単語から検索し、対応する第一の単語を用いて応答文を生成してもよい。 The response generation means acquires a pair of words from the vocabulary data for the word corresponding to the word mentioned in the acquired utterance sentence, and generates a response sentence. For example, when the user makes an utterance referring to ramen, the word “ramen” may be searched from the first word, and a response sentence may be generated using the corresponding second word. Also, for example, when the user utters “tensioned”, the word “tension” is searched from the second word, and a response sentence is generated using the corresponding first word. Good.

第一の単語と第二の単語は、必ずしも一対一である必要はない。例えば、特定の第一の単語に複数の第二の単語が関連付いている場合があり、特定の第二の単語に複数の第一の単語が関連付いている場合がある。かかる構成によると、単語同士の対応を辿ることで、応答文のバリエーションを膨らませることができる。 The first word and the second word are not necessarily one-to-one. For example, a plurality of second words may be associated with a specific first word, and a plurality of first words may be associated with a specific second word. According to such a configuration, it is possible to inflate variations of response sentences by following the correspondence between words.

また、本発明に係る音声対話装置は、前記ユーザに対して、前記発話で言及された対象に対する認識を問う質問を行い、前記ユーザから得られた回答に基づいて、前記語彙データを生成または更新する語彙収集手段をさらに有することを特徴としてもよい。 Further, the voice interaction apparatus according to the present invention asks the user a question asking about recognition of the object mentioned in the utterance, and generates or updates the vocabulary data based on an answer obtained from the user. The vocabulary collecting means may be further provided.

語彙データを学習させるため、語彙収集手段が、ユーザに対して、ある単語に対する認識を問う質問を行ってもよい。質問の内容は、例えば、「○○って好き？」といったように、ユーザの嗜好を問うものであってもよいし、「○○ってどんな感じ？」といったように、単語を別の言葉でユーザに形容させるものであってもよい。また、「○○って何？」といったように、単語の概念自体を問うものであってもよい。 In order to learn vocabulary data, the vocabulary collecting means may ask the user a question about recognition of a certain word. The content of the question may be something that asks the user's preference, for example, “Do you like XX?”, Or you may change the word to another word, such as “How does XX feel?” It may be what the user wants to describe. Also, it may ask the word concept itself, such as “What is XX?”.

また、前記語彙収集手段は、前記ユーザが発した発話に、前記発話で言及された対象に対する認識を表す単語が含まれている場合に、当該単語に基づいて前記語彙データを生成または更新することを特徴としてもよい。 The vocabulary collecting means may generate or update the vocabulary data based on the word when the utterance uttered by the user includes a word representing recognition of the target mentioned in the utterance. May be a feature.

このように、通常の対話においてなされた発話から、第一の単語と第二の単語（すなわち、ユーザが言及している対象と、それに対するユーザの認識を表す単語）を抽出できる場合、自動的に語彙データを生成ないし更新するようにしてもよい。 Thus, if the first word and the second word (that is, the word that the user is referring to and the word representing the user's perception) can be automatically extracted from the utterances made in the normal dialogue, Vocabulary data may be generated or updated.

また、前記語彙データベースは、前記複数の語彙データをユーザごとに関連付けて記憶し、前記応答生成手段は、対話中のユーザに関連付いた語彙データを利用することを特徴としてもよい。 Further, the vocabulary database may store the plurality of vocabulary data in association with each user, and the response generation unit may use vocabulary data associated with the user who is interacting.

語彙データをユーザごとに保持し、対話中のユーザに対応する語彙データを利用することで、パーソナライズされた受け答えをすることができる。 By holding vocabulary data for each user and using the vocabulary data corresponding to the user during the conversation, personalized answering can be made.

また、前記第二の単語は、前記第一の単語の上位概念を表す単語、前記ユーザが前記第一の単語を形容した単語、または、前記ユーザの前記第一の単語に対する嗜好を表す単語のうちのいずれかであることを特徴としてもよい。 In addition, the second word is a word that represents a general concept of the first word, a word that the user describes the first word, or a word that represents the user's preference for the first word. One of them may be a feature.

このように、関連のある単語を結びつけて記憶することで、応答のバリエーションを広げることができる。 In this way, it is possible to widen variations of responses by connecting and storing related words.

また、前記応答生成手段は、前記発話で言及された対象に対応する前記第一の単語を前記語彙データから抽出し、かつ、関連付いている前記第二の単語が前記抽出した第一の単
語と共通する他の第一の単語である関連単語を抽出し、前記関連単語を用いて応答文を生成する第二の応答生成を行うことを特徴としてもよい。 The response generation means extracts the first word corresponding to the target mentioned in the utterance from the vocabulary data, and the second word associated with the first word is extracted. It is good also as extracting the related word which is other 1st words in common with, and performing the 2nd response production | generation which produces | generates a response sentence using the said related word.

異なる第一の単語について、同一の第二の単語が関連付いている場合がある。このような場合、第二の単語を介して別の第一の単語（関連単語）を抽出し、応答文の生成に利用してもよい。例えば、『ラーメン』と『麺類』、『うどん』と『麺類』という単語がそれぞれ関連付いて記憶されている場合であって、ラーメンに言及した発話がなされた場合、うどんについての話題を振るようにしてもよい。 For different first words, the same second word may be associated. In such a case, another first word (related word) may be extracted via the second word and used to generate a response sentence. For example, if the words “ramen” and “noodles”, “udon” and “noodles” are stored in association with each other, and if an utterance mentioning ramen is made, the topic about udon will be shaken. It may be.

また、前記第二の応答生成において、前記抽出した第一の単語に関連付いた前記第二の単語が複数ある場合に、前記応答生成手段は、前記関連単語に加え、前記関連単語と直接関連付いていない前記第二の単語をさらに抽出し、前記関連単語と、前記抽出した第二の単語とを用いて応答文を生成することを特徴としてもよい。 In addition, in the second response generation, when there are a plurality of the second words related to the extracted first word, the response generation means is directly related to the related word in addition to the related word. The second word that is not attached may be further extracted, and a response sentence may be generated using the related word and the extracted second word.

例えば、『ラーメン』と『麺類』、『ラーメン』と『好き』、『うどん』と『麺類』という単語がそれぞれ関連付いて記憶されている場合であって、ラーメンに言及した発話がなされた場合、「うどんも好き？」といった応答文を生成してもよい。 For example, when the words “ramen” and “noodles”, “ramen” and “like”, “udon” and “noodles” are stored in association with each other, and an utterance referring to ramen is made A response sentence such as “Do you like udon?” May be generated.

なお、本発明は、上記手段の少なくとも一部を含む音声対話装置として特定することができる。また、前記音声対話装置が行う対話方法として特定することもできる。上記処理や手段は、技術的な矛盾が生じない限りにおいて、自由に組み合わせて実施することができる。 Note that the present invention can be specified as a voice interactive device including at least a part of the above means. Further, it can be specified as a dialogue method performed by the voice dialogue apparatus. The above processes and means can be freely combined and implemented as long as no technical contradiction occurs.

本発明によれば、ユーザと対話する装置において、バリエーション豊かな会話を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, in the apparatus which interacts with a user, a conversation with rich variation can be provided.

第一の実施形態に係る対話システムのシステム構成図である。It is a system configuration figure of a dialog system concerning a first embodiment. ロボット１０、制御装置２０、サーバ装置３０間のデータフロー図である。4 is a data flow diagram among the robot 10, the control device 20, and the server device 30. FIG. サーバ装置３０が行う処理のフローチャート図である。It is a flowchart figure of the process which the server apparatus 30 performs. 第一の実施形態における単語テーブルの例である。It is an example of the word table in 1st embodiment. 第二の実施形態における単語テーブルの例である。It is an example of the word table in 2nd embodiment.

以下、本発明の好ましい実施形態について図面を参照しながら説明する。
本実施形態に係る音声対話システムは、ユーザが発した音声を取得して音声認識を行い、認識結果に基づいて応答文を生成することでユーザとの対話を行うシステムである。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.
The voice dialogue system according to the present embodiment is a system that performs dialogue with the user by acquiring voice uttered by the user, performing voice recognition, and generating a response sentence based on the recognition result.

（第一の実施形態）
<システム構成>
図１は、第一の実施形態に係る音声対話システムのシステム構成図である。本実施形態に係る音声対話システムは、ロボット１０と、制御装置２０と、サーバ装置３０から構成される。 (First embodiment)
<System configuration>
FIG. 1 is a system configuration diagram of the voice interaction system according to the first embodiment. The voice interaction system according to the present embodiment includes a robot 10, a control device 20, and a server device 30.

ロボット１０は、スピーカやマイク、カメラ等を有しており、ユーザとのインタフェースを担う手段である。ロボット１０は、人型やキャラクター型であってもよいし、他の形状であってもよい。
制御装置２０は、ロボット１０に対して命令を発行する装置である。また、サーバ装置３０は、制御装置２０から送信された要求に応じて、ユーザに提供する応答（応答文）を
生成する装置である。
本実施形態では、ロボット１０はユーザインタフェースとしてのみ機能し、発話内容の認識、応答文の生成、その他の処理など、システム全体を制御する処理はサーバ装置３０が行う。また、ロボット１０とサーバ装置３０とを仲介する処理を制御装置２０が行う。 The robot 10 has a speaker, a microphone, a camera, and the like, and is a means for performing an interface with a user. The robot 10 may be a human type, a character type, or another shape.
The control device 20 is a device that issues a command to the robot 10. The server device 30 is a device that generates a response (response sentence) to be provided to the user in response to a request transmitted from the control device 20.
In the present embodiment, the robot 10 functions only as a user interface, and the server device 30 performs processing for controlling the entire system, such as recognition of utterance contents, generation of response sentences, and other processing. Further, the control device 20 performs a process that mediates between the robot 10 and the server device 30.

まず、ロボット１０について説明する。
ロボット１０は、音声入力部１１、近距離通信部１２、音声出力部１３から構成される。 First, the robot 10 will be described.
The robot 10 includes a voice input unit 11, a short-range communication unit 12, and a voice output unit 13.

音声入力部１１は、ユーザが発した音声を取得する手段である。具体的には、内蔵されたマイクを用いて、音声を電気信号（以下、音声データ）に変換する。取得した音声データは、後述する近距離通信部１２を介して制御装置２０へ送信される。 The voice input unit 11 is means for acquiring voice uttered by the user. Specifically, sound is converted into an electric signal (hereinafter referred to as sound data) using a built-in microphone. The acquired voice data is transmitted to the control device 20 via the short-range communication unit 12 described later.

近距離通信部１２は、制御装置２０と近距離無線通信を行う手段である。本実施形態では、近距離通信部１２は、Ｂｌｕｅｔｏｏｔｈ（登録商標）規格を利用して通信を行う。近距離通信部１２は、ペアリング先となる制御装置２０に関する情報を記憶しており、簡便な処理で接続を行うことができる。なお、Ｂｌｕｅｔｏｏｔｈ規格は、ＩＥＥＥ８０２．１５．１とも呼ばれる。 The near field communication unit 12 is a unit that performs near field communication with the control device 20. In the present embodiment, the short-range communication unit 12 performs communication using the Bluetooth (registered trademark) standard. The short-range communication unit 12 stores information related to the control device 20 that is a pairing destination, and can be connected by simple processing. Note that the Bluetooth standard is also referred to as IEEE 802.15.1.

音声出力部１３は、ユーザに提供する音声を出力する手段である。具体的には、内蔵されたスピーカを用いて、制御装置２０から送信された音声データを音声に変換する。 The audio output unit 13 is means for outputting audio to be provided to the user. Specifically, voice data transmitted from the control device 20 is converted into voice using a built-in speaker.

次に、制御装置２０について説明する。制御装置２０は、ロボット１０とサーバ装置３０との仲介を行う装置であって、典型的にはモバイルコンピュータ、携帯電話、スマートフォンなどの小型のコンピュータである。制御装置２０は、ＣＰＵ、主記憶装置、補助記憶装置を有する情報処理装置として構成することができる。補助記憶装置に記憶されたプログラムが主記憶装置にロードされ、ＣＰＵによって実行されることで、図１に図示した各手段が機能する。なお、図示した機能の全部または一部は、専用に設計された回路を用いて実行されてもよい。 Next, the control device 20 will be described. The control device 20 is a device that mediates between the robot 10 and the server device 30, and is typically a small computer such as a mobile computer, a mobile phone, or a smartphone. The control device 20 can be configured as an information processing device having a CPU, a main storage device, and an auxiliary storage device. Each unit shown in FIG. 1 functions by loading a program stored in the auxiliary storage device into the main storage device and executing it by the CPU. Note that all or part of the illustrated functions may be executed using a circuit designed exclusively.

制御装置２０は、近距離通信部２１、制御部２２、通信部２３から構成される。 The control device 20 includes a short-range communication unit 21, a control unit 22, and a communication unit 23.

近距離通信部２１が有する機能は、前述した近距離通信部１２と同様であるため、詳細な説明は省略する。 Since the functions of the short-range communication unit 21 are the same as those of the short-range communication unit 12 described above, detailed description thereof is omitted.

制御部２２は、ロボット１０から音声を取得し、当該取得した音声に対する応答を取得する手段である。具体的には、ロボット１０から取得した音声を、通信部２３を介してサーバ装置３０（いずれも後述）に送信し、対応する応答文をサーバ装置３０から受信する。また、音声合成機能によって、応答文を音声データに変換し、ロボット１０に送信する。ロボット１０に送信された音声は、音声出力部１３を介してユーザに提供される。これにより、ユーザは、自然言語による会話を行うことができる。 The control unit 22 is a unit that acquires sound from the robot 10 and acquires a response to the acquired sound. Specifically, the voice acquired from the robot 10 is transmitted to the server device 30 (both described later) via the communication unit 23, and a corresponding response sentence is received from the server device 30. Further, the response sentence is converted into voice data by the voice synthesis function and transmitted to the robot 10. The voice transmitted to the robot 10 is provided to the user via the voice output unit 13. Thereby, the user can perform a conversation in a natural language.

通信部２３は、通信回線（例えば無線ＬＡＮや携帯電話網）を介してネットワークにアクセスすることで、サーバ装置３０との通信を行う手段である。 The communication unit 23 is a unit that communicates with the server device 30 by accessing the network via a communication line (for example, a wireless LAN or a mobile phone network).

サーバ装置３０は、送信された音声を認識したうえで、ユーザに提供する応答文を生成する装置であり、通信部３１、音声認識部３２、応答生成部３３からなる。
通信部３１が有する機能は、前述した通信部２３と同様であるため、詳細な説明は省略する。 The server device 30 is a device that generates a response sentence to be provided to the user after recognizing the transmitted voice, and includes a communication unit 31, a voice recognition unit 32, and a response generation unit 33.
Since the function of the communication unit 31 is the same as that of the communication unit 23 described above, detailed description thereof is omitted.

音声認識部３２は、ロボットが有する音声入力部１１が取得した音声に対して音声認識を行い、テキストに変換する手段である。音声認識は、既知の技術によって行うことができる。例えば、音声認識部３２には、音響モデルと認識辞書が記憶されており、取得した音声データと音響モデルとを比較して特徴を抽出し、抽出した特徴を認識辞書とをマッチングさせることで音声認識を行う。認識結果は、応答生成部３３へ送信される。 The voice recognition unit 32 is a unit that performs voice recognition on the voice acquired by the voice input unit 11 of the robot and converts the voice into text. Speech recognition can be performed by known techniques. For example, the speech recognition unit 32 stores an acoustic model and a recognition dictionary, extracts features by comparing the acquired speech data with the acoustic model, and matches the extracted features with the recognition dictionary to generate speech. Recognize. The recognition result is transmitted to the response generation unit 33.

応答生成部３３は、音声認識部３２から取得したテキストに基づいて、ユーザに提供する応答文を生成する手段である。提供する応答文は、例えば、事前に記憶された対話シナリオ（対話辞書）に基づくものであってもよいし、データベースやウェブを検索して得られた情報に基づくものであってもよい。
本実施形態では、応答生成部３３は、単語を学習するための辞書であるテーブル（単語テーブル）を有しており、学習結果に応じて異なる応答文を生成することができる。詳細な処理内容については後述する。
応答生成部３３が取得した情報は、制御装置２０へテキスト形式で送信され、その後、合成音声に変換され、ロボット１０を介してユーザに向けて出力される。 The response generation unit 33 is a unit that generates a response sentence to be provided to the user based on the text acquired from the voice recognition unit 32. The response sentence to be provided may be based on, for example, a dialogue scenario (dialog dictionary) stored in advance, or may be based on information obtained by searching a database or the web.
In the present embodiment, the response generation unit 33 has a table (word table) that is a dictionary for learning words, and can generate different response sentences according to learning results. Detailed processing contents will be described later.
The information acquired by the response generation unit 33 is transmitted to the control device 20 in a text format, then converted into synthesized speech, and output to the user via the robot 10.

サーバ装置３０も、ＣＰＵ、主記憶装置、補助記憶装置を有する情報処理装置として構成することができる。補助記憶装置に記憶されたプログラムが主記憶装置にロードされ、ＣＰＵによって実行されることで、図１に図示した各手段が機能する。なお、図示した機能の全部または一部は、専用に設計された回路を用いて実行されてもよい。 The server device 30 can also be configured as an information processing device having a CPU, a main storage device, and an auxiliary storage device. Each unit shown in FIG. 1 functions by loading a program stored in the auxiliary storage device into the main storage device and executing it by the CPU. Note that all or part of the illustrated functions may be executed using a circuit designed exclusively.

<対話の流れ>
次に、図１に示した各手段が行う処理とデータの流れについて、処理内容およびデータの流れを説明するフロー図である図２を参照しながら説明する。 <Dialog flow>
Next, the processing performed by each unit shown in FIG. 1 and the data flow will be described with reference to FIG. 2 which is a flowchart for explaining the processing contents and the data flow.

まず、ステップＳ１１で、ロボット１０が有する音声入力部１１が、マイクを通してユーザが発話した音声を取得する。取得した音声は音声データに変換され、通信部を介して、制御装置２０が有する制御部２２へ送信される。また、制御部２２は、取得した音声データを、サーバ装置３０が有する音声認識部３２へ送信する。 First, in step S11, the voice input unit 11 of the robot 10 acquires the voice spoken by the user through the microphone. The acquired voice is converted into voice data and transmitted to the control unit 22 of the control device 20 via the communication unit. In addition, the control unit 22 transmits the acquired voice data to the voice recognition unit 32 included in the server device 30.

次に、音声認識部３２が、取得した音声データに対して音声認識を行い、テキストに変換する（ステップＳ１２）。音声認識の結果得られたテキストは、応答生成部３３へ送信される。次に、応答生成部３３が、ユーザから得られた発話の内容に基づいて応答を生成する（ステップＳ１３）。
生成された応答文は、制御装置２０へ送信され、制御部２２によって音声データに変換される（ステップＳ１４）。そして、音声データはロボット１０に送信され、音声出力部１３を介して出力（再生）される（ステップＳ１５）。 Next, the voice recognition unit 32 performs voice recognition on the acquired voice data and converts it into text (step S12). The text obtained as a result of the speech recognition is transmitted to the response generation unit 33. Next, the response generation unit 33 generates a response based on the content of the utterance obtained from the user (step S13).
The generated response sentence is transmitted to the control device 20 and converted into voice data by the control unit 22 (step S14). Then, the audio data is transmitted to the robot 10 and output (reproduced) via the audio output unit 13 (step S15).

<応答の生成方法 >
次に、応答生成部３３が応答を生成する具体的な方法について説明する。図３は、応答生成部３３がステップＳ１３にて応答を生成する処理をより詳細に表したフローチャート図である。 <Response generation method>
Next, a specific method for generating a response by the response generation unit 33 will be described. FIG. 3 is a flowchart showing in more detail the process in which the response generation unit 33 generates a response in step S13.

まず、ステップＳ２１で、ユーザが言及している対象（話題としている対象。以下、当該対象を表す単語を対象語と称する）を判定する。対象語は、例えば、音声認識部３２が出力したテキストに対して形態素解析を行い、得られた複数の単語を解析することで判定することができる。例えば、ユーザが、「今日の昼ごはんはラーメンを食べた」といった発話を行った場合、ユーザがラーメン（または昼ごはん）について言及していると判定することができる。 First, in step S21, an object referred to by the user (an object that is a topic. Hereinafter, a word representing the object is referred to as an object word) is determined. The target word can be determined, for example, by performing morphological analysis on the text output by the speech recognition unit 32 and analyzing a plurality of obtained words. For example, when the user makes an utterance such as “I ate ramen for lunch today”, it can be determined that the user refers to ramen (or lunch).

ステップＳ２２については後述する。
ステップＳ２３では、記憶している単語テーブルを参照し、対象語が学習済みであるか否かを判定する。図４は、応答生成部３３が記憶している単語テーブルの例である。本実施形態では、単語テーブルに、第一の単語と第二の単語が関連付けられて記憶されている。第一の単語が学習した単語であり、第二の単語は、第一の単語に対するユーザの認識を表す単語である。
ユーザの認識を表す単語とは、例えば、以下のようなものが挙げられる。なお、第一の単語および第二の単語は、それぞれ一単語であってもよいし、単語の集合であってもよい。
（Ａ）第一の単語を形容した語（第一の単語から連想する語）
例えば、「うどん」に対して「つるつるする」といったように、第一の単語を他の語で形容したものである。また、例えば、「試験」に対して「緊張する」など、第一の単語から単に連想する語であってもよい。本実施形態では、いずれも形容語と称する。
（Ｂ）第一の単語の上位概念を表す語
例えば、「ラーメン」と「うどん」は、共に「麺類」という上位概念を有しているため、双方に「麺類」という単語で表すことができる。
（Ｃ）第一の単語に対するユーザの嗜好を表す語
例えば、「好き」「嫌い」「どちらかといえば好き」「とても嫌い」などである（以下、嗜好表現と称する）。
図４（Ａ）の例では、ユーザＩＤがＵ００１であるユーザについて、「テスト」という語と、「ドキドキする」という語が関連付いている。ここでは、ステップＳ２１で判定した「ラーメン」に対応する単語がテーブルに存在しない（学習済みではない）状況を例に説明を続ける。 Step S22 will be described later.
In step S23, it is determined whether or not the target word has been learned by referring to the stored word table. FIG. 4 is an example of a word table stored in the response generation unit 33. In the present embodiment, the first word and the second word are stored in the word table in association with each other. The first word is a learned word, and the second word is a word that represents user recognition of the first word.
Examples of the word representing user recognition include the following. Each of the first word and the second word may be a single word or a set of words.
(A) A word that describes the first word (a word associated with the first word)
For example, the first word is described by another word, such as “smooth” for “udon”. Further, for example, it may be a word simply associated with the first word, such as “tense” for “test”. In this embodiment, all are called adjectives.
(B) A word representing the superordinate concept of the first word For example, “ramen” and “udon” both have a superordinate concept of “noodles”, and therefore can be represented by the word “noodles” in both. .
(C) A word representing the user's preference for the first word. For example, “like”, “dislike”, “somewhat like”, “very dislike” (hereinafter referred to as preference expression).
In the example of FIG. 4A, the word “test” and the word “pounding” are associated with the user whose user ID is U001. Here, the description will be continued by taking as an example a situation where the word corresponding to “ramen” determined in step S21 does not exist in the table (not learned).

ユーザが言及している単語が未学習の単語であった場合、ステップＳ２４へ遷移し、ユーザに対する質問を生成する。本実施形態では、ステップＳ２４で生成される質問は、以下の三種類のうちのいずれかである。質問の種類はどれであってもよい。
（１）単語をユーザに形容させる質問
例えば、「○○ってどんな感じ？」といった質問である。
（２）単語の上位概念を問う質問
例えば、「○○って何？」といった質問である。
（３）単語に対するユーザの嗜好を問う質問
例えば、「○○って好き？」といった質問である。 When the word referred to by the user is an unlearned word, the process proceeds to step S24 to generate a question for the user. In the present embodiment, the question generated in step S24 is one of the following three types. Any type of question may be used.
(1) A question that allows a user to describe a word. For example, it is a question such as "What does XX feel?"
(2) Question that asks the general concept of words For example, a question such as “What is XX?”.
(3) Question that asks the user's preference for words For example, a question such as “Do you like OO?”.

ステップＳ２５では、生成された質問を出力し、ユーザから当該質問に対する回答を得る。ここでは、ステップＳ１３〜Ｓ１５の処理を一時的に進めたうえで、ユーザから回答を取得する。すなわち、図２に示したフローがもう一度実行され、再度ステップＳ１３に戻ってくる。かかる処理によると、ユーザから、対象語に対応する嗜好表現、形容語、上位概念のいずれかを得ることができる。
質問の種類が、前述した（１）であった場合、当該質問に対する返答には形容語が含まれているため、これを抽出する。例えば、ユーザが行った発話のうち、単語の並びが以下のパターンに合致する語句を抽出することで、形容語を取得することができる。

In step S25, the generated question is output and an answer to the question is obtained from the user. Here, an answer is acquired from the user after the processing in steps S13 to S15 is temporarily advanced. That is, the flow shown in FIG. 2 is executed once again, and the process returns to step S13 again. According to such processing, it is possible to obtain from the user one of a preference expression, an adjective, and a superordinate concept corresponding to the target word.
When the type of question is (1) described above, since the adjective is included in the response to the question, it is extracted. For example, an adjective can be acquired by extracting a phrase whose word sequence matches the following pattern from utterances performed by the user.

例えば、「ラーメンってどんな感じ？」という質問に対して、ユーザが「つるつるしてる」と答え、「つるつる（副詞）＋する（動詞）」という解析がなされたものとする。この場合、「つるつるする」という語が、第二の単語として登録される。
図４（Ｂ）は、「ラーメンはつるつるする」というユーザの認識に基づいて、レコード（語彙データ）が追加された場合の例である。 For example, it is assumed that the user answers “smoothly feel like ramen” and the analysis “smoothly (adverb) + do (verb)” is made. In this case, the word “smoothly” is registered as the second word.
FIG. 4B shows an example in which a record (vocabulary data) is added based on the user's recognition that “ramen slippery”.

質問の種類が、前述した（２）であった場合、当該質問に対する返答には、対象語の上位概念を表す語が含まれているため、これを抽出する。
ユーザの回答には、例えば、「○○」「○○だよ」「○○のことだね」「○○です」「それは○○」といった様々なパターンが考えられるため、変化しうる部分を正規化した表現（正規表現）によって、直接の回答となる部分を抽出すればよい。 When the type of the question is (2) described above, the response to the question includes a word that represents the general concept of the target word, and is extracted.
For example, the user's answer can be changed by various patterns such as “XX”, “XX”, “XX”, “XX”, and “XX”. What is necessary is just to extract the part used as a direct answer by the normalized expression (regular expression).

質問の種類が、前述した（３）であった場合、当該質問に対する返答には、ユーザによる嗜好表現が含まれているため、これを抽出する。
例えば、「うん」「違う」といったように、質問が肯定されたか否定されたかを判定してもよいし、「好き」「嫌い」といった絶対的な表現を抽出してもよい。また、否定表現や二重否定表現（好きじゃない、嫌いなわけではない等）を考慮してもよい。 When the type of the question is (3) described above, the response to the question includes a preference expression by the user, and is extracted.
For example, it may be determined whether the question is affirmed or denied, such as “Yes” or “No”, or an absolute expression such as “Like” or “I don't like” may be extracted. Also, negative expressions and double negative expressions (not like, not dislike etc.) may be considered.

第一の単語と、第一の単語に対応する第二の単語が取得されると、これらの単語をセットにして単語テーブルに記録する（ステップＳ２６）。 When the first word and the second word corresponding to the first word are acquired, these words are set and recorded in the word table (step S26).

なお、ユーザに質問を行わなくても、学習が可能な場合がある。
本実施形態では、ステップＳ２２で、取得した発話の内容だけで学習が可能か否かを判定する。例えば、発話の内容が「ラーメンが好きなんだ」といったように、発話に嗜好表現、形容語、上位概念のいずれかが含まれている場合、追加の質問をすることなく学習が可能であるため、処理はステップＳ２６へ遷移する。この場合、「ラーメン」と「好き」を関連付けて学習することができる。学習ができない場合、ステップＳ２３へ遷移する。 Note that learning may be possible without asking the user questions.
In the present embodiment, in step S22, it is determined whether or not learning is possible only with the content of the acquired utterance. For example, if the utterance contains any of the preferred expressions, adjectives, or superordinate concepts, such as “I like ramen”, it is possible to learn without asking additional questions. The process proceeds to step S26. In this case, it is possible to learn by associating “ramen” with “like”. If learning is not possible, the process proceeds to step S23.

そして、ステップＳ２７にて、学習結果を利用して（すなわち、単語テーブルに記録された情報を用いて）応答文を生成する。ステップＳ２３にて学習済みであると判定された場合も同様である。 In step S27, a response sentence is generated using the learning result (that is, using information recorded in the word table). The same applies when it is determined in step S23 that learning has been completed.

ステップＳ２７では、単語テーブルから対象語を検索し、当該対象語が存在した場合に、対になる単語（第一の単語であってもよいし、第二の単語であってもよい）を抽出して応答文の生成に利用する。
例えば、図４（Ａ）に示した情報が記録されていた場合であって、ユーザの発話内容が「いまドキドキしてる」といったものであった場合、「ドキドキする（第二の単語）」に
対応する単語である「テスト（第一の単語）」を抽出し、例えば、「テストでもあるの？」といった応答文を生成する。
また、ユーザが「テスト」について言及した場合、「テスト（第一の単語）」に対応する単語である「ドキドキする（第二の単語）」を抽出し、「ドキドキするね」といった応答文を生成してもよい。 In step S27, the target word is searched from the word table, and when the target word exists, a paired word (may be the first word or the second word) is extracted. And used to generate response sentences.
For example, in the case where the information shown in FIG. 4A is recorded, and the user's utterance content is “I am thrilled now”, “I am thrilled (second word)”. The corresponding word “test (first word)” is extracted, and for example, a response sentence such as “is it a test?” Is generated.
In addition, when the user refers to “test”, the word “exciting (second word)” corresponding to “test (first word)” is extracted, and a response sentence such as “exciting” is added. It may be generated.

なお、図４（Ａ）では形容語を例示したが、第二の単語は嗜好表現であってもよい。例えば、「ラーメン」と「好き」が関連付いている場合であって、ユーザが「ラーメン」について言及した場合、「ラーメン好きだね」といった応答文を生成してもよい。また、ユーザが「好きなもの」について言及した場合、「ラーメンとどっちが好き？」といった応答文を生成してもよい。
同様に、第二の単語は上位概念であってもよい。例えば、「ラーメン」と「麺類」が関連付いている場合であって、ユーザが「ラーメン」について言及した場合、「麺類かぁ」といった応答文を生成してもよい。また、ユーザが「麺類」について言及した場合、「ラーメンとかだね」といった応答文を生成してもよい。 Although FIG. 4A illustrates an adjective, the second word may be a preference expression. For example, when “ramen” and “like” are related, and the user mentions “ramen”, a response sentence such as “I like ramen” may be generated. In addition, when the user mentions “what he / she likes”, a response sentence such as “which do you like ramen?” May be generated.
Similarly, the second word may be a superordinate concept. For example, when “ramen” and “noodles” are related, and the user refers to “ramen”, a response sentence such as “noodles” may be generated. In addition, when the user refers to “noodles”, a response sentence such as “ramen or kadane” may be generated.

なお、ステップＳ２６からステップＳ２７へ遷移した場合、直前で学習した情報を用いて応答を生成すると不自然になるため、学習したばかりの情報は用いないほうが好ましい。 In addition, since it will become unnatural if a response is produced | generated using the information learned immediately before when it changes from step S26 to step S27, it is preferable not to use the information just learned.

説明したように、第一の実施形態では、第一の単語と、それに対するユーザの認識を表す第二の単語とを関連付けて記憶し、互いに参照することで応答文を生成する。これにより、ユーザの認識に基づいて話題を生成することができ、応答のバリエーションを豊かにすることができる。 As described, in the first embodiment, the first word and the second word representing the user's recognition for the first word are stored in association with each other, and a response sentence is generated by referring to each other. Thereby, a topic can be produced | generated based on a user's recognition, and the variation of a response can be enriched.

（第二の実施形態）
第一の実施形態では、第一の単語と第二の単語を相互に参照することで応答文を生成した。これに対し、第二の実施形態は、同一の語が第二の単語として複数の第一の単語に関連づいていた場合に、同じ語が関連付いている別の第一の単語（関連単語）を抽出し、応答に用いる実施形態である。 (Second embodiment)
In the first embodiment, the response sentence is generated by mutually referring to the first word and the second word. On the other hand, in the second embodiment, when the same word is related to a plurality of first words as the second word, another first word related to the same word (related word) ) Is extracted and used for the response.

例えば、図４（Ｃ）の例の場合、「テスト」という語と、「面接」という語に、ともに「ドキドキする」という語が関連付いている。このような場合において、ユーザが「これから面接がある」といった内容の発話を行った場合、第二の単語を介して「テスト」という関連単語を抽出し、「面接ってテストみたいにドキドキするよね」といった応答文を生成してもよい。 For example, in the example of FIG. 4C, the word “test” and the word “interview” are both associated with the word “pounding”. In such a case, when the user utters the content of “I have an interview”, the related word “test” is extracted via the second word, and “excitingly like an interview and a test. May be generated.

また、図４（Ｃ）の例の場合、「ラーメン」という語と、「つけ麺」という語に、ともに「つるつるする」という語が関連付いている。このような場合において、ユーザが「ラーメン食べようかな？」といった内容の発話を行った場合、第二の単語を介して「つけ麺」という単語を抽出し、「つけ麺もいいね！」といった応答文を生成してもよい。 4C, the word “ramen” and the word “tsukemen” are both associated with the word “smooth”. In such a case, when the user utters content such as “Would you like to eat ramen?”, The word “Tsukemen” is extracted via the second word, and a response sentence “Tsukemen is good!” May be generated.

以上説明したように、第二の実施形態では、第二の単語が共通する他の関連単語を抽出して応答文の生成に利用する。関連単語は、ユーザが一定の関連性を認識している単語であるため、当該ユーザにとって自然な話題を提示することができる。 As described above, in the second embodiment, other related words that are common to the second word are extracted and used to generate a response sentence. Since the related word is a word for which the user recognizes certain relevance, it is possible to present a topic that is natural to the user.

（第三の実施形態）
第一および第二の実施形態では、第二の単語を一つのみ定義した。これに対し、第三の実施形態は、第二の単語を、種別ごとに複数のフィールドによって保持する実施形態である。 (Third embodiment)
In the first and second embodiments, only one second word is defined. In contrast, the third embodiment is an embodiment in which the second word is held by a plurality of fields for each type.

第三の実施形態に係る音声対話システムの構成は、第一の実施形態と同様であるため説明は省略し、利用するデータおよび処理における相違点のみを説明する。 Since the configuration of the voice interaction system according to the third embodiment is the same as that of the first embodiment, description thereof will be omitted, and only differences in data and processing used will be described.

図５は、第三の実施形態において応答生成部３３が記憶している単語テーブルの例である。本実施形態では、単語テーブルに定義された第二の単語が、「嗜好表現」「形容語」「上位概念」の三つによって表される。 FIG. 5 is an example of a word table stored in the response generation unit 33 in the third embodiment. In the present embodiment, the second word defined in the word table is represented by three of “taste expression”, “adjective word”, and “superordinate concept”.

また、第三の実施形態では、対話において対象語を取得した場合において、第二の実施形態と同様に、第二の単語（嗜好、形容語、上位概念のいずれか）が共通する他の単語を関連単語として抽出し、応答文の生成に利用する。 Further, in the third embodiment, when the target word is acquired in the dialogue, as in the second embodiment, other words having the same second word (preference, adjective, or superordinate concept) are common. Are extracted as related words and used to generate response sentences.

第三の実施形態における応答生成部３３の処理について、図３を参照しながら説明する。ここでは、ユーザが「新しいうどん屋ができたんだ」という発話を行い、ステップＳ２１にて、装置が「うどん」について言及していると判断したものとする。また、応答生成部３３には、図５（Ａ）に示した単語テーブルが記憶されているものとする。
この場合、ステップＳ２２およびＳ２３は否定判定となる。 Processing of the response generation unit 33 in the third embodiment will be described with reference to FIG. Here, it is assumed that the user has made an utterance that “a new udon shop has been made” and it is determined in step S21 that the device refers to “udon”. Also, it is assumed that the response generation unit 33 stores the word table shown in FIG.
In this case, steps S22 and S23 are negative.

第三の実施形態で、語彙データを追加する際に、ユーザの発話に基づいて何を抽出したか（嗜好表現であるか、形容語であるか、上位概念であるか）を判定し、適切なフィールドに格納するという点において第二の実施形態と相違する。
なお、第三の実施形態では、ステップＳ２４およびＳ２５を複数回繰り返し、複数種類の質問を行うようにしてもよい。例えば、上位概念と嗜好についての質問を二回行うようにしてもよい。
本例では、「うどんって何？」という質問を行った結果、「麺類だよ」といった回答が得られ、また、「うどんは好き？」という質問を行った結果、「好きだよ」といった回答が得られたものとする。この結果、単語テーブルは、図５（Ｂ）のようになる。 In the third embodiment, when adding vocabulary data, what is extracted based on the user's utterance (whether it is a preference expression, an adjective, or a superordinate concept) is determined and appropriate This is different from the second embodiment in that it is stored in a simple field.
In the third embodiment, steps S24 and S25 may be repeated a plurality of times to ask a plurality of types of questions. For example, you may make it ask a question about a superordinate concept and a taste twice.
In this example, as a result of asking “What is udon?”, An answer such as “It ’s noodles” is obtained, and as a result of asking “Do you like udon?” It is assumed that an answer has been obtained. As a result, the word table is as shown in FIG.

第三の実施形態では、ステップＳ２７で、第二の実施形態と同様に、第二の単語（嗜好表現、形容語、上位概念のいずれか）のうち、同じ語が関連付いている第一の単語を関連単語として抽出し、応答に用いる。
例えば、「うどん」について言及された発話がなされた場合、同じ上位概念を持つ「そば」という単語を取得し、そばについての話題を有する応答文を生成することができる。例えば、「うどんかぁ。麺類ならそばも良いよね」といった応答文を生成してもよい。 In the third embodiment, in step S27, as in the second embodiment, the first word associated with the same word among the second words (preference expression, adjective, or superordinate concept) is associated. A word is extracted as a related word and used for a response.
For example, when an utterance referring to “Udon” is made, the word “Soba” having the same superordinate concept is acquired, and a response sentence having a topic about soba can be generated. For example, a response sentence such as “Udonka. Noodles are good for soba” may be generated.

ところで、複数の第一の単語について、嗜好表現、形容語、上位概念のうちのどれかが共通し、その他が背反するというケースがある。これについて説明する。 By the way, there are cases in which any one of the preference expression, adjective, and superordinate concept is common and the others are contrary to each other for the plurality of first words. This will be described.

例えば、図５（Ｂ）の例では、うどんは好きであるが、同じ麺類であるそばは嫌いといったように、関連付いた第二の単語のうちの一部がそれぞれ背反している。よって、ステップＳ２７にて、当該背反を話題とする応答文を生成してもよい。例えば、「うどんは好きなのに、そばは嫌いなんだね」といった応答文を生成することができる。 For example, in the example of FIG. 5B, some of the related second words are contrary to each other, like udon, but dislikes the same noodle soba. Therefore, in step S27, a response sentence that discusses the contradiction may be generated. For example, a response sentence such as “I like udon but don't like soba” can be generated.

第三の実施形態では、このように、第二の単語を細分化して記憶し、応答文の生成に用いる。かかる構成によると、語彙データベースからより多くの単語を抽出できる。また、第二の単語のうちの一部が背反している場合、これを指摘するなど、対話にて用いる話題をさらに増やすことができる。 In the third embodiment, as described above, the second word is subdivided and stored and used to generate a response sentence. According to this configuration, more words can be extracted from the vocabulary database. Moreover, when some of the second words are contradictory, the topic used in the dialogue can be further increased, for example, by pointing out this.

（変形例）
上記の実施形態はあくまでも一例であって、本発明はその要旨を逸脱しない範囲内で適
宜変更して実施しうる。 (Modification)
The above embodiment is merely an example, and the present invention can be implemented with appropriate modifications within a range not departing from the gist thereof.

例えば、実施形態の説明では、サーバ装置３０が音声認識を行ったが、音声認識を行う手段を制御装置２０に持たせてもよい。また、実施形態の説明では、サーバ装置３０が応答文の生成を行ったが、応答文を生成する手段を制御装置２０に持たせてもよい。また、制御装置２０およびサーバ装置３０を用いずに、ロボット１０が全ての処理を行うようにしてもよい。 For example, in the description of the embodiment, the server device 30 performs voice recognition, but the control device 20 may have a means for performing voice recognition. In the description of the embodiment, the server apparatus 30 generates a response sentence. However, the control apparatus 20 may have a means for generating a response sentence. Further, the robot 10 may perform all processes without using the control device 20 and the server device 30.

１０・・・ロボット
１１・・・音声入力部
１２，２１・・・近距離通信部
１３・・・音声出力部
１４・・・動作制御部
２０・・・制御装置
２２・・・制御部
２３，３１・・・通信部
３０・・・サーバ装置
３２・・・音声認識部
３３・・・応答生成部 DESCRIPTION OF SYMBOLS 10 ... Robot 11 ... Voice input part 12, 21 ... Short-range communication part 13 ... Voice output part 14 ... Operation control part 20 ... Control apparatus 22 ... Control part 23, DESCRIPTION OF SYMBOLS 31 ... Communication part 30 ... Server apparatus 32 ... Voice recognition part 33 ... Response generation part

Claims

A voice interaction device that interacts with a user by voice,
Voice acquisition means for acquiring a speech uttered by the user;
A vocabulary database for storing a plurality of vocabulary data for generating a response sentence;
Response generating means for generating a response sentence to the utterance based on the stored vocabulary data,
The vocabulary data is data in which a first word is associated with a second word that is a word representing the user's recognition of the first word;
The response generation means extracts the first word or the second word corresponding to the object mentioned in the utterance from the vocabulary data, and generates a response sentence using a word associated with the word. One response generation,
Spoken dialogue device.

Lexical data collection means for asking the user a question asking about recognition of the object mentioned in the utterance and generating or updating the vocabulary data based on an answer obtained from the user;
The voice interactive apparatus according to claim 1.

The vocabulary collecting means generates or updates the vocabulary data based on the word when the utterance uttered by the user includes a word representing recognition of the object mentioned in the utterance.
The voice interactive apparatus according to claim 2.

The vocabulary database stores the plurality of vocabulary data in association with each user;
The response generating means uses vocabulary data associated with a user who is interacting;
The voice interactive apparatus according to claim 2 or 3.

The second word is a word that represents a superordinate concept of the first word, a word that the user describes the first word, or a word that represents the user's preference for the first word Either
The voice interactive apparatus according to claim 1.

The response generation means extracts the first word corresponding to the object mentioned in the utterance from the vocabulary data, and the related second word is common to the extracted first word Extracting a related word that is another first word to perform a second response generation that generates a response sentence using the related word,
The voice interactive apparatus according to claim 1.

In the second response generation, when there are a plurality of the second words related to the extracted first word, the response generation means is directly related to the related word in addition to the related word. Further extracting the second word that is not, and generating a response sentence using the related word and the extracted second word;
The voice interactive apparatus according to claim 6.

A dialogue method performed by a voice dialogue device that dialogues with a user by voice,
A voice acquisition step of acquiring a speech uttered by the user;
A response generation step of generating a response sentence for the utterance based on a plurality of vocabulary data for generating a response sentence,
The vocabulary data is data in which a first word is associated with a second word that is a word representing the user's recognition of the first word;
In the response generation step, the first word or the second word corresponding to the object mentioned in the utterance is extracted from the vocabulary data, and a response sentence is generated using a word associated with the word. One response generation,
How to interact.

A voice interaction device that interacts with a user by voice,
Voice acquisition means for acquiring a speech uttered by the user;
Response generating means for generating a response sentence for the utterance based on a plurality of vocabulary data for generating a response sentence,
The vocabulary data is data in which a first word is associated with a second word that is a word representing the user's recognition of the first word;
The response generation means extracts the first word or the second word corresponding to the object mentioned in the utterance from the vocabulary data, and generates a response sentence using a word associated with the word. One response generation,
Spoken dialogue device.