JP2022054667A

JP2022054667A - Voice dialogue device, voice dialogue system, and voice dialogue method

Info

Publication number: JP2022054667A
Application number: JP2020161825A
Authority: JP
Inventors: 智子内山; Tomoko Uchiyama
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2022-04-07

Abstract

To provide a voice dialogue device capable of improving the convenience of a user using the voice dialogue device, a voice dialogue system, and a voice dialogue method.SOLUTION: In a voice dialogue system 100, a control unit of a voice dialogue device includes: a detection unit that detects an utterance of a user; a determination unit that determines the possibility of continuation of the dialogue with the user when having detected the utterance of the user; a response processing unit that responds to the utterance of the user when the determination unit determines that the dialogue is being continued. The determination unit uses a domain of the dialogue with the user, which was done just before the detected user's utterance for determining the possibility of the continuation of the dialogue.SELECTED DRAWING: Figure 1

Description

本発明は、音声対話装置、音声対話システム、および、音声対話方法に関する。 The present invention relates to a voice dialogue device, a voice dialogue system, and a voice dialogue method.

従来、ユーザにより入力された言葉に対応する応答を出力することにより、ユーザとの音声対話を行う音声対話システムが知られる（例えば特許文献１参照）。 Conventionally, there is known a voice dialogue system that performs a voice dialogue with a user by outputting a response corresponding to a word input by the user (see, for example, Patent Document 1).

ユーザの発話に対して、常に音声認識および意図の理解を行う構成では、システムの処理負担が大きくなりやすい。このために、現在の音声対話システムの多くは、ユーザがボタンを押すことにより対話が開始されたり、ウェイクワードと呼ばれる所定のワードの検出を契機として対話が開始されたりする構成となっている。 In a configuration in which voice recognition and intention understanding are always performed for the user's utterance, the processing load of the system tends to be large. For this reason, many of the current voice dialogue systems are configured such that the dialogue is started when the user presses a button, or the dialogue is started when a predetermined word called a wake word is detected.

特開２０１８－１０９６６３号公報Japanese Unexamined Patent Publication No. 2018-109663

ボタンの押下や、ウェイクワードの発話により対話が開始される構成では、用件ごとに、毎回ボタンの押下やウェイクワードの発話が必要となるために、ユーザが使い難く感じる可能性がある。 In a configuration in which a dialogue is started by pressing a button or uttering a wake word, the user may find it difficult to use because the button must be pressed or the wake word must be spoken for each requirement.

本発明は、上記の点に鑑み、音声対話装置を利用するユーザの利便性を向上することができる技術を提供することを目的とする。 In view of the above points, it is an object of the present invention to provide a technique capable of improving the convenience of a user who uses a voice dialogue device.

上記目的を達成するために本発明の音声対話装置は、ユーザの発話を検出する検出部と、前記ユーザの発話を検出した場合に、前記ユーザとの対話継続の可能性を判断する判断部と、前記判断部により対話が継続していると判断される場合に、前記ユーザの発話に応答する応答処理部と、を備え、前記判断部は、前記対話継続の可能性の判断に、検出した前記ユーザの発話の直前に行われた前記ユーザとの対話のドメインを利用する構成（第１の構成）になっている。 In order to achieve the above object, the voice dialogue device of the present invention includes a detection unit that detects the utterance of the user and a determination unit that determines the possibility of continuing the dialogue with the user when the utterance of the user is detected. A response processing unit that responds to the user's utterance when it is determined by the determination unit that the dialogue is continuing is provided, and the determination unit detects the possibility of continuation of the dialogue. It is configured to use the domain of the dialogue with the user performed immediately before the user's utterance (first configuration).

上記第１の構成の音声対話装置において、前記判断部は、前記対話継続の可能性の判断に、前記ユーザとの対話における自装置の応答状態を更に利用する構成（第２の構成）であってよい。 In the voice dialogue device of the first configuration, the determination unit further utilizes the response state of the own device in the dialogue with the user to determine the possibility of continuation of the dialogue (second configuration). It's okay.

上記第１又は第２の構成の音声対話装置において、前記判断部は、前記対話継続の可能性の判断に、前記ユーザの音声から抽出される音声の特徴を更に利用する構成（第３の構成）であってよい。 In the voice dialogue device of the first or second configuration, the determination unit further utilizes the characteristics of the voice extracted from the voice of the user in determining the possibility of continuation of the dialogue (third configuration). ) May be.

上記第１から第３のいずれかの構成の音声対話装置において、前記判断部は、前記対話継続の可能性の判断に、前記ユーザの撮影画像から得られる情報を更に利用する構成（第４の構成）であってよい。 In the voice dialogue device having any of the first to third configurations, the determination unit further uses the information obtained from the captured image of the user to determine the possibility of continuation of the dialogue (fourth). Configuration).

上記第１から第４のいずれかの構成の音声対話装置において、前記判断部は、前記直前に行われたユーザとの対話の終了後から一定時間以内に前記ユーザの発話が検出された場合に、前記対話継続の可能性を判断する構成（第５の構成）であってよい。 In the voice dialogue device having any of the first to fourth configurations, the determination unit detects the utterance of the user within a certain period of time after the end of the dialogue with the user immediately before the end. , The configuration (fifth configuration) for determining the possibility of continuation of the dialogue may be used.

上記第１から第５のいずれかの構成の音声対話装置において、前記判断部は、前記対話継続の可能性を示す数値と閾値とを比較して前記ユーザとの対話が継続しているか否かを判断し、前記閾値は、前記直前に行われたユーザとの対話の終了からの経過時間に応じて変更される構成（第６の構成）であってよい。 In the voice dialogue device having any of the first to fifth configurations, the determination unit compares a numerical value indicating the possibility of continuation of the dialogue with a threshold value to determine whether or not the dialogue with the user is continuing. The threshold value may be changed according to the elapsed time from the end of the dialogue with the user performed immediately before (sixth configuration).

上記第１から第６のいずれかの構成の音声対話装置において、前記判断部は、前記対話継続の可能性を示す数値と閾値とを比較して前記ユーザとの対話が継続しているか否かを判断し、前記判断部は、特定のタイミングで前記ユーザの発話が検出された場合に、前記閾値を変更して、前記対話が継続していると判断しやすくする構成（第７の構成）であってよい。 In the voice dialogue device having any of the first to sixth configurations, the determination unit compares a numerical value indicating the possibility of continuation of the dialogue with a threshold value to determine whether or not the dialogue with the user is continuing. The determination unit changes the threshold value when the user's utterance is detected at a specific timing, so that it is easy to determine that the dialogue is continuing (seventh configuration). May be.

上記第１から第７のいずれかの構成の音声対話装置において、前記判断部は、前記ユーザの発話を検出し、且つ、前記ユーザによる所定の対話開始操作が行われていない場合に、前記対話継続の可能性を判断する構成（第８の構成）であってよい。 In the voice dialogue device having any of the first to seventh configurations, the determination unit detects the utterance of the user, and the dialogue is performed when the user does not perform a predetermined dialogue start operation. It may be a configuration (eighth configuration) for determining the possibility of continuation.

上記目的を達成するために本発明の音声対話システムは、上記第１から第８のいずれかの構成の音声対話装置と、前記ユーザの音声を音声信号に変換して前記音声対話装置へと出力するマイクロホンと、前記音声対話装置から出力される音声信号を音声に変換して前記ユーザに向けて放音するスピーカと、を備える構成（第９の構成）になっている。 In order to achieve the above object, the voice dialogue system of the present invention converts the voice of the user into a voice signal and outputs it to the voice dialogue device and the voice dialogue device having any one of the first to eighth configurations. The configuration (9th configuration) includes a microphone that performs sound, and a speaker that converts a voice signal output from the voice dialogue device into voice and emits sound to the user.

上記第９の構成の音声対話システムは、前記ユーザを撮影し、撮影した画像の情報を前記音声対話装置に出力するカメラを更に備える構成（第１０の構成）であってよい。 The voice dialogue system having the ninth configuration may be configured to further include a camera that captures the user and outputs the information of the captured image to the voice dialogue device (tenth configuration).

上記目的を達成するために本発明の音声対話方法は、音声対話装置における音声対話方法であって、ユーザの発話を検出する検出工程と、前記ユーザの発話を検出した場合に、前記ユーザとの対話継続の可能性を判断する判断工程と、前記判断工程により対話が継続していると判断される場合に、前記ユーザの発話に応答する応答処理工程と、を備え、前記対話継続の可能性の判断に、検出した前記ユーザの発話の直前に行われた前記ユーザとの対話のドメインが利用される構成（第１１の構成）になっている。 In order to achieve the above object, the voice dialogue method of the present invention is a voice dialogue method in a voice dialogue device, in which a detection step of detecting a user's utterance and a detection step of detecting the user's utterance with the user are performed. A determination step for determining the possibility of continuation of the dialogue and a response processing step for responding to the utterance of the user when it is determined by the determination step that the dialogue is continuing are provided, and the possibility of continuation of the dialogue is provided. In the determination, the domain of the dialogue with the user performed immediately before the utterance of the detected user is used (11th configuration).

本発明によれば、音声対話装置を利用するユーザの利便性を向上することができる。 According to the present invention, it is possible to improve the convenience of the user who uses the voice dialogue device.

音声対話システムの構成を示すブロック図Block diagram showing the configuration of a voice dialogue system 音声対話装置が備える制御部の機能を示すブロック図A block diagram showing the functions of the control unit of the voice dialogue device. 音声対話装置の動作の概略を説明するための模式図Schematic diagram for explaining the outline of the operation of the voice dialogue device 音声対話装置の発話仮受付状態における動作例を示すフローチャートA flowchart showing an operation example in the utterance temporary reception state of the voice dialogue device. 音声対話装置の発話仮受付状態における動作の変形例を示すフローチャートA flowchart showing a modified example of the operation of the voice dialogue device in the utterance temporary reception state. 図５に示す閾値調整処理の第１の詳細例を示すフローチャートA flowchart showing a first detailed example of the threshold value adjustment process shown in FIG. 図５に示す閾値調整処理の第２の詳細例を示すフローチャートA flowchart showing a second detailed example of the threshold value adjustment process shown in FIG.

以下、本発明の例示的な実施形態について、図面を参照しながら詳細に説明する。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings.

＜１．音声対話システム＞
図１は、本発明の実施形態に係る音声対話システム１００の構成を示すブロック図である。本実施形態の音声対話システム１００は、一例として車両に適用される。車両には、例えば自動車や電車等の車輪のついた乗り物が広く含まれてよい。本発明の音声対話システム１００は、船舶や航空機等の車両以外の人を載せる移動体や、家屋や施設等の移動体以外のものに適用されてもよい。 <1. Voice Dialogue System>
FIG. 1 is a block diagram showing a configuration of a voice dialogue system 100 according to an embodiment of the present invention. The voice dialogue system 100 of the present embodiment is applied to a vehicle as an example. Vehicles may broadly include vehicles with wheels, such as automobiles and trains. The voice dialogue system 100 of the present invention may be applied to a moving body such as a ship or an aircraft on which a person other than a vehicle is placed, or a moving body other than a moving body such as a house or a facility.

図１に示すように、音声対話システム１００は、音声対話装置１と、マイクロホン２と、スピーカ３とを備える。また、音声対話システム１００は、カメラ４と、サーバ装置５とを更に備える。なお、音声対話システム１００は、カメラ４とサーバ装置５との少なくともいずれか一方を備えなくてもよい。 As shown in FIG. 1, the voice dialogue system 100 includes a voice dialogue device 1, a microphone 2, and a speaker 3. Further, the voice dialogue system 100 further includes a camera 4 and a server device 5. The voice dialogue system 100 does not have to include at least one of the camera 4 and the server device 5.

音声対話装置１は、車両の適所に配置される車両用対話装置である。音声対話装置１は、ユーザの発話に対して、適宜、応答を行う装置である。本実施形態では、ユーザは、運転者等の車両の乗員である。応答には、ユーザの発話に対する音声による回答が含まれる。また、応答には、ユーザの発話に対する回答を画面表示する等、音声以外の手段を利用した回答が含まれてもよい。ユーザの発話に対する回答を画面表示する構成の場合には、音声対話システム１００には、モニタが含まれる。本明細書における「対話」には、ユーザと装置との音声によるやりとりのみならず、ユーザからの音声による質問や指示に対して、装置が音声以外の手段を利用して応答する場合が含まれてよい。 The voice dialogue device 1 is a vehicle dialogue device arranged at an appropriate position in the vehicle. The voice dialogue device 1 is a device that appropriately responds to a user's utterance. In the present embodiment, the user is a occupant of a vehicle such as a driver. The response includes a voice response to the user's utterance. Further, the response may include an answer using a means other than voice, such as displaying the answer to the user's utterance on the screen. In the case of a configuration in which the answer to the user's utterance is displayed on the screen, the voice dialogue system 100 includes a monitor. The term "dialogue" as used herein includes not only a voice interaction between a user and a device but also a case where the device responds to a voice question or instruction from the user by means other than voice. It's okay.

本実施形態の音声対話装置１は、原則として、ユーザが発したウェイクワードを検出した場合に、ユーザの発話の意図を理解して、ユーザの発話に即した応答を行う。音声対話装置１は、一問一答を基本としており、原則として、用件ごとに毎回ウェイクワードを発話する必要がある。ただし、音声対話装置１は、自身がユーザに問い返しを行った場合には、ウェイクワードを検出しなくても応答を行う。また、音声対話装置１は、特定の条件下において、ウェイワードを検出しなくても応答を行う。この点の詳細にはついては後述する。 As a general rule, the voice dialogue device 1 of the present embodiment understands the intention of the user's utterance when detecting the wake word uttered by the user, and responds according to the user's utterance. The voice dialogue device 1 is based on one question and one answer, and in principle, it is necessary to utter a wake word for each requirement. However, when the voice dialogue device 1 asks the user back, the voice dialogue device 1 responds without detecting the wake word. Further, the voice dialogue device 1 responds under specific conditions without detecting the way word. The details of this point will be described later.

なお、ウェイクワードは、ユーザとの対話（やりとり）を開始するトリガとなる所定のワードである。音声対話装置１との対話を開始したいユーザは、対話開始の合図としてウェイクワードを発する。これにより、音声対話装置１は、ウェイクワードの後にユーザが発した発話の意図を理解して応答を行う。ウェイクワードは、例えば、「ハロー、マイコンピュータ」や「ヘイ、ビークル」等であってよい。 The wake word is a predetermined word that triggers a dialogue (interaction) with the user. A user who wants to start a dialogue with the voice dialogue device 1 issues a wake word as a signal to start the dialogue. As a result, the voice dialogue device 1 understands the intention of the utterance uttered by the user after the wake word and responds. The wake word may be, for example, "hello, my computer", "hey, vehicle" or the like.

また、本実施形態では、ウェイクワードの検出により、ユーザと音声対話装置１との対話が開始される構成としているが、これは例示である。ウェイクワードの検出に替えて、例えば、ボタンが押されたことの検出により、対話が開始される構成としてもよい。また、ウェイクワードの検出と、ボタンの利用とが併用されてもよい。 Further, in the present embodiment, the dialogue between the user and the voice dialogue device 1 is started by the detection of the wake word, which is an example. Instead of detecting the wake word, for example, the dialogue may be started by detecting that the button is pressed. Further, the detection of the wake word and the use of the button may be used together.

マイクロホン２は、ユーザが発生した音声を集音する。マイクロホン２は、車両の適所に配置される。マイクロホン２は、音声対話装置１と有線又は無線にて接続される。マイクロホン２は、ユーザの音声を音声信号に変換して音声対話装置１へと出力する。なお、マイクロホン２は、音声対話装置１に含まれてもよい。 The microphone 2 collects the voice generated by the user. The microphone 2 is placed in place on the vehicle. The microphone 2 is connected to the voice dialogue device 1 by wire or wirelessly. The microphone 2 converts the user's voice into a voice signal and outputs it to the voice dialogue device 1. The microphone 2 may be included in the voice dialogue device 1.

スピーカ３は、音声対話装置１と有線又は無線にて接続される。スピーカ３は、音声対話装置１から出力される音声信号を音声に変換してユーザに向けて放音する。スピーカ３も、マイクロホン２と同様に、車両の適所に配置される。なお、スピーカ３は、音声対話装置１に含まれてもよい。 The speaker 3 is connected to the voice dialogue device 1 by wire or wirelessly. The speaker 3 converts the voice signal output from the voice dialogue device 1 into voice and emits the sound to the user. The speaker 3 is also arranged at an appropriate position in the vehicle like the microphone 2. The speaker 3 may be included in the voice dialogue device 1.

カメラ４は、音声対話装置１と有線又は無線にて接続される。カメラ４は、ユーザを撮影し、撮影した画像の情報を音声対話装置１に出力する。カメラ４は、例えば、車両の座席に座るユーザの全体を撮影可能に車両の適所に配置される。また、例えば、カメラ４は、車両の座席に座るユーザの顔を撮影可能に車両の適所に配置される。なお、カメラ４は、音声対話装置１に含まれてもよい。 The camera 4 is connected to the voice dialogue device 1 by wire or wirelessly. The camera 4 photographs the user and outputs the information of the captured image to the voice dialogue device 1. The camera 4 is arranged at an appropriate position in the vehicle so that the entire user sitting in the seat of the vehicle can be photographed, for example. Further, for example, the camera 4 is arranged at an appropriate position in the vehicle so that the face of the user sitting in the seat of the vehicle can be photographed. The camera 4 may be included in the voice dialogue device 1.

サーバ装置５は、インターネット等のネットワークに接続されたコンピュータ装置である。本実施形態のサーバ装置５は、人工知能（ＡＩ：Artificial Intelligence）を備える。サーバ装置５は、ネットワークに接続された任意の他のコンピュータ装置から様々な情報の提供を受けることができる。音声対話装置１は、ネットワークを介してサーバ装置５と情報のやりとりを行うことができる。 The server device 5 is a computer device connected to a network such as the Internet. The server device 5 of the present embodiment includes artificial intelligence (AI). The server device 5 can receive various information from any other computer device connected to the network. The voice dialogue device 1 can exchange information with the server device 5 via a network.

本実施形態の音声対話システム１００では、詳細は後述するように特定の条件下においてウェイクワードを発しなくても、ユーザの発話に対して音声対話装置１が応答を行うために、ユーザの利便性を向上することができる。また、本実施形態では、音声対話装置１がカメラ４から取得されるユーザの画像情報をも考慮してユーザとの対話に関わる判断を行うことができるために、ユーザは、より人間との対話に近しい感覚で音声対話装置１との対話を行うことができる。 In the voice dialogue system 100 of the present embodiment, as will be described in detail later, the voice dialogue device 1 responds to the user's utterance without issuing a wake word under specific conditions, which is convenient for the user. Can be improved. Further, in the present embodiment, since the voice dialogue device 1 can make a judgment related to the dialogue with the user in consideration of the image information of the user acquired from the camera 4, the user can more interact with the human. It is possible to have a dialogue with the voice dialogue device 1 with a feeling close to that of.

＜２．音声対話装置＞
次に、音声対話装置１について詳細に説明する。図１に示すように、音声対話装置１は、検出部１１と、画像処理部１２と、制御部１３と、記憶部１４と、通信部１５と、を備える。 <2. Voice dialogue device>
Next, the voice dialogue device 1 will be described in detail. As shown in FIG. 1, the voice dialogue device 1 includes a detection unit 11, an image processing unit 12, a control unit 13, a storage unit 14, and a communication unit 15.

検出部１１は、マイクロホン２から音声信号を入力される。検出部１１は、ユーザの発話を検出する。本実施形態においては、検出部１１は、ユーザの発話を単に検出するだけでなく、ユーザの発話について音声認識も行う。以下、検出部１１のことを音声認識部１１と記載する。 The detection unit 11 inputs an audio signal from the microphone 2. The detection unit 11 detects the user's utterance. In the present embodiment, the detection unit 11 not only detects the user's utterance but also recognizes the user's utterance by voice. Hereinafter, the detection unit 11 will be referred to as a voice recognition unit 11.

音声認識部１１は、半導体集積回路により構成される。音声認識部１１は、例えばＡＩチップにより構成される。音声認識部１１は、入力された音声信号によりユーザの発話を検出する。音声認識部１１は、検出したユーザの発話音声をテキストデータ（文字列データ）に変換したり、音声の特徴を抽出したりする。音声の特徴には、例えば、音量、音高（ピッチ）、抑揚（イントネーション）等が含まれてよい。音声認識部１１は、制御部１３と接続される。音声認識部１１は、変換により得られたテキストデータ、および、音声の特徴を示すデータを含む音声に関わる情報を制御部１３に出力する。 The voice recognition unit 11 is composed of a semiconductor integrated circuit. The voice recognition unit 11 is composed of, for example, an AI chip. The voice recognition unit 11 detects the user's utterance based on the input voice signal. The voice recognition unit 11 converts the detected user's spoken voice into text data (character string data), and extracts the characteristics of the voice. Audio features may include, for example, volume, pitch, intonation, and the like. The voice recognition unit 11 is connected to the control unit 13. The voice recognition unit 11 outputs the text data obtained by the conversion and the information related to the voice including the data indicating the characteristics of the voice to the control unit 13.

画像処理部１２は、カメラ４で撮影された画像のデータを入力される。画像処理部１２は、半導体集積回路により構成される。画像処理部１２は、例えばＡＩチップにより構成される。画像処理部１２は、入力された画像データからユーザの振舞い（動作）に関わる特徴を抽出する。ユーザの振舞いに関わる特徴には、例えば、ユーザの姿勢変化、顔の向きの変化（詳細には顔の回転や顔の上下動）等が含まれてよい。画像処理部１２は、制御部１３と接続される。画像処理部１２は、ユーザの振舞いに関わる特徴を示すデータを含む撮影画像情報を制御部１３に出力する。 The image processing unit 12 is input with the data of the image taken by the camera 4. The image processing unit 12 is composed of a semiconductor integrated circuit. The image processing unit 12 is composed of, for example, an AI chip. The image processing unit 12 extracts features related to the user's behavior (behavior) from the input image data. Features related to the behavior of the user may include, for example, a change in the posture of the user, a change in the orientation of the face (specifically, rotation of the face and vertical movement of the face) and the like. The image processing unit 12 is connected to the control unit 13. The image processing unit 12 outputs captured image information including data showing features related to the behavior of the user to the control unit 13.

制御部１３は、音声対話装置１の全体を統括的に制御するコントローラである。制御部１３は、例えば、ＣＰＵ（Central Processing Unit）を含むコンピュータであってよい。制御部１３によって実現される各種の機能は、コンピュータが記憶部１４に記憶されるプログラムに従って演算処理を実行することにより実現される。 The control unit 13 is a controller that comprehensively controls the entire voice dialogue device 1. The control unit 13 may be, for example, a computer including a CPU (Central Processing Unit). Various functions realized by the control unit 13 are realized by the computer executing arithmetic processing according to a program stored in the storage unit 14.

記憶部１４は、例えば、ＲＡＭ（Random Access Memory）やフラッシュメモリ等の半導体メモリ素子、ハードディスク、或いは、光ディスク等の可搬型の記録媒体を用いる記憶装置等で構成される。記憶部１４は、ファームウェアとしてのプログラムや各種のデータを記憶する。 The storage unit 14 is composed of, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, a hard disk, or a storage device using a portable recording medium such as an optical disk. The storage unit 14 stores a program as firmware and various data.

通信部１５は、制御部１３と接続される。通信部１５は、無線通信を利用してネットワーク経由でサーバ装置５と接続され、サーバ装置５と双方向通信を行う。すなわち、制御部１３は、通信部１５を利用して、サーバ装置５と情報のやりとりを行うことができる。 The communication unit 15 is connected to the control unit 13. The communication unit 15 is connected to the server device 5 via a network using wireless communication, and performs bidirectional communication with the server device 5. That is, the control unit 13 can exchange information with the server device 5 by using the communication unit 15.

図２は、本発明の実施形態に係る音声対話装置１が備える制御部１３の機能を示すブロック図である。制御部１３は、それを構成するコンピュータがプログラムに従って演算処理を行うことによって発揮する機能として、ウェイクワード検出部１３１と、応答処理部１３２と、判断部１３３とを備える。換言すると、音声対話装置１は、ウェイクワード検出部１３１と、応答処理部１３２と、判断部１３３とを備える。ウェイクワードの替わりにボタンが利用される場合には、ウェイクワード検出部１３１は、不要である。 FIG. 2 is a block diagram showing a function of the control unit 13 included in the voice dialogue device 1 according to the embodiment of the present invention. The control unit 13 includes a wake word detection unit 131, a response processing unit 132, and a determination unit 133 as functions that the computer constituting the control unit 13 exerts by performing arithmetic processing according to a program. In other words, the voice dialogue device 1 includes a wake word detection unit 131, a response processing unit 132, and a determination unit 133. When the button is used instead of the wake word, the wake word detection unit 131 is unnecessary.

なお、ウェイクワード検出部１３１、応答処理部１３２、および、判断部１３３の少なくともいずれか１つは、ＡＳＩＣ（Application Specific Integrated Circuit）又はＦＰＧＡ（Field Programmable Gate Array）等のハードウェアで構成されてもよい。また、ウェイクワード検出部１３１、応答処理部１３２、および、判断部１３３は、概念的な構成要素である。１つの構成要素が実行する機能を複数の構成要素に分散させたり、複数の構成要素が有する機能を１つの構成要素に統合させたりしてよい。また、上述の音声認識部１１の少なくとも一部の機能や、画像処理部１２の少なくとも一部の機能が、制御部１３の機能に含まれてもよい。 Even if at least one of the wake word detection unit 131, the response processing unit 132, and the judgment unit 133 is configured by hardware such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). good. Further, the wake word detection unit 131, the response processing unit 132, and the determination unit 133 are conceptual components. The functions executed by one component may be distributed to a plurality of components, or the functions of the plurality of components may be integrated into one component. Further, at least a part of the functions of the above-mentioned voice recognition unit 11 and at least a part of the functions of the image processing unit 12 may be included in the functions of the control unit 13.

ウェイクワード検出部１３１は、ユーザの発話テキストデータにウェイクワードが含まれる場合に、ユーザがウェイクワードを発したことを検出する。ウェイクワードの検出により、後述するように、音声対話装置１の対話機能が起動して、音声対話装置１において、ユーザの質問や指令に即した対応が行われる。 The wake word detection unit 131 detects that the user has issued a wake word when the user's utterance text data includes the wake word. Upon detecting the wake word, as will be described later, the dialogue function of the voice dialogue device 1 is activated, and the voice dialogue device 1 responds to the user's question or command.

応答処理部１３２は、ユーザがウェイクワードを発したことを検出した場合に、ユーザのウェイクワードに続く発話に応答する応答処理を行う。応答処理には、例えば、ユーザの発話音声の解読処理、音声の解読により理解されたユーザの要求を解決する解決処理、および、解決処理の成果をユーザに伝達する伝達処理が含まれる。解読処理には、例えばユーザの発話意図の解析処理が含まれる。解決処理には、例えばインターネットを利用した検索処理が含まれる。伝達処理には、例えば音声応答や表示応答が含まれる。上述のように、本明細書では、音声応答のみならず、表示応答も対話を構成する要素である。 When the response processing unit 132 detects that the user has uttered a wake word, the response processing unit 132 performs response processing in response to the utterance following the user's wake word. The response process includes, for example, a process of decoding a user's uttered voice, a solution process of resolving a user's request understood by decoding the voice, and a transmission process of transmitting the result of the solution process to the user. The decoding process includes, for example, an analysis process of the user's utterance intention. The resolution process includes, for example, a search process using the Internet. The transmission process includes, for example, a voice response and a display response. As described above, in the present specification, not only the voice response but also the display response is an element constituting the dialogue.

応答処理に含まれる各種の処理は、応答処理部１３２によって全て行われてもよいが、本実施形態では、一部の処理がサーバ装置５によって行われる。応答処理部１３２は、ウェイクワードが検出された場合に、ユーザの発話の音声信号を、通信部１５を介してサーバ装置５に送信する。サーバ装置５は、受信した音声信号に対し詳細音声認識処理及び自然言語処理等を行うことでユーザの要求内容に応えるための成果データを生成し、当該成果データを音声対話装置１に送信する。応答処理部１３２は、受信した成果データに基づいてユーザに対する応答を行う。 All of the various processes included in the response process may be performed by the response process unit 132, but in the present embodiment, a part of the process is performed by the server device 5. When the wake word is detected, the response processing unit 132 transmits the voice signal of the user's utterance to the server device 5 via the communication unit 15. The server device 5 generates result data for responding to the user's request by performing detailed voice recognition processing, natural language processing, and the like on the received voice signal, and transmits the result data to the voice dialogue device 1. The response processing unit 132 makes a response to the user based on the received result data.

判断部１３３は、ユーザの発話を検出した場合に、ユーザとの対話継続の可能性を判断する。判断部１３３は、対話継続の可能性の判断に、検出部１１で検出したユーザの発話の直前に行われたユーザとの対話のドメインを利用する。詳細には、ユーザとの対話は、ユーザと自装置１との対話を意味する。対話のドメインは、ユーザや自装置１の発話の意図を考慮して決められる。 The determination unit 133 determines the possibility of continuing the dialogue with the user when the user's utterance is detected. The determination unit 133 uses the domain of the dialogue with the user performed immediately before the utterance of the user detected by the detection unit 11 in determining the possibility of continuing the dialogue. In detail, the dialogue with the user means the dialogue between the user and the own device 1. The domain of dialogue is determined in consideration of the intention of the user and the utterance of the own device 1.

対話のドメインは、詳細には、対話の話題（トピック）である。例えば、ユーザが音声対話装置１に対して「今日の天気」について質問をした場合には、対話のドメインは「天気」となる。また、例えば、ユーザが「本日のアクセスランキング一位のニュース」について質問をした場合には、対話のドメインは「ニュース」となる。対話のドメインによって、ユーザとの対話継続の可能性が異なる傾向がある。例えば、対話のドメインが「天気」である場合には、対話継続の可能性が高い傾向があり、対話のドメインが「ニュース」である場合には、対話継続の可能性が低い傾向がある。このために、対話のドメインを利用することにより、対話継続の可能性を適切に判断することが可能になる。 The domain of the dialogue is, in detail, the topic of the dialogue. For example, when the user asks the voice dialogue device 1 about "today's weather", the domain of the dialogue is "weather". Further, for example, when the user asks a question about "news with the highest access ranking today", the domain of the dialogue is "news". Depending on the domain of the dialogue, the possibility of continuing the dialogue with the user tends to differ. For example, if the domain of the dialogue is "weather", the possibility of continuation of the dialogue tends to be high, and if the domain of the dialogue is "news", the possibility of continuation of the dialogue tends to be low. Therefore, by using the domain of the dialogue, it becomes possible to appropriately judge the possibility of continuing the dialogue.

本実施形態では、詳細には、判断部１３３は、ユーザの発話を検出し、且つ、ユーザによる所定の対話開始操作が行われていない場合に、対話継続の可能性を判断する。このような構成とすれば、無駄に対話継続の可能性を判断する必要がなくなり、音声対話装置１における処理負担を低減することができる。所定の対話開始操作は、本実施形態では、ユーザによるウェイクワードの発声である。所定の対話開始操作は、ウェイクワードの発声に替えて、例えばボタンの押下等であってよい。 In the present embodiment, in detail, the determination unit 133 determines the possibility of continuation of the dialogue when the user's utterance is detected and the predetermined dialogue start operation by the user is not performed. With such a configuration, it is not necessary to uselessly determine the possibility of continuation of dialogue, and the processing load on the voice dialogue device 1 can be reduced. The predetermined dialogue start operation is, in the present embodiment, the utterance of a wake word by the user. The predetermined dialogue start operation may be, for example, pressing a button instead of uttering a wake word.

なお、本実施形態では、より詳細には、判断部１３３は、自装置１からユーザに対して問い返しを行っている場合にも、ユーザ対話継続の可能性の判断は行われない。これにより、無駄に対話継続の可能性を判断する必要がなくなり、音声対話装置１における処理負担を更に低減することができる。 In the present embodiment, more specifically, the determination unit 133 does not determine the possibility of continuing the user dialogue even when the own device 1 is asking the user a question. As a result, it is not necessary to uselessly determine the possibility of continuing the dialogue, and the processing load on the voice dialogue device 1 can be further reduced.

本実施形態では、応答処理部１３２は、判断部１３３により対話（ユーザと自装置１との対話）が継続していると判断される場合、ユーザの発話に応答する。これによれば、発話の度に必ずウェイクワードを発する必要がなく、ユーザの利便性を向上することができる。また、ユーザは、人間との対話に近しい感覚で音声対話装置１と対話することができる。 In the present embodiment, the response processing unit 132 responds to the user's utterance when it is determined by the determination unit 133 that the dialogue (dialogue between the user and the own device 1) is continuing. According to this, it is not always necessary to issue a wake word each time an utterance is made, and the convenience of the user can be improved. Further, the user can interact with the voice dialogue device 1 with a feeling close to that of a dialogue with a human.

図３は、本発明の実施形態に係る音声対話装置１の動作の概略を説明するための模式図である。図３に示すように、音声対話装置１は、ウェイクワードを検出すると、ユーザ６の発話を受け付ける発話受付状態となる。発話受付状態は、自装置１がユーザ６の発話に対して応答を行うことを決定した状態である。ウェイクワードに続くユーザ６の発話（図３の発話Ａ）に対して、音声対話装置１は応答処理を行う。応答処理には、サーバ装置５によって行われる、ユーザの発話の解読処理、および、解読したユーザの要求に対する解決処理（図３の例では検索処理）が含まれる。音声対話装置１は、検索結果について音声応答や表示応答を行う。音声対話装置１は、応答処理の完了により、ユーザ６との対話が終了したと認識する。なお、音声対話装置１が、ユーザに対して問い返しを行った場合には、音声対話装置１は、ユーザ６との対話が継続していると認識し、発話受付状態となる。 FIG. 3 is a schematic diagram for explaining an outline of the operation of the voice dialogue device 1 according to the embodiment of the present invention. As shown in FIG. 3, when the voice dialogue device 1 detects the wake word, it is in the utterance acceptance state of accepting the utterance of the user 6. The utterance acceptance state is a state in which the own device 1 decides to respond to the utterance of the user 6. The voice dialogue device 1 performs response processing to the utterance of the user 6 following the wake word (utterance A in FIG. 3). The response process includes a process of decoding the user's utterance and a process of resolving the decoded user's request (search process in the example of FIG. 3) performed by the server device 5. The voice dialogue device 1 makes a voice response or a display response to the search result. The voice dialogue device 1 recognizes that the dialogue with the user 6 is completed when the response processing is completed. When the voice dialogue device 1 asks the user back, the voice dialogue device 1 recognizes that the dialogue with the user 6 is continuing, and is in the utterance acceptance state.

本実施形態では、特徴的な構成として、ユーザ６との対話が終了したと認識された後においても、一定時間に限って、ユーザ６の発話を仮に受け付ける発話仮受付状態となる。発話仮受付状態は、ユーザの発話の全てに対して応答するとは限らず、特定の条件を満たした場合に限ってユーザに応答する状態である。発話仮受付状態では、上述した判断部１３３が、対話継続の可能性を判断し、当該判断の結果に基づきユーザの発話に対して応答を行うか否かを決定する。 In the present embodiment, as a characteristic configuration, even after it is recognized that the dialogue with the user 6 is completed, the utterance provisional reception state is set in which the utterance of the user 6 is tentatively accepted only for a certain period of time. The utterance provisional reception state is a state in which the user does not always respond to all the utterances of the user, but responds to the user only when a specific condition is satisfied. In the utterance provisional reception state, the above-mentioned determination unit 133 determines the possibility of continuing the dialogue, and determines whether or not to respond to the user's utterance based on the result of the determination.

図３に示す例では、発話仮受付状態の期間中にユーザ６が発話Ｂを行っている。このために、判断部１３３は、発話Ｂの検出をトリガとして、直前の対話（発話Ａに関わる対話）のドメインを判定する。判断部１３３は、判定した対話のドメインを利用して、ユーザとの対話継続の可能性を判断する。そして、判断部１３３により対話が継続していると判断された場合には、応答処理部１３２によるユーザ６への応答が行われる。判断部１３３により対話が継続していないと判断された場合には、応答処理部１３２によるユーザ６への応答は行われない。すなわち、応答処理部１３２の処理自体が行われない。 In the example shown in FIG. 3, the user 6 is performing the utterance B during the period of the utterance provisional reception state. For this purpose, the determination unit 133 determines the domain of the immediately preceding dialogue (dialogue related to the utterance A) by using the detection of the utterance B as a trigger. The determination unit 133 determines the possibility of continuing the dialogue with the user by using the domain of the determined dialogue. Then, when the determination unit 133 determines that the dialogue is continuing, the response processing unit 132 responds to the user 6. If the determination unit 133 determines that the dialogue is not continuing, the response processing unit 132 does not respond to the user 6. That is, the processing itself of the response processing unit 132 is not performed.

本実施形態では、判断部１３３は、直前に行われたユーザ６との対話の終了後から一定時間以内にユーザ６の発話が検出された場合に、対話継続の可能性を判断する構成となっている。このような構成によれば、ユーザ６に応答するために必要となる処理を常に行う必要がなく、音声対話装置１における処理負担を低減することができる。また、ユーザ６が要求していないにもかかわらず、ユーザへの応答を行うといった誤動作が発生する可能性を低減することができる。 In the present embodiment, the determination unit 133 determines the possibility of continuing the dialogue when the utterance of the user 6 is detected within a certain time after the end of the dialogue with the user 6 performed immediately before. ing. According to such a configuration, it is not always necessary to perform the processing required to respond to the user 6, and the processing load on the voice dialogue device 1 can be reduced. Further, it is possible to reduce the possibility of a malfunction such as responding to the user even though the user 6 has not requested it.

本実施形態では、好ましい形態として、判断部１３３は、対話継続の可能性の判断に、ユーザ６の音声から抽出される音声の特徴を更に利用する。音声の特徴には、例えば、音量、音高（ピッチ）、抑揚（イントネーション）等が含まれてよい。例えば、ユーザ６の発話が装置１に向けての発話である場合、抑揚が少なくなる傾向がある。例えば、判断部１３３は、発話仮受付状態で検出したユーザの発話（図３の例では発話Ｂ）の抑揚が少ない場合、対話継続の可能性が高いと判断する。このような構成とすることにより、対話のドメインだけでなく他の指標も加えて対話継続の可能性を判断することができるために、ユーザ６の発話が自装置１に向けたものであるか否かをより適切に判断することが可能となる。 In the present embodiment, as a preferred embodiment, the determination unit 133 further utilizes the characteristics of the voice extracted from the voice of the user 6 in determining the possibility of continuing the dialogue. Audio features may include, for example, volume, pitch, intonation, and the like. For example, when the utterance of the user 6 is an utterance toward the device 1, the intonation tends to be reduced. For example, the determination unit 133 determines that there is a high possibility that the dialogue will continue if the inflection of the user's utterance (speech B in the example of FIG. 3) detected in the utterance provisional reception state is small. With such a configuration, it is possible to judge the possibility of continuing the dialogue by adding not only the domain of the dialogue but also other indicators, so that the utterance of the user 6 is directed to the own device 1. It becomes possible to judge whether or not it is more appropriate.

また、本実施形態では、好ましい形態として、判断部１３３は、対話継続の可能性の判断に、ユーザ６の撮影画像から得られる情報に更に利用する。詳細には、判断部１３３は、発話仮受付状態になってから発話を行ったユーザの振舞い（動作）に関わる特徴を利用して対話継続の可能性を判断する。ユーザの振舞い関わる特徴には、例えば、ユーザの姿勢変化や顔の向きの変化等が含まれてよい。例えば、ユーザ６は、装置１に向けて発話する場合、マイクロホン２（又は装置１）の方をちらっと見る傾向がある。このために、ユーザの姿勢や顔の向きの変化等から、ユーザが装置１やマイクロホン２の方を向いたと判断される場合には、対話継続の可能性が高いと判断できる。このような構成とすることにより、対話のドメインだけでなく他の指標も加えて対話継続の可能性を判断することができるために、ユーザ６の発話が自装置１に向けたものであるか否かをより適切に判断することが可能となる。 Further, in the present embodiment, as a preferred embodiment, the determination unit 133 further uses the information obtained from the captured image of the user 6 for determining the possibility of continuing the dialogue. Specifically, the judgment unit 133 determines the possibility of continuing the dialogue by using the characteristics related to the behavior (behavior) of the user who has made an utterance after the utterance provisional reception state is reached. The characteristics related to the behavior of the user may include, for example, a change in the posture of the user, a change in the orientation of the face, and the like. For example, the user 6 tends to glance at the microphone 2 (or device 1) when speaking to the device 1. Therefore, if it is determined that the user faces the device 1 or the microphone 2 from changes in the posture or face orientation of the user, it can be determined that there is a high possibility that the dialogue will continue. With such a configuration, it is possible to judge the possibility of continuing the dialogue by adding not only the domain of the dialogue but also other indicators, so that the utterance of the user 6 is directed to the own device 1. It becomes possible to judge whether or not it is more appropriate.

本実施形態の音声対話装置１は、発話仮受付状態においては、直前の対話のドメイン、ユーザ６の音声の特徴に係る情報、ユーザ６の画像情報といった複数の情報を利用して対話継続の可能性を判断し、これに基づきユーザ６に対して応答したり、応答しなかったりする。以下、仮受付状態における音声対話装置１の動作の詳細例について図４を参照して説明する。なお、図４は、音声対話装置１の発話仮受付状態における動作例を示すフローチャートである。 In the utterance provisional reception state, the voice dialogue device 1 of the present embodiment can continue the dialogue by using a plurality of information such as the domain of the immediately preceding dialogue, the information related to the voice characteristics of the user 6, and the image information of the user 6. The sex is judged, and based on this, the user 6 may or may not respond. Hereinafter, a detailed example of the operation of the voice dialogue device 1 in the temporary reception state will be described with reference to FIG. Note that FIG. 4 is a flowchart showing an operation example of the voice dialogue device 1 in the utterance provisional reception state.

ステップＳ１では、判断部１３３が、先のユーザ６との対話が終了してから一定時間以内か否かを判定する。一定時間は、例えば６０秒等である。判断部１３３は、一定時間外である場合には（ステップＳ１でＮｏ）、処理を終了する。一方、判断部１３３は、一定時間以内である場合には（ステップＳ１でＹｅｓ）、次のステップＳ２に処理を進める。 In step S1, the determination unit 133 determines whether or not it is within a certain period of time after the dialogue with the user 6 is completed. The fixed time is, for example, 60 seconds. If the determination unit 133 is out of a certain period of time (No in step S1), the determination unit 133 ends the process. On the other hand, if the determination unit 133 is within a certain period of time (Yes in step S1), the determination unit 133 proceeds to the next step S2.

ステップＳ２では、判断部１３３が、ユーザ６の発話を検出したか否かを判定する。ユーザ６の発話の検出は、音声認識部１１により行われる。ユーザ６の発話が検出された場合（ステップＳ２でＹｅｓ）、判断部１３３は、次のステップＳ３に処理を進める。一方、ユーザ６の発話が検出されていない場合（ステップＳ２でＮｏ）、判断部１３３は、処理をステップＳ１に戻す。 In step S2, the determination unit 133 determines whether or not the utterance of the user 6 has been detected. The voice recognition unit 11 detects the utterance of the user 6. When the utterance of the user 6 is detected (Yes in step S2), the determination unit 133 proceeds to the next step S3. On the other hand, when the utterance of the user 6 is not detected (No in step S2), the determination unit 133 returns the process to step S1.

ステップＳ３では、判断部１３３が、検出したユーザ６の発話の先頭にウェイクワードが含まれていたか否かを判定する。ウェイクワードの検出は、音声認識部１１から発話テキストデータが入力されるウェイクワード検出部１３１によって行われる。ウェイクワードが含まれていた場合には（ステップＳ３でＹｅｓ）、判断部１３３はステップＳ７に処理を進める。一方、ウェイクワードが含まれていない場合には（ステップＳ３でＮｏ）、判断部１３３は、次のステップＳ４に処理を進める。 In step S3, the determination unit 133 determines whether or not the wake word is included at the beginning of the utterance of the detected user 6. The wake word is detected by the wake word detection unit 131 in which the spoken text data is input from the voice recognition unit 11. If a wake word is included (Yes in step S3), the determination unit 133 proceeds to step S7. On the other hand, if the wake word is not included (No in step S3), the determination unit 133 proceeds to the next step S4.

ステップＳ４では、判断部１３３が、対話継続の可能性を示す判定値を算出する。判定値は、対話継続の可能性が高い場合に高い値となり、対話継続の可能性が低い場合に低い値となる。判定値は、例えば、百分率で表される数字であってよい。この場合、対話継続の可能性が高い場合に１００％に近づき、対話継続の可能性が低い場合に０％に近づく構成としてよい。判断部１３３は、例えばディープラーニング等の手法により機械学習を行った学習済みモデル（ニューラルネットワーク）を用いて判定値を求める。学習済みモデルに所定の特徴量が入力されることにより、判定値が求められる。学習済みモデルには、音声認識部１１で認識した発話仮受付中のユーザ６の発話音声の特徴を示す少なくとも一つの特徴量（抑揚等）と、発話仮受付中のユーザ６の発話前後における振舞いの特徴を示す少なくとも一つの特徴量（顔の回転速度等）とが入力される。ユーザの振舞いの特徴を示す特徴量は、画像処理部１２から得られる。判断部１３３は、判定値を求めると、次のステップＳ５に処理を進める。 In step S4, the determination unit 133 calculates a determination value indicating the possibility of continuation of the dialogue. The determination value becomes a high value when the possibility of continuation of dialogue is high, and becomes a low value when the possibility of continuation of dialogue is low. The determination value may be, for example, a number expressed as a percentage. In this case, the configuration may be such that when the possibility of continuation of dialogue is high, it approaches 100%, and when the possibility of continuation of dialogue is low, it approaches 0%. The determination unit 133 obtains a determination value using a trained model (neural network) that has undergone machine learning by, for example, a method such as deep learning. A determination value is obtained by inputting a predetermined feature amount to the trained model. The trained model includes at least one feature amount (inflection, etc.) indicating the characteristics of the utterance voice of the user 6 who is receiving the utterance provisional reception recognized by the voice recognition unit 11, and the behavior before and after the utterance of the user 6 who is receiving the utterance provisional reception. At least one feature amount (such as the rotation speed of the face) indicating the feature of is input. The feature amount indicating the feature of the user's behavior is obtained from the image processing unit 12. When the determination unit 133 obtains the determination value, the determination unit 133 proceeds to the next step S5.

ステップＳ５では、判断部１３３が、先に求めた判定値をより信頼性の高い値とすることを狙って係数を取得する。判断部１３３は、係数を求めるに際して、発話仮受付中のユーザ６の発話の直前に行われた、自装置１とユーザ６との対話のドメインを判定する。対話のドメインは、判断部１３３自身の処理によって得られてもよいが、ユーザ６の発話の意図を解読するサーバ装置５によって得られる構成としてもよい。サーバ装置５は、ユーザ６の発話の解析を行う度に、対話のドメインを分類し、分類したドメインを音声対話装置１に送信する構成としてよい。音声対話装置１は、サーバ装置５から送られてくるユーザ６との対話のドメインを記憶部１４に記憶しておく構成であってよい。 In step S5, the determination unit 133 acquires the coefficient with the aim of making the previously determined determination value a more reliable value. When determining the coefficient, the determination unit 133 determines the domain of the dialogue between the own device 1 and the user 6 performed immediately before the utterance of the user 6 who is temporarily accepting the utterance. The domain of the dialogue may be obtained by the processing of the determination unit 133 itself, or may be obtained by the server device 5 that decodes the intention of the user 6's utterance. The server device 5 may be configured to classify the domain of dialogue each time the user 6's utterance is analyzed, and transmit the classified domain to the voice dialogue device 1. The voice dialogue device 1 may be configured to store the domain of the dialogue with the user 6 sent from the server device 5 in the storage unit 14.

判断部１３３は、判定した対話のドメインから対話継続の可能性に関わる係数を求める。当該係数は、例えば、対話のドメイン毎に予め決められた数値であってよく、予め記憶部１４に記憶されていてよい。係数は、例えば、ゼロより大きく１.０以下の値であり、経験則的に、対話が継続される可能性が高い対話のドメインに対しては大きな係数値が割り当てられ、対話が継続される可能性が低い対話のドメインに対しては小さな係数値が割り当てられる。例えば、図３において、発話Ａが「今日の天気は？」である場合、音声対話装置１が応答した後にも、ユーザ６は、「それでは、明日は？」といった発話を行い、対話を継続する傾向がある。このために、対話のドメインが天気である場合には、係数は大きな値とされる。例えば、対話のドメインが天気である場合、係数は１．０であってよい。なお、ここでは、係数が１．０を超えない場合を例示したが、係数は１．０を超える数字であってもよい。判断部１３３は、係数を取得すると、次のステップＳ６に処理を進める。 The determination unit 133 obtains a coefficient related to the possibility of continuation of the dialogue from the domain of the determined dialogue. The coefficient may be, for example, a numerical value predetermined for each domain of dialogue, and may be stored in the storage unit 14 in advance. The coefficient is, for example, a value greater than zero and less than or equal to 1.0, and as a rule of thumb, a large coefficient value is assigned to the domain of the dialogue in which the dialogue is likely to continue, and the dialogue is continued. Small coefficient values are assigned to domains of dialogue that are unlikely. For example, in FIG. 3, when the utterance A is "What is the weather today?", Even after the voice dialogue device 1 responds, the user 6 makes a utterance such as "Then, what is tomorrow?" And continues the dialogue. Tend. For this reason, if the domain of the dialogue is the weather, the coefficient will be a large value. For example, if the domain of dialogue is weather, the factor may be 1.0. Although the case where the coefficient does not exceed 1.0 is illustrated here, the coefficient may be a number exceeding 1.0. When the determination unit 133 obtains the coefficient, the process proceeds to the next step S6.

ステップＳ６では、判断部１３３が、ステップＳ４で求めた判定値にステップＳ５で求めた係数を乗じて補正判定値を算出する。これにより、ユーザ６の発話音声の情報、ユーザ６の画像情報、および、直前の対話のドメインを利用して、対話継続の可能性を評価することができる。判断部１３３は、補正判定値を算出すると、次のステップＳ７に処理を進める。 In step S6, the determination unit 133 calculates the correction determination value by multiplying the determination value obtained in step S4 by the coefficient obtained in step S5. Thereby, the possibility of continuation of the dialogue can be evaluated by using the information of the spoken voice of the user 6, the image information of the user 6, and the domain of the immediately preceding dialogue. When the determination unit 133 calculates the correction determination value, the process proceeds to the next step S7.

ステップＳ７では、判断部１３３が、ステップＳ６で求めた補正判定値と、予め準備された閾値とを比較する。判断部１３３は、対話継続の可能性を示す数値と閾値とを比較してユーザ６との対話が継続しているか否かを判断する。判断部１３３は、補正判定値が閾値以上である場合（ステップＳ７でＹｅｓ）、対話継続の可能性が高く、ユーザ６との対話が継続していると判断する。この場合、ステップＳ８に処理が進められる。一方、判断部１３３は、補正判定値が閾値より小さい場合（ステップＳ７でＮｏ）、対話継続の可能性が低く、ユーザ６の発話は自装置１に向けたものでないと判断する。この場合、ステップＳ９に処理が進められる。 In step S7, the determination unit 133 compares the correction determination value obtained in step S6 with the threshold value prepared in advance. The determination unit 133 compares the numerical value indicating the possibility of continuing the dialogue with the threshold value and determines whether or not the dialogue with the user 6 is continuing. When the correction determination value is equal to or greater than the threshold value (Yes in step S7), the determination unit 133 determines that there is a high possibility that the dialogue will continue and that the dialogue with the user 6 is continuing. In this case, the process proceeds to step S8. On the other hand, when the correction determination value is smaller than the threshold value (No in step S7), the determination unit 133 determines that the possibility of continuation of the dialogue is low and that the utterance of the user 6 is not directed to the own device 1. In this case, the process proceeds to step S9.

ステップＳ８では、応答処理部１３２が、発話仮受付中に検出されたユーザ６の発話に対する応答処理を行う。ステップＳ９では、応答処理部１３２が、発話仮受付中に検出されたユーザ６の発話を放置することに決定する。ステップＳ９の決定がなされると、応答処理部１３２による応答処理は進められない。すなわち、音声対話装置１はユーザの発話に対して応答を行わない。 In step S8, the response processing unit 132 performs response processing to the utterance of the user 6 detected during the utterance provisional reception. In step S9, the response processing unit 132 decides to leave the utterance of the user 6 detected during the utterance provisional reception. Once the decision in step S9 is made, the response processing by the response processing unit 132 cannot proceed. That is, the voice dialogue device 1 does not respond to the user's utterance.

本実施形態の構成によれば、ユーザ６は、音声対話装置１との対話終了後、一定時間以内であれば、ウェイクワードを発することなく対話を継続することができるのでユーザの利便性が高まる。また、音声対話装置１は、音声情報、画像情報、および、対話の内容といった複数の情報を利用して、ユーザの発話が自装置１に向けたものであるか否かを判定して応答を行うために、人間との対話に近いしい感覚での対話を実現することができる。 According to the configuration of the present embodiment, the user 6 can continue the dialogue without issuing a wake word within a certain period of time after the dialogue with the voice dialogue device 1 is completed, so that the convenience of the user is enhanced. .. Further, the voice dialogue device 1 uses a plurality of information such as voice information, image information, and the content of the dialogue to determine whether or not the user's utterance is directed to the own device 1 and respond. In order to do so, it is possible to realize a dialogue with a feeling close to that of a dialogue with humans.

なお、判定値を求める場合、或いは、係数を取得する場合に、判断部１３３は、直前に行われたユーザとの対話における自装置１の応答状態を考慮してもよい。すなわち、判断部１３３は、対話継続の可能性の判断に、直前に行われたユーザ６との対話における自装置１の応答状態を更に利用してもよい。音声対話装置１は、ユーザ６の発話に対する応答に成功する場合と、失敗する場合とがある。ユーザ６の応答に失敗する場合には、ユーザ６の発話の意図を理解できない場合や、発話の意図は理解できたが、検索結果が得られなかった場合等が含まれ、例えば、「分かりませんでした。」等の応答を行う。 When obtaining the determination value or acquiring the coefficient, the determination unit 133 may consider the response state of the own device 1 in the dialogue with the user immediately before. That is, the determination unit 133 may further utilize the response state of the own device 1 in the dialogue with the user 6 that was performed immediately before to determine the possibility of continuing the dialogue. The voice dialogue device 1 may succeed in responding to the utterance of the user 6 or may fail. When the response of the user 6 fails, it includes the case where the user 6's intention of utterance cannot be understood, or the case where the user 6's intention of utterance can be understood but the search result cannot be obtained. It was. ”And so on.

直前の対話において応答に失敗した場合、ユーザが音声対話装置１に向けて更に発話を行う可能性が高くなる。このために、直前の対話において、自装置１が応答に失敗している場合には、判定値や係数値が大きくなるように処理が行われる構成としてよい。このような構成とすることにより、対話継続の可能性を判断する指標を更に増やして、ユーザ６の発話が自装置１に向けたものであるか否かをより適切に判断することが可能となる。 If the response fails in the immediately preceding dialogue, the user is more likely to speak to the voice dialogue device 1. For this reason, if the own device 1 fails to respond in the immediately preceding dialogue, the processing may be performed so that the determination value and the coefficient value become large. With such a configuration, it is possible to further increase the index for judging the possibility of continuing the dialogue, and to more appropriately judge whether or not the utterance of the user 6 is directed to the own device 1. Become.

また、対話の終了後の一定時間以内（発話仮受付状態）においては、ウェイクワードを発しなくても対話を継続させる可能性がある。音声対話装置１又は音声対話システム１００は、このような状態であることを、ユーザ６に報知する構成としてよい。報知の手段は、例えば、画像表示や発光等であってよい。このような構成とすれば、ユーザ６が無駄にウェイクワードを言わなくて済むようにできる。 In addition, within a certain period of time after the end of the dialogue (temporary utterance reception state), there is a possibility that the dialogue will continue even if the wake word is not issued. The voice dialogue device 1 or the voice dialogue system 100 may be configured to notify the user 6 of such a state. The means of notification may be, for example, image display, light emission, or the like. With such a configuration, it is possible to prevent the user 6 from unnecessarily saying a wake word.

＜３．変形例＞
図５は、音声対話装置１の発話仮受付状態における動作の変形例を示すフローチャートである。図５に示すフローチャートは、図４に示すフローチャートのステップＳ６とステップＳ７との間に、ステップＳ１０の処理が追加されている点が、図４のフローチャートと異なる。ステップＳ１０では、閾値を調整する処理が行われる。すなわち、変形例においては、判定値（本例では正確には補正判定値）と比較する閾値は、一定値ではなく、適宜、変更される構成となっている。以下に、閾値が適宜変更される構成について、２つの例（第１変形例と第２変形例）を示す。 <3. Modification example>
FIG. 5 is a flowchart showing a modified example of the operation of the voice dialogue device 1 in the utterance temporary reception state. The flowchart shown in FIG. 5 is different from the flowchart of FIG. 4 in that the process of step S10 is added between steps S6 and S7 of the flowchart shown in FIG. In step S10, a process of adjusting the threshold value is performed. That is, in the modified example, the threshold value to be compared with the determination value (correctly, the correction determination value in this example) is not a constant value but is appropriately changed. Below, two examples (first modification and second modification) of the configuration in which the threshold value is appropriately changed are shown.

（３－１．第１変形例）
第１変形例では、閾値は、直前に行われたユーザとの対話の終了からの経過時間に応じて変更される。ユーザとの対話の終了の時点は、上述の発話仮受付状態の開始時点と一致する。 (3-1. First modification)
In the first modification, the threshold value is changed according to the elapsed time from the end of the immediately preceding dialogue with the user. The time point at which the dialogue with the user ends coincides with the time point at which the above-mentioned utterance provisional reception state starts.

図６は、図５に示す閾値調整処理の第１の詳細例を示すフローチャートである。ステップＳ１０１では、判断部１３３が、ユーザとの対話の終了から第１時間以内であるか否かを判定する。第１時間は、図５に示すステップＳ１の一定時間より短い時間である。判断部１３３は、第１時間以内である場合（ステップＳ１０１でＹｅｓ）、次のステップＳ１０２に処理を進める。一方、判断部１３３は、第１時間外である場合（ステップＳ１０１でＮｏ）、ステップＳ１０３に処理を進める。 FIG. 6 is a flowchart showing a first detailed example of the threshold value adjustment process shown in FIG. In step S101, the determination unit 133 determines whether or not it is within the first hour from the end of the dialogue with the user. The first time is shorter than the fixed time in step S1 shown in FIG. If the determination unit 133 is within the first hour (Yes in step S101), the determination unit 133 proceeds to the next step S102. On the other hand, when the determination unit 133 is outside the first time (No in step S101), the determination unit 133 proceeds to the process in step S103.

ステップＳ１０２では、判断部１３３が、閾値を初期値のまま維持する。判断部１３３は、ステップＳ１０２の処理を完了すると、図５に示すステップＳ７の処理を行う。すなわち、初期値のまま維持された閾値と、補正判定値との比較が行われて、対話が継続しているか否かが判断されることになる。 In step S102, the determination unit 133 maintains the threshold value as the initial value. When the determination unit 133 completes the process of step S102, the determination unit 133 performs the process of step S7 shown in FIG. That is, the threshold value maintained at the initial value is compared with the correction determination value, and it is determined whether or not the dialogue is continued.

ステップＳ１０３では、判断部１３３が、ユーザとの対話の終了から第２時間以内であるか否かを判定する。第２時間は、図５に示すステップＳ１の一定時間より短く、第１時間より長い時間である。判断部１３３は、第２時間以内である場合（ステップＳ１０３でＹｅｓ）、次のステップＳ１０４に処理を進める。一方、判断部１３３は、第１時間外である場合（ステップＳ１０３でＮｏ）、ステップＳ１０５に処理を進める。 In step S103, the determination unit 133 determines whether or not it is within the second time from the end of the dialogue with the user. The second time is shorter than the fixed time of step S1 shown in FIG. 5 and longer than the first time. If the determination unit 133 is within the second time (Yes in step S103), the determination unit 133 proceeds to the next step S104. On the other hand, if the determination unit 133 is outside the first time (No in step S103), the determination unit 133 proceeds to the process in step S105.

ステップＳ１０４では、判断部１３３が、閾値を初期値から第１閾値に変更する。第１閾値は、初期値よりも大きな値である。判断部１３３は、ステップＳ１０４の処理を完了すると、図５に示すステップＳ７の処理を行う。すなわち、初期値よりも大きくした閾値（第１閾値）と、補正判定値との比較が行われて、対話が継続しているか否かが判断されることになる。第１閾値は初期値よりも大きい値であるために、第１閾値への変更によって、対話が継続しているとの判定がなされ難くなる。 In step S104, the determination unit 133 changes the threshold value from the initial value to the first threshold value. The first threshold value is a value larger than the initial value. When the determination unit 133 completes the process of step S104, the determination unit 133 performs the process of step S7 shown in FIG. That is, the threshold value (first threshold value) larger than the initial value is compared with the correction determination value, and it is determined whether or not the dialogue is continued. Since the first threshold value is larger than the initial value, it becomes difficult to determine that the dialogue is continuing by changing to the first threshold value.

ステップＳ１０５では、判断部１３３が、閾値を初期値から第２閾値に変更する。第２閾値は、第１閾値よりも大きな値である。判断部１３３は、ステップＳ１０５の処理を完了すると、図５に示すステップＳ７の処理を行う。すなわち、初期値および第１閾値よりも大きくした閾値（第２閾値）と、補正判定値との比較が行われて、対話が継続しているか否かが判断されることになる。第２閾値への変更によって、対話が継続しているとの判定がよりなされ難くなる。 In step S105, the determination unit 133 changes the threshold value from the initial value to the second threshold value. The second threshold value is a value larger than the first threshold value. When the determination unit 133 completes the process of step S105, the determination unit 133 performs the process of step S7 shown in FIG. That is, the initial value and the threshold value larger than the first threshold value (second threshold value) are compared with the correction determination value, and it is determined whether or not the dialogue is continued. The change to the second threshold makes it more difficult to determine that the dialogue is continuing.

本変形例の構成によれば、閾値の段階的な変更により、ユーザ６との対話が終了してからの経過時間が長くなるにつれて、対話が継続しているとの判定される可能性が段階的に低くなる。ユーザ６との対話が終了してからの経過時間が長くなると、通常、対話が継続される可能性が低くなる。このために、本変形例の構成によれば、ユーザ６が要求していないにもかかわらず、ユーザ６への応答を行うといった誤動作が発生する可能性を低減することができる。 According to the configuration of this modification, there is a possibility that it is determined that the dialogue continues as the elapsed time from the end of the dialogue with the user 6 increases due to the stepwise change of the threshold value. It becomes low. The longer the elapsed time since the end of the dialogue with the user 6, the less likely it is that the dialogue will continue. Therefore, according to the configuration of this modification, it is possible to reduce the possibility of a malfunction such as responding to the user 6 even though the user 6 has not requested it.

なお、本変形例では、閾値が３段階用意される構成としたが、これは例示にすぎない。変更可能な閾値の数は、適宜変更されてよい。 In this modification, the threshold value is prepared in three stages, but this is only an example. The number of variable thresholds may be changed as appropriate.

（３－２．第２変形例）
第２変形例では、判断部１３３は、特定のタイミングでユーザ６の発話が検出された場合に、閾値を変更して、対話が継続していると判断しやすくする。特定のタイミングは、ユーザ６が対話の継続を行う傾向が高いタイミングであり、経験則等により決められるタイミングである。特定のタイミングは、例えば、自装置１がユーザ６に対する応答に失敗した後や、対話が継続する可能性の高いドメインに属する対話の後などが該当する。 (3-2. Second modification)
In the second modification, the determination unit 133 changes the threshold value when the utterance of the user 6 is detected at a specific timing, so that it is easy to determine that the dialogue is continuing. The specific timing is a timing at which the user 6 has a high tendency to continue the dialogue, and is a timing determined by an empirical rule or the like. The specific timing corresponds to, for example, after the own device 1 fails to respond to the user 6, or after a dialogue belonging to a domain in which the dialogue is likely to continue.

図７は、図５に示す閾値調整処理の第２の詳細例を示すフローチャートである。ステップＳ１０６では、判断部１３３が、ユーザの発話の検出が特定のタイミングであるか否かを判定する。判断部１３３は、特定のタイミングに該当する場合（ステップＳ１０６でＹｅｓ）、次のステップＳ１０７に処理を進める。一方、判断部１３３は、特定のタイミングに該当しない場合（ステップＳ１０６でＮｏ）、ステップＳ１０８に処理を進める。 FIG. 7 is a flowchart showing a second detailed example of the threshold value adjustment process shown in FIG. In step S106, the determination unit 133 determines whether or not the detection of the user's utterance is at a specific timing. When the determination unit 133 corresponds to a specific timing (Yes in step S106), the determination unit 133 proceeds to the next step S107. On the other hand, if the determination unit 133 does not correspond to the specific timing (No in step S106), the determination unit 133 proceeds to the process in step S108.

ステップＳ１０７では、判断部１３３が、閾値を初期値から第３閾値に変更する。第３閾値は、初期値よりも小さな値である。判断部１３３は、ステップＳ１０７の処理を完了すると、図５に示すステップＳ７の処理を行う。すなわち、初期値よりも小さくした閾値（第３閾値）と、補正判定値との比較が行われて、対話が継続しているか否かが判断されることになる。第３閾値は初期値よりも小さい値であるために、第３閾値への変更によって、対話が継続しているとの判定がされやすくなる。 In step S107, the determination unit 133 changes the threshold value from the initial value to the third threshold value. The third threshold value is smaller than the initial value. When the determination unit 133 completes the process of step S107, the determination unit 133 performs the process of step S7 shown in FIG. That is, the threshold value (third threshold value) smaller than the initial value is compared with the correction determination value, and it is determined whether or not the dialogue is continued. Since the third threshold value is smaller than the initial value, it is easy to determine that the dialogue is continuing by changing to the third threshold value.

ステップＳ１０８では、判断部１３３が、閾値を初期値のまま維持する。判断部１３３は、ステップＳ１０８の処理を完了すると、図５に示すステップＳ７の処理を行う。すなわち、初期値のまま維持された閾値と、補正判定値との比較が行われて、対話が継続しているか否かが判断されることになる。 In step S108, the determination unit 133 maintains the threshold value as the initial value. When the determination unit 133 completes the process of step S108, the determination unit 133 performs the process of step S7 shown in FIG. That is, the threshold value maintained at the initial value is compared with the correction determination value, and it is determined whether or not the dialogue is continued.

本変形例の構成によれば、ユーザとの対話が継続する可能性が高いタイミングである場合に、音声対話装置１によって対話が継続していると判断されやすくなり、ユーザ６の発話に適切に対応することができる。 According to the configuration of this modification, when the dialogue with the user is likely to continue, it is easy for the voice dialogue device 1 to determine that the dialogue is continuing, and it is appropriate for the user 6 to speak. Can be accommodated.

＜４．留意事項＞
本明細書中に開示されている種々の技術的特徴は、上記実施形態のほか、その技術的創作の主旨を逸脱しない範囲で種々の変更を加えることが可能である。すなわち、上記実施形態は、全ての点で例示であって、制限的なものではないと考えられるべきであり、本発明の技術的範囲は、上記実施形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲と均等の意味及び範囲内に属する全ての変更が含まれると理解されるべきである。また、本明細書中に示される複数の実施形態及び変形例は可能な範囲で適宜組み合わせて実施されてよい。 <4. Points to note>
The various technical features disclosed herein can be modified in addition to the above embodiments without departing from the gist of the technical creation. That is, it should be considered that the embodiments are exemplary in all respects and are not restrictive, and the technical scope of the invention is not the description of the embodiments but the claims. It is shown and should be understood to include all modifications that fall within the meaning and scope of the claims. In addition, a plurality of embodiments and modifications shown in the present specification may be appropriately combined and implemented to the extent possible.

以上においては、音声対話装置１は、直前の対話のドメイン、ユーザ６の発話音声の情報、ユーザ６の画像情報といった複数の情報を利用して対話継続の可能性を判断する構成したが、これは例示にすぎない。例えば、音声対話装置は、直前の対話のドメインのみを利用して、対話継続の可能性を判断する構成としてもよい。例えば、直前の対話のドメインから対話継続の可能性を示す数値を求めてもよい。また、例えば、音声対話装置は、直前の対話のドメインと、ユーザの発話音声の情報とのみを利用して、対話継続の可能性を判断する構成としてもよい。また、例えば、音声対話装置は、直前の対話のドメインと、ユーザの画像情報とのみを利用して、対話継続の可能性を判断する構成としてもよい。 In the above, the voice dialogue device 1 is configured to determine the possibility of continuation of the dialogue by using a plurality of information such as the domain of the immediately preceding dialogue, the information of the spoken voice of the user 6, and the image information of the user 6. Is just an example. For example, the voice dialogue device may be configured to determine the possibility of continuation of dialogue by using only the domain of the immediately preceding dialogue. For example, a numerical value indicating the possibility of continuing the dialogue may be obtained from the domain of the immediately preceding dialogue. Further, for example, the voice dialogue device may be configured to determine the possibility of continuation of the dialogue by using only the domain of the immediately preceding dialogue and the information of the spoken voice of the user. Further, for example, the voice dialogue device may be configured to determine the possibility of continuation of the dialogue by using only the domain of the immediately preceding dialogue and the image information of the user.

また、対話のドメインによって判定値の補正を行うのではなく、対話のドメインによって判定値と比較する閾値を変更する構成としてもよい。この構成も、対話のドメインを利用して対話継続の可能性を判断する構成に含まれる。 Further, instead of correcting the determination value depending on the domain of the dialogue, the threshold value to be compared with the determination value may be changed depending on the domain of the dialogue. This configuration is also included in the configuration for determining the possibility of continuing the dialogue using the domain of the dialogue.

１・・・音声対話装置
２・・・マイクロホン
３・・・スピーカ
４・・・カメラ
６・・・ユーザ
１１・・・検出部
１００・・・音声対話システム
１３２・・・応答処理部
１３３・・・判断部 1 ... Voice dialogue device 2 ... Microphone 3 ... Speaker 4 ... Camera 6 ... User 11 ... Detection unit 100 ... Voice dialogue system 132 ... Response processing unit 133 ...・ Judgment department

Claims

A detector that detects the user's utterance,
When the utterance of the user is detected, a judgment unit for determining the possibility of continuing the dialogue with the user, and a judgment unit.
A response processing unit that responds to the user's utterance when it is determined by the determination unit that the dialogue is continuing.
Equipped with
The determination unit is a voice dialogue device that uses the domain of the dialogue with the user performed immediately before the detected utterance of the user to determine the possibility of continuation of the dialogue.

The voice dialogue device according to claim 1, wherein the determination unit further utilizes the response state of the own device in the dialogue with the user in determining the possibility of continuation of the dialogue.

The voice dialogue device according to claim 1 or 2, wherein the determination unit further utilizes the characteristics of the voice extracted from the voice of the user in determining the possibility of continuation of the dialogue.

The voice dialogue device according to any one of claims 1 to 3, wherein the determination unit further utilizes information obtained from the captured image of the user in determining the possibility of continuation of the dialogue.

3. The voice dialogue device according to any one of the following items.

The determination unit compares a numerical value indicating the possibility of continuation of the dialogue with a threshold value to determine whether or not the dialogue with the user is continuing.
The voice dialogue device according to any one of claims 1 to 5, wherein the threshold value is changed according to the elapsed time from the end of the dialogue with the user performed immediately before.

The determination unit compares a numerical value indicating the possibility of continuation of the dialogue with a threshold value to determine whether or not the dialogue with the user is continuing.
The determination unit changes the threshold value when the user's utterance is detected at a specific timing, so that it is easy to determine that the dialogue is continuing, any one of claims 1 to 6. The voice dialogue device described in.

Any one of claims 1 to 7, wherein the determination unit detects the utterance of the user and determines the possibility of continuing the dialogue when the user has not performed a predetermined dialogue start operation. The voice dialogue device described in the section.

The voice dialogue device according to any one of claims 1 to 8.
A microphone that converts the user's voice into a voice signal and outputs it to the voice dialogue device.
A speaker that converts a voice signal output from the voice dialogue device into voice and emits sound to the user.
A voice dialogue system.

The voice dialogue system according to claim 9, further comprising a camera that shoots the user and outputs information of the shot image to the voice dialogue device.

It is a voice dialogue method in a voice dialogue device.
A detection process that detects the user's utterance,
A determination step for determining the possibility of continuing a dialogue with the user when the user's utterance is detected, and
A response processing step that responds to the user's utterance when it is determined by the determination step that the dialogue is continuing.
Equipped with
A voice dialogue method in which the domain of the dialogue with the user performed immediately before the detected speech of the user is used to determine the possibility of continuation of the dialogue.