JP2018017776A

JP2018017776A - Voice interactive device

Info

Publication number: JP2018017776A
Application number: JP2016145635A
Authority: JP
Inventors: 美奈結城; Mina Yuki; 真太郎吉澤; Shintaro Yoshizawa; 智哉高谷; Tomoya Takatani
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2016-07-25
Filing date: 2016-07-25
Publication date: 2018-02-01

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device capable of speaking in consideration of an individual difference in real time.SOLUTION: The voice interactive device has a speech reception part for identifying speech of a person and a speech part for speaking. The speech part speaks when a silent time which is a time during which the speech reception part does not receive any speech after the speech part speaks exceeds a predetermined standby time. The voice interactive device also has: a storage part for storing a response time starting from a speech by the speech part through a response by the person until reception of the speech by the speech reception part; an estimation part for calculating an average and a dispersion of the response time stored by the storage part, creating a Gamma distribution concerning the response time and the speech probability by using the calculated average and dispersion and estimating a silence time which allows determination that the person does not have intention to speak by using the created Gamma distribution; and a speech control part for updating the standby time in the speech part, setting the silent time estimated by the estimation part as a new standby time.SELECTED DRAWING: Figure 1

Description

本発明は、人と対話を行う音声対話装置に関する。 The present invention relates to a voice interaction apparatus that performs a conversation with a person.

近年、人と対話を行う対話ロボット等の音声対話装置が実用化されている。しかし、音声対話装置が人と対話する場面では、人及び音声対話装置の双方が沈黙することがあり、こうした沈黙が頻繁に起きると、対話が中断し、気まずい雰囲気になってしまう。
そこで、最近は、無音時間が予め定められた待ち時間を超えた場合に、音声対話装置側から発話する技術も提案されている（例えば、特許文献１）。 In recent years, a speech dialogue apparatus such as a dialogue robot for carrying out a dialogue with a person has been put into practical use. However, in a situation where the voice interaction device interacts with a person, both the person and the voice interaction device may be silent, and if such a silence occurs frequently, the conversation is interrupted and an awkward atmosphere is created.
Therefore, recently, a technique has also been proposed in which speech is spoken from the voice interaction device side when the silent time exceeds a predetermined waiting time (for example, Patent Document 1).

特開２０１０−１５２１１９号公報JP 2010-152119 A

しかし、特許文献１のように、無音時間が予め定められた待ち時間を超えた場合に、音声対話装置側から発話する技術では、待ち時間は予め定めた時間となり、個人差を考慮して待ち時間を変更することができないという問題があった。
また、時間経過と共に人の発話傾向が変化した場合に（例えば、音声対話装置に慣れてきて応答時間が短くなる等）、発話傾向の変化に対応して待ち時間を変更することができないという問題もあった。 However, as in Patent Document 1, when the silent time exceeds a predetermined waiting time, in the technology of speaking from the voice interactive apparatus side, the waiting time becomes a predetermined time and waits in consideration of individual differences. There was a problem that the time could not be changed.
In addition, when a person's utterance tendency changes with the passage of time (for example, getting used to a voice interaction device to shorten the response time), the waiting time cannot be changed in response to the change in the utterance tendency. There was also.

本発明は、上記を鑑みなされたものであって、リアルタイムで人の個人差を考慮した発話が可能な音声対話装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a voice interactive apparatus capable of uttering in consideration of individual differences in real time.

本発明の一態様に係る音声対話装置は、
人の発話を認識する発話受付部と、発話する発話部と、を備え、前記発話部は前記発話部が発話してから前記発話受付部が発話を受け付けなかった時間である無音時間が予め定められた待ち時間を超えた場合に発話する音声対話装置であって、
前記発話部が発話してから人が応答して前記発話受付部が発話を受け付けるまでの応答時間を記憶する記憶部と、
前記記憶部にて記憶された応答時間の平均及び分散を算出し、算出した平均及び分散を用いて応答時間及び発話確率に関するガンマ分布を作成し、作成したガンマ分布を用いて人が発話する気がないと判断できる無音時間を推定する推定部と、
前記推定部にて推定された無音時間を新たな待ち時間として、前記発話部における前記待ち時間を更新する発話制御部と、を備える。 A spoken dialogue apparatus according to one aspect of the present invention includes:
An utterance receiving unit for recognizing a person's utterance, and an uttering unit for uttering, wherein the utterance unit determines a silent time which is a time when the utterance receiving unit does not accept an utterance after the utterance unit utters A spoken dialogue device that utters when a specified waiting time is exceeded,
A storage unit that stores a response time from when the utterance unit utters until a person responds and the utterance reception unit receives the utterance;
The average and variance of the response times stored in the storage unit are calculated, a gamma distribution relating to the response time and the utterance probability is created using the calculated average and variance, and a person speaks using the created gamma distribution. An estimation unit for estimating a silent time that can be determined to be absent,
An utterance control unit that updates the waiting time in the utterance unit with the silent time estimated by the estimation unit as a new waiting time.

上述した本発明の態様によれば、リアルタイムで人の個人差を考慮した発話が可能な音声対話装置を提供することができるという効果が得られる。 According to the aspect of the present invention described above, there is an effect that it is possible to provide a voice interactive apparatus capable of speaking in real time in consideration of individual differences.

実施の形態に係る音声対話装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the voice interactive apparatus which concerns on embodiment. 実施の形態に係る計測部の動作例を説明する図である。It is a figure explaining the operation example of the measurement part which concerns on embodiment. 実施の形態に係る推定部の動作例を説明する図である。It is a figure explaining the operation example of the estimation part which concerns on embodiment. 実施の形態に係る音声対話装置において、発話部が発話した後の動作例を説明するフローチャートである。It is a flowchart explaining the example of an operation | movement after the utterance part uttered in the speech dialogue apparatus which concerns on embodiment.

以下、本発明の実施の形態について説明する。
＜実施の形態の概要＞
まず、本実施の形態に係る音声対話装置の概要について説明する。
対話ロボット等である音声対話装置は、人が発話した場合には、その発話に音声対話装置側で応答して発話する設計になっていることが多い。そのため、人が発話した場合には、無音時間は発生しないか、発生したとしても非常に短い時間になるため、対話が中断する可能性は低いと考えられる。 Embodiments of the present invention will be described below.
<Outline of the embodiment>
First, an outline of the voice interactive apparatus according to the present embodiment will be described.
In many cases, a speech dialogue apparatus, such as a dialogue robot, is designed so that when a person speaks, the voice dialogue apparatus responds to the utterance and speaks. Therefore, when a person speaks, the silent time does not occur, or even if it occurs, it is very short time, so it is considered that the possibility that the dialogue is interrupted is low.

その一方、音声対話装置が発話した場合には、続いて、その発話に人側で応答して発話することになるため、無音時間（すなわち、音声対話装置が発話してから人の発話を音声対話装置にて受け付けなかった時間）が長くなる可能性がある。そのため、音声対話装置が発話してから経過した無音時間が待ち時間を超えた場合には、人に発話する気がないと判断し、音声対話装置側から発話することにより、対話が中断することを回避することができる。 On the other hand, when a voice dialogue device utters, the person speaks in response to the utterance, so the silent time (that is, after the voice dialogue device utters, the person's utterance is spoken) There is a possibility that the time (which was not accepted by the interactive device) will be longer. Therefore, if the silent time that has elapsed since the voice dialogue device uttered exceeds the waiting time, it is determined that there is no intention to speak to the person, and the dialogue is interrupted by speaking from the voice dialogue device side. Can be avoided.

しかし、人に発話する気がないと判断できる無音時間には、個人差があり、また、発話傾向の変化（例えば、音声対話装置に慣れてきて応答時間が短くなる等）に応じてリアルタイムに変化すると考えられる。 However, there is an individual difference in the silent time when it can be determined that a person is not willing to speak, and in real time according to changes in the utterance tendency (for example, the response time is shortened after becoming accustomed to a voice interaction device). It will change.

そこで、本実施の形態に係る音声対話装置は、音声対話装置が発話してから経過した無音時間が待ち時間を超えた場合に発話する構成において、その待ち時間を、リアルタイムで人の個人差を考慮して、変更することとする。 Therefore, in the configuration in which the voice interaction device according to the present embodiment utters when the silent time that has elapsed since the voice interaction device uttered exceeds the waiting time, the waiting time is calculated in real time with the individual difference of the person. It will be changed in consideration.

＜実施の形態の構成＞
続いて、図１を参照して、本実施の形態に係る音声対話装置１の構成について説明する。図１に示されるように、音声対話装置１は、人であるユーザ２と対話を行う装置であり、例えば、対話ロボット等である。また、音声対話装置１は、発話部１１と、発話受付部１２と、計測部１３と、フィルタ部１４と、推定部１５と、発話制御部１６と、を備えている。なお、フィルタ部１４は、記憶部の一例である。また、図１は、本発明の本質的な構成要素のみを抜粋して示したものであり、その他の構成要素は省略されている。 <Configuration of the embodiment>
Next, with reference to FIG. 1, the configuration of the voice interactive apparatus 1 according to the present embodiment will be described. As shown in FIG. 1, the voice interaction device 1 is a device that interacts with a user 2 who is a person, such as a conversation robot. The voice interaction apparatus 1 includes an utterance unit 11, an utterance reception unit 12, a measurement unit 13, a filter unit 14, an estimation unit 15, and an utterance control unit 16. The filter unit 14 is an example of a storage unit. FIG. 1 shows only the essential components of the present invention, and other components are omitted.

発話部１１は、スピーカ（不図示）を介して音声を出力することで、ユーザ２に対して発話する。
発話受付部１２は、ユーザ２が発話した音声を、マイク（不図示）を介して入力することで、ユーザ２の発話を認識する。 The utterance unit 11 utters the user 2 by outputting sound through a speaker (not shown).
The utterance reception unit 12 recognizes the utterance of the user 2 by inputting the voice uttered by the user 2 via a microphone (not shown).

計測部１３は、発話部１１が発話してから、ユーザ２が応答して発話受付部１２が発話を受け付けるまでの時間である応答時間を計測する。詳細には、ユーザ２の応答時間は、発話部１１が発話を終えてから、この発話に対してユーザ２が応答して発話した音声を発話受付部１２にて受け付けるまでの時間である。例えば、図２の例では、発話部１１が「確かに」という音声の発話を終えたタイミングが時刻ｔ１、この発話に対してユーザ２が応答して発話した「体調も良いです」という音声を発話受付部１２にて受け付けたタイミングが時刻ｔ２である。そのため、ユーザ２の応答時間は、時刻ｔ１と時刻ｔ２との差分である１秒となる。 The measurement unit 13 measures a response time that is a time from when the utterance unit 11 speaks until when the user 2 responds and the utterance reception unit 12 accepts the utterance. Specifically, the response time of the user 2 is a time from when the utterance unit 11 finishes utterance to when the utterance reception unit 12 accepts a voice uttered in response to the utterance by the user 2. For example, in the example of FIG. 2, the timing when the utterance unit 11 finishes the utterance of “sure” is the time t1, and the voice “the physical condition is good” uttered in response to the utterance by the user 2 The timing received by the utterance receiving unit 12 is time t2. Therefore, the response time of the user 2 is 1 second which is a difference between the time t1 and the time t2.

フィルタ部１４は、計測部１３が計測したユーザ２の応答時間のデータである応答時間データをフィルタリングして、外れ値の応答時間データを除去し、残りの応答時間データを記憶する。例えば、音声対話装置１が放置されていた場合や、ユーザ２が熟考して応答した場合には、ユーザ２の応答時間は非常に長くなる。フィルタ部１４は、このような応答時間データを外れ値の応答時間データとして除去する。 The filter unit 14 filters the response time data, which is the response time data of the user 2 measured by the measurement unit 13, removes outlier response time data, and stores the remaining response time data. For example, when the voice interactive device 1 is left unattended or when the user 2 ponders and responds, the response time of the user 2 becomes very long. The filter unit 14 removes such response time data as outlier response time data.

フィルタ部１４は、過去Ｎ回分のＮ個の応答時間データを記憶し、Ｎ個を超えた場合は、最も古い応答時間データを破棄するものとする。後述のように、推定部１５は、応答時間データを用いて、ユーザ２の応答時間の分散及び平均を算出し、ユーザ２が発話する気がないと判断できる無音時間（ユーザ２が発話してから発話受付部１２が発話を受け付けなかった時間）を推定する。一般的に、分散及び平均を算出するためのデータは最低でも３個必要となる。そのため、上記のＮは３以上の整数とする。また、上記のＮは、大きければ推定精度は上がるが古いデータが残ってしまい、小さければ現状に対応しやすくなるが推定精度が下がってしまう。そのため、上記のＮは、現状及び推定精度の優先度等を考慮して、３以上の適正な数値に決定すれば良い。 The filter unit 14 stores N response time data for the past N times, and discards the oldest response time data when the number exceeds N. As will be described later, the estimation unit 15 calculates the variance and average of the response time of the user 2 using the response time data, and the silent time (the user 2 is uttered) To the time when the utterance receiving unit 12 does not accept the utterance). In general, at least three pieces of data for calculating the variance and the average are required. Therefore, the above N is an integer of 3 or more. If N is large, the estimation accuracy increases, but old data remains, and if it is small, it becomes easier to deal with the current situation, but the estimation accuracy decreases. Therefore, the above N may be determined to an appropriate numerical value of 3 or more in consideration of the current state and the priority of estimation accuracy.

フィルタ部１４が外れ値の応答時間データを除去する方法としては、任意の方法を利用可能である。例えば、フィルタ部１４は、メディアンフィルタを使用し、中央値付近から外れた応答時間データを除去しても良い。又は、フィルタ部１４は、ＩＱＲ（四分位範囲：interquartile range）を利用するフィルタを使用し、ＩＱＲから外れた箱ひげ図のひげ部分の応答時間データを除去しても良い。ＩＱＲを利用するフィルタは、正規分布でない場合にも使用可能であり、また、計算量が少なく、高速レスポンスが可能である等の利点がある。又は、計測部１３は、応答時間の計測時に、応答時間データにラベルを付与し、フィルタ部１４は、特定のラベルが付与された応答時間データを除去しても良い。例えば、計測部１３は、カメラ（不図示）で撮影された画像から、ユーザ２が熟考して応答したと判断できる応答時間データに特定のラベルを付与することが考えられる。また、計測部１３は、ユーザ２の音声を認識し、自立語が含まれていない音声の応答時間データに特定のラベルを付与することが考えられる。また、計測部１３は、音声対話装置１の動作終了待ちであった場合の応答時間データに特定のラベルを付与することが考えられる。 Any method can be used as a method for the filter unit 14 to remove outlier response time data. For example, the filter unit 14 may use a median filter to remove response time data that deviates from the vicinity of the median value. Alternatively, the filter unit 14 may use a filter that uses IQR (interquartile range) to remove response time data of the whisker portion of the box plot that deviates from IQR. A filter using IQR can be used even when the distribution is not normal, and has advantages such as a small amount of calculation and high-speed response. Alternatively, the measurement unit 13 may add a label to the response time data when measuring the response time, and the filter unit 14 may remove the response time data to which a specific label is added. For example, it is conceivable that the measurement unit 13 gives a specific label to response time data that can be determined that the user 2 has responded by pondering from an image captured by a camera (not shown). Further, it is conceivable that the measurement unit 13 recognizes the voice of the user 2 and gives a specific label to the response time data of the voice that does not include an independent word. Moreover, it is possible that the measurement part 13 gives a specific label to the response time data when it is waiting for the operation | movement completion of the voice interactive apparatus 1. FIG.

推定部１５は、まず、フィルタ部１４にて記憶されたＮ個の応答時間データを用いてユーザ２の応答時間の平均及び分散を算出する。続いて、推定部１５は、算出したユーザ２の応答時間の平均及び分散を用いて、ユーザ２の応答時間及びその応答時間で発話する確率である発話確率に関するガンマ分布を作成する。ガンマ分布は、パラメトリックモデルの一例である。続いて、推定部１５は、作成したガンマ分布を用いて、ユーザ２が発話する気がないと判断できる無音時間を推定する。 The estimation unit 15 first calculates the average and variance of the response times of the user 2 using the N response time data stored in the filter unit 14. Subsequently, the estimation unit 15 creates a gamma distribution relating to the response time of the user 2 and the probability of speaking at the response time using the average and variance of the calculated response times of the user 2. The gamma distribution is an example of a parametric model. Subsequently, the estimation unit 15 estimates the silent time during which it can be determined that the user 2 is not willing to speak using the created gamma distribution.

例えば、推定部１５は、図３のようなガンマ分布を作成したとする。図３は、応答時間データの個数が１５個、発話確率の閾値が９５％となっている例である。図３の例では、応答時間２．４秒以下でユーザ２が発話する発話確率が、閾値となる９５％になっている。言い換えれば、発話部１１が発話してから経過した無音時間が２．４秒以上であれば、ユーザ２は９５％の確率で発話する気がないと判断することができる。そこで、推定部１５は、ユーザ２が発話する気がないと判断できる無音時間を、２．４秒と推定する。なお、応答時間データの個数及び発話確率の閾値は、図３の数値に限定されず、その他の数値でも良い。 For example, it is assumed that the estimation unit 15 creates a gamma distribution as shown in FIG. FIG. 3 shows an example in which the number of response time data is 15 and the threshold of the utterance probability is 95%. In the example of FIG. 3, the utterance probability that the user 2 utters within a response time of 2.4 seconds or less is 95%, which is a threshold value. In other words, if the silent time that has elapsed since the utterance unit 11 uttered is 2.4 seconds or more, the user 2 can determine that the user 2 is not willing to utter with a probability of 95%. Therefore, the estimation unit 15 estimates the silent time that can be determined that the user 2 is not willing to speak as 2.4 seconds. Note that the number of response time data and the threshold of the utterance probability are not limited to the numerical values shown in FIG. 3, but may be other numerical values.

発話制御部１６は、発話部１１の発話タイミングを制御する。具体的には、発話制御部１６は、ユーザ２が発話した場合には、その発話に応答して発話するよう発話部１１を制御する。また、発話制御部１６は、発話部１１が発話した場合には、発話部１１が発話してから経過した現在の無音時間が待ち時間を超えた場合に発話するよう発話部１１を制御する。 The utterance control unit 16 controls the utterance timing of the utterance unit 11. Specifically, the utterance control unit 16 controls the utterance unit 11 to utter in response to the utterance when the user 2 utters. In addition, when the utterance unit 11 utters, the utterance control unit 16 controls the utterance unit 11 to utter when the current silent time after the utterance unit 11 utters exceeds the waiting time.

ここで、ユーザ２との対話開始時には、フィルタ部１４には何らの応答時間データも記憶されていないため、推定部１５は無音時間を推定することができない。そのため、発話制御部１６は、ユーザ２との対話開始時には、発話部１１における待ち時間として、予め定められた待ち時間を使用する。その後、ユーザ２との対話が進んで、フィルタ部１４に３個以上の応答時間データが記憶されると、推定部１５で無音時間の推定が可能になる。そのため、以降、発話制御部１６は、推定部１５で無音時間が推定される度に、推定された無音時間を新たな待ち時間として、発話部１１における待ち時間を更新する。 Here, since no response time data is stored in the filter unit 14 when the dialogue with the user 2 is started, the estimation unit 15 cannot estimate the silent time. Therefore, the utterance control unit 16 uses a predetermined waiting time as a waiting time in the uttering unit 11 when starting a conversation with the user 2. Thereafter, when the dialogue with the user 2 proceeds and three or more response time data are stored in the filter unit 14, the estimation unit 15 can estimate the silent time. Therefore, thereafter, each time the estimation unit 15 estimates the silent time, the speech control unit 16 updates the waiting time in the speaking unit 11 with the estimated silent time as a new waiting time.

また、発話制御部１６は、発話部１１の発話内容も制御する。例えば、発話制御部１６は、現在の無音時間が待ち時間を超えた場合に発話させる時には、「ところで、この前の連休は遊びに行ったりした？」等と発話させ、新たな話題を提供するように制御する。ただし、発話内容をどのような内容に決定するかは本発明の本質的な部分ではなく、発話内容は公知の方法を適用して決定することができるため、詳細な説明は省略する。 The utterance control unit 16 also controls the utterance content of the utterance unit 11. For example, when the speech control unit 16 speaks when the current silent time exceeds the waiting time, the speech control unit 16 utters, “Did you go to play the last consecutive holiday?”, Etc., and provides a new topic. To control. However, what kind of content is determined as utterance content is not an essential part of the present invention, and since utterance content can be determined by applying a known method, detailed description thereof is omitted.

なお、音声対話装置１が備える各構成要素は、例えば、コンピュータである音声対話装置１が備えるプロセッサ（不図示）及びメモリ（不図示）によって実現することができる。具体的には、プロセッサが、メモリからソフトウェア（プログラム）を読み出して実行することで、各構成要素を実現することができる。また、各構成要素は、プログラムによるソフトウェアで実現することに限定されることなく、ハードウェア、ファームウェア及びソフトウェアのうちのいずれかの組み合わせなどにより実現しても良い In addition, each component with which the voice interactive apparatus 1 is provided is realizable with the processor (not shown) and memory (not shown) with which the voice interactive apparatus 1 which is a computer is provided, for example. Specifically, each component can be realized by the processor reading and executing software (program) from the memory. Further, each component is not limited to being realized by software by a program, and may be realized by any combination of hardware, firmware, and software.

上述したプログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（random access memory））を含む。 The above-described program can be stored using various types of non-transitory computer readable media and supplied to a computer. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W and semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (random access memory)) are included.

また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されても良い。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバなどの有線通信路、または無線通信路を介して、プログラムをコンピュータに供給できる。 Further, the program may be supplied to the computer by various types of temporary computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

＜実施の形態の動作＞
続いて、図４を参照して、本実施の形態に係る音声対話装置１において、発話部１１が発話した後の動作について説明する。
図４に示されるように、まず、発話制御部１６は、発話部１１が発話したか否かを判断する（ステップＳ１）。ステップＳ１の発話部１１の発話は、対話開始時の発話、ユーザ２の発話に対する発話、発話部１１の発話から経過した無音時間が待ち時間を超えた場合の発話が含まれる。 <Operation of the embodiment>
Then, with reference to FIG. 4, the operation | movement after the utterance part 11 utters is demonstrated in the voice interactive apparatus 1 which concerns on this Embodiment.
As shown in FIG. 4, first, the utterance control unit 16 determines whether or not the utterance unit 11 has uttered (step S1). The utterance of the utterance unit 11 in step S1 includes the utterance at the start of the dialogue, the utterance with respect to the utterance of the user 2, and the utterance when the silent time elapsed after the utterance of the utterance unit 11 exceeds the waiting time.

ステップＳ１において、発話部１１が発話した場合（ステップＳ１のＹｅｓ）、発話制御部１６は、発話部１１の発話に対してユーザ２が応答したか判断し（ステップＳ２）、ユーザ２の応答がない場合（ステップＳ２のＮｏ）、続いて、発話部１１の発話から経過した現在の無音時間が待ち時間を超えたか否かを判断する（ステップＳ３）。ここでの待ち時間は、ユーザ２との対話開始時には、予め定められた待ち時間となり、ユーザ２との対話が進んで推定部１５で無音時間が推定可能になると、推定部１５で推定された無音時間で更新された新たな待ち時間となる。ステップＳ３において、現在の無音時間が待ち時間を超えていない場合（ステップＳ３のＮｏ）、ステップＳ２に戻る。すなわち、ステップＳ２，Ｓ３は、現在の無音時間が待ち時間を超えるか、待ち時間の間にユーザ２が応答するまで繰り返されることになる。 In step S1, when the utterance unit 11 speaks (Yes in step S1), the utterance control unit 16 determines whether the user 2 responds to the utterance of the utterance unit 11 (step S2), and the response of the user 2 is If there is not (No in step S2), it is then determined whether or not the current silent time that has elapsed since the utterance of the utterance unit 11 has exceeded the waiting time (step S3). The waiting time here is a predetermined waiting time at the start of the dialogue with the user 2, and when the dialogue with the user 2 proceeds and the estimation unit 15 can estimate the silent time, the estimation unit 15 estimated the waiting time. It becomes a new waiting time updated with silence time. In step S3, when the current silent time does not exceed the waiting time (No in step S3), the process returns to step S2. That is, steps S2 and S3 are repeated until the current silent time exceeds the waiting time or the user 2 responds during the waiting time.

ステップＳ３において、現在の無音時間が待ち時間を超えた場合（ステップＳ３のＹｅｓ）、発話制御部１６は、ユーザ２に発話する気がないと判断し、発話するよう発話部１１を制御する（ステップＳ４）。ステップＳ４の後は、ステップＳ１に戻る。 In step S3, when the current silent time exceeds the waiting time (Yes in step S3), the utterance control unit 16 determines that the user 2 is not willing to utter and controls the utterance unit 11 to utter ( Step S4). After step S4, the process returns to step S1.

また、ステップＳ２において、発話部１１の発話に対してユーザ２が応答した場合（ステップＳ２のＹｅｓ）、計測部１３は、発話部１１が発話してから、ユーザ２が応答して発話受付部１２が発話を受け付けるまでの応答時間を計測する（ステップＳ５）。続いて、フィルタ部１４は、計測部１３が計測したユーザ２の応答時間データをフィルタリングして、外れ値の応答時間データを除去し、残りの応答時間データを記憶する（ステップＳ６）。続いて、推定部１５は、フィルタ部１４にて記憶された応答時間データを用いてユーザ２の応答時間の平均及び分散を算出し、算出した平均及び分散を用いて、ユーザ２の応答時間及び発話確率に関するガンマ分布を作成し、作成したガンマ分布を用いて、ユーザ２が発話する気がないと判断できる無音時間を推定する。続いて、発話制御部１６は、推定部１５にて推定された無音時間を新たな待ち時間として、発話部１１における待ち時間を更新する（ステップＳ７）。なお、ステップＳ７において、フィルタ部１４に３個以上の応答時間データが記憶されていない場合には、推定部１５で無音時間を推定することができない。そのため、その場合は、発話制御部１６は、発話部１１における待ち時間を更新せず、現在の待ち時間のままとする。ステップＳ７の後は、ステップＳ１に戻る。 In step S2, when the user 2 responds to the utterance of the utterance unit 11 (Yes in step S2), the measurement unit 13 responds after the utterance unit 11 utters and the user 2 responds to the utterance reception unit. The response time until 12 accepts the utterance is measured (step S5). Subsequently, the filter unit 14 filters the response time data of the user 2 measured by the measurement unit 13 to remove outlier response time data, and stores the remaining response time data (step S6). Subsequently, the estimation unit 15 calculates the average and variance of the response time of the user 2 using the response time data stored in the filter unit 14, and uses the calculated average and variance to determine the response time of the user 2 and A gamma distribution related to the utterance probability is created, and the silent time during which the user 2 can determine that he / she does not want to utter is estimated using the created gamma distribution. Subsequently, the utterance control unit 16 updates the waiting time in the utterance unit 11 with the silent time estimated by the estimating unit 15 as a new waiting time (step S7). In step S7, when three or more response time data are not stored in the filter unit 14, the estimation unit 15 cannot estimate the silent time. Therefore, in that case, the utterance control unit 16 does not update the waiting time in the uttering unit 11 and keeps the current waiting time. After step S7, the process returns to step S1.

以上が、音声対話装置１において、発話部１１が発話した後の動作となる。なお、音声対話装置１においては、ユーザ２が発話した場合には、その発話に応答して発話するという動作を行う。ただし、ユーザ２が発話した後の動作は、本発明の本質的な部分ではなく、公知の動作を適用することができるため、詳細な説明は省略する。 The above is the operation after the utterance unit 11 utters in the voice interaction apparatus 1. In the voice interaction apparatus 1, when the user 2 speaks, an operation of speaking in response to the speech is performed. However, the operation after the user 2 speaks is not an essential part of the present invention, and a known operation can be applied.

＜実施の形態の効果＞
上述したように、本実施の形態に係る音声対話装置１によれば、発話部１１が発話してからユーザ２が応答して発話を受け付けるまでの応答時間を記憶し、ユーザ２の応答時間の平均及び分散を用いて応答時間及び発話確率に関するガンマ分布を作成し、作成したガンマ分布を用いて人が発話する気がないと判断できる無音時間を推定し、推定した無音時間を新たな待ち時間として、発話部１１における待ち時間を更新する。 <Effect of Embodiment>
As described above, according to the voice interaction apparatus 1 according to the present embodiment, the response time from when the utterance unit 11 utters until the user 2 responds and accepts the utterance is stored, and the response time of the user 2 is stored. Create a gamma distribution related to response time and utterance probability using average and variance, estimate the silent time that can be determined that a person is not willing to utter using the created gamma distribution, and use the estimated silent time as a new waiting time Then, the waiting time in the utterance unit 11 is updated.

このように、ユーザ２の応答時間に応じて、発話部１１における待ち時間を更新するため、リアルタイムで人の個人差を考慮して、待ち時間を変更することができる。それにより、リアルタイムで人の個人差を考慮した発話が可能な音声対話装置１を提供することができる。 Thus, since the waiting time in the utterance part 11 is updated according to the response time of the user 2, the waiting time can be changed in consideration of individual differences between people in real time. As a result, it is possible to provide the voice interactive apparatus 1 capable of uttering in consideration of individual differences in real time.

また、リアルタイムで人の個人差を考慮した発話が可能となるため、ユーザ２との対話が中断することを回避し、対話を長続きさせることができる。
また、人が発話する気がないと判断できる無音時間の推定にパラメトリックモデルであるガンマ分布を使用するため、少ない応答時間データ（応答時間データは最低で３個あれば良い）で無音時間を推定することができる。それにより、対話開始から無音時間が推定となるまでの時間を短縮することができる。 Moreover, since it is possible to utter in consideration of individual differences in real time, it is possible to avoid interruption of the dialogue with the user 2 and to continue the dialogue.
Also, since the gamma distribution, which is a parametric model, is used to estimate the silent time when it is determined that a person is not willing to speak, the silent time is estimated with a small amount of response time data (minimum of three response time data is sufficient). can do. Thereby, it is possible to shorten the time from the start of dialogue until the silent time is estimated.

また、外れ値の応答時間データをフィルタリングして除去するため、ユーザ２が熟考した等の要因で応答時間がたまたま長くなったとしても、そのような応答時間データを除外して、無音時間を推定することができる。それにより、無音時間の推定精度が低くなることを回避することができる。 Also, since outlier response time data is filtered out, even if the response time happens to be long due to factors such as the user 2 pondering, the silent time is estimated by excluding such response time data. can do. Thereby, it can avoid that the estimation precision of silence time falls.

＜実施の形態の変形例＞
なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。
例えば、上記実施の形態においては、ユーザ２との対話開始時には、発話部１１における待ち時間を予め設定された待ち時間としたが、これには限定されない。例えば、ユーザ２と前回対話した時のユーザ２の応答時間データをフィルタ部１４に記憶させた状態でユーザ２との対話を開始すれば、対話開始時から、推定部１５が無音時間を推定することができる。その結果、対話開始時から、推定部１５が推定した無音時間を待ち時間とすることができる。 <Modification of Embodiment>
Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.
For example, in the above-described embodiment, the waiting time in the utterance unit 11 is set as a waiting time set in advance at the start of the conversation with the user 2, but the present invention is not limited to this. For example, if the dialogue with the user 2 is started in a state where the response time data of the user 2 at the previous dialogue with the user 2 is stored in the filter unit 14, the estimation unit 15 estimates the silent time from the beginning of the dialogue. be able to. As a result, the silence time estimated by the estimation unit 15 from the start of the conversation can be set as the waiting time.

また、上記実施の形態においては、推定部１５は、ユーザ２が発話する気がないと判断できる無音時間を推定するために、パラメトリックモデルの一例であるガンマ分布を使用したが、これには限定されない。推定部１５は、パラメトリックモデルの代わりに、歪度（skewness）を持つモデルを使用しても良い。なお、歪度とは、分布の非対称を示す指標である。 In the above embodiment, the estimation unit 15 uses the gamma distribution, which is an example of a parametric model, in order to estimate the silent time that can be determined that the user 2 is not willing to speak. However, the estimation unit 15 is not limited thereto. Not. The estimation unit 15 may use a model having skewness instead of the parametric model. The skewness is an index indicating the distribution asymmetry.

１音声対話装置
１１発話部
１２発話受付部
１３計測部
１４フィルタ部
１５推定部
１６発話制御部
２ユーザ DESCRIPTION OF SYMBOLS 1 Voice dialogue apparatus 11 Utterance part 12 Utterance reception part 13 Measurement part 14 Filter part 15 Estimation part 16 Utterance control part 2 User

Claims

An utterance receiving unit for recognizing a person's utterance, and an uttering unit for uttering, wherein the utterance unit determines a silent time which is a time when the utterance receiving unit does not accept an utterance after the utterance unit utters A spoken dialogue device that utters when a specified waiting time is exceeded,
A storage unit that stores a response time from when the utterance unit utters until a person responds and the utterance reception unit receives the utterance;
The average and variance of the response times stored in the storage unit are calculated, a gamma distribution relating to the response time and the utterance probability is created using the calculated average and variance, and a person speaks using the created gamma distribution. An estimation unit for estimating a silent time that can be determined to be absent,
A speech dialogue apparatus comprising: an utterance control unit that updates the waiting time in the utterance unit using the silent time estimated by the estimation unit as a new waiting time.