JPWO2018078885A1

JPWO2018078885A1 - Dialogue device, dialogue method and computer program for dialogue

Info

Publication number: JPWO2018078885A1
Application number: JP2018547103A
Authority: JP
Inventors: 金岡　利知; 利知金岡; 徹上和田; 章人吉井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2019-06-24
Anticipated expiration: 2036-10-31
Also published as: JP6699748B2; WO2018078885A1

Abstract

対話装置は、ユーザの声が表された音声信号におけるユーザが発話している発話区間からユーザが発話したキーワードを検出するキーワード検出部と、キーワードが検出されると所定区間を設定し、その所定区間において音声信号が第１の時間長にわたって継続して無音となると第１の応答音声を音声出力部を介して出力させる応答部とを有する。また、応答部は、発話区間の開始後においてキーワードが検出されていないときに音声信号が第１の時間長よりも長い第２の時間長にわたって継続して無音となると第２の応答音声を音声出力部を介して出力させる。The dialogue apparatus sets a predetermined section when a keyword is detected, and a predetermined section when a keyword is detected, from the speech section in which the user speaks in the speech signal representing the user's voice and the keyword is detected. And a response unit configured to output a first response voice via the voice output unit when the voice signal continues to be silent for a first length of time in the section. Also, the response unit generates a second response voice when the voice signal continues to be silent for a second time length longer than the first time length when the keyword is not detected after the start of the utterance period. Output via the output unit.

Description

本発明は、例えば、ユーザと音声により対話する対話装置、対話方法及び対話用コンピュータプログラムに関する。 The present invention relates to, for example, an interactive device, an interactive method, and a computer program for interactive interaction with a user by voice.

ユーザが発した音声を対話装置により認識し、その対話装置が認識した結果に応じて音声で応答する技術が研究されている。このような技術では、ユーザが発した音声に対して対話装置が何らかの処理していることをユーザに知らせて、ユーザが不安を感じないようにするために、対話装置が適度に応答することがもとめられる。そこで、ユーザが発した音声に応じて相槌を返す技術が研究されている（例えば、特許文献１及び２を参照）。 A technology has been researched in which speech generated by a user is recognized by a dialog device, and voice response is made according to the result recognized by the dialog device. In such a technique, the dialog device responds appropriately to notify the user that the dialog device is processing something to the voice emitted by the user, and to make the user not feel anxious. It is required. Then, the technique which returns a sumo according to the voice which the user uttered is studied (for example, refer to patent documents 1 and 2).

例えば、特許文献１に記載された技術では、会話をスタートさせるための質問がスピーカにより出力される。出力された質問に対する音声が入力されると、その音声が認識されて、その音声に該当する登録語が決定される。そして相槌群の中から登録語に応じた相槌が決定され、決定された相槌の合成音声が出力される。 For example, in the technology described in Patent Document 1, a question for starting a conversation is output by the speaker. When a voice corresponding to the outputted question is input, the voice is recognized and a registered word corresponding to the voice is determined. Then, a sumo according to the registered word is determined from the sumo group, and a synthetic speech of the determined sumo is output.

また、特許文献２に開示された技術では、ユーザから発話された音声が認識されるとともに、意味処理タイミング及び応答タイミングが判定される。そして音声認識の結果に基づいて、意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行った内容が反映された音声での応答が行われる。またこの技術では、発話中の無音区間が応答タイミングであるか否か判定され、その無音区間が応答タイミングでないと判定された場合、相槌の音声が出力される。 Further, in the technology disclosed in Patent Document 2, the voice uttered by the user is recognized, and the semantic processing timing and the response timing are determined. Then, based on the result of speech recognition, when it is determined that the timing is the semantic processing timing and the response timing, the response with the voice in which the content of the semantic processing is reflected is performed. Further, in this technology, it is determined whether the silent section in the utterance is the response timing, and when it is determined that the silent section is not the response timing, the voice of the harmony is output.

特開２００２−１６９５９１号公報Japanese Patent Laid-Open No. 2002-169591 特開２００５−１９６１３４号公報JP, 2005-196134, A

対話装置が適切なタイミングで相槌を返すことで、ユーザは、対話装置がユーザの発話に対する処理を実行していることを知ることができるので、ユーザは、不安を感じずに対話装置との対話を行うことができる。しかし、対話装置が相槌を返すタイミングが不適切だと、不自然な対話になってしまうことがある。例えば、特許文献２に記載されているように、無音区間において対話装置が相槌を返す場合において、無音区間がユーザの発話における「間」であると、その「間」の次のユーザの発話と相槌とが重なってしまうことがある。一方、無音区間が開始されてから、ある程度以上経過しても、対話装置が相槌を返さなければ、ユーザは、自身の音声が対話装置により処理されていないと感じて発話を繰り返すことがある。 Since the dialog device can know that the dialog device is executing processing for the user's speech by returning the response at the appropriate timing, the user can interact with the dialog device without feeling anxious. It can be performed. However, if the timing at which the dialogue device returns the reciprocation is inadequate, it may result in an unnatural dialogue. For example, as described in Patent Document 2, in the case where the dialog device returns a sumo in the silent section, if the silent section is "between" in the user's speech, the next user's speech in the "between" There are times when it overlaps with the sumo wrestler. On the other hand, if the dialog device does not return a reciprocity even if a certain amount of time or more has passed since the start of the silent section, the user may repeat the utterance, feeling that his / her voice has not been processed by the dialog device.

一つの側面では、本発明は、適切なタイミングでユーザに応答することが可能な対話装置を提供することを目的とする。 In one aspect, the present invention aims to provide an interactive device capable of responding to a user at an appropriate timing.

一つの実施形態によれば、対話装置が提供される。この対話装置は、ユーザの声が表された音声信号におけるユーザが発話している発話区間からユーザが発話したキーワードを検出するキーワード検出部と、キーワードが検出されると所定区間を設定し、その所定区間において音声信号が第１の時間長にわたって継続して無音となると第１の応答音声を音声出力部を介して出力させる応答部と、を有する。 According to one embodiment, an interactive device is provided. The dialogue apparatus sets a keyword detection unit for detecting a keyword uttered by the user from an utterance section uttered by the user in an audio signal representing the voice of the user, and sets a predetermined interval when the keyword is detected. And a response unit that outputs the first response voice via the voice output unit when the voice signal continues to be silent for the first time length in the predetermined section.

一つの側面では、本明細書に開示された対話装置は、適切なタイミングでユーザに応答することができる。 In one aspect, the interactive devices disclosed herein can respond to the user at appropriate times.

図１は、一つの実施形態による対話装置の概略構成図である。FIG. 1 is a schematic block diagram of a dialog device according to one embodiment. 図２は、対話処理に関する処理部の機能ブロック図である。FIG. 2 is a functional block diagram of a processing unit related to interactive processing. 図３は、音声信号中のキーワードの検出タイミングと、無音検知区間と、相槌の出力タイミングとの関係の一例を表すタイミングチャートである。FIG. 3 is a timing chart showing an example of the relationship between the detection timing of the keyword in the audio signal, the silence detection section, and the output timing of the harmony. 図４は、対話処理の動作フローチャートである。FIG. 4 is an operation flowchart of interactive processing.

以下、図を参照しつつ、対話装置について説明する。
この対話装置は、ユーザが発した音声を表す音声信号中の発話区間において、ユーザが設定されたキーワードを発したことを認識すると、キーワードが認識されたときから所定区間を設定する。そしてこの対話装置は、その所定区間内において音声信号が第１の時間長にわたって継続して無音となると、相槌の再生音声を出力する。その際、この対話装置は、第１の時間長を、相対的に短く設定することで、適切なタイミングで相槌の再生音声を出力することを可能とする。Hereinafter, the dialog device will be described with reference to the drawings.
When the dialog device recognizes that the user has uttered the set keyword in the speech section in the voice signal representing the voice uttered by the user, the dialog device sets a predetermined zone from when the keyword is recognized. Then, when the speech signal continues to be silent for the first time length in the predetermined section, the dialogue device outputs the reproduced speech of the harmony. At this time, the interactive device can set the first time length to be relatively short, and can output the reproduced voice of the harmony at an appropriate timing.

なお、この対話装置は、音声認識を利用するマンマシンインターフェースを採用する様々な装置、例えば、ナビゲーションシステム、携帯電話機、コンピュータまたはロボットなどに実装できる。 Note that this interactive device can be implemented in various devices that employ a man-machine interface that utilizes speech recognition, such as a navigation system, a cellular phone, a computer or a robot.

図１は、一つの実施形態による対話装置の概略構成図である。対話装置１は、マイクロホン１１と、アナログ／デジタルコンバータ１２と、デジタル／アナログコンバータ１３と、スピーカ１４と、記憶部１５と、処理部１６とを有する。なお、対話装置１は、さらに、ユーザを検知するための人感センサ（図示せず）、タッチパネルといったユーザインターフェース（図示せず）及び他の機器と通信するための通信インターフェース（図示せず）などを有していてもよい。 FIG. 1 is a schematic block diagram of a dialog device according to one embodiment. The interactive device 1 includes a microphone 11, an analog / digital converter 12, a digital / analog converter 13, a speaker 14, a storage unit 15, and a processing unit 16. The dialogue apparatus 1 further includes a human sensor (not shown) for detecting a user, a user interface (not shown) such as a touch panel, a communication interface (not shown) for communicating with other devices, etc. May be included.

マイクロホン１１は、音声入力部の一例であり、ユーザの声を含む、対話装置１の周囲の音を集音し、その音の強度に応じたアナログ音声信号を生成する。そしてマイクロホン１１は、そのアナログ音声信号をアナログ／デジタルコンバータ１２（以下、Ａ／Ｄコンバータと表記する）へ出力する。Ａ／Ｄコンバータ１２は、アナログの音声信号を所定のサンプリングレートでサンプリングすることにより、その音声信号をデジタル化する。なお、サンプリングレートは、例えば、音声信号からユーザの声を解析するために必要な周波数帯域がナイキスト周波数以下となるよう、例えば、16kHz〜32kHzに設定される。そしてＡ／Ｄコンバータ１２は、デジタル化された音声信号を処理部１６へ出力する。なお、以下では、デジタル化された音声信号を、単に音声信号と呼ぶ。 The microphone 11 is an example of a voice input unit, collects sounds around the interactive device 1 including the voice of the user, and generates an analog voice signal according to the intensity of the sound. Then, the microphone 11 outputs the analog audio signal to an analog / digital converter 12 (hereinafter referred to as an A / D converter). The A / D converter 12 digitizes the analog audio signal by sampling it at a predetermined sampling rate. The sampling rate is set to, for example, 16 kHz to 32 kHz so that the frequency band necessary for analyzing the user's voice from the audio signal is equal to or less than the Nyquist frequency. Then, the A / D converter 12 outputs the digitized audio signal to the processing unit 16. Hereinafter, the digitized voice signal is simply referred to as a voice signal.

デジタル／アナログコンバータ１３（以下、Ｄ／Ａコンバータと表記する）は、処理部１６から受信した、デジタルの再生音声信号をアナログの再生音声信号に変換する。そしてＤ／Ａコンバータ１３は、アナログ化された再生音声信号をスピーカ１４へ出力する。スピーカ１４は、音声出力部の一例であり、アナログ化された再生音声信号を音声として出力する。 The digital / analog converter 13 (hereinafter referred to as a D / A converter) converts the digital reproduction audio signal received from the processing unit 16 into an analog reproduction audio signal. Then, the D / A converter 13 outputs the analog reproduced audio signal to the speaker 14. The speaker 14 is an example of an audio output unit, and outputs an analog reproduced audio signal as audio.

記憶部１５は、例えば、読み書き可能な不揮発性の半導体メモリと、読み書き可能な揮発性の半導体メモリとを有する。さらに、記憶部１５は、磁気記録媒体あるいは光記録媒体及びそのアクセス装置を有していてもよい。そして記憶部１５は、処理部１６上で実行される対話処理で利用される各種のデータ及び対話処理の途中で生成される各種のデータを記憶する。例えば、記憶部１５は、認識対象となるキーワード及びキーワードの音素系列、単語辞書、及び、質問とキーワードの対応関係を表すテーブルなどを記憶する。さらに、記憶部１５は、質問、相槌または他の応答音声を表す再生音声信号を記憶する。さらに、記憶部１５は、音声認識の結果に対して行われる処理に関するプログラム及びそのプログラムで利用される各種のデータを記憶してもよい。 The storage unit 15 includes, for example, a readable / writable nonvolatile semiconductor memory and a readable / writable volatile semiconductor memory. Furthermore, the storage unit 15 may have a magnetic recording medium or an optical recording medium and an access device thereof. Then, the storage unit 15 stores various types of data used in the dialogue processing executed on the processing unit 16 and various kinds of data generated in the middle of the dialogue processing. For example, the storage unit 15 stores a keyword and a phoneme sequence of the keyword to be recognized, a word dictionary, and a table indicating a correspondence between a question and the keyword. Furthermore, the storage unit 15 stores a reproduced voice signal representing a question, a fight or other response voice. Furthermore, the storage unit 15 may store a program related to processing performed on the result of speech recognition and various data used in the program.

処理部１６は、例えば、一つまたは複数のプロセッサと、その周辺回路とを有する。そして処理部１６は、対話処理を実行することで、ユーザからの発話に対して相槌の合成音声を出力するタイミングを決定する。さらに、処理部１６は、音声信号に基づく音声認識の結果に応じた処理を実行する。 The processing unit 16 includes, for example, one or more processors and their peripheral circuits. Then, the processing unit 16 executes interactive processing to determine the timing for outputting a synthetic speech of the user in response to the utterance from the user. Furthermore, the processing unit 16 performs processing according to the result of speech recognition based on the speech signal.

以下、処理部１６の詳細について説明する。 Hereinafter, details of the processing unit 16 will be described.

図２は、対話処理に関する処理部１６の機能ブロック図である。処理部１６は、キーワード設定部２１と、発話区間開始検出部２２と、キーワード検出部２３と、応答部２４と、音声認識部２５とを有する。処理部１６が有するこれらの各部は、例えば、処理部１６が有するプロセッサ上で動作するコンピュータプログラムにより実現される機能モジュールである。あるいは、処理部１６が有するこれらの各部は、その各部の機能を実現する一つまたは複数の集積回路であってもよい。 FIG. 2 is a functional block diagram of the processing unit 16 related to interactive processing. The processing unit 16 includes a keyword setting unit 21, an utterance interval start detection unit 22, a keyword detection unit 23, a response unit 24, and a voice recognition unit 25. These units included in the processing unit 16 are, for example, functional modules realized by a computer program operating on the processor included in the processing unit 16. Alternatively, the units included in the processing unit 16 may be one or more integrated circuits that realize the functions of the units.

処理部１６は、例えば、人感センサが対話装置１から所定範囲内にユーザがいることを検知すると、対話処理を開始する。あるいは、処理部１６は、ユーザインターフェースに対して所定の操作が行われると、対話処理を開始する。そして処理部１６は、対話処理が開始されると、上記の各部の処理を実行する。 For example, when the human sensor detects that the user is within the predetermined range from the dialogue device 1, the processing unit 16 starts the dialogue processing. Alternatively, the processing unit 16 starts an interactive process when a predetermined operation is performed on the user interface. Then, when the dialog processing is started, the processing unit 16 executes the processing of each of the above-described units.

キーワード設定部２１は、対話処理が開始されると、ユーザに対して発する質問を選択し、選択した質問の内容に応じたキーワードを設定する。例えば、キーワード設定部２１は、ユーザインターフェースに対して実行された操作に応じて、予め用意された複数の質問の中から質問を選択する。あるいは、キーワード設定部２１は、対話装置１で実行中のアプリケーションに応じて質問を選択してもよい。この場合、例えば、操作の内容、あるいは、アプリケーションの種別と質問との対応関係を表すテーブルが記憶部１５に予め記憶される。そして、キーワード設定部２１は、そのテーブルを参照して、実行された操作の内容または実行中のアプリケーションの種別に対応する質問を選択すればよい。 When the interactive processing is started, the keyword setting unit 21 selects a question to be issued to the user, and sets a keyword according to the content of the selected question. For example, the keyword setting unit 21 selects a question from among a plurality of questions prepared in advance according to the operation performed on the user interface. Alternatively, the keyword setting unit 21 may select a question according to the application being executed by the interactive device 1. In this case, for example, a table representing the content of the operation or the correspondence between the type of application and the question is stored in advance in the storage unit 15. Then, the keyword setting unit 21 may select the question corresponding to the content of the executed operation or the type of the application being executed with reference to the table.

キーワード設定部２１は、選択した質問に対応する再生音声信号を記憶部１５から読出し、その再生音声信号をＤ／Ａコンバータ１３を介してスピーカ１４へ出力する。そしてスピーカ１４により、選択した質問の音声がユーザに対して発せられる。 The keyword setting unit 21 reads a reproduced voice signal corresponding to the selected question from the storage unit 15, and outputs the reproduced voice signal to the speaker 14 via the D / A converter 13. Then, the speaker 14 emits the voice of the selected question to the user.

またキーワード設定部２１は、選択した質問に対応するキーワードを設定する。キーワードは、例えば、選択した質問に対して想定される回答に含まれる語句とすることができる。例えば、質問が「何処へ行きますか？」であれば、キーワードは、回答として想定される地名、例えば、「アメリカ」、「京都」などに設定される。また、質問が「何をなさいますか？」であれば、キーワードは、回答として想定される、対話装置１が実装された装置が実行可能な機能の名称に含まれる語句に設定される。例えば、その装置が実行可能な機能が何らかのチケットの予約であれば、キーワードは、「チケット」あるいは「予約」に設定される。 The keyword setting unit 21 sets a keyword corresponding to the selected question. The keyword can be, for example, a word or phrase included in an expected response to the selected question. For example, if the question is "where are you going?", The keyword is set to a place name assumed as an answer, for example, "US," "Kyoto," or the like. Also, if the question is "what do you want to do?", The keyword is set to a word that is assumed as an answer and is included in the name of the function that can be executed by the device on which the interactive device 1 is implemented. For example, if the function that can be performed by the device is a reservation for some ticket, the keyword is set to "ticket" or "reservation".

キーワードを設定するために、例えば、質問とキーワードとの対応関係を表すテーブルが予め記憶部１５に記憶される。そしてキーワード設定部２１は、そのテーブルを参照して、選択した質問に対応するキーワードを設定すればよい。なお、設定されるキーワードは一つであってもよく、あるいは、複数であってもよい。
キーワード設定部２１は、設定したキーワードをキーワード検出部２３へ通知する。またキーワード設定部２１は、発話区間開始検出部２２へ、質問が出力されたことを通知する。In order to set a keyword, for example, a table representing a correspondence between a question and a keyword is stored in advance in the storage unit 15. Then, the keyword setting unit 21 may set the keyword corresponding to the selected question with reference to the table. In addition, one or more keywords may be set.
The keyword setting unit 21 notifies the keyword detection unit 23 of the set keyword. The keyword setting unit 21 also notifies the speech segment start detection unit 22 that the question has been output.

対話処理が開始された後、処理部１６は、入力された音声信号を、所定長を持つフレーム単位に分割する。フレーム長は、例えば、10msec〜20msecに設定される。そして各フレームは、時間順に発話区間開始検出部２２、キーワード検出部２３及び音声認識部２５に入力される。 After the interactive process is started, the processing unit 16 divides the input audio signal into frame units having a predetermined length. The frame length is set to, for example, 10 msec to 20 msec. Then, each frame is input to the speech zone start detection unit 22, the keyword detection unit 23, and the speech recognition unit 25 in order of time.

発話区間開始検出部２２は、質問が出力された後、マイクロホン１１及びＡ／Ｄコンバータ１２を介して処理部１６に入力された音声信号から、ユーザが発話している区間である発話区間の開始を検出する。発話区間では音声信号にユーザの声が含まれるため、発話区間における音声信号の信号対雑音比は、発話区間以外における音声信号の信号対雑音比よりも高くなると想定される。そこで、発話区間開始検出部２２は、音声信号の信号対雑音比（以下では、単にSN比と表記する）をフレームごとに算出し、そのSN比に基づいて、発話区間の開始を検出する。 The speech section start detection unit 22 starts the speech section, which is a section in which the user is speaking, from the audio signal input to the processing unit 16 via the microphone 11 and the A / D converter 12 after the question is output. To detect Since a voice of the user is included in the speech signal in the speech section, it is assumed that the signal-to-noise ratio of the speech signal in the speech section is higher than the signal-to-noise ratio of the speech signal in other than the speech section. Therefore, the speech zone start detection unit 22 calculates the signal-to-noise ratio (hereinafter simply referred to as the SN ratio) of the speech signal for each frame, and detects the start of the speech zone based on the SN ratio.

発話区間開始検出部２２は、フレームが入力される度に、そのフレームについての音声信号のパワーを算出する。発話区間開始検出部２２は、例えば、フレームごとに、次式に従ってパワーを算出する。

ここで、S_k(n)は、最新のフレーム（以下、現フレームとも呼ぶ）のn番目のサンプリング点の信号値を表す。kはフレーム番号である。またNは、一つのフレームに含まれるサンプリング点の総数を表す。そしてSpow(k)は、現フレームのパワーを表す。The speech section start detection unit 22 calculates the power of the speech signal for the frame each time a frame is input. The speech zone start detection unit 22 calculates power according to the following equation, for example, for each frame.

Here, S _k (n) represents the signal value of the n-th sampling point of the latest frame (hereinafter also referred to as the current frame). k is a frame number. Also, N represents the total number of sampling points included in one frame. And Spow (k) represents the power of the current frame.

なお、発話区間開始検出部２２は、各フレームについて、複数の周波数のそれぞれごとにパワーを算出してもよい。この場合、キーワード検出部２３は、フレームごとに、音声信号を、時間周波数変換を用いて時間領域から周波数領域のスペクトル信号に変換する。なお、発話区間開始検出部２２は、時間周波数変換として、例えば、高速フーリエ変換(Fast Fourier Transform,FFT)を用いることができる。そして発話区間開始検出部２２は、周波数帯域ごとに、その周波数帯域に含まれるスペクトル信号の２乗和を、その周波数帯域のパワーとして算出できる。 The speech zone start detection unit 22 may calculate the power of each of the plurality of frequencies for each frame. In this case, the keyword detection unit 23 converts the speech signal from the time domain into the spectrum signal of the frequency domain using time frequency conversion for each frame. The utterance period start detection unit 22 can use, for example, fast Fourier transform (FFT) as the time frequency conversion. Then, for each frequency band, the speech zone start detection unit 22 can calculate the sum of squares of the spectrum signals included in the frequency band as the power of the frequency band.

また、発話区間開始検出部２２は、フレームごとに、そのフレームにおける音声信号中の推定雑音成分を算出する。本実施形態では、発話区間開始検出部２２は、直前のフレームにおいて推定雑音成分を、現フレームのパワーを用いて次式に従って更新することで、現フレームの推定雑音成分を算出する。

ここで、Noise(k-1)は、直前のフレームにおける推定雑音成分を表し、Noise(k)は、現フレームにおける推定雑音成分を表す。またβは、忘却係数であり、例えば、0.9に設定される。In addition, the speech section start detection unit 22 calculates, for each frame, an estimated noise component in the audio signal in the frame. In the present embodiment, the speech zone start detection unit 22 calculates the estimated noise component of the current frame by updating the estimated noise component in the immediately preceding frame according to the following equation using the power of the current frame.

Here, Noise (k-1) represents the estimated noise component in the immediately preceding frame, and Noise (k) represents the estimated noise component in the current frame. Also, β is a forgetting factor, and is set to, for example, 0.9.

なお、パワーが周波数帯域ごとに算出されている場合には、キーワード検出部は、（２）式に従って、推定される雑音成分を周波数帯域ごとに算出してもよい。この場合には、（２）式において、Noise(k-1)、Noise(k)及びSpow(k)は、それぞれ、着目する周波数帯域についての直前のフレームの推定雑音成分、現フレームの推定雑音成分、パワーとなる。 When the power is calculated for each frequency band, the keyword detection unit may calculate the noise component to be estimated for each frequency band according to equation (2). In this case, in the equation (2), Noise (k-1), Noise (k) and Spow (k) respectively represent the estimated noise component of the immediately preceding frame and the estimated noise of the current frame for the frequency band of interest It becomes an ingredient and power.

なお、発話区間開始検出部２２は、現フレームのパワーが所定の閾値以下である場合に限り、（２）式に従って推定雑音成分を更新すればよい。そして現フレームのパワーが所定の閾値より大きい場合には、発話区間開始検出部２２は、Noise(k)=Noise(k-1)とすればよい。なお、所定の閾値は、例えば、Noise(k-1)に所定のオフセット値を加算した値とすることができる。 The speech zone start detection unit 22 may update the estimated noise component according to equation (2) only when the power of the current frame is equal to or less than a predetermined threshold. If the power of the current frame is larger than the predetermined threshold, the speech section start detection unit 22 may set Noise (k) = Noise (k−1). The predetermined threshold may be, for example, a value obtained by adding a predetermined offset value to Noise (k−1).

発話区間開始検出部２２は、フレームごとに、SN比を算出する。例えば、発話区間開始検出部２２は、次式に従ってSN比を算出する。

ここで、SNR(k)は、現フレームのSN比を表す。なお、パワー及び推定雑音成分が周波数帯域ごとに算出されている場合には、発話区間開始検出部２２は、（３）式に従って、SN比を周波数帯域ごとに算出してもよい。この場合には、（３）式において、Noise(k)、Spow(k)及びSNR(k)は、それぞれ、着目する周波数帯域についての現フレームの推定雑音成分、パワー、SN比となる。The speech zone start detection unit 22 calculates an SN ratio for each frame. For example, the speech zone start detection unit 22 calculates the SN ratio according to the following equation.

Here, SNR (k) represents the SN ratio of the current frame. If the power and the estimated noise component are calculated for each frequency band, the speech zone start detection unit 22 may calculate the SN ratio for each frequency band according to equation (3). In this case, in the equation (3), Noise (k), Spow (k) and SNR (k) become the estimated noise component, power, and SN ratio of the current frame for the frequency band of interest, respectively.

本実施形態では、発話区間開始検出部２２は、フレームごとに、そのフレームのSN比を有音判定閾値Thsnrと比較する。なお、有音判定閾値Thsnrは、例えば、音声信号中に推定雑音成分以外の信号成分が含まれることに相当する値、例えば、2〜3に設定される。そして発話区間開始検出部２２は、SN比が有音判定閾値Thsnr以上であれば、そのフレームは発話区間に含まれると判定する。 In the present embodiment, the speech zone start detection unit 22 compares, for each frame, the SN ratio of the frame with the noise determination threshold Thsnr. The sound presence determination threshold Thsnr is set to, for example, a value corresponding to that a signal component other than the estimated noise component is included in the audio signal, for example, 2 to 3. Then, the speech section start detection unit 22 determines that the frame is included in the speech section if the SN ratio is equal to or more than the sound presence determination threshold Thsnr.

さらに、周波数帯域ごとにSN比が算出されている場合には、発話区間開始検出部２２は、SN比が有音判定閾値Thsnr以上となる周波数帯域の数が所定数以上となる場合に、そのフレームは発話区間に含まれると判定してもよい。なお、所定数は、例えば、SN比が算出される周波数帯域の総数の1/2とすることができる。 Furthermore, when the SN ratio is calculated for each frequency band, the speech section start detection unit 22 determines that the number of frequency bands for which the SN ratio is equal to or higher than the sound presence determination threshold Thsnr is equal to or greater than a predetermined number. It may be determined that the frame is included in the utterance section. Note that the predetermined number can be, for example, 1/2 of the total number of frequency bands for which the SN ratio is calculated.

発話区間開始検出部２２は、直前のフレームが発話区間に含まれず（すなわち、直前のフレームにおけるSN比が有音判定閾値Thsnr未満であり）、かつ、現フレームが発話区間に含まれる場合、現フレームから発話区間が開始されたと判定する。あるいは、発話区間開始検出部２２は、発話区間に含まれるフレームが所定数（例えば、2〜5フレーム）連続した時点で、その連続するフレームの先頭から発話区間が開始されたと判定してもよい。 If the immediately preceding frame is not included in the utterance section (that is, the SN ratio in the immediately preceding frame is less than the noise determination threshold Thsnr) and the current frame is included in the utterance section, the speech section start detection unit 22 It is determined that the speech segment has started from the frame. Alternatively, when a predetermined number (for example, 2 to 5 frames) of consecutive frames included in the speech section continues, the speech section start detection unit 22 may determine that the speech section is started from the beginning of the consecutive frames. .

発話区間開始検出部２２は、発話区間の開始を検出すると、発話区間が開始されたことをキーワード検出部２３及び音声認識部２５へ通知する。 When the speech zone start detection unit 22 detects the start of the speech zone, the speech zone start detection unit 22 notifies the keyword detection unit 23 and the speech recognition unit 25 that the speech zone has started.

キーワード検出部２３は、発話区間が開始されると、その発話区間内においてキーワード設定部２１により設定されたキーワードを検出する。キーワード検出部２３は、発話区間に対して、様々なワードスポッティング技術の何れかを適用することで、対象となるキーワードを検出する。例えば、キーワード検出部２３は、発話区間内のフレームごとに、ユーザの声の特徴を表す複数の特徴量を算出する。そしてキーワード検出部２３は、フレームごとに、各特徴量を要素とする特徴ベクトルを生成する。 When the speech section is started, the keyword detection section 23 detects the keyword set by the keyword setting section 21 in the speech section. The keyword detection unit 23 detects a target keyword by applying any one of various word spotting techniques to the speech section. For example, the keyword detection unit 23 calculates, for each frame in the utterance section, a plurality of feature quantities representing features of the user's voice. Then, the keyword detection unit 23 generates, for each frame, a feature vector having each feature amount as an element.

例えば、キーワード検出部２３は、ユーザの声の特徴を表す特徴量として、メル周波数ケプストラム係数(Mel Frequency Cepstral Coefficient、MFCC)と、それらのΔケプストラム及びΔΔケプストラムを求める。 For example, the keyword detection unit 23 obtains Mel Frequency Cepstral Coefficient (MFCC) and their Δ cepstrum and ΔΔ cepstrum as the feature amount representing the feature of the user's voice.

キーワード検出部２３は、発話区間内に、検出対象となるキーワードに相当する長さを持つ検出区間を設定する。そしてキーワード検出部２３は、検出区間内の各フレームから抽出された特徴量に基づいて、その検出区間についての最尤音素系列を探索する。なお、最尤音素系列は、最も確からしいと推定される、音声に含まれる各音素をその発声順に並べた音素系列である。 The keyword detection unit 23 sets a detection section having a length corresponding to the keyword to be detected in the speech section. Then, the keyword detection unit 23 searches for the maximum likelihood phoneme sequence for the detected section based on the feature quantity extracted from each frame in the detected section. The maximum likelihood phoneme sequence is a phoneme sequence in which each phoneme included in the speech, which is estimated to be most probable, is arranged in the order of the utterance.

そのために、キーワード検出部２３は、例えば、音響モデルとして隠れマルコフモデル(Hidden Markov Model, HMM)を利用し、音声の特徴ベクトルに対する各音素の出力確率を混合正規分布(Gaussian Mixture Model, GMM)により算出するGMM-HMMを用いる。 For this purpose, the keyword detection unit 23 uses, for example, a Hidden Markov Model (HMM) as an acoustic model, and the output probability of each phoneme with respect to the feature vector of the speech is a Gaussian mixture model (GMM). Use GMM-HMM to calculate.

具体的に、キーワード検出部２３は、検出区間中のフレームごとに、そのフレームの特徴ベクトルをGMMに入力することで、そのフレームについての、各音素に対応するHMMの各状態の出力確率を算出する。また、キーワード検出部２３は、各フレームから算出された特徴ベクトルに対して、特徴ベクトルの要素ごとに平均値を推定してその要素の値から推定した平均値を差し引くCepstral Mean Normalization(CMN)と呼ばれる正規化を実行してもよい。そしてキーワード検出部２３は、正規化された特徴ベクトルをGMMに入力してもよい。 Specifically, the keyword detection unit 23 calculates the output probability of each state of the HMM corresponding to each phoneme for the frame by inputting the feature vector of the frame to the GMM for each frame in the detection section. Do. In addition, the keyword detection unit 23 estimates a mean value for each element of the feature vector for the feature vector calculated from each frame and subtracts the mean value estimated from the value of the element, and Cepstral Mean Normalization (CMN) It may perform a called normalization. Then, the keyword detection unit 23 may input the normalized feature vector to the GMM.

キーワード検出部２３は、フレームごとに、得られた出力確率を音素HMMの対応する状態についての出力確率として用いることで、着目する検出区間について、累積対数尤度が最大となる音素系列を最尤音素系列として求める。 The keyword detection unit 23 uses the obtained output probability as the output probability for the corresponding state of the phoneme HMM for each frame, thereby maximizing the phoneme sequence having the largest cumulative log likelihood for the detection section of interest. Calculated as phoneme series.

例えば、キーワード検出部２３は、遷移元である前のフレームの音素候補のHMMの状態から遷移先である現在のフレームのある音素候補のHMMの状態へ遷移する確率（状態遷移確率）の対数化値を算出する。さらに、キーワード検出部２３は、現在のフレームのある音素候補のHMMの状態における出力確率の対数化値を算出する。そしてキーワード検出部２３は、それらの対数化値を、前のフレームまでの音素候補のHMMの状態における累積対数尤度に加算することで、現在のフレームのある音素候補のHMMの状態における累積対数尤度を算出する。その際、キーワード検出部２３は、遷移元の音素候補のHMMの状態の中から、遷移先である現在のフレームのある音素候補のHMMの状態に遷移した場合に、尤も累積対数尤度が大きい遷移元の音素候補を選択する。キーワード検出部２３は、その選択を現在のフレームにおけるすべての音素候補のHMMの状態について行うViterbi演算を検出区間の最後のフレームまで進める。なお、キーワード検出部２３は、上記の合計が所定値以上となる状態遷移を選択してもよい。そしてキーワード検出部２３は、最後のフレームにおける累積対数尤度が最大となる状態を選び、その状態に到達するまでの状態遷移の履歴(Viterbiパス)をバックトラックすることにより求め、Viterbiパスに基づいてその音声区間における最尤音素系列を求める。 For example, the keyword detection unit 23 logs the probability (state transition probability) of transition from the state of the HMM of the phoneme candidate of the previous frame that is the transition source to the HMM state of the phoneme candidate of the current frame that is the transition destination. Calculate the value. Furthermore, the keyword detection unit 23 calculates a logarithmic value of the output probability in the state of the HMM of the phoneme candidate with the current frame. Then, the keyword detection unit 23 adds those logarithmic values to the cumulative log likelihood in the state of the HMM of the phoneme candidate up to the previous frame to obtain the cumulative log in the state of the HMM of the phoneme candidate with the current frame. Calculate the likelihood. At that time, the keyword detection unit 23 has a large likelihood of cumulative log likelihood when transitioning to the state of the HMM of the phoneme candidate with the current frame that is the transition destination out of the states of the HMM of the transition source phoneme candidate. The phoneme candidate of the transition source is selected. The keyword detection unit 23 advances the Viterbi operation, which performs the selection on the states of the HMMs of all the phoneme candidates in the current frame, to the last frame of the detection section. The keyword detection unit 23 may select a state transition in which the above sum is equal to or more than a predetermined value. Then, the keyword detection unit 23 selects a state in which the accumulated log likelihood in the last frame is maximum, and obtains it by backtracking the history of state transition (Viterbi path) until the state is reached, based on the Viterbi path. The maximum likelihood phoneme sequence in the voice section is determined.

キーワード検出部２３は、最尤音素系列と、検出対象となるキーワードの発声を表す音素系列（以下、単にキーワード音素系列と呼ぶ）とを比較することで、検出区間においてそのキーワードが発話されたか否かを判定する。例えば、キーワード検出部２３は、最尤音素系列と、キーワード音素系列の一致度を算出し、一致度が一致判定閾値以上であれば、検出区間においてキーワードが発声されたと判定する。なお、一致度として、例えば、キーワード検出部２３は、キーワード音素系列に含まれる音素の総数に対する、キーワード音素系列と最尤音素系列との間で一致した音素の数の比を算出する。あるいは、キーワード検出部２３は、キーワード音素系列と最尤音素系列との間で動的計画法マッチングを行って、レーベンシュタイン距離LD（編集距離とも呼ばれる）を算出してもよい。そしてキーワード検出部２３は、1/(1+LD)を一致度として算出してもよい。 The keyword detection unit 23 compares the maximum likelihood phoneme sequence with a phoneme sequence (hereinafter, simply referred to as a keyword phoneme sequence) representing an utterance of the keyword to be detected, thereby determining whether the keyword is uttered in the detection section. Determine if For example, the keyword detection unit 23 calculates the coincidence between the maximum likelihood phoneme sequence and the keyword phoneme sequence, and determines that the keyword is uttered in the detection section if the coincidence is equal to or more than the coincidence determination threshold. As the degree of coincidence, for example, the keyword detection unit 23 calculates a ratio of the number of phonemes matched between the keyword phoneme series and the maximum likelihood phoneme series to the total number of phonemes included in the keyword phoneme series. Alternatively, the keyword detection unit 23 may calculate the Levenshtein distance LD (also referred to as editing distance) by performing dynamic programming matching between the keyword phoneme sequence and the maximum likelihood phoneme sequence. Then, the keyword detection unit 23 may calculate 1 / (1 + LD) as the matching degree.

キーワード検出部２３は、検出区間においてキーワードが発話されていると判定すると、キーワードが検出されたことを応答部２４へ通知する。 When it is determined that the keyword is uttered in the detection section, the keyword detection unit 23 notifies the response unit 24 that the keyword is detected.

一方、キーワード検出部２３は、一致度が一致判定閾値未満であれば、検出区間では検出対象となるキーワードは発話されていないと判定する。そしてキーワード検出部２３は、発話区間中で所定数のフレーム（例えば、1〜2フレーム）だけ検出区間の開始タイミングを遅らせて、検出区間を再設定し、再設定した検出区間に対して上記の処理を実行して、キーワードが発話されたか否かを判定すればよい。なお、検出区間の開始タイミングから発話区間の終了までの長さが、検出区間の長さよりも短い場合には、キーワード検出部２３は、キーワードを検出しなくてもよい。 On the other hand, if the degree of coincidence is less than the coincidence determination threshold, the keyword detection unit 23 determines that the keyword to be detected is not uttered in the detection section. Then, the keyword detection unit 23 delays the start timing of the detection interval by a predetermined number of frames (for example, 1 to 2 frames) in the utterance interval, resets the detection interval, and performs the above for the reset detection interval. Processing may be performed to determine whether a keyword has been uttered. When the length from the start timing of the detection section to the end of the utterance section is shorter than the length of the detection section, the keyword detection unit 23 may not detect the keyword.

また、キーワード検出部２３は、音声信号が第２の時間長にわたって継続して無音となると、すなわち、ユーザの声が含まれない無音区間に相当するフレームが第２の時間長に対応する数だけ連続すると、発話区間が終了したと判定する。第２の時間長は、ユーザの発話中の「間」をユーザの発話の終了と誤検出しないように、例えば、500ミリ秒間〜1秒間に設定される。なお、キーワード検出部２３は、例えば、SN比が有音判定閾値Thsnr未満となるフレームが無音区間に含まれると判定すればよい。そしてキーワード検出部２３は、発話区間が終了したことを応答部２４及び音声認識部２５へ通知する。 In addition, when the voice signal continues to be silent for the second time length, that is, the number of frames corresponding to the silent section in which the voice of the user is not included corresponds to the second time length. If it continues, it will determine with the speech area having been complete | finished. The second time length is set, for example, to 500 milliseconds to 1 second so as not to erroneously detect "due" in the user's speech as the end of the user's speech. The keyword detection unit 23 may determine, for example, that a frame in which the SN ratio is less than the sound presence determination threshold Thsnr is included in the silent section. Then, the keyword detection unit 23 notifies the response unit 24 and the speech recognition unit 25 that the speech section has ended.

応答部２４は、キーワードが検出されたことが通知されると、無音検知区間を設定する。そして応答部２４は、無音検知区間内において音声信号が第１の時間長にわたって継続して無音となると、すなわち、無音区間に相当するフレームが第１の時間長に対応する数だけ連続すると、相槌の再生音声信号を出力する。なお、相槌は、応答音声の一例である。 When notified that the keyword is detected, the response unit 24 sets a silence detection section. Then, when the voice signal continues to be silent for the first time length in the silence detection section, that is, when the frames corresponding to the silence section continue for the number corresponding to the first time length, the response unit 24 performs the competition. Output a reproduced audio signal. The sumo is an example of the response voice.

無音検知区間の長さは、例えば、ユーザが検出対象となるキーワードを発話してから発話を終了するまでの想定時間に応じて予め設定される。例えば、無音検知区間は、1秒間〜数秒間に設定される。なお、無音検知区間の長さは、検出対象となるキーワードごとに設定されてもよい。この場合には、例えば、記憶部１５に、キーワードと無音検知区間の長さとの対応関係を表すテーブルが予め記憶され、応答部２４は、そのテーブルを参照して、検出対象となるキーワードに対応する無音検知区間の長さを設定すればよい。これにより、応答部２４は、キーワードに応じて、キーワードの発話から発話区間の終了までの長さに合わせて無音検知区間の長さを適切に設定できる。 The length of the silence detection interval is set in advance according to, for example, an assumed time from when the user utters a keyword to be detected until the utterance is ended. For example, the silence detection section is set to one second to several seconds. Note that the length of the silence detection section may be set for each keyword to be detected. In this case, for example, a table representing the correspondence between the keyword and the length of the silence detection section is stored in advance in the storage unit 15, and the response unit 24 refers to the table and corresponds to the keyword to be detected. It suffices to set the length of the silence detection interval to be performed. Thus, the response unit 24 can appropriately set the length of the silence detection interval according to the length from the keyword utterance to the end of the utterance interval according to the keyword.

応答部２４は、無音検知区間が開始されると、無音検知区間内のフレームごとのSN比に応じて、無音区間に含まれるフレームを検出する。例えば、応答部２４は、現フレームのSN比が有音判定閾値Thsnr未満であれば、現フレームは無音区間に含まれると判定する。 When the silence detection interval is started, the response unit 24 detects a frame included in the silence interval according to the SN ratio of each frame in the silence detection interval. For example, if the SN ratio of the current frame is less than the sound presence determination threshold Thsnr, the response unit 24 determines that the current frame is included in the silent period.

応答部２４は、無音区間が第１の時間長にわたって継続するか否か判定する。そして応答部２４は、無音区間が第１の時間長にわたって継続した場合、発話区間が終了したと判定する。そして応答部２４は、相槌の再生音声信号を記憶部１５から読み込み、その再生音声信号をＤ／Ａコンバータ１３を介してスピーカ１４へ出力する。そしてスピーカ１４により、相槌の音声が再生される。また、応答部２４は、発話区間が終了したことを音声認識部２５へ通知する。 The response unit 24 determines whether the silent section continues for the first time length. Then, when the silent section continues for the first time length, the response unit 24 determines that the utterance section is ended. Then, the response unit 24 reads the playback audio signal of the sumo from the storage unit 15 and outputs the playback audio signal to the speaker 14 via the D / A converter 13. Then, the speaker 14 reproduces the voice of the sumo wrestling. In addition, the response unit 24 notifies the voice recognition unit 25 that the speech section has ended.

ここで、第１の時間長は、第２の時間長よりも短い期間に設定される。例えば、第１の時間長は、例えば、数10ミリ秒間〜100ミリ秒間に設定される。これにより、対話装置１は、ユーザが発話を終えてから対話装置１が相槌を返すまでの待機時間を短くできる。またこの実施形態によれば、キーワードが検出されてから無音検知区間が設定されているので、その無音検知区間においてユーザの発話が終了することが想定されている。そのため、このように第１の時間長が比較的短く設定されても、ユーザの発話中の「間」を発話区間の終了と誤検出することが抑制される。さらに、その「間」において対話装置１が相槌を返してしまい、相槌とユーザの発話とが重なったり、あるいは、ユーザの発話を遮ることが抑制される。 Here, the first time length is set to a period shorter than the second time length. For example, the first time length is set to, for example, several tens of milliseconds to 100 milliseconds. Thereby, the dialogue apparatus 1 can shorten the waiting time until the dialogue apparatus 1 returns the sumo after the user finishes speaking. Further, according to this embodiment, since the silence detection section is set after the keyword is detected, it is assumed that the user's speech ends in the silence detection section. Therefore, even if the first time length is set to be relatively short as described above, it is possible to suppress erroneous detection of the "interval" in the user's speech as the end of the speech section. Furthermore, the dialogue device 1 makes a face-to-face return in the "between", and it is suppressed that the face-to-face and the user's speech overlap or the user's speech is blocked.

なお、相槌は予め複数用意されていてもよい。そして応答部２４は、複数の相槌の中からランダムに何れかの相槌を選択してもよい。あるいは、応答部２４は、複数の相槌の中から、検出対象となるキーワードに応じて相槌を選択してもよい。そして、応答部２４は、選択した相槌に対応する再生音声信号を記憶部１５から読み出して、Ｄ／Ａコンバータ１３を介してスピーカ１４へ出力してもよい。 A plurality of sumo wrestling may be prepared in advance. Then, the response unit 24 may randomly select one of the plurality of sumo wrestles. Alternatively, the response unit 24 may select a sushi roll from among a plurality of sushi rolls according to the keyword to be detected. Then, the response unit 24 may read the reproduced audio signal corresponding to the selected sumo from the storage unit 15 and output the read audio signal to the speaker 14 via the D / A converter 13.

さらに、応答部２４は、キーワードが検出されていないとき、すなわち、無音検知区間が設定されていないときに、キーワード検出部２３から発話区間の終了を通知された場合も、相槌の再生音声信号を出力してもよい。これにより、ユーザが想定されるキーワードを発話しなかった場合でも、対話装置１は、ユーザが発話を終了したときに相槌を返すので、ユーザが不安に感じることを抑制できる。 Furthermore, when no keyword is detected, that is, when the silence detection period is not set, the response unit 24 also receives the reproduction voice signal of the harmony when the end of the speech period is notified from the keyword detection unit 23. You may output it. As a result, even when the user does not utter the expected keyword, the interactive device 1 returns a reciprocation when the user ends the utterance, so it is possible to suppress the user from feeling uneasy.

図３は、音声信号中のキーワードの検出タイミングと、無音検知区間と、相槌の出力タイミングとの関係の一例を表すタイミングチャートである。図３において、横軸は時間を表す。一番上には、マイクロホン１１を介して入力される音声信号３０１が示される。この例では、時刻t0〜t4の間にユーザが「え〜っと・・・アメリカに行きます」と発話したものとする。上から２番目のチャート３０２は、キーワードが検出されるタイミングを表す。この例では、時刻t3において、キーワード「アメリカ」が検出される。 FIG. 3 is a timing chart showing an example of the relationship between the detection timing of the keyword in the audio signal, the silence detection section, and the output timing of the harmony. In FIG. 3, the horizontal axis represents time. At the top, an audio signal 301 input via the microphone 11 is shown. In this example, it is assumed that the user utters "I go to the United States" during time t0 to t4. The second chart 302 from the top represents the timing at which a keyword is detected. In this example, the keyword "US" is detected at time t3.

上から３番目のチャート３０３は、無音検知区間が設定されるタイミングを表す。この例では、キーワード「アメリカ」が検出された時刻t3から時刻t6まで、無音検知区間が設定される。 The third chart 303 from the top represents the timing at which the silence detection interval is set. In this example, a silence detection section is set from time t3 to time t6 when the keyword "US" is detected.

上から４番目のチャート３０４は、検出される無音区間を表す。この例では、発話中の「間」に相当する時刻t1〜t2の間、及び、発話が終了した時刻t4以降において無音区間が検出される。時刻t1〜t2は、無音検知区間には含まれず、一方、時刻t4は、無音検知区間に含まれている。そして一番下から２番目のチャート３０５は、キーワード検出後の無音検知区間において第１の時間長にわたって無音区間が継続した場合における、相槌が再生されるタイミングを表す。この例では、時刻t4から第１の時間長だけ経過した時刻t5において、相槌「はい」が再生される。 The fourth chart 304 from the top represents the silence interval detected. In this example, a silent section is detected during time t1 to t2 corresponding to "during" during the speech and after time t4 when the speech ends. Times t1 to t2 are not included in the silence detection section, and time t4 is included in the silence detection section. The second lowest chart 305 represents the timing at which the harmony is reproduced when the silent section continues for the first time length in the silent detection section after the keyword detection. In this example, at time t5 when the first length of time has elapsed from time t4, the sumo wrestler "Yes" is reproduced.

一番下のチャート３０６は、発話区間中においてキーワードが検出されなかった場合における、相槌が再生されるタイミングを表す。この場合には、時刻t1〜t2の「間」を発話区間の終了と誤検出することを防止するため、時刻t4から第２の時間長だけ経過した、時刻t5よりも後の時刻t7において、相槌が再生されることになる。 The lowermost chart 306 represents the timing at which the sumo wrestling is reproduced when no keyword is detected in the speech section. In this case, at time t7 after time t5 when a second length of time has elapsed from time t4, in order to prevent false detection of “due” between times t1 and t2 as the end of the speech section, Sumo wrestling will be regenerated.

このように、キーワードが検出されている場合におけるユーザの発話が終了してから相槌が返されるまでの待機時間は、キーワードが検出されていない場合の待機時間よりも短くて済む。 As described above, the waiting time from the end of the user's speech in the case where the keyword is detected to the return of the fight may be shorter than the waiting time when the keyword is not detected.

なお、無音検知区間内に、一定時間継続した無音区間が検知されなければ、応答部２４は、無音検知区間をリセットする。 In addition, if the silent area which continued for a fixed time is not detected in a silent detection area, the response part 24 resets a silent detection area.

音声認識部２５は、発話区間が終了すると、発話区間全体に対して音声認識処理を実行して、その発話区間におけるユーザの発話内容を認識する。 When the speech zone ends, the speech recognition unit 25 executes speech recognition processing on the entire speech zone to recognize the user's speech content in the speech zone.

音声認識部２５は、発話区間全体にわたって最尤音素系列をもとめる。そのために、音声認識部２５は、発話区間開始後においてフレームが入力される度に、キーワード検出部２３と同様に、そのフレームから人の声の特徴を表す複数の特徴量を含む特徴ベクトルを生成する。なお、音声認識部２５は、キーワード検出部２３が特徴ベクトルを算出しているフレームについては、その特徴ベクトルを利用してもよい。 The speech recognition unit 25 obtains the maximum likelihood phoneme sequence over the entire speech segment. Therefore, the speech recognition unit 25 generates a feature vector including a plurality of feature quantities representing the features of human voice from the frame every time a frame is input after the start of the speech section, like the keyword detection unit 23 Do. The speech recognition unit 25 may use the feature vector for the frame for which the keyword detection unit 23 is calculating the feature vector.

そして音声認識部２５は、キーワード検出部２３または応答部２４から発話区間が終了したことを通知されると、その発話区間全体に対して最尤音素系列を求める。その際、音声認識部２５は、キーワード検出部２３と同様に、GMM-HMMを利用して、フレームごとに得られた出力確率を音素HMMの対応する状態についての出力確率として用いることで、累積対数尤度が最大となる音素系列を最尤音素系列として求めればよい。 When notified by the keyword detection unit 23 or the response unit 24 that the speech section has ended, the speech recognition unit 25 obtains the maximum likelihood phoneme sequence for the entire speech section. At that time, the speech recognition unit 25 uses the GMM-HMM as in the keyword detection unit 23 and uses the output probability obtained for each frame as the output probability for the corresponding state of the phoneme HMM, thereby accumulating The phoneme sequence having the largest log likelihood may be determined as the maximum likelihood phoneme sequence.

音声認識部２５は、単語ごとの音素系列を表す単語辞書を参照して、発話区間の最尤音素系列と一致する音素系列を持つ単語の組み合わせを言語モデルに従って検出することで、発話区間内の発話内容を認識する。あるいは、音声認識部２５は、最尤音素系列から発話内容を認識するための他の技術を利用して、発話区間におけるユーザの発話内容を認識してもよい。 The speech recognition unit 25 refers to a word dictionary representing a phoneme sequence for each word, and detects a combination of words having a phoneme sequence that matches the maximum likelihood phoneme sequence of the speech interval according to the language model, thereby Recognize utterance content. Alternatively, the speech recognition unit 25 may recognize the user's utterance content in the utterance section using another technique for recognizing the utterance content from the maximum likelihood phoneme sequence.

処理部１６は、認識したユーザの発話内容と、処理部１６にて実行されるアプリケーションとに応じた処理を実行する。例えば、処理部１６は、発話内容に応じた応答語句を生成し、その応答語句に応じた合成音声信号を生成してもよい。そして処理部１６は、生成した合成音声信号を、Ｄ／Ａコンバータ１３を介してスピーカ１４へ出力してもよい。その際、処理部１６は、例えば、発話内容を表す文字列を入力することで応答語句を生成するように予め生成されたニューラルネットワークなどを利用して、応答語句を生成してもよい。 The processing unit 16 executes processing according to the recognized utterance content of the user and the application executed by the processing unit 16. For example, the processing unit 16 may generate a response phrase according to the content of the utterance and may generate a synthetic speech signal according to the response phrase. Then, the processing unit 16 may output the generated synthesized voice signal to the speaker 14 via the D / A converter 13. At that time, the processing unit 16 may generate a response phrase using, for example, a neural network or the like generated in advance so as to generate a response phrase by inputting a character string representing the content of the utterance.

あるいは、処理部１６は、発話内容に応じた単語の組み合わせをクエリとして、対話装置１と接続されたネットワーク上で探索処理を実行してもよい。あるいはまた、処理部１６は、発話内容を表す文字列と、対話装置１が実装された装置の操作コマンドとを比較し、発話内容を表す文字列が何れかの操作コマンドと一致する場合に、その操作コマンドに応じた処理を実行してもよい。 Alternatively, the processing unit 16 may execute a search process on a network connected to the interactive apparatus 1 using a combination of words according to the content of the utterance as a query. Alternatively, the processing unit 16 compares the character string representing the uttered content with the operation command of the device on which the interactive apparatus 1 is installed, and when the character string representing the uttered content matches any of the operation commands, Processing according to the operation command may be executed.

図４は、本実施形態による、対話処理の動作フローチャートである。 FIG. 4 is an operation flowchart of interactive processing according to the present embodiment.

キーワード設定部２１は、ユーザに発する質問を選択し、選択した質問の再生音声信号をＤ／Ａコンバータ１３を介してスピーカ１４へ出力する（ステップＳ１０１）。また、キーワード設定部２１は、選択した質問に応じたキーワードを設定する（ステップＳ１０２）。 The keyword setting unit 21 selects a question to be issued to the user, and outputs a reproduction voice signal of the selected question to the speaker 14 via the D / A converter 13 (step S101). Further, the keyword setting unit 21 sets a keyword according to the selected question (step S102).

発話区間開始検出部２２は、マイクロホン１１及びＡ／Ｄコンバータ１２を介して入力された音声信号において発話区間が開始されたか否か判定する（ステップＳ１０３）。発話区間が開始されていなければ（ステップＳ１０３−Ｎｏ）、発話区間開始検出部２２は、ステップＳ１０３の処理を繰り返す。一方、発話区間開始検出部２２は、発話区間の開始を検出すると（ステップＳ１０３−Ｙｅｓ）、発話区間が開始されたことをキーワード検出部２３及び音声認識部２５へ通知する。そしてキーワード検出部２３は、キーワード検出を開始するとともに、音声認識部２５は、音声認識を開始する（ステップＳ１０４）。 The speech zone start detection unit 22 determines whether a speech zone has been started in the audio signal input through the microphone 11 and the A / D converter 12 (step S103). If the speech section has not been started (No at Step S103), the speech section start detection unit 22 repeats the process at Step S103. On the other hand, when the speech zone start detection unit 22 detects the start of the speech zone (step S103-Yes), the speech zone start detection unit 22 notifies the keyword detection unit 23 and the speech recognition unit 25 that the speech zone has started. Then, the keyword detection unit 23 starts keyword detection, and the speech recognition unit 25 starts speech recognition (step S104).

キーワード検出部２３は、発話区間に設定した検出区間においてキーワードが検出されたか否か判定する（ステップＳ１０５）。キーワードが検出された場合（ステップＳ１０５−Ｙｅｓ）、キーワード検出部２３は、キーワードが検出されたことを応答部２４へ通知する。そして応答部２４は、無音検知区間を設定する（ステップＳ１０６）。 The keyword detection unit 23 determines whether a keyword is detected in the detection section set as the speech section (step S105). When a keyword is detected (step S105-Yes), the keyword detection unit 23 notifies the response unit 24 that the keyword is detected. Then, the response unit 24 sets a silence detection section (step S106).

応答部２４は、無音検知区間内で音声信号が第１の時間長にわたって継続して無音となったか否か判定する（ステップＳ１０７）。無音検知区間内で第１の時間長にわたって継続した無音区間が検知されていなければ（ステップＳ１０７−Ｎｏ）、応答部２４は、無音検知区間が経過したか否か判定する（ステップＳ１０８）。無音検知区間が経過していなければ（ステップＳ１０８−Ｎｏ）、応答部２４は、次フレーム以降について、ステップＳ１０７の処理を繰り返す。一方、無音検知区間が経過していれば（ステップＳ１０８−Ｙｅｓ）、応答部２４は、無音検知区間をリセットする（ステップＳ１０９）。そして処理部１６は、ステップＳ１０５以降の処理を繰り返す。 The response unit 24 determines whether the audio signal continues to be silent for the first time length in the silence detection section (step S107). If the silent section continued for the first time length is not detected in the silent detection section (step S107-No), the response unit 24 determines whether the silent detection section has passed (step S108). If the silence detection interval has not elapsed (No at Step S108), the response unit 24 repeats the process at Step S107 for the next frame and thereafter. On the other hand, if the silence detection interval has elapsed (Yes at Step S108), the response unit 24 resets the silence detection interval (Step S109). Then, the processing unit 16 repeats the processing after step S105.

一方、ステップＳ１０７において、無音検知区間内で第１の時間長にわたって継続した無音区間が検知されると（ステップＳ１０７−Ｙｅｓ）、応答部２４は、相槌の再生音声信号をＤ／Ａコンバータ１３を介してスピーカ１４へ出力する（ステップＳ１１０）。さらに、応答部２４は、発話区間が終了したことを音声認識部２５へ通知する。 On the other hand, when the silent section continued for the first time length is detected in the silent detection section in step S107 (step S107-Yes), the response unit 24 causes the D / A converter 13 The signal is output to the speaker 14 (step S110). Furthermore, the response unit 24 notifies the voice recognition unit 25 that the speech section has ended.

音声認識部２５は、発話区間全体にわたって音声認識処理を実行して、発話区間中のユーザの発話内容を認識する（ステップＳ１１１）。そして処理部１６は、認識された発話内容に応じた処理を実行する（ステップＳ１１２）。 The speech recognition unit 25 executes speech recognition processing over the entire speech segment to recognize the speech content of the user in the speech segment (step S111). Then, the processing unit 16 executes a process according to the recognized utterance content (step S112).

また、ステップＳ１０５において、キーワードが検出されなかった場合（ステップＳ１０５−Ｎｏ）、キーワード検出部２３は、無音区間が第２の時間長にわたって継続したか否か判定する（ステップＳ１１３）。第２の時間長にわたって継続した無音区間が検知されていなければ（ステップＳ１１３−Ｎｏ）、キーワード検出部２３は、キーワードの検出区間を所定フレーム数だけ遅らせて、ステップＳ１０５以降の処理を繰り返す。 In step S105, when no keyword is detected (No in step S105), the keyword detection unit 23 determines whether the silent section continues for the second time length (step S113). If a silent section continued for the second time length is not detected (step S113-No), the keyword detection unit 23 delays the keyword detection section by a predetermined number of frames, and repeats the processing from step S105.

一方、無音区間が第２の時間長にわたって継続したことが検知されれば（ステップＳ１１３−Ｙｅｓ）、キーワード検出部２３は、発話区間が終了したことを応答部２４及び音声認識部２５に通知する。そして処理部１６は、ステップＳ１１０〜Ｓ１１２の処理を実行する。処理部１６は、ステップＳ１１２の後、対話処理を終了する。 On the other hand, when it is detected that the silent section continues for the second time length (step S113-Yes), the keyword detection unit 23 notifies the response unit 24 and the voice recognition unit 25 that the speech section has ended. . Then, the processing unit 16 executes the processing of steps S110 to S112. After step S112, the processing unit 16 ends the interactive processing.

以上に説明してきたように、この対話装置は、入力された音声信号から、質問に応じたキーワードを検出すると、無音検知区間を設定する。そしてこの対話装置は、無音検知区間において、比較的短い第１の時間長にわたって、無音区間が継続したことを検知すると、相槌の再生音声信号を出力する。そのため、この対話装置は、ユーザの発話が終了してから相槌の再生音声を出力するまでの待機時間を短くできるので、ユーザに対して適切なタイミングで応答できる。 As described above, this dialogue device sets a silence detection interval when it detects a keyword corresponding to a question from the input voice signal. Then, when it is detected in the silence detection section that the silent section has continued for a relatively short first time length, the dialog device outputs a reproduced voice signal of the pair. Therefore, since this dialogue apparatus can shorten the waiting time until the playback voice of the pair is output after the user's speech is finished, it can respond to the user at an appropriate timing.

なお、変形例によれば、発話区間開始検出部２２、キーワード検出部２３及び応答部２４は、フレームごとのパワーに基づいて、フレームが発話区間に含まれるか否かを検出してもよい。この場合には、発話区間開始検出部２２、キーワード検出部２３及び応答部２４は、現フレームのパワーが所定のパワー閾値Thp以上であれば、現フレームは発話区間に含まれると判定してもよい。そして発話区間開始検出部２２、キーワード検出部２３及び応答部２４は、そのパワーが所定のパワー閾値Thp未満であれば、現フレームは無音区間に含まれると判定してもよい。 According to the modification, the speech zone start detection unit 22, the keyword detection unit 23, and the response unit 24 may detect whether a frame is included in the speech zone based on the power of each frame. In this case, the speech section start detection unit 22, the keyword detection unit 23, and the response unit 24 determine that the current frame is included in the speech section if the power of the current frame is greater than or equal to the predetermined power threshold Thp. Good. If the power is less than a predetermined power threshold Thp, the speech section start detection unit 22, the keyword detection unit 23, and the response unit 24 may determine that the current frame is included in the silent section.

また他の変形例によれば、発話区間開始検出部２２、キーワード検出部２３及び応答部２４は、フレームごとの音声信号の周期性の強さを表すピッチゲインに基づいて、フレームが発話区間に含まれるか否かを検出してもよい。 According to another modification, the speech section start detection unit 22, the keyword detection unit 23, and the response unit 24 select the frame as the speech section based on the pitch gain representing the strength of the periodicity of the audio signal for each frame. It may be detected whether or not it is included.

この場合、発話区間開始検出部２２は、フレームごとに、音声信号の長期自己相関C(d)を、遅延量d∈{d_low,...,d_high}について算出する。

ここで、S_k(n)は、現フレームkのn番目の信号値である。またNは、フレームに含まれるサンプリング点の総数を表す。なお、(n-d)が負となる場合、直前のフレームの対応する信号値（すなわち、S_k-1(N-(n-d))）がS_k(n-d)として用いられる。そして遅延量dの範囲{d_low,...,d_high}は、人の声の基本周波数(100〜300Hz)に相当する遅延量が含まれるように設定される。ピッチゲインは、基本周波数において最も高くなるためである。例えば、サンプリングレートが16kHzである場合、d_low=40、d_high=286に設定される。In this case, the speech section start detection unit 22 calculates, for each frame, the long-term autocorrelation C (d) of the speech signal for the delay amount dε {d _low ,..., D _high }.

Here, S _k (n) is the n-th signal value of the current frame k. Also, N represents the total number of sampling points included in the frame. When (nd) is negative, the corresponding signal value of the immediately preceding frame (ie, S _k-1 (N- (nd))) is used as S _k (nd). The range {d _low ,..., D _high } of the delay amount d is set so as to include the delay amount corresponding to the fundamental frequency (100 to 300 Hz) of the human voice. The pitch gain is the highest at the fundamental frequency. For example, when the sampling rate is 16 kHz, d _low = 40 and d _high = 286 are set.

発話区間開始検出部２２は、遅延量の範囲に含まれる遅延量dごとに長期自己相関C(d)を算出すると、長期自己相関C(d)のうちの最大値C(d_max)を求める。なお、d_maxは、長期自己相関C(d)の最大値C(d_max)に対応する遅延量であり、この遅延量はピッチ周期に相当する。そして発話区間開始検出部２２は、次式に従ってピッチゲインg_pitchを算出する。

When the speech section start detection unit 22 calculates the long-term autocorrelation C (d) for each delay amount d included in the delay amount range, it obtains the maximum value C (d _max ) of the long-term autocorrelation C (d). . Here, d _max is a delay amount corresponding to the maximum value C (d _max ) of the long-term autocorrelation C (d), and this delay amount corresponds to a pitch period. Then, the speech zone start detection unit 22 calculates the pitch gain g _pitch according to the following equation.

発話区間開始検出部２２は、現フレームのピッチゲインg_pitchが所定の閾値以上であれば、現フレームは発話区間に含まれると判定する。キーワード検出部２３及び応答部２４も、同様に、フレームごとにピッチゲインg_pitchを算出し、ピッチゲインg_pitchが所定の閾値未満となるフレームが無音区間に含まれると判定すればよい。このように、ピッチゲインを発話区間の検出に利用することで、対話装置１は、雑音が比較的大きい環境でも、発話区間と無音区間とを正確に区別することができる。そのため、対話装置１は、雑音が比較的大きい環境でも、相槌を返すタイミングを適切に決定できる。If the pitch gain g _pitch of the current frame is equal to or greater than a predetermined threshold, the speech section start detection unit 22 determines that the current frame is included in the speech section. Similarly, the keyword detection unit 23 and the response unit 24 may calculate the pitch gain g _pitch for each frame and determine that a frame in which the pitch gain g _pitch is less than a predetermined threshold is included in the silent section. As described above, by using the pitch gain to detect a speech section, the dialog device 1 can accurately distinguish between a speech section and a silent section even in an environment where noise is relatively large. Therefore, the dialog device 1 can appropriately determine the timing of returning the fight, even in an environment where the noise is relatively large.

さらに他の変形例によれば、応答部２４は、キーワード検出後の無音検知区間内において無音区間が第１の時間長にわたって継続したときの相槌と、キーワードが検出されずに、発話区間が終了したときの相槌とを異ならせてもよい。キーワードが検出された場合には、ユーザは、質問に対して明りょうな回答を発話している可能性が高い。そこで、応答部２４は、キーワード検出後の無音検知区間内において第１の時間長にわたって継続した無音区間したときの相槌を、肯定的な相槌、例えば、「はい」、「なるほど」、「了解しました」などとする。一方、キーワードが検出されずに発話区間が終了した場合には、ユーザは、質問に対して想定される回答と異なる回答をしたか、あるいは、回答を保留してあいまいな言葉（例えば、「う〜ん」、「どうしようかな」）を発していると想定される。そこで、応答部２４は、キーワードが検出されずに発話区間が終了した場合の相槌を、回答を促すような相槌、例えば、「どうぞ」、「もう一度お願いします」などとすればよい。これにより、対話装置１は、ユーザの回答状況によらず、ユーザの発話に対する処理を実行していることをユーザに示すことができる。 According to still another modification, the response unit 24 ends the speech section without detecting a keyword and a match when the silent section continues for the first time length in the silence detection section after the keyword detection. You may make it different from the sumo wrestling. If a keyword is detected, the user is likely to be speaking a clear answer to the question. Therefore, the response unit 24 acknowledges, for example, “Yes”, “I see”, “Among”, when the silent section continues for the first time length in the silent detection section after the keyword detection. And so on. On the other hand, if the utterance section ends without detecting a keyword, the user may give an answer different from the expected answer to the question, or may hold the answer and put an ambiguous word (for example, ~) It is assumed that you are issuing "what to do". Therefore, the response unit 24 may set the sumo when the speech section is ended without detecting the keyword as a sumo prompting to answer, for example, "Please", "Please ask again" or the like. Thereby, the dialog device 1 can indicate to the user that processing for the user's speech is being performed regardless of the user's answer situation.

さらに他の変形例によれば、対話装置１の使用環境に応じて予め質問及びキーワードは設定されていてもよい。この場合には、キーワード設定部２１は省略されてもよい。そして処理部１６は、例えば、対話装置１のユーザインターフェースに対して所定の操作が実行されると、その質問の再生音声信号を記憶部１５から読み込んで、Ｄ／Ａコンバータ１３を介してスピーカ１４へ出力すればよい。 According to still another modification, the question and the keyword may be set in advance according to the use environment of the dialog device 1. In this case, the keyword setting unit 21 may be omitted. Then, for example, when a predetermined operation is performed on the user interface of the interactive apparatus 1, the processing unit 16 reads the reproduced voice signal of the question from the storage unit 15, and the speaker 14 via the D / A converter 13. It should be output to

さらに他の変形例によれば、応答部２４は、発話区間中においてキーワードが検出されなければ、発話区間の終了が検知されても、相槌の再生音声信号をスピーカ１４へ出力しないようにしてもよい。この場合には、音声認識部２５により認識された、発話区間の発話内容に対して行われた処理の結果が、スピーカ１４またはユーザインターフェースを介してユーザに通知されてもよい。 According to still another modification, if the response unit 24 does not detect a keyword in the speech section, even if the end of the speech section is detected, the response section 24 does not output the reproduced speech signal of the pair to the speaker 14. Good. In this case, the user may be notified of the result of the process performed on the utterance content of the utterance section recognized by the speech recognition unit 25 through the speaker 14 or the user interface.

上記の実施形態または変形例による対話装置の処理部が有する各機能をコンピュータに実現させるコンピュータプログラムは、磁気記録媒体または光記録媒体といったコンピュータによって読み取り可能な媒体に記録された形で提供されてもよい。 A computer program that causes a computer to realize each function of the processing unit of the dialog device according to the above embodiment or modification may be provided as recorded on a computer-readable medium, such as a magnetic recording medium or an optical recording medium. Good.

ここに挙げられた全ての例及び特定の用語は、読者が、本発明及び当該技術の促進に対する本発明者により寄与された概念を理解することを助ける、教示的な目的において意図されたものであり、本発明の優位性及び劣等性を示すことに関する、本明細書の如何なる例の構成、そのような特定の挙げられた例及び条件に限定しないように解釈されるべきものである。本発明の実施形態は詳細に説明されているが、本発明の精神及び範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。 All examples and specific terms cited herein are intended for instructional purposes to help the reader understand the concepts contributed by the inventor to the present invention and the promotion of the art. It should be understood that the present invention is not to be limited to the construction of any of the examples herein, and to the specific listed examples and conditions relating to showing superiority and inferiority of the present invention. Although embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and modifications can be made thereto without departing from the spirit and scope of the present invention.

１対話装置
１１マイクロホン
１２アナログ／デジタルコンバータ
１３デジタル／アナログコンバータ
１４スピーカ
１５記憶部
１６処理部
２１キーワード設定部
２２発話区間開始検出部
２３キーワード検出部
２４応答部
２５音声認識部Reference Signs List 1 interactive device 11 microphone 12 analog / digital converter 13 digital / analog converter 14 speaker 15 storage unit 16 processing unit 21 keyword setting unit 22 speech section start detection unit 23 keyword detection unit 24 response unit 25 speech recognition unit

Claims

A keyword detection unit that detects a keyword uttered by the user from an utterance section in which the user utters in an audio signal representing the user's voice;
A response unit that sets a predetermined section when the keyword is detected, and outputs a first response sound via a sound output section when the sound signal continues to be silent for a first length of time in the predetermined section; ,
Interactive device with

The response unit is a second response when the voice signal continues to be silent for a second time length longer than the first time length when the keyword is not detected after the start of the utterance period. The interactive device according to claim 1, wherein voice is output via the voice output unit.

The interactive device according to claim 2, wherein the second response voice is a voice different from the first response voice.

The voice output unit according to any one of claims 1 to 3, further comprising: a keyword setting unit configured to output the voice of the question to the user via the voice output unit and to set the keyword according to the content of the question. Interactive device.

The interactive device according to claim 4, wherein the response unit sets the length of the predetermined section in accordance with the set keyword.

The processor detects a keyword uttered by the user from an utterance section where the user utters in an audio signal representing the user's voice,
When the keyword is detected by the processor, a predetermined interval is set;
The processor causes the first response voice to be output through the voice output unit when the voice signal continues to be silent for the first time length in the predetermined section.
An interactive way that involves.

Detecting a keyword uttered by the user from an utterance section in which the user utters in an audio signal representing the user's voice;
When the keyword is detected, a predetermined section is set,
When the voice signal continues to be silent for a first time length in the predetermined section, the first response voice is output via the voice output unit.
An interactive computer program for causing a computer to perform things.