JP2018091911A

JP2018091911A - Voice interactive system and voice interactive method

Info

Publication number: JP2018091911A
Application number: JP2016233103A
Authority: JP
Inventors: 康貴田中; Yasutaka Tanaka; 美智子小川; Michiko Ogawa; 西蔵羽山; Nishikura Hayama
Original assignee: Sohgo Security Services Co Ltd
Current assignee: Sohgo Security Services Co Ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2018-06-14
Anticipated expiration: 2036-11-30
Also published as: JP6748565B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive system and a voice interactive method for performing smooth voice interaction with a user.SOLUTION: A voice interactive robot 10 switches a speech mode for outputting voice and a voice recognition mode for recognizing the voice of a user so as to perform operation. At the speech mode, the voice interactive robot 10 outputs the voice from a speaker, terminates the speech mode when the output of the voice is completed, and shifts to the voice recognition mode. The voice interactive robot 10 does not recognize the voice of the user at the speech mode, but collects sound by a microphone and calculates similarity with feature data of the voice of a self-device, which is previously registered. When the similarity becomes a threshold or less before the output of the voice is completed, the output of the voice is stopped halfway and the speech mode is terminated by considering that the speech of the user is detected.SELECTED DRAWING: Figure 1

Description

この発明は、音声の入力を受け付け、受け付けた音声に応じて音声の出力を行う音声対話システム及び音声対話方法に関する。 The present invention relates to a voice dialogue system and a voice dialogue method for receiving voice input and outputting voice according to the received voice.

従来、ユーザの音声を認識し、認識の結果に対応した内容の音声を出力することでユーザとの音声対話を行う音声対話システムが知られている。かかる音声対話システムは、通信回線を介した自動応答や、携帯端末上でのユーザ支援などに用いることができる他、ロボットへの搭載も可能である。音声対話システムを搭載したロボットは、会話をユーザとのインタフェースとして利用可能であり、警備、店舗スタッフの補助、個人の生活支援やエンターテインメントなど、多様なシチュエーションにおいて運用することができる。 2. Description of the Related Art Conventionally, there has been known a voice dialogue system that recognizes a user's voice and outputs a voice having contents corresponding to the recognition result to perform a voice dialogue with the user. Such a voice interaction system can be used for automatic response via a communication line, user support on a portable terminal, and the like, and can also be mounted on a robot. A robot equipped with a voice interaction system can use conversation as an interface with a user, and can be used in various situations such as security, assistance of store staff, personal life support and entertainment.

ここで、ユーザとの音声対話を行う場合には、出力音声と入力音声の分離が重要となる。システム側からの出力音声が入力音声に含まれると、自システムの出力音声をユーザの音声と誤認識するという問題が生じるためである。そこで、システム側が音声を出力する発話モードとユーザ音声を認識する音声認識モードとを切り替えることで、自システムの出力音声による誤認識を防ぐことが行われている。 Here, when performing a voice dialogue with the user, it is important to separate the output voice and the input voice. This is because if the output sound from the system side is included in the input sound, there arises a problem that the output sound of the own system is erroneously recognized as the user's sound. Thus, erroneous recognition due to the output sound of the own system is performed by switching between the speech mode in which the system outputs sound and the speech recognition mode in which user speech is recognized.

発話モードと音声認識モードとを切り替える構成では、システム側の発話モード中にユーザが発言をしてもその発言は認識されない。そのため、ユーザはシステム側からの音声の出力が完了するのを待って発言することになる。しかし、ユーザが音声対話システムに不慣れである場合等には、システム側からの音声の出力中に発言を行うことがある。 In the configuration in which the speech mode and the speech recognition mode are switched, even if the user speaks during the system-side speech mode, the speech is not recognized. Therefore, the user speaks after waiting for the completion of the voice output from the system side. However, when the user is unfamiliar with the voice interaction system, the user may speak during the output of the voice from the system side.

そこで、特許文献１は、発話中にもユーザの音声を認識する音声認識装置を備えたロボットを開示している。特許文献１が開示する音声認識装置は、音声の出力開始から所定時間後にユーザの音声認識を開始するとともに、マイクで集音した音声から自装置の出力音声相当分を相関演算により除去する出力音声除去部を設けることで、音声の出力を音声の認識を並行して行っている。 Therefore, Patent Document 1 discloses a robot provided with a voice recognition device that recognizes a user's voice even while speaking. The speech recognition device disclosed in Patent Document 1 starts user speech recognition after a predetermined time from the start of speech output, and also outputs output speech corresponding to the output speech equivalent of the device itself from speech collected by a microphone. By providing the removing unit, voice output is performed in parallel with voice recognition.

特開２００７−１５５９８６号公報JP 2007-155986 A

しかしながら、上記特許文献１に代表される従来の技術を用いたとしても、ユーザとの対話を円滑に行うことは困難であった。上記特許文献１のように、出力音声相当分を相関演算により除去するよう構成しても、音の反射環境、ノイズ状況、ひずみなどの要因によって出力音声の除去を完全に行うことはできず、誤認識を充分に防ぐことはできないのである。 However, even if the conventional technique represented by the above-mentioned Patent Document 1 is used, it is difficult to smoothly perform dialogue with the user. Even if it is configured to remove the equivalent portion of the output sound by correlation calculation as in Patent Document 1, the output sound cannot be completely removed due to factors such as sound reflection environment, noise situation, distortion, It is not possible to prevent misrecognition sufficiently.

また、音声の出力と音声の認識を並行して行った場合には、ユーザは自身の発言がシステム側で認識されているかを把握できず、発言を続けるべきか、システム側からの音声の出力の完了を待つべきかを判断することができない。特に、対話が高度化し、システム側から出力される音声が長くなると、システム側からの音声の出力が完了するまでユーザに待機させることは、円滑な対話を大きく損なうこととなる。 In addition, when the voice output and the voice recognition are performed in parallel, the user cannot grasp whether his / her speech is recognized on the system side, and whether the speech should be continued or the voice output from the system side. Cannot determine whether to wait for completion. In particular, when the dialogue becomes more sophisticated and the voice output from the system side becomes longer, allowing the user to wait until the output of the voice from the system side is completed greatly impairs the smooth dialogue.

これらのことから、ユーザとの円滑な音声対話をいかにして実現するかが重要な課題となっていた。かかる課題は、マイクとスピーカを離して設置することが困難なロボットに音声対話システムを搭載するケースで顕著となるが、通信回線を介した自動応答や携帯端末上でのユーザ支援などに音声対話システムを用いる場合にも同様に生ずる。 For these reasons, how to realize a smooth voice conversation with the user has become an important issue. Such a problem becomes conspicuous in the case where a voice dialogue system is installed in a robot that is difficult to install apart from a microphone and a speaker, but voice dialogue is used for automatic response via a communication line or user support on a portable terminal. The same occurs when using the system.

本発明は、上記の従来技術の課題を解決するためになされたものであって、ユーザと円滑な音声対話を行う音声対話システム及び音声対話方法を提供することを目的とする。 The present invention has been made in order to solve the above-described problems of the prior art, and an object thereof is to provide a voice dialogue system and a voice dialogue method for performing a smooth voice dialogue with a user.

上述した課題を解決し、目的を達成するため、請求項１に記載の発明は、音声の入力を受け付ける入力受付部と、前記入力受付部により受け付けた入力音声に応じて音声の出力を行う出力処理部とを備えた音声対話システムであって、前記出力処理部により出力される出力音声を自己音声として登録する登録部と、前記出力処理部による音声の出力中に、前記入力音声と前記自己音声との類似度を算出する類似度算出部と、前記類似度算出部により算出された類似度に基づいて、前記出力処理部による音声の出力を停止するか否かを制御する動作制御部とを備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the invention according to claim 1 is an input receiving unit that receives voice input, and an output that outputs voice according to the input voice received by the input receiving unit. A speech dialogue system comprising a processing unit, a registration unit for registering output speech output by the output processing unit as self-speech, and during output of speech by the output processing unit, the input speech and the self A similarity calculation unit that calculates the similarity to the sound, and an operation control unit that controls whether or not to stop the output of the sound by the output processing unit based on the similarity calculated by the similarity calculation unit; It is provided with.

また、請求項２に記載の発明は、請求項１に記載の発明において、前記入力受付部により受け付けた入力音声に対して音声認識を行う音声認識部をさらに備え、前記出力処理部は、前記音声認識部による音声認識の結果に応じて出力する音声の内容を決定し、前記動作制御部は、前記音声認識部による音声認識を行う音声認識モードと、前記出力処理部による音声の出力を行う発話モードとを切り替える制御を行うことを特徴とする。 The invention according to claim 2 further includes a speech recognition unit that performs speech recognition on the input speech received by the input reception unit according to the invention according to claim 1, wherein the output processing unit includes: The content of the voice to be output is determined according to the result of the voice recognition by the voice recognition unit, and the operation control unit performs a voice recognition mode for performing voice recognition by the voice recognition unit and outputs a voice by the output processing unit. Control is performed to switch between speech modes.

また、請求項３に記載の発明は、請求項２に記載の発明において、前記動作制御部は、前記出力処理部による音声の出力が完了するか、前記類似度に基づいて前記音声の出力を停止した場合に前記発話モードから前記音声認識モードに切り替えることを特徴とする。 According to a third aspect of the present invention, in the second aspect of the present invention, the operation control unit outputs the voice based on the similarity or whether the output of the voice is completed by the output processing unit. When the operation stops, the speech mode is switched to the speech recognition mode.

また、請求項４に記載の発明は、請求項１〜３のいずれか一つに記載の発明において、前記出力処理部は、前記類似度に基づいて前記音声の出力を停止する場合に、音声の出力の停止に対応する特定の音声を出力した上で音声の出力を停止することを特徴とする。 According to a fourth aspect of the present invention, in the invention according to any one of the first to third aspects, when the output processing unit stops outputting the voice based on the similarity, After outputting a specific sound corresponding to the stop of the output, the output of the sound is stopped.

また、請求項５に記載の発明は、請求項１〜４のいずれか一つに記載の発明において、前記登録部は、前記出力音声の周波数に係る特徴を分析して生成した特徴データを前記自己音声として登録し、前記類似度算出部は、前記入力音声の周波数に係る特徴を分析して生成した特徴データと前記自己音声として登録した特徴データとの類似度を算出することを特徴とする。 The invention according to claim 5 is the invention according to any one of claims 1 to 4, wherein the registration unit analyzes the feature data related to the frequency of the output speech and generates the feature data. It is registered as self-speech, and the similarity calculation unit calculates the similarity between feature data generated by analyzing features related to the frequency of the input sound and feature data registered as the self-speech .

また、請求項６に記載の発明は、請求項１〜５のいずれか一つに記載の発明において、前記登録部は、前記自己音声以外の所定の音声を他者音声としてさらに登録し、前記動作制御部は、前記入力音声と前記自己音声との類似度が閾値以下となった場合に、前記入力音声と前記他者音声との類似度に応じて前記出力処理部による音声の出力を停止するか否かを決定することを特徴とする。 The invention according to claim 6 is the invention according to any one of claims 1 to 5, wherein the registration unit further registers a predetermined sound other than the self-speech as the other person's sound, The operation control unit stops outputting the sound by the output processing unit according to the degree of similarity between the input sound and the other person's sound when the degree of similarity between the input sound and the self sound is equal to or less than a threshold value. It is characterized by determining whether to do.

また、請求項７に記載の発明は、請求項６に記載の発明において、前記入力受付部と同一の筐体に設けられ、物理的な動作を行うアクチュエータをさらに備え、前記登録部は、前記アクチュエータの動作によって生じる音を前記他者音声として登録することを特徴とする。 The invention according to claim 7 is the invention according to claim 6, further comprising an actuator that is provided in the same housing as the input receiving unit and performs a physical operation, and the registration unit includes Sound generated by the operation of the actuator is registered as the other person's voice.

また、請求項８に記載の発明は、請求項１〜６のいずれか一つに記載の発明において、前記入力受付部と同一の筐体に設けられ、物理的な動作を行うアクチュエータをさらに備え、前記登録部は、前記アクチュエータの動作によって生じる音と前記出力処理部により出力される出力音声とが合成された音声を自己音声として登録することを特徴とする。 The invention according to claim 8 is the invention according to any one of claims 1 to 6, further comprising an actuator that is provided in the same casing as the input receiving unit and performs a physical operation. The registration unit registers as a self-sound a synthesized voice of a sound generated by the operation of the actuator and an output voice output from the output processing unit.

また、請求項９に記載の発明は、音声の入力を受け付ける入力受付部と、前記入力受付部により受け付けた入力音声に応じて音声の出力を行う出力処理部とを備えた音声対話システムの音声対話方法であって、前記出力処理部により出力される出力音声を自己音声として登録する登録ステップと、前記出力処理部による音声の出力中に、前記入力音声と前記自己音声との類似度を算出する類似度算出ステップと、前記類似度算出ステップにより算出された類似度に基づいて、前記出力処理部による音声の出力を停止するか否かを制御する動作制御ステップとを含むことを特徴とする。 According to a ninth aspect of the present invention, there is provided a voice dialogue system comprising: an input receiving unit that receives voice input; and an output processing unit that outputs voice according to the input voice received by the input receiving unit. A registration method for registering an output sound output by the output processing unit as a self-sound, and calculating a similarity between the input sound and the self-sound during the sound output by the output processing unit A similarity calculation step, and an operation control step for controlling whether or not to stop the output of sound by the output processing unit based on the similarity calculated by the similarity calculation step. .

本発明によれば、出力処理部により出力される出力音声を自己音声として登録し、出力処理部による音声の出力中に、入力音声と自己音声との類似度を算出し、類似度算出部により算出された類似度に基づいて出力処理部による音声の出力を停止するか否かを制御するよう構成したため、ユーザと円滑な音声対話を行うことができる。 According to the present invention, the output sound output by the output processing unit is registered as self-speech, the similarity between the input sound and the self-sound is calculated during the output of the sound by the output processing unit, and the similarity calculation unit Since it is configured to control whether or not to stop outputting the voice by the output processing unit based on the calculated similarity, a smooth voice conversation with the user can be performed.

図１は、本実施例１に係る音声対話システムの概念の説明図である。FIG. 1 is an explanatory diagram of the concept of the voice interaction system according to the first embodiment. 図２は、図１に示した音声対話ロボットの構成を示す構成図である。FIG. 2 is a block diagram showing the configuration of the voice interactive robot shown in FIG. 図３は、ユーザの発話による類似度の低下についての説明図である。FIG. 3 is an explanatory diagram of a decrease in similarity due to the user's utterance. 図４は、自己音声特徴データの登録処理の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of self-voice feature data registration processing. 図５は、音声認識モードの処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure in the voice recognition mode. 図６は、発話モードの処理手順を示すフローチャートである。FIG. 6 is a flowchart showing the processing procedure of the speech mode. 図７は、本実施例２に係る音声対話ロボットの動作についての説明図である。FIG. 7 is an explanatory diagram of the operation of the voice interactive robot according to the second embodiment. 図８は、図７に示した音声対話ロボットの構成を示す構成図である。FIG. 8 is a block diagram showing the configuration of the voice interactive robot shown in FIG. 図９は、本実施例２における発話モードの処理手順を示すフローチャートである。FIG. 9 is a flowchart illustrating the processing procedure of the speech mode in the second embodiment.

以下に、添付図面を参照して、本発明に係る音声対話システム及び音声対話方法の好適な実施例を詳細に説明する。 Exemplary embodiments of a voice interaction system and a voice interaction method according to the present invention will be described below in detail with reference to the accompanying drawings.

まず、本実施例１に係る音声対話システムの概念について説明する。図１は、本実施例１に係る音声対話システムの概念の説明図である。本実施例１では、音声対話システムを搭載したロボットである音声対話ロボット１０が、ユーザの音声を認識し、認識の結果に対応した内容の音声を出力することでユーザとの音声対話を行う。 First, the concept of the voice interaction system according to the first embodiment will be described. FIG. 1 is an explanatory diagram of the concept of the voice interaction system according to the first embodiment. In the first embodiment, a voice dialogue robot 10 which is a robot equipped with a voice dialogue system recognizes a user's voice and outputs a voice having contents corresponding to the recognition result, thereby performing a voice dialogue with the user.

音声対話ロボット１０は、後述するようにスピーカ１１とマイク１２を備えており、スピーカ１１から音声の出力を行う発話モードと、ユーザの音声をマイク１２により集音して音声認識する音声認識モードとを切り替えて動作する。 The voice interactive robot 10 includes a speaker 11 and a microphone 12 as will be described later, and an utterance mode in which voice is output from the speaker 11 and a voice recognition mode in which user's voice is collected by the microphone 12 and recognized. Switch to operate.

発話モードにおいては、音声対話ロボット１０は、スピーカ１１から音声の出力を行い、音声の出力が完了した場合に発話モードを終了して音声認識モードに移行する。音声対話ロボット１０は、発話モードではユーザの音声認識は行わないが、マイク１２により集音を行い、事前に登録した自装置の音声の特徴データとの類似度を算出する。 In the utterance mode, the voice interaction robot 10 outputs a voice from the speaker 11, and ends the utterance mode and shifts to the voice recognition mode when the output of the voice is completed. The voice interactive robot 10 does not recognize the user's voice in the utterance mode, but collects the sound with the microphone 12 and calculates the similarity with the feature data of the voice of the own device registered in advance.

音声対話ロボット１０が音声を出力し、ユーザが発話していない状態では、マイク１２は音声対話ロボット１０の音声を集音することになり、事前に登録した自装置の音声の特徴データとの類似度は高い値となる。 When the voice dialogue robot 10 outputs a voice and the user is not speaking, the microphone 12 collects the voice of the voice dialogue robot 10 and is similar to the voice feature data registered in advance. The degree is high.

一方、音声対話ロボット１０による音声の出力中にユーザが発話を行うと、マイク１２が集音する音声は、音声対話ロボット１０の音声とユーザの音声とが混じった合成音声となるので、事前に登録した自装置の音声の特徴データとの類似度が低下する。 On the other hand, if the user speaks during the output of the voice by the voice dialogue robot 10, the voice collected by the microphone 12 becomes a synthesized voice in which the voice of the voice dialogue robot 10 and the user's voice are mixed. The similarity with the registered voice feature data of the own device is lowered.

音声対話ロボット１０は、音声の出力の完了前に類似度が閾値以下となった場合には、ユーザの発話を検知したとして、音声の出力を途中で停止し、発話モードを終了する。すなわち、この場合には、発話モードは中断により終了して音声認識モードに移行することになる。 If the similarity is equal to or less than the threshold before the completion of the voice output, the voice interactive robot 10 detects the user's utterance, stops the voice output halfway, and ends the utterance mode. That is, in this case, the utterance mode is terminated due to the interruption and shifts to the voice recognition mode.

このように、音声対話ロボット１０は、スピーカ１１により出力される自装置の音声の特徴データを事前に登録し、発話モードにおける音声の出力中にマイク１２により集音した音声と自装置の音声の特徴データとの類似度を算出し、類似度が閾値以下となった場合には発話モードを中断して音声認識モードに移行する。このため、ユーザが発話した場合には、速やかに音声認識モードに移行してユーザの音声を認識することができ、円滑な音声対話を行うことができる。 As described above, the voice interactive robot 10 registers in advance the feature data of the sound of the own device output from the speaker 11, and the sound collected by the microphone 12 during the output of the sound in the speech mode and the sound of the own device. The similarity with the feature data is calculated, and when the similarity is less than or equal to the threshold, the speech mode is interrupted and the speech recognition mode is entered. For this reason, when a user speaks, it can change to voice recognition mode quickly and can recognize a user's voice, and can carry out smooth voice conversation.

また、音声認識モードでは自装置の音声を集音することがないため、自装置の音声による誤認識を防止することができる。さらに、ユーザは自身の発言が音声対話ロボット１０により認識されていることを把握できるため、ストレス無く発言を行うことができる。音声対話ロボット１０からの音声の出力と、ユーザの発話とが同時に行われると、ユーザにとって自身の発言が音声対話ロボット１０に認識されているか否かがが不明確となるが、音声対話ロボット１０が音声の出力を中断すればユーザの発話を認識する状態に移行したとユーザが認識するからである。 Further, since the voice of the own device is not collected in the voice recognition mode, erroneous recognition due to the voice of the own device can be prevented. Furthermore, since the user can grasp that his / her speech is recognized by the voice interaction robot 10, he / she can speak without stress. When the voice output from the voice interaction robot 10 and the user's utterance are performed simultaneously, it is unclear to the user whether or not his / her speech is recognized by the voice dialogue robot 10. This is because if the output of the voice is interrupted, the user recognizes that the state has shifted to a state in which the user's speech is recognized.

次に、図１に示した音声対話ロボット１０の構成について説明する。図２は、図１に示した音声対話ロボット１０の構成を示す構成図である。図２に示すように、音声対話ロボット１０は、スピーカ１１、マイク１２、操作部１３、アクチュエータ１４、記憶部１５及び制御部１６を有する。 Next, the configuration of the voice interactive robot 10 shown in FIG. 1 will be described. FIG. 2 is a block diagram showing the configuration of the voice interactive robot 10 shown in FIG. As shown in FIG. 2, the voice interaction robot 10 includes a speaker 11, a microphone 12, an operation unit 13, an actuator 14, a storage unit 15, and a control unit 16.

スピーカ１１は、音声対話ロボット１０による音声の出力に用いられる。マイク１２は、周囲の音を集音することで、ユーザの音声の入力を受け付ける入力受付部として機能する。操作部１３は、ボタン等により操作入力の受付を行う。なお、ボタンの操作入力に限らず、タブレットなどからの遠隔操作や、ジェスチャーの認識による操作受付を可能としてもよい。 The speaker 11 is used for outputting voice by the voice interactive robot 10. The microphone 12 functions as an input reception unit that receives an input of a user's voice by collecting ambient sounds. The operation unit 13 receives an operation input using a button or the like. The operation input is not limited to button operation input, and remote operation from a tablet or the like, or operation reception by gesture recognition may be possible.

アクチュエータ１４は、音声対話ロボット１０に物理的な動作を行わせるために用いられる。具体的には、音声対話ロボット１０の腕や首に相当する部材の動作、表情を示す部材の動作がアクチュエータ１４の駆動により制御される。ここでは、人型や動物型のロボットを想定しているが、音声対話ロボット１０の形状は任意に設計可能であり、アクチュエータ１４は、音声対話ロボット１０の物理的な動作に広く用いることができる。 The actuator 14 is used for causing the voice interactive robot 10 to perform a physical operation. Specifically, the operation of the member corresponding to the arm and neck of the voice interactive robot 10 and the operation of the member showing a facial expression are controlled by driving the actuator 14. Here, a human-type or animal-type robot is assumed, but the shape of the voice dialogue robot 10 can be arbitrarily designed, and the actuator 14 can be widely used for physical operations of the voice dialogue robot 10. .

記憶部１５は、ハードディスク装置や不揮発性メモリ等からなる記憶デバイスである。記憶部１５は、スピーカ１１により出力される自装置の音声の特徴データを自己音声特徴データ１５ａとして記憶する。 The storage unit 15 is a storage device composed of a hard disk device, a nonvolatile memory, or the like. The storage unit 15 stores the voice feature data of the own device output from the speaker 11 as the own voice feature data 15a.

制御部１６は、音声対話ロボット１０の全体を制御する制御部であり、音声認識部１６ａ、発話処理部１６ｂ、音声登録部１６ｃ、類似度算出部１６ｄ、類似度判定部１６ｅ、状態遷移部１６ｆ及びアクチュエータ駆動処理部１６ｇを有する。 The control unit 16 is a control unit that controls the entire voice dialogue robot 10, and includes a voice recognition unit 16a, an utterance processing unit 16b, a voice registration unit 16c, a similarity calculation unit 16d, a similarity determination unit 16e, and a state transition unit 16f. And an actuator drive processing unit 16g.

音声認識部１６ａは、音声認識モードにおいてユーザの音声を認識する処理を行う処理部である。具体的には、マイク１２が集音した入力音声からユーザの音声を抽出して分析し、ユーザによる発話の内容を特定する。 The voice recognition unit 16a is a processing unit that performs processing for recognizing the user's voice in the voice recognition mode. Specifically, the user's voice is extracted from the input voice collected by the microphone 12 and analyzed, and the content of the utterance by the user is specified.

発話処理部１６ｂは、発話モードにおいて音声の出力を行う出力処理部である。具体的には、音声認識部１６ａによりユーザの発話の内容が特定された場合に、特定された発話の内容に対して適切な応答の内容を決定し、決定した内容の出力音声をスピーカ１１から出力する。また、ユーザによる発話が行われていない状態で、特定の内容の出力音声をスピーカ１１から出力することも可能である。 The utterance processing unit 16b is an output processing unit that outputs voice in the utterance mode. Specifically, when the content of the user's utterance is specified by the voice recognition unit 16a, the content of an appropriate response is determined for the specified content of the utterance, and the output sound of the determined content is sent from the speaker 11 Output. Moreover, it is also possible to output an output sound having specific contents from the speaker 11 in a state where the user does not speak.

音声登録部１６ｃは、スピーカ１１から出力される自装置の音声、すなわち出力音声の特徴データを自己音声特徴データ１５ａとして記憶部１５に格納する処理を行う。特徴データは、例えば出力音声を周波数分析してその特徴を示すデータを生成することで得られる。具体的には、ＬＰＣ（Linear Predictive Coding）ケプストラム係数や、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）等の任意の手法を用いることができる。 The voice registration unit 16c performs a process of storing the voice of the own device output from the speaker 11, that is, the feature data of the output voice in the storage unit 15 as the own voice feature data 15a. The feature data can be obtained, for example, by generating data indicating the feature by frequency analysis of the output speech. Specifically, an arbitrary method such as an LPC (Linear Predictive Coding) cepstrum coefficient or an MFCC (Mel-Frequency Cepstrum Coefficient) can be used.

類似度算出部１６ｄは、発話モードにおいてマイク１２が集音した入力音声と自己音声特徴データ１５ａとの類似度を算出する処理部である。具体的には、音声登録部１６ｃが出力音声から自己音声特徴データ１５ａを生成する際と同様の処理を入力音声に対して行うことで入力音声の特徴データを生成し、入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出することになる。 The similarity calculation unit 16d is a processing unit that calculates the similarity between the input sound collected by the microphone 12 in the speech mode and the self-speech feature data 15a. Specifically, the voice registration unit 16c generates the input voice feature data by performing the same process on the input voice as when the self-voice feature data 15a is generated from the output voice, and the feature data of the input voice The degree of similarity with the self-speech feature data 15a is calculated.

ここで、類似度算出部１６ｄは、マイク１２が集音した入力音声に対して周波数フィルタを施すことで、音声以外の音の影響を低減し、音声部分を抽出した上で、入力音声の特徴データを生成する。また、入力音声の特徴データの生成時には、入力音声から所定時間の部分音声を音声フレームとして複数切り出し、音声フレームごとに特徴データを生成する。従って、自己音声特徴データ１５ａとの類似度についても、複数の音声フレームについてそれぞれ算出される。 Here, the similarity calculation unit 16d performs a frequency filter on the input sound collected by the microphone 12, thereby reducing the influence of sound other than the sound, and extracting the sound part, and then features of the input sound. Generate data. In addition, when generating feature data of the input voice, a plurality of partial voices for a predetermined time are cut out as voice frames from the input voice, and feature data is generated for each voice frame. Accordingly, the degree of similarity with the self-speech feature data 15a is also calculated for each of a plurality of speech frames.

類似度判定部１６ｅは、類似度算出部１６ｄにより算出された類似度が閾値以下であるか否かを判定する処理を行う。類似度判定部１６ｅは、類似度が閾値以下となる音声フレームが一定数連続した場合に、ユーザの発話を検知したものとする。１つの音声フレームの長さと、ユーザの発話を検知するための音声フレームの数とを調整することで、突発的なノイズを除去し、適切にユーザの発話を検知することが可能である。 The similarity determination unit 16e performs a process of determining whether or not the similarity calculated by the similarity calculation unit 16d is equal to or less than a threshold value. It is assumed that the similarity determination unit 16e detects the user's utterance when a predetermined number of audio frames whose similarity is equal to or less than the threshold value continue. By adjusting the length of one voice frame and the number of voice frames for detecting the user's speech, it is possible to remove sudden noise and appropriately detect the user's speech.

状態遷移部１６ｆは、発話モードと音声認識モードの切り替えを制御する動作制御部である。具体的には、状態遷移部１６ｆは、発話モードにおいて、発話処理部１６ｂが決定した内容の出力音声の出力が完了するか、ユーザの発話が検知された場合に、発話モードを終了して音声認識モードに移行させる。 The state transition unit 16f is an operation control unit that controls switching between the speech mode and the voice recognition mode. Specifically, in the utterance mode, the state transition unit 16f ends the utterance mode when the output of the output speech determined by the utterance processing unit 16b is completed or when the user's utterance is detected, and the state transition unit 16f Switch to recognition mode.

ユーザの発話により発話モードを終了する場合には、発話処理部１６ｂが決定した内容の出力音声の出力を途中で停止させて発話モードを終了する。なお、発話処理部１６ｂが決定した内容の出力音声の出力を途中で停止した後、特定の音声を出力させた上で発話モードを終了しても良い。この特定の音声には、例えば「どうされましたか？」などのように、音声対話ロボット１０がユーザの音声を認識する状態に移行することをユーザに伝え、ユーザの発話を促す内容の音声を用いる。 When the utterance mode is terminated by the user's utterance, the output of the output voice having the content determined by the utterance processing unit 16b is stopped halfway, and the utterance mode is terminated. It should be noted that the speech mode may be terminated after outputting the specific speech after stopping the output of the output speech of the content determined by the speech processing unit 16b. For this specific voice, for example, “What did you do?” Or the like, the voice dialogue robot 10 is informed of the transition to a state of recognizing the user's voice, and the voice of the content that prompts the user to speak is given. Use.

また、状態遷移部１６ｆは、音声認識モードにおいて、ユーザの発話の終了を検知した場合に、音声認識モードを終了して発話モードに移行させる。ユーザの発話の終了は、例えば「無音の状態が所定時間連続した」などの条件により検知すればよい。 Further, when the state transition unit 16f detects the end of the user's utterance in the voice recognition mode, the state transition unit 16f ends the voice recognition mode and shifts to the utterance mode. The end of the user's utterance may be detected based on a condition such as “silent state continues for a predetermined time”.

アクチュエータ駆動処理部１６ｇは、アクチュエータ１４の駆動を制御する処理部である。アクチュエータ１４は、例えば音声対話ロボット１０の発話の内容などに合わせて駆動される。かかるアクチュエータ１４の制御により、発話時の身振りや表情の変化を摸した動作を行わせることができる。この他、ユーザの発話に対する相槌や、音声対話ロボット１０の移動にもアクチュエータ１４の駆動制御を用いることができる。 The actuator drive processing unit 16g is a processing unit that controls driving of the actuator 14. The actuator 14 is driven in accordance with, for example, the content of the speech of the voice interactive robot 10. By controlling the actuator 14, it is possible to perform an operation that takes into account changes in gestures and facial expressions during speech. In addition, the drive control of the actuator 14 can be used for consideration of the user's utterance and the movement of the voice interactive robot 10.

図３は、ユーザの発話による類似度の低下についての説明図である。図３に示すように、音声対話ロボット１０が発話している区間では、音声対話ロボット１０の出力音声が入力音声に含まれ、ユーザが発話している区間では、ユーザの音声が入力音声に含まれる。このため、音声対話ロボット１０の発話区間とユーザの発話区間が重複する区間では、出力音声とユーザの音声の双方が入力音声に含まれることになる。 FIG. 3 is an explanatory diagram of a decrease in similarity due to the user's utterance. As shown in FIG. 3, in the section where the voice dialogue robot 10 is speaking, the output voice of the voice dialogue robot 10 is included in the input voice, and in the section where the user is speaking, the user's voice is included in the input voice. It is. For this reason, in the section where the speech section of the voice interactive robot 10 and the user's speech section overlap, both the output voice and the user's voice are included in the input voice.

従って、入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出すると、音声対話ロボット１０のみが発話している区間では類似度は閾値を超えた値となるが、ユーザが発話している区間では、類似度が低下して閾値以下となる。 Therefore, when the similarity between the feature data of the input speech and the self-speech feature data 15a is calculated, the similarity exceeds a threshold value in the section where only the voice interactive robot 10 is speaking, but the user speaks In a certain section, the degree of similarity decreases and becomes below the threshold.

次に、音声対話ロボット１０の処理手順について説明する。図４は、自己音声特徴データ１５ａの登録処理の処理手順を示すフローチャートである。まず、音声登録部１６ｃは、操作部１３への操作入力などにより、登録モードを開始する（ステップＳ１０１）。 Next, a processing procedure of the voice interactive robot 10 will be described. FIG. 4 is a flowchart showing a processing procedure of registration processing of the self-speech feature data 15a. First, the voice registration unit 16c starts a registration mode by an operation input to the operation unit 13 or the like (step S101).

登録モードの開始後、音声登録部１６ｃは、登録対象の音声を取得する（ステップＳ１０２）。この登録対象の音声の取得は、例えばスピーカ１１から音声の出力を行い、マイク１２により集音することで行う。また、予め他の装置で取得された音声データを受け付けても良い。 After the registration mode is started, the voice registration unit 16c acquires a registration target voice (step S102). For example, the registration target sound is acquired by outputting sound from the speaker 11 and collecting sound by the microphone 12. Moreover, you may receive the audio | voice data previously acquired with the other apparatus.

スピーカ１１から音声の出力を行ってスピーカ１１により集音する場合には、ノイズの少ない環境で行うことが望ましい。若しくは、音声対話ロボット１０を運用する実環境で登録対象の音声の取得を行ってもよい。さらに、アクチュエータ１４を動作させつつ登録対象の音声の取得を行えば、アクチュエータ１４の駆動音と出力音とが合成された音声を登録することができる。 When sound is output from the speaker 11 and collected by the speaker 11, it is desirable to perform in an environment with little noise. Alternatively, registration target speech may be acquired in an actual environment in which the voice interactive robot 10 is operated. Furthermore, by acquiring the registration target sound while operating the actuator 14, it is possible to register a sound in which the drive sound of the actuator 14 and the output sound are synthesized.

音声登録部１６ｃは、取得した音声の特徴データを算出し（ステップＳ１０３）、自己音声特徴データ１５ａとして記憶部１５に登録して（ステップＳ１０４）、登録モードを終了する（ステップＳ１０５）。 The voice registration unit 16c calculates the feature data of the acquired voice (step S103), registers it as the self voice feature data 15a in the storage unit 15 (step S104), and ends the registration mode (step S105).

図５は、音声認識モードの処理手順を示すフローチャートである。まず、状態遷移部１６ｆにより音声認識モードが開始されると（ステップＳ２０１）、音声認識部１６ａは、マイク１２が集音した音を入力音声として取得する（ステップＳ２０２）。その後、状態遷移部１６ｆは、ユーザの発話が終了したか否かを判定する（ステップＳ２０３）。ユーザの発話の終了は、例えば「無音の状態が所定時間連続した」などの条件により検知すればよい。 FIG. 5 is a flowchart showing a processing procedure in the voice recognition mode. First, when the voice recognition mode is started by the state transition unit 16f (step S201), the voice recognition unit 16a acquires a sound collected by the microphone 12 as an input voice (step S202). Thereafter, the state transition unit 16f determines whether or not the user's utterance has ended (step S203). The end of the user's utterance may be detected based on a condition such as “silent state continues for a predetermined time”.

ユーザの発話が終了していなければ（ステップＳ２０３；Ｎｏ）、音声認識部１６ａは、ステップＳ２０２に移行し、入力音声の取得を継続する。一方、ユーザの発話が終了したならば（ステップＳ２０３；Ｙｅｓ）、音声認識部１６ａは、取得した入力音声に対して音声認識処理を行う（ステップＳ２０４）。この音声認識処理により、ユーザによる発話の内容が特定される。発話処理部１６ｂは、特定されたユーザの発話の内容に対して適切な応答の内容を決定する（ステップＳ２０５）。 If the user's utterance has not ended (step S203; No), the speech recognition unit 16a proceeds to step S202 and continues to acquire the input speech. On the other hand, if the user's utterance is completed (step S203; Yes), the speech recognition unit 16a performs speech recognition processing on the acquired input speech (step S204). The content of the utterance by the user is specified by this voice recognition processing. The utterance processing unit 16b determines an appropriate response content with respect to the content of the specified user's utterance (step S205).

その後、音声認識部１６ａは音声認識モードを終了し（ステップＳ２０６）、状態遷移部１６ｆは音声認識モードから発話モードへの移行を行う（ステップＳ２０７）。 Thereafter, the voice recognition unit 16a ends the voice recognition mode (step S206), and the state transition unit 16f shifts from the voice recognition mode to the speech mode (step S207).

図６は、発話モードの処理手順を示すフローチャートである。まず、状態遷移部１６ｆにより発話モードが開始されると（ステップＳ３０１）、発話処理部１６ｂは、スピーカ１１からの音声の出力を行う（ステップＳ３０２）。スピーカ１１から出力する音声の内容は、ユーザの発話の内容に応じて決定される。若しくは、ユーザによる発話が行われていない状態での出力用に予め用意した特定の内容を用いることもできる。 FIG. 6 is a flowchart showing the processing procedure of the speech mode. First, when the utterance mode is started by the state transition unit 16f (step S301), the utterance processing unit 16b outputs sound from the speaker 11 (step S302). The content of the sound output from the speaker 11 is determined according to the content of the user's utterance. Alternatively, it is possible to use specific contents prepared in advance for output in a state where the user does not speak.

また、類似度算出部１６ｄは、マイク１２が集音した音を入力音声として取得し（ステップＳ３０３）、入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出する（ステップＳ３０４）。 Further, the similarity calculation unit 16d acquires the sound collected by the microphone 12 as the input sound (step S303), and calculates the similarity between the feature data of the input sound and the self-speech feature data 15a (step S304).

類似度判定部１６ｅは、類似度算出部１６ｄにより算出された類似度が閾値以下であるか否かを判定する（ステップＳ３０５）。その結果、類似度が閾値以下である場合（ステップＳ３０６；Ｙｅｓ）、より詳細には、類似度が閾値以下となる音声フレームが一定数連続した場合、状態遷移部１６ｆは、発話処理部１６ｂが決定した内容の出力音声の出力を途中で停止させる（ステップＳ３１０）。停止後に、発話を途中で停止したことに対応する特定の音声を出力させてもよい。 The similarity determination unit 16e determines whether or not the similarity calculated by the similarity calculation unit 16d is equal to or less than a threshold value (step S305). As a result, when the similarity is equal to or lower than the threshold (step S306; Yes), more specifically, when a certain number of voice frames whose similarity is equal to or lower than the threshold continue, the state transition unit 16f has the utterance processing unit 16b. The output of the determined output voice is stopped halfway (step S310). You may output the specific audio | voice corresponding to having stopped utterance on the way after a stop.

類似度が閾値以下でない場合（ステップＳ３０６；Ｎｏ）、より詳細には、類似度が閾値以下となる音声フレームの一定数の連続が生じていない場合、発話処理部１６ｂは、音声の出力を完了したか否かを判定する（ステップＳ３０７）。その結果、音声の出力が完了していなければ（ステップＳ３０７；Ｎｏ）、ステップＳ３０２に移行し、音声の出力を継続する。 If the similarity is not less than or equal to the threshold (step S306; No), more specifically, if there is no constant continuation of the audio frames having the similarity less than or equal to the threshold, the utterance processing unit 16b completes the output of the audio It is determined whether or not (step S307). As a result, if the audio output is not completed (step S307; No), the process proceeds to step S302, and the audio output is continued.

音声の出力が完了した場合（ステップＳ３０７；Ｙｅｓ）、若しくはステップＳ３１０で音声の出力を途中で停止した場合、発話処理部１６ｂは発話モードを終了し（ステップＳ３０８）、状態遷移部１６ｆは発話モードから音声認識モードへの移行を行って（ステップＳ３０９）、処理を終了する。 When the voice output is completed (step S307; Yes) or when the voice output is stopped halfway in step S310, the speech processing unit 16b ends the speech mode (step S308), and the state transition unit 16f is the speech mode. To the voice recognition mode (step S309), and the process ends.

上述してきたように、本実施例１に係る音声対話ロボット１０は、自装置がスピーカ１１から出力する音声の特徴を示す自己音声特徴データ１５ａを記憶部１５に登録し、発話モードにおける音声の出力中にマイク１２により集音した入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出し、類似度が閾値以下となった場合には発話モードを中断して音声認識モードに移行する。このため、ユーザが発話した場合には、速やかに音声認識モードに移行してユーザの音声を認識することができ、円滑な音声対話を行うことができる。 As described above, the voice interaction robot 10 according to the first embodiment registers the self-speech feature data 15a indicating the feature of the sound output from the speaker 11 from the own device in the storage unit 15, and outputs the sound in the utterance mode. The similarity between the input voice feature data collected by the microphone 12 and the self-speech feature data 15a is calculated, and when the similarity falls below a threshold, the speech mode is interrupted and the speech recognition mode is entered. . For this reason, when a user speaks, it can change to voice recognition mode quickly and can recognize a user's voice, and can carry out smooth voice conversation.

また、音声認識モードでは自装置の音声を集音することがないため、自装置の音声による誤認識を防止することができる。さらに、ユーザは自身の発言が音声対話ロボット１０により認識されていることを把握できるため、ストレス無く発言を行うことができる。 Further, since the voice of the own device is not collected in the voice recognition mode, erroneous recognition due to the voice of the own device can be prevented. Furthermore, since the user can grasp that his / her speech is recognized by the voice interaction robot 10, he / she can speak without stress.

実施例１では、自装置がスピーカ１１から出力する音声の特徴を示す自己音声特徴データ１５ａを記憶部１５に登録し、自己音声特徴データ１５ａを用いて発話の中断に係る制御を行う構成について説明を行ったが、自装置がスピーカ１１から出力する音声以外の音声をさらに登録して発話の中断に係る制御を行ってもよい。 In the first embodiment, a description will be given of a configuration in which self-speech feature data 15a indicating a feature of speech output from the speaker 11 is registered in the storage unit 15 and control related to utterance interruption is performed using the self-speech feature data 15a. However, it is also possible to perform control related to utterance interruption by further registering voice other than the voice output from the speaker 11 by the own apparatus.

例えば、音声対話ロボット１０がユーザとの対話を行っている場合に、館内放送や背景音楽（ＢＧＭ：background music）がマイク１２により集音されると、館内放送や背景音楽により類似度の低下が生じ、ユーザが発話したと誤認識して音声の出力を中断する可能性がある。 For example, when the voice conversation robot 10 is interacting with the user, if the in-house broadcast or background music (BGM: background music) is collected by the microphone 12, the similarity is lowered due to the in-house broadcast or background music. This may cause the user to misrecognize that he / she spoke and interrupt the output of the voice.

そこで、本実施例２では、発生が予測される音声を除外対象として予め登録しておき、類似度の低下が除外対象により生じている場合には音声の出力を継続する構成について説明を行う。 Therefore, in the second embodiment, a description will be given of a configuration in which a sound that is predicted to be generated is registered in advance as an object to be excluded, and the output of sound is continued when a decrease in similarity occurs due to the object to be excluded.

図７は、本実施例２に係る音声対話ロボット１１０の動作についての説明図である。図７に示す音声対話ロボット１１０は、自己音声特徴データ１５ａに加え、除外対象とするべき音声の特徴を除外対象音声特徴データとして登録している。 FIG. 7 is an explanatory diagram of the operation of the voice interactive robot 110 according to the second embodiment. In addition to the self-speech feature data 15a, the spoken dialogue robot 110 shown in FIG. 7 registers speech features to be excluded as speech feature data to be excluded.

音声対話ロボット１１０は、発話モードで音声を出力中に、入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出し、類似度の比較により他者（ユーザ又は除外対象）の発話を検知する。 While outputting voice in the utterance mode, the voice interactive robot 110 calculates the similarity between the feature data of the input voice and the self-speech feature data 15a, and utters another person (user or exclusion target) by comparing the similarity. Detect.

他者の発話を検知したならば、音声対話ロボット１１０は、入力音声の特徴データと除外対象音声特徴データとの類似度を算出し、除外対象に該当するか否かを判定する。その結果、除外対象に該当する場合には、音声の出力を停止せず、発話モードを継続する。一方、場外対象に該当する場合には、音声の出力を停止し、発話モードを中断して音声認識モードに移行する。 If the speech of another person is detected, the voice interactive robot 110 calculates the similarity between the feature data of the input voice and the exclusion target speech feature data, and determines whether or not it corresponds to the exclusion target. As a result, if it falls under the exclusion target, the speech mode is continued without stopping the voice output. On the other hand, when it corresponds to an out-of-field object, the output of the voice is stopped, the speech mode is interrupted, and the voice recognition mode is entered.

次に、図８を参照し、図７に示した音声対話ロボット１１０の構成について説明する。図８は、図７に示した音声対話ロボット１１０の構成を示す構成図である。図８に示すように、音声対話ロボット１１０は、記憶部１５に除外対象音声特徴データ１５ｂをさらに記憶する。また、制御部１６における音声登録部１１６ｃ、類似度算出部１１６ｄ、類似度判定部１１６ｅ、状態遷移部１１６ｆの動作が図２に示した音声対話ロボット１０と異なる。その他の構成及び動作は図２に示した音声対話ロボット１０と同様であるので、同一の構成要素には同一の符号を付して説明を省略する。 Next, the configuration of the voice interactive robot 110 shown in FIG. 7 will be described with reference to FIG. FIG. 8 is a configuration diagram showing the configuration of the voice interactive robot 110 shown in FIG. As shown in FIG. 8, the voice interaction robot 110 further stores exclusion target voice feature data 15 b in the storage unit 15. Further, the operations of the voice registration unit 116c, the similarity calculation unit 116d, the similarity determination unit 116e, and the state transition unit 116f in the control unit 16 are different from those of the voice interactive robot 10 illustrated in FIG. Since other configurations and operations are the same as those of the voice interactive robot 10 shown in FIG. 2, the same components are denoted by the same reference numerals and description thereof is omitted.

除外対象音声特徴データ１５ｂは、除外対象とするべき音声の特徴を示すデータである。例えば、館内放送や背景音楽を除外対象音声特徴データ１５ｂとして登録することができる。また、特定の人物の音声を登録することも可能である。 The exclusion target speech feature data 15b is data indicating features of speech to be excluded. For example, in-house broadcasting and background music can be registered as the exclusion target audio feature data 15b. It is also possible to register the voice of a specific person.

音声登録部１１６ｃは、自己音声特徴データ１５ａの登録処理に加え、除外対象音声特徴データ１５ｂの登録処理を行う。具体的には、登録モードの開始時などに、自己音声特徴データ１５ａを登録するか除外対象音声特徴データ１５ｂを登録するかを選択する操作を受け付けて登録を行えばよい。 The voice registration unit 116c performs a registration process of the exclusion target voice feature data 15b in addition to the registration process of the own voice feature data 15a. Specifically, at the start of the registration mode, registration may be performed by accepting an operation for selecting whether to register the own voice feature data 15a or the exclusion target voice feature data 15b.

類似度算出部１１６ｄは、入力音声と自己音声特徴データ１５ａとの類似度の算出に加え、入力音声と除外対象音声特徴データ１５ｂとの類似度の算出を行う。類似度の算出に係る処理については、実施例１と同様であるが、除外対象音声特徴データ１５ｂが複数登録されている場合には、それぞれの除外対象音声特徴データ１５ｂについて類似度を算出する。 The similarity calculation unit 116d calculates the similarity between the input voice and the exclusion target voice feature data 15b in addition to calculating the similarity between the input voice and the own voice feature data 15a. The processing relating to the calculation of the similarity is the same as in the first embodiment, but when a plurality of exclusion target speech feature data 15b are registered, the similarity is calculated for each exclusion target speech feature data 15b.

類似度判定部１１６ｅは、入力音声の特徴データと自己音声特徴データ１５ａとの類似度と閾値との比較に加え、入力音声の特徴データと除外対象音声特徴データ１５ｂとの類似度と閾値との比較を行う。入力音声の特徴データと自己音声特徴データ１５ａとの類似度と閾値との比較は、他者の音声の検知に用いられる。入力音声の特徴データと除外対象音声特徴データ１５ｂとの類似度と閾値との比較は、検知した他者の音声が除外対象であるか否かを識別するために用いる。これらの閾値は同一の値ではなく、それぞれ適切に設定する。 The similarity determination unit 116e compares the similarity between the input speech feature data and the self-speech feature data 15a with a threshold value, and compares the similarity between the input speech feature data and the exclusion target speech feature data 15b with a threshold value. Make a comparison. The comparison of the similarity between the feature data of the input speech and the self-speech feature data 15a and the threshold value is used to detect the speech of others. The comparison between the similarity between the input voice feature data and the exclusion target voice feature data 15b and the threshold value is used to identify whether or not the detected voice of the other person is the exclusion target. These threshold values are not the same value, but are set appropriately.

状態遷移部１１６ｆは、発話モードにおいて、他者の音声を検知し、検知した他者の音声が除外対象に該当しない場合に発話モードを中断するが、検知した他者の音声が除外対象である場合には発話モードを継続する。なお、音声の出力が完了した場合の発話モードの終了と、音声認識モードの終了については実施例１と同様である。 The state transition unit 116f detects the voice of the other person in the utterance mode, and interrupts the utterance mode when the detected voice of the other person does not correspond to the exclusion target, but the detected voice of the other person is the exclusion target. In the case, the speech mode is continued. Note that the end of the speech mode and the end of the speech recognition mode when the output of the voice is completed are the same as in the first embodiment.

図９は、本実施例２における発話モードの処理手順を示すフローチャートである。まず、状態遷移部１１６ｆにより発話モードが開始されると（ステップＳ４０１）、発話処理部１６ｂは、スピーカ１１からの音声の出力を行う（ステップＳ４０２）。スピーカ１１から出力する音声の内容は、ユーザの発話の内容に応じて決定される。若しくは、ユーザによる発話が行われていない状態での出力用に予め用意した特定の内容を用いることもできる。 FIG. 9 is a flowchart illustrating the processing procedure of the speech mode in the second embodiment. First, when the utterance mode is started by the state transition unit 116f (step S401), the utterance processing unit 16b outputs sound from the speaker 11 (step S402). The content of the sound output from the speaker 11 is determined according to the content of the user's utterance. Alternatively, it is possible to use specific contents prepared in advance for output in a state where the user does not speak.

また、類似度算出部１１６ｄは、マイク１２が集音した音を入力音声として取得し（ステップＳ４０３）、入力音声の特徴データと自己音声特徴データ１５ａとの類似度を算出する（ステップＳ４０４）。 In addition, the similarity calculation unit 116d acquires the sound collected by the microphone 12 as the input sound (step S403), and calculates the similarity between the feature data of the input sound and the self-speech feature data 15a (step S404).

類似度判定部１１６ｅは、類似度算出部１１６ｄにより算出された類似度が閾値以下であるか否かを判定する（ステップＳ４０５）。その結果、類似度が閾値以下である場合（ステップＳ４０６；Ｙｅｓ）、より詳細には、類似度が閾値以下となる音声フレームが一定数連続した場合、類似度算出部１１６ｄは、入力音声の特徴データと除外対象音声特徴データ１５ｂとの類似度を算出する（ステップＳ４１０）。 The similarity determination unit 116e determines whether or not the similarity calculated by the similarity calculation unit 116d is equal to or less than a threshold value (step S405). As a result, when the similarity is equal to or less than the threshold (step S406; Yes), more specifically, when a certain number of audio frames having the similarity equal to or less than the threshold continue, the similarity calculation unit 116d determines the feature of the input sound. The similarity between the data and the exclusion target speech feature data 15b is calculated (step S410).

入力音声の特徴データと除外対象音声特徴データ１５ｂとの類似度が閾値未満であれば、除外対象ではないとして（ステップＳ４１１；Ｎｏ）、状態遷移部１１６ｆは、発話処理部１６ｂが決定した内容の出力音声の出力を途中で停止させる（ステップＳ４１２）。停止後に、発話を途中で停止したことに対応する特定の音声を出力させてもよい。 If the similarity between the input speech feature data and the exclusion target speech feature data 15b is less than the threshold value, the state transition unit 116f determines that the speech processing unit 16b has not determined that it is not an exclusion target (step S411; No). The output of the output sound is stopped halfway (step S412). You may output the specific audio | voice corresponding to having stopped utterance on the way after a stop.

自己音声特徴データとの類似度が閾値以下でない場合（ステップＳ４０６；Ｎｏ）、もしくは、自己音声特徴データとの類似度が閾値以下でかつ除外対象音声特徴データ１５ｂとの類似度が閾値以上である場合（ステップＳ４１１；Ｙｅｓ）、発話処理部１６ｂは、音声の出力を完了したか否かを判定する（ステップＳ４０７）。その結果、音声の出力が完了していなければ（ステップＳ４０７；Ｎｏ）、ステップＳ４０２に移行し、音声の出力を継続する。 When the similarity with the self-speech feature data is not less than the threshold (step S406; No), or the similarity with the self-speech feature data is less than the threshold and the similarity with the exclusion target speech feature data 15b is more than the threshold In the case (step S411; Yes), the speech processing unit 16b determines whether or not the output of the voice has been completed (step S407). As a result, if the audio output is not completed (step S407; No), the process proceeds to step S402, and the audio output is continued.

音声の出力が完了した場合（ステップＳ４０７；Ｙｅｓ）、若しくはステップＳ４１２で音声の出力を途中で停止した場合、発話処理部１６ｂは発話モードを終了し（ステップＳ４０８）、状態遷移部１１６ｆは発話モードから音声認識モードへの移行を行って（ステップＳ４０９）、処理を終了する。 When the voice output is completed (step S407; Yes) or when the voice output is stopped halfway in step S412, the speech processing unit 16b ends the speech mode (step S408), and the state transition unit 116f is the speech mode. To the voice recognition mode (step S409), and the process ends.

上述してきたように、本実施例２に係る音声対話ロボット１１０は、自己音声特徴データ１５ａを登録するとともに、除外対象とするべき音声の特徴を示す除外対象音声特徴データ１５ｂをさらに登録し、入力音声の特徴データと自己音声特徴データ１５ａとの類似度の低下が除外対象により生じている場合には音声の出力を継続する。このため、館内放送や背景音楽をユーザの発話と誤認識して音声の出力を中断する事態を防止できる。 As described above, the voice interaction robot 110 according to the second embodiment registers the self-speech feature data 15a and further registers the exclusion target speech feature data 15b indicating the feature of the speech to be excluded. If the drop in the similarity between the voice feature data and the self-voice feature data 15a occurs due to the exclusion target, the voice output is continued. For this reason, it is possible to prevent a situation where voice output is interrupted due to erroneous recognition of in-house broadcasting or background music as a user's utterance.

なお、本実施例２では、除外対象とするべき音声の特徴を登録する場合を例に説明を行ったが、警備員や医師の音声など、優先して認識するべき音声を優先対象として登録する構成としてもよい。また、他の音声対話ロボットの音声を優先対象として登録すれば、音声認識を用いて複数の音声対話ロボットを連携させることができる。また、ユーザとの対話の開始時にユーザの音声を優先対象として登録してもよい。さらに、自装置のアクチュエータの駆動音を除外対象音声特徴データ１５ｂとして登録してもよい。 In the second embodiment, an example has been described in which a feature of a voice to be excluded is registered, but a voice to be recognized with priority such as a security guard or a doctor is registered as a priority target. It is good also as a structure. If the voices of other voice interactive robots are registered as priority targets, a plurality of voice interactive robots can be linked using voice recognition. In addition, the user's voice may be registered as a priority object at the start of the dialogue with the user. Furthermore, the drive sound of the actuator of the own device may be registered as the exclusion target sound feature data 15b.

また、上記実施例１及び２では、音声対話システムをロボットに搭載する場合について説明を行ったが、本発明はこれに限定されるものではなく、通信回線を介した自動応答や、携帯端末上でのユーザ支援など、任意の音声対話システムに用いることができる。 In the first and second embodiments, the case where the voice interactive system is mounted on the robot has been described. However, the present invention is not limited to this, and an automatic response via a communication line or a mobile terminal can be used. It can be used for any spoken dialogue system such as user support.

以上のように、本発明に係る音声対話システム及び音声対話方法は、ユーザとの円滑な音声対話の実現に適している。 As described above, the voice dialogue system and the voice dialogue method according to the present invention are suitable for realizing a smooth voice dialogue with the user.

１０、１１０音声対話ロボット
１１スピーカ
１２マイク
１３操作部
１４アクチュエータ
１５記憶部
１５ａ自己音声特徴データ
１５ｂ除外対象音声特徴データ
１６制御部
１６ａ音声認識部
１６ｂ発話処理部
１６ｃ、１１６ｃ音声登録部
１６ｄ、１１６ｄ類似度算出部
１６ｅ、１１６ｅ類似度判定部
１６ｆ、１１６ｆ状態遷移部
１６ｇアクチュエータ駆動処理部 DESCRIPTION OF SYMBOLS 10,110 Voice dialogue robot 11 Speaker 12 Microphone 13 Operation part 14 Actuator 15 Storage part 15a Self-speech feature data 15b Exclusion target voice feature data 16 Control part 16a Speech recognition part 16b Speech processing part 16c, 116c Voice registration part 16d, 116d Similar Degree calculation unit 16e, 116e Similarity determination unit 16f, 116f State transition unit 16g Actuator drive processing unit

Claims

A voice dialogue system comprising an input receiving unit that receives voice input and an output processing unit that outputs voice according to the input voice received by the input receiving unit,
A registration unit for registering the output sound output by the output processing unit as a self-sound;
A similarity calculation unit that calculates the similarity between the input voice and the self-sound during the output of the voice by the output processing unit;
A voice dialogue system comprising: an operation control unit that controls whether or not to stop outputting voice by the output processing unit based on the similarity calculated by the similarity calculation unit.

A voice recognition unit that performs voice recognition on the input voice received by the input reception unit;
The output processing unit determines the content of the voice to be output according to the result of the voice recognition by the voice recognition unit,
The voice according to claim 1, wherein the operation control unit performs control to switch between a voice recognition mode in which voice recognition is performed by the voice recognition unit and an utterance mode in which voice is output by the output processing unit. Dialog system.

The operation control unit switches from the speech mode to the speech recognition mode when the output of the speech by the output processing unit is completed or when the output of the speech is stopped based on the similarity. Item 3. The voice interaction system according to item 2.

The output processing unit, when stopping the output of the sound based on the similarity, stops outputting the sound after outputting a specific sound corresponding to the stop of the output of the sound. Item 4. The spoken dialogue system according to any one of Items 1 to 3.

The registration unit registers feature data generated by analyzing features related to the frequency of the output speech as the self-speech,
The said similarity calculation part calculates the similarity of the feature data registered by analyzing the characteristic concerning the frequency of the said input audio | voice, and the feature data registered as the said own audio | voice. The spoken dialogue system according to any one of the above.

The registration unit further registers a predetermined voice other than the self-voice as the other person's voice,
When the similarity between the input voice and the self voice is equal to or less than a threshold value, the operation control unit outputs the voice by the output processing unit according to the similarity between the input voice and the other person's voice. It is determined whether to stop. The spoken dialogue system according to any one of claims 1 to 5, wherein the voice dialogue system is determined.

Provided in the same housing as the input receiving unit, further comprising an actuator for performing a physical operation;
The voice registration system according to claim 6, wherein the registration unit registers a sound generated by an operation of the actuator as the other person's voice.

Provided in the same housing as the input receiving unit, further comprising an actuator for performing a physical operation;
The said registration part registers the audio | voice with which the sound produced by the operation | movement of the said actuator and the output audio | voice output by the said output process part were synthesize | combined as a self audio | voice. The spoken dialogue system described in 1.

A voice dialogue method of a voice dialogue system comprising: an input reception unit that receives voice input; and an output processing unit that outputs voice according to the input voice received by the input reception unit,
A registration step of registering the output sound output by the output processing unit as a self-sound;
A similarity calculation step of calculating a similarity between the input sound and the self-sound during output of the sound by the output processing unit;
A voice interaction method comprising: an operation control step for controlling whether or not to stop outputting voice by the output processing unit based on the similarity calculated in the similarity calculation step.