JP2007155986A

JP2007155986A - Voice recognition device and robot equipped with the same

Info

Publication number: JP2007155986A
Application number: JP2005349118A
Authority: JP
Inventors: Ryota Hiura; 亮太日浦; Ken Onishi; 献大西; Keiichiro Osada; 啓一郎長田; Kyoko Oshima; 京子大嶋
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2005-12-02
Filing date: 2005-12-02
Publication date: 2007-06-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device and a robot equipped with the same, capable of fast-paced speaking, while preventing first part missing in voice recognition. <P>SOLUTION: The robot comprises: a control section 53 for composing words of speech; a voice synthesizing section 55 for generating an output voice signal based on the words; a speaker 18 for outputting an output voice based on an output voice signal; a microphone 14 for converting voice including at least user's voice to an input signal; an output voice elimination section 57 for generating an input voice signal by eliminating a signal component related to the output voice from the input signal; and a voice recognition section 59 for outputting a recognition result to the control section 53 by recognizing the user's voice based on the input voice signal. The control sections 53 and 61 control timing of starting recognition of the user's voice by the voice recognition section 59, after a predetermined period from start of outputting the output voice, and before finish of outputting the output voice. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識装置および音声認識装置を備えたロボットに関する。 The present invention relates to a voice recognition device and a robot including the voice recognition device.

近年のコンピュータ技術の発達により、コンピュータを応用したロボットが開発されている。このようなロボットとしては、製造現場等で用いられる産業用ロボットだけでなく、子供の面倒をみるロボットなど、人と密接な関わりを持つロボットも含まれている（例えば、特許文献１参照）。
人と密接な関わりを持つロボットに対して、人が指示を与える方法には種々のものが挙げられるが、その一つとして音声対話による方法が挙げられる。 With the development of computer technology in recent years, robots using computers have been developed. Such robots include not only industrial robots used at manufacturing sites and the like, but also robots that are closely related to people, such as robots that take care of children (see, for example, Patent Document 1).
There are various methods for giving instructions to robots that are closely related to people. One of them is a method using voice dialogue.

ロボットを相手に行う音声対話においては、人の指示、ロボットの応答、人の指示、ロボットの応答、と順々に確認を繰り返す方法が一般的である。例えば、電話のようなハンドセットやヘッドセットを用いないロボットのようなシステムにおいては、ロボットの発話終了後から音声認識を開始するのが一般的である。ロボットが発話している途中から音声認識を開始すると、ロボット自身の発話を認識してしまう可能性があるからである。
しかしながら、人は、ロボットの発話が終了してから間をあけずに、または、発話が終了しないうちに話しかける傾向がある。特に、同じ会話シナリオを体験した経験のある人の場合は、上記傾向が顕著に現れる。このようにロボットが音声認識を開始する前に人が話しかけると、ロボットは人の話の最初の部分を認識できないため、人の話を正確に認識できなくなるという問題があった。 In a voice dialogue performed with a robot as a partner, a method of repeating confirmation in order of a human instruction, a robot response, a human instruction, and a robot response is generally used. For example, in a system such as a telephone and a robot that does not use a headset, voice recognition is generally started after the end of the utterance of the robot. This is because if speech recognition is started while the robot is speaking, the robot's own speech may be recognized.
However, people tend to talk without waiting after the end of the utterance of the robot or before the end of the utterance. In particular, in the case of a person who has experienced the same conversation scenario, the above tendency appears remarkably. As described above, when a person speaks before the robot starts voice recognition, the robot cannot recognize the first part of the person's story, and thus cannot correctly recognize the person's story.

上述の問題を解決するものとして、ロボットが音声認識を開始したことを人に知らせるシステムが一般的に知られている。
例えば、ロボットの音声認識開始のタイミングを、ロボットの耳部に設けたランプを点灯させることにより、人に伝えるシステムが知られている。また、ロボットの発話終了後に短いビープ音を鳴らすことで、人の発話を促すシステムも知られている。
これらのシステムを用いることにより、ロボットが人の話を最初から認識できるという一定の効果を得ることができる。しかしながら、これらのシステムを用いると、人は、ロボットが指定したタイミングを守って話しをすることになる。このように、話しをするタイミングを制約されると、人はストレスを感じることがある。 As a solution to the above problem, a system that informs a person that a robot has started speech recognition is generally known.
For example, a system is known in which a voice recognition start timing of a robot is transmitted to a person by turning on a lamp provided at the ear of the robot. Also known is a system that encourages human speech by sounding a short beep after the end of the robot's speech.
By using these systems, it is possible to obtain a certain effect that the robot can recognize a person's story from the beginning. However, when these systems are used, a person speaks at the timing specified by the robot. Thus, when the timing of speaking is restricted, people may feel stress.

そのため、人の上記傾向に対応することにより、人の話を正確に認識するとともに、人にストレスを感じさせない様々な音声認識装置が提案されている（例えば、特許文献２および３参照。）。
特開２００５−３０５６３１号公報（第８−１１頁）特開２００３−３４５３９０号公報（第３頁、第１図）特開２００４−３３３５４３号公報（第８−９頁、第１図） For this reason, various speech recognition apparatuses have been proposed that deal with the above-mentioned tendency of people and accurately recognize people's stories and do not cause people to feel stress (see, for example, Patent Documents 2 and 3).
JP 2005-305631 A (page 8-11) JP 2003-345390 A (page 3, FIG. 1) JP 2004-333543 A (page 8-9, FIG. 1)

上述の特許文献２においては、傾きを検出する傾きセンサと、傾きセンサの出力に基づいてユーザが入力しようとしていることを検出する入力準備検出手段と、音声入力の開始時に押されるプレストークボタンと、音声を一時的に記憶するバッファメモリと、音声信号の認識処理を行う音声認識処理手段と、を備えた音声処理装置の構成が開示されている。
この構成によれば、ユーザがプレストークボタンを押すために、音声処理装置を手に取ったことを傾きセンサにより検出し、入力準備検出手段が音声入力の準備を指示する。すると、ユーザがプレストークボタンを押す前から、音声信号がバッファメモリに記憶される。その後、プレストークボタンが押されると、音声認識処理手段はバッファメモリに記憶された音声信号の認識を開始する。
このように、プレストークボタンが押される前に記憶された音声信号も、音声認識処理手段により認識処理されるため、人の話を最初から認識することができると記載されている。
しかしながら、上述の特許文献２記載の音声処理装置においては、音声認識処理手段により音声認識を開始させるために、プレストークボタンを押す必要があり、人にストレスを感じさせる恐れがあった。 In the above-mentioned Patent Document 2, an inclination sensor that detects inclination, input preparation detection means that detects that a user is going to input based on the output of the inclination sensor, and a press talk button that is pressed at the start of voice input; A configuration of a speech processing apparatus including a buffer memory for temporarily storing speech and speech recognition processing means for performing speech signal recognition processing is disclosed.
According to this configuration, the tilt sensor detects that the user has picked up the voice processing device in order to press the press talk button, and the input preparation detection unit instructs preparation for voice input. Then, the audio signal is stored in the buffer memory before the user presses the press talk button. Thereafter, when the press talk button is pressed, the voice recognition processing means starts recognizing the voice signal stored in the buffer memory.
As described above, it is described that the speech signal stored before the press talk button is pressed is also recognized by the speech recognition processing means, so that the person's story can be recognized from the beginning.
However, in the above-described speech processing apparatus described in Patent Document 2, it is necessary to press the press talk button in order to start speech recognition by the speech recognition processing means, which may cause stress to the person.

上述の特許文献３においては、システム側音声を出力する音声出力部と、ユーザ音声を音声信号に変換するマイクロフォンと、ユーザ音声を認識する音声認識部と、ユーザの音声対話の習熟度を判定する習熟度判定部と、システム側音声の出力を変更する音声出力変更部と、マイクロフォンから入力された音声信号から、音声出力部が出力したシステム側音声の出力相当信号分を相関演算して除去する音声応答除去部と、を備えた音声対話システムの構成が開示されている。
この構成によれば、音声応答除去部を備えているため、音声対話システムが音声応答を出力しているときであっても、ユーザからの音声を認識できると記載されている。
しかしながら、このような音声応答除去部（例えば、アコースティックエコーキャンセラー、以下ＡＥＣと表記する。）においては、その処理の性質として、環境での音の反射が複雑な場合や、他の雑音や、ひずみなどの要因により、システム側音声の除去は完全に行えない恐れがあった。
また、音声出力部からシステム側音声を出力した直後においては、音声応答除去部におけるシステム側音声の除去処理が収束しておらず、ユーザからの音声の認識性能が低下する恐れがあった。 In the above-mentioned Patent Document 3, a speech output unit that outputs system-side speech, a microphone that converts user speech into speech signals, a speech recognition unit that recognizes user speech, and a user's spoken dialogue proficiency level are determined. From the proficiency level determination unit, the audio output change unit that changes the output of the system side audio, and the audio signal input from the microphone, the signal corresponding to the output of the system side audio output by the audio output unit is subjected to correlation calculation and removed. A configuration of a voice interaction system including a voice response removing unit is disclosed.
According to this configuration, since the voice response removing unit is provided, it is described that the voice from the user can be recognized even when the voice dialogue system outputs a voice response.
However, in such a voice response removal unit (for example, an acoustic echo canceller, hereinafter referred to as AEC), as the nature of the processing, when reflection of sound in the environment is complicated, or other noise or distortion Due to factors such as the above, there is a possibility that the system side voice cannot be completely removed.
Also, immediately after the system side voice is output from the voice output unit, the system side voice removal processing in the voice response removal unit has not converged, and the voice recognition performance from the user may be degraded.

例えば、システム側音声を完全に除去できなかった場合に、システム側音声を出力している最中も音声認識を行うと、システム側音声をユーザの音声と誤認識する恐れがあった。そのため、音声対話システムが、ユーザの発話を待たずに、自らのシステム側音声を誤認識する恐れもあった。
また、音声応答除去部がシステム側音声を完全に除去できる場合でも、システム側音声の出力当初から音声認識を行うと、システム側音声の出力当初に外来雑音（システム側音声およびユーザの音声以外の音）が発生した場合、この外来雑音をユーザの音声と誤認識する恐れがあった。
このような誤認識を行うと、音声対話システムは誤認識に基づいて会話を進めるため、正確な内容の会話をテンポよく行うことができないという問題があった。 For example, if the system side voice cannot be completely removed and the voice recognition is performed while the system side voice is being output, the system side voice may be erroneously recognized as the user's voice. For this reason, the voice interaction system may misrecognize its own system-side voice without waiting for the user to speak.
Even when the voice response removal unit can completely remove the system side voice, if the voice recognition is performed from the beginning of the output of the system side voice, the external noise (other than the system side voice and the user's voice other than the system side voice) will be generated at the beginning of the output of the system side voice. If the sound is generated, this external noise may be mistakenly recognized as the user's voice.
When such misrecognition is performed, the voice dialogue system advances the conversation based on the misrecognition, and thus there is a problem that it is not possible to perform a conversation with accurate contents at a good tempo.

本発明は、上記の課題を解決するためになされたものであって、音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる音声認識装置および音声認識装置を備えたロボットを提供することを目的とする。 The present invention has been made to solve the above-described problem, and is a voice recognition device capable of preventing speech recognition from being interrupted and realizing a conversation with a good tempo, and a robot including the voice recognition device. The purpose is to provide.

上記目的を達成するために、本発明は、以下の手段を提供する。
本発明の音声認識装置は、会話の台詞を組み立てる制御部と、組み立てられた台詞に基づいて出力音声信号を生成する音声合成部と、生成された出力音声信号に基づいて出力音声を出力するスピーカと、ユーザが発声したユーザ音声を少なくとも含む音声を入力信号に変換するマイクロフォンと、前記出力音声信号に基づいて、前記入力信号から前記出力音声に係る信号成分を除去して入力音声信号を生成する出力音声除去部と、入力音声信号に基づいて前記ユーザ音声を認識し、認識結果を前記制御部に出力する音声認識部と、を備え、前記制御部が、前記台詞に基づいて、前記音声認識部による前記ユーザ音声の認識開始のタイミングを、前記出力音声の出力開始から所定時間後、かつ、前記出力音声の出力終了前に制御することを特徴とする。 In order to achieve the above object, the present invention provides the following means.
The speech recognition apparatus according to the present invention includes a control unit that assembles dialogue lines, a speech synthesis unit that generates output speech signals based on the assembled dialogues, and a speaker that outputs output speech based on the generated output speech signals. And a microphone that converts at least user speech uttered by the user into an input signal, and based on the output speech signal, a signal component related to the output speech is removed from the input signal to generate an input speech signal An output speech removal unit; and a speech recognition unit that recognizes the user speech based on an input speech signal and outputs a recognition result to the control unit, wherein the control unit recognizes the speech based on the dialogue. A timing of starting recognition of the user voice by a unit is controlled after a predetermined time from the start of output of the output voice and before the end of output of the output voice. To.

本発明によれば、制御部がユーザ音声の認識開始のタイミングを、台詞に基づいて出力音声の出力開始後、かつ、出力終了前に制御するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
ユーザ音声の認識開始のタイミングが、制御部によりユーザ音声出力終了前に制御されるため、出力音声の出力終了前からユーザ音声の認識を開始することができる。そのため、ユーザが出力音声の出力終了直後、または、出力音声の出力中に話しても、音声認識装置はユーザの音声を最初から認識でき、音声認識の頭切れを防止することができるとともに、テンポのよい会話を実現することができる。
ユーザ音声の認識開始のタイミングを台詞に基づいて制御するため、台詞の長さが変化しても、必ず、出力音声の出力終了前にユーザ音声の認識を開始することができる。
ユーザ音声の認識開始のタイミングが出力音声の出力開始から所定時間後であるため、出力音声除去部の処理が安定した状態においてユーザ音声の認識を行うことができる。出力音声の出力開始直後は、出力音声除去部の処理が不安定であり、かかる状態ではユーザ音声の誤認識が発生する恐れがある。上述のように、ユーザ音声の認識開始のタイミングを音声出力の出力開始から所定時間後にすることで、ユーザ音声の誤認識を防止してテンポのよい会話を実現することができる。
出力音声の出力開始直後においては、ユーザの発話内容が、出力音声に係る台詞に対して有効でない回答の可能性が高い。そのため、ユーザ音声の認識開始のタイミングを出力音声の出力開始から所定時間後とすることで、上記有効でない回答の音声認識を防止して、テンポのよい会話を実現することができる。 According to the present invention, the control unit controls the timing of the start of user speech recognition after starting the output of the output speech based on the dialogue and before the end of the output, thereby preventing the user speech recognition from being interrupted, A conversation with good tempo can be realized.
Since the user voice recognition start timing is controlled by the control unit before the user voice output is finished, the user voice recognition can be started before the output of the output voice is finished. Therefore, even if the user speaks immediately after the output of the output voice or during the output of the output voice, the voice recognition device can recognize the user's voice from the beginning, and can prevent the voice recognition from being interrupted. A good conversation can be realized.
Since the user speech recognition start timing is controlled based on the dialogue, even if the dialogue length changes, the user speech recognition can always be started before the output of the output speech is finished.
Since the user voice recognition start timing is a predetermined time after the output start of the output voice, the user voice can be recognized in a state where the process of the output voice removing unit is stable. Immediately after the output of the output voice is started, the process of the output voice removal unit is unstable, and in this state, there is a possibility that erroneous recognition of the user voice occurs. As described above, the user voice recognition start timing is set to be a predetermined time after the start of the voice output, thereby preventing erroneous recognition of the user voice and realizing a conversation with a good tempo.
Immediately after the output of the output voice is started, there is a high possibility that the user's utterance content is not valid for the speech related to the output voice. Therefore, by setting the user voice recognition start timing after a predetermined time from the output start of the output voice, it is possible to prevent voice recognition of the invalid answer and realize a conversation with a good tempo.

上記発明においては、前記出力音声信号に基づいて、前記台詞に係る前記出力音声の発話時間の長さを算出する発話時間算出部を備え、前記制御部が、前記発話時間の長さに基づいて、前記音声認識部による前記ユーザ音声の認識開始のタイミングを制御することが望ましい。 In the above invention, an utterance time calculation unit that calculates the length of the utterance time of the output voice related to the dialogue based on the output voice signal is provided, and the control unit is based on the length of the utterance time. It is desirable to control the timing of the recognition of the user voice by the voice recognition unit.

本発明によれば、発話時間算出部により算出された発話時間の長さに基づいて、制御部がユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを確実に防止するとともに、テンポのよい会話を実現することができる。
発話時間算出部は、スピーカに入力される出力音声信号に基づいて、発話時間の長さを算出しているため、実際にスピーカから出力される出力音声の発話時間の長さを算出することができる。制御部は、算出された発話時間の長さに基づいて、ユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを確実に防止することができる。
例えば、台詞の一部に個人名やニックネームなどが含まれ、会話により台詞の一部が変更される場合であっても、発話時間算出部は、変更後の台詞に係る出力音声信号に基づいて、発話時間の長さを算出することができる。そのため、音声認識装置は、ユーザ音声認識の頭切れを確実に防止することができる。 According to the present invention, since the control unit controls the timing of the start of recognition of the user voice based on the length of the utterance time calculated by the utterance time calculation unit, it is possible to reliably prevent the user voice recognition from being interrupted. , Can achieve a conversation with good tempo.
Since the utterance time calculation unit calculates the length of the utterance time based on the output sound signal input to the speaker, it is possible to calculate the length of the utterance time of the output sound actually output from the speaker. it can. Since the control unit controls the timing of starting the recognition of the user voice based on the calculated length of the utterance time, the head of the user voice recognition can be surely prevented.
For example, even if a part of the dialogue includes an individual name, a nickname, etc., and part of the dialogue is changed by conversation, the utterance time calculation unit is based on the output audio signal related to the changed dialogue. The length of the utterance time can be calculated. Therefore, the voice recognition device can surely prevent the user voice recognition from being interrupted.

上記発明においては、前記台詞に係る前記出力音声の発話時間の長さを、予め記憶する記憶部を備え、前記制御部が、前記記憶部に記憶された前記発話時間の長さに基づいて、前記音声認識部による前記ユーザ音声の認識開始のタイミングを制御することが望ましい。 In the above invention, a storage unit that stores in advance the length of the utterance time of the output speech related to the line, the control unit based on the length of the utterance time stored in the storage unit, It is desirable to control the timing of starting the recognition of the user voice by the voice recognition unit.

本発明によれば、記憶部に台詞に係る出力音声の発話時間の長さが予め記憶され、制御部が、記憶された発話時間の長さに基づいて、ユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを確実に防止するとともに、テンポのよい会話を実現することができる。
例えば、スピーカに入力される出力音声信号に基づいて、発話時間の長さを逐一算出する場合と比較して、発話時間を算出する必要がないため、発話時における演算負荷の低減を図ることができる。また、発話時間算出部を用いる必要がなくなるため、音声認識装置の構成を簡略化することができる。 According to the present invention, the length of the utterance time of the output voice related to the dialogue is stored in advance in the storage unit, and the control unit controls the timing of starting the recognition of the user voice based on the stored length of the utterance time. Therefore, it is possible to surely prevent the user voice recognition from being interrupted and realize a conversation with a good tempo.
For example, since it is not necessary to calculate the utterance time based on the output audio signal input to the speaker, it is not necessary to calculate the utterance time. it can. Further, since it is not necessary to use the utterance time calculation unit, the configuration of the speech recognition apparatus can be simplified.

上記発明においては、前記制御部が、前記発話時間から所定長さの遅延時間を引いた開始時間を算出し、前記出力音声の出力開始から前記開始時間経過した時点で、前記音声認識部に前記ユーザ音声の認識を開始させることが望ましい。 In the above invention, the control unit calculates a start time obtained by subtracting a delay time of a predetermined length from the utterance time, and when the start time elapses from the start of output of the output speech, the control unit recognizes the speech recognition unit. It is desirable to start recognition of user speech.

本発明によれば、出力音声の出力開始から、開始時間を経過した時点で、ユーザ音声の認識を開始するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
開始時間は、発話時間から所定長さの遅延時間を引くことにより算出されるため、制御部は、ユーザ音声の認識開始を所定のタイミングに制御することができる。
例えば、前記所定のタイミングを、ユーザが発話する直前になるように制御することで、音声認識装置におけるユーザ音声の誤認識を防止できる。つまり、音声の認識開始からユーザが発話するまでの間隔を短くすることで、その間隔の間に外来音が発生する確率を低くできる。そのため、音声認識装置が、上記外来音をユーザ音声と誤認識することを防止することができる。
なお、遅延時間の長さは、零よりも長く、かつ、発話時間の長さよりも短いことが望ましい。 According to the present invention, since the recognition of the user voice is started when the start time has elapsed from the start of output of the output voice, it is possible to prevent the user voice recognition from being interrupted and to realize a conversation with a good tempo. it can.
Since the start time is calculated by subtracting a predetermined delay time from the utterance time, the control unit can control the start of user speech recognition at a predetermined timing.
For example, it is possible to prevent erroneous recognition of the user's voice in the voice recognition device by controlling the predetermined timing to be immediately before the user speaks. That is, by shortening the interval from the start of speech recognition until the user speaks, the probability that an external sound is generated during that interval can be reduced. Therefore, it is possible to prevent the voice recognition device from erroneously recognizing the external sound as user voice.
The length of the delay time is preferably longer than zero and shorter than the length of the utterance time.

上記発明においては、前記制御部が、前記遅延時間の長さを変更することにより、
前記音声認識部による前記ユーザ音声の認識開始のタイミングを制御することが望ましい。 In the above invention, the control unit changes the length of the delay time,
It is desirable to control the timing of starting the recognition of the user voice by the voice recognition unit.

本発明によれば、制御部が遅延時間の長さを変更することにより、ユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
遅延時間の長さを変更することにより、発話時間から遅延時間を引いて求められる開始時間の長さを変更することができる。そのため、ユーザ音声の認識開始のタイミングを変更することができる。 According to the present invention, the control unit controls the timing of the start of user speech recognition by changing the length of the delay time, thereby preventing the user speech recognition from being interrupted and realizing a conversation with a good tempo. be able to.
By changing the length of the delay time, the length of the start time obtained by subtracting the delay time from the speech time can be changed. Therefore, it is possible to change the start timing of user voice recognition.

上記発明においては、前記制御部が、前記台詞を構成する文に基づいて、前記音声認識部における前記ユーザ音声の認識開始のタイミングを制御することが望ましい。 In the above invention, it is desirable that the control unit controls the timing of starting the recognition of the user voice in the voice recognition unit based on a sentence constituting the dialogue.

本発明によれば、台詞を構成する文に基づいて、制御部がユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
例えば、台詞が複数の文から構成されている場合には、ユーザ音声の認識開始のタイミングと、２番目以後の文に係る出力音声を出力するタイミングのうちのいずれかのタイミングとを合わせることにより、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。つまり、ユーザは最初の文に係る出力音声について発話せずに聞き、２番目以後の文に係る出力音声については途中から発話する傾向がある。そのため、２番目以後の文に係る出力音声を出力するタイミングのうちのいずれかのタイミングと、に合わせてユーザ音声の認識を開始することで、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
例えば、ユーザ音声の認識開始のタイミングを、出力音声の出力開始からの時間で制御する場合と比較して、台詞の構成に応じて、ユーザ音声の認識開始のタイミングをきめ細かく制御を行うことができるので、ユーザ音声認識の頭切れをより確実に防止するとともに、テンポのよい会話を実現することができる。
ユーザ音声の認識開始のタイミングの制御に、発話時間の長さを用いないため、発話時間の長さを算出しにくい場合、または、発話時間の長さを算出するのに時間がかかる場合に、容易にユーザ音声の認識開始のタイミングを制御することができる。 According to the present invention, since the control unit controls the start timing of user speech recognition based on the sentences constituting the dialogue, it is possible to prevent the user speech recognition from being interrupted and realize a conversation with a good tempo. it can.
For example, when the dialogue is composed of a plurality of sentences, by combining the timing of starting recognition of the user voice and the timing of outputting the output voice related to the second and subsequent sentences. Thus, it is possible to prevent the user's voice recognition from being interrupted and realize a conversation with a good tempo. That is, the user tends to listen to the output speech related to the first sentence without speaking, and to utter the output speech related to the second and subsequent sentences from the middle. For this reason, the user speech recognition is started in accordance with any one of the output timings of the output speech related to the second and subsequent sentences, thereby preventing the user speech recognition from being interrupted and reducing the tempo. A good conversation can be realized.
For example, it is possible to finely control the timing of starting user speech recognition according to the configuration of the dialogue, as compared with the case where the timing of starting user speech recognition is controlled by the time from the start of output of output speech. Therefore, it is possible to more reliably prevent the user voice recognition from being interrupted and realize a conversation with a good tempo.
Since the length of the utterance time is not used to control the timing of starting the recognition of the user voice, when it is difficult to calculate the length of the utterance time, or when it takes time to calculate the length of the utterance time, The user speech recognition start timing can be easily controlled.

また、台詞を構成する文には、ユーザから話しかけられる可能性が低い文と、話しかけられる可能性が高い文とがある。そこで、例えば、制御部は、ユーザから話しかけられる可能性の高い文に係る出力音声を出力開始する時点からユーザ音声の認識を開始するように制御してもよい。あるいは、台詞が、ユーザから話しかけられる可能性が低い文から、話しかけられる可能性が高い文に変わった時点から、ユーザ音声の認識を開始するように制御してもよい。このようにすることにより、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
ユーザから話しかけられる可能性が低い文としては、ユーザに対する呼びかけや、ユーザからの指示の復唱などの文が挙げられる。ユーザから話しかけられる可能性が高い文としては、ユーザに対する指示の要求する文などが挙げられる。なお、ここで述べる文とは、一定の形式と方法で単語を並べたものである。 In addition, the sentences constituting the dialogue include a sentence that is unlikely to be spoken by the user and a sentence that is likely to be spoken. Therefore, for example, the control unit may perform control so as to start recognition of the user voice from the time when the output voice related to the sentence that is likely to be spoken by the user is started. Or you may control to start recognition of a user's voice from the time when a dialogue changes from a sentence with a low possibility of being spoken by a user to a sentence with a high possibility of being spoken. By doing so, it is possible to prevent the user voice recognition from being interrupted and to realize a conversation with a good tempo.
Sentences that are unlikely to be spoken by the user include sentences such as a call to the user and a repetition of instructions from the user. Examples of the sentence that is highly likely to be spoken by the user include a sentence that requests an instruction to the user. The sentence described here is an arrangement of words in a certain format and method.

本発明の音声認識装置を備えたロボットは、ユーザの音声を認識する音声認識装置を備えたロボットであって、前記音声認識装置が、請求項１から請求項７のいずれかに記載の音声認識装置であることを特徴とする。 A robot provided with a voice recognition device of the present invention is a robot provided with a voice recognition device that recognizes a user's voice, and the voice recognition device according to any one of claims 1 to 7. It is a device.

本発明によれば、音声認識装置を備えたロボットが、上記本発明の音声認識装置を用いることにより、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。 According to the present invention, a robot equipped with a voice recognition device can use the voice recognition device of the present invention to prevent a user voice recognition from being interrupted and realize a conversation with a good tempo.

本発明の音声認識装置および音声認識装置を備えたロボットによれば、制御部がユーザ音声の認識開始のタイミングを、台詞に基づいて出力音声の出力開始後、かつ、出力終了前に制御するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができるという効果を奏する。 According to the voice recognition device and the robot including the voice recognition device of the present invention, the control unit controls the timing of starting the recognition of the user voice after starting the output of the output voice based on the dialogue and before the end of the output. In addition to preventing the user's voice recognition from being interrupted, it is possible to realize a conversation with a good tempo.

以下に、本発明に係るロボットの一実施形態について、図面を参照して説明する。
〔第１の実施形態〕
図１は、本発明の第１の実施形態に係るロボットの正面図、図２は、図１に示した生活支援ロボットの左側面図である。
図１および図２に示すように、生活支援ロボットの本体１は、頭部２と、この頭部２を下方から支持する胸部３と、この胸部３の右側に設けられた右腕部４ａ、胸部３の左側に設けられた左腕部４ｂと、胸部３の下方に接続された腰部５と、この腰部５の下方に接続されたスカート部６と、このスカート部６の下方に接続された脚部７とを備えている。 Hereinafter, an embodiment of a robot according to the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a front view of the robot according to the first embodiment of the present invention, and FIG. 2 is a left side view of the life support robot shown in FIG.
As shown in FIGS. 1 and 2, the main body 1 of the life support robot includes a head 2, a chest 3 that supports the head 2 from below, a right arm 4 a provided on the right side of the chest 3, and a chest 3, a left arm 4 b provided on the left side, a waist 5 connected below the chest 3, a skirt 6 connected below the waist 5, and a leg connected below the skirt 6. 7.

頭部２には、頭頂部近傍に全方位カメラ１１が一つ設けられている。この全方位カメラ１１の外周に沿って複数の赤外線ＬＥＤ１２が所定の間隔で円環上に配置されている。
頭部２の前面の中央近傍には、図１に示すように、前方を撮像するための前方カメラ１３が正面視して右側に一つ、マイクロフォン１４が正面視して左側に一つ、それぞれ設けられている。 One omnidirectional camera 11 is provided near the top of the head 2. A plurality of infrared LEDs 12 are arranged on the ring at predetermined intervals along the outer periphery of the omnidirectional camera 11.
In the vicinity of the center of the front surface of the head 2, as shown in FIG. 1, the front camera 13 for imaging the front is one on the right side when viewed from the front, and the microphone 14 is one on the left side when viewed from the front. Is provided.

胸部３の前面の中央近傍には、モニタ１５が一つ設けられている。このモニタ１５の上方には、人を検知するための超音波距離センサ１６が一つ設けられている。モニタ１５の下方には、電源スイッチ１７が一つ設けられている。超音波距離センサ１６の上方には、２つのスピーカ１８が左右に一つずつ設けられている。また、図２に示すように、胸部３の背面には、荷物を収納することができるランドセル部３３が設けられている。ランドセル部３３には、上部に設けたヒンジ周りに回動可能な開閉扉３３ａが設けられている。図１に示すように、胸部３の左右の肩部には、タッチセンサ１９がそれぞれ一つずつ設けられている。 One monitor 15 is provided near the center of the front surface of the chest 3. One ultrasonic distance sensor 16 for detecting a person is provided above the monitor 15. One power switch 17 is provided below the monitor 15. Above the ultrasonic distance sensor 16, two speakers 18 are provided one on each side. In addition, as shown in FIG. 2, a backpack 33 that can store luggage is provided on the back of the chest 3. The school bag 33 is provided with an opening / closing door 33a that can be rotated around a hinge provided at the top. As shown in FIG. 1, one touch sensor 19 is provided on each of the left and right shoulders of the chest 3.

右腕部４ａおよび左腕部４ｂには、多関節構造が採用されている。右腕部４ａ、左腕部４ｂにおいて、胸部３との接続部近傍には、体や物の挟み込みを検知して腕の動作を止めるための脇スイッチ２０がそれぞれ設けられている。図１に示すように、右腕部４ａの手のひら部分には、マンマシンインターフェースとして機能する握手スイッチ２１が内蔵されている。これら脇スイッチ２０や握手スイッチ２１には、例えば、押圧センサが採用される。 A multi-joint structure is adopted for the right arm portion 4a and the left arm portion 4b. In the right arm portion 4a and the left arm portion 4b, side switches 20 are provided in the vicinity of the connection portion with the chest portion 3 to detect the pinching of a body or an object and stop the movement of the arm. As shown in FIG. 1, a handshake switch 21 that functions as a man-machine interface is built in the palm of the right arm 4a. For the side switch 20 and the handshake switch 21, for example, a pressure sensor is employed.

腰部５の前面の中央近傍には、人を検知するための超音波距離センサ２２が左右に一つずつ設けられている。これら超音波距離センサ２２の下方には、複数の赤外センサ２３を配列されたセンサ領域２４が設けられている。これら赤外線センサ２２は、ロボット本体１の下方前方にある障害物等を検出するためのものである。図１および図２に示すように、腰部５の下方には、前面および背面において、音源方向を検出するためのマイクロフォン２５が左右に一つずつ、計４つ設けられている。図２に示すように、腰部５の側面の左右には、本体を持ち上げるときに使用する取手部２６がそれぞれ一つずつ設けられている。取手部２６は、凹所とされており、操作者の手が挿入できるようになっている。 In the vicinity of the center of the front surface of the waist 5, one ultrasonic distance sensor 22 for detecting a person is provided on each side. Below these ultrasonic distance sensors 22, a sensor region 24 in which a plurality of infrared sensors 23 are arranged is provided. These infrared sensors 22 are for detecting an obstacle or the like in the lower front of the robot body 1. As shown in FIG. 1 and FIG. 2, a total of four microphones 25 are provided below the waist 5 for detecting the sound source direction, one on the left and one on the front and back. As shown in FIG. 2, one handle portion 26 used for lifting the main body is provided on each of the left and right sides of the waist portion 5. The handle 26 is a recess so that the operator's hand can be inserted.

スカート部６の前面下方には、段差を検出するための赤外線センサ２７が、中央および左右に計３つ設けられている。図２に示すように、スカート部６の背面には、充電コネクタ２８が設けられている。 Below the front surface of the skirt portion 6, a total of three infrared sensors 27 for detecting a step are provided in the center and on the left and right. As shown in FIG. 2, a charging connector 28 is provided on the back surface of the skirt portion 6.

図１に示すように、脚部７の前面には、側方の距離を検出するための赤外線センサ２９が左右に一つずつ設けられている。これら赤外線センサ２９は、主に段差検出に用いられるものである。
図２に示すように、脚部７の背面には、充電スタンドにロボット本体１を位置固定するためのフック３０が設けられている。脚部７は、走行用車輪３１および４つのボールキャスタ３２を備えた台車とされている。 As shown in FIG. 1, one infrared sensor 29 for detecting a lateral distance is provided on the front surface of the leg portion 7 on the left and right sides. These infrared sensors 29 are mainly used for level difference detection.
As shown in FIG. 2, a hook 30 for fixing the position of the robot body 1 to the charging stand is provided on the back surface of the leg portion 7. The leg portion 7 is a carriage provided with traveling wheels 31 and four ball casters 32.

上述したロボットにおいて、上記頭部２の顔表情は図示しない駆動機構により可変となっている。また、頭部２と胸部３との間の首関節や、胸部３と右腕部４ａ間、胸部３と左腕部４ｂ間の肩関節、右腕部４ａ、左腕部４ｂ内の肘関節、手首関節等が図示しない駆動機構により駆動可能であるとともに、脚部７に装備された走行用車輪３１が図示しない駆動機構により駆動されることにより、自動操舵および自動走行が可能な構成となっている。 In the robot described above, the facial expression of the head 2 is variable by a drive mechanism (not shown). Also, a neck joint between the head 2 and the chest 3, a shoulder joint between the chest 3 and the right arm 4a, a shoulder joint between the chest 3 and the left arm 4b, an elbow joint in the right arm 4a and the left arm 4b, a wrist joint, etc. Can be driven by a driving mechanism (not shown), and the driving wheels 31 mounted on the legs 7 are driven by a driving mechanism (not shown) so that automatic steering and automatic driving are possible.

また、本実施形態に係るロボットは、作業空間をロボット本体に内蔵されたバッテリからの電源供給により自立的に移動するように構成されており、一般家庭等の屋内を作業空間として人間と共存し、例えば、一般家庭内でロボットの所有者や操作者などのユーザの生活を補助・支援・介護するために用いられる。そのため、ロボット１は、ユーザとの会話を実現させる会話機能のほか、ユーザの行動を見守ったり、ユーザの行動を補助したり、ユーザと一緒に行動したりする機能を備えている。このような機能は、例えば、ロボット１の本体の内部に内蔵されたマイクロコンピュータ等からなる制御装置により実現されるものである。制御装置には、図１および図２に示した各種カメラや各種センサ等が接続されており、カメラからの画像情報やセンサからのセンサ検出情報を取得し、これらの情報に基づいて各種プログラムを実行することにより、上述した各種機能を実現させる。
なお、ロボット本体１の形状としては、図１および図２に示した形状に限られず、愛玩用に動物を模したものなど、種々のものを採用することが可能である。 In addition, the robot according to the present embodiment is configured to move autonomously by supplying power from a battery built in the robot body, and coexists with a human being as a working space indoors. For example, it is used to assist, support, and care for the lives of users such as robot owners and operators in general households. Therefore, in addition to the conversation function for realizing the conversation with the user, the robot 1 has a function of watching the user's action, assisting the user's action, and acting with the user. Such a function is realized by, for example, a control device including a microcomputer or the like built in the main body of the robot 1. Various cameras and various sensors shown in FIGS. 1 and 2 are connected to the control device, acquire image information from the cameras and sensor detection information from the sensors, and execute various programs based on these information. By executing this, the various functions described above are realized.
The shape of the robot body 1 is not limited to the shape shown in FIGS. 1 and 2, and various shapes such as a model imitating an animal for pets can be adopted.

次に、本発明の特徴部分である音声認識機能について説明する。本実施形態に係るロボットの音声認識機能は、上述した制御装置内に設けられた音声認識装置により実現されるものである。図３に本実施形態に係る音声認識装置の機能ブロック図を示す。
音声認識装置５１は、図３に示すように、ユーザとの会話に用いる台詞を生成する会話シナリオ実行部（制御部）５３と、台詞に基づいて出力音声信号を生成する音声合成部５５と、出力音声信号に基づいて出力音声を出力するスピーカ１８と、少なくともユーザの音声を含む音を入力信号に変換するマイクロフォン１４と、入力信号から入力音声信号を生成するアコースティックエコーキャンセラー（以下、ＡＥＣと表記する。）（出力音声除去部）５７と、入力音声信号に基づいてユーザ音声を認識する音声認識部５９と、音声認識部５９の認識開始を指示する遅延制御部（制御部）６１と、台詞に係る出力音声の発話時間長さを算出する発話時間算出部６３と、ユーザの会話の習熟度を判定する習熟度判定部６５とを備えている。 Next, the speech recognition function that is a characteristic part of the present invention will be described. The voice recognition function of the robot according to the present embodiment is realized by the voice recognition device provided in the control device described above. FIG. 3 shows a functional block diagram of the speech recognition apparatus according to the present embodiment.
As shown in FIG. 3, the speech recognition apparatus 51 includes a conversation scenario execution unit (control unit) 53 that generates a dialogue used for conversation with the user, a speech synthesis unit 55 that generates an output speech signal based on the dialogue, A speaker 18 that outputs output sound based on the output sound signal, a microphone 14 that converts sound including at least user's sound into an input signal, and an acoustic echo canceller (hereinafter referred to as AEC) that generates an input sound signal from the input signal (Output voice removal unit) 57, voice recognition unit 59 that recognizes user voice based on the input voice signal, delay control unit (control unit) 61 that instructs the voice recognition unit 59 to start recognition, and dialogue An utterance time calculation unit 63 that calculates the utterance time length of the output voice, and a proficiency level determination unit 65 that determines the proficiency level of the user's conversation.

会話シナリオ実行部５３は、ユーザとの会話のシナリオを選択するとともに、選択したシナリオに基づいて台詞を生成するものである。会話シナリオ実行部５３は、ユーザの習熟度を判定する習熟度判定部６５と、判定した習熟度などを記憶する記憶部６７とを備えている。また、会話シナリオ実行部５３は、音声合成部５５および遅延制御部６１に電気信号を出力するように音声合成部５５および遅延制御部６１と接続されている。また、会話シナリオ実行部５３は、音声認識部５９から電気信号が入力されるように音声認識部５９と接続されている。
習熟度判定部６５は、ユーザの会話の習熟度を判定するものであり、会話シナリオ実行部５３内に設けられている。会話シナリオ実行部５３は、習熟度判定部６５の出力に基づいて、遅延時間の長さＤを変更している。習熟度の判定には、例えば、上述のロボットの電源が入れられてか経過した日数や、同じ内容のシナリオを繰り返した回数などが用いられている。
音声合成部５５は、会話シナリオ実行部５３が生成した台詞に基づいて出力音声信号を生成するものである。音声合成部５５は、会話シナリオ実行部５３から電気信号が入力されるように会話シナリオ実行部５３と接続されている。また、音声合成部５５は、スピーカ１８とＡＥＣ５７と発話時間算出部６３とに電気信号を出力するようにスピーカ１８、ＡＥＣ５７および発話時間算出部６３と接続されている。 The conversation scenario execution unit 53 selects a scenario for conversation with the user and generates a dialogue based on the selected scenario. The conversation scenario execution unit 53 includes a proficiency level determination unit 65 that determines a user's proficiency level, and a storage unit 67 that stores the determined proficiency level. The conversation scenario execution unit 53 is connected to the speech synthesis unit 55 and the delay control unit 61 so as to output an electrical signal to the speech synthesis unit 55 and the delay control unit 61. The conversation scenario execution unit 53 is connected to the voice recognition unit 59 so that an electric signal is input from the voice recognition unit 59.
The proficiency level determination unit 65 determines the proficiency level of the user's conversation, and is provided in the conversation scenario execution unit 53. The conversation scenario execution unit 53 changes the delay time length D based on the output of the proficiency level determination unit 65. For the determination of the proficiency level, for example, the number of days that have passed since the above-mentioned robot is turned on, the number of times the same content scenario is repeated, or the like is used.
The voice synthesis unit 55 generates an output voice signal based on the dialogue generated by the conversation scenario execution unit 53. The speech synthesizer 55 is connected to the conversation scenario execution unit 53 so that an electric signal is input from the conversation scenario execution unit 53. The voice synthesis unit 55 is connected to the speaker 18, the AEC 57, and the utterance time calculation unit 63 so as to output electric signals to the speaker 18, the AEC 57, and the utterance time calculation unit 63.

スピーカ１８は、入力される出力音声信号に基づいて、出力音声を、例えば、ユーザに対して出力するものである。スピーカ１８は音声合成部５５から電気信号が入力されるように音声合成部５５と接続されている。なお、スピーカ１８としては、公知のスピーカを用いることができ、特に限定するものではない。
マイクロフォン１４は、ユーザの発話を含めたマイクロフォン１４に入力した音を、電気信号である入力信号に変換するものである。マイクロフォン１４は、ＡＥＣ５７に電気信号が出力されるようにＡＥＣ５７と接続されている。なお、マイクロフォン１４としては、公知のマイクロフォンを用いることができ、特に限定するものではない。 The speaker 18 outputs output sound to, for example, a user based on the input output sound signal. The speaker 18 is connected to the speech synthesizer 55 so that an electrical signal is input from the speech synthesizer 55. Note that a known speaker can be used as the speaker 18 and is not particularly limited.
The microphone 14 converts sound input to the microphone 14 including the user's utterance into an input signal that is an electrical signal. The microphone 14 is connected to the AEC 57 so that an electrical signal is output to the AEC 57. Note that a known microphone can be used as the microphone 14 and is not particularly limited.

ＡＥＣ５７は、音声合成部５５から入力された出力音声信号と、マイクロフォン１４から入力された入力信号とを相関演算することにより、入力信号からスピーカ１８から出力された出力音声に相当する信号を除去して入力音声信号を算出するものである。ＡＥＣ５７は、音声合成部５５から電気信号が入力されるように音声合成部５５と接続されている。また、ＡＥＣ５７は、音声認識部５９に電気信号が出力されるように音声認識部５９と接続されている。
音声認識部５９はユーザの発話を認識するものである。具体的には、ＡＥＣ５７から入力される入力音声信号と、音声認識辞書とのマッチングを行うことで、ユーザの発話を認識するものである。音声認識部５９は、会話シナリオ実行部５３に電気信号を出力するように会話シナリオ実行部５３と接続されている。また、音声認識部５９は、ＡＥＣ５７および遅延制御部６１とから電気信号が入力されるようにＡＥＣ５７および遅延制御部６１と接続されている。 The AEC 57 removes a signal corresponding to the output sound output from the speaker 18 from the input signal by performing a correlation operation between the output sound signal input from the sound synthesizer 55 and the input signal input from the microphone 14. The input audio signal is calculated. The AEC 57 is connected to the speech synthesizer 55 so that an electrical signal is input from the speech synthesizer 55. The AEC 57 is connected to the voice recognition unit 59 so that an electrical signal is output to the voice recognition unit 59.
The voice recognition unit 59 recognizes a user's utterance. Specifically, the user's utterance is recognized by matching an input voice signal input from the AEC 57 with a voice recognition dictionary. The voice recognition unit 59 is connected to the conversation scenario execution unit 53 so as to output an electrical signal to the conversation scenario execution unit 53. The voice recognition unit 59 is connected to the AEC 57 and the delay control unit 61 so that electric signals are input from the AEC 57 and the delay control unit 61.

遅延制御部６１は、音声認識部５９における音声認識開始のタイミングを指示するものである。遅延制御部６１は、会話シナリオ実行部５３および発話時間算出部６３から電気信号が入力されるように会話シナリオ実行部５３および発話時間算出部６３と接続されている。また、遅延制御部６１は、音声認識部５９から電気信号が入力されるように音声認識部５９と接続されている。
具体的には、遅延制御部６１は、まず、音声合成部５５が算出した発話時間の長さＸと、会話シナリオ実行部５３が算出した遅延時間の長さＤとに基づいて、音声認識の開始時間（Ｘ−Ｄ）を算出している。その後、遅延制御部６１は、出力音声の出力開始から開始時間（Ｘ−Ｄ）経過した時点で、音声認識部５９に対して、音声認識開始の信号を出力する。 The delay control unit 61 instructs the voice recognition start timing in the voice recognition unit 59. The delay control unit 61 is connected to the conversation scenario execution unit 53 and the utterance time calculation unit 63 so that electric signals are input from the conversation scenario execution unit 53 and the utterance time calculation unit 63. The delay control unit 61 is connected to the voice recognition unit 59 so that an electric signal is input from the voice recognition unit 59.
Specifically, the delay control unit 61 first performs speech recognition based on the speech duration length X calculated by the speech synthesis unit 55 and the delay time length D calculated by the conversation scenario execution unit 53. The start time (XD) is calculated. Thereafter, the delay control unit 61 outputs a speech recognition start signal to the speech recognition unit 59 when the start time (X-D) has elapsed from the output start of the output speech.

次に、上述の構成からなる生活支援ロボットとユーザとの間の会話における、音声認識装置５１の働きを説明する。
まず、本実施形態における生活支援ロボットとユーザとの間の会話の流れを説明する。
図４は、図３の音声認識装置とユーザとの間の会話の流れを説明する模式図である。図４において、横軸は時間を表し、図中のＳＰが生活支援ロボットの発話期間を表し、ＲＣが、生活支援ロボットが音声を認識している期間を表している。
ユーザとの間で会話を行っていない場合には、図４に示すように、生活支援ロボットの音声認識部５９は音声を認識し続け（Ａ）、ユーザからの音声による指示の入力を待っている。
この状態において、ユーザから音声による指示が入力されると、音声認識装置５１は音声の認識を中断し（Ｂ）、入力された指示に対する台詞を発話する（ＳＰ）。音声認識装置５１は、発話の開始時から所定時間（Ｘ−Ｄ）が経過した時点（Ｃ）で、再び、音声の認識（ＲＣ）を開始して、ユーザの音声入力を認識し始める。ユーザの音声入力が終了した等の理由により、所定レベル以上の大きさの音声入力が一定期間ない状態が続くと、音声認識装置５１は音声認識を区切り、音声認識の結果に基づき次の処理を行う。
このようにして、生活支援ロボットとユーザとの間で会話が交互に繰り返されるキャッチボール型の会話がなされる。 Next, the operation of the voice recognition device 51 in the conversation between the life support robot having the above-described configuration and the user will be described.
First, the flow of conversation between the life support robot and the user in this embodiment will be described.
FIG. 4 is a schematic diagram for explaining the flow of conversation between the speech recognition apparatus of FIG. 3 and the user. In FIG. 4, the horizontal axis represents time, SP in the figure represents the utterance period of the life support robot, and RC represents the period during which the life support robot recognizes the voice.
When the user does not have a conversation, as shown in FIG. 4, the voice recognition unit 59 of the life support robot continues to recognize the voice (A), and waits for the voice input from the user. Yes.
In this state, when a voice instruction is input from the user, the voice recognition device 51 interrupts the voice recognition (B) and utters a speech in response to the input instruction (SP). The voice recognition device 51 starts voice recognition (RC) again at a time (C) when a predetermined time (X-D) has elapsed from the start of utterance, and starts to recognize the user's voice input. If the state where there is no voice input of a predetermined level or more for a certain period due to the user's voice input being terminated, the voice recognition device 51 delimits the voice recognition, and performs the following processing based on the result of the voice recognition. Do.
In this way, a catch ball type conversation in which the conversation is alternately repeated between the life support robot and the user is performed.

次に、上述のキャッチボール型の会話が行われている際の、音声認識装置５１の働きについて説明する。
生活支援ロボットがユーザから音声による指示の入力が待っている状態から、ユーザから音声による入力指示が入力されると、図３に示すように、音声認識装置５１の会話シナリオ実行部５３は、音声認識部５９に対して音声認識を停止する停止信号を出力する。同時に、会話シナリオ実行部５３は、ユーザからの指示入力に応じた会話シナリオを組み立て、組み立てたシナリオに基づいた台詞を生成する。あるいは、会話シナリオ実行部５３に備えられた記憶部６７の認識辞書に記憶された台詞を呼び出す。このとき、生成または呼び出される台詞には、生活支援ロボットが出力する音声に係る台詞のほかに、生活支援ロボットの音声に対してユーザが回答すると予想されるユーザに係る台詞も含まれている。
例えば、ユーザからの入力指示が「伝言」の場合、会話シナリオ実行部５３が組み立てる会話シナリオのうち、ロボットに係る台詞は、「伝言を伝えます。誰に伝えますか。」となる。また、ユーザに係る台詞は、「Ａさん」、「Ｂさん」など、伝言する相手の名前となる。 Next, the operation of the voice recognition device 51 when the above-described catch ball type conversation is performed will be described.
When a voice input instruction is input from the user while the life support robot is waiting for a voice instruction input from the user, the conversation scenario execution unit 53 of the voice recognition device 51, as shown in FIG. A stop signal for stopping voice recognition is output to the recognition unit 59. At the same time, the conversation scenario execution unit 53 assembles a conversation scenario according to an instruction input from the user, and generates a dialogue based on the assembled scenario. Alternatively, the dialogue stored in the recognition dictionary of the storage unit 67 provided in the conversation scenario execution unit 53 is called. At this time, the generated or called dialogue includes dialogue related to the user expected to answer the voice of the life support robot in addition to speech related to the voice output by the life support robot.
For example, when the input instruction from the user is “message”, the dialogue related to the robot in the conversation scenario assembled by the conversation scenario execution unit 53 is “I will convey the message. Further, the dialogue related to the user is the name of the other party to be notified, such as “Mr. A” and “Mr. B”.

さらに、会話シナリオ実行部５３は、習熟度判定部６５が判定したユーザの習熟度に基づいて、遅延時間の長さＤを算出する。ユーザの習熟度は、予め習熟度判定部６５が判定したものであって、記憶部６７に記憶されたものが用いられる。例えば、習熟度判定部６５は、ロボットの電源が入れられてから経過した日数が長くなると、ユーザの習熟度は高くなったと判定して、遅延時間の長さＤを長くする信号を生成する。あるいは、同じ内容のシナリオの繰り返し回数が増えると、ユーザの習熟度は高くなったと判定して、遅延時間の長さＤを長くする信号を生成する。
会話シナリオ実行部５３は、ロボットに係る台詞の信号を音声合成部５５に出力し、音声認識部５９に対する音声認識開始信号、ユーザに係る台詞の信号および遅延時間の長さＤの信号を遅延制御部６１に出力する。
ロボットに係る台詞の信号が入力された音声合成部５５は、スピーカ１８に対して出力する出力音声信号を生成する。出力音声信号は、ロボットに係る台詞に対応した波形を有する信号である。出力音声信号は、発話時間算出部６３とＡＥＣ５７とスピーカ１８とに向けて出力される。
スピーカ１８は、入力された出力音声信号に基づいて、ユーザに対してロボットに係る台詞を出力音声として出力する。 Further, the conversation scenario execution unit 53 calculates the delay time length D based on the user's proficiency level determined by the proficiency level determination unit 65. The user's proficiency level is determined in advance by the proficiency level determination unit 65 and is stored in the storage unit 67. For example, the proficiency level determination unit 65 determines that the proficiency level of the user has increased when the number of days that have elapsed since the power of the robot is turned on, and generates a signal for increasing the length D of the delay time. Alternatively, when the number of repetitions of the scenario having the same content increases, it is determined that the user's proficiency level has increased, and a signal for increasing the delay time length D is generated.
The conversation scenario execution unit 53 outputs a speech signal related to the robot to the speech synthesizer 55, and delay-controls the speech recognition start signal to the speech recognition unit 59, the speech signal related to the user, and the signal of the delay time length D. To the unit 61.
The speech synthesizer 55 to which the speech signal related to the robot is input generates an output speech signal to be output to the speaker 18. The output audio signal is a signal having a waveform corresponding to the dialogue related to the robot. The output audio signal is output toward the utterance time calculation unit 63, the AEC 57, and the speaker 18.
The speaker 18 outputs a speech related to the robot as output speech to the user based on the input output speech signal.

一方、発話時間算出部６３は、入力された出力音声信号に基づいて、スピーカ１８から実際に出力される出力音声の発話時間の長さＸを算出する。例えば、出力音声信号のバイト数をカウントして、割り算することにより発話時間の長さＸを求めることができる。なお、発話時間の長さＸの算出方法は、公知の方法を用いることができ、特に限定されるものではない。発話時間の長さＸが算出されると、発話時間算出部６３から発話時間の長さＸに係る信号が遅延制御部６１に出力される。
遅延制御部６１は、入力された発話時間の長さＸの信号と、遅延時間の長さＤの信号とに基づいて、開始時間の長さを算出する。具体的には、発話時間の長さＸから遅延時間の長さＤを引いた値（Ｘ−Ｄ）が開始時間の長さとなる。開始時間の長さ（Ｘ−Ｄ）とは、スピーカ１８から出力音声が出力されたときから、音声認識部５９における音声認識が開始されるまでの時間をいう。開始時間の長さ（Ｘ−Ｄ）が算出されると、遅延制御部６１は時間の計測を開始し、開始時間の長さ（Ｘ−Ｄ）が経過した時点で、会話シナリオ実行部５３から入力されていた、音声認識部５９に対する音声認識開始の信号とユーザに係る台詞の信号とを音声認識部５９に出力する。 On the other hand, the utterance time calculation unit 63 calculates the utterance time length X of the output sound actually output from the speaker 18 based on the input output sound signal. For example, the length X of the utterance time can be obtained by counting and dividing the number of bytes of the output audio signal. In addition, the calculation method of the length X of speech time can use a well-known method, and is not specifically limited. When the utterance time length X is calculated, a signal related to the utterance time length X is output from the utterance time calculation unit 63 to the delay control unit 61.
The delay control unit 61 calculates the length of the start time based on the input signal of the speech time length X and the signal of the delay time length D. Specifically, a value (X−D) obtained by subtracting the delay time length D from the speech time length X is the length of the start time. The length of the start time (XD) refers to the time from when the output sound is output from the speaker 18 until the start of speech recognition in the speech recognition unit 59. When the length of the start time (X-D) is calculated, the delay control unit 61 starts measuring time, and when the length of the start time (X-D) has elapsed, the conversation scenario execution unit 53 The input voice recognition start signal for the voice recognition unit 59 and the speech signal for the user are output to the voice recognition unit 59.

なお、延長時間の長さＤは、開始時間の長さ（Ｘ−Ｄ）が、ＡＥＣ５７における処理が安定するのに要する時間以上となるように設定されることが望ましい。
このように設定することにより、ＡＥＣ５７の処理が安定した状態から音声認識を開始することができ、ユーザ音声の誤認識を防止することができる。ＡＥＣ５７の処理が安定するとは、入力信号から出力音声信号に相当する信号成分を除去する処理が安定することをいう。この処理が安定することで、出力音声信号が取り除かれた入力音声信号を安定して算出することができる。 The length D of the extension time is desirably set so that the length of the start time (X−D) is equal to or longer than the time required for the processing in the AEC 57 to be stabilized.
By setting in this way, voice recognition can be started from a state where the processing of AEC 57 is stable, and erroneous recognition of user voice can be prevented. “AEC 57 processing is stable” means that processing for removing a signal component corresponding to an output audio signal from an input signal is stable. By stabilizing this process, it is possible to stably calculate the input audio signal from which the output audio signal has been removed.

また、遅延時間の長さＤの取りうる範囲は、零以上、発話時間の長さＸ以下とすることが望ましい。つまり、遅延時間の長さＤが零の場合は、開始時間の長さ（Ｘ−Ｄ）が発話時間の長さＸと等しくなり、発話時間の終了と同時に音声認識が開始される。また、遅延時間の長さＤが発話時間の長さＸと等しい場合には、開始時間の長さ（Ｘ−Ｄ）は零となり、発話時間の開始とともに音声認識が開始される。 In addition, it is desirable that the range of the delay time length D is greater than or equal to zero and less than or equal to the speech time length X. That is, when the delay time length D is zero, the start time length (X−D) is equal to the speech time length X, and speech recognition is started simultaneously with the end of the speech time. When the length D of the delay time is equal to the length X of the utterance time, the length of the start time (X−D) becomes zero, and voice recognition is started as the utterance time starts.

一方、マイクロフォン１４は、マイクロフォン１４に入力された音を電気信号である入力信号に変換し、ＡＥＣ５７を介して音声認識部５９に出力している。入力信号は、マイクロフォン１４に入力された音の波形に対応した波形をもつ電気信号であり、ユーザの音声や、スピーカ１８から出力された音声や、その他の外来雑音の成分も含まれる信号である。
ＡＥＣ５７は、入力された入力信号と出力音声信号とに基づいて、入力信号から出力音声信号に相当する信号成分を除去し、入力音声信号が生成される。具体的には、入力信号と出力音声信号との相関関数を求め、求めた相関関数に基づいて入力信号から出力音声信号に相当する信号成分が除去されている。なお、ＡＥＣ５７としては、公知のエコーキャンセラーを用いることができ、特に限定するものではない。
また、ＡＥＣ５７に出力音声信号が入力されていない場合には、入力信号が処理されずにそのまま音声認識部５９に出力される。
なお、上述のマイクロフォン１４およびＡＥＣ５７は会話シナリオ実行部５３から出力される音声認識開始の信号の有無に関わらず、常に処理が行われている。 On the other hand, the microphone 14 converts the sound input to the microphone 14 into an input signal that is an electrical signal, and outputs the input signal to the voice recognition unit 59 via the AEC 57. The input signal is an electric signal having a waveform corresponding to the waveform of the sound input to the microphone 14 and is a signal including a user's voice, a voice output from the speaker 18, and other external noise components. .
The AEC 57 removes a signal component corresponding to the output audio signal from the input signal based on the input signal and the output audio signal, and an input audio signal is generated. Specifically, a correlation function between the input signal and the output audio signal is obtained, and a signal component corresponding to the output audio signal is removed from the input signal based on the obtained correlation function. Note that a known echo canceller can be used as the AEC 57 and is not particularly limited.
When no output audio signal is input to the AEC 57, the input signal is output to the audio recognition unit 59 without being processed.
The above-described microphone 14 and AEC 57 are always processed regardless of the presence or absence of a speech recognition start signal output from the conversation scenario execution unit 53.

音声認識部５９は、遅延制御部６１から入力される認識開始の信号に基づいて、ＡＥＣ５７から入力される入力音声信号の認識を行う。具体的には、認識開始の信号とともに入力されるユーザに係る台詞の信号と、入力音声信号とのマッチングを行い、入力音声信号がユーザに係る台詞の信号と一致するか否かを判定している。
音声認識部５９は、ユーザからの音声入力が所定期間ない場合には、音声認識を区切り、判定結果を会話シナリオ実行部５３に出力する。会話シナリオ実行部５３は、判定結果に基づいて次の会話シナリオを組み立てる。以降、上述の処理が繰り返される。
例えば、ロボットに係る台詞が、「伝言を伝えます。誰に伝えますか。」の場合、ユーザが「Ａさん」と応えると、音声認識部５９においてマッチングが行われ、ユーザが「Ａさん」と応えたと認識、判定される。会話シナリオ実行部５３はこの判定に基づいて、次の会話シナリオを組み立てる。
なお、音声認識部５９は、ユーザからの音声が所定期間入力されない場合に、音声認識を区切り、判定結果を会話シナリオ実行部５３に出力してもよいし、音声認識を終了してもよい。ユーザからの音声が所定期間入力されない場合としては、所定レベル以上の音声が一定時間、入力されなかった場合を挙げることができる。 The voice recognition unit 59 recognizes the input voice signal input from the AEC 57 based on the recognition start signal input from the delay control unit 61. Specifically, the speech signal related to the user input together with the recognition start signal is matched with the input speech signal, and it is determined whether or not the input speech signal matches the speech signal related to the user. Yes.
When there is no voice input from the user for a predetermined period, the voice recognition unit 59 delimits voice recognition and outputs a determination result to the conversation scenario execution unit 53. The conversation scenario execution unit 53 assembles the next conversation scenario based on the determination result. Thereafter, the above process is repeated.
For example, if the dialogue related to the robot is “I will tell you a message. Who do you want to tell?”, When the user answers “Mr. A”, matching is performed in the voice recognition unit 59, and the user is “Mr. A”. Is recognized and determined. The conversation scenario execution unit 53 assembles the next conversation scenario based on this determination.
Note that the voice recognition unit 59 may divide the voice recognition and output the determination result to the conversation scenario execution unit 53 when the voice from the user is not input for a predetermined period, or may end the voice recognition. As a case where the voice from the user is not input for a predetermined period, a case where a voice of a predetermined level or higher is not input for a predetermined time can be cited.

また、音声認識部５９における判定で、ユーザが応えた内容が有効でない判定された場合には、会話シナリオ実行部５３は、生活支援ロボットにおける機能のガイダンスする内容のシナリオを選択する。会話シナリオ実行部５３が、ガイダンスに係るシナリオを選択する場合としては、ユーザの音声入力が適切な入力でなかった場合や、生活支援ロボットからの問いかけに対して、ユーザが有効な回答をしなかった場合などを挙げることができる。
図５は、図３の音声認識装置がガイダンスを行う場合の会話の流れを説明する模式図である。
生活支援ロボットの音声認識装置５１がガイダンスを行う場合、図５に示すように、ガイダンスに係る台詞の発話（ＳＰ）と同時に、音声認識部５９は音声の認識（ＲＣ）を開始する。
この場合、会話シナリオ実行部５３は、図３に示すように、発話時間の長さＸと等しい延長時間の長さＤの信号を遅延制御部６１に出力し、ガイダンスに係る台詞の信号を音声合成部５５に出力する。音声合成部５５は、入力されたガイダンスに係る台詞に基づいて出力音声信号を生成し、スピーカ１８および発話時間算出部６３に出力する。一方、遅延制御部６１には発話時間の長さＸの信号が入力され、開始時間（Ｘ−Ｄ）が算出される。遅延制御部６１は、開始時間（Ｘ−Ｄ）の算出結果（零）に基づいて、すぐに音声認識部５９に音声認識開始の信号を出力する。 If it is determined by the voice recognition unit 59 that the content answered by the user is not valid, the conversation scenario execution unit 53 selects a scenario with the content to be guided by the function of the life support robot. When the conversation scenario execution unit 53 selects a scenario related to guidance, when the user's voice input is not appropriate, or when the user does not answer the question from the life support robot Can be mentioned.
FIG. 5 is a schematic diagram for explaining the flow of conversation when the voice recognition apparatus of FIG. 3 provides guidance.
When the speech recognition apparatus 51 of the life support robot performs guidance, as shown in FIG. 5, the speech recognition unit 59 starts speech recognition (RC) simultaneously with speech (SP) of speech related to the guidance.
In this case, as shown in FIG. 3, the conversation scenario execution unit 53 outputs a signal having an extension time length D equal to the speech time length X to the delay control unit 61, and the speech signal related to the guidance is spoken. The data is output to the combining unit 55. The voice synthesizer 55 generates an output voice signal based on the dialogue related to the input guidance, and outputs it to the speaker 18 and the utterance time calculator 63. On the other hand, the signal of the length X of the utterance time is input to the delay control unit 61, and the start time (X−D) is calculated. The delay control unit 61 immediately outputs a speech recognition start signal to the speech recognition unit 59 based on the calculation result (zero) of the start time (X−D).

このように制御することにより、生活支援ロボットの音声認識装置５１がガイダンスを開始したとき、または、ガイダンス中にユーザが話しかけても、音声認識装置５１は、ユーザの発話を認識することができる。特に、ガイダンスは、ユーザからの音声入力に所定の空白期間があいた後に開始されているため、ユーザの発話時期と、ガイダンスの開始時期とが接近する可能性が高い。このような場合であっても、ガイダンスの開始時期と、音声の認識開始時期とが同時であるため、ユーザ音声認識の頭切れを防止することができる。 By controlling in this way, the speech recognition device 51 can recognize the user's utterance when the speech recognition device 51 of the life support robot starts guidance or when the user speaks during the guidance. In particular, since the guidance is started after a predetermined blank period in the voice input from the user, there is a high possibility that the user's utterance time approaches the guidance start time. Even in such a case, since the guidance start time and the voice recognition start time are the same, it is possible to prevent the user voice recognition from being interrupted.

図６は、図３の音声認識装置とユーザとの会話の流れにおける他の例を説明する模式図である。
また、会話シナリオ実行部５３が組み立てるシナリオの内容によっては、生活支援ロボットがシナリオを発話し終わるまで、ユーザが発話しない場合もある。
かかる場合、生活支援ロボットの音声認識装置５１は、図６に示すように、ロボットに係る台詞の発話（ＳＰ）を行っている間は、音声認識部５９において音声の認識を行わない。ロボットに係る台詞の発話が終了すると、音声認識装置５１は、音声認識部５９において音声の認識（ＲＣ）をし始める。
この場合、会話シナリオ実行部５３は、図３に示すように、時間長さが零の延長時間の長さＤの信号を遅延制御部６１に出力し、ロボットに係る台詞の信号を音声合成部５５に出力する。音声合成部５５は、入力されたガイダンスに係る台詞に基づいて出力音声信号を生成し、スピーカ１８および発話時間算出部６３に出力する。一方、遅延制御部６１には発話時間の長さＸの信号が入力され、開始時間（Ｘ−Ｄ）が算出される。遅延制御部６１は、開始時間（Ｘ−Ｄ）の算出結果（Ｘ）に基づいて、ロボットに係る台詞の発話終了後に音声認識開始の信号を音声認識部５９に出力する。 FIG. 6 is a schematic diagram for explaining another example in the flow of conversation between the voice recognition apparatus of FIG. 3 and the user.
Further, depending on the contents of the scenario assembled by the conversation scenario execution unit 53, the user may not speak until the life support robot finishes speaking the scenario.
In this case, as shown in FIG. 6, the speech recognition device 51 of the life support robot does not perform speech recognition in the speech recognition unit 59 while performing speech (SP) related to the robot. When speech of the robot is finished, the speech recognition device 51 starts speech recognition (RC) in the speech recognition unit 59.
In this case, as shown in FIG. 3, the conversation scenario execution unit 53 outputs a signal having an extension time length D with a time length of zero to the delay control unit 61, and outputs a speech signal related to the robot to the speech synthesis unit. To 55. The voice synthesizer 55 generates an output voice signal based on the dialogue related to the input guidance, and outputs it to the speaker 18 and the utterance time calculator 63. On the other hand, the signal of the length X of the utterance time is input to the delay control unit 61, and the start time (X−D) is calculated. Based on the calculation result (X) of the start time (XD), the delay control unit 61 outputs a speech recognition start signal to the speech recognition unit 59 after the speech of the speech related to the robot is completed.

このように制御することにより、音声認識部５９による音声の認識開始時期とユーザの発話開始時期との間隔を短くすることができ、音声認識装置５１におけるユーザ音声の誤認識を防止できる。つまり、音声の認識開始からユーザが発話するまでの間隔を短くすることで、その間隔の間に外来音が発生する確率を低くできる。そのため、音声認識装置５１が、上記外来音をユーザ音声と誤認識することを防止することができる。 By controlling in this way, the interval between the speech recognition start time by the speech recognition unit 59 and the user's speech start time can be shortened, and erroneous recognition of the user speech in the speech recognition device 51 can be prevented. That is, by shortening the interval from the start of speech recognition until the user speaks, the probability that an external sound is generated during that interval can be reduced. Therefore, the voice recognition device 51 can be prevented from erroneously recognizing the external sound as a user voice.

上記の構成によれば、遅延制御部６１が、音声認識部５９におけるユーザ音声の認識開始のタイミングを、開始時間（Ｘ−Ｄ）に基づいて、出力音声の出力開始後、かつ、出力終了前に制御するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
ユーザ音声の認識開始のタイミングが、遅延制御部６１によりユーザ音声出力終了前に制御されるため、出力音声の出力終了前からユーザ音声の認識を開始することができる。そのため、ユーザが出力音声の出力終了直後、または、出力音声の出力中に話しても、音声認識部５９はユーザの音声を最初から認識でき、音声認識の頭切れを防止することができるとともに、テンポのよい会話を実現することができる。
ユーザ音声の認識開始のタイミングが出力音声の出力開始から開始時間（Ｘ−Ｄ）後であるため、ＡＥＣ５７の処理が安定した状態においてユーザ音声の認識を行うことができる。出力音声の出力開始直後は、ＡＥＣ５７の処理が不安定であり、かかる状態ではユーザ音声の誤認識が発生する恐れがある。上述のように、ユーザ音声の認識開始のタイミングを音声出力の出力開始から開始時間（Ｘ−Ｄ）後にすることで、ユーザ音声の誤認識を防止してテンポのよい会話を実現することができる。
出力音声の出力開始直後においては、ユーザの発話内容が、出力音声に係る台詞に対して有効でない回答の可能性が高い。そのため、ユーザ音声の認識開始のタイミングを出力音声の出力開始から開始時間（Ｘ−Ｄ）後とすることで、上記有効でない回答の音声認識を防止して、テンポのよい会話を実現することができる。 According to the above configuration, the delay control unit 61 sets the timing of the user speech recognition start in the speech recognition unit 59 after the start of output of the output speech and before the end of output based on the start time (XD). Therefore, the user voice recognition can be prevented from being interrupted, and a conversation with a good tempo can be realized.
Since the user voice recognition start timing is controlled by the delay control unit 61 before the end of the user voice output, the user voice recognition can be started before the output of the output voice ends. Therefore, even if the user speaks immediately after the output of the output sound or during the output of the output sound, the voice recognition unit 59 can recognize the user's voice from the beginning, and can prevent speech recognition from being interrupted. A conversation with good tempo can be realized.
Since the user voice recognition start timing is after the start time (XD) from the output start of the output voice, the user voice can be recognized in a state where the processing of the AEC 57 is stable. Immediately after the output of the output voice is started, the processing of the AEC 57 is unstable, and in this state, there is a possibility that erroneous recognition of the user voice occurs. As described above, the user voice recognition start timing is set to be after the start time (X-D) from the start of the voice output, thereby preventing misrecognition of the user voice and realizing a conversation with a good tempo. .
Immediately after the output of the output voice is started, there is a high possibility that the user's utterance content is not valid for the speech related to the output voice. For this reason, the user voice recognition start timing is set to be after the start time (XD) from the output voice output start, thereby preventing voice recognition of the invalid answer and realizing a conversation with a good tempo. it can.

開始時間の長さ（Ｘ−Ｄ）は、発話時間の長さＸから遅延時間の長さＤを引くことにより算出されるため、遅延制御部６１は、ユーザ音声の認識開始を所定のタイミングに制御することができる。
開始時間の長さ（Ｘ−Ｄ）を定めるパラメータには、台詞に係る発話時間の長さＸも含まれるため、台詞の長さが変化しても、上記台詞に係る出力音声の出力終了前に、必ずユーザ音声の認識を開始することができる。
ユーザ音声の認識開始のタイミングは、ユーザが発話する直前であることが望ましく、このようにすることで、音声認識装置５１におけるユーザ音声の誤認識を防止できる。つまり、音声の認識開始からユーザが発話するまでの間隔を短くすることで、その間隔の間に外来音が発生する確率を低くできる。そのため、音声認識装置５１が、上記外来音をユーザ音声と誤認識することを防止することができる。 Since the start time length (X-D) is calculated by subtracting the delay time length D from the utterance time length X, the delay control unit 61 starts the user speech recognition at a predetermined timing. Can be controlled.
Since the parameter for determining the length of the start time (X-D) includes the length X of the speech time related to the dialogue, even if the length of the dialogue changes, before the output of the output speech related to the dialogue is finished. In addition, user voice recognition can always be started.
It is desirable that the user voice recognition start timing is immediately before the user speaks. By doing so, erroneous recognition of the user voice in the voice recognition device 51 can be prevented. That is, by shortening the interval from the start of speech recognition until the user speaks, the probability that an external sound is generated during that interval can be reduced. Therefore, the voice recognition device 51 can be prevented from erroneously recognizing the external sound as a user voice.

発話時間の長さＸは、発話時間算出部６３において、スピーカに入力される出力音声信号に基づいて算出されるため、実際にスピーカから出力される出力音声の発話時間の長さＸを算出することができる。遅延制御部６１は、算出された発話時間の長さＸに基づいて、ユーザ音声の認識開始のタイミングを制御するため、ユーザ音声認識の頭切れを確実に防止することができる。
例えば、台詞の一部に個人名やニックネームなどが含まれ、会話により台詞の一部が変更される場合であっても、発話時間算出部６３は、変更後の台詞に係る出力音声信号に基づいて、発話時間の長さＸを算出することができる。そのため、音声認識装置５１は、ユーザ音声認識の頭切れを確実に防止することができる。 Since the utterance time length X is calculated by the utterance time calculation unit 63 based on the output sound signal input to the speaker, the utterance time length X of the output sound actually output from the speaker is calculated. be able to. Since the delay control unit 61 controls the user speech recognition start timing based on the calculated utterance time length X, it is possible to reliably prevent the user speech recognition from being interrupted.
For example, even when a part of a line includes an individual name, a nickname, and the like and a part of the line is changed by conversation, the utterance time calculation unit 63 is based on the output audio signal related to the changed line. Thus, the length X of the utterance time can be calculated. Therefore, the voice recognition device 51 can surely prevent the user voice recognition from being interrupted.

なお、上述のように、習熟度判定部６５は、ロボットの電源が入れられてから経過した日数や、同じ内容のシナリオの繰り返し回数に基づいてユーザの習熟度を判定してもよいし、会話シナリオ実行部５３から台詞が出力されてから、音声認識部５９から音声認識結果が入力されるまでの時間を計測し、この時間に基づいてユーザの習熟度を判定してもよく、特に限定するものではない。
このようにして習熟度を判定することにより、ユーザの習熟度をより確実かつきめ細かく判定することができ、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。 As described above, the proficiency level determination unit 65 may determine the proficiency level of the user based on the number of days that have elapsed since the robot was turned on, the number of times the scenario has the same content, or the conversation. The time from when the speech is output from the scenario execution unit 53 to when the speech recognition result is input from the speech recognition unit 59 may be measured, and based on this time, the user's proficiency level may be determined. It is not a thing.
By determining the proficiency level in this way, it is possible to determine the proficiency level of the user more reliably and finely, prevent the user voice recognition from being interrupted, and realize a conversation with a good tempo.

なお、上述のように、発話時間算出部６３により逐一スピーカ１８から出力される出力音声の発話時間の長さを算出してもよいし、予め、所定の台詞を出力するときの発話時間の長さを計測し、計測した値を記憶部６７に所定の台詞と関連付けて記憶させておいてもよい。
このようにすることで、発話時間算出部６３による発話時間長さの算出を省略することができ、発話時における演算負荷の削減を図ることができる。また、発話時間算出部６３を用いる必要がなくなるため、音声認識装置の構成を簡略化することができる。 As described above, the length of the utterance time of the output voice that is output from the speaker 18 one by one may be calculated by the utterance time calculation unit 63, or the length of the utterance time when a predetermined dialogue is output in advance. The measured value may be stored in the storage unit 67 in association with a predetermined dialogue.
By doing in this way, the calculation of the utterance time length by the utterance time calculation unit 63 can be omitted, and the calculation load during the utterance can be reduced. Further, since it is not necessary to use the utterance time calculation unit 63, the configuration of the speech recognition apparatus can be simplified.

〔第２の実施の形態〕
次に、本発明の第２の実施形態について図７および図８を参照して説明する。
本実施形態の生活支援ロボットの基本構成は、第１の実施形態と同様であるが、第１の実施形態とは、音声認識装置における発話のタイミングが異なっている。よって、本実施形態においては、図７および図８を用いて音声認識装置における発話のタイミングのみを説明し、生活支援ロボットの本体等の説明を省略する。
図７は、本実施形態に係る音声認識装置の機能を説明するブロック図である。
なお、第１の実施形態と同一の構成要素には、同一の符号を付してその説明を省略する。
音声認識装置１５１は、図７に示すように、ユーザとの会話に用いる台詞を生成する会話シナリオ実行部（制御部）１５３と、出力音声信号を生成する音声合成部５５と、出力音声を出力するスピーカ１８と、入力した音を入力信号に変換するマイクロフォン１４と、入力信号から入力音声信号を生成するＡＥＣ５７と、入力音声信号に基づいてユーザ音声を認識する音声認識部５９とを備えている。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to FIGS.
The basic configuration of the life support robot of this embodiment is the same as that of the first embodiment, but the timing of utterances in the speech recognition apparatus is different from that of the first embodiment. Therefore, in this embodiment, only the utterance timing in the speech recognition apparatus will be described with reference to FIGS. 7 and 8, and description of the main body of the life support robot will be omitted.
FIG. 7 is a block diagram illustrating functions of the speech recognition apparatus according to the present embodiment.
In addition, the same code | symbol is attached | subjected to the component same as 1st Embodiment, and the description is abbreviate | omitted.
As shown in FIG. 7, the speech recognition apparatus 151 includes a conversation scenario execution unit (control unit) 153 that generates dialogue used for conversation with the user, a speech synthesis unit 55 that generates an output speech signal, and outputs output speech. A speaker 18 that converts the input sound into an input signal, an AEC 57 that generates an input voice signal from the input signal, and a voice recognition unit 59 that recognizes the user voice based on the input voice signal. .

会話シナリオ実行部１５３は、ユーザとの会話のシナリオを選択するとともに、選択したシナリオに基づいて複数の文から構成される台詞を生成するものである。また、会話シナリオ実行部１５３は、音声合成部５５および遅延制御部６１に電気信号を出力するように音声合成部５５および遅延制御部６１と接続されている。また、会話シナリオ実行部１５３は、音声認識部５９から電気信号が入力されるように音声認識部５９と接続されている。 The conversation scenario execution unit 153 selects a scenario of conversation with the user and generates a dialogue composed of a plurality of sentences based on the selected scenario. The conversation scenario execution unit 153 is connected to the speech synthesis unit 55 and the delay control unit 61 so as to output an electrical signal to the speech synthesis unit 55 and the delay control unit 61. The conversation scenario execution unit 153 is connected to the voice recognition unit 59 so that an electric signal is input from the voice recognition unit 59.

次に、上述の構成からなる生活支援ロボットとユーザとの間の会話における、音声認識装置１５１の働きを説明する。
図８は、図７の音声認識装置とユーザとの間の会話の流れを説明する模式図である。図８において、横軸は時間を表し、図中のＳＰ１，ＳＰ２が生活支援ロボットの発話期間を表し、ＲＣが、生活支援ロボットが音声を認識している期間を表している。
ユーザから音声認識装置１５１に音声による指示が入力されると、音声認識装置１５１は、図８に示すように、入力された指示に対する台詞（例えば、第１文および第２文の２文から構成されるもの）を発話する（ＳＰ１，ＳＰ２）。ここで、ＳＰ１は、最初の第１文に係る発話を示すものであり、ＳＰ２は、次の第２文に係る発話を示すものである。
音声認識装置１５１は、最初の文に係る発話ＳＰ１を行っている間は、音声認識を行わず、２番目の文に係る発話ＳＰ２を開始すると同時に音声認識ＲＣを行う。 Next, the operation of the voice recognition device 151 in the conversation between the life support robot having the above-described configuration and the user will be described.
FIG. 8 is a schematic diagram for explaining the flow of conversation between the speech recognition apparatus of FIG. 7 and the user. In FIG. 8, the horizontal axis represents time, SP1 and SP2 in the figure represent the utterance period of the life support robot, and RC represents the period during which the life support robot recognizes the voice.
When a voice instruction is input from the user to the voice recognition device 151, the voice recognition device 151 is configured with a dialogue (for example, two sentences of a first sentence and a second sentence) for the input instruction as shown in FIG. 8. (SP1, SP2). Here, SP1 indicates an utterance related to the first first sentence, and SP2 indicates an utterance related to the next second sentence.
The speech recognition apparatus 151 does not perform speech recognition while performing the utterance SP1 related to the first sentence, and performs speech recognition RC simultaneously with starting the utterance SP2 related to the second sentence.

次に、上述のキャッチボール型の会話が行われている際の、音声認識装置１５１の働きについて説明する。
音声認識装置１５１の会話シナリオ実行部１５３は、図７に示すように、ユーザからの指示入力に応じた会話シナリオを組み立て、組み立てたシナリオに基づいた台詞（例えば、第１文および第２文の２文から構成されるもの）を生成する。
例えば、ユーザからの入力指示が「伝言」の場合、会話シナリオ実行部５３が組み立てる会話シナリオのうち、ロボットに係る台詞を、「伝言を伝えます。誰に伝えますか。」とすると、第１文は「伝言を伝えます。」、第２文は「誰に伝えますか。」となる。 Next, the operation of the voice recognition device 151 when the above-described catch ball type conversation is performed will be described.
As shown in FIG. 7, the conversation scenario execution unit 153 of the speech recognition apparatus 151 assembles a conversation scenario according to an instruction input from the user, and dialogues based on the assembled scenario (for example, the first sentence and the second sentence). That consists of two sentences).
For example, when the input instruction from the user is “message”, the dialogue related to the robot in the conversation scenario assembled by the conversation scenario execution unit 53 is “I will convey the message. The sentence is “Tell the message.” The second sentence is “Who do you want to tell?”

会話シナリオ実行部１５３は、音声合成部５５に対して、第１文および第２文の信号を順に出力する。一方、会話シナリオ実行部１５３は、音声認識部５９に対して、音声認識開始の信号を第２文の信号にタイミングを合わせて出力する。
第１文および第２文の信号が入力された音声合成部５５は、スピーカ１８に対して出力する出力音声信号を生成する。出力音声信号は、ＡＥＣ５７とスピーカ１８とに向けて出力される。スピーカ１８は、入力された出力音声信号に基づいて、ユーザに対して第１文および第２文に係る台詞を出力音声として出力する。 The conversation scenario execution unit 153 sequentially outputs the first sentence signal and the second sentence signal to the speech synthesis unit 55. On the other hand, the conversation scenario execution unit 153 outputs a voice recognition start signal to the voice recognition unit 59 in synchronization with the second sentence signal.
The voice synthesizer 55 to which the first sentence and second sentence signals are input generates an output voice signal to be output to the speaker 18. The output audio signal is output toward the AEC 57 and the speaker 18. The speaker 18 outputs the lines related to the first sentence and the second sentence as output sound to the user based on the input output sound signal.

音声認識部５９は、会話シナリオ実行部１５３から入力される音声認識の開始信号に基づいて、ＡＥＣ５７から入力される入力音声信号の認識を行う。ここで、音声認識の開始信号は、会話シナリオ実行部１５３から第２分の信号の出力と同時に出力されている。そのため、音声認識部５９における音声認識は、第２文に係る台詞がスピーカ１８から出力されるのと略同時に開始される。 The voice recognition unit 59 recognizes the input voice signal input from the AEC 57 based on the voice recognition start signal input from the conversation scenario execution unit 153. Here, the speech recognition start signal is output from the conversation scenario execution unit 153 simultaneously with the output of the second signal. Therefore, the speech recognition in the speech recognition unit 59 is started substantially simultaneously with the output of the speech related to the second sentence from the speaker 18.

上記の構成によれば、音声認識部５９におけるユーザ音声の認識を、第２文に係る出力音声がスピーカ１８から出力されるタイミングで開始するため、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
つまり、ユーザは第１文に係る出力音声について発話せずに聞き、第２文に係る出力音声については途中から発話する傾向がある。そのため、第２文に係る出力音声を出力するタイミングで、ユーザ音声の認識を開始することで、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。 According to the above configuration, since the user speech recognition in the speech recognition unit 59 is started at the timing when the output speech related to the second sentence is output from the speaker 18, the user speech recognition is prevented from being interrupted and the tempo is prevented. A good conversation can be realized.
That is, the user has a tendency to listen to the output voice related to the first sentence without speaking, and to utter the output voice related to the second sentence from the middle. Therefore, by starting the recognition of the user voice at the timing of outputting the output voice related to the second sentence, it is possible to prevent the user voice recognition from being interrupted and realize a conversation with a good tempo.

上記の構成によれば、ユーザ音声の認識開始のタイミングを、出力音声の出力開始からの時間で制御する場合と比較して、台詞の構成に応じて、ユーザ音声の認識開始のタイミングをきめ細かく制御を行うことができるので、ユーザ音声認識の頭切れをより確実に防止するとともに、テンポのよい会話を実現することができる。
また、ユーザ音声の認識開始のタイミングの制御に、発話時間の長さを用いないため、発話時間の長さを算出しにくい場合、または、発話時間の長さを算出するのに時間がかかる場合に、容易にユーザ音声の認識開始のタイミングを制御することができる。 According to said structure, compared with the case where the timing of the user voice recognition start is controlled by the time from the output start of the output voice, the timing of the user voice recognition start is finely controlled according to the configuration of the dialogue. Therefore, it is possible to more reliably prevent the user voice recognition from being interrupted and realize a conversation with a good tempo.
Also, since the length of the utterance time is not used to control the user speech recognition start timing, it is difficult to calculate the length of the utterance time, or when it takes time to calculate the length of the utterance time. In addition, the user voice recognition start timing can be easily controlled.

なお、上述のように、会話シナリオ実行部１５３により生成される台詞は、第１文および第２文の２文からなるものであってもよいし、さらに多くの文からなる台詞であってもよく、特に限定するものではない。この場合に、音声認識部５９における音声認識の開始は、第２文以後の音声出力の各タイミングのうちのいずれでもよく、特に限定するものではない。 As described above, the dialogue generated by the conversation scenario execution unit 153 may be composed of two sentences of the first sentence and the second sentence, or may be composed of more sentences. Well, not particularly limited. In this case, the start of speech recognition in the speech recognition unit 59 may be any of the timings of speech output after the second sentence, and is not particularly limited.

なお、上述のように、台詞を構成する各文の並び順のみに基づいて音声認識部５９における音声認識を開始するタイミングを決定してもよいし、上記各文の内容に基づいて音声認識を開始するタイミングを決定してもよく、特に限定するものではない。
つまり、台詞を構成する文には、ユーザから話しかけられる可能性が低い文と、話しかけられる可能性が高い文とがある。そこで、会話シナリオ実行部１５３は、ユーザから話しかけられる可能性の高い文に係る出力音声を出力開始する時点からユーザ音声の認識を開始するように制御してもよい。あるいは、台詞が、ユーザから話しかけられる可能性が低い文から、話しかけられる可能性が高い文に変わった時点から、ユーザ音声の認識を開始するように制御してもよい。このようにすることにより、ユーザ音声認識の頭切れを防止するとともに、テンポのよい会話を実現することができる。
ここで、ユーザから話しかけられる可能性が低い文としては、ユーザに対する呼びかけや、ユーザからの指示の復唱などの文が挙げられる。ユーザから話しかけられる可能性が高い文としては、ユーザに対する指示の要求する文などが挙げられる。 Note that, as described above, the timing for starting speech recognition in the speech recognition unit 59 may be determined based only on the order of the sentences constituting the dialogue, or the speech recognition may be performed based on the content of each sentence. The start timing may be determined and is not particularly limited.
That is, the sentences constituting the dialogue include a sentence that is unlikely to be spoken by the user and a sentence that is likely to be spoken. Therefore, the conversation scenario execution unit 153 may perform control so as to start recognition of the user voice from the time when the output voice related to the sentence that is likely to be spoken by the user is started. Or you may control to start recognition of a user's voice from the time when a dialogue changes from a sentence with a low possibility of being spoken by a user to a sentence with a high possibility of being spoken. By doing so, it is possible to prevent the user voice recognition from being interrupted and to realize a conversation with a good tempo.
Here, sentences that are unlikely to be spoken by the user include sentences such as a call to the user and a repetition of instructions from the user. Examples of the sentence that is highly likely to be spoken by the user include a sentence that requests an instruction to the user.

なお、本発明の技術範囲は上記実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲において種々の変更を加えることが可能である。
例えば、上記の実施の形態においては、この発明の音声認識装置を生活支援ロボットに適用して説明したが、この発明は生活支援ロボットに限られることなく、その他、人からの指示を音声で入力する機器に適用できるものである。 The technical scope of the present invention is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention.
For example, in the above embodiment, the voice recognition device of the present invention has been described as applied to a life support robot. However, the present invention is not limited to a life support robot, and other instructions from a person can be input by voice. It can be applied to the equipment.

本発明の第１の実施形態に係るロボットの構成を説明する正面図である。It is a front view explaining the structure of the robot which concerns on the 1st Embodiment of this invention. 図２は、図１に示した生活支援ロボットの構成を説明する左側面図である2 is a left side view illustrating the configuration of the life support robot shown in FIG. 本実施形態に係る音声認識装置の機能を説明するブロック図である。It is a block diagram explaining the function of the speech recognition apparatus which concerns on this embodiment. 図３の音声認識装置とユーザとの間の会話の流れを説明する模式図である。It is a schematic diagram explaining the flow of the conversation between the speech recognition apparatus of FIG. 3 and a user. 図３の音声認識装置がガイダンスを行う場合の会話の流れを説明する模式図である。It is a schematic diagram explaining the flow of the conversation when the voice recognition apparatus of FIG. 3 performs guidance. 図３の音声認識装置とユーザとの会話の流れにおける他の例を説明する模式図である。It is a schematic diagram explaining the other example in the flow of the conversation between the speech recognition apparatus of FIG. 3 and a user. 本発明の第２の実施形態に係る音声認識装置の機能を説明するブロック図である。It is a block diagram explaining the function of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 図７の音声認識装置とユーザとの間の会話の流れを説明する模式図である。It is a schematic diagram explaining the flow of the conversation between the speech recognition apparatus of FIG. 7 and a user.

Explanation of symbols

１４マイクロフォン
１８スピーカ
５１，１５１音声認識装置
５３，１５３会話シナリオ実行部（制御部）
５５音声合成部
５７ＡＥＣ（出力音声除去部）
５９音声認識部
６１遅延制御部（制御部）
６３発話時間算出部
６７記憶部 14 Microphone 18 Speaker 51, 151 Speech recognition device 53, 153 Conversation scenario execution unit (control unit)
55 Speech synthesis unit 57 AEC (output speech removal unit)
59 Speech recognition unit 61 Delay control unit (control unit)
63 Utterance time calculation unit 67 Storage unit

Claims

A control unit that assembles dialogue lines;
A speech synthesizer that generates an output speech signal based on the assembled dialogue;
A speaker that outputs output sound based on the generated output sound signal;
A microphone that converts voice including at least user voice uttered by the user into an input signal;
An output sound removal unit that generates an input sound signal by removing a signal component related to the output sound from the input signal based on the output sound signal;
A voice recognition unit that recognizes the user voice based on an input voice signal and outputs a recognition result to the control unit;
The control unit controls the timing of starting recognition of the user voice by the voice recognition unit after a predetermined time from the start of output of the output voice and before the end of output of the output voice based on the dialogue. A voice recognition device characterized by the above.

An utterance time calculation unit that calculates the length of the utterance time of the output sound related to the line based on the output sound signal;
The speech recognition apparatus according to claim 1, wherein the control unit controls the timing of starting recognition of the user speech by the speech recognition unit based on the length of the utterance time.

A storage unit for storing in advance the length of the utterance time of the output voice related to the line;
The speech recognition according to claim 1, wherein the control unit controls the recognition start timing of the user speech by the speech recognition unit based on the length of the utterance time stored in the storage unit. apparatus.

The control unit calculates a start time obtained by subtracting a predetermined delay time from the utterance time,
The speech recognition apparatus according to claim 2 or 3, wherein the speech recognition unit starts recognition of the user speech when the start time has elapsed from the start of output of the output speech.

The control unit changes the length of the delay time,
The speech recognition apparatus according to claim 4, wherein timing for starting recognition of the user speech by the speech recognition unit is controlled.

The speech recognition apparatus according to claim 1, wherein the control unit controls the recognition start timing of the user speech in the speech recognition unit based on a sentence constituting the dialogue.

A robot equipped with a voice recognition device for recognizing a user's voice,
A robot having a speech recognition device, wherein the speech recognition device is the speech recognition device according to any one of claims 1 to 7.