JP2006251545A

JP2006251545A - Speech interaction system and computer program

Info

Publication number: JP2006251545A
Application number: JP2005069850A
Authority: JP
Inventors: Kenji Abe; 賢司阿部; Takuya Noda; 拓也野田; Masaharu Harada; 将治原田; Tasuke Ito; 太介伊藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2005-03-11
Filing date: 2005-03-11
Publication date: 2006-09-21
Anticipated expiration: 2025-03-11
Also published as: JP4667085B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech interaction system and a computer program that can maintain smooth interaction without causing an unnecessarily long waiting time for a user. <P>SOLUTION: The speech interaction system which receives and recognizes a speech on the basis of grammar for speech recognition to advance interaction on the basis of recognition results and interaction scenario information in which advance procedures of the interaction are described, and outputs an answer to the received speech analyzes the grammar for speech recognition and increases or decreases a wait time before a speech is received and/or a speech reception time for which a speech is continuously received on the basis of analysis results of the grammar for speech recognition. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、利用者とコンピュータとの間で対話シナリオ情報に沿って自動的に行う対話を円滑に進行するよう利用者の応答時間を調整することができる音声対話システム及びコンピュータプログラムに関する。 The present invention relates to a voice interaction system and a computer program capable of adjusting a response time of a user so that a dialogue automatically performed between the user and a computer in accordance with dialogue scenario information proceeds smoothly.

近年、音声認識システム（ＡＳＲ：Auto Speech Recognition）を用いたボイスポータル等の音声対話システム（ＩＶＲ：Interactive Voice Response）が普及し始めている。音声対話システムを用いることにより、例えばチケット予約サービス、宅配便の再配達依頼サービス等のサービスを、サービス拠点毎に要員を配置することなく提供することができ、２４時間対応の具現化、人件費の抑制等、多大なメリットを享受することができる。 In recent years, an interactive voice response (IVR) such as a voice portal using a speech recognition system (ASR: Auto Speech Recognition) has begun to spread. By using the voice dialogue system, it is possible to provide services such as a ticket reservation service and a courier re-delivery request service without assigning personnel to each service base. It is possible to enjoy great merits such as suppression of the above.

一方、利用者の音声に対して自動的に応答することから、円滑な対話を進行させることが重要な課題となる。しかし、例えば利用者の音声の音量が小さく、音声対話システムが利用者による音声を受け付けないと判断した場合、音声対話システムは利用者からの音声入力の待機状態となることから、対話が中断し、円滑な対話を維持することが困難となる。 On the other hand, since it responds automatically to a user's voice, it is an important subject to advance a smooth conversation. However, for example, if the volume of the user's voice is low and the voice dialogue system determines that it will not accept the voice from the user, the voice dialogue system will be in a standby state for voice input from the user, so the dialogue will be interrupted. It becomes difficult to maintain a smooth dialogue.

また、周囲の雑音の音量が、利用者の音声と重複して入力される場合、音声対話システムは、利用者による音声の入力が継続して行われているものと判断し、所定の時間だけ継続して音声を受け付ける場合がある。すなわち、利用者にとっては音声を入力したにもかかわらず、音声対話システムが応答を返さない状態となり、対話が中断し、円滑な対話を維持することが困難となる。 Also, if the ambient noise volume is input in duplicate with the user's voice, the voice interaction system determines that the voice is continuously input by the user, and only for a predetermined time. There are cases where voice is continuously received. That is, for the user, although the voice is input, the voice dialogue system does not return a response, the dialogue is interrupted, and it is difficult to maintain a smooth dialogue.

斯かる問題点を解決すべく、例えば特許文献１では、利用者による音声を受け付けない場合、一定時間音声入力の待ち時間を設定しておき、利用者による音声入力と区別することができない周囲の雑音がある場合、利用者による音声を継続して受け付ける最大受付時間を設定しておくことにより、対話の中断時間を最小限に抑制し、円滑な対話を維持する音声対話システムが開示されている。 In order to solve such a problem, for example, in Patent Document 1, when voice by a user is not accepted, a waiting time for voice input is set for a certain period of time and surroundings that cannot be distinguished from voice input by the user Disclosed is a voice dialogue system that maintains a smooth dialogue by minimizing the dialogue interruption time by setting a maximum reception time for continuously accepting voice by the user when there is noise. .

また、特許文献２では、一定時間、音声入力を受け付けない場合、事前に準備しておいた支援シナリオに沿って対話を進行させることにより、対話の中断を回避する音声対話システムが開示されている。
特開平８−０７６９６４号公報特開２０００−０４８０３８号公報 Further, Patent Document 2 discloses a voice dialogue system that avoids interruption of a dialogue by advancing the dialogue according to a support scenario prepared in advance when voice input is not accepted for a certain period of time. .
Japanese Patent Application Laid-Open No. 8-076964 JP 2000-048038 A

しかし、特許文献１に開示してある音声対話システムは、対話の進行状況に基づいて適切な音声入力の待ち時間及び最大受付時間を設定することが困難であり、最悪のケースを想定して設定することにより、必要以上に長い時間を設定することが多い。したがって、音声対話システムが必要以上に無反応になることが多く、利用者にとって円滑な対話を維持することが困難であるという問題点があった。また、対話シナリオが動的に生成される場合、事前に音声入力の待ち時間及び最大受付時間を設定することができないという問題点も残されていた。 However, it is difficult for the voice dialogue system disclosed in Patent Document 1 to set an appropriate voice input waiting time and maximum reception time based on the progress of dialogue, and it is set assuming the worst case. By doing so, a time longer than necessary is often set. Therefore, the voice dialogue system often becomes unresponsive more than necessary, and there is a problem that it is difficult for the user to maintain a smooth dialogue. In addition, when a dialogue scenario is dynamically generated, there remains a problem that the waiting time for voice input and the maximum reception time cannot be set in advance.

さらに特許文献２に開示してある音声対話システムは、対話の進行状況を利用者からの音声入力の有無により判断しており、あらゆるケースを想定して支援用の対話シナリオを作成する必要があり、実装時の対話シナリオの作成がより困難になるという問題点があった。 Furthermore, the speech dialogue system disclosed in Patent Document 2 determines the progress of the dialogue based on the presence or absence of voice input from the user, and it is necessary to create a dialogue scenario for support in every case. There was a problem that it was more difficult to create a dialogue scenario at the time of implementation.

本発明は斯かる事情に鑑みてなされたものであり、利用者にとって必要以上の待ち時間が生じることがなく、円滑な対話を維持することができる音声対話システム及びコンピュータプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and it is an object of the present invention to provide a voice dialogue system and a computer program that can maintain a smooth dialogue without causing an excessive waiting time for a user. And

上記目的を達成するために第１発明に係る音声対話システムは、音声を受け付ける手段と、音声認識用文法に基づいて受け付けた音声を認識する手段と、認識した結果及び対話の進行手順を記述した対話シナリオ情報に基づいて対話を進行させる手段と、前記受け付けた音声に対する応答を出力する手段とを備える音声対話システムにおいて、前記音声認識用文法を解析する解析手段と、前記音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減する待ち時間調整手段とを備えることを特徴とする。 In order to achieve the above object, the speech dialogue system according to the first aspect of the invention describes means for accepting speech, means for recognizing the speech received based on the speech recognition grammar, the recognition result and the progress of the dialogue. In a spoken dialogue system comprising means for advancing a dialogue based on dialogue scenario information and a means for outputting a response to the accepted speech, an analysis means for analyzing the speech recognition grammar, and an analysis of the speech recognition grammar A waiting time adjusting means for increasing or decreasing the waiting time until the voice is received based on the result.

また、第２発明に係る音声対話システムは、第１発明において、前記音声認識用文法は、受け付ける音声のパターン及び音声に含まれるべきキーワードを指定してあり、前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターン数、及び受け付ける音声のキーワードの個数を決定するようにしてあることを特徴とする。 In the voice dialogue system according to the second invention, in the first invention, the voice recognition grammar specifies a voice pattern to be accepted and a keyword to be included in the voice, and the analysis means is for the voice recognition. Based on the grammar, the number of speech patterns that can be accepted and the number of speech keywords that are accepted are determined.

また、第３発明に係る音声対話システムは、第１又は第２発明において、前記音声認識用文法の解析結果に基づいて音声を受け付けることが可能な音声受付時間を増減する手段を備えることを特徴とする。 The voice interaction system according to the third invention is characterized in that, in the first or second invention, the voice interaction system further comprises means for increasing / decreasing a voice reception time during which voice can be received based on an analysis result of the voice recognition grammar. And

また、第４発明に係る音声対話システムは、第３発明において、前記音声認識用文法は、受け付ける音声のパターンを指定してあり、前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターンを音声認識した場合の文字列に含まれる最大文字数又は最大モーラ数を決定するようにしてあることを特徴とする。 In the speech dialogue system according to the fourth invention, in the third invention, the speech recognition grammar specifies a speech pattern to be accepted, and the analysis means accepts based on the speech recognition grammar. The maximum number of characters or the maximum number of mora included in the character string when the speech pattern capable of being recognized is recognized is determined.

また、第５発明に係るコンピュータプログラムは、音声を受け付け、音声認識用文法に基づいて受け付けた音声を認識し、認識した結果及び対話の進行手順を記述した対話シナリオ情報に基づいて対話を進行させ、受け付けた音声に対する応答を出力するコンピュータで実行可能なコンピュータプログラムにおいて、前記コンピュータを、前記音声認識用文法を解析する解析手段、及び前記音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減する待ち時間調整手段として機能させることを特徴とする。 The computer program according to the fifth aspect of the present invention receives speech, recognizes the received speech based on the speech recognition grammar, and advances the dialogue based on the recognition scenario and dialogue scenario information describing the progress of the dialogue. In the computer program executable by the computer that outputs a response to the received speech, the computer is allowed to analyze the speech recognition grammar until the speech is received based on the analysis result of the speech recognition grammar. It functions as a waiting time adjusting means for increasing or decreasing the waiting time.

第１発明及び第５発明では、対話シナリオ情報に沿って用いられる音声認識用文法を解析して、音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減する。これにより、音声認識用文法に基づいて、例えば利用者が応答するのに相当の思考時間を要する設問については、待ち時間を長く設定し、応答するのに思考時間をそれほど必要としない設問については、待ち時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the first invention and the fifth invention, the speech recognition grammar used in accordance with the dialogue scenario information is analyzed, and the waiting time until the speech is received is increased or decreased based on the analysis result of the speech recognition grammar. Thus, based on the speech recognition grammar, for example, for questions that require considerable think time for the user to respond, for questions that require a long waiting time and do not require so much thought time to respond The waiting time can be set short, and a smooth dialogue can be maintained without setting a dialogue interruption time unnatural for the user.

第２発明では、音声認識用文法に基づいて、受け付けることが可能な音声のパターン数、及び受け付ける音声のキーワードの個数を決定する。これにより、音声認識用文法から利用者による応答として期待する音声のパターン数及び応答音声に含まれるキーワード数を決定することができ、例えば音声のパターン数及キーワード数が多い場合には、利用者が応答するのに相当の思考時間を要する設問であると判断し、待ち時間を長く設定する。また、音声のパターン数及キーワード数が少ない場合には、利用者が応答するのにそれほど思考時間を必要としない設問であると判断し、待ち時間を短く設定する。したがって、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the second invention, the number of speech patterns that can be accepted and the number of speech keywords that are accepted are determined based on the speech recognition grammar. Thereby, the number of voice patterns expected as a response by the user and the number of keywords included in the response voice can be determined from the speech recognition grammar. For example, when the number of voice patterns and the number of keywords are large, the user Is a question that requires a considerable amount of thinking time to respond, and sets a long waiting time. When the number of voice patterns and the number of keywords are small, it is determined that the question does not require so much thought time for the user to respond, and the waiting time is set short. Therefore, it is possible to maintain a smooth conversation without setting a conversation interruption time unnatural for the user.

第３発明では、音声認識用文法の解析結果に基づいて音声を受け付けることが可能な音声受付時間を増減する。これにより、音声認識用文法に基づいて、例えば利用者が応答するのに必要な発声パターンが含む文字数が多い設問については、音声を受け付けることが可能な音声受付時間を長く設定し、文字数が少ない設問については、音声を受け付けることが可能な音声受付時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the third invention, the voice reception time during which voice can be received is increased or decreased based on the analysis result of the voice recognition grammar. Thus, based on the speech recognition grammar, for example, for a question with a large number of characters included in the utterance pattern necessary for the user to respond, the speech reception time during which speech can be received is set long and the number of characters is small. For questions, the voice reception time during which voice can be received can be set short, and smooth conversation can be maintained without setting an unnatural conversation interruption time for the user.

第４発明では、音声認識用文法を解析して、受け付けることが可能な音声のパターンを音声認識した場合の文字列に含まれる最大文字数又は最大モーラ数を決定する。これにより、音声認識用文法に基づいて、例えば利用者が応答するのに必要な発声パターンが含む最大文字数又は最大モーラ数を特定することができ、最大文字数又は最大モーラ数が多い設問については、音声を受け付けることが可能な音声受付時間を長く設定し、最大文字数又は最大モーラ数が少ない設問については、音声を受け付けることが可能な音声受付時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the fourth invention, the grammar for speech recognition is analyzed, and the maximum number of characters or the maximum number of mora included in the character string when speech patterns that can be accepted are recognized is determined. Thereby, based on the grammar for speech recognition, for example, the maximum number of characters or the maximum number of mora included in the utterance pattern necessary for the user to respond can be specified. For questions that have a long voice reception time that can receive voice and the maximum number of characters or maximum number of mora is small, the voice reception time that can receive voice can be set short, which is unnatural for the user. It is possible to maintain a smooth dialogue without setting a simple dialogue interruption time.

第１発明及び第５発明によれば、音声認識用文法に基づいて、例えば利用者が応答するのに相当の思考時間を要する設問については、待ち時間を長く設定し、応答するのに思考時間をそれほど必要としない設問については、待ち時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 According to the first and fifth inventions, based on the speech recognition grammar, for example, for a question that requires a considerable thinking time for the user to respond, a long waiting time is set and the thinking time is required to respond. For questions that do not require so much, the waiting time can be set short, and a smooth dialog can be maintained without setting an uninterrupted dialog interruption time for the user.

第２発明では、音声認識用文法から利用者による応答として期待する音声のパターン数及び応答音声に含まれるキーワード数を決定することができ、例えば音声のパターン数及キーワード数が多い場合には、利用者が応答するのに相当の思考時間を要する設問であると判断し、待ち時間を長く設定する。また、音声のパターン数及キーワード数が少ない場合には、利用者が応答するのにそれほど思考時間を必要としない設問であると判断し、待ち時間を短く設定する。したがって、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the second invention, the number of speech patterns expected as a response by the user and the number of keywords included in the response speech can be determined from the speech recognition grammar. For example, when the number of speech patterns and the number of keywords are large, It is determined that the question requires a considerable amount of thinking time for the user to respond, and the waiting time is set longer. When the number of voice patterns and the number of keywords are small, it is determined that the question does not require so much thought time for the user to respond, and the waiting time is set short. Therefore, it is possible to maintain a smooth conversation without setting a conversation interruption time unnatural for the user.

第３発明では、音声認識用文法に基づいて、例えば利用者が応答するのに必要な発声パターンが含む文字数が多い設問については、音声を受け付けることが可能な音声受付時間を長く設定し、文字数が少ない設問については、音声を受け付けることが可能な音声受付時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the third invention, based on the speech recognition grammar, for example, for a question with a large number of characters included in the utterance pattern necessary for the user to respond, the speech reception time during which speech can be received is set long, and the number of characters For questions with a small number of questions, the voice reception time during which voice can be received can be set short, and smooth conversation can be maintained without setting an unnatural conversation interruption time for the user. .

第４発明では、音声認識用文法に基づいて、例えば利用者が応答するのに必要な発声パターンが含む最大文字数又は最大モーラ数を特定することができ、最大文字数又は最大モーラ数が多い設問については、音声を受け付けることが可能な音声受付時間を長く設定し、最大文字数又は最大モーラ数が少ない設問については、音声を受け付けることが可能な音声受付時間を短く設定することができ、利用者にとって不自然な対話中断時間を設定することなく、円滑な対話を維持することが可能となる。 In the fourth invention, based on the speech recognition grammar, for example, the maximum number of characters or the maximum number of mora included in the utterance pattern necessary for the user to respond can be specified. Can set a long voice reception time that can receive voice, and for questions with a small maximum number of characters or a maximum number of mora, the voice reception time that can receive voice can be set short. A smooth conversation can be maintained without setting an unnatural conversation interruption time.

以下、本発明をその実施の形態を示す図面に基づいて具体的に説明する。 Hereinafter, the present invention will be specifically described with reference to the drawings showing embodiments thereof.

（実施の形態１）
以下、本発明の実施の形態１に係る音声対話システムについて図面に基づいて具体的に説明する。図１は、本発明の実施の形態１に係る音声対話システムの構成を示すブロック図である。図１に示すように、本実施の形態１に係る音声対話システムは、利用者の音声を受け付け、利用者に対して応答音声を出力する音声入出力部２０を備えた対話制御装置１０を備えている。 (Embodiment 1)
Hereinafter, the voice interaction system according to Embodiment 1 of the present invention will be specifically described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a voice interaction system according to Embodiment 1 of the present invention. As shown in FIG. 1, the voice dialogue system according to the first embodiment includes a dialogue control device 10 including a voice input / output unit 20 that accepts a user's voice and outputs a response voice to the user. ing.

対話制御装置１０は、少なくとも、ＣＰＵ（中央演算装置）１１、記録手段１２、ＲＡＭ１３、外部の通信手段と接続する通信インタフェース１４、及びＤＶＤ、ＣＤ等の可搬型記録媒体１６を用いる補助記録手段１５で構成される。 The dialogue control device 10 includes at least a CPU (central processing unit) 11, a recording unit 12, a RAM 13, a communication interface 14 connected to an external communication unit, and an auxiliary recording unit 15 using a portable recording medium 16 such as a DVD or a CD. Consists of.

ＣＰＵ１１は、内部バス１７を介して対話制御装置１０の上述したようなハードウェア各部と接続されており、上述したハードウェア各部を制御するとともに、記録手段１２に記録されている処理プログラム、例えば利用者の音声を受け付け、受け付けた音声を認識するプログラム、対話シナリオ情報を読出し認識結果に対する応答を生成するプログラム、生成した応答を再生出力するプログラム等に従って、種々のソフトウェア的機能を実行する。 The CPU 11 is connected to the above-described hardware units of the dialog control apparatus 10 via the internal bus 17, and controls the above-described hardware units and processes programs recorded in the recording unit 12, for example, use Various software functions are executed in accordance with a program for receiving a voice of a person, a program for recognizing the received voice, a program for reading out dialogue scenario information and generating a response to the recognition result, a program for reproducing and generating the generated response, and the like.

記録手段１２は、内蔵される固定型記録装置（ハードディスク）、ＲＯＭ等で構成され、通信インタフェース１４を介した外部のコンピュータ、又はＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体１６から取得した、対話制御装置１０として機能させるために必要な処理プログラムを記録している。記録手段１２は、処理プログラムだけではなく、自動応答を行うための対話シナリオを記述した対話シナリオ情報１２１も記録している。 The recording unit 12 includes a built-in fixed recording device (hard disk), a ROM, and the like, and has a dialog acquired from an external computer or a portable recording medium 16 such as a DVD or CD-ROM via the communication interface 14. A processing program necessary to function as the control device 10 is recorded. The recording unit 12 records not only the processing program but also dialogue scenario information 121 describing a dialogue scenario for performing an automatic response.

ＲＡＭ１３は、ＤＲＡＭ等で構成され、ソフトウェアの実行時に発生する一時的なデータを記録する。通信インタフェース１４は内部バス１７に接続されており、外部のネットワーク網と通信することができるよう接続することにより、処理に必要なデータを送受信することができる。 The RAM 13 is composed of a DRAM or the like, and records temporary data generated when the software is executed. The communication interface 14 is connected to the internal bus 17 and can transmit and receive data necessary for processing by connecting so as to be able to communicate with an external network.

音声入出力部２０は、マイクロフォン等の音声入力装置により利用者の音声を受け付け、音声データに変換してＣＰＵ１１へ送出する機能、及びＣＰＵ１１の指示により、生成した応答に対応する合成音声をスピーカ等の音声出力装置から再生出力する機能を備えている。 The voice input / output unit 20 receives a user's voice by a voice input device such as a microphone, converts the voice data into voice data and sends the voice data to the CPU 11, and a synthesized voice corresponding to the generated response according to an instruction of the CPU 11. The function of reproducing and outputting from the audio output device is provided.

補助記録手段１５は、ＣＤ、ＤＶＤ等の可搬型記録媒体１６を用い、記録手段１２へ、ＣＰＵ１１が処理するプログラム、データ等をダウンロードする。また、ＣＰＵ１１が処理したデータをバックアップすべく書き込むことも可能である。 The auxiliary recording unit 15 uses a portable recording medium 16 such as a CD or a DVD, and downloads a program, data, or the like processed by the CPU 11 to the recording unit 12. It is also possible to write the data processed by the CPU 11 for backup.

なお、本実施の形態１では、対話制御装置１０と音声入出力部２０とが一体となっている場合について説明するが、特にこれに限定されるものではなく、１つの音声入出力部２０が、複数の対話制御装置１０とネットワーク網等を介して接続されている形態であっても良い。 In the first embodiment, the case where the dialogue control apparatus 10 and the voice input / output unit 20 are integrated will be described. However, the present invention is not particularly limited thereto, and one voice input / output unit 20 is provided. Alternatively, it may be connected to a plurality of dialogue control apparatuses 10 via a network or the like.

本発明の実施の形態１に係る音声対話システムの対話制御装置１０は、利用者による音声の入力を促すために、記録手段１２に記憶されている対話シナリオ情報１２１に沿って、ＣＰＵ１１の指令により音声入出力部２０から音声出力を行う。例えば、「ご用件は、○○ですか」等、次に利用者に期待する音声を「はい」、「いいえ」に限定することができる質問を音声出力する。利用者による音声を認識する場合、音声出力時に生成した、又は事前に対話シナリオ情報１２１の一部として記録してある音声認識用文法に基づいて音声認識する。 The dialogue control apparatus 10 of the voice dialogue system according to the first embodiment of the present invention uses a command of the CPU 11 in accordance with the dialogue scenario information 121 stored in the recording means 12 in order to prompt the user to input voice. Audio output is performed from the audio input / output unit 20. For example, a question that can limit the next expected voice to the user to “Yes” or “No”, such as “Is your business XX”, is output as a voice. When recognizing a voice by a user, the voice is recognized based on a voice recognition grammar generated at the time of voice output or recorded as a part of the dialogue scenario information 121 in advance.

なお、対話シナリオ情報１２１は、例えばＶｏｉｃｅＸＭＬ（以下、ＶＸＭＬ）シナリオ記述言語により、対話における利用者の音声を受け付けることができるよう記述される。すなわち、対話シナリオ情報１２１には、コンピュータ側からの出力の内容、利用者の発した音声に応じた対話の遷移、音声の内容に応じて次に行うべき処理等が記述されている。 The dialogue scenario information 121 is described so as to be able to accept a user's voice in the dialogue by, for example, VoiceXML (hereinafter referred to as VXML) scenario description language. That is, the dialogue scenario information 121 describes the contents of the output from the computer, the transition of the dialogue according to the voice uttered by the user, the processing to be performed next according to the voice content, and the like.

出力された音声に対して、音声入出力部２０から利用者の音声を受け付けた場合、受付けた音声の波形データ、または受付けた音声を音響分析した結果である特徴量を示すデータとして記録手段１２及びＲＡＭ１３に記録され、ＣＰＵ１１の指令により、ＲＡＭ１３に記録された音声について音声認識を行う。音声認識処理に用いる音声認識エンジンは特に限定されるものではなく、一般に用いられる音声認識エンジンであれば何でも良い。 When the user's voice is received from the voice input / output unit 20 with respect to the output voice, the recording means 12 is used as waveform data of the received voice or data indicating a feature value as a result of acoustic analysis of the received voice. In addition, voice recognition is performed on the voice recorded in the RAM 13 and in accordance with an instruction from the CPU 11. The speech recognition engine used for speech recognition processing is not particularly limited, and any speech recognition engine that is generally used may be used.

なお、記録手段１２としては、内蔵されているハードディスクに限定されるものではなく、通信インタフェース１４を介して接続されている他のコンピュータに内蔵されているハードディスク等、大容量のデータを記録することができる記録媒体であれば何でもよい。 The recording means 12 is not limited to a built-in hard disk, and records a large amount of data such as a hard disk built in another computer connected via the communication interface 14. Any recording medium can be used.

ＣＰＵ１１は、音声認識処理に用いる音声認識用文法を解析し、受付けることができる音声のパターンの数ｘと、受付ける音声に含まれるべきキーワードの個数ｙとを決定する。図２は、例えば「ご用件は、○○ですか」等の質問を音声出力した場合に受付けることができる音声のパターンと音声中のキーワードの例示図である。 The CPU 11 analyzes the speech recognition grammar used for the speech recognition process, and determines the number x of speech patterns that can be accepted and the number y of keywords to be included in the speech to be accepted. FIG. 2 is an exemplary diagram of voice patterns and keywords in the voice that can be received when a question such as “Is your business OO?” Is output by voice, for example.

図２に示すように、次に利用者に期待する音声を「はい」、「いいえ」に限定するような質問の場合、受付けることができる音声のパターンは「はい」、「いいえ」の２パターンである。したがって、受付けることができる音声のパターンの数ｘ＝２と決定することができる。 As shown in FIG. 2, in the case of a question that restricts the voice expected to the user to “Yes” or “No” next, the voice patterns that can be accepted are two patterns of “Yes” and “No”. It is. Therefore, it is possible to determine that the number of sound patterns that can be accepted x = 2.

また、受付ける音声中のキーワードの個数はそれぞれ１個であることから、受付ける音声に含まれるべきキーワードの個数ｙ＝１と決定することができる。 Further, since the number of keywords in the received voice is one, it can be determined that the number of keywords y = 1 to be included in the received voice.

図３は、例えば「切符の希望条件を教えてください」等の質問を音声出力した場合に受付けることができる音声のパターンと音声中のキーワードの例示図である。この場合、次に利用者に期待する音声は、「東京発」、「新大阪着」、「禁煙」、・・・等の発着駅及び座席条件の組合せとなる。 FIG. 3 is a diagram showing examples of voice patterns and keywords in the voice that can be received when a question such as “Please tell me the desired conditions for tickets” is outputted as voice. In this case, the next voice expected from the user is a combination of the departure / arrival station and seating conditions such as “from Tokyo”, “to Shin-Osaka”, “no smoking”,.

図３に示すように、受付けることができる音声のパターンを「東京発」、「新大阪着」、「禁煙」、・・・等の組合せ数ｎ（ｎは自然数）とした場合、受付けることができる音声のパターンの数ｘ＝ｎと決定することができる。また、受付ける音声中のキーワードの個数ｙは、音声のパターンに含まれる単語数の最大値であり、図３の例ではｙ＝３と決定することができる。 As shown in FIG. 3, if the acceptable voice pattern is “from Tokyo”, “arriving at Shin-Osaka”, “no smoking”,... The number of possible voice patterns x = n can be determined. The number y of keywords in the received voice is the maximum number of words included in the voice pattern, and can be determined as y = 3 in the example of FIG.

ＣＰＵ１１は、質問ごとに音声認識用文法を解析し、音声認識用文法の解析結果として、受付けることができる音声のパターンの数ｘと、また、受付ける音声中のキーワードの個数ｙとを質問ごとに取得し、ｘ及びｙの関数として利用者からの音声入力の待ち時間Ｔ１を質問ごとに随時設定する。例えば、音声入力の待ち時間Ｔ１は、ａ、ｂ、ｃを定数として、（数１）のように算出することができる。 The CPU 11 analyzes the speech recognition grammar for each question, and determines the number x of speech patterns that can be accepted and the number y of keywords in the speech to be accepted for each question as the analysis result of the speech recognition grammar. Obtaining and setting a waiting time T1 for voice input from the user as a function of x and y as needed for each question. For example, the voice input waiting time T1 can be calculated as shown in (Expression 1), where a, b, and c are constants.

（数１）
Ｔ１＝ａｘ＋ｂｙ＋ｃ (Equation 1)
T1 = ax + by + c

すなわち、受付けることができる音声のパターンの数ｘが大きいほど、利用者にとっては選択すべき項目数が多くなる。したがって、利用者の思考時間がより長くなるものと考えられる。また、受付ける音声中のキーワードの個数ｙが大きいほど、利用者にとっては整理、分類等の思考時間がより長くなるものと考えられる。したがって、音声認識用文法に応じて、利用者からの音声入力の待ち時間Ｔ１を変動させることにより、利用者にとって自然な対話を継続させることが可能となる。 That is, the larger the number x of sound patterns that can be accepted, the greater the number of items to be selected for the user. Therefore, it is considered that the user's thinking time becomes longer. In addition, it is considered that the thinking time for organizing, classifying, and the like becomes longer for the user as the number y of keywords in the received voice is larger. Therefore, by changing the waiting time T1 for voice input from the user according to the speech recognition grammar, it is possible to continue a natural conversation for the user.

図４は、本発明の実施の形態１に係る音声対話システムの対話制御装置１０のＣＰＵ１１の処理手順を示すフローチャートである。対話制御装置１０のＣＰＵ１１は、対話シナリオ情報１２１に含まれている音声認識用文法を取得し（ステップＳ４０１）、受付けることが可能な音声パターン数を算出する（ステップＳ４０２）。ＣＰＵ１１は、受付けることが可能な音声パターンごとに含まれているキーワード数を算出し（ステップＳ４０３）、算出したキーワード数が最大値であるか否かを判断する（ステップＳ４０４）。具体的には、算出したキーワード数をＲＡＭ１３に記録しておき、随時算出したキーワード数と比較することにより判断する。 FIG. 4 is a flowchart showing a processing procedure of the CPU 11 of the dialogue control apparatus 10 of the voice dialogue system according to Embodiment 1 of the present invention. The CPU 11 of the dialogue control apparatus 10 acquires the speech recognition grammar included in the dialogue scenario information 121 (step S401), and calculates the number of acceptable voice patterns (step S402). The CPU 11 calculates the number of keywords included for each acceptable voice pattern (step S403), and determines whether or not the calculated number of keywords is the maximum value (step S404). Specifically, the calculated number of keywords is recorded in the RAM 13, and the determination is made by comparing with the number of keywords calculated at any time.

ＣＰＵ１１が、算出したキーワード数が最大値であると判断した場合（ステップＳ４０４：ＹＥＳ）、ＣＰＵ１１は、キーワード数の最大値として算出したキーワード数をＲＡＭ１３に記録する（ステップＳ４０５）。ＣＰＵ１１は、全ての音声パターンについて、ステップＳ４０３以下の処理を繰り返し実行し（ステップＳ４０６：ＮＯ）、ＣＰＵ１１が全ての音声パターンについて処理が完了したと判断した場合（ステップＳ４０６：ＹＥＳ）、ＣＰＵ１１は、算出した受付けることが可能な音声パターン数ｘと、算出したキーワード数ｙとに基づいて、利用者による音声入力の待ち時間Ｔ１を算出する（ステップＳ４０７）。 When the CPU 11 determines that the calculated keyword count is the maximum value (step S404: YES), the CPU 11 records the calculated keyword count as the maximum keyword count in the RAM 13 (step S405). The CPU 11 repeatedly executes the processing from step S403 on for all voice patterns (step S406: NO), and when the CPU 11 determines that the processing has been completed for all voice patterns (step S406: YES), the CPU 11 Based on the calculated number x of acceptable voice patterns and the calculated number y of keywords, a waiting time T1 for voice input by the user is calculated (step S407).

ＣＰＵ１１は、対話シナリオ情報に基づく合成音声を出力し（ステップＳ４０８）、利用者による音声を受付けたか否かを判断する（ステップＳ４０９）。ＣＰＵ１１が、利用者による音声を受付けたと判断した場合（ステップＳ４０９：ＹＥＳ）、ＣＰＵ１１は、受付けた音声を音声認識し（ステップＳ４１０）、次の対話シナリオへと進行する。ＣＰＵ１１が、利用者による音声を受付けていないと判断した場合（ステップＳ４０９：ＮＯ）、ＣＰＵ１１は、算出した待ち時間Ｔ１が経過しているか否かを判断する（ステップＳ４１１）。 The CPU 11 outputs a synthesized voice based on the dialogue scenario information (step S408), and determines whether or not a voice from the user has been received (step S409). When the CPU 11 determines that the user's voice has been received (step S409: YES), the CPU 11 recognizes the received voice (step S410) and proceeds to the next dialogue scenario. When the CPU 11 determines that the user's voice is not received (step S409: NO), the CPU 11 determines whether or not the calculated waiting time T1 has elapsed (step S411).

ＣＰＵ１１は、算出した待ち時間Ｔ１が経過するまで利用者による音声入力の待ち状態となり（ステップＳ４１１：ＮＯ）、ＣＰＵ１１が、算出した待ち時間Ｔ１が経過したと判断した場合（ステップＳ４１１：ＹＥＳ）、ＣＰＵ１１は、所定のエラー処理を実行することにより（ステップＳ４１２）、利用者に対話が不自然に中断した等の感覚を生じさせること無く、円滑な対話を維持する。 The CPU 11 waits for voice input by the user until the calculated waiting time T1 elapses (step S411: NO), and when the CPU 11 determines that the calculated waiting time T1 has elapsed (step S411: YES), By executing predetermined error processing (step S412), the CPU 11 maintains a smooth dialogue without causing the user to feel that the dialogue has been interrupted unnaturally.

対話シナリオの質問ごとの待ち時間の設定は、事前に一括して設定しておいても良いし、対話中に動的に設定しても良い。音声認識用文法が固定されている場合は、事前に設定しておく方が良いし、音声認識用文法が動的に変更又は追加されるものである場合、待ち時間も動的に設定することになる。また、認識用文法が動的に変更又は追加される場合であっても、事前に待ち時間を設定しておき、音声認識用文法が動的に変更又は追加された場合に待ち時間を再計算するようにしても良い。 The waiting time for each question in the dialogue scenario may be set in advance or may be set dynamically during the dialogue. If the speech recognition grammar is fixed, it is better to set it in advance, and if the speech recognition grammar is dynamically changed or added, the waiting time should also be set dynamically. become. Even if the recognition grammar is dynamically changed or added, a waiting time is set in advance, and the waiting time is recalculated when the speech recognition grammar is dynamically changed or added. You may make it do.

なお、音声認識用文法の内容に応じて、算出した音声入力の待ち時間Ｔ１を修正する修正係数を事前に記録手段１２に記録しておいても良い。図５は、音声入力の待ち時間を修正する修正係数を記録手段１２に記録している場合の、修正係数照会の手順を模式的に示す図である。 Note that a correction coefficient for correcting the calculated speech input waiting time T1 may be recorded in the recording unit 12 in advance according to the content of the speech recognition grammar. FIG. 5 is a diagram schematically showing the procedure of the correction coefficient inquiry when the correction coefficient for correcting the voice input waiting time is recorded in the recording means 12.

図５では、音声認識用文法の例として、（ａ）、（ｂ）、（ｃ）の３つの場合を示している。音声認識用文法が図５（ａ）である場合、ＣＰＵ１１は、文法内容「行き先応答」をキー情報として記録手段１２を照会し、修正係数「１．８」を抽出する。同様に、音声認識用文法が図５（ｂ）、図５（ｃ）である場合、ＣＰＵ１１は、それぞれ文法内容「注文内容確認」、「枚数確認」をキー情報として記録手段１２を照会し、修正係数「１．２」、「１．０」を抽出する。 FIG. 5 shows three cases (a), (b), and (c) as examples of the speech recognition grammar. When the speech recognition grammar is FIG. 5A, the CPU 11 inquires the recording means 12 using the grammar content “destination response” as key information, and extracts the correction coefficient “1.8”. Similarly, when the speech recognition grammars are FIG. 5B and FIG. 5C, the CPU 11 inquires the recording means 12 using the grammar contents “order content confirmation” and “number confirmation” as key information, respectively. Correction coefficients “1.2” and “1.0” are extracted.

このように、文法内容に応じて、利用者による思考時間の長短を設定することができ、同じ音声パターンの場合（図５（ｂ）、図５（ｃ））であっても、想定される利用者による思考時間の長短を考慮した修正係数を求めることができ、求めた修正係数を音声入力の待ち時間Ｔ１に乗ずることにより、より実際の対話に即した待ち時間を設定することができ、より自然な対話を具現化することが可能となる。 In this way, the length of the thinking time by the user can be set according to the grammatical content, and even in the case of the same voice pattern (FIG. 5 (b), FIG. 5 (c)) is assumed. A correction coefficient that takes into account the length of thinking time by the user can be obtained, and by multiplying the obtained correction coefficient by the waiting time T1 for voice input, a waiting time that is more suitable for actual conversation can be set. A more natural dialogue can be realized.

以上のように本実施の形態１によれば、音声認識用文法から利用者による応答として想定している音声のパターン数及び応答音声に含まれるキーワード数を決定することができ、例えば音声のパターン数及キーワード数が多い場合には、利用者が応答するのに相当の思考時間を要する設問であると判断し、待ち時間を長く設定する。また、音声のパターン数及キーワード数が少ない場合には、利用者が応答するのにそれほど思考時間を必要としない設問であると判断し、待ち時間を短く設定する。したがって、利用者にとって不自然な対話中断時間が生じることがなく、円滑な対話を維持することが可能となる。 As described above, according to the first embodiment, the number of speech patterns assumed as a response by the user and the number of keywords included in the response speech can be determined from the speech recognition grammar. If the number and the number of keywords are large, it is determined that the question requires a considerable thinking time for the user to respond, and the waiting time is set longer. When the number of voice patterns and the number of keywords are small, it is determined that the question does not require so much thought time for the user to respond, and the waiting time is set short. Therefore, it is possible to maintain a smooth dialogue without causing an uninterrupted dialogue interruption time for the user.

また、認識用文法の内容に応じて、設定する待ち時間を修正する修正係数を変動させることにより、実際の対話に即した待ち時間を設定することができ、より自然な対話を維持することが可能となる。また、修正係数は、設問に対する応答の速さに応じて変動させても良い。例えば設問に対する応答が速いほど、利用者が該システムの操作に慣れているものと推定することができ、修正係数を小さくすることにより待ち時間を短くすることが可能となる。 Also, by changing the correction coefficient for correcting the waiting time to be set according to the content of the recognition grammar, it is possible to set the waiting time according to the actual dialogue and maintain a more natural dialogue. It becomes possible. The correction coefficient may be changed according to the response speed to the question. For example, it can be estimated that the faster the response to the question, the more familiar the user is with the operation of the system, and the waiting time can be shortened by reducing the correction coefficient.

なお、本実施の形態１では、音声認識用文法に基づく利用者による音声入力の待ち時間を、音声対話システムから合成音声を出力するごとに算出する例について説明しているが、特にこれに限定されるものではなく、対話シナリオ情報に沿った対話の開始時、又は開始する前に、すべての音声認識用文法について解析し、解析結果に基づいて利用者による音声入力の待ち時間を算出して記録手段１２に記録しておくものであっても良い。 In the first embodiment, an example is described in which the waiting time for speech input by the user based on the speech recognition grammar is calculated every time the synthesized speech is output from the speech dialogue system. However, at the start or before the start of the dialogue according to the dialogue scenario information, all grammars for speech recognition are analyzed, and the waiting time for speech input by the user is calculated based on the analysis result. It may be recorded in the recording means 12.

（実施の形態２）
以下、本発明の実施の形態２に係る音声対話システムについて図面に基づいて具体的に説明する。本発明の実施の形態２に係る音声対話システムの構成は、実施の形態１と同様であることから、同一の符号を付することで詳細な説明を省略する。本実施の形態２は、利用者による音声を継続して入力する時間の上限値を定める点において実施の形態１と相違する。 (Embodiment 2)
Hereinafter, the voice interaction system according to the second embodiment of the present invention will be specifically described with reference to the drawings. Since the configuration of the voice interaction system according to the second embodiment of the present invention is the same as that of the first embodiment, detailed description thereof will be omitted by attaching the same reference numerals. The second embodiment is different from the first embodiment in that an upper limit value of a time for continuously inputting a voice by a user is determined.

本発明の実施の形態２に係る音声対話システムの対話制御装置１０は、利用者による音声の入力を促すために、記録手段１２に記憶されている対話シナリオ情報１２１に沿って、ＣＰＵ１１の指令により音声入出力部２０から音声出力を行う。例えば、「ご用件は、○○ですか」等、次に利用者に期待する音声を「はい」、「いいえ」に限定することができる質問を音声出力する。利用者による音声を認識する場合、音声出力時に生成した、又は事前に対話シナリオ情報１２１の一部として記録してある音声認識用文法に基づいて音声認識する。 The dialogue control apparatus 10 of the voice dialogue system according to the second embodiment of the present invention uses a command from the CPU 11 in accordance with the dialogue scenario information 121 stored in the recording means 12 in order to prompt the user to input voice. Audio output is performed from the audio input / output unit 20. For example, a question that can limit the next expected voice to the user to “Yes” or “No”, such as “Is your business XX”, is output as a voice. When recognizing a voice by a user, the voice is recognized based on a voice recognition grammar generated at the time of voice output or recorded as a part of the dialogue scenario information 121 in advance.

なお、音声認識用文法を含む対話シナリオ情報１２１は、例えばＶｏｉｃｅＸＭＬ（以下、ＶＸＭＬ）シナリオ記述言語により、対話における利用者の音声を受け付けることができるよう記述される。すなわち、対話シナリオ情報１２１には、コンピュータ側からの出力の内容、利用者の発した音声に応じた対話の遷移、音声の内容に応じて次に行うべき処理等が記述される。 The dialogue scenario information 121 including the speech recognition grammar is described so as to be able to accept the voice of the user in the dialogue by, for example, VoiceXML (hereinafter, VXML) scenario description language. That is, the dialogue scenario information 121 describes the content of the output from the computer, the transition of the dialogue according to the voice uttered by the user, the processing to be performed next according to the content of the voice, and the like.

ＣＰＵ１１は、音声認識処理に用いる音声認識用文法を解析し、受付けることができる音声のパターンに含まれる最大文字数又は最大モーラ数ｚを決定する。図６は、例えば「ご用件は、○○ですか」等の質問を音声出力した場合に受付けることができる音声のパターンと音声パターンを音声認識した場合に含まれる認識文字数の例示図である。 The CPU 11 analyzes the speech recognition grammar used for the speech recognition processing, and determines the maximum number of characters or the maximum number of mora included in the speech pattern that can be accepted. FIG. 6 is an exemplary diagram of speech patterns that can be accepted when a question such as “Is your business XX” is output by voice and the number of recognized characters that are included when the speech pattern is voice-recognized, for example. .

図６に示すように、次に利用者に期待する音声を「はい」、「いいえ」に限定するような質問の場合、受付けることができる音声のパターンは「はい」、「いいえ」の２パターンである。したがって、受付けることができる音声のパターンを音声認識した場合の認識文字数は、それぞれ２、３であることから、最大文字数ｚ＝３と決定することができる。 As shown in FIG. 6, in the case of a question that restricts the voice expected to the user to “Yes” or “No” next, the voice patterns that can be accepted are two patterns of “Yes” and “No”. It is. Accordingly, since the number of recognized characters when the speech pattern that can be accepted is recognized is 2, 3 respectively, the maximum number of characters z = 3 can be determined.

図７は、例えば「どの新幹線をご利用ですか」等の質問を音声出力した場合に受付けることができる音声のパターンと、音声パターンを音声認識した場合に含まれる認識文字数の例示図である。この場合、次に利用者に期待する音声は、例えば「とうほくしんかんせん」、「とうかいどうしんかんせん」、「さんようしんかんせん」となる。したがって、受付けることができる音声のパターンを音声認識した場合の認識文字数は、それぞれ１０、１２、１０であることから、最大文字数ｚ＝１２と決定することができる。 FIG. 7 is an illustration of a voice pattern that can be received when a question such as “which bullet train is used” is output by voice and the number of recognized characters included when the voice pattern is voice-recognized. In this case, the voice expected for the user is, for example, “Tohokushinkansen”, “Tokaishinkansen”, “Sanyoushinkansen”. Therefore, since the recognized character numbers when the speech patterns that can be accepted are recognized are 10, 12, and 10, respectively, the maximum number of characters z = 12 can be determined.

ＣＰＵ１１は、音声認識用文法の解析結果として、受付けることができる音声のパターンを音声認識した場合の最大文字数ｚを取得し、ｚの関数として利用者からの音声を継続して受付ける音声受付時間Ｔ２を随時設定する。例えば、音声受付時間Ｔ２は、ｄ、ｅを定数として、（数２）のように算出することができる。 The CPU 11 obtains the maximum number of characters z when speech patterns that can be accepted are recognized as speech recognition grammar analysis results, and the speech reception time T2 for continuously receiving speech from the user as a function of z. Is set at any time. For example, the voice reception time T2 can be calculated as (Equation 2), where d and e are constants.

（数２）
Ｔ２＝ｄｚ＋ｅ (Equation 2)
T2 = dz + e

すなわち、受付けることができる音声のパターンを音声認識した場合の最大文字数ｚが大きいほど、利用者にとっては入力すべき音声が長くなる。したがって、利用者が入力する音声の受付時間を不要に長く設定することが無く、周囲の雑音が誤検出された場合、誤検出されたことを通常の対話と同程度の時間で利用者が知ることができ、より自然な対話を維持することが可能となる。なお、ｚは、受付けることができる音声を音声認識した文字列に含まれる最大モーラ数であっても良い。また１つの発声に複数のキーワードが含まれている場合、キーワード間の時間間隔を考慮してｚを算出しても良い。 That is, the larger the maximum number of characters z when a speech pattern that can be accepted is recognized, the longer the speech to be input for the user. Therefore, when the reception time of the voice input by the user is not set unnecessarily long and the ambient noise is erroneously detected, the user knows that the erroneous detection has been made in the same amount of time as a normal conversation. It is possible to maintain a more natural dialogue. Note that z may be the maximum number of mora included in a character string obtained by voice recognition of an acceptable voice. When a plurality of keywords are included in one utterance, z may be calculated in consideration of the time interval between keywords.

図８は、本発明の実施の形態２に係る音声対話システムの対話制御装置１０のＣＰＵ１１の処理手順を示すフローチャートである。対話制御装置１０のＣＰＵ１１は、対話シナリオ情報１２１に含まれている音声認識用文法を取得し（ステップＳ８０１）、受付けることが可能な音声パターンごとに音声認識した場合の認識文字数を算出し（ステップＳ８０２）、算出した認識文字数が最大値であるか否かを判断する（ステップＳ８０３）。具体的には、算出した認識文字数をＲＡＭ１３に記録しておき、随時算出した認識文字数と比較することにより判断する。 FIG. 8 is a flowchart showing a processing procedure of the CPU 11 of the dialogue control apparatus 10 of the voice dialogue system according to Embodiment 2 of the present invention. The CPU 11 of the dialogue control apparatus 10 acquires the speech recognition grammar included in the dialogue scenario information 121 (step S801), and calculates the number of recognized characters when speech recognition is performed for each acceptable speech pattern (step S801). S802), it is determined whether or not the calculated number of recognized characters is the maximum value (step S803). Specifically, the calculated number of recognized characters is recorded in the RAM 13, and the determination is made by comparing with the number of recognized characters calculated as needed.

ＣＰＵ１１が、算出した認識文字数が最大値であると判断した場合（ステップＳ８０３：ＹＥＳ）、ＣＰＵ１１は、受付けることが可能な音声パターンを音声認識した場合の最大文字数又は最大モーラ数として算出した認識文字数をＲＡＭ１３に記録する（ステップＳ８０４）。ＣＰＵ１１は、全ての音声パターンについて、ステップＳ８０２以下の処理を繰り返し実行し（ステップＳ８０５：ＮＯ）、ＣＰＵ１１が全ての音声パターンについて処理が完了したと判断した場合（ステップＳ８０５：ＹＥＳ）、ＣＰＵ１１は、算出した受付けることが可能な音声パターンを音声認識した場合の最大文字数又は最大モーラ数ｚに基づいて、利用者からの音声を継続して受付ける音声受付時間Ｔ２を算出する（ステップＳ８０６）。 If the CPU 11 determines that the calculated number of recognized characters is the maximum value (step S803: YES), the CPU 11 calculates the maximum number of characters or the maximum number of recognized mora when the speech pattern that can be accepted is recognized. Is recorded in the RAM 13 (step S804). The CPU 11 repeatedly executes the processing from step S802 on for all voice patterns (step S805: NO), and when the CPU 11 determines that the processing has been completed for all voice patterns (step S805: YES), the CPU 11 Based on the maximum number of characters or the maximum number of mora z when the recognized speech pattern that can be received is recognized, a speech reception time T2 for continuously receiving the speech from the user is calculated (step S806).

ＣＰＵ１１は、対話シナリオ情報に基づく合成音声を出力し（ステップＳ８０７）、算出した音声受付時間Ｔ２が経過したか否かを判断する（ステップＳ８０８）。ＣＰＵ１１は、音声受付時間Ｔ２が経過したと判断するまで利用者の音声を継続して受付け（ステップＳ８０８：ＮＯ）、音声受付時間Ｔ２が経過したと判断した場合（ステップＳ８０８：ＹＥＳ）、ＣＰＵ１１は、入力された音声を音声認識し（ステップＳ８０９）、次の対話シナリオへと進行する。 The CPU 11 outputs a synthesized voice based on the dialogue scenario information (step S807), and determines whether or not the calculated voice reception time T2 has elapsed (step S808). The CPU 11 continues to accept the user's voice until it is determined that the voice reception time T2 has elapsed (step S808: NO). If the CPU 11 determines that the voice reception time T2 has elapsed (step S808: YES), the CPU 11 The input voice is recognized (step S809), and the process proceeds to the next dialogue scenario.

以上のように本実施の形態２によれば、音声認識用文法から利用者による応答として想定している音声のパターンが音声認識された場合の最大文字数又は最大モーラ数を決定することができ、最大文字数又は最大モーラ数に応じて、利用者の音声を継続して受付けることが可能な音声受付時間を設定することにより、周囲の雑音を誤検出した場合であっても利用者にとって不自然な対話中断時間が生じることがなく、円滑な対話を維持することが可能となる。 As described above, according to the second embodiment, it is possible to determine the maximum number of characters or the maximum number of mora when the speech pattern assumed as a response by the user is recognized from the speech recognition grammar, By setting the voice reception time that can continuously accept the user's voice according to the maximum number of characters or the maximum number of mora, it is unnatural for the user even if the surrounding noise is falsely detected. It is possible to maintain a smooth dialogue without causing a dialogue interruption time.

なお、本実施の形態２では、音声認識用文法に基づく利用者による音声を継続して受付ける音声受付時間を、音声対話システムから合成音声を出力するごとに算出する例について説明しているが、特にこれに限定されるものではなく、対話シナリオ情報に沿った対話の開始時、又は開始する前に、すべての音声認識用文法について解析し、解析結果に基づいて利用者による利用者による音声を継続して受付ける音声受付時間を算出して記録手段１２に記録しておくものであっても良い。 In the second embodiment, an example is described in which the voice reception time for continuously receiving the voice by the user based on the voice recognition grammar is calculated every time the synthesized voice is output from the voice dialogue system. However, the present invention is not limited to this. All speech recognition grammars are analyzed at the start or before the start of the dialog according to the dialog scenario information, and the user's voice is analyzed based on the analysis result. The voice reception time continuously received may be calculated and recorded in the recording unit 12.

また、本実施の形態２単独ではなく、実施の形態１と併用することにより、対話の中断時間がより自然な対話に即した時間となることから、より円滑な対話を維持することが可能となる。 In addition, when used in combination with the first embodiment rather than the second embodiment alone, the interruption time of the conversation becomes a time that is more natural, and it is possible to maintain a smoother conversation. Become.

以上の実施の形態１及び２に関し、さらに以下の付記を開示する。 Regarding the above first and second embodiments, the following additional notes are disclosed.

（付記１）
音声を受け付ける手段と、
音声認識用文法に基づいて受け付けた音声を認識する手段と、
認識した結果及び対話の進行手順を記述した対話シナリオ情報に基づいて対話を進行させる手段と、
前記受け付けた音声に対する応答を出力する手段とを備える音声対話システムにおいて、
前記音声認識用文法を解析する解析手段と、
前記音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減する待ち時間調整手段と
を備えることを特徴とする音声対話システム。 (Appendix 1)
Means for receiving audio;
Means for recognizing the received speech based on the speech recognition grammar;
Means for proceeding the dialogue based on the recognized scenario and the dialogue scenario information describing the procedure of the dialogue;
A voice dialogue system comprising: means for outputting a response to the received voice;
Analyzing means for analyzing the speech recognition grammar;
Waiting time adjusting means for increasing or decreasing the waiting time until receiving a voice based on the analysis result of the speech recognition grammar.

（付記２）
前記音声認識用文法は、受け付ける音声のパターン及び音声に含まれるべきキーワードを指定してあり、
前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターン数、及び受け付ける音声のキーワードの個数を決定するようにしてあることを特徴とする付記１記載の音声対話システム。 (Appendix 2)
The speech recognition grammar specifies a speech pattern to be accepted and a keyword to be included in the speech,
2. The spoken dialogue system according to claim 1, wherein the analysis means determines the number of voice patterns that can be accepted and the number of keywords of the voice to be accepted based on the grammar for voice recognition. .

（付記３）
前記音声認識用文法の解析結果に基づいて音声を受け付けることが可能な音声受付時間を増減する手段を備えることを特徴とする付記１又は２記載の音声対話システム。 (Appendix 3)
The spoken dialogue system according to claim 1 or 2, further comprising means for increasing or decreasing a voice reception time during which voice can be received based on an analysis result of the voice recognition grammar.

（付記４）
前記音声認識用文法は、受け付ける音声のパターンを指定してあり、
前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターンを音声認識した場合の文字列に含まれる最大文字数又は最大モーラ数を決定するようにしてあることを特徴とする付記３記載の音声対話システム。 (Appendix 4)
The speech recognition grammar specifies a pattern of speech to be accepted,
The analyzing means is configured to determine a maximum number of characters or a maximum number of mora included in a character string when a recognizable speech pattern is speech-recognized based on the speech recognition grammar. The voice interaction system according to appendix 3.

（付記５）
音声を受け付け、
音声認識用文法に基づいて受け付けた音声を認識し、
認識した結果及び対話の進行手順を記述した対話シナリオ情報に基づいて対話を進行させ、
前記受け付けた音声に対する応答を出力する音声対話方法において、
前記音声認識用文法を解析し、
前記音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減することを特徴とする音声対話方法。 (Appendix 5)
Accept audio,
Recognize the received speech based on the speech recognition grammar,
Based on the recognized results and dialog scenario information that describes the progress of the dialog, the dialog proceeds.
In the voice interaction method for outputting a response to the received voice,
Analyzing the speech recognition grammar;
A voice interaction method characterized by increasing or decreasing a waiting time until a voice is received based on an analysis result of the voice recognition grammar.

（付記６）
前記音声認識用文法は、受け付ける音声のパターン及び音声に含まれるべきキーワードを指定してあり、
前記音声認識用文法に基づいて、受け付けることが可能な音声のパターン数、及び受け付ける音声のキーワードの個数を決定することを特徴とする付記５記載の音声対話方法。 (Appendix 6)
The speech recognition grammar specifies a speech pattern to be accepted and a keyword to be included in the speech,
The voice interaction method according to claim 5, wherein the number of voice patterns that can be accepted and the number of voice keywords that are accepted are determined based on the voice recognition grammar.

（付記７）
前記音声認識用文法の解析結果に基づいて音声を受け付けることが可能な音声受付時間を増減することを特徴とする付記５又は６記載の音声対話方法。 (Appendix 7)
The voice interaction method according to appendix 5 or 6, wherein the voice reception time during which voice can be received is increased or decreased based on the analysis result of the voice recognition grammar.

（付記８）
前記音声認識用文法は、受け付ける音声のパターンを指定してあり、
前記音声認識用文法に基づいて、受け付けることが可能な音声のパターンを音声認識した場合の文字列に含まれる最大文字数又は最大モーラ数を決定することを特徴とする付記７記載の音声対話方法。 (Appendix 8)
The speech recognition grammar specifies a pattern of speech to be accepted,
8. The voice interaction method according to appendix 7, wherein the maximum number of characters or the maximum number of mora included in a character string when speech patterns that can be accepted are recognized based on the speech recognition grammar is determined.

（付記９）
音声を受け付け、音声認識用文法に基づいて受け付けた音声を認識し、認識した結果及び対話の進行手順を記述した対話シナリオ情報に基づいて対話を進行させ、受け付けた音声に対する応答を出力するコンピュータで実行可能なコンピュータプログラムにおいて、
前記コンピュータを、
前記音声認識用文法を解析する解析手段、及び
前記音声認識用文法の解析結果に基づいて音声を受け付けるまでの待ち時間を増減する待ち時間調整手段
として機能させることを特徴とするコンピュータプログラム。 (Appendix 9)
A computer that accepts speech, recognizes the received speech based on the speech recognition grammar, advances the dialogue based on the recognized scenario and dialogue scenario information describing the progress of the dialogue, and outputs a response to the received speech In an executable computer program,
The computer,
A computer program that functions as an analysis unit that analyzes the speech recognition grammar, and a waiting time adjustment unit that increases or decreases a waiting time until a speech is received based on an analysis result of the speech recognition grammar.

（付記１０）
前記音声認識用文法は、受け付ける音声のパターン及び音声に含まれるべきキーワードを指定してあり、
前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターン数、及び受け付ける音声のキーワードの個数を決定するようにしてあることを特徴とする付記９記載のコンピュータプログラム。 (Appendix 10)
The speech recognition grammar specifies a speech pattern to be accepted and a keyword to be included in the speech,
The computer program according to appendix 9, wherein the analyzing means determines the number of speech patterns that can be accepted and the number of speech keywords to be accepted based on the speech recognition grammar.

（付記１１）
前記コンピュータを、
前記音声認識用文法の解析結果に基づいて音声を受け付けることが可能な音声受付時間を増減する手段として機能させることを特徴とする付記９又は１０記載のコンピュータプログラム。 (Appendix 11)
The computer,
The computer program according to appendix 9 or 10, wherein the computer program functions as means for increasing or decreasing a voice reception time during which a voice can be received based on an analysis result of the voice recognition grammar.

（付記１２）
前記音声認識用文法は、受け付ける音声のパターンを指定してあり、
前記解析手段は、前記音声認識用文法に基づいて、受け付けることが可能な音声のパターンを音声認識した場合の文字列に含まれる最大文字数又は最大モーラ数を決定するようにしてあることを特徴とする付記１１記載のコンピュータプログラム。 (Appendix 12)
The speech recognition grammar specifies a pattern of speech to be accepted,
The analyzing means is configured to determine a maximum number of characters or a maximum number of mora included in a character string when a recognizable speech pattern is recognized based on the speech recognition grammar. The computer program according to appendix 11.

本発明の実施の形態１に係る音声対話システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech dialogue system which concerns on Embodiment 1 of this invention. 受付けることができる音声のパターンと音声中のキーワードの例示図である。It is an illustration figure of the keyword in the sound pattern and sound which can be accepted. 受付けることができる音声のパターンと音声中のキーワードの例示図である。It is an illustration figure of the keyword in the sound pattern and sound which can be accepted. 本発明の実施の形態１に係る音声対話システムの対話制御装置のＣＰＵの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of CPU of the dialogue control apparatus of the voice dialogue system which concerns on Embodiment 1 of this invention. 音声入力の待ち時間を修正する修正係数を記録手段に記録している場合の、修正係数照会の手順を模式的に示す図である。It is a figure which shows typically the procedure of a correction coefficient inquiry in case the correction coefficient which corrects the waiting time of audio | voice input is recorded on the recording means. 受付けることができる音声のパターンと音声パターンを音声認識した場合に含まれる認識文字数の例示図である。It is an illustration figure of the number of recognition characters contained when carrying out the speech recognition of the speech pattern and speech pattern which can be received. 受付けることができる音声のパターンと音声パターンを音声認識した場合に含まれる認識文字数の例示図である。It is an illustration figure of the number of recognition characters contained when carrying out the speech recognition of the speech pattern and speech pattern which can be received. 本発明の実施の形態２に係る音声対話システムの対話制御装置のＣＰＵの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of CPU of the dialogue control apparatus of the voice dialogue system which concerns on Embodiment 2 of this invention.

Explanation of symbols

１０対話制御装置
１１ＣＰＵ
１２記録手段
１３ＲＡＭ
１４通信インタフェース
１５補助記録手段
１６可搬型記録媒体
２０音声入出力部
１２１対話シナリオ情報 10 Dialogue control device 11 CPU
12 recording means 13 RAM
14 Communication Interface 15 Auxiliary Recording Unit 16 Portable Recording Medium 20 Audio Input / Output Unit 121 Dialog Scenario Information

Claims

Means for receiving audio;
Means for recognizing the received speech based on the speech recognition grammar;
Means for proceeding the dialogue based on the recognized scenario and the dialogue scenario information describing the procedure of the dialogue;
A voice dialogue system comprising means for outputting a response to the received voice,
Analyzing means for analyzing the speech recognition grammar;
Waiting time adjusting means for increasing or decreasing the waiting time until receiving a voice based on the analysis result of the speech recognition grammar.

The speech recognition grammar specifies a speech pattern to be accepted and a keyword to be included in the speech,
2. The voice dialogue according to claim 1, wherein the analyzing means determines the number of voice patterns that can be accepted and the number of voice keywords to be accepted based on the grammar for voice recognition. system.

3. The voice dialogue system according to claim 1, further comprising means for increasing / decreasing a voice reception time during which voice can be received based on an analysis result of the voice recognition grammar.

The speech recognition grammar specifies a pattern of speech to be accepted,
The analyzing means is configured to determine a maximum number of characters or a maximum number of mora included in a character string when a recognizable speech pattern is recognized based on the speech recognition grammar. The voice interaction system according to claim 3.

A computer that accepts speech, recognizes the received speech based on the speech recognition grammar, advances the dialogue based on the recognized scenario and dialogue scenario information describing the progress of the dialogue, and outputs a response to the received speech In an executable computer program,
The computer,
A computer program that functions as an analysis unit that analyzes the speech recognition grammar, and a waiting time adjustment unit that increases or decreases a waiting time until a speech is received based on an analysis result of the speech recognition grammar.