JP2005024869A

JP2005024869A - Voice responder

Info

Publication number: JP2005024869A
Application number: JP2003189988A
Authority: JP
Inventors: Keisuke Yoshizaki; 圭祐吉崎
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2003-07-02
Filing date: 2003-07-02
Publication date: 2005-01-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice responder which can easily be used by a user. <P>SOLUTION: The voice responder recognizes the uttering contents of the user and provides voice guidance which is beforehand set on the basis of the recognition result. In the device, uttering speed of the user is detected by uttering speed detecting means 12 and 13, and uttering contents of the responding voice guidance is changed in accordance with the detected uttering speed by uttering contents changing means 16 and 17. Even though the response time and uttering contents are the same, the uttering contents of the voice guidance is selectively changed and outputted in accordance with the uttering speed of the user, appropriate degree of proficiency is determined and proper measures are taken for the uttering contents of the voice guidance. Thus, the device can easily be used by the user. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声応答装置に関する。
【０００２】
【従来の技術】
一般に、この種の音声応答装置は、例えば、電話を用いたチケットの予約やデータベースの参照・更新を行うことに利用されている。
【０００３】
ここに、この種の装置は、利用者が操作に不慣れな場合においては、音声ガイダンスにて流される機器の操作説明が足りなく、利用者が何をすれば良いのかが分かりにくいものとなっている。一方、利用者が利用方法を熟知している場合においては、音声ガイダンスの内容は冗長なものとなっており、利用者がやりとりを完了させるのに時間を要してしまう。
【０００４】
このような点を考慮した従来技術に、音声ガイダンスを不慣れな利用者と熟練した利用者とで切替えて送出する音声応答装置がある（例えば、特許文献１，２参照）。これらの特許文献１，２では音声応答装置がガイダンスを送出し始めてから利用者が音声により応答するまでの反応時間の長さによって利用者の熟練度を判定し、この熟練度に従ってガイダンスの変更を行うようにしている。
【０００５】
また、特許文献３では、利用者が操作に不要な言葉を発した際や言い淀みがあった際には、当該利用者の熟練度を下げることにより、装置の熟練度が低い利用者において、反応時間が短い場合でも、操作に不慣れな利用者用のガイダンスを送出することができるようにしている。
【０００６】
【特許文献１】
特開平４−３４４９３０号公報
【特許文献２】
特開平１０−２０８８４号公報
【特許文献３】
特開２００１−３３１１９６公報
【０００７】
【発明が解決しようとする課題】
しかしながら、上述したようなこれらの従来の音声応答装置では、ゆっくりとした発話を行っている、若しくは、速く発話を行っている、といったような発話速度は考慮していないため、反応時間が同じであり、また、発話内容が正しい場合においては、操作に不慣れな利用者と慣れた利用者とで熟練度が同じと判定されてしまう問題がある。
【０００８】
これは例えば、数列や「はい」「いいえ」といったような発話内容が単純な場合において、より起こりやすく、例えば数列「０４７１」を短く発話する人と長く発話する人との双方の場合において、発話内容が正しく、反応時間が同じであれば同じ熟練度と判定されてしまう。従って、音声応答装置は必ずしも利用者にとって使いやすいものとはなっていない。
【０００９】
また、熟練度が低い場合において選択的に出力される音声ガイダンスは、本来「それでよろしいですか？」であるのに対して、「それでよろしいですか？よろしければ“はい”と発話して下さい」といったように、操作の説明を通常時に比べてより詳細に説明するといったものであり、出力される声質、例えば、音量や発話速度は変更されない。そのため、操作に不慣れな利用者にとっては結局のところ分りにくいままとなりやすい。この点でも、音声応答装置は必ずしも利用者にとって使いやすいものとはなっていない。
【００１０】
また、最近では、利用者が携帯電話を介して音声応答装置を利用するケースも増えており、その際に周囲の雑音が多いといった、利用者が音声ガイダンスを聞き取りにくい状況にあっても、音声ガイダンスの出力に関しては何ら変更処理は行われない。従って、携帯電話等を利用する上で、音声応答装置は必ずしも利用者にとって使いやすいものとはなっていない現状にある。
【００１１】
本発明の目的は、利用者にとって使いやすい音声応答装置を提供することである。
【００１２】
【課題を解決するための手段】
本発明は、利用者の発声内容を認識しその認識結果に基づいて予め設定された音声ガイダンスを提供する音声応答装置において、利用者の発話速度を検出する発話速度検出手段と、検出された発話速度に従い応答する前記音声ガイダンスの発話内容を変更する発話内容変更手段と、を具備し、反応時間や発話内容が同じであっても、利用者の発話速度に応じて熟練度を適正に判別しその結果により選択的に音声ガイダンスの発話内容を変更して出力することにより、利用者にとって使いやすい音声応答装置となる。
【００１３】
【発明の実施の形態】
本発明の第１の実施の形態を図１ないし図５に基づいて説明する。
本実施の形態の音声応答装置は、例えば電話等を用いたチケットの予約用サーバ等への適用例を示し、装置自体は、基本的には、マイクロコンピュータを内蔵するコンピュータであり、インストールされ或いは記憶保存されているコンピュータプログラムに従い後述するような処理を含めて各種の処理を実行する。
【００１４】
まず、本実施の形態の音声応答装置となるサーバ１のハードウェア構成例を図１に示す概略ブロック図を参照して説明する。当該サーバ１は、その概略構成として、各種演算処理を実行して各部を集中的に制御するＣＰＵ２に、固定データを格納するＲＯＭ３、可変データを書き替え自在に格納するＲＡＭ４、表示デバイス、スピーカ等の出力デバイス５、各種入力デバイス６、記憶装置７、及び、通信インターフェース８等がシステムバス９を介して接続された構成を有している。
【００１５】
記憶装置７としては、例えばハードディスクドライブが用いられる。このような記憶装置７には、ＯＳの他、各種のプログラムがインストールされ、かつ、チケット予約等に関する各種ファイル、テーブル等が記憶保存されている。記憶装置７にインストールされたＯＳ、プログラム及び各種ファイルは、当該サーバ１のプログラムの起動時、その全部又は一部がＲＡＭ４にコピーされ、アクセス速度の高速度化が図られている。
【００１６】
また、サーバ１は、通信インターフェース８、電話等を介して利用者が音声によりアクセス可能とされている。
【００１７】
このようなコンピュータを利用した本実施の形態の音声応答装置（サーバ１）の機能ブロック図を図２に示す。本実施の形態の音声応答装置では、まず、利用者が音声を入力する音声入力部１１と、この音声入力部１１に入力された音声の発話速度を解析・検出する音声解析部１２と、発話速度を判別するための条件が設定されているＲＯＭ３を利用した音声解析記憶部１３とが設けられている。ここに、音声解析部１２と音声解析記憶部１３とにより発話速度検出手段の機能が実現されている。また、入力された音声を認識し音声認識結果を出力する音声認識部１４と、利用者がサービスを受けるために当該自動音声応答装置の操作に必要な語句が予め設定されているＲＯＭ３又は記憶装置７を利用した音声認識辞書部１５とが設けられている。また、当該音声応答装置（サーバ１）が行う対話処理の内容が予め設定されているＲＯＭ３又は記憶装置７を利用した対話処理記憶部１６と、音声認識結果及び対話処理記憶部１６の設定内容により音声ガイダンスの発話内容を管理する対話処理部１７と、音声ガイダンスの発話内容の音声データが設定されているＲＯＭ３又は記憶装置７を利用したガイダンス記憶部１８と、ガイダンス記憶部１８より音声ガイダンスの発話内容の音声データを取得して音声データを出力するガイダンス出力部１９と、音声データを出力デバイス５中のスピーカから出力する音声出力部２０とが設けられている。ここに、対話処理部１７と対話処理記憶部１６とにより発話内容変更手段の機能が実現されている。
【００１８】
このような構成において、通信インターフェース８等を介して利用者が音声を入力すると、音声入力部１１に入力された音声情報は音声解析部１２に送られる。ここで、プログラムに従いＣＰＵ２により実行される音声解析部１２の処理の流れを図３に示す概略フローチャートを参照して説明する。
【００１９】
まず、ステップＳ１において、入力音声をフレーム単位で分割し、分割したフレーム毎に音響パワーレベルを求める。全フレームの中で最大の音響パワーレベルと最小の音響パワーレベルの中間値を基準値とし、各フレームについて先に求めた基準値との比較処理を行うことにより、全フレームについて発話部分であるか発話部分ではないかを見分ける。そして、発話部分であるフレームの時間を足し合わせることにより入力音声の発話時間を求める。得られた結果を内部メモリ（ＲＡＭ４）に保存する。
【００２０】
ステップＳ２では、入力された音声を音声認識部１４へ送信し、認識結果を待つ。そして、音声認識部１４から得られた認識結果から、音声認識が正常に行われた場合は（Ｓ２のＹ）、ステップＳ３へ進む。また、認識できなかった場合は（Ｓ２のＮ）、発話速度の判定結果を「普通の発話速度」と設定して内部メモリ（ＲＡＭ４）に保存し、ステップＳ４に進む。
【００２１】
ステップＳ３では、音声認識部１４から得られた音声認識結果から語数をカウントする。例えば、音声認識結果が「よやく」であれば３語、「ちけっと」であれば４語といったように語数をそのままカウントする。そして、ステップＳ１にて求めた発話時間より１秒当りの発話語数を“発話速度”として計算する。そして、求めた発話速度と音声解析記憶部１３に予め設定されている発話速度の条件とを比較し、入力された音声の発話速度の判別を行い、判別結果を内部メモリ（ＲＡＭ４）に保存し、ステップＳ４に進む。
【００２２】
図４は音声解析記憶部１３に予め設定された発話速度の条件の設定例であり、ここでは発話速度の条件として、１秒当りの発話語数が３．８以下は「遅い発話」、３．８を超えかつ４．２以下であれば「普通の発話速度」、４．２を超えるならば「速い発話」と設定されている。
【００２３】
この時、求めた発話速度が例えば３．６であれば入力音声の発話速度は「遅い発話」と判断され、内部メモリ（ＲＡＭ４）に設定される。
【００２４】
ステップＳ４では、音声認識結果と内部メモリ（ＲＡＭ４）に設定された発話速度の判定結果とを対話処理部１７に送信して終了する。
【００２５】
対話処理部１７では音声認識結果より対話処理記憶部１６に予め設定された対話処理の内容に応じて出力する音声ガイダンスの分野を選択する。また、選択された音声ガイダンスの分野について、対話処理記憶部１６に予め設定された、音声ガイダンスの発話内容リストを参照し、利用者の発話速度に応じた発話内容の音声ガイダンスを選択する。そして、選択した音声ガイダンスをガイダンス出力部１９に送信する。
【００２６】
図５は対話処理記憶部１６（ＲＯＭ３等）に予め設定された、例えば、「チケット予約」の分野における利用者の発話速度による音声ガイダンスの発話内容リストのテーブル例を示す。図示例では、例えば、発話速度が「遅い発話」の場合は「チケット予約ですね？よろしければ“はい”、間違いであれば“いいえ”と発話して下さい」、「普通の発話速度」の場合は「チケット予約を承りました。何日のご予約でしょうか？」、「速い発話」の場合は「何日のご予約でしょうか？」という音声ガイダンスの発話内容の設定となっている。ここで、例えば音声認識の結果が「チケット予約」であり、利用者の発話速度の判定結果が「普通の発話速度」であれば、「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを選択し、「遅い発話」であれば、「チケット予約ですね？よろしければ“はい”、間違いであれば“いいえ”と発話して下さい」という発話内容の音声ガイダンスが選択されることとなる。
【００２７】
ガイダンス出力部１９では選択された音声ガイダンスの発話内容から、ガイダンス記憶部１８より対応する音声ガイダンスの音声データを取得し、音声出力部２０に送信する。
【００２８】
音声出力部２０では、送信された音声データを出力音声として出力する。
【００２９】
以上の処理により、発話速度に応じて音声ガイダンスの発話内容が変更されて出力されることになる。従って、通常であれば「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスとなるが、不慣れで発話速度の遅い利用者の場合には、それに対応して例えば「チケット予約ですね？よろしければ“はい”、間違いであれば“いいえ”と発話して下さい」という丁寧な発話内容の音声ガイダンスとなり、当該利用者にとって使いやすいものとなる。逆に、慣れていて発話速度の速い利用者の場合には、「何日のご予約でしょうか？」という冗長でなく必要最低限の発話内容の音声ガイダンスとなり、当該利用者にとって使いやすいものとなる。特に、発話速度に注目して不慣れな利用者と慣れた利用者とを判別しているので、反応時間が同じで発話内容が正しい場合にもその熟練度を適正に判別することができる。
【００３０】
本発明の第２の実施の形態を図６ないし図８に基づいて説明する。第１の実施の形態で示した部分と同一部分は同一符号を用いて示し、説明も省略する（以降の実施の形態でも順次同様とする）。
【００３１】
本実施の形態では、図２に示した構成（機能ブロック）に対して、ガイダンス出力部１９の出力側に、音声変換部２１、音声変換記憶部２２及び話速変換部２３によるデータ変換処理手段が付加されている。ここに、音声変換部２１は、ガイダンス出力部１９から入力される音声ガイダンスの発話速度（話速）を変換条件に従い変更させる変更指示や音量の変換処理を行うものであり、ＲＯＭ３又は記憶装置７を利用した音声変換記憶部２２には音声データの変換処理を行う上での音声の変換条件が予め設定されている。また、話速変換部２３は音声ガイダンスの話速の変換処理を実行する。音声変換部２１、音声変換記憶部２２及び話速変換部２３によるデータ変換処理手段によって、音声ガイダンスの発話速度や発話音量、すなわち音声ガイダンスの発話声質が変えられる。
【００３２】
なお、本実施の形態においては、ガイダンス記憶部１８に設定される音声ガイダンスの発話内容としては、音声ガイダンス分野毎に１種類が用意され、例えばチケット予約の分野であれば、通常用の「チケット予約を承りました。何日のご予約でしょうか？」という発話内容が格納されている。
【００３３】
また、対話処理記憶部１６に予め設定された音声ガイダンスの発話内容リストは、発話速度に対して同じ音声ガイダンスの発話内容を設定する。
【００３４】
このような構成において、利用者が音声を入力すると、音声入力部１１に入力された音声情報は音声解析部１２に送られる。音声解析部１２では前述の実施の形態の場合と同様の処理を行い、音声認識結果と内部メモリに設定された発話速度の判定結果を対話処理部１７に送信する。対話処理部１７では音声認識結果より対話処理記憶部１６に予め設定された対話処理の内容に応じて出力する音声ガイダンスの分野を選択する。また、選択された音声ガイダンスの分野について、対話処理記憶部１６に予め設定された、音声ガイダンスの発話内容リストを参照し、音声ガイダンスを選択する。そして、選択した音声ガイダンスと発話速度の判定結果をガイダンス出力部１９に送信する。
【００３５】
ガイダンス出力部１９では選択された音声ガイダンスから、ガイダンス記憶部１８より音声ガイダンスの発話内容に応じた音声データを取得し、音声変換部２１へ送信する。また、受け取った発話速度の判定結果も音声変換部２１へ送信する。
【００３６】
音声変換部２１は受け取った音声データを発話速度の判定結果に応じて変換処理を行った後に変換処理後の音声データを音声出力部２０に送信する。
【００３７】
図７は音声変換部２１の処理の流れを示す概略フローチャートである。音声変換部２１では、まず、ステップＳ１１において、ガイダンス出力部１９から入力された音声データを内部メモリに保存する。次のステップＳ１２では、音声変換部２１に与えられた発話速度の判定結果と音声変換記憶部２２に予め設定されている音声の変換条件とを比較する。
【００３８】
図８はＲＯＭ３等を利用する音声変換記憶部２２にテーブルとして予め設定された音声の変換条件、すなわち、出力音声ガイダンスの音量及び発話速度の変換条件である。利用者の発話速度が「遅い発話」の場合において出力音声ガイダンスの音量は１．２倍、話速は０．９倍、「普通の発話速度」の場合においては音量・話速共に変更なし、「速い発話」の場合においての音量は変更無し、話速は１．２倍と設定されている。ここで、例えば発話速度の判定結果が「遅い発話」であるならば、音量変換条件より音量は１．２倍となる。
【００３９】
ステップＳ１２では、音声変換部２１に与えられた発話速度の判定結果と音声変換記憶部２２に設定されている出力音声ガイダンスの音量の変換条件とを比較し、変換条件に当てはまるなら、ステップＳ１３へ進み、条件に当てはまらない場合はステップＳ１４へと進む。ステップＳ１３では、内部メモリに保存された音声データの音量を変換し、内部メモリの音声を更新後、ステップＳ１４へと進む。
【００４０】
ステップＳ１４では、音声変換部２１に与えられた発話速度の判定結果と音声変換記憶部２２に設定されている出力音声ガイダンスの発話速度の変換条件とを比較し、条件に当てはまれば、ステップＳ１５へ進み、条件に当てはまらない場合は、ステップＳ１６へ進む。
【００４１】
ここで、例えば、発話速度の判定結果が「遅い発話」であるならば、図８の出力音声ガイダンスの発話速度の変換条件より出力音声ガイダンスの話速は０．９倍となる。そして話速が変更されるので、ステップＳ１５へ進む。
【００４２】
ステップＳ１５では、内部メモリ（ＲＡＭ４）の音声データを話速変換部２３に送信し、受け取った音声データで内部メモリ（ＲＡＭ４）の音声データを更新後、ステップＳ１６へと進む。例えば、発話速度を１．２倍に変換するのであれば、音声データと、音声データの発話速度を１．２倍にするという指示を話速変換部２３に与えることとなり、その結果として発話速度が１．２倍になった音声データを受け取ることになる。
【００４３】
ステップＳ１６では、内部メモリ（ＲＡＭ４）の音声データを音声出力部２０に送信して終了する。音声出力部２０では、送信された音声データを出力音声としてスピーカから出力する。
【００４４】
以上の処理により、入力音声の発話速度に応じて、出力される音声ガイダンスの音量や発話速度を変更させるための音声データの変換処理がなされることとなり、発話速度が遅い場合には音量が１．２倍で発話速度が０．９倍と遅くなった音声ガイダンスが出力され、また、発話速度が速い場合には音量はそのままで発話速度が１．２倍と速くなった音声ガイダンスが出力される。
【００４５】
以上の処理により、例えば音声ガイダンスの発話内容は「チケット予約を承りました。何日のご予約でしょうか？」で同一であっても、不慣れで発話速度の遅い利用者の場合には、音量が１．２倍で発話速度が０．９倍と遅くなった音声ガイダンスが出力されるので、聞き取りやすく、かつ、慌てることなく聞き取ることができるので、当該利用者にとって使いやすいものとなる。逆に、慣れていて発話速度の速い利用者の場合には、発話速度が速くなって音声ガイダンスの終了時間が早くなるので、時間的に冗長でなくなり、当該利用者にとって使いやすいものとなる。本実施の形態の場合も、発話速度に注目して不慣れな利用者と慣れた利用者とを判別しているので、反応時間が同じで発話内容が正しい場合にもその熟練度を適正に判別することができる。
【００４６】
なお、本実施の形態では、音声ガイダンスの発話声質を変更させる例として、音量及び話速の例で説明したが、何れか一方のみでもよく、或いは、発話ピッチ、発話音程等を発話声質として変更させるようにしてもよい（以降の実施の形態においても同様）。
【００４７】
本発明の第３の実施の形態を図９ないし図１１に基づいて説明する。本実施の形態では、図６に示した構成（機能ブロック）において、音声解析部１２及び音声解析記憶部１３に、発話速度検出手段の機能を実現させるとともに、利用者の発声に関する雑音状況を検出する雑音検出手段の機能を実現させるようにしたものである。即ち、音声解析部１２には音声の雑音状況を解析する機能が付加され、かつ、ＲＯＭ３又はＲＡＭ４を利用した音声解析記憶部１３には雑音状況を判別するための条件が予め設定されている。
【００４８】
このような構成において、利用者が音声を入力すると、音声入力部１１に入力された音声情報は音声解析部１２に送られる。
【００４９】
図９は、本実施の形態の音声解析部１２の処理の流れを示す概略フローチャートである。なお、図３に示した発話速度に関する処理制御は省略する。音声解析部１２は、ステップＳ２１にて入力音声をフレーム単位で分割し、分割したフレーム毎に音響パワーレベルを求める。全フレームの中で最大の音響パワーレベルと最小の音響パワーレベルの中間値を基準値とし、各フレームについて比較処理を行うことにより、全フレームについて発話部分であるか発話部分ではないかを見分ける。そして、発話部分であるフレームの平均音響パワーレベルと発話部分でないフレームの平均音響パワーレベルとの比率を求め、その結果と音声解析記憶部１３に予め設定された雑音状況の条件とを比較し、周囲の雑音状況を判定する。そして、判定した結果を内部メモリ（ＲＡＭ４）に保存し、ステップＳ２２に進む。
【００５０】
ここで、図１０はＲＯＭ３等を利用した音声解析記憶部１３にテーブルとして予め設定された雑音状況の条件の例であり、ここでは雑音状況の条件が、０．７以下は「雑音小」、０．７を超えるならば「雑音大」と２段階に設定されている。この時、求めた発話が行われている区間の平均音響パワーレベルが８０であり、発話が行われていない区間の平均音響パワーレベルが７０であるならば、比率は０．７５となり上記条件により雑音状況は「雑音大」と判断され、内部メモリ（ＲＡＭ４）に設定される。
【００５１】
ステップＳ２２では、入力された音声を音声認識部１４へ送信し、認識結果を待つ。そして、音声認識部１４から得られた認識結果と内部メモリに設定された雑音状況の判定結果を対話処理部１７に送信して終了する。
【００５２】
対話処理部１７では、音声認識結果より対話処理記憶部１６に予め設定された対話処理の内容に応じて出力する音声ガイダンスの分野を選択する。また、選択された音声ガイダンスの分野について、対話処理記憶部１６に予め設定された、音声ガイダンスの発話内容リストを参照し、利用者の発話速度に応じた発話内容の音声ガイダンスを選択する。そして、選択した音声ガイダンスと発話速度の判定結果と雑音状況の判定結果をガイダンス出力部１９に送信する。
【００５３】
音声ガイダンスの内容リストとしては、例えば、図１１に示すように設定されている。即ち、利用者の発話速度に応じてガイダンス項目（音声ガイダンスの発話内容）が分類されている。ガイダンス出力部１９では選択された音声ガイダンスから、ガイダンス記憶部１８より音声ガイダンスの発話内容に応じた音声データを取得し、音声変換部２１へ送信する。また、受け取った発話速度の判定結果と雑音状況の判定結果も音声変換部２１へ送信する。
【００５４】
音声変換部２１は受け取った音声データを発話速度の判定結果と雑音状況の判定結果に応じて変換処理を行った後に変換処理後の音声データを音声出力部２０に送信する。
【００５５】
音声変換部２１は、図７のフローチャートに従って音声処理を行う。音声変換部２１では、ステップＳ１１で、ガイダンス出力部１９から入力された音声データを内部メモリに保存する。次のステップＳ１２では、音声変換部２１に与えられた発話速度の判定結果及び雑音状況の判定結果と音声変換記憶部２２に予め設定されている出力音声ガイダンスの音量及び発話速度の変換条件とを比較する。
【００５６】
図１２はＲＯＭ３等を利用する音声変換記憶部２２にテーブルとして予め設定された出力音声ガイダンスの音量及び発話速度の変換条件である。利用者の発話速度が同じであっても、雑音状況の大小に応じて、音声ガイダンスの音量及び話速が２段階に切換えられるように設定されている。特に、雑音状況が「雑音大」の場合には、音量を大きくし、かつ、話速を遅くするように設定されている。
【００５７】
ステップＳ１２では、発話速度及び雑音状況の判定結果と、出力音声ガイダンスの音量の変換条件とを比較する。変換条件に当てはまるなら、ステップＳ１３へ進み、条件に当てはまらない場合はステップＳ１４へと進む。
【００５８】
ステップＳ１３では、内部メモリに保存された音声データの音量を変換し、内部メモリの音声を更新後、ステップＳ１４へと進む。
【００５９】
ステップＳ１４では、発話速度の判定結果と雑音状況の判定結果と出力音声ガイダンスの発話速度の変換条件とを比較し、条件に当てはまれば、ステップＳ１５へ進み、条件に当てはまらない場合は、ステップＳ１６へ進む。
【００６０】
従って、このような設定条件下で、例えば音声認識の結果が「チケット予約」であり、発話速度の判定結果が「普通の発話速度」であって、雑音状況の判定結果が「雑音小」であれば、「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを音量、話速とも通常レベルで出力し、雑音状況の判定結果が「雑音大」であれば、当該「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを音量１．２倍に大きくし、話速０．９倍に遅くして出力することとなる。発話速度が他の場合も同様である。
【００６１】
以上の処理により、発話速度及び雑音状況に応じて音声ガイダンスの発話内容が選択され、かつ、その発話声質として音量及び話速を変更させて出力させることとなる。従って、本実施の形態によれば、反応時間や発話内容が同じであっても、利用者の発話速度及び利用者の周囲の雑音状況に応じて選択的に音声ガイダンスの発話内容を変更して出力するとともに、音声ガイダンスを出力する際にその発話声質を変更させるために音声データの加工を行い、雑音が大きい状況では発話速度を遅くすることや音量を上げることによって音声ガイダンスが聞き取りやすくなり、慣れた人、不慣れな人を問わず、利用者にとって使いやすいものとなる。特に、携帯電話等を利用する環境下では効果的となる。
【００６２】
本発明の第４の実施の形態を図１３に基づいて説明する。本実施の形態では、図６に示した構成（機能ブロック）において、第３の実施の形態と同じく、音声解析部１２及び音声解析記憶部１３に、発話速度検出手段の機能を実現させるとともに、利用者の発声に関する雑音状況を検出する雑音検出手段の機能を実現させるようにしたものである。即ち、音声解析部１２には音声の雑音状況を解析する機能が付加され、かつ、ＲＯＭ３等を利用した音声解析記憶部１３には雑音状況を判別するための条件が予め設定されている。本実施の形態においては、音声ガイダンスの発話内容としては、ガイダンス分野毎に１種類のみが用意され、例えばチケット予約の分野であれば、通常用の「チケット予約を承りました。何日のご予約でしょうか？」という発話内容が格納されているものとする。
【００６３】
図１３はＲＯＭ３等を利用する音声変換記憶部２２にテーブルとして予め設定された出力音声ガイダンスの音量及び発話速度の変換条件である。利用者の発話速度及び雑音状況の大小に応じて、音声ガイダンスの音量及び話速が２段階に切換えられるように設定されている。特に、雑音状況が「雑音大」の場合には、音量を大きくし、かつ、話速を遅くするように設定されている。
【００６４】
本実施の形態の場合の処理制御は、話者速度に応じて音声ガイダンスの発話内容が選択切換えされない点を除くと、第３の実施の形態の場合に準ずる。
【００６５】
従って、このような設定条件下で、例えば音声認識の結果が「チケット予約」であり、発話速度の判定結果が「遅い発話」であって、雑音状況の判定結果が「雑音小」であれば、「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを音量１．２倍、話速０．９倍で出力し、雑音状況の判定結果が「雑音大」であれば、当該「チケット予約を承りました。何日のご予約でしょうか？」という同一の発話内容の音声ガイダンスをさらに音量１．３倍に大きくし、話速０．８倍に遅くして出力することとなる。発話速度が他の場合も同様である。
【００６６】
従って、本実施の形態によれば、反応時間や発話内容が同じであっても、利用者の発話速度及び利用者の周囲の雑音状況に応じて、音声ガイダンスを出力する際にその発話声質を変更させるために音声データの加工を行い、雑音が大きい状況では発話速度をより遅くすることや音量をより上げることによって音声ガイダンスが聞き取りやすくなり、慣れた人、不慣れな人を問わず、利用者にとって使いやすいものとなる。特に、携帯電話等を利用する環境下では効果的となる。
【００６７】
本発明の第５の実施の形態を図１４及び図１５に基づいて説明する。本実施の形態では、図６に示した構成（機能ブロック）において、音声解析部１２及び音声解析記憶部１３に、利用者の発声に関する雑音状況を検出する雑音検出手段の機能を実現させるとともに、発話速度検出手段の機能を削除したものである。即ち、本実施の形態では、発話速度に基づく熟練度の判別機能が省略され、音声解析部１２には音声の雑音状況を解析する機能が付加され、かつ、音声解析記憶部１３には雑音状況を判別するための条件が予め設定されている。
【００６８】
このような構成において、利用者が音声を入力すると、音声入力部１１に入力された音声情報は音声解析部１２に送られる。本実施の形態の音声解析部１２の処理の流れは、図９のフローチャートで説明した場合と同様である。
【００６９】
ここで、図１５は音声解析記憶部１３に予めテーブルとして設定された雑音状況の条件の例であり、ここでは雑音状況の条件が、０．６以下は「雑音小」、０．６を超えかつ０．７以下であれば「雑音中」、０．７を超えるならば「雑音大」と３段階に設定されている。この時、求めた発話が行われている区間の平均音響パワーレベルが８０であり、発話が行われていない区間の平均音響パワーレベルが７０であるならば、比率は０．７５となり上記条件により雑音状況は「雑音大」と判断され、内部メモリに設定される。
【００７０】
本実施の形態においては、音声ガイダンスの発話内容としては、ガイダンス分野毎に１種類のみが用意され、例えばチケット予約の分野であれば、通常用の「チケット予約を承りました。何日のご予約でしょうか？」という発話内容が格納されている。
【００７１】
図１５はＲＯＭ３等を利用する音声変換記憶部２２にテーブルとして予め設定された出力音声ガイダンスの音量及び発話速度の変換条件である。
【００７２】
雑音状況の大中小に応じて、音声ガイダンスの音量及び話速が３段階に切換えられるように設定されている。特に、雑音状況が「雑音大」の場合には、音量をより大きくし、かつ、話速をより遅くするように設定されている。
【００７３】
従って、このような設定条件下で、例えば音声認識の結果が「チケット予約」であり、雑音状況の判定結果が「雑音小」であれば、「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを音量は通常レベルで、話速は通常の１．２倍に速くして出力する一方、雑音状況の判定結果が「雑音大」であれば、当該「チケット予約を承りました。何日のご予約でしょうか？」という発話内容の音声ガイダンスを音量１．２倍に大きくし、話速０．９倍に遅くして出力することとなる。
【００７４】
以上の処理により、利用者が発話している環境の雑音状況に応じて音声ガイダンスの発話声質として音量及び話速を変更させて出力させることとなる。従って、本実施の形態によれば、反応時間や発話内容が同じであっても、利用者の周囲の雑音状況に応じて音声ガイダンスを出力する際にその発話声質を変更させるために音声データの加工を行い、雑音が大きい状況では発話速度を遅くすることや音量を上げることによって音声ガイダンスが聞き取りやすくなり、利用者にとって使いやすいものとなる。特に、携帯電話等を利用する環境下では効果的となる。
【００７５】
【発明の効果】
本発明によれば、利用者にとって使いやすい音声応答装置を提供することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態を示すブロック図である。
【図２】その機能的ブロック図である。
【図３】その音声解析部における処理内容を示す概略フローチャートである。
【図４】発話速度の判別条件例を示す説明図である。
【図５】発話速度に応じた発話内容例を示す説明図である。
【図６】本発明の第２の実施の形態を示す機能的ブロック図である。
【図７】その音声変換部における処理内容を示す概略フローチャートである。
【図８】発話速度に応じた音量及び話速の設定例を示す説明図である。
【図９】本発明の第３の実施の形態の音声解析部における処理内容を示す概略フローチャートである。
【図１０】雑音状況の判別条件例を示す説明図である。
【図１１】発話速度及び雑音状況に応じた発話内容例を示す説明図である。
【図１２】発話速度及び雑音状況に応じた音量及び話速の設定例を示す説明図である。
【図１３】本発明の第４の実施の形態の発話速度及び雑音状況に応じた音量及び話速の設定例を示す説明図である。
【図１４】本発明の第５の実施の形態の雑音状況の判別条件例を示す説明図である。
【図１５】雑音状況に応じた音量及び話速の設定例を示す説明図である。
【符号の説明】
１２，１３発話速度検出手段、雑音検出手段
１６，１７発話内容変更手段
２１〜２３データ変換処理手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice response device.
[0002]
[Prior art]
In general, this type of voice response device is used for, for example, reservation of a ticket using a telephone and reference / update of a database.
[0003]
Here, when this type of device is unfamiliar with the operation, there is not enough explanation of the operation of the device circulated by voice guidance, and it becomes difficult to understand what the user should do. Yes. On the other hand, when the user is familiar with the usage method, the content of the voice guidance is redundant, and it takes time for the user to complete the exchange.
[0004]
As a conventional technique considering such points, there is a voice response device that switches and sends voice guidance between an unfamiliar user and a skilled user (see, for example, Patent Documents 1 and 2). In these Patent Documents 1 and 2, the skill level of the user is determined based on the length of the reaction time from when the voice response device starts sending the guidance until the user responds by voice, and the guidance is changed according to the skill level. Like to do.
[0005]
Further, in Patent Document 3, when a user utters a word unnecessary for an operation or when there is a grudge, by reducing the skill level of the user, the user with a low skill level of the device Even when the reaction time is short, guidance for users who are unfamiliar with the operation can be sent out.
[0006]
[Patent Document 1]
JP-A-4-344930
[Patent Document 2]
Japanese Patent Laid-Open No. 10-20884
[Patent Document 3]
JP 2001-331196 A
[0007]
[Problems to be solved by the invention]
However, in these conventional voice response devices as described above, the reaction time is the same because the utterance speed such as a slow utterance or a fast utterance is not considered. In addition, when the content of the utterance is correct, there is a problem that it is determined that the skill level is the same between the user unfamiliar with the operation and the user accustomed to the operation.
[0008]
This is more likely to occur, for example, when the utterance content is simple, such as a number sequence or “Yes” or “No”. For example, in the case of both the person who speaks the sequence “0471” for a short time and the person who speaks for a long time, If the content is correct and the reaction time is the same, it is determined that the skill level is the same. Therefore, the voice response device is not always easy to use for the user.
[0009]
In addition, the voice guidance that is selectively output when the skill level is low is originally “Are you sure?”, Whereas “Are you sure? Speak“ Yes ”if you want.” As described above, the description of the operation will be described in more detail as compared with the normal time, and the output voice quality, for example, the volume and the speech rate will not be changed. Therefore, it tends to remain difficult to understand after all for users who are unfamiliar with the operation. In this respect as well, the voice response device is not always easy for the user to use.
[0010]
Recently, the number of users who use voice response devices via mobile phones has increased, and even if the user has difficulty in hearing voice guidance, such as when there is a lot of ambient noise, No change processing is performed on the guidance output. Therefore, when using a mobile phone or the like, the voice response device is not always easy to use for the user.
[0011]
An object of the present invention is to provide a voice response device that is easy for a user to use.
[0012]
[Means for Solving the Problems]
The present invention relates to a voice response device for recognizing a user's utterance content and providing a preset voice guidance based on the recognition result, an utterance speed detecting means for detecting the utterance speed of the user, and a detected utterance Utterance content changing means for changing the utterance content of the voice guidance that responds according to the speed, even if the reaction time and the utterance content are the same, the skill level is appropriately determined according to the utterance speed of the user By selectively changing and outputting the utterance content of the voice guidance according to the result, the voice response device is easy to use for the user.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
A first embodiment of the present invention will be described with reference to FIGS.
The voice response device of the present embodiment shows an application example to a ticket reservation server using a telephone or the like, for example. The device itself is basically a computer with a built-in microcomputer and is installed or Various processes are executed according to the computer program stored and saved, including the processes described below.
[0014]
First, a hardware configuration example of the server 1 serving as the voice response device according to the present embodiment will be described with reference to a schematic block diagram shown in FIG. The server 1 has, as its schematic configuration, a CPU 2 that executes various arithmetic processes and centrally controls each unit, a ROM 3 that stores fixed data, a RAM 4 that stores variable data in a rewritable manner, a display device, a speaker, and the like Output device 5, various input devices 6, storage device 7, communication interface 8, and the like are connected via a system bus 9.
[0015]
For example, a hard disk drive is used as the storage device 7. In the storage device 7, various programs are installed in addition to the OS, and various files, tables, and the like related to ticket reservation are stored and saved. The OS, programs, and various files installed in the storage device 7 are all or partially copied to the RAM 4 when the program of the server 1 is started, so that the access speed is increased.
[0016]
In addition, the server 1 can be accessed by voice by the user via the communication interface 8, a telephone, or the like.
[0017]
FIG. 2 shows a functional block diagram of the voice response device (server 1) of the present embodiment using such a computer. In the voice response device according to the present embodiment, first, a voice input unit 11 through which a user inputs voice, a voice analysis unit 12 that analyzes and detects a speech rate of voice input to the voice input unit 11, and a speech A voice analysis storage unit 13 using a ROM 3 in which conditions for determining the speed are set is provided. Here, the speech analysis unit 12 and the speech analysis storage unit 13 realize the function of the speech rate detection means. Further, a voice recognition unit 14 that recognizes input voice and outputs a voice recognition result, and a ROM 3 or a storage device in which words and phrases necessary for the operation of the automatic voice response device are set in advance for the user to receive a service. 7 is used. Further, the dialogue processing storage unit 16 using the ROM 3 or the storage device 7 in which the content of the dialogue processing performed by the voice response device (server 1) is set in advance, and the voice recognition result and the setting contents of the dialogue processing storage unit 16 Dialogue processing unit 17 for managing speech content of speech guidance, guidance storage unit 18 using ROM 3 or storage device 7 in which speech data of speech content of speech guidance is set, and speech guidance speech from guidance storage unit 18 A guidance output unit 19 that acquires audio data of contents and outputs the audio data, and an audio output unit 20 that outputs the audio data from a speaker in the output device 5 are provided. Here, the dialogue processing unit 17 and the dialogue processing storage unit 16 realize the function of the utterance content changing means.
[0018]
In such a configuration, when a user inputs a voice via the communication interface 8 or the like, the voice information input to the voice input unit 11 is sent to the voice analysis unit 12. Here, the processing flow of the voice analysis unit 12 executed by the CPU 2 in accordance with the program will be described with reference to a schematic flowchart shown in FIG.
[0019]
First, in step S1, the input sound is divided into frames, and an acoustic power level is obtained for each divided frame. Is the utterance part for all frames by comparing each frame with the reference value previously determined for the intermediate value between the maximum sound power level and the minimum sound power level in all frames? Identify whether it is the utterance part. Then, the utterance time of the input voice is obtained by adding the time of the frame that is the utterance part. The obtained result is stored in the internal memory (RAM 4).
[0020]
In step S2, the input voice is transmitted to the voice recognition unit 14, and a recognition result is awaited. If the speech recognition is normally performed based on the recognition result obtained from the speech recognition unit 14 (Y in S2), the process proceeds to step S3. If it cannot be recognized (N in S2), the speech rate determination result is set to "normal speech rate" and stored in the internal memory (RAM 4), and the process proceeds to step S4.
[0021]
In step S3, the number of words is counted from the speech recognition result obtained from the speech recognition unit 14. For example, if the voice recognition result is “good”, the number of words is counted as it is, such as 3 words, and if it is “Chiketto”, the number of words is counted as it is. Then, the number of utterance words per second is calculated as the “utterance speed” from the utterance time obtained in step S1. Then, the calculated speech rate is compared with the speech rate condition preset in the speech analysis storage unit 13, the speech rate of the input speech is determined, and the determination result is stored in the internal memory (RAM 4). The process proceeds to step S4.
[0022]
4 is an example of setting the speech rate condition preset in the speech analysis storage unit 13. Here, as the speech rate condition, “slow utterance” is used when the number of speech words per second is 3.8 or less. If it exceeds 8 and is 4.2 or less, “normal utterance speed” is set, and if it exceeds 4.2, “fast utterance” is set.
[0023]
At this time, if the obtained utterance speed is 3.6, for example, the utterance speed of the input voice is determined as “slow utterance” and set in the internal memory (RAM 4).
[0024]
In step S4, the speech recognition result and the speech speed determination result set in the internal memory (RAM 4) are transmitted to the dialog processing unit 17 and the process is terminated.
[0025]
The dialogue processing unit 17 selects the field of voice guidance to be output according to the contents of the dialogue processing preset in the dialogue processing storage unit 16 from the voice recognition result. For the selected voice guidance field, the speech guidance utterance content list preset in the dialogue processing storage unit 16 is referred to, and the speech guidance having the utterance content corresponding to the utterance speed of the user is selected. Then, the selected voice guidance is transmitted to the guidance output unit 19.
[0026]
FIG. 5 shows a table example of an utterance content list of voice guidance based on the utterance speed of a user in the field of “ticket reservation”, which is preset in the dialogue processing storage unit 16 (ROM 3 or the like). In the example shown in the figure, for example, when the speech rate is “slow utterance”, “Reservation is OK? If yes, please say“ Yes ”, if it is wrong, please say“ No ””, “Normal speech rate” The voice guidance utterance content is set to "Ticket reservations received. How many days are reservations?", And "Fast utterances" is "How many days are reservations?" Here, for example, if the result of speech recognition is “ticket reservation” and the judgment result of the user's utterance speed is “normal utterance speed”, “ticket reservation accepted. Select the voice guidance for the utterance content, and if it is “slow utterance”, the voice content of the utterance content will say “Yes, if you like ticket reservation? Guidance will be selected.
[0027]
The guidance output unit 19 acquires the voice data of the corresponding voice guidance from the guidance storage unit 18 from the utterance content of the selected voice guidance and transmits it to the voice output unit 20.
[0028]
The sound output unit 20 outputs the transmitted sound data as output sound.
[0029]
With the above processing, the utterance content of the voice guidance is changed and output according to the utterance speed. Therefore, it is usually a voice guidance of the utterance content “You have booked a ticket. How many days will it be?”, But for users who are unfamiliar and have a slow utterance speed, For example, it is a voice guidance of polite utterance contents such as “Ticket reservation? If yes, please say“ Yes ”, if wrong, please say“ No ”, which is easy for the user to use. Conversely, for users who are accustomed and have a fast utterance speed, the voice guidance for the minimum utterance content is not redundant and the user is easy to use. Become. Particularly, since the user who is unfamiliar and the user who is accustomed are discriminated by paying attention to the utterance speed, even when the reaction time is the same and the utterance content is correct, the skill level can be discriminated appropriately.
[0030]
A second embodiment of the present invention will be described with reference to FIGS. The same parts as those shown in the first embodiment are denoted by the same reference numerals, and description thereof is also omitted (the same applies to the subsequent embodiments in order).
[0031]
In the present embodiment, with respect to the configuration (functional block) shown in FIG. 2, on the output side of the guidance output unit 19, data conversion processing means by the voice conversion unit 21, the voice conversion storage unit 22, and the speech speed conversion unit 23. Is added. Here, the voice conversion unit 21 performs a change instruction for changing the speech speed (speaking speed) of the voice guidance input from the guidance output unit 19 in accordance with the conversion condition and a volume conversion process. The ROM 3 or the storage device 7 In the voice conversion storage unit 22 using the voice conversion conditions for performing the voice data conversion process, the voice conversion conditions are set in advance. Further, the speech speed conversion unit 23 executes speech speed conversion processing for voice guidance. The speech conversion unit 21, the speech conversion storage unit 22, and the speech speed conversion unit 23 change the speech speed and speech volume of the speech guidance, that is, the speech quality of the speech guidance.
[0032]
In the present embodiment, one kind of voice guidance utterance content set in the guidance storage unit 18 is prepared for each voice guidance field. For example, in the case of a ticket reservation field, a normal “ticket” The content of the utterance is stored, "Reservation accepted.
[0033]
The speech guidance utterance content list preset in the dialogue processing storage unit 16 sets the speech content of the same voice guidance for the speech speed.
[0034]
In such a configuration, when a user inputs a voice, the voice information input to the voice input unit 11 is sent to the voice analysis unit 12. The speech analysis unit 12 performs the same processing as in the above-described embodiment, and transmits the speech recognition result and the speech rate determination result set in the internal memory to the dialogue processing unit 17. The dialogue processing unit 17 selects the field of voice guidance to be output according to the contents of the dialogue processing preset in the dialogue processing storage unit 16 from the voice recognition result. In addition, the voice guidance is selected with reference to the utterance content list of the voice guidance preset in the dialogue processing storage unit 16 for the selected voice guidance field. Then, the selected voice guidance and utterance speed determination result are transmitted to the guidance output unit 19.
[0035]
The guidance output unit 19 acquires voice data corresponding to the utterance content of the voice guidance from the guidance storage unit 18 from the selected voice guidance and transmits it to the voice conversion unit 21. The received speech speed determination result is also transmitted to the voice conversion unit 21.
[0036]
The voice conversion unit 21 performs a conversion process on the received voice data according to the speech speed determination result, and then transmits the converted voice data to the voice output unit 20.
[0037]
FIG. 7 is a schematic flowchart showing a processing flow of the voice conversion unit 21. In step S11, the voice conversion unit 21 first stores the voice data input from the guidance output unit 19 in the internal memory. In the next step S <b> 12, the speech rate determination result given to the speech conversion unit 21 is compared with speech conversion conditions preset in the speech conversion storage unit 22.
[0038]
FIG. 8 shows voice conversion conditions preset as a table in the voice conversion storage unit 22 using the ROM 3 or the like, that is, conversion conditions of the volume and speech rate of the output voice guidance. When the user's speaking speed is “slow speaking”, the volume of the output voice guidance is 1.2 times, the speaking speed is 0.9 times, and when the “normal speaking speed”, the volume and speaking speed are not changed. In the case of “fast speech”, the volume is not changed and the speech speed is set to 1.2 times. Here, for example, if the determination result of the utterance speed is “slow utterance”, the volume is 1.2 times higher than the volume conversion condition.
[0039]
In step S12, the determination result of the speech rate given to the voice conversion unit 21 is compared with the volume conversion condition of the output voice guidance set in the voice conversion storage unit 22. If the conversion condition is satisfied, the process goes to step S13. If the condition is not met, the process proceeds to step S14. In step S13, the volume of the audio data stored in the internal memory is converted, and after updating the audio in the internal memory, the process proceeds to step S14.
[0040]
In step S14, the determination result of the speech rate given to the speech conversion unit 21 is compared with the speech rate conversion condition of the output speech guidance set in the speech conversion storage unit 22, and if the condition is satisfied, the process goes to step S15. If the condition does not apply, the process proceeds to step S16.
[0041]
Here, for example, if the determination result of the utterance speed is “slow utterance”, the speech speed of the output voice guidance is 0.9 times based on the conversion condition of the utterance speed of the output voice guidance in FIG. Since the speaking speed is changed, the process proceeds to step S15.
[0042]
In step S15, the voice data in the internal memory (RAM 4) is transmitted to the speech speed conversion unit 23. After the voice data in the internal memory (RAM 4) is updated with the received voice data, the process proceeds to step S16. For example, if the utterance speed is to be converted to 1.2 times, voice data and an instruction to increase the utterance speed of the voice data to 1.2 times are given to the speech speed conversion unit 23, and as a result, the utterance speed is increased. Will receive the voice data that has increased by a factor of 1.2.
[0043]
In step S16, the audio data in the internal memory (RAM 4) is transmitted to the audio output unit 20 and the process ends. The audio output unit 20 outputs the transmitted audio data from the speaker as output audio.
[0044]
Through the above processing, the voice data conversion process is performed to change the volume of the voice guidance to be output and the speech speed in accordance with the speech speed of the input voice. When the speech speed is slow, the volume is set to 1. • Voice guidance with an utterance speed that is 0.9 times slower at 2 times is output, and when the utterance speed is fast, voice guidance with an utterance speed increased to 1.2 times without changing the volume is output. The
[0045]
With the above processing, for example, even if the voice guidance utterance content is the same as “Ticket reservation accepted. How many days will it be?” Since the voice guidance with 1.2 times the utterance speed and 0.9 times slower is output, the voice guidance is easy to hear and can be heard without hurry, which makes it easy for the user to use. On the contrary, in the case of a user who is used and has a high utterance speed, the utterance speed is increased and the voice guidance finish time is shortened. Therefore, the user is not redundant in time and is easy to use for the user. In the case of the present embodiment as well, since the user who is unfamiliar and the user who is accustomed are discriminated by paying attention to the speaking speed, the skill level is properly discriminated even when the reaction time is the same and the utterance content is correct can do.
[0046]
In this embodiment, as an example of changing the voice quality of voice guidance, the example of volume and speed has been described. However, only one of them may be used, or the voice pitch, voice pitch, etc. may be changed as the voice quality. You may make it make it (it is the same also in subsequent embodiment).
[0047]
A third embodiment of the present invention will be described with reference to FIGS. In the present embodiment, in the configuration (functional block) shown in FIG. 6, the speech analysis unit 12 and the speech analysis storage unit 13 realize the function of the speech rate detection means and detect the noise situation related to the user's speech. The function of the noise detection means is realized. In other words, the voice analysis unit 12 is provided with a function of analyzing the noise state of the voice, and the voice analysis storage unit 13 using the ROM 3 or the RAM 4 is preset with a condition for determining the noise state.
[0048]
In such a configuration, when a user inputs a voice, the voice information input to the voice input unit 11 is sent to the voice analysis unit 12.
[0049]
FIG. 9 is a schematic flowchart showing the flow of processing of the voice analysis unit 12 of the present embodiment. Note that the processing control relating to the speech rate shown in FIG. 3 is omitted. In step S21, the voice analysis unit 12 divides the input voice in units of frames and obtains an acoustic power level for each divided frame. An intermediate value between the maximum sound power level and the minimum sound power level among all frames is used as a reference value, and comparison processing is performed for each frame, thereby distinguishing whether all frames are utterance portions or not utterance portions. Then, the ratio of the average acoustic power level of the frame that is the utterance part and the average acoustic power level of the frame that is not the utterance part is obtained, and the result is compared with the condition of the noise situation preset in the speech analysis storage unit 13; Determine ambient noise conditions. Then, the determined result is stored in the internal memory (RAM 4), and the process proceeds to step S22.
[0050]
Here, FIG. 10 is an example of the condition of the noise situation preset as a table in the voice analysis storage unit 13 using the ROM 3 or the like. Here, the condition of the noise situation is 0.7 or less when “low noise”, If it exceeds 0.7, “high noise” is set in two stages. At this time, if the average sound power level of the section in which the obtained utterance is performed is 80 and the average sound power level of the section in which the utterance is not performed is 70, the ratio is 0.75, and the above condition is satisfied. The noise situation is determined to be “noisy” and set in the internal memory (RAM 4).
[0051]
In step S22, the input voice is transmitted to the voice recognition unit 14, and a recognition result is awaited. Then, the recognition result obtained from the voice recognition unit 14 and the determination result of the noise situation set in the internal memory are transmitted to the dialogue processing unit 17 and the process is terminated.
[0052]
The dialogue processing unit 17 selects a field of voice guidance to be output according to the content of the dialogue processing set in advance in the dialogue processing storage unit 16 from the voice recognition result. For the selected voice guidance field, the speech guidance utterance content list preset in the dialogue processing storage unit 16 is referred to, and the speech guidance having the utterance content corresponding to the utterance speed of the user is selected. Then, the selected voice guidance, speech rate determination result, and noise status determination result are transmitted to the guidance output unit 19.
[0053]
The content list of the voice guidance is set as shown in FIG. 11, for example. That is, guidance items (speech contents of voice guidance) are classified according to the user's utterance speed. The guidance output unit 19 acquires voice data corresponding to the utterance content of the voice guidance from the guidance storage unit 18 from the selected voice guidance and transmits it to the voice conversion unit 21. The received speech rate determination result and noise status determination result are also transmitted to the voice conversion unit 21.
[0054]
The voice conversion unit 21 converts the received voice data according to the speech speed determination result and the noise situation determination result, and then transmits the converted voice data to the voice output unit 20.
[0055]
The voice conversion unit 21 performs voice processing according to the flowchart of FIG. In step S11, the voice conversion unit 21 stores the voice data input from the guidance output unit 19 in the internal memory. In the next step S12, the speech rate determination result and noise state determination result given to the speech conversion unit 21 and the output speech guidance volume and speech rate conversion conditions preset in the speech conversion storage unit 22 are obtained. Compare.
[0056]
FIG. 12 shows the conversion conditions for the volume and speech rate of the output voice guidance preset as a table in the voice conversion storage unit 22 using the ROM 3 or the like. Even if the utterance speed of the user is the same, the volume and the speech speed of the voice guidance are set so as to be switched between two levels according to the size of the noise situation. In particular, when the noise state is “noisy”, the volume is set to be high and the speech speed is set to be low.
[0057]
In step S12, the determination result of the speech speed and the noise situation is compared with the sound volume conversion condition of the output voice guidance. If the conversion condition is satisfied, the process proceeds to step S13. If the conversion condition is not satisfied, the process proceeds to step S14.
[0058]
In step S13, the volume of the audio data stored in the internal memory is converted, and after updating the audio in the internal memory, the process proceeds to step S14.
[0059]
In step S14, the speech rate determination result, the noise state determination result, and the speech rate conversion condition of the output voice guidance are compared. If the condition is satisfied, the process proceeds to step S15. If the condition is not satisfied, the process proceeds to step S16. move on.
[0060]
Therefore, under such setting conditions, for example, the speech recognition result is “ticket reservation”, the speech speed determination result is “normal speech speed”, and the noise situation determination result is “low noise”. If there is, the voice guidance of the utterance content “Your ticket reservation is accepted. How many days is reserved?” Is output at the normal level for both the volume and the speech speed, and the judgment result of the noise situation is “noisy” For example, the voice guidance of the utterance content “You received a ticket reservation. How many days will it be reserved?” Is increased to 1.2 times the volume and output at a speed that is 0.9 times slower. Become. The same applies to other utterance speeds.
[0061]
Through the above processing, the utterance content of the voice guidance is selected according to the utterance speed and the noise situation, and the volume and the speech speed are changed and output as the utterance voice quality. Therefore, according to the present embodiment, even if the reaction time and utterance content are the same, the utterance content of the voice guidance is selectively changed according to the utterance speed of the user and the noise situation around the user. In addition to outputting, voice data is processed to change the voice quality when outputting voice guidance, and in situations where there is a lot of noise, the voice guidance becomes easier to hear by slowing the utterance speed or increasing the volume, It is easy for users to use regardless of whether they are familiar or unfamiliar. This is particularly effective in an environment where a mobile phone or the like is used.
[0062]
A fourth embodiment of the present invention will be described with reference to FIG. In the present embodiment, in the configuration (functional block) shown in FIG. 6, as in the third embodiment, the speech analysis unit 12 and the speech analysis storage unit 13 realize the function of the speech rate detection unit, The function of the noise detection means for detecting the noise situation related to the user's utterance is realized. That is, a function for analyzing the noise situation of the voice is added to the voice analysis unit 12, and conditions for determining the noise situation are preset in the voice analysis storage unit 13 using the ROM 3 or the like. In the present embodiment, only one type of speech guidance utterance content is prepared for each guidance field. For example, in the case of the field of ticket reservation, the usual “ticket reservation is accepted. It is assumed that the utterance content “is it a reservation?” Is stored.
[0063]
FIG. 13 shows the conversion conditions of the volume and speech rate of the output voice guidance preset as a table in the voice conversion storage unit 22 using the ROM 3 or the like. The volume and the voice speed of the voice guidance are set so as to be switched between two levels according to the user's voice speed and noise level. In particular, when the noise state is “noisy”, the volume is set to be high and the speech speed is set to be low.
[0064]
The process control in the present embodiment is the same as in the third embodiment, except that the speech guidance speech content is not selectively switched according to the speaker speed.
[0065]
Therefore, under such setting conditions, for example, if the speech recognition result is “ticket reservation”, the speech speed determination result is “slow utterance”, and the noise situation determination result is “low noise” , "Voice reservations are accepted. How many days is the reservation?" The voice guidance of the utterance content is output at a volume of 1.2 times and a speech speed of 0.9 times. ”, The voice guidance of the same utterance content“ The ticket reservation was accepted. How many days will it be reserved? ”Was further increased to 1.3 times the volume and slowed to 0.8 times the speaking speed. Will be output. The same applies to other utterance speeds.
[0066]
Therefore, according to the present embodiment, even when the reaction time and the content of the utterance are the same, the voice quality of the utterance is changed when outputting the voice guidance according to the utterance speed of the user and the noise situation around the user. Voice data is processed to make changes, and in a noisy situation, the voice guidance becomes easier to hear by lowering the utterance speed and increasing the volume, regardless of whether you are used or unfamiliar It will be easy to use. This is particularly effective in an environment where a mobile phone or the like is used.
[0067]
A fifth embodiment of the present invention will be described with reference to FIGS. In the present embodiment, in the configuration (functional block) shown in FIG. 6, the voice analysis unit 12 and the voice analysis storage unit 13 realize the function of a noise detection unit that detects a noise situation related to the user's utterance, The function of the speech rate detection means is deleted. That is, in the present embodiment, the skill level discrimination function based on the speech rate is omitted, the voice analysis unit 12 is added with a function of analyzing the noise status of the voice, and the voice analysis storage unit 13 has the noise status. A condition for discriminating is set in advance.
[0068]
In such a configuration, when a user inputs a voice, the voice information input to the voice input unit 11 is sent to the voice analysis unit 12. The processing flow of the voice analysis unit 12 of the present embodiment is the same as that described in the flowchart of FIG.
[0069]
Here, FIG. 15 is an example of the condition of the noise situation preset as a table in the voice analysis storage unit 13. Here, the condition of the noise situation is “small noise” when 0.6 or less, and exceeds 0.6. If it is 0.7 or less, it is set in three stages, “medium”, and if it exceeds 0.7, “large noise”. At this time, if the average sound power level of the section in which the obtained utterance is performed is 80 and the average sound power level of the section in which the utterance is not performed is 70, the ratio is 0.75, and the above condition is satisfied. The noise situation is determined as “noisy” and set in the internal memory.
[0070]
In the present embodiment, only one type of speech guidance utterance content is prepared for each guidance field. For example, in the case of the field of ticket reservation, the usual “ticket reservation is accepted. Utterance contents are stored.
[0071]
FIG. 15 shows the conversion conditions for the volume and speech rate of the output voice guidance preset as a table in the voice conversion storage unit 22 using the ROM 3 or the like.
[0072]
It is set so that the volume and the voice speed of the voice guidance can be switched in three stages according to the noise level. In particular, when the noise state is “noisy”, the volume is set to be higher and the speech speed is set to be slower.
[0073]
Therefore, under such setting conditions, for example, if the result of speech recognition is “ticket reservation” and the judgment result of the noise situation is “low noise”, “ticket reservation accepted. If the result of the noise situation is “noisy”, the voice guidance of the utterance content is output at a normal level and the speech speed is 1.2 times faster than normal. The voice guidance of the utterance content, “Ticket reservation accepted. How many days is reserved?” Is increased to 1.2 times the volume and output at a slower speed of 0.9 times.
[0074]
Through the above processing, the volume and the speech speed are changed and output as the speech quality of the speech guidance according to the noise situation of the environment where the user is speaking. Therefore, according to the present embodiment, even if the reaction time and the content of the utterance are the same, the voice data of the voice data is changed in order to change the utterance voice quality when outputting the voice guidance according to the noise situation around the user. In the situation where the noise is high, the voice guidance is easy to hear by slowing down the utterance speed or increasing the volume, and it is easy for the user to use. This is particularly effective in an environment where a mobile phone or the like is used.
[0075]
【The invention's effect】
According to the present invention, it is possible to provide a voice response device that is easy for a user to use.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a functional block diagram thereof.
FIG. 3 is a schematic flowchart showing processing contents in the voice analysis unit;
FIG. 4 is an explanatory diagram showing an example of a condition for determining an utterance speed.
FIG. 5 is an explanatory diagram showing an example of utterance contents according to the utterance speed.
FIG. 6 is a functional block diagram showing a second embodiment of the present invention.
FIG. 7 is a schematic flowchart showing processing contents in the voice conversion unit;
FIG. 8 is an explanatory diagram showing a setting example of a volume and a speech speed according to the speech speed.
FIG. 9 is a schematic flowchart illustrating processing contents in a speech analysis unit according to a third embodiment of this invention.
FIG. 10 is an explanatory diagram illustrating an example of a noise condition determination condition;
FIG. 11 is an explanatory diagram showing an example of utterance contents according to the utterance speed and noise status.
FIG. 12 is an explanatory diagram illustrating a setting example of a volume and a speech speed according to a speech speed and a noise situation.
FIG. 13 is an explanatory diagram illustrating a setting example of a sound volume and a speech speed according to an utterance speed and a noise state according to the fourth embodiment of this invention.
FIG. 14 is an explanatory diagram illustrating an example of a noise condition determination condition according to the fifth embodiment of this invention;
FIG. 15 is an explanatory diagram illustrating a setting example of a volume and a speech speed according to a noise situation.
[Explanation of symbols]
12, 13 Speech rate detection means, noise detection means
16, 17 Utterance content change means
21-23 Data conversion processing means

Claims

In a voice response device that recognizes a user's utterance content and provides a preset voice guidance based on the recognition result,
Utterance speed detecting means for detecting the utterance speed of the user;
Utterance content changing means for changing the utterance content of the voice guidance to respond according to the detected utterance speed;
A voice response device comprising:

In a voice response device that recognizes a user's utterance content and provides a preset voice guidance based on the recognition result,
Utterance speed detecting means for detecting the utterance speed of the user;
Data conversion processing means for performing voice data conversion processing for changing the voice quality of the voice guidance that responds according to the detected speech speed;
A voice response device comprising:

Noise detecting means for detecting a noise situation related to the user's utterance;
Data conversion processing means for performing voice data conversion processing for changing the voice quality of the voice guidance that responds according to the detected noise situation;
The voice response device according to claim 1, further comprising:

Noise detecting means for detecting a noise situation related to the user's utterance;
The data conversion processing means performs voice data conversion processing for changing the voice quality of the voice guidance that responds according to the detected speech speed and noise status.
The voice response device according to claim 2.

In a voice response device that recognizes a user's utterance content and provides a preset voice guidance based on the recognition result,
Noise detecting means for detecting a noise situation related to the user's utterance;
Data conversion processing means for performing voice data conversion processing for changing the voice quality of the voice guidance that responds according to the detected noise situation;
A voice response device comprising: