JP2005031150A

JP2005031150A - Apparatus and method for speech processing

Info

Publication number: JP2005031150A
Application number: JP2003193112A
Authority: JP
Inventors: Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-07-07
Filing date: 2003-07-07
Publication date: 2005-02-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide speech recognition and speech synthesis when the speech recognition or speech synthesis are not corresponding to the mother tongue of a user, a language that the user is not reluctant to use is used and a consideration is given to the language which is not the mother tongue. <P>SOLUTION: Information regarding the language skill of the user is acquired (step S201), a language as an object of speech recognition is selected out of a plurality of languages according to the acquired information regarding the language skill (step S202), and operation conditions of speech recognition are set according to the information regarding the language skill and the language as the object of recognition (step S203). <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、多言語の音声を認識しうる音声処理装置および方法、ならびに多言語の音声を出力しうる音声処理装置および方法に関するものである。
【０００２】
【従来の技術】
近年、複数の言語の音声を認識しうる音声認識装置、および複数の言語の音声を出力しうる音声合成装置が開発されつつある。ただし、現状の多言語に対応した音声認識装置や音声合成装置では、あらかじめ利用者が使用する言語を指定する必要がある。ここで、利用者の母国語がこれらの装置が処理可能な言語に含まれていない場合には、利用者にとってなるべく抵抗のない言語を選択し、利用者が操作しやすいようにこれらの装置を動作させることが望ましい。
【０００３】
また、多言語音声認識装置および多言語音声合成装置を用いた多言語音声対話システムを考えた場合、理想的には音声認識装置が取り扱う言語の種類と音声合成装置が取り扱う言語の種類は同一であることが望ましいが、現実にはそうであるとは限らない。例えば、ある多言語音声対話システムにおいて、音声認識は、英語、日本語、ドイツ語、フランス語、イタリア語の５か国語に対応しているが、音声合成は、英語、日本語、中国語の３か国語にだけ対応している、という場合もある。この場合、英語や日本語を母国語とする利用者にとっては、音声認識および音声合成ともこれらの言語に対応しているため問題はない。しかし、例えばドイツ語に対しては、音声認識は可能であるが音声合成はできないことになる。逆に、中国語に対しては、音声合成は可能であるが音声認識ができない。
【０００４】
そこでこのような場合の次善策として、ドイツ語を母国語としている利用者に対しては、ドイツ語以外の適切な言語が音声合成の言語として設定されることが望ましい。同様に、中国語を母国語としている利用者に対しては、中国語以外の適切な言語が音声認識の言語として設定されることが望ましい。また、例えば、音声認識および音声合成のいずれも対応していないオランダ語を母国語とする利用者がこの多言語音声対話システムを利用する場合は、使用可能な言語のうちの適切な言語が音声認識および音声合成の言語として設定されることが望ましい。
【０００５】
このような要請に対し、例えば特許文献１には、言語ごとに、音声認識、言語解析、言語生成、音声合成のオブジェクトによって言語依存オブジェクトを構成する音声翻訳システムにおいて、指定された言語依存オブジェクトが存在しない場合に、指定言語に最も近い（近さの定義の記述はなし）言語が選択されることが記載されている。
【０００６】
【特許文献１】
特開２００１−２２２５３０公報
【０００７】
【発明が解決しようとする課題】
しかしながら、特許文献１に開示された方法により単に近い言語を選択させてオブジェクトを動作させてしまうと、利用者にとっては非母国語のオブジェクトを母国語のオブジェクトとして利用することになり、例えば、音声出力の発声スピードが速すぎるため利用者が理解できなかったり、ネイティブな発音ができないために音声認識に失敗するという問題が生じる可能性がある。
【０００８】
そこで、本発明は、音声処理システムにおいて、音声認識または音声合成が利用者の母国語に対応していない場合に、利用者にとってなるべく抵抗のない言語を使用し、なおかつ、その言語が非母国語であることを考慮した音声認識や音声合成が提供されるようにすることを目的とする。
【０００９】
【課題を解決するための手段】
本発明の一側面によれば、複数の言語から選択された言語の音声認識を行う音声処理装置であって、利用者の言語能力に関する情報を取得する取得手段と、取得した言語能力に関する情報に基づいて、音声認識の対象とする言語を前記複数の言語から選択する選択手段と、前記言語能力に関する情報と認識対象の言語とに基づいて、音声認識の動作条件を設定する設定手段とを有することを特徴とする音声処理装置が提供される。
【００１０】
本発明の別の側面によれば、複数の言語から選択された言語の音声合成を行う音声処理装置であって、利用者の言語能力に関する情報を取得する取得手段と、取得した言語能力に関する情報に基づいて、音声合成を行う言語を前記複数の言語から選択する選択手段と、前記言語能力に関する情報と音声合成を行う言語とに基づいて、音声合成の動作条件を設定する設定手段とを有することを特徴とする音声処理装置が提供される。
【００１１】
【発明の実施の形態】
以下、図面を参照して本発明の好適な実施形態について詳細に説明する。
【００１２】
（第１の実施形態）
図１は、本発明の第１の実施形態に係る音声処理装置の構成を示すブロック図である。この音声処理装置は典型的にはＣＰＵを用いたコンピュータシステムで実現されうる。もちろん、ＣＰＵを使用せずに専用のハードウェアロジックで実現してもよい。
【００１３】
１０１はＣＰＵで、ＲＯＭ１０２や外部記憶装置１０４からＲＡＭ１０３にロードされたプログラムに従って、本音声処理装置全体の制御を司る。ＲＯＭ１０２はブートプログラムや各種パラメータなどを格納している。ＲＡＭ１０３は、ＣＰＵ１０１による各種制御の実行時に作業領域を提供する主記憶装置として機能する。
【００１４】
１０４は外部記憶装置としてのハードディスク装置で、図示するように、ここにＯＳの他、音声処理プログラムがインストールされている。この音声処理プログラムは多言語対応の音声認識プログラムを含んでいる。なお、音声処理プログラムは例えばＣＤ−ＲＯＭ１１０ａに格納されて提供され、ＣＤ−ＲＯＭドライブ１１０を介して外部記憶装置１０４にインストールされる。あるいは、図示しないネットワークを介して音声処理プログラムの提供を受けることも可能である。
【００１５】
１０５はマイクロフォンなどによる音声入力部である。１０６は液晶タッチパネルなどの操作表示部であり、処理内容の設定・入力、文字、画像による通知などの表示・出力を行う。１０７は補助入出力部で、例えば、ボタン、テンキー、キーボード、マウス、ペン、スイッチ、ＬＥＤなどの光情報、点字、アクチュエータなどで構成されうる。１０８はスピーカなどの音声出力部であり、利用者へのメッセージの通知などを行う。１０９は上記各部を接続するバスである。
【００１６】
図２は、本実施形態における音声処理プログラムの音声認識モジュールの構成を示す図である。
【００１７】
音声認識モジュールは、音響分析モジュール１と、探索モジュール２に大別される。音響分析モジュール１は、音声入力部１０５を介して入力された音声に対して、一定のフレーム間隔で音響特徴ベクトル（例えば、Δパワー、ＭＦＣＣなどで構成される）に変換する。探索モジュール２は、音響モデル３および単語辞書５を用いて、言語モデル（もしくは文法）４によって言語的な制約を加えつつ探索を行う。
【００１８】
実施形態における音声認識は多言語対応であり、たとえば日本語、英語、ドイツ語、フランス語、イタリア語の５カ国語に対応する。ここで、音響モデル３、言語モデル４、および単語辞書５はそれぞれ、図示のように、上記の各言語毎に、ネイティブ（ｎａｔｉｖｅ）話者用のものと、非ネイティブ（ｎｏｎ−ｎａｔｉｖｅ）話者用のものを個別に含む。そして、後述する方法によって選択された言語の音響モデル、言語モデル、単語辞書が選択されて探索処理が実行される。
【００１９】
図３は、本実施形態における音声認識の動作条件の設定処理を示すフローチャートである。
【００２０】
まず、ステップＳ２０１において、利用者の母国語や利用者が発話できる言語およびそのレベルなどに関する発話言語情報を言語能力に関する情報として獲得する。利用者の母国語に関しては、たとえば図４に示すような母国語選択画面を操作表示部１０６に表示し、利用者に選択させることによって獲得する。ここで例えば、利用者が「Ｄｅｕｔｃｈ」を選択した場合には、利用者の母国語はオランダ語となる。
【００２１】
なお、上記のような母国語選択画面を介して発話言語情報を獲得するのではなく、予めファイルなどに利用者ごとの発話言語情報を格納して、この情報をもとに獲得してもよい。この場合、利用者は利用者ＩＤの入力などの操作を行ったことに応じて、その利用者に関する情報を通知することが好ましい。
【００２２】
あるいは、発話言語情報は、利用者の発声内容に基づきその言語を識別する方法を用いることによって獲得するようにしてもよい。
【００２３】
利用者が発話できる言語およびそのレベルに関する情報に関しては、図５に示すような発話レベル選択画面を操作表示部１０６に表示し、利用者に選択させることによって獲得する。例えば、各言語の発話レベルが５段階（１が低く、５が高い）で与えられ、利用者がいずれか適当なものを選択することができる。同図の例は、英語の発話レベル４が選択された状態を示している。
【００２４】
なお、利用者が発話できる言語およびそのレベルに関する情報に関しては、上記のような発話レベル選択画面を介して獲得するのではなく、利用者に所定の音声を発声させ、その発声内容を音声認識にかけ、そのときの音声認識率、尤度、スコアなどの情報に基づいて獲得するようにしてもよい。他にも、音声認識によらない方法として、一般的な語学レベルを測定するテストを利用して発話レベルを獲得してもよい。
【００２５】
次に、ステップＳ２０２において、発話言語情報に基づいて音声認識の対象とする言語を選択する。利用者の母国語が英語、日本語、ドイツ語、フランス語、イタリア語のいずれかである場合には、その母国語による音声認識を実行すればよい。しかし、利用者の母国語がこれら以外である場合には、利用者に対してなるべく負担の少ない言語を選択する必要がある。すなわち、音声認識の対象となりうる複数の言語のうち利用者がどの言語を話すことができるかという情報を獲得する必要がある。図５に示した発話レベル選択画面を介してこの情報が獲得できる場合には、最も高い発話レベルとして設定された言語を音声認識の対象言語として設定する。このとき、最も高い発話レベルが複数の言語に対して存在する場合には、利用者にそれらの言語を提示し、選択させる、もしくは、音声認識モジュールの不特定話者に対する認識率がより高い言語を自動的に選択することになどよって決定することができる。
【００２６】
なお、先のステップＳ２０１では、図４に示したような母国語選択画面や図５に示したような発話レベル選択画面を介して、それぞれ、利用者の母国語の情報と、利用者が発話できる言語およびそのレベルに関する情報の両方を、発話言語情報として取得していたが、これは少なくともいずれか一方が取得できればよい。例えば、図４の母国語選択画面を介して母国語に関する情報が得られない場合には、直接対応する母国語の音響モデルや言語モデルを選択することはできなくなるが、少なくとも、図５の発話レベル選択画面を介して得られた情報に基づいて音声認識の動作条件や言語を設定することはできる。逆に、図５の発話レベル選択画面を介して利用者の発話できる言語およびそのレベルに関する情報が得られない場合には、少なくとも、図４の母国語選択画面を介して得られた母国語の情報に基づいて、その母国語を話す人の一般的な利用者が発話できる言語およびそのレベルに関する情報を例えば図６に示すような形で親密度（たとえば１０段階で表され、１０が最も親密度が高い）として予め求めておくことにより、音声認識の動作条件や言語を設定することができる。
【００２７】
次に、ステップＳ２０３において、音声認識の動作条件を設定する。この際、図５に示した発話レベル選択画面を介して得られる利用者の使用言語に対する発話レベルをその言語に対する「親密度」として捉え、親密度の値に応じて、音声認識の動作条件を変更する。例えば、図５における１から５までの発話レベルを２倍したもので親密度を定義することができる。そして、親密度に応じた音声認識の動作条件は、親密度と動作条件との対応関係を記述した動作条件テーブルを参照することで決定される。この動作条件テーブルは、たとえば図７に示すような構造で、音声処理プログラムに付随してハードディスク１０４に格納される。音声認識の動作条件としては、音声認識の探索条件など探索処理に関するもの、認識候補の数など結果出力に関するもの、音響モデルの種類など音響的なモデルに関するもの、音声認識の語彙や文法など言語的なモデルに関するものなどがある。
【００２８】
図７に示した動作条件テーブルには、探索条件としてビームサーチにおけるビーム幅の値、Ｎベスト出力（上位Ｎ位までの文仮説が生成される）のＮの数、音響モデルにおけるｎａｔｉｖｅ／ｎｏｎ−ｎａｔｉｖｅの選択、言語モデルにおけるｎａｔｉｖｅ／ｎｏｎ−ｎａｔｉｖｅの選択などが規定されている。図７を参照すると、たとえば親密度が６の場合には、ビーム幅を２００、Ｎベスト出力数を５、ｎｏｎ−ｎａｔｉｖｅ用の音響モデル、ｎｏｎ−ｎａｔｉｖｅ用の認識文法を用いる、といった音声認識の動作条件が設定される。したがって、このステップＳ２０３では、ステップＳ２０２で選択された言語の音響モデル、言語モデル、単語辞書のそれぞれについて、親密度に応じてｎａｔｉｖｅ用かｎｏｎ−ｎａｔｉｖｅ用かが選択される。
【００２９】
なお、図２では、音響モデル、言語モデル、単語辞書の全てについてｎｏｎ−ｎａｔｉｖｅのモデルが存在するが、これらのいくつかについてのみｎｏｎ−ｎａｔｉｖｅのモデルを含んだ構成としてもよいし、ｎｏｎ−ｎａｔｉｖｅのモデルを含まない構成としてもよい。また、ｎｏｎ−ｎａｔｉｖｅのモデルは１つでなく、ｎｏｎ−ｎａｔｉｖｅの度合いや母国語の種類やカテゴリに応じた複数のモデルを用いた構成としてもよい。ここで、ｎａｔｉｖｅのモデルは、通常用いられる方法によって、音響モデル、言語モデル、単語辞書を作成すればよい。次に、ｎｏｎ−ｎａｔｉｖｅの音響モデルは、ｎｏｎ−ｎａｔｉｖｅが発声した音声データベースを用いて音響モデルを作成することができる。また、ｎｏｎ−ｎａｔｉｖｅの言語モデルは、平易な単語や短文で構成された簡単な文法を作成する方法や、平易な単語や短文を用いたＮ−ｇｒａｍなどの統計的言語モデルを作成する方法を用いることができる。また，ｎｏｎ−ｎａｔｉｖｅの単語辞書は、ｎｏｎ−ｎａｔｉｖｅが発声しやすい発音辞書を作成することができる。
【００３０】
このようにして、利用者の発話言語および発話レベルに応じて音声認識の動作条件が設定される。このため、非母国語を音声認識させる場合にはｎｏｎ−ｎａｔｉｖｅ用の音響モデル、言語モデル、単語辞書を用いて音声認識が実行されるように設定されるので、従来のようにネイティブな発音ができないために音声認識に失敗するということが少なくなる。
【００３１】
（第２の実施形態）
第２の実施形態は、音声合成処理に関するものである。本実施形態に係る音声処理装置の構成は図１に示したものと同様であるが、本実施形態における音声処理プログラムは、第１の実施形態における音声認識モジュールのかわりに、図８に示すような構成の音声合成モジュール１０を有する。
【００３２】
本実施形態における音声合成は多言語対応であり、たとえば日本語、英語、中国語の３カ国語に対応する。これに伴い、音声合成モジュール１０は、図８に示すように、各国語用の合成モジュール１１，１２，１３を含み、後述する方法によって選択された言語に対応する合成モジュールが実行される。日本語用の合成モジュール１１は、たとえば図示のような構成を有する。テキスト解析モジュール１４は、入力された日本語テキストの構文解析（具体的には、形態素解析）を、言語辞書１４ａを用いて行う。言語辞書１４ａは、図示のように、ネイティブ用のものと、非ネイティブ用のものとを個別に備えていることが好ましい。言語処理モジュール１５は音韻処理モジュール１５ａと韻律処理モジュール１５ｂとを含み、音韻処理モジュール１５ａは、テキストの解析結果に基づき音素記号列を出力し、韻律処理モジュール１５ｂは、ポーズ、アクセント、イントネーション、継続時間長などの韻律情報を出力する。音響処理モジュール１６は、入力した音素記号列および韻律情報に基づいて、波形辞書１７を用いて合成音声を生成する。また、波形辞書はネイティブ話者用のものと、非ネイティブ話者用のものとを個別に備えている。なお、英語用の合成モジュール１２および中国語用の合成モジュール１３も、日本語用の合成モジュール１１と同様の構成であるので、図示およびその説明は省略する。
【００３３】
図９は、本実施形態における音声合成の動作条件の設定処理を示すフローチャートである。
【００３４】
まず、ステップＳ３０１において、利用者の母国語や利用者が音で聞いて理解（聴解）できる言語およびそのレベルなどに関する聴解言語情報を獲得する。利用者の母国語に関しては、第１の実施形態で説明した図４と同様の母国語選択画面を操作表示部１０６に表示し、利用者に選択させることによって獲得する。また、利用者が聴解できる言語およびそのレベルに関する情報に関しても、第１の実施形態で説明した図５と同様の発話レベル選択画面を操作表示部１０６に表示し、利用者に選択させることによって獲得する。
【００３５】
なお、聴解言語情報は、予めファイルなどに利用者ごとの聴解言語情報を格納して、この情報をもとに獲得してもよい。この場合、利用者は利用者ＩＤの入力などの操作を行うことによって、利用者に関する情報を通知することが好ましい。
【００３６】
また、利用者が聴解できる言語のレベルに関する情報に関しては、所定内容の音声合成に対する利用者の応答結果もしくは応答時間などに基づいて獲得するようにしてもよい。
【００３７】
次に、ステップＳ３０２において、聴解言語情報に基づいて音声出力の言語を選択する。利用者の母国語が英語、日本語、中国語のいずれかである場合には、その母国語による音声合成を実行すればよい。しかし、利用者の母国語がこれらの３言語以外である場合には、利用者に対してなるべく負担の少ない言語を選択する必要がある。すなわち、音声出力が可能な言語のうち利用者がどの言語が聴解しやすいかという情報を獲得する必要がある。発話レベル選択画面を介してこの情報が獲得できる場合には、その情報を用いて音声出力の言語を設定する。
【００３８】
なお、先のステップＳ３０１では、母国語選択画面や発話レベル選択画面を介して、それぞれ、利用者の母国語と利用者が聴解できる言語およびそのレベルに関するものの両方を、聴解言語情報として取得していたが、これは少なくともいずれか一方を取得できればよい。例えば、図４の母国語選択画面を介して母国語に関する情報が得られない場合には、直接対応する母国語の合成モジュールを特定することができなくなるが、少なくとも、図５の発話レベル選択画面を介して得られた情報に基づいて音声合成の動作条件や言語を設定することができる。逆に、図５の発話レベル選択画面を介して利用者が聴解できる言語およびそのレベルに関する情報が得られない場合には、図４の母国語選択画面を介して得られた母国語の情報に基づいて、その母国語を話す人の一般的な利用者が聴解できる言語およびそのレベルに関する情報を例えば図１０に示すような形で親密度として予め求めておくことにより、音声合成の動作条件や言語を設定することができる。
【００３９】
次に、ステップＳ３０３において、音声合成の動作条件を設定する。この際、図５と同様の発話レベル選択画面を介して得られる利用者が選択した言語に対する聴解レベルをその言語に対する「親密度」として捉え、親密度の値に応じて、音声合成の動作条件を変更する。なお、聴解レベルと親密度は、第１の実施形態で述べた方法と同様に規定することが可能である。
【００４０】
親密度に応じた音声合成の動作条件は、親密度と動作条件との対応関係を記述した動作条件テーブルを参照することで決定される。この動作条件テーブルは、たとえば図１１に示すような構造で、音声処理プログラムに付随してハードディスク１０４に格納される。音声合成の動作条件としては、音声合成の発声速度など韻律的な要因に関するもの、音量など出力の要因に関するもの、波形辞書の種類など音韻的な要因に関するもの、応答文の内容など言語的な要因に関するものなどがある。
【００４１】
図１１に示した動作条件テーブルには、韻律的な要因として韻律制御における発声速度、出力の要因に関するものとして音量、音韻的な要因としてｎａｔｉｖｅもしくはｎｏｎ−ｎａｔｉｖｅの波形辞書の選択、言語的な要因として応答文の内容および繰り返し回数などが規定されている。図１１を参照すると、たとえば親密度が６の場合には、発声速度はゆっくり、音量は７（０を最小音量、９を最大音量とする１０段階のうちの７）、ｎｏｎ−ｎａｔｉｖｅ音声を利用した波形辞書、丁寧な応答文を用いる、といった音声合成の動作条件が設定される。したがって、このステップＳ３０３では、ステップＳ３０２で選択された言語の音声合成モジュールが選択され、さらに、親密度に応じて、波形辞書のうちｎａｔｉｖｅ用かｎｏｎ−ｎａｔｉｖｅ用のものが選択される。
【００４２】
なお、図８では、波形辞書にｎｏｎ−ｎａｔｉｖｅの波形辞書が存在するが、ｎｏｎ−ｎａｔｉｖｅの波形辞書を含まない構成としてもよい。また、ｎｏｎ−ｎａｔｉｖｅの波形辞書は１つでなく、ｎｏｎ−ｎａｔｉｖｅの度合いや母国語の種類やカテゴリに応じた複数の波形辞書を用いた構成としてもよい。ここで、ｎａｔｉｖｅの波形辞書は、通常用いられる方法によって作成すればよい。次に、ｎｏｎ−ｎａｔｉｖｅの波形辞書は、ｎｏｎ−ｎａｔｉｖｅがｎａｔｉｖｅの言語を発声した音声データを用いて作成することができる。他にも、ｎｏｎ−ｎａｔｉｖｅがｎｏｎ−ｎａｔｉｖｅの言語を発声した音声データと、ｎｏｎ−ｎａｔｉｖｅの言語の音素体系（発音体系）とｎａｔｉｖｅの言語の音素体系（発音体系）を対応付けることによって作成することもできる。
【００４３】
このようにして、利用者の母国語や聴解可能な言語の聴解レベルに応じて音声合成の動作条件が設定される。このため、非母国語を音声合成させる場合には、ｎｏｎ−ｎａｔｉｖｅ用として調整された音韻、韻律情報に従い音声合成が実行されるように設定されるので、音声出力の内容の理解の向上に役立つ。
【００４４】
（第３の実施形態）
第１の実施形態では、音声認識機能を有する音声処理装置について説明し、第２の実施形態では、音声合成機能を有する音声処理装置について説明したが、第１の実施形態と第２の実施形態とを組み合わせて、音声認識機能と音声合成機能の両方を兼ね備えた音声処理装置を実現することも可能である。
【００４５】
第３の実施形態に係る音声処理装置の構成は図１に示したものと同様であるが、本実施形態における音声処理プログラムは、第１の実施形態おける音声認識モジュールと、第２の実施形態における音声合成モジュールとを含む。
【００４６】
図１２は、本実施形態における音声認識および音声合成の動作条件の設定処理を示すフローチャートである。
【００４７】
動作の手順は、ステップＳ４０１において、発話および聴解言語に関する情報を同時に獲得していること以外は、第１および第２の実施形態における処理と同様である。ここで、音声認識の言語および音声合成の言語で用いる言語についてはそれぞれステップＳ４０２およびステップＳ４０４で、第１および第２の実施形態で説明した方法に従い、独立に求めることも可能であるが、利用者によっては音声認識と音声合成の言語が異なると違和感があると考えられるため、音声認識と音声合成の親密度の和が最大となる言語を音声認識および音声合成に対して適用することも可能である。
【００４８】
（他の実施形態）
以上、本発明の実施形態を詳述したが、本発明は、例えばシステム、装置、方法、プログラムもしくは記憶媒体等としての実施態様をとることが可能である。また、本発明は、複数の機器から構成されるシステムに適用してもよいし、また、一つの機器からなる装置に適用してもよい。
【００４９】
なお、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラムを、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータがその供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。その場合、プログラムの機能を有していれば、その形態はプログラムである必要はない。
【００５０】
従って、本発明の機能処理をコンピュータで実現するために、そのコンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明の特許請求の範囲には、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。
【００５１】
その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。
【００５２】
プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ、ＤＶＤ−Ｒ）などがある。
【００５３】
その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、そのホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明のクレームに含まれるものである。
【００５４】
また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。
【００５５】
また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現され得る。
【００５６】
さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現される。
【００５７】
【発明の効果】
本発明によれば、音声処理システムにおいて、音声認識または音声合成が利用者の母国語に対応していない場合に、利用者にとってなるべく抵抗のない言語を使用し、なおかつ、その言語が非母国語であることを考慮した音声認識や音声合成が提供される。
【図面の簡単な説明】
【図１】実施形態における音声処理装置の構成を示すブロック図である。
【図２】実施形態における音声認識モジュールの構成を示す図である。
【図３】実施形態における音声認識の動作条件の設定処理を示すフローチャートである。
【図４】実施形態における母国語選択画面の一例を示す図である。
【図５】実施形態における発話レベル選択画面の一例を示す図である。
【図６】母国語毎の他の言語に対する親密度の例を示す図である。
【図７】実施形態における音声認識の動作条件テーブルの一例を示す図である。
【図８】実施形態における音声合成モジュールの構成を示す図である。
【図９】実施形態における音声合成の動作条件の設定処理を示すフローチャートである。
【図１０】母国語毎の他の言語に対する親密度の例を示す図である。
【図１１】実施形態における音声合成の動作条件テーブルの一例を示す図である。
【図１２】実施形態における音声認識および音声合成の動作条件の設定処理を示すフローチャートである。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech processing apparatus and method capable of recognizing multilingual speech, and a speech processing apparatus and method capable of outputting multilingual speech.
[0002]
[Prior art]
In recent years, speech recognition devices that can recognize a plurality of languages and a speech synthesizer that can output a plurality of languages can be developed. However, in the current speech recognition apparatus and speech synthesis apparatus that support multiple languages, it is necessary to specify the language used by the user in advance. Here, if the user's native language is not included in the languages that can be processed by these devices, a language that is as resistant as possible to the user is selected, and these devices are configured so that the user can easily operate them. It is desirable to operate.
[0003]
Also, when considering a multilingual speech dialogue system using a multilingual speech recognition device and a multilingual speech synthesizer, ideally the language type handled by the speech recognition device and the language type handled by the speech synthesizer are the same. It is desirable, but not necessarily in reality. For example, in a multilingual speech dialogue system, speech recognition is available in five languages: English, Japanese, German, French, and Italian, but speech synthesis is available in three languages: English, Japanese, and Chinese. In some cases, the language is only supported. In this case, there is no problem for a user whose native language is English or Japanese because both speech recognition and speech synthesis are compatible with these languages. However, for German, for example, speech recognition is possible but speech synthesis is not possible. Conversely, for Chinese, speech synthesis is possible but speech recognition is not possible.
[0004]
Therefore, as a workaround in such a case, it is desirable to set an appropriate language other than German as a speech synthesis language for users whose native language is German. Similarly, for a user whose native language is Chinese, it is desirable that an appropriate language other than Chinese is set as the speech recognition language. Also, for example, when a user whose native language is Dutch, which does not support either speech recognition or speech synthesis, uses this multilingual speech dialogue system, an appropriate language out of the usable languages is spoken. It is desirable to set the language for recognition and speech synthesis.
[0005]
In response to such a request, for example, Patent Document 1 discloses a language-dependent object designated in a speech translation system in which a language-dependent object is configured by speech recognition, language analysis, language generation, and speech synthesis objects for each language. It is described that the language closest to the specified language (there is no description of the definition of proximity) is selected when it does not exist.
[0006]
[Patent Document 1]
JP 2001-222530 A
[Problems to be solved by the invention]
However, if an object is operated by simply selecting a close language by the method disclosed in Patent Document 1, a non-native language object is used as a native language object for a user. There is a possibility that the user cannot understand because the output utterance speed is too fast, or that speech recognition fails because the native pronunciation is not possible.
[0008]
Therefore, the present invention uses a language that is as resistant as possible to the user when speech recognition or speech synthesis does not correspond to the user's native language in the speech processing system, and the language is a non-native language. It is an object of the present invention to provide speech recognition and speech synthesis in consideration of the above.
[0009]
[Means for Solving the Problems]
According to one aspect of the present invention, there is provided a speech processing apparatus that performs speech recognition of a language selected from a plurality of languages, an acquisition unit that acquires information related to a user's language ability, and an information related to the acquired language ability. And a selection unit that selects a language to be a speech recognition target from the plurality of languages, and a setting unit that sets a speech recognition operation condition based on the information on the language ability and the language to be recognized. An audio processing device is provided.
[0010]
According to another aspect of the present invention, a speech processing apparatus that performs speech synthesis of a language selected from a plurality of languages, an acquisition unit that acquires information about a user's language ability, and information about the acquired language ability And selecting means for selecting a language for speech synthesis from the plurality of languages, and setting means for setting operating conditions for speech synthesis based on the language capability information and the language for speech synthesis. An audio processing device is provided.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
[0012]
(First embodiment)
FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the first embodiment of the present invention. This audio processing apparatus can be typically realized by a computer system using a CPU. Of course, you may implement | achieve with a dedicated hardware logic, without using CPU.
[0013]
A CPU 101 controls the entire voice processing apparatus according to a program loaded from the ROM 102 or the external storage device 104 to the RAM 103. The ROM 102 stores a boot program and various parameters. The RAM 103 functions as a main storage device that provides a work area when the CPU 101 executes various controls.
[0014]
Reference numeral 104 denotes a hard disk device serving as an external storage device. As shown in the figure, a voice processing program is installed in addition to the OS. This speech processing program includes a multi-language speech recognition program. The voice processing program is provided by being stored in, for example, the CD-ROM 110 a and installed in the external storage device 104 via the CD-ROM drive 110. Alternatively, it is possible to receive a voice processing program via a network (not shown).
[0015]
Reference numeral 105 denotes an audio input unit such as a microphone. Reference numeral 106 denotes an operation display unit such as a liquid crystal touch panel, which performs display / output of processing content setting / input, notification by characters, images, and the like. Reference numeral 107 denotes an auxiliary input / output unit, which can be composed of, for example, optical information such as buttons, numeric keys, a keyboard, a mouse, a pen, a switch, and an LED, Braille, and an actuator. An audio output unit 108 such as a speaker notifies a user of a message. Reference numeral 109 denotes a bus for connecting the above-described units.
[0016]
FIG. 2 is a diagram showing the configuration of the speech recognition module of the speech processing program in the present embodiment.
[0017]
The speech recognition module is roughly divided into an acoustic analysis module 1 and a search module 2. The acoustic analysis module 1 converts the voice input via the voice input unit 105 into an acoustic feature vector (for example, composed of Δ power, MFCC, etc.) at a fixed frame interval. The search module 2 uses the acoustic model 3 and the word dictionary 5 to perform a search while applying linguistic restrictions with the language model (or grammar) 4.
[0018]
The speech recognition in the embodiment is multilingual, and for example, corresponds to five languages: Japanese, English, German, French, and Italian. Here, as shown in the figure, the acoustic model 3, the language model 4, and the word dictionary 5 are respectively for native speakers and non-native speakers for each language described above. Includes one for each. Then, an acoustic model, a language model, and a word dictionary of a language selected by a method described later are selected, and search processing is executed.
[0019]
FIG. 3 is a flowchart showing processing for setting the operation conditions for speech recognition in the present embodiment.
[0020]
First, in step S201, utterance language information relating to the user's native language, the language that the user can utter, and its level, etc. is acquired as information about language ability. The user's native language is acquired by, for example, displaying a native language selection screen as shown in FIG. 4 on the operation display unit 106 and allowing the user to select it. Here, for example, when the user selects “Dutch”, the native language of the user is Dutch.
[0021]
In addition, instead of acquiring the utterance language information via the native language selection screen as described above, the utterance language information for each user may be stored in a file in advance and acquired based on this information. . In this case, it is preferable that the user notifies information related to the user in response to an operation such as input of the user ID.
[0022]
Alternatively, the utterance language information may be obtained by using a method for identifying the language based on the utterance content of the user.
[0023]
A language that the user can speak and information related to the level are acquired by displaying the speech level selection screen as shown in FIG. 5 on the operation display unit 106 and allowing the user to select it. For example, the speech level of each language is given in five levels (1 is low and 5 is high), and the user can select any appropriate one. The example in the figure shows a state in which English utterance level 4 is selected.
[0024]
It should be noted that the language that can be spoken by the user and the information on the level thereof are not acquired via the speech level selection screen as described above, but the user speaks a predetermined voice, and the voice content is subjected to voice recognition. Further, it may be acquired based on information such as a speech recognition rate, likelihood, and score at that time. In addition, as a method not based on speech recognition, the speech level may be acquired using a test for measuring a general language level.
[0025]
Next, in step S202, a language for speech recognition is selected based on the utterance language information. If the user's native language is English, Japanese, German, French, or Italian, speech recognition in the native language may be performed. However, if the user's native language is other than these, it is necessary to select a language that has the least burden on the user. That is, it is necessary to acquire information about which languages a user can speak among a plurality of languages that can be a target of speech recognition. If this information can be acquired via the utterance level selection screen shown in FIG. 5, the language set as the highest utterance level is set as the speech recognition target language. At this time, if the highest utterance level exists for a plurality of languages, the language is presented to the user and selected, or the language with a higher recognition rate for unspecified speakers in the speech recognition module Can be determined by automatically selecting.
[0026]
In the previous step S201, information on the user's native language and the user's utterance are sent via the native language selection screen as shown in FIG. 4 and the speech level selection screen as shown in FIG. Both the language that can be used and the information related to the level have been acquired as the utterance language information, but it is sufficient that at least one of them can be acquired. For example, if information about the native language cannot be obtained via the native language selection screen of FIG. 4, it is not possible to directly select the corresponding native language acoustic model or language model, but at least the utterance of FIG. Based on the information obtained via the level selection screen, it is possible to set the speech recognition operating conditions and language. On the other hand, when the language that can be spoken by the user and the information about the level cannot be obtained via the utterance level selection screen of FIG. 5, at least the native language obtained through the native language selection screen of FIG. Based on the information, information on the language and the level at which a general user of the person who speaks his / her native language can speak is shown in a form shown in FIG. By obtaining in advance that the density is high), it is possible to set the speech recognition operating conditions and language.
[0027]
Next, in step S203, operation conditions for speech recognition are set. At this time, the utterance level for the language used by the user obtained through the utterance level selection screen shown in FIG. 5 is regarded as “intimacy” for the language, and the operation condition of voice recognition is determined according to the intimacy value. change. For example, the familiarity can be defined by doubling the utterance level from 1 to 5 in FIG. The operation condition for speech recognition according to the familiarity is determined by referring to an operation condition table describing a correspondence relationship between the familiarity and the operation condition. This operating condition table has a structure as shown in FIG. 7, for example, and is stored in the hard disk 104 along with the voice processing program. Speech recognition operating conditions include search processing such as search conditions for speech recognition, results output such as the number of recognition candidates, acoustic models such as the type of acoustic model, linguistics such as vocabulary and grammar for speech recognition, etc. There are things related to various models.
[0028]
In the operation condition table shown in FIG. 7, as search conditions, the beam width value in the beam search, the N number of N best outputs (sentence hypotheses up to the top N) are generated, and native / non− in the acoustic model. Selection of native, selection of native / non-native in the language model, and the like are defined. Referring to FIG. 7, for example, when the familiarity is 6, the speech width of 200, the N best output number, 5, the non-native acoustic model, and the non-native recognition grammar are used. Operating conditions are set. Therefore, in this step S203, for each of the acoustic model, language model, and word dictionary of the language selected in step S202, whether for native or non-native is selected according to the familiarity.
[0029]
In FIG. 2, there are non-native models for all of the acoustic model, language model, and word dictionary. However, only some of these models may include a non-native model, or a non-native model may be included. It is good also as a structure which does not include this model. Further, the number of non-native models is not limited to one, and a configuration using a plurality of models corresponding to the degree of non-native, the type and category of the native language may be used. Here, as a native model, an acoustic model, a language model, and a word dictionary may be created by a commonly used method. Next, as a non-native acoustic model, an acoustic model can be created using a speech database uttered by non-native. The non-native language model includes a method for creating a simple grammar composed of plain words and short sentences, and a method for creating a statistical language model such as N-gram using plain words and short sentences. Can be used. In addition, the non-native word dictionary can create a pronunciation dictionary that is easy to utter non-native.
[0030]
In this way, the speech recognition operating conditions are set according to the user's utterance language and utterance level. For this reason, when a non-native language is recognized by speech, it is set so that speech recognition is executed using a non-native acoustic model, a language model, and a word dictionary. It is less likely that voice recognition will fail because it is not possible.
[0031]
(Second Embodiment)
The second embodiment relates to speech synthesis processing. The configuration of the speech processing apparatus according to this embodiment is the same as that shown in FIG. 1, but the speech processing program in this embodiment is as shown in FIG. 8 instead of the speech recognition module in the first embodiment. The speech synthesis module 10 having the above configuration is included.
[0032]
The speech synthesis in this embodiment is multilingual compatible, for example, in three languages, Japanese, English and Chinese. Accordingly, as shown in FIG. 8, the speech synthesis module 10 includes synthesis modules 11, 12, and 13 for national languages, and a synthesis module corresponding to a language selected by a method described later is executed. The synthesis module 11 for Japanese has a configuration as shown in the figure, for example. The text analysis module 14 performs syntax analysis (specifically, morphological analysis) of the input Japanese text using the language dictionary 14a. As shown in the figure, the language dictionary 14a preferably has a native one and a non-native one individually. The language processing module 15 includes a phoneme processing module 15a and a prosody processing module 15b. The phoneme processing module 15a outputs a phoneme symbol string based on the analysis result of the text, and the prosody processing module 15b performs pause, accent, intonation, and continuation. Prosodic information such as time length is output. The acoustic processing module 16 generates synthesized speech using the waveform dictionary 17 based on the input phoneme symbol string and prosodic information. In addition, the waveform dictionary is individually provided for native speakers and for non-native speakers. The English synthesizing module 12 and the Chinese synthesizing module 13 have the same configuration as the Japanese synthesizing module 11, and thus illustration and description thereof are omitted.
[0033]
FIG. 9 is a flowchart showing processing for setting the operating conditions for speech synthesis in the present embodiment.
[0034]
First, in step S301, listening language information about the user's native language, the language that the user can understand (listening) by listening to sound, and the level thereof is acquired. The user's native language is obtained by displaying the native language selection screen similar to that shown in FIG. 4 described in the first embodiment on the operation display unit 106 and allowing the user to select it. In addition, the language that can be understood by the user and the information related to the level are also acquired by displaying the utterance level selection screen similar to that in FIG. 5 described in the first embodiment on the operation display unit 106 and making the user select it. To do.
[0035]
Note that the listening language information may be obtained based on this information by previously storing listening language information for each user in a file or the like. In this case, it is preferable that the user notifies the information about the user by performing an operation such as inputting a user ID.
[0036]
Information about the language level that the user can listen to may be acquired based on a response result or a response time of the user with respect to speech synthesis of a predetermined content.
[0037]
Next, in step S302, the language of the audio output is selected based on the listening language information. If the user's native language is English, Japanese, or Chinese, speech synthesis using the native language may be performed. However, when the user's native language is other than these three languages, it is necessary to select a language with as little burden on the user as possible. That is, it is necessary to acquire information as to which language is easy for a user to understand among languages that can be output. If this information can be acquired via the utterance level selection screen, the language for voice output is set using the information.
[0038]
In step S301, both the user's native language and the language that the user can listen to and the level related to the level are acquired as listening language information via the native language selection screen and the speech level selection screen. However, it is sufficient that at least one of them can be acquired. For example, if information about the native language cannot be obtained via the native language selection screen of FIG. 4, it is not possible to specify the corresponding native language synthesis module, but at least the speech level selection screen of FIG. It is possible to set the speech synthesis operating conditions and language based on the information obtained through the menu. On the other hand, when the language that can be understood by the user via the utterance level selection screen in FIG. 5 and information on the level cannot be obtained, the information on the native language obtained through the native language selection screen in FIG. Based on the information on the language and the level that can be understood by a general user who speaks his / her native language as the familiarity in the form as shown in FIG. You can set the language.
[0039]
Next, in step S303, speech synthesis operating conditions are set. At this time, the listening level for the language selected by the user obtained through the utterance level selection screen similar to that in FIG. 5 is regarded as “intimacy” for the language, and the speech synthesis operation condition is determined according to the familiarity value. To change. The listening level and familiarity can be defined in the same manner as in the method described in the first embodiment.
[0040]
The speech synthesis operation condition corresponding to the familiarity is determined by referring to an operation condition table describing the correspondence between the familiarity and the operation condition. This operating condition table has a structure as shown in FIG. 11, for example, and is stored in the hard disk 104 along with the voice processing program. Speech synthesis operating conditions include prosody factors such as speech synthesis speech rate, output factors such as volume, phonological factors such as waveform dictionary types, and linguistic factors such as the contents of response sentences There is something about.
[0041]
In the operation condition table shown in FIG. 11, the utterance speed in prosodic control as a prosodic factor, the volume as a factor of output, the selection of a native or non-native waveform dictionary as a phonological factor, the linguistic factor The contents of the response sentence and the number of repetitions are specified. Referring to FIG. 11, when the intimacy is 6, for example, the speaking rate is slow, the volume is 7 (7 out of 10 steps where 0 is the minimum volume and 9 is the maximum volume), and non-native voice is used. The speech synthesis operating conditions such as using the waveform dictionary and the polite response sentence are set. Accordingly, in this step S303, the speech synthesis module of the language selected in step S302 is selected, and further, the native or non-native one of the waveform dictionaries is selected according to the familiarity.
[0042]
In FIG. 8, there is a non-native waveform dictionary in the waveform dictionary, but the non-native waveform dictionary may not be included. Further, the number of non-native waveform dictionaries is not limited to one, and a plurality of waveform dictionaries according to the degree of non-native, the type and category of the native language may be used. Here, the native waveform dictionary may be created by a commonly used method. Next, a non-native waveform dictionary can be created by using voice data obtained by uttering a language in which non-native is native. In addition, it is created by associating speech data in which non-native speaks non-native language, phoneme system (phonetic system) of non-native language and phoneme system (phonetic system) of native language. You can also.
[0043]
In this way, the operation conditions for speech synthesis are set according to the listening level of the user's native language and audible language. For this reason, when synthesizing a non-native language, it is set so that speech synthesis is performed in accordance with the phoneme and prosody information adjusted for non-native, which helps to improve the understanding of the contents of the speech output. .
[0044]
(Third embodiment)
In the first embodiment, a speech processing apparatus having a speech recognition function has been described. In the second embodiment, a speech processing apparatus having a speech synthesis function has been described. However, the first embodiment and the second embodiment are described. Can be combined to realize a speech processing apparatus having both a speech recognition function and a speech synthesis function.
[0045]
The configuration of the speech processing apparatus according to the third embodiment is the same as that shown in FIG. 1, but the speech processing program in this embodiment is the speech recognition module in the first embodiment and the second embodiment. And a speech synthesis module.
[0046]
FIG. 12 is a flowchart showing a process for setting operation conditions for speech recognition and speech synthesis in the present embodiment.
[0047]
The procedure of the operation is the same as the processing in the first and second embodiments except that in step S401, information related to the utterance and the listening language is acquired at the same time. Here, the language used in the speech recognition language and the speech synthesis language can be obtained independently in steps S402 and S404 according to the methods described in the first and second embodiments, respectively. Some users may find it strange to use different languages for speech recognition and speech synthesis, so it is possible to apply the language that maximizes the sum of the intimacy of speech recognition and speech synthesis to speech recognition and speech synthesis. It is.
[0048]
(Other embodiments)
The embodiment of the present invention has been described in detail above. However, the present invention can take an embodiment as a system, apparatus, method, program, storage medium, or the like. In addition, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.
[0049]
In the present invention, a software program that realizes the functions of the above-described embodiments is directly or remotely supplied to a system or apparatus, and the computer of the system or apparatus reads and executes the supplied program code. Including the case where it is also achieved by. In that case, as long as it has the function of a program, the form does not need to be a program.
[0050]
Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. That is, the scope of the claims of the present invention includes the computer program itself for realizing the functional processing of the present invention.
[0051]
In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.
[0052]
As a recording medium for supplying the program, for example, flexible disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card, ROM, DVD (DVD-ROM, DVD-R).
[0053]
As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program itself of the present invention or a compressed file including an automatic installation function is downloaded from the homepage to a recording medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the claims of the present invention.
[0054]
In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.
[0055]
In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of them and performing the processing.
[0056]
Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.
[0057]
【The invention's effect】
According to the present invention, in a speech processing system, when speech recognition or speech synthesis does not correspond to a user's native language, a language that is as resistant as possible to the user is used, and the language is a non-native language. Therefore, speech recognition and speech synthesis are provided.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a sound processing apparatus according to an embodiment.
FIG. 2 is a diagram illustrating a configuration of a voice recognition module in the embodiment.
FIG. 3 is a flowchart illustrating processing for setting operation conditions for speech recognition in the embodiment.
FIG. 4 is a diagram illustrating an example of a native language selection screen in the embodiment.
FIG. 5 is a diagram showing an example of an utterance level selection screen in the embodiment.
FIG. 6 is a diagram illustrating an example of intimacy with respect to another language for each native language.
FIG. 7 is a diagram illustrating an example of an operation condition table for speech recognition in the embodiment.
FIG. 8 is a diagram illustrating a configuration of a speech synthesis module in the embodiment.
FIG. 9 is a flowchart illustrating processing for setting operation conditions for speech synthesis in the embodiment.
FIG. 10 is a diagram showing an example of familiarity with another language for each native language.
FIG. 11 is a diagram illustrating an example of an operation condition table for speech synthesis in the embodiment.
FIG. 12 is a flowchart illustrating processing for setting operation conditions for speech recognition and speech synthesis in the embodiment.

Claims

A speech processing apparatus that performs speech recognition of a language selected from a plurality of languages,
An acquisition means for acquiring information about the language ability of the user;
Selection means for selecting a language to be subjected to speech recognition from the plurality of languages based on the acquired information on the language ability;
Setting means for setting operation conditions for speech recognition based on the information on the language ability and the language to be recognized;
A speech processing apparatus comprising:

The speech processing apparatus according to claim 1, wherein the information regarding the language ability includes native language information and utterance level information of at least one of the plurality of languages.

The said selection means selects languages other than the said native language based on the said speech level information, when a user's native language is not contained in these languages. Audio processing device.

The speech according to claim 1, wherein the operation condition includes at least one of a search condition for a recognition candidate, an output condition for a recognition result, an acoustic model selection, a language model selection, and a word dictionary selection. Processing equipment.

A method of performing speech recognition by selecting one of a plurality of languages,
Obtaining information about the user's language skills;
Selecting a language to be subjected to speech recognition from the plurality of languages based on the acquired information on the language ability;
Setting speech recognition operating conditions based on the information about the language ability and the language to be recognized;
A method characterized by comprising:

In order to select one of several languages and perform speech recognition,
Obtaining information about the user's language skills;
Selecting a language to be subjected to speech recognition from the plurality of languages based on the acquired information about the language ability;
Setting operating conditions for speech recognition based on the information about the language ability and the language to be recognized;
A program for running

A speech processing apparatus that performs speech synthesis of a language selected from a plurality of languages,
An acquisition means for acquiring information about the language ability of the user;
Selection means for selecting a language for speech synthesis from the plurality of languages based on the acquired information on language ability;
Setting means for setting operation conditions for speech synthesis based on the information on the language ability and the language for speech synthesis;
A speech processing apparatus comprising:

8. The speech processing apparatus according to claim 7, wherein the information on the language ability includes native language information and listening level information of at least one of the plurality of languages.

9. The selection unit according to claim 8, wherein when the user's native language is not included in the plurality of languages, the selection unit selects a language other than the native language based on the listening level information. Audio processing device.

The speech processing apparatus according to claim 7, wherein the operation condition includes at least one of an utterance speed, a volume, a waveform dictionary selection, and a response sentence content.

A method for performing speech synthesis by selecting one of a plurality of languages,
Obtaining information about the user's language skills;
Selecting a language for speech synthesis from the plurality of languages based on the acquired information about the language ability;
Setting operating conditions for speech synthesis based on the information about the language ability and the language for speech synthesis;
A method characterized by comprising:

In order to select a language from multiple languages and perform speech synthesis,
Obtaining information about the user's language skills;
Selecting a language for speech synthesis from the plurality of languages based on the acquired information about the language ability;
Setting operating conditions for speech synthesis based on the information about the language ability and the language for speech synthesis;
A program for running