JP3741536B2

JP3741536B2 - Educational equipment

Info

Publication number: JP3741536B2
Application number: JP12177198A
Authority: JP
Inventors: エフカレンジョン; 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-04-15
Filing date: 1998-04-15
Publication date: 2006-02-01
Anticipated expiration: 2018-04-15
Also published as: JPH11296060A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識を利用した教育機器に関する。
【０００２】
【従来の技術】
近年、音声を使った語学教育機器として、いくつかの技術が提案されている。例えば、特開昭６３−３０３４００号には、カード状の記録装置に模範の音声(習得したい言語(例えば英語)の模範の音声)を入れておいて、それを聞きながら復唱してスピーキングの練習を行なう技術が示されている。この技術によれば、効果的に模範音声を聞くことはできるが、発音の間違があっても本人が気がつかない限り、直すことができないという欠点がある。
【０００３】
一方、特開昭５９−２２０７７５号には、模範となる音声を磁気テープに録音しており、それと利用者の音声を比較して、類似しているかどうかを機械的に判断しその結果を話者に知らせる技術が示されている。また、特開昭６０−１６２２８１号には、模範発声者の音声と練習者の入力音声とを記憶して、これらを音響分析し、練習者の入力音声の特徴を模範音声の特徴と比べて評価し、その分析および評価結果を表示装置に表示し、練習者は自分の発音を聞き、且つ、表示装置４に表示された模範音声および自分の音声の分析結果および、自分の音声の評価結果を見て、模範音声と自分の音声の音声特徴の相違を確認し、自分の発音を矯正する技術が示されている。
【０００４】
【発明が解決しようとする課題】
特開昭５９−２２０７７５号，特開昭６０−１６２２８１号の技術によれば、模範音声と学習者の音声とを、振幅，ピッチ，ホルマントで比較するので、音の比較はできるが、一般の人にとっては、何を正せば模範の音声に近付くのか理解しにくいなどの欠点がある。
【０００５】
また、特開昭５９−２２０７７５号，特開昭６０−１６２２８１号の技術では、模範音声(習得したい言語(例えば英語)の模範の音声)が出力されるだけであって、これにより、利用者は、仮に、この言語の正しい発音を身につけることができたとしても、その言語の意味が何であるかを知りたいとき、これをすぐには知得することができないという欠点があった。すなわち、語学の学習の基礎は単語の習得にあり、単語を正しい発音で身につけるためには、教材として正しい発音を示すことのみならず、意味を理解することが非常に重要であり、現在、日本では単語カードの表面に日本語、裏に外国語を記入して、表の日本語を見て裏の外国語(例えば英語)の単語を思い出させるというような手法が広く用いられているが、特開昭５９−２２０７７５号，特開昭６０−１６２２８１号の技術では、上記の単語カードのような使い方ができないという欠点があった。
【０００６】
本発明は、模範音声を練習者(利用者)に提示して練習者に自己の発音の正しさを判断させることができるとともに、練習者に言葉の意味と発音との両方を身に付けさせることの可能な教育機器を提供することを目的としている。
【０００７】
また、本発明は模範となる文字や絵を示し、練習者に言葉で答えさせることで、練習者に言葉の意味と発音との両方を身に付けさせることの可能な教育機器を提供することを目的としている。
【０００８】
【課題を解決するための手段】
上記目的を達成するために、請求項１記載の発明は、模範音声と該模範音声に関連した提示用情報とを記録している記録保持手段と、
前記記録保持手段に記録されている提示用情報を提示する提示手段と、
前記記録保持手段に記録されている模範音声を再生する音声再生手段と、
発声された音声を受け付ける音声入力手段と、
前記音声入力手段で入力された音声と前記音声記録手段に記録されている模範音声とを比較し、前記音声入力手段で入力された音声が模範音声と類似しているか否かを認識する音声認識手段と、
前記音声入力手段で入力された音声と模範音声とが類似していると前記音声認識手段が認識した場合に、前記音声入力手段で入力された音声または該音声の特徴パターンを判定情報として前記模範音声に対応づけて前記記録保持手段に記録する登録手段とを備え、
前記音声入力手段は、さらに、前記提示手段で提示された前記提示用情報に基づいて発声された音声を受け付け、
前記音声認識手段は、さらに、発声された音声または該音声の特徴パターンと事前に記憶されている判定情報とが類似するか否かを判定して、発声された音声の正誤を判断する
ことを特徴としている。
【０００９】
また、請求項２記載の発明は、請求項１記載の教育機器において、前記音声入力手段で入力された音声と模範音声とが類似していないと前記音声認識手段が認識した場合には、前記音声再生手段で再び模範音声を再生させることを特徴としている。
【００１０】
また、請求項３記載の発明は、請求項１または請求項２記載の教育機器において、発声された音声の正誤を判断する前記音声認識手段が誤りと判断した場合に、前記提示用情報に対応する音声を前記記録保持手段から読出し、前記音声再生手段で再生させることを特徴としている。
【００１７】
【発明の実施の形態】
以下、本発明の実施形態を図面に基づいて説明する。図１は本発明に係る教育機器の構成例を示す図である。図１を参照すると、この教育機器は、全体を制御する制御部１と、音声を入力する音声入力部(例えばマイクロフォン)２と、音声を記録保持する音声記録保持部(例えばメモリ)３と、該音声記録保持部３に記録保持されている音声を再生する音声再生部(例えばスピーカ)４と、音声記録保持部３に記録されている音声のいずれかを指示する指示部(例えばキーボード)５とを有し、前記制御部１には、認識用の音声パターン(標準パターン)を登録可能であって音声を認識可能な音声認識手段１０と、比較手段２０とが設けられ、認識用の標準パターンの登録時には、制御部１は、音声記録保持部２に記録保持されている音声を音声再生部４によって再生させ、利用者が再生された音声に近い音声を発声するとき、利用者が発声した音声に基づいて(この音声の特徴パターンを抽出することで)、認識用の標準パターンを生成して、これを音声認識手段１０に登録し、また、音声認識時には、制御部１は、発声すべき音声を指示部５によって利用者に指示させた後、利用者に該音声を発声させ、利用者の発声音声を音声認識手段１０で認識させ、その認識結果が指示部５で指示された音声と関連付けられているか否かを比較手段２０で判断して、その結果を利用者に提示するようになっている。
【００１８】
ここで、音声記録保持部３には、より詳しくは、例えば教育機器を動作させるための制御プログラム，外国語の単語の発音音声のデータ(例えば模範となる英語音声データ)，外国語の単語の日本語の意味などが記憶されている。
【００１９】
また、音声認識手段１０の音声認識方式に関しては、従来公知の任意の方式を用いることができる。例えば文献「古井著デジタル音声処理(東海大出版１９８５年))」に記載されているような方式のものを用いることができる。
【００２０】
図２は音声認識手段１０の構成例を示す図である。図２の例では、音声認識手段１０は、特徴抽出部１１と、特定話者用標準パターン登録部１２と、比較部１３と、結果出力部１４とを有している。
【００２１】
音声認識手段１０が図２のように構成されている場合において、標準パターンの登録時には、利用者(特定話者)の予め決められた単語の入力音声を特徴抽出部１１で特徴量(特徴パターン)に変換した後、特徴抽出部１１で抽出された特徴パターンを特定話者用標準パターン登録部１２に標準パターンとして記憶するようになっている。一方、音声認識時には、利用者(特定話者)の未知の単語の入力音声を特徴抽出部１１で特徴量(特徴パターン)に変換した後、比較部１３において、未知の単語の入力音声の特徴パターンと特定話者用標準パターン登録部１２に予め登録してある種々の単語の標準パターンとの間でそれぞれ類似度を計算し、結果出力部１４では、最も高い類似度を与えた標準パターンをもつ単語を認識結果として出力するようになっている。
【００２２】
次に、このような構成の教育機器の処理動作例について説明する。先ず、音声認識手段１０に特定話者音声認識用の標準パターンを登録する場合について説明する。いま、指示部５から利用者が単語(単語名)を例えば日本語で指定(入力)すると、制御部１は、その日本語を前述したような仕方で利用者に提示する。日本語の提示の仕方として、例えば、図１に示すように、さらに表示部６を設け、表示部６に日本語の文字(キャラクタ)を表示しても良いし、あらかじめ日本語の単語の音声を波形データとして音声記録保持部(メモリ)３に記憶しておいて音声再生部(スピーカ)４から再生するようにしても良い。なお、日本語を音声再生で提示するときには、表示部６は必ずしも設けられていなくても良い。
【００２３】
このように、日本語を利用者に提示した後、制御部１は、この日本語単語に対応した外国語の単語の発音音声データを音声記録保持部３から読み出し、再生する。これにより、利用者は、この日本語単語に対応した外国語の単語の発音をスピーカ４を通して聞くことができる。例えば、音声記録保持部３に、日本語単語とこれに対応した外国語(英語)の単語が単語数“５”で登録されているとする。ここで、５つの単語を「赤」「青」「緑」「白」「黒」とする。まず、利用者は、キーボード５から「赤」と入力すると、スピーカ４から「ｒｅｄ」と言う発音が出る。利用者はこれを真似してマイクロフォン２に向かって「レッド」と発音する。なお、利用者は、必要に応じて、この発声を複数回繰り返しても良い。
【００２４】
マイクロフォン２から入力された音声(例えば「レッド」)はＡ／Ｄ変換されて、制御部１に与えられる。制御部１では、与えられた音声の一部を音声認識手段１０に与え、他の一部を再生用の音声として音声波形のまま音声記録保持部(メモリ)３に記憶することができる。なお、この例では、音声信号を分岐して、音声認識手段１０，メモリ３の両方に入力させているが、必ずしも音声信号を分岐して両方へ入力させる必要はない。例えば、マイクロフォン２から入力された音声信号を音声波形のままメモリ３へ記憶させ、必要に応じて音声認識手段１０へ入力させても良い。また、音声信号を音声波形のままメモリ３へ記憶させても良いが、利用者の音声を再生する必要がない場合は、これを音声認識手段１０だけに与え、音声認識用の特徴パターンに変換させて記憶させても良い。この場合には、使用するメモリ量を少なくさせることができる。
【００２５】
このように、利用者の発声した音声が音声認識手段１０に入力されると、音声認識手段１０では、入力された音声(例えば「レッド」)を特定話者音声認識用の標準パターンとしての登録に利用できる。すなわち、入力された音声(例えば「レッド」)の特徴パターンをこの利用者(特定話者)の認識用標準パターンとして、特定話者用標準パターン登録部１２に登録することができる。
【００２６】
このような動作を順次に繰り返して、５つの単語全ての発音練習を行ない、５つの単語の英語発声音声に基づいてこの話者用の標準パターンを作成し、これを音声認識手段１０に登録する。そして、５つの単語全ての発音練習を終了すると、利用者は、指示部(キーボード)５からテストのモードを選択し、そのとき、単語名を指定することができる。テストのモードが選択され単語名が指定されると、制御部１は、音声記録保持部(メモリ)３に記憶されている日本語(指定された日本語)を例えば音声で再生して利用者に提示し、この再生が終了すると、音声認識手段１０は、未知の入力音声に対する認識待ちの状態に入る。
【００２７】
次に、未知の入力音声に対する認識処理時，すなわち、実際の音声認識時について説明する。先ず、指示部５から利用者が単語(単語名)を例えば日本語で指定(入力)すると、制御部１は、その日本語を前述したような仕方で利用者に提示する。日本語の提示の仕方として、例えば、図１に示すように、さらに表示部６を設け、表示部６に日本語の文字(キャラクタ)を表示しても良いし、あらかじめ日本語の単語の音声を波形データとして音声記録保持部(メモリ)３に記憶しておいて音声再生部(スピーカ)４から再生するようにしても良い。なお、日本語を音声再生で提示するときには、表示部６は必ずしも設けられていなくても良い。
【００２８】
このように、日本語を利用者に提示した後、利用者は、音声再生部４から日本語で提示された単語の英語発声を試みることができる。利用者がこの単語を英語発声すると、英語発声された音声は、制御部１に取り込まれ、制御部１では、この音声を音声認識手段１０に与えて、音声認識させる。音声認識手段１０では、入力された英語音声の特徴パターンを求め、この特徴パターンを予め登録されている５つの単語(英語)の標準パターンと照合して、最も類似している標準パターンをもつ単語を認識結果とし、この認識結果を結果出力部１４から出力させる。この段階で、制御部１は、先に指定されている単語名と認識結果とを比較手段２０で比較し、この比較の結果、同じであれば正解であると判断し、違っていれば誤りと判断する。そして、この判断結果を例えば表示部６に表示して利用者に知らせる。この時、単語の表示の順番は、登録順でも、登録の逆順でも良いし、あるいは、ランダムにしてもよい。
【００２９】
図３，図４は図１の教育機器の処理動作の具体例を説明するためのフローチャートである。図３，図４の例では、利用者は、先ず、個人情報を記録するファイル名を入力する(ステップＳ１)。なお、このファイル名は、利用者がキーボード５から入力しても良いし、あるいは、機器のメモリに予め記憶されている単語名を機器自体が自動で読み出し、機器自体が自動的に発生することも可能である。このようにして、ファイル名の入力がなされると、このファイル名が新しいファイルであるか否かを判断する(ステップＳ２)。この結果、新しいファイルの場合は初めての利用であるので、利用者に発音練習を行なわせ、音声認識用の標準パターンを作成する必要がある。そのために、先ず、単語カウンタＷＣＮＴを“１”に初期設定し(ステップＳ３)，日本語の単語名を入力する(ステップＳ４)。日本語の単語名の入力も、ファイル名の入力と同様に、利用者によりキーボード５から入力させても良いし、あるいは、機器のメモリに予め記憶されている単語名を機器自体が自動で読み出し、機器自体が自動的に発生することも可能である。
【００３０】
次に、その日本語の単語名に対応する英語の単語名を入力する(ステップＳ５)。例えば、ステップＳ４で、日本語の単語名として「赤」と入力した時には、ステップＳ５では「ｒｅｄ」と入力する。この際、英語の入力も、手入力でなされても良いし、機器が自動で入力しても良い。
【００３１】
このように、日本語とそれに対応した英語が入力されると、これらを例えば表示部６に表示する(ステップＳ６)。すなわち、例えば、「赤」，「ｒｅｄ」のように表示する。そして、この教育機器は、模範の発音を出力し、それに従って、利用者に発声(復唱)させる(ステップＳ７)。すなわち、機器は模範の音声として「レッド」を出力し、利用者はこれに従い、模範音声にできる限り似せて「レッド」を発声する。
【００３２】
利用者が発声したこの音声は音声認識手段１０へ取り込まれ、音声認識手段１０では、利用者が発声した単語の音声の特徴パターンをこの単語の標準パターンとして例えばファイルに登録する(ステップＳ８)。次いで、単語カウンタＷＣＮＴを“１”だけ増加させ(ステップＳ９)、カウンタ値ＷＣＮＴが全ての単語数ｎを超えたかを判断する(ステップＳ１０)。ｎを超えないときには、ｎを超えるまで、ステップＳ４乃至ステップＳ９の処理を繰り返し、終了したら、上記ファイルを保存する。
【００３３】
次いで、発音練習だけか、発音練習の他にさらに単語の記憶トレーニングをするかを利用者に選択させる(ステップＳ１１)。利用者が模範音声を聞いて発音練習するだけの時はこれで処理を終了するが、発音練習と単語の記憶トレーニングをする時はここからテストルーチンへ入る(すなわち、ステップＳ１５に進む)。
【００３４】
また、ステップＳ２において、入力されたファイル名がすでに存在し、利用者がすでに発声練習をしたと判断されるときには、互いに対応づけられた英語語彙と日本語語彙とをロードし(ステップＳ１２，Ｓ１３)、ステップＳ８で作成された音声認識用の標準パターン(テンプレート)をロードする(ステップＳ１４)。なお、ここで、語彙とは単語の集合である。
【００３５】
次いで、ステップＳ１５からのテストルーチンに入る。テストルーチンでは、先ず、単語ポインタを最初の単語位置にセットする(ステップＳ１５)。そして、単語カウンタＷＣＮＴを“１”に初期設定する(ステップＳ１６)。次いで、その単語位置の日本語の単語を例えば表示部６に提示(例えば表示)する(ステップＳ１７)。ここで、日本語の単語を必ずしも文字表示する必要はなく、録音された音声で出力しても良い。
【００３６】
このように、日本語の単語を利用者に提示するとき、利用者は、それに対応する英語単語を発声することができる(ステップＳ１８)。利用者が英語単語を発声し、その音声が入力されると、音声認識手段１０では、この英語単語の音声を音声認識する(ステップＳ１９)。すなわち、この英語単語の音声の特徴パターンを抽出し、この特徴パターンをステップＳ８で登録された各単語の標準パターンと照合することで、音声認識を行なう。そして、この音声認識の結果、利用者の発声した音声の特徴パターンが正しい英語単語の音声特徴量(標準パターン)を備えたものであるか否かを判断し(ステップＳ２０)、利用者の発声した音声が正しい英語単語の音声特徴量(標準パターン)を備えたものでないときには(リジェクトされたか、あるいは、誤認識結果だったら)、ステップＳ１７に戻り、再度日本語の単語を表示して利用者に再度それに対応する英語単語を発声させ、ステップＳ１７乃至Ｓ２０の音声認識処理を繰り返す。
【００３７】
一方、ステップＳ２０において、認識結果が正しければ、単語カウンタＷＣＮＴを“１”だけ増加して(ステップＳ２１)、単語カウンタＷＣＮＴが所定値ｎを越えたか否かを判断する(ステップＳ２２)。この結果、所定値ｎに達していないときには、再びステップＳ１７に戻り、次の日本語の単語を表示し、上記テストルーチンを繰り返す。このようにして、ステップＳ２２で単語カウンタＷＣＮＴが所定値ｎを越えたときに、全ての処理を終了する。
【００３８】
なお、上述のテストルーチン(ステップＳ１５乃至Ｓ２２)では、単語ポインタを最初の単語位置にセットし、最初の単語位置の単語から順次にテストを行なうようになっているが、これのかわりに、例えば、乱数を発生させてテストする単語をランダムに決めても良い。
【００３９】
このようなテストを行なうことで、利用者は、日本語の単語名に対応した英単語の正しい発音を習得でき、また、これと同時に、この英単語の意味(すなわち、日本語の単語)が何であるかを把握できる。また、上述の例では、利用者に単語を提示したが、日本語の文章を提示し、これに対応した英語の文章を利用者に発声させても良い。また、上述の例では、指示部５にキーボードを用いたが、キーボードのかわりに、例えばフロッピー・ディスクやＣＤ−ＲＯＭなどの記録媒体を用いることもできる。また、上述の例では、指示部５が設けられているが、指示部５のかわりに、２種類以上の言語でそれぞれ発声された内容の音声が互いに対応付けて記録されている音声記録部(フロッピー・ディスクやＣＤ−ＲＯＭなどの記録媒体など)を設けることもできる。
【００４０】
図５は本発明に係る教育機器の他の構成例を示す図であり、図５の教育機器は、図１の教育機器において、指示部５のかわりに、２種類以上の言語でそれぞれ発声された内容の音声が互いに対応付けて記録されている音声記録部７(フロッピー・ディスクやＣＤ−ＲＯＭなどの記録媒体など)が設けられたものとなっている。
【００４１】
また、図５の教育機器では、音声記録保持部３は、音声記録部７に記録された内容を一時的に記憶する一時記憶部としての機能も有し、認識用の標準パターンの登録時には、ある単語について、制御部１は、音声記録部７から音声記録保持部(一時記憶部)３に記録された２種類以上の言語の音声のうち一種類以上の言語の音声(例えば、英語の音声)を第１の音声として音声再生部４から再生し、利用者に該第１の音声(英語の音声)に従って英語の音声を発声させ、利用者が発声した音声に基づいて認識用の標準パターンを生成して、これを音声認識手段１０に登録するようになっており、また、音声認識時には、ある単語について、制御部１は、音声記録部７から音声記録保持部(一時記憶部)３に記録された２種類以上の言語の音声の中から、前記利用者が発声した種類の言語とは別の種類の言語の音声(例えば、日本語の音声)を第２の音声として再生し、利用者に、この第２の音声(日本語の音声)に対応した第１の音声(英語の音声)を発声させ、利用者のこの発声音声を音声認識手段１０で認識させ、その認識結果が音声記録保持部(一時記憶部)３に記録され第１の音声として再生された音声と関連付けられているか否かを判断して利用者に提示するようになっている。
【００４２】
図５の教育機器では、音声記録部(記録媒体)７の内容として、種々のものを設定することができ、この内容は、一時的に音声記録保持部３へ記憶されることで、発声を促す単語の種類を換えたり、言語の種類を換えたり、更にはプログラムの変更によって、外国語だけでなく、質問に対する答えを教えることや、目の不自由な人に対する訓練機にすることもできる。従って、この音声記憶部７，すなわち記録媒体だけを取り替えることで教育機器の機能を容易に変更できる。
【００４３】
このように、図５の教育機器では、記録媒体を取り替えることで、１つのシステムを多くの人が使ったり、様々なレベルの学習に使うことができる。
【００４４】
ところで、上述した各教育機器(より具体的には、制御部１の音声認識手段１０が図２の構成となっている機器)では、利用者の誤りをシステム側から指摘することができない。具体的には、本人が気がつかない誤りがある場合、例え「ｒｅｄ」を常に「レット」と発音する人がいた場合、「赤」−「ｒｅｄ」というガイダンスに沿って「レット」と発音すると、これがこの特定話者の標準パターンとして登録されてしまい、テストで「赤は何と言うでしょう」との問に対し、「レット」と発音すれば、音声認識の結果は正解となってしまう。そのため、模範の発音と自分の発音の違っていることを誰かに指摘されるまで、上述した教育機器ではその発音の誤りを正すことはできない。
【００４５】
図６は本人の思い込みで発音しているような誤りに対する修正を可能にすることを意図した音声認識手段１０’の構成例を示す図である。すなわち、図６の構成例では、できるだけ正しい発音で標準パターンを作成することと、正しい発音を学習することとを意図したものとなっており、音声認識手段１０’は、特徴抽出部１１と、特定話者用標準パターン登録部１２と、比較部１３と、結果出力部１４との他に、さらに、不特定話者用標準パターン登録部１５を有している。
【００４６】
図６の構成例は、現在、不特定話者用の音声認識装置が利用できるようになってきたことと、特定話者方式の方が認識精度が高いこととの２つの特徴を利用したものであり、図６の構成例では、まず、不特定話者用の標準パターンを使って、利用者が正しい発音をしているかどうかをチェックし、正しいと判断されたものに対して特定話者用の標準パターンを登録するようにしている。
【００４７】
図７は制御部１の音声認識手段が図６のような音声認識手段１０’の構成となっている場合の教育機器の他の構成例を示す図であり、図７の教育機器は、前述のように、本人の思い込みで発音しているような誤りに対する修正を可能にすることを意図している。すなわち、図７の教育機器は、できるだけ正しい発音で標準パターンを作成することと、正しい発音を学習することとを目的になされたものである。
【００４８】
図７の例の教育機器は、図１の構成例の教育機器において、制御部１の音声認識手段が図６のような音声認識手段１０’の構成となっていることの他に、指示部(例えばキーボード)５とともに、図５の構成例に示したような音声記録部(記録媒体)７がさらに設けられたものとなっている。ここで、音声記録部(記録媒体)７には、２種類以上の言語でそれぞれ発声された内容の音声が互いに対応付けて記録されているが、この際、記録されるべき発声された内容の音声は、不特定話者のものとなっている(例えば、複数の話者の音声の平均をとった標準的な音声のものとなっている)。
【００４９】
図６，図７の構成の教育機器では、利用者の音声の標準パターンを特定話者用標準パターン登録部１２に登録するに先立って、利用者が発声した音声の特徴パターンと不特定話者用標準パターン登録部１５に登録されている不特定話者用の標準パターンとの類似度を求めて、正しい認識結果が得られるかどうかを調べ、正しい認識結果が得られれば、その音声の特徴パターンを特定話者用標準パターン登録部１２にそのまま登録し、正しい認識結果が得られない場合は、「もう一度発声練習をしましょう」とか「発音は正しいですか？」などのメッセージを利用者に与え、上述したのと同じ動作を繰り返し行なわせる。このような動作を行なって、不特定話者用標準パターン登録部１５に登録されている不特定話者用の標準パターンと最も高い類似度を得た利用者音声の特徴パターンを特定話者用の標準パターンとして特定話者用標準パターン登録部１２に登録することができる。
【００５０】
具体的に、利用者が自分が正しい発音をしているのか否かを調べるため、ある単語の音声を発声すると、音声認識手段１０’では入力された単語の音声の特徴パターンを抽出し、入力音声の特徴パターンを先ず最初の不特定話者用の標準パターンと比較する。そして、その時の両者の類似度と、この類似度を与えた標準パターンの単語名とを例えばメモリ(図示せず)に一時記憶し、次いで、入力音声の特徴パターンを次の不特定話者用の標準パターンと比較する。この標準パターンとの類似度が先の標準パターンとの類似度よりも大きい時には、先に記憶した標準パターンを消去し、現在の類似度とその類似度を与えた標準パターンの単語名とを上記メモリに記憶する。一方、現在の類似度の方が小さいときは、先に記憶した標準パターンをそのままメモリに記憶保持する。このようにして、入力音声の特徴パターンを不特定話者用の各標準パターンと順次に比較し、これらの類似度を求めた後、最も高い類似度を与えた標準パターン，すなわち、メモリに残っている単語が最大の類似度を得たもの(単語名)が認識結果となる。
【００５１】
このように音声認識手段から認識結果が得られたとき、制御部１は、この認識結果と、利用者に対し発声を促した単語名とが一致するか否かを比較し、単語名が一致するならば正しい発音と判断し、違っていれば、誤まった発音と判断する。これによって本人が気がつかないような発音の誤りを指摘できる。そして、上記認識結果と利用者に対し発声を促した単語名とが一致するときに、上記認識結果を与えた不特定話者用の標準パターンを特定話者用の標準パターンとして登録することができる。
【００５２】
このように、図６，図７の構成の教育機器では、利用者の音声の標準パターンを特定話者用標準パターン登録部１２に登録するに先立って、利用者が発声した音声の特徴パターンと不特定話者用標準パターン登録部１５に登録されている不特定話者用の標準パターンとの類似度を求めて、正しい認識結果が得られるかどうかを調べ利用者は自分が正しい発音をしているかどうかが判断でき、正しい音声で教育機器を使うことができる。また、これと同時に、自分自身が正しい発音を身につけることができる。すなわち、本人の思い込みで発音しているような誤りに対する修正が可能となり、できるだけ正しい発音で標準パターンを作成することと、正しい発音を学習することが可能となる。
【００５３】
また、上記のような各教育機器において、単語発声などの学習中に音声認識結果が誤りとなる場合として、機器から発音提示された時の発音を忘れてしまって、まったく別の言葉を発声してしまったり、あるいは、提示された発音と似ている発音をしているが登録したときの正しい発音とは違っている場合がある。いずれの場合にしても、利用者は正しい発音をもう一度聞いてみる必要がある。
【００５４】
図８は本発明に係る教育機器の他の構成例を示す図であり、図８の教育機器は、上記の問題を解決することを意図している。
【００５５】
すなわち、図８を参照すると、この教育機器において、制御部１の音声認識手段は例えば図９のような音声認識手段１０''の構成のものとなっている。また、図８の教育機器の例では、指示部５の他に、音声記録部(記録媒体)７が設けられている。
【００５６】
また、図８，図９の教育機器では、音声認識手段１０''の認識結果が比較手段２０において違っていたと判断した場合に、制御部１は、指示された音声と対になる一方または両方の音声を音声記録保持部(一時記憶部)３から読み出して音声再生部４に与え、これを再生させるようになっている。
【００５７】
具体的に、図８の教育機器においても、音声記録保持部３には、例えば音声記録部(記録媒体)７から、例えば教育機器を動作させるための制御プログラム，外国語の単語の発音音声のデータ(例えば模範となる英語音声データ)，外国語の単語の日本語の意味などがロードされ記憶されている。
【００５８】
そして、この教育機器においても、標準パターンの登録時には、音声記録保持部３に保持されている単語の音声を音声再生部４から再生して、再生された音声に近い音声を利用者に繰り返し発声させ、その特徴パターンを特定話者用の標準パターンとして特定話者用標準パターン登録部１２に登録するようにしている。その後、発声すべき英語を意味する日本語を表示し、それに対して発声された英語発音を前述したと同様の仕方で認識して認識結果を得る。このとき、発声された英語の音声を、例えば音声記録保持部３に一時的に保存しておくのも効果的である。
【００５９】
ところで、この教育機器では、このような認識の結果、誤認識している時は、制御部１は、音声記録保持部３から該当単語の英語の音声を取り出し、この音声信号を音声再生部４から再生して利用者に聞かせる。それに続いて、制御部１は、一時的に音声記録保持部３に取り込んでおいた利用者の発音音声を音声再生部４から再生して利用者に聞かせる。これによって、利用者は、正しい英語の発音と自己の発声した発音との違いを明瞭に把握することができる。すなわち、この種の教育機器においては、単語発声などの学習中に音声認識結果が誤りとなる場合として、機器から発音提示された時の発音を忘れてしまって、まったく別の言葉を発声してしまったり、あるいは、提示された発音と似ている発音をしているが登録したときの正しい発音とは違っている場合があるが、図８の教育機器では、いずれの場合についても、利用者は正しい発音をもう一度聞くことができる。
【００６０】
図１０は図８の教育機器の変形例を示す図であり、図１０の教育機器は、図８の教育機器において、制御部１の音声認識手段に図２の音声認識手段１０が用いられている。すなわち、図１０の教育機器は、音声認識手段１０の認識結果が違っていた場合に、制御部１は、再度、指示された音声を再生して、利用者に発声を求めるようにし、利用者が再度発声すると、その音声の特徴パターンで先に登録した音声認識用の標準パターンを書き換えるようになっている。
【００６１】
すなわち、教育機器を使用する場合、利用者は自分の知らない言葉を発声しなければならないことがある。そのために発声が安定しなかったり、間違えたりする。このうち、発声の不安定さを低減するには、この教育機器を繰り返し使用することが有効であり、これによって発声を安定させることができるが、言い誤りは、元の標準パターンを書き換えておく必要がある。
【００６２】
また、音声認識の誤りには、上記の原因以外に経時変化がある。すなわち、音声を登録してから時間が経つと、正しい発音をしているにもかかわらず、正しい認識ができない場合が生ずることがある。
【００６３】
図１０の教育機器では、誤認識した音声の標準パターンを新しいものと入れ替えるようにしているので、上記のような場合に対処することができる。
【００６４】
具体的に、図１０の教育機器においても、音声記録保持部３には、例えば音声記録部(記録媒体)７から、例えば教育機器を動作させるための制御プログラム，外国語の単語の発音音声のデータ(例えば模範となる英語音声データ)，外国語の単語の日本語の意味などがロードされ記憶されている。
【００６５】
そして、この教育機器においても、標準パターンの登録時には、音声記録保持部３に保持されている単語の音声を音声再生部４から再生して、再生された音声に近い音声を利用者に繰り返し発声させ、それを特定話者用の標準パターンとして特定話者用標準パターン登録部１２に登録するようにしている。その後、発声すべき英語を意味する日本語を表示し、それに対して発声された英語発音を前述したと同様の仕方で認識して認識結果を得る。このとき、発声された英語の音声を、例えば音声記録保持部３に一時的に保存しておくのも効果的である。
【００６６】
ところで、この教育機器では、このような認識の結果、誤認識している時は、制御部１は、音声記録保持部３から該当単語の英語の音声を取り出し、この音声信号を音声再生部４から再生して利用者に聞かせる。
【００６７】
また、これと同時に、音声認識手段１０を登録モードにする。そこで、利用者が音声を発声すると、発声した音声は特徴抽出されその特徴パターンが特定話者用標準パターン登録部１２に標準パターンとして登録される。なお、このようにして特徴パターンが標準パターンとして登録されることによって、先に登録されている既存の標準パターン、すなわち現在、誤認識となった標準パターンは消去される。しかしながら、既存の標準パターンを必ずしも消去して書き換えなければならないわけではなく、新たな標準パターンを既存の標準パターンと平均を取ったものを標準パターンとして登録してもかまわない。こうすることによって標準パターンの老朽化を防ぐことができる。
【００６８】
また、上述した各構成例の教育機器では、音声認識手段が誤認識した場合に、これが利用者が間違えた単語を発声したものなのか、正しい単語を間違えて発声したものなのかを区別できない。
【００６９】
図１１は本発明に係る教育機器の他の構成例を示す図であり、図１１の教育機器は、音声認識手段が誤認識した場合に、これが利用者が間違えた単語を発声したものなのか、正しい単語を間違えて発声したものなのかを区別することを意図している。
【００７０】
すなわち、図１１の教育機器は、例えば図８や図１０の構成例において、制御部１が図１２に示すような構成のものとなっている。図１２を参照すると、図１１の教育機器の制御部１は、音声認識手段１０'''が、特徴抽出部１１と、特定話者用標準パターン登録部１２と、比較部１３と、結果出力部１４との他に、さらに、比較部１３で得られる類似度(認識結果が得られるとき、この認識結果を与えた単語の音声の特徴パターンの標準パターンに対する類似度)を保持する指定単語類似度保持部１７を有している。また、制御部１の比較手段２０'は、音声認識手段１０'''の指定単語類似度保持部１７に保持された類似度を閾値ＴＨと比較し、類似度が閾値ＴＨよりも大きいか小さいかをも判断するようになっている。
【００７１】
このような構成の教育機器では、音声認識時には、制御部１は、例えば指示部５によって発声する音声を利用者に指示させた後、利用者に該音声を発声させ、利用者の発声音声を音声認識手段１０'''で認識させ、その認識結果が指示部５で指示された音声と関連付けられているか否かを判断し、その後、音声認識手段１０'''によって発声する音声を指定し、利用者が発した音声の特徴パターンと標準パターンとの間で類似度を計算して前述したと同様にして認識結果を得る。この結果、正しい認識結果が得られた時は、前述したものと同様の動作をする一方、正しく認識されなかった場合は、計算された類似度が予め決められた閾値ＴＨよりも小さいか、大きいかを判断し、閾値ＴＨよりも小さい時は、誤認識された音声を再生するようにしている。
【００７２】
具体的に、音声認識手段が誤認識する場合として、前述のように、利用者が教育機器から提示された時の発音を忘れてしまって、まったく別の言葉を発声してしまうか、または、似ている発音をしているが登録した時の正しい発音とは違った発音となっている場合が考えられる。前者の場合は後者に比べて類似度が低いので類似度の違いによって、両者を区別することができる。すなわち、正しく認識されなかった場合は、計算された類似度が予め決められた閾値ＴＨよりも小さいか、大きいかを判断することで、両者を区別することができる。このように区別がなされると、前者の場合であれば、例えば「単語を間違えていませんか？」のメッセージを利用者に示し、また、後者の場合であれば、「この単語と区別しましょう」のメッセージを利用者に示し、誤認識先の単語の音声を再生し、「正解はこれです」と言って、正しい音声を再生する。
【００７３】
図１１の教育機器の動作について、より具体的に説明する。図１１の教育機器においても、音声記録保持部３には、例えば音声記録部(記録媒体)７から、例えば教育機器を動作させるための制御プログラム，外国語の単語の発音音声のデータ(例えば模範となる英語音声データ)，外国語の単語の日本語の意味などがロードされ記憶されている。
【００７４】
そして、この教育機器においても、標準パターンの登録時には、音声記録保持部３に保持されている単語の音声を音声再生部４から再生して、再生された音声に近い音声を利用者に繰り返し発声させ、それを特定話者用の標準パターンとして特定話者用標準パターン登録部１２に登録するようにしている。その後、発声すべき英語を意味する日本語を表示し、それに対して発声された英語発音を前述したと同様の仕方で認識して認識結果を得る。このとき、発声された英語の音声を、例えば音声記録保持部３に一時的に保存しておくのも効果的である。
【００７５】
ところで、この教育機器では、利用者が自分が正しい発音をしているのか否かを調べるため、ある単語の音声を発声すると、音声認識手段１０'''では、入力された単語の音声の特徴パターンを抽出し、入力音声の特徴パターンを先ず最初の不特定話者用の標準パターンと比較する。そして、その時の両者の類似度と、この類似度を与えた標準パターンの単語名とを指定単語類似度保持部１７に一時記憶し、次いで、入力音声の特徴パターンを次の特定話者用標準パターンと比較する。この標準パターンとの類似度が先の標準パターンとの類似度よりも大きい時には、先に記憶した標準パターンを消去し、現在の類似度とその類似度を与えた標準パターンの単語名とを記憶する。一方、現在の類似度の方が小さい時は、そのまま現在の物を消去して、次の標準パターンを取り出す。但し、照合する標準パターンの単語名が、記憶部から音声再生部４を通じて発声されたものと同じ場合は、類似度にかかわらず、同じ記憶部に記憶しておく。
【００７６】
このようにして、入力音声の特徴パターンを特定話者用の各標準パターンと順次に比較し、これらの類似度を求めた後、最も高い類似度を与えた標準パターン，すなわち、記憶部に残っている単語が最大の類似度を得たもの(単語名)が認識結果となる。この認識結果が誤っていた場合には、正しい単語名と共に保存されている類似度を、閾値ＴＨと比較する。類似度が閾値ＴＨよりも低ければ、「単語を間違えていませんか？」と言うメッセージと共に表示し、一方、閾値ＴＨよりも高ければ、この単語と「間違えていませんか」と言うメッセージと共に、誤認識した単語の音声を音声記録保持部３から取り出し、音声再生部４から出力する。
【００７７】
これにより、利用者(話者)は自分が間違っていたことに気付く。あるいは、自分の発音が間違われやすい単語を知ることによって、間違われないような発音をすることになる。
【００７８】
ここでの閾値ＴＨの決め方は、特定話者方式の標準パターンと正しい入力音声特徴パターンとの間で生じる類似度の１／２から２／３程度のものが適当である。
【００７９】
このようにして、音声認識結果が誤認識である時、利用者が間違えた単語を発声しているのか、正しい単語を間違えて発声しているのかを区別し、利用者へ知らせることが可能になる。
【００８０】
また、図１３は本発明に係る教育機器の他の構成例を示す図であり、図１３の教育機器は、模範となる文字や絵を示し、練習者に言葉で答えさせることで、練習者に言葉の意味と発音との両方を身に付けさせることの可能な教育機器を提供することを目的としている。
【００８１】
すなわち、図１３の教育機器では、例えば、図１，図５の教育機器において、音声記録保持部３のかわりに、音声・画像記録保持部(メモリ)２３が設けられており、音声・画像記録保持部２３には、本機器を動作させるためのプログラムと、単語の外国語発音音声のデータとともに、それに対応した画像(文字や絵)が記憶されている。なお、図１３の例は図１に対応したものとなっている(指示部５が設けられたものとなっている)。
【００８２】
図１３の構成例では、指示部(例えばキーボード)５から例えば絵を指定すると、その絵を表示部(ディスプレイ)６に表示して、この絵に対応した音声データ(外国語の発音)を音声・画像記録保持部２３から読み出し、外国語の発音を音声再生部(スピーカ)４から出力する。これにより、利用者は、表示部６で絵を見ながら、この絵に対応した外国語の発音をスピーカ４を通して聞くことができる。例として、外国語が英語であるとし、また、単語数が５の場合を示す。いま仮に、５単語を「犬」「猫」「鳥」「馬」「牛」とする。プログラムをスタートすると、最初の犬の絵が表示部６に表示され、音声再生部(スピーカ)４から「ｄｏｇ」と言う発音が出力される。利用者はこれを真似して音声入力部(マイクロフォン)２に向かって「ドッグ」と発音する。必要に応じて、これを複数回繰り返しても良い。
【００８３】
音声入力部(マイクロフォン)２から入力された音声はＡ／Ｄ変換されて、一部は音声認識手段１０へ入力し、他の一部は再生用に音声波形のまま音声・画像記録保持部(メモリ)２３に記憶しておいても良い。この部分は、音声信号を分岐して必ずしも両方へ入れる必要はない。音声波形のまま音声・画像記録保持部(メモリ)２３へ記憶し、必要に応じて音声認識手段１０へ入力しても良いし、利用者の音声を再生する必要がない場合は、音声認識用の特徴量に直して記憶する方が使用するメモリ量が少なくて済む。音声認識手段１０では入力された音声を特定話者音声認識の音声登録に利用して話者の認識用標準パターンを作る。
【００８４】
このような動作を繰り返して５単語全ての発音練習を終わる。指示部(キーボード)５からテストのモードを選ぶと、表示部(ディスプレイ)６に動物の絵が表示され、認識待ちの状態に入る。
【００８５】
利用者は、表示部６に表示された絵の英語発声を試みる。英語発声された音声は、音声認識手段１０へ入力されて登録されている５単語の中で認識され、認識結果を比較手段２０へ出力する。ここでは先に送られている単語名と認識結果が比較され、同じであれば正解、違っていれば誤りとする。その結果を表示部６で利用者に知らせる。このように、表示部６に絵を順次に表示させて、利用者にそれに対応した外国語を順次に発声させてその発音が正しいか否かの結果を利用者に報告できる。この時、絵の表示の順番は登録順でも、登録の逆順でもあるいは、ランダムにしてもよい。
【００８６】
このように、図１３の教育機器では、模範となる文字や絵を示し、練習者に言葉で答えさせることで、練習者に言葉の意味と発音との両方を身に付けさせることが可能となる。
【００８７】
なお、図１３の構成例において、音声認識手段１０は、例えば、図２に示したと同様の構成となっており、入力された音声は特徴抽出部１１で特徴量に変換され、音声登録に際しては、変換された特徴量が、直接、特定話者用標準パターン記録部１２に記憶される。一方、認識に際しては、未知の入力音声を特徴抽出部１１で特徴量に変換した後、比較部１３において、予め登録されている標準パターンのそれぞれとの間で類似度を計算し、最も類似度が高いものを結果出力部１４から認識結果として出力する。
【００８８】
また、上述の例では、外国語が英語であり、単語数が５であるとしたが、外国語は英語以外のものでも良く、また、単語数も任意の個数のものにすることができる。また、上述の例では、絵と外国語の単語との対応付けを述べたが、それに限るものではなく、例えば国旗と国名、社章と社名、顔と人名、漢字と読み方、地図上の位置と場所名との対応付けなどを行なうこともできる。また、絵を動画にすれば、手話の学習など、視覚情報と聴覚情報の対応づけの教育に有効である。
【００８９】
また、上述の例では、単語で示したが、文章でも良いことはいうまでもない。また、図１３において、上述の例では、指示部５がキーボードであるとし、キーボードからコマンド入力や選択をするようにしたが、指示部５は、必ずしもキーボードである必要なく、例えばフロッピィディスクなどを用い、プロッピィディスクに記憶されているプログラムでコントロールしてもよい。
【００９０】
【発明の効果】
以上に説明したように、請求項１乃至請求項３記載の発明によれば、模範となる音声を示し、練習者の発音の正しさを判断することと、言葉の意味と、発音の両方が一度に身につけられるような教育機器を提供できる。
【図面の簡単な説明】
【図１】本発明に係る教育機器の構成例を示す図である。
【図２】音声認識手段の構成例を示す図である。
【図３】図１の教育機器の処理動作を説明するためのフローチャートである。
【図４】図１の教育機器の処理動作を説明するためのフローチャートである。
【図５】本発明に係る教育機器の他の構成例を示す図である。
【図６】音声認識手段の他の構成例を示す図である。
【図７】本発明に係る教育機器の他の構成例を示す図である。
【図８】本発明に係る教育機器の他の構成例を示す図である。
【図９】音声認識手段の他の構成例を示す図である。
【図１０】図８の教育機器の変形例を示す図である。
【図１１】本発明に係る教育機器の他の構成例を示す図である。
【図１２】図１１の教育機器の制御部の構成例を示す図である。
【図１３】本発明に係る教育機器の他の構成例を示す図である。
【符号の説明】
１制御部
２音声入力部
３音声記録保持部
４音声再生部
５指示部
６表示部
７音声記録部
１０，１０',１０''，１０''' 音声認識手段
１１特徴抽出部
１２特定話者用標準パターン登録部
１３比較部
１４結果出力部
１５不特定話者用標準パターン登録部
２０,２０' 比較手段
２３音声・画像記録保持部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an educational device using voice recognition.
[0002]
[Prior art]
In recent years, several technologies have been proposed as language education equipment using speech. For example, in Japanese Patent Laid-Open No. 63-303400, a model voice (a model voice of a language (for example, English) that you want to learn) is put in a card-like recording device, and it is repeated while listening to it to practice speaking. Techniques for performing are shown. According to this technology, the model voice can be effectively heard, but there is a drawback that even if there is a mistake in pronunciation, it cannot be corrected unless the person himself / herself notices.
[0003]
On the other hand, in Japanese Patent Laid-Open No. 59-220775, an exemplary voice is recorded on a magnetic tape, and the voice of a user is compared with that to determine mechanically whether they are similar or not and talk about the results. The technology to inform the person is shown. Japanese Patent Application Laid-Open No. 60-162281 stores the voice of the model speaker and the input voice of the practitioner, and analyzes them to compare the characteristics of the input voice of the practitioner with the characteristics of the model voice. Evaluation, the analysis and the evaluation result are displayed on the display device, the practitioner hears his / her pronunciation, the analysis result of the model voice and his / her voice displayed on the display device 4, and the evaluation result of his / her voice Shows a technique for checking the difference in voice characteristics between the model voice and one's own voice, and correcting one's pronunciation.
[0004]
[Problems to be solved by the invention]
According to the techniques of Japanese Patent Laid-Open Nos. 59-220775 and 60-162281, the model voice and the learner's voice are compared with each other in amplitude, pitch, and formant. For humans, there is a drawback that it is difficult to understand what should be corrected to get close to the model voice.
[0005]
Further, in the techniques disclosed in Japanese Patent Laid-Open Nos. 59-220775 and 60-162281, only a model voice (a model voice of a language to be learned (for example, English)) is output. However, even if you can acquire the correct pronunciation of this language, you have the disadvantage that you cannot know it immediately when you want to know what the language means. In other words, language learning is based on the acquisition of words. In order to acquire words with the correct pronunciation, it is very important not only to show the correct pronunciation as a teaching material but also to understand the meaning. In Japan, a technique is widely used in which Japanese is written on the front of the word card and foreign language is written on the back, and the words in the foreign language (for example, English) are reminded by looking at the Japanese on the front. However, the techniques disclosed in Japanese Patent Laid-Open Nos. 59-220775 and 60-162281 have a drawback that they cannot be used like the above word cards.
[0006]
The present invention presents a model voice to a practitioner (user) and allows the practitioner to determine the correctness of his / her pronunciation, and also allows the practitioner to acquire both the meaning and pronunciation of the word The purpose is to provide educational equipment that can do this.
[0007]
In addition, the present invention provides an educational device that allows a practitioner to acquire both the meaning and pronunciation of words by showing model characters and pictures and allowing the practitioner to answer in words. It is an object.
[0008]
[Means for Solving the Problems]
  In order to achieve the above object, the invention according to claim 1Is a record holding means for recording the model voice and the presentation information related to the model voice;
Presenting means for presenting information for presentation recorded in the record holding means;
Audio reproduction means for reproducing the exemplary audio recorded in the record holding means;
Voice input means for receiving the spoken voice;
Voice recognition for comparing whether the voice input by the voice input unit is similar to the model voice by comparing the voice input by the voice input unit with the model voice recorded in the voice recording unit Means,
When the voice recognition means recognizes that the voice input by the voice input means is similar to the model voice, the voice input by the voice input means or a feature pattern of the voice is used as the determination information as the determination information. Registration means for recording in the record holding means in association with the voice,
The voice input means further accepts a voice uttered based on the presentation information presented by the presentation means,
The speech recognition means further determines whether the uttered speech or the feature pattern of the speech is similar to the determination information stored in advance and determines whether the uttered speech is correct or incorrect.
  It is characterized by that.
[0009]
  The invention of claim 2In the educational device according to claim 1, when the voice recognition means recognizes that the voice inputted by the voice input means is not similar to the model voice, the voice reproduction means again outputs the model voice. PlayIt is characterized by that.
[0010]
  The invention of claim 3In the educational device according to claim 1 or 2, when the voice recognition means for judging the correctness of the uttered voice is judged to be an error, the voice corresponding to the presentation information is sent from the record holding means. Read and play back by the sound playback meansIt is characterized by that.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a configuration example of an educational device according to the present invention. Referring to FIG. 1, this educational device includes a control unit 1 that controls the whole, a voice input unit (for example, a microphone) 2 that inputs voice, a voice recording and holding unit (for example, memory) 3 that records and holds voice, A sound reproducing unit (for example, a speaker) 4 that reproduces the sound recorded and held in the sound recording and holding unit 3 and an instruction unit (for example, a keyboard) 5 that indicates one of the sounds recorded in the sound recording and holding unit 3 The control unit 1 is provided with a voice recognition means 10 capable of registering a voice pattern for recognition (standard pattern) and capable of recognizing voice, and a comparison means 20, and is provided with a standard for recognition. At the time of pattern registration, the control unit 1 causes the audio reproducing unit 4 to reproduce the audio recorded and held in the audio recording holding unit 2, and when the user utters audio close to the reproduced audio, the user speaks. Based on the voice ( A standard pattern for recognition is generated and registered in the voice recognition means 10, and at the time of voice recognition, the control unit 1 determines the voice to be uttered as an instruction unit. 5, after the user is instructed by 5, the user utters the voice, the user's voice is recognized by the voice recognition means 10, and the recognition result is associated with the voice instructed by the instruction unit 5. The comparison means 20 determines whether or not and presents the result to the user.
[0018]
More specifically, the voice record holding unit 3 includes, for example, a control program for operating the educational device, pronunciation data of foreign words (for example, English voice data as an example), foreign language words. The meaning of Japanese is remembered.
[0019]
As the voice recognition method of the voice recognition means 10, any conventionally known method can be used. For example, the system described in the document “Furui Digital Audio Processing (Tokai Univ. Publishing, 1985)” can be used.
[0020]
FIG. 2 is a diagram showing a configuration example of the voice recognition means 10. In the example of FIG. 2, the speech recognition unit 10 includes a feature extraction unit 11, a specific speaker standard pattern registration unit 12, a comparison unit 13, and a result output unit 14.
[0021]
In the case where the speech recognition means 10 is configured as shown in FIG. 2, when a standard pattern is registered, an input speech of a predetermined word of a user (specific speaker) is inputted to a feature amount (feature pattern) by a feature extraction unit 11. ), The feature pattern extracted by the feature extraction unit 11 is stored in the standard pattern registration unit 12 for the specific speaker as a standard pattern. On the other hand, at the time of speech recognition, after the input speech of an unknown word of the user (specific speaker) is converted into a feature amount (feature pattern) by the feature extraction unit 11, the comparison unit 13 features the input speech of the unknown word. Similarities are calculated between the patterns and the standard patterns of various words registered in advance in the standard pattern registration unit 12 for specific speakers, and the result output unit 14 selects the standard pattern that gives the highest similarity. The words they have are output as recognition results.
[0022]
Next, an example of processing operation of the educational equipment having such a configuration will be described. First, a case where a standard pattern for specific speaker voice recognition is registered in the voice recognition means 10 will be described. Now, when the user designates (inputs) a word (word name) in Japanese, for example, from the instruction unit 5, the control unit 1 presents the Japanese to the user in the manner described above. As a way of presenting Japanese, for example, as shown in FIG. 1, a display unit 6 may be further provided, and Japanese characters (characters) may be displayed on the display unit 6. May be stored as waveform data in the audio recording / holding unit (memory) 3 and reproduced from the audio reproducing unit (speaker) 4. Note that the display unit 6 is not necessarily provided when Japanese is presented by voice reproduction.
[0023]
Thus, after presenting the Japanese language to the user, the control unit 1 reads out the pronunciation voice data of the foreign language word corresponding to the Japanese word from the voice record holding unit 3 and reproduces it. Thereby, the user can hear the pronunciation of the foreign language word corresponding to the Japanese word through the speaker 4. For example, it is assumed that a Japanese word and a corresponding foreign language (English) word are registered in the voice recording holding unit 3 with the number of words “5”. Here, the five words are “red”, “blue”, “green”, “white”, and “black”. First, when the user inputs “red” from the keyboard 5, the speaker 4 produces a pronunciation “red”. The user imitates this and pronounces “red” toward the microphone 2. The user may repeat this utterance a plurality of times as necessary.
[0024]
The sound (for example, “red”) input from the microphone 2 is A / D converted and provided to the control unit 1. In the control unit 1, a part of the given voice can be given to the voice recognition means 10, and the other part can be stored as a voice for reproduction in the voice record holding unit (memory) 3 as a voice waveform. In this example, the voice signal is branched and input to both the voice recognition means 10 and the memory 3, but it is not always necessary to branch the voice signal and input it to both. For example, a voice signal input from the microphone 2 may be stored in the memory 3 as a voice waveform and input to the voice recognition unit 10 as necessary. Further, the voice signal may be stored in the memory 3 as the voice waveform. However, when it is not necessary to reproduce the user's voice, the voice signal is given only to the voice recognition means 10 and converted into a feature pattern for voice recognition. It may be memorized. In this case, the amount of memory used can be reduced.
[0025]
As described above, when the voice uttered by the user is input to the voice recognition unit 10, the voice recognition unit 10 registers the input voice (for example, “red”) as a standard pattern for specific speaker voice recognition. Available to: That is, the feature pattern of the input voice (for example, “red”) can be registered in the standard pattern registration unit 12 for the specific speaker as a standard pattern for recognition of the user (specific speaker).
[0026]
Such an operation is repeated in sequence to practice pronunciation of all five words, create a standard pattern for this speaker based on the English utterances of the five words, and register this in the speech recognition means 10. . When the pronunciation practice for all five words is completed, the user can select a test mode from the instruction unit (keyboard) 5 and specify a word name. When the test mode is selected and the word name is designated, the control unit 1 reproduces the Japanese (designated Japanese) stored in the voice record holding unit (memory) 3 with, for example, voice and plays back the user. When this reproduction is completed, the voice recognition means 10 enters a state of waiting for recognition with respect to an unknown input voice.
[0027]
Next, a description will be given of recognition processing for an unknown input speech, that is, actual speech recognition. First, when the user designates (inputs) a word (word name) in, for example, Japanese from the instruction unit 5, the control unit 1 presents the Japanese to the user in the manner described above. As a way of presenting Japanese, for example, as shown in FIG. 1, a display unit 6 may be further provided, and Japanese characters (characters) may be displayed on the display unit 6. May be stored as waveform data in the audio recording / holding unit (memory) 3 and reproduced from the audio reproducing unit (speaker) 4. Note that the display unit 6 is not necessarily provided when Japanese is presented by voice reproduction.
[0028]
In this way, after presenting Japanese to the user, the user can try to speak English of the word presented in Japanese from the voice reproduction unit 4. When the user utters this word in English, the voice uttered in English is taken into the control unit 1, and the control unit 1 gives this voice to the voice recognition means 10 for voice recognition. The speech recognition means 10 obtains the feature pattern of the input English speech, compares this feature pattern with the standard patterns of five words (English) registered in advance, and the word having the most similar standard pattern The recognition result is output from the result output unit 14. At this stage, the control unit 1 compares the previously specified word name with the recognition result by the comparison means 20, and if the result of this comparison is the same, the control unit 1 determines that the answer is correct. Judge. Then, for example, the determination result is displayed on the display unit 6 to notify the user. At this time, the display order of the words may be the registration order, the reverse order of registration, or may be random.
[0029]
3 and 4 are flowcharts for explaining a specific example of the processing operation of the educational device of FIG. In the example of FIGS. 3 and 4, the user first inputs a file name for recording personal information (step S1). The file name may be input by the user from the keyboard 5, or the device itself automatically reads the word name stored in advance in the device memory, and the device itself is automatically generated. Is also possible. When the file name is input in this way, it is determined whether or not the file name is a new file (step S2). As a result, since a new file is used for the first time, it is necessary to make the user practice pronunciation and create a standard pattern for speech recognition. For this purpose, first, the word counter WCNT is initially set to “1” (step S3), and a Japanese word name is input (step S4). Similarly to the input of the file name, the Japanese word name can be input from the keyboard 5 by the user, or the device itself automatically reads the word name stored in the device memory. It is also possible for the device itself to occur automatically.
[0030]
Next, an English word name corresponding to the Japanese word name is input (step S5). For example, when “red” is input as a Japanese word name in step S4, “red” is input in step S5. At this time, the English input may be performed manually, or the device may input automatically.
[0031]
Thus, when Japanese and English corresponding to it are input, these are displayed, for example, on the display unit 6 (step S6). That is, for example, “red” and “red” are displayed. Then, this educational device outputs an exemplary pronunciation, and causes the user to utter (return) the voice accordingly (step S7). That is, the device outputs “red” as a model voice, and the user utters “red” as closely as possible to the model voice.
[0032]
This voice uttered by the user is taken into the voice recognition means 10, and the voice recognition means 10 registers the feature pattern of the voice of the word uttered by the user as a standard pattern of this word, for example, in a file (step S8). Next, the word counter WCNT is incremented by “1” (step S9), and it is determined whether or not the counter value WCNT exceeds the total number of words n (step S10). If it does not exceed n, the processing from step S4 to step S9 is repeated until n is exceeded, and when the processing is completed, the file is saved.
[0033]
Next, the user is allowed to select whether to practice pronunciation only or to practice further memory training in addition to pronunciation practice (step S11). When the user just listens to the model voice and practice the pronunciation, the process is completed. However, when the pronunciation practice and the memory training of the word are performed, the test routine is entered from here (that is, the process proceeds to step S15).
[0034]
If it is determined in step S2 that the input file name already exists and the user has already practiced speaking, the English vocabulary and the Japanese vocabulary associated with each other are loaded (steps S12 and S13). ) The standard pattern (template) for speech recognition created in step S8 is loaded (step S14). Here, the vocabulary is a set of words.
[0035]
Next, the test routine from step S15 is entered. In the test routine, first, the word pointer is set at the first word position (step S15). Then, the word counter WCNT is initialized to “1” (step S16). Next, the Japanese word at the word position is presented (for example, displayed) on the display unit 6, for example (step S17). Here, it is not always necessary to display Japanese words as characters, and recorded words may be output.
[0036]
Thus, when presenting a Japanese word to the user, the user can utter the corresponding English word (step S18). When the user utters an English word and the voice is input, the voice recognition means 10 recognizes the voice of the English word (step S19). That is, the voice feature pattern of the English word is extracted, and the feature pattern is collated with the standard pattern of each word registered in step S8, thereby performing speech recognition. Then, as a result of the speech recognition, it is determined whether or not the feature pattern of the speech uttered by the user has a correct speech feature amount (standard pattern) of the English word (step S20), and the utterance of the user is determined. If the received speech does not have the correct English word speech feature (standard pattern) (if it has been rejected or is a misrecognition result), the process returns to step S17 to display the Japanese word again and Then, the corresponding English word is uttered again, and the speech recognition processing in steps S17 to S20 is repeated.
[0037]
On the other hand, if the recognition result is correct in step S20, the word counter WCNT is incremented by “1” (step S21), and it is determined whether or not the word counter WCNT exceeds a predetermined value n (step S22). As a result, when the predetermined value n has not been reached, the process returns to step S17 again, the next Japanese word is displayed, and the test routine is repeated. In this way, when the word counter WCNT exceeds the predetermined value n in step S22, all processing is completed.
[0038]
In the above test routine (steps S15 to S22), the word pointer is set at the first word position, and the test is performed sequentially from the word at the first word position. Instead of this, for example, Alternatively, a random number may be generated to randomly determine a word to be tested.
[0039]
By performing such a test, the user can learn the correct pronunciation of the English word corresponding to the Japanese word name, and at the same time, the meaning of the English word (ie, the Japanese word) You can figure out what it is. In the above example, a word is presented to the user. However, a Japanese sentence may be presented, and an English sentence corresponding to the Japanese sentence may be presented to the user. In the above-described example, a keyboard is used as the instruction unit 5, but a recording medium such as a floppy disk or a CD-ROM can be used instead of the keyboard. In the above example, the instruction unit 5 is provided, but instead of the instruction unit 5, a voice recording unit (in which voices of contents uttered in two or more languages are recorded in association with each other ( A recording medium such as a floppy disk or a CD-ROM can also be provided.
[0040]
FIG. 5 is a diagram showing another configuration example of the educational device according to the present invention. The educational device of FIG. 5 is uttered in two or more languages in place of the instruction unit 5 in the educational device of FIG. An audio recording unit 7 (such as a floppy disk or a recording medium such as a CD-ROM) is provided in which audio of the contents described above is recorded in association with each other.
[0041]
In the educational device of FIG. 5, the voice recording holding unit 3 also has a function as a temporary storage unit that temporarily stores the content recorded in the voice recording unit 7, and at the time of registering a standard pattern for recognition, For a certain word, the control unit 1 transmits voices in one or more languages (for example, English voices) out of two or more language voices recorded in the voice recording holding unit (temporary storage unit) 3 from the voice recording unit 7. ) As a first voice from the voice playback unit 4, causing the user to utter English voice according to the first voice (English voice), and based on the voice uttered by the user, a standard pattern for recognition Is registered in the voice recognition means 10, and at the time of voice recognition, the control unit 1 from the voice recording unit 7 to the voice recording holding unit (temporary storage unit) 3 for a certain word. From two or more languages recorded in The voice of a language different from the kind of language spoken by the user (for example, Japanese voice) is reproduced as the second voice, and the second voice (Japanese voice) is transmitted to the user. The first voice corresponding to the voice (English voice) is uttered, the voice of the user is recognized by the voice recognition means 10, and the recognition result is recorded in the voice record holding unit (temporary storage unit) 3 and the first voice is recorded. It is determined whether or not it is associated with the voice reproduced as the voice, and presented to the user.
[0042]
In the educational device of FIG. 5, various contents can be set as the contents of the sound recording unit (recording medium) 7, and this content is temporarily stored in the sound recording holding unit 3, so By changing the type of word to be promoted, changing the type of language, and changing the program, it is possible to teach not only foreign languages but also answers to questions and to be a training machine for visually impaired people . Therefore, the function of the educational device can be easily changed by replacing only the voice storage unit 7, that is, the recording medium.
[0043]
In this way, in the educational device of FIG. 5, by replacing the recording medium, one person can use one system for various levels of learning.
[0044]
By the way, in each of the above-described educational devices (more specifically, a device in which the voice recognition means 10 of the control unit 1 has the configuration shown in FIG. 2), the user's error cannot be pointed out from the system side. Specifically, if there is an error that the person does not notice, for example, if there is a person who always pronounces “red” as “let”, and pronounces “let” along the guidance of “red”-“red”, If this is registered as a standard pattern of this specific speaker and the question “What does red say?” Is pronounced as “let” in the test, the result of speech recognition will be correct. For this reason, the above-mentioned educational equipment cannot correct the pronunciation error until someone points out that the pronunciation of the model is different from the pronunciation of the person.
[0045]
FIG. 6 is a diagram showing a configuration example of the speech recognition means 10 ′ intended to enable correction for an error that is pronounced with the user's belief. That is, in the configuration example of FIG. 6, it is intended to create a standard pattern with the correct pronunciation as much as possible and to learn the correct pronunciation. The speech recognition means 10 ′ includes the feature extraction unit 11, In addition to the specific speaker standard pattern registration unit 12, the comparison unit 13, and the result output unit 14, it further includes a non-specific speaker standard pattern registration unit 15.
[0046]
The configuration example in FIG. 6 uses two features: a voice recognition device for unspecified speakers is now available, and the recognition accuracy of the specific speaker method is higher. In the configuration example of FIG. 6, first, the standard pattern for unspecified speakers is used to check whether the user has made a correct pronunciation, and the specified speaker is determined to be correct. A standard pattern is registered.
[0047]
FIG. 7 is a diagram showing another configuration example of the educational equipment when the voice recognition means of the control unit 1 has the construction of the voice recognition means 10 ′ as shown in FIG. 6, and the educational equipment in FIG. In this way, it is intended to make it possible to correct mistakes that are pronounced with the user's assumptions. That is, the educational device of FIG. 7 is intended to create a standard pattern with the correct pronunciation as much as possible and to learn the correct pronunciation.
[0048]
The educational device in the example of FIG. 7 includes the instruction unit in addition to the voice recognition unit of the control unit 1 having the configuration of the voice recognition unit 10 ′ as shown in FIG. Along with (for example, a keyboard) 5, an audio recording unit (recording medium) 7 as shown in the configuration example of FIG. 5 is further provided. Here, in the voice recording unit (recording medium) 7, voices of contents uttered in two or more languages are recorded in association with each other. At this time, the contents of the uttered contents to be recorded are recorded. The voice is that of an unspecified speaker (for example, that of a standard voice obtained by averaging the voices of a plurality of speakers).
[0049]
In the educational device having the configuration shown in FIGS. 6 and 7, prior to registering the standard pattern of the user's voice in the standard pattern registration unit 12 for the specific speaker, the feature pattern of the voice uttered by the user and the unspecified speaker The similarity with the standard pattern for unspecified speakers registered in the standard pattern registration unit 15 is obtained, and whether or not a correct recognition result is obtained is checked. If the pattern is registered as it is in the standard pattern registration unit 12 for the specific speaker and a correct recognition result cannot be obtained, a message such as “Let's practice speaking again” or “Is the pronunciation correct?” Is given to the user. And repeat the same operation as described above. By performing such an operation, the feature pattern of the user voice that has the highest similarity with the standard pattern for the unspecified speaker registered in the standard pattern registration unit 15 for the unspecified speaker is used for the specific speaker. Can be registered in the standard pattern registration unit 12 for the specific speaker.
[0050]
Specifically, in order to check whether or not the user has made a correct pronunciation, when the voice of a certain word is uttered, the voice recognition means 10 'extracts the feature pattern of the inputted word and inputs it. First, the speech feature pattern is compared with the standard pattern for the first unspecified speaker. Then, the similarity between the two at that time and the word name of the standard pattern that gave this similarity are temporarily stored in, for example, a memory (not shown), and then the feature pattern of the input speech is used for the next unspecified speaker. Compare with the standard pattern. When the similarity to this standard pattern is greater than the similarity to the previous standard pattern, the previously stored standard pattern is deleted, and the current similarity and the word name of the standard pattern that gave the similarity are Store in memory. On the other hand, when the current similarity is smaller, the previously stored standard pattern is stored and held in the memory as it is. In this way, after the feature pattern of the input speech is sequentially compared with the standard patterns for unspecified speakers, and these similarities are obtained, the standard pattern giving the highest similarity, that is, the memory remains in the memory. The word with the highest similarity (word name) is the recognition result.
[0051]
When the recognition result is obtained from the voice recognition unit in this way, the control unit 1 compares the recognition result with the word name that urges the user to speak, and the word names match. If so, it is judged as correct pronunciation, and if it is different, it is judged as wrong pronunciation. This makes it possible to point out pronunciation errors that the person does not notice. When the recognition result matches the word name that prompted the user to speak, the standard pattern for the unspecified speaker that gave the recognition result can be registered as the standard pattern for the specific speaker. it can.
[0052]
As described above, in the educational device having the configuration of FIG. 6 and FIG. 7, prior to registering the standard pattern of the user's voice in the standard pattern registration unit 12 for the specific speaker, The similarity to the standard pattern for unspecified speakers registered in the standard pattern registration unit 15 for unspecified speakers is obtained, and whether or not a correct recognition result can be obtained is checked, the user pronounces himself correctly. Can use the educational equipment with the correct voice. At the same time, you can acquire the correct pronunciation yourself. That is, it is possible to correct an error that is pronounced with the user's belief, to create a standard pattern with the correct pronunciation as much as possible, and to learn the correct pronunciation.
[0053]
Also, in each of the above educational devices, if the speech recognition result is incorrect during learning such as word utterance, forgetting the pronunciation when presented by the device and uttering a completely different word. There are cases where the pronunciation is similar to the suggested pronunciation but is different from the correct pronunciation when registered. In any case, the user needs to hear the correct pronunciation again.
[0054]
FIG. 8 is a diagram illustrating another configuration example of the educational device according to the present invention, and the educational device of FIG. 8 is intended to solve the above-described problem.
[0055]
That is, referring to FIG. 8, in this educational device, the voice recognition means of the control unit 1 has a structure of voice recognition means 10 ″ as shown in FIG. 9, for example. In addition, in the example of the educational device in FIG. 8, an audio recording unit (recording medium) 7 is provided in addition to the instruction unit 5.
[0056]
8 and 9, when it is determined that the recognition result of the voice recognition unit 10 '' is different in the comparison unit 20, the control unit 1 makes one or both of a pair with the instructed voice. Are read from the voice recording / holding unit (temporary storage unit) 3 and given to the voice reproduction unit 4 to be reproduced.
[0057]
Specifically, also in the educational device of FIG. 8, the voice recording holding unit 3 receives, for example, a control program for operating the educational device from the voice recording unit (recording medium) 7, and the pronunciation sound of a foreign word. Data (for example, English voice data as an example), Japanese meanings of foreign words, etc. are loaded and stored.
[0058]
Also in this educational device, at the time of registration of the standard pattern, the voice of the word held in the voice recording holding unit 3 is played back from the voice playback unit 4, and the voice close to the played back voice is uttered to the user repeatedly. The feature pattern is registered in the standard pattern registration unit 12 for the specific speaker as a standard pattern for the specific speaker. After that, Japanese meaning English to be uttered is displayed, and the English pronunciation uttered is recognized in the same manner as described above to obtain a recognition result. At this time, it is also effective to temporarily store the spoken English voice in, for example, the voice recording holding unit 3.
[0059]
By the way, in this educational equipment, when misrecognizing as a result of such recognition, the control part 1 takes out the English sound of the applicable word from the voice record holding part 3, and this voice signal is used as the voice reproduction part 4 Play from and let the user hear. Subsequently, the control unit 1 reproduces the user's pronunciation voice temporarily stored in the voice record holding unit 3 from the voice reproduction unit 4 to let the user hear it. As a result, the user can clearly grasp the difference between the correct English pronunciation and the pronunciation uttered by the user. In other words, in this type of educational device, if the speech recognition result is incorrect during learning such as word utterance, forgetting the pronunciation when the pronunciation is presented from the device and uttering a completely different word. There are cases where the pronunciation is similar to the suggested pronunciation, but may be different from the correct pronunciation when registered, but with the educational device in FIG. Can hear the correct pronunciation again.
[0060]
FIG. 10 is a diagram showing a modification of the educational device of FIG. 8. The educational device of FIG. 10 uses the speech recognition means 10 of FIG. 2 as the speech recognition means of the control unit 1 in the educational device of FIG. Yes. That is, in the educational device of FIG. 10, when the recognition result of the voice recognition means 10 is different, the control unit 1 reproduces the instructed voice again to ask the user to speak, and the user When the voice is uttered again, the standard pattern for voice recognition previously registered is rewritten with the feature pattern of the voice.
[0061]
That is, when using educational equipment, the user may have to speak a language that he / she does not know. For this reason, the utterance is not stable or is mistaken. Of these, in order to reduce the instability of utterance, it is effective to use this educational equipment repeatedly, which can stabilize the utterance, but the error is to rewrite the original standard pattern There is a need.
[0062]
In addition to the above causes, there is a change over time in voice recognition errors. In other words, when time passes after the voice is registered, there is a case where correct recognition cannot be performed even though the correct pronunciation is made.
[0063]
In the educational device in FIG. 10, the misrecognized voice standard pattern is replaced with a new one, so the above case can be dealt with.
[0064]
Specifically, also in the educational device of FIG. 10, the voice recording holding unit 3 receives, for example, a control program for operating the educational device from the voice recording unit (recording medium) 7, and the pronunciation sound of foreign words. Data (for example, English voice data as an example), Japanese meanings of foreign words, etc. are loaded and stored.
[0065]
Also in this educational device, at the time of registration of the standard pattern, the voice of the word held in the voice recording holding unit 3 is played back from the voice playback unit 4, and the voice close to the played back voice is uttered to the user repeatedly. Then, it is registered in the standard pattern registration unit 12 for specific speakers as a standard pattern for specific speakers. After that, Japanese meaning English to be uttered is displayed, and the English pronunciation uttered is recognized in the same manner as described above to obtain a recognition result. At this time, it is also effective to temporarily store the spoken English voice in, for example, the voice recording holding unit 3.
[0066]
By the way, in this educational equipment, when misrecognizing as a result of such recognition, the control part 1 takes out the English sound of the applicable word from the voice record holding part 3, and this voice signal is used as the voice reproduction part 4 Play from and let the user hear.
[0067]
At the same time, the voice recognition means 10 is set to the registration mode. Therefore, when the user utters a voice, the voice is voiced, and the feature pattern is registered as a standard pattern in the standard pattern registration unit 12 for the specific speaker. By registering the feature pattern as a standard pattern in this way, the existing standard pattern that has been previously registered, that is, the standard pattern that is currently erroneously recognized, is deleted. However, the existing standard pattern does not necessarily have to be erased and rewritten, and a new standard pattern obtained by averaging the existing standard pattern may be registered as a standard pattern. This can prevent the standard pattern from aging.
[0068]
Further, in the above-described educational device of each configuration example, when the voice recognition means misrecognizes, it cannot be distinguished whether the user uttered the wrong word or the correct word.
[0069]
FIG. 11 is a diagram showing another example of the configuration of the educational device according to the present invention. If the educational device of FIG. 11 misrecognizes speech recognition means, this is what the user uttered the wrong word. It is intended to distinguish whether the correct word is spoken.
[0070]
That is, the educational device of FIG. 11 has a configuration in which the control unit 1 is configured as shown in FIG. 12, for example, in the configuration examples of FIGS. Referring to FIG. 12, the control unit 1 of the educational device in FIG. 11 includes a speech recognition means 10 ′ ″, a feature extraction unit 11, a specific speaker standard pattern registration unit 12, a comparison unit 13, and a result output. In addition to the unit 14, the designated word similarity that holds the similarity obtained by the comparison unit 13 (similarity with respect to the standard pattern of the feature pattern of the speech of the word given the recognition result when the recognition result is obtained) A degree holding part 17 is provided. Further, the comparison unit 20 ′ of the control unit 1 compares the similarity held in the designated word similarity holding unit 17 of the speech recognition unit 10 ′ ″ with the threshold TH, and the similarity is larger or smaller than the threshold TH. It has come to judge.
[0071]
In the educational device having such a configuration, at the time of speech recognition, the control unit 1 instructs the user to utter the speech uttered by the instruction unit 5, for example, and then causes the user to utter the speech, The voice recognition means 10 ′ ″ recognizes the result, determines whether the recognition result is associated with the voice instructed by the instruction section 5, and then designates the voice to be uttered by the voice recognition means 10 ′ ″. The similarity is calculated between the feature pattern of the voice uttered by the user and the standard pattern, and the recognition result is obtained in the same manner as described above. As a result, when a correct recognition result is obtained, the same operation as described above is performed. On the other hand, when the recognition is not correctly performed, the calculated similarity is smaller or larger than a predetermined threshold TH. If it is smaller than the threshold value TH, the misrecognized voice is reproduced.
[0072]
Specifically, as a case where the voice recognition means misrecognizes, as described above, the user forgets the pronunciation when presented from the educational device and speaks a completely different word, or Although the pronunciation is similar, it is possible that the pronunciation is different from the correct pronunciation at the time of registration. In the former case, since the degree of similarity is lower than that of the latter, the two can be distinguished by the difference in degree of similarity. That is, when the recognition is not correctly performed, it is possible to distinguish the two by determining whether the calculated similarity is smaller or larger than a predetermined threshold TH. When the distinction is made in this way, in the former case, for example, a message “Is the word wrong?” Is shown to the user, and in the latter case, the message is “Different from this word. The message “Sho” is shown to the user, the voice of the misrecognized word is played, and the correct voice is played by saying “The correct answer is this”.
[0073]
The operation of the educational device in FIG. 11 will be described more specifically. In the educational device of FIG. 11 as well, the voice recording holding unit 3 is supplied from the voice recording unit (recording medium) 7, for example, a control program for operating the educational device, voice pronunciation data of a foreign language word (for example, a model) English voice data) and the meaning of Japanese words in foreign languages are loaded and stored.
[0074]
Also in this educational device, at the time of registration of the standard pattern, the voice of the word held in the voice recording holding unit 3 is played back from the voice playback unit 4, and the voice close to the played back voice is uttered to the user repeatedly. Then, it is registered in the standard pattern registration unit 12 for specific speakers as a standard pattern for specific speakers. After that, Japanese meaning English to be uttered is displayed, and the English pronunciation uttered is recognized in the same manner as described above to obtain a recognition result. At this time, it is also effective to temporarily store the spoken English voice in, for example, the voice recording holding unit 3.
[0075]
By the way, in this educational device, when the user utters the sound of a certain word in order to check whether he / she is pronounced correctly, the speech recognition means 10 ′ ″ has the characteristics of the sound of the input word. The pattern is extracted, and the feature pattern of the input speech is first compared with the standard pattern for the first unspecified speaker. Then, the similarity between the two and the word name of the standard pattern to which the similarity is given are temporarily stored in the designated word similarity holding unit 17, and the feature pattern of the input speech is then stored as the standard for the next specific speaker. Compare with pattern. When the similarity to this standard pattern is greater than the similarity to the previous standard pattern, the previously stored standard pattern is deleted, and the current similarity and the word name of the standard pattern to which the similarity is given are stored. To do. On the other hand, when the current similarity is smaller, the current object is deleted as it is, and the next standard pattern is taken out. However, when the word name of the standard pattern to be collated is the same as that uttered from the storage unit through the voice reproduction unit 4, it is stored in the same storage unit regardless of the similarity.
[0076]
In this way, after comparing the feature pattern of the input speech with each standard pattern for a specific speaker in order, and obtaining these similarities, the standard pattern giving the highest similarity, that is, remaining in the storage unit. The word with the highest similarity (word name) is the recognition result. If this recognition result is incorrect, the similarity stored together with the correct word name is compared with the threshold value TH. If the similarity is lower than the threshold TH, it is displayed with a message saying “Is the word wrong?”, Whereas if it is higher than the threshold TH, it is displayed with a message “Is it wrong?” The voice of the misrecognized word is taken out from the voice record holding unit 3 and output from the voice reproduction unit 4.
[0077]
As a result, the user (speaker) notices that he / she is wrong. Or, by knowing a word whose pronunciation is easy to be mistaken, the pronunciation is not mistaken.
[0078]
As a method for determining the threshold value TH, a value about 1/2 to 2/3 of the similarity generated between the standard pattern of the specific speaker method and the correct input voice feature pattern is appropriate.
[0079]
In this way, when the speech recognition result is misrecognition, it is possible to distinguish whether the user is uttering the wrong word or the correct word and to inform the user Become.
[0080]
Moreover, FIG. 13 is a figure which shows the other structural example of the educational equipment which concerns on this invention, and the educational equipment of FIG. 13 shows an example character and a picture, and makes a practitioner answer in words, The purpose is to provide educational equipment that allows students to learn both the meaning and pronunciation of words.
[0081]
That is, in the educational device of FIG. 13, for example, in the educational device of FIGS. 1 and 5, an audio / image recording holding unit (memory) 23 is provided instead of the audio recording holding unit 3. The holding unit 23 stores a program for operating the apparatus and foreign language pronunciation voice data of a word, and images (characters and pictures) corresponding to the program. Note that the example of FIG. 13 corresponds to FIG. 1 (instruction unit 5 is provided).
[0082]
In the configuration example of FIG. 13, for example, when a picture is specified from the instruction section (for example, keyboard) 5, the picture is displayed on the display section (display) 6, and voice data corresponding to this picture (foreign pronunciation) is voiced. Reading from the image record holding unit 23 and outputting the pronunciation of the foreign language from the sound reproduction unit (speaker) 4. Thus, the user can listen to the pronunciation of the foreign language corresponding to the picture through the speaker 4 while viewing the picture on the display unit 6. As an example, it is assumed that the foreign language is English and the number of words is five. Suppose now that the five words are “dog”, “cat”, “bird”, “horse”, and “cow”. When the program is started, a picture of the first dog is displayed on the display unit 6, and a pronunciation “dog” is output from the audio reproduction unit (speaker) 4. The user imitates this and pronounces “dog” toward the voice input unit (microphone) 2. If necessary, this may be repeated a plurality of times.
[0083]
The voice input from the voice input unit (microphone) 2 is A / D converted, partly input to the voice recognition means 10, and the other part is a voice / image record holding unit (remaining a voice waveform for reproduction) It may be stored in the memory 23). In this part, it is not always necessary to branch the audio signal into both. The voice waveform may be stored in the voice / image record holding unit (memory) 23 as it is and input to the voice recognition means 10 if necessary. The amount of memory used can be reduced if the feature amount is corrected and stored. The voice recognition means 10 creates a standard pattern for speaker recognition by using the input voice for voice registration for specific speaker voice recognition.
[0084]
Repeating these actions, all five words are pronounced. When a test mode is selected from the instruction unit (keyboard) 5, an animal picture is displayed on the display unit (display) 6, and a state of waiting for recognition is entered.
[0085]
The user tries to speak English on the picture displayed on the display unit 6. The voice uttered in English is input to the voice recognition means 10 and recognized among the five registered words, and the recognition result is output to the comparison means 20. Here, the word name sent earlier is compared with the recognition result. The display unit 6 informs the user of the result. In this way, pictures can be sequentially displayed on the display unit 6, and the user can sequentially speak the corresponding foreign language, and the result of whether or not the pronunciation is correct can be reported to the user. At this time, the display order of the pictures may be the registration order, the reverse order of the registration, or random.
[0086]
In this way, with the educational equipment of FIG. 13, it is possible to show the model characters and pictures and have the practitioner answer in words, so that the practitioner can learn both the meaning and pronunciation of the words. Become.
[0087]
In the configuration example of FIG. 13, the voice recognition means 10 has the same configuration as that shown in FIG. 2, for example. The input voice is converted into a feature amount by the feature extraction unit 11, and the voice is registered. The converted feature amount is directly stored in the standard pattern recording unit 12 for the specific speaker. On the other hand, at the time of recognition, after the unknown input speech is converted into a feature amount by the feature extraction unit 11, the comparison unit 13 calculates a similarity with each of the standard patterns registered in advance, and the most similarity With a higher value is output from the result output unit 14 as a recognition result.
[0088]
In the above example, the foreign language is English and the number of words is 5. However, the foreign language may be other than English, and the number of words can be any number. In the above example, the correspondence between the picture and the word in the foreign language is described. However, the present invention is not limited to this. For example, the national flag and the country name, the company emblem and the company name, the face and the person name, the kanji and the reading, the position on the map It is also possible to associate a location name with a location name. In addition, if a picture is converted into a moving picture, it is effective for education of correspondence between visual information and auditory information, such as learning sign language.
[0089]
In the above example, the word is used, but it goes without saying that it may be a sentence. In FIG. 13, in the above example, the instruction unit 5 is a keyboard, and commands are input and selected from the keyboard. However, the instruction unit 5 is not necessarily a keyboard, for example, a floppy disk or the like. The program may be used and controlled by a program stored on the floppy disk.
[0090]
【The invention's effect】
  As explained above, claims 1 toClaim 3According to the described invention, it is possible to provide an educational device that shows an exemplary voice and judges the correctness of the pronunciation of the practitioner, and can acquire both the meaning of the word and the pronunciation at once.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of an educational device according to the present invention.
FIG. 2 is a diagram showing a configuration example of voice recognition means.
FIG. 3 is a flowchart for explaining a processing operation of the educational device in FIG. 1;
4 is a flowchart for explaining a processing operation of the educational device in FIG. 1. FIG.
FIG. 5 is a diagram showing another configuration example of the educational device according to the present invention.
FIG. 6 is a diagram showing another configuration example of voice recognition means.
FIG. 7 is a diagram showing another configuration example of the educational device according to the present invention.
FIG. 8 is a diagram showing another configuration example of the educational device according to the present invention.
FIG. 9 is a diagram showing another configuration example of voice recognition means.
10 is a diagram showing a modification of the educational device in FIG.
FIG. 11 is a diagram showing another configuration example of the educational device according to the present invention.
12 is a diagram illustrating a configuration example of a control unit of the educational device in FIG. 11. FIG.
FIG. 13 is a diagram showing another configuration example of the educational device according to the present invention.
[Explanation of symbols]
1 Control unit
2 Voice input part
3 Voice record holding part
4 Audio playback part
5 indicator
6 Display section
7 Voice recording part
10, 10 ′, 10 ″, 10 ′ ″ speech recognition means
11 Feature extraction unit
12 Standard pattern registration section for specific speakers
13 Comparison part
14 Result output section
15 Standard pattern registration section for unspecified speakers
20,20 'comparison means
23 Audio / Image Record Holding Unit

Claims

Record holding means for recording the model voice and the presentation information related to the model voice;
  Presenting means for presenting information for presentation recorded in the record holding means;
  Audio reproduction means for reproducing the exemplary audio recorded in the record holding means;
  Voice input means for receiving the spoken voice;
  Voice recognition for comparing whether the voice input by the voice input unit is similar to the model voice by comparing the voice input by the voice input unit with the model voice recorded in the voice recording unit Means,
  When the voice recognition means recognizes that the voice input by the voice input means is similar to the model voice, the voice input by the voice input means or a feature pattern of the voice is used as the determination information as the determination information. Registration means for recording in the record holding means in association with the voice,
  The voice input means further accepts a voice uttered based on the presentation information presented by the presentation means,
  The speech recognition means further determines whether the uttered speech or the feature pattern of the speech is similar to the determination information stored in advance and determines whether the uttered speech is correct or incorrect.
  Educational equipment characterized by that.

2. The educational device according to claim 1, wherein when the voice recognition unit recognizes that the voice input by the voice input unit is not similar to the model voice, the voice playback unit reproduces the model voice again. Educational equipment characterized by that.

In the educational device according to claim 1 or 2, when the voice recognition means for judging correctness of the uttered voice is judged to be an error, the voice corresponding to the presentation information is read from the record holding means, Educational equipment, which is played back by the voice playback means.