JP2004126143A

JP2004126143A - Voice recognition device and voice recognition program

Info

Publication number: JP2004126143A
Application number: JP2002288933A
Authority: JP
Inventors: Yoshiharu Abe; 阿部　芳春; Hiroyasu Goi; 伍井　啓泰
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-10-01
Filing date: 2002-10-01
Publication date: 2004-04-22

Abstract

<P>PROBLEM TO BE SOLVED: To reduce recognition errors of words or word strings having low frequencies in appearance in sentence examples for study of a language model in voice recognition. <P>SOLUTION: A voice recognition device which converts an input voice 1001 uttered by a user into feature vectors by a voice analysis means 2001, converts them into a syllable string having a maximum likelihood by a syllable string recognition means 3001, and refers to statistics of word chains stored in a language model 4003 to convert the syllable string into a word string by a word string search means 4001 is provided with a short word language model 4004 where statistics of isolated words are stored and a utterance length estimating means 4005 which estimates the number of syllables included in the syllable string outputted from the syllable string recognition means 3001 and switches a language model which the word string search means 4001 should refers to, from the language model 4003 to the short word language model 4004 by the language model switching means 4006 in the case of a short utterance input, and thus recognition errors caused by short words or utterance re-input of word strings in correction work or the like are reduced. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識技術に関し、特に音声認識による文章入力技術等に関するものである。
【０００２】
【従来の技術】
利便性や特別な訓練が不要であることなどから、音声入力による文書作成への期待は極めて高く、音声認識による日本語の文章入力ソフトウェアが各社から市販され注目を浴びている。
【０００３】
音声を文字に変換する口述筆記のための音声認識技術として、非特許文献１には、「認識誤り傾向の確率モデルを用いた２段階探索法による大語彙連続音声認識」の技術が開示されている。
【０００４】
非特許文献２では、辞書に登録されていない単語（未知語）を細かい音節単位にて認識する技術が開示されている。
【０００５】
特許文献１では、単語のモデルに無い複合語を入力する際ユーザを支援する技術が開示されている。
【０００６】
これらのソフトウェアをコンピュータで動作させることで実現される従来の音声認識文章入力装置は、入力音声を分析して特徴ベクトル時系列を出力する音声分析手段と、例えばトライフォンＨＭＭからなる音響モデルと、音響モデルを参照して音声分析手段の特徴ベクトル時系列から入力音声に対応する音節列を認識する音節列認識手段と、音節列認識手段の認識誤り傾向を記憶する差分モデルと、例えば単語のＮ−ｇｒａｍ統計量からなる言語モデルと、この言語モデルを参照して音節列認識手段の出力音節列から入力音声を最もよく近似する単語の列を探索する単語列探索手段と、単語列探索手段の出力単語列の文字を一時記憶するテキストバッファと、テキストバッファに記憶された文字を表示する表示手段と、ユーザ操作によってテキストバッファ中の文字を修正する修正手段と、テキストバッファから抽出されるユーザが作成したユーザテキストが格納される記憶手段とで構成される。
言語モデルとしては、単語連鎖の統計量に基づくＮ−ｇｒａｍモデル（例えばＮ＝３）が用いられ、その統計量は、新聞やＷｅｂ（インターネット上の情報資源）などの大量のテキストから学習される。
【０００７】
この構成の従来の音声認識文章入力装置において、ユーザの音声は単語列探索手段によって言語モデルに記憶された言語モデルに従って単語列に変換されその文字がテキストバッファに一時記憶されると同時に、表示手段によってユーザに表示される。ユーザは表示手段の表示に基づいて、ユーザ操作によりテキストバッファに一時記憶されたテキスト中の認識誤りを修正して最終的に所望のユーザテキストを得ることが可能である。ユーザ操作としては、キーボードによる入力のほかに音声による入力が可能である。
【０００８】
しかし、このような従来の音声認識文章入力装置では、単語列探索手段が出力する単語列は、言語モデルに基づき計算される確率の大きな単語列であるため、言語モデルの学習用例文に現れない単語列を認識する可能性は小さい。このため、「音声認識をするなど」という発声に対しては、「音声を認識する。」と誤って認識する可能性が高い（この例では、「など」を「。（マル）」と誤認識している。また、助詞「を」の位置が異なる）。認識誤りに対して、ユーザはユーザ操作により修正手段を介してテキストバッファ内の文字に対して修正を行う。修正結果は表示手段に表示される。修正方法として、修正部分を再発声して、認識結果を上書きすることによる修正が可能である。この場合、再発声された音声は言語モデルの統計量に基づいて、単語列探索手段により最も確率の高い単語列が選択されて認識結果となるため、言語モデルの学習テキストに出現しない単語列は、正しく認識される可能性は低い。したがって、上記の例で、「。」を修正するため、「など」と再発声しても、単語「など」が学習用文例の中に孤立した単語として出現することはまれであるため、正しく認識される可能性は低い。
【０００９】
また、上記のような誤認識部分の再発声による修正に限らず、通常の音声入力の場合でも、言語モデルの学習用文例に孤立して出現することがまれな単語を発声して、入力する場合も、同様に、正しく、認識される可能性が低下する。
【００１０】
【特許文献１】
特表平０７−５０７８８０号公報
【非特許文献１】
電子情報通信学会論文誌ＶＯＬ．Ｊ８３−Ｄ−ＩＩ，Ｎｏ．１２，ＰＰ．２５４５−２５５３、２０００年１２月発行
【非特許文献２】
伊藤克亘他著「被覆率を重視した大規模連続音声認識用統計言語モデル」，日本音響学会講演論文集，ｐｐ．６５−６６、１９９９年３月発行
【００１１】
【発明が解決しようとする課題】
このような従来の構成の音声認識装置では、誤認識部分の再発声や通常の音声入力のとき、言語モデルの学習文例において孤立して出現する頻度の小さい単語や単語列を入力しても、正しく認識する可能性が低く、音声入力の能率が悪くなるという技術的課題があった。
【００１２】
また、このような従来の構成の音声認識装置では、単語の言語モデルを用いているため、言語モデルに存在しない単語を音声入力することが困難であるという技術的課題があった。これに対して、言語モデルに存在しない単語を音節単位に認識する従来手法があるが、音節の認識結果が正解でない場合に所望の単語を音声入力することが困難であった。
【００１３】
この発明は上記のような技術的課題を解決するためになされたもので、音声認識において、通常の音声入力や誤認識部分の再発声による修正に際して、言語モデルの学習用文例における出現頻度の小さい単語や単語列の認識誤りを減少させることが可能な音声認識技術を提供することを目的とする。
【００１４】
また、この発明は、言語モデルに存在しない単語を容易に音声入力することで認識結果の編集や訂正等の作業効率の向上を可能とした音声認識技術を提供することを目的とする。
【００１５】
【課題を解決するための手段】
この発明に係る音声認識装置は、単語連鎖の統計量を記憶した第１言語モデル記憶手段と、短い単語の統計量または文章の部分から切り出された単語の統計量またはサブワードの統計量を記憶した第２言語モデル記憶手段と、入力音声の発声長を推定する発声長推定手段と、前記発声長推定手段にて判別された前記発声長の長短に応じて前記第１言語モデルまたは第２言語モデルに記憶された前記統計量を選択的に参照して前記入力音声を文字列に変換する探索手段と、を含むものである。
【００１６】
この発明に係る音声認識プログラムは、コンピュータを、単語連鎖の統計量を記憶した第１言語モデル記憶手段と、短い単語の統計量または文章の部分から切り出された単語の統計量またはサブワードの統計量を記憶した第２言語モデル記憶手段と、入力音声の発声長を推定する発声長推定手段と、前記発声長推定手段にて判別された前記発声長の長短に応じて前記第１言語モデルまたは第２言語モデルに記憶された前記統計量を選択的に参照して前記入力音声を文字列に変換する探索手段として機能させるものである。
【００１７】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１を示す音声認識装置の構成の一例を示すブロック図である。なお、以下に例示される音声認識装置は、たとえば中央処理装置、主記憶、外部記憶装置、ディスプレイ、キーボード、マウス等のハードウェアおよびこれらを動作させるプログラムで構成されるコンピュータシステムで構成され、後述の各手段や操作は上述のハードウェアおよびプログラム等にて実現することができる。
【００１８】
図１において、２００１は、入力音声１００１を分析して特徴ベクトル時系列を出力する音声分析手段、３００２は例えばトライフォンＨＭＭからなる音響モデル、３００１は音響モデル３００２を参照して音声分析手段２００１からの特徴ベクトル時系列から入力音声１００１に対応する尤度最大の音節列を認識する音節列認識手段、４００２は音節列認識手段の認識誤り傾向を記憶する差分モデル、４００３は単語のＮ−ｇｒａｍからなる言語モデル、４００４は短単語言語モデル、４００５は音節列認識手段３００１からの出力に基づいて、入力音声１００１の発声長を推定する発声長推定手段、４００６は発声長推定手段４００５の判定により言語モデル４００３（第１言語モデル記憶手段）または短単語言語モデル４００４（第２言語モデル記憶手段）を切替える言語モデル切替手段、４００１は言語モデル切替手段４００６で切替えられた言語モデル４００３または短単語言語モデル４００４と差分モデル４００２とを参照して音節列認識手段３００１が出力した音節列を最もよく近似する単語の列を探索する単語列探索手段、５００１は単語列探索手段４００１の出力単語列の文字を一時記憶するテキストバッファ、６００１はテキストバッファ５００１に記憶された文字を表示する表示手段、６００３はユーザ操作６００２によってテキストバッファ５００１中の文字を修正する修正手段、７００１はテキストバッファから抽出されるユーザが作成したユーザテキストである。
【００１９】
言語モデル４００３としては、単語連鎖の統計量に基づくＮ−ｇｒａｍ（例えばＮ＝３）が用いられる。言語モデル４００３の統計量は、新聞やＷｅｂなどの大量のテキストから学習される。一方、短単語言語モデル４００４は、孤立した単語の統計量（Ｎ−ｇｒａｍ）が記憶されている。単語としては、通常の名詞に限らず、孤立して発声される可能性のある、格助詞「が」、副助詞「など」なども含まれる。
【００２０】
次に動作について説明する。図１６は本実施の形態の音声認識装置の作用の一例を示すフローチャートである。
【００２１】
この構成において、ユーザの発声した入力音声１００１は（ステップＳＴ１０１）、音声分析手段２００１で特徴ベクトルに変換され（ステップＳＴ１０２）、さらに音節列認識手段３００１で尤度最大の音節列に変換される（ステップＳＴ１０３）。いま、ユーザが、「音声を認識するなど」と発声したとする。このとき、音節列認識手段３００１の出力は、図２の発声Ａ“おんせえおにんしきするなど”に対して８割程度の認識誤りを含むため図２の音節列Ａのような“おんせんにんちするまど”となる。発声長推定手段４００５は、音節列認識手段３００１の出力の音節列に含まれる音節数（この場合は音節列Ａに含まれる音節数１１）が所定の値（この場合は３）を超えている（ステップＳＴ１０４）。このため、発声長推定手段４００５は、言語モデル切替手段４００６の切替器４００６ａを言語モデル４００３に切替える（ステップＳＴ１０５）。この結果、単語列探索手段４００１は、言語モデル４００３に記憶された単語連鎖の統計量を参照して、音節列認識手段３００１出力の音節列（この場合音節列Ａ）を単語列に変換し（ステップＳＴ１０６）、その文字がテキストバッファ５００１に一時記憶されると同時に、図３に例示される表示手段６００１によってユーザに表示される（ステップＳＴ１０７）。ユーザは表示手段６００１の表示に基づいてテキストバッファ５００１に一時記憶されたテキスト中の認識誤りを修正して（ステップＳＴ１０８、ステップＳＴ１０９）、最終的に所望のユーザテキスト７００１を得ることが可能である。
【００２２】
図３に例示される表示手段６００１は、たとえばディスプレイ等からなり、テキスト表示枠６００１ａ、編集ボタン枠６００１ｂ、カーソル６００１ｃ、マウスポインタ６００１ｄ、等で構成されている。編集ボタン枠６００１ｂには、保存ボタン６００１ｂ−１、貼付ボタン６００１ｂ−２、コピーボタン６００１ｂ−３等が表示されており、マウスポインタ６００１ｄにて各ボタンを指示することで、認識結果の表示テキストの各種編集操作が可能になっている。
【００２３】
図３は、上記発声Ａに対して、言語モデル４００３に基づいて得られる単語列Ａの表示例を示す。この場合、「音声を認識するなど」を「音声を認識する。」と誤って認識した様子を示す（この例では、「など」を「。（マル）」と誤認識している）。
【００２４】
この場合、認識誤りに対して、ユーザはユーザ操作６００２により修正手段６００３を介してテキストバッファ５００１内の文字に対して修正を行う。修正結果は表示手段６００１に表示される。修正方法として、修正部分を再発声して（ステップＳＴ１１０）、認識結果を上書きすることによる修正が可能である。このため、「。」を１字削除して図４の状態にする。その後、修正のため、「など」と再発声する（ステップＳＴ１０１）。この場合、再発声された入力音声は、音声分析手段２００１により、特徴ベクトルに変換され（ステップＳＴ１０２）、音節列認識手段３００１により、“まど”と音節列（図２の音節列Ｂ）に変換される（ステップＳＴ１０３）。この場合の音節数は２であり所定の値３未満であるため（ステップＳＴ１０４）、発声長推定手段４００５は言語モデル切替手段４００６に対して切替器４００６ａを短単語言語モデル４００４に切替えるよう指示する（ステップＳＴ１１１）。このため、単語列探索手段４００１は短単語言語モデル４００４を参照して音節列Ｂに対応する単語列を探索してその文字をテキストバッファ５００１に書き込む（ステップＳＴ１０６）。この結果は、表示手段６００１に“など”という単語がカーソル６００１ｃの直後に追加表示され、図５の表示が得られる（ステップＳＴ１０７）。
【００２５】
このように、「など」という発声については、発声長推定手段４００５が音節列認識手段３００１の出力の音節列の長さを所定の値と比較して短いと判断するため、短単語言語モデル４００４の統計量に基づいて、単語列探索手段４００１により最も確率の高い単語列が選択されて認識結果となるため、言語モデルの学習テキストには現れないかもしくは低頻度である単語列であっても、正しい“など”が認識される可能性が高まる。したがって、上記の例で、「。」を修正するため、「など」と再発声すると、「など」が通常の言語モデル４００３の学習テキストに無くても、短単語言語モデル４００４が参照されるため、正しく認識される可能性が高い。一方、従来のように発声長によらず常に言語モデル４００３を適用する場合は、“など”と再発声しても、依然として、単語“。”が認識されてしまい、ユーザの入力効率を著しく損なうことになる。
【００２６】
以上のように、本実施の形態では、発声長推定手段４００５にて発声長の長短を判別し、発声長が所定の値より短いとき、通常の言語モデル４００３の代わりに短単語言語モデル４００４を参照して単語列を探索するようにしているので、再発声による短い単語の認識性能が高くなり、たとえば認識結果の部分的な訂正作業の効率が向上し、より能率的に音声入力を実行することができる。
【００２７】
実施の形態２．
上述の実施の形態１では、入力音声の発声長が短いときに言語モデルを短単語言語モデルとするようにしたものであるが、次にさらに同音語記憶手段を追加する実施の形態を示す。
【００２８】
図６は、このような場合の構成図である。また、図１７は、本実施の形態の音声認識装置の作用の一例を示すフローチャートである。なお、前述の実施の形態１と同一の構成要素は同一の符号を付している。
【００２９】
図において、６２０１は音節列から引けるように同音語の文字を記憶した同音語記憶手段、６２０２は単語列探索手段の内部出力に含まれる単語列の音節データに基づいて、同音語記憶手段６２０１から同音語の候補を選択する同音語生成手段、６２０３は同音語生成手段６２０２が生成した同音語の候補を表示する同音語表示手段である。図７は表示手段６００１内における同音語表示手段６２０３の表示例を示す説明図である。
【００３０】
このような構成において、ユーザは、発声Ａに対して、“音声を認識する。”と誤認識したため、“。”を削除し（ステップＳＴ１０９、ステップＳＴ１１０）、短い発声Ｂをすると、音節列Ｂに対して、単語列探索手段４００１は、内部データとして、＜文字＝“など”、音節列＝“など”＞という単語列の情報を出力する（ステップＳＴ１０１〜ステップＳＴ１０４、ステップＳＴ１１１、ステップＳＴ２０１）。同音語生成手段６２０２は内部データに含まれる＜音節列＝など＞というデータ項目を得て同音語記憶手段６２０１に記憶された同音語データの中から＜音節列＝など＞を含む同音語の候補を得て同音語表示手段６２０３に表示する（ステップＳＴ２０２）。すなわち、本例では、図７に示すように同音語表示手段６２０３には“等”、“など”、“‥”、“ナド”の４候補が表示される。ユーザは候補番号をキーボードの数字キーで選ぶか、マウスポインタ６００１ｄで所望の候補をポイントすることにより所望の文字を選択することが可能であり（ステップＳＴ２０３）、選択された候補がカーソル位置に出力される（ステップＳＴ２０４）。
【００３１】
このように本実施の形態によれば、短い発声に対して、同音語表示手段６２０３にて同音語表示を行うので、短単語言語モデル４００４に同一の読みと表記が一致する単語がなくても、所望の候補を入力できる可能性が高まる。
【００３２】
実施の形態３．
上述の実施の形態１および実施の形態２では、カーソル６００１ｃの表示位置で示される文字挿入位置に関わらず発声長により、言語モデルを切替えるものであるが、本実施の形態３では、カーソル６００１ｃの表示位置で示される文字挿入位置に依存して、言語モデルを切替える例について説明する。
【００３３】
図８は、本実施の形態の構成図を示す。また、図１８は、本実施の形態の音声認識装置の作用の一例を示すフローチャートである。なお、前述の各実施の形態と同一の構成要素は同一の符号を付している。図８において、４００４ｂは文の部分を単語として切出しこれら部分単語の統計量を記憶した部分発声言語モデル、４００５ｂはカーソル６００１ｃの表示位置で示される文字挿入位置が文外にあるか、文中にあるかを判定する文字挿入位置判定手段である。
【００３４】
図９は、ユーザが前掲の図２の発声Ａ“おんせえおにんしきするなど”を行ったときに表示された認識テキストの例（“音声認識する。”と表示されている）である。この場合、助詞“を”が脱落しているため、ユーザは、助詞“を”を挿入するため、カーソル６００１ｃを認識直後のカーソル位置Ｐ０からカーソル位置Ｐ１まで移動する。その後、ユーザが図２の発声Ｃ“お”を行うと、文字挿入位置判定手段４００５ｂは、カーソル位置Ｐ１で示される文字挿入位置が、表示テキストの文内にあると判定し（ステップＳＴ３０１）、言語モデル切替手段４００６は切替器４００６ａを部分発声言語モデル４００４ｂ（第２言語モデル記憶手段）に切替える（ステップＳＴ１１１）。このため、単語列探索手段４００１は、部分発声言語モデル４００４ｂを参照して認識結果を作成するので（ステップＳＴ３０２）、単語列“を”が得られる可能性が高くなる。単語列探索手段４００１の出力は、その文字がテキストバッファ５００１に書き込まれ、表示手段６００１の表示は、カーソル位置Ｐ１（文字挿入位置）の直前に助詞“を”が挿入された表示となる（ステップＳＴ３０３）。
【００３５】
このように、本実施の形態によれば、編集操作による移動後のカーソル位置Ｐ１（文字挿入位置）が表示テキストの文内にあることを判定して、単語列探索で参照する言語モデルを部分発声言語モデル４００４ｂに切替えるので、一般文の言語モデル４００３では認識し難い部分発声を高い精度で入力できる。
【００３６】
実施の形態４．
この実施の形態４では、文字削除など特定のユーザ操作の直後だけ、上述の実施の形態１または実施の形態２の短単語処理を有効とする。
【００３７】
図１０は、本実施の形態の音声認識装置の構成図を示す。また、図１９は、本実施の形態の音声認識装置の作用の一例を示すフローチャートである。なお、前述の各実施の形態と同一の構成要素は同一の符号を付している。
【００３８】
図１０において、６００４はユーザ操作６００２によって生じるイベントを記憶するユーザ操作記憶手段である。また、４００５ｃは、ユーザ操作記憶手段６００４に記憶されたイベントを参照して、直前のユーザ操作がテキストバッファ５００１中の連続するｎ文字（ｎは自然数）の削除操作であった場合に、言語モデル切替手段４００６の切替器４００６ａを部分発声言語モデル４００４ｂ側に切り替える、ユーザ操作判定手段である。
【００３９】
このような構成において、ユーザが図２の発声Ａ“おんせえおにんしきするなど”を行ったときに、表示手段６００１に表示された認識テキストの例（“音声の認識する。”と表示されている）を図１１に示す。この場合、助詞“を”が助詞“の”に誤って認識されている。このため、ユーザは、助詞“の”を削除するため、カーソル６００１ｃを認識直後のカーソル位置Ｐ０から、カーソル位置Ｐ２まで移動する。さらに、ユーザは、削除キーを用いて、助詞“の”を削除する。図１２はこのときの表示を示す。カーソル位置Ｐ３は、“の”の削除後の位置にある。
【００４０】
ここまでのユーザによるカーソル移動および削除キーの操作はソフトウェアのイベントを発生することにより処理されており、これらのイベントはすべてユーザ操作記憶手段６００４に記憶されている。この状態で、ユーザが図２の発声Ｃ“お”を行うと、ユーザ操作判定手段４００５ｃは、直前のユーザ操作が１文字の削除であったと判定し（ステップＳＴ４０１）、さらに音節数が既定値よりも短いため（ステップＳＴ１０４）言語モデル切替手段４００６は切替器４００６ａを部分発声言語モデル４００４ｂに切替える（ステップＳＴ１１１）。このため、単語列探索手段４００１は、部分発声言語モデル４００４ｂを参照して認識結果を作成するので（ステップＳＴ４０２）、目的の単語列“を”が得られる可能性が高くなる。単語列探索手段４００１の出力は、その文字がテキストバッファ５００１に書き込まれ、表示手段６００１の表示は、図１３に示すようにカーソル位置Ｐ４の直前に助詞“を”が挿入された表示となる（ステップＳＴ４０３）。図１３では助詞“を”が正しく認識された例を示している。
【００４１】
このように、本実施の形態によれば、ユーザ操作により文字が削除された直後であることを判定して、単語列探索で参照する言語モデルを部分発声言語モデルに切替えるので、一般文言語モデルでは認識し難い部分発声を高い精度で入力できる。
【００４２】
なお、１文字の削除について説明したが、一般にｎを自然数としてｎ文字の削除であってもよい。また、削除と同じ効果をショートカットキーとして登録してあった場合も、ユーザ操作判定手段で直前のイベントを解析する部分の変更によって、同様の効果を奏する。
【００４３】
実施の形態５．
本実施の形態は、上記実施の形態４において、直前に削除された文字数が所定の値以下であるとき、実施の形態１または実施の形態２の短単語処理を有効とする。
【００４４】
すなわち、本実施の形態５では、上記実施の形態４の説明において、文字削除後、この状態で、ユーザが発声を行うと、ユーザ操作判定手段４００５ｃは、ユーザによって直前に削除された文字数が所定の値（ここでは、２文字）以下であるときだけ（ステップＳＴ５０１）、言語モデル切替手段４００６は切替器４００６ａを部分発声言語モデル４００４ｂに切替える。
【００４５】
このように、本実施の形態によれば、ユーザ操作により文字がたとえば行単位で削除された直後は、削除された行文字数が所定の値より大きいとすれば、通常の言語モデルを用いて発声を認識するので文章の音声認識ができ、また、所定の値以下の文字数の削除である場合は、単語列探索で参照する言語モデルを部分発声言語モデル４００４ｂに切替えるので、一般文の言語モデル４００３では認識し難い部分発声を高い精度で入力できる。
【００４６】
実施の形態６．
上述の実施の形態２では、同音語記憶手段を追加し、同音語の候補を選択できるように構成したものであるが、本実施の形態では、サブワード記憶手段を追加した例を示す。
【００４７】
図１４は、このような場合の音声認識装置の構成図である。また、図２０は本実施の形態の音声認識装置の作用の一例を示すフローチャートである。なお、前述の各実施の形態と同一の構成要素は同一の符号を付している。
【００４８】
ここで、サブワードとは、表記文字列と音節列の対応の最小単位である。例えば、「三菱電機は流石だ」（みつびしでんきはさすがだ）という句は、「三」（みつ），「菱」（びし），「電」（でん），「機」（き），「は」（は），「流石」（さすが），「だ」（だ）という７つのサブワードから構成される。このとき「」内は表記文字列を示し、（）内は対応する音節列である。単漢字では「流石」は「流」と「石」に分割できるが、単漢字に分割してしまうと（さすが）という読みに対応できなくなるので最小単位ではない。この点は単漢字とは異なる。
【００４９】
図１４において、６２０１ａは音節列から引けるようにサブワードの表記を記憶したサブワード記憶手段、６２０２ａは単語列探索手段の内部出力に含まれる音節に基づいて、サブワード記憶手段６２０１ａからサブワードの候補を選択するサブワード生成手段、６２０３ａはサブワード生成手段６２０２ａが生成した同音語の候補を表示するサブワード表示手段である。また、４００４ａは、サブワードの統計量を記憶したサブワード言語モデル（第２言語モデル記憶手段）である。図１５はサブワード表示手段６２０３ａの表示例を示す図である。サブワード表示手段６２０３ａは、表示手段６００１の表示領域内に表示窓して表示される。
【００５０】
このような構成において、ユーザは、前掲の発声Ａに対して、“音声を認識する。”と誤認識したため、“。”を削除し、短い発声Ｂをすると（ステップＳＴ１０４、ステップＳＴ１１１）、音節列Ｂに対して、単語列探索手段４００１は、サブワード言語モデル４００４ａを参照して、内部データとして、＜＜文字＝“窓”、音節列＝“まど”＞＜文字＝“等”、音節列＝“など”＞＜文字＝“灘”、音節列＝“なだ”＞＜文字＝“ま”、音節列＝“ま”，文字＝“ま”、音節列＝“ど”＞＜文字＝“真”、音節列＝“ま”，文字＝“土”、音節列＝“ど”＞…（途中省略）…＜文字＝“茄”、音節列＝“な”，文字＝“舵”、音節列＝“だ”＞＜文字＝“ナ”、音節列＝“な”，文字＝“ダ”、音節列＝“だ”＞＞というサブワード列の情報を出力する（ステップＳＴ６０１）。
【００５１】
サブワード生成手段６２０２ａは内部データに含まれる＜音節列＝など＞というデータ項目を得てサブワード記憶手段６２０１ａに記憶されたサブワードデータの中から＜音節列＝など＞＜音節列＝ま＞＜音節列＝ど＞＜音節列＝な＞＜音節列＝だ＞を含むサブワードの候補を得てサブワード表示手段６２０３ａに表示する。すなわち、本例では、図１５に示すようにサブワード表示手段６２０３ａには“窓”、“等”、“灘”、“ま”、“真”他の４０候補が表示される（ステップＳＴ６０２）。マウスポインタ６００１ｄで所望の候補をポイントすることにより所望の文字を選択入力することが可能である（ステップＳＴ６０３、ステップＳＴ６０４）。
【００５２】
このように本実施の形態によれば、サブワード表示手段６２０３ａにて、短い発声に対して、サブワード分割した候補表示を行うので、サブワード言語モデル４００４ａに同一の読みと表記の単語が存在しなくとも所望の候補を入力できる。例えば、サブワード言語モデル４００４ａに「など」という発音の「名土」という固有名詞が登録されていなくとも、サブワード候補から「名」と「土」を選択入力することにより目的を達成できる。
【００５３】
上記した特許請求の範囲に記載されたこの発明を見方を変えて表現すれば以下のとおりである。
【００５４】
（１）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段とを備える音声認識装置において、前記入力音声の発声長を推定する発声長推定手段と、短い単語の統計量を記憶した短単語言語モデル記憶手段と、前記探索手段は前記発声長推定手段が推定した発声長が所定の長さより短いとき、前記短単語言語モデル記憶手段に記憶された短い単語の統計量を参照して入力音声を文字列に変換する手段であることを特徴とする音声認識装置。
【００５５】
（２）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段とを備える音声認識装置において、前記入力音声の発声長を推定する発声長推定手段と、短い単語の統計量を記憶した短単語言語モデル記憶手段と、前記探索手段の文字列から同音語を生成する同音語生成手段と、同音語生成手段の生成した同音語リストを表示する同音語表示手段を設け、前記探索手段は、前記発声長推定手段が推定した発声長が所定の長さより短いとき、前記短単語言語モデル記憶手段に記憶された短い単語の統計量を参照して入力音声を文字列に変換することを特徴とする音声認識装置。
【００５６】
（３）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段とを備える音声認識装置において、前記表示手段に表示されるカーソルの位置が文章の内部にあることを判定する文字挿入位置判定手段と、文章の部分を単語として切出しこれら単語の統計量を記憶した部分発声言語モデル記憶手段と、前記探索手段は前記文字挿入位置判定手段が前記表示手段に表示されるカーソルの位置が文章の内部にあると判定したとき、前記部分発声言語モデル記憶手段に記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手段であることを特徴とする音声認識装置。
【００５７】
（４）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段と、該表示手段の表示内容を修正するユーザ操作手段を備える音声認識装置において、前記ユーザ操作を記憶するユーザ操作記憶手段と、前記ユーザ操作記憶手段に記憶されたユーザ操作を判定するユーザ操作判定手段と、文章の部分を単語として切出しこれら単語の統計量を記憶した部分発声言語モデル記憶手段と、前記探索手段は前記ユーザ操作判定手段が直前のユーザ操作が文字の削除であると判定したとき前記部分発声言語モデル記憶手段に記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手段であることを特徴とする音声認識装置。
【００５８】
（５）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段と、該表示手段の表示内容を修正するユーザ操作手段を備える音声認識装置において、前記ユーザ操作を記憶するユーザ操作記憶手段と、前記ユーザ操作記憶手段に記憶されたユーザ操作を判定するユーザ操作判定手段と、文章の部分を単語として切出しこれら単語の統計量を記憶した部分発声言語モデル記憶手段と、前記探索手段は前記ユーザ操作判定手段が直前のユーザ操作が所定の数以下の文字の削除であると判定したとき前記部分発声言語モデル記憶手段に記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手段であることを特徴とする音声認識装置。
【００５９】
（６）．単語の統計量を記憶した言語モデル記憶手段と、該言語モデル記憶手段に記憶された単語の統計量を参照して、入力音声を文字列に変換する探索手段と、該探索手段で変換された文字列を表示する表示手段とを備える音声認識装置において、サブワードの統計量を記憶したサブワード言語モデル記憶手段と、前記探索手段は前記サブワード言語モデル記憶手段に記憶されたサブワードの統計量と認識誤り傾向の確率モデルを参照して入力音声を文字列に変換する手段であることを特徴とする音声認識装置。
【００６０】
（７）．単語の統計量を記憶し、該記憶された単語の統計量を参照して、入力音声を文字列に変換し、該変換された文字列を表示するとともに、前記単語の統計量とは別に短い単語の統計量を記憶し、入力音声を文字列に変換する際、入力音声の発声長を推定し、前記推定した発声長が所定の長さより短いとき、前記記憶された短い単語の統計量を参照して入力音声を文字列に変換することを特徴とする音声認識方法。
【００６１】
（８）．単語の統計量を記憶し、該記憶された単語の統計量を参照して、入力音声を文字列に変換し、該変換された文字列を表示する音声認識方法において、前記単語の統計量とは別に短い単語の統計量を記憶し、前記入力音声の発声長を推定し、前記推定された入力音声の発声長が所定の長さより短いとき、前記記憶した短い単語の統計量を参照して入力音声を文字列に変換するとともに、前記変換された文字列から同音語を生成し、該生成した同音語のリストを表示することを特徴とする音声認識方法。
【００６２】
（９）．単語の統計量を記憶し、該記憶された単語の統計量を参照して、入力音声を文字列に変換し、該変換された文字列と文字挿入位置を示すカーソルとを表示する音声認識方法において、前記単語の統計量とは別に文章の部分を単語として切出して得られる単語の統計量を記憶し、入力音声を文字列に変換する際に、前記表示されるカーソルの位置が文章の内部にあることを判定し、前記表示されるカーソルの位置が文章の内部にあると判定したとき、前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換することを特徴とする音声認識方法。
【００６３】
（１０）．単語の統計量を記憶し、該記憶された単語の統計量を参照して、入力音声を文字列に変換し、該変換された文字列を表示し、該表示された表示内容を修正するユーザの操作を記憶し、前記記憶されたユーザ操作を判定し、前記単語の統計量とは別に文章の部分を単語として切出して得られる単語の統計量を記憶し、入力音声を文字列に変換するに際し、直前のユーザ操作が文字の削除であると判定したとき前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換することを特徴とする音声認識方法。
【００６４】
（１１）．単語の統計量を記憶し、該記憶された単語の統計量を参照して、入力音声を文字列に変換し、該変換された文字列を表示し、該表示された表示内容を修正するユーザ操作を記憶し、前記記憶されたユーザ操作を判定し、前記単語の統計量とは別に文章の部分を単語として切出して得られる単語の統計量を記憶し、入力音声を文字列に変換するに際し、前記直前のユーザ操作が所定の数以下の文字の削除であると判定したとき前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換することを特徴とする音声認識方法。
【００６５】
（１２）．単語の統計量を記憶し、記憶された単語の統計量を参照して、入力音声を文字列に変換するとともに、変換された文字列を表示し、サブワードの統計量を記憶するとともに、サブワードの統計量と認識誤り傾向の確率モデルを参照することにより、入力音声を文字列に変換することを特徴とする音声認識方法。
【００６６】
（１３）．単語の統計量を記憶した言語モデル記憶する第１の手順と、該記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、該探索手段で変換された文字列を表示する第３の手順と、前記入力音声の発声長を推定する第４の手順と、短い単語の統計量を記憶する第５の手順と、上記第２の手順は前記第４の手順の実行の結果推定した発声長が所定の長さより短いとき、前記第５の手順の実行の結果記憶された短い単語の統計量を参照して入力音声を文字列に変換する手順を実行させるための音声認識プログラム。
【００６７】
（１４）．単語の統計量を記憶する第１の手順と、該記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、該変換された文字列を表示する第３の手順と、前記入力音声の発声長を推定する第４の手順と、短い単語の統計量を記憶する第５の手順と、前記第２の手順により変換された文字列から同音語を生成する第６の手順と、前記生成された同音語のリストを表示する第７の手順と、上記第２の手順は前記推定された発声長が所定の長さより短いとき、前記記憶された短い単語の統計量を参照して入力音声を文字列に変換する手順を実行させるための音声認識プログラム。
【００６８】
（１５）．単語の統計量を記憶する第１の手順と、該記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、該変換された文字列を表示する第３の手順と、前記表示されるカーソルの位置が文章の内部にあることを判定する第４の手順と、文章の部分を単語として切出して得られる単語の統計量を記憶する第５の手順と、上記第２の手順は前記表示されるカーソルの位置が文章の内部にあると判定したとき、前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手順を実行させるための音声認識プログラム。
【００６９】
（１６）．単語の統計量を記憶する第１の手順と、該記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、該変換された文字列を表示する第３の手順と、該表示された表示内容を修正するユーザ操作を入力する第４の手順と、前記ユーザ操作を記憶する第５の手順と、前記記憶されたユーザ操作を判定する第６の手順と、文章の部分を単語として切出して得られる単語の統計量を記憶する第７の手順と、上記第２の手順は前記入力音声を文字列に変換する際に、前記直前のユーザ操作が文字の削除であると判定したとき前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手順とを実行させるための音声認識プログラム。
【００７０】
（１７）．単語の統計量を記憶する第１の手順と、該記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、該変換された文字列を表示する第３の手順と、該表示された表示内容を修正するユーザ操作を入力する第４の手順と、前記記憶されたユーザ操作を判定する第５の手順と、文章の部分を単語として切出して得られる単語の統計量を記憶する第６の手順と、上記第２の手順は直前のユーザ操作が所定の数以下の文字の削除であると判定したとき前記記憶された文章の部分を単語として切出して得られる単語の統計量を参照して入力音声を文字列に変換する手順を実行させるための音声認識プログラム。
【００７１】
（１８）．単語の統計量を記憶する第１の手順と、上記第１の手順に記憶された単語の統計量を参照して、入力音声を文字列に変換する第２の手順と、第２の手順で変換された文字列を表示する第３の手順と、サブワードの統計量を記憶した第４の手順と、上記第２の手順は第４の手順に記憶されたサブワードの統計量と認識誤り傾向の確率モデルを参照して入力音声を文字列に変換する手順を実行させるための音声認識プログラム。
【００７２】
【発明の効果】
この発明によれば、音声認識において、通常の音声入力や誤認識部分の再発声による修正に際して、言語モデルの学習用文例における出現頻度の小さい単語や単語列の認識誤りを減少させることができる。
【００７３】
また、この発明によれば、言語モデルに存在しない単語を容易に音声入力することが可能となり、音声認識結果の編集や訂正等の作業効率の向上を実現できる。
【図面の簡単な説明】
【図１】この発明の実施の形態１である音声認識装置の構成例を示すブロック図である。
【図２】この発明の実施の形態１である音声認識装置における認識結果の一例を示す説明図である。
【図３】この発明の実施の形態１である音声認識装置における単語列の表示例を示す説明図である。
【図４】この発明の実施の形態１である音声認識装置における１字削除結果の表示例を示す説明図である。
【図５】この発明の実施の形態１である音声認識装置における訂正結果の表示例を示す説明図である。
【図６】この発明の実施の形態２である音声認識装置の構成例を示すブロック図である。
【図７】この発明の実施の形態２である音声認識装置における同音語表示窓の表示例を示す説明図である。
【図８】この発明の実施の形態３である音声認識装置の構成例を示すブロック図である。
【図９】この発明の実施の形態３である音声認識装置における認識結果の表示例を示す説明図である。
【図１０】この発明の実施の形態４である音声認識装置の構成例を示すブロック図である。
【図１１】この発明の実施の形態４である音声認識装置における認識結果の表示例を示す説明図である。
【図１２】この発明の実施の形態４である音声認識装置における認識結果の削除例を示す説明図である。
【図１３】この発明の実施の形態４である音声認識装置における認識結果の挿入例を示す説明図である。
【図１４】この発明の実施の形態６である音声認識装置の構成例を示すブロック図である。
【図１５】この発明の実施の形態６である音声認識装置における認識結果の表示例を示す説明図である。
【図１６】この発明の実施の形態１である音声認識装置の作用の一例を示すフローチャートである。
【図１７】この発明の実施の形態２である音声認識装置の作用の一例を示すフローチャートである。
【図１８】この発明の実施の形態３である音声認識装置の作用の一例を示すフローチャートである。
【図１９】この発明の実施の形態４および実施の形態５である音声認識装置の作用の一例を示すフローチャートである。
【図２０】この発明の実施の形態６である音声認識装置の作用の一例を示すフローチャートである。
【符号の説明】
１００１　入力音声、２００１　音声分析手段、３００１　音節列認識手段、３００２　音響モデル、４００１　単語列探索手段、４００２　差分モデル、４００３　言語モデル、４００４　短単語言語モデル、４００４ａ　サブワード言語モデル、４００４ｂ　部分発声言語モデル、４００５　発声長推定手段、４００５ｂ　文字挿入位置判定手段、４００５ｃ　ユーザ操作判定手段、４００６　言語モデル切替手段、４００６ａ　切替器、５００１　テキストバッファ、６００１　表示手段、６００１ａ　テキスト表示枠、６００１ｂ　編集ボタン枠、６００１ｂ−１　保存ボタン、６００１ｂ−２　貼付ボタン、６００１ｂ−３　コピーボタン、６００１ｃ　カーソル、６００１ｄ　マウスポインタ、６００２　ユーザ操作、６００３　修正手段、６００４　ユーザ操作記憶手段、６２０１　同音語記憶手段、６２０１ａ　サブワード記憶手段、６２０２　同音語生成手段、６２０２ａ　サブワード生成手段、６２０３　同音語表示手段、６２０３ａ　サブワード表示手段、７００１　ユーザテキスト、Ｐ０〜Ｐ４　カーソル位置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition technique, and more particularly to a text input technique using speech recognition.
[0002]
[Prior art]
Expectations for document creation by voice input are extremely high because of the convenience and no special training required, and Japanese text input software using voice recognition is commercially available from various companies and is attracting attention.
[0003]
Non-Patent Document 1 discloses a technique of “large vocabulary continuous speech recognition by a two-stage search method using a probability model of recognition error tendency” as a speech recognition technique for dictation to convert speech into characters. I have.
[0004]
Non-Patent Document 2 discloses a technique for recognizing words (unknown words) not registered in a dictionary in units of fine syllables.
[0005]
Patent Literature 1 discloses a technique for assisting a user when inputting a compound word that is not included in a word model.
[0006]
Conventional speech recognition text input devices realized by operating these software programs on a computer include a speech analysis unit that analyzes input speech and outputs a feature vector time series, and an acoustic model including, for example, a triphone HMM. A syllable string recognition means for recognizing a syllable string corresponding to the input speech from the feature vector time series of the voice analysis means with reference to the acoustic model; a difference model for storing recognition error tendency of the syllable string recognition means; A language model comprising gram statistics; a word string search means for searching for a string of words that best approximates the input speech from an output syllable string of the syllable string recognition means with reference to the language model; A text buffer for temporarily storing characters of the output word string, display means for displaying the characters stored in the text buffer, and And correction means for correcting the character in the text buffer, and a storage means for the user text is stored user-created extracted from the text buffer.
As a language model, an N-gram model (for example, N = 3) based on a statistic of a word chain is used, and the statistic is learned from a large amount of texts such as newspapers and Web (information resources on the Internet). .
[0007]
In the conventional speech recognition sentence input device having this configuration, the user's voice is converted into a word string by the word string search means in accordance with the language model stored in the language model, and the characters are temporarily stored in the text buffer, and simultaneously the display means Is displayed to the user. The user can correct a recognition error in the text temporarily stored in the text buffer by a user operation based on the display on the display means, and finally obtain a desired user text. As the user operation, a voice input is possible in addition to a keyboard input.
[0008]
However, in such a conventional speech recognition sentence input device, the word string output by the word string search means is a word string having a high probability of being calculated based on the language model, and therefore does not appear in the language model learning example sentence. The possibility of recognizing a word string is small. For this reason, it is highly probable that the utterance "recognize voice or the like" is erroneously recognized as "recognize voice." (In this example, "etc." And the position of the particle "wo" is different). In response to the recognition error, the user corrects the characters in the text buffer via the correction means by a user operation. The correction result is displayed on the display means. As a correction method, it is possible to make a correction by re-uttering the corrected part and overwriting the recognition result. In this case, the re-uttered voice is selected based on the statistic of the language model by the word sequence searching means to select the word sequence with the highest probability, and the recognition result is obtained. Is unlikely to be correctly recognized. Therefore, in the above example, even if the word “etc.” is re-uttered to correct “.”, The word “etc.” rarely appears as an isolated word in the learning sentence example. It is unlikely to be recognized.
[0009]
In addition to the above-described correction of the misrecognition portion by re-utterance, even in the case of normal voice input, a word rarely appearing in isolation in the language model learning sentence is uttered and input. In this case as well, the possibility of correct recognition is reduced.
[0010]
[Patent Document 1]
JP-T-07-507880
[Non-patent document 1]
IEICE Transactions Vol. J83-D-II, No. 12, PP. 2545-2553, issued December 2000
[Non-patent document 2]
Katsunobu Ito et al., "Statistical Language Model for Large Scale Continuous Speech Recognition with Emphasis on Coverage," Proc. 65-66, issued March 1999
[0011]
[Problems to be solved by the invention]
In the speech recognition device having such a conventional configuration, in the case of re-utterance of a misrecognized portion or normal speech input, even if a word or a word string that appears infrequently in isolation in a learning sentence example of a language model is input, There is a technical problem that the possibility of correct recognition is low and the efficiency of voice input is reduced.
[0012]
Further, in the speech recognition device having such a conventional configuration, since a language model of a word is used, there is a technical problem that it is difficult to speech-input a word that does not exist in the language model. On the other hand, there is a conventional method of recognizing words that do not exist in the language model in units of syllables, but it has been difficult to input a desired word by voice when the recognition result of syllables is not correct.
[0013]
SUMMARY OF THE INVENTION The present invention has been made to solve the above-described technical problem. In speech recognition, when correcting a normal speech input or a misrecognition portion by re-utterance, the frequency of occurrence in a language model learning sentence example is low. An object of the present invention is to provide a speech recognition technique capable of reducing recognition errors of words and word strings.
[0014]
It is another object of the present invention to provide a speech recognition technology that enables an improvement in work efficiency such as editing and correction of a recognition result by easily inputting a word that does not exist in a language model.
[0015]
[Means for Solving the Problems]
A speech recognition apparatus according to the present invention has a first language model storage unit that stores a word chain statistic, and stores a short word statistic or a word statistic or a subword statistic cut out from a sentence part. Second language model storage means, utterance length estimation means for estimating the utterance length of the input speech, and the first language model or the second language model according to the length of the utterance length determined by the utterance length estimation means Searching means for selectively referring to the statistic stored in the input section to convert the input voice into a character string.
[0016]
The speech recognition program according to the present invention comprises: a computer for storing a first language model storing means for storing word chain statistics; a short word statistics or a word statistics cut out from a sentence portion; or a subword statistics. A second language model storing means for storing the utterance length of the input speech, a utterance length estimating means for estimating the utterance length of the input voice, and the first language model or the second language model according to the length of the utterance length determined by the utterance length estimating means. The present invention functions as search means for selectively referring to the statistics stored in the bilingual model and converting the input speech into a character string.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing an example of a configuration of a speech recognition device according to Embodiment 1 of the present invention. The speech recognition device exemplified below is configured by a computer system including hardware such as a central processing unit, a main memory, an external storage device, a display, a keyboard, a mouse, and a program for operating the hardware. Can be realized by the above-described hardware and programs.
[0018]
In FIG. 1, reference numeral 2001 denotes a speech analysis unit that analyzes an input speech 1001 and outputs a feature vector time series, 3002 denotes an acoustic model including, for example, a triphone HMM, and 3001 denotes a speech model 3002 with reference to the acoustic model 3002. Syllable string recognition means for recognizing the syllable string having the maximum likelihood corresponding to the input speech 1001 from the feature vector time series of the above, 4002 is a difference model for storing the recognition error tendency of the syllable string recognition means, and 4003 is the N-gram of the word. 4004 is a short word language model, 4005 is a utterance length estimating means for estimating the utterance length of the input speech 1001 based on the output from the syllable string recognition means 3001, and 4006 is a language determined by the utterance length estimating means 4005. Model 4003 (first language model storage means) or short word language model 4004 (second language model Language model switching means 4001 for switching the word model storage means), and a syllable string output by the syllable string recognition means 3001 with reference to the language model 4003 or the short word language model 4004 switched by the language model switching means 4006 and the difference model 4002. Word string search means for searching for a string of words that best approximates the string, 5001 is a text buffer for temporarily storing the characters of the word string output from the word string search means 4001, and 6001 is for displaying the characters stored in the text buffer 5001 A display unit 6003 is a correction unit for correcting a character in the text buffer 5001 by a user operation 6002, and a reference numeral 7001 is a user text created by the user and extracted from the text buffer.
[0019]
As the language model 4003, an N-gram (for example, N = 3) based on a word chain statistic is used. The statistics of the language model 4003 are learned from a large amount of text such as newspapers and Web. On the other hand, the short word language model 4004 stores statistics (N-gram) of isolated words. The words are not limited to ordinary nouns, but also include case particles "ga" and auxiliary particles "such as" which may be uttered in isolation.
[0020]
Next, the operation will be described. FIG. 16 is a flowchart illustrating an example of the operation of the voice recognition device according to the present embodiment.
[0021]
In this configuration, the input voice 1001 uttered by the user (step ST101) is converted into a feature vector by the voice analysis unit 2001 (step ST102), and further converted into a syllable sequence having the maximum likelihood by the syllable sequence recognition unit 3001 (step ST102). Step ST103). Now, suppose that the user uttered “recognizing voice or the like”. At this time, the output of the syllable string recognition means 3001 contains about 80% of recognition errors with respect to the utterance A "Once a mess" in FIG. Onsen-nen-do-mado " The utterance length estimation unit 4005 determines that the number of syllables included in the syllable string output from the syllable string recognition means 3001 (in this case, the number of syllables 11 included in the syllable string A) exceeds a predetermined value (in this case, 3). (Step ST104). Therefore, utterance length estimation means 4005 switches switch 4006a of language model switching means 4006 to language model 4003 (step ST105). As a result, the word string searching means 4001 converts the syllable string (in this case, syllable string A) output from the syllable string recognizing means 3001 into a word string with reference to the word chain statistics stored in the language model 4003 ( In step ST106), the character is temporarily stored in the text buffer 5001, and simultaneously displayed to the user by the display means 6001 illustrated in FIG. 3 (step ST107). The user can correct a recognition error in the text temporarily stored in the text buffer 5001 based on the display on the display means 6001 (step ST108, step ST109), and finally obtain a desired user text 7001. .
[0022]
The display means 6001 illustrated in FIG. 3 is, for example, a display or the like, and includes a text display frame 6001a, an edit button frame 6001b, a cursor 6001c, a mouse pointer 6001d, and the like. The edit button frame 6001b displays a save button 6001b-1, a paste button 6001b-2, a copy button 6001b-3, and the like. By pointing each button with the mouse pointer 6001d, the display text of the recognition result is displayed. Various editing operations are possible.
[0023]
FIG. 3 shows a display example of a word string A obtained based on the language model 4003 for the utterance A. In this case, it is shown that "recognize voice" is erroneously recognized as "recognize voice" (in this example, "etc." is erroneously recognized as ". (Maru)").
[0024]
In this case, the user corrects the character in the text buffer 5001 via the correction unit 6003 by the user operation 6002 for the recognition error. The correction result is displayed on the display unit 6001. As a correction method, the correction can be performed by re-uttering the corrected part (step ST110) and overwriting the recognition result. Therefore, "." Is deleted by one character, and the state shown in FIG. 4 is obtained. Thereafter, for correction, the word “etc.” is re-uttered (step ST101). In this case, the re-uttered input voice is converted into a feature vector by the voice analysis unit 2001 (step ST102), and is converted into “Mado” and a syllable sequence (syllable sequence B in FIG. 2) by the syllable sequence recognition unit 3001. It is converted (step ST103). Since the number of syllables in this case is 2 and less than the predetermined value 3 (step ST104), the utterance length estimation unit 4005 instructs the language model switching unit 4006 to switch the switch 4006a to the short word language model 4004. (Step ST111). Therefore, the word string search means 4001 searches for a word string corresponding to the syllable string B by referring to the short word language model 4004, and writes the character to the text buffer 5001 (step ST106). As a result, the word "etc." is additionally displayed on the display means 6001 immediately after the cursor 6001c, and the display of FIG. 5 is obtained (step ST107).
[0025]
As described above, for the utterance of “such as”, the utterance length estimation unit 4005 compares the length of the syllable string output from the syllable string recognition unit 3001 with a predetermined value and determines that the length is short. Since the word string with the highest probability is selected by the word string search means 4001 based on the statistic of, and a recognition result is obtained, even if the word string does not appear in the learning text of the language model or has a low frequency, , The likelihood that a correct “such as” is recognized increases. Therefore, in the above example, if the word “etc.” is re-uttered to correct “.”, The short word language model 4004 is referred to even if “etc.” is not included in the learning text of the normal language model 4003. Is likely to be correctly recognized. On the other hand, when the language model 4003 is always applied irrespective of the utterance length as in the related art, the word “.” Is still recognized even if re-uttered as “etc.”, significantly impairing the input efficiency of the user. Will be.
[0026]
As described above, in the present embodiment, the length of the utterance length is determined by the utterance length estimation means 4005, and when the utterance length is shorter than a predetermined value, the short word language model 4004 is used instead of the normal language model 4003. Since a word string is searched for by reference, the performance of recognizing short words by re-speaking is improved, for example, the efficiency of partial correction of recognition results is improved, and speech input is performed more efficiently. be able to.
[0027]
Embodiment 2 FIG.
In the first embodiment, the language model is set to the short word language model when the utterance length of the input speech is short. Next, an embodiment in which a homophone storage unit is further added will be described.
[0028]
FIG. 6 is a configuration diagram in such a case. FIG. 17 is a flowchart illustrating an example of the operation of the voice recognition device according to the present embodiment. The same components as those in the first embodiment are denoted by the same reference numerals.
[0029]
In the figure, reference numeral 6201 denotes a homophone storage means for storing homophone characters so as to be able to be extracted from the syllable string, and reference numeral 6202 denotes a homophone storage means 6201 based on syllable data of a word string included in the internal output of the word string search means. A homophone generation means 6203 for selecting homophone candidates is a homophone display means for displaying the homophone candidates generated by the homophone generation means 6202. FIG. 7 is an explanatory diagram showing a display example of the homophone display means 6203 in the display means 6001.
[0030]
In such a configuration, the user mistakenly recognizes utterance A as “recognize voice.”, Deletes “.” (Steps ST109 and ST110), and makes a short utterance B to produce a syllable string B. In response to this, the word string search means 4001 outputs word string information of <character = “etc.”, Syllable string = “etc.”> As internal data (steps ST101 to ST104, step ST111, step ST201). . The homophone generation means 6202 obtains a data item <syllable string = etc.> Included in the internal data and selects homonym candidates including <syllable string = etc.> From the homonym data stored in the homophone storage means 6201. And displays it on the homophone display means 6203 (step ST202). That is, in this example, as shown in FIG. 7, four candidates of "equal", "equal", "$", and "nad" are displayed on the homophone display means 6203. The user can select a candidate number with the numeric keys on the keyboard or point the desired candidate with the mouse pointer 6001d to select a desired character (step ST203), and the selected candidate is output at the cursor position. Is performed (step ST204).
[0031]
As described above, according to the present embodiment, the homophone display is performed by the homophone display unit 6203 for a short utterance, so that even if the short word language model 4004 does not include a word having the same reading and notation. Thus, the possibility of inputting desired candidates increases.
[0032]
Embodiment 3 FIG.
In the first and second embodiments described above, the language model is switched according to the utterance length regardless of the character insertion position indicated by the display position of the cursor 6001c. In the third embodiment, however, the language of the cursor 6001c is changed. An example in which the language model is switched depending on the character insertion position indicated by the display position will be described.
[0033]
FIG. 8 shows a configuration diagram of the present embodiment. FIG. 18 is a flowchart illustrating an example of the operation of the voice recognition device according to the present embodiment. The same components as those in the above-described embodiments are denoted by the same reference numerals. In FIG. 8, reference numeral 4004b denotes a partial utterance language model in which a sentence portion is cut out as a word and the statistics of these partial words are stored, and reference numeral 4005b denotes a character insertion position indicated by the display position of a cursor 6001c outside or in the sentence. Character insertion position determining means for determining whether
[0034]
FIG. 9 shows an example of the recognition text displayed when the user performs the utterance A “Once-seen, etc.” in FIG. 2 described above (displayed as “voice recognition”). is there. In this case, since the particle "" is missing, the user moves the cursor 6001c from the cursor position P0 immediately after recognition to the cursor position P1 in order to insert the particle "". Thereafter, when the user makes utterance C “O” in FIG. 2, character insertion position determination unit 4005b determines that the character insertion position indicated by cursor position P1 is within the sentence of the display text (step ST301). Language model switching means 4006 switches switch 4006a to partial utterance language model 4004b (second language model storage means) (step ST111). For this reason, since the word string search means 4001 creates a recognition result with reference to the partial utterance language model 4004b (step ST302), there is a high possibility that the word string "" is obtained. The output of the word string search means 4001 is that the character is written in the text buffer 5001, and the display of the display means 6001 is a display in which the particle "" is inserted immediately before the cursor position P1 (character insertion position) (step). ST303).
[0035]
As described above, according to the present embodiment, it is determined that the cursor position P1 (character insertion position) after moving by the editing operation is within the sentence of the display text, and the language model referred to in the word string search is partially determined. By switching to the utterance language model 4004b, a partial utterance that is difficult to recognize with the language model 4003 of a general sentence can be input with high accuracy.
[0036]
Embodiment 4 FIG.
In the fourth embodiment, the short word processing of the first or second embodiment is valid only immediately after a specific user operation such as character deletion.
[0037]
FIG. 10 shows a configuration diagram of the speech recognition device of the present embodiment. FIG. 19 is a flowchart illustrating an example of the operation of the voice recognition device according to the present embodiment. The same components as those in the above-described embodiments are denoted by the same reference numerals.
[0038]
In FIG. 10, reference numeral 6004 denotes a user operation storage unit that stores an event generated by the user operation 6002. Also, 4005c refers to the event stored in the user operation storage unit 6004, and if the immediately preceding user operation is an operation of deleting n consecutive characters (n is a natural number) in the text buffer 5001, the language model This is a user operation determination unit that switches the switch 4006a of the switching unit 4006 to the partial utterance language model 4004b side.
[0039]
In such a configuration, when the user makes the utterance A “Once your name” in FIG. 2, an example of the recognition text displayed on the display unit 6001 (“Recognize voice”). (Displayed) is shown in FIG. In this case, the particle "" is mistakenly recognized as the particle "". Therefore, the user moves the cursor 6001c from the cursor position P0 immediately after the recognition to the cursor position P2 to delete the particle “no”. Further, the user deletes the particle “no” using the delete key. FIG. 12 shows the display at this time. The cursor position P3 is at the position after the deletion of “no”.
[0040]
Up to this point, the cursor movement and the operation of the delete key by the user have been processed by generating software events, and these events are all stored in the user operation storage unit 6004. In this state, when the user performs utterance C “O” in FIG. 2, user operation determining means 4005c determines that the immediately preceding user operation was deletion of one character (step ST401), and the number of syllables is reduced to a predetermined value. (Step ST104), the language model switching means 4006 switches the switch 4006a to the partial utterance language model 4004b (step ST111). For this reason, since the word string search means 4001 creates a recognition result with reference to the partial utterance language model 4004b (step ST402), there is a high possibility that the target word string "" is obtained. In the output of the word string search means 4001, the character is written into the text buffer 5001, and the display of the display means 6001 is a display in which the particle "" is inserted immediately before the cursor position P4 as shown in FIG. 13 ( Step ST403). FIG. 13 shows an example in which the particle "" is correctly recognized.
[0041]
As described above, according to the present embodiment, it is determined that a character is deleted immediately after a user operation and a language model referred to in a word string search is switched to a partial utterance language model. Can input partial utterances that are difficult to recognize with high accuracy.
[0042]
Although the deletion of one character has been described, n characters may generally be deleted with n being a natural number. Also, when the same effect as that of the deletion is registered as a shortcut key, the same effect can be obtained by changing the part for analyzing the immediately preceding event by the user operation determining means.
[0043]
Embodiment 5 FIG.
In the present embodiment, the short word processing of the first or second embodiment is valid when the number of characters deleted immediately before is equal to or less than a predetermined value in the fourth embodiment.
[0044]
That is, in the fifth embodiment, in the description of the fourth embodiment, after the character is deleted, if the user speaks in this state, the user operation determining unit 4005c determines that the number of characters deleted immediately before by the user is a predetermined number. (Step ST501), the language model switching means 4006 switches the switch 4006a to the partial utterance language model 4004b.
[0045]
As described above, according to the present embodiment, immediately after a character is deleted in units of lines by a user operation, if the number of deleted line characters is larger than a predetermined value, utterance is performed using a normal language model. In the case where the number of characters less than a predetermined value is to be deleted, the language model referred to in the word string search is switched to the partial utterance language model 4004b. Can input partial utterances that are difficult to recognize with high accuracy.
[0046]
Embodiment 6 FIG.
In the above-described second embodiment, a homophone storage unit is added to enable selection of a homophone candidate. In the present embodiment, an example in which a subword storage unit is added will be described.
[0047]
FIG. 14 is a configuration diagram of the speech recognition device in such a case. FIG. 20 is a flowchart illustrating an example of the operation of the voice recognition device according to the present embodiment. The same components as those in the above-described embodiments are denoted by the same reference numerals.
[0048]
Here, the sub-word is the minimum unit of correspondence between a written character string and a syllable string. For example, the phrase “Mitsubishi Electric is truly amazing” is exactly the same as “Mitsubishi Denki”, “San” (Mitsu), “Hishi” (Bishi), “Den” (Den), “Miki” (K), It consists of seven sub-words, "ha", "as expected", and "da". At this time, "" indicates a notation character string, and () indicates a corresponding syllable string. In the case of simple Kanji characters, "Ryuseki" can be divided into "Ryu" and "Ishi", but if it is divided into single Kanji characters, it will not be possible to cope with reading (as expected), so it is not the minimum unit. This is different from simple kanji.
[0049]
In FIG. 14, reference numeral 6201a denotes a subword storage unit that stores a notation of a subword so that it can be subtracted from a syllable string. The sub-word generating means 6203a is a sub-word displaying means for displaying homophone candidates generated by the sub-word generating means 6202a. Reference numeral 4004a denotes a subword language model (second language model storage unit) that stores the statistics of subwords. FIG. 15 is a diagram showing a display example of the sub-word display means 6203a. The sub-word display unit 6203a is displayed as a display window in the display area of the display unit 6001.
[0050]
In such a configuration, the user mistakenly recognizes the above utterance A as "recognize voice.", Deletes ".", And makes a short utterance B (step ST104, step ST111). For the column B, the word sequence searching means 4001 refers to the subword language model 4004a, and as its internal data, << character = “window”, syllable sequence = “mad”><character = “etc.”, Syllable String = "etc."><Letter = "Nada", syllable string = "Nada"><letter = "ma", syllable string = "ma", letter = "ma", syllable string = "do"><letter = "True", syllable string = "ma", character = "earth", syllable string = "do"> ... (omitted on the way) ... <character = "aubergine", syllable string = "na", character = "rudder" , Syllable sequence = "da"><character = "na", syllable sequence = "na", character = "da", syllable sequence = "da">>> And outputs the broadcast (step ST601).
[0051]
The subword generating means 6202a obtains the data item <syllable string = etc.> Included in the internal data and selects <syllable string = etc.><Syllable string = ma><syllable string from the subword data stored in the subword storage means 6201a. = >><syllable string = na><Subsyllable string = da> A candidate for a subword containing “syllabic string = da” is obtained and displayed on the subword display means 6203a. That is, in this example, as shown in FIG. 15, forty other candidates such as "window", "etc.", "Nada", "ma", and "true" are displayed on the subword display means 6203a (step ST602). A desired character can be selected and input by pointing a desired candidate with the mouse pointer 6001d (step ST603, step ST604).
[0052]
As described above, according to the present embodiment, the subword display unit 6203a displays the subword-divided candidates for short utterances. Therefore, even if the same word with the same pronunciation and notation does not exist in the subword language model 4004a. Desired candidates can be input. For example, even if the proper noun “Meiji” pronounced “such as” is not registered in the subword language model 4004a, the purpose can be achieved by selecting and inputting “Name” and “Soil” from the subword candidates.
[0053]
The present invention described in the above-described claims is expressed in another way as follows.
[0054]
(1). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. In a speech recognition apparatus including a display unit that displays a character string, a utterance length estimation unit that estimates an utterance length of the input speech, a short word language model storage unit that stores a short word statistic, and the search unit When the utterance length estimated by the utterance length estimation unit is shorter than a predetermined length, the utterance length conversion unit converts the input speech into a character string by referring to the short word statistic stored in the short word language model storage unit. A speech recognition device characterized by the following.
[0055]
(2). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. A speech recognition device comprising: a display unit that displays a character string; a speech length estimation unit that estimates a speech length of the input speech; a short word language model storage unit that stores a short word statistic; A homophone generation means for generating a homophone from a character string, and a homophone display means for displaying a homophone list generated by the homophone generation means are provided, and the search means includes an utterance length estimated by the utterance length estimation means. A speech recognition apparatus characterized in that, when the length is shorter than a predetermined length, the input speech is converted into a character string by referring to the statistics of short words stored in the short word language model storage means.
[0056]
(3). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. In a voice recognition device comprising a display means for displaying a character string, a character insertion position determination means for determining that a position of a cursor displayed on the display means is inside a text, and a text part cut out as a word. A partial utterance language model storage unit that stores a statistic of a word; and the search unit determines the partial utterance when the character insertion position determination unit determines that the position of a cursor displayed on the display unit is inside a sentence. A voice which is means for converting an input voice into a character string by referring to a statistic of a word obtained by cutting out a sentence part stored as a word in a language model storage means; Identification equipment.
[0057]
(4). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. In a voice recognition device including a display unit for displaying a character string, and a user operation unit for correcting a display content of the display unit, a user operation storage unit for storing the user operation, and a user stored in the user operation storage unit A user operation determining unit that determines an operation; a partial utterance language model storage unit that extracts a sentence part as a word and stores a statistic of these words; and the search unit includes a user operation determining unit that determines that the last user operation is a character. When it is determined that the input speech is to be deleted, the input speech is referred to by referring to the statistic of a word obtained by cutting out a sentence part stored as a word in the partial utterance language model storage means. Speech recognition apparatus, characterized in that the means for converting a character string.
[0058]
(5). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. In a voice recognition device including a display unit for displaying a character string, and a user operation unit for correcting a display content of the display unit, a user operation storage unit for storing the user operation, and a user stored in the user operation storage unit A user operation determining means for determining an operation, a partial utterance language model storing means for extracting a sentence part as a word and storing a statistic of these words, and the searching means, wherein the user operation determining means determines that the immediately preceding user operation is a predetermined one. When it is determined that the number of characters is less than or equal to the number of characters, the statistics of words obtained by cutting out the sentence portion stored as words in the partial utterance language model storage unit are referred to. Speech recognition apparatus, characterized in that the means for converting the input speech to a string.
[0059]
(6). A language model storage unit that stores the statistics of the words, a search unit that converts the input speech into a character string by referring to the statistics of the words stored in the language model storage unit, and a conversion unit that converts the input speech into a character string. In a speech recognition apparatus provided with a display unit for displaying a character string, a subword language model storage unit that stores a subword statistic, and the search unit includes a subword statistic stored in the subword language model storage unit and a recognition error. A speech recognition device, which is means for converting an input speech into a character string with reference to a probability model of a tendency.
[0060]
(7). The statistic of the word is stored, the input voice is converted into a character string by referring to the statistic of the stored word, and the converted character string is displayed. When the statistic of the word is stored and the input voice is converted to a character string, the utterance length of the input voice is estimated, and when the estimated utterance length is shorter than a predetermined length, the statistic of the stored short word is calculated. A speech recognition method characterized by converting an input speech into a character string by referring to it.
[0061]
(8). A speech recognition method for storing a statistic of a word, referring to the statistic of the stored word, converting an input voice into a character string, and displaying the converted character string, comprising: Separately stores a short word statistic, estimates the utterance length of the input voice, and when the estimated utterance length of the input voice is shorter than a predetermined length, refers to the stored short word statistic. A speech recognition method comprising: converting an input speech into a character string; generating a homophone from the converted character string; and displaying a list of the generated homophone.
[0062]
(9). A speech recognition method for storing a statistic of a word, referring to the statistic of the stored word, converting an input voice into a character string, and displaying the converted character string and a cursor indicating a character insertion position. In the above, the statistics of the word obtained by cutting out a sentence part as a word separately from the statistics of the word are stored, and when the input voice is converted into a character string, the position of the displayed cursor is inside the text. Is determined, and when it is determined that the position of the displayed cursor is inside the text, the input voice is referred to by referring to the statistics of words obtained by cutting out the stored text part as words. A speech recognition method characterized by converting to a character string.
[0063]
(10). A user who stores a statistic of a word, refers to the statistic of the stored word, converts an input voice into a character string, displays the converted character string, and corrects the displayed content. Is stored, the stored user operation is determined, and a statistic of a word obtained by cutting out a sentence portion as a word separately from the statistic of the word is stored, and the input voice is converted into a character string. In this case, when it is determined that the immediately preceding user operation is the deletion of a character, the input speech is converted into a character string by referring to a statistic of a word obtained by cutting out the stored sentence part as a word. Voice recognition method to be used.
[0064]
(11). A user who stores a statistic of a word, refers to the statistic of the stored word, converts an input voice into a character string, displays the converted character string, and corrects the displayed content. Storing the operation, determining the stored user operation, storing the statistic of a word obtained by cutting out a sentence part as a word separately from the statistic of the word, and converting the input voice into a character string. When it is determined that the immediately preceding user operation is the deletion of a predetermined number of characters or less, the input speech is converted into a character string by referring to the statistics of words obtained by cutting out the stored sentence portion as words. Voice recognition method.
[0065]
(12). The input speech is converted into a character string by referring to the stored word statistics, and the converted character string is displayed, the subword statistics are stored, and the subword statistics are stored. A speech recognition method characterized by converting an input speech into a character string by referring to a statistic and a probability model of recognition error tendency.
[0066]
(13). A first procedure for storing a language model in which the statistics of words are stored, a second procedure for converting the input speech into a character string by referring to the statistics of the stored words, and A third procedure for displaying the uttered character string, a fourth procedure for estimating the utterance length of the input voice, a fifth procedure for storing statistics of short words, and the second procedure described above. When the utterance length estimated as a result of the execution of the above procedure is shorter than a predetermined length, a procedure of converting the input speech into a character string by referring to the short word statistics stored as a result of the execution of the fifth procedure is executed. Speech recognition program to let you.
[0067]
(14). A first procedure for storing word statistics, a second procedure for converting input speech into a character string with reference to the stored word statistics, and a second procedure for displaying the converted character string. A third procedure, a fourth procedure for estimating the utterance length of the input voice, a fifth procedure for storing the statistics of short words, and generating a homophone from the character string converted by the second procedure A sixth procedure for displaying the list of the generated homonyms, and a second procedure for displaying the list of the generated homonyms when the estimated utterance length is shorter than a predetermined length. A speech recognition program for executing a procedure of converting an input speech into a character string by referring to a statistic of the speech.
[0068]
(15). A first procedure for storing word statistics, a second procedure for converting input speech into a character string with reference to the stored word statistics, and a second procedure for displaying the converted character string. A third procedure, a fourth procedure for determining that the position of the displayed cursor is inside the text, and a fifth procedure for storing the statistics of words obtained by cutting out the text part as words. The second step is that, when it is determined that the position of the cursor to be displayed is inside a sentence, the input voice is referred to by referring to the statistics of words obtained by cutting out the stored sentence portion as words. A speech recognition program to execute the procedure to convert to a character string.
[0069]
(16). A first procedure for storing word statistics, a second procedure for converting input speech into a character string with reference to the stored word statistics, and a second procedure for displaying the converted character string. A third procedure, a fourth procedure for inputting a user operation for correcting the displayed content, a fifth procedure for storing the user operation, and a sixth procedure for determining the stored user operation And a seventh procedure for storing a statistic of a word obtained by cutting out a sentence portion as a word. The second procedure is characterized in that when the input voice is converted into a character string, And converting the input speech into a character string by referring to the statistic of the word obtained by extracting the stored sentence portion as a word when it is determined that the sentence is to be deleted.
[0070]
(17). A first procedure for storing word statistics, a second procedure for converting input speech into a character string with reference to the stored word statistics, and a second procedure for displaying the converted character string. A third procedure, a fourth procedure for inputting a user operation for correcting the displayed content, a fifth procedure for determining the stored user operation, and a sentence portion obtained as a word. The sixth procedure for storing the statistics of words and the second procedure are to cut out the stored sentence portion as a word when it is determined that the last user operation is to delete a predetermined number or less of characters. A speech recognition program for executing a procedure of converting an input speech into a character string with reference to the obtained word statistics.
[0071]
(18). A first procedure for storing the statistic of the word, a second procedure for converting the input voice to a character string with reference to the statistic of the word stored in the first procedure, and a second procedure A third procedure for displaying the converted character string, a fourth procedure for storing the statistics of the sub-words, and the second procedure comprising the statistics of the sub-words stored in the fourth procedure and the recognition error tendency. A speech recognition program for executing a procedure for converting an input speech into a character string with reference to a probability model.
[0072]
【The invention's effect】
According to the present invention, it is possible to reduce errors in recognition of a word or a word string having a low appearance frequency in a learning example of a language model when correcting a normal speech input or an erroneously recognized portion by re-utterance in speech recognition.
[0073]
Further, according to the present invention, it is possible to easily input a word that does not exist in the language model by voice, and it is possible to improve the work efficiency such as editing and correcting the voice recognition result.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of a speech recognition device according to a first embodiment of the present invention;
FIG. 2 is an explanatory diagram illustrating an example of a recognition result in the voice recognition device according to the first embodiment of the present invention;
FIG. 3 is an explanatory diagram showing a display example of a word string in the voice recognition device according to the first embodiment of the present invention;
FIG. 4 is an explanatory diagram showing a display example of a one-character deletion result in the voice recognition device according to the first embodiment of the present invention;
FIG. 5 is an explanatory diagram showing a display example of a correction result in the speech recognition device according to the first embodiment of the present invention;
FIG. 6 is a block diagram illustrating a configuration example of a voice recognition device according to a second embodiment of the present invention;
FIG. 7 is an explanatory diagram showing a display example of a homophone display window in the voice recognition device according to the second embodiment of the present invention;
FIG. 8 is a block diagram illustrating a configuration example of a voice recognition device according to a third embodiment of the present invention;
FIG. 9 is an explanatory diagram showing a display example of a recognition result in the voice recognition device according to the third embodiment of the present invention;
FIG. 10 is a block diagram illustrating a configuration example of a voice recognition device according to a fourth embodiment of the present invention;
FIG. 11 is an explanatory diagram showing a display example of a recognition result in the voice recognition device according to the fourth embodiment of the present invention;
FIG. 12 is an explanatory diagram showing an example of deleting a recognition result in the voice recognition device according to the fourth embodiment of the present invention;
FIG. 13 is an explanatory diagram illustrating an example of insertion of a recognition result in the speech recognition device according to the fourth embodiment of the present invention;
FIG. 14 is a block diagram illustrating a configuration example of a voice recognition device according to a sixth embodiment of the present invention;
FIG. 15 is an explanatory diagram showing a display example of a recognition result in the voice recognition device according to the sixth embodiment of the present invention;
FIG. 16 is a flowchart illustrating an example of an operation of the voice recognition device according to the first embodiment of the present invention.
FIG. 17 is a flowchart illustrating an example of an operation of the voice recognition device according to the second embodiment of the present invention;
FIG. 18 is a flowchart illustrating an example of an operation of the voice recognition device according to the third embodiment of the present invention.
FIG. 19 is a flowchart showing an example of the operation of the speech recognition device according to the fourth and fifth embodiments of the present invention.
FIG. 20 is a flowchart showing an example of the operation of the speech recognition device according to the sixth embodiment of the present invention.
[Explanation of symbols]
1001 input speech, 2001 speech analysis means, 3001 syllable string recognition means, 3002 acoustic model, 4001 word string search means, 4002 difference model, 4003 language model, 4004 short word language model, 4004a subword language model, 4004b partial utterance language model, 4005 utterance length estimation means, 4005b character insertion position determination means, 4005c user operation determination means, 4006 language model switching means, 4006a switcher, 5001 text buffer, 6001 display means, 6001a text display frame, 6001b edit button frame, 6001b-1 Save button, 6001b-2 paste button, 6001b-3 copy button, 6001c cursor, 6001d mouse pointer, 6002 user operation, 6003 correction means, 6004 user operation storage Stage, 6201 homophones storage means, 6201A subword storage means, 6202 homophones generating means, 6202A word generating means, 6203 homophones display means, 6203A word display means 7001 the user text, P0-P4 cursor position.

Claims

First language model storage means for storing word chain statistics, and second language model storage means for storing the statistics of isolated words or the statistics of words or subwords cut out from a sentence part; Utterance length estimating means for estimating the utterance length of the input voice; and the statistic stored in the first language model or the second language model according to the length of the utterance length determined by the utterance length estimation means. Searching means for selectively referring to and converting the input voice into a character string.

Including a homophone generation means for generating a homophone from the character string output from the search means, and a homophone display means for displaying the homophone selection candidate list generated by the homophone generation means. The voice recognition device according to claim 1, wherein

3. A sub-word generating unit for generating a sub-word from the character string output from the searching unit, and a sub-word displaying unit for displaying a selection candidate list of the sub-word generated by the sub-word generating unit. The speech recognition device according to 1.

A display unit that displays the character string output from the search unit; and a character insertion position determination unit that determines a position in the character string where a cursor displayed on the display unit is located. When it is determined that the input speech is inside the character string, the search unit selectively refers to the statistic stored in the second language model in accordance with the length of the utterance length to refer to the input speech. The speech recognition device according to claim 1, wherein the speech recognition device performs an operation of converting the character string into a character string.

A display unit that displays the character string output from the search unit; a user operation storage unit that stores a user operation for correcting the character string displayed on the display unit; and a user operation storage unit that stores the user operation. User operation determination means for determining the type of user operation, wherein when the user operation is determined to be the character string deletion operation, the search means determines the second language model according to the length of the utterance length 4. The speech recognition apparatus according to claim 1, wherein the input speech is converted into a character string by selectively referring to the statistic stored in the speech recognition apparatus. .

When the user operation determining means determines that the immediately preceding user operation is the deletion of a predetermined number or less of characters from the character string, the search means determines the second language model in accordance with the length of the utterance length. 6. The speech recognition device according to claim 5, wherein an operation of converting the input speech into a character string by selectively referring to the stored statistics is performed.

A first language model storing means for storing word chain statistics; a second language model storing means for storing statistics of isolated words or statistics of words cut out from a sentence portion or statistics of subwords; Means, an utterance length estimating means for estimating the utterance length of the input voice, and the utterance length stored in the first language model or the second language model according to the length of the utterance length determined by the utterance length estimation means. A speech recognition program for functioning as search means for selectively referring to statistics and converting the input speech into a character string.