JP3975825B2

JP3975825B2 - Character recognition error correction method, apparatus and program

Info

Publication number: JP3975825B2
Application number: JP2002140463A
Authority: JP
Inventors: 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-05-15
Filing date: 2002-05-15
Publication date: 2007-09-12
Anticipated expiration: 2022-05-15
Also published as: JP2003331214A

Description

【０００１】
【発明の属する技術分野】
本発明は、手書き文字認識において生じる誤りを訂正する技術に係り、特に、文字認識結果としての文字パターンを読み込み、文字の前後の繋がりを表現する言語モデルを用いて入力された文字パターンの誤り部分を検知し正解候補を提示する日本語の文字認識装置の誤りを訂正する方法及び装置に関する。
【０００２】
【従来の技術】
一般に、申込書のような手書き帳票では、漢字表記の住所や氏名を記入する際に、仮名表記の振り仮名を付与する習慣がある。これには、漢字表記と仮名表記が同じ内容を表わすという制約、或いは、同じ内容が漢字表記と仮名表記で表現されているという冗長性を利用することにより、この帳票データを人間が処理する際に誤りを防ぐ、という効果がある。
【０００３】
たとえば、手書き帳票を人手で処理する場合、漢字表記からは「水田」でるか「永田」であるか判別できないときに、仮名表記が「ながた」と判別できれば、漢字表記は「永田」であることがわかる。同様に、仮名表記からは「なかた」であるか「ながた」であるかが判別できないときに、漢字表記が「永田」と判別できれば、仮名表記は「ながた」であることがわかる。さらに、漢字表記に「水田」と「永田」の可能性があり、仮名表記に「なかた」と「ながた」の可能性がある場合、「水田」は「ながた」とは読めないので、妥当な解釈は、漢字表記「永田」と仮名表記「ながた」の組だけであることがわかる。
【０００４】
ところが、文字認識装置を用いて帳票を自動的に処理しようとする場合、従来の文字認識誤り訂正法は、漢字表記と仮名表記を同時に扱うことにより、両者の間に存在する制約（又は冗長性）を有効に利用しながら誤り訂正を実現することができなかった。その理由は、以下の（１）及び（２）の二つである。
【０００５】
（１）出現する可能性がある漢字表記と仮名表記の組を辞書中に全て登録することは事実上不可能なので、辞書には無くても正しい漢字表記と仮名表記の組が入力に出現する可能性を考慮しなければならない。しかし、このような事象に確率を与える方法は考案されていなかった。
【０００６】
（２）漢字表記と仮名表記の文字認識候補がそれぞれ複数ある場合、漢字表記と仮名表記の対応の可能性は非常に膨大な数になる。しかし、これを系統的に調べるアルゴリズムが考案されていなかった。
【０００７】
従来の文字認識の誤り訂正法は、一つの入力文字列（漢字表記又は仮名表記）に対して、文字ｎｇｒａｍ、すなわち、ｎが２以上の整数を表わすときに、ｎ個の文字からなるｎ連鎖を表現する文字ｎｇｒａｍモデルや、単語ｎｇｒａｍ、すなわち、ｎが２以上の整数を表わすときに、ｎ個の単語からなるｎ連鎖を表現する単語ｎｇｒａｍモデルなどの統計的言語モデルを利用して、誤り訂正を行なう方法が主流である。
【０００８】
文字ｎｇｒａｍモデルを使用する例として、
杉村・斉藤：「文字連接情報を用いた読み取り不能文字の判定処理−文字認識への応用−」電子情報通信学会論文誌 Vol.J68−D No.1, pp.64−71, 1985
が挙げられる。単語ｎｇｒａｍモデルを利用する例としては、
高尾・西野：「日本語文書リーダ後処理の実現と評価」情報処理学会論文誌 Vol.33 No.5, pp.664−670, 1992
伊東・丸山：「ＯＣＲ入力された日本語文の誤り検出と自動訂正」情報処理学会論文誌 Vol.33, No.5, pp.664−670, 1992
永田：「文字類似度と統計的言語モデルを用いた日本語文字認識誤り訂正法」電子情報通信学会論文誌（Ｄ−ＩＩ） Vol.J81−D−II, No.11, pp.2624−2634, 1998
が挙げられる。
【０００９】
これらの方法に対して、近年、漢字表記と仮名表記を同時に利用して誤り訂正を行なう方法として、単漢字とその読みの組を基本単位とする統計的言語モデルに基づいて、漢字表記と仮名表記を対応付ける方法が、
Nagata, M.: Synchronous Morphological Analysis of Grapheme and Phoneme for Japanese OCR, Proceedings of the 38^th Annual Meeting of the Association for Computational Linguistics, pp.384−391, 2000
に提案されている。この文献で提案された方法では、たとえば、「福沢諭吉」と「フクザワユキチ」という漢字表記と仮名表記の文字認識結果の組を入力とする場合、「福／フク」というような単漢字とその読みの組の出現確率に基づいて、漢字表記及び仮名表記のそれぞれに対する複数の文字認識候補の中から「福／フク」、「沢／サワ」、「諭／ユ」、「吉／キチ」という対応関係を求めることにより誤り訂正を行なう。
【００１０】
単漢字とその読みを言語モデルの基本単位とする方法は、氏名のように短い文字列（漢字表記が３−５文字程度）で、かつ、一つの漢字に対する読み方の異なり数が多い場合には有効な方法である。
【００１１】
【発明が解決しようとする課題】
ところが、住所のように長い文字列（漢字表記が１０−１５文字程度）を対象とする場合、単漢字とその読みを言語モデルの基本単位とすると、探索すべき組合せの数が膨大になり計算量が大きくなってしまう、という問題がある。特に、仮名表記から検索される漢字表記の候補の数が問題になる。
【００１２】
たとえば、「神奈川県横須賀市光の丘」という漢字表記と「カナガワケンヨコスカシヒカリノオカ」という仮名表記の組を入力とする場合、「カ」と読む可能性がある漢字は以下に示すように少なくとも２１４個ある。
【００１３】
【表１】

次に、「カナ」と読む可能性がある漢字は以下に示すように少なくとも１７個ある。
【００１４】
【表２】

さらに、「ナ」と読む可能性がある漢字は少なくとも６６個あり、「ナガ」と読む可能性のある漢字は少なくとも１５個ある。
【００１５】
同様に、「神」という漢字の読みは、
シンジンカミカンコウカカグカナカモ
クマココハダマミ
のように少なくとも１４個ある。
【００１６】
文字認識の誤り訂正の場合、認識結果の第１候補だけではなく、下位候補についても考慮しなければならないので、探索空間はさらに大きくなり、住所のような長い文字列では無視できない計算量となる。
【００１７】
単語を統計的言語モデルの基本単位とした場合、文字マトリクス（すなわち、入力文の各文字位置において文字認識スコアの高い順番に文字候補を並べたリスト）に含まれる文字の組合せの大部分は単語を構成しないので、辞書と照合する単語の数は少なくなる。しかし、反対に、未知語候補、すなわち、辞書と照合しない単語候補の数が多くなるので、状況は改善しない。
【００１８】
もし、単語を構成する可能性が高い漢字文字列と仮名文字列の組合せに対して高い確率を与えるような未知語モデルを考案することができれば、未知語候補を絞り込むことによって計算量を削減できるので、住所の漢字表記と仮名表記のような比較的長い文字列の組に対しても、両者を対応付けながら誤り訂正を行なえる。しかし、従来、このような未知語モデルは提案されていない。
【００１９】
上記の従来技術の問題点に鑑みて、本発明は、住所の漢字表記と仮名表記のような比較的長い第１の表記の文字列と第２の表記の文字列の組において、互いに対応する第１の表記と第２の表記の文字認識結果の組を同時に取り扱うことにより、第１の表記と第２の表記の間に存在する制約又は冗長性を利用した誤り訂正を実現する文字認識誤り訂正方法の提供を目的とする。
【００２０】
また、本発明は、このような文字認識誤り訂正方法を実施する装置の提供を目的とする。
【００２１】
さらに、本発明は、このような文字認識誤り訂正方法をコンピュータに実現させるプログラムの提供を目的とする。
【００２２】
【課題を解決するための手段】
本発明は、文字認識装置の出力として漢字表記と仮名表記のような第１の表記及び第２の表記が同時に与えられる場合に、言語モデル、文字認識装置モデル、及び、第１の表記と第２の表記を対応付けながら最適な単語列を探索するアルゴリズムを用いて、同じ内容が第１の表記と第２の表記で表現されているという冗長性を利用することにより、第１の表記又は第２の表記の何れか一方だけでは訂正できない誤りを訂正するものである。
【００２３】
本発明は、任意の第１の表記の文字列と、第２の表記の文字列について、第１の表記の文字列と第２の表記の文字列の同時確率、すなわち、第１の表記の文字列が第２の表記の文字列を第１の表記で表わした文字列であり、かつ、第２の表記の文字列が第１の表記の文字列を第２の表記で表わした文字列である確率を与える言語モデルと、任意の二つの文字について、一方の文字が他方の文字に誤認識される確率を与える文字認識装置モデルと、言語モデル及び文字認識装置モデルに基づいて、最も確率が大きい単語列、すなわち、最も確率が大きい第１の表記の単語列と第２の表記の単語列の組を求める最適単語列検索手段と、を用いて、第１の表記の文字列と第２の表記の文字列を対応付けることにより、第１の表記の文字列又は第２の表記の文字列の一方の文字列だけでは訂正できない誤りを訂正する文字認識誤り訂正方法を提供する。
【００２４】
本発明（請求項１）は、漢字列と、その漢字列と同じ内容を表す仮名列の組を文字認識した結果が入力として与えられ、この入力中の文字認識誤りを訂正する文字認識誤り訂正装置において、
単語の漢字表記と仮名表記の組を記憶する単語辞書と、
単語の漢字表記と仮名表記の組のｎｇｒａｍの出現確率を与える単語ｎｇｒａｍモデルと、
単語辞書に登録されていない任意の漢字列と仮名列の組が単語を構成する確率である未知語確率を与える未知語モデルと、
文字認識装置において、入力文字ｃｉが文字ｃｊと認識される確率である文字混同確率を与える文字認識装置モデルと、
漢字表記と仮名表記のそれぞれに対し、各文字位置において文字認識スコアの高い順番に文字候補を並べたリストを文字数分並べたリストである文字マトリクスを入力する入力手段と、
漢字表記の文字マトリクスに含まれる漢字列と、仮名表記の文字マトリクスに含まれる仮名列の組と完全一致する単語辞書中の単語を検索し、一致単語を同定する単語照合手段と、
漢字表記の文字マトリクスに含まれる漢字列と仮名表記の文字マトリクスに含まれる仮名列の組の中で、単語辞書に登録されていない単語である未知語の候補を、未知語モデルを用いて未知語確率が大きい所定個数の未知語を生成する未知語候補生成手段と、
漢字表記の文字マトリクスに含まれる漢字列と仮名表記の文字マトリクスに含まれる仮名列の組と類似照合する単語辞書中の単語を検索し、類似単語を同定する類似単語照合手段と、
単語ｎｇｒａｍモデル、未知語モデルと文字認識モデルに基づいて、一致単語、未知語、類似単語の中で、入力された漢字表記と仮名表記の組に対して、単語ｎｇｒａｍ出現確率と文字混同確率の積を最大化する漢字表記と仮名表記の組である単語列を求める最適単語列探索手段と、を備え、
未知語モデルは、
単語タイプの出現頻度から単語タイプの出現確率である単語タイプ確率を求める単語タイプ確率計算手段と、
単語の漢字表記を構成する文字の種類に基づいて定義された単語タイプのいずれかに未知語を分類する単語タイプ判定手段と、
各単語タイプの漢字表記の長さの分布と仮名表記の長さの分布と、未知語の単語タイプに基づいて、当該未知語の漢字表記の長さの確率と仮名表記の長さの確率の積である単語長確率を求める単語長確率計算手段と、
漢字表記の文字ｎｇｒａｍ頻度から漢字列の出現確率を求め、仮名表記の文字のｎｇｒａｍ頻度から仮名列の出現確率を求め、未知語の同時確率である単語表記確率を求める単語表記確率計算手段と、
未知語の単語タイプ確率、単語長確率、及び単語表記確率の積により未知語の未知語確率を計算する未知語確率計算手段と、を有する。
【００２５】
漢字列と、その漢字列と同じ内容を表す仮名列の組を文字認識した結果が入力として与えられ、この入力中の文字認識誤りを訂正する文字認識誤り訂正方法において、
単語の漢字表記と仮名表記の組を記憶する単語辞書と、
前記単語の漢字表記と前記仮名表記の組のｎｇｒａｍの出現確率を与える単語ｎｇｒａｍモデルと、
前記単語辞書に登録されていない任意の漢字列と仮名列の組が単語を構成する確率である未知語確率を与える未知語モデルと、
文字認識装置において、入力文字ｃｉが文字ｃｊと認識される確率である文字混同確率を与える文字認識装置モデルと、を有する装置において、
入力手段が、前記漢字表記と前記仮名表記のそれぞれに対し、各文字位置において文字認識スコアの高い順番に文字候補を並べたリストを文字数分並べたリストである文字マトリクスを入力する入力ステップと、
単語照合手段が、前記漢字表記の文字マトリクスに含まれる漢字列と、前記仮名表記の文字マトリクスに含まれる仮名列の組と完全一致する前記単語辞書中の単語を検索し、一致単語を同定する単語照合ステップと、
未知語候補生成手段が、前記漢字表記の文字マトリクスに含まれる漢字列と前記仮名表記の文字マトリクスに含まれる仮名列の組の中で、前記単語辞書に登録されていない単語である未知語の候補を、前記未知語モデルを用いて未知語確率が大きい所定個数の未知語を生成する未知語候補生成ステップと、
類似単語照合手段が、前記漢字表記の文字マトリクスに含まれる漢字列と前記仮名表記の文字マトリクスに含まれる仮名列の組と類似照合する前記単語辞書中の単語を検索し、類似単語を同定する類似単語照合ステップと、
最適単語列探索手段が、前記単語ｎｇｒａｍモデル、前記未知語モデルと前記文字認識装置モデルに基づいて、前記一致単語、前記未知語、前記類似単語の中で、入力された漢字表記と仮名表記の組に対して、単語ｎｇｒａｍ出現確率と文字混同確率の積を最大化する漢字表記と仮名表記の組である単語列を求める最適単語列探索ステップと、
を行い、
前記未知語モデルにおいて、
単語タイプ確率計算手段が、単語タイプの出現頻度から単語タイプの出現確率である単語タイプ確率を求めるステップと、
単語タイプ判定手段が、単語の漢字表記を構成する文字の種類に基づいて定義された単語タイプのいずれかに未知語を分類するステップと、
単語長確率計算手段が、各単語タイプの漢字表記の長さの分布と仮名表記の長さの分布と、未知語の単語タイプに基づいて、当該未知語の漢字表記の長さの確率と仮名表記の長さの確率の積である単語長確率を求めるステップと、
単語表記確率計算手段が、漢字表記の文字ｎｇｒａｍ頻度から漢字列の出現確率を求め、仮名表記の文字のｎｇｒａｍ頻度から仮名列の出現確率を求め、未知語の同時確率である単語表記確率を求めるステップと、
未知語確率計算手段が、未知語の前記単語タイプ確率、前記単語長確率、及び前記単語表記確率の積により未知語の未知語確率を計算する未知語確率計算ステップと、を行う。
【００２６】
本発明（請求項３）は、請求項２記載の文字認識誤り訂正方法をコンピュータで実現するためのプログラムである。
【００３３】
【発明の実施の形態】
図１は、本発明の第１実施例による文字認識誤り訂正システムの構成図である。本実施例の文字認識誤り訂正システムは、日本語の漢字表記と、日本語の仮名表記を文字認識する文字認識装置１と、日本語の漢字表記と仮名表記の組を文字認識装置１によって認識した結果に依存する文字認識誤りを訂正する文字認識誤り訂正装置１００と、含む。また、本発明の第１実施例は、たとえば、申込書のような手書き帳票の住所欄の文字認識を想定しているため、漢字表記には、漢字、カタカナ、ひらがな、ローマ字、英数字なども含まれる場合があり、また、振り仮名欄の仮名表記には、カタカナ又はひらがなの他にローマ字や英数字が含まれる場合がある。尚、以下の実施例の説明では、簡単のため、漢字表記は、漢字だけにより構成され、仮名表記はカタカナだけにより構成されているものとする。
【００３４】
文字認識誤り訂正装置１００は、文字認識装置１からの文字認識結果として、漢字表記の文字マトリクスと、仮名表記の文字マトリクスを入力する。文字マトリクスとは、入力の各文字位置において文字認識スコアの高い順番に文字候補を並べたリストを、文字数分だけ並べたリストである。また、以下では、文字マトリクスの各文字位置において、その文字位置の文字候補のリストから一文字ずつ選ぶことにより構成される文字列を、「文字マトリクスに含まれる文字列」と呼ぶ。
【００３５】
文字認識誤り訂正装置１００は、
文字認識装置１から漢字表記の文字マトリクス及び仮名表記の文字マトリクスを受け取る最適単語列検索部２と、
単語の漢字表記と仮名表記の組を記憶する単語辞書７と、
漢字表記の文字マトリクスに含まれる漢字列及び仮名表記の文字マトリクスに含まれる仮名列の組と完全一致する単語を単語辞書７から検索する単語照合部３と、
単語辞書７に登録されていない任意の漢字列と仮名列の組が単語を構成する確率を与える未知語モデル８と、
漢字表記の文字マトリクスに含まれる漢字列及び仮名表記の文字マトリクスに含まれる仮名列の組から、未知語モデル８を用いて、未知語候補を生成する未知語候補生成部４と、
任意の二つの文字について、一方の文字が他方の文字に誤認識される確率、すなわち、文字混同確率を与える文字認識装置モデル１０と、
漢字表記の文字マトリクスに含まれる漢字列及び仮名表記の文字マトリクスに含まれる仮名列の組と類似照合する単語を単語辞書７から検索する類似単語照合部５と、
単語の漢字表記と仮名表記の組のｎ個の連鎖、すなわち、ｎｇｒａｍの出現確率を与える単語ｎｇｒａｍモデル６と、
単語ｎｇｒａｍモデル６、単語辞書７及び未知語モデル８を含み、任意の漢字列と仮名列について、漢字列が仮名列の漢字表記であり、かつ、仮名列が漢字列の仮名表記である確率、すなわち、漢字表記と仮名表記の同時確率を与える言語モデル９と、
を具備する。
【００３６】
また、最適単語列探索部２は、単語照合部３から単語辞書７と完全一致した単語を受け、未知語候補生成部４から未知語候補を受け、及び、類似単語照合部５から単語辞書７と類似照合した類似単語を受け、言語モデル９及び文字認識装置モデル１０に基づいて、最も確率が大きい単語列、すなわち、漢字表記と仮名表記の組を求める。
【００３７】
図２は、本発明の第１実施例による文字認識誤り訂正システムにおける文字認識誤り訂正方法を説明するフローチャートである。
【００３８】
文字認識誤り訂正方法は、文字認識装置１で漢字表記と仮名表記の組を文字認識することにより得られた漢字表記の文字マトリクスと仮名表記の文字マトリクスの組を入力として用いる。
【００３９】
ステップ１において、単語照合部３は、漢字表記の文字マトリクスに含まれる漢字文字列と仮名表記の文字マトリクスに含まれる仮名文字列の組と完全一致する単語辞書７中の単語、すなわち、一致単語を同定する。
【００４０】
ステップ２において、未知候補生成部４は、漢字表記の文字マトリクスに含まれる漢字文字列と仮名表記の文字マトリクスに含まれる仮名文字列の組の中で、単語辞書７に登録されていない単語の候補、すなわち、未知語を生成する。
【００４１】
ステップ３において、類似単語照合部５は、漢字表記の文字マトリクスに含まれる漢字文字列及び仮名表記の文字マトリクスに含まれる仮名文字列の組と類似照合する単語辞書７中の単語、すなわち、類似単語を同定する。
【００４２】
最後に、ステップ４において、最適単語列探索部２は、言語モデル９と文字認識モデル１０に基づいて、一致単語、未知語及び類似単語の中で、最も確率が大きい単語列、すなわち、漢字表記と仮名表記の組を求める。
【００４３】
上記説明では、ステップ１、ステップ２、ステップ３の順に処理が行なわれているが、ステップ１、ステップ２、ステップ３を実行する順序は、このような順に制限されることはなく、どのような順序で行なっても構わない。
【００４４】
このような本発明の第１実施例の文字認識誤り訂正システムの構成によれば、漢字表記と仮名表記を対応付けることが可能になり、漢字表記又は仮名表記のいずれか一方のみからでは訂正できない誤りを訂正し、入力が辞書に登録されていない単語を含む場合でも未知語モデル８に基づいて、漢字列と仮名列の組の出現確率を推定し、正解文字が文字マトリクスに含まれていない場合でも類似単語照合によって誤り訂正候補を提示できるようになる。
【００４５】
図３は、本発明の第２実施例による文字認識誤り訂正システムの概略ブロック図である。次に、図３を参照して、本発明の第２実施例の構成を説明する。図３において、図１の構成要素と同じ参照番号を付された構成要素は、図１における対応した構成要素と同一若しくは類似した構成要素である。
【００４６】
本発明の第２実施例による文字認識誤り訂正システムは、文字認識装置１、最適単語列探索部２、単語照合部３、未知単語生成部４、類似単語照合部５、言語モデル９、及び、文字認識装置モデル１０を含む。単語モデル９は、単語ｂｉｇｒａｍモデル６０と、単語辞書７と、未知語モデル８とを含む。単語ｂｉｇｒａｍモデル６０は、単語ｂｉｇｒａｍ頻度テーブル６１と、単語ｂｉｇｒａｍ確率計算部６２とを含む。未知語モデル８は、未知語確率計算部８１と、単語タイプ判定部８２と、単語タイプ定義記憶部８３と、単語長確率計算部８４と、平均単語長テーブル８５と、単語表記確率計算部８６と、文字ｂｉｇｒａｍ頻度テーブル８７と、を含む。文字認識装置モデル１０は、文字混同確率計算部１１と文字認識装置正解率テーブル１２とを含む。
【００４７】
ここで、単語ｂｉｇｒａｍは、２個の単語からなる２連鎖を表わし、文字ｂｉｇｒａｍは２個の文字からなる２連鎖を表わす。
【００４８】
最適単語列探索部２は、入力された漢字表記と仮名表記の組に対して文字認識装置１が出力した漢字表記マトリクスと仮名表記マトリクスの組を入力とし、二つの文字マトリクスのそれぞれについて、文頭から文末へ一文字ずつ進む動的計画法（dynamic programming）を用いて、単語列の同時確率、すなわち、単語ｂｉｇｒａｍ確率の積と、文字混同確率との積を最大化するような単語列を求める。
【００４９】
そのため、最適単語列探索部２には、単語照合部３からの完全一致単語と、未知語候補生成部４からの未知語候補と、類似単語照合部５からの類似単語候補とが、単語候補として与えられる。単語候補には、文字混同確率計算部１１によって、単語の漢字表記及び仮名表記を構成する文字混同確率が与えられる。また、単語ｂｉｇｒａｍ確率は、単語ｂｉｇｒａｍ確率計算部６２によって与えられる。
【００５０】
単語照合部３は、漢字表記の文字マトリクスに含まれる文字列と仮名表記の文字マトリクスに含まれる文字列の全ての組合せを単語辞書７と照合し、照合したものを完全一致単語として、最適単語列検索部２へ与える。ここで、文字マトリクスの各文字位置において、その文字位置の文字候補のリストから一文字ずつ選ぶことにより構成される文字列を、「文字マトリクスに含まれる文字列」と呼ぶ。
【００５１】
未知語候補生成部４は、漢字表記の文字マトリクスに含まれる文字列と仮名表記の文字マトリクスに含まれる文字列の組合せの中で、単語辞書７と照合しない組合せを未知語であるとみなし、未知語モデル確率計算部８１により求めた未知語確率が大きい順に予め定めた個数の未知語を、未知語候補として最適単語列検索部２へ与える。
【００５２】
類似単語照合部５は、漢字表記の文字マトリクスに含まれる文字列と仮名表記の文字マトリクスに含まれる文字列の全ての組合せを単語辞書７と類似照合し、照合したものを類似単語候補として最適単語列検索部２へ与える。類似単語照合の距離尺度としては、一方の文字列を他方の文字列に変換するのに必要な挿入・削除・置換の回数を表す編集距離(すなわち、一致する文字数の割合）を使用することができる。
【００５３】
単語ｂｉｇｒａｍ確率計算部６２は、単語ｂｉｇｒａｍ頻度テーブル６１に付与された単語ｂｉｇｒａｍ頻度から、単語ｂｉｇｒａｍの出現確率を計算する。
【００５４】
単語タイプ判定部８２は、単語タイプ定義８３に基づいて、未知語候補の漢字表記を構成する文字の種類から未知語の単語タイプを決定する。
【００５５】
単語長確率計算部８４は、平均単語長テーブル８５に記憶された各単語タイプの漢字表記と仮名表記の平均単語長から、未知語候補の漢字表記の長さと仮名表記の長さの同時確率を求める。
【００５６】
単語表記確率計算部８６は、文字ｂｉｇｒａｍ頻度テーブル８７に記憶された漢字表記の文字ｂｉｇｒａｍ頻度及び仮名表記文字の文字ｂｉｇｒａｍ頻度から、未知語候補の漢字表記の文字列と仮名表記の文字列の同時確率を求める。
【００５７】
文字混同確率計算部１１は、文字認識装置正解率テーブル１２に格納されている第１候補正解率と累積正解率から、文字混同確率を計算する。
【００５８】
かくして、最適単語列探索部２は、単語ｂｉｇｒａｍ確率の積と、文字混同確率との積を最大化するような単語列（漢字表記と仮名表記の組）を求めることができる。
【００５９】
以下では、まず、本発明の理論的基礎である「文字認識誤り訂正の情報理論的解釈」について説明し、続いて、言語モデル、未知語モデル、文字認識装置モデル、最適単語列探索手段、類似単語照合手段の順に説明する。
【００６０】
〔１〕文字認識誤り訂正の情報理論的解釈
本発明の第３実施例では、文字認識装置の入力と出力の関係を、雑音のある通信路のモデル（noisy channel model）で定式化する。入力された第１の表記である漢字表記（graphemes)と第２の表記である仮名表記（phonemes）をそれぞれＧとＰとし、これらに対する文字認識装置の出力をＧ’とＰ’とする。本発明の文字認識の誤り訂正は、事後確率Ｐ（Ｇ，Ｐ｜Ｇ’，Ｐ’）を最大にする漢字表記
【００６１】
【数１】

と仮名表記
【００６２】
【数２】

を求める問題に帰着する。さらに、ベイズの定理を使えば、
Ｐ（Ｇ，Ｐ）Ｐ（Ｇ’，Ｐ’｜Ｇ，Ｐ）
を最大にする漢字表記と仮名表記の組
【００６３】
【数３】

を求めればよいことがわかる。
【００６４】
【数４】

ここでは、Ｐ（Ｇ，Ｐ）を言語モデル、Ｐ（Ｇ’，Ｐ’｜Ｇ，Ｐ）を文字装置モデルと呼ぶ。
【００６５】
〔２〕言語モデル
漢字表記及び仮名表記が、それぞれ、
長さｌの文字列Ｇ＝α_１α_２．．．α_ｌ
及び
長さｍの文字列Ｐ＝β_１β_２．．．β_ｍ
から構成されるとする。さらに、単語の漢字表記と仮名表記を対応付けることにより、漢字表記と仮名表記の組が長さｎの単語列
（Ｇ，Ρ）＝（（ｇ_１，ｐ_１），（ｇ_２，ｐ_２），．．．，（ｇ_ｎ，ｐ_ｎ））
に分割されるとする。
【００６６】
たとえば、漢字表記「神奈川県横須賀市光の丘」と仮名表記「カナガワケンヨコスカシヒカリノオカ」の組は、単語列（（神奈川県，カナガワケン），（横須賀市，ヨコスカシ），（光の丘，ヒカリノオカ））に対応付けられ分割される。
【００６７】
本発明では、言語モデルＰ（Ｇ，Ｐ）を、漢字表記Ｇと仮名表記Ｐを対応付ける最も尤もらしい単語列Ｐ（Ｇ，Ρ）の同時確率で近似する。さらに、この単語列の同時確率Ｐ（Ｇ，Ρ）を後述の単語ｂｉｇｒａｍモデルで近似する。
【００６８】
〔３〕単語ｂｉｇｒａｍモデル
一般に、単語Ｎ−ｇｒａｍは、Ｎが２以上の整数を表わすときに、Ｎ個の単語からなるＮ連鎖を表わす。一例として、単語ｂｉｇｒａｍは、２個の単語からなる単語連鎖である。単語ｂｉｇｒａｍモデルは、次式のように単語ｂｉｇｒａｍ確率Ｐ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）の積で単語列の同時確率を近似する。
【００６９】
【数５】

ここで、＜ｂｏｓ＞及び＜ｅｏｓ＞は、文の先頭及び末尾を表わす特別な記号である。
【００７０】
単語ｂｉｇｒａｍ確率を求める方法（すなわち、単語ｂｉｇｒａｍ確率計算手段）は以下の通りである。
【００７１】
先ず、漢字仮名混じり表記の日本語テキストを単語に分割し、仮名表記の読みを付与したデータを作成する。以降、このデータを学習データと呼ぶ。
【００７２】
この学習データにおける単語の漢字表記と仮名表記の組の出現頻度Ｃ（ｇ_ｉ，ｐ_ｉ）、及び、単語の漢字表記と仮名表記の組のｂｉｇｒａｍの出現確率Ｃ（ｇ_ｉ−１，ｐ_ｉ−１，ｇ_ｉ，ｐ_ｉ）を求め、単語ｂｉｇｒａｍ頻度テーブルに格納しておく。
【００７３】
図４は単語出現頻度の例の説明図であり、図５は単語ｂｉｇｒａｍ出現頻度の例の説明図である。図示された例では、’／’で区切られた漢字列と仮名列の組が一つの単語を表わす。
【００７４】
次に、これらの出現頻度から単語の相対頻度ｆ（ｇ_ｉ，ｐ_ｉ）及び単語ｂｉｇｒａｍの相対頻度ｆ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）を求める。
【００７５】
【数６】

次に、これらの相対頻度を線形補間して単語ｂｉｇｒａｍ組確率を求める。ここで、線形補間係数αは訓練データの確率が最大になるように決定する。
Ｐ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）＝
（１−α）ｆ（ｇ_ｉ，ｐ_ｉ）＋αｆ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）（５）
〔４〕未知語モデル
未知語モデルは、単語辞書に登録されていない漢字表記と仮名表記の組の出現確率を求めるための計算モデルである。これは、未知語（ｇ，ｐ）を構成する
長さｌ_ｇの漢字表記文字列ｇ＝ｃｇ_１．．．ｃｇ_ｌｇ
と、
長さｌ_ｐの仮名表記文字列ｐ＝ｃｐ_１．．．ｃｐ_ｌｐ
の同時確率分布Ｐ（ｇ，ｐ）として定義される。
【００７６】
本発明の一実施例では、単語の漢字表記を構成する文字の種類に基づいて複数の単語タイプを定義し、単語タイプ別に未知語の確率を推定する。
【００７７】
図６は、日本語の未知語を７種類の単語タイプに分類した場合の例の説明図である。単語タイプの定義は、バッカス記法（Backus Naur Form, BNF）で記述されており、ここで、［・・・］は、文字集合中の任意の１文字と照合することを表わす。二つの文字の間に、−を書くことで文字範囲を表わす。文字コードには、ＪＩＳ−Ｘ−０２０８を仮定している。＊は０回以上の繰り返しを表わし、＋は１回以上の繰り返しを表わす。
【００７８】
＜ｓｙｍ＞、＜ｎｕｍ＞、＜ａｌｐｈａ＞、＜ｈｉｒａ＞、＜ｋａｔａ＞、及び、＜ｋａｎ＞は、それぞれ記号列、数字列、アルファベット列、ひらがな列、カタカナ列、及び、漢字列を表わす。これら以外の複数の字種から構成される文字列は、すべて＜ｍｉｓｃ＞とする。
【００７９】
本発明の第３実施例では、単語タイプ確率、すなわち、未知語における各単語タイプ＜ＷＴ＞の出現確率Ｐ（＜ＷＴ＞）、及び、単語タイプ別の未知語の漢字表記と仮名表記の同時確率Ｐ（ｇ，ｐ｜＜ＷＴ＞）の積から、未知語の出現確率（未知語の漢字表記と仮名表記の同時確率）Ｐ（ｇ，ｐ）を求める。
Ｐ（ｇ，ｐ）＝Ｐ（＜ＷＴ＞）Ｐ（ｇ，ｐ｜＜ＷＴ＞）（６）
単語タイプ確率は、学習データにおける低頻度語（出現頻度が１の単語）を単語タイプに分類し、それぞれ単語タイプの相対頻度から求める。
【００８０】
単語タイプ別の未知語の出現確率Ｐ（ｇ，ｐ｜＜ＷＴ＞）は、単語タイプ別の漢字表記の長さと仮名表記の長さの同時確率Ｐ（ｌ_ｇ，ｌ_ｐ｜＜ＷＴ＞）、漢字表記の文字列の出現確率Ｐ（ｇ）、及び、仮名表記の文字列の出現確率Ｐ（ｐ）の積で近似する。
【００８１】
【数７】

以下では、単語タイプ別の漢字表記の長さと仮名表記の長さの同時確率Ｐ（ｌ_ｇ，ｌ_ｐ｜＜ＷＴ＞）を単語長確率と呼ぶ。
【００８２】
単語長確率は、単語タイプ別の漢字表記の長さの分布と単語タイプ別の仮名表記の長さの分布の積で近似し、漢字表記及び仮名表記の長さの分布は、それぞれ、単語タイプ別の漢字表記の平均文字長λ_{ｇ，＜ＷＴ＞}及び仮名表記の平均文字長λ_{ｐ，＜ＷＴ＞}をパラメータとするポワソン分布で近似する。
【００８３】
【数８】

単語タイプ別の漢字表記の平均文字長λ_{ｇ，＜ＷＴ＞}及び仮名表記の平均文字長λ_{ｐ，＜ＷＴ＞}は、学習データにおける低頻度語から求める。図７は、単語タイプ別の漢字表記と仮名表記の平均文字長の例の説明図である。
【００８４】
漢字表記の文字列の出現確率Ｐ（ｇ）は、漢字表記に使用される文字の文字ｂｉｇｒａｍで近似する。
【００８５】
【数９】

仮名表記の文字列の出現確率Ｐ（ｐ）は、仮名表記に使用される文字の文字ｂｉｇｒａｍモデルで近似する。
【００８６】
【数１０】

漢字表記及び仮名表記の文字ｂｉｇｒａｍ確率は、単語ｂｉｇｒａｍ確率と同様に、学習データから求めた文字出現頻度と文字ｂｉｇｒａｍ出現頻度からそれぞれの相対確率を求め、この相対確率を線形補間することにより得られる。図８は、漢字表記の文字ｂｉｇｒａｍの出現頻度の例を示し、図９は、仮名表記の文字ｂｉｇｒａｍの出現頻度の例を示す図である。
【００８７】
〔５〕文字認識装置モデル
本発明の第３実施例では、文字認識装置モデルＰ（Ｇ’，Ｐ’｜Ｇ，Ｐ）に関して、漢字表記Ｇと仮名表記Ｐが独立に認識され、さらに漢字表記の各文字ｃｇ_ｉと仮名表記の各文字ｃｐ_ｊが独立に認識されると仮定する。
【００８８】
【数１１】

一般に、文字認識装置において、入力された文字ｃ_ｉが文字ｃ_ｊと認識される確率Ｐ（ｃ_ｊ｜ｃ_ｉ）は、文字混同確率（character confusion probability)と呼ばれる。文字混同確率は、基本的には、文字認識装置の入力と出力の組の頻度データである文字混同行列（character confusion matrix)から求めることができる。しかし、文字混同行列は、文字認識法が入力画像の品質に大きく依存するので汎用性が低い。また、日本語は文字の種類が３０００字以上あるので、すべての文字について十分に多くの文字認識結果を集めることはできない。
【００８９】
そこで、本発明の第３実施例では、文字認識装置が出力する第１候補の正解率ｐ_１、及び、第ｎ候補までの累積正解率ｐ_ｎ（ｎは文字認識装置が出力する文字候補の数）をパラメータとして、文字混同確率を以下のように近似する。
【００９０】
【数１２】

ここで、｜Ｃ｜は認識対象となる文字集合の大きさである。たとえば、漢字の場合、｜Ｃ_ｇ｜＝６８７９、仮名の場合、｜Ｃ_ｐ｜＝８７に設定すればよい。
【００９１】
式（１２）は、文字混同確率として、もしその文字が第１候補であるならば、文字に関係なく一定の値（第１候補の平均正解率）ｐ_１を割り当てる。もし、その文字が第２候補以降の候補文字の中に入っていれば、累積正解率から第１候補正解率を差し引いた残りを均等に割り当てる。もし、その文字が文字候補になければ、１から累積正解率を引いたものを、候補文字以外の文字集合に対して均等に割り当てる。
【００９２】
〔６〕最適単語列探索
式（１）に示す事後確率を最大化する漢字表記と仮名表記の組を求める手順（最適単語列探索手段）を以下に示す。
【００９３】
入力された漢字表記及び仮名表記を、それぞれ、
長さｌの文字列Ｇ＝α_１α_２．．．α_ｌ
及び
長さｍの文字列Ｐ＝β_１β_２．．．β_ｍ
とする。
【００９４】
漢字表記中の文字位置及び仮名表記中の文字位置を、それぞれ、ｘ（０≦ｘ≦ｌ）及びｙ（０≦ｙ≦ｍ）で表わすことにすると、漢字表記と仮名表記を対応付けることにより得られる長さｎの単語列
（Ｇ，Ρ）＝（（ｇ_１，ｐ_１），（ｇ_２，ｐ_２），．．．，（ｇ_ｎ，ｐ_ｎ））
は、単語の境界の座標の列（長さｎ＋１）
（ｘ_０，ｙ_０），（ｘ_１，ｙ_１），（ｘ_２，ｙ_２），．．．，（ｘ_ｎ，ｙ_ｎ）
で表現することができる。ここで、文字位置（ｘ_ｉ，ｙ_ｉ）は、それぞれ単語（ｇ_ｉ，ｐ_ｉ）の終了位置であり、（ｘ_０，ｙ_０）＝（０，０）及び（ｘ_ｎ，ｙ_ｎ）＝（ｌ，ｍ）である。
【００９５】
先頭からｉ番目の単語までの単語列の同時確率Ｐ（ｇ_１，ｐ_１，．．．，ｇ_ｉ，ｐ_ｉ）と、各単語を構成する漢字表記と仮名表記の各文字の文字混同確率との積の最大値をφ（ｇ_ｉ，ｐ_ｉ）と定義すると、式（２）より、以下の関係が成立する。
【００９６】
【数１３】

ここで、ｑとｒは、漢字表記ｇ_ｉの開始位置と終了位置を表わし、ｓとｔは仮名表記ｐ_ｉの開始位置と終了位置を表わす。すなわち、
ｇ_ｉ＝ｃｇ_ｑ＋１．．．ｃｇ_ｒ
ｐ_ｉ＝ｃｐ_ｓ＋１．．．ｃｐ_ｔ
である。また、ｃｇ’_ｊ及びｃｐ’_ｋは、ｃｇ_ｊ及びｃｐ_ｋに対応する文字認識結果である。
【００９７】
式（１３）は、以下の関係を表わす。先頭からｉ番目の単語までの同時確率と各単語の漢字表記と仮名表記を構成する各文字の文字混同確率の積の最大値φ（ｇ_ｉ，ｐ_ｉ）は、先頭からｉ−１番目の単語までの同時確率と各単語の漢字表記と仮名表記を構成する各文字の文字混同確率の積の最大値φ（ｇ_ｉ−１，ｐ_ｉ−１）と、ｉ−１番目の単語とｉ番目の単語の単語ｂｉｇｒａｍ確率Ｐ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）の積の最大値に、ｉ番目の単語の漢字表記と仮名表記を構成する各文字の文字混同確率の積を掛けたものである。この関係を利用して、先頭から順にφ（ｇ_ｉ，ｐ_ｉ）を求めれば、先頭から末尾までの確率の最大値φ（ｇ_ｎ，ｐ_ｎ）を求めることができる。
【００９８】
図１０は、最適単語列探索の動作を説明するためのフローチャートである。最適単語列探索は、二次元の動的計画法を用いて式（１３）の計算を実現する。ここでは、φ（ｇ_ｉ，ｐ_ｉ）を部分解析の確率と呼び、φ（ｇ_ｉ，ｐ_ｉ）を格納するテーブルを部分解析テーブルと呼ぶ。
【００９９】
以下では、図１０に従って、最適単語列探索の動作を説明する。
【０１００】
最適単語列探索は、漢字表記と仮名表記の先頭から始まり、それぞれの現在の解析位置が末尾方向へ一文字ずつ進む。ステップＳ１１では、探索の開始位置を漢字表記と仮名表記の先頭（０，０）に設定する。
【０１０１】
ステップＳ１２では、探索が漢字表記の末尾に達したかどうかを判断する。もし、末尾に達していれば、最適単語列探索を終了する。そうでなければ、以下の処理を漢字表記の各文字位置で行なう。
【０１０２】
ステップＳ１３では、探索が仮名表記の末尾に達したかどうかを判断する。もし、末尾に達していれば、ステップＳ３０へ進む。そうでなければ、以下の処理を仮名表記の各文字位置で行なう。
【０１０３】
ステップＳ１４では、現在の漢字表記の文字位置と仮名表記の文字位置の組に到達するまで全ての単語列を部分解析テーブルから検索し、その中の一つを現在の部分解析（単語列）として選ぶ。
【０１０４】
ステップＳ１５では、全ての単語列を調べたかを判定する。もしそうならば、ステップＳ２９において、探索を仮名表記の次の文字位置へ進める。そうでなければ、以下の処理を各単語列について行なう。
【０１０５】
ステップＳ１６では、現在の漢字表記の文字位置から始まる、漢字表記の文字マトリクスに含まれる全ての漢字文字列のリストを作成する。
【０１０６】
ステップＳ１７では、現在の仮名表記の文字位置から始まる、仮名表記の文字マトリクスに含まれる全ての仮名文字列のリストを作成する。
【０１０７】
ステップＳ１８では、ステップＳ１６で作成した漢字文字列のリストと、ステップＳ１７で作成した仮名文字列のリストの全ての組合せから構成される単語リストを作成する。このリストの中で、単語辞書に照合しないものは未知語とみなす。
【０１０８】
ステップＳ１９では、ステップＳ１６で作成した漢字文字列のリスト、及び、ステップＳ１７で作成した仮名文字列のリストと類似照合する単語辞書中の単語を、単語リストに追加する。
【０１０９】
ステップＳ２０では、単語リストから一つの単語を選ぶ。
【０１１０】
ステップＳ２１では、全ての単語を調べたかを判定する。もしそうでなければ、ステップＳ２８へ進む。そうでなければ、以下の処理を各単語について行なう。
【０１１１】
ステップＳ２２では、現在の単語（現在の単語を最後の単語とする先頭からの単語列）が部分解析テーブルに登録されているかどうかを調べる。もしそうならば、ステップＳ２４へ進む。もしそうでなければ、ステップＳ２３において、この単語を部分解析テーブルに登録し、部分解析（単語列）の確率を０に初期化した後に、ステップＳ２４へ進む。
【０１１２】
ステップＳ２４では、現在の単語列と現在の単語の組合せによる新しい単語列の確率を求める。新しい単語列の確率は、次式で表わされる。
【０１１３】
【数１４】

ステップＳ２５では、もし新しい単語列の確率が、最後の単語が同じである以前の単語列の確率よりも大きいかどうかを調べる。もしそうれあれば、ステップＳ２６において新しい単語列の確率を部分解析テーブルに格納してステップＳ２７へ進む。
【０１１４】
ステップＳ２７では、次の単語を選び、ステップＳ２１へ戻る。
【０１１５】
ステップＳ２８では、次の単語列を選び、ステップＳ１５へ戻る。
【０１１６】
ステップＳ２９では、探索を仮名表記の次の文字位置へ進め、ステップＳ１３へ戻る。
【０１１７】
ステップＳ３０では、探索を漢字表記の次の文字位置へ進め、ステップＳ１２へ戻る。
【０１１８】
〔７〕単語の類似照合
以下では、最適単語探索のステップＳ９における単語の類似照合の方法について説明する。現在の漢字文字列の文字位置をｘとし、仮名文字列の文字位置をｙとする。
【０１１９】
先ず始めに、漢字文字列をキーにして単語を類似検索する。
【０１２０】
（１）現在の漢字表記の文字位置ｘから始まり、漢字表記の文字マトリクスに含まれる全ての漢字文字列のリストの要素と、漢字表記が一致する単語辞書中の単語を全て検索して、類似単語候補リストを作成する。
【０１２１】
（２）類似単語候補リストの各要素について、その仮名表記と、現在の仮名表記の文字位置ｙから始まり、仮名表記の文字マトリクスに含まれる全ての仮名文字列との編集距離（一致しない文字数）の最小値を求める。
【０１２２】
（３）類似単語候補リストから、相対編集距離（文字列の長さに対する一致しない文字の割合）が０．５以下の単語を取り出し、これを出現頻度順に並べ、最大５個を類似単語候補として生成する。
【０１２３】
次に、仮名文字列をキーにして単語を類似検索する。
【０１２４】
（４）現在の仮名表記の文字位置ｙから始まり、仮名表記の文字マトリクスに含まれる全ての仮名文字列のリストの要素と、仮名表記が一致する単語辞書中の単語を全て検索して、類似単語候補リストを作成する。
【０１２５】
（５）類似単語候補リストの各要素について、その漢字表記と、現在の漢字表記の文字位置ｘから始まり、漢字表記の文字マトリクスに含まれる全ての漢字文字列との編集距離（一致しない文字数）の最小値を求める。
【０１２６】
（６）類似単語候補リストから、相対編集距離（文字列の長さに対する一致しない文字の割合）が０．５以下の単語を取り出し、これを出現頻度順に並べ、最大５個を類似単語候補として生成する。
【０１２７】
（７）最後に、漢字文字列をキーにして生成した単語候補と仮名文字列をキーにして生成した単語候補の集合和（重複した単語候補を一つにまとめたもの）を、最終的な類似単語候補とする。
【０１２８】
なお、ここで説明した編集距離の閾値０．５や最大候補数５は、パラメータの設定値の一例であり、最適な値は、たとえば、実験的に決定される。また、上記の例では、最初に漢字文字列をキーにして単語を類似検索し、次に仮名文字列をキーにして単語を類似検索しているが、逆に、最初に仮名文字列をキーにして単語を類似検索し、次に漢字文字列をキーにして単語を類似検索してもよい。
【０１２９】
【実施例】
最後に、本発明の一実施例による処理例を示す。図１１は、漢字表記「福井県福井市糸崎町」と仮名表記「フクイケンフクイシイトザキチョウ」に対して文字認識装置が出力した文字マトリクスの組に対する最適単語列探索の例を説明する図である。
【０１３０】
この処理例では、文字マトリクスは第２候補までを使用している。たとえば、「福」という漢字に対する第１候補及び第２候補は、それぞれ、「福」及び「禍」であり、「フ」という仮名に対する第１候補及び第２候補は、それぞれ、「フ」及び「ク」である。
【０１３１】
最適単語列探索では、各文字位置において、そこへ到達する単語列とそこから出発する単語の全ての組合せを調べ、出発する単語の終了位置における単語列の同時確率を更新する。
【０１３２】
図１１の左側では、漢字表記の文字位置をｘ（横軸）、仮名表記の文字位置をｙ（縦軸）で表わし、ある文字位置（６，９）に到達する単語列と出発する単語の位置関係、すなわち、文字位置（６，９）に到達する単語列の最後の単語の開始位置、及び、文字位置（６，９）から始まる単語の終了位置を示している。
【０１３３】
図１１の右側では、文字位置（６，９）に到達する単語列の最後の単語、及び、文字位置（６，９）から始まる単語の組合せの全てを調べる様子を示している。各単語は「漢字文字列／仮名文字列」で表現し、単語に対応する箱の上部には、単語の開始位置と終了位置の座標が示されている。また、単語が未知語の場合は、箱の下部に単語タイプを示している。
【０１３４】
単語候補には、単語辞書と完全一致したもの、単語辞書と類似照合したもの、及び、単語辞書と照合しなかったもの（未知語）の３種類ある。たとえば、文字位置（３，５）から文字位置（６，９）にある単語「福井市／フクイシ」は完全一致したものであり、漢字文字列「福井市」及び仮名文字列「フクイシ」が文字マトリクスにあり、かつ、「福井市／フクイシ」が単語辞書にある。
【０１３５】
文字位置（６，９）から文字位置（８，１３）にある単語「糸崎／イトザキ」は類似照合したもので、漢字文字列「糸崎」は文字マトリクスにあるが、仮名文字列「イトザキ」は文字マトリクスになく、漢字文字列をキーにして単語辞書から検索したものである。文字位置（５，７）から文字位置（６，９）にある単語候補「市／イシ」は未知語候補対であり、漢字文字列「市」も仮名文字列「イシ」も文字マトリクスにあるが、この単語「市／イシ」は辞書にない。
【０１３６】
図１１には、ある文字位置（６，９）における処理の様子が示されているが、このような処理を原点（０，０）から始めて（９，１６）まで、平面上の全ての格子点で行なうことにより、漢字表記と仮名表記の同時確率と文字混同確率の積が最大となる単語列を求めることができる。
【０１３７】
上記の本発明の実施例による文字認識誤り訂正方法は、ソフトウェア（プログラム）で構築することが可能であり、コンピュータのＣＰＵによってこのプログラムを実行することにより本発明の実施例による文字認識誤り訂正装置を実現することができる。構築されたプログラムは、ディスク装置等に記録しておき必要に応じてコンピュータにインストールされ、フレキシブルディスク、メモリカード、ＣＤ−ＲＯＭ等の可搬記録媒体に格納して必要に応じてコンピュータにインストールされ、或いは、通信回線等を介してコンピュータにインストールされ、コンピュータのＣＰＵによって実行される。
【０１３８】
以上、本発明の代表的な実施例を説明したが、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【０１３９】
【発明の効果】
以上のように、本発明によれば、漢字文字列と仮名文字列の同時確率を与える言語モデルと、文字混同確率を与える文字認識装置モデルと、言語モデル及び文字認識装置モデルに基づいて、入力された漢字文字列と仮名文字列の文字認識結果の組に対して最も確率が大きい単語列を求める最適単語列探索手段と、を用いて漢字表記と仮名表記を対応付けることにより、同じ内容が漢字表記と仮名表記で表現されているという冗長性を利用して、漢字表記又は仮名表記のいずれか一方のみからでは訂正できない文字認識誤りを訂正することができる、文字認識誤り訂正方法を実現できる。
【図面の簡単な説明】
【図１】本発明の第１実施例による文字認識誤り訂正システムの構成図である。
【図２】本発明の第１実施例による文字認識誤り訂正方法のフローチャートである。
【図３】本発明の第２実施例による文字認識誤り訂正システムの構成図である。
【図４】単語辞書の例の説明図である。
【図５】単語ｂｉｇｒａｍの出現頻度の例の説明図である。
【図６】単語タイプの定義の例の説明図である。
【図７】単語タイプ別の漢字表記の平均長と仮名表記の平均長の例の説明図である。
【図８】漢字表記の文字ｂｉｇｒａｍの出現頻度の例の説明図である。
【図９】仮名表記の文字ｂｉｇｒａｍの出現頻度の例の説明図である。
【図１０】本発明の第３実施例による最適単語列探索処理のフローチャートである。
【図１１】最適単語列探索の例を示す図である。
【符号の説明】
１文字認識装置
２最適単語列探索部
３単語照合部
４未知語候補生成部
５類似単語照合部
６単語ｎｇｒａｍモデル
７単語辞書
８未知語モデル
９単語モデル
１０文字認識装置モデル
１００文字認識誤り訂正装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for correcting an error that occurs in handwritten character recognition, and in particular, an error portion of a character pattern that is input using a language model that reads a character pattern as a character recognition result and expresses a connection before and after characters. The present invention relates to a method and an apparatus for correcting an error in a Japanese character recognition apparatus that detects an error and presents a correct candidate.
[0002]
[Prior art]
In general, a handwritten form such as an application form has a habit of giving a kana-named kana when writing an address or name written in kanji. This can be achieved by using the restriction that kanji notation and kana notation represent the same content, or the redundancy that the same content is expressed in kanji notation and kana notation. This has the effect of preventing errors.
[0003]
For example, when handwritten forms are processed manually, if the kana notation can be determined as "Nagata" when it cannot be determined whether it is "Mizuta" or "Nagata" from the kanji notation, the kanji notation will be "Nagata" I know that there is. Similarly, if it is not possible to determine whether it is “Nakata” or “Nagata” from the kana notation, if the kanji notation can be determined as “Nagata”, it will be understood that the kana notation is “Nagata”. . Furthermore, if there is a possibility of “Mizuta” and “Nagata” in Kanji notation, and “Nakata” and “Nagata” in Kana notation, “Mizuta” cannot be read as “Nagata”. Therefore, it can be seen that the only reasonable interpretation is the combination of the kanji notation “Nagata” and the kana notation “Nagata”.
[0004]
However, when trying to automatically process a form using a character recognition device, the conventional character recognition error correction method handles the kanji notation and kana notation at the same time. ) Was not able to be corrected while being used effectively. There are two reasons (1) and (2) below.
[0005]
(1) Since it is virtually impossible to register all possible combinations of kanji and kana notations in the dictionary, even if they are not in the dictionary, the correct kanji and kana combinations appear in the input. The possibility must be considered. However, a method for giving a probability to such an event has not been devised.
[0006]
(2) When there are a plurality of character recognition candidates for kanji notation and kana notation, the possibility of correspondence between kanji notation and kana notation becomes very large. However, an algorithm for systematically examining this has not been devised.
[0007]
A conventional error recognition method for character recognition uses a character ngram, that is, an n-chain consisting of n characters when n represents an integer of 2 or more for one input character string (kanji or kana notation). Using a statistical language model such as a character ngram model that expresses or a word ngram, that is, a word ngram model that expresses an n-chain of n words when n represents an integer of 2 or more. The correction method is the mainstream.
[0008]
As an example of using the character gramm model:
Sugimura and Saito: "Judgment of unreadable characters using character concatenation information-Application to character recognition-" IEICE Transactions Vol.J68-D No.1, pp.64-71, 1985
Is mentioned. As an example of using the word ngram model,
Takao and Nishino: "Implementation and Evaluation of Post-Processing for Japanese Document Readers" IPSJ Transactions Vol.33 No.5, pp.664-670, 1992
Ito and Maruyama: "Error Detection and Automatic Correction of Japanese Sentences with OCR Input," IPSJ Transactions Vol.33, No.5, pp.664-670, 1992
Nagata: “Japanese Character Recognition Error Correction Method Using Character Similarity and Statistical Language Model” IEICE Transactions (D-II) Vol.J81-D-II, No.11, pp.2624-2634 , 1998
Is mentioned.
[0009]
In recent years, Kanji notation and Kana have been developed based on a statistical language model based on a single kanji and its reading as a method for error correction using kanji and kana notation simultaneously. How to associate the notation
Nagata, M .: Synchronous Morphological Analysis of Grapheme and Phoneme for Japanese OCR, Proceedings of the 38^th Annual Meeting of the Association for Computational Linguistics, pp.384-391, 2000
Has been proposed. In the method proposed in this document, for example, when a combination of kanji notation “Fukuzawa Yukichi” and “Fukuzawa Yukichi” and kana character recognition result is input, a single kanji character “Fuku / Fuku” and its reading are used. Based on the appearance probability of the pair of characters, the correspondence of “Fuku / Fuku”, “Sawa / Sawa”, “San / Yu”, “Kichi / Kichi” from among multiple character recognition candidates for Kanji and Kana notation Error correction is performed by obtaining the relationship.
[0010]
The method of using a single kanji and its reading as the basic unit of the language model is when a short character string such as a name (kanji notation is about 3-5 characters) and the number of readings for one kanji is large. It is an effective method.
[0011]
[Problems to be solved by the invention]
However, when a long character string (kanji notation is about 10-15 characters) like an address is used, if a single kanji and its reading are the basic unit of the language model, the number of combinations to be searched is enormous. There is a problem that the amount becomes large. In particular, the number of kanji notation candidates retrieved from kana notation becomes a problem.
[0012]
For example, if the input is a pair of kanji notation “Yokosuka City, Kanagawa Prefecture Hikari no Oka” and “Kanagawa Ken Yokosuka Hikarinooka”, there are at least 214 Kanji characters that can be read as “K” as shown below. is there.
[0013]
[Table 1]

Next, there are at least 17 Chinese characters that can be read as “Kana” as shown below.
[0014]
[Table 2]

Furthermore, there are at least 66 kanji characters that may be read as “na”, and at least 15 kanji characters that may be read as “naga”.
[0015]
Similarly, the reading of the Chinese character “God”
Shin Jin Kami Kang Kag Kag Kana Duck
Bear Koha Dama Mi
There are at least 14 like this.
[0016]
In the case of error correction in character recognition, not only the first candidate of the recognition result but also the lower candidate must be considered, so the search space becomes even larger, and the calculation amount cannot be ignored for a long character string such as an address. .
[0017]
When words are the basic unit of the statistical language model, most of the combinations of characters included in the character matrix (that is, a list of character candidates arranged in descending order of character recognition score at each character position of the input sentence) , The number of words to be checked against the dictionary is reduced. However, on the contrary, since the number of unknown word candidates, that is, word candidates that are not matched with the dictionary increases, the situation does not improve.
[0018]
If an unknown word model can be devised that gives a high probability to a combination of kanji character strings and kana character strings that are likely to constitute words, the amount of calculation can be reduced by narrowing down unknown word candidates. Therefore, it is possible to correct an error while associating a pair of relatively long character strings such as kanji notation and kana notation of an address. However, conventionally, such an unknown word model has not been proposed.
[0019]
In view of the above-mentioned problems of the prior art, the present invention corresponds to a pair of a relatively long first character string and a second character string such as kanji and kana notation of an address. Character recognition error that realizes error correction using constraints or redundancy existing between the first notation and the second notation by simultaneously handling a set of character recognition results of the first notation and the second notation The purpose is to provide a correction method.
[0020]
It is another object of the present invention to provide an apparatus for performing such a character recognition error correction method.
[0021]
Another object of the present invention is to provide a program that causes a computer to realize such a character recognition error correction method.
[0022]
[Means for Solving the Problems]
The present invention provides a language model, a character recognition device model, a first notation and a first notation when the first notation and the second notation such as kanji notation and kana notation are given simultaneously as the output of the character recognition device. By using the redundancy that the same content is expressed in the first notation and the second notation using an algorithm that searches for an optimal word string while associating the two notations, the first notation or An error that cannot be corrected by only one of the second notations is corrected.
[0023]
According to the present invention, for any first character string and second character string, the joint probability of the first character string and the second character string, that is, the first character string The character string is a character string that represents the character string of the second notation in the first notation, and the character string of the second notation represents the character string of the first notation in the second notation. Based on the language model that gives the probability of being, the character recognition device model that gives the probability that one character is misrecognized by the other character, and the language model and character recognition device model, the most probable Using the optimum word string search means for obtaining a pair of the first notation word string and the second notation word string having the highest probability, that is, the first notation character string and the first notation By associating the character string of 2 notation, the character string of 1st notation or 2nd Only one of a string of serial string to provide a character recognition error correction method for correcting an error that can not be corrected.
[0024]
  According to the present invention (Claim 1), a character recognition error correction for correcting a character recognition error in an input is provided as a character recognition result of a kanji character string and a kana character string representing the same content as the character character string. In the device
  A word dictionary that stores pairs of kanji and kana notation of words;
  A word ngram model that gives the occurrence probability of a ngram of a kanji notation and a kana notation of a word;
  An unknown word model that gives an unknown word probability that is a probability that a combination of an arbitrary kanji string and a kana string not registered in the word dictionary constitutes a word;
  In a character recognition device, a character recognition device model that gives a character confusion probability that is a probability that an input character ci is recognized as a character cj;
  Input means for inputting a character matrix that is a list in which character candidates are arranged in order from the highest character recognition score at each character position for each of kanji notation and kana notation;
  A word matching means for searching for a word in a word dictionary that completely matches a kanji string included in a character matrix of kanji notation and a kana string included in a character matrix of kana notation, and identifying a matching word;
  The unknown word candidate that is not registered in the word dictionary is unknown using the unknown word model in the kana character string pair included in the kanji character matrix and the kana character string included in the kanji character matrix. An unknown word candidate generating means for generating a predetermined number of unknown words having a high word probability;
  A similar word matching means for searching for a word in a word dictionary that is similar to a pair of kana characters included in a kanji character matrix and a kana character string included in a kanji character matrix, and identifying a similar word;
  Based on the word ngram model, the unknown word model, and the character recognition model, the word ngram appearance probability and the character confusion probability of the input kanji / kana notation among the matched words, unknown words, and similar words An optimum word string search means for obtaining a word string that is a set of kanji notation and kana notation that maximizes the product,
  The unknown word model is
  A word type probability calculating means for obtaining a word type probability that is an appearance probability of the word type from the appearance frequency of the word type;
  A word type determination means for classifying an unknown word into one of the word types defined based on the types of characters constituting the kanji representation of the word;
  Based on the distribution of the length of kanji notation and the length of kana notation for each word type and the word type of the unknown word, the probability of the kanji notation length and the probability of the kana notation length of the unknown word A word length probability calculating means for obtaining a word length probability which is a product;
  A word notation probability calculating means for obtaining an appearance probability of a kanji string from the character ngram frequency of the kanji notation, obtaining an appearance probability of the kana string from the ngram frequency of the kana notation character, and obtaining a word notation probability which is a simultaneous probability of unknown words;
  And an unknown word probability calculating means for calculating an unknown word probability of the unknown word by a product of the word type probability, the word length probability, and the word notation probability of the unknown word.
[0025]
  In the character recognition error correction method for correcting the character recognition error in the input, the result of character recognition of a kanji string and a kana string pair representing the same content as the kanji string is given as input.
  A word dictionary that stores pairs of kanji and kana notation of words;
  A word ngram model that gives the appearance probability of ngrams of the kanji notation of the word and the kana notation;
  An unknown word model that gives an unknown word probability that is a probability that a combination of an arbitrary kanji string and a kana string not registered in the word dictionary constitutes a word;
  In a character recognition device, a device having a character recognition device model that gives a character confusion probability that is a probability that an input character ci is recognized as a character cj.
  An input means for inputting a character matrix that is a list in which a list of character candidates is arranged in order from the highest character recognition score at each character position for each of the kanji notation and the kana notation; and
  A word matching means searches for a word in the word dictionary that completely matches a pair of kanji characters included in the character matrix of the kanji notation and a kana character string included in the character matrix of the kana notation, and identifies a matching word A word matching step;
  An unknown word candidate generation unit is configured to detect an unknown word that is a word that is not registered in the word dictionary in a set of kanji strings included in the kanji character matrix and a kana character string included in the kana character matrix. An unknown word candidate generating step for generating a predetermined number of unknown words having a high unknown word probability using the unknown word model,
  The similar word matching means searches for a word in the word dictionary that is similar matched with a pair of kanji characters included in the character matrix of the kanji character notation and a kana character string included in the character matrix of the kana character notation, and identifies a similar word A similar word matching step;
  Optimum word string search means includes the word ngram model, the unknown word modelAndBased on the character recognition device model, the product of the word ngram appearance probability and the character confusion probability is maximized for the set of the kanji and kana notation input among the matching word, the unknown word, and the similar word. An optimum word string search step for obtaining a word string that is a set of kanji notation and kana notation to be converted;
And
  In the unknown word model,
  A word type probability calculating means for obtaining a word type probability that is an appearance probability of the word type from the appearance frequency of the word type;
  A step of classifying the unknown word into any of the word types defined based on the types of characters constituting the kanji notation of the word;
  The word length probability calculation means calculates the kanji length length probability and kana of the unknown word based on the kanji length distribution and kana length distribution of each word type and the unknown word type. Obtaining a word length probability that is a product of the notation length probabilities;
  The word notation probability calculating means obtains the appearance probability of the kanji string from the character ngram frequency of the kanji notation, obtains the appearance probability of the kana string from the ngram frequency of the kana notation character, and obtains the word notation probability which is the simultaneous probability of the unknown word. Steps,
  An unknown word probability calculation means performs an unknown word probability calculation step of calculating an unknown word probability of an unknown word based on a product of the word type probability, the word length probability, and the word notation probability of the unknown word.
[0026]
  The present invention (Claim 3) is a program for realizing the character recognition error correction method according to Claim 2 by a computer.
[0033]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a block diagram of a character recognition error correction system according to a first embodiment of the present invention. The character recognition error correction system of the present embodiment recognizes a character recognition device 1 that recognizes Japanese kanji characters and Japanese kana characters and a character recognition device 1 that recognizes a combination of Japanese kanji characters and kana characters. And a character recognition error correction apparatus 100 for correcting a character recognition error depending on the result. In addition, since the first embodiment of the present invention assumes character recognition in the address field of a handwritten form such as an application form, kanji notation includes kanji, katakana, hiragana, romaji, and alphanumeric characters. In addition, the kana notation in the swing kana column may include Roman letters and alphanumeric characters in addition to katakana or hiragana. In the following description of the embodiments, for the sake of simplicity, it is assumed that kanji notation is composed only of kanji and kana notation is composed only of katakana.
[0034]
The character recognition error correction apparatus 100 inputs a kanji character matrix and a kana character matrix as character recognition results from the character recognition device 1. The character matrix is a list in which character candidates are arranged in ascending order of the character recognition score at each input character position by the number of characters. Hereinafter, a character string formed by selecting one character at a time from each character position in the character matrix from a list of character candidates at the character position is referred to as a “character string included in the character matrix”.
[0035]
The character recognition error correction apparatus 100 includes:
An optimum word string search unit 2 for receiving a character matrix in kanji and a character matrix in kana from the character recognition device 1;
A word dictionary 7 for storing pairs of kanji and kana notation of words;
A word matching unit 3 that searches the word dictionary 7 for a word that completely matches a set of kanji strings included in the character matrix of kanji notation and a kana string of characters included in the character matrix of kana notation;
An unknown word model 8 that gives a probability that a combination of an arbitrary kanji string and a kana string not registered in the word dictionary 7 constitutes a word;
An unknown word candidate generation unit 4 that generates an unknown word candidate from a set of kanji strings included in the character matrix of kanji notation and kana strings included in the character matrix of kana notation, using the unknown word model 8;
For any two characters, a character recognition device model 10 that gives a probability that one character is erroneously recognized by the other character, that is, a character confusion probability;
A similar word matching unit 5 that searches a word dictionary 7 for words that are similar to a pair of kanji strings included in the character matrix of kanji notation and a kana string included in the character matrix of kana notation;
N chains of sets of kanji and kana notation of words, that is, the word gram model 6 giving the appearance probability of gram,
The probability that the kanji string is a kanji notation of a kana string and the kana string is a kana notation of a kanji string, for any kanji string and kana string, including the word ngram model 6, the word dictionary 7 and the unknown word model 8; That is, a language model 9 that gives simultaneous probabilities of kanji and kana notation,
It comprises.
[0036]
The optimum word string search unit 2 receives a word that completely matches the word dictionary 7 from the word matching unit 3, receives an unknown word candidate from the unknown word candidate generation unit 4, and receives a word dictionary 7 from the similar word matching unit 5. Based on the language model 9 and the character recognition device model 10, a word string having the highest probability, that is, a combination of kanji and kana notation is obtained.
[0037]
FIG. 2 is a flowchart for explaining a character recognition error correction method in the character recognition error correction system according to the first embodiment of the present invention.
[0038]
The character recognition error correction method uses, as an input, a combination of a character matrix of kanji notation and a character matrix of kana notation obtained by character recognition of a set of kanji notation and kana notation by the character recognition device 1.
[0039]
In step 1, the word collation unit 3 matches the word in the word dictionary 7 that completely matches the combination of the kanji character string included in the kanji character matrix and the kana character string included in the kana character matrix, that is, the matching word. Is identified.
[0040]
In step 2, the unknown candidate generation unit 4 selects a word that is not registered in the word dictionary 7 from the pair of kanji character strings included in the kanji character matrix and the kana character string included in the kana character matrix. Candidates, that is, unknown words are generated.
[0041]
In step 3, the similar word collation unit 5 performs word matching in the word dictionary 7 that is similar to the pair of kanji character strings included in the kanji character matrix and the kana character string included in the kana character matrix, that is, similar Identify words.
[0042]
Finally, in step 4, the optimum word string search unit 2 is based on the language model 9 and the character recognition model 10, and is the word string having the highest probability among matching words, unknown words, and similar words, that is, kanji notation. And a pair of kana.
[0043]
In the above description, processing is performed in the order of Step 1, Step 2, and Step 3. However, the order in which Step 1, Step 2, and Step 3 are executed is not limited to this order, and You may do it in order.
[0044]
According to the configuration of the character recognition error correction system of the first embodiment of the present invention, it is possible to associate kanji notation with kana notation, and errors that cannot be corrected only from either kanji notation or kana notation. Even if the input includes a word that is not registered in the dictionary, the appearance probability of the combination of the kanji character string and the kana character string is estimated based on the unknown word model 8, and the correct character is not included in the character matrix. However, error correction candidates can be presented by similar word matching.
[0045]
FIG. 3 is a schematic block diagram of a character recognition error correction system according to the second embodiment of the present invention. Next, the configuration of the second embodiment of the present invention will be described with reference to FIG. 3, components given the same reference numerals as those in FIG. 1 are the same or similar to the corresponding components in FIG.
[0046]
A character recognition error correction system according to a second embodiment of the present invention includes a character recognition device 1, an optimum word string search unit 2, a word collation unit 3, an unknown word generation unit 4, a similar word collation unit 5, a language model 9, and A character recognition device model 10 is included. The word model 9 includes a word bigram model 60, a word dictionary 7, and an unknown word model 8. The word bigram model 60 includes a word bigram frequency table 61 and a word bigram probability calculation unit 62. The unknown word model 8 includes an unknown word probability calculation unit 81, a word type determination unit 82, a word type definition storage unit 83, a word length probability calculation unit 84, an average word length table 85, and a word notation probability calculation unit 86. And a character bigram frequency table 87. The character recognition device model 10 includes a character confusion probability calculation unit 11 and a character recognition device accuracy rate table 12.
[0047]
Here, the word bigram represents a two-chain consisting of two words, and the character bigram represents a two-chain consisting of two characters.
[0048]
The optimum word string search unit 2 inputs a combination of a kanji notation matrix and a kana notation matrix output by the character recognition device 1 with respect to a set of input kanji notation and kana notation, and sets a sentence head for each of the two character matrices. A word sequence that maximizes the product of the simultaneous probability of word sequences, that is, the product of word bigram probabilities and the character confusion probability, is obtained using dynamic programming that advances from the character to the end of the sentence one character at a time.
[0049]
Therefore, the optimum word string search unit 2 includes the complete match word from the word collation unit 3, the unknown word candidate from the unknown word candidate generation unit 4, and the similar word candidate from the similar word collation unit 5 as word candidates. As given. The word confusion probability is given to the word candidate by the character confusion probability calculation unit 11. The word bigram probability is given by the word bigram probability calculation unit 62.
[0050]
The word collation unit 3 collates all combinations of character strings included in the kanji character matrix and character strings included in the kana character matrix with the word dictionary 7, and uses the collation as an exact match word to determine the optimum word This is given to the column search unit 2. Here, a character string formed by selecting one character at a time from each character position in the character matrix from a list of character candidates at the character position is referred to as a “character string included in the character matrix”.
[0051]
The unknown word candidate generation unit 4 regards a combination that is not matched with the word dictionary 7 as an unknown word among combinations of character strings included in the character matrix of kanji notation and character strings included in the character matrix of kana notation, A predetermined number of unknown words are given to the optimum word string search unit 2 as unknown word candidates in descending order of the unknown word probabilities obtained by the unknown word model probability calculation unit 81.
[0052]
The similar word matching unit 5 performs similar matching with the word dictionary 7 for all combinations of the character strings included in the character matrix of kanji notation and the character matrix of kana notation, and the matching is optimal as a similar word candidate. Give to the word string search unit 2. The distance measure for similar word matching uses the edit distance (ie, the percentage of the number of matching characters) that represents the number of insertions / deletions / replacements required to convert one string to the other. it can.
[0053]
The word bigram probability calculation unit 62 calculates the appearance probability of the word bigram from the word bigram frequency assigned to the word bigram frequency table 61.
[0054]
Based on the word type definition 83, the word type determination unit 82 determines the word type of the unknown word from the type of characters that constitute the kanji notation of the unknown word candidate.
[0055]
The word length probability calculation unit 84 calculates the simultaneous probability of the kanji notation length and the kana notation length of the unknown word candidate from the kanji notation and kana notation average word length stored in the average word length table 85. Ask.
[0056]
The word notation probability calculation unit 86 calculates the kanji character string of the unknown word candidate and the kana character string simultaneously from the kanji character bigram frequency and the kana character character bigram frequency stored in the character bigram frequency table 87. Find the probability.
[0057]
The character confusion probability calculation unit 11 calculates a character confusion probability from the first candidate correct answer rate and the accumulated correct answer rate stored in the character recognition device correct answer rate table 12.
[0058]
Thus, the optimum word string search unit 2 can obtain a word string (a combination of kanji and kana notation) that maximizes the product of the word bigram probability and the character confusion probability.
[0059]
In the following, first, the "information theoretical interpretation of character recognition error correction", which is the theoretical basis of the present invention, will be described, followed by a language model, unknown word model, character recognition device model, optimum word string search means, similarity The word collating means will be described in this order.
[0060]
[1] Information theoretical interpretation of character recognition error correction
In the third embodiment of the present invention, the relationship between the input and the output of the character recognition device is formulated by a noisy channel model with noise. The input Kanji notation (graphemes) as the first notation and kana notation (phonemes) as the second notation are G and P, respectively, and the output of the character recognition device for these is G 'and P'. The character recognition error correction of the present invention is a kanji notation that maximizes the posterior probability P (G, P | G ', P').
[0061]
[Expression 1]

And kana notation
[0062]
[Expression 2]

To the problem of seeking. Furthermore, using Bayes' theorem,
P (G, P) P (G ', P' | G, P)
A set of kanji and kana notation that maximizes
[0063]
[Equation 3]

It can be seen that
[0064]
[Expression 4]

Here, P (G, P) is referred to as a language model, and P (G ′, P ′ | G, P) is referred to as a character device model.
[0065]
[2] Language model
Kanji notation and kana notation are respectively
String of length l G = α₁α₂. . . α_l
as well as
String of length m P = β₁β₂. . . β_m
Suppose that Further, by associating the kanji notation and the kana notation of the word, the word string of which the combination of kanji notation and kana notation is length n
(G, Ρ) = ((g₁, P₁), (G₂, P₂),. . . , (G_n, P_n))
Is divided into
[0066]
For example, the combination of the kanji notation “Kokonooka, Yokosuka City, Kanagawa” and “Kanagawa Ken Yokosuka Hikarinooka” corresponds to the word string ((Kanagawa Prefecture, Kanagawa Ken), (Yokosuka City, Yokosukashi), (Hikari no Oka, Hikarinooka)). It is attached and divided.
[0067]
In the present invention, the language model P (G, P) is approximated by the joint probability of the most likely word string P (G, Ρ) that associates the kanji notation G with the kana notation P. Further, the simultaneous probability P (G, Ρ) of this word string is approximated by a word bigram model described later.
[0068]
[3] Word bigram model
In general, the word N-gram represents an N chain consisting of N words, where N represents an integer of 2 or more. As an example, the word bigram is a word chain composed of two words. The word bigram model has the word bigram probability P (g_i, P_i｜ g_i-1, P_i-1) To approximate the joint probability of a word string.
[0069]
[Equation 5]

Here, <bos> and <eos> are special symbols representing the beginning and end of a sentence.
[0070]
The method for obtaining the word bigram probability (that is, the word bigram probability calculating means) is as follows.
[0071]
First, Japanese text in kanji kana mixed notation is divided into words, and data with kana notation reading is created. Hereinafter, this data is referred to as learning data.
[0072]
Appearance frequency C (g of a combination of kanji and kana notation of words in this learning data_i, P_i), And the appearance probability C (g) of a bigram of a combination of kanji notation and kana notation of a word_i-1, P_i-1, G_i, P_i) And stored in the word bigram frequency table.
[0073]
FIG. 4 is an explanatory diagram of an example of the word appearance frequency, and FIG. 5 is an explanatory diagram of an example of the word bigram appearance frequency. In the illustrated example, a combination of a kanji character string and a kana character string separated by “/” represents one word.
[0074]
Next, the relative frequency f (g_i, P_i) And the relative frequency f (g of the word bigram_i, P_i｜ g_i-1, P_i-1)
[0075]
[Formula 6]

Next, linear interpolation is performed on these relative frequencies to obtain word bigram group probabilities. Here, the linear interpolation coefficient α is determined so that the probability of the training data is maximized.
P (g_i, P_i｜ g_i-1, P_i-1) =
(1-α) f (g_i, P_i) + Αf (g_i, P_i｜ g_i-1, P_i-1(5)
[4] Unknown word model
The unknown word model is a calculation model for obtaining the appearance probability of a combination of kanji notation and kana notation registered in the word dictionary. This constitutes the unknown word (g, p)
Length l_gKanji character string g = cg₁. . . cg_lg
When,
Length l_pKana character string p = cp₁. . . cp_lp
Is defined as a joint probability distribution P (g, p).
[0076]
In one embodiment of the present invention, a plurality of word types are defined based on the types of characters constituting the kanji notation of a word, and the probability of unknown words is estimated for each word type.
[0077]
FIG. 6 is an explanatory diagram of an example when Japanese unknown words are classified into seven word types. The definition of the word type is described in Bacchus notation (Backus Naur Form, BNF), where [...] represents matching with any one character in the character set. The character range is expressed by writing-between two characters. The character code is assumed to be JIS-X-0208. * Represents zero or more repetitions, and + represents one or more repetitions.
[0078]
<Sym>, <num>, <alpha>, <hira>, <kata>, and <kan> represent a symbol string, a numeric string, an alphabet string, a hiragana string, a katakana string, and a kanji string, respectively. All character strings composed of a plurality of character types other than these are <misc>.
[0079]
In the third embodiment of the present invention, the word type probability, that is, the appearance probability P (<WT>) of each word type <WT> in an unknown word, and simultaneous kanji and kana notation of an unknown word by word type From the product of the probabilities P (g, p | <WT>), the appearance probability of unknown words (simultaneous probability of kanji and kana notation of unknown words) P (g, p) is obtained.
P (g, p) = P (<WT>) P (g, p | <WT>) (6)
The word type probability is obtained by classifying low-frequency words (words with an appearance frequency of 1) in the learning data into word types and respectively determining the relative frequency of the word type.
[0080]
The unknown word appearance probability P (g, p | <WT>) for each word type is the simultaneous probability P (l of the kanji notation length and the kana notation length for each word type._g, L_p| <WT>), the appearance probability P (g) of the character string in Kanji notation, and the appearance probability P (p) of the character string in Kana notation.
[0081]
[Expression 7]

In the following, the joint probability P (l of the kanji notation length and the kana notation length for each word type_g, L_p| <WT>) is called the word length probability.
[0082]
The word length probability is approximated by the product of the distribution of length of kanji notation by word type and the distribution of length of kana notation by word type. The length distribution of kanji notation and kana notation is the word type, respectively. Average character length λ of another kanji notation_{g, <WT>}And the average character length of kana notation λ_{p, <WT>}Approximate with Poisson distribution with.
[0083]
[Equation 8]

Average character length λ in kanji notation by word type_{g, <WT>}And the average character length of kana notation λ_{p, <WT>}Is obtained from low-frequency words in the learning data. FIG. 7 is an explanatory diagram of an example of the average character length of kanji notation and kana notation for each word type.
[0084]
The appearance probability P (g) of the character string in Kanji notation is approximated by the character bigram of the character used in Kanji notation.
[0085]
[Equation 9]

The appearance probability P (p) of a character string in kana notation is approximated by a character bigram model of characters used in kana notation.
[0086]
[Expression 10]

Similar to the word bigram probability, the character bigram probability in kanji and kana notation is obtained by calculating the relative probability from the character appearance frequency obtained from the learning data and the character bigram appearance frequency, and linearly interpolating the relative probability. . FIG. 8 shows an example of the appearance frequency of the character bigram notation, and FIG. 9 shows an example of the appearance frequency of the kanji character bigram.
[0087]
[5] Character recognition device model
In the third embodiment of the present invention, with respect to the character recognition device model P (G ′, P ′ | G, P), the kanji notation G and the kana notation P are recognized independently, and each character cg in kanji notation is further recognized._iAnd each character cp of kana notation_jIs recognized independently.
[0088]
## EQU11 ##

In general, in a character recognition device, an input character c_iIs the letter c_jP (c_j| C_i) Is called the character confusion probability. Basically, the character confusion probability can be obtained from a character confusion matrix which is frequency data of a set of input and output of the character recognition device. However, the character confusion matrix has low versatility because the character recognition method greatly depends on the quality of the input image. In addition, since Japanese has more than 3000 types of characters, it is not possible to collect a sufficient number of character recognition results for all characters.
[0089]
Therefore, in the third embodiment of the present invention, the accuracy rate p of the first candidate output by the character recognition device.₁, And the cumulative accuracy rate p up to the nth candidate_nThe character confusion probability is approximated as follows using (n is the number of character candidates output by the character recognition device) as a parameter.
[0090]
[Expression 12]

Here, | C | is the size of the character set to be recognized. For example, for Chinese characters, | C_g| = 6879, in case of Kana | C_pIt is sufficient to set | = 87.
[0091]
Equation (12) is a character confusion probability, if the character is the first candidate, a constant value (average correct answer rate of the first candidate) p regardless of the character.₁Assign. If the character is included in candidate characters after the second candidate, the remainder obtained by subtracting the first candidate correct rate from the cumulative correct rate is assigned equally. If the character does not exist in the character candidate, a value obtained by subtracting the cumulative accuracy rate from 1 is equally assigned to a character set other than the candidate character.
[0092]
[6] Optimal word string search
A procedure (optimum word string search means) for obtaining a combination of kanji notation and kana notation that maximizes the posterior probability shown in equation (1) is shown below.
[0093]
The entered kanji and kana notation respectively
String of length l G = α₁α₂. . . α_l
as well as
String of length m P = β₁β₂. . . β_m
And
[0094]
If the character position in the kanji notation and the character position in the kana notation are represented by x (0 ≦ x ≦ l) and y (0 ≦ y ≦ m), respectively, it is obtained by associating the kanji notation with the kana notation. A word string of length n
(G, Ρ) = ((g₁, P₁), (G₂, P₂),. . . , (G_n, P_n))
Is a sequence of word boundary coordinates (length n + 1)
(X₀, Y₀), (X₁, Y₁), (X₂, Y₂),. . . , (X_n, Y_n)
Can be expressed as Here, the character position (x_i, Y_i) For each word (g_i, P_i) End position and (x₀, Y₀) = (0,0) and (x_n, Y_n) = (L, m).
[0095]
Simultaneous probability P (g of the word string from the first to the i-th word₁, P₁,. . . , G_i, P_i) And the kanji character composing each word and the character confusion probability of each character in kana notation_i, P_i), The following relationship is established from Equation (2).
[0096]
[Formula 13]

Where q and r are kanji notation g_iRepresents the start position and end position of, and s and t are kana notation p_iRepresents the start and end positions of That is,
g_i= Cg_{q + 1}. . . cg_r
p_i= Cp_{s + 1}. . . cp_t
It is. Cg ’_jAnd cp '_kCg_jAnd cp_kIs a character recognition result corresponding to.
[0097]
Formula (13) represents the following relationship. Maximum value φ (g) of the product of the joint probability from the first to the i-th word and the character confusion probability of each character constituting the kanji notation and kana notation of each word_i, P_i) Is the maximum value φ (g) of the product of the simultaneous probability from the top to the (i-1) th word and the character confusion probability of each character constituting the kanji notation and kana notation of each word._i-1, P_i-1) And the word bigram probability P (g) of the (i−1) -th word and the i-th word._i, P_i｜ g_i-1, P_i-1) Multiplied by the product of the character confusion probabilities of each character constituting the kanji notation and kana notation of the i-th word. Using this relationship, φ (g_i, P_i), The maximum probability φ (g_n, P_n).
[0098]
FIG. 10 is a flowchart for explaining the operation of the optimum word string search. The optimum word string search realizes the calculation of Expression (13) using a two-dimensional dynamic programming method. Here, φ (g_i, P_i) Is called the probability of partial analysis, and φ (g_i, P_i) Is called a partial analysis table.
[0099]
Below, the operation | movement of the optimal word sequence search is demonstrated according to FIG.
[0100]
The optimum word string search starts from the beginning of kanji and kana notation, and the current analysis position advances one character at a time toward the end. In step S11, the search start position is set to the beginning (0, 0) of the kanji notation and kana notation.
[0101]
In step S12, it is determined whether the search has reached the end of the kanji notation. If the end has been reached, the optimum word string search is terminated. Otherwise, the following processing is performed at each character position in Kanji notation.
[0102]
In step S13, it is determined whether the search has reached the end of the kana notation. If the end has been reached, the process proceeds to step S30. Otherwise, the following processing is performed at each character position in kana notation.
[0103]
In step S14, all word strings are searched from the partial analysis table until the current set of kanji character positions and kana character positions is reached, and one of them is set as the current partial analysis (word string). Choose.
[0104]
In step S15, it is determined whether all word strings have been examined. If so, in step S29, the search is advanced to the next character position in the kana notation. Otherwise, the following processing is performed for each word string.
[0105]
In step S16, a list of all Kanji character strings included in the Kanji character matrix starting from the current Kanji character position is created.
[0106]
In step S17, a list of all kana character strings included in the kana character matrix starting from the current kana character position is created.
[0107]
In step S18, a word list composed of all combinations of the kanji character string list created in step S16 and the kana character string list created in step S17 is created. Anything in this list that does not match the word dictionary is considered an unknown word.
[0108]
In step S19, the words in the word dictionary that are similar to the list of kanji character strings created in step S16 and the kana character string list created in step S17 are added to the word list.
[0109]
In step S20, one word is selected from the word list.
[0110]
In step S21, it is determined whether all words have been examined. If not, the process proceeds to step S28. Otherwise, the following processing is performed for each word.
[0111]
In step S22, it is checked whether or not the current word (word string from the beginning with the current word as the last word) is registered in the partial analysis table. If so, the process proceeds to step S24. If not, in step S23, the word is registered in the partial analysis table, the probability of partial analysis (word string) is initialized to 0, and then the process proceeds to step S24.
[0112]
In step S24, the probability of a new word string based on the combination of the current word string and the current word is obtained. The probability of a new word string is expressed by the following equation.
[0113]
[Expression 14]

In step S25, it is checked whether the probability of the new word string is greater than the probability of the previous word string where the last word is the same. If so, the probability of a new word string is stored in the partial analysis table in step S26, and the process proceeds to step S27.
[0114]
In step S27, the next word is selected and the process returns to step S21.
[0115]
In step S28, the next word string is selected and the process returns to step S15.
[0116]
In step S29, the search is advanced to the next character position in the kana notation, and the process returns to step S13.
[0117]
In step S30, the search is advanced to the next character position in the kanji notation, and the process returns to step S12.
[0118]
[7] Similarity matching of words
In the following, a method of word similarity matching in step S9 of optimum word search will be described. Let x be the character position of the current kanji character string and y be the character position of the kana character string.
[0119]
First, similar search is performed for words using kanji character strings as keys.
[0120]
(1) Search for all words in the word dictionary that match the kanji notation with the elements of the list of all kanji character strings included in the kanji notation character matrix starting from the character position x in the current kanji notation, and similar Create a word candidate list.
[0121]
(2) For each element of the similar word candidate list, the edit distance (number of mismatched characters) between the kana notation and all the kana character strings included in the character matrix of the kana notation starting from the character position y of the current kana notation Find the minimum value of.
[0122]
(3) Extract words having a relative edit distance (ratio of non-matching characters with respect to the length of the character string) of 0.5 or less from the similar word candidate list, and arrange them in the order of appearance frequency. Generate.
[0123]
Next, similar words are searched using the kana character string as a key.
[0124]
(4) Search for all the words in the word dictionary matching the kana notation with the elements of the list of all kana character strings included in the kana notation character matrix starting from the character position y of the current kana notation, and similar Create a word candidate list.
[0125]
(5) For each element of the similar word candidate list, the edit distance (number of characters that do not match) between the kanji notation and all the kanji character strings included in the kanji character matrix starting from the character position x of the current kanji notation Find the minimum value of.
[0126]
(6) Extract words having a relative edit distance (ratio of mismatched characters with respect to the length of the character string) of 0.5 or less from the similar word candidate list, arrange them in the order of appearance frequency, and use a maximum of 5 as similar word candidates Generate.
[0127]
(7) Finally, the final sum of the word candidates generated using the kanji character string as a key and the word candidates generated using the kana character string as a key (a collection of overlapping word candidates as one) Similar word candidates.
[0128]
The editing distance threshold value 0.5 and the maximum number of candidates 5 described here are examples of parameter setting values, and the optimum values are determined experimentally, for example. In the above example, a similar search is performed for words using the kanji character string as a key, and then a similar search is performed for words using the kana character string as a key. It is also possible to perform a similar search for words, and then perform a similar search for words using a kanji character string as a key.
[0129]
【Example】
Finally, an example of processing according to an embodiment of the present invention will be described. FIG. 11 is a diagram for explaining an example of an optimum word string search for a set of character matrices output by the character recognition device for the kanji notation “Itozakicho, Fukui City, Fukui Prefecture” and the kana notation “Fukuiken Fukuishiito Zakicho”. .
[0130]
In this processing example, the character matrix uses up to the second candidate. For example, the first and second candidates for the Chinese character “Fuku” are “Fuku” and “及び”, respectively, and the first and second candidates for the Kana “F” are “F” and “F”, respectively. "Ku".
[0131]
In the optimum word string search, at each character position, all combinations of the word string reaching there and the words starting therefrom are examined, and the simultaneous probability of the word string at the end position of the starting word is updated.
[0132]
On the left side of FIG. 11, the character position in kanji notation is represented by x (horizontal axis), the character position in kana notation is represented by y (vertical axis), and a word string reaching a certain character position (6, 9) and the starting word The positional relationship, that is, the start position of the last word in the word string that reaches the character position (6, 9) and the end position of the word starting from the character position (6, 9) are shown.
[0133]
The right side of FIG. 11 shows a state in which all the combinations of the last word in the word string reaching the character position (6, 9) and the word starting from the character position (6, 9) are examined. Each word is expressed as “kanji character string / kana character string”, and the coordinates of the start position and end position of the word are shown at the top of the box corresponding to the word. If the word is an unknown word, the word type is shown at the bottom of the box.
[0134]
There are three types of word candidates: one that completely matches the word dictionary, one that matches the word dictionary, and one that does not match the word dictionary (unknown word). For example, the word “Fukui City / Fukui” from the character position (3, 5) to the character position (6, 9) is a complete match, and the kanji character string “Fukui City” and the kana character string “Fukui” are characters. It is in the matrix and “Fukui City / Fukuishi” is in the word dictionary.
[0135]
The word “Itozaki / Itozaki” from the character position (6, 9) to the character position (8, 13) is a similar collation. The kanji character string “Itozaki” is in the character matrix, but the kana character string “Itozaki” is The search is made from a word dictionary using a kanji character string as a key, not in the character matrix. The word candidate “city / Ishi” from the character position (5, 7) to the character position (6, 9) is an unknown word candidate pair, and both the kanji character string “city” and the kana character string “Ishi” are in the character matrix. However, the word “city / Ishi” is not in the dictionary.
[0136]
FIG. 11 shows the state of processing at a certain character position (6, 9). All the lattices on the plane from the origin (0, 0) to (9, 16) are displayed. By performing the process using dots, it is possible to obtain a word string that maximizes the product of the simultaneous probability of kanji and kana notation and the character confusion probability.
[0137]
The character recognition error correction method according to the embodiment of the present invention can be constructed by software (program), and the character recognition error correction apparatus according to the embodiment of the present invention is executed by executing the program by the CPU of the computer. Can be realized. The built program is recorded in a disk device or the like and installed in a computer as necessary, and stored in a portable recording medium such as a flexible disk, a memory card, or a CD-ROM, and installed in the computer as needed. Alternatively, it is installed in a computer via a communication line or the like and executed by the CPU of the computer.
[0138]
As mentioned above, although the typical Example of this invention was described, this invention is not limited to said Example, A various change and application are possible within a claim.
[0139]
【The invention's effect】
As described above, according to the present invention, based on the language model that gives the joint probability of the kanji character string and the kana character string, the character recognition device model that gives the character confusion probability, the language model, and the character recognition device model By matching the kanji notation with the kana notation using the optimum word string searching means for finding the word string having the highest probability for the character recognition result pair of the kanji character string and kana character string, the same contents can be obtained. A character recognition error correction method that can correct a character recognition error that cannot be corrected only from either kanji notation or kana notation can be realized by using the redundancy of notation and kana notation.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a character recognition error correction system according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a character recognition error correction method according to a first embodiment of the present invention.
FIG. 3 is a block diagram of a character recognition error correction system according to a second embodiment of the present invention.
FIG. 4 is an explanatory diagram of an example of a word dictionary.
FIG. 5 is an explanatory diagram of an example of the appearance frequency of a word bigram.
FIG. 6 is an explanatory diagram of an example of definition of a word type.
FIG. 7 is an explanatory diagram of an example of an average length of kanji notation and an average length of kana notation for each word type;
FIG. 8 is an explanatory diagram of an example of the appearance frequency of characters bigram notation.
FIG. 9 is an explanatory diagram of an example of the appearance frequency of a character bigram notation.
FIG. 10 is a flowchart of optimum word string search processing according to a third embodiment of the present invention.
FIG. 11 is a diagram illustrating an example of optimum word string search.
[Explanation of symbols]
1 Character recognition device
2 Optimal word string search unit
3 Word matching part
4 Unknown word candidate generator
5 Similar word matching part
6 word ngram model
7 word dictionary
8 Unknown word model
9 word model
10 Character recognition device model
100 character recognition error correction device

Claims

In a character recognition error correction apparatus for correcting a character recognition error during input, a result of character recognition of a kanji string and a kana character string pair representing the same content as the kanji string is given as input.
A word dictionary that stores pairs of kanji and kana notation of words;
A word ngram model that gives the occurrence probability of a ngram of a kanji notation and a kana notation of a word;
An unknown word model that gives an unknown word probability that is a probability that a combination of an arbitrary kanji string and a kana string not registered in the word dictionary constitutes a word;
In a character recognition device, a character recognition device model that gives a character confusion probability that is a probability that an input character ci is recognized as a character cj;
Input means for inputting a character matrix that is a list in which character candidates are arranged in order from the highest character recognition score at each character position for each of kanji notation and kana notation;
A word matching means for searching for a word in the word dictionary that completely matches a kanji string included in the character matrix of the kanji notation and a kana string included in the character matrix of the kana notation, and identifying a matching word;
An unknown word candidate that is a word that is not registered in the word dictionary is selected from the set of kanji strings included in the kanji character matrix and the kana character string included in the kana character matrix. An unknown word candidate generating means for generating a predetermined number of unknown words having a large unknown word probability using
A similar word matching means for searching for a word in the word dictionary that is similar to a kanji string included in the character matrix of the kanji notation and a set of kana strings included in the character matrix of the kana notation, and identifying a similar word;
Based on the word ngram model, the unknown word model, and the character recognition model, the word ngram appears for a set of input kanji and kana notation among the matched word, the unknown word, and the similar word. An optimal word string search means for obtaining a word string that is a set of kanji notation and kana notation that maximizes the product of the probability and the character confusion probability, and
The unknown word model is
A word type probability calculating means for obtaining a word type probability that is an appearance probability of the word type from the appearance frequency of the word type;
A word type determination means for classifying an unknown word into one of the word types defined based on the types of characters constituting the kanji representation of the word;
Based on the distribution of the length of kanji notation and the length of kana notation for each word type and the word type of the unknown word, the probability of the kanji notation length and the probability of the kana notation length of the unknown word A word length probability calculating means for obtaining a word length probability which is a product;
A word notation probability calculating means for obtaining an appearance probability of a kanji character string from a character ngram frequency of a kanji character notation, obtaining an appearance probability of a kana character string from an ngram frequency of a character of kana character notation, and obtaining a word notation probability which is a simultaneous probability of unknown words;
An unknown word probability calculating means for calculating an unknown word probability of an unknown word by a product of the word type probability of the unknown word, the word length probability, and the word notation probability.
Character recognition error correction apparatus characterized by the above.

In the character recognition error correction method for correcting the character recognition error in the input, the result of character recognition of a kanji string and a kana string pair representing the same content as the kanji string is given as input.
A word dictionary that stores pairs of kanji and kana notation of words;
A word ngram model that gives the occurrence probability of a ngram of a pair of kanji notation and kana notation of a word;
An unknown word model that gives an unknown word probability that is a probability that a combination of an arbitrary kanji string and a kana string not registered in the word dictionary constitutes a word;
In a character recognition device, a device having a character recognition device model that gives a character confusion probability that is a probability that an input character ci is recognized as a character cj.
An input step for inputting a character matrix that is a list in which character candidates are arranged in order of descending character recognition score at each character position for each of kanji and kana notation;
A word matching unit that searches for a word in the word dictionary that completely matches a pair of kana characters included in a character matrix of kanji notation and a kana character string included in a character matrix of kana notation, and identifies a matching word Steps,
An unknown word candidate generation unit is configured to detect an unknown word that is a word that is not registered in the word dictionary in a set of kanji strings included in the kanji character matrix and a kana character string included in the kana character matrix. An unknown word candidate generation step for generating a predetermined number of unknown words with a large unknown word probability using the unknown word model,
The similar word matching means searches for a word in the word dictionary that is similar matched with a pair of kanji characters included in the character matrix of the kanji character notation and a kana character string included in the character matrix of the kana character notation, and identifies a similar word A similar word matching step;
Estimated word string search means, said word ngram model, the said unknown word model based on the character recognition device model, the match word, the unknown word, in the similar words, kanji and kana input An optimal word string search step for obtaining a word string that is a set of kanji notation and kana notation that maximizes the product of the word ngram appearance probability and the character confusion probability;
And
In the unknown word model,
A word type probability calculating means for obtaining a word type probability that is a word type appearance probability from the word type appearance frequency;
A step of classifying the unknown word into any of the word types defined based on the types of characters constituting the kanji notation of the word;
The word length probability calculation means calculates the kanji length length probability and kana of the unknown word based on the kanji length distribution and kana length distribution of each word type and the unknown word type. Obtaining a word length probability that is the product of the written length probabilities;
The word notation probability calculating means obtains the appearance probability of the kanji string from the character ngram frequency of the kanji notation, obtains the appearance probability of the kana string from the ngram frequency of the kana notation character, and obtains the word notation probability which is the simultaneous probability of the unknown word. Steps,
An unknown word probability calculating means performs an unknown word probability calculating step of calculating an unknown word probability of an unknown word by a product of the word type probability, the word length probability, and the word notation probability of an unknown word, Character recognition error correction method.

A program for realizing the character recognition error correction method according to claim 2 by a computer.