JP3669626B2

JP3669626B2 - Search device, recording medium, and program

Info

Publication number: JP3669626B2
Application number: JP2001168888A
Authority: JP
Inventors: 太郎今川; 堅司近藤; 善彦松川; 強司目片
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2000-06-06
Filing date: 2001-06-04
Publication date: 2005-07-13
Anticipated expiration: 2021-06-04
Also published as: JP2002063197A

Description

【０００１】
【発明の属する技術分野】
インデックステーブルを用いて、オリジナル文書中の文字を認識することによって得られる文字認識結果からキーワードを検索する検索装置、記録媒体およびプログラムに関する。
【０００２】
【従来の技術】
近年、インターネットの普及に伴い、ネットワーク上に存在する大量の情報から必要な情報を取り出す検索技術が重要視されている。特に、テキストデータから特定のキーワードを検索するシステムは、既に数多く提供されている。このような検索においては、大量のテキスト文書から正確で、高速な検索を行うことが求められている。
【０００３】
高速な検索を行うために、インデックステーブルを用いてテキストデータから特定のキーワードを検索する技術が知られている。インデックステーブルは、所定の数の文字（例えば、２文字）を含むインデックス文字列と、その文字列に一致するテキストデータ中の部分の位置とを定義する。
【０００４】
オリジナル文書（紙の形態の文書）中の文字を文字認識することによって得られる文字コードの集合（文字認識結果）からキーワードを検索する場合には、文字認識の誤り（誤認識）を考慮しなければならない。文字認識において誤りがある場合、文字コードが表す文字はオリジナル文書に書かれている文字と異なり得るからである。誤認識とは、オリジナル文書に書かれた文字が、正しく文字コードに変換されないことをいう。このような誤認識は、例えば、紙面に印字された文字のかすれや傾き、汚れ等に起因して発生する。
【０００５】
例えば、オリジナル文書のある位置に、「イヌ」という文字列が存在し、この文字列中の文字「ヌ」が「ス」と誤認識された場合、文字列「イヌ」に対応する文字認識結果中の部分の位置には、文字列「イス」が存在する。その結果、この文字認識結果から作成されたインデックステーブルには、インデックス文字列「イス」とその位置とが登録される。従って、このインデックステーブルを用いてキーワード「イヌ」を検索しても、文字認識結果中のその位置にキーワードを検出することができない。このように、オリジナル文書中のある位置にキーワードが存在するにもかかわらず、その位置においてキーワードが検出できないという、「検索漏れ」の問題が発生する。
【０００６】
検索漏れの問題に対処する従来技術として、オリジナル文書中の１つの文字に対する文字認識結果として複数の候補文字を用意し、その複数の候補文字に基づいて、オリジナル文書中に存在する可能性のある複数の文字列をインデックス文字列としてインデックステーブルに登録する技術が知られている。キーワードの検索は、このインデックステーブルを用いて行なわれる。このような技術は、例えば、特開平９−１６６１９号公報「情報処理方法および装置」に開示されている。
【０００７】
図１１は、従来技術によって、オリジナル文書中に存在する可能性のある複数の文字列をインデックス文字列とてし登録したインデックステーブル１９０１の一例を示す。図１１に示される例では、インデックステーブル１９０１は、「インデックスを用いた・・・」という文字列を含むオリジナル文書を文字認識することによって得られる。インデックステーブル１９０１には、インデックス文字列「イシ」とインデックス文字列「イン」とがいずれも文字認識結果中の同じ位置である文字位置「１」に存在するものとして登録されている（行１９１１および行１９１２）。
【０００８】
図１１に示されるインデックステーブル１９０１を使用することにより、「インデックス」というキーワードを検出することができる。以下、従来技術に従って、図１１に示されるインデックステーブル１９０１を使用してキーワード「インデックス」を検索する処理を説明する。
【０００９】
まず、キーワード中に含まれる、互いに隣り合う２文字からなる文字列が生成される。キーワード「インデックス」から、「イン」、「ンデ」、「デッ」、「ック」および「クス」という５個の文字列が生成される。
【００１０】
次に、これらの文字列がインデックステーブル１９０１から検索される。文字列「イン」、「ンデ」、「デッ」、「ック」および「クス」は、それぞれ、文字認識結果中の文字位置「１」、「２」、「３」、「４」および「５」に存在することが示されている（行１９１２、行１９１９、行１９１５、行１９１４および行１９１３）。
【００１１】
これらの文字位置の位置関係から、キーワード「インデックス」が文字認識結果中に存在していることが判断される。
【００１２】
このようにして、オリジナル文書中に存在する可能性のある複数の文字列をインデックス文字列とてし登録したインデックステーブルを用いた従来技術によれば、検索漏れの問題が回避され得る。
【００１３】
【発明が解決しようとする課題】
このような従来技術によれば、検索ノイズが増加してしまうという問題点がある。検索ノイズとは、オリジナル文書中にキーワードが存在しないにもかかわらず、キーワードが検出されることをいう。例えば、図１１に示されるインデックステーブル１９０１を使用して、「デンワ」および「フック」というキーワードを検索した場合に、それらのキーワードが文字位置「３」において検出される。検索結果が正当であるかどうかを判断するためには、ユーザがオリジナル文書と検索結果を比較しなければならない。
【００１４】
検索漏れの問題を防ぐために、１つの文字に対する文字認識結果として得られる候補文字の数を多くするほど、このような検索ノイズが多くなり、ユーザが検索結果が正当であるかどうかを判断する負担が増加する。
【００１５】
本発明は、このような問題点に鑑みてなされたものであって、高速な検索を行い、かつ、検索結果の正当性を容易に判定することが可能な検索装置、記録媒体およびプログラムを提供することを目的とする。
【００１６】
【課題を解決するための手段】
本発明の検索装置は、オリジナル文書中の文字のそれぞれを認識することによって前記各文字に対応する少なくとも１つの候補文字を有する前記オリジナル文書の文字認識結果から、インデックステーブルを用いて、複数の文字よりなるキーワードを検索する検索装置であって、前記インデックステーブルは、前記オリジナル文書中に存在する文字列を構成する複数の文字のそれぞれに対応する前記候補文字を組み合わせた文字列によって構成されるインデックス文字列と、前記オリジナル文章中における前記インデックス文字列の位置と、前記インデックス文字列に含まれる各候補文字毎に、前記オリジナル文書中に前記各候補文字がそれぞれ存在する確率として予め定義された確信度とを含み、前記インデックス文字列と同じ文字数の前記キーワードの文字列が前記インデックステーブルに存在するかを判定し、存在する場合には、そのキーワードの文字列と一致するインデックス文字列の前記位置に基づいて、前記オリジナル文章中における前記キーワードの文字列の位置を特定する位置特定部と、前記キーワードの文字列に一致するインデックス文字列の確信度に基づいて、前記特定された位置に前記キーワードが存在する確率として定義されるキーワード確信度を算出する算出部と、前記キーワード確信度に基づいて検索結果の正当性を判定する判定部とを備えており、これにより、上記目的が達成される。
【００１８】
前記判定部は、前記キーワード確信度が所定の値以上である場合に、前記検索結果を正当であると判定してもよい。
【００１９】
前記所定の値は、前記キーワードに含まれる文字の数および前記キーワードに含まれる文字の種類の少なくとも一方に応じて設定されてもよい。
【００２０】
前記インデックステーブルを作成するインデックステーブル作成部をさらに含み、前記インデックステーブル作成部は、前記文字認識結果による前記候補文字が複数生成される場合に、前記オリジナル文書の文字列におけるそれぞれの文字に対応して生成された前記候補文字のそれぞれ同士を組み合わせることにより、前記インデックス文字列を生成してもよい。
【００２１】
前記インデックステーブルを作成するインデックステーブル作成部をさらに含み、前記インデックステーブル作成部は、前記文字認識結果により前記各文字に対して１つの前記候補文字をそれぞれ生成するとともに、生成された前記各候補文字に類似する類似文字を前記確信度に基づいて生成し、前記オリジナル文書の文字列におけるそれぞれの文字に対応する前記候補文字と前記類似文字との組み合わせ、または前記類似文字同士の組み合わせによっても前記インデックス文字列を生成してもよい。
【００２２】
本発明の記録媒体は、オリジナル文書中の文字のそれぞれを認識することによって前記各文字に対応する少なくとも１つの候補文字を有する前記オリジナル文書の文字認識結果から、インデックステーブルを用いて、複数の文字よりなるキーワードを検索する検索処理をコンピュータにより実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であって、前記インデックステーブルは、前記オリジナル文書中に存在する文字列を構成する複数の文字のそれぞれに対応する前記候補文字を組み合わせた文字列によって構成されるインデックス文字列と、前記オリジナル文章中における前記インデックス文字列の位置と、前記インデックス文字列に含まれる各候補文字毎に、前記オリジナル文書中に前記各候補文字がそれぞれ存在する確率として予め定義された確信度とを含み、前記検索処理は、前記インデックス文字列と同じ文字数の前記キーワードの文字列が前記インデックステーブルに存在するかを判定し、存在する場合には、そのキーワードの文字列と一致するインデックス文字列の前記位置に基づいて、前記オリジナル文書中における前記キーワードの文字列の位置を特定するステップと、前記キーワードの文字列に一致するインデックス文字列の確信度に基づいて、前記特定された位置に前記キーワードが存在する確率として定義されるキーワード確信度を算出するステップと、前記キーワード確信度に基づいて検索結果の正当性を判定するステップとを包含するプログラムを記録しており、これにより、上記目的が達成される。
【００２３】
本発明のプログラムは、オリジナル文書中の文字のそれぞれを認識することによって前記各文字に対応する少なくとも１つの候補文字を有する前記オリジナル文書の文字認識結果から、インデックステーブルを用いて、複数の文字よりなるキーワードを検索する検索処理をコンピュータにより実行させるためのプログラムであって、前記インデックステーブルは、前記オリジナル文書中に存在する文字列を構成する複数の文字のそれぞれに対応する前記候補文字を組み合わせた文字列によって構成されるインデックス文字列と、前記オリジナル文章中における前記インデックス文字列の位置と、前記インデックス文字列に含まれる各候補文字毎に、前記オリジナル文書中に前記各候補文字がそれぞれ存在する確率として予め定義された確信度とを含み、前記検索処理は、前記インデックス文字列と同じ文字数の前記キーワードの文字列が前記インデックステーブルに存在するかを判定し、存在する場合には、そのキーワードの文字列と一致するインデックス文字列の前記位置に基づいて、前記オリジナル文章中における前記キーワードの文字列の位置を特定するステップと、前記キーワードの文字列に一致するインデックス文字列の確信度に基づいて、前記特定された位置に前記キーワードが存在する確率として定義されるキーワード確信度を算出するステップと、前記キーワード確信度に基づいて検索結果の正当性を判定するステップとを包含し、これにより、上記目的が達成される。
【００２４】
【発明の実施の形態】
本明細書中で、文字とは、特定の言語体系において使用される文字に限定されず、数字、記号（例えば、「）」や「◎」）を含むあらゆるシンボルをいう。このようなシンボルには、そのシンボルを電子的に表現するためのコード（文字コード）が割り当てられている。
【００２５】
図１は、本発明の検索装置１の構成を示す。検索装置１は、オリジナル文書中の文字を認識することによって得られる文字認識結果からキーワードを検索する。
【００２６】
検索装置１は、その構成要素として、端末１００と、文書登録処理および文書検索処理を実行するＣＰＵ１００と、文書を画像データとして入力する画像入力機器１２０と、ワークメモリ１８０と、ハードディスク（ＨＤＤ）１７０とを備える。これらの構成要素は、内部バス１１０１を介して互いに接続されている。あるいは、これらの構成要素は、任意のタイプのネットワークを介して互いに接続されていてもよい。
【００２７】
端末１００は、例えば、キーボードとＣＲＴとを備えた入出力デバイスである。端末１００は、例えば、検索装置１が実行する処理をユーザが指定したり、検索装置１が実行した検索処理結果をユーザに表示するために用いられる。
【００２８】
ＨＤＤ１７０には、文書登録プログラム１１０３と、文書検索プログラム１１０４と、文字認識パターン辞書１６０と、確信度テーブル１５０と、文書データ１１０２とが格納されている。ＨＤＤ１７０として、任意のタイプのメモリが使用されてもよい。
【００２９】
文書登録プログラム１１０３および文書検索プログラム１１０４の全体または一部は、任意のタイプの通信回線（図示せず）または放送を介して検索装置１に提供されてもよいし、任意のタイプのコンピュータ読み取り可能な記録媒体に記録された形態で検索装置１に提供されてもよい。そのような記録媒体は、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、フレキシブルディスク等である。そのような記録媒体に記録された文書登録プログラム１１０３および文書検索プログラム１１０４は、ディスクドライブ等の読み取りデバイスによって検索装置１にインストールされ得る。
【００３０】
図２は、オリジナル文書中の文字を認識することによって得られる文字認識結果からキーワードを検索するために、検索装置１によって実行される処理の流れを示す。
【００３１】
ユーザが端末１００（図１）から文書登録処理の開始を指示すると、ＨＤＤ１７０に格納された文書登録プログラム１１０３がワークメモリ１８０にロードされる。ＣＰＵ１１０は、ワークメモリ１８０に高速にアクセスすることができる。ＣＰＵ１１０が文書登録プログラム１１０３を実行することにより、文書登録処理が行なわれる。
【００３２】
文書登録処理は、文字認識処理と、インデックステーブル作成処理とを含む。文字認識処理と、インデックステーブル作成処理とはそれぞれ、文書登録プログラムの一部である文字認識プログラム（図示せず）と、インデックステーブル作成プログラム（図示せず）とをＣＰＵ１１０が実行することによって行なわれる。
【００３３】
文字認識処理では、画像入力機器１２０によってオリジナル文書が読み取られ、オリジナル文書の画像データ（文書画像データ）１３０が生成される。文書画像データ１３０は、ＨＤＤ１７０（図１）に格納される。文書画像データ１３０中の部分領域によって表される形状と文字認識パターン辞書１６０（図１）に登録されている文字の形状の類似性に基づいて、文字認識処理が行なわれる。文字認識処理の結果は、文字認識結果１４０としてＨＤＤ１７０に格納される。
【００３４】
次に、インデックステーブル作成処理では、文字認識結果１４０からインデックステーブル１９０が作成される。インデックステーブル１９０は、ＨＤＤ１７０に格納される。インデックステーブル作成処理において、確信度テーブル１５０（図１）が参照され得る。
【００３５】
ＨＤＤ１７０に格納された文書画像データ１３０と、文字認識結果１４０と、インデックステーブル１９０とは、文書データ１１０２（図１）の少なくとも一部を構成する。
【００３６】
ユーザが端末１００からキーワードを入力し、文書検索処理の開始を指示すると、ＨＤＤ１７０に格納された文書検索プログラム１１０４がワークメモリ１８０にロードされる。ＣＰＵ１１０が文書検索プログラム１１０４を実行することにより、文書検索処理が行なわれる。文書検索処理では、インデックステーブル１９０を用いて、文字認識結果からキーワードが検索される。
【００３７】
文書登録処理によってインデックステーブル１９０がいったん生成されると、キーワードの検索は、インデックステーブル１９０を参照して行なわれる。検索すべきキーワードが変わっても、新たなインデックステーブル１９０を作成する必要はない。
【００３８】
なお、図２に示される全ての処理が検索装置１によって行われることは必須ではない。例えば、文書登録処理が検索装置１とは別の機器によって行なわれ、生成されたインデックステーブル１９０を用いた文書検索処理のみが検索装置１によって行なわれてもよい。
【００３９】
図３は、オリジナル文書１３１０の一例を示す。オリジナル文書１３１０は、「インデックスを用いた検索方法。文書データからの」という文字列を含む。オリジナル文書は、例えば、文字列が印刷された紙の形態の文書である。オリジナル文書は、あるいは、標識、看板、掲示板等に書かれた形態の文書であってもよい。
【００４０】
図４は、オリジナル文書１３１０に対して文字認識処理を行うことにより得られる文字認識結果１４０の一例を示す。文字認識結果１４０は、文字位置１０４２と、候補文字１０４３とを含む。図４において、各候補文字に添えられたカッコ内の数字は、各候補文字についての信頼度Ｒｒを示す。文字認識結果１４０は、オリジナル文書１３１０（図３）に含まれる「イ」、「ン」、「デ」、「ッ」、「ク」、「ス」という各文字の文字認識結果として、最大の信頼度Ｒｒが得られた候補文字が、それぞれ、「イ」、「シ」、「テ」、「ソ」、「タ」、「ス」であることを示す。
【００４１】
文字認識処理は、任意のアルゴリズムに従って実行され得る。文字認識処理は、例えば、１文字単位に文書画像データ１３０を切り出し、その切り出された１文字単位の画像データ（部分領域）を文字コードに変換していくというアルゴリズムに従って実行され得る。
【００４２】
部分領域から文字コードへの変換の際には、部分領域によって表される形状と、文字認識パターン辞書１６０（図１）に登録されている文字の形状とが比較される。所定の判定基準に基づいて形状が類似していると判定された文字が、候補文字として得られる。このようにして、部分領域が、候補文字の文字コードへと変換されていく。１つの部分領域に対応する候補文字が複数得られてもよい。
【００４３】
候補文字は、その形状と部分領域によって表される形状とが類似しているために、オリジナル文書のその部分領域に対応する部分に書かれている文字と一致する可能性が高いとみなし得る文字を意味する。
【００４４】
文字認識結果１４０における各欄（例えば、欄１０４４）は、文書画像データ１３０中の部分領域に対応している。すなわち、オリジナル文書１３１０の部分（例えば、図３に示される部分１３１１）に対応している。また、文書画像データ１３０中の部分領域は、文書画像データから１文字単位に切り出されるので、部分領域は、オリジナル文書１３１０の１文字（例えば、図３に示される部分１３１１に書かれている文字「ク」）に対応している。
【００４５】
欄１０４４に示される候補文字「タ」、「ウ」、「ワ」および「ク」は、対応するオリジナル文書１３１０の部分（図３に示される部分１３１１）に書かれている文字と一致する可能性が高いとみなし得る文字である。
【００４６】
文字位置１０４２は、文字認識結果１４０における、その候補文字の位置を示す。例えば、欄１０４４に示される文字位置「５」は、文字認識結果１４０における欄１０４４（文字認識結果中の部分）の位置が、「５番目」の位置であることを示す。
【００４７】
文字位置１０４２の表現方法としては、候補文字に対応するオリジナル文書１３１０中の部分が特定できさえすれば、どのような表現方法を使用してもよい。上述したように、文字認識結果中の各欄は文書画像データ１３０の部分領域に対応する。従って、文字位置１０４２は、候補文字が含まれる欄の文字認識結果中の位置によって表されてもよいし、その欄が対応する文書画像データ１３０の部分領域の文書画像データ１３０中の位置によって表されてもよい。
【００４８】
例えば、文字位置１０４２は、オリジナル文書の文書名と、ページ番号と、行番号と、その行における先頭からの位置（何文字目であるか）によって表されてもよいし、文書画像データにおける座標やアドレスによって表されてもよい。
【００４９】
信頼度Ｒｒは、文字認識の確からしさ、すなわち、正解確率を示す。信頼度Ｒｒは、０以上１以下の値をとり、値が大きいほど確からしさが大きいものとする。文字認識には、例えば、ニューラルネットワークやベクトル量子化やテンプレートマッチングの手法を採用することができる。
【００５０】
文字認識にニューラルネットワークの手法を採用する場合には、文字認識パターン辞書１６０に登録されている文字のうち、出力値がある基準以上である少なくとも１つのニューロンに対応する文字が候補文字として得られる。ニューロンの出力値と正解確率との対応関係を予め求めておき、その対応関係に基づいて、各候補文字に対応するニューロンの出力値から、信頼度Ｒｒを求めることができる。
【００５１】
ベクトル量子化やテンプレートマッチングの手法は、いずれも、文書画像データ１３０の部分領域によって表される形状と、文字認識パターン辞書１６０に登録されている文字の形状との特徴量空間における距離を求めることにより、文字認識を行う手法である。１つの形状は、特徴量空間における１つの代表点として表される。これらの手法が採用される場合には、文字認識パターン辞書１６０に登録されている文字のうち、特徴量空間における距離がある基準以下である少なくとも１つの文字が候補文字として得られる。特徴量空間における距離と正解確率との対応関係を予め求めておき、その対応関係に基づいて、各候補文字に対応する特徴量空間における距離から、信頼度Ｒｒを求めることができる。
【００５２】
文字認識にいずれの手法を用いた場合でも、信頼度Ｒｒは、文書画像データ１３０の部分領域によって表される形状と、文字認識パターン辞書１６０に登録されている文字の形状との類似性を反映する。
【００５３】
信頼度Ｒｒとしては、形状の類似性以外の情報が考慮されてもよい。例えば、文書画像データ１３０中の文字認識の対象となる部分領域の大きさの偏差ＳＲや、行におけるその部分領域の相対的位置の偏差ＬＲなどが考慮されてもよい。
【００５４】
部分領域の大きさの偏差ＳＲは、例えば、文書画像データ１３０におけるすべての部分領域（それぞれが１つの文字に対応する）の大きさの平均値からの、その部分領域の大きさの偏差として定義され得る。予め、偏差ＳＲと文字認識の正解確率との対応関係を求めておくことにより、偏差ＳＲが大きい場合に信頼度Ｒｒが小さくなるように、信頼度Ｒｒを修正することができる。
【００５５】
部分領域の相対位置の偏差ＬＲは、例えば、文書画像データにおける同一の行のすべての部分領域（それぞれが１つの文字に対応する）について、行に垂直な方向の位置の平均値を求め、その部分領域の行に垂直な方向の位置のこの平均値からの偏差として定義され得る。予め、偏差ＬＲと文字認識の正解確率との対応関係を求めておくことにより、偏差ＬＲが大きい場合に信頼度Ｒｒが小さくなるように、信頼度Ｒｒを修正することができる。
【００５６】
このように、信頼度Ｒｒを偏差ＳＲおよび／または偏差ＬＲに応じて修正することにより、信頼度Ｒｒをより適切に設定することができる。
【００５７】
図４に示される文字認識結果１４０から、インデックステーブル１９０が作成される（インデックステーブル作成処理）。
【００５８】
図５Ａは、インデックステーブル作成処理の手順を示す。以下、インデックステーブル作成処理の手順を詳しく説明する。
【００５９】
ステップＳ４０１：文字認識結果１４０中の注目している候補文字の信頼度Ｒｒが基準値以上であるか否かが判定される。基準値は、例えば、「０．０５」であり得る。ステップＳ４０１における判定結果が「Ｙｅｓ」である場合には、処理はステップＳ４０２に進む。ステップＳ４０１における判定結果が「Ｎｏ」である場合には、処理はステップＳ４０４に進む。
【００６０】
なお、文字認識処理によって文字認識結果１４０（図４）を得る際に、信頼度Ｒｒが基準値以上である候補文字のみを文字認識結果１４０に含むようにしてもよい。その場合には、ステップＳ４０１における処理は省略され得る。
【００６１】
ステップＳ４０２：候補文字の確信度Ｃｒが計算される。確信度Ｃｒは、例えば、各候補文字についての信頼度Ｒｒに基づいて、（数１）により計算される。
【００６２】
【数１】
確信度Ｃｒ＝候補文字ついての信頼度Ｒｒ×文字別係数Ｋｒ
文字別係数Ｋｒは、予め、１つの文字（例えば、「イ」）ごとに定義されている。文字別係数Ｋｒは、通常の文書中におけるその文字の出現確率に依存する。文字は、その種類ごとに通常の文書中における出現確率が異なる。例えば、一般の日本語の文書では、文字「ゐ」は、文字「る」よりも出現確率が低い。このように、出現確率が低い文字については、文字別係数Ｋｒが低く設定される。逆に、出現確率が高い文字については、文字別係数Ｋｒが高く設定される。文字ごとの出現確率は、予め、大量の一般的な文書を対象として統計的に求めることができる。
【００６３】
各候補文字についての確信度Ｃｒは、その候補文字についての信頼度Ｒｒに候補文字と同一の文字（文字コードが一致する文字）についての文字別係数Ｋｒを掛けることによって求められる。このようにして計算された確信度Ｃｒは、候補文字と同一の文字がオリジナル文書中の特定の部分に存在する確率を示す。そのような特定の部分とは、文字認識結果１４０（図４）において、その候補文字が含まれる欄（文字認識結果中の部分）が対応するオリジナル文書中の部分である。
【００６４】
ただし、確信度Ｃｒが必ずしも統計学的な確率そのものである必要はない。確信度Ｃｒは、統計学的な確率を所定の基準に従って正規化した値であり得る。このような所定の基準は、候補文字の確信度Ｃｒが、候補文字と同一の文字がオリジナル文書中の特定の部分に存在する確率を示すという性質を保持する限り、任意の基準であり得る。確信度Ｃｒは、実数表現でなく整数表現によって表されてもよい。あるいは、確信度Ｃｒは、確信度Ｃｒのレベルを段階的に示す記号によって表されてもよい（例えば、○：高、△：中、×：低）。
【００６５】
なお、文字ごとの出現確率が不明である場合には、文字別係数Ｋｒをすべての文字について一定としてもよい。また、字種（漢字、カタカナ、ひらがな）ごとに文字別係数Ｋｒを設定してもよい。
【００６６】
ステップＳ４０３：候補文字と、ステップＳ４０２で求められた確信度Ｃｒとが候補文字−確信度テーブルに登録される。
【００６７】
図５Ｂは、候補文字−確信度テーブル１５０１の一例を示す。候補文字と確信度Ｃｒとは、文字位置１０４２（図４）ごとに、候補文字−確信度テーブル１５０１に登録される。
【００６８】
図５Ａを再び参照して、インデックステーブル作成処理の説明を続ける。
【００６９】
ステップＳ４０４：すべての文字位置のすべての候補文字について、ステップＳ４０１〜ステップＳ４０３の処理が行なわれたか否かが判定される。ステップＳ４０４における判定結果が「Ｙｅｓ」である場合には、処理はステップＳ４０５に進む。ステップＳ４０４における判定結果が「Ｎｏ」である場合には、他の候補文字について、ステップＳ４０１からの処理が行なわれる。
【００７０】
ステップＳ４０５：候補文字−確信度テーブル１５０１（図５Ｂ）の隣接した文字位置に登録された候補文字からインデックステーブルが作成される。インデックステーブルは、インデックス文字列と、文字位置と、確信度Ｃｒとを定義する。
【００７１】
インデックス文字列は、候補文字−確信度テーブル１５０１（図５Ｂ）の隣接した文字位置に登録された候補文字を組み合わせることによって生成される。例えば、候補文字−確信度テーブル１５０１の文字位置「１」に登録された候補文字「イ」と、隣接した文字位置「２」に登録された候補文字「シ」とを組み合わせることによって、インデックス文字列「イシ」が生成される。
【００７２】
図６は、インデックステーブル作成処理によって作成されたインデックステーブルの一例を示す。インデックステーブル１９０の欄１６１０は、インデックス文字列を示す。欄１６１１は、インデックス文字列に含まれる先頭の候補文字の文字位置を示す。欄１６１２はインデックス文字列に含まれる候補文字のそれぞれについて定義される確信度Ｃｒの組を示す。
【００７３】
インデックステーブル１９０に含まれる行１６０２は、インデックス文字列「イシ」に一致する文字認識結果１４０中の部分の位置が「１」であり、インデックス文字列「イシ」の文字「イ」について定義された確信度Ｃｒが０．９であり、インデックス文字列「イシ」の文字「シ」について定義された確信度Ｃｒが０．８であることを示す。
【００７４】
インデックス文字列「イシ」（行１６０２）に含まれる候補文字のそれぞれについて定義される確信度Ｃｒの組は、各候補文字についてステップＳ４０２（図５Ａ）で算出された確信度Ｃｒの組として得られる。なお、確信度Ｃｒの組として、各候補文字についてステップＳ４０２（図５Ａ）で算出された確信度Ｃｒにインデックス文字列ごとの係数を掛けた値の組が用いられてもよい。インデックス文字列ごとの係数は、例えば、一般の文書中に出現する確率が小さいインデックス文字列については、低く設定され得る。例えば、文字列「ヲヲ」や文字列「ヰヰ」は、一般の日本語の文書中に出現する確率は小さい。このようなインデックス文字列に対しては、インデックス文字列ごとの係数は低く設定され得る。
【００７５】
候補文字−確信度テーブル１５０１（図５Ｂ）の隣接した文字位置に登録された候補文字を組み合わせることによってインデックス文字列を生成することは、文字認識結果１４０（図４）に示される複数の欄のうち、隣接した（連続した）複数の欄（例えば、欄１０４５と欄１０４６）のそれぞれに含まれる候補文字を組み合わせることと等価である。
【００７６】
このように、図５Ａに示されるステップＳ４０１〜ステップＳ４０５において、ＣＰＵ１１０（図１）は、インデックステーブル１９０を作成するインデックステーブル作成部として機能する。
【００７７】
インデックステーブル１９０は、図５Ｂに示される候補文字−確信度テーブル１５０１の隣接した文字位置に登録された候補文字のすべての組み合わせをインデックス文字列として登録することによって作成される。
【００７８】
ただし、候補文字−確信度テーブル１５０１の隣接した文字位置に登録された候補文字のすべての組み合わせに重複する組み合わせがある場合には、インデックステーブル１９０には、１つのインデックス文字列について複数の文字位置と確信度Ｃｒの組とが登録される。例えば、候補文字−確信度テーブル１５０１の文字位置「２」および「３」に登録された候補文字「ン」および「ワ」からインデックス文字列「ンワ」が生成され、文字位置「４」および「５」に登録された候補文字「ン」および「ワ」からもインデックス文字列「ンワ」が生成される。この場合、１つのインデックス文字列「ンワ」について、文字位置２、確信度Ｃｒ（０．７，０．２）と文字位置４、確信度Ｃｒ（０．１，０．２）とがインデックステーブル１９０に登録される（行１６０４）。
【００７９】
インデックス文字列に含まれる文字数は、予め定められている。図６に示される例では、インデックス文字列に含まれる文字数は、「２」である。インデックス文字列に含まれる文字数は、任意の自然数であり得る。しかし、一般に、インデックス文字列に含まれる文字数は２以上であることが好ましい。インデックス文字列に含まれる文字数が１であると、１つのインデックス文字列について登録される文字位置と確信度Ｃｒとの数が多くなり、検索を高速に行なうことができなくなるからである。
【００８０】
インデックステーブル１９０中のインデックス文字列は、検索を容易にするために所定の順序に従って順序付けられていることが好ましい。
【００８１】
インデックステーブル１９０は、図４に示される文字認識結果１４０中の１つの文字位置に対する複数の候補文字を用いて作成されている。その結果、インデックステーブル１９０は、同一の文字位置に対応する複数のインデックス文字列を含む。従って、複数のインデックス文字列が、文字認識結果の１つの部分に一致し得る。例えば、インデックステーブル１９０の行１６０２に示されるインデックス文字列「イシ」と、インデックステーブル１９０の行１６０３に示されるインデックス文字列「イン」とは、いずれも、文字位置「１」によって示される文字認識結果１４０中の部分（欄１０４５と欄１０４６とを包含する部分）に一致する。これによって、検索漏れを減らすことが可能になる。
【００８２】
このように、インデックス文字列と、文字認識結果の部分とが一致するとは、インデックス文字列に含まれる各文字が、文字認識結果の連続した部分（図４に示される隣接した欄）の１つに含まれる少なくとも１つの候補文字の１つと同一である（文字コードが等しい）という概念を含む。
【００８３】
インデックステーブル１９０のような、同一の文字位置に対応する複数のインデックス文字列を含むインデックステーブルは、１つの文字位置に対する候補文字が１つであるような文字認識結果からも作成することができる。
【００８４】
図７は、１つの文字位置に対する候補文字が１つである文字認識結果１４０ａの一例を示す。文字認識結果１４０ａは、図４に示される文字認識結果１４０と比較して、１つの文字位置に対する候補文字が１つであるという点が異なる。文字認識結果１４０ａは、オリジナル文書１３１０（図３）に含まれる「イ」、「ン」、「デ」、「ッ」、「ク」、「ス」という各文字が、それぞれ、「イ」、「シ」、「テ」、「ソ」、「タ」、「ス」と認識されたことを示す。オリジナル文書１３１０に含まれる文字「ン」、「デ」、「ッ」、「ク」は、誤って認識されている。
【００８５】
図８Ａは、図７に示される文字認識結果１４０ａからインデックステーブルを作成する処理（インデックステーブル作成処理）の手順を示す。
【００８６】
ステップＳ５０１：確信度テーブルを参照して、文字認識結果１４０ａの１つの文字位置に対する候補文字と信頼度Ｒｒとの１つの組から、類似文字と確信度Ｃｒとの組が求められる。類似文字と確信度Ｃｒとの組は、複数得られてもよい。ステップＳ５０１の処理は、各文字位置について行なわれる。確信度テーブルは、図９を参照して後述される。
【００８７】
ステップＳ５０２：類似文字の確信度Ｃｒが、所定の基準値以上であるか否かが判定される。所定の基準値とは、例えば、０．０５である。ステップＳ５０２における判定結果が「Ｙｅｓ」である場合には、処理はステップＳ５０３に進む。ステップＳ５０２における判定結果が「Ｎｏ」である場合には、処理はステップＳ５０４に進む。
【００８８】
ステップＳ５０３：類似文字と、ステップＳ５０１で求められた確信度Ｃｒとが類似文字−確信度テーブルに登録される。
【００８９】
図８Ｂは、類似文字−確信度テーブル１８０１の例を示す。類似文字と確信度Ｃｒとは、文字位置１０４２（図７）ごとに、類似文字−確信度テーブル１８０１に登録される。
【００９０】
図８Ａを再び参照して、インデックステーブル作成処理の説明を続ける。
【００９１】
ステップＳ５０４：すべての文字位置の候補文字について、ステップＳ５０２〜ステップＳ５０３の処理が行なわれたか否かが判定される。ステップＳ５０４における判定結果が「Ｙｅｓ」である場合には、処理はステップＳ５０５に進む。ステップＳ５０４における判定結果が「Ｎｏ」である場合には、他の類似文字について、ステップＳ５０２からの処理が行なわれる。
【００９２】
ステップＳ５０５：類似文字−確信度テーブル１８０１（図８Ｂ）の隣接した文字位置に登録された類似文字からインデックステーブルが作成される。インデックス文字列は、類似文字−確信度テーブル１８０１（図８Ｂ）の隣接した文字位置に登録された類似文字を組み合わせることによって生成される。この処理は、図５Ａに示されるステップＳ４０５において、候補文字−確信度テーブル１５０１（図５Ｂ）からインデックステーブル１９０（図６）を作成した処理と同様である。
【００９３】
生成されるインデックステーブルは、図６に示されるインデックステーブル１９０と同様である。例えば、インデックステーブル１９０の行１６０２において、欄１６１１は、インデックス文字列「イシ」に含まれる先頭の類似文字「イ」の文字位置を示す。欄１６１２はインデックス文字列「イシ」に含まれる類似文字のそれぞれについて定義される確信度Ｃｒの組（０．９，０．８）を示す。
【００９４】
図９は、確信度テーブル１５０の一例を示す。図９には、確信度テーブル１５０のうち、候補文字「シ」に関する部分のみを示す。
【００９５】
確信度テーブル１５０は、例えば、文字認識結果として１つの候補文字「シ」および信頼度Ｒｒ「０．９」が得られた場合に、類似文字「ン」および確信度Ｃｒ「０．２」と、類似文字「シ」および確信度Ｃｒ「０．８」とが得られることを示す。類似文字「ン」および類似文字「シ」は、候補文字「シ」と文字の形状が類似しているか、同一である文字である。
【００９６】
候補文字「シ」についての類似文字が「ン」および「シ」であることは、文字認識結果として１つの候補文字「シ」が得られた場合に、オリジナルの文書中には類似文字「ン」または類似文字「シ」が書かれている可能性が高いことを示す。
【００９７】
確信度テーブル１５０は、予め、多種多数の文字が書かれたオリジナル文書に対して文字認識を行い、それによって得られる文字認識結果および信頼度Ｒｒと、オリジナル文書に実際に存在する文字とを比較することによって作成され得る。例えば、確信度テーブル１５０の部分１８１１に示される確信度Ｃｒの「０．２」は、様々なフォントや様々な印字品質で書かれた文字「ン」に対して文字認識を行った場合に、候補文字「シ」および信頼度Ｒｒ０．９が得られる確率から求められ得る。
【００９８】
確信度テーブル１５０は、全ての文字の組み合わせに対して用意される。ただし、確信度Ｃｒが所定の基準よりも小さくなるような類似文字については、確信度テーブル１５０に登録する必要はない。従って、１つの候補文字について得られる類似文字の個数を限定することができる。
【００９９】
文字認識によって得られる信頼度Ｒｒが図９に示される確信度テーブル１５０に定義される信頼度Ｒｒと一致しない場合（例えば、文字認識によって得られる信頼度Ｒｒが０．８）である場合には、適切な方法により類似文字の確信度Ｃｒが計算される。例えば、文字認識によって得られる信頼度Ｒｒが０．５よりも小さい場合には、確信度テーブル１５０中の信頼度Ｒｒ「０．５」の行が参照される。また、文字認識によって得られる信頼度Ｒｒが０．９よりも大きい場合には、確信度テーブル１５０中の信頼度Ｒｒ「０．９」の行が参照される。文字認識によって得られる信頼度Ｒｒが確信度テーブル１５０に定義される２つの信頼度Ｒｒの間の値である場合には、確信度テーブル１５０に定義される２つの信頼度Ｒｒのうち、文字認識によって得られる信頼度Ｒｒに近い値の行が参照される。
【０１００】
なお、確信度テーブル１５０の構造は、図９に示される構造に限定されない。確信度テーブル１５０は、候補文字と信頼度Ｒｒとの組から、類似文字と確信度Ｃｒの組とが少なくとも１つ得られる限り、任意の構造を有し得る。例えば、確信度Ｃｒの信頼度Ｒｒに対する分布を一様分布であると仮定して、その分布範囲を確信度の上限および下限、信頼度Ｒｒの上限および下限により表し、これらの上限値および下限値が確信度テーブル１５０に定義されてもよい。あるいは、確信度Ｃｒの信頼度Ｒｒに対する分布をガウス分布であると仮定して、その分布の平均値と分散値とが確信度テーブル１５０に定義されてもよい。
【０１０１】
このような確信度テーブル１５０を用いて図８Ａに示されるインデックステーブル作成処理を実行することにより、１つの文字位置に対する候補文字が１つである文字認識結果１４０ａ（図７）からでも、同一の文字位置に対応する複数のインデックス文字列を含むインデックステーブル１９０（図６）を作成することができる。
【０１０２】
確信度テーブル１５０は、検索装置１がインデックステーブル作成処理を図５Ａに示される手順に従って実行する場合には、省略され得る。
【０１０３】
このように、インデックステーブル１９０はまた、図７に示される文字認識結果１４０ａ中の１つの文字位置に対する１つの候補文字に予め対応付けられた複数の類似文字を用いて作成され得る。その結果、インデックステーブル１９０は、同一の文字位置に対応する複数のインデックス文字列を含む。従って、複数のインデックス文字列が、文字認識結果の１つの部分に一致し得る。これによって、文字認識処理において誤認識が生じた場合にも検索漏れを減らすことが可能になる。
【０１０４】
但し、インデックス文字列に含まれる各文字が、文字認識結果の連続した部分（図７に示される隣接した欄）の１つに含まれる１つの候補文字と同一であるとは限らない。例えば、インデックステーブル１９０の行１６０３に示されるインデックス文字列「イン」に含まれる文字「ン」は、文字認識結果１４０ａ（図７）の欄１０４６ａに含まれる１つの候補文字「シ」と同一ではない。しかし、インデックス文字列「イン」に含まれる文字「ン」は、その候補文字「シ」に予め確信度テーブル１５０（図９）により対応付けられた類似文字「ン」と同一である。
【０１０５】
このように、インデックス文字列と、文字認識結果の部分とが一致するとは、インデックス文字列に含まれる各文字が、文字認識結果の連続した部分（図４に示される隣接した欄）の１つに含まれる１つの候補文字に予め対応付けられた少なくとも１つの文字の１つと同一である（文字コードが等しい）という概念を含む。
【０１０６】
次に、インデックステーブル１９０（図６）を用いて文字認識結果からキーワードを検索する処理（文書検索処理）を説明する。
【０１０７】
図１０は、文書検索処理の手順を示す。以下、文書検索処理の各ステップを詳しく説明する。
【０１０８】
ステップＳ３０１：キーワードが入力される。以下、キーワードが「インデックス」という文字列である場合を例として説明する。
【０１０９】
ステップＳ３０２：キーワードから、連続する２文字の組（長さが２の文字列）が抽出される。この例では、２文字の組「イン」、「ンデ」、「デッ」、「ック」、「クス」が抽出される。なお、抽出される文字列の長さは、インデックステーブルに定義されるインデックス文字列の長さと等くなるように設定される。従って、インデックス文字列の長さがｎ（ｎは自然数）である場合には、キーワードからｎ文字の組（長さがｎの文字列）が抽出される。以下の説明では、ｎ＝２であるものとする。
【０１１０】
抽出された複数の２文字の組は、互いのその一部がオーバーラップしている。しかし、オーバーラップしないようにキーワードから２文字の組を抽出してもよい。例えば、キーワード「インデックス」から２文字の組「イン」、「デッ」、「クス」が抽出されてもよい。ただし、キーワードに含まれるそれぞれの文字は、抽出された２文字の組の少なくとも１つに含まれるように、キーワードから２文字の組が抽出される。
【０１１１】
ステップＳ３０３：インデックステーブル１９０（図６）を参照し、２文字の組に対応する文字位置と確信度Ｃｒとが抽出される。この例では、
２文字の組「イン」に対応する文字位置「１」、確信度Ｃｒの組（０．９，０．７）（行１６０３）、
２文字の組「ンデ」に対応する文字位置「２」、確信度Ｃｒの組（０．７，０．８）（行１６０５）、
２文字の組「デッ」に対応する文字位置「３」、確信度Ｃｒの組（０．８，０．３）（行１６０６）、
２文字の組「ック」に対応する文字位置「４」、確信度Ｃｒの組（０．３，０．１）（行１６０７）、および、
２文字の組「クス」に対応する文字位置「５」、確信度Ｃｒの組（０．１，０．９）（行１６０８）が得られる。
【０１１２】
なお、図６に示されるインデックステーブル１９０から各文字列に対応する文字位置および確信度Ｃｒを効率的に取り出すために、インデックス文字列に含まれる文字の文字コードと、インデックス文字列を含む行が格納されているアドレス（例えば、ＨＤＤ１７０上のアドレス）との対応表を用いてもよい。また、このようなアドレスは、２分木探索法を用いて求められてもよい。
【０１１３】
ステップＳ３０４：すべての２文字の組について、ステップＳ３０３の処理が行なわれたか否かが判定される。ステップＳ３０４における判定結果が「Ｎｏ」である場合には、他の２文字の組についてステップＳ３０３の処理が行なわれる。ステップＳ３０４における判定結果が「Ｙｅｓ」である場合には、処理はステップＳ３０５に進む。
【０１１４】
ステップＳ３０５：すべての２文字の組が所定の順序で並んでいるか否かが判定される。この判定は、ステップＳ３０４でそれぞれの２文字の組について得られた文字位置に基づいて行なわれる。具体的には、キーワードのｋ文字目（ｋは自然数）を先頭とする２文字の組について得られた文字位置ｍ（ｍは自然数）が、すべての２文字の組について、「ｍ−ｋ＝一定」という関係を満たすならば、すべての２文字の組が所定の順序で並んでいると判定される。
【０１１５】
すべての２文字の組が所定の順序で並んでいることは、キーワードが文字認識結果中の特定の部分に一致することを示す。その特定の部分とは、キーワードに含まれる各文字が一致する文字認識結果中の部分を包含する部分である。
【０１１６】
この例では、キーワード「インデックス」が、文字認識結果１４０（図４）の部分１０４７または文字認識結果１４０ａ（図７）の部分１０４７ａに一致する。
【０１１７】
このような部分１０４７または部分１０４７ａの位置は、その部分の先頭の欄の文字位置「１」として特定される。
【０１１８】
この例では、キーワード「インデックス」から抽出されたすべての２文字の組は、上述した関係を満たすために、「所定の順序で並んでいる」と判定される。
【０１１９】
ステップＳ３０５における判定が「Ｙｅｓ」である場合には、処理はステップＳ３０６に進む。ステップＳ３０５における判定が「Ｎｏ」である場合には、処理はステップＳ３０８に進む。
【０１２０】
このように、ステップＳ３０２〜ステップＳ３０５において、ＣＰＵ１１０（図１）は、インデックステーブル１９０（図６）に含まれるインデックス文字列とインデックス文字列に一致する文字認識結果中の部分の位置とに基づいて、キーワードがその文字認識結果中の部分に一致するか否かを判定し、もし一致する場合には、キーワードに一致するその文字認識結果中の部分の位置を特定する位置特定部として機能する。
【０１２１】
ステップＳ３０６：キーワード確信度Ｋｃが算出される。キーワード確信度Ｋｃは、例えば、キーワード「インデックス」から抽出された２文字の組「イン」、「ンデ」、「デッ」、「ック」、「クス」のそれぞれに対応する確信度Ｃｒの組の左側の値と、キーワード「インデックス」を構成する最後の２文字の組「クス」に対応する確信度Ｃｒの組の右側の値との相加平均として求められる。これは、キーワードに含まれる各文字について定義された確信度の相加平均を求めることと等価である。この例では、キーワード確信度Ｋｃ＝（０．９＋０．７＋０．８＋０．３＋０．１＋０．９）／６＝０．６１となる。
【０１２２】
なお、キーワード確信度Ｋｃは、相乗平均、メディアン値、または最頻値によって算出されてもよい。キーワード確信度Ｋｃは、２文字の組のそれぞれに対応する確信度Ｃｒの組のうち、小さくない方の値だけを用いて算出されてもよい。確信度Ｃｒが予め定められた基準値未満の場合には、その確信度Ｃｒをキーワード確信度Ｋｃの算出に用いないようにしてもよい。
【０１２３】
このように、キーワード確信度Ｋｃは、キーワードに含まれる各文字について定義された確信度Ｃｒに基づいて算出される。
【０１２４】
算出されたキーワード確信度Ｋｃは、文字認識結果中の文字位置に対応するオリジナル文書中の位置に、そのキーワードが存在する確率を示す。
【０１２５】
例えば、ステップＳ３０５で、キーワード「インデックス」が、文字認識結果１４０（図４）の部分１０４７または文字認識結果１４０ａ（図７）の部分１０４７ａに一致すると判定され、このような部分１０４７または部分１０４７ａの位置が文字位置「１」と特定された場合、文字位置「１」に対応するオリジナル文書中の位置（すなわち、オリジナル文書の先頭）にキーワード「インデックス」が存在する確率は、０．６１である。
【０１２６】
このように、ステップＳ３０６において、ＣＰＵ１１０は、インデックステーブル１９０（図６）に含まれる確信度Ｃｒに基づいて、キーワードに一致する文字認識結果中の部分の位置に対応するオリジナル文書中の位置にキーワードが存在する確率を示すキーワード確信度Ｋｃを算出する算出部として機能する。
【０１２７】
ステップＳ３０７：キーワード確信度Ｋｃが基準値（所定の値）以上であるか否かが判定される。基準値は、例えば、０．５であり得る。この基準値は、固定値であってもよいし、キーワードに応じて設定されてもよい。例えば、キーワードの文字数に応じて基準値を変更してもよい。
【０１２８】
この基準値を高くすると、検索ノイズを減らすことができるが、高くしすぎると検索漏れが起こりやすくなる。一般に、キーワードの文字数が多い場合には、基準値を低く設定することにより、誤認識が多い場合にも検索漏れを減らすことが好ましい。キーワードの文字数（キーワードに含まれる文字の数）が多い場合には、基準値を低く設定することによっても検索ノイズはあまり増加しないからである。
【０１２９】
キーワードに含まれる文字の種類（字種）に応じて基準値を変更してもよい。例えば、キーワードの各文字がカタカナである場合、漢字である場合、漢字とカタカナとの混合である場合、ひらがなである場合のそれぞれについて、基準値を最適に設定することにより、より効果的な検索を行うことができる。
【０１３０】
あるいは、この基準値は、ユーザによって指定されてもよい。ユーザは、検索漏れを防ぐか、検索ノイズを減少させるかという目的に応じて、適切な基準値を設定し得る。
【０１３１】
ステップＳ３０７における判定が「Ｙｅｓ」である場合には、処理はステップＳ３０９に進む。ステップＳ３０７における判定が「Ｎｏ」である場合には、処理はステップＳ３０８に進む。
【０１３２】
このように、ステップＳ３０５においてキーワードが文字認識結果１４０または文字認識結果１４０ａの部分に一致するという検索結果が得られた後に、ステップＳ３０７が実行され、実際のオリジナル文書にキーワードがあるか否か（検索結果が正当であるか否か）がキーワード確信度Ｋｃに基づいて判定される。ステップＳ３０７において、ＣＰＵ１１０は、キーワード確信度Ｋｃに基づいて検索結果の正当性を判定する判定部として機能する。
【０１３３】
ステップＳ３０８：キーワードがオリジナル文書中に存在しないと判断される。端末１００（図１）のディスプレイには、例えば、「キーワードが見つかりませんでした」というメッセージが表示される。
【０１３４】
ステップＳ３０９：キーワードが検出されたと判断される。検索結果として、検出箇所を示す文字位置と、キーワード確信度Ｋｃとが得られる。検出箇所が複数である場合には、文字位置とキーワード確信度Ｋｃとの組が複数得られる。
【０１３５】
上述した、キーワードが「インデックス」である例では、検索結果として、文字位置「１」と、キーワード確信度Ｋｃ「０．６１」とが得られる。
【０１３６】
検索結果は、例えば、端末１００に表示される。検索装置１は、例えば、ＨＤＤ１７０に格納された文書画像データ１３０（および／または、文字認識結果１４０、１４０ａ）を端末１００のディスプレイに表示し、そのディスプレイに表示された文書画像データ１３０（および／または、文字認識結果１４０、１４０ａ）の領域のうちキーワードに対応する領域を強調表示する。強調表示は、例えば、表示される文字の属性（例えば、文字の色や濃度、文字背景の色や濃度、文字の大きさ、文字の太さ、フォント等）を変更することによってなされる。このような属性は、キーワード確信度Ｋｃに応じて変化させてもよい。例えば、キーワード確信度Ｋｃが０．５〜１．０の間を０．１の刻み幅で区分し、各区分に異なる属性を設定して強調表示を行ってもよい。この場合には、ユーザがキーワード確信度Ｋｃの大小を視覚的に把握することができるので、ユーザが検索結果の正当性のさらなる判定を視覚的に、容易に行うことができるという利点が得られる。
【０１３７】
あるいは、キーワード確信度Ｋｃが高い検出箇所から順に、キーワードに対応する領域を表示してもよい。
【０１３８】
このようにして、ユーザが検索結果の正当性のさらなる判定を行う場合には、ステップＳ３０７において用いられる基準値を低く設定してもよい。
【０１３９】
あるいは、ステップＳ３０７が省略されてもよい。この場合、検索結果の正当性の判定はすべてユーザにより行なわれる。ユーザは、キーワード確信度Ｋｃに基づいて、検索結果の正当性の判定を容易に行うことが可能である。
【０１４０】
以下、図１０に示される文書検索処理により、検索ノイズが抑制される例を説明する。
【０１４１】
キーワード「ワックス」を指定して、図６に示されるインデックステーブル１９０を用いて図１０に示される文書検索処理を行った場合、ステップＳ３０５における判定は「Ｙｅｓ」となり、文字位置「３」が特定される。
【０１４２】
ステップＳ３０６において、キーワード確信度Ｋｃ＝（０．２＋０．３＋０．１＋０．９）／４＝０．３８と算出される。
【０１４３】
キーワード確信度Ｋｃが基準値０．５よりも小さいので、キーワードが存在しないと判断される。
【０１４４】
キーワード「デンワ」を指定した場合、ステップＳ３０５における判定は「Ｙｅｓ」となり、文字位置「３」が特定される。
【０１４５】
ステップＳ３０６において、キーワード確信度Ｋｃ＝（０．８＋０．１＋０．２）／３＝０．３７と算出される。
【０１４６】
キーワード確信度Ｋｃが基準値０．５よりも小さいので、キーワードが存在しないと判断される。
【０１４７】
同様に、キーワード「フック」を指定した場合、ステップＳ３０５における判定は「Ｙｅｓ」となり、文字位置「３」が特定される。
【０１４８】
ステップＳ３０６において、キーワード確信度Ｋｃ＝（０．２＋０．３＋０．１）／３＝０．２と算出される。
【０１４９】
キーワード確信度Ｋｃが基準値０．５よりも小さいので、キーワードが存在しないと判断される。
【０１５０】
このように、本発明の検索装置１によれば、オリジナル文書中にキーワードが存在しないにもかかわらず、キーワードが検出されることを抑制する、すなわち、検索ノイズを抑制することが可能になる。
【０１５１】
本発明の文書検索処理は、コンピュータ上のソフトウェアによって実現されることに限定されない。本発明の文書検索処理をハードウェアによって実現してもよいし、ソフトウェアとハードウェアの組み合わせによって実現してもよい。
【０１５２】
なお、上述した実施の形態では、日本語の文書を例に取り説明した。しかし、本発明の適用は、日本語の文書に限定されない。他の任意の文書（例えば、中国語の文書、英語の文書、韓国語の文書）に本発明を適用することも可能である。
【０１５３】
【発明の効果】
本発明によれば、インデックステーブルに含まれる確信度に基づいて、キーワードに一致する文字認識結果中の部分の位置に対応するオリジナル文書中の位置にキーワードが存在する確率を示すキーワード確信度が算出される。従って、キーワード確信度に基づいて、検索結果の正当性を判定することが容易になる。
【０１５４】
本発明の検索装置は、インデックステーブルを用いるので、高速な検索を行うことが可能である。
【図面の簡単な説明】
【図１】本発明の検索装置１の構成を示すブロック図
【図２】オリジナル文書中の文字を認識することによって得られる文字認識結果からキーワードを検索するために、検索装置１によって実行される処理の流れを示す図
【図３】オリジナル文書１３１０の一例を示す図
【図４】オリジナル文書１３１０に対して文字認識処理を行うことにより得られる文字認識結果１４０の一例を示す図
【図５Ａ】インデックステーブル作成処理の手順を示すフローチャート
【図５Ｂ】候補文字−確信度テーブル１５０１の例を示す図
【図６】インデックステーブル作成処理によって作成されたインデックステーブル１９０の一例を示す図
【図７】１つの文字位置に対する候補文字が１つである文字認識結果１４０ａの一例を示す図
【図８Ａ】図７に示される文字認識結果１４０ａからインデックステーブルを作成する処理の手順を示すフローチャート
【図８Ｂ】類似文字−確信度テーブル１８０１の例を示す図
【図９】確信度テーブル１５０の一例を示す図
【図１０】文書検索処理の手順を示すフローチャート
【図１１】従来技術によって、オリジナル文書中に存在する可能性のある複数の文字列をインデックス文字列とてし登録したインデックステーブル１９０１の一例を示す図
【符号の説明】
１検索装置
１００端末
１１０ＣＰＵ
１２０画像入力機器
１３０文書画像データ
１４０文字認識結果
１７０ＨＤＤ
１８０ワークメモリ
１９０インデックステーブル[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a search device, a recording medium, and a program for searching for a keyword from a character recognition result obtained by recognizing a character in an original document using an index table.
[0002]
[Prior art]
In recent years, with the spread of the Internet, search technology that extracts necessary information from a large amount of information existing on a network is regarded as important. In particular, many systems for searching for specific keywords from text data have already been provided. In such a search, it is required to perform an accurate and high-speed search from a large amount of text documents.
[0003]
In order to perform a high-speed search, a technique for searching for a specific keyword from text data using an index table is known. The index table defines an index character string including a predetermined number of characters (for example, two characters) and a position of a portion in text data that matches the character string.
[0004]
When searching for a keyword from a set of character codes (character recognition results) obtained by character recognition of characters in an original document (paper document), character recognition errors (false recognition) must be considered. I must. This is because if there is an error in character recognition, the character represented by the character code can be different from the character written in the original document. Misrecognition means that characters written in the original document are not correctly converted into character codes. Such misrecognition occurs due to, for example, fading, tilting, and dirt of characters printed on paper.
[0005]
For example, if a character string “Inu” exists at a position in the original document and the character “nu” in the character string is erroneously recognized as “su”, the character recognition result corresponding to the character string “Inu” A character string “chair” exists at the position of the inner part. As a result, the index character string “chair” and its position are registered in the index table created from the character recognition result. Therefore, even if the keyword “dog” is searched using this index table, the keyword cannot be detected at that position in the character recognition result. As described above, there is a problem of “missing search” in which a keyword cannot be detected at a certain position even though the keyword exists at a certain position in the original document.
[0006]
As a conventional technique for dealing with the search omission problem, a plurality of candidate characters may be prepared as a character recognition result for one character in the original document, and may exist in the original document based on the plurality of candidate characters. A technique for registering a plurality of character strings as an index character string in an index table is known. The keyword search is performed using this index table. Such a technique is disclosed in, for example, “Information Processing Method and Apparatus” in Japanese Patent Laid-Open No. 9-16619.
[0007]
FIG. 11 shows an example of an index table 1901 in which a plurality of character strings that may exist in an original document are registered as index character strings according to the prior art. In the example shown in FIG. 11, the index table 1901 is obtained by character recognition of an original document including a character string “using an index... In the index table 1901, the index character string “I” and the index character string “In” are both registered as existing at the character position “1”, which is the same position in the character recognition result (rows 1911 and 1911). Line 1912).
[0008]
By using the index table 1901 shown in FIG. 11, the keyword “index” can be detected. Hereinafter, processing for searching for the keyword “index” using the index table 1901 shown in FIG.
[0009]
First, a character string composed of two adjacent characters included in a keyword is generated. From the keyword “index”, five character strings “IN”, “NDE”, “DET”, “Cuck”, and “X” are generated.
[0010]
Next, these character strings are searched from the index table 1901. The character strings “IN”, “NDE”, “DET”, “Cook”, and “CUS” have character positions “1”, “2”, “3”, “4” and “4” in the character recognition result respectively. “5” is indicated (line 1912, line 1919, line 1915, line 1914 and line 1913).
[0011]
From the positional relationship of these character positions, it is determined that the keyword “index” exists in the character recognition result.
[0012]
In this way, according to the conventional technique using the index table in which a plurality of character strings that may exist in the original document are registered as index character strings, the problem of search omission can be avoided.
[0013]
[Problems to be solved by the invention]
According to such a conventional technique, there is a problem that search noise increases. Search noise means that a keyword is detected even though the keyword does not exist in the original document. For example, when keywords “denwa” and “hook” are searched using the index table 1901 shown in FIG. 11, these keywords are detected at the character position “3”. In order to determine whether the search result is valid, the user must compare the original document with the search result.
[0014]
In order to prevent the problem of search omission, as the number of candidate characters obtained as a character recognition result for one character increases, such search noise increases, and the burden on the user to determine whether the search result is valid or not. Will increase.
[0015]
The present invention has been made in view of such problems, and provides a search device, a recording medium, and a program that can perform a high-speed search and can easily determine the validity of a search result. The purpose is to do.
[0016]
[Means for Solving the Problems]
The search device of the present invention recognizes each of the characters in the original document, and uses a result of character recognition of the original document having at least one candidate character corresponding to each character to use a plurality of characters. The index table is an index configured by a character string combining the candidate characters corresponding to each of a plurality of characters constituting the character string existing in the original document. A certainty defined in advance as the probability that each candidate character exists in the original document for each character string, the position of the index character string in the original sentence, and each candidate character included in the index character string Before the same number of characters as the index string It is determined whether a keyword character string exists in the index table, and if it exists, based on the position of the index character string that matches the keyword character string, the keyword character string in the original sentence A keyword certainty factor defined as a probability that the keyword exists at the identified position based on a position identifying unit that identifies the position of the keyword and a certainty factor of an index character string that matches the character string of the keyword A calculation unit; A determination unit that determines validity of a search result based on the keyword certainty factor; Thus, the above object is achieved.
[0018]
The determination unit may determine that the search result is valid when the keyword certainty factor is equal to or greater than a predetermined value.
[0019]
The predetermined value may be set according to at least one of the number of characters included in the keyword and the type of character included in the keyword.
[0020]
Said An index table creation unit that creates an index table further includes the index table creation unit, When a plurality of candidate characters are generated based on the character recognition result, each of the candidate characters generated corresponding to each character in the character string of the original document The index character string may be generated by combining.
[0021]
Said An index table creation unit that creates an index table further includes the index table creation unit, One candidate character is generated for each character based on the character recognition result, and a similar character similar to each generated candidate character is generated based on the certainty factor, and a character string of the original document The combination of the candidate character corresponding to each character and the similar character, or the combination of the similar characters The index character string may be generated.
[0022]
The recording medium of the present invention uses a character recognition result of the original document having at least one candidate character corresponding to each character by recognizing each character in the original document, using an index table, and a plurality of characters. A computer-readable recording medium recording a program for causing a computer to execute a search process for searching for a keyword comprising the index table, wherein the index table includes a plurality of characters constituting a character string existing in the original document. For each candidate character included in the index character string, an index character string composed of a character string that combines the corresponding candidate characters, a position of the index character string in the original sentence, and the original document Each said candidate character in it And a certainty factor that is defined in advance as a probability of existence, and the search process determines whether a character string of the keyword having the same number of characters as the index character string exists in the index table. Determining the position of the keyword character string in the original document based on the position of the index character string that matches the keyword character string; and the confidence of the index character string that matches the keyword character string. Calculating a keyword certainty factor defined as a probability that the keyword exists at the specified position based on the degree, and determining a validity of a search result based on the keyword certainty factor Records the program to be This achieves the above object.
[0023]
The program of the present invention recognizes each of the characters in the original document by using an index table from the character recognition result of the original document having at least one candidate character corresponding to each character. Search process to search for By computer An index character string composed of a character string obtained by combining the candidate characters corresponding to each of a plurality of characters constituting the character string existing in the original document. A position of the index character string in the original sentence and a certainty factor defined in advance as a probability that each candidate character exists in the original document for each candidate character included in the index character string. And the search process determines whether a character string of the keyword having the same number of characters as the index character string exists in the index table, and if it exists, an index character string that matches the character string of the keyword is determined. Based on the position, the keyword in the original sentence. A keyword certainty factor defined as a probability that the keyword exists at the identified position based on the step of identifying the position of the character string and the certainty factor of the index character string that matches the character string of the keyword And steps to Determining the validity of a search result based on the keyword certainty factor; This achieves the above object.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
In this specification, a character is not limited to a character used in a specific language system, and refers to any symbol including numbers and symbols (for example, “)” and “◎”). Such a symbol is assigned a code (character code) for electronically expressing the symbol.
[0025]
FIG. 1 shows a configuration of a search device 1 of the present invention. The search device 1 searches for a keyword from a character recognition result obtained by recognizing a character in the original document.
[0026]
The search device 1 includes, as its constituent elements, a terminal 100, a CPU 100 that executes document registration processing and document search processing, an image input device 120 that inputs a document as image data, a work memory 180, and a hard disk (HDD) 170. With. These components are connected to each other via an internal bus 1101. Alternatively, these components may be connected to each other via any type of network.
[0027]
The terminal 100 is an input / output device including a keyboard and a CRT, for example. The terminal 100 is used, for example, for a user to specify a process executed by the search device 1 or to display a search process result executed by the search device 1 to the user.
[0028]
The HDD 170 stores a document registration program 1103, a document search program 1104, a character recognition pattern dictionary 160, a certainty factor table 150, and document data 1102. Any type of memory may be used as the HDD 170.
[0029]
All or part of the document registration program 1103 and the document search program 1104 may be provided to the search device 1 via any type of communication line (not shown) or broadcast, or any type of computer-readable program. The search apparatus 1 may be provided in a form recorded on a simple recording medium. Such a recording medium is, for example, a DVD-ROM, a CD-ROM, a flexible disk, or the like. The document registration program 1103 and the document search program 1104 recorded on such a recording medium can be installed in the search device 1 by a reading device such as a disk drive.
[0030]
FIG. 2 shows a flow of processing executed by the search device 1 in order to search for a keyword from a character recognition result obtained by recognizing a character in the original document.
[0031]
When the user instructs the start of document registration processing from the terminal 100 (FIG. 1), the document registration program 1103 stored in the HDD 170 is loaded into the work memory 180. The CPU 110 can access the work memory 180 at high speed. When the CPU 110 executes the document registration program 1103, document registration processing is performed.
[0032]
The document registration process includes a character recognition process and an index table creation process. The character recognition process and the index table creation process are respectively performed by the CPU 110 executing a character recognition program (not shown) and an index table creation program (not shown) that are part of the document registration program. .
[0033]
In the character recognition process, the original document is read by the image input device 120, and image data (document image data) 130 of the original document is generated. The document image data 130 is stored in the HDD 170 (FIG. 1). Character recognition processing is performed based on the similarity between the shape represented by the partial area in the document image data 130 and the character shape registered in the character recognition pattern dictionary 160 (FIG. 1). A result of the character recognition process is stored in the HDD 170 as a character recognition result 140.
[0034]
Next, in the index table creation process, an index table 190 is created from the character recognition result 140. The index table 190 is stored in the HDD 170. In the index table creation process, the certainty factor table 150 (FIG. 1) can be referred to.
[0035]
Document image data 130, character recognition result 140, and index table 190 stored in HDD 170 constitute at least a part of document data 1102 (FIG. 1).
[0036]
When the user inputs a keyword from the terminal 100 and instructs the start of the document search process, the document search program 1104 stored in the HDD 170 is loaded into the work memory 180. When the CPU 110 executes the document search program 1104, document search processing is performed. In the document search process, a keyword is searched from the character recognition result using the index table 190.
[0037]
Once the index table 190 is generated by the document registration process, the keyword search is performed with reference to the index table 190. Even if the keyword to be searched changes, it is not necessary to create a new index table 190.
[0038]
Note that it is not essential that all processing shown in FIG. For example, the document registration process may be performed by a device different from the search apparatus 1, and only the document search process using the generated index table 190 may be performed by the search apparatus 1.
[0039]
FIG. 3 shows an example of the original document 1310. The original document 1310 includes a character string “search method using an index. From document data”. The original document is, for example, a document in the form of paper on which a character string is printed. The original document may be a document written in a sign, a signboard, a bulletin board, or the like.
[0040]
FIG. 4 shows an example of a character recognition result 140 obtained by performing character recognition processing on the original document 1310. The character recognition result 140 includes a character position 1042 and a candidate character 1043. In FIG. 4, the numbers in parentheses attached to each candidate character indicate the reliability Rr for each candidate character. The character recognition result 140 is the largest character recognition result of each character “I”, “N”, “De”, “T”, “K”, “S” included in the original document 1310 (FIG. 3). The candidate characters for which the reliability Rr is obtained are “i”, “shi”, “te”, “so”, “ta”, and “su”, respectively.
[0041]
The character recognition process can be performed according to an arbitrary algorithm. The character recognition process can be executed, for example, according to an algorithm in which the document image data 130 is cut out in character units, and the cut out image data (partial area) in character units is converted into character codes.
[0042]
When converting from a partial area to a character code, the shape represented by the partial area is compared with the shape of the character registered in the character recognition pattern dictionary 160 (FIG. 1). Characters determined to be similar in shape based on a predetermined determination criterion are obtained as candidate characters. In this way, the partial area is converted into the character code of the candidate character. A plurality of candidate characters corresponding to one partial region may be obtained.
[0043]
Candidate characters are considered to be highly likely to match the characters written in the part corresponding to the partial area of the original document because the shape and the shape represented by the partial area are similar. Means.
[0044]
Each column (for example, the column 1044) in the character recognition result 140 corresponds to a partial region in the document image data 130. That is, it corresponds to a portion of the original document 1310 (for example, the portion 1311 shown in FIG. 3). Further, since the partial area in the document image data 130 is cut out from the document image data in units of one character, the partial area is one character of the original document 1310 (for example, a character written in the part 1311 shown in FIG. 3). "Ku").
[0045]
Candidate characters “T”, “U”, “W”, and “K” shown in column 1044 can match the characters written in the corresponding portion of original document 1310 (portion 1311 shown in FIG. 3). It is a character that can be regarded as having high character.
[0046]
A character position 1042 indicates the position of the candidate character in the character recognition result 140. For example, the character position “5” shown in the column 1044 indicates that the position of the column 1044 (part in the character recognition result) in the character recognition result 140 is the “fifth” position.
[0047]
As a method for expressing the character position 1042, any method may be used as long as the portion in the original document 1310 corresponding to the candidate character can be specified. As described above, each column in the character recognition result corresponds to a partial area of the document image data 130. Therefore, the character position 1042 may be represented by the position in the character recognition result of the column including the candidate character, or may be represented by the position in the document image data 130 of the partial area of the document image data 130 corresponding to the column. May be.
[0048]
For example, the character position 1042 may be represented by the document name of the original document, the page number, the line number, and the position (number of characters) from the beginning of the line, or the coordinates in the document image data. Or may be represented by an address.
[0049]
The reliability Rr indicates the probability of character recognition, that is, the correct probability. The reliability Rr takes a value from 0 to 1, and the greater the value, the greater the probability. For character recognition, for example, a neural network, vector quantization, or template matching can be employed.
[0050]
When a neural network method is adopted for character recognition, characters corresponding to at least one neuron whose output value is equal to or greater than a certain standard among characters registered in the character recognition pattern dictionary 160 are obtained as candidate characters. . The correspondence relationship between the neuron output value and the correct answer probability is obtained in advance, and the reliability Rr can be obtained from the neuron output value corresponding to each candidate character based on the correspondence relationship.
[0051]
In both vector quantization and template matching methods, the distance in the feature amount space between the shape represented by the partial region of the document image data 130 and the character shape registered in the character recognition pattern dictionary 160 is obtained. This is a technique for performing character recognition. One shape is represented as one representative point in the feature amount space. When these methods are employed, at least one character whose distance in the feature amount space is equal to or less than a reference among characters registered in the character recognition pattern dictionary 160 is obtained as a candidate character. A correspondence relationship between the distance in the feature amount space and the correct answer probability is obtained in advance, and the reliability Rr can be obtained from the distance in the feature amount space corresponding to each candidate character based on the correspondence relationship.
[0052]
Regardless of which method is used for character recognition, the reliability Rr reflects the similarity between the shape represented by the partial area of the document image data 130 and the character shape registered in the character recognition pattern dictionary 160. To do.
[0053]
Information other than shape similarity may be considered as the reliability Rr. For example, the deviation SR of the size of the partial area that is the target of character recognition in the document image data 130, the deviation LR of the relative position of the partial area in the line, and the like may be considered.
[0054]
The deviation SR of the size of the partial area is defined as, for example, a deviation of the size of the partial area from the average value of the sizes of all partial areas (each corresponding to one character) in the document image data 130. Can be done. By obtaining a correspondence relationship between the deviation SR and the correct recognition probability of character recognition in advance, the reliability Rr can be corrected so that the reliability Rr becomes small when the deviation SR is large.
[0055]
The relative position deviation LR of the partial areas is obtained by, for example, obtaining the average value of the positions in the direction perpendicular to the lines for all the partial areas (each corresponding to one character) in the same line in the document image data. It can be defined as the deviation from this average value of the position in the direction perpendicular to the rows of the partial areas. By obtaining a correspondence relationship between the deviation LR and the correct probability of character recognition in advance, the reliability Rr can be corrected so that the reliability Rr becomes small when the deviation LR is large.
[0056]
Thus, the reliability Rr can be set more appropriately by correcting the reliability Rr according to the deviation SR and / or the deviation LR.
[0057]
An index table 190 is created from the character recognition result 140 shown in FIG. 4 (index table creation processing).
[0058]
FIG. 5A shows a procedure of index table creation processing. Hereinafter, the procedure of the index table creation process will be described in detail.
[0059]
Step S401: It is determined whether or not the reliability Rr of the candidate character of interest in the character recognition result 140 is greater than or equal to a reference value. The reference value may be “0.05”, for example. If the determination result in step S401 is “Yes”, the process proceeds to step S402. If the determination result in step S401 is “No”, the process proceeds to step S404.
[0060]
When the character recognition result 140 (FIG. 4) is obtained by the character recognition process, only the candidate characters whose reliability Rr is equal to or higher than the reference value may be included in the character recognition result 140. In that case, the process in step S401 may be omitted.
[0061]
Step S402: The certainty factor Cr of the candidate character is calculated. The certainty factor Cr is calculated by, for example, (Equation 1) based on the reliability Rr for each candidate character.
[0062]
[Expression 1]
Confidence level Cr = reliability level Rr for candidate character × characteristic coefficient Kr
The character-specific coefficient Kr is defined in advance for each character (for example, “I”). The character-specific coefficient Kr depends on the appearance probability of the character in a normal document. Each character has a different appearance probability in a normal document. For example, in a general Japanese document, the character “ゐ” has a lower appearance probability than the character “ru”. As described above, the character-specific coefficient Kr is set low for characters having a low appearance probability. On the contrary, the character-specific coefficient Kr is set to be high for characters having a high appearance probability. The appearance probability for each character can be statistically obtained in advance for a large number of general documents.
[0063]
The certainty factor Cr for each candidate character is obtained by multiplying the reliability Rr for the candidate character by the character-specific coefficient Kr for the same character as the candidate character (character with the matching character code). The certainty factor Cr calculated in this way indicates the probability that the same character as the candidate character exists in a specific part in the original document. Such a specific portion is a portion in the original document corresponding to the column (the portion in the character recognition result) including the candidate character in the character recognition result 140 (FIG. 4).
[0064]
However, the certainty factor Cr is not necessarily a statistical probability itself. The certainty factor Cr may be a value obtained by normalizing a statistical probability according to a predetermined criterion. Such a predetermined criterion may be any criterion as long as the certainty factor Cr of the candidate character retains the property that the same character as the candidate character indicates the probability that the character exists in a specific part in the original document. The certainty factor Cr may be expressed by an integer expression instead of a real number expression. Alternatively, the certainty factor Cr may be represented by a symbol that indicates the level of the certainty factor Cr stepwise (for example, ◯: high, Δ: medium, x: low).
[0065]
When the appearance probability for each character is unknown, the character-specific coefficient Kr may be constant for all characters. Further, a character-specific coefficient Kr may be set for each character type (kanji, katakana, hiragana).
[0066]
Step S403: The candidate character and the certainty factor Cr obtained in step S402 are registered in the candidate character-confidence table.
[0067]
FIG. 5B shows an example of the candidate character-credibility table 1501. The candidate character and the certainty factor Cr are registered in the candidate character-confidence table 1501 for each character position 1042 (FIG. 4).
[0068]
Referring to FIG. 5A again, the description of the index table creation process will be continued.
[0069]
Step S404: It is determined whether or not the processing of steps S401 to S403 has been performed for all candidate characters at all character positions. If the determination result in step S404 is “Yes”, the process proceeds to step S405. If the determination result in step S404 is “No”, the processing from step S401 is performed on other candidate characters.
[0070]
Step S405: An index table is created from candidate characters registered at adjacent character positions in the candidate character-confidence table 1501 (FIG. 5B). The index table defines an index character string, a character position, and a certainty factor Cr.
[0071]
The index character string is generated by combining candidate characters registered at adjacent character positions in the candidate character-confidence table 1501 (FIG. 5B). For example, by combining the candidate character “I” registered at the character position “1” of the candidate character-confidence table 1501 with the candidate character “B” registered at the adjacent character position “2”, the index character The column “Ishi” is generated.
[0072]
FIG. 6 shows an example of the index table created by the index table creation process. A column 1610 of the index table 190 shows an index character string. A column 1611 indicates the character position of the first candidate character included in the index character string. A column 1612 shows a set of certainty factors Cr defined for each of the candidate characters included in the index character string.
[0073]
The row 1602 included in the index table 190 has a position of “1” in the character recognition result 140 matching the index character string “Ishi”, and is defined for the character “I” of the index character string “Ishi”. The certainty factor Cr is 0.9, and the certainty factor Cr defined for the character “I” in the index character string “Ishi” is 0.8.
[0074]
A set of certainty factors Cr defined for each candidate character included in the index character string “Ishi” (line 1602) is obtained as a set of certainty factors Cr calculated in step S402 (FIG. 5A) for each candidate character. . As a set of certainty factors Cr, a set of values obtained by multiplying the certainty factor Cr calculated in step S402 (FIG. 5A) for each candidate character by a coefficient for each index character string may be used. The coefficient for each index character string can be set low, for example, for an index character string that has a low probability of appearing in a general document. For example, the character string “Wo” and the character string “ヰヰ” have a low probability of appearing in a general Japanese document. For such an index character string, the coefficient for each index character string can be set low.
[0075]
Generating an index character string by combining candidate characters registered at adjacent character positions in the candidate character-confidence table 1501 (FIG. 5B) indicates that a plurality of fields shown in the character recognition result 140 (FIG. 4) are displayed. Of these, it is equivalent to combining candidate characters included in each of a plurality of adjacent (continuous) columns (for example, the column 1045 and the column 1046).
[0076]
Thus, in step S401 to step S405 shown in FIG. 5A, the CPU 110 (FIG. 1) functions as an index table creation unit that creates the index table 190.
[0077]
The index table 190 is created by registering all combinations of candidate characters registered at adjacent character positions in the candidate character-confidence table 1501 shown in FIG. 5B as index character strings.
[0078]
However, when there are overlapping combinations in all combinations of candidate characters registered at adjacent character positions in the candidate character-confidence table 1501, the index table 190 includes a plurality of character positions for one index character string. And a set of certainty factors Cr are registered. For example, the index character string “NW” is generated from the candidate characters “N” and “W” registered in the character positions “2” and “3” of the candidate character-confidence table 1501, and the character positions “4” and “4” The index character string “NW” is also generated from the candidate characters “N” and “W” registered in “5”. In this case, for one index character string “NW”, the character position 2, confidence factor Cr (0.7, 0.2) and character position 4, confidence factor Cr (0.1, 0.2) are index tables. 190 is registered (line 1604).
[0079]
The number of characters included in the index character string is determined in advance. In the example shown in FIG. 6, the number of characters included in the index character string is “2”. The number of characters included in the index character string can be any natural number. However, in general, the number of characters included in the index character string is preferably 2 or more. This is because if the number of characters included in the index character string is 1, the number of character positions and certainty factors Cr registered for one index character string increases, and the search cannot be performed at high speed.
[0080]
The index character strings in the index table 190 are preferably ordered according to a predetermined order to facilitate searching.
[0081]
The index table 190 is created using a plurality of candidate characters for one character position in the character recognition result 140 shown in FIG. As a result, the index table 190 includes a plurality of index character strings corresponding to the same character position. Therefore, a plurality of index character strings can match one part of the character recognition result. For example, the index character string “Ishi” shown in the row 1602 of the index table 190 and the index character string “in” shown in the row 1603 of the index table 190 are both character recognitions indicated by the character position “1”. It matches the portion in the result 140 (the portion including the column 1045 and the column 1046). This can reduce search omissions.
[0082]
In this way, the index character string and the character recognition result portion match that each character included in the index character string is one of the consecutive portions of the character recognition result (adjacent column shown in FIG. 4). Includes the concept that it is the same as one of at least one candidate character included in (character code is equal).
[0083]
An index table including a plurality of index character strings corresponding to the same character position, such as the index table 190, can also be created from a character recognition result in which there is one candidate character for one character position.
[0084]
FIG. 7 shows an example of a character recognition result 140a in which there is one candidate character for one character position. The character recognition result 140a differs from the character recognition result 140 shown in FIG. 4 in that there is one candidate character for one character position. The character recognition result 140a indicates that the characters “I”, “N”, “De”, “T”, “K”, “S” included in the original document 1310 (FIG. 3) are “I”, “ This indicates that “SHI”, “TE”, “SO”, “TA”, and “SU” are recognized. The characters “n”, “de”, “t”, and “ku” included in the original document 1310 are erroneously recognized.
[0085]
FIG. 8A shows a procedure of processing for creating an index table (index table creation processing) from the character recognition result 140a shown in FIG.
[0086]
Step S501: With reference to the certainty factor table, a pair of similar characters and certainty factor Cr is obtained from one set of candidate characters and reliability Rr for one character position of character recognition result 140a. A plurality of pairs of similar characters and certainty factors Cr may be obtained. The process of step S501 is performed for each character position. The certainty factor table will be described later with reference to FIG.
[0087]
Step S502: It is determined whether or not the certainty factor Cr of the similar character is greater than or equal to a predetermined reference value. The predetermined reference value is, for example, 0.05. If the determination result in step S502 is “Yes”, the process proceeds to step S503. If the determination result in step S502 is “No”, the process proceeds to step S504.
[0088]
Step S503: Similar characters and the certainty factor Cr obtained in step S501 are registered in the similar character-confidence table.
[0089]
FIG. 8B shows an example of the similar character-credibility table 1801. The similar character and the certainty factor Cr are registered in the similar character-confidence table 1801 for each character position 1042 (FIG. 7).
[0090]
With reference to FIG. 8A again, the description of the index table creation processing will be continued.
[0091]
Step S504: It is determined whether or not the processes in steps S502 to S503 have been performed for candidate characters at all character positions. If the determination result in step S504 is “Yes”, the process proceeds to step S505. If the determination result in step S504 is “No”, the processing from step S502 is performed on other similar characters.
[0092]
Step S505: An index table is created from similar characters registered at adjacent character positions in the similar character-confidence table 1801 (FIG. 8B). The index character string is generated by combining similar characters registered at adjacent character positions in the similar character-credibility table 1801 (FIG. 8B). This process is the same as the process of creating the index table 190 (FIG. 6) from the candidate character-confidence table 1501 (FIG. 5B) in step S405 shown in FIG. 5A.
[0093]
The generated index table is the same as the index table 190 shown in FIG. For example, in the row 1602 of the index table 190, a column 1611 indicates the character position of the first similar character “I” included in the index character string “Ishi”. A column 1612 shows a set (0.9, 0.8) of certainty factors Cr defined for each of the similar characters included in the index character string “Ishi”.
[0094]
FIG. 9 shows an example of the certainty factor table 150. FIG. 9 shows only the portion related to the candidate character “si” in the certainty factor table 150.
[0095]
For example, when one candidate character “shi” and reliability Rr “0.9” are obtained as a character recognition result, the certainty factor table 150 is similar character “n” and certainty factor Cr “0.2”. , The similar character “shi” and the certainty factor Cr “0.8” are obtained. The similar character “n” and the similar character “si” are characters having the same or the same character shape as the candidate character “si”.
[0096]
The fact that the similar characters for the candidate character “si” are “n” and “si” indicates that when one candidate character “si” is obtained as a character recognition result, the similar character “ "Or a similar character" shi "is likely to be written.
[0097]
The certainty factor table 150 performs character recognition on an original document in which a large number of characters are written in advance, and compares the character recognition result and reliability Rr obtained thereby with characters actually existing in the original document. Can be created. For example, “0.2” of the certainty factor Cr shown in the portion 1811 of the certainty factor table 150 is obtained when character recognition is performed for a character “N” written in various fonts and various print qualities. It can be obtained from the probability that the candidate character “shi” and the reliability Rr0.9 are obtained.
[0098]
The certainty factor table 150 is prepared for all combinations of characters. However, similar characters whose certainty factor Cr is smaller than a predetermined reference need not be registered in the certainty factor table 150. Therefore, the number of similar characters obtained for one candidate character can be limited.
[0099]
When the reliability Rr obtained by character recognition does not match the reliability Rr defined in the certainty table 150 shown in FIG. 9 (for example, the reliability Rr obtained by character recognition is 0.8). The certainty factor Cr of similar characters is calculated by an appropriate method. For example, when the reliability Rr obtained by character recognition is smaller than 0.5, the row of the reliability Rr “0.5” in the certainty factor table 150 is referred to. When the reliability Rr obtained by character recognition is greater than 0.9, the row of the reliability Rr “0.9” in the certainty factor table 150 is referred to. When the reliability Rr obtained by the character recognition is a value between the two reliability levels Rr defined in the certainty degree table 150, the character recognition is performed among the two reliability levels Rr defined in the certainty degree table 150. A row having a value close to the reliability Rr obtained by is referred to.
[0100]
Note that the structure of the certainty factor table 150 is not limited to the structure shown in FIG. The certainty level table 150 may have an arbitrary structure as long as at least one similar character and certainty level Cr group is obtained from the set of candidate characters and the reliability level Rr. For example, assuming that the distribution of the reliability Cr with respect to the reliability Rr is a uniform distribution, the distribution range is expressed by the upper and lower limits of the reliability and the upper and lower limits of the reliability Rr. May be defined in the certainty factor table 150. Alternatively, assuming that the distribution of the reliability Cr with respect to the reliability Rr is a Gaussian distribution, the average value and the variance of the distribution may be defined in the reliability table 150.
[0101]
By executing the index table creation processing shown in FIG. 8A using such a certainty factor table 150, the same character recognition result 140a (FIG. 7) having one candidate character for one character position is the same. An index table 190 (FIG. 6) including a plurality of index character strings corresponding to character positions can be created.
[0102]
The certainty factor table 150 may be omitted when the search device 1 executes the index table creation process according to the procedure shown in FIG. 5A.
[0103]
In this manner, the index table 190 can also be created using a plurality of similar characters that are associated in advance with one candidate character for one character position in the character recognition result 140a shown in FIG. As a result, the index table 190 includes a plurality of index character strings corresponding to the same character position. Therefore, a plurality of index character strings can match one part of the character recognition result. As a result, it is possible to reduce omissions even when erroneous recognition occurs in the character recognition process.
[0104]
However, each character included in the index character string is not necessarily the same as one candidate character included in one of consecutive portions of character recognition results (adjacent columns shown in FIG. 7). For example, the character “N” included in the index character string “IN” shown in the row 1603 of the index table 190 is not the same as one candidate character “SE” included in the column 1046a of the character recognition result 140a (FIG. 7). Absent. However, the character “n” included in the index character string “in” is the same as the similar character “n” previously associated with the candidate character “si” by the certainty factor table 150 (FIG. 9).
[0105]
In this way, the index character string and the character recognition result portion match that each character included in the index character string is one of the consecutive portions of the character recognition result (adjacent column shown in FIG. 4). Including the concept of being the same as one of at least one character previously associated with one candidate character included in the character (equal character code).
[0106]
Next, a process (document search process) for searching for a keyword from a character recognition result using the index table 190 (FIG. 6) will be described.
[0107]
FIG. 10 shows the procedure of the document search process. Hereinafter, each step of the document search process will be described in detail.
[0108]
Step S301: A keyword is input. Hereinafter, a case where the keyword is a character string “index” will be described as an example.
[0109]
Step S302: A set of two consecutive characters (a character string having a length of 2) is extracted from the keyword. In this example, a pair of two characters “IN”, “NDE”, “DET”, “KOK”, and “KUS” are extracted. Note that the length of the extracted character string is set to be equal to the length of the index character string defined in the index table. Therefore, when the length of the index character string is n (n is a natural number), a set of n characters (a character string having a length of n) is extracted from the keyword. In the following description, it is assumed that n = 2.
[0110]
The extracted sets of two characters overlap each other in part. However, a set of two characters may be extracted from the keyword so as not to overlap. For example, a set of two characters “IN”, “DET”, and “KUS” may be extracted from the keyword “index”. However, a set of two characters is extracted from the keyword so that each character included in the keyword is included in at least one of the two sets of extracted characters.
[0111]
Step S303: Referring to the index table 190 (FIG. 6), the character position and the certainty factor Cr corresponding to the set of two characters are extracted. In this example,
A character position “1” corresponding to a two-character set “IN”, a set (0.9, 0.7) of confidence factor Cr (line 1603),
A character position “2” corresponding to a two-character set “Nde”, a set (0.7, 0.8) of confidence Cr (line 1605),
A character position “3” corresponding to a two-character set “de”, a set of confidence factors Cr (0.8, 0.3) (line 1606),
A character position “4” corresponding to a two-character set “Cook”, a set of confidence factors Cr (0.3, 0.1) (line 1607), and
A set (0.1, 0.9) (line 1608) of the character position “5” and the certainty factor Cr corresponding to the two-character set “kus” is obtained.
[0112]
In order to efficiently extract the character position and the certainty factor Cr corresponding to each character string from the index table 190 shown in FIG. 6, the character code of the character included in the index character string and the line including the index character string are displayed. A correspondence table with stored addresses (for example, addresses on the HDD 170) may be used. Such an address may be obtained using a binary tree search method.
[0113]
Step S304: It is determined whether or not the process of step S303 has been performed for all sets of two characters. If the determination result in step S304 is “No”, the process of step S303 is performed for the other two character sets. If the determination result in step S304 is “Yes”, the process proceeds to step S305.
[0114]
Step S305: It is determined whether or not all sets of two characters are arranged in a predetermined order. This determination is made based on the character position obtained for each two-character group in step S304. Specifically, the character position m (m is a natural number) obtained for a set of two characters starting from the k-th character (k is a natural number) of the keyword is “m−k = If the relationship of “constant” is satisfied, it is determined that all sets of two characters are arranged in a predetermined order.
[0115]
The fact that all two character sets are arranged in a predetermined order indicates that the keyword matches a specific part in the character recognition result. The specific portion is a portion including a portion in the character recognition result in which each character included in the keyword matches.
[0116]
In this example, the keyword “index” matches the portion 1047 of the character recognition result 140 (FIG. 4) or the portion 1047a of the character recognition result 140a (FIG. 7).
[0117]
The position of the part 1047 or the part 1047a is specified as the character position “1” in the head column of the part.
[0118]
In this example, it is determined that all sets of two characters extracted from the keyword “index” are “in a predetermined order” in order to satisfy the relationship described above.
[0119]
If the determination in step S305 is “Yes”, the process proceeds to step S306. If the determination in step S305 is “No”, the process proceeds to step S308.
[0120]
As described above, in steps S302 to S305, the CPU 110 (FIG. 1) is based on the index character string included in the index table 190 (FIG. 6) and the position of the portion in the character recognition result that matches the index character string. Then, it is determined whether or not the keyword matches the part in the character recognition result, and if it matches, it functions as a position specifying unit that specifies the position of the part in the character recognition result that matches the keyword.
[0121]
Step S306: The keyword certainty factor Kc is calculated. The keyword certainty factor Kc is, for example, a certainty factor Cr corresponding to each of the two-character sets “IN”, “NDE”, “DET”, “Cuck”, and “CUS” extracted from the keyword “index”. It is obtained as an arithmetic average of the value on the left side of the set and the value on the right side of the set of certainty factors Cr corresponding to the last two character set “kus” constituting the keyword “index”. This is equivalent to obtaining an arithmetic average of the certainty levels defined for each character included in the keyword. In this example, the keyword certainty factor Kc = (0.9 + 0.7 + 0.8 + 0.3 + 0.1 + 0.9) /6=0.61.
[0122]
The keyword certainty factor Kc may be calculated by a geometric mean, a median value, or a mode value. The keyword certainty factor Kc may be calculated using only the smaller value of the certainty factor Cr pairs corresponding to the two character pairs. When the certainty factor Cr is less than a predetermined reference value, the certainty factor Cr may not be used for calculating the keyword certainty factor Kc.
[0123]
Thus, the keyword certainty factor Kc is calculated based on the certainty factor Cr defined for each character included in the keyword.
[0124]
The calculated keyword certainty factor Kc indicates the probability that the keyword exists at a position in the original document corresponding to the character position in the character recognition result.
[0125]
For example, in step S305, it is determined that the keyword “index” matches the portion 1047 of the character recognition result 140 (FIG. 4) or the portion 1047a of the character recognition result 140a (FIG. 7). When the position is specified as the character position “1”, the probability that the keyword “index” exists at the position in the original document corresponding to the character position “1” (that is, at the beginning of the original document) is 0.61. .
[0126]
As described above, in step S306, the CPU 110 determines the keyword at the position in the original document corresponding to the position of the portion in the character recognition result that matches the keyword based on the certainty factor Cr included in the index table 190 (FIG. 6). It functions as a calculation unit that calculates the keyword certainty factor Kc indicating the probability of the presence of.
[0127]
Step S307: It is determined whether or not the keyword certainty factor Kc is greater than or equal to a reference value (predetermined value). The reference value may be 0.5, for example. This reference value may be a fixed value or set according to a keyword. For example, the reference value may be changed according to the number of characters of the keyword.
[0128]
If this reference value is increased, search noise can be reduced. However, if it is too high, search omission is likely to occur. In general, when the number of characters in a keyword is large, it is preferable to reduce the omission of search even when misrecognition is frequent by setting the reference value low. This is because when the number of characters in the keyword (the number of characters included in the keyword) is large, the search noise does not increase so much even if the reference value is set low.
[0129]
The reference value may be changed according to the type of characters (character type) included in the keyword. For example, if each character of the keyword is katakana, kanji, mixed kanji and katakana, or hiragana, the reference value is set optimally for more effective search. It can be performed.
[0130]
Alternatively, this reference value may be specified by the user. The user can set an appropriate reference value according to the purpose of preventing search omission or reducing search noise.
[0131]
If the determination in step S307 is “Yes”, the process proceeds to step S309. If the determination in step S307 is “No”, the process proceeds to step S308.
[0132]
As described above, after a search result is obtained that the keyword matches the character recognition result 140 or the character recognition result 140a in step S305, step S307 is executed to determine whether or not the keyword exists in the actual original document ( Whether the search result is valid) is determined based on the keyword certainty factor Kc. In step S307, the CPU 110 functions as a determination unit that determines the validity of the search result based on the keyword certainty factor Kc.
[0133]
Step S308: It is determined that the keyword does not exist in the original document. For example, a message “Keywords could not be found” is displayed on the display of the terminal 100 (FIG. 1).
[0134]
Step S309: It is determined that a keyword has been detected. As a search result, the character position indicating the detected location and the keyword certainty factor Kc are obtained. When there are a plurality of detection locations, a plurality of sets of character positions and keyword certainty factors Kc are obtained.
[0135]
In the example described above in which the keyword is “index”, the character position “1” and the keyword certainty factor Kc “0.61” are obtained as search results.
[0136]
The search result is displayed on the terminal 100, for example. For example, the search device 1 displays the document image data 130 (and / or the character recognition results 140 and 140a) stored in the HDD 170 on the display of the terminal 100, and the document image data 130 (and / or / displayed on the display). Alternatively, the area corresponding to the keyword is highlighted in the areas of the character recognition results 140 and 140a). The highlighting is performed, for example, by changing the attributes of the displayed characters (for example, the character color and density, the character background color and density, the character size, the character thickness, the font, etc.). Such attributes may be changed according to the keyword certainty factor Kc. For example, the keyword certainty factor Kc between 0.5 and 1.0 may be divided with a step size of 0.1, and different attributes may be set for each division to perform highlighting. In this case, since the user can visually grasp the magnitude of the keyword certainty factor Kc, there is an advantage that the user can easily make a further determination of the validity of the search result visually. .
[0137]
Or you may display the area | region corresponding to a keyword in an order from the detection location with high keyword reliability Kc.
[0138]
In this manner, when the user further determines the validity of the search result, the reference value used in step S307 may be set low.
[0139]
Alternatively, step S307 may be omitted. In this case, all of the determination of the validity of the search result is performed by the user. The user can easily determine the validity of the search result based on the keyword certainty factor Kc.
[0140]
Hereinafter, an example in which search noise is suppressed by the document search process shown in FIG. 10 will be described.
[0141]
When the keyword “wax” is specified and the document search process shown in FIG. 10 is performed using the index table 190 shown in FIG. 6, the determination in step S305 is “Yes”, and the character position “3” is specified. Is done.
[0142]
In step S306, the keyword certainty factor Kc = (0.2 + 0.3 + 0.1 + 0.9) /4=0.38 is calculated.
[0143]
Since the keyword certainty factor Kc is smaller than the reference value 0.5, it is determined that no keyword exists.
[0144]
When the keyword “denwa” is designated, the determination in step S305 is “Yes”, and the character position “3” is specified.
[0145]
In step S306, the keyword certainty factor Kc = (0.8 + 0.1 + 0.2) /3=0.37 is calculated.
[0146]
Since the keyword certainty factor Kc is smaller than the reference value 0.5, it is determined that no keyword exists.
[0147]
Similarly, when the keyword “hook” is designated, the determination in step S305 is “Yes”, and the character position “3” is specified.
[0148]
In step S306, the keyword certainty factor Kc = (0.2 + 0.3 + 0.1) /3=0.2 is calculated.
[0149]
Since the keyword certainty factor Kc is smaller than the reference value 0.5, it is determined that no keyword exists.
[0150]
As described above, according to the search device 1 of the present invention, it is possible to suppress the detection of the keyword even when the keyword does not exist in the original document, that is, to suppress the search noise.
[0151]
The document search process of the present invention is not limited to being realized by software on a computer. The document search processing of the present invention may be realized by hardware, or may be realized by a combination of software and hardware.
[0152]
In the embodiment described above, a Japanese document has been described as an example. However, the application of the present invention is not limited to Japanese documents. The present invention can also be applied to other arbitrary documents (for example, Chinese documents, English documents, Korean documents).
[0153]
【The invention's effect】
According to the present invention, based on the certainty factor included in the index table, the keyword certainty factor indicating the probability that the keyword exists at the position in the original document corresponding to the position of the part in the character recognition result that matches the keyword is calculated. Is done. Therefore, it becomes easy to determine the validity of the search result based on the keyword certainty factor.
[0154]
Since the search device of the present invention uses an index table, it is possible to perform a high-speed search.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a search device 1 according to the present invention.
FIG. 2 is a diagram illustrating a flow of processing executed by the search device 1 to search for a keyword from a character recognition result obtained by recognizing a character in an original document.
FIG. 3 is a diagram showing an example of an original document 1310
4 is a diagram showing an example of a character recognition result 140 obtained by performing character recognition processing on an original document 1310. FIG.
FIG. 5A is a flowchart showing the procedure of index table creation processing;
FIG. 5B is a diagram showing an example of a candidate character-confidence table 1501
FIG. 6 is a view showing an example of an index table 190 created by index table creation processing;
FIG. 7 is a diagram showing an example of a character recognition result 140a in which there is one candidate character for one character position.
FIG. 8A is a flowchart showing a processing procedure for creating an index table from the character recognition result 140a shown in FIG.
FIG. 8B is a diagram showing an example of a similar character-confidence table 1801
FIG. 9 is a diagram illustrating an example of a certainty factor table 150;
FIG. 10 is a flowchart showing a procedure of document search processing.
FIG. 11 is a diagram showing an example of an index table 1901 in which a plurality of character strings that may exist in an original document are registered as index character strings according to a conventional technique.
[Explanation of symbols]
1 Search device
100 terminals
110 CPU
120 Image input device
130 Document image data
140 Character recognition result
170 HDD
180 Work memory
190 Index table

Claims

Retrieval for recognizing a keyword consisting of a plurality of characters from the character recognition result of the original document having at least one candidate character corresponding to each character by recognizing each character in the original document using an index table A device,
The index table includes an index character string composed of a character string combining the candidate characters corresponding to each of a plurality of characters constituting the character string existing in the original document, and the index character in the original sentence. A column position and, for each candidate character included in the index character string, including a certainty factor defined in advance as a probability that each candidate character exists in the original document,
It is determined whether a character string of the keyword having the same number of characters as the index character string exists in the index table, and if present, based on the position of the index character string that matches the character string of the keyword, A position specifying unit for specifying the position of the character string of the keyword in the original sentence;
A calculation unit that calculates a keyword certainty factor defined as a probability that the keyword exists at the specified position based on a certainty factor of an index character string that matches the character string of the keyword;
And a determination unit for determining validity of the search results based on the keyword credibility, retrieval device.

The search device according to claim 1, wherein the determination unit determines that the search result is valid when the keyword certainty factor is equal to or greater than a predetermined value.

The search device according to claim 2, wherein the predetermined value is set according to at least one of the number of characters included in the keyword and the type of character included in the keyword.

An index table creating unit for creating the index table;
The index table creation unit, when a plurality of candidate characters are generated based on the character recognition result, by combining each of the candidate characters generated corresponding to each character in the character string of the original document The search device according to claim 1, wherein the index character string is generated.

An index table creating unit for creating the index table;
The index table creation unit generates one candidate character for each character based on the character recognition result, and generates a similar character similar to each generated candidate character based on the certainty factor. The search device according to claim 1, wherein the index character string is also generated by a combination of the candidate character corresponding to each character in the character string of the original document and the similar character or a combination of the similar characters. .

Retrieval for recognizing a keyword consisting of a plurality of characters from the character recognition result of the original document having at least one candidate character corresponding to each character by recognizing each character in the original document using an index table A computer-readable recording medium recording a program for causing a computer to execute processing,
The index table includes an index character string composed of a character string combining the candidate characters corresponding to each of a plurality of characters constituting the character string existing in the original document, and the index character in the original sentence. A column position and, for each candidate character included in the index character string, including a certainty factor defined in advance as a probability that each candidate character exists in the original document,
The search process
It is determined whether a character string of the keyword having the same number of characters as the index character string exists in the index table, and if present, based on the position of the index character string that matches the character string of the keyword, Identifying the position of the keyword string in the original document;
Calculating a keyword certainty factor defined as a probability that the keyword exists at the specified position based on a certainty factor of an index character string that matches the keyword character string;
Including determining the validity of the search results based on the keyword credibility, recording medium recording a program.

Retrieval for recognizing a keyword consisting of a plurality of characters from the character recognition result of the original document having at least one candidate character corresponding to each character by recognizing each character in the original document using an index table A program for causing a computer to execute processing,
The index table includes an index character string composed of a character string combining the candidate characters corresponding to each of a plurality of characters constituting the character string existing in the original document, and the index character in the original sentence. A column position and, for each candidate character included in the index character string, including a certainty factor defined in advance as a probability that each candidate character exists in the original document,
The search process
It is determined whether a character string of the keyword having the same number of characters as the index character string exists in the index table, and if present, based on the position of the index character string that matches the character string of the keyword, Identifying the position of the keyword string in the original sentence;
Calculating a keyword certainty factor defined as a probability that the keyword exists at the specified position based on a certainty factor of an index character string that matches the keyword character string;
Determining the validity of the search result based on the keyword certainty factor.