JP4936635B2

JP4936635B2 - Character string search device, character string search method, and program for causing computer to execute the method

Info

Publication number: JP4936635B2
Application number: JP2003084877A
Authority: JP
Inventors: 佳洋入江
Original assignee: Glory Ltd
Current assignee: Glory Ltd
Priority date: 2003-03-26
Filing date: 2003-03-26
Publication date: 2012-05-23
Anticipated expiration: 2023-03-26
Also published as: JP2004295329A

Description

【０００１】
【発明の属する技術分野】
この発明は、所定の文字列をキーワードとして受け付けた際に、検索対象となる文書を形成する各文字とあらかじめ登録した標準文字との間の距離値に基づいて前記文書を形成する各文字を文字認識し、該文字認識した認識結果を踏まえて前記文書から前記キーワードをなす文字列を検索する文字列検索装置、文字列検索方法およびその方法をコンピュータに実行させるためのプログラムに関し、特に、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させ、所定のキーワードに対応する文字列を効率的に検索することができる文字検索装置、文字検索方法およびその方法をコンピュータに実行させるプログラムに関する。
【０００２】
【従来の技術】
従来、スキャナなどから取り込んだ手書きの文書を画像データとして蓄積して管理する電子ファイリングが広く知られている。電子ファイリングでは、多量の画像データが蓄積されるため、効率的な文書の検索技術を開発することが重要な課題となっている。
【０００３】
このため、画像データとして蓄積された手書き文書の文字認識をおこなって、指定されたキーワードを含む文書を全文検索により抽出する技術が非特許文献１に開示されている。
【０００４】
この非特許文献１の技術では、画像データから切り出された文字に対する複数の文字認識の候補をラティス構造により表現し、それを用いて検索することにより文字列の認識精度を向上させている。ところが、この非特許文献１の技術では、検索する文字列の中に文字認識できない文字があった場合に、その文字列が検索漏れとなるという問題があった。
【０００５】
このため、特許文献１には、画像データ内から文字列を検索する際に、文字認識できない文字と予め作成した文字との間の文字画像の形状特徴を照合して類似度を計算し、類似度がしきい値よりも大きい場合に、文字認識できない文字を含む文字列を検索結果として出力する文書ファイリング装置が開示されている。
【０００６】
また、特許文献２には、検索するキーワードの文字列内に文字認識できない文字が一部含まれる場合でも、文字列全体の文字数に対する一致した文字数の割合がしきい値以上であれば、その文字列を検索結果として出力するファイリング装置が開示されている。
【０００７】
【非特許文献１】
仙田修司、他２名，「全文検索可能な文書画像データベースシステムの構築」，ディジタル図書館，図書館情報大学，１９９６年１０月２３日，Ｎｏ．８
【特許文献１】
特開２０００−５７３１５号公報
【特許文献２】
特開平８−２７２８１３号公報
【０００８】
しかしながら、特許文献１の従来技術を用いた場合には、文字認識できない文字とあらかじめ作成した文字の形状特徴とを照合して類似度を計算したが、文字認識できない文字はそもそも文字画像の質が低品質で判別しにくい状態にあるために、類似度は小さくなり、文字列の検出漏れがそれほど改善されないという問題があった。
【０００９】
さらに、この従来技術では、文字認識される手書きの文字の文字画像が低品質である場合に、無理に文字の形状特徴の照合をおこなった結果、正しい文字とは違う文字に認識され、かえって検索効率を悪化させてしまうといったことも起こり得る。
【００１０】
また、特許文献２の従来技術では、文字認識できなかった文字がどのようなものであっても、他の部分の一致度が高ければその文字列が検索されてしまうので、検索するキーワードと関連の無い文字列までが検索ノイズとして検索され、ユーザを混乱させるという問題があった。
【００１１】
これらのことから、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させつつ、所定のキーワードに対応する文字列を効率的に検索することができる技術をいかに実現するかが重要な課題となっている。
【００１２】
この発明は、上述した問題（課題）を解消するためになされたものであり、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させ、所定のキーワードに対応する文字列を効率的に検索することができる文字検索装置、文字検索方法およびその方法をコンピュータに実行させるプログラムを提供することを目的とする。
【００１３】
【課題を解決するための手段】
本発明は、上記目的を達成するためになされたものであり、本発明は、所定の文字列をキーワードとして受け付けた際に、検索対象となる文書を形成する各文字とあらかじめ登録した標準文字との間の距離値に基づいて前記文書を形成する各文字を文字認識し、該文字認識した認識結果を踏まえて前記文書から前記キーワードをなす文字列を検索する文字列検索装置であって、前記文書を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できない場合に、該所定の文字に対して所定の距離値を付与する距離値付与手段と、前記距離値付与手段により付与された距離値と文字認識された前記文書を形成する各文字の距離値とに基づいて前記文書に存在する前記キーワードをなす文字列を検索するキーワード検索手段とを備えたことを特徴とする。
【００１４】
また、本発明は、前記文書を形成する各文字を文字認識して得られた単数または複数の文字認識の候補を文字ラティスとして記憶する文字ラティス記憶手段をさらに備え、前記キーワード検索手段は、前記文字ラティス記憶手段により文字ラティスとして記憶された認識結果を読み出し、前記距離値付与手段により付与された距離値と文字認識された前記文書を形成する各文字の距離値とに基づいて前記文字ラティスとして記憶された認識結果から前記文書に存在する前記キーワードをなす文字列を検索することを特徴とする。
【００１５】
また、本発明は、前記所定の文字列をキーワードとして受け付けた際に、前記文字認識した結果を踏まえて該キーワードをなす文字列を形成する各文字を該文字列の先頭文字から順に探索し、該文字列を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できない場合に、該所定の文字をスキップして探索を続行し、該文字列内の文字数に対する該文字認識できない文字数の割合が所定の値以下である場合に該文字列を抽出する文字列抽出手段をさらに備え、前記キーワード検索手段は、前記文字列抽出手段により抽出された文字列を形成する前記文字認識できない所定の文字に対して前記距離値付与手段により付与された所定の距離値と、前記抽出された文字列を形成する文字認識された各文字の距離値とに基づいて該抽出された文字列が該キーワードをなす文字列であるかどうかを判定し、該判定に基づいて該キーワードをなす文字列を検索することを特徴とする。
【００１６】
また、本発明は、前記キーワードをなす文字列の先頭文字が文字認識できない場合に、前記文字列抽出手段は、該キーワードをなす文字列を形成する各文字を該文字列の末尾文字から逆順に探索し、該文字列を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できない場合に、該所定の文字をスキップして探索を続行し、該文字列内の文字数に対する該文字認識できない文字数の割合が所定の値以下である場合に該文字列を抽出することを特徴とする。
【００１７】
また、本発明は、前記文字列抽出手段により抽出された文字列に対して動的計画法を適用し、前記キーワードをなす文字列を形成する各文字に前記抽出された文字列を形成する各文字を対応付ける文字対応付け手段をさらに備え、前記距離値付与手段は、前記文字対応付け手段により動的計画法が適用された文字列を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できなかった場合に、前記文字対応付け手段により該所定の文字が対応付けられた前記キーワードをなす文字列を形成する文字に係る値を前記所定の距離値として該所定の文字に対して付与し、前記キーワード検索手段は、前記距離値付与手段により付与された距離値と前記動的計画法が適用された文字列を形成する前記文字認識された各文字の距離値とに基づいて前記抽出された文字列が該キーワードをなす文字列であるかどうかを判定し、該判定に基づいて該キーワードをなす文字列を検索することを特徴とする。
【００１８】
また、本発明は、前記距離値付与手段は、前記文字対応付け手段により動的計画法が適用された文字列を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できなかった場合に、前記文字対応付け手段により該所定の文字が対応付けられた前記キーワードをなす文字列を形成する文字の距離値の分散または標準偏差に基づいた前記所定の距離値を該所定の文字に対して付与することを特徴とする。
【００１９】
また、本発明は、前記文字ラティス記憶手段により文字ラティスとして記憶された前記単数または複数の文字認識の候補に対し、該文字認識の候補の距離値の小さい順に番号を認識順位として付与する認識順位付与手段をさらに備え、前記キーワード検索手段は、前記認識順位付与手段により付与された認識順位、前記距離値付与手段により付与された距離値および文字認識された前記文書を形成する各文字の距離値に基づいて前記文書に存在する前記キーワードをなす文字列を検索することを特徴とする。
【００２０】
また、本発明は、所定の文字列をキーワードとして受け付けた際に、検索対象となる文書を形成する各文字とあらかじめ登録した標準文字との間の距離値に基づいて前記文書を形成する各文字を文字認識し、該文字認識した認識結果を踏まえて前記文書から前記キーワードをなす文字列を検索する文字列検索方法であって、前記文書を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できない場合に、該所定の文字に対して所定の距離値を付与する距離値付与工程と、前記距離値付与工程により付与された距離値と文字認識された前記文書を形成する各文字の距離値とに基づいて前記文書に存在する前記キーワードをなす文字列を検索するキーワード検索工程とを含んだことを特徴とする。
【００２１】
また、本発明は、所定の文字列をキーワードとして受け付けた際に、検索対象となる文書を形成する各文字とあらかじめ登録した標準文字との間の距離値に基づいて前記文書を形成する各文字を文字認識し、該文字認識した認識結果を踏まえて前記文書から前記キーワードをなす文字列を検索する文字列検索方法をコンピュータに実行させるプログラムであって、前記文書を形成する所定の文字を前記標準文字との間の距離値を用いても文字認識できない場合に、該所定の文字に対して所定の距離値を付与する距離値付与工程と、前記距離値付与工程により付与された距離値と文字認識された前記文書を形成する各文字の距離値とに基づいて前記文書に存在する前記キーワードをなす文字列を検索するキーワード検索工程とをコンピュータに実行させることを特徴とする。
【００２２】
【発明の実施の形態】
以下に添付図面を参照して、本発明に係る文字列検索装置、文字列検索方法およびその方法をコンピュータに実行させるプログラムの好適な実施の形態を詳細に説明する。なお、本実施の形態では、横書きの手書き文書画像から所定の検索文字列を検索する場合について説明する。
【００２３】
まず、本実施の形態に係る文字列検索装置の構成について説明する。図１は、本実施の形態に係る文字列検索装置の構成を示す機能ブロック図である。同図に示すように、この文字列検索装置１００は、帳票など複数の手書きの文書画像１０２の中から特定のキーワードを含む文書を探し出す場合に、キーワードを指定して、そのキーワードが文書画像１０２中に含まれるかどうかを検索する装置である。
【００２４】
この文字列検索装置１００では、たとえ一部に認識できない文字があった場合でも、キーワードと文書画像１０２中の文字列とが文字列全体として見て一致の度合いが高ければ、その文字列を検索することができる。
【００２５】
文字列検索装置１００は、文書画像記憶部１０１、文書画像１０２、検索文字列受付部１０３、行・文字切出し処理部１０４、距離値算出部１０５、文字認識処理部１０６、認識順位付与部１０７、文字ラティス記憶部１０８、文字ラティス１０９、文字列抽出部１１０、文字対応付け部１１１、距離値付与部１１２、キーワード検索部１１３および制御部１１４を有する。
【００２６】
文書画像記憶部１０１は、手書きの複数の文書画像１０２の記憶および読み出しをおこなう記憶部であり、検索文字列受付部１０３は、文書画像１０２の中から検索したいキーワードをなす文字列を受け付ける受付部である。この検索文字列受付部１０３は、キーボード等により入力された文字列や、ネットワーク経由で端末装置から送信された文字列などを受け付ける。
【００２７】
行・文字切出し処理部１０４は、文字認識をおこなう文書画像１０２中の文書の領域を特定し、その領域内の文書から行を切り出し、切り出された各行から１文字を構成すると考えられる各文字をさらに切り出す処理部である。
【００２８】
文字列を検索する文書領域の特定は、具体的には、文書毎のレイアウト規則にしたがっておこなう。なお、この特定は手動でおこなうこととしてもよい。また、他の公知の方法により文書領域の特定をおこなうこととしてもよい。
【００２９】
距離値算出部１０５は、様々な人により書かれた文字を平均化して作成した標準文字の特徴量と、切り出された文書画像１０２中の各文字の特徴量との間の距離値を算出する算出部である。この標準文字は、図示しない記憶部に記憶される。なお、標準文字の作成法は、ここで述べた方法に限定されず、様々な方法を用いることができる。
【００３０】
文字認識処理部１０６は、切り出された文字の文字認識をおこなう処理部である。具体的には、距離値算出部１０５により算出された距離値を参照し、距離値が所定のしきい値よりも小さい場合にその文字が比較した標準文字に対応するものであると認識する。
【００３１】
認識順位付与部１０７は、ある文字に対して文字認識をおこなうことより得られた複数の文字の候補に対して、得られた各文字の距離値を基にして認識順位を付与する付与部である。具体的には、複数の文字候補の距離値を比較して、距離値の小さい順から番号を割り振り認識順位を設定する。
【００３２】
文字ラティス記憶部１０８は、文書画像１０２を形成する各文字を文字認識して得られた単数または複数の文字認識の候補を文字ラティス１０９として記憶し、かつ読み出しをおこなう記憶部である。また、各文字候補の距離値と認識順位とを併せて記憶する。文字ラティス１０９ついては後に詳細に説明をおこなうこととする。
【００３３】
文字列抽出部１１０は、上記文字ラティス１１０を利用して、検索対象であるキーワードをなす文字列を文書画像１０２中から抽出する抽出部である。具体的には、キーワードを形成する各文字を文字ラティス内で先頭文字から順に探索する。
【００３４】
この際、文字認識できない文字があった場合には、その文字をスキップして次の文字の探索をおこなう。そして、キーワードをなす文字列の文字数に対する認識できない文字数の割合が所定の値以下である場合に、その文字列を検索対象であるキーワードの候補として抽出する。
【００３５】
また、キーワードの先頭文字が文字認識できない場合には、後に説明するように、キーワードをなす文字列の検索を始めることができず、たとえ他の部分がキーワードと一致するとしてもその文字列が検索漏れとなるので、キーワードの末尾文字から逆順に文字列候補の検索をおこない、検索漏れを防止する。
【００３６】
文字対応付け部１１０は、抽出された文字列を形成する各文字を、検索対象であるキーワードの各文字に動的計画法を用いて対応付ける処理をおこなう処理部である。
【００３７】
距離値付与部１１２は、文字対応付け部１１０により対応付けがなされた文字列内に文字認識できない文字がある場合には、文字認識できない文字に対応するキーワード内の文字の距離値の分散に基づいた値を、文字認識できない文字の距離値として付与する付与部である。また、文字認識できない文字の認識順位に所定の順位を設定する。
【００３８】
キーワード検索部１１３は、文字列抽出部１１０により抽出された文字列の平均距離値を算出し、算出された平均距離値が所定のしきい値よりも小さかった場合に、抽出された文字列をキーワードに対応するものとして判定し、検索結果として出力する検索部である。平均距離値は、文字認識できない文字に対して付与された距離値および認識順位と、文字認識がなされた文字の距離値および認識順位とを基にして算出される。
【００３９】
制御部１１４は、文字列検索装置１００の全体制御をおこなう制御部であり、各機能部間の各種データの授受などを司る制御部である。
【００４０】
図２は、文字列検索装置１００が検索する対象とする横書きの手書き文書の一例を示す図である。同図に示すように、この文字列検索装置１００は、隣り合う文字の文字間隔が自由なフリーピッチの文書から文字列を検索することができる。
【００４１】
次に、行・文字切出し処理部１０４による行および文字の切出し処理について説明する。図３は、行・文字切出し処理部１０４による行の切出し処理を説明する図である。同図に示すように、行・文字切出し処理部１０４は、横方向の黒画素の頻度を表したヒストグラム（行ヒストグラム）を作成し、各行ヒストグラム間の谷間部分を行の切出し位置として各行の切出しをおこなう。
【００４２】
また、文字の切出し処理も同様にしておこなうことができる。具体的には、切り出された各行に対し、縦方向の黒画素の頻度を表したヒストグラム（文字ヒストグラム）を作成し、各文字ヒストグラム間の谷間部分を文字の切出し位置として各文字の切出しをおこなう。
【００４３】
次に、図１に示した文字ラティス１０９の構造について説明する。図４は、行・文字切出し処理部１０４により切り出された基本セグメントの一例を示す図である。同図に示すように、行・文字切出し処理部１０４は、文字ヒストグラムにより行毎に文字の分割位置の候補を設定し、その分割位置にしたがって文字を切り出して基本セグメントを作成する。
【００４４】
基本セグメントとは、文字ヒストグラムにより分割された、文字候補となりうる最小の単位である。基本セグメントでは、１つの文字中に左右に分離した部分がある場合（例えば、「社」や「は」など。）、その文字が２文字と誤って切り出される場合などもありうる。
【００４５】
図５は、基本セグメントを組み合わせて作成した候補セグメントの一例を示す図である。同図に示すように、１つの文字が２文字以上に誤って切り出される場合などに対処するため、基本セグメントを組み合わせた候補セグメントを作成し、様々な文字の候補を列挙する。
【００４６】
図６は、文書中での出現順に並べられた基本セグメントおよび候補セグメントの組合せの一例を示す図である。同図に示すように、基本セグメントと候補セグメントとを組み合わせ、様々な文字の候補を考慮することにより、より正確に文字認識をおこなうことができるようになる。
【００４７】
図７は、文字ラティス１０９の構造の一例を示す図である。この図では、図６に示した基本セグメントおよび候補セグメントの一部を利用して作成した文字ラティス１０９を示している。図７に示すように、文字ラティス１０９とは、図６に示した基本セグメントおよび候補セグメントに対し文字認識をおこなって得られた様々な文字の候補をラティス構造により表現したものである。
【００４８】
ここで、同一の文字画像に文字認識された複数の文字の候補が対応することも考えられる。その場合、例えば、認識順位が上位のものから１０位までで、かつ、文字の候補の距離値が所定のしきい値以下のもののみを選択して、文字ラティス１０９の生成をおこなう。
【００４９】
また、文字ラティス１０９中の文字の各候補には、文書中に出現した順番にセグメント番号を割り当てる。図７の例では、「ノ」、「バ」、「バ」、「ぐ」、「イ」、「付」、「才」、「火」、「燃」、「そ」、「然」、「然」、「さ」、「ー」、「米」、「料」、「斗」の順番でセグメント番号を割り当てる。
【００５０】
次に、図１に示した文字認識処理部１０６による文字特徴量の距離値の算出処理について説明する。図８は、図１に示した文字認識処理部１０６による文字特徴量の距離値の算出処理について説明する説明図である。この処理では、多数の人により書かれた文字を平均化して作成した標準文字を利用する。
【００５１】
まず、図８（ａ）に示すように、標準文字の文字画像の輪郭を抽出する処理をおこなう。続いて、図８（ｂ）に示すように、輪郭を抽出した文字画像をメッシュ領域に分割する。この例では、縦横２マスずつで構成されるメッシュ領域が、メッシュ領域の半分ずつが重なるようにした場合に、２５個（＝５×５）生成されるように分割している。
【００５２】
文字の特徴量を抽出する場合には、分割されたメッシュ領域毎に、図８（ｃ）に示した各方向の文字の輪郭数を計数して特徴量とし、その特徴量から特徴ベクトルを生成する。この例では、抽出する輪郭の方向数を８方向としているので、特徴ベクトルの次元は、２００次元（＝５×５×８）となる。
【００５３】
このような特徴ベクトルを、文書画像１０２中から切り出された各文字について作成し、標準文字と各文字との間の距離値を計算する。ここで、距離値には、マハラノビス距離、ユークリッド距離またはシティブロック距離など様々な距離尺度を使用することができる。
【００５４】
次に、図１に示す文字ラティス記憶部１０８により作成されたセグメント番号テーブルについて説明する。図９は、図１に示す文字ラティス記憶部１０８により作成されたセグメント番号テーブルの一例を示す図である。同図に示すように、文字ラティス記憶部１０８は、生成した文字ラティス１０９を文字毎に並べ替え、セグメント番号、距離値および認識順位の情報を付加した形で、セグメント番号テーブルとして記憶する。
【００５５】
各文字のセグメント番号、距離値および認識順位は、（セグメント番号，距離値，認識順位）のように組にして記憶される。図９の例では、「イ」として認識された文字は文書画像中に３回出現しており、それぞれのセグメント番号、距離値および認識順位は、（６４，１７９，１）、（６８，２１４，９）および（７６，１２５，１）となる。
【００５６】
次に、セグメント番号テーブルを利用した文字列抽出部１１０による文字列の抽出処理について説明する。図１０は、セグメント番号テーブルを利用した文字列抽出部１１０による文字列の抽出処理について説明する説明図である。
【００５７】
同図に示すように、検索したいキーワードが「バイオ燃料」である場合には、文字列抽出部１１０は、セグメント番号テーブルの中から、キーワードの先頭文字である「バ」の文字を探索する。図１０（ａ）では、「バ」の文字がセグメント番号「６９」番にあることがわかる。
【００５８】
続いて、次の文字である「イ」の検索をおこなうが、その際、所定の番号幅を設定し、その番号幅以内で探索をおこなう。この番号幅は、文書画像中で「バ」の文字に連続する「イ」の文字のみが選択されるようにし、「バ」の文字に連続しない「イ」の文字が含まれないように適切に設定する。
【００５９】
例えば、番号幅を１２とした場合、セグメント番号が７６である「イ」の文字は、「バ」の文字からのセグメント番号増加分が７（＝７６−６９）であるので、「バ」に連続する文字として選択される。セグメント番号が６４および６８である「イ」の文字は、「バ」の文字のセグメント番号よりも小さいので、「バ」の文字よりも前に出現した文字であると判断でき、選択はされない。
【００６０】
同様に、セグメント番号が７８の「オ」の文字は、「イ」の文字からのセグメント番号増加分が２（＝７８−７６）であるので、「イ」の文字に引き続く文字として選択される。以下、同様にして、「料」の文字まで選択され、図１０（ｂ）に示すような、キーワードである「バイオ燃料」に対応する文字列の候補を抽出する。抽出が終了したら、抽出された文字列の先頭文字および末尾文字のセグメント番号を記憶する。
【００６１】
ここでは、探索をおこなう番号幅を１２としたが、番号幅は連続する文字が適切に選択されるようにすれば、任意に設定することができる。例えば、文書画像の１行の高さの３倍の長さに含まれる文字列の文字の番号幅を算出し、その番号幅内で文字の探索をおこなうこととしてもよい。これは、文字列内の連続する文字は、一行の高さの３倍の長さ以内に出現するという予測に基づいた設定法である。
【００６２】
次に、文字認識できない文字がある場合の文字列抽出部１１０による文字列の抽出処理について説明する。図１１は、文字認識できない文字がある場合の文字列抽出部１１０による文字列の抽出処理について説明する説明図である。ここでは、セグメント番号が７８である「オ」の文字が文字認識できなかった場合について説明する。
【００６３】
同図に示すように、文字列抽出部１１０は図１０の場合と同様にして、「イ」の文字まで探索した後、「オ」の文字を１２以内の番号幅で探索する。しかし、セグメント番号が７８である「オ」の文字を文字認識できなかったために、「オ」の文字がないものと判定する。
【００６４】
その場合、文字列抽出部１１０は、文字を探索する番号幅を拡大して「オ」の次の文字である「燃」の文字の探索をおこなう。ここで、拡大する番号幅は任意に設定できる。ここでは、始めの番号幅の２倍とし、２４（＝１２×２）に設定する。
【００６５】
このように設定することにより、セグメント番号が８２である「燃」の文字を選択することができる。その後は、再度通常の番号幅でそれ以降の文字の検索をおこなう。
【００６６】
ここで、探索するキーワードの先頭文字が文字認識できない場合には、探索を開始するセグメント番号が定まらないので、キーワードを形成する各文字の探索を開始することができない。したがって、この場合には、キーワードを逆順に検索する処理をおこなう。
【００６７】
すなわち、「バイオ燃料」の「バ」の文字が文字認識できない場合に、「料」の文字の探索をおこない、次に「燃」の文字の探索を逆順におこなう。図１１の例を用いると、セグメント番号が９４の「料」の文字が選択された場合に、番号幅を１２として逆順に「燃」の文字の探索をおこなう。以下、同様に、「イ」の文字までの探索をおこなう。
【００６８】
文字認識ができない文字がある場合、検索対象となるキーワードの文字数に対する文字認識できない文字数の割合が所定のしきい値以下であれば、文字列抽出部１１０により探索された各文字からなる文字列をキーワードに対応する文字列として抽出する。
【００６９】
図１１の例では、キーワードである「バイオ燃料」の文字数は５であり、文字認識できない文字数が１であるので、認識できない文字の割合は２０％であり、しきい値が３０％に設定されている場合、この文字列をキーワードに対応する文字列として抽出する。そして、抽出された文字列の先頭文字および末尾文字のセグメント番号を記憶する。このしきい値は任意に設定することができる。
【００７０】
次に、文字対応付け部１１１、距離値付与部１１２およびキーワード検索部１１３によるキーワードをなす文字列を検索する検索処理について説明する。図１２は、文字対応付け部１１１、距離値付与部１１２およびキーワード検索部１１３によるキーワードをなす文字列を検索する検索処理について説明する説明図である。
【００７１】
まず、文字対応付け部１１１が、動的計画法によりキーワードをなす文字列を形成する各文字に、文字列抽出部１１０により抽出された文字列を形成する各文字を対応付ける処理をおこなう。
【００７２】
文字対応付け部１１１は、図１２（ａ）に示す文字列抽出部１１０により抽出された文字列の先頭文字および末尾文字のセグメント番号を基にして、その文字列を形成する基本セグメントを検索し、図１２（ｂ）に示すように、各基本セグメントに番号を割り当てる。
【００７３】
そして、図１２（ｃ）に示すように、横軸に番号が割り当てられた基本セグメントを並べ、縦軸には検索するキーワードを並べて、動的計画法によりキーワードをなす文字列を形成する各文字に、各基本セグメントを対応付ける。
【００７４】
図１２（ｃ）において、矢印（１）のように上方向に線が進む場合は、１つの基本セグメントが複数の文字に対応することを意味する。これは、１つの基本セグメントに対して２種類の文字が含まれるなど文字の切り出し誤りがあった場合や、１つの基本セグメントに対して２種類の文字認識結果が得られた場合に相当する。
【００７５】
矢印（２）のように斜め方向に線が進む場合は、１つの基本セグメントが１つの文字に対応していることを示している。また、矢印（３）のように横方向に線が進む場合には、複数の基本セグメントに対し１つの文字が対応していることを示している。これは、偏と旁からなる漢字が２つの基本セグメントに分割されている場合などに相当する。
【００７６】
図１２（ｃ）の例では、動的計画法により、１から３までの基本セグメントが「バ」の文字に対応するものとして検出され、６から９までの基本セグメントが「燃」の文字に対応するものとして検出され、１０および１１の基本セグメントが「料」の文字に対応するものとして検出される。
【００７７】
図１２（ｄ）には、このようにして文字の対応付けがなされた文字列が示されている。そして、距離値付与部１１２は、対応付けにより基本セグメントの組み合わせが見直された「燃」および「料」の文字には、対応付け処理後の文字に対する距離値および認識順位を新たに割り当てる。
【００７８】
その後、キーワード検索部１１３は、抽出された文字列とキーワードとの間の平均距離値を算出する。具体的には、抽出された文字列を形成する各文字の距離値を認識順位で重み付けし、重み付けがなされた距離値の平均値を算出することにより求める。ここで重みは、Ｎを認識順位の値として、２×Ｎ／（１＋Ｎ）で与えられる。認識順位が１（Ｎ＝１）の場合は、この重みは１となり、認識順位が１０（Ｎ＝１０）の場合は、この重みは約１．８２となる。
【００７９】
図１２（ｄ）の例では、「バ」、「イ」、「オ」、「燃」および「料」のそれぞれの認識順位は１、１、２、１および１であるので、各文字の重みは１、１、１．３３、１および１となる。したがって、文字列「バイオ燃料」の平均距離値は、（１×１１３＋１×１２５＋１．３３×２０４＋１×１６４＋１×１９６）／５＝１７３．９と算出される。なお、平均距離値を算出する計算式はこれに限定されるものではなく、他の計算式で算出をしてもよい。
【００８０】
平均距離値が算出されると、キーワード検索部１１３は、予め定めた平均距離値のしきい値と算出された平均距離値とを比較し、算出された平均距離値がしきい値よりも小さい場合には、その文字列がキーワードと一致するものであると判定し、その結果を出力する。このしきい値には、例えば、キーワードを形成する各文字の３σの平均値などを用いることができる。
【００８１】
次に、文字認識できない文字がある場合の平均距離値の算出処理について説明する。図１３は、文字認識できない文字がある場合の平均距離値の算出処理について説明する説明図である。
【００８２】
図１３（ａ）に示すように、文字対応付け部１１１は、図１２（ｃ）の場合と同様に、１から４までの基本セグメントをキーワード中の「バ」および「イ」の文字に対応付ける。５の基本セグメントは、文字認識により認識できなかったものであり、実際にその部分が何の文字であるかは不明である。この場合、文字対応付け部１１１は、その部分が「オ」に対応するものであるとして、次の「燃」の文字に対応する基本セグメントの対応付け処理に移行する。その後の処理は、図１２（ｃ）で説明したものと同様である。
【００８３】
図１３（ｂ）には、このようにして基本セグメントの対応付けがおこなわれた文字列が示されている。ここでも、基本セグメントの組み合わせが見直された「燃」および「料」の文字には、見直し後の文字に対する距離値および認識順位を新たに割り当てる。
【００８４】
ただし、文字認識ができない文字「オ」の距離値には、「オ」の距離値の標準偏差σ（分散はσの２乗。）の３倍、すなわち３σを設定する。この例では、３σは２６１である。また、認識順位は低く設定し、この例では１０位と設定する。
【００８５】
この例では、「バ」、「イ」、「オ」、「燃」および「料」のそれぞれの認識順位は１、１、１０、１および１となるので、各文字の重みは１、１、１．８２、１および１となる。したがって、文字列「バイオ燃料」の平均距離値は、（１×１１３＋１×１２５＋１．８２×２６１＋１×１６４＋１×１９６）／５＝２１４．６と算出される。
【００８６】
平均距離値が算出されると、キーワード検索部１１３は、予め定めた平均距離値のしきい値と算出された平均距離値とを比較し、算出された平均距離値がしきい値よりも小さい場合には、その文字列がキーワードと一致するものであると判定し、その結果を出力する。
【００８７】
ここで、文字認識できない文字の距離値は、任意の値に設定することができる。ただし、その距離値を小さく設定しすぎると（例えば０に設定するなど。）、他の部分の距離値が小さく、かつ、認識順位が低ければ、文字認識できない文字が何であっても、その文字列が検索されてしまうので、キーワードとは関係のない文字列までが検索され、検索ノイズが多くなるという結果になる。
【００８８】
逆に、文字認識できない文字の距離値を大きく設定しすぎると、文字列の平均距離値が大きくなり、他の部分の距離値がいかに小さくても検索漏れが生じやすくなる。また、認識順位の設定にも、同様のことが言える。
【００８９】
そこで、本発明では、認識できない文字の距離値および認識順位に適切な大きさの値を設定することにより、上記の問題を解決している。なお、距離値として３σの値を用いたのは、文字特徴の情報が３σに含まれているからである。
【００９０】
すなわち、図１３（ａ）の例で、「オ」の文字に対する３σ（または分散）の値が大きいということは、「オ」の文字特徴のばらつきが大きいことを意味している。その場合、他の文字（たとえば、「才」など。）との特徴の差が目立たなくなり、その文字が実際に「オ」であるかどうか、さらには、その文字を含む文字列がキーワードと一致するかどうかが不確定なものとなる。
【００９１】
したがって、キーワード中の文字「オ」に対応する文字列内の文字が文字認識できない場合に、距離値を３σと設定することにより、その文字列が検索結果として選択されにくくなり、検索ノイズを減少させることができる。なお、距離値として設定される値は、３σに限定されるものではなく、「オ」の距離値の平均値に基づいた値などにしてもよい。
【００９２】
次に、本実施の形態に係る文字列検索処理の処理手順について説明する。図１４は、本実施の形態に係る文字列検索処理の処理手順を示すフローチャートである。同図に示すように、まず、行・文字切出し処理部１０４は、レイアウト解析後、行および文字の切出し処理をおこなう（ステップＳ１４０１）。
【００９３】
そして、距離値算出部１０５は、切り出された各文字と標準文字との間の特徴量の距離値を算出する（ステップＳ１４０２）。続いて、文字認識処理部１０６は、算出された距離値に基づいて各文字の文字認識をおこなう（ステップＳ１４０３）。そして、認識順位付与部１０７は、ある文字に対して複数の文字認識結果が得られた場合に、距離値の小さい順から認識順位を設定する（ステップＳ１４０４）。
【００９４】
その後、文字ラティス記憶部１０８は、文字ラティスを生成し、文字認識された結果を各文字の距離値および認識順位とともに、セグメント番号テーブルとして記憶する（ステップＳ１４０５）。
【００９５】
そして、文字列抽出部１１０は、セグメント番号テーブルを利用して、指定されたキーワードに対応する文字列を抽出し（ステップＳ１４０６）、文字対応付け部１１１は、抽出された文字列を形成する各文字をキーワード内の各文字に対応付ける処理をおこなう（ステップＳ１４０７）。
【００９６】
続いて、距離値付与部１１２は、対応付けがなされた文字列内に文字認識できない文字があるかどうかを調べ（ステップＳ１４０８）、文字認識できない文字がある場合には（ステップＳ１４０８，Ｙｅｓ）、文字認識できない文字に対応するキーワード内の文字の距離値の分散に基づいた値を、文字認識できない文字の距離値として設定し、また、文字認識できない文字の認識順位に所定の順位を設定する（ステップＳ１４０９）。文字認識できない文字がなかった場合には（ステップＳ１４０８，Ｎｏ）、そのままステップＳ１４１０に移行する。
【００９７】
その後、キーワード検索部１１３は、抽出された文字列の平均距離値を算出し、算出された平均距離値が所定のしきい値よりも小さかった場合に、抽出された文字列をキーワードに対応するものとして判定し、検索結果として出力する（ステップＳ１４１０）。
【００９８】
上述してきたように、本実施の形態では、検索をおこなう文字列中に文字認識できない文字があった場合に、その文字の距離値に所定の値（例えば標準偏差など）を設定し、また、認識順位として低い順位（例えば１０位など）を設定して、抽出された文字列とキーワードとの間の平均距離値を計算し、その平均距離値が所定のしきい値よりも小さかった場合に抽出された文字列とキーワードとが一致するものとして出力することとしたので、たとえ認識できない文字があったとしても文字列全体の平均距離値を算出して判定することにより、検索漏れや検索ノイズの少ないキーワードの検索をおこなうことができる。
【００９９】
さて、これまで本発明の実施の形態について説明したが、本発明は上述した実施の形態以外にも、上記特許請求の範囲に記載した技術的思想の範囲内において種々の異なる実施の形態にて実施されてもよいものである。
【０１００】
例えば、本実施の形態では、検索対象となる文書画像が手書きの文書画像である場合を示したが、本発明はこれに限定されるものではなく、印刷文字の文書画像に対しても同様に適用することができる。
【０１０１】
また、本実施の形態では、横書きの文書に対して本発明を適用する場合を示したが、本発明はこれに限定されず、縦書きの文書に対しても容易に応用することができる。
【０１０２】
また、本実施の形態では、文字ラティスを利用してキーワードをなす文字列を検索することとしたが、本発明はこれに限定されず、文字ラティスを用いない公知の文字列検索方法で文字列の検索をおこなうこととしてもよい。
【０１０３】
さらに、本実施の形態では、１つのキーワードに対応する文字列を帳票などの文書画像から検索する場合を示したが、本発明はこれに限定されるものではなく、単語辞書内の単語に一致する文字列を文書画像から検索し、検索された文字列の数が多いものをいくつか選択して文書画像のインデックスを作成する場合にも適用することができる。
【０１０４】
また、本実施の形態では、隣り合う文字の文字間隔が自由なフリーピッチの文書から文字列を検索する場合を示したが、本発明はこれに限定されるものではなく、文字間隔が一定である記入枠付き文書などにも同様に適用することができる。
【０１０５】
また、本実施の形態では、文字認識できない文字に対して、動的計画法を適用した後に所定の距離値および認識順位を付与することとしたが、本発明はこれに限定されるものではなく、文字認識をおこなって文字認識ができないと判明した時点でそれらを付与することとしてもよい。
【０１０６】
さらに、本実施の形態では、キーワードをなす文字列をその文字列の先頭文字から順に探索し、先頭文字が文字認識できない文字である場合には、末尾文字から逆順に探索することとしたが、本発明はこれに限定されるものではなく、先頭文字が文字認識できない文字である場合にその次の文字以降を順に探索していくこととしてもよい。
【０１０７】
また、本実施の形態において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。
【０１０８】
また、図示した文字列検索装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、文字列検索装置の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、文字列検索装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。
【０１０９】
なお、本実施の形態で説明した文字列検索方法は、あらかじめ用意されたプログラムをパーソナル・コンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することもできる。
【０１１０】
【発明の効果】
以上説明したように、本発明によれば、文書を形成する所定の文字を標準文字との間の距離値を用いても文字認識できない場合に、所定の文字に対して所定の距離値を付与し、付与された距離値と文字認識された文書を形成する各文字の距離値とに基づいて文書に存在するキーワードをなす文字列を検索するよう構成したので、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させつつ、所定のキーワードに対応する文字列を効率的に検索することが可能となる。
【０１１１】
また、本発明によれば、文書を形成する各文字を文字認識して得られた単数または複数の文字認識の候補を文字ラティスとして記憶し、文字ラティスとして記憶された認識結果を読み出し、文字認識できない文字に対して付与された距離値と文字認識された各文字の距離値とに基づいて文字ラティスとして記憶された認識結果から文書に存在する前記キーワードをなす文字列を検索するよう構成したので、文字認識された様々な文字の候補の中から文字列を検索でき、所定のキーワードに対応する文字列を効率的に検索することができる。
【０１１２】
また、本発明によれば、所定の文字列をキーワードとして受け付けた際に、文字認識した結果を踏まえてキーワードをなす文字列を形成する各文字をその文字列の先頭文字から順に探索し、文字列を形成する所定の文字を標準文字との間の距離値を用いても文字認識できない場合に、その所定の文字をスキップして探索を続行し、文字列内の文字数に対する文字認識できない文字数の割合が所定の値以下である場合にその文字列を抽出し、抽出された文字列を形成する文字認識できない所定の文字に対して付与された所定の距離値と、抽出された文字列を形成する文字認識された各文字の距離値とに基づいて抽出された文字列がキーワードをなす文字列であるかどうかを判定し、その判定に基づいてキーワードをなす文字列を検索するよう構成したので、文字認識できない文字がある場合でも、効率的にキーワードをなす文字列を検索することができる。
【０１１３】
また、本発明によれば、キーワードをなす文字列の先頭文字が文字認識できない場合に、そのキーワードをなす文字列を形成する各文字をその文字列の末尾文字から逆順に探索し、文字列を形成する所定の文字を標準文字との間の距離値を用いても文字認識できない場合に、その所定の文字をスキップして探索を続行し、文字列内の文字数に対する文字認識できない文字数の割合が所定の値以下である場合にその文字列を抽出するよう構成したので、キーワードをなす文字列の先頭文字が文字認識できない場合でも、効率的にキーワードをなす文字列を検索することができる。
【０１１４】
また、本発明によれば、抽出された文字列に対して動的計画法を適用して、キーワードをなす文字列を形成する各文字に、抽出された文字列を形成する各文字を対応付け、動的計画法が適用された文字列を形成する所定の文字を標準文字との間の距離値を用いても文字認識できなかった場合に、その所定の文字が対応付けられたキーワードをなす文字列を形成する文字に係る値を所定の距離値としてその所定の文字に付与し、その付与された距離値と、動的計画法が適用された文字列を形成する文字認識された各文字の距離値とに基づいて、抽出された文字列がキーワードをなす文字列であるかどうかを判定し、その判定に基づいてキーワードをなす文字列を検索するよう構成したので、文字認識できない文字がある場合でも、動的計画法により適切な距離値をその文字に付与することができ、検索漏れや検索ノイズを減少させることができる。
【０１１５】
また、本発明によれば、動的計画法が適用された文字列を形成する所定の文字を標準文字との間の距離値を用いても文字認識できなかった場合に、その所定の文字が対応付けられた、キーワードをなす文字列を形成する文字の距離値の分散または標準偏差に基づいた所定の距離値を、その所定の文字に対して付与するよう構成したので、文字認識できない文字がある場合でも、文字特徴の情報を含んだ適切な距離値をその文字に付与することができ、検索漏れや検索ノイズを減少させることができる。
【０１１６】
また、本発明によれば、文字ラティスとして記憶された前記単数または複数の文字認識の候補に対し、文字認識の候補の距離値の小さい順に番号を認識順位として付与し、その付与された認識順位、文字認識できない文字に対して付与された距離値および文字認識された各文字の距離値に基づいて文書に存在するキーワードをなす文字列を検索するよう構成したので、認識順位を考慮することにより、検索漏れや検索ノイズをさらに減少させることができる。
【０１１７】
また、本発明によれば、文書を形成する所定の文字を標準文字との間の距離値を用いても文字認識できない場合に、所定の文字に対して所定の距離値を付与し、付与された距離値と文字認識された文書を形成する各文字の距離値とに基づいて文書に存在するキーワードをなす文字列を検索するよう構成したので、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させつつ、所定のキーワードに対応する文字列を効率的に検索することが可能となる。
【０１１８】
また、本発明によれば、文書を形成する所定の文字を標準文字との間の距離値を用いても文字認識できない場合に、所定の文字に対して所定の距離値を付与し、付与された距離値と文字認識された文書を形成する各文字の距離値とに基づいて文書に存在するキーワードをなす文字列を検索するよう構成したので、文字認識できない文字があった場合でも、検索漏れや検索ノイズを減少させつつ、所定のキーワードに対応する文字列を効率的に検索することが可能となる。
【図面の簡単な説明】
【図１】本実施の形態に係る文字列検索装置の構成を示す機能ブロック図である。
【図２】文字列検索装置が検索する対象とする横書きの手書き文書の一例を示す図である。
【図３】行・文字切出し処理部による行の切出し処理を説明する図である。
【図４】行・文字切出し処理部により切り出された基本セグメントの一例を示す図である。
【図５】基本セグメントを組み合わせて作成した候補セグメントの一例を示す図である。
【図６】文書画像中での出現順に並べられた基本セグメントおよび候補セグメントの組合せの一例を示す図である。
【図７】文字ラティスの構造の一例を示す図である。
【図８】図１に示した文字認識処理部による文字特徴量の距離値の算出処理について説明する説明図である。
【図９】図１に示す文字ラティス記憶部により作成されたセグメント番号テーブルの一例を示す図である。
【図１０】セグメント番号テーブルを利用した文字列抽出部による文字列の抽出処理について説明する説明図である。
【図１１】文字認識できない文字がある場合の文字列抽出部による文字列の抽出処理について説明する説明図である。
【図１２】文字対応付け部、距離値付与部およびキーワード検索部によるキーワードをなす文字列を検索する検索処理について説明する説明図である。
【図１３】文字認識できない文字がある場合の平均距離値の算出処理について説明する説明図である。
【図１４】本実施の形態に係る文字列検索処理の処理手順を示すフローチャートである。
【符号の説明】
１００文字列検索装置
１０１文書画像記憶部
１０２文書画像
１０３検索文字列受付部
１０４行・文字切出し部
１０５距離値算出部
１０６文字認識処理部
１０７認識順位付与部
１０８文字ラティス記憶部
１０９文字ラティス
１１０文字列抽出部
１１１文字対応付け部
１１２距離値付与部
１１３キーワード検索部
１１４制御部[0001]
BACKGROUND OF THE INVENTION
In the present invention, when a predetermined character string is accepted as a keyword, each character forming the document is converted into a character based on a distance value between each character forming the document to be searched and a standard character registered in advance. More particularly, the present invention relates to a character string search device, a character string search method, and a program for causing a computer to execute the method, which search for a character string that forms the keyword from the document based on the recognition result of the character recognition. Character search device, character search method, and program for causing computer to execute the method capable of efficiently searching for a character string corresponding to a predetermined keyword by reducing search omission and search noise even when there are characters that cannot be processed About.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, electronic filing for storing and managing handwritten documents captured from a scanner or the like as image data is widely known. In electronic filing, since a large amount of image data is accumulated, it is an important issue to develop an efficient document search technique.
[0003]
For this reason, Non-Patent Document 1 discloses a technique for performing character recognition of a handwritten document stored as image data and extracting a document including a designated keyword by full-text search.
[0004]
In the technique of Non-Patent Document 1, a plurality of character recognition candidates for a character cut out from image data are expressed by a lattice structure, and the character string recognition accuracy is improved by searching using the lattice structure. However, the technique of Non-Patent Document 1 has a problem in that when there is a character that cannot be recognized in the character string to be searched, the character string becomes a search omission.
[0005]
For this reason, when searching for a character string from image data, Patent Document 1 calculates similarity by comparing the shape characteristics of a character image between characters that cannot be recognized and characters that have been created in advance. A document filing device that outputs a character string including characters that cannot be recognized as a search result when the degree is greater than a threshold value is disclosed.
[0006]
Further, in Patent Document 2, even if some characters that cannot be recognized are included in the character string of the keyword to be searched, if the ratio of the number of matched characters to the number of characters in the entire character string is equal to or greater than a threshold value, that character A filing device that outputs a column as a search result is disclosed.
[0007]
[Non-Patent Document 1]
Shuji Senda and two others, “Construction of a document image database system capable of full-text search”, Digital Library, Library and Information University, October 23, 1996, No. 8
[Patent Document 1]
JP 2000-57315 A
[Patent Document 2]
Japanese Patent Laid-Open No. 8-27213
[0008]
However, when the conventional technique of Patent Document 1 is used, the similarity is calculated by collating the character that cannot be recognized with the shape feature of the character that has been created in advance. However, the character that cannot be recognized has the character image quality in the first place. Since it is in a state where it is difficult to discriminate due to low quality, there is a problem that the degree of similarity is small and the omission of detection of character strings is not improved so much.
[0009]
Furthermore, with this conventional technology, when the character image of a handwritten character that is recognized is low quality, the character shape feature is collated forcibly. It may happen that the efficiency is deteriorated.
[0010]
Further, in the prior art of Patent Document 2, any character that could not be recognized is searched for a character string if the degree of coincidence of other parts is high. There is a problem that even a character string with no character is searched as a search noise, which confuses the user.
[0011]
For these reasons, it is important to realize a technology that can efficiently search for a character string corresponding to a given keyword while reducing search omissions and search noise even when there are characters that cannot be recognized. It is a difficult issue.
[0012]
The present invention has been made to solve the above-described problems (problems). Even when there are characters that cannot be recognized, search omissions and search noise are reduced, and a character string corresponding to a predetermined keyword is efficiently used. An object of the present invention is to provide a character search device, a character search method, and a program for causing a computer to execute the method.
[0013]
[Means for Solving the Problems]
The present invention has been made to achieve the above object, The present invention Then, when a predetermined character string is accepted as a keyword, each character forming the document is recognized based on a distance value between each character forming the document to be searched and a standard character registered in advance, A character string search device that searches a character string that forms the keyword from the document based on a recognition result of the character recognition, and uses a distance value between the predetermined character forming the document and the standard character. If the character cannot be recognized, the distance value giving means for giving a predetermined distance value to the predetermined character, and the distance value given by the distance value giving means and each character that forms the character-recognized document And a keyword search means for searching for a character string constituting the keyword existing in the document based on the distance value of the document.
[0014]
Also, The present invention A character lattice storage means for storing one or a plurality of character recognition candidates obtained by character recognition of each character forming the document as a character lattice, and the keyword search means includes the character lattice storage means. The recognition result stored as a character lattice is read, and the recognition result stored as the character lattice based on the distance value given by the distance value giving means and the distance value of each character forming the character-recognized document To search for a character string forming the keyword existing in the document.
[0015]
Also, The present invention When the predetermined character string is accepted as a keyword, each character forming the character string forming the keyword is searched in order from the first character of the character string based on the result of the character recognition, and the character string is formed. If the predetermined character to be recognized cannot be recognized using the distance value from the standard character, the search is continued by skipping the predetermined character, and the number of characters that cannot be recognized with respect to the number of characters in the character string is determined. A character string extracting unit configured to extract the character string when the ratio is equal to or lower than a predetermined value, wherein the keyword searching unit forms the character string extracted by the character string extracting unit; The extracted character based on the predetermined distance value assigned to the character by the distance value assigning means and the distance value of each character recognized to form the extracted character string There is judged whether the character string form the keyword, wherein the search for a character string which forms the keywords based on the determination.
[0016]
Also, The present invention When the first character of the character string forming the keyword cannot be recognized, the character string extracting means searches each character forming the character string forming the keyword in reverse order from the last character of the character string, and If character recognition cannot be performed for a predetermined character forming a sequence using the distance value from the standard character, the search is continued by skipping the predetermined character, and the character recognition for the number of characters in the character string is performed. The character string is extracted when the ratio of the number of characters that cannot be processed is equal to or less than a predetermined value.
[0017]
Also, The present invention , Applying dynamic programming to the character string extracted by the character string extracting means, and associating each character forming the extracted character string with each character forming the character string forming the keyword The distance value assigning means further includes a predetermined character that forms a character string to which dynamic programming is applied by the character associating means, even if a distance value between the standard character and the predetermined character is used. When the character cannot be recognized, a value relating to the character forming the character string forming the keyword associated with the predetermined character is assigned to the predetermined character by the character associating unit as the predetermined distance value. Then, the keyword search means is based on the distance value given by the distance value giving means and the distance value of each character recognized to form the character string to which the dynamic programming is applied. Issued string is determined whether a character string constituting the keyword, wherein the search for a character string which forms the keywords based on the determination.
[0018]
Also, The present invention The distance value assigning unit is unable to recognize a character using a distance value between the standard character and a predetermined character forming a character string to which dynamic programming is applied by the character matching unit. In addition, the predetermined distance value based on the dispersion or standard deviation of the distance values of the characters that form the character string that forms the keyword associated with the predetermined character by the character associating means is assigned to the predetermined character. It is characterized by giving.
[0019]
Also, The present invention A recognition rank assigning means for assigning numbers as recognition ranks to the character recognition candidates stored in the character lattice storage means as character lattices in ascending order of the distance values of the character recognition candidates. The keyword search unit is configured to determine the document based on the recognition rank given by the recognition rank granting unit, the distance value given by the distance value granting unit, and the distance value of each character forming the character-recognized document. The character string which makes the said keyword which exists in is searched.
[0020]
Also, The present invention Then, when a predetermined character string is accepted as a keyword, each character forming the document is recognized based on a distance value between each character forming the document to be searched and a standard character registered in advance, A character string search method for searching a character string forming the keyword from the document based on a recognition result of the character recognition, wherein a predetermined character forming the document is determined using a distance value from the standard character If a character cannot be recognized, a distance value giving step for giving a predetermined distance value to the predetermined character, and each character forming the document recognized as a distance value given by the distance value giving step And a keyword search step of searching for a character string forming the keyword existing in the document based on the distance value of the document.
[0021]
Also, The present invention Then, when a predetermined character string is accepted as a keyword, each character forming the document is recognized based on a distance value between each character forming the document to be searched and a standard character registered in advance, A program that causes a computer to execute a character string search method for searching for a character string that forms the keyword from the document based on the recognition result of the character recognition, wherein a predetermined character forming the document is between the standard character and When the character cannot be recognized even if the distance value is used, a distance value giving step for giving a predetermined distance value to the predetermined character, and the distance value given by the distance value giving step and the character recognized Causing a computer to execute a keyword search step of searching for a character string constituting the keyword existing in the document based on a distance value of each character forming the document. And features.
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of a character string search apparatus, a character string search method, and a program that causes a computer to execute the method will be described below in detail with reference to the accompanying drawings. In the present embodiment, a case where a predetermined search character string is searched from a horizontally written handwritten document image will be described.
[0023]
First, the configuration of the character string search device according to the present embodiment will be described. FIG. 1 is a functional block diagram showing the configuration of the character string search device according to the present embodiment. As shown in the figure, when searching for a document including a specific keyword from a plurality of handwritten document images 102 such as a form, the character string search apparatus 100 designates a keyword, and the keyword is the document image 102. It is a device that searches for whether it is included.
[0024]
In this character string search apparatus 100, even if there is a character that cannot be recognized in part, if the degree of matching between the keyword and the character string in the document image 102 as a whole is high, the character string is searched. can do.
[0025]
The character string search device 100 includes a document image storage unit 101, a document image 102, a search character string reception unit 103, a line / character cutout processing unit 104, a distance value calculation unit 105, a character recognition processing unit 106, a recognition order assignment unit 107, A character lattice storage unit 108, a character lattice 109, a character string extraction unit 110, a character association unit 111, a distance value assignment unit 112, a keyword search unit 113, and a control unit 114 are included.
[0026]
The document image storage unit 101 is a storage unit that stores and reads a plurality of handwritten document images 102, and the search character string reception unit 103 is a reception unit that receives a character string that is a keyword to be searched from the document image 102. It is. The search character string receiving unit 103 receives a character string input from a keyboard or the like, a character string transmitted from a terminal device via a network, and the like.
[0027]
The line / character cutout processing unit 104 specifies a document area in the document image 102 to be subjected to character recognition, cuts out a line from the document in the area, and extracts each character considered to constitute one character from each cut out line. Furthermore, it is a processing part to cut out.
[0028]
The specification of the document area for retrieving the character string is specifically performed according to the layout rule for each document. This specification may be performed manually. Further, the document area may be specified by another known method.
[0029]
The distance value calculation unit 105 calculates a distance value between a feature value of a standard character created by averaging characters written by various people and a feature value of each character in the cut document image 102. It is a calculation part. This standard character is stored in a storage unit (not shown). The standard character creation method is not limited to the method described here, and various methods can be used.
[0030]
The character recognition processing unit 106 is a processing unit that performs character recognition of the extracted character. Specifically, the distance value calculated by the distance value calculation unit 105 is referred to, and when the distance value is smaller than a predetermined threshold, it is recognized that the character corresponds to the compared standard character.
[0031]
The recognition rank assigning unit 107 is a grant unit that assigns a recognition rank to a plurality of character candidates obtained by performing character recognition on a certain character based on the obtained distance value of each character. is there. Specifically, the distance values of a plurality of character candidates are compared, numbers are assigned in ascending order of the distance values, and the recognition rank is set.
[0032]
The character lattice storage unit 108 is a storage unit that stores, as a character lattice 109, and reads out one or a plurality of character recognition candidates obtained by character recognition of each character forming the document image 102. Further, the distance value and recognition order of each character candidate are stored together. The character lattice 109 will be described in detail later.
[0033]
The character string extraction unit 110 is an extraction unit that extracts a character string that forms a keyword to be searched from the document image 102 using the character lattice 110. Specifically, each character forming the keyword is searched in order from the first character in the character lattice.
[0034]
At this time, if there is a character that cannot be recognized, the character is skipped and the next character is searched. When the ratio of the number of unrecognizable characters to the number of characters in the character string forming the keyword is equal to or less than a predetermined value, the character string is extracted as a keyword candidate to be searched.
[0035]
Also, if the first character of the keyword cannot be recognized, it will not be possible to start searching for the character string that makes up the keyword, as will be explained later, even if the other part matches the keyword. Therefore, the search for the character string candidates is performed in reverse order from the last character of the keyword to prevent the search omission.
[0036]
The character association unit 110 is a processing unit that performs a process of associating each character forming the extracted character string with each character of a keyword to be searched using dynamic programming.
[0037]
When there is a character that cannot be recognized in the character string associated by the character association unit 110, the distance value assigning unit 112 is based on the distribution of distance values of characters in the keyword corresponding to the character that cannot be recognized. The assigning unit assigns the obtained value as the distance value of the character that cannot be recognized. In addition, a predetermined rank is set as the recognition rank of characters that cannot be recognized.
[0038]
The keyword search unit 113 calculates an average distance value of the character string extracted by the character string extraction unit 110, and if the calculated average distance value is smaller than a predetermined threshold value, the keyword search unit 113 It is a search part which determines as corresponding to a keyword and outputs it as a search result. The average distance value is calculated based on the distance value and recognition order assigned to characters that cannot be recognized, and the distance value and recognition rank of characters that have been recognized.
[0039]
The control unit 114 is a control unit that performs overall control of the character string search device 100, and is a control unit that controls transmission and reception of various data between the functional units.
[0040]
FIG. 2 is a diagram illustrating an example of a horizontally written handwritten document to be searched by the character string search device 100. As shown in the figure, the character string search apparatus 100 can search a character string from a free-pitch document in which the character spacing between adjacent characters is free.
[0041]
Next, line and character cutout processing by the line / character cutout processing unit 104 will be described. FIG. 3 is a diagram for explaining line extraction processing by the line / character extraction processing unit 104. As shown in the figure, the line / character cutout processing unit 104 creates a histogram (line histogram) representing the frequency of black pixels in the horizontal direction, and cuts out each line using the valley portion between the line histograms as the line cutout position. To do.
[0042]
In addition, the character cutting process can be performed in the same manner. Specifically, a histogram (character histogram) representing the frequency of black pixels in the vertical direction is created for each extracted line, and each character is extracted using the valley portion between each character histogram as the character extraction position. .
[0043]
Next, the structure of the character lattice 109 shown in FIG. 1 will be described. FIG. 4 is a diagram illustrating an example of the basic segment cut out by the line / character cut-out processing unit 104. As shown in the figure, the line / character cutout processing unit 104 sets character division position candidates for each line by using a character histogram, and cuts out characters according to the division positions to create basic segments.
[0044]
The basic segment is a minimum unit that can be a character candidate divided by a character histogram. In the basic segment, there is a case where there is a part separated into the left and right in one character (for example, “Company”, “Ha”, etc.), and the character may be erroneously cut out as two characters.
[0045]
FIG. 5 is a diagram illustrating an example of candidate segments created by combining basic segments. As shown in the figure, in order to cope with a case where one character is erroneously cut out to two or more characters, candidate segments combining basic segments are created, and various character candidates are listed.
[0046]
FIG. 6 is a diagram illustrating an example of combinations of basic segments and candidate segments arranged in the order of appearance in a document. As shown in the figure, character recognition can be performed more accurately by combining basic segments and candidate segments and considering various character candidates.
[0047]
FIG. 7 is a diagram illustrating an example of the structure of the character lattice 109. This figure shows a character lattice 109 created by using a part of the basic segment and candidate segments shown in FIG. As shown in FIG. 7, the character lattice 109 represents various character candidates obtained by performing character recognition on the basic segment and candidate segment shown in FIG. 6 in a lattice structure.
[0048]
Here, it is also conceivable that a plurality of character candidates recognized as characters correspond to the same character image. In this case, for example, the character lattice 109 is generated by selecting only those whose recognition ranks are from the top to the tenth and the distance value of the candidate character is a predetermined threshold value or less.
[0049]
In addition, a segment number is assigned to each candidate character in the character lattice 109 in the order of appearance in the document. In the example of FIG. 7, “no”, “ba”, “ba”, “gu”, “b”, “t”, “age”, “fire”, “burn”, “so”, “no”, Segment numbers are assigned in the order of "Zen", "Sa", "-", "Rice", "Rate", and "Dou".
[0050]
Next, a process for calculating the distance value of the character feature amount by the character recognition processing unit 106 shown in FIG. 1 will be described. FIG. 8 is an explanatory diagram for explaining the calculation process of the distance value of the character feature amount by the character recognition processing unit 106 shown in FIG. In this process, standard characters created by averaging characters written by many people are used.
[0051]
First, as shown in FIG. 8A, processing for extracting the contour of a standard character image is performed. Subsequently, as shown in FIG. 8B, the character image from which the contour is extracted is divided into mesh regions. In this example, when a mesh area composed of two vertical and horizontal squares is overlapped with each other, half of the mesh area is divided so that 25 (= 5 × 5) are generated.
[0052]
When extracting character feature values, the number of character contours in each direction shown in FIG. 8 (c) is counted for each divided mesh region, and feature vectors are generated from the feature values. To do. In this example, since the number of directions of the contour to be extracted is eight, the dimension of the feature vector is 200 dimensions (= 5 × 5 × 8).
[0053]
Such a feature vector is created for each character cut out from the document image 102, and a distance value between the standard character and each character is calculated. Here, various distance measures such as Mahalanobis distance, Euclidean distance, or city block distance can be used as the distance value.
[0054]
Next, the segment number table created by the character lattice storage unit 108 shown in FIG. 1 will be described. FIG. 9 is a diagram showing an example of the segment number table created by the character lattice storage unit 108 shown in FIG. As shown in the figure, the character lattice storage unit 108 rearranges the generated character lattice 109 for each character and stores it as a segment number table in a form in which segment number, distance value, and recognition rank information are added.
[0055]
The segment number, distance value, and recognition order of each character are stored as a set such as (segment number, distance value, recognition order). In the example of FIG. 9, the character recognized as “I” appears three times in the document image, and the segment number, distance value, and recognition order are (64, 179, 1), (68, 214). 9) and (76, 125, 1).
[0056]
Next, character string extraction processing by the character string extraction unit 110 using the segment number table will be described. FIG. 10 is an explanatory diagram for explaining the character string extraction processing by the character string extraction unit 110 using the segment number table.
[0057]
As shown in the figure, when the keyword to be searched is “biofuel”, the character string extraction unit 110 searches the segment number table for the character “ba” that is the first character of the keyword. In FIG. 10A, it can be seen that the character “B” is in the segment number “69”.
[0058]
Subsequently, the next character “I” is searched. At this time, a predetermined number width is set, and the search is performed within the number width. This number width is appropriate so that only the “I” characters that follow the “B” character are selected in the document image, and the “I” characters that do not follow the “B” character are not included. Set to.
[0059]
For example, when the number width is 12, the character “I” whose segment number is 76 is 7 (= 76−69) since the segment number increment from the character “B” is “B”. Selected as a sequence of characters. Since the character “I” whose segment numbers are 64 and 68 is smaller than the segment number of the character “B”, it can be determined that the character has appeared before the character “B” and is not selected.
[0060]
Similarly, the character “o” with the segment number 78 is selected as a character subsequent to the character “a” because the increment of the segment number from the character “a” is 2 (= 78-76). . In the same manner, the characters up to “charge” are selected, and character string candidates corresponding to the keyword “biofuel” as shown in FIG. 10B are extracted. When the extraction is completed, the segment numbers of the first character and the last character of the extracted character string are stored.
[0061]
Here, the number width to be searched is set to 12, but the number width can be arbitrarily set as long as consecutive characters are appropriately selected. For example, the character number width of a character string included in the length of three times the height of one line of the document image may be calculated, and the character search may be performed within the number width. This is a setting method based on the prediction that consecutive characters in a character string appear within a length three times the height of one line.
[0062]
Next, character string extraction processing by the character string extraction unit 110 when there is a character that cannot be recognized will be described. FIG. 11 is an explanatory diagram illustrating character string extraction processing by the character string extraction unit 110 when there is a character that cannot be recognized. Here, a case where the character “o” with the segment number 78 cannot be recognized will be described.
[0063]
As shown in FIG. 10, the character string extraction unit 110 searches for the character “A” in the same manner as in FIG. However, since the character “o” having the segment number 78 could not be recognized, it is determined that there is no character “o”.
[0064]
In that case, the character string extraction unit 110 searches for the character “FUN”, which is the character following “O”, by expanding the number width for searching for the character. Here, the number width to be enlarged can be arbitrarily set. Here, it is set to 24 (= 12 × 2), which is twice the initial number width.
[0065]
By setting in this way, it is possible to select the character “burn” having the segment number 82. Thereafter, the subsequent characters are searched again with the normal number width.
[0066]
Here, when the first character of the keyword to be searched cannot be recognized, the segment number for starting the search is not determined, so that the search for each character forming the keyword cannot be started. Therefore, in this case, a process for searching for keywords in reverse order is performed.
[0067]
That is, when the character “B” of “Biofuel” cannot be recognized, the character “Fee” is searched, and then the character “Fuel” is searched in the reverse order. Using the example of FIG. 11, when the character “Rate” with the segment number 94 is selected, the character width “Fuel” is searched in reverse order with a number width of 12. In the same manner, the search is performed up to the character “I”.
[0068]
When there is a character that cannot be recognized, if the ratio of the number of characters that cannot be recognized to the number of characters of the keyword to be searched is equal to or less than a predetermined threshold value, a character string composed of each character searched by the character string extraction unit 110 is selected. Extracted as a character string corresponding to a keyword.
[0069]
In the example of FIG. 11, since the number of characters of the keyword “biofuel” is 5 and the number of characters that cannot be recognized is 1, the percentage of characters that cannot be recognized is 20%, and the threshold is set to 30%. If so, this character string is extracted as a character string corresponding to the keyword. Then, the segment numbers of the first character and the last character of the extracted character string are stored. This threshold value can be set arbitrarily.
[0070]
Next, a search process for searching for a character string that forms a keyword by the character association unit 111, the distance value assigning unit 112, and the keyword search unit 113 will be described. FIG. 12 is an explanatory diagram illustrating a search process for searching for a character string that forms a keyword by the character association unit 111, the distance value assigning unit 112, and the keyword search unit 113.
[0071]
First, the character associating unit 111 performs processing for associating each character forming the character string extracted by the character string extracting unit 110 with each character forming the character string forming the keyword by dynamic programming.
[0072]
The character association unit 111 searches for the basic segment that forms the character string based on the segment number of the first character and the last character of the character string extracted by the character string extraction unit 110 shown in FIG. As shown in FIG. 12B, a number is assigned to each basic segment.
[0073]
Then, as shown in FIG. 12 (c), the horizontal segments are arranged with basic segments assigned numbers, and the vertical axis is arranged with keywords to be searched, and each character forming a character string forming a keyword by dynamic programming. Each basic segment is associated with.
[0074]
In FIG. 12C, when the line advances in the upward direction as indicated by the arrow (1), it means that one basic segment corresponds to a plurality of characters. This corresponds to a case where there is a character cutout error such as two types of characters included in one basic segment, or a case where two types of character recognition results are obtained for one basic segment.
[0075]
When the line advances in an oblique direction as indicated by the arrow (2), it indicates that one basic segment corresponds to one character. Further, when the line advances in the horizontal direction as indicated by the arrow (3), it indicates that one character corresponds to a plurality of basic segments. This corresponds to a case where a kanji character composed of partial and 偏 is divided into two basic segments.
[0076]
In the example of FIG. 12 (c), the basic segments 1 to 3 are detected as corresponding to the character “B” by dynamic programming, and the basic segments 6 to 9 are detected as the characters “FUN”. Correspondingly, 10 and 11 basic segments are detected as corresponding to the characters “Rate”.
[0077]
FIG. 12D shows a character string in which characters are associated in this way. Then, the distance value assigning unit 112 newly assigns a distance value and a recognition order for the character after the association process to the characters “fuel” and “fee” whose combination of basic segments has been reviewed by the association.
[0078]
Thereafter, the keyword search unit 113 calculates an average distance value between the extracted character string and the keyword. Specifically, the distance value of each character forming the extracted character string is weighted by the recognition rank, and the average value of the weighted distance values is calculated. Here, the weight is given by 2 × N / (1 + N), where N is the value of the recognition order. When the recognition order is 1 (N = 1), this weight is 1, and when the recognition order is 10 (N = 10), this weight is about 1.82.
[0079]
In the example of FIG. 12D, the recognition ranks of “B”, “I”, “O”, “Fuel”, and “Fee” are 1, 1, 2, 1, and 1, respectively. The weights are 1, 1, 1.33, 1 and 1. Therefore, the average distance value of the character string “biofuel” is calculated as (1 × 113 + 1 × 125 + 1.33 × 204 + 1 × 164 + 1 × 196) /5=173.9. The calculation formula for calculating the average distance value is not limited to this, and may be calculated using another calculation formula.
[0080]
When the average distance value is calculated, the keyword search unit 113 compares the predetermined average distance value threshold value with the calculated average distance value, and the calculated average distance value is smaller than the threshold value. In this case, it is determined that the character string matches the keyword, and the result is output. For this threshold value, for example, the average value of 3σ of each character forming the keyword can be used.
[0081]
Next, an average distance value calculation process when there are characters that cannot be recognized will be described. FIG. 13 is an explanatory diagram for explaining the calculation process of the average distance value when there is a character that cannot be recognized.
[0082]
As shown in FIG. 13A, the character association unit 111 associates the basic segments 1 to 4 with the characters “B” and “I” in the keyword, as in FIG. 12C. . The basic segment No. 5 was unrecognizable by character recognition, and it is unclear what character it actually is. In this case, the character associating unit 111 shifts to the process of associating the basic segment corresponding to the next character “FUN”, assuming that the part corresponds to “O”. The subsequent processing is the same as that described with reference to FIG.
[0083]
FIG. 13B shows a character string in which the basic segments are associated in this way. Again, the distance value and recognition order for the revised character are newly assigned to the characters “Fuel” and “Fee” whose combination of basic segments has been revised.
[0084]
However, the distance value of the character “o” that cannot be recognized is set to 3 times the standard deviation σ (variance is the square of σ) of the distance value of “o”, that is, 3σ. In this example, 3σ is 261. The recognition order is set low, and in this example, it is set to 10th.
[0085]
In this example, the recognition ranks of “B”, “I”, “O”, “Fuel” and “Fee” are 1, 1, 10, 1 and 1, respectively. , 1.82, 1 and 1. Therefore, the average distance value of the character string “biofuel” is calculated as (1 × 113 + 1 × 125 + 1.82 × 261 + 1 × 164 + 1 × 196) /5=21.4.
[0086]
When the average distance value is calculated, the keyword search unit 113 compares the predetermined average distance value threshold value with the calculated average distance value, and the calculated average distance value is smaller than the threshold value. In this case, it is determined that the character string matches the keyword, and the result is output.
[0087]
Here, the distance value of characters that cannot be recognized can be set to an arbitrary value. However, if the distance value is set too small (for example, set to 0), if the distance value of other parts is small and the recognition order is low, the character can be recognized regardless of the character that cannot be recognized. Since a column is searched, even a character string unrelated to the keyword is searched, resulting in an increase in search noise.
[0088]
On the other hand, if the distance value of the character that cannot be recognized is set too large, the average distance value of the character string becomes large, and no matter how small the distance value of other parts, search omission is likely to occur. The same applies to the setting of the recognition order.
[0089]
Therefore, in the present invention, the above-mentioned problem is solved by setting an appropriate size value for the distance value and recognition order of unrecognizable characters. The reason why the value of 3σ is used as the distance value is that character feature information is included in 3σ.
[0090]
That is, in the example of FIG. 13A, a large value of 3σ (or variance) for the character “o” means that the variation of the character feature of “o” is large. In that case, the difference in features from other characters (for example, “age”) becomes inconspicuous, whether the character is actually “o”, and the character string containing that character matches the keyword Whether or not to do so is uncertain.
[0091]
Therefore, when the character in the character string corresponding to the character “o” in the keyword cannot be recognized, setting the distance value as 3σ makes it difficult for the character string to be selected as a search result, thus reducing search noise. Can be made. The value set as the distance value is not limited to 3σ, and may be a value based on the average value of the distance values of “e”.
[0092]
Next, the procedure of the character string search process according to this embodiment will be described. FIG. 14 is a flowchart showing the procedure of the character string search process according to the present embodiment. As shown in the figure, first, the line / character cutout processing unit 104 performs line and character cutout processing after layout analysis (step S1401).
[0093]
Then, the distance value calculation unit 105 calculates the distance value of the feature amount between each extracted character and the standard character (step S1402). Subsequently, the character recognition processing unit 106 performs character recognition of each character based on the calculated distance value (step S1403). Then, when a plurality of character recognition results are obtained for a certain character, the recognition rank assigning unit 107 sets the recognition rank in ascending order of the distance value (step S1404).
[0094]
After that, the character lattice storage unit 108 generates a character lattice, and stores the character recognition result as a segment number table together with the distance value and recognition order of each character (step S1405).
[0095]
Then, the character string extracting unit 110 extracts a character string corresponding to the designated keyword using the segment number table (step S1406), and the character associating unit 111 forms each extracted character string. A process of associating the character with each character in the keyword is performed (step S1407).
[0096]
Subsequently, the distance value assigning unit 112 checks whether there is a character that cannot be recognized in the associated character string (step S1408). If there is a character that cannot be recognized (step S1408, Yes), A value based on the dispersion of character distance values in a keyword corresponding to characters that cannot be recognized is set as a distance value of characters that cannot be recognized, and a predetermined rank is set as the recognition rank of characters that cannot be recognized ( Step S1409). If there is no character that cannot be recognized (No in step S1408), the process proceeds to step S1410.
[0097]
Thereafter, the keyword search unit 113 calculates an average distance value of the extracted character string, and when the calculated average distance value is smaller than a predetermined threshold, the extracted character string corresponds to the keyword. It judges as a thing and outputs as a search result (step S1410).
[0098]
As described above, in the present embodiment, when there is a character that cannot be recognized in the character string to be searched, a predetermined value (for example, standard deviation) is set as the distance value of the character, When a lower rank (for example, 10th rank) is set as the recognition rank, an average distance value between the extracted character string and the keyword is calculated, and the average distance value is smaller than a predetermined threshold value Since the extracted character string and the keyword are output as a match, even if there is an unrecognizable character, the average distance value of the entire character string is calculated and judged, so that search omissions and search noise You can search for keywords with few.
[0099]
The embodiments of the present invention have been described so far, but the present invention is not limited to the above-described embodiments, and can be applied in various different embodiments within the scope of the technical idea described in the claims. It may be implemented.
[0100]
For example, in the present embodiment, the case where the document image to be searched is a handwritten document image is shown, but the present invention is not limited to this, and the same applies to a document image of printed characters. Can be applied.
[0101]
In the present embodiment, the case where the present invention is applied to a horizontally written document has been described. However, the present invention is not limited to this and can be easily applied to a vertically written document.
[0102]
In this embodiment, a character string that forms a keyword is searched using a character lattice. However, the present invention is not limited to this, and a character string can be obtained by a known character string search method that does not use a character lattice. It is good also as searching for.
[0103]
Furthermore, in the present embodiment, the case where a character string corresponding to one keyword is searched from a document image such as a form is shown, but the present invention is not limited to this, and matches a word in a word dictionary. The present invention can also be applied to a case where a character string to be searched is searched from a document image, and an index of the document image is created by selecting a number of searched character strings.
[0104]
Further, in the present embodiment, a case where a character string is searched from a free-pitch document in which the character spacing between adjacent characters is free has been shown, but the present invention is not limited to this, and the character spacing is constant. The same can be applied to a document with an entry frame.
[0105]
In the present embodiment, the predetermined distance value and the recognition order are given to the characters that cannot be recognized after the dynamic programming is applied. However, the present invention is not limited to this. These may be added when it is determined that character recognition cannot be performed.
[0106]
Furthermore, in the present embodiment, the character string that forms the keyword is searched in order from the first character of the character string, and when the first character is a character that cannot be recognized, the search is performed in reverse order from the last character. The present invention is not limited to this, and when the first character is a character that cannot be recognized, it is possible to sequentially search for the next character and the subsequent characters.
[0107]
In addition, among the processes described in this embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed All or a part of the above can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.
[0108]
Each component of the illustrated character string search apparatus is functionally conceptual and does not necessarily need to be physically configured as illustrated. That is, the specific form of the character string search device is not limited to the one shown in the figure, and all or part of the character string search device is functionally or physically distributed / integrated in arbitrary units according to various loads or usage conditions. Can be configured. Furthermore, all or any part of each processing function performed in the character string search device is realized by a CPU and a program that is analyzed and executed by the CPU, or is realized as hardware by wired logic. obtain.
[0109]
The character string search method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program can be distributed via a network such as the Internet. The program can also be executed by being recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO, and a DVD and being read from the recording medium by the computer.
[0110]
【Effect of the invention】
As explained above, The present invention According to the above, when character recognition is not possible using a distance value between a predetermined character forming a document and a standard character, a predetermined distance value is given to the predetermined character, and the given distance value is Since it is configured to search for a character string that constitutes a keyword existing in a document based on the distance value of each character that forms a character-recognized document, even if there is a character that cannot be recognized, search omissions and search noise can be avoided. While decreasing, it becomes possible to search efficiently the character string corresponding to a predetermined keyword.
[0111]
Also, The present invention According to the above, one or more character recognition candidates obtained by character recognition of each character forming the document are stored as a character lattice, the recognition result stored as the character lattice is read, and characters that cannot be recognized are read. The character string that constitutes the keyword existing in the document is searched from the recognition result stored as the character lattice based on the distance value assigned to each character and the distance value of each character recognized. A character string can be searched from various character candidates, and a character string corresponding to a predetermined keyword can be efficiently searched.
[0112]
Also, The present invention According to the above, when a predetermined character string is accepted as a keyword, each character forming the character string forming the keyword is searched in order from the first character of the character string based on the character recognition result, and the character string is formed. If the character cannot be recognized even if the distance between the predetermined character and the standard character is used, the search is continued by skipping the predetermined character, and the ratio of the number of characters that cannot be recognized to the number of characters in the character string is predetermined. If the character string is less than or equal to the value, the character string is extracted, and a predetermined distance value assigned to a predetermined character that does not recognize the character that forms the extracted character string and the character that forms the extracted character string are recognized. The character string extracted based on the distance value of each character is determined whether it is a character string forming a keyword, and the character string forming the keyword is searched based on the determination. Even if there is a character that can not be shaped recognized, it is possible to efficiently search for a string forming the keyword.
[0113]
Also, The present invention According to the above, when the first character of the character string forming the keyword cannot be recognized, each character forming the character string forming the keyword is searched in reverse order from the last character of the character string, and the predetermined character string forming the character string is determined. If the character cannot be recognized using the distance between the standard character and the character, the search is continued by skipping the predetermined character, and the ratio of the number of characters that cannot be recognized to the number of characters in the character string is less than the predetermined value. Since the character string is extracted in such a case, the character string forming the keyword can be efficiently searched even when the first character of the character string forming the keyword cannot be recognized.
[0114]
Also, The present invention According to the above, by applying dynamic programming to the extracted character string, each character forming the extracted character string is associated with each character forming the character string forming the keyword, and dynamic planning is performed. When a character that cannot be recognized using a distance value between a standard character and a predetermined character that forms a character string to which the law is applied, a character string that forms a keyword associated with the predetermined character is formed. A value related to the character to be assigned as a predetermined distance value to the predetermined character, the given distance value, and the distance value of each character recognized to form a character string to which dynamic programming is applied, Based on the above, it is determined whether the extracted character string is a character string that forms a keyword, and the character string that forms the keyword is searched based on the determination, so even if there are characters that cannot be recognized, Proper distance with dynamic programming Can impart value to the character, it is possible to reduce the search omission or search noise.
[0115]
Also, The present invention According to the above, when a predetermined character forming a character string to which dynamic programming is applied cannot be recognized using a distance value between the standard character and the predetermined character, the predetermined character is associated. Since the predetermined distance value based on the dispersion or standard deviation of the distance value of the characters forming the character string forming the keyword is assigned to the predetermined character, even if there is a character that cannot be recognized, An appropriate distance value including character feature information can be given to the character, and search omission and search noise can be reduced.
[0116]
Also, The present invention According to the above, for the one or a plurality of character recognition candidates stored as a character lattice, numbers are assigned as recognition ranks in ascending order of distance values of the character recognition candidates, and the given recognition rank and character recognition cannot be performed. Since it is configured to search for a character string that forms a keyword existing in the document based on the distance value assigned to the character and the distance value of each character recognized, Search noise can be further reduced.
[0117]
Also, The present invention According to the above, when character recognition is not possible using a distance value between a predetermined character forming a document and a standard character, a predetermined distance value is given to the predetermined character, and the given distance value is Since it is configured to search for a character string that constitutes a keyword existing in a document based on the distance value of each character that forms a character-recognized document, even if there is a character that cannot be recognized, search omissions and search noise can be avoided. While decreasing, it becomes possible to search efficiently the character string corresponding to a predetermined keyword.
[0118]
Also, The present invention According to the above, when character recognition is not possible using a distance value between a predetermined character forming a document and a standard character, a predetermined distance value is given to the predetermined character, and the given distance value is Since it is configured to search for a character string that constitutes a keyword existing in a document based on the distance value of each character that forms a character-recognized document, even if there is a character that cannot be recognized, search omissions and search noise can be avoided. While decreasing, it becomes possible to search efficiently the character string corresponding to a predetermined keyword.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a configuration of a character string search device according to an embodiment.
FIG. 2 is a diagram illustrating an example of a horizontally written handwritten document to be searched by a character string search device;
FIG. 3 is a diagram for explaining line cut-out processing by a line / character cut-out processing unit;
FIG. 4 is a diagram illustrating an example of a basic segment cut out by a line / character cut-out processing unit.
FIG. 5 is a diagram showing an example of candidate segments created by combining basic segments.
FIG. 6 is a diagram illustrating an example of combinations of basic segments and candidate segments arranged in the order of appearance in a document image.
FIG. 7 is a diagram illustrating an example of a structure of a character lattice.
8 is an explanatory diagram for explaining a calculation process of a distance value of a character feature amount by a character recognition processing unit shown in FIG. 1; FIG.
9 is a diagram showing an example of a segment number table created by the character lattice storage unit shown in FIG. 1. FIG.
FIG. 10 is an explanatory diagram illustrating a character string extraction process by a character string extraction unit using a segment number table.
FIG. 11 is an explanatory diagram illustrating a character string extraction process performed by a character string extraction unit when there is a character that cannot be recognized.
FIG. 12 is an explanatory diagram illustrating search processing for searching for a character string that forms a keyword by a character association unit, a distance value assigning unit, and a keyword search unit;
FIG. 13 is an explanatory diagram illustrating an average distance value calculation process when there are characters that cannot be recognized.
FIG. 14 is a flowchart showing a processing procedure of a character string search process according to the present embodiment.
[Explanation of symbols]
100 Character string search device
101 Document image storage unit
102 Document image
103 Search string acceptance section
104 Line / character cutout
105 Distance value calculator
106 Character recognition processing section
107 recognition rank assigning unit
108 character lattice storage
109 character lattice
110 Character string extractor
111 Character mapping part
112 Distance value giving unit
113 Keyword search part
114 Control unit

Claims

When a predetermined character string is accepted as a keyword, based on a distance value between each character candidate forming a document image to be searched and a standard character created by averaging characters written by many people A character string search device for performing character recognition of the document image and searching the character string forming the keyword from the document image based on the recognition result of the character recognition,
When the distance value between the standard character corresponding to each character forming the character string and the character candidate forming the document image is less than a predetermined threshold, the distance value is associated with the character candidate. Character recognition means that the character recognition was successful in
When the character recognition means fails in character recognition for at least one of the characters forming the character string, the large number used when creating the standard character for the standard character corresponding to the character dispersion of the distance values between each character and said standard characters written by a person, the standard deviation or average the calculated the dispersion, the position corresponding to the character value calculated in advance based on the standard deviation or the average as the distance value A distance value giving means for giving the character candidate existing in
The distance value assigned by the distance value assigning unit to the character candidate for which the character recognition unit has failed to recognize the character, and the distance value associated with the character candidate for which the character recognition unit has successfully recognized the character. A character string search device comprising: keyword search means for searching for the keyword based on the above.

Character lattice storage means for storing, as a character lattice, one or a plurality of the character candidates obtained when the character recognition means has succeeded in character recognition.
The keyword detecting means includes
The character string search device according to claim 1, wherein the keyword is detected for the character candidates included in the character lattice stored in the character lattice storage unit.

The keyword search means includes:
The keyword candidate is a set of the character candidates in which a value obtained by dividing the number of characters that the character recognition unit has failed in the character recognition divided by the total number of characters of the character string is equal to or less than a predetermined threshold. The character string search device according to claim 1 or 2.

The character recognition means includes
When character search fails for the first character among the characters forming the character string with the search direction of the character candidates forming the document image in the normal order, the search direction is set in the reverse order. The character string search apparatus according to claim 1, 2, or 3, wherein character recognition is performed from the last character among the characters forming the character string.

The character recognition means includes
The character recognition is performed on the character candidate after generating the character candidate by combining one or more basic segments by dynamic programming. The character string search device described in 1.

The keyword search means includes:
A distance value assigned by the distance value assigning unit to the character candidate for which the character recognition unit has failed in character recognition, and a distance value associated with the character candidate for which the character recognition unit has succeeded in character recognition. The character string search apparatus according to claim 1, wherein a set of the character candidates whose average distance value calculated by use is equal to or less than a predetermined threshold is set as the keyword candidate.

The character recognition means includes
Corresponding recognition ranks in ascending order of distance values from the standard characters to the character candidates that have been successfully recognized.
The distance value giving means is
Giving the recognition rank which is not at least the first place to the character candidate given the distance value,
The keyword search means includes:
The distance value given by the distance value giving unit to the character candidate for which the character recognition unit has failed to recognize the character and the recognition rank, and the character recognition unit is associated with the character candidate that has succeeded in character recognition. The character string search device according to claim 1, wherein the keyword is searched based on a distance value and the recognition order.

A character string search method for searching a keyword consisting of a predetermined character string from a document image,
A distance value between each character candidate forming the document image and a standard character created by averaging characters written by a large number of people is calculated, and the standard character corresponding to each character forming the character string A character recognition step in which when the distance value with the character candidate forming the document image is less than a predetermined threshold, the character candidate is successfully recognized after associating the distance value with the character candidate;
When the character recognition process fails in character recognition for at least one character among the characters forming the character string, the large number used when creating the standard character for the standard character corresponding to the character dispersion of the distance values between each character and said standard characters written by a person, the standard deviation or average the calculated the dispersion, the position corresponding to the character value calculated in advance based on the standard deviation or the average as the distance value A distance value giving step to be given to the character candidate existing in
The distance value assigned by the distance value assigning step to the character candidate for which the character recognition step has failed in character recognition, and the distance value associated with the character candidate for which the character recognition step has succeeded in character recognition. A character string search method comprising: a keyword search step for searching for the keyword based on the above.

When a predetermined character string is accepted as a keyword, based on a distance value between each character candidate forming a document image to be searched and a standard character created by averaging characters written by many people A character string search program for performing character recognition of the document image and searching the character string forming the keyword from the document image based on the recognition result of the character recognition,
When the distance value between the standard character corresponding to each character forming the character string and the character candidate forming the document image is less than a predetermined threshold, the distance value is associated with the character candidate. A character recognition procedure that assumes that character recognition succeeded in
When the character recognition procedure fails in character recognition for at least one character among the characters forming the character string, the large number used in creating the standard character for the standard character corresponding to the character dispersion of the distance values between each character and said standard characters written by a person, the standard deviation or average the calculated the dispersion, the position corresponding to the character value calculated in advance based on the standard deviation or the average as the distance value A distance value giving procedure to be given to the character candidate existing in
The distance value assigned by the distance value assigning procedure to the character candidate for which the character recognition procedure has failed in character recognition, and the distance value associated with the character candidate for which the character recognition procedure has succeeded in character recognition. A character string search program that causes a computer to execute a keyword search procedure for searching for the keyword based on the above.