JP2004341631A

JP2004341631A - Microfilm ocr system

Info

Publication number: JP2004341631A
Application number: JP2003134835A
Authority: JP
Inventors: Michihiko Takahashi; 通彦高橋; Tatsumi Inahashi; 辰美稲橋
Original assignee: JIM KK
Current assignee: JIM KK
Priority date: 2003-05-13
Filing date: 2003-05-13
Publication date: 2004-12-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a microfilm OCR system that is increased in character recognition accuracy when reading in a character string recorded on a microfilm as character data. <P>SOLUTION: The microfilm OCR system, which reads in a microfilm with character strings recorded, as image data 1, at a scanner 10 and recognizes a character area in the image data 1 as character data in a computer 30, holds the font of the character strings recorded on the microfilm as a standard pattern 28, and commands the computer 30 to compare the standard pattern 28 and the character area of the image data 1 and recognize the character area as the character data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、コンピュータにより出力した文字列が記録されたマイクロフィルムをスキャナで読み取り、ＯＣＲソフトによりコンピュータに文字データとして認識させるマイクロフィルムＯＣＲシステムに関する。
【０００２】
【従来の技術】
文字列の印刷された被記録媒体を感光素子でイメージデータとして読み取り、読み取ったイメージデータ内の文字領域を文字データとして認識するＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）装置や、感光素子を有するスキャナとコンピュータとを用いたＯＣＲソフトが普及している（例えば、非特許文献１参照）。
【０００３】
このＯＣＲ装置又はＯＣＲソフトが被記録媒体の文字領域を読み取る手順としては、まず、読み取ったイメージデータの文字領域を見つけて読む順序を決定するレイアウト解析を行い、次に、連続した文字領域から１行毎に分解する行の切り出しを行い、次に、切り出した１行を１文字毎に切り出す文字の切り出しを行った後、最後に１文字毎に文字データの認識を行う文字認識を行っている。
【０００４】
文字認識では、個々の文字について、文字の大きさ、文字の字体（明朝体、ゴシック体、教科書体）、文字の潰れ及びかすれ等の変動に対して正規化、マッチング、知識処理という順で処理を行っている。
【０００５】
正規化では、認識したい１文字を一定の大きさに変換することで、文字の縦長及び横長などの変形を吸収する。
【０００６】
また、マッチングでは、従来では、正規化された文字を予め登録されていた標準パターンと単純に重ね合わせることで比較して文字の識別を行っていた。
【０００７】
しかしながら、この方法では、文字の傾き、字体、潰れ及びかすれなどの変動により高い認識精度を出すのが困難であったため、文字の形をそのまま比較する方法ではなく、文字の特徴から識別する方法が採用されている。
【０００８】
この文字の特徴から識別する方法としては、例えば、正規化された文字を上下、左右、斜め方向の４つの成分に分解し、この成分を文字の特徴として抽出した後、抽出した文字の特徴成分と予め登録してある文字の特徴（標準パターン）とを算出して比較している。
【０００９】
さらに、知識処理では、認識された１文字毎の文字データを連続した文字列データとし、この文字列データから漢字列やカタカナ列などを抜き出し、その部分を予め登録している単語辞書と照合して誤読した部分を自動的に訂正している。
【００１０】
このように各工程を行うことで、ＯＣＲ装置又はＯＣＲソフトは、被記録媒体に記録された文字列を文字データとして認識することができる。
【００１１】
一方、従来より、コンピュータで処理されたデータをマイクロフィルム上に人間が読み取れる文字列や図形などで出力し、フィルムベースで利用するＣＯＭ（ＣｏｍｐｕｔｅｒＯｕｔｐｕｔＭｉｃｒｏｆｉｌｍ）システムが利用されている（例えば、非特許文献２参照）。
【００１２】
このＣＯＭシステムでは、ペーパレス化、保管の省スペース化及び保管による劣化が少なく永年保存に優れるという利点があると共に、デジタルデータの保存でのシステムダウンや、コンピュータウィルスの侵入などの予測不可能なリスクに対応することができる。また、マイクロフィルムに記録された内容は、改ざんすることができず、文書管理の構築においても高い優位性を持っている。
【００１３】
このＣＯＭシステムでは、コンピュータからのデータを文字列や図形などでマイクロフィルムに記録する方法として、ＣＲＴの管面に文字列及び図形などを投影し、これを光学的にマイクロフィルムに投影するＣＲＴ方式が用いられていた。
【００１４】
しかしながら、近年では、処理速度の高速化、高解像度及びメンテナンスの容易化などから、Ｈｅ−ＮｅレーザやＡｒレーザを用いたレーザ光を直接マイクロフィルムに投射して記録するレーザ方式が採用されている。
【００１５】
このような方式で記録される文字は、漢字英数字カナ文字出力を行う漢字ＣＯＭや、グラフ、図面出力などを行うグラフィックＣＯＭなどの形式で出力されている。そして、マイクロフィルムに記録される文字は、１文字を１６×１６ドットから４０×４０ドットで表示したもの、すなわちビットマップ方式フォント（ドットフォント）が広く用いられている。
【００１６】
【非特許文献１】，
メディアドライブ株式会社，［ｏｎｌｉｎｅ］，［平成１５年２月２１日検索］，インターネット
＜ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｍｅｄｉａｄｒｉｖｅ．ｃｏ．ｊｐ／ｔｅｃｈｎｏｌｏｇｙ／ｗｈａｔｓｏｃｒ／ｏｖｅｒｖｉｅｗ．ｈｔｍｌ＞
【非特許文献２】
板東政夫著，外９名，「ＣＯＭシステムガイド−コンピュータアウトプットマイクロフィルム」，社団法人日本画像情報マネジメント協会，平成１０年１１月１５日
【００１７】
【発明が解決しようとする課題】
しかしながら、上述したＣＯＭシステムなどで記録されたマイクロフィルムをイメージデータとして読み込み、既存のＯＣＲ装置又はＯＣＲソフトを用いて電子データからなる文字データとして認識させても、マイクロフィルムに記録された文字列は微少なため、既存のＯＣＲ装置又はＯＣＲソフトでは文字認識精度が低いという問題がある。
【００１８】
本発明はこのような事情に鑑み、マイクロフィルムに記録された文字列を文字データとして読み込む際の文字認識精度を向上したマイクロフィルムＯＣＲシステムを提供することを課題とする。
【００１９】
【課題を解決するための手段】
上記課題を解決する本発明の第１の態様は、文字列が記録されたマイクロフィルムをスキャナによりイメージデータとして読み取ると共に、該イメージデータ内の文字領域をコンピュータに文字データとして認識させるマイクロフィルムＯＣＲシステムであって、前記マイクロフィルムに記録された前記文字列のフォントを標準パターンとして保持すると共に、該標準パターンと前記イメージデータの文字領域とを比較して当該文字領域を前記文字データとしてコンピュータに認識させることを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２０】
本発明の第２の態様は、第１の態様において、前記標準パターンが前記マイクロフィルムに記録された前記文字列の前記フォントを変換したアウトラインフォントからなることを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２１】
本発明の第３の態様は、第１の態様において、前記文字データの認識が、前記マイクロフィルムに記録された前記文字列のフォントを直接用いて行われることを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２２】
本発明の第４の態様は、第１〜３の何れかの態様において、前記マイクロフィルムがＣＯＭシステムにより記録されたＣＯＭフィルムであることを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２３】
本発明の第５の態様は、第１〜４の何れかの態様において、前記文字データの認識が、前記イメージデータの前記文字領域を見つけて読む順序を決定するレイアウト解析と、連続した文字領域から１行毎に分解する行の切り出しと、切り出した１行を１文字毎に切り出す文字の切り出しと、１文字毎に前記文字データの認識を行う文字認識とを含むことを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２４】
本発明の第６の態様は、第５の態様において、前記１文字毎の前記文字データの認識が、正規化、マッチング、知識処理により構成されていることを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２５】
本発明の第７の態様は、第５又は６の態様において、前記マイクロフィルムに記録された文字列が帳票データであると共に、前記知識処理が、認識した文字データの計算を行い正誤の処理を行うことを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２６】
本発明の第８の態様は、第７の態様において、前記知識処理が、前記文字列を前記マイクロフィルムに記録した際に用いたプログラムに基づいて検証ルールを確立し、該検証ルールに基づいて前記文字データの計算を行うことで正誤処理を行うことを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２７】
本発明の第９の態様は、第７の態様において、前記知識処理は、認識した前記文字データに基づいて検証ルールを確立し、該検証ルールに基づいて前記文字データの計算を行うことで正誤処理を行うことを特徴とするマイクロフィルムＯＣＲシステムにある。
【００２８】
かかる本発明では、マイクロフィルムに記録された文字列のフォントを標準パターンとして保持し、この標準パターンと比較することで文字列を文字データとして認識するため、マイクロフィルムに記録された微少な文字列の文字認識精度を向上することができる。
【００２９】
【発明の実施の形態】
以下に本発明を実施形態に基づいて詳細に説明する。
【００３０】
（実施形態１）
図１は、マイクロフィルムＯＣＲシステムの概略を示すブロック図である。
【００３１】
図１に示すように、マイクロフィルムＯＣＲシステムは、文字列の記録されたマイクロフィルムをイメージデータとして読み取るスキャナ１０と、スキャナ１０から読み取ったイメージデータの文字領域から文字データを認識するためのＯＣＲソフト２０を有するコンピュータ３０とを具備する。
【００３２】
マイクロフィルムは、例えば、１６ｍｍ又は３５ｍｍフィルムを３０．５ｍ（１００ｆｔ）又は６５．５ｍ（２１５ｆｔ）の長さでリールに巻き付けたロールフィルムや、１０５ｍｍ×１４８ｍｍのマイクロフィッシュ（ＪＩＳＺ６００１）などに、ネガ状又はポジ状に文字列を含む情報が記録されたものである。このようなマイクロフィルムに記録された情報としては、例えば、帳票データ等を挙げることができる。なお、マイクロフィッシュは、ロール状、シート状の何れの形式でもよく、その形状は特に限定されない。
【００３３】
また、スキャナ１０は、マイクロフィルムに記録された文字列、グラフィック等をイメージデータ１として読み取るものであり、例えば、リニアＣＣＤを用いたフラットヘッドスキャナやフィルムスキャナなどを挙げることができる。なお、マイクロフィルムとして連続するロール状のものが用いられた場合には、スキャナにマイクロフィルムを搬送する搬送手段を設け、連続してマイクロフィルムをイメージデータとして読み込めるようにしてもよい。
【００３４】
また、コンピュータ３０は、図示しないコンピュータ本体、モニタ、操作キーボードなどからなり、このコンピュータ３０には、スキャナ１０が読み込んだイメージデータ１の文字領域を文字データとして認識するＯＣＲソフト２０が内蔵されている。
【００３５】
コンピュータ３０に内蔵されたＯＣＲソフト２０は、スキャナ１０から受け取ったイメージデータ１の文字領域を解析するレイアウト解析手段２１と、解析した文字領域から行を切り出す行切り出し手段２２と、行切り出し手段２２が切り出した行から１文字を切り出す文字切り出し手段２３と、切り出した文字を認識する文字認識手段２４とを具備する。
【００３６】
レイアウト解析手段２１は、図２（ａ）に示すように、スキャナ１０が読み取ったイメージデータ１の文字領域２を解析する。このようなレイアウト解析手段２１は、文字領域２を自動的に解析するようにしてもよく、コンピュータ３０のモニタにイメージデータ１を映し出してユーザに選択させるようにしてもよい。
【００３７】
また、行切り出し手段２２は、図２（ｂ）に示すように、レイアウト解析手段２１が解析した文字領域２から１行の文字列３を切り出す。この１行の文字列３は、横書き又は縦書きに自動的に対応して切り出すようになっている。
【００３８】
さらに、文字切り出し手段２３は、図２（ｃ）に示すように、行切り出し手段２２が切り出した１行の文字列３から、１文字４を切り出す。
【００３９】
このように、イメージデータ１からレイアウト解析手段２１、行切り出し手段２２及び文字切り出し手段２３により切り出された１文字４は、文字認識手段２４によって文字データとして認識される。
【００４０】
文字認識手段２４は、正規化手段２５、マッチング手段２６、知識処理手段２７及び標準パターン２８を具備する。
【００４１】
正規化手段２５は、図３（ａ）に示すような文字切り出し手段２３が切り出した１文字４の大きさを、図３（ｂ）に示すように所定の大きさに変更した正規化文字５として、縦長又は横長などの変形を吸収する。
【００４２】
この正規化手段２５が変形する正規化文字５の大きさは、文字認識手段２４が保持した標準パターン２８と同等となるように変形する。このように１文字４を標準パターン２８と同等の大きさとなるように正規化した正規化文字５とすることで、１文字４を標準パターン２８と比較する際に、認識精度を向上することができる。
【００４３】
また、マッチング手段２６は、変形された正規化文字５と標準パターン２８とを比較し、最も近い文字候補を挙げる。
【００４４】
ここで、文字認識手段２４が保持した標準パターン２８は、マイクロフィルムに記録された文字列のフォントと同等のものである。なお、マイクロフィルムとしてＣＯＭフィルム（ＣＯＭシステムにより記録されたマイクロフィルム）が用いられた場合、ＣＯＭフィルムに通常記録される文字列のフォントは、ビットマップ方式フォント（ドットフォント）であり、１文字が１６×１６〜４０×４０ドットで表現されている。
【００４５】
そして、正規化文字５と標準パターン２８との比較では、例えば、図４に示すように、正規化文字５を上下、左右、斜め方向の４つの成分に分解し、４つの成分を個々の文字の特徴として抽出して４つの成分を７×７の１９６個の特徴値とする。この正規化文字５の４つの成分毎の特徴値と、標準パターン２８の４つの成分毎の特徴値とを、例えば、ユークリッド幾何学により比較し、正規化された文字と最も近い標準パターン２８の文字を候補として挙げる。
【００４６】
なお、正規化文字５と標準パターン２８との比較は、特にこれに限定されず、例えば、ビットマップ方式フォントをアウトラインフォントに変換し、アウトラインフォントのエッジ特徴を抽出したものを標準パターン２８として、正規化文字５のエッジ特徴と比較する拡張セル特徴方式で行うようにしてもよい。
【００４７】
また、ビットマップ方式フォントをアウトラインフォントに変換し、アウトラインフォントの輪郭特徴を抽出したものを標準パターン２８として、正規化文字５の輪郭特徴と比較する加重方向ヒストグラムで行うようにしてもよい。
【００４８】
さらに、ビットマップ方式フォントをアウトラインフォントに変換し、アウトラインフォントの文字内の所定の点から８方向に触手を伸ばして求まる所定点の連結長を抽出したものを標準パターン２８として、正規化文字５の所定点の連結長と比較する外郭方向寄与度特徴で行うようにしてもよい。
【００４９】
また、ビットマップ方式フォントのドット位置を標準パターン２８として、正規化文字５のドット位置との比較を行うようにしてもよい。
【００５０】
このように正規化文字５をマイクロフィルムに記録された文字列と同一のフォントを基にした標準パターン２８を用いて比較することで、マッチング手段２６の認識精度を向上することができる。
【００５１】
また、知識処理手段２７は、マッチング手段２６がマッチングさせた標準パターン２８の候補から、文字列を作成し、日本語、英語等の単語情報などの言語情報を使用して、より正確な知識処理を行う。
【００５２】
また、本実施形態では、マイクロフィルムに記録された文字列が帳票データであるため、知識処理手段２７は、帳票データの数値を計算し、小計、合計などの計算結果からも、知識処理を行う。
【００５３】
ここで、帳票データの知識処理を行う知識処理手段２７は、対象となるマイクロフィルムに文字列を記録した際に用いられたプログラム、例えば、電子帳票ソフト、電子会計ソフトなどがある場合には、そのプログラムに基づいて検証ルールを確定し、検証ルールに基づいて認識した文字列の計算を行うことで、正確な知識処理を行うことができる。
【００５４】
また、対象となるマイクロフィルムに文字列を記録した際に用いられたプログラムがない場合には、知識処理手段２７は、検証ルールを形成して認識した文字列の知識処理を行う。
【００５５】
具体的には、例えば、マイクロフィルムの文字列が図５に示すように、帳票データの細かな数値が書かれた明細行４０と、この明細行４０の内容が計算されたトータル行４１とで分かれていた場合、明細行４０をトランザクション行として、トータル行４１と区別し、トランザクション行（明細行４０）とトータル行４１とを比較検討する検証ルールを形成する。この検証ルールは、例えば、ＯＣＲソフトが動作しているコンピュータ３０を操作するユーザに検証ルールを確認させることで確定する。そして確定した検証ルールに基づいて、トランザクション行内の計算を行い、計算結果とトータル行４１との値が異なる場合には、認識した文字データの正確な知識処理を行う。
【００５６】
なお、このような一連の知識処理は、例えば、認識した文字列を表計算ソフトに取り込み、検証ルールを表計算ソフトのマクロとして動作させることで行うことができる。
【００５７】
このように、マイクロフィルムに記録された文字列の種類に応じて知識処理手段２７が知識処理を行うことで、さらなる認識精度を向上することができる。
【００５８】
このようにＯＣＲソフトが認識した標準パターン２８は、コンピュータ３０が文字データとして取得し、例えば、モニタなどへの表示、プリンタへの出力、ＣＤ−Ｒ（登録商標）、ＤＶＤ−Ｒなどの他の記録媒体に出力することができる。
【００５９】
また、例えば、スキャナ１０で読み込んだイメージデータ１は、ＯＣＲソフト２０により文字データとして認識した後に、所望のフォント、色、配置などに容易に変更することができる。これにより、プリンタ等で紙などに印刷する際に、読みやすく整理し易い状態での印刷が可能となる。勿論、イメージデータ１をＯＣＲソフト２０を介さずに直接プリンタから出力するようにしてもよい。
【００６０】
さらに、マイクロフィルムに記録された文字列を高精度に認識して文字データとすることで、マイクロフィルムの全文検索などを行うことができる。これにより、マイクロフィルムに文字列を記録する際に、他の記録媒体に文字データなどの電子データを記録する必要がない。
【００６１】
（他の実施形態）
以上、本発明の実施形態１を説明したが、本発明は上述したものに限定されるものではない。
【００６２】
例えば、上述した実施形態１では、マイクロフィルムに記録された文字列を文字データとして認識するＯＣＲソフト２０をコンピュータ３０に搭載したが、特にこれに限定されず、例えば、スキャナ自体にＯＣＲソフトを搭載し、スキャナでイメージデータの取得と、イメージデータから文字データの認識とを行い、コンピュータに文字データを渡すようにしてもよい。
【００６３】
また、上述した実施形態１では、マイクロフィルムにＣＯＭシステムにより記録された文字列を文字データとして認識させるようにしたが、特にこれに限定されず、例えば、他のフォントを第２の標準パターンとしてさらに保持させるようにしてもよい。すなわち、ＣＯＭシステムにより記録されたマイクロフィルムを読み込む際は、ＣＯＭシステムと同様の標準パターンを用いて認識を行い、他の方法により記録されたものを読み込む際は、第２の標準パターンを用いて認識を行うようにすれば、他の方式での記録でも文字認識精度を低下させることがない。
【００６４】
さらに、上述した実施形態１では、マイクロフィルムに記録された文字列のフォントをビットマップ方式フォントとしたが、マイクロフィルムに記録された文字列のフォントはビットマップ方式フォントに限定されるものではなく、文字認識手段２４がマイクロフィルムに記録された文字列のフォントを標準パターンとして保持して、標準パターンを用いた文字の認識を行うことで、高精度な文字認識を行うことができる。
【００６５】
【発明の効果】
以上説明したように、本発明のマイクロフィルムＯＣＲシステムでは、マイクロフィルムの文字列の比較をする標準パターンとしてマイクロフィルムに記録された文字列と同等のフォントを用いることで、文字認識精度を向上することができる。
【図面の簡単な説明】
【図１】本発明の実施形態１に係るマイクロフィルムＯＣＲシステムの概略を示すブロック図である。
【図２】本発明の実施形態１に係るイメージデータの概略図である。
【図３】本発明の実施形態１に係るイメージデータの概略図である。
【図４】本発明の実施形態１に係る文字認識方法を示す概略図である。
【図５】本発明の実施形態１に係るマイクロフィルムの文字列を示す図である。
【符号の説明】
１イメージデータ
２文字領域
３１行の文字列
４１文字
１０スキャナ
２０ＯＣＲソフト
２１レイアウト解析手段
２２行切り出し手段
２３文字切り出し手段
２４文字認識手段
２５正規化手段
２６マッチング手段
２７知識処理手段
２８標準パターン
３０コンピュータ
４０明細行
４１トータル行[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a microfilm OCR system in which a microfilm on which a character string output by a computer is recorded is read by a scanner, and the computer recognizes the data as character data by OCR software.
[0002]
[Prior art]
An OCR (Optical Character Reader) device that reads a recording medium on which a character string is printed as image data with a photosensitive element and recognizes a character area in the read image data as character data, or a scanner and a computer having a photosensitive element. The used OCR software has become widespread (for example, see Non-Patent Document 1).
[0003]
As a procedure in which the OCR apparatus or the OCR software reads the character area of the recording medium, first, a layout analysis is performed to find the character area of the read image data and determine the reading order. The line to be decomposed is cut out for each line, and then the cut-out line is cut out for each character, and then the character recognition is performed to recognize the character data for each character. .
[0004]
In character recognition, for each character, normalization, matching, and knowledge processing are performed in order of character size, character font (Mincho, Gothic, textbook), character collapse, and blurring. Processing is in progress.
[0005]
In the normalization, one character to be recognized is converted into a certain size, thereby absorbing deformation of the character such as vertical and horizontal length.
[0006]
Conventionally, in matching, characters are identified by simply superimposing normalized characters on a standard pattern registered in advance and comparing them.
[0007]
However, in this method, it was difficult to obtain high recognition accuracy due to variations in character inclination, character style, crushing, and blurring. Has been adopted.
[0008]
As a method of identifying from the characteristics of this character, for example, a normalized character is decomposed into four components, that is, up, down, left, right, and oblique directions, and this component is extracted as a feature of the character. And a character feature (standard pattern) registered in advance are compared.
[0009]
Further, in the knowledge processing, the recognized character data of each character is converted into continuous character string data, and a kanji character string, a katakana character string, and the like are extracted from the character string data, and the part is collated with a word dictionary registered in advance. It automatically corrects misread parts.
[0010]
By performing each process in this manner, the OCR device or the OCR software can recognize the character string recorded on the recording medium as character data.
[0011]
On the other hand, conventionally, a COM (Computer Output Microfilm) system which outputs data processed by a computer as a character string or a figure which can be read by a human on a microfilm and uses it on a film basis has been used (for example, non-patented). Reference 2).
[0012]
This COM system has the advantages of being paperless, saving storage space, and having less deterioration due to storage, and excellent in long-term storage, as well as unpredictable risks such as system down in digital data storage and intrusion of computer viruses. Can be handled. Further, the contents recorded on the microfilm cannot be falsified, and have a high advantage in document management construction.
[0013]
In this COM system, as a method of recording data from a computer as a character string or a figure on a microfilm, a CRT method of projecting a character string and a figure on a CRT tube surface and optically projecting this on a microfilm is used. Was used.
[0014]
However, in recent years, a laser method of directly projecting and recording a laser beam using a He-Ne laser or an Ar laser on a microfilm has been adopted in order to increase processing speed, increase resolution, and facilitate maintenance. .
[0015]
Characters recorded in this manner are output in a format such as kanji COM for outputting kanji alphanumeric / kana characters and graphic COM for outputting graphs and drawings. The characters recorded on the microfilm are each represented by 16 × 16 dots to 40 × 40 dots, that is, a bitmap font (dot font) is widely used.
[0016]
[Non-patent document 1],
Media Drive Co., Ltd., [online], [searched on February 21, 2003], Internet <URL: http: // www. mediadrive. co. jp / technology / whatsocr / overview. html>
[Non-patent document 2]
Masao Bando, 9 others, "COM System Guide-Computer Output Microfilm", Japan Image Information Management Association, November 15, 1998.
[Problems to be solved by the invention]
However, even if the microfilm recorded by the above-mentioned COM system or the like is read as image data and recognized as character data composed of electronic data using an existing OCR device or OCR software, the character string recorded on the microfilm is still Due to the small size, there is a problem that the character recognition accuracy is low with existing OCR devices or OCR software.
[0018]
In view of such circumstances, an object of the present invention is to provide a microfilm OCR system with improved character recognition accuracy when reading a character string recorded on a microfilm as character data.
[0019]
[Means for Solving the Problems]
According to a first aspect of the present invention, there is provided a microfilm OCR system in which a microfilm on which a character string is recorded is read as image data by a scanner, and a computer recognizes a character area in the image data as character data. And holding the font of the character string recorded on the microfilm as a standard pattern, and comparing the standard pattern with the character area of the image data to recognize the character area as the character data by the computer. And a microfilm OCR system.
[0020]
A second aspect of the present invention is the microfilm OCR system according to the first aspect, wherein the standard pattern comprises an outline font obtained by converting the font of the character string recorded on the microfilm. .
[0021]
A third aspect of the present invention is the microfilm OCR system according to the first aspect, wherein the recognition of the character data is performed directly using a font of the character string recorded on the microfilm. is there.
[0022]
A fourth aspect of the present invention is the microfilm OCR system according to any one of the first to third aspects, wherein the microfilm is a COM film recorded by a COM system.
[0023]
According to a fifth aspect of the present invention, in any one of the first to fourth aspects, the recognition of the character data includes a layout analysis for determining an order of finding and reading the character region of the image data; A microfilm, comprising: extracting a line to be decomposed for each line from a line; extracting a character for extracting the extracted line for each character; and character recognition for recognizing the character data for each character. It is in the OCR system.
[0024]
According to a sixth aspect of the present invention, there is provided the microfilm OCR system according to the fifth aspect, wherein the recognition of the character data for each character is performed by normalization, matching, and knowledge processing. .
[0025]
According to a seventh aspect of the present invention, in the fifth or sixth aspect, the character string recorded on the microfilm is form data, and the knowledge processing calculates the recognized character data and performs the correct / incorrect processing. A microfilm OCR system is characterized in that:
[0026]
According to an eighth aspect of the present invention, in the seventh aspect, the knowledge processing establishes a verification rule based on a program used when the character string was recorded on the microfilm, and based on the verification rule. The microfilm OCR system is characterized in that correct / incorrect processing is performed by calculating the character data.
[0027]
In a ninth aspect of the present invention based on the seventh aspect, the knowledge processing establishes a verification rule based on the recognized character data, and calculates the character data based on the verification rule. A microfilm OCR system characterized by performing processing.
[0028]
In the present invention, the font of the character string recorded on the microfilm is held as a standard pattern, and the character string is recognized as character data by comparing with the standard pattern. Can be improved in character recognition accuracy.
[0029]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described in detail based on embodiments.
[0030]
(Embodiment 1)
FIG. 1 is a block diagram schematically showing a microfilm OCR system.
[0031]
As shown in FIG. 1, a microfilm OCR system includes a scanner 10 for reading a microfilm on which a character string is recorded as image data, and an OCR software for recognizing character data from a character area of the image data read by the scanner 10. And a computer 30 having the same.
[0032]
The microfilm may be, for example, a roll film obtained by winding a 16 mm or 35 mm film on a reel with a length of 30.5 m (100 ft) or 65.5 m (215 ft), or a 105 mm × 148 mm microfish (JIS Z6001). Information including a character string is recorded in a shape or a positive shape. Examples of information recorded on such a microfilm include form data. The microfish may be in any form of a roll or a sheet, and the shape is not particularly limited.
[0033]
The scanner 10 reads a character string, a graphic, and the like recorded on a microfilm as image data 1, and includes, for example, a flat head scanner or a film scanner using a linear CCD. When a continuous roll of microfilm is used, the scanner may be provided with a conveying means for conveying the microfilm so that the microfilm can be read continuously as image data.
[0034]
The computer 30 includes a computer main body (not shown), a monitor, an operation keyboard, and the like. The computer 30 includes OCR software 20 that recognizes a character area of the image data 1 read by the scanner 10 as character data. .
[0035]
The OCR software 20 built in the computer 30 includes a layout analyzing unit 21 for analyzing a character area of the image data 1 received from the scanner 10, a line extracting unit 22 for extracting a line from the analyzed character area, and a line extracting unit 22. A character extracting means 23 for extracting one character from the extracted line and a character recognizing means 24 for recognizing the extracted character are provided.
[0036]
The layout analysis unit 21 analyzes the character area 2 of the image data 1 read by the scanner 10 as shown in FIG. Such a layout analysis unit 21 may automatically analyze the character area 2 or may display the image data 1 on a monitor of the computer 30 and allow the user to select it.
[0037]
The line cutout unit 22 cuts out one line of the character string 3 from the character area 2 analyzed by the layout analysis unit 21, as shown in FIG. 2B. This one-line character string 3 is automatically cut out in correspondence with horizontal writing or vertical writing.
[0038]
Further, as shown in FIG. 2 (c), the character extracting means 23 extracts one character 4 from the character string 3 of one line extracted by the line extracting means 22.
[0039]
As described above, one character 4 cut out from the image data 1 by the layout analysis unit 21, the line cutout unit 22, and the character cutout unit 23 is recognized as character data by the character recognition unit 24.
[0040]
The character recognition unit 24 includes a normalization unit 25, a matching unit 26, a knowledge processing unit 27, and a standard pattern 28.
[0041]
The normalizing means 25 converts the size of one character 4 extracted by the character extracting means 23 as shown in FIG. 3A to a predetermined size as shown in FIG. Absorbs deformation such as portrait or landscape.
[0042]
The size of the normalized character 5 deformed by the normalizing means 25 is changed to be equal to the standard pattern 28 held by the character recognizing means 24. As described above, by making the one character 4 the normalized character 5 which is normalized so as to have the same size as the standard pattern 28, the recognition accuracy can be improved when the one character 4 is compared with the standard pattern 28. it can.
[0043]
Further, the matching means 26 compares the transformed normalized character 5 with the standard pattern 28 and gives the closest character candidate.
[0044]
Here, the standard pattern 28 held by the character recognizing means 24 is equivalent to a character string font recorded on a microfilm. When a COM film (a microfilm recorded by a COM system) is used as the microfilm, the character string font normally recorded on the COM film is a bitmap font (dot font), and one character is It is represented by 16 × 16 to 40 × 40 dots.
[0045]
Then, in the comparison between the normalized character 5 and the standard pattern 28, for example, as shown in FIG. 4, the normalized character 5 is decomposed into four components of up, down, left, right, and oblique directions, and the four components are separated into individual characters. And the four components are set as 196 feature values of 7 × 7. The feature value of each of the four components of the normalized character 5 and the feature value of each of the four components of the standard pattern 28 are compared by, for example, Euclidean geometry, and the standard character 28 closest to the normalized character is compared. List characters as candidates.
[0046]
The comparison between the normalized character 5 and the standard pattern 28 is not particularly limited thereto. For example, a standard pattern 28 is obtained by converting a bitmap font into an outline font and extracting the edge features of the outline font. The extended cell feature method for comparing with the edge feature of the normalized character 5 may be used.
[0047]
Alternatively, a bitmap font may be converted to an outline font, and the outline features of the outline font may be extracted and used as a standard pattern 28 using a weighted direction histogram for comparison with the outline features of the normalized character 5.
[0048]
Further, a bitmap type font is converted to an outline font, and a connection length of a predetermined point obtained by extending a tentacle from a predetermined point in the character of the outline font in eight directions is extracted as a standard pattern 28, and a normalized character 5 May be performed using the contour direction contribution characteristic to be compared with the connection length of the predetermined point.
[0049]
Further, the dot position of the bitmap font may be used as the standard pattern 28 to compare with the dot position of the normalized character 5.
[0050]
By comparing the normalized character 5 with the character string recorded on the microfilm using the standard pattern 28 based on the same font, the recognition accuracy of the matching means 26 can be improved.
[0051]
Further, the knowledge processing means 27 creates a character string from the standard pattern 28 candidates matched by the matching means 26, and uses language information such as word information such as Japanese and English to perform more accurate knowledge processing. I do.
[0052]
Further, in this embodiment, since the character string recorded on the microfilm is the form data, the knowledge processing means 27 calculates the numerical value of the form data and performs the knowledge processing also from the calculation results such as the subtotal and the total. .
[0053]
Here, the knowledge processing unit 27 that performs the knowledge processing of the form data includes a program used when the character string is recorded on the target microfilm, for example, electronic form software, electronic accounting software, and the like. By determining the verification rule based on the program and calculating the recognized character string based on the verification rule, accurate knowledge processing can be performed.
[0054]
If there is no program used when the character string is recorded on the target microfilm, the knowledge processing unit 27 forms a verification rule and performs knowledge processing on the recognized character string.
[0055]
Specifically, for example, as shown in FIG. 5, the character string of the microfilm includes a detailed line 40 in which detailed numerical values of the form data are written, and a total line 41 in which the content of the detailed line 40 is calculated. If they are divided, the detail row 40 is set as a transaction row, distinguished from the total row 41, and a verification rule for comparing and reviewing the transaction row (detail row 40) and the total row 41 is formed. This verification rule is determined, for example, by having the user who operates the computer 30 running the OCR software confirm the verification rule. Then, the calculation in the transaction line is performed based on the determined verification rule, and when the calculation result is different from the value in the total line 41, accurate knowledge processing of the recognized character data is performed.
[0056]
Note that such a series of knowledge processing can be performed, for example, by importing a recognized character string into spreadsheet software and operating the verification rule as a macro of the spreadsheet software.
[0057]
As described above, the knowledge processing unit 27 performs the knowledge processing according to the type of the character string recorded on the microfilm, so that the recognition accuracy can be further improved.
[0058]
The standard pattern 28 recognized by the OCR software is acquired by the computer 30 as character data, and is displayed on a monitor or the like, output to a printer, or other data such as a CD-R (registered trademark) or a DVD-R. It can be output to a recording medium.
[0059]
Further, for example, after the image data 1 read by the scanner 10 is recognized as character data by the OCR software 20, it can be easily changed to a desired font, color, arrangement, and the like. As a result, when printing on paper or the like with a printer or the like, printing can be performed in a state that is easy to read and organize. Of course, the image data 1 may be directly output from the printer without using the OCR software 20.
[0060]
Furthermore, by recognizing a character string recorded on the microfilm with high accuracy and forming it as character data, a full-text search of the microfilm can be performed. This eliminates the need to record electronic data such as character data on another recording medium when recording a character string on the microfilm.
[0061]
(Other embodiments)
As described above, the first embodiment of the present invention has been described, but the present invention is not limited to the above.
[0062]
For example, in the first embodiment described above, the OCR software 20 for recognizing a character string recorded on a microfilm as character data is installed in the computer 30, but the present invention is not limited to this. For example, the OCR software is installed in the scanner itself. Then, the scanner may acquire the image data, recognize the character data from the image data, and pass the character data to the computer.
[0063]
In the first embodiment, the character string recorded on the microfilm by the COM system is recognized as character data. However, the present invention is not particularly limited to this. For example, another font may be used as the second standard pattern. You may make it hold | maintain further. That is, when reading a microfilm recorded by the COM system, recognition is performed using the same standard pattern as that of the COM system, and when reading a microfilm recorded by another method, the second standard pattern is used. If the recognition is performed, the character recognition accuracy does not decrease even when recording is performed by another method.
[0064]
Furthermore, in the first embodiment described above, the font of the character string recorded on the microfilm is a bitmap font, but the font of the character string recorded on the microfilm is not limited to the bitmap font. The character recognition means 24 holds the font of the character string recorded on the microfilm as a standard pattern and performs character recognition using the standard pattern, so that highly accurate character recognition can be performed.
[0065]
【The invention's effect】
As described above, the microfilm OCR system of the present invention improves character recognition accuracy by using a font equivalent to a character string recorded on a microfilm as a standard pattern for comparing the character strings of the microfilm. be able to.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing a microfilm OCR system according to a first embodiment of the present invention.
FIG. 2 is a schematic diagram of image data according to the first embodiment of the present invention.
FIG. 3 is a schematic diagram of image data according to the first embodiment of the present invention.
FIG. 4 is a schematic diagram illustrating a character recognition method according to the first embodiment of the present invention.
FIG. 5 is a diagram showing a character string of the microfilm according to the first embodiment of the present invention.
[Explanation of symbols]
1 Image data 2 Character area 3 One line character string 4 One character 10 Scanner 20 OCR software 21 Layout analysis means 22 Line extraction means 23 Character extraction means 24 Character recognition means 25 Normalization means 26 Matching means 27 Knowledge processing means 28 Standard pattern 30 Computer 40 Detail line 41 Total line

Claims

A microfilm OCR system in which a microfilm on which a character string is recorded is read as image data by a scanner and a character area in the image data is recognized by a computer as character data,
A font of the character string recorded on the microfilm is held as a standard pattern, and the standard pattern is compared with a character area of the image data to cause a computer to recognize the character area as the character data. Microfilm OCR system.

2. The microfilm OCR system according to claim 1, wherein the standard pattern comprises an outline font obtained by converting a font of the character string recorded on the microfilm.

2. The microfilm OCR system according to claim 1, wherein the recognition of the character data is performed directly using a font of the character string recorded on the microfilm.

The microfilm OCR system according to any one of claims 1 to 3, wherein the microfilm is a COM film recorded by a COM system.

The character data recognition according to any one of claims 1 to 4, wherein the character data is recognized by determining a reading order of the character area of the image data and determining a reading order, and extracting a line to be decomposed line by line from a continuous character area. A microfilm OCR system comprising: a character cutout for cutting out a cutout line for each character; and a character recognition for recognizing the character data for each character.

6. The microfilm OCR system according to claim 5, wherein the recognition of the character data for each character includes normalization, matching, and knowledge processing.

7. The microfilm OCR system according to claim 5, wherein the character string recorded on the microfilm is form data, and the knowledge processing calculates the recognized character data and performs correct / incorrect processing. .

8. The method according to claim 7, wherein the knowledge processing establishes a verification rule based on a program used when the character string is recorded on the microfilm, and calculates the character data based on the verification rule. A microfilm OCR system characterized by performing processing.

8. The microfilm according to claim 7, wherein the knowledge processing establishes a verification rule based on the recognized character data, and performs correctness processing by calculating the character data based on the verification rule. OCR system.