JP3159087B2

JP3159087B2 - Document collation device and method

Info

Publication number: JP3159087B2
Application number: JP28793596A
Authority: JP
Inventors: 慎治佐瀬
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-10-30
Filing date: 1996-10-30
Publication date: 2001-04-23
Anticipated expiration: 2016-10-30
Also published as: JPH10134141A

Description

DETAILED DESCRIPTION OF THE INVENTION

【発明の属する技術分野】本発明は、文書間の文書照合
方法及び装置に関し、特にワープロ等で作成された電子
データ形式の文書とプリンタ等で紙葉類等に印字された
文書との同一性を確認する作業を支援する文書照合方法
及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for collating documents between documents, and more particularly, to the method of identifying documents in electronic data format created by a word processor or the like with documents printed on paper sheets by a printer or the like. The present invention relates to a document matching method and apparatus for supporting a task of checking a document.

【０００１】[0001]

【従来の技術】２つの文書の同一性の確認が必要となる
場面は多々ある。その代表的な例としては、契約書等の
重要度の高い文書の同一確認がある。以降、本発明では
複数の文書間の同一確認を行う作業を文書照合と呼ぶ。
このような文書は通常紙面上に記載されるため、これま
で、その文書照合は、２つの紙面を横に並べて目視で逐
一確認することで行われていた。この作業はほとんどを
人手に頼っているため、作業者の負担は大である。2. Description of the Related Art In many cases, it is necessary to confirm the identity of two documents. A typical example is the identification of documents of high importance such as a contract. Hereinafter, in the present invention, the operation of performing the same check between a plurality of documents is referred to as document collation.
Since such a document is usually written on a paper, the document collation has hitherto been performed by arranging two papers side by side and visually confirming each one. Since this work mostly depends on humans, the burden on the operator is heavy.

【０００２】一方、近年のパーソナルコンピュータと通
信ネットワークの普及は、文書の電子データ化を促進し
ている。電子データ形式の文書間の文書照合は、通常Ｏ
Ｓのコマンドに組み込まれており非常に容易に行うこと
ができる。例えばＭＳ−ＤＯＳ（マイクロソフト社の登
録商標）ではＦＣコマンドが、またＵＮＩＸ（Ｘ／Ｏｐ
ｅｎＣｏｍｐａｎｙＬｔｄ．の登録商標）ではｃｍ
ｐコマンドがその代表的なものである。これらのコマン
ドの機能など詳細については、各ＯＳ（Ｏｐｅｒａｔｉ
ｏｎＳｙｓｔｅｍ）のマニュアルや村瀬康治著“入門
ＭＳ−ＤＯＳ（（アスキー・ラーニングシステム入門
コース）”ｐｐ１２３、（アスキー出版社刊）或いはＳ
ｔｅｐｈｅｎＣｏｆｆｉｎ著“ＵＮＩＸＳＶＲ４”
ｐｐ１５０、（ソフトバンク株式会社刊）等に述べられ
ている。On the other hand, the spread of personal computers and communication networks in recent years has promoted the conversion of documents into electronic data. Document collation between documents in electronic data format is usually O
It is built into the S command and can be performed very easily. For example, in MS-DOS (registered trademark of Microsoft Corporation), FC command is used, and UNIX (X / Op
en Company Ltd. Registered trademark) in cm
The p command is a typical example. For details such as the functions of these commands, refer to each OS (Operati
on System) manual or "Introductory MS-DOS ((ASCII Learning System Introductory Course)""written by Koji Murase, pp123, (published by ASCII Publishing Company) or S
"UNIX SVR4" by tephen Coffin
pp150, (published by SoftBank Corp.) and the like.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、前述の
公知文献には、電子データ形式の文書と紙面記載の文書
との文書照合については、何も提案されていない。今後
電子データ形式の文書の普及が広く浸透するにつれ、紙
葉と電子データの文書が混在した状況が生じ、このよう
な確認の必要性が高まってくると考えられるが、従来の
文書照合方法では、これに対応することができない。However, there is no proposal in the above-mentioned known documents regarding document collation between a document in electronic data format and a document described on paper. In the future, as the spread of electronic data format documents spreads widely, the situation where paper sheets and electronic data documents are mixed will occur, and it is thought that the need for such confirmation will increase. , Can not respond to this.

【０００４】電子データ形式と紙面の文書間の照合は、
前述の紙面文書間の文書比較と比べても、さらに人間に
与える負担が大きい。その主な理由として以下の１）〜
３）のようなことが考えられる。１）文書を並べて一つ
の視野内で比較することが難しい、２）一般的に、電子
データ形式と紙面で文書のレイアウトが異なる、３）一
般的に、使用する文字形状が異なる。本発明は、このよ
うな文書照合作業を自動化し、作業者の負担を軽減する
ことが目的である。[0004] The collation between the electronic data format and the paper document is as follows.
Compared with the above-described document comparison between paper documents, the burden on humans is even greater. The main reasons are as follows:
The following 3) can be considered. 1) It is difficult to arrange the documents side by side and compare them within one field of view. 2) Generally, the layout of the document is different between the electronic data format and the paper. 3) Generally, the character shape used is different. An object of the present invention is to automate such a document collation operation and reduce a burden on an operator.

【０００５】また、重要な書類を作成し提出する場合に
は、慎重を期すために、その書類の記載内容を一度確認
してから提出する必要があるが本発明は、このような作
業をも支援することを目的としている。[0005] In addition, when preparing and submitting important documents, it is necessary to confirm the contents of the documents once before submitting them with caution. However, the present invention also requires such work. It is intended to help.

【０００６】[0006]

【課題を解決するための手段】上記問題点を解決するた
めに、本発明の文書照合装置は、シート上に記載された
文字を光学的に読み取る認識部と、予め電子データとし
て格納されている文字コードと前記認識部による認識結
果とを一文字ずつ照合する照合部と、前記照合部による
照合の結果に応じて操作者が特に注意を払って見る位置
を示すように表示方法を切り替えて、前記電子データの
文字コードに対応する文字を表示する確認部とを備え、
前記認識部による認識結果は、前記シート上に記載され
た文字を光学走査して得られる文字イメージに基づいて
得られる文字コードおよび該文字コードの認識結果に対
する信頼度を含み、前記確認部は、前記照合部による前
記電子データの文字コードと前記認識結果の文字コード
との照合結果および該認識結果の文字コードに対する前
記信頼度に応じて操作者が特に注意を払って見る位置を
示すように表示方法を切り替えて、前記電子データの文
字を表示し、前記確認部は、前記電子データの文字コー
ドと前記認識結果の文字コードとの照合結果が一致であ
り、かつ該文字コードに対する信頼度が予め設定された
第１の値より高い第１のケース、前記照合結果が一致で
あり、かつ前記信頼度が前記第１の値よりも低い第２の
ケースおよび前記照合結果が不一致である第３のケース
において、それぞれ異なる表示方法で、前記電子データ
の文字を表示する。In order to solve the above-mentioned problems, a document collating apparatus according to the present invention is provided with a recognition section for optically reading a character written on a sheet, and is stored in advance as electronic data. A collating unit for collating the character code and the recognition result by the recognizing unit one character at a time, and switching a display method so as to indicate a position where the operator pays particular attention to look according to the result of the collation by the collating unit, A confirmation unit that displays a character corresponding to the character code of the electronic data,
The recognition result by the recognition unit includes a character code obtained based on a character image obtained by optically scanning a character written on the sheet and a reliability of the recognition result of the character code, and the confirmation unit includes: Displayed so as to indicate a position where an operator pays special attention to the character code of the electronic data and the character code of the recognition result by the collating unit and the position of the operator who pays special attention to the character code of the recognition result. A method is switched to display the characters of the electronic data, and the confirmation unit displays a character code of the electronic data.
Match with the character code of the recognition result
And the reliability for the character code is set in advance.
A first case higher than the first value, wherein the matching result is a match
And the second reliability is lower than the first value.
A case and a third case in which the collation results do not match
The electronic data in a different display method.
Is displayed .

【０００７】また、本発明の文書照合方法は、予め格納
された電子データの文字コードに該当する文字をシート
上に印字し、前記シート上に印字された文字を光学的に
読み取り認識結果を取得し、前記電子データの文字コー
ドと前記認識結果とを一文字ずつ照合して照合結果を取
得し、前記照合結果に応じて操作者が特に注意を払って
見る位置を示すように表示方法を切り替えて、前記電子
データの文字コードに該当する文字を表示し、前記照合
結果を目視により確認するものである。Further, the document collation method of the present invention stores
The character corresponding to the character code of the electronic data
Printed on, to get the recognition result Ri printing characters on the sheet optically <br/> reading the electronic data character code
And said recognition result and de acquired collation to collation result one by one, paying particular attention by the operator in response to the comparison result
Switching the display method to indicate the viewing position, the electronic
Display character corresponding to the character code data, confirming visually the comparison result.

【０００８】[0008]

【発明の実施の形態】次に、本発明の文書照合方法及び
装置の第１の実施形態について、図面を参照して説明す
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, a first embodiment of a document matching method and apparatus according to the present invention will be described with reference to the drawings.

【０００９】本実施形態の文書照合方法では、電子形式
の文字データと紙面上の文字データを自動的に照合する
ための手段として文字認識処理が用いられる。しかしな
がら、現状の文字認識技術の水準では、総ての文字を完
全に正しく読みとることはできない。手書きはもちろん
のこと、特に活字でも漢字のように字種が多い場合やフ
ォントが多様化するとその精度は低下する。そこで、文
字認識で文書照合が可能な部分は文字認識で行い、それ
以外の部分は人間が行うという形態により、人間の負担
を軽減しようとするのが本実施形態の主旨である。この
ためには、人手による文書照合方法自体を容易にする工
夫も重要である。In the document collation method of this embodiment, character recognition processing is used as means for automatically collating electronic character data with character data on paper. However, at the current level of character recognition technology, not all characters can be read completely correctly. In addition to handwriting, especially in the case of typographical characters, such as kanji, there are many types of characters, and when the fonts are diversified, the accuracy decreases. Therefore, the purpose of the present embodiment is to reduce the burden on humans by performing character recognition on portions that can be collated by character recognition and performing the other portions on humans. For this purpose, it is also important to devise a method for facilitating the manual document collation method itself.

【００１０】とは言え、できる限り文字認識の精度を高
める工夫をすることは、人間の負担を軽減することに直
接繋がるため、特に重要である。ここで、一般的に文字
認識の精度が不十分となる理由を整理してみると以下の
（１）〜（３）のようになる。（１）読取領域の（自
動）解析を誤る。結果として、異なった読取領域を読取
る。（２）文字切出しを誤る。この誤りは、全角半角混
じりの文や、文字サイズの変化する部分で生じやすく、
特に文字種に注目すれば、分離文字（ハ、は、仙等の文
字）・句読点などの特定の文字を含んだ位置で生じやす
い。（３）個別文字認識で誤る。一般的には、字種が増
えるほどまた字形が多様化するほど相対的に低下する。Nevertheless, it is particularly important to devise ways to improve the accuracy of character recognition as much as possible, since this leads directly to reducing the burden on humans. Here, the reasons why the accuracy of character recognition is generally insufficient are summarized as follows (1) to (3). (1) Incorrect (automatic) analysis of the reading area. As a result, different reading areas are read. (2) Incorrect character extraction. This error is likely to occur in sentences that contain full-width and half-width characters, and where the text size changes,
In particular, if attention is paid to the character type, it is likely to occur at a position including a specific character such as a separated character (a character such as kan) or a punctuation mark. (3) An error occurs in individual character recognition. In general, as the character type increases and the character shape becomes more diversified, it relatively decreases.

【００１１】文書照合においては、電子データを文字認
識処理における正解データとして捉えることができるの
で、本実施形態ではこの性質を利用して上記（１）〜
（３）の問題を低減して文字認識の精度を向上させる工
夫をしている。In document collation, electronic data can be regarded as correct answer data in character recognition processing. In this embodiment, the above-mentioned (1) to (1) to (4) are used by utilizing this property.
An attempt is made to reduce the problem (3) and improve the accuracy of character recognition.

【００１２】図１は、本発明の第１の実施形態の構成を
示すブロック図である。シート１１は、文書照合の対象
となる一方のデータが記載されている用紙である。電子
データ１２は比較されるもう一方のデータで、文字コー
ドとして通常ハードディスクやメモリなどの記憶媒体上
に格納されている。シート１１の一例を図２に示す。ま
た、電子データ１２の文字コードとしては、使用するＯ
Ｓ等に応じてＪＩＳコード、シフトＪＩＳコード、ＥＵ
Ｃコードなどが使用される。FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention. The sheet 11 is a sheet on which one data to be collated with a document is described. The electronic data 12 is the other data to be compared, and is usually stored as a character code on a storage medium such as a hard disk or a memory. FIG. 2 shows an example of the sheet 11. The character code of the electronic data 12 is O
JIS code, shift JIS code, EU according to S etc.
C code or the like is used.

【００１３】シート１１は、文字認識部２１で光学的に
走査され、その記載内容が文字コードに変換され出力さ
れる。文字認識部２１は、これまでに種々の形態で製品
化されている一般的な文字認識装置（以下、ＯＣＲ（Ｏ
ｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）と
する）と同等のものを適用でき、特に特殊な技術は必要
としない公知のものである。The sheet 11 is optically scanned by a character recognizing unit 21, and the written contents are converted into character codes and output. The character recognition unit 21 is a general character recognition device (hereinafter referred to as OCR (OCR) that has been commercialized in various forms.
optical sensor reader) can be applied, and it is a known device that does not require a special technique.

【００１４】文字認識部２１についてもう少し詳しく述
べる。図３は、代表的な文字認識部の構造である。ま
ず、光学走査部２１１でシート１１を光学的に走査し、
Ａ／Ｄ変換を介して、デジタル画像を得る。デジタル画
像は、後段の認識処理方法や要求される処理速度、Ｈ／
Ｗ規模に応じて２値画像・濃淡画像・カラー画像等で取
得する。この際、光学的に特定の色をドロップアウトし
たドロップアウト画像を用いても構わない。またシート
は、フラットベッドスキャナーのように手で置いても、
自動搬送してもよい。更に自動搬送のタイプでは、帳票
用のＯＣＲのように２つのスタッカーをもち文書照合の
一致・不一致に応じて、アクセプトスタッカーとリジェ
クトスタッカーに分けて排出することも可能である。The character recognition unit 21 will be described in more detail. FIG. 3 shows the structure of a typical character recognition unit. First, the optical scanning unit 211 optically scans the sheet 11,
Obtain a digital image via A / D conversion. The digital image is processed by the subsequent recognition processing method, required processing speed, H /
The image is acquired as a binary image, a grayscale image, a color image, or the like according to the W scale. At this time, a dropout image obtained by optically dropping out a specific color may be used. Also, even if you place the sheet by hand like a flatbed scanner,
It may be automatically conveyed. Further, in the automatic transport type, it is possible to have two stackers, such as a form OCR, and to discharge the sheets into an accept stacker and a reject stacker according to the coincidence / mismatch of the document collation.

【００１５】領域抽出部２１２では、取得したデジタル
画像から読取領域（文字領域）を抽出する。この際、文
書ＯＣＲの様に射影や黒連結領域の並び等の情報に基づ
いて自動的に読取領域を抽出してもよいし、通常帳票Ｏ
ＣＲで用いる様に予め与えられたフォーマット定義を用
いて読取領域を抽出してもよい。The area extracting section 212 extracts a reading area (character area) from the obtained digital image. At this time, the reading area may be automatically extracted based on information such as the projection and the arrangement of the black connected areas as in the document OCR.
The reading area may be extracted using a format definition given in advance as used in CR.

【００１６】文字切出部２１３では読取領域から個々の
文字を切り出す。この際も文書ＯＣＲの領域抽出で用い
る特徴をベースに自動的に切出してもよいし、フォーマ
ット定義を利用して切出してもよい。The character extracting section 213 extracts individual characters from the reading area. At this time, the document may be automatically extracted based on the features used in the region extraction of the document OCR, or may be extracted using the format definition.

【００１７】文字認識部２１４では、切り出された個別
文字から特徴抽出を行い文字辞書を利用して文字認識を
行う。文字認識も、これまでに多種多様な方法が開発さ
れてきているが、いずれの方法でも用いることができ
る。例えば、照合時に入力文字の特徴と予め辞書に登録
してある個々の文字カテゴリの特徴との最小距離または
最大類似度を求めるような手法と、ツリー（木）構造を
使い、たどりついたリーフ（葉）を判定結果とする手法
があるがいずれでも構わない。The character recognizing unit 214 extracts features from the cut out individual characters and performs character recognition using a character dictionary. Although various methods have been developed for character recognition, any method can be used. For example, a method of finding the minimum distance or the maximum similarity between the characteristics of the input character and the characteristics of the individual character categories registered in the dictionary in advance during collation, and a leaf (leaf) using a tree structure ) May be used as the determination result, but any method may be used.

【００１８】以上述べたように、文字認識部２１は、最
終的にシート１１に記載されている読取対象文字を文字
コードの形式に変換できればいずれの方式を用いても構
わないが、唯一特徴的なことは、通常知識処理とか文字
認識後処理と呼ばれる、言語情報による文字認識結果の
補正を行う必要がないことである。この理由は、言語情
報により文字認識の結果を補正することは、保有してい
る知識にあわせて文字認識結果を改竄する危険があるた
め、本発明のような文書照合にはそぐわない処理である
という理由による。また、電子データ１２自体がかなり
充実した知識情報であるため、他の知識は用いない方が
よい。なお、電子データ１２の利用方法については後で
述べる。As described above, the character recognition unit 21 may use any method as long as the characters to be read described on the sheet 11 can be finally converted into the character code format. What is different is that there is no need to correct the character recognition result based on linguistic information, which is usually called knowledge processing or character recognition post-processing. The reason is that correcting character recognition results using linguistic information is a process that is not suitable for document collation as in the present invention, because there is a risk of falsification of the character recognition results in accordance with the knowledge possessed. It depends on the reason. Further, since the electronic data 12 itself is considerably rich knowledge information, it is better not to use other knowledge. The method of using the electronic data 12 will be described later.

【００１９】次に照合部２２では、文字認識部２１の出
力結果の文字コードと電子データ１２の照合を行い一致
・不一致を確認する。図４を用いて照合部２２の動作を
説明する。図４は、図３のシートに対する文字認識結果
の例を３１に電子データ１２を読出したものを３２に示
してある。図４の文字認識結果３１と電子データ３２は
文字コードであるが、ここでは説明をわかりやすくする
ためそのまま文字で表現してある。Next, the collating unit 22 verifies the character code of the output result of the character recognizing unit 21 with the electronic data 12 and confirms a match / mismatch. The operation of the matching unit 22 will be described with reference to FIG. FIG. 4 shows an example of a character recognition result for the sheet of FIG. Although the character recognition result 31 and the electronic data 32 in FIG. 4 are character codes, they are expressed as characters here for easy understanding.

【００２０】文字認識結果３１では、９文字目のカタカ
ナ“テ”を記号“〒”、１１文字目のカタカナ“ト”を
漢字“卜”（ぼく）と読んでいる。照合部２３は、対応
する文字コードが一致しているか否かを区別するため文
字認識結果３１と電子データ３２の対応する位置の文字
コードを前から順に１文字ずつ比較し、その結果を個々
の文字に対して文書照合結果３３に格納する。文書照合
結果３３において０は一致を、１は不一致を表してい
る。In the character recognition result 31, the ninth character katakana "te" is read as the symbol "@", and the eleventh character katakana "to" is read as the kanji "to" (I). The collating unit 23 compares the character recognition result 31 and the character code at the corresponding position of the electronic data 32 one by one in order from the front in order to distinguish whether or not the corresponding character codes match, and compares the result with each individual character. The character is stored in the document comparison result 33. In the document comparison result 33, 0 indicates a match and 1 indicates a mismatch.

【００２１】なお、図４は照合結果の一例であり、本発
明では、個々の文字コードの一致、不一致がわかればよ
く、照合の順序や照合結果の表現方法は本質的な問題で
はない。FIG. 4 shows an example of the collation result. In the present invention, it is only necessary to know whether the character codes match or not, and the order of collation and the method of expressing the collation result are not essential.

【００２２】最後に、目視確認部２３で、照合結果を表
示し、操作者が目視確認を行う。この際、電子データ３
２をディスプレイ上に表示するが、操作者が特に注意を
払って見る位置を示すために、照合結果３３の一致・不
一致にもとづき、不一致部分を強調して表示する。この
手順を図５を用いて説明する。まず、電子データ３２を
一文字ずつ先頭から（ステップ２３１）順に読出し（ス
テップ２３２）、照合結果３３の対応する位置の照合結
果を参照して（ステップ２３３）一致（０）ならば通常
表示（ステップ２３４）を不一致（１）ならば強調表示
（ステップ２３５）を行う。この手順を電子データの最
終文字まで繰り返す（ステップ２３６）。Finally, the verification result is displayed by the visual confirmation unit 23, and the operator performs visual confirmation. At this time, electronic data 3
2 is displayed on the display, but in order to indicate the position where the operator pays particular attention, the mismatched portion is emphasized and displayed based on the matching / mismatching of the matching result 33. This procedure will be described with reference to FIG. First, the electronic data 32 is read out character by character in order from the beginning (step 231) (step 232), and the collation result at the corresponding position of the collation result 33 is referred to (step 233). ) Are not matched (1), highlighting is performed (step 235). This procedure is repeated until the last character of the electronic data (step 236).

【００２３】強調表示としては、反転表示、ブリンク、
色を変える、文字形を変える（例えば斜体）、大きさを
変える等がある。図６に一例として大きさを変えた表示
例を示す。As highlighting, reverse display, blink,
Changing the color, changing the character shape (for example, italic), changing the size, and the like. FIG. 6 shows a display example in which the size is changed as an example.

【００２４】ところで、文字認識部２１では、認識過程
において、紙葉上に記載されている文字のレイアウト情
報をも得ることができる。従って、目視確認部２３で
は、この文字認識部２１によるレイアウト解析結果を利
用して照合結果を表示すれば、紙面上のレイアウトと電
子データのレイアウトとを近いものとすることができ、
確認作業はより容易になる。一般的に、文字認識は行単
位で進められるので、最も簡単なレイアウト情報として
は、一行の認識が終了した後に、文字認識結果に改行コ
ード（通常０ｘ０ａ）を付加すれば得られる。照合部２
２では、文字認識結果に含まれる改行コードは無視する
ようにして照合を行い、目視確認部２３では、文字認識
結果の改行コード位置を参照して電子データの表示を行
えばこのようなレイアウト表示の調整が可能となる。By the way, the character recognizing section 21 can also obtain the layout information of the characters described on the paper sheet in the recognition process. Therefore, if the visual confirmation unit 23 displays the collation result by using the layout analysis result by the character recognition unit 21, the layout on the paper and the layout of the electronic data can be made similar.
Confirmation work becomes easier. In general, character recognition is performed in units of lines. Therefore, the simplest layout information can be obtained by adding a line feed code (usually 0x0a) to a character recognition result after recognition of one line is completed. Collation unit 2
In step 2, the collation is performed by ignoring the line feed code included in the character recognition result. In the visual check unit 23, if the electronic data is displayed by referring to the line feed code position of the character recognition result, such a layout display is performed. Can be adjusted.

【００２５】なお、表示手段としては、本実施形態の説
明で述べたディスプレイの他に、プリンタなどの印字手
段があり、同様の効果が得られる。As the display means, there is a printing means such as a printer in addition to the display described in this embodiment, and the same effect can be obtained.

【００２６】操作者は、照合部２２からの照合結果を示
す表示データに基づいて、シート１１と表示データの強
調部分を重点的に目視確認し文書の一致・不一致を確か
める。これで文書照合は終了するが、その結果をキーボ
ードやマウス等で入力したり、シートの排出先が複数あ
る場合はシートの排出先を指定したり、文書照合の結果
を電子データに反映したりする処理が、運用状況により
適宜設定される。Based on the display data indicating the collation result from the collation unit 22, the operator focuses on the sheet 11 and the emphasized portion of the display data, and confirms the coincidence / mismatch of the documents. Document collation is now complete, but the result can be entered using a keyboard, mouse, etc., if there are multiple destinations for the sheet, the destination for the sheet can be specified, and the result of the document collation can be reflected in electronic data. The processing to be performed is appropriately set according to the operation situation.

【００２７】このような発明は、パーソナルコンピュー
タとスキャナーまたはパーソナルコンピュータとＯＣＲ
の組合せ等により容易に実現可能である。Such an invention relates to a personal computer and a scanner or a personal computer and an OCR
It can be easily realized by a combination of the above.

【００２８】次に、本発明の第２の実施形態について説
明する。本実施形態と前述の第１の実施形態との差異
は、文字認識部２１の認識結果の出力形式と目視確認部
２３の表示方法の２カ所のみであるので、特に、この部
分について説明する。図３に示された文字認識部２１４
では個々の文字イメージを認識しその結果として文字コ
ードを出力するが、この時認識に用いた文字イメージ自
体を認識結果（文字コード）とあわせて照合部２２にわ
たす。文字イメージは先頭から何文字目の読取結果に対
応したものであるかがわかり、かつランダムにアクセス
できるような形式とする。例えば、個々の文字イメージ
をＣＩＸＸＸ．ＹＹＹという名称でファイル化するのも
１つの方法である。ここでＸＸＸは先頭から何番目の文
字のイメージかを表わし、例えば、２番目の文字ならば
ＸＸＸ＝００２とする。また拡張子ＹＹＹは通常用いら
れるイメージの格納形態を表わす。なお、先頭のＣＩは
文字イメージを示す名称であり、任意に選択可能であ
る。Next, a second embodiment of the present invention will be described. This embodiment differs from the above-described first embodiment only in two places, that is, the output format of the recognition result of the character recognition unit 21 and the display method of the visual confirmation unit 23. Therefore, this part will be particularly described. The character recognition unit 214 shown in FIG.
Recognizes each character image and outputs a character code as a result. At this time, the character image itself used for recognition is passed to the collating unit 22 together with the recognition result (character code). The character image is in a format that makes it possible to know the character number corresponding to the read result from the head and to access the character image at random. For example, an individual character image is referred to as CIXXX. One method is to create a file with the name YYY. Here, XXX represents the number of the image of the character from the beginning. For example, XXX = 002 for the second character. The extension YYY indicates a storage format of a commonly used image. The leading CI is a name indicating a character image, and can be arbitrarily selected.

【００２９】照合部２２においては、認識結果である文
字コードは、前述と同様の処理に用いられ、文字イメー
ジは何の処理にも用いられずに目視確認部２３に渡され
る。目視確認部２３は、照合結果を示す電子データの文
字コードを表示する際に、その上または下に対応する文
字イメージを並べて表示する。この際文字イメージの大
きさは、画面の大きさや電子データのコードの表示サイ
ズにあわせて拡大や縮小処理を行うと見易くなる。ま
た、同様の文が二列に並んでわずらわしい場合もあるの
で、表示データ中のカーソル位置の電子コードに対応す
る文字イメージのみ或いは、不一致の部分のみを表示し
てもよい。これらのイメージ表示や拡大・縮小処理は周
知の技術であるので詳細な説明は省略する。In the collating unit 22, the character code as the recognition result is used for the same processing as described above, and the character image is passed to the visual checking unit 23 without being used for any processing. When displaying the character code of the electronic data indicating the collation result, the visual check unit 23 displays the corresponding character image above or below the character code. At this time, the size of the character image becomes easier to see if it is enlarged or reduced according to the size of the screen or the display size of the code of the electronic data. Further, since similar sentences may be bothersome in two rows, only the character image corresponding to the electronic code at the cursor position in the display data or only the mismatched portion may be displayed. Since the image display and the enlargement / reduction processing are well-known techniques, detailed description will be omitted.

【００３０】本実施形態では、目視確認部２３で表示す
る文字イメージがわかればよいので、文字認識部からの
インタフェースはその範囲で自由である。本実施形態は
ファイル形式によるイメージインタフェースを用いて説
明したが、図７に示すようにアクセステーブルと連続格
納したイメージをインタフェースとしてもよいし、図８
に示すように目視確認部２３は切出位置のみを情報とし
て受取、再度文字認識部２１に対して必要な文字イメー
ジのみを要求する方法でもよい。In this embodiment, since the character image to be displayed by the visual confirmation unit 23 need only be known, the interface from the character recognition unit is free within that range. Although the present embodiment has been described using an image interface based on a file format, an image stored continuously with an access table as shown in FIG. 7 may be used as an interface.
As shown in (2), the visual confirmation unit 23 may receive only the cutout position as information, and request the character recognition unit 21 again only a required character image.

【００３１】次に、本発明の第３の実施形態について説
明する。Next, a third embodiment of the present invention will be described.

【００３２】本実施形態も前述の第１及び第２の実施形
態と文字認識部２１の認識結果の出力形式以降が異なる
のでその部分について説明する。This embodiment is also different from the first and second embodiments in the output format of the recognition result of the character recognition unit 21 and thereafter.

【００３３】まず文字認識部２１は、各文字の認識結果
である文字コードに加えてその認識信頼度を出力する。
類似度や距離を判断基準としている文字認識では、認識
信頼度として類似度、距離の逆数／符号変換（±の変
換）あるいはそれぞれの１位候補と２位候補の差の大き
さを用いる。文字認識がツリー構造を用いた手法の場合
は予め各ノードやリーフに信頼度を定義しておき判定時
に通過したノードやリーフの累積値を信頼度とすればよ
い。First, the character recognition unit 21 outputs the recognition reliability of each character in addition to the character code as the recognition result.
In character recognition using similarity or distance as a criterion, similarity, reciprocal of distance / code conversion (conversion of ±), or the magnitude of the difference between the first and second candidates are used as recognition reliability. If the character recognition is a method using a tree structure, the reliability may be defined in advance for each node or leaf, and the accumulated value of the nodes or leaves passed at the time of determination may be used as the reliability.

【００３４】次に照合部２２の処理について図９を用い
て説明する。図４と同様に、３１は文字認識結果、３２
は電子データを示す。３４は信頼度でＳxはＸ文字目の
信頼度を表わしている。ここで、信頼度がＳ0以上の
時、文字認識結果は信頼度が高いとする。なおしきい値
となるＳ0は実験的に定めればよい。照合結果３５は次
の様に定める。Next, the processing of the collating unit 22 will be described with reference to FIG. Similar to FIG. 4, reference numeral 31 denotes a character recognition result;
Indicates electronic data. 34 is the reliability, and Sx represents the reliability of the X-th character. Here, when the reliability is S0 or more, the character recognition result is assumed to have high reliability. The threshold value S0 may be determined experimentally. The collation result 35 is determined as follows.

【００３５】認識結果と電子データが一致し信頼度がＳ0以上 … ０認識結果と電子データが一致し信頼度がＳ0未満 … ２認識結果と電子データが不一致 … １なお図９では、各文字の認識結果の信頼度Ｓ1〜Ｓ15の
うち信頼度Ｓ6のみＳ0未満で残りはＳ0以上の場合の照
合結果３５を示している。The recognition result matches the electronic data and the reliability is S0 or more... 0 The recognition result matches the electronic data and the reliability is less than S0... 2 The recognition result does not match the electronic data... 1 In FIG. Among the reliability S1 to S15 of the recognition result, only the reliability S6 is less than S0, and the rest is the collation result 35 in the case of S0 or more.

【００３６】目視確認部２３では、照合結果３５に応じ
て表示方法を切替える部分が異なる。例えば、図１０に
示されるように、照合結果３５が１である部分が０の４
倍角、２である部分が０の縦倍角で表示される。In the visual checking section 23, the portion for switching the display method according to the collation result 35 is different. For example, as shown in FIG.
The double-width, 2 portion is displayed as a vertical double-width of 0.

【００３７】次に、本発明の第４の実施形態について説
明する。Next, a fourth embodiment of the present invention will be described.

【００３８】照合部２２には、予め文字認識部２１の読
取対象文字種の一覧が文字コードの並びとして与えられ
ている。従って、電子データの中に文字認識で絶対に正
しい結果が出ない文字種、即ち読取対象外の文字種がチ
ェックできる構造になっている。A list of character types to be read by the character recognizing unit 21 is given to the collating unit 22 in advance as a sequence of character codes. Accordingly, the electronic data has a structure in which a character type that does not produce a correct result by character recognition, that is, a character type not to be read can be checked.

【００３９】この際の照合部２２の処理を図１１を用い
て説明する。図１１において４１は文字認識結果、４２
はその信頼度、４３は電子データ、４４は照合結果を表
わしている。なお電子データ４３の８文字目はローマ数
字のＩで文字認識の読取対象外字種、文字認識結果４１
の８文字目は、アルファベットのＩであり、信頼度４３
はＳ6＜Ｓ0で残りはしきい値Ｓ0より大きいとしてい
る。照合部２２では、以下の手順で照合結果を得る。The processing of the matching unit 22 at this time will be described with reference to FIG. In FIG. 11, reference numeral 41 denotes a character recognition result;
Represents the reliability, 43 represents the electronic data, and 44 represents the collation result. The eighth character of the electronic data 43 is the Roman numeral I, the character type to be read for character recognition, and the character recognition result 41.
The 8th character is the alphabet I, and the reliability is 43
Is S6 <S0, and the rest are larger than the threshold value S0. The collating unit 22 obtains a collation result in the following procedure.

【００４０】電子データ４３中の各文字コードに対して
読取対象文字種の一覧を検索し、読取対象外文字の場合
は、照合結果に３をセットする。読取対象外文字の場合
は、前述の第３の実施形態で示した手順に従い照合結果
をセットする。本手順に従い、文字認識結果４１、信頼
度４２及び電子データ４３に基づいて、照合結果４４を
得る。なお、文字コードの一覧から特定の文字コードを
検索する手法は公知の技術であり、詳細な説明は省略す
る。A list of character types to be read is searched for each character code in the electronic data 43. In the case of a character not to be read, 3 is set as the collation result. In the case of a character not to be read, the collation result is set according to the procedure described in the third embodiment. According to this procedure, a collation result 44 is obtained based on the character recognition result 41, the reliability 42, and the electronic data 43. The technique of searching for a specific character code from a list of character codes is a known technique, and a detailed description thereof will be omitted.

【００４１】目視確認部２３は、照合結果４４により表
示を切替える部分が異なるが、例えば、図１２に示すよ
うに、照合結果４４が１である部分を０の４倍角、２で
ある部分を０の縦倍角、３である部分を０の横倍角で表
示することができる。The visual confirmation unit 23 differs in the part for switching the display depending on the collation result 44. For example, as shown in FIG. 12, the part where the collation result 44 is 1 is a quadruple angle of 0 and the part where the collation result 44 is 2 is 0. The portion that is 3 times the vertical double width of 3 can be displayed with the horizontal double width of 0.

【００４２】次に、本発明の第５の実施形態について説
明する。本実施形態では、他の形態と比較して、特に、
文字切出部に特徴がある。Next, a fifth embodiment of the present invention will be described. In the present embodiment, in particular,
There is a characteristic in the character cutout part.

【００４３】図１３に示すごとく文字切出部２１３から
電子データ１２が参照可能となっている。もちろん、直
接電子データ１２を参照できなくてもデータ転送等で照
合部２２を経由してその内容が参照できてもよい。As shown in FIG. 13, the electronic data 12 can be referred to from the character extracting section 213. Of course, even if the electronic data 12 cannot be referred to directly, the contents may be referred to via the collation unit 22 by data transfer or the like.

【００４４】以下図１４を用いて文字切出部２１３の動
作について説明する。図１４において、５１は領域抽出
されたイメージの例で、５２はそのイメージの単一黒連
結領域の外接矩形を表わしたブロック図形である。イメ
ージ５１において平仮名は全角でカタカナは半角で印字
されている。通常の文字切出しは、“はテスト”の部分
の切出しをよく誤る。これは、この部分で文字ピッチが
乱れるためである。The operation of the character extracting section 213 will be described below with reference to FIG. In FIG. 14, reference numeral 51 denotes an example of an image from which an area has been extracted, and reference numeral 52 denotes a block figure representing a circumscribed rectangle of a single black connection area of the image. In the image 51, hiragana is printed in full-width and katakana is printed in half-width. In normal character extraction, the extraction of the "was test" part is often mistaken. This is because the character pitch is disturbed in this portion.

【００４５】本実施形態では、この文字切出を安定に行
うために電子データを利用する。通常全角文字は２ｂｙ
ｔｅコードで、半角文字は１ｂｙｔｅコードで表現され
るためこの性質を利用して文字切出を行う。例えば、図
１４に示される例では、電子データを２ｂｙｔｅのコー
ドと１ｂｙｔｅのコードに分離することにより、最初の
３文字が全角、次の３文字が半角、その後２文字が全角
であることがわかる。In this embodiment, electronic data is used to stably perform the character extraction. Normal 2-byte characters are 2 by
In the te code, a half-width character is represented by a 1-byte code, and character extraction is performed using this property. For example, in the example shown in FIG. 14, by separating the electronic data into a 2-byte code and a 1-byte code, it can be seen that the first three characters are full-width, the next three characters are half-width, and the following two characters are full-width. .

【００４６】全角文字の幅は、行の高さから類推できる
ので、ブロック図形５２に対して電子データ１２のコー
ドを参照して切出しを行う。１文字目は全角であるので
５２１と５２２が統合され１文字目となる。同様に２文
字目は５２３、３文字目は５２４では小さすぎるので５
２４と５２５が統合され３文字目となる。次の３文字は
半角であるので５２６〜５２８はそれぞれ単独文字とし
て切出され、５２９・５３０・５３１は全角文字として
統合される。Since the width of the full-width character can be inferred from the height of the line, the block figure 52 is cut out with reference to the code of the electronic data 12. Since the first character is full-width, 521 and 522 are integrated to form the first character. Similarly, the second character is 523, and the third character is 524 because it is too small.
24 and 525 are integrated to form a third character. Since the next three characters are half-width characters, 526 to 528 are cut out as single characters, and 529, 530 and 531 are integrated as full-width characters.

【００４７】このようにして、安定した精度の高い文字
切出しが可能となる。なお、単一黒連結領域の抽出や全
角半角混じりのボトムアップ的なブロック統合は周知の
技術であるので、ここでは電子データの使用方法につい
てのみ説明するにとどめる。In this manner, stable and accurate character extraction can be performed. Note that extraction of a single black connected region and bottom-up block integration of a mixture of full-width and half-width characters are well-known technologies, and therefore, only a method of using electronic data will be described here.

【００４８】また、この説明では単一黒連結領域を文字
切出の基本特徴に用いたが射影による文字切出でも同様
の処理が可能であることは言うまでもない。In this description, the single black connected area is used as a basic feature of character extraction, but it goes without saying that the same processing can be performed by character extraction by projection.

【００４９】次に本発明の第６の実施形態について説明
する。本実施形態も前述の第５の実施形態と同様に他の
実施形態と比較して、文字切出部２１３に特徴がある。
文字認識部２１においては、各文字コードから参照でき
る形で各文字の形状情報を保有している。形状情報は、
文字の外接枠形状として与えられ、正方形枠（以下の説
明ではコード１で表す）・縦長長方形枠（同２）・横長
長方形枠（同３）・小型枠（同４）・不定／空白（同
０）等で定義される。図１５に、テーブルを利用して文
字コードより形状コードを参照する例を示す。図１５で
は、ＪＩＳコードを例に説明しており、テーブルの先頭
から文字コード（ＹＹＹＹ）に該当する１６進数のコー
ドｂｙｔｅ分進んだ位置にその文字コードが示す文字に
対応した形状情報が１ｂｙｔｅの０〜４で記述されてい
る。したがって、文字コード（ＹＹＹＹ）がわかれば、
それから対応する形状情報のコード（０，１，２，３あ
るいは４）を参照することができる。なお、本テーブル
は、文字辞書作成時に用いる学習用文字データベースを
用いて容易に自動作成可能である。Next, a sixth embodiment of the present invention will be described. This embodiment also has a feature in the character cutout unit 213 as compared with the other embodiments as in the above-described fifth embodiment.
The character recognition unit 21 holds shape information of each character in a form that can be referred to from each character code. The shape information is
Given as the circumscribed frame shape of the character, a square frame (represented by code 1 in the following description), a vertically long rectangular frame (2), a horizontally long rectangular frame (3), a small rectangular frame (4), an indefinite / blank 0) etc. FIG. 15 shows an example in which a shape code is referred to from a character code using a table. In FIG. 15, the JIS code is described as an example, and the shape information corresponding to the character indicated by the character code is 1 byte at a position advanced by the hexadecimal code byte corresponding to the character code (YYYY) from the top of the table. 0 to 4 are described. Therefore, if the character code (YYYY) is known,
Then, the code (0, 1, 2, 3, or 4) of the corresponding shape information can be referred to. This table can be easily and automatically created using a learning character database used when creating a character dictionary.

【００５０】このテーブルによる形状情報を利用した処
理手順について、図１６を用いて説明する。図１６の６
１は領域抽出２１２により得られた文字列イメージの例
であり、最初の４文字（「’」まで）が全角で、次の３
文字（テスト）が半角で、残りの４文字が全角で印字さ
れている。図１６の６２は同図の６１のイメージより黒
点の単一連結領域を抽出しおのおのをその外接枠で表現
した図である。従来の方式によれば、６２５〜６３３の
間で切出し誤りが生じやすく、特に、６２６，６３０，
６３２，６３３の統合と分離で誤りが生じやすい。A processing procedure using the shape information based on this table will be described with reference to FIG. 16 in FIG.
Reference numeral 1 denotes an example of a character string image obtained by the region extraction 212. The first four characters (up to “′”) are full-width characters, and the next three characters
The characters (test) are printed in half-width and the remaining four characters are printed in full-width. Reference numeral 62 in FIG. 16 is a diagram in which a single connected area of a black point is extracted from the image 61 in FIG. 16 and each is represented by its circumscribed frame. According to the conventional method, a cutting error easily occurs between 625 and 633, and in particular, 626, 630,
Errors are likely to occur in the integration and separation of 632 and 633.

【００５１】本実施形態では、まず電子データを参照す
る。この電子データを用いて図１５の方法で特に図示は
しない形状のテーブルを参照すると、１１１４２２２４
１１４（正方形、・・・、小型）という形状情報が得ら
れる。この情報に基づいて、６２から文字切り出しを行
うと、６２１と６２２で１文字目が構成され、以降６２
３が１文字、６２４の位置は形状情報が正方形であるの
で、６２４と６２５を統合して１文字、６２６〜６３０
が各１文字、６３１の位置とその次の文字は正方形とい
う形状情報があるので、６３１〜６３３を統合して１文
字、６３４，６３５でそれぞれ１文字という切出結果６
３を得ることができる。ブロックの統合方法は、公知の
技術であるのでその詳細な説明は省略する。また、ここ
では、黒画素の連結領域をベースにした処理で説明した
が、射影による文字切り出しでも同様の処理は容易に実
現可能である。In this embodiment, first, electronic data is referred to. Referring to a table of a shape (not shown) using the electronic data in the method of FIG.
Shape information 114 (square,..., Small) is obtained. When the character is cut out from 62 based on this information, the first character is composed of 621 and 622,
3 is one character, and the position of 624 is a square whose shape information is a square. Therefore, 624 and 625 are integrated into one character, 626 to 630.
Since there is shape information that each character is 631, the position of 631 and the next character are squares, 631 to 633 are integrated, and one character is output to 634 and 635.
3 can be obtained. Since a method of integrating blocks is a known technique, a detailed description thereof will be omitted. Further, here, the processing based on the connected area of the black pixels has been described. However, the same processing can be easily realized by projecting a character.

【００５２】このようにして、文字切り出しの誤りを削
減して安定した文字切出しが実現される。In this manner, stable character extraction is realized by reducing character extraction errors.

【００５３】次に、本発明の第７の実施形態について説
明する。本実施形態は、シート１１上の一部分と電子デ
ータ１２を照合する場合に適用されるものである。Next, a seventh embodiment of the present invention will be described. This embodiment is applied to a case where the electronic data 12 is compared with a part on the sheet 11.

【００５４】文字切出を行うべき範囲とは別に、フォー
マット定義でシート１１上の文書照合をしたい部分を別
に指定することで、シート１１上の一部分と電子データ
１２を照合することができる。この方法は、帳票ＯＣＲ
では基本的な方法であるので容易に類推可能であり、詳
細な説明は省略する。Aside from the range in which character extraction is to be performed, by specifying a portion on the sheet 11 where document collation is desired in the format definition, the electronic data 12 can be collated with a portion on the sheet 11. This method uses the form OCR
Since this is a basic method, it can be easily analogized, and a detailed description is omitted.

【００５５】他の方法として、領域抽出部２２における
処理実行中で、操作者に領域解析結果を提示し、文書照
合したい領域を操作者に指定させる方法がある。この方
法も、文書ＯＣＲで実現されている方法と類似しており
当該技術者であれば容易に実現可能であるので詳細な説
明は省略する。As another method, there is a method in which an area analysis result is presented to the operator during the execution of the processing in the area extracting unit 22 and the operator specifies an area to be compared with the document. This method is also similar to the method realized by the document OCR, and can be easily realized by a person skilled in the art, so that the detailed description is omitted.

【００５６】このような機能の必要性は、例えばシート
に記載された本文の３行目から比較を行いたい場合に、
先頭から文字切出しを行ったあとでないと３行目の位置
が判明しないが、必要なのはその一部であるため、文字
切出しとは別に文書照合位置を指定することが有用とな
る。本手法は、行の途中から指定することも可能であ
る。The necessity of such a function is, for example, when a comparison is to be performed from the third line of the text described on the sheet.
The position of the third line cannot be determined until after the character is extracted from the beginning, but since it is necessary to specify only a part of the position, it is useful to specify the document collation position separately from the character extraction. This method can also be specified in the middle of a line.

【００５７】次に、本発明の第８の実施形態について説
明する。Next, an eighth embodiment of the present invention will be described.

【００５８】本実施形態は、文字認識部２１４で用いる
文字辞書の分割方法と読取時の制御方法に特徴がある。This embodiment is characterized by a method of dividing a character dictionary used by the character recognition unit 214 and a control method at the time of reading.

【００５９】辞書を分割する目的は、文字認識精度の向
上にある。文字認識においては識別すべきカテゴリ数
（読取対象字種数）が増加するにつれ読取精度が減少す
るのは一般的な事実である。特に、電子データの入力時
（作成時）に、ＯＣＲで読取可能な文字種を意識してい
ない作成者に対して入力対象文字種を制限することは困
難であるため、文書照合で取扱う文字種は少なくともＪ
ＩＳ第２水準までの範囲となり、６５００字種以上にな
る。そこで、ある読取文字に対してその文字を含む分割
辞書を用い、読取対象文字種数を擬似的に減らすことに
より読取精度の向上をはかる。また、読取文字種が減少
することにより、処理速度が向上するという効果も得ら
れる。The purpose of dividing the dictionary is to improve the accuracy of character recognition. In character recognition, it is a general fact that the reading accuracy decreases as the number of categories to be identified (the number of character types to be read) increases. In particular, at the time of inputting (creating) electronic data, it is difficult to restrict the input target character type to a creator who is not aware of the character type that can be read by the OCR.
The range is up to the IS second level, which is more than 6,500 characters. Therefore, the reading accuracy is improved by using a divided dictionary for a certain read character and reducing the number of read target character types in a pseudo manner. Further, the effect that the processing speed is improved by reducing the read character type is also obtained.

【００６０】ここで、具体的な文字辞書の分割方法につ
いて説明する。まず、記号・平仮名・片仮名・アルファ
ベット大文字・同小文字・ＪＩＳ第一水準・ＪＩＳ第２
水準に対応するように辞書を分割する方法がある。この
分割方法はＪＩＳコードに準じているので非常に容易に
実現できる。第２の分割方法としては、漢字を部首毎に
分割する方法がある。ＪＩＳコードに対する部首の分類
は、市販されているので作成は容易である。例えば、上
柿力編“パソコンワープロ漢字辞典”（ナツメ社刊）な
どにこのようなデータがある。もう一つの分割方法とし
ては、辞書作成時の文字認識テストにいて、互いに誤読
となる割合が高い２つあるいはそれ以上の文字をそれぞ
れ異なる辞書に分割する方法がある。この方法では、例
えば「夕（漢字）」と「タ（片仮名）」などのペアが異
なる辞書に分割される。このような分割方法は、文字認
識の実験結果さえあれば、その結果に基づいて誤読２０
％以上となる文字の組み合わせを取り出すことも、これ
らの文字が互いに重なりの無いように複数の辞書に分割
することもプログラミングで可能であり比較的容易であ
る。後者の分割技術としては、このような誤読の割合が
高い文字の組み合わせをペナルティの評価関数として、
パターン認識でよく用いられる自動クラスタリング手法
を利用できる。Here, a specific method of dividing a character dictionary will be described. First of all, symbols, hiragana, katakana, uppercase letters, lowercase letters, JIS first level, JIS second
There is a method of dividing the dictionary according to the level. Since this dividing method conforms to the JIS code, it can be realized very easily. As a second division method, there is a method of dividing a kanji for each radical. Classification of radicals with respect to JIS codes is easy to prepare because they are commercially available. For example, such data can be found in, for example, Riki Kamigaki, "PC Word Processor Kanji Dictionary" (published by Natsume Co., Ltd.). As another division method, there is a method of dividing two or more characters having a high rate of misreading into different dictionaries in a character recognition test at the time of creating a dictionary. In this method, for example, a pair such as “Yu (Kanji)” and “Ta (Katakana)” is divided into different dictionaries. Such a division method is based on the result of the character recognition experiment only if there is only an experiment result.
It is relatively easy to program by taking out combinations of characters that are equal to or more than% and by dividing them into a plurality of dictionaries so that these characters do not overlap each other. As the latter segmentation technique, such a combination of characters with a high rate of misreading is used as a penalty evaluation function,
An automatic clustering method often used in pattern recognition can be used.

【００６１】文字辞書は、図１７に示すように分割辞書
の各要素（個別文字の辞書）がひとまとまりで連続する
ように配置してある。辞書分割テーブルはこの分割辞書
の先頭の格納位置が、分割辞書の格納順に記述してあ
る。従って、Ｚ番目の分割辞書にアクセスするのであれ
ば、辞書分割テーブルのＺ番目に辞書先頭からの開始位
置が、同テーブルのＺ＋１番目に分割辞書の終了位置
（分割辞書Ｚ＋１の開始位置）が格納されていることに
なる。As shown in FIG. 17, the character dictionary is arranged so that each element (dictionary of individual characters) of the divided dictionary is continuous as a unit. In the dictionary division table, the head storage positions of the division dictionaries are described in the storage order of the division dictionaries. Therefore, if the Z-th divided dictionary is accessed, the start position from the beginning of the dictionary is stored at the Z-th position in the dictionary division table, and the end position (start position of the divided dictionary Z + 1) is stored at the Z + 1-th position in the table. It will be.

【００６２】次に、このような分割辞書のアクセス方法
の例を図１７と図１８を用いて示す。まず、電子データ
１２から該当する文字位置の文字コードを読み出す（図
１８ステップ８１）。このコードをもとにこのコードの
正解が含まれている分割辞書の番号（図１７ではＺ）を
辞書番号テーブルを利用して取得する（図１８ステップ
８２）。文字コードから分割辞書の番号を取得する手順
は、図１５で説明した形状コードの取得と全く同じであ
るので詳細な説明は省略する。分割辞書の番号は前述の
ように辞書の格納順につけてある。Next, an example of such a divided dictionary access method will be described with reference to FIGS. 17 and 18. FIG. First, a character code at a corresponding character position is read from the electronic data 12 (step 81 in FIG. 18). Based on this code, the number (Z in FIG. 17) of the divided dictionary containing the correct answer of this code is acquired using the dictionary number table (step 82 in FIG. 18). The procedure for acquiring the number of the divided dictionary from the character code is exactly the same as the procedure for acquiring the shape code described with reference to FIG. The numbers of the divided dictionaries are given in the storage order of the dictionaries as described above.

【００６３】次に、分割辞書の番号から、辞書分割テー
ブルをアクセスする（ステップ８３）。このアクセス方
法は前述の通りである。Next, the dictionary division table is accessed from the number of the division dictionary (step 83). This access method is as described above.

【００６４】このようにして分割辞書の格納位置を取得
すると順次辞書を読み出し（ステップ８４）、照合を行
い（ステップ８５）、次の文字辞書の読み出し位置をセ
ットする（ステップ８６）。この際、次の読み出し位置
が分割辞書の終了位置に達していれば終了し、そうでな
ければステップ８４〜８６を実行する（ステップ８
７）。When the storage positions of the divided dictionaries are obtained in this manner, the dictionaries are sequentially read (step 84), collation is performed (step 85), and the read position of the next character dictionary is set (step 86). At this time, if the next read position has reached the end position of the divided dictionary, the process ends, otherwise, steps 84 to 86 are executed (step 8).
7).

【００６５】ここでは、分割辞書のアクセス方法につい
て説明したが、辞書全部を照合した後、辞書の分割情報
に基づき文字認識結果の候補の中から分割辞書に含まれ
る文字コードでもっとも性能の高いものを手順も可能で
ある。Here, the method of accessing the divided dictionary has been described. After collating the entire dictionary, based on the divided information of the dictionary, the character codes having the highest performance among the character codes contained in the divided dictionary are selected from the character recognition result candidates. The procedure is also possible.

【００６６】次に、本発明の第９の実施形態について説
明する。Next, a ninth embodiment of the present invention will be described.

【００６７】本実施形態は、他の形態と比較して照合部
２２の手順が異なる。This embodiment is different from the other embodiments in the procedure of the collating unit 22.

【００６８】照合部２２には区別のつかない同形異字の
ペアを一覧として記述しておく。同形異字とは、同じ字
形であるのに文字コードが異なるもので、通常使用する
文字の中には「井（漢字）と＃（記号）」、日（ひ）と
曰（いわく）」、（ト（片仮名）と卜（漢字）」、「Ｃ
（大文字）とｃ（小文字）」などを代表にこのような同
形異字のペアがかなりの数がある。また「日（ひ）と臼
（うす）」等は、厳密には同形異字とは異なるが、文字
認識上区別することが非常に難しいとして同形異字とし
て扱ってもよい。このような文字コードのペアが、連続
的に並んでいるものが同形異字のペアの一覧である。The matching unit 22 lists pairs of inhomogeneous characters that cannot be distinguished from each other as a list. Homomorphic characters have the same character shape but different character codes. Some commonly used characters include “I (Kanji) and # (symbol)”, “H” (hi) and “Iwaku”, ( To (Katakana) and To (kanji) "," C
There are a considerable number of such homomorphic pairs represented by "(uppercase) and c (lowercase)". Strictly speaking, “day and day” are different from homomorphic characters, but may be treated as homomorphic characters because it is extremely difficult to distinguish them in character recognition. A list of such character code pairs that are consecutively arranged is a pair of homomorphic characters.

【００６９】図１９で、文字認識結果３１と電子データ
３２およびその照合結果３３は、前述の図４と同じであ
り、その導出手順もすでに説明した方法と同じである。
また、照合結果３４は、照合結果３３から同形異字のペ
アの一覧を参照して導びかれるものである。In FIG. 19, the character recognition result 31, the electronic data 32, and the collation result 33 are the same as those in FIG. 4 described above, and the derivation procedure is the same as the method already described.
The matching result 34 is derived from the matching result 33 with reference to a list of pairs of homomorphic characters.

【００７０】具体的には、照合結果３３が不一致のもの
即ち図１９では１であるものについてその位置の電子デ
ータ３２の文字コードを用いて同形異字の一覧を検索す
る。この結果、一覧の中にその文字コードが見つかれ
ば、一覧から同形異字の相手方の文字コードを読み出
し、そのコードを対応する位置の文字認識結果３１の文
字コードと照合する。この結果、一致すれば、その照合
結果３３を不一致から一致に切り替え照合結果３４に格
納する。照合結果３３において、不一致以外のもの（図
１９では０）と、同形異字のペアの一覧に該当する文字
コードがない場合は、照合結果３３および３４は同じ値
とする。More specifically, a list of homomorphic characters is searched using the character code of the electronic data 32 at the position where the collation result 33 does not match, that is, 1 in FIG. As a result, if the character code is found in the list, the character code of the other party having the same variant is read from the list, and the code is compared with the character code of the character recognition result 31 at the corresponding position. As a result, if they match, the matching result 33 is switched from non-matching to matching and stored in the matching result 34. In the comparison result 33, when there is no character code corresponding to a list other than the mismatch (0 in FIG. 19) and the homomorphic character pair, the comparison results 33 and 34 have the same value.

【００７１】なお、図１９においては、同形異字の一覧
に「ト」（片仮名）と「卜」（漢字）のペアはあるが、
「〒」と「テ」はないため、「ト」（片仮名）と「卜」
（漢字）に該当する照合結果のみが１から０に切り替え
られている。In FIG. 19, there is a pair of “to” (Katakana) and “to” (Kanji) in the list of homomorphic characters.
Because there are no "〒" and "te", "to" (Katakana) and "to"
Only the matching result corresponding to (Kanji) is switched from 1 to 0.

【００７２】本発明の第１０の実施形態は、一度作成し
た文書を確認するときに適用される。即ち、重要度の高
い文書は、通常文字を記載して書類を作成した後に、そ
の内容を再確認してから提出あるいは発送され、本実施
形態は、このような作業の支援を行うものである。The tenth embodiment of the present invention is applied when confirming a once created document. In other words, a document with high importance is created after writing a document with ordinary characters, reconfirming the contents, and then submitting or sending the document. The present embodiment supports such work. .

【００７３】図２０を参照すると、操作端末９１には、
ワープロで作成したり、あるいは電子メールなどを利用
して送付された電子データが格納されている。本電子デ
ータは図１に示される電子データ１２に相当する。操作
端末９１としては、通常パソコンやサーバ等を使用す
る。Referring to FIG. 20, operation terminal 91 includes:
Stores electronic data created by a word processor or sent using e-mail or the like. This electronic data corresponds to the electronic data 12 shown in FIG. As the operation terminal 91, a personal computer, a server, or the like is usually used.

【００７４】次に、この電子データを操作端末９１に接
続されているプリンタ９２から印字し、用紙を作成す
る。次に、作成した用紙を文字認識装置９３にかけその
内容を認識する。文字認識装置９３が図４に示される文
字認識部２１に相当する。この認識結果を操作端末９１
に転送し、操作端末に格納されている電子データと文字
認識結果を照合し、不一致の部分はディスプレイを用い
て確認する。操作端末９１が図４に示される照合部２２
および目視確認部２３に相当する。この結果、一致が確
認されれば、当該用紙の文書照合は終了する。Next, the electronic data is printed from a printer 92 connected to the operation terminal 91 to create a sheet. Next, the created paper is applied to a character recognition device 93 to recognize its contents. The character recognition device 93 corresponds to the character recognition unit 21 shown in FIG. This recognition result is stored in the operation terminal 91.
Then, the electronic data stored in the operation terminal is compared with the character recognition result, and the mismatched portion is confirmed using the display. The operation terminal 91 is the collating unit 22 shown in FIG.
And the visual confirmation unit 23. As a result, if a match is confirmed, the document collation of the paper is terminated.

【００７５】本実施形態では、操作端末とプリンタと文
字認識装置の構成で説明したが、文字認識装置をスキャ
ナーに置き換え、文字認識自体は操作端末側で実行する
構成も可能である。また、現在ではプリンタとスキャナ
ーが一体になった製品もあるので、プリンタとスキャナ
ーを一体化させることも容易に実現可能である。In this embodiment, the configuration of the operation terminal, the printer, and the character recognition device has been described. However, a configuration in which the character recognition device is replaced with a scanner and the character recognition itself is executed on the operation terminal side is also possible. At present, there is a product in which a printer and a scanner are integrated, so that it is possible to easily integrate the printer and the scanner.

【００７６】このとき、印字部の印字フォント・印字サ
イズ・印字レイアウトを固定化し、領域解析をそのレイ
アウト専用にし、かつ、文字辞書をその印字専用にすれ
ば通常の文字認識装置と比べても相当に高い精度の文字
認識精度を得ることが可能となる。At this time, if the printing font, printing size, and printing layout of the printing unit are fixed, the area analysis is dedicated to the layout, and the character dictionary is dedicated to the printing, it is quite comparable to a normal character recognition device. It is possible to obtain a highly accurate character recognition accuracy.

【００７７】次に、本発明の第１１の実施形態について
説明する。本実施形態では、前述の第１０の実施形態で
示された構成（図２０）において、印字におけるフォン
トを比較的自由に選択でき、かつ、文字認識精度を高く
キープすることを実現するものである。Next, an eleventh embodiment of the present invention will be described. In the present embodiment, in the configuration shown in the above-described tenth embodiment (FIG. 20), it is possible to relatively freely select a font for printing and to keep character recognition accuracy high. .

【００７８】図２０に示される構成において、用紙に印
刷する文字フォント、サイズ、レイアウトは、操作端末
９１側で制御可能である。従って印字するフォント・サ
イズ・レイアウト情報を文字認識装置９３に前もって通
知することも可能である。これらの情報はプリンタ９２
に印字データを送る直前に、文字認識装置９３（文字認
識部）に、図２１に示される形式で通知される。In the configuration shown in FIG. 20, the character font, size, and layout to be printed on paper can be controlled on the operation terminal 91 side. Therefore, it is possible to notify the character recognition device 93 of the font / size / layout information to be printed in advance. These information are stored in the printer 92.
Immediately before the print data is sent to the character recognition device 93 (character recognition unit), the notification is made in the format shown in FIG.

【００７９】図２１は、電子データの１文字毎の形式を
表しており、この形式のサイズの文字数倍のサイズ情報
が文字認識装置に転送される。文字コード１０１は電子
データの文字コード、印字位置（Ｘ，Ｙ）１０２はその
文字の外接矩形の左上の位置を表し、座標系は帳票の左
上を原点とするＸＹ座標系である。フォント種１０３は
印字フォントをあらかじめ定められた番号で表したもの
である。フォントサイズ（Ｘ，Ｙ）１０４は、印字フォ
ントの外接矩形の横（Ｘ）方向サイズと縦（Ｙ）方向サ
イズで表している。なお、座標およびサイズの単位とし
ては、１／１０ｍｍとか光学系の分解能サイズを用い
る。FIG. 21 shows the format of each character of the electronic data, and size information of the size of this format, which is the number of characters, is transferred to the character recognition device. The character code 101 represents the character code of the electronic data, the print position (X, Y) 102 represents the upper left position of the circumscribed rectangle of the character, and the coordinate system is an XY coordinate system having the origin at the upper left of the form. The font type 103 represents a print font by a predetermined number. The font size (X, Y) 104 is represented by a horizontal (X) direction size and a vertical (Y) direction size of a circumscribed rectangle of the print font. In addition, as the unit of the coordinates and the size, 1/10 mm or the resolution size of the optical system is used.

【００８０】また、文字認識装置９３では、プリンタ９
２で使用する各フォント・サイズ毎に個別の文字辞書を
保有しており、分割アクセスが可能である。分割の格納
順序は、図２１で用いたフォント種を示す番号順として
ある。分割辞書及びそのアクセスの方法については前述
の第８の実施形態で既に述べているため、説明は省略す
る。In the character recognition device 93, the printer 9
2 has an individual character dictionary for each font size used, and allows divided access. The storage order of the divisions is a numerical order indicating the font type used in FIG. Since the divided dictionaries and a method for accessing the divided dictionaries have already been described in the eighth embodiment, the description will be omitted.

【００８１】ここで、文字認識装置９２の動作を図２２
を用いて説明する。１１１は領域抽出後の読みとるべき
文字イメージの例であり、プロポーショナルフォントを
用いて印字されているため印字ピッチが一定していな
い。文字イメージ１１１から単一黒連結領域を抽出し、
その外接枠で表現したものが同図の１１２である。The operation of the character recognition device 92 will now be described with reference to FIG.
This will be described with reference to FIG. Reference numeral 111 denotes an example of a character image to be read after the area is extracted. Since the character image is printed using a proportional font, the printing pitch is not constant. Extract a single black connected area from the character image 111,
What is expressed by the circumscribed frame is 112 in FIG.

【００８２】領域抽出に関しては、図２１に示された印
字位置１０２を情報として使用して行う方法がある。し
かしながら、パソコン用のプリンタの印字精度は位置で
上下左右に１〜２ｍｍずれる場合が散見されるので、精
度の高い情報としては使用しにくい面がある。従って、
領域抽出はまず前述の第１の実施形態で述べた方法で行
い、印字位置１０２については補助情報として用いる。
例えば、小さな点が２つあり、どちらを選択してよいか
わからない場合などには本位置情報１０２を用いて最終
的に判断する。There is a method for extracting an area by using the print position 102 shown in FIG. 21 as information. However, the printing accuracy of a printer for a personal computer is occasionally shifted by 1 to 2 mm in the vertical and horizontal directions, so that it is difficult to use the information as highly accurate information. Therefore,
The region extraction is first performed by the method described in the first embodiment, and the print position 102 is used as auxiliary information.
For example, when there are two small points and it is not clear which one to select, a final decision is made using the actual position information 102.

【００８３】文字切出しは、フォントサイズを利用して
外接枠と比較しながら必要に応じて統合を繰り返して切
り出す。例えば、図２２の場合、１１２１と１１２２を
統合すると、操作端末９１より受け取った「こ」のフォ
ントサイズ１０４とおよそ一致するので、１１２１と１
１２２を統合して１文字目とする。１１２３のサイズ
は、「れ」のフォントサイズ１０４とおよそ一致し、１
１２４と１１２５は統合しておよそ「は」のフォントサ
イズ１０４と一致するので、これを１文字とする。同様
に、１１２６から１１３０はそれぞれ「ｔｅｓｔ用」と
サイズが対応し、それぞれ１文字として切り出される。
１１３１〜１１３３は統合されて「で」のサイズに相当
し、１１３４と１１３５も「す。」としてそれぞれ切り
出される。Character extraction is performed by repeating integration as necessary while comparing with the circumscribed frame using the font size. For example, in the case of FIG. 22, when 1121 and 1122 are integrated, they substantially match the font size 104 of “ko” received from the operation terminal 91.
122 is integrated to be the first character. The size of 1123 is approximately the same as the font size 104 of
Since the characters 124 and 1125 are integrated and substantially coincide with the font size 104 of “ha”, this is set as one character. Similarly, each of 1126 to 1130 has a size corresponding to “for test” and is cut out as one character.
1131 to 1133 are integrated to correspond to the size of “de”, and 1134 and 1135 are also cut out as “su”.

【００８４】文字認識で使用する辞書も１文字毎にフォ
ント種１０３の情報に基づいて照合する辞書を切り替え
て必要な分割部分を用いる。例えば、フォントＡの文字
に対しては文字辞書はフォントＡの部分のみを用いる。
前述のように、図２１のフォント種１０３と文字辞書の
分割番号を一致するようにしているので、その使用方法
は前述の第８の実施形態で述べたものと同様の処理で可
能となる。The dictionary used for character recognition also switches the dictionary to be collated based on the information of the font type 103 for each character and uses a necessary divided portion. For example, for a character in font A, the character dictionary uses only the font A portion.
As described above, since the font type 103 in FIG. 21 matches the division number of the character dictionary, the method of using the same can be performed by the same processing as that described in the eighth embodiment.

【００８５】なお、本実施形態については、図２１に示
された情報をすべて使用する必要はなく、その一部の情
報だけでも適用が可能である。In this embodiment, it is not necessary to use all of the information shown in FIG. 21, and it is possible to apply only some of the information.

【００８６】このように、従来、安定した文字切出が難
しいとされていた印字ピッチが一定でないプロポーショ
ナルフォント等の印字に対しても安定して文字切出しが
できる。また、文字毎にフォントに併せて専用辞書に切
り替えることにより高い精度で文字認識が可能となる。As described above, it is possible to stably extract characters even when printing a proportional font or the like in which the print pitch is not constant, which has conventionally been considered difficult to extract characters stably. Also, by switching to a dedicated dictionary for each character in accordance with the font, character recognition can be performed with high accuracy.

【００８７】[0087]

【発明の効果】以上説明したように、本発明によれば、
電子データ形式の文書と紙面上に記載された文書の文書
照合の大部分を自動化でき、従来手作業であったための
作業負担を軽減できる効果がある。さらに、人手による
照合量が削減し、多くを自動化するため、処理速度が向
上し、業務および顧客サービスの効率化にも繋がる。As described above, according to the present invention,
Most of the document collation between the document in the electronic data format and the document described on the paper can be automated, which has the effect of reducing the work load due to the conventional manual operation. Furthermore, since the amount of manual collation is reduced and much is automated, the processing speed is improved, leading to more efficient business and customer service.

【００８８】さらに、目視確認作業をすべて端末上のデ
ィスプレイ上あるいは、１枚の用紙上で行うことが可能
なため、さらに人手の作業の軽減ならびに業務およびサ
ービスの効率化が可能となる。Further, since all the visual check operations can be performed on the display on the terminal or on a single sheet of paper, it is possible to further reduce the number of manual operations and increase the efficiency of operations and services.

【００８９】さらに、文字認識の信頼度の低い部分を目
視確認対象とすることにより、文字認識の誤りによる文
書照合信頼度低下を防ぐことができ、精度の高い文書照
合を実現できる。Further, by making a portion having low reliability of character recognition a target of visual confirmation, it is possible to prevent a decrease in document collation reliability due to an error in character recognition, thereby realizing highly accurate document collation.

【００９０】さらに、必ず照合誤りが生じる箇所を指摘
することにより、人間による目視確認部分の確認ミスを
削減することが可能となる。このため、より精度の高い
文書照合を実現できる。Further, by pointing out a place where a collation error occurs, it is possible to reduce a human error in confirming the visual confirmation part. Therefore, more accurate document matching can be realized.

【００９１】さらに、全角文字と半角文字が混在した文
字列の文字切出し精度を高めることにより、文字認識精
度を高めることができ、結果的に目視確認作業を低減で
きる。Further, by improving the character extraction accuracy of a character string in which full-width characters and half-width characters are mixed, the accuracy of character recognition can be improved, and as a result, visual confirmation work can be reduced.

【００９２】さらに、全角文字と半角文字が混在した文
字列に加え、従来誤りやすかった記号などの文字切出精
度を高めることにより、文字認識精度を高めることがで
き、結果的に目視確認作業を低減できる。Further, in addition to a character string in which full-width characters and half-width characters are mixed, the character recognition accuracy can be improved by improving the character extraction accuracy of a symbol or the like which has been apt to be erroneous in the past, and as a result, the visual confirmation work can be performed. Can be reduced.

【００９３】さらに、紙葉に記載された文書の一部と電
子データの照合を可能とし、文書照合の適用範囲を広げ
る効果がある。特に、文書の途中からの文書照合も可能
になる効果がある。Further, it is possible to collate a part of the document described on the paper sheet with the electronic data, which has an effect of expanding the applicable range of the document collation. In particular, there is an effect that the document can be collated from the middle of the document.

【００９４】さらに、文字認識精度を高めることがで
き、目視確認作業の低減が可能となる。例えば、片仮名
と漢字を別の分割辞書とすれば、片仮名の「タ」と漢字
の「夕」のような誤読はなくなり、文字認識精度は向上
する。Further, the accuracy of character recognition can be improved, and the visual confirmation work can be reduced. For example, if katakana and kanji are used as separate divided dictionaries, misreading such as katakana “ta” and kanji “evening” is eliminated, and the character recognition accuracy is improved.

【００９５】さらに、文字認識では完全に対応すること
が難しい同形異字への対応を可能とし、目視確認作業を
より低減できる効果がある。Further, it is possible to cope with the same and different characters which are difficult to completely cope with in character recognition, and there is an effect that the visual check operation can be further reduced.

【００９６】さらに、重要な文書の作成後の文書照合を
支援するためのもので、その作業の効率化を実現でき
る。Further, this is for supporting document collation after creation of an important document, and the work efficiency can be improved.

【００９７】さらに、請求項１０の発明において、特に
その文字認識精度を高めることが可能であり、業務の一
層の効率化が実現できる。また、通常の文字認識装置で
は高い読取精度を実現することが難しい、プロポーショ
ナルピッチ印字や文字毎に印字フォントを切替えた文書
など多様な種類の文書にも対応することができ、サービ
スの向上にも繋がる。Further, according to the tenth aspect of the present invention, it is possible to particularly improve the character recognition accuracy, and it is possible to realize more efficient work. Also, it is possible to handle various types of documents such as proportional pitch printing and documents in which the printing font is switched for each character, which is difficult to achieve high reading accuracy with a normal character recognition device. Connect.

[Brief description of the drawings]

【図１】本発明の第１の実施形態の構成を示すブロック
図である。FIG. 1 is a block diagram illustrating a configuration of a first exemplary embodiment of the present invention.

【図２】文書照合に用いるシートの例を示す図である。FIG. 2 is a diagram illustrating an example of a sheet used for document matching.

【図３】図１における文字認識部の構成例を示す図であ
る。FIG. 3 is a diagram illustrating a configuration example of a character recognition unit in FIG. 1;

【図４】図１における照合部の処理例を説明する図であ
る。FIG. 4 is a diagram illustrating a processing example of a matching unit in FIG. 1;

【図５】図１における目視確認部の処理の流れを示す図
である。FIG. 5 is a diagram showing a flow of processing of a visual confirmation unit in FIG. 1;

【図６】図１における目視確認部の表示例を示す図であ
る。FIG. 6 is a diagram showing a display example of a visual confirmation unit in FIG. 1;

【図７】ファイル渡しによる文字認識部−照合部イメー
ジインタフェース例を示す図である。FIG. 7 is a diagram illustrating an example of a character recognition unit-collation unit image interface by file transfer.

【図８】切出位置渡しによる文字認識部−照合部イメー
ジインタフェース例を示す図である。FIG. 8 is a diagram illustrating an example of a character recognition unit-collation unit image interface based on cutout position passing.

【図９】本発明の第３の実施形態における照合部の処理
例を説明する図である。FIG. 9 is a diagram illustrating a processing example of a matching unit according to a third embodiment of the present invention.

【図１０】第３の実施形態における目視確認部の表示例
を示す図である。FIG. 10 is a diagram illustrating a display example of a visual confirmation unit according to the third embodiment.

【図１１】本発明の第４の実施形態における照合部の処
理例を説明する図である。FIG. 11 is a diagram illustrating a processing example of a matching unit according to a fourth embodiment of the present invention.

【図１２】第４の実施形態における目視確認部の表示例
を示す図である。FIG. 12 is a diagram illustrating a display example of a visual confirmation unit according to a fourth embodiment.

【図１３】本発明の第５の実施形態の構成を示すブロッ
ク図である。FIG. 13 is a block diagram showing a configuration of a fifth embodiment of the present invention.

【図１４】第５の実施形態における文字切出の動作を説
明するための図である。FIG. 14 is a diagram illustrating an operation of extracting a character according to the fifth embodiment.

【図１５】本発明の第６の実施形態における形状情報の
取得方法を説明する図である。FIG. 15 is a diagram illustrating a method for acquiring shape information according to a sixth embodiment of the present invention.

【図１６】第６の実施形態における文字切出の動作を説
明するための図である。FIG. 16 is a diagram for explaining the character extracting operation in the sixth embodiment.

【図１７】本発明の第８の実施形態における文字辞書の
参照方法を説明する図である。FIG. 17 is a diagram illustrating a method of referring to a character dictionary according to the eighth embodiment of the present invention.

【図１８】第８の実施形態における文字認識の処理を示
す図である。FIG. 18 is a diagram illustrating a character recognition process according to the eighth embodiment.

【図１９】本発明の第９の実施形態における照合部の処
理例を説明する図である。FIG. 19 is a diagram illustrating a processing example of a matching unit according to a ninth embodiment of the present invention.

【図２０】本発明の第１０の実施形態の構成を示す図で
ある。FIG. 20 is a diagram showing a configuration of a tenth embodiment of the present invention.

【図２１】本発明の第１１の実施形態の印字文字情報の
例を示す図である。FIG. 21 is a diagram illustrating an example of print character information according to an eleventh embodiment of the present invention.

【図２２】第１１の実施形態における文字切出の動作を
説明するための図である。FIG. 22 is a diagram for explaining the character cutting operation in the eleventh embodiment.

[Explanation of symbols]

１１シート１２電子データ２１，１２９文字認識部２２照合部２３目視確認部９１，１３０操作端末９２プリンタ９３文字認識装置 DESCRIPTION OF SYMBOLS 11 Sheet 12 Electronic data 21,129 Character recognition part 22 Collation part 23 Visual confirmation part 91,130 Operation terminal 92 Printer 93 Character recognition device

Claims

(57) [Claims]

1. A recognition unit for optically reading characters written on a sheet, a collation unit for collating character codes stored as electronic data in advance with a result of recognition by the recognition unit, and a collation unit. A confirmation unit that switches a display method so as to indicate a position where the operator pays particular attention to view in accordance with a result of the collation by the unit, and displays a character corresponding to a character code of the electronic data; The recognition result includes a character code obtained based on a character image obtained by optically scanning a character written on the sheet and a degree of reliability of the recognition result of the character code. The operator pays special attention to the character code of the electronic data and the character code of the recognition result according to the verification result and the reliability of the character code of the recognition result. Switching the display method to indicate the position to pay and view, and displaying the characters of the electronic data, the checking unit is that the collation result between the character code of the electronic data and the character code of the recognition result is the same, A first case in which the reliability of the character code is higher than a first value set in advance, a second case in which the matching result is matched, and the second case in which the reliability is lower than the first value. In a third case where the collation results do not match, the characters of the electronic data are displayed by different display methods.

2. A recognizing unit for optically reading characters written on a sheet, a collating unit for collating a character code stored as electronic data in advance with a result of recognition by the recognizing unit, and a collating unit. A confirmation unit that switches a display method so as to indicate a position where the operator pays particular attention to view in accordance with a result of the collation by the unit, and displays a character corresponding to a character code of the electronic data; A first storage unit that stores in advance a character type that can be read by the computer; and the collation unit determines a display method according to whether a character of the electronic data is a character type stored in the first storage unit. A document collation device that switches and displays characters of the electronic data.

3. A recognizing unit for optically reading characters written on a sheet, a collating unit for collating a character code stored in advance as electronic data with a result of recognition by the recognizing unit, and a collating unit. A confirmation unit that switches a display method so as to indicate a position where the operator pays particular attention to view in accordance with a result of the collation by the unit, and displays a character corresponding to a character code of the electronic data; A scanning unit that optically scans a character written on the sheet, an area extracting unit that extracts a character area from image data obtained by scanning, and a character image in the extracted area as one character. And a character recognition unit that recognizes a character image cut out for each character by using a character dictionary, wherein the character cutout unit stores character code data of the electronic data. A document collating device, wherein a cutout position of the character image corresponding to the character code is controlled according to a length.

4. A recognizing unit for optically reading characters written on a sheet, a collating unit for collating a character code stored in advance as electronic data with a result of recognition by the recognizing unit, character by character, and the collating unit. A confirmation unit that switches a display method so as to indicate a position where the operator pays particular attention to view in accordance with a result of the collation by the unit, and displays a character corresponding to a character code of the electronic data; A scanning unit that optically scans a character written on the sheet, an area extracting unit that extracts a character area from image data obtained by scanning, and a character image in the extracted area as one character. A character extracting unit that extracts a character image extracted for each character by using a character dictionary, and a character extracting unit that extracts a character image corresponding to each character code of the electronic data. A second storage unit for storing shape information indicating a tangent frame, wherein the character cutout unit sets a cutout position of the character image based on shape information obtained by referring to the second storage unit. A document matching device characterized by controlling.

5. A recognizing unit for optically reading a character written on a sheet, a collating unit for collating a character code stored as electronic data in advance with a result of recognition by the recognizing unit, and the collating unit. A confirmation unit that switches a display method so as to indicate a position where the operator pays particular attention to view in accordance with a result of the collation by the unit, and displays a character corresponding to a character code of the electronic data; Includes a plurality of character dictionaries that are divided based on a predetermined condition and can be specified by a character code of the electronic data, and are located at positions corresponding to the characters when recognizing characters described on the sheet. A document collating apparatus for acquiring a character code of the electronic data, determining the character dictionary to be used in accordance with the character code, and recognizing the character using the character dictionary.

6. A recognizing unit for optically reading characters written on a sheet, a collating unit for collating a character code stored in advance as electronic data and a result of recognition by the recognizing unit one character at a time, A display unit that switches a display method so as to indicate a position where the operator pays particular attention to the position to be viewed in accordance with a result of the comparison by the unit, and displays a character corresponding to a character code of the electronic data. A third storage unit that stores a combination of a plurality of characters in the form of a character code in advance, wherein the collation unit determines that the character code of the electronic data does not match the character code of the recognition result in the collation result When the character code of the electronic data is stored in the third storage unit, a character code of the other character having a similar shape is acquired, and the acquired character code and the recognition result are obtained. Document matching apparatus characterized by collating the character code again.

7. A printing unit that prints a character corresponding to a character code of electronic data stored in advance on a sheet, a recognition unit that optically reads the character printed on the sheet, and a character of the electronic data. A collating unit that collates a code and a recognition result by the recognizing unit one character at a time, and switches a display method so as to indicate a position where an operator pays special attention to look according to a result of the collation by the collating unit,
A document verification device for displaying a character corresponding to a character code of the electronic data.

8. The electronic data holds at least one piece of information of a character font, a character size, or layout information together with the character code in the electronic data unit or the character code unit. A printing process is performed based on the information held by the electronic data, and the recognition unit controls a cutout position of a character image obtained from printed characters on the sheet based on the information, and a character dictionary to be used. 8. The document matching apparatus according to claim 7 , wherein the specification is performed.

9. When the character of the recognition result and the character of the electronic data corresponding to the character do not match in the collation result by the collating unit, the confirmation unit determines that the electronic data is determined to be mismatched. document matching device according to any one of claims 1 to 8, wherein the emphasizing and displaying the data of the character.

10. The recognition result by the recognition unit includes a character image obtained by optically scanning a character written on the sheet and a character code obtained by performing a recognition process on the character image. a document matching device according to any one of claims 1 to 8, wherein the displaying side by side with the character image corresponding to the character and the character of the electronic data.

11. A character code corresponding to a character code of electronic data stored in advance is printed on a sheet, a character printed on the sheet is optically read to obtain a recognition result, and a character code of the electronic data is obtained. And the recognition result are collated one character at a time to obtain a collation result.The display method is switched so as to indicate a position where the operator pays special attention to look according to the collation result, and the character code of the electronic data is A document collating method comprising displaying a corresponding character and visually confirming the collation result.