JPH07141462A

JPH07141462A - Document system

Info

Publication number: JPH07141462A
Application number: JP5290253A
Authority: JP
Inventors: Hidekazu Hatano; 英一羽田野; Kazuyuki Kodama; 和行児玉; Yoshihiro Shima; 好博嶋; Masashi Koga; 昌史古賀; Kiyomichi Kurino; 清道栗野; Takeyuki Sugimoto; 建行杉本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-11-19
Filing date: 1993-11-19
Publication date: 1995-06-02
Anticipated expiration: 2017-11-25
Also published as: JP3351062B2

Abstract

PURPOSE:To obtain a document system capable of coping with a quasiroutine format document by extracting vertical and horizontal ruled lines from a document image, performing a sheet discrimination by the positional information and the length information of the ruled lines and the discrimination information for every form of a discrimination dictionary and performing a reading based on the information on the relative reading item area for every form of the discrimination dictionary. CONSTITUTION:Data composed of the starting coordinate of the ruled line for every document form and the coordinates of the length of the ruled line and a reading item area is stored in discrimination dictionary 6. The image input 2 of a document 1 is performed and the extractions 3 of a horizontal ruled line and a vertical ruled line are performed. Next, a sheet discrimination 4 performs the matching of the extracted ruled lines and the discrimination dictionary 6, determines a reading item area coordinate, performs a character recognition 8 by extracting the characters of a reading item area coordinate range from a document image, and performs a reading item registration or parforms the operations of the reading item registrastion 9 and an image registration 10. Thus, a correction which is easy to be used by a user can be performed by coping with the document of a quasi-routine format and performing the integral display of images and the recognition result window of the reading item.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は様式や書式が相違する文
書の自動読取り、登録・蓄積、タイトル等の読取り項目
の修正、検索を行なう文書システムに係わる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document system for automatically reading, registering and storing documents having different styles and formats, correcting read items such as titles, and searching.

【０００２】[0002]

【従来の技術】文書の様式では、記載項目及び記載項目
の順番を規定するが、その絶対的な位置座標、寸法を規
定していない。文書の書式は、記載項目及び項目の順番
を規定するだけでなく、その絶対的な位置座標、寸法を
規定している。様式が定められた文書を準定型フォーマ
ット文書と言い、書式が定められた文書を定型フォーマ
ット文書と言う。2. Description of the Related Art The format of a document defines the description items and the order of the description items, but does not specify the absolute position coordinates and dimensions. The document format not only defines the description items and the order of the items, but also the absolute position coordinates and dimensions thereof. A document whose style is defined is called a semi-standard format document, and a document whose format is defined is called a standard format document.

【０００３】従来の光学的文字認識装置では、文書の書
式をフォーマットデータとして予め設定することによ
り、定型フォーマット文書を読取ることができるが、様
式のみが規定されている文書、例えば、法令文書などの
準定型フォーマット文書は読取ることができなかった。In the conventional optical character recognition device, a standard format document can be read by presetting the format of the document as format data. However, a document in which only the style is specified, such as a legal document, can be read. The standard format document could not be read.

【０００４】なお、この種に関連する文書システムは、
特開昭５６−１１５７３号公報に述べられているよう
に、文書に予め入力すべき読取り項目を指示するマーク
記入領域を設け、当該領域に記入されたマークを読み取
ることによって、文書上の所定領域を読み取る方式が提
案されている。The document system related to this type is
As described in Japanese Unexamined Patent Publication No. 56-11573, a mark writing area for instructing a reading item to be input in advance is provided in a document, and a mark written in the area is read so that a predetermined area on the document is read. The method of reading is proposed.

【０００５】また、特開平４−１５７７５号公報に述べ
られているように、表構造検出するための方法としてヒ
ストグラフによる所定の長さで探索領域の番号を付けて
行なう方式がある。Further, as described in Japanese Patent Application Laid-Open No. 4-15775, there is a method for detecting a table structure by numbering search areas with a predetermined length by a histograph.

【０００６】また、従来の読取り結果の修正は、情報処
理学会第４４回全国大会「手書き文字認識システムにお
ける誤認識修正インターフェースに関する考察」に、文
字認識のため機械が勝手に切り出した画像と認識結果を
表示するものが提案されている。Further, the conventional correction of the reading result is described in the 44th national conference of the Information Processing Society of Japan, "Consideration on the erroneous recognition correction interface in the handwritten character recognition system", and the image and the recognition result cut out by the machine for character recognition. Is proposed to display.

【０００７】[0007]

【発明が解決しようとする課題】上記従来技術は、マー
ク記入領域を付加するため当該記入領域の紙面上の占め
る面積が多くなり、読取り項目を増やすことができない
という問題があった。また、マーク記入領域を新たに用
意する必要があるため、既存の各種文書をそのまま読み
取ることができないという問題もあった。さらに、様式
が法令等で決められている文書では、従来のようにマー
ク記入領域を設けることは法令上様式が規定されている
ために認められず、マーク記入領域のない当該様式文書
を読み取ることはできないという問題があった。The above-mentioned prior art has a problem that since the mark writing area is added, the area occupied by the writing area on the paper surface is increased and the number of read items cannot be increased. In addition, since it is necessary to newly prepare a mark writing area, there is also a problem that various existing documents cannot be read as they are. Furthermore, in the case of a document whose format is determined by laws and regulations, it is not permitted to provide a mark entry area as in the past because the format is legally prescribed, and the format document without a mark entry area must be read. There was a problem that I could not.

【０００８】また、表構造検出するための方法としてヒ
ストグラフによる所定の長さで探索領域の番号を付けて
行なうと、罫線のノイズ、かすれや、画像の伸縮より影
響を受け易いので、番号が変わって誤認識をおこすと言
う問題があった。If the search area is numbered with a predetermined length by a histograph as a method for detecting the table structure, the number is changed because it is easily affected by noise, blurring of the ruled line, and expansion / contraction of the image. There was a problem of causing false recognition.

【０００９】文字認識のため機械が勝手に切り出した画
像と認識結果を表示するため、文書としてのレイアウト
がなくなりユーザに使い勝手の良い物でなかった。Since the machine automatically displays the image cut out and the recognition result for the character recognition, the layout as a document disappears and the machine is not convenient for the user.

【００１０】本発明の第１の目的は、罫線の位置情報や
罫線の符号化によりシート識別を行なうことで準定型フ
ォーマット文書に対応することのできる文書システムを
提供することにある。A first object of the present invention is to provide a document system which can deal with a semi-standard format document by performing sheet identification by position information of ruled lines and coding of ruled lines.

【００１１】本発明の第２の目的は、レイアウトがわか
る入力文書画像と認識結果を一体表示することでユーザ
にとって使い勝手の良い表示を行なうことのできる文書
システムを提供することにある。A second object of the present invention is to provide a document system capable of providing a user-friendly display by integrally displaying an input document image whose layout is known and a recognition result.

【００１２】[0012]

【課題を解決するための手段】前記第１の目的を達成す
るために、本発明による文書システムは文書画像から縦
横の罫線を抽出し、その罫線の位置情報や長さの情報
と、識別辞書にある様式毎の識別情報によりシート識別
を行ない、識別辞書にある各様式毎の相対的読取り項目
領域の情報を基に読取りを行なうようにした。また文書
画像から縦横の罫線を抽出し、その罫線の位置関係で罫
線に符号を付けた情報と、識別辞書にある様式毎の識別
情報によりシート識別を行ない、識別辞書にある各様式
毎の罫線符号表された読取り項目領域の情報を基に読み
取り項目領域内部の文字を読取りを行なうようにした。In order to achieve the first object, the document system according to the present invention extracts vertical and horizontal ruled lines from a document image, and the position information and length information of the ruled lines and an identification dictionary. Sheet identification is performed based on the identification information for each style, and reading is performed based on the information of the relative read item area for each style in the identification dictionary. In addition, vertical and horizontal ruled lines are extracted from the document image, and sheet identification is performed based on the information in which the ruled lines are coded according to the positional relationship of the ruled lines and the identification information for each style in the identification dictionary, and the ruled lines for each style in the identification dictionary. The characters inside the read item area are read based on the information of the read item area that is represented by the code.

【００１３】前記第２の目的を達成するために、本発明
による文書システム読取りでの認識エラーのある入力文
書を判定し、認識した読取り項目の場所やフォントサイ
ズや認識文字などの修正情報を覚えて置き、入力された
文書で認識エラーとされた入力文書画像を一次的に登録
して置く、そして認識エラーと判定された文書を修正す
る際、入力文書画像と修正情報により修正表示レイアウ
トを生成し修正を行なうようにした。In order to achieve the second object, an input document having a recognition error in reading the document system according to the present invention is determined, and the location of the recognized read item and the correction information such as the font size and the recognized character are memorized. The input document image that has been recognized as a recognition error in the input document is temporarily registered, and when correcting the document that has been determined as a recognition error, a correction display layout is generated from the input document image and the correction information. I made a correction.

【００１４】[0014]

【作用】以下、本発明の原理と動作を説明する。The principle and operation of the present invention will be described below.

【００１５】図１は文書画像を入力してシート識別を行
ない読取り項目の登録又は読取り項目と画像データを登
録する様子を示す。まず文書１の画像入力２を行ない、
横罫線及び縦罫線を抽出３する。次にシート識別４は抽
出した罫線と識別辞書６とマッチング５を行ない読取り
項目領域の抽出７を行い読取り項目領域座標求め、文書
画像より読取り項目領域座標範囲の文字抽出し文字認識
８を行ない読取り項目登録９を行なう。または読取り項
目登録９と画像登録１０の動作を行なう。FIG. 1 shows a state in which a document image is input to perform sheet identification and read items are registered or read items and image data are registered. First, input image 2 of document 1,
The horizontal ruled lines and the vertical ruled lines are extracted 3. Next, the sheet identification 4 performs matching 5 with the extracted ruled line, the identification dictionary 6 and the extraction 7 of the read item area to obtain the read item area coordinates, extracts the characters in the read item area coordinate range from the document image, and performs the character recognition 8 for reading. Item registration 9 is performed. Alternatively, the operations of reading item registration 9 and image registration 10 are performed.

【００１６】図２は罫線の始点座標と長さでシート識別
を行なう様子を示す。文書１が画像入力２を行ない、入
力した文書から罫線を抽出し、抽出した罫線を縦罫線、
横罫線に分解した後、それぞれの罫線の始点２１と長さ
２２からなる罫線始点長さデータ２３を求める。そして
予め色々な様式毎のシートの識別データを登録した識別
辞書２８の、罫線の始点２４と長さ２５からなる識別フ
ォーマットデータ２６とのマッチング２７を行ない、様
式番号と読取り項目領域データよりなる識別結果２９を
識別辞書２８より抽出する動作を行なう。FIG. 2 shows how the sheet is identified by the starting point coordinates and the length of the ruled line. Document 1 performs image input 2 and extracts ruled lines from the input document. The extracted ruled lines are vertical ruled lines,
After the division into horizontal ruled lines, ruled line starting point length data 23 including the starting points 21 and the lengths 22 of the respective ruled lines is obtained. Then, matching 27 is carried out with the identification format data 26 consisting of the starting point 24 and the length 25 of the ruled line of the identification dictionary 28 in which the identification data of the sheets of various styles are registered in advance, and the identification consisting of the style number and the read item area data is carried out. The result 29 is extracted from the identification dictionary 28.

【００１７】図３は図２のシート識別のマッチング２７
と図１の読取り項目領域抽出７の方法を説明する。FIG. 3 shows the sheet identification matching 27 of FIG.
A method of reading item area extraction 7 in FIG. 1 will be described.

【００１８】まず入力された罫線始点長さデータ２３と
識別辞書２８の識別フォーマットデータ２６を比べて識
別辞書登録数誤差抽出３１を行う。識別辞書登録数誤差
抽出３１は罫線始点長さデータ２３の長さと識別辞書２
８にある識別フォーマットデータ２６の長さのそれぞれ
の合計を求めその値を比較する長さ誤差抽出３２、罫線
始点長さデータ２３の始点座標と識別辞書２８にある識
別フォーマットデータ２６の始点座標のそれぞれのＸ座
標を比較する始点座標誤差１抽出３３、罫線始点長さデ
ータ２３の始点座標と識別辞書２８にある識別フォーマ
ットデータ２６の始点座標のそれぞれのＹ座標を比較す
る始点座標誤差２抽出３４、罫線始点長さデータ２３と
識別辞書２８にある識別フォーマットデータ２６の各デ
ータ内で隣合う罫線の始点座標のＸ座標差の値を求め罫
線始点長さデータ２３の差と識別フォーマットデータ２
６差の値を比較する始点座標誤差３抽出３５、罫線始点
長さデータ２３と識別辞書２８にある識別フォーマット
データ２６の各データ内で隣合う罫線の始点座標のＹ座
標差の値を求め罫線始点長さデータ２３の差と識別フォ
ーマットデータ２６差の値を比較する始点座標誤差４抽
出３６が行なわれる。First, the input ruled line length data 23 and the identification format data 26 of the identification dictionary 28 are compared with each other to perform the identification dictionary registration number error extraction 31. The identification dictionary registration number error extraction 31 is based on the length of the ruled line starting point length data 23 and the identification dictionary 2
8 to obtain the sum of the lengths of the identification format data 26, and compare the values with the length error extraction 32, the starting point coordinates of the ruled line starting point length data 23 and the starting point coordinates of the identification format data 26 in the identification dictionary 28. Starting point coordinate error 1 extraction 33 for comparing the respective X coordinates, starting point coordinate error 2 extraction 34 for comparing the respective starting Y-coordinates of the ruled line starting point length data 23 and the starting point coordinates of the identification format data 26 in the identification dictionary 28. , The difference between the ruled line start point length data 23 and the identification format data 2 is calculated by obtaining the value of the X coordinate difference between the start point coordinates of the adjacent ruled lines in each data of the ruled line start point length data 23 and the identification format data 26 in the identification dictionary 28.
6 Extraction of coordinate error 3 extraction 35 for comparing difference values, ruled line starting point length data 23, and Y format difference value of the starting point coordinates of adjacent ruled lines in each data of the identification format data 26 in the identification dictionary 28. A starting point coordinate error 4 extraction 36 for comparing the difference between the starting point length data 23 and the difference value of the identification format data 26 is performed.

【００１９】その結果、罫線始点長さデータ２３に対し
て最小誤差の識別フォーマットデータを求め識別辞書２
８を参照し、様式番号、読取り位置データ抽出３７する
動作を行なう。As a result, the identification format data with the minimum error is obtained from the ruled line starting point length data 23 and the identification dictionary 2 is obtained.
8, the operation of extracting the format number and reading position data 37 is performed.

【００２０】図４は罫線の符号化してシート識別を行な
うための罫線始点終点データより罫線を符号化し罫線関
係データを求める様子を示す。入力された罫線４１の始
点終点座標からなる罫線データ４２を基に階層符号化４
３を行ない、符号化された罫線４４のように各罫線毎に
符号を割り振った罫線符号データ４５を求める。その結
果各罫線に対しての符号は次の様になる。まず縦罫線に
ついて、ＩＤ番号１はｙ１、ＩＤ番号２はｙ１ー１、Ｉ
Ｄ番号３はｙ１ー１+１、ＩＤ番号４はｙ１ー２とな
る。また横罫線ついて、ＩＤ番号１はｘ１、ＩＤ番号２
はｘ２、ＩＤ番号３はｘ２＋１になる。そして罫線デー
タ４２と罫線符号データ４５より縦横の罫線に対して各
縦罫線上に対して横罫線の始点座標や終点座標の接続関
係と、各横罫線上に対して縦罫線の始点座標や終点座標
の接続関係を求め、その結果縦罫線の始点関係はｙ１に
対しｘ１、ｘ２、ｘ２＋１になり、終点関係はｙ１ー１
に対しｘ１と、ｙ１ー２に対しｘ２、ｘ２＋１の関係式
が抽出され、また横罫線の始点関係はｘ１に対しｙ１、
ｙ１ー１、ｙ１ー１＋１と、ｘ２に対しｙ１ー２にな
り、終点関係はｘ２に対しｙ１ー１、ｙ１ー１＋１と、
ｘ２＋１に対しｙ１、ｙ１ー２の関係式を抽出する罫線
関係データ４６作成する動作を行なう。FIG. 4 shows how ruled lines are coded from ruled line start and end point data for coding a ruled line to identify a sheet and to obtain ruled line related data. Hierarchical coding 4 based on the ruled line data 42 consisting of the start and end coordinates of the input ruled line 41
3 is performed to obtain ruled line code data 45 in which a code is assigned to each ruled line like the coded ruled line 44. As a result, the code for each ruled line is as follows. First, regarding the vertical ruled line, ID number 1 is y1, ID number 2 is y1-1, I
The D number 3 becomes y1-1 + 1 and the ID number 4 becomes y1-2. For horizontal ruled lines, ID number 1 is x1, ID number 2
Becomes x2 and the ID number 3 becomes x2 + 1. Then, from the ruled line data 42 and the ruled line code data 45, the connection relation of the start point coordinates and the end point coordinates of the horizontal ruled line on each vertical ruled line and the start point coordinates and the end point of the vertical ruled line on each horizontal ruled line The connection relationship of the coordinates is obtained, and as a result, the starting point relationship of the vertical ruled line is x1, x2, x2 + 1 with respect to y1, and the ending point relationship is y1-1.
X1 and y1−2, the relational expressions of x2 and x2 + 1 are extracted, and the starting point relation of the horizontal ruled line is y1 with respect to x1,
y1-1, y1-1 + 1 and x2 become y1-2, and the end point relation is x1-1, y1-1 + 1 for x2,
The operation of creating the ruled line relational data 46 for extracting the relational expressions of y1 and y1-2 with respect to x2 + 1 is performed.

【００２１】図５は図４の符号化４３の方法を説明す
る。罫線の始点終点座標を基に、まずオーバーラップし
てない第１階層の罫線毎に番号付け５１を行なう。次に
第２階層以下の罫線は上位階層の罫線を探し、その罫線
より長さが長い場合その上位階層の罫線を探し、同じ場
合＋符号を付けて番号付けし、短い場合ー符号を付けて
番号付け５２を行なう。このオーバーラップ確認と階層
付けは、罫線４１を縦横の罫線に分けて説明する。縦罫
線は矢印５３の方向（Ｘ方向）に見てオーバーラップ確
認と階層付けする。横罫線は矢印５４の方向（Ｙ方向）
に見てオーバーラップ確認と階層付けする。これにより
第１階層は縦罫線５５、横罫線５６になり、第２階層以
下の階層は縦罫線５７、横罫線５８になる動作を行な
う。FIG. 5 illustrates the method of encoding 43 of FIG. First, numbering 51 is performed for each ruled line of the first layer that does not overlap, based on the coordinates of the start point and the end point of the ruled line. Next, for the ruled lines of the second layer and below, search for the ruled line of the upper layer, if the line is longer than that rule, search for the ruled line of the upper layer, if the same, add a + sign to number, and if it is short, add a sign. Numbering 52 is performed. This overlap confirmation and layering will be described by dividing the ruled line 41 into vertical and horizontal ruled lines. The vertical ruled lines are seen in the direction of the arrow 53 (X direction) and overlap confirmation and layering are performed. The horizontal ruled line is the direction of arrow 54 (Y direction)
See and check overlap and layer. As a result, the first layer becomes the vertical ruled line 55 and the horizontal ruled line 56, and the second and subsequent layers become the vertical ruled line 57 and the horizontal ruled line 58.

【００２２】図６は図４で求めた罫線関係データ４６を
識別辞書とのマッチングを行なう様子を示す。罫線関係
データ４６と枠の構造番号、Ｄｏｎ’ｔＣａｒｅデー
タ、罫線関係データ、読取り枠符号より構成した辞書構
造６１からなる識別辞書６２の辞書Ａ６３と辞書Ｂ６４
を誤差算出方法６５によりマッチングを行なう。具体的
に誤差算出方法６５は関係数と関係式の誤差は１０、関
係式イコール部分は誤差１を与える。その結果、辞書Ａ
６３に対しては誤差＝０、辞書Ｂ６４に対しては誤差＝
５１の誤差値が抽出される。その結果誤差の一番小さい
辞書Ａ６３の構造番号１と読取り項目符号データ６６が
出力され、読取り項目符号データ６６は対象罫線抽出６
７、直線式による読取り項目領域抽出６８の動作を行な
う。FIG. 6 shows how the ruled line relation data 46 obtained in FIG. 4 is matched with the identification dictionary. The dictionary A63 and the dictionary B64 of the identification dictionary 62 including the ruled line relation data 46, the frame structure number, Don't Care data, the ruled line relation data, and the reading frame code.
Are matched by the error calculation method 65. Specifically, the error calculation method 65 gives an error of 10 in the relational number and the relational expression, and an error of 1 in the equal part of the relational expression. As a result, dictionary A
63 = error = 0, dictionary B64 =
The error value of 51 is extracted. As a result, the structure number 1 of the dictionary A63 having the smallest error and the read item code data 66 are output, and the read item code data 66 is the target ruled line extraction 6
7. The read item area extraction 68 is performed by a linear method.

【００２３】図７はＤｏｎ’ｔＣａｒｅをデータを利用
してマッチングを行なう様子を示す。図４の罫線始点終
点データ４１にかすれ１本の場合７１は罫線関係データ
は７２になり、ノイズ１本の場合７３は罫線関係データ
７４になる。そして辞書Ａ６３のＤ＝ｙ１ー１＊よりｙ
１ー１とｙ１ー１以下の物はＤｏｎ’ｔＣａｒｅとして
マッチングを行なうと、罫線関係データ７２、７４のア
ンダーライン部分は無視されてマッチングを行ない、そ
の結果誤差はどちらの場合でも０になる動作を行なう。FIG. 7 shows how Don't Care is used for matching. In the ruled line start point / end point data 41 of FIG. 4, if the blur is one, 71 is the ruled line related data, and if it is one noise, 73 is the ruled line related data 74. Then, from D = y1-1 * in the dictionary A63, y
When matching is done with Don't Care for items less than 1-1 and y1-1, the underlined parts of the ruled line related data 72, 74 are ignored and matching is performed, resulting in an error of 0 in either case. Do.

【００２４】図８は入力された文書の読取り項目を自動
認識して、認識エラーがある場合の修正を行なう様子を
示す。文書１が入力され、自動認識８１より読取り項目
の文字を文字認識し、認識エラー判定８２より認識エラ
ーが有る場合、認識エラー画像データ、読取り項目の認
識文字やフォントサイズや読取り項目領域座標からなる
認識エラー情報を認識エラー画像データ・認識エラー情
報一時蓄積８３の動作を行なう。そして、読取りエラー
の文書を修正する場合に、読取り項目修正表示レイアウ
ト作成８４で修正画面作成し、画像・読取り項目一体表
示修正８５により修正作業を行ない、画像データ・読取
り項目情報登録８６を行なう。また認識エラー判定８２
より認識エラーが無い場合は蓄積装置８６に画像データ
・読取り項目情報登録８６の動作を行なう。FIG. 8 shows a state in which the read item of the input document is automatically recognized and the correction is performed when there is a recognition error. When the document 1 is input and the characters of the read item are recognized by the automatic recognition 81 and there is a recognition error from the recognition error judgment 82, the recognition error image data, the recognized character of the read item, the font size, and the read item area coordinates are used. The recognition error information is recognized and the recognition error image data / recognition error information temporary storage 83 is operated. When a document having a reading error is to be corrected, a correction screen is created by the read item correction display layout creation 84, and the correction work is performed by the image / read item integrated display correction 85, and the image data / read item information registration 86 is performed. Also, the recognition error determination 82
If there is no more recognition error, the image data / read item information registration 86 is performed in the storage device 86.

【００２５】図９は図８の修正時の表示方法の様子を示
す。文書画像９１上に認識結果ウィンドウ９２を読取り
位置の周辺に一体表示する。具体的に説明すると、読取
り項目１の文書画像部分９６の下に読取り項目１の認識
結果ウインドウ９７が表示され、読取り項目２の文書画
像部分９８の下に読取り項目２の認識結果ウインドウ９
９が表示された一体表示ａ９３は、ユーザ等の指示によ
り次の様に変化する。項目２の認識エラーの場合、読取
り項目２の文書画像部分９８の下に読取り項目２の認識
結果ウインドウ９９のみ表示した一体表示ｂ９４。また
は読取り項目２の文書画像部分９８を見やすくするた
め、読取り項目１の文書画像部分９６の上に読取り項目
１の認識結果ウインドウ９７が表示され、読取り項目２
の文書画像部分９８の下に読取り項目２の認識結果ウイ
ンドウ９９が表示された一体表示ｃ９５に変化する動作
を行なう。FIG. 9 shows a state of the display method at the time of correction of FIG. The recognition result window 92 is integrally displayed on the document image 91 around the reading position. Specifically, the recognition result window 97 of the reading item 1 is displayed below the document image portion 96 of the reading item 1, and the recognition result window 9 of the reading item 2 is displayed below the document image portion 98 of the reading item 2.
The integrated display a93 in which 9 is displayed changes as follows according to an instruction from the user or the like. In the case of a recognition error of item 2, an integrated display b94 in which only the recognition result window 99 of the read item 2 is displayed below the document image portion 98 of the read item 2. Alternatively, in order to make the document image portion 98 of the read item 2 easy to see, the recognition result window 97 of the read item 1 is displayed on the document image portion 96 of the read item 1, and the read item 2 is displayed.
The operation of changing to the integrated display c95 in which the recognition result window 99 of the reading item 2 is displayed under the document image portion 98 of is displayed.

【００２６】[0026]

【実施例】以下、本発明の実施例を詳細に説明する。図
１０は、本発明による文書システムの構成を示す図であ
る。本システムは、文書７９を画像データとして入力す
る画像入力装置１００、入力された文書の読取り項目を
認識する自動認識装置６００、文書の画像データおよび
読取り項目を登録する記憶装置２００、画像データおよ
びキャラクタウインドウを表示する表示装置３００、ユ
ーザからの指示を入力する外部指示入力装置４００、登
録された文書を検索する検索装置７００、これらの各装
置を制御する制御装置５００より構成される。EXAMPLES Examples of the present invention will be described in detail below. FIG. 10 is a diagram showing the configuration of the document system according to the present invention. The present system includes an image input device 100 for inputting a document 79 as image data, an automatic recognition device 600 for recognizing read items of an input document, a storage device 200 for registering image data and read items of a document, image data and characters. It comprises a display device 300 for displaying a window, an external instruction input device 400 for inputting an instruction from a user, a search device 700 for searching a registered document, and a control device 500 for controlling each of these devices.

【００２７】図１１は、罫線の始点長さを利用した自動
認識システムのブロック図である。構成は、罫線情報抽
出手段である文書画像の黒ドットの連続すなわちランデ
ータを抽出するランデータ抽出装置６０１と、比較的長
いランデータから罫線データすなわち罫線の始点および
終点座標を抽出する罫線データ抽出装置６０２と、抽出
した罫線データを、識別辞書との照合を行なうための前
処理として補正する罫線データ正規化装置６０３と、正
規化された罫線データより罫線始点長さデータ抽出する
罫線始点長さデータ抽出装置６０４、入力された文書の
罫線始点長さデータと識別辞書装置６０６にある識別辞
書を基にシート識別を行なうシート識別装置６０５、識
別された結果より識別辞書装置６０６の読取り項目領域
データを基に読取り項目領域座標を抽出する読取り項目
領域抽出装置６０７、読取り項目領域座標を基にその領
域を文字認識する文字認識装置６０８より成る。FIG. 11 is a block diagram of an automatic recognition system using the length of the starting point of a ruled line. The constitution is a run data extraction device 601 which is a ruled line information extraction means for extracting continuous black dots of a document image, that is, run data, and a ruled line data extraction for extracting the ruled line data, that is, the start point and end point coordinates of the ruled line from relatively long run data. A device 602, a ruled line data normalization device 603 that corrects the extracted ruled line data as a pre-process for performing collation with an identification dictionary, and a ruled line start point length that extracts ruled line start point data from the normalized ruled line data. Data extraction device 604, sheet identification device 605 that performs sheet identification based on the ruled line starting point length data of the input document and the identification dictionary in identification dictionary device 606, read item area data of identification dictionary device 606 based on the identified result A read item area extraction device 607 for extracting the read item area coordinates based on the Band consists of character recognition character recognition apparatus 608.

【００２８】次に、図１２、および図１、２、３、１
０、１１を用いて罫線の始点長さを利用した自動認識処
理フローを説明する。まず、図１２の画像入力Ａ１で
は、図１０に示す画像入力装置１００より文書７９を入
力し、制御装置５００で入力された文書画像を自動認識
装置６００に転送する。ランデータ抽出Ａ２では、文書
画像から連結した黒ドットの始点と終点座標を求めてラ
ンデータを作成し、ランの長さがあらかじめ設定された
値より長いランデータの抽出を行なう。Next, referring to FIG. 12 and FIGS.
An automatic recognition processing flow using the starting point length of the ruled line will be described using 0 and 11. First, in the image input A1 of FIG. 12, the document 79 is input from the image input device 100 shown in FIG. 10, and the document image input by the control device 500 is transferred to the automatic recognition device 600. In the run data extraction A2, run data is created by obtaining the start and end coordinates of the connected black dots from the document image, and the run data whose run length is longer than a preset value is extracted.

【００２９】これは、図１１のラン抽出装置６０１で実
行される。罫線データ抽出Ａ３は、ランデータをもとに
ランの接続を探索して連結されたランを一罫線（図１の
罫線抽出３に概要を示す）として、罫線の始点と終点の
座標、これらの座標値を用いた罫線の傾きを求め、始点
終点座標と傾きからなる罫線データ（詳細を図２の罫線
データ２３に示す）を抽出する。これは、図１１の罫線
データ抽出装置６０２装置で処理する。罫線データ正規
化Ａ４は、図２の罫線始点長さデータ２３を用いて文書
様式を識別するための前処理として、すなわち図１１の
識別辞書装置６０４との整合性をとるために、図２の識
別辞書２８にある正規化データをもとに、傾き、伸縮、
平行移動の補正を行なう。これは図１１の罫線データ正
規化装置６０２で行なう。具体的には、罫線データの傾
き補正は、罫線データ抽出Ａ３で求めた傾きを用いて各
罫線データの始点および終点座標を補正することにより
行なう。This is executed by the run extraction device 601 shown in FIG. The ruled line data extraction A3 searches the connection of the run based on the run data and defines the connected run as one ruled line (an outline is shown in the ruled line extraction 3 in FIG. 1). The slope of the ruled line is calculated using the coordinate values, and the ruled line data (details are shown in the ruled line data 23 in FIG. 2) including the start point end point coordinates and the slope are extracted. This is processed by the ruled line data extraction device 602 shown in FIG. The ruled line data normalization A4 of FIG. 2 is performed as a pre-process for identifying the document format using the ruled line starting point length data 23 of FIG. 2, that is, for consistency with the identification dictionary device 604 of FIG. Based on the normalized data in the identification dictionary 28,
Correct the parallel movement. This is performed by the ruled line data normalization device 602 in FIG. Specifically, the inclination correction of the ruled line data is performed by correcting the start point and end point coordinates of each ruled line data using the inclination obtained in the ruled line data extraction A3.

【００３０】また、伸縮、平行移動の補正は、罫線デー
タの中から正規化データに登録された罫線と同じ罫線の
基準線を抽出し、罫線データ基準線と正規化データ基準
線を比較して、伸縮率と平行移動量を計算し補正する。
図１２の罫線始点長さデータ抽出Ａ５では、正規化され
た罫線データの始点および終点座標より罫線の始点座標
と長さを求める。これは、図１１の罫線始点長さデータ
抽出装置６０４で行なう。Further, for the expansion / contraction and parallel movement correction, a reference line having the same ruled line as the ruled line registered in the normalized data is extracted from the ruled line data, and the ruled line data reference line and the normalized data reference line are compared. , Calculate the expansion / contraction rate and the amount of parallel movement and correct it.
In the ruled line starting point length data extraction A5 of FIG. 12, the starting point coordinates and length of the ruled line are obtained from the normalized starting point and ending point coordinates of the ruled line data. This is performed by the ruled line starting point length data extraction device 604 of FIG.

【００３１】シート識別Ａ６では、正規化された罫線デ
ータと、識別辞書装置６０６（図１１）中の識別辞書２
８（図２）との間でシート識別のマッチング２７を行な
う（図２）。この処理は、図１１に示したシート識別装
置６０５で実行する。In the sheet identification A6, the normalized ruled line data and the identification dictionary 2 in the identification dictionary device 606 (FIG. 11) are used.
8 (FIG. 2) and sheet matching 27 are performed (FIG. 2). This processing is executed by the sheet identification device 605 shown in FIG.

【００３２】このマッチング方法の詳細を図３を用いて
説明する。まず、識別辞書２８に登録されている識別フ
ォーマットデータ２６の中から始点座標が近い罫線デー
タを選択し、長さ誤差抽出３２で、罫線始点長さデータ
２３と識別フォーマットデータ２６の長さのそれぞれの
合計を求めその値を比較する。Details of this matching method will be described with reference to FIG. First, ruled line data whose start point coordinates are close to each other is selected from the identification format data 26 registered in the identification dictionary 28, and in the length error extraction 32, the ruled line starting point length data 23 and the length of the identification format data 26 are respectively selected. Calculate the sum of and compare the values.

【００３３】次に、始点座標誤差１抽出３３で、罫線始
点長さデータ２３の始点座標と、識別辞書２８にある識
別フォーマットデータ２６の始点座標のそれぞれの横方
向座標（Ｘ座標という）を比較する。始点座標誤差２抽
出３４で、罫線始点長さデータ２３の始点座標と識別辞
書２８にある識別フォーマットデータ２６の始点座標の
それぞれの縦方向座標（Ｙ座標という）を比較する。始
点座標誤差３抽出３５で、罫線始点長さデータ２３と識
別辞書２８にある識別フォーマットデータ２６のそれぞ
れについて、隣合う罫線の始点座標のＸ座標に関する差
を求め、罫線始点長さデータ２３での差と、識別フォー
マットデータ２６での差を比較する。Next, in the start point coordinate error 1 extraction 33, the start point coordinates of the ruled line start point length data 23 and the start point coordinates of the identification format data 26 in the identification dictionary 28 are compared in the horizontal direction (referred to as the X coordinate). To do. In the start point coordinate error 2 extraction 34, the start point coordinates of the ruled line start point length data 23 and the start point coordinates of the identification format data 26 in the identification dictionary 28 are compared in the vertical direction (referred to as Y coordinates). In the starting point coordinate error 3 extraction 35, for each of the ruled line starting point length data 23 and the identification format data 26 in the identification dictionary 28, the difference regarding the X coordinate of the starting point coordinates of the adjacent ruled lines is obtained, and the ruled line starting point length data 23 is obtained. The difference is compared with the difference in the identification format data 26.

【００３４】次に、始点座標誤差４抽出３６では、罫線
始点長さデータ２３と識別辞書２８にある識別フォーマ
ットデータ２６のそれぞれについて、隣合う罫線の始点
座標のＹ座標に関する差を求め、罫線始点長さデータ２
３での差と、識別フォーマットデータ２６での差を比較
する。Next, in the start point coordinate error 4 extraction 36, for each of the ruled line start point length data 23 and the identification format data 26 in the identification dictionary 28, the difference in the Y coordinate of the start point coordinates of the adjacent ruled lines is obtained, and the ruled line start point is calculated. Length data 2
The difference in 3 and the difference in the identification format data 26 are compared.

【００３５】以上の結果、図１２の読取り位置抽出Ａ７
では、罫線始点長さデータ２３に対して最小誤差となる
識別フォーマットデータを求めることができ、識別辞書
２８を参照することによって、様式番号・読取り項目領
域データ２９の様式番号Ｆ１、読取り項目領域の始点Ｘ
座標ＲＸ１、始点Ｙ座標ＲＸ２、始点座標からの長さＲ
Ｌ、始点座標からの幅ＲＷを求め、文書画像上の読取り
項目領域座標を抽出する。これらの読取り位置抽出処理
（図１の７）は、図１１の読取り項目領域抽出装置６０
７で実行される。As a result of the above, reading position extraction A7 in FIG.
Then, it is possible to obtain the identification format data having the minimum error with respect to the ruled line starting point length data 23. By referring to the identification dictionary 28, the format number F1 of the format number / read item area data 29 and the read item area Starting point X
Coordinate RX1, starting point Y coordinate RX2, length R from starting point coordinate
L, the width RW from the starting point coordinates is obtained, and the read item area coordinates on the document image are extracted. The read position extraction processing (7 in FIG. 1) is performed by the read item area extraction device 60 in FIG.
It is executed in 7.

【００３６】次に、図１２の読取り項目文字認識Ａ８で
は、上記Ａ７により求めた読取り項目領域座標の文字を
文書画像より切り出して認識する。これは、図１１の文
字認識装置６０８で行なわれる。図１２の登録Ａ９で
は、読取り項目の文字情報、または画像データと読取り
項目を記憶装置２００（図１０）に登録する。Next, in the read item character recognition A8 of FIG. 12, the character of the read item area coordinates obtained in A7 is cut out from the document image and recognized. This is performed by the character recognition device 608 of FIG. In registration A9 of FIG. 12, the character information of the read item, or the image data and the read item are registered in the storage device 200 (FIG. 10).

【００３７】次に、図１３を用いて、本発明の別な実施
例を説明する。Next, another embodiment of the present invention will be described with reference to FIG.

【００３８】図１３は罫線符号化を利用した自動認識シ
ステムのブロック図である。構成は、文書画像の黒ドッ
ト連続した始点終点座標ランデータを抽出するランデー
タ抽出装置６１１、長いランデータより始点終点座標か
らなる罫線データを抽出する罫線データ抽出装置６１
２、ランデータより枠の最小最大座標エリアからなる枠
ブロックデータを抽出する枠ブロック抽出装置６１３、
罫線符号化のため枠ブロック毎に罫線データを選択する
枠ブロック毎罫線データ抽出装置６１４、抽出した罫線
データの傾き補正を行ない補正を行なう罫線データ傾き
補正装置６１５、抽出した罫線データを識別辞書とのマ
ッチングを行なうための前処理として罫線の位置関係を
基に罫線を符号データを抽出する罫線符号化装置６１
６、罫線データと罫線符号データを基に罫線関係データ
を抽出する罫線関係抽出装置６１７入力された文書の罫
線関係データを基に罫線符号データと識別辞書装置６１
９にある識別辞書を基のシート識別を行なうシート識別
装置６１８、識別された結果より識別辞書装置６１９の
読取り項目符号データを基に読取り項目領域座標を抽出
する読取り項目領域抽出装置６２０、読取り項目領域座
標を基にその領域を文字認識する文字認識装置６２１よ
り成る。FIG. 13 is a block diagram of an automatic recognition system using ruled line coding. The configuration is a run data extraction device 611 that extracts start point / end point coordinate run data of continuous black dots of a document image, and a ruled line data extraction device 61 that extracts ruled line data consisting of start point / end point coordinates from long run data.
2. A frame block extraction device 613 for extracting frame block data consisting of the minimum and maximum coordinate areas of the frame from the run data,
Each frame block ruled line data extraction device 614 that selects ruled line data for each frame block for ruled line coding, ruled line data inclination correction device 615 that corrects and corrects the extracted ruled line data, and the extracted ruled line data as an identification dictionary. Ruled line coding device 61 for extracting coded data of ruled lines based on the positional relationship of the ruled lines as a pre-process for matching
6. Ruled line relation extraction device 617 for extracting ruled line relation data based on ruled line data and ruled line code data 617 Based on ruled line relation data of an inputted document, ruled line code data and identification dictionary device 61
Sheet identification device 618 for performing sheet identification based on the identification dictionary in FIG. 9, read item area extraction device 620 for extracting read item area coordinates based on the read item code data of the identification dictionary device 619 from the identified result, read item The character recognizing device 621 recognizes a character in the area based on the area coordinates.

【００３９】図１４を基に、図１、４、５、６、７、１
０、および図１３を用いての罫線符号化を利用した自動
認識処理フローを説明する。Based on FIG. 14, FIGS. 1, 4, 5, 6, 7, 1
An automatic recognition processing flow using ruled line coding with 0 and FIG. 13 will be described.

【００４０】まず図１４の画像入力Ｂ１は図１０の文書
７９が画像入力装置１００より入力し、制御装置５００
は入力された文書画像を自動認識６００に転送する。First, in the image input B1 of FIG. 14, the document 79 of FIG. 10 is input from the image input device 100, and the control device 500
Transfers the input document image to automatic recognition 600.

【００４１】次に図１４のランデータ抽出Ｂ２は、文書
画像より連結した黒ドットの始点と終点座標を調べてラ
ンデータを作成し、ランの長さを見てあらかじめ設定さ
れた長さより長いランデータの抽出を図１３のランデー
タ抽出装置６１１で行なう。Next, in the run data extraction B2 of FIG. 14, run data is created by checking the start point and end point coordinates of the black dots connected from the document image, and the run length is checked and the run length longer than the preset length is checked. Data is extracted by the run data extraction device 611 shown in FIG.

【００４２】図１４の罫線データ抽出Ｂ３は図１の罫線
抽出３として、長いランデータを基にランの接続を探索
して連結されたランを一罫線として、罫線の始点終点座
標と座標値より罫線の傾き値求め、始点終点座標と傾き
値からなる罫線データを抽出を、図１３の罫線データ抽
出装置６１２により行ない、罫線データを求める。The ruled line data extraction B3 of FIG. 14 is the ruled line extraction 3 of FIG. 1 in which the run connection is searched based on the long run data and the connected run is set as one ruled line, and the start point end point coordinates and the coordinate value of the ruled line are used. The ruled line data is obtained by obtaining the ruled line inclination value and extracting the ruled line data including the start point / end point coordinates and the inclination value by the ruled line data extraction device 612 in FIG. 13.

【００４３】図１４の枠ブロック抽出Ｂ４は入力された
ランデータを基に、ランの接続関係により黒ドットの輪
郭データを抽出し、この輪郭データより枠の輪郭データ
を選択し、枠の最小座標と最大座標のブロックエリアデ
ータよりなる枠ブロックデータ抽出を図１３の枠ブロッ
ク抽出装置６１３により行なう。The frame block extraction B4 of FIG. 14 extracts the contour data of the black dots according to the run connection relation based on the input run data, selects the contour data of the frame from this contour data, and selects the minimum coordinates of the frame. The frame block data extraction device 613 of FIG. 13 extracts the frame block data composed of the block area data of the maximum coordinates.

【００４４】次に図１４の枠ブロック毎罫線データ抽出
Ｂ５は、罫線データと枠ブロックデータを基に、罫線符
号化のため枠ブロック毎の罫線データを抽出を、図１３
の枠ブロック毎罫線データ抽出装置６１４で行なう。Next, the ruled line data extraction B5 for each frame block in FIG. 14 extracts ruled line data for each frame block for ruled line coding based on the ruled line data and the frame block data.
This is performed by the ruled line data extraction device 614 for each frame block.

【００４５】次に図１４の罫線データ傾き補正Ｂ６は、
図１３の枠ブロック毎罫線データ抽出６１４で求めた罫
線データに対して、罫線データ抽出装置６１２で求めた
傾き値により罫線データの始点終点座標を傾き補正を罫
線データ傾き補正装置６１５にて行ない図４の罫線デー
タ４２を求める。Next, the ruled line data inclination correction B6 of FIG.
With respect to the ruled line data obtained by the ruled line data extraction 614 for each frame block in FIG. 13, the ruled line data inclination correction device 615 performs inclination correction of the start point / end point coordinates of the ruled line data by the inclination value calculated by the ruled line data extraction device 612. 4 ruled line data 42 is obtained.

【００４６】図１４の罫線符号化Ｂ７は罫線データ４２
をシート識別するための前処理として図１３の罫線符号
化装置６１６により図４の階層符号化４３を図５の方法
で以下のように行なう。罫線の始点終点座標を基に、ま
ずオーバーラップしてない第１階層の罫線毎に番号付け
５１を行なう。次に第２階層以下の罫線は上位階層の罫
線を探し、その罫線より長さが長い場合その上位階層の
罫線を探し、同じ場合＋符号を付けて番号付けし、短い
場合ー符号を付けて番号付け５２を行なう。このオーバ
ーラップ確認と階層付けは、罫線４１を縦横の罫線に分
けて説明する。縦罫線は矢印５３の方向（Ｘ方向）に見
てオーバーラップ確認と階層付けする。横罫線は矢印５
４の方向（Ｙ方向）に見てオーバーラップ確認と階層付
けする。これにより第１階層は縦罫線５５、横罫線５６
になり、第２階層以下の階層は縦罫線５７、横罫線５８
になる動作を行ない、その結果各罫線に対しての符号
は、縦罫線はＩＤ番号１はｙ１、ＩＤ番号２はｙ１ー
１、ＩＤ番号３はｙ１ー１+１、ＩＤ番号４はｙ１ー２
となり、横件線はＩＤ番号１はｘ１、ＩＤ番号２はｘ
２、ＩＤ番号３はｘ２＋１になる。罫線符号データ４５
を求める。The ruled line coding B7 in FIG. 14 is the ruled line data 42.
As the pre-processing for identifying the sheet, the ruled line encoding device 616 of FIG. 13 performs the hierarchical encoding 43 of FIG. 4 by the method of FIG. 5 as follows. First, numbering 51 is performed for each ruled line of the first layer that does not overlap, based on the coordinates of the start point and the end point of the ruled line. Next, for the ruled lines of the second layer and below, search for the ruled line of the upper layer, if the line is longer than that rule, search for the ruled line of the upper layer, if the same, add a + sign to number, and if it is short, add a sign. Numbering 52 is performed. This overlap confirmation and layering will be described by dividing the ruled line 41 into vertical and horizontal ruled lines. The vertical ruled lines are seen in the direction of the arrow 53 (X direction) and overlap confirmation and layering are performed. Horizontal ruled line is arrow 5
When viewed in the direction 4 (Y direction), overlap confirmation and layering are performed. As a result, the first layer has vertical ruled lines 55 and horizontal ruled lines 56.
Therefore, the vertical ruled line 57 and the horizontal ruled line 58 are applied to the second level and below.
As a result, the code for each ruled line is that the vertical ruled line has y1 for ID number 1, y1-1 for ID number 2, y1-1 + 1 for ID number 3, and y1 for ID number 4. Two
The horizontal line is x1 for ID number 1 and x for ID number 2.
2, ID number 3 becomes x2 + 1. Ruled line code data 45
Ask for.

【００４７】図１４の罫線関係データ作成Ｂ８は識別辞
書装置６１９マッチングを行なうため図１３の罫線関係
抽出装置６１６により、罫線データ４２と罫線符号デー
タ４５より縦横の罫線に対して各縦罫線上に対して横罫
線の始点座標や終点座標の接続関係と、各横罫線上に対
して縦罫線の始点座標や終点座標の接続関係を求め、そ
の結果縦罫線の始点関係はｙ１に対しｘ１、ｘ２、ｘ２
＋１になり、終点関係はｙ１ー１に対しｘ１と、ｙ１ー
２に対しｘ２、ｘ２＋１の関係式が抽出され、また横罫
線の始点関係はｘ１に対しｙ１、ｙ１ー１、ｙ１ー１＋
１と、ｘ２に対しｙ１ー２になり、終点関係はｘ２に対
しｙ１ー１、ｙ１ー１＋１と、ｘ２＋１に対しｙ１、ｙ
１ー２の関係式を抽出する罫線関係データ４６を抽出す
る。Since the ruled line relation data creation B8 in FIG. 14 performs matching with the identification dictionary device 619, the ruled line relation extraction device 616 in FIG. 13 causes the ruled line data 42 and the ruled line code data 45 to be placed on each vertical ruled line with respect to the vertical and horizontal ruled lines. On the other hand, the connection relationship between the start point coordinates and the end point coordinates of the horizontal ruled line and the connection relationship between the start point coordinates and the end point coordinate of the vertical ruled line on each horizontal ruled line are obtained. As a result, the start point relationship of the vertical ruled line is x1, x2 for y1. , X2
+1 becomes the end point relationship, and the relational expressions of x1 for y1-1 and x2, x2 + 1 for y1-2 are extracted, and the start point relationship of the horizontal ruled line is y1, y1-1, y1-1 + for x1.
1 and x2 becomes y1-2, and the end point relations are y1-1 and y1-1 + 1 for x2 and y1, y for x2 + 1.
Ruled line relation data 46 for extracting the relational expression 1-2 is extracted.

【００４８】次に図１４のシート識別Ｂ９は、この抽出
された罫線符号データ、罫線関係データと識別辞書装置
６１９を基にシート識別装置６１８で図１のシート識別
のマッチングを第６、７図の方法で以下のように行な
う。罫線関係データ４６と枠の構造番号、Ｄｏｎ’ｔＣ
ａｒｅデータ、罫線関係データ、読取り枠符号より構成
した辞書構造６１からなる識別辞書６２の辞書Ａ６３と
辞書Ｂ６４を誤差算出方法６５によりマッチングを行な
う。具体的に誤差算出方法６５は関係数と関係式の誤差
は１０、関係式イコール部分は誤差１を与える。その結
果、辞書Ａ６３対しては誤差＝０、辞書Ｂ６４対しては
誤差＝５１の誤差値が抽出される。またＤｏｎ’ｔＣａ
ｒｅマッチングは、図４の罫線始点終点データ４１にか
すれ１本の場合７１は罫線関係データは７２になり、ノ
イズ１本の場合７３は罫線関係データ７４になる。そし
て辞書Ａ６３のＤ＝ｙ１ー１＊よりｙ１ー１とｙ１ー１
以下の物はＤｏｎ’ｔＣａｒｅとしてマッチングを行な
うと、罫線関係データ７２、７４のアンダーライン部分
は無視されてマッチングを行ない、その結果誤差はどち
らの場合でも０になる動作し、最小誤差の辞書Ａ６３マ
ッチングする。Next, in the sheet identification B9 of FIG. 14, the sheet identification device 618 performs matching of the sheet identification of FIG. 1 on the basis of the extracted ruled line code data, ruled line relation data and the identification dictionary device 619. Follow the procedure below. Ruled line related data 46, frame structure number, Don'tC
The error calculation method 65 performs matching between the dictionary A63 and the dictionary B64 of the identification dictionary 62 including the dictionary structure 61 composed of are data, ruled line relation data, and reading frame code. Specifically, the error calculation method 65 gives an error of 10 in the relational number and the relational expression, and an error of 1 in the equal part of the relational expression. As a result, an error value of error = 0 for the dictionary A63 and an error value of 51 for the dictionary B64 is extracted. Also Don'tCa
In the case of re-matching, the ruled line start point / end point data 41 of FIG. 4 has ruled line relational data 72 when there is one blur 71 and ruled line relational data 74 when there is one noise. And from D = y1-1 * of dictionary A63, y1-1 and y1-1
When the following items are matched as Don't Care, the underlined parts of the ruled line related data 72 and 74 are ignored and matching is performed. As a result, the error becomes 0 in either case, and the minimum error dictionary A63 To match.

【００４９】その結果、図１４の読取り項目領域抽出Ｂ
１０を図１３の読取り項目領域抽出装置６２０より図１
の読取り位置抽出７として図６の識別辞書６２にある最
小誤差の辞書Ａ６３構造番号と読取り項目符号データ６
６より対象罫線抽出６７で読取り項目に当たる罫線デー
タを抽出し、直線式による文書画像上の読取り項目領域
座標を抽出する。As a result, read item area extraction B in FIG.
10 from the read item area extraction device 620 of FIG.
As the reading position extraction 7 of the minimum error dictionary A63 in the identification dictionary 62 of FIG.
From 6, the target ruled line extraction 67 extracts the ruled line data corresponding to the read item, and the read item area coordinates on the document image by the linear equation are extracted.

【００５０】図１４の読取り項目文字認識Ｂ１１は出力
された読取り項目領域座標の文字を文書画像より切り出
し図１１の文字認識装置６２１で認識し、図１４の登録
Ｂ１２の読取り項目の文字情報又は画像データと文字情
報を記憶装置２００に登録する。In the read item character recognition B11 of FIG. 14, the character of the output read item area coordinate is cut out from the document image and recognized by the character recognition device 621 of FIG. 11, and the character information or image of the read item of registration B12 of FIG. Data and character information are registered in the storage device 200.

【００５１】図１５は自動認識を行ない、読取り項目に
認識エラーのある文書を修正するシステムのブロック図
である。構成は、文書７９の画像データを入力する画像
入力装置８００、入力された文書の読取り項目を認識す
る自動認識装置８０７、自動認識の結果認識エラーを判
定する判定装置８０８、読取り項目の認識でエラーの有
った文書の画像データ一時的に記憶する一時記憶装置８
０２、読取り項目の認識でエラーの有った文書の読取り
項目の認識文字、フォントサイズ、読取り項目領域座標
から成る認識エラー情報を一時的に記憶する認識情報記
憶装置８０９、認識エラーの文書を修正する場合修正表
示のレイアウトを作成する修正表示レイアウト作成装置
８０３、画像データやキャラクタウインドウを表示する
表示装置８０４、ユーザからの指示を入力する外部指示
入力装置８０５、登録された文書を検索する検索装置８
１０、これらの各装置を制御する制御装置８０６より構
成される。FIG. 15 is a block diagram of a system for performing automatic recognition and correcting a document having a recognition error in a read item. An image input device 800 for inputting image data of the document 79, an automatic recognition device 807 for recognizing a read item of the input document, a determination device 808 for judging a recognition error as a result of the automatic recognition, and an error in recognition of a read item. Temporary storage device 8 for temporarily storing image data of a document with
02, a recognition information storage device 809 that temporarily stores recognition error information including a recognition character of a reading item, a font size, and a reading item area coordinate of a document having an error in recognizing the reading item; In the case of performing a correction display layout creating device 803 for creating a layout of a correction display, a display device 804 for displaying image data and a character window, an external instruction input device 805 for inputting an instruction from a user, and a search device for searching a registered document. 8
10, a control device 806 for controlling each of these devices.

【００５２】次に第８、９、１５図を用いて入力された
文書の読取り項目を自動認識して、認識エラーがある場
合の修正方法を説明する。Next, a method of automatically recognizing the read item of the input document and correcting the case where there is a recognition error will be described with reference to FIGS.

【００５３】まず図１５の文書７９が画像入力装置８０
０より入力され、文書の読取り項目の文字を自動的に抽
出・認識を図８の自動認識８１が図１５の自動認識装置
８０７で行なわれる。この認識結果を図８の認識エラー
判定を図１５の判定装置８０８で行ない、図８の認識エ
ラー文書の画像データや認識エラー情報一時蓄積８３を
図１５の一時蓄積装置８０２、認識情報記憶装置８０９
に蓄積する。次に図８の読取り項目修正表示レイアウト
作成８５を、ユーザが図１５の外部指示入力装置８０５
より修正指示を入力し、制御装置８０６が修正表示レイ
アウト作成装置８０３に指示を与え、修正表示レイアウ
ト作成装置８０３は修正を行ない文書の画像データを一
時記憶装置８０２より、認識エラー情報を認識情報記憶
装置８０９より入力し、修正表示レイアウト作成装置８
０３は修正画面のレイアウトを作成する。この作成され
た修正画面を図８の画像・読取り項目一体表示修正８６
を図９の表示方法により、文書画像９１上に認識結果ウ
ィンドウ９２を読取り位置の周辺に一体表示する。また
読取り項目１の文書画像部分９４の下に読取り項目１の
認識結果ウインドウ９５が表示され、読取り項目２の文
書画像部分９６の下に読取り項目２の認識結果ウインド
ウ９７が表示された一体表示ａ９３は、ユーザ等の指示
により次の様に変化する。項目２の認識エラーの場合、
読取り項目２の文書画像部分９６の下に読取り項目２の
認識結果ウインドウ９７のみ表示した一体表示ｂ９４ま
たは読取り項目２の文書画像部分９６を見やすくするた
め、読取り項目１の文書画像部分９４の上に読取り項目
１の認識結果ウインドウ９５が表示され、読取り項目２
の文書画像部分９６の下に読取り項目２の認識結果ウイ
ンドウ９７が表示された一体表示ｃ９５に変化する表示
し修正を行なう。修正された文書の図８の画像データ・
読取り項目情報登録８６を図１５の記憶装置８０１に制
御装置８０６が登録する。図８の認識エラー８２のエラ
ー無しの場合は、図８の画像データ・読取り項目情報登
録８６を図１５の記憶装置８０１に制御装置８０６が登
録する。First, the document 79 in FIG. 15 is the image input device 80.
The automatic recognition 81 of FIG. 8 is performed by the automatic recognition device 807 of FIG. 15 to automatically extract and recognize the characters of the read item of the document. The recognition error determination of FIG. 8 is performed by the determination device 808 of FIG. 15 based on this recognition result, and the image data of the recognition error document and the recognition error information temporary storage 83 of FIG. 8 are temporarily stored in the temporary storage device 802 and the recognition information storage device 809 of FIG.
Accumulate in. Next, the user creates the read item correction display layout creation 85 in FIG. 8 by the external instruction input device 805 in FIG.
A correction instruction is input, the control device 806 gives an instruction to the correction display layout creation device 803, and the correction display layout creation device 803 corrects the image data of the document from the temporary storage device 802 and stores recognition error information as recognition information. Input from the device 809, and the correction display layout creation device 8
03 creates a layout of the correction screen. The created correction screen is displayed in the image / read item integrated display correction 86 in FIG.
By the display method of FIG. 9, the recognition result window 92 is integrally displayed on the document image 91 around the reading position. Further, a recognition result window 95 of the reading item 1 is displayed below the document image portion 94 of the reading item 1, and a recognition result window 97 of the reading item 2 is displayed below the document image portion 96 of the reading item 2 Integrated display a93 Changes according to instructions from the user, etc. In case of the recognition error of item 2,
In order to make the integrated display b94 displaying only the recognition result window 97 of the reading item 2 below the document image portion 96 of the reading item 2 or the document image portion 96 of the reading item 1 easy to see, the document image portion 94 of the reading item 1 is displayed. The recognition result window 95 of the reading item 1 is displayed, and the reading item 2
The recognition result window 97 of the reading item 2 is displayed under the document image portion 96 of the above, and the display is changed to the integrated display c95, and the correction is performed. Image data of the modified document in Figure 8
The control device 806 registers the read item information registration 86 in the storage device 801 of FIG. When there is no recognition error 82 in FIG. 8, the control device 806 registers the image data / read item information registration 86 in FIG. 8 in the storage device 801 in FIG.

【００５４】[0054]

【発明の効果】以上説明したごとく、本発明によれば、
自動認識における読取り項目を見つけるためのシート識
別を罫線の始点座標や長さ、または罫線に符号付けする
ことにより、準定型フォーマットの文書に対応でき、読
取り項目の修正時に画像と読取り項目の認識結果ウィン
ドウを一体表示することでユーザにとって使いやすい修
正が行なえる。As described above, according to the present invention,
By coding the sheet identification for finding the read item in automatic recognition to the starting point coordinates and length of the ruled line, or the ruled line, it is possible to correspond to documents in a semi-standard format, and the recognition result of the image and the read item when correcting the read item By displaying the windows integrally, it is possible to make modifications that are easy for the user to use.

[Brief description of drawings]

【図１】本発明の文書画像を入力してシート識別を行な
い読取り項目の登録又は読取り項目と画像データを登録
する様子を示す図、FIG. 1 is a diagram showing a state in which a document image of the present invention is input and sheet identification is performed to register read items or read items and image data.

【図２】本発明の罫線の始点座標と長さでシート識別を
行なう様子を示す図、FIG. 2 is a diagram showing a state in which sheet identification is performed based on the starting point coordinates and length of a ruled line according to the present invention;

【図３】本発明の図２のシート識別のマッチングと読取
り項目領域抽出の方法を示す図、FIG. 3 is a diagram showing a method of matching sheet identification and extracting a read item area of FIG. 2 according to the present invention;

【図４】本発明の罫線の符号化してシート識別を行なう
方式の罫線始点終点データより罫線を符号化し罫線関係
データを求める様子を示す図、FIG. 4 is a diagram showing how ruled lines are coded from ruled line start point end point data and ruled line related data are obtained by coding ruled lines of the present invention to perform sheet identification;

【図５】本発明の図４の符号化の方法を示す図、5 is a diagram showing the encoding method of FIG. 4 according to the present invention;

【図６】本発明の図４で求めた罫線関係データを識別辞
書とのマッチングを行なう様子を示す図、FIG. 6 is a diagram showing how the ruled line relation data obtained in FIG. 4 of the present invention is matched with an identification dictionary;

【図７】本発明のＤｏｎ’ｔＣａｒｅをデータを利用し
てマッチングを行なう様子を示す図、FIG. 7 is a diagram showing how Don't Care of the present invention performs matching using data;

【図８】本発明の入力された文書の読取り項目を自動認
識して、認識エラーがある場合の修正を行なう様子を示
す図、FIG. 8 is a diagram showing a state in which a reading item of an input document of the present invention is automatically recognized and correction is performed when a recognition error is present;

【図９】本発明の図８の修正時の表示方法の様子を示す
図、9 is a diagram showing a state of a display method at the time of correction of FIG. 8 of the present invention,

【図１０】本発明の一実施例の自動認識を説明するため
の文書システムの構成を示す図、FIG. 10 is a diagram showing a configuration of a document system for explaining automatic recognition according to an embodiment of the present invention;

【図１１】本発明の一実施例の罫線の始点長さを利用し
た自動認識システムの構成を示す図、FIG. 11 is a diagram showing a configuration of an automatic recognition system using a starting point length of a ruled line according to an embodiment of the present invention;

【図１２】本発明の一実施例の罫線の始点長さを利用し
た自動認識のフローを示す図、FIG. 12 is a diagram showing a flow of automatic recognition using a starting point length of a ruled line according to an embodiment of the present invention;

【図１３】本発明の一実施例の罫線の符号化を利用した
自動認識システムの構成を示す図、FIG. 13 is a diagram showing a configuration of an automatic recognition system using ruled line encoding according to an embodiment of the present invention;

【図１４】本発明の一実施例の罫線の符号化を利用した
自動認識のフローを示す図、FIG. 14 is a diagram showing a flow of automatic recognition using ruled line encoding according to an embodiment of the present invention;

【図１５】本発明の一実施例の自動認識を行ない認識エ
ラーの有る文書を修正するシステムの構成を示す図。FIG. 15 is a diagram showing a configuration of a system for performing automatic recognition and correcting a document having a recognition error according to an embodiment of the present invention.

[Explanation of symbols]

１：文書、２：画像入力、３：罫線抽出、４：シート識
別、５：マッチング、６：識別辞書、７：読取り項目領
域抽出、８：文字認識、９：読取り項目登録、１０：画
像登録、２１：始点座標、２２：長さ、２３：罫線始点
長さデータ、２４：始点座標、２５：長さ、２６：識別
フォーマットデータ、２７：マッチング、２８：識別辞
書、２９：様式番号・読取り項目領域データ、３１：識
別辞書登録数誤差抽出、３２：長さ誤差抽出、３３：始
点座標誤差１抽出、３４：始点座標誤差２抽出、３５：
始点座標誤差３抽出、３６：始点座標誤差４抽出、３
７：様式番号・読取り項目領域データ抽出、４１：罫
線、４２：罫線データ、４３：階層符号化、４４：罫
線、４５：罫線符号データ、４６：罫線関係データ、５
１：第１階層番号付け、５２：第２階層以下番号付け、
５３：矢印、５４：矢印、５５：第１階層縦罫線、５
６：第１階層横罫線、５７：第２階層以下縦罫線、５
８：第２階層以下横罫線、６１：辞書構造、６２：識別
辞書、６３：辞書Ａ、６４：辞書Ｂ、６５：誤差算出方
法、６６：読取り項目符号データ、６７：対象罫線抽
出、６８：直線式による読む取り項目位置抽出、７１：
かすれ１本の場合、７２：罫線関係データ、７３：ノイ
ズ１本の場合、７４：罫線関係データ、８１：自動認
識、８２：認識エラー判定、８３：認識エラー画像デー
タ・認識エラー情報一時蓄積、８４：読取り項目修正表
示レイアウト作成、８５：画像・読取り項目一体表示修
正、８６：画像データ・読取り項目情報登録、９１：文
書画像、９２：認識結果ウインドウ、９３：一体表示
ａ、９４：一体表示ｂ、９５：一体表示ｃ、９６：読取
り項目１の文書画像、９７：読取り項目１の認識結果ウ
インドウ、９８：読取り項目２の文書画像、９９：読取
り項目２の認識結果ウインドウ、７９：文書、１００：
画像入力装置、２００：記憶装置、３００：表示装置、
４００：外部指示入力装置、５００：制御装置、６０
０：自動認識装置、７００：検索装置、６０１：ランデ
ータ抽出装置、６０２：罫線データ抽出装置、６０３：
罫線データ正規化装置、６０４：罫線始点長さデータ抽
出装置、６０５：シート識別装置、６０６：識別辞書装
置、６０７：読取り項目領域抽出装置、６０８：文字認
識装置、Ａ１：画像入力、Ａ２：ランデータ抽出、Ａ
３：罫線データ抽出、Ａ４：罫線データ正規化、Ａ５：
罫線始点長さデータ抽出、Ａ６：シート識別、Ａ７：読
取り項目領域抽出、Ａ８：読取り項目文字認識、Ａ９：
登録、６１１：ランデータ抽出、６１２：罫線データ抽
出装置、６１３枠ブロック抽出装置、６１４：枠ブロッ
ク毎罫線データ抽出装置、６１５罫線データ傾き補正装
置、６１６：罫線符号化装置、６１７：罫線関係データ
抽出装置、６１８：シート識別装置、６１９：識別辞
書、６２０：読取り項目領域抽出装置、６２１：文字認
識装置、Ｂ１：画像入力、Ｂ２：ランデータ抽出、Ｂ
３：罫線データ抽出、Ｂ４：枠ブロック抽出、Ｂ５：枠
ブロック毎罫線データ抽出、Ｂ６：罫線データ傾き補
正、Ｂ７：罫線符号化、Ｂ８：罫線関係データ作成、Ｂ
９：シート識別、Ｂ１０：読取り項目領域抽出、Ｂ１
１：読取り項目文字認識、Ｂ１２：登録、８００：画像
入力装置、８０１：記憶装置、８０２：一時記憶装置、
８０３：修正表示レイアウト作成装置、８０４：表示装
置、８０５：外部指示入力装置、８０６：制御装置、８
０７：自動認識装置、８０８：判定装置、８０９：認識
情報記憶装置、８１０検索装置。1: Document, 2: Image input, 3: Ruled line extraction, 4: Sheet identification, 5: Matching, 6: Identification dictionary, 7: Read item area extraction, 8: Character recognition, 9: Read item registration, 10: Image registration , 21: start point coordinate, 22: length, 23: ruled line start point length data, 24: start point coordinate, 25: length, 26: identification format data, 27: matching, 28: identification dictionary, 29: style number / reading. Item region data, 31: identification dictionary registration number error extraction, 32: length error extraction, 33: start point coordinate error 1 extraction, 34: start point coordinate error 2 extraction, 35:
Starting point coordinate error 3 extraction, 36: Starting point coordinate error 4 extraction, 3
7: format number / read item area data extraction, 41: ruled line, 42: ruled line data, 43: hierarchical coding, 44: ruled line, 45: ruled line code data, 46: ruled line related data, 5
1: 1st layer numbering, 52: 2nd layer and lower numbering,
53: arrow, 54: arrow, 55: first level vertical ruled line, 5
6: 1st layer horizontal ruled line, 57: 2nd layer and below vertical ruled line, 5
8: horizontal ruled lines below the second layer, 61: dictionary structure, 62: identification dictionary, 63: dictionary A, 64: dictionary B, 65: error calculation method, 66: read item code data, 67: target ruled line extraction, 68: Extraction of read item position by linear method, 71:
In the case of one faint line, 72: Ruled line related data, 73: In the case of one noise, 74: Ruled line related data, 81: Automatic recognition, 82: Recognition error judgment, 83: Recognition error image data / recognition error information temporary storage, 84: Read item correction display layout creation, 85: Image / read item integrated display correction, 86: Image data / read item information registration, 91: Document image, 92: Recognition result window, 93: Integrated display a, 94: Integrated display b, 95: integrated display c, 96: document image of reading item 1, 97: recognition result window of reading item 1, 98: document image of reading item 2, 99: recognition result window of reading item 2, 79: document, 100:
Image input device, 200: storage device, 300: display device,
400: External instruction input device, 500: Control device, 60
0: Automatic recognition device, 700: Search device, 601: Run data extraction device, 602: Ruled line data extraction device, 603:
Ruled line data normalization device, 604: Ruled line start point length data extraction device, 605: Sheet identification device, 606: Identification dictionary device, 607: Read item area extraction device, 608: Character recognition device, A1: Image input, A2: Run Data extraction, A
3: Ruled line data extraction, A4: Ruled line data normalization, A5:
Ruled line starting point length data extraction, A6: sheet identification, A7: read item area extraction, A8: read item character recognition, A9:
Registration, 611: Run data extraction, 612: Ruled line data extraction device, 613 frame block extraction device, 614: Ruled line data extraction device for each frame block, 615 Ruled line data inclination correction device, 616: Ruled line coding device, 617: Ruled line relation data Extraction device, 618: Sheet identification device, 619: Identification dictionary, 620: Read item area extraction device, 621: Character recognition device, B1: Image input, B2: Run data extraction, B
3: Ruled line data extraction, B4: Framed block extraction, B5: Ruled line data extraction for each frame block, B6: Ruled line data inclination correction, B7: Ruled line coding, B8: Ruled line related data creation, B
9: Sheet identification, B10: Read item area extraction, B1
1: Read item character recognition, B12: Registration, 800: Image input device, 801: Storage device, 802: Temporary storage device,
803: correction display layout creation device, 804: display device, 805: external instruction input device, 806: control device, 8
07: automatic recognition device, 808: determination device, 809: recognition information storage device, 810 search device.

───────────────────────────────────────────────────── フロントページの続き (72)発明者古賀昌史東京都国分寺市東恋ヶ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者栗野清道神奈川県小田原市国府津2880番地株式会社日立製作所ストレージシステム事業部内 (72)発明者杉本建行神奈川県小田原市国府津2880番地株式会社日立製作所ストレージシステム事業部内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Masafumi Koga 1-280, Higashi Koigakubo, Kokubunji, Tokyo Inside the Central Research Laboratory, Hitachi, Ltd. (72) Inventor Kiyomichi Kurino 2880, Kozu, Kagawa Prefecture System Division (72) Inventor Takeyuki Sugimoto 2880 Kokuzu, Odawara City, Kanagawa Hitachi Storage Systems Division

Claims

[Claims]

1. An image input device for inputting a document image and a format of the document image input by the image input device are identified,
In a document system having a recognition device for extracting a predetermined read item region and recognizing item information as a character code from the image of the region, the recognition device is a starting point of a ruled line for each style of a document of an input document image. An identification dictionary that stores data composed of coordinates, the length of ruled lines and the coordinates of the read item area; ruled line information extraction means for obtaining the ruled line starting point coordinates and ruled line lengths of the document image input to the recognition device; An identification means for determining the document format stored in the recognition dictionary by matching the data of the identification dictionary with the obtained ruled line information, and a reading item area corresponding to the document format obtained by the identification means are provided. It has a reading item area extracting means for determining and a character recognizing means for recognizing a character described inside the image of the reading item area. Document system.

2. An image input device for inputting a document image and a format of the document image input by the image input device are identified,
In a document system having a recognition device for extracting a predetermined read item region and recognizing item information as a character code from the image of the region, the recognition device is a starting point of a ruled line for each style of a document of an input document image. And an end point, and an identification dictionary that stores code data indicating the positional relationship between ruled lines, and a ruled line composed of vertical ruled lines and horizontal ruled lines of the document image input to the recognition device to obtain ruled line code data, A document system characterized in that a reading item area is determined from an intersection of the ruled lines by collating with the identification dictionary, and a character is recognized from an image of the area.

3. The document system according to claim 2, wherein the recognition device assigns ruled line code data to vertical ruled lines in a horizontal coordinate order and horizontal ruled lines in a vertical coordinate order.

4. The document system according to claim 2, wherein the recognition device generates ruled line code data, vertical ruled lines in a horizontal coordinate order and a contact relationship with a horizontal ruled line, and horizontal ruled lines in a vertical coordinate order. A document system characterized by being assigned according to the contact relationship with vertical ruled lines.

5. The document system according to claim 2, wherein the recognition device extracts ruled line code data, vertical ruled lines do not overlap with each other in vertical coordinate, and the ruled lines are arranged in vertical coordinate order, and horizontal ruled lines are A document system characterized by extracting ruled lines that do not overlap with each other in the lateral direction coordinates and assigning the ruled lines in the order of the lateral direction coordinates.

6. The document system according to claim 2, wherein the recognition device extracts ruled lines from the ruled line code data that do not overlap the vertical ruled lines with respect to the vertical coordinate and assigns the codes already with reference to the horizontal ruled lines. If the vertical coordinate of the ruled line exists in the same range, a code including the hierarchical relationship is assigned,
For horizontal ruled lines, extract ruled lines that do not overlap with respect to horizontal direction coordinates, and assign a code that includes a hierarchical relationship if the horizontal coordinate of a ruled line that has already been assigned a code by referring to a vertical ruled line is in the same range. Document system characterized by.

7. The document system according to claim 2, wherein the recognition device includes ruled line code data in which the vertical ruled lines are arranged in the horizontal direction, the horizontal ruled lines are arranged in the vertical direction, and the positions and lengths of the ruled lines are included. Document system characterized by assigning.

8. The document system according to claim 1, wherein the ruled line information of the input document image and the data of the identification dictionary are collated by using the discrimination dictionary of the style identified immediately before the collation. Document system characterized by.

9. The document system according to claim 1, wherein as a result of the collation of the input document image and the identification dictionary, the error of the form number of the document number stored in the dictionary is the smallest. A document system that obtains the form number of a document number.

10. The document system according to claim 1, wherein as a result of collating the input document image with the identification dictionary, a frame structure is identified for each independent frame having a different outer contour, and a structure number is extracted. Then, the document system characterized by judging the style by the set of the structure numbers.

11. The document system according to claim 2, wherein the input document image is collated with an identification dictionary by converting ruled line code data into relational data of start and end points of vertical and horizontal ruled lines. And document system.

12. The document system according to claim 1 or 2, wherein the coordinates of the start point and the end point of the black pixel connection for each scanning line are obtained from the input document image data, and the data of the coordinates is used as run data for the document image. A document system having a run data extracting device for extracting ruled lines of the document.

13. The document system according to claim 1, wherein the coordinates of the starting point and the ending point of the black pixel connection for each scanning line are obtained from the input image data, and the run data is extracted from the coordinates to obtain run data. A document system comprising: a run data extracting device that removes short runs that are equal to or less than a predetermined threshold, and then connects the remaining runs to extract them as ruled lines.

14. An image input device for inputting a document image, and a recognition device for extracting the position of an area where a read item of the input document exists and recognizing the character to extract read item information.
A determination device that determines a recognition error such as a character recognition error in the recognition device, a recognition information storage device that stores the position and font size of the recognized read item information, and an input document that is determined to be a recognition error by the determination device. A primary storage device for temporarily storing the information, a display device for correcting read item information of a document determined to be a recognition error, and a display based on the information stored in the recognition information device for the correction display. A modified display layout creation device for creating a layout;
An input document image and its recognition information, which have an image determined to have no recognition error and read item information, and a corrected document image and a storage device that stores the read item information, and which have been determined by the determination device to be a recognition error. A document system characterized in that a correction display layout is created based on the recognition information when the characters of a read item are corrected, and a document image and a window for each recognition result are integrally displayed.

15. The document system according to claim 14, wherein when the document image and the recognition result separate window are integrally displayed on the display device for modification, the display position and the display size of the recognition result separate window are changed. A document system having an input device for inputting a document.

16. The document system according to claim 14, wherein, when the correction display layout creating apparatus has a plurality of recognition items, only a part recognized as a recognition error by the judgment apparatus and a recognition result separate window and a document image are displayed. A document system comprising:

17. The document system according to claim 14, wherein, when the correction display layout creating apparatus has a plurality of recognition items, a recognition result separate window of a portion judged to be a recognition error by the judging apparatus and a recognition error. A document system characterized in that a recognition result separate window of a part determined to be absent is displayed integrally with a document image by changing the background color.

18. The document system according to claim 14, wherein when there are a plurality of recognition items for the corrected display layout creation device, the recognition result separate window of the part judged as the recognition error by the judgment device is highlighted and the document is displayed. A document system characterized by being displayed integrally with an image.