JP2001143018A

JP2001143018A - Character reader and method therefor

Info

Publication number: JP2001143018A
Application number: JP32187299A
Authority: JP
Inventors: Katsuhiko Aoki; 勝彦青木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1999-11-12
Filing date: 1999-11-12
Publication date: 2001-05-25

Abstract

PROBLEM TO BE SOLVED: To accurately detect cells of an item column from a table format document, including a header and a footer and to accurately read out characters in cells of a data column corresponding to the detected cells. SOLUTION: A cell coordinate detection part 14 extracts a ruled line from the image data of a table format document stored in an image memory 12 and finds out the coordinate positions of respective cells. A table-aligning part 18 aligns cells included in each row and each column and generates table structure data representing the structure of a table formed by respective cells. A row sorting part 20 detects a section, in which rows having the same number of cells are longest continued in the column direction on the basis of the table structure data, compares the leading row with its upper row by using a difference between the widths of the rows, an interval between the rows, etc., and detects a row to be used as an item row. Then a character recognition part 22 reads out characters from the cells of the item row, recognizes the attributes of characters in the cells of the corresponding data row on the basis of the read characters and converts the attributes of the characters into a suitable character code.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、表形式文書から文
字を読み取る文字読取装置に係り、特に、たとえば、帳
票等の書面に記入された所望の文字を電子的に読み取る
好適な文字読取装置および文字読取方法に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character reading apparatus for reading characters from a tabular document, and more particularly to a suitable character reading apparatus for electronically reading desired characters written on a document such as a form. It relates to a character reading method.

【０００２】[0002]

【従来の技術】従来、表形式文書から罫線を抽出して、
罫線で囲まれたセル内の文字を読み取る文字読取装置と
して、たとえば、特許公報第2740335 号に記載されたも
のが提案されている。2. Description of the Related Art Conventionally, ruled lines are extracted from a table format document,
As a character reading device for reading characters in a cell surrounded by ruled lines, for example, a device described in Japanese Patent Publication No. 2740335 has been proposed.

【０００３】この文字読取装置は、表形式文書をイメー
ジデータとして入力する入力部と、そのイメージデータ
を記憶する記憶部と、記憶したイメージデータから罫線
を抽出して罫線で囲まれたセルを認識するセル抽出部
と、セルの中で項目欄に相当するセルのセル内文字を認
識する項目欄文字認識部と、セル属性を判定するための
判定基準を記憶するセル属性判定基準記憶部と、判定基
準に基づいて項目欄の文字から項目欄以外のセル属性を
判定するセル属性判定部と、判定したセル属性に従って
各セル内文字を認識する文字認識部とを有するものであ
った。The character reading apparatus includes an input unit for inputting a tabular document as image data, a storage unit for storing the image data, and extracting a ruled line from the stored image data to recognize a cell surrounded by the ruled line. A cell extracting unit, an item column character recognizing unit that recognizes a character in a cell of a cell corresponding to an item column in the cell, a cell attribute determining criterion storing unit that stores a criterion for determining a cell attribute, It has a cell attribute determining unit that determines the cell attribute of the item column other than the item column from the characters in the item column based on the determination criterion, and a character recognition unit that recognizes characters in each cell according to the determined cell attribute.

【０００４】この装置によれば、上記特許公報に記載の
ように、数表や名簿等などの表形式文書であって、単一
の表の第１行目に項目欄を有する表形式文書を対象とし
て、その項目欄に続くデータ欄の文字を項目欄のそれぞ
れの項目文字に基づいて読み取るものであった。According to this apparatus, as described in the above-mentioned patent publication, a tabular document such as a numerical table or a list, which has an item field on the first line of a single table, is used. As an object, characters in a data column following the item column are read based on respective item characters in the item column.

【０００５】すなわち、上記文字読取装置は、表形式文
書を入力部にて走査して、その結果の画像を表わすイメ
ージデータを読み込む。読み込んだイメージデータは、
記憶部に順次蓄積される。イメージデータが蓄積される
と、セル抽出部では、横罫線および縦罫線をそれぞれ抽
出して、それら罫線で囲まれるセルの座標位置を求め
る。セル抽出部にてそれぞれのセルを抽出すると、項目
欄文字認識部にて第１行目のそれぞれのセルの座標位置
から、そのセル内文字のイメージデータを記憶部から読
み出して、それぞれの文字を認識する。That is, the character reading device scans a tabular document at an input section and reads image data representing the resulting image. The loaded image data is
The data is sequentially stored in the storage unit. When the image data is accumulated, the cell extracting unit extracts the horizontal ruled line and the vertical ruled line, and obtains the coordinate position of the cell surrounded by the ruled line. When each cell is extracted by the cell extraction unit, the image data of the character in the cell is read out from the storage unit from the coordinate position of each cell in the first row by the item column character recognition unit, and each character is read. recognize.

【０００６】次に、項目欄のそれぞれのセルの文字を認
識すると、セル属性判定部では、セル属性判定基準部の
判定基準に基づいて項目欄の文字のセル属性を判定し
て、これに続くデータ欄のセル属性を決定する。その判
定結果は、文字認識部に設定され、項目欄に続くデータ
欄にて対応のセル内の文字を読み取り、それぞれの文字
を文字コードに変換するものであった。Next, upon recognizing the character of each cell in the item column, the cell attribute determining unit determines the cell attribute of the character in the item column based on the criterion of the cell attribute criterion unit. Determine the cell attribute of the data column. The result of the determination is set in the character recognition unit, and the characters in the corresponding cell are read in the data column following the item column, and each character is converted into a character code.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上述し
た従来の技術では、セル抽出部にて抽出した表の第１行
目のセルに書かれた文字を項目欄認識部にてそれぞれ項
目欄の文字として読み取るので、表形式文書に、対象と
する表以外の罫線にて囲まれたヘッダやフッタ等が存在
する場合は、項目欄を正しく認識することができず、デ
ータ欄の文字認識に支障をきたす場合があるという問題
があった。However, in the above-described conventional technique, the characters written in the cells in the first row of the table extracted by the cell extracting unit are used by the item column recognizing unit to respectively write the characters in the item columns. If the table format document contains a header or footer surrounded by ruled lines other than the target table, the item column cannot be recognized correctly, and the character recognition in the data column will not be recognized. There was a problem that it could come.

【０００８】また、項目欄がたとえば２行のセルにわた
って区画されている場合など、項目欄がデータ欄と異な
る複数行のセルで表わされている場合には、データ欄を
読み取るための正しい項目欄を認識することができない
場合があり、上記と同様にデータ欄の文字認識に支障を
きたす場合があるという問題があった。When the item column is represented by a plurality of rows of cells different from the data column, for example, when the item column is partitioned over two rows of cells, a correct item for reading the data column is used. In some cases, the field cannot be recognized, and in the same manner as described above, there is a problem that the character recognition in the data field may be affected.

【０００９】したがって、表形式文書にヘッダやフッタ
がある場合、または項目欄が複数行にわたる場合などに
は、オペレータ等によって読取領域を指定しなければな
らず手間がかかるという問題があった。Therefore, when a tabular document has a header or footer, or when an item column extends over a plurality of lines, a reading area must be specified by an operator or the like, which is troublesome.

【００１０】本発明は、上述の課題を解消して、人手を
介することなく項目欄を的確に検出して、必要とするデ
ータ欄の文字を正確に読み取ることができる文字読取装
置を提供することを目的とする。An object of the present invention is to solve the above-mentioned problems and to provide a character reading device capable of accurately detecting an item column without human intervention and reading characters in a required data column accurately. With the goal.

【００１１】[0011]

【課題を解決するための手段】本発明による文字読取装
置は上記課題を解決するために、表形式文書から罫線で
囲まれた所定のセル内に書かれた文字を読み取る文字読
取装置において、表形式文書の画像を表わす画像データ
を入力する画像入力手段と、画像入力手段からの画像デ
ータを記憶する画像データ記憶手段と、画像データ記憶
手段の画像データから罫線の画像データを抽出して、そ
れら罫線で囲まれたそれぞれのセルの座標情報を表わす
罫線領域データを求めるセル座標検出手段と、セル座標
検出手段にて求めた罫線領域データに基づいてそれぞれ
のセルを行方向および列方向に整列させて、それぞれの
行および列のセルにて形成される表構造を表わす表構造
データを生成する表整列手段と、表整列手段にて生成し
た表構造データに基づいて項目行のセルおよびデータ行
のセルならびにそれら以外のセルにそれぞれの行を分類
する行分類手段と、行分類手段にて分類した少なくとも
データ行のそれぞれのセル内に書かれた文字を画像デー
タ記憶手段から読み取って文字コードに変換し、項目行
に書かれたセル内の文字属性に基づいて、各項目に対応
するデータ欄のセルの文字を認識する文字認識手段とを
含むことを特徴とする。According to the present invention, there is provided a character reading apparatus for reading a character written in a predetermined cell surrounded by a ruled line from a tabular document. Image input means for inputting image data representing an image of a formal document, image data storage means for storing image data from the image input means, and image data of ruled lines extracted from the image data in the image data storage means. Cell coordinate detecting means for obtaining ruled area data representing coordinate information of each cell surrounded by ruled lines, and aligning the cells in the row direction and column direction based on the ruled area data obtained by the cell coordinate detecting means. Means for generating table structure data representing a table structure formed by cells of each row and column; and table structure data generated by the table alignment means. A row classifying means for classifying each row into a cell of an item row, a cell of a data row, and other cells based on the data, and characters written in at least each cell of the data row classified by the row classifying means. A character recognition unit that reads from the data storage unit, converts the character code into a character code, and recognizes a character of a cell in a data column corresponding to each item based on a character attribute in a cell written in the item row. And

【００１２】この場合、行分類手段は、表整列手段にて
生成した表構造データに基づいて各行に含まれるセルの
個数を求めるセル数検出手段と、同じセル数の行が最も
長く列方向に連続する区間を検出する連続区間検出手段
と、その区間の先頭行を仮の項目行として、その仮の項
目行とさらに上の行とを比較した結果から真の項目行を
検出する項目行検出手段とを含むとよい。In this case, the row classifying means includes a cell number detecting means for calculating the number of cells included in each row based on the table structure data generated by the table sorting means, and a row having the same cell number being the longest in the column direction. A continuous section detecting means for detecting a continuous section, and an item row detection for detecting a true item row from a result of comparing the temporary item row with a row further above, with the first row of the section as a temporary item row Means.

【００１３】また、項目行検出手段では、仮の項目行と
その上の行とが比較され、行の幅の差、行の間隔または
それらの両方が、それぞれ所定の値以内である場合は、
上の行は新たな仮の項目行とされてさらにその上の行と
比較され、それぞれ所定の値を越える場合は、仮の項目
行は真の項目行とされると有利である。In the item line detecting means, the provisional item line and the line above it are compared, and if the difference in line width, the line interval, or both are within predetermined values,
The upper row is set as a new provisional item row and compared with the further upper row, and if each exceeds a predetermined value, it is advantageous that the provisional item row is a true item row.

【００１４】また、項目行検出手段では、仮の項目行と
その上の行とが比較され、行に含まれるセルの面積の差
または行の面積に対するセルの面積の割合の差、および
行の間隔が、それぞれ所定の値以内である場合は、上の
行は新たな仮の項目行とされてさらにその上の行と比較
され、それぞれ所定の値を越える場合は、仮の項目行は
真の項目行とされることとして、項目行を検出してもよ
い。The item line detecting means compares the provisional item line with the line above it, and determines the difference in the cell area included in the line or the difference in the ratio of the cell area to the line area, and If the interval is within a predetermined value, the upper row is regarded as a new temporary item row and compared with the row above it, and if each exceeds the predetermined value, the temporary item row is true. The item line may be detected as the item line.

【００１５】これらの場合、表整列手段では、行方向の
セルは、それぞれのセルの座標位置およびその高さに基
づいて、いずれの行に属するかが決定され、それぞれの
行のセルが整列されるとよい。In these cases, the table sorting means determines to which row the cell in the row direction belongs, based on the coordinate position and the height of each cell, and sorts the cells in each row. Good.

【００１６】[0016]

【発明の実施の形態】次に、添付図面を参照して本発明
による文字読取装置の一実施例を詳細に説明する。図１
には、本発明による文字読取装置の一実施例が示されて
いる。本実施例による文字読取装置は、図３に示す帳票
Ｋ等の表形式文書から罫線を抽出して、罫線で囲まれた
所定のセル内の文字を読み取る文字読取装置であって、
読取対象の表本体Ａの他に罫線で囲まれたヘッダＢやフ
ッタＣを含む表形式文書から、表本体Ａのデータ欄のセ
ル内に記入されたそれぞれの文字を読み取り、それぞれ
所定の文字コードに変換するデータ変換装置である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, an embodiment of the character reading device according to the present invention will be described in detail with reference to the accompanying drawings. FIG.
1 shows an embodiment of the character reading device according to the present invention. The character reading device according to the present embodiment is a character reading device that extracts a ruled line from a tabular document such as a form K shown in FIG. 3 and reads characters in a predetermined cell surrounded by the ruled line.
From the tabular document including the header B and the footer C surrounded by ruled lines in addition to the table body A to be read, each character written in the cell of the data column of the table body A is read, and a predetermined character code is set. This is a data conversion device that converts the data into.

【００１７】特に、本実施例では、罫線で囲まれたセル
を行方向および列方向に整列させてそれぞれのセルにて
形成される表の構造を表わす表構造データを生成する、
図１の表整列部18と、その結果に基づいて表本体Ａから
項目行およびデータ行を分類する行分類部20とを含み、
分類した項目行のセルに書かれた文字に基づいて、対応
のデータセルの文字を有効に文字コードに変換する点が
主な特徴点である。In particular, in the present embodiment, cells surrounded by ruled lines are aligned in the row and column directions to generate table structure data representing the structure of a table formed by each cell.
1 includes a table sorting unit 18 shown in FIG. 1 and a row classifying unit 20 for classifying item rows and data rows from the table body A based on the result.
The main feature is that the characters in the corresponding data cells are effectively converted into character codes based on the characters written in the cells of the classified item rows.

【００１８】詳細には、本実施例の文字読取装置は、図
１に示すように、画像入力部10と、画像メモリ12と、セ
ル座標検出部14と、表セル情報メモリ16と、表整列部18
と、行分類部20と、文字認識部22と、認識結果メモリ24
と、データ処理部26と、表示部28と、操作入力部30と、
制御部32とを含む。More specifically, as shown in FIG. 1, the character reading apparatus according to the present embodiment includes an image input unit 10, an image memory 12, a cell coordinate detecting unit 14, a table cell information memory 16, Part 18
, A line classification unit 20, a character recognition unit 22, and a recognition result memory 24.
, A data processing unit 26, a display unit 28, an operation input unit 30,
And a control unit 32.

【００１９】画像入力部10は、表形式文書上のイメージ
画像を表わす画像データを入力する入力装置であり、本
実施例では、対象とする表形式文書を走査してその画像
信号を２値の画像データとして読み取るイメージスキャ
ナなどが有利に適用される。入力した画像データは、画
像データメモリ12に順次書き込まれる。画像データメモ
リ12は、画像入力手段10からの画像データを順次蓄積す
る記憶回路であり、たとえば、入力したイメージデータ
をビットマップに展開して記憶するフレームメモリなど
の記憶装置である。記憶した画像データは、セル座標検
出部14に供給される。The image input unit 10 is an input device for inputting image data representing an image image on a tabular document. In this embodiment, the image input unit 10 scans a target tabular document and converts its image signal into a binary signal. An image scanner that reads as image data is advantageously applied. The input image data is sequentially written to the image data memory 12. The image data memory 12 is a storage circuit that sequentially stores the image data from the image input unit 10, and is a storage device such as a frame memory that expands the input image data into a bitmap and stores it. The stored image data is supplied to the cell coordinate detector 14.

【００２０】セル座標検出部14は、画像データ記憶部12
に蓄積した画像データに基づいて、罫線で囲まれたそれ
ぞれのセルの座標位置を検出するセル抽出回路であり、
たとえば、画像データ"1" が行方向または列方向に所定
の長さ以上連続する実線および間欠的に連続する破線に
て表わされる罫線をそれぞれ検出して、その交点位置か
らそれぞれのセルの座標を求める座標検出回路である。
それぞれのセルの座標位置を表わす情報は、罫線領域デ
ータとして表セル情報メモリ16に順次書き込まれて表整
列部18に供給される。The cell coordinate detecting section 14 includes an image data storing section 12
Is a cell extraction circuit that detects the coordinate position of each cell surrounded by ruled lines based on the image data accumulated in
For example, the image data "1" detects a solid line and a ruled line represented by an intermittently continuous broken line continuous for a predetermined length or more in the row direction or the column direction, and calculates the coordinates of each cell from the intersection point. This is a coordinate detection circuit to be obtained.
Information indicating the coordinate position of each cell is sequentially written to the table cell information memory 16 as ruled line area data and supplied to the table aligning unit 18.

【００２１】表整列部18は、セル座標検出部14にて求め
た罫線領域データに基づいて、それぞれのセルにて形成
される表構造を求める表構造検出回路であり、本実施例
では、いずれのセルがいずれの行および列に含まれるか
を決定し、それぞれのセルを行方向および列方向に整列
させて、その表構造を表わす表構造データを生成する。
特に、本実施例では、セルを行方向に整列させる際に、
それぞれのセルの座標位置およびその高さに基づいて、
いずれの行に属するかを決定して、それぞれの行のセル
を整列させる。生成した表構造データは、表セル情報メ
モリ16に順次書き込まれて行分類部20に供給される。The table aligning unit 18 is a table structure detecting circuit for obtaining a table structure formed by each cell based on the ruled line area data obtained by the cell coordinate detecting unit 14. Is determined to be included in which row and column, and the cells are aligned in the row and column directions to generate table structure data representing the table structure.
In particular, in this embodiment, when cells are aligned in the row direction,
Based on the coordinate position of each cell and its height,
Decide which row it belongs to and align the cells in each row. The generated table structure data is sequentially written to the table cell information memory 16 and supplied to the row classification unit 20.

【００２２】行分類部20は、表整列部18にて生成した表
構造データに基づいてそれぞれの行のセルを項目行のセ
ルおよびデータ行のセルならびにそれら以外のセルに分
類するセル分類回路であり、本実施例では、同じセル数
の行が列方向に最も長く連続する区間を検出して、その
区間の先頭行とさらに上の行を比較して項目行を求め
て、その項目行およびこれに続く同セル数区間のデータ
行ならびにそれら以外のヘッダおよびフッタの行に分類
する。The row classifying unit 20 is a cell classifying circuit for classifying cells of each row into cells of an item row, cells of a data row and other cells based on the table structure data generated by the table sorting unit 18. In the present embodiment, in this embodiment, the section where the row having the same number of cells is the longest in the column direction is detected, and the first row of the section and the row further above are compared to obtain an item row. Subsequent to this, the data is classified into data rows of the same cell number section and other header and footer rows.

【００２３】より具体的には、本実施例の行分類部20
は、表構造データから各行に含まれるセルの個数をそれ
ぞれ求めるセル数検出部と、その結果から同じセル数の
行が最も長く列方向に連続する区間を検出する連続区間
検出部と、その区間の先頭行を仮の項目行として、その
仮の項目行とさらに上の行とを比較した結果から真の項
目行を検出する項目行検出部とを含む。特に、本実施例
の項目行検出部は、仮の項目行とその上の行とを比較す
る際に、それらの行の幅の差が所定の値以内であるか否
かを求め、さらに行の間隔が所定の値以内であるか否か
を求めて、それぞれ所定の値以内となった場合に、上の
行を仮の項目行として、さらにその上の行と比較して、
それぞれ所定の値を越える場合に、仮の項目行を真の項
目行として項目行を検出する部位である。項目行を検出
すると、その下の同じセル数が連続する区間のそれぞれ
の行をデータ行とし、項目行の上の行をヘッダの行およ
びデータ行の区間の下の行をフッタの行として、それぞ
れ分類した結果を表セル情報メモリ16に書き込む。表セ
ル情報メモリ16に書き込まれた行分類の結果は、罫線領
域データおよび表構造データとともに文字認識部22に供
給される。More specifically, the line classification unit 20 of the present embodiment
Is a cell number detecting section for respectively obtaining the number of cells included in each row from the table structure data, a continuous section detecting section for detecting a section where the row having the same cell number is the longest continuous in the column direction from the result, And an item line detection unit that detects a true item line from the result of comparing the tentative item line with a line above it as a tentative item line. In particular, when comparing the tentative item line and the line above it, the item line detection unit of the present embodiment determines whether the difference between the widths of those lines is within a predetermined value, and furthermore, Is determined whether or not the interval is within a predetermined value, and when each is within the predetermined value, the upper row is a temporary item row, and further compared with the row above,
When each of the values exceeds a predetermined value, a provisional item line is detected as a true item line and the item line is detected. When an item row is detected, each row of the section where the same number of cells is continuous below is a data row, a row above the item row is a header row, and a row below the data row section is a footer row. The result of each classification is written in the table cell information memory 16. The result of the row classification written in the table cell information memory 16 is supplied to the character recognition unit 22 together with the ruled line area data and the table structure data.

【００２４】文字認識部22は、行分類部20にて分類した
データ行のそれぞれのセル内に書かれた文字の画像デー
タを画像メモリ12から読み出して所定の文字コードに変
換する文字コード変換回路であり、本実施例では、行分
類部20にて分類した項目行のセル内の文字属性に基づい
て、それぞれ対応するデータ行のセル内の文字を認識す
る。より具体的には、項目行のそれぞれのセル内の文字
の画像データを画像メモリ12から読み出して文字コード
に変換する項目行文字認識部と、セル属性を判定するた
めの判定基準を記憶するセル属性判定基準記憶部と、判
定基準に基づいて項目行の文字からデータ行のセル属性
を判定するセル属性判定部と、判定したセル属性に従っ
てそれぞれのデータ行のセル内文字を表わす画像データ
を画像メモリ12から読み出して文字コードに変換するデ
ータ行文字認識部とを含む。変換した文字コードは、そ
れぞれ認識結果メモリ24に書き込まれる。A character recognizing section 22 reads out from the image memory 12 the image data of the characters written in each cell of the data row classified by the row classifying section 20, and converts the image data into a predetermined character code. In the present embodiment, the characters in the cells of the corresponding data row are recognized based on the character attributes in the cells of the item row classified by the row classification unit 20. More specifically, an item line character recognition unit that reads image data of characters in each cell of an item line from the image memory 12 and converts it into a character code, and a cell that stores a criterion for determining a cell attribute An attribute determination criterion storage unit, a cell attribute determination unit that determines a cell attribute of a data row from a character of an item row based on the determination criterion, and image data representing a character in a cell of each data row according to the determined cell attribute. And a data line character recognizing unit which reads out from the memory 12 and converts it into a character code. The converted character codes are respectively written in the recognition result memory 24.

【００２５】認識結果メモリ24は、文字コードに変換さ
れたそれぞれのセルの文字をデータ行毎に蓄積する記憶
回路であり、その結果は文字コードに対応した文字に変
換されて表示部28に表示される。有利には、データ処理
部26にて並べ替えあるいは数値計算などの処理が施され
て表示される。表示部28は、認識結果メモリ24からの文
字コードで表わされる文字を表示するCRT(cathode ray
tube) などの表示装置であり、本実施例では、文字読取
の認識結果の表示およびそのデータ処理の結果を表示す
る。その表示切替などの操作は、操作入力部30からの指
示に従って実行される。The recognition result memory 24 is a storage circuit for accumulating the character of each cell converted into the character code for each data line. The result is converted into the character corresponding to the character code and displayed on the display unit 28. Is done. Advantageously, the data is displayed after being subjected to processing such as rearrangement or numerical calculation in the data processing unit 26. The display unit 28 displays a character represented by the character code from the recognition result memory 24 (CRT (cathode ray)).
The present embodiment displays a recognition result of character reading and a result of data processing thereof. The operation such as the display switching is executed in accordance with an instruction from the operation input unit 30.

【００２６】操作入力部30は、キーボードあるいはポイ
ンティングデバイスなどに接続されて、表示切替あるい
はデータ処理の指示または所望のデータの入力などに用
いられる操作回路である。その操作指示は制御部32に供
給される。制御部32は、文字読取のシーケンス制御およ
びデータ処理などを制御する中央処理装置であり、特
に、本実施例では画像入力部10の画像データ入力に応動
して、それぞれセル座標検出部14、表整列部18、行分類
部20および文字認識部22を順次制御するシーケンス制御
回路である。The operation input unit 30 is an operation circuit which is connected to a keyboard or a pointing device, and is used for switching display or instructing data processing or inputting desired data. The operation instruction is supplied to the control unit 32. The control unit 32 is a central processing unit that controls the sequence control of character reading, data processing, and the like. In particular, in this embodiment, in response to image data input of the image input unit 10, the cell coordinate detection unit 14 and the table This is a sequence control circuit that sequentially controls the sorting unit 18, the line classification unit 20, and the character recognition unit 22.

【００２７】次に、本実施例による文字読取装置の動作
を図２ないし図10を参照して説明する。まず、図２のフ
ローチャートに示すように、ステップS10 にて画像入力
部10は、表形式文書を走査して、その画像を２値の画像
データとして読み込んで画像メモリ12に順次蓄積する。
たとえば、本実施例では、表形式文書として図３に示す
ように、読取対象のデータが記入された表本体Ａの他に
罫線で囲まれたヘッダＢおよびフッタＣを含む帳票Ｋの
文書画像が読み込まれる。Next, the operation of the character reading apparatus according to the present embodiment will be described with reference to FIGS. First, as shown in the flowchart of FIG. 2, in step S10, the image input unit 10 scans a tabular document, reads the image as binary image data, and sequentially stores the binary image data in the image memory 12.
For example, in the present embodiment, as shown in FIG. 3, as a tabular document, a document image of a form K including a header B and a footer C surrounded by ruled lines in addition to a table body A in which data to be read is written. Is read.

【００２８】次に画像メモリ12に画像データが蓄積され
ると、ステップS12 に進み、セル座標検出部14にて画像
メモリ12から画像データを読み出して、それぞれの罫線
の画像データを抽出して、その罫線で囲まれたそれぞれ
のセルの座標位置を求める。この結果、求めたセルの座
標情報は、罫線領域データとして表セル情報メモリ16に
順次書き込まれる。Next, when the image data is stored in the image memory 12, the process proceeds to step S12, in which the cell coordinate detection unit 14 reads out the image data from the image memory 12, extracts the image data of each ruled line, The coordinate position of each cell surrounded by the ruled line is obtained. As a result, the obtained cell coordinate information is sequentially written into the table cell information memory 16 as ruled line area data.

【００２９】次にセルの座標位置が求められると、ステ
ップS14 に進み、表整列部18にて表セル情報メモリ16か
ら罫線領域データを読み出して、その座標情報からそれ
ぞれ各行のセルを行方向に整列させて、いずれのセルが
いずれの行に含まれるかを求める。この際、１行目から
順次セルの行方向の位置座標に基づいてそれぞれの行を
整列させると、たとえば、図４に示すように、３行目で
は、６個のセル100 〜110 が整列されて、その２列目の
下のセルに相当するセル112 が４行目となって検出され
る。そこで本実施例では、セルの位置座標とともに高さ
を基準にしてそれぞれのセルを行方向に整列させる。そ
の結果、図５に示すように、セル112 は３行目に含まれ
る第７のセルとして求められる。Next, when the coordinate position of the cell is obtained, the process proceeds to step S14, in which the table aligning unit 18 reads out the ruled line area data from the table cell information memory 16 and, based on the coordinate information, writes the cells of each row in the row direction. Align and find which cells are in which row. At this time, by arranging the respective rows sequentially from the first row based on the position coordinates of the cells in the row direction, for example, as shown in FIG. 4, in the third row, six cells 100 to 110 are aligned. Thus, the cell 112 corresponding to the cell below the second column is detected as the fourth row. Therefore, in the present embodiment, the cells are aligned in the row direction based on the height together with the position coordinates of the cells. As a result, as shown in FIG. 5, the cell 112 is obtained as the seventh cell included in the third row.

【００３０】以下同様に、各行のセルをそれぞれ位置座
標と高さに基づいて行方向に整列させる。行方向の整列
が終了すると、ステップS16 にてそれぞれの行の間隔な
どを求めて、セルを列方向に整列する。このようにして
求めたそれぞれの行および列のセルにて形成される表構
造を表わす表構造データは、表セル情報メモリ16に書き
込まれる。Similarly, the cells in each row are aligned in the row direction based on the position coordinates and the height. When the alignment in the row direction is completed, the cells are aligned in the column direction in step S16 by calculating the spacing between the rows. The table structure data representing the table structure formed by the cells of each row and column thus obtained is written to the table cell information memory 16.

【００３１】次いで、ステップS18 に進み、行分類部20
にて表セル情報メモリ16から表構造データを読み出し
て、各行に含まれるセルの個数をそれぞれ図６に示すよ
うに求める。各行のセルの個数を求めると、ステップS2
0 に進んで、ステップS18 の結果から同じセル数の行が
最も長く列方向に連続する、たとえば図６に示すように
４行目から11行目のセル数"10"の区間Ｄを検出する。Next, the process proceeds to step S18, where the line classification unit 20
The table structure data is read from the table cell information memory 16 to determine the number of cells included in each row as shown in FIG. After calculating the number of cells in each row, step S2
Proceeding to step 0, the section D having the cell number "10" from the fourth row to the eleventh row as shown in FIG. 6 is detected from the result of step S18, for example, as shown in FIG. .

【００３２】次に、ステップS22 に進み、ステップS20
にて検出した区間Ｄの先頭行を図７に示すように仮の項
目行Ｅとして、ステップS24 に進む。ステップS24 で
は、図８に示すように、仮の項目行Ｅとその上の行Ｆの
幅の差を求め、その値が所定の値以内であるか否かを判
定して、さらに行の間隔が所定の値以内であるか否かを
判定する。それらの結果がそれぞれ所定の値以内であれ
ば、ステップS26 に移って、図９に示すように上の行Ｆ
を仮の項目行として、ステップS24 に戻り、仮の項目行
とした上の行Ｆとさらにその上の行Ｇとを比較する。そ
の結果、それぞれ所定の値を越える場合には、ステップ
S28 に進んで、仮の項目行Ｆを真の項目行として図10に
示すように項目行Ｆを検出する。Next, the process proceeds to step S22, and the process proceeds to step S20.
The head row of the section D detected in the above is set as a temporary item row E as shown in FIG. 7, and the process proceeds to step S24. In step S24, as shown in FIG. 8, the difference between the width of the provisional item line E and the width of the line F above it is determined, and whether or not the value is within a predetermined value is determined. Is determined to be within a predetermined value. If the results are within the predetermined values, the process proceeds to step S26, where the upper row F is displayed as shown in FIG.
Is returned to step S24 as a tentative item row, and the upper row F, which is the tentative item row, is further compared with the upper row G. As a result, if each exceeds a predetermined value, step
Proceeding to S28, the item line F is detected as shown in FIG. 10 with the provisional item line F as a true item line.

【００３３】次いで、項目行Ｆを検出すると、ステップ
S30 にて項目行の下の同じセル数が連続する区間Ｄのそ
れぞれの行をデータ行として、項目行の上の各行をヘッ
ダの行およびデータ行の区間Ｄのさらに下の各行をフッ
タの行として分類する。それぞれ分類した結果は、表セ
ル情報メモリ16に書き込まれる。表セル情報メモリ16に
書き込まれた行分類の結果は罫線領域データおよび表構
造データとともに文字認識部22に供給される。Next, when the item line F is detected, the step
At S30, each row in section D where the same number of cells below the item row is continuous is used as a data row, and each row above the item row is used as a header row and each row further below data row section D is used as a footer row. Classify as The results of each classification are written to the table cell information memory 16. The result of the line classification written in the table cell information memory 16 is supplied to the character recognition unit 22 together with the ruled line area data and the table structure data.

【００３４】次に、行分類が終了すると、ステップS32
に進み、文字認識部22にて項目行として分類された行の
それぞれのセルの座標位置から、それらのセル内の文字
に相当する画像データを画像メモリ12から読み出してそ
れぞれの文字を認識する。次いで、ステップS34 にて、
あらかじめ記憶した判定基準に基づいて項目行の文字か
らデータ行のセル属性を判定する。その結果、ステップ
S36 にて、項目行のセルに対応するそれぞれのデータ行
のセル内文字を表わす画像データを画像メモリ12から読
み出して、それぞれの文字をステップS34 にて判定した
セル属性に従って文字コードに変換する。変換した文字
コードは、それぞれ認識結果メモリ24に書き込まれる。Next, when the line classification is completed, step S32
Then, from the coordinate position of each cell in the row classified as the item row by the character recognition unit 22, image data corresponding to characters in those cells is read from the image memory 12 to recognize each character. Next, in step S34,
The cell attribute of the data row is determined from the characters of the item row based on the criteria stored in advance. As a result, the step
In S36, the image data representing the character in the cell of each data row corresponding to the cell of the item row is read from image memory 12, and each character is converted into a character code according to the cell attribute determined in step S34. The converted character codes are respectively written in the recognition result memory 24.

【００３５】次にステップS38 にて認識結果メモリ24に
書き込まれたそれぞれのデータ行の文字が表示部28に表
示されて、それぞれの文字が正確に読み取られたか否か
が確認される。それぞれの文字が正当であれば、ステッ
プS40 に移り、操作入力部30の操作に応動してデータ処
理部26が起動されて、並べ替えあるいは数値計算などの
所望のデータ処理が施されて、処理が終了する。Next, in step S38, the characters of each data line written in the recognition result memory 24 are displayed on the display unit 28, and it is confirmed whether or not each character has been correctly read. If each character is valid, the process proceeds to step S40, where the data processing unit 26 is started in response to the operation of the operation input unit 30, and the desired data processing such as rearrangement or numerical calculation is performed. Ends.

【００３６】以下同様に、それぞれの表形式文書の文書
画像を表わす画像データを読み込んで、その中から罫線
を抽出して、それぞれの罫線にて囲まれたセルの座標位
置を求める。次いで、それぞれのセルの座標位置および
高さに基づいてセルを行方向および列方向に整列して、
その結果から項目行のセルおよびデータ行のセルならび
にそれら以外のセルを分類する。特に、項目行を検出す
る際に、同じセル数が最も長く列方向に連続する区間を
求めて、その先頭行と上の行さらに上の行を比較した結
果から項目行を検出する。項目行を検出すると、そのセ
ル内文字を認識して、その結果に基づいてデータ行の文
字を認識して、それぞれ文字コードに変換する。この結
果、表形式文書の表本体Ａに記入された、必要とするそ
れぞれのデータ行のセルに記入された文字を的確に読み
取って、それぞれ所望のデータ処理を実行する。Similarly, image data representing the document image of each tabular document is read, and ruled lines are extracted from the image data, and the coordinate position of a cell surrounded by each ruled line is obtained. Then, aligning the cells in the row and column directions based on the coordinate position and height of each cell,
Based on the result, cells in the item row, cells in the data row, and other cells are classified. In particular, when detecting an item row, a section where the same number of cells is the longest and continues in the column direction is obtained, and the item row is detected from the result of comparing the top row with the upper row and the upper row. When an item row is detected, the character in the cell is recognized, the character in the data row is recognized based on the result, and each is converted to a character code. As a result, the characters written in the cells of the respective required data rows written in the table body A of the tabular document are accurately read, and the desired data processing is executed.

【００３７】以上のように本実施例の文字読取装置によ
れば、表整列部18にてそれぞれのセルを行方向および列
方向に整列して、それぞれのセルにて形成される表構造
を表わす表構造データを生成し、その表構造データに基
づいて行分類部20にて同じセル数の行が列方向に連続す
る区間を検出して、その先頭行と上の行さらに上の行と
を比較して項目行を検出するので、表形式文書に罫線で
囲まれたヘッダやフッタがある場合であっても項目行を
的確に検出することができ、その結果、項目行の文字に
基づいてデータ行に書かれた文字を正確に文字コードに
変換して読み取ることができる。As described above, according to the character reading apparatus of the present embodiment, each cell is aligned in the row direction and the column direction by the table aligning section 18 to represent the table structure formed by each cell. Based on the table structure data, based on the table structure data, the row classification unit 20 detects a section where rows of the same number of cells are continuous in the column direction, and determines the first row, the upper row, and the upper row. Since the item line is detected by comparison, even if the tabular document has a header or footer surrounded by ruled lines, the item line can be accurately detected, and as a result, based on the character of the item line, Characters written in the data line can be accurately converted to character codes and read.

【００３８】また、表整列部18にて行方向のセルを整列
する際に、それぞれのセルの行方向の位置座標およびそ
の高さを基準として、セルがいずれの行に属するかを決
定して行を整列するので、項目行が複数段のセルにわた
って区画されている場合であっても、的確に項目行のそ
れぞれのセルを検出することができ、その結果、項目行
の文字に基づいてデータ行に書かれた文字を、正確に文
字コードに変換して読み取ることができる。When the cells in the row direction are sorted by the table sorting unit 18, it is determined which cell the cell belongs to based on the position coordinates and the height of each cell in the row direction. Since the rows are arranged, even if the item row is divided into cells of multiple columns, each cell of the item row can be detected accurately, and as a result, data can be detected based on the character of the item row. Characters written on a line can be accurately converted to character codes and read.

【００３９】したがって、表形式文書にヘッダやフッタ
がある場合あるいは項目行が複数段にわたる場合などで
あっても、オペレータ等によって読取領域を指定するこ
となく、必要とするデータ欄の文字を的確に読み取るこ
とができる。Therefore, even when a tabular document has a header or footer, or when an item row extends over a plurality of rows, the characters in the required data fields can be accurately specified without designating a reading area by an operator or the like. Can be read.

【００４０】なお、上記実施例では、行分類部20は、仮
の項目行とその上の行とを比較する際に、行の幅の差お
よび行の間隔に基づいて項目行を検出するようにした
が、本発明においては、仮の項目行とその上の行とを比
較する際に、それらの行に含まれるセルの面積または行
の面積に対するセルの面積の割合および行の間隔を比較
して、それらの差が、それぞれ所定の値以内であるか否
かに基づいて項目行を検出するようにしてもよい。In the above embodiment, when comparing the provisional item line with the line above it, the line classifying section 20 detects the item line based on the difference between the line widths and the line interval. However, in the present invention, when comparing the provisional item row with the row above it, the area ratio of the cell area to the cell area or the row area contained in those rows and the row interval are compared. Then, the item row may be detected based on whether or not the difference is within a predetermined value.

【００４１】また、上記実施例では、画像入力部10とし
て表形式文書を走査してその２値画像データを形成する
イメージスキャナを適用した場合を例に挙げて説明した
が、本発明においては、これに限定されることなく、た
とえば、ファクシミリなどの伝送装置からの画像信号を
受けて入力する回路であってもよい。さらに、本発明に
おいては、２値の画像データに限定されることなく、罫
線および文字を的確に読み取ることができる画像データ
であれば、たとえば、諧調を含む画像データあるいはカ
ラー画像データを入力する回路であってもよい。In the above embodiment, the case where an image scanner for scanning a tabular document and forming its binary image data is applied as the image input unit 10 has been described as an example. The circuit is not limited to this, and may be a circuit that receives and inputs an image signal from a transmission device such as a facsimile. Furthermore, in the present invention, a circuit for inputting image data including gradation or color image data, for example, is not limited to binary image data, as long as the image data can accurately read ruled lines and characters. It may be.

【００４２】また、上記実施例では、説明を簡単にする
ため表形式文書を１枚づつ読み取ってそれぞれ文字読取
を実行する場合を例に挙げて説明したが、本発明におい
ては複数枚連続的に読み取ってそれぞれ処理してもよ
い。Further, in the above embodiment, for the sake of simplicity, the case where the table format documents are read one by one and the respective characters are read has been described as an example. You may read and process each.

【００４３】[0043]

【発明の効果】以上説明したように本発明によれば、罫
線で囲まれたセルの座標情報を表わす罫線領域データに
基づいてそれぞれのセルを行方向および列方向に整列さ
せて、それぞれの行および列のセルにて形成される表構
造を表わす表構造データを生成する表整列手段と、表整
列手段にて生成した表構造データに基づいて項目行のセ
ルおよびデータ行のセルならびにそれら以外のセルにそ
れぞれの行を分類する行分類手段とを有するので、表形
式文書に対象とする表以外に罫線にて囲まれたヘッダや
フッタ等が存在する場合であっても項目行を的確に検出
することができ、その項目行の文字に基づいてデータ行
の文字を正確に文字コードに変換することができる。し
たがって、オペレータ等によって読取領域を指定するな
どの手間を省くことができ、処理を短時間に効率よく実
行することができるなどの優れた効果を奏する。As described above, according to the present invention, each cell is aligned in the row direction and the column direction based on the ruled line area data representing the coordinate information of the cell surrounded by the ruled line, and each row is aligned. Table aligning means for generating table structure data representing a table structure formed by and table cells, and a cell of an item row and a cell of a data row based on the table structure data generated by the table aligning means and other cells. Since the cell has a row classification means for classifying each row, even if there is a header or footer surrounded by ruled lines other than the target table in the tabular document, the item row can be accurately detected. The characters in the data line can be accurately converted to character codes based on the characters in the item line. Therefore, it is possible to omit trouble such as designating a reading area by an operator or the like, and it is possible to obtain an excellent effect that processing can be efficiently executed in a short time.

[Brief description of the drawings]

【図１】本発明による文字読取装置の一実施例を示すブ
ロック図である。FIG. 1 is a block diagram showing an embodiment of a character reading device according to the present invention.

【図２】図１の実施例による文字読取装置の動作を説明
するためのフローチャートである。FIG. 2 is a flowchart for explaining the operation of the character reading device according to the embodiment of FIG. 1;

【図３】図１の実施例による文字読取装置にて読み取る
表形式文書の一例を示す図である。FIG. 3 is a diagram showing an example of a tabular document read by the character reading device according to the embodiment of FIG. 1;

【図４】図１の実施例による文字読取装置の表整列部18
の処理過程を示す図である。4 is a table aligning unit 18 of the character reading device according to the embodiment of FIG.
It is a figure which shows the process of processing.

【図５】図１の実施例による文字読取装置の表整列部18
の処理結果を示す図である。FIG. 5 is a table aligning unit 18 of the character reading device according to the embodiment of FIG. 1;
It is a figure showing the processing result of.

【図６】図１の実施例による文字読取装置の行分類部20
による連続区間の検出を示す図である。FIG. 6 is a line classification unit 20 of the character reading device according to the embodiment of FIG. 1;
FIG. 6 is a diagram showing detection of a continuous section by using FIG.

【図７】図１の実施例による文字読取装置の行分類部20
による仮の項目行の決定を示す図である。FIG. 7 is a line classification unit 20 of the character reading device according to the embodiment of FIG. 1;
FIG. 6 is a diagram showing a determination of a provisional item line according to FIG.

【図８】図１の実施例による文字読取装置の行分類部20
による、仮の項目行の比較工程の一例を示す図である。FIG. 8 is a line classification unit 20 of the character reading device according to the embodiment of FIG. 1;
FIG. 7 is a diagram showing an example of a comparison process of a provisional item row according to FIG.

【図９】図１の実施例による文字読取装置の行分類部20
による、仮の項目行の比較工程の一例を示す図である。FIG. 9 is a line classification unit 20 of the character reading device according to the embodiment of FIG. 1;
FIG. 7 is a diagram showing an example of a comparison process of a provisional item row according to FIG.

【図１０】図１の実施例による文字読取装置の行分類部
20により決定された、真の項目行を示す図である。FIG. 10 is a line classification unit of the character reading device according to the embodiment of FIG. 1;
FIG. 7 is a diagram showing a true item row determined by 20.

[Explanation of symbols]

10 画像入力部 12 画像メモリ 14 セル座標検出部 16 表セル情報メモリ 18 表整列部 20 行分類部 22 文字認識部 10 Image input unit 12 Image memory 14 Cell coordinate detection unit 16 Table cell information memory 18 Table alignment unit 20 Line classification unit 22 Character recognition unit

Claims

[Claims]

1. A character reading apparatus for reading a character written in a predetermined cell surrounded by a ruled line from a tabular document,
The apparatus includes: image input means for inputting image data representing an image of a tabular document; image data storage means for storing image data from the image input means; and a ruled line based on the image data stored in the image data storage means. Cell-coordinate detecting means for extracting image data of (a) and calculating ruled-line area data representing coordinate information of each cell surrounded by the ruled line; and determining each cell based on the ruled-line area data obtained by the cell coordinate detecting means. Are aligned in the row and column directions,
Table sorting means for generating table structure data representing a table structure formed by cells of each row and column; cells of item rows and cells of data rows based on the table structure data generated by the table sorting means And a row classifying means for classifying each row into cells other than those, reading characters written in each cell of at least the data row classified by the row classifying means from the image data storage means into a character code. A character reading device, comprising: character recognition means for recognizing a character in a cell of a data line corresponding to each item based on an attribute of a character in a cell written in the item line.

2. The character reading device according to claim 1, wherein the row classification unit determines the number of cells included in each row based on the table structure data generated by the table alignment unit; A continuous section detecting means for detecting a section in which a row having the same number of cells is the longest and continuous in the column direction, and a result of comparing the provisional item row with a further upper row using the first row of the section as a provisional item row And an item line detecting means for detecting a true item line from the character reading device.

3. The character reading device according to claim 2, wherein the item line detection unit compares the provisional item line with a line above the tentative item line, and determines a difference in line width, a line interval, or any of those. If both are within the respective predetermined values, the above line is regarded as a new provisional item line and compared with the line above it, and if each exceeds the predetermined value, the provisional item line is A character reading device characterized by being a true item line.

4. The character reading device according to claim 2, wherein the item line detection unit compares the provisional item line with a line above the tentative item line, and determines a difference in the area of cells included in the line or a difference in the line. If the difference between the ratio of the cell area to the area and the row spacing is within a predetermined value, the above row is set as a new provisional item row and compared with the row above it, and The character reading device according to claim 1, wherein when the value exceeds a predetermined value, the provisional item line is regarded as a true item line.

5. The character reading device according to claim 1, wherein in the table aligning unit, the cell in the row direction is determined based on a coordinate position of each cell and its height. A character reading device, wherein it is determined whether or not a cell belongs to a row, and cells in each row are arranged.

6. A character reading method for reading characters written in predetermined cells surrounded by ruled lines from a tabular document,
The method comprises the steps of: inputting image data representing an image of a tabular document by image input means; storing image data from the image input means in image data storage means; and storing the image data in the image data storage means. An extracting step of extracting image data of a ruled line from the stored image data to obtain ruled line area data representing coordinate information of each cell surrounded by the ruled line; An alignment step of aligning in the column direction to generate table structure data representing a table structure formed by cells of each row and column; and a cell of an item row and a cell of a data row based on the table structure data; A classification step of classifying each row into cells other than the above, and at least the characters written in each cell of the data row among the classified cells are described above. Recognizing a character in a cell in a data line corresponding to each item based on an attribute of a character in a cell written in the item line by reading from the image data storage unit and converting the character code into a character code. A character reading method characterized by the following.

7. The character reading method according to claim 6, wherein the classifying step further obtains the number of cells included in each row based on the table structure data, and the row having the same number of cells is the longest in the column direction. A data line detecting step of detecting a section that is continuous with the above, and an item row for detecting a true item row from a result of comparing the tentative item row and a row further above with the first row of the section as a tentative item row And a detecting step.

8. The character reading method according to claim 7, wherein the item line detecting step further compares the tentative item line and a line above the tentative item line to determine a difference in a line width, a line interval, or the like. If both are within a predetermined value, the above line is further compared with a line above it as a new provisional item line, and if each exceeds a predetermined value, the provisional item line is true. A character reading method including a comparison step of setting an item row.

9. The character reading method according to claim 7, wherein the item line detecting step further compares the provisional item line with a line above the tentative item line to determine a difference in cell area or a line included in the line. If the difference between the ratio of the cell area to the area of the cell and the row interval is within a predetermined value, the above row is compared with a row above it as a new provisional item row, and each , A character reading method including a comparison step of setting a provisional item line as a true item line.

10. The character reading method according to claim 6, wherein the aligning step of generating the table structure data further includes a coordinate position of each cell in a row direction and a height thereof. A character reading method comprising: determining a row to which a cell belongs based on the position, thereby aligning cells in each row.