JPH02116985A

JPH02116985A - Automatic input device for document

Info

Publication number: JPH02116985A
Application number: JP63270007A
Authority: JP
Inventors: Takashi Ishikawa; 孝石川; Akihiro Oka; 昭宏岡
Original assignee: Pentel Co Ltd
Current assignee: Pentel Co Ltd
Priority date: 1988-10-26
Filing date: 1988-10-26
Publication date: 1990-05-01

Abstract

PURPOSE:To extract a character string corresponding to a meaning item to improve the productivity of data input by dividing character strings extracted from a document picture by respective meaning items to output them with respect to a document having a list-form structure like a list of names. CONSTITUTION:A character candidate extracting process 3 which takes in document picture data 2 outputs the area, which is diagonally designated with a pair of coordinate values (X1,Y1) and (X2,X2) of a circumscribed rectangle of a black picture element connection area, as a character candidate area to a character recognizing process 4. The process 4 convertes picture data in the character candidate area extracted by the process 3 to a character code by character recognizing technique. A character string extracting process 5 extracts a character string in accordance with positional relations of character candidate areas. For example, when X1(K)<=X1(K+1)<=X2(K) or X1(K)<=X2(K +1)<=X2(K) (K is the index of the character candidate area) is true with respect to coordinates of two character candidate areas adjacent to each other in the case of a laterally written document, it is judged that these two areas belong to the same character string.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は１文書画像として入力した文書をコード化して
出力する文書自動入力装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an automatic document input device that encodes and outputs a document input as a single document image.

（従来の技術およびその問題点）従来、文章のような連続する文字行からなる文書や、伝
票等のあらかじめ指定された領域について文字を読み取
る文書入力装置はあったが、名簿等の文字行の位置によ
って意味項目が一定した表形式の文書を入力できる文書
入力装置はなかった（問題点を解決ｌ餐めの手段）本発明は如上の問題点に鑑みなされたもので。(Prior art and its problems) Conventionally, there have been document input devices that read characters in documents consisting of continuous character lines such as sentences, or in pre-specified areas such as slips, but There has not been a document input device that can input a tabular document in which the meaning items are constant depending on the position (means for solving the problem) The present invention was created in view of the above problems.

文字画像として入力した文書をコード化して出力する文
書自動入力装置において、人力した文書の文字行の先頭
を基１９とした長さで文字行を意味項目に分割する文書
自動入力装置を提案するものである。This document automatic input device encodes and outputs a document input as a character image, and proposes a document automatic input device that divides a character line into semantic items with a length of 19 based on the beginning of the character line of a manually inputted document. It is.

（作用）本発明では、名簿等の表形式の構造をもつ文書に対し、
文書画像から抽出した文字行を意味項目毎に分割して出
力するものである。(Operation) In the present invention, for a document having a tabular structure such as a list,
This method divides character lines extracted from a document image into semantic items and outputs them.

（実施例）本発明の一実施例を第１図のブロック図を参照して説明
する。本発明を適用した文り自動入力装置１は、イメー
ジスキャナ、複写機等の入力部分からデータを入力した
文書画像データ２、文字候補抽出工程３、文字認識工程
４、文字列抽出工程５、構造解析工程６、及びデータと
しての文書コードデータ７、分割データ８で構成される
。(Embodiment) An embodiment of the present invention will be described with reference to the block diagram of FIG. An automatic text input device 1 to which the present invention is applied includes document image data 2 inputted from an input part of an image scanner, a copying machine, etc., a character candidate extraction process 3, a character recognition process 4, a character string extraction process 5, and a structure. It consists of an analysis process 6, document code data 7 and divided data 8 as data.

文書画像データ２を取り込んだ文字候補抽出工程３は、
黒画素連結領域の外接矩形の座標値（Ｘｏ、Ｙ、）と（
Ｘｚ、Ｙｚ）の組で対角指定した領域を、文字候補領域
として文字認識工程４へ出力する。In the character candidate extraction step 3 that takes in the document image data 2,
The coordinate values (Xo, Y,) of the circumscribed rectangle of the black pixel connected area and (
The area specified diagonally by the pair (Xz, Yz) is output to the character recognition step 4 as a character candidate area.

文字認識工程４は、文字候補抽出工程３で抽出した文字
候補領域の画像データを公知の文字認識手法（例えば中
田和男編「パターン認識とその応用」コロナ社を参照）
によって文字コードに変換する。In the character recognition step 4, the image data of the character candidate area extracted in the character candidate extraction step 3 is processed using a known character recognition method (for example, see "Pattern Recognition and Its Applications" edited by Kazuo Nakata, published by Corona Publishing).
Convert to character code by

文字列抽出工程５は、文字候補領域の位置関係によって
文字列を抽出する。例えば、横書き文書の場合、隣合う
２つの２文字候補領域の座標について。In the character string extraction step 5, character strings are extracted based on the positional relationship of character candidate areas. For example, in the case of a horizontally written document, the coordinates of two adjacent two-character candidate areas.

Ｘ□（Ｋ）≦Ｘユ（Ｋ＋１）≦Ｘ２（Ｋ）・・　（１）またはＸｌ（Ｋ）≦Ｘ、（Ｋ＋１）≦Ｘ、（Ｋ）・・・・・・
（２）が成り立つとき２つの領域は同じ文字列に属すると判断
する。ここで、には文字候補領域のインデックスを表す
。X□(K)≦Xyu(K+1)≦X2(K)... (1) or Xl(K)≦X, (K+1)≦X, (K)...
When (2) holds true, it is determined that the two areas belong to the same character string. Here, represents the index of the character candidate area.

第２図には裏構造を省略した文書状況を示した。Figure 2 shows the document situation with the back structure omitted.

構造解析工程６では、１つの文字列を１つの行と判断し
、行頭を基準とした長さであらかじめ指示された分割デ
ータに基づいて文字列を意味項目に分割する。例えば、
第２図の例で１．を氏名、■□−１，を住所、ｌ、−１
２を電話番号というように文字列を分割する。分割した
文字列は１．で区切るなどして意味項目が区分できるよ
うにした文字列として文書コードデータ７が出力される
。In the structure analysis step 6, one character string is determined to be one line, and the character string is divided into meaning items based on division data specified in advance with a length based on the beginning of the line. for example,
In the example of Figure 2, 1. is the name, ■□-1, is the address, l, -1
Divide the string so that 2 is the phone number. The divided string is 1. The document code data 7 is output as a character string whose meaning items can be distinguished by dividing them by .

尚、文字列抽出では、動的計画法の使用により分離文字
を１つの文字として結合することもできるものである５
また、行頭の字下げやノイズによる影響をなくすため５
行頭の座標値は各文字列の行頭座標値の統計的な推定値
を使用することが望ましい。これによって、ある程度の
文書の傾きにも対処できるようになる。更に、表形式の
データでありながら罫線無しで説明をしたが、データに
罫線がある場合は、あらかじめ画像処理によって罫線を
削除しておけば問題が無く、また、罫線を利用する方法
として罫線の位置を検出することによって意味項目の区
切りとすることもできる。Furthermore, in character string extraction, it is also possible to combine separated characters into one character by using dynamic programming5.
Also, in order to eliminate the effects of indentation and noise at the beginning of the line, 5
It is desirable to use a statistical estimate of the line start coordinate value of each character string as the line start coordinate value. This makes it possible to deal with a certain degree of inclination of the document. Furthermore, although the data is in tabular format, we have explained without ruled lines, but if there are ruled lines in the data, there is no problem if you delete the ruled lines by image processing in advance. Semantic items can also be separated by detecting their positions.

（発明の効果）本発明によれば、名簿等の表形式の構造をもつ文書に対
して１文書画像から意味項目に対応した文字列を抽出で
きるので、データ入力の生産性を著しく向上することが
できる。(Effects of the Invention) According to the present invention, character strings corresponding to semantic items can be extracted from a single document image for a document having a tabular structure such as a directory, so data input productivity can be significantly improved. Can be done.

[Brief explanation of the drawing]

図面は本発明の一実施例を示すもので、第１図は本発明
のブロック図、第２図は文書データ状況を示したもので
ある。The drawings show one embodiment of the present invention; FIG. 1 is a block diagram of the present invention, and FIG. 2 shows a document data situation.

Claims

[Claims]

A document automatic input device that encodes and outputs a document input as a character image, characterized in that the character line is divided into semantic items by lengths based on the beginning of the character line of the input document. .