JPH02116985A - Automatic input device for document - Google Patents

Automatic input device for document

Info

Publication number
JPH02116985A
JPH02116985A JP63270007A JP27000788A JPH02116985A JP H02116985 A JPH02116985 A JP H02116985A JP 63270007 A JP63270007 A JP 63270007A JP 27000788 A JP27000788 A JP 27000788A JP H02116985 A JPH02116985 A JP H02116985A
Authority
JP
Japan
Prior art keywords
character
document
area
character string
character candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP63270007A
Other languages
Japanese (ja)
Inventor
Takashi Ishikawa
孝 石川
Akihiro Oka
昭宏 岡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pentel Co Ltd
Original Assignee
Pentel Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pentel Co Ltd filed Critical Pentel Co Ltd
Priority to JP63270007A priority Critical patent/JPH02116985A/en
Publication of JPH02116985A publication Critical patent/JPH02116985A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

PURPOSE:To extract a character string corresponding to a meaning item to improve the productivity of data input by dividing character strings extracted from a document picture by respective meaning items to output them with respect to a document having a list-form structure like a list of names. CONSTITUTION:A character candidate extracting process 3 which takes in document picture data 2 outputs the area, which is diagonally designated with a pair of coordinate values (X1,Y1) and (X2,X2) of a circumscribed rectangle of a black picture element connection area, as a character candidate area to a character recognizing process 4. The process 4 convertes picture data in the character candidate area extracted by the process 3 to a character code by character recognizing technique. A character string extracting process 5 extracts a character string in accordance with positional relations of character candidate areas. For example, when X1(K)<=X1(K+1)<=X2(K) or X1(K)<=X2(K +1)<=X2(K) (K is the index of the character candidate area) is true with respect to coordinates of two character candidate areas adjacent to each other in the case of a laterally written document, it is judged that these two areas belong to the same character string.

Description

【発明の詳細な説明】 (産業上の利用分野) 本発明は1文書画像として入力した文書をコード化して
出力する文書自動入力装置に関するものである。
DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an automatic document input device that encodes and outputs a document input as a single document image.

(従来の技術およびその問題点) 従来、文章のような連続する文字行からなる文書や、伝
票等のあらかじめ指定された領域について文字を読み取
る文書入力装置はあったが、名簿等の文字行の位置によ
って意味項目が一定した表形式の文書を入力できる文書
入力装置はなかった(問題点を解決l餐めの手段) 本発明は如上の問題点に鑑みなされたもので。
(Prior art and its problems) Conventionally, there have been document input devices that read characters in documents consisting of continuous character lines such as sentences, or in pre-specified areas such as slips, but There has not been a document input device that can input a tabular document in which the meaning items are constant depending on the position (means for solving the problem) The present invention was created in view of the above problems.

文字画像として入力した文書をコード化して出力する文
書自動入力装置において、人力した文書の文字行の先頭
を基19とした長さで文字行を意味項目に分割する文書
自動入力装置を提案するものである。
This document automatic input device encodes and outputs a document input as a character image, and proposes a document automatic input device that divides a character line into semantic items with a length of 19 based on the beginning of the character line of a manually inputted document. It is.

(作用) 本発明では、名簿等の表形式の構造をもつ文書に対し、
文書画像から抽出した文字行を意味項目毎に分割して出
力するものである。
(Operation) In the present invention, for a document having a tabular structure such as a list,
This method divides character lines extracted from a document image into semantic items and outputs them.

(実施例) 本発明の一実施例を第1図のブロック図を参照して説明
する。本発明を適用した文り自動入力装置1は、イメー
ジスキャナ、複写機等の入力部分からデータを入力した
文書画像データ2、文字候補抽出工程3、文字認識工程
4、文字列抽出工程5、構造解析工程6、及びデータと
しての文書コードデータ7、分割データ8で構成される
(Embodiment) An embodiment of the present invention will be described with reference to the block diagram of FIG. An automatic text input device 1 to which the present invention is applied includes document image data 2 inputted from an input part of an image scanner, a copying machine, etc., a character candidate extraction process 3, a character recognition process 4, a character string extraction process 5, and a structure. It consists of an analysis process 6, document code data 7 and divided data 8 as data.

文書画像データ2を取り込んだ文字候補抽出工程3は、
黒画素連結領域の外接矩形の座標値(Xo、Y、)と(
Xz、Yz)の組で対角指定した領域を、文字候補領域
として文字認識工程4へ出力する。
In the character candidate extraction step 3 that takes in the document image data 2,
The coordinate values (Xo, Y,) of the circumscribed rectangle of the black pixel connected area and (
The area specified diagonally by the pair (Xz, Yz) is output to the character recognition step 4 as a character candidate area.

文字認識工程4は、文字候補抽出工程3で抽出した文字
候補領域の画像データを公知の文字認識手法(例えば中
田和男編「パターン認識とその応用」コロナ社を参照)
によって文字コードに変換する。
In the character recognition step 4, the image data of the character candidate area extracted in the character candidate extraction step 3 is processed using a known character recognition method (for example, see "Pattern Recognition and Its Applications" edited by Kazuo Nakata, published by Corona Publishing).
Convert to character code by

文字列抽出工程5は、文字候補領域の位置関係によって
文字列を抽出する。例えば、横書き文書の場合、隣合う
2つの2文字候補領域の座標について。
In the character string extraction step 5, character strings are extracted based on the positional relationship of character candidate areas. For example, in the case of a horizontally written document, the coordinates of two adjacent two-character candidate areas.

X□(K)≦Xユ(K+1)≦X2(K)・・ (1) または Xl(K)≦X、(K+1)≦X、(K)・・・・・・
(2) が成り立つとき2つの領域は同じ文字列に属すると判断
する。ここで、には文字候補領域のインデックスを表す
X□(K)≦Xyu(K+1)≦X2(K)... (1) or Xl(K)≦X, (K+1)≦X, (K)...
When (2) holds true, it is determined that the two areas belong to the same character string. Here, represents the index of the character candidate area.

第2図には裏構造を省略した文書状況を示した。Figure 2 shows the document situation with the back structure omitted.

構造解析工程6では、1つの文字列を1つの行と判断し
、行頭を基準とした長さであらかじめ指示された分割デ
ータに基づいて文字列を意味項目に分割する。例えば、
第2図の例で1.を氏名、■□−1,を住所、l、−1
2を電話番号というように文字列を分割する。分割した
文字列は1.で区切るなどして意味項目が区分できるよ
うにした文字列として文書コードデータ7が出力される
In the structure analysis step 6, one character string is determined to be one line, and the character string is divided into meaning items based on division data specified in advance with a length based on the beginning of the line. for example,
In the example of Figure 2, 1. is the name, ■□-1, is the address, l, -1
Divide the string so that 2 is the phone number. The divided string is 1. The document code data 7 is output as a character string whose meaning items can be distinguished by dividing them by .

尚、文字列抽出では、動的計画法の使用により分離文字
を1つの文字として結合することもできるものである5
また、行頭の字下げやノイズによる影響をなくすため5
行頭の座標値は各文字列の行頭座標値の統計的な推定値
を使用することが望ましい。これによって、ある程度の
文書の傾きにも対処できるようになる。更に、表形式の
データでありながら罫線無しで説明をしたが、データに
罫線がある場合は、あらかじめ画像処理によって罫線を
削除しておけば問題が無く、また、罫線を利用する方法
として罫線の位置を検出することによって意味項目の区
切りとすることもできる。
Furthermore, in character string extraction, it is also possible to combine separated characters into one character by using dynamic programming5.
Also, in order to eliminate the effects of indentation and noise at the beginning of the line, 5
It is desirable to use a statistical estimate of the line start coordinate value of each character string as the line start coordinate value. This makes it possible to deal with a certain degree of inclination of the document. Furthermore, although the data is in tabular format, we have explained without ruled lines, but if there are ruled lines in the data, there is no problem if you delete the ruled lines by image processing in advance. Semantic items can also be separated by detecting their positions.

(発明の効果) 本発明によれば、名簿等の表形式の構造をもつ文書に対
して1文書画像から意味項目に対応した文字列を抽出で
きるので、データ入力の生産性を著しく向上することが
できる。
(Effects of the Invention) According to the present invention, character strings corresponding to semantic items can be extracted from a single document image for a document having a tabular structure such as a directory, so data input productivity can be significantly improved. Can be done.

【図面の簡単な説明】[Brief explanation of the drawing]

図面は本発明の一実施例を示すもので、第1図は本発明
のブロック図、第2図は文書データ状況を示したもので
ある。
The drawings show one embodiment of the present invention; FIG. 1 is a block diagram of the present invention, and FIG. 2 shows a document data situation.

Claims (1)

【特許請求の範囲】[Claims] 文字画像として入力した文書をコード化して出力する文
書自動入力装置において、入力した文書の文字行の先頭
を基準とした長さで文字行を意味項目に分割することを
特徴とする文書自動入力装置。
A document automatic input device that encodes and outputs a document input as a character image, characterized in that the character line is divided into semantic items by lengths based on the beginning of the character line of the input document. .
JP63270007A 1988-10-26 1988-10-26 Automatic input device for document Pending JPH02116985A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP63270007A JPH02116985A (en) 1988-10-26 1988-10-26 Automatic input device for document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63270007A JPH02116985A (en) 1988-10-26 1988-10-26 Automatic input device for document

Publications (1)

Publication Number Publication Date
JPH02116985A true JPH02116985A (en) 1990-05-01

Family

ID=17480254

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63270007A Pending JPH02116985A (en) 1988-10-26 1988-10-26 Automatic input device for document

Country Status (1)

Country Link
JP (1) JPH02116985A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5491024A (en) * 1977-12-28 1979-07-19 Nec Corp Blocking unit for character string
JPS5631171A (en) * 1979-08-24 1981-03-28 Toshiba Corp Character reader

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5491024A (en) * 1977-12-28 1979-07-19 Nec Corp Blocking unit for character string
JPS5631171A (en) * 1979-08-24 1981-03-28 Toshiba Corp Character reader

Similar Documents

Publication Publication Date Title
EP0843277A3 (en) Page analysis system
WO2020125481A1 (en) Method for generating identification pattern, and terminal device
KR920019198A (en) Pattern Feature Extraction Method
JPH02116985A (en) Automatic input device for document
JPH08180068A (en) Electronic filing device
JP3171626B2 (en) Character recognition processing area / processing condition specification method
Aparna et al. A complete OCR system development of Tamil magazine documents
JPH0462688A (en) Document recognizing system
JPS61296481A (en) Document reader
JP2508975B2 (en) Electronic blackboard
KR940004476A (en) Image Control
JP2004280514A (en) Pdf file and system for forming pdf file
JPH06274551A (en) Image filing device
JPH0564396B2 (en)
JPH04324577A (en) Broken-line graph recognizing device
JP2917396B2 (en) Character recognition method
JPH05159062A (en) Document recognition device
JPH0433079A (en) Table processing system
JP2794042B2 (en) Recognition device for tabular documents
JP2890307B2 (en) Table space separation device
JP3074210B2 (en) Paper document image processing device
JPS62134765A (en) Electronic retrieving method for dictionary of chinese character explained in japanese
JPS6180477A (en) Fair-copying device of document
JPH03269689A (en) Document reading device
KR950024109A (en) Character Feature Extraction Method using RunLance Scan Method in Character Recognition