JPH02116985A - Automatic input device for document - Google Patents
Automatic input device for documentInfo
- Publication number
- JPH02116985A JPH02116985A JP63270007A JP27000788A JPH02116985A JP H02116985 A JPH02116985 A JP H02116985A JP 63270007 A JP63270007 A JP 63270007A JP 27000788 A JP27000788 A JP 27000788A JP H02116985 A JPH02116985 A JP H02116985A
- Authority
- JP
- Japan
- Prior art keywords
- character
- document
- area
- character string
- character candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 abstract description 10
- 239000000284 extract Substances 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 2
- 238000007373 indentation Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
Landscapes
- Character Input (AREA)
- Character Discrimination (AREA)
Abstract
Description
【発明の詳細な説明】
(産業上の利用分野)
本発明は1文書画像として入力した文書をコード化して
出力する文書自動入力装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an automatic document input device that encodes and outputs a document input as a single document image.
(従来の技術およびその問題点)
従来、文章のような連続する文字行からなる文書や、伝
票等のあらかじめ指定された領域について文字を読み取
る文書入力装置はあったが、名簿等の文字行の位置によ
って意味項目が一定した表形式の文書を入力できる文書
入力装置はなかった(問題点を解決l餐めの手段)
本発明は如上の問題点に鑑みなされたもので。(Prior art and its problems) Conventionally, there have been document input devices that read characters in documents consisting of continuous character lines such as sentences, or in pre-specified areas such as slips, but There has not been a document input device that can input a tabular document in which the meaning items are constant depending on the position (means for solving the problem) The present invention was created in view of the above problems.
文字画像として入力した文書をコード化して出力する文
書自動入力装置において、人力した文書の文字行の先頭
を基19とした長さで文字行を意味項目に分割する文書
自動入力装置を提案するものである。This document automatic input device encodes and outputs a document input as a character image, and proposes a document automatic input device that divides a character line into semantic items with a length of 19 based on the beginning of the character line of a manually inputted document. It is.
(作用)
本発明では、名簿等の表形式の構造をもつ文書に対し、
文書画像から抽出した文字行を意味項目毎に分割して出
力するものである。(Operation) In the present invention, for a document having a tabular structure such as a list,
This method divides character lines extracted from a document image into semantic items and outputs them.
(実施例)
本発明の一実施例を第1図のブロック図を参照して説明
する。本発明を適用した文り自動入力装置1は、イメー
ジスキャナ、複写機等の入力部分からデータを入力した
文書画像データ2、文字候補抽出工程3、文字認識工程
4、文字列抽出工程5、構造解析工程6、及びデータと
しての文書コードデータ7、分割データ8で構成される
。(Embodiment) An embodiment of the present invention will be described with reference to the block diagram of FIG. An automatic text input device 1 to which the present invention is applied includes document image data 2 inputted from an input part of an image scanner, a copying machine, etc., a character candidate extraction process 3, a character recognition process 4, a character string extraction process 5, and a structure. It consists of an analysis process 6, document code data 7 and divided data 8 as data.
文書画像データ2を取り込んだ文字候補抽出工程3は、
黒画素連結領域の外接矩形の座標値(Xo、Y、)と(
Xz、Yz)の組で対角指定した領域を、文字候補領域
として文字認識工程4へ出力する。In the character candidate extraction step 3 that takes in the document image data 2,
The coordinate values (Xo, Y,) of the circumscribed rectangle of the black pixel connected area and (
The area specified diagonally by the pair (Xz, Yz) is output to the character recognition step 4 as a character candidate area.
文字認識工程4は、文字候補抽出工程3で抽出した文字
候補領域の画像データを公知の文字認識手法(例えば中
田和男編「パターン認識とその応用」コロナ社を参照)
によって文字コードに変換する。In the character recognition step 4, the image data of the character candidate area extracted in the character candidate extraction step 3 is processed using a known character recognition method (for example, see "Pattern Recognition and Its Applications" edited by Kazuo Nakata, published by Corona Publishing).
Convert to character code by
文字列抽出工程5は、文字候補領域の位置関係によって
文字列を抽出する。例えば、横書き文書の場合、隣合う
2つの2文字候補領域の座標について。In the character string extraction step 5, character strings are extracted based on the positional relationship of character candidate areas. For example, in the case of a horizontally written document, the coordinates of two adjacent two-character candidate areas.
X□(K)≦Xユ(K+1)≦X2(K)・・ (1)
または
Xl(K)≦X、(K+1)≦X、(K)・・・・・・
(2)
が成り立つとき2つの領域は同じ文字列に属すると判断
する。ここで、には文字候補領域のインデックスを表す
。X□(K)≦Xyu(K+1)≦X2(K)... (1) or Xl(K)≦X, (K+1)≦X, (K)...
When (2) holds true, it is determined that the two areas belong to the same character string. Here, represents the index of the character candidate area.
第2図には裏構造を省略した文書状況を示した。Figure 2 shows the document situation with the back structure omitted.
構造解析工程6では、1つの文字列を1つの行と判断し
、行頭を基準とした長さであらかじめ指示された分割デ
ータに基づいて文字列を意味項目に分割する。例えば、
第2図の例で1.を氏名、■□−1,を住所、l、−1
2を電話番号というように文字列を分割する。分割した
文字列は1.で区切るなどして意味項目が区分できるよ
うにした文字列として文書コードデータ7が出力される
。In the structure analysis step 6, one character string is determined to be one line, and the character string is divided into meaning items based on division data specified in advance with a length based on the beginning of the line. for example,
In the example of Figure 2, 1. is the name, ■□-1, is the address, l, -1
Divide the string so that 2 is the phone number. The divided string is 1. The document code data 7 is output as a character string whose meaning items can be distinguished by dividing them by .
尚、文字列抽出では、動的計画法の使用により分離文字
を1つの文字として結合することもできるものである5
また、行頭の字下げやノイズによる影響をなくすため5
行頭の座標値は各文字列の行頭座標値の統計的な推定値
を使用することが望ましい。これによって、ある程度の
文書の傾きにも対処できるようになる。更に、表形式の
データでありながら罫線無しで説明をしたが、データに
罫線がある場合は、あらかじめ画像処理によって罫線を
削除しておけば問題が無く、また、罫線を利用する方法
として罫線の位置を検出することによって意味項目の区
切りとすることもできる。Furthermore, in character string extraction, it is also possible to combine separated characters into one character by using dynamic programming5.
Also, in order to eliminate the effects of indentation and noise at the beginning of the line, 5
It is desirable to use a statistical estimate of the line start coordinate value of each character string as the line start coordinate value. This makes it possible to deal with a certain degree of inclination of the document. Furthermore, although the data is in tabular format, we have explained without ruled lines, but if there are ruled lines in the data, there is no problem if you delete the ruled lines by image processing in advance. Semantic items can also be separated by detecting their positions.
(発明の効果)
本発明によれば、名簿等の表形式の構造をもつ文書に対
して1文書画像から意味項目に対応した文字列を抽出で
きるので、データ入力の生産性を著しく向上することが
できる。(Effects of the Invention) According to the present invention, character strings corresponding to semantic items can be extracted from a single document image for a document having a tabular structure such as a directory, so data input productivity can be significantly improved. Can be done.
図面は本発明の一実施例を示すもので、第1図は本発明
のブロック図、第2図は文書データ状況を示したもので
ある。The drawings show one embodiment of the present invention; FIG. 1 is a block diagram of the present invention, and FIG. 2 shows a document data situation.
Claims (1)
書自動入力装置において、入力した文書の文字行の先頭
を基準とした長さで文字行を意味項目に分割することを
特徴とする文書自動入力装置。A document automatic input device that encodes and outputs a document input as a character image, characterized in that the character line is divided into semantic items by lengths based on the beginning of the character line of the input document. .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP63270007A JPH02116985A (en) | 1988-10-26 | 1988-10-26 | Automatic input device for document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP63270007A JPH02116985A (en) | 1988-10-26 | 1988-10-26 | Automatic input device for document |
Publications (1)
Publication Number | Publication Date |
---|---|
JPH02116985A true JPH02116985A (en) | 1990-05-01 |
Family
ID=17480254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP63270007A Pending JPH02116985A (en) | 1988-10-26 | 1988-10-26 | Automatic input device for document |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPH02116985A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5491024A (en) * | 1977-12-28 | 1979-07-19 | Nec Corp | Blocking unit for character string |
JPS5631171A (en) * | 1979-08-24 | 1981-03-28 | Toshiba Corp | Character reader |
-
1988
- 1988-10-26 JP JP63270007A patent/JPH02116985A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5491024A (en) * | 1977-12-28 | 1979-07-19 | Nec Corp | Blocking unit for character string |
JPS5631171A (en) * | 1979-08-24 | 1981-03-28 | Toshiba Corp | Character reader |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP0843277A3 (en) | Page analysis system | |
WO2020125481A1 (en) | Method for generating identification pattern, and terminal device | |
KR920019198A (en) | Pattern Feature Extraction Method | |
JPH02116985A (en) | Automatic input device for document | |
JPH08180068A (en) | Electronic filing device | |
JP3171626B2 (en) | Character recognition processing area / processing condition specification method | |
Aparna et al. | A complete OCR system development of Tamil magazine documents | |
JPH0462688A (en) | Document recognizing system | |
JPS61296481A (en) | Document reader | |
JP2508975B2 (en) | Electronic blackboard | |
KR940004476A (en) | Image Control | |
JP2004280514A (en) | Pdf file and system for forming pdf file | |
JPH06274551A (en) | Image filing device | |
JPH0564396B2 (en) | ||
JPH04324577A (en) | Broken-line graph recognizing device | |
JP2917396B2 (en) | Character recognition method | |
JPH05159062A (en) | Document recognition device | |
JPH0433079A (en) | Table processing system | |
JP2794042B2 (en) | Recognition device for tabular documents | |
JP2890307B2 (en) | Table space separation device | |
JP3074210B2 (en) | Paper document image processing device | |
JPS62134765A (en) | Electronic retrieving method for dictionary of chinese character explained in japanese | |
JPS6180477A (en) | Fair-copying device of document | |
JPH03269689A (en) | Document reading device | |
KR950024109A (en) | Character Feature Extraction Method using RunLance Scan Method in Character Recognition |