JPH01147786A

JPH01147786A - Reading device for document containing table

Info

Publication number: JPH01147786A
Application number: JP62305680A
Authority: JP
Inventors: Yasuo Hongo; 本郷　保夫
Original assignee: Fuji Electric Co Ltd
Current assignee: Fuji Electric Co Ltd
Priority date: 1987-12-04
Filing date: 1987-12-04
Publication date: 1989-06-09
Anticipated expiration: 2011-12-25
Also published as: JP2567001B2

Abstract

PURPOSE:To read a document containing the table of optional construction by making a document picture containing the table into a vector, extracting all ruled lines, dividing a table area one after another and making a column area into table data containing table construction. CONSTITUTION:Picture data PD from an image scanner 2 is directly written in an image memory 11. A vectorizing operation part 12 performs contour tracking for the document picture PD', and makes it into a vector by attaching a prescribed code classified by a tracing direction, or the like, and gives vector data VD to a table construction analyzing part 13. The analyzing part 13 extracts the ruled line from the data VD, and divides the table area one after another, and calculates table construction descriptive data Ds. A table data reading part 14 cuts out and reads a character of every individual column area from the data Ds and the document picture PD'' given from the memory 11, and sends it as table data Dd containing the table construction to a main processor 15. The processor 15 displays the data Dd on a CRT display 3 through an input/output interface.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、表を含む文書を読取ることが可能な読取装
置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a reading device capable of reading documents including tables.

[Conventional technology]

従来、表を含む文書を読取るものとしては、表中のデー
タ部分のみを指定してＯＣＲ（光学式読取装置）で読取
るものが知られている。2. Description of the Related Art Conventionally, as a method for reading a document including a table, a method is known in which only the data portion of the table is specified and read using an OCR (optical reader).

[Problem that the invention seeks to solve]

しかしながら、このような読取装置ではデータ部分のみ
を指示する等の面倒な処理が必要となるだけでなく、成
る定まったフォーマットの表の場合しか適用できず、任
意の表のデータ（表データ）を読取ることができないと
云う問題がある。However, such reading devices not only require troublesome processing such as specifying only the data part, but also can only be applied to tables with a fixed format, and can read arbitrary table data (table data). The problem is that it cannot be read.

したがって、この発明は任意の構造の表を含む文書を読
取り可能にすることを目的とする。Therefore, it is an object of the present invention to make documents containing tables of arbitrary structure readable.

[Means for solving problems]

表を含む文書を入力する入力手段と、文書画像をベクト
ル化するベクトル化手段と、ベクトルデータから全ての
罫線を抽出し、最大の横罫線、縦罫線にて形成される欄
領域をそれよりも短かい罫線によって順次領域を分割し
つ＼その各々にラベル付けを・して個々の欄領域を抽出
する表構造解析手段と、該各欄領域の画像から文字切出
しをして個々の文字を読取る表データ読取手段とを設け
る。An input means for inputting a document including a table; a vectorization means for vectorizing a document image; A table structure analysis means that sequentially divides the area using short ruled lines and labels each of them to extract individual column areas, and extracts characters from the image of each column area and reads the individual characters. Table data reading means is provided.

[Effect]

表を含む文書をイメージスキャナ等を介して入力し、そ
の２値画像を輪郭追跡等の手法によりベクトル化して得
られるベクトルデータから罫線を抽出し、最大の横罫線
、縦罫線にて形成される欄領域をそれよりも短かい罫線
によって順次領域を分割しつ−その各々にラベル付けを
して個々の欄領域を抽出することにより、種々の構造の
表を読取り得るようにする。また、個々の欄領域につい
て文字の切出しをし、文字を読取ることにより、表の各
欄の内容を解析し得るようにする。A document containing a table is input through an image scanner, etc., and the binary image is vectorized using a method such as contour tracking. Ruled lines are extracted from the vector data obtained, and the largest horizontal and vertical ruled lines are formed. Tables with various structures can be read by sequentially dividing a column area using shorter ruled lines, labeling each area, and extracting individual column areas. Furthermore, by cutting out characters from each column area and reading the characters, the contents of each column of the table can be analyzed.

〔Example〕

第１図はこの発明の実施例を示すブロック図である。な
お、ニーで用いられる読取プロセッサ１はイメージメモ
リ１１、ベクトル化演算部１２、裏構造解析部１３、表
データ読取部１４、メインプロセッサ１５および入出力
インクフェイス１６等より成り、例えば第２図の如くイ
メージスキャナ２およびＣＲＴデイスプレィ３と−もに
文書情報処理システムを構成する。FIG. 1 is a block diagram showing an embodiment of the invention. The reading processor 1 used in the knee is composed of an image memory 11, a vectorization calculation section 12, a back structure analysis section 13, a table data reading section 14, a main processor 15, an input/output ink face 16, etc. The image scanner 2 and CRT display 3 together constitute a document information processing system.

イメージスキャナ２からの画像データＰＤは、イメージ
メモリ１１に直接書き込まれる。ベクトル化演算部１２
は文書画像ＰＤ’に対し、例えば輪郭追跡をしてその追
跡方向別に所定コードを付す等して公知のベクトル化を
行ない、ベクトルデータＶＤを裏構造解析部１３へ与え
る。裏構造解析部１３はこのベクトルデータＶＤから罫
線を抽出し、表領域を逐次分割して裏構造記述データＤ
５を演算する。表データ読取部１４はこの裏構造記述デ
ータＤｓとイメージメモリ１１から与えられる文書画像
ＰＤ″とから、個々の欄領域毎に文字切出しをしてこれ
を読取り、裏構造を含む表データＤ４としてメインプロ
センサ１５に送る。メインプロセッサ１５は表データＤ
ｄを入出力インクフェイス１６を介して第２図のＣＲＴ
デイスプレィ２へ表示したり、キーボードから必要なデ
ータを入力して編集したりする。また、この表データは
フロッピィディスクに保存したり、他の０Ａ（Ｏｆｆｉ
ｃｅ　　Ａｕｔｏｍａｔｉｏｎ）プロセッサへ伝送する
こともできる。Image data PD from the image scanner 2 is directly written into the image memory 11. Vectorization calculation unit 12
performs known vectorization on the document image PD' by, for example, tracing the outline and attaching a predetermined code for each tracing direction, and provides vector data VD to the back structure analysis section 13. The back structure analysis unit 13 extracts ruled lines from this vector data VD, sequentially divides the table area, and generates back structure description data D.
Calculate 5. The front data reading unit 14 cuts out characters for each column area from this back structure description data Ds and the document image PD″ given from the image memory 11, reads them, and reads them as main front data D4 including the back structure. The main processor 15 sends the table data D to the processor 15.
d to the CRT shown in FIG. 2 via the input/output ink face 16.
Display on Display 2 or edit by entering necessary data from the keyboard. In addition, this table data can be saved on a floppy disk or stored on another 0A (Office).
ce Automation) processor.

第３図に表を含む文書の例を示す。こ＼では、文書４の
中に罫線をもつ表４Ａが含まれているが、文字パターン
等は表の２値化画像のベクトルよりも一般に短かいので
、これを利用して文字パターンを分離することができる
。FIG. 3 shows an example of a document including a table. In this case, Table 4A with ruled lines is included in document 4, but character patterns etc. are generally shorter than the vectors of the binarized image of the table, so this is used to separate the character patterns. be able to.

第４図に表の画像例を示す。同図の太線が裏構造ベクト
ルデータである。こ−で、成る１つのベクトル■、に対
し、太線を挟む反対側のベクトル■。Ｉ＋　ｖ。２を共
役ベクトル、またベクトル■五の始点または終点の近傍
に始点または終点をもつベクトルＶ０．Ｖ、□を隣接ベ
クトルと定義すること＼する。なお、「・」印を付して
ベクトルを示すが、必要な場合の外は省略する。FIG. 4 shows an example of an image of the table. The thick line in the figure is the underlying structure vector data. Here, for one vector ■, which consists of, the vector ■ on the opposite side of the thick line. I+v. 2 is a conjugate vector, and vector V0.2 is a conjugate vector, and vector V0. Let V and □ be defined as adjacent vectors. Note that vectors are indicated with a ``•'' mark, but are omitted unless necessary.

第５図にベクトル■、の記述情報の例を示す。FIG. 5 shows an example of descriptive information for vector .

同図からも明らかなように、ベクトル０．は始点Ｐ１４
．終点Ｐｓ□、長さｊ２ｉ、傾きθｉおよび隣接ベクト
ルＶ、、Ｖ工等により記述される。As is clear from the figure, vector 0. is the starting point P14
．． It is described by the end point Ps□, the length j2i, the slope θi, and the adjacent vectors V, , V, etc.

ベクトル情報メモリの内容を第６Ａ図に示す。The contents of the vector information memory are shown in FIG. 6A.

図示されないベクトル情報メモリには、ベクトル番号ｉ
　　（ｉ＝１．・・・・・・、　　Ｎｖ　）の順にベク
トルの始点（位置ベクトル）ｘｌ、終点（位置ベクトル
）ｘＥｉ、長さ！２１．傾きθｉ、ベクトル値■３．前
隣接ベクトル番号Ｎ。、後隣接ベクトル番号Ｎ４．。In the vector information memory (not shown), vector number i
(i = 1......, Nv), the starting point (position vector) xl, the ending point (position vector) xEi, and the length of the vector! 21. Slope θi, vector value ■3. Previous adjacent vector number N. , rear adjacent vector number N4. .

罫線本数Ｆｉ、罫線番号ＮＡｚ’　　（ｊ＝１．・・・
・・・。Number of ruled lines Fi, ruled line number NAz' (j=1...
....

Ｆｉ）等が格納される。たりし、ベクトル化直後は罫線
本数Ｆｉと罫線番号Ｎ　Ａ　ｉ　ｊはゼロ（初期値）に
なっており、罫線抽出後にそのベクトルが構成要素とな
る罫線の本数と番号が決まる。なお、ベクトル■、は、Ｖｉ　＝ＸＥｉ　　ＸＳｉとする。Fi) etc. are stored. Immediately after vectorization, the number of ruled lines Fi and the ruled line number N A i j are zero (initial values), and after the ruled line is extracted, the number and number of ruled lines of which the vector is a component are determined. Note that the vector ■ is set as Vi =XEi XSi.

第６Ｂ図に罫線抽出後の罫線情報メモリの内容を示す。FIG. 6B shows the contents of the ruled line information memory after the ruled lines have been extracted.

これは、第６Ａ図の如きデータから、−・クトルが重な
り合ったりまたは延長線でつながるものを含め、組み合
わせることが可能なすべてのベクトルを抽出して作成す
る。例えば第４図の例では、ベクトルＶ、、Ｖ。ｌ＋　
　■。２が１つの罫線となる。なお、複数の罫線に重複
して同一のベクトルが含まれることがある。こうして、
各罫線はその番号（ｊ＝１．・・・・・・、Ｎ、）順に
ベクトルの本数Ｎ□、ベクトル番号Ｎ！ｌ、″　（ｋ＝
１〜Ｎｏｔ）　。This is created by extracting all vectors that can be combined, including vectors that overlap or are connected by extension lines, from the data as shown in FIG. 6A. For example, in the example of FIG. 4, the vectors V,,V. l+
■. 2 is one ruled line. Note that the same vector may be included in multiple ruled lines. thus,
Each ruled line has the number of vectors N□ and the vector number N! l,'' (k=
1~Not).

始点）’　Ｓｊ＋終点ｙｔｊ＋長さし４．傾きθｊおよ
び罫線ベクトルＫ、が求められ、第６Ｂ図の如く記憶さ
れる。なお、罫線ベクトルに、は、ＫＪ＝ｙＥＪ−ｙｓ
Ｊとする。Starting point)' Sj + Ending point ytj + Length 4. The slope θj and the ruled line vector K are determined and stored as shown in FIG. 6B. In addition, for the ruled line vector, KJ=yEJ−ys
Let it be J.

が欄領域であり、肩部の添字にて行番号、下部の添字に
て列番号を示している。また、図中のＫｉは始点をｙｓ
ｉ、終点をｙｔｉ、長さをＫｉとする罫線である。これ
は、最も長い４本の横罫線と３本の縦罫線により、１回
の操作で裏構造が記述できる例である。また、この欄領
域の各々について、文字切出しをして文字を読取ること
により、表の内容を知ることができる。is the column area, and the subscript at the shoulder indicates the row number, and the subscript at the bottom indicates the column number. In addition, Ki in the figure is the starting point ys
i, the end point is yti, and the length is Ki. This is an example in which the back structure can be described in one operation using the longest four horizontal ruled lines and three vertical ruled lines. Furthermore, by cutting out characters from each of the column areas and reading the characters, the contents of the table can be known.

第８図は裏構造解析部の処理動作を示すフローチャート
である。以下、その動作について第９図の表を例にして
説明する。FIG. 8 is a flowchart showing the processing operation of the back structure analysis section. The operation will be explained below using the table of FIG. 9 as an example.

まず、第６Ａ図の如きベクトルデータを長さ！。First, calculate the length of the vector data as shown in Figure 6A. .

の大きい順番に並び換える（■参照）。次に、長いベク
トルから順番に、これと共役なベクトルまたは向きが同
じの隣接ベクトルから、組み合わせ可能なすべての罫線
を抽出してベクトル情報メモリの一部および罫線情報メ
モリに格納する（■参照）。そして、罫線情報について
も、長さＬ　ｉの大きい順番に並び換える（■参照）。Sort in descending order of size (see ■). Next, in order from the longest vector, extract all the ruled lines that can be combined from vectors that are conjugate to this vector or adjacent vectors with the same direction, and store them in a part of the vector information memory and the ruled line information memory (see ■) . The ruled line information is also sorted in descending order of length L i (see ■).

こ＼までで、第９図に示す表４Ｂの縦罫線と横罫線のす
べてが抽出される。次いで、最も長い縦罫線および横罫
線を抽出して欄領域を抽出し、ラベル付けを行なう（■
、■参照）。その結果、第９図の表からは第９Ａ図のよ
うに長い罫線り、、、Ｌ、□、Ｌｙｌ〜Ｌ　、ｂのみが
抽出され、これらの罫線にて囲まれる各領域にＴ１なる
欄ラベルが付される。な・お、表の中には両サイドの縦
罫線がないものがあるが、その場合は両サイドの縦罫線
を推定して補なうようにする。こうすれば、その後の欄
分割が容易になる。Up to this point, all of the vertical ruled lines and horizontal ruled lines in Table 4B shown in FIG. 9 have been extracted. Next, extract the longest vertical and horizontal ruled lines, extract the column area, and label it (■
, see ■). As a result, from the table in Figure 9, only the long ruled lines, . is attached. Note that some tables do not have vertical ruled lines on both sides, but in that case, the vertical ruled lines on both sides are estimated and supplemented. This will make subsequent column division easier.

以後、各欄ごとにさらに長い罫線がないかどうかを探し
、欄領域を細かく分割して行く（■参照）。Thereafter, each column is searched for longer ruled lines and the column area is divided into smaller pieces (see ■).

この処理を、欄領域を分割できなくなるまで繰り返す（
■参照）。第２回目の処理により第９Ｂ図の如く罫線Ｌ
ｘ３〜ＬＸｅが抽出されて欄領域が分割され、その各々
に図示の如きラベルが付される。Repeat this process until the column area can no longer be divided (
■Reference). As a result of the second processing, the ruled line L as shown in Figure 9B.
x3 to LXe are extracted, the column area is divided, and each is given a label as shown.

このとき、各欄領域を囲む罫線に対し、両端が近接する
罫線を抽出するので、第２回目の処理では、例えば欄領
域Ｔ２からはＬｘＳ、　　Ｌｘｂの方が先に抽出される
。さらに、第３回目の処理では第９Ｃ図の如く、罫線り
、７〜Ｌ、９が抽出されて欄領域が分割され、第４回目
の処理では第９Ｄ図の如（、罫線Ｌ　Ｘ９＋　　ＬＸＩ
。が抽出されて欄領域が分割される。At this time, since a ruled line having both ends close to each other is extracted with respect to the ruled line surrounding each column area, in the second processing, for example, LxS and Lxb are extracted first from the column area T2. Furthermore, in the third processing, as shown in FIG. 9C, ruled lines 7 to L, 9 are extracted and the column area is divided, and in the fourth processing, as shown in FIG.
. is extracted and the column area is divided.

最終的には、第９図の裏構造は第１０図のようなトリー
形式で記述することができる。Finally, the back structure shown in FIG. 9 can be described in a tree format as shown in FIG.

第１１図は表データ読取部の処理動作を示すフローチャ
ートである。FIG. 11 is a flowchart showing the processing operation of the table data reading section.

第１図に示す表データ読取部１４は、以上のようにして
得られた裏構造記述データと、イメージメモリを介して
得られる画像データから、まず各欄領域毎の表データＤ
：を第１２図の如く抽出する。抽出されたデータから、
水平方向の投影値を求めることにより行切出しを行ない
（■、■参照）、しかる後垂直方向の投影値を求めるこ
とにより個々の文字を切出しく■、■参照）、パターン
マツチング法等の公知の手法により文字を識別する（■
参照）。か＼る処理をすべての欄のデータにつき行ない
、表内のデータを読取る。The front data reading unit 14 shown in FIG.
: is extracted as shown in Figure 12. From the extracted data,
Lines are cut out by determining the projection value in the horizontal direction (see ■, ■), and then individual characters are cut out by determining the projection value in the vertical direction. Identify characters using the method (■
reference). The above process is performed on the data in all columns, and the data in the table is read.

以上のように、裏構造解析部により表を解析して構造記
述を行ない、表データ読取部により各欄領域毎に文字を
読取って表データを抽出することにより、種々の構造を
もつ表をその表内データと＼もに読取ることができ、デ
ータベースを作成することができる。As described above, the back structure analysis section analyzes the table and describes the structure, and the table data reading section reads the characters in each column area and extracts the table data, thereby allowing tables with various structures to be created. It can read both table data and create a database.

なお、この発明は罫線が斜めの場合にも適用することが
できるだけでなく、表と同じく水平、垂直の線分で記述
される図形を読取る場合にも適用することができる。Note that the present invention can be applied not only to cases where the ruled lines are diagonal, but also to cases where figures described by horizontal and vertical line segments are read, as in tables.

〔Effect of the invention〕

この発明によれば、表を含む文書画像をヘクトル化し、
すべての罫線を抽出してその長いものから順に表領域を
分割し、欄領域を抽出して表を階層的に記述するように
したので、裏全体の効率的な読取りが可能になる。すな
わち、固定フォーマットではない、自由な構造の表を入
力することができ、しかも表構造を一義的に記述できる
ので、読取結果をデータベースとして再利用し得る形に
して表データを得ることができる。According to this invention, a document image including a table is converted into a hector,
All the ruled lines are extracted and the table area is divided in descending order of length, and the column areas are extracted and the table is written hierarchically, making it possible to efficiently read the entire back side. That is, it is possible to input a table with a free structure rather than a fixed format, and since the table structure can be uniquely described, it is possible to obtain table data in a form in which the read results can be reused as a database.

その結果（１）フリーフォーマットの表の読取りが可能になる。the result (1) Free format tables can be read.

（２）同じフォーマットの表は、同じトリー構造で記述
することができる（一義的表現形式）。(2) Tables with the same format can be described using the same tree structure (unique representation format).

（３）表データの入力および統計処理を統一的に行なう
ことができる。(3) Tabular data input and statistical processing can be performed in a unified manner.

などの効果が得られる。Effects such as this can be obtained.

[Brief explanation of the drawing]

第１図はこの発明の実施例を示すブロック図、第２図は
この発明による読取装置を含む文書情報処理システムを
示す概要図、第３図は表を含む文書の一例を説明するた
めの説明図、第４図は表の画像例を説明するための説明
図、第５図はベクトルの記述情報を説明するための説明
図、第６Ａ図はベクトル情報メモリの内容を説明するた
めの説明図、第６Ｂ図は罫線情報メモリの内容を説明す
るための説明図、第７図は罫線とａ領域の関係を説明す
るための説明図、第８図は裏構造解析部の処理動作を示
すフローチャート、第９図はや＼複雑な表の例を説明す
るための説明図、第９八図ないし９Ｂ図は６９ｉ域の分
割処理とラベル付は処理の過程を説明するための説明図
、第１０図は第９図の表構造を示すトリー図、第１１図
は表データ読取部の処理動作を示すフローチャート、第
１２図は表構造と表データを示すトリー図である。符号説明１・・・読取プロセッサ、２・・・イメージスキャナ、
３・・・ＣＲ″Ｔデイスプレィ、４・・・表を含む文書
、４Ａ、４Ｂ・・・表、１１・・・イメージメモリ、１
２・・・ベクトル化演算部、１３・・・裏構造解析部、
１４・・・表データ読取部、１５・・・メインプロセッ
サ、１６・・・入出力インクフェイス。代理人　弁理士　松　崎　　　清Ｍ４Ｉｌ！！＄５！ＪｌＥｅ人図第６Ｂ図Ｉｔ！！１＠９１１邦Ｍ８図第９人図第９ＢＩ１ｇｏに図第９Ｄ図第１１図FIG. 1 is a block diagram showing an embodiment of the invention, FIG. 2 is a schematic diagram showing a document information processing system including a reading device according to the invention, and FIG. 3 is an explanation for explaining an example of a document including a table. 4 is an explanatory diagram for explaining an example of a table image, FIG. 5 is an explanatory diagram for explaining vector description information, and FIG. 6A is an explanatory diagram for explaining the contents of vector information memory. , FIG. 6B is an explanatory diagram for explaining the contents of the ruled line information memory, FIG. 7 is an explanatory diagram for explaining the relationship between the ruled lines and area a, and FIG. 8 is a flowchart showing the processing operation of the back structure analysis section. , FIG. 9 is an explanatory diagram for explaining an example of a rather complicated table, FIGS. 98 to 9B are explanatory diagrams for explaining the process of dividing the 69i area and labeling, and FIG. 9 is a tree diagram showing the table structure of FIG. 9, FIG. 11 is a flowchart showing the processing operation of the table data reading section, and FIG. 12 is a tree diagram showing the table structure and table data. Code explanation 1...Reading processor, 2...Image scanner,
3...CR''T display, 4...Document including table, 4A, 4B...Table, 11...Image memory, 1
2... Vectorization calculation unit, 13... Back structure analysis unit,
14... Table data reading unit, 15... Main processor, 16... Input/output ink face. Agent Patent Attorney Kiyoshi Matsuzaki M4Il! ! $5! J lEe Person Chart Figure 6B It! ! 1 @911 Japanese M8 figure 9 person figure 9BI1 go figure figure 9D figure 11

Claims

[Claims] An input means for inputting a document including a table, a vectorization means for vectorizing a document image, and a vectorization means for extracting all ruled lines from the vector data and forming the largest horizontal ruled line and vertical ruled line. Table structure analysis means that sequentially divides a column area using shorter ruled lines and labels each area to extract individual column areas; A reading device for a document including a table, comprising: table data reading means for reading individual characters;