JPH1011459A

JPH1011459A - Document registration system

Info

Publication number: JPH1011459A
Application number: JP8164855A
Authority: JP
Inventors: Osamu Dousaka; 修道坂; Taketomo Haga; 丈友芳賀; Jun Yoshino; 順吉野; Hideaki Tsukamoto; 英昭塚本
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1996-06-25
Filing date: 1996-06-25
Publication date: 1998-01-16

Abstract

PROBLEM TO BE SOLVED: To provide a document registration system which can automatically register an inputted document for retrieval together with a retrieval key. SOLUTION: A document recognition processing part 12 extracts an area having an attribute of a table, a figure, etc., and character information from an inputted document image. A document classification processing part 13 specifies an attribute of the format, etc., of the document image from the extracted area, and includes the specified document attribute, character information, etc., in the retrieval key and registers them in a document data base 14. At retrieval time, a retrieval key is inputted to an API from a WWW client terminal 18 through a LAN 17. According to the retrieval key, the API extracts the desired document image from the document data base 14.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば、スキャナ
などから取り込んだ文書画像を検索キーと共にデータベ
ースに自動登録する文書登録システムに関する。[0001] 1. Field of the Invention [0002] The present invention relates to a document registration system for automatically registering a document image taken from a scanner or the like together with a search key in a database.

【０００２】[0002]

【従来の技術】近年、紙を媒体とした書類を電子化し
て、書類をコンピュータ内に蓄積あるいは管理するよう
なファイリングシステムの需要が高まっている。従来の
この種のファイリングシステムでは、例えばファクシミ
リやスキャナを通して書類の文書画像（ラスタイメー
ジ）を取得し、これをコンピュータ内の記憶装置に随時
登録する。その際、書類検索を考慮して、文書画像の登
録時にその文書画像に対応する検索キーをキーボード等
によって入力している。2. Description of the Related Art In recent years, there has been an increasing demand for a filing system which digitizes a paper-based document and stores or manages the document in a computer. In a conventional filing system of this type, a document image (raster image) of a document is acquired through, for example, a facsimile or a scanner, and the document image is registered in a storage device in a computer as needed. At this time, a search key corresponding to the document image is input using a keyboard or the like when the document image is registered in consideration of the document search.

【０００３】図１６は、文書画像の登録時にユーザが必
要な検索キーを入力するために表示されるシステム画面
例を示す図である。このシステム画面上には、例えば登
録日、著者、題名、ジャンル、キーワード等の各領域が
表示されており、各領域に必要な情報が入力されたとき
に、それが文書画像と共にデータベースに登録されるよ
うになっている。FIG. 16 is a diagram showing an example of a system screen displayed to allow a user to input a necessary search key when registering a document image. On this system screen, for example, each area such as a registration date, an author, a title, a genre, and a keyword is displayed. When necessary information is input to each area, the information is registered in a database together with a document image. It has become so.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来の
文書登録システムでは、各文書画像毎に図１６に示され
るようなシステム画面を見てキーボード等から手作業で
検索キーを登録しなければならないため、登録文書画像
の量が増えるにつれて登録作業時の負担が大きくなると
いった問題があった。However, in the conventional document registration system, it is necessary to manually register a search key from a keyboard or the like while viewing a system screen as shown in FIG. 16 for each document image. However, there has been a problem that the burden on the registration work increases as the amount of the registered document image increases.

【０００５】また、例えばキーワードなどを検索キーと
して登録する場合、そのキーワードがユーザの主観によ
って異なる場合は検索効率が著しく低下する問題もあっ
た。[0005] In addition, for example, when a keyword or the like is registered as a search key, if the keyword differs depending on the subjectivity of the user, there is a problem that the search efficiency is significantly reduced.

【０００６】そこで、本発明の課題は、入力された文書
画像を登録するときに、その文書画像から自動的に最適
な検索キーを生成して登録することができる、改良され
た文書登録システムを提供することにある。Accordingly, an object of the present invention is to provide an improved document registration system capable of automatically generating and registering an optimum search key from an input document image when registering the input document image. To provide.

【０００７】[0007]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、入力された文書画像を該文書画像に対応
する検索キーと共にデータベースに登録する文書登録手
段を備え、前記検索キーに基づいて登録済みの文書画像
を検索する文書登録システムにおいて、前記文書登録手
段が、前記文書画像上の文字領域、表領域、図領域など
の属性別領域及びその位置情報を特定する文書認識処理
部を有し、この特定した属性別領域の属性及び位置情報
を含んで前記検索キーを自動生成することを特徴とす
る。In order to solve the above-mentioned problems, the present invention comprises document registration means for registering an input document image in a database together with a search key corresponding to the document image, based on the search key. In the document registration system for searching for a registered document image, the document registration unit includes a document recognition processing unit that specifies attribute-based areas such as a character area, a table area, and a figure area on the document image and position information thereof. And automatically generating the search key including the attribute and position information of the specified attribute-based area.

【０００８】文書認識処理部は、具体的には、前記文書
画像が傾いている場合にその傾き補正を行う画像正規化
部、前記文書画像または傾き補正がなされた文書画像か
ら基本矩形領域の抽出を行う基本矩形領域抽出部、抽出
された基本矩形領域を必要に応じて統合する領域統合
部、基本矩形領域または統合された領域が文字、図、表
等のいずれの属性に該当するかを判別する領域属性判別
部を備えて構成する。前記画像正規化部は、例えば、前
記文書画像における行方向及び列方向の各画素周辺分布
を検出し、各方向の白分布が最大になるように当該文書
画像を回転させるように構成する。More specifically, the document recognition processing section includes an image normalizing section for correcting the inclination of the document image when the document image is inclined, and extracting a basic rectangular area from the document image or the document image subjected to the inclination correction. A basic rectangular area extracting unit, an area integrating unit that integrates the extracted basic rectangular areas as necessary, and determines whether the basic rectangular area or the integrated area corresponds to a character, a figure, a table, or any other attribute. And a region attribute discriminating unit. The image normalization unit is configured to detect, for example, the peripheral distribution of each pixel in the row direction and the column direction in the document image, and rotate the document image so that the white distribution in each direction is maximized.

【０００９】また、前記領域属性判別部は、例えば、統
合された領域に含まれる基本矩形領域の幅と高さの分
散、及び各領域内の線分の数とその長さを求め、各分散
の値、線分の数、及び線分長に応じて当該領域の属性
が、文字、表、図のいずれであるかを判別するように構
成する。Further, the area attribute discriminating section obtains, for example, the variance of the width and height of the basic rectangular area included in the integrated area, the number of line segments in each area, and the length thereof. , The number of line segments, and the length of the line segments to determine whether the attribute of the area is a character, a table, or a figure.

【００１０】前記領域属性判別部は、また、例えば、文
字と判定された注目領域の近隣の領域からの位置ずれ量
を算出し、この位置ずれ量が規定値以下のときに当該注
目領域を前記近隣の領域の題表示領域と判定するように
構成する。The area attribute discriminating unit calculates, for example, a positional shift amount from an area adjacent to the target area determined to be a character. When the positional shift amount is equal to or less than a specified value, the target area is determined. The title display area is determined to be a neighboring area.

【００１１】本発明の他の文書登録システムは、前記文
書登録手段が、前記文書画像上の文字領域、表領域、図
領域などの属性別領域及びその位置情報を特定する文書
認識処理部と、特定された属性別領域及びその位置情報
に基づいて文書フォーマットによる分類を行う文書分類
処理部とを有し、この分類結果を前記検索キーに含ませ
ることを特徴とする。In another document registration system according to the present invention, the document registration means specifies a region by attribute such as a character region, a table region, and a drawing region on the document image and position information thereof, A document classification processing unit for performing classification in a document format based on the specified attribute-based area and its position information, wherein the classification result is included in the search key.

【００１２】このような文書登録システムにおいて、前
記文書分類処理部は、例えば、前記文書認識処理部で認
識された属性別領域の属性の数に基づいて、当該文書画
像を文字主体の「文字的文書」、表主体の「表的文
書」、図主体の「図的文書」などの一般的文書種別に分
類するように構成する。In such a document registration system, the document classification processing unit converts the document image into a character-based “character-based” based on, for example, the number of attributes of the attribute-based area recognized by the document recognition processing unit. It is configured to be classified into general document types such as "document", "table document" mainly composed of tables, and "graphic document" composed mainly of figures.

【００１３】あるいは、所定方向における前記基本矩形
領域または統合領域の存在頻度及び個々の領域の面積の
関数に基づいて当該文書画像の段組の種類を判別し、判
別された段組別に前記文書画像を分類するように構成す
る。Alternatively, the type of the column of the document image is determined based on the existence frequency of the basic rectangular region or the integrated region in a predetermined direction and the function of the area of each region, and the document image is determined for each determined column. Is configured to be classified.

【００１４】あるいは、前記文書認識処理部で認識され
た属性別領域の属性及び位置情報に基づいて文書画像の
文書特徴を表す特徴ベクトルを生成する特徴ベクトル生
成手段と、予め生成した複数の特徴ベクトルをそれぞれ
文書フォーマット別にグループ分けして格納した辞書部
と、前記特徴ベクトル生成手段が生成した特徴ベクトル
と前記辞書部に格納された各特徴ベクトルとの距離を比
較して、該距離がより小さい特徴ベクトルの属する一つ
のグループを特定するグループ特定手段とを有し、この
特定されたグループ別に前記文書画像を分類するように
構成する。Alternatively, a feature vector generating means for generating a feature vector representing a document feature of a document image based on the attribute and position information of the attribute-based area recognized by the document recognition processing section; Is compared with the distance between the feature vector generated by the feature vector generating means and each feature vector stored in the dictionary unit, and a feature value having a smaller distance is compared. Group specifying means for specifying one group to which the vector belongs, and the document images are classified according to the specified group.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施形態を詳細に説明する。図１は、本発明の文書登
録システムを、インターネットに接続して使用する場合
の実施形態を示すブロック構成図である。この文書登録
システムは、文書画像を登録して検索可能に管理するサ
ーバ１と、このサーバ１にＬＡＮ（ローカル・エリア・
ネットワーク）１７を介して接続される複数のＷＷＷ
（ワールド・ワイド・ウエブ）クライアント端末１８と
を含んで構成される。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram showing an embodiment in which the document registration system of the present invention is used by connecting to the Internet. The document registration system includes a server 1 for registering a document image and managing the document image so as to be searchable, and a LAN (local area
A plurality of WWWs connected via a network 17
(World Wide Web) and a client terminal 18.

【００１６】サーバ１の構成は、登録系と検索系とに分
けられる。登録系は、登録用の文書１０の文書画像を取
得するスキャナ１１、取得した文書１０の文書画像を認
識する文書認識処理部１２、文書認識処理部１２で認識
された文書画像を分類するとともに対応する検索キーを
自動生成する文書分類処理部１３、分類された文書画像
とその文書画像に対応する検索キーを格納する文書デー
タベース１４を備えて構成される。また、検索系は、Ｌ
ＡＮ１７に接続されたｈｔｔｐｄ（ハイパーテキスト転
送プロトコル・デーモン）１６及び検索を行うデータベ
ースＡＰＩ（アプリケーション・プログラム・インタフ
ェース）１５を含んで構成される。なお、説明の便宜
上、文字画像はモノクロ画像であるとする。The configuration of the server 1 is divided into a registration system and a search system. The registration system includes a scanner 11 for acquiring a document image of the document 10 for registration, a document recognition processing unit 12 for recognizing the document image of the acquired document 10, and classifying and responding to the document images recognized by the document recognition processing unit 12. And a document database 14 for storing classified document images and search keys corresponding to the document images. The search system is L
It comprises an httpd (hypertext transfer protocol daemon) 16 connected to the AN 17 and a database API (application program interface) 15 for searching. For convenience of explanation, it is assumed that the character image is a monochrome image.

【００１７】まず、上記構成の文書登録システムの登録
系の動作を説明する。まず、文書認識処理部１２の基本
的な機能ブロックを図２を参照して説明する。文書認識
処理部１２は、スキャナ１１から取り込んだ文書画像に
ついて、スキャン時の傾きの補正を行って正規化を行う
画像正規化部１２１と、正規化された文書画像から基本
矩形領域の抽出を行う基本矩形抽出部１２２と、抽出さ
れた基本矩形の統合を行う領域統合部１２３と、統合し
た領域についてそれぞれ属性判別を行う領域属性判別部
１２４と、文字の属性をもつ領域に対して文字の自動認
識処理を行う文字認識部１２５とを備えている、なお、
文字認識部１２５では、例えば市販されているＯＣＲを
使用することができる。First, the operation of the registration system of the document registration system having the above configuration will be described. First, basic functional blocks of the document recognition processing unit 12 will be described with reference to FIG. The document recognition processing unit 12 corrects the inclination of a document image captured from the scanner 11 during scanning to normalize the document image, and extracts a basic rectangular area from the normalized document image. A basic rectangle extraction unit 122, an area integration unit 123 that integrates the extracted basic rectangles, an area attribute determination unit 124 that performs attribute determination on each of the integrated areas, and an automatic character A character recognition unit 125 for performing a recognition process.
In the character recognition unit 125, for example, a commercially available OCR can be used.

【００１８】各機能ブロック１２１〜１２５の内容をよ
り詳しく説明する。図３〜図５は、画像正規化部１２１
の機能の説明図である。画像正規化に際しては、スキャ
ナ１１で取得した文書画像の上をｘ軸方向（左から右へ
の方向）とｙ軸方向（上から下へ方向）にそれぞれ走査
して黒画素の数を検出する。図３は、文書画像３１と、
この文書画像３１をｙ軸方向に走査して得られる黒画素
数の周辺分布３２と、文書画像３１をｘ軸方向に走査し
て得られる画素数の周辺分布３３が示されている。The contents of the function blocks 121 to 125 will be described in more detail. 3 to 5 show the image normalization unit 121.
FIG. 4 is an explanatory diagram of the function of FIG. At the time of image normalization, the number of black pixels is detected by scanning the document image acquired by the scanner 11 in the x-axis direction (from left to right) and in the y-axis direction (from top to bottom). . FIG. 3 shows a document image 31 and
A peripheral distribution 32 of the number of black pixels obtained by scanning the document image 31 in the y-axis direction and a peripheral distribution 33 of the number of pixels obtained by scanning the document image 31 in the x-axis direction are shown.

【００１９】この文書画像３１の傾きが正常の（つまり
傾いていない）場合は、図４（ａ）に示されるように、
周辺分布における行間や右端、左端、上端、下端の余白
に対応する白の分布（黒画素数が０の部分）が明確にな
っている。これに対して、傾いた文書画像４１の場合、
ｙ方向の周辺分布４２とｘ方向の周辺分布４３における
白の分布は図４（ｂ）に示されるように相対的に減少し
ている。このように、文書画像の傾きの有無は、ｘ方向
及びｙ方向の周辺分布から判定可能であり、文字画像が
全く傾の無い場合はｘ方向及びｙ方向の周辺分布におけ
る白分布が最大となっている。When the document image 31 has a normal inclination (that is, is not inclined), as shown in FIG.
The white distribution (portion where the number of black pixels is 0) corresponding to the margins between the lines and the margins at the right end, the left end, the upper end, and the lower end in the peripheral distribution is clear. On the other hand, in the case of the inclined document image 41,
The distribution of white in the peripheral distribution 42 in the y direction and the peripheral distribution 43 in the x direction is relatively reduced as shown in FIG. As described above, the presence or absence of the inclination of the document image can be determined from the peripheral distribution in the x direction and the y direction. When the character image has no inclination, the white distribution in the peripheral distribution in the x direction and the y direction becomes maximum. ing.

【００２０】画像正規化部１２１では、もし、図４
（ｂ）のように文書画像４１が傾いているとき、ｘ方向
及びｙ方向の周辺分布４２，４３における白分布を最大
にするようにその文書画像４１を回転させる。図５は、
周辺分布における白分布を最大にする回転角度の検出方
法の説明図である。図５から明らかなように、本実施形
態では、ｘ方向及びｙ方向の周辺分布における白分布の
長さの合計（白のＲｕｎＬｅｎｇｔｈ）と回転角度との
関係から、白分布の長さが最大となる回転角度ｔを検出
する。そして、この回転角度ｔだけ文書画像を回転す
る。In the image normalizing section 121, if
When the document image 41 is inclined as shown in (b), the document image 41 is rotated so as to maximize the white distribution in the peripheral distributions 42 and 43 in the x and y directions. FIG.
It is explanatory drawing of the detection method of the rotation angle which maximizes the white distribution in a peripheral distribution. As is clear from FIG. 5, in the present embodiment, from the relationship between the total length of white distributions in the peripheral distribution in the x direction and the y direction (Run Length of white) and the rotation angle, the length of the white distribution is maximum. Is detected. Then, the document image is rotated by the rotation angle t.

【００２１】図６は基本矩形抽出部１２２における基本
矩形抽出過程の説明図であり、（ａ）は文書画像の３×
３の画素正方領域である。図示の例では、黒色画素Ｐ
（ｉ，ｊ）を中心として正方領域の左上から時計回りの
方向に８個の画素がＰ（ｉ−１，ｊ−１）、Ｐ（ｉ，ｊ
−１）、Ｐ（ｉ＋１，ｊ−１）、Ｐ（ｉ＋１，ｊ）、Ｐ
（ｉ＋１，ｊ＋１）、Ｐ（ｉ，ｊ＋１）、Ｐ（ｉ−１，
ｊ＋１）、Ｐ（ｉ−１，ｊ）の順に取り巻いている。基
本矩形抽出部１２２では、この黒色画素Ｐ（ｉ，ｊ）を
取り巻く８個の画素について、黒か白かの判定を行う。
黒画素であると判定されたそれぞれの画素については、
新たに黒画素と判定されたものを中心にそれを取り巻く
８個の画素について再び黒か否かの判定を行う。この操
作を新たな黒画素が見い出されなくなるまで繰り返す。
一つの操作が終了して得られた矩形領域の例を図６
（ｂ）に示す。FIGS. 6A and 6B are explanatory diagrams of a basic rectangle extracting process in the basic rectangle extracting section 122. FIG.
3 is a pixel square area. In the illustrated example, the black pixel P
Eight pixels are P (i−1, j−1) and P (i, j) in the clockwise direction from the upper left of the square area around (i, j).
-1), P (i + 1, j-1), P (i + 1, j), P
(I + 1, j + 1), P (i, j + 1), P (i-1,
j + 1) and P (i-1, j). The basic rectangle extraction unit 122 determines whether the eight pixels surrounding the black pixel P (i, j) are black or white.
For each pixel determined to be a black pixel,
A determination is again made as to whether or not the pixel newly determined to be a black pixel is black with respect to eight pixels surrounding the pixel. This operation is repeated until no new black pixel is found.
FIG. 6 shows an example of a rectangular area obtained after one operation is completed.
(B).

【００２２】抽出された各基本矩形は、領域統合部１２
３で統合される。統合のロジックとしては、図７（ａ）
に示されるロジックと図７（ｂ）に示されるロジックの
いずれかを適用することができる。図７（ａ）のロジッ
クでは、ある矩形領域Ｃｉに注目し、その矩形領域Ｃｉ
の一対の長辺及び短辺からそれぞれ外方に一定距離ｘｄ
ｉｓｔ，ｙｄｉｓｔだけ離れた範囲に近接空間を設定す
る。上記距離ｘｄｉｓｔ，ｙｄｉｓｔは文書画像の解像
度に対応した距離である。そして、この近接空間内にあ
る他の矩形領域Ｃｊを検出し、これを包括するような統
合領域Ｃｋ１を定める。一方、図７（ｂ）のロジックで
は、ある矩形領域Ｃｉに注目してこれに少なくとも一部
分が重なる、あるいはこれを含む他の矩形領域Ｃｊを求
め、これらを包括する統合領域Ｃｋ２を定める。領域統
合部１２３では、前段の基本矩形抽出部１で抽出した全
ての基本矩形領域について、上記いずれかのロジックを
適用して領域統合を行う。Each of the extracted basic rectangles is stored in the area integrating unit 12.
3 is integrated. As the integration logic, FIG.
7 and the logic shown in FIG. 7B can be applied. In the logic of FIG. 7A, attention is paid to a certain rectangular area Ci, and the rectangular area Ci
A fixed distance xd outward from the pair of long and short sides, respectively
An adjacent space is set in a range separated by ist and ydist. The distances xdist and ydist are distances corresponding to the resolution of the document image. Then, another rectangular area Cj in the close space is detected, and an integrated area Ck1 including the rectangular area Cj is determined. On the other hand, in the logic of FIG. 7B, another rectangular area Cj that at least partially overlaps or includes a certain rectangular area Ci is obtained by focusing on a certain rectangular area Ci, and an integrated area Ck2 including these is determined. The area integration unit 123 performs area integration by applying any of the above logics to all the basic rectangular areas extracted by the basic rectangle extraction unit 1 in the preceding stage.

【００２３】統合された各領域は、領域属性判別部１２
４で、その属性が判別される。領域属性判別部１２４に
おける処理手順は図８に示すとおりであり、まず、統合
された各領域について領域内の線分抽出を行う（Ｓ１
１）。次に、領域内の基本矩形の幅ｗと高さｈの分散σ
_w ²、σ_h ²を下記(1ー1),(1-2)式によって算出するととも
に、各分散σ_w ²、σ_h ²が、それぞれしきい値ｗ_varとｈ
_varより小さいか否かを判定する（Ｓ１２）。なお、下
記式において、Ｎは矩形領域の総数、ｗ_aは矩形領域の
幅の平均値、ｈ_aは矩形領域の高さの平均値である。Each integrated area is assigned to an area attribute discriminating unit 12.
At 4, the attribute is determined. The processing procedure in the area attribute discriminating unit 124 is as shown in FIG. 8, and first, a line segment in the area is extracted for each integrated area (S1).
1). Next, the variance σ of the width w and the height h of the basic rectangle in the area
_w ² and σ _h ² are calculated according to the following equations (1-1) and (1-2), and the respective variances σ _w ² and σ _h ² are determined by thresholds w _var and h
_It is determined whether it is smaller than _var (S12). In the equation below, N is the total number of the rectangular region, the w _a mean value of the width of the rectangular area, h _a is the mean value of the height of the rectangular area.

【００２４】[0024]

【数１】 (Equation 1)

【００２５】各分散σ_w ²、σ_h ²がそれぞれしきい値より
小さい場合は（Ｓ１２：Ｔｒｕｅ）、抽出された線分の
うち、その長さが領域の幅または高さの３０％以上の線
分が存在するか否かを判定する（Ｓ１３）。存在しない
場合（Ｓ１３：Ｆａｌｓｅ）、その領域を文字領域と判
定し（Ｓ１４）、図表題識別処理を実行する（Ｓ１
５）。一方、Ｓ１２の判定ステップにおいて各分散の少
なくとも一方がしきい値以上の場合（Ｓ１２：Ｆａｌｓ
ｅ）、あるいは各分散がしきい値より小さいが（Ｓ１
２：Ｔｒｕｅ）、抽出された線分の内、その長さが領域
の幅または高さの３０％以上のものがない場合は（Ｓ１
３：Ｔｒｕｅ）、その領域を表領域または図領域と判定
し、Ｓ１６の処理に進む。If each of the variances σ _w ² and σ _h ² is smaller than the threshold value (S12: True), the length of the extracted line segment is 30% or more of the width or height of the region. It is determined whether or not a line segment exists (S13). If it does not exist (S13: False), the area is determined to be a character area (S14), and a figure title identification process is executed (S1).
5). On the other hand, when at least one of the variances is equal to or greater than the threshold in the determination step of S12 (S12: False
e) or each variance is smaller than the threshold (S1
2: True), if there is no extracted line segment whose length is 30% or more of the width or height of the region (S1)
3: True), the area is determined to be a table area or a figure area, and the process proceeds to S16.

【００２６】Ｓ１６の処理では、表領域と図領域の判別
を行う。即ち、抽出された線分の内、領域の幅の８０％
以上の長さのものが２つ以上存在し、かつ、領域の高さ
の８０％以上の長さのものが２つ以上存在するか否かを
判定する。存在する場合は（Ｓ１６：Ｔｒｕｅ）その領
域を表領域と判定し（Ｓ１７）、存在しない場合は（Ｓ
１６：Ｆａｌｓｅ）その領域を図領域と判定する。In the process of S16, a table area and a figure area are determined. That is, of the extracted line segments, 80% of the width of the area
It is determined whether there are two or more objects having the above length and two or more objects having a length of 80% or more of the height of the region. If it exists (S16: True), the area is determined to be a table area (S17).
16: False) The area is determined to be a figure area.

【００２７】ここで、図表題識別処理（Ｓ１５）の内容
を図９を参照して説明する。図表題識別処理では、図９
に示すように、領域属性判別部１２４において文字領域
と判定された領域Ｃｉ（矩形左上座標：ｘｉ，ｙｉ）に
注目し、近隣の矩形領域Ｃｊ（矩形左上座標：ｘｊ，ｙ
ｊ）について、ｙ方向の距離ｙｄｉｓｔと中心の位置ず
れｃｄｉｓｔを算出する。そして、下記式により図表題
識別処理を行う。なお、下式において、ｗｉ，ｗｊ，ｈ
ｉ，ｈｊは、それぞれ、矩形領域Ｃｉ，Ｃｊの幅と高さ
を表す。Here, the contents of the figure title identification processing (S15) will be described with reference to FIG. In the figure title identification processing, FIG.
As shown in the figure, the area Ci (rectangular upper left coordinate: xi, yi) determined to be a character area by the area attribute determining unit 124 is focused on, and a neighboring rectangular area Cj (rectangular upper left coordinate: xj, y) is noted.
For j), the distance ydist in the y direction and the center displacement cdist are calculated. Then, figure title identification processing is performed by the following equation. In the following equation, wi, wj, h
i and hj represent the width and height of the rectangular areas Ci and Cj, respectively.

【００２８】[0028]

【数２】 (Equation 2)

【００２９】上記判定式を満たす場合、矩形領域Ｃｊが
図領域であれば矩形領域Ｃｉは図題についての文字領域
と判定され、矩形領域Ｃｊが表領域であれば矩形領域Ｃ
ｉは表題についての文字領域と判定される。When the above determination formula is satisfied, if the rectangular area Cj is a figure area, the rectangular area Ci is determined to be a character area for the title. If the rectangular area Cj is a table area, the rectangular area C is determined.
i is determined to be a character area for the title.

【００３０】以上の処理の結果、図１０に示されるよう
な文書画像が得られる。図１０から、文字を主体とする
文字領域、図を主体とする図領域、そして、表を主体と
する表領域に分けられていることが分かる。さらに、図
領域あるいは表領域に近接して、図題あるいは表題に対
応する文字領域が存在している。As a result of the above processing, a document image as shown in FIG. 10 is obtained. From FIG. 10, it can be seen that the text area is mainly divided into a character area mainly composed of characters, a figure area mainly composed of figures, and a table area mainly composed of tables. Further, a character area corresponding to a title or a title exists near the figure area or the table area.

【００３１】次に、文書分類処理部１３の動作について
説明する。文書分類処理部１３は、文書認識処理部１２
の処理結果に基づいて文書フォーマット（体裁）による
分類を行う。より具体的には、文書画像を、文字が主体
の「文字的文書」、表が主体の「表的文書」、図が主体
の「図的文書」などに分類する。これを一般的文書種別
分類と称する。また、雑誌や論文などに見られる２段
組、３段組などに分類する。これを段組分類と称する。
さらに、目的に応じてユーザが定義したフォーマットに
分類する。これをユーザ定義分類と称する。Next, the operation of the document classification processing section 13 will be described. The document classification processing unit 13 includes the document recognition processing unit 12
Classification is performed based on the document format (format) based on the processing result. More specifically, the document images are classified into “character documents” mainly composed of characters, “table documents” mainly composed of tables, and “graphical documents” mainly composed of figures. This is called a general document type classification. In addition, it is classified into a two-column system, a three-column system, and the like found in magazines and articles. This is called a column classification.
Further, the data is classified into formats defined by the user according to the purpose. This is called a user-defined classification.

【００３２】まず、一般的文書種別分類について説明す
る。文書分類処理部１３は、表領域Ｃｔｉの面積をＳｔ
ｉ、図領域Ｃｆｉの面積をＳｆｉ、全領域の面積をＳと
したとき、下記判定式によって、表的文書か、図的文書
か、あるいは文字的文書かの判別を行う。First, the general document type classification will be described. The document classification processing unit 13 calculates the area of the table area Cti as St
i, when the area of the graphic region Cfi is Sfi and the area of all the regions is S, it is determined whether the document is a table document, a graphic document, or a character document by the following determination formula.

【００３３】[0033]

【数３】 (Equation 3)

【００３４】次に、段組分類について説明する。段組分
類に際しては、まず、各矩形領域のｘ方向の度数分布、
即ちｘ方向の矩形領域の原点（例えば領域の左上）の存
在頻度を調べ、ｘ方向における階級値（一定範囲の代表
値）Ｒｉと、度数ｎｉ及び個々の領域面積Ｓｉの積ｎｉ
＊Ｓｉとの関係（ｎｉ＊Ｓｉ＝ｇ（Ｒｉ））を求める。
これをｇ関数と称する。次に、このｇ関数に３段組用の
フィルタｆ３と２段組用のフィルタｆ２とを掛け合わせ
て合成関数ｈ３（Ｒｉ）＝ｆ３（ｇ（Ｒｉ））、ｈ２
（Ｒｉ）＝ｆ３（ｇ（Ｒｉ））をそれぞれ算出する。こ
れをｈ関数と称する。Next, the column classification will be described. For column classification, first, the frequency distribution in the x direction of each rectangular area,
That is, the existence frequency of the origin (for example, the upper left of the region) of the rectangular region in the x direction is checked, and the class value (representative value within a certain range) Ri in the x direction is multiplied by the product ni of the frequency ni and each region area Si.
The relationship with * Si (ni * Si = g (Ri)) is obtained.
This is called a g function. Next, the g function is multiplied by a filter f3 for three stages and a filter f2 for two stages to obtain a composite function h3 (Ri) = f3 (g (Ri)), h2
(Ri) = f3 (g (Ri)) is calculated. This is called an h function.

【００３５】図１１（ａ）は２段組用のフィルタの特
性、（ｂ）は３段組用のフィルタの特性、（ｃ）はこの
実施形態におけるｇ関数、（ｄ）はｈ関数の説明図であ
る。図１１（ｃ），（ｄ）の場合、縦軸が度数ｎｉ及び
個々の領域面積Ｓｉの積ｎｉ＊Ｓｉであり、横軸がｘ方
向の座標となる。各度数ｎｉにそれぞれ領域面積Ｓｉを
乗算するのは、各度数ｎｉに重みを付与して文書画像中
のゴミ印字等を排除するためである。図１１（ｄ）に示
されるｈ関数は、結局、フィルタを通した場合のｘ方向
の度数分布と面積の積となる。そこで、全体の領域（統
合領域）の面積の総和Ｓに対するｈ関数の割合がどの程
度になるかによって段組の種類を判別する。段組みの種
類の判別は下式の条件を満足するか否かによって行う。
なお、判別の順序は、(4-1)式、(4-2)式の順に行う。FIG. 11A shows the characteristics of a two-stage filter, FIG. 11B shows the characteristics of a three-stage filter, FIG. 11C shows the g function in this embodiment, and FIG. 11D shows the h function. FIG. 11C and 11D, the vertical axis represents the product ni * Si of the frequency ni and the area of each region Si, and the horizontal axis represents the coordinates in the x direction. The reason for multiplying each frequency ni by the area area Si is to give weight to each frequency ni and eliminate dust printing or the like in the document image. The h function shown in FIG. 11D is, after all, a product of the frequency distribution in the x direction and the area when the light passes through the filter. Therefore, the type of the column is determined based on the ratio of the h function to the total area S of the entire area (integrated area). The determination of the type of the column is made based on whether or not the following condition is satisfied.
Note that the order of determination is performed in the order of Expression (4-1) and Expression (4-2).

【００３６】[0036]

【数４】 (Equation 4)

【００３７】次に、ユーザ定義分類について説明する。
図１２は、ユーザが独自に定義するフォーマットの登録
手順の説明図である。スキャナ１１より入力された文書
画像について単一属性の矩形領域を特定するのは上述の
文書認識処理部１２の処理と同様である（ステップＳ２
１）。ここでは、特定した矩形領域について図１３に示
されるような特徴ベクトルを抽出する（ステップＳ２
２）。そして、この特徴ベクトルによりフォーマット判
定用辞書を作成する（ステップＳ２３）。このフォーマ
ット判定用辞書は、例えば文書データベース１４に含め
ることも可能であるが、独立なユニットとして存在させ
てもよい。図１３で示される特徴ベクトルＶは、（Ｖ
１，Ｖ２，…，Ｖ５３）（５３は成分数）のように記述
され、特徴ベクトルの各成分は、文書認識処理の際に特
定された矩形領域全体についての特徴を表している。Next, the user-defined classification will be described.
FIG. 12 is an explanatory diagram of a registration procedure of a format uniquely defined by a user. Specifying a single-attribute rectangular area for a document image input from the scanner 11 is the same as the above-described processing of the document recognition processing unit 12 (step S2).
1). Here, a feature vector as shown in FIG. 13 is extracted for the specified rectangular area (step S2).
2). Then, a dictionary for format determination is created from the feature vector (step S23). The format determination dictionary can be included in, for example, the document database 14, but may exist as an independent unit. The feature vector V shown in FIG.
1, V2,..., V53) (53 is the number of components), and each component of the feature vector represents a feature of the entire rectangular area specified during the document recognition processing.

【００３８】次に、フォーマット判定用辞書を用いてユ
ーザ定義によるフォーマットのカテゴリ判定の処理手順
を図１４を参照して説明する。ユーザ定義によるフォー
マットのカテゴリ判定は、文書分類処理部１３が行う。
文書分類処理部１３は、まずスキャナ１１に入力された
文書の文書画像について文書認識処理を行って単一の属
性をもつ矩形領域を特定する（ステップＳ３１）。次に
特定された矩形領域に基づいて図１３に示される特徴ベ
クトルＶ＝（Ｖ１，Ｖ２，…，Ｖ５３：５３は成分数）
を抽出する（ステップＳ３２）。さらに、フォーマット
のカテゴリが異なる文書の特徴ベクトルがより散らば
り、かつ同じカテゴリに属する文書の特徴ベクトルがよ
り近接して分布するような特徴空間を形成する変換行列
Ｍを求め、これを作用してできる新たな特徴ベクトルＶ
！＝ＭＶを定義する（ステップＳ３３）。Next, a processing procedure for determining a category of a format defined by a user using a format determination dictionary will be described with reference to FIG. The category classification of the format defined by the user is performed by the document classification processing unit 13.
First, the document classification processing unit 13 performs a document recognition process on the document image of the document input to the scanner 11 to specify a rectangular area having a single attribute (step S31). Next, based on the specified rectangular area, a feature vector V = (V1, V2,..., V53: 53 is the number of components) shown in FIG.
Is extracted (step S32). Further, a transformation matrix M is formed which forms a feature space in which feature vectors of documents having different format categories are more scattered and feature vectors of documents belonging to the same category are more closely distributed. New feature vector V
! = MV is defined (step S33).

【００３９】図１５は特徴空間の一例を示しており、図
中の点は一つの文書についての特徴ベクトルを象徴して
いる。また、互いに近接する点集合の集まりがユーザが
予め定義したフォーマットが属するカテゴリＡ，Ｂ，Ｃ
に対応している。こうして、さらにステップＳ３３で生
成された特徴ベクトルＶ！を特徴空間にマップして、カ
テゴリＡ，Ｂ，Ｃに含まれる特徴ベクトルとのユークリ
ッド距離を算出する（ステップＳ３４）。特徴ベクトル
Ｖ！とカテゴリとの距離の定義の例として、特徴ベクト
ルＶ！とカテゴリに属する特徴ベクトルとの距離の平均
値、あるいは、最小値等をとることができる。FIG. 15 shows an example of a feature space. Points in the figure represent feature vectors for one document. Further, a set of point sets close to each other is classified into categories A, B, and C to which a format defined by the user belongs.
It corresponds to. Thus, the feature vector V! Is mapped to the feature space, and the Euclidean distance from the feature vectors included in the categories A, B, and C is calculated (step S34). Feature vector V! As an example of the definition of the distance between a category and a category, the feature vector V! The average value or the minimum value of the distances between the feature vector and the feature vector belonging to the category can be taken.

【００４０】最後に、各カテゴリとの距離において、距
離が最も近く、かつ、その値がしきい値ｄよりも小さな
カテゴリが見つかれば、入力された文書のフォーマット
はそのカテゴリに属するものと判定される（ステップＳ
３５）。入力された文書のフォーマットと既に存在する
カテゴリとの距離がいずれもしきい値ｄを越える場合
は、入力された文書のカテゴリを新しいカテゴリとして
登録する。Finally, if a category having the closest distance to each category and a value smaller than the threshold value d is found, the format of the input document is determined to belong to the category. (Step S
35). If the distance between the format of the input document and the existing category both exceeds the threshold value d, the category of the input document is registered as a new category.

【００４１】このように、本実施形態による文書登録シ
ステムでは、スキャナ１１から入力されたユーザ定義に
よる文書のフォーマットについて自動的にフォーマット
のカテゴリの判定が行われる。そして、入力された文書
の文書画像が判定されたカテゴリについての識別情報を
含む検索キーと共に文書用データベース１４に登録され
る。この検索キーに基づいて文書データベース１４の検
索が実行される。As described above, in the document registration system according to the present embodiment, the format category of the document defined by the user input from the scanner 11 is automatically determined. Then, the document image of the input document is registered in the document database 14 together with a search key including identification information on the determined category. A search of the document database 14 is executed based on the search key.

【００４２】次に、文書登録システムの検索系の動作を
説明する。文書検索は、例えばＬＡＮ等に繋がったＷＷ
Ｗクライアント端末１８より検索キーを入力することに
より行われる。検索用アプリケーションは、市販のＷＷ
Ｗブラウザを使用することができる。また、ｈｔｔｐｄ
より検索用インターフェースが提供される。そして、例
えば「ｃｇｉ−ｂｉｎ」を用い、データベースＡＰＩを
通して文書データベース１４から所望の文書画像を索出
する。このようにして、ＷＷＷクライアント端末１８か
ら対話的に検索を実行してユーザの所望する文書を得る
ことができる。この場合の検索結果は、一般的に複数の
文書である。さらに、検索結果の文書数を考慮して検索
結果の文書数が少なくなったときに、ブラウジング（パ
ラパラめくり）により一気に文書画像を出力して参照す
ることができる。なお、検索された文書は、ＷＷＷクラ
イアント端末１８のディスプレイに表示するか、あるい
はプリントアウトなどによって出力する。Next, the operation of the retrieval system of the document registration system will be described. Document search is, for example, a WW connected to a LAN or the like.
This is performed by inputting a search key from the W client terminal 18. Search application is a commercial WW
A W browser can be used. Also, httpd
A more search interface is provided. Then, a desired document image is retrieved from the document database 14 through the database API using, for example, “cgi-bin”. In this way, a search can be executed interactively from the WWW client terminal 18 to obtain a document desired by the user. The search result in this case is generally a plurality of documents. Further, when the number of documents in the search result is reduced in consideration of the number of documents in the search result, a document image can be output and referenced at a stretch by browsing (parallel flipping). The retrieved document is displayed on the display of the WWW client terminal 18 or output by printing out.

【００４３】この文書登録システムでは、フォーマット
検索、キーワード検索、あるいは図／表名検索による三
種類の検索が可能である。フォーマット検索では、上述
した「文字的文書」、「表的文書」、「図的文書」、
「段無し」、「２段組」、「３段組」や、ユーザが定義
したフォーマットなどを検索キーとして、文書データベ
ース１４に登録されている文書のフォーマット情報を参
照して文書の検索を行う。キーワード検索では、キーワ
ードを検索キーとして、文書データベース１４に登録さ
れている文書の文字情報を参照して検索を行う。このと
き、市販の全文検索エンジンを使用してもよい。また、
図／表名検索では、図表に付与されている図題／表題を
検索キーとして検索を行うが、文書データベース１４に
登録されている文書の図／表名情報を参照して、指定さ
れた題目を含む図／表が記載されている文書を検索す
る。In this document registration system, three types of searches by format search, keyword search, or figure / table name search are possible. In the format search, the "character document", "table document", "graphic document",
The document is searched by referring to the format information of the document registered in the document database 14 using “column-less”, “two-column”, “three-column” or a format defined by the user as a search key. . In the keyword search, a search is performed with reference to character information of a document registered in the document database 14 using a keyword as a search key. At this time, a commercially available full-text search engine may be used. Also,
In the figure / table name search, a search is performed using the figure / title given to the figure as a search key. The figure / table name information of the document registered in the document database 14 is referred to, and the designated title is designated. Search for documents in which figures / tables containing "" are described.

【００４４】本実施形態による文書登録システムは、イ
ンターネットなどの広域ネットワークに接続された形態
をとっているが、必ずしもこうした形態を必要としな
い。例えば、文書登録と文書検索を実現するようにプロ
グラムされたパーソナルコンピュータに文書を取り込む
ためのスキャナと、検索結果を出力するためのディスプ
レイやプリンタなどが接続された実施形態も可能であ
る。以上、本発明を実施の形態を示して説明したが、本
発明は、上記実施の態様に限定されるものでないことは
勿論である。Although the document registration system according to the present embodiment has a form connected to a wide area network such as the Internet, such a form is not necessarily required. For example, an embodiment in which a scanner for taking in a document into a personal computer programmed to realize document registration and document search and a display or printer for outputting a search result are connected is also possible. As described above, the present invention has been described with reference to the embodiments. However, it is needless to say that the present invention is not limited to the above embodiments.

【００４５】[0045]

【発明の効果】上述の説明から明らかなように、本発明
によれば、検索対象となる文書画像の入力、分類作業の
負担が軽減され、しかも高精度な検索が可能になる。し
たがって、実用的なファイリングシステムの構築、運用
が可能になり、事務処理のペーパーレス化が促進されて
書類保管スペースの確保が不要になる。As is apparent from the above description, according to the present invention, the burden of inputting and classifying document images to be searched can be reduced, and high-precision searching can be performed. Therefore, a practical filing system can be constructed and operated, and paperless office work is promoted, so that it is not necessary to secure a document storage space.

[Brief description of the drawings]

【図１】本発明の一実施形態による文書登録システムの
構成図。FIG. 1 is a configuration diagram of a document registration system according to an embodiment of the present invention.

【図２】本実施形態による文書認識処理部の基本的な機
能ブロックの構成図。FIG. 2 is a configuration diagram of basic functional blocks of a document recognition processing unit according to the embodiment.

【図３】本実施形態による文書画像上に存在する黒画素
の周辺分布図。FIG. 3 is a peripheral distribution diagram of black pixels existing on a document image according to the embodiment;

【図４】（ａ）は傾きのない文書画像上に存在する黒画
素の周辺分布図、（ｂ）傾いた文書画像上に存在する黒
画素の周辺分布図。4A is a peripheral distribution diagram of black pixels existing on a document image having no inclination, and FIG. 4B is a peripheral distribution diagram of black pixels existing on an inclined document image.

【図５】本実施形態による文書画像の傾き補正角度算出
の説明図。FIG. 5 is an explanatory diagram of calculating a tilt correction angle of a document image according to the embodiment.

【図６】（ａ）は本実施形態による基本矩形領域の抽出
処理の説明図、（ｂ）は本実施形態による基本矩形領域
の一例を示した図。FIG. 6A is a diagram illustrating a process of extracting a basic rectangular area according to the embodiment, and FIG. 6B is a diagram illustrating an example of the basic rectangular area according to the embodiment.

【図７】（ａ）本実施形態による基本矩形領域の統合ロ
ジックの説明図、（ｂ）本実施形態による基本矩形領域
の他の統合ロジックの説明図。7A is an explanatory diagram of integrated logic of a basic rectangular area according to the embodiment, and FIG. 7B is an explanatory diagram of another integrated logic of the basic rectangular area according to the embodiment.

【図８】本実施形態による領域属性判別処理における処
理手順の説明図。FIG. 8 is an explanatory diagram of a processing procedure in a region attribute determination process according to the embodiment;

【図９】本実施形態による図表題識別処理における処理
手順の説明図。FIG. 9 is an explanatory diagram of a processing procedure in a figure title identification processing according to the embodiment;

【図１０】本実施形態による文書認識処理後の文書画像
の一例を示した図。FIG. 10 is a view showing an example of a document image after the document recognition processing according to the embodiment.

【図１１】（ａ）本実施形態による２段組のフィルタの
特性図、（ｂ）は３段組のフィルタの特性図、（ｃ）は
ｇ関数の一例を示す図、（ｄ）は（ｃ）の場合のｈ関数
（合成関数）の一例を示した図。11A is a characteristic diagram of a two-stage filter according to the present embodiment, FIG. 11B is a characteristic diagram of a three-stage filter, FIG. 11C is a diagram showing an example of a g function, and FIG. The figure which showed an example of the h function (synthesis function) in the case of c).

【図１２】本実施形態によるユーザ定義によるフォーマ
ット登録処理における処理手順の説明図。FIG. 12 is an explanatory diagram of a processing procedure in a user-defined format registration process according to the embodiment;

【図１３】本実施形態による特徴ベクトルの定義の一例
を示した図。FIG. 13 is a view showing an example of the definition of a feature vector according to the embodiment.

【図１４】本実施形態によるユーザ定義によるフォーマ
ット判定処理における処理手順の説明図。FIG. 14 is an explanatory diagram of a processing procedure in a format determination process according to a user definition according to the embodiment;

【図１５】本実施形態による特徴空間の一例を示した
図。FIG. 15 is a view showing an example of a feature space according to the embodiment;

【図１６】従来技術による文書登録システムにおける登
録画面の一例を示した図。FIG. 16 is a diagram showing an example of a registration screen in a conventional document registration system.

[Explanation of symbols]

１サーバ１０入力文書１１スキャナ１２文書認識処理部１３文書分類処理部１４文書データベース１５ＡＰＩ１６ｈｔｔｐｄ１７ＬＡＮ１８ＷＷＷクライアント端末１２１画像正規化部１２２基本矩形抽出部１２３領域統合部１２４領域属性判別部１２５文字認識部 1 Server 10 Input Document 11 Scanner 12 Document Recognition Processing Unit 13 Document Classification Processing Unit 14 Document Database 15 API 16 http 17 LAN 18 WWW Client Terminal 121 Image Normalization Unit 122 Basic Rectangle Extraction Unit 123 Area Integration Unit 124 Area Attribute Discrimination Unit 125 Character recognition unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者塚本英昭東京都江東区豊洲三丁目３番３号エヌ・ティ・ティ・データ通信株式会社内 ──────────────────────────────────────────────────の Continuing on the front page (72) Inventor Hideaki Tsukamoto 3-3-3 Toyosu, Koto-ku, Tokyo NTT Data Communications Corporation

Claims

[Claims]

1. A document registration system comprising: document registration means for registering an input document image in a database together with a search key corresponding to the document image; and searching for a registered document image based on the search key. The document registration unit has a document recognition processing unit that specifies attribute-based areas such as a character area, a table area, and a figure area on the document image, and position information thereof, and stores the attribute and position information of the specified attribute-based area. A document registration system for automatically generating the search key.

2. The image recognition processing unit according to claim 1, wherein the document image processing unit corrects the inclination of the document image when the document image is inclined, and extracts a basic rectangular area from the document image or the document image subjected to the inclination correction. A basic rectangular area extracting unit, an area integrating unit that integrates the extracted basic rectangular areas as necessary, and an area for determining whether the basic rectangular area or the integrated area corresponds to a character, a figure, a table, or any other attribute. 2. The document registration system according to claim 1, further comprising an attribute determination unit.

3. The image normalizing unit detects a peripheral distribution of each pixel in a row direction and a column direction in the document image, and rotates the document image such that a white distribution in each direction is maximized. The document registration system according to claim 2, wherein

4. The area attribute determining unit obtains the variance of the width and height of the basic rectangular area included in the integrated area, the number of line segments in each area and the length thereof, and calculates the value of each variance. , The number of line segments, and the attribute of the region according to the length of the line segments, characters, tables,
3. The document registration system according to claim 2, wherein one of the figures is determined.

5. The region attribute determining unit calculates a positional shift amount of a region of interest determined as a character from a region near the target region. 5. The document registration system according to claim 4, wherein the region is determined to be a title display region.

6. A document registration system, comprising: document registration means for registering an input document image in a database together with a search key corresponding to the document image, wherein the document registration system searches for a registered document image based on the search key. The document registration unit includes: a document recognition processing unit that specifies an attribute-based area such as a character area, a table area, and a figure area on the document image and its position information; and a document based on the specified attribute-based area and its position information. A document registration system, comprising: a document classification processing unit for performing classification by format; and including the classification result in the search key.

7. The document classification processing unit, based on the number of attributes in the attribute-based area recognized by the document recognition processing unit, the document image is a “character document” mainly composed of characters, and a “table document” composed mainly of tables. 7. The document registration system according to claim 6, wherein the document registration system is classified into general document types such as "graphic documents" and "graphic documents" mainly composed of graphics.

8. The document classification processing unit determines the type of a column of the document image based on a function of the frequency of existence of the basic rectangular area or the integrated area in a predetermined direction and the area of each area. 7. The document registration system according to claim 6, wherein the document images are classified according to columns.

9. A feature vector generation unit configured to generate a feature vector representing a document feature of a document image based on attribute and position information of an attribute-based area recognized by the document recognition processing unit, A dictionary unit in which a plurality of feature vectors generated in advance are grouped for each document format and stored, and a distance between a feature vector generated by the feature vector generation unit and each feature vector stored in the dictionary unit are compared. 7. The document registration system according to claim 6, further comprising group specifying means for specifying one group to which the feature vector having the smaller distance belongs, and classifying the document image according to the specified group.