JPH1166230A

JPH1166230A - Device, method, and medium for document recognition

Info

Publication number: JPH1166230A
Application number: JP9216873A
Authority: JP
Inventors: Yoshihiko Matsukawa; 善彦松川; Kenji Kondo; 堅司近藤; Tsuyoshi Megata; 強司目片
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-08-11
Filing date: 1997-08-11
Publication date: 1999-03-09

Abstract

PROBLEM TO BE SOLVED: To provide a device, method, and medium for document recognition which can recognize a document more efficiently. SOLUTION: A document area initializing device generates a document area object, a black pixel circumscribed rectangle extracting device 108 extracts the circumscribed rectangle of connecting black pixel components, and a blank- extracting device extracts a zone of white pixels in the area object as a blank zone. A document area dividing device 109 discriminates and divides a document area, a paragraph area dividing device 110 divides a paragraph as a set of characters, and a character string area dividing device 111 divides a character string as a set of characters. Then a character area dividing device initializes the attribute of a character area object, a character-recognizing device 105 recognizes the characters in the character area, and a closed area dividing device 112 discriminates and divides a closed area which cannot be divided by the blank zone. The areas divided by the respective dividing devices are generated as area objects and given contiguity or inclusion relation as attributes to generate an area division tree, and the dividing process is repeated until all the area objects cannot be divided.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像の構造を
解析し、文書中のデータを電子化するために使用する文
書認識装置、文書認識方法及び媒体に関するものであ
る。[0001] 1. Field of the Invention [0002] The present invention relates to a document recognition device, a document recognition method, and a medium used for analyzing a structure of a document image and digitizing data in the document.

【０００２】[0002]

【従来の技術】文書画像の構造を解析しようとした場
合、黒画素領域を抽出し、抽出された黒画素領域の間に
存在するセパレータ（空白領域、あるいは罫線）を基に
画像を領域分割する。文書の内容が大きく変化するセパ
レータをうまく抽出することが必要である。2. Description of the Related Art When trying to analyze the structure of a document image, a black pixel region is extracted and the image is divided into regions based on a separator (blank region or ruled line) existing between the extracted black pixel regions. . It is necessary to successfully extract separators whose contents vary greatly.

【０００３】従来の装置において文書画像の構造を解析
する場合、画像から文字列、縦横罫線、その他の黒画素
領域を抽出する。以降の処理は、抽出された矩形データ
を基に行われる。When analyzing the structure of a document image in a conventional apparatus, a character string, vertical and horizontal ruled lines, and other black pixel areas are extracted from the image. Subsequent processing is performed based on the extracted rectangular data.

【０００４】まず、矩形の座標位置からセパレータとな
るような長く幅のある白領域や長い罫線を全て抽出す
る。次に、図形領域を除いた後に抽出されたセパレータ
を用いて、文字領域を大まかに分割する。さらに、文字
領域内で行ピッチや文字サイズの変化から構成要素の切
れ目（サブセパレータ）を求め、このサブセパレータに
従って領域をさらに細分化する。このようにして画像を
解析して領域構造データが木構造として得られる。ここ
で、黒画素領域を抽出する方法は、白ランにローパスフ
ィルタを適用することにより近接する黒画素を一つにま
とめてゆくボトムアップ的な手法を用いている。又、領
域を分割する際には、縦・横交互に領域分割を行ってい
る。First, all white regions and long ruled lines having a long width serving as a separator are extracted from the coordinates of a rectangle. Next, the character area is roughly divided using the separator extracted after removing the graphic area. Further, a break (sub-separator) of a component is obtained from a change in line pitch or character size in the character area, and the area is further subdivided according to the sub-separator. By analyzing the image in this manner, the area structure data is obtained as a tree structure. Here, as a method of extracting a black pixel area, a low-pass filter is applied to a white run to combine adjacent black pixels into one to use a bottom-up method. When dividing an area, the area is divided alternately vertically and horizontally.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら従来の装
置では、文書が枠で囲まれた場合や表などの領域が文書
内に存在した場合、罫線が孤立していない（縦・横の罫
線が接している）ために、枠や表の中まで領域分割を進
めることができなかった。又、ある一定閾値以上の大き
さの白領域で大まかに分割しているため、様々なサイズ
の文字が混在する文書では、閾値の設定が困難であっ
た。つまり、大きな文字の領域を参照して設定された閾
値は、小さな文字の領域には使用できないことになる。
その結果、最終的に得られる領域構造データである木構
造は、文書の領域構造から大きくずれることになり、後
で修正を行わなければならなくなると言う課題が有っ
た。However, in the conventional apparatus, when the document is surrounded by a frame or when an area such as a table exists in the document, the ruled lines are not isolated (the vertical and horizontal ruled lines are in contact with each other). Therefore, it was not possible to proceed with the area division into the frame or table. In addition, since a white area having a size equal to or larger than a certain threshold is roughly divided, it is difficult to set a threshold for a document in which characters of various sizes are mixed. That is, the threshold value set with reference to the large character area cannot be used for the small character area.
As a result, the tree structure, which is the finally obtained region structure data, greatly deviates from the region structure of the document, and there is a problem in that the tree structure must be corrected later.

【０００６】これに対して、本願発明者は、分割対象領
域毎にセパレータをうまく選択することにより、得られ
る木構造が修正をしなくても領域構造をうまく表現でき
るようにしている。最終的に分割対象領域内の空白領域
と罫線とを抽出し、抽出した後はそれぞれを区別するこ
となく同じセパレータとして考え、その代わりに分割す
る際にセパレータに優先順位を持たせることにより処理
を簡単化している。また、分割対象領域毎に分割する際
に、空白領域が見つからなければ、罫線を探してセパレ
ータとして設定させることにより、枠に囲まれた文書や
文書内に表が存在してもうまく領域分割できるようにし
ている。On the other hand, the inventor of the present application makes it possible to express the region structure well without modifying the obtained tree structure by properly selecting the separator for each division target region. Finally, the blank area and the ruled line in the area to be divided are extracted, and after the extraction, each is considered as the same separator without distinction, and the processing is performed by giving priority to the separator when dividing instead. Simplified. Also, when a blank area is not found when dividing into each division target area, a ruled line is searched and set as a separator, so that even if there is a document surrounded by a frame or a table in the document, the area can be divided well. Like that.

【０００７】又、従来の手法では、領域の内容まで立ち
入らなかったため、もしよく似たフォーマットの文書が
入力された場合にそれ以上絞り込むことが出来ないた
め、文書認識が効率的に行えないと言う課題が有った。
フォーマットの構造から文字認識の結果を利用すればフ
ォーマットの限定が容易になる。また、全ての領域が一
次元的に取り扱うのではなく、フォーマットをツリー構
造にすることによって識別されたフォーマットをより限
定することができる。Further, according to the conventional method, since the contents of the area cannot be entered, if a document of a similar format is input, it is not possible to further narrow down the document, so that document recognition cannot be performed efficiently. There was an issue.
If the result of character recognition is used from the structure of the format, the format can be easily limited. Also, not all areas are handled one-dimensionally, but the format identified by a tree structure can further limit the identified format.

【０００８】本発明は、従来の装置のこの様な課題を考
慮し、文書認識がより一層効率的に行うことが出来る文
書認識装置、文書認識方法及び媒体を提供することを目
的とする。An object of the present invention is to provide a document recognizing device, a document recognizing method and a medium capable of performing document recognizing more efficiently in consideration of such problems of the conventional device.

【０００９】[0009]

【課題を解決するための手段】請求項１記載の本発明
は、文書領域オブジェクトを生成する文書領域初期化手
段と、連結黒画素成分の外接矩形を抽出する黒画素外接
矩形抽出手段と、領域オブジェクト内の白画素の帯を空
白帯として抽出する空白帯抽出手段と、文書領域を識別
分割する文書領域分割手段と、文字列の集合である段落
を分割する段落領域分割手段と、文字の集合である文字
列を分割する文字列領域分割手段と、文字領域オブジェ
クトの属性を初期化する文字領域分割手段と、前記空白
帯によって分割できない閉領域を識別分割する閉領域分
割手段とを備え、前記各分割手段において分割された領
域を領域オブジェクトとして生成し隣接あるいは包含関
係を属性として持たせることにより領域分割木を生成
し、全領域オブジェクトを分割できなくなるまで分割し
た時点で、前記領域分割木が前記文書領域オブジェクト
のフォーマット情報を示す文書認識装置である。According to the present invention, there is provided a document area initializing means for generating a document area object, a black pixel circumscribed rectangle extracting means for extracting a circumscribed rectangle of a connected black pixel component, A blank band extracting unit for extracting a white pixel band in the object as a blank band, a document region dividing unit for identifying and dividing a document region, a paragraph region dividing unit for dividing a paragraph which is a set of character strings, and a set of characters A character string region dividing unit that divides a character string, a character region dividing unit that initializes an attribute of a character region object, and a closed region dividing unit that identifies and separates a closed region that cannot be divided by the blank band. A region divided tree is generated by generating a region divided by each dividing means as a region object and having adjacent or inclusive relation as an attribute, and an entire region object is generated. Once divided up can not divide, the region partition tree is a document recognition apparatus according to the format information of the document area object.

【００１０】請求項９記載の本発明は、入力文書画像か
ら文書データの領域を認識する文書認識方法であって、
前記入力文書画像から罫線の抽出及び／又は文書データ
の存在しない空白帯の抽出をし、その抽出した罫線及び
／又は空白帯を用いて前記画像データを所定の領域に分
割し、前記分割された所定の領域から罫線及び／又は空
白帯を抽出し、その抽出した罫線及び／又は空白帯を用
いて前記所定の領域を更に分割し、前記分割を繰り返し
行い、前記分割の包含関係を領域分割木として生成し、
前記文書データのフォーマット情報として出力する文書
認識方法である。According to a ninth aspect of the present invention, there is provided a document recognition method for recognizing an area of document data from an input document image,
A ruled line is extracted from the input document image and / or a blank band where no document data exists is extracted, and the image data is divided into a predetermined area using the extracted ruled line and / or blank band. A ruled line and / or a blank band is extracted from a predetermined region, the predetermined region is further divided using the extracted ruled line and / or a blank band, and the division is repeated. Generated as
This is a document recognition method for outputting as format information of the document data.

【００１１】これにより、例えば、文書領域初期化装置
は文書領域オブジェクトを生成し初期化し、黒画素外接
矩形抽出装置は連結黒画素成分の外接矩形を抽出し、空
白帯抽出装置は領域オブジェクト内の白画素の帯を空白
帯として抽出し、文書領域分割装置は文書領域を識別・
分割し、段落領域分割装置は文字列の集合である段落を
分割し、文字列領域分割装置は文字の集合である文字列
を分割し、文字領域分割装置は文字領域オブジェクトの
属性を初期化し、文字認識装置は文字領域の文字認識を
行い、閉領域分割装置は空白帯によって分割できない閉
領域を識別・分割し、前記各分割装置において分割され
た領域を領域オブジェクトとして生成し隣接あるいは包
含関係を属性として持たせることにより領域分割木を生
成し、全領域オブジェクトを分割できなくなるまで分割
を行い、領域分割木をフォーマット情報として得る。Thus, for example, the document region initializing device generates and initializes a document region object, the black pixel circumscribed rectangle extracting device extracts the circumscribed rectangle of the connected black pixel component, and the blank band extracting device outputs the circumscribed rectangle in the region object. The band of white pixels is extracted as a blank band, and the document area dividing device identifies the document area.
Dividing, the paragraph region dividing device divides a paragraph which is a set of character strings, the character string region dividing device divides a character string which is a set of characters, the character region dividing device initializes attributes of the character region object, The character recognition device performs character recognition of the character region, the closed region dividing device identifies and divides a closed region that cannot be divided by a blank band, generates the divided region in each of the dividing devices as a region object, and determines the adjacent or inclusive relationship. A region division tree is generated by giving it as an attribute, division is performed until all the region objects cannot be divided, and the region division tree is obtained as format information.

【００１２】[0012]

【発明の実施の形態】以下、本発明の文書認識装置の一
実施の形態について述べる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the document recognition apparatus according to the present invention will be described below.

【００１３】なお、本発明を実施するにあたりオブジェ
クト指向を導入する。オブジェクト指向によるメリット
は、問題を整理することができ、プログラムの細部をカ
プセル化できることにある。実施の形態の説明に入る前
にここで用いられるクラスについて述べておく。本実施
の形態でオブジェクトとして考えられるのは領域オブジ
ェクトである。領域オブジェクトは、位置情報、領域の
種類を表すクラス情報、領域の書式情報、領域内を分割
しているセパレータ情報、領域分割木を構成する隣接及
び包含情報、領域内に含まれる黒画素外接矩形情報、領
域の幾何学的な構造を示す構造情報、領域内の文字幅の
推定値等を属性値として持っている。この領域オブジェ
クトの基本的な領域のクラスをSegCellクラスと呼び、S
egCellクラスを基本クラスとしてこれから派生された以
下のクラスの領域オブジェクトが生成される。In implementing the present invention, object orientation is introduced. The advantage of object orientation is that it can organize problems and encapsulate program details. Before starting the description of the embodiments, the classes used here will be described. In this embodiment, an area object is considered as an object. The region object includes position information, class information indicating the type of the region, format information of the region, separator information dividing the inside of the region, adjacency and inclusion information constituting the region division tree, a black pixel circumscribed rectangle included in the region It has information, structure information indicating the geometric structure of the area, an estimated value of the character width in the area, and the like as attribute values. The class of the basic area of this area object is called the SegCell class, and S
Based on the egCell class as a base class, area objects of the following classes derived from this are generated.

【００１４】・ＤｏｃＣｅｌｌクラス（文書領域：文
書全体、或いは組み込まれた文書）・ＰａｒａＣｅｌｌクラス（段落領域：文字列の集
合）・ＴｅｘｔＣｅｌｌクラス（文字列領域：文字の集
合）・ＣｈａｒＣｅｌｌクラス（文字領域）・ＯｔｈｅｒＣｅｌｌクラス（閉領域：図表、罫線、
写真など文字領域以外の領域の総称）・ＴａｂｌｅＣｅｌｌクラス（表領域）・ＬｉｎｅＣｅｌｌクラス（罫線領域）・ＦｉｇＣｅｌｌクラス（図領域）・ＰｉｃｔＣｅｌｌクラス（写真領域）まず、図１を用いて本発明の文書認識装置の一実施の形
態の構成及び動作について述べるとともに、本発明の文
書認識方法の一実施の形態についても同時に説明する。DocCell class (document area: whole document or embedded document) ParaCell class (paragraph area: set of character strings) TextCell class (character string area: set of characters) CharCell class (character area) -OtherCell class (closed area: charts, ruled lines,
TableCell class (table area) LineCell class (ruled line area) FigCell class (figure area) PictCell class (photograph area) First, document recognition of the present invention will be described with reference to FIG. The configuration and operation of an embodiment of the apparatus will be described, and an embodiment of the document recognition method of the present invention will be described at the same time.

【００１５】画像入力装置から文書画像を入力し、文書
領域初期化装置において入力された文書画像全体を一つ
の領域とみなして文書領域オブジェクト（DocCellクラ
ス）を生成・初期化し、文書領域分割装置において文書
領域オブジェクトを領域識別・分割し、また分割された
領域の識別結果に応じた領域オブジェクトを生成・初期
化し、段落領域が生成されれば段落領域分割装置を用い
て文字列に分割して文字列領域オブジェクトを生成・初
期化し、文字列領域が生成されれば文字列領域分割装置
を用いて文字に分割して文字領域オブジェクトを生成・
初期化し、閉領域が生成されれば閉領域領域分割装置を
用いて領域識別・分割し、また閉領域分割装置によって
分割された領域のクラスに応じた領域オブジェクトを生
成・初期化し、生成された領域のクラスに応じた処理を
行う。例えば、表領域であればさらに罫線を基に分割し
てゆき、図や写真領域であれば領域内の画像を圧縮する
等を行う。このような方法によって分割ができなくなる
まで領域を階層的に分割する。また、文字列領域分割装
置において生成された文字領域オブジェクトは文字領域
分割装置においてその属性値が設定される。図２は文書
領域クラス（DocCellクラス）がどのように分割されて
ゆくかを階層構造で示した図である。特徴的なのは文書
領域クラス（DocCellクラス）と閉領域クラス（OtherCe
llクラス）であり、そのどちらも対応する領域分割装置
において領域の識別が行われ、分割されてさらに文書領
域クラス（DocCellクラス）を生成する可能性がある。
例えば、２段組の文書の場合、領域全体を示す文書領域
オブジェクトが分割されて左右２つの新しい領域オブジ
ェクトを生成するといった場合などがそうである。A document image is input from an image input device, a document region initialization device generates and initializes a document region object (DocCell class) by regarding the entire input document image as one region. Area identification and division of the document area object, generation and initialization of an area object according to the identification result of the divided area, and if a paragraph area is generated, use a paragraph area division device to divide it into a character string and Generates and initializes a line area object, and if a character string area is generated, generates a character area object by dividing it into characters using a character string area dividing device.
If a closed region is generated, the region is identified and divided using a closed region dividing device, and a region object is generated and initialized according to the class of the region divided by the closed region dividing device. Performs processing according to the class of the area. For example, in the case of a table area, the image is further divided based on ruled lines, and in the case of a figure or photograph area, the image in the area is compressed. The area is divided hierarchically until division is no longer possible by such a method. The attribute value of the character region object generated by the character string region dividing device is set by the character region dividing device. FIG. 2 is a diagram showing how a document area class (DocCell class) is divided in a hierarchical structure. Characteristic are the document area class (DocCell class) and the closed area class (OtherCe class).
ll class), both of which have the possibility of identifying the area in the corresponding area dividing device and dividing the area to generate a document area class (DocCell class).
For example, in the case of a two-column document, the case where the document area object indicating the entire area is divided and two new left and right area objects are generated is the same.

【００１６】次にセパレータの一つである空白帯を抽出
する空白帯抽出装置を図３をもとに説明する。まず、縦
方向と横方向のセパレータ（数１）３０２及び（数２）
３０１を求める。ここで、ｉは、Ｓ_Vの添え字であり、
ｊは、Ｓ_Hの添え字である。Next, a blank band extracting apparatus for extracting a blank band which is one of the separators will be described with reference to FIG. First, vertical and horizontal separators (Equation 1) 302 and (Equation 2)
301 is obtained. Here, i is a subscript of S _V,
j is a subscript of S _H.

【００１７】[0017]

【数１】 (Equation 1)

【００１８】[0018]

【数２】 (Equation 2)

【００１９】それぞれのセパレータはその方向と直交す
る座標の区間によって表現される。すなわち、Each separator is represented by a section of coordinates orthogonal to the direction. That is,

【００２０】[0020]

【数３】 (Equation 3)

【００２１】となる。ここで、ｎ_V、ｎ_Hはそれぞれ縦、
横方向の空白帯の個数であり、どちらも２以上の数とな
る。## EQU1 ## Here, n _V and n _H are vertical, respectively.
This is the number of blank bands in the horizontal direction, both of which are 2 or more.

【００２２】ここで、セパレータの求め方であるが、空
白帯がセパレータの場合は、対象となっている領域内の
射影分布を縦または横方向について求め、射影分布中で
ある閾値（例えば１）より小さい区間をセパレータとし
ている。一方、罫線がセパレータの場合は罫線の抽出ア
ルゴリズムによって抽出された罫線を包含する区間をセ
パレータとしている。なお、射影分布を求める際に実際
の画像の黒画素を計数するのではなく、領域オブジェク
ト内に設定された黒画素外接矩形を用いると高速に処理
する事が可能であり、また複雑な形状の領域であって容
易に処理できる。さらに、この黒画素外接矩形を用いる
ことにより、画像の傾き補正も高速に行える。つまり、
文書画像全体にアフィン変換をかけると非常に処理時間
がかかるが、外接矩形のみを回転するのはそれほど処理
時間はかからない。例えば、外接矩形の重心を回転し外
接矩形の幅・高さを回転角度に応じて補正したものを回
転された外接矩形の幅・高さとすることにより高速化が
はかれる。また、もとの外接矩形に内接する円（中心と
半径）を回転させてもよい。このように画像を取り扱わ
なくてもよい部分では外接矩形を用い、文字認識などの
原画像が必要な場合にはもとの外接矩形に囲まれた画像
を用いればよい。Here, the method of obtaining the separator is as follows. If the blank band is a separator, the projection distribution in the target area is obtained in the vertical or horizontal direction, and the threshold (eg, 1) in the projection distribution is obtained. The smaller section is used as a separator. On the other hand, when the ruled line is a separator, the section including the ruled line extracted by the ruled line extraction algorithm is set as the separator. When calculating the projection distribution, it is possible to perform high-speed processing by using the black pixel circumscribed rectangle set in the area object instead of counting the black pixels of the actual image. It is an area and can be processed easily. Further, by using the black pixel circumscribed rectangle, the inclination of the image can be corrected at high speed. That is,
Applying the affine transformation to the entire document image takes a very long processing time, but rotating only the circumscribed rectangle does not take much processing time. For example, the speed can be increased by rotating the center of gravity of the circumscribed rectangle and correcting the width and height of the circumscribed rectangle according to the rotation angle to obtain the rotated width and height of the circumscribed rectangle. Alternatively, a circle (center and radius) inscribed in the original circumscribed rectangle may be rotated. As described above, a circumscribed rectangle may be used in a portion where an image need not be handled, and an image surrounded by the original circumscribed rectangle may be used when an original image such as character recognition is required.

【００２３】次に、後述する説明に使用する記号の定義
を記しておく。Next, the definitions of symbols used in the following description will be described.

【００２４】・セパレータＳ_Vi，Ｓ_Vi+1間、及びＳ_Hj，
Ｓ_Hj+1間の区間をそれぞれＧ_Vi、及びＧ_Hjとすると、こ
れらＧ_Vi、Ｇ_Hjは、数４、数５のように表せる。尚、図
３では、Ｇ_Viに符号３０４を付し、Ｇ_Hjに符号３０３を
付した。ここで、ｉ，ｉ＋１，ｊ，ｊ＋１は、上記
Ｓ_V，Ｓ_H，及びＧ_V，Ｇ_Hの添え字である。Between the separators S _Vi , S _{Vi + 1} , and S _Hj ,
_Assuming that the section between S _{Hj + 1} is G _Vi and G _Hj , these G _Vi and G _Hj can be expressed as _Equations 4 and 5. In FIG. 3, G _Vi is _denoted by reference numeral 304, and G _Hj is _denoted by reference numeral 303. Here, i, i + 1, j , j + 1 is the index of the S _V, S _H, and G _V, G _H.

【００２５】[0025]

【数４】 (Equation 4)

【００２６】[0026]

【数５】 (Equation 5)

【００２７】・セパレータＳ_Viや、セパレータ間等の区
間の大きさ（幅）を求める関数をｗ（ｘ）とすると、セ
パレータＳ_Viの大きさは、ｗ（Ｓ_Vi）と表現できる。If the function for determining the size (width) of the separator S _Vi or the interval between the separators is w (x), the size of the separator S _Vi can be expressed as w (S _Vi ).

【００２８】・その他、変数の平均をμ（ｘ）、標準偏
差をσ（ｘ）、最大値をｍａｘ（ｘ）、そして最頻値
をｆ（ｘ）とする。変数ｗの最頻値ｆ（ｗ）は以下の式
（数６）で求められる。In addition, the average of variables is μ (x), the standard deviation is σ (x), the maximum value is max (x), and the mode value is f (x). The mode value f (w) of the variable w is obtained by the following equation (Equation 6).

【００２９】[0029]

【数６】 (Equation 6)

【００３０】ここで、Ｄ（ｗ）は頻度関数であり、（数
７）は頻度関数を平滑化した関数である。Here, D (w) is a frequency function, and (Equation 7) is a function obtained by smoothing the frequency function.

【００３１】[0031]

【数７】 (Equation 7)

【００３２】である。例えばｎ＝５である。Is as follows. For example, n = 5.

【００３３】このようにして求められたセパレータをも
とに領域が分割されてゆくが、中でも文書領域クラスと
閉領域クラスのオブジェクトの分割時に領域の識別処理
（クラス分け）が行われるという意味で、文書領域分割
装置と閉領域分割装置とは特殊な動作を行う。まず、図
４を用いて文書領域分割装置の動作を説明する。孤立罫
線抽出装置４０２によって孤立罫線が抽出されればこの
孤立罫線をセパレータとして領域を分割する。もし孤立
罫線が存在しなければその対象領域が段落領域クラス
（ParaCellクラス）及び文字列領域クラス（TextCellク
ラス）かどうかを、文字列判定装置４０３及び段落判定
装置４０４を用いて調べ、このどれでもなければ分割に
使用するセパレータ（空白帯）を文書領域構造識別装置
４０５を用いて選択し、選択されたセパレータをもとに
領域分割する。なお、いずれの場合もセパレータが選択
されれば再分割領域生成装置４０６によって分割された
領域オブジェクトが生成・初期化されてその領域のクラ
スに応じた領域分割装置が呼び出されるようになってい
る。さらに、分割するセパレータが存在しない場合は閉
領域クラス（OtherCellクラス）の領域オブジェクトを
生成・初期化し閉領域分割装置１１２を呼び出す。The area is divided on the basis of the separator obtained in this manner. In particular, the area identification processing (classification) is performed when the document area class and the closed area class are divided. The document region dividing device and the closed region dividing device perform special operations. First, the operation of the document area dividing apparatus will be described with reference to FIG. When an isolated ruled line is extracted by the isolated ruled line extracting device 402, the area is divided using the isolated ruled line as a separator. If an isolated ruled line does not exist, it is checked whether the target area is a paragraph area class (ParaCell class) and a character string area class (TextCell class) using the character string determination device 403 and the paragraph determination device 404. If not, a separator (blank band) to be used for division is selected using the document area structure identification device 405, and the area is divided based on the selected separator. In any case, if the separator is selected, the divided area object is generated and initialized by the re-divided area generating apparatus 406, and the area dividing apparatus according to the class of the area is called. Further, when there is no separator to be divided, a region object of a closed region class (OtherCell class) is generated and initialized, and the closed region dividing device 112 is called.

【００３４】次に図５を用いて閉領域分割装置１１２に
ついて説明する。閉領域の領域識別においては罫線が重
要な意味をなしている。閉領域分割装置１１２では、罫
線抽出装置５０１にて罫線を抽出し、罫線が複数存在す
れば広い意味での表（あるいは帳票、TableCellクラ
ス）と識別し、罫線で囲まれる領域を新しい文書領域オ
ブジェクトを再分割領域生成装置５０２によって生成・
初期化し、それ以外で罫線が１本であれば罫線領域オブ
ジェクト（LineCellクラス）、領域の大きさが文字の大
きさに近ければ文字領域クラス（CharCellクラス）、黒
画素の密度により写真領域クラス（PictCellクラス）又
は図領域クラス（FigCellクラス）と識別する。そし
て、それぞれの領域クラスに対応した領域オブジェクト
を再分割領域生成装置５０２によって生成・初期化し、
また領域オブジェクトのクラスに応じた処理装置を呼び
出す。例えば、文書領域オブジェクトや文字領域オブジ
ェクトであれば前述した文書領域分割装置１０９や文字
領域分割装置１１３が呼ばる。その他の領域オブジェク
トに関しては、写真領域や図領域であれば圧縮を行った
り、罫線領域オブジェクトであればベクトル化するなど
の処理が行われる。Next, the closed region dividing device 112 will be described with reference to FIG. A ruled line plays an important role in identifying a closed region. In the closed region dividing device 112, a ruled line is extracted by a ruled line extracting device 501, and if there are a plurality of ruled lines, the ruled line is identified as a table (or a form or a TableCell class) in a broad sense. Is generated by the subdivision region generation device 502.
Initialize, otherwise, if there is only one ruled line, a ruled line region object (LineCell class), if the size of the region is close to the size of the character, a character region class (CharCell class), and a photo region class based on the density of black pixels ( PictCell class) or diagram area class (FigCell class). Then, an area object corresponding to each area class is generated and initialized by the subdivision area generation device 502,
Also, a processing device corresponding to the class of the area object is called. For example, if the object is a document area object or a character area object, the above-described document area dividing apparatus 109 and character area dividing apparatus 113 are called. For other area objects, processing such as compression is performed in the case of a photograph area or a figure area, and processing such as vectorization is performed in the case of a ruled area object.

【００３５】ここで再分割領域生成装置４０６，５０２
の動作について説明する。選択された縦・横２種類のセ
パレータの重複部分Ｗ_i,jを（数８）によって求める。Here, the re-divided area generating devices 406 and 502
Will be described. The overlapping part W _{i, j} of the selected two types of vertical and horizontal separators is obtained by (Equation 8).

【００３６】[0036]

【数８】 (Equation 8)

【００３７】ただし、Ｒ（ｌ，ｔ，ｒ，ｂ）は、２点
（ｌ，ｔ），（ｒ，ｂ）に囲まれる領域を示すものとす
る。Here, R (l, t, r, b) indicates an area surrounded by two points (l, t) and (r, b).

【００３８】そして、２つの重複部分Ｗ_i,jとＷ_i+1,j+1
とによって生成される領域が分割された領域となり、こ
の領域を同様の方法で再分割する。Then, two overlapping portions W _{i, j} and W _{i + 1, j + 1}
Is a divided region, and this region is subdivided in a similar manner.

【００３９】領域の識別に利用される情報は、前出した
セパレータと文字幅の推定値である。ここでは文字幅推
定装置について述べる。文字を含む領域(DocCell,ParaC
ell,TextCellクラス)において、文字幅の推定にはその
都度適当であると思われる方法を用いる。DocCell・Par
aCellクラスの場合、文字幅を黒連結成分の外接矩形か
ら推定する。外接矩形が正方領域に近いものを求め、こ
の外接矩形の幅の最頻値をこの領域の文字幅の推定値と
する。一方、TextCellクラスである場合、文字幅はその
領域の高さとする。（縦書きの場合は、領域の幅とな
る。）ここで、各クラスにおいて推定された文字幅をｗ
_Cと表現する。The information used for identifying the area is the separator and the estimated value of the character width described above. Here, a character width estimation device will be described. Area containing characters (DocCell, ParaC
ell, TextCell class), a method deemed appropriate in each case is used for estimating the character width. DocCell / Par
In the case of aCell class, character width is estimated from the circumscribed rectangle of the black connected component. A rectangle whose circumscribed rectangle is close to a square region is obtained, and the mode of the width of this circumscribed rectangle is set as an estimated value of the character width of this region. On the other hand, in the case of the TextCell class, the character width is the height of the area. (In the case of vertical writing, it is the width of the area.) Here, the character width estimated in each class is w
Expressed as _C.

【００４０】次に文字列判定装置の動作について述べる
（図６参照）。以下の条件を満たせば、横書きの文字列
と判定する。つまり、文字の横幅と文字列の高さが似か
よった値になることを利用している。なお、縦書きの場
合も同様の条件となる。Next, the operation of the character string determination device will be described (see FIG. 6). If the following conditions are satisfied, it is determined that the character string is written horizontally. In other words, the fact that the width of the character is similar to the height of the character string is used. Note that the same condition applies to the case of vertical writing.

【００４１】[0041]

【数９】 (Equation 9)

【００４２】[0042]

【数１０】 (Equation 10)

【００４３】また、段落判定装置は次に述べるような動
作をする（図７参照）。段落の判定には、文字列の規則
性を利用する。また、セパレータの幅の最大値が大きい
方の方向に分割する。すなわち、（数１１）であれば縦
書き、もしそうでなければ横書きと仮定する。The paragraph judging device performs the following operation (see FIG. 7). The regularity of a character string is used to determine a paragraph. Further, the separator is divided in the direction in which the maximum value of the width of the separator is larger. That is, it is assumed that if (Equation 11), vertical writing is used, and if not, horizontal writing is used.

【００４４】[0044]

【数１１】 [Equation 11]

【００４５】これは通常、文字間よりも文字列間の方が
大きいことを利用している。ここで、横書きと仮定した
場合、次式（数１２）と（数１３）とを満たせば、This utilizes the fact that the space between character strings is generally larger than the space between characters. Here, assuming horizontal writing, if the following equations (Equation 12) and (Equation 13) are satisfied,

【００４６】[0046]

【数１２】 (Equation 12)

【００４７】[0047]

【数１３】 (Equation 13)

【００４８】対象としている領域を文字列の集合、すな
わち横書きの段落と判定することが出来る。The target area can be determined as a set of character strings, that is, a horizontally written paragraph.

【００４９】これまで、閉領域、文字列領域、段落領域
のどれとも判定されなかった領域の識別を文書領域構造
識別装置４０５を用いて行う（図８参照）。この３つの
領域以外の領域としては、段組であったり、章立て、論
文のフロントページなどのタイトルページなどを挙げる
ことができる。そもそも文書の中で一つのまとまった機
能や意味をなす領域とそれ以外の領域とを区切る場合に
は、人が見てわかりやすいように分離性の高いセパレー
タを用いる。例えば、章が変わるところでは文字間幅よ
りも太い空白帯を用いて新しい章が配置されている。ま
た、タイトルページのようにタイトルと本文の文字の大
きさは大きく異なるという性質も利用できる。このよう
な考えに基づき文書領域構造識別装置４０５は構成され
ている。まず、空白帯幅変化点抽出装置８０１と射影幅
変化点抽出装置８０２を用いて領域の内容物が大きく変
化する空白帯、それと同時に空白帯幅の平均値と射影幅
の平均値を求める。これらの情報を用いて有効空白帯選
択装置８０３は領域をどのように分割するかを決定す
る。A region which has not been determined to be any of the closed region, the character string region, and the paragraph region is identified by using the document region structure identifying device 405 (see FIG. 8). Areas other than the three areas include a column, a chapter, and a title page such as a front page of a dissertation. In the first place, in order to separate a region having a single function or meaning from the other region in a document, a separator having high separability is used so as to be easily understood by humans. For example, where chapters change, new chapters are arranged using blank bands that are wider than the character spacing. In addition, the property that the size of the text in the title and the text in the main text are greatly different like the title page can be used. The document area structure identification device 405 is configured based on such a concept. First, using the blank band width change point extracting device 801 and the projecting width changing point extracting device 802, a blank band in which the contents of the area greatly changes, and at the same time, an average value of the blank band width and an average value of the projected width are obtained. Using the information, the effective blank band selecting device 803 determines how to divide the area.

【００５０】ここで、空白帯幅変化点抽出装置８０１
は、隣り合う２つの空白帯幅を求め、小さい方の射影幅
と大きい方の射影幅との比が一定値以下（例えば0.5以
下）であれば大きい方の空白帯をセパレータの候補と
し、また、射影幅変化点抽出装置８０２は、隣り合う２
つの射影幅を求め、小さい方の射影幅と大きい方の射影
幅との比が一定値以下（例えば0.8以下）であればこの
２つの射影の間の空白帯がセパレータの候補とするもの
である。Here, the blank band width change point extracting device 801
Finds the width of two adjacent blank bands, and if the ratio of the smaller projected width to the larger projected width is equal to or less than a certain value (eg, 0.5 or less), the larger blank band is used as a separator candidate, and , The projection width change point extraction device 802
When two projection widths are obtained and the ratio of the smaller projection width to the larger projection width is equal to or less than a certain value (for example, 0.8 or less), a blank band between the two projections is a candidate for a separator. .

【００５１】次に文字列領域分割装置の動作について述
べる。文字列領域の分割は単に領域を小領域に分割する
だけではなく、文字を含む領域の最小構成要素である文
字として切り出すことが重要である。例えば「松」とい
う文字は「木」と「公」の２つの領域に分割されるので
はなく、「松」として分割されるのが望ましい。文字列
領域分割装置でもこれまで述べてきた領域分割と同様識
別処理を伴う。ただし、文字領域の場合は文字認識の結
果が識別処理となる。また、人間が文字を正しく切り出
せるのは文字が読めることからも、文字の切り出しに文
字認識結果を用いるのは妥当と考えられる。Next, the operation of the character string area dividing device will be described. It is important to divide a character string region not only into a small region but also to cut out the character string region as a character which is a minimum component of a region including a character. For example, it is preferable that the character "matsu" is not divided into two regions, "tree" and "public", but is divided as "pine". The character string region dividing device also involves the identification processing as in the region division described above. However, in the case of a character area, the result of character recognition is an identification process. In addition, it is considered appropriate to use the character recognition result for extracting characters, since humans can read characters correctly because characters can be read.

【００５２】この文字列領域分割装置において次に定義
するような値（以降、切り出しスコアと呼ぶ）を用い
る。通常、文字認識は処理を行う前に文字候補領域を一
定の大きさに正規化するので、文字の大きさや縦横比等
の情報が失われてしまう。つまり、上述した「松」の場
合「木」と「公」に分割され、それぞれ高い認識結果と
なる可能性がある。In this character string region dividing apparatus, a value defined below (hereinafter referred to as a cutout score) is used. Normally, in character recognition, a character candidate area is normalized to a certain size before processing, and thus information such as a character size and an aspect ratio is lost. In other words, in the case of the above-mentioned “pine”, it is divided into “tree” and “public”, and there is a possibility that a high recognition result is obtained.

【００５３】そのような不都合を補うように切り出しス
コアを定義する。上述した文字幅の推定値をｗ_Cとし、
文字候補領域Ｒ_Cの文字列方向の幅及び文字認識第ｉ候
補のスコアをそれぞれ（数１４）とした時、A cutout score is defined so as to compensate for such inconvenience. Let w _{C be} the estimated value of the character width described above,
When the width of the character candidate region _{RC in} the character string direction and the score of the character recognition i-th candidate are respectively (Equation 14),

【００５４】[0054]

【数１４】 [Equation 14]

【００５５】切り出しスコアｄ_RC（数１５）は、The cutout score d _RC (Equation 15) is

【００５６】[0056]

【数１５】 (Equation 15)

【００５７】と定義する。ここで、第１候補と第３候補
のスコアの差をとったのは第１候補と第２候補が非常に
似かよった文字の場合、スコアに差がなくなるからであ
る。例えば、「と」と「ど」のような場合である。な
お、第１候補と第２候補の差を用いてもよいこのような文字切り出しスコアを用い、以下に２つの文
字切り出し方法を述べる。一つ目は自然言語処理のチャ
ート法に類似した方法である（図９参照）。まず、文字
列を先頭から見てゆき、その方向と垂直に分割できる箇
所で全て分割し、n個の小領域９０１（数１６）を求
め、各小領域ｒ_iの前後に指標９０２（数１７）を付与
する。Is defined. Here, the difference between the scores of the first candidate and the third candidate is obtained because there is no difference in the score when the first candidate and the second candidate are very similar characters. For example, "to" and "to". Note that the difference between the first candidate and the second candidate may be used. Using such a character extraction score, two character extraction methods will be described below. The first is a method similar to the chart method of natural language processing (see FIG. 9). First, take a look strings from the top all divided at a point that can split its and vertically, n pieces determine the small area 901 (the number 16) of the index 902 (the number before and after each of the small areas r _i 17 ).

【００５８】[0058]

【数１６】 (Equation 16)

【００５９】[0059]

【数１７】 [Equation 17]

【００６０】そして、連続した小領域を統合し、その統
合領域の切り出しスコアを求める。但し、求める統合領
域の幅は、その幅が文字推定幅の１．２倍よりも小さく
なるもの全てについてである。尚、この統合領域は文字
領域候補と考えることができる。Then, continuous small areas are integrated, and a cutout score of the integrated area is obtained. However, the width of the integrated area to be obtained is all the widths whose width is smaller than 1.2 times the estimated character width. This integrated area can be considered as a character area candidate.

【００６１】この様にして求められたｍ個の統合領域そ
れぞれに対応する文字ラティスを求める。ここで、文字
ラティスとは、対応する統合領域の始点及び終点の指標
ｖ_s，ｖ_e、切り出しスコアｄ、そして統合領域を構成す
る小領域の集合（数１８）、の４つの要素の組から成
り、記号ｌ_jで表現する。A character lattice corresponding to each of the m integrated areas thus obtained is obtained. Here, the character lattice is defined as a set of four elements of indices v _s and v _e of the corresponding start point and end point of the integrated area, a cutout score d, and a set of small areas constituting the integrated area (expression 18). And represented by the symbol l _j .

【００６２】[0062]

【数１８】 (Equation 18)

【００６３】また、文字ラティスｌ_jの集合（数１９）A set of character lattices l _j (Equation 19)

【００６４】[0064]

【数１９】 [Equation 19]

【００６５】を形成することが出来る。但し、文字ラテ
ィスｌ_iを構成する統合領域（文字領域候補）の数をｎ_i
とする。Can be formed. However, the number of integrated regions (character region candidates) constituting the character lattice l _i is n _i
And

【００６６】この様にして求めたラティスの集合（初期
ラティスの集合）から、以下の接続ルールにより接続可
能な２つのラティスｌ_i，ｌ_jを接続し、新しいラティス
ｌ’を生成し、集合（数１９）に加える。ここで、ラテ
ィスの接続ルールは、次式（数２０）となる。From the lattice set (initial lattice set) obtained in this way, two lattices l _i and l _j connectable by the following connection rule are connected to generate a new lattice l ′, and a set ( Add to equation (19). Here, the lattice connection rule is expressed by the following equation (Equation 20).

【００６７】[0067]

【数２０】 (Equation 20)

【００６８】例えば、図９の場合、文字ラティス９０３
と文字ラティス９０４とを接続して文字ラティス９０５
を得た場合、文字ラティス９０５のスコアはラティスの
接続ルールにより（１６１５＋１７１０）／２＝１６６
２となる。For example, in the case of FIG.
And a character lattice 905 by connecting
, The score of the character lattice 905 is (1615 + 1710) / 2 = 166 according to the lattice connection rule.
It becomes 2.

【００６９】このように、自然言語処理のチャート法と
類似した方法でラティスを接続し、集合（数１９）中で
（数２１）となるラティスのうちスコアの最も高いもの
を対象領域の文字切り出し結果とする。As described above, the lattices are connected by a method similar to the chart method of the natural language processing, and the lattice having the highest score among the lattices represented by (Equation 21) in the set (Equation 19) is subjected to character segmentation of the target area. Result.

【００７０】[0070]

【数２１】 (Equation 21)

【００７１】この方法の利点は２点ある。一つは文字列
中の任意の文字らしいラティスを選択し、順次隣接する
ラティスを接続してゆくので、前方から逐次切り出す方
法（後述）では切り出せない不定ピッチ文字（英数字）
等にも比較的うまく対応できることと、もう一つは文字
ラティスを最小の単位で保持しているため、後処理にお
いてフィードバックがかけやすいということである。This method has two advantages. One is to select an arbitrary character-like lattice in a character string and connect adjacent lattices sequentially. Therefore, an indefinite pitch character (alphanumeric) that cannot be extracted by the method of sequentially extracting from the front (described later)
And the other is that the character lattice is held in the smallest unit, so that it is easy to provide feedback in post-processing.

【００７２】文字切り出しの二つ目の方法は、小領域を
文字列の先頭から逐次統合し、文字を切り出す（図１０
参照）。ある小領域まで文字の切り出しが終了していた
とすると、次の小領域から始めて順次隣り合う小領域を
統合し、切り出しスコアを求める。そして、切り出しス
コアが極大となった統合領域として文字を切り出す。In the second method of extracting characters, small areas are sequentially integrated from the beginning of a character string and characters are extracted (FIG. 10).
reference). Assuming that the cutout of characters has been completed up to a certain small area, adjacent small areas are sequentially integrated starting from the next small area to obtain a cutout score. Then, a character is cut out as an integrated area where the cutout score is maximized.

【００７３】例えば、図１０の場合、ｒ₁まで分割が終
わっているとした場合、まず、ｒ₂を文字とした場合、
切り出しスコア１７６で「朴」という文字が認識結果と
して得られる。次に、ｒ₂，ｒ₃を統合したものを文字と
した場合、切り出しスコア１７１０で「枯」という文字
が認識結果として得られる。文字ラティスを用いた場合
と同様、文字推定幅の１．２倍までを文字探索の探索幅
とするため、これ以上小領域は統合されず、結果として
ｒ₂を始点とする文字としては、切り出しスコアが極大
となる「枯」が選ばれる。以降、小領域ｒ₄を始点とし
て同様の処理が行の最後の小領域まで続けられる。For example, in the case of FIG. 10, if the division is completed up to r ₁ , first, if r ₂ is a character,
The character "Park" is obtained as a recognition result in the cutout score 176. Next, when a character obtained by integrating r ₂ and r ₃ is a character, a character “dead” is obtained as a recognition result in the cutout score 1710. As in the case of using the character lattice, since the search width of the character search is up to 1.2 times the estimated character width, no more small areas are integrated, and as a result, a character starting from r ₂ is cut out. The "dead" with the highest score is selected. Thereafter, the same processing as the starting point a small region r ₄ is continued until the last small region of the line.

【００７４】この方法は一つ目の方法より容易で高速に
処理できるが、切り出しを間違えた場合、訂正がしづら
いという欠点がある。文字切り出し装置においてはこの
どちらの方法を用いてもよい。This method is easier and faster than the first method, but has the drawback that if the cutout is incorrect, it is difficult to correct it. Either of these methods may be used in the character extracting device.

【００７５】次に図１１を用いて、フォーマットを識別
する機能としてフォーマット識別装置の説明を行う。フ
ォーマット識別装置はフォーマットデータベースと対に
なって自然言語処理におけるパーザー（構文解析）のよ
うな動作を行う。自然言語処理の文法にあたるものが、
領域分割ではフォーマットデータベース上のフォーマッ
ト情報、すなわち前述した領域分割木であり、パーザー
に相当するものがフォーマット識別装置となる。図１１
にデータベース中のフォーマット情報のイメージを示
す。図１１（ａ）は文字列領域１１０１があり、その下
に表領域１１０２が配置されたフォーマットを示し、図
１１（ｂ）は写真領域１１０３があり、その下に段落領
域１１０４が配置されたフォーマットを示す。なお、前
述したように各領域にはその領域がどの方向のセパレー
タで分割されたかの情報が付記されているので隣り合う
領域との位置関係を知ることができる。具体的にフォー
マット情報を用いて領域分割する方法について述べる。
画像が文書領域分割装置に入力され、最終的に領域分割
木、つまりフォーマット情報が得られたとする。そして
フォーマット識別装置を用いて、フォーマットデータベ
ース内のフォーマット情報と比較する。ここで、フォー
マットデータベース内のフォーマット情報を直接操作す
るのではなく、比較に用いるフォーマット情報をメモり
に一度蓄える。初期状態ではフォーマットデータベース
内の全てのフォーマット情報がメモリに蓄えられるもの
とする。そして、入力された文書から得られたフォーマ
ット情報の根からたどり、各ノードに対応する領域オブ
ジェクトをメモリ内のフォーマット情報の同じ位置の領
域オブジェクトと比較する。例えば、入力文書画像のフ
ォーマット情報の一番上に配置された領域が文字列領域
であった場合、メモリ内にある図１１（ａ）、（ｂ）の
フォーマット情報の同じ位置にある領域オブジェクト
（文字領域、及び写真領域）を比較し、一致しなかった
図１１（ｂ）のフォーマット情報を削除する。このよう
にフォーマット情報（領域分割木）を探索してメモリ上
のフォーマット情報の内で一致しないものを削除し、最
後に残ったものを入力された文書のフォーマットに一致
するフォーマット情報の候補であるとする。なお、文字
領域等の細かいノードまでをたどると誤識別を起こす可
能性があるので、フォーマット情報によって探索を行う
ノード位置を予め決めておくこともできる。ここで、一
致したフォーマット情報が多数存在する場合には、対応
する領域の重心の距離の近さによってフォーマット情報
の候補をさらに絞り込むこともできる。また、対応する
文字領域内の文字を比較し、一致するものを選択するこ
とも可能である。その結果、フォーマットデータベース
に登録されているどのフォーマットとも一致しなかった
場合、入力された文書は未知フォーマットであると判断
し登録を促すようユーザーに提示する。この時、ユーザ
ーは表示装置に表示された新しいフォーマットを見なが
ら、キーボード１１４やポインティングデバイス１１５
を用いて修正・追加・削除が行えるようにする。Next, a format identification device will be described as a format identification function with reference to FIG. The format identification device performs an operation like a parser (syntax analysis) in natural language processing in combination with the format database. The grammar of natural language processing is
In the area division, format information in a format database, that is, the above-described area division tree, which corresponds to a parser, is a format identification device. FIG.
Shows an image of the format information in the database. FIG. 11A shows a format in which a character string area 1101 is provided and a table area 1102 is provided below the character string area 1101. FIG. 11B shows a format in which a photographic area 1103 is provided and a paragraph area 1104 is provided therebelow. Is shown. As described above, since information on which direction the region is divided by the separator is added to each region, it is possible to know the positional relationship with the adjacent region. Specifically, a method of dividing an area using format information will be described.
It is assumed that an image is input to the document area dividing device, and an area dividing tree, that is, format information is finally obtained. Then, the format is compared with the format information in the format database using the format identification device. Here, instead of directly operating the format information in the format database, the format information used for comparison is temporarily stored in a memory. In the initial state, it is assumed that all the format information in the format database is stored in the memory. Then, following the root of the format information obtained from the input document, the area object corresponding to each node is compared with the area object at the same position in the format information in the memory. For example, if the area located at the top of the format information of the input document image is a character string area, the area object () in the memory at the same position in the format information of FIGS. The character area and the photograph area are compared, and the format information in FIG. 11B that does not match is deleted. As described above, the format information (area division tree) is searched, and the format information on the memory that does not match is deleted, and the last remaining format information is a candidate for the format information that matches the format of the input document. And Note that tracing down to a fine node such as a character area may cause erroneous identification. Therefore, a position of a node to be searched can be determined in advance based on format information. Here, when there is a large number of matching format information, the format information candidates can be further narrowed down by the short distance of the center of gravity of the corresponding area. It is also possible to compare the characters in the corresponding character areas and select a matching character. As a result, if the format does not match any of the formats registered in the format database, the input document is determined to be in an unknown format and presented to the user to prompt registration. At this time, the user looks at the new format displayed on the display device while watching the keyboard 114 and the pointing device 115.
To make corrections, additions and deletions.

【００７６】請求項１及び２記載の発明によれば、領域
を階層的に分割すると同時に領域分割木を生成し、最終
的に得られる領域分割木を文書のフォーマットとして得
ることが可能となる。According to the first and second aspects of the present invention, it is possible to hierarchically divide an area and generate an area division tree at the same time, and obtain a finally obtained area division tree as a document format.

【００７７】請求項３記載の発明によれば、領域の形状
だけではなく、領域内の内容を比較することにより正確
なフォーマット識別が可能となる。According to the third aspect of the present invention, accurate format identification can be performed by comparing not only the shape of the area but also the contents in the area.

【００７８】請求項４記載の発明によれば、画像自体を
処理しないので処理時間を短縮できることと、領域中の
黒画素の連結成分の位置（黒画素外接矩形）のみを扱う
のでどんな形状の領域であっても容易に空白帯を抽出す
ることができる。According to the fourth aspect of the present invention, since the image itself is not processed, the processing time can be reduced, and only the position of the connected component of the black pixel (black pixel circumscribed rectangle) in the area is handled, so that the area of any shape However, a blank band can be easily extracted.

【００７９】請求項５記載の発明によれば、画像自体の
回転処理を行わないので処理時間を短縮できる。According to the fifth aspect of the present invention, since the rotation processing of the image itself is not performed, the processing time can be reduced.

【００８０】請求項６記載の発明によれば、文書領域を
識別することにより、領域の処理方法を限定することが
可能となる。According to the sixth aspect of the present invention, by identifying a document area, it is possible to limit the processing method of the area.

【００８１】請求項７記載の発明によれば、空白帯の分
離度の強さを基に領域を分割することができ、領域分割
木がフォーマットを表すように生成することができる。According to the seventh aspect of the present invention, an area can be divided based on the degree of separation of a blank band, and an area division tree can be generated so as to represent a format.

【００８２】請求項８記載の発明によれば、閉領域を識
別することにより、領域の処理方法を限定することがで
き、さらに、複雑な構成の表領域も領域分割することが
可能となる。According to the eighth aspect of the present invention, by identifying a closed area, it is possible to limit the processing method of the area, and it is also possible to divide a complicatedly structured table area.

【００８３】尚、上記実施の形態に記載の各手段（各装
置）の全部又は一部の手段の機能をコンピュータに実行
させるためのプログラムを記録した媒体を用いることに
より、上記と同様の効果を発揮するものである。The same effects as described above can be obtained by using a medium in which a program for causing a computer to execute the functions of all or a part of each means (each apparatus) described in the above embodiment is used. To demonstrate.

【００８４】又、上記実施の形態の各手段の処理動作
は、コンピュータを用いてプログラムの働きにより、ソ
フトウェア的に実現してもよいし、あるいは、上記処理
動作をコンピュータを使用せずに特有の回路構成によ
り、ハード的に実現してもよい。The processing operation of each means of the above-described embodiment may be realized as software by a program using a computer, or the processing operation may be performed without using a computer. The circuit configuration may be implemented as hardware.

【００８５】[0085]

【発明の効果】以上述べたところから明らかなように本
発明は、文書認識がより一層効率的に行うことが出来る
言う長所を有する。As is apparent from the above description, the present invention has an advantage that document recognition can be performed more efficiently.

[Brief description of the drawings]

【図１】本実施の形態の文書認識システムの全体の構成
を説明する図。FIG. 1 is an exemplary view for explaining the overall configuration of a document recognition system according to an embodiment;

【図２】実施の形態の領域クラスと階層的分割を説明す
る図。FIG. 2 is a diagram illustrating an area class and hierarchical division according to the embodiment.

【図３】実施の形態の空白帯抽出装置を説明する図。FIG. 3 is a diagram illustrating a blank band extracting device according to the embodiment.

【図４】実施の形態の文書領域分割装置を説明する図。FIG. 4 is an exemplary view for explaining a document area dividing apparatus according to the embodiment;

【図５】実施の形態の閉領域分割装置を説明する図。FIG. 5 is a diagram illustrating a closed region dividing device according to an embodiment.

【図６】実施の形態の文字列判定装置を説明する図。FIG. 6 is a diagram illustrating a character string determination device according to an embodiment.

【図７】実施の形態の段落判定装置を説明する図。FIG. 7 is a diagram illustrating a paragraph determination device according to an embodiment.

【図８】実施の形態の文書領域構造識別装置を説明する
図。FIG. 8 is an exemplary view for explaining a document area structure identification device according to the embodiment;

【図９】実施の形態の文字列領域分割装置において文字
ラティスを用いて文字を切り出す方法を説明する図。FIG. 9 is an exemplary view for explaining a method of cutting out characters using a character lattice in the character string region dividing apparatus according to the embodiment;

【図１０】実施の形態の文字列領域分割装置において逐
次的に文字を切り出す方法を説明する図。FIG. 10 is an exemplary view for explaining a method of sequentially cutting out characters in the character string region dividing apparatus according to the embodiment.

【図１１】（ａ），（ｂ）：実施の形態のフォーマット
識別装置の動作を説明する図。FIGS. 11A and 11B are diagrams for explaining the operation of the format identification device according to the embodiment;

[Explanation of symbols]

１０３画像入力装置１０４傾き検出装置１０５文字認識装置１０６フォーマット識別装置１０７フォーマットデータベース１０８黒画素外接矩形抽出装置１０９文書領域分割装置１１０段落領域分割装置１１１文字列領域分割装置１１２閉領域分割装置１１３文字領域分割装置４０１文字幅推定装置４０２孤立罫線抽出装置４０３文字列判定装置４０４段落判定装置４０５文書領域構造識別装置４０６再分割領域生成装置５０１罫線抽出装置５０２再分割領域生成装置８０１空白帯変化点抽出装置８０２射影幅変化点抽出装置８０３有効空白帯選択装置 103 Image Input Device 104 Inclination Detection Device 105 Character Recognition Device 106 Format Recognition Device 107 Format Database 108 Black Pixel Bounding Rectangle Extraction Device 109 Document Region Decomposition Device 110 Paragraph Region Decomposition Device 111 Character String Region Decomposition Device 112 Closed Region Decomposition Device 113 Character Region Dividing device 401 Character width estimating device 402 Isolated ruled line extracting device 403 Character string judging device 404 Paragraph judging device 405 Document region structure identifying device 406 Subdivision region generating device 501 Ruled line extracting device 502 Redivision region generating device 801 Blank band change point extracting device 802 Projection width change point extraction device 803 Effective blank band selection device

Claims

[Claims]

1. A document area initializing means for generating a document area object, a black pixel circumscribing rectangle extracting means for extracting a circumscribed rectangle of a connected black pixel component, and a white pixel band in the area object is extracted as a blank band. Blank band extracting means, document area dividing means for identifying and dividing a document area, paragraph area dividing means for dividing a paragraph which is a set of character strings, and character string area dividing means for dividing a character string which is a set of characters. A character region dividing unit for initializing the attribute of the character region object; and a closed region dividing unit for identifying and dividing a closed region that cannot be divided by the blank band, and generating the region divided by each of the dividing units as a region object. When an area division tree is generated by giving adjacent or inclusive relations as attributes, and divided until all area objects can no longer be divided In document recognition apparatus, wherein the area division tree showing the format information of the document area object.

2. The area object includes position information of the area, class information indicating the type of the area, format information of the area, separator information that divides the inside of the area, and adjacent and adjacent areas constituting the area division tree. The character area object includes inclusive information, black pixel circumscribed rectangle information included in the area, structural information indicating a geometric structure of the area, and an estimated value of a character width in the area. 2. The document recognition apparatus according to claim 1, further comprising a character recognition result obtained by a character recognition unit in addition to the information held by the object.

And a format database for storing the format information; a display means; a keyboard and a pointing device, wherein the format information obtained from an input document image and A corresponding area object of the format information present in the format database is compared, and when it is determined that the format information does not exist in the format database, based on the image displayed on the display means and the corresponding format information. 2. The document recognition apparatus according to claim 1, wherein the keyboard and / or the pointing device register the new format information in the format database.

4. The document recognition method according to claim 1, wherein said blank band extracting means extracts a blank band by projecting a black pixel circumscribed rectangle in said area object in a horizontal and / or vertical direction. apparatus.

5. A black pixel circumscribing rectangle which is a part of an input document image, comprising: detecting means for detecting the inclination of a black pixel circumscribing rectangle, rotating the black pixel circumscribing rectangle to correct the inclination, and A document recognition device for performing area division using a pixel circumscribed rectangle.

6. The document area dividing means includes a character width estimating means, an isolated ruled line extracting means, a character string judging means, a paragraph judging means, a document area structure identifying means, and a re-divided area generating means. An isolated ruled line is searched for by the isolated ruled line extracting means, and if the isolated ruled line exists, it is selected as a separator. If the isolated ruled line does not exist, a character string determining means determines whether the target area is a character string or a paragraph. A determination is made using a paragraph determination means, and if it is a character string or a paragraph, a character string area object or a paragraph area object is generated in the re-divided area generation means, respectively. The blank band to be used for division is selected as a separator, and the selected isolated ruled line or blank band is selected at least by the re-divided region generating means based on the separator. Individual document area objects are generated, and if neither a character string nor a paragraph is present and no separator is present, a closed area object is generated by the subdivision area generating means, and the classes of the generated objects, that is, a document area class and a paragraph area class , String area class,
2. The document recognition apparatus according to claim 1, wherein the dividing means corresponding to each of the first class and the closed region class is called, that is, a document region dividing unit, a paragraph region dividing unit, a character string region dividing unit, and a closed region dividing unit.

7. A blank band width change point extracting unit for extracting a blank band width change point at which a blank band width greatly changes in each of a vertical direction and a horizontal direction; Projection width change point extraction means for extracting a projection width change point at which the width of the projection of the sandwiched portion changes greatly, and a blank band used for area division based on the blank band width change point and information on the projection width change point 7. The document recognition apparatus according to claim 6, further comprising an effective blank band selecting means for determining the area, and dividing the area based on the strength of separation of the blank band.

8. The closed area dividing means includes: a ruled line extracting means;
And a ruled line extracting means for extracting ruled lines in the horizontal and vertical directions. If the number of the extracted ruled lines is one, a ruled line area object is generated. An object is generated, and at least one document area object is generated by the re-divided area generating means using the ruled line as a separator. If the ruled line does not exist, the character determining means determines whether the size is close to the size of the character, If it is close, a character area object is generated. If the area is not determined as a character, an area larger than the character generates a figure area object and a photograph area object based on the density of black pixels in the area. Generates an area and creates a document area class, figure area class, and photo area Document area division means, figure area processing means, photograph area processing means, character area division means, ruled area processing means, and noise area processing corresponding to each of the class, the character area class, the ruled line area class, and the noise area class 2. The method according to claim 1, further comprising: calling means.
Document recognition device as described.

9. A document recognition method for recognizing a region of document data from an input document image, comprising: extracting a ruled line from the input document image and / or extracting a blank band where no document data exists; And / or dividing the image data into predetermined regions using blank bands, extracting ruled lines and / or blank bands from the divided predetermined regions, and using the extracted ruled lines and / or blank bands. Further dividing the predetermined area, repeatedly performing the division, generating the inclusion relation of the division as an area division tree,
A document recognition method, wherein the document data is output as format information of the document data.

10. A medium storing a program for causing a computer to execute the functions of all or a part of each of the means described in any one of claims 1 to 8.

11. A medium on which a program for causing a computer to execute all or a part of each step according to claim 9 is recorded.