JPH11232439A

JPH11232439A - Document picture structure analysis method

Info

Publication number: JPH11232439A
Application number: JP10050130A
Authority: JP
Inventors: Toshinari Hayashi; 俊成林
Original assignee: Individual
Current assignee: Individual
Priority date: 1998-02-16
Filing date: 1998-02-16
Publication date: 1999-08-27
Also published as: WO1999041681A1

Abstract

PROBLEM TO BE SOLVED: To precisely and efficiently analyze the structure of a document picture by using content information when the document picture is converted into an electronized document. SOLUTION: For learning the document structure of a whole document, the document picture of a content page is taken in at first, it is extracted in a basic rectangle at every line, a character is recognized and is analyzed. Here, chapter/clause numbers are analyzed, indexes are extracted and the page numbers of the respective indexes are extracted. The document picture of the text page is taken in, several tens of continuous pages are inputted and the basic rectangle is extracted and analyzed against the respective pages. The layout elements of a header, a footer, the page number, chapter/clause, a text, graphic/ table are identified from the layout feature of the extracted basic rectangle. All the elements except for the rectangles identified as the graphic/list are character-recognized. The index is matched with an index candidate extracted in text analysis at every index page extracted in content analysis as a matching processing and more precise index information is analyzed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像構造解析
技術に関し、特に、文書の論理構造まで認識し効率的に
文書を電子化する文書画像構造解析技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document image structure analysis technique, and more particularly, to a document image structure analysis technique for recognizing a logical structure of a document and efficiently digitizing the document.

【０００２】[0002]

【従来の技術】パーソナルコンピュータ、ワープロ等の
普及に伴い電子化文書が一般化しはじめ、これら電子化
文書の編集の容易さ、効率の良さから電子出版なども盛
んになってきている。この電子化文書を新たに作成する
場合は、その過程で適宜SGMLやODA などの標準マークア
ップ言語を使用すればよい。2. Description of the Related Art With the spread of personal computers, word processors and the like, digitized documents have begun to be generalized, and electronic publishing has become active due to the ease and efficiency of editing these digitized documents. If a new electronic document is created, a standard markup language such as SGML or ODA may be used in the process.

【０００３】一方、既存の印刷文書を電子化し再利用し
たいという要求も高く、特に膨大な蔵書をもつ図書館の
場合、書籍を電子化することによって、効率よく蔵書を
検索し参照することができる。これを実現する技術が文
書画像構造解析技術である。この文書画像構造解析技術
は、単なる文字認識、領域の識別だけでなく、スキャナ
から読み込んだ文書画像から各種の情報、例えば属性
（表紙、目次、図など）、章節構成、頁番号を獲得して
文書の論理的構造まで認識し、効率良く文書を蓄積する
処理技術である。On the other hand, there is a high demand that existing print documents be digitized and reused. Particularly, in the case of a library having a large collection of books, the collection of books can be efficiently searched and referenced by digitizing the books. The technology for realizing this is a document image structure analysis technology. This document image structure analysis technology acquires not only character recognition and area identification but also various kinds of information, such as attributes (cover, table of contents, figures, etc.), chapter sections, page numbers, from a document image read from a scanner. This is a processing technology that recognizes the logical structure of a document and stores the document efficiently.

【０００４】文書画像構造解析技術に関しては、文字認
識や図形認識等のメディア変換や文書画像のレイアウト
構造解析を中心として盛んに研究が行われている。文書
画像のレイアウト構造解析の研究として、拡大・縮退法
［ミックスモード通信のための文字領域の抽出アルゴリ
ズム，信学論(D) ，J67-D ，11，pp.1277-1285(1984-1
1) ］、連結法［マルチメディア文書構造処理システ
ム，画像電子学会誌，19，5 ，pp.286-295(1990-10)
］、周辺分布等に基づく領域分割法［スプリット検出
による文書画像構造解析，信学論(D-II)，J74-D-II，4
，pp.491-499(1991-04)］などの文書要素抽出に関する
もの、書式定義言語を用いるもの［自動ファイリングの
ための文書理解の方式，学信論(D) ，J71-D ，10，pp.2
050-2058(1988-10) ］、モデルベースのアプローチ［モ
デルに基づいた文書画像のレイアウト理解，電子情報通
信学会論文誌(D-II)，J75-D-II，10，pp.1673-1681］な
どが報告されている。これらは、文書のレイアウト要素
（カラム、図、表、文字行、文字など）の分離・抽出を
実現したものと言える。[0004] With respect to the document image structure analysis technology, researches have been actively conducted mainly on media conversion such as character recognition and graphic recognition and analysis of the layout structure of a document image. As a study on the layout structure analysis of document images, the enlargement / reduction method [Character region extraction algorithm for mixed-mode communication, IEICE (J), J67-D, 11, pp.1277-1285 (1984-1
1)], Concatenation method [Multimedia document structure processing system, Journal of the Institute of Image Electronics Engineers of Japan, 19, 5, pp. 286-295 (1990-10)
], Region segmentation method based on marginal distribution, etc. [Document image structure analysis by split detection, IEICE (D-II), J74-D-II, 4
, Pp.491-499 (1991-04)], etc., and those using a format definition language [Document comprehension method for automatic filing, Gakushin Ron (D), J71-D, 10, pp.2
050-2058 (1988-10)], Model-based approach [Understanding document image layout based on model, IEICE Transactions on Electronics (D-II), J75-D-II, 10, pp.1673-1681 ] Has been reported. These can be said to realize separation and extraction of document layout elements (columns, figures, tables, character lines, characters, etc.).

【０００５】[0005]

【発明が解決しようとする課題】書籍や雑誌などの文書
の「文書構造」は、図１に示すように、章節から成る
「論理構造」、頁が示す「線形構造」及び索引から本文
への「参照構造」からなる。目次は、文書の論理構造を
最も忠実かつ簡潔に表しているものであり、これを解析
することによって、文書の論理構造が得られる。しか
も、ほとんどの目次は図表などの情報を含んでいないの
で、領域分割における解析は容易で、この論理構造をベ
ースに、本文の文書画像を解析した方がより効率的な解
析を行うことができると考えられる。また、索引情報が
書籍のキーワードとなり、索引を解析することにより、
キーワードとして用いることが可能である。As shown in FIG. 1, the "document structure" of a document such as a book or a magazine includes a "logical structure" composed of chapter sections, a "linear structure" represented by a page, and an index to the text. It consists of a “reference structure”. The table of contents represents the logical structure of a document most faithfully and concisely, and by analyzing this, the logical structure of the document is obtained. Moreover, since most tables of contents do not include information such as figures and tables, analysis in area division is easy, and analyzing document images of the text based on this logical structure allows more efficient analysis. it is conceivable that. Also, the index information becomes the keyword of the book, and by analyzing the index,
It can be used as a keyword.

【０００６】従って、本発明の目的は、文書画像を電子
化文書に変換するに際し目次情報および索引情報を利用
してより正確で効率的な文書画像構造解析方法を提供す
ることにある。SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a more accurate and efficient document image structure analysis method using index information and index information when converting a document image into an electronic document.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
に、本発明の文書画像構造解析方法では、文書全体の文
書構造を知るためにまず目次頁の文書画像を取り込んで
得られるテキストを解析する（目次解析）。次に、本文
頁の文書画像を取り込んでレイアウト要素を調べ（レイ
アウト解析）、目次解析で得られた情報を用いてレイア
ウト要素のマッチング処理を行い、論理構造、線形構造
及び参照構造等を有した電子化文書を得る（図２参
照）。In order to achieve the above object, a document image structure analysis method according to the present invention analyzes a text obtained by first taking in a document image of a table of contents page in order to know the document structure of the entire document. (Table of Contents analysis). Next, the document image of the main body page is fetched, layout elements are checked (layout analysis), layout element matching processing is performed using information obtained by the table of contents analysis, and a logical structure, a linear structure, a reference structure, and the like are obtained. An electronic document is obtained (see FIG. 2).

【０００８】[0008]

【発明の実施の形態】本発明で行う処理の概略を示す。文書全体（目次頁と本文頁）について文書画像を取り
込む。目次解析：ラインごとを基本矩形で抽出し、文字認識
を行ってから解析する。ここでは、章節番号解析、見出
し抽出および各見出しの頁番号抽出を行なう。本文解析：連続した数十頁を入力し各頁に対して、基
本矩形を抽出し解析する。ここでは、抽出した基本矩形
のレイアウト特徴からヘッダ・フッタ、頁番号、章節、
本文、図・表などのレイアウト要素の識別を行なう。図
・表として識別された矩形以外は、全て文字認識を行
う。マッチング処理：目次解析で抽出した見出しの頁ごと
に、その見出しと本文解析で抽出した見出し候補とマッ
チングを行い、より正確な見出し情報を解析する。索引解析処理：索引頁を基本矩形で抽出し、段組判定
を行ったあと、キーワードとなる索引および頁番号を抽
出する。なお、上記目次及び本文解析においては、夫々
の頁に対して抽出された基本矩形のレイアウト特徴（位
置、幅、高さ、インデントなど）を基にして解析する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The outline of the processing performed in the present invention will be described. A document image is fetched for the entire document (table of contents and body page). Table of contents analysis: Extract each line as a basic rectangle, perform character recognition, and then analyze. Here, section number analysis, headline extraction, and page number extraction of each headline are performed. Body analysis: Input dozens of continuous pages and extract and analyze basic rectangles for each page. Here, the header / footer, page number, chapter section,
Identify layout elements such as text, figures and tables. Except for rectangles identified as figures and tables, character recognition is performed on all. Matching processing: For each page of the headline extracted by the table of contents analysis, matching is performed between the headline and the headline candidate extracted by the text analysis, and more accurate headline information is analyzed. Index analysis processing: After extracting an index page as a basic rectangle and performing column determination, an index and a page number serving as a keyword are extracted. In the table of contents and text analysis, the analysis is performed based on layout features (position, width, height, indent, etc.) of the basic rectangle extracted for each page.

【０００９】上記処理に関し、その好適且つ具体的な処
理内容をさらに説明する。１．基本矩形の抽出(Basic Rectangle Extraction) 文書全頁に対する共通処理として、目次の場合は１頁ご
とに、本文の場合は左右開き頁ごとに、イメージスキャ
ナで読み取り２値画像に変換する。変換した画像に対し
ノイズ除去および傾き補正をした後、以下のアルゴリズ
ムによりレイアウト要素の基本単位となる基本矩形を抽
出する。 (1) 8 連結ラベリングにより隣接する黒画素領域を求
め、これに連結する最小矩形領域を求める。次に、領域
の重なり合った矩形をそれらの外接矩形で統合する。 (2) (1) で求めた矩形に対して、横方向の射影分布を求
め、空白は或しきい値以下で隣接する外接矩形の統合を
行い、文字列、行となる矩形を抽出する。以上の処理に
より、各頁の文字行列、図・表などは独立した基本矩形
として抽出される。With respect to the above processing, preferable and specific processing contents will be further described. 1. Basic Rectangle Extraction As a common process for all pages of a document, an image scanner reads and converts it into a binary image for each page in the case of a table of contents, and for each of the left and right pages in the case of a text. After removing noise and correcting the inclination of the converted image, a basic rectangle serving as a basic unit of a layout element is extracted by the following algorithm. (1) An adjacent black pixel area is obtained by 8-connection labeling, and a minimum rectangular area connected thereto is obtained. Next, the overlapping rectangles of the regions are integrated by their circumscribed rectangles. (2) With respect to the rectangle obtained in (1), a horizontal projection distribution is obtained, and adjacent circumscribing rectangles are integrated with a blank below a certain threshold to extract a rectangle that becomes a character string or a line. By the above processing, the character matrix, figure, table, etc. of each page are extracted as independent basic rectangles.

【００１０】２．目次解析目次のレイアウトも何種類かあるが、ここでは、表１の
左側に示すように文書に一番多く使われている、１行ご
とに一つの章・節を示すものとする。この場合、各行単
位は表１の右側に示すようなフォーマットの一つとな
る。[0010] 2. Table of Contents Analysis Although there are several types of table of contents layouts, here, as shown on the left side of Table 1, one chapter / section is used for each line, which is used most frequently in documents. In this case, each row unit has one of the formats shown on the right side of Table 1.

【００１１】２．１目次の構造分析一般に文書の目次は、図表などの情報を含んでおらず、
基本矩形抽出処理を行った後、行ごとに切り出し、行単
位で文字認識を行い、文字コードに変換する。以下の処
理により、章節番号、頁番号および見出しを抽出する。2.1 Structural Analysis of Table of Contents In general, the table of contents of a document does not include information such as figures and tables.
After performing the basic rectangle extraction processing, the image is cut out line by line, character recognition is performed in line units, and converted into character codes. The section number, page number, and heading are extracted by the following processing.

【００１２】２．２章節番号の抽出見出しの内容が全部数字ではないという前提で（数字だ
けの見出しは考えられない）各行の最初の数文字分を取
り出し、この中から数字だけ抽出し、数字以外の文字
（例えば、「第？章」の「第」や「章」、「１−１」、
「１．１」のセパレータ文字「−」や「．」など）を取
り除く。この処理によって、章節番号を持たない見出し
は除かれ、表２のパターンのようになる。2.2 Extraction of section numbers Assuming that the contents of the headings are not all numbers (headings consisting only of numbers are not considered), the first few characters of each line are extracted, and only the numbers are extracted from these, and the numbers are extracted. Characters other than (for example, "Chapter" and "Chapter" of "Chapter?", "1-1",
Remove the separator characters "-" and "." By this processing, headings having no section number are removed, and a pattern shown in Table 2 is obtained.

【００１３】２．３頁番号の識別章節番号解析と同様に、今度は各行から最後の一定文字
数を取り、このうち数字だけを抽出し、頁番号とする。
以下、見出しの各行の章節番号、見出しおよび頁番号を
見出しセット(headline set)と呼び、頁番号がない行
は、下の行の頁番号をこの見出しの頁番号とする。ま
た、各見出しセットの見出し部と頁番号部の間に使われ
るセパレータ（「，」、「．」、空白など）の文字コー
ドは取り除かれる。以上の処理によって、目次のフォー
マットは表１の右側のようになる。2.3 Identification of Page Numbers In the same manner as in the section number analysis, the last fixed number of characters is taken from each line, and only the numbers are extracted from these numbers and used as page numbers.
Hereinafter, the section number, heading, and page number of each line of the heading are called a headline set, and for a line without a page number, the page number of the lower line is used as the page number of this heading. Also, the character codes of the separators (“,”, “.”, Blank, etc.) used between the heading part and the page number part of each heading set are removed. With the above processing, the format of the table of contents is as shown on the right side of Table 1.

【００１４】[0014]

【表１】 [Table 1]

【００１５】[0015]

【表２】 [Table 2]

【００１６】２．４節番号の補正節番号の補正については、山田［文書画像のODA 論理構
造化文書への変換方式，進学論(D-II)，J76-D-II，11，
pp2274-2284(1993-11)］はあるｍレベルの階層にある節
番号に対して、１次導出番号および２次導出番号を利用
したマッチング手法によって、補正することができた。
しかし、文字認識する際セパレータである「−」
や「．」などの脱落や認識誤りなどの原因で、マッチン
グが失敗することもある。この問題を回避するために、
節番号を抽出する際に、セパレータ文字を取り除き、こ
こで、あるｍレベルの階層の節番号snr に対して、次に
予想される節番号は最高ｍ＋１レベルで、図５に示した
通りになる。これらをsnr の１次導出番号と呼び、この
１次導出番号の次に生じ得る節番号の全体をsnr の２次
導出番号とし、以下、同様にｎ次導出番号を定義する。
これらの章節番号は、文字の誤認識あるいは見出しに数
字が含まれている場合もあるので、ここでは、このよう
な乱れは散発的に起きると想定して、次のような補正処
理と行う。2.4 Section Number Correction Section number correction is described in Yamada [Conversion method of document image to ODA logical structured document, Preparatory Theory (D-II), J76-D-II, 11,
pp2274-2284 (1993-11)] could correct a section number in a certain m-level hierarchy by a matching method using a primary derived number and a secondary derived number.
However, when recognizing characters, the separator "-"
Matching may fail due to omission of “.” Or “.” Or recognition error. To avoid this problem,
When extracting the section number, the separator character is removed. Here, for the section number snr of a certain m-level hierarchy, the next expected section number is the highest m + 1 level, as shown in FIG. . These are referred to as primary derived numbers of snr, and the entirety of node numbers that can occur next to the primary derived numbers are defined as secondary derived numbers of snr. Hereinafter, the nth derived number is similarly defined.
Since these chapter numbers may include erroneous character recognition or numbers included in headings, the following correction processing is performed here on the assumption that such disturbances occur sporadically.

【００１７】(1) 節番号リストから１個目の数字を解析
の始点(snr₀)とし、その数字は必ず０あるいは１であ
り、これを開始点とする。 (2) この開始点を snr_iとし、目次リストから次の節番
号のある行の節番号を snr_i+1、次の次の節番号のある
行の節番号を snr_i+2とする。 (3) snr_i+1が snr_iの１次導出番号のいずれかとマッ
チングが成功すれば、 snr_i+1を新始点とし、(2) を繰
り返す。 (4) snr_i+1が snr_iの１次導出番号のいずれともマッ
チングしない場合、次の処理を行う。（ケース１）： snr_i+1に文字認識の誤りがあるものの
数字部分が存在しているsnr_i+2を snr_iの２次導出番
号と比較し、マッチングが成功すれば snr_i+2を新始点
とし(2) へ戻る。この時、 snr_i+1の訂正も行う。（ケース２）：文字認識の誤りにより節番号部に数字が
存在していない snr_i+1の目次リスト番号と snr_iの目
次リスト番号の差分は２であり、かつ snr_iの１次導出
番号とマッチングした場合、 snr_i+1を新始点とし、前
の目次セットに節番号を追加する。 (5) (4) のいずれも失敗した場合、本処理全体を失敗と
みなす。なお、本処理が失敗したとしても節番号を抽出
できないだけであり、引き続き他の処理を行う。(1) The first number from the section number list is set as the analysis start point (snr ₀ ), and the number is always 0 or 1, and this is set as the start point. (2) The starting point is defined as snr _i , the section number of the line having the next section number from the table of contents list is defined as snr _{i + 1} , and the node number of the line having the next section number is defined as snr _{i + 2} . (3) If snr _{i + 1} is successful matching with any of the first-order derivation number of snr _i, as a new starting point snr _{i + 1,} repeating (2). (4) If snr _{i + 1} does not match any of the primary derived numbers of snr _i , the following processing is performed. (Case 1): the snr _{_{i +}} snr _{i +} ₂ where the numbers of what there is an error in the character recognition are present in ₁ compares the secondary derivation number snr _i, the snr _{i + 2} if successful matching Return to (2) as the new starting point. At this time, snr _{i + 1} is also corrected. (Case 2): the difference between the table of contents list number of snr _{i + 1} of the Table of Contents list number and the snr _i that does not exist numbers in the section number part by the error of character recognition is 2, and the primary derivation number of snr _i If it is matched, snr _{i + 1} is set as the new starting point, and the section number is added to the previous table of contents set. (5) If any of (4) fails, the entire processing is regarded as failure. Note that even if this processing fails, only the node number cannot be extracted, and other processing is continuously performed.

【００１８】３．本文レイアウト解析３．１処理概要文書の本文頁には、見出し、本文、図表（写真も含
む）、ヘッダ・フッタ、頁番号などのレイアウト要素が
含まれる。従って、文書の本文領域は、大別して見出し
領域、本文領域、図表・写真領域、ヘッダ・フッタ領
域、頁番号領域から構成される。これらの領域は、罫線
(Field Separator) 、または空白領域によって分類さ
れ、領域分割のために、まず画像をスキャナーから入力
し二値化、ノイズ除去等の前処理をした後、矩形の抽出
を行う。得られた矩形から特徴量を用いて、各矩形の識
別を行う。以下、処理フローに従って処理の詳細につい
て述べる。3. Text Layout Analysis 3.1 Processing Overview The text page of the document contains layout elements such as headings, text, figures and tables (including photos), headers and footers, and page numbers. Accordingly, the text region of the document is roughly composed of a heading region, a text region, a figure / photograph region, a header / footer region, and a page number region. These areas are bordered
(Field Separator) or a blank area. For area division, an image is first input from a scanner, pre-processed such as binarization and noise removal, and then a rectangle is extracted. Each rectangle is identified from the obtained rectangle by using the feature amount. Hereinafter, details of the processing will be described according to the processing flow.

【００１９】３．２画像入力および特徴量の抽出連続した複数頁を有する文書から見開き２頁を１画像と
して入力し、二値化処理、雑音除去をした後、左右の頁
ごとに紙面中のすべての黒画素を囲む外接方形(Boundin
g Rectangle)を求め、この外接方形の左上角を（０，
０）座標とする。これは、後で述べる基本方形特徴の一
つに相対座標を用いるものがあり、印字領域をそろえる
必要がある。左右頁ごとに、基本矩形を抽出した後、図
３に示すように抽出した矩形から左上座標（ｘ，ｙ）、
幅、高さ、インデント（揃え）、上下行間の６つの特徴
量とする。インデントは画像の左端基準位置から個々の
矩形の左端までの変位とし、行間については抽出した矩
形にソートを行い上下の矩形との距離とする。また、得
られた矩形ごとに文字認識を行い、文字コードとして保
持し、頁ごとに基本矩形リストを作成し、以下、これを
本文リストと呼ぶ。3.2 Image Input and Extraction of Feature Amount From a document having a plurality of continuous pages, two facing pages are input as one image, binarized, and noise is removed. A bounding rectangle surrounding all black pixels (Boundin
g Rectangle) and calculate the upper left corner of this circumscribed rectangle as (0,
0) Coordinates. This uses relative coordinates as one of basic square features described later, and it is necessary to align print areas. After extracting the basic rectangle for each of the left and right pages, the upper left coordinates (x, y),
Width, height, indent (alignment), and six feature values between the upper and lower rows. The indent is the displacement from the reference position at the left end of the image to the left end of each rectangle, and the line spacing is sorted into the extracted rectangles and the distance from the upper and lower rectangles. Further, character recognition is performed for each of the obtained rectangles, stored as character codes, and a basic rectangle list is created for each page. This is hereinafter referred to as a body list.

【００２０】３．３頁番号、ヘッダ・フッタの抽出頁番号、ヘッダ・フッタは必ず各頁の最上行あるいは最
下行にあるので、各頁の最上、下行の基本矩形を候補と
する。後述する方法でヘッダ・フッタおよび頁番号を抽
出する。3.3 Extraction of Page Number and Header / Footer Since the page number and header / footer are always in the top row or bottom row of each page, the basic rectangles in the top and bottom rows of each page are taken as candidates. The header / footer and page number are extracted by a method described later.

【００２１】３．３．１ヘッダ・フッタの抽出ヘッダ・フッタは頁毎についており、内容は全部同じと
はかぎらないが、置く位置は必ず頁の一番上あるいは一
番下の行にある。しかも、その文字の大きさつまり基本
矩形の高さが等しく、長さは頁番号の基本矩形より長い
が、頁外接矩形の幅より短い。以上の条件により、下記
の方法でヘッダ・フッタを抽出する。3.3.1 Extraction of Header / Footer The header / footer is provided for each page, and the content is not always the same, but the position of placement is always at the top or bottom line of the page. In addition, the size of the character, that is, the height of the basic rectangle is equal, and the length is longer than the basic rectangle of the page number but shorter than the width of the circumscribed rectangle of the page. Under the above conditions, the header / footer is extracted by the following method.

【００２２】特徴量の右上座標を利用し連続とした各
頁の最上下行にある基本矩形を取り出し、上下別々に高
さの統計をとり、数の一番多いものを候補とする。全候補の中から長さがある一定以下であるもの（頁番
号の基本矩形）と、頁外接矩形の幅と同じもの（本文の
基本矩形）を取り除く。上下行の数を別々に計算し、多い方をヘッダ・フッタ
候補とする。本文リストから上述の方法で得られたヘッ
ダ・フッタ候補の基本矩形を取り除く。Using the upper right coordinate of the feature quantity, the basic rectangle in the lowermost row of each continuous page is taken out, height statistics are separately obtained for the upper and lower sides, and the largest number is determined as a candidate. Among all the candidates, those having a length equal to or less than a certain value (basic rectangle of the page number) and those having the same width as the circumscribed rectangle of the page (basic rectangle of the text) are removed. The numbers of the upper and lower rows are calculated separately, and the larger one is set as a header / footer candidate. The basic rectangle of the header / footer candidate obtained by the above method is removed from the body list.

【００２３】３．３．２頁番号の抽出ヘッダ・フッタの抽出方法と同じく、最上下行の基本矩
形を調べ、高さが等しく、しかも内容が数字である場
合、これを頁番号候補とする。文書によって、頁番号
は、ヘッダ・フッタの隣にあり、基本矩形を切り出す
際、ヘッダ・フッタと同じ基本矩形にある可能性もある
ので、上述の方法で頁番号を抽出できない場合、ヘッダ
・フッタ候補のうち最も外側（左頁は先頭の文字列、右
頁は後ろの文字列）の数字を取り出し頁番号候補とす
る。なお、文字の認識誤りにより、頁番号が間違ってい
る可能性があるが、頁番号は連続した数字順であるの
で、ある頁の頁番号を基準としその番号がＮである場
合、前頁の頁番号はＮ−１、次の頁番号はＮ＋１とし
て、最後の頁までマッチングを行い、マッチング率が７
０％以上の場合に正しい頁番号順とし、この頁番号順と
合っていない頁番号を補正する。最後に、頁番号が抽出
されなかった頁に対し、抽出された他の頁の頁番号と同
じ位置にある基本矩形を頁番号候補とし、全頁番号候補
を本文リストから取り除く。3.3.2 Extraction of page number In the same manner as in the extraction method of the header / footer, the basic rectangle in the top and bottom lines is examined. If the height is equal and the content is a number, this is regarded as a page number candidate. . Depending on the document, the page number is next to the header / footer. When the basic rectangle is cut out, the page number may be in the same basic rectangle as the header / footer. The outermost numbers (the left page is the first character string, and the right page is the last character string) of the candidates are taken out as page number candidates. Although the page number may be wrong due to a character recognition error, since the page numbers are in a continuous numerical order, if the number is N based on the page number of a certain page, Assuming that the page number is N-1 and the next page number is N + 1, matching is performed up to the last page, and the matching rate is 7
When it is 0% or more, the correct page number order is set, and a page number that does not match this page number order is corrected. Finally, for a page from which no page number has been extracted, a basic rectangle located at the same position as the page number of another extracted page is set as a page number candidate, and all page number candidates are removed from the text list.

【００２４】３．４本文識別本文は、文書に最も多い矩形であるので、本文リストに
ある全ての基本矩形の特徴量から高さと行間の統計をと
り、もっとも多い基本矩形を本文領域候補とする。ま
た、これらの基本矩形パターンは、図４に示すように三
つあるので、以下の条件の一つに合うものを本文領域候
補とする。パターンＡ：インデントがありかつ本文領域の右端ま
での長さがある。パターンＢ：頁の外接矩形と同じ幅の矩形パターンＣ：インデントがない。なお、パターンＢとパターンＣに合った基本矩形の場合
は、前の基本矩形がパターンＡあるいはパターンＢに属
しているものだけとし、頁の一番上の基本矩形である場
合は、前の基本矩形は前頁の最後にある基本矩形とす
る。以上の処理を終わった後、本文リストから本文領域
と認識されるものを取り除く。3.4 Body Identification Since the body is the largest rectangle in the document, statistics of the height and the line spacing are obtained from the feature values of all the basic rectangles in the body list, and the most basic rectangle is determined as a body area candidate. . Since there are three basic rectangular patterns as shown in FIG. 4, a pattern meeting one of the following conditions is set as a text region candidate. Pattern A: There is an indent and a length to the right end of the text area. Pattern B: rectangle having the same width as the circumscribed rectangle of the page Pattern C: no indentation. In the case of a basic rectangle matching the pattern B and the pattern C, it is assumed that the previous basic rectangle belongs only to the pattern A or the pattern B. The rectangle is the basic rectangle at the end of the previous page. After the above processing is completed, those recognized as the text area are removed from the text list.

【００２５】３．５見出しの識別本文リストに残された基本矩形の全てを見出し候補とし
ておく。これには本文識別に失敗したもの一部が含まれ
てしまうが、最終的に見出しとのマッチングで識別す
る。3.5 Identification of Heading All of the basic rectangles left in the text list are set as heading candidates. This includes a part of the text that failed to be identified, but is finally identified by matching with the headline.

【００２６】４．マッチング処理４．１目次の見出しと本文の見出しとのマッチング処
理マッチング処理では、本文レイアウト解析から得られた
見出し候補のうちのテキストが目次解析で得られた見出
しテキストと最も一致している基本矩形を見出しとす
る。具体的には、目次から作成された見出しセットリス
トから章節番号、見出し、頁番号の見出しセット１組を
抜出し、その頁番号にある見出し候補と比較し、一致率
の最も高い見出し候補をこの見出しとする。比較する際
は、見出しセットの章節番号と見出しは、文字認識で得
られた文字列を用いる。この一致率は、下式に示すよう
な方法で計算される。（２つの比較文字列のうち一致する文字数）／（長い文
字列の方の文字数）4. 4.1 Matching process 4.1 Matching process between the table of contents heading and the body heading In the matching process, the basic rectangle in which the text among the heading candidates obtained from the body layout analysis most closely matches the heading text obtained by the table of contents analysis Is the heading. Specifically, a set of a heading set including a chapter number, a heading, and a page number is extracted from the heading set list created from the table of contents, and compared with the heading candidate at the page number. And At the time of comparison, a character string obtained by character recognition is used for the section number and the heading of the heading set. This matching rate is calculated by the method shown in the following equation. (Number of matching characters in two comparison strings) / (number of characters in long string)

【００２７】５．索引解析処理５．１段組判定索引頁の基本矩形抽出を行った後、以下の二タイプの索
引頁を段組判定する。１．セパレータのある場合段組を分けるセパレータのある場合は、その座標を調べ
ることによって、段組部分を分けることができる。２．セパレータのない場合セパレータのない場合、まず、全頁をＹ座標方向にヒス
トグラムをとり、濃度で行として区切る。次に行ごとに
文字認識を行い、得られた結果から文字であるのか数字
であるのかを調べる。そして、数字でない文字列領域を
索引情報の文字部分とし、数字である部分を頁番号とす
る。これによって、索引頁の段組を分ける。５．２索引解析この解析方法では索引行ごとに逐次解析していく手法を
とっているので、同じ段に存在する領域をまとめ、さら
にそのまとめた段組を高い順に並び換える。次に、段組
領域を各行の文字となり索引部分と頁番号となる数字部
分で分け、索引情報リストを作成する。5. 5. Index Analysis Processing 5.1 Column Determination After extracting the basic rectangle of the index page, the following two types of index pages are column-determined. 1. When there is a separator If there is a separator that separates columns, the column part can be separated by examining the coordinates. 2. In the case where there is no separator In the case where there is no separator, first, all pages are histogramd in the Y coordinate direction, and are divided into rows by density. Next, character recognition is performed for each line, and the obtained result is checked whether it is a character or a number. Then, a character string area that is not a number is set as a character part of the index information, and a part that is a number is set as a page number. This separates the columns of the index page. 5.2 Index Analysis Since this analysis method employs a method of sequentially analyzing each index row, regions existing in the same column are collected, and the collected columns are rearranged in descending order. Next, the column area is divided into characters on each line and divided into an index portion and a numeric portion as a page number, and an index information list is created.

【００２８】[0028]

【発明の効果】以上説明したごとく、本発明によれば格
納すべき対象文書の論理構造を自動的に解析することが
可能になり、キーボートから二次情報を入力することが
不要となるかあるいは大幅に削減されるので、入力がき
わめて簡素化されることになる。また、文書の論理構造
を解析したことにより、大型ハイパーテキストシステム
を構築することが簡単になる。As described above, according to the present invention, it is possible to automatically analyze the logical structure of a target document to be stored, and it becomes unnecessary to input secondary information from a keyboard. The input is greatly simplified, since it is greatly reduced. Further, by analyzing the logical structure of the document, it is easy to construct a large hypertext system.

[Brief description of the drawings]

【図１】文書の文書構造を示す。FIG. 1 shows the document structure of a document.

【図２】本発明の処理ブロック図を示す。FIG. 2 shows a processing block diagram of the present invention.

【図３】基本矩形の特徴量を示す。FIG. 3 shows feature values of a basic rectangle.

【図４】本文の基本矩形パターンＡ、Ｂ及びＣを示す。FIG. 4 shows basic rectangular patterns A, B and C of the text.

【図５】章節番号補正の説明図である。FIG. 5 is an explanatory diagram of section number correction;

【図６Ａ】章節番号補正の説明図である。FIG. 6A is an explanatory diagram of section number correction;

【図６Ｂ】章節番号補正の説明図である。FIG. 6B is an explanatory diagram of section number correction;

Claims

[Claims]

When an electronic document is created from a document having at least a table of contents page and a body page, each document image of the table of contents page and body page is fetched, and the document structure of the entire document shown in the table of contents page is analyzed. Document image structure analysis that separates and extracts layout elements by performing page layout analysis and uses the document structure information of the entire document obtained from the table of contents page analysis to perform matching processing on the body page layout elements Method.

2. The document image structure analysis method according to claim 1, wherein when analyzing the document structure of the entire document shown in the table of contents, a chapter number correction is performed in response to a character recognition error of the chapter number.

3. A document image analysis system which analyzes index information when digitizing a document including an index, divides the index into page numbers of the index, and uses the index as a keyword when searching for a book.