JPH1166196A

JPH1166196A - Document image recognition device and computer-readable recording medium where program allowing computer to function as same device is recorded

Info

Publication number: JPH1166196A
Application number: JP9220426A
Authority: JP
Inventors: Takashi Saito; 高志齋藤; Tei Abe; 悌阿部; Tsukasa Kouchi; 司幸地
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-08-15
Filing date: 1997-08-15
Publication date: 1999-03-09

Abstract

PROBLEM TO BE SOLVED: To generate documents in various format complying with usages such as a document whose reproduction is given priority and a document whose contents are made important. SOLUTION: This device is equipped with a document image input part 200 which inputs a document image generated by optically reading a paper document, a preprocess part 201 which performs noise removal and a skew correcting process for the inputted document image, an information extraction part 204 which performs a recognizing and extracting process for a character area including character strings and/or a an image area including images of a graph, a table, a photograph or the like, and a character recognizing process for the character strings in the extracted character area and also analyzes the layout of the document image to extract layout information, a document generation part 210 which generates a PostScript document and an HTML document according to the character recognition result and layout information extraction result, and a data base part 213 which stores the generated PostScript document and HTML document respectively.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、紙文書を光学的に
読み取ることによって得た文書画像から文字コードを抽
出するという単なる文字認識処理を行うだけではなく、
紙文書の持つ様々な情報を抽出して利用することによ
り、紙文書の再現を優先した文書や紙文書の内容を重視
した文書等、利用目的に応じた様々な形態のコンピュー
タ上で利用可能な文書を生成することができるようにし
た文書画像認識装置およびその装置としてコンピュータ
を機能させるためのプログラムを記録したコンピュータ
読み取り可能な記録に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention not only performs a character recognition process of extracting a character code from a document image obtained by optically reading a paper document, but also performs a character recognition process.
By extracting and using various information of paper documents, it can be used on various types of computers according to the purpose of use, such as documents that prioritize reproduction of paper documents and documents that emphasize contents of paper documents The present invention relates to a document image recognition device capable of generating a document, and a computer-readable record in which a program for causing a computer to function as the device is recorded.

【０００２】[0002]

【従来の技術】パーソナルコンピュータやネットワーク
の急速な普及は、オフィスにおける文書の電子化を促進
することとなった。ところが、紙は閲覧性が良い等の利
便性があるため、紙文書の生産は未だ止むことがなく、
紙の形で保存・流通されている文書は未だ多量に存在し
ていることが現状であり、このギャップがオフィスワー
クにおける生産効率の低下の一因となっている。2. Description of the Related Art The rapid spread of personal computers and networks has promoted the digitization of documents in offices. However, because paper has good convenience such as good readability, production of paper documents has not stopped.
At present, there are still a large number of documents stored and distributed in paper form, and this gap is one of the causes of a decrease in production efficiency in office work.

【０００３】つまり、文書の作成，流通，閲覧，管理，
再利用といった一連の流れにおいて、文書は紙に記録さ
れた形とコンピュータ上のデータという形で存在してお
り、このように文書が紙に記録された形とコンピュータ
上のデータという形で存在しているのは、相互の変換コ
ストが高いということが原因となっている。[0003] In other words, document creation, distribution, browsing, management,
In a series of flows such as reuse, documents exist in the form of data recorded on paper and data on a computer. Thus, the document exists in the form of data recorded on paper and data on a computer. The reason is that the mutual conversion cost is high.

【０００４】上記問題を解決する手段の一つとして、Ｏ
ＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄ
ｅｒ）を挙げることができる。このＯＣＲを用いること
により、紙文書をスキャナ等で読み取って文書画像を生
成した後、文書画像中の文字列を文字コードに変換する
ことができる。As one of the means for solving the above problem, O
CR (Optical Character Read)
er). By using the OCR, a paper document is read by a scanner or the like to generate a document image, and then a character string in the document image can be converted into a character code.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記Ｏ
ＣＲは、文書画像中の文字列を文字コード情報に変換す
ることは可能であるが、元の紙文書のレイアウト等まで
抽出することができないため、ＯＣＲによって生成され
た文書は単なるテキストの羅列であって、元の文書画像
が持つ様々な情報を十分に利用することができないとい
う問題点があった。換言すれば、文書は、作成者の意図
に応じて様々なレイアウト処理が施されているが、単に
ＯＣＲを用いただけでは、文書のレイアウトを抽出する
ことができず、元の文書のレイアウトを再現した文書を
新たに生成したり、元の文書のレイアウトを利用して新
たなレイアウトの文書を生成したりすることはできなか
った。However, the above O
The CR can convert a character string in a document image into character code information, but cannot extract the layout of the original paper document or the like. Therefore, the document generated by the OCR is simply a text list. Thus, there is a problem that various information included in the original document image cannot be sufficiently used. In other words, the document is subjected to various layout processes according to the creator's intention, but simply using OCR cannot extract the document layout and reproduce the original document layout. It is not possible to generate a new document newly or to generate a document with a new layout using the layout of the original document.

【０００６】本発明は上記に鑑みてなされたものであっ
て、紙文書を光学的に読み取ることによって得た文書画
像から文字コードを抽出するという単なる文字認識処理
を行うだけではなく、紙文書の持つ様々な情報を抽出し
て利用することにより、紙文書の再現を優先した文書や
紙文書の内容を重視した文書等、利用目的に応じた様々
な形態のコンピュータ上で利用可能な文書を生成するこ
とができるようにすることを目的とする。The present invention has been made in view of the above, and not only performs a simple character recognition process of extracting a character code from a document image obtained by optically reading a paper document, but also performs processing of the paper document. Generates documents that can be used on various types of computers according to the purpose of use, such as documents that prioritize reproduction of paper documents and documents that emphasize contents of paper documents by extracting and using various information possessed The purpose is to be able to.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するた
め、請求項１の文書画像認識装置は、紙文書を光学的に
読み取ることによって生成した文書画像を入力する入力
手段と、前記入力手段を介して入力した文書画像から文
字列を含む文字領域および／または図，表，写真等の画
像を含む画像領域を認識して抽出する領域抽出手段と、
前記領域認識手段で抽出した文字領域の文字列について
文字認識処理を行う文字認識手段と、前記領域抽出手段
の抽出結果に基づいて、前記文書画像のレイアウトを解
析し、レイアウト情報を抽出するレイアウト情報抽出手
段と、前記文字認識手段による文字認識結果およびレイ
アウト情報抽出手段によるレイアウト情報抽出結果に基
づいて、ページ記述言語を用いた第１の文書を生成する
第１の文書生成手段と、前記文字認識手段による文字認
識結果およびレイアウト情報抽出手段によるレイアウト
情報抽出結果に基づいて、構造化記述言語を用いた第２
の文書を生成する第２の文書生成手段と、前記第１およ
び第２の文書生成手段で生成した第１および第２の文書
をそれぞれ格納する格納手段と、を備えたものである。According to a first aspect of the present invention, there is provided a document image recognition apparatus comprising: an input unit for inputting a document image generated by optically reading a paper document; Region extracting means for recognizing and extracting a character region including a character string and / or an image region including an image such as a figure, a table, or a photograph from a document image input through the apparatus;
Character recognition means for performing character recognition processing on a character string of a character area extracted by the area recognition means; and layout information for analyzing a layout of the document image based on an extraction result of the area extraction means and extracting layout information Extracting means; first document generating means for generating a first document using a page description language based on a character recognition result by the character recognizing means and a layout information extracting result by layout information extracting means; Based on the result of character recognition by the means and the result of layout information extraction by the layout information extracting means.
And a storage unit for storing the first and second documents generated by the first and second document generation units, respectively.

【０００８】また、請求項２の文書画像認識装置は、請
求項１に記載の文書画像認識装置において、前記第１の
文書が、ＰｏｓｔＳｃｒｉｐｔ形式またはＰＤＦ形式に
よって表現された文書であるものである。According to a second aspect of the present invention, in the document image recognition apparatus of the first aspect, the first document is a document expressed in a PostScript format or a PDF format.

【０００９】また、請求項３の文書画像認識装置は、請
求項１に記載の文書画像認識装置において、前記第２の
文書が、ＳＧＭＬ，ＨＴＭＬまたはＸＭＬによって表現
された文書であるものである。According to a third aspect of the present invention, in the document image recognition apparatus of the first aspect, the second document is a document represented by SGML, HTML, or XML.

【００１０】さらに、請求項４のコンピュータ読み取り
可能な記録媒体は、前記請求項１〜３のいずれか１つに
記載の文書画像認識装置の各手段としてコンピュータを
機能させるためのプログラムを記録したものである。A computer-readable recording medium according to a fourth aspect of the present invention stores a program for causing a computer to function as each unit of the document image recognition apparatus according to any one of the first to third aspects. It is.

【００１１】[0011]

【発明の実施の形態】以下、本発明の文書画像認識装置
およびその装置としてコンピュータを機能させるための
プログラムを記録したコンピュータ読み取り可能な記録
媒体の実施の形態について、添付の図面を参照しつつ詳
細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of a document image recognition apparatus of the present invention and a computer-readable recording medium storing a program for causing a computer to function as the apparatus will be described in detail with reference to the accompanying drawings. Will be described.

【００１２】〔実施の形態１〕実施の形態１の文書画像
認識装置は、紙文書を光学的に読み取って生成した文書
画像を入力し、入力した文書画像に基づいて、コンピュ
ータ上で利用可能な文書を生成するものである。換言す
れば、オフィスワーク等においては、文書を生産し、流
通し、閲覧し、管理し、利用するという一連の流れがあ
るため、実施の形態１の文書画像認識装置は、上記流通
および閲覧を考慮した文書と、利用を考慮した文書を生
成することができるようにしたものである。First Embodiment A document image recognition apparatus according to a first embodiment receives a document image generated by optically reading a paper document, and can use the document image on a computer based on the input document image. Generate documents. In other words, in office work and the like, since there is a series of flows of producing, distributing, browsing, managing, and using documents, the document image recognition apparatus according to the first embodiment performs the above-described distribution and browsing. It is possible to generate a document in consideration of and a document in consideration of use.

【００１３】ここで、流通および閲覧を考慮した文書と
は、文書作成者の意図をできるだけ正確に伝えることが
できるように、オリジナルの紙文書が持つレイアウト情
報まで忠実に再現することを目的とした文書のことであ
る（紙文書のメタファとしてのＷＹＳＩＷＹＧを保存し
た形の文書）。紙文書をコンピュータ上の文書で忠実に
再現することができるようにするために、実施の形態１
の文書画像認識装置は、文字や画像を統一的に記述する
ことができるページ記述言語を用いて文書を生成するも
のである。なお、実施の形態１の文書画像認識装置にお
いては、ページ記述言語として、ＰｏｓｔＳｃｒｉｐｔ
を用いることにする。以下では、このＰｏｓｔＳｃｒｉ
ｐｔを用いて生成した文書のことをＰｏｓｔＳｃｒｉｐ
ｔ文書と定義することにする。また、ＰｏｓｔＳｃｒｉ
ｐｔに代えて、ＰＤＦ（ＰｏｒｔａｂｌｅＤｏｃｕｍ
ｅｎｔＦｏｒｍａｔ：アドビシステム社が公開してい
るファイル形式）形式を用いることもできる。Here, the document considering distribution and browsing is intended to faithfully reproduce the layout information of the original paper document so that the intention of the document creator can be conveyed as accurately as possible. A document (a document in which WYSIWYG is stored as a metaphor of a paper document). First Embodiment To enable a paper document to be faithfully reproduced by a document on a computer, the first embodiment
The document image recognition apparatus generates a document using a page description language capable of uniformly describing characters and images. In the document image recognition apparatus according to the first embodiment, PostScript is used as a page description language.
Will be used. In the following, this PostScri
PostScript is a document generated using pt
It is defined as a t document. Also, PostScri
PDF (Portable Docum) instead of pt
ent Format: a file format published by Adobe Systems Incorporated).

【００１４】また、利用を考慮した文書とは、オリジナ
ルの紙文書が持つレイアウトにとらわれず、文書の内容
を優先したコンピュータならではの形を持った文書のこ
とである。このように、レイアウトよりも内容を優先し
た文書を生成するために、実施の形態１の文書画像認識
装置は、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌ
ｉｚｅｄＭａｒｋｕｐＬａｎｇａｇｅ）やＨＴＭＬ
（ＨｙｐｅｒｔｅｘｔＭａｒｋｕｐＬａｎｇａｇ
ｅ）、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐ
Ｌａｎｇａｇｅ）等の構造化記述言語で表現することに
よって文書を生成し、文字や画像が混在した紙文書をハ
イパーテキスト化するものである。なお、実施の形態１
の文書画像認識装置においては、構造化記述言語とし
て、ＨＴＭＬを用いることにする。以下では、このＨＴ
ＭＬを用いて生成した文書のことをＨＴＭＬ文書と定義
することにする。A document that is considered to be used is a document that has a form unique to a computer that gives priority to the contents of the document, regardless of the layout of the original paper document. As described above, in order to generate a document in which the content is prioritized over the layout, the document image recognition apparatus according to the first embodiment uses an SGML (Standard General).
size Markup Language) and HTML
(Hypertext Markup Langag
e), XML (Extensible Markup)
A document is generated by expressing it in a structured description language such as Language (Language) or the like, and a paper document in which characters and images are mixed is converted into hypertext. Embodiment 1
In this document image recognition apparatus, HTML is used as a structured description language. In the following, this HT
A document generated using ML is defined as an HTML document.

【００１５】図１は、実施の形態１の文書画像認識装置
のシステム構成を示す構成図である。図１に示す文書画
像認識装置は、紙文書を光学的に読み取って文書画像を
生成するカラースキャナ１００，モノクロスキャナ１０
１およびネットワークスキャナ１０２と、ファクシミリ
装置１０３から送信された文書画像を受信するファック
スモデム１０４と、カラースキャナ１００,モノクロス
キャナ１０１，ネットワークスキャナ１０２およびファ
ックスモデム１０４から文書画像を入力し、入力した文
書画像に基づいて、ＰｏｓｔＳｃｒｉｐｔ文書およびＨ
ＴＭＬ文書を生成する文書画像処理サーバ１０５ａと、
紙文書を光学的に読み取って文書画像を生成するディジ
タル複合機１０６と、ディジタル複合機１０６から文書
画像を入力し、入力した文書画像に基づいて、Ｐｏｓｔ
Ｓｃｒｉｐｔ文書およびＨＴＭＬ文書を生成する文書画
像処理サーバ１０５ｂと、文書画像処理サーバ１０５
ａ，１０５ｂ（以下これらを「文書画像処理サーバ１０
５」と記述する）で生成した文書を入力して、データベ
ースに格納し、クライアント１０７からの検索要求に応
じて、該当する文書を検索して出力する検索サーバ１０
８と、から構成されている。なお、１０９は、ＬＡＮ等
のネットワークを示している。FIG. 1 is a configuration diagram showing a system configuration of the document image recognition apparatus according to the first embodiment. A document image recognition apparatus shown in FIG. 1 includes a color scanner 100 and a monochrome scanner 10 that optically read a paper document to generate a document image.
1 and a network scanner 102, a fax modem 104 for receiving a document image transmitted from a facsimile apparatus 103, and a color scanner 100, a monochrome scanner 101, a network scanner 102, and a fax modem 104 for inputting a document image and inputting the document image. Based on the PostScript document and H
A document image processing server 105a for generating a TML document;
A digital multifunction peripheral 106 that optically reads a paper document to generate a document image, a document image is input from the digital multifunction peripheral 106, and a Post is input based on the input document image.
A document image processing server 105b that generates a Script document and an HTML document, and a document image processing server 105
a, 105b (hereinafter referred to as “document image processing server 10”).
5)), stores the document in a database, and searches and outputs the corresponding document in response to a search request from the client 107.
8 is comprised. Reference numeral 109 denotes a network such as a LAN.

【００１６】図２は、図１に示した文書画像認識装置の
概念構成図である。実施の形態１の文書画像認識装置
は、大きく文書画像入力部２００，前処理部２０１，情
報抽出部２０４，文書生成部２１０およびデータベース
部２１３から構成される。なお、文書画像入力部２０
０，前処理部２０１，情報抽出部２０４および文書生成
部２１０は、図１に示した文書画像処理サーバ１０５に
該当し、データベース部２１３は、図１に示した検索サ
ーバ１０８に該当する。以下に上記各部の構成について
説明する。FIG. 2 is a conceptual block diagram of the document image recognition apparatus shown in FIG. The document image recognition device according to the first embodiment mainly includes a document image input unit 200, a preprocessing unit 201, an information extraction unit 204, a document generation unit 210, and a database unit 213. The document image input unit 20
0, the pre-processing unit 201, the information extraction unit 204, and the document generation unit 210 correspond to the document image processing server 105 shown in FIG. 1, and the database unit 213 corresponds to the search server 108 shown in FIG. Hereinafter, the configuration of each unit will be described.

【００１７】（１）文書画像入力部文書画像入力部２００は、カラースキャナ１００，モノ
クロスキャナ１０１，ネットワークスキャナ１０２およ
びディジタル複合機１０６で生成した文書画像やファッ
クスモデム１０４で受信した文書画像を入力するもので
ある。また、文書画像入力部２００は、ワードプロセッ
サ等のアプリケーションプログラムで作成した文書ファ
イルを入力することもできる。(1) Document Image Input Unit The document image input unit 200 inputs a document image generated by the color scanner 100, the monochrome scanner 101, the network scanner 102, and the digital MFP 106, and a document image received by the fax modem 104. Things. Further, the document image input unit 200 can also input a document file created by an application program such as a word processor.

【００１８】（２）前処理部前処理部２０１は、文書画像入力部２００を介して文書
画像を入力し、入力した文書画像から孤立点ノイズを除
去するノイズ除去部２０２と、入力した文書画像が傾い
ているような場合に、傾きを補正するスキュー補正部２
０３とを有している。この前処理部２０１は、文書画像
についてノイズ除去および傾き補正を行うことにより、
後に行われる領域分割処理や文字認識処理において悪影
響を及ぼす要因を除去するものである。なお、入力した
文書画像がフルカラーであるような場合、領域分割処理
等を容易に、かつ高速に行うことができるようにするた
め、フルカラーの文書画像を２値化する２値化処理を行
うことにしても良い。(2) Pre-processing unit The pre-processing unit 201 inputs a document image via the document image input unit 200, and removes an isolated point noise from the input document image. Skew correction unit 2 that corrects the inclination when the camera is tilted
03. The pre-processing unit 201 performs noise removal and inclination correction on the document image,
This is to remove factors that have an adverse effect on the area division processing and character recognition processing performed later. If the input document image is full-color, binarization processing for binarizing the full-color document image is performed in order to easily and quickly perform the area division processing and the like. You may do it.

【００１９】（３）情報抽出部情報抽出部２０４は、前処理部２０１から文書画像を入
力し、入力した文書画像から文字列を含む文字領域およ
び／または図，表，写真等の画像を含む画像領域を認識
して分割する処理を行うと共に、文書画像がいかなる段
組種類、例えば、シングルコラム，マルチコラム，フリ
ーコラムのいずれで構成されているかを識別する処理お
よびセンタリング領域を検知する処理等のレイアウト情
報抽出処理を行う領域分割・情報抽出部２０５と、領域
分割・情報抽出部２０５で分割した画像領域に表が含ま
れている場合に、表の枠と罫線の構造を抽出すると共
に、枠内の文字領域を抽出する表処理部２０６と、領域
分割・情報抽出部２０５および表処理部２０６で分割し
た文字領域の文字列について文字認識処理を行うＯＣＲ
部２０７と、領域分割・情報抽出部２０５で分割した文
字領域の文字列のフォントが強調系（中ゴシック等）や
非強調系（細明朝等）のいずれであるかを識別する処理
を行うフォント識別部２０８と、上記各部による処理の
結果に基づいて、文書画像のレイアウトを解析すると共
に、文書の論理的な構造を解析するレイアウト・論理解
析部２０９と、を有している。(3) Information Extraction Unit The information extraction unit 204 receives a document image from the preprocessing unit 201 and includes a character area including a character string and / or an image such as a figure, a table, or a photograph from the input document image. In addition to performing processing for recognizing and dividing an image area, processing for identifying what column type the document image is composed of, for example, single column, multi-column, or free column, processing for detecting a centering area, etc. A region division / information extraction unit 205 that performs a layout information extraction process, and extracts a table frame and a ruled line structure when a table is included in the image region divided by the region division / information extraction unit 205. A table processing unit 206 for extracting a character region in the frame, and a character recognition process for a character string of the character region divided by the region dividing / information extracting unit 205 and the table processing unit 206. OCR to do
The processing for identifying whether the font of the character string of the character area divided by the section 207 and the area division / information extraction section 205 is an emphasis type (medium Gothic, etc.) or a non-emphasis type (fine Mincho, etc.) It has a font identification unit 208 and a layout / logic analysis unit 209 that analyzes the layout of the document image and analyzes the logical structure of the document based on the result of processing by each unit.

【００２０】（４）文書生成部文書生成部２１０は、ＯＣＲ部２０７による文字認識処
理の結果およびレイアウト・論理解析部２０９による解
析処理の結果に基づいて、ＰｏｓｔＳｃｒｉｐｔ文書を
生成するＰｏｓｔＳｃｒｉｐｔ文書生成部２１１と、Ｈ
ＴＭＬ文書を生成するＨＴＭＬ文書生成部２１２と、を
有している。(4) Document Generation Unit The document generation unit 210 generates a PostScript document based on the result of the character recognition processing by the OCR unit 207 and the result of the analysis processing by the layout / logic analysis unit 209. And H
And an HTML document generation unit 212 that generates a TML document.

【００２１】（５）データベース（ＤＢ）データベース２１３は、ＰｏｓｔＳｃｒｉｐｔ文書生成
部２１１で生成されたＰｏｓｔＳｃｒｉｐｔ文書を格納
するＰｏｓｔＳｃｒｉｐｔ文書ＤＢ２１４と、ＨＴＭＬ
文書生成部２１２で生成されたＨＴＭＬ文書を格納する
ＨＴＭＬ文書ＤＢ２１５とを有している。(5) Database (DB) The database 213 includes a PostScript document DB 214 for storing the PostScript document generated by the PostScript document generation unit 211, and HTML.
An HTML document DB 215 for storing the HTML document generated by the document generation unit 212.

【００２２】次に、上述した構成を有する文書画像認識
装置の動作について、詳細に説明する。図３は、文書画
像認識装置の動作手順を示すフローチャートである。Next, the operation of the document image recognition apparatus having the above-described configuration will be described in detail. FIG. 3 is a flowchart showing an operation procedure of the document image recognition device.

【００２３】文書画像入力部２００は、カラースキャナ
１００，モノクロスキャナ１０１，ネットワークスキャ
ナ１０２およびディジタル複合機１０６で生成した文書
画像、並びにファックスモデム１０４で受信した文書画
像を入力する（Ｓ３０１）。また、クライアント１０７
から文書画像やワードプロセッサ等のアプリケーション
プログラムで作成した文書ファイルを入力することもで
きる。The document image input unit 200 inputs a document image generated by the color scanner 100, the monochrome scanner 101, the network scanner 102, and the digital MFP 106, and a document image received by the fax modem 104 (S301). Also, the client 107
Can input a document image or a document file created by an application program such as a word processor.

【００２４】図４は、文書画像入力部２００を介して入
力した文書画像を画面表示した様子の一例を示す説明図
である。図４において、４００は文書画像認識処理の制
御画面を、４０１は文書画像入力部２００を介して入力
した文書画像を表示する表示画面をそれぞれ示してい
る。FIG. 4 is an explanatory diagram showing an example of a state in which a document image input via the document image input unit 200 is displayed on a screen. 4, reference numeral 400 denotes a control screen for the document image recognition process, and 401 denotes a display screen for displaying a document image input via the document image input unit 200.

【００２５】図４に示す制御画面４００では、以下に説
明する各種処理の実行の指定および処理の詳細な設定を
行うことができると共に、ＰｏｓｔＳｃｒｉｐｔ文書お
よびＨＴＭＬ文書の両方またはいずれか一方の生成を指
定することができる。なお、図４においては、文書画像
を入力し、各種処理の実行を指定することができる制御
画面４００を示したが、予め設定した条件に基づいて、
文書の生成・格納まで自動的に実行できるようにするこ
ともできる。On the control screen 400 shown in FIG. 4, it is possible to specify the execution of various processes described below and to make detailed settings for the processes, and to specify the generation of a PostScript document and / or an HTML document. can do. Note that FIG. 4 shows the control screen 400 on which a document image can be input and execution of various processes can be designated.
It is also possible to automatically execute processing up to generation and storage of a document.

【００２６】続いて、ノイズ除去部２０２は、文書画像
入力部２００を介して文書画像を入力し、入力した文書
画像から孤立点ノイズを除去する（Ｓ３０２）。また、
入力した文書画像が傾いているような場合、スキュー補
正部２０３は、文書画像の傾きを補正する（Ｓ３０
３）。Subsequently, the noise removing unit 202 inputs a document image via the document image input unit 200, and removes isolated point noise from the input document image (S302). Also,
If the input document image is inclined, the skew correction unit 203 corrects the inclination of the document image (S30).
3).

【００２７】領域分割・情報抽出部２０５は、ノイズ除
去部２０２およびスキュー補正部２０３からなる前処理
部２０１から文書画像を入力し、入力した文書画像から
文字列を含む文字領域および／または図，表，写真等の
画像を含む画像領域を認識して分割する処理を行う（Ｓ
３０４）。分割した各領域には、文字領域か、画像領域
か、画像領域の場合にはさらに図，表，写真か等の領域
の種類および領域の位置が属性情報として付与される。An area dividing / information extracting section 205 receives a document image from a pre-processing section 201 comprising a noise removing section 202 and a skew correcting section 203, and outputs a character area including a character string from the input document image and / or a figure, A process for recognizing and dividing an image area including an image such as a table or a photograph is performed (S
304). For each of the divided regions, a character region, an image region, and in the case of an image region, a type of a region such as a diagram, a table, or a photograph and a position of the region are added as attribute information.

【００２８】領域分割処理を行った後、領域分割・情報
抽出処理部２０５は、文書画像がいかなる段組種類、例
えば、シングルコラム，マルチコラム，フリーコラムの
いずれで構成されているかを識別する処理およびセンタ
リング領域の検知処理等のレイアウト情報を抽出する処
理を行う（Ｓ３０５）。After performing the area division processing, the area division / information extraction processing unit 205 identifies which column type the document image is composed of, for example, single column, multi column, or free column. Then, a process of extracting layout information such as a process of detecting a centering region is performed (S305).

【００２９】具体的には、文字領域同士の間の距離（空
白部分）および罫線を検出し、検出した距離および罫線
の本数と共に、領域分割処理で分割した文字領域の位置
に関する属性情報に基づいて、段組種類の判定を行う。More specifically, the distance (blank portion) between the character regions and the ruled lines are detected, and the detected distance and the number of ruled lines are used together with the attribute information on the position of the character region divided by the region dividing process. , The column type is determined.

【００３０】その後、表処理部２０６は、領域分割・情
報抽出部２０５で分割した各領域の属性情報に基づい
て、表を含む領域が存在するか否かを判定する（Ｓ３０
６）。この判定は、上記のように装置側で自動的に行う
ことにしても良いし、ユーザが指定しても良い。表が含
まれていない場合には、ステップＳ３０８に進む。Thereafter, the table processing unit 206 determines whether or not there is a region including a table based on the attribute information of each region divided by the region dividing / information extracting unit 205 (S30).
6). This determination may be made automatically on the device side as described above, or may be specified by the user. If no table is included, the process proceeds to step S308.

【００３１】一方、表が含まれている場合、表処理部２
０６は、表の枠と罫線の構造を抽出すると共に、枠内の
文字領域を抽出する（Ｓ３０７）。このように、表の中
から文字領域を抽出することにより、次のＯＣＲ部２０
７において、表中の文字認識処理を行うことができる。On the other hand, if a table is included, the table processing unit 2
Step 06 extracts the structure of the table frame and the ruled line, and also extracts the character area in the frame (S307). As described above, by extracting the character area from the table, the next OCR unit 20 is extracted.
In step 7, character recognition processing in the table can be performed.

【００３２】そして、ＯＣＲ部２０７は、領域分割・情
報抽出部２０５および表処理部２０６で分割した文字領
域の文字列について文字認識処理を行う（Ｓ３０８）。
すなわち、ＯＣＲ部２０７は、文字領域について、行切
り出しおよび文字切り出し処理を行い、文字切り出した
個々の文字パターンについて文字認識処理を行う。加え
て、ＯＣＲ部２０７は、文字認識結果に対して、言語処
理による誤り補正を行う。Then, the OCR section 207 performs a character recognition process on the character string of the character area divided by the area dividing / information extracting section 205 and the table processing section 206 (S308).
That is, the OCR unit 207 performs line cutout and character cutout processing on the character area, and performs character recognition processing on each of the character cutout character patterns. In addition, the OCR unit 207 performs error correction by language processing on the character recognition result.

【００３３】フォント識別部２０８は、領域分割・情報
抽出部２０５で分割した文字領域の文字列について、行
単位でフォントが強調系（中ゴシック等）や非強調系
（細明朝等）のいずれであるかを識別する処理を行う
（Ｓ３０９）。具体的には、例えば、黒画素密度やラン
レングスの分布等に基づいて、フォントの特徴を識別す
る。The font identifying unit 208 determines whether the font of the character string of the character region divided by the region dividing / information extracting unit 205 is emphasized (such as middle Gothic) or non-emphasized (such as Ming Mincho) for each line. Is performed (S309). Specifically, the characteristics of the font are identified based on, for example, the distribution of black pixel density and run length.

【００３４】続いて、レイアウト・論理解析部２０９
は、上記各部による処理の結果に基づいて、文書画像の
レイアウトを解析する（Ｓ３１０）。ここで行われるレ
イアウトの解析処理には、例えば、タイトル部，小見出
し部，キャプション，ヘッダ・フッタ部の検出処理が含
まれる。Subsequently, the layout / logic analysis unit 209
Analyzes the layout of the document image based on the result of the processing by each unit (S310). The layout analysis performed here includes, for example, detection of a title portion, a subheading portion, a caption, and a header / footer portion.

【００３５】ここで、タイトルは、一般的に本文の文字
とはサイズや行ピッチが異なり、また、存在する位置も
本文とは若干離れていることから、領域分割・情報抽出
部２０５で付与した領域の位置に関する属性情報や、フ
ォント識別部２０８による識別結果を用いて、タイトル
部を検出することができる。Here, the title is generally different in size and line pitch from the characters in the text, and the position where the title is present is slightly apart from the text. The title part can be detected using the attribute information on the position of the area and the identification result by the font identification unit 208.

【００３６】小見出しは、本文の文字と文字サイズがほ
ぼ等しい場合も多く、本文に近接した場所に位置するこ
とから、本文と同一の領域に存在していることも多い。
そこで、各文字領域の先頭行の文字サイズまたはフォン
トが、同一の文字領域中の他の文字のものと異なる場合
に、先頭行を小見出し行と判定する。The subheadings are often almost the same in size as the characters in the main text, and because they are located near the main text, they often exist in the same area as the main text.
Therefore, when the character size or font of the first line of each character area is different from those of other characters in the same character area, the first line is determined to be a subtitle line.

【００３７】また、キャプションは、図，表，写真等の
画像に付与されたものであり、一般的に画像領域の近傍
で、本文とは離れた位置に存在すること、さらには、
「図８」等の図や表等を指し示す語が含まれていること
から、これらの条件を満たす行をキャプションと判定す
る。そして、キャプションと対応する図との間にリンク
を生成する。A caption is given to an image such as a figure, a table, a photograph, etc., and generally exists near the image area and at a position away from the text.
Since a word indicating a diagram, a table, or the like such as “FIG. 8” is included, a row satisfying these conditions is determined as a caption. Then, a link is generated between the caption and the corresponding figure.

【００３８】さらに、ヘッダ・フッタは、文書画像の上
下に存在するため、該当する行をヘッダ・フッタとして
検出する。具体的には、例えば、領域識別・情報抽出部
でシングルコラムと識別された場合、センタリング行よ
り上の行がヘッダ部と判定することができる。この際、
ヘッダの行のレベルについても識別する。また、マルチ
コラムと識別された場合、段に属しない上下の行がヘッ
ダ・フッタ部となる。Further, since the header / footer exists above and below the document image, the corresponding line is detected as the header / footer. Specifically, for example, when the region identification / information extraction unit identifies the single column, the line above the centering line can be determined to be the header portion. On this occasion,
Also identifies the level of the header row. If the column is identified as a multi-column, the upper and lower rows that do not belong to the column become header / footer sections.

【００３９】その後、レイアウト・論理解析部２０９
は、文書の論理的な構造を解析する（Ｓ３１１）。具体
的には、図，表，写真等に付与されたキャプション中の
「図８」等の語と同一の語を本文中から見つけ出し、こ
れらの文字間にリンクを生成する処理を行う。この処理
においては、ＯＣＲ部２０７による文字認識の結果を用
いるため、各文字パターンの候補文字の全てを使用する
ことによって、ＯＣＲエラーに対処することができる。Thereafter, the layout / logic analysis unit 209
Analyzes the logical structure of the document (S311). Specifically, the same word as the word such as “FIG. 8” in the captions given to the figures, tables, photographs, and the like is found in the text, and a process of generating a link between these characters is performed. In this processing, since the result of character recognition by the OCR unit 207 is used, it is possible to deal with an OCR error by using all the candidate characters of each character pattern.

【００４０】そして、文書生成部２１０は、上述した情
報抽出部２０４の各部で行った処理の結果に基づいて、
文書を生成する処理を行う（Ｓ３１２）。すなわち、Ｐ
ｏｓｔＳｃｒｉｐｔ文書生成部２１１は、情報抽出部２
０４で行った文字認識やレイアウト解析等の結果に基づ
いて、ＰｏｓｔＳｃｒｉｐｔで各ページを表現した文書
を生成する。なお、ＯＣＲエラーをカバーするため、文
字認識確信度の低い文字については、文字認識の結果で
はなく、元の画像を用いることができる。Then, the document generation unit 210, based on the result of the processing performed by each unit of the information extraction unit 204 described above,
A process for generating a document is performed (S312). That is, P
The ostScript document generation unit 211 includes the information extraction unit 2
Based on the results of character recognition and layout analysis performed in step 04, a document expressing each page in PostScript is generated. In order to cover the OCR error, the original image can be used for a character with low character recognition certainty, instead of the character recognition result.

【００４１】また、ＨＴＭＬ文書生成部２１２は、情報
抽出部２０４で行った文字認識やレイアウト解析等の結
果に基づいて、タイトル，段落等を示すタグを各文字列
に付与し、ＨＴＭＬ文書を生成する。The HTML document generation unit 212 adds a tag indicating a title, a paragraph, etc. to each character string based on the result of character recognition and layout analysis performed by the information extraction unit 204 to generate an HTML document. I do.

【００４２】その後、ＰｏｓｔＳｃｒｉｐｔ文書はＰｏ
ｓｔＳｃｒｉｐｔ文書ＤＢ２１４に、ＨＴＭＬ文書はＨ
ＴＭＬ文書ＤＢ２１５にそれぞれ格納される（Ｓ３１
３）。なお、同一の文書画像から生成したＰｏｓｔＳｃ
ｒｉｐｔ文書とＨＴＭＬ文書の両方を登録する場合に
は、一方から他方を呼び出すことができるように、相互
に関連づけを行うことにしても良い。After that, the PostScript document becomes Po
The HTML document is H in the stScript document DB 214.
Each is stored in the TML document DB 215 (S31
3). Note that PostSc generated from the same document image
When registering both a RIP document and an HTML document, they may be associated with each other so that one can call the other.

【００４３】なお、ＰｏｓｔＳｃｒｉｐｔ文書ＤＢ２１
４およびＨＴＭＬ文書ＤＢ２１５にそれぞれ格納された
ＰｏｓｔＳｃｒｉｐｔ文書およびＨＴＭＬ文書は、検索
処理を行うことによって各クライアント１０７で画面表
示することができる。すなわち、クライアント１０７か
らの検索要求に応じて、検索サーバ１０８がＰｏｓｔＳ
ｃｒｉｐｔ文書ＤＢ２１４およびＨＴＭＬ文書ＤＢ２１
５から該当する文書を検索して出力し、クライアント１
０７は、検索サーバ１０８から検索結果を入力し、検索
要求に該当する文書を画面表示する。Note that the PostScript document DB 21
4 and the HTML document stored in the HTML document DB 215 can be displayed on the screen of each client 107 by performing a search process. That is, in response to a search request from the client 107, the search server 108
script document DB 214 and HTML document DB 21
5 and retrieves the corresponding document and outputs it.
In step 07, a search result is input from the search server 108, and a document corresponding to the search request is displayed on the screen.

【００４４】図５は、ＰｏｓｔＳｃｒｉｐｔ文書を画面
表示した様子の一例を示す説明図である。図５に示すよ
うに、図４に示した文書画像に基づいてＰｏｓｔＳｃｒ
ｉｐｔで表現した文書を生成することにより、元の紙文
書と同一のレイアウトの文書を容易に生成することがで
きる。すなわち、紙文書を再現した文書を生成すること
により、文書作成者の意図をできるだけ正確に伝えるこ
とができるような、流通および閲覧に適した文書を得る
ことができる。なお、ＰｏｓｔＳｃｒｉｐｔ文書を画面
表示した様子は元の紙文書とほぼ同一であるが、内部情
報は保持されているため、検索を行ったり、再利用した
りすることができる。FIG. 5 is an explanatory diagram showing an example of a state in which a PostScript document is displayed on a screen. As shown in FIG. 5, PostScr based on the document image shown in FIG.
By generating a document represented by ipt, a document having the same layout as the original paper document can be easily generated. That is, by generating a document that reproduces a paper document, a document suitable for distribution and browsing that can convey the intention of the document creator as accurately as possible can be obtained. Although the PostScript document is displayed on the screen almost in the same manner as the original paper document, since the internal information is retained, the PostScript document can be searched or reused.

【００４５】また、図６は、ＨＴＭＬ文書を画面表示し
た様子の一例を示す説明図である。図６は、図と図番号
のハイパーテキスト化を行ったものであり、例えば、本
文中の「図９」をマウス等でクリックすると、「図９」
に該当する図が画面表示される。このように、紙文書が
持つレイアウトにとらわれることなく、ハイパーテキス
ト化することにより、紙文書の内容を優先したコンピュ
ータならではの形を持った文書を生成することができ
る。FIG. 6 is an explanatory diagram showing an example of a state in which an HTML document is displayed on a screen. FIG. 6 is a diagram in which the figure and the figure number are converted into hypertext. For example, when "FIG. 9" in the text is clicked with a mouse or the like, "FIG. 9"
Is displayed on the screen. As described above, by converting into a hypertext without being bound by the layout of the paper document, a document having a form unique to a computer in which the contents of the paper document are prioritized can be generated.

【００４６】このように、実施の形態１の文書画像認識
装置によれば、文書画像から文字コードを抽出するとい
う単なる文字認識処理を行うだけではなく、文書画像の
持つ様々な情報を抽出して利用するため、紙文書の再現
を優先した文書や紙文書の内容を重視した文書等、利用
目的に応じた様々な形態のコンピュータ上で利用可能な
文書を生成することができる。As described above, according to the document image recognition apparatus of the first embodiment, not only the simple character recognition process of extracting the character code from the document image but also various information of the document image is extracted. For this purpose, it is possible to generate a document that can be used on various types of computers according to the purpose of use, such as a document that prioritizes the reproduction of a paper document and a document that emphasizes the contents of a paper document.

【００４７】なお、図１においては、ネットワーク１０
９を介したシステムとして実施の形態１の文書画像認識
装置の構成を説明したが、図２に示す機能を１台のコン
ピュータに持たせることにより、スタンドアローンの形
態で文書画像認識装置を構成することもできる。In FIG. 1, the network 10
Although the configuration of the document image recognition apparatus according to the first embodiment has been described as a system via a computer 9, the function shown in FIG. 2 is provided in one computer to configure the document image recognition apparatus in a stand-alone form. You can also.

【００４８】また、実施の形態１の文書画像認識装置で
は、上述したＰｏｓｔＳｃｒｉｐｔ文書やＨＴＭＬ文書
を生成することにしたが、これらに限定するものではな
く、必要に応じて他の形式の文書を生成することもでき
る。In the document image recognition apparatus according to the first embodiment, the above-described PostScript document and HTML document are generated. However, the present invention is not limited to these, and other types of documents may be generated as needed. You can also.

【００４９】〔実施の形態２〕実施の形態２の文書画像
認識装置は、実施の形態１で説明したようにして生成し
たＰｏｓｔＳｃｒｉｐｔ文書やＨＴＭＬ文書を効率良く
検索することができるようにしたものである。具体的に
は、ＯＣＲ部２０７で文字認識した文字列から所定の文
字列を抽出し、抽出した文字列を対応するＰｏｓｔＳｃ
ｒｉｐｔ文書やＨＴＭＬ文書に関連づけておき、該当す
る文字列を検索することにより、関連づけされたＰｏｓ
ｔＳｃｒｉｐｔ文書やＨＴＭＬ文書を検索結果として出
力できるようにするものである。以下では、この文字列
のことをキーテキストと定義することにする。[Second Embodiment] A document image recognition apparatus according to a second embodiment is capable of efficiently searching for a PostScript document or an HTML document generated as described in the first embodiment. is there. More specifically, a predetermined character string is extracted from the character string recognized by the OCR unit 207, and the extracted character string is assigned to the corresponding PostSc.
associated with a .doc document or an HTML document, and by searching for a corresponding character string, the associated Pos
A tScript document or an HTML document can be output as a search result. Hereinafter, this character string is defined as key text.

【００５０】図７は、実施の形態２の文書画像認識装置
の概念構成図である。図７において、実施の形態１で説
明した図２と同一の構成については同一の符号を付すこ
とにし、それらの詳細な説明については省略する。FIG. 7 is a conceptual configuration diagram of the document image recognition apparatus according to the second embodiment. 7, the same components as those in FIG. 2 described in the first embodiment are denoted by the same reference numerals, and the detailed description thereof will be omitted.

【００５１】実施の形態２の文書画像認識装置は、図７
に示すように、ＯＣＲ部２０７で文字認識した文字列か
ら上述したキーテキストを抽出するキーテキスト抽出部
７００と、キーテキスト抽出部７００で抽出したキーテ
キストを入力し、キーテキストＤＢ７０２に登録するキ
ーテキスト登録部７０１と、検索要求を入力し、キーテ
キストＤＢ７０２に登録されたキーテキストを検索し
て、該当するキーテキストに関連づけられたＰｏｓｔＳ
ｃｒｉｐｔ文書またはＨＴＭＬ文書を検索結果として出
力する検索処理部７０３と、を備えている。The document image recognition apparatus according to the second embodiment has the structure shown in FIG.
As shown in FIG. 7, a key text extracting unit 700 for extracting the above-described key text from the character string recognized by the OCR unit 207, and a key text extracted by the key text extracting unit 700 and registered in the key text DB 702. A text registration unit 701 and a search request are input, a key text registered in the key text DB 702 is searched, and PostS associated with the corresponding key text is searched.
a search processing unit 703 that outputs a script document or an HTML document as a search result.

【００５２】なお、ＰｏｓｔＳｃｒｉｐｔ文書ＤＢ２１
４，ＨＴＭＬ文書ＤＢ２１５およびキーテキストＤＢ７
０２は、検索処理部７０３に設けられる。この検索処理
部７０３は、図１における検索サーバ１０８に該当し、
クライアント１０７からの検索要求に基づいて、上記検
索処理を行う。また、実施の形態２の文書画像認識装置
をスタンドアローンの形態で構成した場合には、直接検
索要求を入力して検索処理を行う。Note that the PostScript document DB 21
4, HTML document DB 215 and key text DB 7
02 is provided in the search processing unit 703. This search processing unit 703 corresponds to the search server 108 in FIG.
The search processing is performed based on a search request from the client 107. When the document image recognition device according to the second embodiment is configured in a stand-alone mode, a search request is directly input to perform a search process.

【００５３】上記キーテキスト抽出部２０７で抽出する
キーテキストとしては、文書を端的に表した文字列、例
えば、文書全体，章，節のタイトルや、ヘッダ・フッタ
等の書誌的事項，文書の要約文等が考えられる。また、
文書中の図等を基準として、図のキャプションを構成す
る文字列や、図番を含むセンテンス，このセンテンスを
含むパラグラフおよびページ単位の文字列をキーテキス
トとして抽出しても良い。なお、上記キーテキストを抽
出するには、文書画像のレイアウトを解析する必要があ
ることから、キーテキスト抽出部７００は、レイアウト
・論理解析部２０９による解析結果を用いて、キーテキ
ストの抽出処理を行うようにしても良い。The key text to be extracted by the key text extraction unit 207 is a character string that simply represents the document, for example, the entire document, chapter and section titles, bibliographic items such as headers and footers, and document summaries. A sentence or the like can be considered. Also,
A character string constituting a caption of a figure, a sentence including a figure number, a paragraph including the sentence, and a character string in page units may be extracted as a key text based on a figure or the like in a document. In order to extract the key text, it is necessary to analyze the layout of the document image. Therefore, the key text extraction unit 700 uses the analysis result of the layout / logic analysis unit 209 to perform a key text extraction process. It may be performed.

【００５４】また、キーテキスト抽出部７０１は、上記
キーテキスト抽出部７００で抽出したキーテキストを入
力し、入力したキーテキストを文書生成部２１０で生成
したＰｏｓｔＳｃｒｉｐｔ文書やＨＴＭＬ文書に関連づ
け、キーテキストＤＢ７０２に格納する。The key text extraction unit 701 inputs the key text extracted by the key text extraction unit 700, associates the input key text with the PostScript document or HTML document generated by the document generation unit 210, and generates a key text DB 702. To be stored.

【００５５】さらに、検索処理部７０３は、検索要求を
入力すると共に、検索結果を出力する入出力部７０４
と、入出力部７０４から検索要求を入力し、キーテキス
トＤＢ７０２から該当するキーテキストを検索する検索
エンジン７０５とを有している。具体的に、入出力部７
０４は、検索要求を入力して検索エンジン７０５に検索
要求を出力する。検索エンジン７０５は、入出力部７０
４から検索要求を入力し、キーテキストＤＢ７０２から
該当するキーテキストを検索し、該当するキーテキスト
を入出力部７０４に出力する。入出力部７０４は、入力
したキーテキストに関連づけられたＰｏｓｔＳｃｒｉｐ
ｔ文書やＨＴＭＬ文書を検索結果として出力する。Further, the search processing unit 703 is provided with an input / output unit 704 for inputting a search request and outputting a search result.
And a search engine 705 for inputting a search request from the input / output unit 704 and searching for a corresponding key text from the key text DB 702. Specifically, the input / output unit 7
04 inputs a search request and outputs the search request to the search engine 705. The search engine 705 includes an input / output unit 70
4, a search request is input, a corresponding key text is searched from the key text DB 702, and the corresponding key text is output to the input / output unit 704. The input / output unit 704 includes a PostScript associated with the input key text.
t documents and HTML documents are output as search results.

【００５６】このように、実施の形態２の文書画像認識
装置によれば、文書画像中の文字列をキーテキストとし
て文書の検索を行うことにしたため、検索要求に対し
て、最も適切な検索結果を得ることができると共に、高
速な検索処理を実現することができる。As described above, according to the document image recognition apparatus of the second embodiment, a document search is performed using a character string in a document image as a key text. , And a high-speed search process can be realized.

【００５７】なお、上述した実施の形態２においては、
キーテキストを用いて文書の検索を用いることにした
が、キーテキストに代え、ＯＣＲ部２０７で文字認識し
た結果全てを用いて全文検索を行うようにすることもで
きる。図８は、実施の形態２の文書画像認識装置の変形
例を示す概念構成図である。In the second embodiment described above,
Although the document search is used using the key text, a full-text search may be performed using all the results of character recognition by the OCR unit 207 instead of the key text. FIG. 8 is a conceptual configuration diagram illustrating a modified example of the document image recognition device of the second embodiment.

【００５８】図８に示すように、上述したキーテキスト
抽出部７００，キーテキスト登録部７０１およびキーテ
キストＤＢ７０２に代えて、テキスト登録部８００およ
び全文検索用テキストＤＢ８０１を設け、テキスト登録
部８００がＯＣＲ部２０７から文字認識結果、即ち文書
画像中のテキストの全文を入力し、入力したテキストを
文書生成部２１０で生成したＰｏｓｔＳｃｒｉｐｔ文書
やＨＴＭＬ文書に関連づけし、全文検索用テキストＤＢ
８０１に登録する。その結果、全文検索用テキストＤＢ
８０１に登録されたテキストを用いて、ＰｏｓｔＳｃｒ
ｉｐｔ文書やＨＴＭＬ文書を検索することができる。こ
の場合は、テキストを全文検索用テキストＤＢ８０１に
登録するため、検索の度に各ファイルのオープン・クロ
ーズという処理が不要となるため、検索処理の高速化を
図ることができる。As shown in FIG. 8, a text registration unit 800 and a full-text search text DB 801 are provided in place of the above-described key text extraction unit 700, key text registration unit 701, and key text DB 702. The character recognition result, that is, the full text of the text in the document image is input from the unit 207, and the input text is associated with the PostScript document or the HTML document generated by the document generation unit 210, and the text DB for full text search is input.
Register at 801. As a result, full-text search text DB
PostScr using the text registered in 801
Ipt documents and HTML documents can be searched. In this case, since the text is registered in the full-text search text DB 801, it is not necessary to open and close each file every time the search is performed, so that the search process can be sped up.

【００５９】〔実施の形態３〕続いて、実施の形態３の
文書画像認識装置について説明する。実施の形態３の文
書画像認識装置は、実施の形態１のものと同様に、流通
および閲覧を考慮した文書と、利用を考慮した文書を生
成することができるようにしたものである。Third Embodiment Next, a document image recognition apparatus according to a third embodiment will be described. The document image recognition device according to the third embodiment can generate a document that considers distribution and browsing and a document that considers use, as in the case of the first embodiment.

【００６０】実施の形態３の文書画像認識装置における
流通および閲覧を考慮した文書とは、オリジナルの紙文
書を読み取って生成した文書画像であり、ここではイメ
ージデータと定義することにする。また、利用を考慮し
た文書とは、文書画像中の文字列について文字認識を行
った結果であり、ここではテキストデータと定義するこ
とにする。The document in consideration of distribution and browsing in the document image recognition device according to the third embodiment is a document image generated by reading an original paper document, and is defined as image data here. In addition, a document in consideration of use is a result of performing character recognition on a character string in a document image, and is defined here as text data.

【００６１】図９は、実施の形態３の文書画像認識装置
の概念構成図である。なお、図９において、実施の形態
１で説明した構成と同一の構成については同一の符号を
付し、これらの詳細な説明は省略する。FIG. 9 is a conceptual configuration diagram of the document image recognition device of the third embodiment. In FIG. 9, the same components as those described in the first embodiment are denoted by the same reference numerals, and detailed description thereof will be omitted.

【００６２】実施の形態３の文書画像認識装置は、大き
く文書画像入力部２００，前処理部２０１，情報抽出部
２０４，登録部９００およびデータベース部２１３から
構成される。なお、文書画像入力部２００，前処理部２
０１，情報抽出部２０４および登録部９００は、図１に
示した文書画像処理サーバ１０５に該当し、データベー
ス部９０３は、図１に示した検索サーバ１０８に該当す
る。The document image recognition apparatus according to the third embodiment includes a document image input unit 200, a preprocessing unit 201, an information extraction unit 204, a registration unit 900, and a database unit 213. The document image input unit 200 and the preprocessing unit 2
01, the information extraction unit 204 and the registration unit 900 correspond to the document image processing server 105 shown in FIG. 1, and the database unit 903 corresponds to the search server 108 shown in FIG.

【００６３】文書画像入力部２００および前処理部２０
１については実施の形態１で説明した通りである。ま
た、情報抽出部２０４においては、ＰｏｓｔＳｃｒｉｐ
ｔ文書やＨＴＭＬ文書を生成しないため、フォント識別
部２０８およびレイアウト・論理解析部２０９が省略さ
れている以外は実施の形態１で説明した通りである。Document image input unit 200 and preprocessing unit 20
1 is as described in the first embodiment. In the information extraction unit 204, PostScript
Since no t document or HTML document is generated, the configuration is the same as that described in the first embodiment except that the font identification unit 208 and the layout / logic analysis unit 209 are omitted.

【００６４】実施の形態１の文書画像認識装置と異なる
のは、イメージＤＢ登録部９０１およびテキストＤＢ登
録部９０２を備えた登録部９００と、イメージＤＢ９０
４およびテキストＤＢ９０５を備えたデータベース部９
０３である。The difference from the document image recognition apparatus of the first embodiment is that a registration unit 900 having an image DB registration unit 901 and a text DB registration unit 902 and an image DB 90
4 and a database unit 9 including a text DB 905
03.

【００６５】イメージＤＢ登録部９０１は、前処理部２
０１からノイズ除去およびスキュー補正を行った後のイ
メージデータを入力し、入力したイメージデータをデー
タベース部９０３のイメージＤＢ９０４に登録する処理
を行い、テキストＤＢ登録部９０２は、ＯＣＲ部２０７
で文字認識処理を行った結果であるテキストデータをデ
ータベース部９０３のテキストＤＢ９０５に登録する処
理を行う。なお、同一文書のイメージデータとテキスト
データの両方を登録する場合には、一方から他方を呼び
出すことができるように、相互に関連づけを行うことに
しても良い。The image DB registration unit 901 includes a pre-processing unit 2
01, image data after noise removal and skew correction has been input, and processing for registering the input image data in the image DB 904 of the database unit 903 is performed.
Then, a process of registering text data as a result of performing the character recognition process in the text DB 905 of the database unit 903 is performed. When both the image data and the text data of the same document are registered, they may be associated with each other so that one can call the other.

【００６６】なお、イメージＤＢ９０４およびテキスト
ＤＢ９０５にそれぞれ格納されたイメージデータおよび
テキストデータは、検索処理を行うことによって各クラ
イアント１０７で画面表示することができる。すなわ
ち、クライアント１０７からの検索要求に応じて、検索
サーバ１０８がイメージＤＢ９０４およびテキストＤＢ
９０５から該当する文書を検索して出力し、クライアン
ト１０７は、検索サーバ１０８から検索結果を入力し、
検索要求に該当する文書を画面表示する。The image data and text data stored in the image DB 904 and the text DB 905 can be displayed on the screen of each client 107 by performing a search process. That is, in response to a search request from the client 107, the search server 108 stores the image DB 904 and the text DB
The client 107 searches for and outputs the corresponding document from 905, and the client 107 inputs the search result from the search server 108,
The document corresponding to the search request is displayed on the screen.

【００６７】このように、実施の形態３の文書画像認識
装置によれば、イメージデータをデータベースに登録す
るため、紙文書そのものをコンピュータ上で画面表示す
ることができ、また、テキストデータをデータベースに
登録するため、紙文書中の情報を容易に利用することが
できる。As described above, according to the document image recognition apparatus of the third embodiment, since the image data is registered in the database, the paper document itself can be displayed on the screen of the computer, and the text data can be stored in the database. Because of the registration, the information in the paper document can be easily used.

【００６８】〔実施の形態４〕実施の形態４の文書画像
認識装置は、実施の形態３で説明したようにして生成し
たイメージデータやテキストデータを、実施の形態２で
説明したキーテキストを用いて効率良く検索することが
できるようにしたものである。[Fourth Embodiment] A document image recognition apparatus according to a fourth embodiment uses the image data and text data generated as described in the third embodiment by using the key text described in the second embodiment. This makes it possible to search efficiently.

【００６９】図１０は、実施の形態４の文書画像認識装
置の概念構成図である。図１０において、実施の形態１
で説明した図２および実施の形態２で説明した図７と同
一の構成については同一の符号を付すことにし、それら
の詳細な説明については省略する。FIG. 10 is a conceptual configuration diagram of the document image recognition device of the fourth embodiment. In FIG. 10, the first embodiment
2 and FIG. 7 described in the second embodiment will be denoted by the same reference numerals, and detailed description thereof will be omitted.

【００７０】実施の形態４の文書画像認識装置は、図１
０に示すように、文書画像のレイアウトを解析し、ＯＣ
Ｒ部２０７で文字認識した文字列から上述したキーテキ
ストを抽出するキーテキスト抽出部７００と、キーテキ
スト抽出部７００で抽出したキーテキストを入力し、キ
ーテキストＤＢ７０２に登録するキーテキスト登録部７
０１と、直接に検索要求を入力し、キーテキストＤＢ７
０２に登録されたキーテキストを検索して、該当するキ
ーテキストに関連づけられたイメージデータやテキスト
データを検索結果として出力する検索処理部７０３と、
を備えている。The document image recognition apparatus according to the fourth embodiment has the structure shown in FIG.
0, the layout of the document image is analyzed.
A key text extraction unit 700 for extracting the above-described key text from the character string recognized by the character unit 207, and a key text registration unit 7 for inputting the key text extracted by the key text extraction unit 700 and registering the key text in the key text DB 702.
01 and a search request is input directly, and the key text DB 7
A search processing unit 703 for searching for a key text registered in No. 02 and outputting image data or text data associated with the corresponding key text as a search result;
It has.

【００７１】上記キーテキスト抽出部２０７で抽出する
キーテキストとしては、文書を端的に表した文字列、例
えば、文書全体，章，節のタイトルや、ヘッダ・フッタ
等の書誌的事項、文書の要約文等が考えられる。また、
文書中の図等を基準として、図のキャプションを構成す
る文字列や、図番を含むセンテンス，このセンテンスを
含むパラグラフおよびページ単位の文字列をキーテキス
トとして抽出しても良い。The key text to be extracted by the key text extraction unit 207 is a character string that simply represents the document, for example, the entire document, chapter and section titles, bibliographic items such as headers and footers, and document summaries. A sentence or the like can be considered. Also,
A character string constituting a caption of a figure, a sentence including a figure number, a paragraph including the sentence, and a character string in page units may be extracted as a key text based on a figure or the like in a document.

【００７２】また、キーテキスト抽出部７０１は、上記
キーテキスト抽出部７００で抽出したキーテキストを入
力し、対応するイメージデータやテキストデータに関連
づけ、キーテキストＤＢ７０２に格納する。The key text extracting unit 701 inputs the key text extracted by the key text extracting unit 700, associates the key text with the corresponding image data or text data, and stores the key text in the key text DB 702.

【００７３】さらに、検索処理部７０３は、検索要求を
入力すると共に、検索結果を出力する入出力部７０４
と、入出力部７０４から検索要求を入力し、キーテキス
トＤＢ７０２から該当するキーテキストを検索する検索
エンジン７０５とを有している。具体的に、入出力部７
０４は、検索要求を入力して検索エンジン７０５に検索
要求を出力する。検索エンジン７０５は、入出力部７０
４から検索要求を入力し、キーテキストＤＢ７０２から
該当するキーテキストを検索し、該当するキーテキスト
を入出力部７０４に出力する。入出力部７０４は、入力
したキーテキストに関連づけられたイメージデータやテ
キストデータを検索結果として出力する。Further, the search processing unit 703 receives an input of a search request, and outputs a search result.
And a search engine 705 for inputting a search request from the input / output unit 704 and searching for a corresponding key text from the key text DB 702. Specifically, the input / output unit 7
04 inputs a search request and outputs the search request to the search engine 705. The search engine 705 includes an input / output unit 70
4, a search request is input, a corresponding key text is searched from the key text DB 702, and the corresponding key text is output to the input / output unit 704. The input / output unit 704 outputs image data or text data associated with the input key text as a search result.

【００７４】上記検索処理部７０３は、図１における検
索サーバ１０８に該当し、クライアント１０７からの検
索要求に基づいて、検索処理を行う。また、実施の形態
２の文書画像認識装置をスタンドアローンの形態で構成
した場合には、直接検索要求を入力して検索処理を行
う。The search processing unit 703 corresponds to the search server 108 in FIG. 1, and performs a search process based on a search request from the client 107. When the document image recognition device according to the second embodiment is configured in a stand-alone mode, a search request is directly input to perform a search process.

【００７５】このように、実施の形態４の文書画像認識
装置によれば、文書画像中の文字列をキーテキストとし
て文書の検索を行うことにしたため、検索要求に対し
て、最も適切な検索結果を得ることができると共に、高
速な検索処理を実現することができる。As described above, according to the document image recognition apparatus of the fourth embodiment, a document search is performed by using a character string in a document image as a key text. , And a high-speed search process can be realized.

【００７６】なお、上述した実施の形態４においては、
キーテキストを用いて文書の検索を用いることにした
が、キーテキストに代え、ＯＣＲ部２０７で文字認識し
た結果全てを用いて全文検索を行うようにすることもで
きる。図１１は、実施の形態４の文書画像認識装置の変
形例を示す概念構成図である。In the fourth embodiment described above,
Although the document search is used using the key text, a full-text search may be performed using all the results of character recognition by the OCR unit 207 instead of the key text. FIG. 11 is a conceptual configuration diagram illustrating a modified example of the document image recognition device of the fourth embodiment.

【００７７】図１１に示すように、テキスト登録部８０
０および全文検索用テキストＤＢ８０１を設け、テキス
ト登録部８００がＯＣＲ部２０７から文字認識結果、即
ち文書画像中のテキストの全文を入力し、対応するイメ
ージデータやテキストデータに関連づけし、全文検索用
テキストＤＢ８０１に登録する。その結果、全文検索用
テキストＤＢ８０１に登録されたテキストを用いて、イ
メージデータやテキストデータを検索することができ
る。この場合は、テキストを全文検索用テキストＤＢ８
０１に登録するため、検索の度に各ファイルのオープン
・クローズという処理が不要となるため、検索処理の高
速化を図ることができる。As shown in FIG. 11, the text registration unit 80
0 and a full-text search text DB 801 are provided, and the text registration unit 800 receives the character recognition result from the OCR unit 207, that is, the full text of the text in the document image, associates the text with the corresponding image data or text data, and associates it with the corresponding text data. Register in DB801. As a result, image data and text data can be searched using the text registered in the full-text search text DB 801. In this case, the text is stored in the full-text search text DB8.
01, it is not necessary to open and close each file every time a search is performed, so that the speed of the search process can be increased.

【００７８】なお、上述した実施の形態１〜４の文書画
像認識装置は、それぞれ任意に組み合わせて１つの文書
画像認識装置を構成することができる。The document image recognition apparatuses according to the first to fourth embodiments can be arbitrarily combined to form one document image recognition apparatus.

【００７９】さらに、上述した実施の形態１〜４の文書
画像認識装置としてコンピュータを機能させるプログラ
ムを作成し、これらをハードディスク，フロッピーディ
スク，ＣＤ−ＲＯＭ，ＭＯ，ＤＶＤ等のコンピュータ読
み取り可能な記録媒体に記録して、記録媒体を介してプ
ログラムを配布することができる。そして、記録媒体に
記録されたプログラムをコンピュータで読み出して実行
することにより、上述した文書画像認識装置を実現する
ことができる。Further, a program for causing a computer to function as the document image recognition device according to the above-described first to fourth embodiments is prepared, and these are stored in a computer-readable recording medium such as a hard disk, a floppy disk, a CD-ROM, an MO, and a DVD. And the program can be distributed via a recording medium. Then, the above-described document image recognition device can be realized by reading and executing the program recorded on the recording medium by the computer.

【００８０】[0080]

【発明の効果】以上説明したように、本発明の文書画像
認識装置（請求項１）によれば、紙文書を光学的に読み
取ることによって生成した文書画像を入力する入力手段
と、入力手段を介して入力した文書画像から文字列を含
む文字領域および／または図，表，写真等の画像を含む
画像領域を認識して抽出する領域抽出手段と、領域認識
手段で抽出した文字領域の文字列について文字認識処理
を行う文字認識手段と、領域抽出手段の抽出結果に基づ
いて、文書画像のレイアウトを解析し、レイアウト情報
を抽出するレイアウト情報抽出手段と、文字認識手段に
よる文字認識結果およびレイアウト情報抽出手段による
レイアウト情報抽出結果に基づいて、ページ記述言語を
用いた第１の文書を生成する第１の文書生成手段と、文
字認識手段による文字認識結果およびレイアウト情報抽
出手段によるレイアウト情報抽出結果に基づいて、構造
化記述言語を用いた第２の文書を生成する第２の文書生
成手段と、第１および第２の文書生成手段で生成した第
１および第２の文書をそれぞれ格納する格納手段と、を
備えたため、文書画像から文字コードを抽出するという
単なる文字認識処理を行うだけではなく、文書画像の持
つ様々な情報を抽出して利用することにより、紙文書の
再現を優先した文書や紙文書の内容を重視した文書等、
利用目的に応じた様々な形態のコンピュータ上で利用可
能な文書を生成することができる。As described above, according to the document image recognition apparatus of the present invention, the input means for inputting the document image generated by optically reading the paper document and the input means are provided. Area extracting means for recognizing and extracting a character area including a character string and / or an image area including an image such as a figure, a table, and a photograph from a document image input through the apparatus, and a character string of the character area extracted by the area recognizing means A character recognizing unit that performs a character recognizing process, a layout information extracting unit that analyzes a layout of a document image based on an extraction result of the area extracting unit, and extracts layout information, and a character recognizing result and a layout information by the character recognizing unit. A first document generation unit for generating a first document using a page description language based on a layout information extraction result by the extraction unit, and a character recognition unit A second document generation unit for generating a second document using a structured description language based on the character recognition result and a layout information extraction result by the layout information extraction unit; and a first and second document generation unit. Storage means for storing the first and second documents respectively, so that not only a simple character recognition process of extracting a character code from the document image but also various information of the document image is extracted. By using it, documents that prioritize reproduction of paper documents and documents that emphasize contents of paper documents, etc.
Documents that can be used on various types of computers according to the purpose of use can be generated.

【００８１】また、本発明の文書画像認識装置（請求項
２）によれば、請求項１に記載の文書画像認識装置にお
いて、第１の文書は、ＰｏｓｔＳｃｒｉｐｔ形式または
ＰＤＦ形式によって表現された文書であるため、文書作
成者の意図をできるだけ正確に伝えることができるよう
な、流通および閲覧に適した文書を得ることができる。According to the document image recognition device of the present invention (claim 2), in the document image recognition device according to claim 1, the first document is a document expressed in a PostScript format or a PDF format. Therefore, a document suitable for distribution and browsing, which can convey the intention of the document creator as accurately as possible, can be obtained.

【００８２】また、本発明の文書画像認識装置（請求項
３）によれば、請求項１に記載の文書画像認識装置にお
いて、第２の文書は、ＳＧＭＬ，ＨＴＭＬまたはＸＭＬ
によって表現された文書であるため、紙文書が持つレイ
アウトにとらわれることなく、紙文書の内容を優先した
コンピュータならではの形を持った文書を生成すること
ができる。According to the document image recognition device of the present invention (claim 3), in the document image recognition device according to claim 1, the second document is SGML, HTML, or XML.
Since the document is represented by a document, it is possible to generate a document having a form unique to a computer in which the contents of the paper document are prioritized, regardless of the layout of the paper document.

【００８３】さらに、本発明のコンピュータ読み取り可
能な記録媒体（請求項４）によれば、請求項１〜３のい
ずれか１つに記載の文書画像認識装置の各手段としてコ
ンピュータを機能させるためのプログラムを記録したた
め、記録したプログラムをコンピュータ上で実行するこ
とにより、文書画像の持つ様々な情報を抽出して利用す
ることにより、紙文書の再現を優先した文書や紙文書の
内容を重視した文書等、利用目的に応じた様々な形態の
コンピュータ上で利用可能な文書を生成することができ
る。Further, according to the computer-readable recording medium of the present invention (claim 4), the computer-readable recording medium for causing a computer to function as each means of the document image recognition apparatus according to any one of claims 1 to 3 is provided. Because the program is recorded, the recorded program is executed on a computer to extract and use various information of the document image, thereby giving priority to the reproduction of paper documents and documents that emphasize the contents of paper documents. For example, documents that can be used on various types of computers according to the purpose of use can be generated.

[Brief description of the drawings]

【図１】実施の形態１の文書画像認識装置のシステム構
成を示す構成図である。FIG. 1 is a configuration diagram illustrating a system configuration of a document image recognition device according to a first embodiment;

【図２】図１に示す文書画像認識装置の概念構成図であ
る。FIG. 2 is a conceptual configuration diagram of the document image recognition device shown in FIG.

【図３】実施の形態１の文書画像認識装置の動作手順を
示すフローチャートである。FIG. 3 is a flowchart illustrating an operation procedure of the document image recognition device according to the first embodiment;

【図４】実施の形態１の文書画像認識装置において、文
書画像入力部を介して入力した文書画像を画面表示した
様子の一例を示す説明図である。FIG. 4 is an explanatory diagram illustrating an example of a state where a document image input via a document image input unit is displayed on a screen in the document image recognition device according to the first embodiment;

【図５】実施の形態１の文書画像認識装置において、生
成したＰｏｓｔＳｃｒｉｐｔ文書を画面表示した様子の
一例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a state in which a generated PostScript document is displayed on a screen in the document image recognition device of the first embodiment.

【図６】実施の形態１の文書画像認識装置において、生
成したＨＴＭＬ文書を画面表示した様子の一例を示す説
明図である。FIG. 6 is an explanatory diagram showing an example of a state where a generated HTML document is displayed on a screen in the document image recognition device of the first embodiment.

【図７】実施の形態２の画像認識装置の概念構成図であ
る。FIG. 7 is a conceptual configuration diagram of an image recognition device according to a second embodiment.

【図８】実施の形態２の文書画像認識装置の変形例を示
す概念構成図である。FIG. 8 is a conceptual configuration diagram showing a modified example of the document image recognition device of the second embodiment.

【図９】実施の形態３の文書画像認識装置の概念構成図
である。FIG. 9 is a conceptual configuration diagram of a document image recognition device according to a third embodiment.

【図１０】実施の形態４の文書画像認識装置の概念構成
図である。FIG. 10 is a conceptual configuration diagram of a document image recognition device according to a fourth embodiment.

【図１１】実施の形態４の文書画像認識装置の変形例を
示す概念構成図である。FIG. 11 is a conceptual configuration diagram showing a modified example of the document image recognition device of the fourth embodiment.

[Explanation of symbols]

１００カラースキャナ１０１モノクロスキャナ１０２ネットワークスキャナ１０３ファクシミリ装置１０４ファックスモデム１０５（１０５ａ，１０５ｂ）文書画像処理サー
バ１０６ディジタル複合機１０７クライアント１０８検索サーバ１０８１０９ネットワーク２００文書画像入力部２０１前処理部２０２ノイズ除去部２０３スキュー補正部２０４情報抽出部２０５領域分割・情報抽出部２０６表処理部２０７ＯＣＲ部２０８フォント識別部２０９レイアウト・論理解析部２１０文書生成部２１１ＰｏｓｔＳｃｒｉｐｔ文書生成部２１２ＨＴＭＬ文書生成部２１３データベース部２１４ＰｏｓｔＳｃｒｉｐｔ文書ＤＢ２１５ＨＴＭＬ文書ＤＢ７００キーテキスト抽出部７０１キーテキスト登録部７０２キーテキストＤＢ７０３検索処理部７０４入出力部７０５検索エンジン８００テキスト登録部８０１全文検索用テキストＤＢREFERENCE SIGNS LIST 100 color scanner 101 monochrome scanner 102 network scanner 103 facsimile machine 104 fax modem 105 (105a, 105b) document image processing server 106 digital multifunction peripheral 107 client 108 search server 108 109 network 200 document image input unit 201 preprocessing unit 202 noise removal unit 203 Skew correction unit 204 Information extraction unit 205 Area division / information extraction unit 206 Table processing unit 207 OCR unit 208 Font identification unit 209 Layout / logic analysis unit 210 Document generation unit 211 PostScript document generation unit 212 HTML document generation unit 213 Database unit 214 PostScript document DB 215 HTML document DB 700 Key text extraction unit 701 Key text registration unit 02 key text DB 703 the search processing unit 704 input-output unit 705 search engine 800 text registration unit 801 full-text search for text DB

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号ＦＩＨ０４Ｎ 1/40 Ｈ０４Ｎ 1/40 Ｆ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁶ Identification code FI H04N 1/40 H04N 1/40 F

Claims

[Claims]

1. An input means for inputting a document image generated by optically reading a paper document, a character area including a character string and / or a figure, table, or photograph from the document image input via the input means. Area extracting means for recognizing and extracting an image area including an image such as an image, character recognizing means for performing character recognition processing on a character string of the character area extracted by the area recognizing means, and based on the extraction result of the area extracting means. A layout information extracting means for analyzing a layout of the document image and extracting layout information; and a page description language based on a character recognition result by the character recognition means and a layout information extraction result by the layout information extracting means. First to generate the first document
Generating a second document using a structured description language based on the character recognition result by the character recognition unit and the layout information extraction result by the layout information extraction unit.
A document image recognizing device comprising: a document generating unit; and a storage unit that stores the first and second documents generated by the first and second document generating units, respectively.

2. The method according to claim 1, wherein the first document is a PostScript.
2. The document image recognition device according to claim 1, wherein the document is a document represented by a t format or a PDF format.

3. The method according to claim 1, wherein the second document is SGML, HTML,
2. The document image recognition device according to claim 1, wherein the document is a document represented by XML.

4. A computer-readable recording medium on which a program for causing a computer to function as each unit of the document image recognition apparatus according to claim 1 is recorded.