JP2009048618A

JP2009048618A - Document extracting method, document extracting apparatus, computer program, and recording medium

Info

Publication number: JP2009048618A
Application number: JP2008162324A
Authority: JP
Inventors: Hitoshi Hirohata; 仁志廣畑
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2007-07-24
Filing date: 2008-06-20
Publication date: 2009-03-05
Anticipated expiration: 2028-06-20
Also published as: CN101354717A; CN101354717B; JP4340714B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document extracting method and a document extracting apparatus for extracting document data concerning a document composed of a plurality of pages from a database, and to provide a computer program and a recording medium. <P>SOLUTION: Document data corresponding to each page included in the document is stored, and furthermore, feature data indicative of a feature of the document data and a document index indicating the document are associated with the document data. A document extracting apparatus obtains input document data (S32), calculates feature data from the input document data (S34), judges similarity between the input document data and the document data based on the feature data (S36), obtains a document index associated with document data similar to the input document data (S39), and extracts a plurality of pieces of document data associated with the document index (S43). Thus, document data concerning the document including a page corresponding to the document data similar to the input document data is extracted for a plurality of pages. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、原稿のデータベースから特定の原稿を検索する技術に関し、より詳しくは、スキャナで原稿を読み取った画像等の原稿データに基づいて、読み取った原稿に対応する原稿データをデータベースから検索する原稿抽出方法、原稿抽出装置、コンピュータプログラム及び記録媒体に関する。 The present invention relates to a technique for retrieving a specific document from a database of documents, and more specifically, a document for retrieving document data corresponding to a read document from the database based on document data such as an image read by the scanner. The present invention relates to an extraction method, a document extraction device, a computer program, and a recording medium.

従来、文書又は写真等でなる原稿をスキャナを用いて読み取ったデータ、又はパーソナルコンピュータ（ＰＣ）等を用いて電子的に作成した原稿データをデータベースに蓄積しておき、新たに原稿を読み取り、読み取った原稿に対応する原稿データをデータベースから抽出する技術が利用されている。原稿データを抽出する方法としては、例えば、読み取った原稿からＯＣＲ（Optical Character Reader）を用いてキーワードを抽出し、キーワードに基づいて原稿の類似度を判定する方法、原稿を罫線のある帳票原稿に限定しておき、罫線の特徴を抽出して原稿の類似度を判定する方法等が提案されている。 Conventionally, data obtained by reading an original composed of a document or a photograph using a scanner, or original data created electronically using a personal computer (PC) or the like is stored in a database, and a new original is read and read. A technique for extracting original data corresponding to an original from a database is used. As a method for extracting manuscript data, for example, a keyword is extracted from a read manuscript using an OCR (Optical Character Reader), and the similarity of the manuscript is determined based on the keyword. For example, a method has been proposed in which the feature of a ruled line is extracted to determine the similarity of documents.

特許文献１には、原稿（文書）を特徴付けるデスクリプタとデスクリプタで特徴付けられる原稿のリストとを関連付けておき、読み取った原稿（入力文書）からデスクリプタを生成し、生成したデスクリプタを用いて原稿の照合を行う技術が開示されている。原稿のデスクリプタは、原稿の読み取りに伴って生じる歪み等に対して不変であるように定められる。一の原稿について複数のデスクリプタを生成し、読み取った原稿から生成したデスクリプタの夫々に関連付けられている原稿に対して投票を行い、最高得票数を得た原稿又は得票数が所定の閾値を越えた原稿を選択する。 Patent Document 1 associates a descriptor characterizing a document (document) with a list of documents characterized by the descriptor, generates a descriptor from the read document (input document), and collates the document using the generated descriptor. Techniques for performing are disclosed. The document descriptor is determined so as to be invariant to distortion or the like caused by the document reading. A plurality of descriptors are generated for one document, and the document associated with each of the descriptors generated from the scanned document is voted. The document with the highest number of votes or the number of votes exceeds a predetermined threshold. Select the original.

特許文献２には、原稿の画像データを予め記憶しておき、読み取った原稿のビットマップデータと予め記憶してある原稿のビットマップデータとの間で１ビット単位でパターンマッチングを行うことにより、原稿の検索を行う技術が開示されている。また特許文献２には、複数ページよりなる原稿の場合、検索用に表紙のページのみを読み取り、読み取ったページの画像データと、記憶してある各原稿の１枚目の画像データとを比較することにより、原稿を検索してもよいことが記載されている。 In Patent Document 2, image data of a document is stored in advance, and pattern matching is performed in units of 1 bit between the scanned bitmap data and the stored bitmap data of the document. A technique for searching a manuscript is disclosed. Further, in Patent Document 2, in the case of a document composed of a plurality of pages, only the cover page is read for search, and the image data of the read page is compared with the stored first image data of each document. It is described that the manuscript may be searched.

特許文献３には、文書画像を予め記憶しておき、読み取った原稿の画像の特徴量と記憶してある文書画像の全てのページの特徴量とを比較して類似度を求め、類似度が閾値よりも高い文書画像を抽出することにより、文書画像を検索する技術が開示されている。この技術では、複数の文書画像が候補となった場合は、文書画像を表示してユーザによる選択を受け付け、また文書画像に含まれるページの類似度の平均が閾値を下回った場合は、その文書画像を候補から削除して絞り込みを行う。
特開平７−２８２０８８号公報特開平５−３７７４８号公報特開２００６−３１１８１号公報 In Patent Document 3, a document image is stored in advance, the feature amount of the read original image is compared with the feature amount of all pages of the stored document image, and the similarity is obtained. A technique for retrieving a document image by extracting a document image higher than a threshold value is disclosed. In this technique, when a plurality of document images are candidates, the document image is displayed and a selection by the user is accepted, and when the average similarity of pages included in the document image falls below a threshold value, the document is displayed. Delete images from candidates and refine them.
Japanese Patent Laid-Open No. 7-282088 JP-A-5-37748 JP 2006-31181 A

通常、文書等の原稿は複数ページで構成されていることが多い。特許文献１に開示された技術を始めとする従来の技術は、スキャナで読み取った原稿との照合を行って所望の原稿データをデータベースから抽出することは可能であるものの、複数ページで構成されている原稿については、ページ毎に照合を行って原稿データを抽出する必要がある。従って、紛失又は汚れ等によって照合元の原稿に欠落が生じた場合は、複数ページで構成されている原稿に係る原稿データを全てのページに亘って抽出することができないという問題がある。特許文献１には、この問題の解決手段については何ら開示されていない。 Usually, a document such as a document is often composed of a plurality of pages. Although the conventional technique including the technique disclosed in Patent Document 1 can collate with a document read by a scanner and extract desired document data from a database, it is composed of a plurality of pages. It is necessary to extract original data by comparing each original. Therefore, when the original document to be collated is lost due to loss, dirt, or the like, there is a problem in that the original data relating to the original composed of a plurality of pages cannot be extracted over all pages. Patent Document 1 does not disclose any means for solving this problem.

また特許文献２に記載されているような、複数ページより構成されている原稿のビットマップデータを比較する技術では、ページ毎に比較を行うので、原稿に含まれるページ数及び原稿数が増えるほど比較の処理に時間がかかってしまうという問題がある。また、ビットマップデータの比較を行う場合は、比較する二つの画像データの位置合わせを精度良く行う必要がある。しかし、実際には、正確に位置合わせを行うことは困難であり、その結果、精度良く原稿を検索することができないという問題がある。 Further, in the technique for comparing bitmap data of documents composed of a plurality of pages as described in Patent Document 2, since comparison is performed for each page, the number of pages included in the document and the number of documents increase. There is a problem in that the comparison process takes time. In addition, when comparing bitmap data, it is necessary to accurately align two image data to be compared. However, in practice, it is difficult to accurately align the position, and as a result, there is a problem that a document cannot be searched with high accuracy.

また特許文献３に記載の技術では、文書画像の文字領域における特徴量として、ＯＣＲを用いて文字コードを抽出しているので、抽出する文字コードによっては、類似判定の精度が低下するという問題がある。この精度低下を補うために、多くの文字コードを抽出することが考えられるが、その場合、文字コードを格納しておくメモリ容量が大きくなり、また、多くのデータを用いて検索を行うので、処理に時間がかかるという問題がある。また、特許文献２及び３の技術においては、秘密情報を含む原稿が検索されることに関しては考慮されていないので、秘密情報を含む原稿が容易に出力されてしまう虞があるという問題がある。 In the technique described in Patent Document 3, since the character code is extracted using the OCR as the feature amount in the character area of the document image, there is a problem that the accuracy of similarity determination is lowered depending on the extracted character code. is there. In order to compensate for this decrease in accuracy, it is conceivable to extract a large number of character codes, but in that case, the memory capacity for storing the character codes becomes large, and a search is performed using a lot of data. There is a problem that processing takes time. Further, in the techniques of Patent Documents 2 and 3, there is a problem that there is a possibility that a document including secret information may be easily output because no consideration is given to searching for a document including secret information.

本発明は、斯かる事情に鑑みてなされたものであって、その目的とするところは、原稿の一部に基づいて原稿の他の部分のデータをも抽出できるようにすることにより、複数ページで構成される原稿に係る原稿データを容易にデータベースから抽出することが可能となる原稿抽出方法、原稿抽出装置、コンピュータプログラム及び記録媒体を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to make it possible to extract data of other parts of a document based on a part of the document, thereby allowing a plurality of pages to be extracted. It is an object to provide a document extraction method, a document extraction device, a computer program, and a recording medium that can easily extract document data relating to a document composed of

本発明の他の目的とするところは、原稿データを抽出する際に、目的とは異なる原稿データを間違って抽出してしまう愚を避けることが可能となる原稿抽出装置を提供することにある。 Another object of the present invention is to provide a document extraction device that can avoid the foolishness of erroneously extracting document data different from the purpose when document data is extracted.

また本発明の他の目的とするところは、原稿を出力するための条件を定めておくことにより、秘密情報を保護することができる原稿抽出装置を提供することにある。 Another object of the present invention is to provide a document extraction device that can protect confidential information by setting conditions for outputting a document.

本発明に係る原稿抽出方法は、記憶手段で記憶してある原稿データの中から特定の原稿データを抽出する方法において、複数のページで構成される原稿を示す原稿インデックスを、前記原稿に含まれる各ページに対応する原稿データに関連付けて記憶手段で記憶しておき、原稿データから抽出した特徴点に基づいて計算され、前記原稿データの特徴を示す特徴データを、前記原稿データに関連付けて記憶手段で記憶しておき、新たな原稿データである入力原稿データを取得し、取得した入力原稿データから特徴点を抽出し、抽出した特徴点に基づいて、入力原稿データの特徴を示す特徴データを生成し、生成した特徴データと記憶手段で記憶してある特徴データとを比較することによって、記憶手段が記憶している特徴データに関連付けられた原稿データと入力原稿データとの類似度を判定し、入力原稿データとの類似度が高い原稿データであると判定した原稿データに関連付けられた原稿インデックスを取得し、取得した原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを抽出することを特徴とする。 The document extraction method according to the present invention is a method of extracting specific document data from document data stored in a storage means, and includes a document index indicating a document composed of a plurality of pages. Stored in association with the document data corresponding to each page by the storage means, calculated based on the feature points extracted from the document data, and storing the feature data indicating the characteristics of the document data in association with the document data To acquire input document data as new document data, extract feature points from the acquired input document data, and generate feature data indicating features of the input document data based on the extracted feature points Then, by comparing the generated feature data with the feature data stored in the storage means, it is related to the feature data stored in the storage means. The similarity between the document data and the input document data is determined, a document index associated with the document data determined to be document data having a high similarity with the input document data is acquired, and the document index indicated by the acquired document index is obtained. A plurality of document data corresponding to a plurality of contained pages is extracted.

本発明に係る原稿抽出装置は、原稿データを記憶する原稿記憶手段を備え、該原稿記憶手段が記憶している原稿データの中から特定の原稿データを抽出する原稿抽出装置において、複数のページで構成される原稿を示す原稿インデックスを、前記原稿に含まれる各ページに対応する原稿データに関連付けて記憶する手段と、原稿データから抽出した特徴点に基づいて計算され、前記原稿データの特徴を示す特徴データを、前記原稿データに関連付けて記憶する特徴データ記憶手段と、新たな原稿データである入力原稿データを取得する取得手段と、該取得手段が取得した入力原稿データから特徴点を抽出する手段と、該手段が抽出した特徴点に基づいて、入力原稿データの特徴を示す特徴データを生成する生成手段と、該生成手段が生成した特徴データと前記特徴データ記憶手段が記憶している特徴データとを比較することによって、前記特徴データ記憶手段が記憶している特徴データに関連付けられた原稿データと入力原稿データとの類似度を判定する判定手段と、入力原稿データとの類似度が高い原稿データであると前記判定手段が判定した原稿データに関連付けられた原稿インデックスを取得する手段と、該手段が取得した原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを抽出する抽出手段とを備えることを特徴とする。 An original extracting apparatus according to the present invention includes original storing means for storing original data. In the original extracting apparatus for extracting specific original data from original data stored in the original storing means, a plurality of pages are used. Means for storing a document index indicating a document to be constructed in association with document data corresponding to each page included in the document, and a feature point extracted from the document data, and indicating characteristics of the document data Feature data storage means for storing feature data in association with the original data, acquisition means for acquiring input original data as new original data, and means for extracting feature points from the input original data acquired by the acquisition means And generating means for generating feature data indicating the characteristics of the input document data based on the feature points extracted by the means, and the generating means The similarity between the document data associated with the feature data stored in the feature data storage means and the input document data is determined by comparing the collected data with the feature data stored in the feature data storage means A document index associated with the document data determined by the determination unit, and a document indicated by the document index acquired by the unit. And an extraction means for extracting a plurality of document data corresponding to a plurality of contained pages.

本発明に係る原稿抽出装置は、前記特徴データ記憶手段は、一の原稿データに関連付けて、該原稿データの特徴を示す複数の特徴データを記憶するように構成してあり、前記生成手段は、入力原稿データの特徴を示す複数の特徴データを生成するように構成してあり、前記判定手段は、前記生成手段が生成した複数の特徴データの夫々について、当該特徴データと一致する特徴データに関連付けられた原稿データに対して投票を行う手段と、前記原稿記憶手段が記憶している原稿データの内、得票数が最大である原稿データ又は得票数が所定量以上である原稿データを、入力原稿データとの類似度が高い原稿データであると判定する手段とを有することを特徴とする。 The document extraction device according to the present invention is configured such that the feature data storage means stores a plurality of feature data indicating characteristics of the document data in association with one document data, and the generation means includes: A plurality of feature data indicating features of input document data are generated, and the determination unit associates each of the plurality of feature data generated by the generation unit with feature data matching the feature data. A means for voting the received original data, and the original data having the maximum number of votes or the original data having a predetermined number or more of the original data stored in the original storage means, Means for determining that the document data has high similarity to the data.

本発明に係る原稿抽出装置は、前記取得手段は、複数の入力原稿データを取得する手段を有し、前記判定手段は、複数の入力原稿データの夫々について、前記原稿記憶手段が記憶している原稿データと入力原稿データとの類似度を判定する手段を有し、前記抽出手段は、複数の入力原稿データの夫々との類似度が高い原稿データに関連付けられた原稿インデックスが互いに一致する場合に、当該原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを抽出する手段を有することを特徴とする。 In the document extraction apparatus according to the present invention, the acquisition unit includes a unit that acquires a plurality of input document data, and the determination unit stores the document storage unit for each of the plurality of input document data. Means for determining the similarity between the document data and the input document data, and the extraction unit is configured to match document indexes associated with document data having a high similarity with each of the plurality of input document data. And a means for extracting a plurality of document data corresponding to a plurality of pages included in the document indicated by the document index.

本発明に係る原稿抽出装置は、入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスが複数個取得された場合、又は、複数の入力原稿データの夫々との類似度が高い原稿データに関連付けられた原稿インデックスの内で前記複数の入力原稿データに共通した原稿インデックスが複数個取得された場合に、更なる入力原稿データを要求する手段を更に備えることを特徴とする。 The document extracting apparatus according to the present invention is a document having a high similarity to each of a plurality of input document data when a plurality of document indexes associated with document data having a high similarity to the input document data are acquired. The apparatus further comprises means for requesting further input document data when a plurality of document indexes common to the plurality of input document data are obtained from the document indexes associated with the data.

本発明に係る原稿抽出装置は、前記取得手段は、原稿を光学的に読み取ることによって入力原稿データを取得するように構成してあることを特徴とする。 The document extraction device according to the present invention is characterized in that the acquisition means is configured to acquire input document data by optically reading the document.

本発明に係る原稿抽出装置は、原稿インデックスに関連付けて、当該原稿インデックスが示す原稿に含まれる各ページに対応する原稿データを出力するために必要な所定の出力条件を記憶する手段と、前記抽出手段が抽出した原稿データに関連付けられた原稿インデックスに関連付けられた出力条件が満たされているか否かを判定する手段と、前記出力条件が満たされていると判定された場合に、原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを出力する手段と、前記出力条件が満たされていないと判定された場合に、原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データの出力を禁止する手段とを更に備えることを特徴とする。 The document extracting apparatus according to the present invention stores, in association with the document index, a predetermined output condition necessary for outputting document data corresponding to each page included in the document indicated by the document index, and the extraction A means for determining whether or not an output condition associated with the document index associated with the document data extracted by the means is satisfied, and a document index when the output condition is determined to be satisfied Means for outputting a plurality of document data corresponding to a plurality of pages included in the document, and a plurality corresponding to a plurality of pages included in the document indicated by the document index when it is determined that the output condition is not satisfied And a means for prohibiting the output of the original data.

本発明に係る原稿抽出装置は、前記抽出手段が抽出した複数の原稿データに基づいた複数の画像を形成する手段を更に備えることを特徴とする。 The document extraction device according to the present invention is characterized by further comprising means for forming a plurality of images based on the plurality of document data extracted by the extraction means.

本発明に係るコンピュータプログラムは、コンピュータに、コンピュータ内部又は外部で記憶された原稿データの中から特定の原稿データを抽出させるコンピュータプログラムにおいて、コンピュータに、入力された入力原稿データから特徴点を抽出させる手順と、コンピュータに、抽出した特徴点に基づいて、入力原稿データの特徴を示す特徴データを生成させる手順と、コンピュータに、生成した特徴データと記憶された原稿データの特徴を示す特徴データとを比較することによって、記憶された原稿データと入力原稿データとの類似度を判定させる手順と、コンピュータに、入力原稿データとの類似度が高い原稿データであると判定した原稿データに関連付けられた原稿インデックスを取得させる手順と、コンピュータに、取得した原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを抽出させる手順とを含むことを特徴とする。 A computer program according to the present invention is a computer program for causing a computer to extract specific document data from document data stored inside or outside the computer, and causes the computer to extract feature points from input document data. A procedure for causing the computer to generate feature data indicating the characteristics of the input document data based on the extracted feature points; and a computer for generating generated feature data and feature data indicating the characteristics of the stored document data. A procedure for determining the similarity between the stored document data and the input document data by comparing, and a document associated with the document data that the computer determines to be document data having a high similarity with the input document data The procedure to get the index and the computer Characterized in that it comprises a procedure for extracting a plurality of original data corresponding to a plurality of pages included in the document indicated by the index.

本発明に係るコンピュータでの読み取りが可能な記録媒体は、コンピュータに、コンピュータ内部又は外部で記憶された原稿データの中から特定の原稿データを抽出させるコンピュータプログラムを記録してあるコンピュータでの読み取りが可能な記録媒体において、コンピュータに、入力された入力原稿データから特徴点を抽出させる手順と、コンピュータに、抽出した特徴点に基づいて、入力原稿データの特徴を示す特徴データを生成させる手順と、コンピュータに、生成した特徴データと記憶された原稿データの特徴を示す特徴データとを比較することによって、記憶された原稿データと入力原稿データとの類似度を判定させる手順と、コンピュータに、入力原稿データとの類似度が高い原稿データであると判定した原稿データに関連付けられた原稿インデックスを取得させる手順と、コンピュータに、取得した原稿インデックスが示す原稿に含まれる複数のページに対応する複数の原稿データを抽出させる手順とを含むコンピュータプログラムを記録してあることを特徴とする。 The computer-readable recording medium according to the present invention can be read by a computer in which a computer program for extracting specific document data from document data stored inside or outside the computer is recorded. In a possible recording medium, a procedure for causing the computer to extract feature points from the input document data that has been input, and a procedure for causing the computer to generate feature data indicating the characteristics of the input document data based on the extracted feature points; A procedure for causing the computer to determine the degree of similarity between the stored document data and the input document data by comparing the generated feature data with the feature data indicating the characteristics of the stored document data; Related to manuscript data determined to be manuscript data with high similarity to data A computer program including a procedure for acquiring a document index and a procedure for extracting a plurality of document data corresponding to a plurality of pages included in a document indicated by the acquired document index is recorded on the computer. Features.

本発明においては、原稿に含まれる各ページに対応する原稿データを記憶しておき、更に、原稿データから抽出した特徴点に基づいて計算され、前記原稿データの特徴を示す特徴データと、原稿を示す原稿インデックスとを原稿データに関連付けて記憶しておく。原稿抽出装置は、入力原稿データを取得した場合に、入力原稿データから特徴データを生成し、特徴データに基づいて原稿データとの類似度を判定し、入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスを取得し、取得した原稿インデックスに関連付けられた複数の原稿データを抽出する。これにより、入力原稿データに類似すると判定された原稿データに対応するページを含む原稿が特定され、また特定された原稿に含まれる全てのページに対応する原稿データが抽出される。 In the present invention, document data corresponding to each page included in the document is stored, and further, feature data indicating the characteristics of the document data calculated based on the feature points extracted from the document data, and the document The document index shown is stored in association with the document data. When the original document data is acquired, the document extraction device generates feature data from the input document data, determines the similarity with the document data based on the feature data, and the document data having a high similarity with the input document data A document index associated with the document index is acquired, and a plurality of document data associated with the acquired document index is extracted. Thus, a document including pages corresponding to the document data determined to be similar to the input document data is specified, and document data corresponding to all pages included in the specified document is extracted.

また本発明においては、原稿抽出装置は、原稿データの類似度を判定するために、一の原稿データについて複数の特徴データを記憶しておき、入力原稿データについて生成した各特徴データ毎に同一の特徴データに関連付けられた原稿データに投票し、最大の得票数又は所定量以上の得票数を得た原稿データを、入力原稿データとの類似度が高い原稿データであるとする。複数の特徴データの内で多くの特徴データが一致する原稿データを類似度が高いと判定するので、より確からしい類似度判定を行うことができる。 In the present invention, the document extraction device stores a plurality of feature data for one document data and determines the same for each feature data generated for the input document data in order to determine the similarity of the document data. Document data that has voted on document data associated with feature data and obtained a maximum number of votes or a number of votes equal to or greater than a predetermined amount is document data having a high degree of similarity to input document data. Since it is determined that the document data in which a lot of feature data matches among the plurality of feature data has a high similarity, it is possible to perform a more reliable similarity determination.

また本発明においては、原稿抽出装置は、複数の入力原稿データを取得し、各入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスが一致する場合に、一致した原稿インデックスに関連付けられた複数の原稿データを抽出する。これにより、複数のページに基づいて一の原稿を抽出することが可能となる。 In the present invention, the document extracting apparatus acquires a plurality of input document data, and associates the document index associated with the document data having a high degree of similarity with each input document data with the matched document index. The plurality of document data thus obtained is extracted. Thereby, it is possible to extract one original based on a plurality of pages.

また本発明においては、原稿抽出装置は、入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスが複数ある場合に、更に原稿の他のページに対応する入力原稿データを要求する。これにより、原稿の他のページに対応する入力原稿データが更に取得され、原稿の他のページをも利用して原稿インデックスの絞込みが行われる。 In the present invention, when there are a plurality of document indexes associated with document data having a high degree of similarity to the input document data, the document extraction device further requests input document data corresponding to other pages of the document. As a result, input document data corresponding to other pages of the document is further acquired, and the document index is narrowed down using the other pages of the document.

また本発明においては、原稿抽出装置は、入力原稿データを取得する取得手段として、原稿を光学的に読み取るスキャナを備えることにより、原稿の一部をスキャナで読み取ることによって原稿データの抽出を行う。 In the present invention, the document extraction apparatus includes a scanner that optically reads a document as an acquisition unit that acquires input document data, and extracts document data by reading a part of the document with the scanner.

また本発明においては、原稿抽出装置は、各原稿インデックスについて予め出力条件を定めておき、出力条件が満たされた場合に原稿データを出力し、出力条件が満たされない場合は原稿データの出力を禁止することにより、出力条件が満たされる原稿インデックスに対応する原稿のみを出力する。 Also, in the present invention, the document extraction device sets an output condition for each document index in advance, and outputs document data when the output condition is satisfied, and prohibits output of document data when the output condition is not satisfied. Thus, only the document corresponding to the document index that satisfies the output condition is output.

更に本発明においては、原稿抽出装置は、原稿データに基づいて画像を形成する手段を備えることにより、抽出した原稿データに基づいた画像を形成することができる。 Furthermore, in the present invention, the document extraction device includes means for forming an image based on the document data, so that an image based on the extracted document data can be formed.

本発明にあっては、複数ページで構成される原稿の一部に対応する入力原稿データに基づき、原稿の全てのページに対応する原稿データを抽出することが可能となる。従って、複数ページで構成されている原稿に紛失又は汚れ等によって欠落が生じた場合であっても、原稿データを予め記憶してあるデータベースの中から、全てのページに亘った原稿データを容易に抽出することが可能となる。 In the present invention, it is possible to extract document data corresponding to all pages of a document based on input document data corresponding to a part of a document composed of a plurality of pages. Therefore, even if a document composed of a plurality of pages is lost or lost due to dirt, document data covering all pages can be easily stored from a database in which document data is stored in advance. It becomes possible to extract.

また本発明にあっては、原稿データの類似度を判定する際に、複数の特徴データに基づいてより確からしい類似度判定を行うことができるので、入力原稿データに類似しない原稿データを類似度が高い原稿データであると間違って判定するのを抑制することが可能となる。 Further, according to the present invention, when determining the similarity of document data, it is possible to perform a more reliable similarity determination based on a plurality of feature data. It is possible to suppress erroneously determining that the document data is high.

また本発明にあっては、複数のページに基づいて一の原稿を抽出することが可能となり、目的とは異なる原稿データを間違って抽出してしまう可能性をより低下させることができる。例えば、互いに類似する原稿が存在する場合でも、目的の原稿データを抽出することが可能となる。 In the present invention, it is possible to extract one original based on a plurality of pages, and it is possible to further reduce the possibility of erroneously extracting original data different from the purpose. For example, even when there are similar documents, it is possible to extract target document data.

また本発明にあっては、複数のページを利用することにより、より確からしい類似度判定を行うことが可能となり、所望の原稿データを高精度で抽出することが可能となる。 Further, according to the present invention, by using a plurality of pages, it is possible to perform a more reliable similarity determination, and it is possible to extract desired document data with high accuracy.

また本発明にあっては、原稿の一部をスキャナで読み取ることによって、例えば、通信ネットワークを介して接続されているサーバ装置に記憶されている原稿データの抽出を行うことができ、写真又は文書等からなる原稿の一部から手軽に原稿全体のデータを取得することが可能となる。 In the present invention, by scanning a part of a document with a scanner, for example, document data stored in a server device connected via a communication network can be extracted, and a photograph or document can be extracted. Thus, it is possible to easily acquire data of the entire original from a part of the original including the above.

また本発明にあっては、出力条件が満たされた場合に原稿の出力を可能とするので、重要度の高い原稿に出力条件を定めておくことにより、重要度の高い原稿が容易に出力されることを防止し、原稿に含まれる秘密情報を保護することが可能となる。 Further, according to the present invention, since the output of a document is possible when the output condition is satisfied, a document with high importance can be easily output by setting the output condition for a document with high importance. And the confidential information contained in the document can be protected.

更に本発明にあっては、デジタル複写機又はスキャナを備えた複合機等の画像形成装置を用い、画像形成装置に記憶されている原稿データ又は通信ネットワークを介して画像形成装置に接続されているサーバ装置に記憶されている原稿データから抽出した原稿データに基づいた画像を形成することができるので、画像形成によって、写真又は文書等からなる原稿を手軽に取得することが可能となる等、本発明は優れた効果を奏する。 Further, in the present invention, an image forming apparatus such as a digital copying machine or a multifunction machine equipped with a scanner is used, and is connected to the image forming apparatus via document data stored in the image forming apparatus or a communication network. Since an image based on the document data extracted from the document data stored in the server device can be formed, it is possible to easily acquire a document composed of a photograph or a document by image formation. The invention has an excellent effect.

以下本発明をその実施の形態を示す図面に基づき具体的に説明する。
（実施の形態１）
実施の形態１では、本発明の原稿抽出装置がカラー画像を形成する画像形成装置である形態を示す。図１は、実施の形態１に係る本発明の原稿抽出装置１００の内部の機能構成を示すブロック図である。本発明の原稿抽出装置１００は、原稿抽出装置１００を構成する各部の動作を制御する制御部１１、半導体メモリ又はハードディスク等で構成される記憶部（記憶手段）１２、及びカラー画像を光学的に読み取るカラー画像入力部１３を備えている。カラー画像入力部１３には、読み取ったカラー画像に応じた画像データを生成する処理を行うカラー画像処理部２が接続されている。カラー画像入力部１３は、写真又は文書等からなる原稿をカラー画像として読み取り、記憶部１２は、カラー画像入力部１３が原稿を読み取ってカラー画像処理部２が生成した画像データである原稿データを記憶する。記憶部１２は本発明における原稿記憶手段として機能し、カラー画像入力部１３は本発明における取得手段として機能する。またカラー画像処理部２には、カラー画像処理部２が生成した画像データに基づいてカラー画像を形成するカラー画像形成部１４が接続されている。カラー画像入力部１３、カラー画像処理部２及びカラー画像形成部１４には、使用者からの操作を受け付ける操作パネル１５が接続されている。 Hereinafter, the present invention will be specifically described with reference to the drawings showing embodiments thereof.
(Embodiment 1)
In the first embodiment, the document extracting apparatus of the present invention is an image forming apparatus that forms a color image. FIG. 1 is a block diagram showing an internal functional configuration of an original extracting apparatus 100 according to the first embodiment of the present invention. The document extraction apparatus 100 of the present invention includes a control unit 11 that controls the operation of each unit constituting the document extraction apparatus 100, a storage unit (storage unit) 12 including a semiconductor memory or a hard disk, and a color image optically. A color image input unit 13 for reading is provided. The color image input unit 13 is connected to a color image processing unit 2 that performs processing for generating image data corresponding to the read color image. The color image input unit 13 reads a manuscript consisting of a photograph or a document as a color image, and the storage unit 12 reads manuscript data which is image data generated by the color image processing unit 2 after the color image input unit 13 reads the manuscript. Remember. The storage unit 12 functions as a document storage unit in the present invention, and the color image input unit 13 functions as an acquisition unit in the present invention. The color image processing unit 2 is connected to a color image forming unit 14 that forms a color image based on the image data generated by the color image processing unit 2. An operation panel 15 that receives an operation from a user is connected to the color image input unit 13, the color image processing unit 2, and the color image forming unit 14.

カラー画像入力部１３は、ＣＣＤ（Charge Coupled Device ）を備えたスキャナにて構成されており、紙等の記録担体上に形成されたカラー画像である原稿からの反射光像をＲ（赤）Ｇ（緑）Ｂ（青）に分解してＣＣＤで読み取り、ＲＧＢのアナログ信号に変換してカラー画像処理部２へ出力する構成となっている。カラー画像処理部２は、カラー画像入力部１３から入力されたＲＧＢのアナログ信号に対して後述する画像処理を行ってデジタルの画像データを生成し、更にデジタルのＣ（シアン）Ｍ（マゼンタ）Ｙ（イエロー）Ｋ（黒）信号からなる画像データを生成してカラー画像形成部１４へ出力する。カラー画像形成部１４は、カラー画像処理部２から入力された画像データに基づいて、熱転写、電子写真、又はインクジェット等の方式によりカラー画像を形成する。操作パネル１５は、原稿抽出装置１００の操作に必要な情報を表示する液晶ディスプレイ等の表示部と、原稿抽出装置１００の動作を制御する指示を使用者の操作により受け付けるタッチパネル又はテンキー等の受付部とを含んで構成されている。 The color image input unit 13 is configured by a scanner having a CCD (Charge Coupled Device), and a reflected light image from a document which is a color image formed on a recording carrier such as paper is R (red) G. (Green) B (Blue) is decomposed and read by a CCD, converted into RGB analog signals, and output to the color image processing unit 2. The color image processing unit 2 performs image processing, which will be described later, on the RGB analog signals input from the color image input unit 13 to generate digital image data, and further, digital C (cyan) M (magenta) Y Image data composed of (yellow) K (black) signals is generated and output to the color image forming unit 14. Based on the image data input from the color image processing unit 2, the color image forming unit 14 forms a color image by a method such as thermal transfer, electrophotography, or inkjet. The operation panel 15 includes a display unit such as a liquid crystal display that displays information necessary for operation of the document extraction device 100, and a reception unit such as a touch panel or a numeric keypad that receives an instruction for controlling the operation of the document extraction device 100 by a user operation. It is comprised including.

カラー画像処理部２は、カラー画像入力部１３から入力されたアナログ信号をＡ／Ｄ変換部２０でデジタル信号に変換し、シェーディング補正部２１、入力階調補正部２２、領域分離処理部２３、原稿抽出処理部２４、色補正部２５、黒生成下色除去部２６、空間フィルタ処理部２７、出力階調補正部２８、階調再現処理部２９の順に送り、デジタルのＣＭＹＫ信号からなる画像データをカラー画像形成部１４へ出力する構成となっている。 The color image processing unit 2 converts the analog signal input from the color image input unit 13 into a digital signal by the A / D conversion unit 20, and performs a shading correction unit 21, an input tone correction unit 22, a region separation processing unit 23, The document extraction processing unit 24, the color correction unit 25, the black generation and under color removal unit 26, the spatial filter processing unit 27, the output gradation correction unit 28, and the gradation reproduction processing unit 29 are sent in this order, and image data composed of digital CMYK signals. Is output to the color image forming unit 14.

Ａ／Ｄ変換部２０は、カラー画像入力部１３からカラー画像処理部２へ入力されたＲＧＢのアナログ信号を受け付け、ＲＧＢのアナログ信号をデジタルのＲＧＢ信号へ変換し、ＲＧＢ信号をシェーディング補正部２１へ出力する。 The A / D conversion unit 20 receives an RGB analog signal input from the color image input unit 13 to the color image processing unit 2, converts the RGB analog signal into a digital RGB signal, and converts the RGB signal into a shading correction unit 21. Output to.

シェーディング補正部２１は、Ａ／Ｄ変換部２０から入力されたＲＧＢ信号に対して、カラー画像入力部１３の照明系、結像系及び撮像系で生じる各種の歪みを取り除く処理を行う。シェーディング補正部２１は、次に、歪みを取り除いたＲＧＢ信号を入力階調補正部２２へ出力する。 The shading correction unit 21 performs processing for removing various distortions generated in the illumination system, the imaging system, and the imaging system of the color image input unit 13 on the RGB signals input from the A / D conversion unit 20. Next, the shading correction unit 21 outputs the RGB signal from which distortion has been removed to the input tone correction unit 22.

入力階調補正部２２は、シェーディング補正部２１から入力されたＲＧＢ信号に対して、カラーバランスを調整する。更に、シェーディング補正部２１から入力階調補正部２２へ入力されたＲＧＢ信号はＲＧＢの反射率信号であり、入力階調補正部２２は、シェーディング補正部２１から入力されたＲＧＢ信号を、カラー画像処理部２で処理しやすい濃度信号等の信号へ変換する。入力階調補正部２２は、次に、処理を行ったＲＧＢ信号を領域分離処理部２３へ出力する。 The input tone correction unit 22 adjusts the color balance for the RGB signals input from the shading correction unit 21. Further, the RGB signal input from the shading correction unit 21 to the input gradation correction unit 22 is an RGB reflectance signal, and the input gradation correction unit 22 converts the RGB signal input from the shading correction unit 21 into a color image. The signal is converted into a signal such as a density signal that can be easily processed by the processing unit 2. Next, the input tone correction unit 22 outputs the processed RGB signal to the region separation processing unit 23.

領域分離処理部２３は、入力階調補正部２２から入力されたＲＧＢ信号が表す画像中の各画素を、文字領域、網点領域、又は写真領域のいずれかに分離し、分離結果に基づき、各画素がいずれの領域に属しているかを示す領域識別信号を、黒生成下色除去部２６、空間フィルタ処理部２７、及び階調再現処理部２９へ出力する。領域分離処理部２３は、また、入力階調補正部２２から入力されたＲＧＢ信号を原稿抽出処理部２４へ出力する。 The region separation processing unit 23 separates each pixel in the image represented by the RGB signal input from the input tone correction unit 22 into one of a character region, a halftone dot region, or a photo region, and based on the separation result, A region identification signal indicating to which region each pixel belongs is output to the black generation and under color removal unit 26, the spatial filter processing unit 27, and the gradation reproduction processing unit 29. The area separation processing unit 23 also outputs the RGB signal input from the input tone correction unit 22 to the document extraction processing unit 24.

原稿抽出処理部２４は、記憶部１２と接続されており、ＲＧＢ信号でなる画像データである原稿データを記憶部１２との間で入出力する処理、及び後述する本発明の原稿抽出方法に係る処理を実行する。原稿抽出処理部２４は、また、領域分離処理部２３から入力されたＲＧＢ信号でなる画像データ又は記憶部１２から入力された原稿データである画像データを色補正部２５へ出力する。なお、原稿抽出装置１００は、原稿抽出処理部２４を領域分離処理部２３の後段に設けるのではなく、入力階調補正部２２と並列して設けた形態であってもよい。 The document extraction processing unit 24 is connected to the storage unit 12 and relates to a process for inputting / outputting document data, which is image data composed of RGB signals, to / from the storage unit 12 and a document extraction method of the present invention described later. Execute the process. The document extraction processing unit 24 also outputs image data composed of RGB signals input from the region separation processing unit 23 or image data that is document data input from the storage unit 12 to the color correction unit 25. Note that the document extraction apparatus 100 may have a form in which the document extraction processing unit 24 is provided in parallel with the input tone correction unit 22 instead of being provided at the subsequent stage of the region separation processing unit 23.

色補正部２５は、原稿抽出処理部２４から入力されたＲＧＢ信号をＣＭＹ信号へ変換し、色再現の忠実化実現のために、不要吸収成分を含むＣＭＹ色材の分光特性に基づいた色濁りをＣＭＹ信号から取り除く処理を行う。色補正部２５は、次に、色補正を行ったＣＭＹ信号を黒生成下色除去部２６へ出力する。 The color correction unit 25 converts the RGB signal input from the document extraction processing unit 24 into a CMY signal, and color turbidity based on the spectral characteristics of the CMY color material including unnecessary absorption components in order to realize faithful color reproduction. Is removed from the CMY signal. Next, the color correction unit 25 outputs the CMY signal subjected to color correction to the black generation and under color removal unit 26.

黒生成下色除去部２６は、色補正部２５から入力されたＣＭＹの３色信号からＫ信号を生成する黒生成処理を行い、元のＣＭＹ信号から黒生成処理によって得られたＫ信号を差し引くことによって、ＣＭＹの３色信号をＣＭＹＫの４色信号へ変換する。黒生成処理の一例としては、スケルトンブラックにより黒生成を行う方法がある。この方法では、スケルトンカーブの入出力特性をｙ＝ｆ（ｘ）、変換前のデータをＣ，Ｍ，Ｙ、ＵＣＲ（Under Color Removal ）率をα（０＜α＜１）とすると、変換後のデータＣ’，Ｍ’，Ｙ’，Ｋ’は下記の式で表される。
Ｋ’＝ｆ（min（Ｃ，Ｍ，Ｙ））
Ｃ’＝Ｃ−αＫ’
Ｍ’＝Ｍ−αＫ’
Ｙ’＝Ｙ−αＫ’ The black generation and under color removal unit 26 performs black generation processing for generating a K signal from the CMY three-color signals input from the color correction unit 25, and subtracts the K signal obtained by the black generation processing from the original CMY signal. Thus, the CMY three-color signal is converted into a CMYK four-color signal. As an example of the black generation process, there is a method of generating black using skeleton black. In this method, if the input / output characteristics of the skeleton curve are y = f (x), the data before conversion is C, M, Y, and the UCR (Under Color Removal) rate is α (0 <α <1), Data C ′, M ′, Y ′, and K ′ are expressed by the following equations.
K ′ = f (min (C, M, Y))
C ′ = C−αK ′
M ′ = M−αK ′
Y ′ = Y−αK ′

ここで、ＵＣＲ率α（０＜α＜１）は、ＣＭＹが重なっている部分をＫに置き換えてＣＭＹをどの程度削減するかを示す。前記第１式は、ＣＭＹの各信号強度の内の最も小さい信号強度に応じてＫ信号が生成されることを示している。黒生成下色除去部２６は、次に、ＣＭＹ信号を変換したＣＭＹＫ信号を空間フィルタ処理部２７へ出力する。 Here, the UCR rate α (0 <α <1) indicates how much CMY is reduced by replacing the portion where CMY overlaps with K. The first equation indicates that the K signal is generated according to the smallest signal strength among the CMY signal strengths. Next, the black generation and under color removal unit 26 outputs the CMYK signal obtained by converting the CMY signal to the spatial filter processing unit 27.

空間フィルタ処理部２７は、黒生成下色除去部２６から入力されたＣＭＹＫ信号が表す画像に対して、領域分離処理部２３から入力された領域識別信号に基づいてデジタルフィルタによる空間フィルタ処理を行うことにより、画像のぼやけ又は粒状性劣化を改善する。例えば、領域分離処理部２３にて文字に分離された領域に対しては、空間フィルタ処理部２７は、文字の再現性を高めるために、高周波成分の強調量が大きいフィルタを用いて空間フィルタ処理を行う。また領域分離処理部２３にて網点に分離された領域に対しては、空間フィルタ処理部２７は、入力網点成分を除去するためのローパス・フィルタ処理を行う。空間フィルタ処理部２７は、次に、処理後のＣＭＹＫ信号を出力階調補正部２８へ出力する。 The spatial filter processing unit 27 performs spatial filter processing with a digital filter on the image represented by the CMYK signal input from the black generation and under color removal unit 26 based on the region identification signal input from the region separation processing unit 23. As a result, the blurring or graininess deterioration of the image is improved. For example, for a region separated into characters by the region separation processing unit 23, the spatial filter processing unit 27 uses a filter with a high enhancement amount of high-frequency components in order to improve character reproducibility. I do. Further, the spatial filter processing unit 27 performs low-pass filter processing for removing the input halftone dot component on the region separated into halftone dots by the region separation processing unit 23. Next, the spatial filter processing unit 27 outputs the processed CMYK signal to the output tone correction unit 28.

出力階調補正部２８は、空間フィルタ処理部２７から入力されたＣＭＹＫ信号に対して、カラー画像形成部１４の特性値である網点面積率に変換する出力階調補正処理を行い、出力階調補正処理後のＣＭＹＫ信号を階調再現処理部２９へ出力する。 The output gradation correction unit 28 performs an output gradation correction process for converting the CMYK signal input from the spatial filter processing unit 27 into a halftone dot area ratio that is a characteristic value of the color image forming unit 14, and The CMYK signal after the tone correction processing is output to the gradation reproduction processing unit 29.

階調再現処理部２９は、出力階調補正部２８から入力されたＣＭＹＫ信号に対して、領域分離処理部２３から入力された領域識別信号に基づいて、画素の階調数を減少させながら領域に応じた階調を表現できるように処理を行う。例えば、領域分離処理部２３にて文字に分離された領域に対しては、階調再現処理部２９は、高域周波成分の再現に適した高解像度のスクリーンによる二値化又は低階調化の処理を行う。また領域分離処理部２３にて網点に分離された領域に対しては、階調再現処理部２９は、最終的に画像を画素に分離して夫々の階調を再現できるように処理する階調再現処理を行う。階調再現処理部２９は、次に、処理後の画像データをカラー画像形成部１４へ出力する。 The gradation reproduction processing unit 29 reduces the number of pixel gradations based on the region identification signal input from the region separation processing unit 23 with respect to the CMYK signal input from the output gradation correction unit 28. Processing is performed so that the gradation corresponding to can be expressed. For example, for a region separated into characters by the region separation processing unit 23, the gradation reproduction processing unit 29 performs binarization or gradation reduction using a high-resolution screen suitable for reproducing high-frequency components. Perform the process. In addition, for a region separated into halftone dots by the region separation processing unit 23, the gradation reproduction processing unit 29 performs processing so that the image is finally separated into pixels and each gradation can be reproduced. Perform tone reproduction processing. Next, the gradation reproduction processing unit 29 outputs the processed image data to the color image forming unit 14.

カラー画像形成部１４は、カラー画像処理部２から入力されたＣＭＹＫ信号でなる画像データに基づいて、紙等の記録担体上にＣＭＹＫのカラー画像を形成する。原稿データである画像データに基づいて画像を形成することにより、カラー画像形成部１４は、写真又は文書等からなる原稿を出力する。 The color image forming unit 14 forms a CMYK color image on a record carrier such as paper based on the image data composed of CMYK signals input from the color image processing unit 2. By forming an image based on image data that is document data, the color image forming unit 14 outputs a document composed of a photograph or a document.

次に、原稿抽出処理部２４の構成及び原稿抽出処理部２４が行う処理を説明する。図２は、原稿抽出処理部２４の構成を示すブロック図である。原稿抽出処理部２４は、入力された原稿データが表す原稿上の文字又は図形等に対応する特徴点を抽出する特徴点抽出部２４１、特徴点から原稿データの特徴を示す特徴データを算出する特徴データ算出部２４２、特徴データに基づいて、記憶部１２が記憶する原稿データに対して投票を行う投票処理部２４３、投票結果に基づいて原稿データの類似度を判定する類似度判定処理部２４４、及び記憶部１２から特定の原稿データを抽出する原稿抽出部２４５を備えている。 Next, the configuration of the document extraction processing unit 24 and the processing performed by the document extraction processing unit 24 will be described. FIG. 2 is a block diagram showing the configuration of the document extraction processing unit 24. The document extraction processing unit 24 extracts feature points corresponding to characters or figures on the document represented by the input document data, and calculates feature data indicating the characteristics of the document data from the feature points. A data calculation unit 242; a voting processing unit 243 for voting on the document data stored in the storage unit 12 based on the feature data; a similarity determination processing unit 244 for determining the similarity of the document data based on the voting result; A document extraction unit 245 that extracts specific document data from the storage unit 12 is also provided.

図３は、特徴点抽出部２４１の構成を示すブロック図である。特徴点抽出部２４１は、原稿データを無彩化する無彩化処理部２４１０、原稿データの解像度を所定の解像度に変換する解像度変換部２４１１、原稿データの空間周波数特性を補正するフィルタ処理部２４１２、原稿データを二値化する二値化処理部２４１３、及び文字等の重心を抽出する重心抽出部２４１４を備えている。 FIG. 3 is a block diagram showing a configuration of the feature point extraction unit 241. As shown in FIG. The feature point extraction unit 241 includes an achromatization processing unit 2410 for neutralizing the document data, a resolution conversion unit 2411 for converting the resolution of the document data to a predetermined resolution, and a filter processing unit 2412 for correcting the spatial frequency characteristics of the document data. , A binarization processing unit 2413 for binarizing document data, and a centroid extraction unit 2414 for extracting the centroid of characters and the like.

無彩化処理部２４１０は、入力された原稿データがカラー画像データである場合に、カラー画像を無彩化して、輝度信号又は明度信号に変換し、変換後の原稿データを解像度変換部２４１１へ出力する。例えば、輝度信号Ｙは、各画素ＲＧＢの色成分の強度を夫々Ｒｊ、Ｇｊ、Ｂｊとし、各画素の輝度をＹｊとして、Ｙｊ＝０．３０×Ｒｊ＋０．５９×Ｇｊ＋０．１１×Ｂｊで表すことができる。また他の方法として、ＲＧＢ信号をＣＩＥ（Commission International de l'Eclairage ）１９７６Ｌ^* ａ^* ｂ^* 信号に変換することによってカラー画像を無彩化する方法を利用しても良い。 When the input document data is color image data, the achromatic processing unit 2410 achromatically converts the color image into a luminance signal or a brightness signal, and converts the converted document data to the resolution conversion unit 2411. Output. For example, the luminance signal Y is expressed as Yj = 0.30 × Rj + 0.59 × Gj + 0.11 × Bj, where Rj, Gj, and Bj are the intensity of the color component of each pixel RGB, and Yj is the luminance of each pixel. Can do. As another method, a method of achromatizing a color image by converting an RGB signal into a CIE (Commission International de l'Eclairage) 1976 L ^* a ^* b ^* signal may be used.

解像度変換部２４１１は、入力された原稿データの解像度が所定の解像度になるように原稿データを変倍して、原稿データの解像度を変換し、原稿データをフィルタ処理部２４１２へ出力する。これにより、カラー画像入力部１３で光学的に原稿が変倍されて原稿データの解像度が変化した場合であっても、その影響を受けることなく特徴点の抽出を行うことが可能となる。また解像度変換部２４１１は、カラー画像入力部１３で等倍時に読み込まれる解像度よりも低解像度に変換する。例えば、カラー画像入力部１３で６００ｄｐｉ（dot per inch）で読み込んだ原稿データを３００ｄｐｉに変換する。これにより、後段における処理量を低減することができる。 The resolution conversion unit 2411 scales the document data so that the resolution of the input document data becomes a predetermined resolution, converts the resolution of the document data, and outputs the document data to the filter processing unit 2412. Thereby, even when the original is optically scaled by the color image input unit 13 and the resolution of the original data is changed, the feature points can be extracted without being affected by the change. The resolution conversion unit 2411 converts the resolution to a resolution lower than the resolution read at the same magnification by the color image input unit 13. For example, document data read at 600 dpi (dot per inch) by the color image input unit 13 is converted to 300 dpi. Thereby, the processing amount in the latter stage can be reduced.

フィルタ処理部２４１２は、入力された原稿データの空間周波数特性を画像の強調化処理及び平滑化処理等によって補正し、補正後の画像を二値化処理部２４１３へ出力する。フィルタ処理部２４１２での処理は、カラー画像入力部１３の空間周波数特性が機種ごとに異なることを吸収するために行われる。カラー画像入力部１３が備えるＣＣＤが出力する画像信号には、レンズ又はミラー等の光学系部品、ＣＣＤの受光面のアパーチャ開口度、転送効率、残像、物理的な走査による積分効果及び走査むら等に起因して画像がぼやける劣化が生ずる。フィルタ処理部２４１２は、境界又はエッジ等の強調処理を行うことにより、原稿データに生じた劣化を修復する。また、フィルタ処理部２４１２は、後段で処理される特徴点の抽出処理に不要な高周波成分を抑制するための平滑化処理を行う。 The filter processing unit 2412 corrects the spatial frequency characteristics of the input document data by image enhancement processing and smoothing processing, and outputs the corrected image to the binarization processing unit 2413. The processing in the filter processing unit 2412 is performed to absorb that the spatial frequency characteristics of the color image input unit 13 are different for each model. Image signals output by the CCD provided in the color image input unit 13 include optical system parts such as lenses or mirrors, aperture aperture of the light receiving surface of the CCD, transfer efficiency, afterimage, integration effect by physical scanning, scanning unevenness, etc. Due to this, the image becomes blurred. The filter processing unit 2412 restores the degradation that has occurred in the document data by performing an enhancement process such as a boundary or an edge. Further, the filter processing unit 2412 performs a smoothing process for suppressing high-frequency components that are not necessary for the feature point extraction process to be processed later.

図４は、フィルタ処理部２４１２が利用する空間フィルタの例を示す説明図である。図に示すように、空間フィルタは、例えば、７×７の大きさを有し、強調処理及び平滑化処理を行うための混合フィルタである。入力された原稿データの画素を走査し、空間フィルタによる演算処理をすべての画素に対して行う。なお、空間フィルタの大きさは、７×７の大きさに限定されるものではなく、３×３、５×５などの大きさであってもよい。また、フィルタ係数の数値は一例であって、これに限定されるものではなく、カラー画像入力部１３の機種又は特性などに応じて適宜設定することができる。 FIG. 4 is an explanatory diagram illustrating an example of a spatial filter used by the filter processing unit 2412. As shown in the figure, the spatial filter has a size of 7 × 7, for example, and is a mixed filter for performing enhancement processing and smoothing processing. The pixels of the input document data are scanned, and arithmetic processing using a spatial filter is performed on all the pixels. Note that the size of the spatial filter is not limited to 7 × 7, and may be 3 × 3, 5 × 5, or the like. In addition, the numerical values of the filter coefficients are examples, and are not limited thereto, and can be appropriately set according to the model or characteristics of the color image input unit 13.

二値化処理部２４１３は、入力された原稿データに含まれる各画素の輝度値又は明度値を所定の閾値と比較することにより原稿データを二値化し、二値化した原稿データを重心抽出部２４１４へ出力する。 The binarization processing unit 2413 binarizes the document data by comparing the luminance value or brightness value of each pixel included in the input document data with a predetermined threshold value, and the binarized document data is converted into a centroid extraction unit. Output to 2414.

重心抽出部２４１４は、二値化処理部２４１３から入力された原稿データの各画素について、二値化された画素値に応じたラベルを付すラベリングを行う。即ち、ラベルには二種類のラベルがあり、画素値が０又は１で表される場合に、０の画素には一方のラベルが付され、１の画素には他方のラベルが付される。重心抽出部２４１４は、次に、同一ラベルが付された画素が連結した連結領域を特定し、特定した連結領域の重心を特徴点として抽出し、抽出した特徴点を特徴データ算出部２４２へ出力する。なお、特徴点は、原稿データが表す二値画像上での座標値で表すことができる。 The center-of-gravity extraction unit 2414 performs labeling for each pixel of the document data input from the binarization processing unit 2413, with a label corresponding to the binarized pixel value. That is, there are two types of labels, and when the pixel value is represented by 0 or 1, one label is attached to the 0 pixel and the other label is attached to the 1 pixel. Next, the center-of-gravity extraction unit 2414 specifies a connected region in which pixels with the same label are connected, extracts the center of gravity of the specified connected region as a feature point, and outputs the extracted feature point to the feature data calculation unit 242. To do. The feature points can be represented by coordinate values on the binary image represented by the document data.

図５は、連結領域の特徴点の例を示す説明図である。図５において、特定された連結領域は、文字「Ａ」であり、同一ラベルが付された画素の集合として特定される。この文字「Ａ」の重心の位置は、図５中の黒丸で示される位置となり、この重心が特徴点となる。図６は、文字列に対する特徴点の抽出結果の例を示す説明図である。複数の文字から構成される文字列の場合、文字の種類により夫々異なる位置に特徴点が抽出される。特徴点は、文字に対してのみではなく、同様にして図形又は写真の部分に対しても抽出することができる。なお、ここで示した特徴点の抽出方法は一例であり、他の方法を用いて特徴点を抽出してもよい。例えば、文字列を単語に分解し、各単語の重心を特徴点として抽出する処理を行ってもよい。 FIG. 5 is an explanatory diagram illustrating an example of feature points of a connected region. In FIG. 5, the identified connected area is the letter “A”, and is identified as a set of pixels with the same label. The position of the center of gravity of the letter “A” is a position indicated by a black circle in FIG. 5, and this center of gravity is a feature point. FIG. 6 is an explanatory diagram illustrating an example of a feature point extraction result for a character string. In the case of a character string composed of a plurality of characters, feature points are extracted at different positions depending on the character type. The feature points can be extracted not only for characters but also for a figure or a photograph part in the same manner. The feature point extraction method shown here is merely an example, and feature points may be extracted using other methods. For example, a process of decomposing a character string into words and extracting the centroid of each word as a feature point may be performed.

特徴データ算出部２４２は、特徴点抽出部２４１から入力された特徴点に基づき、入力された原稿データの特徴を示す特徴データを算出する処理を行う。ここに、特徴データの算出例を示す。特徴データ算出部２４２は、特徴点抽出部２４１から入力された特徴点の夫々を順に注目特徴点とし、注目特徴点に近接する４つの他の特徴点を抽出する。 The feature data calculation unit 242 performs a process of calculating feature data indicating the features of the input document data based on the feature points input from the feature point extraction unit 241. Here, a calculation example of feature data is shown. The feature data calculation unit 242 uses the feature points input from the feature point extraction unit 241 as target feature points in order, and extracts four other feature points close to the target feature point.

図７は、注目特徴点と抽出した特徴点とを示す説明図である。特徴データ算出部２４２は、図７に示すように、１つの特徴点を注目特徴点とし、この注目特徴点の周辺に近接する特徴点を、注目特徴点からの距離が近いものから順に所定数（ここでは４点）だけ周辺特徴点として抽出する。図７に示す例では、特徴点ａを注目特徴点Ｐ１とした場合には図中の閉曲線Ｃ１で囲まれる特徴点ｂ，ｃ，ｄ，ｅの４点が周辺特徴点として抽出され、特徴点ｂを注目特徴点Ｐ２とした場合には図中の閉曲線Ｃ２で囲まれる特徴点ａ，ｃ，ｅ，ｆの４点が周辺特徴点として抽出される。 FIG. 7 is an explanatory diagram showing the feature point of interest and the extracted feature points. As shown in FIG. 7, the feature data calculation unit 242 sets one feature point as a feature point of interest, and sets a predetermined number of feature points that are close to the periphery of this feature point of interest in order from the closest distance from the feature point of interest. Only (four points here) are extracted as peripheral feature points. In the example shown in FIG. 7, when the feature point a is the target feature point P1, the four feature points b, c, d, and e surrounded by the closed curve C1 in the figure are extracted as the peripheral feature points. When b is the target feature point P2, the four feature points a, c, e, and f surrounded by the closed curve C2 in the figure are extracted as the peripheral feature points.

また、特徴データ算出部２４２は、抽出した周辺特徴点４点の中から、３点の組み合わせを抽出する。図８は、注目特徴点Ｐ１に対して３点の周辺特徴点を抽出し、特徴データを算出する例を示す説明図である。図８（ａ）〜図８（ｄ）に示すように、図７に示した特徴点ａを注目特徴点Ｐ１とした場合、周辺特徴点ｂ，ｃ，ｄ，ｅの中から３点を選択した全ての組み合わせ、即ち、周辺特徴点ｂ，ｃ，ｄ、周辺特徴点ｂ，ｃ，ｅ、周辺特徴点ｂ，ｄ，ｅ、周辺特徴点ｃ，ｄ，ｅの各組み合わせが抽出される。 In addition, the feature data calculation unit 242 extracts a combination of three points from the four extracted peripheral feature points. FIG. 8 is an explanatory diagram illustrating an example in which feature data is calculated by extracting three peripheral feature points for the target feature point P1. As shown in FIGS. 8A to 8D, when the feature point a shown in FIG. 7 is the target feature point P1, three points are selected from the peripheral feature points b, c, d, and e. That is, all the combinations of the peripheral feature points b, c, d, the peripheral feature points b, c, e, the peripheral feature points b, d, e, and the peripheral feature points c, d, e are extracted.

次に、特徴データ算出部２４２は、抽出した各組み合わせについて、幾何学的変形に対する不変量（特徴量の１つ）Ｈｉｊを算出する。ここで、ｉは注目特徴点を示す数（ｉは１以上の整数）であり、ｊは周辺特徴点３点の組み合わせを示す数（ｊは１以上の整数）である。本実施の形態では周辺特徴点同士を結ぶ線分の長さのうちの２つの比を不変量Ｈｉｊとする。なお、線分の長さは、各周辺特徴点の座標値に基づいて算出すればよい。例えば、図８（ａ）に示した例では、特徴点ｂと特徴点ｃとを結ぶ線分の長さをＡ１１、特徴点ｂと特徴点ｄとを結ぶ線分の長さをＢ１１とし、不変量Ｈ１１をＨ１１＝Ａ１１／Ｂ１１により求める。また、図８（ｂ）に示した例では、特徴点ｂと特徴点ｃとを結ぶ線分の長さをＡ１２、特徴点ｂと特徴点ｅとを結ぶ線分の長さをＢ１２とし、不変量Ｈ１２をＨ１２＝Ａ１２／Ｂ１２により求める。また、図８（ｃ）に示した例では、特徴点ｂと特徴点ｄとを結ぶ線分の長さをＡ１３、特徴点ｂと特徴点ｅとを結ぶ線分の長さをＢ１３とし、不変量Ｈ１３をＨ１３＝Ａ１３／Ｂ１３により求める。また、図８（ｄ）に示した例では、特徴点ｃと特徴点ｄとを結ぶ線分の長さをＡ１４、特徴点ｃと特徴点ｅとを結ぶ線分の長さをＢ１４とし、不変量Ｈ１４をＨ１４＝Ａ１４／Ｂ１４により求める。このようにして、図８（ａ）〜図８（ｄ）に示した例では、不変量Ｈ１１，Ｈ１２，Ｈ１３，Ｈ１４が算出される。以上の例では、注目特徴点に１番目，２番目，３番目に近い周辺特徴点３点の組み合わせをｊ＝１とし、注目特徴点に１番目，２番目，４番目に近い周辺特徴点３点の組み合わせをｊ＝２とし、注目特徴点に１番目，３番目，４番目に近い周辺特徴点３点の組み合わせをｊ＝３とし、注目特徴点に２番目，３番目，４番目に近い周辺特徴点３点の組み合わせをｊ＝４とした。また、３点の周辺特徴点の中で注目特徴点に最も近い周辺特徴点と２番目に近い周辺特徴点とを結ぶ線分をＡｉｊ、注目特徴点に最も近い周辺特徴点と３番目に近い周辺特徴点とを結ぶ線分をＢｉｊとした。なお、周辺特徴点３点の組み合わせの順番又は不変量Ｈｉｊの算出に用いる線分を定めるためには、以上の例で用いた方法に限ることなく、周辺特徴点間を結ぶ線分の長さを基準にして定める方法等、任意の方法を用いて定めればよい。 Next, the feature data calculation unit 242 calculates, for each extracted combination, an invariant (one of feature amounts) Hij with respect to geometric deformation. Here, i is a number indicating the feature point of interest (i is an integer equal to or greater than 1), and j is a number indicating a combination of three peripheral feature points (j is an integer equal to or greater than 1). In the present embodiment, the ratio of two of the lengths of the line segments connecting the peripheral feature points is set as the invariant Hij. Note that the length of the line segment may be calculated based on the coordinate value of each peripheral feature point. For example, in the example shown in FIG. 8A, the length of the line segment connecting the feature point b and the feature point c is A11, the length of the line segment connecting the feature point b and the feature point d is B11, The invariant H11 is obtained by H11 = A11 / B11. In the example shown in FIG. 8B, the length of the line segment connecting the feature point b and the feature point c is A12, the length of the line segment connecting the feature point b and the feature point e is B12, The invariant H12 is obtained by H12 = A12 / B12. In the example shown in FIG. 8C, the length of the line segment connecting the feature point b and the feature point d is A13, the length of the line segment connecting the feature point b and the feature point e is B13, The invariant H13 is obtained by H13 = A13 / B13. In the example shown in FIG. 8D, the length of the line segment connecting the feature point c and the feature point d is A14, the length of the line segment connecting the feature point c and the feature point e is B14, The invariant H14 is obtained by H14 = A14 / B14. In this way, invariants H11, H12, H13, and H14 are calculated in the examples shown in FIGS. 8 (a) to 8 (d). In the above example, the combination of the three neighboring feature points closest to the target feature point is j = 1, and the peripheral feature point 3 closest to the first, second, and fourth feature points is the target feature point. The combination of points is set to j = 2, the combination of 3 neighboring feature points closest to the first, third, and fourth feature points is set to j = 3, and the second, third, and fourth closest to the feature points of interest. The combination of 3 peripheral feature points was set to j = 4. Also, among the three peripheral feature points, Aij is a line segment connecting the peripheral feature point closest to the target feature point and the second closest peripheral feature point, and the third closest feature point to the peripheral feature point closest to the target feature point A line segment connecting the peripheral feature points is defined as Bij. In order to determine the order of the combination of the three peripheral feature points or the line segment used for calculating the invariant Hij, the length of the line segment connecting the peripheral feature points is not limited to the method used in the above example. What is necessary is just to define using arbitrary methods, such as the method defined based on.

次に、特徴データ算出部２４２は、下記式の余りの値をハッシュ値（特徴データ）Ｈｉとして算出し、記憶部１２に記憶させる。なお、下記式のＤは余りが取り得る値の範囲をどの程度に設定するかに応じて予め設定される定数である。
（Ｈｉ１×１０³ ＋Ｈｉ２×１０² ＋Ｈｉ３×１０¹ ＋Ｈｉ４×１０⁰ ）／Ｄ Next, the feature data calculation unit 242 calculates a remainder value of the following formula as a hash value (feature data) Hi and stores it in the storage unit 12. Note that D in the following equation is a constant set in advance according to how much the range of values that the remainder can take is set.
(Hi1 × 10 ³ + Hi2 × 10 ² + Hi3 × 10 ¹ + Hi4 × 10 ⁰ ) / D

また、特徴データ算出部２４２は、１つの注目特徴点に対する周辺特徴点の抽出及びハッシュ値Ｈｉの算出が終了した後、他の特徴点を次の注目特徴点とし、次の注目特徴点について周辺特徴点の抽出及びハッシュ値の算出を行い、各特徴点を注目特徴点としたハッシュ値を算出する。 Also, after the feature data calculation unit 242 finishes the extraction of the peripheral feature points for one target feature point and the calculation of the hash value Hi, the other feature point is set as the next target feature point, and the next target feature point is A feature point is extracted and a hash value is calculated, and a hash value is calculated with each feature point as a feature point of interest.

図７に示した例では、特徴データ算出部２４２は、特徴点ａを注目特徴点Ｐ１とした周辺特徴点の抽出及びハッシュ値Ｈ１の算出が終了した後に、特徴点ｂを注目特徴点Ｐ２とした周辺特徴点の抽出及びハッシュ値Ｈ２の算出を行う。図７に示すように、特徴点ｂを注目特徴点Ｐ２とした場合、特徴点ａ，ｃ，ｅ，ｆの４点が周辺特徴点として抽出される。図９は、注目特徴点Ｐ２に対して３点の周辺特徴点を抽出し、特徴データを算出する例を示す説明図である。図９（ａ）〜図９（ｄ）に示すように、特徴データ算出部２４２は、周辺特徴点ａ，ｃ，ｅ，ｆの内の３点の組み合わせ、即ち、周辺特徴点ａ，ｅ，ｆ、周辺特徴点ａ，ｃ，ｅ、周辺特徴点ａ，ｆ，ｃ、周辺特徴点ｅ，ｆ，ｃの各組み合わせを抽出し、各組み合わせについて不変量Ｈｉｊを算出する。図８に示した注目特徴点Ｐ１の場合と同様に、注目特徴点Ｐ２の場合でも、図９（ａ）に示すようにＨ２１＝Ａ２１／Ｂ２１により不変量Ｈ２１が算出され、図９（ｂ）に示すようにＨ２２＝Ａ２２／Ｂ２２により不変量Ｈ２２が算出され、図９（ｃ）に示すようにＨ２３＝Ａ２３／Ｂ２３により不変量Ｈ２３が算出され、図９（ｄ）に示すようにＨ２４＝Ａ２４／Ｂ２４により不変量Ｈ２４が算出される。また特徴データ算出部２４２は、不変量Ｈ２１，Ｈ２２，Ｈ２３，Ｈ２４からハッシュ値Ｈ２を算出し、記憶部１２に記憶させる。更に特徴データ算出部２４２は、各特徴点を注目特徴点として同様の処理を繰り返し、各特徴点を注目特徴点とした場合のハッシュ値Ｈｉを夫々に求めて記憶部１２に記憶させる。 In the example illustrated in FIG. 7, the feature data calculation unit 242 sets the feature point b as the target feature point P2 after the extraction of the peripheral feature points with the feature point a as the target feature point P1 and the calculation of the hash value H1 are completed. The extracted peripheral feature points and the hash value H2 are calculated. As shown in FIG. 7, when the feature point b is the target feature point P2, four feature points a, c, e, and f are extracted as the peripheral feature points. FIG. 9 is an explanatory diagram illustrating an example in which feature data is calculated by extracting three peripheral feature points for the target feature point P2. As shown in FIGS. 9A to 9D, the feature data calculation unit 242 includes a combination of three of the peripheral feature points a, c, e, and f, that is, the peripheral feature points a, e, and f. f, surrounding feature points a, c, e, surrounding feature points a, f, c, and surrounding feature points e, f, c are extracted, and an invariant Hij is calculated for each combination. As in the case of the feature point of interest P1 shown in FIG. 8, in the case of the feature point of interest P2, as shown in FIG. 9A, the invariant H21 is calculated by H21 = A21 / B21, and FIG. As shown in FIG. 9, the invariant H22 is calculated by H22 = A22 / B22, as shown in FIG. 9C, the invariant H23 is calculated by H23 = A23 / B23, and as shown in FIG. 9D, H24 = The invariant H24 is calculated by A24 / B24. The feature data calculation unit 242 calculates a hash value H2 from the invariants H21, H22, H23, and H24, and stores the hash value H2 in the storage unit 12. Further, the feature data calculation unit 242 repeats the same process using each feature point as the feature point of interest, and obtains and stores the hash value Hi when each feature point is the feature point of interest in the storage unit 12.

以上の如くにして、特徴データ算出部２４２は、特徴点の夫々についてハッシュ値Ｈｉである特徴データを計算し、計算した複数の特徴データを原稿データの特徴データとする。特徴データ算出部２４２は、本発明における生成手段として機能する。 As described above, the feature data calculation unit 242 calculates feature data that is the hash value Hi for each of the feature points, and sets the calculated feature data as feature data of the document data. The feature data calculation unit 242 functions as a generation unit in the present invention.

なお、ここで示した特徴データの算出方法は一例であり、他の方法を用いて特徴データを算出してもよい。例えば、他の所定のハッシュ関数を用いて特徴データを算出してもよい。また、注目特徴点に近接する特徴点を抽出する際に、５点又は６点等、４点以外の数の特徴点を抽出して特徴データを算出してもよい。また、抽出した５つの特徴点から更に３つの特徴点を抽出し、３点間の距離に基づいて特徴データを算出し、５つの特徴点から更に３つの特徴点を抽出できる組み合わせの数だけ特徴データを算出する等、一の注目特徴点について複数の特徴データを算出する処理を行ってもよい。 Note that the feature data calculation method shown here is merely an example, and feature data may be calculated using another method. For example, the feature data may be calculated using another predetermined hash function. Further, when extracting feature points close to the feature point of interest, feature data may be calculated by extracting a number of feature points other than four, such as five or six points. In addition, three more feature points are extracted from the five extracted feature points, feature data is calculated based on the distance between the three points, and the number of features that can be extracted from the five feature points is the number of combinations. A process of calculating a plurality of feature data for one target feature point, such as calculating data, may be performed.

特徴データ算出部２４２が算出する特徴データは、原稿データに関連付けられて記憶部１２で記憶されている。記憶部１２は、夫々に複数のページで構成される原稿毎に、各ページに対応する原稿データを記憶し、更に、原稿データと原稿とを対応付ける原稿テーブル、及び原稿データと特徴データとを対応付ける特徴テーブルを記憶している。記憶部１２は、本発明における特徴データ記憶手段として機能する。 The feature data calculated by the feature data calculation unit 242 is stored in the storage unit 12 in association with the document data. The storage unit 12 stores document data corresponding to each page for each document composed of a plurality of pages, and further associates a document table that associates document data with a document, and document data and feature data. A feature table is stored. The storage unit 12 functions as feature data storage means in the present invention.

図１０は、記憶部１２が記憶する原稿データを示す概念図である。原稿に含まれる各ページに対応する複数の原稿データが記憶されており、各原稿データには、原稿データを個別に示すＩＤ１，ＩＤ２，…のページインデックスが付されている。図１１は、記憶部１２が記憶する原稿データと原稿とを対応付ける原稿テーブルの内容例を示す概念図である。原稿を個別に示すＤｏｃ１，Ｄｏｃ２，…の原稿インデックスが記録されており、原稿に含まれる各ページに対応する原稿データを示すページインデックスが、原稿インデックスに関連付けられて記録されている。テーブルには更に各原稿のページ数が記録されており、ページ数と同数のページインデックスが原稿インデックスに関連付けられている。ページインデックスが原稿インデックスに関連付けられていることによって、図１０に示す如く、記憶部１２は原稿インデックス及び原稿データを互いに関連付けて記憶する。 FIG. 10 is a conceptual diagram showing document data stored in the storage unit 12. A plurality of document data corresponding to each page included in the document is stored, and each document data has a page index of ID1, ID2,... Individually indicating the document data. FIG. 11 is a conceptual diagram illustrating an example of the contents of a document table that associates document data stored in the storage unit 12 with a document. A document index of Doc1, Doc2,... Individually indicating a document is recorded, and a page index indicating document data corresponding to each page included in the document is recorded in association with the document index. The table further records the number of pages of each document, and the same page index as the number of pages is associated with the document index. Since the page index is associated with the document index, the storage unit 12 stores the document index and document data in association with each other as shown in FIG.

図１２は、記憶部１２が記憶する原稿データと特徴データとを対応付ける特徴テーブルの内容例を示す概念図である。図中には、ハッシュ値である特徴データをＥ＝１２７として算出した場合の例を示している。０〜１２６の夫々の特徴データが記録されており、原稿データのページインデックスが、その原稿データについて算出された特徴データに関連付けて記録されている。複数の原稿データで同一の特徴データが算出されることがあるので、各特徴データには、複数のページインデックスが関連付けられている。また一の原稿データについて複数の特徴データが算出されるので、一の原稿データのページインデックスが複数の特徴データに関連付けられている。ページインデックスが特徴データに関連付けられていることによって、記憶部１２は特徴データ及び原稿データを互いに関連付けて記憶する。 FIG. 12 is a conceptual diagram illustrating an example of the contents of a feature table that associates document data and feature data stored in the storage unit 12. In the figure, an example is shown in which feature data that is a hash value is calculated as E = 127. Each feature data of 0 to 126 is recorded, and the page index of the document data is recorded in association with the feature data calculated for the document data. Since the same feature data may be calculated for a plurality of document data, each feature data is associated with a plurality of page indexes. Further, since a plurality of feature data is calculated for one document data, the page index of one document data is associated with the plurality of feature data. Since the page index is associated with the feature data, the storage unit 12 stores the feature data and the document data in association with each other.

投票処理部２４３は、特徴データ算出部２４２が算出した特徴データに基づいて、記憶部１２が記憶する特徴テーブルを検索し、算出した特徴データと一致する特徴データに関連付けられたページインデックスが示す原稿データに投票する。一の特徴データに複数のページインデックスが関連付けられている場合は、その特徴データに関連付けられた全ての原稿データに対して投票が行われる。入力された原稿データについて特徴データ算出部２４２は複数の特徴データを算出するので、各特徴データについて投票が行われ、入力された原稿データに類似する原稿データに対しては複数回の投票が行われる。投票処理部２４３は、特徴データ算出部２４２が算出した複数の特徴データについて投票を行った結果を類似度判定処理部２４４へ出力する。 The voting processing unit 243 searches the feature table stored in the storage unit 12 based on the feature data calculated by the feature data calculation unit 242, and the document indicated by the page index associated with the feature data that matches the calculated feature data Vote for data. When a plurality of page indexes are associated with one feature data, voting is performed for all document data associated with the feature data. Since the feature data calculation unit 242 calculates a plurality of feature data for the input document data, each feature data is voted, and a plurality of votes are performed for document data similar to the input document data. Is called. The voting processing unit 243 outputs the result of voting on the plurality of feature data calculated by the feature data calculating unit 242 to the similarity determination processing unit 244.

類似度判定処理部２４４は、投票処理部２４３から入力された投票結果に基づいて、入力された原稿データが、記憶部１２に記憶された原稿データのいずれに類似するかを判定し、判定結果を原稿抽出部２４５へ出力する。具体的には、類似度判定処理部２４４は、記憶部１２に記憶された各原稿データの得票数を検査し、得票数が最大である原稿データを、入力された原稿データに類似する原稿データであると判定する。あるいは、類似度判定処理部２４４は、特徴データ算出部２４２が算出した特徴データの数である最大可能得票数で各原稿データの得票数を除算して得票数を正規化し、正規化した得票数が所定の閾値以上である原稿データを、入力された原稿データに類似する原稿データであると判定する処理を行ってもよい。入力された原稿データに類似する原稿データがある場合は、類似度判定処理部２４４が出力する判定結果には、類似する原稿データのページインデックスが含まれる。投票処理部２４３及び類似度判定処理部２４４は、本発明における判定手段として機能する。 The similarity determination processing unit 244 determines whether the input document data is similar to the document data stored in the storage unit 12 based on the voting result input from the voting processing unit 243, and the determination result Is output to the document extraction unit 245. Specifically, the similarity determination processing unit 244 inspects the number of votes of each document data stored in the storage unit 12, and the document data having the maximum number of votes is regarded as document data similar to the input document data. It is determined that Alternatively, the similarity determination processing unit 244 normalizes the number of votes obtained by dividing the number of votes of each document data by the maximum possible number of votes, which is the number of feature data calculated by the feature data calculation unit 242, and normalized number of votes Processing may be performed in which document data having a value equal to or greater than a predetermined threshold value is document data similar to the input document data. When there is document data similar to the input document data, the determination result output by the similarity determination processing unit 244 includes the page index of similar document data. The voting processing unit 243 and the similarity determination processing unit 244 function as determination means in the present invention.

原稿抽出部２４５は、類似度判定処理部２４４から入力された判定結果に含まれるページインデックスに基づいて、記憶部１２が記憶する原稿テーブルを検索し、ページインデックスに関連付けられた原稿インデックスを取得する。これにより、入力された原稿データに類似すると判定された原稿データに対応するページを含む原稿が特定される。原稿抽出部２４５は、次に、取得した原稿インデックスに関連付けられた複数のページインデックスが示す複数の原稿データを抽出し、抽出した複数の原稿データを色補正部２５へ出力する。これにより、特定された原稿に含まれる全てのページに対応する原稿データが抽出される。原稿抽出部２４５は、本発明における抽出手段として機能する。 The document extraction unit 245 searches the document table stored in the storage unit 12 based on the page index included in the determination result input from the similarity determination processing unit 244, and acquires the document index associated with the page index. . Thus, a document including a page corresponding to the document data determined to be similar to the input document data is specified. Next, the document extraction unit 245 extracts a plurality of document data indicated by a plurality of page indexes associated with the acquired document index, and outputs the extracted plurality of document data to the color correction unit 25. Thereby, document data corresponding to all pages included in the specified document is extracted. The document extraction unit 245 functions as an extraction unit in the present invention.

次に、以上の構成でなる本発明の原稿抽出装置１００が実行する処理を説明する。原稿抽出装置１００は、複数のページで構成される原稿を読み取って原稿データを登録する処理と、原稿の一部を読み取って原稿の全てのページに対応する原稿データを抽出する処理とを実行する。原稿の一部から原稿の全てのページに対応する原稿データを抽出する処理は、本発明の原稿抽出方法に係る処理である。図１３は、原稿データを登録する処理の手順を示すフローチャートである。 Next, processing executed by the document extraction apparatus 100 of the present invention having the above configuration will be described. The document extraction apparatus 100 executes a process of reading a document composed of a plurality of pages and registering document data, and a process of reading a part of the document and extracting document data corresponding to all pages of the document. . The process of extracting document data corresponding to all pages of the document from a part of the document is a process according to the document extraction method of the present invention. FIG. 13 is a flowchart showing a procedure of processing for registering document data.

原稿抽出装置１００の制御部１１は、操作パネル１５を使用者が操作することによる、原稿データの登録指示の受付を随時待ち受けている（Ｓ１１）。登録指示の受付がない場合は（Ｓ１１：ＮＯ）、制御部１１は、登録指示の受付の待ち受けを続行する。原稿データの登録指示を受け付けた場合は（Ｓ１１：ＹＥＳ）、複数のページでなる原稿を使用者が原稿抽出装置１００にセットし、カラー画像入力部１３は、各ページを光学的に読み取ることによって、ＲＧＢ信号でなる画像データである複数の原稿データを取得する（Ｓ１２）。カラー画像入力部１３は、原稿データをカラー画像処理部２へ出力し、カラー画像処理部２では、Ａ／Ｄ変換部２０、シェーディング補正部２１、入力階調補正部２２、及び領域分離処理部２３の順に原稿データを処理し、制御部１１は、記憶部１２に原稿データを記憶させる（Ｓ１３）。 The control unit 11 of the document extracting apparatus 100 waits for reception of an instruction to register document data when the user operates the operation panel 15 (S11). When the registration instruction is not received (S11: NO), the control unit 11 continues to wait for the registration instruction. When a document data registration instruction is received (S11: YES), the user sets a document consisting of a plurality of pages on the document extraction device 100, and the color image input unit 13 optically reads each page. Then, a plurality of document data, which is image data composed of RGB signals, is acquired (S12). The color image input unit 13 outputs document data to the color image processing unit 2, and the color image processing unit 2 includes an A / D conversion unit 20, a shading correction unit 21, an input tone correction unit 22, and a region separation processing unit. The document data is processed in the order of 23, and the control unit 11 stores the document data in the storage unit 12 (S13).

原稿抽出処理部２４では、特徴点抽出部２４１が前述の処理によって一の原稿データについて複数の特徴点を抽出し（Ｓ１４）、特徴データ算出部２４２は、前述の処理によって夫々の特徴点について特徴データを計算することにより、一の原稿データの特徴を示す複数の特徴データを算出する（Ｓ１５）。制御部１１は、次に、一の原稿データを示すページインデックスを生成し、記憶部１２に記憶した原稿データにページインデックスを付加することによって、ページインデックスを設定する（Ｓ１６）。このとき、制御部１１は、原稿データが入力された順番、又は日時等に基づいて、一意のページインデックスを生成する。制御部１１は、次に、特徴データ算出部２４２が算出した特徴データと原稿データのページインデックスとを関連付けることによって、図１２に示す如き特徴テーブルを更新する（Ｓ１７）。 In the document extraction processing unit 24, the feature point extraction unit 241 extracts a plurality of feature points for one document data by the above-described processing (S14), and the feature data calculation unit 242 performs the feature for each feature point by the above-described processing. By calculating the data, a plurality of feature data indicating the features of one document data is calculated (S15). Next, the control unit 11 generates a page index indicating one document data, and sets the page index by adding the page index to the document data stored in the storage unit 12 (S16). At this time, the control unit 11 generates a unique page index based on the order in which the document data is input or the date and time. Next, the control unit 11 updates the feature table as shown in FIG. 12 by associating the feature data calculated by the feature data calculation unit 242 with the page index of the document data (S17).

制御部１１は、次に、入力された全ての原稿データについて特徴データを関連付ける処理が終了したか否かを判定する（Ｓ１８）。まだ特徴データを関連付ける処理を行っていない原稿データがある場合は（Ｓ１８：ＮＯ）、制御部１１は、処理をステップＳ１４へ戻し、特徴点抽出部２４１は、まだ特徴点の抽出を行っていない原稿データについて特徴点を抽出する。全ての原稿データについて処理が終了している場合は（Ｓ１８：ＹＥＳ）、取得した複数の原稿データに対応する複数のページで構成される原稿を示す原稿インデックスを生成することによって、原稿インデックスを設定する（Ｓ１９）。ここで、制御部１１は、日時等から原稿インデックスを生成する。なお、制御部１１は、使用者が希望する原稿インデックスを操作パネル１５で受け付ける処理を行ってもよい。 Next, the control unit 11 determines whether or not the process of associating the feature data with respect to all input document data has been completed (S18). If there is document data that has not yet been subjected to the process of associating the feature data (S18: NO), the control unit 11 returns the process to step S14, and the feature point extraction unit 241 has not yet extracted the feature point. Feature points are extracted from the document data. If the processing has been completed for all document data (S18: YES), the document index is set by generating a document index indicating a document composed of a plurality of pages corresponding to the plurality of acquired document data. (S19). Here, the control unit 11 generates a document index from the date and time. Note that the control unit 11 may perform a process of accepting the document index desired by the user through the operation panel 15.

制御部１１は、次に、生成した原稿インデックスと原稿データのページインデックスとを関連付けることによって、記憶部１２が記憶する原稿テーブルを更新し（Ｓ２０）、処理を終了する。以上の処理により、複数のページからなる原稿の原稿データが記憶部１２に記憶される。 Next, the control unit 11 updates the document table stored in the storage unit 12 by associating the generated document index with the page index of the document data (S20), and ends the process. Through the above processing, document data of a document composed of a plurality of pages is stored in the storage unit 12.

図１４は、原稿データを抽出する処理の手順を示すフローチャートである。原稿抽出装置１００の制御部１１は、操作パネル１５を使用者が操作することによる、原稿データの抽出指示の受付を随時待ち受けている（Ｓ３１）。抽出指示の受付がない場合は（Ｓ３１：ＮＯ）、制御部１１は、抽出指示の受付の待ち受けを続行する。画像データの抽出指示を受け付けた場合は（Ｓ３１：ＹＥＳ）、複数のページでなる原稿に含まれる一部のページを原稿抽出装置１００に使用者がセットし、カラー画像入力部１３は、セットされたページを光学的に読み取ることによって、ＲＧＢ信号でなる画像データである入力原稿データを取得する（Ｓ３２）。 FIG. 14 is a flowchart showing a procedure of processing for extracting document data. The control unit 11 of the document extraction apparatus 100 waits for reception of an instruction to extract document data when the user operates the operation panel 15 (S31). When the extraction instruction is not received (S31: NO), the control unit 11 continues to wait for the extraction instruction reception. When an image data extraction instruction is received (S31: YES), the user sets some pages included in a document consisting of a plurality of pages to the document extraction apparatus 100, and the color image input unit 13 is set. By reading the page optically, input document data, which is image data composed of RGB signals, is acquired (S32).

カラー画像入力部１３は、入力原稿データをカラー画像処理部２へ出力し、カラー画像処理部２では、Ａ／Ｄ変換部２０、シェーディング補正部２１、入力階調補正部２２、及び領域分離処理部２３の順に入力原稿データを処理し、原稿抽出処理部２４では、特徴点抽出部２４１が入力原稿データについて複数の特徴点を抽出する（Ｓ３３）。特徴データ算出部２４２は、特徴点抽出部２４１が抽出した各特徴点について特徴データを計算することにより、入力原稿データの特徴を示す複数の特徴データを算出する（Ｓ３４）。 The color image input unit 13 outputs input document data to the color image processing unit 2, and the color image processing unit 2 includes an A / D conversion unit 20, a shading correction unit 21, an input tone correction unit 22, and a region separation process. The input document data is processed in the order of the unit 23, and in the document extraction processing unit 24, the feature point extraction unit 241 extracts a plurality of feature points from the input document data (S33). The feature data calculation unit 242 calculates feature data for each feature point extracted by the feature point extraction unit 241, thereby calculating a plurality of feature data indicating the features of the input document data (S34).

投票処理部２４３は、次に、特徴データ算出部２４２が算出した各特徴データについて、記憶部１２が記憶する特徴テーブルを検索し、算出した特徴データに関連付けられたページインデックスが示す原稿データに投票する投票処理を行う（Ｓ３５）。類似度判定処理部２４４は、投票処理部２４３での投票結果に基づいて、入力原稿データが、記憶部１２に記憶された原稿データのいずれに類似するかを判定する（Ｓ３６）。このとき、類似度判定処理部２４４は、最低限の得票数を得た原稿データの内で得票数が最大である原稿データ、又は正規化された得票数が所定の閾値以上である原稿データを、入力原稿データとの類似度が高い原稿データであると判定する。 Next, the voting processing unit 243 searches the feature table stored in the storage unit 12 for each feature data calculated by the feature data calculation unit 242, and votes for the document data indicated by the page index associated with the calculated feature data. A voting process is performed (S35). The similarity determination processing unit 244 determines which of the original data stored in the storage unit 12 is similar to the input original data based on the voting result in the voting processing unit 243 (S36). At this time, the similarity determination processing unit 244 selects the document data having the maximum number of votes from the document data having the minimum number of votes, or the document data having a normalized number of votes equal to or more than a predetermined threshold. Then, it is determined that the document data has high similarity to the input document data.

制御部１１は、次に、類似度判定処理部２４４での判定結果が、類似度が高い原稿データがあることを示しているか否かを判定する（Ｓ３７）。判定結果が、類似度が高い原稿データがないことを示している場合は（Ｓ３７：ＮＯ）、制御部１１は、使用者がカラー画像入力部１３に読み取らせた原稿と類似する原稿がないことを示す情報を出力する（Ｓ３８）。具体的には、制御部１１は、類似する原稿がないことを示す文字情報を操作パネル１５の表示部に表示させるか、又は類似する原稿がないことを文字で表した画像をカラー画像形成部１４に形成させる。ステップＳ３８が終了した後は、原稿抽出装置１００は、原稿データを抽出する処理を終了する。 Next, the control unit 11 determines whether or not the determination result in the similarity determination processing unit 244 indicates that there is document data having a high similarity (S37). When the determination result indicates that there is no document data having a high similarity (S37: NO), the control unit 11 does not have a document similar to the document read by the color image input unit 13 by the user. Is output (S38). Specifically, the control unit 11 displays character information indicating that there is no similar manuscript on the display unit of the operation panel 15, or displays an image that represents that there is no similar manuscript using characters. 14 to form. After step S38 is completed, the document extraction device 100 ends the process of extracting document data.

ステップＳ３７で、判定結果が、類似度が高い原稿データがあることを示している場合は（Ｓ３７：ＹＥＳ）、原稿抽出部２４５は、記憶部１２が記憶する原稿テーブルを検索し、類似度判定処理部２４４が入力原稿データとの類似度が高いと判定した原稿データのページインデックスに関連付けられた原稿インデックスを取得する（Ｓ３９）。制御部１１は、次に、複数のページに対応する複数の入力原稿データを取得しているか否かを判定する（Ｓ４０）。取得した入力原稿データが一のページに対応する入力原稿データである場合は（Ｓ４０：ＮＯ）、原稿抽出部２４５は、取得した原稿インデックスに原稿テーブルで関連付けられた複数のページインデックスが示す複数の原稿データを抽出する（Ｓ４３）。これにより、入力原稿データとの類似度が高い原稿データに対応するページが含まれる原稿に係る原稿データが全て抽出される。 If it is determined in step S37 that the determination result includes document data having a high degree of similarity (S37: YES), the document extraction unit 245 searches the document table stored in the storage unit 12 to determine the similarity. The processing unit 244 acquires a document index associated with the page index of the document data that is determined to have a high degree of similarity with the input document data (S39). Next, the control unit 11 determines whether or not a plurality of input document data corresponding to a plurality of pages has been acquired (S40). When the acquired input document data is input document data corresponding to one page (S40: NO), the document extraction unit 245 displays a plurality of pages indicated by a plurality of page indexes associated with the acquired document index in the document table. Document data is extracted (S43). As a result, all document data relating to a document including pages corresponding to document data having high similarity to the input document data is extracted.

原稿抽出部２４５は、抽出した原稿データを色補正部２５へ出力し、色補正部２５、黒生成下色除去部２６、空間フィルタ処理部２７、出力階調補正部２８、階調再現処理部２９の順に原稿データを処理し、カラー画像処理部２はカラー画像形成部１４へ原稿データを出力する。カラー画像形成部１４は、画像データである複数の原稿データに基づいた画像を形成することにより、複数の原稿データに対応する複数のページで構成される原稿を出力する原稿出力処理を行う（Ｓ４４）。ステップＳ４４が終了した後は、原稿抽出装置１００は、原稿データを抽出する処理を終了する。 The document extraction unit 245 outputs the extracted document data to the color correction unit 25, and the color correction unit 25, the black generation and under color removal unit 26, the spatial filter processing unit 27, the output gradation correction unit 28, and the gradation reproduction processing unit. The document data is processed in the order of 29, and the color image processing unit 2 outputs the document data to the color image forming unit. The color image forming unit 14 performs an original output process of outputting an original composed of a plurality of pages corresponding to the plurality of original data by forming an image based on the plurality of original data as image data (S44). ). After step S44 is completed, the document extraction device 100 ends the process of extracting document data.

ステップＳ４０で複数のページに対応する複数の入力原稿データを取得している場合は（Ｓ４０：ＹＥＳ）、制御部１１は、各入力原稿データについて取得した原稿インデックスが一致しているか否かを判定する（Ｓ４１）。原稿インデックスが一致していない場合は（Ｓ４１：ＮＯ）、制御部１１は、処理をステップＳ３８へ進め、類似する原稿がないことを出力する。 When a plurality of input document data corresponding to a plurality of pages is acquired in step S40 (S40: YES), the control unit 11 determines whether or not the acquired document index matches each input document data. (S41). If the document indexes do not match (S41: NO), the control unit 11 advances the process to step S38 and outputs that there is no similar document.

ステップＳ４１で原稿インデックスが一致している場合は（Ｓ４１：ＹＥＳ）、制御部１１は、全ての入力原稿データについて類似度を判定する処理が終了したか否かを判定する（Ｓ４２）。まだ類似度を判定する処理を行っていない入力原稿データがある場合は（Ｓ４２：ＮＯ）、制御部１１は、処理をステップＳ３３へ戻し、特徴点抽出部２４１は、まだ特徴点の抽出を行っていない入力原稿データについて特徴点を抽出する。全ての入力原稿データについて処理が終了している場合は（Ｓ４２：ＹＥＳ）、原稿抽出装置１００は、処理をステップＳ４３へ進め、入力原稿データとの類似度が高い原稿データに対応するページが含まれる原稿に係る原稿データを抽出して原稿を出力する。 If the document indexes match in step S41 (S41: YES), the control unit 11 determines whether or not the processing for determining the similarity is completed for all input document data (S42). If there is input document data that has not yet been subjected to the process of determining the similarity (S42: NO), the control unit 11 returns the process to step S33, and the feature point extraction unit 241 still extracts the feature points. Feature points are extracted from input document data that has not been input. If processing has been completed for all input document data (S42: YES), document extraction apparatus 100 proceeds to step S43, and includes a page corresponding to document data having a high similarity to input document data. The document data relating to the document to be extracted is extracted and the document is output.

なお、以上の処理においては、入力原稿データとの類似度が高い原稿データが一つであるとしているが、原稿抽出装置１００は、正規化された得票数が所定の閾値以上である原稿データが複数ある場合に、複数の原稿データを入力原稿データとの類似度が高い原稿データであると判定する処理を行ってもよい。この場合は、複数の原稿データの夫々に係る原稿を共に出力する処理を行ってもよく、又、類似度が高いと判定された各原稿データに対応するページのイメージを操作パネル１５の表示部で表示し、正当な原稿データを使用者に選択させる処理を行ってもよい。 In the above processing, it is assumed that there is one document data having a high similarity to the input document data. However, the document extraction apparatus 100 has document data whose normalized number of votes is equal to or greater than a predetermined threshold. When there are a plurality of document data, a process of determining that the plurality of document data is document data having a high similarity to the input document data may be performed. In this case, it is possible to perform processing for outputting a document related to each of a plurality of document data, and to display an image of a page corresponding to each document data determined to have a high degree of similarity as a display unit of the operation panel 15. May be displayed to allow the user to select valid document data.

以上詳述した如く、本発明においては、原稿抽出装置１００は、原稿に含まれる各ページに対応する原稿データを記憶部１２に記憶しておき、更に原稿データの特徴を示す特徴データ、及び原稿を示す原稿インデックスを原稿データに関連付けて記憶しておく。原稿抽出装置１００は、入力原稿データを取得した場合に、入力原稿データから特徴データを生成し、特徴データに基づいて原稿データとの類似度を判定し、入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスを取得し、取得した原稿インデックスに関連付けられた複数の原稿データを抽出する。これにより、入力原稿データに類似すると判定された原稿データに対応するページを含む原稿が特定され、また特定された原稿に含まれる全てのページに対応する原稿データが抽出される。即ち、複数ページで構成される原稿の一部に対応する入力原稿データに基づき、原稿の全てのページに対応する原稿データを抽出することが可能となる。従って、複数ページで構成されている原稿に紛失又は汚れ等によって欠落が生じた場合であっても、原稿データを予め記憶してあるデータベースの中から、原稿の全てのページに亘った原稿データを容易に抽出することが可能となる。 As described above in detail, in the present invention, the document extraction device 100 stores document data corresponding to each page included in the document in the storage unit 12, and further includes feature data indicating the characteristics of the document data, and the document. Is stored in association with the document data. When the document extraction apparatus 100 acquires input document data, the document extraction device 100 generates feature data from the input document data, determines a similarity with the document data based on the feature data, and a document with a high similarity with the input document data. A document index associated with the data is acquired, and a plurality of document data associated with the acquired document index is extracted. Thus, a document including pages corresponding to the document data determined to be similar to the input document data is specified, and document data corresponding to all pages included in the specified document is extracted. That is, it is possible to extract document data corresponding to all pages of a document based on input document data corresponding to a part of a document composed of a plurality of pages. Therefore, even if a document composed of a plurality of pages is lost due to loss or dirt, document data covering all pages of the document is stored in a database in which document data is stored in advance. It can be easily extracted.

また本発明の原稿抽出装置１００は、原稿データの類似度を判定するために、一の原稿データについて複数の特徴データを記憶しておき、入力原稿データについて生成した各特徴データ毎に同一の特徴データに関連付けられた原稿データに投票し、最大の得票数又は所定量以上の得票数を得た原稿データを、入力原稿データとの類似度が高い原稿データであるとする。複数の特徴データの内で多くの特徴データが一致する原稿データを類似度が高いと判定するので、より確からしい類似度判定を行うことができる。従って、入力原稿データに類似しない原稿データを類似度が高い原稿データであると間違って判定することによって目的とは異なる原稿データを抽出してしまう愚を可及的に避けることが可能となる。 The document extraction apparatus 100 of the present invention stores a plurality of feature data for one document data and determines the same feature for each feature data generated for input document data in order to determine the similarity of document data. Document data that has voted for document data associated with the data and obtained the maximum number of votes or a number of votes equal to or greater than a predetermined amount is document data having a high degree of similarity to input document data. Since it is determined that the document data in which a lot of feature data matches among the plurality of feature data has a high similarity, it is possible to perform a more reliable similarity determination. Therefore, it is possible to avoid as much as possible the frustration of extracting document data different from the purpose by erroneously determining that document data not similar to the input document data is document data having a high degree of similarity.

また本発明の原稿抽出装置は、複数の入力原稿データを取得し、各入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスが一致する場合に、一致した原稿インデックスに関連付けられた複数の原稿データを抽出する。これにより、複数のページに基づいて原稿を抽出することが可能となり、目的とは異なる原稿データを間違って抽出してしまう可能性をより低下させることができる。例えば、互いに類似する原稿が存在する場合でも、目的の原稿データを確実に抽出することが可能となる。 The document extraction device of the present invention acquires a plurality of input document data, and when document indexes associated with document data having a high degree of similarity with each input document data match, the document index is associated with the matched document index. A plurality of document data is extracted. Thereby, it is possible to extract a document based on a plurality of pages, and it is possible to further reduce the possibility of erroneously extracting document data different from the purpose. For example, even when there are similar documents, it is possible to reliably extract target document data.

また本発明においては、原稿データが表す原稿上の文字、図形及び写真等の重心に対応した特徴点を原稿データから抽出し、抽出した複数の特徴点の相対的な位置関係に基づいて、数値で表される特徴データを算出する。このようにして算出した特徴データを原稿データ間で比較することにより原稿データの検索を行うので、従来のビットマップデータを比較することによる検索、又は原稿から抽出した多量の文字コードである特徴量を比較することによる検索に比べて、原稿データの検索処理を行うために必要なデータ量が大幅に削減される。従って、本発明においては、従来技術に比べて、原稿データを検索する処理に必要な時間が削減される。また本発明においては、複数の特徴点の相対的な位置関係に基づいて求めた特徴データを比較することによって原稿データの検索を行うので、原稿データ間で画像の位置合わせを行う必要がない。従って、本発明では、従来技術に比べて高精度で原稿データを検索することができる。 In the present invention, a feature point corresponding to the center of gravity of a character, a figure, a photograph, or the like on the document represented by the document data is extracted from the document data, and a numerical value is calculated based on the relative positional relationship of the extracted feature points. The feature data represented by is calculated. Since the document data is searched by comparing the feature data calculated in this way between the document data, the search is performed by comparing the conventional bitmap data, or a feature amount that is a large amount of character code extracted from the document. Compared with the search by comparing the document data, the amount of data necessary for performing the document data search process is greatly reduced. Therefore, in the present invention, the time required for the process of searching for document data is reduced as compared with the prior art. In the present invention, since document data is searched by comparing feature data obtained based on the relative positional relationship between a plurality of feature points, it is not necessary to perform image alignment between document data. Therefore, in the present invention, document data can be searched with higher accuracy than in the prior art.

なお、本実施の形態においては、カラー画像データである原稿データを扱う形態を示したが、これに限るものではなく、本発明の原稿抽出装置１００は、モノクロの原稿データを扱う形態であってもよい。 In the present embodiment, the document data that is color image data is handled. However, the present invention is not limited to this, and the document extraction apparatus 100 according to the present invention handles monochrome document data. Also good.

また本実施の形態においては、本発明における取得手段としてスキャナであるカラー画像入力部１３を用いた形態を示したが、これに限るものではなく、本発明の原稿抽出装置１００は、取得手段として、外部のスキャナ又はＰＣから原稿データを受信するインタフェースを備えた形態であってもよい。また本発明に係る原稿データは、原稿を光学的に取り込んだ画像データに限るものではなく、アプリケーションプログラムを利用したＰＣで作成したテキストデータ等のアプリケーションデータであってもよい。この場合は、原稿抽出装置１００は、取得手段であるインタフェースでアプリケーションデータである原稿データを受け付け、本発明に係る処理を実行する。 In the present embodiment, the color image input unit 13 that is a scanner is used as the acquisition unit in the present invention. However, the present invention is not limited to this, and the document extraction device 100 of the present invention is used as the acquisition unit. Alternatively, an interface for receiving document data from an external scanner or PC may be provided. The document data according to the present invention is not limited to image data obtained by optically capturing a document, and may be application data such as text data created by a PC using an application program. In this case, the document extraction apparatus 100 receives document data that is application data through an interface that is an acquisition unit, and executes processing according to the present invention.

また本実施の形態においては、取得した原稿データを登録し、登録した原稿データの中から必要な原稿データを抽出する処理を行う形態を示したが、これに限るものではなく、本発明の原稿抽出装置１００は、予め原稿データを記憶している記憶部１２を取り付けられる等の方法により、登録の処理を行うことなく原稿データを抽出する処理を行う形態であってもよい。また本実施の形態においては、原稿抽出装置１００で内蔵する記憶部１２に記憶する原稿データから必要な原稿データを抽出する処理を行う形態を示したが、これに限るものではなく、本発明の原稿抽出装置１００は、通信ネットワークで接続されたストレージ装置又はサーバ装置等の外部の記憶手段に記憶された原稿データから必要な原稿データを抽出する処理を行う形態であってもよい。 In the present embodiment, the acquired document data is registered, and necessary document data is extracted from the registered document data. However, the present invention is not limited to this. The extraction apparatus 100 may be configured to perform processing for extracting document data without performing registration processing by a method such as attaching a storage unit 12 that stores document data in advance. Further, in the present embodiment, an embodiment has been described in which processing for extracting necessary document data from document data stored in the storage unit 12 built in the document extraction apparatus 100 is performed, but the present invention is not limited to this. The document extraction device 100 may be configured to perform processing for extracting necessary document data from document data stored in an external storage unit such as a storage device or a server device connected via a communication network.

（実施の形態２）
実施の形態２においては、入力画像データとの類似度が高い原稿データが複数ある場合に、更に入力画像データを取得して画像データの絞込みを行う形態を示す。本実施の形態に係る原稿抽出装置の内部構成は、図１〜図３を用いて説明した実施の形態１の場合と同様である。また本実施の形態に係る記憶部１２での記憶内容は、図１１及び図１２を用いて説明した実施の形態１の場合と同様である。また本実施の形態に係る原稿抽出装置が原稿データを登録する処理は、図１３のフローチャートを用いて説明した実施の形態１の場合と同様である。 (Embodiment 2)
In the second embodiment, when there are a plurality of document data having a high similarity to the input image data, the input image data is further acquired to narrow down the image data. The internal configuration of the document extracting apparatus according to the present embodiment is the same as that of the first embodiment described with reference to FIGS. In addition, the storage contents in the storage unit 12 according to the present embodiment are the same as those in the first embodiment described with reference to FIGS. The process of registering document data by the document extraction apparatus according to the present embodiment is the same as that in the first embodiment described with reference to the flowchart of FIG.

図１５及び図１６は、実施の形態２に係る原稿抽出装置が行う原稿データを抽出する処理の手順を示すフローチャートである。原稿抽出装置１００の制御部１１は、操作パネル１５を使用者が操作することによる、原稿データの抽出指示の受付を随時待ち受けている（Ｓ５０１）。抽出指示の受付がない場合は（Ｓ５０１：ＮＯ）、制御部１１は、抽出指示の受付の待ち受けを続行する。画像データの抽出指示を受け付けた場合は（Ｓ５０１：ＹＥＳ）、複数のページでなる原稿に含まれる一部のページを原稿抽出装置１００に使用者がセットし、カラー画像入力部１３は、セットされた一のページを光学的に読み取ることによって、ＲＧＢ信号でなる画像データである入力原稿データを取得する（Ｓ５０２）。 15 and 16 are flowcharts showing a procedure of processing for extracting document data performed by the document extraction apparatus according to the second embodiment. The control unit 11 of the document extraction apparatus 100 waits for reception of an instruction to extract document data when the user operates the operation panel 15 (S501). If no extraction instruction has been received (S501: NO), the control unit 11 continues to wait for an extraction instruction to be received. When an image data extraction instruction is received (S501: YES), the user sets some pages included in a document consisting of a plurality of pages to the document extraction apparatus 100, and the color image input unit 13 is set. By reading the one page optically, input document data, which is image data composed of RGB signals, is acquired (S502).

カラー画像入力部１３は、入力原稿データをカラー画像処理部２へ出力し、カラー画像処理部２では、Ａ／Ｄ変換部２０、シェーディング補正部２１、入力階調補正部２２、及び領域分離処理部２３の順に入力原稿データを処理し、原稿抽出処理部２４では、特徴点抽出部２４１が入力原稿データについて複数の特徴点を抽出する（Ｓ５０３）。特徴データ算出部２４２は、特徴点抽出部２４１が抽出した各特徴点について特徴データを計算することにより、入力原稿データの特徴を示す複数の特徴データを算出する（Ｓ５０４）。 The color image input unit 13 outputs input document data to the color image processing unit 2, and the color image processing unit 2 includes an A / D conversion unit 20, a shading correction unit 21, an input tone correction unit 22, and a region separation process. The input document data is processed in the order of the unit 23, and in the document extraction processing unit 24, the feature point extraction unit 241 extracts a plurality of feature points from the input document data (S503). The feature data calculation unit 242 calculates feature data for each feature point extracted by the feature point extraction unit 241 to calculate a plurality of feature data indicating the features of the input document data (S504).

投票処理部２４３は、次に、特徴データ算出部２４２が算出した各特徴データについて、記憶部１２が記憶する特徴テーブルを検索し、算出した特徴データに関連付けられたページインデックスが示す原稿データに投票する投票処理を行う（Ｓ５０５）。類似度判定処理部２４４は、投票処理部２４３での投票結果に基づいて、入力原稿データが、記憶部１２に記憶された原稿データのいずれに類似するかを判定する（Ｓ５０６）。ステップＳ５０６では、類似度判定処理部２４４は、正規化された得票数が所定の閾値以上である原稿データを、入力原稿データとの類似度が高い原稿データであると判定する。 Next, the voting processing unit 243 searches the feature table stored in the storage unit 12 for each feature data calculated by the feature data calculation unit 242, and votes for the document data indicated by the page index associated with the calculated feature data. A voting process is performed (S505). The similarity determination processing unit 244 determines which of the original data stored in the storage unit 12 is similar to the input original data based on the voting result in the voting processing unit 243 (S506). In step S506, the similarity determination processing unit 244 determines that the document data whose normalized vote count is equal to or greater than a predetermined threshold is document data having a high similarity to the input document data.

制御部１１は、次に、類似度判定処理部２４４での判定結果が、入力原稿データとの類似度が高い原稿データがあることを示しているか否かを判定する（Ｓ５０７）。判定結果が、類似度が高い原稿データがないことを示している場合は（Ｓ５０７：ＮＯ）、制御部１１は、使用者がカラー画像入力部１３に読み取らせた原稿と類似する原稿がないことを示す情報を出力する（Ｓ５０８）。ステップＳ５０８が終了した後は、原稿抽出装置１００は、原稿データを抽出する処理を終了する。 Next, the control unit 11 determines whether or not the determination result in the similarity determination processing unit 244 indicates that there is document data having a high similarity to the input document data (S507). If the determination result indicates that there is no document data having a high similarity (S507: NO), the control unit 11 does not have a document similar to the document read by the color image input unit 13 by the user. Is output (S508). After step S508 is completed, the document extraction apparatus 100 ends the process of extracting document data.

ステップＳ５０７で、判定結果が、入力原稿データとの類似度が高い原稿データがあることを示している場合は（Ｓ５０７：ＹＥＳ）、原稿抽出部２４５は、記憶部１２が記憶する原稿テーブルを検索し、類似度判定処理部２４４が入力原稿データとの類似度が高いと判定した原稿データのページインデックスに関連付けられた原稿インデックスを取得する（Ｓ５０９）。入力原稿データとの類似度が高い原稿データが複数ある場合は、ステップＳ５０９では複数の原稿インデックスが取得される。制御部１１は、次に、現在処理中の入力原稿データが、複数のページでなる原稿の内の２ページ目以降のページを読み取った入力原稿データであるか否かを判定する（Ｓ５１０）。現在処理中の入力原稿データが原稿の１ページ目を読み取った入力原稿データである場合は（Ｓ５１０：ＮＯ）、制御部１１は、ステップＳ５０９で取得した原稿インデックスが複数個あるか否かを判定する（Ｓ５１５）。ステップＳ５０９で取得した原稿インデックスが単数である場合は（Ｓ５１５：ＮＯ）、原稿抽出部２４５は、取得した原稿インデックスに原稿テーブルで関連付けられた複数のページインデックスが示す複数の原稿データを抽出する（Ｓ５１６）。 If it is determined in step S507 that the determination result includes document data having a high similarity to the input document data (S507: YES), the document extraction unit 245 searches the document table stored in the storage unit 12. Then, the document index associated with the page index of the document data determined by the similarity determination processing unit 244 to have a high similarity with the input document data is acquired (S509). If there are a plurality of document data having a high degree of similarity with the input document data, a plurality of document indexes are acquired in step S509. Next, the control unit 11 determines whether or not the input document data currently being processed is input document data obtained by reading the second and subsequent pages of a plurality of pages (S510). When the input document data currently being processed is input document data obtained by reading the first page of the document (S510: NO), the control unit 11 determines whether or not there are a plurality of document indexes acquired in step S509. (S515). If there is a single document index acquired in step S509 (S515: NO), the document extraction unit 245 extracts a plurality of document data indicated by a plurality of page indexes associated with the acquired document index in the document table ( S516).

原稿抽出部２４５は、抽出した原稿データを色補正部２５へ出力し、色補正部２５、黒生成下色除去部２６、空間フィルタ処理部２７、出力階調補正部２８、階調再現処理部２９の順に原稿データを処理し、カラー画像処理部２はカラー画像形成部１４へ原稿データを出力する。カラー画像形成部１４は、画像データである複数の原稿データに基づいた画像を形成することにより、複数の原稿データに対応する複数のページで構成される原稿を出力する原稿出力処理を行う（Ｓ５１７）。ステップＳ５１７が終了した後は、原稿抽出装置１００は、原稿データを抽出する処理を終了する。 The document extraction unit 245 outputs the extracted document data to the color correction unit 25, and the color correction unit 25, the black generation and under color removal unit 26, the spatial filter processing unit 27, the output gradation correction unit 28, and the gradation reproduction processing unit. The document data is processed in the order of 29, and the color image processing unit 2 outputs the document data to the color image forming unit. The color image forming unit 14 performs an original output process of outputting an original composed of a plurality of pages corresponding to the plurality of original data by forming an image based on the plurality of original data as image data (S517). ). After step S517 is completed, the document extraction apparatus 100 ends the process of extracting document data.

ステップＳ５１０で、現在処理中の入力原稿データが原稿の２ページ目以降のページを読み取った入力原稿データである場合は（Ｓ５１０：ＹＥＳ）、制御部１１は、原稿からこれまで読み取ったページに対応する入力原稿データに関して取得した原稿インデックスの内、これまで読み取った全ページに共通する原稿インデックスがあるか否かを判定する（Ｓ５１１）。全ページに共通する原稿インデックスがない場合は（Ｓ５１１：ＮＯ）、制御部１１は、処理をステップＳ５０８へ進め、類似する原稿がないことを出力する。 In step S510, when the input document data currently being processed is input document data obtained by reading the second and subsequent pages of the document (S510: YES), the control unit 11 corresponds to the pages read so far from the document. It is determined whether or not there is a document index common to all the pages read so far among the document indexes acquired for the input document data to be input (S511). If there is no document index common to all pages (S511: NO), the control unit 11 advances the process to step S508, and outputs that there is no similar document.

これまで読み取った全ページに共通する原稿インデックスがある場合は（Ｓ５１１：ＹＥＳ）、制御部１１は、全ページに共通する原稿インデックスが複数個あるか否かを判定する（Ｓ５１２）。全ページに共通する原稿インデックスが単数である場合は（Ｓ５１２：ＮＯ）、制御部１１は、処理部ステップＳ５１６へ進め、原稿抽出部２４５は、取得した原稿インデックスに関連付けられた複数のページインデックスが示す複数の原稿データを抽出し（Ｓ５１６）、カラー画像形成部１４は、複数の原稿データに対応する複数のページで構成される原稿を出力する原稿出力処理を行い（Ｓ５１７）、原稿抽出装置１００は処理を終了する。 If there is a document index common to all the pages read so far (S511: YES), the control unit 11 determines whether there are a plurality of document indexes common to all pages (S512). When the document index common to all pages is singular (S512: NO), the control unit 11 proceeds to the processing unit step S516, and the document extraction unit 245 has a plurality of page indexes associated with the acquired document index. The color image forming unit 14 performs document output processing for outputting a document composed of a plurality of pages corresponding to the plurality of document data (S517), and the document extraction device 100 extracts the document data shown in FIG. Ends the process.

ステップＳ５１５において、取得した原稿インデックスが複数個ある場合（Ｓ５１５：ＹＥＳ）、又はステップＳ５１２において、これまで読み取った全ページに共通する原稿インデックスが複数個ある場合は（Ｓ５１２：ＹＥＳ）、制御部１１は、原稿の他のページの要求を示す情報を出力する処理を行う（Ｓ５１３）。具体的には、制御部１１は、原稿に含まれる新たなページの読取を要求する文字情報を操作パネル１５の表示部に表示させる。 When there are a plurality of document indexes acquired in step S515 (S515: YES), or when there are a plurality of document indexes common to all the pages read so far (S512: YES), the control unit 11 Performs a process of outputting information indicating a request for another page of the document (S513). Specifically, the control unit 11 causes the display unit of the operation panel 15 to display character information that requests reading of a new page included in the document.

制御部１１は、次に、原稿に含まれる他のページが原稿抽出装置１００に使用者によりセットされているか否かを判定する（Ｓ５１４）。原稿に含まれる他のページが原稿抽出装置１００にセットされている場合は（Ｓ５１４：ＹＥＳ）、制御部１１は、処理をステップＳ５０２へ戻し、カラー画像入力部１３は、原稿に含まれる他のページに対応する入力原稿データを取得する。 Next, the control unit 11 determines whether or not another page included in the document is set in the document extraction apparatus 100 by the user (S514). When another page included in the document is set in the document extraction apparatus 100 (S514: YES), the control unit 11 returns the process to step S502, and the color image input unit 13 determines other pages included in the document. Input document data corresponding to the page is acquired.

原稿に含まれる他のページが原稿抽出装置１００にセットされていない場合は（Ｓ５１４：ＮＯ）、制御部１１は、処理をステップＳ５１６へ進める。なお、ステップＳ５１４では、制御部１１は、ステップＳ５１３の処理が終了してから所定時間が経過しても原稿の他のページがセットされない場合、又は使用者が操作パネル１５を操作することにより原稿読取の終了指示を受けつけた場合に、原稿の他のページがセットされていないと判定する処理を行ってもよい。制御部１１が処理をステップＳ５１６へ進めることにより、原稿抽出部２４５は、これまで読み取った全ページに共通する複数の原稿インデックスの夫々に関連付けられた各ページインデックスが示す原稿データを抽出し（Ｓ５１６）、カラー画像形成部１４は、抽出した原稿データに対応する原稿を出力する原稿出力処理を行う（Ｓ５１７）。これにより、原稿抽出装置１００は、複数の原稿インデックスに対応する複数の原稿を出力する。ステップＳ５１７が終了した後は、原稿抽出装置１００は処理を終了する。 If another page included in the document is not set in the document extraction apparatus 100 (S514: NO), the control unit 11 advances the process to step S516. Note that in step S514, the control unit 11 causes the document to be set when another page of the document is not set even after a predetermined time has elapsed after the process of step S513 is completed, or when the user operates the operation panel 15. When a reading end instruction is received, processing for determining that no other page of the document is set may be performed. When the control unit 11 advances the process to step S516, the document extraction unit 245 extracts document data indicated by each page index associated with each of a plurality of document indexes common to all the pages read so far (S516). The color image forming unit 14 performs a document output process for outputting a document corresponding to the extracted document data (S517). As a result, the document extraction device 100 outputs a plurality of documents corresponding to a plurality of document indexes. After step S517 ends, the document extraction apparatus 100 ends the process.

以上詳述した如く、本実施の形態に係る原稿抽出装置は、原稿の内で読みとったページに対応する入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスが複数ある場合に、原稿の他のページに対応する入力原稿データを要求し、原稿の他のページを読み取った入力画像データを取得する。更に本実施の形態に係る原稿抽出装置は、読み取った全ページに共通して入力原稿データとの類似度が高い原稿データに関連付けられた原稿インデックスを取得し、取得した原稿インデックスに関連付けられた複数の原稿データを抽出する。これにより、入力原稿データに類似すると判定された原稿データの原稿インデックスが複数ある場合に、原稿の他のページをも利用して原稿インデックスの絞込みが行われ、入力原稿データに類似する原稿データの原稿インデックスが確定するまで絞込みが繰り返される。従って、複数のページを利用することにより、より確からしい類似度判定を行うことが可能となり、所望の原稿データを高精度で抽出することが可能となる。 As described above in detail, the document extraction apparatus according to the present embodiment has a plurality of document indexes associated with document data having high similarity to input document data corresponding to pages read in the document. Input document data corresponding to another page of the document is requested, and input image data obtained by reading another page of the document is acquired. Furthermore, the document extraction apparatus according to the present embodiment acquires a document index associated with document data having a high similarity to the input document data in common for all read pages, and a plurality of document indexes associated with the acquired document index. Original data is extracted. As a result, when there are a plurality of document indexes of document data determined to be similar to the input document data, the document index is narrowed down using other pages of the document, and the document data similar to the input document data is retrieved. The narrowing is repeated until the document index is determined. Therefore, by using a plurality of pages, it is possible to perform a more reliable similarity determination, and it is possible to extract desired document data with high accuracy.

（実施の形態３）
実施の形態１及び２においては、一ページに対応する入力原稿データに基づいていずれの原稿をも出力できる形態を示したが、実施の形態３においては、特定の原稿について出力の条件をより厳しくした形態を示す。本実施の形態に係る原稿抽出装置の内部構成は、図１〜図３を用いて説明した実施の形態１の場合と同様である。 (Embodiment 3)
In the first and second embodiments, a form in which any original can be output based on the input original data corresponding to one page has been shown. However, in the third embodiment, the output conditions for a specific original are more stringent. Shows the form. The internal configuration of the document extracting apparatus according to the present embodiment is the same as that of the first embodiment described with reference to FIGS.

図１７は、実施の形態３に係る記憶部１２が記憶する原稿データと原稿とを対応付ける原稿テーブルの内容例を示す概念図である。原稿を個別に示すＤｏｃ１，Ｄｏｃ２，…の原稿インデックスに関連付けて、ページインデックス及びページ数が記録されており、更に、原稿を出力するために必要となる出力条件が原稿インデックスに関連付けて記録されている。図１７に示す例では、Ｄｏｃ１〜Ｄｏｃ４の原稿インデックスには出力条件が関連付けられておらず、Ｄｏｃ２１及びＤｏｃ５１の原稿インデックスに出力条件が関連付けられている。Ｄｏｃ２１の原稿インデックスには、原稿インデックスに関連付けられたＩＤ２１〜ＩＤ２８のページインデックスの内、ＩＤ２１及びＩＤ２５に対応する原稿データが共に入力原稿データと類似となることが出力条件として関連付けられている。またＤｏｃ５１の原稿インデックスには、原稿インデックスに関連付けられたＩＤ５１〜ＩＤ５５のページインデックスの内、三個以上のページインデックスに対応する原稿データが入力原稿データと類似となることが出力条件として関連付けられている。また、本実施の形態に係る記憶部１１が記憶する原稿データと特徴データとを対応付ける特徴テーブルの内容は、図１２を用いて説明した実施の形態１の場合と同様である。 FIG. 17 is a conceptual diagram illustrating a content example of a document table that associates document data and a document stored in the storage unit 12 according to the third embodiment. A page index and the number of pages are recorded in association with the document indexes of Doc1, Doc2,... That individually indicate the document, and further, output conditions necessary for outputting the document are recorded in association with the document index. Yes. In the example shown in FIG. 17, output conditions are not associated with the document indexes of Doc1 to Doc4, and output conditions are associated with the document indexes of Doc21 and Doc51. The document index of Doc 21 is associated with an output condition that both the document data corresponding to ID 21 and ID 25 among the page indexes ID 21 to ID 28 associated with the document index are similar to the input document data. The document index of Doc 51 is associated with an output condition that document data corresponding to three or more page indexes among ID 51 to ID 55 associated with the document index is similar to the input document data. Yes. The contents of the feature table that associates the document data and the feature data stored in the storage unit 11 according to the present embodiment are the same as those in the first embodiment described with reference to FIG.

また本実施の形態に係る原稿抽出装置が原稿データを登録する処理は、図１３のフローチャートを用いて説明した実施の形態１の場合と同様である。また本実施の形態に係る原稿抽出装置が行う原稿データを抽出する処理は、図１４のフローチャートを用いて説明した実施の形態１の場合、又は図１５及び図１６を用いて説明した実施の形態２の場合とほぼ同様であるが、ステップＳ４４又はステップＳ５１７の原稿出力処理の内容が実施の形態１又は２と異なる。 The process of registering document data by the document extraction apparatus according to the present embodiment is the same as that in the first embodiment described with reference to the flowchart of FIG. The document data extraction process performed by the document extraction apparatus according to the present embodiment is the same as that in the first embodiment described with reference to the flowchart of FIG. 14 or the embodiment described with reference to FIGS. Although it is almost the same as in the case of 2, the content of the document output process in step S44 or step S517 is different from that in the first or second embodiment.

図１８は、実施の形態３に係る原稿抽出装置が行う原稿出力処理の手順を示すフローチャートである。本実施の形態に係る原稿抽出装置１００は、原稿データを抽出する処理において、図１３に示したステップＳ３１〜Ｓ４３、又は図１４及び図１５に示したステップＳ５０１〜Ｓ５１６の処理を実行する。ステップＳ４４又はステップＳ５１７の原稿出力処理では、制御部１１は、まず、原稿抽出部２４５がステップＳ４３又はステップＳ５１６で抽出した原稿データの内、一の原稿データに関連付けられた原稿インデックスを選択する（Ｓ６１）。制御部１１は、次に、記憶部１２に記憶している原稿テーブルを検索し、選択した原稿インデックスに出力条件が関連付けられているか否かを判定する（Ｓ６２）。選択した原稿インデックスに出力条件が関連付けられている場合は（Ｓ６２：ＹＥＳ）、制御部１１は、原稿インデックスに関連付けられた出力条件が満たされているか否かを判定する（Ｓ６３）。 FIG. 18 is a flowchart illustrating a procedure of document output processing performed by the document extraction apparatus according to the third embodiment. The document extraction apparatus 100 according to the present embodiment executes steps S31 to S43 shown in FIG. 13 or steps S501 to S516 shown in FIGS. 14 and 15 in the process of extracting document data. In the document output process of step S44 or step S517, the control unit 11 first selects a document index associated with one document data among the document data extracted by the document extraction unit 245 in step S43 or step S516 ( S61). Next, the control unit 11 searches the document table stored in the storage unit 12 and determines whether an output condition is associated with the selected document index (S62). When the output condition is associated with the selected document index (S62: YES), the control unit 11 determines whether the output condition associated with the document index is satisfied (S63).

例えば、図１７に示すＤｏｃ２１の原稿インデックスが選択されている場合は、ステップＳ３７又はステップＳ５０７で、ＩＤ２１及びＩＤ２５に対応する原稿データが共に入力原稿データと類似する原稿データとして判定されているときに、出力条件が満たされていると判定される。ＩＤ２１及びＩＤ２５のいずれかに対応する原稿データが、入力原稿データと類似する原稿データとは判定されなかったときは、出力条件は満たされていないと判定される。またＤｏｃ２１の原稿インデックスが選択されている場合は、ステップＳ３７又はステップＳ５０７で、ＩＤ５１〜ＩＤ５５のページインデックスの内、三個以上のページインデックスに対応する原稿データが入力原稿データと類似する原稿データとして判定されているときに、出力条件が満たされていると判定される。三個未満のページインデックスに対応する原稿データしか、入力原稿データと類似する原稿データとして判定されていないときには、出力条件は満たされていないと判定される。 For example, when the document index of Doc 21 shown in FIG. 17 is selected, when document data corresponding to ID21 and ID25 are both determined as document data similar to the input document data in step S37 or step S507. It is determined that the output condition is satisfied. If the document data corresponding to either ID 21 or ID 25 is not determined to be document data similar to the input document data, it is determined that the output condition is not satisfied. If the document index of Doc21 is selected, document data corresponding to three or more page indexes among the page indexes of ID51 to ID55 are document data similar to the input document data in step S37 or step S507. When it is determined, it is determined that the output condition is satisfied. When only document data corresponding to less than three page indexes is determined as document data similar to input document data, it is determined that the output condition is not satisfied.

ステップＳ６２で原稿インデックスに出力条件が関連付けられていない場合（Ｓ６２：ＮＯ）、又はステップＳ６３で原稿インデックスに関連付けられた出力条件が満たされている場合は（Ｓ６３：ＹＥＳ）、カラー画像形成部１４は、選択した原稿インデックスに関連付けられた各ページインデックスが示す原稿データに基づいた画像を形成することにより、選択した原稿インデックスに対応する原稿を出力する（Ｓ６４）。例えば、図１７に示すＤｏｃ１〜Ｄｏｃ４の原稿インデックスに対応する原稿は、出力条件が定められていないので、無条件に出力される。またＤｏｃ２１及びＤｏｃ５１の原稿インデックスに対応する原稿は、出力条件が見たされている場合に出力される。ステップＳ６４が終了した後は、制御部１１は、処理を次のステップＳ６５へ進める。ステップＳ６３で原稿インデックスに関連付けられた出力条件が満たされていない場合は（Ｓ６３：ＮＯ）、選択した原稿インデックスに対応する原稿を出力することなく、制御部１１は、処理を次のステップＳ６５へ進める。このようにして、制御部１１は、出力条件が満たされていない原稿データの出力を禁止する。 When the output condition is not associated with the document index in step S62 (S62: NO), or when the output condition associated with the document index is satisfied in step S63 (S63: YES), the color image forming unit 14 Forms an image based on the document data indicated by each page index associated with the selected document index, and outputs a document corresponding to the selected document index (S64). For example, the documents corresponding to the document indexes Doc1 to Doc4 shown in FIG. 17 are output unconditionally because the output conditions are not defined. Documents corresponding to the document indexes of Doc21 and Doc51 are output when the output conditions are met. After step S64 is completed, the control unit 11 advances the process to the next step S65. If the output condition associated with the document index is not satisfied in step S63 (S63: NO), the control unit 11 proceeds to the next step S65 without outputting the document corresponding to the selected document index. Proceed. In this way, the control unit 11 prohibits output of document data that does not satisfy the output condition.

制御部１１は、次に、ステップＳ４３又はステップＳ５１６で抽出された全原稿データに対する処理が終了したか否かを判定する（Ｓ６５）。処理がまだ終了していない原稿データがまだある場合は（Ｓ６５：ＮＯ）、制御部１１は、処理をステップＳ６１へ戻し、ステップＳ４３又はステップＳ５１６で抽出された原稿データに関連付けられた原稿インデックスの内でまだ選択していない原稿インデックスを選択する。ステップＳ４３又はステップＳ５１６で抽出された全原稿データに対する処理が終了した場合は（Ｓ６５：ＹＥＳ）、制御部１１は、原稿出力処理を終了し、処理を原稿データを抽出する処理へ戻す。原稿出力処理が終了した後は、原稿抽出装置１００は、原稿データを抽出する処理を終了する。 Next, the control unit 11 determines whether or not the processing for all document data extracted in step S43 or step S516 has been completed (S65). If there is document data that has not been processed yet (S65: NO), the control unit 11 returns the process to step S61, and stores the document index associated with the document data extracted in step S43 or step S516. Select a document index that has not yet been selected. When the processing for all the document data extracted in step S43 or step S516 is completed (S65: YES), the control unit 11 ends the document output processing and returns the processing to the processing for extracting the document data. After the document output process is completed, the document extraction apparatus 100 ends the process of extracting document data.

以上詳述した如く、本実施の形態に係る原稿抽出装置は、各原稿インデックスについて予め出力条件を定めておき、原稿出力処理を行う際には、出力条件が満たされる原稿インデックスに対応する原稿のみを出力する。実施の形態１及び２では、一ページに対応する入力原稿データに基づいて原稿を出力することができるので、秘密情報を含むような重要度の高い原稿であっても、原稿の一ページに基づいて全原稿ページを容易に出力することが可能となっていた。本実施の形態においては、原稿抽出装置は、出力条件が定められている原稿については、出力条件が満たされた場合に出力するので、重要度の高い原稿に出力条件を定めておくことにより、重要度の高い原稿が容易に出力されることを防止することができる。 As described above in detail, the document extraction apparatus according to the present embodiment sets output conditions for each document index in advance, and when performing document output processing, only the document corresponding to the document index that satisfies the output conditions. Is output. In Embodiments 1 and 2, since a document can be output based on input document data corresponding to one page, even a highly important document including confidential information is based on one page of the document. Thus, all manuscript pages can be easily output. In the present embodiment, the document extraction device outputs a document for which an output condition is determined when the output condition is satisfied, so by setting the output condition for a highly important document, It is possible to prevent a highly important document from being easily output.

例えば、出力条件として、複数のページで入力原稿データと原稿データとが類似であると判定されることが必要であるとしておくことにより、原稿の一ページに基づいて重要度の高い原稿の全ページが出力されることを防止することができる。また出力条件として、入力原稿データと特定の原稿データとが類似であると判定されることが必要であるとしておくことにより、原稿の特定のページを所有していない使用者は原稿を原稿抽出装置から抽出することができなくなる。特定の原稿データとしては、複数ページからなる原稿の主な内容とは関連性の無い照合用の内容を表す原稿データを登録しておけばよい。照合用の内容としては、原稿の主な内容が日本文である場合に照合用の内容を英文とする等、原稿の主な内容とは全く異なるフォーマットとしておくことがより望ましい。 For example, as an output condition, it is necessary to determine that input document data and document data are similar in a plurality of pages, so that all pages of a highly important document based on one page of the document Can be prevented from being output. Further, since it is necessary to determine that the input document data and the specific document data are similar as output conditions, a user who does not own a specific page of the document can extract the document from the document extraction device. Can no longer be extracted from. As specific manuscript data, manuscript data representing contents for collation not related to the main contents of a manuscript composed of a plurality of pages may be registered. It is more desirable that the collation contents have a completely different format from the main contents of the manuscript, such as when the main content of the manuscript is Japanese, the collation content is English.

以上のようにして、本実施の形態に係る原稿抽出装置は、照合用の特定の原稿データを所有している特定の使用者に対して、出力条件が定められている原稿を抽出することを可能とし、照合用の特定の原稿データを所有していないその他の使用者では重要度の高い原稿を出力できないようにする。従って、本実施の形態においては、秘密情報が含まれる重要度の高い原稿に対して出力条件を定めておくことにより、原稿に含まれる秘密情報を保護することが可能となる。 As described above, the document extraction apparatus according to the present embodiment extracts a document having an output condition for a specific user who owns specific document data for collation. It is possible to prevent other users who do not have specific document data for collation from outputting highly important documents. Therefore, in the present embodiment, it is possible to protect confidential information included in a document by setting output conditions for a highly important document including confidential information.

（実施の形態４）
実施の形態１〜３では、本発明の原稿抽出装置が画像形成装置である形態を示したが、実施の形態４においては、本発明の原稿抽出装置がスキャナ装置である形態を示す。図１９は、実施の形態４に係る本発明の原稿抽出装置３００の内部の機能構成を示すブロック図である。本発明の原稿抽出装置３００は、原稿抽出装置３００を構成する各部の動作を制御する制御部３１、半導体メモリ又はハードディスク等で構成される記憶部３２、及びカラー画像を光学的に読み取るカラー画像入力部３３を備えている。カラー画像入力部３３にはＡ／Ｄ変換部３４が接続されており、Ａ／Ｄ変換部３４にはシェーディング補正部３５が接続され、シェーディング補正部３５には原稿抽出処理部３６が接続されている。原稿抽出処理部３６には、原稿データを外部へ送信する送信部３７が接続されている。記憶部３２、カラー画像入力部３３、Ａ／Ｄ変換部３４、シェーディング補正部３５、原稿抽出処理部３６、及び送信部３７は、制御部３１に接続されており、更に制御部３１には、使用者からの操作を受け付ける操作部３８が接続されている。 (Embodiment 4)
In the first to third embodiments, the document extracting apparatus of the present invention is an image forming apparatus. In the fourth embodiment, the document extracting apparatus of the present invention is a scanner apparatus. FIG. 19 is a block diagram showing an internal functional configuration of the document extracting apparatus 300 according to the fourth embodiment of the present invention. The document extraction apparatus 300 of the present invention includes a control unit 31 that controls the operation of each unit constituting the document extraction apparatus 300, a storage unit 32 that includes a semiconductor memory or a hard disk, and a color image input that optically reads a color image. A portion 33 is provided. An A / D conversion unit 34 is connected to the color image input unit 33, a shading correction unit 35 is connected to the A / D conversion unit 34, and a document extraction processing unit 36 is connected to the shading correction unit 35. Yes. The document extraction processing unit 36 is connected to a transmission unit 37 that transmits document data to the outside. The storage unit 32, the color image input unit 33, the A / D conversion unit 34, the shading correction unit 35, the document extraction processing unit 36, and the transmission unit 37 are connected to the control unit 31. An operation unit 38 that receives an operation from the user is connected.

記憶部３２は、実施の形態１〜３で説明した原稿抽出装置１００が備える記憶部１２と同様に、夫々に複数のページで構成される原稿毎に、各ページに対応する原稿データを記憶し、更に、原稿データと原稿とを対応付ける原稿テーブル、及び原稿データと特徴データとを対応付ける特徴テーブルを記憶している。また送信部３７には、外部のＰＣ又は画像形成装置等が接続されている。 The storage unit 32 stores document data corresponding to each page for each document composed of a plurality of pages, similarly to the storage unit 12 included in the document extraction apparatus 100 described in the first to third embodiments. Further, a document table that associates document data and a document, and a feature table that associates document data and feature data are stored. The transmission unit 37 is connected to an external PC or an image forming apparatus.

カラー画像入力部３３は、ＣＣＤを備えたスキャナにて構成されており、原稿からの反射光像をＲＧＢに分解してＣＣＤで読み取り、ＲＧＢのアナログ信号に変換してＡ／Ｄ変換部３４へ出力する。Ａ／Ｄ変換部３４は、ＲＧＢのアナログ信号をデジタルのＲＧＢ信号へ変換し、ＲＧＢ信号をシェーディング補正部３５へ出力する。 The color image input unit 33 is configured by a scanner equipped with a CCD. The reflected light image from the original is decomposed into RGB, read by the CCD, converted into RGB analog signals, and sent to the A / D conversion unit 34. Output. The A / D conversion unit 34 converts RGB analog signals into digital RGB signals, and outputs the RGB signals to the shading correction unit 35.

シェーディング補正部３５は、Ａ／Ｄ変換部３４から入力されたＲＧＢ信号に対して、カラー画像入力部３３の照明系、結像系及び撮像系で生じる各種の歪みを取り除く処理を行う。更にシェーディング補正部３５は、ＲＧＢ信号のカラーバランスを調整する処理を行い、ＲＧＢの反射率信号を濃度信号へ変換する処理を行う。シェーディング補正部３５は、次に、処理後のＲＧＢ信号でなる画像データである原稿データを原稿抽出処理部３６へ出力する。 The shading correction unit 35 performs processing for removing various distortions generated in the illumination system, the imaging system, and the imaging system of the color image input unit 33 on the RGB signal input from the A / D conversion unit 34. Further, the shading correction unit 35 performs processing for adjusting the color balance of the RGB signals, and performs processing for converting the RGB reflectance signals into density signals. Next, the shading correction unit 35 outputs document data, which is image data composed of processed RGB signals, to the document extraction processing unit 36.

原稿抽出処理部３６は、実施の形態１〜３で説明した原稿抽出装置１００が備える原稿抽出処理部２４と同様に構成されており、原稿抽出処理部２４と同様の処理を実行する。即ち、原稿抽出処理部３６は、シェーディング補正部３５から入力された原稿データを入力原稿データとして、図１４、又は図１５及び図１６のフローチャートで示した処理と同様の処理を行って、記憶部３２から、入力原稿データと類似度が高い原稿データに対応するページが含まれる原稿に係る複数の原稿データを抽出する。 The document extraction processing unit 36 is configured in the same manner as the document extraction processing unit 24 provided in the document extraction apparatus 100 described in the first to third embodiments, and executes the same processing as the document extraction processing unit 24. That is, the document extraction processing unit 36 uses the document data input from the shading correction unit 35 as input document data and performs the same processing as the processing shown in the flowcharts of FIGS. A plurality of document data relating to a document including pages corresponding to document data having a high similarity to the input document data is extracted from 32.

制御部３１は、原稿抽出処理部３６が抽出した複数の原稿データを送信部３７に外部へ送信させることにより、抽出した原稿データを出力する。送信部３７は、外部のＰＣ又は画像形成装置等の装置へ複数の原稿データを送信し、外部の装置は複数の原稿データに基づいて画像を形成する等の処理を実行する。 The control unit 31 outputs the extracted document data by causing the transmission unit 37 to transmit a plurality of document data extracted by the document extraction processing unit 36 to the outside. The transmission unit 37 transmits a plurality of document data to an apparatus such as an external PC or an image forming apparatus, and the external apparatus executes processing such as forming an image based on the plurality of document data.

以上詳述した如く、本実施の形態においても、実施の形態１〜３と同様に、複数ページで構成される原稿の一部に対応する入力原稿データに基づき、原稿の全てのページに対応する原稿データを抽出することが可能となる。従って、本実施の形態においても、複数ページで構成されている原稿に紛失又は汚れ等によって欠落が生じた場合であっても、原稿データを予め記憶してあるデータベースの中から、原稿の全てのページに亘った原稿データを容易に抽出することが可能となる。 As described above in detail, in the present embodiment as well, in the same manner as in the first to third embodiments, all pages of a document are supported based on input document data corresponding to a part of a document composed of a plurality of pages. Document data can be extracted. Therefore, also in the present embodiment, even when a document composed of a plurality of pages is lost due to loss or dirt, all the documents of the document are stored in the database in which the document data is stored in advance. Document data over a page can be easily extracted.

（実施の形態５）
実施の形態５では、汎用のコンピュータを用いて本発明の原稿抽出装置を実現した形態を示す。図２０は、実施の形態５に係る本発明の原稿抽出装置４００の内部構成を示すブロック図である。本実施の形態に係る本発明の原稿抽出装置４００は、ＰＣ等の汎用コンピュータを用いて構成されており、演算を行うＣＰＵ４１と、演算に伴って発生する一時的な情報を記憶するＲＡＭ４２と、光ディスク等の本発明の記録媒体５から情報を読み取るＣＤ−ＲＯＭドライブ等のドライブ部４３と、ハードディスク等の記憶部４４とを備えている。ＣＰＵ４１は、本発明の記録媒体５から本発明のコンピュータプログラム５１をドライブ部４３に読み取らせ、読み取ったコンピュータプログラム５１を記憶部４４に記憶させる。コンピュータプログラム５１は必要に応じて記憶部４４からＲＡＭ４２へロードされ、ロードされたコンピュータプログラム５１に基づいてＣＰＵ４１は原稿抽出装置４００に必要な処理を実行する。 (Embodiment 5)
In the fifth embodiment, a form in which the document extraction apparatus of the present invention is realized using a general-purpose computer will be described. FIG. 20 is a block diagram showing the internal configuration of the document extraction device 400 according to the fifth embodiment of the present invention. A document extraction apparatus 400 according to the present embodiment is configured using a general-purpose computer such as a PC, and includes a CPU 41 that performs a calculation, a RAM 42 that stores temporary information generated along with the calculation, A drive unit 43 such as a CD-ROM drive for reading information from the recording medium 5 of the present invention such as an optical disk and a storage unit 44 such as a hard disk are provided. The CPU 41 causes the drive unit 43 to read the computer program 51 of the present invention from the recording medium 5 of the present invention, and stores the read computer program 51 in the storage unit 44. The computer program 51 is loaded from the storage unit 44 to the RAM 42 as necessary. Based on the loaded computer program 51, the CPU 41 executes processing necessary for the document extraction device 400.

また原稿抽出装置４００は、使用者が操作することによる各種の処理指示等の情報が入力されるキーボード又はポインティングデバイス等の入力部４５と、各種の情報を表示する液晶ディスプレイ等の表示部４６とを備えている。更に原稿抽出装置４００は、画像形成装置等の原稿を出力する外部の出力装置６１に接続された送信部４７と、スキャナ装置等の原稿データを入力する外部の入力装置６２に接続された受信部４８とを備えている。送信部４７は、原稿データを出力装置６１へ送信し、出力装置６１は原稿データに基づいて原稿を出力する。入力装置６２は、原稿を光学的に読み取って原稿データを生成し、生成した原稿データを原稿抽出装置４００へ送信し、受信部４８は、入力装置６２から送信された原稿データを受信する。受信部４８は、本発明における取得手段として機能する。 The document extracting apparatus 400 includes an input unit 45 such as a keyboard or a pointing device for inputting information such as various processing instructions operated by the user, and a display unit 46 such as a liquid crystal display for displaying various information. It has. Further, the document extraction device 400 includes a transmission unit 47 connected to an external output device 61 that outputs a document such as an image forming device, and a reception unit connected to an external input device 62 that inputs document data such as a scanner device. 48. The transmission unit 47 transmits the document data to the output device 61, and the output device 61 outputs the document based on the document data. The input device 62 optically reads a document to generate document data, transmits the generated document data to the document extraction device 400, and the receiving unit 48 receives the document data transmitted from the input device 62. The receiving unit 48 functions as an acquisition unit in the present invention.

記憶部４４は、実施の形態１〜３で説明した原稿抽出装置１００が備える記憶部１２と同様に、夫々に複数のページで構成される原稿毎に、各ページに対応する原稿データを記憶し、更に、原稿データと原稿とを対応付ける原稿テーブル、及び原稿データと特徴データとを対応付ける特徴テーブルを記憶している。 The storage unit 44 stores document data corresponding to each page for each document composed of a plurality of pages, similarly to the storage unit 12 included in the document extraction apparatus 100 described in the first to third embodiments. Further, a document table that associates document data and a document, and a feature table that associates document data and feature data are stored.

ＣＰＵ４１は、本発明のコンピュータプログラム５１をＲＡＭ４２にロードし、ロードしたコンピュータプログラム５１に従って、本発明の原稿抽出方法に係る処理を実行する。即ち、受信部４８で入力装置６２から原稿データが入力された場合に、入力された原稿データを入力原稿データとして、ＣＰＵ４１は、図１４、又は図１５及び図１６のフローチャートで示した処理と同様の処理を行って、記憶部４４から、入力原稿データと類似度が高い原稿データに対応するページが含まれる原稿に係る複数の原稿データを抽出する。ＣＰＵ４１は、抽出した複数の原稿データを送信部４７から出力装置６１へ送信し、出力装置６１は、原稿データに基づいて複数ページからなる原稿を出力する。なお、ＣＰＵ４１は、原稿データとして、アプリケーションプログラムを利用して作成したテキストデータ等のアプリケーションデータを扱う処理を行ってもよい。 The CPU 41 loads the computer program 51 of the present invention into the RAM 42, and executes processing according to the document extraction method of the present invention in accordance with the loaded computer program 51. That is, when document data is input from the input device 62 by the receiving unit 48, the CPU 41 uses the input document data as input document data, and performs the same processing as that shown in the flowchart of FIG. 14, FIG. 15, or FIG. Thus, a plurality of document data relating to a document including pages corresponding to document data having a high similarity to the input document data is extracted from the storage unit 44. The CPU 41 transmits a plurality of extracted document data from the transmission unit 47 to the output device 61, and the output device 61 outputs a document composed of a plurality of pages based on the document data. Note that the CPU 41 may perform processing for handling application data such as text data created using an application program as document data.

以上詳述した如く、本実施の形態においても、実施の形態１〜４と同様に、複数ページで構成される原稿の一部に対応する入力原稿データに基づき、原稿の全てのページに対応する原稿データを抽出することが可能となる。従って、本実施の形態においても、複数ページで構成されている原稿に紛失又は汚れ等によって欠落が生じた場合であっても、原稿データを予め記憶してあるデータベースの中から、原稿の全てのページに亘った原稿データを容易に抽出することが可能となる。 As described above in detail, in the present embodiment as well, in the same manner as in the first to fourth embodiments, all pages of a document are supported based on input document data corresponding to a part of a document composed of a plurality of pages. Document data can be extracted. Therefore, also in the present embodiment, even when a document composed of a plurality of pages is lost due to loss or dirt, all the documents of the document are stored in the database in which the document data is stored in advance. Document data over a page can be easily extracted.

なお、本実施の形態においては、原稿抽出装置４００で内蔵する記憶部４４に記憶する原稿データから必要な原稿データを抽出する処理を行う形態を示したが、これに限るものではなく、本発明の原稿抽出装置４００は、通信ネットワークで接続されたストレージ装置又はサーバ装置等の図示しない外部の記憶手段に記憶された原稿データから必要な原稿データを抽出する処理を行う形態であってもよい。 In the present embodiment, an embodiment has been described in which processing for extracting necessary document data from document data stored in the storage unit 44 built in the document extraction apparatus 400 is performed, but the present invention is not limited to this. The document extraction device 400 may be configured to perform processing for extracting necessary document data from document data stored in an external storage unit (not shown) such as a storage device or a server device connected via a communication network.

なお、本発明のコンピュータプログラム５１を記録してある本発明の記録媒体５は、磁気テープ、磁気ディスク、可搬型のハードディスク、ＣＤ−ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ等の光ディスク、又はＩＣカード（メモリカードを含む）／光カード等のカード型記録媒体のいずれの形態であってもよい。また本発明の記録媒体５は、原稿抽出装置４００に装着され、記録媒体５の記録内容をＣＰＵ４１が読み出すことが可能な半導体メモリ、即ちマスクＲＯＭ、ＥＰＲＯＭ（Erasable Programmable Read Only Memory）、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、フラッシュＲＯＭ等であってもよい。 The recording medium 5 of the present invention in which the computer program 51 of the present invention is recorded is a magnetic tape, a magnetic disk, a portable hard disk, an optical disk such as a CD-ROM / MO / MD / DVD, or an IC card (memory). Any type of card-type recording medium such as an optical card) may be used. Further, the recording medium 5 of the present invention is mounted on the document extraction device 400, and a semiconductor memory from which the CPU 41 can read the recorded contents of the recording medium 5, that is, mask ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically). An Erasable Programmable Read Only Memory), a flash ROM, or the like may be used.

また、本発明のコンピュータプログラム５１は、インターネット又はＬＡＮ等の通信ネットワークを介して原稿抽出装置４００に接続された図示しない外部のサーバ装置から原稿抽出装置４００へダウンロードされて記憶部４４に記憶される形態であってもよい。この形態の場合は、コンピュータプログラム５１をダウンロードするために必要なプログラムは、予め記憶部４４に記憶されてあるか、又は所定の記録媒体からドライブ部４３を用いて読み出されて記憶部４４に記憶され、必要に応じてＲＡＭ４２にロードされるものであればよい。 The computer program 51 of the present invention is downloaded to the document extraction device 400 from an external server device (not shown) connected to the document extraction device 400 via a communication network such as the Internet or a LAN and stored in the storage unit 44. Form may be sufficient. In the case of this form, a program necessary for downloading the computer program 51 is stored in the storage unit 44 in advance, or is read from a predetermined recording medium using the drive unit 43 and stored in the storage unit 44. It is sufficient if it is stored and loaded into the RAM 42 as necessary.

実施の形態１に係る本発明の原稿抽出装置の内部の機能構成を示すブロック図である。FIG. 3 is a block diagram showing an internal functional configuration of the document extraction device according to the first embodiment of the present invention. 原稿抽出処理部の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a document extraction processing unit. FIG. 特徴点抽出部の構成を示すブロック図である。It is a block diagram which shows the structure of a feature point extraction part. フィルタ処理部が利用する空間フィルタの例を示す説明図である。It is explanatory drawing which shows the example of the spatial filter which a filter process part utilizes. 連結領域の特徴点の例を示す説明図である。It is explanatory drawing which shows the example of the feature point of a connection area | region. 文字列に対する特徴点の抽出結果の例を示す説明図である。It is explanatory drawing which shows the example of the extraction result of the feature point with respect to a character string. 注目特徴点と抽出した特徴点を示す説明図である。It is explanatory drawing which shows an attention feature point and the extracted feature point. 注目特徴点Ｐ１に対して３点の周辺特徴点を抽出し、特徴データを算出する例を示す説明図である。It is explanatory drawing which shows the example which extracts the 3 surrounding feature points with respect to the attention feature point P1, and calculates feature data. 注目特徴点Ｐ２に対して３点の周辺特徴点を抽出し、特徴データを算出する例を示す説明図である。It is explanatory drawing which shows the example which extracts 3 surrounding feature points with respect to the feature point of interest P2, and calculates feature data. 記憶部が記憶する原稿データを示す概念図である。It is a conceptual diagram which shows the document data which a memory | storage part memorize | stores. 記憶部が記憶する原稿データと原稿とを対応付ける原稿テーブルの内容例を示す概念図である。It is a conceptual diagram which shows the example of the content of the manuscript table which matches the manuscript data and manuscript which a memory | storage part memorize | stores. 記憶部が記憶する原稿データと特徴データとを対応付ける特徴テーブルの内容例を示す概念図である。It is a conceptual diagram which shows the example of the content of the feature table which matches the original data and feature data which a memory | storage part memorize | stores. 原稿データを登録する処理の手順を示すフローチャートである。6 is a flowchart illustrating a procedure of processing for registering document data. 原稿データを抽出する処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process which extracts original data. 実施の形態２に係る原稿抽出装置が行う原稿データを抽出する処理の手順を示すフローチャートである。10 is a flowchart illustrating a procedure of processing for extracting document data performed by the document extraction device according to the second embodiment. 実施の形態２に係る原稿抽出装置が行う原稿データを抽出する処理の手順を示すフローチャートである。10 is a flowchart illustrating a procedure of processing for extracting document data performed by the document extraction device according to the second embodiment. 実施の形態３に係る記憶部が記憶する原稿データと原稿とを対応付ける原稿テーブルの内容例を示す概念図である。10 is a conceptual diagram illustrating an example of the contents of a document table that associates document data and a document stored in a storage unit according to Embodiment 3. FIG. 実施の形態３に係る原稿抽出装置が行う原稿出力処理の手順を示すフローチャートである。10 is a flowchart illustrating a procedure of document output processing performed by a document extraction device according to Embodiment 3. 実施の形態４に係る本発明の原稿抽出装置の内部の機能構成を示すブロック図である。FIG. 10 is a block diagram showing an internal functional configuration of an original extracting apparatus according to a fourth embodiment of the present invention. 実施の形態５に係る本発明の原稿抽出装置の内部構成を示すブロック図である。FIG. 10 is a block diagram illustrating an internal configuration of a document extraction device according to a fifth embodiment of the present invention.

Explanation of symbols

１００、３００、４００原稿抽出装置
１１、３１制御部
１２、３２、４４記憶部（記憶手段）
１３、３３カラー画像入力部
２４、３６原稿抽出処理部
２４２特徴データ算出部
２４３投票処理部
２４４類似度判定処理部
２４５原稿抽出部
４１ＣＰＵ
５記録媒体
５１コンピュータプログラム 100, 300, 400 Document extraction device 11, 31 Control unit 12, 32, 44 Storage unit (storage unit)
13, 33 Color image input unit 24, 36 Document extraction processing unit 242 Feature data calculation unit 243 Voting processing unit 244 Similarity determination processing unit 245 Document extraction unit 41 CPU
5 Recording medium 51 Computer program

Claims

In a method for extracting specific document data from document data stored in a storage means,
A document index indicating a document composed of a plurality of pages is stored in a storage unit in association with document data corresponding to each page included in the document,
Feature data that is calculated based on the feature points extracted from the document data and indicates the features of the document data is stored in association with the document data in a storage unit,
Input document data that is new document data is acquired,
Extract feature points from the acquired input document data,
Based on the extracted feature points, generate feature data indicating the features of the input document data,
By comparing the generated feature data with the feature data stored in the storage means, the similarity between the document data associated with the feature data stored in the storage means and the input document data is determined,
Obtain a document index associated with document data determined to be document data having a high degree of similarity to the input document data,
A document extraction method comprising: extracting a plurality of document data corresponding to a plurality of pages included in a document indicated by an acquired document index.

In a document extraction device comprising document storage means for storing document data, and extracting specific document data from document data stored in the document storage means,
Means for storing a document index indicating a document composed of a plurality of pages in association with document data corresponding to each page included in the document;
Feature data storage means for storing the feature data calculated based on the feature points extracted from the document data and indicating the features of the document data in association with the document data;
Acquisition means for acquiring input original data which is new original data;
Means for extracting feature points from the input document data acquired by the acquisition means;
Generating means for generating feature data indicating the characteristics of the input document data based on the feature points extracted by the means;
Document data associated with the feature data stored in the feature data storage means and input document data by comparing the feature data generated by the generation means with the feature data stored in the feature data storage means Determining means for determining the similarity to
Means for obtaining a document index associated with the document data determined by the determination means as document data having a high similarity to the input document data;
An original extracting apparatus comprising: extraction means for extracting a plurality of original data corresponding to a plurality of pages included in the original indicated by the original index acquired by the means.

The feature data storage means includes
A plurality of feature data indicating characteristics of the document data are stored in association with one document data;
The generating means includes
It is configured to generate a plurality of feature data indicating the features of the input document data,
The determination means includes
A means for voting on the document data associated with the feature data matching the feature data for each of the plurality of feature data generated by the generation means;
Among the document data stored in the document storage unit, the document data having the maximum number of votes or the document data having a number of votes equal to or more than a predetermined amount is determined to be document data having high similarity to the input document data. The document extracting device according to claim 2, further comprising:

The acquisition means includes
Means for acquiring a plurality of input document data;
The determination means includes
Means for determining the similarity between the original data stored in the original storage means and the input original data for each of a plurality of input original data;
The extraction means includes
When document indexes associated with document data having high similarity with each of the plurality of input document data match each other, a plurality of document data corresponding to a plurality of pages included in the document indicated by the document index is extracted. The document extracting device according to claim 2 or 3, further comprising: means.

When a plurality of document indexes associated with document data having a high degree of similarity with input document data is acquired, or among document indexes associated with document data having a high similarity with each of the plurality of input document data 5. The document extracting apparatus according to claim 4, further comprising means for requesting further input document data when a plurality of document indexes common to the plurality of input document data are acquired.

The acquisition means includes
6. The document extraction device according to claim 2, wherein input document data is acquired by optically reading a document.

Means for storing predetermined output conditions necessary for outputting document data corresponding to each page included in the document indicated by the document index in association with the document index;
Means for determining whether an output condition associated with a document index associated with the document data extracted by the extraction unit is satisfied;
Means for outputting a plurality of document data corresponding to a plurality of pages included in a document indicated by a document index when it is determined that the output condition is satisfied;
The apparatus further comprises means for prohibiting output of a plurality of document data corresponding to a plurality of pages included in a document indicated by a document index when it is determined that the output condition is not satisfied. The document extraction device according to any one of 2 to 6.

The document extracting apparatus according to claim 2, further comprising a unit that forms a plurality of images based on the plurality of document data extracted by the extracting unit.

In a computer program for causing a computer to extract specific document data from document data stored inside or outside the computer,
A procedure for causing a computer to extract feature points from input document data;
A procedure for causing the computer to generate feature data indicating the features of the input document data based on the extracted feature points;
A procedure for causing the computer to determine the degree of similarity between the stored document data and the input document data by comparing the generated feature data with the feature data indicating the characteristics of the stored document data;
A procedure for causing a computer to acquire a document index associated with document data determined to be document data having high similarity to input document data;
A computer program comprising: causing a computer to extract a plurality of document data corresponding to a plurality of pages included in a document indicated by an acquired document index.

In a computer-readable recording medium in which a computer program for extracting specific document data from document data stored inside or outside the computer is recorded in a computer,
A procedure for causing a computer to extract feature points from input document data;
A procedure for causing the computer to generate feature data indicating the features of the input document data based on the extracted feature points;
A procedure for causing the computer to determine the degree of similarity between the stored document data and the input document data by comparing the generated feature data with the feature data indicating the characteristics of the stored document data;
A procedure for causing a computer to acquire a document index associated with document data determined to be document data having high similarity to input document data;
A computer-readable recording comprising a computer recorded with a computer program including a procedure for extracting a plurality of document data corresponding to a plurality of pages included in a document indicated by an acquired document index Medium.