JP2007004393A

JP2007004393A - Document search device and document search method

Info

Publication number: JP2007004393A
Application number: JP2005182495A
Authority: JP
Inventors: Ichiro Yamashita; 一郎山下; Akinori Murakami; 哲範村上; Yoshihide Kadani; 嘉英甲谷
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-06-22
Filing date: 2005-06-22
Publication date: 2007-01-11
Anticipated expiration: 2025-06-22
Also published as: JP4788205B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document search device and its method for more surely searching out a read document. <P>SOLUTION: This document search device is provided with extraction parts 4, 6, and 8, which individually extract a UUID, a characteristic amount and a search word from read image, and search parts 10, 12, and 14 searching and extracting respective corresponding databases 20, 22, and 24 by using the extracted UUID, characteristic amount, and the search word as keys. When the read image is formed by scanning a document, an extraction processing control part 2 carries out combined search processing based on the extracted UUID, characteristic amount, and the search word. A search result evaluation part 16 displays document information specified by the search result on a screen of a display part 18. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書に関する情報が蓄積されたデータベースの中から、入力された画像データに合致する文書を検索する装置及び方法に関する。 The present invention relates to an apparatus and method for searching for a document that matches input image data from a database in which information about the document is stored.

近年、企業における情報のセキュリティ強化、コンプライアンスあるいは個人情報の保護など情報の取扱いが重要視されてきている。例えば、企業側は、監査などに応じて業務がどのような情報に基づいて遂行されたかを公開する必要がある。そのためには、情報に対するアクセス状況をログ管理して、いつ誰がどのような情報をどこからどう処理したのかなどを特定できるようにしておく必要がある。 In recent years, the handling of information such as strengthening information security, compliance or protection of personal information in companies has been regarded as important. For example, the company needs to disclose what information the business has been performed based on, for example, an audit. For this purpose, it is necessary to log the access status of information so that it is possible to identify when and who processed what information from where.

ところで、従来から紙媒体で取り扱われる機密情報のセキュリティ管理強化等のために、印刷用紙等の媒体にＵＵＩＤ（ＵｎｉｖｅｒｓａｒｙＵｎｉｑｕｅＩｄｅｎｔｉｆｉｅｒ）に割り振って、そのＵＵＩＤをバーコードやＩＣタグなどで媒体に印刷したり、埋め込んだりする技術がある。このＵＵＩＤを利用すれば、媒体に関する管理情報を検索することによって、手元にある機密文書がいつ誰がその機密文書を作成したかなどを特定することができる。従って、手元にある機密文書が不正若しくは無断で複製された文書であっても、その機密文書の作成元を探し出すことが可能になる。 By the way, in order to strengthen security management of confidential information that has been conventionally handled by paper media, a UUID (Universal Unique Identifier) is assigned to a medium such as printing paper, and the UUID is printed on the medium with a barcode or an IC tag. There is technology to embed or embed. If this UUID is used, it is possible to identify when the confidential document at hand has created the confidential document by searching the management information regarding the medium. Therefore, even if the confidential document at hand is a document that has been copied illegally or without permission, it is possible to find the creation source of the confidential document.

このように、ＵＵＩＤを利用することで文書の作成元を容易に特定することができるが、ＵＵＩＤを利用する技術の他に文書を特定する際に用いるデータとして、文書の画像データから特徴量を算出し、その特徴量に基づいて原画像データを特定する技術がある（例えば特許文献１〜４）。 As described above, the document creation source can be easily specified by using the UUID. However, in addition to the technique using the UUID, the feature amount is obtained from the image data of the document as data used when specifying the document. There is a technique for calculating and specifying original image data based on the feature amount (for example, Patent Documents 1 to 4).

特開２００４−１３９２１０号公報JP 2004-139210 A 特開平９−２７０９０２号公報Japanese Patent Laid-Open No. 9-270902 特開２００３−２８１１７６号公報JP 2003-281176 A 特開平１０−４９６５９号公報Japanese Patent Laid-Open No. 10-49659

しかしながら、従来においては、消去されたり、紙が汚れたりしてＵＵＩＤが特定することができない場合があり、このような場合には、文書作成元を特定することができない。また、ＵＵＩＤの入力も面倒である。また、特徴量を利用する技術だけでは、文書作成元を特定する精度に難点がある。 However, conventionally, there are cases where the UUID cannot be specified due to being erased or the paper becoming dirty. In such a case, the document creation source cannot be specified. Also, the input of UUID is troublesome. In addition, only the technology that uses the feature amount has a difficulty in specifying the document creation source.

本発明は、以上のような課題を解決するためになされたものであり、その目的は、読み取った文書をより確実に探し出すことのできる文書検索装置及び方法を提供することにある。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a document search apparatus and method that can more reliably find a read document.

以上のような目的を達成するために、本発明に係る文書検索装置は、文書に関する情報が蓄積されたデータベースの中から、入力された画像データに合致する文書を検索する文書検索装置において、文書の画像データを入力する画像データ入力手段と、入力された文書の画像データから当該文書を特定しうる複数種類の文書特徴情報を抽出する文書特徴抽出処理手段と、前記文書特徴抽出処理手段から抽出された各文書特徴情報に基づき前記データベースを検索することによって当該文書の選択候補を取得する取得手段と、前記取得手段により取得された選択候補を評価することによって当該文書を特定する評価手段と、前記評価手段による評価結果を出力する出力手段とを有することを特徴とする。 In order to achieve the above object, a document search apparatus according to the present invention is a document search apparatus for searching for a document that matches input image data from a database in which information on documents is stored. Image data input means for inputting the image data, document feature extraction processing means for extracting a plurality of types of document feature information capable of specifying the document from the image data of the input document, and extraction from the document feature extraction processing means Obtaining means for acquiring selection candidates for the document by searching the database based on each document feature information, and evaluation means for identifying the document by evaluating the selection candidates obtained by the obtaining means; Output means for outputting an evaluation result by the evaluation means.

また、前記文書特徴抽出処理手段は、文書特徴情報として当該文書に固有に割り付けられた識別情報を、画像データから抽出する識別情報抽出部を有し、前記取得手段は、各文書に固有に割り付けられた識別情報が登録された前記データベースを検索する識別情報検索部を有することを特徴とする。 The document feature extraction processing unit includes an identification information extraction unit that extracts identification information uniquely assigned to the document as document feature information from the image data, and the acquisition unit assigns the document information uniquely to each document. And an identification information search unit for searching the database in which the registered identification information is registered.

また、前記文書特徴抽出処理手段は、文書識別情報として当該文書の特徴量を画像データから算出する特徴量抽出部を有し、前記取得手段は、各文書の特徴量が登録された前記データベースを検索する特徴量検索部を有することを特徴とする。 The document feature extraction processing unit includes a feature amount extraction unit that calculates a feature amount of the document from image data as document identification information, and the acquisition unit stores the database in which the feature amount of each document is registered. It has the feature-value search part to search.

また、前記文書特徴抽出処理手段は、文書識別情報として当該文書の画像データからデータベース検索に用いる検索語を抽出する検索語抽出部を有し、前記取得手段は、各文書が登録された前記データベースを検索する文書検索部を有することを特徴とする。 The document feature extraction processing unit includes a search term extraction unit that extracts a search term used for database search from image data of the document as document identification information, and the acquisition unit includes the database in which each document is registered. It has a document search part for searching.

また、前記文書特徴抽出処理手段は、複数種類の文書特徴情報を抽出する各処理の順番制御を行う抽出処理制御部を有することを特徴とする。 Further, the document feature extraction processing means includes an extraction process control unit that controls the order of each process for extracting a plurality of types of document feature information.

本発明に係る文書検索方法は、文書に関する情報が蓄積されたデータベースの中から、入力された画像データに合致する文書を検索する文書検索装置において実施され、文書の画像データを入力する画像データ入力ステップと、入力された文書の画像データから当該文書を特定しうる複数種類の文書特徴情報を抽出する文書特徴抽出ステップと、前記文書特徴抽出ステップから抽出された各文書特徴情報に基づき前記データベースを検索することによって当該文書の選択候補を取得する取得ステップと、前記取得手段により取得された選択候補を評価することによって当該文書を特定する評価ステップと、前記評価ステップによる評価結果を出力する出力ステップとを含むことを特徴とする。 The document search method according to the present invention is implemented in a document search apparatus that searches for a document that matches input image data from a database in which information about the document is stored, and image data input for inputting image data of the document A document feature extraction step for extracting a plurality of types of document feature information capable of specifying the document from image data of the input document, and the database based on each document feature information extracted from the document feature extraction step. An acquisition step of acquiring a selection candidate of the document by searching, an evaluation step of specifying the document by evaluating the selection candidate acquired by the acquisition means, and an output step of outputting an evaluation result by the evaluation step It is characterized by including.

本発明によれば、複数種類の文書特徴情報を抽出し、検索するという手段を組み合わせることにより、より確実に探し出すことができる。 According to the present invention, it is possible to find out more reliably by combining means for extracting and searching a plurality of types of document feature information.

以下、図面に基づいて、本発明の好適な実施の形態について説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

実施の形態１．
図１は、本発明に係る文書検索装置の一実施の形態を示した機能ブロック構成図である。本実施の形態における文書検索装置は、画像読取部１、抽出処理制御部２、ＵＵＩＤ抽出部４、特徴量抽出部６、検索語抽出部８、ＵＵＩＤ検索部１０、類似画像検索部１２、文書検索部１４、検索結果評価部１６及び表示部１８を有している。画像読取部１は、文書が記載された原稿を読み取る。ＵＵＩＤ抽出部４は、ＯＣＲ（光学的文字読取装置）を用いて読取画像からＵＵＩＤを抽出する。特徴量抽出部６は、読取画像から画像の特徴量を算出する。抽出処理制御部２は、各抽出部６における各抽出処理の実行順の決定などの処理制御を行う。検索語抽出部８は、ＯＣＲを用いて読取画像からテキスト文字を抽出し、その中から検索語を抽出する。本実施の形態では、読取画像から当該文書を特定しうる文書特徴情報を抽出する文書特徴抽出処理手段としてＵＵＩＤ抽出部４、特徴量抽出部６及び検索語抽出部８を設け、各抽出部４，６，８に文書特徴情報として、ＵＵＩＤ、特徴量及び検索語という異なる種類の情報をそれぞれ抽出させるようにした。そして、データベース検索によって当該文書の選択候補を取得する取得手段として、ＵＵＩＤ検索部１０、類似画像検索部１２及び文書検索部１４をそれぞれ各抽出部４，６，８に対応させて設けた。つまり、ＵＵＩＤデータベース２０には、各文書に割り振られたＵＵＩＤが登録されているので、ＵＵＩＤ検索部１０は、読取画像から抽出されたＵＵＩＤをキーにＵＵＩＤデータベース２０を検索することで該当する文書を特定する。また、類似画像データベース２２には、読取画像から抽出された特徴量が登録されているので、類似画像検索部１２は、読取画像から抽出された特徴量に基づき類似画像データベース２２を検索することで該当する文書を特定する。また、文書データベース２４には、文書自体が登録されているので、文書検索部１４は、読取画像から抽出された検索語をキーに文書データベース２４を検索することで該当する文書を特定する。なお、各検索部１０〜１４における検索処理の結果として、該当する文書が一つだけ必ず選択されるとは限らず、複数の文書が特定される場合もあるため、このような場合には、選択候補を取得するという表現の方が適切になる。検索結果評価部１６は、各検索部１２により得られた検索結果を評価して当該文書を特定する。表示部１８は、検索結果評価部１６による評価結果に従い特定された文書に関する情報を文書情報データベース２６から取り出して画面表示する。 Embodiment 1 FIG.
FIG. 1 is a functional block configuration diagram showing an embodiment of a document search apparatus according to the present invention. The document search apparatus according to the present embodiment includes an image reading unit 1, an extraction process control unit 2, a UUID extraction unit 4, a feature amount extraction unit 6, a search word extraction unit 8, a UUID search unit 10, a similar image search unit 12, and a document A search unit 14, a search result evaluation unit 16, and a display unit 18 are provided. The image reading unit 1 reads a document on which a document is written. The UUID extraction unit 4 extracts a UUID from the read image using an OCR (optical character reader). The feature amount extraction unit 6 calculates the feature amount of the image from the read image. The extraction process control unit 2 performs process control such as determining the execution order of each extraction process in each extraction unit 6. The search word extraction unit 8 extracts text characters from the read image using OCR, and extracts search words from the text characters. In the present embodiment, a UUID extraction unit 4, a feature amount extraction unit 6, and a search word extraction unit 8 are provided as document feature extraction processing means for extracting document feature information that can identify the document from the read image. , 6 and 8, different types of information such as UUID, feature amount, and search word are extracted as document feature information. Then, UUID search unit 10, similar image search unit 12, and document search unit 14 are provided corresponding to each of extraction units 4, 6, and 8 as acquisition means for acquiring selection candidates for the document by database search. That is, since the UUID assigned to each document is registered in the UUID database 20, the UUID search unit 10 searches the UUID database 20 using the UUID extracted from the read image as a key to find the corresponding document. Identify. In addition, since the feature amount extracted from the read image is registered in the similar image database 22, the similar image search unit 12 searches the similar image database 22 based on the feature amount extracted from the read image. Identify relevant documents. Further, since the document itself is registered in the document database 24, the document search unit 14 specifies the corresponding document by searching the document database 24 using the search word extracted from the read image as a key. Note that, as a result of the search processing in each of the search units 10 to 14, only one corresponding document is not necessarily selected, and a plurality of documents may be specified. The expression of acquiring selection candidates is more appropriate. The search result evaluation unit 16 specifies the document by evaluating the search result obtained by each search unit 12. The display unit 18 extracts information about the document specified according to the evaluation result by the search result evaluation unit 16 from the document information database 26 and displays it on the screen.

なお、図１では、１台の文書検索装置として図示したが、実際には、図２に示したネットワークシステムで実現を想定している。すなわち、図２には、画像形成装置３０とデータベースサーバ３２とをＬＡＮ（ローカルエリアネットワーク）３４で接続した構成が示されているが、図１に示した各データベース２０〜２４はデータベースサーバ３２に、それ以外の構成は画像形成装置３０に、それぞれ搭載することを想定している。ただ、各データベース２０〜２４を１台のデータベースサーバで一括管理するという構成に限定されるものではない。また、図２では便宜的に１台の画像形成装置３０のみ示したが、実際には図１を用いて説明した各機能を搭載した画像形成装置３０が複数台接続されており、データベースサーバ３２は、いずれかの画像形成装置３０からの要求に応じてデータベース検索を行い、その検索結果を要求元の画像形成装置３０へ返信する。 In FIG. 1, although illustrated as a single document search apparatus, in reality, it is assumed to be realized by the network system shown in FIG. That is, FIG. 2 shows a configuration in which the image forming apparatus 30 and the database server 32 are connected by a LAN (local area network) 34, but the databases 20 to 24 shown in FIG. The other configurations are assumed to be mounted on the image forming apparatus 30, respectively. However, the present invention is not limited to the configuration in which the databases 20 to 24 are collectively managed by one database server. In FIG. 2, only one image forming apparatus 30 is shown for convenience, but actually, a plurality of image forming apparatuses 30 having the functions described with reference to FIG. Performs a database search in response to a request from one of the image forming apparatuses 30 and returns the search result to the image forming apparatus 30 that is the request source.

また、本実施の形態における文書検索装置は、各データベース２０〜２４を除き、複合機等の画像形成装置３０の内部に形成されている。実際には、画像形成装置３０に搭載されたコンピュータで実現される。そして、画像読取部１はスキャナと、ＵＵＩＤ抽出部４及び検索語抽出部８はＯＣＲと、表示部１８は操作パネルと、それぞれ連携動作する。各構成要素２〜１６の各処理機能は、画像形成装置３０に搭載されたコンピュータ及びスキャナ等の機器と、そのコンピュータにより実行されるソフトウェアプログラムとの協調動作によって実現される。 In addition, the document search apparatus according to the present embodiment is formed inside an image forming apparatus 30 such as a multifunction peripheral except for the databases 20 to 24. Actually, it is realized by a computer mounted on the image forming apparatus 30. The image reading unit 1 operates in cooperation with the scanner, the UUID extraction unit 4 and the search word extraction unit 8 operate in conjunction with the OCR, and the display unit 18 operates in cooperation with the operation panel. Each processing function of each component 2 to 16 is realized by a cooperative operation between a computer and a device such as a scanner mounted on the image forming apparatus 30 and a software program executed by the computer.

次に、本実施の形態における文書検索処理について図３に示したフローチャートを用いて説明する。 Next, the document search process in the present embodiment will be described with reference to the flowchart shown in FIG.

例えば、何らかの原因で漏洩した文書が、いま入手でき、この文書に関する情報、例えばこの文書がいつ誰によって作成されたものであるかを追求したいとする。なお、ここでは、便宜的にその文書は１枚の用紙のみで構成されているものとし、用紙の所定位置（例えば、用紙の下端から２０〜３０ｍｍ）に文字列コードで表記されたＵＵＩＤが印刷されているものとする。ユーザは、画像形成装置のスキャン機能を利用してこの文書をスキャンする。画像読取部１は、このスキャンによって文書を読み取ることで読取画像を形成する（ステップ１０１）。なお、画像読取部１は、読取画像を自ら形成しなくても、他の装置でスキャンされ生成された読取画像をネットワーク経由で取得する場合もこの処理に含まれるものとする。 For example, a document that has been leaked for some reason is now available and you want to pursue information about this document, for example, who created this document when. Here, for convenience, it is assumed that the document is composed of only one sheet, and a UUID represented by a character string code is printed at a predetermined position (for example, 20 to 30 mm from the lower end of the sheet). It is assumed that The user scans this document using the scanning function of the image forming apparatus. The image reading unit 1 forms a read image by reading the document by this scanning (step 101). Note that the image reading unit 1 includes the case where the read image scanned and generated by another apparatus is acquired via the network without forming the read image itself.

次に、ＵＵＩＤ抽出部４は、読取画像からＵＵＩＤを抽出する（ステップ１０２）。ＵＵＩＤは、画像の所定領域（画像の下端から２０〜３０ｍｍ）に付されているので、この所定領域内のコードを読み取ることで抽出できる。本実施の形態では、文字列（文字モード）で記載されたＵＵＩＤを想定したため、ＯＣＲで文字認識するようにしたが、例えばバーコードで表記されている場合はバーコードリーダによってＵＵＩＤを判読することになる。いずれにしてもＵＵＩＤの付加方法に合わせたＵＵＩＤの抽出手段を用いればよい。 Next, the UUID extraction unit 4 extracts the UUID from the read image (step 102). Since the UUID is attached to a predetermined area of the image (20 to 30 mm from the lower end of the image), it can be extracted by reading the code in the predetermined area. In this embodiment, since a UUID described in a character string (character mode) is assumed, characters are recognized by OCR. However, for example, when a barcode is used, the UUID is read by a barcode reader. become. In any case, it is sufficient to use UUID extraction means that matches the UUID addition method.

ここで、読取画像からＵＵＩＤが正常に抽出できた場合、ＵＵＩＤ検索部１０は、読取画像から抽出されたＵＵＩＤをキーにＵＵＩＤデータベース２０を検索する（ステップ１０３，１０４）。ＵＵＩＤは、各文書にユニークに割り付けられているので、通常であれば、この検索によりただ一つの文書が抽出され特定されることになる。従って、この場合の検索結果評価部１６は、特に検索結果を評価する必要はなく、表示部１８は、ＵＵＩＤ検索部１０による検索により特定されたＵＵＩＤをキーに文書情報データベース２６を検索して取り出した当該文書に関する情報を画面表示する（ステップ１１２）。 If the UUID can be normally extracted from the read image, the UUID search unit 10 searches the UUID database 20 using the UUID extracted from the read image as a key (steps 103 and 104). Since the UUID is uniquely assigned to each document, normally, only one document is extracted and specified by this search. Accordingly, the search result evaluation unit 16 in this case does not need to evaluate the search result in particular, and the display unit 18 searches and retrieves the document information database 26 using the UUID specified by the search by the UUID search unit 10 as a key. Information about the document is displayed on the screen (step 112).

本実施の形態における文書情報データベース２６には、各ＵＵＩＤに対応させて、当該文書の画像データ、作成日時、作成者及び機器に関する情報で構成される。画像データは、画像読取部１などの画像読取手段によって読み取られた画像そのものもデータを想定しているが、データ量が膨大となるためサムネイルとしてもよい。作成日時は、当該文書が作成された日時を特定するための時間情報である。作成者は、当該文書を作成した者を識別する情報であり、ユーザＩＤ若しくはユーザ名であり、文書を作成する装置にログインしたときに指定されたユーザ情報から得る。機器は、当該文書が作成された機器を特定する情報であり、本実施の形態ではＩＰアドレスを用いる。なお、プリント機能のように画像形成装置に対してネットワーク経由で印刷データを送信して印刷処理を実施させるような場合には、その印刷データ送信元のＩＰアドレスも合わせて記録する。ＦＡＸ送信機能の場合は、送信先のＩＰアドレスも合わせて記録する。 The document information database 26 according to the present embodiment includes information on image data, creation date, creator, and device of the document in association with each UUID. As the image data, the image itself read by the image reading means such as the image reading unit 1 is assumed to be data, but it may be a thumbnail because the amount of data becomes enormous. The creation date and time is time information for specifying the date and time when the document was created. The creator is information for identifying a person who created the document, is a user ID or a user name, and is obtained from user information specified when logging in to a device that creates the document. The device is information for specifying the device on which the document is created, and an IP address is used in the present embodiment. When print data is transmitted to the image forming apparatus via the network and the print processing is performed as in the print function, the IP address of the print data transmission source is also recorded. In the case of the FAX transmission function, the destination IP address is also recorded.

従って、表示部１８は、ＵＵＩＤ検索部１０による検索により特定されたＵＵＩＤに対応した文書に関する上記情報を画面表示することになる。これにより、ユーザは、漏洩した文書が、いつ誰によってどの機器を用いて印刷されたかを特定することができる。 Therefore, the display unit 18 displays the above-described information related to the document corresponding to the UUID specified by the search by the UUID search unit 10 on the screen. Thus, the user can specify when and by which device the leaked document was printed by whom.

なお、読取画像からＵＵＩＤが正常に抽出できた場合、基本的には、以上のように処理されるが、例外的に、読取画像からＵＵＩＤが正常に抽出できたと判断してもＯＣＲの解読ミス等で、読取画像から抽出されたＵＵＩＤと合致するＵＵＩＤがＵＵＩＤデータベース２０に存在しない場合も想定できる。この場合は、いくつかの桁をワイルドカードとして検索をしたり、例えば８と９、０と６など解読ミスをしやすい数字を入れ替えながら検索をしたりして複数の選択候補を抽出するようにしてもよい。あるいは、自動抽出したＵＵＩＤを画面表示してユーザにより修正させるような手段を設けるようにしてもよい。なお、選択候補を複数抽出した場合、検索結果評価部１６は、抽出したＵＵＩＤをキーに文書情報データベース２６を検索して取り出した各文書に関する情報を画面表示する（ステップ１１２）。ユーザは、表示された情報を参照に、漏洩文書と合致する文書を探し出す。この場合、文書検索装置は、ただ１つの文書を特定することはできないが、選択候補を抽出できるので、漏洩文書の作成元の特定を支援することは可能である。 If the UUID can be normally extracted from the read image, the process is basically performed as described above. However, in the exceptional case, even if it is determined that the UUID can be normally extracted from the read image, an OCR decoding error is detected. For example, it can be assumed that there is no UUID in the UUID database 20 that matches the UUID extracted from the read image. In this case, a plurality of selection candidates are extracted by performing a search using several digits as wild cards, or by performing a search while replacing numbers that are prone to decoding errors, such as 8 and 9, 0, and 6, for example. May be. Alternatively, means for displaying the automatically extracted UUID on the screen and correcting it by the user may be provided. When a plurality of selection candidates are extracted, the search result evaluation unit 16 displays information on each document retrieved by searching the document information database 26 using the extracted UUID as a key (step 112). The user searches for a document that matches the leaked document with reference to the displayed information. In this case, the document search apparatus cannot identify only one document, but can extract selection candidates, and therefore can support the identification of a leaked document creation source.

一方、故意若しくは用紙上の汚れなどにより読取画像からＵＵＩＤが正常に抽出できなかった場合、検索語抽出部８は、ＯＣＲを使って読取画像からテキストを抽出する（ステップ１０３，１０５）。そして、形態素解析を行ってテキストから単語を切り出し、有効語を抽出する（ステップ１０６）。ここでいう有効語というのは、予め定められた文字数以上の単語、予め定められた出現頻度以上の単語及び単語の存在確率が予め定められた値より大きい単語の総称である。また、単語の存在確率というのは、「Ｎ文字の単語認識率は文字認識率のＮ乗である。」及び「この単語が当該文書にＭ回出現している場合には、１−（（１−（文字認識率のＮ乗））のＭ乗）が当該文書内にある確率である。」としたときの確率であると定義する。 On the other hand, when the UUID cannot be normally extracted from the read image intentionally or due to dirt on the paper, the search word extraction unit 8 extracts the text from the read image using OCR (steps 103 and 105). Then, morphological analysis is performed to cut out words from the text and extract valid words (step 106). The term “effective word” as used herein is a generic term for a word having a predetermined number of characters or more, a word having a predetermined appearance frequency or more, and a word having a word existence probability larger than a predetermined value. The word existence probability is “the word recognition rate of N letters is the Nth power of the character recognition rate.” And “If this word appears M times in the document, 1-(( 1- (character recognition rate to the Nth power)) to the Mth power) is defined as the probability of being in the document.

ここで、抽出した有効語の数が予め決めたｎ個より多く抽出できた場合、文書検索部１４は、読取画像から抽出された有効語を検索語（キーワード）として、あるいは有効語の組合せにより検索式を生成する（ステップ１０７，１０８）、検索式は、例えば有効語を出現頻度順に並べたときの上位ｉ番目までの有効語のＡＮＤをとる、あるいは、有効語を構成する文字数の多い順に並べたときの上位ｊ番目までの有効語のＡＮＤをとる、などの規則に従い生成する。このようにして得た生成した検索語若しくは検索式によって全ての文書が登録されている文書データベース２４を検索することによって文書を抽出する（ステップ１０９）。なお、検索結果が０件の場合は、検索語を自動若しくはユーザ選択により減らして検索式を作成し直すようにしてもよい。 Here, when the number of extracted effective words can be extracted more than the predetermined n, the document search unit 14 uses the effective words extracted from the read image as search words (keywords) or by a combination of effective words. A search expression is generated (steps 107 and 108). For example, the search expression is ANDed up to the highest i-th effective word when the effective words are arranged in order of appearance frequency, or the number of characters constituting the effective word is in descending order. It is generated according to a rule such as ANDing up to the j-th effective word when arranged. A document is extracted by searching the document database 24 in which all documents are registered by using the generated search word or search expression thus obtained (step 109). If there are no search results, the search terms may be reduced automatically or by user selection and re-created.

検索結果評価部１６は、検索結果を参照し、検索語と一致度の高い文書ほど、漏洩文書に該当する確率が高いと判断してヒット率の高い上位ｋ番目までの文書を選択候補として抽出する。表示部１８は、抽出した文書に付加されているＵＵＩＤをキーに文書情報データベース２６を検索して取り出した当該文書に関する情報を画面表示する（ステップ１１２）。ユーザは、表示された情報を参照し、漏洩文書と見比べることで、漏洩文書に合致する文書を複数候補の中から見つけ出すことができる。 The search result evaluation unit 16 refers to the search result, determines that a document having a higher degree of coincidence with the search word has a higher probability of being a leaked document, and extracts up to the top kth documents having a higher hit rate as selection candidates To do. The display unit 18 displays information on the document retrieved by searching the document information database 26 using the UUID added to the extracted document as a key (step 112). The user can find a document that matches the leaked document from a plurality of candidates by referring to the displayed information and comparing it with the leaked document.

一方、抽出した有効語の数が予め決めたｎ個より多く抽出できなかった場合（ステップ１０７）、続いて特徴量に基づく文書検索処理に移行する。これは、必要数以上の検索語が抽出できなかったために検索語による文書検索では的確な結果が得られないと判断できることである。このとき、特徴量抽出部６は、読取画像から色彩情報、テクスチャ情報、形状情報などの特徴を解析して特徴量を抽出する（ステップ１１０）。そして、類似画像検索部１２は、抽出された特徴量に基づき類似画像データベース２２を検索することによって抽出特徴量から推測される文書を特定する（ステップ１１２）。この場合も、予め決めている類似度の高い上位ｋ番目までの文書を選択候補として抽出する。表示部１８は、抽出した類似画像に付加されているＵＵＩＤをキーに文書情報データベース２６を検索して取り出した当該文書に関する情報を画面表示する（ステップ１１２）。ユーザは、表示された情報を参照し、漏洩文書と見比べることで、漏洩文書に合致する文書を複数候補の中から見つけ出すことができる。 On the other hand, when the number of extracted effective words cannot be extracted more than the predetermined n (step 107), the process proceeds to document search processing based on the feature amount. This is because it is possible to determine that an accurate result cannot be obtained by document search using a search word because a search word exceeding the necessary number could not be extracted. At this time, the feature amount extraction unit 6 analyzes features such as color information, texture information, and shape information from the read image and extracts feature amounts (step 110). And the similar image search part 12 specifies the document estimated from the extracted feature-value by searching the similar image database 22 based on the extracted feature-value (step 112). Also in this case, the top k-th documents with high similarity are extracted as selection candidates. The display unit 18 displays on the screen information related to the document retrieved by searching the document information database 26 using the UUID added to the extracted similar image as a key (step 112). The user can find a document that matches the leaked document from a plurality of candidates by referring to the displayed information and comparing it with the leaked document.

本実施の形態によれば、以上のように異なる３種類の検索方法を組み合わせることで、漏洩文書に関する情報をより確実に探し出すことができる。本実施の形態では、フローチャートに示した処理手順から明らかなようにＵＵＩＤが抽出できればＵＵＩＤによる検索を実行し、ＵＵＩＤが抽出できないときには検索語として有効な語数を調べ、所定値以上の検索語が存在すれば検索語を抽出し、そして存在しなければ特徴量を抽出するようにした。すなわち、ＵＵＩＤ、検索語、特徴量という順に優先順位付けをして検索方法を適用するようにした。 According to the present embodiment, information relating to a leaked document can be found more reliably by combining three different search methods as described above. In this embodiment, as is clear from the processing procedure shown in the flowchart, if the UUID can be extracted, a search by UUID is executed. If the UUID cannot be extracted, the number of effective words as a search word is checked, and there are search words exceeding a predetermined value. If it does not exist, the search term is extracted, and if it does not exist, the feature amount is extracted. That is, the search method is applied by prioritizing in the order of UUID, search word, and feature value.

実施の形態２．
本実施の形態における文書検索処理について図４に示したフローチャートを用いて説明する。図４において、実施の形態１において図３を用いて説明した処理と同じ処理には、同じステップ番号を付け、説明を適宜省略する。なお、本実施の形態における装置構成は、実施の形態１と同じでよいため説明を省略する。 Embodiment 2. FIG.
The document search process in this embodiment will be described with reference to the flowchart shown in FIG. In FIG. 4, the same processes as those described in Embodiment 1 with reference to FIG. 3 are assigned the same step numbers, and the description thereof is omitted as appropriate. Note that the apparatus configuration in the present embodiment may be the same as that in the first embodiment, and a description thereof will be omitted.

文書スキャンによって読取画像を形成すると（ステップ１０１）、読取画像からＵＵＩＤを抽出する（ステップ１０２）。読取画像からＵＵＩＤが正常に抽出できた場合、ＵＵＩＤ検索部１０は、そのＵＵＩＤをキーにＵＵＩＤデータベース２０を検索し（ステップ１０３，１０４）、その検索結果を画面表示する（ステップ１０４，１１２）。 When a read image is formed by document scanning (step 101), a UUID is extracted from the read image (step 102). When the UUID can be normally extracted from the read image, the UUID search unit 10 searches the UUID database 20 using the UUID as a key (steps 103 and 104), and displays the search result on the screen (steps 104 and 112).

一方、読取画像からＵＵＩＤが正常に抽出できなかった場合、この場合の処理が実施の形態１と異なる処理手順であるが、このとき、抽出処理制御部２は、文書が読み取られたときのスキャンモードを確認し、そのスキャンモードが写真モードであれば、特徴量を抽出し、類似画像を検索、表示する（ステップ２０１，１１０〜１１２）。一方、スキャンモードが写真モードでなければ、検索語を抽出し、文書を検索、表示する（ステップ２０１，１０５〜１０９，１１２）。 On the other hand, when the UUID cannot be normally extracted from the read image, the processing in this case is a processing procedure different from that in the first embodiment. At this time, the extraction processing control unit 2 scans when the document is read. The mode is confirmed, and if the scan mode is the photo mode, the feature amount is extracted, and similar images are searched and displayed (steps 201, 110 to 112). On the other hand, if the scan mode is not the photo mode, the search term is extracted, and the document is searched and displayed (steps 201, 105 to 109, 112).

すなわち、上記説明では、文字によって記述された「文書」を前提にしているが、読み取るべき用紙には、実際には写真の印刷領域が全部若しくは大半の場合もある。この場合、ユーザは、読取精度を考慮して、写真モードを選択してスキャンすると考えられる。この場合は、検索語を抽出するよりも特徴量を用いて検索した方が適切であると考えられる。そこで、本実施の形態では、スキャンモードとして写真モードが選択されているか否かを調べて検索方法を選択するようにした。すなわち、本実施の形態では、ＵＵＩＤ、特徴量、検索語という順に優先順位付けをして検索方法を適用するようにした。 In other words, the above description is based on a “document” written in characters, but the paper to be read may actually have all or most of the print area of the photo. In this case, it is considered that the user selects and scans a photo mode in consideration of reading accuracy. In this case, it is considered that it is more appropriate to search using the feature amount than to extract the search word. Therefore, in this embodiment, the search method is selected by examining whether or not the photo mode is selected as the scan mode. That is, in the present embodiment, the search method is applied by prioritizing in the order of UUID, feature amount, and search word.

実施の形態３．
本実施の形態における文書検索処理について図５に示したフローチャートを用いて説明する。図５において、実施の形態２において図４を用いて説明した処理と同じ処理には、同じステップ番号を付け、説明を適宜省略する。なお、本実施の形態における装置構成は、実施の形態１と同じでよいため説明を省略する。 Embodiment 3 FIG.
The document search process in this embodiment will be described with reference to the flowchart shown in FIG. In FIG. 5, the same processes as those described in Embodiment 2 with reference to FIG. 4 are given the same step numbers, and the description thereof is omitted as appropriate. Note that the apparatus configuration in the present embodiment may be the same as that in the first embodiment, and a description thereof will be omitted.

文書スキャンによって読取画像を形成すると（ステップ１０１）、抽出処理制御部２は、文書が読み取られたときのスキャンモードを確認し、そのスキャンモードが写真モードであれば、特徴量を抽出し、類似画像を検索、表示する（ステップ２０１，１１０〜１１２）。一方、スキャンモードが写真モードでなければ、読取画像からＵＵＩＤを抽出する（ステップ１０２）。読取画像からＵＵＩＤが正常に抽出できた場合、ＵＵＩＤ検索部１０は、そのＵＵＩＤをキーにＵＵＩＤデータベース２０を検索し、その検索結果を画面表示する（ステップ１０３，１０４，１１２）。 When the scanned image is formed by scanning the document (step 101), the extraction processing control unit 2 confirms the scan mode when the document is scanned, and if the scan mode is the photo mode, the feature amount is extracted and similar. Images are retrieved and displayed (steps 201, 110 to 112). On the other hand, if the scan mode is not the photo mode, the UUID is extracted from the read image (step 102). When the UUID can be normally extracted from the read image, the UUID search unit 10 searches the UUID database 20 using the UUID as a key, and displays the search result on the screen (steps 103, 104, and 112).

すなわち、上記実施の形態２では、検索語より特徴量を優先することによって写真原稿の場合に適していると説明した。本実施の形態では、さらにその傾向を強めた。つまり、本実施の形態は、ＵＵＩＤが印刷できない写真原稿の場合には、ＵＵＩＤを確認するまでもなく特徴量を抽出しようとするものである。すなわち、本実施の形態では、特徴量、ＵＵＩＤ、検索語という順に優先順位付けをして検索方法を適用するようにした。 That is, in the second embodiment, it has been described that it is suitable for the case of a photographic document by giving priority to the feature amount over the search word. In the present embodiment, the tendency is further strengthened. That is, in the present embodiment, in the case of a photographic document that cannot be printed with a UUID, the feature amount is to be extracted without confirming the UUID. In other words, in the present embodiment, the search method is applied by prioritizing in the order of feature quantity, UUID, and search word.

上記各実施の形態によれば、各検索処理を選択的に実行し、その検索結果を表示するようにしたので、漏洩文書に合致する文書情報を確実に探し出すことができる。また、ユーザは、文書をスキャンするという操作だけを行えばよいので簡単である。 According to each of the above embodiments, each search process is selectively executed and the search result is displayed, so that it is possible to reliably find document information that matches the leaked document. In addition, since the user only needs to perform an operation of scanning a document, it is simple.

実施の形態４．
本実施の形態における文書検索処理について図６に示したフローチャートを用いて説明する。図６において、実施の形態１において図３を用いて説明した処理と同じ処理には、同じステップ番号を付け、説明を適宜省略する。なお、本実施の形態における装置構成は、実施の形態１と同じでよいため説明を省略する。 Embodiment 4 FIG.
The document search process in this embodiment will be described with reference to the flowchart shown in FIG. In FIG. 6, the same processes as those described in Embodiment 1 with reference to FIG. Note that the apparatus configuration in the present embodiment may be the same as that in the first embodiment, and a description thereof will be omitted.

文書スキャンによって読取画像を形成すると（ステップ１０１）、抽出処理制御部２は、本実施の形態が抽出する全ての文書特徴情報、すなわちＵＵＩＤと検索語と特徴量との各抽出処理を同時並行して実施させる（ステップ１０２〜１０４，１０５〜１０９，１１０〜１１１）。そして、検索結果評価部１６は、各検索部１０，１２，１４による検索結果をマージする（ステップ４０１）。マージは、予め決められた評価基準、例えば抽出数を検索結果毎に設定し、例えばＵＵＩＤ、検索語、特徴量といった予め決めた順番に各検索結果を並べる。表示部１８は、そのマージした結果を画面表示する（ステップ１１２）。 When a scanned image is formed by document scanning (step 101), the extraction processing control unit 2 performs all the extraction processing of all document feature information extracted by the present embodiment, that is, UUID, search word, and feature amount simultaneously. (Steps 102 to 104, 105 to 109, 110 to 111). Then, the search result evaluation unit 16 merges the search results obtained by the search units 10, 12, and 14 (step 401). In the merge, a predetermined evaluation criterion, for example, the number of extractions is set for each search result, and the search results are arranged in a predetermined order such as UUID, search word, and feature amount. The display unit 18 displays the merged result on the screen (step 112).

本実施の形態によれば、各検索処理を組み合わせて実行し、その検索結果をマージして表示できるようにしたので、漏洩文書に合致する文書情報を探し出せる可能性が高くなる。また、ユーザは、文書をスキャンするという操作だけを行えばよいので簡単である。 According to the present embodiment, each search process is executed in combination, and the search results can be merged and displayed. Therefore, there is a high possibility that document information matching the leaked document can be found. In addition, since the user only needs to perform an operation of scanning a document, it is simple.

本発明に係る文書検索装置の一実施の形態を示した機能ブロック構成図である。It is a functional block block diagram which showed one Embodiment of the document search device concerning this invention. 本実施の形態における文書検索装置を適用したネットワークシステムの全体構成図である。1 is an overall configuration diagram of a network system to which a document search device according to an embodiment is applied. 実施の形態１における文書検索処理を示したフローチャートである。3 is a flowchart showing document search processing in the first embodiment. 実施の形態２における文書検索処理を示したフローチャートである。10 is a flowchart illustrating document search processing in the second embodiment. 実施の形態３における文書検索処理を示したフローチャートである。10 is a flowchart showing document search processing in the third embodiment. 実施の形態４における文書検索処理を示したフローチャートである。10 is a flowchart showing document search processing in the fourth embodiment.

Explanation of symbols

１画像読取部、２抽出処理制御部、４ＵＵＩＤ抽出部、６特徴量抽出部、８検索語抽出部、１０ＵＵＩＤ検索部、１２類似画像検索部、１４文書検索部、１６検索結果評価部、１８表示部、２０ＵＵＩＤデータベース、２２類似画像データベース、２４文書データベース、２６文書情報データベース、３０画像形成装置、３２データベースサーバ、３４ＬＡＮ。 DESCRIPTION OF SYMBOLS 1 Image reading part, 2 Extraction process control part, 4 UUID extraction part, 6 Feature-value extraction part, 8 Search term extraction part, 10 UUID search part, 12 Similar image search part, 14 Document search part, 16 Search result evaluation part, 18 display unit, 20 UUID database, 22 similar image database, 24 document database, 26 document information database, 30 image forming apparatus, 32 database server, 34 LAN.

Claims

In a document search apparatus for searching for a document that matches input image data from a database in which information about the document is stored,
Image data input means for inputting image data of a document;
Document feature extraction processing means for extracting a plurality of types of document feature information capable of specifying the document from image data of the input document;
Obtaining means for obtaining selection candidates for the document by searching the database based on each document feature information extracted from the document feature extraction processing means;
Evaluation means for identifying the document by evaluating the selection candidates acquired by the acquisition means;
Output means for outputting an evaluation result by the evaluation means;
A document search apparatus characterized by comprising:

The document search apparatus according to claim 1, wherein
The document feature extraction processing unit includes an identification information extraction unit that extracts identification information uniquely assigned to the document as document feature information from image data,
The document search apparatus, wherein the acquisition unit includes an identification information search unit that searches the database in which identification information uniquely assigned to each document is registered.

The document search apparatus according to claim 1 or 2,
The document feature extraction processing unit includes a feature amount extraction unit that calculates a feature amount of the document from image data as document identification information.
The document search apparatus, wherein the acquisition unit includes a feature amount search unit that searches the database in which feature amounts of each document are registered.

The document search apparatus according to claim 1 or 2,
The document feature extraction processing unit includes a search word extraction unit that extracts a search word used for database search from image data of the document as document identification information,
The document search apparatus, wherein the acquisition unit includes a document search unit that searches the database in which each document is registered.

The document search apparatus according to claim 1, wherein
The document feature extraction processing means includes an extraction process control unit that controls the order of each process for extracting a plurality of types of document feature information.

It is implemented in a document search device that searches a document that matches input image data from a database in which information about the document is stored,
An image data input step for inputting image data of the document;
A document feature extraction step of extracting a plurality of types of document feature information capable of specifying the document from the image data of the input document;
An acquisition step of acquiring selection candidates for the document by searching the database based on each document feature information extracted from the document feature extraction step;
An evaluation step of identifying the document by evaluating the selection candidates acquired by the acquisition means;
An output step of outputting an evaluation result by the evaluation step;
A document retrieval method comprising: