JP2018028714A

JP2018028714A - Information processing apparatus and program

Info

Publication number: JP2018028714A
Application number: JP2016159118A
Authority: JP
Inventors: 碧谷口; Midori Taniguchi
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2016-08-15
Filing date: 2016-08-15
Publication date: 2018-02-22

Abstract

PROBLEM TO BE SOLVED: To reduce a storage area for a document by taking a content included in the document as a storage unit.SOLUTION: An information processing apparatus comprises: a content extraction part 13 that extracts a content included in an image of a document to be stored according to a content extraction rule in which, of contents included in the document, a content enabling tracking of the document is defined; a content registration part 153 that registers the extracted content in a content storage part 25; and a document information registration part 155 that registers a content ID of the content extracted from the document in a document information storage part 26 in association with the name of the document to register the document as a combination with the content.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置及びプログラムに関する。 The present invention relates to an information processing apparatus and a program.

文書データを画像ファイルとしてハードディスク装置に格納し管理するイメージログシステムにおいて、ハードディスク装置に蓄積されているファイルが出力された場合、そのファイルを出力したユーザーのユーザーＩＤをファイルの属性情報として記録してファイル管理を行う場合がある。このシステムを利用すると、情報の漏洩が発覚したときに漏洩した情報を含むファイルの出力者が容易に特定できるので情報漏洩の追跡が可能となる。 In an image log system that stores and manages document data as image files in a hard disk device, when a file stored in the hard disk device is output, the user ID of the user who output the file is recorded as file attribute information. File management may be performed. If this system is used, it is possible to easily identify the output person of the file containing the information leaked when the information leak is detected, so that the information leak can be traced.

ところで、管理対象とする画像ファイル全体をそのままハードディスク装置に保存していくと、時間の経過に伴いハードディスク装置の記憶容量を圧迫していくことにもなりかねない。そこで、従来では、文書データの文字列部分にＯＣＲをかけて、文字列部分の画像のデータ量を削減する技術が提案されている（例えば、特許文献１）。 By the way, if the entire image file to be managed is stored in the hard disk device as it is, the storage capacity of the hard disk device may be compressed over time. Therefore, conventionally, a technique has been proposed in which the character string portion of document data is subjected to OCR to reduce the image data amount of the character string portion (for example, Patent Document 1).

特開２００７−０８６９５６号公報JP 2007-086956 A 特開２００９−１１０３１９号公報JP 2009-110319 A

しかしながら、従来においては、ファイルの文字列部分のデータ量を削減できたとしても情報漏洩の追跡に不要な部分の画像データを格納対象から除くようなことはせずにファイル全体を格納対象としている。 However, in the past, even if the amount of data in the character string portion of the file can be reduced, the entire file is targeted for storage without removing the image data of the portion unnecessary for tracking information leakage from the storage target. .

本発明は、文書に含まれるコンテンツを格納単位とすることによって文書の格納領域の削減を図ることを目的とする。 An object of the present invention is to reduce the storage area of a document by using content contained in the document as a storage unit.

本発明に係る情報処理装置は、格納対象の文書を受け付ける文書受付手段と、文書に含まれるコンテンツのうち保存対象とするコンテンツが定義されたコンテンツ抽出定義情報に基づき、前記文書受付手段に受け付けられた文書からコンテンツを抽出する抽出手段と、前記抽出手段により抽出されたコンテンツをコンテンツ記憶手段に登録するコンテンツ登録手段と、前記文書受付手段に受け付けられた文書を識別する文書識別情報に、当該文書から抽出され前記コンテンツ記憶手段に格納されたコンテンツのコンテンツ識別情報を対応付けして文書情報を生成して文書情報記憶手段に登録する文書情報登録手段と、を有することを特徴とする。 The information processing apparatus according to the present invention is received by the document receiving unit based on a document receiving unit that receives a document to be stored and content extraction definition information that defines content to be stored among the contents included in the document. Extraction means for extracting content from the extracted document, content registration means for registering the content extracted by the extraction means in the content storage means, and document identification information for identifying the document received by the document reception means. Document information registration means for generating document information by associating the content identification information of the content extracted from the content storage means and stored in the content storage means and registering the document information in the document information storage means.

また、前記抽出手段により抽出されたコンテンツと同一のコンテンツが前記コンテンツ記憶手段に既に登録されている場合、前記コンテンツ登録手段は、前記抽出手段により抽出されたコンテンツを前記コンテンツ記憶手段に登録せず、前記文書情報登録手段は、前記文書識別情報に前記同一のコンテンツのコンテンツ識別情報を対応付けして文書情報を生成することを特徴とする。 When the same content as the content extracted by the extracting unit is already registered in the content storing unit, the content registering unit does not register the content extracted by the extracting unit in the content storing unit. The document information registration means generates document information by associating the content identification information of the same content with the document identification information.

また、前記コンテンツ登録手段は、前記抽出手段により抽出されたコンテンツに類似したコンテンツが前記コンテンツ記憶手段に登録されている場合、前記文書識別情報に前記抽出手段により抽出されたコンテンツと類似したコンテンツとの差分情報をコンテンツとして前記コンテンツ記憶手段に登録し、前記文書情報登録手段は、前記文書識別情報に前記差分情報のコンテンツ識別情報を対応付けして文書情報を生成することを特徴とする。 The content registering unit may include a content similar to the content extracted by the extracting unit in the document identification information when content similar to the content extracted by the extracting unit is registered in the content storage unit. The difference information is registered in the content storage unit as content, and the document information registration unit generates the document information by associating the content identification information of the difference information with the document identification information.

また、前記類似したコンテンツと前記抽出手段により抽出された差分情報とを関連付けたコンテンツ管理情報を参照してコンテンツを復元する復元手段を有することを特徴とする。 Further, the present invention is characterized by further comprising a restoring means for restoring the content with reference to content management information in which the similar content is associated with the difference information extracted by the extracting means.

また、前記コンテンツ記憶手段に登録されているコンテンツを出力するコンテンツ出力手段と、前記出力手段により出力されたコンテンツの中から選択されたコンテンツを受け付けるコンテンツ受付手段と、前記コンテンツ受付手段により受け付けられたコンテンツを含む文書に関する情報を出力する文書情報出力手段と、を有することを特徴とする。 A content output unit that outputs content registered in the content storage unit; a content reception unit that receives content selected from the content output by the output unit; and the content reception unit And document information output means for outputting information relating to the document including the content.

また、前記コンテンツ抽出定義情報には、文書の追跡を可能とするコンテンツが定義されており、前記コンテンツ受付手段は、漏洩した文書に含まれているコンテンツを受け付けることを特徴とする。 The content extraction definition information defines a content that enables document tracking, and the content receiving means receives content included in a leaked document.

また、前記格納対象の文書の出力者を含む当該文書の属性情報を受け付ける属性情報受付手段を有し、前記文書情報登録手段は、前記文書識別情報に当該文書の属性情報を関連付けし、前記文書情報出力手段は、前記文書に関する情報として当該文書の属性情報を出力することを特徴とする。 And an attribute information receiving unit configured to receive attribute information of the document including an output person of the document to be stored, wherein the document information registration unit associates the attribute information of the document with the document identification information, and The information output means outputs attribute information of the document as information relating to the document.

また、コンテンツがカテゴリに分類されて前記コンテンツ記憶手段に登録される場合、前記抽出手段により抽出されたコンテンツの分類先とするカテゴリを決定するカテゴリ決定手段を有し、前記コンテンツ登録手段は、前記抽出手段により抽出されたコンテンツを前記カテゴリ決定手段により受け付けられたカテゴリに分類して前記コンテンツ記憶手段に登録し、前記コンテンツ出力手段は、前記文書に関する情報を出力する際にカテゴリが指定さると、その指定されたカテゴリ及び当該カテゴリを下位層に含むカテゴリに分類されたコンテンツを前記コンテンツ記憶手段の中から抽出して表示することを特徴とする。 In addition, when content is classified into categories and registered in the content storage unit, the content registration unit includes a category determination unit that determines a category as a classification destination of the content extracted by the extraction unit, The content extracted by the extraction unit is classified into the category received by the category determination unit and registered in the content storage unit, and the content output unit is configured to specify a category when outputting information about the document. The designated category and the content classified into a category including the category in a lower layer are extracted from the content storage means and displayed.

本発明に係るプログラムは、コンピュータを、格納対象の文書を受け付ける文書受付手段、文書に含まれるコンテンツのうち保存対象とするコンテンツが定義されたコンテンツ抽出定義情報に基づき、前記文書受付手段に受け付けられた文書からコンテンツを抽出する抽出手段、前記抽出手段により抽出されたコンテンツをコンテンツ記憶手段に登録するコンテンツ登録手段、前記文書受付手段に受け付けられた文書を識別する文書識別情報に、当該文書から抽出され前記コンテンツ記憶手段に格納されたコンテンツのコンテンツ識別情報を対応付けして文書情報を生成して文書情報記憶手段に登録する文書情報登録手段、として機能させる。 The program according to the present invention is received by the document receiving means based on the document receiving means for receiving the document to be stored, and the content extraction definition information in which the content to be stored among the contents included in the document is defined. Extraction means for extracting content from the extracted document, content registration means for registering the content extracted by the extraction means in the content storage means, and document identification information for identifying the document received by the document reception means. The content information stored in the content storage means is associated with the content identification information to generate document information and register it in the document information storage means.

請求項１に記載の発明によれば、文書に含まれるコンテンツを格納単位とすることによって文書の格納領域の削減を図ることができる。 According to the first aspect of the present invention, the storage area of the document can be reduced by using the content included in the document as a storage unit.

請求項２に記載の発明によれば、格納しようとするコンテンツと同一のコンテンツが既に登録されている場合、同一のコンテンツを重複して格納させずにすむ。 According to the second aspect of the present invention, when the same content as the content to be stored is already registered, it is not necessary to store the same content repeatedly.

請求項３に記載の発明によれば、格納しようとするコンテンツに類似したコンテンツが登録されている場合、格納しようとするコンテンツとして、類似するコンテンツとの差分のみを抽出して格納することができる。 According to the third aspect of the present invention, when content similar to the content to be stored is registered, only the difference from the similar content can be extracted and stored as the content to be stored. .

請求項４に記載の発明によれば、差分のみを格納したコンテンツを復元することができる。 According to the fourth aspect of the present invention, it is possible to restore the content storing only the difference.

請求項５に記載の発明によれば、格納したコンテンツを含む文書に関する情報を出力することができる。 According to the fifth aspect of the present invention, it is possible to output information relating to a document including stored content.

請求項６に記載の発明によれば、追跡対象とする文書に関する情報を出力することができる。 According to the sixth aspect of the present invention, it is possible to output information relating to a document to be tracked.

請求項７に記載の発明によれば、追跡対象とする文書の出力者を特定することができる。 According to the invention described in claim 7, it is possible to specify the output person of the document to be tracked.

請求項８に記載の発明によれば、追跡対象とする文書に含まれるコンテンツの絞り込みの便宜を図ることができる。 According to the invention described in claim 8, it is possible to facilitate the narrowing down of the content included in the document to be tracked.

請求項９に記載の発明によれば、文書に含まれるコンテンツを格納単位とすることによって文書の格納領域の削減を図ることができる。 According to the ninth aspect of the present invention, the storage area of the document can be reduced by using the content included in the document as a storage unit.

本発明に係る情報処理装置の一実施の形態である文書管理装置を示したブロック構成図であり、文書の登録に関連する構成を示した図である。1 is a block configuration diagram showing a document management apparatus which is an embodiment of an information processing apparatus according to the present invention, and shows a configuration related to document registration. FIG. 本発明に係る情報処理装置の一実施の形態である文書管理装置を示したブロック構成図であり、文書検索に関連する構成を示した図である。1 is a block configuration diagram showing a document management apparatus which is an embodiment of an information processing apparatus according to the present invention, and shows a configuration related to document search. FIG. 本実施の形態における文書管理装置のハードウェア構成図である。It is a hardware block diagram of the document management apparatus in this Embodiment. 本実施の形態における文書の登録処理を示したフローチャートである。It is the flowchart which showed the registration process of the document in this Embodiment. 図４Ａに続くフローチャートである。It is a flowchart following FIG. 4A. 本実施の形態において格納される文書に関連付けして登録される当該文書の属性情報のデータ構成の一例を示した図である。It is the figure which showed an example of the data structure of the attribute information of the said document registered in relation with the document stored in this Embodiment. 本実施の形態におけるコンテンツカテゴリ管理情報記憶部に記憶されるコンテンツカテゴリ管理情報のデータ構成の一例を示した図である。It is the figure which showed an example of the data structure of the content category management information memorize | stored in the content category management information storage part in this Embodiment. 図６に示したコンテンツカテゴリ管理情報の設定例に基づきカテゴリとコンテンツとの関係を模式的に表した図である。It is the figure which represented typically the relationship between a category and a content based on the example of a setting of the content category management information shown in FIG. 本実施の形態における文書情報記憶部に登録される文書情報のデータ構成の一例を示した図である。It is the figure which showed an example of the data structure of the document information registered into the document information storage part in this Embodiment. 本実施の形態における文書検索処理を示したフローチャートである。It is the flowchart which showed the document search process in this Embodiment.

以下、図面に基づいて、本発明の好適な実施の形態について説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

図１は、本発明に係る情報処理装置の一実施の形態である文書管理装置１０を示したブロック構成図であり、特に文書の登録に関連する構成を示した図である。図２は、図１と同じく文書管理装置１０を示したブロック構成図であるが、特に文書の検索に関連する構成を示した図である。本実施の形態の説明に用いない構成要素については図１，２から省略した。本実施の形態における文書管理装置１０は、文書データを画像として格納して管理する装置である。 FIG. 1 is a block diagram showing a document management apparatus 10 which is an embodiment of an information processing apparatus according to the present invention. In particular, FIG. 1 shows a structure related to document registration. FIG. 2 is a block diagram showing the document management apparatus 10 as in FIG. 1, but particularly shows a configuration related to document search. Components not used in the description of the present embodiment are omitted from FIGS. The document management apparatus 10 in this embodiment is an apparatus that stores and manages document data as an image.

図３は、本実施の形態における文書管理装置１０のハードウェア構成図である。本実施の形態における文書管理装置１０は、従前から存在する汎用的なハードウェア構成で実現できる。すなわち、文書管理装置１０は、図３に示したようにＣＰＵ３１、ＲＯＭ３２、ＲＡＭ３３、ハードディスクドライブ（ＨＤＤ）３４、入力手段として設けられたマウス３５とキーボード３６、及び表示装置として設けられたディスプレイ３７をそれぞれ接続する入出力コントローラ３８、通信手段として設けられたネットワークコントローラ３９を内部バス４０に接続して構成される。 FIG. 3 is a hardware configuration diagram of the document management apparatus 10 according to the present embodiment. The document management apparatus 10 according to the present embodiment can be realized by a general-purpose hardware configuration that has existed in the past. That is, as shown in FIG. 3, the document management apparatus 10 includes a CPU 31, a ROM 32, a RAM 33, a hard disk drive (HDD) 34, a mouse 35 and a keyboard 36 provided as input means, and a display 37 provided as a display device. An input / output controller 38 and a network controller 39 provided as communication means are connected to an internal bus 40, respectively.

図１に戻り、本実施の形態における文書管理装置１０は、文書データ受付部１１、文書属性情報受付部１２、コンテンツ抽出部１３、カテゴリ決定部１４、コンテンツ管理部１５、コンテンツ抽出ルール設定部１６、カテゴリ分類ルール設定部１７、コンテンツ抽出ルール記憶部２２、カテゴリ分類ルール記憶部２３、コンテンツカテゴリ管理情報記憶部２４、コンテンツ記憶部２５及び文書情報記憶部２６を有している。文書データ受付部１１は、格納対象の文書データの画像を受け付ける文書受付手段として機能する。文書管理装置１０はコンピュータで実現されていることから、文書データ受付部１１が受け付ける文書というのは電子データ（画像）であるが、説明の便宜上、文書の画像も単に「文書」と表現して説明する。文書属性情報受付部１２は、格納対象の文書の属性情報を受け付ける属性情報受付手段として機能する。コンテンツ抽出部１３は、コンテンツ抽出ルール記憶部２２に登録されているコンテンツ抽出定義情報としてのコンテンツ抽出ルールに基づき文書データ受付部１１に受け付けられた文書からコンテンツを抽出する抽出手段として機能する。ここで、「コンテンツ」というのは、文書に含まれる情報のことをいい、テキスト部分、表、画像等などの種類がある。カテゴリ決定部１４は、カテゴリ分類ルール記憶部２３に登録されているカテゴリ分類ルールに基づきコンテンツ抽出部１３により抽出された各コンテンツの分類先とするカテゴリを決定するカテゴリ決定手段として機能する。コンテンツ抽出ルール設定部１６は、管理者や開発者等（以下、単に「管理者」）の入力に基づいてコンテンツ抽出ルール記憶部２２にコンテンツ抽出ルールを設定登録する。カテゴリ分類ルール設定部１７は、管理者の入力に基づいてカテゴリ分類ルール記憶部２３にカテゴリ分類ルールを設定登録する。 Returning to FIG. 1, the document management apparatus 10 according to the present exemplary embodiment includes a document data reception unit 11, a document attribute information reception unit 12, a content extraction unit 13, a category determination unit 14, a content management unit 15, and a content extraction rule setting unit 16. A category classification rule setting unit 17, a content extraction rule storage unit 22, a category classification rule storage unit 23, a content category management information storage unit 24, a content storage unit 25, and a document information storage unit 26. The document data receiving unit 11 functions as a document receiving unit that receives an image of document data to be stored. Since the document management apparatus 10 is realized by a computer, the document received by the document data receiving unit 11 is electronic data (image). However, for convenience of explanation, the document image is also simply expressed as “document”. explain. The document attribute information receiving unit 12 functions as an attribute information receiving unit that receives attribute information of a document to be stored. The content extraction unit 13 functions as an extraction unit that extracts content from a document received by the document data reception unit 11 based on a content extraction rule as content extraction definition information registered in the content extraction rule storage unit 22. Here, “content” refers to information included in a document, and there are types such as a text part, a table, and an image. The category determination unit 14 functions as a category determination unit that determines a category as a classification destination of each content extracted by the content extraction unit 13 based on the category classification rule registered in the category classification rule storage unit 23. The content extraction rule setting unit 16 sets and registers a content extraction rule in the content extraction rule storage unit 22 based on an input from an administrator, a developer, or the like (hereinafter simply “administrator”). The category classification rule setting unit 17 sets and registers the category classification rule in the category classification rule storage unit 23 based on the administrator's input.

本実施の形態では、格納対象の文書全体をそのまま格納するのではなく、文書に含まれるコンテンツを格納単位とすることを特徴の一つとしている。このため、コンテンツ記憶部２５には、文書に含まれるコンテンツが記憶されるが、コンテンツ管理部１５は、このコンテンツの管理を行う。コンテンツ管理部１５は、差分算出部１５１、差分コンテンツ生成部１５２、コンテンツ登録部１５３、コンテンツカテゴリ管理情報登録部１５４及び文書情報登録部１５５を有している。差分算出部１５１は、コンテンツ記憶部２５への登録対象となるコンテンツ（以下、「登録対象コンテンツ」）と、コンテンツ記憶部２５に既に登録されているコンテンツのうち登録対象となるコンテンツの分類先となるカテゴリ及びそのカテゴリを下位層に含むカテゴリに分類された各コンテンツとの差分値を算出する。差分コンテンツ生成部１５２は、差分算出部１５１により算出された差分値に基づき登録対象コンテンツに類似したコンテンツ（以下、「類似コンテンツ」）がコンテンツ記憶部２５に登録されていると判断した場合、登録対象コンテンツと類似コンテンツとの差分情報を抽出する。コンテンツ登録部１５３は、コンテンツ登録手段として設けられ、コンテンツ抽出部１３により抽出されたコンテンツ（登録対象コンテンツ）をコンテンツ記憶部２５に登録する。ただ、類似コンテンツが存在する場合は登録対象コンテンツをそのまま登録しないなど処理の内容が異なってくるが、この登録処理の詳細については後述する。コンテンツカテゴリ管理情報登録部１５４は、コンテンツの登録に伴い、当該コンテンツに関するコンテンツ管理情報及びカテゴリに関するカテゴリ管理情報を含むコンテンツカテゴリ管理情報をコンテンツカテゴリ管理情報記憶部２４に設定登録する。文書情報登録部１５５は、文書情報登録手段として設けられ、文書データ受付部１１に受け付けられた文書を識別する文書識別情報に、当該文書から抽出されコンテンツ記憶部２５に格納されたコンテンツのコンテンツ識別情報を対応付けして文書情報を生成して文書情報記憶部２６に登録する。更に、文書情報登録部１５５は、文書識別情報に、文書属性情報受付部１２により受け付けられた当該文書の属性情報を関連付けて文書情報記憶部２６に登録する。 This embodiment is characterized in that the entire document to be stored is not stored as it is, but the content included in the document is used as a storage unit. Therefore, the content included in the document is stored in the content storage unit 25, but the content management unit 15 manages this content. The content management unit 15 includes a difference calculation unit 151, a difference content generation unit 152, a content registration unit 153, a content category management information registration unit 154, and a document information registration unit 155. The difference calculation unit 151 includes a content to be registered in the content storage unit 25 (hereinafter, “registration target content”) and a classification destination of the content to be registered among the content already registered in the content storage unit 25. And a difference value with each content classified into a category including the category in a lower layer. When the difference content generation unit 152 determines that content similar to the registration target content (hereinafter, “similar content”) is registered in the content storage unit 25 based on the difference value calculated by the difference calculation unit 151, the registration is performed. Difference information between the target content and similar content is extracted. The content registration unit 153 is provided as a content registration unit, and registers content (registration target content) extracted by the content extraction unit 13 in the content storage unit 25. However, if similar content exists, the content of the process differs, for example, the registration target content is not registered as it is. Details of this registration process will be described later. As the content is registered, the content category management information registration unit 154 sets and registers content category management information including content management information related to the content and category management information related to the category in the content category management information storage unit 24. The document information registration unit 155 is provided as document information registration means, and the content identification of the content extracted from the document and stored in the content storage unit 25 is added to the document identification information for identifying the document received by the document data reception unit 11. Document information is generated in association with the information and registered in the document information storage unit 26. Further, the document information registration unit 155 registers the document identification information in the document information storage unit 26 in association with the attribute information of the document received by the document attribute information reception unit 12.

図２において、文書管理装置１０は、カテゴリ受付部１８、コンテンツ表示部１９及び情報表示部２０を有している。コンテンツカテゴリ管理情報記憶部２４、コンテンツ記憶部２５及び文書情報記憶部２６は、図１に示した構成と同じである。カテゴリ受付部１８は、管理者が入力指定した検索したい文書（漏洩が発覚した文書）に含まれているコンテンツが属するカテゴリを受け付ける。コンテンツ表示部１９は、コンテンツ出力手段として設けられ、コンテンツ記憶部２５に記憶されているコンテンツのうちカテゴリ受付部１８により受け付けられたカテゴリ及びそのカテゴリを下位層に含むカテゴリに分類されたコンテンツをディスプレイ３７に表示する。コンテンツ表示部１９は、また復元手段としても機能し、表示するコンテンツが類似コンテンツの存在により差分情報から生成されている場合、コンテンツカテゴリ管理情報を参照して当該コンテンツを復元する。管理者は、コンテンツ表示部１９により表示されたコンテンツの中から検索したい文書（漏洩が発覚した文書）に含まれているコンテンツを選択することになるが、情報表示部２０は、その選択されたコンテンツを受け付けるコンテンツ受付手段及びそのコンテンツを含む文書に関する情報を出力する文書情報出力手段として機能する。本実施の形態の場合、文書に関する情報として文書属性情報をディスプレイ３７に表示する。なお、各記憶部２２〜２６のデータ構成等については追って説明する。 In FIG. 2, the document management apparatus 10 includes a category receiving unit 18, a content display unit 19, and an information display unit 20. The content category management information storage unit 24, the content storage unit 25, and the document information storage unit 26 have the same configuration as shown in FIG. The category accepting unit 18 accepts a category to which content included in a document to be searched (document in which leakage is detected) input and specified by the administrator belongs. The content display unit 19 is provided as a content output unit, and displays the category received by the category receiving unit 18 among the content stored in the content storage unit 25 and the content classified into the category including the category in the lower layer. 37. The content display unit 19 also functions as a restoration unit. When the content to be displayed is generated from the difference information due to the presence of similar content, the content display unit 19 restores the content with reference to the content category management information. The administrator selects the content included in the document to be searched from the content displayed by the content display unit 19 (the document in which leakage is detected), but the information display unit 20 selects the selected content. It functions as a content accepting unit that accepts content and a document information output unit that outputs information about a document including the content. In the present embodiment, document attribute information is displayed on the display 37 as information relating to the document. The data configuration of each storage unit 22 to 26 will be described later.

文書管理装置１０における各構成要素１１〜２０は、文書管理装置１０を形成するコンピュータと、コンピュータに搭載されたＣＰＵ３１で動作するプログラムとの協調動作により実現される。また、各記憶部２２〜２６は、文書管理装置１０に搭載されたＨＤＤ３４にて実現される。あるいは、全部又は一部の記憶部をＲＡＭ３３又は外部にある記憶手段をネットワーク経由で利用してもよい。 The constituent elements 11 to 20 in the document management apparatus 10 are realized by a cooperative operation of a computer that forms the document management apparatus 10 and a program that operates on the CPU 31 mounted on the computer. In addition, each of the storage units 22 to 26 is realized by the HDD 34 mounted on the document management apparatus 10. Alternatively, all or part of the storage unit may use the RAM 33 or an external storage unit via a network.

また、本実施の形態で用いるプログラムは、通信手段により提供することはもちろん、ＣＤ−ＲＯＭやＵＳＢメモリ等のコンピュータ読み取り可能な記録媒体に格納して提供することも可能である。通信手段や記録媒体から提供されたプログラムはコンピュータにインストールされ、コンピュータのＣＰＵがプログラムを順次実行することで各種処理が実現される。 Further, the program used in this embodiment can be provided not only by communication means but also by storing it in a computer-readable recording medium such as a CD-ROM or USB memory. The program provided from the communication means or the recording medium is installed in the computer, and various processes are realized by the CPU of the computer sequentially executing the program.

次に、本実施の形態における動作について説明するが、文書管理装置１０を動作させるためにはコンテンツ抽出ルール及びカテゴリ分類ルールを事前に設定しておく必要がある。もちろん、コンテンツ抽出ルール及びカテゴリ分類ルールは、動作開始後において追加、変更、削除を適宜行ってもよい。 Next, an operation in the present embodiment will be described. In order to operate the document management apparatus 10, it is necessary to set a content extraction rule and a category classification rule in advance. Of course, the content extraction rule and the category classification rule may be appropriately added, changed, or deleted after the operation is started.

コンテンツ抽出ルールには、文書に含まれるコンテンツのうち保存対象とするコンテンツを特定するためのルールが定義されている。本実施の形態の場合、漏洩した文書の追跡ができるようにするために文書の追跡が可能なコンテンツを特定するためのルールが定義されている。つまり、文書の追跡の参考にならないようなコンテンツはコンテンツ記憶部２５への格納対象から除外するようにした。すなわちコンテンツの要不要を振り分けるルールをコンテンツ抽出ルールとして設定する。本実施の形態では、前述したように文書をコンテンツ単位に分割して格納するが、格納する際に目的の達成に寄与しないコンテンツを格納しないようにすることで、コンテンツ記憶部２５における格納領域の削減を図るようにした。本実施の形態では、コンテンツ抽出ルールをプログラムで作成して、コンテンツ抽出部１３がコンテンツ（ＯＣＲ結果の文字列、画像等）及びコンテンツの内容を特徴付けるコンテンツ属性情報（例えば、図面データにおける図番（文字列））を抽出できるように設定されている。 In the content extraction rule, a rule for specifying content to be stored among content included in a document is defined. In the case of the present embodiment, a rule for specifying contents that can be traced is defined in order to be able to trace a leaked document. That is, content that is not useful for document tracking is excluded from the storage target in the content storage unit 25. That is, a rule for allocating necessity of content is set as a content extraction rule. In the present embodiment, as described above, the document is divided and stored in units of content, but by not storing content that does not contribute to the achievement of the purpose at the time of storage, the storage area of the content storage unit 25 is stored. A reduction was made. In this embodiment, a content extraction rule is created by a program, and the content extracting unit 13 characterizes content (character string, image, etc. of OCR result) and content attribute information (for example, a drawing number ( It is set so that the string)) can be extracted.

カテゴリ分類ルールには、コンテンツ抽出部１３により抽出されたコンテンツを登録するカテゴリを一意に決めるためのルールが定義されている。例えば、コンテンツの属性情報に図番○○が含まれており、文書の属性情報にユーザー△△が含まれていれば、カテゴリ□□に分類する、などである。 In the category classification rule, a rule for uniquely determining a category for registering the content extracted by the content extraction unit 13 is defined. For example, if the content attribute information includes the figure number XX and the document attribute information includes the user ΔΔ, the contents are classified into the category □□.

続いて、本実施の形態における文書の登録処理について図４Ａ、図４Ｂに示したフローチャートを用いて説明する。 Next, document registration processing according to the present embodiment will be described with reference to the flowcharts shown in FIGS. 4A and 4B.

ユーザーの所定の操作に応じて文書登録処理のアプリケーションが起動されると、まず、文書データ受付部１１は、ユーザーにより指定された登録対象の文書の画像を取得し（ステップ１０１）、また、文書属性情報受付部１２は、その文書の属性情報を取得する（ステップ１０２）。属性情報のデータ構成例を図５に示す。属性情報には、文書データ受付部１１により受け付けられた文書画像の元となる文書がジョブ（プリントジョブ）により印字出力されて生成され、その出力者はユーザーＩＤが“ｕｓｅｒ００１”のユーザーであることがわかる。 When an application for document registration processing is started in response to a user's predetermined operation, first, the document data receiving unit 11 acquires an image of a document to be registered designated by the user (step 101), and the document The attribute information receiving unit 12 acquires attribute information of the document (step 102). An example of the data structure of the attribute information is shown in FIG. The attribute information is generated by printing out a document that is the basis of the document image received by the document data receiving unit 11 by a job (print job), and the output person is a user whose user ID is “user001”. I understand.

続いて、コンテンツ抽出部１３は、コンテンツ抽出ルール記憶部２２に登録されているコンテンツ抽出ルールプログラムを実行し、文書に含まれているコンテンツ及び各コンテンツの属性情報を抽出する（ステップ１０３）。前述したように、コンテンツ抽出部１３は、コンテンツ抽出ルールの定義に従って文書の追跡を可能とするコンテンツのみを登録対象の文書から抽出する。 Subsequently, the content extraction unit 13 executes the content extraction rule program registered in the content extraction rule storage unit 22 and extracts the content included in the document and the attribute information of each content (step 103). As described above, the content extraction unit 13 extracts only the content that enables document tracking according to the definition of the content extraction rule from the registration target document.

コンテンツ抽出部１３は上記処理により１又は複数のコンテンツを抽出することになるが、文書管理装置１０は、抽出した各コンテンツに対し以下に説明する処理を繰り返し実行する。 The content extraction unit 13 extracts one or a plurality of contents by the above process, but the document management apparatus 10 repeatedly executes the process described below for each extracted content.

まず、カテゴリ決定部１４は、抽出されたコンテンツの中からまだ処理対象としていないコンテンツを１つ取り出し、コンテンツ抽出部１３により取得されたコンテンツ属性情報に基づきカテゴリ分類ルールを参照していずれかのカテゴリに分類する（ステップ１０５）。このように、本実施の形態では、文書検索の便宜を図るためにコンテンツをカテゴリに分類して登録するが、図７ではそのカテゴリとコンテンツとの関係を模式的に図示している。 First, the category determination unit 14 extracts one content that has not yet been processed from the extracted content, and refers to the category classification rule based on the content attribute information acquired by the content extraction unit 13 to select one of the categories. (Step 105). As described above, in this embodiment, contents are classified and registered in categories for convenience of document search. FIG. 7 schematically shows the relationship between the categories and contents.

コンテンツの分類先とするカテゴリが決定されると、コンテンツ管理部１５は、コンテンツをコンテンツ記憶部２５に登録することになるが、そのために、差分算出部１５１は、そのコンテンツ（登録対象コンテンツ）と、コンテンツ記憶部２５に登録されているコンテンツのうち登録対象コンテンツの分類先となるカテゴリ及びその先祖カテゴリに含まれている各コンテンツ（以下、「比較対象コンテンツ」）との差分値を算出する（ステップ１０６）。ここで、「先祖カテゴリ」というのは、登録対象コンテンツの分類先となるカテゴリを下位層に含むカテゴリのことをいう。図７に示したカテゴリの関係に基づくと、登録対象コンテンツの分類先となるカテゴリが“部品Ａ”の場合、 “ｒｏｏｔ”、“商品Ａ”及び“図面”が先祖カテゴリに該当する。 When the category as the content classification destination is determined, the content management unit 15 registers the content in the content storage unit 25. For this purpose, the difference calculation unit 151 determines the content (content to be registered) and The difference value between the content registered in the content storage unit 25 and the content included in the ancestor category (hereinafter referred to as “comparison target content”) of the registration target content is calculated ( Step 106). Here, the “ancestor category” refers to a category including a category that is a classification destination of the registration target content in a lower layer. Based on the category relationship shown in FIG. 7, when the category to which the registration target content is classified is “part A”, “root”, “product A”, and “drawing” correspond to the ancestor category.

ところで、登録対象コンテンツと比較対象コンテンツとの差分値が０ということは、差分がない、すなわち同一のコンテンツである。また、差分値が小さいほど登録対象コンテンツと比較対象コンテンツは類似度が高いといえる。本実施の形態では、類似かどうかの判定基準となる閾値を予め設定しており、差分値がその閾値以下の場合に、登録対象コンテンツと比較対象コンテンツは類似していると判断する。 By the way, when the difference value between the registration target content and the comparison target content is 0, there is no difference, that is, the same content. Moreover, it can be said that the smaller the difference value, the higher the similarity between the registration target content and the comparison target content. In this embodiment, a threshold value that is a criterion for determining whether or not they are similar is set in advance, and when the difference value is equal to or smaller than the threshold value, it is determined that the registration target content and the comparison target content are similar.

ここで、登録対象コンテンツと同一の比較対象コンテンツがコンテンツ記憶部２５に登録されていない場合（ステップ１０７でＮ）、かつ登録対象コンテンツに類似した比較対象コンテンツがコンテンツ記憶部２５に登録されていない場合（ステップ１０８でＮ）、コンテンツ管理部１５は、登録対象コンテンツに対し、当該コンテンツを識別するための識別情報（コンテンツＩＤ）を新たに発行する（ステップ１１０）。登録対象コンテンツに類似した比較対象コンテンツがコンテンツ記憶部２５に登録されている場合（ステップ１０８でＹ）、差分コンテンツ生成部１５２は、登録対象コンテンツと類似コンテンツとの差分を抽出して差分情報を生成する（ステップ１０９）。 Here, when the same comparison target content as the registration target content is not registered in the content storage unit 25 (N in Step 107), the comparison target content similar to the registration target content is not registered in the content storage unit 25. In this case (N in Step 108), the content management unit 15 issues new identification information (content ID) for identifying the content to be registered (Step 110). When the comparison target content similar to the registration target content is registered in the content storage unit 25 (Y in step 108), the difference content generation unit 152 extracts the difference information by extracting the difference between the registration target content and the similar content. Generate (step 109).

なお、差分情報は、登録対象コンテンツと類似コンテンツとの差分を示す情報である。差分としては、類似コンテンツに対して付加する部分の情報及び類似コンテンツから除外する部分の情報が含まれることになるが、本実施の形態では、この差分情報をコンテンツとして管理することにする。そして、コンテンツ管理部１５は、差分情報であるコンテンツに対してコンテンツＩＤを発行する（ステップ１１０）。 The difference information is information indicating a difference between the registration target content and the similar content. The difference includes information on a part to be added to similar content and information on a part excluded from the similar content. In this embodiment, the difference information is managed as content. And the content management part 15 issues content ID with respect to the content which is difference information (step 110).

一方、登録対象コンテンツと同一の比較対象コンテンツがコンテンツ記憶部２５に既に登録されている場合（ステップ１０７でＹ）、コンテンツ管理部１５は、同一の比較対象コンテンツに対して発行したコンテンツＩＤを登録対象コンテンツに付与する（ステップ１１１）。 On the other hand, when the same comparison target content as the registration target content is already registered in the content storage unit 25 (Y in step 107), the content management unit 15 registers the content ID issued for the same comparison target content It is given to the target content (step 111).

格納対象の文書から抽出されたコンテンツ全てに対して以上の処理を実施すると（ステップ１０４でＮ）、コンテンツ登録部１５３は、各コンテンツをコンテンツ記憶部２５に登録する（ステップ１１２）。同一又は類似するコンテンツが存在しなかった場合には、当該コンテンツをそのままコンテンツ記憶部２５に登録すればよい。類似コンテンツが存在した場合、差分情報によるコンテンツを登録することになる。また、同一のコンテンツが存在した場合、同一コンテンツは登録済みなので改めて登録する必要はない。 When the above processing is performed on all the contents extracted from the document to be stored (N in Step 104), the content registration unit 153 registers each content in the content storage unit 25 (Step 112). If there is no identical or similar content, the content may be registered in the content storage unit 25 as it is. If similar content exists, the content based on the difference information is registered. If the same content exists, it is not necessary to register again because the same content has already been registered.

本実施の形態では、保存しておくべきコンテンツのみを登録することによって格納領域を削減する。更に、類似コンテンツが存在する場合には、その類似コンテンツとの差分のみを格納するようにし、同一コンテンツが存在する場合には重複した登録を回避することによって格納領域の削減効果を更に高めている。 In the present embodiment, the storage area is reduced by registering only the contents to be stored. Further, when similar content exists, only the difference from the similar content is stored, and when the same content exists, the effect of reducing the storage area is further enhanced by avoiding duplicate registration. .

そして、コンテンツカテゴリ管理情報登録部１５４は、コンテンツ間、カテゴリ間、更にコンテンツとカテゴリとの関係を示すコンテンツカテゴリ管理情報をコンテンツカテゴリ管理情報記憶部２４に登録する（ステップ１１３）。 Then, the content category management information registration unit 154 registers content category management information indicating the relationship between content, between categories, and between content and category in the content category management information storage unit 24 (step 113).

図６は、本実施の形態におけるコンテンツカテゴリ管理情報記憶部２４に記憶されるコンテンツカテゴリ管理情報のデータ構成の一例を示した図である。また、図７は、図６の設定例に基づきカテゴリとコンテンツとの関係を模式的に表した図である。 FIG. 6 is a diagram showing an example of a data configuration of content category management information stored in the content category management information storage unit 24 in the present embodiment. FIG. 7 is a diagram schematically showing the relationship between categories and contents based on the setting example of FIG.

図７を参照すれば明らかなように、本実施の形態では、カテゴリを階層的に形成している。図６において種別が“カテゴリ”のレコードを参照すると、各カテゴリには、当該カテゴリの識別情報（ＩＤ）としてのカテゴリＩＤ、カテゴリラベル（カテゴリ名）及び当該カテゴリが属する直上のカテゴリ（親カテゴリ）を対応付けしたカテゴリ管理情報が設定される。本実施の形態の場合、“ｒｏｏｔ”には親カテゴリが設定されていないため最上位のカテゴリであることがわかる。 As is apparent from FIG. 7, categories are hierarchically formed in the present embodiment. Referring to a record of type “category” in FIG. 6, each category has a category ID as category identification information (ID), a category label (category name), and a category immediately above (category parent) to which the category belongs. Is set in the category management information. In the case of the present embodiment, it is understood that “root” is the highest category because no parent category is set.

また、種別が“コンテンツ”のレコードを参照すると、各コンテンツには、当該コンテンツの識別情報（ＩＤ）としてのコンテンツＩＤ、当該コンテンツのコンテンツ記憶部２５における格納場所を示すコンテンツ格納先ポインタ、当該コンテンツの分類先となる親カテゴリ、当該コンテンツに類似コンテンツが存在する場合にはその類似コンテンツのコンテンツＩＤを対応付けしたコンテンツ管理情報が設定される。親カテゴリには、同一コンテンツが存在する場合、当該コンテンツの親カテゴリには、分類先となるカテゴリが文書毎に設定される。図６，７に示した設定例によると、コンテンツＩＤが“０４０３”のコンテンツは、“組立図”及び“部品Ａ”のカテゴリに属していることから各カテゴリのカテゴリＩＤが親カテゴリに設定される。また、コンテンツＩＤが“０４０５”のコンテンツは、コンテンツＩＤが“０４０４”のコンテンツに類似していることがわかる。 When a record of type “content” is referred to, each content includes a content ID as identification information (ID) of the content, a content storage destination pointer indicating a storage location of the content in the content storage unit 25, the content If there is similar content in the parent category as the classification destination and the content, content management information in which the content ID of the similar content is associated is set. When the same content exists in the parent category, a category to be classified is set for each document in the parent category of the content. According to the setting examples shown in FIGS. 6 and 7, since the content with the content ID “0403” belongs to the categories “assembly drawing” and “part A”, the category ID of each category is set as the parent category. The It can also be seen that the content with the content ID “0405” is similar to the content with the content ID “0404”.

続いて、文書情報登録部１５５は、文書情報を文書情報記憶部２６に登録するが、この登録される文書情報のデータ構成の一例を図８に示す。文書情報は、文書の識別情報としての文書名に、コンテンツＩＤ及び文書属性情報格納先ポインタが対応付けして登録される。コンテンツＩＤには、コンテンツ抽出部１３により当該文書から保存対象として抽出されたコンテンツのコンテンツＩＤが設定される。このように、文書は、コンテンツの組合せとして登録され、認識される。文書属性情報格納先ポインタには、属性情報の格納場所を示すポインタ情報が設定される。 Subsequently, the document information registration unit 155 registers the document information in the document information storage unit 26. An example of the data structure of the registered document information is shown in FIG. Document information is registered by associating a document name as document identification information with a content ID and a document attribute information storage destination pointer. In the content ID, the content ID of the content extracted from the document by the content extraction unit 13 as a storage target is set. In this way, a document is registered and recognized as a combination of contents. In the document attribute information storage destination pointer, pointer information indicating the storage location of the attribute information is set.

以上説明したように、文書情報登録部１５５は、文書名に当該文書に含まれるコンテンツのコンテンツＩＤを対応付けて文書情報を生成することにより、当該文書をコンテンツの組合せとして登録することになる（ステップ１１４）。そして、文書情報登録部１５５は、更に文書名に当該文書の属性情報を関連付けて登録する（ステップ１１５）。属性情報は、この時点で保存してもよいし、ステップ１０２で受け付けた時点でコンテンツの登録に先立って保存するようにしてもよい。 As described above, the document information registration unit 155 registers the document as a content combination by generating document information by associating the document name with the content ID of the content included in the document ( Step 114). Then, the document information registration unit 155 further registers the document name in association with the attribute information of the document (step 115). The attribute information may be saved at this time, or may be saved prior to content registration at the time of acceptance in step 102.

本実施の形態においては、以上のようにして文書を登録することになる。続いて、本実施の形態における文書検索処理について図９に示したフローチャートを用いて説明する。本実施の形態における文書検索処理は、例えば、文書の漏洩が確認されたときに、その文書の追跡を行いたい場合に実行される。 In the present embodiment, a document is registered as described above. Next, document search processing in the present embodiment will be described with reference to the flowchart shown in FIG. The document search process in the present embodiment is executed, for example, when it is desired to track a document when it is confirmed that the document has been leaked.

ユーザーの所定の操作に応じて文書検索処理のアプリケーションが起動されると、まず、カテゴリ受付部１８は、コンテンツカテゴリ管理情報記憶部２４からコンテンツカテゴリ管理情報を読み出し、コンテンツカテゴリ管理情報に設定されているカテゴリをディスプレイ３７に表示する（ステップ１２１）。これは、カテゴリを単にリスト表示してもよいが、図７に例示したように、カテゴリの階層関係がわかるようにしてコンテンツカテゴリ管理情報を表示してもよい。管理者は、追跡対象とする文書、すなわち漏洩した文書に含まれているコンテンツ及び当該コンテンツが属するカテゴリを知っているので、表示されたものの中からそのカテゴリを選択する。仮に、管理者はカテゴリを記憶していなくても、漏洩した文書は明らかであることから、この文書に含まれているコンテンツからカテゴリは類推可能であり、また、図７に示したようにコンテンツカテゴリ管理情報を模式的に表示すれば、よりカテゴリを選択しやすくなる。なお、文書に複数のコンテンツが含まれている場合が想定できるが、この場合は、いずれか１つのコンテンツが属するカテゴリを選択すればよい。 When an application for document search processing is started in response to a user's predetermined operation, first, the category receiving unit 18 reads content category management information from the content category management information storage unit 24, and is set in the content category management information. Are displayed on the display 37 (step 121). In this case, the categories may be simply displayed as a list, but as illustrated in FIG. 7, the content category management information may be displayed so that the hierarchical relationship of the categories can be understood. Since the administrator knows the content included in the document to be tracked, that is, the content included in the leaked document and the category to which the content belongs, the administrator selects the category from the displayed items. Even if the administrator does not memorize the category, the leaked document is clear, so the category can be inferred from the content included in this document. Also, as shown in FIG. If the category management information is schematically displayed, it becomes easier to select a category. Note that it can be assumed that the document includes a plurality of contents. In this case, a category to which any one of the contents belongs may be selected.

カテゴリ受付部１８が管理者により選択されたカテゴリを受け付けると（ステップ１２２）、コンテンツ表示部１９は、コンテンツカテゴリ管理情報を参照して、選択されたカテゴリ及び先祖カテゴリに属するコンテンツを抽出する（ステップ１２３）。そして、コンテンツをディスプレイ３７に表示する（ステップ１２４）。 When the category receiving unit 18 receives the category selected by the administrator (step 122), the content display unit 19 refers to the content category management information and extracts content belonging to the selected category and ancestor category (step). 123). Then, the content is displayed on the display 37 (step 124).

ところで、本実施の形態では、類似コンテンツが存在した場合、類似コンテンツとの差分を抽出して格納するようにした。従って、表示対象のコンテンツに類似コンテンツが存在する場合、コンテンツ表示部１９は、当該表示対象のコンテンツ（差分情報）及び当該表示対象のコンテンツの類似コンテンツのコンテンツをコンテンツ記憶部２５から読み出して表示対象のコンテンツを表示可能な画像に復元してから表示する。 By the way, in this embodiment, when similar content exists, the difference from the similar content is extracted and stored. Therefore, when similar content exists in the display target content, the content display unit 19 reads the display target content (difference information) and the content of the similar content of the display target content from the content storage unit 25 to display the display target content. The content of is restored to a displayable image and then displayed.

管理者がディスプレイ３７に表示されたコンテンツの中から追跡対象とする文書に含まれるコンテンツを選択すると、情報表示部２０は、そのコンテンツを受け付け（ステップ１２５）、そのコンテンツに基づき文書情報記憶部２６を検索することによって、受け付けたコンテンツを含む文書を特定する（ステップ１２６）。なお、同じコンテンツを含む文書が複数存在する場合は、全ての文書を特定する。そして、情報表示部２０は、特定した文書に対応付けされている属性情報を読み出してディスプレイ３７に表示する（ステップ１２７）。 When the administrator selects content included in the document to be tracked from the content displayed on the display 37, the information display unit 20 receives the content (step 125), and the document information storage unit 26 based on the content. The document including the accepted content is specified by searching (step 126). If there are a plurality of documents including the same content, all documents are specified. And the information display part 20 reads the attribute information matched with the specified document, and displays it on the display 37 (step 127).

管理者は、表示された属性情報のユーザーＩＤを参照することで、漏洩した文書の出力者を知ることができるので、その出力者に問い合わせるなどすることによって漏洩した文書の追跡が可能となる。 Since the administrator can know the output person of the leaked document by referring to the user ID of the displayed attribute information, the leaked document can be traced by making an inquiry to the output person.

なお、ここでは、漏洩した文書の追跡を目的としているため、属性情報に含まれるユーザーＩＤを少なくとも表示すればよいが、異なる目的の場合、属性情報の他の情報や文書に含まれる全てのコンテンツなど当該文書に関する情報を表示するようにしてもよい。 In this case, since the purpose is to track the leaked document, it is sufficient to display at least the user ID included in the attribute information. However, in the case of a different purpose, other information of the attribute information and all contents included in the document For example, information regarding the document may be displayed.

本実施の形態では、図７に示した文書情報から明らかなように、文書をコンテンツの組合せとして格納するようにし、コンテンツが文書のどのページを構成するかという情報は記録していない。もちろん、コンテンツの表示ページやページ上の配置等の情報を合わせて記録するようにしてもよい。ただ、流出した文書は、その文書のレイアウトのまま利用されるとは限らず、部分的にあるコンテンツだけが切り取られて利用される場合が少なくない。このため、本実施の形態では、コンテンツを格納単位とし、コンテンツ単位で文書を管理するようにした。 In this embodiment, as is clear from the document information shown in FIG. 7, the document is stored as a combination of contents, and information about which page of the document the content constitutes is not recorded. Of course, information such as a display page of content and an arrangement on the page may be recorded together. However, the leaked document is not always used as it is in the layout of the document, and there are not a few cases where only some contents are cut out and used. For this reason, in this embodiment, content is stored as a unit, and documents are managed in units of content.

なお、本実施の形態においては、コンテンツ出力手段及び文書情報出力手段の出力先としてディスプレイ３７を例にして説明したが、これに限らず、例えば、印刷媒体、管理者が使用する情報端末、更に記憶手段やネットワークを介して他のコンピュータへ送信するなど他の出力先を選択するようにしてもよい。 In the present embodiment, the display 37 is described as an example of the output destination of the content output means and the document information output means. However, the present invention is not limited to this, and for example, a print medium, an information terminal used by an administrator, Other output destinations such as transmission to other computers via storage means or a network may be selected.

また、本実施の形態においては、情報処理装置を１台のコンピュータで実現する場合を例にして説明したが、文書の登録機能と検索機能に分けるなど複数のコンピュータにて実現してもよい。また、画像形成装置にはコンピュータ（情報処理装置）が内蔵されているが、スキャナによる文書の読取画像を内部のＨＤＤに格納する画像形成装置にて実現してもよい。 In this embodiment, the case where the information processing apparatus is realized by one computer has been described as an example. However, the information processing apparatus may be realized by a plurality of computers such as a document registration function and a search function. The image forming apparatus includes a computer (information processing apparatus). However, the image forming apparatus may be realized by an image forming apparatus that stores an image read by a scanner in an internal HDD.

１０文書管理装置、１１文書データ受付部、１２文書属性情報受付部、１３コンテンツ抽出部、１４カテゴリ決定部、１５コンテンツ管理部、１６コンテンツ抽出ルール設定部、１７カテゴリ分類ルール設定部、１８カテゴリ受付部、１９コンテンツ表示部、２０情報表示部、２２コンテンツ抽出ルール記憶部、２３カテゴリ分類ルール記憶部、２４コンテンツカテゴリ管理情報記憶部、２５コンテンツ記憶部、２６文書情報記憶部、３１ＣＰＵ、３２ＲＯＭ、３３ＲＡＭ、３４ハードディスクドライブ（ＨＤＤ）、３５マウス、３６キーボード、３７ディスプレイ、３８入出力コントローラ、３９ネットワークコントローラ、４０内部バス、１５１差分算出部、１５２差分コンテンツ生成部、１５３コンテンツ登録部、１５４コンテンツカテゴリ管理情報登録部、１５５文書情報登録部。 DESCRIPTION OF SYMBOLS 10 Document management apparatus, 11 Document data reception part, 12 Document attribute information reception part, 13 Content extraction part, 14 Category determination part, 15 Content management part, 16 Content extraction rule setting part, 17 Category classification rule setting part, 18 Category reception Unit, 19 content display unit, 20 information display unit, 22 content extraction rule storage unit, 23 category classification rule storage unit, 24 content category management information storage unit, 25 content storage unit, 26 document information storage unit, 31 CPU, 32 ROM 33 RAM, 34 Hard disk drive (HDD), 35 Mouse, 36 Keyboard, 37 Display, 38 Input / output controller, 39 Network controller, 40 Internal bus, 151 Difference calculation unit, 152 Difference content generation unit, 15 Content registration unit, 154 content category management information registration unit, 155 document information registration unit.

Claims

A document receiving means for receiving a document to be stored;
Extraction means for extracting content from the document received by the document reception means based on content extraction definition information in which content to be stored among contents included in the document is defined;
Content registration means for registering the content extracted by the extraction means in content storage means;
Document information storage means for generating document information by associating the document identification information for identifying the document accepted by the document acceptance means with the content identification information of the content extracted from the document and stored in the content storage means Document information registration means to be registered in
An information processing apparatus comprising:

When the same content as the content extracted by the extraction unit is already registered in the content storage unit,
The content registration means does not register the content extracted by the extraction means in the content storage means,
The document information registering means generates document information by associating the content identification information of the same content with the document identification information;
The information processing apparatus according to claim 1.

The content registration means, when content similar to the content extracted by the extraction means is registered in the content storage means, the difference between the content extracted by the extraction means and the content similar to the content extracted by the extraction means Register information as content in the content storage means,
The document information registration unit generates document information by associating content identification information of the difference information with the document identification information.
The information processing apparatus according to claim 1.

The information processing apparatus according to claim 3, further comprising a restoring unit that restores the content with reference to content management information that associates the similar content with the difference information extracted by the extracting unit.

Content output means for outputting content registered in the content storage means;
Content accepting means for accepting content selected from the content output by the output means;
Document information output means for outputting information relating to a document including the content received by the content receiving means;
The information processing apparatus according to claim 1, further comprising:

The content extraction definition information defines content that enables document tracking,
6. The information processing apparatus according to claim 5, wherein the content receiving unit receives content included in a leaked document.

Attribute information receiving means for receiving attribute information of the document including the output person of the document to be stored;
The document information registration unit associates the attribute information of the document with the document identification information,
The information processing apparatus according to claim 6, wherein the document information output unit outputs attribute information of the document as information related to the document.

When content is classified into categories and registered in the content storage means, it has category determination means for determining a category as a classification destination of the content extracted by the extraction means,
The content registration means classifies the content extracted by the extraction means into a category accepted by the category determination means and registers it in the content storage means,
When a category is specified when outputting information about the document, the content output unit extracts the specified category and the content classified into a category including the category in a lower layer from the content storage unit. The information processing apparatus according to claim 5, wherein the information processing apparatus is displayed.

Computer
A document receiving means for receiving a document to be stored;
Extraction means for extracting content from a document received by the document reception means based on content extraction definition information in which content to be stored among contents included in the document is defined;
Content registration means for registering the content extracted by the extraction means in content storage means;
Document information storage means for generating document information by associating the document identification information for identifying the document accepted by the document acceptance means with the content identification information of the content extracted from the document and stored in the content storage means Document information registration means to be registered in
Program to function as.