JP2009230427A

JP2009230427A - Method, device and program for detection and estimation of electronic document attribute, and recording medium

Info

Publication number: JP2009230427A
Application number: JP2008074456A
Authority: JP
Inventors: Kazuyo Hashimoto; 加寿代橋本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2008-03-21
Filing date: 2008-03-21
Publication date: 2009-10-08

Abstract

PROBLEM TO BE SOLVED: To provide a method, device and program for detection and estimation of electronic document attribute, capable of appropriately detecting and estimating an attribute corresponding to a document without attribute association from a preliminarily accumulated electronic document database, and a recording medium therefor. SOLUTION: The method for detection and estimation of electronic document attribute for detecting and estimating an attribute corresponding to a document without attribute association from a preliminarily accumulated electronic document database includes a search procedure for searching an electronic document with high similarity to the document from electronic documents accumulated in the database; a comparison procedure for comparing, for each attribute of the electronic document searched by the search procedure, the similarity of the document for the attribute concerned with a threshold corresponding to an importance adjusted for this attribute; and an estimation procedure for estimating, when the similarity of each attribute of the searched electronic document exceeds the threshold, the attributes of the electronic document as the attributes of the document. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、予め蓄積された電子文書データベースから、属性の関連付けがなされていない文書に対応する属性を検出し推定する電子文書属性検出推定方法、電子文書属性検出推定装置、電子文書属性検出推定プログラム及び記憶媒体に関する。 The present invention relates to an electronic document attribute detection / estimation method, an electronic document attribute detection / estimation apparatus, and an electronic document attribute detection / estimation program for detecting and estimating an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance. And a storage medium.

一般的に、企業では、その情報資産は電子文書（以下、単に文書という。）の形で形成され、蓄積され、利用されている。この企業内文書の機密性を考慮し、その機密性に応じて企業文書の取り扱いをコントロールすることは非常に重要である。このような背景のもと、企業文書の取り扱いを制限するための様々な技術が存在する。 In general, in an enterprise, the information assets are formed, stored, and used in the form of electronic documents (hereinafter simply referred to as documents). Considering the confidentiality of this in-company document, it is very important to control the handling of the corporate document according to the confidentiality. Against this background, there are various techniques for restricting the handling of corporate documents.

例えば、特許文献１に記載されている技術では、類似文書検索技術を利用し、管理のされていない仕係かり文書などの中身を解析して、その属性（例えば、分類、カテゴリ、機密レベル、タイプなどの管理・整理するための属性等）を推定するためのシステムである。類似判定には、類似文書検索エンジンにより挙げられた、類似度を示す信頼度値を予め設定した閾値と比べて、超えていれば類似、超えていなければ非類似として判定している。この判定により、推定対象とした仕掛かり文書の属性を、類似と判定された文書の属性値として推定する。 For example, in the technology described in Patent Document 1, similar document search technology is used to analyze the contents of unprocessed work documents and the like, and their attributes (for example, classification, category, confidential level, This is a system for estimating the attributes and the like for managing and organizing types. In the similarity determination, the reliability value indicating the similarity, which is given by the similar document search engine, is compared with a preset threshold value, and if it exceeds, it is determined as similar, and if it does not exceed, it is determined as dissimilar. By this determination, the attribute of the in-process document as the estimation target is estimated as the attribute value of the document determined to be similar.

これにより、管理されていなかった文書に対して、推定した属性値をもとに、アクセス制限規則であるポリシーに則り、アクセス制御等を行うことができる。
特開２００６−１８５１５３号公報 Accordingly, access control or the like can be performed on a document that has not been managed in accordance with a policy that is an access restriction rule based on the estimated attribute value.
JP 2006-185153 A

しかしながら、特許文献１等に記載されている技術によれば、類似判定時における閾値を、文書の属性によらず一定に設定するため、運用上不便な点が生じる場合がある。例えば、その文書の属性となる商品Ａテーマが、公開されていない極秘開発中の場合、その商品Ａに関する文書については、開発関係者以外一切知られたくない。このような商品Ａテーマに関する文書として推定された文書に対しては、アクセス権限のないものによるコピー、スキャン、印刷等による外部への流出を厳しくチェックする必要がある。 However, according to the technique described in Patent Document 1 and the like, the threshold value at the time of similarity determination is set to be constant regardless of the document attribute, which may cause inconvenience in operation. For example, if the product A theme that is the attribute of the document is in secret development that has not been disclosed to the public, the document relating to the product A does not want to be known by anyone other than the developer. It is necessary to strictly check outflows to the outside due to copying, scanning, printing, etc. due to a document having no access authority for a document estimated as a document related to the product A theme.

一方、商品がすでに発売済みであって、機能や性能については公開されている場合、その商品Ｂに関する極秘文書に関するもののみ外部への流出をチェックしたいという場合もある。 On the other hand, if the product has already been released and the functions and performance are disclosed, there are cases where it is desired to check the outflow of only the confidential document related to the product B to the outside.

上記商品Ａテーマの場合には、類似度を測る閾値を低めに設定して、類似度が低い文書についても広くチェックする必要があるが、商品Ｂテーマの場合には、類似度を測る閾値を高めに設定して、類似度が高い情報のみチェックしたい。このように、類似判定時における閾値を一定に設定せずに、文書の属性毎に文書の属性の重要度に合わせて閾値を変更、調整することにより、例えば管理者に警告メッセージを送る場合にも、必要な場合にのみチェック機能を発揮するよう実運用に合った調整を行い、適切な警告メッセージを送信することが可能となる。 In the case of the product A theme, it is necessary to set a low threshold for measuring the similarity and to check a wide range of documents with low similarity, but in the case of the product B theme, the threshold for measuring the similarity is set. I want to check only information with high similarity by setting it higher. In this way, for example, when a warning message is sent to the administrator by changing and adjusting the threshold according to the importance of the document attribute for each document attribute without setting the threshold at the time of similarity determination constant. However, it is possible to make adjustments suitable for actual operation so that the check function is exhibited only when necessary, and to send an appropriate warning message.

本発明は上記の点に鑑みてなされたものであり、予め蓄積された電子文書データベースから、属性の関連付けがなされていない文書に対応する属性を適切に検出し推定する電子文書属性検出推定方法、電子文書属性検出推定装置、電子文書属性検出推定プログラム及び記憶媒体を提供することを目的とする。 The present invention has been made in view of the above points, and an electronic document attribute detection estimation method that appropriately detects and estimates an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance, An object is to provide an electronic document attribute detection estimation apparatus, an electronic document attribute detection estimation program, and a storage medium.

上記の課題を解決するために本発明では、次に述べる各手段を講じたことを特徴とするものである。 In order to solve the above-described problems, the present invention is characterized by the following measures.

本発明は、予め蓄積された電子文書データベースから、属性の関連付けがなされていない文書に対応する属性を検出し推定する電子文書属性検出推定方法であって、前記データベースに蓄積された電子文書のうち、前記文書と類似性の高い電子文書を検索する検索手順と、前記検索手順により検索された電子文書の属性毎に、該属性に対する前記文書の類似度と、該属性に対して調整された重要度に対応する閾値とを比較する比較手順と、前記検索された電子文書の属性毎の類似度が前記閾値を超える場合に、前記電子文書の属性を前記文書の属性として推定する推定手順とを有する。 The present invention relates to an electronic document attribute detection and estimation method for detecting and estimating an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance, and the electronic document among the electronic documents stored in the database A search procedure for searching for an electronic document having high similarity to the document, and for each attribute of the electronic document searched by the search procedure, the similarity of the document with respect to the attribute and an importance adjusted for the attribute A comparison procedure for comparing a threshold corresponding to the degree, and an estimation procedure for estimating the attribute of the electronic document as the attribute of the document when the similarity for each attribute of the retrieved electronic document exceeds the threshold. Have.

また、上記課題を解決するための手段として、本発明は、予め蓄積された電子文書データベースから、属性の関連付けがなされていない文書に対応する属性を検出し推定する電子文書属性検出推定装置であって、前記データベースに蓄積された電子文書のうち、前記文書と類似性の高い電子文書を検索する検索手段と、前記検索手段により検索された電子文書の属性毎、該属性に対する前記文書の類似度と、該属性に対して調整された重要度に対応する閾値とを比較する比較手段と、前記検索された電子文書の属性毎の類似度が前記閾値を超える場合に、前記電子文書の属性を前記文書の属性として推定する推定手段とを実行する。 Further, as a means for solving the above-described problems, the present invention is an electronic document attribute detection / estimation apparatus that detects and estimates an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance. Search means for searching for an electronic document having a high similarity to the document among the electronic documents stored in the database, and for each attribute of the electronic document searched by the search means, the similarity of the document to the attribute And a comparison means for comparing the threshold value corresponding to the importance adjusted for the attribute, and when the similarity for each attribute of the searched electronic document exceeds the threshold value, the attribute of the electronic document is Estimating means for estimating the attribute of the document is executed.

更に、上記課題を解決するための手段として、本発明は、上記電子文書属性検出推定方法での手順を、コンピュータに実行させるプログラム、コンピュータに読み取り可能な記憶媒体とすることもできる。 Furthermore, as means for solving the above-described problems, the present invention may be a program for causing a computer to execute the procedure in the above-described electronic document attribute detection estimation method, and a computer-readable storage medium.

本発明によれば、予め蓄積された電子文書データベースから、属性の関連付けがなされていない文書に対応する属性を適切に検出し推定する電子文書属性検出推定方法、電子文書属性検出推定装置、電子文書属性検出推定プログラム及び記憶媒体を提供することができる。 According to the present invention, an electronic document attribute detection / estimation method, an electronic document attribute detection / estimation device, and an electronic document that appropriately detect and estimate an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance. An attribute detection estimation program and a storage medium can be provided.

次に、本発明を実施するための最良の形態について図面と共に説明する。 Next, the best mode for carrying out the present invention will be described with reference to the drawings.

図１は、第１システム１０１の概略構成例を示す図である。図１に示すシステム１０１は、電子文書属性検出推定装置１０、管理者端末２０、複合機３０−１、文書管理ＤＢ（データベース）や文書管理装置等としてのファイルサーバ４０とがＬＡＮ等のネットワーク（有線または無線の別は問わない）によって接続されることにより構成されている。なお、複合機３０−１は、少なくともスキャナ及びプロッタを備え、プリンタ、コピー、及びスキャナ等の複数の画像処理を可能とする。また、文書属性検出推定装置１０、管理者端末２０等は、同一の企業内またはオフィス内等、情報の機密性が保持されるべき空間内において構成されるものとする。 FIG. 1 is a diagram illustrating a schematic configuration example of the first system 101. A system 101 shown in FIG. 1 includes an electronic document attribute detection / estimation apparatus 10, an administrator terminal 20, a multifunction machine 30-1, a document management DB (database), a file server 40 as a document management apparatus, etc. (Whether wired or wireless). The multifunction machine 30-1 includes at least a scanner and a plotter, and enables a plurality of image processing such as a printer, a copy, and a scanner. Further, it is assumed that the document attribute detection estimation apparatus 10, the administrator terminal 20, and the like are configured in a space where the confidentiality of information should be maintained, such as in the same company or office.

文書属性検出推定装置１０は、通信ネットワークを介して外部より取得される複数の文書学習プログラム群から構成される文書学習連携プログラム４１から文書属性学習プログラム１１をインストールする。また、文書属性検出推定装置１０は、通信ネットワークを介して接続されるファイルサーバ４０の文書管理ＤＢ等より各種文書データを取得し、文書属性学習データプログラム１１を用いて解析し、その結果生成される各種データを属性特徴ベースＤＢ１２−１に出力する。この文書属性学習プログラム１１及び属性特徴ベースＤＢ１２−１によりコンテンツ解析エンジン１６−１が構成される。 The document attribute detection estimation apparatus 10 installs the document attribute learning program 11 from the document learning cooperation program 41 configured by a plurality of document learning program groups acquired from the outside via a communication network. The document attribute detection estimation apparatus 10 acquires various document data from the document management DB of the file server 40 connected via the communication network, analyzes it using the document attribute learning data program 11, and is generated as a result. Are output to the attribute feature base DB 12-1. The document attribute learning program 11 and the attribute feature base DB 12-1 constitute a content analysis engine 16-1.

また、文書属性検出推定装置１０は、文書属性推定プログラム１３−１を用いて、属性特徴ベースＤＢ１２−１により得られる各種属性の特徴情報及び予め設定されるポリシー１４−１を参照する。これらの情報をもとに、文書属性検出推定装置１０は、通信ネットワークを介して外部に設けられた複合機３０−１より得られる解析対象文書３１−１に適応する属性を推定させ、推定により得られる解析結果１５を出力する。 Further, the document attribute detection estimation apparatus 10 refers to feature information of various attributes obtained from the attribute feature base DB 12-1 and a preset policy 14-1 using the document attribute estimation program 13-1. Based on these pieces of information, the document attribute detection / estimation apparatus 10 estimates an attribute adapted to the analysis target document 31-1 obtained from the multifunction machine 30-1 provided outside via the communication network, and estimates the attribute. The obtained analysis result 15 is output.

なお、文書属性検出推定装置１０は、解析結果１５の結果をもとに、ポリシー１４−１に設定されたポリシー情報に則り、解析対象文書３１−１が出力された旨の警告情報等を通信ネットワークを介して外部に接続されるコンピュータ装置である管理者端末２０等に送信する。 The document attribute detection estimation apparatus 10 communicates warning information or the like indicating that the analysis target document 31-1 is output based on the policy information set in the policy 14-1, based on the result of the analysis result 15. The data is transmitted to the administrator terminal 20 that is a computer device connected to the outside via the network.

管理者端末２０は、文書属性検出推定装置１０より、送信された警告情報等を取得し、表示等することができる。 The administrator terminal 20 can acquire and display the transmitted warning information from the document attribute detection estimation apparatus 10.

図２は、電子文書属性検出推定装置１０のハードウェア構成例を示す図である。図２において、まず、電子文書属性検出推定装置１０は、コンピュータ装置であって、それぞれバスＢで相互に接続されている入力装置５１と、出力装置５２と、ドライブ装置５３と、補助記憶装置５４と、メモリ装置５５と、演算処理装置５６と、及びインターフェース装置５７とで構成される。 FIG. 2 is a diagram illustrating a hardware configuration example of the electronic document attribute detection estimation apparatus 10. In FIG. 2, the electronic document attribute detection / estimation apparatus 10 is a computer apparatus, and includes an input device 51, an output device 52, a drive device 53, and an auxiliary storage device 54 that are connected to each other via a bus B. A memory device 55, an arithmetic processing device 56, and an interface device 57.

入力装置５１は、キーボードやマウス等で構成され、各種信号を入力するために用いられる。出力装置５２はディスプレイ装置等で構成され、各種ウインドウやデータ等を表示するために用いられる。インターフェース装置５７は、モデム、ＬＡＮカード等で構成されており、ネットワーク３に接続するために用いられる。 The input device 51 includes a keyboard and a mouse, and is used to input various signals. The output device 52 is composed of a display device or the like, and is used for displaying various windows and data. The interface device 57 includes a modem, a LAN card, and the like, and is used for connecting to the network 3.

電子文書属性検出推定装置１０は、このインターフェース装置５７によってネットワーク３を介して外部に設けられファイルサーバ４０等より各種文書データを取得する。また、電子文書属性検出推定装置１０は、取得した各種文書データからコンテンツ解析エンジン１６−１として構成される文書属性学習プログラム１１によって属性特徴ベースＤＢ１２−１を生成する。 The electronic document attribute detection / estimation apparatus 10 obtains various document data from the file server 40 or the like provided outside via the network 3 by the interface device 57. Further, the electronic document attribute detection / estimation apparatus 10 generates the attribute feature base DB 12-1 by the document attribute learning program 11 configured as the content analysis engine 16-1 from the acquired various document data.

また、電子文書属性検出推定装置１０は、インターフェース装置５７によってネットワーク３を介して外部に設けられた複合機３０−１より解析対象となる文書を取得する。さらに、電子文書属性検出推定装置１０は、ネットワーク３を介して外部に設けられた管理者端末２０に対して、解析対象となる文書が出力された旨の情報等を送信する。 In addition, the electronic document attribute detection / estimation apparatus 10 acquires a document to be analyzed from the multifunction machine 30-1 provided outside via the network 3 by the interface device 57. Furthermore, the electronic document attribute detection estimation apparatus 10 transmits information indicating that the document to be analyzed has been output to the administrator terminal 20 provided outside via the network 3.

本実施形態に係る文書属性学習プログラム１１及び文書属性推定プログラム１３−１は、電子文書属性検出推定装置１０を制御する各種プログラムの少なくとも一部である。文書属性学習プログラム１１や文書属性推定プログラム１３−１は、例えば記憶媒体５８の配布やネットワーク３からのダウンロード等によって提供される。また、文書属性学習プログラム１１や文書属性推定プログラム１３−１を記憶した記憶媒体５８は、ＣＤ−ＲＯＭ、フレキシブルディスク、光磁気ディスク等のように情報を光学的、電気的或いは磁気的に記憶する記憶媒体、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリ等のように情報を電気的に記憶する半導体メモリ等、様々なタイプの記憶媒体を用いて提供することができる。 The document attribute learning program 11 and the document attribute estimation program 13-1 according to the present embodiment are at least a part of various programs that control the electronic document attribute detection estimation apparatus 10. The document attribute learning program 11 and the document attribute estimation program 13-1 are provided by, for example, distribution of the storage medium 58 or downloading from the network 3. The storage medium 58 storing the document attribute learning program 11 and the document attribute estimation program 13-1 stores information optically, electrically, or magnetically like a CD-ROM, flexible disk, magneto-optical disk, or the like. It can be provided using various types of storage media such as a storage medium, a ROM (Read Only Memory), a semiconductor memory that electrically stores information such as a flash memory.

また、文書属性学習プログラム１１や文書属性推定プログラム１３−１を記憶した記憶媒体５８がドライブ装置５３にセットされると、文書属性学習プログラム１１や文書属性推定プログラム１３−１は、記憶媒体５８からドライブ装置５３を介して補助記憶装置５４にインストールされる。 When the storage medium 58 storing the document attribute learning program 11 and the document attribute estimation program 13-1 is set in the drive device 53, the document attribute learning program 11 and the document attribute estimation program 13-1 are read from the storage medium 58. It is installed in the auxiliary storage device 54 via the drive device 53.

更に、文書属性学習プログラム１１や文書属性推定プログラム１３−１は、ネットワーク３からダウンロードされ、インターフェース装置５７を介して補助記憶装置５４にインストールされてもよい。 Further, the document attribute learning program 11 and the document attribute estimation program 13-1 may be downloaded from the network 3 and installed in the auxiliary storage device 54 via the interface device 57.

電子文書属性検出推定装置１０は、インストールされた文書属性学習プログラム１１や文書属性推定プログラム１３−１を格納すると共に、必要なファイル、データ等を格納する。メモリ装置５５は、起動時に補助記憶装置５４から文書属性学習プログラム１１や文書属性推定プログラム１３−１を読み出して格納する。そして、演算処理装置５６は、メモリ装置５５に格納された文書属性学習プログラム１１にしたがって、コンテンツ解析エンジン１６−１として属性特徴ベースＤＢ１２−１の生成処理を行い、また、文書属性推定プログラム１３−１にしたがって、後述するような各種処理を実現している。 The electronic document attribute detection estimation apparatus 10 stores the installed document attribute learning program 11 and the document attribute estimation program 13-1, and stores necessary files, data, and the like. The memory device 55 reads and stores the document attribute learning program 11 and the document attribute estimation program 13-1 from the auxiliary storage device 54 at the time of activation. Then, the arithmetic processing unit 56 performs generation processing of the attribute feature base DB 12-1 as the content analysis engine 16-1 according to the document attribute learning program 11 stored in the memory device 55, and the document attribute estimation program 13- 1, various processes as described later are realized.

次に、第２システムの概略構成例について説明する。第２システム１０２は、第１システム１０１と比較すると、複合機が電子文書属性検出推定装置１０内に設けられている構成である。 Next, a schematic configuration example of the second system will be described. Compared with the first system 101, the second system 102 has a configuration in which a multifunction peripheral is provided in the electronic document attribute detection estimation apparatus 10.

図３は、第２システム１０２の概略構成例を示す図である。図３に示す概略構成例おいて、文書属性学習・推定複合機６０は、文書属性学習プログラム１１、属性特徴ベースＤＢ１２−２、文書属性推定プログラム１３−２、複合機３０−２、コピー・スキャン・ファクシミリアプリ６１により構成されている。なお、複合機３０−２は、少なくともスキャナ及びプロッタを備え、プリンタ、コピー、及びスキャナ等の複数の画像処理を可能とする。 FIG. 3 is a diagram illustrating a schematic configuration example of the second system 102. In the schematic configuration example shown in FIG. 3, the document attribute learning / estimation multifunction device 60 includes a document attribute learning program 11, an attribute feature base DB 12-2, a document attribute estimation program 13-2, a multifunction device 30-2, a copy / scan. A facsimile application 61 is used. Note that the multifunction machine 30-2 includes at least a scanner and a plotter, and enables a plurality of image processing such as a printer, a copy, and a scanner.

文書属性学習・推定複合機６０は、通信ネットワークを介して外部に設けられたファイルサーバ４０に接続して、ファイルサーバ４０の文書管理ＤＢ等より各種文書データを取得する。また、文書属性学習・推定複合機６０は、文書属性学習プログラム１１を用いて、各種文書データの取得要求を行い、取得した各種文書データを解析し、その結果生成される各種データを属性特徴ベースＤＢ１２−２に出力する。この文書属性学習プログラム１１及び属性特徴ベースＤＢ１２−２によりコンテンツ解析エンジン１６−２が構成される。 The document attribute learning / estimating multifunction device 60 is connected to a file server 40 provided outside via a communication network, and acquires various types of document data from a document management DB of the file server 40 or the like. Further, the document attribute learning / estimation complex machine 60 uses the document attribute learning program 11 to request acquisition of various document data, analyze the acquired various document data, and use the generated various data as an attribute feature base. Output to DB12-2. The document attribute learning program 11 and the attribute feature base DB 12-2 constitute a content analysis engine 16-2.

また、文書属性学習・推定複合機６０は、文書属性推定プログラム１３−２を用いて、属性特徴ベースＤＢ１２−２により得られる各種属性の特徴情報及び予めポリシー１４−１に設定されるポリシー情報を参照する。そして、文書属性学習・推定複合機装置６０は、コピー・スキャン・ファクシミリアプリ６１を介して複合機３０−２から得られる解析対象文書３１−２に適応する属性を推定させ、推定により得られる解析結果１５を出力する。 Further, the document attribute learning / estimation complex machine 60 uses the document attribute estimation program 13-2 to obtain the feature information of various attributes obtained from the attribute feature base DB 12-2 and the policy information set in the policy 14-1 in advance. refer. Then, the document attribute learning / estimating multifunction device 60 estimates an attribute to be applied to the analysis target document 31-2 obtained from the multifunction device 30-2 via the copy / scan / facsimile application 61, and an analysis obtained by the estimation. The result 15 is output.

なお、文書属性学習・推定複合機６０は、解析結果１５の結果をもとに、ポリシー１４−１に設定されたポリシー情報に則り、解析対象文書３１−２が出力された旨の情報等をコピー・スキャン・ファクシミリアプリ６１を介して、解析対象文書３１−２を出力したユーザに対して複合機３０−２の画面上で警告情報等を発信する。 Note that the document attribute learning / estimation multifunction device 60 obtains information indicating that the analysis target document 31-2 has been output based on the policy information set in the policy 14-1 based on the analysis result 15. Warning information or the like is transmitted on the screen of the multifunction machine 30-2 to the user who has output the analysis target document 31-2 through the copy / scan / facsimile application 61.

図４は、文書属性学習・推定複合機６０のハードウェア構成例を示す図である。図４において、文書属性学習・推定複合機装置６０は、コンピュータ装置であって、それぞれバスＢで相互に接続されている入力装置５１と、出力装置５２と、ドライブ装置５３と、補助記憶装置５４と、メモリ装置５５と、演算処理装置５６と、及びインターフェース装置５７とで構成される。 FIG. 4 is a diagram illustrating a hardware configuration example of the document attribute learning / estimation multifunction device 60. In FIG. 4, the document attribute learning / estimating multifunction device 60 is a computer device, and is an input device 51, an output device 52, a drive device 53, and an auxiliary storage device 54 that are connected to each other via a bus B. A memory device 55, an arithmetic processing device 56, and an interface device 57.

入力装置５１は、キーボードやマウス等で構成され、各種信号を入力するために用いられる。出力装置５２はディスプレイ装置等で構成され、各種ウインドウやデータ等を表示するために用いられる。インターフェース装置５７は、モデム、ＬＡＮカード等で構成されており、ネットワークに接続するために用いられる。 The input device 51 includes a keyboard and a mouse, and is used to input various signals. The output device 52 is composed of a display device or the like, and is used for displaying various windows and data. The interface device 57 includes a modem, a LAN card, and the like, and is used for connecting to a network.

文書属性学習・推定複合機６０は、このインターフェース装置５７によってネットワーク４を介して外部に設けられたファイルサーバ４０等より各種文書データを取得する。また、文書属性学習・推定複合機６０は、取得した各種文書データからコンテンツ解析エンジン１６−２として構成される文書属性学習プログラム１１により属性特徴ベースＤＢ１２−２を生成する。 The document attribute learning / estimating complex machine 60 acquires various document data from the file server 40 provided outside via the network 4 by the interface device 57. Further, the document attribute learning / estimating multifunction device 60 generates the attribute feature base DB 12-2 from the acquired various document data by the document attribute learning program 11 configured as the content analysis engine 16-2.

本実施形態に係る文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１は、文書属性学習・推定複合機装置６０を制御する各種プログラムの少なくとも一部である。文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１は、例えば記憶媒体５８の配布やネットワーク４からのダウンロード等によって提供される。また、文書属性学習プログラム１１、文書属性推定プログラム１３−２、コピー・スキャン・ファクシミリアプリ６１を記憶した記憶媒体５８は、ＣＤ−ＲＯＭ、フレキシブルディスク、光磁気ディスク等のように情報を光学的、電気的或いは磁気的に記憶する記憶媒体、ＲＯＭ、フラッシュメモリ等のように情報を電気的に記憶する半導体メモリ等、様々なタイプの記憶媒体を用いて提供することができる。 The document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 according to the present embodiment are at least a part of various programs that control the document attribute learning / estimation multifunction device 60. The document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 are provided by, for example, distribution of the storage medium 58 or download from the network 4. The storage medium 58 storing the document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 stores information optically such as a CD-ROM, a flexible disk, and a magneto-optical disk. The present invention can be provided using various types of storage media such as a storage medium that stores information electrically or magnetically, a semiconductor memory that stores information electrically such as a ROM and a flash memory.

また、文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１を記憶した記憶媒体５８がドライブ装置５３にセットされると、文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１は、記憶媒体５８からドライブ装置５３を介して補助記憶装置１４にインストールされる。また、ネットワーク４からダウンロードされた文書属性学習プログラム１１や文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１は、インターフェース装置５７を介して補助記憶装置５４にインストールされる。 When the storage medium 58 storing the document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 is set in the drive device 53, the document attribute learning program 11, the document attribute estimation program 13-2 and the copy / scan / facsimile application 61 are installed in the auxiliary storage device 14 from the storage medium 58 via the drive device 53. The document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 downloaded from the network 4 are installed in the auxiliary storage device 54 via the interface device 57.

文書属性学習・推定複合機６０は、インストールされた文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１を格納すると共に、必要なファイル、データ等を格納する。メモリ装置５５は、起動時に補助記憶装置５４から文書属性学習プログラム１１、文書属性推定プログラム１３−２、及びコピー・スキャン・ファクシミリアプリ６１を読み出して格納する。そして、演算処理装置５６は、メモリ装置５５に格納された文書属性学習プログラム１１にしたがって、コンテンツ解析エンジン１６−２として属性特徴ベースＤＢ１２−２の生成処理を行い、また、文書属性推定プログラム１３−２にしたがって、後述するような各種処理を実現している。 The document attribute learning / estimation multifunction device 60 stores the installed document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61, and also stores necessary files, data, and the like. The memory device 55 reads and stores the document attribute learning program 11, the document attribute estimation program 13-2, and the copy / scan / facsimile application 61 from the auxiliary storage device 54 at the time of activation. Then, the arithmetic processing unit 56 performs generation processing of the attribute feature base DB 12-2 as the content analysis engine 16-2 according to the document attribute learning program 11 stored in the memory device 55, and the document attribute estimation program 13- According to 2, various processes as described later are realized.

なお、コピー・スキャン・ファクシミリアプリ６１は、複合機３０−２におけるコピー・スキャン・ファクシミリの機能をそれぞれ実行させるためのアプリケーションである。コピー・スキャン・ファクシミリアプリ６１は、文書属性推定プログラム１３−２からの要求により解析対象となる文書（解析対象文書３１−２）を文書属性推定プログラム１３−２に出力する。 The copy / scan / facsimile application 61 is an application for executing the copy / scan / facsimile functions of the multifunction machine 30-2. The copy / scan / facsimile application 61 outputs a document to be analyzed (analysis target document 31-2) to the document attribute estimation program 13-2 in response to a request from the document attribute estimation program 13-2.

次に、文書属性検出推定処理の第１機能構成における学習データを生成する機能について説明する。図５は、第１機能構成における学習データの生成する機能を説明するため図である。 Next, a function for generating learning data in the first functional configuration of the document attribute detection estimation process will be described. FIG. 5 is a diagram for explaining a function of generating learning data in the first functional configuration.

図５に示すように、学習データの元となるデータとして、例えば文書管理装置４０が有するハードディスク等の記憶装置等に、フォルダ等に階層分けされた各種文書が蓄積されている。このとき、電子文書属性検出推定装置１０は、例えば上述したコンテンツ解析エンジン１６−１等を用いて、上記蓄積された各種文書情報を取得し、フォルダ毎の文書の特徴を抽出し、抽出したフォルダ毎の文書の特徴とフォルダ名と関連づけて、例えばコンテンツ解析エンジン１６−１の属性特徴ベースＤＢ１２−１等に蓄積する。 As shown in FIG. 5, various types of documents that are divided into folders or the like are stored in a storage device such as a hard disk included in the document management device 40 as data serving as learning data. At this time, the electronic document attribute detection / estimation apparatus 10 uses the content analysis engine 16-1 or the like described above, for example, to acquire the accumulated various document information, extract document features for each folder, and extract the extracted folder. Each document feature and folder name are associated with each other and stored in, for example, the attribute feature base DB 12-1 of the content analysis engine 16-1.

ここで、コンテンツ解析エンジン１６−１について説明する。コンテンツ解析エンジン１６−１は、各種文書が属性毎に蓄積、管理されている文書管理装置４０より、例えば複合機等により取得された解析対象文書の属性を推定するための各種文書情報を取得する。 Here, the content analysis engine 16-1 will be described. The content analysis engine 16-1 acquires various types of document information for estimating the attributes of the analysis target document acquired by, for example, a multifunction peripheral from the document management apparatus 40 in which various types of documents are stored and managed for each attribute. .

ここでいう属性とは、例えば、様々な情報が文書として管理・整理するときに用いられる分類名、カテゴリ名、フォルダ名等である。一般的に、様々な情報を文書として保存するときには、関連する文書がフォルダ毎にまとめられて分類・整理・管理される。このフォルダ毎にまとめられた文書を分類・整理・管理するためのラベル名を関連する文書の属性として考えることができる。したがって、属性は、関連する文書が、例えば、分類、カテゴリ、機密レベル、タイプ等に分類・整理・管理されるためのラベル（フォルダ）名として考えることができる。 The attributes here are, for example, classification names, category names, folder names, and the like used when various information is managed and organized as documents. Generally, when various information is stored as a document, related documents are grouped, organized, and managed for each folder. A label name for classifying, organizing, and managing documents collected in each folder can be considered as an attribute of the related document. Therefore, an attribute can be considered as a label (folder) name for classifying, organizing, and managing related documents into, for example, classification, category, confidential level, type, and the like.

コンテンツ解析エンジン１６−１は、文書管理装置４０より事前にフォルダ毎に纏められた文書のデータを取得し、取得したデータからフォルダ毎に纏められた文書の特徴を抽出し、抽出したフォルダ毎の文書の特徴と、フォルダ名、すなわちラベル（あるいは属性）とを関連づけて、学習データとして生成する。 The content analysis engine 16-1 acquires the data of the documents summarized for each folder in advance from the document management apparatus 40, extracts the characteristics of the documents summarized for each folder from the acquired data, and extracts the data for each extracted folder. The feature of the document is associated with a folder name, that is, a label (or attribute), and is generated as learning data.

上述の内容をより具体的に説明すると、図５に示すように、文書管理装置４０には、学習フォルダ・キャビネットの中に複数の商品テーマ（例えば、商品Ａテーマ，商品Ｂテーマ，…）が存在している。 More specifically, as shown in FIG. 5, the document management apparatus 40 has a plurality of product themes (for example, product A theme, product B theme,...) In the learning folder cabinet. Existing.

また、商品Ａテーマのフォルダには、商品Ａの企画書、設計書等の商品Ａに関連する文書が分類され、それぞれ管理されている（商品Ａ企画．ｄｏｃ，商品Ａ設計．ｄｏｃ，…）。同様に、商品Ｂテーマのフォルダには、商品Ｂの企画書、設計書等の商品Ｂに関連する文書が分類され、管理されている（商品Ｂ設計．ｄｏｃ，商品Ａ設計．ｄｏｃ，…）。 In addition, in the folder of the product A theme, documents related to the product A such as a plan and a design document of the product A are classified and managed (product A plan.doc, product A design.doc,...). . Similarly, in the folder of the product B theme, documents related to the product B such as a planning document and a design document of the product B are classified and managed (product B design.doc, product A design.doc,...). .

コンテンツ解析エンジン１６−１は、文書管理装置４０よりフォルダ毎の文書情報を取得し、フォルダ毎に文書の特徴を抽出して解析する。また、コンテンツ解析エンジン１６−１は、解析結果として得られたフォルダ毎の特徴と属性（フォルダ名、ラベル等）とを関連づけ学習しておく。このようにして、コンテンツ解析エンジン１６−１は、例えば商品Ａテーマ、商品Ｂテーマの文書がフォルダ毎に分類し管理されている場合には、商品Ａテーマ、商品Ｂテーマ毎の文書の特徴を抽出し、フォルダ名である商品Ａテーマと商品Ａテーマの特徴、または、フォルダ名である商品Ｂテーマと商品Ｂテーマの特徴とを関連付けて、例えば、属性特徴ベースＤＢ１２−１等に蓄積することによって事前にデータ化する等により学習しておく。なお、上述の解析・抽出・学習方法等は、公知の方法を使用することができる。 The content analysis engine 16-1 acquires document information for each folder from the document management apparatus 40, and extracts and analyzes document features for each folder. Further, the content analysis engine 16-1 learns by associating the characteristics and attributes (folder name, label, etc.) for each folder obtained as an analysis result. In this way, the content analysis engine 16-1 determines the characteristics of the document for each of the product A theme and the product B theme when, for example, the documents of the product A theme and the product B theme are classified and managed for each folder. Extracting and associating the features of the product A theme and the product A theme, which are folder names, or the features of the product B theme and the product B theme, which are folder names, and storing them in, for example, the attribute feature base DB 12-1 To learn in advance. In addition, a well-known method can be used for the above-mentioned analysis / extraction / learning method.

次に、図６を用いて、第１機能構成における文書属性プログラム１３−１のモジュール構成を説明する。文書属性推定プログラム１３−１は、以下に説明するように、解析対象文書に対する属性推定依頼を受付け、その属性推定を実施し、属性推定の結果をポリシーに対応させて、ポリシーに基づいた実行を行うプログラムである。 Next, the module configuration of the document attribute program 13-1 in the first functional configuration will be described with reference to FIG. As described below, the document attribute estimation program 13-1 receives an attribute estimation request for an analysis target document, performs attribute estimation, associates the attribute estimation result with the policy, and executes the policy based execution. It is a program to be performed.

文書属性推定プログラム１３−１は、属性推定依頼受付部７０、属性推定処理部７１、ポリシー判定部７２、ポリシー設定部７３より構成される。 The document attribute estimation program 13-1 includes an attribute estimation request reception unit 70, an attribute estimation processing unit 71, a policy determination unit 72, and a policy setting unit 73.

属性推定依頼受付部７０は、複合機３０−１等のクライアントシステムから、解析対象文書３１−１の文書データ、ポリシー判定の材料となるコンテキスト情報(ユーザ情報、オペレーション)を受け取る。文書データは、以下に説明する属性推定処理部７１へ出力し、属性推定処理部７１から解析結果１５を受け取る。その後、解析結果１５とコンテキスト情報をポリシー判定部７２へ出力する。 The attribute estimation request accepting unit 70 receives document data of the analysis target document 31-1 and context information (user information, operation) as material for policy determination from a client system such as the multifunction machine 30-1. The document data is output to the attribute estimation processing unit 71 described below, and the analysis result 15 is received from the attribute estimation processing unit 71. Thereafter, the analysis result 15 and the context information are output to the policy determination unit 72.

属性推定処理部７１は、文書属性推定処理を行う制御モジュールである。属性推定処理部７１は、以下のサブモジュールにより文書データをピース分割し、ピースごとに属性推定を実施し、各結果に対する閾値判定を行い、最終結果を出力する。なお、属性推定処理部７１の文書属性推定処理の具体的な処理手順については後述する。 The attribute estimation processing unit 71 is a control module that performs document attribute estimation processing. The attribute estimation processing unit 71 divides the document data into pieces by the following submodules, performs attribute estimation for each piece, performs threshold determination for each result, and outputs a final result. A specific processing procedure of the document attribute estimation process of the attribute estimation processing unit 71 will be described later.

属性推定処理部７１は、以下に示すピース分割部７１−１、文書解析部７１−２、類似度収集部７１−３により構成される。 The attribute estimation processing unit 71 includes a piece division unit 71-1, a document analysis unit 71-2, and a similarity collection unit 71-3 described below.

まず、ピース分割部７１−１は、文書の類似度を判定する単位をより細かい部分に分割して、部分単位でどの文書のどの部分に似ているかを判定する。一部の流用でも検出したいため、解析課程でこの処理を取り入れることができる。この分割の単位がピースである。分割の単位は、例えば、５００文字等の単位の一定長分割、ページ単位分割、段落単位分割（ページイメージ上の空白行などを識別して抽出）、１行単位の分割、文単位の分割等が考えられる。 First, the piece dividing unit 71-1 divides a unit for determining the similarity of documents into finer parts, and determines which part of which document is similar to each part. Since we want to detect even some diversions, this process can be incorporated in the analysis process. The unit of this division is a piece. The division unit is, for example, a fixed length division of 500 character units, page unit division, paragraph unit division (extracted by identifying blank lines on the page image), one line unit division, sentence unit division, etc. Can be considered.

次に、文書解析部７１−２では、分割した文書（テキスト）を類似文書検索により解析し、解析結果１５として属性及び類似度をＮ位まで抽出する。なお、上記類似文書検索処理の具体的な処理手順については後述する。 Next, the document analysis unit 71-2 analyzes the divided document (text) by similar document search, and extracts the attribute and the similarity to the Nth place as the analysis result 15. A specific processing procedure of the similar document search processing will be described later.

次に、類似度収集部７１−３は、分割した文書（ピース）毎の結果、重複する属性のマージ、属性ごとに最高値の類似度を抽出し、閾値判定を行う。なお、上記類似度収集処理の具体的な処理手順については後述する。 Next, the similarity collection unit 71-3 extracts the result of each divided document (piece), merges overlapping attributes, extracts the highest similarity for each attribute, and performs threshold determination. A specific processing procedure of the similarity collection processing will be described later.

次に、閾値管理部７１−４は、閾値リスト８０を管理する。 Next, the threshold management unit 71-4 manages the threshold list 80.

ポリシー判定部７２は、予め設定されたポリシー１４−１を参照し、属性処理部７１が出力した解析結果１５の属性と、クライアントシステムから受け取った、ユーザ情報、オペレーション等のコンテキスト情報をもとにポリシー判定を行う。また、このポリシー判定結果と付随する警告メール、ログ記録等のアクションを実行する。なお、ポリシー情報の具体的な例については後述する。 The policy determination unit 72 refers to the preset policy 14-1, and based on the attribute of the analysis result 15 output from the attribute processing unit 71 and the context information such as user information and operation received from the client system. Make a policy decision. Also, actions such as this policy determination result and accompanying warning mail, log recording are executed. A specific example of policy information will be described later.

ポリシー設定部７３は、ポリシーを設定するユーザインターフェースを有するモジュールである。本実施形態では、例えば、ポリシー設定画面上で閾値の強弱等を設定する機能をサブモジュールである微調整設定部７３−１に持たせる。なお、設定画面の具体的な例については後述する。 The policy setting unit 73 is a module having a user interface for setting a policy. In the present embodiment, for example, the fine adjustment setting unit 73-1 which is a sub module has a function of setting the strength of the threshold on the policy setting screen. A specific example of the setting screen will be described later.

微調整設定部７３−１は、ポリシー設定時に属性毎に閾値の強弱等を設定し、ポリシー１４に微調整レベルを記述する。属性推定時に、閾値管理部７１−４はこの微調整レベルを参照する。 The fine adjustment setting unit 73-1 sets the strength of the threshold for each attribute when setting the policy, and describes the fine adjustment level in the policy 14. At the time of attribute estimation, the threshold management unit 71-4 refers to this fine adjustment level.

上述のモジュールを有することにより、文書属性推定プログラム１３−１は、属性推定を実施し、その属性推定結果をポリシーに対応させて、ポリシーに基づいた実行を行う。 By having the above-described module, the document attribute estimation program 13-1 performs attribute estimation, executes the attribute estimation result in association with the policy, and executes based on the policy.

次に、上述した文書属性推定プログラム１３−１の属性推定処理部７１が行う文書属性推定処理について説明する。図７は、第１機能構成における文書属性推定処理を説明するためのフローチャートである。 Next, document attribute estimation processing performed by the attribute estimation processing unit 71 of the document attribute estimation program 13-1 described above will be described. FIG. 7 is a flowchart for explaining the document attribute estimation processing in the first functional configuration.

属性推定処理部７１は、まず閾値リスト８０を閾値管理部７１−４より取得し（Ｓ１）、解析対象文書３１−１を受け付ける（Ｓ２）。 The attribute estimation processing unit 71 first acquires the threshold list 80 from the threshold management unit 71-4 (S1) and receives the analysis target document 31-1 (S2).

次に、ピース分割部７１−１は、Ｓ２の処理で取得した解析対象文書３１−１のピース分割を行い（Ｓ３）、文書解析部７１−２は、分割された文書ピース毎に、類似文書検索処理により文書解析を行う（Ｓ４）。なお、Ｓ４の文書解析では、文書解析部７１−２は、閾値リスト８０の最低値で足切りを行い、解析結果１５−１として解析結果Ｘを出力する。また、Ｓ４の類似文書検索処理の具体的な処理及び解析結果Ｘの具体的な例については後述する。 Next, the piece dividing unit 71-1 performs piece division of the analysis target document 31-1 acquired in the process of S 2 (S 3), and the document analyzing unit 71-2 performs similar documents for each divided document piece. Document analysis is performed by search processing (S4). In the document analysis of S4, the document analysis unit 71-2 cuts off at the lowest value in the threshold list 80 and outputs the analysis result X as the analysis result 15-1. A specific process of the similar document search process in S4 and a specific example of the analysis result X will be described later.

次に、類似度収集部７１−３は、予め設定されたポリシー１４−１より微調整レベル９０の内容を取得し（Ｓ５）、類似度収集処理の類似結果を収集し、解析結果１５−３として最終結果Ｙを出力する（Ｓ６）。Ｓ６処理では、微調整レベル９０により属性毎に閾値を再調整することができる。なお、Ｓ６の類似度収集処理の具体的な処理及び最終結果Ｙの具体的な例については後述する。 Next, the similarity collection unit 71-3 acquires the content of the fine adjustment level 90 from the preset policy 14-1 (S5), collects the similarity result of the similarity collection process, and analyzes the result 15-3. As a result, the final result Y is output (S6). In the S6 process, the threshold value can be readjusted for each attribute at the fine adjustment level 90. A specific process of the similarity collection process in S6 and a specific example of the final result Y will be described later.

ここで、本発明にて用いる類似度の閾値の決め方の一例について説明する。図８は、類似度の閾値に対する類似度の正解・誤り分布を示す図である。 Here, an example of how to determine the similarity threshold used in the present invention will be described. FIG. 8 is a diagram showing the correctness / error distribution of similarity with respect to the similarity threshold.

類似度の閾値を設定する場合には、例えば予め設定された類似文書検索の分類器等によって類似度の値、分布が異なるため、予めサンプルデータセットにより実験を行う。図８（Ａ）に示すように、類似文書検索のサンプリングにより、期待する属性にヒットする類似度の正解分布と、期待外の属性にヒットする類似度の不正解分布が存在する。したがって、正解が多く、不正解が少なくなるように、例えば、図８（Ａ）のグラフ（３）に位置する基本閾値を設定する。 When the threshold value of similarity is set, for example, since the value and distribution of similarity are different depending on a preset similar document search classifier or the like, an experiment is performed in advance with a sample data set. As shown in FIG. 8A, due to sampling of similar document search, there is a correct answer distribution of similarity hitting an expected attribute and an incorrect answer distribution of similarity hitting an unexpected attribute. Therefore, for example, the basic threshold value positioned in the graph (3) in FIG. 8A is set so that there are many correct answers and fewer incorrect answers.

次に、上述したように、実際には様々な状況に応じた運用を行うため、属性毎に閾値を調整する。この閾値の調整は、予め設定するポリシー１４に規定する、例えば、図８（Ｂ）に示すような、閾値調整リスト７８を作成することができる。 Next, as described above, the threshold is adjusted for each attribute in order to actually perform operations according to various situations. This threshold value adjustment is defined in a policy 14 set in advance, for example, a threshold value adjustment list 78 as shown in FIG. 8B can be created.

図８（Ｂ）に示すように、閾値調整リスト７８は、属性重要度を示す「高め（８０）」、「やや高め（７５）」、「普通（７０）」、「やや低め（６５）」、「低め（６０）」等の、段階値のみをポリシー１４で指定させる。このように閾値調整リスト７８を指定させることにより、上述の商品Ａテーマについて、例えば「低め」を指定すると、類似する商品Ａテーマの情報に対して流出するより多くの情報を監視の対象とすることができる。また、ポリシー１４に規定する運用の複雑さも回避することが可能となる。 As shown in FIG. 8B, the threshold adjustment list 78 includes “higher (80)”, “slightly higher (75)”, “normal (70)”, and “slightly lower (65)” indicating attribute importance. , “Lower (60)” or the like is used to specify only the step value in the policy 14. By specifying the threshold adjustment list 78 in this way, for the above-described product A theme, for example, when “lower” is specified, more information that flows out to information on similar product A theme is monitored. be able to. In addition, it is possible to avoid the complexity of operation defined in the policy 14.

また、閾値調整リスト７８で示す属性重要度に関連して、予め設定される閾値リスト８０の一例について説明する。図９は、閾値リスト８０の一例を示す図である。図９に示すように、閾値リスト８０は、閾値に強弱を付けたリストのことであり、レベルを「＋２（８０）」、「＋１（７５）」、「０（７０）」、「−１（６５）」、「−２（６０）」として設定する。閾値リスト８０に、最低閾値を設けることにより（例えば、図９において「６０」）、後述する類似文書検索処理等において足切りに用いることもできる。 An example of the preset threshold list 80 will be described in relation to the attribute importance shown in the threshold adjustment list 78. FIG. 9 is a diagram illustrating an example of the threshold list 80. As illustrated in FIG. 9, the threshold list 80 is a list in which thresholds are added and the level is set to “+2 (80)”, “+1 (75)”, “0 (70)”, “−1”. (65) "and" -2 (60) ". By providing a minimum threshold value in the threshold value list 80 (for example, “60” in FIG. 9), it can also be used as a cut-off in a similar document search process described later.

さらに、微調整レベル９０の一例について説明する。図１０は、微調整レベルの一例を示す図である。図１０に示す微調整レベル９０は、ポリシー１４−１で属性毎に閾値を規定するためのレベルのことであり、例えば、「（属性）Ａ→（レベル）−１」「（属性）Ｂ→（レベル）＋２」「（属性）Ｃ→（レベル）＋１」等として設定することができる。微調整レベル９０は、上記閾値調整リスト７８及び閾値リスト８０に対応されている。したがって、例えば微調整レベル９０「属性Ａ→レベル（−１）」と規定されると、属性Ａの閾値は、図９に示す閾値リスト８０に対応し、「６５」と変更させることができる。微調整レベル９０により、属性毎に属性の重要度に合わせた閾値を、運用上容易に変更させることができる。閾値調整リスト７８、閾値リスト８０、微調整レベル９０対応関係については後述する。 Further, an example of the fine adjustment level 90 will be described. FIG. 10 is a diagram illustrating an example of the fine adjustment level. The fine adjustment level 90 shown in FIG. 10 is a level for defining a threshold value for each attribute in the policy 14-1, for example, “(attribute) A → (level) −1” “(attribute) B → (Level) +2 ”,“ (attribute) C → (level) +1 ”, etc. The fine adjustment level 90 corresponds to the threshold adjustment list 78 and the threshold list 80. Therefore, for example, when the fine adjustment level 90 “attribute A → level (−1)” is defined, the threshold value of the attribute A can be changed to “65” corresponding to the threshold list 80 shown in FIG. With the fine adjustment level 90, it is possible to easily change the threshold according to the importance of the attribute for each attribute. The correspondence relationship between the threshold adjustment list 78, the threshold list 80, and the fine adjustment level 90 will be described later.

なお、本実施形態における閾値の設定では、５つのレベルを設定されているが、本発明においてはこの限りではなく複数のレベルに設定されていればよい。 In the setting of the threshold value in the present embodiment, five levels are set. However, in the present invention, the present invention is not limited to this, and a plurality of levels may be set.

次に、上述した文書属性推定プログラム１３−１の文書解析部７１−２が行う類似文書検索処理について説明する。図１１は、第１機能構成における類似文書検索処理を説明するためのフローチャートである。 Next, a similar document search process performed by the document analysis unit 71-2 of the document attribute estimation program 13-1 described above will be described. FIG. 11 is a flowchart for explaining similar document search processing in the first functional configuration.

図１１に示す類似文書検索処理では、まず文書解析部７１−２は、ピース分割部７１−１により分割した文書ピースを取得し、また属性特徴ベースＤＢ１２−１より属性毎の文書の特徴を参照し、取得した文書ピース毎に類似文書検索を行い、文書解析を行う（Ｓ１０）。なお、上記類似文書検索は、公知の技術を用いて行っても良いが、例えば特開２００６−１８５１５３に記載の方法を用いることもできる。 In the similar document search process shown in FIG. 11, first, the document analysis unit 71-2 acquires the document piece divided by the piece division unit 71-1, and refers to the feature of the document for each attribute from the attribute feature base DB 12-1. Then, a similar document search is performed for each acquired document piece, and document analysis is performed (S10). The similar document search may be performed using a known technique, but for example, a method described in JP-A-2006-185153 can also be used.

次に、文書解析部７１−２は、全ての文書ピースに対して文書解析が終了しているか判断する（Ｓ１１）。文書解析が終了していると判断した場合、解析結果Ｘを生成して（Ｓ１２）、処理を終了する。文書ピース解析が終了していないと判断した場合、文書ピースを１ピース取得する（Ｓ１３）。 Next, the document analysis unit 71-2 determines whether document analysis has been completed for all document pieces (S11). If it is determined that the document analysis has ended, an analysis result X is generated (S12), and the process ends. If it is determined that the document piece analysis has not ended, one document piece is acquired (S13).

次に、文書解析部７１−２は、Ｓ１３処理により取得した１ピースについて、類似文書検索を実施し、ランクＮ位までの類似文書との類似度を取得する（Ｓ１４）。なお、ランクＮ位は、予めシステムパラメータとして決定することができる。 Next, the document analysis unit 71-2 performs a similar document search for one piece acquired by the processing in S13, and acquires the similarity with similar documents up to rank N (S14). The rank N can be determined in advance as a system parameter.

次に、文書解析部７１−２は、Ｓ１４にて取得したＮ位全部について調べたかを判断する（Ｓ１５）。Ｎ位全部について調べていないと判断した場合、ランクレコードを取得する（Ｓ１６）。次に、そのランクレコードの類似度について、閾値リスト８０を閾値管理部７１−４より取得して最低閾値と比較し、最低閾値以上の類似度を有するランクレコードであるかを判断する（Ｓ１７）。 Next, the document analysis unit 71-2 determines whether all N ranks acquired in S14 have been checked (S15). If it is determined that all N ranks have not been checked, a rank record is acquired (S16). Next, with respect to the similarity of the rank record, the threshold list 80 is acquired from the threshold management unit 71-4 and compared with the minimum threshold to determine whether the rank record has a similarity equal to or higher than the minimum threshold (S17). .

文書解析部７１−２は、ランクレコードの類似度が最低閾値以上であると判断した場合、その類似文書に対応する属性を取得する（Ｓ１８）。また、文書解析部７１−２は、そのランクレコードのピースＩＤ、ランク、属性、類似度を示す解析結果Ｘを、解析結果Ｘ用リストに出力し保存する（Ｓ１９）。 When the document analysis unit 71-2 determines that the similarity of the rank record is equal to or higher than the minimum threshold, the document analysis unit 71-2 acquires an attribute corresponding to the similar document (S18). Further, the document analysis unit 71-2 outputs and stores the analysis result X indicating the piece ID, rank, attribute, and similarity of the rank record in the analysis result X list (S19).

更に、文書解析部７１−２は、Ｓ１７の処理において、ランクレコードの類似度が最低閾値より小さいと判断した場合、そのレコードの足切りを行う（Ｓ１７）。これにより、処理対象のデータを減らして、効率アップを可能とする
また、文書解析部７１−２は、Ｓ１５の処理において、Ｓ１４にて取得したＮ位全部について調べたと判断した場合、Ｓ１１の処理に戻る。 Further, when the document analysis unit 71-2 determines that the similarity of the rank record is smaller than the minimum threshold in the process of S17, the document analysis unit 71-2 cuts off the record (S17). As a result, the data to be processed can be reduced and the efficiency can be improved. If the document analysis unit 71-2 determines that all N ranks acquired in S14 have been examined in the process of S15, the process of S11 is performed. Return to.

次に、上述した文書属性推定プログラム１３−１の類似度収集部７１−３が行う類似度収集処理について説明する。図１２は、第１機能構成における類似度収集処理を説明するためのフローチャートである。 Next, the similarity collection process performed by the similarity collection unit 71-3 of the document attribute estimation program 13-1 described above will be described. FIG. 12 is a flowchart for explaining similarity collection processing in the first functional configuration.

図１２に示す類似度収集処理では、類似度収集部７１−３は、文書解析部７１−２により出力された解析結果Ｘを取得し、取得した解析結果Ｘについて属性をマージし、類似度は属性毎に最高値を残す（Ｓ２０）。したがって、Ｓ２０の処理では、類似度収集部７１−３は、得られた解析結果１５−２として中間結果Ｙ'を出力する。 In the similarity collection process shown in FIG. 12, the similarity collection unit 71-3 acquires the analysis result X output by the document analysis unit 71-2, merges attributes with respect to the acquired analysis result X, and the similarity is The highest value is left for each attribute (S20). Therefore, in the process of S20, the similarity collection unit 71-3 outputs the intermediate result Y ′ as the obtained analysis result 15-2.

次に、類似度収集部７１−３は、出力された中間結果Ｙ'の個々について閾値微調整判定を実施する（Ｓ２１）。次に、中間結果Ｙ'の個々についての閾値微調整判定が終了しているか判断する（Ｓ２２）。類似度収集部７１−３は、閾値微調整判定が終了していると判断した場合、解析結果１５−３として最終結果Ｙを生成して（Ｓ２３）、処理を終了する。 Next, the similarity collection unit 71-3 performs threshold fine adjustment determination for each of the output intermediate results Y ′ (S21). Next, it is determined whether the threshold fine adjustment determination for each intermediate result Y ′ has been completed (S22). When it is determined that the threshold fine adjustment determination has been completed, the similarity collection unit 71-3 generates a final result Y as the analysis result 15-3 (S23) and ends the process.

次に、類似度収集部７１−３は、閾値微調整判定が終了していないと判断した場合、中間結果Ｙ'の１レコードを取得する（Ｓ２４）。次に、類似度収集部７１−３は、Ｓ２４処理にて取得したレコードについて属性毎の微調整レベル９０をポリシー１４より取得する（Ｓ２５）。次に、類似度収集部７１−３は、閾値管理部７１−４の閾値リスト８０を参照して、Ｓ２５処理にて取得した属性毎の微調整レベル９０に対応するレベルの閾値を取得する（Ｓ２６）。次に、類似度収集部７１−３は、Ｓ２６にて取得した属性毎の閾値とＳ２４で取得したレコードの属性毎の類似度とを比較し、閾値以上の類似度を有するレコードであるかを判断する（Ｓ２７）。 Next, when it is determined that the threshold fine adjustment determination has not ended, the similarity collection unit 71-3 acquires one record of the intermediate result Y ′ (S24). Next, the similarity collection unit 71-3 acquires the fine adjustment level 90 for each attribute from the policy 14 for the record acquired in S24 processing (S25). Next, the similarity collection unit 71-3 refers to the threshold list 80 of the threshold management unit 71-4, and acquires a threshold level corresponding to the fine adjustment level 90 for each attribute acquired in S25 processing ( S26). Next, the similarity collection unit 71-3 compares the threshold for each attribute acquired in S26 with the similarity for each attribute of the record acquired in S24, and determines whether the record has a similarity greater than or equal to the threshold. Judgment is made (S27).

類似度収集部７１−３は、Ｓ２４処理にて取得したレコードの類似度が閾値以上であると判断した場合、そのレコードの属性、類似度について最終結果Ｙをリスト用に保存する（Ｓ２８）。 When the similarity collection unit 71-3 determines that the similarity of the record acquired in S24 processing is equal to or greater than the threshold, the similarity collection unit 71-3 stores the final result Y for the list for the attribute and similarity of the record (S28).

更に、類似度収集部７１−３は、Ｓ２７の処理において、Ｓ２４処理にて取得したレコードの類似度が閾値より小さいと判断した場合、そのレコードの足切りを行う（Ｓ２７）。 Furthermore, when the similarity collection unit 71-3 determines in the process of S27 that the similarity of the record acquired in the process S24 is smaller than the threshold, the record is cut off (S27).

次に、上述した処理で生成される解析結果１５としての解析結果Ｘ、中間結果Ｙ'、解析結果及びＹについて説明する。図１３は、上述した処理で生成される解析結果Ｘ、中間結果Ｙ'、及び最終結果Ｙについての一例を示す図である。 Next, the analysis result X, the intermediate result Y ′, the analysis result, and Y as the analysis result 15 generated by the above-described process will be described. FIG. 13 is a diagram illustrating an example of the analysis result X, the intermediate result Y ′, and the final result Y generated by the above-described processing.

図１３（Ａ）は、類似文書検索処理により生成される解析結果Ｘの一例である。解析結果Ｘは、分割したピースＩＤ及びランクが示され、ランクレコードには、類似する属性、類似度が示されている。例えば、ピースＩＤ「１」は、１位「Ａ、７０」とあり、ピースＩＤ「１」は、属性Ａに７０％類似していること、また２位以下に該当するものがないことが分かる。ピースＩＤ「２」は、１位「Ｃ、７０」、２位「Ａ、６５」、３位「Ｂ、６０」とあり、ピースＩＤ「２」は、属性Ｃに７０％類似し、属性Ａに６５％類似し、属性Ｂには６０％類似していることがわかる。 FIG. 13A is an example of the analysis result X generated by the similar document search process. The analysis result X shows the divided piece IDs and ranks, and the rank record shows similar attributes and similarities. For example, the piece ID “1” has the first place “A, 70”, and the piece ID “1” is 70% similar to the attribute A, and it is understood that there is no one that falls under the second place. . The piece ID “2” has the first place “C, 70”, the second place “A, 65”, the third place “B, 60”, the piece ID “2” is 70% similar to the attribute C, and the attribute A It can be seen that it is 65% similar to attribute B and 60% similar to attribute B.

上記文書解析結果Ｘは、まず、類似文書検索処理のＳ１４にて類似検索を実施した後、予め設定されたランクＮ位（図１３（Ａ）では「３位」）までの類似する文書との類似度を取得するよう設定されているため、４位以下の類似度は足切りされリスト上には出てこない。また、類似文書検索処理のＳ１７において、閾値リスト８０を参照して、最低閾値（図１３（Ａ）では「６０」）以下の類似度のものは足切りされるよう設定されているため、リストには類似度６０以下のものは出てこない。 The document analysis result X is obtained by performing a similar search in S14 of the similar document search process, and then comparing with similar documents up to a preset rank N (“third” in FIG. 13A). Since the similarity is set to be acquired, the fourth and lower similarities are cut off and do not appear on the list. In S17 of the similar document search process, with reference to the threshold list 80, those with a similarity equal to or lower than the minimum threshold (“60” in FIG. 13A) are set to be cut off. No one with a similarity of 60 or less appears.

図１３（Ｂ）は、類似度収集処理により生成される中間結果Ｙ'の一例である。中間結果Ｙ'は、上記解析結果Ｘの属性をマージし、類似度は属性毎に最高値を残す処理により生成されるため、マージされた属性、属性毎の最高値の類似度が示されている。したがって、中間結果Ｙ'には、解析結果Ｘのマージされた属性、及びその属性における最高値の類似度である「Ｂ、９０」、「Ａ、７０」、「Ｃ、７０」のレコード示されている。 FIG. 13B is an example of the intermediate result Y ′ generated by the similarity collection process. The intermediate result Y ′ is generated by merging the attributes of the analysis result X, and the similarity is the process of leaving the highest value for each attribute. Therefore, the merged attribute and the similarity of the highest value for each attribute are indicated. Yes. Therefore, in the intermediate result Y ′, the merged attribute of the analysis result X and the records of “B, 90”, “A, 70”, “C, 70” which are the similarity of the highest value in the attribute are indicated. ing.

図１３（Ｃ）は、類似度収集処理により生成される最終結果Ｙの一例である。最終結果Ｙには、「Ｂ、９０」「Ａ、７０」と示されている。図１３（Ｂ）に示す中間結果Ｙのリストに「Ｃ、７０」が示されているが、類似度収集処理のＳ２５にて属性毎の微調整レベル９０を取得したことにより、属性Ｃの閾値は「７５」となる。したがってＳ２７の属性別判定により「Ｃ、７０」は振り落とされるため、最終結果Ｙのリストには残らない。 FIG. 13C is an example of the final result Y generated by the similarity collection process. The final result Y indicates “B, 90” and “A, 70”. Although “C, 70” is shown in the list of intermediate results Y shown in FIG. 13B, the threshold value of attribute C is obtained by acquiring the fine adjustment level 90 for each attribute in S25 of the similarity collection process. Becomes “75”. Therefore, since “C, 70” is shaken off by the attribute-specific determination in S27, it does not remain in the final result Y list.

ここで、第１機能構成におけるポリシー１４−１の一例について説明する。図１４は、ポリシー１４−１の一例を示す図である。 Here, an example of the policy 14-1 in the first functional configuration will be described. FIG. 14 is a diagram illustrating an example of the policy 14-1.

図１４（Ａ）に示すように、ポリシー１４−１には、例えば、属性としてのラベル、属性毎の微調整レベルを示す閾値判定基準、操作可能な対象者、操作内容、操作が実行された後の警告等の内容が記載されている。 As shown in FIG. 14A, for example, a label as an attribute, a threshold determination standard indicating a fine adjustment level for each attribute, an operable target person, an operation content, and an operation are executed in the policy 14-1. The contents of later warnings are described.

具体的には、図１４（Ａ）及び図１５（Ａ）に示すように、『ラベルが「商品Ａ」について、閾値判定基準は「低め」であること』であることが、図１５（Ｂ）の記述１４ａに示され、『関係者以外が、コピー、スキャン、印刷、ＦＡＸしたら違反』となることが、図１５（Ｂ）の記述１４ｂで示される。このポリシー１４−１に基づき、図１４（Ｂ）に示すように、「Ｘさん（関係者ではない）にコピーされた」を示す判定結果を得た場合には、商品Ａの管理者に警告メールが出される。 Specifically, as shown in FIGS. 14A and 15A, it is shown that “the threshold judgment criterion is“ lower ”for the label“ product A ”” ”in FIG. The description 14a in FIG. 15B shows that “other than the related person is a violation if copying, scanning, printing, and faxing”. Based on this policy 14-1, as shown in FIG. 14 (B), when the determination result indicating “Copied to Mr. X (not related person)” is obtained, a warning is given to the manager of the product A An email is issued.

また、上述したポリシー１４−１の記述例について説明する。図１５は、上述したポリシー例をＸＭＬ表記した一例を示す図である。図１５（Ｂ）に示すように、ポリシー１４−１のＸＭＬ１４−２には、上記した記述１４ａは＜Ｒｅｓｏｕｃｅｔｙｐｅ＝"ＬＡＢＥＬ"＞＜ｖａｌｕｅ＞ＰＲＯＪＥＣＴ−Ａ＜／ｖａｌｕｅ＞＜ＴｈｒｅｓｈｏｌｄＬａｖｅｌ＞ＬＯＷＥＲ＜／ＴｈｒｅｓｈｏｌｄＬａｖｅｌ＞と示されている。 A description example of the policy 14-1 will be described. FIG. 15 is a diagram illustrating an example in which the above-described policy example is expressed in XML. As shown in FIG. 15B, in the XML 14-2 of the policy 14-1, the above description 14a includes <Resource type = “LABEL”> <value> PROJECT-A </ value> <ThresholdLevel> LOWER <// ThresholdLevel>.

また、上記した記述１４ｂは＜Ｕｓｅｒ＞関係者以外＜／Ｕｓｅｒ＞＜Ｏｐｅｒａｔｉｏｎｉｄ＝"ＡＬＬ"／＞と示されている。 Further, the description 14b described above is indicated as <User> <Operation id = “ALL” /> other than those related to <User>.

この記載にしたがって、ポリシー１４−１の『違反検出時には、商品Ａの管理者に警告メールを出す』ことが、図１５（Ｂ）の記述１４ｃ＜Ｏｂｌｉｇａｔｉｏｎｉｄ＝"ＡＬＡＲＭ＿ＭＡＩＬ"＞＜ａｄｄｒｅｓｓ＞ａａａ＠ｘｘｘ．ｘｘｘ．ｘｘ＜／ａｄｄｒｅｓｓ＞と示されている。 In accordance with this description, the policy 14-1 “When a violation is detected, a warning mail is sent to the administrator of the product A” is described in the description 14c <Oblation id = “ALARM_MAIL”> <address> aaa @ in FIG. xxx. xxx. xx </ address> is indicated.

次に、上述したポリシー設定画面の一例について説明する。図１６は、ポリシー設定画面等の一例を示す図である。なお、図１６（Ａ）は、ポリシー設定画面１００のレイアウト構成を示し、図１６（Ｂ）、図１６（Ｃ）は、微調整レベルを調整するためのインターフェース例を示している。 Next, an example of the policy setting screen described above will be described. FIG. 16 is a diagram illustrating an example of a policy setting screen. 16A shows the layout configuration of the policy setting screen 100, and FIGS. 16B and 16C show examples of interfaces for adjusting the fine adjustment level.

図１６（Ａ）に示すポリシー設定画面１００は、一例として、属性を設定する「分類ラベル」、操作可能な対象者を設定する「操作実行者」、操作内容を設定する「操作種別」、操作後のアクションを設定する「警告方法」、及び微調整レベルを調整する「属性重要度」の各設定領域を有している。設定後、例えば登録ボタンを選択することにより上記設定事項が登録され、ポリシー１４−１に規定される。 For example, the policy setting screen 100 shown in FIG. 16A includes a “classification label” for setting an attribute, an “operation performer” for setting an operable target person, an “operation type” for setting operation details, and an operation. Each setting area includes a “warning method” for setting a later action and an “attribute importance” for adjusting a fine adjustment level. After the setting, for example, by selecting a registration button, the setting items are registered and defined in the policy 14-1.

図１６（Ａ）に示すように、属性重要度１００−２は、「高め」、「やや高め」、「ふつう」、「やや低め」、「低め」のうち何れか１つをラジオボタンにより選択しているが、本発明においてはこれに限定されるものではなく、例えば図１６（Ｂ）に示すスライダーバー形式１０１や、図１６（Ｃ）に示すダイヤル形式１０２により設定することもできる。なお、文書の属性を管理する者によって、属性重要度１００−２が変更される度に、１４−１の微調整リスト９０は更新される。 As shown in FIG. 16A, for the attribute importance 100-2, one of “high”, “slightly high”, “normal”, “slightly low”, and “low” is selected by a radio button. However, the present invention is not limited to this. For example, the slider bar format 101 shown in FIG. 16B or the dial format 102 shown in FIG. Each time the attribute importance 100-2 is changed by the person who manages the attributes of the document, the fine adjustment list 90 of 14-1 is updated.

また、属性重要度１００−２の「高め」、「やや高め」、「ふつう」、「やや低め」、「低め」は、図８（Ｂ）で説明した、ポリシー１４−１の閾値調整リスト７８「高め（８０）」、「やや高め（７５）」、「ふつう（７０）」、「やや低め（６５）」、「低め（６０）」に対応するよう設定されている。さらに、閾値調整リスト７８と閾値リスト８０とは、重要度において関連付けられ、閾値リスト８０と微調整リスト９０とは、レベルにおいて関連付けられている。 Further, “higher”, “slightly higher”, “normal”, “slightly lower”, and “lower” of the attribute importance 100-2 are the threshold adjustment list 78 of the policy 14-1 described in FIG. 8B. “High (80)”, “Slightly high (75)”, “Normal (70)”, “Slightly low (65)”, and “Low (60)” are set. Further, the threshold adjustment list 78 and the threshold list 80 are related in importance, and the threshold list 80 and the fine adjustment list 90 are related in level.

したがって、例えば、図１６に示すポリシー設定画面１００において、分類レベル「商品Ａテーマ」に対して、属性重要度１００−２「ふつう」を選択すると、ポリシー１４−１の閾値調整リスト７８において「ふつう（７０）」と規定される。閾値リスト８０において「閾値７０→レベル０」と設定されているため、ポリシー１４−１の微調整レベル９０には、属性「商品Ａテーマ」→レベル「０」と規定される。したがって、例えば、上述の第１及び後述する第２機能構成における閾値微調整判定において、微調整レベル９０に対応するレベルの閾値を取得する際には、属性「商品Ａテーマ」の閾値は「７０」として取得される。 Therefore, for example, when the attribute importance 100-2 “normal” is selected for the classification level “product A theme” on the policy setting screen 100 shown in FIG. 16, “normal” is displayed in the threshold adjustment list 78 of the policy 14-1. (70) ". Since “threshold 70 → level 0” is set in the threshold list 80, the fine adjustment level 90 of the policy 14-1 is defined as the attribute “product A theme” → level “0”. Therefore, for example, in the threshold fine adjustment determination in the above-described first and second functional configurations described later, when the threshold value of the level corresponding to the fine adjustment level 90 is acquired, the threshold value of the attribute “product A theme” is “70”. Is obtained.

このように、ポリシー設定画面１００において、属性重要度１００−２を設けたことにより、属性の重要度に合わせた閾値の設定を容易化する。したがって、予め蓄積している文書の属性を管理する者は、時期により変動する属性の重要度に合わせ、閾値に容易に設定変更させることが可能となる。 As described above, the attribute importance 100-2 is provided on the policy setting screen 100, thereby facilitating the setting of the threshold according to the importance of the attribute. Accordingly, a person who manages the attribute of the document stored in advance can easily change the setting of the threshold according to the importance of the attribute that varies depending on the time.

閾値を属性毎に容易に調整可能としたことにより、属性の関連づけがされていない文書の属性を推定する類似度検索を行う際に、属性の重要度に合わせた閾値の調整を可能とする。閾値の調整は、類似する文書の抽出度範囲を容易に変更することを可能とする。したがって、文書の属性を管理する者は、重要度の増した属性については、閾値を低く設定し類似する文書の抽出度範囲を広げ、類似度が低い文書であってもコピー等による外部への流出を厳しく監視することが可能となる、一方、重要度の低くなった属性については、閾値を高く設定し類似する文書の抽出度範囲を狭くして、類似度の高い文書のみ監視対象とすることができる。これによって、日々変動する属性の重要度に合わせた類似度検索を可能とすることができる。 By making it possible to easily adjust the threshold value for each attribute, it is possible to adjust the threshold value according to the importance level of the attribute when performing a similarity search for estimating the attribute of a document that is not associated with an attribute. The adjustment of the threshold value makes it possible to easily change the extraction range of similar documents. Therefore, the person who manages the attribute of the document sets the threshold value low for the attribute with increased importance and widens the extraction degree range of similar documents. On the other hand, it is possible to closely monitor the outflow. On the other hand, for attributes with low importance, the threshold is set high and the extraction range of similar documents is narrowed so that only documents with high similarity are monitored. be able to. This makes it possible to perform a similarity search according to the importance of attributes that vary from day to day.

次に、文書属性検出推定処理の第１機能構成における処理概要について説明する。図１７は、第１機能構成における処理概要の流れを説明するための図である。 Next, an outline of processing in the first functional configuration of document attribute detection estimation processing will be described. FIG. 17 is a diagram for explaining the flow of the processing outline in the first functional configuration.

図１７によれば、文書属性推定プログラム１３−１は、複合機３０−１等により属性の不明な、解析対象文書３１−１を受け付けると、解析対象文書３１−１の文書データを取得する。次に、文書属性推定プログラム１３−１は、事前にコンテンツ解析エンジン１６−１により学習した属性毎の文書特徴である「商品Ａテーマ文書の特徴」、「商品Ｂテーマ文書の特徴」と解析対象文書３１−１の文書データとを類似検索し、解析する。この解析から、中間結果Ｙ'が生成される。次に、ポリシー１４−１が参照されて、属性毎の微調整レベル９０が取得される。この微調整レベル９０取得後の属性毎の閾値と、中間結果Ｙ'の文書の類似度が比較判定され、最終結果Ｙが生成される。 According to FIG. 17, when the document attribute estimation program 13-1 receives an analysis target document 31-1 whose attribute is unknown by the multifunction machine 30-1 or the like, the document attribute estimation program 13-1 acquires document data of the analysis target document 31-1. Next, the document attribute estimation program 13-1 analyzes the “characteristics of the product A theme document” and “characteristics of the product B theme document” which are document characteristics for each attribute learned in advance by the content analysis engine 16-1. Similar search is performed on the document data of the document 31-1, and the document data is analyzed. From this analysis, an intermediate result Y ′ is generated. Next, the policy 14-1 is referred to, and the fine adjustment level 90 for each attribute is acquired. The threshold for each attribute after obtaining the fine adjustment level 90 is compared with the similarity of the document of the intermediate result Y ′, and the final result Y is generated.

例えば、図１７によれば、ポリシー１４−１に「商品Ａ：低め（類似度６０以上）」、「商品Ｂ：やや高め（類似度７５以上）」と設定されている。中間結果Ｙ'は、「商品Ａテーマ(類似度６５％)」、「商品Ｂテーマ（類似度７０％）」であるため、属性毎の類似度判定により、「商品Ｂテーマ（類似度７０％）」は足切りされ、最終結果Ｙは、「商品Ａ（類似度６５％）」となった。これにより、解析対象文書３１−１は、「商品Ａ（類似度６５％）」と推定される。 For example, according to FIG. 17, “product A: low (similarity 60 or higher)” and “product B: slightly higher (similarity 75 or higher)” are set in the policy 14-1. Since the intermediate result Y ′ is “product A theme (similarity 65%)” and “product B theme (similarity 70%)”, “product B theme (similarity 70%) is determined by similarity determination for each attribute. ) ”Was cut off, and the final result Y was“ product A (similarity 65%) ”. As a result, the analysis target document 31-1 is estimated as “product A (similarity 65%)”.

上述したように、第１機能構成によれば、属性の関連付けがされていない文書がコピー機等に印刷等された場合に、属性の関連付けがされていない文書データ（情報）と予め蓄積されている属性の文書特徴との類似解析を行う。この結果、属性の関連付けがされていない文書と類似度の高い文書の属性を、属性の関連付けがされていない文書の属性と推定し、その属性に規定されているポリシーを適用させることができる。属性の関連付けがされていない文書の属性を推定する際に、予め蓄積されている属性の重要度に応じた閾値を変更することにより、属性毎に類似度抽出範囲を調整することを可能とする。これにより、状況に応じて変動する属性の重要度に合わせた文書の属性を推定することができる。したがって、例えば、重要度が低くなった属性の類似度の閾値は高く設定して類似度抽出範囲を狭くし、重要度が高い属性の類似度の閾値は低く設定することにより類似度抽出範囲を広げ、属性の関連付けがされていない文書の属性を適切に推定し、その文書に適用される属性のポリシーに基づいて、運用に合った適切なアクションを行うことが可能となる。 As described above, according to the first functional configuration, when a document with no attribute association is printed on a copier or the like, document data (information) with no attribute association is stored in advance. Similar analysis is performed with the document features of the attribute. As a result, it is possible to estimate an attribute of a document having a high similarity to a document that is not associated with an attribute as an attribute of a document that is not associated with an attribute, and to apply a policy defined for the attribute. When estimating the attribute of a document that is not associated with an attribute, it is possible to adjust the similarity extraction range for each attribute by changing the threshold according to the importance of the attribute stored in advance. . Thereby, it is possible to estimate the attribute of the document in accordance with the importance of the attribute that varies depending on the situation. Thus, for example, the similarity extraction range is narrowed by setting the similarity threshold value of the attribute having a low importance level to be narrow, and the similarity threshold value of the attribute having a high importance level is set to be low. It is possible to appropriately estimate the attribute of a document that is not associated with the attribute, and to perform an appropriate action suitable for the operation based on the attribute policy applied to the document.

次に、文書属性検出推定処理の第２機能構成における学習データを生成する機能について説明する。図１８は、第２機能構成における学習データの生成する機能を説明するため図である。 Next, a function for generating learning data in the second functional configuration of the document attribute detection estimation process will be described. FIG. 18 is a diagram for explaining a function of generating learning data in the second functional configuration.

図１８に示すように、学習データの元となるデータとして、例えば文書管理装置４０が有するハードディスク等の記憶装置等に、フォルダ等に階層分けされた各種文書が蓄積されている。このとき、文書属性学習・推定複合機６０は、例えばコンテンツ解析エンジン１６−２等を用いて、蓄積された各種文書情報を取得し、フォルダ毎の文書の特徴を抽出する。そして、このフォルダ毎の文書の特徴とフォルダ名、すなわち属性と関連づけて、例えば属性特徴ベースＤＢ１２−２等に蓄積する。さらに、このとき、属性毎の類似度判定時の微調整レベル９０（低め、やや高め等）についても併せて蓄積する。この解析方法、学習方法等は公知の方法を使用する。 As shown in FIG. 18, various types of documents that are divided into folders or the like are stored in a storage device such as a hard disk included in the document management device 40 as data that is the basis of learning data. At this time, the document attribute learning / estimating multifunction device 60 acquires various types of document information using, for example, the content analysis engine 16-2 and extracts the document features for each folder. The document features and folder names for each folder are associated with the folder name, that is, the attribute, and stored in the attribute feature base DB 12-2, for example. Furthermore, at this time, the fine adjustment level 90 (lower, slightly higher, etc.) at the time of similarity determination for each attribute is also accumulated. For this analysis method, learning method, etc., known methods are used.

次に、図１９を用いて、第２機能構成における文書属性推定プログラム１３−２のモジュール構成を説明する。文書属性推定プログラム１３−２は、第１機能構成で説明した文書属性推定プログラム１３−１とは、ポリシー設定部７３に微調整設定部７３−１がない点及び閾値管理部７１−４がポリシー１４−２を参照しない点が異なる。また、それ以外の同符号の機能については上述した第１機能構成で説明した機能と同様であるため、その説明は省略する。 Next, the module configuration of the document attribute estimation program 13-2 in the second functional configuration will be described with reference to FIG. The document attribute estimation program 13-2 is different from the document attribute estimation program 13-1 described in the first functional configuration in that the policy setting unit 73 has no fine adjustment setting unit 73-1, and the threshold management unit 71-4 has a policy. The difference is that 14-2 is not referenced. Other functions with the same reference numerals are the same as the functions described in the first functional configuration described above, and thus the description thereof is omitted.

次に、上述した文書属性推定プログラム１３−２の属性処理部７１が行う文書属性推定処理について説明する。図２０は、第２機能構成における文書属性推定処理を説明するためのフローチャートである。 Next, document attribute estimation processing performed by the attribute processing unit 71 of the document attribute estimation program 13-2 described above will be described. FIG. 20 is a flowchart for explaining document attribute estimation processing in the second functional configuration.

図２０に示す文書属性推定処理では、属性推定処理部７１は、まず閾値リスト８０を閾値管理部７１−４より取得し（Ｓ３０）、解析対象文書３１−２を受け付ける（Ｓ３１）。 In the document attribute estimation process shown in FIG. 20, the attribute estimation processing unit 71 first acquires the threshold list 80 from the threshold management unit 71-4 (S30), and accepts the analysis target document 31-2 (S31).

次に、ピース分割部７１−１は、Ｓ３１処理で取得した解析対象文書のピース分割を行い（Ｓ３２）、文書解析部７１−２は、属性特徴ベースＤＢより微調整レベル９０を取得する（Ｓ３３）。 Next, the piece dividing unit 71-1 performs piece division of the analysis target document acquired in S31 processing (S32), and the document analyzing unit 71-2 acquires the fine adjustment level 90 from the attribute feature base DB (S33). ).

次に、文書解析部７１−２は、分割された文書ピース毎に、類似文書検索処理により文書解析を行う（Ｓ３４）。Ｓ３４処理では、文書解析部７１−２は、微調整レベル９０の閾値を参照して、解析結果１５−４として解析結果Ｙ''を出力する。なお、Ｓ３４の類似文書検索処理の具体的な処理手順及び解析結果Ｙ''の具体的な例については後述する。 Next, the document analysis unit 71-2 performs document analysis for each divided document piece by a similar document search process (S34). In S34 processing, the document analysis unit 71-2 refers to the threshold value of the fine adjustment level 90 and outputs the analysis result Y ″ as the analysis result 15-4. A specific processing procedure of the similar document search processing in S34 and a specific example of the analysis result Y ″ will be described later.

次に、類似度収集部７１−３は、Ｓ３４処理での類似文書検索処理の類似結果について収集を行い、解析結果１５−３として最終結果Ｙを出力する（Ｓ３５）。なお、Ｓ３５の類似結果収集処理の具体的な処理手順及び最終結果Ｙの具体的な例については後述する。 Next, the similarity collection unit 71-3 collects the similarity results of the similar document search process in the S34 process, and outputs the final result Y as the analysis result 15-3 (S35). A specific processing procedure of the similar result collection processing in S35 and a specific example of the final result Y will be described later.

次に、上述した文書属性推定プログラム１３−２の文書解析部７１−２が行う類似文書検索処理について説明する。図２１は、第２機能構成における類似文書検索処理を説明するためのフローチャートである。 Next, a similar document search process performed by the document analysis unit 71-2 of the document attribute estimation program 13-2 described above will be described. FIG. 21 is a flowchart for explaining similar document search processing in the second functional configuration.

図２１に示す類似文書検索処理では、まず文書解析部７１−２は、ピース分割部７１−１により分割された文書ピースを取得し、また属性特徴ベースＤＢ１２−２より属性毎の文書の特徴を参照し、取得した文書ピース毎に類似文書検索を行い、文書解析を行う（Ｓ４０）。なお、上記類似文書検索は、公知の技術を用いて行っても良いが、例えば特開２００６−１８５１５３に記載の方法を用いることもできる。 In the similar document search process shown in FIG. 21, first, the document analysis unit 71-2 acquires the document piece divided by the piece division unit 71-1, and obtains the document features for each attribute from the attribute feature base DB 12-2. The similar document search is performed for each acquired document piece by referring to the document analysis (S40). The similar document search may be performed using a known technique, but for example, a method described in JP-A-2006-185153 can also be used.

次に、文書解析部７１−２は、全ての文書ピースに対して文書解析が終了しているか判断する（Ｓ４１）。文書解析が終了していると判断した場合、解析結果Ｙ''を生成して（Ｓ４２）、処理を終了する。文書ピース解析が終了していないと判断した場合、文書ピースを１ピース取得する（Ｓ４３）。 Next, the document analysis unit 71-2 determines whether document analysis has been completed for all document pieces (S41). If it is determined that the document analysis has been completed, an analysis result Y ″ is generated (S42), and the process ends. If it is determined that the document piece analysis has not ended, one document piece is acquired (S43).

次に、文書解析部７１−２は、S４３処理により取得した１ピースについて、類似文書検索を実施し、ランクＮ位までの類似文書との類似度を取得する（Ｓ４４）。なお、ランクＮ位は、予めシステムパラメータとして決定することができる。 Next, the document analysis unit 71-2 performs a similar document search for one piece acquired by the processing of S43, and acquires the degree of similarity with similar documents up to rank N (S44). The rank N can be determined in advance as a system parameter.

次に、文書解析部７１−２は、Ｓ４４にて取得したＮ位全部について調べたかを判断する（Ｓ４５）。Ｎ位全部について調べていないと判断した場合、ランクレコードを取得する（Ｓ４６）。次に、そのランクレコードの類似度について、閾値リスト８０を閾値管理部７１−４より取得して最低閾値と比較し、最低閾値以上の類似度を有するランクレコードであるかを判断する（Ｓ４７）。 Next, the document analysis unit 71-2 determines whether all N ranks acquired in S44 have been examined (S45). If it is determined that all N ranks have not been checked, a rank record is acquired (S46). Next, with respect to the similarity of the rank record, the threshold list 80 is acquired from the threshold management unit 71-4 and compared with the minimum threshold to determine whether the rank record has a similarity equal to or higher than the minimum threshold (S47). .

文書解析部７１−２は、ランクレコードの類似度が最低閾値以上であると判断した場合、その類似文書に対応する属性を取得する（Ｓ４８）。次に、文書解析部７１−２は、ランクレコードについての属性に対応する微調整レベル９０を属性特徴ベースＤＢ１２−２より取得する（Ｓ４９）。次に、文書解析部７１−２は、閾値リスト８０を参照して、ランクレコードの属性毎の微調整レベル９０に対応するレベルの閾値を取得する（Ｓ５０）。 When the document analysis unit 71-2 determines that the similarity of the rank record is equal to or higher than the minimum threshold, the document analysis unit 71-2 acquires an attribute corresponding to the similar document (S48). Next, the document analysis unit 71-2 acquires the fine adjustment level 90 corresponding to the attribute for the rank record from the attribute feature base DB 12-2 (S49). Next, the document analysis unit 71-2 refers to the threshold value list 80, and acquires a threshold level corresponding to the fine adjustment level 90 for each attribute of the rank record (S50).

次に、文書解析部７１−２は、Ｓ５０にて取得した閾値とＳ４６にて取得したランクレコードの属性毎の類似度とを比較し、閾値以上の類似度を有するランクレコードであるかを判断する（Ｓ５１）。 Next, the document analysis unit 71-2 compares the threshold acquired in S50 with the similarity for each attribute of the rank record acquired in S46, and determines whether the rank record has a similarity greater than or equal to the threshold. (S51).

文書解析部７１−２は、ランクレコードの類似度が閾値以上である場合と判断した場合、ランクレコードのピースＩＤ、ランク、属性、類似度を示す解析結果Ｙ''を、解析結果Ｙ''用リストに出力し保存する（Ｓ５２）。 If the document analysis unit 71-2 determines that the similarity of the rank record is equal to or greater than the threshold, the analysis result Y '' indicating the piece ID, rank, attribute, and similarity of the rank record is displayed as the analysis result Y ''. The data is output to and saved in the list (S52).

更に、文書解析部７１−２は、Ｓ４７の処理において、ランクレコードの類似度が最低閾値より小さいと判断した場合、ランクレコードの足切りを行う（Ｓ４７）。これにより、処理対象のデータを減らして、効率アップを可能とする
更にまた、文書解析部７１−２は、Ｓ５１の処理において、ランクレコードの類似度が閾値より小さい場合と判断した場合、ランクレコードの足切りを行う（Ｓ５１）。 Furthermore, when the document analysis unit 71-2 determines that the similarity of the rank record is smaller than the minimum threshold in the process of S47, the document analysis unit 71-2 cuts off the rank record (S47). As a result, the data to be processed can be reduced and efficiency can be increased. Furthermore, when the document analysis unit 71-2 determines in the processing of S51 that the similarity of the rank record is smaller than the threshold, the rank record Is cut off (S51).

また、Ｓ４５の処理において、Ｓ４４にて取得したＮ位全部について調べたと判断した場合、Ｓ４１の処理に戻る。 If it is determined in the process of S45 that all N ranks acquired in S44 have been examined, the process returns to S41.

次に、上述した処理で生成される解析結果Ｙ''について説明する。図２２は、上述した処理で生成される解析結果Ｙ''についての一例を示す図である。 Next, the analysis result Y ″ generated by the above process will be described. FIG. 22 is a diagram illustrating an example of the analysis result Y ″ generated by the above-described processing.

図２２に示すように、解析結果Ｙ''は、分割したピースＩＤ及びランクが示され、ランクレコードには、類似する属性値、類似度が示されている。例えば、分割したピースＩＤ「１」は、１位「Ａ、７０」とあり、ピースＩＤ「１」は、属性Ａに７０％類似していること、また２位以下に該当するものがないことが分かる。また、ピースＩＤ「２」には、該当するものがないことがわかる。 As shown in FIG. 22, the analysis result Y ″ indicates the divided piece ID and rank, and the rank record indicates similar attribute values and similarities. For example, the divided piece ID “1” has the first place “A, 70”, the piece ID “1” is 70% similar to the attribute A, and there is no one that falls under the second place. I understand. Further, it is understood that there is no corresponding piece ID “2”.

上記文書解析結果Ｙ''は、まず、類似文書検索処理のＳ４４にて予め設定されたランクＮ位（図２２では「３位」）までの類似する文書との類似度を取得するよう設定されているため、４位以下の類似度は足切されリスト上には出てこない。また、類似文書研削処理のＳ４７において、閾値リスト８０を参照して、最低閾値（図２２では「６０」）以下の類似度のものは足切りされるよう設定されているため、リストには類似度６０以下のものは出てこない。さらに、Ｓ４９にて属性値に対応する微調整レベル９０を取得し、Ｓ５０にて微調整レベル９０に対応するレベルの閾値を取得し、Ｓ５１にて類似する文書の類似度がこの閾値以上のものリスト上に残る。 The document analysis result Y ″ is set so as to obtain the similarity with similar documents up to rank N (“3” in FIG. 22) preset in S44 of the similar document search process. Therefore, the 4th or lower similarity is cut off and does not appear on the list. In S47 of the similar document grinding process, with reference to the threshold list 80, those with a similarity equal to or lower than the minimum threshold (“60” in FIG. 22) are set to be cut off. Those with a degree of 60 or less will not come out. Further, the fine adjustment level 90 corresponding to the attribute value is acquired in S49, the threshold value of the level corresponding to the fine adjustment level 90 is acquired in S50, and the similarity of similar documents is greater than or equal to this threshold value in S51. Remain on the list.

次に、上述した文書属性推定プログラム１３−２の類似度収集部７１−３が行う類似結果収集処理について説明する。図２３は、第２機能構成における類似結果収集処理を説明するためのフローチャートである。 Next, a similarity result collection process performed by the similarity collection unit 71-3 of the document attribute estimation program 13-2 described above will be described. FIG. 23 is a flowchart for explaining similar result collection processing in the second functional configuration.

図２３に示す類似度収集処理では、類似度収集部７１−３は、文書解析部７１−２により生成された解析結果Ｙ''を取得し、解析結果Ｙ''の属性をマージ、類似度は属性毎に最高値を残す（Ｓ６０）。次に、Ｓ６０の処理より最終結果Ｙリストを生成して保存する（Ｓ６１）。 In the similarity collection process illustrated in FIG. 23, the similarity collection unit 71-3 acquires the analysis result Y ″ generated by the document analysis unit 71-2, merges the attributes of the analysis result Y ″, and the similarity Leaves the highest value for each attribute (S60). Next, a final result Y list is generated and stored from the process of S60 (S61).

次に、上述した処理で生成される最終結果Ｙについて説明する。図２４は、上述した処理で生成される最終結果Ｙについての一例を示す図である。 Next, the final result Y generated by the above process will be described. FIG. 24 is a diagram illustrating an example of the final result Y generated by the above-described processing.

図２４に示す最終結果Ｙは、解析結果Ｙ''の属性をマージし、類似度は属性毎に最高値を残す処理により生成されるため、解析結果Ｙ''のマージされた属性、及びその属性における最高値の類似度である「Ｂ、９０」、「Ａ、７０」のレコード示されている。 The final result Y shown in FIG. 24 is generated by the process of merging the attributes of the analysis result Y ″ and the similarity level leaves the highest value for each attribute. Therefore, the merged attribute of the analysis result Y ″ and its Records of “B, 90” and “A, 70” which are the similarity of the highest value in the attribute are shown.

次に、文書属性検出推定処理の第２機能構成における処理概要について説明する。この構成では、コンテンツ解析エンジン１６−２が、ファイルサーバ文書管理ＤＢ４０に蓄積されている電子文書の属性を学習する際に、電子文書の属性毎に属性の重要度も合わせて学習した場合の構成例である。図２５は、第２機能構成における処理概要の流れを説明するための図である。 Next, an outline of processing in the second functional configuration of document attribute detection estimation processing will be described. In this configuration, when the content analysis engine 16-2 learns the attributes of the electronic document stored in the file server document management DB 40, the content analysis engine 16-2 also learns the importance of the attribute for each attribute of the electronic document. It is an example. FIG. 25 is a diagram for explaining the flow of the processing outline in the second functional configuration.

図２５によれば、文書属性推定プログラム１３−２は、複合機３０−２等により属性（ラベル）の不明な解析対象文書３１−２を受け付けると、解析対象文書３１−２の文書データを取得する。次に、文書属性推定プログラム１３−２は、コンテンツ解析エンジン１６−２により学習した属性毎の文書特徴である「商品Ａテーマ文書の特徴」、「商品Ｂテーマ文書の特徴」と属性毎の微調整レベル９０を取得する。次に、解析対象文書３１−２の文書データと属性毎の文書特徴とを類似検索し、解析する。このとき属性毎の微調整レベル９０を参照して、文書データの類似度が比較判定されるため、属性毎の閾値の度合も加味された最終結果Ｙが生成される。この最終結果Ｙより、文書属性推定プログラム１３−２は、解析対象文書３１−２に対する属性（類似度を含む）を判定することができる。 According to FIG. 25, when the document attribute estimation program 13-2 receives an analysis target document 31-2 whose attribute (label) is unknown by the multifunction machine 30-2 or the like, the document attribute estimation program 13-2 acquires the document data of the analysis target document 31-2. To do. Next, the document attribute estimation program 13-2 reads “features of the product A theme document” and “features of the product B theme document”, which are document features for each attribute learned by the content analysis engine 16-2, and details for each attribute. An adjustment level 90 is acquired. Next, the similarity search is performed between the document data of the analysis target document 31-2 and the document feature for each attribute, and analysis is performed. At this time, since the similarity of the document data is compared and determined with reference to the fine adjustment level 90 for each attribute, the final result Y that takes into account the degree of the threshold for each attribute is generated. From this final result Y, the document attribute estimation program 13-2 can determine attributes (including similarity) for the analysis target document 31-2.

例えば、図２５によれば、コンテンツ解析エンジン１６−２には「商品Ａテーマ文書の特徴、閾値：低め」「商品Ｂテーマ文書の特徴、閾値：やや高め」がデータとして蓄積されている。微調整レベル９０には、属性毎に「低め（類似度６０以上）」「やや高め（類似度７５以上）」等と設定されている。文書属性推定プログラム１３−２は、解析対象文書３１−２の解析対象文書データ、コンテンツ解析エンジン１６−２及び閾値リスト８０を用いて、類似検索
と解析を行い、最終結果Ｙ（属性Ｂ（類似度９０％））を生成する。これにより、文書属性推定プログラム１３−２は、解析対象文書３１−２は「属性Ｂ（類似度９０％）」と推定される。 For example, according to FIG. 25, “features of product A theme document, threshold: lower” and “features of product B theme document, threshold: slightly higher” are stored as data in the content analysis engine 16-2. The fine adjustment level 90 is set to “lower (similarity 60 or higher)” “slightly higher (similarity 75 or higher)” or the like for each attribute. The document attribute estimation program 13-2 performs similarity search and analysis using the analysis target document data of the analysis target document 31-2, the content analysis engine 16-2, and the threshold list 80, and obtains the final result Y (attribute B (similarity Degree 90%)). As a result, the document attribute estimation program 13-2 estimates that the analysis target document 31-2 is “attribute B (similarity 90%)”.

ここで、第２機能構成におけるポリシー１４−２の一例について図を用いて説明する。図２６は、ポリシー１４−２の一例を示す図である。 Here, an example of the policy 14-2 in the second functional configuration will be described with reference to the drawings. FIG. 26 is a diagram illustrating an example of the policy 14-2.

図２６（Ａ）に示すように、ポリシー１４−２には、例えば、属性としてのラベル、操作可能な対象者、操作内容、操作が実行された後の警告等の内容が記載されている。具体的には、図２６に示すように、『ラベルが「商品Ａ」の文書について、関係者以外が、コピー、スキャン、印刷、ＦＡＸしたら「違反」』であること、『違反検出時には、商品Ａの管理者に警告メールを出す』等と記載されている。このポリシー１４−２に基づき、図２６（Ｂ）に示すように、「Ｘさん（関係者ではない）にコピーされた」を示す判定結果を得た場合には、商品Ａの管理者に警告メールが出される。 As shown in FIG. 26A, the policy 14-2 describes, for example, contents such as a label as an attribute, an operable target person, an operation content, a warning after the operation is executed, and the like. Specifically, as shown in FIG. 26, “For a document whose label is“ product A ”, if a person other than the related party copies, scans, prints, or faxes it is“ violation ””, “ “Send a warning email to the administrator of A”. Based on this policy 14-2, as shown in FIG. 26 (B), if a determination result indicating “Copied to Mr. X (not a related person)” is obtained, a warning is given to the administrator of product A An email is issued.

上述したように、第２機能構成によれば、属性の関連付けがされていない文書がコピー機等に印刷等された場合に、属性の関連付けがされていない文書データ（情報）と予め蓄積されている属性の文書特徴との類似解析を行う。この結果、属性の関連付けがされていない文書と類似度の高い文書の属性を、属性の関連付けがされていない文書の属性と推定し、その属性に規定されているポリシーを適用させることができる。属性の関連付けがされていない文書の属性を推定する際に、予め蓄積されている属性の重要度に応じた閾値に設定することにより、属性毎に類似度抽出範囲を調整することを可能とする。これにより、状況に応じて変動する属性の重要度に合わせた文書の属性を推定することができる。したがって、例えば、重要度が低くなった属性の類似度の閾値は高く設定して類似度抽出範囲を狭くし、重要度が高い属性の類似度の閾値は低く設定することにより類似度抽出範囲を広げ、属性の関連付けがされていない文書の属性を適切に推定し、その文書に適用される属性のポリシーに基づいて、運用に合った適切なアクションを行うことが可能となる。 As described above, according to the second functional configuration, when a document with no attribute association is printed on a copier or the like, document data (information) with no attribute association is stored in advance. Similar analysis is performed with the document features of the attribute. As a result, it is possible to estimate an attribute of a document having a high similarity to a document that is not associated with an attribute as an attribute of a document that is not associated with an attribute, and to apply a policy defined for the attribute. When estimating the attribute of a document that is not associated with an attribute, it is possible to adjust the similarity extraction range for each attribute by setting a threshold according to the importance of the attribute stored in advance. . Thereby, it is possible to estimate the attribute of the document in accordance with the importance of the attribute that varies depending on the situation. Thus, for example, the similarity extraction range is narrowed by setting the similarity threshold value of the attribute having a low importance level to be narrow, and the similarity threshold value of the attribute having a high importance level is set to be low. It is possible to appropriately estimate the attribute of a document that is not associated with the attribute, and to perform an appropriate action suitable for the operation based on the attribute policy applied to the document.

以上、本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications can be made within the scope of the gist of the present invention described in the claims. Can be changed.

第１システムの概略構成例を示す図である。It is a figure which shows the example of schematic structure of a 1st system. 電子文書属性検出推定装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of an electronic document attribute detection estimation apparatus. 第２システムの概略構成例を示す図である。It is a figure which shows the schematic structural example of a 2nd system. 文書属性学習・推定複合機のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a document attribute learning / estimating multifunction peripheral. 第１機能構成における学習データの生成する機能を説明するため図である。It is a figure for demonstrating the function which the learning data in a 1st function structure produces | generates. 第１機能構成における文書属性プログラムのモジュール構成を説明するための図である。It is a figure for demonstrating the module structure of the document attribute program in a 1st function structure. 第１機能構成における文書属性推定処理を説明するためのフローチャートである。It is a flowchart for demonstrating the document attribute estimation process in a 1st function structure. 類似度の閾値に対する類似度の正解・誤り分布を示す図である。It is a figure which shows the correct answer / error distribution of the similarity with respect to the threshold value of similarity. 閾値リストの一例を示す図である。It is a figure which shows an example of a threshold value list. 微調整レベルの一例を示す図である。It is a figure which shows an example of a fine adjustment level. 第１機能構成における類似文書検索処理を説明するためのフローチャートである。It is a flowchart for demonstrating the similar document search process in a 1st function structure. 第１機能構成における類似度収集処理を説明するためのフローチャートである。It is a flowchart for demonstrating the similarity collection process in a 1st function structure. 解析結果Ｘ、中間結果Ｙ'、及び最終結果Ｙについての一例を示す図である。FIG. 6 is a diagram illustrating an example of an analysis result X, an intermediate result Y ′, and a final result Y. ポリシーの一例を示す図である。It is a figure which shows an example of a policy. ポリシー例をＸＭＬ表記した一例を示す図である。It is a figure which shows an example which described the policy example in XML. ポリシー設定画面等の一例を示す図であるIt is a figure which shows an example of a policy setting screen etc. 第１機能構成における処理概要の流れを説明するための図である。It is a figure for demonstrating the flow of the process outline | summary in a 1st functional structure. 第２機能構成における学習データの生成する機能を説明するため図である。It is a figure for demonstrating the function which the learning data in a 2nd function structure produces | generates. 第２機能構成における文書属性推定プログラムのモジュール構成を説明するための図である。It is a figure for demonstrating the module structure of the document attribute estimation program in a 2nd function structure. 第２機能構成における文書属性推定処理を説明するためのフローチャートである。It is a flowchart for demonstrating the document attribute estimation process in a 2nd function structure. 第２機能構成における類似文書検索処理を説明するためのフローチャートである。It is a flowchart for demonstrating the similar document search process in a 2nd function structure. 解析結果Ｙ''についての一例を示す図である。It is a figure which shows an example about analysis result Y ''. 第２機能構成における類似結果収集処理を説明するためのフローチャートである。It is a flowchart for demonstrating the similar result collection process in a 2nd function structure. 最終結果Ｙについての一例を示す図である。It is a figure which shows an example about the final result Y. 第２機能構成における処理概要の流れを説明するための図である。It is a figure for demonstrating the flow of the process outline | summary in a 2nd function structure. ポリシーの一例を示す図である。It is a figure which shows an example of a policy.

Explanation of symbols

１０電子文書属性検出推定装置
１１文書属性学習プログラム
１２属性特徴ＤＢ
１３文書属性推定プログラム
１４ポリシー
１６コンテンツ解析エンジン
２０管理者端末
３０複合機
４０ファイルサーバ
５１入力装置
５２出力装置
５３ドライブ装置
５４補助記憶装置
５５メモリ装置
５６演算処理装置
５７インターフェース装置
５８記憶装置
６０文書属性学習・推定複合機
６１コピー・スキャン・ファクシミリアプリ
６２ファイルサーバ
７０属性確定依頼受付部
７１属性推定処理部
７２ポリシー判定部
７３ポリシー設定部
７８閾値調整リスト
８０閾値リスト
９０微調整リスト
１００ポリシー設定画面
Ｂバス DESCRIPTION OF SYMBOLS 10 Electronic document attribute detection estimation apparatus 11 Document attribute learning program 12 Attribute feature DB
13 Document attribute estimation program 14 Policy 16 Content analysis engine 20 Administrator terminal 30 MFP 40 File server 51 Input device 52 Output device 53 Drive device 54 Auxiliary storage device 55 Memory device 56 Arithmetic processing device 57 Interface device 58 Storage device 60 Document attribute Learning / Estimating MFP 61 Copy / Scan / Facsimile Application 62 File Server 70 Attribute Determination Request Accepting Unit 71 Attribute Estimation Processing Unit 72 Policy Determination Unit 73 Policy Setting Unit 78 Threshold Adjustment List 80 Threshold List 90 Fine Adjustment List 100 Policy Setting Screen B bus

Claims

An electronic document attribute detection estimation method for detecting and estimating an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance,
A search procedure for searching for an electronic document having a high similarity to the document among the electronic documents stored in the database;
For each attribute of the electronic document searched by the search procedure, a comparison procedure that compares the similarity of the document to the attribute and a threshold corresponding to the importance adjusted for the attribute;
An electronic document attribute detection estimation method comprising: an estimation procedure for estimating an attribute of the electronic document as an attribute of the document when a similarity for each attribute of the retrieved electronic document exceeds the threshold.

The comparison procedure is:
For each attribute of the electronic document, the importance is acquired from the adjustment list in which the attribute is associated with the adjusted importance, and the importance is obtained from the threshold list in which the importance is associated with the threshold. The electronic document attribute detection estimation method according to claim 1, wherein a threshold value is acquired and the similarity is compared.

3. The electronic document attribute detection estimation method according to claim 2, wherein the adjustment list is updated in accordance with an adjustment of importance of the attribute of the electronic document by a user.

The estimation procedure includes:
The information indicating that the document is output is output to an administrator based on a policy set in advance for each attribute when the document exceeding the threshold is output. The electronic document attribute detection estimation method according to any one of 1 to 3.

3. The electronic document attribute detection estimation according to claim 2, wherein the adjustment list is a list in which importance acquired when learning the attributes of the electronic document stored in the database is associated with each attribute. Method.

An electronic document attribute detection / estimation device that detects and estimates an attribute corresponding to a document that is not associated with an attribute from an electronic document database stored in advance,
Search means for searching for an electronic document having a high similarity to the document among the electronic documents stored in the database;
Comparison means for comparing the similarity of the document to the attribute for each attribute of the electronic document searched by the search means and a threshold corresponding to the importance adjusted for the attribute;
An electronic document attribute detection / estimation device that executes estimation means for estimating the attribute of the electronic document as the attribute of the document when the similarity for each attribute of the searched electronic document exceeds the threshold value .

The comparison means includes
For each attribute of the electronic document, the importance is acquired from the adjustment list in which the attribute is associated with the adjusted importance, and the importance is obtained from the threshold list in which the importance is associated with the threshold. The electronic document attribute detection estimation apparatus according to claim 6, wherein a threshold value is acquired and compared with the similarity.

8. The electronic document attribute detection estimation apparatus according to claim 7, wherein the adjustment list is updated in accordance with an adjustment of importance of the attribute of the electronic document by a user.

The estimation means includes
The information indicating that the document is output is output to an administrator based on a policy set in advance for each attribute when the document exceeding the threshold is output. The electronic document attribute detection estimation apparatus according to any one of 6 to 8.

The electronic document attribute detection estimation according to claim 7, wherein the adjustment list is a list in which importance acquired when learning the attributes of the electronic document stored in the database is associated with each attribute. apparatus.

An electronic document attribute detection and estimation program for detecting and estimating an attribute corresponding to a document that is not associated with an attribute from an electronic document database accumulated in advance,
Computer
Search means for searching for an electronic document having a high similarity to the document among the electronic documents stored in the database;
Comparison means for comparing the similarity of the document to the attribute for each attribute of the electronic document searched by the search means and a threshold corresponding to the importance adjusted for the attribute;
An electronic document attribute detection / estimation program which functions as an estimation means for estimating an attribute of the electronic document as an attribute of the document when the similarity for each attribute of the retrieved electronic document exceeds the threshold.

The comparison means includes
For each attribute of the electronic document, the importance is acquired from the adjustment list in which the attribute is associated with the adjusted importance, and the importance is obtained from the threshold list in which the importance is associated with the threshold. 12. The electronic document attribute detection estimation program according to claim 11, wherein a threshold value is acquired and compared with the similarity.

A computer-readable storage medium storing an electronic document attribute detection estimation program for detecting and estimating an attribute corresponding to a document that is not associated with an attribute from an electronic document database accumulated in advance,
Computer
Search means for searching for an electronic document having a high similarity to the document among the electronic documents stored in the database;
For each attribute of the electronic document searched by the search means, comparing means for comparing the similarity of the document to the attribute and a threshold corresponding to the importance adjusted for the attribute;
A computer-readable storage medium storing a program for functioning as estimation means for estimating the attribute of the electronic document as the attribute of the document when the similarity for each attribute of the searched electronic document exceeds the threshold .

The comparison means includes
For each attribute of the electronic document, the importance is acquired from the adjustment list in which the attribute is associated with the adjusted importance, and the importance is obtained from the threshold list in which the importance is associated with the threshold. The computer-readable storage medium according to claim 13, wherein a threshold value is acquired and compared with the similarity.