JP5084341B2

JP5084341B2 - Document analysis processing apparatus, image processing apparatus, document analysis processing program, document analysis processing method

Info

Publication number: JP5084341B2
Application number: JP2007117193A
Authority: JP
Inventors: 加寿代橋本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2007-04-26
Filing date: 2007-04-26
Publication date: 2012-11-28
Anticipated expiration: 2027-04-26
Also published as: JP2008276386A

Description

本発明は、文書解析処理装置、画像処理装置、文書解析処理プログラム、文書解析処理方法に係り、特に文書の属性を解析し、その属性に応じた処理を行う文書解析処理装置、画像処理装置、文書解析処理プログラム、文書解析処理方法に関する。 The present invention relates to a document analysis processing device, an image processing device, a document analysis processing program, and a document analysis processing method, and in particular, a document analysis processing device, an image processing device, which analyzes a document attribute and performs processing according to the attribute. The present invention relates to a document analysis processing program and a document analysis processing method.

近年の自然言語処理技術の発達と計算機の処理能力の向上に伴い、従来の文書解析処理装置では、大量の蓄積文書の中から内容が類似する文書を抽出し、類似度に基づいて分類をすることが可能になっている。 With the recent development of natural language processing technology and improvement of computer processing capability, conventional document analysis processing devices extract documents with similar contents from a large amount of stored documents and classify them based on the degree of similarity It is possible.

文書が類似するかどうかの判定方法としては、以下の手法が知られている。まず、従来の判定方法では、対象文書を文字列や単語や文節を単位とする要素に分解し、その要素の組み合わせに基づいて特徴量を計算する。そして、従来の判定方法では、全ての文書の組み合わせについて、特徴量の類似度を求め、類似度が一定以上であれば類似するとみなしていた。 The following methods are known as methods for determining whether documents are similar. First, in the conventional determination method, a target document is decomposed into elements each having a character string, a word, or a phrase as a unit, and a feature amount is calculated based on the combination of the elements. In the conventional determination method, the similarity of the feature amounts is obtained for all the combinations of documents, and if the similarity is equal to or higher than a certain level, it is regarded as similar.

特徴量の計算方法としては様々な方式が考案されている。例えば、従来の計算方法では対象文書を文字列や単語や文節を単位とする要素に分解した後に、各要素の文書集合における出現頻度と、その対象文書における出現頻度とに基づいて要素の重みを求める。そして、従来の計算方法では各要素と、その重みによって構成されるベクトルとによって特徴量を表現していた。 Various methods have been devised for calculating feature amounts. For example, in the conventional calculation method, after the target document is decomposed into elements in units of character strings, words, and phrases, the element weights are calculated based on the appearance frequency of each element in the document set and the appearance frequency in the target document. Ask. In the conventional calculation method, the feature amount is expressed by each element and a vector constituted by its weight.

なお、類似度は、そのベクトルの内積を求めるなどして算出する。類似度に基づく従来の分類方法では、同じ分類のものとして定義された文書群の特徴量（ベクトル）の平均値を算出し、対象文書の特徴量（ベクトル）と、その平均ベクトルとの類似度が一定以上であれば、その対象文書はその分類であると判断していた。特許文献１には、上記のような類似文書の検索に関する技術の一例が記載されている。 The similarity is calculated by obtaining the inner product of the vectors. In the conventional classification method based on similarity, the average value of feature quantities (vectors) of document groups defined as having the same classification is calculated, and the similarity between the feature quantity (vector) of the target document and the average vector is calculated. If is more than a certain value, the target document is determined to be in that category. Patent Document 1 describes an example of a technique related to searching for similar documents as described above.

また、企業等の各組織では、扱う企業秘密や個人情報について、その情報漏えいを防止することが求められている。特許文献２には、各組織においてセキュリティポリシー（ポリシー）を掲げて、権限のある人にしか機密情報にアクセスできないようにアクセスを制御したり、機密情報を暗号化して権限のある人にしか閲覧できないようにしたりする技術の一例が記載されている。 In addition, each organization such as a company is required to prevent information leakage of trade secrets and personal information handled. Patent Document 2 has a security policy (policy) in each organization that controls access so that only authorized persons can access confidential information, or encrypts confidential information and allows only authorized persons to view it. An example of a technique for preventing this is described.

このように従来の文書解析処理装置では、文書にアクセスする際、上記のような類似文書の検索に関する技術を利用し、文書の内容から文書の属性を推定して、文書に対するアクセスがポリシーに違反していないかを監視することができた。
特開２０００−１４８７７０号公報特開２００６−１８５１５３号公報 As described above, in the conventional document analysis processing apparatus, when accessing a document, the above-mentioned technique for searching for similar documents is used, the attribute of the document is estimated from the contents of the document, and the access to the document violates the policy. I was able to monitor it.
JP 2000-148770 A JP 2006-185153 A

しかしながら、ポリシーに従って文書に対するアクセスの監視を行う従来の文書解析処理装置では、以下のように、運用時にユーザが不便を感じることも起こり得た。例えばポリシーは「・・・は原則的に禁止、実行せざるをえない場合は管理責任者の許可を得た上で実行する。」というような原則運用である。したがって、原則以外が適用できない従来の文書解析処理装置は、例えば以下のケースについて、不便，融通が利かないなどの悪評を買うことがあった。 However, in a conventional document analysis processing apparatus that monitors access to a document according to a policy, the user may feel inconvenience during operation as follows. For example, a policy is a principle operation such as “... is prohibited in principle, and is executed with the permission of the manager in charge when it must be executed”. Therefore, the conventional document analysis processing apparatus to which other than the principle can not be applied is often notorious for the following cases, such as inconvenience and inflexibility.

第１のケースは、類似検索・分類の推定結果が論理的に正解であるが、運用上、その文書に推定属性を適用したくないような例である。原則以外が適用できない従来の文書解析処理装置は第１のケースに対して融通性がない。 The first case is an example in which the estimation result of the similar search / classification is logically correct, but the estimation attribute is not applied to the document in operation. A conventional document analysis processing apparatus to which other than the principle is not applicable is not flexible with respect to the first case.

例えば開発商品単位のカテゴリの分類「商品Ａ，商品Ｂ，商品Ｃ」で分類管理する体系があるとする。商品Ａについては競合他社への漏洩対策の為、開発関係者のみ参照可が規定され、開発関係者以外に対して「極秘」扱いが規定されている。 For example, it is assumed that there is a system for performing classification management by category classification “product A, product B, product C” of a developed product unit. For product A, it is stipulated that only development personnel can refer to it as a countermeasure against leakage to competitors, and “top-secret” treatment is stipulated for non-developer personnel.

商品Ａのパンフレット文書を分類すると、原則以外が適用できない従来の文書解析処理装置では「商品Ａ」に判定される。しかし、パンフレット文書は、多くの人に公開したいため、「極秘」扱いとしたくない。また、商品Ａの機能仕様書のドラフト文書は開発関係者以外「極秘」扱いだが、他の商品Ｂのチームでも参考にする記述がある場合、他の商品Ｂのチームも参照可としたい。 If the pamphlet document of product A is classified, it is determined as “product A” in the conventional document analysis processing apparatus to which other than the principle cannot be applied. However, since pamphlet documents are open to many people, they do not want to treat them as “top secret”. In addition, the draft document of the functional specification of product A is treated as “confidential” except for those involved in development, but if there is a description that can be referred to by other product B teams, it would be possible to refer to other product B teams.

このようなケースでは、次回以降、同じような文書が分類された場合、前回の判定結果と異なる結果を要求される。このような状況を解決する１つの方法として、「極秘，秘，社外秘」など、アクセス制御されている従来の文書解析処理装置ではコンテンツによらず文書にＩＤが付与されており、例外制御を扱う機能として許可証等も考案されている。しかし、許可証を利用する従来の文書解析処理装置は、文書に付与されたＩＤに基づくものであり、文書にＩＤが付与されていない文書に適応できなかった。 In such a case, if a similar document is classified after the next time, a result different from the previous determination result is requested. As one method for solving such a situation, in a conventional document analysis processing apparatus in which access control is performed, such as “secret”, “secret”, “private secret”, etc., an ID is assigned to a document regardless of content, and exception control is handled. Permits etc. are also devised as a function. However, the conventional document analysis processing apparatus using a permit is based on the ID assigned to the document, and cannot be applied to a document in which no ID is assigned to the document.

第２のケースは、類似検索・分類の判定結果を変更する方法として、電子メールのスパムフィルタのフィードバック・再学習機能タイプを利用する例である。スパムフィルタのフィードバック・再学習機能タイプを利用する従来の文書解析処理装置では、学習データベースそのものを再学習させており、学習データベースの特徴量を変えることになってしまう。正解を正解でないと学習させることは、学習データベースの精度低下に繋がってしまう。 The second case is an example in which the feedback / relearning function type of an e-mail spam filter is used as a method for changing the determination result of the similar search / classification. In the conventional document analysis processing apparatus using the feedback / relearning function type of the spam filter, the learning database itself is relearned, and the feature amount of the learning database is changed. Learning the correct answer if it is not correct leads to a decrease in the accuracy of the learning database.

本発明は、上記の点に鑑みなされたもので、融通が利き、且つ精度低下も防止できる文書解析処理装置、画像処理装置、文書解析処理プログラム、文書解析処理方法を提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a document analysis processing device, an image processing device, a document analysis processing program, and a document analysis processing method that are flexible and can prevent deterioration in accuracy. .

上記課題を解決するため、本発明は、文書の属性を解析し、その属性に応じた処理を行う文書解析処理装置であって、例外扱いする文書の特徴と例外扱いする例外属性とが対応付けられた例外属性格納手段に基づいて前記文書の例外属性を解析する例外属性解析手段と、例外扱いしない文書の特徴と例外扱いしない推定属性とが対応付けられた推定属性格納手段に基づいて前記文書の推定属性を解析する推定属性解析手段と、前記推定属性解析手段によって解析された前記文書の推定属性から前記例外属性解析手段によって解析された前記文書の例外属性を除いた結果に基づき、前記文書の属性を解析する文書属性解析手段と、前記文書及び前記文書に対するアクション情報を受信する受付手段と、前記文書の属性及び前記文書に対するアクション情報と、前記文書の属性及び前記文書に対するアクション情報に応じた処理とが対応付けられたポリシー格納手段に基づいてポリシー判定を行い、そのポリシー判定の結果に基づいて前記文書の属性及び前記文書に対するアクション情報に応じた処理を行うポリシー判定手段と、を有し、前記例外属性格納手段は、例外扱いする場合の付帯条件が、例外扱いする文書の特徴及び例外扱いする例外属性に対応付けられていることを特徴とする。 In order to solve the above-described problems, the present invention is a document analysis processing apparatus that analyzes document attributes and performs processing according to the attributes, and associates the characteristics of an exception-handled document with exception-handled exception attributes. The exception attribute analyzing means for analyzing the exception attribute of the document based on the specified exception attribute storing means, and the estimated attribute storage means in which the characteristics of the document not handled as an exception and the estimated attribute not handled as an exception are associated with each other. An estimated attribute analyzing unit that analyzes the estimated attribute of the document, and based on a result of excluding the exception attribute of the document analyzed by the exception attribute analyzing unit from the estimated attribute of the document analyzed by the estimated attribute analyzing unit, a document attribute analysis means for the attribute analysis, a receiving unit for receiving the action information for the document and the document, Accession for the attribute and the document of the document Policy determination based on policy storage means in which the application information is associated with processing according to the document attribute and action information for the document, and based on the policy determination result, the document attribute and the document have a, and policy determination means for performing a process corresponding to the action information for the document, the exceptional attributes storage means, incidental conditions for exception handling is correspondence to exception attributes features and exception handling of the document to be exempt It is characterized by being.

なお、本発明の構成要素、表現または構成要素の任意の組合せを、方法、装置、システム、コンピュータプログラム、記録媒体、データ構造などに適用したものも本発明の態様として有効である。 In addition, what applied the component, expression, or arbitrary combination of the component of this invention to a method, an apparatus, a system, a computer program, a recording medium, a data structure, etc. is also effective as an aspect of this invention.

本発明によれば、融通が利き、且つ精度低下も防止できる文書解析処理装置、画像処理装置、文書解析処理プログラム、文書解析処理方法を提供可能である。 According to the present invention, it is possible to provide a document analysis processing device, an image processing device, a document analysis processing program, and a document analysis processing method that are flexible and can prevent a decrease in accuracy.

次に、本発明を実施するための最良の形態を、以下の実施例に基づき図面を参照しつつ説明していく。 Next, the best mode for carrying out the present invention will be described based on the following embodiments with reference to the drawings.

図１は本発明によるシステムの一実施例の構成図である。図１のシステムは、文書属性学習・解析サーバ１，複合機２，文書学習連携プログラム３及び例外学習連携プログラム４を含む構成である。なお、文書属性学習・解析サーバ１は、文書解析処理装置の一例である。複合機２は、画像処理装置の一例である。 FIG. 1 is a block diagram of an embodiment of a system according to the present invention. The system shown in FIG. 1 includes a document attribute learning / analysis server 1, a multifunction machine 2, a document learning cooperation program 3, and an exception learning cooperation program 4. The document attribute learning / analysis server 1 is an example of a document analysis processing apparatus. The multifunction device 2 is an example of an image processing apparatus.

図１の文書属性学習・解析サーバ１は、文書属性解析プログラム１１，文書属性学習プログラム１２，例外学習プログラム１３，解析結果ＤＢ１４，ポリシーＤＢ１５，属性特徴ベースＤＢ１６，例外情報ＤＢ１７を含む構成である。 The document attribute learning / analysis server 1 in FIG. 1 includes a document attribute analysis program 11, a document attribute learning program 12, an exception learning program 13, an analysis result DB 14, a policy DB 15, an attribute feature base DB 16, and an exception information DB 17.

文書属性学習は、学習クライアントである文書学習連携プログラム３と、学習サーバである文書属性学習プログラム１２とで行われる。文書属性学習プログラム１２は、文書属性学習の結果を属性特徴ベースＤＢ１６に登録する。また、例外学習は学習クライアントである例外学習連携プログラム４と、学習サーバである例外学習プログラム１３とで行われる。例外学習プログラム１３は、例外学習の結果を例外情報ＤＢ１７に登録する。 Document attribute learning is performed by the document learning cooperation program 3 that is a learning client and the document attribute learning program 12 that is a learning server. The document attribute learning program 12 registers the document attribute learning result in the attribute feature base DB 16. Exception learning is performed by the exception learning cooperation program 4 as a learning client and the exception learning program 13 as a learning server. The exception learning program 13 registers the result of exception learning in the exception information DB 17.

文書属性解析プログラム１１は、複合機２から解析対象文書５を受信し、その文書の属性を後述のように解析する。そして、文書属性解析プログラム１１はポリシーＤＢ１５を用いて後述のようにポリシー判定を行う。ポリシー違反を検出すると、文書属性解析プログラム１１は例えば管理者に警告を行う。最後に、文書属性解析プログラム１１は結果を解析結果ＤＢ１４に登録する。 The document attribute analysis program 11 receives the analysis target document 5 from the multifunction machine 2 and analyzes the attribute of the document as described later. Then, the document attribute analysis program 11 performs policy determination using the policy DB 15 as described later. When a policy violation is detected, the document attribute analysis program 11 warns the administrator, for example. Finally, the document attribute analysis program 11 registers the result in the analysis result DB 14.

文書属性学習・解析サーバ１は、例えば図２に示すようなハードウェア構成により実現される。図２は、文書属性学習・解析サーバの一実施例のハードウェア構成図である。 The document attribute learning / analysis server 1 is realized by a hardware configuration as shown in FIG. FIG. 2 is a hardware configuration diagram of an embodiment of the document attribute learning / analysis server.

文書属性学習・解析サーバ１は、それぞれバスＢで相互に接続された入力装置２１，出力装置２２，ドライブ装置２３，補助記憶装置２４，主記憶装置２５，演算処理装置２６およびインターフェース装置２７で構成される。 The document attribute learning / analysis server 1 includes an input device 21, an output device 22, a drive device 23, an auxiliary storage device 24, a main storage device 25, an arithmetic processing device 26, and an interface device 27 that are mutually connected by a bus B. Is done.

入力装置２１はキーボードやマウスなどで構成され、各種信号を入力するために用いられる。出力装置２２はディスプレイ装置などで構成され、各種ウインドウやデータ等を表示するために用いられる。インターフェース装置２７は、モデム，ＬＡＮカードなどで構成されており、インターネットやＬＡＮ等のネットワークに接続する為に用いられる。 The input device 21 includes a keyboard and a mouse, and is used for inputting various signals. The output device 22 includes a display device and is used to display various windows, data, and the like. The interface device 27 includes a modem, a LAN card, and the like, and is used to connect to a network such as the Internet or a LAN.

本発明による文書解析処理プログラムは、文書属性学習・解析サーバ１を制御する各種プログラムの少なくとも一部である。文書解析処理プログラムは例えば記録媒体２８の配布やネットワークからのダウンロードなどによって提供される。文書解析処理プログラムを記録した記録媒体２８は、ＣＤ−ＲＯＭ、フレキシブルディスク、光磁気ディスク等の様に情報を光学的，電気的或いは磁気的に記録する記録媒体、ＲＯＭ、フラッシュメモリ等の様に情報を電気的に記録する半導体メモリ等、様々なタイプの記録媒体を用いることができる。 The document analysis processing program according to the present invention is at least a part of various programs that control the document attribute learning / analysis server 1. The document analysis processing program is provided, for example, by distributing the recording medium 28 or downloading from the network. The recording medium 28 on which the document analysis processing program is recorded is a recording medium on which information is optically, electrically or magnetically recorded, such as a CD-ROM, flexible disk, magneto-optical disk, ROM, flash memory, etc. Various types of recording media such as a semiconductor memory for electrically recording information can be used.

また、文書解析処理プログラムを記録した記録媒体２８がドライブ装置２３にセットされると、文書解析処理プログラムは記録媒体２８からドライブ装置２３を介して補助記憶装置２４にインストールされる。ネットワークからダウンロードされた文書解析処理プログラムはインターフェース装置２７を介して補助記憶装置２４にインストールされる。 When the recording medium 28 on which the document analysis processing program is recorded is set in the drive device 23, the document analysis processing program is installed from the recording medium 28 into the auxiliary storage device 24 via the drive device 23. The document analysis processing program downloaded from the network is installed in the auxiliary storage device 24 via the interface device 27.

補助記憶装置２４はインストールされた文書解析処理プログラムを格納すると共に、必要なファイル，データ等を格納する。主記憶装置２５は、起動時に補助記憶装置２４から文書解析処理プログラムを読み出して格納する。そして、演算処理装置２６は主記憶装置２５に格納された文書解析処理プログラムに従って、後述するような各種処理を実現している。 The auxiliary storage device 24 stores the installed document analysis processing program and also stores necessary files, data, and the like. The main storage device 25 reads and stores the document analysis processing program from the auxiliary storage device 24 at the time of activation. The arithmetic processing unit 26 implements various processes as described later in accordance with the document analysis processing program stored in the main storage device 25.

本発明による文書解析処理プログラムは、文書属性解析プログラム１１，文書属性学習プログラム１２，例外学習プログラム１３を含む構成である。文書属性学習処理は文書学習連携プログラム３及び文書属性学習プログラム１２によって実現される。 The document analysis processing program according to the present invention includes a document attribute analysis program 11, a document attribute learning program 12, and an exception learning program 13. The document attribute learning process is realized by the document learning cooperation program 3 and the document attribute learning program 12.

文書属性学習プログラム１２は文書学習連携プログラム３から文書属性の学習依頼を受け付け、受け付けた文書を基に文書属性学習の結果を属性特徴ベースＤＢ１６に登録する登録処理を行う。なお、文書学習連携プログラム３及び文書属性学習プログラム１２は所定のフォルダを監視し、フォルダに機密文書が保存されたとき、機密文書の文書属性学習を行って文書属性学習の結果を登録するような既存の技術を利用して実現できる。 The document attribute learning program 12 receives a document attribute learning request from the document learning cooperation program 3, and performs registration processing for registering the document attribute learning result in the attribute feature base DB 16 based on the received document. The document learning cooperation program 3 and the document attribute learning program 12 monitor a predetermined folder, and when a confidential document is stored in the folder, the document attribute learning of the confidential document is performed and the document attribute learning result is registered. It can be realized using existing technology.

また、例外学習処理は例外学習連携プログラム４及び例外学習プログラム１３によって実現される。図３は例外学習処理の手順を表したフローチャートである。ステップＳ１に進み、例外学習連携プログラム４は例えば管理者等のユーザから特別に例外扱いしたい文書と例外扱いしたい文書の属性（例外属性）とが入力される。 The exception learning process is realized by the exception learning cooperation program 4 and the exception learning program 13. FIG. 3 is a flowchart showing the procedure of the exception learning process. In step S1, the exception learning cooperation program 4 receives, for example, a document that is to be treated specially as an exception and an attribute (exception attribute) of the document that is desired to be treated as an exception from a user such as an administrator.

図４は特別に例外扱いしたい文書と例外扱いしたい文書の属性とが入力される例外学習画面の一例のイメージ図である。図４の例外学習画面４０は、例外扱いしたい文書を「対象ファイル」として入力し、例外扱いしたい文書の属性を「属性」として入力する例を表している。なお、「対象ファイル」の入力は、「参照」ボタン４１を押下することで表示されるファイル管理画面を利用して行うこともできる。 FIG. 4 is an image diagram of an example of an exception learning screen in which a document to be specially handled as an exception and attributes of the document to be handled as an exception are input. The exception learning screen 40 of FIG. 4 represents an example in which a document to be handled as an exception is input as “target file” and an attribute of the document to be handled as an exception is input as “attribute”. The “target file” can also be input using a file management screen displayed by pressing the “reference” button 41.

ステップＳ２に進み、例外学習連携プログラム４はユーザから入力された例外扱いしたい文書と例外扱いしたい文書の属性とに基づき、例外学習プログラム１３へ例外情報の学習依頼を送信する。例外情報の学習依頼には、ユーザによって入力された例外扱いしたい文書と例外扱いしたい文書の属性とが、学習文書と例外属性として含まれる。 In step S2, the exception learning cooperation program 4 transmits a request for learning exception information to the exception learning program 13 based on the document that the user wants to handle as an exception and the attribute of the document that the user wants to handle as an exception. The request for learning exception information includes a document input by the user that is to be handled as an exception and the attributes of the document that are to be handled as an exception as a learning document and an exception attribute.

さらに、ユーザから例外属性を適用するのに必要なコンテキスト（付帯条件）が入力された場合、例外情報の学習依頼にはコンテキストが更に含まれる。例えばコンテキストには例外属性を適用する人を特定するもの（ＸＸＸさんがコピーする場合だけ例外的に許可する等）や例外属性を適用する場所を特定するもの（ｘｘｘの部屋内であればコピーを許可する等）がある。 Furthermore, when a context (ancillary condition) necessary for applying the exception attribute is input from the user, the exception information learning request further includes the context. For example, in the context, specify the person to whom the exception attribute is applied (exception is permitted only when Mr. XXX copies), or specify the location where the exception attribute is applied (copy in the xxx room) Etc.).

ステップＳ３に進み、例外学習プログラム１３は例外学習連携プログラム４から例外情報の学習依頼を受け付ける。ステップＳ４に進み、例外学習プログラム１３は受け付けた例外情報の学習依頼を基に例外情報の登録処理を行う。 In step S3, the exception learning program 13 accepts an exception information learning request from the exception learning cooperation program 4. In step S4, the exception learning program 13 performs exception information registration processing based on the received exception information learning request.

例外情報の登録処理では、受け付けた例外情報の学習依頼に含まれる学習文書から全文検索の元になるテキスト情報を抽出する。なお、学習文書がスキャン文書等の画像である場合にはＯＣＲ処理によりテキスト情報を抽出する。 In the exception information registration process, text information that is the source of the full-text search is extracted from the learning document included in the accepted exception information learning request. When the learning document is an image such as a scanned document, text information is extracted by OCR processing.

ステップＳ５に進み、例外学習プログラム１３はテキスト情報から全文検索用の特徴量を算出し、その特徴量と共に、受け付けた例外情報の学習依頼に含まれる指定された例外属性を紐付けて例外情報ＤＢ１７へ登録する。 In step S5, the exception learning program 13 calculates a feature quantity for full-text search from the text information, and links the specified exception attribute included in the received exception information learning request together with the feature quantity to the exception information DB 17. Register with

なお、図３のフローチャートに表した例外学習処理は既存の技術を利用して実現することができる。例えばテキスト情報から全文検索用の特徴量を算出する処理は従来技術を応用して実現できる。全文検索用の特徴量は、文字列，単語又は文節の組み合わせの要素に分解されたテキスト情報における、各要素の出現頻度や重みのｎ次元ベクトルで表すことができる。 The exception learning process shown in the flowchart of FIG. 3 can be realized by using existing technology. For example, the processing for calculating the feature amount for full text search from text information can be realized by applying the prior art. The feature quantity for full-text search can be represented by an n-dimensional vector of the appearance frequency and weight of each element in text information decomposed into elements of a combination of character string, word or clause.

全文検索用の特徴量をｎ次元ベクトルで表した場合、文書間の類似度は以下のように計算できる。類似度の計算は、例えば特開２０００−１４８７７０号公報などに記載されているようなｎ次元ベクトル間の内積あるいは余弦によって算出する方法を用いることができる。類似度が閾値を超えていれば、２つの文書は類似すると判定される。 When the feature quantity for full-text search is represented by an n-dimensional vector, the similarity between documents can be calculated as follows. For the calculation of the similarity, for example, a method of calculating by an inner product or cosine between n-dimensional vectors as described in JP 2000-148770 A can be used. If the similarity exceeds the threshold, it is determined that the two documents are similar.

図５は例外情報ＤＢに登録されるレコードのイメージ図である。図５のレコードは例外扱いする文書（例外文書）の特徴データと、例外文書の属性（例外属性）と、例外属性を適用する人（ユーザ）を特定するコンテキスト１と、例外属性を適用する場所を特定するコンテキスト２とを含む構成である。なお、例外文書の特徴データはテキスト情報から算出した全文検索用の特徴量である。例外属性，コンテキスト１及び２は、受け付けた例外情報の学習依頼に含まれていたものである。 FIG. 5 is an image diagram of records registered in the exception information DB. The record in FIG. 5 includes feature data of an exception handling document (exception document), an exception document attribute (exception attribute), a context 1 that identifies a person (user) to apply the exception attribute, and a place to apply the exception attribute. And a context 2 that specifies The feature data of the exception document is a feature amount for full-text search calculated from text information. Exception attributes, contexts 1 and 2 are included in the received request for learning exception information.

さらに、文書属性解析処理は、文書属性解析プログラム１１，解析結果ＤＢ１４，ポリシーＤＢ１５，属性特徴ベースＤＢ１６，例外情報ＤＢ１７によって実現される。図６は文書属性解析プログラムの一実施例の構成図である。図６の文書属性解析プログラム１１は、文書解析依頼受付部６１，文書解析判定処理部６２，属性特徴ベース判定部６３，例外情報判定部６４，ポリシー判定部６５を含む構成である。 Further, the document attribute analysis process is realized by the document attribute analysis program 11, the analysis result DB 14, the policy DB 15, the attribute feature base DB 16, and the exception information DB 17. FIG. 6 is a block diagram of an embodiment of a document attribute analysis program. The document attribute analysis program 11 shown in FIG. 6 includes a document analysis request receiving unit 61, a document analysis determination processing unit 62, an attribute feature base determination unit 63, an exception information determination unit 64, and a policy determination unit 65.

図７は文書属性解析処理の手順を表したフローチャートである。ステップＳ１１では文書解析依頼受付部６１が複合機２等の外部から解析対象文書５及びアクション情報（例えば誰が何をした等）をネットワーク経由で受信し、その解析対象文書５及びアクション情報を文書解析判定処理部６２に送信する。 FIG. 7 is a flowchart showing the procedure of document attribute analysis processing. In step S11, the document analysis request receiving unit 61 receives the analysis target document 5 and action information (for example, who did what) from the outside of the multifunction device 2 or the like via the network, and analyzes the analysis target document 5 and action information. It transmits to the determination process part 62.

ステップＳ１２に進み、文書解析判定処理部６２は例外情報判定部６４へ例外判定を要求する。例外判定を要求された例外情報判定部６４は例外情報ＤＢ１７に登録されている例外文書の特徴データから解析対象文書５の特徴データと、ほぼ同一の例外文書を検索する類似文書検索を行う。そして、例外情報判定部６４は検索された例外文書に紐付けされている例外属性を例外情報ＤＢ１７から抽出する（結果１）。 In step S 12, the document analysis determination processing unit 62 requests exception determination from the exception information determination unit 64. The exception information determination unit 64 requested to determine the exception performs a similar document search for searching for an exception document that is substantially the same as the feature data of the analysis target document 5 from the feature data of the exception document registered in the exception information DB 17. Then, the exception information determination unit 64 extracts the exception attribute associated with the searched exception document from the exception information DB 17 (result 1).

なお、例外情報判定部６４は例外情報ＤＢ１７に登録されている例外文書の特徴データから解析対象文書５の特徴データと、ほぼ同一の例外文書を検索するため、類似度の閾値を図８に示すように通常より高く設定する。図８は、ほぼ同一の例外文書を検索する為に利用する類似度の閾値を表した一例のグラフ図である。 Note that the exception information determination unit 64 searches the exception data that is almost the same as the feature data of the analysis target document 5 from the feature data of the exception document registered in the exception information DB 17, and the similarity threshold is shown in FIG. Set higher than usual. FIG. 8 is a graph showing an example of a similarity threshold used for searching for almost identical exception documents.

閾値のグラフ図は、利用する文書検索エンジンによって類似度の値、分布が違う。したがって、類似文書検索を行う場合には予め実験して閾値の推奨値を決める。類似文書検索を行う場合の閾値は、評価段階で、目的に近いサンプルデータ（学習文書）を使って決めることが望ましい。 In the graph of the threshold value, the similarity value and distribution differ depending on the document search engine used. Therefore, when a similar document search is performed, a recommended value for the threshold is determined through an experiment in advance. It is desirable to determine the threshold value for the similar document search using sample data (learning document) close to the purpose at the evaluation stage.

例外判定では、ほぼ同一に近い類似文書を抽出したい。そこで、例外判定では学習したものと同じ文書（質問文書）で質問した場合に１００％となるように正規化した類似度を用いる。 In exception determination, I want to extract similar documents that are nearly identical. Therefore, in the exception determination, a similarity degree normalized so as to be 100% when a question is asked with the same document (question document) as learned is used.

ただし、学習文書や質問文書が複合機２から得たスキャン画像である場合、毎回、全く同一の像やＯＣＲ結果を得ることが難しいので、同じ紙画像を質問しても１００％は得られない。そこで、類似文書検索を行う場合には類似度の閾値を、多少の相違結果を考慮した高い値（２）とする。値（１）は、正解を誤りとしてしまう誤認と、誤りを正解としてしまう誤認とのバランスを考えた通常の閾値を表している。 However, if the learning document or question document is a scanned image obtained from the multifunction device 2, it is difficult to obtain the exact same image or OCR result every time, so even if the same paper image is asked, 100% cannot be obtained. . Therefore, when a similar document search is performed, the similarity threshold is set to a high value (2) considering some difference results. The value (1) represents a normal threshold value considering the balance between the misperception that the correct answer is regarded as an error and the misperception that the error is regarded as a correct answer.

ステップＳ１３に進み、文書解析判定処理部６２は属性特徴ベース判定部６３へ属性推定を要求する。属性推定を要求された属性特徴ベース判定部６３は、属性特徴ベースＤＢ１６に登録されている文書属性学習の結果に基づき、通常のコンテンツ解析（類似文書検索＆文書分類）により解析対象文書５の推定属性を抽出する（結果２）。 In step S13, the document analysis determination processing unit 62 requests the attribute feature base determination unit 63 to estimate the attribute. The attribute feature base determination unit 63 requested to estimate the attribute estimates the analysis target document 5 by normal content analysis (similar document search & document classification) based on the document attribute learning result registered in the attribute feature base DB 16. Extract attributes (result 2).

ステップＳ１４に進み、文書解析判定処理部６２は結果１の例外属性と結果２の推定属性とに基づき、解析対象文書５の総合判定を行う。図９は、結果１の例外属性，結果２の推定属性および総合結果を表した構造イメージ図である。図９に示すように、結果１の例外属性の構造は、複数の例外属性及び信頼度から成る。結果２の推定属性の構造は、複数の推定属性及び信頼度から成る。解析対象文書５の総合結果は、結果２の推定属性の構造を表す属性リストから結果１の例外属性の構造を表す属性リストを除いたものである。 In step S14, the document analysis determination processing unit 62 performs comprehensive determination on the analysis target document 5 based on the exception attribute of the result 1 and the estimated attribute of the result 2. FIG. 9 is a structural image diagram showing the exception attribute of result 1, the estimated attribute of result 2, and the overall result. As shown in FIG. 9, the structure of the exception attribute of Result 1 is composed of a plurality of exception attributes and reliability. The structure of the estimated attribute of the result 2 includes a plurality of estimated attributes and reliability. The overall result of the analysis target document 5 is obtained by removing the attribute list representing the structure of the exceptional attribute of the result 1 from the attribute list representing the structure of the estimated attribute of the result 2.

図１０は、結果１の例外属性，結果２の推定属性及び総合結果を表した処理イメージ図である。図１０の例では、結果１の例外属性「カテゴリＡ」を結果２の推定属性「カテゴリＡ，カテゴリＢ」から除いて、総合結果「カテゴリＢ」が得られた例を表している。 FIG. 10 is a processing image diagram showing the exception attribute of result 1, the estimated attribute of result 2, and the overall result. In the example of FIG. 10, the exception attribute “category A” of the result 1 is excluded from the estimated attributes “category A, category B” of the result 2, and an overall result “category B” is obtained.

総合結果が得られた後、ポリシー判定部６５はステップＳ１５に進み、ステップＳ１４で得られた総合結果とステップＳ１１で受信したアクション情報とを元に、ポリシーＤＢ１５を用いてポリシー判定を行う。ポリシーＤＢ１５には、図１１のようなポリシーが設定されている。 After the comprehensive result is obtained, the policy determination unit 65 proceeds to step S15, and performs policy determination using the policy DB 15 based on the comprehensive result obtained in step S14 and the action information received in step S11. In the policy DB 15, a policy as shown in FIG. 11 is set.

図１１はポリシーＤＢに設定されているポリシーの一例の構成図である。図１１のポリシーは「ＣＡＴＥＧＯＲＹ＿Ａの文書がスキャンされたら、管理者に警告メールを送信する。」というものである。ポリシーＤＢ１５には、文書の属性「カテゴリＡ」及び文書に対するアクション情報「スキャン」が、処理「管理者に警告メールを送信する」と対応付けられている。 FIG. 11 is a configuration diagram of an example of a policy set in the policy DB. The policy in FIG. 11 is “When a document of CATEGORY_A is scanned, a warning mail is sent to the administrator”. In the policy DB 15, the attribute “category A” of the document and the action information “scan” for the document are associated with the process “send a warning mail to the administrator”.

ステップＳ１５のポリシー判定の結果、ポリシー違反を検出すると、ポリシー判定部６５はステップＳ１６からステップＳ１７に進み、ポリシーＤＢ１５に設定されているポリシーに従って警告メールや警告ログ等の責務処理を行った後、ステップＳ１８に進む。 If a policy violation is detected as a result of the policy determination in step S15, the policy determination unit 65 proceeds from step S16 to step S17, and performs duty processing such as warning mail and warning log in accordance with the policy set in the policy DB 15, Proceed to step S18.

ステップＳ１５のポリシー判定の結果、ポリシー違反を検出しなければ、ポリシー判定部６５はステップＳ１６からステップＳ１８に進む。ステップＳ１８では、文書解析判定処理部６２が総合結果を解析結果ＤＢ１４に登録する。 If no policy violation is detected as a result of the policy determination in step S15, the policy determination unit 65 proceeds from step S16 to step S18. In step S18, the document analysis determination processing unit 62 registers the comprehensive result in the analysis result DB 14.

なお、本発明によるシステムは図１の構成に限るものでなく、図１２，図１３に示した構成であってもよい。図１２は本発明によるシステムの他の実施例の構成図である。図１２のシステムはサーバ及びクライアントの連携型でなく、複合機２へ各種機能を盛り込んだ構成となっている。 Note that the system according to the present invention is not limited to the configuration shown in FIG. 1, but may have the configurations shown in FIGS. FIG. 12 is a block diagram of another embodiment of the system according to the present invention. The system shown in FIG. 12 has a configuration in which various functions are incorporated in the multi-function device 2 instead of a server-client cooperation type.

図１２のシステムは、学習も文書学習連携プログラム３及び例外学習連携プログラム４と連携するのでなく、複合機２側から学習元のファイルサーバ１２１を監視して文書を取り込む形式となる。 The system shown in FIG. 12 does not cooperate with the document learning cooperation program 3 and the exception learning cooperation program 4, but takes a form of capturing a document by monitoring the learning source file server 121 from the multifunction peripheral 2 side.

複合機２は、文書属性解析プログラム１１，文書属性学習プログラム１２，例外学習プログラム１３，解析結果ＤＢ１４，ポリシーＤＢ１５，属性特徴ベースＤＢ１６，例外情報ＤＢ１７，コピー，スキャナ，ファクシミリ等のアプリ１２０を含む構成である。 The MFP 2 includes a document attribute analysis program 11, a document attribute learning program 12, an exception learning program 13, an analysis result DB 14, a policy DB 15, an attribute feature base DB 16, an exception information DB 17, an application 120 such as a copy, a scanner, and a facsimile. It is.

文書属性解析プログラム１１は自機のアプリ１２０から解析対象文書５を取得し、その文書の属性を前述のように解析する。そして、文書属性解析プログラム１１はポリシーＤＢ１５を用いて前述のようにポリシー判定を行う。ポリシー違反を検出すると、文書属性解析プログラム１１は例えばオペパネを利用して管理者に警告を行う。最後に、文書属性解析プログラム１１は結果を解析結果ＤＢ１４に登録する。 The document attribute analysis program 11 acquires the analysis target document 5 from the application 120 of its own device, and analyzes the attribute of the document as described above. Then, the document attribute analysis program 11 performs policy determination using the policy DB 15 as described above. When a policy violation is detected, the document attribute analysis program 11 issues a warning to the administrator using, for example, an operation panel. Finally, the document attribute analysis program 11 registers the result in the analysis result DB 14.

図１３は本発明によるシステムの他の実施例の構成図である。図１３のシステムは図１２のシステムと同様、複合機２へ各種機能を盛り込んだ構成となっているが、学習を文書学習連携プログラム３と連携して行っている。 FIG. 13 is a block diagram of another embodiment of the system according to the present invention. The system shown in FIG. 13 has a configuration in which various functions are incorporated in the multifunction machine 2, as in the system shown in FIG. 12, but learning is performed in cooperation with the document learning cooperation program 3.

本発明によるシステムは、属性特徴ベースＤＢ１６とは別の独立した例外情報ＤＢ１７を用意して、例外情報を独立して学習させることにより、属性特徴ベースＤＢ１６の精度を低下させることなく、融通性を高めることができる。 The system according to the present invention prepares an exception information DB 17 that is independent from the attribute feature base DB 16 and learns exception information independently, thereby reducing flexibility without reducing the accuracy of the attribute feature base DB 16. Can be increased.

また、類似文書検索による例外文書の特定には、曖昧性を低くする為に類似度の閾値を通常より高く設定することで、ほぼ同一の例外文書の検索を実現する。ほぼ同一の例外文書の検索を実現することで、本発明によるシステムは文書のコンテンツ自体が文書を特定する要素になる。 In addition, in order to specify an exception document by similar document search, by setting a similarity threshold higher than usual in order to reduce ambiguity, it is possible to search for almost identical exception documents. By realizing retrieval of almost identical exception documents, in the system according to the present invention, the document content itself becomes an element for specifying the document.

本発明は、具体的に開示された実施例に限定されるものではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。 The present invention is not limited to the specifically disclosed embodiments, and various modifications and changes can be made without departing from the scope of the claims.

本発明によるシステムの一実施例の構成図である。1 is a configuration diagram of an embodiment of a system according to the present invention. FIG. 文書属性学習・解析サーバの一実施例のハードウェア構成図である。It is a hardware block diagram of one Example of a document attribute learning / analysis server. 例外学習処理の手順を表したフローチャートである。It is a flowchart showing the procedure of exception learning processing. 特別に例外扱いしたい文書と例外扱いしたい文書の属性とが入力される例外学習画面の一例のイメージ図である。It is an image figure of an example of the exception learning screen into which the document which wants to handle exception specially and the attribute of the document which wants to handle exception are input. 例外情報ＤＢに登録されるレコードのイメージ図である。It is an image figure of the record registered into exception information DB. 文書属性解析プログラムの一実施例の構成図である。It is a block diagram of one Example of a document attribute analysis program. 文書属性解析処理の手順を表したフローチャートである。It is a flowchart showing the procedure of document attribute analysis processing. ほぼ同一の例外文書を検索する為に利用する類似度の閾値を表した一例のグラフ図である。It is an example of the graph showing the threshold value of similarity used for searching for almost identical exception documents. 結果１の例外属性，結果２の推定属性および総合結果を表した構造イメージ図である。It is a structure image figure showing the exception attribute of result 1, the presumed attribute of result 2, and the comprehensive result. 結果１の例外属性，結果２の推定属性及び総合結果を表した処理イメージ図である。It is a processing image figure showing the exception attribute of result 1, the presumed attribute of result 2, and the comprehensive result. ポリシーＤＢに設定されているポリシーの一例の構成図である。It is a block diagram of an example of the policy set to policy DB. 本発明によるシステムの他の実施例の構成図である。It is a block diagram of the other Example of the system by this invention. 本発明によるシステムの他の実施例の構成図である。It is a block diagram of the other Example of the system by this invention.

Explanation of symbols

１文書属性学習・解析サーバ
２複合機
３文書学習連携プログラム
４例外学習連携プログラム
５解析対象文書
１１文書属性解析プログラム
１２文書属性学習プログラム
１３例外学習プログラム
１４解析結果ＤＢ
１５ポリシーＤＢ
１６属性特徴ベースＤＢ
１７例外情報ＤＢ
２１入力装置
２２出力装置
２３ドライブ装置
２４補助記憶装置
２５主記憶装置
２６演算処理装置
２７インターフェース装置
６１文書解析依頼受付部
６２文書解析判定処理部
６３属性特徴ベース判定部
６４例外情報判定部
６５ポリシー判定部
１２０アプリ
１２１ファイルサーバ DESCRIPTION OF SYMBOLS 1 Document attribute learning / analysis server 2 MFP 3 Document learning cooperation program 4 Exception learning cooperation program 5 Document to be analyzed 11 Document attribute analysis program 12 Document attribute learning program 13 Exception learning program 14 Analysis result DB
15 Policy DB
16 Attribute feature base DB
17 Exception information DB
21 Input Device 22 Output Device 23 Drive Device 24 Auxiliary Storage Device 25 Main Storage Device 26 Arithmetic Processing Device 27 Interface Device 61 Document Analysis Request Accepting Unit 62 Document Analysis Determination Processing Unit 63 Attribute Feature Base Determination Unit 64 Exception Information Determination Unit 65 Policy Determination Department 120 Application 121 File server

Claims

A document analysis processing device that analyzes document attributes and performs processing according to the attributes,
An exception attribute analyzing means for analyzing the exception attribute of the document based on an exception attribute storage means in which a feature of the document to be handled as an exception and an exception attribute to be handled as an exception are associated;
An estimated attribute analyzing means for analyzing the estimated attribute of the document based on an estimated attribute storage means in which a feature of a document not handled as an exception and an estimated attribute not handled as an exception are associated;
A document attribute analyzing unit that analyzes the attribute of the document based on a result obtained by removing the exception attribute of the document analyzed by the exception attribute analyzing unit from the estimated attribute of the document analyzed by the estimated attribute analyzing unit ;
Receiving means for receiving the document and action information for the document;
Policy determination is performed based on policy storage means in which the attribute of the document and action information for the document are associated with processing corresponding to the attribute of the document and action information for the document, and based on the result of the policy determination Policy determining means for performing processing according to the attribute of the document and action information for the document;
I have a,
The document analysis processing apparatus, wherein the exception attribute storage means associates an incidental condition for handling an exception with a feature of the document to be handled as an exception and an exception attribute to be handled as an exception .

The document document analysis processing device according to claim 1 Symbol mounting characterized in that it is a document that has been processed by the image processing apparatus.

An image processing apparatus having at least one of a plotter unit and a scanner unit that analyzes an attribute of a document and performs processing according to the attribute,
An exception attribute analyzing means for analyzing the exception attribute of the document based on an exception attribute storage means in which a feature of the document to be handled as an exception and an exception attribute to be handled as an exception are associated;
An estimated attribute analyzing means for analyzing the estimated attribute of the document based on an estimated attribute storage means in which a feature of a document not handled as an exception and an estimated attribute not handled as an exception are associated;
A document attribute analyzing unit that analyzes the attribute of the document based on a result obtained by removing the exception attribute of the document analyzed by the exception attribute analyzing unit from the estimated attribute of the document analyzed by the estimated attribute analyzing unit ;
Receiving means for receiving the document and action information for the document;
Policy determination is performed based on policy storage means in which the attribute of the document and action information for the document are associated with processing corresponding to the attribute of the document and action information for the document, and based on the result of the policy determination Policy determining means for performing processing according to the attribute of the document and action information for the document;
I have a,
The image processing apparatus according to claim 1, wherein the exception attribute storage means associates an incidental condition for handling an exception with a feature of the document to be handled as an exception and an exception attribute to be handled as an exception .

A document analysis processing device that analyzes document attributes and performs processing according to the attributes.
Exceptions attribute analysis means to analyze the exception attributes of the document based on the exception attribute storage means and exception attributes associated to features and exempt of the document to be exempt,
Estimated attribute analysis means to analyze the estimated attributes of the document based on the estimated attribute storage means and estimated attributes associated not characterized and exception handling documents that do not exempt,
The basis of the estimated attributes of the document analyzed by estimating attribute analysis unit to the exception attribute analyzing means results exception attributes of the analyzed document by the document attribute analysis means to analyze the attributes of the document,
Receiving means for receiving the document and action information for the document;
Policy determination is performed based on policy storage means in which the attribute of the document and action information for the document are associated with processing corresponding to the attribute of the document and action information for the document, and based on the result of the policy determination Policy determining means for performing processing according to the attribute of the document and action information for the document
And then allowed to function,
In the exception attribute storage means, the incidental conditions for handling exceptions are associated with the characteristics of the documents to be handled as exceptions and the exception attributes to be handled as exceptions.
Document analysis processing program characterized by

A document analysis processing method in a document analysis processing apparatus that analyzes an attribute of a document and performs processing according to the attribute,
An exception attribute analyzing step of analyzing the exception attribute of the document based on an exception attribute storage means in which a feature of the document to be handled as an exception and an exception attribute to be handled as an exception are associated;
An estimated attribute analysis step of analyzing the estimated attribute of the document based on estimated attribute storage means in which a feature of the document that is not handled as an exception and an estimated attribute that is not handled as an exception are associated;
A document attribute analysis step for analyzing the attribute of the document based on a result obtained by removing the exception attribute of the document analyzed by the exception attribute analysis step from the estimated attribute of the document analyzed by the estimated attribute analysis step ;
Receiving a document and action information for the document;
Policy determination is performed based on policy storage means in which the attribute of the document and action information for the document are associated with processing corresponding to the attribute of the document and action information for the document, and based on the result of the policy determination A policy determination step for performing processing according to the attribute of the document and action information for the document;
I have a,
The document analysis processing method, wherein the exception attribute storage means associates an incidental condition for handling an exception with a feature of the document to be handled as an exception and an exception attribute to be handled as an exception .