JP2013137740A

JP2013137740A - Secret information identification method, information processor, and program

Info

Publication number: JP2013137740A
Application number: JP2012221514A
Authority: JP
Inventors: Sachiko Yoshihama; 佐知子吉▲濱▼
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2011-11-28
Filing date: 2012-10-03
Publication date: 2013-07-11
Also published as: DE102012220716A1; GB201220817D0; GB2497397A; CN103136189B; CN103136189A

Abstract

PROBLEM TO BE SOLVED: To identify secret information included in a log stored by a server, etc.SOLUTION: An information processor comprises: a cluster section 214 that reads messages from a log and cluster-divides the messages in response to the similarity of the messages; a variable part specification section 216 that specifies a variable part between the messages; an attribute determination section 218 that attempts to determine a security attribute of the variable part using a preset rule; and an attribute estimation section 220 that, when there is a part of which the security attribute cannot be determined by the rule, determines the security attribute using a correspondence relation of appearance positions in the message, or, estimates the security attribute of the part of which the security attribute cannot be determined using a co-occurrence relation between the part of which the security attribute is determined and the part of which the security attribute cannot be determined.

Description

本発明は、機密情報を識別する技術に関し、より詳細には、情報処理装置が蓄積したログに含まれる機密情報を特定する技術に関する。 The present invention relates to a technique for identifying confidential information, and more particularly to a technique for identifying confidential information included in a log accumulated by an information processing apparatus.

近年、各種情報は、インターネット、イントラネット、ＬＡＮなどのネットワークを介して共有され、情報の利用性やアクセス性が高まっている。インターネットなどにおいて情報を管理し、情報の利用者に対して情報を提供するためには、提供するべきコンテンツなどを管理するサーバが利用される。サーバは、ネットワークを介して接続されたクライアント装置からアクセスを受け付け、要求されたコンテンツの提供、利用者登録、個人情報の登録・変更などの処理を行う。 In recent years, various types of information are shared through networks such as the Internet, an intranet, and a LAN, and the usability and accessibility of information is increasing. In order to manage information on the Internet and provide information to information users, a server for managing contents to be provided is used. The server receives access from a client device connected via a network, and performs processing such as provision of requested content, user registration, and registration / change of personal information.

ネットワークに接続されたサーバとしては、とりわけ限定されるものではないが、ＳＭＴＰを使用して電子メールの送受信を可能とするメール・サーバ、ＨＴＴＰプロトコルを使用してウェブ・サービスを行うために、ＣＧＩなどを実装するウェブ・サーバ、ＦＴＰサーバなどの他、各種データを管理し、アクセス要求に応答してデータを提供するデータベース・サーバなどを挙げることができる。これらのサーバは、処理を実行する毎に、アクセスしたユーザの情報、認証結果、処理のために送付されたデータ内容、実行結果などを、蓄積して行く。蓄積される情報は、サーバの種類によって内容は異なるものの、アクセス元のＩＰアドレス、アクセス元のドメイン名、アクセスされた日付と時刻、アクセスされたファイル名、リンク元のページのURL、訪問者のウェブ・ブラウザ名やＯＳ名、処理にかかった時間、受信バイト数、送信バイト数、サービス状態コードなどを含む。以下、サーバなどの情報処理装置の動作により蓄積され、ファイル、データベース、など、動作に関するメッセージを蓄積したファイルを、単にログとして参照する。 The server connected to the network is not particularly limited, but is a mail server that enables transmission / reception of electronic mail using SMTP, and CGI for performing web services using HTTP protocol. In addition to a web server, an FTP server, and the like, a database server that manages various data and provides data in response to an access request can be used. Each time these servers execute processing, they accumulate information on users who have accessed them, authentication results, data contents sent for processing, execution results, and the like. Although the contents of the information vary depending on the type of server, the IP address of the access source, the domain name of the access source, the date and time of access, the file name accessed, the URL of the link source page, the visitor's URL It includes the web browser name, OS name, processing time, number of received bytes, number of transmitted bytes, service status code, and the like. Hereinafter, a file that is accumulated by an operation of an information processing apparatus such as a server and that stores messages related to the operation, such as a file or a database, is simply referred to as a log.

サーバが生成するログは、上述したように、利用価値の高い情報を高密度に含んでおり、また、ログ分析を通じて、サーバへの例えば、分散ＤｏＳ攻撃といった悪意ある攻撃の履歴や、不正アクセスの履歴、アクセス内容の統計分析などによる市場分析などに適用することができる。 As described above, the log generated by the server contains high-use-value information in high density, and through log analysis, for example, a history of malicious attacks such as distributed DoS attacks and unauthorized access It can be applied to market analysis by statistical analysis of history and access contents.

また、ログは、近年頻発しているサーバへの不正アクセスなどに関し、複数の組織で取得されたログを横断的に分析することで、ネットワーク上の攻撃者の時系列的及びターゲットの推移の情報をより正確に把握するために利用することもできる。しかしながら、ログは上述したようにネットワークの基本情報や個人情報を含むので、ログ分析を外部分析者に依頼する場合や、ログが複数のドメインにまたがっている場合には、信頼性のあるドメインであってもログ開示を行うことによる漏洩リスクが生じる。 In addition, the logs are information on time-series and target transitions of attackers on the network by analyzing logs acquired by multiple organizations in relation to unauthorized access to the server that has occurred frequently in recent years. It can also be used to more accurately grasp However, since the log includes basic network information and personal information as described above, if you request an external analyst for log analysis, or if the log spans multiple domains, it is a reliable domain. Even if there is, there is a risk of leakage due to log disclosure.

図１０には、例示的にＡｐａｃｈｅ２．０を使用して実装したウェブ・サーバの例示的なアクセス・ログ１０００およびＦＴＰサーバのトランザクション・ログ１１００を示す。なお、図１０では、ネットワーク情報やプライベート情報、ポート情報に関し、秘匿する目的でアスタリスク「＊」で置換して示す。図１０に示すようにログには、サーバの固定ＩＰアドレス、使用しているポート番号、ディレクトリ階層構造といったサーバの基幹情報の他、ユーザＩＤといったプライベート情報やパスワードなど機密性の極めて高い情報が含まれている。しかしながら、多種多様な情報が登録される可能性のあるログ中で、どの箇所に機密性が高い情報が含まれているのかについては、ログの内容によって異なるという問題がある。 FIG. 10 illustrates an exemplary access log 1000 for a web server and an FTP server transaction log 1100 that are illustratively implemented using Apache 2.0. In FIG. 10, network information, private information, and port information are replaced with an asterisk “*” for the purpose of concealment. As shown in FIG. 10, the log includes server critical information such as a fixed IP address of the server, a used port number, and a directory hierarchical structure, as well as private information such as a user ID and highly confidential information such as a password. It is. However, there is a problem that the location where highly confidential information is included in a log where various types of information may be registered differs depending on the content of the log.

例えば、図１０に示したログをそのまま外部に出すことは、企業や組織のネットワーク情報やサーバ情報、個人情報などを外部に持ち出すことになるので、それ自体、企業リスクを生じさせる。また、ログが悪意のある攻撃者に洩れた場合、企業が蓄積した高付加価値の情報が破壊されたり、ハッキングにより盗用される虞も生じ、さらに、ＤｏＳ攻撃などのターゲットとされることも考えられる。 For example, if the log shown in FIG. 10 is output to the outside as it is, network information, server information, personal information, etc. of the company or organization are taken out to the outside. In addition, if the log is leaked to a malicious attacker, high-value-added information accumulated by the company may be destroyed or stolen by hacking, and it may be targeted as a DoS attack. It is done.

このため、サーバを利用する企業や組織にとって、ログをそのまま外部分析に提供することは、有用な情報を得ることができる代償として、機密漏洩、プライバシー情報漏洩、サーバへの不正アクセスによる情報漏洩など、高いリスクを生じさせることになっていた。この様な理由から、サーバへのアクセス履歴を解析し、サーバの機能に反映させる目的であっても、ログの第三者への開示には、秘密保持契約では賄えない高い障壁があり、柔軟なログ解析を行う際の阻害要因となっていた。また、ログ情報から機密性の高い情報を特定できたとしても、機密性の高い情報をアスタリスクなどで一括して置換してしまうと、アクセス者の同一性やアクセスしたデータの同一性などが失われる場合があり、ログの情報を秘匿するには、元のデータの属性や同一性などについては認識できることが好ましい。 For this reason, providing a log to external analysis as it is for companies and organizations that use the server as a price to obtain useful information, such as confidential information leakage, privacy information leakage, information leakage due to unauthorized access to the server, etc. , Was supposed to pose a high risk. For this reason, even if the purpose is to analyze the access history to the server and reflect it in the function of the server, there is a high barrier that cannot be covered by the confidentiality agreement in the disclosure of the log to a third party. It was an impediment to flexible log analysis. Even if highly confidential information can be identified from log information, if the highly confidential information is replaced with asterisks all at once, the identity of the accessor and the identity of the accessed data will be lost. In order to conceal the log information, it is preferable that the attribute and identity of the original data can be recognized.

これまで、ログの機密度を判断する手法が知られており、例えば特開２００９−１１６６８０号公報（特許文献１）には、コンピュータに入出力データについて機密性の有無などのデータ種類を簡便かつ高精度に検出し、データの適正な管理に寄与する技術を提供することを目的とし、入出力データの読み込み手段、入出力データに含まれる文字列を取得するデータ内容取得手段、文字列や、それに含まれる所定の文字群を素性として抽出する素性抽出手段を備え、外部記憶装置に、予めデータ種類が分かっている教師データを用いて機械学習したデータ種類学習結果を参照して素性のデータ種類を判定するデータ種類判定手段を設けることで、機械学習により高精度にデータ種類を判定する技術が記載されている。 Until now, a method for determining the confidentiality of logs has been known. For example, Japanese Patent Application Laid-Open No. 2009-116680 (Patent Document 1) provides a simple and easy-to-use data type such as the presence / absence of confidentiality of input / output data in a computer. For the purpose of providing technologies that detect with high accuracy and contribute to the appropriate management of data, I / O data reading means, data content acquisition means for acquiring character strings included in I / O data, character strings, A feature type extraction unit that extracts a predetermined character group included therein as a feature, and a data type of the feature by referring to a data type learning result obtained by machine learning using teacher data whose data type is known in advance in an external storage device There is described a technique for determining a data type with high accuracy by machine learning by providing a data type determination means for determining the above.

特許文献１に記載された手法でもログ内の情報の機密性を判断することができる。しかしながら、教示データを利用するために教示データに含まれない情報に関しては機密性の判断ができず、機密情報の漏洩を生じてしまう虞が残されていた。その他、正規表現や単語リストにより機密ワードを検出する技術は、正規表現の種類の登録や単語リストの登録など、データ構築のための労力が膨大な物となることや単語漏れなどの点で限界があり、充分とは言えない。また、ログに関し、完全なスキーマを事前に定義して、それに従って機密情報を匿名化することも考えられるが、作成されるログの多様性のため、多様なスキーマを完全に作成することも現実的ではない。また、単語リストやスキーマを補充しても、珍しい名前はいくらでもあり、さらには、ユーザＩＤ、パスワードのタイプミスや入力箇所の間違いなど、間違えて入力された情報を登録したログに対応することも必要とされていた。 The technique disclosed in Patent Document 1 can also determine the confidentiality of information in the log. However, the confidentiality of information that is not included in the teaching data for using the teaching data cannot be determined, and there is a possibility that confidential information may be leaked. In addition, the technology for detecting confidential words using regular expressions and word lists is limited in terms of registering regular expression types and registering word lists, which requires a lot of labor for data construction and word leakage. Is not enough. It is also possible to predefine a complete schema for logs and anonymize confidential information accordingly. However, due to the diversity of logs created, it is also possible to create various schemas completely. Not right. Moreover, even if the word list and schema are supplemented, there are any number of unusual names, and it is also possible to correspond to a log in which information entered by mistake such as user ID, password typo or input location error is registered. Was needed.

特開２００９−１１６６８０号公報JP 2009-116680 A

本発明は、上述した従来技術の問題に鑑みてなされたものであり、ログ内に含まれる機密情報を識別することにより、ログの有用性を損なうことなく、ログの利用性を拡張することを可能とする、機密情報識別方法、情報処理装置、およびプログラムを提供することを課題とする。 The present invention has been made in view of the above-described problems of the prior art. By identifying confidential information included in a log, the utility of the log can be extended without deteriorating the usefulness of the log. It is an object of the present invention to provide a confidential information identification method, an information processing apparatus, and a program that can be performed.

本発明は上記課題を解決するために、ログ内の個別情報が機密情報か否かを識別する。ログ内での個別情報の機密性の判断は、ログのメッセージを、メッセージの類似性によってクラスタ化し、各クラスタに含まれるメッセージを比較することにより、その差分から、メッセージの固定部分と可変部分を識別する。そして可変部分のそれぞれについて、判定ルールに登録された語、ストリング列または正規表現として定義されるコード情報を参照して機密度を判定する。そして、判定ルールに登録された語、ストリング列またはコード情報により機密と判定された文字列出現した位置は、当該クラスタ内のメッセージ中における機密にするべき部分として判断する。この判断を当該クラスタ内の他のメッセージにも伝播する。すなわち、当該クラスタ内の他のメッセージにおいては、機密にするべき部分にあたる文字列は、判定ルールにより機密と判定されなかった場合でも、機密であると判断する。
さらに、当該クラスタ内の他のメッセージにおいて、機密にするべき部分にあたる文字列を判定ルールに登録することにより、同一の文字列が他のメッセージ中に出現した場合にも、機密と判定することを可能にする。 In order to solve the above problems, the present invention identifies whether or not individual information in a log is confidential information. The confidentiality of individual information in the log is determined by clustering log messages according to the similarity of the messages and comparing the messages contained in each cluster. Identify. For each variable part, the confidentiality is determined with reference to code information defined as a word, a string string, or a regular expression registered in the determination rule. The position where the character string determined to be confidential by the word, string string, or code information registered in the determination rule appears as a portion to be classified in the message in the cluster. This determination is propagated to other messages in the cluster. That is, in other messages in the cluster, a character string corresponding to a portion to be kept confidential is determined to be confidential even if it is not determined to be confidential by the determination rule.
Furthermore, by registering a character string corresponding to a confidential part in another message in the cluster in the determination rule, even if the same character string appears in another message, it is determined that it is confidential. to enable.

機密領域と判断された領域は、個々の情報に適した形式で別表示に置換する。置換は、完全に情報をマスキングしてしまうと図１０に示すように情報量が少なくなり、ログとしての有用性が著しく低下するので、できるかぎり元の情報と同じ種類または意味合いを有するセマンティックス的に同等の表示で置換することができる。元の情報と同じ種類または意味合いの表示で置換することで、情報の種類が判定でき、かつ同定可能な形で置換することができる。例えば、人名であれば、別の名前、例えば、“Alice”→“Cathy”、“Bob”→“David”と言うように別名にマッピングする。 The area determined to be a confidential area is replaced with another display in a format suitable for individual information. If the information is completely masked, the amount of information is reduced as shown in FIG. 10 and the usefulness as a log is remarkably reduced. Therefore, the replacement has the same type or meaning as the original information as much as possible. It can be replaced with an equivalent display. By replacing with the same type or meaning display as the original information, the information type can be determined and replaced in an identifiable form. For example, in the case of a person name, another name, for example, “Alice” → “Cathy”, “Bob” → “David” is mapped to an alias.

また、例えばＩＰアドレスなどの場合には、ＩＰアドレスのネットワーク構造の特定部分だけを残し、他の部分をプライベートＩＰアドレスなどを構成する正規表現で与えられる一定の規則を有するコード情報などに置換することができる。 In the case of an IP address, for example, only a specific part of the network structure of the IP address is left and the other part is replaced with code information having a certain rule given by a regular expression constituting a private IP address or the like. be able to.

さらに本発明では、判定ルールに登録されていない情報については、メッセージ中における出現位置、機密語との共起関係を使用して判定ルールからでは機密属性が決定できない領域の機密属性を推定することで、ログの機密領域が外部に漏れないようにしながら、ログの利用性を改善することを可能とする。 Further, in the present invention, for information that is not registered in the determination rule, the secret attribute of the area where the confidential attribute cannot be determined from the determination rule using the co-occurrence relationship with the appearance position in the message and the confidential word is estimated. Thus, it is possible to improve log usability while preventing the confidential area of the log from leaking outside.

本実施形態の情報処理システム１００の実施形態を示す図。The figure which shows embodiment of the information processing system 100 of this embodiment. 本実施形態で使用するセキュア・ログ生成部２００の機能ブロック図。The functional block diagram of the secure log production | generation part 200 used by this embodiment. 本実施形態で解析対象とされるログ３００を示す図。The figure which shows the log 300 made into analysis object by this embodiment. 本実施形態の判定ルール２２４に登録される語、ストリング列、または正規表現など可変部のリストを示す図。The figure which shows the list | wrist of variable parts, such as the word registered in the determination rule 224 of this embodiment, a string string, or a regular expression. 本実施形態のログ解析方法のフローチャートおよびログ解析のデータ形態を示す図。The figure which shows the data format of the flowchart and log analysis of the log analysis method of this embodiment. 図５に後続する処理のフローチャート。6 is a flowchart of processing subsequent to FIG. 図６で説明した機密度推定処理のフローチャート。7 is a flowchart of confidentiality estimation processing described in FIG. 6. 本実施形態で使用する機密度判断態様８００を、対象とするログ８１０に対応付けて示した図。The figure which showed the confidentiality judgment aspect 800 used by this embodiment corresponding to the log 810 made into object. 本実施形態の表示置換部２２２の実行する置換処理の実施形態を示す図。The figure which shows embodiment of the replacement process which the display replacement part 222 of this embodiment performs. 例示的にＡｐａｃｈｅ２．０を使用して実装したウェブ・サーバのアクセス・ログ１０００およびＦＴＰサーバのトランザクション・ログ１１００を示す図。FIG. 3 is a diagram illustrating an access log 1000 of a web server and a transaction log 1100 of an FTP server implemented using Apache 2.0 as an example.

以下、本発明を実施形態をもって説明するが、本発明は、後述する実施形態に限定されるものではない。図１は、本実施形態の機密情報識別方法が適用される情報処理システム１００の実施形態を示す。サーバ機能部１２０は、ネットワーク１１０に接続されていて、ネットワーク１１０に接続されたクライアント装置１１２からの要求に応答して、ウェブ・サービス、ストレージ・サービス、検索サービスなどをクライアント装置１１２に対して提供する。 Hereinafter, although this invention is demonstrated with embodiment, this invention is not limited to embodiment mentioned later. FIG. 1 shows an embodiment of an information processing system 100 to which the confidential information identification method of this embodiment is applied. The server function unit 120 is connected to the network 110, and provides a web service, a storage service, a search service, etc. to the client device 112 in response to a request from the client device 112 connected to the network 110. To do.

サーバ機能部１２０は、サーバ装置１２２と、サーバ装置１２２が実装するデータベース・アプリケーションなどによりデータが管理される、データベース１２４とを含んでいる。データベース１２４は、提供するべきコンテンツを管理する他にも、利用者登録、利用者情報変更、アクセス制御情報などのセキュリティ情報を含むことができる。 The server function unit 120 includes a server device 122 and a database 124 in which data is managed by a database application installed in the server device 122 or the like. In addition to managing the content to be provided, the database 124 can include security information such as user registration, user information change, and access control information.

図１に示すサーバ装置１２２は、ブレード・サーバ、ラックマウント・サーバ、または汎用コンピュータなどの情報処理装置から構成することができ、ＷＩＮＤＯＷＳ（登録商標）２００Ｘ、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）などのオペレーティング・システムにより制御することができる。また、サーバ装置１２２は、クライアント装置１１２からの検索要求を処理し、クライアント装置１１２に処理結果を返すことができる限り、分散コンピューティングのためのプロキシ・サーバ、ゲートウェイ・サーバなどとして実装することもできるし、ウェブ・サーバとして実装することができる。 The server apparatus 122 shown in FIG. 1 can be configured from an information processing apparatus such as a blade server, a rack mount server, or a general-purpose computer. The WINDOWS (registered trademark) 200X, UNIX (registered trademark), and LINUX (registered trademark) are available. ) Or the like. Further, the server device 122 may be implemented as a proxy server, a gateway server, or the like for distributed computing as long as it can process a search request from the client device 112 and return a processing result to the client device 112. It can be implemented as a web server.

クライアント１０２は、いかなるシングルコア・プロセッサまたはデュアルコア・プロセッサといったマイクロプロセッサ、ＲＡＭ、ハードディスク装置などを含むパーソナル・コンピュータ、ワークステーションとして実装することができる。また、クライアント装置１１２は、ＰＤＡ、スマートフォンとして実装することもできる。クライアント装置１１２は、ＷＩＮＤＯＷＳ（登録商標）、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）、ＭＡＣＯＳ（登録商標）、ＡＮＤＯＲＯＩＤ（登録商標）など、いかなるオペレーティング・システムで制御することができる。 The client 102 can be implemented as a personal computer or workstation including a microprocessor such as any single-core processor or dual-core processor, RAM, a hard disk device, and the like. The client device 112 can also be implemented as a PDA or a smartphone. The client device 112 can be controlled by any operating system such as WINDOWS (registered trademark), UNIX (registered trademark), LINUX (registered trademark), MAC OS (registered trademark), ANDOROID (registered trademark), or the like.

クライアント装置１１２と、サーバ機能部１２０との間は、ネットワーク１１０を介して、ＴＣＰ／ＩＰなどのトランザクション・プロトコルを使用して接続することができる。また、クライアント装置１１２と、サーバ装置１２２との間のデータ・トランザクションは、ＲＭＩ(Remote Method Invocation)、ＲＰＣ(Remote Procedure Call)、ＥＪＢ(Enterprise
Java（登録商標） Beans)、ＣＯＲＢＡ(Common Object Broker Architecture)などの分散コンピューティング環境を使用して構築することができる。 The client device 112 and the server function unit 120 can be connected via a network 110 using a transaction protocol such as TCP / IP. Data transactions between the client device 112 and the server device 122 are RMI (Remote Method Invocation), RPC (Remote Procedure Call), and EJB (Enterprise).
It can be constructed using a distributed computing environment such as Java (registered trademark) Beans) or CORBA (Common Object Broker Architecture).

他の実施形態では、サーバ装置１２２とクライアント装置１１２との間でＨＴＴＰプロトコルを使用し、クライアント装置１１２側にウェブ・ブラウザ、サーバ装置１２２側にＣＧＩ(Common Gateway Interface)、サーブレット、データベース・アプリケーションなどのサーバ・プログラムを実装して構成することもできる。さらに他の実施形態では、サーバ装置１２２側にＦＴＰサーバ・アプリケーションを実装し、クライアント装置１１２をＦＴＰクライアントとして、データ・トランザクションを行う構成とすることもできる。 In another embodiment, the HTTP protocol is used between the server device 122 and the client device 112, a web browser is used on the client device 112 side, a CGI (Common Gateway Interface), a servlet, a database application, etc. on the server device 122 side. It is also possible to implement and configure a server program. In still another embodiment, an FTP server application may be installed on the server device 122 side, and the client device 112 may be used as an FTP client to perform a data transaction.

サーバ装置１２２は、サーバ装置１２２またはデータベース１２４の適切な記憶空間にログ１２６を保持している。本明細書において、ログ１２６とは、サーバなどの情報処理装置の動作により蓄積された情報処理装置の動作についてのメッセージを蓄積したファイルを、単にログとして参照することができ、特定の実施形態では、例えばログ１２６は、クライアント装置１１２との間で行われたトランザクションのうち、サーバ装置１２２の動作を表す情報が、逐次的に記録されて生成される。 The server device 122 holds the log 126 in an appropriate storage space of the server device 122 or the database 124. In this specification, the log 126 can simply refer to a file that stores messages about the operation of the information processing apparatus accumulated by the operation of the information processing apparatus such as a server as a log. For example, the log 126 is generated by sequentially recording information representing the operation of the server device 122 among transactions performed with the client device 112.

ログ１２６は、情報は高付加価値であるものの、多くの場合、テキスト・ベースで記録されるので、種々の方法で外部からアクセスすることができるが、高度に限定された企業・団体の担当者以外の者が生のログにアクセスすることは、セキュリティの観点から好ましくない。このため、本実施形態では、サーバ機能部１２０に対して、ログに直接アクセスさせるのではなく、ログが含む重要な基本情報や個人情報を遮蔽したセキュア・ログを生成し、当該セキュア・ログにアクセスさせる機能手段を、サーバ装置１２２に実装する。また、本明細書において、セキュア・ログとは、ログ１２６が含む機密情報を本発明にしたがって識別し、機密情報を遮蔽または置換して、機密情報が表示されないように修正したデータ・ファイルをいう。 Although the log 126 is recorded in a text base in many cases, although the information is high value-added, it can be accessed from the outside in various ways. It is not preferable from the viewpoint of security that a person other than the above accesses the raw log. Therefore, in this embodiment, the server function unit 120 is not directly accessed to the log, but a secure log that shields important basic information and personal information included in the log is generated, and the secure log is accessed. The function unit to be implemented is mounted on the server device 122. Further, in this specification, the secure log refers to a data file that is modified so that the confidential information included in the log 126 is identified according to the present invention, and the confidential information is shielded or replaced so that the confidential information is not displayed. .

図２は、ログから機密性の高い領域を識別するために、本実施形態で使用するセキュア・ログ生成部２００の機能ブロック図である。図２に示したセキュア・ログ生成部２００は、サーバ装置１２２が実行可能なプログラム、例えばＣ＋＋、Ｊａｖａ（登録商標）、Ｐｅｒｌ、Ｒｕｂｙ、ＰＨＰなどを使用して作成することができ、暗号化などの方式とは異なる方式により、ログへのアクセスを制御する、例えばフィルタ・モジュールなどとしてサーバ装置１２２に実装することができる。 FIG. 2 is a functional block diagram of the secure log generation unit 200 used in this embodiment in order to identify a highly confidential area from the log. The secure log generation unit 200 illustrated in FIG. 2 can be created using a program that can be executed by the server device 122, such as C ++, Java (registered trademark), Perl, Ruby, PHP, and the like. The server apparatus 122 can be implemented as a filter module or the like that controls access to the log by a method different from the above method.

図２に示したセキュア・ログ生成部２００は、サーバ装置１２２が生成したログ１２６を、適切な入力インタフェースを使用し、ログが記録された記憶空間から読み出して、秘匿性の高い情報を識別し、各種の処理を適用して機密情報として遮蔽する。機密情報が遮蔽されたデータ・ファイルは、セキュア・ログ１２６ａとされ、出力インタフェースなどを介して出力させることができる。ログ１２６の読み出しに際しては、ログに暗号が設定されている場合、セキュア・ログ生成部２００を呼び出すために用意されたパスワードや復号鍵を入力することができる。また、セキュア・ログ１２６ａを出力する形式は、特に限定はされないが、デスクトップ画面への表示、ＨＴＭＬ、ＸＭＬなどの構造化文書の作成、テキスト文書の作成、作成したファイルのハードディスク装置などの外部記憶媒体への格納、ネットワークを介した送信などを含むことができる。なお、図２では、説明の便宜上、入力・出力インタフェースは省略して示している。 The secure log generation unit 200 illustrated in FIG. 2 reads the log 126 generated by the server device 122 from the storage space where the log is recorded using an appropriate input interface, and identifies highly confidential information. By applying various processes, it is shielded as confidential information. The data file in which the confidential information is shielded is used as a secure log 126a, and can be output via an output interface or the like. When reading the log 126, if encryption is set in the log, a password and a decryption key prepared for calling the secure log generation unit 200 can be input. The format for outputting the secure log 126a is not particularly limited, but is displayed on the desktop screen, structured documents such as HTML and XML, text documents, and external storage such as a hard disk device for the created files. Storage on a medium, transmission over a network, etc. can be included. In FIG. 2, for convenience of explanation, the input / output interface is omitted.

図２を使用して、本実施形態のセキュア・ログ生成部２００をさらに説明する。セキュア・ログ生成部２００は、機密情報識別部２１０と、表示置換部２３０とを含んで構成することができる。機密情報識別部２１０は、ログ１２６に存在する機密情報を識別する機能を提供しており、表示置換部２３０は、機密情報識別部２１０が識別したログ１２６の表示を他の文字などで置き換える機能を提供する。 The secure log generation unit 200 of this embodiment will be further described with reference to FIG. The secure log generation unit 200 can include a confidential information identification unit 210 and a display replacement unit 230. The confidential information identification unit 210 provides a function of identifying confidential information existing in the log 126, and the display replacement unit 230 replaces the display of the log 126 identified by the confidential information identification unit 210 with another character or the like. I will provide a.

機密情報識別部２１０は、メッセージ解析部２１２と、クラスタ部２１４と、可変部特定部２１６とを含んでいる。メッセージ解析部２１２は、ログを構文解析するパーザを含んで構成されており、ログ１２６が含むメッセージの文章類似性を、例えばテンプレートとの比較などにより定量化し、類似性の順にメッセージをソーティングする。クラスタ部２１４は、ソーティングされたメッセージを類似性を使用してクラスタリングする。可変部特定部２１６は、特定のクラスタに含まれるメッセージの領域のうち、固定されて変化しない領域である固定部およびメッセージ毎に変化する領域である可変部を、メッセージ相互の比較により特定し、クラスタに帰属されたメッセージのうち、変数として処理するべき可変部の位置を識別する。以下、メッセージ中の領域のうち、メッセージ毎に変化する領域を可変部として参照し、メッセージが代わっても変化しない領域を固定部として参照する。 The confidential information identification unit 210 includes a message analysis unit 212, a cluster unit 214, and a variable unit identification unit 216. The message analysis unit 212 includes a parser that parses the log. The message analysis unit 212 quantifies the text similarity of the message included in the log 126 by, for example, comparison with a template, and sorts the messages in the order of similarity. The cluster unit 214 clusters the sorted messages using similarity. The variable part specifying unit 216 specifies a fixed part that is a fixed and non-changing area among message areas included in a specific cluster and a variable part that is a changing area for each message by comparing the messages, Among the messages belonging to the cluster, the position of the variable part to be processed as a variable is identified. Hereinafter, among the areas in the message, an area that changes for each message is referred to as a variable part, and an area that does not change even if the message is changed is referred to as a fixed part.

さらに機密情報識別部２１０は、属性決定部２１８と、属性推定部２２０とを含む。属性決定部２１８は、メッセージ中で可変部として識別されたワード（語）、ストリング列、正規表現で与えられる一定の規則を有するコード情報などの機密性に関する属性を、判定ルール２２４を参照して判断する。例えば、可変部として識別された領域の存在する語、ストリング列、正規表現が判定ルール２２４に登録されているか否かを検索し、現在判断されている可変部が、判定ルールで機密に登録されている場合、当該可変部は、機密情報として遮蔽、または置換するべき変数としてマークされる。 Further, the confidential information identification unit 210 includes an attribute determination unit 218 and an attribute estimation unit 220. The attribute determination unit 218 refers to the determination rule 224 for attributes related to confidentiality, such as a word (word) identified as a variable part in a message, a string string, and code information having a certain rule given by a regular expression. to decide. For example, it is searched whether a word, a string string, and a regular expression in an area identified as a variable part are registered in the determination rule 224, and the currently determined variable part is secretly registered in the determination rule. If so, the variable part is marked as a variable to be masked or replaced as confidential information.

また、属性推定部２２０は、判定ルール２２４に登録されていない変数に関してその機密性を推定する判断を行う。推定判断の第１実施形態では、判定ルール２２４に従って機密であると判断された可変部と、メッセージ中での位置が同一にある変数は、判定ルール２２４によって機密と判断された可変部と同一の機密レベルを有するものと推定する。また、推定判断の第２実施形態では、機密であると判断された可変部と、属性が不明な可変部の共起関係を使用し、共起関係の態様によって属性不明の可変部の機密レベルを推定する実施形態である。 In addition, the attribute estimation unit 220 determines to estimate the confidentiality of variables that are not registered in the determination rule 224. In the first embodiment of the estimation determination, the variable part determined to be confidential according to the determination rule 224 and the variable having the same position in the message are the same as the variable part determined to be confidential by the determination rule 224. Estimated to have a confidential level. Further, in the second embodiment of the estimation determination, the co-occurrence relationship between the variable part determined to be confidential and the variable part whose attribute is unknown is used, and the confidential level of the variable part whose attribute is unknown is determined according to the co-occurrence relation mode. It is embodiment which estimates this.

本実施形態の属性推定部２２０は、判定ルール２２４だけを使用するのではなく、メッセージ内の構文解析の結果を機密レベルの推定に使用することにより、判定ルール２２４に登録した語、ストリング列、正規表現だけではなく、判定ルール２２４からは機密レベルの不明な語、ストリング列、正規表現（以下、本実施形態では、不明部として参照する。）の処理を可能とする。 The attribute estimation unit 220 according to the present embodiment does not use only the determination rule 224 but uses the result of the syntax analysis in the message for estimation of the confidential level, thereby registering the word, string string, Not only the regular expression but also the determination rule 224 enables processing of an unknown word, a string string, and a regular expression (hereinafter referred to as an unknown part in this embodiment).

表示置換部２３０は、機密と判定または推定した可変部を、可変部のセマンティックスを保持したまま、異なる語、ストリング列、または正規化表現といった別表示に元の表示を置換する。用語「セマンティックスを保持したまま」とは、可変部の意味内容または概念内容と同一または類似の置換語を選択することを意味する。例示すれば、人名の場合、“Alice”→“Cathy”、“Bob”→“David”などに置換する。また、ＩＰアドレスといった正規表現の場合、“１９２．１６８．１．１”→“１９２．１．１．２”、“１０．１．５．６”→“１６７．５．７．８”などのコード情報に置換する。また、地名、ランドマーク名、ポート番号、その他の可変部などについても同一または類似の置換語で置き換える処理を行う。 The display replacement unit 230 replaces the original display with another display such as a different word, a string string, or a normalized expression while maintaining the semantics of the variable unit for the variable unit determined or estimated as confidential. The term “preserving semantics” means selecting a replacement word that is the same as or similar to the semantic content or conceptual content of the variable part. For example, in the case of a person name, “Alice” → “Cathy”, “Bob” → “David”, etc. are replaced. Further, in the case of a regular expression such as an IP address, “192.168.1.1” → “192.1.2.1.2”, “10.1.55.6” → “167.5.7.8”, etc. Replace with the code information. Also, the place name, landmark name, port number, and other variable parts are replaced with the same or similar substitute words.

なお、ポート番号やメール・アドレスの場合には、偽名や異なる数値を使用したとしても当該メール・アドレスを使用している第三者が存在する可能性も高く、またサーバが実際に置換後のポートを使用している可能性もある。このため、本実施形態では、メール・アドレスやポート番号などの情報の場合、メール・アドレスやポート番号であることが分かる程度の痕跡情報を残し、それ以外は元の情報を数値以外のキャラクタ、アスタリスク、♯記号、その他適切なシンボルで語を置き換えることができる。 In the case of a port number or email address, there is a high possibility that a third party using the email address exists even if a fake name or a different number is used. You may be using a port. For this reason, in the present embodiment, in the case of information such as a mail address and a port number, leaving trace information to the extent that it is known that it is a mail address and a port number, otherwise, the original information is replaced with characters other than numerical values, You can replace words with asterisks, # signs, and other appropriate symbols.

その他、暗号化やその他の置換方法など、これまで知られたいかなる匿名化方法や秘匿方法を使用することもできる。また、可変部の変換の際には、元の語などの出現履歴とのコインシデンスを取るために可変部の語や値が同一の場合には、同一の置換語や値を割当てることが好ましい。 In addition, any anonymization method or concealment method known so far, such as encryption and other replacement methods, can also be used. In addition, when converting the variable part, it is preferable to assign the same replacement word or value if the word or value of the variable part is the same in order to obtain coincidence with the appearance history of the original word or the like.

表示置換部２３０によって機密と判定された可変部が置換された後、セキュア・ログ１２６ａで示されるデータ・ファイルとして出力可能とされる。表示置換部２３０が作成したセキュア・ログ１２６ａは、適切な出力インタフェースを介して、ファイルなどの伝送媒体として送信することもできるし、ハードディスク装置、ＵＳＢメモリ、フレキシブルディスクなどの可搬性の記録媒体に格納することで、出力することができる。 After the variable part determined to be confidential by the display replacement unit 230 is replaced, it can be output as a data file indicated by the secure log 126a. The secure log 126a created by the display replacement unit 230 can be transmitted as a transmission medium such as a file via an appropriate output interface, or can be used as a portable recording medium such as a hard disk device, a USB memory, or a flexible disk. By storing, it can be output.

以上のように生成したセキュア・ログは、仮にログ解析のために外部業者がアクセスする場合や、外部業者にファイルを提供する場合にも、企業リスクの発生を最低化できるので、よりログの利用性を高め、ネットワークシステムの改善を可能とする。なお、生のログにアクセスする場合には、他の安全性の高いアプリケーションを使用することで、ログへのアクセス性とログ解析の秘匿性とを確保することができるが、生のログにアクセスするためのアプリケーションは、本願の要旨ではないので詳細な説明を省略する。 The secure log generated as described above can minimize the occurrence of corporate risk even when an external contractor accesses it for log analysis or provides a file to an external contractor. To improve the network system. In addition, when accessing the raw log, it is possible to ensure the accessibility to the log and the confidentiality of the log analysis by using another highly secure application. Since the application for doing this is not the gist of the present application, detailed description thereof is omitted.

図３は、本実施形態で解析対象とされるログ３００を示す。図３に示したログ３００は、人名３１０、都市名３２０、電子メール・アドレス３３０が含まれている。また、ログ３００に例示するメッセージは、ログイン・メッセージの他、特定の人名に関連する、Tokyo、Osakaなどのロケール情報やメール・アドレスの更新に関する情報が含まれている。また、日本人名と思われるストリング列である“Sachiko”３４０も含まれている。これら個人情報や個人情報かも知れない情報、または個人情報に関連して機密にするべき情報などをすべてを判定ルール２２４に含ませることはログの種類の多様さや、判定ルール２２４作成のためのプログラミング労力を考えると現実的ではない。 FIG. 3 shows a log 300 to be analyzed in this embodiment. The log 300 shown in FIG. 3 includes a person name 310, a city name 320, and an e-mail address 330. In addition to the login message, the message exemplified in the log 300 includes locale information related to a specific person name, such as Tokyo and Osaka, and information related to the update of the mail address. Also included is “Sachiko” 340, which is a string string that seems to be a Japanese name. Including all of these personal information, information that may be personal information, or information that should be kept confidential in relation to personal information in the determination rule 224 is a variety of log types and programming for creating the determination rule 224 It is not realistic considering the effort.

またいくら単語登録を増やしたところで、例えばインド・ヨーロッパ言語の人名“Alice”３１０や“Bob”しか登録されていない判定ルール２２４では、例えば、日本人名らしい“Sachiko”は、その機密レベルが不明な不明部に分類されてしまい、充分な秘匿性を保証できない。このため、本実施形態は、メッセージ構造の解析を通じて不明部についてもその機密レベルを推定することで、ログ３００の秘匿性の検出を改善するものである。 In addition, for example, in the judgment rule 224 in which only the names “Alice” 310 and “Bob” in the Indo-European language are registered, for example, “Sachiko” that seems to be a Japanese name has an unknown secret level. It is classified as an unknown part, and sufficient confidentiality cannot be guaranteed. For this reason, this embodiment improves the detection of the confidentiality of the log 300 by estimating the confidential level of an unknown part through analysis of the message structure.

図４は、本実施形態の判定ルール２２４に登録される語、ストリング列、または正規表現など可変部のリストを示す。判定ルール２２４は、機密部のレコード毎に属性と語／ストリング列／正規表現などの領域表示とがフィールドとして対応付けられている。属性は、機密部のセマンティックスに対応するカテゴリであり、置換語は、可変部のカテゴリに基づき、選択することができる。また、ＩＰアドレスは、正規表現で与えられており、ＩＰアドレスを置換する場合には、例えばプライベートアドレスの中から本来のＩＰアドレスの一部を残した形式の表現に置換することができる。 FIG. 4 shows a list of variable parts such as words, string strings, or regular expressions registered in the determination rule 224 of the present embodiment. In the determination rule 224, an attribute and an area display such as a word / string string / regular expression are associated as a field for each record of the confidential part. The attribute is a category corresponding to the semantics of the confidential part, and the replacement word can be selected based on the category of the variable part. Further, the IP address is given by a regular expression, and when replacing the IP address, for example, it can be replaced with an expression in a format in which a part of the original IP address is left out of the private address.

この他、図４には、属性として、メール・アドレスも登録されており、メール・アドレスの場合には、ランダムに「＠」よりも左のストリング列を置換するだけでは実際のメール・アドレスとなる可能性が排除できないので例えば、ストリングを、「＊」（アスタリスク）や「！」（イクスクラメーション・マーク）など、メール・アドレスであることを認識させることができる範囲で、匿名化することができる。 In addition, in FIG. 4, an e-mail address is also registered as an attribute. In the case of an e-mail address, the actual e-mail address is simply changed by simply replacing the string string to the left of “@”. For example, anonymizing a string within a range that can be recognized as an e-mail address, such as “*” (asterisk) or “!” (Exclamation mark). Can do.

その他、判定ルール２２４には、非機密メッセージも登録することができる。非機密メッセージは、判定ルール２２４に不可避的にエントリするべきデータではないが、パーザによる構文解析を効率化することが要求される用途では、非機密メッセージを登録することができる。 In addition, a non-confidential message can be registered in the determination rule 224. The non-confidential message is not data that should be inevitably entered in the determination rule 224, but the non-confidential message can be registered in an application that requires efficient parsing by the parser.

図５のフローチャートおよびログ解析のデータ形態を使用して、本実施形態の機密情報識別処理およびセキュア・ログ生成処理を説明する。図５の処理は、ステップＳ５００から開始し、ステップＳ５０１で、メッセージ解析部２１２がログデータをメッセージ単位で読み込み、ログをメッセージ毎に分割し、メッセージごとに編集距離を計算する。ステップＳ５０２では、編集距離を使用してメッセージを類似度に応じてソーティングする。ステップＳ５０２で得られるメッセージ構造５１０は、メッセージの編集距離に基づく類似度に対応してソーティングされた構造として生成され、図５に示した実施形態では、ユーザ・プロファイル更新のメッセージと、ログイン・メッセージが、類似度のないメッセージの種類として認識されている。メッセージ構造５１０では、メッセージ中の可変部５１２、５１４が例示的に示されている。その他、「User Profile for」や「is updated」などの文字列は、固定部である。 The confidential information identification processing and secure log generation processing of this embodiment will be described using the flowchart of FIG. 5 and the data format of log analysis. 5 starts from step S500. In step S501, the message analysis unit 212 reads log data in units of messages, divides the log into messages, and calculates an edit distance for each message. In step S502, messages are sorted according to similarity using the edit distance. The message structure 510 obtained in step S502 is generated as a sorted structure corresponding to the similarity based on the edit distance of the message. In the embodiment shown in FIG. 5, a user profile update message and a login message are generated. Is recognized as a message type without similarity. In the message structure 510, variable parts 512 and 514 in the message are exemplarily shown. Other character strings such as “User Profile for” and “is updated” are fixed parts.

詳細に説明すると、センテンス「Use Profile
for」と、「is updated」に挟まれた語「Alice」は、個人名であり、また「Tokyo」、「alice@foo.com」は、それぞれ都市名、電子メール・アドレスであり、それぞれの値を示す変数名とともに可変部として識別されている。メッセージ構造５１０を見ると解るように、類似度の高いメッセージの可変部は、文章構造中の同一の順に出現する特徴を有している。 To explain in detail, the sentence “Use Profile”
The word “Alice” between “for” and “is updated” is an individual name, and “Tokyo” and “alice@foo.com” are a city name and an email address, respectively. It is identified as a variable part together with a variable name indicating a value. As can be seen from the message structure 510, the variable part of a message having a high degree of similarity has a feature that appears in the same order in the sentence structure.

再度、フローチャートを使用して説明すると、ステップＳ５０３では、クラスタ部２１４が、ソーティングしたメッセージを編集距離から決定される類似度に応じてクラスタ化する。クラスタ化は、ソーティングによる類似度のランク付けの程度に応じ、必ずしも必要な処理ではないが、クラスタ単位で可変部・固定部の認識を行うことで、可変部の認識性・認識精度を高めることができる。図５には、ステップＳ５０３の処理により生成されるメッセージ構造５１０のクラスタ化の処理を、クラスタ構造５２０を示す。説明する実施形態では、ユーザ・プロファイル更新のメッセージを含むクラスタと、ログイン・メッセージを含むクラスタが識別されている。 To explain again using the flowchart, in step S503, the cluster unit 214 clusters the sorted messages according to the similarity determined from the edit distance. Clustering is not always necessary depending on the degree of similarity ranking by sorting. However, by recognizing variable parts and fixed parts in units of clusters, the recognition and recognition accuracy of variable parts can be improved. Can do. FIG. 5 shows a cluster structure 520 for the clustering process of the message structure 510 generated by the process of step S503. In the described embodiment, a cluster containing a user profile update message and a cluster containing a login message are identified.

さらに、ステップＳ５０３では、クラスタを形成するメッセージの固定部および可変部の構造をテンプレート構造５３０として登録し、各クラスタにおける可変部＝変数が存在する箇所、すなわち同一のクラスタ内の各メッセージにおける変数を、メッセージに対して紐付けたテンプレートを生成し、適切な作業用の記憶空間に登録する。この際、メッセージのクラスタは、例えば、[クラスタ識別値，編集距離範囲，テンプレート識別値]などとして索引付けすることができ、判定ルール２２４の適切な記憶領域を確保して、クラスタの索引を登録しておくことができる。 Further, in step S503, the structure of the fixed part and variable part of the message forming the cluster is registered as a template structure 530, and the variable part in each cluster = the place where the variable exists, that is, the variable in each message in the same cluster. Then, a template associated with the message is generated and registered in an appropriate working storage space. At this time, the cluster of the message can be indexed as, for example, [cluster identification value, edit distance range, template identification value], and an appropriate storage area for the determination rule 224 is secured and the cluster index is registered. Can be kept.

テンプレート構造は、ログの処理毎に生成することもできるが、同一のサーバ機能部１２０に関しては、多くの場合、同一のメッセージが使用される。このため、一旦クラスタの索引を生成した後には、判定ルール２２４にクラスタ識別値に紐付けてメッセージ・テンプレートとして登録しておき、処理対象のメッセージを読み込んで編集距離から分類されるべきクラスタを識別し、直ちに処理対象のメッセージ中の可変部の機密度を評価するように実装することもできる。 Although the template structure can be generated for each log process, the same message is often used for the same server function unit 120. For this reason, once the cluster index is generated, it is registered as a message template in association with the cluster identification value in the decision rule 224, and the cluster to be classified from the editing distance is identified by reading the message to be processed. However, it can also be implemented to immediately evaluate the sensitivity of the variable part in the message to be processed.

図５に示すテンプレート構造５３０では、可変部は、“＜？＞”として示されているが、図５のテンプレート構造５３０の可変部の表示は例示的なものであり、構造化文書のタグを付して識別することを意味するものではない。テンプレート中での可変部の識別は、可変部特定部２１６が担当し、例えば先頭からの語数、スペースの数、変数を識別するためのダブルコーテーションの間など、特定の目的のためのプログラミングにより適宜選択することができる。ステップＳ５０４では、識別された可変部を判定ルール２２４の照合のための検索キーとして設定し、ポイントＡから次の処理に処理を進める。 In the template structure 530 shown in FIG. 5, the variable part is shown as “<?>”, But the display of the variable part of the template structure 530 in FIG. 5 is exemplary, and the tag of the structured document is displayed. It does not mean that they are attached and identified. The variable part identification unit 216 is responsible for identifying the variable part in the template. For example, the number of words from the beginning, the number of spaces, and during double quotations to identify variables, may be appropriately determined by programming for a specific purpose. You can choose. In step S504, the identified variable part is set as a search key for collation of the determination rule 224, and the process proceeds from point A to the next process.

図６は、図５に後続する処理のフローチャートである。図６の処理は、ステップＳ６０１で判定ルール２２４を属性決定部２１８が検索することにより、可変部の機密度を判定する。その後、ステップＳ６０２では、検索の結果得られた機密度を、現在判断している可変部の位置の可変部の機密度として、テンプレートに紐づける。紐付けは、テンプレートを構文解析し、語／ストリング／正規表現の階層構造とし、ＸＭＬなどの構造化文書として紐付けすることもできるし、より単純に[テンプレート識別値，先頭からの語数，機密，先頭からの語数，非機密，先頭からの語数，機密]などのテーブルとして登録しておくことができる。 FIG. 6 is a flowchart of processing subsequent to FIG. In the process of FIG. 6, the attribute determining unit 218 searches the determination rule 224 in step S601 to determine the sensitivity of the variable unit. Thereafter, in step S602, the sensitivity obtained as a result of the search is linked to the template as the sensitivity of the variable portion at the position of the variable portion currently determined. Linking can be done by parsing the template, creating a hierarchical structure of words / strings / regular expressions, and as a structured document such as XML, or more simply [template identification value, number of words from top, confidential , Number of words from the top, non-confidential, number of words from the top, confidential].

図６には、属性決定部２１８が、テンプレートを使用して可変部の機密度を判断した結果を示す。メッセージ構造６１０では、「User Profile for」の後の“Alice”および“Bob”は、判定ルール２２４に登録されており、そのまま機密であると判断される。一方、“Sachiko”は、説明する実施形態では、判定ルール２２４には登録されておらず、属性決定部２１８は、検索の結果として値“false”を返す。 FIG. 6 shows a result of the attribute determination unit 218 determining the sensitivity of the variable unit using a template. In the message structure 610, “Alice” and “Bob” after “User Profile for” are registered in the determination rule 224 and are determined to be confidential as they are. On the other hand, “Sachiko” is not registered in the determination rule 224 in the embodiment to be described, and the attribute determination unit 218 returns a value “false” as a search result.

この状況は、ログイン・メッセージでも同様である。ステップＳ６０２で属性決定部２１８が、値＝falseを返すと、セキュア・ログ生成部２００は、属性推定部２２０を呼び出す。属性推定部２２０は、ステップＳ６０３で、機密属性が不明な可変部のテンプレート上での位置を判断し、テンプレートに割り当てられている当該出現位置の機密度を属性決定部２１８が割り当てるべき機密度として設定し、後述する表示置換部２３０の処理に利用させる。この処理について説明したのがテンプレート構造６２０である。 The situation is similar for login messages. When the attribute determination unit 218 returns value = false in step S <b> 602, the secure log generation unit 200 calls the attribute estimation unit 220. In step S603, the attribute estimation unit 220 determines the position of the variable unit whose confidential attribute is unknown on the template, and uses the sensitivity of the appearance position assigned to the template as the sensitivity to be assigned by the attribute determination unit 218. It is set and used for processing of the display replacement unit 230 described later. This process is described in the template structure 620.

テンプレート構造６２０では、ユーザ・プロファイル更新テンプレートでは、＜Red＞で示された可変部の位置がすでに機密属性として登録されているので、＜Red＞の位置に相当する不明部が出現しても、当該不明部の機密属性＝機密として設定できる。またログイン・テンプレートについても例示的に＜Red＞で示した位置の不明部は、その機密属性＝機密に設定される。 In the template structure 620, in the user profile update template, the position of the variable part indicated by <Red> has already been registered as a confidential attribute, so even if an unknown part corresponding to the position of <Red> appears, The confidential attribute of the unknown part can be set as confidential. For the login template, the unknown part at the position indicated by <Red> is set to confidential attribute = confidential.

また、属性推定部２２０は、メッセージに存在する任意位置の可変部について、ステップＳ６０４で出現位置以外の情報を使用して任意の位置の可変部の機密度類推処理を行う。機密度類推処理は、より詳細に後述するが、メッセージ中での機密部の存在の有無または機密部との共起関係を使用してメッセージ中の任意の位置の不明部についてその機密属性を推定する処理を行う。ステップＳ６０４の処理の後、ステップＳ６０５で特定のメッセージ内で機密レベルの不明な不明部の機密レベルの設定を更新し、処理を表示置換部２３０に処理を渡し、ステップＳ６０６で、可変部を判定ルールを参照して異なる表示に置換して、セキュア・ログを生成させる。その後、ステップＳ６０７で適切な出力インタフェースを介してセキュア・ログ１２６ａを他の装置で利用可能となるように出力し処理を終了する。 Also, the attribute estimation unit 220 performs sensitivity analogy processing of the variable part at any position using information other than the appearance position at step S604 for the variable part at an arbitrary position present in the message. Sensitivity analogy processing, which will be described in more detail later, estimates the confidential attribute of an unknown part at any position in the message using the presence or absence of the confidential part in the message or the co-occurrence relationship with the confidential part. Perform the process. After the processing of step S604, the setting of the confidential level of the unknown part with the unknown confidential level in the specific message is updated in step S605, the process is passed to the display replacement unit 230, and the variable part is determined in step S606. Refer to the rules and replace them with different displays to generate a secure log. After that, in step S607, the secure log 126a is output so as to be usable by another device via an appropriate output interface, and the process is terminated.

図７は、図６で説明した機密度推定処理のフローチャートである。機密度推定処理は、図２に示したセキュア・ログ生成部２００が、メッセージ中に含まれ得る任意の位置の可変部の機密レベルを推定するための処理である。本実施形態の機密情報識別方法では、機密度の類推は、２つの実施形態で行われ、第１の類推方法は、メッセージ内での機密部の存在を利用する態様（ステップＳ６０４→Ｓ７００→Ｓ６０５）であり、第２の類推方法は、機密部と不明部との共起関係を使用して、メッセージ内の機密属性を動的に類推する態様（ステップＳ６０４→Ｓ７１０→Ｓ７１１→Ｓ７１２→Ｓ６０５）である。本実施形態において、用語「共起関係」とは、メッセージが含む可変部の値２つまたはそれ以上が共に同一メッセージ内に出現することを意味する。また、用語「共起頻度」とは、特定の可変部が共にメッセージ内で出現する頻度を意味する。 FIG. 7 is a flowchart of the confidentiality estimation process described in FIG. The confidentiality estimation process is a process for the secure log generation unit 200 shown in FIG. 2 to estimate the security level of the variable unit at an arbitrary position that can be included in the message. In the confidential information identification method of this embodiment, the confidentiality analogy is performed in two embodiments, and the first analogy method uses the presence of the confidential part in the message (steps S604 → S700 → S605). The second analogy method uses the co-occurrence relationship between the confidential part and the unknown part to dynamically infer the confidential attribute in the message (steps S604 → S710 → S711 → S712 → S605). It is. In the present embodiment, the term “co-occurrence relationship” means that two or more values of the variable part included in the message appear in the same message. The term “co-occurrence frequency” means the frequency at which a specific variable part appears in the message.

より具体的には、例えば、個人の氏名と、特定の日付とが同一メッセージ内の異なる可変部に同時に出現する場合を考える。氏名は、機密度の高い機密部であり、その直後に出現する日付は、氏名に対応する個人にとって特別な意味を持つ日、たとえば誕生日である可能性が高い。また、そのような共起が発生した場合に個人の誕生日が特定されると推定することは、異なる個人に関し、これらの変数の共起が｛同姓同名確率＊同一誕生日確率｝程度のきわめて低い確率にすぎないことを考慮すると妥当な類推方法であるということができる。すなわち、機密部と同一メッセージ内に出現する可変部は、その機密属性が不明であったとしても「機密」として推定するのが妥当である。 More specifically, for example, consider a case where an individual's name and a specific date appear simultaneously in different variable parts in the same message. The name is a confidential part with high confidentiality, and the date appearing immediately after that is highly likely to be a day having a special meaning for the individual corresponding to the name, such as a birthday. In addition, assuming that such a co-occurrence occurs is that the birthday of the individual is specified, regarding the different individuals, the co-occurrence of these variables is extremely similar to the {same name probability * same birth probability}. Considering that it is only a low probability, it can be said that this is a reasonable analogy method. That is, it is reasonable to estimate the variable part appearing in the same message as the confidential part as “confidential” even if the confidential attribute is unknown.

このため、本実施形態では、共起関係を使用して機密度を類推する場合、機密部を基準とする共起頻度を使用し、共起頻度に関して条件を設定することにより、不明部の機密度を類推する。このための条件としては、共起頻度に関して特定の論理条件を設定することができる。 For this reason, in the present embodiment, when the co-occurrence relationship is used to estimate the confidentiality, the co-occurrence frequency based on the confidential part is used, and the condition of the co-occurrence frequency is set, thereby setting the unknown part function. Analogize density. As a condition for this, a specific logical condition regarding the co-occurrence frequency can be set.

以下さらに、図７を使用して本実施形態の機密度類推処理を説明する。機密度類推処理は、ステップＳ６０３から処理を渡されて開始し、第1の実施形態であるステップＳ７００で、メッセージ内に機密部が含まれるか否かを判断し、機密部が含まれる場合、当該メッセージ内に共起する可変部を一括して機密と設定し、処理をステップＳ６０５に渡す。 Hereinafter, the confidentiality analogy processing of this embodiment will be described with reference to FIG. Sensitivity analogy processing is started by passing the processing from step S603, and in step S700 which is the first embodiment, it is determined whether or not a confidential part is included in the message. The variable parts co-occurring in the message are collectively set as confidential, and the process is passed to step S605.

以下に、第２の実施形態について説明する。第２の実施形態では、ステップＳ７１０でメッセージから当該メッセージが含む可変部をリスト・アップする。ステップＳ７１１でログ内の同一の属性に分類される可変部と共に出現する可変部をリスト・アップし、共起頻度を計算し、可変部に対応付ける。 The second embodiment will be described below. In the second embodiment, in step S710, the variable part included in the message is listed from the message. In step S711, variable parts appearing together with variable parts classified into the same attribute in the log are listed up, the co-occurrence frequency is calculated, and associated with the variable parts.

ステップＳ７１２では、機密部の機密部文字列（Ａ）と、不明部の可変部文字列（Ｂ）の共起頻度がしきい値ＴＨ１以上であり、かつ同時にその可変部文字列（Ｂ）と、当該機密部の文字列以外（上付きバーＡ）が同時に出現する頻度がしきい値ＴＨ２以下である場合に、現在判断している不明部を機密と推定する。この論理条件による処理を採用する理由は、例えば可変部の値が機密情報である氏名である場合、この氏名と高い頻度で共起する文字列（例：誕生日、電子メール・アドレス、この人物のパスワードなど）は、機密であると考えるべきだからである。 In step S712, the co-occurrence frequency of the confidential part character string (A) of the confidential part and the variable part character string (B) of the unknown part is equal to or higher than the threshold value TH1, and at the same time, the variable part character string (B) When the frequency of occurrence of characters other than the character string of the confidential portion (superscript bar A) at the same time is equal to or lower than the threshold value TH2, the currently determined unknown portion is estimated as confidential. The reason for adopting the processing based on this logical condition is that, for example, when the value of the variable part is a name that is confidential information, a character string that frequently co-occurs with this name (eg, birthday, e-mail address, this person) Passwords, etc.) should be considered confidential.

図７には、共起頻度に基づく機密度類推のための例示的な条件を示す。条件７３０は、機密部と共起する可変部を機密として処理する第１の実施形態において使用される。条件７３０では、人名「Alice」と共起する文字列「Tokyo」および電子メール・アドレス「alice@foo.com」が共に機密性が高い可変部であるとして推定される。推定の結果に応じて文字の置換などを上述したように適用し、機密情報を保護する。また、条件７４０は、第２の実施形態のステップＳ７１２で使用する条件である。 FIG. 7 shows exemplary conditions for confidentiality analogy based on the co-occurrence frequency. The condition 730 is used in the first embodiment in which the variable part co-occurring with the confidential part is processed as confidential. Under the condition 730, it is presumed that the character string “Tokyo” co-occurring with the personal name “Alice” and the e-mail address “alice@foo.com” are both highly variable variable parts. Depending on the estimation result, character replacement or the like is applied as described above to protect confidential information. The condition 740 is a condition used in step S712 of the second embodiment.

一方、条件７４０は、複数の共起判断を行う必要が生じるものの、より精密に不明部の機密部との関係に基づいて機密推定を行うことを可能とする。これらの各判断条件は、ログの種類、目的に応じて情報処理装置に実装することができる。 On the other hand, although the condition 740 requires a plurality of co-occurrence determinations, it is possible to estimate the secret more precisely based on the relationship with the unknown secret part. Each of these determination conditions can be implemented in the information processing apparatus according to the type and purpose of the log.

条件７４０のような判断が必要になるのは、以下のような場合である。すなわち、機密の可変部と同時に出現する文字列のうちには、ごく一般的で、他のメッセージ内にも出現するようなものがありうる。例えば、ある人物が住む国の名前は、その人物の名前と同時に高い頻度で出現するかもしれないが、同じ国に他にも多くの人が住んでいて、その国名が他の多くの人名とも同時に出現することがありうる。そのような場合には、国名自体の機密度は低く、国名を置換する必要はない。（つまり、国の人口が十分に大きいと仮定すれば、国名のみから個人を特定することは容易ではないため、国名からはほとんど個人のプライバシーが漏洩しないと考えられるためである）。この例の場合にある国名は、特定の人名Ａと同時に出現するばかりでなく、Ａ以外の人名（上付きバーＡ）とも多く同時に出現するため、条件７４０を用いることで、非機密と判断することができる。また、他の実施形態の場合、上述したしきい値を、特定用途において適切に機密度を与えるように、適宜設定することで、所望するセキュリティ性を提供することができる。 The determination as in condition 740 is necessary in the following cases. In other words, the character string that appears at the same time as the confidential variable part is very general and may appear in other messages. For example, the name of a country where a person lives may appear at the same time as the name of that person, but there are many other people living in the same country, and that country name It can appear at the same time. In such a case, the sensitivity of the country name itself is low and it is not necessary to replace the country name. (In other words, assuming that the country's population is sufficiently large, it is not easy to identify an individual from the country name alone, and it is considered that the privacy of the individual hardly leaks from the country name). In this example, a country name not only appears at the same time as a specific person name A, but also appears at the same time as a person name other than A (superscript bar A). be able to. In the case of other embodiments, desired thresholds can be provided by appropriately setting the above-described threshold value so as to appropriately provide sensitivity in a specific application.

以下、さらに本発明の他の実施形態について、図７のコンテキストに沿って説明する。この実施形態では機密度推定処理はステップＳ６０４から処理を渡されて開始する。ステップＳ７００では、属性推定部２２０は、下記の２つのモードから一つを、以下の通り選択する。第１のモードは、最も単純なアプローチであり、属性推定部２２０は、機密とされるメッセージ内のすべての可変部を判定し、メッセージ内のいずれか１以上の可変部が機密と判定された場合、処理をステップＳ６０５に進める。これは、可変部が機密でない場合であってもいくつかの可変部を機密と過剰に分類することもあるが、簡略化した判定方法である。また、第２のモードを選択した場合、属性推定部２２０は、ステップＳ７１０でメッセージに含まれる可変部をリストする。ステップＳ７１１で、属性推定部２２０は、各メッセージ内に出現する可変部のセットをリストし、その後各可変部それぞれの共起頻度を計算する。 Hereinafter, another embodiment of the present invention will be described in the context of FIG. In this embodiment, the confidentiality estimation process is started after the process is passed from step S604. In step S700, the attribute estimation unit 220 selects one of the following two modes as follows. The first mode is the simplest approach, where the attribute estimator 220 determines all variable parts in a message that is classified and any one or more variable parts in the message are determined to be confidential. If so, the process advances to step S605. This is a simplified determination method, although some variable parts may be excessively classified as confidential even if the variable part is not confidential. If the second mode is selected, the attribute estimation unit 220 lists variable units included in the message in step S710. In step S711, the attribute estimation unit 220 lists a set of variable parts that appear in each message, and then calculates the co-occurrence frequency of each variable part.

ステップＳ７１２では、不明部および特定の機密部の共起頻度が所定のしきい値ＴＨ１以上であり、かつ不明部である可変部の特定の機密部を除く機密部との共起頻度がしきい値ＴＨ２よりも低い場合、属性推定部２２０は、当該可変部分を機密と決定する。この処理を用いるのは、以下の理由による。例えば、可変部が、機密情報である個人名なので、個人名と共に、高い共起頻度で出現（例えば誕生日、電子メール・アドレス、個人のパスワード）するストリングもまた機密として考慮することができるためである。 In step S712, the co-occurrence frequency of the unknown part and the specific confidential part is greater than or equal to a predetermined threshold value TH1, and the co-occurrence frequency of the variable part that is the unknown part with the confidential part excluding the specific confidential part is the threshold. When the value is lower than the value TH2, the attribute estimation unit 220 determines that the variable part is confidential. This process is used for the following reason. For example, since the variable part is a personal name that is confidential information, strings that appear with a high co-occurrence frequency (for example, birthdays, email addresses, personal passwords) can also be considered confidential. It is.

この実施形態において第１のモードは、機密部と可変部との間の共起関係における、不明部を「機密」として判断するための出現頻度のしきい値を０に設定することに対応する。言い換えると、機密部分と共に1回以上出現するすべての可変部分を機密として判断するものであり、第１の実施形態のように、メッセージ内で機密部と不明部とが存在する場合に、当該不明部を機密とする代替的な処理と言うことができる。第１のモードは、第１の実施形態と同様に可変部のいくつかを、そうでないにもかかわらず機密として過分類する可能性がある点で簡略化した判定方法である。しかしながら、この方法は、共起頻度のチェックを必要としない。したがって、この実施形態は、情報処理装置のオーバー・ヘッドを軽減する場合に属性推定部２２０が選択する処理とすることができる。ステップＳ７１２の後、属性推定部２２０は、ステップＳ６０５に進んで、図７の機密度推定処理を終了させる。付随的に、共起判断のための対象の可変部の属性に応じて、共起頻度のしきい値として異なる値を使用することもできる。 In this embodiment, the first mode corresponds to setting the appearance frequency threshold for determining the unknown part as “confidential” to 0 in the co-occurrence relationship between the confidential part and the variable part. . In other words, all variable parts that appear once or more together with the confidential part are judged as confidential, and when there are a confidential part and an unknown part in the message as in the first embodiment, the unknown part It can be said that this is an alternative process that keeps the department confidential. As in the first embodiment, the first mode is a simplified determination method in that some of the variable parts may be overclassified as confidential even though they are not. However, this method does not require a co-occurrence frequency check. Therefore, this embodiment can be a process selected by the attribute estimation unit 220 when reducing the overhead of the information processing apparatus. After step S712, the attribute estimation unit 220 proceeds to step S605 and ends the confidentiality estimation process of FIG. Additionally, a different value can be used as a threshold value of the co-occurrence frequency depending on the attribute of the variable part to be determined for co-occurrence.

なお、メッセージによっては、メッセージ内の可変部に関し、その機密属性が判定ルール２２４によってはまったく判断できないものも発生する可能性がある。この場合、機密情報識別部２１０は、不明部の出現位置を使用して機密属性の推定を行い、その後さらに、共起関係を使用して不明部の機密属性の推定および決定を行うことで、機密情報がそのままセキュア・ログに表示されてしまうことを防止することができる。さらに、他の実施形態では、一旦不明部として識別された位置に出現した語、ストリング列、キャラクタ列、数字列、コード情報などについて機密属性が推定された後、判定ルール２２４に推定された語、ストリング、キャラクタ列、コード情報などのデータを追加登録することにより、判定ルール２２４を学習させ、機密情報の判定処理を効率化させて行くこともできる。 Depending on the message, there is a possibility that the confidential attribute of the variable part in the message cannot be determined at all by the determination rule 224. In this case, the confidential information identification unit 210 estimates the confidential attribute using the appearance position of the unknown part, and then further estimates and determines the confidential attribute of the unknown part using the co-occurrence relationship. It is possible to prevent confidential information from being displayed in the secure log as it is. Furthermore, in another embodiment, after the confidential attribute is estimated for a word, a string string, a character string, a numeric string, code information, etc. that once appeared at a position identified as an unknown part, the word estimated by the determination rule 224 By additionally registering data such as strings, character strings, and code information, the determination rule 224 can be learned, and the determination process of confidential information can be made more efficient.

図８は、本実施形態で使用する機密度判断態様８００を、対象とするログ８１０に対応付けて示した図である。白抜き矩形枠は、固定部であり、雲形枠内の可変部は、機密の領域であり、ハッチングした矩形枠内は、推定された機密領域であり、アンダーラインの可変部は、テンプレート内の出現位置を使用して判断された機密属性の領域である。 FIG. 8 is a diagram showing the confidentiality determination mode 800 used in the present embodiment in association with the target log 810. The open rectangular frame is a fixed part, the variable part in the cloud frame is a confidential area, the hatched rectangular frame is an estimated confidential area, and the underlined variable part is in the template. This is a confidential attribute area determined using the appearance position.

図８に示すように、固定メッセージ（非機密）および雲形枠内の可変部のセット８２０は、判定ルール２２４を使用して直接機密属性が判断されたものである。一方、セット８３０については、判定ルール２２４では不明部として分類された可変部である。本実施形態では、不明部として分類された可変部に関し、可変部の共起関係およびメッセージ内の出現位置を使用して機密属性を判定する。 As shown in FIG. 8, the fixed message (non-confidential) and the variable portion set 820 in the cloud frame are those in which the confidential attribute is directly determined using the determination rule 224. On the other hand, the set 830 is a variable part classified as an unknown part in the determination rule 224. In the present embodiment, regarding the variable part classified as the unknown part, the confidential attribute is determined using the co-occurrence relation of the variable part and the appearance position in the message.

共起関係を使用して機密属性が類推または推定された可変部は、氏名に対する日付および都市名である。また、可変部の出現位置を使用して判断されたのは、可変部＝passw0rdである。この可変部は、ユーザＩＤを入力しようとして誤ってパスワードを入力し、加えてパスワードにタイプミスが重畳された不明部となっている。この可変部は、ユーザＩＤが入力されるべき部位に誤入力されなおかつタイプミスがあるため、不明部を構成する。当然ながらこの説明は、説明のためにだけなされるものであって、パスワードに相当する可変部が判定ルールに登録されることはない。本実施形態では、同一のクラスタ内のメッセージでの可変部の出現位置を使用して、例えばログ８１０の第１行目の可変部“UserID”の直後に機密属性の領域が出現していることを利用して、不明部＝passw0rdについて機密であると判断している。 The variable part in which the confidential attribute is inferred or estimated using the co-occurrence relationship is the date for the name and the city name. Also, the variable part = passw0rd is determined using the appearance position of the variable part. This variable part is an unknown part in which a password is erroneously entered in an attempt to enter a user ID, and a typo is superimposed on the password. This variable part constitutes an unknown part because it is erroneously input to the part where the user ID is to be input and there is a typo. Of course, this explanation is given only for explanation, and the variable part corresponding to the password is not registered in the determination rule. In the present embodiment, using the appearance position of the variable part in the message within the same cluster, for example, the confidential attribute area appears immediately after the variable part “UserID” on the first line of the log 810. Is used to determine that unknown part = passw0rd is confidential.

以上の通り、本実施形態では、判定ルール２２４に登録されていない可変部についても機密レベルを設定することができ、企業・団体リスクを低減させることで、ログの利用性を向上させている。 As described above, in the present embodiment, the security level can be set even for the variable part that is not registered in the determination rule 224, and the log usability is improved by reducing the company / group risk.

図９は、本実施形態の表示置換部２３０の実行する置換処理の実施形態を示す。元のログ９００には、人名、都市名、電子メール・アドレスなど複数の機密領域が含まれている。本実施形態の表示置換部２３０は、機密として登録されたメッセージの可変部を、設定されたプロトコルにしたがって置換する。具体的には、人名・都市名については判定ルール２２４内の同一の属性の他の値を選択し、置換する。なお、このとき元の可変部が同一の場合には、同一の別表現の値を割当てる。また、電子メール・アドレスについては電子メール・アドレスであったことを識別させる程度の別表現で、アルファベットを他のキャラクタや数字に変更する。 FIG. 9 shows an embodiment of replacement processing executed by the display replacement unit 230 of the present embodiment. The original log 900 includes a plurality of confidential areas such as person names, city names, and e-mail addresses. The display replacement unit 230 of this embodiment replaces the variable part of the message registered as confidential according to the set protocol. Specifically, for the person name / city name, another value of the same attribute in the determination rule 224 is selected and replaced. At this time, if the original variable part is the same, the same different expression value is assigned. Also, the e-mail address is changed to another character or number with another expression to the extent that the e-mail address is identified.

具体的には、人名に関しては、ログ９００の“Alice”、“Bob”、“Sachiko”は、セキュア・ログ９１０では、それぞれ、“Mary”、“Nic”、“John”に置換されている。また、都市名については、“Tokyo”、“Osaka”、“Naha”が、それぞれ“New
York”、“Washington”、“Toront”に置換されている。さらに電子メール・アドレスについては、ＳＭＴＰプロトコルに準拠した表現を有することが認識できるように****@***.***の表現を残し、キャラクタで置換したものとされている。なお、個人を特定する以外のドメインネームの領域については、情報量の観点から非置換のまま残すこともできる。 Specifically, regarding the personal name, “Alice”, “Bob”, and “Sachiko” in the log 900 are replaced with “Mary”, “Nic”, and “John” in the secure log 910, respectively. As for city names, “Tokyo”, “Osaka”, “Naha” are “New” respectively.
“York”, “Washington”, “Toront.” Furthermore, the e-mail address is ****@***.*** so that it can be recognized that it has a representation that conforms to the SMTP protocol. The domain name area other than identifying an individual can be left unreplaced from the viewpoint of the amount of information.

また、図９には、示されていないが、ＩＰアドレスなどについては、グローバルＩＰアドレスを、元の数字の一部を流用しながら適切なプライベートＩＰアドレスに置換するなどによって機密情報を置換する。なお、置換のためのルールは、セキュア・ログ生成部２００の管理する適切な記憶空間にテーブルやリストとして保存しておき、サーバ管理者などの高レベルの管理者の要求に応じて、逆変換し、元のログを再現するために利用することもできる。 Further, although not shown in FIG. 9, for the IP address or the like, the confidential information is replaced by replacing the global IP address with an appropriate private IP address while using a part of the original number. Note that the rules for replacement are stored as a table or list in an appropriate storage space managed by the secure log generation unit 200, and inversely converted in response to a request from a high-level administrator such as a server administrator. It can also be used to reproduce the original log.

なお、本発明につき、発明の理解を容易にするために各機能手段および各機能手段の処理を具体的な機能手段をもって記述したが、本発明は、上述した特定の機能手段が特定の処理を実行する外、処理効率や実装上のプログラミングなどの効率を考慮して、いかなる機能手段にでも上述した処理を実行するための機能を割り当てることができる。 In the present invention, in order to facilitate understanding of the invention, each functional unit and the processing of each functional unit are described with specific functional units. However, in the present invention, the specific functional unit described above performs specific processing. In addition to execution, a function for executing the above-described processing can be assigned to any functional means in consideration of efficiency such as processing efficiency and implementation programming.

本発明の上記機能は、Ｃ＋＋、Ｊａｖａ（登録商標）、Ｊａｖａ（登録商標）Ｂｅａｎｓ、Ｊａｖａ（登録商標）Ａｐｐｌｅｔ、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔ、Ｐｅｒｌ、Ｒｕｂｙ、ＰＹＴＨＯＮなど、オブジェクト指向プログラミング言語、ＳＱＬなどの検索専用言語などで記述された装置実行可能なプログラムにより実現でき、装置可読な記録媒体に格納して頒布または伝送して頒布することができる。 The above-described functions of the present invention include C ++, Java (registered trademark), Java (registered trademark) Beans, Java (registered trademark) Applet, Java (registered trademark) Script, Perl, Ruby, PYTHON, and other object-oriented programming languages, SQL, etc. Can be realized by a device-executable program described in a search-dedicated language or the like, stored in a device-readable recording medium, and distributed or transmitted for distribution.

これまで本発明を、特定の実施形態をもって説明してきたが、本発明は、実施形態に限定されるものではなく、他の実施形態、追加、変更、削除など、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 Although the present invention has been described with specific embodiments, the present invention is not limited to the embodiments, and other embodiments, additions, changes, deletions, and the like can be conceived by those skilled in the art. It can be changed within the range, and any embodiment is included in the scope of the present invention as long as the effects and effects of the present invention are exhibited.

１００情報処理システム
１０２クライアント
１１０ネットワーク
１１２クライアント装置
１２０サーバ機能部
１２２サーバ装置
１２４データベース
１２６ログ
１２６ａセキュア・ログ
２００セキュア・ログ生成部
２１０機密情報識別部
２１２メッセージ解析部
２１４クラスタ部
２１６可変部特定部
２１８属性決定部
２２０属性推定部
２２４判定ルール
２３０表示置換部 DESCRIPTION OF SYMBOLS 100 Information processing system 102 Client 110 Network 112 Client apparatus 120 Server function part 122 Server apparatus 124 Database 126 Log 126a Secure log 200 Secure log generation part 210 Confidential information identification part 212 Message analysis part 214 Cluster part 216 Variable part specification part 218 Attribute determination unit 220 Attribute estimation unit 224 Determination rule 230 Display replacement unit

Claims

A method for identifying confidential information in a log accumulated by an information processing apparatus, the method comprising:
Reading a message about the operation of the information processing device from a log, and clustering the messages in relation to the similarity of the messages;
Identifying a variable portion between messages of messages included in the cluster;
Attempting to determine the variable portion of the confidential attribute using a preset rule;
Determining a confidential attribute of a portion for which the confidential attribute cannot be determined by estimating from the portion for which the confidential attribute is determined when there is a portion for which the confidential attribute cannot be determined by the rule.

The method of claim 1, further comprising replacing a display of a variable portion in the message with another display in response to the determined confidentiality attribute to generate a secure log.

The step of estimating and determining the confidential attribute is a step of estimating using a correspondence relationship between an appearance position of a part in which the confidential attribute cannot be determined in the message and an appearance position of a part for which the confidential attribute is determined. The method of claim 1 comprising:

The method according to claim 1, further comprising: estimating a confidential attribute of a portion where the confidential attribute cannot be determined from a co-occurrence frequency of a portion where the confidential attribute is determined and a portion where the confidential attribute cannot be determined.

The method according to claim 1, further comprising the step of quantifying the similarity of the message using edit distances of characters, characters, and spaces constituting the message.

The method according to claim 1, wherein the variable part is code information described according to a rule provided by a word, a string string, or a regular expression constituting the message.

The method according to claim 1, wherein the rule classifies and registers code information described according to a rule given by a word, a string string, or a regular expression to be classified for each semantic of the part.

Estimating using the correspondence relationship between the appearance position of the part where the confidential attribute cannot be determined in the message and the appearance position of the part where the confidential attribute is determined,
Collating a template that associates an appearance position with a confidential attribute for a variable part of the message included in the cluster;
4. The method of claim 3, comprising determining a portion of the template at the same occurrence position as a confidential attribute of the template.

The step of estimating and determining the confidential attribute sets the confidential attribute of the portion where the confidential attribute cannot be determined to be confidential based on a condition in the co-occurrence frequency of the portion that should be confidential and the portion where the confidential attribute cannot be determined. The method of claim 1 including the step of:

The method according to claim 1, further comprising the step of additionally registering and learning the data of the portion whose confidential attribute is estimated and determined in the rule.

Estimating and determining the confidentiality attribute estimates the confidentiality attribute of the variable portion of the template as confidential if any of the same cluster messages for the variable portion of the template includes at least one confidential portion. The method of claim 1.

The step of replacing the display of the variable part in the message with another display to generate a secure log includes selecting and replacing another display that retains the semantics of the variable part. The method according to 1.

The method of claim 1, comprising selecting the same separate display if the display of the original portion in the message is the same.

The method according to claim 2, further comprising an output step of transmitting only the secure log to the outside of the information processing apparatus.

An information processing apparatus for identifying confidential information in a log, wherein the information processing apparatus includes:
A cluster unit that reads a message about the operation of the information processing apparatus from a log and clusters the messages in relation to the similarity of the messages;
A variable part specifying unit for specifying a variable part between messages among messages included in the cluster;
An attribute determination unit that attempts to determine the confidential attribute of the variable part using a preset rule;
When there is a portion where the confidential attribute cannot be determined by the rule, the estimation is performed using the correspondence between the appearance position of the portion where the confidential attribute cannot be determined in the message and the appearance position of the portion where the confidential attribute is determined. Or an attribute estimation unit that estimates and determines the confidential attribute of the portion where the confidential attribute cannot be determined from the co-occurrence frequency of the portion where the confidential attribute is determined and the portion where the confidential attribute cannot be determined;
Including information processing apparatus.

further,
A message analyzer that reads the messages from the log and sorts the messages in order of similarity of the messages;
A display replacement unit for generating a secure log by replacing the display of the variable part in the message with another display in response to the determined confidential attribute;
The information processing apparatus according to claim 15, wherein the message analysis unit quantifies the similarity of the message using edit distances of characters, characters, and spaces that form the message.

The information processing apparatus according to claim 15, wherein the variable part is code information described according to a rule provided by a word, a string string, or a regular expression constituting the message.

An apparatus-executable program for an information processing apparatus to execute the method according to any one of claims 1 to 14.