JP2005209115A

JP2005209115A - Log summarization device, log summarization program, and recording medium

Info

Publication number: JP2005209115A
Application number: JP2004017589A
Authority: JP
Inventors: Akira Yamada; 山田　　明; Masaru Miyake; 優三宅; Keisuke Takemori; 敬祐竹森; Toshiaki Tanaka; 俊昭田中; Akihito Yamamoto; 明仁山本
Original assignee: National Institute of Information and Communications Technology; KDDI R&D Laboratories Inc
Current assignee: National Institute of Information and Communications Technology; KDDI Research Inc
Priority date: 2004-01-26
Filing date: 2004-01-26
Publication date: 2005-08-04
Anticipated expiration: 2024-01-26
Also published as: JP4491577B2

Abstract

【課題】膨大なログを見易く整形し、有益な情報をユーザに提示することができるログ要約装置、ログ要約プログラムおよび記録媒体を提供する。
【解決手段】要約処理部１４１は、不純度関数を用いて、ログ中の各属性データについての不純度ｆを求め、最もｆの大きな属性データを要約対象の属性データとして選択する。続いて、要約処理部１４１は、選択された属性データに出現する項目データの出現頻度に基づいて、属性データの不純度が下がるように、複数の項目データを同一の数値または文字列に置き換える。要約処理部１４１は置き換えによって生じた重複行を削除する。要約処理部１４１は、ログの行数が所定行未満となるまで要約処理を繰り返す。
【選択図】図１
PROBLEM TO BE SOLVED: To provide a log summarizing device, a log summarizing program, and a recording medium capable of easily shaping an enormous log so as to be displayed and presenting useful information to a user.
A summary processing unit 141 obtains an impurity f for each attribute data in a log using an impurity function, and selects attribute data having the largest f as attribute data to be summarized. Subsequently, the summary processing unit 141 replaces the plurality of item data with the same numerical value or character string so as to reduce the impurity data based on the appearance frequency of the item data appearing in the selected attribute data. The summary processing unit 141 deletes duplicate lines generated by the replacement. The summary processing unit 141 repeats the summary processing until the number of log lines becomes less than a predetermined number of lines.
[Selection] Figure 1

Description

本発明は、ルータ、ファイアウォール、ＩＤＳ（ＩｎｔｒｕｓｉｏｎＤｅｔｅｃｔｉｏｎＳｙｓｔｅｍ：侵入検知システム）等のネットワーク機器（通信機器、セキュリティ機器）のログを自動要約するログ要約装置、ログ要約プログラムおよび記録媒体に関する。 The present invention relates to a log summarization device, a log summarization program, and a recording medium that automatically summarize logs of network devices (communication devices, security devices) such as routers, firewalls, IDS (Intrusion Detection System).

ネットワークにおいて、ルータ、ファイアウォール、パケットフィルター、ＩＤＳ等の機器を設置する場合、これらの装置では膨大な記録（ログ）が発生する。この膨大なログを全て確認することは困難であり、ログを要約あるいは解析する手法が提案されている。 When devices such as routers, firewalls, packet filters, and IDSs are installed in the network, a huge amount of records (logs) are generated in these devices. It is difficult to confirm all of these enormous logs, and methods for summarizing or analyzing the logs have been proposed.

特許文献１には、プログラムの実行トレースなどのログから制御構造を抽出し、その情報を用いてログを整形する制御構造抽出方法が記載されている。また、特許文献２には、アクセスログの中から実質的に意味を有するアクセスログを抽出するアクセスログの縮約方法が記載されている。また、特許文献３には、単語あるいは二単語連結句の出現頻度を考慮してログ情報を抽出するログ情報解析装置が記載されている。
特開２００３−２４１９９８号公報特開２００３−７６８１４号公報特開２００１−３５６９３９号公報 Patent Document 1 describes a control structure extraction method for extracting a control structure from a log such as an execution trace of a program and shaping the log using the information. Further, Patent Document 2 describes a method for reducing an access log that extracts an access log that is substantially meaningful from the access log. Patent Document 3 describes a log information analysis device that extracts log information in consideration of the appearance frequency of words or two-word concatenation phrases.
JP 2003-241998 A JP 2003-76814 A JP 2001-356939 A

ネットワーク機器において出力されるログは非常に膨大であり、その全てをユーザが確認することは困難であった。また、有益な情報が他の重要でない情報に埋もれてしまっており、有益な情報のみを確認することが困難であった。 The log output in the network device is very large, and it is difficult for the user to confirm all of the logs. In addition, useful information is buried in other unimportant information, and it is difficult to confirm only useful information.

本発明は、上述した問題点に鑑みてなされたものであって、膨大なログを見易く整形し、有益な情報をユーザに提示することができるログ要約装置、ログ要約プログラムおよび記録媒体を提供することを目的とする。 The present invention has been made in view of the above-described problems, and provides a log summarization apparatus, a log summarization program, and a recording medium that can easily format a huge log and present useful information to a user. For the purpose.

本発明は上記の課題を解決するためになされたもので、請求項１に記載の発明は、ネットワーク機器において出力されるログを要約するログ要約装置において、前記ログに記録されている属性データに出現する項目データごとの出現頻度に基づいて、前記属性データの出現項目の偏在度に関する値を算出する算出手段と、前記偏在度に関する値の算出結果に基づいて、集約対象の属性データを選択する選択手段と、選択された前記属性データ中に出現する項目データの出現頻度に基づいて、複数の項目データを同一の数値または文字列に置き換える置換手段と、前記置換手段による項目データの置き換えによって生じた重複行を同一の行に集約する集約手段とを具備することを特徴とするログ要約装置である。 The present invention has been made to solve the above-described problems, and the invention according to claim 1 is a log summarizing apparatus that summarizes a log output in a network device, and includes attribute data recorded in the log. Based on the appearance frequency for each item data that appears, a calculation unit that calculates a value related to the uneven distribution degree of the appearance item of the attribute data, and selects attribute data to be aggregated based on a calculation result of the value related to the uneven distribution degree Produced by selection means, replacement means for replacing a plurality of item data with the same numerical value or character string based on the appearance frequency of item data appearing in the selected attribute data, and replacement of item data by the replacement means A log summarizing apparatus comprising: an aggregating means for aggregating overlapping lines into the same line.

属性データとは、ログに記録されているファイルサイズやファイル名、ＩＰアドレス等のデータ種別を表す。項目データとは、属性データの具体的内容（数値や文字列）を表す。出現頻度とは、同一の項目データがログ中に出現する頻度を表し、出現回数や出現確率等である。偏在度に関する値の算出においては、情報エントロピーやｇｉｎｉ係数等を用いる。項目データが特定のデータに集中するほど偏在度は高くなり、逆に特定のデータに集中せずにばらつくほど偏在度は低くなる。 The attribute data represents a data type such as a file size, a file name, and an IP address recorded in the log. Item data represents specific contents (numerical values or character strings) of attribute data. The appearance frequency represents the frequency at which the same item data appears in the log, such as the number of appearances and the appearance probability. Information entropy, a gini coefficient, etc. are used in the calculation of the value regarding the degree of uneven distribution. The uneven distribution degree increases as the item data concentrates on specific data, and conversely, the uneven distribution degree decreases as the item data varies without concentrating on specific data.

請求項２に記載の発明は、請求項１に記載のログ要約装置において、前記属性データは、連続属性データ、離散属性データ、および階層構造属性データのいずれかであり、前記置換手段は、選択された前記属性データに出現する項目データの出現頻度に基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 According to a second aspect of the present invention, in the log summarizing apparatus according to the first aspect, the attribute data is any one of continuous attribute data, discrete attribute data, and hierarchical structure attribute data, and the replacement means selects The plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the appearance frequency of the item data appearing in the attribute data.

連続属性データは、項目データ間に大小関係があり、その大小関係に意味があるものである。離散属性データは、項目データ間に大小関係がないか、あったとしても大小関係に特に意味がないものである。階層構造属性データは、項目データが階層構造で表されるものである。ＩＰアドレス、ディレクトリ構造、ディクレトリパス、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）、ＵＲＩ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）、メールアドレス、Ｘｐａｔｈ等を階層構造属性データに含めることができる。 The continuous attribute data has a magnitude relationship between item data, and the magnitude relationship is meaningful. In the discrete attribute data, there is no size relationship between item data, or even if there is a size relationship, there is no particular meaning in the size relationship. In the hierarchical structure attribute data, item data is represented by a hierarchical structure. An IP address, directory structure, directory path, URL (Uniform Resource Locator), URI (Uniform Resource Identifier), mail address, Xpath, and the like can be included in the hierarchical structure attribute data.

請求項３に記載の発明は、請求項２に記載のログ要約装置において、前記属性データが前記連続属性データである場合、前記置換手段は、前記属性データに出現する複数の項目データ間の差と該項目データの出現頻度とに基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 According to a third aspect of the present invention, in the log summarizing apparatus according to the second aspect, when the attribute data is the continuous attribute data, the replacement means includes a difference between a plurality of item data appearing in the attribute data. A plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the appearance frequency of the item data.

請求項４に記載の発明は、請求項２に記載のログ要約装置において、前記属性データが前記離散属性データである場合、前記置換手段は、前記属性データに出現する複数の項目データ間のハミング距離と該項目データの出現頻度とに基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 According to a fourth aspect of the present invention, in the log summarizing apparatus according to the second aspect, when the attribute data is the discrete attribute data, the replacing means hums between a plurality of item data appearing in the attribute data. A plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the distance and the appearance frequency of the item data.

ハミング距離とは、比較対象の複数の文字列の同一位置における文字がいくつ異なるかを示す値である。比較対象の複数の文字列の文字列長が異なる場合は、文字列長の差をハミング距離に加算するなどとすればよい。 The Hamming distance is a value indicating how many characters at the same position of a plurality of character strings to be compared are different. If the character string lengths of the plurality of character strings to be compared are different, the difference between the character string lengths may be added to the Hamming distance.

請求項５に記載の発明は、請求項２に記載のログ要約装置において、前記属性データが前記階層構造属性データである場合、前記置換手段は、前記属性データに出現する項目データを木構造の節点に割り当て、前記属性データの偏在度が上がるように、出現頻度の小さな項目データを同一の数値または文字列に置き換えることを特徴とする。 According to a fifth aspect of the present invention, in the log summarizing apparatus according to the second aspect, when the attribute data is the hierarchical structure attribute data, the replacement unit converts the item data appearing in the attribute data into a tree structure. Item data having a low appearance frequency is replaced with the same numerical value or character string so that the attribute data is allocated to nodes and the degree of uneven distribution of the attribute data is increased.

請求項６に記載の発明は、請求項１〜請求項５のいずれかの項に記載のログ要約装置において、前記ログを複数のログに分割する分割手段と、分割された個々のログの要約後のログを連結する連結手段とをさらに具備することを特徴とする。 According to a sixth aspect of the present invention, there is provided the log summarizing apparatus according to any one of the first to fifth aspects, wherein the log is divided into a plurality of logs, and the divided individual logs are summarized. It further comprises connecting means for connecting subsequent logs.

分割手段によって分割されたログは、算出手段、選択手段、置換手段、および集約手段の処理によって要約される。連結手段は、分割後の個々のログが要約されたものを連結する。この場合、連結手段は個々のログを所定数ごとに連結し、複数のログを生成してもよい。複数のログが生成された場合、それぞれのログは算出手段、選択手段、置換手段、および集約手段の処理によって再び要約される。連結手段は要約後のログを連結する。上記の処理を繰り返し行うようにしてもよい。 The logs divided by the dividing unit are summarized by the processing of the calculating unit, selecting unit, replacing unit, and aggregating unit. The concatenation means concatenates the summarized individual logs after the division. In this case, the concatenation unit may concatenate individual logs every predetermined number to generate a plurality of logs. When a plurality of logs are generated, each log is summarized again by the processing of the calculation means, selection means, replacement means, and aggregation means. The connecting means connects the logs after summarization. The above processing may be repeated.

請求項７に記載の発明は、ネットワーク機器において出力されるログの要約処理をコンピュータに実行させるログ要約プログラムにおいて、前記ログに記録されている属性データ中に出現する項目データごとの出現頻度に基づいて、前記属性データの出現項目の偏在度に関する値を算出するステップと、前記偏在度に関する値の算出結果に基づいて、集約対象の属性データを選択するステップと、選択された前記属性データに出現する項目データの出現頻度に基づいて、複数の項目データを同一の数値または文字列に置き換えるステップと、前記置換手段による項目データの置き換えによって生じた重複行を同一の行に集約するステップとをコンピュータに実行させるためのログ要約プログラムである。 The invention according to claim 7 is a log summarization program that causes a computer to execute a log summarization process of a log output in a network device, based on the appearance frequency for each item data appearing in the attribute data recorded in the log Calculating a value related to the uneven distribution degree of the appearance items of the attribute data, selecting attribute data to be aggregated based on a calculation result of the value related to the uneven distribution degree, and appearing in the selected attribute data A step of replacing a plurality of item data with the same numerical value or character string, and a step of consolidating duplicate rows generated by the replacement of the item data by the replacement means into the same row based on the appearance frequency of the item data to be performed This is a log summarization program to be executed.

請求項８に記載の発明は、請求項７に記載のログ要約プログラムにおいて、前記属性データは、連続属性データ、離散属性データ、および階層構造属性データのいずれかであり、前記選択された前記属性データ中に出現する項目データの出現頻度に基づいて、複数の前記項目データを同一の数値または文字列に置き換えるステップにおいては、選択された前記属性データに出現する項目データの出現頻度に基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 The invention according to claim 8 is the log summarization program according to claim 7, wherein the attribute data is any one of continuous attribute data, discrete attribute data, and hierarchical structure attribute data, and the selected attribute is selected. In the step of replacing the plurality of item data with the same numerical value or character string based on the appearance frequency of item data appearing in the data, based on the appearance frequency of the item data appearing in the selected attribute data, A plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased.

請求項９に記載の発明は、請求項８に記載のログ要約プログラムにおいて、前記属性データが前記連続属性データである場合、前記属性データに出現する複数の項目データ間の差と該項目データの出現頻度とに基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 The invention according to claim 9 is the log summarization program according to claim 8, wherein when the attribute data is the continuous attribute data, a difference between a plurality of item data appearing in the attribute data and the item data A plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the appearance frequency.

請求項１０に記載の発明は、請求項８に記載のログ要約プログラムにおいて、前記属性データが前記離散属性データである場合、前記属性データに出現する複数の項目データ間のハミング距離と該項目データの出現頻度とに基づいて、前記属性データの偏在度が上がるように、複数の項目データを同一の数値または文字列に置き換えることを特徴とする。 In the log summarizing program according to claim 8, when the attribute data is the discrete attribute data, the Hamming distance between a plurality of item data appearing in the attribute data and the item data The plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the appearance frequency of.

請求項１１に記載の発明は、請求項８に記載のログ要約プログラムにおいて、前記属性データが前記階層構造属性データである場合、前記属性データに出現する項目データを木構造の節点に割り当て、前記属性データの偏在度が上がるように、出現頻度の小さな項目データを同一の数値または文字列に置き換えることを特徴とする。 The invention according to claim 11 is the log summarization program according to claim 8, wherein when the attribute data is the hierarchical structure attribute data, item data appearing in the attribute data is assigned to a node of a tree structure, The item data with a low appearance frequency is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased.

請求項１２に記載の発明は、請求項７〜請求項１１のいずれかの項に記載のログ要約プログラムにおいて、前記偏在度を算出するステップの前に、前記ログを複数のログに分割するステップと、前記重複行を集約するステップの後に、分割された個々のログの要約後のログを連結するステップとをさらに具備することを特徴とする。 The invention according to claim 12 is the log summarization program according to any one of claims 7 to 11, wherein the log is divided into a plurality of logs before the step of calculating the uneven distribution degree. And a step of concatenating logs after summarizing the divided individual logs after the step of aggregating the duplicate lines.

請求項１３に記載の発明は、請求項７〜請求項１２のいずれかの項に記載のログ要約プログラムを記録したコンピュータ読み取り可能な記録媒体である。 A thirteenth aspect of the present invention is a computer-readable recording medium on which the log summarizing program according to any one of the seventh to twelfth aspects is recorded.

本発明によれば、ログ中の属性ごとの項目の出現頻度を考慮してログの要約を行うようにしたので、膨大なログを見易く整形し、有益な情報をユーザに提示することができるという効果が得られる。 According to the present invention, since log summarization is performed in consideration of the appearance frequency of items for each attribute in the log, it is possible to easily format a huge log and present useful information to the user. An effect is obtained.

以下、図面を参照し、本発明を実施するための最良の形態について説明する。図１は、本発明の一実施形態によるログ要約装置の構成を示すブロック図である。本実施形態によるログ要約装置は、ネットワークに接続されており、各種ネットワーク機器から出力されるログの要約を行う。ネットワーク機器としては、ルータ、ファイアウォール、ＩＤＳ等が想定される。 The best mode for carrying out the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a log summarizing apparatus according to an embodiment of the present invention. The log summarization apparatus according to the present embodiment is connected to a network and summarizes logs output from various network devices. As the network device, a router, a firewall, an IDS, or the like is assumed.

図１において、１０はログ収集部であり、ネットワーク機器から、ネットワークを介してログを収集する。１１は操作部であり、ユーザによって操作されるキーボード等を備えている。ユーザによって操作部１１が操作されると、操作部１１はユーザによる操作を示す信号を制御部１５へ出力する。制御部１５はこの信号に基づいてユーザによる操作の内容を判断する。 In FIG. 1, reference numeral 10 denotes a log collection unit, which collects logs from a network device via a network. Reference numeral 11 denotes an operation unit, which includes a keyboard and the like operated by the user. When the operation unit 11 is operated by the user, the operation unit 11 outputs a signal indicating an operation by the user to the control unit 15. The control unit 15 determines the content of the user operation based on this signal.

１２は記憶部であり、ログ収集部１０によって収集されたログや、制御部１５の動作用のプログラム、ログ形式等の設定に関する設定情報等を記憶する。１３は表示部であり、例えば液晶ディスプレイ等を備えている。表示部１３は、制御部１５によって出力される表示データに基づいて、ユーザによる操作の結果やログの要約の結果等を表示する。１４はログ要約部であり、後述する処理に従って、ログの要約を行う。ログ要約部１４は要約処理を行う要約処理部１４１と、処理用データを一時記憶するメモリ等の処理用領域１４２とを備える。制御部１５は各部を制御する。 Reference numeral 12 denotes a storage unit that stores logs collected by the log collection unit 10, operation programs for the control unit 15, setting information related to settings such as a log format, and the like. Reference numeral 13 denotes a display unit, which includes, for example, a liquid crystal display. The display unit 13 displays the result of the user operation, the log summary result, and the like based on the display data output by the control unit 15. Reference numeral 14 denotes a log summarization unit, which summarizes logs according to processing described later. The log summarization unit 14 includes a summarization processing unit 141 that performs summarization processing, and a processing area 142 such as a memory that temporarily stores processing data. The control unit 15 controls each unit.

上記の構成を備えたログ要約装置は、例えば汎用のパーソナルコンピュータによって実現される。ログ要約装置は外部の通信機器から出力されるログを、ネットワークを介して収集し、収集したログを要約するものであってもよいし、各種ネットワーク機器の機能がソフトウェアとして実現されている場合、そのソフトウェアがインストールされているパーソナルコンピュータがログ要約装置の機能を備えていてもよい。 The log summarizing apparatus having the above configuration is realized by, for example, a general-purpose personal computer. The log summarization device may collect logs output from an external communication device via a network and summarize the collected logs. If the functions of various network devices are realized as software, A personal computer in which the software is installed may have the function of a log summarizing device.

次に、本実施形態におけるログについて説明する。一般的に、ログは予め形式が定義されており、記録が時系列順に発生する構造を有している。ログの形式は、ネットワーク機器ごとに定義されているが、機器が異なるログ間においても、汎用的な形式は定義されている。ログは複数行からなり、一行当たりのログデータは、時間データおよび複数の属性データからなる。本実施形態における属性とは、値が連続的に変化する属性（連続属性）、値が離散的に変化する属性（離散属性）、および値が階層構造の関係で表される属性（階層構造属性）のいずれかの属性である。 Next, the log in this embodiment will be described. In general, a log has a format defined in advance, and has a structure in which recording occurs in time series. The log format is defined for each network device, but a general-purpose format is also defined between logs with different devices. The log is composed of a plurality of lines, and the log data per line is composed of time data and a plurality of attribute data. The attributes in the present embodiment are attributes whose values change continuously (continuous attributes), attributes whose values change discretely (discrete attributes), and attributes whose values are represented by a hierarchical structure (hierarchical structure attributes) ) Is one of the attributes.

図２は本実施形態におけるログの一般形式を示す参考図である。本実施形態におけるログデータは、先頭の時間データが示す時刻順に並んでおり、各ログデータは連続属性・離散属性・階層構造属性のいずれかに属する属性データを有している。連続属性に属する属性データは数値で表され、項目ごとの大小関係に意味があるものである。離散属性に属する属性データは文字列または数値で表されるものであり、項目ごとの大小関係がないか、もしあったとしても大小関係に特に意味がないものである。階層構造属性に属する属性データは、項目が階層構造で表されるものである。なお、本実施形態における属性データとは、ファイルサイズやファイル名、ＩＰアドレス等を示すものとする。また、属性データは複数の項目からなり、項目の具体的内容（数値や文字列）を項目値（項目データ）と定義することにする。 FIG. 2 is a reference diagram showing the general format of the log in this embodiment. The log data in this embodiment is arranged in the order of time indicated by the first time data, and each log data has attribute data belonging to any of continuous attributes, discrete attributes, and hierarchical structure attributes. The attribute data belonging to the continuous attribute is represented by a numerical value, and the magnitude relationship for each item is meaningful. The attribute data belonging to the discrete attribute is represented by a character string or a numerical value, and does not have a magnitude relationship for each item, or if there is any, the magnitude relationship has no particular meaning. In the attribute data belonging to the hierarchical structure attribute, items are represented in a hierarchical structure. Note that attribute data in the present embodiment indicates a file size, a file name, an IP address, and the like. The attribute data is composed of a plurality of items, and the specific contents (numerical values and character strings) of the items are defined as item values (item data).

図３はｆｔｐ（ｆｉｌｅｔｒａｎｓｆｅｒｐｒｏｔｏｃｏｌ）サーバにおいて生成されるログの例を示す参考図である。図において、各行によってイベントが構成されている。図４はこのログの形式を示す参考図である。以下、図３のログの1行目のデータを用いてログの形式を説明する。図４には、ログが「ｃｕｒｒｅｎｔ−ｔｉｍｅ」、「ｔｒａｎｓｆｅｒ−ｔｉｍｅ」、「ｒｅｍｏｔｅ−ｈｏｓｔ」、「ｆｉｌｅ−ｓｉｚｅ」、および「ｆｉｌｅｎａｍｅ」の５つの属性データからなることが示されている。図３における「ＭｏｎＳｅｐ１０４：４８：０２２００３」は、図４における「ｃｕｒｒｅｎｔ−ｔｉｍｅ」に相当する。「ｃｕｒｒｅｎｔ−ｔｉｍｅ」の一般形式は「ＤＤＤＭＭＭｄｄｈｈ：ｍｍ：ｓｓＹＹＹＹ」のように表され、各変数の意味は図４に記載されているとおりである。 FIG. 3 is a reference diagram illustrating an example of a log generated in an ftp (file transfer protocol) server. In the figure, each row constitutes an event. FIG. 4 is a reference diagram showing the format of this log. Hereinafter, the log format will be described using the data in the first line of the log in FIG. FIG. 4 shows that the log includes five attribute data of “current-time”, “transfer-time”, “remote-host”, “file-size”, and “filename”. “Mon Sep 1 04:48:02 2003” in FIG. 3 corresponds to “current-time” in FIG. The general format of “current-time” is expressed as “DDD MMM ddhh: mm: ss YYYY”, and the meaning of each variable is as described in FIG.

図３における「９」は、図４における「ｔｒａｎｓｆｅｒ−ｔｉｍｅ」に相当し、秒単位での転送の総合時間を表す。図３における「１９２．１６８．８０．１９」は図４における「ｒｅｍｏｔｅ−ｈｏｓｔ」に相当し、ＩＰアドレスで表されたリモートホスト名である。図３における「６１６８７」は図４における「ｆｉｌｅ−ｓｉｚｅ」に相当し、転送されたファイルのサイズをバイト単位で表したものである。図３における「／ｔｍ／ｐａｃｋａｇｅｓ／ｍｉｋｔｅｘ−ｍｅｔａｆｏｎｔ−ｍｉｓｃ．ｃａｂ」は図４における「ｆｉｌｅｎａｍｅ」に相当し、転送されたファイル名を表す。この「ｆｉｌｅｎａｍｅ」には、転送されたファイルの階層構造が示されている。 “9” in FIG. 3 corresponds to “transfer-time” in FIG. 4 and represents the total transfer time in seconds. “192.168.80.19” in FIG. 3 corresponds to “remote-host” in FIG. 4, and is a remote host name represented by an IP address. “61687” in FIG. 3 corresponds to “file-size” in FIG. 4 and represents the size of the transferred file in bytes. “/Tm/packages/mictex-metafont-misc.cab” in FIG. 3 corresponds to “filename” in FIG. 4 and represents the transferred file name. This “filename” indicates the hierarchical structure of the transferred file.

時系列データの順序の基準となるｃｕｒｒｅｎｔ−ｔｉｍｅを除く以上の項目のうち、「ｔｒａｎｓｆｅｒ−ｔｉｍｅ」および「ｆｉｌｅ−ｓｉｚｅ」は連続属性に属し、「ｆｉｌｅｎａｍｅ」および「ｒｅｍｏｔｅ−ｈｏｓｔ」は階層構造属性に属する。「ｆｉｌｅｎａｍｅ」および「ｒｅｍｏｔｅ−ｈｏｓｔ」は、離散属性として扱うこともできるが、本実施形態においては、後述するように階層構造属性として扱うことにする。 Among the above items excluding current-time which is the order of time-series data, “transfer-time” and “file-size” belong to continuous attributes, and “filename” and “remote-host” are hierarchical structure attributes. Belonging to. “Filename” and “remote-host” can be handled as discrete attributes, but in the present embodiment, they are handled as hierarchical structure attributes as described later.

次に、本実施形態によるログ要約装置の動作を説明する。図５は本実施形態におけるログの要約の様子を示す概略参考図である。本実施形態において、ログ要約装置は膨大なログの中から任意の長さＮ行のログを切り出し、予め定義された長さＭ行未満のログに要約する。なお、ＭおよびＮは正の整数であり、Ｍ≦Ｎである。 Next, the operation of the log summarizing apparatus according to the present embodiment will be described. FIG. 5 is a schematic reference diagram showing a state of log summarization in the present embodiment. In the present embodiment, the log summarization apparatus cuts out a log having an arbitrary length of N lines from a vast number of logs, and summarizes it into a log having a length less than a predetermined length of M lines. M and N are positive integers, and M ≦ N.

ログ収集部１０は、ネットワーク機器から出力されたログを収集し、制御部１５へ出力する。制御部１５はログ収集部１０によって出力されたログを記憶部１２へ格納する。ユーザによって操作部１１が操作され、ログの要約の開始が指示されると、操作部１１はユーザによる開始指示を示す信号を制御部１５へ出力する。制御部１５はこの信号に基づいて、記憶部１２からログおよび設定情報を読み出し、ログ要約部１４へ出力する。ログ要約部１４はログの要約を行い、要約後のログを制御部１５へ出力する。制御部１５はこのログを表示するための表示データを表示部１３へ出力する。表示部１３は表示データに基づいた表示を行う。ネットワーク管理者等のユーザは表示部１３の表示によって、ログの要約結果を確認することができる。 The log collection unit 10 collects logs output from the network device and outputs them to the control unit 15. The control unit 15 stores the log output by the log collection unit 10 in the storage unit 12. When the operation unit 11 is operated by the user and the start of log summarization is instructed, the operation unit 11 outputs a signal indicating the start instruction by the user to the control unit 15. Based on this signal, the control unit 15 reads the log and setting information from the storage unit 12 and outputs the log and setting information to the log summarization unit 14. The log summarization unit 14 summarizes the log and outputs the summarized log to the control unit 15. The control unit 15 outputs display data for displaying this log to the display unit 13. The display unit 13 performs display based on the display data. A user such as a network administrator can confirm the log summary result by the display on the display unit 13.

以下、ログ要約部１４によるログの要約手法の詳細について説明する。設定情報には、前述したログの形式を示す情報や、ログの切り出しの長さを示す行数Ｎ、要約処理終了の基準となるログの長さを示す行数Ｍ等が記録されている。これらの情報はユーザによる任意の変更が可能である。ログおよび設定情報は要約処理部１４１によって処理用領域１４２へ格納され、要約処理部１４１によって適宜読み出される。要約処理部１４１は設定情報に基づいて、ログを集約可能な形式に変換する。すなわち、ログに記録された情報のうち、設定情報によって規定された形式に従わない属性データを取り除くと共に、ログ中の各イベントにカウンタという属性データを付加する。このカウンタが示す値は、イベント中の全属性データの値が同一の値であるイベントの数を示すものであり、カウンタの初期値は１である。 Details of the log summarization method performed by the log summarization unit 14 will be described below. In the setting information, information indicating the log format described above, the number of lines N indicating the length of log cutting, the number of lines M indicating the length of the log as a reference for the end of the summary process, and the like are recorded. These information can be arbitrarily changed by the user. The log and the setting information are stored in the processing area 142 by the summary processing unit 141 and are appropriately read out by the summary processing unit 141. Based on the setting information, the summary processing unit 141 converts the logs into a format that can be aggregated. That is, the attribute data that does not follow the format defined by the setting information is removed from the information recorded in the log, and attribute data called a counter is added to each event in the log. The value indicated by this counter indicates the number of events in which all the attribute data values in the event are the same value, and the initial value of the counter is 1.

続いて、要約処理部１４１は設定情報中の、ログの切り出しの長さを示す行数Ｎを参照し、膨大なログの中からＮ行のデータ（処理用データ）を切り出す。処理用データの切り出しにおいては、ログに記録されている時間データの参照により、任意の時間範囲内でＮ行のデータを切り出すことができる。 Subsequently, the summary processing unit 141 refers to the number N of rows indicating the log cut length in the setting information, and cuts out N rows of data (processing data) from a huge log. In the extraction of the processing data, N rows of data can be extracted within an arbitrary time range by referring to the time data recorded in the log.

続いて、要約処理部１４１は集約を行う属性データの選択を行う。この選択は以下のように行われる。要約処理部１４１は各属性データについて特定の項目値ごとの出現率（出現確率）を求め、求めた出現率と不純度関数とを用いて不純度を求める。そして、不純度が最も高い属性データを選択して集約を行う。ある一つの属性（ファイルサイズ、ファイル名等）に注目した場合の、その属性データの総イベント数（処理用データの行数と同じ）を｜Ｓ｜、その属性データに出現する項目値（数値または文字列）をＣ_ｉ（ｉ＝０，１，２，・・・）、その属性データにおけるＣ_ｉの出現回数を｜Ｃ_ｉ｜とする。Ｃ_ｉの出現率ｐ_ｉは以下の［数１］で表される。 Subsequently, the summary processing unit 141 selects attribute data to be aggregated. This selection is performed as follows. The summary processing unit 141 obtains the appearance rate (appearance probability) for each specific item value for each attribute data, and obtains the impurity using the obtained appearance rate and the impurity function. Then, the attribute data with the highest purity is selected and aggregated. When paying attention to one attribute (file size, file name, etc.), | S | is the total number of events of that attribute data (the same as the number of lines of processing data), and the item value (numerical value) that appears in the attribute data Or a character string) is C _i (i = 0, 1, 2,...), And the number of occurrences of C _{i in} the attribute data is | C _i |. The appearance rate p _i of C _i is expressed by the following [Equation 1].

不純度関数としては、情報量（情報エントロピー）またはｇｉｎｉ係数を用いる。不純度関数を情報量とする場合、不純度ｆは［数２］で表され、不純度関数をｇｉｎｉ係数とする場合、不純度ｆは［数３］で表される。 As the impurity function, an information amount (information entropy) or a gini coefficient is used. When the impurity function is an information amount, the impurity f is expressed by [Equation 2], and when the impurity function is a gini coefficient, the impurity f is expressed by [Equation 3].

不純度ｆは、特定の属性データについて、項目値が特定の値にどの程度集中しているのかという偏在度を示す指標となる。例えば、値が大きくばらつくほどｆの値は大きく（偏在度は小さく）、値が特定の値に集中するほどｆの値は小さく（偏在度は大きく）なる。ｆの値が小さな属性はログの分析を行う上で、重要なデータを含んでいる可能性が高い。例えば、ネットワーク上に攻撃が急激に蔓延した場合、項目値が特定の値に集中しやすくなるのでｆの値は小さくなり、そのような属性を分析することは、攻撃の分析を行う上で重要である。したがって、ｆの値が大きな属性はあまり重要でないデータを含んでいることが多く、要約処理部１４１はそのようなデータを要約することによって、ユーザにとって見易いログを生成する。 The impurity f is an index indicating the degree of uneven distribution of how much the item values are concentrated on a specific value for specific attribute data. For example, the larger the value, the larger the value of f (the degree of uneven distribution) becomes smaller, and the more the value concentrates on a specific value, the smaller the value of f (the degree of uneven distribution) becomes. An attribute with a small value of f is likely to contain important data for log analysis. For example, when an attack spreads rapidly on the network, the value of f becomes small because the item value tends to concentrate on a specific value, and analyzing such attributes is important for analyzing the attack It is. Therefore, an attribute having a large value of f often includes data that is not very important, and the summary processing unit 141 generates a log that is easy for the user to view by summarizing such data.

要約処理部１４１は各属性データについて不純度ｆを求め、最もｆの大きな属性データを要約対象の属性データとして選択する。選択された属性データの種類によって要約処理部１４１の動作は異なり、以下、各属性データごとに要約処理部１４１の動作を説明する。 The summarization processing unit 141 obtains the impurity f for each attribute data, and selects the attribute data having the largest f as the attribute data to be summarized. The operation of the summary processing unit 141 varies depending on the type of the selected attribute data. Hereinafter, the operation of the summary processing unit 141 will be described for each attribute data.

選択された属性データが連続属性である場合、要約処理部１４１は属性データの項目値とその値の出現度数とを処理用領域１４２に格納する。続いて、要約処理部１４１は、［数４］で表される値ｄ_１を求める。［数４］において、D_ｊは属性データにおけるｊ番目（ｊ＝０，１，２，・・・）の項目値であり、｜D_ｊ｜はその出現率（D_ｊの出現度数を全出現度数すなわちログの行数で割ったもの）を示す。｜D_ｊ｜＋｜D_ｊ＋１｜は隣り合う項目値の出現度数の和であり、｜D_ｊ＋１−D_ｊ｜は隣り合う項目値の距離（差）を示す。 When the selected attribute data is a continuous attribute, the summary processing unit 141 stores the item value of the attribute data and the appearance frequency of the value in the processing area 142. Subsequently, the summary processing unit 141 obtains a value d ₁ represented by [Equation 4]. In [Expression 4], D _j is the j-th (j = 0, 1, 2,...) Item value in the attribute data, and | D _j | is the appearance rate (the appearance frequency of D _j is all appearing) Frequency divided by the number of log lines). | D _j | + | D _{j + 1} | is the sum of the frequency of appearance of adjacent item values, and | D _{j + 1} −D _j | represents the distance (difference) between adjacent item values.

［数４］におけるｄ_１は密度に相当する値である。ｄ_１が大きいほど特定の値への集中の度合が大きく、そのようなデータはログの分析を行う上では重要なデータである。したがって、要約処理部１４１はｄ_１が小さい区間から順番に集約を行う。すなわち、要約処理部１４１は、ｄ_１が最小となる隣り合う項目値D_ｊとD_ｊ＋１とを同一の値に置き換えることを置換情報として処理用領域１４２に格納する。要約処理部１４１は上述した処理を所定回数繰り返す。あるいは、要約処理部１４１は項目値の種類の数が所定数になるまで上述した処理を繰り返す。この所定回数または所定数は設定情報に予め記録されている。以上のように、重要でないデータを集約することにより、要約処理部１４１はこの属性データの不純度ｆを下げる（偏在度を上げる）。 D ₁ in [Expression 4] is a value corresponding to the density. As d ₁ is larger, the degree of concentration to a specific value is larger, and such data is important data for log analysis. Thus, summarization unit 141 to aggregate in order from the interval d ₁ is small. That is, the summary processing unit 141 stores in the processing area 142 replacement information that replaces adjacent item values D _j and D _{j +} ₁ that minimize d ₁ with the same value. The summary processing unit 141 repeats the above process a predetermined number of times. Alternatively, the summary processing unit 141 repeats the above-described processing until the number of types of item values reaches a predetermined number. This predetermined number of times or the predetermined number is recorded in advance in the setting information. As described above, by summarizing unimportant data, the summary processing unit 141 reduces the impurity f of the attribute data (increases the uneven distribution degree).

要約処理部１４１によって選択された属性データが離散属性である場合、要約処理部１４１は属性データの項目値とその値の出現度数とを処理用領域１４２に格納する。続いて、要約処理部１４１は［数５］で表される値ｄ_２を求める。［数５］において、Ｅ_ｋは属性データにおけるｋ番目（ｋ＝０，１，２，・・・）の項目値であり、｜Ｅ_ｋ｜はその出現率（Ｅ_ｋの出現度数を全出現度数で割ったもの）を示す。ｄ_ｈｕｍ（ｋ，ｋ＋１）はＥ_ｋとＥ_ｋ＋１とのハミング距離（Ｅ_ｋとＥ_ｋ＋１の同一位置における文字がいくつ異なるかを示す値）である。比較対象のＥ_ｋとＥ_ｋ＋１との文字列長が異なる場合は、文字列長の差をハミング距離に加算する等とする。 When the attribute data selected by the summary processing unit 141 is a discrete attribute, the summary processing unit 141 stores the item value of the attribute data and the appearance frequency of the value in the processing area 142. Subsequently, the summary processing unit 141 obtains a value d ₂ expressed by [Equation 5]. In [Expression 5], E _k is the k-th (k = 0, 1, 2,...) Item value in the attribute data, and | E _k | is the appearance rate (the appearance frequency of E _k is the total appearance). Divided by frequency). _{d hum (k, k + 1} ) is the Hamming distance (a value indicating whether the character is a number different from the same position of _{E k} and _{E k + 1)} of _{E k} and _{E k + 1.} When the character string lengths of E _k and E _{k + 1} to be compared are different, the difference between the character string lengths is added to the Hamming distance.

［数５］におけるｄ_２もｄ_１と同様に、密度に相当する値である。要約処理部１４１はｄ_２が小さい区間から順番に集約を行う。すなわち、要約処理部１４１は、ｄ_２が最小となる隣り合う項目値Ｅ_ｊとＥ_ｊ＋１とを同一の値に置き換えることを置換情報として処理用領域１４２に格納する。要約処理部１４１は上述した処理を所定回数繰り返すか、項目数の種類の数が所定数になるまで繰り返す。以上のように、重要でないデータを集約することにより、要約処理部１４１はこの属性データの不純度ｆを下げる（偏在度を上げる）。 D ₂ in [Equation 5] is also a value corresponding to the density, similarly to d ₁ . Summary processor 141 to aggregate in order from the interval d ₂ is small. That is, the summary processing unit 141 stores in the processing area 142 replacement information that replaces adjacent item values E _j and E _{j + 1} that minimize d ₂ with the same value. The summary processing unit 141 repeats the above process a predetermined number of times or until the number of types of items reaches a predetermined number. As described above, by summarizing unimportant data, the summary processing unit 141 reduces the impurity f of the attribute data (increases the uneven distribution degree).

要約処理部１４１によって選択された属性データが階層構造属性である場合、要約処理部１４１は属性データの項目値とその値の出現度数とを処理用領域１４２に格納する。続いて、要約処理部１４１は属性データの全ての項目値を木構造の節点に割り当てる。図６はディレクトリ構造を例とした場合の、属性データの項目値を木構造の節点に割り当てる手法を示す概略参考図である。図において、６０１〜６０４は節点である。要約処理部１４１は属性データの項目値を読み込み、文字「／」の出現によって木構造の節点を認識する。 When the attribute data selected by the summary processing unit 141 is a hierarchical structure attribute, the summary processing unit 141 stores the item value of the attribute data and the appearance frequency of the value in the processing area 142. Subsequently, the summary processing unit 141 assigns all item values of the attribute data to the nodes of the tree structure. FIG. 6 is a schematic reference diagram showing a method for assigning item values of attribute data to nodes of a tree structure when a directory structure is taken as an example. In the figure, reference numerals 601 to 604 denote nodes. The summary processing unit 141 reads the item value of the attribute data and recognizes the node of the tree structure by the appearance of the character “/”.

例えば、最初に読み込んだ値が「／ｕｓｒ／」という値の場合、要約処理部１４１は、先頭の「／」に関連付けられた節点６０１を作成する。続いて、要約処理部１４１は「ｕｓｒ／」に関連付けられた節点６０２を節点６０１の下の階層に作成し、節点６０２にこの値を割り当てる。次に読み込んだ値が「／ｈｏｍｅ／ｙａｍａｄａ／」という値の場合、要約処理部１４１は、先頭の「／」によって節点６０１を認識し、続く「ｈｏｍｅ／」に関連付けられた節点６０３を節点６０１の下の階層に作成する。続いて、要約処理部１４１は「ｙａｍａｄａ／」に関連付けられた節点６０４を節点６０３の下の階層に作成し、読み込んだ値を節点６０４に割り当てる。同様に、「／ｈｏｍｅ／ｍｉｙａｋｅ／」という値によって節点６０５が作成され、この値が節点６０５に割り当てられる。 For example, when the first read value is “/ usr /”, the summary processing unit 141 creates a node 601 associated with the leading “/”. Subsequently, the summary processing unit 141 creates a node 602 associated with “usr /” in a hierarchy below the node 601 and assigns this value to the node 602. When the next read value is “/ home / yamada /”, the summary processing unit 141 recognizes the node 601 by the leading “/”, and the node 603 associated with the subsequent “home /” is the node 601. Create in the hierarchy below. Subsequently, the summary processing unit 141 creates a node 604 associated with “yamada /” in a hierarchy below the node 603 and assigns the read value to the node 604. Similarly, a node 605 is created with the value “/ home / miyake /”, and this value is assigned to the node 605.

図７および図８は階層構造属性における項目の集約化の様子を示す概略参考図である。図７（ａ）は集約前の木構造を示している。木構造の節点に付与された丸印は、その節点に割り当てられた項目値を示しており、丸印の大きさはその項目値の出現数を模式的に示している。各階層構造には節点７０１〜７１５が予め割り当てられているものとする。また、各節点における項目値の出現数は図９の左側に示されているとおりとする。例えば、節点７０８に割り当てられている項目値（「／ｕｓｒ／ｓｒｃ／ａ／」）の出現数は１である。 7 and 8 are schematic reference diagrams showing how items in the hierarchical structure attribute are aggregated. FIG. 7A shows a tree structure before aggregation. Circles attached to the nodes of the tree structure indicate item values assigned to the nodes, and the size of the circles schematically indicates the number of occurrences of the item values. Assume that nodes 701 to 715 are assigned in advance to each hierarchical structure. The number of item values that appear at each node is as shown on the left side of FIG. For example, the number of occurrences of the item value (“/ usr / src / a /”) assigned to the node 708 is 1.

要約処理部１４１は各項目値を上記のような木構造の節点に割り当て、以下の［数６］で表されるＴｈを算出する。［数６］において、Ｓは各節点に属する項目値の出現数の和（全項目数）であり、ログの行数と等しい。図７〜図９の例においては、Ｓは図９で示される各出現数の和の１４となる。Ｏは出力する項目値の数であり、この例では４とする。ＳおよびＯは記憶部１２中の設定情報に予め記録されている。この例においては、項目数１４の出現項目の項目値を４個の項目値に集約する（１４個の項目を４つの節点に割り当てる）とする。 The summary processing unit 141 assigns each item value to the node of the tree structure as described above, and calculates Th represented by the following [Equation 6]. In [Equation 6], S is the sum of the number of appearances of item values belonging to each node (total number of items), and is equal to the number of lines in the log. 7 to 9, S is 14 which is the sum of the numbers of appearances shown in FIG. O is the number of item values to be output, and is 4 in this example. S and O are recorded in advance in the setting information in the storage unit 12. In this example, it is assumed that the item values of the appearing items having the number of items of 14 are collected into four item values (14 items are assigned to four nodes).

要約処理部１４１は上記のＴｈを算出する。Ｓ＝１４、Ｏ＝４であるから、Ｔｈ＝３．５となる。このＴｈは集約後の１節点当たりの平均項目数を示している。要約処理部１４１は出現数の小さい項目値から順に以下の処理を行う。
（１）一つ上の階層の節点に集約した場合に、その節点における項目値の出現数がＴｈ以下である場合、集約可能であると判断する。
（２）（１）において一つ上の階層の節点に集約することができる場合、その節点の下の階層に属する項目値をその節点に集約する。
（３）（１）において、一つ上に階層の節点に集約することができない場合、さらに一つ上の階層の節点について再び（１）を考慮する。 The summary processing unit 141 calculates the above Th. Since S = 14 and O = 4, Th = 3.5. This Th indicates the average number of items per node after aggregation. The summary processing unit 141 performs the following processing in order from the item value with the smallest number of appearances.
(1) When aggregation is performed at a node in the next higher hierarchy, if the number of occurrences of item values at that node is equal to or less than Th, it is determined that aggregation is possible.
(2) In the case of (1), when it is possible to consolidate to the node of the hierarchy one level above, the item values belonging to the hierarchy below the node are consolidated to the node.
(3) In (1), when it is not possible to consolidate the nodes one level above, (1) is considered again for the nodes one level higher.

要約処理部１４１は以上の（１）〜（３）を繰り返し、出力する項目値の数がＯとなるまで項目値の集約を行う。割り当て可能な節点が見つからず、最上位の節点まで探索が進んだ場合には、最上位の節点に割り当てられる。以下、具体例を挙げて説明する。図７（ａ）の状態において、要約処理部１４１は最も出現数の小さい項目値である、節点７０８に属する項目値を選択する。上記の（１）に従い、要約処理部１４１は節点７０４への割り当てを試みる（図７（ｂ）参照）。節点７０８に属する項目値を節点７０４へ割り当てたとすると、節点７０４に属する項目数は１となり、Ｔｈ以下である。また、節点７０４の下の階層に属する節点７０９に属する項目値も節点７０４へ割り当てたとすると、節点７０４に属する項目数は合計３となり、Ｔｈ以下である。したがって、上記の（２）により、節点７０８および７０９に属する項目値は節点７０４へ割り当てられる（図７（ｃ）参照）。 The summary processing unit 141 repeats the above (1) to (3), and aggregates the item values until the number of item values to be output becomes O. If no assignable node is found and the search proceeds to the highest node, the node is assigned to the highest node. Hereinafter, a specific example will be described. In the state of FIG. 7A, the summary processing unit 141 selects an item value belonging to the node 708, which is the item value having the smallest number of appearances. In accordance with (1) above, the summary processing unit 141 tries to assign to the node 704 (see FIG. 7B). If an item value belonging to the node 708 is assigned to the node 704, the number of items belonging to the node 704 is 1, which is equal to or less than Th. Further, if the item values belonging to the node 709 belonging to the hierarchy below the node 704 are also assigned to the node 704, the total number of items belonging to the node 704 is 3, which is equal to or less than Th. Therefore, the item values belonging to the nodes 708 and 709 are assigned to the node 704 according to (2) above (see FIG. 7C).

続いて、要約処理部１４１は節点７１１に属する項目値を選択し、上記の（１）に従い、節点７０５への割り当てを試みる（図７（ｄ）参照）。節点７１１に属する項目値を節点７１１に割り当て、節点７１０に属する項目値も節点７１１に割り当てたとすると、節点７１１に属する項目数の合計は４となり、Ｔｈを超えるので、前記の（３）に従った割り当てが行われる。要約処理部１４１は節点７１１に属する項目値をさらに上の階層の節点７０２へ割り当てようとする（図７（ｅ）参照）。前記の（２）に従い、節点７０２の下の階層の節点７０４および節点７１１に属する項目値も節点７０２へ割り当てた場合、項目数の合計はＴｈを超えるので、要約処理部１４１は最上位の節点７０１へ項目値を割り当てる（図７（ｆ））。 Subsequently, the summary processing unit 141 selects an item value belonging to the node 711, and tries to assign to the node 705 in accordance with the above (1) (see FIG. 7D). If the item value belonging to the node 711 is assigned to the node 711 and the item value belonging to the node 710 is also assigned to the node 711, the total number of items belonging to the node 711 is 4, which exceeds Th. Assignments are made. The summary processing unit 141 tries to assign the item value belonging to the node 711 to the node 702 in the higher hierarchy (see FIG. 7E). According to the above (2), when the item values belonging to the nodes 704 and 711 in the hierarchy below the node 702 are also assigned to the node 702, the total number of items exceeds Th, so that the summarization processing unit 141 has the highest node. Item values are assigned to 701 (FIG. 7 (f)).

続いて、要約処理部１４１は節点７１４に属する項目値を選択し、前記の（１）に従い、節点７０７への割り当てを試みる（図８（ａ）参照）。節点７０７以下の階層に属する項目値を全て節点７０７へ割り当てたとすると、項目数はＴｈ以下なので、節点７１４および７１５に属する項目値は節点７０７へ割り当てられる（図８（ｂ）参照）。続いて、要約処理部１４１は節点７１３に属する項目値を選択し、節点７０５への割り当てを試みる（図８（ｃ）参照）。前記の（２）に従い、節点７０５の下の階層の節点７１０に属する項目値も節点７０５へ割り当てた場合、項目数の合計はＴｈを超えるので、要約処理部１４１は節点７０３への割り当てを試みる（図８（ｄ）参照）。この場合も、前記の（３）により、要約処理部１４１は最上位の階層の節点７０１へ割り当てる（図８（ｅ）参照）。 Subsequently, the summary processing unit 141 selects an item value belonging to the node 714, and tries to assign to the node 707 in accordance with the above (1) (see FIG. 8A). If all the item values belonging to the hierarchy below the node 707 are assigned to the node 707, the number of items is equal to or less than Th, and the item values belonging to the nodes 714 and 715 are assigned to the node 707 (see FIG. 8B). Subsequently, the summary processing unit 141 selects an item value belonging to the node 713 and tries to assign it to the node 705 (see FIG. 8C). When the item values belonging to the node 710 in the hierarchy below the node 705 are also assigned to the node 705 according to the above (2), the total number of items exceeds Th, so the summary processing unit 141 tries to assign to the node 703. (See FIG. 8D). Also in this case, according to (3), the summary processing unit 141 assigns the node 701 in the highest hierarchy (see FIG. 8E).

続いて、要約処理部１４１は出現数の最も少ない項目値として節点７０７（出現数２）に属する項目値を選択し、節点７０３への割り当てを試みる（図８（ｆ）参照）。前記の（３）により、要約処理部１４１はこの項目値を最上位の階層の節点７０１へ割り当てる（図８（ｇ）参照）。ここまでで、項目値の数は所定の４となったので、要約処理部１４１は項目の集約を終了する。集約後の各項目値とその項目数は図９の右側のようになる。要約処理部１４１は以上のような集約化に基づいて、各項目の項目値がどのような項目値に置き換わるかを置換情報として処理用領域１４２へ格納する。以上のように、重要でない項目値（出現数の少ない項目値）を同一の節点に集約することにより、要約処理部１４１はこの属性データの不純度ｆを下げる（偏在度を上げる）。 Subsequently, the summary processing unit 141 selects an item value belonging to the node 707 (number of appearances 2) as the item value having the smallest number of appearances, and tries to assign it to the node 703 (see FIG. 8F). According to (3) above, the summary processing unit 141 assigns this item value to the node 701 in the highest hierarchy (see FIG. 8G). Up to this point, the number of item values has reached a predetermined value of 4, so the summary processing unit 141 ends the aggregation of items. Each item value after aggregation and the number of items are as shown on the right side of FIG. Based on the above aggregation, the summary processing unit 141 stores in the processing area 142 as replacement information what item value each item value is replaced with. As described above, the summary processing unit 141 reduces the impurity f of the attribute data (increases the uneven distribution degree) by collecting unimportant item values (item values with a small number of appearances) at the same node.

上述したような各属性データごとの項目集約に続いて、要約処理部１４１はログの各項目値の置き換えを行う。すなわち、要約処理部１４１は処理用領域１４２から置換情報を読み出し、置換情報に基づいて、集約によって同一の項目値となった項目の項目値を同一の値で置き換える。例えば、上記の階層構造属性の場合を挙げると、項目値「／ｕｓｒ／ｓｒｃ／ａ／」（節点７０８の項目値）および「／ｕｓｒ／ｓｒｃ／ｂ」（節点７０９の項目値）は「／ｕｓｒ／ｓｒｃ／」（節点７０４の項目値）で置き換えられる。 Following the item aggregation for each attribute data as described above, the summary processing unit 141 replaces each item value in the log. That is, the summary processing unit 141 reads the replacement information from the processing area 142, and replaces the item value of the item having the same item value by aggregation with the same value based on the replacement information. For example, in the case of the above hierarchical structure attribute, the item values “/ usr / src / a /” (item value of node 708) and “/ usr / src / b” (item value of node 709) are “// It is replaced with “usr / src /” (the item value of the node 704).

続いて、要約処理部１４１は上記の置き換えによって生じた重複行（各属性データの項目値が全て一致する行）の数を数え、その数をカウンタに保存し、一行だけを残して他の重複行を削除する。この場合、「ｃｕｒｒｅｎｔ−ｔｉｍｅ」は上記の重複行の数え上げには関与しない。すなわち、要約処理部１４１は「ｃｕｒｒｅｎｔ−ｔｉｍｅ」を除いた属性データに関して重複しているか否かの判定を行う。あるいは、要約処理部１４１は全イベントに対して「ｃｕｒｒｅｎｔ−ｔｉｍｅ」を同一の値に置き換えてから上記の重複行の数え上げを行う。 Subsequently, the summary processing unit 141 counts the number of duplicate lines (lines in which the item values of each attribute data all match) generated by the above replacement, stores the number in the counter, and leaves only one line to other duplicates. Delete the line. In this case, “current-time” is not involved in the above-described counting of duplicate rows. That is, the summary processing unit 141 determines whether or not there is duplication regarding attribute data excluding “current-time”. Alternatively, the summary processing unit 141 counts the duplicate lines after replacing “current-time” with the same value for all events.

要約処理部１４１は設定情報を処理用領域１４２から読み出して参照し、以上の重複行の削除の結果、残ったログの行数がＭ未満であれば要約処理を終了し、Ｍ以上であれば、要約処理部１４１は不純度ｆに基づいた属性データの選択を行い、その属性データに関して上記の項目集約処理を行う。 The summary processing unit 141 reads the setting information from the processing area 142 and refers to the setting information. If the number of remaining log lines is less than M as a result of the above-described deletion of duplicate lines, the summary processing unit 141 ends the summary process. The summary processing unit 141 selects attribute data based on the impurity f, and performs the item aggregation processing on the attribute data.

なお、階層構造属性の項目集約化に関しては、ディレクトリ構造を例として説明したが、ＩＰアドレスの場合も同様である。図１０はＩＰアドレスの木構造への割り当て手法を示す概略参考図である。図で示されるように、要約処理部１４１は、ピリオドで区切られたＩＰアドレスの各数値を３２ビットの２進数へ変換する。続いて、要約処理部１４１は、各ビット間にデリミタを挿入し、デリミタで区切られた１または０を左から上位の階層に割り当てていく。例えば、「１１００」で表される２進数の場合、図１０のように、各ビットの数値が最上位の節点１１０１の下に属する節点１１０２〜１１０５に割り当てられる。以上のようにして構築された木構造に対して、要約処理部１４１は項目の集約化を行う。 In addition, regarding the item aggregation of hierarchical structure attributes, the directory structure has been described as an example, but the same applies to IP addresses. FIG. 10 is a schematic reference diagram showing a method for assigning IP addresses to a tree structure. As shown in the figure, the summary processing unit 141 converts each numerical value of the IP address divided by periods into a 32-bit binary number. Subsequently, the digest processing unit 141 inserts a delimiter between each bit, and assigns 1 or 0 delimited by the delimiter to the upper hierarchy from the left. For example, in the case of a binary number represented by “1100”, the numerical value of each bit is assigned to nodes 1102 to 1105 belonging to the lowermost node 1101 as shown in FIG. The summary processing unit 141 aggregates items for the tree structure constructed as described above.

図１１は上述したログ要約装置の動作を示すフローチャートである。ユーザによって操作部１１が操作され、ログの要約の開始が指示されると、操作部１１はユーザによる開始指示を示す信号を制御部１５へ出力する。制御部１５はこの信号に基づいて、記憶部１２からログおよび設定情報を読み出し、ログ要約部１４へ出力する。ログおよび設定情報はログ要約部１４へ入力される（ステップＳ１１０１）。続いて、ログ要約部１４中の要約処理部１４１は設定情報に基づいて、ログを集約可能な形式に変換する（ステップＳ１１０２）。 FIG. 11 is a flowchart showing the operation of the log summarization apparatus described above. When the operation unit 11 is operated by the user and the start of log summarization is instructed, the operation unit 11 outputs a signal indicating the start instruction by the user to the control unit 15. Based on this signal, the control unit 15 reads the log and setting information from the storage unit 12 and outputs the log and setting information to the log summarization unit 14. The log and setting information are input to the log summary unit 14 (step S1101). Subsequently, the summary processing unit 141 in the log summarizing unit 14 converts the logs into a format that can be aggregated based on the setting information (step S1102).

続いて、要約処理部１４１は設定情報中の、ログの切り出しの長さを示す行数Ｎを参照し、ログの中からＮ行の処理用データを切り出す（ステップＳ１１０３）。要約処理部１４１は不純度関数を用いて各属性データについて不純度ｆを求め、最もｆの大きな属性データを要約対象の属性データとして選択する（ステップＳ１１０４）。要約処理部１４１は選択された属性データが連続属性、離散属性、および階層構造属性のいずれに属するかを判定する（ステップＳ１１０５）。 Subsequently, the summary processing unit 141 refers to the number N of lines indicating the log cut length in the setting information, and cuts out N rows of processing data from the log (step S1103). The summary processing unit 141 obtains the impurity f for each attribute data using the impurity function, and selects the attribute data having the largest f as the attribute data to be summarized (step S1104). The summary processing unit 141 determines whether the selected attribute data belongs to a continuous attribute, a discrete attribute, or a hierarchical structure attribute (step S1105).

選択された属性データが連続属性である場合、要約処理部１４１は［数４］で示されるｄ_１が小さい区間から順番に集約を行う（ステップＳ１１０６）。また、選択された属性データが離散属性である場合、要約処理部１４１は［数５］で示されるｄ_２が小さい区間から順番に集約を行う（ステップＳ１１０７）。また、選択された属性データが階層構造属性である場合、要約処理部１４１は、各項目を木構造の節点に割り当て、出現数の小さい項目値から順に集約を行う（ステップＳ１１０８）。 When selected attribute data is continuous attribute, abstract processor 141 to aggregate in order from d ₁ is small intervals indicated by [Equation 4] (Step S1106). Also, if the attribute data selected are discrete attributes, abstract processor 141 to aggregate in order from d ₂ is smaller intervals indicated by Equation 5 (step S1107). If the selected attribute data is a hierarchical structure attribute, the summary processing unit 141 assigns each item to a node of the tree structure, and aggregates the items in descending order of item values (step S1108).

続いて、要約処理部１４１は、集約によって同一となった項目の項目値を同一の値で置き換える（ステップＳ１１０９）。要約処理部１４１は、置き換えによって生じた重複行の数を数え、その数をカウンタに保存し、一行だけを残して残りの重複行を削除する（ステップＳ１１１０）。続いて、要約処理部１４１は設定情報を処理用領域１４２から読み出して参照し、残ったログの行数がＭ未満であるかどうか判定する（ステップＳ１１１１）。残ったログの行数がＭ未満であれば処理が終了する。一方、残ったログの行数がＭ以上であれば、ステップＳ１１０４に戻る。ログの行数がＭ未満となるまで以上の処理が繰り返される。 Subsequently, the summary processing unit 141 replaces the item values of the items that become the same by the aggregation (step S1109). The summary processing unit 141 counts the number of duplicate lines generated by the replacement, stores the number in a counter, and deletes the remaining duplicate lines while leaving only one line (step S1110). Subsequently, the summary processing unit 141 reads the setting information from the processing area 142 and refers to it, and determines whether the number of remaining log lines is less than M (step S1111). If the number of remaining log lines is less than M, the process ends. On the other hand, if the number of remaining log lines is M or more, the process returns to step S1104. The above processing is repeated until the number of log lines is less than M.

なお、大きなログを要約する場合に、まずログを複数に分割し、分割後のログに対して要約を行うことを繰り返してもよい。例えば、図１２において、ログ１０００をログ２０に要約するとする。要約処理部１４１はログ１０００を所定の長さの複数のログ５００、５１０、５２０、５３０・・・に分割する。続いて、要約処理部１４１は前述した要約手法によって個々のログを要約し、ログ４００、４１０、４２０、４３０・・・を生成する。要約処理部１４１は要約した個々のログを所定数ごとに連結し、ログ３００〜３３０を生成する。続いて、要約処理部１４１は連結した個々のログを要約し、ログ２００〜２３０を生成する。要約処理部１４１はこれを再び連結してログ１００を生成し、このログを要約して所定行数未満のログ２０を生成する。 When summarizing a large log, it may be repeated that the log is first divided into a plurality of logs and summarization is performed on the divided logs. For example, in FIG. The summary processing unit 141 divides the log 1000 into a plurality of logs 500, 510, 520, 530,... Having a predetermined length. Subsequently, the summary processing unit 141 summarizes the individual logs by the above-described summarization method, and generates logs 400, 410, 420, 430. The summary processing unit 141 concatenates each summarized log for each predetermined number, and generates logs 300 to 330. Subsequently, the summary processing unit 141 summarizes each connected log and generates logs 200 to 230. The summary processing unit 141 concatenates these again to generate the log 100, and summarizes this log to generate the log 20 having a number less than a predetermined number of lines.

以上説明したように、本実施形態によれば、ＩＤＳ、ファイアウォール、ルータ、Ｗｅｂサーバ、ｆｔｐサーバ、各種ｓｙｓｌｏｇ、ｐｅｅｒｔｏｐｅｅｒシステム等において出力される膨大なログを所定行数未満（あるいは以下）に要約することができる。これにより、膨大なログを見易く整形し、ログ中の有益な情報をネットワークの管理者等に提示することができる。また、本実施形態によれば、出力するログの規模（行数）を指定して要約を行うことができる。これにより、ネットワークの概要を知りたい場合には出力するログの規模を小さくして要約を行い、ネットワークの詳細を知りたい場合には出力するログの規模をもう少し大きくして要約を行うなどの柔軟な要約を行うことができる。 As described above, according to the present embodiment, an enormous log output in an IDS, firewall, router, Web server, ftp server, various syslogs, peer-to-peer systems, etc. is less than a predetermined number of lines (or below). Can be summarized. As a result, it is possible to easily format an enormous log and present useful information in the log to a network administrator or the like. Further, according to the present embodiment, summarization can be performed by specifying the scale (number of lines) of the log to be output. This makes it possible to reduce the size of the log to be output when you want to get an overview of the network, and to summarize by reducing the size of the log to be output when you want to know the details of the network. Summary can be made.

また、ログとして出力される属性の中で、特にＩＰアドレス、ディレクトリ構造、ディクレトリパス、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）、ＵＲＩ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）、メールアドレス、Ｘｐａｔｈ等を木構造に割り当てることにより、有益な要約を行うことができる。階層構造の場合、各階層が意味のある情報を持っており、項目の集約先の階層は意味のある情報を持っている。例えば、「／ｈｏｍｅ／ａ／」というディレクトリ構造の場合、集約先の候補となる「／」、「／ｈｏｍｅ／」、および「／ｈｏｍｅ／ａ」のいずれもが意味のある情報である。本実施形態による階層属性の要約手法によれば、階層構造をより細かい粒度で、意味のある情報に集約することができる。通信ログにはＩＰアドレス、ＵＲＩ、Ｘｐａｔｈ等の階層構造属性を有しているものが多く、本実施形態によれば、これらの集約をより細かく行うことができる。 Among the attributes output as logs, in particular, by assigning an IP address, directory structure, directory path, URL (Uniform Resource Locator), URI (Uniform Resource Identifier), mail address, Xpath, etc. to the tree structure, Useful summaries can be made. In the case of a hierarchical structure, each hierarchy has meaningful information, and the hierarchy of items to be aggregated has meaningful information. For example, in the case of a directory structure of “/ home / a /”, all of “/”, “/ home /”, and “/ home / a” that are candidates for aggregation are meaningful information. According to the hierarchical attribute summarization method according to the present embodiment, the hierarchical structure can be aggregated into meaningful information with a finer granularity. Many communication logs have hierarchical structure attributes such as an IP address, URI, and Xpath, and according to the present embodiment, these can be aggregated more finely.

また、段階的にログの要約を繰り返し行うことにより、一度の要約で所定行数未満のログを生成するよりも処理を高速に行うことができる。さらに、この要約の結果を保存しておくことにより、新しいログが出力された場合に、新しいログだけを要約すればよい。 In addition, by repeating log summarization step by step, processing can be performed at a higher speed than generating a log with less than a predetermined number of lines in one summary. Furthermore, by storing the summary results, when a new log is output, only the new log needs to be summarized.

以上、図面を参照して本発明の実施形態について詳述してきたが、具体的な構成はこれらの実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計変更等も含まれる。例えば、上述した実施形態におけるログ要約装置は、その動作および機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータに読み込ませ、実行させることにより実現してもよい。 As described above, the embodiments of the present invention have been described in detail with reference to the drawings, but the specific configuration is not limited to these embodiments, and includes design changes and the like within a scope not departing from the gist of the present invention. It is. For example, the log summarization apparatus in the above-described embodiment records a program for realizing the operation and function on a computer-readable recording medium, and causes the computer to read and execute the program recorded on the recording medium. May be realized.

ここで、「コンピュータ」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Here, the “computer” includes a homepage providing environment (or display environment) if the WWW system is used. The “computer-readable recording medium” refers to a storage device such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a hard disk built in the computer. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上述したログ要約プログラムは、このプログラムを記憶装置等に格納したコンピュータから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上述したログ要約プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータにすでに記録されているプログラムとの組合せで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The log summarization program described above may be transmitted from a computer storing the program in a storage device or the like to another computer via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the log summarization program described above may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer, what is called a difference file (difference program) may be sufficient.

本発明の一実施形態によるログ要約装置の構成を示すブロック図である。It is a block diagram which shows the structure of the log summarizing apparatus by one Embodiment of this invention. 本実施形態におけるログの形式を説明するための参考図である。It is a reference figure for demonstrating the format of the log in this embodiment. ｆｔｐサーバにおいて生成されるログの例を示す参考図である。It is a reference figure which shows the example of the log produced | generated in an ftp server. ｆｔｐサーバにおいて生成されるログの形式を示す参考図である。It is a reference figure which shows the format of the log produced | generated in an ftp server. ログ要約装置によるログの要約の様子を示す概略参考図である。It is a schematic reference figure which shows the mode of the log summarization by a log summarization apparatus. 階層構造属性の項目値の木構造への割り当て手法を示す概略参考図である。It is a schematic reference figure which shows the allocation method to the tree structure of the item value of a hierarchical structure attribute. 階層構造属性の項目集約の様子を示す概略参考図である。It is a schematic reference figure which shows the mode of the item aggregation of a hierarchical structure attribute. 階層構造属性の項目集約の様子を示す概略参考図である。It is a schematic reference figure which shows the mode of the item aggregation of a hierarchical structure attribute. 階層構造属性の項目集約を説明するための概略参考図である。It is a schematic reference drawing for demonstrating item aggregation of a hierarchical structure attribute. ＩＰアドレスの木構造への割り当て手法を示す概略参考図である。It is a schematic reference figure which shows the allocation method to the tree structure of an IP address. ログ要約装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a log summarization apparatus. 大きなログを分割する場合の他の手法を説明するための概略参考図である。It is a schematic reference diagram for demonstrating the other method in the case of dividing | segmenting a big log.

Explanation of symbols

１０・・・ログ収集部、１１・・・操作部、１２・・・記憶部、１３・・・表示部、１４・・・ログ要約部、１５・・・制御部、１４１・・・要約処理部、１４２・・・処理用領域。

DESCRIPTION OF SYMBOLS 10 ... Log collection part, 11 ... Operation part, 12 ... Memory | storage part, 13 ... Display part, 14 ... Log summary part, 15 ... Control part, 141 ... Summary process Part, 142... Processing area.

Claims

In a log summarization device that summarizes logs output in network devices,
Calculation means for calculating a value related to the uneven distribution degree of the appearance items of the attribute data based on the appearance frequency for each item data appearing in the attribute data recorded in the log;
Selection means for selecting attribute data to be aggregated based on a calculation result of a value regarding the uneven distribution degree;
Replacement means for replacing a plurality of item data with the same numerical value or character string based on the appearance frequency of item data appearing in the selected attribute data;
Aggregating means for aggregating duplicate lines generated by replacement of item data by the replacing means into the same line;
A log summarizing apparatus comprising:

The attribute data is any one of continuous attribute data, discrete attribute data, and hierarchical structure attribute data,
The replacement means replaces a plurality of item data with the same numerical value or character string so that the uneven distribution degree of the attribute data is increased based on the appearance frequency of the item data appearing in the selected attribute data. The log summarizing apparatus according to claim 1.

When the attribute data is the continuous attribute data, the replacement unit increases the degree of uneven distribution of the attribute data based on the difference between the plurality of item data appearing in the attribute data and the appearance frequency of the item data. As described above, the log summarizing apparatus according to claim 2, wherein a plurality of item data is replaced with the same numerical value or character string.

In the case where the attribute data is the discrete attribute data, the replacement means determines the degree of uneven distribution of the attribute data based on the Hamming distance between a plurality of item data appearing in the attribute data and the appearance frequency of the item data. The log summarizing apparatus according to claim 2, wherein a plurality of item data is replaced with the same numerical value or character string so as to increase.

When the attribute data is the hierarchical structure attribute data, the replacement means assigns item data appearing in the attribute data to nodes of a tree structure, and an item with a low appearance frequency so that the degree of uneven distribution of the attribute data increases. The log summarizing apparatus according to claim 2, wherein the data is replaced with the same numerical value or character string.

Dividing means for dividing the log into a plurality of logs;
A concatenation means for concatenating logs after the summarization of each divided log;
The log summarizing apparatus according to claim 1, further comprising:

In a log summarization program that causes a computer to execute log summarization processing that is output in a network device,
Calculating a value related to the degree of uneven distribution of the appearance items of the attribute data based on the appearance frequency for each item data appearing in the attribute data recorded in the log;
Selecting attribute data to be aggregated based on a calculation result of a value regarding the uneven distribution degree;
Replacing a plurality of item data with the same numerical value or character string based on the appearance frequency of the item data appearing in the selected attribute data;
Aggregating duplicate lines generated by the replacement of item data by the replacement means into the same line;
Log summarization program to make the computer run.

The attribute data is any one of continuous attribute data, discrete attribute data, and hierarchical structure attribute data,
In the step of replacing the plurality of item data with the same numerical value or character string based on the appearance frequency of the item data appearing in the selected attribute data, the item data appearing in the selected attribute data The log summarization program according to claim 7, wherein a plurality of item data is replaced with the same numerical value or character string so that the degree of uneven distribution of the attribute data is increased based on the appearance frequency.

When the attribute data is the continuous attribute data, based on the difference between the plurality of item data appearing in the attribute data and the appearance frequency of the item data, a plurality of uneven distribution levels of the attribute data are increased. 9. The log summarization program according to claim 8, wherein item data is replaced with the same numerical value or character string.

In a case where the attribute data is the discrete attribute data, a plurality of the attribute data is increased based on the Hamming distance between the item data appearing in the attribute data and the appearance frequency of the item data. The log summarization program according to claim 8, wherein the item data is replaced with the same numerical value or character string.

When the attribute data is the hierarchical structure attribute data, item data that appears in the attribute data is assigned to nodes of a tree structure, and the item data with a low appearance frequency is assigned the same numerical value so that the degree of uneven distribution of the attribute data increases. The log summarization program according to claim 8, wherein the log summarization program is replaced with a character string.

Dividing the log into a plurality of logs before calculating the uneven distribution;
Concatenating the logs after summarizing the individual divided logs after the step of aggregating the duplicate rows;
The log summarization program according to any one of claims 7 to 11, further comprising:

A computer-readable recording medium on which the log summarizing program according to any one of claims 7 to 12 is recorded.