JP6249505B1

JP6249505B1 - Feature extraction apparatus and program

Info

Publication number: JP6249505B1
Application number: JP2016251309A
Authority: JP
Inventors: 勝人伊佐野
Original assignee: Mitsubishi Electric Information Systems Corp
Current assignee: Mitsubishi Electric Information Systems Corp
Priority date: 2016-12-26
Filing date: 2016-12-26
Publication date: 2017-12-20
Anticipated expiration: 2036-12-26
Also published as: JP2018106383A

Abstract

【課題】複数の変数を含むデータについて、特徴を適切かつ自動的に抽出する、特徴抽出装置およびプログラムを提供する。【解決手段】特徴抽出装置１０は、Ｎ個（ただしＮは２以上の整数）のカラムを含むレコードの特徴を抽出する。特徴抽出装置１０は、複数のレコードを取得する機能と、Ｎ個のカラムのうちＭ個（ただしＭは２≦Ｍ≦Ｎとなる整数）からなるすべての組み合わせについて、カラム組を生成する機能（カラム組生成機能）と、レコードのそれぞれについて、カラム組のそれぞれに係る特徴数を決定する機能（ただし、特徴数は、当該レコードの、当該カラム組に含まれるすべてのカラムのデータ値を組み合わせた値の、そのカラム組における頻度を表す）と、レコードのそれぞれに、特徴数を関連付けて出力する機能とを備える。【選択図】図３A feature extraction apparatus and program for appropriately and automatically extracting features from data including a plurality of variables. A feature extraction apparatus extracts features of a record including N columns (where N is an integer equal to or greater than 2). The feature extraction apparatus 10 has a function of acquiring a plurality of records and a function of generating a column set for all combinations of M columns (where M is an integer satisfying 2 ≦ M ≦ N). (Column set generation function) and a function for determining the number of features related to each column set for each record (however, the number of features is a combination of the data values of all the columns included in the column set of the record) And a function for outputting the number of features in association with each record. [Selection] Figure 3

Description

本発明は、データの特徴を抽出する特徴抽出装置に関する。 The present invention relates to a feature extraction apparatus that extracts data features.

データを分類する技術が公知である。たとえば、大量のデータを、正常なデータと異常なデータとに分類する処理が行われている。このような方法の例は特許文献１および特許文献２にそれぞれ記載されている。 Techniques for classifying data are known. For example, a process of classifying a large amount of data into normal data and abnormal data is performed. Examples of such a method are described in Patent Document 1 and Patent Document 2, respectively.

特開２０１２−１３８０４４号公報JP 2012-138044 A 特許第５９７３６３６号公報Japanese Patent No. 5976366

しかしながら、従来の技術では、複数の変数を含むデータの分類を行う場合に、基準となる特徴を手作業で決定しなければならないという問題があった。 However, in the conventional technique, there is a problem that when a data including a plurality of variables is classified, a reference feature has to be manually determined.

たとえば、大量のデータを扱う場合には、すべてのデータを参照するのは労力の点で困難であるため、一部のデータを抜き出して基準を決定するという省略作業が必要になり、選別作業の適切さが問題となる。さらに、たとえば１組のデータに多数の変数の値が含まれる場合には、特徴となる相関を検討すべき変数の組み合わせの数が膨大となり、すべての組み合わせを網羅的に考慮するのは困難である。 For example, when handling a large amount of data, it is difficult in terms of labor to refer to all of the data, so it is necessary to omit a part of the data and determine the criteria, and the sorting work Adequacy matters. Furthermore, for example, when a set of data includes a large number of variable values, the number of combinations of variables to be examined for characteristic correlation becomes enormous, and it is difficult to comprehensively consider all combinations. is there.

この発明は、このような問題点を解決するためになされたものであり、複数の変数を含むデータについて、特徴を適切かつ自動的に抽出する、特徴抽出装置およびプログラムを提供することを目的とする。 The present invention has been made to solve such a problem, and an object of the present invention is to provide a feature extraction apparatus and program for appropriately and automatically extracting features from data including a plurality of variables. To do.

上述の問題点を解決するため、この発明に係る特徴抽出装置は、Ｎ個（ただしＮは２以上の整数）のカラムを含むレコードの特徴を抽出する、特徴抽出装置であって、
複数の前記レコードを取得する機能と、
前記Ｎ個のカラムのうちＭ個（ただしＭは２≦Ｍ≦Ｎとなる整数）からなるすべての組み合わせについて、カラム組を生成する、カラム組生成機能と、
前記レコードのそれぞれについて、前記カラム組のそれぞれに係る特徴数を決定する機能であって、前記特徴数は、当該レコードの、当該カラム組に含まれるすべてのカラムのデータ値を組み合わせた値の、そのカラム組における頻度を表す、特徴数を決定する機能と、
前記レコードのそれぞれに、前記特徴数を関連付けて出力する機能と
を備える。
特定の態様によれば、前記カラム組生成機能は、所定の上限カラム数Ｌ（ただしＬは２≦Ｌ≦Ｎとなる整数）に対し、少なくとも、２≦Ｍ≦ＬとなるすべてのＭについて実行される。
特定の態様によれば、
各レコードの各カラムのデータ値を数値に変換する機能と、
各カラムについて、各レコードの数値を正規化する機能と、
各カラム組について、各レコードの特徴数を正規化する機能と、
をさらに備える。
特定の態様によれば、前記特徴抽出装置は、各カラム組について、各レコードの特徴数に基づき、所定の異常判定規則を用いて、各レコードが異常レコードであるか否か判定する、異常判定機能をさらに備える。
特定の態様によれば、前記特徴抽出装置は、すべてのカラムおよびカラム組のうちから選択される、合計２個のカラムまたはカラム組からなる判定対象対について、各レコードの数値または特徴数に基づき、所定の異常判定規則を用いて、各レコードが異常レコードであるか否か判定する、異常判定機能をさらに備える。
特定の態様によれば、
前記異常判定機能は、
前記判定対象対が２つのカラム組からなる場合には実行されず、
前記判定対象対がカラムおよびカラム組からなり、かつ、そのカラム組にそのカラムが含まれない場合には実行されず、
前記判定対象対にカラムが含まれ、かつ、そのカラムのデータ値がすべてのレコードについて同一である場合には実行されず、
前記判定対象対にカラム組が含まれ、かつ、そのカラム組の特徴数がすべてのレコードについて同一である場合には実行されない。
特定の態様によれば、前記異常判定機能は、前記判定対象対を構成するカラムおよびカラム組に関わらず、同一の前記異常判定規則を用いて実行される。
特定の態様によれば、
前記特徴抽出装置は、すべてのカラムおよびカラム組のうちから選択される、合計２個のカラムまたはカラム組のそれぞれを軸とする直交座標系において、各レコードの位置を示す画像を生成する、画像生成機能をさらに備え、
前記特徴抽出装置は、カラムおよびカラム組のうち２個からなるすべての組み合わせについて、前記画像生成機能を実行する。
また、本発明に係るプログラムは、コンピュータを上述の特徴抽出装置として機能させる。 In order to solve the above-described problem, a feature extraction device according to the present invention is a feature extraction device that extracts features of a record including N columns (where N is an integer of 2 or more),
A function of acquiring a plurality of the records;
A column set generation function for generating a column set for all combinations of M (where M is an integer satisfying 2 ≦ M ≦ N) out of the N columns;
For each of the records, a function for determining the number of features related to each of the column sets, wherein the number of features is a value obtained by combining the data values of all the columns included in the column set of the record, A function for determining the number of features representing the frequency in the column set;
A function of associating and outputting the number of features to each of the records.
According to a specific aspect, the column group generation function is executed for at least all M satisfying 2 ≦ M ≦ L for a predetermined upper limit number of columns L (where L is an integer satisfying 2 ≦ L ≦ N). Is done.
According to a particular aspect,
A function that converts the data value of each column of each record to a numeric value,
For each column, the function to normalize the numerical value of each record,
A function that normalizes the number of features in each record for each column set,
Is further provided.
According to a specific aspect, the feature extraction device determines, for each column set, whether each record is an abnormal record using a predetermined abnormality determination rule based on the number of features of each record. It further has a function.
According to a specific aspect, the feature extraction device is configured based on a numerical value or a feature number of each record for a determination target pair consisting of a total of two columns or column sets selected from all columns and column sets. And an abnormality determination function for determining whether each record is an abnormal record using a predetermined abnormality determination rule.
According to a particular aspect,
The abnormality determination function is
When the judgment target pair is composed of two column sets, it is not executed,
If the judgment target pair consists of a column and a column set, and the column set does not include the column, it is not executed.
If the determination target pair includes a column and the data value of the column is the same for all the records, it is not executed.
If the judgment target pair includes a column set and the number of features of the column set is the same for all the records, the determination is not performed.
According to a specific aspect, the abnormality determination function is executed using the same abnormality determination rule regardless of a column and a column set constituting the determination target pair.
According to a particular aspect,
The feature extraction device generates an image indicating a position of each record in an orthogonal coordinate system having a total of two columns or column sets selected from all columns and column sets as axes. A generation function,
The feature extraction apparatus executes the image generation function for all combinations of columns and column combinations.
The program according to the present invention causes a computer to function as the above-described feature extraction device.

この発明に係る特徴抽出装置およびプログラムによれば、レコードのカラムの組み合わせについて網羅的に特徴数を決定するので、レコードの特徴を適切かつ自動的に抽出することができる。 According to the feature extraction apparatus and program according to the present invention, the number of features is comprehensively determined for combinations of record columns, so that the feature of the record can be appropriately and automatically extracted.

本発明の実施の形態１に係る特徴抽出装置の位置付けの例を示す図である。It is a figure which shows the example of positioning of the feature extraction apparatus which concerns on Embodiment 1 of this invention. 図１の特徴抽出装置の構成の例を示す図である。It is a figure which shows the example of a structure of the feature extraction apparatus of FIG. 図１の特徴抽出装置が扱うデータの構成の例を示す図である。It is a figure which shows the example of a structure of the data which the feature extraction apparatus of FIG. 1 handles. 図１の特徴抽出装置が扱うデータの構成の別の例を示す図である。It is a figure which shows another example of a structure of the data which the feature extraction apparatus of FIG. 1 handles. 図１の特徴抽出装置の特徴抽出処理の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the feature extraction process of the feature extraction apparatus of FIG. 実施の形態２に係る特徴抽出装置の、異常判定処理の流れの例を示すフローチャートである。10 is a flowchart illustrating an example of a flow of abnormality determination processing of the feature extraction device according to the second embodiment.

以下、この発明の実施の形態を添付図面に基づいて説明する。
実施の形態１．
図１に、本発明の実施の形態１に係る特徴抽出装置１０の位置付けの例を示す。特徴抽出装置１０は、データから特徴を抽出し、抽出した特徴を元のデータに関連付けて出力する。出力されたデータおよび特徴は、他の装置（たとえば異常検知装置２０）によって利用することができる。 Embodiments of the present invention will be described below with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 shows an example of positioning of the feature extraction apparatus 10 according to Embodiment 1 of the present invention. The feature extraction device 10 extracts features from the data and outputs the extracted features in association with the original data. The output data and features can be used by other devices (for example, the abnormality detection device 20).

図２に、特徴抽出装置１０の構成の例を示す。特徴抽出装置１０は公知のコンピュータとしての構成を備え、演算を行う演算手段１１と、情報を格納する記憶手段１２とを備える。また、とくに図示しないが、特徴抽出装置１０は、使用者の操作を受け付ける入力手段と、情報を出力する出力手段と、外部の通信ネットワークに対し情報の入出力を行う通信手段とを備える。 FIG. 2 shows an example of the configuration of the feature extraction apparatus 10. The feature extraction apparatus 10 has a configuration as a known computer, and includes a calculation unit 11 that performs a calculation and a storage unit 12 that stores information. Further, although not particularly illustrated, the feature extraction apparatus 10 includes an input unit that receives a user operation, an output unit that outputs information, and a communication unit that inputs and outputs information to an external communication network.

記憶手段１２は、入力されるデータと、特徴抽出装置１０がデータについて抽出する特徴とを格納する。また、記憶手段１２は、図示しないプログラムを格納する。コンピュータの演算手段１１がこのプログラムを実行することにより、そのコンピュータは特徴抽出装置１０として機能する。すなわち、このプログラムは、コンピュータを、本明細書に記載される特徴抽出装置１０として機能させる。 The storage unit 12 stores input data and features that the feature extraction device 10 extracts for the data. The storage unit 12 stores a program (not shown). When the computing means 11 of the computer executes this program, the computer functions as the feature extracting device 10. That is, this program causes a computer to function as the feature extraction device 10 described in this specification.

図３に、図１の特徴抽出装置が扱うデータの構成の例を示す。この例は、たとえば特定のコンピュータの通信記録を表すものであるが、本発明は一般的なデータについて適用可能である。なお、図３および図４の例は、後述する正規化処理を行っていない状態のものである。 FIG. 3 shows an example of the structure of data handled by the feature extraction apparatus of FIG. This example represents, for example, a communication record of a specific computer, but the present invention is applicable to general data. Note that the examples of FIGS. 3 and 4 are in a state where normalization processing described later is not performed.

データが構成するレコードＲ１〜Ｒ５に、それぞれ特徴数が関連付けられている。レコードは複数のカラムを含み、各カラムに対応する変数の値（データ値）を有する。図３では、カラムの例として、「時刻」を表すカラムと、「ＩＰアドレス」を表すカラムとが示されている。本明細書において、１つのレコードに含まれるカラムの数をＮ（ただしＮは２以上の整数）で表す場合がある。図３の例ではＮ＝２である。特徴抽出装置１０は、このようなレコードＲ１〜Ｒ５それぞれのレコードの特徴を抽出する。 The number of features is associated with each of the records R1 to R5 formed by the data. The record includes a plurality of columns, and has a variable value (data value) corresponding to each column. In FIG. 3, as an example of the column, a column representing “time” and a column representing “IP address” are shown. In this specification, the number of columns included in one record may be represented by N (where N is an integer of 2 or more). In the example of FIG. 3, N = 2. The feature extraction device 10 extracts the feature of each record R1 to R5.

特徴数は、レコードとカラム組との組み合わせのそれぞれについて決定される。カラム組とは、１つ以上のカラムからなる組である。すなわち、カラム組は、要素としてカラムを持つ集合であるということができる。図３の例では、カラム組３０として、「時刻」のみからなるカラム組｛時刻｝と、「ＩＰアドレス」のみからなるカラム組｛ＩＰアドレス｝と、「時刻」および「ＩＰアドレス」からなるカラム組｛時刻，ＩＰアドレス｝とが生成されている。なお、この例から明らかなように、本明細書において、単一のカラムのみからなるものであっても「カラム組」という名称を用いるが、区別のためカラム組には｛｝記号を付す。また、各カラム組についてｎ｛｝によって後述の特徴数を表す。 The number of features is determined for each combination of record and column set. A column set is a set consisting of one or more columns. That is, it can be said that the column set is a set having columns as elements. In the example of FIG. 3, as the column set 30, a column set {time} consisting only of “time”, a column set {IP address} consisting only of “IP address”, and a column consisting of “time” and “IP address”. A set {time, IP address} is generated. As is clear from this example, in this specification, the name “column set” is used even if it consists of only a single column, but the {} symbol is attached to the column set for distinction. Further, the number of features described later is represented by n {} for each column set.

本明細書において、カラム組を構成するカラムの数をＭ（ただしＭは整数であってＭ≦Ｎ）で表す場合がある。全部でＮ個のカラムが存在する場合には、Ｍ個のカラムからなるカラム組が最大_ＮＣ_Ｍ個定義可能である。図３の例では、カラムの数がＮ＝２であるので、単一のカラムからなるカラム組（Ｍ＝１）が最大２組定義可能であり、２つのカラムからなるカラム組（Ｍ＝２）が最大１組定義可能である。すなわち、図３の例では、与えられたカラムから構成可能なカラム組がすべて定義されているといえる。 In the present specification, the number of columns constituting the column set may be represented by M (where M is an integer and M ≦ N). When N columns exist in total, a maximum of _N C _M column groups each including _M columns can be defined. In the example of FIG. 3, since the number of columns is N = 2, a maximum of two sets of columns (M = 1) consisting of a single column can be defined, and a column set consisting of two columns (M = 2). ) Can be defined at most. That is, in the example of FIG. 3, it can be said that all column sets that can be configured from a given column are defined.

特徴数は、当該レコードの、当該カラム組に含まれるすべてのカラムのデータ値を組み合わせた値の、そのカラム組における頻度を表す。たとえばカラム組｛時刻｝について、「2016-03-01 18:20:53」というデータ値が２つのレコード（レコードＲ１およびＲ２）に出現するので、「2016-03-01 18:20:53」というデータ値を有するレコードの特徴数「ｎ｛時刻｝」の値は「２」となる。また、「時刻」および「ＩＰアドレス」からなるカラム組｛時刻，ＩＰアドレス｝について、「2016-03-01 18:20:56」および「91.205.189.15」の組み合わせが２つのレコード（Ｒ４およびＲ５）に出現するので、それらのレコードの特徴数「ｎ｛時刻，ＩＰアドレス｝」は「２」となる。このように、頻度は各カラム組について定義可能であるので、特徴数も各カラム組について決定可能である。 The number of features represents the frequency in the column set of values obtained by combining the data values of all the columns included in the column set of the record. For example, for the column set {time}, the data value “2016-03-01 18:20:53” appears in two records (records R1 and R2), so “2016-03-01 18:20:53” The value of the feature number “n {time}” of the record having the data value is “2”. For the column set {time, IP address} consisting of “time” and “IP address”, the combination of “2016-03-01 18:20:56” and “91.205.189.15” has two records (R4 and R5 ), The feature number “n {time, IP address}” of those records is “2”. Thus, since the frequency can be defined for each column set, the number of features can also be determined for each column set.

図４に、データの構成の別の例を示す。この例ではカラム数Ｎ＝３である。カラム組は、１≦Ｍ≦Ｎ＝３となるすべての整数Ｍについて定義されており、全部で７個である。なお特徴数は、図３および図４に示すように頻度に一致する整数値であってもよいが、頻度を表す数値であればどのようなものであってもよい（たとえば後述するように正規化された数値であってもよい）。 FIG. 4 shows another example of the data configuration. In this example, the number of columns N = 3. The column set is defined for all integers M satisfying 1 ≦ M ≦ N = 3, and is 7 in total. The number of features may be an integer value that matches the frequency as shown in FIGS. 3 and 4, but may be any number as long as it is a numerical value that represents the frequency (for example, a normal value as described later). It may be a numerical value).

以上のように構成される特徴抽出装置１０の動作を、以下に説明する。
図５は、特徴抽出装置１０の特徴抽出処理の流れの例を示すフローチャートである。まず特徴抽出装置１０は、データを表す複数のレコードを取得する（ステップＳ１）。レコードの取得は、たとえばネットワークを介した入力処理によって行われてもよいし、記憶手段１２にあらかじめ格納されたデータを読み込んでもよい。 The operation of the feature extraction apparatus 10 configured as described above will be described below.
FIG. 5 is a flowchart illustrating an example of the flow of feature extraction processing of the feature extraction apparatus 10. First, the feature extraction apparatus 10 acquires a plurality of records representing data (step S1). Acquisition of a record may be performed by, for example, input processing via a network, or data stored in advance in the storage unit 12 may be read.

次に、特徴抽出装置１０は、レコードを数値化する（ステップＳ２）。数値化の方法はどのようなものであってもよいが、たとえば、各カラムについてデータ値の頻度を算出し、頻度の昇順に数値（たとえば非負整数または正の整数）を割り当てることによって行われる。または、データ値のアルファベット順（辞書順）に数値を割り当てることによって行われてもよい。または、時刻であれば秒数に換算してもよいし、ＩＰアドレスであれば４バイト長の数値としてもよい。 Next, the feature extraction apparatus 10 digitizes the record (step S2). Any numerical method may be used. For example, the frequency of data values is calculated for each column, and numerical values (for example, non-negative integers or positive integers) are assigned in ascending order of frequency. Alternatively, it may be performed by assigning numerical values in alphabetical order (dictionary order) of data values. Alternatively, the time may be converted to the number of seconds, or the IP address may be a 4-byte long numerical value.

なお、この処理は、もともと数値であるデータ（例として図４のレコードＡおよびＣ）については省略可能である。また、抽出された特徴の利用方法（異常検知装置２０の仕様等）によっては、ステップＳ２全体を省略してもよい。データ値が数値でない場合であっても、上述の方法で特徴数を決定することは可能である。 This process can be omitted for data that is originally a numerical value (for example, records A and C in FIG. 4). Further, depending on how to use the extracted features (specifications of the abnormality detection device 20 and the like), the entire step S2 may be omitted. Even if the data value is not a numerical value, it is possible to determine the number of features by the method described above.

次に、特徴抽出装置１０はカラム組を生成する（ステップＳ３、カラム組生成機能）。生成されるカラム組の例は、図３および図４にカラム組３０として示される。なおステップＳ３の処理は、ステップＳ１の前に実行してもよいし、ステップＳ１およびＳ２と並列的に実行してもよい。 Next, the feature extraction apparatus 10 generates a column set (step S3, column set generation function). An example of a generated column set is shown as column set 30 in FIGS. The process of step S3 may be executed before step S1, or may be executed in parallel with steps S1 and S2.

次に、特徴抽出装置１０は、レコードのそれぞれについて、カラム組のそれぞれに係る特徴数を決定する（ステップＳ４）。すなわち、各レコードについて、当該レコードの、当該カラム組に含まれるすべてのカラムのデータ値を組み合わせた値の、そのカラム組における頻度を算出する。図３および図４は、ステップＳ４が終了した時点の例ということができる。
なおステップＳ２の処理は、ステップＳ４の処理の後に実行するように構成してもよい。 Next, the feature extraction apparatus 10 determines the number of features related to each column set for each record (step S4). That is, for each record, the frequency in the column set of the value obtained by combining the data values of all the columns included in the column set of the record is calculated. 3 and 4 can be said to be examples at the time when step S4 is completed.
In addition, you may comprise so that the process of step S2 may be performed after the process of step S4.

次に、特徴抽出装置１０は、各カラムおよびカラム組について、各レコードの数値（データ値）および特徴数を正規化する（ステップＳ５）。正規化の方法はどのようなものであってもよいが、たとえばそのカラムにおける最小のデータ値を０に、そのカラムにおける最大のデータ値を１にそれぞれ変換する線形写像関数を用いることができる。たとえば、あるカラムまたはカラム組について、データ値または特徴数ｘ_ｉ（ただしｉはレコードを表すインデックス）は、（ｘ_ｉ−ｘ_ｍｉｎ）／（ｘ_ｍａｘ−ｘ_ｍｉｎ）に正規化される。ただしｘ_ｍａｘおよびｘ_ｍｉｎはそれぞれ、当該カラムまたはカラム組におけるデータ値または特徴数の最大値および最小値である。図４のカラム「Ｃ」について、レコードＲ１１，Ｒ１２，Ｒ１３，Ｒ１４のデータ値はそれぞれ１，２，３，４であるが、このような線形写像変換を用いると、各データ値は０，１／３，２／３，１と正規化される。なお、抽出された特徴の利用方法（異常検知装置２０の仕様等）によっては、ステップＳ５を省略してもよい。また、正規化は、すべてのカラムおよびカラム組について行う必要はなく、一部のカラムまたはカラム組のみについて行ってもよい。 Next, the feature extraction apparatus 10 normalizes the numerical value (data value) and the number of features of each record for each column and column set (step S5). Any normalization method may be used. For example, a linear mapping function that converts the minimum data value in the column to 0 and the maximum data value in the column to 1 can be used. For example, for a certain column or column set, the data value or feature number x _i (where i is an index representing a record) is normalized to (x _i −x _min ) / (x _max −x _min ). Here, x _max and x _min are the maximum value and the minimum value of the data value or the number of features in the column or column set, respectively. With respect to the column “C” in FIG. 4, the data values of the records R11, R12, R13, and R14 are 1, 2, 3, and 4, respectively. / 3, 2/3, and 1. Note that step S5 may be omitted depending on how to use the extracted features (specifications of the abnormality detection device 20 and the like). Further, normalization need not be performed for all columns and column sets, but may be performed for only some columns or column sets.

次に、特徴抽出装置１０は、レコードのそれぞれに、特徴数を関連付けて出力する（ステップＳ６）。このようにして、図３または図４に示す情報（ただし場合によってはさらに正規化されたもの）が出力される。 Next, the feature extraction apparatus 10 associates and outputs the number of features to each record (step S6). In this way, the information shown in FIG. 3 or FIG. 4 (however, further normalized information) is output.

出力されたレコードおよび特徴数は、適宜利用することが可能であるが、たとえば異常検知装置２０による異常検知処理に用いることができる。たとえば、異常検知装置２０は、複数のレコードと、各レコードに関連付けられた特徴数との入力を受け付け、これらに基づいて、各レコードが異常データであるか否かについての判定を行う。 The output record and the number of features can be used as appropriate, but can be used for an abnormality detection process by the abnormality detection device 20, for example. For example, the abnormality detection device 20 accepts input of a plurality of records and the number of features associated with each record, and based on these, determines whether each record is abnormal data.

レコードとして、たとえば特定のコンピュータにおける通信記録（図３に示すもの等）を用いれば、異常な通信記録を検出することができるので、マルウェア等による被害を抑制することができる。 If, for example, a communication record (such as that shown in FIG. 3) in a specific computer is used as the record, an abnormal communication record can be detected, and damage caused by malware or the like can be suppressed.

このように、本発明の実施の形態１に係る特徴抽出装置１０によれば、複数の変数を表すカラムを含むデータについて、特徴を適切かつ自動的に抽出することができるので、基準となる特徴を手作業で決定する必要がなくなる。 As described above, according to the feature extraction apparatus 10 according to Embodiment 1 of the present invention, features can be appropriately and automatically extracted from data including columns representing a plurality of variables. Need not be determined manually.

たとえば、異常検知装置２０による異常検知処理において、レコードだけを入力したのでは、カラム間の相関が適切に考慮されない可能性があり、異常判定の精度に限界がある。これに対し、特徴抽出装置１０は各レコードについてカラム組ごとに特徴数を関連付けて出力するので、各カラム組が表すカラムの組み合わせについて相関を考慮した異常判定を行うことができ、異常判定の精度が向上する。 For example, if only the record is input in the abnormality detection process by the abnormality detection device 20, there is a possibility that the correlation between the columns may not be properly considered, and the accuracy of abnormality determination is limited. On the other hand, since the feature extraction apparatus 10 outputs the number of features associated with each column set for each record, the abnormality determination can be performed in consideration of the correlation with respect to the combination of columns represented by each column set. Will improve.

また、大量のデータを扱う場合であっても、すべてのデータについて特徴数を出力することが容易であり、したがって、一部のデータのみを選別する必要がない。さらに、レコードが多数のカラムを含む場合であっても、特徴となる相関を検討すべきカラムの組み合わせすべてについて、網羅的に特徴数を出力することができる。 In addition, even when handling a large amount of data, it is easy to output the number of features for all data, and therefore it is not necessary to select only a part of the data. Furthermore, even if the record includes a large number of columns, the number of features can be output comprehensively for all combinations of columns whose correlations to be considered should be examined.

なお、レコードは複数のカラムを用いて定義されている必要がある。たとえば１つのレコードが単一のカラムのみから構成されている文字列データ（Ｐｒｏｘｙサーバのログ等）は不適である。 A record must be defined using a plurality of columns. For example, character string data (such as a proxy server log) in which one record is composed of only a single column is inappropriate.

また、とくにＡＴＰ攻撃を検出したい場合等には、異常検知処理のリアルタイム性が重視されるが、その場合には、長時間にわたって記録された大量のレコードを一度に処理するのではなく、秒単位や分単位でデータを区切り、適切に処理可能なレコード数に限定して処理を行うと好適である。 In addition, especially when it is desired to detect an ATP attack, the real-time property of abnormality detection processing is emphasized. In this case, a large number of records recorded for a long time are not processed at a time, but in units of seconds. It is preferable to perform the processing by dividing the data in units of minutes and limiting the number of records to be appropriately processed.

また、特徴抽出装置１０を構成するコンピュータの処理能力に応じて、カラムの数に上限を設けてもよい。 Further, an upper limit may be set for the number of columns according to the processing capability of the computer constituting the feature extraction apparatus 10.

実施の形態２．
実施の形態２は、特徴抽出装置が異常検知装置の機能を兼ねる構成としたものである。すなわち、実施の形態２に係る特徴抽出装置は、実施の形態１の特徴抽出装置１０のようにレコードの特徴を抽出する機能に加え、実施の形態１の異常検知装置２０のように、レコードおよび抽出された特徴に基づいて、各レコードが異常レコードであるか否かを判定する機能を備える。 Embodiment 2. FIG.
In the second embodiment, the feature extraction device also functions as an abnormality detection device. That is, the feature extraction device according to the second embodiment has a function of extracting the feature of a record as in the feature extraction device 10 in the first embodiment, and records and records as in the abnormality detection device 20 in the first embodiment. A function is provided for determining whether each record is an abnormal record based on the extracted features.

とくに、実施の形態２に係る特徴抽出装置は、二軸選択を行い、２次元のデータに基づいて異常判定を行う。すなわち、取得したデータを構成するカラムおよびカラム組のうちから、合計２個のカラムまたはカラム組を選択し（以降、ここで特定される２個のカラムまたはカラム組からなる対を「判定対象対」と呼ぶ）、各レコードの判定対象対のデータ値（数値）に基づいて、そのレコードが異常レコードであるか否かを判定する。言い換えると、特徴抽出装置は、複数の２次元ベクトルのうちから異常ベクトルを検出するということができる。なお、実施の形態２で入力されるレコードはすべて数値化されているものとする。 In particular, the feature extraction apparatus according to the second embodiment performs two-axis selection and performs abnormality determination based on two-dimensional data. That is, a total of two columns or column sets are selected from the columns and column sets that constitute the acquired data (hereinafter, a pair consisting of the two columns or column sets specified here is referred to as “determination target pair”. It is determined whether or not the record is an abnormal record based on the data value (numerical value) of the determination target pair of each record. In other words, it can be said that the feature extraction device detects an abnormal vector from a plurality of two-dimensional vectors. It is assumed that all records input in Embodiment 2 are digitized.

図６は、実施の形態２に係る特徴抽出装置の異常判定処理の流れの例を示すフローチャートである。図６の処理は、たとえば図５の処理が終了した後に開始される。 FIG. 6 is a flowchart illustrating an example of a flow of abnormality determination processing of the feature extraction device according to the second embodiment. The process of FIG. 6 is started after the process of FIG. 5 is completed, for example.

まず特徴抽出装置は、複数のレコードと、各レコードに関連付けられた特徴数とを取得する（ステップＳ１１）。レコードおよび特徴数は、図５の処理に従って構成されているものとする。なお、図６の処理が図５の処理に続いて行われる場合には、ステップＳ６およびステップＳ１１は省略可能である。 First, the feature extraction device acquires a plurality of records and the number of features associated with each record (step S11). It is assumed that the record and the number of features are configured according to the process of FIG. Note that when the process of FIG. 6 is performed subsequent to the process of FIG. 5, Step S <b> 6 and Step S <b> 11 can be omitted.

次に、特徴抽出装置は、判定対象対を生成する（ステップＳ１２）。判定対象対は、たとえば合計２個のカラムまたはカラム組からなる対すべてである。図４の例では、カラムの数がＮ＝３であり、カラム組の数が７であるので、カラムおよびカラム組の数は合計１０個となり、生成される判定対象対の数は_１０Ｃ_２個となる。 Next, the feature extraction device generates a determination target pair (step S12). The judgment target pairs are all pairs composed of a total of two columns or column sets, for example. In the example of FIG. 4, since the number of columns is N = 3 and the number of column sets is 7, the total number of columns and column sets is 10, and the number of determination target pairs generated is ₁₀ C _2. It becomes a piece.

次に、特徴抽出装置は、各判定対象対について異常判定を行う（ステップＳ１３、異常判定機能）。異常判定の方法は、各レコードの各カラムの数値または各カラム組の特徴数に基づき、所定の異常判定規則を用いて、各レコードが異常レコードであるか否かを判定する方法であれば、どのようなものであってもよい。たとえば特許文献２に記載される方法を用いてもよい。 Next, the feature extraction apparatus performs abnormality determination for each determination target pair (step S13, abnormality determination function). If the method of abnormality determination is a method of determining whether each record is an abnormal record using a predetermined abnormality determination rule based on the numerical value of each column of each record or the number of features of each column set, Any thing is acceptable. For example, the method described in Patent Document 2 may be used.

実施の形態２では、特徴抽出装置は、ステップＳ１３において、判定対象対を構成するカラムおよびカラム組に関わらず、同一の異常判定規則を用いる。たとえば、カラムＡおよびカラム組｛Ａ，Ｂ｝からなる判定対象対に対しても、カラムＣおよびカラム組｛Ａ，Ｃ｝からなる判定対象対に対しても、特許文献２に記載される判定規則を用いる。このようにすると、異常判定規則を個別に設定する必要がないので、規則を検討する労力が低減可能である。 In the second embodiment, the feature extraction apparatus uses the same abnormality determination rule in step S13 regardless of the columns and column sets that constitute the determination target pair. For example, the determination described in Patent Document 2 for both a determination target pair including column A and column set {A, B} and a determination target pair including column C and column set {A, C}. Use rules. In this way, it is not necessary to set the abnormality determination rules individually, so that the labor for studying the rules can be reduced.

次に、特徴抽出装置は、判定結果を出力する（ステップＳ１４）。たとえば、いずれかの判定対象対について異常レコードであると判定されたレコードには、異常レコードであることを示す情報を関連付けて出力する。なお合わせて、いずれの判定対象対についても正常レコードであると判定されたレコードには、正常レコードである（または異常レコードではない）ことを示す情報を関連付けて出力してもよい。 Next, the feature extraction apparatus outputs a determination result (step S14). For example, a record that is determined to be an abnormal record for any of the determination target pairs is output in association with information indicating that it is an abnormal record. In addition, information indicating that it is a normal record (or not an abnormal record) may be output in association with a record determined to be a normal record for any determination target pair.

このように、実施の形態２に係る特徴抽出装置によれば、実施の形態１に従って決定された大量の特徴数を有効に活用し、より適切な異常判定を行うことができる。 As described above, according to the feature extraction apparatus according to the second embodiment, it is possible to effectively use the large number of features determined according to the first embodiment and perform more appropriate abnormality determination.

実施の形態２において、以下のような変形を施すことができる。
実施の形態２ではすべての判定対象対について異常判定機能を実行するが、一部の判定対象対についてはこれを省略してもよい。とくに、適切に異常を検出できる可能性が低い判定対象対については異常判定を省略すると、処理スピードを向上させることが期待でき、効率的である。 In the second embodiment, the following modifications can be made.
In the second embodiment, the abnormality determination function is executed for all determination target pairs, but this may be omitted for some determination target pairs. In particular, with respect to a determination target pair that is unlikely to be able to detect an abnormality appropriately, if the abnormality determination is omitted, the processing speed can be expected to be improved, which is efficient.

たとえば、判定対象対が２つのカラム組からなる場合には、異常判定機能を実行しないよう構成してもよい。すなわち、判定対象対の少なくとも一方がカラムである場合にのみ、異常判定機能を実行する。図４の例では、カラム組｛Ａ｝およびカラム組｛Ｂ｝からなる判定対象対については異常判定が省略される。 For example, when the determination target pair includes two column sets, the abnormality determination function may not be executed. That is, the abnormality determination function is executed only when at least one of the determination target pairs is a column. In the example of FIG. 4, the abnormality determination is omitted for the determination target pair including the column set {A} and the column set {B}.

また、たとえば、判定対象対がカラムおよびカラム組からなり、かつ、そのカラム組にそのカラムが含まれない場合には、異常判定機能を実行しないよう構成してもよい。図４の例において、カラムＡおよびカラム組｛Ｂ，Ｃ｝からなる判定対象対を考えると、カラムＡはカラム組｛Ｂ，Ｃ｝に含まれないので、この判定対象対については異常判定が省略される。 Further, for example, when the determination target pair includes a column and a column set and the column set does not include the column, the abnormality determination function may not be executed. In the example of FIG. 4, considering a determination target pair consisting of column A and column set {B, C}, column A is not included in the column set {B, C}. Omitted.

また、たとえば、判定対象対にカラムが含まれ、かつ、そのカラムのデータ値がすべてのレコードについて同一である場合には、異常判定機能を実行しないよう構成してもよい。同様に、判定対象対にカラム組が含まれ、かつ、そのカラム組の特徴数がすべてのレコードについて同一である場合には、異常判定機能を実行しないよう構成してもよい。図４の例では、カラム組｛Ａ｝、カラム組｛Ｂ，Ｃ｝、および、カラム組｛Ａ，Ｂ，Ｃ｝は、それぞれすべてのレコードについて特徴数が同一なので、これらのカラム組のいずれかを含む判定対象対については異常判定が省略される。 Further, for example, when the determination target pair includes a column and the data value of the column is the same for all the records, the abnormality determination function may not be executed. Similarly, when the determination target pair includes a column set and the number of features of the column set is the same for all records, the abnormality determination function may not be executed. In the example of FIG. 4, the column set {A}, the column set {B, C}, and the column set {A, B, C} have the same feature numbers for all the records. Abnormality determination is omitted for a pair of determination targets including or.

なおこれらの異常判定を省略する方法としては、次の２つの方法が考えられる。１つめの方法としては、異常判定を実行しないといする組み合わせについては判定対象対を生成しない。つまり、ステップＳ１２の対象外とする方法である。
２つめの方法としては、ステップＳ１２では判定対象対を生成し、ステップＳ１３の異常判定処理で判定処理の対象外とする方法である。
どちらの方法によっても、異常判定の対象外とする組み合わせを対象外とすることで、処理効率を向上させることが期待できる。 The following two methods are conceivable as methods for omitting these abnormality determinations. As a first method, a determination target pair is not generated for a combination that does not execute abnormality determination. In other words, this is a method that excludes step S12.
As a second method, a determination target pair is generated in step S12 and excluded from the determination process in the abnormality determination process in step S13.
Either method can be expected to improve processing efficiency by excluding combinations that are not subject to abnormality determination.

実施の形態２において、特徴抽出装置は、各レコードの関係を示す画像を生成する機能（画像生成機能）を備えてもよい。たとえば、判定対象対のそれぞれについて、対を構成するカラムまたはカラム組のそれぞれを軸とする直交座標系において、各レコードの位置を示す画像を生成する。すなわち、判定対象対それぞれについて１枚の２次元グラフ画像が生成され、各画像において、１つのレコードにつき１つの記号（図形または点等）が、各レコードに対応する位置にプロットされることになる。また、特徴抽出装置は、生成された画像を出力してもよい。 In the second embodiment, the feature extraction apparatus may include a function (image generation function) for generating an image indicating the relationship between the records. For example, for each determination target pair, an image indicating the position of each record is generated in an orthogonal coordinate system with the columns or column sets constituting the pair as axes. That is, one two-dimensional graph image is generated for each judgment target pair, and in each image, one symbol (graphic or point) for each record is plotted at a position corresponding to each record. . Further, the feature extraction device may output the generated image.

さらに、画像において、異常レコードについては、異常レコードであることを示すマークを表示してもよい。たとえば、異常レコードを示す点を中心として、赤色の円を表示してもよい。このようにすると、特徴抽出装置の利用者は、異常レコードの状況を容易に把握することができる。 Furthermore, in an image, about an abnormal record, you may display the mark which shows that it is an abnormal record. For example, a red circle may be displayed around a point indicating an abnormal record. In this way, the user of the feature extraction device can easily grasp the status of the abnormal record.

なお、画像生成機能は、すべての判定対象対について実行されてもよいし、異常判定機能と同様に一部の判定対象対について省略してもよい。 Note that the image generation function may be executed for all determination target pairs, or may be omitted for some determination target pairs as in the abnormality determination function.

実施の形態２では、判定対象対を構成するカラムおよびカラム組に関わらず、同一の異常判定規則を用いている（判定対象対以外の条件に応じては、異なる異常判定規則を用いてもよい）。このため、ステップＳ５を実行してデータ値および特徴数を正規化しておくと好適である。変形例として、カラムまたはカラム組に応じて異なる異常判定規則を用いてもよい。その場合には、ステップＳ５を省略してもよい。 In the second embodiment, the same abnormality determination rule is used regardless of the columns and column sets constituting the determination target pair (different abnormality determination rules may be used depending on conditions other than the determination target pair. ). For this reason, it is preferable to execute step S5 to normalize the data value and the number of features. As a modification, different abnormality determination rules may be used depending on the column or column set. In that case, step S5 may be omitted.

また、実施の形態２では判定対象対は２次元であるが、３次元以上のベクトルの異常判定アルゴリズムを用いる場合には、合計３個以上のカラムまたはカラム組からなる判定対象組を生成して利用してもよい。 Further, in the second embodiment, the determination target pair is two-dimensional, but when a three-dimensional or higher vector abnormality determination algorithm is used, a determination target set including a total of three or more columns or column sets is generated. May be used.

異常判定機能は、判定対象対を生成することなく実行することも可能である。たとえば、特徴抽出装置は、各カラム組について、各レコードの特徴数に基づき、所定の異常判定規則を用いて、各レコードが異常レコードであるか否か判定してもよい。この場合の異常判定規則として、特徴数が例外的に大きいレコードおよび特徴数が例外的に小さいレコードを異常レコードと判定するという規則を用いてもよい。図４のカラム組｛Ａ｝の例では、特徴数はすべて５であり、例外的に大きいまたは小さい特徴数は出現しないが、仮にカラム組｛Ａ｝の特徴数のうち１つだけが１４．０１という大きい値であり、他の特徴数が０〜０．０３の範囲内の小さい値であれば、１４．０１の値を持つレコードを異常レコードと判定してもよい。逆に、仮にカラム組｛Ａ｝の特徴数のうち１つだけが０．０２という小さい値であり、他の特徴数が１４．００〜２３．００の範囲内の大きい値であれば、０．０２の値を持つレコードを異常レコードと判定してもよい。このような判定方法を実現するアルゴリズムは、当業者が適宜設計可能である。たとえば、そのカラム組における各特徴数の偏差値を算出し、偏差値から５０を減算した絶対値が所定の閾値を超える場合に、その特徴数を持つレコードを異常レコードと判定してもよい。 The abnormality determination function can also be executed without generating a determination target pair. For example, the feature extraction device may determine whether each record is an abnormal record using a predetermined abnormality determination rule based on the number of features of each record for each column set. As an abnormality determination rule in this case, a rule may be used in which a record having an exceptionally large number of features and a record having an exceptionally small number of features are determined as abnormal records. In the example of the column set {A} in FIG. 4, the number of features is all five, and no exceptionally large or small feature number appears, but only one of the number of features of the column set {A} is 14. A record having a value of 14.01 may be determined as an abnormal record if it is a large value of 01 and the other feature number is a small value within the range of 0 to 0.03. Conversely, if only one of the number of features of the column set {A} is a small value of 0.02, and the other number of features is a large value within the range of 14.00 to 23.00, 0 is assumed. A record having a value of .02 may be determined as an abnormal record. Those skilled in the art can appropriately design an algorithm for realizing such a determination method. For example, a deviation value of each feature number in the column set is calculated, and when an absolute value obtained by subtracting 50 from the deviation value exceeds a predetermined threshold, a record having the feature number may be determined as an abnormal record.

このような判定方法を用いると、カラム組に属する各カラムの変数の従属関係に注目した異常判定が可能となる。概してそれらの変数が互いに独立である（それらの変数間に従属関係がない）にも関わらず、特定のパターンのみ従属して出現する場合には、そのパターンを含むレコードは異常レコードである可能性が高いが、このような判定方法によれば、それらのパターンを含むレコードの特徴数が大きくなるので、異常レコードであると適切に判定される。逆に、概してそれらの変数間に従属関係があるにも関わらず、少数のレコードのみそれらの変数が独立して出現する場合には、そのパターンを含むレコードは異常レコードである可能性が高いが、このような判定方法によれば、それらのパターンを含むレコードの特徴数が小さくなるので、異常レコードであると適切に判定される。 By using such a determination method, it is possible to perform abnormality determination focusing on the dependency relationship of the variables of each column belonging to the column set. In general, when the variables are independent of each other (there are no dependencies between the variables) and appear only as a specific pattern, the record containing the pattern may be an abnormal record However, according to such a determination method, since the number of features of records including those patterns is large, it is appropriately determined as an abnormal record. Conversely, if there are generally dependencies between these variables and only a small number of records appear independently, the record containing that pattern is likely to be an abnormal record. According to such a determination method, since the number of features of records including those patterns becomes small, it is appropriately determined that the record is an abnormal record.

さらに、実施の形態１および２において、次のような変形を施すことができる。
必要に応じ、データに前処理を行ってもよい。時系列データの場合には、前処理の前に粒度の細かいカラム（たとえば日時を表すカラム）で予めソートしておく。たとえば、端数を丸めて、この値を、処理すべきレコードとして特徴抽出装置に入力してもよい。または、原レコード（元のデータ）間の差分を取得し、この差分を、処理すべきレコードとして特徴抽出装置に入力してもよい。または、原レコードのカラム毎の移動平均を算出し、この移動平均を、処理すべきレコードとして特徴抽出装置に入力してもよい。原レコードのカラム毎の移動平均の差分を用いてもよい。原レコードに基づいて、２段階学習に基づく変化点スコアリング等の変化点強調関数を算出し（たとえば山西健司「データマイニングによる異常検知」、共立出版、２００９年）、その値を、処理すべきレコードとして特徴抽出装置に入力してもよい。「処理すべきレコードとして特徴抽出装置に入力してもよい」とは、原レコードにカラムを追加してもよいし、原レコードのカラムを置き換えてもよい。これらの組み合わせを用いてもよい。 Furthermore, the following modifications can be made in the first and second embodiments.
If necessary, the data may be preprocessed. In the case of time-series data, sorting is performed in advance on a fine-grained column (for example, a column representing date and time) before preprocessing. For example, the fraction may be rounded and this value may be input to the feature extraction device as a record to be processed. Alternatively, a difference between original records (original data) may be acquired, and this difference may be input to the feature extraction apparatus as a record to be processed. Alternatively, a moving average for each column of the original record may be calculated, and this moving average may be input to the feature extraction apparatus as a record to be processed. You may use the difference of the moving average for every column of the original record. Based on the original record, change point emphasis function such as change point scoring based on two-stage learning is calculated (for example, Kenji Yamanishi “abnormality detection by data mining”, Kyoritsu Publishing, 2009), and the value should be processed You may input into a feature extraction apparatus as a record. The phrase “may be input to the feature extraction apparatus as a record to be processed” may add a column to the original record or replace the column of the original record. A combination of these may be used.

カラム組のうち、全レコードについて同一の特徴数となるものは、削除してもよい。たとえば図４の例では、カラム組｛Ａ｝、カラム組｛Ｂ，Ｃ｝、カラム組｛Ａ，Ｂ，Ｃ｝は削除してもよい。削除のタイミングはたとえばステップＳ５の直後であってもよい。 Of the column set, those having the same feature number for all records may be deleted. For example, in the example of FIG. 4, the column set {A}, the column set {B, C}, and the column set {A, B, C} may be deleted. The timing of deletion may be immediately after step S5, for example.

図３および図４の例では、カラムの数Ｎに対し、１≦Ｍ≦Ｎとなるすべての整数Ｍについて、Ｍ個のカラムからなるカラム組が生成されている。しかしながら、常にすべての整数Ｍについてカラム組を生成する必要はない。２≦Ｍ≦Ｎとなる整数Ｍのうち少なくとも１つについて、Ｍ個のカラムからなるすべての組み合わせについてカラム組が生成されればよい。たとえば、図４の例において、Ｍ＝２についてのみカラム組を生成してもよい（その場合には、生成されるカラム組は、｛Ａ，Ｂ｝，｛Ａ，Ｃ｝，｛Ｂ，Ｃ｝の３個のみとなる）。 In the example of FIGS. 3 and 4, a column set including M columns is generated for all integers M satisfying 1 ≦ M ≦ N with respect to the number N of columns. However, it is not always necessary to generate a column set for every integer M. For at least one of the integers M satisfying 2 ≦ M ≦ N, it is only necessary to generate a column set for all combinations of M columns. For example, in the example of FIG. 4, a column set may be generated only for M = 2 (in this case, the generated column sets are {A, B}, {A, C}, {B, C } Only three).

または、所定の上限カラム数Ｌ（ただしＬは２≦Ｌ≦Ｎとなる整数）に対し、２≦Ｍ≦ＬとなるすべてのＭについてカラム組を生成してもよい（その場合には、さらにＭ＝１についてカラム組を生成することを妨げない）。このようにすると、計算量を抑制しつつ、効率的にカラム組を生成して特徴を抽出することができる。なお、Ｌは特徴抽出装置があらかじめ記憶していてもよいし、任意の値の入力を受け付けてもよい。 Alternatively, for a predetermined upper limit number of columns L (where L is an integer satisfying 2 ≦ L ≦ N), a column set may be generated for all M satisfying 2 ≦ M ≦ L. Does not prevent generating a column set for M = 1). In this way, it is possible to efficiently generate a column set and extract features while suppressing the amount of calculation. Note that L may be stored in advance by the feature extraction apparatus, or an input of an arbitrary value may be accepted.

また、特徴数を決定または算出した後、異常判定を行う前に、特定の条件に応じて特徴数を更新してもよい。たとえば、複数のカラムに対応する変数が互いに独立であることが判明している場合には、それらのカラムを含むカラム組の特徴数は、頻度でなく各カラムの確率の積に置き換えてもよい。たとえば図４の例において、カラムＡとカラムＢとが互いに独立であることが判明している場合には、カラム組｛Ａ，Ｂ｝の特徴数を、カラムＡにおけるデータ値の出現確率と、カラムＢにおけるデータ値の出現確率との積としてもよい。 Further, after determining or calculating the number of features, the number of features may be updated according to a specific condition before performing abnormality determination. For example, when it is known that variables corresponding to a plurality of columns are independent from each other, the number of features of the column set including those columns may be replaced with the product of the probabilities of each column instead of the frequency. . For example, in the example of FIG. 4, when it is found that the column A and the column B are independent from each other, the number of features of the column set {A, B} is expressed as the occurrence probability of the data value in the column A, It is good also as a product with the appearance probability of the data value in column B.

これを図４のレコードＲ１１について説明すると、カラムＡのデータ値「１」の出現確率は０．５であり（すなわち全１０レコードのうち５つのみについてその値が出現しており）、カラムＢのデータ値「Ｘ」の出現確率は０．４である（すなわち全１０レコードのうち４つのみについてその値が出現している）。したがって、カラムＡとカラムＢとが互いに独立であることが判明している場合には、レコードＲ１１のカラム組｛Ａ，Ｂ｝の特徴数を０．５×０．４＝０．２としてもよい。このようにすると、本来互いに独立である変数が偶然偏った場合等であっても、その偶然の偏りを無視して適切な異常判定を行うことができる。 This will be described with reference to the record R11 in FIG. 4. The appearance probability of the data value “1” in the column A is 0.5 (that is, the value appears in only 5 out of all 10 records), and the column B The appearance probability of the data value “X” is 0.4 (that is, the value appears for only 4 out of all 10 records). Therefore, if it is known that the column A and the column B are independent from each other, the number of features of the column set {A, B} of the record R11 may be 0.5 × 0.4 = 0.2. Good. In this way, even if variables that are originally independent of each other are accidentally biased, it is possible to perform appropriate abnormality determination while ignoring the accidental bias.

実施の形態１および２では、正規化後のデータ値および特徴数の範囲は［０，１］となるが、他の範囲に正規化してもよい。たとえば［−１，１］としてもよく、［−０．５，０．５］としてもよく、［０，１００］としてもよい。 In Embodiments 1 and 2, the range of normalized data values and feature numbers is [0, 1], but may be normalized to other ranges. For example, it may be [-1, 1], [-0.5, 0.5], or [0, 100].

１０特徴抽出装置、３０カラム組、Ｓ３カラム組生成機能、Ｓ１３異常判定機能。 10 feature extraction device, 30 column set, S3 column set generation function, S13 abnormality determination function.

Claims

A feature extraction device that extracts features of a record including N columns (where N is an integer of 2 or more),
A function of acquiring a plurality of the records;
A column set generation function for generating a column set for all combinations of M (where M is an integer satisfying 2 ≦ M ≦ N) out of the N columns;
For each of the records, a function for determining the number of features related to each of the column sets, wherein the number of features is a value obtained by combining the data values of all the columns included in the column set of the record, A function for determining the number of features representing the frequency in the column set;
A feature extraction device comprising a function of associating and outputting the number of features to each of the records.

The column set generation function is executed for at least all M satisfying 2 ≦ M ≦ L for a predetermined upper limit column number L (where L is an integer satisfying 2 ≦ L ≦ N). The device described.

A function that converts the data value of each column of each record to a numeric value,
For each column, the function to normalize the numerical value of each record,
A function that normalizes the number of features in each record for each column set,
The apparatus according to claim 1, further comprising:

The feature extraction device further includes an abnormality determination function for determining whether each record is an abnormal record using a predetermined abnormality determination rule based on the number of features of each record for each column set. The apparatus as described in any one of 1-3.

The feature extraction apparatus selects a predetermined abnormality determination rule based on a numerical value or a feature number of each record for a determination target pair consisting of a total of two columns or column sets selected from all columns and column sets. The apparatus as described in any one of Claims 1-4 further provided with the abnormality determination function which uses and determines whether each record is an abnormal record.

The abnormality determination function is
When the judgment target pair is composed of two column sets, it is not executed,
If the judgment target pair consists of a column and a column set, and the column set does not include the column, it is not executed.
If the determination target pair includes a column and the data value of the column is the same for all the records, it is not executed.
If the judgment target pair includes a column set, and the number of features of the column set is the same for all records, it is not executed.
The apparatus according to claim 5.

The apparatus according to claim 5 or 6, wherein the abnormality determination function is executed using the same abnormality determination rule regardless of a column and a column set constituting the determination target pair.

The feature extraction device generates an image indicating a position of each record in an orthogonal coordinate system having a total of two columns or column sets selected from all columns and column sets as axes. A generation function,
The feature extraction device executes the image generation function for all combinations of two columns and columns.
The device according to claim 1.

A program that causes a computer to function as the feature extraction device according to any one of claims 1 to 8.