CN104376078A

CN104376078A - Abnormal data detection method based on knowledge entropy

Info

Publication number: CN104376078A
Application number: CN201410650726.XA
Authority: CN
Inventors: 刘峰; 刘钦; 杨瑞; 吕传耀
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-11-14
Filing date: 2014-11-14
Publication date: 2015-02-25

Abstract

An abnormal data detection method based on knowledge entropy is characterized by comprising the following steps of (1) the attribute analysis stage of a sample set and (2) the data sample detection stage of the sample set; at the attribute analysis stage of the sample set, the data sample set U generated by an application program and an attribute set A corresponding to the data sample set U are collected; normalization preprocessing is performed on attribute values in the data sample set U; clustering processing is performed on the data sample set U based on the attribute set A, and the knowledge entropy of A is calculated; the importance of all attributes is calculated, and the sequence of the attribute set is constructed according to the importance of all the attributes; the stage is over; at the data sample detection stage of the sample set, the abnormal factors of all data samples are calculated; an abnormal data set is output according to the abnormal factors; the stage is over. According to the detection method, the nondeterminacy of clustering is avoided while the clustering effect is utilized, and therefore the detection accuracy of abnormal data can be effectively guaranteed.

Description

A kind of abnormal deviation data examination method of knowledge based entropy

Technical field

The present invention relates to abnormal deviation data examination method, on the basis of the mass data collection especially generated in computer information system, the method for abnormal information in heuristic data, relates to the abnormal deviation data examination method based on cluster and Knowledge entropy more specifically.

Background technology

Anomaly data detection also claims outlier detection and exception to excavate, common Anomalies Genesis is data from different classes (as swindle, invasion etc.), data variable natural variation (as gene mutation, the purchasing model etc. that client is new) and DATA REASONING or collects error.Because outlier can find distinguished fresh information, so be widely used in the various fields such as customer purchasing behavior analysis in intrusion detection, fraud detection, public health, electric business's platform.

The method of anomaly data detection mainly comprises following several: the technology of (1) Corpus--based Method: first set up a data model, and being extremely those can not the object of perfect matching with model; If the set that model is bunch, then abnormal is significantly do not belong to the object of any bunch; When using regression model, be relatively away from the object of predicted value extremely.(2) based on the technology of adjacency: usually can define proximity measure between objects, exception object is those objects away from other objects.(3) technology of density based: be only categorized as abnormity point as most of neighbour significantly lower than it of the local density of a point.(4) based on the technology of cluster: using the tuftlet away from other bunches as abnormity point.

The Major Difficulties of anomaly data detection is to compare the sample, the data dimension information evaluation of high dimensional data, the data exception of non-single dimension etc. that are difficult to process non-numeric type.The technology of Corpus--based Method is difficult to process high dimensional data; Technology based on adjacency can not process the data set with different densities region; The technology of density based is difficult to adjust ginseng; Technology based on cluster be difficult to ensure produce bunch quality, very large to the quality influence of outlier.

In order to improve the effect of anomaly data detection, avoid the uncertainty of cluster while utilizing Clustering Effect, the present invention proposes a kind of abnormal deviation data examination method of knowledge based entropy, effectively can ensure the Detection accuracy of abnormal data.

Summary of the invention

Goal of the invention: the invention provides a kind of method detecting abnormal data in the mass data sample set of application collection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then the Outlier factor of each data sample is calculated, last output abnormality data acquisition.

Technical scheme of the present invention is: the abnormal data point detecting method of knowledge based entropy comprises the steps:

1) the attributive analysis stage of data sample set:

A) the data sample set U of application program generation and the community set A of correspondence thereof is collected;

B) standardization pre-service is carried out to the property value in set of data samples U;

C) based on attribute complete or collected works A clustering processing is done to set of data samples U, and calculate the Knowledge entropy of A;

D) importance degree of each attribute is calculated respectively, the sequence of structure attribute set accordingly;

E) terminate.

2) the data sample detection-phase of data sample set:

A) Outlier factor of each data sample is calculated;

B) according to Outlier factor output abnormality data acquisition;

C) terminate.

Wherein the detailed process of step 1-b is as follows:

1) the attribute complete or collected works A of ergodic data sample set U;

2) for property value be the attribute a of value type _i, close maximal value according to the minimum value of all this attributes of data sample and make standardization processing: standardization property value V ' _i,j=(V _i,j-V _{i, min})/(V _{i, max}-V _{i, min}), make the property value after standardization between 0 to 1.0; Wherein V _i,jthe property value before specification, V _{i, min}be before specification all data samples at a _iminimum value on attribute, V _{i, max}be before specification all data samples at a _imaximal value on attribute;

3) for property value be not the attribute a of value type _k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value _i,j=attribute a _kvalue is V _k,jsample number/total sample number.

The detailed process of step 1-c is as follows:

1) the data acquisition U that the attribute complete or collected works A of set of data samples is corresponding is considered;

2) the diameter L of set of computations U, if threshold value δ=L/10;

3) based on threshold value δ, complete link clustering is done to U, obtain cluster result (E ₁, E ₂, E ₃..., E _k), wherein E _lbe the set after a data sample set cluster, meet

&ForAll; x_{i}, x_{j} &Element; E_{l}, Σ_{h = 1}^{| A |} | x_{i, h} - x_{j, h} | \leq δ;

4) Knowledge entropy of computation attribute complete or collected works A

E (A) = - Σ_{i = 1}^{k} \frac{| E_{i} |}{| U |} \log_{2} \frac{| E_{i} |}{| U |} .

The detailed process of step 1-d is as follows:

1) to each attribute a in attribute complete or collected works A _i, calculate its Attribute Significance: sig (a _i)=E (A)-E (A-{a _i);

2) according to Attribute Significance sequence, sequence of attributes S=<a ' is obtained to attribute complete or collected works A ₁, a ' ₂..., a ' _{| A|}>, wherein meet sig (a ' _i)≤sig (a ' _i+1);

3) structure attribute sequence of sets AS=<A ₁, A ₂..., A _m>, wherein to 1≤i≤m, and meet A _i+1=A _i-{ a' _i.

The detailed process of step 2-a is as follows:

1) to attribute a ' each in S _ithe cluster making step 1-c obtains

2) to community set A each in AS _ialso the cluster making step 1-c obtains

3) to data sample x each in U, its weight w (x) is calculated, wherein represent that x is at a _icluster belonging in cluster result;

4) Outlier factor d (x) of x is calculated,

d (x) = 1 - w (x) * \sqrt{\frac{Σ_{i = 2}^{m - 1} \frac{| [x]_{A_{j}} | - | [x]_{A_{j - 1}} |}{| [x]_{A_{j}} |}}{m - 1}},

Wherein represent that x is at A _jcluster belonging in cluster result.

The detailed process of step 2-b is as follows:

1)

2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};

3) D is exported.

Beneficial effect of the present invention: the effect that invention increases anomaly data detection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then calculates the Outlier factor of each data sample, last output abnormality data acquisition.The present invention avoids the uncertainty of cluster while utilizing Clustering Effect, effectively can ensure the Detection accuracy of abnormal data.

Accompanying drawing explanation

The abnormal deviation data examination method process flow diagram of Fig. 1 knowledge based entropy

Fig. 2 carries out pretreated process flow diagram to data sample attribute value

Fig. 3 based on property set A to the process flow diagram carrying out data sample set U and do complete link clustering

Fig. 4 computation attribute importance degree the process flow diagram of structure attribute sequence of sets

Fig. 5 calculates the Outlier factor of each sample and the process flow diagram of output abnormality data

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail.

Fig. 1 is the abnormal deviation data examination method process flow diagram of knowledge based entropy.The abnormal deviation data examination method of knowledge based entropy uses the method for cluster to classify to object set, utilizes Knowledge entropy computation attribute importance degree and obtains community set sequence, through calculating the Outlier factor of all objects to the traversal of property set.Finally carry out result output as requested

Fig. 2 carries out pretreated detailed description to data sample attribute value.

Step 2-0 starts;

Certain attribute of random selecting a in step 2-1 dependency set A _i;

Step 2-2 judges whether property value is Numeric Attributes value;

If step 2-3 numeric type, then to a all in sample set _iproperty value makes standardization processing;

Step 2-4 if not Numeric Attributes value, by a all in sample set _iproperty value is set to frequency values;

Step 2-5 is by a _iremove from A;

Step 2-6 judges whether A is empty set, then gets back to step 2-1 if not; If then terminate.

Fig. 3 is to the process flow diagram carrying out data sample set U and do complete link clustering based on property set A.

Step 3-0 starts;

Step 3-1 finds out U middle distance 2 points farthest, calculates the diameter of its distance as U if threshold value δ=L/10;

Step 3-2 has a b in U _i, construction set E _i={ b _i, initialization cluster set is combined into C={E ₁, E ₂.。。，E _|U|}；

Step 3-3 judges whether there is the cluster that can be polymerized in cluster set C, there is E in C _i, E _j, meet d (E _i, E _j) <2 δ, wherein,

d (E_{i}, E_{j}) = \max_{x_{1} &Element; E_{i}, x_{2} &Element; E_{j}} | x_{1} - x_{2} |, | x_{1} - x_{2} | = Σ_{h = 1}^{| A |} | x_{h, 1} - x_{h, 2} |;

Step 3-4 is by can cluster E again in C _i, E _jmerge, add in C, then by E _i, E _jremove from C and jump to step 3-3;

Step 3-5 exports the cluster set C divided;

Step 3-6 terminates.

Fig. 4 is computation attribute importance degree and the detailed description of structure attribute sequence of sets.

Step 4-0 starts;

Step 4-1 carries out cluster based on attribute complete or collected works A to set of data samples U, obtains cluster set C={E ₁, E ₂.。。，E _k}；

Step 4-2 calculates the Knowledge entropy of A

E (A) = - Σ_{i = 1}^{k} \frac{| E_{i} |}{| U |} \log_{2} \frac{| E_{i} |}{| U |};

Step 4-3 dependency collection A selects an attribute a _i;

Step 4-4 calculates a _iattribute Significance sig (a _i)=E (A)-E (A-{a _i);

Step 4-5 judges whether that Attribute Significance needs to calculate in addition, if then jump to step 4-3, then jumps to step 4-6 if not;

Step 4-6 is based on sig (a _i) sorting obtains S=<a ' ₁, a ' ₂..., a ' _{| A|}>, meet sig (a ' _i)≤sig (a ' _i+1);

Step 4-7 structure attribute sequence of sets AS=<A1, A2 ..., Am>, meets A _i+1=A _i-{ a' _i,

A_{i} &SubsetEqual; A, A_{1} = A, A_{m} = {{a^{'}}_{n}};

Step 4-8 terminates.

Fig. 5 calculates the Outlier factor of each sample and the detailed step of output abnormality data.

Step 5-0 starts;

Step 5-1 is to attribute a ' each in S _iobtain as cluster

Step 5-2 is to community set A each in AS _iobtain as cluster

Step 5-3, to data sample x each in U, first calculates its weight calculate its Outlier factor again

d (x) = 1 - w (x) * \sqrt{\frac{Σ_{i = 2}^{m - 1} \frac{| [x]_{A_{j}} | - | [x]_{A_{j - 1}} |}{| [x]_{A_{j}} |}}{m - 1}};

Step 5-4 exports the x of all d (x) >0.85 in U;

Step 5-5 terminates.

Claims

1. an abnormal deviation data examination method for knowledge based entropy, is characterized in that comprising the steps:

1) the attributive analysis stage of data sample set:

E) terminate.

2) the data sample detection-phase of data sample set:

A) Outlier factor of each data sample is calculated;

B) according to Outlier factor output abnormality data acquisition;

C) terminate.

Wherein the described standardization pre-service detailed process of step 1-b is as follows:

1) the attribute complete or collected works A of ergodic data sample set U;

2) for property value be the attribute a of value type _i, make standardization processing according to the minimum value of all data samples on this attribute and maximal value: standardization property value V ' _i,j=(V _i,j-V _{i, min})/(V _{i, max}-V _{i, min}), make the property value after standardization between 0 to 1.0; Wherein V _i,jthe property value before specification, V _{i, min}be before specification all data samples at a _iminimum value on attribute, V _{i, max}be before specification all data samples at a _imaximal value on attribute;

3) for property value be not the attribute a of value type _k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value _k,j=attribute a _kvalue is V _k,jsample number/total sample number;

4) terminate.

2. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the Knowledge entropy computing method based on cluster described in 1-c;

2) the diameter L of set of computations U, if threshold value δ=L/10;

3) based on parameter δ, complete link clustering is done to U, obtain cluster result (E ₁, E ₂, E ₃..., E _k), wherein E _lbe the set after a data sample set cluster, meet

{&ForAll; x}_{i}, x_{j} &Element; E_{l}, Σ_{h = 1}^{| A |} | x_{i, h} - x_{j, h} | \leq δ;

4) Knowledge entropy of computation attribute complete or collected works A

E (A) = - Σ_{i = 1}^{k} \frac{| E_{i} |}{| U |} \log_{2} \frac{| E_{i} |}{| U |};

5) terminate.

The detailed process of step 1-d is as follows:

3) structure attribute sequence of sets AS=<A ₁, A ₂..., A _m>, wherein to 1≤i≤m, a ₁=A, A _m={ a' _n, and meet A _i+1=A _i-{ a' _i.

3. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the data sample Outlier factor computational algorithm described in 2-a:

1) to attribute a ' each in S _ithe cluster making step 1-c obtains

2) to community set A each in AS _ialso the cluster making step 1-c obtains

3) to data sample x each in U, its weight w (x) is calculated, wherein i represents that x is at a _icluster belonging in cluster result;

4) Outlier factor d (x) of x is calculated,

d (x) = 1 - w (x) * \sqrt{\frac{Σ_{i = 2}^{m - 1} \frac{| {[x]}_{A_{j}} | - | {[x]}_{A_{j - 1}} |}{| {[x]}_{A_{j}} |}}{m - 1}},

Wherein represent that x is at A _jcluster belonging in cluster result;

5) terminate;

The detailed process of step 2-b is as follows:

1)

2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};

3) D is exported;

4) terminate.