CN104376078A - Abnormal data detection method based on knowledge entropy - Google Patents

Abnormal data detection method based on knowledge entropy Download PDF

Info

Publication number
CN104376078A
CN104376078A CN201410650726.XA CN201410650726A CN104376078A CN 104376078 A CN104376078 A CN 104376078A CN 201410650726 A CN201410650726 A CN 201410650726A CN 104376078 A CN104376078 A CN 104376078A
Authority
CN
China
Prior art keywords
attribute
data
value
cluster
data sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410650726.XA
Other languages
Chinese (zh)
Inventor
刘峰
刘钦
杨瑞
吕传耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201410650726.XA priority Critical patent/CN104376078A/en
Publication of CN104376078A publication Critical patent/CN104376078A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An abnormal data detection method based on knowledge entropy is characterized by comprising the following steps of (1) the attribute analysis stage of a sample set and (2) the data sample detection stage of the sample set; at the attribute analysis stage of the sample set, the data sample set U generated by an application program and an attribute set A corresponding to the data sample set U are collected; normalization preprocessing is performed on attribute values in the data sample set U; clustering processing is performed on the data sample set U based on the attribute set A, and the knowledge entropy of A is calculated; the importance of all attributes is calculated, and the sequence of the attribute set is constructed according to the importance of all the attributes; the stage is over; at the data sample detection stage of the sample set, the abnormal factors of all data samples are calculated; an abnormal data set is output according to the abnormal factors; the stage is over. According to the detection method, the nondeterminacy of clustering is avoided while the clustering effect is utilized, and therefore the detection accuracy of abnormal data can be effectively guaranteed.

Description

A kind of abnormal deviation data examination method of knowledge based entropy
Technical field
The present invention relates to abnormal deviation data examination method, on the basis of the mass data collection especially generated in computer information system, the method for abnormal information in heuristic data, relates to the abnormal deviation data examination method based on cluster and Knowledge entropy more specifically.
Background technology
Anomaly data detection also claims outlier detection and exception to excavate, common Anomalies Genesis is data from different classes (as swindle, invasion etc.), data variable natural variation (as gene mutation, the purchasing model etc. that client is new) and DATA REASONING or collects error.Because outlier can find distinguished fresh information, so be widely used in the various fields such as customer purchasing behavior analysis in intrusion detection, fraud detection, public health, electric business's platform.
The method of anomaly data detection mainly comprises following several: the technology of (1) Corpus--based Method: first set up a data model, and being extremely those can not the object of perfect matching with model; If the set that model is bunch, then abnormal is significantly do not belong to the object of any bunch; When using regression model, be relatively away from the object of predicted value extremely.(2) based on the technology of adjacency: usually can define proximity measure between objects, exception object is those objects away from other objects.(3) technology of density based: be only categorized as abnormity point as most of neighbour significantly lower than it of the local density of a point.(4) based on the technology of cluster: using the tuftlet away from other bunches as abnormity point.
The Major Difficulties of anomaly data detection is to compare the sample, the data dimension information evaluation of high dimensional data, the data exception of non-single dimension etc. that are difficult to process non-numeric type.The technology of Corpus--based Method is difficult to process high dimensional data; Technology based on adjacency can not process the data set with different densities region; The technology of density based is difficult to adjust ginseng; Technology based on cluster be difficult to ensure produce bunch quality, very large to the quality influence of outlier.
In order to improve the effect of anomaly data detection, avoid the uncertainty of cluster while utilizing Clustering Effect, the present invention proposes a kind of abnormal deviation data examination method of knowledge based entropy, effectively can ensure the Detection accuracy of abnormal data.
Summary of the invention
Goal of the invention: the invention provides a kind of method detecting abnormal data in the mass data sample set of application collection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then the Outlier factor of each data sample is calculated, last output abnormality data acquisition.
Technical scheme of the present invention is: the abnormal data point detecting method of knowledge based entropy comprises the steps:
1) the attributive analysis stage of data sample set:
A) the data sample set U of application program generation and the community set A of correspondence thereof is collected;
B) standardization pre-service is carried out to the property value in set of data samples U;
C) based on attribute complete or collected works A clustering processing is done to set of data samples U, and calculate the Knowledge entropy of A;
D) importance degree of each attribute is calculated respectively, the sequence of structure attribute set accordingly;
E) terminate.
2) the data sample detection-phase of data sample set:
A) Outlier factor of each data sample is calculated;
B) according to Outlier factor output abnormality data acquisition;
C) terminate.
Wherein the detailed process of step 1-b is as follows:
1) the attribute complete or collected works A of ergodic data sample set U;
2) for property value be the attribute a of value type i, close maximal value according to the minimum value of all this attributes of data sample and make standardization processing: standardization property value V ' i,j=(V i,j-V i, min)/(V i, max-V i, min), make the property value after standardization between 0 to 1.0; Wherein V i,jthe property value before specification, V i, minbe before specification all data samples at a iminimum value on attribute, V i, maxbe before specification all data samples at a imaximal value on attribute;
3) for property value be not the attribute a of value type k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value i,j=attribute a kvalue is V k,jsample number/total sample number.
The detailed process of step 1-c is as follows:
1) the data acquisition U that the attribute complete or collected works A of set of data samples is corresponding is considered;
2) the diameter L of set of computations U, if threshold value δ=L/10;
3) based on threshold value δ, complete link clustering is done to U, obtain cluster result (E 1, E 2, E 3..., E k), wherein E lbe the set after a data sample set cluster, meet ∀ x i , x j ∈ E l , Σ h = 1 | A | | x i , h - x j , h | ≤ δ ;
4) Knowledge entropy of computation attribute complete or collected works A E ( A ) = - Σ i = 1 k | E i | | U | log 2 | E i | | U | .
The detailed process of step 1-d is as follows:
1) to each attribute a in attribute complete or collected works A i, calculate its Attribute Significance: sig (a i)=E (A)-E (A-{a i);
2) according to Attribute Significance sequence, sequence of attributes S=<a ' is obtained to attribute complete or collected works A 1, a ' 2..., a ' | A|>, wherein meet sig (a ' i)≤sig (a ' i+1);
3) structure attribute sequence of sets AS=<A 1, A 2..., A m>, wherein to 1≤i≤m, and meet A i+1=A i-{ a' i.
The detailed process of step 2-a is as follows:
1) to attribute a ' each in S ithe cluster making step 1-c obtains
2) to community set A each in AS ialso the cluster making step 1-c obtains
3) to data sample x each in U, its weight w (x) is calculated, wherein represent that x is at a icluster belonging in cluster result;
4) Outlier factor d (x) of x is calculated, d ( x ) = 1 - w ( x ) * &Sigma; i = 2 m - 1 | [ x ] A j | - | [ x ] A j - 1 | | [ x ] A j | m - 1 , Wherein represent that x is at A jcluster belonging in cluster result.
The detailed process of step 2-b is as follows:
1)
2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};
3) D is exported.
Beneficial effect of the present invention: the effect that invention increases anomaly data detection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then calculates the Outlier factor of each data sample, last output abnormality data acquisition.The present invention avoids the uncertainty of cluster while utilizing Clustering Effect, effectively can ensure the Detection accuracy of abnormal data.
Accompanying drawing explanation
The abnormal deviation data examination method process flow diagram of Fig. 1 knowledge based entropy
Fig. 2 carries out pretreated process flow diagram to data sample attribute value
Fig. 3 based on property set A to the process flow diagram carrying out data sample set U and do complete link clustering
Fig. 4 computation attribute importance degree the process flow diagram of structure attribute sequence of sets
Fig. 5 calculates the Outlier factor of each sample and the process flow diagram of output abnormality data
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
Fig. 1 is the abnormal deviation data examination method process flow diagram of knowledge based entropy.The abnormal deviation data examination method of knowledge based entropy uses the method for cluster to classify to object set, utilizes Knowledge entropy computation attribute importance degree and obtains community set sequence, through calculating the Outlier factor of all objects to the traversal of property set.Finally carry out result output as requested
Fig. 2 carries out pretreated detailed description to data sample attribute value.
Step 2-0 starts;
Certain attribute of random selecting a in step 2-1 dependency set A i;
Step 2-2 judges whether property value is Numeric Attributes value;
If step 2-3 numeric type, then to a all in sample set iproperty value makes standardization processing;
Step 2-4 if not Numeric Attributes value, by a all in sample set iproperty value is set to frequency values;
Step 2-5 is by a iremove from A;
Step 2-6 judges whether A is empty set, then gets back to step 2-1 if not; If then terminate.
Fig. 3 is to the process flow diagram carrying out data sample set U and do complete link clustering based on property set A.
Step 3-0 starts;
Step 3-1 finds out U middle distance 2 points farthest, calculates the diameter of its distance as U if threshold value δ=L/10;
Step 3-2 has a b in U i, construction set E i={ b i, initialization cluster set is combined into C={E 1, E 2.。。,E |U|};
Step 3-3 judges whether there is the cluster that can be polymerized in cluster set C, there is E in C i, E j, meet d (E i, E j) <2 δ, wherein, d ( E i , E j ) = max x 1 &Element; E i , x 2 &Element; E j | x 1 - x 2 | , | x 1 - x 2 | = &Sigma; h = 1 | A | | x h , 1 - x h , 2 | ;
Step 3-4 is by can cluster E again in C i, E jmerge, add in C, then by E i, E jremove from C and jump to step 3-3;
Step 3-5 exports the cluster set C divided;
Step 3-6 terminates.
Fig. 4 is computation attribute importance degree and the detailed description of structure attribute sequence of sets.
Step 4-0 starts;
Step 4-1 carries out cluster based on attribute complete or collected works A to set of data samples U, obtains cluster set C={E 1, E 2.。。,E k};
Step 4-2 calculates the Knowledge entropy of A E ( A ) = - &Sigma; i = 1 k | E i | | U | log 2 | E i | | U | ;
Step 4-3 dependency collection A selects an attribute a i;
Step 4-4 calculates a iattribute Significance sig (a i)=E (A)-E (A-{a i);
Step 4-5 judges whether that Attribute Significance needs to calculate in addition, if then jump to step 4-3, then jumps to step 4-6 if not;
Step 4-6 is based on sig (a i) sorting obtains S=<a ' 1, a ' 2..., a ' | A|>, meet sig (a ' i)≤sig (a ' i+1);
Step 4-7 structure attribute sequence of sets AS=<A1, A2 ..., Am>, meets A i+1=A i-{ a' i, A i &SubsetEqual; A , A 1 = A , A m = { a &prime; n } ;
Step 4-8 terminates.
Fig. 5 calculates the Outlier factor of each sample and the detailed step of output abnormality data.
Step 5-0 starts;
Step 5-1 is to attribute a ' each in S iobtain as cluster
Step 5-2 is to community set A each in AS iobtain as cluster
Step 5-3, to data sample x each in U, first calculates its weight calculate its Outlier factor again
d ( x ) = 1 - w ( x ) * &Sigma; i = 2 m - 1 | [ x ] A j | - | [ x ] A j - 1 | | [ x ] A j | m - 1 ;
Step 5-4 exports the x of all d (x) >0.85 in U;
Step 5-5 terminates.

Claims (3)

1. an abnormal deviation data examination method for knowledge based entropy, is characterized in that comprising the steps:
1) the attributive analysis stage of data sample set:
A) the data sample set U of application program generation and the community set A of correspondence thereof is collected;
B) standardization pre-service is carried out to the property value in set of data samples U;
C) based on attribute complete or collected works A clustering processing is done to set of data samples U, and calculate the Knowledge entropy of A;
D) importance degree of each attribute is calculated respectively, the sequence of structure attribute set accordingly;
E) terminate.
2) the data sample detection-phase of data sample set:
A) Outlier factor of each data sample is calculated;
B) according to Outlier factor output abnormality data acquisition;
C) terminate.
Wherein the described standardization pre-service detailed process of step 1-b is as follows:
1) the attribute complete or collected works A of ergodic data sample set U;
2) for property value be the attribute a of value type i, make standardization processing according to the minimum value of all data samples on this attribute and maximal value: standardization property value V ' i,j=(V i,j-V i, min)/(V i, max-V i, min), make the property value after standardization between 0 to 1.0; Wherein V i,jthe property value before specification, V i, minbe before specification all data samples at a iminimum value on attribute, V i, maxbe before specification all data samples at a imaximal value on attribute;
3) for property value be not the attribute a of value type k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value k,j=attribute a kvalue is V k,jsample number/total sample number;
4) terminate.
2. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the Knowledge entropy computing method based on cluster described in 1-c;
1) the data acquisition U that the attribute complete or collected works A of set of data samples is corresponding is considered;
2) the diameter L of set of computations U, if threshold value δ=L/10;
3) based on parameter δ, complete link clustering is done to U, obtain cluster result (E 1, E 2, E 3..., E k), wherein E lbe the set after a data sample set cluster, meet &ForAll; x i , x j &Element; E l , &Sigma; h = 1 | A | | x i , h - x j , h | &le; &delta; ;
4) Knowledge entropy of computation attribute complete or collected works A E ( A ) = - &Sigma; i = 1 k | E i | | U | log 2 | E i | | U | ;
5) terminate.
The detailed process of step 1-d is as follows:
1) to each attribute a in attribute complete or collected works A i, calculate its Attribute Significance: sig (a i)=E (A)-E (A-{a i);
2) according to Attribute Significance sequence, sequence of attributes S=<a ' is obtained to attribute complete or collected works A 1, a ' 2..., a ' | A|>, wherein meet sig (a ' i)≤sig (a ' i+1);
3) structure attribute sequence of sets AS=<A 1, A 2..., A m>, wherein to 1≤i≤m, a 1=A, A m={ a' n, and meet A i+1=A i-{ a' i.
3. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the data sample Outlier factor computational algorithm described in 2-a:
1) to attribute a ' each in S ithe cluster making step 1-c obtains
2) to community set A each in AS ialso the cluster making step 1-c obtains
3) to data sample x each in U, its weight w (x) is calculated, wherein i represents that x is at a icluster belonging in cluster result;
4) Outlier factor d (x) of x is calculated, d ( x ) = 1 - w ( x ) * &Sigma; i = 2 m - 1 | [ x ] A j | - | [ x ] A j - 1 | | [ x ] A j | m - 1 , Wherein represent that x is at A jcluster belonging in cluster result;
5) terminate;
The detailed process of step 2-b is as follows:
1)
2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};
3) D is exported;
4) terminate.
CN201410650726.XA 2014-11-14 2014-11-14 Abnormal data detection method based on knowledge entropy Pending CN104376078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410650726.XA CN104376078A (en) 2014-11-14 2014-11-14 Abnormal data detection method based on knowledge entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410650726.XA CN104376078A (en) 2014-11-14 2014-11-14 Abnormal data detection method based on knowledge entropy

Publications (1)

Publication Number Publication Date
CN104376078A true CN104376078A (en) 2015-02-25

Family

ID=52554985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410650726.XA Pending CN104376078A (en) 2014-11-14 2014-11-14 Abnormal data detection method based on knowledge entropy

Country Status (1)

Country Link
CN (1) CN104376078A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
CN108268467A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of abnormal deviation data examination method and device based on attribute
CN109190598A (en) * 2018-09-29 2019-01-11 西安交通大学 A kind of rotating machinery monitoring data noise detection method based on SES-LOF
CN109992578A (en) * 2019-01-07 2019-07-09 平安科技(深圳)有限公司 Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning
CN112219212A (en) * 2017-12-22 2021-01-12 阿韦瓦软件有限责任公司 Automated detection of anomalous industrial processing operations

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246645A (en) * 2008-04-01 2008-08-20 东南大学 Method for recognizing outlier traffic data
US20080255772A1 (en) * 2007-02-06 2008-10-16 Abb Research Ltd. Method and a control system for monitoring the condition of an industrial robot
CN101509839A (en) * 2009-03-12 2009-08-19 上海交通大学 Cluster industrial robot failure diagnosis method based on outlier excavation
CN103902798A (en) * 2012-12-27 2014-07-02 纽海信息技术(上海)有限公司 Data preprocessing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080255772A1 (en) * 2007-02-06 2008-10-16 Abb Research Ltd. Method and a control system for monitoring the condition of an industrial robot
CN101246645A (en) * 2008-04-01 2008-08-20 东南大学 Method for recognizing outlier traffic data
CN101509839A (en) * 2009-03-12 2009-08-19 上海交通大学 Cluster industrial robot failure diagnosis method based on outlier excavation
CN103902798A (en) * 2012-12-27 2014-07-02 纽海信息技术(上海)有限公司 Data preprocessing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张净等: "基于信息论的高维海量数据离群点挖掘", 《计算机科学》 *
江峰等: "基于粗糙集理论的序列离群点检测", 《电子学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN105160181B (en) * 2015-09-02 2018-02-23 华中科技大学 A kind of digital control system domain of instruction sequence variation data detection method
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
CN108205570B (en) * 2016-12-19 2021-06-29 华为技术有限公司 Data detection method and device
CN108268467A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of abnormal deviation data examination method and device based on attribute
CN108268467B (en) * 2016-12-30 2021-08-06 广东精点数据科技股份有限公司 Attribute-based abnormal data detection method and device
CN112219212A (en) * 2017-12-22 2021-01-12 阿韦瓦软件有限责任公司 Automated detection of anomalous industrial processing operations
CN109190598A (en) * 2018-09-29 2019-01-11 西安交通大学 A kind of rotating machinery monitoring data noise detection method based on SES-LOF
CN109190598B (en) * 2018-09-29 2020-05-15 西安交通大学 Rotating machinery monitoring data noise point detection method based on SES-LOF
CN109992578A (en) * 2019-01-07 2019-07-09 平安科技(深圳)有限公司 Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning
CN109992578B (en) * 2019-01-07 2023-08-08 平安科技(深圳)有限公司 Anti-fraud method and device based on unsupervised learning, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108520357B (en) Method and device for judging line loss abnormality reason and server
CN107122790B (en) Non-invasive load identification algorithm based on hybrid neural network and ensemble learning
CN104376078A (en) Abnormal data detection method based on knowledge entropy
CN109543943B (en) Electric price checking execution method based on big data deep learning
Yin et al. Wasserstein Generative Adversarial Network and Convolutional Neural Network (WG‐CNN) for Bearing Fault Diagnosis
CN107992968B (en) Electric energy meter metering error prediction method based on integrated time series analysis technology
CN108777873A (en) The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
WO2016101628A1 (en) Data processing method and device in data modeling
CN112001788B (en) Credit card illegal fraud identification method based on RF-DBSCAN algorithm
CN106503086A (en) The detection method of distributed local outlier
Arbin et al. Comparative analysis between k-means and k-medoids for statistical clustering
Li et al. Research and application of random forest model in mining automobile insurance fraud
CN112990330B (en) User energy abnormal data detection method and device
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context
CN115051929A (en) Network fault prediction method and device based on self-supervision target perception neural network
Shi et al. An improved agglomerative hierarchical clustering anomaly detection method for scientific data
CN113420506A (en) Method for establishing prediction model of tunneling speed, prediction method and device
Dong Application of Big Data Mining Technology in Blockchain Computing
CN117972314A (en) Cloud platform monitoring method and system based on digital twinning
CN104111887A (en) Software fault prediction system and method based on Logistic model
CN105590167A (en) Method and device for analyzing electric field multivariate operating data
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
CN113538063A (en) Electricity charge abnormal data analysis method, device, equipment and medium based on decision tree
CN106778252B (en) Intrusion detection method based on rough set theory and WAODE algorithm
Bezerra et al. Aggregating measures using fuzzy logic for evaluating feature models

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150225

RJ01 Rejection of invention patent application after publication