CN104376078A - Abnormal data detection method based on knowledge entropy - Google Patents
Abnormal data detection method based on knowledge entropy Download PDFInfo
- Publication number
- CN104376078A CN104376078A CN201410650726.XA CN201410650726A CN104376078A CN 104376078 A CN104376078 A CN 104376078A CN 201410650726 A CN201410650726 A CN 201410650726A CN 104376078 A CN104376078 A CN 104376078A
- Authority
- CN
- China
- Prior art keywords
- attribute
- data
- value
- cluster
- data sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 22
- 238000001514 detection method Methods 0.000 title abstract description 15
- 238000000034 method Methods 0.000 claims description 37
- 230000005856 abnormality Effects 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims 1
- 238000004364 calculation method Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 238000010606 normalization Methods 0.000 abstract 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 241000208340 Araliaceae Species 0.000 description 1
- 206010064571 Gene mutation Diseases 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000009545 invasion Effects 0.000 description 1
- 238000013450 outlier detection Methods 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An abnormal data detection method based on knowledge entropy is characterized by comprising the following steps of (1) the attribute analysis stage of a sample set and (2) the data sample detection stage of the sample set; at the attribute analysis stage of the sample set, the data sample set U generated by an application program and an attribute set A corresponding to the data sample set U are collected; normalization preprocessing is performed on attribute values in the data sample set U; clustering processing is performed on the data sample set U based on the attribute set A, and the knowledge entropy of A is calculated; the importance of all attributes is calculated, and the sequence of the attribute set is constructed according to the importance of all the attributes; the stage is over; at the data sample detection stage of the sample set, the abnormal factors of all data samples are calculated; an abnormal data set is output according to the abnormal factors; the stage is over. According to the detection method, the nondeterminacy of clustering is avoided while the clustering effect is utilized, and therefore the detection accuracy of abnormal data can be effectively guaranteed.
Description
Technical field
The present invention relates to abnormal deviation data examination method, on the basis of the mass data collection especially generated in computer information system, the method for abnormal information in heuristic data, relates to the abnormal deviation data examination method based on cluster and Knowledge entropy more specifically.
Background technology
Anomaly data detection also claims outlier detection and exception to excavate, common Anomalies Genesis is data from different classes (as swindle, invasion etc.), data variable natural variation (as gene mutation, the purchasing model etc. that client is new) and DATA REASONING or collects error.Because outlier can find distinguished fresh information, so be widely used in the various fields such as customer purchasing behavior analysis in intrusion detection, fraud detection, public health, electric business's platform.
The method of anomaly data detection mainly comprises following several: the technology of (1) Corpus--based Method: first set up a data model, and being extremely those can not the object of perfect matching with model; If the set that model is bunch, then abnormal is significantly do not belong to the object of any bunch; When using regression model, be relatively away from the object of predicted value extremely.(2) based on the technology of adjacency: usually can define proximity measure between objects, exception object is those objects away from other objects.(3) technology of density based: be only categorized as abnormity point as most of neighbour significantly lower than it of the local density of a point.(4) based on the technology of cluster: using the tuftlet away from other bunches as abnormity point.
The Major Difficulties of anomaly data detection is to compare the sample, the data dimension information evaluation of high dimensional data, the data exception of non-single dimension etc. that are difficult to process non-numeric type.The technology of Corpus--based Method is difficult to process high dimensional data; Technology based on adjacency can not process the data set with different densities region; The technology of density based is difficult to adjust ginseng; Technology based on cluster be difficult to ensure produce bunch quality, very large to the quality influence of outlier.
In order to improve the effect of anomaly data detection, avoid the uncertainty of cluster while utilizing Clustering Effect, the present invention proposes a kind of abnormal deviation data examination method of knowledge based entropy, effectively can ensure the Detection accuracy of abnormal data.
Summary of the invention
Goal of the invention: the invention provides a kind of method detecting abnormal data in the mass data sample set of application collection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then the Outlier factor of each data sample is calculated, last output abnormality data acquisition.
Technical scheme of the present invention is: the abnormal data point detecting method of knowledge based entropy comprises the steps:
1) the attributive analysis stage of data sample set:
A) the data sample set U of application program generation and the community set A of correspondence thereof is collected;
B) standardization pre-service is carried out to the property value in set of data samples U;
C) based on attribute complete or collected works A clustering processing is done to set of data samples U, and calculate the Knowledge entropy of A;
D) importance degree of each attribute is calculated respectively, the sequence of structure attribute set accordingly;
E) terminate.
2) the data sample detection-phase of data sample set:
A) Outlier factor of each data sample is calculated;
B) according to Outlier factor output abnormality data acquisition;
C) terminate.
Wherein the detailed process of step 1-b is as follows:
1) the attribute complete or collected works A of ergodic data sample set U;
2) for property value be the attribute a of value type
i, close maximal value according to the minimum value of all this attributes of data sample and make standardization processing: standardization property value V '
i,j=(V
i,j-V
i, min)/(V
i, max-V
i, min), make the property value after standardization between 0 to 1.0; Wherein V
i,jthe property value before specification, V
i, minbe before specification all data samples at a
iminimum value on attribute, V
i, maxbe before specification all data samples at a
imaximal value on attribute;
3) for property value be not the attribute a of value type
k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value
i,j=attribute a
kvalue is V
k,jsample number/total sample number.
The detailed process of step 1-c is as follows:
1) the data acquisition U that the attribute complete or collected works A of set of data samples is corresponding is considered;
2) the diameter L of set of computations U,
if threshold value δ=L/10;
3) based on threshold value δ, complete link clustering is done to U, obtain cluster result (E
1, E
2, E
3..., E
k), wherein E
lbe the set after a data sample set cluster, meet
4) Knowledge entropy of computation attribute complete or collected works A
The detailed process of step 1-d is as follows:
1) to each attribute a in attribute complete or collected works A
i, calculate its Attribute Significance: sig (a
i)=E (A)-E (A-{a
i);
2) according to Attribute Significance sequence, sequence of attributes S=<a ' is obtained to attribute complete or collected works A
1, a '
2..., a '
| A|>, wherein meet sig (a '
i)≤sig (a '
i+1);
3) structure attribute sequence of sets AS=<A
1, A
2..., A
m>, wherein to 1≤i≤m,
and meet A
i+1=A
i-{ a'
i.
The detailed process of step 2-a is as follows:
1) to attribute a ' each in S
ithe cluster making step 1-c obtains
2) to community set A each in AS
ialso the cluster making step 1-c obtains
3) to data sample x each in U, its weight w (x) is calculated,
wherein
represent that x is at a
icluster belonging in cluster result;
4) Outlier factor d (x) of x is calculated,
Wherein
represent that x is at A
jcluster belonging in cluster result.
The detailed process of step 2-b is as follows:
1)
2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};
3) D is exported.
Beneficial effect of the present invention: the effect that invention increases anomaly data detection, the method first knowledge based entropy calculates the significance level of each attribute in data sample set, then calculates the Outlier factor of each data sample, last output abnormality data acquisition.The present invention avoids the uncertainty of cluster while utilizing Clustering Effect, effectively can ensure the Detection accuracy of abnormal data.
Accompanying drawing explanation
The abnormal deviation data examination method process flow diagram of Fig. 1 knowledge based entropy
Fig. 2 carries out pretreated process flow diagram to data sample attribute value
Fig. 3 based on property set A to the process flow diagram carrying out data sample set U and do complete link clustering
Fig. 4 computation attribute importance degree the process flow diagram of structure attribute sequence of sets
Fig. 5 calculates the Outlier factor of each sample and the process flow diagram of output abnormality data
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
Fig. 1 is the abnormal deviation data examination method process flow diagram of knowledge based entropy.The abnormal deviation data examination method of knowledge based entropy uses the method for cluster to classify to object set, utilizes Knowledge entropy computation attribute importance degree and obtains community set sequence, through calculating the Outlier factor of all objects to the traversal of property set.Finally carry out result output as requested
Fig. 2 carries out pretreated detailed description to data sample attribute value.
Step 2-0 starts;
Certain attribute of random selecting a in step 2-1 dependency set A
i;
Step 2-2 judges whether property value is Numeric Attributes value;
If step 2-3 numeric type, then to a all in sample set
iproperty value makes standardization processing;
Step 2-4 if not Numeric Attributes value, by a all in sample set
iproperty value is set to frequency values;
Step 2-5 is by a
iremove from A;
Step 2-6 judges whether A is empty set, then gets back to step 2-1 if not; If then terminate.
Fig. 3 is to the process flow diagram carrying out data sample set U and do complete link clustering based on property set A.
Step 3-0 starts;
Step 3-1 finds out U middle distance 2 points farthest, calculates the diameter of its distance as U
if threshold value δ=L/10;
Step 3-2 has a b in U
i, construction set E
i={ b
i, initialization cluster set is combined into C={E
1, E
2.。。,E
|U|};
Step 3-3 judges whether there is the cluster that can be polymerized in cluster set C, there is E in C
i, E
j, meet d (E
i, E
j) <2 δ, wherein,
Step 3-4 is by can cluster E again in C
i, E
jmerge, add in C, then by E
i, E
jremove from C and jump to step 3-3;
Step 3-5 exports the cluster set C divided;
Step 3-6 terminates.
Fig. 4 is computation attribute importance degree and the detailed description of structure attribute sequence of sets.
Step 4-0 starts;
Step 4-1 carries out cluster based on attribute complete or collected works A to set of data samples U, obtains cluster set C={E
1, E
2.。。,E
k};
Step 4-2 calculates the Knowledge entropy of A
Step 4-3 dependency collection A selects an attribute a
i;
Step 4-4 calculates a
iattribute Significance sig (a
i)=E (A)-E (A-{a
i);
Step 4-5 judges whether that Attribute Significance needs to calculate in addition, if then jump to step 4-3, then jumps to step 4-6 if not;
Step 4-6 is based on sig (a
i) sorting obtains S=<a '
1, a '
2..., a '
| A|>, meet sig (a '
i)≤sig (a '
i+1);
Step 4-7 structure attribute sequence of sets AS=<A1, A2 ..., Am>, meets A
i+1=A
i-{ a'
i,
Step 4-8 terminates.
Fig. 5 calculates the Outlier factor of each sample and the detailed step of output abnormality data.
Step 5-0 starts;
Step 5-1 is to attribute a ' each in S
iobtain as cluster
Step 5-2 is to community set A each in AS
iobtain as cluster
Step 5-3, to data sample x each in U, first calculates its weight
calculate its Outlier factor again
Step 5-4 exports the x of all d (x) >0.85 in U;
Step 5-5 terminates.
Claims (3)
1. an abnormal deviation data examination method for knowledge based entropy, is characterized in that comprising the steps:
1) the attributive analysis stage of data sample set:
A) the data sample set U of application program generation and the community set A of correspondence thereof is collected;
B) standardization pre-service is carried out to the property value in set of data samples U;
C) based on attribute complete or collected works A clustering processing is done to set of data samples U, and calculate the Knowledge entropy of A;
D) importance degree of each attribute is calculated respectively, the sequence of structure attribute set accordingly;
E) terminate.
2) the data sample detection-phase of data sample set:
A) Outlier factor of each data sample is calculated;
B) according to Outlier factor output abnormality data acquisition;
C) terminate.
Wherein the described standardization pre-service detailed process of step 1-b is as follows:
1) the attribute complete or collected works A of ergodic data sample set U;
2) for property value be the attribute a of value type
i, make standardization processing according to the minimum value of all data samples on this attribute and maximal value: standardization property value V '
i,j=(V
i,j-V
i, min)/(V
i, max-V
i, min), make the property value after standardization between 0 to 1.0; Wherein V
i,jthe property value before specification, V
i, minbe before specification all data samples at a
iminimum value on attribute, V
i, maxbe before specification all data samples at a
imaximal value on attribute;
3) for property value be not the attribute a of value type
k, frequency imparting 0 to 1.0 value: the V ' accordingly occurred according to nonumeric property value
k,j=attribute a
kvalue is V
k,jsample number/total sample number;
4) terminate.
2. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the Knowledge entropy computing method based on cluster described in 1-c;
1) the data acquisition U that the attribute complete or collected works A of set of data samples is corresponding is considered;
2) the diameter L of set of computations U,
if threshold value δ=L/10;
3) based on parameter δ, complete link clustering is done to U, obtain cluster result (E
1, E
2, E
3..., E
k), wherein E
lbe the set after a data sample set cluster, meet
4) Knowledge entropy of computation attribute complete or collected works A
5) terminate.
The detailed process of step 1-d is as follows:
1) to each attribute a in attribute complete or collected works A
i, calculate its Attribute Significance: sig (a
i)=E (A)-E (A-{a
i);
2) according to Attribute Significance sequence, sequence of attributes S=<a ' is obtained to attribute complete or collected works A
1, a '
2..., a '
| A|>, wherein meet sig (a '
i)≤sig (a '
i+1);
3) structure attribute sequence of sets AS=<A
1, A
2..., A
m>, wherein to 1≤i≤m,
a
1=A, A
m={ a'
n, and meet A
i+1=A
i-{ a'
i.
3. the abnormal deviation data examination method of knowledge based entropy according to claim 1, is characterized in that based on the data sample Outlier factor computational algorithm described in 2-a:
1) to attribute a ' each in S
ithe cluster making step 1-c obtains
2) to community set A each in AS
ialso the cluster making step 1-c obtains
3) to data sample x each in U, its weight w (x) is calculated,
wherein
i represents that x is at a
icluster belonging in cluster result;
4) Outlier factor d (x) of x is calculated,
Wherein
represent that x is at A
jcluster belonging in cluster result;
5) terminate;
The detailed process of step 2-b is as follows:
1)
2) to data sample x each in U, if d (x) >0.85, then D=D ∪ { x};
3) D is exported;
4) terminate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410650726.XA CN104376078A (en) | 2014-11-14 | 2014-11-14 | Abnormal data detection method based on knowledge entropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410650726.XA CN104376078A (en) | 2014-11-14 | 2014-11-14 | Abnormal data detection method based on knowledge entropy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104376078A true CN104376078A (en) | 2015-02-25 |
Family
ID=52554985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410650726.XA Pending CN104376078A (en) | 2014-11-14 | 2014-11-14 | Abnormal data detection method based on knowledge entropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104376078A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160181A (en) * | 2015-09-02 | 2015-12-16 | 华中科技大学 | Detection method of abnormal data of numerical control system instruction field sequence |
CN108205570A (en) * | 2016-12-19 | 2018-06-26 | 华为技术有限公司 | A kind of data detection method and device |
CN108268467A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of abnormal deviation data examination method and device based on attribute |
CN109190598A (en) * | 2018-09-29 | 2019-01-11 | 西安交通大学 | A kind of rotating machinery monitoring data noise detection method based on SES-LOF |
CN109992578A (en) * | 2019-01-07 | 2019-07-09 | 平安科技(深圳)有限公司 | Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning |
CN112219212A (en) * | 2017-12-22 | 2021-01-12 | 阿韦瓦软件有限责任公司 | Automated detection of anomalous industrial processing operations |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246645A (en) * | 2008-04-01 | 2008-08-20 | 东南大学 | Method for recognizing outlier traffic data |
US20080255772A1 (en) * | 2007-02-06 | 2008-10-16 | Abb Research Ltd. | Method and a control system for monitoring the condition of an industrial robot |
CN101509839A (en) * | 2009-03-12 | 2009-08-19 | 上海交通大学 | Cluster industrial robot failure diagnosis method based on outlier excavation |
CN103902798A (en) * | 2012-12-27 | 2014-07-02 | 纽海信息技术(上海)有限公司 | Data preprocessing method |
-
2014
- 2014-11-14 CN CN201410650726.XA patent/CN104376078A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080255772A1 (en) * | 2007-02-06 | 2008-10-16 | Abb Research Ltd. | Method and a control system for monitoring the condition of an industrial robot |
CN101246645A (en) * | 2008-04-01 | 2008-08-20 | 东南大学 | Method for recognizing outlier traffic data |
CN101509839A (en) * | 2009-03-12 | 2009-08-19 | 上海交通大学 | Cluster industrial robot failure diagnosis method based on outlier excavation |
CN103902798A (en) * | 2012-12-27 | 2014-07-02 | 纽海信息技术(上海)有限公司 | Data preprocessing method |
Non-Patent Citations (2)
Title |
---|
张净等: "基于信息论的高维海量数据离群点挖掘", 《计算机科学》 * |
江峰等: "基于粗糙集理论的序列离群点检测", 《电子学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160181A (en) * | 2015-09-02 | 2015-12-16 | 华中科技大学 | Detection method of abnormal data of numerical control system instruction field sequence |
CN105160181B (en) * | 2015-09-02 | 2018-02-23 | 华中科技大学 | A kind of digital control system domain of instruction sequence variation data detection method |
CN108205570A (en) * | 2016-12-19 | 2018-06-26 | 华为技术有限公司 | A kind of data detection method and device |
CN108205570B (en) * | 2016-12-19 | 2021-06-29 | 华为技术有限公司 | Data detection method and device |
CN108268467A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of abnormal deviation data examination method and device based on attribute |
CN108268467B (en) * | 2016-12-30 | 2021-08-06 | 广东精点数据科技股份有限公司 | Attribute-based abnormal data detection method and device |
CN112219212A (en) * | 2017-12-22 | 2021-01-12 | 阿韦瓦软件有限责任公司 | Automated detection of anomalous industrial processing operations |
CN109190598A (en) * | 2018-09-29 | 2019-01-11 | 西安交通大学 | A kind of rotating machinery monitoring data noise detection method based on SES-LOF |
CN109190598B (en) * | 2018-09-29 | 2020-05-15 | 西安交通大学 | Rotating machinery monitoring data noise point detection method based on SES-LOF |
CN109992578A (en) * | 2019-01-07 | 2019-07-09 | 平安科技(深圳)有限公司 | Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning |
CN109992578B (en) * | 2019-01-07 | 2023-08-08 | 平安科技(深圳)有限公司 | Anti-fraud method and device based on unsupervised learning, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520357B (en) | Method and device for judging line loss abnormality reason and server | |
CN107122790B (en) | Non-invasive load identification algorithm based on hybrid neural network and ensemble learning | |
CN104376078A (en) | Abnormal data detection method based on knowledge entropy | |
CN109543943B (en) | Electric price checking execution method based on big data deep learning | |
Yin et al. | Wasserstein Generative Adversarial Network and Convolutional Neural Network (WG‐CNN) for Bearing Fault Diagnosis | |
CN107992968B (en) | Electric energy meter metering error prediction method based on integrated time series analysis technology | |
CN108777873A (en) | The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend | |
WO2016101628A1 (en) | Data processing method and device in data modeling | |
CN112001788B (en) | Credit card illegal fraud identification method based on RF-DBSCAN algorithm | |
CN106503086A (en) | The detection method of distributed local outlier | |
Arbin et al. | Comparative analysis between k-means and k-medoids for statistical clustering | |
Li et al. | Research and application of random forest model in mining automobile insurance fraud | |
CN112990330B (en) | User energy abnormal data detection method and device | |
CN108038211A (en) | A kind of unsupervised relation data method for detecting abnormality based on context | |
CN115051929A (en) | Network fault prediction method and device based on self-supervision target perception neural network | |
Shi et al. | An improved agglomerative hierarchical clustering anomaly detection method for scientific data | |
CN113420506A (en) | Method for establishing prediction model of tunneling speed, prediction method and device | |
Dong | Application of Big Data Mining Technology in Blockchain Computing | |
CN117972314A (en) | Cloud platform monitoring method and system based on digital twinning | |
CN104111887A (en) | Software fault prediction system and method based on Logistic model | |
CN105590167A (en) | Method and device for analyzing electric field multivariate operating data | |
CN113726558A (en) | Network equipment flow prediction system based on random forest algorithm | |
CN113538063A (en) | Electricity charge abnormal data analysis method, device, equipment and medium based on decision tree | |
CN106778252B (en) | Intrusion detection method based on rough set theory and WAODE algorithm | |
Bezerra et al. | Aggregating measures using fuzzy logic for evaluating feature models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150225 |
|
RJ01 | Rejection of invention patent application after publication |