CN110288014A - A kind of local Outliers Detection method based on comentropy weighting - Google Patents
A kind of local Outliers Detection method based on comentropy weighting Download PDFInfo
- Publication number
- CN110288014A CN110288014A CN201910540443.2A CN201910540443A CN110288014A CN 110288014 A CN110288014 A CN 110288014A CN 201910540443 A CN201910540443 A CN 201910540443A CN 110288014 A CN110288014 A CN 110288014A
- Authority
- CN
- China
- Prior art keywords
- attribute
- data
- distance
- sample data
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
Abstract
The invention discloses a kind of local Outliers Detection methods based on comentropy weighting, firstly, obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off are arranged;Next finds out entropy weight distance;Each sample data is found out apart from nearest k number evidence according to entropy weight distance;Then, the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr;The part for finally calculating each sample data peels off factor eldof (r).The present invention increases entropy weight information on the basis of LDOF algorithm, i.e. calculate apart from when using entropy Weighted distance, so that the accuracy rate of Outlier Data detection improves, regular hour expense is generated simultaneously, it is not required to understand data distribution in advance when detecting, the training set etc. for reaching certain amount and requirement is not depended on, algorithm detection accuracy is effectively increased.
Description
Technical field
The invention belongs to the field of data mining, and in particular to a kind of local Outliers Detection method based on comentropy weighting.
Background technique
With the rapid development of modern science and technology and information technology, the magnanimity number of under cover valuable information is had accumulated
According to.In order to make full use of these data, data mining technology becomes an important field of research.The major function of data mining
It is to find out the unforeseen but valuable knowledge of people from the insufficient data set of a large amount of, integrality.With certainly
So development of science and social science, clustering method have been widely used in data mining, the side such as pattern-recognition and machine learning
Face.Clustering method can be divided into roughly split plot design and top and bottom process, and difference is to be gone to utilize data with different building methods.
One given data set is split as disjoint part by split plot design, and top and bottom process is that a data set is resolved into one point
Layer structure sequence.The purpose of cluster is one group of data record of searching, so that recording in a given cluster is similar, and it is different
Record in cluster is dissimilar.Obviously, similar concept is the key that clustering problem.In fact, the definition logarithm of similarity
It is relatively easy for attribute, it is continuous because the attribute value of these attributes has a naturally sequence.Euclidean distance mentions
The similarity measurement method that can be widely accepted for a logarithm value attribute is supplied, however many data sets also include that classification belongs to
Property.On these attributes, Euclidean distance function can not be defined, because their attribute value is discrete, unordered.Cause
This, can not go to measure with Euclidean distance.And then produce research field one important one classification data of subdomains of cluster
Cluster.
Outlier detection is a very important research branch of data mining, and major function is from huge and answer
The data for having greatly difference to belong to only a few again simultaneously with mainstream data are extracted in miscellaneous data set.Researcher has been at present
Propose a large amount of outlier detection algorithm.Such as, the outlier detection algorithm based on cluster, the outlier detection based on statistics are calculated
Method etc..
However existing most of outlier detection algorithms all have the shortcomings that time complexity is high.Then, researcher
Again develop it is many improve algorithm performances technologies.But knows data distribution in advance in the presence of needs, requires a certain number of instructions
Practice the not high problem of collection, Outlier Data detection accuracy.
Summary of the invention
Goal of the invention: the purpose of the present invention is to provide a kind of local Outliers Detection method based on comentropy weighting, inspections
It is not required to understand data distribution in advance when survey, does not depend on the training set for reaching certain amount and requirement, effectively increase algorithm detection
Precision.
Technical solution: a kind of local Outliers Detection method based on comentropy weighting of the present invention, including following step
It is rapid:
(1) obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off is arranged;
(2) entropy weight distance is found out;
(3) each sample data is found out apart from nearest k number evidence according to entropy weight distance;
(4) the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr;
(5) part for calculating each sample data peels off factor eldof (r).
The step (1) the following steps are included:
(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for more
Dimension attribute collection A={ A1,A2,…,AdCalculate entropy formula:
(12) A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o ∈
D, Ai∈A;According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute that peels off is arranged
Weight is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.
The step (2) is realized by following formula:
Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute AiOn take
Value, wiIt is attribute AiAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute AiOn take
Value.
The step (4) is realized by following formula:
The step (5) is realized by following formula:
The utility model has the advantages that compared with prior art, beneficial effects of the present invention: 1, LDOF algorithm on the basis of increasing entropy
Weigh information, i.e., calculate apart from when using entropy Weighted distance so that Outlier Data detection accuracy rate improve;2, when detecting
It is not required to understand data distribution in advance, does not depend on the training set etc. for reaching certain amount and requirement, effectively increase algorithm detection essence
Degree.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is 20 data generated at random;
Fig. 3 is that LDOF and ELDOF peel off factor reduced value;
Fig. 4 is the 2 cluster data distribution maps comprising Outlier Data;
Fig. 5 is LDOF algorithm and ELDOF algorithm difference neighbour's detection accuracy comparative experiments figure;
Fig. 6 is peel off factor reduced value of LDOF and the ELDOF algorithm of 29 neighbours in 2 cluster datas;
Fig. 7 is the outlier effect picture detected based on LDOF algorithm;
Fig. 8 is the outlier effect picture that the method for the present invention is measured.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 is flow chart of the invention, the specific steps are as follows:
Step 1: setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value.For
Multidimensional property collection A={ A1,A2,…,AdCalculate entropy formula:
If sample attribute independently occurs, can indicate are as follows:
Step 2: A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o
∈ D, Ai∈A.According to formulaObtaining those of sample data attribute is the attribute that peels off, and is arranged and peels off
Attribute weight is 2, if attribute is the non-attribute that peels off, the attribute weight that peels off is 1;
Step 3: with identified weight in step 2, a data object o in its attribute AiOn value be denoted as beAssuming that wiIt is attribute AiAbout the weight (i=1,2 ..., d) of object o, according to formulaFind out entropy weight distance EWdist (o, p);
Step 4: the k number evidence for looking for each sample data nearest with entropy weight distance required in step 3;
Step 5: with required entropy weight distance in step 3 according to formulaCalculate each sample data
K nearest neighbor distance dr;
Step 6: with required entropy weight distance in step 3 according to formulaCalculate each sample
K neighbour's average distance D of notebook datar;
Step 7: according to formulaCalculate the eldof (r) of each sample data.
Also need to acquire the probability of each attribute of each sample data independently occurred in the algorithm with comentropy weighting
Value, our probability value is 20 samples being calculated such that for generating 2 dimensions at random, takes the central point of this 20 points first,
Then x-axis value of each point and a ratio of y-axis value distance center point represent this data point its each attribute
Probability value.The method of asking of this ratio is: the frontier distance for calculating this 20 data first is denoted as Di(i=1,2), then calculates
The range difference d of each data and center on each attributei(i=1,2), then the ratio of this 20 each attributes of data is just
It is di/(Di/ 2) (i=1,2).Probability value on 20 each attributes of data is exactly required rate value.
It is random to generate 20 data points as matrix y=[0.5 2.7;1 3.6;1.5 4.2;2 2;2.6 4;2.8 4.3;
3.2 7.2;3.6 4.3;3.8 5.5;4.1 0.6;4.3 3.8;4.4 4.2;4.8 2.5;5.1 3.3;5.3 5.7;5.6
4.7;6 4.4;6.4 1.7;6.8 3.7;7.8 4.2].This 20 random data are used to do one to the improved method proposed
Preliminary verifying, if comentropy weighting theory generates preferable effect in 20 data generated at random, further
We by the comentropy method of weighting proposed be applied to true data set carry out deeper into verifying, 20 generated at random
Data are as shown in Figure 2.
As shown in figure 3, we can see that the broken line degree of fluctuation of the method for the present invention (ELDOF algorithm) is bigger, and not
The LDOF algorithm for adding comentropy weighting is just in contrast relatively more steady, thus concludes that and be added to comentropy weighting algorithm
The eldof value calculated of ELDOF algorithm than no addition comentropy algorithm LDOF algorithm to the judgement of Outlier Data more
Rationally.
The flag flower data set for there are 150 samples is obtained, which there are 4 attributes, respectively indicate calyx length, calyx
Width, petal length, petal width, unit are cm.The sample is made of the classification that 3 include 50 samples.Mountain iris class is complete
It is fully separating and to intersect between latter two type in two class of variegated iris and Virginia iris, so data set can be with
Divide 2 classes or 3 alanysis.It is divided herein according to 2 classes, two attributes of calyx length and calyx width is only taken to be tested.1st class
21 samples are taken, the 2nd class takes 19 samples, then adds 20 Outlier Datas.
Normal data has 40, and the Outlier Data of addition has 20, wherein there is normal data 21 in first cluster, peels off
Data 9;There are normal data 19, Outlier Data 11 in second cluster.As shown in figure 4, pentalpha label sample and
The sample of asterisk label is respectively 2 cluster datas, and the sample of diamond indicia is Outlier Data.
As shown in figure 5, we can see that the detection accuracy of ELDOF algorithm is all when arest neighbors number is 5,10 and 15
It is higher than the detection accuracy of LDOF algorithm by 5%, when neighbour's number is 20,25 and 29, the detection accuracy ratio LDOF of ELDOF algorithm
All high 10%, the LDOF algorithm of the detection accuracy of algorithm when arest neighbors number is 29 detection accuracy steadily in 75%, and I
ELDOF algorithm when arest neighbors number is 29 detection accuracy have arrived at 85%, thus we obtain: regardless of arest neighbors
Number be it is how many, ELDOF algorithm is always higher than the detection accuracy of LDOF algorithm.
As shown in fig. 6, we take 29 nearest neighbouringplots, it can be seen that the effect ratio LDOF of ELDOF algorithm embodiment outlier
Algorithm becomes apparent from also more rationally.In order to more intuitively indicate two kinds of algorithm superiority and inferiority, we show Outlier Data detection with Fig. 7, Fig. 8
Situation.
According to Fig. 7, Fig. 8, we analyze ELDOF algorithm ratio LDOF algorithm detection outlier verification and measurement ratio it is higher.It will be every
A algorithm is made comparisons with the Outlier Data of original addition, and the detection accuracy being calculated is as shown in table 1.
The detection accuracy of 1 LDOF algorithm of table and ELDOF algorithm compares
According to table 1, the detection accuracy of the detection accuracy ratio LDOF algorithm of ELDOF algorithm is high in the outlier detection of two clusters
Go out 10%, thus we can also be concluded that: reality of the ELDOF algorithm than the LDOF algorithm under the conditions of ad eundem
It is more preferable to test effect.
In addition to this, the runing time comparison for the LDOF algorithm and ELDOF algorithm for being 29 we illustrate arest neighbors number
As shown in table 2.Find out the runing time for being added to the algorithm of entropy weight really than being not added with entropy within the scope of available times, in table
The Riming time of algorithm of power is long.
2 two kinds of Riming time of algorithm comparisons of table
Claims (5)
1. a kind of local Outliers Detection method based on comentropy weighting, which comprises the following steps:
(1) obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off is arranged;
(2) entropy weight distance is found out;
(3) each sample data is found out apart from nearest k number evidence according to entropy weight distance;
(4) the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr;
(5) part for calculating each sample data peels off factor eldof (r).
2. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described
Step (1) the following steps are included:
(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for multidimensional category
Property collection A={ A1,A2,…,AdCalculate entropy formula:
(12) A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, Ai
∈A;According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute weight that peels off is arranged
It is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.
3. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described
Step (2) is realized by following formula:
Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute AiOn value, wiIt is
Attribute AiAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute AiOn value.
4. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described
Step (4) is realized by following formula:
5. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described
Step (5) is realized by following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910540443.2A CN110288014A (en) | 2019-06-21 | 2019-06-21 | A kind of local Outliers Detection method based on comentropy weighting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910540443.2A CN110288014A (en) | 2019-06-21 | 2019-06-21 | A kind of local Outliers Detection method based on comentropy weighting |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110288014A true CN110288014A (en) | 2019-09-27 |
Family
ID=68005079
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910540443.2A Pending CN110288014A (en) | 2019-06-21 | 2019-06-21 | A kind of local Outliers Detection method based on comentropy weighting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110288014A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158871A (en) * | 2021-04-15 | 2021-07-23 | 重庆大学 | Wireless signal intensity abnormity detection method based on density core |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103812700A (en) * | 2014-02-18 | 2014-05-21 | 西南大学 | Message classifying method based on rule information entropy |
CN107123989A (en) * | 2017-05-25 | 2017-09-01 | 国网上海市电力公司 | A kind of topology identification method based on improved local outlier factor algorithm |
CN108280561A (en) * | 2017-01-06 | 2018-07-13 | 重庆邮电大学 | A kind of discrete manufacture mechanical product quality source tracing method based on comentropy and Weighted distance |
CN108549669A (en) * | 2018-03-21 | 2018-09-18 | 南京邮电大学 | A kind of outlier detection method towards big data |
CN109492667A (en) * | 2018-10-08 | 2019-03-19 | 国网天津市电力公司电力科学研究院 | A kind of feature selecting discrimination method for non-intrusive electrical load monitoring |
-
2019
- 2019-06-21 CN CN201910540443.2A patent/CN110288014A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103812700A (en) * | 2014-02-18 | 2014-05-21 | 西南大学 | Message classifying method based on rule information entropy |
CN108280561A (en) * | 2017-01-06 | 2018-07-13 | 重庆邮电大学 | A kind of discrete manufacture mechanical product quality source tracing method based on comentropy and Weighted distance |
CN107123989A (en) * | 2017-05-25 | 2017-09-01 | 国网上海市电力公司 | A kind of topology identification method based on improved local outlier factor algorithm |
CN108549669A (en) * | 2018-03-21 | 2018-09-18 | 南京邮电大学 | A kind of outlier detection method towards big data |
CN109492667A (en) * | 2018-10-08 | 2019-03-19 | 国网天津市电力公司电力科学研究院 | A kind of feature selecting discrimination method for non-intrusive electrical load monitoring |
Non-Patent Citations (2)
Title |
---|
KE ZHANG等: "A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data", 《PAKDD 2009: ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING》 * |
包小兵等: "基于信息熵加权的局部离群点检测算法", 《计算机技术与发展》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158871A (en) * | 2021-04-15 | 2021-07-23 | 重庆大学 | Wireless signal intensity abnormity detection method based on density core |
CN113158871B (en) * | 2021-04-15 | 2022-08-02 | 重庆大学 | Wireless signal intensity abnormity detection method based on density core |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kut et al. | Spatio-temporal outlier detection in large databases | |
CN110245981B (en) | Crowd type identification method based on mobile phone signaling data | |
CN104462184B (en) | A kind of large-scale data abnormality recognition method based on two-way sampling combination | |
CN107682319B (en) | Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method | |
CN105760888B (en) | A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute | |
Li et al. | A comparative analysis of evolutionary and memetic algorithms for community detection from signed social networks | |
CN106339416B (en) | Educational data clustering method based on grid fast searching density peaks | |
CN102364498A (en) | Multi-label-based image recognition method | |
CN103888541A (en) | Method and system for discovering cells fused with topology potential and spectral clustering | |
CN109410238A (en) | A kind of fructus lycii identification method of counting based on PointNet++ network | |
Yang et al. | A feature-metric-based affinity propagation technique for feature selection in hyperspectral image classification | |
Pan et al. | Mining regular behaviors based on multidimensional trajectories | |
JP4937395B2 (en) | Feature vector generation apparatus, feature vector generation method and program | |
CN106250925B (en) | A kind of zero Sample video classification method based on improved canonical correlation analysis | |
CN108304851A (en) | A kind of High Dimensional Data Streams Identifying Outliers method | |
CN109447110A (en) | The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics | |
CN112381181A (en) | Dynamic detection method for building energy consumption abnormity | |
Gulhane et al. | A review of image data clustering techniques | |
CN115374106B (en) | Intelligent data classification method based on knowledge graph technology | |
CN103020153B (en) | A kind of advertisement recognition method based on video | |
CN110909253B (en) | Group relation mining and analyzing method based on specific users | |
CN109271466A (en) | A kind of weather data analysis method based on hierarchical clustering Yu K mean algorithm | |
CN104731887B (en) | A kind of user method for measuring similarity in collaborative filtering | |
CN108804635A (en) | A kind of method for measuring similarity based on Attributions selection | |
CN103150470A (en) | Visualization method for concept drift of data stream in dynamic data environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190927 |
|
RJ01 | Rejection of invention patent application after publication |