CN110288014A - A kind of local Outliers Detection method based on comentropy weighting - Google Patents

A kind of local Outliers Detection method based on comentropy weighting Download PDF

Info

Publication number
CN110288014A
CN110288014A CN201910540443.2A CN201910540443A CN110288014A CN 110288014 A CN110288014 A CN 110288014A CN 201910540443 A CN201910540443 A CN 201910540443A CN 110288014 A CN110288014 A CN 110288014A
Authority
CN
China
Prior art keywords
attribute
data
distance
sample data
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910540443.2A
Other languages
Chinese (zh)
Inventor
王丽娜
冯超
邢梓萌
沈朝瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201910540443.2A priority Critical patent/CN110288014A/en
Publication of CN110288014A publication Critical patent/CN110288014A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Abstract

The invention discloses a kind of local Outliers Detection methods based on comentropy weighting, firstly, obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off are arranged;Next finds out entropy weight distance;Each sample data is found out apart from nearest k number evidence according to entropy weight distance;Then, the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr;The part for finally calculating each sample data peels off factor eldof (r).The present invention increases entropy weight information on the basis of LDOF algorithm, i.e. calculate apart from when using entropy Weighted distance, so that the accuracy rate of Outlier Data detection improves, regular hour expense is generated simultaneously, it is not required to understand data distribution in advance when detecting, the training set etc. for reaching certain amount and requirement is not depended on, algorithm detection accuracy is effectively increased.

Description

A kind of local Outliers Detection method based on comentropy weighting
Technical field
The invention belongs to the field of data mining, and in particular to a kind of local Outliers Detection method based on comentropy weighting.
Background technique
With the rapid development of modern science and technology and information technology, the magnanimity number of under cover valuable information is had accumulated According to.In order to make full use of these data, data mining technology becomes an important field of research.The major function of data mining It is to find out the unforeseen but valuable knowledge of people from the insufficient data set of a large amount of, integrality.With certainly So development of science and social science, clustering method have been widely used in data mining, the side such as pattern-recognition and machine learning Face.Clustering method can be divided into roughly split plot design and top and bottom process, and difference is to be gone to utilize data with different building methods. One given data set is split as disjoint part by split plot design, and top and bottom process is that a data set is resolved into one point Layer structure sequence.The purpose of cluster is one group of data record of searching, so that recording in a given cluster is similar, and it is different Record in cluster is dissimilar.Obviously, similar concept is the key that clustering problem.In fact, the definition logarithm of similarity It is relatively easy for attribute, it is continuous because the attribute value of these attributes has a naturally sequence.Euclidean distance mentions The similarity measurement method that can be widely accepted for a logarithm value attribute is supplied, however many data sets also include that classification belongs to Property.On these attributes, Euclidean distance function can not be defined, because their attribute value is discrete, unordered.Cause This, can not go to measure with Euclidean distance.And then produce research field one important one classification data of subdomains of cluster Cluster.
Outlier detection is a very important research branch of data mining, and major function is from huge and answer The data for having greatly difference to belong to only a few again simultaneously with mainstream data are extracted in miscellaneous data set.Researcher has been at present Propose a large amount of outlier detection algorithm.Such as, the outlier detection algorithm based on cluster, the outlier detection based on statistics are calculated Method etc..
However existing most of outlier detection algorithms all have the shortcomings that time complexity is high.Then, researcher Again develop it is many improve algorithm performances technologies.But knows data distribution in advance in the presence of needs, requires a certain number of instructions Practice the not high problem of collection, Outlier Data detection accuracy.
Summary of the invention
Goal of the invention: the purpose of the present invention is to provide a kind of local Outliers Detection method based on comentropy weighting, inspections It is not required to understand data distribution in advance when survey, does not depend on the training set for reaching certain amount and requirement, effectively increase algorithm detection Precision.
Technical solution: a kind of local Outliers Detection method based on comentropy weighting of the present invention, including following step It is rapid:
(1) obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off is arranged;
(2) entropy weight distance is found out;
(3) each sample data is found out apart from nearest k number evidence according to entropy weight distance;
(4) the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr
(5) part for calculating each sample data peels off factor eldof (r).
The step (1) the following steps are included:
(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for more Dimension attribute collection A={ A1,A2,…,AdCalculate entropy formula:
(12) A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, Ai∈A;According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute that peels off is arranged Weight is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.
The step (2) is realized by following formula:
Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute AiOn take Value, wiIt is attribute AiAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute AiOn take Value.
The step (4) is realized by following formula:
The step (5) is realized by following formula:
The utility model has the advantages that compared with prior art, beneficial effects of the present invention: 1, LDOF algorithm on the basis of increasing entropy Weigh information, i.e., calculate apart from when using entropy Weighted distance so that Outlier Data detection accuracy rate improve;2, when detecting It is not required to understand data distribution in advance, does not depend on the training set etc. for reaching certain amount and requirement, effectively increase algorithm detection essence Degree.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is 20 data generated at random;
Fig. 3 is that LDOF and ELDOF peel off factor reduced value;
Fig. 4 is the 2 cluster data distribution maps comprising Outlier Data;
Fig. 5 is LDOF algorithm and ELDOF algorithm difference neighbour's detection accuracy comparative experiments figure;
Fig. 6 is peel off factor reduced value of LDOF and the ELDOF algorithm of 29 neighbours in 2 cluster datas;
Fig. 7 is the outlier effect picture detected based on LDOF algorithm;
Fig. 8 is the outlier effect picture that the method for the present invention is measured.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 is flow chart of the invention, the specific steps are as follows:
Step 1: setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value.For Multidimensional property collection A={ A1,A2,…,AdCalculate entropy formula:
If sample attribute independently occurs, can indicate are as follows:
Step 2: A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, Ai∈A.According to formulaObtaining those of sample data attribute is the attribute that peels off, and is arranged and peels off Attribute weight is 2, if attribute is the non-attribute that peels off, the attribute weight that peels off is 1;
Step 3: with identified weight in step 2, a data object o in its attribute AiOn value be denoted as beAssuming that wiIt is attribute AiAbout the weight (i=1,2 ..., d) of object o, according to formulaFind out entropy weight distance EWdist (o, p);
Step 4: the k number evidence for looking for each sample data nearest with entropy weight distance required in step 3;
Step 5: with required entropy weight distance in step 3 according to formulaCalculate each sample data K nearest neighbor distance dr
Step 6: with required entropy weight distance in step 3 according to formulaCalculate each sample K neighbour's average distance D of notebook datar
Step 7: according to formulaCalculate the eldof (r) of each sample data.
Also need to acquire the probability of each attribute of each sample data independently occurred in the algorithm with comentropy weighting Value, our probability value is 20 samples being calculated such that for generating 2 dimensions at random, takes the central point of this 20 points first, Then x-axis value of each point and a ratio of y-axis value distance center point represent this data point its each attribute Probability value.The method of asking of this ratio is: the frontier distance for calculating this 20 data first is denoted as Di(i=1,2), then calculates The range difference d of each data and center on each attributei(i=1,2), then the ratio of this 20 each attributes of data is just It is di/(Di/ 2) (i=1,2).Probability value on 20 each attributes of data is exactly required rate value.
It is random to generate 20 data points as matrix y=[0.5 2.7;1 3.6;1.5 4.2;2 2;2.6 4;2.8 4.3; 3.2 7.2;3.6 4.3;3.8 5.5;4.1 0.6;4.3 3.8;4.4 4.2;4.8 2.5;5.1 3.3;5.3 5.7;5.6 4.7;6 4.4;6.4 1.7;6.8 3.7;7.8 4.2].This 20 random data are used to do one to the improved method proposed Preliminary verifying, if comentropy weighting theory generates preferable effect in 20 data generated at random, further We by the comentropy method of weighting proposed be applied to true data set carry out deeper into verifying, 20 generated at random Data are as shown in Figure 2.
As shown in figure 3, we can see that the broken line degree of fluctuation of the method for the present invention (ELDOF algorithm) is bigger, and not The LDOF algorithm for adding comentropy weighting is just in contrast relatively more steady, thus concludes that and be added to comentropy weighting algorithm The eldof value calculated of ELDOF algorithm than no addition comentropy algorithm LDOF algorithm to the judgement of Outlier Data more Rationally.
The flag flower data set for there are 150 samples is obtained, which there are 4 attributes, respectively indicate calyx length, calyx Width, petal length, petal width, unit are cm.The sample is made of the classification that 3 include 50 samples.Mountain iris class is complete It is fully separating and to intersect between latter two type in two class of variegated iris and Virginia iris, so data set can be with Divide 2 classes or 3 alanysis.It is divided herein according to 2 classes, two attributes of calyx length and calyx width is only taken to be tested.1st class 21 samples are taken, the 2nd class takes 19 samples, then adds 20 Outlier Datas.
Normal data has 40, and the Outlier Data of addition has 20, wherein there is normal data 21 in first cluster, peels off Data 9;There are normal data 19, Outlier Data 11 in second cluster.As shown in figure 4, pentalpha label sample and The sample of asterisk label is respectively 2 cluster datas, and the sample of diamond indicia is Outlier Data.
As shown in figure 5, we can see that the detection accuracy of ELDOF algorithm is all when arest neighbors number is 5,10 and 15 It is higher than the detection accuracy of LDOF algorithm by 5%, when neighbour's number is 20,25 and 29, the detection accuracy ratio LDOF of ELDOF algorithm All high 10%, the LDOF algorithm of the detection accuracy of algorithm when arest neighbors number is 29 detection accuracy steadily in 75%, and I ELDOF algorithm when arest neighbors number is 29 detection accuracy have arrived at 85%, thus we obtain: regardless of arest neighbors Number be it is how many, ELDOF algorithm is always higher than the detection accuracy of LDOF algorithm.
As shown in fig. 6, we take 29 nearest neighbouringplots, it can be seen that the effect ratio LDOF of ELDOF algorithm embodiment outlier Algorithm becomes apparent from also more rationally.In order to more intuitively indicate two kinds of algorithm superiority and inferiority, we show Outlier Data detection with Fig. 7, Fig. 8 Situation.
According to Fig. 7, Fig. 8, we analyze ELDOF algorithm ratio LDOF algorithm detection outlier verification and measurement ratio it is higher.It will be every A algorithm is made comparisons with the Outlier Data of original addition, and the detection accuracy being calculated is as shown in table 1.
The detection accuracy of 1 LDOF algorithm of table and ELDOF algorithm compares
According to table 1, the detection accuracy of the detection accuracy ratio LDOF algorithm of ELDOF algorithm is high in the outlier detection of two clusters Go out 10%, thus we can also be concluded that: reality of the ELDOF algorithm than the LDOF algorithm under the conditions of ad eundem It is more preferable to test effect.
In addition to this, the runing time comparison for the LDOF algorithm and ELDOF algorithm for being 29 we illustrate arest neighbors number As shown in table 2.Find out the runing time for being added to the algorithm of entropy weight really than being not added with entropy within the scope of available times, in table The Riming time of algorithm of power is long.
2 two kinds of Riming time of algorithm comparisons of table

Claims (5)

1. a kind of local Outliers Detection method based on comentropy weighting, which comprises the following steps:
(1) obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off is arranged;
(2) entropy weight distance is found out;
(3) each sample data is found out apart from nearest k number evidence according to entropy weight distance;
(4) the k nearest neighbor distance d of each sample data is calculatedrWith average distance Dr
(5) part for calculating each sample data peels off factor eldof (r).
2. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (1) the following steps are included:
(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for multidimensional category Property collection A={ A1,A2,…,AdCalculate entropy formula:
(12) A={ A1,A2,…,AdIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, Ai ∈A;According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute weight that peels off is arranged It is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.
3. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (2) is realized by following formula:
Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute AiOn value, wiIt is Attribute AiAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute AiOn value.
4. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (4) is realized by following formula:
5. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (5) is realized by following formula:
CN201910540443.2A 2019-06-21 2019-06-21 A kind of local Outliers Detection method based on comentropy weighting Pending CN110288014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910540443.2A CN110288014A (en) 2019-06-21 2019-06-21 A kind of local Outliers Detection method based on comentropy weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910540443.2A CN110288014A (en) 2019-06-21 2019-06-21 A kind of local Outliers Detection method based on comentropy weighting

Publications (1)

Publication Number Publication Date
CN110288014A true CN110288014A (en) 2019-09-27

Family

ID=68005079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910540443.2A Pending CN110288014A (en) 2019-06-21 2019-06-21 A kind of local Outliers Detection method based on comentropy weighting

Country Status (1)

Country Link
CN (1) CN110288014A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103812700A (en) * 2014-02-18 2014-05-21 西南大学 Message classifying method based on rule information entropy
CN107123989A (en) * 2017-05-25 2017-09-01 国网上海市电力公司 A kind of topology identification method based on improved local outlier factor algorithm
CN108280561A (en) * 2017-01-06 2018-07-13 重庆邮电大学 A kind of discrete manufacture mechanical product quality source tracing method based on comentropy and Weighted distance
CN108549669A (en) * 2018-03-21 2018-09-18 南京邮电大学 A kind of outlier detection method towards big data
CN109492667A (en) * 2018-10-08 2019-03-19 国网天津市电力公司电力科学研究院 A kind of feature selecting discrimination method for non-intrusive electrical load monitoring

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103812700A (en) * 2014-02-18 2014-05-21 西南大学 Message classifying method based on rule information entropy
CN108280561A (en) * 2017-01-06 2018-07-13 重庆邮电大学 A kind of discrete manufacture mechanical product quality source tracing method based on comentropy and Weighted distance
CN107123989A (en) * 2017-05-25 2017-09-01 国网上海市电力公司 A kind of topology identification method based on improved local outlier factor algorithm
CN108549669A (en) * 2018-03-21 2018-09-18 南京邮电大学 A kind of outlier detection method towards big data
CN109492667A (en) * 2018-10-08 2019-03-19 国网天津市电力公司电力科学研究院 A kind of feature selecting discrimination method for non-intrusive electrical load monitoring

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KE ZHANG等: "A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data", 《PAKDD 2009: ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING》 *
包小兵等: "基于信息熵加权的局部离群点检测算法", 《计算机技术与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113158871B (en) * 2021-04-15 2022-08-02 重庆大学 Wireless signal intensity abnormity detection method based on density core

Similar Documents

Publication Publication Date Title
Kut et al. Spatio-temporal outlier detection in large databases
CN110245981B (en) Crowd type identification method based on mobile phone signaling data
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
CN107682319B (en) Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method
CN105760888B (en) A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute
Li et al. A comparative analysis of evolutionary and memetic algorithms for community detection from signed social networks
CN106339416B (en) Educational data clustering method based on grid fast searching density peaks
CN102364498A (en) Multi-label-based image recognition method
CN103888541A (en) Method and system for discovering cells fused with topology potential and spectral clustering
CN109410238A (en) A kind of fructus lycii identification method of counting based on PointNet++ network
Yang et al. A feature-metric-based affinity propagation technique for feature selection in hyperspectral image classification
Pan et al. Mining regular behaviors based on multidimensional trajectories
JP4937395B2 (en) Feature vector generation apparatus, feature vector generation method and program
CN106250925B (en) A kind of zero Sample video classification method based on improved canonical correlation analysis
CN108304851A (en) A kind of High Dimensional Data Streams Identifying Outliers method
CN109447110A (en) The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics
CN112381181A (en) Dynamic detection method for building energy consumption abnormity
Gulhane et al. A review of image data clustering techniques
CN115374106B (en) Intelligent data classification method based on knowledge graph technology
CN103020153B (en) A kind of advertisement recognition method based on video
CN110909253B (en) Group relation mining and analyzing method based on specific users
CN109271466A (en) A kind of weather data analysis method based on hierarchical clustering Yu K mean algorithm
CN104731887B (en) A kind of user method for measuring similarity in collaborative filtering
CN108804635A (en) A kind of method for measuring similarity based on Attributions selection
CN103150470A (en) Visualization method for concept drift of data stream in dynamic data environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190927

RJ01 Rejection of invention patent application after publication