CN110288014A

CN110288014A - A kind of local Outliers Detection method based on comentropy weighting

Info

Publication number: CN110288014A
Application number: CN201910540443.2A
Authority: CN
Inventors: 王丽娜; 冯超; 邢梓萌; 沈朝瑶
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2019-09-27

Abstract

The invention discloses a kind of local Outliers Detection methods based on comentropy weighting, firstly, obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off are arranged；Next finds out entropy weight distance；Each sample data is found out apart from nearest k number evidence according to entropy weight distance；Then, the k nearest neighbor distance d of each sample data is calculated_rWith average distance D_r；The part for finally calculating each sample data peels off factor eldof (r).The present invention increases entropy weight information on the basis of LDOF algorithm, i.e. calculate apart from when using entropy Weighted distance, so that the accuracy rate of Outlier Data detection improves, regular hour expense is generated simultaneously, it is not required to understand data distribution in advance when detecting, the training set etc. for reaching certain amount and requirement is not depended on, algorithm detection accuracy is effectively increased.

Description

A kind of local Outliers Detection method based on comentropy weighting

Technical field

The invention belongs to the field of data mining, and in particular to a kind of local Outliers Detection method based on comentropy weighting.

Background technique

With the rapid development of modern science and technology and information technology, the magnanimity number of under cover valuable information is had accumulated According to.In order to make full use of these data, data mining technology becomes an important field of research.The major function of data mining It is to find out the unforeseen but valuable knowledge of people from the insufficient data set of a large amount of, integrality.With certainly So development of science and social science, clustering method have been widely used in data mining, the side such as pattern-recognition and machine learning Face.Clustering method can be divided into roughly split plot design and top and bottom process, and difference is to be gone to utilize data with different building methods. One given data set is split as disjoint part by split plot design, and top and bottom process is that a data set is resolved into one point Layer structure sequence.The purpose of cluster is one group of data record of searching, so that recording in a given cluster is similar, and it is different Record in cluster is dissimilar.Obviously, similar concept is the key that clustering problem.In fact, the definition logarithm of similarity It is relatively easy for attribute, it is continuous because the attribute value of these attributes has a naturally sequence.Euclidean distance mentions The similarity measurement method that can be widely accepted for a logarithm value attribute is supplied, however many data sets also include that classification belongs to Property.On these attributes, Euclidean distance function can not be defined, because their attribute value is discrete, unordered.Cause This, can not go to measure with Euclidean distance.And then produce research field one important one classification data of subdomains of cluster Cluster.

Outlier detection is a very important research branch of data mining, and major function is from huge and answer The data for having greatly difference to belong to only a few again simultaneously with mainstream data are extracted in miscellaneous data set.Researcher has been at present Propose a large amount of outlier detection algorithm.Such as, the outlier detection algorithm based on cluster, the outlier detection based on statistics are calculated Method etc..

However existing most of outlier detection algorithms all have the shortcomings that time complexity is high.Then, researcher Again develop it is many improve algorithm performances technologies.But knows data distribution in advance in the presence of needs, requires a certain number of instructions Practice the not high problem of collection, Outlier Data detection accuracy.

Summary of the invention

Goal of the invention: the purpose of the present invention is to provide a kind of local Outliers Detection method based on comentropy weighting, inspections It is not required to understand data distribution in advance when survey, does not depend on the training set for reaching certain amount and requirement, effectively increase algorithm detection Precision.

Technical solution: a kind of local Outliers Detection method based on comentropy weighting of the present invention, including following step It is rapid:

(1) obtaining attribute in advance is the sample data of attribute of peeling off, and the attribute weight that peels off is arranged；

(2) entropy weight distance is found out；

(3) each sample data is found out apart from nearest k number evidence according to entropy weight distance；

(4) the k nearest neighbor distance d of each sample data is calculated_rWith average distance D_r；

(5) part for calculating each sample data peels off factor eldof (r).

The step (1) the following steps are included:

(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for more Dimension attribute collection A={ A₁,A₂,…,A_dCalculate entropy formula:

(12) A={ A₁,A₂,…,A_dIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, A_i∈A；According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute that peels off is arranged Weight is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.

The step (2) is realized by following formula:

Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute A_iOn take Value, w_iIt is attribute A_iAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute A_iOn take Value.

The step (4) is realized by following formula:

The step (5) is realized by following formula:

The utility model has the advantages that compared with prior art, beneficial effects of the present invention: 1, LDOF algorithm on the basis of increasing entropy Weigh information, i.e., calculate apart from when using entropy Weighted distance so that Outlier Data detection accuracy rate improve；2, when detecting It is not required to understand data distribution in advance, does not depend on the training set etc. for reaching certain amount and requirement, effectively increase algorithm detection essence Degree.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is 20 data generated at random；

Fig. 3 is that LDOF and ELDOF peel off factor reduced value；

Fig. 4 is the 2 cluster data distribution maps comprising Outlier Data；

Fig. 5 is LDOF algorithm and ELDOF algorithm difference neighbour's detection accuracy comparative experiments figure；

Fig. 6 is peel off factor reduced value of LDOF and the ELDOF algorithm of 29 neighbours in 2 cluster datas；

Fig. 7 is the outlier effect picture detected based on LDOF algorithm；

Fig. 8 is the outlier effect picture that the method for the present invention is measured.

Specific embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 is flow chart of the invention, the specific steps are as follows:

Step 1: setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value.For Multidimensional property collection A={ A₁,A₂,…,A_dCalculate entropy formula:

If sample attribute independently occurs, can indicate are as follows:

Step 2: A={ A₁,A₂,…,A_dIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, A_i∈A.According to formulaObtaining those of sample data attribute is the attribute that peels off, and is arranged and peels off Attribute weight is 2, if attribute is the non-attribute that peels off, the attribute weight that peels off is 1；

Step 3: with identified weight in step 2, a data object o in its attribute A_iOn value be denoted as beAssuming that w_iIt is attribute A_iAbout the weight (i=1,2 ..., d) of object o, according to formulaFind out entropy weight distance EWdist (o, p)；

Step 4: the k number evidence for looking for each sample data nearest with entropy weight distance required in step 3；

Step 5: with required entropy weight distance in step 3 according to formulaCalculate each sample data K nearest neighbor distance d_r；

Step 6: with required entropy weight distance in step 3 according to formulaCalculate each sample K neighbour's average distance D of notebook data_r；

Step 7: according to formulaCalculate the eldof (r) of each sample data.

Also need to acquire the probability of each attribute of each sample data independently occurred in the algorithm with comentropy weighting Value, our probability value is 20 samples being calculated such that for generating 2 dimensions at random, takes the central point of this 20 points first, Then x-axis value of each point and a ratio of y-axis value distance center point represent this data point its each attribute Probability value.The method of asking of this ratio is: the frontier distance for calculating this 20 data first is denoted as D_i(i=1,2), then calculates The range difference d of each data and center on each attribute_i(i=1,2), then the ratio of this 20 each attributes of data is just It is d_i/(D_i/ 2) (i=1,2).Probability value on 20 each attributes of data is exactly required rate value.

It is random to generate 20 data points as matrix y=[0.5 2.7；1 3.6；1.5 4.2；2 2；2.6 4；2.8 4.3； 3.2 7.2；3.6 4.3；3.8 5.5；4.1 0.6；4.3 3.8；4.4 4.2；4.8 2.5；5.1 3.3；5.3 5.7；5.6 4.7；6 4.4；6.4 1.7；6.8 3.7；7.8 4.2].This 20 random data are used to do one to the improved method proposed Preliminary verifying, if comentropy weighting theory generates preferable effect in 20 data generated at random, further We by the comentropy method of weighting proposed be applied to true data set carry out deeper into verifying, 20 generated at random Data are as shown in Figure 2.

As shown in figure 3, we can see that the broken line degree of fluctuation of the method for the present invention (ELDOF algorithm) is bigger, and not The LDOF algorithm for adding comentropy weighting is just in contrast relatively more steady, thus concludes that and be added to comentropy weighting algorithm The eldof value calculated of ELDOF algorithm than no addition comentropy algorithm LDOF algorithm to the judgement of Outlier Data more Rationally.

The flag flower data set for there are 150 samples is obtained, which there are 4 attributes, respectively indicate calyx length, calyx Width, petal length, petal width, unit are cm.The sample is made of the classification that 3 include 50 samples.Mountain iris class is complete It is fully separating and to intersect between latter two type in two class of variegated iris and Virginia iris, so data set can be with Divide 2 classes or 3 alanysis.It is divided herein according to 2 classes, two attributes of calyx length and calyx width is only taken to be tested.1st class 21 samples are taken, the 2nd class takes 19 samples, then adds 20 Outlier Datas.

Normal data has 40, and the Outlier Data of addition has 20, wherein there is normal data 21 in first cluster, peels off Data 9；There are normal data 19, Outlier Data 11 in second cluster.As shown in figure 4, pentalpha label sample and The sample of asterisk label is respectively 2 cluster datas, and the sample of diamond indicia is Outlier Data.

As shown in figure 5, we can see that the detection accuracy of ELDOF algorithm is all when arest neighbors number is 5,10 and 15 It is higher than the detection accuracy of LDOF algorithm by 5%, when neighbour's number is 20,25 and 29, the detection accuracy ratio LDOF of ELDOF algorithm All high 10%, the LDOF algorithm of the detection accuracy of algorithm when arest neighbors number is 29 detection accuracy steadily in 75%, and I ELDOF algorithm when arest neighbors number is 29 detection accuracy have arrived at 85%, thus we obtain: regardless of arest neighbors Number be it is how many, ELDOF algorithm is always higher than the detection accuracy of LDOF algorithm.

As shown in fig. 6, we take 29 nearest neighbouringplots, it can be seen that the effect ratio LDOF of ELDOF algorithm embodiment outlier Algorithm becomes apparent from also more rationally.In order to more intuitively indicate two kinds of algorithm superiority and inferiority, we show Outlier Data detection with Fig. 7, Fig. 8 Situation.

According to Fig. 7, Fig. 8, we analyze ELDOF algorithm ratio LDOF algorithm detection outlier verification and measurement ratio it is higher.It will be every A algorithm is made comparisons with the Outlier Data of original addition, and the detection accuracy being calculated is as shown in table 1.

The detection accuracy of 1 LDOF algorithm of table and ELDOF algorithm compares

According to table 1, the detection accuracy of the detection accuracy ratio LDOF algorithm of ELDOF algorithm is high in the outlier detection of two clusters Go out 10%, thus we can also be concluded that: reality of the ELDOF algorithm than the LDOF algorithm under the conditions of ad eundem It is more preferable to test effect.

In addition to this, the runing time comparison for the LDOF algorithm and ELDOF algorithm for being 29 we illustrate arest neighbors number As shown in table 2.Find out the runing time for being added to the algorithm of entropy weight really than being not added with entropy within the scope of available times, in table The Riming time of algorithm of power is long.

2 two kinds of Riming time of algorithm comparisons of table

Claims

1. a kind of local Outliers Detection method based on comentropy weighting, which comprises the following steps:

(2) entropy weight distance is found out；

(5) part for calculating each sample data peels off factor eldof (r).

2. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (1) the following steps are included:

(11) setting X is a stochastic variable, and S (X) is the probable value of all X, and p (x) is the probability of X probable value, for multidimensional category Property collection A={ A₁,A₂,…,A_dCalculate entropy formula:

(12) A={ A₁,A₂,…,A_dIt is defined as the property set of a d dimension data collection D, then for data object o ∈ D, A_i ∈A；According to formulaObtaining those of sample data attribute is the attribute that peels off, and the attribute weight that peels off is arranged It is 2, if sample data attribute is the non-attribute that peels off, attribute weight is 1.

3. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (2) is realized by following formula:

Wherein, EWdist (o, p) is entropy weight distance,For a data object o in its attribute A_iOn value, w_iIt is Attribute A_iAbout the weight (i=1,2 ..., d) of object o,It is data object p in its attribute A_iOn value.

4. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (4) is realized by following formula:

5. a kind of local Outliers Detection method based on comentropy weighting according to claim 1, which is characterized in that described Step (5) is realized by following formula: