CN107562778A

CN107562778A - A kind of outlier excavation method based on deviation feature

Info

Publication number: CN107562778A
Application number: CN201710599251.XA
Authority: CN
Inventors: 王红滨; 冯梦园; 何鸣; 王念滨; 尹新亮; 顾正浩; 苏畅; 童鹏鹏; 曾庆宇; 张海彬
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2018-01-09
Anticipated expiration: 2037-07-21
Also published as: CN107562778B

Abstract

The invention discloses a kind of based on the outlier excavation method for deviateing feature, comprise the following steps：(1) each dimension of data set is divided into h intervals at equal intervals, then whole data set is divided into h^dIndividual grid；(2) each data point is done into one with grid index to associate, if not including data point in a grid, does not consider the grid；(3) each grid in the space formed for division, obtains the barycenter of grid, and the part for calculating barycenter peels off the factor；(4) part for calculating each data object peels off the factor, and the part of the object factor that peels off is equal to the factor that peels off of affiliated grid barycenter in data set.Data space is divided into grid by the present invention when detecting the outlier that data are concentrated using F_LOF detection algorithms, is peeled off the factor based on the barycenter of grid to calculate the part of data point, is reduced and calculate the time, improve detection efficiency, show its superiority.

Description

A kind of outlier excavation method based on deviation feature

Technical field

The present invention relates to outlier excavation field, and in particular to a kind of based on the outlier excavation method for deviateing feature.

Background technology

Outlier excavation is the important research branch of one of data mining, is effectively applied in log analysis, invasion inspection The actual life such as survey, quality control field, outlier excavation includes outlier detection and the point analysis that peels off.Outlier detection is fortune With appropriate algorithm, the abnormal behaviour different from most of behavior is detected, obtains some true but rare information that peel off； The point analysis that peels off is then that the information detected is analysed in depth, and draws knowledge or pattern；Its task is its number of identification It is different from the observation of other data objects according to characteristic remarkable.Outlier detection is extremely important in data mining, because if Exception be as caused by the variation of inherent data, then to they analyze it can be found that contain wherein it is deeper, Potentially, valuable information.Therefore, outlier detection is a significantly research direction.Data mining expert will Outlier be defined as " outlier is distinguished data object in data set, and its performance is very different with other data objects, So that make one to suspect these data objects and nonrandom deviation, but as produced by another entirely different mechanism ".This definition discloses the essence to peel off to a certain extent, has been widely cited.But this is a not strict description Type defines.In fact, academia does not have the unified formal definitions on outlier for a long time.Researchers are normal The formal definitions of outlier are often provided according to different application environments.For many years, researchers are for different types of Data set proposes different mathematical methods and is used to detect outlier existing under different situations.Although outlier disobeys number According to the universal regularity of distribution, it may be possible to as caused by certain abnormal mechanism, but Research on Mining these point but have very high reality It is sometimes even more important than normal data with value, wherein tacit knowledge, and it is random ignore or abandon these points, then have Of great value knowledge may be lost, tremendous influence is produced to result.

Generally speaking, outlier detection technology is broadly divided into the technology based on statistics, the technology based on distance, based on density Technology, the technology based on cluster.Asking for local outlier can not be detected in order to solve distance-based outlier point detection technique Topic, there has been proposed the outlier detection technology based on density：LOF algorithms and its variant.The technology solves the local journey that peels off The measurement of degree and its decision problem, local outlier is capable of detecting when, and for data object in the region of different densities It can be good at handling.Its difficult point is that the parameter selection of algorithm is relatively difficult.At present, the LOF detection algorithms based on density have been Through being widely used on outlier excavation, but because when handling mass data information, the time complexity of LOF algorithms is very Height, limit its application.

The present invention improves and is directed to the problem of LOF outlier detection Algorithms T-cbmplexities are higher, from the angle for deviateing feature Degree, it is proposed that a kind of quick LOF detection algorithms, be designated as F_LOF.F_LOF detection algorithms are no longer based on whole data set and go to calculate The part of each data point peels off the factor (peel off degree), but by the way that data space is divided into grid, based on each grid Barycenter peeled off the factor to calculate the part of data point.In addition, the algorithm can also be efficiently used for real-time outlier detection, often During the secondary addition new data point to data set, the network of available data point can be utilized, it is only necessary at identification data point The position of grid, peeled off the factor without the further part for calculating new data point.F_LOF detection algorithms are being realized and passed During accuracy of detection similar in detection algorithm of uniting, hence it is evident that reduce and calculate the time, improve efficiency, finally give preferably detection effect Fruit.

The content of the invention

The invention aims to solve the problems, such as that LOF detection algorithm time complexities are higher, from the angle for deviateing feature A kind of a kind of degree, it is proposed that outlier excavation method based on deviation feature of improved quick LOF detection algorithms.

The present invention to achieve these goals, is realized using following technical scheme：A kind of outlier based on deviation feature Method for digging, comprise the following steps：

(1) data set D each dimension is divided into h intervals at equal intervals, then whole data set is divided into h^dIt is individual Grid；

(2) by each data point x_i∈ D and grid index j=1 ..., h^dAn association is done, if in a grid not Comprising data point, then the grid is not considered；

(3) each grid j in the space formed for division, the barycenter C of grid is obtained_j, and calculate barycenter C_jPart The factor that peels off Lof_k(C_j)；

(4) part for calculating each data object peels off factor values, and the part of the object factor that peels off is equal to institute in data set Belong to the factor that peels off of grid barycenter.

The barycenter C of grid in the step (3)_jWith barycenter C_jPart peel off factor Lof_k(C_j) calculating process is as follows：

(3.1) barycenter C is calculated_jKth distance k_dist (C_j)；For two object C in data space_jAnd o, with Europe Formula distance is measurement, to given positive integer k, by C_jKth distance be summarised as C_jThe distance between o, is designated as k_dist (C_j), wherein object o meets that following condition is：At least exist k object o' ∈ D { C_jMeet d (C_j,o')≤d(C_j,o)；Extremely Exist less k-1 object o' ∈ D { C_jMeet d (C_j, o') and ＜ d (C_j,o)；

(3.2) barycenter C is calculated_jKth apart from field N_k(C_j)；Generally by data space with barycenter C_jDistance be less than Or equal to k_dist (C_j) barycenter object set be defined as N_k(C_j), it is formulated as：

N_k(C_j)={ o | d (C_j,o)≤k_dist(C_j)}；

(3.3) barycenter C is calculated_jWith its N_k(C_j) in data point reach distance；Barycenter C_jRelative to other barycenter o can Refer to C up to distance_jKth distance and C_jThe higher value of distance, is expressed as with equation below between o：

reach_dist_k(C_j, o) and=max { k_dist (o), d (C_j,o)}；

Wherein, o ∈ N_k(C_j)；

(3.4) barycenter C is calculated_jLocal reachability density lrd_k(C_j)；lrd_k(C_j) it is by barycenter C_jWith its kth apart from field N_k(C_j) in the average reach distances of all objects ask down, calculation formula is：

(3.5) result drawn more than, barycenter C is obtained_jPart peel off factor Lof_k(C_j), formula is：

The part that each data object is calculated in the step (4) peels off factor values, if the object x in data set D_iCategory In grid j, then the part of the object factor representation that peels off is：

Lof_{G_k}(x_i)=Lof_k(C_j)。

The beneficial effects of the present invention are：

A kind of outlier excavation method based on deviation feature of the present invention, when detecting the outlier that data are concentrated, in order to Reduce and calculate the time, improve detection efficiency, F_LOF detection algorithms are no longer based on whole data set to calculate the office of each data point Portion is peeled off the factor, but data space is divided into grid, is peeled off the factor based on the barycenter of grid to calculate the part of data point. Because the quantity of grid is less than the quantity of data point, under acceptable error, hence it is evident that reduce time complexity.In addition, should Algorithm can also be efficiently used for real-time outlier detection, when adding new data point to data set every time, can utilize existing number The network at strong point, it is only necessary to the grid position of identification data point, without the further Lof for calculating new data point Value.

Brief description of the drawings

Fig. 1 is the mining process figure of the present invention；

Fig. 2 is the FB(flow block) of the present invention；

Fig. 3 is calculating time comparison diagram of the F_LOF detection algorithms of the present invention in UCI databases on four data sets；

The part on data set of F_LOF detection algorithms and LOF algorithms that Fig. 4 is the present invention peels off factor difference contrast Figure；

Fig. 5 is detection accuracy contrast of the F_LOF detection algorithms of the present invention in UCI databases on four data sets Figure.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference Illustration, the present invention is described in more detail.

At present, it is no matter domestic or external, positive research is all being carried out to outlier excavation method, scholars propose A variety of model methods and corresponding algorithm, they are directed to different peel off vertex type and specific practical problem, respectively there is feature.This Invention is directed to the problem of LOF outlier detection Algorithms T-cbmplexities are higher on the basis of forefathers study, from deviation feature Angle, the advantages of in combination with classic algorithm and innovation, it is proposed that a kind of improved quick LOF detection algorithms.

With reference to the mining process figure that Fig. 1 is the present invention；Mining process from obtain outlier to be defined into data prediction right After be perform outlier detection method, assessed and explained from detection method, finally obtain knowledge process.

With reference to shown in Fig. 2, FB(flow block) of the invention；Whole data space is divided into a fixed number by F_LOF detection algorithms The grid of amount, calculate the local factor that peels off of Lof values of data object in the Lof values of each grid barycenter, rather than whole data set. For each data object in data set, its Lof value is equal with the Lof values of the barycenter of affiliated grid.For by n d dimension The data set D ∈ R formed according to object^d, it is assumed that the division number in each dimension is arranged to h, a kind of peeling off based on deviation feature Point method for digging, detailed calculation procedure are as follows：

N_k(C_j)={ o | d (C_j,o)≤k_dist(C_j)}；

reach_dist_k(C_j, o) and=max { k_dist (o), d (C_j,o)}；

Wherein, o ∈ N_k(C_j)；

Lof_{G_k}(x_i)=Lof_k(C_j)。

With reference to shown in Fig. 3, by analysis of experimental data, describe on four data sets, with F_LOF detection algorithms and LOF detection algorithms calculate the details of the required calculating time of Lof values.For each data set, x-axis represents each real Different demarcation number in proved recipe case in each dimension, y-axis, which represents, calculates the time that Lof values are consumed in each experimental program (second).For four data sets, it can be drawn by experimental result, the calculating time of F_LOF detection algorithms is better than LOF calculations Method, and as division number h constantly becomes big, the calculating time is being continuously increased.

With reference to shown in Fig. 4, the accurate Lof by being tested to each data set, obtained with LOF detection algorithms is described Gap between value and the approximate Lof values obtained with F_LOF detection algorithms.In the experimental program, x-axis is represented according to accurate The data point of Lof values descending arrangement, y-axis represent the Lof values of each data point, and two curves represent respectively accurate Lof values with The situation of change of approximate Lof values.It is not difficult to draw from figure, the experiment for four data sets, although the approximation that this algorithm obtains In absolute scale there is a certain distance between Lof values and accurate Lof values, but these gap very littles, in acceptable mistake In poor scope, hence it is evident that reduce time loss, improve the efficiency of detection.

With reference to shown in Fig. 5, it is shown that for each data set, when using different divide values, the change feelings of accurate rate Condition.As illustrated, the accuracy of the algorithm improves with the increase of number of grid, it is contemplated that between accuracy and time efficiency Balance, in the detection of outlier, reduce less accuracy, exchange larger time efficiency gain for, be a kind of effective Strategy.

Sum it up, by several comparative analyses above can be seen that set forth herein F_LOF detection algorithms a kind of base No matter at runtime in deviateing the outlier excavation method of feature in terms of efficiency, also or algorithm is applied and examined in real-time outlier In survey, its superiority is shown.

It is noted that embodiment described above is to the illustrative and not limiting of technical solution of the present invention, affiliated technology neck The equivalent substitution of domain those of ordinary skill or other modifications made according to prior art, as long as not exceeding the technology of the present invention side The thinking and scope of case, should be included within interest field of the presently claimed invention.

Claims

It is 1. a kind of based on the outlier excavation method for deviateing feature, it is characterised in that to comprise the following steps：

(1) data set D each dimension is divided into h intervals at equal intervals, then whole data set is divided into h^dIndividual grid；

(2) by each data point x_i∈ D and grid index j=1 ..., h^dAn association is done, if do not included in a grid Data point, then the grid is not considered；

(3) each grid j in the space formed for division, the barycenter C of grid is obtained_j, and calculate barycenter C_jPart peel off Factor Lof_k(C_j)；

(4) part for calculating each data object peels off factor values, and the part of the object factor that peels off is equal to affiliated net in data set The factor that peels off of lattice barycenter.
It is 2. according to claim 1 a kind of based on the outlier excavation method for deviateing feature, it is characterised in that the step (3) the barycenter C of the grid in_jWith barycenter C_jPart peel off factor Lof_k(C_j) calculating process is as follows：

(3.1) barycenter C is calculated_jKth distance k_dist (C_j)；For two object C in data space_jAnd o, with it is European away from From for measurement, to given positive integer k, by C_jKth distance be summarised as C_jThe distance between o, it is designated as k_dist (C_j), its Middle object o meets that following condition is：At least exist k object o' ∈ D { C_jMeet d (C_j,o')≤d(C_j,o)；At least exist K-1 object o' ∈ D { C_jMeet d (C_j, o') and ＜ d (C_j,o)；

(3.2) barycenter C is calculated_jKth apart from field N_k(C_j)；Generally by data space with barycenter C_jDistance be less than or wait In k_dist (C_j) barycenter object set be defined as N_k(C_j), it is formulated as：

N_k(C_j)={ o | d (C_j,o)≤k_dist(C_j)}；

(3.3) barycenter C is calculated_jWith its N_k(C_j) in data point reach distance；Barycenter C_jRelative to other barycenter o reach distance Refer to C_jKth distance and C_jThe higher value of distance, is expressed as with equation below between o：

reach_dist_k(C_j, o) and=max { k_dist (o), d (C_j,o)}；

Wherein, o ∈ N_k(C_j)；

(3.4) barycenter C is calculated_jLocal reachability density lrd_k(C_j)；lrd_k(C_j) it is by barycenter C_jWith its kth apart from field N_k (C_j) in the average reach distances of all objects ask down, calculation formula is：

<mrow> <msub> <mi>lrd</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <mi>o</mi> <mo>&Element;</mo> <msub> <mi>N</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <mi>r</mi> <mi>e</mi> <mi>a</mi> <mi>c</mi> <mi>h</mi> <mo>_</mo> <msub> <mi>dist</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>,</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>/</mo> <msub> <mi>N</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> </mrow>

(3.5) result drawn more than, barycenter C is obtained_jPart peel off factor Lof_k(C_j), formula is：

<mrow> <msub> <mi>Lof</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <mi>o</mi> <mo>&Element;</mo> <msub> <mi>N</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msub> <mi>lrd</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>/</mo> <msub> <mi>lrd</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <msub> <mi>N</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> <mo>.</mo> </mrow>
It is 3. according to claim 1 a kind of based on the outlier excavation method for deviateing feature, it is characterised in that：The step (4) part that each data object is calculated in peels off factor values, if the object x in data set D_iBelong to grid j, then the object The part factor representation that peels off be：

Lof_{G_k}(x_i)=Lof_k(C_j)。