CN105117485A

CN105117485A - High-accuracy global outlier detection algorithm based on k-nearest neighbor

Info

Publication number: CN105117485A
Application number: CN201510593056.7A
Authority: CN
Inventors: 许红龙; 毛睿; 陆敏华; 李荣华; 王毅; 刘刚; 陆克中
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2015-09-17
Filing date: 2015-09-17
Publication date: 2015-12-02
Anticipated expiration: 2035-09-17
Also published as: CN105117485B

Abstract

The invention belongs to the field of data mining, and provides a global outlier detection algorithm. The algorithm includes the steps that S1, a data set D is detected in a block mode; S2, the distance between each object of the data set D and each object of a first data block is calculated, the (m+k) neighbor of each object of the first data block is updated, the outlier degree of each object is calculated in real time, and the objects with the outlier degree smaller than a threshold value c are excluded from the data block; S3, after the first data block is processed, the objects not excluded from the data block are sequenced from large to small according to the outlier degrees, first n objects are taken and added into TOP n outliers, and the threshold value c is updated; S4, an ith data block is processed like the step2, and after the data block is processed, the TOP n outliers and the threshold value c are updated; after all data blocks are processed, the TOP n outliers are output. The algorithm can expand the data set application range and improve the detection accuracy.

Description

A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour

Technical field

The invention belongs to data mining technology field, particularly relate to a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour.

Background technology

Outlier also claims abnormity point, exception object, the most influential definition of present academia is the definition " outlier is the distinguished data point of data centralization; its performance is put so different with other; to such an extent as to make people suspect these data and nonrandom deviation, but produced by another diverse mechanism " that Hawkins proposes.In addition, each class Outliers Detection algorithm all provides the definition of corresponding outlier.Outlier detection is also referred to as abnormality detection, separate-blas estimation or outlier excavation, and it is exactly according to certain algorithm the outlier detection of data centralization out, such as, detect TOPn outlier, or all satisfactory outlier.In other words, outlier detection is exactly excavate the point that in mass data, only a few is significantly different from mainstream data.

Distance-based outlier detection algorithm has versatility.It does not need user to have pertinent arts, does not need tentation data collection to meet any particular probability distributed model yet.After first Knorr and Ng in 1998 propose the definition of distance-based outlier point, scholars propose the definition of various outlier and corresponding detection algorithm one after another.Wherein conventional definition has three: and

derive from Knorr and Ng propose definition DB (p, D)---the object O in data set T is an outlier, is greater than D when having the distance of P partial objects and O in data set T at least.This definition is equivalent to mean be less than R with the distance of object O the no more than k of object, obviously such definition is more visual in image. and DB (p, D) is dualization definition, an object or be outlier, or be normal point.

it is the definition that the people such as Ramaswamy proposed in 2000.This definition is using the distance value of object O and its kth neighbour as degree of peeling off, therefore, it is possible to sequence draws TOP-n outlier, avoids the problem that dualization definition precision is poor to a certain extent.

it is the definition that the people such as Angiulli proposed in 2002.This definition with comparatively similar, it is using the mean value of the distance of object O k neighbour front with it as degree of peeling off, basis on further increase degree of accuracy, thus become most widely used definition on Outliers Detection algorithm research.

In the outlier definition that existing based on distance three are the most frequently used, it is generally acknowledged there is the highest Detection accuracy, but still not ideal enough, applicable data collection also has limitation.

Summary of the invention

Technical matters to be solved by this invention is that providing a kind of can increase the overall outlier detection algorithm that the data set scope of application also can improve Detection accuracy.

The invention provides a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour, comprise following steps:

Step S1: data set D is detected with partitioned mode, each detected blocks of data is called a data block, and each object of data block and its (m+k) nearest neighbor distance are initialized as maximal value;

Step S2: by each object of data set D and each calculation and object distance of first data block, and upgrade (m+k) neighbour of each object in first data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of initial threshold c; The degree of peeling off of described each object is the distance sum of this object and its m+1 to m+k neighbour;

Step S3: after processing first data block, sorts by degree of peeling off from big to small by the object be not excluded in first data block, gets a front n object and adds TOPn outlier, and upgrade threshold value c;

Step S4: by each object of data set D and each calculation and object distance of second data block, and upgrade (m+k) neighbour of each object in second data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of threshold value c;

Step S5: after processing second data block, if the degree of peeling off of the object be not excluded in second data block is greater than the degree of peeling off in TOPn outlier, then upgrades TOPn outlier, and upgrades threshold value c;

Step S6: for i-th data block, i=3,4,5 ..., repeat step S4-S5; Until all data blocks all process, export TOPn outlier.

Further, in described step S2, initial threshold c is set to 0.

Further, in step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.

The present invention compared with prior art, beneficial effect is: the high-accuracy overall situation outlier detection algorithm based on k very neighbour provided by the invention, when calculating the degree of peeling off of certain object, it is the m arest neighbors first removing this object, calculate the distance sum of this object and its k arest neighbors again, so just more easily detect intensive outlier, also can take into account the detection of sparse outlier simultaneously, increase the data set scope of application; Adopt overall outlier detection algorithm provided by the invention to also improve Detection accuracy simultaneously; In addition, getting rid of non-outlier in advance by arranging threshold value, saving memory headroom.

Accompanying drawing explanation

Fig. 1 is the high-accuracy overall situation outlier detection algorithm flow chart based on k very neighbour that the embodiment of the present invention provides.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

The degree of peeling off that the present invention calculates certain object is: this object is to its m+1 to the distance sum of m+k neighbour.

Introduce a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour below:

M+k neighbour of all objects of direct calculating data centralization, then calculates the degree of peeling off (i.e. the distance sum of p and its m+1 to m+k neighbour) of each object p, sorts to obtain TOPn, obtains TOPn outlier.

The advantage of this implementation is simple, and shortcoming is that the memory headroom consumed is very large.

In fact, for an object, the process of search arest neighbors, be more search certainly, " temporarily " degree of peeling off (degree of peeling off calculated in real time) is more and more lower, because constantly there is nearer neighbour found; So once " temporarily " degree of peeling off of certain object is less than threshold values c, it may be no longer just outlier, and calculate because continue search, its degree of peeling off only can be less, and more impossible is outlier; Algorithm gets rid of non-outlier in advance according to this point threshold values c.

Introduce a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour provided by the invention below, described algorithm specifically comprises following steps:

Step S2: by each object of data set D and each calculation and object distance of first data block, and upgrade (m+k) neighbour of each object in first data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of initial threshold c (initial threshold c is set to 0); The degree of peeling off of described each object is the distance sum of this object and its m+1 to m+k neighbour;

In described step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.

Introduce the specific embodiment adopting above-mentioned algorithm search TOPn outlier below:

Certain one-dimensional data collection comprise to as if:

3，1，4，15，9，2，6，5，35，7，97，93，23，84，62

Be divided into three data blocks, each data block comprises 5 objects:

First data block: 3, Isosorbide-5-Nitrae, 15,9

Second data block: 2,6,5,35,7

3rd data block: 97,93,23,84,62

Suppose m=2, k=2, n=2, namely using certain object and its 3rd arest neighbors to the distance sum of the 4th arest neighbors as degree of peeling off, require maximum 2 objects of degree of peeling off.

It should be noted: the neighbour of certain object, do not comprise it oneself; When neighbour's quantity of having searched for is less than m+k, degree of peeling off is set to infinity.

When one, processing first data block (3, Isosorbide-5-Nitrae, 15,9), threshold values c=0 (degree of peeling off all is more than or equal to 0, so in fact process first data block do not get rid of any object in advance):

By each object of whole data set and each calculation and object distance of this data block:

(1) the 1st object 3 of data set, calculation and object distance each with data block, and preserve the current neighbour of each object of data block.Real-time calculating degree of peeling off, is less than just getting rid of from data block of threshold values c.

(2) the 2nd object 1 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Calculating degree of peeling off, is less than just getting rid of from data block of threshold values c.

(3)……

(4) last object 62 of data set, can calculation and object distance each with data block, attempt upgrading neighbour, calculate degree of peeling off, see and get rid of in advance.

(5) after each object of data set and the calculation and object distance of this data block, neighbour's situation is obtained as follows:

Process first data block, eachly can not calculated degree of peeling off by the object got rid of in advance, get maximum two, add TOPn:

First is 15, and degree of peeling off is 17;

Second is 9, and degree of peeling off is 9;

The n-th of TOPn, namely the 2nd, degree of peeling off is 9, threshold values c=9.

When two, processing second data block (2,6,5,35,7), threshold values c=9:

(1) the 1st object 3 of data set, calculation and object distance each with data block, and preserve the current neighbour of each object of data block, real-time calculating degree of peeling off is (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), see and whether be less than threshold values c and get rid of from data block in advance.

(2) the 2nd object 1 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Real-time calculating degree of peeling off (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), is less than just getting rid of from data block of threshold values c.

(3) the 3rd object 4 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Real-time calculating degree of peeling off (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), is less than just getting rid of from data block of threshold values c.

(4) the 4th object 15 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time, is less than just getting rid of from data block of threshold values c.

Up to now, degree of peeling off is greater than threshold values (c=9) entirely.

(5) the 5th object 9 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.

The current degree of peeling off of object 6,5, lower than threshold values 9, therefore can delete (search for because of along with the continuation of neighbour, their degree of peeling off only can be more and more less, or constant, and can not become large) from data block in advance.

(6) the 6th object 2 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.(at this moment data block only remaining 2,35,7 three objects, the neighbour of 2 do not comprise it oneself)

(7) the 7th object 6 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.

The current degree of peeling off of object 2,7, lower than threshold values 9, therefore can delete from data block in advance.At this moment data block is only left object 35.

(8)……

(9), after all objects of data set and this data block calculate, degree of peeling off situation is obtained as follows:

Upgrade TOPn:

First is 35, and degree of peeling off is 53;

Second is 15, and degree of peeling off is 17;

Upgrade threshold values c=17.

Three, other data block of process is continued

The TOPn outlier of final acquisition is:

First is 97, and degree of peeling off is 97;

Second is 93, and degree of peeling off is 89;

Specific as follows:

In the present invention, the degree of peeling off calculating certain object is the m arest neighbors first removing this object, then calculates the distance sum of this object and its k arest neighbors; In fact, after removing m arest neighbors, adopt algorithm provided by the invention just more easily to detect intensive outlier, also can take into account the detection of sparse outlier simultaneously, increase the data set scope of application; Also improve Detection accuracy simultaneously; In addition, utilize threshold value c to get rid of non-outlier in advance, saved memory headroom.

Algorithm provided by the invention is applicable to scientific research field, can algorithm as a comparison, and application mode is programming realization and imports to intend detecting data set and run; Also be applicable to industrial circle, can be applicable to network invasion monitoring, credit card fraud detection, medical and public health detects, fields such as examining is damaged in industry, application mode is programming realization and imports plan detection data set to run.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1., based on a k very neighbour's high-accuracy overall situation outlier detection algorithm, it is characterized in that, comprise following steps:

2. overall outlier detection algorithm as claimed in claim 1, it is characterized in that, in described step S2, initial threshold c is set to 0.

3. overall outlier detection algorithm as claimed in claim 1, is characterized in that, in step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.