CN105975519A

CN105975519A - Multi-supporting point index-based outlier detection method and system

Info

Publication number: CN105975519A
Application number: CN201610278832.9A
Authority: CN
Inventors: 许红龙; 毛睿; 陆敏华; 廖好; 李荣华; 王毅; 刘刚
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2016-04-28
Filing date: 2016-04-28
Publication date: 2016-09-28

Abstract

The invention provides a multi-supporting point index-based outlier detection method. The method comprises a supporting point selection step of reading a data set and selecting a plurality of supporting points in the data set to form a supporting point set, an index creation step of calculating a distance through each object in the data set and the selected supporting points, forming a multidimensional data space by taking the distance as coordinates and creating an index by utilizing the multidimensional data space, and an outlier detection step of dividing the index into data blocks and performing block-by-block outlier detection on the data blocks. The invention furthermore provides a multi-supporting point index-based outlier detection system. According to the technical scheme provided by the method and system, the index is created by calculating the distance through selecting the supporting points and the global data set, so that the data space warp caused by a single supporting point is avoided; and all sparse regions in the data set are preferentially detected, so that an outlier degree threshold can be increased more quickly and the outlier detection speed can be increased.

Description

A kind of Outliers Detection method indexed based on many strong points and system thereof

Technical field

The present invention relates to computer realm, particularly relate to a kind of Outliers Detection method indexed based on many strong points And system.

Background technology

Outlier is distinguished data point in data set, and its performance is the most different from other point, to such an extent as to People is made to suspect these data nonrandom deviation, but by produced by another diverse mechanism. Outlier is also referred to as abnormity point or exception object.Outlier detection be also referred to as abnormality detection, separate-blas estimation or from Group's point excavates, and it is exactly according to certain algorithm the outlier detection in data set out, such as, detect TOP-n outlier, or all satisfactory outlier.In other words, outlier detection excavates sea exactly The point that in amount data, only a few is dramatically different with mainstream data.

At present, the detection algorithm for outlier mainly includes ORCA algorithm and iORCA algorithm.

Wherein, ORCA algorithm uses the method upsetting data set order at random, in order to obtain approximation on the average line The time complexity of property.But, in the worst cases, time complexity is still up to O (n²)！Even if flat In the case of Jun, owing to the bottom valve value rate of climb that peels off is relatively slow, cause beta pruning efficiency not ideal enough.At data set In the case of larger, the required detection time is the most oversize.

The shortcoming of iORCA algorithm includes: first, simply uses a strong point, indexes the time in saving While, but result in the distortion of data space, reduce Quality of index, it is impossible to play beta pruning effect well Rate；Secondly, iORCA algorithm is for promoting degree of peeling off threshold value, preferential detecting distance strong point district farther out as early as possible Territory, but have ignored other sparse region, but the lifting speed of degree of peeling off threshold value has limitation；Again, iORCA Algorithm does not provide strong point Algorithms of Selecting, and the quality of the strong point is closely related with algorithm performance, in other words, The strong point choosing method that iORCA algorithm uses only randomly selects, and effect is unstable；Finally, iORCA Algorithm only judges whether to stop detection outlier by a termination rules, fails to give full play to metric space " three Angle inequality " act on and reduce distance calculation times further.

Summary of the invention

In view of this, it is an object of the invention to provide a kind of Outliers Detection method indexed based on many strong points And system, it is intended to the single strong point solving to use in prior art causes data space distortion and the inspection that peels off The problem that degree of testing the speed is the highest.

The present invention proposes a kind of Outliers Detection method indexed based on many strong points, and described method includes:

Choose strong point step: read in data set, described data set is chosen multiple strong point and props up to be formed Support point collection；

Set up index step: by object each in data set with selected multiple strong point computed ranges also Using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index；

Outliers Detection step: dividing index is data block, and described data block is carried out block-by-block detection outlier.

Preferably, choose strong point step described in specifically include:

After reading in data set, randomly select initial reference point, and choose and described initial reference point distance Point on the basis of point furthest；

Calculate the distance of each object in described data set and described datum mark；

Sort according to the order from small to large of distance；

Described data set is divided into equidistant multistage；

Described multistage is ranked up according to the size of contained number of objects；

Judge that number of objects contained by each segmentation is the most equal；

If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support Point set；

If number of objects contained by each segmentation is equal, then preferential by the segmentation close to described initial reference point Quantity midpoint add support point set.

Preferably, described index step of setting up specifically includes:

According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates；

It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space；

Multi-dimensional data space is mapped as integer coordinate values；

Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values；

The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.

Preferably, described Outliers Detection step specifically includes:

Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order；

Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one；

If all objects in current data block are impossible to as outlier, then it is directly entered next data Block；

If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block；

When all data blocks have all processed, export TOP n outlier.

On the other hand, the present invention also provides for a kind of Outliers Detection system indexed based on many strong points, described system System includes:

Choose strong point module, be used for reading in data set, described data set is chosen multiple strong point with shape Become to support point set；

Set up index module, for by object each in data set and selected multiple strong points calculate away from From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index；

Outliers Detection module, is used for dividing index for data block, and described data block is carried out block-by-block detection from Group's point.

Preferably, choose described in strong point module specifically for:

Sort according to the order from small to large of distance；

Described data set is divided into equidistant multistage；

If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point Amount midpoint adds support point set.

Preferably, described set up index module specifically for:

Multi-dimensional data space is mapped as integer coordinate values；

Preferably, described Outliers Detection module specifically for:

When all data blocks have all processed, export TOP n outlier.

The technical scheme that the present invention provides, for reducing data space distortion, chooses multiple strong point in data set, Set up index, guarantee to set up index time overhead minimum (for Outliers Detection total time) simultaneously；For Faster promote degree of peeling off threshold value, all sparse region in preferential detection data set, including relatively far region and its Its sparse region；For improving the stability of algorithm performance, approximation close quarters strong point Algorithms of Selecting is proposed, The relatively good strong point of quality is chosen within the extremely short time；For reducing distance calculation times further, Accelerate Outliers Detection speed, use multiple prune rule, more greatly get rid of non-outlier and non-k is nearest Adjacency pair as.The technical scheme that the present invention provides is come with global data collection computed range by choosing multiple strong point Set up index, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are excellent First detect, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.

Accompanying drawing explanation

Fig. 1 is the Outliers Detection method flow diagram indexed based on many strong points in an embodiment of the present invention；

Fig. 2 is the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention；

Fig. 3 is the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention；

Fig. 4 is the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention；

Fig. 5 is the internal junction of the Outliers Detection system 10 indexed based on many strong points in an embodiment of the present invention Structure schematic diagram.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and reality Execute example, the present invention is further elaborated.Only should be appreciated that specific embodiment described herein Only in order to explain the present invention, it is not intended to limit the present invention.

The noun that occurs in technical solution of the present invention and shown in being explained as follows:

Degree of peeling off: the degree of peeling off of an object represents its degree peeled off, commonly uses the distance of itself and k arest neighbors Meansigma methods as degree of peeling off, or its distance value with kth arest neighbors is as degree of peeling off；

Data block a: unit of Outliers Detection, is made up of several objects in data set, the most conventional 1000 objects are as a data block；

Degree of peeling off threshold value: the degree of peeling off of the n-th outlier of TOP n outlier；

Spiral order: such as have an index sequence 1,2,3,4,5,6,7,8,9,10, if With 5 as starting point, it spiral order be exactly 5,4,6,3,7,2,8 ..., or 5,6,4,7, 3,8,2 ..., it is simply that one in front and one in back, the meaning that the rest may be inferred；

Quantity midpoint: the midpoint calculated in quantity, the number of objects bigger than this object, with less than this object Number of objects, difference is less than 1 or equal.

The specific embodiment of the invention provides a kind of Outliers Detection method indexed based on many strong points, described Method mainly comprises the steps:

S11, choose strong point step: read in data set, described data set is chosen multiple strong point with shape Become to support point set；

S12, set up index step: by object each in data set and selected multiple strong points calculate away from From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index；

S13, Outliers Detection step: divide index for data block, and described data block is carried out block-by-block detection from Group's point.

A kind of Outliers Detection method indexed based on many strong points that the present invention provides is by choosing multiple strong point Index is set up, it is to avoid the data space distortion that single strong point causes, logarithm with global data collection computed range Preferentially detect according to all sparse region concentrated, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed Degree.

A kind of Outliers Detection method indexed based on many strong points provided by the present invention will be carried out in detail below Explanation.

Refer to Fig. 1, for the Outliers Detection method stream indexed based on many strong points in an embodiment of the present invention Cheng Tu.

In step s 11, strong point step is chosen: read in data set, choose multiple in described data set The strong point supports point set to be formed.

In the present embodiment, described in choose strong point step S11 and specifically include sub-step S111-S118, As shown in Figure 2.

Refer to Fig. 2, for the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention.

In step S111, after reading in data set, randomly select initial reference point, and choose with described Point on the basis of initial reference point distance point furthest.

In step S112, calculate the distance of each object in described data set and described datum mark.

In step S113, sort according to the order from small to large of distance.

In step S114, described data set is divided into equidistant multistage.

In step sl 15, described multistage is ranked up according to the size of contained number of objects.

In step S116, it is judged that number of objects contained by each segmentation is the most equal.

In step S117, if number of objects contained by each segmentation is unequal, then by the quantity of each segmentation Point sequentially adds support point set.

In step S118, if number of objects contained by each segmentation is equal, then will be from described initial reference point The quantity midpoint of nearer segmentation adds support point set.

In the present embodiment, utilize the situation of equidistant partition to data set at datum mark to farthest with its distance On the basis of object, divide data set by equal distance increment.Assume that maximum distance is d_f, intend being divided into N section, then can respectively with distance between reference d_f/n、2d_f/n、……、(n-1)d_f/ n etc. divides, thus Data set is divided into equidistant but that number of objects is the most equal n section.It determines the method for close quarters It is first to add up number of objects contained by each section, then sort by this quantity, the time that the big person of quantity chooses for the strong point Favored area.

In the present embodiment, after reading in data set, temporary reference point is randomly selected as initial reference Point, with it apart from farthest object in search data set, with this object as basic point, calculates in data set each Object and the distance of reference point, sort according to order from small to large, use at " equidistant partition+quantity midpoint " Method, take in each section after division site and add strong point Candidate Set.Calculate the number of objects of each section, Again to number of objects by order sequence from big to small.For the segmentation that number of objects is equal, compare and obtain this Segmentation closest with reference point among a little segmentations, takes its quantity midpoint as first strong point.Run into When contained by segmentation, number of objects is equal, the most preferentially choosing the segmentation midpoint close to the strong point is the strong point.

In the present embodiment, it should be noted that sufficient amount of in order to make strong point Candidate Set to choose The strong point, its scale (quantity of namely segmentation) should be greater than plan and selects number of support points.For guaranteeing to choose matter Amount, number of fragments should be typically more than 2 times of number of support points.If additionally, using the son of data set The strong point chosen by collection, and equally in order to ensure strong point quality, its scale can not be too small, typically takes one Data block, in the case of number of support points is more, just should use more data block.

Please continue to refer to Fig. 1, in step s 12, index step is set up: by selected multiple supports Point forms multi-dimensional data space, utilizes described multi-dimensional data space to set up index.

In the present embodiment, described index step S12 of setting up specifically includes sub-step S121-S125, as Shown in Fig. 3.

Refer to Fig. 3, for the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention.

In step S121, according to the multidimensional data dimension of plan conversion, select the correspondence that the described strong point is concentrated The strong point of quantity.

In step S122, it is the distance value with each strong point by object map each in described data set, To form multi-dimensional data space.

In step S123, multi-dimensional data space is mapped as integer coordinate values.

In step S124, Hilbert index mapping algorithm is used directly to calculate every pair of integer coordinate values Hilbert encoding value.

In step s 125, the multiple Hilbert encoding value obtained are ranked up, to set up Hilbert Index.

In the present embodiment, after reading data set, according to the multidimensional data dimension of plan conversion, use Strong point Algorithms of Selecting, chooses the strong point of respective numbers, by each for data set object map is and each The distance value of support point, forms multi-dimensional data space (i.e. real number coordinate figure).Next real number coordinate figure is reflected Penetrate as integer coordinate values, then use Hilbert to index mapping algorithm, directly every pair of integer coordinate values of calculating Hilbert encoding value, this completes the coding to metric space object, then is sorted by these encoded radios, I.e. set up Hilbert index.

Please continue to refer to Fig. 1, in step s 13, Outliers Detection step: dividing index is data block, and Described data block is carried out block-by-block detection outlier.

In the present embodiment, described Outliers Detection step S13 specifically includes sub-step S131-S135, as Shown in Fig. 4.

Refer to Fig. 4, for the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention.

In step S131, dividing described Hilbert index is data block, by encoded radio from sparse to intensive For these block sequencings using as Outliers Detection order.

In step S132, degree of peeling off threshold value is set and is initialized as 0, read by detection ordering data block one by one Described data set.

In step S133, if all objects in current data block are impossible to as outlier, the most directly Enter next data block.

In step S134, if having object in current data block may be outlier, then from described current number Start with screw type sequential search arest neighbors according to the object of position in block, and will determine that and be unlikely to be outlier Object removes from detected current data block, until after all objects in current data block have all processed Update TOP n outlier and degree of peeling off threshold value, and enter next data block.

In step S135, when all data blocks have all processed, export TOP n outlier.

In the present embodiment, describe by false code and illustrate as a example by algorithm, input: arest neighbors quantity k, Intend detection outlier quantity n, data set D；Output: TOP n outlier.Then above-mentioned steps S13 includes:

After index is set up, to index data by data block (such as 1000 objects are a data block) Divide, data block is calculated Hilbert encoded radio increment and sorts in descending order.Next by the number sequencing order Outlier is detected according to block block-by-block.For each data block, when just starting to detect, first call prune rule three, Judge whether to contain outlier, if nothing, be then directly entered next data block；If having, then from data In block, the object of position starts, with screw type sequential search arest neighbors.Each in tested data block B Object, first uses prune rule one to judge to have not to be probably outlier, if impossible, then by it from data block B removes, and enters the detection of next object；If being probably outlier, then continue search for its k arest neighbors. Before computed range, prune rule two is used to judge to have not to be probably k arest neighbors, if being unlikely to be its k Arest neighbors, does not the most calculate both distances, is directly entered the detection of next object；If may, then calculate two The distance of person, and attempt updating its k arest neighbors, judge simultaneously its currently degree of peeling off whether less than threshold value c, If being less than, being also impossible to the most again become outlier, removing from data block B.

In the present embodiment, wherein three big prune rules are as follows:

(1) prune rule one: get rid of the object of non-outlier.

If dist is (x, p_i)+dist(p_i,nn_k(p_i, D)) < c, wherein p_i∈P

So x can not be outlier.

In other words, strong point p_iAnd the distance of its k arest neighbors and object x is both less than c, so object x At least k object in the range of radius c, its degree of peeling off is necessarily smaller than c.

(2) prune rule two: get rid of the object of non-k arest neighbors.

If | | dist (x_t,p_i)-dist(x_j,p_i)||>dist(x_t,nn_k(x_t, D)), wherein p_i∈P

So x_jCan not be x_tK arest neighbors.

(3) prune rule three:

If dist is (B, p_i)+dist(p_i,nn_k(p_i, D)) < c, wherein p_i∈P

So all objects in data block B are impossible to as outlier.

It is to say, all objects of data block B have the arest neighbors of more than k in the range of distance c.

In the present embodiment, it practice, after having detected a data block, the object in data block can Can be removed in a large number.For remaining object, attempt adding TOP n outlier one by one, and renewal peels off Point threshold value c.After having detected all data blocks, export TOP n outlier.

The technical scheme that the present invention provides, while keeping versatility based on distance, is provided that higher inspection Degree of testing the speed, and the definition of compatible multiple outlier.The technical scheme that the present invention provides uses three big prune rules, A large amount of non-outlier and non-k arest neighbors got rid of, minimizing distance calculation times, improves Outliers Detection speed.

The specific embodiment of the invention also provides for a kind of Outliers Detection system 10 indexed based on many strong points, main Including:

Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with Formed and support point set；

Set up index module 12, for being calculated with selected multiple strong points by object each in data set Distance and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up and index；

Outliers Detection module 13, being used for dividing index is data block, and described data block is carried out block-by-block detection Outlier.

A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, by choosing multiple Support point and global data collection computed range set up index, it is to avoid the data space distortion that single strong point causes, The all sparse region concentrating data are preferentially detected, and can promote degree of peeling off threshold value quickly, improve the inspection that peels off Degree of testing the speed.

Refer to Fig. 5, show in an embodiment of the present invention the Outliers Detection system indexed based on many strong points The structural representation of system 10.In the present embodiment, the Outliers Detection system 10 indexed based on many strong points Mainly include choosing strong point module 11, setting up index module 12 and Outliers Detection module 13.

Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with Formed and support point set.

In the present embodiment, described in choose strong point module 11 specifically for: reading in after data set, Randomly select initial reference point, and choose and described initial reference point point on the basis of point furthest；Calculate Each object in described data set and the distance of described datum mark；Arrange according to the order from small to large of distance Sequence；Described data set is divided into equidistant multistage；By described multistage according to the size of contained number of objects It is ranked up；Judge that number of objects contained by each segmentation is the most equal；If number of objects contained by each segmentation Unequal, then the quantity midpoint of each segmentation is sequentially added support point set；

Set up index module 12, for forming multi-dimensional data space by selected multiple strong points, utilize Described multi-dimensional data space sets up index.

In the present embodiment, described set up index module 12 specifically for:

Multi-dimensional data space is mapped as integer coordinate values；

In the present embodiment, described Outliers Detection module 13 specifically for:

If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block；When all data blocks have all processed, export TOP N outlier.

A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, for reducing data space Distortion, chooses multiple strong point in data set, sets up index, guarantees to set up index time overhead pole simultaneously Little (for Outliers Detection total time)；For faster promoting degree of peeling off threshold value, in preferential detection data set All sparse region, including relatively far region and other sparse region；For improving the stability of algorithm performance, Propose approximation close quarters strong point Algorithms of Selecting, within the extremely short time, choose relatively good of quality Support point；For reducing distance calculation times further, accelerate Outliers Detection speed, use multiple prune rule, More greatly get rid of non-outlier and non-k arest neighbors object.The one that the present invention provides is based on many strong points The Outliers Detection system 10 of index sets up rope by choosing multiple strong point with global data collection computed range Drawing, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are preferentially detected, Degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.

A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides is keeping based on distance Versatility while, be provided that higher detection speed, and the definition of compatible multiple outlier.The present invention carries For a kind of big prune rule of Outliers Detection system 10 3 indexed based on many strong points, a large amount of get rid of non- Outlier and non-k arest neighbors, reduce distance calculation times, improve Outliers Detection speed.

It should be noted that in above-described embodiment, included unit is simply carried out according to function logic Divide, but be not limited to above-mentioned division, as long as being capable of corresponding function；It addition, it is each The specific name of functional unit, also only to facilitate mutually distinguish, is not limited to the protection model of the present invention Enclose.

It addition, one of ordinary skill in the art will appreciate that the whole or portion realizing in the various embodiments described above method The program that can be by step by step completes to instruct relevant hardware, and corresponding program can be stored in a meter In calculation machine read/write memory medium, described storage medium, such as ROM/RAM, disk or CD etc..

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this Any amendment, equivalent and the improvement etc. made within bright spirit and principle, should be included in the present invention Protection domain within.

Claims

1. the Outliers Detection method indexed based on many strong points, it is characterised in that described method includes:

2. the Outliers Detection method indexed based on many strong points as claimed in claim 1, it is characterised in that Described strong point step of choosing specifically includes:

Sort according to the order from small to large of distance；

Described data set is divided into equidistant multistage；

3. the Outliers Detection method indexed based on many strong points as claimed in claim 2, it is characterised in that Described index step of setting up specifically includes:

Multi-dimensional data space is mapped as integer coordinate values；

4. the Outliers Detection method indexed based on many strong points as claimed in claim 3, it is characterised in that Described Outliers Detection step specifically includes:

When all data blocks have all processed, export TOP n outlier.

5. the Outliers Detection system indexed based on many strong points, it is characterised in that described system includes:

6. the Outliers Detection system indexed based on many strong points as claimed in claim 5, it is characterised in that Described choose strong point module specifically for:

Sort according to the order from small to large of distance；

Described data set is divided into equidistant multistage；

7. the Outliers Detection system indexed based on many strong points as claimed in claim 6, it is characterised in that Described set up index module specifically for:

Multi-dimensional data space is mapped as integer coordinate values；

8. the Outliers Detection system indexed based on many strong points as claimed in claim 7, it is characterised in that Described Outliers Detection module specifically for:

When all data blocks have all processed, export TOP n outlier.