CN105975519A - Multi-supporting point index-based outlier detection method and system - Google Patents

Multi-supporting point index-based outlier detection method and system Download PDF

Info

Publication number
CN105975519A
CN105975519A CN201610278832.9A CN201610278832A CN105975519A CN 105975519 A CN105975519 A CN 105975519A CN 201610278832 A CN201610278832 A CN 201610278832A CN 105975519 A CN105975519 A CN 105975519A
Authority
CN
China
Prior art keywords
point
index
data
data block
outlier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610278832.9A
Other languages
Chinese (zh)
Inventor
许红龙
毛睿
陆敏华
廖好
李荣华
王毅
刘刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201610278832.9A priority Critical patent/CN105975519A/en
Publication of CN105975519A publication Critical patent/CN105975519A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-supporting point index-based outlier detection method. The method comprises a supporting point selection step of reading a data set and selecting a plurality of supporting points in the data set to form a supporting point set, an index creation step of calculating a distance through each object in the data set and the selected supporting points, forming a multidimensional data space by taking the distance as coordinates and creating an index by utilizing the multidimensional data space, and an outlier detection step of dividing the index into data blocks and performing block-by-block outlier detection on the data blocks. The invention furthermore provides a multi-supporting point index-based outlier detection system. According to the technical scheme provided by the method and system, the index is created by calculating the distance through selecting the supporting points and the global data set, so that the data space warp caused by a single supporting point is avoided; and all sparse regions in the data set are preferentially detected, so that an outlier degree threshold can be increased more quickly and the outlier detection speed can be increased.

Description

A kind of Outliers Detection method indexed based on many strong points and system thereof
Technical field
The present invention relates to computer realm, particularly relate to a kind of Outliers Detection method indexed based on many strong points And system.
Background technology
Outlier is distinguished data point in data set, and its performance is the most different from other point, to such an extent as to People is made to suspect these data nonrandom deviation, but by produced by another diverse mechanism. Outlier is also referred to as abnormity point or exception object.Outlier detection be also referred to as abnormality detection, separate-blas estimation or from Group's point excavates, and it is exactly according to certain algorithm the outlier detection in data set out, such as, detect TOP-n outlier, or all satisfactory outlier.In other words, outlier detection excavates sea exactly The point that in amount data, only a few is dramatically different with mainstream data.
At present, the detection algorithm for outlier mainly includes ORCA algorithm and iORCA algorithm.
Wherein, ORCA algorithm uses the method upsetting data set order at random, in order to obtain approximation on the average line The time complexity of property.But, in the worst cases, time complexity is still up to O (n2)!Even if flat In the case of Jun, owing to the bottom valve value rate of climb that peels off is relatively slow, cause beta pruning efficiency not ideal enough.At data set In the case of larger, the required detection time is the most oversize.
The shortcoming of iORCA algorithm includes: first, simply uses a strong point, indexes the time in saving While, but result in the distortion of data space, reduce Quality of index, it is impossible to play beta pruning effect well Rate;Secondly, iORCA algorithm is for promoting degree of peeling off threshold value, preferential detecting distance strong point district farther out as early as possible Territory, but have ignored other sparse region, but the lifting speed of degree of peeling off threshold value has limitation;Again, iORCA Algorithm does not provide strong point Algorithms of Selecting, and the quality of the strong point is closely related with algorithm performance, in other words, The strong point choosing method that iORCA algorithm uses only randomly selects, and effect is unstable;Finally, iORCA Algorithm only judges whether to stop detection outlier by a termination rules, fails to give full play to metric space " three Angle inequality " act on and reduce distance calculation times further.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of Outliers Detection method indexed based on many strong points And system, it is intended to the single strong point solving to use in prior art causes data space distortion and the inspection that peels off The problem that degree of testing the speed is the highest.
The present invention proposes a kind of Outliers Detection method indexed based on many strong points, and described method includes:
Choose strong point step: read in data set, described data set is chosen multiple strong point and props up to be formed Support point collection;
Set up index step: by object each in data set with selected multiple strong point computed ranges also Using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection step: dividing index is data block, and described data block is carried out block-by-block detection outlier.
Preferably, choose strong point step described in specifically include:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support Point set;
If number of objects contained by each segmentation is equal, then preferential by the segmentation close to described initial reference point Quantity midpoint add support point set.
Preferably, described index step of setting up specifically includes:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Preferably, described Outliers Detection step specifically includes:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data Block;
If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
On the other hand, the present invention also provides for a kind of Outliers Detection system indexed based on many strong points, described system System includes:
Choose strong point module, be used for reading in data set, described data set is chosen multiple strong point with shape Become to support point set;
Set up index module, for by object each in data set and selected multiple strong points calculate away from From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection module, is used for dividing index for data block, and described data block is carried out block-by-block detection from Group's point.
Preferably, choose described in strong point module specifically for:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support Point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point Amount midpoint adds support point set.
Preferably, described set up index module specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Preferably, described Outliers Detection module specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data Block;
If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
The technical scheme that the present invention provides, for reducing data space distortion, chooses multiple strong point in data set, Set up index, guarantee to set up index time overhead minimum (for Outliers Detection total time) simultaneously;For Faster promote degree of peeling off threshold value, all sparse region in preferential detection data set, including relatively far region and its Its sparse region;For improving the stability of algorithm performance, approximation close quarters strong point Algorithms of Selecting is proposed, The relatively good strong point of quality is chosen within the extremely short time;For reducing distance calculation times further, Accelerate Outliers Detection speed, use multiple prune rule, more greatly get rid of non-outlier and non-k is nearest Adjacency pair as.The technical scheme that the present invention provides is come with global data collection computed range by choosing multiple strong point Set up index, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are excellent First detect, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
Accompanying drawing explanation
Fig. 1 is the Outliers Detection method flow diagram indexed based on many strong points in an embodiment of the present invention;
Fig. 2 is the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention;
Fig. 3 is the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention;
Fig. 4 is the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention;
Fig. 5 is the internal junction of the Outliers Detection system 10 indexed based on many strong points in an embodiment of the present invention Structure schematic diagram.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and reality Execute example, the present invention is further elaborated.Only should be appreciated that specific embodiment described herein Only in order to explain the present invention, it is not intended to limit the present invention.
The noun that occurs in technical solution of the present invention and shown in being explained as follows:
Degree of peeling off: the degree of peeling off of an object represents its degree peeled off, commonly uses the distance of itself and k arest neighbors Meansigma methods as degree of peeling off, or its distance value with kth arest neighbors is as degree of peeling off;
Data block a: unit of Outliers Detection, is made up of several objects in data set, the most conventional 1000 objects are as a data block;
Degree of peeling off threshold value: the degree of peeling off of the n-th outlier of TOP n outlier;
Spiral order: such as have an index sequence 1,2,3,4,5,6,7,8,9,10, if With 5 as starting point, it spiral order be exactly 5,4,6,3,7,2,8 ..., or 5,6,4,7, 3,8,2 ..., it is simply that one in front and one in back, the meaning that the rest may be inferred;
Quantity midpoint: the midpoint calculated in quantity, the number of objects bigger than this object, with less than this object Number of objects, difference is less than 1 or equal.
The specific embodiment of the invention provides a kind of Outliers Detection method indexed based on many strong points, described Method mainly comprises the steps:
S11, choose strong point step: read in data set, described data set is chosen multiple strong point with shape Become to support point set;
S12, set up index step: by object each in data set and selected multiple strong points calculate away from From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
S13, Outliers Detection step: divide index for data block, and described data block is carried out block-by-block detection from Group's point.
A kind of Outliers Detection method indexed based on many strong points that the present invention provides is by choosing multiple strong point Index is set up, it is to avoid the data space distortion that single strong point causes, logarithm with global data collection computed range Preferentially detect according to all sparse region concentrated, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed Degree.
A kind of Outliers Detection method indexed based on many strong points provided by the present invention will be carried out in detail below Explanation.
Refer to Fig. 1, for the Outliers Detection method stream indexed based on many strong points in an embodiment of the present invention Cheng Tu.
In step s 11, strong point step is chosen: read in data set, choose multiple in described data set The strong point supports point set to be formed.
In the present embodiment, described in choose strong point step S11 and specifically include sub-step S111-S118, As shown in Figure 2.
Refer to Fig. 2, for the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention.
In step S111, after reading in data set, randomly select initial reference point, and choose with described Point on the basis of initial reference point distance point furthest.
In step S112, calculate the distance of each object in described data set and described datum mark.
In step S113, sort according to the order from small to large of distance.
In step S114, described data set is divided into equidistant multistage.
In step sl 15, described multistage is ranked up according to the size of contained number of objects.
In step S116, it is judged that number of objects contained by each segmentation is the most equal.
In step S117, if number of objects contained by each segmentation is unequal, then by the quantity of each segmentation Point sequentially adds support point set.
In step S118, if number of objects contained by each segmentation is equal, then will be from described initial reference point The quantity midpoint of nearer segmentation adds support point set.
In the present embodiment, utilize the situation of equidistant partition to data set at datum mark to farthest with its distance On the basis of object, divide data set by equal distance increment.Assume that maximum distance is df, intend being divided into N section, then can respectively with distance between reference df/n、2df/n、……、(n-1)df/ n etc. divides, thus Data set is divided into equidistant but that number of objects is the most equal n section.It determines the method for close quarters It is first to add up number of objects contained by each section, then sort by this quantity, the time that the big person of quantity chooses for the strong point Favored area.
In the present embodiment, after reading in data set, temporary reference point is randomly selected as initial reference Point, with it apart from farthest object in search data set, with this object as basic point, calculates in data set each Object and the distance of reference point, sort according to order from small to large, use at " equidistant partition+quantity midpoint " Method, take in each section after division site and add strong point Candidate Set.Calculate the number of objects of each section, Again to number of objects by order sequence from big to small.For the segmentation that number of objects is equal, compare and obtain this Segmentation closest with reference point among a little segmentations, takes its quantity midpoint as first strong point.Run into When contained by segmentation, number of objects is equal, the most preferentially choosing the segmentation midpoint close to the strong point is the strong point.
In the present embodiment, it should be noted that sufficient amount of in order to make strong point Candidate Set to choose The strong point, its scale (quantity of namely segmentation) should be greater than plan and selects number of support points.For guaranteeing to choose matter Amount, number of fragments should be typically more than 2 times of number of support points.If additionally, using the son of data set The strong point chosen by collection, and equally in order to ensure strong point quality, its scale can not be too small, typically takes one Data block, in the case of number of support points is more, just should use more data block.
Please continue to refer to Fig. 1, in step s 12, index step is set up: by selected multiple supports Point forms multi-dimensional data space, utilizes described multi-dimensional data space to set up index.
In the present embodiment, described index step S12 of setting up specifically includes sub-step S121-S125, as Shown in Fig. 3.
Refer to Fig. 3, for the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention.
In step S121, according to the multidimensional data dimension of plan conversion, select the correspondence that the described strong point is concentrated The strong point of quantity.
In step S122, it is the distance value with each strong point by object map each in described data set, To form multi-dimensional data space.
In step S123, multi-dimensional data space is mapped as integer coordinate values.
In step S124, Hilbert index mapping algorithm is used directly to calculate every pair of integer coordinate values Hilbert encoding value.
In step s 125, the multiple Hilbert encoding value obtained are ranked up, to set up Hilbert Index.
In the present embodiment, after reading data set, according to the multidimensional data dimension of plan conversion, use Strong point Algorithms of Selecting, chooses the strong point of respective numbers, by each for data set object map is and each The distance value of support point, forms multi-dimensional data space (i.e. real number coordinate figure).Next real number coordinate figure is reflected Penetrate as integer coordinate values, then use Hilbert to index mapping algorithm, directly every pair of integer coordinate values of calculating Hilbert encoding value, this completes the coding to metric space object, then is sorted by these encoded radios, I.e. set up Hilbert index.
Please continue to refer to Fig. 1, in step s 13, Outliers Detection step: dividing index is data block, and Described data block is carried out block-by-block detection outlier.
In the present embodiment, described Outliers Detection step S13 specifically includes sub-step S131-S135, as Shown in Fig. 4.
Refer to Fig. 4, for the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention.
In step S131, dividing described Hilbert index is data block, by encoded radio from sparse to intensive For these block sequencings using as Outliers Detection order.
In step S132, degree of peeling off threshold value is set and is initialized as 0, read by detection ordering data block one by one Described data set.
In step S133, if all objects in current data block are impossible to as outlier, the most directly Enter next data block.
In step S134, if having object in current data block may be outlier, then from described current number Start with screw type sequential search arest neighbors according to the object of position in block, and will determine that and be unlikely to be outlier Object removes from detected current data block, until after all objects in current data block have all processed Update TOP n outlier and degree of peeling off threshold value, and enter next data block.
In step S135, when all data blocks have all processed, export TOP n outlier.
In the present embodiment, describe by false code and illustrate as a example by algorithm, input: arest neighbors quantity k, Intend detection outlier quantity n, data set D;Output: TOP n outlier.Then above-mentioned steps S13 includes:
After index is set up, to index data by data block (such as 1000 objects are a data block) Divide, data block is calculated Hilbert encoded radio increment and sorts in descending order.Next by the number sequencing order Outlier is detected according to block block-by-block.For each data block, when just starting to detect, first call prune rule three, Judge whether to contain outlier, if nothing, be then directly entered next data block;If having, then from data In block, the object of position starts, with screw type sequential search arest neighbors.Each in tested data block B Object, first uses prune rule one to judge to have not to be probably outlier, if impossible, then by it from data block B removes, and enters the detection of next object;If being probably outlier, then continue search for its k arest neighbors. Before computed range, prune rule two is used to judge to have not to be probably k arest neighbors, if being unlikely to be its k Arest neighbors, does not the most calculate both distances, is directly entered the detection of next object;If may, then calculate two The distance of person, and attempt updating its k arest neighbors, judge simultaneously its currently degree of peeling off whether less than threshold value c, If being less than, being also impossible to the most again become outlier, removing from data block B.
In the present embodiment, wherein three big prune rules are as follows:
(1) prune rule one: get rid of the object of non-outlier.
If dist is (x, pi)+dist(pi,nnk(pi, D)) < c, wherein pi∈P
So x can not be outlier.
In other words, strong point piAnd the distance of its k arest neighbors and object x is both less than c, so object x At least k object in the range of radius c, its degree of peeling off is necessarily smaller than c.
(2) prune rule two: get rid of the object of non-k arest neighbors.
If | | dist (xt,pi)-dist(xj,pi)||>dist(xt,nnk(xt, D)), wherein pi∈P
So xjCan not be xtK arest neighbors.
(3) prune rule three:
If dist is (B, pi)+dist(pi,nnk(pi, D)) < c, wherein pi∈P
So all objects in data block B are impossible to as outlier.
It is to say, all objects of data block B have the arest neighbors of more than k in the range of distance c.
In the present embodiment, it practice, after having detected a data block, the object in data block can Can be removed in a large number.For remaining object, attempt adding TOP n outlier one by one, and renewal peels off Point threshold value c.After having detected all data blocks, export TOP n outlier.
The technical scheme that the present invention provides, for reducing data space distortion, chooses multiple strong point in data set, Set up index, guarantee to set up index time overhead minimum (for Outliers Detection total time) simultaneously;For Faster promote degree of peeling off threshold value, all sparse region in preferential detection data set, including relatively far region and its Its sparse region;For improving the stability of algorithm performance, approximation close quarters strong point Algorithms of Selecting is proposed, The relatively good strong point of quality is chosen within the extremely short time;For reducing distance calculation times further, Accelerate Outliers Detection speed, use multiple prune rule, more greatly get rid of non-outlier and non-k is nearest Adjacency pair as.The technical scheme that the present invention provides is come with global data collection computed range by choosing multiple strong point Set up index, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are excellent First detect, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
The technical scheme that the present invention provides, while keeping versatility based on distance, is provided that higher inspection Degree of testing the speed, and the definition of compatible multiple outlier.The technical scheme that the present invention provides uses three big prune rules, A large amount of non-outlier and non-k arest neighbors got rid of, minimizing distance calculation times, improves Outliers Detection speed.
The specific embodiment of the invention also provides for a kind of Outliers Detection system 10 indexed based on many strong points, main Including:
Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with Formed and support point set;
Set up index module 12, for being calculated with selected multiple strong points by object each in data set Distance and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up and index;
Outliers Detection module 13, being used for dividing index is data block, and described data block is carried out block-by-block detection Outlier.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, by choosing multiple Support point and global data collection computed range set up index, it is to avoid the data space distortion that single strong point causes, The all sparse region concentrating data are preferentially detected, and can promote degree of peeling off threshold value quickly, improve the inspection that peels off Degree of testing the speed.
Refer to Fig. 5, show in an embodiment of the present invention the Outliers Detection system indexed based on many strong points The structural representation of system 10.In the present embodiment, the Outliers Detection system 10 indexed based on many strong points Mainly include choosing strong point module 11, setting up index module 12 and Outliers Detection module 13.
Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with Formed and support point set.
In the present embodiment, described in choose strong point module 11 specifically for: reading in after data set, Randomly select initial reference point, and choose and described initial reference point point on the basis of point furthest;Calculate Each object in described data set and the distance of described datum mark;Arrange according to the order from small to large of distance Sequence;Described data set is divided into equidistant multistage;By described multistage according to the size of contained number of objects It is ranked up;Judge that number of objects contained by each segmentation is the most equal;If number of objects contained by each segmentation Unequal, then the quantity midpoint of each segmentation is sequentially added support point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point Amount midpoint adds support point set.
Set up index module 12, for forming multi-dimensional data space by selected multiple strong points, utilize Described multi-dimensional data space sets up index.
In the present embodiment, described set up index module 12 specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Outliers Detection module 13, being used for dividing index is data block, and described data block is carried out block-by-block detection Outlier.
In the present embodiment, described Outliers Detection module 13 specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data Block;
If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block;When all data blocks have all processed, export TOP N outlier.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, for reducing data space Distortion, chooses multiple strong point in data set, sets up index, guarantees to set up index time overhead pole simultaneously Little (for Outliers Detection total time);For faster promoting degree of peeling off threshold value, in preferential detection data set All sparse region, including relatively far region and other sparse region;For improving the stability of algorithm performance, Propose approximation close quarters strong point Algorithms of Selecting, within the extremely short time, choose relatively good of quality Support point;For reducing distance calculation times further, accelerate Outliers Detection speed, use multiple prune rule, More greatly get rid of non-outlier and non-k arest neighbors object.The one that the present invention provides is based on many strong points The Outliers Detection system 10 of index sets up rope by choosing multiple strong point with global data collection computed range Drawing, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are preferentially detected, Degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides is keeping based on distance Versatility while, be provided that higher detection speed, and the definition of compatible multiple outlier.The present invention carries For a kind of big prune rule of Outliers Detection system 10 3 indexed based on many strong points, a large amount of get rid of non- Outlier and non-k arest neighbors, reduce distance calculation times, improve Outliers Detection speed.
It should be noted that in above-described embodiment, included unit is simply carried out according to function logic Divide, but be not limited to above-mentioned division, as long as being capable of corresponding function;It addition, it is each The specific name of functional unit, also only to facilitate mutually distinguish, is not limited to the protection model of the present invention Enclose.
It addition, one of ordinary skill in the art will appreciate that the whole or portion realizing in the various embodiments described above method The program that can be by step by step completes to instruct relevant hardware, and corresponding program can be stored in a meter In calculation machine read/write memory medium, described storage medium, such as ROM/RAM, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this Any amendment, equivalent and the improvement etc. made within bright spirit and principle, should be included in the present invention Protection domain within.

Claims (8)

1. the Outliers Detection method indexed based on many strong points, it is characterised in that described method includes:
Choose strong point step: read in data set, described data set is chosen multiple strong point and props up to be formed Support point collection;
Set up index step: by object each in data set with selected multiple strong point computed ranges also Using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection step: dividing index is data block, and described data block is carried out block-by-block detection outlier.
2. the Outliers Detection method indexed based on many strong points as claimed in claim 1, it is characterised in that Described strong point step of choosing specifically includes:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support Point set;
If number of objects contained by each segmentation is equal, then preferential by the segmentation close to described initial reference point Quantity midpoint add support point set.
3. the Outliers Detection method indexed based on many strong points as claimed in claim 2, it is characterised in that Described index step of setting up specifically includes:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
4. the Outliers Detection method indexed based on many strong points as claimed in claim 3, it is characterised in that Described Outliers Detection step specifically includes:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data Block;
If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
5. the Outliers Detection system indexed based on many strong points, it is characterised in that described system includes:
Choose strong point module, be used for reading in data set, described data set is chosen multiple strong point with shape Become to support point set;
Set up index module, for by object each in data set and selected multiple strong points calculate away from From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection module, is used for dividing index for data block, and described data block is carried out block-by-block detection from Group's point.
6. the Outliers Detection system indexed based on many strong points as claimed in claim 5, it is characterised in that Described choose strong point module specifically for:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support Point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point Amount midpoint adds support point set.
7. the Outliers Detection system indexed based on many strong points as claimed in claim 6, it is characterised in that Described set up index module specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
8. the Outliers Detection system indexed based on many strong points as claimed in claim 7, it is characterised in that Described Outliers Detection module specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data Block;
If having object in current data block may be outlier, the then object of position from described current data block Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected Remove in current data block, peel off until all objects in current data block update TOP n after all having processed Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
CN201610278832.9A 2016-04-28 2016-04-28 Multi-supporting point index-based outlier detection method and system Pending CN105975519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610278832.9A CN105975519A (en) 2016-04-28 2016-04-28 Multi-supporting point index-based outlier detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610278832.9A CN105975519A (en) 2016-04-28 2016-04-28 Multi-supporting point index-based outlier detection method and system

Publications (1)

Publication Number Publication Date
CN105975519A true CN105975519A (en) 2016-09-28

Family

ID=56994235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610278832.9A Pending CN105975519A (en) 2016-04-28 2016-04-28 Multi-supporting point index-based outlier detection method and system

Country Status (1)

Country Link
CN (1) CN105975519A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503245A (en) * 2016-11-08 2017-03-15 深圳大学 A kind of system of selection for supporting point set and device
CN106951353A (en) * 2017-03-20 2017-07-14 北京搜狐新媒体信息技术有限公司 Work data method for detecting abnormality and device
WO2017185296A1 (en) * 2016-04-28 2017-11-02 深圳大学 Method and system for detecting outlier based on multiple support points index
CN107480258A (en) * 2017-08-15 2017-12-15 佛山科学技术学院 A kind of metric space Outliers Detection method based on a variety of strong points
CN107798338A (en) * 2017-09-28 2018-03-13 佛山科学技术学院 A kind of intensive strong point fast selecting method of big data
CN112559571A (en) * 2020-12-21 2021-03-26 国家电网公司东北分部 Approximate outlier calculation method and system for numerical type stream data
CN112733904A (en) * 2020-12-30 2021-04-30 佛山科学技术学院 Water quality abnormity detection method and electronic equipment
CN110287238B (en) * 2019-06-26 2022-11-29 广东奥博信息产业股份有限公司 Method and system for detecting abnormal water quality based on priori knowledge

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744886A (en) * 2013-12-23 2014-04-23 西南科技大学 Directly extracted k nearest neighbor searching algorithm
CN105260742A (en) * 2015-09-29 2016-01-20 深圳大学 Unified classification method for multiple types of data and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744886A (en) * 2013-12-23 2014-04-23 西南科技大学 Directly extracted k nearest neighbor searching algorithm
CN105260742A (en) * 2015-09-29 2016-01-20 深圳大学 Unified classification method for multiple types of data and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
顾新财: ""面向多维数据的孤立点挖掘方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185296A1 (en) * 2016-04-28 2017-11-02 深圳大学 Method and system for detecting outlier based on multiple support points index
CN106503245B (en) * 2016-11-08 2019-07-26 深圳大学 A kind of selection method and device supporting point set
CN106503245A (en) * 2016-11-08 2017-03-15 深圳大学 A kind of system of selection for supporting point set and device
CN106951353A (en) * 2017-03-20 2017-07-14 北京搜狐新媒体信息技术有限公司 Work data method for detecting abnormality and device
CN106951353B (en) * 2017-03-20 2020-05-22 北京搜狐新媒体信息技术有限公司 Method and device for detecting abnormality of operation data
CN107480258A (en) * 2017-08-15 2017-12-15 佛山科学技术学院 A kind of metric space Outliers Detection method based on a variety of strong points
CN107798338A (en) * 2017-09-28 2018-03-13 佛山科学技术学院 A kind of intensive strong point fast selecting method of big data
CN107798338B (en) * 2017-09-28 2021-03-26 佛山科学技术学院 Method for quickly selecting big data dense support points
CN110287238B (en) * 2019-06-26 2022-11-29 广东奥博信息产业股份有限公司 Method and system for detecting abnormal water quality based on priori knowledge
CN112559571A (en) * 2020-12-21 2021-03-26 国家电网公司东北分部 Approximate outlier calculation method and system for numerical type stream data
CN112559571B (en) * 2020-12-21 2024-05-24 国家电网公司东北分部 Approximate outlier calculation method and system for numerical value type stream data
CN112733904A (en) * 2020-12-30 2021-04-30 佛山科学技术学院 Water quality abnormity detection method and electronic equipment
CN112733904B (en) * 2020-12-30 2022-03-25 佛山科学技术学院 Water quality abnormity detection method and electronic equipment
WO2022141746A1 (en) * 2020-12-30 2022-07-07 佛山科学技术学院 Method for detecting anomaly in water quality and electronic device

Similar Documents

Publication Publication Date Title
CN105975519A (en) Multi-supporting point index-based outlier detection method and system
CN111831660B (en) Method and device for evaluating metric space division mode, computer equipment and storage medium
CN102693266B (en) Search for method, the navigation equipment and method of generation index structure of database
CN109740628A (en) Point cloud clustering method, image processing equipment and the device with store function
WO2002025574A2 (en) Data clustering methods and applications
CN106126918B (en) A kind of geographical space abnormal aggregation domain scanning statistical method based on interaction force
US20180143945A1 (en) Method and system for detecting outlier based on multiple pivots index
CN105787126B (en) K-d tree generation method and k-d tree generation device
CN103020321B (en) Neighbor search method and system
CN108416381B (en) Multi-density clustering method for three-dimensional point set
CN104036261A (en) Face recognition method and system
CN105488176A (en) Data processing method and device
CN105005584A (en) Multi-subspace Skyline query computation method
CN109508349A (en) A kind of metric space Outliers Detection method and device
CN110580252B (en) Space object indexing and query method under multi-objective optimization
CN112203324B (en) MR positioning method and device based on position fingerprint database
CN110097581B (en) Method for constructing K-D tree based on point cloud registration ICP algorithm
CN112183001B (en) Hypergraph-based multistage clustering method for integrated circuits
CN105824853B (en) Clustering device and method
CN109840558A (en) Based on density peaks-core integration adaptive clustering scheme
CN110070100A (en) A kind of agricultural weather Outliers Detection method and device that multiple-factor is integrated
CN108509532B (en) Point gathering method and device applied to map
CN112734934B (en) STL model 3D printing slicing method based on intersecting edge mapping
CN110287238A (en) A kind of exception water quality detection method and system based on priori knowledge
CN102682279A (en) High-speed fingerprint feature comparison system and method implemented by classified triangles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160928

RJ01 Rejection of invention patent application after publication