CN105975519A - Multi-supporting point index-based outlier detection method and system - Google Patents
Multi-supporting point index-based outlier detection method and system Download PDFInfo
- Publication number
- CN105975519A CN105975519A CN201610278832.9A CN201610278832A CN105975519A CN 105975519 A CN105975519 A CN 105975519A CN 201610278832 A CN201610278832 A CN 201610278832A CN 105975519 A CN105975519 A CN 105975519A
- Authority
- CN
- China
- Prior art keywords
- point
- index
- data
- data block
- outlier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-supporting point index-based outlier detection method. The method comprises a supporting point selection step of reading a data set and selecting a plurality of supporting points in the data set to form a supporting point set, an index creation step of calculating a distance through each object in the data set and the selected supporting points, forming a multidimensional data space by taking the distance as coordinates and creating an index by utilizing the multidimensional data space, and an outlier detection step of dividing the index into data blocks and performing block-by-block outlier detection on the data blocks. The invention furthermore provides a multi-supporting point index-based outlier detection system. According to the technical scheme provided by the method and system, the index is created by calculating the distance through selecting the supporting points and the global data set, so that the data space warp caused by a single supporting point is avoided; and all sparse regions in the data set are preferentially detected, so that an outlier degree threshold can be increased more quickly and the outlier detection speed can be increased.
Description
Technical field
The present invention relates to computer realm, particularly relate to a kind of Outliers Detection method indexed based on many strong points
And system.
Background technology
Outlier is distinguished data point in data set, and its performance is the most different from other point, to such an extent as to
People is made to suspect these data nonrandom deviation, but by produced by another diverse mechanism.
Outlier is also referred to as abnormity point or exception object.Outlier detection be also referred to as abnormality detection, separate-blas estimation or from
Group's point excavates, and it is exactly according to certain algorithm the outlier detection in data set out, such as, detect
TOP-n outlier, or all satisfactory outlier.In other words, outlier detection excavates sea exactly
The point that in amount data, only a few is dramatically different with mainstream data.
At present, the detection algorithm for outlier mainly includes ORCA algorithm and iORCA algorithm.
Wherein, ORCA algorithm uses the method upsetting data set order at random, in order to obtain approximation on the average line
The time complexity of property.But, in the worst cases, time complexity is still up to O (n2)!Even if flat
In the case of Jun, owing to the bottom valve value rate of climb that peels off is relatively slow, cause beta pruning efficiency not ideal enough.At data set
In the case of larger, the required detection time is the most oversize.
The shortcoming of iORCA algorithm includes: first, simply uses a strong point, indexes the time in saving
While, but result in the distortion of data space, reduce Quality of index, it is impossible to play beta pruning effect well
Rate;Secondly, iORCA algorithm is for promoting degree of peeling off threshold value, preferential detecting distance strong point district farther out as early as possible
Territory, but have ignored other sparse region, but the lifting speed of degree of peeling off threshold value has limitation;Again, iORCA
Algorithm does not provide strong point Algorithms of Selecting, and the quality of the strong point is closely related with algorithm performance, in other words,
The strong point choosing method that iORCA algorithm uses only randomly selects, and effect is unstable;Finally, iORCA
Algorithm only judges whether to stop detection outlier by a termination rules, fails to give full play to metric space " three
Angle inequality " act on and reduce distance calculation times further.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of Outliers Detection method indexed based on many strong points
And system, it is intended to the single strong point solving to use in prior art causes data space distortion and the inspection that peels off
The problem that degree of testing the speed is the highest.
The present invention proposes a kind of Outliers Detection method indexed based on many strong points, and described method includes:
Choose strong point step: read in data set, described data set is chosen multiple strong point and props up to be formed
Support point collection;
Set up index step: by object each in data set with selected multiple strong point computed ranges also
Using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection step: dividing index is data block, and described data block is carried out block-by-block detection outlier.
Preferably, choose strong point step described in specifically include:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance
Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support
Point set;
If number of objects contained by each segmentation is equal, then preferential by the segmentation close to described initial reference point
Quantity midpoint add support point set.
Preferably, described index step of setting up specifically includes:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data
Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Preferably, described Outliers Detection step specifically includes:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings
Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data
Block;
If having object in current data block may be outlier, the then object of position from described current data block
Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected
Remove in current data block, peel off until all objects in current data block update TOP n after all having processed
Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
On the other hand, the present invention also provides for a kind of Outliers Detection system indexed based on many strong points, described system
System includes:
Choose strong point module, be used for reading in data set, described data set is chosen multiple strong point with shape
Become to support point set;
Set up index module, for by object each in data set and selected multiple strong points calculate away from
From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection module, is used for dividing index for data block, and described data block is carried out block-by-block detection from
Group's point.
Preferably, choose described in strong point module specifically for:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance
Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support
Point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point
Amount midpoint adds support point set.
Preferably, described set up index module specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data
Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Preferably, described Outliers Detection module specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings
Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data
Block;
If having object in current data block may be outlier, the then object of position from described current data block
Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected
Remove in current data block, peel off until all objects in current data block update TOP n after all having processed
Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
The technical scheme that the present invention provides, for reducing data space distortion, chooses multiple strong point in data set,
Set up index, guarantee to set up index time overhead minimum (for Outliers Detection total time) simultaneously;For
Faster promote degree of peeling off threshold value, all sparse region in preferential detection data set, including relatively far region and its
Its sparse region;For improving the stability of algorithm performance, approximation close quarters strong point Algorithms of Selecting is proposed,
The relatively good strong point of quality is chosen within the extremely short time;For reducing distance calculation times further,
Accelerate Outliers Detection speed, use multiple prune rule, more greatly get rid of non-outlier and non-k is nearest
Adjacency pair as.The technical scheme that the present invention provides is come with global data collection computed range by choosing multiple strong point
Set up index, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are excellent
First detect, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
Accompanying drawing explanation
Fig. 1 is the Outliers Detection method flow diagram indexed based on many strong points in an embodiment of the present invention;
Fig. 2 is the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention;
Fig. 3 is the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention;
Fig. 4 is the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention;
Fig. 5 is the internal junction of the Outliers Detection system 10 indexed based on many strong points in an embodiment of the present invention
Structure schematic diagram.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and reality
Execute example, the present invention is further elaborated.Only should be appreciated that specific embodiment described herein
Only in order to explain the present invention, it is not intended to limit the present invention.
The noun that occurs in technical solution of the present invention and shown in being explained as follows:
Degree of peeling off: the degree of peeling off of an object represents its degree peeled off, commonly uses the distance of itself and k arest neighbors
Meansigma methods as degree of peeling off, or its distance value with kth arest neighbors is as degree of peeling off;
Data block a: unit of Outliers Detection, is made up of several objects in data set, the most conventional
1000 objects are as a data block;
Degree of peeling off threshold value: the degree of peeling off of the n-th outlier of TOP n outlier;
Spiral order: such as have an index sequence 1,2,3,4,5,6,7,8,9,10, if
With 5 as starting point, it spiral order be exactly 5,4,6,3,7,2,8 ..., or 5,6,4,7,
3,8,2 ..., it is simply that one in front and one in back, the meaning that the rest may be inferred;
Quantity midpoint: the midpoint calculated in quantity, the number of objects bigger than this object, with less than this object
Number of objects, difference is less than 1 or equal.
The specific embodiment of the invention provides a kind of Outliers Detection method indexed based on many strong points, described
Method mainly comprises the steps:
S11, choose strong point step: read in data set, described data set is chosen multiple strong point with shape
Become to support point set;
S12, set up index step: by object each in data set and selected multiple strong points calculate away from
From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
S13, Outliers Detection step: divide index for data block, and described data block is carried out block-by-block detection from
Group's point.
A kind of Outliers Detection method indexed based on many strong points that the present invention provides is by choosing multiple strong point
Index is set up, it is to avoid the data space distortion that single strong point causes, logarithm with global data collection computed range
Preferentially detect according to all sparse region concentrated, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed
Degree.
A kind of Outliers Detection method indexed based on many strong points provided by the present invention will be carried out in detail below
Explanation.
Refer to Fig. 1, for the Outliers Detection method stream indexed based on many strong points in an embodiment of the present invention
Cheng Tu.
In step s 11, strong point step is chosen: read in data set, choose multiple in described data set
The strong point supports point set to be formed.
In the present embodiment, described in choose strong point step S11 and specifically include sub-step S111-S118,
As shown in Figure 2.
Refer to Fig. 2, for the detail flowchart of step S11 shown in Fig. 1 in an embodiment of the present invention.
In step S111, after reading in data set, randomly select initial reference point, and choose with described
Point on the basis of initial reference point distance point furthest.
In step S112, calculate the distance of each object in described data set and described datum mark.
In step S113, sort according to the order from small to large of distance.
In step S114, described data set is divided into equidistant multistage.
In step sl 15, described multistage is ranked up according to the size of contained number of objects.
In step S116, it is judged that number of objects contained by each segmentation is the most equal.
In step S117, if number of objects contained by each segmentation is unequal, then by the quantity of each segmentation
Point sequentially adds support point set.
In step S118, if number of objects contained by each segmentation is equal, then will be from described initial reference point
The quantity midpoint of nearer segmentation adds support point set.
In the present embodiment, utilize the situation of equidistant partition to data set at datum mark to farthest with its distance
On the basis of object, divide data set by equal distance increment.Assume that maximum distance is df, intend being divided into
N section, then can respectively with distance between reference df/n、2df/n、……、(n-1)df/ n etc. divides, thus
Data set is divided into equidistant but that number of objects is the most equal n section.It determines the method for close quarters
It is first to add up number of objects contained by each section, then sort by this quantity, the time that the big person of quantity chooses for the strong point
Favored area.
In the present embodiment, after reading in data set, temporary reference point is randomly selected as initial reference
Point, with it apart from farthest object in search data set, with this object as basic point, calculates in data set each
Object and the distance of reference point, sort according to order from small to large, use at " equidistant partition+quantity midpoint "
Method, take in each section after division site and add strong point Candidate Set.Calculate the number of objects of each section,
Again to number of objects by order sequence from big to small.For the segmentation that number of objects is equal, compare and obtain this
Segmentation closest with reference point among a little segmentations, takes its quantity midpoint as first strong point.Run into
When contained by segmentation, number of objects is equal, the most preferentially choosing the segmentation midpoint close to the strong point is the strong point.
In the present embodiment, it should be noted that sufficient amount of in order to make strong point Candidate Set to choose
The strong point, its scale (quantity of namely segmentation) should be greater than plan and selects number of support points.For guaranteeing to choose matter
Amount, number of fragments should be typically more than 2 times of number of support points.If additionally, using the son of data set
The strong point chosen by collection, and equally in order to ensure strong point quality, its scale can not be too small, typically takes one
Data block, in the case of number of support points is more, just should use more data block.
Please continue to refer to Fig. 1, in step s 12, index step is set up: by selected multiple supports
Point forms multi-dimensional data space, utilizes described multi-dimensional data space to set up index.
In the present embodiment, described index step S12 of setting up specifically includes sub-step S121-S125, as
Shown in Fig. 3.
Refer to Fig. 3, for the detail flowchart of step S12 shown in Fig. 1 in an embodiment of the present invention.
In step S121, according to the multidimensional data dimension of plan conversion, select the correspondence that the described strong point is concentrated
The strong point of quantity.
In step S122, it is the distance value with each strong point by object map each in described data set,
To form multi-dimensional data space.
In step S123, multi-dimensional data space is mapped as integer coordinate values.
In step S124, Hilbert index mapping algorithm is used directly to calculate every pair of integer coordinate values
Hilbert encoding value.
In step s 125, the multiple Hilbert encoding value obtained are ranked up, to set up Hilbert
Index.
In the present embodiment, after reading data set, according to the multidimensional data dimension of plan conversion, use
Strong point Algorithms of Selecting, chooses the strong point of respective numbers, by each for data set object map is and each
The distance value of support point, forms multi-dimensional data space (i.e. real number coordinate figure).Next real number coordinate figure is reflected
Penetrate as integer coordinate values, then use Hilbert to index mapping algorithm, directly every pair of integer coordinate values of calculating
Hilbert encoding value, this completes the coding to metric space object, then is sorted by these encoded radios,
I.e. set up Hilbert index.
Please continue to refer to Fig. 1, in step s 13, Outliers Detection step: dividing index is data block, and
Described data block is carried out block-by-block detection outlier.
In the present embodiment, described Outliers Detection step S13 specifically includes sub-step S131-S135, as
Shown in Fig. 4.
Refer to Fig. 4, for the detail flowchart of step S13 shown in Fig. 1 in an embodiment of the present invention.
In step S131, dividing described Hilbert index is data block, by encoded radio from sparse to intensive
For these block sequencings using as Outliers Detection order.
In step S132, degree of peeling off threshold value is set and is initialized as 0, read by detection ordering data block one by one
Described data set.
In step S133, if all objects in current data block are impossible to as outlier, the most directly
Enter next data block.
In step S134, if having object in current data block may be outlier, then from described current number
Start with screw type sequential search arest neighbors according to the object of position in block, and will determine that and be unlikely to be outlier
Object removes from detected current data block, until after all objects in current data block have all processed
Update TOP n outlier and degree of peeling off threshold value, and enter next data block.
In step S135, when all data blocks have all processed, export TOP n outlier.
In the present embodiment, describe by false code and illustrate as a example by algorithm, input: arest neighbors quantity k,
Intend detection outlier quantity n, data set D;Output: TOP n outlier.Then above-mentioned steps S13 includes:
After index is set up, to index data by data block (such as 1000 objects are a data block)
Divide, data block is calculated Hilbert encoded radio increment and sorts in descending order.Next by the number sequencing order
Outlier is detected according to block block-by-block.For each data block, when just starting to detect, first call prune rule three,
Judge whether to contain outlier, if nothing, be then directly entered next data block;If having, then from data
In block, the object of position starts, with screw type sequential search arest neighbors.Each in tested data block B
Object, first uses prune rule one to judge to have not to be probably outlier, if impossible, then by it from data block
B removes, and enters the detection of next object;If being probably outlier, then continue search for its k arest neighbors.
Before computed range, prune rule two is used to judge to have not to be probably k arest neighbors, if being unlikely to be its k
Arest neighbors, does not the most calculate both distances, is directly entered the detection of next object;If may, then calculate two
The distance of person, and attempt updating its k arest neighbors, judge simultaneously its currently degree of peeling off whether less than threshold value c,
If being less than, being also impossible to the most again become outlier, removing from data block B.
In the present embodiment, wherein three big prune rules are as follows:
(1) prune rule one: get rid of the object of non-outlier.
If dist is (x, pi)+dist(pi,nnk(pi, D)) < c, wherein pi∈P
So x can not be outlier.
In other words, strong point piAnd the distance of its k arest neighbors and object x is both less than c, so object x
At least k object in the range of radius c, its degree of peeling off is necessarily smaller than c.
(2) prune rule two: get rid of the object of non-k arest neighbors.
If | | dist (xt,pi)-dist(xj,pi)||>dist(xt,nnk(xt, D)), wherein pi∈P
So xjCan not be xtK arest neighbors.
(3) prune rule three:
If dist is (B, pi)+dist(pi,nnk(pi, D)) < c, wherein pi∈P
So all objects in data block B are impossible to as outlier.
It is to say, all objects of data block B have the arest neighbors of more than k in the range of distance c.
In the present embodiment, it practice, after having detected a data block, the object in data block can
Can be removed in a large number.For remaining object, attempt adding TOP n outlier one by one, and renewal peels off
Point threshold value c.After having detected all data blocks, export TOP n outlier.
The technical scheme that the present invention provides, for reducing data space distortion, chooses multiple strong point in data set,
Set up index, guarantee to set up index time overhead minimum (for Outliers Detection total time) simultaneously;For
Faster promote degree of peeling off threshold value, all sparse region in preferential detection data set, including relatively far region and its
Its sparse region;For improving the stability of algorithm performance, approximation close quarters strong point Algorithms of Selecting is proposed,
The relatively good strong point of quality is chosen within the extremely short time;For reducing distance calculation times further,
Accelerate Outliers Detection speed, use multiple prune rule, more greatly get rid of non-outlier and non-k is nearest
Adjacency pair as.The technical scheme that the present invention provides is come with global data collection computed range by choosing multiple strong point
Set up index, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are excellent
First detect, degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
The technical scheme that the present invention provides, while keeping versatility based on distance, is provided that higher inspection
Degree of testing the speed, and the definition of compatible multiple outlier.The technical scheme that the present invention provides uses three big prune rules,
A large amount of non-outlier and non-k arest neighbors got rid of, minimizing distance calculation times, improves Outliers Detection speed.
The specific embodiment of the invention also provides for a kind of Outliers Detection system 10 indexed based on many strong points, main
Including:
Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with
Formed and support point set;
Set up index module 12, for being calculated with selected multiple strong points by object each in data set
Distance and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up and index;
Outliers Detection module 13, being used for dividing index is data block, and described data block is carried out block-by-block detection
Outlier.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, by choosing multiple
Support point and global data collection computed range set up index, it is to avoid the data space distortion that single strong point causes,
The all sparse region concentrating data are preferentially detected, and can promote degree of peeling off threshold value quickly, improve the inspection that peels off
Degree of testing the speed.
Refer to Fig. 5, show in an embodiment of the present invention the Outliers Detection system indexed based on many strong points
The structural representation of system 10.In the present embodiment, the Outliers Detection system 10 indexed based on many strong points
Mainly include choosing strong point module 11, setting up index module 12 and Outliers Detection module 13.
Choose strong point module 11, be used for reading in data set, choose in described data set multiple strong point with
Formed and support point set.
In the present embodiment, described in choose strong point module 11 specifically for: reading in after data set,
Randomly select initial reference point, and choose and described initial reference point point on the basis of point furthest;Calculate
Each object in described data set and the distance of described datum mark;Arrange according to the order from small to large of distance
Sequence;Described data set is divided into equidistant multistage;By described multistage according to the size of contained number of objects
It is ranked up;Judge that number of objects contained by each segmentation is the most equal;If number of objects contained by each segmentation
Unequal, then the quantity midpoint of each segmentation is sequentially added support point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point
Amount midpoint adds support point set.
Set up index module 12, for forming multi-dimensional data space by selected multiple strong points, utilize
Described multi-dimensional data space sets up index.
In the present embodiment, described set up index module 12 specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data
Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
Outliers Detection module 13, being used for dividing index is data block, and described data block is carried out block-by-block detection
Outlier.
In the present embodiment, described Outliers Detection module 13 specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings
Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data
Block;
If having object in current data block may be outlier, the then object of position from described current data block
Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected
Remove in current data block, peel off until all objects in current data block update TOP n after all having processed
Point and degree of peeling off threshold value, and enter next data block;When all data blocks have all processed, export TOP
N outlier.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides, for reducing data space
Distortion, chooses multiple strong point in data set, sets up index, guarantees to set up index time overhead pole simultaneously
Little (for Outliers Detection total time);For faster promoting degree of peeling off threshold value, in preferential detection data set
All sparse region, including relatively far region and other sparse region;For improving the stability of algorithm performance,
Propose approximation close quarters strong point Algorithms of Selecting, within the extremely short time, choose relatively good of quality
Support point;For reducing distance calculation times further, accelerate Outliers Detection speed, use multiple prune rule,
More greatly get rid of non-outlier and non-k arest neighbors object.The one that the present invention provides is based on many strong points
The Outliers Detection system 10 of index sets up rope by choosing multiple strong point with global data collection computed range
Drawing, it is to avoid the data space distortion that single strong point causes, all sparse region concentrating data are preferentially detected,
Degree of peeling off threshold value can be promoted quickly, improve Outliers Detection speed.
A kind of Outliers Detection system 10 indexed based on many strong points that the present invention provides is keeping based on distance
Versatility while, be provided that higher detection speed, and the definition of compatible multiple outlier.The present invention carries
For a kind of big prune rule of Outliers Detection system 10 3 indexed based on many strong points, a large amount of get rid of non-
Outlier and non-k arest neighbors, reduce distance calculation times, improve Outliers Detection speed.
It should be noted that in above-described embodiment, included unit is simply carried out according to function logic
Divide, but be not limited to above-mentioned division, as long as being capable of corresponding function;It addition, it is each
The specific name of functional unit, also only to facilitate mutually distinguish, is not limited to the protection model of the present invention
Enclose.
It addition, one of ordinary skill in the art will appreciate that the whole or portion realizing in the various embodiments described above method
The program that can be by step by step completes to instruct relevant hardware, and corresponding program can be stored in a meter
In calculation machine read/write memory medium, described storage medium, such as ROM/RAM, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this
Any amendment, equivalent and the improvement etc. made within bright spirit and principle, should be included in the present invention
Protection domain within.
Claims (8)
1. the Outliers Detection method indexed based on many strong points, it is characterised in that described method includes:
Choose strong point step: read in data set, described data set is chosen multiple strong point and props up to be formed
Support point collection;
Set up index step: by object each in data set with selected multiple strong point computed ranges also
Using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection step: dividing index is data block, and described data block is carried out block-by-block detection outlier.
2. the Outliers Detection method indexed based on many strong points as claimed in claim 1, it is characterised in that
Described strong point step of choosing specifically includes:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance
Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support
Point set;
If number of objects contained by each segmentation is equal, then preferential by the segmentation close to described initial reference point
Quantity midpoint add support point set.
3. the Outliers Detection method indexed based on many strong points as claimed in claim 2, it is characterised in that
Described index step of setting up specifically includes:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data
Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
4. the Outliers Detection method indexed based on many strong points as claimed in claim 3, it is characterised in that
Described Outliers Detection step specifically includes:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings
Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data
Block;
If having object in current data block may be outlier, the then object of position from described current data block
Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected
Remove in current data block, peel off until all objects in current data block update TOP n after all having processed
Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
5. the Outliers Detection system indexed based on many strong points, it is characterised in that described system includes:
Choose strong point module, be used for reading in data set, described data set is chosen multiple strong point with shape
Become to support point set;
Set up index module, for by object each in data set and selected multiple strong points calculate away from
From and using distance as coordinate, form multi-dimensional data space, utilize described multi-dimensional data space to set up index;
Outliers Detection module, is used for dividing index for data block, and described data block is carried out block-by-block detection from
Group's point.
6. the Outliers Detection system indexed based on many strong points as claimed in claim 5, it is characterised in that
Described choose strong point module specifically for:
After reading in data set, randomly select initial reference point, and choose and described initial reference point distance
Point on the basis of point furthest;
Calculate the distance of each object in described data set and described datum mark;
Sort according to the order from small to large of distance;
Described data set is divided into equidistant multistage;
Described multistage is ranked up according to the size of contained number of objects;
Judge that number of objects contained by each segmentation is the most equal;
If number of objects contained by each segmentation is unequal, then the quantity midpoint of each segmentation is sequentially added support
Point set;
If number of objects contained by each segmentation is equal, then by the number of the segmentation close to described initial reference point
Amount midpoint adds support point set.
7. the Outliers Detection system indexed based on many strong points as claimed in claim 6, it is characterised in that
Described set up index module specifically for:
According to intending the multidimensional data dimension of conversion, select the strong point of the respective amount that the described strong point concentrates;
It is the distance value with each strong point by object map each in described data set, to form multidimensional data
Space;
Multi-dimensional data space is mapped as integer coordinate values;
Hilbert index mapping algorithm is used directly to calculate the Hilbert encoding value of every pair of integer coordinate values;
The multiple Hilbert encoding value obtained are ranked up, to set up Hilbert index.
8. the Outliers Detection system indexed based on many strong points as claimed in claim 7, it is characterised in that
Described Outliers Detection module specifically for:
Divide described Hilbert index for data block, by encoded radio from sparse to intensive for these block sequencings
Using as Outliers Detection order;
Setting degree of peeling off threshold value is initialized as 0, reads described data set by detection ordering data block one by one;
If all objects in current data block are impossible to as outlier, then it is directly entered next data
Block;
If having object in current data block may be outlier, the then object of position from described current data block
Start with screw type sequential search arest neighbors, and will determine that the object being unlikely to be outlier is from detected
Remove in current data block, peel off until all objects in current data block update TOP n after all having processed
Point and degree of peeling off threshold value, and enter next data block;
When all data blocks have all processed, export TOP n outlier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610278832.9A CN105975519A (en) | 2016-04-28 | 2016-04-28 | Multi-supporting point index-based outlier detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610278832.9A CN105975519A (en) | 2016-04-28 | 2016-04-28 | Multi-supporting point index-based outlier detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105975519A true CN105975519A (en) | 2016-09-28 |
Family
ID=56994235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610278832.9A Pending CN105975519A (en) | 2016-04-28 | 2016-04-28 | Multi-supporting point index-based outlier detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975519A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503245A (en) * | 2016-11-08 | 2017-03-15 | 深圳大学 | A kind of system of selection for supporting point set and device |
CN106951353A (en) * | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
WO2017185296A1 (en) * | 2016-04-28 | 2017-11-02 | 深圳大学 | Method and system for detecting outlier based on multiple support points index |
CN107480258A (en) * | 2017-08-15 | 2017-12-15 | 佛山科学技术学院 | A kind of metric space Outliers Detection method based on a variety of strong points |
CN107798338A (en) * | 2017-09-28 | 2018-03-13 | 佛山科学技术学院 | A kind of intensive strong point fast selecting method of big data |
CN112559571A (en) * | 2020-12-21 | 2021-03-26 | 国家电网公司东北分部 | Approximate outlier calculation method and system for numerical type stream data |
CN112733904A (en) * | 2020-12-30 | 2021-04-30 | 佛山科学技术学院 | Water quality abnormity detection method and electronic equipment |
CN110287238B (en) * | 2019-06-26 | 2022-11-29 | 广东奥博信息产业股份有限公司 | Method and system for detecting abnormal water quality based on priori knowledge |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744886A (en) * | 2013-12-23 | 2014-04-23 | 西南科技大学 | Directly extracted k nearest neighbor searching algorithm |
CN105260742A (en) * | 2015-09-29 | 2016-01-20 | 深圳大学 | Unified classification method for multiple types of data and system |
-
2016
- 2016-04-28 CN CN201610278832.9A patent/CN105975519A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744886A (en) * | 2013-12-23 | 2014-04-23 | 西南科技大学 | Directly extracted k nearest neighbor searching algorithm |
CN105260742A (en) * | 2015-09-29 | 2016-01-20 | 深圳大学 | Unified classification method for multiple types of data and system |
Non-Patent Citations (1)
Title |
---|
顾新财: ""面向多维数据的孤立点挖掘方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017185296A1 (en) * | 2016-04-28 | 2017-11-02 | 深圳大学 | Method and system for detecting outlier based on multiple support points index |
CN106503245B (en) * | 2016-11-08 | 2019-07-26 | 深圳大学 | A kind of selection method and device supporting point set |
CN106503245A (en) * | 2016-11-08 | 2017-03-15 | 深圳大学 | A kind of system of selection for supporting point set and device |
CN106951353A (en) * | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
CN106951353B (en) * | 2017-03-20 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | Method and device for detecting abnormality of operation data |
CN107480258A (en) * | 2017-08-15 | 2017-12-15 | 佛山科学技术学院 | A kind of metric space Outliers Detection method based on a variety of strong points |
CN107798338A (en) * | 2017-09-28 | 2018-03-13 | 佛山科学技术学院 | A kind of intensive strong point fast selecting method of big data |
CN107798338B (en) * | 2017-09-28 | 2021-03-26 | 佛山科学技术学院 | Method for quickly selecting big data dense support points |
CN110287238B (en) * | 2019-06-26 | 2022-11-29 | 广东奥博信息产业股份有限公司 | Method and system for detecting abnormal water quality based on priori knowledge |
CN112559571A (en) * | 2020-12-21 | 2021-03-26 | 国家电网公司东北分部 | Approximate outlier calculation method and system for numerical type stream data |
CN112559571B (en) * | 2020-12-21 | 2024-05-24 | 国家电网公司东北分部 | Approximate outlier calculation method and system for numerical value type stream data |
CN112733904A (en) * | 2020-12-30 | 2021-04-30 | 佛山科学技术学院 | Water quality abnormity detection method and electronic equipment |
CN112733904B (en) * | 2020-12-30 | 2022-03-25 | 佛山科学技术学院 | Water quality abnormity detection method and electronic equipment |
WO2022141746A1 (en) * | 2020-12-30 | 2022-07-07 | 佛山科学技术学院 | Method for detecting anomaly in water quality and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975519A (en) | Multi-supporting point index-based outlier detection method and system | |
CN111831660B (en) | Method and device for evaluating metric space division mode, computer equipment and storage medium | |
CN102693266B (en) | Search for method, the navigation equipment and method of generation index structure of database | |
CN109740628A (en) | Point cloud clustering method, image processing equipment and the device with store function | |
WO2002025574A2 (en) | Data clustering methods and applications | |
CN106126918B (en) | A kind of geographical space abnormal aggregation domain scanning statistical method based on interaction force | |
US20180143945A1 (en) | Method and system for detecting outlier based on multiple pivots index | |
CN105787126B (en) | K-d tree generation method and k-d tree generation device | |
CN103020321B (en) | Neighbor search method and system | |
CN108416381B (en) | Multi-density clustering method for three-dimensional point set | |
CN104036261A (en) | Face recognition method and system | |
CN105488176A (en) | Data processing method and device | |
CN105005584A (en) | Multi-subspace Skyline query computation method | |
CN109508349A (en) | A kind of metric space Outliers Detection method and device | |
CN110580252B (en) | Space object indexing and query method under multi-objective optimization | |
CN112203324B (en) | MR positioning method and device based on position fingerprint database | |
CN110097581B (en) | Method for constructing K-D tree based on point cloud registration ICP algorithm | |
CN112183001B (en) | Hypergraph-based multistage clustering method for integrated circuits | |
CN105824853B (en) | Clustering device and method | |
CN109840558A (en) | Based on density peaks-core integration adaptive clustering scheme | |
CN110070100A (en) | A kind of agricultural weather Outliers Detection method and device that multiple-factor is integrated | |
CN108509532B (en) | Point gathering method and device applied to map | |
CN112734934B (en) | STL model 3D printing slicing method based on intersecting edge mapping | |
CN110287238A (en) | A kind of exception water quality detection method and system based on priori knowledge | |
CN102682279A (en) | High-speed fingerprint feature comparison system and method implemented by classified triangles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160928 |
|
RJ01 | Rejection of invention patent application after publication |