CN105117485A - High-accuracy global outlier detection algorithm based on k-nearest neighbor - Google Patents

High-accuracy global outlier detection algorithm based on k-nearest neighbor Download PDF

Info

Publication number
CN105117485A
CN105117485A CN201510593056.7A CN201510593056A CN105117485A CN 105117485 A CN105117485 A CN 105117485A CN 201510593056 A CN201510593056 A CN 201510593056A CN 105117485 A CN105117485 A CN 105117485A
Authority
CN
China
Prior art keywords
data block
degree
peeling
outlier
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510593056.7A
Other languages
Chinese (zh)
Other versions
CN105117485B (en
Inventor
许红龙
毛睿
陆敏华
李荣华
王毅
刘刚
陆克中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201510593056.7A priority Critical patent/CN105117485B/en
Publication of CN105117485A publication Critical patent/CN105117485A/en
Application granted granted Critical
Publication of CN105117485B publication Critical patent/CN105117485B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the field of data mining, and provides a global outlier detection algorithm. The algorithm includes the steps that S1, a data set D is detected in a block mode; S2, the distance between each object of the data set D and each object of a first data block is calculated, the (m+k) neighbor of each object of the first data block is updated, the outlier degree of each object is calculated in real time, and the objects with the outlier degree smaller than a threshold value c are excluded from the data block; S3, after the first data block is processed, the objects not excluded from the data block are sequenced from large to small according to the outlier degrees, first n objects are taken and added into TOP n outliers, and the threshold value c is updated; S4, an ith data block is processed like the step2, and after the data block is processed, the TOP n outliers and the threshold value c are updated; after all data blocks are processed, the TOP n outliers are output. The algorithm can expand the data set application range and improve the detection accuracy.

Description

A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour
Technical field
The invention belongs to data mining technology field, particularly relate to a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour.
Background technology
Outlier also claims abnormity point, exception object, the most influential definition of present academia is the definition " outlier is the distinguished data point of data centralization; its performance is put so different with other; to such an extent as to make people suspect these data and nonrandom deviation, but produced by another diverse mechanism " that Hawkins proposes.In addition, each class Outliers Detection algorithm all provides the definition of corresponding outlier.Outlier detection is also referred to as abnormality detection, separate-blas estimation or outlier excavation, and it is exactly according to certain algorithm the outlier detection of data centralization out, such as, detect TOPn outlier, or all satisfactory outlier.In other words, outlier detection is exactly excavate the point that in mass data, only a few is significantly different from mainstream data.
Distance-based outlier detection algorithm has versatility.It does not need user to have pertinent arts, does not need tentation data collection to meet any particular probability distributed model yet.After first Knorr and Ng in 1998 propose the definition of distance-based outlier point, scholars propose the definition of various outlier and corresponding detection algorithm one after another.Wherein conventional definition has three: and
derive from Knorr and Ng propose definition DB (p, D)---the object O in data set T is an outlier, is greater than D when having the distance of P partial objects and O in data set T at least.This definition is equivalent to mean be less than R with the distance of object O the no more than k of object, obviously such definition is more visual in image. and DB (p, D) is dualization definition, an object or be outlier, or be normal point.
it is the definition that the people such as Ramaswamy proposed in 2000.This definition is using the distance value of object O and its kth neighbour as degree of peeling off, therefore, it is possible to sequence draws TOP-n outlier, avoids the problem that dualization definition precision is poor to a certain extent.
it is the definition that the people such as Angiulli proposed in 2002.This definition with comparatively similar, it is using the mean value of the distance of object O k neighbour front with it as degree of peeling off, basis on further increase degree of accuracy, thus become most widely used definition on Outliers Detection algorithm research.
In the outlier definition that existing based on distance three are the most frequently used, it is generally acknowledged there is the highest Detection accuracy, but still not ideal enough, applicable data collection also has limitation.
Summary of the invention
Technical matters to be solved by this invention is that providing a kind of can increase the overall outlier detection algorithm that the data set scope of application also can improve Detection accuracy.
The invention provides a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour, comprise following steps:
Step S1: data set D is detected with partitioned mode, each detected blocks of data is called a data block, and each object of data block and its (m+k) nearest neighbor distance are initialized as maximal value;
Step S2: by each object of data set D and each calculation and object distance of first data block, and upgrade (m+k) neighbour of each object in first data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of initial threshold c; The degree of peeling off of described each object is the distance sum of this object and its m+1 to m+k neighbour;
Step S3: after processing first data block, sorts by degree of peeling off from big to small by the object be not excluded in first data block, gets a front n object and adds TOPn outlier, and upgrade threshold value c;
Step S4: by each object of data set D and each calculation and object distance of second data block, and upgrade (m+k) neighbour of each object in second data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of threshold value c;
Step S5: after processing second data block, if the degree of peeling off of the object be not excluded in second data block is greater than the degree of peeling off in TOPn outlier, then upgrades TOPn outlier, and upgrades threshold value c;
Step S6: for i-th data block, i=3,4,5 ..., repeat step S4-S5; Until all data blocks all process, export TOPn outlier.
Further, in described step S2, initial threshold c is set to 0.
Further, in step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.
The present invention compared with prior art, beneficial effect is: the high-accuracy overall situation outlier detection algorithm based on k very neighbour provided by the invention, when calculating the degree of peeling off of certain object, it is the m arest neighbors first removing this object, calculate the distance sum of this object and its k arest neighbors again, so just more easily detect intensive outlier, also can take into account the detection of sparse outlier simultaneously, increase the data set scope of application; Adopt overall outlier detection algorithm provided by the invention to also improve Detection accuracy simultaneously; In addition, getting rid of non-outlier in advance by arranging threshold value, saving memory headroom.
Accompanying drawing explanation
Fig. 1 is the high-accuracy overall situation outlier detection algorithm flow chart based on k very neighbour that the embodiment of the present invention provides.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
The degree of peeling off that the present invention calculates certain object is: this object is to its m+1 to the distance sum of m+k neighbour.
Introduce a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour below:
M+k neighbour of all objects of direct calculating data centralization, then calculates the degree of peeling off (i.e. the distance sum of p and its m+1 to m+k neighbour) of each object p, sorts to obtain TOPn, obtains TOPn outlier.
The advantage of this implementation is simple, and shortcoming is that the memory headroom consumed is very large.
In fact, for an object, the process of search arest neighbors, be more search certainly, " temporarily " degree of peeling off (degree of peeling off calculated in real time) is more and more lower, because constantly there is nearer neighbour found; So once " temporarily " degree of peeling off of certain object is less than threshold values c, it may be no longer just outlier, and calculate because continue search, its degree of peeling off only can be less, and more impossible is outlier; Algorithm gets rid of non-outlier in advance according to this point threshold values c.
Introduce a kind of high-accuracy overall situation outlier detection algorithm based on k very neighbour provided by the invention below, described algorithm specifically comprises following steps:
Step S1: data set D is detected with partitioned mode, each detected blocks of data is called a data block, and each object of data block and its (m+k) nearest neighbor distance are initialized as maximal value;
Step S2: by each object of data set D and each calculation and object distance of first data block, and upgrade (m+k) neighbour of each object in first data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of initial threshold c (initial threshold c is set to 0); The degree of peeling off of described each object is the distance sum of this object and its m+1 to m+k neighbour;
Step S3: after processing first data block, sorts by degree of peeling off from big to small by the object be not excluded in first data block, gets a front n object and adds TOPn outlier, and upgrade threshold value c;
Step S4: by each object of data set D and each calculation and object distance of second data block, and upgrade (m+k) neighbour of each object in second data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of threshold value c;
Step S5: after processing second data block, if the degree of peeling off of the object be not excluded in second data block is greater than the degree of peeling off in TOPn outlier, then upgrades TOPn outlier, and upgrades threshold value c;
Step S6: for i-th data block, i=3,4,5 ..., repeat step S4-S5; Until all data blocks all process, export TOPn outlier.
In described step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.
Introduce the specific embodiment adopting above-mentioned algorithm search TOPn outlier below:
Certain one-dimensional data collection comprise to as if:
3,1,4,15,9,2,6,5,35,7,97,93,23,84,62
Be divided into three data blocks, each data block comprises 5 objects:
First data block: 3, Isosorbide-5-Nitrae, 15,9
Second data block: 2,6,5,35,7
3rd data block: 97,93,23,84,62
Suppose m=2, k=2, n=2, namely using certain object and its 3rd arest neighbors to the distance sum of the 4th arest neighbors as degree of peeling off, require maximum 2 objects of degree of peeling off.
It should be noted: the neighbour of certain object, do not comprise it oneself; When neighbour's quantity of having searched for is less than m+k, degree of peeling off is set to infinity.
When one, processing first data block (3, Isosorbide-5-Nitrae, 15,9), threshold values c=0 (degree of peeling off all is more than or equal to 0, so in fact process first data block do not get rid of any object in advance):
By each object of whole data set and each calculation and object distance of this data block:
(1) the 1st object 3 of data set, calculation and object distance each with data block, and preserve the current neighbour of each object of data block.Real-time calculating degree of peeling off, is less than just getting rid of from data block of threshold values c.
(2) the 2nd object 1 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Calculating degree of peeling off, is less than just getting rid of from data block of threshold values c.
(3)……
(4) last object 62 of data set, can calculation and object distance each with data block, attempt upgrading neighbour, calculate degree of peeling off, see and get rid of in advance.
(5) after each object of data set and the calculation and object distance of this data block, neighbour's situation is obtained as follows:
Process first data block, eachly can not calculated degree of peeling off by the object got rid of in advance, get maximum two, add TOPn:
First is 15, and degree of peeling off is 17;
Second is 9, and degree of peeling off is 9;
The n-th of TOPn, namely the 2nd, degree of peeling off is 9, threshold values c=9.
When two, processing second data block (2,6,5,35,7), threshold values c=9:
By each object of whole data set and each calculation and object distance of this data block:
(1) the 1st object 3 of data set, calculation and object distance each with data block, and preserve the current neighbour of each object of data block, real-time calculating degree of peeling off is (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), see and whether be less than threshold values c and get rid of from data block in advance.
(2) the 2nd object 1 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Real-time calculating degree of peeling off (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), is less than just getting rid of from data block of threshold values c.
(3) the 3rd object 4 of data set, calculation and object distance each with data block, if distance is also less than the neighbour preserved, then upgrades it.Real-time calculating degree of peeling off (when neighbour's quantity is less than m+k, degree of peeling off is set to infinity), is less than just getting rid of from data block of threshold values c.
(4) the 4th object 15 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time, is less than just getting rid of from data block of threshold values c.
Up to now, degree of peeling off is greater than threshold values (c=9) entirely.
(5) the 5th object 9 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.
The current degree of peeling off of object 6,5, lower than threshold values 9, therefore can delete (search for because of along with the continuation of neighbour, their degree of peeling off only can be more and more less, or constant, and can not become large) from data block in advance.
(6) the 6th object 2 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.(at this moment data block only remaining 2,35,7 three objects, the neighbour of 2 do not comprise it oneself)
(7) the 7th object 6 of data set, calculation and object distance each with data block, calculates degree of peeling off in real time.
The current degree of peeling off of object 2,7, lower than threshold values 9, therefore can delete from data block in advance.At this moment data block is only left object 35.
(8)……
(9), after all objects of data set and this data block calculate, degree of peeling off situation is obtained as follows:
Upgrade TOPn:
First is 35, and degree of peeling off is 53;
Second is 15, and degree of peeling off is 17;
Upgrade threshold values c=17.
Three, other data block of process is continued
The TOPn outlier of final acquisition is:
First is 97, and degree of peeling off is 97;
Second is 93, and degree of peeling off is 89;
Specific as follows:
In the present invention, the degree of peeling off calculating certain object is the m arest neighbors first removing this object, then calculates the distance sum of this object and its k arest neighbors; In fact, after removing m arest neighbors, adopt algorithm provided by the invention just more easily to detect intensive outlier, also can take into account the detection of sparse outlier simultaneously, increase the data set scope of application; Also improve Detection accuracy simultaneously; In addition, utilize threshold value c to get rid of non-outlier in advance, saved memory headroom.
Algorithm provided by the invention is applicable to scientific research field, can algorithm as a comparison, and application mode is programming realization and imports to intend detecting data set and run; Also be applicable to industrial circle, can be applicable to network invasion monitoring, credit card fraud detection, medical and public health detects, fields such as examining is damaged in industry, application mode is programming realization and imports plan detection data set to run.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (3)

1., based on a k very neighbour's high-accuracy overall situation outlier detection algorithm, it is characterized in that, comprise following steps:
Step S1: data set D is detected with partitioned mode, each detected blocks of data is called a data block, and each object of data block and its (m+k) nearest neighbor distance are initialized as maximal value;
Step S2: by each object of data set D and each calculation and object distance of first data block, and upgrade (m+k) neighbour of each object in first data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of initial threshold c; The degree of peeling off of described each object is the distance sum of this object and its m+1 to m+k neighbour;
Step S3: after processing first data block, sorts by degree of peeling off from big to small by the object be not excluded in first data block, gets a front n object and adds TOPn outlier, and upgrade threshold value c;
Step S4: by each object of data set D and each calculation and object distance of second data block, and upgrade (m+k) neighbour of each object in second data block, the degree of peeling off of each object of real-time calculating, when neighbour's quantity is less than m+k, degree of peeling off is set to infinity, and degree of peeling off is less than just getting rid of from this data block of threshold value c;
Step S5: after processing second data block, if the degree of peeling off of the object be not excluded in second data block is greater than the degree of peeling off in TOPn outlier, then upgrades TOPn outlier, and upgrades threshold value c;
Step S6: for i-th data block, i=3,4,5 ..., repeat step S4-S5; Until all data blocks all process, export TOPn outlier.
2. overall outlier detection algorithm as claimed in claim 1, it is characterized in that, in described step S2, initial threshold c is set to 0.
3. overall outlier detection algorithm as claimed in claim 1, is characterized in that, in step S3 and step S5, when upgrading threshold value c, in described TOPn outlier, the degree of peeling off of the n-th outlier is as the value of threshold value c.
CN201510593056.7A 2015-09-17 2015-09-17 A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours Expired - Fee Related CN105117485B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510593056.7A CN105117485B (en) 2015-09-17 2015-09-17 A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510593056.7A CN105117485B (en) 2015-09-17 2015-09-17 A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours

Publications (2)

Publication Number Publication Date
CN105117485A true CN105117485A (en) 2015-12-02
CN105117485B CN105117485B (en) 2018-07-20

Family

ID=54665473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510593056.7A Expired - Fee Related CN105117485B (en) 2015-09-17 2015-09-17 A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours

Country Status (1)

Country Link
CN (1) CN105117485B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787520A (en) * 2016-03-25 2016-07-20 中国农业大学 Cluster and outlier discovery algorithm based on natural shared nearest neighbor search
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN107798338A (en) * 2017-09-28 2018-03-13 佛山科学技术学院 A kind of intensive strong point fast selecting method of big data
CN108776675A (en) * 2018-05-24 2018-11-09 西安电子科技大学 LOF outlier detection methods based on k-d tree
CN110287238A (en) * 2019-06-26 2019-09-27 广东奥博信息产业股份有限公司 A kind of exception water quality detection method and system based on priori knowledge
CN111275480A (en) * 2020-01-07 2020-06-12 成都信息工程大学 Multi-dimensional sparse sales data warehouse oriented fraud behavior mining method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104714964B (en) * 2013-12-13 2018-03-23 中国移动通信集团公司 A kind of physiological data Outliers Detection method and device
CN104462379A (en) * 2014-12-10 2015-03-25 深圳大学 Distance-based high-accuracy global outlier detection algorithm

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787520A (en) * 2016-03-25 2016-07-20 中国农业大学 Cluster and outlier discovery algorithm based on natural shared nearest neighbor search
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN105844102B (en) * 2016-03-25 2018-05-08 中国农业大学 One kind is adaptively without ginseng Spatial Outlier Detection method
CN107798338A (en) * 2017-09-28 2018-03-13 佛山科学技术学院 A kind of intensive strong point fast selecting method of big data
CN107798338B (en) * 2017-09-28 2021-03-26 佛山科学技术学院 Method for quickly selecting big data dense support points
CN108776675A (en) * 2018-05-24 2018-11-09 西安电子科技大学 LOF outlier detection methods based on k-d tree
CN110287238A (en) * 2019-06-26 2019-09-27 广东奥博信息产业股份有限公司 A kind of exception water quality detection method and system based on priori knowledge
CN111275480A (en) * 2020-01-07 2020-06-12 成都信息工程大学 Multi-dimensional sparse sales data warehouse oriented fraud behavior mining method

Also Published As

Publication number Publication date
CN105117485B (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN105117485A (en) High-accuracy global outlier detection algorithm based on k-nearest neighbor
US20130318011A1 (en) Method for Detecting Anomalies in Multivariate Time Series Data
Tsintotas et al. Probabilistic appearance-based place recognition through bag of tracked words
CN115311329B (en) Video multi-target tracking method based on double-link constraint
TWI708209B (en) Object detection method using cnn model and object detection apparatus using the same
CN106803263A (en) A kind of method for tracking target and device
Zhang et al. Clustering-based missing value imputation for data preprocessing
CN102378992A (en) Articulated region detection device and method for same
CN111798487B (en) Target tracking method, apparatus and computer readable storage medium
CN105681339A (en) Incremental intrusion detection method fusing rough set theory and DS evidence theory
CN101414358B (en) Method for detecting and extracting chromosome contour based on directional searching
CN109829936B (en) Target tracking method and device
CN103942536A (en) Multi-target tracking method of iteration updating track model
CN105844102A (en) Self-adaptive parameter-free spatial outlier detection algorithm
CN101908214B (en) Moving object detection method with background reconstruction based on neighborhood correlation
Fan et al. Siamese residual network for efficient visual tracking
Gangurde Feature selection using clustering approach for big data
CN110795599B (en) Video emergency monitoring method and system based on multi-scale graph
CN108241150B (en) Method for detecting and tracking moving object in three-dimensional sonar point cloud environment
Xie et al. Hierarchical forest based fast online loop closure for low-latency consistent visual-inertial SLAM
CN102722732A (en) Image set matching method based on data second order static modeling
Emami et al. Online failure detection and correction for CAMShift tracking algorithm
CN113012193A (en) Multi-pedestrian tracking method based on deep learning
CN109828996A (en) A kind of Incomplete data set rapid attribute reduction
CN103699690A (en) Accurate method for seeking minimal change region in process model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180720

Termination date: 20210917

CF01 Termination of patent right due to non-payment of annual fee