CN105373806A - Outlier detection method based on uncertain data set - Google Patents

Outlier detection method based on uncertain data set Download PDF

Info

Publication number
CN105373806A
CN105373806A CN201510676188.6A CN201510676188A CN105373806A CN 105373806 A CN105373806 A CN 105373806A CN 201510676188 A CN201510676188 A CN 201510676188A CN 105373806 A CN105373806 A CN 105373806A
Authority
CN
China
Prior art keywords
dist
distance
data point
data
uncertain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510676188.6A
Other languages
Chinese (zh)
Inventor
刘文婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201510676188.6A priority Critical patent/CN105373806A/en
Publication of CN105373806A publication Critical patent/CN105373806A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6201Matching; Proximity measures
    • G06K9/6215Proximity measures, i.e. similarity or distance measures

Abstract

The invention discloses an outlier detection method based on an uncertain data set and belongs to the technical field of outlier data mining. The method comprises a step 1 of computing the k distance and the k distance neighborhood of each data point in the uncertain data set; a step 2 of computing a probability that a data point q becomes a neighbor of a data point o in the k distance neighborhood; a step 3 of computing the reachable distance of each data point; a step 4 of computing the reachable density of each data point; and a step 5 of computing the outlier factor of each data point in order to determine outlier data. The method may effectively find out outlier data concealed in the uncertain data set.

Description

A kind of Outliers Detection method based on uncertain data collection
Technical field
The present invention relates to outlier data digging technical field, particularly a kind of Outliers Detection method based on uncertain data collection.
Background technology
Outlier data digging technology is one of study hotspot of current Data Mining, current existing outlier data digging is mainly based on the determination outlier mining that distance or arest neighbors concept are carried out, along with internet and the extensive of mobile Internet are popularized, a large amount of uncertain datas is in different field widespread uses such as finance and economic analysis, electronic communication, modern logistics, the uncertainty of data itself, be difficult to accurately judge that whether data are abnormal, cause being difficult to provide definite Outlier Data.At uncertain data set, even if a data object itself similarly is not an outlier, if but its uncertainty degree is very high, and these data are also under a cloud is possibly abnormal.Therefore based on the Outliers Detection of uncertain data collection, the uncertainty degree i.e. degree of peeling off determining each data is needed.
Summary of the invention
The present invention is in order to solve the above-mentioned defect and deficiency that exist in prior art, provide a kind of Outliers Detection method based on uncertain data collection, the method can find from uncertain data centralization the Outlier Data that is hidden in wherein effectively, determine the degree of peeling off of each data, the fields such as finance and economic analysis, electronic communication, modern logistics can be widely used in.
For solving the problems of the technologies described above, the invention provides a kind of Outliers Detection method based on uncertain data collection, comprising the following steps:.
Step one) calculate the k Distance geometry k of each data point o in uncertain data collection D apart from neighborhood;
Step 2, calculating data point q in k distance neighborhood becomes the probability of the neighbour of data point o;
Step 3, calculates reach distance and the probability density function of each data point q to data point o in k distance neighborhood;
Step 4, calculates reached at the density of each data point o;
Step 5, calculates the factor that peels off of each data point o, determines outlier.
Wherein, described step one comprises the following steps:
1-1) formalization data set;
Uncertain data collection D is expressed as D={o 1, o 2... o i..., o n, n represents uncertain data collection D size, wherein o irepresent a data point of data centralization, each data point has d dimension, i.e. d property value each attribute all associate a probability density function f i j() and cumulative distribution function F i j(), then data point o ibe expressed as:
1-2) determine k distance k_dist_ ε (o) of data point o;
K distance represents in uncertain data collection D, the lowest distance value of corresponding each data point o, and the probability that there is at least k nearest-neighbors in k distance range is not less than ε, is designated as k_dist_ ε (o), and wherein k is positive integer, ε ∈ (0,1];
1-3) define the k distance neighborhood N of data point o k_dist_ ε(o);
K distance neighborhood represents in uncertain data collection D, is less than the set of the point of k_dist_ ε (o), is expressed as N with the minor increment of data point o k_dist_ ε(o):
N k_dist_ε(o)={q|min_dist(q,o)<k_dist_ε(o)},
Wherein min_dist (q, o) represents the minimum interval of the distribution range of data point q and the distribution range of data point o in k distance neighborhood;
Iterative algorithm 1-4) is adopted to calculate k distance k_dist_ ε (o):
Make p o(k_d) represent that data point o is at distance k_d ∈ (0, R max] have the probability of k neighborhood in scope at least, as k_d=k_dist_ ε, p o(k_d)=ε; R maxrepresent that uncertain data concentrates the ultimate range between two data points;
A) initialization:
Uncertain data concentrates the minor increment between two data point o to be 0, represents with low, and uncertain data concentrates the ultimate range between two data point o to be R max, represent with up, therefore, determine uncertain data concentrate distance k_d span between two data points for (low, up], get the initial value of intermediate value as k_d, i.e. k_d=(low+up)/2, and set k distance neighborhood N k_dist_ εo () initial value is empty set Φ, be specifically expressed as follows:
N k_dist_ε(o)=Φ;low=0;up=R max;k_d=(low+up)/2
B) k distance k_dist_ ε (o) is calculated:
When | p o(k_d)-ε | during>=δ, if p o(k_d) < ε, then low=k_d, otherwise up=k_d; Make k_d=(low+up)/2, until | p o(k_d)-ε | < δ sets up, and can draw k_dist_ ε (o)=k_d;
C) k distance neighborhood N is calculated k_dist_ ε(o):
Order
Any data point p except data point o is concentrated, if satisfy condition max_dist (p, o) < k_dist_ ε (o), then N for uncertain data k_dist_ ε(o)=N k_dist_ εo { namely p} represents and p is inserted into k distance neighborhood N () ∪ k_dist_ εo (), wherein, max_dist (p, o) represents the largest interval of the distribution range of data point p and the distribution range of data point o in k distance neighborhood.
In described step 2, calculating data point q in k distance neighborhood becomes the probability P of the neighbour of data point o o(q), concrete computation process is as follows:
If the minor increment min_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), then P o(q)=0;
If the ultimate range max_dist (q, o) between q and o is less than k distance k_dist_ ε (o), then P o(q)=1;
If the ultimate range max_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), and the minor increment min_dist (q, o) between q and o is less than k distance k_dis_t ε (o), then P o(q)=F o,q(k_dis_t ε), F o,qthe cumulative distribution function that (k_dist_ ε) is o, q;
Specifically be expressed as:
Described step 3 comprises the following steps:
3-1) calculating probability density function fd oq
Any two different data point o, q ∈ D, between have m*m different distance value, m*m distance value according to order arrangement from small to large, and divide wide interval, add up the distance value number that each interval comprises, represent the distribution function FD (r of distance value with least square fitting polynomial expression polF (.) o,q), be specifically expressed as:
By FD (r o,q) differentiate on distance r, obtain the probability density function fd of distance oq;
3-2) calculate the reach distance RD of each data point k_dist_ ε(o, q), detailed process is as follows:
Wherein, r is the distance between data point o, q.
In described step 4, calculate reached at the density lrd of data point k(o), computing formula is as follows:
Wherein RD k_dist_ εthe reach distance that (o, q) is data point, P oq () becomes the probability P of the neighbour of data point o for data point q in k distance neighborhood o(q).
In described step 5, calculate the factor that peels off of each data point o;
Each data point o in uncertain data collection D represents by the mode of probability the degree of peeling off, and specific formula for calculation is as follows:
Wherein, LOF ko () represents the degree that peels off of each data point o, given acquiescence degree of peeling off σ, σ is determined by user, if LOF ko () > σ, then data point o is outlier.
The Advantageous Effects that the present invention reaches: the detection method of the Outlier Data based on uncertain data collection provided by the invention, effectively can find from uncertain data centralization the Outlier Data that is hidden in wherein, determine the degree of peeling off of each data, the fields such as finance and economic analysis, electronic communication, modern logistics can be widely used in.
Accompanying drawing explanation
Fig. 1 schematic flow sheet of the present invention.
Embodiment
In order to the technique effect can better understanding technical characteristic of the present invention, technology contents and reach, now accompanying drawing of the present invention is described in detail in conjunction with the embodiments.
Below in conjunction with drawings and Examples, patent of the present invention is further illustrated.
As shown in Figure 1, the invention provides a kind of Outliers Detection method based on uncertain data collection, comprise the following steps:.
Step one) calculate the k Distance geometry k of each data point o in uncertain data collection D apart from neighborhood, concrete computation process is as follows:
1-1) formalization data set;
Uncertain data collection D is expressed as D={o 1, o 2... o i..., o n, n represents uncertain data collection D size, wherein o irepresent a data point of data centralization, each data point has d dimension, i.e. d property value each attribute all associate a probability density function f i j() and cumulative distribution function F i j(), then data point o ibe expressed as:
1-2) determine k distance k_dist_ ε (o) of data point o;
K distance represents in uncertain data collection D, the lowest distance value of corresponding each data point o, and the probability that there is at least k nearest-neighbors in k distance range is not less than ε, is designated as k_dist_ ε (o), and wherein k is positive integer, ε ∈ (0,1];
1-3) define the k distance neighborhood N of data point o k_dist_ ε(o);
K distance neighborhood represents in uncertain data collection D, is less than the set of the point of k_dist_ ε (o), is expressed as N with the minor increment of data point o k_dist_ ε(o):
N k_dist_ε(o)={q|min_dist(q,o)<k_dist_ε(o)},
Wherein min_dist (q, o) represents the minimum interval of the distribution range of data point q and the distribution range of data point o in k distance neighborhood;
Iterative algorithm 1-4) is adopted to calculate k distance k_dist_ ε (o):
Make p o(k_d) represent that data point o is at distance k_d ∈ (0, R max] have the probability of k neighborhood in scope at least, as k_d=k_dist_ ε, p o(k_d)=ε; R maxrepresent that uncertain data concentrates the ultimate range between two data points;
A) initialization:
Uncertain data concentrates the minor increment between two data point o to be 0, represents with low, and uncertain data concentrates the ultimate range between two data point o to be R max, represent with up, therefore, determine uncertain data concentrate distance k_d span between two data points for (low, up], get the initial value of intermediate value as k_d, i.e. k_d=(low+up)/2, and set k distance neighborhood N k_dist_ εo () initial value is empty set Φ, be specifically expressed as follows:
N k_dist_ε(o)=Φ;low=0;up=R max;k_d=(low+up)/2
B) k distance k_dist_ ε (o) is calculated:
When | p o(k_d)-ε | during>=δ, if p o(k_d) < ε, then low=k_d, otherwise up=k_d; Make k_d=(low+up)/2, until | p o(k_d)-ε | < δ sets up, and can draw k_dist_ ε (o)=k_d;
C) k distance neighborhood N is calculated k_dist_ ε(o):
Order
Any data point p except data point o is concentrated, if satisfy condition for uncertain data
Max_dist (p, o) < k_dist_ ε (o), then N k_dist_ ε(o)=N k_dist_ εo { namely p} represents and p is inserted into k distance neighborhood N () ∪ k_dist_ εo (), wherein, max_dist (p, o) represents the largest interval of the distribution range of data point p and the distribution range of data point o in k distance neighborhood.
Step 2, calculating data point q in k distance neighborhood becomes the probability of the neighbour of data point o, and concrete computation process is as follows:
If the minor increment min_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), then P o(q)=0;
If the ultimate range max_dist (q, o) between q and o is less than k distance k_dist_ ε (o), then P o(q)=1;
If the ultimate range max_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), and the minor increment min_dist (q, o) between q and o is less than k distance k_dis_t ε (o), then P o(q)=F o,q(k_dis_t ε), F o,qthe cumulative distribution function that (k_dist_ ε) is o, q;
Specifically be expressed as:
Step 3, calculate reach distance and the probability density function of each data point q to data point o in k distance neighborhood, concrete computation process is as follows:
3-1) calculating probability density function fd oq
Any two different data point o, q ∈ D, between have m*m different distance value, m*m distance value according to order arrangement from small to large, and divide wide interval, add up the distance value number that each interval comprises, represent the distribution function FD (r of distance value with least square fitting polynomial expression polF (.) o,q), be specifically expressed as:
By FD (r o,q) differentiate on distance r, obtain the probability density function fd of distance oq;
3-2) calculate the reach distance RD of each data point k_dist_ ε(o, q), detailed process is as follows:
Wherein, r is the distance between data point o, q.
Step 4, calculate reached at the density of each data point o, computing formula is as follows:
Wherein RD k_dist_ εthe reach distance that (o, q) is data point, P oq () becomes the probability P of the neighbour of data point o for data point q in k distance neighborhood o(q).
Step 5, calculates the factor that peels off of each data point o, determines outlier:
Each data point o in uncertain data collection D represents by the mode of probability the degree of peeling off, and specific formula for calculation is as follows:
Wherein, LOF ko () represents the degree that peels off of each data point o, given acquiescence degree of peeling off σ, σ is determined by user, if LOF ko () > σ, then data point o is outlier.
Below disclose the present invention with preferred embodiment, so it is not intended to limiting the invention, and all technical schemes taking the scheme of equivalent replacement or equivalent transformation to obtain, all drop in protection scope of the present invention.

Claims (6)

1., based on an Outliers Detection method for uncertain data collection, it is characterized in that, comprise the following steps:.
Step one) calculate the k Distance geometry k of each data point o in uncertain data collection D apart from neighborhood;
Step 2, calculating data point q in k distance neighborhood becomes the probability of the neighbour of data point o;
Step 3, calculates reach distance and the probability density function of each data point q to data point o in k distance neighborhood;
Step 4, calculates reached at the density of each data point o;
Step 5, calculates the factor that peels off of each data point o, determines outlier.
2. the Outliers Detection method based on uncertain data collection according to claim 1, it is characterized in that, described step one comprises the following steps:
1-1) formalization data set;
Uncertain data collection D is expressed as D={o 1, o 2... o i..., o n, n represents uncertain data collection D size, wherein o irepresent a data point of data centralization, each data point has d dimension, i.e. d property value each attribute all associate a probability density function and cumulative distribution function then data point o ibe expressed as:
1-2) determine k distance k_dist_ ε (o) of data point o;
K distance represents in uncertain data collection D, the lowest distance value of corresponding each data point o, and the probability that there is at least k nearest-neighbor in k distance range is not less than ε, is designated as k_dist_ ε (o), and wherein k is positive integer, ε ∈ (0,1];
1-3) define the k distance neighborhood N of data point o k_dist_ ε(o);
K distance neighborhood represents in uncertain data collection D, is less than the set of the point of k_dist_ ε (o), is expressed as N with the minor increment of data point o k_dist_ ε(o):
N k_dist_ε(o)={q|min_dist(q,o)<k_dist_ε(o)},
Wherein min_dist (q, o) represents the minimum interval of the distribution range of data point q and the distribution range of data point o in k distance neighborhood;
Iterative algorithm 1-4) is adopted to calculate k distance k_dist_ ε (o):
Make p o(k_d) represent that data point o is at distance k_d ∈ (0, R max] have the probability of k neighborhood in scope at least, as k_d=k_dist_ ε, p o(k_d)=ε; R maxrepresent that uncertain data concentrates the ultimate range between two data points;
A) initialization:
Uncertain data concentrates the minor increment between two data point o to be 0, represents with low, and uncertain data concentrates the ultimate range between two data point o to be R max, represent with up, therefore, determine uncertain data concentrate distance k_d span between two data points for (low, up], get the initial value of intermediate value as k_d, i.e. k_d=(low+up)/2, and set k distance neighborhood N k_dist_ εo () initial value is empty set Φ, be specifically expressed as follows:
N k_dist_ε(o)=Φ;low=0;up=R max;k_d=(low+up)/2
B) k distance k_dist_ ε (o) is calculated:
When | p o(k_d)-ε | during>=δ, if p o(k_d) < ε, then low=k_d, otherwise up=k_d; Make k_d=(low+up)/2, until | p o(k_d)-ε | < δ sets up, and can draw k_dist_ ε (o)=k_d;
C) k distance neighborhood N is calculated k_dist_ ε(o):
Order
Any data point p except data point o is concentrated, if satisfy condition max_dist (p, o) < k_dist_ ε (o), then N for uncertain data k_dist_ ε(o)=N k_dist_ εo { namely p} represents and p is inserted into k distance neighborhood N () ∪ k_dist_ εo (), wherein, max_dist (p, o) represents the largest interval of the distribution range of data point p and the distribution range of data point o in k distance neighborhood.
3. the Outliers Detection method based on uncertain data collection according to claim 1, is characterized in that, in described step 2, calculating data point q in k distance neighborhood becomes the probability P of the neighbour of data point o o(q), concrete computation process is as follows:
If the minor increment min_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), then P o(q)=0;
If the ultimate range max_dist (q, o) between q and o is less than k distance k_dist_ ε (o), then P o(q)=1;
If the ultimate range max_dist (q, o) between q and o is greater than k distance k_dist_ ε (o), and the minor increment min_dist (q, o) between q and o is less than k distance then f o,qthe cumulative distribution function that (k_dist_ ε) is o, q;
Specifically be expressed as:
4. the Outliers Detection method based on uncertain data collection according to claim 1, it is characterized in that, described step 3 comprises the following steps:
3-1) calculating probability density function fd oq
Any two different data point o, q ∈ D, the individual different distance value of m*m, m*m distance value according to order arrangement from small to large, and divide wide interval, add up the distance value number that each interval comprises, represent the distribution function FD (r of distance value with least square fitting polynomial expression polF (.) o,q), be specifically expressed as:
By FD (r o,q) differentiate on distance r, obtain the probability density function fd of distance oq;
3-2) calculate the reach distance RD of each data point k_dist_ ε(o, q), detailed process is as follows:
Wherein, r is the distance between data point o, q.
5. the Outliers Detection method based on uncertain data collection according to claim 1, is characterized in that, in described step 4, calculates reached at the density lrd of data point k(o), computing formula is as follows:
Wherein RD k_dist_ εthe reach distance that (o, q) is data point, P oq () becomes the probability P of the neighbour of data point o for data point q in k distance neighborhood o(q).
6. the Outliers Detection method based on uncertain data collection according to claim 1, is characterized in that, in described step 5, calculates the factor that peels off of each data point o;
Each data point o in uncertain data collection D represents by the mode of probability the degree of peeling off, and specific formula for calculation is as follows:
Wherein, LOF ko () represents the degree that peels off of each data point o, given acquiescence degree of peeling off σ, σ is determined by user, if LOF ko () > σ, then data point o is outlier.
CN201510676188.6A 2015-10-19 2015-10-19 Outlier detection method based on uncertain data set Pending CN105373806A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510676188.6A CN105373806A (en) 2015-10-19 2015-10-19 Outlier detection method based on uncertain data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510676188.6A CN105373806A (en) 2015-10-19 2015-10-19 Outlier detection method based on uncertain data set

Publications (1)

Publication Number Publication Date
CN105373806A true CN105373806A (en) 2016-03-02

Family

ID=55375987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510676188.6A Pending CN105373806A (en) 2015-10-19 2015-10-19 Outlier detection method based on uncertain data set

Country Status (1)

Country Link
CN (1) CN105373806A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN108710796A (en) * 2018-05-15 2018-10-26 广东工业大学 Invasion operation detection method, device, equipment and computer readable storage medium
CN109086291A (en) * 2018-06-09 2018-12-25 西安电子科技大学 A kind of parallel method for detecting abnormality and system based on MapReduce
CN110207827A (en) * 2019-05-23 2019-09-06 浙江大学 A kind of electrical equipment temperature real time early warning method extracted based on Outlier factor

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN108710796A (en) * 2018-05-15 2018-10-26 广东工业大学 Invasion operation detection method, device, equipment and computer readable storage medium
CN108710796B (en) * 2018-05-15 2021-07-06 广东工业大学 Intrusion operation detection method, device, equipment and computer readable storage medium
CN109086291A (en) * 2018-06-09 2018-12-25 西安电子科技大学 A kind of parallel method for detecting abnormality and system based on MapReduce
CN110207827A (en) * 2019-05-23 2019-09-06 浙江大学 A kind of electrical equipment temperature real time early warning method extracted based on Outlier factor
CN110207827B (en) * 2019-05-23 2020-05-08 浙江大学 Electrical equipment temperature real-time early warning method based on abnormal factor extraction

Similar Documents

Publication Publication Date Title
CN105373806A (en) Outlier detection method based on uncertain data set
CN104081435A (en) Image matching method based on cascading binary encoding
CN105844102B (en) One kind is adaptively without ginseng Spatial Outlier Detection method
AU2013372618B2 (en) Character recognition method
US20170069077A1 (en) System and method for determining whether a product image includes a logo pattern
CN104463865A (en) Human image segmenting method
CN103916860B (en) Outlier data detection method based on space time correlation in wireless senser cluster l network
US9075476B2 (en) Touch sensing methods comprising comparison of a detected profile form of a sensing curve with predetermined profile forms and devices thereof
CN103839228A (en) Data rarefying and smooth processing method based on vector map
Li et al. High-speed Sigma-gating SMC-PHD filter
CN104135732B (en) Wireless sensor network covers the computational methods of cyst areas
CN103778436A (en) Pedestrian gesture inspecting method based on image processing
CN102945551A (en) Graph theory based three-dimensional point cloud data plane extracting method
Zou et al. WinIPS: WiFi-based non-intrusive IPS for online radio map construction
CN103646242B (en) Extended target tracking based on maximum stable extremal region feature
CN105574864B (en) The self-adaptive angular-point detection method to be added up based on angle
Jiang et al. CONSEL: Connectivity-based segmentation in large-scale 2D/3D sensor networks
CN106558054B (en) A kind of ridge line extracting method based on watershed
Soundarya et al. An efficient algorithm for coverage hole detection and healing in wireless sensor networks
CN104680523A (en) Multi-modal region-consistent significance object detection method based on foreground and background priori
Zhang et al. AOA based trust evaluation scheme for Sybil attack detection in WSN
CN104021564A (en) Adaptive mean shift algorithm based on local invariant feature detection
CN106022212B (en) A kind of gyro Temperature Drift Modeling
Jing et al. Boundary detection method for large-scale coverage holes in wireless sensor network based on minimum critical threshold constraint
Feng et al. Multihop localisation with distance estimation bias for 3D wireless sensor networks

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160302

RJ01 Rejection of invention patent application after publication