CN105138641A - Angle-based high dimensional data outlier detection method - Google Patents
Angle-based high dimensional data outlier detection method Download PDFInfo
- Publication number
- CN105138641A CN105138641A CN201510524427.6A CN201510524427A CN105138641A CN 105138641 A CN105138641 A CN 105138641A CN 201510524427 A CN201510524427 A CN 201510524427A CN 105138641 A CN105138641 A CN 105138641A
- Authority
- CN
- China
- Prior art keywords
- overbar
- data
- centerdot
- sigma
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an angle-based high dimensional data outlier detection method and belongs to the technical field of outlier data mining. The method comprises the specific steps that 1, k nearest neighbour points of each data point A belonging to a data set D are obtained in the data set D; 2, an angle-based outlier factor of each data point is calculated; 3, the outlier factors of the data points are ranked, and a point set with the minimum outlier factor is selected as an outlier point set with the largest data outlier degree; 4, outlier data are determined. According to the method, outlier data concealed in large-scale high dimensional data can be found efficiently and rapidly, the problem of curse of dimensionality of the outlier detection method based on high dimensional distance, nearest neighbour and the like can be effectively solved, and the method can be widely applied in high dimensional data for credit card fraud detection, traffic accident detection, scientific data measurement abnormal detection and the like.
Description
Technical field
The present invention relates to a kind of high dimensional data Outliers Detection method based on angle, belong to outlier data digging technical field.
Background technology
Outlier data digging technology is one of study hotspot of current Data Mining, is widely used in the fields such as network traffics intrusion detection, traffic hazard detection, science data measurement abnormality detection.Current existing outlier data digging mainly carries out outlier mining based on distance or arest neighbors concept, and in high dimensional data, higher dimensional space Distance geometry arest neighbors has no longer had the characteristic of theorem in Euclid space, will the situation of appearance distance dimension disaster.In high dimensional data, because outlier is away from other data point, the variable angle of the vector that outlier and other point form is little, but not outlier is enclosed in data point, the variable angle of the vector that non-outlier and other point form is comparatively large, therefore can find the Outlier Data be hidden in high dimensional data according to the variance of variable angle.
Summary of the invention
For solving the deficiencies in the prior art, the object of the invention is to, a kind of high dimensional data Outliers Detection method based on angle is provided, the present invention efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, can be widely used in the high dimensional datas such as credit card fraud detection, traffic hazard detection, science data measurement abnormality detection.
Technical scheme of the present invention is: a kind of high dimensional data Outliers Detection method based on angle, is characterized in that, comprise the following steps:
(1) in data set D, for each data point A ∈ D, k the nearest neighbor point of A is obtained;
(2) calculate the peel off factor of each data point based on angle, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively
with
the variance of angle;
(3) by the factor that peels off of order ranking individual data point from small to large, obtain the factor sequence L that peels off, choosing the minimum point set of the factor that peels off is the maximum point set that peels off of data degree of peeling off, and choosing method is: dividing average headway sequence L is 2 class C
aand C
b, compare the front and back data in average headway sequence L successively according to sorting algorithm, if numerical value change is less than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
a, wherein, ε is determined by user, namely
C
A=Φ,C
B=L
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set;
(4) determine Outlier Data, check the classification C obtained in described step (3)
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein, δ is set by the user.
A kind of aforesaid high dimensional data Outliers Detection method based on angle, it is characterized in that, described step (1) comprises the following steps:
1-1) formalization data set, high dimensional data form turns to:
For given High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) for the point that given high dimensional data is concentrated
adopt hypersphere search procedure to obtain k the nearest neighbor point of A, be expressed as point set N
k(A) ∈ D.
A kind of aforesaid high dimensional data Outliers Detection method based on angle, it is characterized in that, described step (2) comprises the following steps:
2-1) define the factors A OF (A) that peels off:
the factors A OF that peels off (A) of A is defined as the vector that A point forms to any two points B and C
with
angle variance, specifically describe and be:
and B ∈ D A}, C ∈ D { A, B}
Then,
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACmathematical expectation;
2-2) according to step 1-2) obtain the k neighbour N of A
k(A) ∈ D, then 2-1) in peel off factors A OF (A)
Be described as the factor that peels off of the k neighbour based on A, namely
:
N
k(A)∈D,B,C∈N
k(A)
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACmathematical expectation.
The beneficial effect that the present invention reaches: the present invention efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, based on the factor that peels off of vector angle, effectively can overcome " dimension disaster " problem based on Outliers Detection methods such as higher-dimension Distance geometry arest neighbors, utilize the present invention can be widely used in the high dimensional datas such as credit card fraud detection, traffic hazard detection, science data measurement abnormality detection.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of high dimensional data Outliers Detection method based on angle of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the invention will be further described.Following examples only for technical scheme of the present invention is clearly described, and can not limit the scope of the invention with this.
As shown in Figure 1, a kind of high dimensional data Outliers Detection method based on angle, comprises the following steps:
1) in data set D, for each data point A ∈ D, k the nearest neighbor point of A is obtained;
In order to obtain k nearest neighbor point of each data point, need to provide the formalized description of high dimensional data, a k Neighbor Points computing method, be respectively:
1-1) formalization data set, described high dimensional data form turns to:
For given High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) for the point that given high dimensional data is concentrated
obtain k the nearest neighbor point of A, be expressed as point set N
k(A) ∈ D, method is: adopt hypersphere search procedure to obtain k Neighbor Points.
The basic thought of hypersphere search procedure higher dimensional space is divided into the equal hypercube of several volumes, also primitive hypercube is, and encode successively, then in the hypersphere centered by A, (by several primitive hypercube coverings) is searched for, till expansion radius of hypersphere comprises k sample in hypersphere gradually.K neighbour in this hypersphere is the k neighbour in whole space.Pre-organized by feature space of the method, makes to be sorted in the hypersphere centered by A and carries out.Till radius of hypersphere increases to gradually in hypersphere and comprises k above pattern sample by zero.Hypersphere search procedure is divided into two stages: the first stage, for organizing the stage, effectively divides by model space and encodes; Subordinate phase is the search judgement stage, namely finds out the k neighbour N waiting to know sample
k(A) ∈ D.
2) calculate the factor that peels off of each data point A, need to provide the formal definitions of the factor that peels off and the factor computing method that peel off based on k nearest neighbor point, concrete grammar is:
2-1) define the factors A OF (A) that peels off:
the factors A OF that peels off (A) of A is defined as A point
To the vector that any two points B and C forms
with
angle variance, specifically describe and be:
and B ∈ D A}, C ∈ D { A, B}
Then,
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACvariance.
2-2) according to step 1-2) obtain the k neighbour N of A
k(A) ∈ D, then 2-1) in peel off factors A OF (A)
The factor that peels off of the k neighbour based on A can be described as, namely
:
N
k(A)∈D,B,C∈N
k(A)
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACmathematical expectation.
3) factor that peels off of ranking individual data point, choosing the minimum point set of the factor that peels off is the maximum point set that peels off of data degree of peeling off, and comprises the following steps:
3-1) by order ordered steps 3 from small to large) in COS distance average headway a little, obtain average headway sequence L, wherein, because in high dimensional data, the average headway of outlier is less, therefore the feature of sequence L is: have the numerical value of small part data less, and the numerical value of other most data is larger;
3-2) dividing data sequence L is 2 class C
aand C
b, C
afor the class that numerical value is less, C
bfor the class that numerical value is larger.
Sorting algorithm step is: compare the front and back data in data sequence L successively, if numerical value change is less than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
a, wherein ε can be determined by user, namely
C
A=Φ,C
B=L
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set.
4) determine outlier, concrete grammar is:
Check step 3) the middle classification C obtained
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein δ can be set by the user.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the prerequisite not departing from the technology of the present invention principle; can also make some improvement and distortion, these improve and distortion also should be considered as protection scope of the present invention.
Claims (3)
1., based on a high dimensional data Outliers Detection method for angle, it is characterized in that, comprise the following steps:
(1) in data set D, for each data point A ∈ D, k the nearest neighbor point of A is obtained;
(2) calculate the peel off factor of each data point based on angle, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively
with
the variance of angle;
(3) by the factor that peels off of order ranking individual data point from small to large, obtain the factor sequence L that peels off, choosing the minimum point set of the factor that peels off is the maximum point set that peels off of data degree of peeling off, and choosing method is: dividing average headway sequence L is 2 class C
aand C
b, compare the front and back data in average headway sequence L successively according to sorting algorithm, if numerical value change is less than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
a, wherein, ε is determined by user, namely
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set;
(4) determine Outlier Data, check the classification C obtained in described step (3)
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein, δ is set by the user.
2. a kind of high dimensional data Outliers Detection method based on angle according to claim 1, it is characterized in that, described step (1) comprises the following steps:
1-1) formalization data set, high dimensional data form turns to:
For given High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) for the point that given high dimensional data is concentrated
adopt hypersphere search procedure to obtain k the nearest neighbor point of A, be expressed as point set N
k(A) ∈ D.
3. a kind of high dimensional data Outliers Detection method based on angle according to claim 1, it is characterized in that, described step (2) comprises the following steps:
2-1) define the factors A OF (A) that peels off:
the factors A OF that peels off (A) of A is defined as the vector that A point forms to any two points B and C
with
angle variance, specifically describe and be:
c ∈ D, and
Then,
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACmathematical expectation;
2-2) according to step 1-2) obtain the k neighbour N of A
k(A) ∈ D, then 2-1) in peel off factors A OF (A)
Be described as the factor that peels off of the k neighbour based on A, namely
N
k(A)∈D,B,C∈N
k(A)
Wherein Var represents vector
with
included angle
bACvariance,
represent vector
with
inner product,
with
represent the norm of vector respectively, E φ
bACrepresent included angle
bACmathematical expectation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510524427.6A CN105138641A (en) | 2015-08-24 | 2015-08-24 | Angle-based high dimensional data outlier detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510524427.6A CN105138641A (en) | 2015-08-24 | 2015-08-24 | Angle-based high dimensional data outlier detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105138641A true CN105138641A (en) | 2015-12-09 |
Family
ID=54723989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510524427.6A Pending CN105138641A (en) | 2015-08-24 | 2015-08-24 | Angle-based high dimensional data outlier detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105138641A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107426680A (en) * | 2017-08-06 | 2017-12-01 | 深圳市益鑫智能科技有限公司 | Towards the wireless sensor network data collection system of building monitoring |
CN107786368A (en) * | 2016-08-31 | 2018-03-09 | 华为技术有限公司 | Detection of anomaly node method and relevant apparatus |
CN109360099A (en) * | 2018-10-22 | 2019-02-19 | 广东工业大学 | A kind of anti-fraud method of finance based on k- nearest neighbor algorithm |
CN110349662A (en) * | 2019-05-23 | 2019-10-18 | 复旦大学 | The outliers across image collection that result is accidentally surveyed for filtering pulmonary masses find method and system |
WO2022141746A1 (en) * | 2020-12-30 | 2022-07-07 | 佛山科学技术学院 | Method for detecting anomaly in water quality and electronic device |
-
2015
- 2015-08-24 CN CN201510524427.6A patent/CN105138641A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107786368A (en) * | 2016-08-31 | 2018-03-09 | 华为技术有限公司 | Detection of anomaly node method and relevant apparatus |
CN107426680A (en) * | 2017-08-06 | 2017-12-01 | 深圳市益鑫智能科技有限公司 | Towards the wireless sensor network data collection system of building monitoring |
CN107426680B (en) * | 2017-08-06 | 2018-10-09 | 广州迈傲信息科技有限公司 | Wireless sensor network data collection system towards building monitoring |
CN109360099A (en) * | 2018-10-22 | 2019-02-19 | 广东工业大学 | A kind of anti-fraud method of finance based on k- nearest neighbor algorithm |
CN110349662A (en) * | 2019-05-23 | 2019-10-18 | 复旦大学 | The outliers across image collection that result is accidentally surveyed for filtering pulmonary masses find method and system |
CN110349662B (en) * | 2019-05-23 | 2023-01-13 | 复旦大学 | Cross-image set outlier sample discovery method and system for filtering lung mass misdetection results |
WO2022141746A1 (en) * | 2020-12-30 | 2022-07-07 | 佛山科学技术学院 | Method for detecting anomaly in water quality and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105138641A (en) | Angle-based high dimensional data outlier detection method | |
Liu et al. | The node importance in actual complex networks based on a multi-attribute ranking method | |
CN106845717B (en) | Energy efficiency evaluation method based on multi-model fusion strategy | |
Xu et al. | On the multiplicative Zagreb coindex of graphs | |
CN102855638A (en) | Detection method for abnormal behavior of vehicle based on spectrum clustering | |
CN102324047A (en) | High spectrum image atural object recognition methods based on sparse nuclear coding SKR | |
CN104881735A (en) | System and method of smart power grid big data mining for supporting smart city operation management | |
CN104268629A (en) | Complex network community detecting method based on prior information and network inherent information | |
CN105139031A (en) | Data processing method based on subspace clustering | |
CN104732545A (en) | Texture image segmentation method combined with sparse neighbor propagation and rapid spectral clustering | |
CN103605793A (en) | Heterogeneous social network community detection method based on genetic algorithm | |
Guo | Accelerated continuous conditional random fields for load forecasting | |
CN103034869A (en) | Part maintaining projection method of adjacent field self-adaption | |
CN105046275A (en) | Large-scale high-dimensional outlier data detection method based on angle variance | |
CN106354803A (en) | Bad load data detection algorithm for power transmission and transformation equipment based on index of characteristic | |
CN103093472B (en) | Based on the remote sensing image change detecting method of doubledictionary intersection rarefaction representation | |
CN104102730A (en) | Known label-based big data normal mode extracting method and system | |
Fesser et al. | Augmentations of Forman's Ricci Curvature and their Applications in Community Detection | |
CN102982342A (en) | Positive semidefinite spectral clustering method based on Lagrange dual | |
CN104679844A (en) | Intermittent process batch data synchronizing method based on improved DTW (Dynamic Time Wrapping) algorithm | |
CN105160347A (en) | Method for detecting outlier data of large-scale high dimension data | |
Nithiyananthan et al. | Enhanced R package-based cluster analysis fault identification models for three phase power system network | |
Savaş | New double sequence spaces of fuzzy numbers | |
CN109783586A (en) | Waterborne troops's comment detection system and method based on cluster resampling | |
Zhao et al. | Study on credit evaluation of electricity users based on random forest |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151209 |
|
RJ01 | Rejection of invention patent application after publication |