CN105160347A - Method for detecting outlier data of large-scale high dimension data - Google Patents

Method for detecting outlier data of large-scale high dimension data Download PDF

Info

Publication number
CN105160347A
CN105160347A CN201510393861.5A CN201510393861A CN105160347A CN 105160347 A CN105160347 A CN 105160347A CN 201510393861 A CN201510393861 A CN 201510393861A CN 105160347 A CN105160347 A CN 105160347A
Authority
CN
China
Prior art keywords
data
outlier
overbar
high dimensional
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510393861.5A
Other languages
Chinese (zh)
Inventor
刘文婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201510393861.5A priority Critical patent/CN105160347A/en
Publication of CN105160347A publication Critical patent/CN105160347A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting outlier data of large-scale high dimension data and belongs to the outlier data mining technology field. The method comprises steps that, (1), a cosine distance mean value of each data point is calculated; (2), a cosine distance of each data point is calculated; (3), a cosine distance average spacing of each data point is calculated; (4), cosine distance average spacings are classified, points having smallest cosine distance average spacing are selected to be outlier points having largest data outlier degree; and (5), the outlier data is determined. Through the method, the outlier data hidden in large-scale high dimension data can be rapidly and efficiently discovered.

Description

The detection method of Outlier Data in a kind of extensive high dimensional data
Technical field
The present invention relates to outlier data digging technical field, particularly the detection method of Outlier Data in a kind of extensive high dimensional data.
Background technology
Outlier data digging technology is one of study hotspot of current Data Mining, is widely used in the fields such as network traffics intrusion detection, credit card fraud detection, video monitoring unusual checking.Current existing outlier data digging mainly carries out outlier mining based on distance or arest neighbors concept, in high dimensional data, if or the consecutive point of data were investigated according to higher dimensional space Distance geometry arest neighbors concept, just there will be the situation that most of data are all judged as Outlier Data.If in high dimensional data, COS distance according to vector detects, then can find the Outlier Data be hidden in high dimensional data, because the variable angle of the vector that outlier and other point form is little, but not outlier is enclosed in data point, the variable angle of the vector that non-outlier and other point form is comparatively large, therefore can find the Outlier Data be hidden in high dimensional data according to the size of variable angle.
Summary of the invention
The present invention proposes the detection method of Outlier Data in a kind of extensive high dimensional data, efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.
In order to achieve the above object, the technical solution adopted in the present invention is:
A detection method for Outlier Data in extensive high dimensional data, comprises the following steps:
(1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively with the mean value of COS distance;
(2) COS distance of each data point A is calculated;
(3) average headway of all COS distance of each data point A is calculated;
(4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off;
(5) outlier is determined.
Aforesaid step (1) comprises the following steps:
1-1) formalization data set, described extensive high dimensional data form turns to:
For given extensive High Dimensional Data Set norm || || be defined as R d→ R +, inner product <, > are defined as R d× R d→ R,
point A, B ∈ D, represent vector
Wherein R drepresent that d ties up real number space, R +represent arithmetic number, R d→ R +represent that d ties up the mapping of the element on real number space to arithmetic number, R d× R d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to all the other two points co sinus vector included angle distance sum, be expressed as M θ(A), computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( A ) = &Sigma; A &Element; D , B &Element; D \ { A } . C &Element; D \ { A , B } < A B &OverBar; , A C &OverBar; > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2
Wherein, represent vector with inner product, with represent vector respectively with norm;
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( A ) &OverBar; = M &theta; ( A ) 1 2 ( n - 1 ) ( n - 2 ) = 2 M &theta; ( A ) ( n - 1 ) ( n - 2 ) .
Aforesaid step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively with cOS distance computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( B A C &OverBar; ) = < A B &OverBar; , A C &OverBar; > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2 .
Aforesaid step (3) calculates the average headway Δ M of all COS distance of each data point A θ(A), i.e. cumulative calculation step 2) with step 1) COS distance of each point that obtains with COS distance mean value the absolute value of difference, computing formula is:
&Delta;M &theta; ( A ) = &Sigma; B &Element; D \ { A } , C &Element; D \ { A , B } | M &theta; ( B A C &OverBar; ) - M e ( A ) &OverBar; | .
Aforesaid step (4) comprises the following steps:
4-1) by order from small to large to sort in described step (3) COS distance average headway a little, obtain average headway sequence L;
4-2) dividing average headway sequence L is 2 class C aand C b,
Sorting algorithm step is: compare the front and back data in average headway sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C b, wherein, ε is determined by user, namely
C A=Φ,C B=L
If d=|l i+1-l i| < ε, then C a=C a∪ { l i}
Otherwise, C b=C b{ l i,
Wherein, l irepresent i-th data in average headway sequence L, Φ represents empty set.
Aforesaid step (5) determines outlier, and concrete grammar is:
Check the classification C obtained in described step (4) aif, C adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C ain point corresponding to all data be outlier, wherein, δ is set by the user.
Compared with the prior art, its effect is actively with obvious in the present invention.The present invention has the following advantages:
The detection method of Outlier Data in extensive high dimensional data provided by the invention, based on co sinus vector included angle distance, effectively can overcome " dimension disaster " problem based on Outliers Detection methods such as higher-dimension Distance geometry arest neighbors, utilize the present invention can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of outlier data detection method in extensive high dimensional data of the present invention.
Embodiment
Now by reference to the accompanying drawings and embodiment, the present invention will be further described:
Outlier data detection method in extensive high dimensional data of the present invention, as shown in Figure 1, comprises the following steps:
1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to other all any two point B and C respectively with the mean value of COS distance;
In order to obtain the COS distance mean value of each data point, needing to provide the formalized description of extensive high dimensional data, the computing method of co sinus vector included angle Distance geometry data point COS distance mean value, being respectively:
1-1) formalization data set, extensive high dimensional data can form turn to:
For given extensive High Dimensional Data Set norm || || be defined as R d→ R +, inner product <, > are defined as R d× R d→ R,
point A, B ∈ D, represent vector
Wherein R drepresent that d ties up real number space, R +represent arithmetic number, R d→ R +represent that d ties up the mapping of the element on real number space to arithmetic number, R d× R d→ R represents that two vectors that d ties up on real number space make inner product operation.
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to other two points co sinus vector included angle distance sum, be expressed as M θ(A), computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( A ) = &Sigma; A &Element; D , B &Element; D \ { A } . C &Element; D \ { A , B } < A B , A C > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2
Wherein, represent vector with inner product, with represent vector respectively with norm.
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( A ) &OverBar; = M &theta; ( A ) 1 2 ( n - 1 ) ( n - 2 ) = 2 M &theta; ( A ) ( n - 1 ) ( n - 2 ) ,
Wherein, n represents the number of data point in extensive High Dimensional Data Set D.
2) calculate the COS distance of each data point A, namely for each data point A, calculate the vector that A point forms to other B and C any respectively with cOS distance computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
M &theta; ( B A C &OverBar; ) = < A B &OverBar; , A C &OverBar; > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2 .
3) the average headway Δ M of all COS distance of each data point A is calculated θ(A), i.e. cumulative calculation step 1) with step 2) COS distance of each point that obtains with COS distance mean value the absolute value of difference, computing formula is:
&Delta;M &theta; ( A ) = &Sigma; B &Element; D \ { A } , C &Element; D \ { A , B } | M &theta; ( B A C &OverBar; ) - M e ( A ) &OverBar; | .
4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off, and comprises the following steps:
4-1) by order ordered steps 3 from small to large) in COS distance average headway a little, obtain average headway sequence L,
Wherein, because in high dimensional data, the average headway of outlier is less, therefore the feature of sequence L is: have the numerical value of small part data less, and the numerical value of other most data is larger;
4-2) dividing data sequence L is 2 class C aand C b, C afor the class that numerical value is less, C bfor the class that numerical value is larger.
Sorting algorithm step is: compare the front and back data in data sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C b, wherein ε can be determined by user, namely
C A=Φ,C B=L
If d=|l i+1-l i| < ε, then C a=C a∪ { l i}
Otherwise, C b=C b{ l i,
Wherein, l irepresent i-th data in average headway sequence L, Φ represents empty set.
5) determine outlier, concrete grammar is:
Check step 4) the middle classification C obtained aif, C adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C ain point corresponding to all data be outlier, wherein δ can be set by the user.

Claims (6)

1. the detection method of Outlier Data in extensive high dimensional data, is characterized in that, comprise the following steps:
(1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively with the mean value of COS distance;
(2) COS distance of each data point A is calculated;
(3) average headway of all COS distance of each data point A is calculated;
(4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off;
(5) outlier is determined.
2. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (1) comprises the following steps:
1-1) formalization data set, described extensive high dimensional data form turns to:
For given extensive High Dimensional Data Set norm || || be defined as R d→ R +, inner product <, > are defined as R d× R d→ R,
point A, B ∈ D, represent vector
Wherein R drepresent that d ties up real number space, R +represent arithmetic number, R d→ R +represent that d ties up the mapping of the element on real number space to arithmetic number, R d× R d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to all the other two points co sinus vector included angle distance sum, be expressed as M θ(A), computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
M &theta; ( A ) = &Sigma; A &Element; D , B &Element; D \ { A } . C &Element; D \ { A , B } < A B &OverBar; , A C &OverBar; > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2
Wherein, represent vector with inner product, with represent vector respectively with norm;
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
M &theta; ( A ) &OverBar; = M &theta; ( A ) 1 2 ( n - 1 ) ( n - 2 ) = 2 M &theta; ( A ) ( n - 1 ) ( n - 2 ) .
3. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, is characterized in that, described step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively with cOS distance computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
M &theta; ( B A C &OverBar; ) = < A B &OverBar; , A C &OverBar; > | | A B &OverBar; | | 2 &CenterDot; | | A C &OverBar; | | 2 .
4. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (3) calculates the average headway Δ M of all COS distance of each data point A θ(A), i.e. cumulative calculation step 2) COS distance of each point that obtains with step 1) with COS distance mean value the absolute value of difference, computing formula is:
&Delta;M &theta; ( A ) = &Sigma; B &Element; D \ { A } , C &Element; D \ { A , B } | M &theta; ( B A C &OverBar; ) - M e ( A ) &OverBar; | .
5. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (4) comprises the following steps:
4-1) by order from small to large to sort in described step (3) COS distance average headway a little, obtain average headway sequence L;
4-2) dividing average headway sequence L is 2 class C aand C b,
Sorting algorithm step is: compare the front and back data in average headway sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C b, wherein, ε is determined by user, namely
&ForAll; l i &Element; L , C A = &Phi; , C B = L
If d=|l i+1-l i| < ε, then C a=C a∪ { l i}
Otherwise, C b=C b{ l i,
Wherein, l irepresent i-th data in average headway sequence L, Φ represents empty set.
6. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 5, it is characterized in that, described step (5) determines outlier, and concrete grammar is:
Check the classification C obtained in described step (4) aif, C adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C ain point corresponding to all data be outlier, wherein, δ is set by the user.
CN201510393861.5A 2015-07-07 2015-07-07 Method for detecting outlier data of large-scale high dimension data Pending CN105160347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510393861.5A CN105160347A (en) 2015-07-07 2015-07-07 Method for detecting outlier data of large-scale high dimension data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510393861.5A CN105160347A (en) 2015-07-07 2015-07-07 Method for detecting outlier data of large-scale high dimension data

Publications (1)

Publication Number Publication Date
CN105160347A true CN105160347A (en) 2015-12-16

Family

ID=54801199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510393861.5A Pending CN105160347A (en) 2015-07-07 2015-07-07 Method for detecting outlier data of large-scale high dimension data

Country Status (1)

Country Link
CN (1) CN105160347A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951353A (en) * 2017-03-20 2017-07-14 北京搜狐新媒体信息技术有限公司 Work data method for detecting abnormality and device
CN110377798A (en) * 2019-06-12 2019-10-25 成都理工大学 Outlier detection method based on angle entropy

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951353A (en) * 2017-03-20 2017-07-14 北京搜狐新媒体信息技术有限公司 Work data method for detecting abnormality and device
CN106951353B (en) * 2017-03-20 2020-05-22 北京搜狐新媒体信息技术有限公司 Method and device for detecting abnormality of operation data
CN110377798A (en) * 2019-06-12 2019-10-25 成都理工大学 Outlier detection method based on angle entropy
CN110377798B (en) * 2019-06-12 2022-10-21 成都理工大学 Outlier detection method based on angle entropy

Similar Documents

Publication Publication Date Title
CN103048041B (en) Fault diagnosis method of electromechanical system based on local tangent space and support vector machine
Kim et al. Structural recurrent neural network for traffic speed prediction
CN104657746A (en) Anomaly detection method based on vehicle trajectory similarity
CN108292369A (en) Visual identity is carried out using deep learning attribute
Huang et al. Network traffic anomaly detection based on growing hierarchical SOM
CN106650297A (en) Satellite subsystem anomaly detection method without domain knowledge
CN102542295A (en) Method for detecting landslip from remotely sensed image by adopting image classification technology
Hussain et al. A novel unsupervised feature‐based approach for electricity theft detection using robust PCA and outlier removal clustering algorithm
CN105574642A (en) Smart grid big data-based electricity price execution checking method
CN102663431A (en) Image matching calculation method on basis of region weighting
CN104807589A (en) Online identification method for gas-liquid two-phase-flow flow pattern in gathering and transportation-vertical pipe system
Dai et al. Complexity–entropy causality plane based on power spectral entropy for complex time series
CN105046275A (en) Large-scale high-dimensional outlier data detection method based on angle variance
CN104881676A (en) Face image convex-and-concave pattern texture feature extraction and recognition method
CN103218617A (en) Multi-linear large space feature extraction method
CN103365999A (en) Text clustering integrated method based on similarity degree matrix spectral factorization
CN103034869A (en) Part maintaining projection method of adjacent field self-adaption
CN105160347A (en) Method for detecting outlier data of large-scale high dimension data
CN104618175A (en) Network abnormity detection method
Cheng et al. Energy theft detection in an edge data center using deep learning
CN103646234A (en) Face identification method based on LGBPH features
CN102982342A (en) Positive semidefinite spectral clustering method based on Lagrange dual
CN103258134A (en) Dimension reduction processing method of high-dimensional vibration signals
CN102346830A (en) Gradient histogram-based virus detection method
Martí et al. YASA: yet another time series segmentation algorithm for anomaly detection in big data problems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151216

RJ01 Rejection of invention patent application after publication