CN105138641A

CN105138641A - Angle-based high dimensional data outlier detection method

Info

Publication number: CN105138641A
Application number: CN201510524427.6A
Authority: CN
Inventors: 刘文婷
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2015-12-09

Abstract

The invention discloses an angle-based high dimensional data outlier detection method and belongs to the technical field of outlier data mining. The method comprises the specific steps that 1, k nearest neighbour points of each data point A belonging to a data set D are obtained in the data set D; 2, an angle-based outlier factor of each data point is calculated; 3, the outlier factors of the data points are ranked, and a point set with the minimum outlier factor is selected as an outlier point set with the largest data outlier degree; 4, outlier data are determined. According to the method, outlier data concealed in large-scale high dimensional data can be found efficiently and rapidly, the problem of curse of dimensionality of the outlier detection method based on high dimensional distance, nearest neighbour and the like can be effectively solved, and the method can be widely applied in high dimensional data for credit card fraud detection, traffic accident detection, scientific data measurement abnormal detection and the like.

Description

A kind of high dimensional data Outliers Detection method based on angle

Technical field

The present invention relates to a kind of high dimensional data Outliers Detection method based on angle, belong to outlier data digging technical field.

Background technology

Outlier data digging technology is one of study hotspot of current Data Mining, is widely used in the fields such as network traffics intrusion detection, traffic hazard detection, science data measurement abnormality detection.Current existing outlier data digging mainly carries out outlier mining based on distance or arest neighbors concept, and in high dimensional data, higher dimensional space Distance geometry arest neighbors has no longer had the characteristic of theorem in Euclid space, will the situation of appearance distance dimension disaster.In high dimensional data, because outlier is away from other data point, the variable angle of the vector that outlier and other point form is little, but not outlier is enclosed in data point, the variable angle of the vector that non-outlier and other point form is comparatively large, therefore can find the Outlier Data be hidden in high dimensional data according to the variance of variable angle.

Summary of the invention

For solving the deficiencies in the prior art, the object of the invention is to, a kind of high dimensional data Outliers Detection method based on angle is provided, the present invention efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, can be widely used in the high dimensional datas such as credit card fraud detection, traffic hazard detection, science data measurement abnormality detection.

Technical scheme of the present invention is: a kind of high dimensional data Outliers Detection method based on angle, is characterized in that, comprise the following steps:

(1) in data set D, for each data point A ∈ D, k the nearest neighbor point of A is obtained;

(2) calculate the peel off factor of each data point based on angle, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively with the variance of angle;

(3) by the factor that peels off of order ranking individual data point from small to large, obtain the factor sequence L that peels off, choosing the minimum point set of the factor that peels off is the maximum point set that peels off of data degree of peeling off, and choosing method is: dividing average headway sequence L is 2 class C _aand C _b, compare the front and back data in average headway sequence L successively according to sorting algorithm, if numerical value change is less than a certain threshold epsilon, then these data and data all below thereof are all divided into class C _a, wherein, ε is determined by user, namely

C _A＝Φ,C _B＝L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

Wherein, l _irepresent i-th data in average headway sequence L, Φ represents empty set;

(4) determine Outlier Data, check the classification C obtained in described step (3) _aif, C _adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C _ain point corresponding to all data be outlier, wherein, δ is set by the user.

A kind of aforesaid high dimensional data Outliers Detection method based on angle, it is characterized in that, described step (1) comprises the following steps:

1-1) formalization data set, high dimensional data form turns to:

For given High Dimensional Data Set norm || || be defined as R ^d→ R ⁺, inner product <, > are defined as R ^d× R ^d→ R, point A, B ∈ D, represent vector wherein R ^drepresent that d ties up real number space, R ⁺represent arithmetic number, R ^d→ R ⁺represent that d ties up the mapping of the element on real number space to arithmetic number, R ^d× R ^d→ R represents that two vectors that d ties up on real number space make inner product operation;

1-2) for the point that given high dimensional data is concentrated adopt hypersphere search procedure to obtain k the nearest neighbor point of A, be expressed as point set N _k(A) ∈ D.

A kind of aforesaid high dimensional data Outliers Detection method based on angle, it is characterized in that, described step (2) comprises the following steps:

2-1) define the factors A OF (A) that peels off: the factors A OF that peels off (A) of A is defined as the vector that A point forms to any two points B and C with angle variance, specifically describe and be:

and B ∈ D A}, C ∈ D { A, B}

= {Var}_{\begin{matrix} B, C &Element; D \\ A &NotEqual; B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}),

Then,

\begin{matrix} A O F (A) = E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \\ - {(\frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2}, \end{matrix}

Wherein Var represents vector with included angle _bACvariance, represent vector with inner product, with represent the norm of vector respectively, E φ _bACrepresent included angle _bACmathematical expectation;

2-2) according to step 1-2) obtain the k neighbour N of A _k(A) ∈ D, then 2-1) in peel off factors A OF (A)

Be described as the factor that peels off of the k neighbour based on A, namely :

N _k(A)∈D,B,C∈N _k(A)

{AOF}_{N_{k} (A)} (A) = {Var}_{\begin{matrix} B, C &Element; N_{k} (A) \\ B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})

\begin{matrix} = E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \\ - {(\frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2}, \end{matrix}

Wherein Var represents vector with included angle _bACvariance, represent vector with inner product, with represent the norm of vector respectively, E φ _bACrepresent included angle _bACmathematical expectation.

The beneficial effect that the present invention reaches: the present invention efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, based on the factor that peels off of vector angle, effectively can overcome " dimension disaster " problem based on Outliers Detection methods such as higher-dimension Distance geometry arest neighbors, utilize the present invention can be widely used in the high dimensional datas such as credit card fraud detection, traffic hazard detection, science data measurement abnormality detection.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of a kind of high dimensional data Outliers Detection method based on angle of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the invention will be further described.Following examples only for technical scheme of the present invention is clearly described, and can not limit the scope of the invention with this.

As shown in Figure 1, a kind of high dimensional data Outliers Detection method based on angle, comprises the following steps:

1) in data set D, for each data point A ∈ D, k the nearest neighbor point of A is obtained;

In order to obtain k nearest neighbor point of each data point, need to provide the formalized description of high dimensional data, a k Neighbor Points computing method, be respectively:

1-1) formalization data set, described high dimensional data form turns to:

1-2) for the point that given high dimensional data is concentrated obtain k the nearest neighbor point of A, be expressed as point set N _k(A) ∈ D, method is: adopt hypersphere search procedure to obtain k Neighbor Points.

The basic thought of hypersphere search procedure higher dimensional space is divided into the equal hypercube of several volumes, also primitive hypercube is, and encode successively, then in the hypersphere centered by A, (by several primitive hypercube coverings) is searched for, till expansion radius of hypersphere comprises k sample in hypersphere gradually.K neighbour in this hypersphere is the k neighbour in whole space.Pre-organized by feature space of the method, makes to be sorted in the hypersphere centered by A and carries out.Till radius of hypersphere increases to gradually in hypersphere and comprises k above pattern sample by zero.Hypersphere search procedure is divided into two stages: the first stage, for organizing the stage, effectively divides by model space and encodes; Subordinate phase is the search judgement stage, namely finds out the k neighbour N waiting to know sample _k(A) ∈ D.

2) calculate the factor that peels off of each data point A, need to provide the formal definitions of the factor that peels off and the factor computing method that peel off based on k nearest neighbor point, concrete grammar is:

2-1) define the factors A OF (A) that peels off: the factors A OF that peels off (A) of A is defined as A point

To the vector that any two points B and C forms with angle variance, specifically describe and be:

and B ∈ D A}, C ∈ D { A, B}

= {Var}_{\begin{matrix} B, C &Element; D \\ A &NotEqual; B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})

Then,

\begin{matrix} A O F (A) = E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \\ - {(\frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2} \end{matrix}

Wherein Var represents vector with included angle _bACvariance, represent vector with inner product, with represent the norm of vector respectively, E φ _bACrepresent included angle _bACvariance.

The factor that peels off of the k neighbour based on A can be described as, namely :

N _k(A)∈D,B,C∈N _k(A)

\begin{matrix} A O F_{N_{k} (A)} (A) = V a r_{\begin{matrix} B, C &Element; N_{k} (A) \\ B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}) \\ = E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \end{matrix}

- {(\frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2}

3) factor that peels off of ranking individual data point, choosing the minimum point set of the factor that peels off is the maximum point set that peels off of data degree of peeling off, and comprises the following steps:

3-1) by order ordered steps 3 from small to large) in COS distance average headway a little, obtain average headway sequence L, wherein, because in high dimensional data, the average headway of outlier is less, therefore the feature of sequence L is: have the numerical value of small part data less, and the numerical value of other most data is larger;

3-2) dividing data sequence L is 2 class C _aand C _b, C _afor the class that numerical value is less, C _bfor the class that numerical value is larger.

Sorting algorithm step is: compare the front and back data in data sequence L successively, if numerical value change is less than a certain threshold epsilon, then these data and data all below thereof are all divided into class C _a, wherein ε can be determined by user, namely

C _A＝Φ,C _B＝L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

Wherein, l _irepresent i-th data in average headway sequence L, Φ represents empty set.

4) determine outlier, concrete grammar is:

Check step 3) the middle classification C obtained _aif, C _adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C _ain point corresponding to all data be outlier, wherein δ can be set by the user.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the prerequisite not departing from the technology of the present invention principle; can also make some improvement and distortion, these improve and distortion also should be considered as protection scope of the present invention.

Claims

1., based on a high dimensional data Outliers Detection method for angle, it is characterized in that, comprise the following steps:

&ForAll; l_{i} &Element; L, C_{A} = Φ, C_{B} = L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

2. a kind of high dimensional data Outliers Detection method based on angle according to claim 1, it is characterized in that, described step (1) comprises the following steps:

1-1) formalization data set, high dimensional data form turns to:

3. a kind of high dimensional data Outliers Detection method based on angle according to claim 1, it is characterized in that, described step (2) comprises the following steps:

c ∈ D, and

B &Element; D \ {A}, C &Element; D \ {A, B}

= {Var}_{\begin{matrix} B, C &Element; D \\ A &NotEqual; B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}),

Then,

\begin{matrix} A O F (A) = E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \\ - {(\frac{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2}, \end{matrix}

Be described as the factor that peels off of the k neighbour based on A, namely

N _k(A)∈D,B,C∈N _k(A)

\begin{matrix} A O F_{N_{k} (A)} (A) = V a r_{\begin{matrix} B, C &Element; N_{k} (A) \\ B &NotEqual; C \end{matrix}} (\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}) \\ E φ_{\overset{&OverBar;}{B A C}}^{2} - {({Eφ}_{\overset{&OverBar;}{B A C}})}^{2} \\ = \frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot {(\frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}})}^{2}}{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}} \\ - {(\frac{\underset{B &Element; N_{k} (A)}{Σ} \underset{C &Element; N_{k} (A)}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||} \cdot \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{{|| \overset{&OverBar;}{A B} ||}^{2} \cdot {|| \overset{&OverBar;}{A C} ||}^{2}}}{\underset{B &Element; D}{Σ} \underset{C &Element; D}{Σ} \frac{1}{|| \overset{&OverBar;}{A B} || \cdot || \overset{&OverBar;}{A C} ||}})}^{2}, \end{matrix}