CN105160347A

CN105160347A - Method for detecting outlier data of large-scale high dimension data

Info

Publication number: CN105160347A
Application number: CN201510393861.5A
Authority: CN
Inventors: 刘文婷
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2015-07-07
Filing date: 2015-07-07
Publication date: 2015-12-16

Abstract

The invention discloses a method for detecting outlier data of large-scale high dimension data and belongs to the outlier data mining technology field. The method comprises steps that, (1), a cosine distance mean value of each data point is calculated; (2), a cosine distance of each data point is calculated; (3), a cosine distance average spacing of each data point is calculated; (4), cosine distance average spacings are classified, points having smallest cosine distance average spacing are selected to be outlier points having largest data outlier degree; and (5), the outlier data is determined. Through the method, the outlier data hidden in large-scale high dimension data can be rapidly and efficiently discovered.

Description

The detection method of Outlier Data in a kind of extensive high dimensional data

Technical field

The present invention relates to outlier data digging technical field, particularly the detection method of Outlier Data in a kind of extensive high dimensional data.

Background technology

Outlier data digging technology is one of study hotspot of current Data Mining, is widely used in the fields such as network traffics intrusion detection, credit card fraud detection, video monitoring unusual checking.Current existing outlier data digging mainly carries out outlier mining based on distance or arest neighbors concept, in high dimensional data, if or the consecutive point of data were investigated according to higher dimensional space Distance geometry arest neighbors concept, just there will be the situation that most of data are all judged as Outlier Data.If in high dimensional data, COS distance according to vector detects, then can find the Outlier Data be hidden in high dimensional data, because the variable angle of the vector that outlier and other point form is little, but not outlier is enclosed in data point, the variable angle of the vector that non-outlier and other point form is comparatively large, therefore can find the Outlier Data be hidden in high dimensional data according to the size of variable angle.

Summary of the invention

The present invention proposes the detection method of Outlier Data in a kind of extensive high dimensional data, efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.

In order to achieve the above object, the technical solution adopted in the present invention is:

A detection method for Outlier Data in extensive high dimensional data, comprises the following steps:

(1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively with the mean value of COS distance;

(2) COS distance of each data point A is calculated;

(3) average headway of all COS distance of each data point A is calculated;

(4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off;

(5) outlier is determined.

Aforesaid step (1) comprises the following steps:

1-1) formalization data set, described extensive high dimensional data form turns to:

For given extensive High Dimensional Data Set norm || || be defined as R ^d→ R ⁺, inner product <, > are defined as R ^d× R ^d→ R,

point A, B ∈ D, represent vector

Wherein R ^drepresent that d ties up real number space, R ⁺represent arithmetic number, R ^d→ R ⁺represent that d ties up the mapping of the element on real number space to arithmetic number, R ^d× R ^d→ R represents that two vectors that d ties up on real number space make inner product operation;

1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to all the other two points co sinus vector included angle distance sum, be expressed as M _θ(A), computing formula is:

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

M_{θ} (A) = \underset{A &Element; D, B &Element; D \ {A} . C &Element; D \ {A, B}}{Σ} \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}}

Wherein, represent vector with inner product, with represent vector respectively with norm;

1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D computing formula is:

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

\overset{&OverBar;}{M_{θ} (A)} = \frac{M_{θ} (A)}{\frac{1}{2} (n - 1) (n - 2)} = \frac{2 M_{θ} (A)}{(n - 1) (n - 2)} .

Aforesaid step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively with cOS distance computing formula is:

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

M_{θ} (\overset{&OverBar;}{B A C}) = \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}} .

Aforesaid step (3) calculates the average headway Δ M of all COS distance of each data point A _θ(A), i.e. cumulative calculation step 2) with step 1) COS distance of each point that obtains with COS distance mean value the absolute value of difference, computing formula is:

{ΔM}_{θ} (A) = \underset{B &Element; D \ {A}, C &Element; D \ {A, B}}{Σ} | M_{θ} (\overset{&OverBar;}{B A C}) - \overset{&OverBar;}{M_{e} (A)} | .

Aforesaid step (4) comprises the following steps:

4-1) by order from small to large to sort in described step (3) COS distance average headway a little, obtain average headway sequence L;

4-2) dividing average headway sequence L is 2 class C _aand C _b,

Sorting algorithm step is: compare the front and back data in average headway sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C _b, wherein, ε is determined by user, namely

C _A＝Φ,C _B＝L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

Wherein, l _irepresent i-th data in average headway sequence L, Φ represents empty set.

Aforesaid step (5) determines outlier, and concrete grammar is:

Check the classification C obtained in described step (4) _aif, C _adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C _ain point corresponding to all data be outlier, wherein, δ is set by the user.

Compared with the prior art, its effect is actively with obvious in the present invention.The present invention has the following advantages:

The detection method of Outlier Data in extensive high dimensional data provided by the invention, based on co sinus vector included angle distance, effectively can overcome " dimension disaster " problem based on Outliers Detection methods such as higher-dimension Distance geometry arest neighbors, utilize the present invention can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of outlier data detection method in extensive high dimensional data of the present invention.

Embodiment

Now by reference to the accompanying drawings and embodiment, the present invention will be further described:

Outlier data detection method in extensive high dimensional data of the present invention, as shown in Figure 1, comprises the following steps:

1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to other all any two point B and C respectively with the mean value of COS distance;

In order to obtain the COS distance mean value of each data point, needing to provide the formalized description of extensive high dimensional data, the computing method of co sinus vector included angle Distance geometry data point COS distance mean value, being respectively:

1-1) formalization data set, extensive high dimensional data can form turn to:

point A, B ∈ D, represent vector

Wherein R ^drepresent that d ties up real number space, R ⁺represent arithmetic number, R ^d→ R ⁺represent that d ties up the mapping of the element on real number space to arithmetic number, R ^d× R ^d→ R represents that two vectors that d ties up on real number space make inner product operation.

1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to other two points co sinus vector included angle distance sum, be expressed as M _θ(A), computing formula is:

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

M_{θ} (A) = \underset{A &Element; D, B &Element; D \ {A} . C &Element; D \ {A, B}}{Σ} \frac{< A B, A C >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}}

Wherein, represent vector with inner product, with represent vector respectively with norm.

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

\overset{&OverBar;}{M_{θ} (A)} = \frac{M_{θ} (A)}{\frac{1}{2} (n - 1) (n - 2)} = \frac{2 M_{θ} (A)}{(n - 1) (n - 2)},

Wherein, n represents the number of data point in extensive High Dimensional Data Set D.

2) calculate the COS distance of each data point A, namely for each data point A, calculate the vector that A point forms to other B and C any respectively with cOS distance computing formula is:

b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}

M_{θ} (\overset{&OverBar;}{B A C}) = \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}} .

3) the average headway Δ M of all COS distance of each data point A is calculated _θ(A), i.e. cumulative calculation step 1) with step 2) COS distance of each point that obtains with COS distance mean value the absolute value of difference, computing formula is:

{ΔM}_{θ} (A) = \underset{B &Element; D \ {A}, C &Element; D \ {A, B}}{Σ} | M_{θ} (\overset{&OverBar;}{B A C}) - \overset{&OverBar;}{M_{e} (A)} | .

4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off, and comprises the following steps:

4-1) by order ordered steps 3 from small to large) in COS distance average headway a little, obtain average headway sequence L,

Wherein, because in high dimensional data, the average headway of outlier is less, therefore the feature of sequence L is: have the numerical value of small part data less, and the numerical value of other most data is larger;

4-2) dividing data sequence L is 2 class C _aand C _b, C _afor the class that numerical value is less, C _bfor the class that numerical value is larger.

Sorting algorithm step is: compare the front and back data in data sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C _b, wherein ε can be determined by user, namely

C _A＝Φ,C _B＝L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

5) determine outlier, concrete grammar is:

Check step 4) the middle classification C obtained _aif, C _adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C _ain point corresponding to all data be outlier, wherein δ can be set by the user.

Claims

1. the detection method of Outlier Data in extensive high dimensional data, is characterized in that, comprise the following steps:

(2) COS distance of each data point A is calculated;

(3) average headway of all COS distance of each data point A is calculated;

(5) outlier is determined.

2. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (1) comprises the following steps:

point A, B ∈ D, represent vector

and B ∈ D { A }, C ∈ D { A, B}

M_{θ} (A) = \underset{A &Element; D, B &Element; D \ {A} . C &Element; D \ {A, B}}{Σ} \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}}

and B ∈ D { A }, C ∈ D { A, B}

\overset{&OverBar;}{M_{θ} (A)} = \frac{M_{θ} (A)}{\frac{1}{2} (n - 1) (n - 2)} = \frac{2 M_{θ} (A)}{(n - 1) (n - 2)} .

3. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, is characterized in that, described step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively with cOS distance computing formula is:

and B ∈ D { A }, C ∈ D { A, B}

M_{θ} (\overset{&OverBar;}{B A C}) = \frac{< \overset{&OverBar;}{A B}, \overset{&OverBar;}{A C} >}{| | \overset{&OverBar;}{A B} | |^{2} \cdot | | \overset{&OverBar;}{A C} | |^{2}} .

4. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (3) calculates the average headway Δ M of all COS distance of each data point A _θ(A), i.e. cumulative calculation step 2) COS distance of each point that obtains with step 1) with COS distance mean value the absolute value of difference, computing formula is:

{ΔM}_{θ} (A) = \underset{B &Element; D \ {A}, C &Element; D \ {A, B}}{Σ} | M_{θ} (\overset{&OverBar;}{B A C}) - \overset{&OverBar;}{M_{e} (A)} | .

5. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (4) comprises the following steps:

4-2) dividing average headway sequence L is 2 class C _aand C _b,

&ForAll; l_{i} &Element; L, C_{A} = Φ, C_{B} = L

If d=|l _i+1-l _i| < ε, then C _a=C _a∪ { l _i}

Otherwise, C _b=C _b{ l _i,

6. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 5, it is characterized in that, described step (5) determines outlier, and concrete grammar is: