CN105160347A - Method for detecting outlier data of large-scale high dimension data - Google Patents
Method for detecting outlier data of large-scale high dimension data Download PDFInfo
- Publication number
- CN105160347A CN105160347A CN201510393861.5A CN201510393861A CN105160347A CN 105160347 A CN105160347 A CN 105160347A CN 201510393861 A CN201510393861 A CN 201510393861A CN 105160347 A CN105160347 A CN 105160347A
- Authority
- CN
- China
- Prior art keywords
- data
- outlier
- overbar
- high dimensional
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting outlier data of large-scale high dimension data and belongs to the outlier data mining technology field. The method comprises steps that, (1), a cosine distance mean value of each data point is calculated; (2), a cosine distance of each data point is calculated; (3), a cosine distance average spacing of each data point is calculated; (4), cosine distance average spacings are classified, points having smallest cosine distance average spacing are selected to be outlier points having largest data outlier degree; and (5), the outlier data is determined. Through the method, the outlier data hidden in large-scale high dimension data can be rapidly and efficiently discovered.
Description
Technical field
The present invention relates to outlier data digging technical field, particularly the detection method of Outlier Data in a kind of extensive high dimensional data.
Background technology
Outlier data digging technology is one of study hotspot of current Data Mining, is widely used in the fields such as network traffics intrusion detection, credit card fraud detection, video monitoring unusual checking.Current existing outlier data digging mainly carries out outlier mining based on distance or arest neighbors concept, in high dimensional data, if or the consecutive point of data were investigated according to higher dimensional space Distance geometry arest neighbors concept, just there will be the situation that most of data are all judged as Outlier Data.If in high dimensional data, COS distance according to vector detects, then can find the Outlier Data be hidden in high dimensional data, because the variable angle of the vector that outlier and other point form is little, but not outlier is enclosed in data point, the variable angle of the vector that non-outlier and other point form is comparatively large, therefore can find the Outlier Data be hidden in high dimensional data according to the size of variable angle.
Summary of the invention
The present invention proposes the detection method of Outlier Data in a kind of extensive high dimensional data, efficiently can find the Outlier Data be hidden in wherein rapidly from extensive high dimensional data, can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.
In order to achieve the above object, the technical solution adopted in the present invention is:
A detection method for Outlier Data in extensive high dimensional data, comprises the following steps:
(1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively
with
the mean value of COS distance;
(2) COS distance of each data point A is calculated;
(3) average headway of all COS distance of each data point A is calculated;
(4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off;
(5) outlier is determined.
Aforesaid step (1) comprises the following steps:
1-1) formalization data set, described extensive high dimensional data form turns to:
For given extensive High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
Wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to all the other two points co sinus vector included angle distance sum, be expressed as M
θ(A), computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
Wherein,
represent vector
with
inner product,
with
represent vector respectively
with
norm;
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D
computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
Aforesaid step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively
with
cOS distance
computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
Aforesaid step (3) calculates the average headway Δ M of all COS distance of each data point A
θ(A), i.e. cumulative calculation step 2) with step 1) COS distance of each point that obtains
with COS distance mean value
the absolute value of difference, computing formula is:
Aforesaid step (4) comprises the following steps:
4-1) by order from small to large to sort in described step (3) COS distance average headway a little, obtain average headway sequence L;
4-2) dividing average headway sequence L is 2 class C
aand C
b,
Sorting algorithm step is: compare the front and back data in average headway sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
b, wherein, ε is determined by user, namely
C
A=Φ,C
B=L
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set.
Aforesaid step (5) determines outlier, and concrete grammar is:
Check the classification C obtained in described step (4)
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein, δ is set by the user.
Compared with the prior art, its effect is actively with obvious in the present invention.The present invention has the following advantages:
The detection method of Outlier Data in extensive high dimensional data provided by the invention, based on co sinus vector included angle distance, effectively can overcome " dimension disaster " problem based on Outliers Detection methods such as higher-dimension Distance geometry arest neighbors, utilize the present invention can be widely used in the high dimensional datas such as credit card fraud detection, video monitoring unusual checking, network traffics intrusion detection.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of outlier data detection method in extensive high dimensional data of the present invention.
Embodiment
Now by reference to the accompanying drawings and embodiment, the present invention will be further described:
Outlier data detection method in extensive high dimensional data of the present invention, as shown in Figure 1, comprises the following steps:
1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to other all any two point B and C respectively
with
the mean value of COS distance;
In order to obtain the COS distance mean value of each data point, needing to provide the formalized description of extensive high dimensional data, the computing method of co sinus vector included angle Distance geometry data point COS distance mean value, being respectively:
1-1) formalization data set, extensive high dimensional data can form turn to:
For given extensive High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
Wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation.
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to other two points co sinus vector included angle distance sum, be expressed as M
θ(A), computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
Wherein,
represent vector
with
inner product,
with
represent vector respectively
with
norm.
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D
computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
Wherein, n represents the number of data point in extensive High Dimensional Data Set D.
2) calculate the COS distance of each data point A, namely for each data point A, calculate the vector that A point forms to other B and C any respectively
with
cOS distance
computing formula is:
b ∈ D, C ∈ D, and B ∈ D A}, C ∈ D { A, B}
3) the average headway Δ M of all COS distance of each data point A is calculated
θ(A), i.e. cumulative calculation step 1) with step 2) COS distance of each point that obtains
with COS distance mean value
the absolute value of difference, computing formula is:
4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off, and comprises the following steps:
4-1) by order ordered steps 3 from small to large) in COS distance average headway a little, obtain average headway sequence L,
Wherein, because in high dimensional data, the average headway of outlier is less, therefore the feature of sequence L is: have the numerical value of small part data less, and the numerical value of other most data is larger;
4-2) dividing data sequence L is 2 class C
aand C
b, C
afor the class that numerical value is less, C
bfor the class that numerical value is larger.
Sorting algorithm step is: compare the front and back data in data sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
b, wherein ε can be determined by user, namely
C
A=Φ,C
B=L
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set.
5) determine outlier, concrete grammar is:
Check step 4) the middle classification C obtained
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein δ can be set by the user.
Claims (6)
1. the detection method of Outlier Data in extensive high dimensional data, is characterized in that, comprise the following steps:
(1) calculate the COS distance mean value of each data point in extensive high dimensional data, namely for each data point A, calculate the vector that A point forms to all the other all any two point B and C respectively
with
the mean value of COS distance;
(2) COS distance of each data point A is calculated;
(3) average headway of all COS distance of each data point A is calculated;
(4) classifying and dividing COS distance average headway, choosing the minimum several points of COS distance average headway is the maximum outlier of data degree of peeling off;
(5) outlier is determined.
2. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (1) comprises the following steps:
1-1) formalization data set, described extensive high dimensional data form turns to:
For given extensive High Dimensional Data Set
norm || || be defined as R
d→ R
+, inner product <, > are defined as R
d× R
d→ R,
point A, B ∈ D,
represent vector
Wherein R
drepresent that d ties up real number space, R
+represent arithmetic number, R
d→ R
+represent that d ties up the mapping of the element on real number space to arithmetic number, R
d× R
d→ R represents that two vectors that d ties up on real number space make inner product operation;
1-2) in extensive High Dimensional Data Set D calculate a little respectively each some A to all the other two points co sinus vector included angle distance sum, be expressed as M
θ(A), computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
Wherein,
represent vector
with
inner product,
with
represent vector respectively
with
norm;
1-3) calculate the mean value of each some A COS distance in extensive High Dimensional Data Set D
computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
3. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, is characterized in that, described step (2) calculates the COS distance of data point A, namely for each data point A, calculates the vector that A point forms to any two points B and C respectively
with
cOS distance
computing formula is:
and B ∈ D { A }, C ∈ D { A, B}
4. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (3) calculates the average headway Δ M of all COS distance of each data point A
θ(A), i.e. cumulative calculation step 2) COS distance of each point that obtains with step 1)
with COS distance mean value
the absolute value of difference, computing formula is:
5. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 1, it is characterized in that, described step (4) comprises the following steps:
4-1) by order from small to large to sort in described step (3) COS distance average headway a little, obtain average headway sequence L;
4-2) dividing average headway sequence L is 2 class C
aand C
b,
Sorting algorithm step is: compare the front and back data in average headway sequence L successively, if numerical value change is greater than a certain threshold epsilon, then these data and data all below thereof are all divided into class C
b, wherein, ε is determined by user, namely
If d=|l
i+1-l
i| < ε, then C
a=C
a∪ { l
i}
Otherwise, C
b=C
b{ l
i,
Wherein, l
irepresent i-th data in average headway sequence L, Φ represents empty set.
6. the detection method of Outlier Data in a kind of extensive high dimensional data according to claim 5, it is characterized in that, described step (5) determines outlier, and concrete grammar is:
Check the classification C obtained in described step (4)
aif, C
adata amount check be greater than a certain threshold value δ, then outlier do not detected in this extensive high dimensional data, otherwise C
ain point corresponding to all data be outlier, wherein, δ is set by the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510393861.5A CN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510393861.5A CN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105160347A true CN105160347A (en) | 2015-12-16 |
Family
ID=54801199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510393861.5A Pending CN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105160347A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951353A (en) * | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
CN110377798A (en) * | 2019-06-12 | 2019-10-25 | 成都理工大学 | Outlier detection method based on angle entropy |
-
2015
- 2015-07-07 CN CN201510393861.5A patent/CN105160347A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951353A (en) * | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
CN106951353B (en) * | 2017-03-20 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | Method and device for detecting abnormality of operation data |
CN110377798A (en) * | 2019-06-12 | 2019-10-25 | 成都理工大学 | Outlier detection method based on angle entropy |
CN110377798B (en) * | 2019-06-12 | 2022-10-21 | 成都理工大学 | Outlier detection method based on angle entropy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103048041B (en) | Fault diagnosis method of electromechanical system based on local tangent space and support vector machine | |
Kim et al. | Structural recurrent neural network for traffic speed prediction | |
CN104657746A (en) | Anomaly detection method based on vehicle trajectory similarity | |
CN108292369A (en) | Visual identity is carried out using deep learning attribute | |
Huang et al. | Network traffic anomaly detection based on growing hierarchical SOM | |
CN106650297A (en) | Satellite subsystem anomaly detection method without domain knowledge | |
CN102542295A (en) | Method for detecting landslip from remotely sensed image by adopting image classification technology | |
Hussain et al. | A novel unsupervised feature‐based approach for electricity theft detection using robust PCA and outlier removal clustering algorithm | |
CN105574642A (en) | Smart grid big data-based electricity price execution checking method | |
CN102663431A (en) | Image matching calculation method on basis of region weighting | |
CN104807589A (en) | Online identification method for gas-liquid two-phase-flow flow pattern in gathering and transportation-vertical pipe system | |
Dai et al. | Complexity–entropy causality plane based on power spectral entropy for complex time series | |
CN105046275A (en) | Large-scale high-dimensional outlier data detection method based on angle variance | |
CN104881676A (en) | Face image convex-and-concave pattern texture feature extraction and recognition method | |
CN103218617A (en) | Multi-linear large space feature extraction method | |
CN103365999A (en) | Text clustering integrated method based on similarity degree matrix spectral factorization | |
CN103034869A (en) | Part maintaining projection method of adjacent field self-adaption | |
CN105160347A (en) | Method for detecting outlier data of large-scale high dimension data | |
CN104618175A (en) | Network abnormity detection method | |
Cheng et al. | Energy theft detection in an edge data center using deep learning | |
CN103646234A (en) | Face identification method based on LGBPH features | |
CN102982342A (en) | Positive semidefinite spectral clustering method based on Lagrange dual | |
CN103258134A (en) | Dimension reduction processing method of high-dimensional vibration signals | |
CN102346830A (en) | Gradient histogram-based virus detection method | |
Martí et al. | YASA: yet another time series segmentation algorithm for anomaly detection in big data problems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151216 |
|
RJ01 | Rejection of invention patent application after publication |