CN107545273A - A kind of local outlier detection method based on density - Google Patents

A kind of local outlier detection method based on density Download PDF

Info

Publication number
CN107545273A
CN107545273A CN201710559390.XA CN201710559390A CN107545273A CN 107545273 A CN107545273 A CN 107545273A CN 201710559390 A CN201710559390 A CN 201710559390A CN 107545273 A CN107545273 A CN 107545273A
Authority
CN
China
Prior art keywords
mrow
neighborhood
msub
density
represented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710559390.XA
Other languages
Chinese (zh)
Inventor
肖利民
苏书宾
阮利
何振学
张周杰
李书攀
刘玺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201710559390.XA priority Critical patent/CN107545273A/en
Publication of CN107545273A publication Critical patent/CN107545273A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

A kind of local outlier detection method based on density provided by the invention, the degree of scatter of object and its neighborhood object is taken into full account, compared with traditional algorithm, the exceptional value that the present invention obtains is more sensitive to the intensity of anomaly of scattered data set, and the degree of accuracy of testing result is higher.Comprise the following steps:The attribute of 1 pair of data set is normalized;2 search for the k evidence from test object arest neighbors;3 calculate the average value of distance between object and its neighborhood object, and are designated as the k neighborhood distances of object;The k neighborhoods distance of 4 pairs of data carries out global normalization's processing;5 calculate the variance of distance between object and its neighborhood object, and are designated as the k neighborhood variances of object;The k neighborhoods variance of 6 pairs of data carries out global normalization's processing;7 calculate the neighborhood decentralization of object;8 calculate the neighborhood density of object;9 calculate object local outlier factor;10 determine that the maximum object of outlier is outlier.

Description

A kind of local outlier detection method based on density
Technical field
The present invention relates to a kind of local outlier detection method based on density, belong to computer science and technology field.
Background technology
Abnormality detection is one of basic task of data mining, the purpose is to abate the noise, or is found potential significant Knowledge.Since the last century 80's, the research to abnormality detection experienced the alternating of prosperity and decline several times.In recent years, with letter The development of breath technology, in application field the driving of actual demand, the rapid development of sensor technology people are obtained easily Get bulk information data.Turn into an active branch in information science again for this abnormality detection, dug in data flow, data The multiple fields such as pick, machine learning and statistics are of great interest, and be commonly employed prospect.Now, abnormal inspection Survey is widely used in intrusion detection, fraud detection, the inspection of industry damage, health care monitoring etc..
For earliest Outlier Detection Algorithm for whole data set, obtained result is that one group of overall situation peels off point set. But in many reality are used, the data that are obtained are simultaneously imperfect, and many times user is also only concerned local shakiness It is qualitative.Relative to global outlier, local outlier is referred to as based on the exception object locally studied.Since the factor that locally peels off (LOF) after proposing, there are many parts and peel off method for checking object.The monitoring needs of local outlier solve local adjacent The determination in domain and object calculate two subproblems compared with its neighborhood.Existing algorithm is carried out to LOF algorithms from different angles Improve and extension, good Detection results are achieved in the data set of some specific distributions.The inspection but existing part peels off Method of determining and calculating overall thinking object deviates the degree of its neighborhood object, is not concerned with overall degree of scatter between them, Thus when Outliers Detection is carried out to scattered data set, the precision of these algorithms will seriously be affected.
The content of the invention
To solve the above problems, the present invention is defined using the expectation and variance of the distance between object and its adjacent region data The k neighborhood decentralization of object, local outlier factor is redefined according to k neighborhood decentralization, and proposed a kind of new part Outliers Detection method.Relative to conventional method, the exceptional value that our method obtains to the intensity of anomaly of scattered data set more Sensitivity, the degree of accuracy of testing result are higher.
Specifically, the invention provides a kind of local outlier detection method based on density, this method to include:
Step 1, each attribute of data set is normalized;
Step 2, the k object from each object arest neighbors is searched for;
Step 3, the average value of distance between each object and its neighborhood object is calculated, and is designated as the k neighborhood distances of object;
Step 4, global normalization's processing is carried out to the k neighborhoods distance of each object;
Step 5, the variance of distance between each object and its neighborhood object is calculated, and is designated as the k neighborhood variances of object;
Step 6, global normalization's processing is carried out to the k neighborhoods variance of each object;
Step 7, the k neighborhood decentralization of each object is calculated;
Step 8, the k neighborhood density of each object is calculated;
Step 9, each object local outlier factor is calculated;
Step 10, the object that peels off of data set is determined;
Wherein, the data set attribute normalization operation of step 1 is represented by:
ajiRepresent the i-th dimension data of j-th of data object in data set.
Wherein, the k of step 2 is threshold value given in advance.
Wherein, step 7 neighborhood decentralization Nk-disp(o) calculating is represented by:
Nnk-adist(o) global normalization of the k neighborhood distances of object, Nn are representedk-vari(o) the k neighborhood sides of object are represented The global normalization of difference, Nk-adist(o) the k neighborhood distances of object are represented.
Wherein, the calculating neighborhood density N of step 8k-dens(o) calculating is represented by:
Nk-dens(o) the neighborhood density of object, N are representedk-adist(o) the k neighborhood distances of object are represented.As object o and its institute There is the coincidence of neighborhood object, in order to avoid Nk-dens(o) it is meaningless, while ensure that o k neighborhoods density is maximum, now directly allow Nk-dens(o) the slightly larger value of the neighborhood density than other all objects is taken in data set.
Wherein, the peel off determination of object of step 10 includes:
Step 101, data set is sorted in descending order by the outlier size of object;
Step 102, the maximum preceding m of outlier is takenoutlOutlier of the individual object as data set, moutlIt is given in advance Threshold value.
The beneficial functional of the present invention is:The present invention combines the degree of scatter of object and its neighborhood object, with object and its The expectation of the distance between neighborhood object and variance define the k neighborhood decentralization of object, are redefined using k neighborhood decentralization Local outlier factor, has taken into full account the regularity of distribution of data set, and exceptional value that algorithm obtains is to disperseing the data of data set The intensity of anomaly of object is more sensitive, and the degree of accuracy of testing result is higher.
Brief description of the drawings
Fig. 1 is a kind of flow chart of local outlier detection method based on density of the present invention.
Embodiment
It must be more clearly understood to express the object, technical solutions and advantages of the present invention, below in conjunction with the accompanying drawings and specifically The present invention is further described in more detail for embodiment.
Assuming that data set O={ o1,o2,…,onBe made up of m object, each object oi={ a1,a2,…,am}(1≤ I≤n) it is n dimension datas.Main idea is that with reference to the regularity of distribution of object and its neighborhood object, using between them Distance expectation and its variance represent object local outlier factor, what new algorithm obtained peel off coefficient can be more accurate Peel off degree of the object in subrange is represented, improves the accuracy and the scope of application of Outliers Detection algorithm.
Each step is described in detail with data set O and its any one object o respectively below:
Step 1, to any the dimension i, A of data seti=[a1i,a2i,...,ani] (1≤i≤m) be normalized;
Further, wherein, the normalization operation of any dimension data value is represented by:
Step 2, the set N of k evidence nearest apart from object o in data set O is determinedk(o), and remember | Nk(o) |=k;
Step 21, the Euclidean distance between object o and other objects is calculated;
Step 22, the object in data set O in addition to object o is ranked up from small to large by distance and takes preceding k object K neighborhoods as object o;
Step 3, the average value of the distance between object o and its k neighborhood object is calculated, is designated as Nk-adist(o);
Further, wherein, N is calculatedk-adist(o) it is represented by:
Step 4, to Nk-adist(o) global normalization is carried out, and is designated as Nnk-adist(o);
Further, wherein, Nn is calculatedk-adist(o) it is represented by:
Step 5, the variance of the distance between object o and its k neighborhood object is calculated, is designated as Nk-vari(o);
Further, wherein, N is calculatedk-vari(o) it is represented by:
Step 6, to Nk-vari(o) global normalization is carried out, is designated as Nnk-vari(o);
Further, wherein, Nn is calculatedk-vari(o) it is represented by:
Step 7, object o k neighborhood decentralization is calculated, is designated as Nk-disp(o);
Further, wherein, N is calculatedk-disp(o) it is represented by:
Step 8, object o k neighborhood density is calculated, is designated as Nk-dens(o);
Further, wherein, N is calculatedk-dens(o) it is represented by:
When object o overlaps with its all neighborhood object, in order to avoid Nk-dens(o) it is meaningless, while ensure that o k neighborhoods are close Degree is maximum, now directly allows Nk-dens(o) the slightly larger value of the neighborhood density than other all objects is taken in data set.
Step 9, object o k neighborhood degree of peeling off is calculated, is designated as VLDC (o);
Further, wherein, calculate VLDC (o) and be represented by:
Step 10, the object that peels off of data set is determined;
Step 101, data set is ranked up from big to small by the coefficient that peels off;
Step 102, the maximum preceding m of coefficient that peels off is takenoutlOutlier of the individual object as data set, moutlIt is given in advance Threshold value;
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Know those skilled in the art when can be made according to the present invention it is various it is corresponding change and deformation, but these corresponding change and become Shape should all belong to the protection domain of appended claims of the invention.

Claims (3)

  1. A kind of 1. local outlier detection method based on density, it is characterised in that comprise the following steps:
    Step 1, each attribute of data set is normalized;
    Step 2, the k object from each object arest neighbors is searched for;
    Step 3, the average value of distance between each object and its neighborhood object is calculated, and is designated as the k neighborhood distances of object;
    Step 4, global normalization's processing is carried out to the k neighborhoods distance of each object;
    Step 5, the variance of distance between each object and its neighborhood object is calculated, and is designated as the k neighborhood variances of object;
    Step 6, global normalization's processing is carried out to the k neighborhoods variance of each object;
    Step 7, the k neighborhood decentralization of each object is calculated;
    Step 8, the k neighborhood density of each object is calculated;
    Step 9, each object local outlier factor is calculated;
    Step 10, the object that peels off of data set is determined.
  2. 2. the method according to claim 11, wherein, step 7 neighborhood decentralization Nk-disp(o) calculating is represented by:
    <mrow> <msub> <mi>N</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>p</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>Nn</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>a</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> <mo>(</mo> <mi>o</mi> <mo>)</mo> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mrow> <msub> <mi>N</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>a</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>*</mo> <msub> <mi>Nn</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>var</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> </mrow> </msup> </mrow>
    Nnk-adist(o) global normalization of the k neighborhood distances of object, Nn are representedk-vari(o) the complete of the k neighborhood variances of object is represented Office's normalization, Nk-adist(o) the k neighborhood distances of object are represented.
  3. 3. according to the method for claim 1, wherein, calculate neighborhood density Nk-dens(o) calculating is represented by:
    <mrow> <msub> <mi>N</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>d</mi> <mi>e</mi> <mi>n</mi> <mi>s</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>N</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>p</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>N</mi> <mrow> <mi>k</mi> <mo>-</mo> <mi>a</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    Nk-dens(o) the neighborhood density of object, N are representedk-adist(o) the k neighborhood distances of object are represented.As object o and its all neighbour Field object overlaps, in order to avoid Nk-dens(o) it is meaningless, while ensure that o k neighborhoods density is maximum, now directly allow Nk-dens(o) The slightly larger value of the neighborhood density than other all objects is taken in data set.
CN201710559390.XA 2017-07-06 2017-07-06 A kind of local outlier detection method based on density Pending CN107545273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710559390.XA CN107545273A (en) 2017-07-06 2017-07-06 A kind of local outlier detection method based on density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710559390.XA CN107545273A (en) 2017-07-06 2017-07-06 A kind of local outlier detection method based on density

Publications (1)

Publication Number Publication Date
CN107545273A true CN107545273A (en) 2018-01-05

Family

ID=60971124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710559390.XA Pending CN107545273A (en) 2017-07-06 2017-07-06 A kind of local outlier detection method based on density

Country Status (1)

Country Link
CN (1) CN107545273A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108333314A (en) * 2018-04-02 2018-07-27 深圳凯达通光电科技有限公司 A kind of air pollution intelligent monitor system
CN109740175A (en) * 2018-11-18 2019-05-10 浙江大学 A kind of point judging method that peels off towards Wind turbines power curve data
CN110648741A (en) * 2018-06-27 2020-01-03 清华大学 Method and device for identifying doctor with abnormal prescription based on local outlier factor
CN112070109A (en) * 2020-07-21 2020-12-11 广东工业大学 Calla kiln energy consumption abnormity detection method based on improved density peak clustering
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113191432A (en) * 2021-05-06 2021-07-30 中国联合网络通信集团有限公司 Outlier factor-based virtual machine cluster anomaly detection method, device and medium
CN113408667A (en) * 2021-07-30 2021-09-17 中国南方电网有限责任公司超高压输电公司检修试验中心 State evaluation method, device, equipment and storage medium
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117854279A (en) * 2024-01-09 2024-04-09 南京清正源信息技术有限公司 Road condition prediction method and system based on edge calculation

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108333314A (en) * 2018-04-02 2018-07-27 深圳凯达通光电科技有限公司 A kind of air pollution intelligent monitor system
CN110648741A (en) * 2018-06-27 2020-01-03 清华大学 Method and device for identifying doctor with abnormal prescription based on local outlier factor
CN109740175A (en) * 2018-11-18 2019-05-10 浙江大学 A kind of point judging method that peels off towards Wind turbines power curve data
CN109740175B (en) * 2018-11-18 2020-12-08 浙江大学 Outlier discrimination method for power curve data of wind turbine generator
CN112070109B (en) * 2020-07-21 2023-06-23 广东工业大学 Water chestnut kiln energy consumption abnormality detection method based on improved density peak value clustering
CN112070109A (en) * 2020-07-21 2020-12-11 广东工业大学 Calla kiln energy consumption abnormity detection method based on improved density peak clustering
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113158871B (en) * 2021-04-15 2022-08-02 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113191432A (en) * 2021-05-06 2021-07-30 中国联合网络通信集团有限公司 Outlier factor-based virtual machine cluster anomaly detection method, device and medium
CN113191432B (en) * 2021-05-06 2023-07-07 中国联合网络通信集团有限公司 Outlier factor-based virtual machine cluster abnormality detection method, device and medium
CN113408667A (en) * 2021-07-30 2021-09-17 中国南方电网有限责任公司超高压输电公司检修试验中心 State evaluation method, device, equipment and storage medium
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117272216B (en) * 2023-11-22 2024-02-09 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117854279A (en) * 2024-01-09 2024-04-09 南京清正源信息技术有限公司 Road condition prediction method and system based on edge calculation

Similar Documents

Publication Publication Date Title
CN107545273A (en) A kind of local outlier detection method based on density
Muller et al. OutRank: ranking outliers in high dimensional data
Entezami et al. Non-parametric empirical machine learning for short-term and long-term structural health monitoring
CN104216349B (en) Utilize the yield analysis system and method for the sensing data of manufacturing equipment
Liu et al. A two-stage approach for predicting the remaining useful life of tools using bidirectional long short-term memory
US20050209820A1 (en) Diagnostic data detection and control
Pavlovski et al. Hierarchical convolutional neural networks for event classification on PMU measurements
Kim et al. Extracting major lines by recruiting zero-threshold canny edge links along sobel highlights
Arul et al. Data anomaly detection for structural health monitoring of bridges using shapelet transform
CN103593470B (en) The integrated unbalanced data flow classification algorithm of a kind of two degree
CN108647737A (en) A kind of auto-adaptive time sequence variation detection method and device based on cluster
Li et al. Robust outlier detection based on the changing rate of directed density ratio
Wang et al. Automatic identification of spatial defect patterns for semiconductor manufacturing
CN112949735A (en) Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining
Zhang et al. Discretizing numerical attributes in decision tree for big data analysis
Sammour et al. An agglomerative hierarchical clustering with various distance measurements for ground level ozone clustering in Putrajaya, Malaysia
Prabhakaran et al. Towards prediction of paradigm shifts from scientific literature
Saravanan et al. Prediction of insufficient accuracy for human activity recognition using convolutional neural network in compared with support vector machine
Gvishiani et al. Mathematical methods of geoinformatics. III. Fuzzy comparisons and recognition of anomalies in time series
Li et al. Control chart pattern recognition under small shifts based on multi-scale weighted ordinal pattern and ensemble classifier
Vijayarani et al. Partitioning clustering algorithms for data stream outlier detection
Bochkaryov et al. Application of the ensemble clustering algorithm in solving the problem of segmentation of users taking into account their loyalty
Carvalho et al. A review of benchmarks for visual defect detection in the manufacturing industry
Maggino et al. New tools for the construction of ranking and evaluation indicators in multidimensional systems of ordinal variables
Kathiresan et al. Efficient Detection Using Soft Computing Approach of Modified Fuzzy C-Means Based Outlier Detection in Electronics Patient Records Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180105

RJ01 Rejection of invention patent application after publication