CN110471946A - A kind of LOF outlier detection method and system based on grid beta pruning - Google Patents

A kind of LOF outlier detection method and system based on grid beta pruning Download PDF

Info

Publication number
CN110471946A
CN110471946A CN201910612053.1A CN201910612053A CN110471946A CN 110471946 A CN110471946 A CN 110471946A CN 201910612053 A CN201910612053 A CN 201910612053A CN 110471946 A CN110471946 A CN 110471946A
Authority
CN
China
Prior art keywords
data
grid
module
point
outlier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910612053.1A
Other languages
Chinese (zh)
Inventor
张绪升
谢胜利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201910612053.1A priority Critical patent/CN110471946A/en
Publication of CN110471946A publication Critical patent/CN110471946A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The LOF outlier detection method and system based on grid beta pruning that the present invention relates to a kind of, which comprises S1: it reads data set and is pre-processed;S2: carrying out equidistant division for data set, calculates the bounds of each grid, and grid is numbered;S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively;S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value and cluster radius threshold value;S5: grid beta pruning;S6: outlier detection.Described device includes: data preprocessing module, data memory module, data cleansing module, Spark distributed computing module.The present invention overcomes existing LOF outlier detection methods when handling large-scale high-dimensional data-object in real time, and time and space complexity is higher, leads to the problem of the practicability is poor, improves the high efficiency and practicability of calculating process.

Description

A kind of LOF outlier detection method and system based on grid beta pruning
Technical field
The present invention relates to data processing fields, more particularly, to a kind of outlier detection side LOF based on grid beta pruning Method and system.
Background technique
Outlier excavation technology is an important research direction of data mining technology.To data mining and analysis In the process, some special data or data segment be will often find that, the fluctuation of other data in their fluctuation and data set There is significant difference, this data point seldom occurred or data segment are known as abnormal point, also referred to as outlier.It peels off The appearance of point has seriously affected the efficiency and Decision Quality that data utilize, meanwhile, Outlier Data often enables people to therefrom send out The knowledge of existing some potentially usefuls.With the quickening of urbanization process, nowadays the data in life often have high-dimensional, big Magnitude, it is multi-source heterogeneous the features such as, this just gives current existing outlier detection method, and more stringent requirements are proposed.Traditional peels off Point detecting method shows very good effect in specific application field, but no longer suitable in the case where higher-dimension large data sets With, and the algorithm of complexity, low detection efficiency, so that the time complexity of entire outlier detection algorithm greatly improves. At present there are mainly four types of outlier detection methods: outlier detection method, distance-based outlier point detection side based on statistics Method, the outlier detection method based on density, the outlier detection method based on cluster.
Outlier detection algorithm based on density is more practical method, and classical realization algorithm is LOF algorithm (Local Outlier Factor, locally peel off factors check method).LOF algorithm by introduce each data object can Up to distance and up to the concept of density, to judge whether a data object is outlier;But calculating each data object Reach distance and up to during density, require to be scanned entire data set, therefore the complexity calculated is very high, when When data magnanimity, it is likely that the case where low memory occur, calculating task cannot be completed.Harbin Engineering University is in its application Patent document " a kind of based on the outlier excavation method for deviateing feature " (application number: CN201710599251.X, publication number: CN It is disclosed in 107562778A) a kind of based on the outlier excavation method for deviateing feature.This method is by the way that data space to be divided into Grid, the part that data point is calculated based on the mass center of each grid are peeled off the factor, and it is big to can solve excavation to a certain extent The problem of high time and space expense is needed when scale data collection.But there are still shortcomings for this method: part peel off because The algorithm of son is extremely complex, needs continually to traverse data set, simultaneously as LOF value becomes when parameter k value difference It is larger to change range, it is difficult to determine suitable outlier threshold value, only when k value is sufficiently large, the value of the outlier factor can just be stablized It is constant, but increase k value, computation complexity also increases with it.For the high dimensional data of magnanimity, drawn even if having carried out grid Divide the solution for then carrying out the local factor that peels off again, to obtain stable LOF value, the time complexity of calculating is similarly frightened People.
Summary of the invention
The present invention is to overcome answering for the existing outlier detection algorithm calculating based on density described in the above-mentioned prior art The defect that miscellaneous degree is high, the practicability is poor, provides a kind of LOF outlier detection method and system based on grid beta pruning.
It the described method comprises the following steps:
S1: input data set, and data set is pre-processed;
S2: setting data set has s isometric intervals, and according to the s value of input, data set is carried out equidistant stroke per one-dimensional Point, meanwhile, the bounds of each grid are calculated, and grid is numbered;
In higher dimensional space, there are multiple dimensions to be all cut into s sections, then data set is by point along every one-dimensional mark Cutpoint line is separated.The irregular section cut out is net boundary.Specific boundary value needs the dimension according to data What degree, data set size, and given segmentation space-number s were codetermined.
S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively;
S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value δ and cluster radius threshold value λ;
Wherein cluster radius threshold value λ is the distance between grid mass center and grid farthest point, and density threshold δ is grid data The average of point;I.e. with total data point number divided by grid total number;
Data amount check is more in mesh space, and density is bigger, and mesh-density threshold value δ is inputted by user.λ is cluster half Diameter is measured at a distance from farthest point in grid by grid mass center;Grid mass center is the average value of all data point distances.
S5: grid beta pruning: mesh-density is less than the grid mark of δ or cluster radius greater than λ and is come out, while retaining them Neighbours' grid, and delete other Grid datasets;
S6: using LOF algorithm to after beta pruning data in grid carry out outlier detection, judge data object whether be from Group's point data, and export judging result.
In order to reduce calculation amount, calculating process in, Check looks for the k- Neighbor Points of data object p to be no longer directed to global data Object, but grid belonging to looking only for and the data object in neighbours' grid.
In order to reduce the time complexity of mass data outlier detection, S1-S5 is actually to cut to one of data set Branch processing, reduces unnecessary calculating to a certain extent.Detecting step is actually the mistake for calculating the factor LOF value that peels off Journey finally judges that the data point is outlier using the numerical value of LOF.Such as: when setting threshold value is ψ, LOF value is lower than ψ Will be judged as non-outlier, data point of the LOF value more than or equal to ψ will be considered as outlier.At this moment entire data set just by Two classes have been divided into it, one kind is the point data that peels off that detected, and another kind of is normal data.
Preferably, S2 the following steps are included:
S2.1: to given d dimension data space, every dimension attribute is divided into s long by the definition divided according to uniform grid It spends an equal and disjoint left side and closes the right isometric section opened, entire d dimension data space is divided into sdA grid cell;
S2.2: grid cell is divided: sets data set attribute as A1, A2..., Ad, every dimension is averagely divided into s Isometric interval, then a grid units are defined as Cell=C [S1][S2]...[Sd], wherein Si(1≤i≤d) indicates that d dimension is empty Between middle i-th dimension SiA grid cell, wherein 0≤Si≤s-1;
S2.3: grid cell is numbered: the fixed A of the sequence of guarantee d dimension attribute first1, A2..., Ad, for wherein Each attribute Ai(1≤i≤d), being divided into s spacer and being numbered from small to large according to range is 1,2 ..., s;So The number of one grid is AiThe set of range number in (1≤i≤d).
Preferably, S6 the following steps are included:
S6.1: the k distance dist of data object p is soughtk(p):
The distance between all objects and p are calculated, are then ranked up required distance value by sequence from small to large, According to determining k value, k-th of value in value sequence, the k distance dist of as p are takenk(p);
Specific the distance between data object q and p are calculated using Euclidean distance as distance scale.Assuming that x is data Any data object of concentration, each data object are made of multiple attributes, are denoted as (x1, x2... xj...xn) wherein xj(j= 1,2 ..., n) be real number, it is j-th of coordinate of x, then any two point p=(p1, p2... pn) and q=(q1, q2... qn) The distance between d (p, q) calculation formula it is as follows:
S6.2: the k- Neighbor Points N of p is soughtk(p):
If being set A by each small Grid dataset that grid beta pruning stays;Then in set A with point p away from From less than distk(p) point is the k- Neighbor Points of p, is denoted as:
Nk(P)=q | d (p, q)≤distk(p),q∈A\{P}}
Wherein, p is the data point that grid data is concentrated, and q is any one data point in grid in addition to p point, d (p, q) is the Euclidean distance between p point and q point in space;
S6.3: reach distance R is determinedk(p, q):
When the distance between object p and object q less than or equal to q k apart from when, p is exactly pair about the reach distance of object q As the k distance dist of qk(q);When the k of the distance between p and object q more than or equal to q apart from when, p about object q up to away from From being exactly actual range d (p, q) between p and q;
So p can be indicated about the reach distance of object q (wherein q ∈ A, and q is in the k- neighbour of p) are as follows:
S6.4: k- neighbour's distribution density D is soughtk(p):
K- neighbour's distribution density that the inverse of object p and the average reach distance of all the points in its k- neighbour is p:
S6.5: the part of p is asked to peel off the factor (LOF), the calculation formula for the factor (LOF) that locally peel off are as follows:
S6.6: judge whether data object is the point data that peels off:
The factor threshold that peels off is set first, and each data object is then peeled off into the factor compared with the factor threshold that peels off, It is not otherwise outlier for outlier if the factor that peels off of data object is greater than the factor threshold that peels off.
In the outlier detection algorithm based on grid finally by the descending row of the factor that peels off of each data object Sequence is analyzed by algorithm it is found that if data object is not outlier, and the factor that peels off levels off to 1;If data object is to peel off Point, the then value for the factor that peels off will be greater than 1, and the degree that peels off is bigger, then the factor values that peel off also will be bigger, in an experiment, be arranged from Group factor threshold value is 1.8 (herein can be by user's satisfactory threshold values of self-defining according to the actual situation), is greater than this threshold value Data object be outlier, less than this threshold value data object be not taken as outlier processing.
The present invention also provides a kind of detection system of LOF outlier detection method using grid beta pruning, the system packets It includes: data preprocessing module, data memory module, data cleansing module, Spark distributed computing module;
The input terminal of data preprocessing module is connect with external data source, and output end and the data of data preprocessing module are deposited Module connection is stored up, the output end of data memory module is connect with data cleansing module again, and data cleansing module and Spark are distributed Computing module connection, Spark distributed computing module are finally connected to data memory module;
Data preprocessing module is responsible for the importing and pretreatment of data, and pretreated data are exported to data and are stored Module;
Data memory module includes distributed file system, and data memory module application distribution formula file system (HDFS) is made For data storing platform, it is responsible for the mobile sms service of data;
Spark distributed computing module receives the given calculating task of data cleansing module, divides as required data Analysis calculates;The storage and calculating that data are realized using the distributed environment of Spark cluster, by distributed file system come Data file is stored and is managed, using the advantage calculated based on memory, improves the calculating speed of algorithm;
Data cleansing module is carried out clear by outlier of the LOF outlier detection algorithm based on grid beta pruning to data It washes, the complicated calculations task being related to gives Spark distributed computing resume module, and will treated intermediate data and most Termination fruit is stored in distributed file system (HDFS).
Preferably, data cleansing module is the nucleus module of entire detection system, it includes data loading processing module, net Lattice pruning module, outlier detection module, four submodules of wash result memory module;
Data loading processing module the data from multiple data sources can be merged by any desired mode or Polymerization, to realize integrated data cleaning and its elsewhere science and engineering work that any one triangular web can not be handled.
Grid pruning module first carries out grid dividing to data set, then the data set of entire data space is carried out beta pruning, Remaining data are carried out outlier detection by the most of data set for not including outlier of removal;
Outlier detection module loading beta pruning treated data set carries out outlier detection to these data, using base In the outlier detection algorithm of density, the factor LOF value that peels off of each data point in data set is calculated, gives each data object mark The degree that peels off is known, to obtain the point data that peels off in data set.
Wash result memory module by after grid beta pruning intermediate result data, after the outlier and cleaning that detect Data are all stored into HDFS.
Preferably, Spark cluster uses fully distributed operational mode, is made of 4 pc machines, a node is used to Master host node is done, which is mainly used for the management and task schedule of Spark distributed type assemblies metadata, remaining 3 sections Point is all working node (slaver), is mainly used for specific data storage and distributed computing task.
The present invention is divided into several grid lists by improved LOF outlier detection algorithm, by large-scale data set Member carries out beta pruning processing to there is a possibility that outlier lesser grid, and only outlier possibility occur biggish to remaining Grid data object carries out the point analysis that peels off, and the complexity for allowing for calculating in this way substantially reduces.
In the improved LOF outlier detection method based on grid beta pruning proposed by the present invention, by mesh-density and cluster Two parameters of radius combine, collectively as whether the standard of beta pruning, greatly reduce the probability that outlier is accidentally deleted;Exist simultaneously When grid division introduce grid number concept, when seeking the k Neighbor Points an of data object, it is not necessary that entire data set into Row traversal, it is only necessary to which grid locating for this data object and the data in adjacent mesh are traversed.
The method of the invention also introduces Spark big data cluster Computing Platform, and Spark is extended to be widely used at present MapReduce computation module so that it batch calculate, iterative calculation and memory calculate on have apparent advantage, Spark Can be with the calculating task of Automatic dispatching complexity, carrying out, complicated calculations when ratio MapReduce is more efficient.
The system model that the method for the invention proposes can be used to processing historical data, it is also possible to handle real-time fluxion According to can use the network of available data point, it is only necessary to pass through comparison when there is new data set that legacy data collection is added Net boundary finds the position of grid at new data point, without duplicate carry out grid dividing.
Compared with prior art, the beneficial effect of technical solution of the present invention is: the present invention overcomes the inspections of existing LOF outlier For survey method when handling large-scale high-dimensional data-object in real time, time and space complexity is higher, causes what the practicability is poor to ask Topic, ensure that carry out also improving while outlier detection under large-scale dataset the high efficiency of calculating process with it is practical Property.
Detailed description of the invention
Fig. 1 is a kind of LOF outlier detection method flow diagram based on grid beta pruning.
Fig. 2 is a kind of LOF outlier detection system architecture diagram based on grid beta pruning.
Fig. 3 is server disposition structural schematic diagram.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
Embodiment 1:
The LOF outlier detection method based on grid beta pruning that the present embodiment provides a kind of.
As shown in Figure 1, the described method comprises the following steps:
S1: input data set, and data set is pre-processed;
S2: setting data set has s isometric intervals, and according to the s value of input, data set is carried out equidistant stroke per one-dimensional Point, meanwhile, the bounds of each grid are calculated, and grid is numbered;
In higher dimensional space, there are multiple dimensions to be all cut into s sections, then data set is by point along every one-dimensional mark Cutpoint line is separated.The irregular section cut out is net boundary.Specific boundary value needs the dimension according to data What degree, data set size, and given segmentation space-number s were codetermined.
S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively;
S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value δ and cluster radius threshold value λ;
Wherein cluster radius threshold value λ is the distance between grid mass center and grid farthest point, and density threshold δ is grid data The average of point;I.e. with total data point number divided by grid total number;
Data amount check is more in mesh space, and density is bigger, and mesh-density threshold value δ is inputted by user.λ is cluster half Diameter is measured at a distance from farthest point in grid by grid mass center;Grid mass center is the average value of all data point distances.
S5: grid beta pruning: mesh-density is less than the grid mark of δ or cluster radius greater than λ and is come out, while retaining them Neighbours' grid, and delete other Grid datasets;
S6: using LOF algorithm to after beta pruning data in grid carry out outlier detection, judge data object whether be from Group's point data, and export judging result.
In order to reduce calculation amount, calculating process in, Check looks for the k- Neighbor Points of data object p to be no longer directed to global data Object, but grid belonging to looking only for and the data object in neighbours' grid.
In order to reduce the time complexity of mass data outlier detection, S1-S5 is actually to cut to one of data set Branch pretreatment, reduces unnecessary calculating to a certain extent.Detecting step is actually the mistake for calculating the factor LOF value that peels off Journey finally judges that the data point is outlier using the numerical value of LOF.
The present embodiment setting threshold value will be judged as non-outlier lower than 1.8 for 1.8, LOF value, and LOF value is more than or equal to 1.8 data point will be considered as outlier.At this moment entire data set has been divided into two classes, and one kind is the outlier that detected Data, another kind of is normal data.
S2 the following steps are included:
S2.1: to given d dimension data space, every dimension attribute is divided into s long by the definition divided according to uniform grid It spends an equal and disjoint left side and closes the right isometric section opened, entire d dimension data space is divided into sdA grid cell;
S2.2: grid cell is divided: sets data set attribute as A1, A2..., Ad, every dimension is averagely divided into s Isometric interval, then a grid units are defined as Cell=C [S1][S2]...[Sd], wherein Si(1≤i≤d) indicates that d dimension is empty Between middle i-th dimension SiA grid cell, wherein 0≤Si≤s-1;
S2.3: grid cell is numbered: the fixed A of the sequence of guarantee d dimension attribute first1, A2..., Ad, for wherein Each attribute Ai(1≤i≤d), being divided into s spacer and being numbered from small to large according to range is 1,2 ..., s;So The number of one grid is AiThe set of range number in (1≤i≤d).
S6 the following steps are included:
S6.1: the k distance dist of data object p is soughtk(p):
The distance between all objects and p are calculated, are then ranked up required distance value by sequence from small to large, According to determining k value, k-th of value in value sequence, the k distance dist of as p are takenk(p);
Specific the distance between data object q and p are calculated using Euclidean distance as distance scale.Assuming that x is data Any data object of concentration, each data object are made of multiple attributes, are denoted as (x1, x2... xj...xn) wherein xj(j= 1,2 ..., n) be real number, it is j-th of coordinate of x, then any two point p=(p1, p2... pn) and q=(q1, q2... qn) The distance between d (p, q) calculation formula it is as follows:
S6.2: the k- Neighbor Points N of p is soughtk(p):
If being set A by each small Grid dataset that grid beta pruning stays;Then in set A with point p away from From less than distk(p) point is the k- Neighbor Points of p, is denoted as:
Nk(P)=q | d (p, q)≤distk(p),q∈A\{P}}
Wherein, P is the data point that grid data is concentrated, and q is any one data point in grid in addition to p point, d (p, q) is the Euclidean distance between p point and q point in space;
S6.3: reach distance R is determinedk(p, q):
When the distance between object p and object q less than or equal to q k apart from when, p is exactly pair about the reach distance of object q As the k distance dist of qk(q);When the k of the distance between p and object q more than or equal to q apart from when, p about object q up to away from From being exactly actual range d (p, q) between p and q;
So p can be indicated about the reach distance of object q (wherein q ∈ A, and q is in the k- neighbour of p) are as follows:
S6.4: k- neighbour's distribution density D is soughtk(p):
K- neighbour's distribution density that the inverse of object p and the average reach distance of all the points in its k- neighbour is p:
S6.5: the part of p is asked to peel off the factor (LOF), the calculation formula for the factor (LOF) that locally peel off are as follows:
S6.6: judge whether data object is the point data that peels off:
The factor threshold that peels off is set first, and each data object is then peeled off into the factor compared with the factor threshold that peels off, It is not otherwise outlier for outlier if the factor that peels off of data object is greater than the factor threshold that peels off.
In the outlier detection algorithm based on grid finally by the descending row of the factor that peels off of each data object Sequence is analyzed by algorithm it is found that if data object is not outlier, and the factor that peels off levels off to 1;If data object is to peel off Point, the then value for the factor that peels off will be greater than 1, and the degree that peels off is bigger, then the factor values that peel off also will be bigger, greater than the number of this threshold value It is outlier according to object, the data object less than this threshold value is not taken as outlier processing.
Embodiment 2:
The present embodiment provides a kind of detection systems of the LOF outlier detection method of grid beta pruning described in Application Example 1 System.
As shown in Fig. 2, the system comprises: data preprocessing module, data memory module, data cleansing module, Spark Distributed computing module;
The input terminal of data preprocessing module is connect with external data source, and output end and the data of data preprocessing module are deposited Module connection is stored up, the output end of data memory module is connect with data cleansing module again, and data cleansing module and Spark are distributed Computing module connection, Spark distributed computing module are finally connected to data memory module;
Data preprocessing module is responsible for the importing and pretreatment of data, and pretreated data are exported to data and are stored Module;
Data memory module includes distributed file system, and data memory module application distribution formula file system (HDFS) is made For data storing platform, it is responsible for the mobile sms service of data;
Spark distributed computing module receives the given calculating task of data cleansing module, divides as required data Analysis calculates;The storage and calculating that data are realized using the distributed environment of Spark cluster, by distributed file system come Data file is stored and is managed, using the advantage calculated based on memory, improves the calculating speed of algorithm;
Data cleansing module is carried out clear by outlier of the LOF outlier detection algorithm based on grid beta pruning to data It washes, the complicated calculations task being related to gives Spark distributed computing resume module, and will treated intermediate data and most Termination fruit is stored in distributed file system (HDFS).
Data cleansing module is the nucleus module of entire detection system, it includes data loading processing module, grid beta pruning Module, outlier detection module, four submodules of wash result memory module;
Data loading processing module the data from multiple data sources can be merged by any desired mode or Polymerization, to realize integrated data cleaning and its elsewhere science and engineering work that any one triangular web can not be handled.
Grid pruning module first carries out grid dividing to data set, then the data set of entire data space is carried out beta pruning, Remaining data are carried out outlier detection by the most of data set for not including outlier of removal;
Outlier detection module loading beta pruning treated data set carries out outlier detection to these data, using base In the outlier detection algorithm of density, the factor LOF value that peels off of each data point in data set is calculated, gives each data object mark The degree that peels off is known, to obtain the point data that peels off in data set.
Wash result memory module by after grid beta pruning intermediate result data, after the outlier and cleaning that detect Data are all stored into HDFS.
The server disposition structural schematic diagram of Spark distributed computing module is as shown in Figure 3.In systems, Spark cluster It using fully distributed operational mode, is made of several pc machines, a node is used to do Master host node, the node It is mainly used for the management and task schedule of Spark distributed type assemblies metadata, remaining node is all working node (slaver), i.e., Slaver node is mainly used for specific data storage and distributed computing task.Spark cluster server machine herein Deployment mode only makees example, can according to need the quantity of increase and decrease working node (slaver) server in practical application, is used for Meets the needs of practical calculating task, ellipsis indicates expansible.
System described in the present embodiment has four big functional modules, data preprocessing module, data memory module, Spark point (module is core processing module, but complicated calculating task needs are assigned to for cloth computing module, data cleansing module Spark distributed computing module is handled.).
Wherein, data preprocessing module can import any kind of data from any more data source, can be structure Change data, semi-structured data, be also possible to unstructured data, the data from multiple data sources can be by any desired Mode merges or polymerize, and by pretreatment, pretreated result is put into the distributed field system of data memory module In system, so that HDFS platform is corresponding to be used as data source server, prepare for the data cleansing of next step;Spark is distributed The distributed environment that computing module utilizes realizes the storage and calculating of cluster, by distributed file system (HDFS) come logarithm It is stored and is managed according to file, realized by technologies such as batch processing, iterative algorithm, interactive inquiry, stream process and quickly divided Analysis and calculating;Data cleansing module is carried out clear by outlier of the LOF outlier detection algorithm based on grid beta pruning to data It washes, treated intermediate data and final result by the way that the module is specified is stored in corresponding HDFS distributed file system In, the data after these cleanings may be used for next data mining and work.
The specific implementation steps are as follows for the most crucial data cleansing module of the present embodiment the method:
For data of the storage into HDFS, need to carry out series of preprocessing operation, such as: dimensionality reduction, data format Conversion etc..Dimension-reduction treatment is primarily referred to as deleting redundancy or the little attribute of relevance, in this way answers the time for greatly reducing algorithm Miscellaneous degree, while making the result that the excavation of outlier is more meaningful, is easier after explanation excavation;Using expert domain knowledge into Row dimensionality reduction can reduce error.The conversion of data format is mainly comprising the conversion and different data structure between different systems Conversion.
Grid beta pruning divides, and the data set of entire data space is carried out beta pruning, removal major part does not include outlier Data set, will be in intermediate result storage to HDFS.
Beta pruning treated intermediate data set is loaded, outlier detection is carried out to these data, here using being based on density Outlier detection algorithm, it may be assumed that LOF algorithm, the algorithm can handle the data in regional area well, pass through computing object The factor that peels off, peel off degree to object identity, can be improved the preciseness and accuracy of algorithm.
The present embodiment arrives the data storage after the outlier and cleaning that detect using detection method described in embodiment 1 On HDFS platform.
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (6)

1. a kind of LOF outlier detection method based on grid beta pruning, which is characterized in that the described method comprises the following steps:
S1: input data set, and data set is pre-processed;
S2: setting data set has s isometric intervals, and according to the s value of input, data set is carried out equidistant division per one-dimensional, Meanwhile the bounds of each grid are calculated, and grid is numbered;
S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively;
S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value δ and cluster radius threshold value λ;
Wherein cluster radius threshold value λ is the distance between grid mass center and grid farthest point, and density threshold δ is grid data point Average;I.e. with total data point number divided by grid total number;
S5: grid beta pruning: mesh-density is less than the grid mark of δ or cluster radius greater than λ and is come out, while retaining their neighbour Grid is occupied, and deletes other Grid datasets;
S6: outlier detection is carried out to the data in grid after beta pruning using LOF algorithm, judges whether data object is outlier Data, and export judging result.
2. the LOF outlier detection method according to claim 1 based on grid beta pruning, which is characterized in that S2 include with Lower step:
S2.1: to given d dimension data space, every dimension attribute is divided into s length phase according to the definition that uniform grid divides Deng and a disjoint left side close the right isometric section opened, entire d dimension data space is divided into sdA grid cell;
S2.2: grid cell is divided: sets data set attribute as A1, A2..., Ad, it is a isometric that every dimension is averagely divided into s Interval, then a grid units are defined as Cell=C [S1][S2]...[Sd], wherein SiIndicate the of i-th dimension in d dimension space SiA grid cell, wherein 1≤i≤d, 0≤Si≤s-1;
S2.3: grid cell is numbered: the fixed A of the sequence of guarantee d dimension attribute first1, A2..., Ad, for therein every A attribute Ai, being divided into s spacer and being numbered from small to large according to range is 1,2 ..., s;The number of so one grid As AiThe set of middle range number.
3. the LOF outlier detection method according to claim 1 based on grid beta pruning, which is characterized in that S6 include with Lower step:
S6.1: the k distance dist of data object p is soughtk(p):
The distance between all objects and p are calculated, are then ranked up required distance value by sequence from small to large, according to Determining k value takes k-th of value in value sequence, the k distance dist of as pk(p);
Assuming that x is any data object in data set, each data object is made of multiple attributes, is denoted as (x1, x2, ...xj...xn) wherein xjIt is real number, is j-th of coordinate of x, j=1,2 ..., n;Then any two point p=(p1, p2, ...pn) and q=(q1, q2... qn) the distance between d (p, q) calculation formula it is as follows:
S6.2: the k- Neighbor Points N of p is soughtk(p):
If being set A by each small Grid dataset that grid beta pruning stays;It is then small at a distance from point p in set A In distk(p) point is the k- Neighbor Points of p, is denoted as:
Nk(P)=q | d (p, q)≤distk(p),q∈A\{P}}
Wherein, P is the data point that grid data is concentrated, and q is any one data point in addition to p point in grid, d (p, Q) Euclidean distance between the p point and q point in space;
S6.3: reach distance R is determinedk(p, q):
When the distance between object p and object q less than or equal to q k apart from when, p is exactly object q about the reach distance of object q K distance distk(q);When the distance between p and object q more than or equal to q k apart from when, p about object q reach distance just It is the actual range d (p, q) between p and q;
So p can be indicated about the reach distance of object q are as follows:
Wherein q ∈ A, and q is in the k- neighbour of p;
S6.4: k- neighbour's distribution density D is soughtk(p):
K- neighbour's distribution density that the inverse of object p and the average reach distance of all the points in its k- neighbour is p:
S6.5: the part of p is asked to peel off the factor, the calculation formula for the factor that locally peels off are as follows:
S6.6: judge whether data object is the point data that peels off:
The factor threshold that peels off is set first, each data object is then peeled off into the factor compared with the factor threshold that peels off, if number It is greater than the factor threshold that peels off according to the factor that peels off of object, then is outlier, is not outlier otherwise.
4. a kind of detection system of the LOF outlier detection method using the described in any item grid beta prunings of claim 1-3, It is characterized in that, the system comprises: data preprocessing module, data memory module, data cleansing module, Spark distribution meter Calculate module;
The input terminal of data preprocessing module is connect with external data source, and the output end and data of data preprocessing module store mould The output end of block connection, data memory module is connect with data cleansing module again, data cleansing module and Spark distributed computing Module connection, Spark distributed computing module are finally connected to data memory module;
Data preprocessing module is responsible for the importing and pretreatment of data, and pretreated data are exported to data and store mould Block;
Data memory module includes distributed file system, and data memory module application distribution formula file system is stored as data Platform is responsible for the mobile sms service of data;
Spark distributed computing module receives the given calculating task of data cleansing module, is analyzed data, is counted as required It calculates;The storage and calculating that data are realized using the distributed environment of Spark cluster, by distributed file system come to data File is stored and is managed, and using the advantage calculated based on memory, improves the calculating speed of algorithm;
Data cleansing module cleans the outlier of data by the LOF outlier detection algorithm based on grid beta pruning, relates to And to complicated calculations task give Spark distributed computing resume module, and treated intermediate data and will most terminate Fruit is stored in distributed file system.
5. the LOF outlier detection system according to claim 4 based on grid beta pruning, which is characterized in that data cleansing Module includes data loading processing module, grid pruning module, outlier detection module, four submodules of wash result memory module Block;
Data from multiple data sources are merged or are polymerize by any desired mode by data loading processing module, thus Realize the integrated data cleaning and its elsewhere science and engineering work that any one triangular web can not be handled;
Grid pruning module first carries out grid dividing to data set, then the data set of entire data space is carried out beta pruning, removal Data set not comprising outlier carries out outlier detection to remaining data;
Outlier detection module loading beta pruning treated data set carries out outlier detection to these data, using based on close The outlier detection algorithm of degree, calculate data set in each data point the factor LOF value that peels off, to each data object tag from Group's degree, to obtain the point data that peels off in data set;
Wash result memory module is by the intermediate result data after grid beta pruning, the data after the outlier and cleaning that detect All storage is into distributed file system.
6. the LOF outlier detection system according to claim 5 based on grid beta pruning, which is characterized in that Spark distribution The Spark cluster of formula computing module uses fully distributed operational mode, is made of several pc machines, one of node quilt For doing host node, host node is used for the management and task schedule of Spark distributed type assemblies metadata, remaining node is work section Point, working node is stored for specific data and distributed computing task.
CN201910612053.1A 2019-07-08 2019-07-08 A kind of LOF outlier detection method and system based on grid beta pruning Pending CN110471946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910612053.1A CN110471946A (en) 2019-07-08 2019-07-08 A kind of LOF outlier detection method and system based on grid beta pruning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910612053.1A CN110471946A (en) 2019-07-08 2019-07-08 A kind of LOF outlier detection method and system based on grid beta pruning

Publications (1)

Publication Number Publication Date
CN110471946A true CN110471946A (en) 2019-11-19

Family

ID=68507531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910612053.1A Pending CN110471946A (en) 2019-07-08 2019-07-08 A kind of LOF outlier detection method and system based on grid beta pruning

Country Status (1)

Country Link
CN (1) CN110471946A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523595A (en) * 2020-04-23 2020-08-11 国网天津市电力公司 Cable defect studying and judging method based on outlier detection algorithm
CN111552867A (en) * 2020-03-31 2020-08-18 北京城市网邻信息技术有限公司 Service information recommendation method and device
CN111800477A (en) * 2020-06-15 2020-10-20 浙江理工大学 Differentiated excitation method for edge-computing data quality perception
CN112446429A (en) * 2020-11-27 2021-03-05 广东电网有限责任公司肇庆供电局 CGAN (Carrier grade Access network) -based routing inspection image data small sample expansion method
CN112559571A (en) * 2020-12-21 2021-03-26 国家电网公司东北分部 Approximate outlier calculation method and system for numerical type stream data
FR3108186A1 (en) * 2020-03-16 2021-09-17 Thales Method of consolidating a set of data for predictive maintenance and associated device
CN113449208A (en) * 2020-03-26 2021-09-28 阿里巴巴集团控股有限公司 Space query method, device, system and storage medium
CN107562778B (en) * 2017-07-21 2021-09-28 哈尔滨工程大学 Outlier mining method based on deviation features
CN114840579A (en) * 2022-04-20 2022-08-02 广东铭太信息科技有限公司 Hospital internal auditing system
US20220284076A1 (en) * 2021-03-04 2022-09-08 Korea Advanced Institute Of Science And Technology Real-time outlier detection method and apparatus in multidimensional data stream

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109102028A (en) * 2018-08-20 2018-12-28 南京邮电大学 Based on improved fast density peak value cluster and LOF outlier detection algorithm
CN109669971A (en) * 2018-12-18 2019-04-23 广东奥博信息产业股份有限公司 A kind of metric space Outliers Detection method based on quickly random intensive supporting point

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109102028A (en) * 2018-08-20 2018-12-28 南京邮电大学 Based on improved fast density peak value cluster and LOF outlier detection algorithm
CN109669971A (en) * 2018-12-18 2019-04-23 广东奥博信息产业股份有限公司 A kind of metric space Outliers Detection method based on quickly random intensive supporting point

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘峰等: "基于MapReduce的时序数据离群点挖掘算法", 《铁路计算机应用》 *
李佐军: "《大数据的架构技术与应用实践的探究》", 30 April 2019, 东北师范大学出版社 *
洪沙等: "基于密度的不确定数据离群点检测研究", 《计算机科学》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562778B (en) * 2017-07-21 2021-09-28 哈尔滨工程大学 Outlier mining method based on deviation features
FR3108186A1 (en) * 2020-03-16 2021-09-17 Thales Method of consolidating a set of data for predictive maintenance and associated device
WO2021185826A1 (en) * 2020-03-16 2021-09-23 Thales Method for consolidating a dataset for predictive maintenance and associated devices
CN113449208A (en) * 2020-03-26 2021-09-28 阿里巴巴集团控股有限公司 Space query method, device, system and storage medium
CN111552867A (en) * 2020-03-31 2020-08-18 北京城市网邻信息技术有限公司 Service information recommendation method and device
CN111523595A (en) * 2020-04-23 2020-08-11 国网天津市电力公司 Cable defect studying and judging method based on outlier detection algorithm
CN111800477A (en) * 2020-06-15 2020-10-20 浙江理工大学 Differentiated excitation method for edge-computing data quality perception
CN111800477B (en) * 2020-06-15 2022-09-23 浙江理工大学 Differentiated excitation method for edge-computing data quality perception
CN112446429A (en) * 2020-11-27 2021-03-05 广东电网有限责任公司肇庆供电局 CGAN (Carrier grade Access network) -based routing inspection image data small sample expansion method
CN112446429B (en) * 2020-11-27 2022-06-21 广东电网有限责任公司肇庆供电局 CGAN (Carrier grade Access network) -based routing inspection image data small sample expansion method
CN112559571A (en) * 2020-12-21 2021-03-26 国家电网公司东北分部 Approximate outlier calculation method and system for numerical type stream data
CN112559571B (en) * 2020-12-21 2024-05-24 国家电网公司东北分部 Approximate outlier calculation method and system for numerical value type stream data
US20220284076A1 (en) * 2021-03-04 2022-09-08 Korea Advanced Institute Of Science And Technology Real-time outlier detection method and apparatus in multidimensional data stream
CN114840579A (en) * 2022-04-20 2022-08-02 广东铭太信息科技有限公司 Hospital internal auditing system

Similar Documents

Publication Publication Date Title
CN110471946A (en) A kind of LOF outlier detection method and system based on grid beta pruning
Elahi et al. Efficient clustering-based outlier detection algorithm for dynamic data stream
Chaves et al. Clustering search algorithm for the capacitated centered clustering problem
Sun et al. Learned cardinality estimation for similarity queries
Wen et al. Efficient structural graph clustering: an index-based approach
CN109543765A (en) A kind of industrial data denoising method based on improvement IForest
Wang et al. Fast parallel algorithms for euclidean minimum spanning tree and hierarchical spatial clustering
Kholghi et al. Classification and evaluation of data mining techniques for data stream requirements
Cai et al. An efficient outlier detection approach on weighted data stream based on minimal rare pattern mining
CN106909942A (en) A kind of Subspace clustering method and device towards high-dimensional big data
CN108921324A (en) Platform area short-term load forecasting method based on distribution transforming cluster
CN110119408A (en) Mobile object continuous-query method under geographical space real-time streaming data
Mahdiraji Clustering data stream: A survey of algorithms
Song et al. A data streams analysis strategy based on hoeffding tree with concept drift on hadoop system
Cai et al. UWFP-Outlier: an efficient frequent-pattern-based outlier detection method for uncertain weighted data streams
Chen et al. Detecting trajectory outliers based on spark
Ahsani et al. Improvement of CluStream algorithm using sliding window for the clustering of data streams
Zhao et al. Mining fault association rules in the perception layer of electric power sensor network based on improved Eclat
Wang et al. RODA: A fast outlier detection algorithm supporting multi-queries
Haque et al. Contextual outlier detection in sensor data using minimum spanning tree based clustering
Devi et al. A proficient method for text clustering using harmony search method
Wu et al. Research on optimizing strategy of database-oriented gis graph database query
Bin et al. An Improved Algorithm for High Speed Train's Maintenance Data Mining Based on MapReduce
Liu et al. A multidimensional time-series association rules algorithm based on spark
Ishikawa et al. A dynamic mobility histogram construction method based on Markov chains

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191119