CN110471946A

CN110471946A - A kind of LOF outlier detection method and system based on grid beta pruning

Info

Publication number: CN110471946A
Application number: CN201910612053.1A
Authority: CN
Inventors: 张绪升; 谢胜利
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2019-11-19

Abstract

The LOF outlier detection method and system based on grid beta pruning that the present invention relates to a kind of, which comprises S1: it reads data set and is pre-processed；S2: carrying out equidistant division for data set, calculates the bounds of each grid, and grid is numbered；S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively；S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value and cluster radius threshold value；S5: grid beta pruning；S6: outlier detection.Described device includes: data preprocessing module, data memory module, data cleansing module, Spark distributed computing module.The present invention overcomes existing LOF outlier detection methods when handling large-scale high-dimensional data-object in real time, and time and space complexity is higher, leads to the problem of the practicability is poor, improves the high efficiency and practicability of calculating process.

Description

A kind of LOF outlier detection method and system based on grid beta pruning

Technical field

The present invention relates to data processing fields, more particularly, to a kind of outlier detection side LOF based on grid beta pruning Method and system.

Background technique

Outlier excavation technology is an important research direction of data mining technology.To data mining and analysis In the process, some special data or data segment be will often find that, the fluctuation of other data in their fluctuation and data set There is significant difference, this data point seldom occurred or data segment are known as abnormal point, also referred to as outlier.It peels off The appearance of point has seriously affected the efficiency and Decision Quality that data utilize, meanwhile, Outlier Data often enables people to therefrom send out The knowledge of existing some potentially usefuls.With the quickening of urbanization process, nowadays the data in life often have high-dimensional, big Magnitude, it is multi-source heterogeneous the features such as, this just gives current existing outlier detection method, and more stringent requirements are proposed.Traditional peels off Point detecting method shows very good effect in specific application field, but no longer suitable in the case where higher-dimension large data sets With, and the algorithm of complexity, low detection efficiency, so that the time complexity of entire outlier detection algorithm greatly improves. At present there are mainly four types of outlier detection methods: outlier detection method, distance-based outlier point detection side based on statistics Method, the outlier detection method based on density, the outlier detection method based on cluster.

Outlier detection algorithm based on density is more practical method, and classical realization algorithm is LOF algorithm (Local Outlier Factor, locally peel off factors check method).LOF algorithm by introduce each data object can Up to distance and up to the concept of density, to judge whether a data object is outlier；But calculating each data object Reach distance and up to during density, require to be scanned entire data set, therefore the complexity calculated is very high, when When data magnanimity, it is likely that the case where low memory occur, calculating task cannot be completed.Harbin Engineering University is in its application Patent document " a kind of based on the outlier excavation method for deviateing feature " (application number: CN201710599251.X, publication number: CN It is disclosed in 107562778A) a kind of based on the outlier excavation method for deviateing feature.This method is by the way that data space to be divided into Grid, the part that data point is calculated based on the mass center of each grid are peeled off the factor, and it is big to can solve excavation to a certain extent The problem of high time and space expense is needed when scale data collection.But there are still shortcomings for this method: part peel off because The algorithm of son is extremely complex, needs continually to traverse data set, simultaneously as LOF value becomes when parameter k value difference It is larger to change range, it is difficult to determine suitable outlier threshold value, only when k value is sufficiently large, the value of the outlier factor can just be stablized It is constant, but increase k value, computation complexity also increases with it.For the high dimensional data of magnanimity, drawn even if having carried out grid Divide the solution for then carrying out the local factor that peels off again, to obtain stable LOF value, the time complexity of calculating is similarly frightened People.

Summary of the invention

The present invention is to overcome answering for the existing outlier detection algorithm calculating based on density described in the above-mentioned prior art The defect that miscellaneous degree is high, the practicability is poor, provides a kind of LOF outlier detection method and system based on grid beta pruning.

It the described method comprises the following steps:

S1: input data set, and data set is pre-processed；

S2: setting data set has s isometric intervals, and according to the s value of input, data set is carried out equidistant stroke per one-dimensional Point, meanwhile, the bounds of each grid are calculated, and grid is numbered；

In higher dimensional space, there are multiple dimensions to be all cut into s sections, then data set is by point along every one-dimensional mark Cutpoint line is separated.The irregular section cut out is net boundary.Specific boundary value needs the dimension according to data What degree, data set size, and given segmentation space-number s were codetermined.

S3: each data object in data set is found into the grid belonging to it compared with the bounds of grid respectively；

S4: the mesh-density and cluster radius of each grid are calculated, and determines mesh-density threshold value δ and cluster radius threshold value λ；

Wherein cluster radius threshold value λ is the distance between grid mass center and grid farthest point, and density threshold δ is grid data The average of point；I.e. with total data point number divided by grid total number；

Data amount check is more in mesh space, and density is bigger, and mesh-density threshold value δ is inputted by user.λ is cluster half Diameter is measured at a distance from farthest point in grid by grid mass center；Grid mass center is the average value of all data point distances.

S5: grid beta pruning: mesh-density is less than the grid mark of δ or cluster radius greater than λ and is come out, while retaining them Neighbours' grid, and delete other Grid datasets；

S6: using LOF algorithm to after beta pruning data in grid carry out outlier detection, judge data object whether be from Group's point data, and export judging result.

In order to reduce calculation amount, calculating process in, Check looks for the k- Neighbor Points of data object p to be no longer directed to global data Object, but grid belonging to looking only for and the data object in neighbours' grid.

In order to reduce the time complexity of mass data outlier detection, S1-S5 is actually to cut to one of data set Branch processing, reduces unnecessary calculating to a certain extent.Detecting step is actually the mistake for calculating the factor LOF value that peels off Journey finally judges that the data point is outlier using the numerical value of LOF.Such as: when setting threshold value is ψ, LOF value is lower than ψ Will be judged as non-outlier, data point of the LOF value more than or equal to ψ will be considered as outlier.At this moment entire data set just by Two classes have been divided into it, one kind is the point data that peels off that detected, and another kind of is normal data.

Preferably, S2 the following steps are included:

S2.1: to given d dimension data space, every dimension attribute is divided into s long by the definition divided according to uniform grid It spends an equal and disjoint left side and closes the right isometric section opened, entire d dimension data space is divided into s^dA grid cell；

S2.2: grid cell is divided: sets data set attribute as A₁, A₂..., A_d, every dimension is averagely divided into s Isometric interval, then a grid units are defined as Cell=C [S₁][S₂]...[S_d], wherein S_i(1≤i≤d) indicates that d dimension is empty Between middle i-th dimension S_iA grid cell, wherein 0≤S_i≤s-1；

S2.3: grid cell is numbered: the fixed A of the sequence of guarantee d dimension attribute first₁, A₂..., A_d, for wherein Each attribute A_i(1≤i≤d), being divided into s spacer and being numbered from small to large according to range is 1,2 ..., s；So The number of one grid is A_iThe set of range number in (1≤i≤d).

Preferably, S6 the following steps are included:

S6.1: the k distance dist of data object p is sought_k(p):

The distance between all objects and p are calculated, are then ranked up required distance value by sequence from small to large, According to determining k value, k-th of value in value sequence, the k distance dist of as p are taken_k(p)；

Specific the distance between data object q and p are calculated using Euclidean distance as distance scale.Assuming that x is data Any data object of concentration, each data object are made of multiple attributes, are denoted as (x₁, x₂... x_j...x_n) wherein x_j(j= 1,2 ..., n) be real number, it is j-th of coordinate of x, then any two point p=(p₁, p₂... p_n) and q=(q₁, q₂... q_n) The distance between d (p, q) calculation formula it is as follows:

S6.2: the k- Neighbor Points N of p is sought_k(p):

If being set A by each small Grid dataset that grid beta pruning stays；Then in set A with point p away from From less than dist_k(p) point is the k- Neighbor Points of p, is denoted as:

N_k(P)=q | d (p, q)≤dist_k(p),q∈A\{P}}

Wherein, p is the data point that grid data is concentrated, and q is any one data point in grid in addition to p point, d (p, q) is the Euclidean distance between p point and q point in space；

S6.3: reach distance R is determined_k(p, q):

When the distance between object p and object q less than or equal to q k apart from when, p is exactly pair about the reach distance of object q As the k distance dist of q_k(q)；When the k of the distance between p and object q more than or equal to q apart from when, p about object q up to away from From being exactly actual range d (p, q) between p and q；

So p can be indicated about the reach distance of object q (wherein q ∈ A, and q is in the k- neighbour of p) are as follows:

S6.4: k- neighbour's distribution density D is sought_k(p):

K- neighbour's distribution density that the inverse of object p and the average reach distance of all the points in its k- neighbour is p:

S6.5: the part of p is asked to peel off the factor (LOF), the calculation formula for the factor (LOF) that locally peel off are as follows:

S6.6: judge whether data object is the point data that peels off:

The factor threshold that peels off is set first, and each data object is then peeled off into the factor compared with the factor threshold that peels off, It is not otherwise outlier for outlier if the factor that peels off of data object is greater than the factor threshold that peels off.

In the outlier detection algorithm based on grid finally by the descending row of the factor that peels off of each data object Sequence is analyzed by algorithm it is found that if data object is not outlier, and the factor that peels off levels off to 1；If data object is to peel off Point, the then value for the factor that peels off will be greater than 1, and the degree that peels off is bigger, then the factor values that peel off also will be bigger, in an experiment, be arranged from Group factor threshold value is 1.8 (herein can be by user's satisfactory threshold values of self-defining according to the actual situation), is greater than this threshold value Data object be outlier, less than this threshold value data object be not taken as outlier processing.

The present invention also provides a kind of detection system of LOF outlier detection method using grid beta pruning, the system packets It includes: data preprocessing module, data memory module, data cleansing module, Spark distributed computing module；

The input terminal of data preprocessing module is connect with external data source, and output end and the data of data preprocessing module are deposited Module connection is stored up, the output end of data memory module is connect with data cleansing module again, and data cleansing module and Spark are distributed Computing module connection, Spark distributed computing module are finally connected to data memory module；

Data preprocessing module is responsible for the importing and pretreatment of data, and pretreated data are exported to data and are stored Module；

Data memory module includes distributed file system, and data memory module application distribution formula file system (HDFS) is made For data storing platform, it is responsible for the mobile sms service of data；

Spark distributed computing module receives the given calculating task of data cleansing module, divides as required data Analysis calculates；The storage and calculating that data are realized using the distributed environment of Spark cluster, by distributed file system come Data file is stored and is managed, using the advantage calculated based on memory, improves the calculating speed of algorithm；

Data cleansing module is carried out clear by outlier of the LOF outlier detection algorithm based on grid beta pruning to data It washes, the complicated calculations task being related to gives Spark distributed computing resume module, and will treated intermediate data and most Termination fruit is stored in distributed file system (HDFS).

Preferably, data cleansing module is the nucleus module of entire detection system, it includes data loading processing module, net Lattice pruning module, outlier detection module, four submodules of wash result memory module；

Data loading processing module the data from multiple data sources can be merged by any desired mode or Polymerization, to realize integrated data cleaning and its elsewhere science and engineering work that any one triangular web can not be handled.

Grid pruning module first carries out grid dividing to data set, then the data set of entire data space is carried out beta pruning, Remaining data are carried out outlier detection by the most of data set for not including outlier of removal；

Outlier detection module loading beta pruning treated data set carries out outlier detection to these data, using base In the outlier detection algorithm of density, the factor LOF value that peels off of each data point in data set is calculated, gives each data object mark The degree that peels off is known, to obtain the point data that peels off in data set.

Wash result memory module by after grid beta pruning intermediate result data, after the outlier and cleaning that detect Data are all stored into HDFS.

Preferably, Spark cluster uses fully distributed operational mode, is made of 4 pc machines, a node is used to Master host node is done, which is mainly used for the management and task schedule of Spark distributed type assemblies metadata, remaining 3 sections Point is all working node (slaver), is mainly used for specific data storage and distributed computing task.

The present invention is divided into several grid lists by improved LOF outlier detection algorithm, by large-scale data set Member carries out beta pruning processing to there is a possibility that outlier lesser grid, and only outlier possibility occur biggish to remaining Grid data object carries out the point analysis that peels off, and the complexity for allowing for calculating in this way substantially reduces.

In the improved LOF outlier detection method based on grid beta pruning proposed by the present invention, by mesh-density and cluster Two parameters of radius combine, collectively as whether the standard of beta pruning, greatly reduce the probability that outlier is accidentally deleted；Exist simultaneously When grid division introduce grid number concept, when seeking the k Neighbor Points an of data object, it is not necessary that entire data set into Row traversal, it is only necessary to which grid locating for this data object and the data in adjacent mesh are traversed.

The method of the invention also introduces Spark big data cluster Computing Platform, and Spark is extended to be widely used at present MapReduce computation module so that it batch calculate, iterative calculation and memory calculate on have apparent advantage, Spark Can be with the calculating task of Automatic dispatching complexity, carrying out, complicated calculations when ratio MapReduce is more efficient.

The system model that the method for the invention proposes can be used to processing historical data, it is also possible to handle real-time fluxion According to can use the network of available data point, it is only necessary to pass through comparison when there is new data set that legacy data collection is added Net boundary finds the position of grid at new data point, without duplicate carry out grid dividing.

Compared with prior art, the beneficial effect of technical solution of the present invention is: the present invention overcomes the inspections of existing LOF outlier For survey method when handling large-scale high-dimensional data-object in real time, time and space complexity is higher, causes what the practicability is poor to ask Topic, ensure that carry out also improving while outlier detection under large-scale dataset the high efficiency of calculating process with it is practical Property.

Detailed description of the invention

Fig. 1 is a kind of LOF outlier detection method flow diagram based on grid beta pruning.

Fig. 2 is a kind of LOF outlier detection system architecture diagram based on grid beta pruning.

Fig. 3 is server disposition structural schematic diagram.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

Embodiment 1:

The LOF outlier detection method based on grid beta pruning that the present embodiment provides a kind of.

As shown in Figure 1, the described method comprises the following steps:

S1: input data set, and data set is pre-processed；

In order to reduce the time complexity of mass data outlier detection, S1-S5 is actually to cut to one of data set Branch pretreatment, reduces unnecessary calculating to a certain extent.Detecting step is actually the mistake for calculating the factor LOF value that peels off Journey finally judges that the data point is outlier using the numerical value of LOF.

The present embodiment setting threshold value will be judged as non-outlier lower than 1.8 for 1.8, LOF value, and LOF value is more than or equal to 1.8 data point will be considered as outlier.At this moment entire data set has been divided into two classes, and one kind is the outlier that detected Data, another kind of is normal data.

S2 the following steps are included:

S6 the following steps are included:

S6.1: the k distance dist of data object p is sought_k(p):

S6.2: the k- Neighbor Points N of p is sought_k(p):

N_k(P)=q | d (p, q)≤dist_k(p),q∈A\{P}}

S6.3: reach distance R is determined_k(p, q):

S6.4: k- neighbour's distribution density D is sought_k(p):

S6.6: judge whether data object is the point data that peels off:

In the outlier detection algorithm based on grid finally by the descending row of the factor that peels off of each data object Sequence is analyzed by algorithm it is found that if data object is not outlier, and the factor that peels off levels off to 1；If data object is to peel off Point, the then value for the factor that peels off will be greater than 1, and the degree that peels off is bigger, then the factor values that peel off also will be bigger, greater than the number of this threshold value It is outlier according to object, the data object less than this threshold value is not taken as outlier processing.

Embodiment 2:

The present embodiment provides a kind of detection systems of the LOF outlier detection method of grid beta pruning described in Application Example 1 System.

As shown in Fig. 2, the system comprises: data preprocessing module, data memory module, data cleansing module, Spark Distributed computing module；

Data cleansing module is the nucleus module of entire detection system, it includes data loading processing module, grid beta pruning Module, outlier detection module, four submodules of wash result memory module；

The server disposition structural schematic diagram of Spark distributed computing module is as shown in Figure 3.In systems, Spark cluster It using fully distributed operational mode, is made of several pc machines, a node is used to do Master host node, the node It is mainly used for the management and task schedule of Spark distributed type assemblies metadata, remaining node is all working node (slaver), i.e., Slaver node is mainly used for specific data storage and distributed computing task.Spark cluster server machine herein Deployment mode only makees example, can according to need the quantity of increase and decrease working node (slaver) server in practical application, is used for Meets the needs of practical calculating task, ellipsis indicates expansible.

System described in the present embodiment has four big functional modules, data preprocessing module, data memory module, Spark point (module is core processing module, but complicated calculating task needs are assigned to for cloth computing module, data cleansing module Spark distributed computing module is handled.).

Wherein, data preprocessing module can import any kind of data from any more data source, can be structure Change data, semi-structured data, be also possible to unstructured data, the data from multiple data sources can be by any desired Mode merges or polymerize, and by pretreatment, pretreated result is put into the distributed field system of data memory module In system, so that HDFS platform is corresponding to be used as data source server, prepare for the data cleansing of next step；Spark is distributed The distributed environment that computing module utilizes realizes the storage and calculating of cluster, by distributed file system (HDFS) come logarithm It is stored and is managed according to file, realized by technologies such as batch processing, iterative algorithm, interactive inquiry, stream process and quickly divided Analysis and calculating；Data cleansing module is carried out clear by outlier of the LOF outlier detection algorithm based on grid beta pruning to data It washes, treated intermediate data and final result by the way that the module is specified is stored in corresponding HDFS distributed file system In, the data after these cleanings may be used for next data mining and work.

The specific implementation steps are as follows for the most crucial data cleansing module of the present embodiment the method:

For data of the storage into HDFS, need to carry out series of preprocessing operation, such as: dimensionality reduction, data format Conversion etc..Dimension-reduction treatment is primarily referred to as deleting redundancy or the little attribute of relevance, in this way answers the time for greatly reducing algorithm Miscellaneous degree, while making the result that the excavation of outlier is more meaningful, is easier after explanation excavation；Using expert domain knowledge into Row dimensionality reduction can reduce error.The conversion of data format is mainly comprising the conversion and different data structure between different systems Conversion.

Grid beta pruning divides, and the data set of entire data space is carried out beta pruning, removal major part does not include outlier Data set, will be in intermediate result storage to HDFS.

Beta pruning treated intermediate data set is loaded, outlier detection is carried out to these data, here using being based on density Outlier detection algorithm, it may be assumed that LOF algorithm, the algorithm can handle the data in regional area well, pass through computing object The factor that peels off, peel off degree to object identity, can be improved the preciseness and accuracy of algorithm.

The present embodiment arrives the data storage after the outlier and cleaning that detect using detection method described in embodiment 1 On HDFS platform.

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of LOF outlier detection method based on grid beta pruning, which is characterized in that the described method comprises the following steps:

S1: input data set, and data set is pre-processed；

S2: setting data set has s isometric intervals, and according to the s value of input, data set is carried out equidistant division per one-dimensional, Meanwhile the bounds of each grid are calculated, and grid is numbered；

Wherein cluster radius threshold value λ is the distance between grid mass center and grid farthest point, and density threshold δ is grid data point Average；I.e. with total data point number divided by grid total number；

S5: grid beta pruning: mesh-density is less than the grid mark of δ or cluster radius greater than λ and is come out, while retaining their neighbour Grid is occupied, and deletes other Grid datasets；

S6: outlier detection is carried out to the data in grid after beta pruning using LOF algorithm, judges whether data object is outlier Data, and export judging result.

2. the LOF outlier detection method according to claim 1 based on grid beta pruning, which is characterized in that S2 include with Lower step:

S2.1: to given d dimension data space, every dimension attribute is divided into s length phase according to the definition that uniform grid divides Deng and a disjoint left side close the right isometric section opened, entire d dimension data space is divided into s^dA grid cell；

S2.2: grid cell is divided: sets data set attribute as A₁, A₂..., A_d, it is a isometric that every dimension is averagely divided into s Interval, then a grid units are defined as Cell=C [S₁][S₂]...[S_d], wherein S_iIndicate the of i-th dimension in d dimension space S_iA grid cell, wherein 1≤i≤d, 0≤S_i≤s-1；

S2.3: grid cell is numbered: the fixed A of the sequence of guarantee d dimension attribute first₁, A₂..., A_d, for therein every A attribute A_i, being divided into s spacer and being numbered from small to large according to range is 1,2 ..., s；The number of so one grid As A_iThe set of middle range number.

3. the LOF outlier detection method according to claim 1 based on grid beta pruning, which is characterized in that S6 include with Lower step:

S6.1: the k distance dist of data object p is sought_k(p):

The distance between all objects and p are calculated, are then ranked up required distance value by sequence from small to large, according to Determining k value takes k-th of value in value sequence, the k distance dist of as p_k(p)；

Assuming that x is any data object in data set, each data object is made of multiple attributes, is denoted as (x₁, x₂, ...x_j...x_n) wherein x_jIt is real number, is j-th of coordinate of x, j=1,2 ..., n；Then any two point p=(p₁, p₂, ...p_n) and q=(q₁, q₂... q_n) the distance between d (p, q) calculation formula it is as follows:

S6.2: the k- Neighbor Points N of p is sought_k(p):

If being set A by each small Grid dataset that grid beta pruning stays；It is then small at a distance from point p in set A In dist_k(p) point is the k- Neighbor Points of p, is denoted as:

N_k(P)=q | d (p, q)≤dist_k(p),q∈A\{P}}

Wherein, P is the data point that grid data is concentrated, and q is any one data point in addition to p point in grid, d (p, Q) Euclidean distance between the p point and q point in space；

S6.3: reach distance R is determined_k(p, q):

When the distance between object p and object q less than or equal to q k apart from when, p is exactly object q about the reach distance of object q K distance dist_k(q)；When the distance between p and object q more than or equal to q k apart from when, p about object q reach distance just It is the actual range d (p, q) between p and q；

So p can be indicated about the reach distance of object q are as follows:

Wherein q ∈ A, and q is in the k- neighbour of p；

S6.4: k- neighbour's distribution density D is sought_k(p):

S6.5: the part of p is asked to peel off the factor, the calculation formula for the factor that locally peels off are as follows:

S6.6: judge whether data object is the point data that peels off:

The factor threshold that peels off is set first, each data object is then peeled off into the factor compared with the factor threshold that peels off, if number It is greater than the factor threshold that peels off according to the factor that peels off of object, then is outlier, is not outlier otherwise.

4. a kind of detection system of the LOF outlier detection method using the described in any item grid beta prunings of claim 1-3, It is characterized in that, the system comprises: data preprocessing module, data memory module, data cleansing module, Spark distribution meter Calculate module；

The input terminal of data preprocessing module is connect with external data source, and the output end and data of data preprocessing module store mould The output end of block connection, data memory module is connect with data cleansing module again, data cleansing module and Spark distributed computing Module connection, Spark distributed computing module are finally connected to data memory module；

Data preprocessing module is responsible for the importing and pretreatment of data, and pretreated data are exported to data and store mould Block；

Data memory module includes distributed file system, and data memory module application distribution formula file system is stored as data Platform is responsible for the mobile sms service of data；

Spark distributed computing module receives the given calculating task of data cleansing module, is analyzed data, is counted as required It calculates；The storage and calculating that data are realized using the distributed environment of Spark cluster, by distributed file system come to data File is stored and is managed, and using the advantage calculated based on memory, improves the calculating speed of algorithm；

Data cleansing module cleans the outlier of data by the LOF outlier detection algorithm based on grid beta pruning, relates to And to complicated calculations task give Spark distributed computing resume module, and treated intermediate data and will most terminate Fruit is stored in distributed file system.

5. the LOF outlier detection system according to claim 4 based on grid beta pruning, which is characterized in that data cleansing Module includes data loading processing module, grid pruning module, outlier detection module, four submodules of wash result memory module Block；

Data from multiple data sources are merged or are polymerize by any desired mode by data loading processing module, thus Realize the integrated data cleaning and its elsewhere science and engineering work that any one triangular web can not be handled；

Grid pruning module first carries out grid dividing to data set, then the data set of entire data space is carried out beta pruning, removal Data set not comprising outlier carries out outlier detection to remaining data；

Outlier detection module loading beta pruning treated data set carries out outlier detection to these data, using based on close The outlier detection algorithm of degree, calculate data set in each data point the factor LOF value that peels off, to each data object tag from Group's degree, to obtain the point data that peels off in data set；

Wash result memory module is by the intermediate result data after grid beta pruning, the data after the outlier and cleaning that detect All storage is into distributed file system.

6. the LOF outlier detection system according to claim 5 based on grid beta pruning, which is characterized in that Spark distribution The Spark cluster of formula computing module uses fully distributed operational mode, is made of several pc machines, one of node quilt For doing host node, host node is used for the management and task schedule of Spark distributed type assemblies metadata, remaining node is work section Point, working node is stored for specific data and distributed computing task.