CN104346445A

CN104346445A - Method for rapidly screening outlier data from massive data

Info

Publication number: CN104346445A
Application number: CN201410584552.1A
Authority: CN
Inventors: 王恩东; 张东; 吴楠; 韦鹏; 付兴旺
Original assignee: Langchao Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2015-02-11
Anticipated expiration: 2034-10-28
Also published as: CN104346445B; WO2016065775A1

Abstract

The invention provides a method for rapidly screening outlier data from massive data. The method takes a full consideration of characteristics of mining and computing time of outlier data from massive data and space complexity, adopts random sampling to reduce the quantity of samples involved in computation, and adopts parallel computing to increase the computing speed, thereby effectively solving the problem of high requirements for computing time and memory space by screening outlier data from massive data, and realizing rapid and effective outlier data screening.

Description

A kind of method of screening Outlier Data fast from large-scale data

Technical field

The present invention relates to CRT technology and machine learning techniques field, specifically a kind of method of screening Outlier Data fast from large-scale data.

Background technology

Outlier Data refers to the general behavior of some and the data existed in mass data or the inconsistent data of model.Two kinds of reasons have been it is generally acknowledged in the generation of Outlier Data:

1) tolerance or execution error institute cause to this type Outlier Data screening, can filter out impurity or in-problem data from mass data, and then the oeverall quality of raising data;

2) outwardness of this categorical data of result of intrinsic data variation determines the importance to the screening of the type Outlier Data.Such as find the Outlier Data of some the unknowns of outwardness at scientific data, the research of correlation theory can well be improved.

Along with the continuous accumulation of data and the scale of data constantly increase, it is increasingly difficult that traditional outlier data digging algorithm utilizes existing design conditions to screen Outlier Data wherein.For this problem, the invention discloses a kind of method of rapid screening Outlier Data from large-scale data.The method fully takes into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.

Summary of the invention

The object of this invention is to provide a kind of method of screening Outlier Data fast from large-scale data.

The object of the invention is to realize in the following manner, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:

1) data prediction

Pre-service is carried out to data, eliminate each data of inconsistency normalization simultaneously between data, concrete operations comprise: data scrubbing, data integration, data transformation, data regularization, the eigenmatrix obtained is designated as T, its size is N*M, and wherein N is the number of all samples, and M is the number of primitive character attribute;

2) Feature Selection and conversion

Feature Selection removes from all properties screening not have even contributive attribute to subsequent operation contribution is less, eigentransformation utilizes current attribute to pass through to convert the attribute obtaining new feature space, the eigenmatrix obtained is designated as Ts, its size is N*m, wherein N is the number of all samples, m be screening and conversion after the number of attribute;

3) initializing variable

Remember that two length are that the full null vector of N is respectively Co, Cs, be respectively used to preserve in subsequent calculations peel off the factor add and and screening sample number of times;

4) iteration

Upgrade vectorial Co and Cs by following iteration, iterate to certain number of times k and namely stop:

(1) Stochastic choice sample set, size is fixed as n;

(2) in vectorial Cs, corresponding element numerical value adds 1;

(3) from matrix T s, screen corresponding row, and calculate partial isolated sex factor corresponding to this matrix;

(4) the corresponding numerical value of vectorial Co adds the partial isolated sex factor walking and obtain respectively;

5) outlier index calculates

Calculate vectorial COI for the factor that peels off by vectorial Co and Cs, computing formula is: COI=Co/Cs;

6) Outlier Data screening

According to the corresponding numerical value of vectorial COI order from big to small, before screening, l sample is as Outlier Data.

Obtain by stochastic sampling the small sample that a scale is far smaller than original sample scale, take completely random to sample during sampling or adopt weight sampling.

Accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables.

Two numerical variables shared by iterative process calculate the outlier index of each sample, and this index characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.

Object beneficial effect of the present invention is: the method for rapid screening Outlier Data from large-scale data, fully take into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.Stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.

Accompanying drawing explanation

Fig. 1 screens Outlier Data process flow diagram from large-scale data;

Fig. 2 is the partial isolated sex factor calculation flow chart of small sample after sampling;

Fig. 3 is the renewal process flow diagram of iterative process shared variable;

Fig. 4 is the computation process figure of outlier index;

Fig. 5 is parallelization screening Outlier Data process flow diagram.

Embodiment

With reference to Figure of description, a kind of method of screening Outlier Data fast from large-scale data of the present invention is described in detail below.

From large-scale data, screen a method for Outlier Data fast, mentality of designing is as follows:

1) be mainly divided into that data prediction, Feature Selection and conversion, initializing variable, iteration, outlier index calculate, Outlier Data screens six stages and develops enforcement.For ensureing the consistance of flow process and the reusability of intermediate result, suggestion is taked to adopt unified exploitation programming language;

2) rudimentary algorithm used in the present invention can be write again, also can adopt existing routine package;

3) in the present invention, repeatedly service range is measured.The definition of distance is flexibly, can adopt Euclidean distance, manhatton distance, COS distance etc.Simpler and fast, suggestion uses COS distance when considering that COS distance calculates;

4) completely random can be taked to sample during sampling, also can adopt weight sampling, the sample weights that sampling rate is lower is high;

5) iterative process of step 4, owing to there is not coupling between different iteration, therefore can adopt parallel iteration computation structure (as shown in Figure 5);

6) accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables; When rewriting numerical value, need to add/unlocking operation to variable;

7) outlier index knot characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.

The inventive method defines a kind of definition and computing method of outlier index, and actual enforcement can improve its definition mode and computing method based on this.

Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims

1. one kind is screened the method for Outlier Data fast from large-scale data, it is characterized in that adopting stochastic sampling to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:

1) data prediction

2) Feature Selection and conversion

3) initializing variable

4) iteration

(1) Stochastic choice sample set, size is fixed as n;

(2) in vectorial Cs, corresponding element numerical value adds 1;

5) outlier index calculates

6) Outlier Data screening

2. method according to claim 1, is characterized in that obtaining by stochastic sampling the small sample that a scale is far smaller than original sample scale, takes completely random to sample or adopt weight sampling during sampling.

3. method according to claim 1, is characterized in that, accelerates non-coupled iterative process and calculates, need share and access two numerical variables between different threads or process by multithreading and multi-process mode.

4. method according to claim 1, it is characterized in that, two numerical variables shared by iterative process calculate the outlier index of each sample, this index characterizes the trend that this sample peels off, numerical value is larger, sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.