CN104346445A - Method for rapidly screening outlier data from massive data - Google Patents
Method for rapidly screening outlier data from massive data Download PDFInfo
- Publication number
- CN104346445A CN104346445A CN201410584552.1A CN201410584552A CN104346445A CN 104346445 A CN104346445 A CN 104346445A CN 201410584552 A CN201410584552 A CN 201410584552A CN 104346445 A CN104346445 A CN 104346445A
- Authority
- CN
- China
- Prior art keywords
- data
- sample
- screening
- outlier
- vectorial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012216 screening Methods 0.000 title claims abstract description 33
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000005070 sampling Methods 0.000 claims abstract description 16
- 238000012804 iterative process Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000013501 data transformation Methods 0.000 claims description 2
- 230000010354 integration Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000005201 scrubbing Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 239000012535 impurity Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for rapidly screening outlier data from massive data. The method takes a full consideration of characteristics of mining and computing time of outlier data from massive data and space complexity, adopts random sampling to reduce the quantity of samples involved in computation, and adopts parallel computing to increase the computing speed, thereby effectively solving the problem of high requirements for computing time and memory space by screening outlier data from massive data, and realizing rapid and effective outlier data screening.
Description
Technical field
The present invention relates to CRT technology and machine learning techniques field, specifically a kind of method of screening Outlier Data fast from large-scale data.
Background technology
Outlier Data refers to the general behavior of some and the data existed in mass data or the inconsistent data of model.Two kinds of reasons have been it is generally acknowledged in the generation of Outlier Data:
1) tolerance or execution error institute cause to this type Outlier Data screening, can filter out impurity or in-problem data from mass data, and then the oeverall quality of raising data;
2) outwardness of this categorical data of result of intrinsic data variation determines the importance to the screening of the type Outlier Data.Such as find the Outlier Data of some the unknowns of outwardness at scientific data, the research of correlation theory can well be improved.
Along with the continuous accumulation of data and the scale of data constantly increase, it is increasingly difficult that traditional outlier data digging algorithm utilizes existing design conditions to screen Outlier Data wherein.For this problem, the invention discloses a kind of method of rapid screening Outlier Data from large-scale data.The method fully takes into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.
Summary of the invention
The object of this invention is to provide a kind of method of screening Outlier Data fast from large-scale data.
The object of the invention is to realize in the following manner, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:
1) data prediction
Pre-service is carried out to data, eliminate each data of inconsistency normalization simultaneously between data, concrete operations comprise: data scrubbing, data integration, data transformation, data regularization, the eigenmatrix obtained is designated as T, its size is N*M, and wherein N is the number of all samples, and M is the number of primitive character attribute;
2) Feature Selection and conversion
Feature Selection removes from all properties screening not have even contributive attribute to subsequent operation contribution is less, eigentransformation utilizes current attribute to pass through to convert the attribute obtaining new feature space, the eigenmatrix obtained is designated as Ts, its size is N*m, wherein N is the number of all samples, m be screening and conversion after the number of attribute;
3) initializing variable
Remember that two length are that the full null vector of N is respectively Co, Cs, be respectively used to preserve in subsequent calculations peel off the factor add and and screening sample number of times;
4) iteration
Upgrade vectorial Co and Cs by following iteration, iterate to certain number of times k and namely stop:
(1) Stochastic choice sample set, size is fixed as n;
(2) in vectorial Cs, corresponding element numerical value adds 1;
(3) from matrix T s, screen corresponding row, and calculate partial isolated sex factor corresponding to this matrix;
(4) the corresponding numerical value of vectorial Co adds the partial isolated sex factor walking and obtain respectively;
5) outlier index calculates
Calculate vectorial COI for the factor that peels off by vectorial Co and Cs, computing formula is: COI=Co/Cs;
6) Outlier Data screening
According to the corresponding numerical value of vectorial COI order from big to small, before screening, l sample is as Outlier Data.
Obtain by stochastic sampling the small sample that a scale is far smaller than original sample scale, take completely random to sample during sampling or adopt weight sampling.
Accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables.
Two numerical variables shared by iterative process calculate the outlier index of each sample, and this index characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
Object beneficial effect of the present invention is: the method for rapid screening Outlier Data from large-scale data, fully take into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.Stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.
Accompanying drawing explanation
Fig. 1 screens Outlier Data process flow diagram from large-scale data;
Fig. 2 is the partial isolated sex factor calculation flow chart of small sample after sampling;
Fig. 3 is the renewal process flow diagram of iterative process shared variable;
Fig. 4 is the computation process figure of outlier index;
Fig. 5 is parallelization screening Outlier Data process flow diagram.
Embodiment
With reference to Figure of description, a kind of method of screening Outlier Data fast from large-scale data of the present invention is described in detail below.
From large-scale data, screen a method for Outlier Data fast, mentality of designing is as follows:
1) be mainly divided into that data prediction, Feature Selection and conversion, initializing variable, iteration, outlier index calculate, Outlier Data screens six stages and develops enforcement.For ensureing the consistance of flow process and the reusability of intermediate result, suggestion is taked to adopt unified exploitation programming language;
2) rudimentary algorithm used in the present invention can be write again, also can adopt existing routine package;
3) in the present invention, repeatedly service range is measured.The definition of distance is flexibly, can adopt Euclidean distance, manhatton distance, COS distance etc.Simpler and fast, suggestion uses COS distance when considering that COS distance calculates;
4) completely random can be taked to sample during sampling, also can adopt weight sampling, the sample weights that sampling rate is lower is high;
5) iterative process of step 4, owing to there is not coupling between different iteration, therefore can adopt parallel iteration computation structure (as shown in Figure 5);
6) accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables; When rewriting numerical value, need to add/unlocking operation to variable;
7) outlier index knot characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
The inventive method defines a kind of definition and computing method of outlier index, and actual enforcement can improve its definition mode and computing method based on this.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.
Claims (4)
1. one kind is screened the method for Outlier Data fast from large-scale data, it is characterized in that adopting stochastic sampling to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:
1) data prediction
Pre-service is carried out to data, eliminate each data of inconsistency normalization simultaneously between data, concrete operations comprise: data scrubbing, data integration, data transformation, data regularization, the eigenmatrix obtained is designated as T, its size is N*M, and wherein N is the number of all samples, and M is the number of primitive character attribute;
2) Feature Selection and conversion
Feature Selection removes from all properties screening not have even contributive attribute to subsequent operation contribution is less, eigentransformation utilizes current attribute to pass through to convert the attribute obtaining new feature space, the eigenmatrix obtained is designated as Ts, its size is N*m, wherein N is the number of all samples, m be screening and conversion after the number of attribute;
3) initializing variable
Remember that two length are that the full null vector of N is respectively Co, Cs, be respectively used to preserve in subsequent calculations peel off the factor add and and screening sample number of times;
4) iteration
Upgrade vectorial Co and Cs by following iteration, iterate to certain number of times k and namely stop:
(1) Stochastic choice sample set, size is fixed as n;
(2) in vectorial Cs, corresponding element numerical value adds 1;
(3) from matrix T s, screen corresponding row, and calculate partial isolated sex factor corresponding to this matrix;
(4) the corresponding numerical value of vectorial Co adds the partial isolated sex factor walking and obtain respectively;
5) outlier index calculates
Calculate vectorial COI for the factor that peels off by vectorial Co and Cs, computing formula is: COI=Co/Cs;
6) Outlier Data screening
According to the corresponding numerical value of vectorial COI order from big to small, before screening, l sample is as Outlier Data.
2. method according to claim 1, is characterized in that obtaining by stochastic sampling the small sample that a scale is far smaller than original sample scale, takes completely random to sample or adopt weight sampling during sampling.
3. method according to claim 1, is characterized in that, accelerates non-coupled iterative process and calculates, need share and access two numerical variables between different threads or process by multithreading and multi-process mode.
4. method according to claim 1, it is characterized in that, two numerical variables shared by iterative process calculate the outlier index of each sample, this index characterizes the trend that this sample peels off, numerical value is larger, sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410584552.1A CN104346445B (en) | 2014-10-28 | 2014-10-28 | A kind of method quickly screening Outlier Data from large-scale data |
PCT/CN2015/072972 WO2016065775A1 (en) | 2014-10-28 | 2015-02-13 | Method for rapid screening of outlier data from large-scale data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410584552.1A CN104346445B (en) | 2014-10-28 | 2014-10-28 | A kind of method quickly screening Outlier Data from large-scale data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104346445A true CN104346445A (en) | 2015-02-11 |
CN104346445B CN104346445B (en) | 2016-09-07 |
Family
ID=52502036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410584552.1A Active CN104346445B (en) | 2014-10-28 | 2014-10-28 | A kind of method quickly screening Outlier Data from large-scale data |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104346445B (en) |
WO (1) | WO2016065775A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104966094A (en) * | 2015-05-26 | 2015-10-07 | 浪潮电子信息产业股份有限公司 | Large-scale data set outlier data mining method based on graph theory method |
WO2016065775A1 (en) * | 2014-10-28 | 2016-05-06 | 浪潮电子信息产业股份有限公司 | Method for rapid screening of outlier data from large-scale data |
CN105868387A (en) * | 2016-04-14 | 2016-08-17 | 江苏马上游科技股份有限公司 | Method for outlier data mining based on parallel computation |
CN110674373A (en) * | 2019-09-17 | 2020-01-10 | 上海森亿医疗科技有限公司 | Big data processing method, device, equipment and storage medium based on sensitive data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102243641A (en) * | 2011-04-29 | 2011-11-16 | 西安交通大学 | Method for efficiently clustering massive data |
CN103310122A (en) * | 2013-07-10 | 2013-09-18 | 北京航空航天大学 | Parallel random sampling consensus method and device |
US20130339367A1 (en) * | 2012-06-14 | 2013-12-19 | Santhosh Adayikkoth | Method and system for preferential accessing of one or more critical entities |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7050932B2 (en) * | 2002-08-23 | 2006-05-23 | International Business Machines Corporation | Method, system, and computer program product for outlier detection |
CN102799616B (en) * | 2012-06-14 | 2014-11-05 | 北京大学 | Outlier point detection method in large-scale social network |
CN104008420A (en) * | 2014-05-26 | 2014-08-27 | 中国科学院信息工程研究所 | Distributed outlier detection method and system based on automatic coding machine |
CN104346445B (en) * | 2014-10-28 | 2016-09-07 | 浪潮电子信息产业股份有限公司 | A kind of method quickly screening Outlier Data from large-scale data |
-
2014
- 2014-10-28 CN CN201410584552.1A patent/CN104346445B/en active Active
-
2015
- 2015-02-13 WO PCT/CN2015/072972 patent/WO2016065775A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102243641A (en) * | 2011-04-29 | 2011-11-16 | 西安交通大学 | Method for efficiently clustering massive data |
US20130339367A1 (en) * | 2012-06-14 | 2013-12-19 | Santhosh Adayikkoth | Method and system for preferential accessing of one or more critical entities |
CN103310122A (en) * | 2013-07-10 | 2013-09-18 | 北京航空航天大学 | Parallel random sampling consensus method and device |
Non-Patent Citations (1)
Title |
---|
娄圣金 等: "一种基于p权值的离群数据挖掘算法", 《小型微型计算机系统》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016065775A1 (en) * | 2014-10-28 | 2016-05-06 | 浪潮电子信息产业股份有限公司 | Method for rapid screening of outlier data from large-scale data |
CN104966094A (en) * | 2015-05-26 | 2015-10-07 | 浪潮电子信息产业股份有限公司 | Large-scale data set outlier data mining method based on graph theory method |
CN104966094B (en) * | 2015-05-26 | 2018-04-17 | 浪潮电子信息产业股份有限公司 | Large-scale data set outlier data mining method based on graph theory method |
CN105868387A (en) * | 2016-04-14 | 2016-08-17 | 江苏马上游科技股份有限公司 | Method for outlier data mining based on parallel computation |
CN110674373A (en) * | 2019-09-17 | 2020-01-10 | 上海森亿医疗科技有限公司 | Big data processing method, device, equipment and storage medium based on sensitive data |
CN110674373B (en) * | 2019-09-17 | 2020-08-07 | 上海森亿医疗科技有限公司 | Big data processing method, device, equipment and storage medium based on sensitive data |
Also Published As
Publication number | Publication date |
---|---|
CN104346445B (en) | 2016-09-07 |
WO2016065775A1 (en) | 2016-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021000556A1 (en) | Method and system for predicting remaining useful life of industrial equipment, and electronic device | |
CN104346445A (en) | Method for rapidly screening outlier data from massive data | |
CN103116540B (en) | Dynamic symbol execution method based on global superblock domination graph | |
CN110263979B (en) | Method and device for predicting sample label based on reinforcement learning model | |
Poláková et al. | Controlled restart in differential evolution applied to CEC2014 benchmark functions | |
WO2022206265A1 (en) | Method for parameter calibration of hydrological forecasting model based on deep reinforcement learning | |
CN111027686A (en) | Landslide displacement prediction method, device and equipment | |
WO2017161646A1 (en) | Method for dynamically selecting optimal model by three-layer association for large data volume prediction | |
CN110232483A (en) | Deep learning load forecasting method, device and terminal device | |
CN109919312A (en) | Operation method, device and the DPU of convolutional neural networks | |
CN104636486A (en) | Method and device for extracting features of users on basis of non-negative alternating direction change | |
Yanghua et al. | Improving classification accuracy of a machine learning approach for fpga timing closure | |
Haut et al. | GPU-friendly neural networks for remote sensing scene classification | |
Li et al. | Accurate and efficient processor performance prediction via regression tree based modeling | |
CN104111887A (en) | Software fault prediction system and method based on Logistic model | |
KR102153161B1 (en) | Method and system for learning structure of probabilistic graphical model for ordinal data | |
CN113434989A (en) | Pipe network leakage amount prediction method and system based on attention mechanism and LSTM | |
CN117668743A (en) | Time sequence data prediction method of association time-space relation | |
WO2024060287A1 (en) | Blast furnace temperature prediction method, terminal device, and storage medium | |
CN116881665A (en) | CMOA optimization-based TimesNet-BiLSTM photovoltaic output prediction method | |
Magnusson et al. | Recurrent neural networks for oil well event prediction | |
CN107939371B (en) | Method and device for determining encryption feasibility of well pattern | |
da F. Vieira et al. | Modularity based hierarchical community detection in networks | |
Axenie et al. | STARLORD: sliding window temporal accumulate-retract learning for online reasoning on datastreams | |
Moss et al. | An FPGA-based spectral anomaly detection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |