CN104346445A - Method for rapidly screening outlier data from massive data - Google Patents

Method for rapidly screening outlier data from massive data Download PDF

Info

Publication number
CN104346445A
CN104346445A CN201410584552.1A CN201410584552A CN104346445A CN 104346445 A CN104346445 A CN 104346445A CN 201410584552 A CN201410584552 A CN 201410584552A CN 104346445 A CN104346445 A CN 104346445A
Authority
CN
China
Prior art keywords
data
sample
screening
outlier
vectorial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410584552.1A
Other languages
Chinese (zh)
Other versions
CN104346445B (en
Inventor
王恩东
张东
吴楠
韦鹏
付兴旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Langchao Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Langchao Electronic Information Industry Co Ltd filed Critical Langchao Electronic Information Industry Co Ltd
Priority to CN201410584552.1A priority Critical patent/CN104346445B/en
Publication of CN104346445A publication Critical patent/CN104346445A/en
Priority to PCT/CN2015/072972 priority patent/WO2016065775A1/en
Application granted granted Critical
Publication of CN104346445B publication Critical patent/CN104346445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for rapidly screening outlier data from massive data. The method takes a full consideration of characteristics of mining and computing time of outlier data from massive data and space complexity, adopts random sampling to reduce the quantity of samples involved in computation, and adopts parallel computing to increase the computing speed, thereby effectively solving the problem of high requirements for computing time and memory space by screening outlier data from massive data, and realizing rapid and effective outlier data screening.

Description

A kind of method of screening Outlier Data fast from large-scale data
Technical field
The present invention relates to CRT technology and machine learning techniques field, specifically a kind of method of screening Outlier Data fast from large-scale data.
Background technology
Outlier Data refers to the general behavior of some and the data existed in mass data or the inconsistent data of model.Two kinds of reasons have been it is generally acknowledged in the generation of Outlier Data:
1) tolerance or execution error institute cause to this type Outlier Data screening, can filter out impurity or in-problem data from mass data, and then the oeverall quality of raising data;
2) outwardness of this categorical data of result of intrinsic data variation determines the importance to the screening of the type Outlier Data.Such as find the Outlier Data of some the unknowns of outwardness at scientific data, the research of correlation theory can well be improved.
Along with the continuous accumulation of data and the scale of data constantly increase, it is increasingly difficult that traditional outlier data digging algorithm utilizes existing design conditions to screen Outlier Data wherein.For this problem, the invention discloses a kind of method of rapid screening Outlier Data from large-scale data.The method fully takes into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.
Summary of the invention
The object of this invention is to provide a kind of method of screening Outlier Data fast from large-scale data.
The object of the invention is to realize in the following manner, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:
1) data prediction
Pre-service is carried out to data, eliminate each data of inconsistency normalization simultaneously between data, concrete operations comprise: data scrubbing, data integration, data transformation, data regularization, the eigenmatrix obtained is designated as T, its size is N*M, and wherein N is the number of all samples, and M is the number of primitive character attribute;
2) Feature Selection and conversion
Feature Selection removes from all properties screening not have even contributive attribute to subsequent operation contribution is less, eigentransformation utilizes current attribute to pass through to convert the attribute obtaining new feature space, the eigenmatrix obtained is designated as Ts, its size is N*m, wherein N is the number of all samples, m be screening and conversion after the number of attribute;
3) initializing variable
Remember that two length are that the full null vector of N is respectively Co, Cs, be respectively used to preserve in subsequent calculations peel off the factor add and and screening sample number of times;
4) iteration
Upgrade vectorial Co and Cs by following iteration, iterate to certain number of times k and namely stop:
(1) Stochastic choice sample set, size is fixed as n;
(2) in vectorial Cs, corresponding element numerical value adds 1;
(3) from matrix T s, screen corresponding row, and calculate partial isolated sex factor corresponding to this matrix;
(4) the corresponding numerical value of vectorial Co adds the partial isolated sex factor walking and obtain respectively;
5) outlier index calculates
Calculate vectorial COI for the factor that peels off by vectorial Co and Cs, computing formula is: COI=Co/Cs;
6) Outlier Data screening
According to the corresponding numerical value of vectorial COI order from big to small, before screening, l sample is as Outlier Data.
Obtain by stochastic sampling the small sample that a scale is far smaller than original sample scale, take completely random to sample during sampling or adopt weight sampling.
Accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables.
Two numerical variables shared by iterative process calculate the outlier index of each sample, and this index characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
Object beneficial effect of the present invention is: the method for rapid screening Outlier Data from large-scale data, fully take into account the feature of large-scale data outlier data digging computing time and space complexity, stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.Stochastic sampling is adopted to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize fast and effective Outlier Data screening.
Accompanying drawing explanation
Fig. 1 screens Outlier Data process flow diagram from large-scale data;
Fig. 2 is the partial isolated sex factor calculation flow chart of small sample after sampling;
Fig. 3 is the renewal process flow diagram of iterative process shared variable;
Fig. 4 is the computation process figure of outlier index;
Fig. 5 is parallelization screening Outlier Data process flow diagram.
Embodiment
With reference to Figure of description, a kind of method of screening Outlier Data fast from large-scale data of the present invention is described in detail below.
From large-scale data, screen a method for Outlier Data fast, mentality of designing is as follows:
1) be mainly divided into that data prediction, Feature Selection and conversion, initializing variable, iteration, outlier index calculate, Outlier Data screens six stages and develops enforcement.For ensureing the consistance of flow process and the reusability of intermediate result, suggestion is taked to adopt unified exploitation programming language;
2) rudimentary algorithm used in the present invention can be write again, also can adopt existing routine package;
3) in the present invention, repeatedly service range is measured.The definition of distance is flexibly, can adopt Euclidean distance, manhatton distance, COS distance etc.Simpler and fast, suggestion uses COS distance when considering that COS distance calculates;
4) completely random can be taked to sample during sampling, also can adopt weight sampling, the sample weights that sampling rate is lower is high;
5) iterative process of step 4, owing to there is not coupling between different iteration, therefore can adopt parallel iteration computation structure (as shown in Figure 5);
6) accelerate non-coupled iterative process by multithreading and multi-process mode to calculate, between different threads or process, need share and access two numerical variables; When rewriting numerical value, need to add/unlocking operation to variable;
7) outlier index knot characterizes the trend that this sample peels off, and numerical value is larger, and sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
The inventive method defines a kind of definition and computing method of outlier index, and actual enforcement can improve its definition mode and computing method based on this.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims (4)

1. one kind is screened the method for Outlier Data fast from large-scale data, it is characterized in that adopting stochastic sampling to reduce the sample size participating in calculating, parallel computation is adopted to accelerate arithmetic speed, thus effectively solve problem higher to the requirement of computing time and memory headroom in the screening of large-scale data Outlier Data, thus realize quick and effective Outlier Data screening, comprise following step:
1) data prediction
Pre-service is carried out to data, eliminate each data of inconsistency normalization simultaneously between data, concrete operations comprise: data scrubbing, data integration, data transformation, data regularization, the eigenmatrix obtained is designated as T, its size is N*M, and wherein N is the number of all samples, and M is the number of primitive character attribute;
2) Feature Selection and conversion
Feature Selection removes from all properties screening not have even contributive attribute to subsequent operation contribution is less, eigentransformation utilizes current attribute to pass through to convert the attribute obtaining new feature space, the eigenmatrix obtained is designated as Ts, its size is N*m, wherein N is the number of all samples, m be screening and conversion after the number of attribute;
3) initializing variable
Remember that two length are that the full null vector of N is respectively Co, Cs, be respectively used to preserve in subsequent calculations peel off the factor add and and screening sample number of times;
4) iteration
Upgrade vectorial Co and Cs by following iteration, iterate to certain number of times k and namely stop:
(1) Stochastic choice sample set, size is fixed as n;
(2) in vectorial Cs, corresponding element numerical value adds 1;
(3) from matrix T s, screen corresponding row, and calculate partial isolated sex factor corresponding to this matrix;
(4) the corresponding numerical value of vectorial Co adds the partial isolated sex factor walking and obtain respectively;
5) outlier index calculates
Calculate vectorial COI for the factor that peels off by vectorial Co and Cs, computing formula is: COI=Co/Cs;
6) Outlier Data screening
According to the corresponding numerical value of vectorial COI order from big to small, before screening, l sample is as Outlier Data.
2. method according to claim 1, is characterized in that obtaining by stochastic sampling the small sample that a scale is far smaller than original sample scale, takes completely random to sample or adopt weight sampling during sampling.
3. method according to claim 1, is characterized in that, accelerates non-coupled iterative process and calculates, need share and access two numerical variables between different threads or process by multithreading and multi-process mode.
4. method according to claim 1, it is characterized in that, two numerical variables shared by iterative process calculate the outlier index of each sample, this index characterizes the trend that this sample peels off, numerical value is larger, sample is that the possibility peeled off is larger, and numerical value is less, and sample is more impossible becomes Outlier Data.
CN201410584552.1A 2014-10-28 2014-10-28 A kind of method quickly screening Outlier Data from large-scale data Active CN104346445B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410584552.1A CN104346445B (en) 2014-10-28 2014-10-28 A kind of method quickly screening Outlier Data from large-scale data
PCT/CN2015/072972 WO2016065775A1 (en) 2014-10-28 2015-02-13 Method for rapid screening of outlier data from large-scale data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410584552.1A CN104346445B (en) 2014-10-28 2014-10-28 A kind of method quickly screening Outlier Data from large-scale data

Publications (2)

Publication Number Publication Date
CN104346445A true CN104346445A (en) 2015-02-11
CN104346445B CN104346445B (en) 2016-09-07

Family

ID=52502036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410584552.1A Active CN104346445B (en) 2014-10-28 2014-10-28 A kind of method quickly screening Outlier Data from large-scale data

Country Status (2)

Country Link
CN (1) CN104346445B (en)
WO (1) WO2016065775A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966094A (en) * 2015-05-26 2015-10-07 浪潮电子信息产业股份有限公司 Large-scale data set outlier data mining method based on graph theory method
WO2016065775A1 (en) * 2014-10-28 2016-05-06 浪潮电子信息产业股份有限公司 Method for rapid screening of outlier data from large-scale data
CN105868387A (en) * 2016-04-14 2016-08-17 江苏马上游科技股份有限公司 Method for outlier data mining based on parallel computation
CN110674373A (en) * 2019-09-17 2020-01-10 上海森亿医疗科技有限公司 Big data processing method, device, equipment and storage medium based on sensitive data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data
CN103310122A (en) * 2013-07-10 2013-09-18 北京航空航天大学 Parallel random sampling consensus method and device
US20130339367A1 (en) * 2012-06-14 2013-12-19 Santhosh Adayikkoth Method and system for preferential accessing of one or more critical entities

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7050932B2 (en) * 2002-08-23 2006-05-23 International Business Machines Corporation Method, system, and computer program product for outlier detection
CN102799616B (en) * 2012-06-14 2014-11-05 北京大学 Outlier point detection method in large-scale social network
CN104008420A (en) * 2014-05-26 2014-08-27 中国科学院信息工程研究所 Distributed outlier detection method and system based on automatic coding machine
CN104346445B (en) * 2014-10-28 2016-09-07 浪潮电子信息产业股份有限公司 A kind of method quickly screening Outlier Data from large-scale data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data
US20130339367A1 (en) * 2012-06-14 2013-12-19 Santhosh Adayikkoth Method and system for preferential accessing of one or more critical entities
CN103310122A (en) * 2013-07-10 2013-09-18 北京航空航天大学 Parallel random sampling consensus method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
娄圣金 等: "一种基于p权值的离群数据挖掘算法", 《小型微型计算机系统》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016065775A1 (en) * 2014-10-28 2016-05-06 浪潮电子信息产业股份有限公司 Method for rapid screening of outlier data from large-scale data
CN104966094A (en) * 2015-05-26 2015-10-07 浪潮电子信息产业股份有限公司 Large-scale data set outlier data mining method based on graph theory method
CN104966094B (en) * 2015-05-26 2018-04-17 浪潮电子信息产业股份有限公司 Large-scale data set outlier data mining method based on graph theory method
CN105868387A (en) * 2016-04-14 2016-08-17 江苏马上游科技股份有限公司 Method for outlier data mining based on parallel computation
CN110674373A (en) * 2019-09-17 2020-01-10 上海森亿医疗科技有限公司 Big data processing method, device, equipment and storage medium based on sensitive data
CN110674373B (en) * 2019-09-17 2020-08-07 上海森亿医疗科技有限公司 Big data processing method, device, equipment and storage medium based on sensitive data

Also Published As

Publication number Publication date
CN104346445B (en) 2016-09-07
WO2016065775A1 (en) 2016-05-06

Similar Documents

Publication Publication Date Title
WO2021000556A1 (en) Method and system for predicting remaining useful life of industrial equipment, and electronic device
CN104346445A (en) Method for rapidly screening outlier data from massive data
CN103116540B (en) Dynamic symbol execution method based on global superblock domination graph
CN110263979B (en) Method and device for predicting sample label based on reinforcement learning model
Poláková et al. Controlled restart in differential evolution applied to CEC2014 benchmark functions
WO2022206265A1 (en) Method for parameter calibration of hydrological forecasting model based on deep reinforcement learning
CN111027686A (en) Landslide displacement prediction method, device and equipment
WO2017161646A1 (en) Method for dynamically selecting optimal model by three-layer association for large data volume prediction
CN110232483A (en) Deep learning load forecasting method, device and terminal device
CN109919312A (en) Operation method, device and the DPU of convolutional neural networks
CN104636486A (en) Method and device for extracting features of users on basis of non-negative alternating direction change
Yanghua et al. Improving classification accuracy of a machine learning approach for fpga timing closure
Haut et al. GPU-friendly neural networks for remote sensing scene classification
Li et al. Accurate and efficient processor performance prediction via regression tree based modeling
CN104111887A (en) Software fault prediction system and method based on Logistic model
KR102153161B1 (en) Method and system for learning structure of probabilistic graphical model for ordinal data
CN113434989A (en) Pipe network leakage amount prediction method and system based on attention mechanism and LSTM
CN117668743A (en) Time sequence data prediction method of association time-space relation
WO2024060287A1 (en) Blast furnace temperature prediction method, terminal device, and storage medium
CN116881665A (en) CMOA optimization-based TimesNet-BiLSTM photovoltaic output prediction method
Magnusson et al. Recurrent neural networks for oil well event prediction
CN107939371B (en) Method and device for determining encryption feasibility of well pattern
da F. Vieira et al. Modularity based hierarchical community detection in networks
Axenie et al. STARLORD: sliding window temporal accumulate-retract learning for online reasoning on datastreams
Moss et al. An FPGA-based spectral anomaly detection system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant