CN105469123A - Missing data completion method based on k plane regression - Google Patents

Missing data completion method based on k plane regression Download PDF

Info

Publication number
CN105469123A
CN105469123A CN201511025065.2A CN201511025065A CN105469123A CN 105469123 A CN105469123 A CN 105469123A CN 201511025065 A CN201511025065 A CN 201511025065A CN 105469123 A CN105469123 A CN 105469123A
Authority
CN
China
Prior art keywords
data
plane
completion
regression
carried out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511025065.2A
Other languages
Chinese (zh)
Inventor
袁玉波
阮彤
邱文强
汤伟
赵婷婷
高炬
殷亦超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN201511025065.2A priority Critical patent/CN105469123A/en
Publication of CN105469123A publication Critical patent/CN105469123A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention provides a new missing database data completion method. The method is characterized by comprising steps: 1, missing detection is carried out on a given data set; 2, dimension reduction of an input variable is carried out, correlation between input dimensions is analyzed, pivoting (PCA) is adopted to select a correlated input dimension, and a new input data set is formed; 3, training set k partitioning is carried out, a cluster (Kmeans) is used for carrying out partitioning on the input training set, and k classes of training sets are obtained; 4, a k plane regression function is built, the optimal regression coefficient and the geometric center of each plane are solved, and a regression fitting function is given; and finally, data completion test is carried out. The experiment proves that the data completion method is extremely effective; in an allowable error range, a completed database with a use value is obtained; and the challenging technical problem brought to machine learning and data mining due to data incompletion can be solved to a certain degree; and the big data application technology progress is pushed.

Description

A kind of missing data complementing method based on K plane regression
Technical field
The present invention relates generally to data mining technology, is specifically related to a kind of missing data complementing method based on K plane regression.
Background technology
In ideal conditions, each data of data centralization should be all complete.But ubiquity data that are incomplete, Noise in real world.For data mining and area of pattern recognition, the data of these disappearances can cause very large impact.Such as, these missing datas can affect the accuracy of correctness from data centralization decimation pattern and derived rule, this data mining model that can lead to errors.And the data mining algorithm for the present stage overwhelming majority does not possess the data set ability that treatment and analysis has missing data.If do not carry out treatment and analysis to these missing datas, and directly given up, this can cause the loss of bulk information, and can produce bias, makes incomplete observation data and produces systematical difference between observation data completely.So for shortage of data situation carry out analysis and completion be one must and also significant thing.
Current missing data complementing method roughly can be divided into following a few class: simple and common complementing method is global constant's enthesis and attribute average enthesis.These two kinds of methods are mainly found a constant or are filled up the attribute of disappearance to needing the attribute filled up to average.In most of the cases, these methods equally can generate inclined result with abandoning the record with missing data.
Equations of The Second Kind is single enthesis and multiple interpolation, single enthesis is the value filled up by missing values as the most similar to it object, similar judgement is modal is use correlation matrix to determine and the maximally related attribute of missing values place attribute, then all objects being sorted by most correlation attribute value size, missing values being filled up the object value for coming before it.Compared with average enthesis, the standard deviation of variable with fill up before relatively, but this method uses inconvenience, more consuming time, and system underestimate method.The a series of possible values of Multiple Imputation replaces each missing values, to reflect the uncertainty of the missing data be replaced.Then, by the statistical study process of standard, several data sets repeatedly replacing rear generation are analyzed, finally, the statistics coming from each data set is carried out comprehensively, obtaining the estimated value of population parameter.
3rd class is the method adopting model to predict missing data.First the method defines a model to the data of input, then carries out Maximum-likelihood estimation based on this model to unknown parameter.In the method, many experts have been had to explore.2012, JiLiu, for visualized data, proposed a kind of tensor method of estimation for missing data.2014, EmilEirola proposed a kind of mixed Gauss model method for estimating distance for missing data.2014, ZhengbangLi proposed for data block disappearance mixing regretional analysis.Although said method achieves good effect, the completion accuracy rate for segment data all has much room for improvement.
Summary of the invention
The object of the invention is to the shortage of data situation for data set, propose a kind of missing data complementing method based on k plane regression.First carry out cluster analysis to data, data are polymerized to K class, and then carry out regretional analysis to each classification, the output obtained is exactly the data that completion obtains.
Technical scheme of the present invention is as follows:
Step 1, first carries out data prediction work, and carry out disappearance to data set and detect, choose the data that do not lack as experimental data, and will the dimension of completion be needed as output terminal, remaining dimension be as input end.
Step 2, carries out parameter initialization.
Comprise the dimension etc. after error, the parameter artificially determined, the iterations of algorithm, plane number K and the Dimensionality Reduction that completion allows.
Step 3, uses PCA method to carry out dimension and about subtracts.
Main object uses PCA to screen regression variable, selects optimization variables, form optimization variables set from the subclass that original variable is formed.
Step 4, is normalized the new variables collection that step 3 obtains, and reduces the interference of noise data.And the data set choosing 70% is as training set, the data set of 30% is as test set.
Step 5, carries out Kmeans cluster analysis to training set data.
We carry out Kmeans cluster analysis to training set, training data are polymerized to K class.For each classification, can go matching by corresponding plane, the center of each classification just can regard the initial geometric center μ of corresponding flat as.
Step 6, asks the geometric center μ of the regression coefficient ω of each plane and each plane respectively.
Asked geometric center μ and the regression coefficient ω of each plane by the error function that iterates, then redefine according to the geometric center of regression coefficient and plane the data set S that each plane comprises, and obtain the center of new plane.Repeat this step until plane geometry center remains unchanged, regression coefficient keeps stable, and namely error function keeps convergence.
Step 7, obtains regression coefficient ω and plane geometric center μ by step 6, and carries out regression forecasting to test data, and namely the result obtained is that we predict the data that completion obtains.
Step 8, for predicting the outcome of obtaining, we define these four indexs such as maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.
Experimental result shows that our missing data completion algorithm performance based on K plane regression is good.
Accompanying drawing explanation
Reader, after having read the specific embodiment of the present invention with reference to accompanying drawing, will become apparent various aspects of the present invention.Wherein,
Fig. 1 is the process flow diagram of a kind of missing data complementing method based on K plane regression of the present invention;
Fig. 2 be during the present invention tests use data set introduce form;
Fig. 3 is experimental result picture of the present invention;
Embodiment
Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end.
Step 2, carries out the initializing set of parameter.
Selection for the maximum error allowed needs the difference of the maxima and minima of the data of completion dimension to be multiplied by a factor alpha artificially set, and we are 0.1 for the value of α.
Step 3, uses PCA to carry out dimension to input data and about subtracts.
As the following formula shown in (1), obtain covariance matrix C, wherein X is the input of our completion algorithm, and m is the number of data.And obtain eigenwert and the characteristic of correspondence vector of covariance matrix C, then proper vector is become matrix from top to bottom by rows by the size of character pair value, and get front d row composition matrix P, the data that Y=XP obtains after being dimensionality reduction.Wherein d is exactly the dimension after we carry out Dimensionality Reduction, and for the value of d, we define a contribution rate, as shown in formula (2), if a front d eigenwert and summation divided by eigenwert be greater than contribution rate R, then d is exactly the dimension after our yojan.
Our general value of contribution rate R is 95%.
C = 1 m XX T - - - ( 1 )
R = Σ d = 1 n e d / Σ e - - - ( 2 )
Step 4, obtains the data after dimensionality reduction by step 3, and by normalization by between data normalization to 0-1, and the data choosing 70% are as training set, and the data of 30% are as test set.
Step 5, carries out K segmentation to training set Kmeans clustering algorithm.
We carry out cluster analysis to training set, data are polymerized to k class, wherein k namely our number of plane of defining, for each classification, we carry out matching by corresponding plane, for cluster center we can regard the geometric center of respective planes as.The initial center of the geometric center of each like this plane can be obtained by cluster centre.
Step 6, iterate the regression coefficient ω and geometric center μ that ask each plane respectively.
We define the error function of this regretional analysis, as shown in formula (3).
E ( θ ) = Σ n = 1 n Σ k ∈ { 1 , ... , K } [ ( ω ~ k T X ~ n - y n ) 2 + γ | | X n - μ k | | 2 ] - - - ( 3 )
X nrepresent input data, y nwhat represent is True Data, and γ is a user-defined parameter, and what its represented is the weight of in above-mentioned formula two.We are with the method determination parameter γ of ten folding cross validations.Our target minimizes this error function.Wherein we not only ensure minimizing of each plane regression predicated error, but also will ensure that the data of carrying out regression forecasting belong to this plane.We define a formula (4).
S k = { X n | k = argmin j ∈ { 1 , ... , K } ( ω ~ j T X ~ n - y n ) 2 + γ | | X n - μ j | | 2 } - - - ( 4 )
The target of formula (3) asks S set, and this set is the set that input data X forms, and the data X that this set comprises can make minimize, like this, formula (3) just becomes formula (5), as follows.
E ( θ ) = Σ i = 1 k Σ X n ∈ S k [ ( ω ~ k T X ~ n - y n ) 2 + γ | | X n - μ j | | 2 ] - - - ( 5 )
Consider that formula (5) is the function about ω, and S kalso be the function about ω, so we use EM iterative algorithm solution formula (5) to obtain regression coefficient ω and average μ, until convergence.Wherein User Defined parameter γ we determine optimum value by the method for ten folding cross validations.
Step 7, obtains parameter ω and μ by step 6, is obtained the data of completion by our prediction of formula (6).
y n * = ω ~ k T X ~ n - - - ( 6 )
Step 8, we define maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.Deviation is expressed as the proportion that data that completion obtains depart from raw data.Following formula represents maximum deviation, minimum deflection, mean deviation and precision of prediction respectively.
max _ d e v = max i = 1 m ( | y i - y i * | / y i ) - - - ( 7 )
min _ d e v = min i = 1 m ( | y i - y i * | / y i ) - - - ( 8 )
a v e _ d e v = Σ i = 1 m ( | y i - y i * | / y i ) / m - - - ( 9 )
s i g n = 0 , ( | y i - y i * | - &alpha; ) i = 1 m > 0 + 1 , ( | y i - y i * | - &alpha; ) i = 1 m < = 0 - - - ( 10 )
p r e = &Sigma; i = 1 m s i g n / m
For maximum deviation, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the maximal value in all prediction deviations.For minimum deflection, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the minimum value in all prediction deviations.For mean deviation, we ask the absolute value of the difference of all actual values and predicted value divided by the summation of actual value and divided by the data amount check predicted.For precision of prediction, our way is the absolute value of the difference asking all actual values and predicted value, and deduct permissible error factor alpha, if this value is greater than 0, then to this data mark-1, represent prediction error, if this value is less than or equal to 0, then to this data mark+1, represent that prediction is correct, last precision of prediction is exactly that the correct data amount check of prediction is divided by data count.
Thank to Shuguang Hospital and National 863 plan (exercise question: angiocardiopathy and the clinical large Data Management Analysis of tumor disease traditional Chinese and western medicine and applied research, project approval code: SQ2015AA0201076, funds 1,000 ten thousand yuan) to the support energetically of this patent and help.

Claims (5)

1. based on a missing data complementing method for K plane regression, it is characterized in that: when completion is carried out to missing data, carry out following steps,
Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end;
Step 2, carries out the initializing set of parameter;
Step 3, uses PCA to carry out dimension to input data and about subtracts;
The data obtained by step 3 are normalized between 0-1 by step 4, and the data choosing 70% are as training set, and all the other data of 30% are as test set;
Step 5, carries out cluster analysis with Kmeans clustering algorithm to training set and obtains initial geometric center μ;
Step 6, minimum error function, iterate the regression coefficient ω and geometric center μ that ask each plane respectively;
Step 7, the parameter ω obtained by step 6 and μ, and regression forecasting is carried out to test data, the result obtained is exactly the data that completion obtains.
Step 8, after the completion data obtained by step 7, these four indexs of definition maximum deviation, minimum deflection, mean deviation and precision of prediction evaluate the performance of completion algorithm.
2. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 3 before completion is carried out to missing data, dimensionality reduction operation is carried out to data set.The data set carrying out completion is needed for each, high correlation dimension and low correlation dimension must be there is, we use PCA (principal component analysis (PCA)) method to carry out major component selection to data set, calculate the eigenwert of each dimension and corresponding proper vector, select the main input of dimension as completion of the high degree of correlation, and define a contribution rate, as follows
R = &Sigma; d = 1 n e d / &Sigma; e - - - ( 1 )
R represents that a front d feature accounts for the ratio of total characteristic value, and front d the eigenwert that we define when R is greater than 95% is exactly the dimension after our Dimensionality Reduction.
3. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 5, training set is carried out k segmentation, and wherein k is also the number of plane.Target data set is carried out k and be divided into k classification, each classification has a corresponding plane to carry out matching, and we carry out initial k segmentation with Kmeans clustering algorithm, and then for the different ω that regression plane is optimized and revised, code reassignment data become k class.
S k = { X n | k = arg min j &Element; { 1 , ... , K } ( &omega; ~ j T X ~ n - y n ) 2 + &gamma; | | X n - &mu; j | | 2 } - - - ( 2 )
S knamely be the data that k classification comprises.
4. the missing data complementing method based on K plane regression according to claim 1, it is characterized in that: described in step 6 carry out regression function structure time, not only ensure the best-fit of regression plane, and ensure that the data of carrying out matching are near the geometric center in its corresponding plane
E ( &theta; ) = &Sigma; i = 1 k &Sigma; X n &Element; S k &lsqb; ( &omega; ~ k T X ~ n - y n ) 2 + &gamma; | | X n - &mu; j | | 2 &rsqb; - - - ( 3 )
γ is self-defining parameter, for the proportion of two in adjustment formula (3), uses 10 folding cross validations to determine optimum γ.
5. the missing data complementing method based on K plane regression according to claims 1, its special this is: the data obtained completion described in step 8, uses these four indexs of maximum deviation, minimum deflection, mean deviation and precision of prediction to carry out performance evaluating.Wherein carry out precision of prediction evaluation and test time, the maximal value being chosen for the data of completion dimension of permissible error α deducts minimum value and is multiplied by a threshold value, threshold value be chosen for 0.1.
CN201511025065.2A 2015-12-30 2015-12-30 Missing data completion method based on k plane regression Pending CN105469123A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511025065.2A CN105469123A (en) 2015-12-30 2015-12-30 Missing data completion method based on k plane regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511025065.2A CN105469123A (en) 2015-12-30 2015-12-30 Missing data completion method based on k plane regression

Publications (1)

Publication Number Publication Date
CN105469123A true CN105469123A (en) 2016-04-06

Family

ID=55606794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511025065.2A Pending CN105469123A (en) 2015-12-30 2015-12-30 Missing data completion method based on k plane regression

Country Status (1)

Country Link
CN (1) CN105469123A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220211A (en) * 2016-12-14 2017-09-29 北京理工大学 It is a kind of to merge the data re-establishing method that tensor filling and tensor recover
CN107229916A (en) * 2017-05-27 2017-10-03 南京航空航天大学 A kind of airport noise Monitoring Data restorative procedure based on depth noise reduction own coding
CN107633455A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 Credit estimation method and device based on data model
WO2018045642A1 (en) * 2016-09-09 2018-03-15 国网山西省电力公司晋城供电公司 A bus bar load forecasting method
CN107862409A (en) * 2017-11-06 2018-03-30 重庆大学 A kind of a large amount of missing data complementing methods of transformer station's power transmission and transforming equipment based on regression analysis
CN109146004A (en) * 2018-10-09 2019-01-04 宁波大学 A kind of dynamic process monitoring method based on iteration missing data estimation strategy
CN109658996A (en) * 2018-11-26 2019-04-19 浙江大学山东工业技术研究院 A kind of physical examination Supplementing Data method, apparatus and application based on side information
CN110046152A (en) * 2019-04-19 2019-07-23 国网河南省电力公司经济技术研究院 A method of processing electricity consumption data missing values
CN110874645A (en) * 2019-11-14 2020-03-10 北京首汽智行科技有限公司 Data reduction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104392456A (en) * 2014-12-09 2015-03-04 西安电子科技大学 SAR (synthetic aperture radar) image segmentation method based on depth autoencoders and area charts
CN104484673A (en) * 2014-12-05 2015-04-01 南京大学 Data complementation method for pattern recognition application of real-time data flow
US20150184199A1 (en) * 2013-12-19 2015-07-02 Amyris, Inc. Methods for genomic integration

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150184199A1 (en) * 2013-12-19 2015-07-02 Amyris, Inc. Methods for genomic integration
CN104484673A (en) * 2014-12-05 2015-04-01 南京大学 Data complementation method for pattern recognition application of real-time data flow
CN104392456A (en) * 2014-12-09 2015-03-04 西安电子科技大学 SAR (synthetic aperture radar) image segmentation method based on depth autoencoders and area charts

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邓超: ""基于支持向量机的流量预测和状态判别研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
韩振兴: ""基于财务报表的上市公司绩效评价研究"", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018045642A1 (en) * 2016-09-09 2018-03-15 国网山西省电力公司晋城供电公司 A bus bar load forecasting method
CN107220211A (en) * 2016-12-14 2017-09-29 北京理工大学 It is a kind of to merge the data re-establishing method that tensor filling and tensor recover
CN107229916A (en) * 2017-05-27 2017-10-03 南京航空航天大学 A kind of airport noise Monitoring Data restorative procedure based on depth noise reduction own coding
CN107633455A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 Credit estimation method and device based on data model
CN107862409A (en) * 2017-11-06 2018-03-30 重庆大学 A kind of a large amount of missing data complementing methods of transformer station's power transmission and transforming equipment based on regression analysis
CN107862409B (en) * 2017-11-06 2021-11-02 重庆大学 Regression analysis-based method for filling large amount of missing data of substation power transmission and transformation equipment
CN109146004A (en) * 2018-10-09 2019-01-04 宁波大学 A kind of dynamic process monitoring method based on iteration missing data estimation strategy
CN109146004B (en) * 2018-10-09 2021-07-23 宁波大学 Dynamic process monitoring method based on iteration missing data estimation strategy
CN109658996A (en) * 2018-11-26 2019-04-19 浙江大学山东工业技术研究院 A kind of physical examination Supplementing Data method, apparatus and application based on side information
CN110046152A (en) * 2019-04-19 2019-07-23 国网河南省电力公司经济技术研究院 A method of processing electricity consumption data missing values
CN110874645A (en) * 2019-11-14 2020-03-10 北京首汽智行科技有限公司 Data reduction method

Similar Documents

Publication Publication Date Title
CN105469123A (en) Missing data completion method based on k plane regression
Sisson et al. Overview of ABC
US6636862B2 (en) Method and system for the dynamic analysis of data
Christen et al. A general purpose sampling algorithm for continuous distributions (the t-walk)
CN100416543C (en) System and method for processing training data for a statistical application
Liu et al. Comparison of five iterative imputation methods for multivariate classification
CN110210625B (en) Modeling method and device based on transfer learning, computer equipment and storage medium
Kraus Recent methods from statistics and machine learning for credit scoring
CN107392217B (en) Computer-implemented information processing method and device
Roever et al. Package ‘klaR’
CN111325344A (en) Method and apparatus for evaluating model interpretation tools
CN105678798A (en) Multi-target fuzzy clustering image segmentation method combining local spatial information
CN111582313A (en) Sample data generation method and device and electronic equipment
Gu Assessing the relative importance of predictors in latent regression models
CN109740013A (en) Image processing method and image search method
CN109829745A (en) Business revenue data predication method, device, computer equipment and storage medium
Araujo et al. Hybrid intelligent design of morphological-rank-linear perceptrons for software development cost estimation
CN114757495A (en) Membership value quantitative evaluation method based on logistic regression
KR20130086083A (en) Risk-profile generation device
Benyacoub et al. Building classification models for customer credit scoring
US20020128858A1 (en) Method and system for population classification
Chaudhry What are the limitations of derivative-based models for optimization in machine learning?
US20230394339A1 (en) Efficient computer-implemented real-world testing of causal inference models
CN115902814B (en) Method and device for evaluating performance of target recognition model based on information space measurement
KR102153540B1 (en) Method and apparatus for micro simulation parameter calibration using machine learning in agent based simulation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160406