CN105469123A - Missing data completion method based on k plane regression - Google Patents
Missing data completion method based on k plane regression Download PDFInfo
- Publication number
- CN105469123A CN105469123A CN201511025065.2A CN201511025065A CN105469123A CN 105469123 A CN105469123 A CN 105469123A CN 201511025065 A CN201511025065 A CN 201511025065A CN 105469123 A CN105469123 A CN 105469123A
- Authority
- CN
- China
- Prior art keywords
- data
- plane
- completion
- regression
- carried out
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Abstract
The invention provides a new missing database data completion method. The method is characterized by comprising steps: 1, missing detection is carried out on a given data set; 2, dimension reduction of an input variable is carried out, correlation between input dimensions is analyzed, pivoting (PCA) is adopted to select a correlated input dimension, and a new input data set is formed; 3, training set k partitioning is carried out, a cluster (Kmeans) is used for carrying out partitioning on the input training set, and k classes of training sets are obtained; 4, a k plane regression function is built, the optimal regression coefficient and the geometric center of each plane are solved, and a regression fitting function is given; and finally, data completion test is carried out. The experiment proves that the data completion method is extremely effective; in an allowable error range, a completed database with a use value is obtained; and the challenging technical problem brought to machine learning and data mining due to data incompletion can be solved to a certain degree; and the big data application technology progress is pushed.
Description
Technical field
The present invention relates generally to data mining technology, is specifically related to a kind of missing data complementing method based on K plane regression.
Background technology
In ideal conditions, each data of data centralization should be all complete.But ubiquity data that are incomplete, Noise in real world.For data mining and area of pattern recognition, the data of these disappearances can cause very large impact.Such as, these missing datas can affect the accuracy of correctness from data centralization decimation pattern and derived rule, this data mining model that can lead to errors.And the data mining algorithm for the present stage overwhelming majority does not possess the data set ability that treatment and analysis has missing data.If do not carry out treatment and analysis to these missing datas, and directly given up, this can cause the loss of bulk information, and can produce bias, makes incomplete observation data and produces systematical difference between observation data completely.So for shortage of data situation carry out analysis and completion be one must and also significant thing.
Current missing data complementing method roughly can be divided into following a few class: simple and common complementing method is global constant's enthesis and attribute average enthesis.These two kinds of methods are mainly found a constant or are filled up the attribute of disappearance to needing the attribute filled up to average.In most of the cases, these methods equally can generate inclined result with abandoning the record with missing data.
Equations of The Second Kind is single enthesis and multiple interpolation, single enthesis is the value filled up by missing values as the most similar to it object, similar judgement is modal is use correlation matrix to determine and the maximally related attribute of missing values place attribute, then all objects being sorted by most correlation attribute value size, missing values being filled up the object value for coming before it.Compared with average enthesis, the standard deviation of variable with fill up before relatively, but this method uses inconvenience, more consuming time, and system underestimate method.The a series of possible values of Multiple Imputation replaces each missing values, to reflect the uncertainty of the missing data be replaced.Then, by the statistical study process of standard, several data sets repeatedly replacing rear generation are analyzed, finally, the statistics coming from each data set is carried out comprehensively, obtaining the estimated value of population parameter.
3rd class is the method adopting model to predict missing data.First the method defines a model to the data of input, then carries out Maximum-likelihood estimation based on this model to unknown parameter.In the method, many experts have been had to explore.2012, JiLiu, for visualized data, proposed a kind of tensor method of estimation for missing data.2014, EmilEirola proposed a kind of mixed Gauss model method for estimating distance for missing data.2014, ZhengbangLi proposed for data block disappearance mixing regretional analysis.Although said method achieves good effect, the completion accuracy rate for segment data all has much room for improvement.
Summary of the invention
The object of the invention is to the shortage of data situation for data set, propose a kind of missing data complementing method based on k plane regression.First carry out cluster analysis to data, data are polymerized to K class, and then carry out regretional analysis to each classification, the output obtained is exactly the data that completion obtains.
Technical scheme of the present invention is as follows:
Step 1, first carries out data prediction work, and carry out disappearance to data set and detect, choose the data that do not lack as experimental data, and will the dimension of completion be needed as output terminal, remaining dimension be as input end.
Step 2, carries out parameter initialization.
Comprise the dimension etc. after error, the parameter artificially determined, the iterations of algorithm, plane number K and the Dimensionality Reduction that completion allows.
Step 3, uses PCA method to carry out dimension and about subtracts.
Main object uses PCA to screen regression variable, selects optimization variables, form optimization variables set from the subclass that original variable is formed.
Step 4, is normalized the new variables collection that step 3 obtains, and reduces the interference of noise data.And the data set choosing 70% is as training set, the data set of 30% is as test set.
Step 5, carries out Kmeans cluster analysis to training set data.
We carry out Kmeans cluster analysis to training set, training data are polymerized to K class.For each classification, can go matching by corresponding plane, the center of each classification just can regard the initial geometric center μ of corresponding flat as.
Step 6, asks the geometric center μ of the regression coefficient ω of each plane and each plane respectively.
Asked geometric center μ and the regression coefficient ω of each plane by the error function that iterates, then redefine according to the geometric center of regression coefficient and plane the data set S that each plane comprises, and obtain the center of new plane.Repeat this step until plane geometry center remains unchanged, regression coefficient keeps stable, and namely error function keeps convergence.
Step 7, obtains regression coefficient ω and plane geometric center μ by step 6, and carries out regression forecasting to test data, and namely the result obtained is that we predict the data that completion obtains.
Step 8, for predicting the outcome of obtaining, we define these four indexs such as maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.
Experimental result shows that our missing data completion algorithm performance based on K plane regression is good.
Accompanying drawing explanation
Reader, after having read the specific embodiment of the present invention with reference to accompanying drawing, will become apparent various aspects of the present invention.Wherein,
Fig. 1 is the process flow diagram of a kind of missing data complementing method based on K plane regression of the present invention;
Fig. 2 be during the present invention tests use data set introduce form;
Fig. 3 is experimental result picture of the present invention;
Embodiment
Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end.
Step 2, carries out the initializing set of parameter.
Selection for the maximum error allowed needs the difference of the maxima and minima of the data of completion dimension to be multiplied by a factor alpha artificially set, and we are 0.1 for the value of α.
Step 3, uses PCA to carry out dimension to input data and about subtracts.
As the following formula shown in (1), obtain covariance matrix C, wherein X is the input of our completion algorithm, and m is the number of data.And obtain eigenwert and the characteristic of correspondence vector of covariance matrix C, then proper vector is become matrix from top to bottom by rows by the size of character pair value, and get front d row composition matrix P, the data that Y=XP obtains after being dimensionality reduction.Wherein d is exactly the dimension after we carry out Dimensionality Reduction, and for the value of d, we define a contribution rate, as shown in formula (2), if a front d eigenwert and summation divided by eigenwert be greater than contribution rate R, then d is exactly the dimension after our yojan.
Our general value of contribution rate R is 95%.
Step 4, obtains the data after dimensionality reduction by step 3, and by normalization by between data normalization to 0-1, and the data choosing 70% are as training set, and the data of 30% are as test set.
Step 5, carries out K segmentation to training set Kmeans clustering algorithm.
We carry out cluster analysis to training set, data are polymerized to k class, wherein k namely our number of plane of defining, for each classification, we carry out matching by corresponding plane, for cluster center we can regard the geometric center of respective planes as.The initial center of the geometric center of each like this plane can be obtained by cluster centre.
Step 6, iterate the regression coefficient ω and geometric center μ that ask each plane respectively.
We define the error function of this regretional analysis, as shown in formula (3).
X
nrepresent input data, y
nwhat represent is True Data, and γ is a user-defined parameter, and what its represented is the weight of in above-mentioned formula two.We are with the method determination parameter γ of ten folding cross validations.Our target minimizes this error function.Wherein we not only ensure minimizing of each plane regression predicated error, but also will ensure that the data of carrying out regression forecasting belong to this plane.We define a formula (4).
The target of formula (3) asks S set, and this set is the set that input data X forms, and the data X that this set comprises can make
minimize, like this, formula (3) just becomes formula (5), as follows.
Consider that formula (5) is the function about ω, and S
kalso be the function about ω, so we use EM iterative algorithm solution formula (5) to obtain regression coefficient ω and average μ, until convergence.Wherein User Defined parameter γ we determine optimum value by the method for ten folding cross validations.
Step 7, obtains parameter ω and μ by step 6, is obtained the data of completion by our prediction of formula (6).
Step 8, we define maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.Deviation is expressed as the proportion that data that completion obtains depart from raw data.Following formula represents maximum deviation, minimum deflection, mean deviation and precision of prediction respectively.
For maximum deviation, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the maximal value in all prediction deviations.For minimum deflection, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the minimum value in all prediction deviations.For mean deviation, we ask the absolute value of the difference of all actual values and predicted value divided by the summation of actual value and divided by the data amount check predicted.For precision of prediction, our way is the absolute value of the difference asking all actual values and predicted value, and deduct permissible error factor alpha, if this value is greater than 0, then to this data mark-1, represent prediction error, if this value is less than or equal to 0, then to this data mark+1, represent that prediction is correct, last precision of prediction is exactly that the correct data amount check of prediction is divided by data count.
Thank to Shuguang Hospital and National 863 plan (exercise question: angiocardiopathy and the clinical large Data Management Analysis of tumor disease traditional Chinese and western medicine and applied research, project approval code: SQ2015AA0201076, funds 1,000 ten thousand yuan) to the support energetically of this patent and help.
Claims (5)
1. based on a missing data complementing method for K plane regression, it is characterized in that: when completion is carried out to missing data, carry out following steps,
Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end;
Step 2, carries out the initializing set of parameter;
Step 3, uses PCA to carry out dimension to input data and about subtracts;
The data obtained by step 3 are normalized between 0-1 by step 4, and the data choosing 70% are as training set, and all the other data of 30% are as test set;
Step 5, carries out cluster analysis with Kmeans clustering algorithm to training set and obtains initial geometric center μ;
Step 6, minimum error function, iterate the regression coefficient ω and geometric center μ that ask each plane respectively;
Step 7, the parameter ω obtained by step 6 and μ, and regression forecasting is carried out to test data, the result obtained is exactly the data that completion obtains.
Step 8, after the completion data obtained by step 7, these four indexs of definition maximum deviation, minimum deflection, mean deviation and precision of prediction evaluate the performance of completion algorithm.
2. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 3 before completion is carried out to missing data, dimensionality reduction operation is carried out to data set.The data set carrying out completion is needed for each, high correlation dimension and low correlation dimension must be there is, we use PCA (principal component analysis (PCA)) method to carry out major component selection to data set, calculate the eigenwert of each dimension and corresponding proper vector, select the main input of dimension as completion of the high degree of correlation, and define a contribution rate, as follows
R represents that a front d feature accounts for the ratio of total characteristic value, and front d the eigenwert that we define when R is greater than 95% is exactly the dimension after our Dimensionality Reduction.
3. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 5, training set is carried out k segmentation, and wherein k is also the number of plane.Target data set is carried out k and be divided into k classification, each classification has a corresponding plane to carry out matching, and we carry out initial k segmentation with Kmeans clustering algorithm, and then for the different ω that regression plane is optimized and revised, code reassignment data become k class.
S
knamely be the data that k classification comprises.
4. the missing data complementing method based on K plane regression according to claim 1, it is characterized in that: described in step 6 carry out regression function structure time, not only ensure the best-fit of regression plane, and ensure that the data of carrying out matching are near the geometric center in its corresponding plane
γ is self-defining parameter, for the proportion of two in adjustment formula (3), uses 10 folding cross validations to determine optimum γ.
5. the missing data complementing method based on K plane regression according to claims 1, its special this is: the data obtained completion described in step 8, uses these four indexs of maximum deviation, minimum deflection, mean deviation and precision of prediction to carry out performance evaluating.Wherein carry out precision of prediction evaluation and test time, the maximal value being chosen for the data of completion dimension of permissible error α deducts minimum value and is multiplied by a threshold value, threshold value be chosen for 0.1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511025065.2A CN105469123A (en) | 2015-12-30 | 2015-12-30 | Missing data completion method based on k plane regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511025065.2A CN105469123A (en) | 2015-12-30 | 2015-12-30 | Missing data completion method based on k plane regression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105469123A true CN105469123A (en) | 2016-04-06 |
Family
ID=55606794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511025065.2A Pending CN105469123A (en) | 2015-12-30 | 2015-12-30 | Missing data completion method based on k plane regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105469123A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220211A (en) * | 2016-12-14 | 2017-09-29 | 北京理工大学 | It is a kind of to merge the data re-establishing method that tensor filling and tensor recover |
CN107229916A (en) * | 2017-05-27 | 2017-10-03 | 南京航空航天大学 | A kind of airport noise Monitoring Data restorative procedure based on depth noise reduction own coding |
CN107633455A (en) * | 2017-09-04 | 2018-01-26 | 深圳市华傲数据技术有限公司 | Credit estimation method and device based on data model |
WO2018045642A1 (en) * | 2016-09-09 | 2018-03-15 | 国网山西省电力公司晋城供电公司 | A bus bar load forecasting method |
CN107862409A (en) * | 2017-11-06 | 2018-03-30 | 重庆大学 | A kind of a large amount of missing data complementing methods of transformer station's power transmission and transforming equipment based on regression analysis |
CN109146004A (en) * | 2018-10-09 | 2019-01-04 | 宁波大学 | A kind of dynamic process monitoring method based on iteration missing data estimation strategy |
CN109658996A (en) * | 2018-11-26 | 2019-04-19 | 浙江大学山东工业技术研究院 | A kind of physical examination Supplementing Data method, apparatus and application based on side information |
CN110046152A (en) * | 2019-04-19 | 2019-07-23 | 国网河南省电力公司经济技术研究院 | A method of processing electricity consumption data missing values |
CN110874645A (en) * | 2019-11-14 | 2020-03-10 | 北京首汽智行科技有限公司 | Data reduction method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104392456A (en) * | 2014-12-09 | 2015-03-04 | 西安电子科技大学 | SAR (synthetic aperture radar) image segmentation method based on depth autoencoders and area charts |
CN104484673A (en) * | 2014-12-05 | 2015-04-01 | 南京大学 | Data complementation method for pattern recognition application of real-time data flow |
US20150184199A1 (en) * | 2013-12-19 | 2015-07-02 | Amyris, Inc. | Methods for genomic integration |
-
2015
- 2015-12-30 CN CN201511025065.2A patent/CN105469123A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150184199A1 (en) * | 2013-12-19 | 2015-07-02 | Amyris, Inc. | Methods for genomic integration |
CN104484673A (en) * | 2014-12-05 | 2015-04-01 | 南京大学 | Data complementation method for pattern recognition application of real-time data flow |
CN104392456A (en) * | 2014-12-09 | 2015-03-04 | 西安电子科技大学 | SAR (synthetic aperture radar) image segmentation method based on depth autoencoders and area charts |
Non-Patent Citations (2)
Title |
---|
邓超: ""基于支持向量机的流量预测和状态判别研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
韩振兴: ""基于财务报表的上市公司绩效评价研究"", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018045642A1 (en) * | 2016-09-09 | 2018-03-15 | 国网山西省电力公司晋城供电公司 | A bus bar load forecasting method |
CN107220211A (en) * | 2016-12-14 | 2017-09-29 | 北京理工大学 | It is a kind of to merge the data re-establishing method that tensor filling and tensor recover |
CN107229916A (en) * | 2017-05-27 | 2017-10-03 | 南京航空航天大学 | A kind of airport noise Monitoring Data restorative procedure based on depth noise reduction own coding |
CN107633455A (en) * | 2017-09-04 | 2018-01-26 | 深圳市华傲数据技术有限公司 | Credit estimation method and device based on data model |
CN107862409A (en) * | 2017-11-06 | 2018-03-30 | 重庆大学 | A kind of a large amount of missing data complementing methods of transformer station's power transmission and transforming equipment based on regression analysis |
CN107862409B (en) * | 2017-11-06 | 2021-11-02 | 重庆大学 | Regression analysis-based method for filling large amount of missing data of substation power transmission and transformation equipment |
CN109146004A (en) * | 2018-10-09 | 2019-01-04 | 宁波大学 | A kind of dynamic process monitoring method based on iteration missing data estimation strategy |
CN109146004B (en) * | 2018-10-09 | 2021-07-23 | 宁波大学 | Dynamic process monitoring method based on iteration missing data estimation strategy |
CN109658996A (en) * | 2018-11-26 | 2019-04-19 | 浙江大学山东工业技术研究院 | A kind of physical examination Supplementing Data method, apparatus and application based on side information |
CN110046152A (en) * | 2019-04-19 | 2019-07-23 | 国网河南省电力公司经济技术研究院 | A method of processing electricity consumption data missing values |
CN110874645A (en) * | 2019-11-14 | 2020-03-10 | 北京首汽智行科技有限公司 | Data reduction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105469123A (en) | Missing data completion method based on k plane regression | |
Sisson et al. | Overview of ABC | |
US6636862B2 (en) | Method and system for the dynamic analysis of data | |
Christen et al. | A general purpose sampling algorithm for continuous distributions (the t-walk) | |
CN100416543C (en) | System and method for processing training data for a statistical application | |
Liu et al. | Comparison of five iterative imputation methods for multivariate classification | |
CN110210625B (en) | Modeling method and device based on transfer learning, computer equipment and storage medium | |
Kraus | Recent methods from statistics and machine learning for credit scoring | |
CN107392217B (en) | Computer-implemented information processing method and device | |
Roever et al. | Package ‘klaR’ | |
CN111325344A (en) | Method and apparatus for evaluating model interpretation tools | |
CN105678798A (en) | Multi-target fuzzy clustering image segmentation method combining local spatial information | |
CN111582313A (en) | Sample data generation method and device and electronic equipment | |
Gu | Assessing the relative importance of predictors in latent regression models | |
CN109740013A (en) | Image processing method and image search method | |
CN109829745A (en) | Business revenue data predication method, device, computer equipment and storage medium | |
Araujo et al. | Hybrid intelligent design of morphological-rank-linear perceptrons for software development cost estimation | |
CN114757495A (en) | Membership value quantitative evaluation method based on logistic regression | |
KR20130086083A (en) | Risk-profile generation device | |
Benyacoub et al. | Building classification models for customer credit scoring | |
US20020128858A1 (en) | Method and system for population classification | |
Chaudhry | What are the limitations of derivative-based models for optimization in machine learning? | |
US20230394339A1 (en) | Efficient computer-implemented real-world testing of causal inference models | |
CN115902814B (en) | Method and device for evaluating performance of target recognition model based on information space measurement | |
KR102153540B1 (en) | Method and apparatus for micro simulation parameter calibration using machine learning in agent based simulation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160406 |