CN105469123A

CN105469123A - Missing data completion method based on k plane regression

Info

Publication number: CN105469123A
Application number: CN201511025065.2A
Authority: CN
Inventors: 袁玉波; 阮彤; 邱文强; 汤伟; 赵婷婷; 高炬; 殷亦超
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-04-06

Abstract

The invention provides a new missing database data completion method. The method is characterized by comprising steps: 1, missing detection is carried out on a given data set; 2, dimension reduction of an input variable is carried out, correlation between input dimensions is analyzed, pivoting (PCA) is adopted to select a correlated input dimension, and a new input data set is formed; 3, training set k partitioning is carried out, a cluster (Kmeans) is used for carrying out partitioning on the input training set, and k classes of training sets are obtained; 4, a k plane regression function is built, the optimal regression coefficient and the geometric center of each plane are solved, and a regression fitting function is given; and finally, data completion test is carried out. The experiment proves that the data completion method is extremely effective; in an allowable error range, a completed database with a use value is obtained; and the challenging technical problem brought to machine learning and data mining due to data incompletion can be solved to a certain degree; and the big data application technology progress is pushed.

Description

A kind of missing data complementing method based on K plane regression

Technical field

The present invention relates generally to data mining technology, is specifically related to a kind of missing data complementing method based on K plane regression.

Background technology

In ideal conditions, each data of data centralization should be all complete.But ubiquity data that are incomplete, Noise in real world.For data mining and area of pattern recognition, the data of these disappearances can cause very large impact.Such as, these missing datas can affect the accuracy of correctness from data centralization decimation pattern and derived rule, this data mining model that can lead to errors.And the data mining algorithm for the present stage overwhelming majority does not possess the data set ability that treatment and analysis has missing data.If do not carry out treatment and analysis to these missing datas, and directly given up, this can cause the loss of bulk information, and can produce bias, makes incomplete observation data and produces systematical difference between observation data completely.So for shortage of data situation carry out analysis and completion be one must and also significant thing.

Current missing data complementing method roughly can be divided into following a few class: simple and common complementing method is global constant's enthesis and attribute average enthesis.These two kinds of methods are mainly found a constant or are filled up the attribute of disappearance to needing the attribute filled up to average.In most of the cases, these methods equally can generate inclined result with abandoning the record with missing data.

Equations of The Second Kind is single enthesis and multiple interpolation, single enthesis is the value filled up by missing values as the most similar to it object, similar judgement is modal is use correlation matrix to determine and the maximally related attribute of missing values place attribute, then all objects being sorted by most correlation attribute value size, missing values being filled up the object value for coming before it.Compared with average enthesis, the standard deviation of variable with fill up before relatively, but this method uses inconvenience, more consuming time, and system underestimate method.The a series of possible values of Multiple Imputation replaces each missing values, to reflect the uncertainty of the missing data be replaced.Then, by the statistical study process of standard, several data sets repeatedly replacing rear generation are analyzed, finally, the statistics coming from each data set is carried out comprehensively, obtaining the estimated value of population parameter.

3rd class is the method adopting model to predict missing data.First the method defines a model to the data of input, then carries out Maximum-likelihood estimation based on this model to unknown parameter.In the method, many experts have been had to explore.2012, JiLiu, for visualized data, proposed a kind of tensor method of estimation for missing data.2014, EmilEirola proposed a kind of mixed Gauss model method for estimating distance for missing data.2014, ZhengbangLi proposed for data block disappearance mixing regretional analysis.Although said method achieves good effect, the completion accuracy rate for segment data all has much room for improvement.

Summary of the invention

The object of the invention is to the shortage of data situation for data set, propose a kind of missing data complementing method based on k plane regression.First carry out cluster analysis to data, data are polymerized to K class, and then carry out regretional analysis to each classification, the output obtained is exactly the data that completion obtains.

Technical scheme of the present invention is as follows:

Step 1, first carries out data prediction work, and carry out disappearance to data set and detect, choose the data that do not lack as experimental data, and will the dimension of completion be needed as output terminal, remaining dimension be as input end.

Step 2, carries out parameter initialization.

Comprise the dimension etc. after error, the parameter artificially determined, the iterations of algorithm, plane number K and the Dimensionality Reduction that completion allows.

Step 3, uses PCA method to carry out dimension and about subtracts.

Main object uses PCA to screen regression variable, selects optimization variables, form optimization variables set from the subclass that original variable is formed.

Step 4, is normalized the new variables collection that step 3 obtains, and reduces the interference of noise data.And the data set choosing 70% is as training set, the data set of 30% is as test set.

Step 5, carries out Kmeans cluster analysis to training set data.

We carry out Kmeans cluster analysis to training set, training data are polymerized to K class.For each classification, can go matching by corresponding plane, the center of each classification just can regard the initial geometric center μ of corresponding flat as.

Step 6, asks the geometric center μ of the regression coefficient ω of each plane and each plane respectively.

Asked geometric center μ and the regression coefficient ω of each plane by the error function that iterates, then redefine according to the geometric center of regression coefficient and plane the data set S that each plane comprises, and obtain the center of new plane.Repeat this step until plane geometry center remains unchanged, regression coefficient keeps stable, and namely error function keeps convergence.

Step 7, obtains regression coefficient ω and plane geometric center μ by step 6, and carries out regression forecasting to test data, and namely the result obtained is that we predict the data that completion obtains.

Step 8, for predicting the outcome of obtaining, we define these four indexs such as maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.

Experimental result shows that our missing data completion algorithm performance based on K plane regression is good.

Accompanying drawing explanation

Reader, after having read the specific embodiment of the present invention with reference to accompanying drawing, will become apparent various aspects of the present invention.Wherein,

Fig. 1 is the process flow diagram of a kind of missing data complementing method based on K plane regression of the present invention;

Fig. 2 be during the present invention tests use data set introduce form;

Fig. 3 is experimental result picture of the present invention;

Embodiment

Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end.

Step 2, carries out the initializing set of parameter.

Selection for the maximum error allowed needs the difference of the maxima and minima of the data of completion dimension to be multiplied by a factor alpha artificially set, and we are 0.1 for the value of α.

Step 3, uses PCA to carry out dimension to input data and about subtracts.

As the following formula shown in (1), obtain covariance matrix C, wherein X is the input of our completion algorithm, and m is the number of data.And obtain eigenwert and the characteristic of correspondence vector of covariance matrix C, then proper vector is become matrix from top to bottom by rows by the size of character pair value, and get front d row composition matrix P, the data that Y=XP obtains after being dimensionality reduction.Wherein d is exactly the dimension after we carry out Dimensionality Reduction, and for the value of d, we define a contribution rate, as shown in formula (2), if a front d eigenwert and summation divided by eigenwert be greater than contribution rate R, then d is exactly the dimension after our yojan.

Our general value of contribution rate R is 95%.

C = \frac{1}{m} {XX}^{T} - - - (1)

R = Σ_{d = 1}^{n} e_{d} / Σ e - - - (2)

Step 4, obtains the data after dimensionality reduction by step 3, and by normalization by between data normalization to 0-1, and the data choosing 70% are as training set, and the data of 30% are as test set.

Step 5, carries out K segmentation to training set Kmeans clustering algorithm.

We carry out cluster analysis to training set, data are polymerized to k class, wherein k namely our number of plane of defining, for each classification, we carry out matching by corresponding plane, for cluster center we can regard the geometric center of respective planes as.The initial center of the geometric center of each like this plane can be obtained by cluster centre.

Step 6, iterate the regression coefficient ω and geometric center μ that ask each plane respectively.

We define the error function of this regretional analysis, as shown in formula (3).

E (θ) = Σ_{n = 1}^{n} \underset{k &Element; {1, ..., K}}{Σ} [{({\tilde{ω}}_{k}^{T} {\tilde{X}}_{n} - y_{n})}^{2} + γ | | X_{n} - μ_{k} | |^{2}] - - - (3)

X _nrepresent input data, y _nwhat represent is True Data, and γ is a user-defined parameter, and what its represented is the weight of in above-mentioned formula two.We are with the method determination parameter γ of ten folding cross validations.Our target minimizes this error function.Wherein we not only ensure minimizing of each plane regression predicated error, but also will ensure that the data of carrying out regression forecasting belong to this plane.We define a formula (4).

S_{k} = {X_{n} | k = {argmin}_{j &Element; {1, ..., K}} {({\tilde{ω}}_{j}^{T} {\tilde{X}}_{n} - y_{n})}^{2} + γ | | X_{n} - μ_{j} | |^{2}} - - - (4)

The target of formula (3) asks S set, and this set is the set that input data X forms, and the data X that this set comprises can make minimize, like this, formula (3) just becomes formula (5), as follows.

E (θ) = Σ_{i = 1}^{k} \underset{X_{n} &Element; S_{k}}{Σ} [{({\tilde{ω}}_{k}^{T} {\tilde{X}}_{n} - y_{n})}^{2} + γ | | X_{n} - μ_{j} | |^{2}] - - - (5)

Consider that formula (5) is the function about ω, and S _kalso be the function about ω, so we use EM iterative algorithm solution formula (5) to obtain regression coefficient ω and average μ, until convergence.Wherein User Defined parameter γ we determine optimum value by the method for ten folding cross validations.

Step 7, obtains parameter ω and μ by step 6, is obtained the data of completion by our prediction of formula (6).

y_{n}^{*} = {\tilde{ω}}_{k}^{T} {\tilde{X}}_{n} - - - (6)

Step 8, we define maximum deviation, minimum deflection, mean deviation and precision of prediction to evaluate the performance of this completion algorithm.Deviation is expressed as the proportion that data that completion obtains depart from raw data.Following formula represents maximum deviation, minimum deflection, mean deviation and precision of prediction respectively.

\max_d e v = \max_{i = 1}^{m} (| y_{i} - y_{i}^{*} | / y_{i}) - - - (7)

\min_d e v = \min_{i = 1}^{m} (| y_{i} - y_{i}^{*} | / y_{i}) - - - (8)

a v e_d e v = Σ_{i = 1}^{m} (| y_{i} - y_{i}^{*} | / y_{i}) / m - - - (9)

s i g n = \{\begin{matrix} 0, {(| y_{i} - y_{i}^{*} | - α)}_{i = 1}^{m} > 0 \\ + 1, {(| y_{i} - y_{i}^{*} | - α)}_{i = 1}^{m} < = 0 \end{matrix} - - - (10)

p r e = Σ_{i = 1}^{m} s i g n / m

For maximum deviation, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the maximal value in all prediction deviations.For minimum deflection, we are all by the data of our prediction, and the absolute value of the difference of actual value and predicted value divided by actual value, and asks the minimum value in all prediction deviations.For mean deviation, we ask the absolute value of the difference of all actual values and predicted value divided by the summation of actual value and divided by the data amount check predicted.For precision of prediction, our way is the absolute value of the difference asking all actual values and predicted value, and deduct permissible error factor alpha, if this value is greater than 0, then to this data mark-1, represent prediction error, if this value is less than or equal to 0, then to this data mark+1, represent that prediction is correct, last precision of prediction is exactly that the correct data amount check of prediction is divided by data count.

Thank to Shuguang Hospital and National 863 plan (exercise question: angiocardiopathy and the clinical large Data Management Analysis of tumor disease traditional Chinese and western medicine and applied research, project approval code: SQ2015AA0201076, funds 1,000 ten thousand yuan) to the support energetically of this patent and help.

Claims

1. based on a missing data complementing method for K plane regression, it is characterized in that: when completion is carried out to missing data, carry out following steps,

Step 1, manually carries out shortage of data detection, and will the data of completion be needed as output terminal, and remaining data is as input end;

Step 2, carries out the initializing set of parameter;

Step 3, uses PCA to carry out dimension to input data and about subtracts;

The data obtained by step 3 are normalized between 0-1 by step 4, and the data choosing 70% are as training set, and all the other data of 30% are as test set;

Step 5, carries out cluster analysis with Kmeans clustering algorithm to training set and obtains initial geometric center μ;

Step 6, minimum error function, iterate the regression coefficient ω and geometric center μ that ask each plane respectively;

Step 7, the parameter ω obtained by step 6 and μ, and regression forecasting is carried out to test data, the result obtained is exactly the data that completion obtains.

Step 8, after the completion data obtained by step 7, these four indexs of definition maximum deviation, minimum deflection, mean deviation and precision of prediction evaluate the performance of completion algorithm.

2. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 3 before completion is carried out to missing data, dimensionality reduction operation is carried out to data set.The data set carrying out completion is needed for each, high correlation dimension and low correlation dimension must be there is, we use PCA (principal component analysis (PCA)) method to carry out major component selection to data set, calculate the eigenwert of each dimension and corresponding proper vector, select the main input of dimension as completion of the high degree of correlation, and define a contribution rate, as follows

R = Σ_{d = 1}^{n} e_{d} / Σ e - - - (1)

R represents that a front d feature accounts for the ratio of total characteristic value, and front d the eigenwert that we define when R is greater than 95% is exactly the dimension after our Dimensionality Reduction.

3. the missing data complementing method based on K plane regression according to claim 1, is characterized in that: described in step 5, training set is carried out k segmentation, and wherein k is also the number of plane.Target data set is carried out k and be divided into k classification, each classification has a corresponding plane to carry out matching, and we carry out initial k segmentation with Kmeans clustering algorithm, and then for the different ω that regression plane is optimized and revised, code reassignment data become k class.

S_{k} = {X_{n} | k = \arg \min_{j &Element; {1, ..., K}} {({\tilde{ω}}_{j}^{T} {\tilde{X}}_{n} - y_{n})}^{2} + γ | | X_{n} - μ_{j} | |^{2}} - - - (2)

S _knamely be the data that k classification comprises.

4. the missing data complementing method based on K plane regression according to claim 1, it is characterized in that: described in step 6 carry out regression function structure time, not only ensure the best-fit of regression plane, and ensure that the data of carrying out matching are near the geometric center in its corresponding plane

E (θ) = Σ_{i = 1}^{k} \underset{X_{n} &Element; S_{k}}{Σ} [{({\tilde{ω}}_{k}^{T} {\tilde{X}}_{n} - y_{n})}^{2} + γ | | X_{n} - μ_{j} | |^{2}] - - - (3)

γ is self-defining parameter, for the proportion of two in adjustment formula (3), uses 10 folding cross validations to determine optimum γ.

5. the missing data complementing method based on K plane regression according to claims 1, its special this is: the data obtained completion described in step 8, uses these four indexs of maximum deviation, minimum deflection, mean deviation and precision of prediction to carry out performance evaluating.Wherein carry out precision of prediction evaluation and test time, the maximal value being chosen for the data of completion dimension of permissible error α deducts minimum value and is multiplied by a threshold value, threshold value be chosen for 0.1.