CN111340069A

CN111340069A - Incomplete data fine modeling and missing value filling method based on alternate learning

Info

Publication number: CN111340069A
Application number: CN202010085968.4A
Authority: CN
Inventors: 刘辉; 张立勇; 宋橘超
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-06-26

Abstract

The invention discloses an incomplete data fine modeling and missing value filling method based on alternate learning, and belongs to the field of data mining. Firstly, dividing an input space into a plurality of subsets based on a fuzzy clustering algorithm, and establishing a specific local linear regression model for each subset; then, a global model is constructed by adopting the weighted sum of the local linear regression model, so that the fineness of the model is improved; and performing selection of salient input features for each subset using a stepwise regression algorithm to further refine the model. The missing value is taken as a variable, a model solving strategy for alternately learning the selection of the obvious input features and the filling of the parameters and the missing value of the model is provided, and the filling is completed concomitantly while the modeling is completed. The invention improves the fineness of the model established in the traditional regression filling, effectively solves the problem of incomplete model input data during modeling of incomplete data, and has ideal filling precision.

Description

Incomplete data fine modeling and missing value filling method based on alternate learning

Technical Field

The invention belongs to the field of data mining, and relates to a method for performing fine modeling and missing value filling on incomplete data based on alternate learning.

Background

The data mining technology can search information hidden in a large amount of data through an algorithm, so that correct guidance is provided for decision making. However, in various fields of real life, data loss is almost an unavoidable problem. High quality data is a prerequisite for high quality data mining. Because many data mining algorithms are difficult to independently deal with incomplete data sets, missing value filling becomes a research hotspot of incomplete data analysis. Currently, researchers have proposed various missing value filling methods, such as a mean value filling method, a hot card filling method, a clustering-based filling method, a regression filling method, and the like.

The mean value filling method (H.L. Shashirekha, A.H.Wani, Analysis of input algorithm for micro gene expression data, in:2015International Conference on applied and Theoretical Computing and Communication Technology, Davanere, India,2015) replaces the missing value with the mean value of the existing data in the incomplete attribute column. Although the method can quickly fill in missing values, the diversity of filling values is reduced, and therefore the filling effect is poor.

Unlike the mean-value-filling method, the hot-stuck-filling method (T.Srebotnjak, G.Carr, A.Sherbinin, C.Rickwood, A global Water Quality Index and hot-delete evaluation of missing data, economic Indicators,17(2012) 108-. This method generally has better padding performance than the mean padding method, since the correlation between samples is taken into account.

Similar to the hot-card filling method, the cluster-Based filling method (c.f.tsai, m.l.li, w.c.lin, a class-Based approach for missing value estimation, Knowledge-Based Systems,151(2018) 124-.

Unlike the above methods, the regression filling method (c. crambes, y. henchiri, regression estimation in the functional linear model with missing values in the thermal, Journal of Statistical Planning and reference, 201(2018) 103) 119) is a model-based filling method, and its main idea is to build a regression model for incomplete data according to the dependency relationship between the attributes and then fill in the missing values based on the built regression model. This padding method generally has better padding performance than the above-described method, since the correlation between attributes is taken into account. The filling result of the regression filling method is usually greatly influenced by the accuracy of the established regression model, so that the modeling of incomplete data arouses the interest of many researchers. How to handle incomplete model input data and how to properly describe the relationships between attributes are two major issues facing incomplete data modeling.

Currently, for the incompleteness of the model input data, a simpler method is to delete all incomplete samples containing missing attribute values and model the incomplete samples based on the complete sample part of the incomplete dataset (f.honghai, c.guoshun, y.cheng, y.bingru, c.yumei, a svm regression based on the approach to filing values, feature Notes in Computer Science,3683(2005) 581-587). The method is more suitable for the condition of low deletion rate or less attributes, because when the deletion scale is overlarge, a large amount of useful information is deleted, so that the modeling effect is poor. Another more popular approach is to pre-fill Missing values before modeling, followed by modeling based on the reconstructed complete data set (H.Kim, G.H.Golub, H.park, Missing value estimation for DNA microarray expression data: local least squares estimation, Bioinformatics,21(2005) 187-198). The method reserves the existing value in the incomplete sample so as to improve the utilization rate of information, but the pre-filling of the missing value causes the quality of the pre-filling value to have direct influence on the model precision.

Some researchers build different models for different clustered samples to reasonably describe the relationship between attributes. A filling method based on clustering and regression models divides a data set into different clusters, and establishes a specific least square regression model in each cluster to predict deletion values (P.Keerin, W.Kurutach, T.Boongoen, analysis of missing value in DNA micro expression data using cluster-based LLS method, in: International Symposium on Communications and information technologies, Surat Thailand,2013, pp: 559-. Compared with the traditional regression filling method, the method has better filling performance. The method combines clustering and stacking denoising self-coders based on a filling rule of the clustering and stacking denoising self-coders, firstly divides samples by using a k-means clustering algorithm, and then constructs different models based on the stacking denoising self-coders in each cluster to fill missing values (W.C.Ku, G.R.Jagadesesh, A.Prakash, T.Srikanthan, A clustering-based adaptive for data-driving input adaptive data in IEEE Forum on Integrated and stationary transformation Systems (FITS), Beijing, China, 2016).

In recent years, researchers have applied the Takagi-Sugeno (TS) fuzzy model to the analysis and prediction of incomplete data and achieved better filling performance. The filling method Based on incomplete data Fuzzy Modeling firstly pre-fills Missing values by using a clustering center, then models a reconstructed complete data set Based on a TS Fuzzy model and predicts the Missing values Based on the built model (X.Lai, X.Liu, L.Zhang, et al, Missing Value Implantation by Rule-Based incorporated data Fuzzy Modeling, in: IEEE International Conference Communication (ICC) Shanghai, China, 2019). The main idea of the TS fuzzy model is to divide the input space into several subsets, then establish different linear regression equations on each subset, and finally connect the linear regression equations by the degree of membership. The model consists of a series of "IF-THEN" fuzzy rules, the back-parts of which are usually linear descriptions of the input variables. Given an incomplete data set X, X with a sample capacity of n and an attribute number of s_k＝[x_1k,x_2k,…,x_sk](1. ltoreq. k. ltoreq.n) is where the kth sample and x_jk(j is more than or equal to 1 and less than or equal to s) is x_kThe j-th dimension of the attribute value. And (3) when the j-dimension attribute is taken as the model output and the other attributes are taken as the model input, the ith fuzzy rule is in the form of:

wherein c is the number of fuzzy rules;

representing the subset to which the q-dimension input feature in the antecedent of the ith fuzzy rule belongs;

a back-part parameter representing the ith fuzzy rule;

representing the output of the ith fuzzy rule. The final output of the model is shown in equation (2):

wherein the content of the first and second substances,

the contribution weight of the ith fuzzy rule, and can be obtained from equation (3):

in the formula, an operator lambada represents a small operation;

denotes x_qkBelong to a subset

Degree of membership of, characterizing x_qkMembership to a subset

To the extent of (c). Compared with the traditional regression model, the TS fuzzy model considers the difference of the regression relations in different subsets, and further considers the difference of the regression relations in different subsetsSuitable for describing the relationship between attributes.

Disclosure of Invention

The invention provides a method for performing fine modeling and missing value filling on incomplete data based on alternate learning, which is used for dividing an input space based on a TS fuzzy model, then selecting significant input features for each subset to improve the fineness of the model, and providing an alternate learning strategy to realize the solution of a fine model and the missing value filling. The influence of the quality of the pre-filling value on the selection of the input features and the influence of the model parameters can be effectively weakened through the alternate learning strategy, so that a better filling result is obtained. Compared with the traditional regression filling method, the filling method can effectively improve the filling precision.

The invention divides the input space into a plurality of subsets and establishes a specific linear regression equation for each subset, and then uses stepwise regression algorithm to select the significant input features for the input of each linear regression equation so as to improve the fineness of the model. On the basis, the missing value is taken as a variable, and the selection of the obvious input features and the filling of the back-piece parameters and the missing value of the model are alternately learned until iteration converges to solve the problem that the input data of the model is incomplete. When the iteration converges, the padding is completed along with the completion of the modeling.

The technical scheme of the invention is as follows:

a method for performing fine modeling and missing value filling on incomplete data based on alternate learning specifically comprises the following steps:

(1) modeling

The input space is first partitioned using a fuzzy C-means clustering (FCM-PDS) algorithm based on a local distance strategy. Given an incomplete data set with a sample capacity of n and a number of attributes of s, the FCM-PDS algorithm divides the input space into c subsets by minimizing the objective function in equation (4),

wherein the content of the first and second substances,

represents a sample x_kBelong to subset A⁽ⁱ⁾M is a weighted index of the degree of membership, m ∈ (1, infinity), d_kiDenotes x_kAnd the clustering center v_i＝[v_1i,v_2i,…,v_si](i is more than or equal to 1 and less than or equal to c), and the calculation formula is shown as the formula (5):

wherein v is_jiDenotes v_iThe jth attribute value of (a);

for marking x_jkWhether or not it is missing, X_MAnd X_pRespectively a set of all missing values and a set of all complete values.

Then, a stepwise regression algorithm is used for selecting the significant input features of each fuzzy rule: the stepwise regression algorithm introduces the features which have obvious influence on the output into the regression model one by one according to the importance, and the significance test is carried out on the features which are selected into the regression model again when a new feature is introduced. If the existing features in the regression model become insignificant due to the introduction of new features, deleting the least significant features; the algorithm terminates when neither new features can be selected into the regression model, nor insignificant features can be removed from the regression model.

After the input space is divided and the significant input features of each fuzzy rule are selected, the significant input feature set of the ith fuzzy rule is made

And m is_iFor the number of selected features, wherein the feature x is significantly input_j＝[x_j1,x_j2,…,x_jn]^Τ(1≤j≤m_i). The ith fuzzy rule is simplified from equation (1) to equation (6),

wherein c is the number of fuzzy rules;

an output representing the ith fuzzy rule;

is the reduced kth sample;

for the m-th fuzzy rule front part_iA subset to which the dimension input features belong;

the back-part parameters of the simplified ith fuzzy rule are used. Moreover, the contribution weight of the ith fuzzy rule is

Become into

The calculation method is shown as formula (7):

in the formula, degree of membership of a single variable

By multivariate degree of membership

Obtained through Gaussian projection, as shown in formula (8):

wherein a is_jiAnd b_jiRespectively representing the center of a Gaussian function and the standard deviation of the Gaussian function, and the calculation formula is shown as formula (9):

wherein u is_kiRepresents a sample x_kMembership to fuzzy subset A⁽ⁱ⁾To the extent of (c). The output of the TS fuzzy model

Can be calculated from equation (10):

(2) missing value filling

Because the establishment of a single TS fuzzy model can only fill the missing values of a single incomplete attribute column, each incomplete attribute column is taken as output in sequence, and all the other attributes are taken as input to establish a plurality of TS fuzzy models. And aiming at the incompleteness of model input data, the missing value is taken as a variable, and an alternative learning strategy is provided for model solution and missing value filling. The alternative learning strategy can be divided into the following steps:

step 1: the missing values are mean pre-padded to obtain a reconstructed complete data set.

Step 2: significant input features and back-piece parameters of the model are updated based on the reconstructed complete data set.

And step 3: and obtaining model output according to the significant input features and the back-part parameters of the updated model and updating the missing value by using the model output.

And 4, step 4: if the filling error obtained by the existing value and the corresponding model output is larger than or equal to the given threshold value, returning to the step 2; otherwise, the model corresponding to the missing value is used to output the filled missing value and the filled data set is output.

The invention has the beneficial effects that: firstly, an input space is divided on the basis of regression modeling, a linear regression equation is established for each subset, then the linear regression equation is selected for significant input features, and through the two steps, the fineness of the model is improved and the filling performance is enhanced. Secondly, regarding the imperfection of model input, regarding the missing value as a variable, and providing an alternative learning strategy, so that the selection of the significant input features and the filling of the back-piece parameters and the missing value of the model are alternatively learned until iterative convergence. In the alternate learning process, the model structure and the model parameters will be gradually accurate with the improvement of the filling accuracy, and the accuracy of the model structure and the parameters will promote the filling value of the missing value to be more reasonable.

Drawings

Fig. 1 is an overall workflow diagram of the present invention.

FIG. 2 is a workflow diagram of the alternate learning strategy of the present invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in conjunction with the accompanying drawings.

Fig. 1 is an overall work flow diagram of the present invention. In the figure, the incomplete data set

first line

1,2, …, s represents an attribute number, black marks represent missing values, and white marks represent existing values. The present invention first divides the incomplete data set into several subsets using the FCM-PDS algorithm and uses these subsets as input for the subsequent feature selection process. And then, carrying out mean value pre-filling on the incomplete data set to obtain a reconstructed complete data set, and carrying out feature selection on each subset by using a stepwise regression algorithm based on the reconstructed complete data set to obtain the significant input features of the model. And then calculating the back-part parameters of the model based on a least square method, and calculating the output of the model by using the back-part parameters and the significant input features of the model. And finally, updating the reconstructed complete data set based on the output of the model by considering the missing value as a variable, and updating the significant input features of the model, the back-piece parameters of the model and the output of the model and performing the next iteration. If the change of the reconstruction error calculated by the existing value and the corresponding model output in the two adjacent iterations is smaller than a specified threshold value, the iteration converges, the filling of the missing value is completed along with the completion of modeling, and the filled data set is output. Otherwise, updating the reconstructed complete data set and performing the next iteration.

Examples

The details of the present invention are described by taking the Blood data set of the UCI machine learning database as an example. Blood is a complete data set with a sample size of 748 and the number of attributes of 4, and partial data in the data set is deleted manually to construct an incomplete data set.

Assuming that 748 sample space is divided into 2 subsets, two fuzzy rules of the TS fuzzy model established by taking the first dimension attribute as output and all the other attributes as input are shown as formula (11), and the established model is expressed by TS-1,

similarly, the 2 nd, 3 rd and 4 th dimension attributes are output in sequence, all other attributes are input and are modeled based on the TS fuzzy model, and the built model is expressed by TS-j (j is more than or equal to 1 and less than or equal to 4). And then, carrying out mean value pre-filling on the missing values to obtain a reconstructed complete data set, and carrying out selection of significant input features on each fuzzy rule. Suppose R in formula (11)⁽¹⁾Is T⁽¹⁾＝{x₂,…,x_m1}，R⁽²⁾Is T⁽²⁾＝{x₂,…,x_m2Is reduced to the formula (12) shown in the formula (11)

And the output of the model TS-1 can be represented by equation (13)

Let P be [ P ]⁽¹⁾,P⁽¹⁾,…,P⁽¹⁾]^TWherein

As fuzzy rule R⁽ⁱ⁾Back part ofIf the model TS-j is not correct, then the post-part parameters of the model TS-j can be obtained based on equation (14)

P＝(B^ΤB)^-1B^Τy, (14)

Wherein y ═ x_j1,x_j2,…,x_jn]^Τ(1 ≦ j ≦ 4) for the desired output vector; b ═ B⁽¹⁾,B⁽¹⁾,…,B^(c)]And B is⁽ⁱ⁾(1. ltoreq. i.ltoreq.2) is represented by the formula (15):

after the back-part parameters of the model TS-j are obtained, the output corresponding to TS-j can be obtained based on the formula (16)

Wherein

And combining the outputs corresponding to the s TS-j to obtain the model output.

The present invention considers the missing values as variables and designs an alternative learning strategy to weaken the influence of the quality of the pre-filling values on the model accuracy, and the specific implementation details of the strategy are shown in fig. 2. In FIG. 2, X_PRepresents a set of all existing values; x_MRepresenting a set of all missing values;

representing a model output set corresponding to the existing value;

representing the set of model outputs corresponding to the missing values. First, the missing value is calculated from

An update is made to adjust the reconstructed complete data set. Then, based on stepwise regression algorithm, the significant output of fuzzy ruleInto a feature set composed of

Is adjusted to

Wherein

And

respectively representing fuzzy rules R in the last iteration and the current iteration⁽ⁱ⁾The salient input feature set. Then, based on least square method, the back-piece parameters of fuzzy rule are divided by

Is adjusted to

Wherein

And

respectively representing fuzzy rules R in the last iteration and the current iteration⁽ⁱ⁾The back-piece parameter of (1). Then based on R⁽ⁱ⁾R can be calculated from the significant input feature set and the back-part parameters⁽ⁱ⁾Output of (2)

And weighted summation is carried out on the output signals to obtain the output corresponding to the TS-j

Finally, combining the outputs corresponding to s TS-j to obtain model output

Wherein

Is used to update the missing value, and

reconstruction error f used to calculate the existing value and its corresponding model output_e. If Δ f_e<If epsilon then iteration terminates and the padded data set is output, if deltaf_eIf the value is more than or equal to epsilon, continuing the next iteration, wherein epsilon represents a threshold value;

and

respectively representing the reconstruction errors calculated by the existing values and the corresponding model outputs in the current iteration and the last iteration.

Comparative example

3 data sets are selected from a UCI machine learning database to verify the filling performance of the method, and the description of the data sets is shown in table 1. In order to calculate the error between the estimation of the missing value and the true value, the selected data sets are all complete data sets, and an experiment constructs an incomplete data set by manually deleting partial data according to the specified missing rate. The specified deletion rates were 5%, 10%, 15%, 20%, 25%, and 30%, respectively.

Table 1 data set description

The experiment compares six methods, and all the methods carry out mean value pre-filling on the missing values before modeling, wherein the sixth method is a filling method based on the method provided by the invention.

(1) Incomplete data is modeled based on a linear regression model, and all features are taken as inputs (REGs).

(2) Incomplete data is modeled based on a linear regression model, and salient features are selected as input (REG-SR) using stepwise regression.

(3) On the basis of REG-SR, the deficiency values are treated as variables and the model structure, model parameters and deficiency values are alternately learned until convergence (REG-SR-AL).

(4) Incomplete data is modeled based on a TS fuzzy model, and all features are taken as input (TS).

(5) Incomplete data is modeled based on a TS fuzzy model, and salient features are selected as inputs (TS-SR) using stepwise regression for each subset.

(6) On the basis of TS-SR, the missing values are regarded as variables and the model structure, model background parameters and the missing values are alternately learned until convergence (TS-SR-AL).

The Root Mean Square Error (RMSE) was used in this experiment to evaluate the padding effect. The RMSE is the square root of the ratio of the square sum of the observed value and the corresponding real deviation thereof to the observation frequency, and can well reflect the modeling precision, and the calculation formula is as follows:

where N is the number of missing values in the data set, z_tThe true value corresponding to the missing bit is,

the padding value corresponding to the missing bit. Table 2 shows the RMSE indicator results for the six padding methods, where the best results are bolded and underlined, and the second best results are bolded.

TABLE 2 RMSE indices of six filling methods

Observing the comparison of TS and REG, the comparison of TS-SR and REG-SR, and the comparison of TS-SR-AL and REG-SR-AL in Table 2, it can be seen that establishing a linear regression method for each subspace after dividing the input space has smaller filling errors than directly establishing a linear method. The comparison of the index results of REG-SR and REG and the comparison of the index results of TS-SR and TS can be found out, and the filling errors can be reduced by performing feature selection on the input of the linear regression model during modeling. Comparing the index results of TS-SR-AL and TS-SR and the index results of REG-SR-AL and REG-SR, it can be known that the filling precision can be obviously improved by using the alternative strategy.

In conclusion, the TS-SR-AL based on the invention has the most optimal result, and the filling precision of the TS-SR-AL is better than that of other comparison methods.

Claims

1. A method for performing fine modeling and missing value filling on incomplete data based on alternate learning is characterized by comprising the following steps:

(1) modeling

Firstly, dividing an input space by using a fuzzy C-means clustering algorithm based on a local distance strategy; given an incomplete data set with a sample size of n and a number of attributes of s, the algorithm divides the input space into c subsets by minimizing the objective function in equation (4),

wherein the content of the first and second substances,

represents a sample x_kBelong to subset A⁽ⁱ⁾M is a weighted index of the degree of membership, m ∈ (1, infinity), d_kiDenotes x_kAnd the clustering center v_i＝[v_1i,v_2i,…,v_si]I is more than or equal to 1 and less than or equal to c, d_kiThe calculation formula is shown in formula (5):

wherein v is_jiDenotes v_iThe jth attribute value of (a);

for marking x_jkWhether or not it is missing, X_MAnd X_pRespectively a set composed of all missing values and a set composed of all complete values;

then, a stepwise regression algorithm is used for selecting the significant input features of each fuzzy rule: the step-by-step regression algorithm introduces the characteristics which have obvious influence on the output into the regression model one by one according to the importance, and the characteristics which are selected into the regression model are subjected to significance testing again when a new characteristic is introduced; if the existing features in the regression model become insignificant due to the introduction of new features, deleting the least significant features; terminating the algorithm when neither new features can be selected into the regression model nor insignificant features can be removed from the regression model;

And m is_iFor the number of selected input features, in which the features are significantly input

The ith fuzzy rule is reduced to equation (6),

wherein c is the number of fuzzy rules;

an output representing the ith fuzzy rule;

is the reduced kth sample;

for simplificationM in the preceding part of the ith fuzzy rule_iA subset to which the dimension input features belong;

the back-part parameters of the simplified ith fuzzy rule are obtained; contribution weight of the ith fuzzy rule

Is calculated as shown in equation (7):

in the formula, degree of membership of a single variable

By multivariate degree of membership

Obtained through Gaussian projection, as shown in formula (8):

wherein u is_kiRepresents a sample x_kMembership to fuzzy subset A⁽ⁱ⁾The degree of (d); the output of the TS fuzzy model

Calculated from equation (10):

(2) missing value filling

Taking each incomplete attribute column as output in sequence, and taking all the other attributes as input to establish a plurality of TS fuzzy models; taking the missing value as a variable, and adopting an alternate learning strategy for model solution and missing value filling, wherein the steps are as follows:

step 1: performing mean pre-filling on the missing values to obtain a reconstructed complete data set;

step 2: updating salient input features and back-piece parameters of the model based on the reconstructed complete data set;

and step 3: obtaining model output according to the significant input features of the updated model and the parameters of the back-part, and updating the missing value by using the model output;