CN111340069A - Incomplete data fine modeling and missing value filling method based on alternate learning - Google Patents

Incomplete data fine modeling and missing value filling method based on alternate learning Download PDF

Info

Publication number
CN111340069A
CN111340069A CN202010085968.4A CN202010085968A CN111340069A CN 111340069 A CN111340069 A CN 111340069A CN 202010085968 A CN202010085968 A CN 202010085968A CN 111340069 A CN111340069 A CN 111340069A
Authority
CN
China
Prior art keywords
model
filling
input
features
missing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010085968.4A
Other languages
Chinese (zh)
Inventor
刘辉
张立勇
宋橘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010085968.4A priority Critical patent/CN111340069A/en
Publication of CN111340069A publication Critical patent/CN111340069A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/02Computing arrangements based on specific mathematical models using fuzzy logic

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Automation & Control Theory (AREA)
  • Biomedical Technology (AREA)
  • Fuzzy Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an incomplete data fine modeling and missing value filling method based on alternate learning, and belongs to the field of data mining. Firstly, dividing an input space into a plurality of subsets based on a fuzzy clustering algorithm, and establishing a specific local linear regression model for each subset; then, a global model is constructed by adopting the weighted sum of the local linear regression model, so that the fineness of the model is improved; and performing selection of salient input features for each subset using a stepwise regression algorithm to further refine the model. The missing value is taken as a variable, a model solving strategy for alternately learning the selection of the obvious input features and the filling of the parameters and the missing value of the model is provided, and the filling is completed concomitantly while the modeling is completed. The invention improves the fineness of the model established in the traditional regression filling, effectively solves the problem of incomplete model input data during modeling of incomplete data, and has ideal filling precision.

Description

Incomplete data fine modeling and missing value filling method based on alternate learning
Technical Field
The invention belongs to the field of data mining, and relates to a method for performing fine modeling and missing value filling on incomplete data based on alternate learning.
Background
The data mining technology can search information hidden in a large amount of data through an algorithm, so that correct guidance is provided for decision making. However, in various fields of real life, data loss is almost an unavoidable problem. High quality data is a prerequisite for high quality data mining. Because many data mining algorithms are difficult to independently deal with incomplete data sets, missing value filling becomes a research hotspot of incomplete data analysis. Currently, researchers have proposed various missing value filling methods, such as a mean value filling method, a hot card filling method, a clustering-based filling method, a regression filling method, and the like.
The mean value filling method (H.L. Shashirekha, A.H.Wani, Analysis of input algorithm for micro gene expression data, in:2015International Conference on applied and Theoretical Computing and Communication Technology, Davanere, India,2015) replaces the missing value with the mean value of the existing data in the incomplete attribute column. Although the method can quickly fill in missing values, the diversity of filling values is reduced, and therefore the filling effect is poor.
Unlike the mean-value-filling method, the hot-stuck-filling method (T.Srebotnjak, G.Carr, A.Sherbinin, C.Rickwood, A global Water Quality Index and hot-delete evaluation of missing data, economic Indicators,17(2012) 108-. This method generally has better padding performance than the mean padding method, since the correlation between samples is taken into account.
Similar to the hot-card filling method, the cluster-Based filling method (c.f.tsai, m.l.li, w.c.lin, a class-Based approach for missing value estimation, Knowledge-Based Systems,151(2018) 124-.
Unlike the above methods, the regression filling method (c. crambes, y. henchiri, regression estimation in the functional linear model with missing values in the thermal, Journal of Statistical Planning and reference, 201(2018) 103) 119) is a model-based filling method, and its main idea is to build a regression model for incomplete data according to the dependency relationship between the attributes and then fill in the missing values based on the built regression model. This padding method generally has better padding performance than the above-described method, since the correlation between attributes is taken into account. The filling result of the regression filling method is usually greatly influenced by the accuracy of the established regression model, so that the modeling of incomplete data arouses the interest of many researchers. How to handle incomplete model input data and how to properly describe the relationships between attributes are two major issues facing incomplete data modeling.
Currently, for the incompleteness of the model input data, a simpler method is to delete all incomplete samples containing missing attribute values and model the incomplete samples based on the complete sample part of the incomplete dataset (f.honghai, c.guoshun, y.cheng, y.bingru, c.yumei, a svm regression based on the approach to filing values, feature Notes in Computer Science,3683(2005) 581-587). The method is more suitable for the condition of low deletion rate or less attributes, because when the deletion scale is overlarge, a large amount of useful information is deleted, so that the modeling effect is poor. Another more popular approach is to pre-fill Missing values before modeling, followed by modeling based on the reconstructed complete data set (H.Kim, G.H.Golub, H.park, Missing value estimation for DNA microarray expression data: local least squares estimation, Bioinformatics,21(2005) 187-198). The method reserves the existing value in the incomplete sample so as to improve the utilization rate of information, but the pre-filling of the missing value causes the quality of the pre-filling value to have direct influence on the model precision.
Some researchers build different models for different clustered samples to reasonably describe the relationship between attributes. A filling method based on clustering and regression models divides a data set into different clusters, and establishes a specific least square regression model in each cluster to predict deletion values (P.Keerin, W.Kurutach, T.Boongoen, analysis of missing value in DNA micro expression data using cluster-based LLS method, in: International Symposium on Communications and information technologies, Surat Thailand,2013, pp: 559-. Compared with the traditional regression filling method, the method has better filling performance. The method combines clustering and stacking denoising self-coders based on a filling rule of the clustering and stacking denoising self-coders, firstly divides samples by using a k-means clustering algorithm, and then constructs different models based on the stacking denoising self-coders in each cluster to fill missing values (W.C.Ku, G.R.Jagadesesh, A.Prakash, T.Srikanthan, A clustering-based adaptive for data-driving input adaptive data in IEEE Forum on Integrated and stationary transformation Systems (FITS), Beijing, China, 2016).
In recent years, researchers have applied the Takagi-Sugeno (TS) fuzzy model to the analysis and prediction of incomplete data and achieved better filling performance. The filling method Based on incomplete data Fuzzy Modeling firstly pre-fills Missing values by using a clustering center, then models a reconstructed complete data set Based on a TS Fuzzy model and predicts the Missing values Based on the built model (X.Lai, X.Liu, L.Zhang, et al, Missing Value Implantation by Rule-Based incorporated data Fuzzy Modeling, in: IEEE International Conference Communication (ICC) Shanghai, China, 2019). The main idea of the TS fuzzy model is to divide the input space into several subsets, then establish different linear regression equations on each subset, and finally connect the linear regression equations by the degree of membership. The model consists of a series of "IF-THEN" fuzzy rules, the back-parts of which are usually linear descriptions of the input variables. Given an incomplete data set X, X with a sample capacity of n and an attribute number of sk=[x1k,x2k,…,xsk](1. ltoreq. k. ltoreq.n) is where the kth sample and xjk(j is more than or equal to 1 and less than or equal to s) is xkThe j-th dimension of the attribute value. And (3) when the j-dimension attribute is taken as the model output and the other attributes are taken as the model input, the ith fuzzy rule is in the form of:
Figure BDA0002382047000000041
wherein c is the number of fuzzy rules;
Figure BDA0002382047000000042
representing the subset to which the q-dimension input feature in the antecedent of the ith fuzzy rule belongs;
Figure BDA0002382047000000043
a back-part parameter representing the ith fuzzy rule;
Figure BDA0002382047000000044
representing the output of the ith fuzzy rule. The final output of the model is shown in equation (2):
Figure BDA0002382047000000045
wherein the content of the first and second substances,
Figure BDA0002382047000000046
the contribution weight of the ith fuzzy rule, and can be obtained from equation (3):
Figure BDA0002382047000000047
in the formula, an operator lambada represents a small operation;
Figure BDA0002382047000000048
denotes xqkBelong to a subset
Figure BDA0002382047000000049
Degree of membership of, characterizing xqkMembership to a subset
Figure BDA00023820470000000410
To the extent of (c). Compared with the traditional regression model, the TS fuzzy model considers the difference of the regression relations in different subsets, and further considers the difference of the regression relations in different subsetsSuitable for describing the relationship between attributes.
Disclosure of Invention
The invention provides a method for performing fine modeling and missing value filling on incomplete data based on alternate learning, which is used for dividing an input space based on a TS fuzzy model, then selecting significant input features for each subset to improve the fineness of the model, and providing an alternate learning strategy to realize the solution of a fine model and the missing value filling. The influence of the quality of the pre-filling value on the selection of the input features and the influence of the model parameters can be effectively weakened through the alternate learning strategy, so that a better filling result is obtained. Compared with the traditional regression filling method, the filling method can effectively improve the filling precision.
The invention divides the input space into a plurality of subsets and establishes a specific linear regression equation for each subset, and then uses stepwise regression algorithm to select the significant input features for the input of each linear regression equation so as to improve the fineness of the model. On the basis, the missing value is taken as a variable, and the selection of the obvious input features and the filling of the back-piece parameters and the missing value of the model are alternately learned until iteration converges to solve the problem that the input data of the model is incomplete. When the iteration converges, the padding is completed along with the completion of the modeling.
The technical scheme of the invention is as follows:
a method for performing fine modeling and missing value filling on incomplete data based on alternate learning specifically comprises the following steps:
(1) modeling
The input space is first partitioned using a fuzzy C-means clustering (FCM-PDS) algorithm based on a local distance strategy. Given an incomplete data set with a sample capacity of n and a number of attributes of s, the FCM-PDS algorithm divides the input space into c subsets by minimizing the objective function in equation (4),
Figure BDA0002382047000000051
wherein the content of the first and second substances,
Figure BDA0002382047000000052
represents a sample xkBelong to subset A(i)M is a weighted index of the degree of membership, m ∈ (1, infinity), dkiDenotes xkAnd the clustering center vi=[v1i,v2i,…,vsi](i is more than or equal to 1 and less than or equal to c), and the calculation formula is shown as the formula (5):
Figure BDA0002382047000000053
wherein v isjiDenotes viThe jth attribute value of (a);
Figure BDA0002382047000000054
for marking xjkWhether or not it is missing, XMAnd XpRespectively a set of all missing values and a set of all complete values.
Then, a stepwise regression algorithm is used for selecting the significant input features of each fuzzy rule: the stepwise regression algorithm introduces the features which have obvious influence on the output into the regression model one by one according to the importance, and the significance test is carried out on the features which are selected into the regression model again when a new feature is introduced. If the existing features in the regression model become insignificant due to the introduction of new features, deleting the least significant features; the algorithm terminates when neither new features can be selected into the regression model, nor insignificant features can be removed from the regression model.
After the input space is divided and the significant input features of each fuzzy rule are selected, the significant input feature set of the ith fuzzy rule is made
Figure BDA0002382047000000061
And m isiFor the number of selected features, wherein the feature x is significantly inputj=[xj1,xj2,…,xjn]Τ(1≤j≤mi). The ith fuzzy rule is simplified from equation (1) to equation (6),
Figure BDA0002382047000000062
wherein c is the number of fuzzy rules;
Figure BDA0002382047000000063
an output representing the ith fuzzy rule;
Figure BDA0002382047000000064
is the reduced kth sample;
Figure BDA0002382047000000065
for the m-th fuzzy rule front partiA subset to which the dimension input features belong;
Figure BDA0002382047000000066
the back-part parameters of the simplified ith fuzzy rule are used. Moreover, the contribution weight of the ith fuzzy rule is
Figure BDA0002382047000000067
Become into
Figure BDA0002382047000000068
The calculation method is shown as formula (7):
Figure BDA0002382047000000069
in the formula, degree of membership of a single variable
Figure BDA00023820470000000610
By multivariate degree of membership
Figure BDA00023820470000000611
Obtained through Gaussian projection, as shown in formula (8):
Figure BDA00023820470000000612
wherein a isjiAnd bjiRespectively representing the center of a Gaussian function and the standard deviation of the Gaussian function, and the calculation formula is shown as formula (9):
Figure BDA0002382047000000071
wherein u iskiRepresents a sample xkMembership to fuzzy subset A(i)To the extent of (c). The output of the TS fuzzy model
Figure BDA0002382047000000072
Can be calculated from equation (10):
Figure BDA0002382047000000073
(2) missing value filling
Because the establishment of a single TS fuzzy model can only fill the missing values of a single incomplete attribute column, each incomplete attribute column is taken as output in sequence, and all the other attributes are taken as input to establish a plurality of TS fuzzy models. And aiming at the incompleteness of model input data, the missing value is taken as a variable, and an alternative learning strategy is provided for model solution and missing value filling. The alternative learning strategy can be divided into the following steps:
step 1: the missing values are mean pre-padded to obtain a reconstructed complete data set.
Step 2: significant input features and back-piece parameters of the model are updated based on the reconstructed complete data set.
And step 3: and obtaining model output according to the significant input features and the back-part parameters of the updated model and updating the missing value by using the model output.
And 4, step 4: if the filling error obtained by the existing value and the corresponding model output is larger than or equal to the given threshold value, returning to the step 2; otherwise, the model corresponding to the missing value is used to output the filled missing value and the filled data set is output.
The invention has the beneficial effects that: firstly, an input space is divided on the basis of regression modeling, a linear regression equation is established for each subset, then the linear regression equation is selected for significant input features, and through the two steps, the fineness of the model is improved and the filling performance is enhanced. Secondly, regarding the imperfection of model input, regarding the missing value as a variable, and providing an alternative learning strategy, so that the selection of the significant input features and the filling of the back-piece parameters and the missing value of the model are alternatively learned until iterative convergence. In the alternate learning process, the model structure and the model parameters will be gradually accurate with the improvement of the filling accuracy, and the accuracy of the model structure and the parameters will promote the filling value of the missing value to be more reasonable.
Drawings
Fig. 1 is an overall workflow diagram of the present invention.
FIG. 2 is a workflow diagram of the alternate learning strategy of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention is provided in conjunction with the accompanying drawings.
Fig. 1 is an overall work flow diagram of the present invention. In the figure, the incomplete data set first line 1,2, …, s represents an attribute number, black marks represent missing values, and white marks represent existing values. The present invention first divides the incomplete data set into several subsets using the FCM-PDS algorithm and uses these subsets as input for the subsequent feature selection process. And then, carrying out mean value pre-filling on the incomplete data set to obtain a reconstructed complete data set, and carrying out feature selection on each subset by using a stepwise regression algorithm based on the reconstructed complete data set to obtain the significant input features of the model. And then calculating the back-part parameters of the model based on a least square method, and calculating the output of the model by using the back-part parameters and the significant input features of the model. And finally, updating the reconstructed complete data set based on the output of the model by considering the missing value as a variable, and updating the significant input features of the model, the back-piece parameters of the model and the output of the model and performing the next iteration. If the change of the reconstruction error calculated by the existing value and the corresponding model output in the two adjacent iterations is smaller than a specified threshold value, the iteration converges, the filling of the missing value is completed along with the completion of modeling, and the filled data set is output. Otherwise, updating the reconstructed complete data set and performing the next iteration.
Examples
The details of the present invention are described by taking the Blood data set of the UCI machine learning database as an example. Blood is a complete data set with a sample size of 748 and the number of attributes of 4, and partial data in the data set is deleted manually to construct an incomplete data set.
Assuming that 748 sample space is divided into 2 subsets, two fuzzy rules of the TS fuzzy model established by taking the first dimension attribute as output and all the other attributes as input are shown as formula (11), and the established model is expressed by TS-1,
Figure BDA0002382047000000091
similarly, the 2 nd, 3 rd and 4 th dimension attributes are output in sequence, all other attributes are input and are modeled based on the TS fuzzy model, and the built model is expressed by TS-j (j is more than or equal to 1 and less than or equal to 4). And then, carrying out mean value pre-filling on the missing values to obtain a reconstructed complete data set, and carrying out selection of significant input features on each fuzzy rule. Suppose R in formula (11)(1)Is T(1)={x2,…,xm1},R(2)Is T(2)={x2,…,xm2Is reduced to the formula (12) shown in the formula (11)
Figure BDA0002382047000000092
And the output of the model TS-1 can be represented by equation (13)
Figure BDA0002382047000000093
Let P be [ P ](1),P(1),…,P(1)]TWherein
Figure BDA0002382047000000094
As fuzzy rule R(i)Back part ofIf the model TS-j is not correct, then the post-part parameters of the model TS-j can be obtained based on equation (14)
P=(BΤB)-1BΤy, (14)
Wherein y ═ xj1,xj2,…,xjn]Τ(1 ≦ j ≦ 4) for the desired output vector; b ═ B(1),B(1),…,B(c)]And B is(i)(1. ltoreq. i.ltoreq.2) is represented by the formula (15):
Figure BDA0002382047000000101
after the back-part parameters of the model TS-j are obtained, the output corresponding to TS-j can be obtained based on the formula (16)
Figure BDA0002382047000000102
Wherein
Figure BDA0002382047000000103
And combining the outputs corresponding to the s TS-j to obtain the model output.
The present invention considers the missing values as variables and designs an alternative learning strategy to weaken the influence of the quality of the pre-filling values on the model accuracy, and the specific implementation details of the strategy are shown in fig. 2. In FIG. 2, XPRepresents a set of all existing values; xMRepresenting a set of all missing values;
Figure BDA0002382047000000104
representing a model output set corresponding to the existing value;
Figure BDA0002382047000000105
representing the set of model outputs corresponding to the missing values. First, the missing value is calculated from
Figure BDA0002382047000000106
An update is made to adjust the reconstructed complete data set. Then, based on stepwise regression algorithm, the significant output of fuzzy ruleInto a feature set composed of
Figure BDA0002382047000000107
Is adjusted to
Figure BDA0002382047000000108
Wherein
Figure BDA00023820470000001022
And
Figure BDA0002382047000000109
respectively representing fuzzy rules R in the last iteration and the current iteration(i)The salient input feature set. Then, based on least square method, the back-piece parameters of fuzzy rule are divided by
Figure BDA00023820470000001010
Is adjusted to
Figure BDA00023820470000001011
Wherein
Figure BDA00023820470000001012
And
Figure BDA00023820470000001013
respectively representing fuzzy rules R in the last iteration and the current iteration(i)The back-piece parameter of (1). Then based on R(i)R can be calculated from the significant input feature set and the back-part parameters(i)Output of (2)
Figure BDA00023820470000001014
And weighted summation is carried out on the output signals to obtain the output corresponding to the TS-j
Figure BDA00023820470000001015
Finally, combining the outputs corresponding to s TS-j to obtain model output
Figure BDA00023820470000001016
Wherein
Figure BDA00023820470000001017
Is used to update the missing value, and
Figure BDA00023820470000001018
reconstruction error f used to calculate the existing value and its corresponding model outpute. If Δ fe<If epsilon then iteration terminates and the padded data set is output, if deltafeIf the value is more than or equal to epsilon, continuing the next iteration, wherein epsilon represents a threshold value;
Figure BDA00023820470000001019
Figure BDA00023820470000001020
and
Figure BDA00023820470000001021
respectively representing the reconstruction errors calculated by the existing values and the corresponding model outputs in the current iteration and the last iteration.
Comparative example
3 data sets are selected from a UCI machine learning database to verify the filling performance of the method, and the description of the data sets is shown in table 1. In order to calculate the error between the estimation of the missing value and the true value, the selected data sets are all complete data sets, and an experiment constructs an incomplete data set by manually deleting partial data according to the specified missing rate. The specified deletion rates were 5%, 10%, 15%, 20%, 25%, and 30%, respectively.
Table 1 data set description
Figure BDA0002382047000000111
The experiment compares six methods, and all the methods carry out mean value pre-filling on the missing values before modeling, wherein the sixth method is a filling method based on the method provided by the invention.
(1) Incomplete data is modeled based on a linear regression model, and all features are taken as inputs (REGs).
(2) Incomplete data is modeled based on a linear regression model, and salient features are selected as input (REG-SR) using stepwise regression.
(3) On the basis of REG-SR, the deficiency values are treated as variables and the model structure, model parameters and deficiency values are alternately learned until convergence (REG-SR-AL).
(4) Incomplete data is modeled based on a TS fuzzy model, and all features are taken as input (TS).
(5) Incomplete data is modeled based on a TS fuzzy model, and salient features are selected as inputs (TS-SR) using stepwise regression for each subset.
(6) On the basis of TS-SR, the missing values are regarded as variables and the model structure, model background parameters and the missing values are alternately learned until convergence (TS-SR-AL).
The Root Mean Square Error (RMSE) was used in this experiment to evaluate the padding effect. The RMSE is the square root of the ratio of the square sum of the observed value and the corresponding real deviation thereof to the observation frequency, and can well reflect the modeling precision, and the calculation formula is as follows:
Figure BDA0002382047000000121
where N is the number of missing values in the data set, ztThe true value corresponding to the missing bit is,
Figure BDA0002382047000000122
the padding value corresponding to the missing bit. Table 2 shows the RMSE indicator results for the six padding methods, where the best results are bolded and underlined, and the second best results are bolded.
TABLE 2 RMSE indices of six filling methods
Figure BDA0002382047000000123
Observing the comparison of TS and REG, the comparison of TS-SR and REG-SR, and the comparison of TS-SR-AL and REG-SR-AL in Table 2, it can be seen that establishing a linear regression method for each subspace after dividing the input space has smaller filling errors than directly establishing a linear method. The comparison of the index results of REG-SR and REG and the comparison of the index results of TS-SR and TS can be found out, and the filling errors can be reduced by performing feature selection on the input of the linear regression model during modeling. Comparing the index results of TS-SR-AL and TS-SR and the index results of REG-SR-AL and REG-SR, it can be known that the filling precision can be obviously improved by using the alternative strategy.
In conclusion, the TS-SR-AL based on the invention has the most optimal result, and the filling precision of the TS-SR-AL is better than that of other comparison methods.

Claims (1)

1. A method for performing fine modeling and missing value filling on incomplete data based on alternate learning is characterized by comprising the following steps:
(1) modeling
Firstly, dividing an input space by using a fuzzy C-means clustering algorithm based on a local distance strategy; given an incomplete data set with a sample size of n and a number of attributes of s, the algorithm divides the input space into c subsets by minimizing the objective function in equation (4),
Figure FDA0002382046990000011
wherein the content of the first and second substances,
Figure FDA0002382046990000012
represents a sample xkBelong to subset A(i)M is a weighted index of the degree of membership, m ∈ (1, infinity), dkiDenotes xkAnd the clustering center vi=[v1i,v2i,…,vsi]I is more than or equal to 1 and less than or equal to c, dkiThe calculation formula is shown in formula (5):
Figure FDA0002382046990000013
wherein v isjiDenotes viThe jth attribute value of (a);
Figure FDA0002382046990000014
for marking xjkWhether or not it is missing, XMAnd XpRespectively a set composed of all missing values and a set composed of all complete values;
then, a stepwise regression algorithm is used for selecting the significant input features of each fuzzy rule: the step-by-step regression algorithm introduces the characteristics which have obvious influence on the output into the regression model one by one according to the importance, and the characteristics which are selected into the regression model are subjected to significance testing again when a new characteristic is introduced; if the existing features in the regression model become insignificant due to the introduction of new features, deleting the least significant features; terminating the algorithm when neither new features can be selected into the regression model nor insignificant features can be removed from the regression model;
after the input space is divided and the significant input features of each fuzzy rule are selected, the significant input feature set of the ith fuzzy rule is made
Figure FDA0002382046990000015
And m isiFor the number of selected input features, in which the features are significantly input
Figure FDA0002382046990000016
The ith fuzzy rule is reduced to equation (6),
Figure FDA0002382046990000021
wherein c is the number of fuzzy rules;
Figure FDA0002382046990000022
an output representing the ith fuzzy rule;
Figure FDA0002382046990000023
is the reduced kth sample;
Figure FDA0002382046990000024
for simplificationM in the preceding part of the ith fuzzy ruleiA subset to which the dimension input features belong;
Figure FDA0002382046990000025
the back-part parameters of the simplified ith fuzzy rule are obtained; contribution weight of the ith fuzzy rule
Figure FDA0002382046990000026
Is calculated as shown in equation (7):
Figure FDA0002382046990000027
in the formula, degree of membership of a single variable
Figure FDA0002382046990000028
By multivariate degree of membership
Figure FDA0002382046990000029
Obtained through Gaussian projection, as shown in formula (8):
Figure FDA00023820469900000210
wherein a isjiAnd bjiRespectively representing the center of a Gaussian function and the standard deviation of the Gaussian function, and the calculation formula is shown as formula (9):
Figure FDA00023820469900000211
wherein u iskiRepresents a sample xkMembership to fuzzy subset A(i)The degree of (d); the output of the TS fuzzy model
Figure FDA00023820469900000212
Calculated from equation (10):
Figure FDA00023820469900000213
(2) missing value filling
Taking each incomplete attribute column as output in sequence, and taking all the other attributes as input to establish a plurality of TS fuzzy models; taking the missing value as a variable, and adopting an alternate learning strategy for model solution and missing value filling, wherein the steps are as follows:
step 1: performing mean pre-filling on the missing values to obtain a reconstructed complete data set;
step 2: updating salient input features and back-piece parameters of the model based on the reconstructed complete data set;
and step 3: obtaining model output according to the significant input features of the updated model and the parameters of the back-part, and updating the missing value by using the model output;
and 4, step 4: if the filling error obtained by the existing value and the corresponding model output is larger than or equal to the given threshold value, returning to the step 2; otherwise, the model corresponding to the missing value is used to output the filled missing value and the filled data set is output.
CN202010085968.4A 2020-02-11 2020-02-11 Incomplete data fine modeling and missing value filling method based on alternate learning Withdrawn CN111340069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010085968.4A CN111340069A (en) 2020-02-11 2020-02-11 Incomplete data fine modeling and missing value filling method based on alternate learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010085968.4A CN111340069A (en) 2020-02-11 2020-02-11 Incomplete data fine modeling and missing value filling method based on alternate learning

Publications (1)

Publication Number Publication Date
CN111340069A true CN111340069A (en) 2020-06-26

Family

ID=71185286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010085968.4A Withdrawn CN111340069A (en) 2020-02-11 2020-02-11 Incomplete data fine modeling and missing value filling method based on alternate learning

Country Status (1)

Country Link
CN (1) CN111340069A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835884A (en) * 2021-02-19 2021-05-25 大连海事大学 Missing data filling method and system in marine fishing ground fishing situation forecasting system
CN113240213A (en) * 2021-07-09 2021-08-10 平安科技(深圳)有限公司 Method, device and equipment for selecting people based on neural network and tree model
CN115423005A (en) * 2022-08-22 2022-12-02 江苏大学 Big data reconstruction method and device for combine harvester
CN116861042A (en) * 2023-09-05 2023-10-10 国家超级计算天津中心 Information verification method, device, equipment and medium based on material database

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835884A (en) * 2021-02-19 2021-05-25 大连海事大学 Missing data filling method and system in marine fishing ground fishing situation forecasting system
CN112835884B (en) * 2021-02-19 2023-05-16 大连海事大学 Missing data filling method and system in ocean fishing ground fish condition forecasting system
CN113240213A (en) * 2021-07-09 2021-08-10 平安科技(深圳)有限公司 Method, device and equipment for selecting people based on neural network and tree model
CN115423005A (en) * 2022-08-22 2022-12-02 江苏大学 Big data reconstruction method and device for combine harvester
CN115423005B (en) * 2022-08-22 2023-10-31 江苏大学 Big data reconstruction method and device for combine harvester
CN116861042A (en) * 2023-09-05 2023-10-10 国家超级计算天津中心 Information verification method, device, equipment and medium based on material database
CN116861042B (en) * 2023-09-05 2023-12-05 国家超级计算天津中心 Information verification method, device, equipment and medium based on material database

Similar Documents

Publication Publication Date Title
CN111340069A (en) Incomplete data fine modeling and missing value filling method based on alternate learning
CN107992976B (en) Hot topic early development trend prediction system and prediction method
Zhan et al. A fast kriging-assisted evolutionary algorithm based on incremental learning
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN110232434A (en) A kind of neural network framework appraisal procedure based on attributed graph optimization
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN111597760B (en) Method for obtaining gas path parameter deviation value under small sample condition
CN113326731A (en) Cross-domain pedestrian re-identification algorithm based on momentum network guidance
CN108171012B (en) Gene classification method and device
CN111814907A (en) Quantum generation countermeasure network algorithm based on condition constraint
Song et al. Nonnegative Latent Factor Analysis-Incorporated and Feature-Weighted Fuzzy Double $ c $-Means Clustering for Incomplete Data
CN115730635A (en) Electric vehicle load prediction method
CN107240028B (en) Overlapped community detection method in complex network of Fedora system component
CN111832817A (en) Small world echo state network time sequence prediction method based on MCP penalty function
Lu et al. Robust and scalable Gaussian process regression and its applications
CN111353525A (en) Modeling and missing value filling method for unbalanced incomplete data set
CN109934344A (en) A kind of multiple target Estimation of Distribution Algorithm of improved rule-based model
CN112270047B (en) Urban vehicle path optimization method based on data-driven group intelligent calculation
CN113610350B (en) Complex working condition fault diagnosis method, equipment, storage medium and device
CN112465253B (en) Method and device for predicting links in urban road network
CN114529096A (en) Social network link prediction method and system based on ternary closure graph embedding
Hu et al. Pwsnas: powering weight sharing nas with general search space shrinking framework
Ortelli et al. Faster estimation of discrete choice models via dataset reduction
Wu et al. A training-free neural architecture search algorithm based on search economics
Tian et al. Microbial Network Recovery by Compositional Graphical Lasso

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200626