CN111353525A - Modeling and missing value filling method for unbalanced incomplete data set - Google Patents

Modeling and missing value filling method for unbalanced incomplete data set Download PDF

Info

Publication number
CN111353525A
CN111353525A CN202010085969.9A CN202010085969A CN111353525A CN 111353525 A CN111353525 A CN 111353525A CN 202010085969 A CN202010085969 A CN 202010085969A CN 111353525 A CN111353525 A CN 111353525A
Authority
CN
China
Prior art keywords
data set
sample
formula
filling
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010085969.9A
Other languages
Chinese (zh)
Inventor
刘辉
张立勇
陆艺丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010085969.9A priority Critical patent/CN111353525A/en
Publication of CN111353525A publication Critical patent/CN111353525A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a modeling and missing value filling method for an unbalanced incomplete data set, and belongs to the technical field of data mining. The invention includes a build model portion and a fill plan portion. In the model building part, aiming at the data imbalance, a distance density algorithm is designed to be applied to the front piece identification process of TS modeling; in the filling scheme part, aiming at the incompleteness of data, a missing value is taken as a variable and is made to participate in an iterative learning filling scheme of conclusion parameter identification, a conclusion parameter is calculated based on a filled data set in the filling process, then the filling value is updated based on the adjusted conclusion parameter, and the filling is completed when iteration converges. The invention reduces the influence of the imbalance of the data set on the TS modeling, fully utilizes the data information in the incomplete data set, and has ideal filling precision on the unbalanced incomplete data set.

Description

Modeling and missing value filling method for unbalanced incomplete data set
Technical Field
The invention belongs to the technical field of data mining, and relates to a modeling and missing value filling method for an unbalanced incomplete data set.
Background
Data loss and imbalance of data sets are two unavoidable problems in the field of data mining. Data missing refers to data value missing or attribute missing caused by factors such as environment when a data set is collected or stored; the imbalance of the data set means that the distribution of the classes in the data set has imbalance, and the number of different class samples has large distance. The imbalance and loss of data sets are widely present in the field of data analysis mining, and thus research on such data sets has received increasing attention.
The imbalance of the data sets creates difficulties in data mining. In the process of fuzzy partition processing of unbalanced data sets, a phenomenon of "uniform effect" (Zhou K, Yang s. expanding the uniform effect of fcmclusing: a data distribution permanent [ J ]. Knowledge-Based Systems, 2016, 96:76-83) is easily generated, i.e., samples in a plurality of classes are divided into a few classes, so that the number of samples in each set in the result is approximately the same. For such phenomena, researchers have proposed fuzzy partitioning methods based on an undersampled data preprocessing model, a kernel function-based clustering algorithm, a multi-point representation method, and the like.
The absence of data sets is also an inevitable problem in the field of data mining. The incomplete samples are directly discarded and the remaining complete samples are used for data analysis, so that the results are biased due to insufficient data. In contrast, by studying the existing data, reasonable padding values for missing values are obtained, and in most cases, better results are obtained. Currently, researchers have proposed a variety of padding methods. The principle of the regression filling method is to establish a regression equation to estimate a missing value according to a regression relationship between an existing value and a pre-filled missing value in a data set, and the regression filling method is widely applied to various work of processing incomplete data.
However, the conventional regression filling method cannot identify the correlation existing between the sample attributes. To identify relationships between attributes, one approach is to divide data with similar regression relationships into subsets using fuzzy clustering and approximate each subset with a linear model. The method can obtain a rule-based fuzzy model by utilizing the existing fuzzy partition matrix, and solves the problem that the correlation relation among the attributes in the actual data set is unknown.
The Takagi-Takagi model (TS model for short) is a typical representative of fuzzy model, and is composed of several if-then rules, and the Modeling process is divided into two parts of front part identification and back part identification (T. Takagi, M.Takagi, fuzzy identification of Systems and Its Applications to Modeling and Control, IEEETrans. Syst. Man Cybern. SMC-15(1985) 116- "132"). It is a nonlinear model represented by the "IF-THEN" fuzzy rule. When modeling data, firstly dividing an input space into a plurality of fuzzy subspaces, then establishing a local linear model in each fuzzy subspace, and connecting all local models by using a membership function. The ith rule of the TS model is as shown in equation (1):
Figure BDA0002382048170000021
in the formula, R(i)Denotes the ith fuzzy rule, i is 1,2.. k denotes the number of rules of the TS model; x is the number ofj={xj1,xj2,...,xjsIs the jth input variable of the system, also called the antecedent variable, where j 1,2jsAn s-th attribute representing a j-th sample;
Figure BDA0002382048170000022
is a fuzzy set of the m-th attribute, also called R, in the ith rule(i)Wherein m is 1,2.
Figure BDA0002382048170000023
A conclusion parameter, also called a back-end parameter,
Figure BDA0002382048170000024
then, the conclusion parameter of the s attribute in the ith rule is represented; y isj (i)Indicating the output of the jth input variable in the ith rule.
Final output y of jth input variable in fuzzy systemjComprises the following steps:
Figure BDA0002382048170000025
in the formula vj (i)The weight of the jth input variable in the ith rule is given by equation (3):
Figure BDA0002382048170000031
in the formula Am (i)(xjm) The m-th attribute x of the j-th sample in the ith rulejmBelonging to fuzzy sets
Figure BDA0002382048170000032
Wherein m is 1,2.
Filling method (Missing Value expressions by Rule-base Incomplex Data Fuzzy modeling. Xiacohen Lai, Xin Liu, Liyong Zhang, et al. IEEE International Conference on Communications (IEEE ICC 2019)) based on TS model obtains membership of each Rule through FCM-PDS clustering algorithm, and uses Fuzzy set
Figure BDA0002382048170000033
As a precursor parameter, the incomplete data set is divided into several subsets, and a local linear regression model is established, which only contains important input variables of each subset. Then, a global nonlinear model is obtained by weighted summation of each local linear model, and the output thereof is used as a padding value. Compared with the traditional regression filling method, the method fully utilizes the existing values and more accurately describes the relationship between the attributes. However, the problem of data imbalance in the actual data set is inevitable, and the above fuzzy partitioning method does not consider the influence of the imbalance of the data set on the fuzzy partitioning.
Disclosure of Invention
In order to solve the problems, the accuracy of the regression equation can be improved by reasonably dividing the unbalanced data set, and therefore the invention provides a modeling and missing value filling method of the unbalanced incomplete data set on the basis of a TS model. The invention comprises two parts: the method comprises the steps of constructing a model part and filling a scheme part, wherein the former improves a TS model front part parameter identification method so as to reduce the influence of data imbalance on fuzzy division; the latter uses the incomplete samples in the training process to improve the data utilization of the incomplete data set.
In the front piece identification process of the model, front piece parameter identification is carried out on an unbalanced incomplete data set based on an idea (SD algorithm) combining distance density and maximum and minimum distance, and the regular number of the front piece is determined so as to reduce the influence of data imbalance on fuzzy division; then aiming at the problem of incomplete input data in the modeling process, firstly selecting input variables to obtain a determined model structure, and then applying a least square method and an iterative updating strategy to realize estimation of conclusion parameters and filling of missing values so as to realize full utilization of the existing data; when the iteration converges, the parameters and the padding values tend to be fixed, thereby completing missing value padding.
The padding accuracy of the missing value padding method can be measured by the Root Mean Square Error (RMSE), i.e.
Figure BDA0002382048170000041
In which N is the number of missing values, xi∈XMIn the form of the original actual data value,
Figure BDA0002382048170000042
is the padding value of the missing value under the padding scheme. If the RMSE value is smaller, the data padding effect is good, otherwise, the padding effect is poor.
The technical scheme of the invention is as follows:
a modeling and missing value filling method for an unbalanced incomplete data set comprises a model building part and a filling scheme, and specifically comprises the following steps:
(1) building models
The local density and the local distance are combined to define the distance density ds for each sampleijAnd designing a distance density algorithm for identifying the front piece model (SD algorithm for short):
with incomplete data set X ═ XM,XCIn which X isMFor subsets formed by missing values in the data set, XCA subset of non-missing values in the dataset. For arbitrary sample xi,xj∈ X, distance density ds thereofijComprises the following steps:
dsij=exp(S(xi))×pd(xi,xj) (5)
in the formula, S (x)iIs a sample x defined in formula (6)iLocal density of (c), pd (x)i,xj) Is x obtained from the formula (7)iAnd xjThe local distance of (a).
Sample X in dataset XiThe local density of (a) is defined as:
Figure BDA0002382048170000043
in the formula, NjIs represented by sample xiK number of neighboring samples xjA set of components, wherein i 1,2, n, n represents the number of samples, and j 1,2. pd (x)i,xj) The local distance is defined, and the calculation method comprises the following steps:
Figure BDA0002382048170000051
where s is the number of sample attributes, IimMarking the m attribute value x of the i sampleimWhether or not it is absent, IjmMarking the mth attribute value x of the jth samplejmWhether the deletion exists or not is calculated as follows:
Figure BDA0002382048170000052
and calculating the clustering centers of the samples and the number of the clustering centers by adopting an SD algorithm, calculating the membership degree by using the obtained clustering centers, and finally obtaining the front-part parameters of the model.
(2) Filling scheme
The invention updates conclusion parameters and padding values of the TS model based on an iterative learning (IU) mode. And (4) aiming at the incomplete data set X with the sample attribute quantity of s, respectively taking each dimension attribute as output, and building s TS models. The input of each TS model is D(m)={D1,D2,...,Dm-1,Dm+1,...DsD-desired outputmWherein m is 1,2. Firstly, the incomplete data set is randomly initialized to obtain a complete data set, and then conclusion parameters are calculated based on a least square method. In each TS model, for the jth sample xjRule I of (1)(i)Weighted input H ofj (i)Obtained from formula (9):
Hj (i)=vj (i)Γ(i)(9)
in the formula vj (i)Representing a weight; gamma-shaped(i)=[1,xj1 (i),...,xj(q-1) (i),xj(q+1) (i),...,xjs (i)]Denotes R after selection of variables(i)Wherein the input variable xjq (i)Is rejected, i 1,2, 1, k, j 1,2<q<s。
The actual output values of the model are then calculated
Figure BDA0002382048170000053
Figure BDA0002382048170000054
In the formula P(i)For rule i R derived from least squares(i)Conclusion parameters of (1).
Obtaining output sets of s TS models through formulas (9) and (10)
Figure BDA0002382048170000055
Where l represents the l-th iteration,
Figure BDA0002382048170000061
indicating that the padding value is to be updated,
Figure BDA0002382048170000062
model output representing existing data for calculating root mean square error f from corresponding true values(l). Then calculating the root mean square error f obtained from the last iterative learning(l-1)If the difference value | △ f | is larger than the threshold value epsilon, the steps are repeated to enter a new round of learning, otherwise, the iteration is finished and the filled data set is output.
The invention has the beneficial effects that: firstly, the algorithm based on the adopted distance density is adopted to replace the original FCM method to identify the front part parameters of the TS model, and the membership degree is reconstructed, so that the influence of the data imbalance on the fuzzy partition is reduced. Secondly, aiming at the problem of incomplete input data in the modeling process, the missing values are regarded as variables, and a set of iterative learning filling scheme for dynamically updating the missing values and model conclusion parameters is adopted, so that the existing data is fully utilized.
Drawings
Fig. 1 is a schematic diagram of the operation of the present invention.
In fig. 1: 1, inputting an unbalanced incomplete data set containing missing values into a model; 2 dividing the data set by a distance density algorithm (SD); 3 calculating the distance pd (x) between the sample and the center by adopting a local distance strategyi,ct) (ii) a 4, selecting input variables; 5 dynamically updating conclusion parameters and filling values through iterative learning (IU); 6 output the complete data set containing the padding values.
FIG. 2 is a workflow diagram of the distance density algorithm (SD) of the present invention.
FIG. 3 is a diagram of an implementation of the iterative learning method (IU) of the present invention.
In fig. 3: step 1, random pre-filling is carried out on an incomplete data set; step 2, inputting the filled data set into an iterative learning model; step 3, when the output condition is not met, continuously updating the filling value; and 4, outputting the data set containing the final filling value when the output condition is reached.
Detailed Description
The following detailed description of the embodiments of the invention is provided in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram of the operation of the present invention showing a first row D in an unbalanced incomplete data set1,D2,...,DsIndicating the attribute name, the black mark indicating the missing value, and the gray mark indicating the padding value. Based on fig. 1, the invention uses a distance density algorithm to identify the parameters of the front part, and then uses an iterative learning method to dynamically realize the identification of conclusion parameters and the filling of missing values. Firstly, inputting an unbalanced incomplete data set containing missing values into a model; in the model construction, n samples of the data set are divided into k classes by using a distance density algorithm, and the class center of each class is c1,c2,...ck(ii) a Due to the lack of the data set attribute, the present invention calculates the distance pd (x) between the sample and the center by the formula (6)i,ct) Wherein i is 1,2, n, t is 1,2, k, thereby completing the antecedent parameter identification of the model; secondly, selecting input variables to enable the model to only comprise a regression equation of the significant variables; in the filling scheme, dynamically updating the conclusion parameters and the filling values so as to complete iterative learning of the model; when the iteration converges, an unbalanced complete data set containing the final padding values is output.
The details of the technical scheme of the invention are explained by taking three data sets of a UCI machine learning database as an example. An incomplete data set is constructed by manually deleting portions of the data in the data set.
(1) Building models
The distance density algorithm (SD algorithm) divides the input unbalanced incomplete data set into k subsets. Aiming at the imbalance of the data set, the principle is to ensure that the distance between a new cluster center obtained each time and the existing cluster center is relatively far. The situation that cluster centers are too close to each other, a plurality of cluster centers are selected in the same class, and no cluster center exists in a small cluster is avoided.
Let B denote the cluster center subscript set, recording the cluster center subscript selected from the dataset samples. Then, selecting a sample farthest from the selected class center from the non-class center samples, where the sample is denoted by q, where q satisfies:
Figure BDA0002382048170000071
then take xqIs a new cluster center and adds the index of the new cluster center to set B. Wherein, ctRepresenting the t-th cluster center of the data set.
The algorithm does not need to give the number of clusters in advance, and can determine the number of initial cluster centers according to a certain calculation rule. The number of the clustering centers is the regular number of the TS model.
The workflow of the distance density (SD) algorithm is detailed in fig. 2, and the specific steps are as follows:
step 1: inputting an incomplete data set;
step 2: initializing an empty set B, the number K of neighbor samples and a parameter theta, wherein theta is less than 1;
and step 3: calculating xiLocal distance to the remaining samples pd (x)i,xj) Wherein j is 1, i-1, i +1, n. Then sorting the obtained local distances, and selecting the first K nearest samples to form a set Ni
And 4, step 4: calculating the local density of each sample according to the formula (6), and taking the sample with the maximum local density as the first class center c1Record c1=xi,B=B+{i};
And 5: computing the remaining samples to c according to equation (5)1And selecting the sample with the largest distance density attribute as the second class center c2Record c2=xj,B=B+{j};
Step 6: if the maximum minimum distance
Figure BDA0002382048170000081
Is still greater than theta × pd (c)1,c2) If yes, go to step 7, otherwise go to step 9;
and 7: center for recording new selectionIs cqQ satisfies formula (11);
and 8: calculate the remaining samples to the new center c according to equation (5)qAnd selecting the sample with the largest distance density attribute as the first class center cnextRecord cnext=xlB ═ B + { l }. Returning to the step 6;
and step 9: output clustering center { c1,c2,...,c|B|And the number | B | of cluster centers.
The number | B | of the cluster centers is equal to the number k of the fuzzy rules, i.e., | B | ═ k. And then calculating the membership degree by using the clustering centers obtained in the steps 1-9. By using
Figure BDA0002382048170000082
Represents a sample xiBelonging to A(t)In which A is(t)Represents one with ctA set of multi-dimensional ambiguities that are centered,
Figure BDA0002382048170000083
obtained from formula (12):
Figure BDA0002382048170000084
in the formula, pd (c)t,xi) The local distance between the t-th cluster center and the i-th sample is represented, where t is 1,2. Obtaining fuzzy sets
Figure BDA0002382048170000085
Thereby completing the identification of the parameters of the model front piece.
(2) Filling scheme
After obtaining the front-part parameters, firstly, stepwise regression is used for selecting input variables, so that only significant variables exist in the model. The method for filling and conclusion parameter identification based on iterative learning (IU) is shown in fig. 3. First row D in FIG. 31,D2,...,DsRepresenting an attribute name; black marks representing dynamic padding values
Figure BDA0002382048170000091
Wherein l represents the l-th iteration; the gray mark represents the final padding value; v. of(i)Is each rule R(i)Wherein i 1,2.., k; h represents the weighted input of all rules; p represents a conclusion parameter, which is calculated in the following way:
P=(HTH)-1HTY (13)
wherein Y ═ x1m,x2m,...,xnm]TRepresenting all samples in the attribute of the mth dimension, wherein m is 1,2,.. s, the absolute value of the difference value of the root mean square error obtained by the existing data and the corresponding model output in two adjacent iterative learning is represented by | △ f | used for judging whether the iterative learning is finished, and epsilon represents a threshold value for stopping the iteration, wherein the calculation formula of f is shown as the following formula:
Figure BDA0002382048170000092
wherein, | XCL represents the number of existing data,
Figure BDA0002382048170000093
and xi∈XC. The specific steps of iterative learning (IU) are as follows:
step 1: random pre-filling is carried out on the incomplete data set to obtain a value containing dynamic filling
Figure BDA0002382048170000094
The data set of (a);
step 2: the conclusion parameter P is calculated based on the padded data set and equations (9) and (13). And obtaining the model output from the formula (10)
Figure BDA0002382048170000095
And step 3: by using
Figure BDA0002382048170000096
Updating the padding value based on
Figure BDA0002382048170000097
And equation (14) calculates f(l)And f obtained from the last iteration(l-1)Comparing and solving the difference | △ f | if | △ f |>E, returning to the step 2, and entering the next iterative learning;
and 4, if the | △ f | is less than or equal to the epsilon, terminating the iteration and outputting a data set containing the final filling value.
(3) Experiment of
3 data sets are selected from a UCI machine learning database to verify the filling performance of the method, and the description of the data sets is shown in table 1. In order to calculate the error between the estimation of the missing value and the true value, the selected data sets are all complete data sets, and an experiment constructs an incomplete data set by manually deleting partial data according to the specified missing rate. The specified deletion rates were 5%, 10%, 15%, 20%, 25%, 30%, 40%, 45%, and 50%, respectively.
Table 1 data set description
Figure BDA0002382048170000101
The experiment is based on the proposed method to fill in incomplete data and compare the filled value with the actual value. For the complete dataset at each specified deficiency ratio, 5 incomplete datasets were randomly generated and the average RMSE value was calculated as the final experimental result. The invention compares the following five filling schemes: a conventional regression model based padding method (REG); a traditional TS modeling-based padding method (Basic-TS); a TS modeling filling method (SD-TS) for constructing a model based on a distance density algorithm; adopting a TS modeling filling method (TS-IU) of iterative learning; a model is constructed based on a distance density algorithm, and an iterative learning TS modeling filling method (SD-TS-IU) is adopted. In each set of comparative experiments, the same initialization data set was used for all methods. Table 2 shows the RMSE indicator results for the five padding methods, where the best results are bolded and underlined, and the second best results are bolded.
TABLE 2 RMSE indices of five filling methods
Figure BDA0002382048170000102
Figure BDA0002382048170000111
As can be seen from Table 2, the padding precision of Basic-TS is generally higher than that of REG, which shows that the padding method based on TS modeling is more effective than the padding method based on regression; further inspection of the data in the tables reveals that RMSEs for SD-TS are generally lower than those for Basic-TS, and that SD-TS-IU also generally gives better results than TS-IU. And with the improvement of the imbalance degree of the data set, the effect of the distance density algorithm is more obvious; comparing RMSEs of TS-IU and Basic-TS, finding that RMSEs of TS-IU are superior to Basic-TS under all conditions except special cases, and showing that the iterative update strategy can effectively improve the filling precision.
In conclusion, the SD-TS-IU of the invention has the most optimal result, which shows that the filling precision of the SD-TS-IU is superior to other comparison methods and has higher filling precision.

Claims (3)

1. A modeling and missing value filling method for an unbalanced incomplete data set is characterized by comprising the following steps:
(1) building models
The local density and the local distance are combined to define the distance density ds for each sampleijAnd designing a distance density algorithm for identifying the front piece model, namely an SD algorithm:
with incomplete data set X ═ XM,XCIn which X isMFor subsets formed by missing values in the data set, XCFor subsets of non-missing values in a data set, for any sample xi,xj∈ X, distance density ds thereofijComprises the following steps:
dsij=exp(S(xi))×pd(xi,xj) (5)
in the formula, S (x)iIs a sample x defined in formula (6)iLocal density of (c), pd (x)i,xj) Is x obtained from the formula (7)iAnd xjThe local distance of (a);
sample X in dataset XiThe local density of (a) is defined as:
Figure FDA0002382048160000011
in the formula, NjIs represented by sample xiK number of neighboring samples xjA set of components, where i 1,2, n, n denotes the number of samples, j 1,2, K is a custom constant, pd (x)i,xj) The local distance is defined, and the calculation method comprises the following steps:
Figure FDA0002382048160000012
where s is the number of sample attributes, IimMarking the m attribute value x of the i sampleimWhether or not it is absent, IjmMarking the mth attribute value x of the jth samplejmWhether the deletion exists or not is calculated as follows:
Figure FDA0002382048160000013
calculating the clustering centers of the samples and the number of the clustering centers by adopting an SD algorithm, then calculating the membership degree by using the obtained clustering centers, and finally obtaining the front part parameters of the model;
(2) filling scheme
Updating conclusion parameters and filling values of the TS model based on an iterative learning mode: aiming at the incomplete data set X with the sample attribute quantity of s, respectively taking each dimension attribute as output, building s TS models, and taking the input of each TS model as D(m)={D1,D2,...,Dm-1,Dm+1,...DsD-desired outputmWherein m 1,2.. times, s, the incomplete data set is randomly initialized to obtain a complete data set, and then conclusion parameters are calculated based on a least square method, and in each TS model, for the jth sample xjRule I of (1)(i)Weighting ofInput Hj (i)Obtained from formula (9):
Hj (i)=vj (i)Γ(i)(9)
in the formula vj (i)Representing a weight; gamma-shaped(i)=[1,xj1 (i),...,xj(q-1) (i),xj(q+1) (i),...,xjs (i)]Denotes R after selection of variables(i)Wherein the input variable xjq (i)Is rejected, i 1,2, 1, k, j 1,2<q<s, then calculating the actual output value of the model
Figure FDA0002382048160000021
Figure FDA0002382048160000022
In the formula P(i)For rule i R derived from least squares(i)The conclusion parameter of (1);
obtaining output sets of s TS models through formulas (9) and (10)
Figure FDA0002382048160000023
Where l represents the l-th iteration,
Figure FDA0002382048160000024
indicating that the padding value is to be updated,
Figure FDA0002382048160000025
model output representing existing data for calculating root mean square error f from corresponding true values(l)Then calculating the root mean square error f obtained from the last iterative learning(l-1)If the difference value | △ f | is larger than the threshold value epsilon, repeating the steps to enter a new round of learning, otherwise, finishing iteration and outputting a filled data set, and thus, modeling the unbalanced incomplete data TS with the s-dimensional attribute as the output is realized.
2. The method of claim 1, wherein the incomplete dataset is modeled by a plurality of partial data sets,
b is made to represent a clustering center subscript set, and the class center subscript selected from the data set sample is recorded; then, selecting a sample farthest from the selected class center from the non-class center samples, where the sample is denoted by q, where q satisfies:
Figure FDA0002382048160000031
then take xqAdding the subscript of the new clustering center into the set B; wherein, ctA tth cluster center representing the data set;
the specific process of constructing the model is as follows:
step 1: inputting an incomplete data set;
step 2: initializing an empty set B, the number K of neighbor samples and a parameter theta, wherein theta is less than 1;
and step 3: calculating xiLocal distance to the remaining samples pd (x)i,xj) Wherein j is 1, i-1, i +1, n; then sorting the obtained local distances, and selecting the first K nearest samples to form a set Ni
And 4, step 4: calculating the local density of each sample according to the formula (6), and taking the sample with the maximum local density as the first class center c1Record c1=xi,B=B+{i};
And 5: computing the remaining samples to c according to equation (5)1And selecting the sample with the largest distance density attribute as the second class center c2Record c2=xj,B=B+{j};
Step 6: if the maximum minimum distance
Figure FDA0002382048160000032
Is still greater than theta × pd (c)1,c2) Go to step 7, otherwiseTurning to step 9;
and 7: recording the newly selected center as cqQ satisfies formula (11);
and 8: calculate the remaining samples to the new center c according to equation (5)qAnd selecting the sample with the largest distance density attribute as the first class center cnextRecord cnext=xlB ═ B + { l }; returning to the step 6;
and step 9: output clustering center { c1,c2,...,c|B|And the number | B | of clustering centers;
the number | B | of the clustering centers is equal to the number k of the fuzzy rules, namely | B | ═ k;
then calculating the membership degree by using the clustering centers obtained in the steps 1-9; by using
Figure FDA0002382048160000033
Represents a sample xiBelonging to A(t)In which A is(t)Represents one with ctA set of multi-dimensional ambiguities that are centered,
Figure FDA0002382048160000034
obtained from formula (12):
Figure FDA0002382048160000041
in the formula, pd (c)t,xi) Means a local distance between the t-th cluster center and the i-th sample, where t is 1,2. Obtaining fuzzy sets
Figure FDA0002382048160000042
Thereby completing the identification of the parameters of the model front piece.
3. The method of claim 1, wherein the incomplete dataset is modeled by a plurality of partial data sets,
h represents the weighted input of all rules; p represents a conclusion parameter, which is calculated in the following way:
P=(HTH)-1HTY (13)
wherein Y ═ x1m,x2m,...,xnm]TRepresenting all samples in the attribute of the mth dimension, wherein m is 1,2, s, and | △ f | represents the absolute value of the difference value of the root mean square error obtained by the existing data and the corresponding model output in two adjacent iterative learning and is used for judging whether the iterative learning is finished, epsilon represents the threshold value for stopping the iteration, and the calculation formula of f is shown as the following formula:
Figure FDA0002382048160000043
wherein, | XCL represents the number of existing data,
Figure FDA0002382048160000044
and xi∈XC
The specific process of the iterative learning is as follows:
step 1: random pre-filling is carried out on the incomplete data set to obtain a value containing dynamic filling
Figure FDA0002382048160000045
The data set of (a);
step 2: calculating a conclusion parameter based on the padded data set and equations (9) and (13); and obtaining the model output from the formula (10)
Figure FDA0002382048160000046
And step 3: by using
Figure FDA0002382048160000047
Updating the padding value based on
Figure FDA0002382048160000048
And equation (14) calculates f(l)And f obtained from the last iteration(l -1)ComparisonAnd calculating the difference value of △ f, if △ f does not induce calculation of Y>E, returning to the step 2, and entering the next iterative learning;
and 4, if the | △ f | is less than or equal to the epsilon, terminating the iteration and outputting a data set containing the final filling value.
CN202010085969.9A 2020-02-11 2020-02-11 Modeling and missing value filling method for unbalanced incomplete data set Withdrawn CN111353525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010085969.9A CN111353525A (en) 2020-02-11 2020-02-11 Modeling and missing value filling method for unbalanced incomplete data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010085969.9A CN111353525A (en) 2020-02-11 2020-02-11 Modeling and missing value filling method for unbalanced incomplete data set

Publications (1)

Publication Number Publication Date
CN111353525A true CN111353525A (en) 2020-06-30

Family

ID=71197960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010085969.9A Withdrawn CN111353525A (en) 2020-02-11 2020-02-11 Modeling and missing value filling method for unbalanced incomplete data set

Country Status (1)

Country Link
CN (1) CN111353525A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113034042A (en) * 2021-04-19 2021-06-25 上海数禾信息科技有限公司 Data processing method and device for construction of wind control model
CN114328742A (en) * 2021-12-31 2022-04-12 广东泰迪智能科技股份有限公司 Missing data preprocessing method for central air conditioner

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113034042A (en) * 2021-04-19 2021-06-25 上海数禾信息科技有限公司 Data processing method and device for construction of wind control model
CN113034042B (en) * 2021-04-19 2024-04-26 上海数禾信息科技有限公司 Data processing method and device for wind control model construction
CN114328742A (en) * 2021-12-31 2022-04-12 广东泰迪智能科技股份有限公司 Missing data preprocessing method for central air conditioner

Similar Documents

Publication Publication Date Title
Yang et al. A feature-reduction multi-view k-means clustering algorithm
CN111104595B (en) Deep reinforcement learning interactive recommendation method and system based on text information
US7428514B2 (en) System and method for estimation of a distribution algorithm
CN106228185A (en) A kind of general image classifying and identifying system based on neutral net and method
CN108009594B (en) A kind of image-recognizing method based on change grouping convolution
CN103942571B (en) Graphic image sorting method based on genetic programming algorithm
CN107203785A (en) Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm
CN110188228A (en) Cross-module state search method based on Sketch Searching threedimensional model
CN113190688A (en) Complex network link prediction method and system based on logical reasoning and graph convolution
CN113190654A (en) Knowledge graph complementing method based on entity joint embedding and probability model
CN112101574B (en) Machine learning supervised model interpretation method, system and equipment
Rahman et al. CRUDAW: A novel fuzzy technique for clustering records following user defined attribute weights
CN109872331A (en) A kind of remote sensing image data automatic recognition classification method based on deep learning
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN111340069A (en) Incomplete data fine modeling and missing value filling method based on alternate learning
CN113449802A (en) Graph classification method and device based on multi-granularity mutual information maximization
CN111353525A (en) Modeling and missing value filling method for unbalanced incomplete data set
Saati et al. A fuzzy data envelopment analysis for clustering operating units with imprecise data
CN113610139A (en) Multi-view-angle intensified image clustering method
CN115496144A (en) Power distribution network operation scene determining method and device, computer equipment and storage medium
CN111192158A (en) Transformer substation daily load curve similarity matching method based on deep learning
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
CN109409434A (en) The method of liver diseases data classification Rule Extraction based on random forest
CN108846845A (en) SAR image segmentation method based on thumbnail and hierarchical fuzzy cluster
CN115937568B (en) Basalt structure background classification method, basalt structure background classification system, basalt structure background classification device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200630

WW01 Invention patent application withdrawn after publication