CN109034231A

CN109034231A - The deficiency of data fuzzy clustering method of information feedback RBF network valuation

Info

Publication number: CN109034231A
Application number: CN201810785729.2A
Authority: CN
Inventors: 张利; 石振桔; 张皓博; 刘洋; 王彦杰; 肖雪冬; 王军
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2018-12-18

Abstract

The present invention relates to a kind of deficiency of data fuzzy clustering methods of information feedback RBF network valuation, and steps are as follows: 1) proposing that information feeds back RBF network model；2) a kind of deficiency of data fuzzy clustering method (IFRBF-FCM) of information feedback RBF numeric type valuation, is proposed；It 3) is that deficiency of data sample chooses corresponding training sample set using Nearest Neighbor Method, it is each missing attribute training IFRBF network using arest neighbors training sample set, the valuation for lacking attribute in deficiency of data sample is predicted to realize, obtains the complete data set after IFRBF network valuation restores；4) the valuation section of deficiency of data attribute is determined, proposes a kind of deficiency of data fuzzy clustering method (IFRBF-IFCM) of IFRBF interval type valuation, obtains fuzzy clustering result.The present invention improves accuracy rate using the cluster result that IFRBF network carries out the complete data set of recovery that valuation obtains to incomplete data sets compared with control methods, and more more acurrate than the cluster result of numeric type valuation, robustness is also more preferable.

Description

Incomplete data fuzzy clustering method for information feedback RBF network estimation

Technical Field

The invention relates to a fuzzy clustering method for incomplete data, in particular to a fuzzy clustering method for incomplete data of interval type evaluation of an information feedback RBF network.

Background

With the rapid development of information technology, there is a large amount of data in various fields. The processing of these data is not manually accessible. Therefore, it is necessary to process these data by means of a computer. Clustering analysis has a large impact in many areas. The traditional cluster analysis algorithm belongs to hard division, each data sample only belongs to or does not belong to a certain cluster, in other words, the membership value of each cluster is either 0 or 1. However, most data in reality have a certain ambiguity, and do not strictly and definitely belong to a certain cluster, but belong to a plurality of clusters in different degrees.

Therefore, as an unsupervised classification method, the fuzzy C-means (FCM) algorithm is the most widely used method among many clustering algorithms. And a membership matrix in the fuzzy clustering algorithm represents the degree [18] of each data sample belonging to each class cluster, and the fuzziness of the data can be embodied. However, the data set using this algorithm needs to be complete and it cannot directly act on the incomplete data set where there is a missing. In practice, however, incomplete data sets often occur. The reasons for this missing incomplete data set are manifold. In order to solve the problem, many scholars at home and abroad carry out further research on fuzzy clustering analysis of incomplete data.

Disclosure of Invention

The invention provides an incomplete data fuzzy clustering method for information feedback RBF network estimation, aiming at the problem that a basic FCM algorithm cannot be directly applied to fuzzy clustering of an incomplete data set. In addition, because the estimate for incomplete data is numerical by training the IFRBF network. However, the numerical data cannot accurately describe the uncertainty of the incomplete data, and certain errors exist. Aiming at the problem, the invention provides an IFRBF interval type estimated incomplete data fuzzy clustering method.

In order to achieve the purpose, the invention adopts the technical scheme that: the incomplete data fuzzy clustering method of the information feedback RBF network estimation is characterized by comprising the following steps:

1) and (3) providing an information feedback RBF network model: combining with Kalman filtering method, the input parameter is X ═ X₁,x₂,…,x_n+m) The output parameter is Y ═ Y₁,y₂,…,y_m) Calculating an error e between a theoretical expected output value of incomplete data and an actual output value of a network of the incomplete data, and feeding back a difference value between a predicted value of the RBF neural network and a theoretical expected value of the data to an input layer to obtain an IFRBF model;

2) selecting a corresponding training sample set for the incomplete data samples by using a nearest neighbor rule, and training the IFRBF network for each missing attribute by using the nearest neighbor training sample set, so that estimation prediction of the missing attributes in the incomplete data samples is realized, and then an integral data set after estimation recovery of the IFRBF network is obtained, and fuzzy clustering analysis is performed;

3) interval type conversion of incomplete data set: estimating and filling lost data attributes in incomplete data samples through an IFRBF network to obtain estimation errors of the complete attributes in the incomplete data samples by the IFRBF network, confirming interval type expression of the lost data attributes according to the average value of the absolute values of the estimation errors, and performing interval processing on the complete attributes in a data set;

4) and carrying out clustering analysis on the converted regional data set by using a regional fuzzy C-means clustering method, wherein each clustering center is represented by a regional vector, and thus obtaining a fuzzy clustering result.

In the step 1), the specific method comprises the following steps:

1.1) input data set normalization processing: all data are converted into numbers between intervals [0,1], so that the magnitude difference among dimensions is eliminated;

1.2) initializing the IFRBF network: setting corresponding numbers n + m, l, m, initialization weight w and central vector C for each layer of nodes of the network_iAnd width σ²Determining the maximum training times M and the error precision epsilon of the network₁and each learning rate eta₁,η₂,η₃；

1.3) calculating the hidden layer output value of the network according to the formula (1)

1.4) calculating the output layer output value of the network according to the formula (2)

Wherein, w_jkThe connection weight between the hidden layer and the output layer is obtained through a minimum variance algorithm;

1.5) calculating the error e between the output value of the network and its desired value according to equation (3)

e_k＝Y_k-O_k,k＝1,2,…,m (3)

1.6) adjusting parameters of the updated network, namely the central vector, the width and the connection weight of the network according to the formula (4), the formula (5) and the formula (6), and feeding back the obtained error to the input layer

η₁,η₂,η₃Represents the rate of learning and, moreover,

wherein,

1.7) algorithm termination decision: when the number of training times reaches the maximum, or when the error e < epsilon₁When so, the algorithm ends; otherwise, return to step 1.3).

In the step 2), the method specifically comprises the following steps:

2.1) selecting training samples: for an s-dimensional incomplete dataset X ═ X₁,x₂,…,x_nIncomplete data samples x_aAnd data sample x_bThe similarity metric formula (2) is shown in formula (11):

wherein x is_iaAnd x_ibAre each x_aAnd x_bI attribute of (2), and I_iSatisfies the following conditions:

selecting corresponding nearest neighbor samples for incomplete data through similarity measurement formulas shown in a formula (11) and a formula (12), and selecting a corresponding training sample set for each incomplete data;

2.2) network training of IFRBF of incomplete data:

2.2.1) for the incomplete attribute in the training sample, adopting a method of replacing a value of '0', and replacing the value of the incomplete attribute with the value of '0' at the corresponding position of the incomplete attribute of the input layer node, thereby carrying out network training;

2.2.2) there is no feedback value in the first training, the corresponding input value is also replaced by a "0" value;

2.3.3) for the corresponding error of the incomplete property, replace it with the mean of the errors of the remaining complete properties, i.e.:

thereby completing the training of the IFRBF network;

2.3) assigning corresponding parameters in the IFRBF network by using the relevant parameters of the IFRBF network obtained by training aiming at the corresponding incomplete data with deficiency;

2.4) calculating the hidden layer output value of the network according to the formula (1);

2.5) calculating the output value of the output layer of the network according to the formula (2);

2.6) obtaining the estimation value of the missing attribute of the corresponding incomplete data from the output values obtained in 2.4) and 2.5), thereby filling the incomplete data set into a complete data set and carrying out fuzzy clustering analysis on the complete data set.

In the step 3), the method specifically comprises the following steps:

3.1) comparing the estimation errors of the complete attributes in the incomplete data samples obtained in the step 2) to obtain the average value of the absolute values of the estimation errors

3.2) segmenting the numerical evaluation of the missing data attribute in the incomplete data set, wherein the median value in the segment is the numerical evaluation x, and the width of the segment isThe interval type of the missing data attribute is represented asI.e. [ x-, x ]⁺]；

3.3) the interval range of the obtained interval type estimation value is judged and limited within the range of the interval [0,1 ]:

if x^-If the evaluation interval is less than 0, the left end point value of the evaluation interval of the missing data attribute is set to be 0, namely x^-＝0；

If x⁺If the evaluation interval of the missing data attribute is more than 1, the right end value of the evaluation interval of the missing data attribute is set to be 1, namely x⁺＝1；

3.4) all the complete attributes in the incomplete dataset are represented in the form of an interval, i.e. the left and right end points of the interval are equal and equal to the original value of the complete attribute.

The interval type fuzzy C-means clustering method in the step 4) specifically comprises the following steps:

s-dimensional interval data set X ═ X₁,x₂,…,x_nContains n pieces of data, and samples x of the data_kEach attribute value in (1) is represented by interval, namely x_kj＝[x_kj ^-,x_kj ⁺](1 ≦ k ≦ s), the data set X is divided into c classes, and the cluster center therein is denoted V ≦ s_ik]＝[v₁,v2,…,vc]And v is_ik＝[v_ik ^-,v_ik ⁺](i ═ 1,2, …, s) using a membership matrix U_(c×n)To represent the clustering result of the interval type data set, wherein the element u in the membership degree matrix_ijThe following conditions are satisfied:

the target function formula of the IFCM algorithm is as follows:

wherein, the data x_jThe euclidean distance to the cluster center vi is: d²(x_j,v_i) M is a fuzzy index and satisfies m ∈ (1, + ∞);

d²(x_j,v_i) The specific calculation formula of (2) is as follows:

wherein, interval type data attribute x_jThe vectors of the left and right interval boundaries are respectively expressed as: x is the number of_j ^-＝[x_1j ^-,x_2j ^-,…,x_sj ^-]^TAnd x_j ⁺＝[x_1j ⁺,x_2j ⁺,…,x_sj ⁺]^TInterval type clustering center v_iTo the left ofThe vectors of the right interval boundaries are respectively denoted as v_i ^-＝[v_1i ^-,v_2i ^-,…,v_si ^-]^TAnd v_i ⁺＝[v_1i ⁺,v_2i ⁺,…,v_si ⁺]^T；

The Lagrange multiplier method is used for solving the necessary condition that the objective function equation (15) reaches the minimum value under the constraint condition of the equation (14) is as follows:

if the interval type data sample x_jCompletely belong to the cluster center v_h(h is more than or equal to 1 and less than or equal to c) is within the range of interval value, the membership degree is 1, if the interval type data sample x_jCompletely not belonging to the cluster centre v_hWithin the range of interval values of (2), the membership degree is 0, i.e.

Otherwise, as shown in equation (20)

And updating the membership degree of each data sample.

In the step 4), the specific steps of obtaining the fuzzy clustering result are as follows:

4.1) parameter initialization: the number of categories of clustering c; the maximum number of iterations G; a fuzzy index m and an iteration termination threshold epsilon; and degree of membershipMatrix U⁽⁰⁾Carrying out initialization;

4.2) updating the clustering center matrix: when iterating to the l (l ═ 1,2, …) times, according to U^(l-1)Simultaneously using formula (17) and formula (18) to cluster center matrix V^(l)V (C) is a left endpoint value^l)^-And its right endpoint value V: (^l)⁺Updating is carried out;

4.3) updating the membership matrix: according to V^(l)Using formula (19) and formula (20) to pair the membership matrix U^(l)Updating is carried out;

4.4) algorithm termination decision: when the number of training times reaches a maximum, or when max | U^(l+1)-U^(l)When | ≦ ε, the algorithm terminates; otherwise l ═ l +1, and return to step 2).

The beneficial effects created by the invention are as follows: according to the invention, through the analysis and research of the RBF neural network and the Kalman filtering thought, the difference value between the predicted value of the RBF neural network and the theoretical expected value of data is fed back to the input layer, so that an information feedback RBF network model, namely an IFRBF network for short, is obtained. And simultaneously, selecting a training sample set for the incomplete data by using a nearest neighbor rule, training the IFRBF network for each missing attribute by using the nearest neighbor training sample set to obtain a complete data set after the IFRBF network estimation is recovered, and performing fuzzy clustering analysis on the complete data set. After the estimation of the IFRBF network, the obtained estimation value of the incomplete data is numerical. However, the numerical data cannot accurately describe the uncertainty of the incomplete data, and certain errors also exist. Aiming at the problem, the numerical evaluation of the missing attribute is converted into an interval form, meanwhile, the complete attribute in the data set is also converted into the interval form, and then the interval type fuzzy C-means clustering method is used for carrying out fuzzy clustering analysis on the obtained interval type data set. The experimental result shows that the clustering result of the complete recovery data set obtained by adopting the IFRBF network to estimate the incomplete data set has higher accuracy compared with the comparison method, and the result of clustering by adopting the interval estimation is more accurate and has better robustness than the clustering result of the numerical estimation.

Drawings

FIG. 1: the artificial data set 1 is schematically illustrated.

FIG. 2: the artificial data set 2 is schematically illustrated.

FIG. 3: and the IFRBF-FCM algorithm is used for generating a change trend graph between the iteration number and the target function of the Iris data set under different miss rates.

FIG. 4: and the IFRBF-FCM algorithm is used for generating a variation trend graph between the iteration number and the objective function of the Bupa data set under different deficiency rates.

FIG. 5: and (3) an IFRBF-FCM algorithm is used for generating a change trend graph between the iteration number and the objective function of the Breast data set under different deficiency rates.

FIG. 6: and the IFRBF-IFCM algorithm is used for generating a change trend graph between the iteration number and the target function of the Iris data set under different miss rates.

FIG. 7: and the IFRBF-IFCM algorithm is used for generating a variation trend graph between the iteration number and the objective function of the Bupa data set under different deficiency rates.

FIG. 8: and (3) an IFRBF-IFCM algorithm is used for generating a change trend graph between the iteration number and the objective function of the Breast data set under different deficiency rates.

FIG. 9: and the IFRBF-IFCM algorithm is used for generating a change trend graph between the iteration number and the objective function of the artificial data set 1 under different deficiency rates.

FIG. 10: and the IFRBF-IFCM algorithm is used for generating a change trend graph between the iteration number and the objective function of the artificial data set 2 under different deficiency rates.

Detailed Description

1) And (3) providing an information feedback RBF network model (IFRBF network for short): combining with Kalman filtering method, the input parameter is X ═ X₁,x₂,…,x_n+m) The output parameter is Y ═ Y₁,y₂,…,y_m) And calculating an error e between the theoretical expected output value of the incomplete data and the actual output value of the network, and feeding back the difference value between the predicted value of the RBF neural network and the theoretical expected value of the data to an input layer to obtain an information feedback RBF network, namely an IFRBF model.

The specific method comprises the following steps:

e_k＝Y_k-O_k,k＝1,2,…,m (3)

η₁,η₂,η₃Represents the rate of learning and, moreover,

wherein,

2) Aiming at the problem that the basic FCM algorithm cannot directly carry out fuzzy clustering analysis on an incomplete data set, the incomplete data fuzzy clustering method for information feedback RBF numerical estimation is provided, and is called IFRBF-FCM for short. Selecting a corresponding training sample set for the incomplete data samples by using a nearest neighbor rule, and training the IFRBF network for each missing attribute by using the nearest neighbor training sample set, so that estimation prediction of the missing attributes in the incomplete data samples is realized, and then an integral data set after estimation recovery of the IFRBF network is obtained, and fuzzy clustering analysis is performed;

the method specifically comprises the following steps:

2.1) selecting training samples: for an s-dimensional incomplete dataset X ═ X₁,x₂,…,x_nIncomplete data samples x_aAnd data sample x_bThe similarity measurement formula (with or without the missing attribute) is shown in formula (11):

2.2) network training of IFRBF of incomplete data:

thereby completing the training of the IFRBF network;

the method specifically comprises the following steps:

3.2)The numerical evaluation of the missing data attribute in the incomplete data set is partitioned, the median value in the interval is the numerical evaluation x, and the width of the interval isThe interval type of the missing data attribute is represented asI.e. [ x ]^-,x⁺]；

3.4) representing all complete attributes in the incomplete dataset in the form of intervals, i.e. the left and right end-points of an interval are equal and equal to the original value of the complete attribute, e.g. the value of one complete attribute is x_ijThen its interval is represented by the form [ x ]_ij,x_ij]。

4) Carrying out clustering analysis on the transformed regional data set by using a regional fuzzy C-means clustering method, wherein each clustering center is represented by a regional vector to obtain a fuzzy clustering result;

the regional fuzzy C-means clustering method specifically comprises the following steps:

s-dimensional interval data set X ═ X₁,x₂,…,x_nContains n pieces of data, and samples x of the data_kEach attribute value in (1) is represented by interval, namely x_kj＝[x_kj ^-,x_kj ⁺](1 ≦ k ≦ s), the data set X is divided into c classes, and the cluster center therein is denoted V ≦ s_ik]＝[v₁,v₂,…,v_c]And v is_ik＝[v_ik ^-,v_ik ⁺](i ═ 1,2, …, s) using a membership matrix U_(c×n)To represent the clustering result of the interval type data set, wherein the element u in the membership degree matrix_ijThe following conditions are satisfied:

the target function formula of the IFCM algorithm is as follows:

wherein, the data x_jTo the center of the cluster v_iThe euclidean distance between them is: d²(x_j,v_i) M is a fuzzy index and satisfies m ∈ (1, + ∞);

d²(x_j,v_i) The specific calculation formula of (2) is as follows:

wherein, interval type data attribute x_jThe vectors of the left and right interval boundaries are respectively expressed as: x is the number of_j ^-＝[x_1j ^-,x_2j ^-,…,x_sj ^-]^TAnd x_j ⁺＝[x_1j ⁺,x_2j ⁺,…,x_sj ⁺]^TInterval type clustering center v_iThe vectors of the left and right interval boundaries are respectively represented as v_i ^-＝[v_1i ^-,v_2i ^-,…,v_si ^-]^TAnd v_i ⁺＝[v_1i ⁺,v_2i ⁺,…,v_si ⁺]^T；

Otherwise, as shown in equation (20)

And updating the membership degree of each data sample.

The specific steps for obtaining the fuzzy clustering result are as follows:

4.1) parameter initialization: the number of categories of clustering c; the maximum number of iterations G; a fuzzy index m and an iteration termination threshold epsilon; and to the membership matrix U⁽⁰⁾Carrying out initialization;

4.2) updating the clustering center matrix: when iterating to the l (l ═ 1,2, …) times, according to U^(l-1)Simultaneously using formula (17) and formula (18) to cluster center matrix V^(l)Left endpoint value V of^(l)-And its right endpoint value V^(l)+Updating is carried out;

5) Carrying out fuzzy clustering analysis on the interval type data set by the IFCM method obtained in the step 4) to obtain a fuzzy clustering result, and comparing the fuzzy clustering result with an IFRBF-FCM method and other four comparative classical algorithms (WDS-FCM, PDS-FCM, OCS-FCM and NPS-FCM) to verify the effectiveness of the invention:

(1) initialization of the experiment: the invention selects the data sets in the three UCI databases as the data sample sets of the experiment, namely Iris, Bupa and Breast data sets. Meanwhile, two artificial data sets are selected to carry out comparison experiments on two algorithms (IFRBF-FCM and IFRBF-IFCM) and four comparison algorithms (WDS-FCM, PDS-FCM, OCS-FCM and NPS-FCM) provided by the invention.

The Iris dataset is a dataset for multidimensional attribute analysis of Iris flowers. There are 150 sample data present in the dataset and are divided into three classes, respectively: irises, iris variegata, and irises virginiae. Each of these three categories contains 50 samples, and each sample contains 4 attributes. Respectively as follows: petal length, calyx length, petal width, and calyx width.

The Bupa dataset is sample data for a liver disease study. The data set contains 345 sample data; the total number of samples in each category was 145 and 200, respectively. The data sample contains 7 attributes, but the 7 th attribute is a category identifier and does not participate in the experiment. The remaining valid 6 attributes include the following: mean volume of red blood cells, glutamyl transpeptidase, and daily alcohol consumption.

The Breast dataset is a dataset that is descriptive of clinical cases of Breast cancer. The data set has 699 sample data, but 16 sample data have the problem of attribute loss, so the data used in the actual data analysis has 683 data. The data set is divided into two categories, namely breast benign tumor sample data and breast malignant tumor sample data; comprising 444 and 239 data samples, respectively. The sample data comprises 11 attribute columns, wherein two columns of attributes do not participate in fuzzy clustering experimental analysis, and the two columns of attributes are respectively the sample data number of the first column and the sample data category number of the last column. The remaining valid 9 attributes include: limbal adhesion, size of the single epithelial cells, mitosis, nude nuclear cells, and chromatin, among others. Table 1 is a description of information related to several UCI data sets described above.

Table 1 description of UCI data set information

The number of data samples in the artificial data set 1 is 200, the number of included categories is 2, and each sub-category includes 100 data samples. The number of data samples in the artificial data set 2 is 400, the number of included categories is 3, and each of the sub-categories includes 80, 100, and 220 data samples. Data samples (x) in the two artificial datasets_i,y_i) All obey independent two-dimensional normal distributions.

The artificial data set 1 is generated according to the following parameters:

(i) class 1: u. of₁＝4,u₂＝4,σ₁ ²＝2,σ₂ ²＝2。

(ii) Class 2: u. of₁＝6,u₂＝8,σ₁ ²＝2,σ₂ ²＝2。

The distribution of the artificial data set 1 generated according to the above parameters is shown in fig. 1. The "+" in red represents data samples in a first subset of the data set, and the "+" in blue represents data samples in a second subset of the data set.

The artificial dataset 2 was generated as follows:

(i) class 1: u. of₁＝20,u₂＝20,σ₁ ²＝2,σ₂ ²＝4。

(ii) Class 2: u. of₁＝25,u₂＝30,σ₁ ²＝9,σ₂ ²＝25。

(iii) Class 3: u. of₁＝36,u₂＝36,σ₁ ²＝16,σ₂ ²＝16。

The distribution of the artificial data set 2 generated according to the above parameters is shown in fig. 2. The "+" in red represents a data sample in a first subset of the data set, the ". sup." in blue represents a data sample in a second subset of the data set, and the ". sup." in green represents a data sample in a third subset of the data set.

In order to enable incomplete data in an experiment to be closer to randomness generated by the incomplete data in practice, data used in the experiment is obtained by randomly losing a complete data set according to a ratio set by people, and therefore the incomplete data set is generated. The location of the missing attribute in the incomplete data set is determined by where the attribute exists, i.e., by the number of rows x and columns y in which the attribute exists. At the same time, we use "? "to replace it. The rules for randomly generating missing data attributes in a dataset are as follows:

(i) for an s-dimensional data set, it must be ensured that there are at most s-1 missing attribute values for any sample data in the data set.

(ii) It must be ensured that at least one complete value exists for any one-dimensional attribute in the data set.

The maximum training times of the IFRBF network is set to be M-500, and the error precision is set to be epsilon₁each learning rate is set to η 0.01₁＝0.1，η₂＝0.1，η₃0.1. For different data sets, the number of nodes in each layer of the IFRBF network is different, and needs to be determined according to the number of related attributes in the data set. The number of hidden layer nodes of the network needs to be determined through experiments. In addition, the maximum iteration times of the FCM algorithm and the IFCM algorithm are set to be G equal to 100, the fuzzy index is set to be m equal to 2, and the iteration termination threshold is set to be epsilon equal to 0.001. The missing rate of each data set is set as follows: 0%, 5%, 10%, 15% and 20%. Considering that the experimental results of each algorithm may have contingency, the invention performs 10 simulation experiments on each algorithm respectively. The average of the results obtained from these 10 experiments was analyzed and compared.

The evaluation performance of the experimental result evaluates the proposed IFRBF-FCM algorithm from two aspects, namely the average error score of the cluster and several external effectiveness evaluation indexes. The average error score can be used for visually comparing clustering results, and the external validity evaluation index can be used for evaluating the similarity degree between the real division of the experimental data and the corresponding fuzzy division results. These indices are respectively: rand Index, Adjusted Rand Index, Jaccard Coefficient, MinkowskiMeasure and Γ staticiscs. The smaller the value of the evaluation index Minkowski Measure is, the better the performance of the corresponding clustering algorithm is. And the larger the values of the other external evaluation indexes are, the better the performance of the clustering algorithm is.

(2) And (5) analyzing an experimental result.

(i) Comparative analysis of IFRBF-FCM Experimental results:

for the incomplete data fuzzy clustering method (IFRBF-FCM) of the information feedback RBF network estimation provided by the invention, the experimental result is compared with other four classical algorithms. The experimental results are shown in tables 2 to 7, in which the most preferable experimental results are marked with bold lines and the less preferable results are marked with underlines.

Table 2 average error score of 10 experiments on incomplete data set Iris

Table 3 average error score of 10 experiments with incomplete data set Bupa

Table 4 average error score of 10 experiments with incomplete data set break

TABLE 5 average effectiveness evaluation index of 10 experiments on incomplete data set Iris

Table 6 average effectiveness evaluation index of 10 experiments with incomplete data set Bupa

Table 7 average effectiveness evaluation index of 10 experiments of incomplete data set break

As can be seen from tables 2 to 7, the IFRBF-FCM algorithm provided by the present invention is relatively better compared with the other four comparative algorithms in terms of overall view under different missing rates of the respective data sets.

The average error score is an evaluation index. As can be seen from the experimental results in tables 2 to 4, the IFRBF-FCM algorithm proposed by the present invention can obtain relatively better experimental results compared with the other four comparative algorithms as a whole. The experimental results for the Bupa dataset were suboptimal only at a deletion rate of 15%.

For several average external effectiveness indicators. From the results in tables 5 to 7, it can be seen that, when the data sets have different deletion rates, the IFRBF-FCM algorithm proposed by the present invention can obtain relatively better experimental results compared with the other four comparative algorithms as a whole.

Fig. 3 to 5 are graphs of variation trends between iteration times and objective functions of the IFRBF-FCM algorithm for the three UCI data sets under different deficiency rates during cluster analysis.

Regarding the convergence of the algorithm, as can be seen from fig. 3 to fig. 5, under different deficiency rates of the respective data sets, the IFRBF-FCM algorithm has a relatively fast change speed of the objective function values at the initial stage of the algorithm, but after a plurality of iterations, the objective function values of the algorithm can be in a relatively stable state.

(ii) Experimental comparative analysis of IFRBF-IFCM:

the experimental results of the IFRBF interval type estimated incomplete data fuzzy clustering algorithm (IFRBF-IFCM) and the IFRBF numerical value estimated incomplete data fuzzy clustering algorithm (IFRBF-FCM) are compared. The experimental results are shown in tables 8 to 11, in which the most preferable results are marked by bolding.

Table 8 average error score for incomplete UCI data set 10 experiments

Table 9 average error score for incomplete artificial dataset 10 experiments

Table 10 average number of iterations for 10 experiments with incomplete UCI data set

TABLE 11 average number of iterations for 10 experiments with incomplete artificial data set

For the three UCI data sets, as can be seen from table 8, the IFRBF-IFCM algorithm is better than the IFRBF-FCM algorithm in terms of the evaluation index of the average error score from the global perspective. The result of the IFRBF-IFCM algorithm is not the same as the experimental result of the IFRBF-FCM algorithm only when the missing rate of the Iris data set is 15% and the missing rate of the Bupa data set is 10%.

For the artificial dataset 1, the distribution of the data samples in the various categories in the dataset is relatively uniform. As can be seen from Table 9, when the missing rate is 10%, the IFRBF-FCM algorithm is relatively better for the evaluation index of the average error score. However, the IFRBF-IFCM algorithm is better when the deficiency rate is 5%, 15%, or 20%. But it can be seen that there is no particularly large difference between the two. For the artificial data set 2, the distribution of the data samples in each category in the data set is not uniform, and the degree of dispersion of the data samples in each category is also very different. The distribution of the sample data in the first class of the data set is relatively centralized, and the distribution of the sample data in the second class and the third class is relatively dispersed. As can be seen from Table 9, the IFRBF-IFCM algorithm has better experimental results than the IFRBF-FCM algorithm under different deficiency rates. Moreover, it can be seen that the IFRBF-IFCM algorithm has much better results than the IFRBF-FCM algorithm when the miss rates are 15% and 20%. Therefore, for a data set which is not uniformly distributed, the IFRBF-IFCM algorithm can more accurately present the uncertainty of the incomplete data attribute, so that the robustness of the data set is improved.

Tables 10 and 11 are the average number of iterations of 10 experiments for the IFRBF-FCM algorithm and the IFRBF-IFCM algorithm. In terms of an evaluation index of the average number of iterations. As can be seen from tables 10 and 11, the average number of iterations of the two algorithms proposed by the present invention is different for different data sets. However, globally, regarding the evaluation index of the average iteration number, the IFRBF-FCM algorithm is better in the two algorithms proposed in the present invention. However, after a plurality of iterations, the objective function value of the IFRBF-IFCM algorithm can finally reach a relatively stable state.

Fig. 6 to fig. 10 are graphs showing the variation trend between the iteration times and the objective function of each data set under different loss rates by applying the IFRBF-IFCM algorithm proposed by the present invention during cluster analysis. As can be seen from fig. 6 to fig. 10, the objective function values of the incomplete data sets can finally reach a relatively stable state through multiple iterations under the condition of different deficiency rates.

Claims

1. The incomplete data fuzzy clustering method of the information feedback RBF network estimation is characterized by comprising the following steps:

1) and (3) providing an information feedback RBF network model: combining with Kalman filtering method, the input parameter is X ═ X₁,x₂,…,x_n+m) The output parameter is Y ═ Y₁,y₂,…,y_m) Calculating the error e between the theoretical expected output value of incomplete data and the actual output value of the network, and feeding back the difference between the predicted value of RBF neural network and the theoretical expected value of data to the input layer to obtain the final productAn IFRBF model;

2. The fuzzy clustering method for incomplete data of information feedback RBF network estimation as claimed in claim 1, wherein: in the step 1), the specific method comprises the following steps:

1.5) calculating the output value Y of the network according to the formula (3)_kAnd its expected value O_kError e between

e_k＝Y_k-O_k,k＝1,2,…,m (3)

η₁,η₂,η₃Represents the rate of learning and, moreover,

wherein,

3. The fuzzy clustering method for incomplete data of information feedback RBF network estimation as claimed in claim 1, wherein: in the step 2), the method specifically comprises the following steps:

2.2) network training of IFRBF of incomplete data:

thereby completing the training of the IFRBF network;

4. The fuzzy clustering method for incomplete data of information feedback RBF network estimation as claimed in claim 1, wherein: in the step 3), the method specifically comprises the following steps:

3.2) segmenting the numerical evaluation of the missing data attribute in the incomplete data set, wherein the median value in the segment is the numerical evaluation x, and the width of the segment isThe interval type of the missing data attribute is represented asI.e. [ x ]^-,x⁺]；

5. The fuzzy clustering method for incomplete data of information feedback RBF network estimation as claimed in claim 1, wherein: the interval type fuzzy C-means clustering method in the step 4) specifically comprises the following steps:

the target function formula of the IFCM algorithm is as follows:

d²(x_j,v_i) The specific calculation formula of (2) is as follows:

Otherwise, as shown in equation (20)

And updating the membership degree of each data sample.

6. The fuzzy clustering method for incomplete data of information feedback RBF network estimation as claimed in claim 1, wherein: in the step 4), the specific steps of obtaining the fuzzy clustering result are as follows:

4.2) updating the clustering center matrix: when iterating to the l (l ═ 1,2, …) times, according to U^(l-1)Simultaneously using formula (17) and formula (18) to cluster center matrix V^(l)Left endpoint value ofAnd its right endpoint valueUpdating is carried out;