CN107729943B

CN107729943B - Missing data fuzzy clustering algorithm for optimizing estimated value of information feedback extreme learning machine and application thereof

Info

Publication number: CN107729943B
Application number: CN201710992778.9A
Authority: CN
Inventors: 张利; 刘洋; 高欣; 潘辉; 王军; 赵中洲
Original assignee: Liaoning University
Current assignee: Zhongchangxing Shandong Information Technology Co ltd
Priority date: 2017-10-23
Filing date: 2017-10-23
Publication date: 2021-11-30
Anticipated expiration: 2037-10-23
Also published as: CN107729943A

Abstract

The invention relates to a missing data fuzzy clustering algorithm for optimizing estimation of an information feedback extreme learning machine and application thereof, which comprises the following basic steps: 1) calculating and selecting data attributes with higher correlation by adopting mutual information, and selecting complete data in incomplete data as a training sample of the FELM network according to the attributes; 2) initializing an input weight omega and a bias value b of the FELM network; 3) pre-filling the missing attribute according to a nearest neighbor rule, and adjusting a pre-filling value according to an error obtained by training the FELM network by the training sample until a reasonable numerical value is found for filling, so as to obtain a recovered complete data set; 4) initializing parameters of a fuzzy C-means algorithm, clustering number C, fuzzy coefficient m, threshold epsilon and membership degree partition matrix U^（0）(ii) a 5) And obtaining a final clustering result through a membership degree partition matrix U and a clustering center V of the iterative optimization fuzzy C-means algorithm. By the method, more reasonable attribute evaluation can be obtained by fully utilizing the relevance between the data samples and the attributes and the distribution information of the complete data samples and the incomplete data samples, so that the clustering result of the incomplete data set is more accurate.

Description

Missing data fuzzy clustering algorithm for optimizing estimated value of information feedback extreme learning machine and application thereof

Technical Field

The invention relates to a missing data fuzzy clustering algorithm for optimizing estimation of an information feedback extreme learning machine and application thereof, belonging to an industrial informatization technology.

Background

Steel is an indispensable important material for construction and quartification realization in China, the steel industry is the basis of national development, and the steel industry keeps steady and high-speed development for more than sixty years after being established, so that an industrial strip steel technical system is established. At present, China is in an important stage of industrial development, and the demand of steel is still huge. For the steel industry, it faces a very large market space. The method has practical significance on how to creatively reform, reduce and produce high-quality, high-benefit and high-level steel with low carbon for the existing strip steel production line. At the present stage, informatization is a strategic measure covering the modernization overall situation, and the steel industry needs to fully combine the informatization technology to further innovate and modify, fully integrate the advanced technology of the informatization industry in the steel rolling process and fully realize the industrial informatization collaborative development. Therefore, it is very important to perform cluster analysis on strip steel data and to enhance the industrial production innovation by analysis results.

In recent years, clustering analysis has been adapted to aggregate many different types of data. Has achieved wide application and development in many research fields. It is a significant matter to use the analysis result to adjust the production line by using the mathematical method to determine the relationship between the strip data samples according to the property of the strip data itself and some similarity or difference measure and to perform cluster analysis on the relationship. However, in real production life, the influence of multiple factors is: such as failure of the data acquisition equipment, failure of the storage medium, failure of the transmission medium, omission of human factors or limitation of the detection instrument, etc. The collected data set has an incomplete phenomenon, and the traditional clustering method cannot be directly applied to the incomplete data set. Therefore, it is very important to select a proper way to process the incomplete data, and to analyze the final result and make future industrial plans.

Disclosure of Invention

In order to solve the problems, the invention provides a missing data fuzzy clustering algorithm for optimizing the estimation of the information feedback limit learning machine, and the missing data fuzzy clustering algorithm is applied to the analysis of strip steel data, and the industrial production reform is strengthened through the analysis result.

The invention is realized by the following technical scheme: the fuzzy clustering algorithm of the missing data of the optimized estimation of the information feedback extreme learning machine is characterized by comprising the following steps:

1) calculating and selecting data attributes with higher correlation by adopting mutual information, and selecting complete data in incomplete data as a training sample of the FELM network according to the attributes;

wherein, mu_X(x) An edge probability density function representing a variable X; mu.s_Y(Y) an edge probability density function representing the variable Y; mu.s_XY(x, y) represents a joint probability density function between variables;

2) and (3) determining parameters of the FELM network: initializing an input weight omega and a bias value b; setting the initialization values of omega and b between intervals < -1,1 >, randomly selecting any random number in the interval to initialize the network, and determining the number of hidden layer nodes of the extreme learning machine;

3) pre-filling the missing attribute according to a nearest neighbor rule, and adjusting the pre-filling value by adopting an error retrieval method according to an error obtained by training the FELM network by a training sample until a reasonable numerical value is found and filled, thereby obtaining a recovered complete data set;

4) initializing parameters of a fuzzy C-means algorithm, clustering number C, fuzzy coefficient m, threshold epsilon and membership degree partition matrix U⁽⁰⁾；

5) Clustering the recovered complete data set by using the fuzzy C mean value, and dividing the matrix U according to the formula (2) and the membership degree when the iteration number t is equal to l^(l-1)Calculating a clustering center matrix V^(l)According to the formulae (3) and V^(l)Updating U^(l)For a given threshold value ε, if

The algorithm is terminated; otherwise, continuing to iteratively update the membership grade division matrix and the clustering center, wherein l is l + 1.

The step 3) pre-fills the missing attribute according to the nearest neighbor rule, and adjusts the pre-filling value by adopting an error retrieval method according to the error obtained by training the FELM network by the training sample until a reasonable numerical value is found and filled, and then the process of obtaining the recovered complete data set is as follows:

1) and pre-filling the missing attribute according to a nearest neighbor rule, selecting k data closest to the data sample, calculating the average value of the corresponding positions of the k data samples from the corresponding positions of the missing data, and taking the average value as a pre-filling value of the incomplete data.

Wherein x is_aAnd x_bIs x respectively_iaAnd x_ibAnd I is_iThe satisfied condition is shown in formula (5):

2) calculating an output matrix H of the hidden layer of the FELM network by using a formula (6-8);

wherein the content of the first and second substances,

the output of the ith hidden layer is shown;

is that

And x_jInner product of (d);

expressed is the input weight of the link between the input layer and the hidden layer; beta is a_iDescribing the output weight value linked between the hidden layer and the output layer; b_iThe bias value of the jth hidden layer is indicated.

Hβ＝T (7)

Where H is the output of the hidden layer node, β is the output weight, and T is the desired weight.

2) Calculating the output weight of the FELM network, and calculating the output weight by using the obtained output matrix H and the expected output value according to a formula (9);

wherein the content of the first and second substances,

is the Moore-penrose generalized inverse of H,

is the smallest and unique.

3) Obtaining the error between the actual output value and the real output value, feeding back the error, and assuming that the predicted value output by the extreme learning machine is Y, the actual value is Y, and the error is e₀；

e₀＝Y-y (10)

4) And judging the magnitude relation between the obtained error and the obtained error of the training sample, if the magnitude relation meets the iteration stop requirement, filling the missing attribute, otherwise, receiving the error, readjusting the pre-filling value, and returning to the step 1).

The error retrieval method comprises the following specific processes:

assume that the initial estimate using the k-nearest neighbor rule for the missing attribute is E_kUsing the FELM network to derive the mean error value of the training samples as

If the output value obtained by performing the FELM learning prediction on the data containing the missing attribute is Y, and the real value of the data is Y, the error e is obtained₀Y-Y, calculating

Adjusting the fill value of the missing attribute:

1) if E < 0, then re-adjust the fill value E for the missing attribute_new＝E_k+ ρ e, i.e., increasing the value with a certain probability, and then performing FELM learning as an input, where ρ ∈ [0, 1]]Is randomly selected according to a random function;

2) if it is

Then the fill value E of the missing attribute is readjusted_new＝E_k-pe, then as input to FELM learning;

3) if it is

It is said that the value predicted by the FELM network, which is close to the true value, is acceptable, so this value is used as a fill-in for the missing attributes of the incomplete data set.

The application of the missing data fuzzy clustering algorithm of the information feedback limit learning machine optimization estimation in strip steel data clustering statistics comprises the following processes:

1) collecting experimental data: collecting data collected by the strip steel at a certain period of time as a data sample;

2) the following attributes are extracted from the collected data sample: the rolling force of a rolling frame, the size of a roll gap between rolling rolls, the roll gap difference between the rolling rolls, the inlet temperature, the outlet temperature, the rolling current, the rolling speed and the SONY value;

3) taking the attribute value acquired in the step 2) as a training data set;

4) and carrying out normalization processing on the data set. For reasons such as data attribute magnitude, all values in a data set are converted to corresponding values in a [0, 1] interval to eliminate differences among data;

5) training samples are selected and optimized. And calculating and selecting data attributes with higher correlation by adopting mutual information, and selecting complete data in incomplete data as a training sample of the FELM network according to the attributes.

6) And determining parameters of the FELM network. Initializing the input weight ω and the offset value b. Setting the initialization values of omega and b between intervals < -1,1 >, randomly selecting any random number in the interval to initialize the network, and determining the number of hidden layer nodes of the extreme learning machine;

7) missing attribute evaluation. Pre-filling the missing attribute according to a nearest neighbor rule, and adjusting a pre-filling value by adopting an error retrieval method according to an error obtained by training a training sample until a reasonable numerical value is found for filling;

8) the FCM algorithm is used to perform cluster analysis on the recovered complete data set.

The invention has the beneficial effects that: traditional solutions either consider only inter-data associations or rely on inter-attribute associations. The method combines internal and external relations (namely, the relation between data and attributes), uses the FELM network to realize the optimized estimation of the missing value of the data, and then carries out corresponding fuzzy clustering analysis on the data set after the optimization is complete. And calculating the correlation between the sample attributes by using the mutual information, thereby providing a theoretical base pad for the selection of the training sample. And selecting a plurality of nearest neighbors adjacent to the incomplete data by using a nearest neighbor rule based on the local distance, and preparing a pre-filling value for each data missing value, wherein the pre-filling value is iteratively used by the FELM network. A plurality of errors (difference between real output and expected output) are obtained through a training sample set, and the average error of the errors is obtained. In response to this adjustment criterion, an error search is used to continually increase or decrease the difference optimization adjustment estimate. And repeating the steps, and harvesting the optimal estimated numerical value of the missing value to fulfill the aim of reasonably and efficiently perfecting the incomplete data set.

Drawings

Fig. 1 is a topological structure diagram of a feedback type extreme learning machine.

Fig. 2 is a flow chart of the algorithm of the present invention.

Fig. 3 is a signal acquisition diagram of strip rolling data.

Fig. 4 is a graph of the change between the number of iterations of the strip rolling data set and the objective function.

Detailed Description

The invention is based on the following theory:

1. information feedback limit learning machine

Extreme Learning Machines (ELMs) are a new type of single hidden layer feedforward neural network (SLFNs) learning algorithm, which was proposed by huang guang bin in 2004. In the extreme learning machine, the input weight connecting the input layer and the hidden layer and the bias value of the hidden layer are randomly selected, and the output weight connecting the hidden layer and the output layer is analyzed and determined by a generalized inverse method. The ELM gives up the gradient descent algorithm, tries to adopt the idea of least square method to solve the optimal neural network, and has achieved great success. However, the conventional extreme learning machine cannot reflect the value of the predicted output value to the network structure, and only depends on the input information to perform calculation in the learning process. Therefore, the traditional extreme learning machine is improved by using the idea of Kalman filtering, the feedback extreme learning machine is obtained, and estimation prediction and filling are better performed on missing attributes in an incomplete data set.

The core idea of the feedback type extreme learning machine is as follows: and the error between the predicted output and the actual output is utilized to achieve the purpose of reasonably adjusting the missing attribute filling, so that the filling value is more reasonable, and the clustering effectiveness is improved. As shown in fig. 1, a feedback type extreme learning machine model is shown.

As shown in fig. 1, the FELM network is composed of an input layer, a hidden layer, and an output layer. Each circle represents a node. The processing and calculation of the data will be performed by each node of the hidden and output layers, the specific number of nodes of the hidden layer will be determined experimentally.

2. Fuzzy C-means (FCM) clustering algorithm

The fuzzy C-means clustering algorithm (Bezdek, 1981) is to put the feature space X ═ X (X)₁，x₂，…，x_n) The characteristic points in the cluster are classified into c types (c is more than 1 and less than or equal to n), and the clustering center V is { V ═ V₁，v₂，…v_cH, the cluster center of the j-th class is v_j∈R^sRepresents, wherein arbitrary data points x_j∈R^sMembership of class j of u_ijDenotes x_jDegree of membership to class j. And u is_ijThe following conditions are satisfied:

u_ik∈[O，1]，i＝1，2，…，c；k＝1，2，…，n； (11)

the objective function is defined as follows:

wherein x is_k＝[x_1k，x_2k，…，x_sk]^TIs the kth data sample, x_jkIs x_kThe jth attribute value of (a); v. of_iIs the ith cluster center; m (m > 1) is an exponential weight which influences the fuzzification degree of the membership matrix; i | · | purple wind₂Representing the euclidean distance.

The updating formula of the cluster center and the membership is as follows:

under the constraint of equation (12), alternating iterations U and V minimize equation (14).

Secondly, the implementation process of the invention:

wherein, mu_X(x) An edge probability density function representing a variable X; mu.s_Y(Y) an edge probability density function representing the variable Y; mu.s_XY(x, y) represents the joint probability density function between the variables.

2) And determining parameters of the FELM network. Initializing the input weight ω and the offset value b. Setting the initialization values of omega and b between intervals < -1,1 >, randomly selecting any random number in the interval to initialize the network, and determining the number of hidden layer nodes of the extreme learning machine;

3) pre-filling the missing attribute according to a nearest neighbor rule, and adjusting a pre-filling value according to an error obtained by training the FELM network by the training sample until a reasonable numerical value is found for filling, so as to obtain a recovered complete data set;

5) Clustering the recovered complete data set by using the fuzzy C mean value, and when the iteration number t is equal to l, carrying out U according to the formula (2)^(l-1)Calculating V^(l)According to the formulae (3) and V^(l)Updating U^(l)If, if

Algorithm terminalStopping; otherwise, continuing to iteratively update the membership grade division matrix and the clustering center, wherein l is l + 1.

And (3) an error retrieval algorithm: assuming that the initial estimation value obtained by using k nearest neighbor rule for missing attribute is Ek, the average error value obtained by using ELM for training sample is Ek

If the output value obtained by ELM learning prediction for the data containing the missing attribute is Y and the real value of the data is Y, the error e is obtained₀Y-Y, calculating

Adjusting the fill value of the missing attribute:

(1) if E < 0, then re-adjust the fill value E for the missing attribute_new＝E_k+ ρ e, i.e., increasing the value with a certain probability, and then performing ELM learning as an input, where ρ ∈ [0, 1]]Is randomly selected according to a random function;

(2) if it is

Then the fill value E of the missing attribute is readjusted_new＝E_k-pe, then as input to ELM learning;

(3) if it is

Then it is indicated that the value predicted by ELM is close to the true value and acceptable, so the value is used as the filling of missing attribute of the incomplete data set;

thirdly, the missing data fuzzy clustering algorithm of the information feedback limit learning machine optimized estimation is used for analyzing the strip steel data, and the industrial production reform is strengthened through the analysis result, and the method comprises the following specific steps:

1. collecting experimental data: strip data is data collected from a steel mill in China at a certain time of day, and the data set comprises 983 data samples. From this collected data sample, the following attributes are extracted: the rolling force of the rolling frame, the size of the roll gap between the rolling rolls, the roll gap difference between the rolling rolls, the inlet temperature, the outlet temperature, the rolling current, the rolling speed and the SONY value. Wherein, the attributes have different close relations with the predicted thickness of the strip steel outlet. These attribute values are used as inputs to the FELM network. Fig. 3 is a signal acquisition plot of data (with the vertical axis representing parameter values and the horizontal axis representing acquisition data time values).

2. And (3) analyzing an experimental result: the experimental data is processed manually to generate a rolling data set of random missing data, and then a training sample set is selected for each missing attribute. In order to illustrate the effectiveness of the incomplete data set fuzzy clustering algorithm of the information feedback limit learning machine optimized estimation, the experimental result of the algorithm is compared with a classical processing algorithm: and comparing results by using a mean value estimation method, a zero filling method, a k neighbor estimation method and an MBP-FCM algorithm. Comparing estimation deviations under different algorithms and different loss ratios, and measuring by three indexes: mean absolute deviation ABS, mean deviation Bias, and mean deviation root mean square RMSE between the true and estimated values. The smaller their values, the higher the accuracy of the estimates. As can be seen from tables 1 and 2, the algorithm provided by the invention has better estimation accuracy compared with the other four comparison algorithms, and the estimation effect is closer to the original data. At different miss ratios, as the number of miss values increases, the bias of the padding also increases with the difference. FIG. 4 is a graph depicting the variation trend of the number of iterations of the strip steel data set and the algorithm objective function of the FELM-FCM algorithm under four deficiency ratios. Fig. 4 shows that the function value of the algorithm proposed by the present invention floats obviously in the initial stage, and after several times of iterative optimization, the algorithm tends to a stable convergence state.

TABLE 1 comparison of missing strip data set estimate deviations under different algorithms

TABLE 2 comparison of the estimated deviations of missing strip data sets for different miss ratios

Claims

1. The application of the missing data fuzzy clustering algorithm of the information feedback limit learning machine optimization estimation in strip steel data clustering statistics is characterized by comprising the following processes:

3) taking the attribute value acquired in the step 2) as a training data set;

4) carrying out normalization processing on the data set; for the order of magnitude of data attribute, all values in the data set are converted into corresponding values in the interval of [0, 1] to eliminate the difference between data;

5) selecting and optimizing training samples; calculating and selecting data attributes with higher correlation by adopting mutual information, and selecting complete data in incomplete data as a training sample of the FELM network according to the attributes;

6) determining parameters of the FELM network; initializing an input weight omega and a bias value b; setting the initialization values of omega and b between intervals < -1,1 >, randomly selecting any random number in the interval to initialize the network, and determining the number of hidden layer nodes of the extreme learning machine;

7) missing attribute evaluation; pre-filling the missing attribute according to a nearest neighbor rule, and adjusting a pre-filling value by adopting an error retrieval method according to an error obtained by training a training sample until a reasonable numerical value is found for filling;

8) cluster analysis of a recovered complete data set using FCM algorithm

The fuzzy clustering algorithm of missing data of the optimization valuation of the information feedback extreme learning machine comprises the following steps:

2.1) calculating by adopting mutual information, selecting data attributes with higher correlation, and selecting complete data in incomplete data as a training sample of the FELM network according to the attributes;

2.2) determining the parameters of the FELM network: initializing an input weight omega and a bias value b; setting the initialization values of omega and b between intervals < -1,1 >, randomly selecting any random number in the interval to initialize the network, and determining the number of hidden layer nodes of the extreme learning machine;

2.3) pre-filling the missing attribute according to a nearest neighbor rule, and adjusting the pre-filling value by adopting an error retrieval method according to the error obtained by training the FELM network by the training sample until a reasonable numerical value is found and filled, thereby obtaining a recovered complete data set;

2.4) initializing parameters of the fuzzy C-means algorithm, the cluster number C, the fuzzy coefficient m, the threshold epsilon and the membership degree partition matrix U⁽⁰⁾；

2.5) clustering the recovered complete data set by using the fuzzy C mean value, and when the iteration time t is equal to l, dividing the matrix U according to the formula (2) and the membership degree^(l-1)Calculating a clustering center matrix V^(l)According to the formulae (3) and V^(l)Updating U^(l)For a given threshold value ε, if

The algorithm is terminated; otherwise, continuing to iteratively update the membership degree partition matrix and the clustering center if l is l + 1;

。

2. the application of the missing data fuzzy clustering algorithm of the information feedback limit learning machine optimized estimation in strip steel data clustering statistics as claimed in claim 1, wherein, in the step 2.3), the missing attributes are pre-filled according to the nearest neighbor rule, and the pre-filled values are adjusted by adopting an error retrieval method according to the error obtained by training the FELM network by the training sample until a reasonable numerical value is found and filled, so that the process of obtaining the recovered complete data set is as follows:

2.3.1) pre-filling the missing attribute according to a nearest neighbor rule, selecting k data nearest to the data sample, calculating an average value of corresponding positions of the k data samples from the corresponding positions of the missing data, and taking the average value as a pre-filling value of the incomplete data;

wherein x is_aAnd x_bIs x respectively_paAnd x_pbAnd I is_pThe satisfied condition is shown in formula (5):

2.3.2) calculating an output matrix of a hidden layer of the FELM network, and calculating an output matrix H of the hidden layer by using formulas (6) to (8);

wherein the content of the first and second substances,

the output of the ith hidden layer is shown;

is that

And x_jInner product of (d);

expressed is the input weight of the link between the input layer and the hidden layer; beta is a_iDescribing the output weight value linked between the hidden layer and the output layer; b_iIndicating the bias value of the ith hidden layer;

Hβ＝T (7)

wherein H is the output of the hidden layer node, β is the output weight, and T is the desired weight;

2.3.3) calculating the output weight of the FELM network, and calculating the output weight by using the obtained output matrix H and the expected output value according to a formula (9);

wherein the content of the first and second substances,

is the Moore-penrose generalized inverse of H,

is minimal and unique;

2.3.4) obtaining the error between the actual output value and the real output value, and feeding back the error, wherein the predicted value output by the extreme learning machine is assumed to be Y, the actual predicted value is Y, and the error is e₀；

e₀＝Y-y (10)

2.3.5) judging the size relation between the obtained error and the error obtained by the training sample, if the iteration stop requirement is met, filling the missing attribute, otherwise, receiving the error, readjusting the pre-filling value, and returning to the step 2.3.1).

3. The application of the missing data fuzzy clustering algorithm of the information feedback limit learning machine optimized estimation in the strip steel data clustering statistics as claimed in claim 2 is characterized in that the error retrieval method comprises the following specific processes:

Adjusting the fill value of the missing attribute:

4.1) if E < 0, then readjust the fill value E of the missing attribute_new＝E_k+ ρ e, i.e., increasing the value with a certain probability, and then performing FELM learning as an input, where ρ ∈ [0, 1]]Is randomly selected according to a random function;

4.2) if

4.3) if

It is said that the values predicted by the FELM network, which are close to the true values, are acceptable, so the predicted values are used as the filling of missing attributes of the incomplete data set.