K nearest neighbor regression prediction method for boiler special equipment steam flow prediction
Technical Field
The invention relates to the field of data prediction, in particular to a K nearest neighbor regression prediction method for predicting steam quantity of special boiler equipment.
Background
The special equipment refers to boiler, pressure vessel, pressure pipeline, elevator and other equipment with great life safety and danger. Wherein the boiler refers to a pressure-bearing device which utilizes various fuels, electricity or other energy sources and passes through a designated volume. The steam volume of the boiler is an important index for measuring the stability and the production effectiveness of the boiler, and the accuracy of steam volume prediction can effectively improve the efficiency of the boiler and promote safe and efficient production.
Generally, factors involved in boiler steam prediction include primary air flow, feedwater flow, secondary air temperature, feedwater pressure, bed temperature of the boiler, bed pressure of the boiler, furnace temperature, and the like. The acquisition frequency of the boiler is acquired according to a fixed-length time sequence, and sometimes the steam quantity needs to be predicted according to acquired time sequence data and related characteristics to carry out detection or early warning. And the supply of each link can be adjusted in the model according to the required steam quantity, so that the appropriate supply quantity is found, and the adjustment risk in the actual production is reduced.
The regression analysis method is relatively mature in technical application, clear in model structure and suitable for being applied to actual production practice. The boiler steam prediction has a serious multiple collinearity problem among the collected data characteristics, and although the multiple collinearity problem is a very common problem in the process of actually applying the model, the serious collinearity problem may cause the traditional linear regression model to have very unstable performance. To effectively solve the instability, a ridge regression method is often used, which can reduce the influence of multiple collinearity to some extent, but has a higher requirement on the form of the historical data if a higher prediction accuracy is required. The K nearest neighbor regression can well utilize historical data, has high prediction accuracy, avoids instability caused by multiple collinearity, and reduces the accuracy of the K nearest neighbor regression by an interference variable with low prediction correlation. In addition, how to determine the K value has higher accuracy, and how to avoid the efficiency reduction of K nearest neighbor regression caused by dimension explosion and overlarge data amount is a problem to be faced when the K nearest neighbor regression prediction method is used for boiler special equipment steam prediction.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims at the characteristic data characteristics and the historical data of the steam quantity acquired by special boiler equipment, and establishes a K nearest neighbor regression prediction method for predicting the steam quantity of the special boiler equipment so as to improve the accuracy and efficiency of the steam quantity prediction.
The technical scheme is as follows: a K nearest neighbor regression prediction method for boiler special equipment steam flow prediction is characterized by comprising the following three aspects: the first aspect is how to determine the value of K, thereby better utilizing the existing historical data for the prediction of new observed data; the second aspect is to reduce the influence of some interference variables, even weak correlation variables, on the predicted value, and avoid performance degradation caused by dimension disaster; a third aspect is how to avoid the reduction in efficiency of K-nearest neighbor regression due to the search space being too large due to the data set being too large. The technical scheme of the invention provides the following steps:
step 1, acquiring historical steam data and characteristic data (including various sensor characteristics and instrument data), and preprocessing the data;
step 2, eliminating variables with low correlation according to the correlation, reducing the influence of interference variables on predicted values, and preventing dimension explosion;
step 3, dividing the historical data from which the irrelevant variables are removed in the step 2 into a training set, a verification set and a test set, accelerating the determination of K neighbors through a KD tree algorithm, and further determining an optimal K value;
step 4, screening out relevant variables of the new data according to the step 2, and predicting steam quantity according to the K value determined in the step 3;
and 5, after new data are collected regularly, adding the data with low similarity with the historical data into the data set, and returning to the step 1 again to carry out the process.
Steps 1 to 3 belong to an offline model updating part, and step 4 belongs to an online prediction part. The overall flow chart of the invention is shown in fig. 1, the flow chart of the off-line model is shown in fig. 2, and the flow chart of the on-line prediction is shown in fig. 3.
Further, in step 1, the collected d-feature set is V ═ V1,V2,...,VdD ═ X ═ historical steam data collected1,X2,...,Xm},Y={y1,y2,...,ymContains m pieces of history data, each piece of data contains d-dimensional characteristics, namely Xi={xi1,xi2,...,xid}. The preprocessing of the features comprises missing value filling, outlier processing, normalization processing and the like.
Further, in step 2, the ith feature V is calculatediCoefficient of correlation r (V) with Yi,Y),
Wherein, Cov (V)iY) is ViCovariance with Y, Var (V)i) Is ViVar (Y) is the variance of Y;
determining a threshold value z according to the obtained d r values, and removing variables | r | < z; the determination mode of the threshold z can be manually set, and the optimal threshold size can also be found by dividing a training set and a test set; in statistics, variables with the absolute value of the correlation coefficient smaller than 0.3 can be generally considered as weakly correlated variables, and the threshold value can be set to 0.3 here, and can also be dynamically adjusted according to the characteristics of actual data.
Further, in step 3, the history data D ' ═ { X ' ═ X ' of the feature removed in step 2 is set to be in step 3.1,X'2,...,X'm},Y={y1,y2,...,ymDivide into training set Train { X }train,YtrainThe verification set validations { X }valid,YvalidAnd Test set Test { X }test,Ytest}. Set of features is V ═ V'1,V'2,...,V's}(s≤d);
For training set { Xtrain,YtrainConstructing a KD tree, calculating the variance of each feature from the features of the training set, and obtaining the feature V 'with the maximum variance'kAs root node, take V'kTaking the median as a dividing point, dividing a sample smaller than the median into a left sub-tree, and otherwise dividing into a right sub-tree; generating a KD tree by recursion by using the same method; the KD tree can accelerate the search of K neighbors, improve the search efficiency and reduce the complexity;
taking different K values, for each K value, searching the verification set { X through a KD tree
valid,Y
validEvery sample on }
The distance metric may use the euclidean distance, the manhattan distance, or the minkowski distance. The neighbor distance is calculated using the euclidean distance as follows:
wherein
Is the ith sample on the validation set,
are samples searched through the KD-tree on the training set,
is on the verification set
The value of the v-th dimension is,
is on the test set
The value of the v-th dimension of (a);
by measurement to obtain
The K neighbors of (a) are as follows:
obtained by distance weighted summation
Predicted value of (2)
The calculation formula is as follows:
wherein, w
iIn order to be the distance weight,
is composed of
A corresponding steam value;
for each on the verification set
Predicted value of (2)
Calculating the square of the prediction error, and further calculating the square error (MSE) of the verification set; repeating the process by selecting different K values; and obtaining the optimal K value to be selected by selecting the minimum square error.
Further, in step 4, new sensor data or meter data is collected (or a value of the dependent variable is assumed as input based on data of the required steam amount); removing irrelevant variables according to the step 2; calculating K neighbors in the KD tree in the step 3 according to the determined K value; and setting weight according to the distance, and carrying out weighted summation to obtain the predicted value of the new data. To improve accuracy, the KD-tree for new data may also be reconstructed with all of the historical data.
Furthermore, in step 5, since the KD tree creation requires a certain time, new data can be periodically collected and then added to the historical data for periodic offline updating; in the selection of new data, the new data can be selected according to the similarity calculation, and the data with lower similarity to the historical data is added into the data set, so that the phenomenon that the KD tree is too large due to too large data set, so that the online search calculation consumption is too large is avoided, and the data set is enriched, so that the prediction can have better effect.
Has the advantages that: the method has the obvious advantages that the invention provides the method for predicting the steam of the special boiler equipment by K nearest neighbor regression, so as to improve the prediction effect, provide a more accurate steam prediction value for production, and provide an index for high-efficiency production. The invention has simple and practical model and higher efficiency, accords with the historical data shape characteristics of the steam of special equipment of the boiler, and has certain practical value.
Drawings
FIG. 1 is an overall flow chart of the present invention.
FIG. 2 is a flow chart of the K-nearest neighbor regression algorithm off-line training in the present invention.
FIG. 3 is a flow chart of the K-nearest neighbor regression algorithm for on-line prediction in the present invention.
FIG. 4 is an exemplary graph of severe multiple collinearity (VIF value >10) for multiple signatures of a boiler specialty.
FIG. 5 is an illustration of a thermodynamic diagram of boiler plant characteristic correlations.
Detailed Description
The invention is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention.
The embodiment of the invention relates to a method for predicting steam volume of special equipment of a boiler by using K neighbor regression, and the implementation flow is shown in figure 1. The method comprises the following specific steps:
step 1, acquiring historical steam data and characteristic data (including various sensor characteristics and instrument data), and preprocessing the data, wherein the preprocessing specifically comprises the following steps:
step 11, collecting historical characteristic data and steam data and storing the historical characteristic data and the steam data in a database; in a further embodiment, the set of collected d-features is V ═ V1,V2,...,VdD ═ X ═ historical steam data collected1,X2,...,Xm},Y={y1,y2,...,ymContains m pieces of history data, each piece of data contains d-dimensional characteristics, namely Xi={xi1,xi2,...,xid}。
And step 12, carrying out preprocessing such as missing value filling, outlier processing, normalization processing and the like on each column of characteristics.
Step 2, rejecting variables with low correlation according to the correlation, reducing the influence of interference variables on predicted values, and preventing dimension explosion, specifically comprising:
step 21, calculating the correlation r between each row of characteristics and the steam quantity; in a further embodiment, the ith feature V is calculatediCoefficient of correlation r (V) with Yi,Y):
Wherein, Cov (V)iY) is ViCovariance with Y, Var (V)i) Is ViVar (Y) is the variance of Y;
determining a threshold value z according to the obtained d r values;
and step 22, eliminating the features with the | r | < 0.3.
Step 3, dividing the historical data from which the irrelevant variables are removed in the step 2 into a training set, a verification set and a test set, accelerating the determination of K neighbors through a KD tree algorithm, and further determining an optimal K value, wherein the method specifically comprises the following steps:
step 31, extracting the historical data with the variables removed from the database to divide the historical data into a training set, a verification set and a test setCollecting; in a further embodiment, the history data D '═ { X' ═ X 'of the feature culled in step 2 is'1,X'2,...,X'm},Y={y1,y2,...,ymDivide into training set Train { X }train,YtrainThe verification set validations { X }valid,YvalidAnd Test set Test { X }test,Ytest}. Set of features V ═ V1',V2',...,Vs'}(s≤d)。
Step 32, constructing a KD tree according to the training set; in a further embodiment, for the training set { Xtrain,YtrainBuilding KD tree, calculating variance of each feature from features of training set, and calculating feature V with maximum variancek' As root node, take VkTaking the median of' as a dividing point, dividing samples smaller than the median into a left sub-tree, and otherwise dividing into a right sub-tree; generating a KD tree by recursion by using the same method;
and step 33, making K equal to 5, searching K most similar historical data in the KD tree for each piece of data in the verification set, and selecting an euclidean distance. Calculating weights of the K neighbors according to the distance between the K neighbors and the prediction points, and carrying out weighted summation on steam values of the K neighbors according to the weights to obtain steam quantity values of the prediction points; calculating the error between the predicted value and the true value of each piece of data, and further calculating the overall mean square error of the verification set;
step 34, making K equal to 6:12, repeating 3.3, and selecting an optimal K value according to the minimum value of the mean square error;
and step 35, testing and checking effects on the test set according to the optimal K value.
Step 4, screening out relevant variables for the new data according to the step 2, and predicting the steam quantity according to the K value determined in the step 3, wherein the method specifically comprises the following steps:
step 41, collecting the feature data on line, and forming an array by the feature data according to the feature sequence in the database;
42, removing the features confirmed in the step 22;
step 43, searching the optimal K neighbor in the constructed KD tree according to the data in the step 42;
and step 44, calculating the steam predicted value of the new data according to the weighted summation of the steam values of the K neighbors, wherein the weight setting mode is the same as that in the step 33.
And 5, after collecting new data regularly, adding the data with low similarity to the historical data into the data set, and returning to the step 1 again to perform the process, wherein the process specifically comprises the following steps:
step 51, periodically summarizing new data;
step 52, calculating the similarity of the new data and the historical data;
step 53, adding the data with lower similarity into the data table of 1.1;
and step 54, returning to the step 1 and repeating the process.
Steps 1-3 belong to an offline model updating part, and step 4 belongs to an online prediction part. The overall flow chart of the invention is shown in fig. 1, the flow chart of the off-line model is shown in fig. 2, and the flow chart of the on-line prediction is shown in fig. 3.
The embodiment provides a specific method, steps and parameters for performing steam prediction on special boiler equipment by k-nearest neighbor regression, so that the prediction effect is improved, a relatively accurate steam prediction value can be provided for production, and an index can be provided for efficient production. In addition, the implementation is only a preferred specific implementation of the present invention, and the setting of the parameters needs to be adjusted according to specific variables and data in the specific implementation process, so as to achieve a better practical effect.