CN108490782B - A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study - Google Patents
A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Download PDFInfo
- Publication number
- CN108490782B CN108490782B CN201810305512.7A CN201810305512A CN108490782B CN 108490782 B CN108490782 B CN 108490782B CN 201810305512 A CN201810305512 A CN 201810305512A CN 108490782 B CN108490782 B CN 108490782B
- Authority
- CN
- China
- Prior art keywords
- completion
- submodel
- result
- follows
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Abstract
The present invention relates to industrial stokehold technical fields, disclose a kind of method and system for being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study.The different dimensional variable for extracting sampled data first generates training set of multiple sampling sets as submodel;Then three kinds of vector machine, BP neural network, Partial Least Squares methods are respectively adopted to each submodel to model;A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, the best several submodels of completion effect is chosen and carries out selective ensemble.The present invention takes full advantage of whole variables of training sample, has preferable Supplementing Data effect, facilitates enterprise according to the production process actual operating state that analysis obtains and carries out targeted production operation optimization.
Description
Technical field
The present invention relates to industrial stokehold technical fields, in particular to a kind of to be applicable in based on selective double layer integrated study
In the method and system of complex industrial process product quality indicator missing data completion.
Background technique
In complex industrial process, is obtained since certain quality index can not directly be measured by sensor, need manually to adopt
The offline chemical examination of collection, the chemical examination period is long, quality index data cannot be obtained in real time, so that the completion problem of quality index missing data
Already become a focus.Complex industrial process has introduced computer control system mostly at present, and what is thus measured is big
Production process data is measured, the completion of difficult mass metering index missing data is provided convenience.
However the data in complex industrial process often have the characteristics that, complementing method in the prior art is caused to be difficult to
Obtain ideal result: first is that the control system of complex industrial process, often have hundreds of sensor to process variable into
Row measures, and dimension is very high, and data volume is very huge, and product quality indicator data take a long time and changed offline
It tests, sample frequency is very low.Therefore after data prediction, the sample number that can be used for Supplementing Data is seldom;Second is that industry system
Often there is stronger coupling in the high dimensional data of system, can seriously affect parameter Estimation, increase model error;Third is that industrial process is deposited
Relationship between the chemical reaction of large amount of complex, all kinds of parameters be all it is nonlinear, such as temperature and entropy, reaction temperature with
Equal between reaction speed is all typical non-linear relation, and this non-linear relation brings very big to the foundation of mathematical model
Difficulty.
Currently used Supplementing Data method includes mean value interpolation, hot platform interpolation, expectation maximization interpolation, regression imputation
Deng since regression imputation method can be as often as possible using the information in data sample, so most researchs concentrate on regression imputation
Method.However when using regression imputation method, since complex industrial process data have the characteristics that dimension is high, non-linear, close coupling,
So causing the precision of Supplementing Data may be unstable.Simultaneously as complex industrial process can be used for the sample number of Supplementing Data
It measures very few, if only establishing Supplementing Data model by less data sample, may result in model and the case where poor fitting occur.
Summary of the invention
The technical problem to be solved by the present invention is to exist between the numerous, variable for process variable in complex industrial process compared with
The big difficult point of close coupling, data fluctuations proposes that a kind of selective double layer integrated study that is based on is suitable for complex industrial process product
The method and system of quality index missing data completion:
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
Method, which comprises the steps of:
S1. training set is generated:
S11, complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12, sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to,
It repeats M times, obtains M mutually independent sampling sets;
S13, characteristic variable of the K variable as training set, institute are not randomly selected with not putting back to each sampling set
The value of the K stated is determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP are used respectively to each training set
Three kinds of modeling method completion data of neural network, respectively obtain three completion results:
S22. Partial Least Squares, support vector machines, three kinds of BP neural network modelings are estimated respectively using least square method
The weight z of method1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodule
The completion result of type is assessed, and is ranked up according to assessment score, the average value of S submodel for selecting score high as
Final completion result.
Further, further include being standardized to initial data set after the S11, eliminate the influence of different variable dimensions
The step of, the specific method is as follows:
In note N number of sample, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) be
J-th of variable sample value of i-th of sample, standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable
The standard deviation of value.
Further, refer to the synthesis of precision and stability described in S3 " according to the completion recruitment evaluation index of proposition "
Index is ranked up according to score of the overall target of precision and stability to the M submodel, if selection score is high
The average value of dry submodel is as final completion as a result, specifically calculating as follows:
S31. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens
This number;
S32. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S33. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S34. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels
It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein: RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, is taken not
Greater than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring
Value.
Further, " the complex industrial process product quality indicator " refers to that be hydrocracked process heavy naphtha evaporates eventually
Point, the N are 595, M 50, P 139.
Further, " the complex industrial process product quality " refers to the Flash Point of Diesel or diesel oil for being hydrocracked process
Hexadecane or boat coal flash-point or boat coal initial boiling point.
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion knot
Fruit module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;Benefit
Generate M sampling set with the self-service sampling method of bootstrap, K variable extracted at random to each sampling set, finally this M N ×
The sampling set of K and sends the training set to second module respectively as the training set of M submodel;Wherein, described
N is number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, point
Regression model is not established using support vector machines, BP neural network, Partial Least Squares, is obtained based on aforementioned three kinds of different characteristics
Three groups of completion results of regression model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every
The completion result of a submodel;The completion result of each submodel is finally passed into third module;
The benefit for each submodel that the final completion object module of the determination is used to generate completion result integration module
Entirely as a result, being assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment
Score is ranked up, and the average value for S submodel for selecting score high is as final completion as a result, and exporting.
Further, the self-service sampling method of bootstrap in the training set generation module specifically:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain
To M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through
Formula is tested to determine:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
Further, the completion result calculating of each submodel is as follows in the completion result integration module:
S1. to each submodel training set, respectively using Partial Least Squares (PLS), support vector machines (SVM), BP mind
Through three kinds of modeling method completion data of network, following three completion results are obtained:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result of i-th of submodel
Yi:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
Further, the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens
This number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels
It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken
No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring
Value.
Compared with the existing methods, the beneficial effects of the present invention are: it is proposed by the present invention integrated based on selective double layer
The Supplementing Data method and system of study, the different dimensional variable that first layer extracts sampled data are established partial model and are integrated,
It can guarantee that each partial model has enough training samples in this way, above-mentioned sampled data is few, dimension is high to solve
Difficult point;Vector machine (SVM), BP neural network, three kinds of Partial Least Squares (PLS) is respectively adopted to each partial model in the second layer
Method is modeled, and wherein support vector regression model is the practical algorithm for small sample, and BP neural network has good non-
Linear approximation ability, Partial Least Squares Regression, which can solve input variable, has that coupling causes, and then builds these three
The result of mould method is integrated, and ensure that algorithm for the stability of the Supplementing Data problem of complex industrial process, has higher
Generalization ability;A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, chooses and mends
The best several submodels of full effect carry out selective ensemble, further improve the precision of completion, are enterprise to industrial process
The on-line analysis of global operation conditions provides more believable foundation, facilitates enterprise and adjusts production model based on the analysis results, subtracts
Few wasting of resources, improves production efficiency.Complementing method proposed by the present invention is input variable dimension is very high, data sample is seldom
In the case of also available preferable completion effect.
Detailed description of the invention
Fig. 1 is the overall procedure according to the Supplementing Data method in the embodiment of the present invention based on selective double layer integrated study
Figure.
Fig. 2 is the structural schematic diagram of the Supplementing Data method based on selective double layer integrated study.
Fig. 3 is to be hydrocracked process product quality indicator data sample deletion condition.
Fig. 4 is the heavy naphtha end point of distillation test set completion knot of the Supplementing Data method based on selective double layer integrated study
Fruit figure.
Fig. 5 is the completion error of the S submodel of the heavy naphtha end point of distillation selected according to the evaluation index of proposition and integrates
The comparison diagram of completion error afterwards.
Fig. 6 Flash Point of Diesel test set completion result figure.
Fig. 7 diesel cetane-number test set completion result figure.
Fig. 8 boat coal initial boiling point test set completion result figure.
Fig. 9 boat coal flash-point test set completion result figure.
Figure 10 Flash Point of Diesel test set S completion errors for selecting submodel and it is integrated after completion error comparison diagram.
Figure 11 diesel cetane-number test set S completion errors for selecting submodel and it is integrated after completion error comparison
Figure.
Figure 12 boat coal initial boiling point test set S select submodel completion error and it is integrated after completion error comparison
Figure.
Figure 13 boat coal flash-point test set S select submodel completion error and it is integrated after completion error comparison diagram.
Specific embodiment
In order to sufficiently disclose the present invention, below specific embodiments of the present invention will be described in further detail:
The present invention takes full advantage of whole variables of input, so that can also obtain in the very high situation of input variable dimension
To preferable completion result.The method of proposition is as follows:
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
Method, which comprises the steps of:
S1. training set is generated:
S11, complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12, sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to,
It repeats M times, obtains M mutually independent sampling sets;
S13, characteristic variable of the K variable as training set, institute are not randomly selected with not putting back to each sampling set
The value of the K stated is determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP are used respectively to each training set
Three kinds of modeling method completion data of neural network, respectively obtain three completion results:
S22. Partial Least Squares, support vector machines, three kinds of BP neural network modelings are estimated respectively using least square method
The weight z of method1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodule
The completion result of type is assessed, and is ranked up according to assessment score, the average value of S submodel for selecting score high as
Final completion result.
In a preferred embodiment of the invention, before using self-service sampling method described in S1, first to initial data set into
Row standardization, eliminates the influence of different variable dimensions.The sampled data of note input shares N number of sample, the change that each sample has P to tie up
It measures, then Xij(i=1,2 ..., N;J=1,2 ..., P) be i-th of sample j-th of variable sample value.
Standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable
The standard deviation of value.
Then adopt identical with raw data set capacity is randomly selected from the initial data concentration after standardization with putting back to
Sample collection repeats M times, obtains M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through
Formula is tested to determine:
So far the training set of the N × K of M submodel has been obtained.
S2 has used three kinds of modeling method integrated mouldings, is further illustrated below:
1) Support vector regression model
By constructing loss function, and it is based on structural risk minimization thought, support vector machines generallys use following minimum
Change Optimized model to determine regression function, it may be assumed that
ω is weight vector in formula,For the expression item of model complexity, μ is iotazation constant, ξi *,ξiFor relaxation
Variable, φ (x) are the nonlinear transformations for mapping the data into higher dimensional space, and b is biasing, and ε is the error upper limit.
Introduce Lagrange multiplier αiAnd αi *, above-mentioned Optimized model can be converted into following primal-dual optimization problem and be solved:
Solving the above problem can be obtained Support vector regression function:
Wherein k (Xi, X) and it is known as kernel function, Mercer condition need to be met, in a preferred embodiment of the invention, chosen
Gauss RBF kernel function:
2) BP neural network
In a preferred embodiment of the invention, three layers of BP neural network modeling are chosen, it is assumed that input layer number is l,
Output layer number of nodes is o, and the number of hidden nodes s is determined by empirical equation:
Output node layer output be
Wherein whjFor the connection weight of hidden layer to output layer, bhFor hidden node value, θjFor the threshold value for exporting node layer.
The output of hidden node is
Wherein vihFor the connection weight of input layer to hidden layer, xiFor input layer value, γhFor the threshold value of hidden node.
According to the adjustment formula of the available weight of back-propagation algorithm and threshold value are as follows:
whj=whj+ηgibh
vih=vih+ηehxi
θj=θj-ηgi
γh=γh-ηeh
Wherein η is learning rate, giAnd ehIt is determined by following formula:
According to the continuous iteration of above-mentioned formula, until the mean square error of network output is met the requirements.
3) Partial Least Squares
When establishing regression model using Partial Least Squares, the principal component extracted in output and input as far as possible had both been considered,
It is contemplated that making to overcome common least square method from the correlation maximization between the principal component that X and Y are extracted respectively and locating
Manage dimension is high, regression problem linearly related between variable when there are the shortcomings that.
Assuming that X and Y are the data that initial data generates after zero averaging, unit varianceization.The of so X and Y
A pair of of principal component t1And u1It is respectively as follows:
t1=Xc1
u1=Yd1
Wherein c1And d1For coefficient vector, following optimization problem solving can be passed through.
Being described as optimization problem, makes t1And u1Between correlation maximization, and make t respectively1And u1Respective variance
It is maximum.It can mathematically formalize as follows:
max<Xc1,Yd1>
This optimization problem can be solved by the method for introducing Lagrange multiplier, can finally be solved, c1It is symmetrical
Matrix XTYYTThe corresponding feature vector of the maximum eigenvalue of X, d1It is YTXXTThe corresponding feature vector of the maximum eigenvalue of Y.Then
Available above-mentioned first couple of relevant principal component t1And u1。
It is as follows to carry out regression modeling:
X=t1p1 T+E
Y=u1q1 T+G
Y=t1r1 T+F
For the above regression equation, p can be calculated with least square method1,q1,r1:
Later using the residual error E in X as new X, the residual error F in Y extracts second pair of principal component, according to preceding as new Y
The method in face is returned, and is constantly recycled, and until residual error F reaches requirement or principal component quantity reaches the upper limit, algorithm terminates.
If finally sharing k principal component, then original X, Y can finally be indicated are as follows:
X=t1p1 T+t2p2 T+…+tkpk T+E
Y=t1r1 T+t2r2 T+…+tkrk T+F
Write as matrix form are as follows:
X=TPT+E
Y=TRT+ F=XCRT+F
In formula, T=[t1,t2,…,tk], P=[p1,p2,…,pk], C=[c1,c2,…,ck], R=[r1,r2,…,rk]。
It is possible thereby to know, as long as obtaining C and R in an iterative process, so that it may estimate output valve with above formula.
After carrying out completion to data by above-mentioned three kinds of modeling methods, it is also necessary to determine every kind of modeling method Supplementing Data knot
The weight of fruit, is integrated.The present invention obtains the optimal estimation of three kinds of modeling method weights using least square method, specifically:
The completion result for remembering three modeling methods is Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
In a preferred embodiment of the invention, completion recruitment evaluation index described in S3 specifically:
1) precision index
Root-mean-square error is chosen as precision index, then the root-mean-square error of i-th of submodel are as follows:
WhereinFor i-th of model, j-th of test sample completion as a result, YjFor the true value of j-th of test sample, N
For the number of test sample;
2) stability indicator
In order to reflect the quality of model more fully hereinafter, it is also necessary to measure completion result on all samples of test set
Stability, therefore choose the standard deviation of error as stability indicator, the then standard deviation of i-th of submodel error are as follows:
WhereinIndicate the mistake absolute value of the difference of i-th of model, j-th of test sample;For i-th of model
Error mean value;
With following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
Score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels
To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
It wherein takes S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken
No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring
Value.
Specific embodiment:
The Supplementing Data method based on selective double layer integrated study that the embodiment provides is pair to be hydrocracked process
As using the historical data of whole process process variable and the quality index of product oil as initial data set, to the product oil of missing
Quality index carries out completion.It is hydrocracked flow process complexity, the process variable of detection is numerous, and there are biggish time lags, causes
Data set dimension is high, and model is with very strong non-linear.Not due to the quality index sample frequency of process variable and product oil
Unanimously or there are the accidents such as product oiling experiment device failure, so that the quality index data missing of product oil is serious.Fig. 5
The deletion condition of qualitative data sample is illustrated, from figure 5 it can be seen that most of quality index just obtains 1 number for 12 hours
According to sample, there are also some quality index, even 1 talent obtains 1 data sample.Complementing method provided by the invention can be to this
The quality index of kind serious loss carries out effectively completion, and detailed process is as follows:
Step (1) is pre-processed to being hydrocracked process flow operation supplemental characteristic, is extracted first by Analysis on Mechanism
The historical data of 160 measurable process variables, according to the fluctuation situation of each variable data, removal is influenced by sensor fault
The variable with unusual fluctuations data, filter out 139 primary process variables.
Step (2) is standardized initial data set, eliminates the influence of different variable dimensions.Remember the hits of input
According to N number of sample is shared, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) it is i-th of sample
J-th of variable sample value.Wherein N=595, P=139.
Standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable
The standard deviation of value.
Step (3), from after standardization initial data concentration randomly select with putting back to it is identical with raw data set capacity
Sampling set, repeat M time, obtain a mutually independent sampling sets of M;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through
Formula is tested to determine:In the present embodiment, since P is 139, so taking K=23.
So far the training set of N × 23 (N is number of training) of M submodel has been obtained.
Step (4) is utilized respectively support vector machines (SVM), BP neural network, Partial Least Squares (PLS) foundation recurrence
Model obtains three groups of completion results;The completion result for remembering three modeling methods is
Step (5) obtains the optimal estimation of three kinds of modeling method weights using least square method.Assuming that reality output is y,
The weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
The completion error of three kinds of modeling methods and the completion error after integrating are as shown in table 1.
Step (6) repeats step (5) M times, obtains the completion result of M submodel.
Step (7) is based on the completion for each submodel that step (6) obtain as a result, according to the comprehensive of its precision and stability
It closes index to be ranked up the score of the M submodel, the average value of the preceding S submodel for selecting score high is as finally
Completion result.
Root-mean-square error is chosen as precision index, then the root-mean-square error of i-th of submodel are as follows:
WhereinFor i-th of model, j-th of test sample completion as a result, YjFor the true value of j-th of test sample, N
For the number of test sample;
The standard deviation of error is chosen as stability indicator, then the standard deviation of i-th of submodel error are as follows:
WhereinIndicate the mistake absolute value of the difference of i-th of model, j-th of test sample;For i-th of model
Error mean value;
With following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
Score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels
To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
It wherein takes S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken
No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring
Value.
With the increase of submodel number M, (as shown in table 2) can be gradually increased in the precision of completion.
Submodel number and its corresponding error when 2 the present embodiment of table is to heavy naphtha end point of distillation boiling range Supplementing Data
It can be seen that completion precision can be obviously improved by increasing submodel number when submodel number is less, work as submodel
When number is sufficiently large, completion effect tends towards stability.
Finally, in order to verify versatility of the invention, according to the above steps of this embodiment respectively to Flash Point of Diesel, diesel oil
Other process product quality indicator data that are hydrocracked such as hexadecane, boat coal flash-point and boat coal initial boiling point have carried out completion, completion
As shown in figs. 6-9, completion error is as shown in table 3 for effect.
3 present invention of table carries out the error of completion to the other quality index of process are hydrocracked
Aforementioned schemes can be written as computer software, and one kind being suitable for complexity based on selective double layer integrated study
The system of industrial process product quality indicator missing data completion, characterized by comprising: include at least generate training set module,
It generates completion object module and determines final completion object module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;Benefit
Generate M sampling set with the self-service sampling method of bootstrap, K variable extracted at random to each sampling set, finally this M N ×
The sampling set of K and sends the training set to second module respectively as the training set of M submodel;Wherein, described
N is number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, point
Regression model is not established using support vector machines, BP neural network, Partial Least Squares, is obtained based on aforementioned three kinds of different characteristics
Three groups of completion results of regression model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every
The completion result of a submodel;The completion result of each submodel is finally passed into third module;
The benefit for each submodel that the final completion object module of the determination is used to generate completion result integration module
Entirely as a result, being assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment
Score is ranked up, and the average value for S submodel for selecting score high is as final completion as a result, and exporting.
The self-service sampling method of bootstrap in the training set generation module specifically:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain
To M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through
Formula is tested to determine:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
The completion result of each submodel calculates as follows in the completion result integration module:
S1. to each submodel training set, respectively using Partial Least Squares (PLS), support vector machines (SVM), BP mind
Through three kinds of modeling method completion data of network, following three completion results are obtained:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result of i-th of submodel
Yi:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
Further, the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens
This number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels
It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken
No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring
Value.
Claims (9)
1. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
Method, which comprises the steps of:
S1. training set is generated:
S11. complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12. sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, repeated
M times, obtain M mutually independent sampling sets;
S13. characteristic variable of the K variable as training set, the K are not randomly selected with not putting back to each sampling set
Value determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP nerve are used respectively to each training set
Three kinds of modeling method completion data of network, respectively obtain three completion results:
S22. three kinds of Partial Least Squares, support vector machines, BP neural network modeling methods are estimated respectively using least square method
Weight z1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
Enable z=[z1,z2,z3]T,Then above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodel
Completion result is assessed, and is ranked up according to assessment score, and the average value for S submodel for selecting score high is as final
Completion result.
2. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The method of missing data completion, which is characterized in that further include being standardized to initial data set after the S11, eliminate different
The step of influence of variable dimension, the specific method is as follows:
In note N number of sample, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) it is i-th
J-th of variable sample value of a sample, standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample value of j-th of variable
Standard deviation.
3. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The method of missing data completion, which is characterized in that refer to precision " according to the completion recruitment evaluation index of proposition " described in S3
With the overall target of stability, i.e., it is ranked up according to score of the overall target of precision and stability to the M submodel,
The average value for several submodels for selecting score high is as final completion as a result, specifically calculating as follows:
S31. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is of test sample
Number;
S32. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S33. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S34. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels
To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein: RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, takes and be not more than
The maximum integer being worth in bracket;Indicate the completion readout of the preceding S submodel of highest scoring.
4. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The method of missing data completion, which is characterized in that " the complex industrial process product quality indicator ", which refers to, is hydrocracked stream
The journey heavy naphtha end point of distillation, the N are 595, M 50, P 139.
5. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The method of missing data completion, which is characterized in that " the complex industrial process product quality ", which refers to, is hydrocracked process
Flash Point of Diesel or diesel fuel cetane or boat coal flash-point or boat coal initial boiling point.
6. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion result
Module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;It utilizes
The self-service sampling method of bootstrap generates M sampling set, K variable is extracted at random to each sampling set, finally this M N × K
Sampling set respectively as the training set of M submodel, and send the training set to second module;Wherein, the N
For number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, it is sharp respectively
Regression model is established with support vector machines, BP neural network, Partial Least Squares, obtains returning based on aforementioned three kinds of different characteristics
Three groups of completion results of model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every height
The completion result of model;The completion result of each submodel is finally passed into third module;
The completion knot for each submodel that the final completion object module of the determination is used to generate completion result integration module
Fruit is assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment score
It is ranked up, the average value for S submodel for selecting score high is as final completion as a result, and exporting.
7. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The system of missing data completion, which is characterized in that the self-service sampling method of bootstrap in the training set generation module is specific
Are as follows:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain M
A mutually independent sampling set;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by experience public affairs
Formula determines:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
8. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The system of missing data completion, which is characterized in that the completion result meter of each submodel in the completion result integration module
It calculates as follows:
S1. to each submodel training set, Partial Least Squares (PLS), support vector machines (SVM), BP nerve net are used respectively
Three kinds of modeling method completion data of network, obtain following three completion results:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result Y of i-th of submodeli:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
Enable z=[z1,z2,z3]T,Then above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
9. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study
The system of missing data completion, which is characterized in that the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is of test sample
Number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels
To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken less
In the maximum integer being worth in bracket;Indicate the completion readout of the preceding S submodel of highest scoring.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810305512.7A CN108490782B (en) | 2018-04-08 | 2018-04-08 | A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810305512.7A CN108490782B (en) | 2018-04-08 | 2018-04-08 | A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108490782A CN108490782A (en) | 2018-09-04 |
CN108490782B true CN108490782B (en) | 2019-04-09 |
Family
ID=63314950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810305512.7A Active CN108490782B (en) | 2018-04-08 | 2018-04-08 | A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108490782B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344968A (en) * | 2018-10-10 | 2019-02-15 | 郑州云海信息技术有限公司 | A kind of method and device of the hyper parameter processing of neural network |
EP3640946A1 (en) * | 2018-10-15 | 2020-04-22 | Sartorius Stedim Data Analytics AB | Multivariate approach for biological cell selection |
CN109454175A (en) * | 2018-12-27 | 2019-03-12 | 合肥合锻智能制造股份有限公司 | A kind of heat stamping and shaping production line and production procedure |
CN110456756B (en) * | 2019-03-25 | 2020-12-08 | 中南大学 | Method suitable for online evaluation of global operation state in continuous production process |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102175182B (en) * | 2011-01-27 | 2012-10-10 | 浙江大学宁波理工学院 | Structured light three-dimensional measurement device and complete point cloud data acquisition method thereof |
US9661084B2 (en) * | 2012-09-28 | 2017-05-23 | 7517700 Canada Inc. O/A Girih | Method and system for sampling online communication networks |
CN105704198B (en) * | 2014-12-29 | 2019-03-12 | 新疆金牛能源科技有限责任公司 | A kind of Internet of Things management control system and method |
CN204331447U (en) * | 2014-12-29 | 2015-05-13 | 新疆金牛能源科技有限责任公司 | A kind of Internet of Things management and control device |
CN107291863B (en) * | 2017-06-12 | 2018-11-13 | 杭州电子科技大学 | A kind of quick check method and device for industrial control equipment information security |
CN107563426B (en) * | 2017-08-25 | 2020-05-22 | 清华大学 | Method for learning locomotive running time sequence characteristics |
CN107704962B (en) * | 2017-10-11 | 2021-03-26 | 大连理工大学 | Steam flow interval prediction method based on incomplete training data set |
-
2018
- 2018-04-08 CN CN201810305512.7A patent/CN108490782B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108490782A (en) | 2018-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108490782B (en) | A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study | |
CN107065843B (en) | Multi-direction KICA batch process fault monitoring method based on Independent subspace | |
CN110542819B (en) | Transformer fault type diagnosis method based on semi-supervised DBNC | |
CN108399248A (en) | A kind of time series data prediction technique, device and equipment | |
CN109635245A (en) | A kind of robust width learning system | |
CN112101480A (en) | Multivariate clustering and fused time sequence combined prediction method | |
CN108051660A (en) | A kind of transformer fault combined diagnosis method for establishing model and diagnostic method | |
CN107168063B (en) | Soft measurement method based on integrated variable selection type partial least square regression | |
CN108416439B (en) | Oil refining process product prediction method and system based on variable weighted deep learning | |
Wei et al. | A new variance-based global sensitivity analysis technique | |
CN114662414B (en) | Oil reservoir production prediction method based on graph wavelet neural network model | |
CN103246762A (en) | Method of comprehensive evaluation for simulation credibility of electric propulsion system | |
CN101706443A (en) | Smoothness evaluation method of seams of clothing fabrics | |
Gan et al. | On the use of data-driven machine learning for remaining life estimation of metallic materials based on Ye-Wang damage theory | |
CN114912364A (en) | Natural gas well flow prediction method, device, equipment and computer readable medium | |
CN105095652A (en) | Method for testing component in sample based on stacking extreme learning machine | |
CN108805419B (en) | Power grid node importance calculation method based on network embedding and support vector regression | |
CN112784173B (en) | Recommendation system scoring prediction method based on self-attention confrontation neural network | |
Wang et al. | A hybrid fuzzy method for performance evaluation of fusion algorithms for integrated navigation system | |
CN113281229A (en) | Multi-model self-adaptive atmosphere PM based on small samples2.5Concentration prediction method | |
CN108647485A (en) | Catalyst carbon deposition measurement method, system, medium and equipment in fluid catalytic cracking | |
CN105092509A (en) | Sample component measurement method based on PCR-ELM algorithm | |
CN108416463A (en) | A kind of product quality prediction technique and system of hydrocracking process | |
CN114492988A (en) | Method and device for predicting product yield in catalytic cracking process | |
CN110673470B (en) | Industrial non-stationary process soft measurement modeling method based on local weighting factor model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |