CN108490782B - A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study - Google Patents

A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Download PDF

Info

Publication number
CN108490782B
CN108490782B CN201810305512.7A CN201810305512A CN108490782B CN 108490782 B CN108490782 B CN 108490782B CN 201810305512 A CN201810305512 A CN 201810305512A CN 108490782 B CN108490782 B CN 108490782B
Authority
CN
China
Prior art keywords
completion
submodel
result
follows
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810305512.7A
Other languages
Chinese (zh)
Other versions
CN108490782A (en
Inventor
袁小锋
吴东哲
王雅琳
李灵
阳春华
桂卫华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201810305512.7A priority Critical patent/CN108490782B/en
Publication of CN108490782A publication Critical patent/CN108490782A/en
Application granted granted Critical
Publication of CN108490782B publication Critical patent/CN108490782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The present invention relates to industrial stokehold technical fields, disclose a kind of method and system for being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study.The different dimensional variable for extracting sampled data first generates training set of multiple sampling sets as submodel;Then three kinds of vector machine, BP neural network, Partial Least Squares methods are respectively adopted to each submodel to model;A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, the best several submodels of completion effect is chosen and carries out selective ensemble.The present invention takes full advantage of whole variables of training sample, has preferable Supplementing Data effect, facilitates enterprise according to the production process actual operating state that analysis obtains and carries out targeted production operation optimization.

Description

One kind being suitable for complex industrial process product quality based on selective double layer integrated study The method and system of index missing data completion
Technical field
The present invention relates to industrial stokehold technical fields, in particular to a kind of to be applicable in based on selective double layer integrated study In the method and system of complex industrial process product quality indicator missing data completion.
Background technique
In complex industrial process, is obtained since certain quality index can not directly be measured by sensor, need manually to adopt The offline chemical examination of collection, the chemical examination period is long, quality index data cannot be obtained in real time, so that the completion problem of quality index missing data Already become a focus.Complex industrial process has introduced computer control system mostly at present, and what is thus measured is big Production process data is measured, the completion of difficult mass metering index missing data is provided convenience.
However the data in complex industrial process often have the characteristics that, complementing method in the prior art is caused to be difficult to Obtain ideal result: first is that the control system of complex industrial process, often have hundreds of sensor to process variable into Row measures, and dimension is very high, and data volume is very huge, and product quality indicator data take a long time and changed offline It tests, sample frequency is very low.Therefore after data prediction, the sample number that can be used for Supplementing Data is seldom;Second is that industry system Often there is stronger coupling in the high dimensional data of system, can seriously affect parameter Estimation, increase model error;Third is that industrial process is deposited Relationship between the chemical reaction of large amount of complex, all kinds of parameters be all it is nonlinear, such as temperature and entropy, reaction temperature with Equal between reaction speed is all typical non-linear relation, and this non-linear relation brings very big to the foundation of mathematical model Difficulty.
Currently used Supplementing Data method includes mean value interpolation, hot platform interpolation, expectation maximization interpolation, regression imputation Deng since regression imputation method can be as often as possible using the information in data sample, so most researchs concentrate on regression imputation Method.However when using regression imputation method, since complex industrial process data have the characteristics that dimension is high, non-linear, close coupling, So causing the precision of Supplementing Data may be unstable.Simultaneously as complex industrial process can be used for the sample number of Supplementing Data It measures very few, if only establishing Supplementing Data model by less data sample, may result in model and the case where poor fitting occur.
Summary of the invention
The technical problem to be solved by the present invention is to exist between the numerous, variable for process variable in complex industrial process compared with The big difficult point of close coupling, data fluctuations proposes that a kind of selective double layer integrated study that is based on is suitable for complex industrial process product The method and system of quality index missing data completion:
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Method, which comprises the steps of:
S1. training set is generated:
S11, complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12, sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, It repeats M times, obtains M mutually independent sampling sets;
S13, characteristic variable of the K variable as training set, institute are not randomly selected with not putting back to each sampling set The value of the K stated is determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP are used respectively to each training set Three kinds of modeling method completion data of neural network, respectively obtain three completion results:
S22. Partial Least Squares, support vector machines, three kinds of BP neural network modelings are estimated respectively using least square method The weight z of method1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodule The completion result of type is assessed, and is ranked up according to assessment score, the average value of S submodel for selecting score high as Final completion result.
Further, further include being standardized to initial data set after the S11, eliminate the influence of different variable dimensions The step of, the specific method is as follows:
In note N number of sample, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) be J-th of variable sample value of i-th of sample, standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable The standard deviation of value.
Further, refer to the synthesis of precision and stability described in S3 " according to the completion recruitment evaluation index of proposition " Index is ranked up according to score of the overall target of precision and stability to the M submodel, if selection score is high The average value of dry submodel is as final completion as a result, specifically calculating as follows:
S31. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens This number;
S32. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S33. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S34. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein: RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, is taken not Greater than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring Value.
Further, " the complex industrial process product quality indicator " refers to that be hydrocracked process heavy naphtha evaporates eventually Point, the N are 595, M 50, P 139.
Further, " the complex industrial process product quality " refers to the Flash Point of Diesel or diesel oil for being hydrocracked process Hexadecane or boat coal flash-point or boat coal initial boiling point.
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion knot Fruit module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;Benefit Generate M sampling set with the self-service sampling method of bootstrap, K variable extracted at random to each sampling set, finally this M N × The sampling set of K and sends the training set to second module respectively as the training set of M submodel;Wherein, described N is number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, point Regression model is not established using support vector machines, BP neural network, Partial Least Squares, is obtained based on aforementioned three kinds of different characteristics Three groups of completion results of regression model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every The completion result of a submodel;The completion result of each submodel is finally passed into third module;
The benefit for each submodel that the final completion object module of the determination is used to generate completion result integration module Entirely as a result, being assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment Score is ranked up, and the average value for S submodel for selecting score high is as final completion as a result, and exporting.
Further, the self-service sampling method of bootstrap in the training set generation module specifically:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain To M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
Further, the completion result calculating of each submodel is as follows in the completion result integration module:
S1. to each submodel training set, respectively using Partial Least Squares (PLS), support vector machines (SVM), BP mind Through three kinds of modeling method completion data of network, following three completion results are obtained:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result of i-th of submodel Yi:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
Further, the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens This number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring Value.
Compared with the existing methods, the beneficial effects of the present invention are: it is proposed by the present invention integrated based on selective double layer The Supplementing Data method and system of study, the different dimensional variable that first layer extracts sampled data are established partial model and are integrated, It can guarantee that each partial model has enough training samples in this way, above-mentioned sampled data is few, dimension is high to solve Difficult point;Vector machine (SVM), BP neural network, three kinds of Partial Least Squares (PLS) is respectively adopted to each partial model in the second layer Method is modeled, and wherein support vector regression model is the practical algorithm for small sample, and BP neural network has good non- Linear approximation ability, Partial Least Squares Regression, which can solve input variable, has that coupling causes, and then builds these three The result of mould method is integrated, and ensure that algorithm for the stability of the Supplementing Data problem of complex industrial process, has higher Generalization ability;A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, chooses and mends The best several submodels of full effect carry out selective ensemble, further improve the precision of completion, are enterprise to industrial process The on-line analysis of global operation conditions provides more believable foundation, facilitates enterprise and adjusts production model based on the analysis results, subtracts Few wasting of resources, improves production efficiency.Complementing method proposed by the present invention is input variable dimension is very high, data sample is seldom In the case of also available preferable completion effect.
Detailed description of the invention
Fig. 1 is the overall procedure according to the Supplementing Data method in the embodiment of the present invention based on selective double layer integrated study Figure.
Fig. 2 is the structural schematic diagram of the Supplementing Data method based on selective double layer integrated study.
Fig. 3 is to be hydrocracked process product quality indicator data sample deletion condition.
Fig. 4 is the heavy naphtha end point of distillation test set completion knot of the Supplementing Data method based on selective double layer integrated study Fruit figure.
Fig. 5 is the completion error of the S submodel of the heavy naphtha end point of distillation selected according to the evaluation index of proposition and integrates The comparison diagram of completion error afterwards.
Fig. 6 Flash Point of Diesel test set completion result figure.
Fig. 7 diesel cetane-number test set completion result figure.
Fig. 8 boat coal initial boiling point test set completion result figure.
Fig. 9 boat coal flash-point test set completion result figure.
Figure 10 Flash Point of Diesel test set S completion errors for selecting submodel and it is integrated after completion error comparison diagram.
Figure 11 diesel cetane-number test set S completion errors for selecting submodel and it is integrated after completion error comparison Figure.
Figure 12 boat coal initial boiling point test set S select submodel completion error and it is integrated after completion error comparison Figure.
Figure 13 boat coal flash-point test set S select submodel completion error and it is integrated after completion error comparison diagram.
Specific embodiment
In order to sufficiently disclose the present invention, below specific embodiments of the present invention will be described in further detail:
The present invention takes full advantage of whole variables of input, so that can also obtain in the very high situation of input variable dimension To preferable completion result.The method of proposition is as follows:
One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Method, which comprises the steps of:
S1. training set is generated:
S11, complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12, sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, It repeats M times, obtains M mutually independent sampling sets;
S13, characteristic variable of the K variable as training set, institute are not randomly selected with not putting back to each sampling set The value of the K stated is determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP are used respectively to each training set Three kinds of modeling method completion data of neural network, respectively obtain three completion results:
S22. Partial Least Squares, support vector machines, three kinds of BP neural network modelings are estimated respectively using least square method The weight z of method1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodule The completion result of type is assessed, and is ranked up according to assessment score, the average value of S submodel for selecting score high as Final completion result.
In a preferred embodiment of the invention, before using self-service sampling method described in S1, first to initial data set into Row standardization, eliminates the influence of different variable dimensions.The sampled data of note input shares N number of sample, the change that each sample has P to tie up It measures, then Xij(i=1,2 ..., N;J=1,2 ..., P) be i-th of sample j-th of variable sample value.
Standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable The standard deviation of value.
Then adopt identical with raw data set capacity is randomly selected from the initial data concentration after standardization with putting back to Sample collection repeats M times, obtains M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:
So far the training set of the N × K of M submodel has been obtained.
S2 has used three kinds of modeling method integrated mouldings, is further illustrated below:
1) Support vector regression model
By constructing loss function, and it is based on structural risk minimization thought, support vector machines generallys use following minimum Change Optimized model to determine regression function, it may be assumed that
ω is weight vector in formula,For the expression item of model complexity, μ is iotazation constant, ξi *iFor relaxation Variable, φ (x) are the nonlinear transformations for mapping the data into higher dimensional space, and b is biasing, and ε is the error upper limit.
Introduce Lagrange multiplier αiAnd αi *, above-mentioned Optimized model can be converted into following primal-dual optimization problem and be solved:
Solving the above problem can be obtained Support vector regression function:
Wherein k (Xi, X) and it is known as kernel function, Mercer condition need to be met, in a preferred embodiment of the invention, chosen Gauss RBF kernel function:
2) BP neural network
In a preferred embodiment of the invention, three layers of BP neural network modeling are chosen, it is assumed that input layer number is l, Output layer number of nodes is o, and the number of hidden nodes s is determined by empirical equation:
Output node layer output be
Wherein whjFor the connection weight of hidden layer to output layer, bhFor hidden node value, θjFor the threshold value for exporting node layer.
The output of hidden node is
Wherein vihFor the connection weight of input layer to hidden layer, xiFor input layer value, γhFor the threshold value of hidden node.
According to the adjustment formula of the available weight of back-propagation algorithm and threshold value are as follows:
whj=whj+ηgibh
vih=vih+ηehxi
θjj-ηgi
γhh-ηeh
Wherein η is learning rate, giAnd ehIt is determined by following formula:
According to the continuous iteration of above-mentioned formula, until the mean square error of network output is met the requirements.
3) Partial Least Squares
When establishing regression model using Partial Least Squares, the principal component extracted in output and input as far as possible had both been considered, It is contemplated that making to overcome common least square method from the correlation maximization between the principal component that X and Y are extracted respectively and locating Manage dimension is high, regression problem linearly related between variable when there are the shortcomings that.
Assuming that X and Y are the data that initial data generates after zero averaging, unit varianceization.The of so X and Y A pair of of principal component t1And u1It is respectively as follows:
t1=Xc1
u1=Yd1
Wherein c1And d1For coefficient vector, following optimization problem solving can be passed through.
Being described as optimization problem, makes t1And u1Between correlation maximization, and make t respectively1And u1Respective variance It is maximum.It can mathematically formalize as follows:
max<Xc1,Yd1>
This optimization problem can be solved by the method for introducing Lagrange multiplier, can finally be solved, c1It is symmetrical Matrix XTYYTThe corresponding feature vector of the maximum eigenvalue of X, d1It is YTXXTThe corresponding feature vector of the maximum eigenvalue of Y.Then Available above-mentioned first couple of relevant principal component t1And u1
It is as follows to carry out regression modeling:
X=t1p1 T+E
Y=u1q1 T+G
Y=t1r1 T+F
For the above regression equation, p can be calculated with least square method1,q1,r1:
Later using the residual error E in X as new X, the residual error F in Y extracts second pair of principal component, according to preceding as new Y The method in face is returned, and is constantly recycled, and until residual error F reaches requirement or principal component quantity reaches the upper limit, algorithm terminates.
If finally sharing k principal component, then original X, Y can finally be indicated are as follows:
X=t1p1 T+t2p2 T+…+tkpk T+E
Y=t1r1 T+t2r2 T+…+tkrk T+F
Write as matrix form are as follows:
X=TPT+E
Y=TRT+ F=XCRT+F
In formula, T=[t1,t2,…,tk], P=[p1,p2,…,pk], C=[c1,c2,…,ck], R=[r1,r2,…,rk]。
It is possible thereby to know, as long as obtaining C and R in an iterative process, so that it may estimate output valve with above formula.
After carrying out completion to data by above-mentioned three kinds of modeling methods, it is also necessary to determine every kind of modeling method Supplementing Data knot The weight of fruit, is integrated.The present invention obtains the optimal estimation of three kinds of modeling method weights using least square method, specifically:
The completion result for remembering three modeling methods is Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
In a preferred embodiment of the invention, completion recruitment evaluation index described in S3 specifically:
1) precision index
Root-mean-square error is chosen as precision index, then the root-mean-square error of i-th of submodel are as follows:
WhereinFor i-th of model, j-th of test sample completion as a result, YjFor the true value of j-th of test sample, N For the number of test sample;
2) stability indicator
In order to reflect the quality of model more fully hereinafter, it is also necessary to measure completion result on all samples of test set Stability, therefore choose the standard deviation of error as stability indicator, the then standard deviation of i-th of submodel error are as follows:
WhereinIndicate the mistake absolute value of the difference of i-th of model, j-th of test sample;For i-th of model Error mean value;
With following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
Score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
It wherein takes S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring Value.
Specific embodiment:
The Supplementing Data method based on selective double layer integrated study that the embodiment provides is pair to be hydrocracked process As using the historical data of whole process process variable and the quality index of product oil as initial data set, to the product oil of missing Quality index carries out completion.It is hydrocracked flow process complexity, the process variable of detection is numerous, and there are biggish time lags, causes Data set dimension is high, and model is with very strong non-linear.Not due to the quality index sample frequency of process variable and product oil Unanimously or there are the accidents such as product oiling experiment device failure, so that the quality index data missing of product oil is serious.Fig. 5 The deletion condition of qualitative data sample is illustrated, from figure 5 it can be seen that most of quality index just obtains 1 number for 12 hours According to sample, there are also some quality index, even 1 talent obtains 1 data sample.Complementing method provided by the invention can be to this The quality index of kind serious loss carries out effectively completion, and detailed process is as follows:
Step (1) is pre-processed to being hydrocracked process flow operation supplemental characteristic, is extracted first by Analysis on Mechanism The historical data of 160 measurable process variables, according to the fluctuation situation of each variable data, removal is influenced by sensor fault The variable with unusual fluctuations data, filter out 139 primary process variables.
Step (2) is standardized initial data set, eliminates the influence of different variable dimensions.Remember the hits of input According to N number of sample is shared, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) it is i-th of sample J-th of variable sample value.Wherein N=595, P=139.
Standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample of j-th of variable The standard deviation of value.
Step (3), from after standardization initial data concentration randomly select with putting back to it is identical with raw data set capacity Sampling set, repeat M time, obtain a mutually independent sampling sets of M;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:In the present embodiment, since P is 139, so taking K=23.
So far the training set of N × 23 (N is number of training) of M submodel has been obtained.
Step (4) is utilized respectively support vector machines (SVM), BP neural network, Partial Least Squares (PLS) foundation recurrence Model obtains three groups of completion results;The completion result for remembering three modeling methods is
Step (5) obtains the optimal estimation of three kinds of modeling method weights using least square method.Assuming that reality output is y, The weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
The completion error of three kinds of modeling methods and the completion error after integrating are as shown in table 1.
Step (6) repeats step (5) M times, obtains the completion result of M submodel.
Step (7) is based on the completion for each submodel that step (6) obtain as a result, according to the comprehensive of its precision and stability It closes index to be ranked up the score of the M submodel, the average value of the preceding S submodel for selecting score high is as finally Completion result.
Root-mean-square error is chosen as precision index, then the root-mean-square error of i-th of submodel are as follows:
WhereinFor i-th of model, j-th of test sample completion as a result, YjFor the true value of j-th of test sample, N For the number of test sample;
The standard deviation of error is chosen as stability indicator, then the standard deviation of i-th of submodel error are as follows:
WhereinIndicate the mistake absolute value of the difference of i-th of model, j-th of test sample;For i-th of model Error mean value;
With following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
Score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
It wherein takes S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring Value.
With the increase of submodel number M, (as shown in table 2) can be gradually increased in the precision of completion.
Submodel number and its corresponding error when 2 the present embodiment of table is to heavy naphtha end point of distillation boiling range Supplementing Data
It can be seen that completion precision can be obviously improved by increasing submodel number when submodel number is less, work as submodel When number is sufficiently large, completion effect tends towards stability.
Finally, in order to verify versatility of the invention, according to the above steps of this embodiment respectively to Flash Point of Diesel, diesel oil Other process product quality indicator data that are hydrocracked such as hexadecane, boat coal flash-point and boat coal initial boiling point have carried out completion, completion As shown in figs. 6-9, completion error is as shown in table 3 for effect.
3 present invention of table carries out the error of completion to the other quality index of process are hydrocracked
Aforementioned schemes can be written as computer software, and one kind being suitable for complexity based on selective double layer integrated study The system of industrial process product quality indicator missing data completion, characterized by comprising: include at least generate training set module, It generates completion object module and determines final completion object module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;Benefit Generate M sampling set with the self-service sampling method of bootstrap, K variable extracted at random to each sampling set, finally this M N × The sampling set of K and sends the training set to second module respectively as the training set of M submodel;Wherein, described N is number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, point Regression model is not established using support vector machines, BP neural network, Partial Least Squares, is obtained based on aforementioned three kinds of different characteristics Three groups of completion results of regression model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every The completion result of a submodel;The completion result of each submodel is finally passed into third module;
The benefit for each submodel that the final completion object module of the determination is used to generate completion result integration module Entirely as a result, being assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment Score is ranked up, and the average value for S submodel for selecting score high is as final completion as a result, and exporting.
The self-service sampling method of bootstrap in the training set generation module specifically:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain To M mutually independent sampling sets;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
The completion result of each submodel calculates as follows in the completion result integration module:
S1. to each submodel training set, respectively using Partial Least Squares (PLS), support vector machines (SVM), BP mind Through three kinds of modeling method completion data of network, following three completion results are obtained:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result of i-th of submodel Yi:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
It enablesThen above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
Further, the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is test specimens This number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket;Indicate the completion result output of the preceding S submodel of highest scoring Value.

Claims (9)

1. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Method, which comprises the steps of:
S1. training set is generated:
S11. complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected;
S12. sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, repeated M times, obtain M mutually independent sampling sets;
S13. characteristic variable of the K variable as training set, the K are not randomly selected with not putting back to each sampling set Value determined by empirical equation:
Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K;
S2. it establishes submodel and generates completion result:
S21. it is based on the M training set, Partial Least Squares, support vector machines, BP nerve are used respectively to each training set Three kinds of modeling method completion data of network, respectively obtain three completion results:
S22. three kinds of Partial Least Squares, support vector machines, BP neural network modeling methods are estimated respectively using least square method Weight z1、z2、z3, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:
Assuming that reality output is y, then
Enable z=[z1,z2,z3]T,Then above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of i-th of submodel is
S3. final completion result is determined:
Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodel Completion result is assessed, and is ranked up according to assessment score, and the average value for S submodel for selecting score high is as final Completion result.
2. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that further include being standardized to initial data set after the S11, eliminate different The step of influence of variable dimension, the specific method is as follows:
In note N number of sample, each sample has the variable of P dimension, then Xij(i=1,2 ..., N;J=1,2 ..., P) it is i-th J-th of variable sample value of a sample, standardized calculation formula are as follows:
Wherein, E (Xj) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (Xj) refer to the input N number of sample value of j-th of variable Standard deviation.
3. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that refer to precision " according to the completion recruitment evaluation index of proposition " described in S3 With the overall target of stability, i.e., it is ranked up according to score of the overall target of precision and stability to the M submodel, The average value for several submodels for selecting score high is as final completion as a result, specifically calculating as follows:
S31. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is of test sample Number;
S32. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S33. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S34. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein: RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, takes and be not more than The maximum integer being worth in bracket;Indicate the completion readout of the preceding S submodel of highest scoring.
4. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that " the complex industrial process product quality indicator ", which refers to, is hydrocracked stream The journey heavy naphtha end point of distillation, the N are 595, M 50, P 139.
5. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that " the complex industrial process product quality ", which refers to, is hydrocracked process Flash Point of Diesel or diesel fuel cetane or boat coal flash-point or boat coal initial boiling point.
6. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion result Module:
The generation training set module eliminates dimension impact for collecting raw data set and by data normalization;It utilizes The self-service sampling method of bootstrap generates M sampling set, K variable is extracted at random to each sampling set, finally this M N × K Sampling set respectively as the training set of M submodel, and send the training set to second module;Wherein, the N For number of training;
The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, it is sharp respectively Regression model is established with support vector machines, BP neural network, Partial Least Squares, obtains returning based on aforementioned three kinds of different characteristics Three groups of completion results of model;Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every height The completion result of model;The completion result of each submodel is finally passed into third module;
The completion knot for each submodel that the final completion object module of the determination is used to generate completion result integration module Fruit is assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment score It is ranked up, the average value for S submodel for selecting score high is as final completion as a result, and exporting.
7. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the self-service sampling method of bootstrap in the training set generation module is specific Are as follows:
Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain M A mutually independent sampling set;
Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by experience public affairs Formula determines:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.
8. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the completion result meter of each submodel in the completion result integration module It calculates as follows:
S1. to each submodel training set, Partial Least Squares (PLS), support vector machines (SVM), BP nerve net are used respectively Three kinds of modeling method completion data of network, obtain following three completion results:
S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result Y of i-th of submodeli:
Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z1、z2、z3, then:
Enable z=[z1,z2,z3]T,Then above formula can be abbreviated are as follows:
Xz=y
Using the estimation of the available weight z of least square method:
So, the completion result of the submodel is
9. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the recruitment evaluation module is specifically assessed in the following way:
S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:
WhereinFor i-th of model, j-th of sample completion as a result, YjFor the true value of j-th of sample, N is of test sample Number;
S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:
WhereinFor the mean value of the error of i-th of model;std(ei) indicate error standard deviation;
S3. with following normalization formula by 2 index RMSE (i) and std (ei) normalize between [0,1]:
S4. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:
Wherein RMSE (i) ' and std (eiPrecision index and stability indicator of) ' be respectively after normalizing;
Then final result are as follows:
Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken less In the maximum integer being worth in bracket;Indicate the completion readout of the preceding S submodel of highest scoring.
CN201810305512.7A 2018-04-08 2018-04-08 A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Active CN108490782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810305512.7A CN108490782B (en) 2018-04-08 2018-04-08 A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810305512.7A CN108490782B (en) 2018-04-08 2018-04-08 A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study

Publications (2)

Publication Number Publication Date
CN108490782A CN108490782A (en) 2018-09-04
CN108490782B true CN108490782B (en) 2019-04-09

Family

ID=63314950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810305512.7A Active CN108490782B (en) 2018-04-08 2018-04-08 A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study

Country Status (1)

Country Link
CN (1) CN108490782B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344968A (en) * 2018-10-10 2019-02-15 郑州云海信息技术有限公司 A kind of method and device of the hyper parameter processing of neural network
EP3640946A1 (en) * 2018-10-15 2020-04-22 Sartorius Stedim Data Analytics AB Multivariate approach for biological cell selection
CN109454175A (en) * 2018-12-27 2019-03-12 合肥合锻智能制造股份有限公司 A kind of heat stamping and shaping production line and production procedure
CN110456756B (en) * 2019-03-25 2020-12-08 中南大学 Method suitable for online evaluation of global operation state in continuous production process

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102175182B (en) * 2011-01-27 2012-10-10 浙江大学宁波理工学院 Structured light three-dimensional measurement device and complete point cloud data acquisition method thereof
US9661084B2 (en) * 2012-09-28 2017-05-23 7517700 Canada Inc. O/A Girih Method and system for sampling online communication networks
CN105704198B (en) * 2014-12-29 2019-03-12 新疆金牛能源科技有限责任公司 A kind of Internet of Things management control system and method
CN204331447U (en) * 2014-12-29 2015-05-13 新疆金牛能源科技有限责任公司 A kind of Internet of Things management and control device
CN107291863B (en) * 2017-06-12 2018-11-13 杭州电子科技大学 A kind of quick check method and device for industrial control equipment information security
CN107563426B (en) * 2017-08-25 2020-05-22 清华大学 Method for learning locomotive running time sequence characteristics
CN107704962B (en) * 2017-10-11 2021-03-26 大连理工大学 Steam flow interval prediction method based on incomplete training data set

Also Published As

Publication number Publication date
CN108490782A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108490782B (en) A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study
CN107065843B (en) Multi-direction KICA batch process fault monitoring method based on Independent subspace
CN110542819B (en) Transformer fault type diagnosis method based on semi-supervised DBNC
CN108399248A (en) A kind of time series data prediction technique, device and equipment
CN109635245A (en) A kind of robust width learning system
CN112101480A (en) Multivariate clustering and fused time sequence combined prediction method
CN108051660A (en) A kind of transformer fault combined diagnosis method for establishing model and diagnostic method
CN107168063B (en) Soft measurement method based on integrated variable selection type partial least square regression
CN108416439B (en) Oil refining process product prediction method and system based on variable weighted deep learning
Wei et al. A new variance-based global sensitivity analysis technique
CN114662414B (en) Oil reservoir production prediction method based on graph wavelet neural network model
CN103246762A (en) Method of comprehensive evaluation for simulation credibility of electric propulsion system
CN101706443A (en) Smoothness evaluation method of seams of clothing fabrics
Gan et al. On the use of data-driven machine learning for remaining life estimation of metallic materials based on Ye-Wang damage theory
CN114912364A (en) Natural gas well flow prediction method, device, equipment and computer readable medium
CN105095652A (en) Method for testing component in sample based on stacking extreme learning machine
CN108805419B (en) Power grid node importance calculation method based on network embedding and support vector regression
CN112784173B (en) Recommendation system scoring prediction method based on self-attention confrontation neural network
Wang et al. A hybrid fuzzy method for performance evaluation of fusion algorithms for integrated navigation system
CN113281229A (en) Multi-model self-adaptive atmosphere PM based on small samples2.5Concentration prediction method
CN108647485A (en) Catalyst carbon deposition measurement method, system, medium and equipment in fluid catalytic cracking
CN105092509A (en) Sample component measurement method based on PCR-ELM algorithm
CN108416463A (en) A kind of product quality prediction technique and system of hydrocracking process
CN114492988A (en) Method and device for predicting product yield in catalytic cracking process
CN110673470B (en) Industrial non-stationary process soft measurement modeling method based on local weighting factor model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant