CN108490782B

CN108490782B - A kind of method and system being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study

Info

Publication number: CN108490782B
Application number: CN201810305512.7A
Authority: CN
Inventors: 袁小锋; 吴东哲; 王雅琳; 李灵; 阳春华; 桂卫华
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2019-04-09
Anticipated expiration: 2038-04-08
Also published as: CN108490782A

Abstract

The present invention relates to industrial stokehold technical fields, disclose a kind of method and system for being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study.The different dimensional variable for extracting sampled data first generates training set of multiple sampling sets as submodel；Then three kinds of vector machine, BP neural network, Partial Least Squares methods are respectively adopted to each submodel to model；A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, the best several submodels of completion effect is chosen and carries out selective ensemble.The present invention takes full advantage of whole variables of training sample, has preferable Supplementing Data effect, facilitates enterprise according to the production process actual operating state that analysis obtains and carries out targeted production operation optimization.

Description

One kind being suitable for complex industrial process product quality based on selective double layer integrated study The method and system of index missing data completion

Technical field

The present invention relates to industrial stokehold technical fields, in particular to a kind of to be applicable in based on selective double layer integrated study In the method and system of complex industrial process product quality indicator missing data completion.

Background technique

In complex industrial process, is obtained since certain quality index can not directly be measured by sensor, need manually to adopt The offline chemical examination of collection, the chemical examination period is long, quality index data cannot be obtained in real time, so that the completion problem of quality index missing data Already become a focus.Complex industrial process has introduced computer control system mostly at present, and what is thus measured is big Production process data is measured, the completion of difficult mass metering index missing data is provided convenience.

However the data in complex industrial process often have the characteristics that, complementing method in the prior art is caused to be difficult to Obtain ideal result: first is that the control system of complex industrial process, often have hundreds of sensor to process variable into Row measures, and dimension is very high, and data volume is very huge, and product quality indicator data take a long time and changed offline It tests, sample frequency is very low.Therefore after data prediction, the sample number that can be used for Supplementing Data is seldom；Second is that industry system Often there is stronger coupling in the high dimensional data of system, can seriously affect parameter Estimation, increase model error；Third is that industrial process is deposited Relationship between the chemical reaction of large amount of complex, all kinds of parameters be all it is nonlinear, such as temperature and entropy, reaction temperature with Equal between reaction speed is all typical non-linear relation, and this non-linear relation brings very big to the foundation of mathematical model Difficulty.

Currently used Supplementing Data method includes mean value interpolation, hot platform interpolation, expectation maximization interpolation, regression imputation Deng since regression imputation method can be as often as possible using the information in data sample, so most researchs concentrate on regression imputation Method.However when using regression imputation method, since complex industrial process data have the characteristics that dimension is high, non-linear, close coupling, So causing the precision of Supplementing Data may be unstable.Simultaneously as complex industrial process can be used for the sample number of Supplementing Data It measures very few, if only establishing Supplementing Data model by less data sample, may result in model and the case where poor fitting occur.

Summary of the invention

The technical problem to be solved by the present invention is to exist between the numerous, variable for process variable in complex industrial process compared with The big difficult point of close coupling, data fluctuations proposes that a kind of selective double layer integrated study that is based on is suitable for complex industrial process product The method and system of quality index missing data completion:

One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Method, which comprises the steps of:

S1. training set is generated:

S11, complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected；

S12, sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, It repeats M times, obtains M mutually independent sampling sets；

S13, characteristic variable of the K variable as training set, institute are not randomly selected with not putting back to each sampling set The value of the K stated is determined by empirical equation:

Wherein the value of P is total dimension of raw data set, thus generates the training set of M N × K；

S2. it establishes submodel and generates completion result:

S21. it is based on the M training set, Partial Least Squares, support vector machines, BP are used respectively to each training set Three kinds of modeling method completion data of neural network, respectively obtain three completion results:

S22. Partial Least Squares, support vector machines, three kinds of BP neural network modelings are estimated respectively using least square method The weight z of method₁、z₂、z₃, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:

Assuming that reality output is y, then

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of i-th of submodel is

S3. final completion result is determined:

Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodule The completion result of type is assessed, and is ranked up according to assessment score, the average value of S submodel for selecting score high as Final completion result.

Further, further include being standardized to initial data set after the S11, eliminate the influence of different variable dimensions The step of, the specific method is as follows:

In note N number of sample, each sample has the variable of P dimension, then X_ij(i=1,2 ..., N；J=1,2 ..., P) be J-th of variable sample value of i-th of sample, standardized calculation formula are as follows:

Wherein, E (X_j) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (X_j) refer to the input N number of sample of j-th of variable The standard deviation of value.

Further, refer to the synthesis of precision and stability described in S3 " according to the completion recruitment evaluation index of proposition " Index is ranked up according to score of the overall target of precision and stability to the M submodel, if selection score is high The average value of dry submodel is as final completion as a result, specifically calculating as follows:

S31. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:

WhereinFor i-th of model, j-th of sample completion as a result, Y_jFor the true value of j-th of sample, N is test specimens This number；

S32. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:

WhereinFor the mean value of the error of i-th of model；std(e_i) indicate error standard deviation；

S33. with following normalization formula by 2 index RMSE (i) and std (e_i) normalize between [0,1]:

S34. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:

Wherein: RMSE (i) ' and std (e_iPrecision index and stability indicator of) ' be respectively after normalizing；

Then final result are as follows:

Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, is taken not Greater than the maximum integer being worth in bracket；Indicate the completion result output of the preceding S submodel of highest scoring Value.

Further, " the complex industrial process product quality indicator " refers to that be hydrocracked process heavy naphtha evaporates eventually Point, the N are 595, M 50, P 139.

Further, " the complex industrial process product quality " refers to the Flash Point of Diesel or diesel oil for being hydrocracked process Hexadecane or boat coal flash-point or boat coal initial boiling point.

One kind being suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion knot Fruit module:

The generation training set module eliminates dimension impact for collecting raw data set and by data normalization；Benefit Generate M sampling set with the self-service sampling method of bootstrap, K variable extracted at random to each sampling set, finally this M N × The sampling set of K and sends the training set to second module respectively as the training set of M submodel；Wherein, described N is number of training；

The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, point Regression model is not established using support vector machines, BP neural network, Partial Least Squares, is obtained based on aforementioned three kinds of different characteristics Three groups of completion results of regression model；Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every The completion result of a submodel；The completion result of each submodel is finally passed into third module；

The benefit for each submodel that the final completion object module of the determination is used to generate completion result integration module Entirely as a result, being assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment Score is ranked up, and the average value for S submodel for selecting score high is as final completion as a result, and exporting.

Further, the self-service sampling method of bootstrap in the training set generation module specifically:

Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain To M mutually independent sampling sets；

Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.

Further, the completion result calculating of each submodel is as follows in the completion result integration module:

S1. to each submodel training set, respectively using Partial Least Squares (PLS), support vector machines (SVM), BP mind Through three kinds of modeling method completion data of network, following three completion results are obtained:

S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result of i-th of submodel Y_i:

Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z₁、z₂、z₃, then:

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of the submodel is

Further, the recruitment evaluation module is specifically assessed in the following way:

S1. precision index of the root-mean-square error of submodel as model is calculated, calculation formula is as follows:

S2. stability indicator of the standard deviation of the error of submodel as model is calculated, calculation formula is as follows:

S3. with following normalization formula by 2 index RMSE (i) and std (e_i) normalize between [0,1]:

S4. score of the overall target determined according to following formula as each submodel, and its score is pressed to all submodels It sorts from high to low, chooses the mean value of the completion result of the model of S highest scoring as final completion result:

Wherein RMSE (i) ' and std (e_iPrecision index and stability indicator of) ' be respectively after normalizing；

Then final result are as follows:

Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket；Indicate the completion result output of the preceding S submodel of highest scoring Value.

Compared with the existing methods, the beneficial effects of the present invention are: it is proposed by the present invention integrated based on selective double layer The Supplementing Data method and system of study, the different dimensional variable that first layer extracts sampled data are established partial model and are integrated, It can guarantee that each partial model has enough training samples in this way, above-mentioned sampled data is few, dimension is high to solve Difficult point；Vector machine (SVM), BP neural network, three kinds of Partial Least Squares (PLS) is respectively adopted to each partial model in the second layer Method is modeled, and wherein support vector regression model is the practical algorithm for small sample, and BP neural network has good non- Linear approximation ability, Partial Least Squares Regression, which can solve input variable, has that coupling causes, and then builds these three The result of mould method is integrated, and ensure that algorithm for the stability of the Supplementing Data problem of complex industrial process, has higher Generalization ability；A kind of completion recruitment evaluation index is finally proposed, the completion effect of each submodel is assessed, chooses and mends The best several submodels of full effect carry out selective ensemble, further improve the precision of completion, are enterprise to industrial process The on-line analysis of global operation conditions provides more believable foundation, facilitates enterprise and adjusts production model based on the analysis results, subtracts Few wasting of resources, improves production efficiency.Complementing method proposed by the present invention is input variable dimension is very high, data sample is seldom In the case of also available preferable completion effect.

Detailed description of the invention

Fig. 1 is the overall procedure according to the Supplementing Data method in the embodiment of the present invention based on selective double layer integrated study Figure.

Fig. 2 is the structural schematic diagram of the Supplementing Data method based on selective double layer integrated study.

Fig. 3 is to be hydrocracked process product quality indicator data sample deletion condition.

Fig. 4 is the heavy naphtha end point of distillation test set completion knot of the Supplementing Data method based on selective double layer integrated study Fruit figure.

Fig. 5 is the completion error of the S submodel of the heavy naphtha end point of distillation selected according to the evaluation index of proposition and integrates The comparison diagram of completion error afterwards.

Fig. 6 Flash Point of Diesel test set completion result figure.

Fig. 7 diesel cetane-number test set completion result figure.

Fig. 8 boat coal initial boiling point test set completion result figure.

Fig. 9 boat coal flash-point test set completion result figure.

Figure 10 Flash Point of Diesel test set S completion errors for selecting submodel and it is integrated after completion error comparison diagram.

Figure 11 diesel cetane-number test set S completion errors for selecting submodel and it is integrated after completion error comparison Figure.

Figure 12 boat coal initial boiling point test set S select submodel completion error and it is integrated after completion error comparison Figure.

Figure 13 boat coal flash-point test set S select submodel completion error and it is integrated after completion error comparison diagram.

Specific embodiment

In order to sufficiently disclose the present invention, below specific embodiments of the present invention will be described in further detail:

The present invention takes full advantage of whole variables of input, so that can also obtain in the very high situation of input variable dimension To preferable completion result.The method of proposition is as follows:

S1. training set is generated:

S2. it establishes submodel and generates completion result:

Assuming that reality output is y, then

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of i-th of submodel is

S3. final completion result is determined:

In a preferred embodiment of the invention, before using self-service sampling method described in S1, first to initial data set into Row standardization, eliminates the influence of different variable dimensions.The sampled data of note input shares N number of sample, the change that each sample has P to tie up It measures, then X_ij(i=1,2 ..., N；J=1,2 ..., P) be i-th of sample j-th of variable sample value.

Standardized calculation formula are as follows:

Then adopt identical with raw data set capacity is randomly selected from the initial data concentration after standardization with putting back to Sample collection repeats M times, obtains M mutually independent sampling sets；

Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:

So far the training set of the N × K of M submodel has been obtained.

S2 has used three kinds of modeling method integrated mouldings, is further illustrated below:

1) Support vector regression model

By constructing loss function, and it is based on structural risk minimization thought, support vector machines generallys use following minimum Change Optimized model to determine regression function, it may be assumed that

ω is weight vector in formula,For the expression item of model complexity, μ is iotazation constant, ξ_i ^*,ξ_iFor relaxation Variable, φ (x) are the nonlinear transformations for mapping the data into higher dimensional space, and b is biasing, and ε is the error upper limit.

Introduce Lagrange multiplier α_iAnd α_i ^*, above-mentioned Optimized model can be converted into following primal-dual optimization problem and be solved:

Solving the above problem can be obtained Support vector regression function:

Wherein k (X_i, X) and it is known as kernel function, Mercer condition need to be met, in a preferred embodiment of the invention, chosen Gauss RBF kernel function:

2) BP neural network

In a preferred embodiment of the invention, three layers of BP neural network modeling are chosen, it is assumed that input layer number is l, Output layer number of nodes is o, and the number of hidden nodes s is determined by empirical equation:

Output node layer output be

Wherein w_hjFor the connection weight of hidden layer to output layer, b_hFor hidden node value, θ_jFor the threshold value for exporting node layer.

The output of hidden node is

Wherein v_ihFor the connection weight of input layer to hidden layer, x_iFor input layer value, γ_hFor the threshold value of hidden node.

According to the adjustment formula of the available weight of back-propagation algorithm and threshold value are as follows:

w_hj=w_hj+ηg_ib_h

v_ih=v_ih+ηe_hx_i

θ_j=θ_j-ηg_i

γ_h=γ_h-ηe_h

Wherein η is learning rate, g_iAnd e_hIt is determined by following formula:

According to the continuous iteration of above-mentioned formula, until the mean square error of network output is met the requirements.

3) Partial Least Squares

When establishing regression model using Partial Least Squares, the principal component extracted in output and input as far as possible had both been considered, It is contemplated that making to overcome common least square method from the correlation maximization between the principal component that X and Y are extracted respectively and locating Manage dimension is high, regression problem linearly related between variable when there are the shortcomings that.

Assuming that X and Y are the data that initial data generates after zero averaging, unit varianceization.The of so X and Y A pair of of principal component t₁And u₁It is respectively as follows:

t₁=Xc₁

u₁=Yd₁

Wherein c₁And d₁For coefficient vector, following optimization problem solving can be passed through.

Being described as optimization problem, makes t₁And u₁Between correlation maximization, and make t respectively₁And u₁Respective variance It is maximum.It can mathematically formalize as follows:

max<Xc₁,Yd₁>

This optimization problem can be solved by the method for introducing Lagrange multiplier, can finally be solved, c₁It is symmetrical Matrix X^TYY^TThe corresponding feature vector of the maximum eigenvalue of X, d₁It is Y^TXX^TThe corresponding feature vector of the maximum eigenvalue of Y.Then Available above-mentioned first couple of relevant principal component t₁And u₁。

It is as follows to carry out regression modeling:

X=t₁p₁ ^T+E

Y=u₁q₁ ^T+G

Y=t₁r₁ ^T+F

For the above regression equation, p can be calculated with least square method₁,q₁,r₁:

Later using the residual error E in X as new X, the residual error F in Y extracts second pair of principal component, according to preceding as new Y The method in face is returned, and is constantly recycled, and until residual error F reaches requirement or principal component quantity reaches the upper limit, algorithm terminates.

If finally sharing k principal component, then original X, Y can finally be indicated are as follows:

X=t₁p₁ ^T+t₂p₂ ^T+…+t_kp_k ^T+E

Y=t₁r₁ ^T+t₂r₂ ^T+…+t_kr_k ^T+F

Write as matrix form are as follows:

X=TP^T+E

Y=TR^T+ F=XCR^T+F

In formula, T=[t₁,t₂,…,t_k], P=[p₁,p₂,…,p_k], C=[c₁,c₂,…,c_k], R=[r₁,r₂,…,r_k]。

It is possible thereby to know, as long as obtaining C and R in an iterative process, so that it may estimate output valve with above formula.

After carrying out completion to data by above-mentioned three kinds of modeling methods, it is also necessary to determine every kind of modeling method Supplementing Data knot The weight of fruit, is integrated.The present invention obtains the optimal estimation of three kinds of modeling method weights using least square method, specifically:

The completion result for remembering three modeling methods is Assuming that reality output is y, the weight of three kinds of modeling methods is respectively z₁、z₂、z₃, then:

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of i-th of submodel is

In a preferred embodiment of the invention, completion recruitment evaluation index described in S3 specifically:

1) precision index

Root-mean-square error is chosen as precision index, then the root-mean-square error of i-th of submodel are as follows:

WhereinFor i-th of model, j-th of test sample completion as a result, Y_jFor the true value of j-th of test sample, N For the number of test sample；

2) stability indicator

In order to reflect the quality of model more fully hereinafter, it is also necessary to measure completion result on all samples of test set Stability, therefore choose the standard deviation of error as stability indicator, the then standard deviation of i-th of submodel error are as follows:

WhereinIndicate the mistake absolute value of the difference of i-th of model, j-th of test sample；For i-th of model Error mean value；

With following normalization formula by 2 index RMSE (i) and std (e_i) normalize between [0,1]:

Score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:

Then final result are as follows:

It wherein takes S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken No more than the maximum integer being worth in bracket；Indicate the completion result output of the preceding S submodel of highest scoring Value.

Specific embodiment:

The Supplementing Data method based on selective double layer integrated study that the embodiment provides is pair to be hydrocracked process As using the historical data of whole process process variable and the quality index of product oil as initial data set, to the product oil of missing Quality index carries out completion.It is hydrocracked flow process complexity, the process variable of detection is numerous, and there are biggish time lags, causes Data set dimension is high, and model is with very strong non-linear.Not due to the quality index sample frequency of process variable and product oil Unanimously or there are the accidents such as product oiling experiment device failure, so that the quality index data missing of product oil is serious.Fig. 5 The deletion condition of qualitative data sample is illustrated, from figure 5 it can be seen that most of quality index just obtains 1 number for 12 hours According to sample, there are also some quality index, even 1 talent obtains 1 data sample.Complementing method provided by the invention can be to this The quality index of kind serious loss carries out effectively completion, and detailed process is as follows:

Step (1) is pre-processed to being hydrocracked process flow operation supplemental characteristic, is extracted first by Analysis on Mechanism The historical data of 160 measurable process variables, according to the fluctuation situation of each variable data, removal is influenced by sensor fault The variable with unusual fluctuations data, filter out 139 primary process variables.

Step (2) is standardized initial data set, eliminates the influence of different variable dimensions.Remember the hits of input According to N number of sample is shared, each sample has the variable of P dimension, then X_ij(i=1,2 ..., N；J=1,2 ..., P) it is i-th of sample J-th of variable sample value.Wherein N=595, P=139.

Standardized calculation formula are as follows:

Step (3), from after standardization initial data concentration randomly select with putting back to it is identical with raw data set capacity Sampling set, repeat M time, obtain a mutually independent sampling sets of M；

Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by passing through Formula is tested to determine:In the present embodiment, since P is 139, so taking K=23.

So far the training set of N × 23 (N is number of training) of M submodel has been obtained.

Step (4) is utilized respectively support vector machines (SVM), BP neural network, Partial Least Squares (PLS) foundation recurrence Model obtains three groups of completion results；The completion result for remembering three modeling methods is

Step (5) obtains the optimal estimation of three kinds of modeling method weights using least square method.Assuming that reality output is y, The weight of three kinds of modeling methods is respectively z₁、z₂、z₃, then:

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of i-th of submodel is

The completion error of three kinds of modeling methods and the completion error after integrating are as shown in table 1.

Step (6) repeats step (5) M times, obtains the completion result of M submodel.

Step (7) is based on the completion for each submodel that step (6) obtain as a result, according to the comprehensive of its precision and stability It closes index to be ranked up the score of the M submodel, the average value of the preceding S submodel for selecting score high is as finally Completion result.

The standard deviation of error is chosen as stability indicator, then the standard deviation of i-th of submodel error are as follows:

Then final result are as follows:

With the increase of submodel number M, (as shown in table 2) can be gradually increased in the precision of completion.

Submodel number and its corresponding error when 2 the present embodiment of table is to heavy naphtha end point of distillation boiling range Supplementing Data

It can be seen that completion precision can be obviously improved by increasing submodel number when submodel number is less, work as submodel When number is sufficiently large, completion effect tends towards stability.

Finally, in order to verify versatility of the invention, according to the above steps of this embodiment respectively to Flash Point of Diesel, diesel oil Other process product quality indicator data that are hydrocracked such as hexadecane, boat coal flash-point and boat coal initial boiling point have carried out completion, completion As shown in figs. 6-9, completion error is as shown in table 3 for effect.

3 present invention of table carries out the error of completion to the other quality index of process are hydrocracked

Aforementioned schemes can be written as computer software, and one kind being suitable for complexity based on selective double layer integrated study The system of industrial process product quality indicator missing data completion, characterized by comprising: include at least generate training set module, It generates completion object module and determines final completion object module:

The self-service sampling method of bootstrap in the training set generation module specifically:

The completion result of each submodel calculates as follows in the completion result integration module:

It enablesThen above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of the submodel is

Then final result are as follows:

Claims

1. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study Method, which comprises the steps of:

S1. training set is generated:

S11. complex industrial process product quality indicator, the raw data set with N number of sample of formation are collected；

S12. sampling set identical with raw data set capacity is randomly selected from the initial data concentration with putting back to, repeated M times, obtain M mutually independent sampling sets；

S13. characteristic variable of the K variable as training set, the K are not randomly selected with not putting back to each sampling set Value determined by empirical equation:

S2. it establishes submodel and generates completion result:

S21. it is based on the M training set, Partial Least Squares, support vector machines, BP nerve are used respectively to each training set Three kinds of modeling method completion data of network, respectively obtain three completion results:

S22. three kinds of Partial Least Squares, support vector machines, BP neural network modeling methods are estimated respectively using least square method Weight z₁、z₂、z₃, weighted calculation obtains the completion of each submodel as a result, specific calculating is as follows:

Assuming that reality output is y, then

Enable z=[z₁,z₂,z₃]^T,Then above formula can be abbreviated are as follows:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of i-th of submodel is

S3. final completion result is determined:

Completion based on each submodel is as a result, according to the completion recruitment evaluation index of proposition to the M submodel Completion result is assessed, and is ranked up according to assessment score, and the average value for S submodel for selecting score high is as final Completion result.

2. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that further include being standardized to initial data set after the S11, eliminate different The step of influence of variable dimension, the specific method is as follows:

In note N number of sample, each sample has the variable of P dimension, then X_ij(i=1,2 ..., N；J=1,2 ..., P) it is i-th J-th of variable sample value of a sample, standardized calculation formula are as follows:

Wherein, E (X_j) refer to the mean value for inputting the N number of sample value of j-th of variable, Std (X_j) refer to the input N number of sample value of j-th of variable Standard deviation.

3. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that refer to precision " according to the completion recruitment evaluation index of proposition " described in S3 With the overall target of stability, i.e., it is ranked up according to score of the overall target of precision and stability to the M submodel, The average value for several submodels for selecting score high is as final completion as a result, specifically calculating as follows:

WhereinFor i-th of model, j-th of sample completion as a result, Y_jFor the true value of j-th of sample, N is of test sample Number；

S34. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:

Then final result are as follows:

Wherein: S=floor (40% × M), M are the number of submodel, and floor () is the function being rounded downwards, that is, takes and be not more than The maximum integer being worth in bracket；Indicate the completion readout of the preceding S submodel of highest scoring.

4. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that " the complex industrial process product quality indicator ", which refers to, is hydrocracked stream The journey heavy naphtha end point of distillation, the N are 595, M 50, P 139.

5. according to claim 1 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The method of missing data completion, which is characterized in that " the complex industrial process product quality ", which refers to, is hydrocracked process Flash Point of Diesel or diesel fuel cetane or boat coal flash-point or boat coal initial boiling point.

6. a kind of be suitable for the missing data completion of complex industrial process product quality indicator based on selective double layer integrated study System, characterized by comprising: include at least and generate training set module, generate completion object module and determine final completion result Module:

The generation training set module eliminates dimension impact for collecting raw data set and by data normalization；It utilizes The self-service sampling method of bootstrap generates M sampling set, K variable is extracted at random to each sampling set, finally this M N × K Sampling set respectively as the training set of M submodel, and send the training set to second module；Wherein, the N For number of training；

The each submodel training dataset for generating completion object module and being used to be passed to training set generation module, it is sharp respectively Regression model is established with support vector machines, BP neural network, Partial Least Squares, obtains returning based on aforementioned three kinds of different characteristics Three groups of completion results of model；Then the weight of three kinds of modeling methods is calculated using least square method, weighting obtains every height The completion result of model；The completion result of each submodel is finally passed into third module；

The completion knot for each submodel that the final completion object module of the determination is used to generate completion result integration module Fruit is assessed according to completion result of the completion recruitment evaluation index of proposition to the M submodel, and according to assessment score It is ranked up, the average value for S submodel for selecting score high is as final completion as a result, and exporting.

7. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the self-service sampling method of bootstrap in the training set generation module is specific Are as follows:

Sampling set identical with raw data set capacity is randomly selected from initial data concentration with putting back to, repeats M times, obtain M A mutually independent sampling set；

Characteristic variable of the K variable as training set is not randomly selected with not putting back to each sampling set, wherein the value of K is by experience public affairs Formula determines:Wherein: P is the total dimension for sampling collection variable, is set in advance according to raw data set.

8. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the completion result meter of each submodel in the completion result integration module It calculates as follows:

S1. to each submodel training set, Partial Least Squares (PLS), support vector machines (SVM), BP nerve net are used respectively Three kinds of modeling method completion data of network, obtain following three completion results:

S2. the weight of each modeling method of Least Square Method is used, weighting obtains the completion result Y of i-th of submodel_i:

Xz=y

Using the estimation of the available weight z of least square method:

So, the completion result of the submodel is

9. according to claim 6 be suitable for complex industrial process product quality indicator based on selective double layer integrated study The system of missing data completion, which is characterized in that the recruitment evaluation module is specifically assessed in the following way:

S4. score of the overall target determined according to following formula as each submodel, and its score is pressed from height to all submodels To low sequence, the mean value of the completion result of the model of S highest scoring is chosen as final completion result:

Then final result are as follows:

Wherein: taking S=floor (40% × M), M is the number of submodel, and floor () is the function being rounded downwards, that is, is taken less In the maximum integer being worth in bracket；Indicate the completion readout of the preceding S submodel of highest scoring.