CN108734207B

CN108734207B - Method for predicting concentration of butane at bottom of debutanizer tower based on model of double-optimization semi-supervised regression algorithm

Info

Publication number: CN108734207B
Application number: CN201810454373.4A
Authority: CN
Inventors: 熊伟丽; 程康明; 马君霞
Original assignee: Jiangnan University
Current assignee: Hefei Minglong Electronic Technology Co ltd
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2021-05-28
Anticipated expiration: 2038-05-14
Also published as: CN108734207A

Abstract

The invention discloses a method for predicting the concentration of butane at the bottom of a debutanizer tower based on a model of a double-optimization semi-supervised regression algorithm, and belongs to the field of semi-supervised regression. Solving the center of the labeled sample compact area through a double-optimization strategy, screening unlabeled samples according to the similarity between the unlabeled samples and the compact area center, and screening labeled samples according to the similarity between the labeled samples; then establishing an auxiliary learner for the selected labeled samples by using a Gaussian process regression method so as to predict labels of the selected unlabeled samples; and finally, the prediction effect of the main learner is improved by utilizing the pseudo label samples, the problem that when the number of label samples is small, the quality of the label-free samples cannot be guaranteed, so that accurate prediction cannot be realized is solved, and the effect of realizing accurate prediction by utilizing few label samples is achieved.

Description

Method for predicting concentration of butane at bottom of debutanizer tower based on model of double-optimization semi-supervised regression algorithm

Technical Field

The invention relates to a method for predicting the concentration of butane at the bottom of a debutanizer tower based on a model of a double-optimization semi-supervised regression algorithm, belonging to the field of semi-supervised regression.

Background

Some important quality variables in chemical, metallurgical, fermentation and other industrial processes cannot be measured through an online instrument, and serious lag exists in a laboratory offline analysis mode, so that the important quality variables need to be predicted through some sample data which can be directly measured.

With the development of science and technology, especially the development of industrial big data technology, a large number of unlabeled samples are more and more easily obtained, and the obtaining cost of labeled samples is still very high, so that labeled samples are few in some industrial processes, and the prediction effect of the model is difficult to ensure in the conventional modeling method under the condition.

To solve these problems, semi-supervised learning, which utilizes a small number of labeled exemplars and a large number of unlabeled exemplars to improve learning performance, has received much attention. At present, the research on semi-supervised clustering and semi-supervised classification is more, but the research on semi-supervised regression is less. The existing semi-supervised regression methods include semi-supervised regression algorithms using popular learning, cooperative training algorithms, semi-supervised support vector regression, selective integration algorithms, and the like. However, when there are few labeled samples, these methods cannot guarantee the quality of the label-free samples, and thus accurate prediction cannot be achieved.

Disclosure of Invention

In order to solve the existing problems, the unlabeled samples are utilized more accurately, and considering that part of samples in the unlabeled samples cannot be accurately predicted through a small amount of labeled samples and outliers existing in the small amount of labeled samples can influence the prediction effect of the unlabeled samples, the invention realizes accurate prediction of the unlabeled samples by defining two optimal criteria from the two aspects of screening the unlabeled samples and screening the labeled samples so as to improve the prediction effect of the model after the unlabeled samples are utilized. The method comprises the following steps:

step 1: screening out the non-label samples according to the optimization criterion 1 and the optimization criterion 2 by using a non-label sample screening algorithm to obtain a non-label sample set M₁The unlabeled sample comes from actual sampling of the real process of the debutanizer;

preferred criteria 1 are described below: given a threshold value theta₁Measure unlabeled sample x 'using Mahalanobis distance'_iSimilarity to labeled sample dense center C, d_iX'_iThe distance from C is less than theta₁X'_iSatisfies the preferred conditions wherein d_iObtained by the formulas (1) to (3); the labeled sample comes from actual sampling of a real process of the debutanizer;

d_i＝sqrt[(x′_i-C)′M^-1(x′_i-C)] (1)

where M is the unlabeled sample covariance matrix,n is the number of unlabeled samples,

is the mean value of the unlabeled samples;

the preferred criteria 2 are described as follows: given a threshold value theta₂Using mahalanobis distance to measure the similarity d (x) between samples_i,x_j) Statistical sample x_iAnd surrounding sample x_jHas a Mahalanobis distance of less than theta₂M, if m is not less than 2, x_iSatisfies the preferred condition wherein d (x)_i,x_j) Obtained from equations (4) to (6)

d(x_i,x_j)＝sqrt[(x_i-x_j)′S^-1(x_i-x_j)] (4)

Wherein S is a labeled sample covariance matrix, n is the number of labeled samples,

the sample mean value with the label is obtained;

the mahalanobis distance represents the covariance distance of data, and the similarity of two unknown sample sets can be effectively calculated;

the unlabeled sample screening algorithm is as follows:

step 1: initializing 1, i assigning an initial value of 1, and setting a threshold value theta₃；

Step 2: sequentially judging x_iWhether or not the threshold value theta is satisfied₃Preferred criterion 2 defined as₃Alternative theta₂Selecting labeled samples which meet the conditions to form a matrix A as similarity constraint;

step 3: and solving the center C of the sample dense area by using the obtained A matrix:

wherein l is the number of the samples in the dense area contained in A, and i represents the dimension of the samples;

step 4: calculating each unlabeled sample x 'according to formulas (1) - (3)'_iDistance d from C_iSelecting the unlabeled samples satisfying the preference criterion 1 and storing them in the matrix M₁Performing the following steps;

step 2: utilizing an auxiliary learner to establish an algorithm, selecting labeled samples according to an optimal selection criterion 2, and establishing a more targeted auxiliary learner f₁；

The auxiliary learner predicts the label of the unlabeled sample by utilizing a model established by the labeled sample;

the auxiliary learner set-up algorithm is as follows:

step 1: initializing 2, i and assigning an initial value of 1;

step 2: sequentially judging x_iWhether the preference criterion 2 is met or not is judged, and the labeled samples meeting the condition are selected to form a matrix B;

step 3: establishing an auxiliary learner f Using Gaussian Process regression GPR according to B₁；

The GPR is a nonparametric probability model based on a statistical learning theory, and is modeled by the GPR as follows:

given training sample set X ∈ R^D×NAnd y ∈ R^NWherein X ═ { X ═ X_i∈R^D}_i＝1…N，y＝{y_i∈R}_i＝1…NInput data and output data representing D dimensions, respectively, the relationship between the input data and the output data being generated by equation (7):

y＝f(x)+ε (7)

where f is an unknown functional form, ε is a mean of 0 and a variance of

For a new input x^*Corresponding probability prediction outputy^*Also satisfies a Gaussian distribution whose mean and variance are shown in formulas (8) and (9):

y^*(x^*)＝c^T(x^*)C^-1y (8)

in the formula c (x)^*)＝[c(x^*,x₁),…,c(x^*,x_n)]^TIs a covariance matrix between the training data and the test data,

is a covariance matrix between training data, I is an identity matrix of dimension N × N, c (x)^*,x^*) Is the autocovariance of the test data;

GPR selects gaussian covariance function:

where v controls a measure of covariance, ω_dRepresents each component x^dThe relative importance of;

for the unknown parameters v, ω in equation (10)₁,…,ω_DSum of Gaussian noise variance

Using maximum likelihood estimation to obtain the parameters

The procedure for finding the value of the parameter θ is as follows:

in order to jump out of local optimum, setting the parameter theta as random values in different ranges, and selecting one random value in each range, wherein the ranges are in different magnitudes, namely 0.001, 0.01, 0.1, 1, 10 and the like;

obtaining optimized parameters by adopting a conjugate gradient method;

after obtaining the optimal parameter θ, for the test sample x^*Estimating an output value of the GPR model by equations (8) and (9);

and step 3: using auxiliary learning devices f₁For unlabeled sample set M₁Predicting the label, and collecting the obtained pseudo label sample set S₁Added to the initially labeled sample set S₀In the method, a main learner is established by using a GPR method, wherein S₀Is an initial labeled sample set;

the pseudo label sample is a sample obtained by artificially predicting a non-label sample by using an auxiliary learner, and the main learner tracks the test sample by using a model established by combining the label sample with the pseudo label sample.

Optionally, the method further includes:

selecting a sample dense area center by selecting samples belonging to the sample dense area;

the sample-dense region refers to a region where samples are distributed in a concentrated manner, and the center of the sample-dense region is the center of the sample-dense region.

Optionally, the method is a method for predicting variables that cannot be directly measured by unlabeled samples in an industrial process.

Optionally, the industrial process comprises an environmental, metallurgical and chemical process.

The invention has the beneficial effects that:

solving the center of the labeled sample compact area through a double-optimization strategy, screening unlabeled samples according to the similarity between the unlabeled samples and the compact area center, and screening labeled samples according to the similarity between the labeled samples; then establishing an auxiliary learner for the selected labeled samples by using a Gaussian process regression method so as to predict labels of the selected unlabeled samples; and finally, the prediction effect of the main learner is improved by utilizing the pseudo label samples, the problem that when the number of label samples is small, the quality of the label-free samples cannot be guaranteed, so that accurate prediction cannot be realized is solved, and the effect of realizing accurate prediction by utilizing few label samples is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a general algorithm flow diagram;

FIG. 2 is a histogram distribution of labeled and unlabeled exemplars;

FIG. 3 is a diagram of a numerically simulated dual-preferred semi-supervised prediction effect;

FIG. 4 longitudinal comparison of different methods;

FIG. 5 comparison of prediction errors for different methods;

FIG. 6 is a histogram statistics of predicted values versus actual values for various methods.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Example (b):

the embodiment provides a model prediction method based on a double-optimization semi-supervised regression algorithm, and takes a common chemical process, namely a debutanizer process as an example. Experimental data from actual sampling of the real process, predictive of butane concentration, see fig. 1, the method comprises:

step 1: screening out the non-label samples according to the optimization criterion 1 and the optimization criterion 2 by using a non-label sample screening algorithm to obtain a non-label sample set M₁。

Preferred criteria 1 are as follows: given a threshold value theta₁Measure unlabeled sample x 'using Mahalanobis distance'_iSimilarity to labeled sample dense center C, d_iX'_iThe distance from C is less than theta₁X'_iThe preferred conditions are satisfied. Wherein d is_iObtained from the formulas (1) to (3).

d_i＝sqrt[(x′_i-C)′M^-1(x′_i-C)] (1)

Wherein M is the unlabeled sample covariance matrix, n is the number of unlabeled samples,

mean unlabeled samples.

The preferred criteria 2 are as follows: given a threshold value theta₂Using mahalanobis distance to measure the similarity d (x) between samples_i,x_j) Statistical sample x_iAnd surrounding sample x_jHas a Mahalanobis distance of less than theta₂M, if m is not less than 2, x_iThe preferred conditions are satisfied. Wherein d (x)_i,x_j) Obtained from the formulas (4) to (6).

d(x_i,x_j)＝sqrt[(x_i-x_j)′S^-1(x_i-x_j)] (4)

the sample mean value with the label is obtained;

mahalanobis distance (Mahalanobis distance) is proposed by the indian statistician Mahalanobis (p.c. Mahalanobis) and represents the covariance distance of the data. The method is an effective method for calculating the similarity of two unknown sample sets.

The unlabeled sample screening algorithm is as follows:

Step 2: sequentially judging x_iWhether or not the threshold value theta is satisfied₃Under the limit (i.e. using theta at this time)₃Alternative theta₂As similarity constraint), selecting labeled samples meeting the condition to form a matrix A;

step 4: calculating each unlabeled sample x by formulas (1) to (3)_iDistance d from C_iSelecting the unlabeled samples satisfying the preference criterion 1 and storing them in the matrix M₁In (1).

Step 2: utilizing an auxiliary learner to establish an algorithm, selecting labeled samples according to an optimal selection criterion 2, and establishing a more targeted auxiliary learner f₁。

The secondary learner uses a model built from the labeled exemplars to predict the labels of the unlabeled exemplars.

The auxiliary learner set-up algorithm is as follows:

step 1: initializing 2, i and assigning an initial value of 1;

step 2: sequentially judging x_iWhether the preference criterion 2 is met or not is judged, and the labeled sample composition B meeting the condition is selected;

step 3: building a Secondary learner f from B using Gaussian Process Regression (GPR)₁。

given training sample set X ∈ R^D×NAnd y ∈ R^NWherein X ═ { X ═ X_i∈R^D}_i＝1…N，y＝{y_i∈R}_i＝1…NRepresenting input and output data in the D dimension, respectively. The relationship between input and output is generated by equation (7):

y＝f(x)+ε (7)

where f is the unknown functional form, ε is the mean 0, and the variance is

Gaussian noise. For a new input x^*Corresponding probability prediction output y^*Also satisfies a Gaussian distribution whose mean and variance are shown in formulas (8) and (9):

y^*(x^*)＝c^T(x^*)C^-1y (8)

in the formula c (x)^*)＝[c(x^*,x₁),…,c(x^*,x_n)]^TIs a covariance matrix between the training data and the test data.

Is a covariance matrix between training data, and I is an identity matrix of dimension N × N. c (x)^*,x^*) Is the autocovariance of the test data.

GPR can select different covariance functions c (x)_i,x_j) The covariance matrix sigma is generated as long as the selected covariance function ensures that the generated covariance matrix satisfies the non-negative-positive-definite relationship. The gaussian covariance function is chosen here:

where v controls a measure of covariance, ω_dRepresents each component x^dRelative importance of.

The simplest method is to obtain the parameters by maximum likelihood estimation

The procedure for finding the value of the parameter θ is as follows:

in order to jump out of local optimum, setting the parameter theta as random values in different ranges, and selecting one random value in each range, wherein the ranges are different magnitudes and are respectively 0.001, 0.01, 0.1, 1 and 10;

and obtaining optimized parameters by a conjugate gradient method.

After obtaining the optimal parameter θ, for the test sample x^*The output values of the GPR model can be estimated using equations (8) and (9).

And step 3: using auxiliary learning devices f₁For unlabeled sample set M₁Predicting the label, and collecting the obtained pseudo label sample set S₁Added to the initially labeled sample set S₀(S₀An initial set of labeled samples), a master learner is established using the GPR method.

Fig. 2 is a histogram distribution of labeled and unlabeled samples, theoretically illustrating the necessity of double optimization, and fig. 3 and fig. 4, 5 and 6 are the results of numerical simulation and debutanizer process simulation, respectively, demonstrating the effectiveness of double optimization from an experimental perspective.

According to the invention, through a double-optimization strategy, the center of a labeled sample dense area is solved, unlabeled samples are screened according to the similarity between the unlabeled samples and the center of the dense area, and labeled samples are screened according to the similarity between the labeled samples; then establishing an auxiliary learner for the selected labeled samples by using a Gaussian process regression method so as to predict labels of the selected unlabeled samples; and finally, the prediction effect of the main learner is improved by utilizing the pseudo label samples, the problem that when the number of label samples is small, the quality of the label-free samples cannot be guaranteed, so that accurate prediction cannot be realized is solved, and the effect of realizing accurate prediction by utilizing few label samples is achieved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for predicting the concentration of butane at the bottom of a debutanizer tower based on a model of a double-preference semi-supervised regression algorithm, which is characterized by comprising the following steps:

d_i＝sqrt[(x′_i-C)′M^-1(x′_i-C)] (1)

is the mean value of the unlabeled samples;

d(x_i,x_j)＝sqrt[(x_i-x_j)′S^-1(x_i-x_j)] (4)

the sample mean value with the label is obtained;

the unlabeled sample screening algorithm is as follows:

the auxiliary learner set-up algorithm is as follows:

step 1: initializing 2, i and assigning an initial value of 1;

y＝f(x)+ε (7)

where f is an unknown functional form, ε is a mean of 0 and a variance of

For a new input x^*Corresponding probability prediction output y^*Also satisfies a Gaussian distribution whose mean and variance are shown in formulas (8) and (9):

y^*(x^*)＝c^T(x^*)C^-1y (8)

GPR selects gaussian covariance function:

Using maximum likelihood estimation to obtain the parameters

The procedure for finding the value of the parameter θ is as follows:

obtaining optimized parameters by adopting a conjugate gradient method;

the pseudo label sample is a sample obtained by artificially predicting a non-label sample by using an auxiliary learner, and the main learner tracks the test sample by using a model established by combining the label sample with the pseudo label sample; namely, the built model is used for predicting the concentration of butane at the bottom of the debutanizer.

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the method is applied to an industrial process to predict variables that cannot be directly measured by unlabeled samples.

4. The method of claim 3, wherein the industrial process comprises an environmental, metallurgical, and chemical process.