CN110674883A

CN110674883A - Active learning method based on k nearest neighbor and probability selection

Info

Publication number: CN110674883A
Application number: CN201910936977.7A
Authority: CN
Inventors: 熊伟丽; 代学志; 马君霞
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-10

Abstract

The invention discloses a k-nearest neighbor and probability selection-based active learning method, which comprises the steps of acquiring corresponding data based on an industrialized control platform system, setting a nearest neighbor k, and calculating the number of representative samples; evaluating the sample and manually marking the sample; updating a GPR model and a training set, and iterating until the model precision is reached; the step of setting the neighbor k and calculating the number of the representative samples comprises the following steps: dividing a training set into a labeled sample set and an unlabeled sample set; setting the number k of neighborhood samples; calculating the number of representative samples to be marked in the label-free sample set according to the k value; according to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost.

Description

Active learning method based on k nearest neighbor and probability selection

Technical Field

The invention relates to the technical field of industry, in particular to a k nearest neighbor and probability selection based active learning method.

Background

In a complex industrial process, real-time monitoring and control of some process variables are important to industrial production, but due to the limitation of the conditions of the prior art, the variables are difficult to measure on line through a sensor; the variables are accurately estimated in real time, and the soft measurement technology is rapidly developed; common soft measurement models include principal component regression, partial least squares regression, support vector machines, artificial neural networks, gaussian process regression, and the like; the GPR is a modeling method developed based on Bayesian theory, and can effectively process complex regression problems.

The soft measurement model is constructed based on an input variable and an output variable, wherein the input variable can be accurately measured by a sensor, however, the output variable is difficult to directly detect due to the severe environment of an industrial field and the restriction of economic cost; thus, there is a large amount of unlabeled data and limited labeled data in industrial processes; the traditional soft measurement model is only modeled by adopting a labeled sample set, information in a non-labeled sample set is not utilized, semi-supervised learning simultaneously utilizes the non-labeled sample and the labeled sample to improve the performance of the model, and the problem is well solved; the traditional semi-supervised learning method comprises self-training, cooperative training, probability generation model, graph semi-supervised learning and the like; although the model performance is improved to a certain extent by semi-supervised learning, expert knowledge is not generally considered, so that the model precision may still not meet the requirements of industrial production; the active learning marks the unlabeled samples according to expert knowledge, so that the labeled samples are assisted to learn, and the performance of the model is further improved; the performance of the model is improved by actively learning and acquiring less artificial marking data; therefore, how to select samples from the unlabeled sample set, which play a role in improving the model performance, becomes a key issue of active learning.

Around the problem, students make a lot of research from different angles, Ge takes the output variance of a GPR model as an evaluation index, selects a sample with a large variance to perform artificial marking, effectively evaluates the sample for active learning, and obtains a relatively accurate soft measurement model in an iterative process, but the method is only applicable to the GPR model, Shi and the like proposes an active learning algorithm based on approximately linear dependence, and obtains a certain effect by using approximately linear dependence to measure unlabeled sample information, but the method does not utilize the output information of the sample, easily causes deviation in information evaluation, Zhou and the like propose an active learning method based on divergent ensemble learning, and obtains good performance on a classification problem, however, in the iterative process of the algorithm, some samples with high similarity may be selected, thereby causing local overfitting or weak generalization of the soft measurement model.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The invention provides a method for solving the key problem of active learning, namely how to select samples which play a role in improving model performance from a label-free sample set.

Therefore, the invention aims to provide an active learning method based on k-nearest neighbor and probability selection.

In order to solve the technical problems, the invention provides the following technical scheme: a k-nearest neighbor and probability selection-based active learning method comprises the steps of obtaining corresponding data based on an industrial control platform system, setting k nearest neighbors, and calculating the number of representative samples; evaluating the sample and manually marking the sample; and updating the GPR model and the training set, and iterating until the model precision is reached.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the method for acquiring corresponding data based on the industrial control platform system, setting neighbor k and calculating the number of representative samples comprises the following steps:

acquiring a training set in an industrial control platform system, and dividing the training set into a labeled sample set and an unlabeled sample set;

setting the number k of neighborhood samples;

and calculating the number of the representative samples to be marked in the unlabeled sample set according to the k value.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the labeled sample set and the unlabeled sample set are respectively

And

wherein, nl and n_uRespectively the number of samples of the labeled sample set and the unlabeled sample set, and m is the number of auxiliary variables.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the calculation formula of the representative sample number p is as follows:

p＝n_u/k

in the formula, n_uThe number of samples is the number of unlabeled sample set samples; k is the neighborhood sample number.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of evaluating the sample and manually marking it:

performing subspace integration on the unlabeled sample set by adopting principal component analysis, and establishing a corresponding GPR (general purpose processor) sub-learner;

calculating the uncertainty of the unlabeled sample according to the output of all the sub-learners, and taking the uncertainty as a sample evaluation standard;

and under the neighborhood information criterion, selecting a representative sample for manual marking by constructing k neighbors of the sample.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the uncertainty calculation formula of the unlabeled sample is as follows:

in the formula, d is the number of subsets obtained by performing subspace division on the labeled sample set;a prediction output for the ith subspace;

average of the d subspace prediction outputs;

wherein the content of the first and second substances,

wherein the unlabeled sample with the maximum uncertainty is x_δ。

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of selecting the representative sample for manual marking by constructing k neighbors of the sample under the neighborhood information criterion comprises the following steps:

selecting unlabeled samples x with k neighbor pairs_δConstructing a sample neighborhood S;

selecting a sample most similar to the neighborhood center for marking;

manually marking the marked sample;

wherein the marked sample is a representative sample.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the unlabeled sample x_δNeighborhood center of

Comprises the following steps:

where k is the number of neighborhood samples, x_iAs unlabeled sample x_δSample points in the neighborhood.

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: .

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the selection strategy of the representative sample can be expressed as

In the formula, x_iAs unlabeled sample x_δThe sample points in the neighborhood of the point,

as unlabeled sample x_δThe neighborhood center of (c).

As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of updating the GPR model and the training set and iterating until the model precision comprises the following steps:

adding the manually marked representative samples into the labeled sample set, and establishing a new GPR model;

and the rest unlabeled sample set enters a new iteration until the model precision meets the requirement.

The invention has the beneficial effects that: according to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic diagram of the overall flow steps of the active learning method based on k neighbors and probability selection according to the present invention.

Fig. 2 is a schematic diagram of the steps of setting k neighbors and calculating the number of representative samples according to the k neighbors and probability selection active learning method of the present invention.

Fig. 3 is a schematic diagram of the steps of evaluating samples and manually labeling them according to the k-nearest neighbor and probability selection active learning method of the present invention.

Fig. 4 is a schematic diagram of a step of selecting a representative sample by constructing k neighbors of the sample to perform artificial labeling under the neighborhood information criterion according to the k neighbors and probability selection active learning method of the present invention.

FIG. 5 is a schematic diagram of a structure for updating a GPR model and a training set iteration to model accuracy according to the k-nearest neighbor and probability selection active learning method of the present invention.

Fig. 6 is a schematic diagram of a training set updating process of the active learning strategy based on k-nearest neighbor and probability selection active learning method according to the present invention.

FIG. 7 is a flow chart of an active learning algorithm for selecting an active learning method based on k neighbors and probabilities in accordance with the present invention.

FIG. 8 is a diagram illustrating the prediction result of random selection based on k-nearest neighbor and probability selection active learning method according to the present invention.

FIG. 9 is a diagram illustrating the prediction result of probability selection based on k-nearest neighbors and the active learning method of probability selection according to the present invention.

FIG. 10 is a diagram illustrating prediction results of an active learning algorithm of the active learning method based on k neighbors and probability selection according to the present invention.

FIG. 11 is a schematic diagram of RMSE of the active learning algorithm of the k-nearest neighbor and probability-based selection active learning method at different iteration numbers.

FIG. 12 is a diagram illustrating the predicted results of a randomly selected 60 th iteration GPR model based on k-nearest neighbors and probability selection active learning method of the present invention.

FIG. 13 is a schematic diagram of the prediction results of the 60 th iteration GPR model based on k neighbors and probability selection active learning method of the present invention.

FIG. 14 is a schematic diagram of the prediction result of the 60 th iteration GPR model of the active learning algorithm based on k neighbors and probability selection active learning method of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Furthermore, the present invention is described in detail with reference to the schematic drawings, and in the detailed description of the embodiments of the present invention, the cross-sectional views illustrating the device structure are not enlarged partially according to the general scale for the convenience of illustration, and the schematic drawings are only examples, which should not limit the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Example 1

Referring to fig. 1 to 7, an overall structure diagram of a k-nearest neighbor and probability-based active learning method is provided, and as shown in fig. 1, the k-nearest neighbor and probability-based active learning method includes the following steps:

s1: acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples;

s2: evaluating the sample and manually marking the sample;

s3: and updating the GPR model and the training set, and iterating until the model precision is reached.

According to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost. Corresponding data is acquired from industrial control platforms of enterprises, such as DCS (distributed control system), PCS (process control system), APC (advanced control system) and MES (manufacturing execution system) and the like.

Specifically, the main structure of the invention comprises the following steps:

s1: acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples; it should be noted that the corresponding data refers to a training set of labeled and unlabeled data sets collected in the industrial control platform system, wherein the steps of obtaining the corresponding data based on the industrial control platform system, setting a neighbor k, and calculating the number of representative samples include:

s11: acquiring a training set at an industrial control platform system, and dividing the training set into a labeled sample set and a non-labeled sample set, wherein the labeled sample set is a data set with a corresponding quality variable measured value, only an auxiliary variable measured value is in the non-labeled data set, and it needs to be noted that the labeled sample set and the non-labeled sample set are respectively

And

wherein n is_lAnd n_uRespectively the number of samples of the labeled sample set and the unlabeled sample set, and m is the number of auxiliary variables.

S12: setting the number k of neighborhood samples;

s13: calculating the number of representative samples to be marked in the label-free sample set according to the k value;

further, the calculation formula representing the number p of samples is as follows:

p＝n_u/k

in the formula, n_uThe number of samples is the number of unlabeled sample set samples; k is the number of neighborhood samples

S2: evaluating the sample and manually marking the sample; the active learning mainly utilizes unlabeled sample information to improve the generalization performance of the model, evaluates the sample and carries out manual marking on the sample:

s21: performing subspace integration on the unlabeled sample set by adopting principal component analysis, and establishing a corresponding GPR (general purpose processor) sub-learner;

it should be noted that GPR is a short for gaussian process regression, which is a machine learning algorithm based on gaussian process and has a good effect on handling problems such as high dimensionality, nonlinearity, small samples, etc.; the training sample set is known as X, y, where the input X ═ X_i∈R^m}_{i＝1,2,...,n}Output y ═ y_i∈R}_{i＝1,2,...,n}N is the number of samples, and m is the number of auxiliary variables; the regression relationship between the input and output can be expressed as:

y＝f(x)+ε

f is an unknown function form, epsilon is a mean of 0, and variance is delta²Gaussian noise of (2); and assuming that the regression function is a Gaussian prior function with a mean value of 0, i.e.

y＝[f(x₁),f(x₂),...,f(x_n)]～GP(0,K)

Where K is the covariance matrix, K_ij＝k(x_i,x_j) Here, a squared exponential function is chosen to construct the covariance function:

in the formula, delta_fIs the standard deviation of the signal, l is the scale parameter, delta_nAs the standard deviation of the noise, delta when i equals j_ijIf not, then delta_ij＝0。For the hyper-parameters of the GPR, the maximum likelihood estimation can be used to obtain:

for test sample x_qThe corresponding mean and variance are:

wherein k is_q＝[k(x_q,x₁),k(x_q,x₂),...,k(x_q,x_n)]^TFor testing sample x_qCovariance vector with training sample set, k (x)_q,x_q) Is a test sample x_qThe covariance of (a);

further, subspace integration is performed on the unlabeled sample set by adopting principal component analysis, and the subspace integration process is as follows:

given a data set X ∈ R^n×mM is the number of auxiliary variables, n is the number of samples, and the PCA model is described as follows:

X＝TP^T+E

wherein T ∈ R^n×dAnd P ∈ R^m×dA score matrix and a load matrix of the principal component subspace, respectively, E ∈ R^n×mD represents the number of divided subspaces;

the load matrix may be divided as follows:

P＝[P₁,P₂,...P_d]

the definition of the ith auxiliary variable in subspace j is:

wherein, i is 1,2, 1, d, P_ijRepresenting the i-th row and j-th column elements of the load matrix. Wherein a larger value of CI (i, j) indicates that the variable contains more information in the jth subspace; and (4) carrying out descending arrangement on the array vectors of the matrix according to the contribution index, wherein the larger variable index is determined as an auxiliary variable of the subspace j.

S22: calculating the uncertainty of the unlabeled sample according to the outputs of all the sub-learners, and taking the uncertainty as a sample evaluation standard; wherein, utilizing PCA to carry out subspace division on the labeled sample set to obtainTo D subsets D₁,D₂,...,D_dThen GPR modeling is carried out to obtain d sub-models, and finally, for each unlabeled sample x_uAll can obtain d prediction outputs

It should be noted that the formula for calculating the uncertainty of the unlabeled sample is:

in the formula, d is the number of subsets obtained by performing subspace division on the labeled sample set;

a prediction output for the ith subspace;

average of the d subspace prediction outputs; it is composed of

It should be noted that the unlabeled sample with the largest uncertainty is x_δ。

S23: under the neighborhood information criterion, selecting a representative sample for manual marking by constructing k neighbors of the sample; under the neighborhood information criterion, the step of selecting a representative sample for manual marking by constructing k neighbors of the sample comprises the following steps:

s231: selecting unlabeled samples x with k neighbor pairs_δConstructing a sample neighborhood S;

s232: selecting a sample most similar to the neighborhood center for marking;

s233: manually marking the marked sample, wherein the marked sample is a representative sample;

specifically, in order to fully describe the overall characteristics of the training set by using limited labeling cost, the belief of k neighbors is continuously introduced under the framework of probability selectionAn information criterion; the method utilizes k nearest neighbor pairs to select a sample x_δAnd constructing a sample neighborhood S, and preferentially selecting a sample most similar to a neighborhood center for marking to ensure that the marked sample covers all information of the training set at the initial iteration stage, wherein the samples are called as representative samples.

Note that the unlabeled sample x_δNeighborhood center of

Comprises the following steps:

Wherein, the selection strategy of the representative sample can be expressed as:

as unlabeled sample x_δThe neighborhood center of (a); wherein, selecting representative samples of the unlabeled sample set for manual marking, and marking as { x_s,y_s}。

S3: updating a GPR model and a training set, and iterating until the model precision is reached; updating the GPR model and the training set, and iterating until the model precision comprises the following steps:

s31: representative samples to be artificially labeled x_s,y_sAdding the samples into a labeled sample set, and establishing a new GPR model;

s32: and the rest unlabeled sample set enters a new round of iteration until the model precision is reached.

The method comprises the steps that active learning needs to be finished by presetting a termination condition to limit the iteration times of an algorithm, and a mean square root error (RMSE) is used as a GPR model performance evaluation index; with the gradual increase of the number of marked samples in the training set, the RMSE of the GPR model is continuously reduced, and when the RMSE reaches a preset threshold value, the active learning process is ended;

wherein n is_tFor number of samples in test set, y_tAnd

the real value and the estimated value of the test sample are respectively.

Here, the active learning algorithm flow of combing:

based on an active learning strategy, performing information evaluation on the label-free sample through uncertainty, and further extracting a representative sample by using a k nearest neighbor criterion; the modeling flow of the algorithm provided by the invention is shown in FIG. 7, and the detailed steps are as follows:

(1) dividing training set into labeled sample set

And unlabeled sample set

Setting the number k of neighborhood samples, and calculating the number p of representative samples to be marked in the unlabeled sample set as n according to the value of k_u/k。

(2) Calculating the uncertainty delta of the unlabeled sample according to the prediction output of the sub-learner to obtain the unlabeled sample x with the maximum uncertainty_δ。

(3) And (4) judging whether the number of the marked samples is less than p, if so, turning to (4), and otherwise, turning to (5).

(4) Constructing x with k neighbors_δOf the neighborhood, choosing representative samples x in the neighborhood_s。

(5) The selected samples are labeled, the training set is updated according to fig. 6, and a new GPR model is established.

(6) If the root mean square error of the soft measurement model meets the precision requirement, the active learning is finished, otherwise, the step (2) is switched to enter a new iteration

Example 2

In order to verify the effectiveness of the method, the function in the formula (1) is adopted to generate simulation data for verification; wherein x_iY is input and output variables respectively, epsilon is Gaussian white noise with the mean value of 0 and the variance of 0.05;

generating 1000 groups of data by using the formula (1), wherein 500 groups are used as a training set, and 500 groups are used as a test set; selecting 10 groups of data in the training set as labeled samples, and selecting 490 groups as unlabeled samples; the number of the neighborhood samples of the KNN-ALPS is 29, and a training set, a subspace learning device and a soft measurement model are updated by marking 1 sample in the iterative process; the comparison method comprises an active learning algorithm of random selection (ALRS) and probability selection (ALPS); the predicted results of the three methods are shown in table 1.

TABLE 1 prediction impact of labeled sample numbers on three soft-metric models

(root mean square error)

As can be seen from table 1, the unlabeled samples selected by the ALRS using the random method cannot play a great role in improving the model performance, and wastes the marking cost; the ALPS selects the label-free samples based on probability selection and carries out manual marking, so that the performance of the model is improved to a certain extent; compared with the ALPS, the KNN-ALPS comprehensively considers the uncertainty and the representative information of the unlabeled sample, and the performance of the model obtained by the same marking cost is improved greatly.

Fig. 8 to 10 show that the prediction results of the three methods, which are marked 20 times and then obtained by intercepting 200 test samples, are significantly deviated from the actual values, and the KNN-ALPS can better track the actual values.

Example 3

In order to further verify the effectiveness of the method provided by the invention, a blast furnace ironmaking process is selected as a simulation object; blast furnace iron making is a method for continuously producing liquid pig iron in a blast furnace by using coke, iron-containing ore and flux, is a main link of the current steel production, and because the temperature in the blast furnace is higher, the silicon content in molten iron is difficult to directly measure by a sensor; in order to accurately estimate the silicon content in the molten iron, the soft measurement modeling driven by data is an effective solution; specific descriptions of blast furnace ironmaking process input variables are shown in table 2.

TABLE 2 detailed description of blast furnace ironmaking Process input variables

The process collects 1000 groups of data, 500 groups of data are selected as a training set, 500 groups of data are selected as a testing set, and in the training set, only 10 labeled samples and 490 unlabeled samples are assumed, and the sample labeling rate is 2%.

In order to verify the superiority of the method, the invention adopts three active learning methods, namely ALRS, ALPS and KNN-ALPS; the number of the neighborhood samples of the KNN-ALPS is set to be 15; in the active learning iteration process, the training set, the subspace learning device and the soft measurement model are updated after 1 sample is marked each time, and 490 iterations are needed when all the unlabeled samples are marked.

As can be seen from fig. 11, as the number of marked samples gradually increases, the model performance of the three methods is improved; the ALRS selects the unlabeled samples by using a random method, the RMSE of the GPR model is slowly reduced, and the performance of the model is obviously improved after multiple iterations; the ALPS selects samples by utilizing uncertainty, the model performance is obviously superior to ALRS in the iteration process, after a k neighbor information criterion is introduced, the samples selected by the KNN-ALPS are more reasonable, and the model performance is obviously improved; it can be seen that the model accuracy of KNN-ALPS was consistently higher than ALPS during the first 135 iterations; this means that the active learning algorithm based on KNN-ALPS can improve the performance of the soft measurement model higher by using the same mark cost.

The black line in FIG. 11 is the model accuracy threshold, and when the GPR model accuracy reaches this threshold, the active learning iteration is terminated and the model is output; it can be seen that the ALRS method can achieve satisfactory model accuracy only after marking 180 samples, while the ALPS and KNN-ALPS methods can achieve the requirements after marking 105 and 45 samples, respectively; therefore, the KNN-ALPS-based active learning method can improve the model performance with the minimum mark cost.

In order to detect the improvement effect of the three active learning methods on the GPR model performance, the prediction results of 0-100 test samples in the 60 th iteration are intercepted, as shown in FIGS. 12-14, and it can be seen that the GPR model tracking effect of the KNN-ALPS in the 60 th iteration process is obviously superior to that of the ALRS and the ALPS.

The method verifies that the model precision can be improved while the marking cost is reduced by the method through the numerical example and the application simulation of the actual industrial process; the active learning modeling method based on k nearest neighbor and probability selection comprehensively considers uncertainty of unlabeled samples and representative information of sample neighborhoods, can select the unlabeled samples more reasonably for marking, and finally achieves the purpose of improving model performance with minimum marking cost. Through typical numerical simulation and simulation of an actual industrial process, the method has higher prediction precision and faster convergence speed.

It is important to note that the construction and arrangement of the present application as shown in the various exemplary embodiments is illustrative only. Although only a few embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters (e.g., temperatures, pressures, etc.), mounting arrangements, use of materials, colors, orientations, etc.) without materially departing from the novel teachings and advantages of the subject matter recited in this application. For example, elements shown as integrally formed may be constructed of multiple parts or elements, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of this invention. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. In the claims, any means-plus-function clause is intended to cover the structures described herein as performing the recited function and not only structural equivalents but also equivalent structures. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the exemplary embodiments without departing from the scope of the present inventions. Therefore, the present invention is not limited to a particular embodiment, but extends to various modifications that nevertheless fall within the scope of the appended claims.

Moreover, in an effort to provide a concise description of the exemplary embodiments, all features of an actual implementation may not be described (i.e., those unrelated to the presently contemplated best mode of carrying out the invention, or those unrelated to enabling the invention).

It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions may be made. Such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, without undue experimentation.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A k nearest neighbor and probability-based active learning method is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples;

evaluating the sample and manually marking the sample;

and updating the GPR model and the training set, and iterating until the model precision is reached.

2. The k-nearest neighbor and probability based selection active learning method of claim 1, wherein: the method for acquiring corresponding data based on the industrial control platform system, setting neighbor k and calculating the number of representative samples comprises the following steps:

setting the number k of neighborhood samples;

3. The k-nearest neighbor and probability based selection active learning method of claim 2, wherein: the labeled sample set and the unlabeled sample set are respectivelyAnd

4. The k-nearest neighbor and probability based selection active learning method of claim 3, wherein: the calculation formula of the representative sample number p is as follows:

p＝n_u/k

5. The k-nearest neighbor and probability based selection active learning method of any one of claims 1 to 4, wherein: the step of evaluating the sample and manually marking it:

calculating the uncertainty of the unlabeled sample according to the outputs of all the sub-learners, and taking the uncertainty as a sample evaluation standard;

under the neighborhood information criterion, selecting representative samples for manual marking by constructing k neighbors of the samples.

6. The k-nearest neighbor and probability based selection active learning method of claim 5, wherein: the uncertainty calculation formula of the unlabeled sample is as follows:

a prediction output for the ith subspace;

average of the d subspace prediction outputs;

wherein the content of the first and second substances,

wherein the unlabeled sample with the maximum uncertainty is x_δ。

7. The k-nearest neighbor and probability based selection active learning method of claim 6, wherein: the step of selecting a representative sample for manual marking by constructing k neighbors of the sample under the neighborhood information criterion comprises the following steps:

selecting a sample most similar to the neighborhood center for marking;

manually marking the marked sample;

wherein the marked sample is a representative sample.

8. The k-nearest neighbor and probability based selection active learning method of claim 7, wherein: the unlabeled sample x_δNeighborhood center of

Comprises the following steps:

9. The k-nearest neighbor and probability based selection active learning method of claim 7 or 8, wherein: the selection strategy of the representative sample can be expressed as

as unlabeled sample x_δThe neighborhood center of (c).

10. The k-nearest neighbor and probability based selection active learning method of claim 9, wherein: the step of updating the GPR model and the training set and iterating until the model precision comprises the following steps: