CN110674883A - Active learning method based on k nearest neighbor and probability selection - Google Patents

Active learning method based on k nearest neighbor and probability selection Download PDF

Info

Publication number
CN110674883A
CN110674883A CN201910936977.7A CN201910936977A CN110674883A CN 110674883 A CN110674883 A CN 110674883A CN 201910936977 A CN201910936977 A CN 201910936977A CN 110674883 A CN110674883 A CN 110674883A
Authority
CN
China
Prior art keywords
sample
samples
unlabeled
active learning
neighborhood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910936977.7A
Other languages
Chinese (zh)
Inventor
熊伟丽
代学志
马君霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201910936977.7A priority Critical patent/CN110674883A/en
Publication of CN110674883A publication Critical patent/CN110674883A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a k-nearest neighbor and probability selection-based active learning method, which comprises the steps of acquiring corresponding data based on an industrialized control platform system, setting a nearest neighbor k, and calculating the number of representative samples; evaluating the sample and manually marking the sample; updating a GPR model and a training set, and iterating until the model precision is reached; the step of setting the neighbor k and calculating the number of the representative samples comprises the following steps: dividing a training set into a labeled sample set and an unlabeled sample set; setting the number k of neighborhood samples; calculating the number of representative samples to be marked in the label-free sample set according to the k value; according to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost.

Description

Active learning method based on k nearest neighbor and probability selection
Technical Field
The invention relates to the technical field of industry, in particular to a k nearest neighbor and probability selection based active learning method.
Background
In a complex industrial process, real-time monitoring and control of some process variables are important to industrial production, but due to the limitation of the conditions of the prior art, the variables are difficult to measure on line through a sensor; the variables are accurately estimated in real time, and the soft measurement technology is rapidly developed; common soft measurement models include principal component regression, partial least squares regression, support vector machines, artificial neural networks, gaussian process regression, and the like; the GPR is a modeling method developed based on Bayesian theory, and can effectively process complex regression problems.
The soft measurement model is constructed based on an input variable and an output variable, wherein the input variable can be accurately measured by a sensor, however, the output variable is difficult to directly detect due to the severe environment of an industrial field and the restriction of economic cost; thus, there is a large amount of unlabeled data and limited labeled data in industrial processes; the traditional soft measurement model is only modeled by adopting a labeled sample set, information in a non-labeled sample set is not utilized, semi-supervised learning simultaneously utilizes the non-labeled sample and the labeled sample to improve the performance of the model, and the problem is well solved; the traditional semi-supervised learning method comprises self-training, cooperative training, probability generation model, graph semi-supervised learning and the like; although the model performance is improved to a certain extent by semi-supervised learning, expert knowledge is not generally considered, so that the model precision may still not meet the requirements of industrial production; the active learning marks the unlabeled samples according to expert knowledge, so that the labeled samples are assisted to learn, and the performance of the model is further improved; the performance of the model is improved by actively learning and acquiring less artificial marking data; therefore, how to select samples from the unlabeled sample set, which play a role in improving the model performance, becomes a key issue of active learning.
Around the problem, students make a lot of research from different angles, Ge takes the output variance of a GPR model as an evaluation index, selects a sample with a large variance to perform artificial marking, effectively evaluates the sample for active learning, and obtains a relatively accurate soft measurement model in an iterative process, but the method is only applicable to the GPR model, Shi and the like proposes an active learning algorithm based on approximately linear dependence, and obtains a certain effect by using approximately linear dependence to measure unlabeled sample information, but the method does not utilize the output information of the sample, easily causes deviation in information evaluation, Zhou and the like propose an active learning method based on divergent ensemble learning, and obtains good performance on a classification problem, however, in the iterative process of the algorithm, some samples with high similarity may be selected, thereby causing local overfitting or weak generalization of the soft measurement model.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The invention provides a method for solving the key problem of active learning, namely how to select samples which play a role in improving model performance from a label-free sample set.
Therefore, the invention aims to provide an active learning method based on k-nearest neighbor and probability selection.
In order to solve the technical problems, the invention provides the following technical scheme: a k-nearest neighbor and probability selection-based active learning method comprises the steps of obtaining corresponding data based on an industrial control platform system, setting k nearest neighbors, and calculating the number of representative samples; evaluating the sample and manually marking the sample; and updating the GPR model and the training set, and iterating until the model precision is reached.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the method for acquiring corresponding data based on the industrial control platform system, setting neighbor k and calculating the number of representative samples comprises the following steps:
acquiring a training set in an industrial control platform system, and dividing the training set into a labeled sample set and an unlabeled sample set;
setting the number k of neighborhood samples;
and calculating the number of the representative samples to be marked in the unlabeled sample set according to the k value.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the labeled sample set and the unlabeled sample set are respectively
Figure BDA0002221837370000021
And
wherein, nl and nuRespectively the number of samples of the labeled sample set and the unlabeled sample set, and m is the number of auxiliary variables.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the calculation formula of the representative sample number p is as follows:
p=nu/k
in the formula, nuThe number of samples is the number of unlabeled sample set samples; k is the neighborhood sample number.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of evaluating the sample and manually marking it:
performing subspace integration on the unlabeled sample set by adopting principal component analysis, and establishing a corresponding GPR (general purpose processor) sub-learner;
calculating the uncertainty of the unlabeled sample according to the output of all the sub-learners, and taking the uncertainty as a sample evaluation standard;
and under the neighborhood information criterion, selecting a representative sample for manual marking by constructing k neighbors of the sample.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the uncertainty calculation formula of the unlabeled sample is as follows:
in the formula, d is the number of subsets obtained by performing subspace division on the labeled sample set;a prediction output for the ith subspace;
Figure BDA0002221837370000033
average of the d subspace prediction outputs;
wherein the content of the first and second substances,
Figure BDA0002221837370000034
wherein the unlabeled sample with the maximum uncertainty is xδ
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of selecting the representative sample for manual marking by constructing k neighbors of the sample under the neighborhood information criterion comprises the following steps:
selecting unlabeled samples x with k neighbor pairsδConstructing a sample neighborhood S;
selecting a sample most similar to the neighborhood center for marking;
manually marking the marked sample;
wherein the marked sample is a representative sample.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the unlabeled sample xδNeighborhood center of
Figure BDA0002221837370000035
Comprises the following steps:
Figure BDA0002221837370000036
where k is the number of neighborhood samples, xiAs unlabeled sample xδSample points in the neighborhood.
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: .
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the selection strategy of the representative sample can be expressed as
Figure BDA0002221837370000041
In the formula, xiAs unlabeled sample xδThe sample points in the neighborhood of the point,
Figure BDA0002221837370000042
as unlabeled sample xδThe neighborhood center of (c).
As a preferred solution of the active learning method based on k nearest neighbors and probability selection according to the present invention, wherein: the step of updating the GPR model and the training set and iterating until the model precision comprises the following steps:
adding the manually marked representative samples into the labeled sample set, and establishing a new GPR model;
and the rest unlabeled sample set enters a new iteration until the model precision meets the requirement.
The invention has the beneficial effects that: according to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
fig. 1 is a schematic diagram of the overall flow steps of the active learning method based on k neighbors and probability selection according to the present invention.
Fig. 2 is a schematic diagram of the steps of setting k neighbors and calculating the number of representative samples according to the k neighbors and probability selection active learning method of the present invention.
Fig. 3 is a schematic diagram of the steps of evaluating samples and manually labeling them according to the k-nearest neighbor and probability selection active learning method of the present invention.
Fig. 4 is a schematic diagram of a step of selecting a representative sample by constructing k neighbors of the sample to perform artificial labeling under the neighborhood information criterion according to the k neighbors and probability selection active learning method of the present invention.
FIG. 5 is a schematic diagram of a structure for updating a GPR model and a training set iteration to model accuracy according to the k-nearest neighbor and probability selection active learning method of the present invention.
Fig. 6 is a schematic diagram of a training set updating process of the active learning strategy based on k-nearest neighbor and probability selection active learning method according to the present invention.
FIG. 7 is a flow chart of an active learning algorithm for selecting an active learning method based on k neighbors and probabilities in accordance with the present invention.
FIG. 8 is a diagram illustrating the prediction result of random selection based on k-nearest neighbor and probability selection active learning method according to the present invention.
FIG. 9 is a diagram illustrating the prediction result of probability selection based on k-nearest neighbors and the active learning method of probability selection according to the present invention.
FIG. 10 is a diagram illustrating prediction results of an active learning algorithm of the active learning method based on k neighbors and probability selection according to the present invention.
FIG. 11 is a schematic diagram of RMSE of the active learning algorithm of the k-nearest neighbor and probability-based selection active learning method at different iteration numbers.
FIG. 12 is a diagram illustrating the predicted results of a randomly selected 60 th iteration GPR model based on k-nearest neighbors and probability selection active learning method of the present invention.
FIG. 13 is a schematic diagram of the prediction results of the 60 th iteration GPR model based on k neighbors and probability selection active learning method of the present invention.
FIG. 14 is a schematic diagram of the prediction result of the 60 th iteration GPR model of the active learning algorithm based on k neighbors and probability selection active learning method of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Furthermore, the present invention is described in detail with reference to the schematic drawings, and in the detailed description of the embodiments of the present invention, the cross-sectional views illustrating the device structure are not enlarged partially according to the general scale for the convenience of illustration, and the schematic drawings are only examples, which should not limit the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Example 1
Referring to fig. 1 to 7, an overall structure diagram of a k-nearest neighbor and probability-based active learning method is provided, and as shown in fig. 1, the k-nearest neighbor and probability-based active learning method includes the following steps:
s1: acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples;
s2: evaluating the sample and manually marking the sample;
s3: and updating the GPR model and the training set, and iterating until the model precision is reached.
According to the method, the uncertainty and the representative information of the label-free sample set are comprehensively considered, so that the samples are selected more reasonably, and the prediction performance of the training model is improved at the minimum marking cost. Corresponding data is acquired from industrial control platforms of enterprises, such as DCS (distributed control system), PCS (process control system), APC (advanced control system) and MES (manufacturing execution system) and the like.
Specifically, the main structure of the invention comprises the following steps:
s1: acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples; it should be noted that the corresponding data refers to a training set of labeled and unlabeled data sets collected in the industrial control platform system, wherein the steps of obtaining the corresponding data based on the industrial control platform system, setting a neighbor k, and calculating the number of representative samples include:
s11: acquiring a training set at an industrial control platform system, and dividing the training set into a labeled sample set and a non-labeled sample set, wherein the labeled sample set is a data set with a corresponding quality variable measured value, only an auxiliary variable measured value is in the non-labeled data set, and it needs to be noted that the labeled sample set and the non-labeled sample set are respectively
Figure BDA0002221837370000061
And
Figure BDA0002221837370000062
wherein n islAnd nuRespectively the number of samples of the labeled sample set and the unlabeled sample set, and m is the number of auxiliary variables.
S12: setting the number k of neighborhood samples;
s13: calculating the number of representative samples to be marked in the label-free sample set according to the k value;
further, the calculation formula representing the number p of samples is as follows:
p=nu/k
in the formula, nuThe number of samples is the number of unlabeled sample set samples; k is the number of neighborhood samples
S2: evaluating the sample and manually marking the sample; the active learning mainly utilizes unlabeled sample information to improve the generalization performance of the model, evaluates the sample and carries out manual marking on the sample:
s21: performing subspace integration on the unlabeled sample set by adopting principal component analysis, and establishing a corresponding GPR (general purpose processor) sub-learner;
it should be noted that GPR is a short for gaussian process regression, which is a machine learning algorithm based on gaussian process and has a good effect on handling problems such as high dimensionality, nonlinearity, small samples, etc.; the training sample set is known as X, y, where the input X ═ Xi∈Rm}i=1,2,...,nOutput y ═ yi∈R}i=1,2,...,nN is the number of samples, and m is the number of auxiliary variables; the regression relationship between the input and output can be expressed as:
y=f(x)+ε
f is an unknown function form, epsilon is a mean of 0, and variance is delta2Gaussian noise of (2); and assuming that the regression function is a Gaussian prior function with a mean value of 0, i.e.
y=[f(x1),f(x2),...,f(xn)]~GP(0,K)
Where K is the covariance matrix, Kij=k(xi,xj) Here, a squared exponential function is chosen to construct the covariance function:
Figure BDA0002221837370000071
in the formula, deltafIs the standard deviation of the signal, l is the scale parameter, deltanAs the standard deviation of the noise, delta when i equals jijIf not, then deltaij=0。For the hyper-parameters of the GPR, the maximum likelihood estimation can be used to obtain:
Figure BDA0002221837370000073
for test sample xqThe corresponding mean and variance are:
Figure BDA0002221837370000074
Figure BDA0002221837370000075
wherein k isq=[k(xq,x1),k(xq,x2),...,k(xq,xn)]TFor testing sample xqCovariance vector with training sample set, k (x)q,xq) Is a test sample xqThe covariance of (a);
further, subspace integration is performed on the unlabeled sample set by adopting principal component analysis, and the subspace integration process is as follows:
given a data set X ∈ Rn×mM is the number of auxiliary variables, n is the number of samples, and the PCA model is described as follows:
X=TPT+E
wherein T ∈ Rn×dAnd P ∈ Rm×dA score matrix and a load matrix of the principal component subspace, respectively, E ∈ Rn×mD represents the number of divided subspaces;
the load matrix may be divided as follows:
P=[P1,P2,...Pd]
the definition of the ith auxiliary variable in subspace j is:
Figure BDA0002221837370000081
wherein, i is 1,2, 1, d, PijRepresenting the i-th row and j-th column elements of the load matrix. Wherein a larger value of CI (i, j) indicates that the variable contains more information in the jth subspace; and (4) carrying out descending arrangement on the array vectors of the matrix according to the contribution index, wherein the larger variable index is determined as an auxiliary variable of the subspace j.
S22: calculating the uncertainty of the unlabeled sample according to the outputs of all the sub-learners, and taking the uncertainty as a sample evaluation standard; wherein, utilizing PCA to carry out subspace division on the labeled sample set to obtainTo D subsets D1,D2,...,DdThen GPR modeling is carried out to obtain d sub-models, and finally, for each unlabeled sample xuAll can obtain d prediction outputs
Figure BDA0002221837370000082
It should be noted that the formula for calculating the uncertainty of the unlabeled sample is:
Figure BDA0002221837370000083
in the formula, d is the number of subsets obtained by performing subspace division on the labeled sample set;
Figure BDA0002221837370000084
a prediction output for the ith subspace;
Figure BDA0002221837370000085
average of the d subspace prediction outputs; it is composed of
Figure BDA0002221837370000086
It should be noted that the unlabeled sample with the largest uncertainty is xδ
S23: under the neighborhood information criterion, selecting a representative sample for manual marking by constructing k neighbors of the sample; under the neighborhood information criterion, the step of selecting a representative sample for manual marking by constructing k neighbors of the sample comprises the following steps:
s231: selecting unlabeled samples x with k neighbor pairsδConstructing a sample neighborhood S;
s232: selecting a sample most similar to the neighborhood center for marking;
s233: manually marking the marked sample, wherein the marked sample is a representative sample;
specifically, in order to fully describe the overall characteristics of the training set by using limited labeling cost, the belief of k neighbors is continuously introduced under the framework of probability selectionAn information criterion; the method utilizes k nearest neighbor pairs to select a sample xδAnd constructing a sample neighborhood S, and preferentially selecting a sample most similar to a neighborhood center for marking to ensure that the marked sample covers all information of the training set at the initial iteration stage, wherein the samples are called as representative samples.
Note that the unlabeled sample xδNeighborhood center of
Figure BDA0002221837370000087
Comprises the following steps:
Figure BDA0002221837370000088
where k is the number of neighborhood samples, xiAs unlabeled sample xδSample points in the neighborhood.
Wherein, the selection strategy of the representative sample can be expressed as:
Figure BDA0002221837370000091
in the formula, xiAs unlabeled sample xδThe sample points in the neighborhood of the point,
Figure BDA0002221837370000092
as unlabeled sample xδThe neighborhood center of (a); wherein, selecting representative samples of the unlabeled sample set for manual marking, and marking as { xs,ys}。
S3: updating a GPR model and a training set, and iterating until the model precision is reached; updating the GPR model and the training set, and iterating until the model precision comprises the following steps:
s31: representative samples to be artificially labeled xs,ysAdding the samples into a labeled sample set, and establishing a new GPR model;
s32: and the rest unlabeled sample set enters a new round of iteration until the model precision is reached.
The method comprises the steps that active learning needs to be finished by presetting a termination condition to limit the iteration times of an algorithm, and a mean square root error (RMSE) is used as a GPR model performance evaluation index; with the gradual increase of the number of marked samples in the training set, the RMSE of the GPR model is continuously reduced, and when the RMSE reaches a preset threshold value, the active learning process is ended;
wherein n istFor number of samples in test set, ytAnd
Figure BDA0002221837370000094
the real value and the estimated value of the test sample are respectively.
Here, the active learning algorithm flow of combing:
based on an active learning strategy, performing information evaluation on the label-free sample through uncertainty, and further extracting a representative sample by using a k nearest neighbor criterion; the modeling flow of the algorithm provided by the invention is shown in FIG. 7, and the detailed steps are as follows:
(1) dividing training set into labeled sample set
Figure BDA0002221837370000095
And unlabeled sample set
Figure BDA0002221837370000096
Setting the number k of neighborhood samples, and calculating the number p of representative samples to be marked in the unlabeled sample set as n according to the value of ku/k。
(2) Calculating the uncertainty delta of the unlabeled sample according to the prediction output of the sub-learner to obtain the unlabeled sample x with the maximum uncertaintyδ
(3) And (4) judging whether the number of the marked samples is less than p, if so, turning to (4), and otherwise, turning to (5).
(4) Constructing x with k neighborsδOf the neighborhood, choosing representative samples x in the neighborhoods
(5) The selected samples are labeled, the training set is updated according to fig. 6, and a new GPR model is established.
(6) If the root mean square error of the soft measurement model meets the precision requirement, the active learning is finished, otherwise, the step (2) is switched to enter a new iteration
Example 2
In order to verify the effectiveness of the method, the function in the formula (1) is adopted to generate simulation data for verification; wherein xiY is input and output variables respectively, epsilon is Gaussian white noise with the mean value of 0 and the variance of 0.05;
Figure RE-GDA0002281544480000101
generating 1000 groups of data by using the formula (1), wherein 500 groups are used as a training set, and 500 groups are used as a test set; selecting 10 groups of data in the training set as labeled samples, and selecting 490 groups as unlabeled samples; the number of the neighborhood samples of the KNN-ALPS is 29, and a training set, a subspace learning device and a soft measurement model are updated by marking 1 sample in the iterative process; the comparison method comprises an active learning algorithm of random selection (ALRS) and probability selection (ALPS); the predicted results of the three methods are shown in table 1.
TABLE 1 prediction impact of labeled sample numbers on three soft-metric models
(root mean square error)
Figure BDA0002221837370000102
As can be seen from table 1, the unlabeled samples selected by the ALRS using the random method cannot play a great role in improving the model performance, and wastes the marking cost; the ALPS selects the label-free samples based on probability selection and carries out manual marking, so that the performance of the model is improved to a certain extent; compared with the ALPS, the KNN-ALPS comprehensively considers the uncertainty and the representative information of the unlabeled sample, and the performance of the model obtained by the same marking cost is improved greatly.
Fig. 8 to 10 show that the prediction results of the three methods, which are marked 20 times and then obtained by intercepting 200 test samples, are significantly deviated from the actual values, and the KNN-ALPS can better track the actual values.
Example 3
In order to further verify the effectiveness of the method provided by the invention, a blast furnace ironmaking process is selected as a simulation object; blast furnace iron making is a method for continuously producing liquid pig iron in a blast furnace by using coke, iron-containing ore and flux, is a main link of the current steel production, and because the temperature in the blast furnace is higher, the silicon content in molten iron is difficult to directly measure by a sensor; in order to accurately estimate the silicon content in the molten iron, the soft measurement modeling driven by data is an effective solution; specific descriptions of blast furnace ironmaking process input variables are shown in table 2.
TABLE 2 detailed description of blast furnace ironmaking Process input variables
Figure BDA0002221837370000111
The process collects 1000 groups of data, 500 groups of data are selected as a training set, 500 groups of data are selected as a testing set, and in the training set, only 10 labeled samples and 490 unlabeled samples are assumed, and the sample labeling rate is 2%.
In order to verify the superiority of the method, the invention adopts three active learning methods, namely ALRS, ALPS and KNN-ALPS; the number of the neighborhood samples of the KNN-ALPS is set to be 15; in the active learning iteration process, the training set, the subspace learning device and the soft measurement model are updated after 1 sample is marked each time, and 490 iterations are needed when all the unlabeled samples are marked.
As can be seen from fig. 11, as the number of marked samples gradually increases, the model performance of the three methods is improved; the ALRS selects the unlabeled samples by using a random method, the RMSE of the GPR model is slowly reduced, and the performance of the model is obviously improved after multiple iterations; the ALPS selects samples by utilizing uncertainty, the model performance is obviously superior to ALRS in the iteration process, after a k neighbor information criterion is introduced, the samples selected by the KNN-ALPS are more reasonable, and the model performance is obviously improved; it can be seen that the model accuracy of KNN-ALPS was consistently higher than ALPS during the first 135 iterations; this means that the active learning algorithm based on KNN-ALPS can improve the performance of the soft measurement model higher by using the same mark cost.
The black line in FIG. 11 is the model accuracy threshold, and when the GPR model accuracy reaches this threshold, the active learning iteration is terminated and the model is output; it can be seen that the ALRS method can achieve satisfactory model accuracy only after marking 180 samples, while the ALPS and KNN-ALPS methods can achieve the requirements after marking 105 and 45 samples, respectively; therefore, the KNN-ALPS-based active learning method can improve the model performance with the minimum mark cost.
In order to detect the improvement effect of the three active learning methods on the GPR model performance, the prediction results of 0-100 test samples in the 60 th iteration are intercepted, as shown in FIGS. 12-14, and it can be seen that the GPR model tracking effect of the KNN-ALPS in the 60 th iteration process is obviously superior to that of the ALRS and the ALPS.
The method verifies that the model precision can be improved while the marking cost is reduced by the method through the numerical example and the application simulation of the actual industrial process; the active learning modeling method based on k nearest neighbor and probability selection comprehensively considers uncertainty of unlabeled samples and representative information of sample neighborhoods, can select the unlabeled samples more reasonably for marking, and finally achieves the purpose of improving model performance with minimum marking cost. Through typical numerical simulation and simulation of an actual industrial process, the method has higher prediction precision and faster convergence speed.
It is important to note that the construction and arrangement of the present application as shown in the various exemplary embodiments is illustrative only. Although only a few embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters (e.g., temperatures, pressures, etc.), mounting arrangements, use of materials, colors, orientations, etc.) without materially departing from the novel teachings and advantages of the subject matter recited in this application. For example, elements shown as integrally formed may be constructed of multiple parts or elements, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of this invention. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. In the claims, any means-plus-function clause is intended to cover the structures described herein as performing the recited function and not only structural equivalents but also equivalent structures. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the exemplary embodiments without departing from the scope of the present inventions. Therefore, the present invention is not limited to a particular embodiment, but extends to various modifications that nevertheless fall within the scope of the appended claims.
Moreover, in an effort to provide a concise description of the exemplary embodiments, all features of an actual implementation may not be described (i.e., those unrelated to the presently contemplated best mode of carrying out the invention, or those unrelated to enabling the invention).
It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions may be made. Such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, without undue experimentation.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (10)

1. A k nearest neighbor and probability-based active learning method is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
acquiring corresponding data based on an industrial control platform system, setting neighbor k, and calculating the number of representative samples;
evaluating the sample and manually marking the sample;
and updating the GPR model and the training set, and iterating until the model precision is reached.
2. The k-nearest neighbor and probability based selection active learning method of claim 1, wherein: the method for acquiring corresponding data based on the industrial control platform system, setting neighbor k and calculating the number of representative samples comprises the following steps:
acquiring a training set in an industrial control platform system, and dividing the training set into a labeled sample set and an unlabeled sample set;
setting the number k of neighborhood samples;
and calculating the number of the representative samples to be marked in the unlabeled sample set according to the k value.
3. The k-nearest neighbor and probability based selection active learning method of claim 2, wherein: the labeled sample set and the unlabeled sample set are respectivelyAnd
Figure FDA0002221837360000012
wherein n islAnd nuRespectively the number of samples of the labeled sample set and the unlabeled sample set, and m is the number of auxiliary variables.
4. The k-nearest neighbor and probability based selection active learning method of claim 3, wherein: the calculation formula of the representative sample number p is as follows:
p=nu/k
in the formula, nuThe number of samples is the number of unlabeled sample set samples; k is the neighborhood sample number.
5. The k-nearest neighbor and probability based selection active learning method of any one of claims 1 to 4, wherein: the step of evaluating the sample and manually marking it:
performing subspace integration on the unlabeled sample set by adopting principal component analysis, and establishing a corresponding GPR (general purpose processor) sub-learner;
calculating the uncertainty of the unlabeled sample according to the outputs of all the sub-learners, and taking the uncertainty as a sample evaluation standard;
under the neighborhood information criterion, selecting representative samples for manual marking by constructing k neighbors of the samples.
6. The k-nearest neighbor and probability based selection active learning method of claim 5, wherein: the uncertainty calculation formula of the unlabeled sample is as follows:
Figure FDA0002221837360000013
in the formula, d is the number of subsets obtained by performing subspace division on the labeled sample set;
Figure FDA0002221837360000021
a prediction output for the ith subspace;
Figure FDA0002221837360000022
average of the d subspace prediction outputs;
wherein the content of the first and second substances,
Figure FDA0002221837360000023
wherein the unlabeled sample with the maximum uncertainty is xδ
7. The k-nearest neighbor and probability based selection active learning method of claim 6, wherein: the step of selecting a representative sample for manual marking by constructing k neighbors of the sample under the neighborhood information criterion comprises the following steps:
selecting unlabeled samples x with k neighbor pairsδConstructing a sample neighborhood S;
selecting a sample most similar to the neighborhood center for marking;
manually marking the marked sample;
wherein the marked sample is a representative sample.
8. The k-nearest neighbor and probability based selection active learning method of claim 7, wherein: the unlabeled sample xδNeighborhood center of
Figure FDA0002221837360000024
Comprises the following steps:
Figure FDA0002221837360000025
where k is the number of neighborhood samples, xiAs unlabeled sample xδSample points in the neighborhood.
9. The k-nearest neighbor and probability based selection active learning method of claim 7 or 8, wherein: the selection strategy of the representative sample can be expressed as
In the formula, xiAs unlabeled sample xδThe sample points in the neighborhood of the point,
Figure FDA0002221837360000027
as unlabeled sample xδThe neighborhood center of (c).
10. The k-nearest neighbor and probability based selection active learning method of claim 9, wherein: the step of updating the GPR model and the training set and iterating until the model precision comprises the following steps:
adding the manually marked representative samples into the labeled sample set, and establishing a new GPR model;
and the rest unlabeled sample set enters a new iteration until the model precision meets the requirement.
CN201910936977.7A 2019-09-29 2019-09-29 Active learning method based on k nearest neighbor and probability selection Pending CN110674883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910936977.7A CN110674883A (en) 2019-09-29 2019-09-29 Active learning method based on k nearest neighbor and probability selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910936977.7A CN110674883A (en) 2019-09-29 2019-09-29 Active learning method based on k nearest neighbor and probability selection

Publications (1)

Publication Number Publication Date
CN110674883A true CN110674883A (en) 2020-01-10

Family

ID=69080324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910936977.7A Pending CN110674883A (en) 2019-09-29 2019-09-29 Active learning method based on k nearest neighbor and probability selection

Country Status (1)

Country Link
CN (1) CN110674883A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639714A (en) * 2020-06-01 2020-09-08 贝壳技术有限公司 Method, device and equipment for determining attributes of users
CN118111624A (en) * 2024-04-29 2024-05-31 成都凯天电子股份有限公司 Self-adaptive overfitting prevention calibration method for resonant pressure sensor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639714A (en) * 2020-06-01 2020-09-08 贝壳技术有限公司 Method, device and equipment for determining attributes of users
CN111639714B (en) * 2020-06-01 2021-07-23 贝壳找房(北京)科技有限公司 Method, device and equipment for determining attributes of users
CN118111624A (en) * 2024-04-29 2024-05-31 成都凯天电子股份有限公司 Self-adaptive overfitting prevention calibration method for resonant pressure sensor

Similar Documents

Publication Publication Date Title
CN104699894B (en) Gaussian process based on real-time learning returns multi-model Fusion Modeling Method
CN113012766B (en) Self-adaptive soft measurement modeling method based on online selective integration
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN105740984A (en) Product concept performance evaluation method based on performance prediction
CN110674883A (en) Active learning method based on k nearest neighbor and probability selection
CN113537469B (en) Urban water demand prediction method based on LSTM network and Attention mechanism
CN114239400A (en) Multi-working-condition process self-adaptive soft measurement modeling method based on local double-weighted probability hidden variable regression model
CN112634992A (en) Molecular property prediction method, training method of model thereof, and related device and equipment
CN112414715A (en) Bearing fault diagnosis method based on mixed feature and improved gray level co-occurrence algorithm
CN112001115A (en) Soft measurement modeling method of semi-supervised dynamic soft measurement network
CN117252114B (en) Cable torsion resistance experiment method based on genetic algorithm
CN111582567B (en) Wind power probability prediction method based on hierarchical integration
CN111797979A (en) Vibration transmission system based on LSTM model
CN115619028A (en) Clustering algorithm fusion-based power load accurate prediction method
CN114676887A (en) River water quality prediction method based on graph convolution STG-LSTM
Shi et al. LWS based PCA subspace ensemble model for soft sensor development
CN111292811B (en) Product prediction method and system for aromatic disproportionation production link
CN113807606B (en) Intermittent process quality online prediction method capable of explaining ensemble learning
CN115017671B (en) Industrial process soft measurement modeling method and system based on online cluster analysis of data flow
CN116689515A (en) Rolling Force Prediction Method for Finish Rolling Based on SA-SCSO-TSVR Algorithm
CN116203929B (en) Industrial process fault diagnosis method for long tail distribution data
CN112149810A (en) Deep learning-based method for predicting and transferring quality of metal injection molding sintered product
Wang et al. Final sulfur content prediction model in hot metal desulphurization process based on IEA-SVM
CN116842391A (en) Long-time ecological observation data anomaly detection method and system
CN117436911A (en) Bagging-based power transmission and transformation project cost prediction method and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination