WO2020140597A1

WO2020140597A1 - Online active learning method applicable to unlabeled unbalanced data stream

Info

Publication number: WO2020140597A1
Application number: PCT/CN2019/114167
Authority: WO
Inventors: 吴庆耀; 张一帆; 谭明奎
Original assignee: 华南理工大学
Priority date: 2018-12-31
Filing date: 2019-10-29
Publication date: 2020-07-09
Also published as: CN109800799A

Abstract

Provided by the present invention is an online active learning method applicable to an unlabeled unbalanced data stream, comprising: performing prediction in an input linear classifier of an unlabeled data stream time sequence, the category of the data stream having the problem of being highly unbalanced, which is to say that the number of positive samples is scarce; according to a proposed asymmetric access policy, the linear classifier dynamically determining a sample requiring labeling and tagging for unbalanced data; according to a proposed asymmetric update policy, the linear classifier updating the linear classifier by using wrongly predicted label data, and increasing learning efficiency by using second-order information of the sample; the online active learning method applicable to an unlabeled unbalanced data stream as described in the present invention proposes a new asymmetric policy by using second-order information of a sample; and the asymmetric policy simultaneously considers the labeling of a sample and the updating of a model, may better solve the problem wherein the category of a sample is unbalanced, and increases the classification performance of a stream data-based active learning model.

Description

An online active learning method applicable to unlabeled unbalanced data streams

[0001] The present invention relates to the technical field of online learning and semi-supervised learning, and in particular to an online active learning method applicable to unlabeled unbalanced data streams.

Background technique

[0002] In recent years, artificial intelligence and related industries are rapidly developing and becoming the focus of attention in academia, industry, and governments around the world. Recently, the State Council released the "New Generation Artificial Intelligence Development Plan", highlighting the national strategic position of artificial intelligence research and industry. In the Internet industry, online learning technology has developed rapidly, and has made great progress in many application fields.

[0003] However, there are still many challenges in the online learning technology. First, the original stream data is unlabeled, and the labeling cost of the data is often very high. How to select the most discriminative data for labeling under the condition of limited labeling budget and train a learner with good performance is an important issue for online learning and its industrial application. Secondly, in a large number of practical task scenarios, the categories of data are often unbalanced, that is, positive data is far less than negative data. How to solve the problem of unbalanced category of samples is also a key problem to be solved urgently in industrial applications.

Summary of the invention

technical problem

Solution to the problem

Technical solution

[0004] In view of this, in order to solve the above-mentioned problems in the prior art, the present invention provides an online active learning method suitable for unlabeled unbalanced data streams, proposes an asymmetric access strategy for unbalanced data, and dynamically determines Samples that require labeling; To effectively update the model, this method further proposes an asymmetric update strategy and uses the second-order information of the sample to efficiently update the model; At the same time, the labeling data existing in the actual classification application is sparse, sample imbalance, flow Data and other problems have a good ability to solve.

[0005] To achieve the above objective, the technical solution of the present invention is as follows.

[0006] An online active learning method applicable to unlabeled unbalanced data streams includes the following steps: [0007] Step 1. The label-free data stream is input into the linear classifier for prediction in time series, where the category of the data stream has a high imbalance problem, and the positive category sample is usually set as a sparse category sample;

[0008] Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for the unlabeled unbalanced data;

[0009] Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier using the mispredicted labeled data, and uses the second-order information of the samples to improve the learning efficiency.

[0010] Further, in the step 1, the unlabeled data stream may be expressed as

Represents the total number of unlabeled samples. The sample budget of labelable tags is s, and the category of tags is

Is far less than the negative sample

W _52S: -— 1

The specific method of using the linear classifier is:

[0011] Step 11. The linear classifier is represented as a surface_

, Which satisfies the multivariate Gaussian distribution , _

, among them

Linear classifier

Of the mean, and

f

Linear classifier

ring

Variance;

[0012] Step 12. The classification prediction of the linear classifier is expressed as

[0013] Step 13. The prediction result of the linear classifier is expressed as: , The linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.

[0014] Further, the steps of the asymmetric access policy in step 2 are as follows:

[0015] Step 21. Sample-based second-order information

f

(That is, the variance of the linear classifier), calculate the confidence of the linear classifier on the current sample;

[0016] Step 22. Calculate the asymmetric access parameters of the current sample based on the confidence;

[0017] Step 23. Based on asymmetric access parameters, perform Bernoulli sampling to obtain the sampled value;

[0018] Step 24. If the sample value is 1, it is determined that the tag of the sample needs to be accessed; otherwise, it is not required.

[0019] Further, the steps of the asymmetric update strategy in step 3 are as follows:

[0020] Step 31: Obtain mislabeled labeled data;

[0021] Step 32: Calculate the asymmetric loss function value of the data based on the mispredicted labeled data;

[0022] Step 33: Update the variance of the linear classifier based on the asymmetric loss function value and the optimization strategy

?

[0023]

[0024] where,

[0027] where, Represents the learning rate of the linear classifier,

Represents the value of the asymmetric loss function

[0030] where,

If

Represents the learning rate of the linear classifier,

Represents the confidence of the model in the current sample, and represents the familiarity of the model to the current sample, so as to calculate the confidence [0031] Based on confidence

[0033] where ft: I

Represents the prediction margin of the linear classifier for the current sample,

, That is, the absolute value of the prediction margin, which represents the distance of the model from the classification plane of the prediction distance of the sample;

[0034] Based on asymmetric access parameters

, Perform Bernoulli sampling to obtain sample values; set different sampling coefficients for different types of samples, and express the sampling probability by the following:

) Sampling coefficient, Negative prediction (ie

＜ 0

) Sampling coefficient; Bernoulli sampling is performed by the sampling probability to obtain the sampling value

[0037] Further, the value of the asymmetric loss function is calculated by the following formula:

[0038]

[0039] where,

Represent the weight of misclassification of positive samples;

10

Represents the indicator function, that is, 1 is satisfied, otherwise 0.

[0040] Based on the asymmetric loss function value

And optimization strategies, update the variance of the linear classifier through the formulas in steps 3.3 and 3.4

fi

And mean

Beneficial effects of invention

Beneficial effect

[0041] Compared with the prior art, the present invention is an online active learning method applicable to unlabeled unbalanced data streams The method has the following advantages and technical effects:

[0042] The present invention uses the second-order information of the sample to propose a new asymmetric strategy; the asymmetric strategy considers both the labeling of the sample and the update of the model, which can better solve the problem of sample imbalance and improve the flow-based The classification performance of the active learning model of data.

Brief description of the drawings

BRIEF DESCRIPTION

[0043] FIG. 1 is a schematic flowchart of an online active learning method applicable to unlabeled unbalanced data streams in an embodiment.

[0044] FIG. 2 is a schematic flowchart of an asymmetric access strategy in an embodiment.

[0045] FIG. 3 is a schematic flowchart of an asymmetric update strategy in an embodiment.

[0046] FIG. 4 is a verification result of the online active learning method in the embodiment.

Invention Example

Embodiments of the invention

[0047] The specific implementation of the present invention will be further described below with reference to the drawings and specific embodiments. It should be noted that the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments.

[0048] As shown in FIG. 1, it is a schematic flowchart of an online active learning method applicable to an unbalanced data stream without tags according to this embodiment, including the following steps:

[0049] Step 1. The unlabeled data stream is input into the linear classifier for prediction in a time series, where the category of the data stream has a high imbalance problem, and the positive sample is usually set as a sparse sample;

[0050] Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for the unlabeled unbalanced data;

[0051] Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier using the mispredicted labeled data, and uses the second-order information of the samples to improve the learning efficiency.

[0052] In the step 1, the unlabeled data stream may be expressed as

The number of features representing the sample is

Represents the total number of unlabeled samples. The sample budget of labelable labels is s, and the category of labels is-

, Then regular samples

Is far less than the negative sample

The specific method of using the linear classifier is:

[0053] Step 11. The linear classifier is expressed as w et ⁴

, Which satisfies the multivariate Gaussian distribution

, among them

It

Linear classifier

Of the mean, and

Linear classifier Variance;

[0054] Step 12. The classification prediction of the linear classifier is expressed as

[0055] Step 13. The prediction result of the linear classifier is expressed as:

, The linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.

[0056] As shown in FIG. 2, which is a schematic flowchart of the asymmetric access strategy of the present invention, the steps of the asymmetric access strategy in the step 2 are as follows:

[0057] Step 21. Sample-based second-order information

[0058] where,

Represents the learning rate of the linear classifier,

Represents the regularization coefficient,

P = iiaifi

mm p)

P

Represents the misclassification cost of positive samples; In addition,

Represents the confidence of the model in the current sample, and represents the familiarity of the model in the current sample, so as to better calculate the confidence

[0059] Step 22. Based on confidence

The asymmetric access parameters of the current sample are calculated by the following formula:

[]

[0060] where,

Pt , Which is the absolute value of the prediction margin, which represents the distance of the model from the classification plane of the prediction distance of the sample;

[0061] Step 23. Based on asymmetric access parameters

[]

[0062] where,

%

Is a positive prediction (ie

) Sampling coefficient,

Negative prediction (ie

[0063] Step 24: If the sampled value

Is 1, it is determined that the label of the sample needs to be accessed, then the budget is used to obtain its label; otherwise if If it is o, it is determined that it is not necessary to access its label.

[0064] As shown in FIG. 3, which is a schematic flowchart of the asymmetric update strategy of the present invention, the steps of the asymmetric update strategy in the step 3 are as follows:

[0065] Step 31: Obtain mislabeled labeled data

[0066] Step 32: Based on the mispredicted labeled data, calculate the asymmetric loss value by the following formula:

Represent the weight of misclassification of positive samples;

10

Represents the indicator function, that is, 1 is satisfied, otherwise 0. Through this cost-sensitive loss function, we can update the linear classifier asymmetrically;

[0068] Step 33: Based on the value of the asymmetric loss function

4

And optimization strategy, the variance of the linear classifier is updated by the following formula

1

[]

[0069] where, Represents the regularization coefficient;

[0070] Step 34: Based on the value of the asymmetric loss function

| h

And optimization strategy, update the mean value of the linear classifier through the following formula

[]

[0071] where,

Represents the learning rate of the linear classifier,

Represents the value of the asymmetric loss function

#%

The gradient of can be derived by derivation of the loss function.

[0072] FIG. 4 shows the performance of the online active learning method applicable to unlabeled unbalanced data streams on the network security data set w8a. The names of the method in FIG. 4 are OA3 and OA3_diag, where OA3_diag is the method A simple variant of is not described in detail. Other comparison methods, such as PAA, OAAL, CSOAL, and SOAL are classic solutions to this problem, and serve as an experimental reference for the proposed method.

[0073] The w8a data set is a classic open source data set used to determine whether a web page is abnormal. The data set has 647 00 samples and 300 eigenvalues. The number of normal web pages is far more than that of abnormal web pages, that is, it belongs to unbalanced data, and the unbalanced degree is 1: 32.5. In this example, the abnormal webpage is a positive sample (minority), and the normal webpage is a negative sample (majority).

[0074] At the time of the experiment, all training samples come in time series and have no labels. The proposed active learning method will target each The web page that arrives at a moment determines whether it needs to be marked according to step 2. If necessary, the label is obtained with a certain amount of money as the labeling cost, and the model is updated according to step 3.

[0075] The detailed experimental results are shown in FIG. 4, and the proposed online active learning method for unlabeled unbalanced data streams has achieved the most excellent performance.

[0076] An online active learning method applicable to unlabeled unbalanced data streams in this embodiment proposes an asymmetric access strategy for unbalanced data and dynamically determines the samples that need to be labeled; to effectively update the model, the method further An asymmetric update strategy is proposed, and the second-order information of the samples is used to efficiently update the model. At the same time, it has good ability to solve the problems of sparse labeled data, unbalanced samples, and streaming data in the actual classification application.

Claims

1. An online active learning method for unlabeled unbalanced data streams, which is characterized by the following steps:

Step 1. Obtain an unlabeled data stream and input it into a linear classifier for prediction in a time series, where the category of the data stream has a high imbalance problem, and set the positive sample to be a sparse sample;

Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for unlabeled unbalanced data;

Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier with mispredicted labeled data, and uses the second-order information of the samples to improve learning efficiency.

2. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, wherein in step 1, the unlabeled data stream is represented as {x _t e M ^d | t = 1,.., T}, where M ^d represents the number of features of the sample is d, r represents the total number of unlabeled samples; the sample budget of labelable labels is 5, and the category of the label is y _t e｛_l, + l}, the regular sample

Class sample = -1, the specific use method of the linear classifier is:

Step 11. The linear classifier is represented as w EE ^d , which satisfies the multivariate Gaussian distribution w~ J\T 0, 2), where /i represents the mean value of the linear classifier w, and 2 represents the variance of the linear classifier w ;

Step 13. The prediction result of the linear classifier is expressed as: If 5> _t = y _t , the linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.

3. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, wherein the steps of the asymmetric access strategy in step 2 are as follows:

Step 21. Based on the second order information 2 of the sample, that is, the variance of the linear classifier, calculate the confidence of the linear classifier on the current sample

Step 22. Based on the confidence level, calculate the asymmetric access parameters of the current sample; replacement page (Article 26 of the rules) Step 23. Based on asymmetric access parameters, perform Bernoulli sampling to obtain its sampling value; Step 24. If the sampling value is 1, it is determined that the label of the sample needs to be accessed; otherwise, it is not required.

4. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, characterized in that the steps of the asymmetric update strategy in step 3 are as follows:

Step 31: Obtain mislabeled labeled data;

Step 32: Based on the mislabeled labeled data, calculate the asymmetric loss function value of the data; Step 33, update the variance of the linear classifier 2 based on the asymmetric loss function value and optimization strategy; Step 34: Based on the asymmetric loss function Values and optimization strategies, update the mean/i of the linear classifier.

5. An online active learning method suitable for unlabeled unbalanced data streams according to claim 3, characterized in that the confidence is calculated by the following formula:

among them, ? 7 represents the learning rate of the linear classifier, y represents the regularization coefficient, p _max = max(l, p), P represents the misclassification cost of positive samples; In addition, R = X _t x _t represents the confidence of the model in the current sample , Represents the model's familiarity with the current sample, so as to better calculate the confidence C _t;

Based on the confidence level c _t , the asymmetric access parameter of the current sample is calculated by the following formula:

Where p _t = /4x _t represents the prediction margin of the linear classifier on the current sample, | p _t |, which is the absolute value of the prediction margin, represents the distance of the model from the classification plane of the sample's prediction;

Based on asymmetric access parameters, perform Bernoulli sampling to obtain sample values; set different sampling coefficients for different types of samples, and express the sampling probability by the following:

Replacement page (Rules 26) The sampling probability is used to perform Bernoulli sampling to obtain the sampling value z _t .

6. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, characterized in that the asymmetric loss function value is calculated by the following formula:

Where P represents the weight of misclassification of positive samples; n(o represents the indicator function, that is, 1 is satisfied, otherwise 0.

7. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, wherein the linear classification is updated by the following formula based on the asymmetric loss function value ^ and the optimization strategy in step 33 Variance 2:

Among them, y represents the regularization coefficient.

8. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, characterized in that the linear classification is updated by the following formula based on the asymmetric loss function value ^ and the optimization strategy in step 34 Mean

Among them, 77 represents the learning rate of the linear classifier, represents the gradient of the asymmetric loss function value, and the derivative of the loss function can be obtained.

Replacement page (Rule 26)