WO2020140597A1 - Online active learning method applicable to unlabeled unbalanced data stream - Google Patents

Online active learning method applicable to unlabeled unbalanced data stream Download PDF

Info

Publication number
WO2020140597A1
WO2020140597A1 PCT/CN2019/114167 CN2019114167W WO2020140597A1 WO 2020140597 A1 WO2020140597 A1 WO 2020140597A1 CN 2019114167 W CN2019114167 W CN 2019114167W WO 2020140597 A1 WO2020140597 A1 WO 2020140597A1
Authority
WO
WIPO (PCT)
Prior art keywords
linear classifier
sample
unlabeled
asymmetric
active learning
Prior art date
Application number
PCT/CN2019/114167
Other languages
French (fr)
Chinese (zh)
Inventor
吴庆耀
张一帆
谭明奎
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Publication of WO2020140597A1 publication Critical patent/WO2020140597A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Abstract

Provided by the present invention is an online active learning method applicable to an unlabeled unbalanced data stream, comprising: performing prediction in an input linear classifier of an unlabeled data stream time sequence, the category of the data stream having the problem of being highly unbalanced, which is to say that the number of positive samples is scarce; according to a proposed asymmetric access policy, the linear classifier dynamically determining a sample requiring labeling and tagging for unbalanced data; according to a proposed asymmetric update policy, the linear classifier updating the linear classifier by using wrongly predicted label data, and increasing learning efficiency by using second-order information of the sample; the online active learning method applicable to an unlabeled unbalanced data stream as described in the present invention proposes a new asymmetric policy by using second-order information of a sample; and the asymmetric policy simultaneously considers the labeling of a sample and the updating of a model, may better solve the problem wherein the category of a sample is unbalanced, and increases the classification performance of a stream data-based active learning model.

Description

一种适用于无标签不平衡数据流的在线主动学习方法 技术领域 An online active learning method applicable to unlabeled unbalanced data streams
[0001] 本发明涉及在线学习和半监督学习技术领域, 具体涉及一种适用于无标签不平 衡数据流的在线主动学习方法。 [0001] The present invention relates to the technical field of online learning and semi-supervised learning, and in particular to an online active learning method applicable to unlabeled unbalanced data streams.
背景技术 Background technique
[0002] 近年来, 人工智能及相关产业正迅速发展壮大, 成为学术界、 工业界以及世界 各国政府关注的焦点。 最近, 国务院发布了 《新一代人工智能发展规划》 , 突 出了人工智能研究和产业的国家战略地位。 在互联网行业, 在线学习技术得到 了飞速发展, 并在多个应用领域取得了长足进展。 [0002] In recent years, artificial intelligence and related industries are rapidly developing and becoming the focus of attention in academia, industry, and governments around the world. Recently, the State Council released the "New Generation Artificial Intelligence Development Plan", highlighting the national strategic position of artificial intelligence research and industry. In the Internet industry, online learning technology has developed rapidly, and has made great progress in many application fields.
[0003] 然而, 5见有在线学习技术尚存在诸多挑战。 首先, 原始流数据是无标签的, 并 且数据的标注代价往往非常高昂。 如何在标注预算受限的情况下, 选择最具判 别力的数据进行标注, 并训练一个性能良好的学习器是在线学习及其工业应用 的重要问题。 其次, 大量实际任务场景中, 数据的类别往往是不平衡的, 即正 类数据远远少于负类数据。 如何解决样本的类别不平衡问题也是工业应用亟待 解决的关键问题。 [0003] However, there are still many challenges in the online learning technology. First, the original stream data is unlabeled, and the labeling cost of the data is often very high. How to select the most discriminative data for labeling under the condition of limited labeling budget and train a learner with good performance is an important issue for online learning and its industrial application. Secondly, in a large number of practical task scenarios, the categories of data are often unbalanced, that is, positive data is far less than negative data. How to solve the problem of unbalanced category of samples is also a key problem to be solved urgently in industrial applications.
发明概述 Summary of the invention
技术问题 technical problem
问题的解决方案 Solution to the problem
技术解决方案 Technical solution
[0004] 有鉴于此, 为解决上述现有技术中的问题, 本发明提供了一种适用于无标签不 平衡数据流的在线主动学习方法, 针对不平衡数据提出非对称访问策略, 动态 地决定需要标注标签的样本; 为有效更新模型, 该方法进一步提出非对称更新 策略, 并利用样本的二阶信息高效地更新模型; 同时对实际分类应用中所存在 的标注数据稀疏、 样本不平衡、 流数据等问题具有较好的解决能力。 [0004] In view of this, in order to solve the above-mentioned problems in the prior art, the present invention provides an online active learning method suitable for unlabeled unbalanced data streams, proposes an asymmetric access strategy for unbalanced data, and dynamically determines Samples that require labeling; To effectively update the model, this method further proposes an asymmetric update strategy and uses the second-order information of the sample to efficiently update the model; At the same time, the labeling data existing in the actual classification application is sparse, sample imbalance, flow Data and other problems have a good ability to solve.
[0005] 为实现上述目的, 本发明的技术方案如下。 [0005] To achieve the above objective, the technical solution of the present invention is as follows.
[0006] 一种适用于无标签不平衡数据流的在线主动学习方法, 包括以下步骤: [0007] 步骤 1、 无标签数据流时序地输入线性分类器中进行预测, 其中数据流的类 别具有高度不平衡问题, 通常设定正类样本为类别稀少样本; [0006] An online active learning method applicable to unlabeled unbalanced data streams includes the following steps: [0007] Step 1. The label-free data stream is input into the linear classifier for prediction in time series, where the category of the data stream has a high imbalance problem, and the positive category sample is usually set as a sparse category sample;
[0008] 步骤 2、 根据提出的非对称访问策略, 线性分类器针对无标签不平衡数据, 时序地决定需要被标注标签的样本; [0008] Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for the unlabeled unbalanced data;
[0009] 步骤 3、 根据提出的非对称更新策略, 线性分类器利用错误预测的标注数据 更新线性分类器, 并利用样本的二阶信息提高学习效率。 [0009] Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier using the mispredicted labeled data, and uses the second-order information of the samples to improve the learning efficiency.
[0010] 进一步地, 所述步骤 1中, 所述无标签数据流可表示为 [0010] Further, in the step 1, the unlabeled data stream may be expressed as
Figure imgf000004_0001
Figure imgf000004_0001
表示无标签样本的总数。 可标注标签的样本预算为 s个, 标签的类别为 Represents the total number of unlabeled samples. The sample budget of labelable tags is s, and the category of tags is
Figure imgf000004_0002
Figure imgf000004_0002
的数量远远少于负类样本 Is far less than the negative sample
W 52S: -— 1 W 52S: -— 1
, 所述线性分类器的具体使用方法为: The specific method of using the linear classifier is:
[0011] 步骤 11、 所述线性分类器表示为 麵_ [0011] Step 11. The linear classifier is represented as a surface_
, 其满足多变量高斯分布 , _ , Which satisfies the multivariate Gaussian distribution , _
, 其中 , among them
表示线性分类器 Linear classifier
的均值, 而 Of the mean, and
f f
表示线性分类器 Linear classifier
ring
的方差; Variance;
[0012] 步骤 12、 所述线性分类器的分类预测表示为 [0012] Step 12. The classification prediction of the linear classifier is expressed as
Figure imgf000005_0001
Figure imgf000005_0001
[0013] 步骤 13、 所述线性分类器的预测结果表示为: 若 , 则线性分类器分类正确, 否则线性分类器的分类错误。 [0013] Step 13. The prediction result of the linear classifier is expressed as: , The linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.
[0014] 进一步地, 所述步骤 2中非对称访问策略的步骤如下: [0014] Further, the steps of the asymmetric access policy in step 2 are as follows:
[0015] 步骤 21、 基于样本的二阶信息 [0015] Step 21. Sample-based second-order information
f f
(即线性分类器的方差) , 计算线性分类器对当前样本的置信度; (That is, the variance of the linear classifier), calculate the confidence of the linear classifier on the current sample;
[0016] 步骤 22、 基于置信度, 计算当前样本的非对称访问参数; [0016] Step 22. Calculate the asymmetric access parameters of the current sample based on the confidence;
[0017] 步骤 23、 基于非对称访问参数, 进行伯努利采样, 获取其采样值; [0017] Step 23. Based on asymmetric access parameters, perform Bernoulli sampling to obtain the sampled value;
[0018] 步骤 24、 如果该采样值为 1, 则判定需要访问该样本的标签; 反之, 则不需 要。 [0018] Step 24. If the sample value is 1, it is determined that the tag of the sample needs to be accessed; otherwise, it is not required.
[0019] 进一步地, 所述步骤 3中非对称更新策略的步骤如下: [0019] Further, the steps of the asymmetric update strategy in step 3 are as follows:
[0020] 步骤 31、 获取错误预测的有标签数据; [0020] Step 31: Obtain mislabeled labeled data;
[0021] 步骤 32、 基于错误预测的有标签数据, 计算该数据的非对称损失函数值; [0021] Step 32: Calculate the asymmetric loss function value of the data based on the mispredicted labeled data;
[0022] 步骤 33、 基于非对称损失函数值和优化策略, 更新线性分类器的方差 [0022] Step 33: Update the variance of the linear classifier based on the asymmetric loss function value and the optimization strategy
? ?
[0023]
Figure imgf000006_0001
[0023]
Figure imgf000006_0001
[0024] 其中, [0024] where,
Figure imgf000006_0002
Figure imgf000006_0002
[0027] 其中, 代表线性分类器的学习率, [0027] where, Represents the learning rate of the linear classifier,
代表非对称损失函数值 Represents the value of the asymmetric loss function
Figure imgf000007_0001
Figure imgf000007_0001
[0030] 其中, [0030] where,
If If
代表线性分类器的学习率, Represents the learning rate of the linear classifier,
Figure imgf000007_0002
Figure imgf000007_0002
代表模型对当前样本的信心, 代表了模型对当前样本的熟悉程度, 从而更好的 计算置信度 [0031] 基于置信度 Represents the confidence of the model in the current sample, and represents the familiarity of the model to the current sample, so as to calculate the confidence [0031] Based on confidence
Figure imgf000008_0002
Figure imgf000008_0002
[0033] 其中, ft : I [0033] where ft: I
代表线性分类器对当前样本的预测边际,
Figure imgf000008_0001
Represents the prediction margin of the linear classifier for the current sample,
Figure imgf000008_0001
, 即该预测边际的绝对值, 代表了模型对该样本的预测距离分类平面的距离; , That is, the absolute value of the prediction margin, which represents the distance of the model from the classification plane of the prediction distance of the sample;
[0034] 基于非对称访问参数 [0034] Based on asymmetric access parameters
, 进行伯努利采样, 获取采样值; 对于不同类别的样本设定不同的采样系数, 通过以下表示采样概率: , Perform Bernoulli sampling to obtain sample values; set different sampling coefficients for different types of samples, and express the sampling probability by the following:
Figure imgf000008_0003
Figure imgf000008_0003
) 的采样系数, 为负类预测 (即 ) Sampling coefficient, Negative prediction (ie
< 0 < 0
) 的采样系数; 通过该采样概率进行伯努利采样, 获取采样值 ) Sampling coefficient; Bernoulli sampling is performed by the sampling probability to obtain the sampling value
[0037] 进一步地, 通过以下公式计算非对称损失函数值: [0037] Further, the value of the asymmetric loss function is calculated by the following formula:
[0038] [0038]
Figure imgf000009_0001
Figure imgf000009_0001
[0039] 其中, [0039] where,
代表正类样本的误分类权重; Represent the weight of misclassification of positive samples;
10 10
代表指示函数, 即满足条件则为 1, 否则为 0。 Represents the indicator function, that is, 1 is satisfied, otherwise 0.
[0040] 基于该非对称损失函数值 [0040] Based on the asymmetric loss function value
和优化策略, 通过步骤 3.3和步骤 3.4的公式更新线性分类器的方差 And optimization strategies, update the variance of the linear classifier through the formulas in steps 3.3 and 3.4
fi fi
和均值 And mean
发明的有益效果 Beneficial effects of invention
有益效果 Beneficial effect
[0041] 与现有技术比较, 本发明的一种适用于无标签不平衡数据流的在线主动学习方 法具有以下优点和技术效果: [0041] Compared with the prior art, the present invention is an online active learning method applicable to unlabeled unbalanced data streams The method has the following advantages and technical effects:
[0042] 本发明利用样本的二阶信息, 提出了新的非对称策略; 该非对称策略同时考 虑样本的标注和模型的更新, 能够更好地解决样本的类别不平衡问题, 并提升 基于流数据的主动学习模型的分类性能。 [0042] The present invention uses the second-order information of the sample to propose a new asymmetric strategy; the asymmetric strategy considers both the labeling of the sample and the update of the model, which can better solve the problem of sample imbalance and improve the flow-based The classification performance of the active learning model of data.
对附图的简要说明 Brief description of the drawings
附图说明 BRIEF DESCRIPTION
[0043] 图 1为实施例中一种适用于无标签不平衡数据流的在线主动学习方法的流程示 意图。 [0043] FIG. 1 is a schematic flowchart of an online active learning method applicable to unlabeled unbalanced data streams in an embodiment.
[0044] 图 2为实施例中非对称访问策略的流程示意图。 [0044] FIG. 2 is a schematic flowchart of an asymmetric access strategy in an embodiment.
[0045] 图 3为实施例中非对称更新策略的流程示意图。 [0045] FIG. 3 is a schematic flowchart of an asymmetric update strategy in an embodiment.
[0046] 图 4为实施例中该在线主动学习方法的验证结果。 [0046] FIG. 4 is a verification result of the online active learning method in the embodiment.
发明实施例 Invention Example
本发明的实施方式 Embodiments of the invention
[0047] 下面将结合附图和具体的实施例对本发明的具体实施作进一步说明。 需要指出 的是, 所描述的实施例仅仅是本发明一部分实施例, 而不是全部的实施例。 [0047] The specific implementation of the present invention will be further described below with reference to the drawings and specific embodiments. It should be noted that the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments.
[0048] 如图 1所示, 为本实施例的一种适用于无标签不平衡数据流的在线主动学习 方法的流程示意图, 包括以下步骤: [0048] As shown in FIG. 1, it is a schematic flowchart of an online active learning method applicable to an unbalanced data stream without tags according to this embodiment, including the following steps:
[0049] 步骤 1、 无标签数据流时序地输入线性分类器中进行预测, 其中数据流的类 别具有高度不平衡问题, 通常设定正类样本为类别稀少样本; [0049] Step 1. The unlabeled data stream is input into the linear classifier for prediction in a time series, where the category of the data stream has a high imbalance problem, and the positive sample is usually set as a sparse sample;
[0050] 步骤 2、 根据提出的非对称访问策略, 线性分类器针对无标签不平衡数据, 时序地决定需要被标注标签的样本; [0050] Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for the unlabeled unbalanced data;
[0051] 步骤 3、 根据提出的非对称更新策略, 线性分类器利用错误预测的标注数据 更新线性分类器, 并利用样本的二阶信息提高学习效率。 [0051] Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier using the mispredicted labeled data, and uses the second-order information of the samples to improve the learning efficiency.
[0052] 所述步骤 1中, 所述无标签数据流可表示为 [0052] In the step 1, the unlabeled data stream may be expressed as
Figure imgf000010_0001
代表样本的特征数量为
Figure imgf000010_0001
The number of features representing the sample is
表示无标签样本的总数。 可标注标签的样本预算为 s个, 标签的类别为 -Represents the total number of unlabeled samples. The sample budget of labelable labels is s, and the category of labels is-
, 则正类样本
Figure imgf000011_0001
, Then regular samples
Figure imgf000011_0001
的数量远远少于负类样本
Figure imgf000011_0002
Is far less than the negative sample
Figure imgf000011_0002
, 所述线性分类器的具体使用方法为: The specific method of using the linear classifier is:
[0053] 步骤 11、 所述线性分类器表示为 w et4 [0053] Step 11. The linear classifier is expressed as w et 4
, 其满足多变量高斯分布
Figure imgf000011_0003
, Which satisfies the multivariate Gaussian distribution
Figure imgf000011_0003
, 其中 , among them
It It
表示线性分类器 Linear classifier
的均值, 而 Of the mean, and
表示线性分类器 的方差; Linear classifier Variance;
[0054] 步骤 12、 所述线性分类器的分类预测表示为 [0054] Step 12. The classification prediction of the linear classifier is expressed as
Figure imgf000012_0001
Figure imgf000012_0001
[0055] 步骤 13、 所述线性分类器的预测结果表示为: 若
Figure imgf000012_0002
[0055] Step 13. The prediction result of the linear classifier is expressed as:
Figure imgf000012_0002
, 则线性分类器分类正确, 否则线性分类器的分类错误。 , The linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.
[0056] 如图 2所示, 为本发明的非对称访问策略的流程示意图, 所述步骤 2中非对称 访问策略的步骤如下: [0056] As shown in FIG. 2, which is a schematic flowchart of the asymmetric access strategy of the present invention, the steps of the asymmetric access strategy in the step 2 are as follows:
[0057] 步骤 21、 基于样本的二阶信息 [0057] Step 21. Sample-based second-order information
Figure imgf000012_0003
[0058] 其中,
Figure imgf000012_0003
[0058] where,
代表线性分类器的学习率, Represents the learning rate of the linear classifier,
代表正则化系数, Represents the regularization coefficient,
P = iiaifi P = iiaifi
mm p) mm p)
P P
代表正类样本的误分类代价; 此外,
Figure imgf000013_0001
Represents the misclassification cost of positive samples; In addition,
Figure imgf000013_0001
代表模型对当前样本的信心, 代表了模型对当前样本的熟悉程度, 从而更好的 计算置信度 Represents the confidence of the model in the current sample, and represents the familiarity of the model in the current sample, so as to better calculate the confidence
[0059] 步骤 22、 基于置信度 [0059] Step 22. Based on confidence
通过以下公式计算当前样本的非对称访问参数: The asymmetric access parameters of the current sample are calculated by the following formula:
[] []
Figure imgf000013_0002
Figure imgf000013_0002
[0060] 其中,
Figure imgf000013_0003
[0060] where,
Figure imgf000013_0003
代表线性分类器对当前样本的预测边际, Represents the prediction margin of the linear classifier for the current sample,
Pt , 即该预测边际的绝对值, 代表了模型对该样本的预测距离分类平面的距离;Pt , Which is the absolute value of the prediction margin, which represents the distance of the model from the classification plane of the prediction distance of the sample;
[0061] 步骤 23、 基于非对称访问参数 [0061] Step 23. Based on asymmetric access parameters
, 进行伯努利采样, 获取采样值; 对于不同类别的样本设定不同的采样系数, 通过以下表示采样概率: , Perform Bernoulli sampling to obtain sample values; set different sampling coefficients for different types of samples, and express the sampling probability by the following:
[] []
Figure imgf000014_0001
Figure imgf000014_0001
[0062] 其中, [0062] where,
% %
为正类预测 (即
Figure imgf000014_0002
Is a positive prediction (ie
Figure imgf000014_0002
) 的采样系数, ) Sampling coefficient,
为负类预测 (即
Figure imgf000014_0003
Negative prediction (ie
Figure imgf000014_0003
) 的采样系数; 通过该采样概率进行伯努利采样, 获取采样值 ) Sampling coefficient; Bernoulli sampling is performed by the sampling probability to obtain the sampling value
[0063] 步骤 24、 如果该采样值 [0063] Step 24: If the sampled value
为 1, 则判定需要访问该样本的标签, 则消耗预算获取其标签; 反之如果 为 o, 则判定不需要访问其标签。 Is 1, it is determined that the label of the sample needs to be accessed, then the budget is used to obtain its label; otherwise if If it is o, it is determined that it is not necessary to access its label.
[0064] 如图 3所示, 为本发明的非对称更新策略的流程示意图, 所述步骤 3中非对称 更新策略的步骤如下: [0064] As shown in FIG. 3, which is a schematic flowchart of the asymmetric update strategy of the present invention, the steps of the asymmetric update strategy in the step 3 are as follows:
[0065] 步骤 31、 获取错误预测的有标签数据
Figure imgf000015_0001
[0065] Step 31: Obtain mislabeled labeled data
Figure imgf000015_0001
[0066] 步骤 32、 基于错误预测的有标签数据, 通过以下公式计算非对称损失值: [0066] Step 32: Based on the mispredicted labeled data, calculate the asymmetric loss value by the following formula:
Figure imgf000015_0002
Figure imgf000015_0002
代表正类样本的误分类权重; Represent the weight of misclassification of positive samples;
10 10
代表指示函数, 即满足条件则为 1, 否则为 0。 通过该代价敏感的损失函数, 我 们可以非对称的更新线性分类器; Represents the indicator function, that is, 1 is satisfied, otherwise 0. Through this cost-sensitive loss function, we can update the linear classifier asymmetrically;
[0068] 步骤 33、 基于非对称损失函数值 [0068] Step 33: Based on the value of the asymmetric loss function
4 4
和优化策略, 通过以下公式更新线性分类器的方差 And optimization strategy, the variance of the linear classifier is updated by the following formula
1 1
[]
Figure imgf000015_0003
[]
Figure imgf000015_0003
[0069] 其中, 代表正则化系数; [0069] where, Represents the regularization coefficient;
[0070] 步骤 34、 基于非对称损失函数值 [0070] Step 34: Based on the value of the asymmetric loss function
| h | h
和优化策略, 过以下公式更新线性分类器的均值 And optimization strategy, update the mean value of the linear classifier through the following formula
[][]
Figure imgf000016_0001
Figure imgf000016_0001
[0071] 其中, [0071] where,
代表线性分类器的学习率, Represents the learning rate of the linear classifier,
代表非对称损失函数值 Represents the value of the asymmetric loss function
# % #%
的梯度, 对损失函数求导即可得。 The gradient of can be derived by derivation of the loss function.
[0072] 图 4展示了该适用于无标签不平衡数据流的在线主动学习方法在网络安全数据 集 w8a上取得的性能, 该方法在图 4中的名字为 OA3和 OA3_diag, 其中 OA3_diag 是本方法的一个简单变体, 不详细描述。 其他比较方法, 如 PAA, OAAL, CSOAL, SOAL为该问题上经典的解决办法, 作为所提出方法的实验参照。 [0072] FIG. 4 shows the performance of the online active learning method applicable to unlabeled unbalanced data streams on the network security data set w8a. The names of the method in FIG. 4 are OA3 and OA3_diag, where OA3_diag is the method A simple variant of is not described in detail. Other comparison methods, such as PAA, OAAL, CSOAL, and SOAL are classic solutions to this problem, and serve as an experimental reference for the proposed method.
[0073] w8a数据集是一个经典开源数据集, 用于判别网页是否异常。 该数据集具有 647 00个样本, 300个特征值。 其正常网页数量远远多于异常网页, 即属于不平衡数 据, 其不平衡度为 1 : 32.5。 本实例设定异常网页为正类样本 (少数类) , 正常 网页为负类样本 (多数类) 。 [0073] The w8a data set is a classic open source data set used to determine whether a web page is abnormal. The data set has 647 00 samples and 300 eigenvalues. The number of normal web pages is far more than that of abnormal web pages, that is, it belongs to unbalanced data, and the unbalanced degree is 1: 32.5. In this example, the abnormal webpage is a positive sample (minority), and the normal webpage is a negative sample (majority).
[0074] 在实验时, 所有训练样本时序到来且无标签。 所提出的主动学习方法将针对每 一时刻到来的网页根据步骤 2判断是否需要标注。 若需要, 则以一定得金钱作为 标注代价获取标签, 并根据步骤 3更新模型。 [0074] At the time of the experiment, all training samples come in time series and have no labels. The proposed active learning method will target each The web page that arrives at a moment determines whether it needs to be marked according to step 2. If necessary, the label is obtained with a certain amount of money as the labeling cost, and the model is updated according to step 3.
[0075] 详细实验结果如图 4所示, 所提出的适用于无标签不平衡数据流的在线主动学 习方法取得了最为优异的性能。 [0075] The detailed experimental results are shown in FIG. 4, and the proposed online active learning method for unlabeled unbalanced data streams has achieved the most excellent performance.
[0076] 本实施例的一种适用于无标签不平衡数据流的在线主动学习方法, 针对不平 衡数据提出非对称访问策略, 动态地决定需要标注标签的样本; 为有效更新模 型, 该方法进一步提出非对称更新策略, 并利用样本的二阶信息高效地更新模 型; 同时对实际分类应用中所存在的标注数据稀疏、 样本不平衡、 流数据等问 题具有较好的解决能力。 [0076] An online active learning method applicable to unlabeled unbalanced data streams in this embodiment proposes an asymmetric access strategy for unbalanced data and dynamically determines the samples that need to be labeled; to effectively update the model, the method further An asymmetric update strategy is proposed, and the second-order information of the samples is used to efficiently update the model. At the same time, it has good ability to solve the problems of sparse labeled data, unbalanced samples, and streaming data in the actual classification application.

Claims

权 利 要 求 书 Claims
1、 一种适用于无标签不平衡数据流的在线主动学习方法, 其特征在于, 包 括以下步骤: 1. An online active learning method for unlabeled unbalanced data streams, which is characterized by the following steps:
步骤 1、 获取无标签数据流, 时序地输入线性分类器中进行预测, 其中数据 流的类别具有高度不平衡问题, 设定正类样本为类别稀少样本; Step 1. Obtain an unlabeled data stream and input it into a linear classifier for prediction in a time series, where the category of the data stream has a high imbalance problem, and set the positive sample to be a sparse sample;
步骤 2、 根据提出的非对称访问策略, 线性分类器针对无标签不平衡数据, 时序地决定需要被标注标签的样本; Step 2. According to the proposed asymmetric access strategy, the linear classifier determines the samples that need to be labeled in time series for unlabeled unbalanced data;
步骤 3、根据提出的非对称更新策略, 线性分类器利用错误预测的标注数据 更新线性分类器, 并利用样本的二阶信息提高学习效率。 Step 3. According to the proposed asymmetric update strategy, the linear classifier updates the linear classifier with mispredicted labeled data, and uses the second-order information of the samples to improve learning efficiency.
2、 根据权利要求 1所述的一种适用于无标签不平衡数据流的在线主动学习 方法,其特征在于,所述步骤 1中,所述无标签数据流表示为{xt e Md |t = 1, . . , T}, 其中 Md代表样本的特征数量为 d, r表示无标签样本的总数; 可标注标签的样本 预算为 5个, 标签的类别为 yt e{_l, +l}, 则正类样本
Figure imgf000018_0001
2. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, wherein in step 1, the unlabeled data stream is represented as {x t e M d | t = 1,.., T}, where M d represents the number of features of the sample is d, r represents the total number of unlabeled samples; the sample budget of labelable labels is 5, and the category of the label is y t e{_l, + l}, the regular sample
Figure imgf000018_0001
类样本 =-1, 所述线性分类器的具体使用方法为: Class sample = -1, the specific use method of the linear classifier is:
步骤 11、所述线性分类器表示为 w E Ed,其满足多变量高斯分布 w〜 J\T 0, 2), 其中 /i表示线性分类器 w的均值, 而 2表示线性分类器 w的方差;
Figure imgf000018_0002
Step 11. The linear classifier is represented as w EE d , which satisfies the multivariate Gaussian distribution w~ J\T 0, 2), where /i represents the mean value of the linear classifier w, and 2 represents the variance of the linear classifier w ;
Figure imgf000018_0002
步骤 13、 所述线性分类器的预测结果表示为: 若 5>t = yt, 则线性分类器分 类正确, 否则线性分类器的分类错误。 Step 13. The prediction result of the linear classifier is expressed as: If 5> t = y t , the linear classifier classifies correctly, otherwise the linear classifier classifies incorrectly.
3、 根据权利要求 1所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 所述步骤 2中非对称访问策略的步骤如下: 3. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, wherein the steps of the asymmetric access strategy in step 2 are as follows:
步骤 21、基于样本的二阶信息 2即线性分类器的方差, 计算线性分类器对当 前样本的置信度; Step 21. Based on the second order information 2 of the sample, that is, the variance of the linear classifier, calculate the confidence of the linear classifier on the current sample
步骤 22、 基于置信度, 计算当前样本的非对称访问参数; 替换页 (细则第 26条) 步骤 23、 基于非对称访问参数, 进行伯努利采样, 获取其采样值; 步骤 24、 如果该采样值为 1 , 则判定需要访问该样本的标签; 反之, 则不 需要。 Step 22. Based on the confidence level, calculate the asymmetric access parameters of the current sample; replacement page (Article 26 of the rules) Step 23. Based on asymmetric access parameters, perform Bernoulli sampling to obtain its sampling value; Step 24. If the sampling value is 1, it is determined that the label of the sample needs to be accessed; otherwise, it is not required.
4、 根据权利要求 1所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 所述步骤 3中非对称更新策略的步骤如下: 4. An online active learning method applicable to unlabeled unbalanced data streams according to claim 1, characterized in that the steps of the asymmetric update strategy in step 3 are as follows:
步骤 31、 获取错误预测的有标签数据; Step 31: Obtain mislabeled labeled data;
步骤 32、 基于错误预测的有标签数据, 计算该数据的非对称损失函数值; 步骤 33、 基于非对称损失函数值和优化策略, 更新线性分类器的方差 2; 步骤 34 : 基于非对称损失函数值和优化策略, 更新线性分类器的均值 /i。 Step 32: Based on the mislabeled labeled data, calculate the asymmetric loss function value of the data; Step 33, update the variance of the linear classifier 2 based on the asymmetric loss function value and optimization strategy; Step 34: Based on the asymmetric loss function Values and optimization strategies, update the mean/i of the linear classifier.
5、 根据权利要求 3所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 通过以下公式计算置信度:
Figure imgf000019_0001
5. An online active learning method suitable for unlabeled unbalanced data streams according to claim 3, characterized in that the confidence is calculated by the following formula:
Figure imgf000019_0001
其中, ?7代表线性分类器的学习率, y代表正则化系数, pmax = max(l, p) , P代表正类样本的误分类代价; 此外, R = X txt代表模型对当前样本的信心, 代表了模型对当前样本的熟悉程度, 从而更好的计算置信度 Ct ; among them, ? 7 represents the learning rate of the linear classifier, y represents the regularization coefficient, p max = max(l, p), P represents the misclassification cost of positive samples; In addition, R = X t x t represents the confidence of the model in the current sample , Represents the model's familiarity with the current sample, so as to better calculate the confidence C t;
基于置信度 ct, 通过以下公式计算当前样本的非对称访问参数:Based on the confidence level c t , the asymmetric access parameter of the current sample is calculated by the following formula:
Figure imgf000019_0002
Figure imgf000019_0002
其中, pt = /4xt代表线性分类器对当前样本的预测边际, |pt |, 即该预测边 际的绝对值, 代表了模型对该样本的预测距离分类平面的距离; Where p t = /4x t represents the prediction margin of the linear classifier on the current sample, | p t |, which is the absolute value of the prediction margin, represents the distance of the model from the classification plane of the sample's prediction;
基于非对称访问参数 , 进行伯努利采样, 获取采样值; 对于不同类别的样 本设定不同的采样系数, 通过以下表示采样概率: Based on asymmetric access parameters, perform Bernoulli sampling to obtain sample values; set different sampling coefficients for different types of samples, and express the sampling probability by the following:
替换页 (细则第 26条) 数; 通过该采样概率进行伯努利采样, 获取采样值 ztReplacement page (Rules 26) The sampling probability is used to perform Bernoulli sampling to obtain the sampling value z t .
6、 根据权利要求 4所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 通过以下公式计算非对称损失函数值:
Figure imgf000020_0001
6. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, characterized in that the asymmetric loss function value is calculated by the following formula:
Figure imgf000020_0001
其中 P代表正类样本的误分类权重; n(o代表指示函数, 即满足条件则为 1, 否则为 0。 Where P represents the weight of misclassification of positive samples; n(o represents the indicator function, that is, 1 is satisfied, otherwise 0.
7、 根据权利要求 4所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 步骤 33所述基于非对称损失函数值 ^和优化策略, 通过以 下公式更新线性分类器的方差 2:
Figure imgf000020_0002
7. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, wherein the linear classification is updated by the following formula based on the asymmetric loss function value ^ and the optimization strategy in step 33 Variance 2:
Figure imgf000020_0002
其中, y代表正则化系数。 Among them, y represents the regularization coefficient.
8、 根据权利要求 4所述的一种适用于无标签不平衡数据流的在线主动学习 方法, 其特征在于, 步骤 34所述基于非对称损失函数值 ^和优化策略, 通过以 下公式更新线性分类器的均值 8. An online active learning method applicable to unlabeled unbalanced data streams according to claim 4, characterized in that the linear classification is updated by the following formula based on the asymmetric loss function value ^ and the optimization strategy in step 34 Mean
Figure imgf000020_0003
Figure imgf000020_0003
其中, 77代表线性分类器的学习率, 代表非对称损失函数值 的梯度, 对 损失函数求导即可得。 Among them, 77 represents the learning rate of the linear classifier, represents the gradient of the asymmetric loss function value, and the derivative of the loss function can be obtained.
替换页 (细则第 26条) Replacement page (Rule 26)
PCT/CN2019/114167 2018-12-31 2019-10-29 Online active learning method applicable to unlabeled unbalanced data stream WO2020140597A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201811652531 2018-12-31
CN201910001840.2A CN109800799A (en) 2018-12-31 2019-01-02 A kind of online Active Learning Method suitable for no label unbalanced data stream
CN201910001840.2 2019-01-02

Publications (1)

Publication Number Publication Date
WO2020140597A1 true WO2020140597A1 (en) 2020-07-09

Family

ID=66558426

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114167 WO2020140597A1 (en) 2018-12-31 2019-10-29 Online active learning method applicable to unlabeled unbalanced data stream

Country Status (2)

Country Link
CN (1) CN109800799A (en)
WO (1) WO2020140597A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095423A (en) * 2021-04-21 2021-07-09 南京大学 Stream data classification method based on-line inverse deductive learning and implementation device thereof
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800799A (en) * 2018-12-31 2019-05-24 华南理工大学 A kind of online Active Learning Method suitable for no label unbalanced data stream
CN110647117B (en) * 2019-09-06 2020-12-18 青岛科技大学 Chemical process fault identification method and system
CN111882063B (en) * 2020-08-03 2022-12-02 清华大学 Data annotation request method, device, equipment and storage medium suitable for low budget
CN113360512B (en) * 2021-06-21 2023-10-27 特赞(上海)信息科技有限公司 Image processing model updating method and device based on user feedback and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150235160A1 (en) * 2014-02-20 2015-08-20 Xerox Corporation Generating gold questions for crowdsourcing
CN106056130A (en) * 2016-05-18 2016-10-26 天津大学 Combined downsampling linear discrimination classification method for unbalanced data sets
CN109101993A (en) * 2018-07-05 2018-12-28 杭州电子科技大学 A kind of classification method for online non-equilibrium flow data
CN109800799A (en) * 2018-12-31 2019-05-24 华南理工大学 A kind of online Active Learning Method suitable for no label unbalanced data stream

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150235160A1 (en) * 2014-02-20 2015-08-20 Xerox Corporation Generating gold questions for crowdsourcing
CN106056130A (en) * 2016-05-18 2016-10-26 天津大学 Combined downsampling linear discrimination classification method for unbalanced data sets
CN109101993A (en) * 2018-07-05 2018-12-28 杭州电子科技大学 A kind of classification method for online non-equilibrium flow data
CN109800799A (en) * 2018-12-31 2019-05-24 华南理工大学 A kind of online Active Learning Method suitable for no label unbalanced data stream

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095423A (en) * 2021-04-21 2021-07-09 南京大学 Stream data classification method based on-line inverse deductive learning and implementation device thereof
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model

Also Published As

Publication number Publication date
CN109800799A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
WO2020140597A1 (en) Online active learning method applicable to unlabeled unbalanced data stream
Zhang et al. Learning structured representation for text classification via reinforcement learning
Wu et al. Distilled person re-identification: Towards a more scalable system
Qu et al. Adversarial category alignment network for cross-domain sentiment classification
WO2022135121A1 (en) Molecular graph representation learning method based on contrastive learning
WO2022088444A1 (en) Multi-task language model-oriented meta-knowledge fine tuning method and platform
CN105117429A (en) Scenario image annotation method based on active learning and multi-label multi-instance learning
CN112115993B (en) Zero sample and small sample evidence photo anomaly detection method based on meta-learning
CN113254675B (en) Knowledge graph construction method based on self-adaptive few-sample relation extraction
CN114491024B (en) Specific field multi-label text classification method based on small sample
CN107729921A (en) A kind of machine Active Learning Method and learning system
CN114841151A (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN112199505B (en) Cross-domain emotion classification method and system based on feature representation learning
CN109947938A (en) Multiple labeling classification method, system, readable storage medium storing program for executing and computer equipment
CN112801162A (en) Adaptive soft label regularization method based on image attribute prior
CN111191033A (en) Open set classification method based on classification utility
CN112116063B (en) Feature offset correction method based on meta learning
CN113435190B (en) Chapter relation extraction method integrating multilevel information extraction and noise reduction
He et al. Addressing the Overfitting in Partial Domain Adaptation with Self-Training and Contrastive Learning
Xia et al. LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition
Chen et al. Unsupervised domain adaptation with joint domain-adversarial reconstruction networks
CN113204975A (en) Sensitive character wind identification method based on remote supervision
Peng et al. Named entity recognition based on reinforcement learning and adversarial training
Wang et al. DualMatch: Robust Semi-supervised Learning with Dual-Level Interaction
Li et al. A Novel Semi-supervised Adaboost Technique Based on Improved Tri-training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907105

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05/11/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19907105

Country of ref document: EP

Kind code of ref document: A1