CN109272033B

CN109272033B - Online soft interval kernel learning algorithm based on step length control

Info

Publication number: CN109272033B
Application number: CN201811037902.7A
Authority: CN
Inventors: 宋允全; 李月菱; 于琪; 雷鹤杰; 梁锡军; 渐令
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2022-03-08
Anticipated expiration: 2038-09-06
Also published as: CN109272033A

Abstract

The invention relates to an online soft interval kernel learning algorithm (OSKL) based on step length control. A nonlinear classifier is constructed by introducing a kernel function, soft interval parameters are introduced to control the influence of noise data, and an online kernel learning algorithm with robustness is designed based on a basic framework of an online gradient descent algorithm. The algorithm can reduce the storage space of the model, effectively control the noise influence, and has the advantages of only O (1) of the calculation complexity of model updating, strong real-time performance, easy realization and the like, thereby being a natural tool for processing and analyzing the data stream problem. The online learning algorithm solves the problem that the traditional classification method based on batch processing technology cannot process data stream efficiently, and also solves the problem that the existing online learning algorithms such as Kernel Perceptron and Pegasos cannot effectively inhibit noise influence.

Description

Online soft interval kernel learning algorithm based on step length control

Technical Field

The invention belongs to the field of data mining and machine learning, relates to a method for data mining and data processing, and particularly relates to an online soft interval kernel learning algorithm (OSKL) based on step length control.

Background

The classification problem is a classic problem in the field of data mining and machine learning. The traditional classification method based on the batch processing technology firstly collects data, builds a learning model based on the collected data, and selects an optimization algorithm to solve the model to obtain a classifier. With the rapid development of technologies such as e-commerce, social media, mobile internet, internet of things, and the like, more and more application scenarios require real-time processing of large-scale data streams. The traditional classification method based on the batch processing technology has the defects of high calculation complexity, low model updating efficiency and the like when processing a large-scale data stream problem. The online learning is based on a basic frame of point-by-point learning, data information is learned point-by-point through a dynamic updating model, the calculation complexity of the model updating once is only O (1), and the online learning method has the advantages of low calculation complexity, high model updating efficiency, strong real-time performance and the like, and is a natural tool for processing and analyzing data flow problems. In addition, some error labels are inevitable in large-scale label data, and the error labels can seriously influence the construction and the effect of the classifier. Therefore, it is desirable to design a data stream mining algorithm with fault tolerance.

Disclosure of Invention

The invention aims to provide an online soft interval kernel learning algorithm based on step length control, which can reduce model storage space, effectively control noise influence, remarkably improve model updating efficiency and meet the real-time requirement of practical application problems, aiming at the problem that the existing classification method based on batch processing technology cannot efficiently process data stream classification and the online learning algorithm cannot inhibit noise influence.

According to an embodiment of the present invention, an online soft interval kernel learning algorithm based on step size control is provided, which includes the following steps:

initializing model parameters, a decision function and a model kernel function.

And (II) collecting the data stream, and predicting the class label of the data stream sample by using a classification decision function.

And (III) acquiring a sample real label, appointing a loss function, and calculating a sample loss value.

And (IV) calculating the updating step size of the decision function of the classifier.

And (V) updating a classifier decision function based on the basic framework of the online gradient descent algorithm.

In the learning algorithm according to the embodiment of the present invention, in step (one), the specific steps of model initialization are:

determining a training sample set and a test sample set, initializing a model threshold parameter C, and initializing a decision function f of a binary problem ₀0, a gaussian kernel function is designated as a model kernel function k (·, ·).

In the learning algorithm according to the embodiment of the present invention, in the step (two), the specific steps of predicting the class label of the data stream sample by using the classification decision function are as follows:

collecting data stream in the form of one-by-one { (x)_t,y_t)}_t＝1,2,…，x_tDenotes the t-th sample input, y_tRepresenting the t-th sample output (class label). Using a decision function f_t-1Predicting the t-th sample in a data stream

The label of (2):

in the learning algorithm according to the embodiment of the present invention, in step (three), the specific operation flow for calculating the sample loss is as follows:

the sample point (x) is calculated by assigning the most common change function of the two-classification problem as the loss function_t,y_t) Change loss of (c):

in the learning algorithm according to the embodiment of the present invention, in step (four), the update step τ is calculated_tThe specific operation flow is as follows: determining the update step τ based on the following two considerations_t: firstly, realizing current sample x with highest possible confidence_tCorrect classification of points, i.e. reaching zero loss (l)_t0); secondly, the stability of the algorithm is ensured as much as possible, namely the fluctuation of the decision function in the updating process is reduced. Optimum step length tau_tIs a solution to the following optimization problem:

on the other hand, large-scale sampling data inevitably contains a large amount of error label data, and the error labels can seriously influence the construction of the decision function and the effect of a corresponding classifier. To this end, we introduce soft interval threshold parameter controlSystem update step τ_tC is less than or equal to C, so that the influence of error label data on the model is limited, and the stability of the classifier is ensured. Based on the sample points (x) calculated in step (three)_t,y_t) 1 change loss l_tAnd a step size control parameter C, determining an update step size tau_tComprises the following steps:

in the learning algorithm according to the embodiment of the present invention, in step (v), the specific operation flow for updating the classifier decision function is as follows:

based on the update step τ calculated in step (IV)_tUnder the basic framework of the online gradient descent algorithm, a decision function f is set_tPerform the update

Obtaining a new decision function f_t。

The invention relates to an online soft interval kernel learning algorithm based on step length control. An online kernel learning classifier is established by introducing a change loss function, a Gaussian kernel function and a soft interval threshold parameter C, so that online prediction of data flow is realized. The method adopts soft interval threshold parameters to enable the updating of the decision function of the classifier to be smoother and has robustness. Compared with classical online learning algorithms Kernel Perceptron and Pegesoso, the proposed algorithm OSKL significantly improves the classification accuracy. The online classification algorithm OSKL can flexibly process the classification problem in a data flow scene, and compared with the traditional static classification mode based on batch processing technology, the online classification algorithm OSKL greatly reduces the calculation complexity and the model operation time.

Drawings

FIG. 1 is a schematic diagram of an online soft interval kernel learning algorithm based on step control

FIG. 2 is a schematic diagram showing classification accuracy comparison of three algorithms on a reference data set

FIG. 3 is a schematic diagram showing the comparison of the classification accuracy of the average test of three algorithms on a noisy label data set ijcnn

FIG. 4 is a schematic diagram showing the comparison of the classification accuracy of the average test of three algorithms on the noise-containing tag data set codna

FIG. 5 is a schematic diagram showing comparison of average test classification accuracy of three algorithms on noise-containing tag data set eegeye

Detailed Description

The specific steps of the present invention are explained below with reference to the drawings.

The first embodiment is as follows: the online classification experiments on the original reference data sets ijcnn, codrna, eegeye are taken as examples for explanation. Fig. 1 is a schematic diagram of an online soft interval kernel learning algorithm based on step size control according to an embodiment of the present invention, where the online learning algorithm includes the following steps:

the method comprises the following steps: model parameters, decision functions and model kernel functions are initialized. The method comprises the following specific steps:

initializing a model threshold parameter C to 0.05, and initializing a binary problem decision function f₀When 0, a gaussian kernel function is designated as a model kernel function, i.e., k (x)_i,x_j)＝exp(-‖x_i-x_j‖²D), where d is taken as the dimension of the sample input x.

Step two: and collecting the data stream, and predicting the class label of the data stream sample by using a classification decision function. The method comprises the following specific steps: collecting data stream in the form of one-by-one { (x)_t,y_t)}_t＝1,2,…，x_tDenotes the t-th sample input, y_tRepresenting the t-th sample output (class label). Using a decision function f_t-1Predicting the t-th sample in a data stream

The label of (2):

step three: obtainingAnd (4) a sample real label appoints a loss function and calculates a sample loss value. The method comprises the following specific steps: the sample point (x) is calculated by assigning the most common change function of the two-classification problem as the loss function_tYt) loss of change:

step four: and calculating the updating step length of the decision function of the classifier. The method comprises the following specific steps: introducing soft interval threshold parameter to control update step length tau_tC is less than or equal to C, so that the influence of error label data on the model is limited, and the stability of the classifier is ensured. Based on the sample points (x) calculated in step (three)_t,y_t) 1 change loss l_tAnd step size control parameter C, determining the updating step size tau of the t step_tComprises the following steps:

step five: and updating a decision function of the classifier based on a basic framework of an online gradient descent algorithm. The method comprises the following specific steps: based on the update step τ calculated in step (IV)_tUnder the basic framework of the online gradient descent algorithm, a decision function f is set_tPerform the update

Obtaining a new decision function f_t。

FIG. 2 is a schematic diagram showing the comparison of the average online test accuracy of predictions made by the online learning algorithm of the present invention and the existing online learning algorithms Kernel Perceptron and Pegasos in the reference data set ijcnn, the reference data set codrna and the reference data set eegeye. As can be seen from FIG. 2, the average test accuracy of the online learning algorithm of the present invention on the above 3 reference data sets is better than that of the other two methods.

Example two: on the basis of the original reference data sets ijcnn, codrna and eegeye, a noise label is added, and an online classifier is trained on the data set containing the noise label. Unlike the first embodiment, in the first embodiment, 30% of the data set is randomly selected as the test set, and the rest of the data is added with the noise label to construct the training set. Specifically, we modulo 20, modulo 10, modulo 5 the sample index, and multiply the sample point label with a remainder of 0 by-1 to obtain noise label data.

Fig. 3-5 are graphs of the average classification performance (average test accuracy, ACA) on the original 30% dataset noiseless test dataset with training on the online classifiers Kernel permeatron, pegaso and OSKL on the noisy labeled datasets ijcnn, codna, eegeye. Experimental results show that the classification precision of the three algorithms is reduced with the increase of noise of training samples indexed by mod20, mod10 and mod5, but the OSKL algorithm provided by I can effectively control the noise influence under the condition of containing noise, and the classification effect is obviously higher than that of online classifiers Kernel Perception and Pegasos.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are possible within the spirit and scope of the claims.

Claims

1. An online soft interval kernel learning method based on step length control is characterized by comprising the following steps:

initializing model parameters, a decision function and a model kernel function; adding a noise label on the basis of the original reference data set ijcnn, codrna and eegeye, and training an online classifier on a data set containing the noise label; randomly selecting 30% of the data set as a test set, and adding noise labels into the rest data to construct a training set;

(II) collecting data stream in the form of one-by-one, and utilizing decision function f_t-1For data stream sample x_tPredicting the label of (2);

(III) obtaining the real label y_tThen, change loss of the sample point is calculated

(IV) calculating the update step τ_t: if l_tWhen the value is 0, then τ_t0; if l_tIf greater than 0, then

(V) updating the classifier decision function f_t＝f_t-1+τ_ty_tk(x_t,·)。

2. The method for learning online soft interval core based on step size control as claimed in claim 1, wherein in step (one), the specific operation flow is: determining a training sample set and a test sample set, initializing a model threshold parameter C, and initializing a decision function f of a binary problem₀0, a gaussian kernel function is chosen as the model kernel function k (·, · k), i.e. k (x)_i,x_j)＝exp(-||x_i-x_j||²D), where d is taken as the dimension of the sample input x.

3. The on-line soft interval kernel learning method based on step size control as claimed in claim 1, wherein: in the step (two), the specific steps of predicting the data stream sample class label by using the classification decision function are as follows: collecting data stream in the form of one-by-one { (x)_t,y_t)}t＝1,2,...，x_tDenotes the t-th sample input, y_tRepresents the t-th sample output (class label); using a decision function f_t-1Predicting the label of the t-th sample in the data stream:

4. the method for learning online soft interval core based on step size control as claimed in claim 1, wherein in step (iii), the specific operation procedure for calculating the sample loss is as follows: the sample point (x) is calculated by assigning the most common change function of the two-classification problem as the loss function_t,y_t) Change loss of (c):

5. the on-line soft interval kernel learning method based on step size control as claimed in claim 1, wherein in step (IV), the update step size τ is calculated_tThe specific operation flow is as follows: determining the update step τ based on the following two considerations_t(ii) a Firstly, realizing current sample x with highest possible confidence_tCorrect classification of points, i.e. reaching zero loss (l)_t0); secondly, the stability of the algorithm is ensured as much as possible, namely the fluctuation of the decision function in the updating process is reduced; the optimal step length tau can be obtained_tComprises the following steps:

on the other hand, a large amount of error label data is inevitable in large-scale sampling data, and the error labels can seriously influence the structure of a decision function and the effect of a corresponding classifier; for this purpose, a soft interval threshold parameter is introduced to control the update step length tau_tC is less than or equal to C, so that the influence of error label data on the model is limited, and the stability of the classifier is ensured; based on the sample points (x) calculated in step (three)_t,y_t) 1 change loss l_tAnd a step size control parameter C, determining an update step size tau_tComprises the following steps:

6. the step-size-control-based online soft interval kernel learning method as claimed in claim 1, wherein in step (five), the specific operation flow for updating the classifier decision function is: based on the update step τ calculated in step (IV)_tUnder the basic framework of the online gradient descent algorithm, a decision function f is set_tPerform the update

Obtaining a new decision function f_t；

The step length control-based online soft interval kernel learning algorithm establishes an online kernel learning classifier by introducing a change loss function, a Gaussian kernel function and a soft interval threshold parameter C, so that online prediction of a data stream is realized; the method adopts soft interval threshold parameters to enable the updating of the decision function of the classifier to be smoother and to have robustness; compared with classical online learning algorithms Kernel Perceptron and Pegesoso, the provided algorithm OSKL remarkably improves the classification precision; the online soft interval kernel learning algorithm based on step length control can flexibly process the classification problem under the scene of data flow, and compared with the traditional static classification mode based on batch processing technology, the algorithm greatly reduces the calculation complexity and reduces the model operation time.