CN108875962A

CN108875962A - Core ridge regression on-line study method based on fixed budget

Info

Publication number: CN108875962A
Application number: CN201810593893.3A
Authority: CN
Inventors: 宋允全; 高富豪; 梁锡军; 渐令
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2018-11-23

Abstract

The core ridge regression on-line study method based on fixed budget that the present invention relates to a kind of, budget value is determined by numerical experiment first, construct initial learning sample set, it establishes core ridge regression model and solves and obtain fallout predictor, core ridge regression model is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula and obtains on-line prediction device, and then realizes the on-line prediction to data flow.This method uses fixed budget strategy, can be effectively controlled the scale of on-line study model, saves memory space, computation complexity is effectively reduced, is easily achieved.On-line study method of the present invention, the on-line prediction problem with data flow characteristics can flexibly be handled, data can be collected by way of data block, compared with traditional batch processing mode and current on-line study method, computation complexity and model running time are considerably reduced, it being capable of efficient process recurrence and classification problem.

Description

Core ridge regression on-line study method based on fixed budget

Technical field

The invention belongs to data minings and machine learning field, are related to the method for data mining and data processing, specifically It says, is related to a kind of core ridge regression on-line study method based on fixed budget.

Background technique

The unbiasedness that ridge regression is estimated by abandoning regression coefficient is obtained using losing partial information, reducing precision as cost More steady regression coefficient estimation, the fitting effect in ill data are better than least square method.And the core of geo-nuclear tracin4 is merged Ridge regression can effectively deal with nonlinear problem thus obtain more being widely applied.The solution of traditional core ridge regression model is base It is executed in batch algorithms, the computation complexity of algorithm is O (n³), wherein n is sample number.However, more and more practical Data handled by problem have data flow characteristic, such as dynamic industrial processes optimization, sensor real-time monitoring, sampling Data are constantly acquired over time in the form of data flow.Batch algorithms are not suitable for due to computation complexity height Handle above-mentioned data-flow problem.The on-line learning algorithm of core ridge regression for this purpose, domestic and foreign scholars begin one's study is calculated with reducing Complexity reduces the model running time, and marked achievement is the increment type core ridge regression on-line study method that B.W.Chen is proposed. This method updates core ridge regression model using Sherman-Morrison-Woodbury formula iteration, by each model modification Computation complexity is by O (n³) drop to O (n²).Since sample size increases linearly over time, the scale of core ridge regression model, storage are empty Between, runing time will all be continuously increased therewith.To solve the above problems, urgently developing a kind of based on fixed budget learning sample collection Core ridge regression on-line study method, the memory space of effective Controlling model and learning time while guaranteeing model accuracy, To adapt to data stream environment.

Summary of the invention

It is an object of the invention to for existing core ridge regression on-line study method can not effectively Controlling model scale etc. be no Foot, proposes a kind of core ridge regression on-line study method based on fixed budget, this method can reduce model memory space, subtract Few runing time, meets the real-time demand of application problem.

According to embodiments of the present invention, a kind of core ridge regression on-line study side based on fixed budget learning sample collection is proposed Method contains following steps：

(1) budget value is determined by numerical experiment；

(2) initial learning sample is randomly selected according to budget and constructs initial learning sample set, establish ridge regression model, It is the ridge regression model without intercept by ridge regression model conversation by centralization method and obtains ridge regression solution, introducing geo-nuclear tracin4 will Ridge regression fallout predictor equivalence is converted into core ridge regression fallout predictor；

(3) data flow is acquired in the form of mini-batch or one-by-one, using fallout predictor to the sample in data flow This is predicted；

(4) noise in data flow is rejected using 3 σ rules, to keep the stability of fallout predictor；

(5) part sample is added by learning sample set according to sample contribution margin, and rejects phase according to minimum contribution criterion The sample of quantity is answered, budget is maintained to stablize；

(6) core ridge regression is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Model obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.

In learning method according to an embodiment of the present invention, in step (1), determine budget the specific steps are：

(1) training sample set and test sample set are determined.

(2) estimated value to be measured is successively chosen, randomly selects respective number in training sample set according to estimated value to be measured Sample, establish core ridge regression model, and the precision of the estimated value is tested using test sample set.

(3) it executes step (2) 10 times, and calculates the average test precision and mean test time of each budget.

(4) double longitudinal axis curves are drawn using average test precision and mean test time, comprehensively considers time cost and core Ridge regression model accuracy determines reasonable budget.

In learning method according to an embodiment of the present invention, in step (2), obtain fallout predictor the specific steps are：

Training sample learning of structure sample set is randomly selected according to determining budget n, establishes ridge regression model, ridge regression Model is expressed as：

Wherein, β is the coefficient vector of ridge regression fallout predictor, and b is intercept item, e_iFor error term, λ is model regularization ginseng Number,It indicates Feature Mapping, is implicitly determined by way of specified kernel function；

Intercept item in model is removed using following centralization method, specific method is：WithReplace x_ij,Table The sample average for showing j-th of input variable, is used in combinationEstimation as intercept item b.At this point, coefficient vector can be solved β obtains ridge regression solution, is expressed as：

β=[φ^T(X)φ(X)+λI]^-1φ^T(X)y, (2)

Wherein,Y=[y₁；…；y_n].

Following inner product representation is converted by ridge regression solution (2) equivalence

φ^T(X)[φ(X)φ^T(X)+λI]^-1y, (3)

Core ridge regression fallout predictor can be obtained by being further introduced into geo-nuclear tracin4：

F (x)=k (x, X) (K+ λ I)^-1y. (4)

Wherein,K (x, X)=[k (x, x₁),k(x,x₂),…,k(x,x_n)], k () is kernel function, is specified by user.

In learning method according to an embodiment of the present invention, in step (4), after collecting sample true tag, comparison The prediction output of fallout predictor calculates the contribution of sample | y_i-f(x_i) |, the noise in data flow is rejected according to 3- σ rule.As excellent It selects, in step (5), fixed budget learning sample collection is added in the maximum sample of contribution margin, contributes criterion from budget according to minimum Learning sample concentrates the sample for rejecting respective numbers to maintain budget.

In learning method according to an embodiment of the present invention, in step (6), using low-rank matrix alignment technique and Sherman- Morrison-Woodbury formula update core ridge regression model, obtain on-line prediction device the specific steps are：

(1) using the sample in data flowThe sample for replacing former learning sample to concentrate

(2) marking the symmetric positive definite matrix (K+ λ I) for needing to invert in old model is A, i.e.,

Structural correction matrix U ∈ R^n×m, it is embodied as：

And correction matrix V ∈ R^n×m, it is embodied as：

(3) constructed correction matrix U ∈ R is utilized^n×mWith V ∈ R^n×mSymmetric positive definite matrix A is corrected, i.e.,：

U^TV+V^TU+A (8)

(4) inverse matrix of symmetric positive definite matrix in (8) is updated using Sherman-Morrison-Woodbury formula：

Q^-1-Q^-1V(I+U^TQ^-1V)^-1U^TQ^-1 (9)

Wherein, Q^-1=A^-1-A^-1U(I+V^TA^-1U)^-1V^TA^-1；

(5) right-hand-side vector y is updated according to learning sample set, obtains updated fallout predictor, i.e. on-line prediction device.

Core ridge regression on-line study method proposed by the present invention based on fixed budget, determines the scale of learning sample collection i.e. Budget selects initial learning sample collection, establishes core ridge regression model and solution obtains fallout predictor, utilize low-rank matrix alignment technique And Sherman-Morrison-Woodbury formula updates core ridge regression model and obtains on-line prediction device, realizes to data The on-line prediction of stream.This method uses fixed budget strategy, can be effectively controlled the scale of on-line study model, saves storage sky Between, reduce computation complexity, be easily achieved.Pass through the core ridge regression on-line study according to embodiments of the present invention based on fixed budget Method can flexibly handle the on-line prediction problem with data flow characteristics, and data can be collected by way of data block, with Traditional batch processing mode and current on-line study method are compared, and computation complexity is considerably reduced, and reduce model fortune The row time can flexibly handle recurrence and classification problem.Particularly, it can will be calculated when handling leave one cross validation problem complicated It spends from O (n⁴) drop to (n³)。

Detailed description of the invention

Attached drawing 1 is core ridge regression on-line study method schematic diagram of the embodiment of the present invention based on fixed budget.

Attached drawing 2 be in the embodiment of the present invention the upper budget Budget Size of benchmark dataset Cpusmall to model accuracy with The impact analysis figure of runing time.

Attached drawing 3 be the upper different data block size Chunk Size of benchmark data set Cpusmall to learning method of the present invention with The mean test time of existing learning method influences schematic diagram.

Attached drawing 4 to learning method of the present invention and has for different data block size Chunk Size on benchmark data set Casp The mean test time of learning method influences schematic diagram.

Subordinate list 1 is the average on-line testing essence of learning method of the present invention and existing learning method in six benchmark datasets Degree and mean test time compare.

Specific embodiment

Below in conjunction with attached drawing, embodiments of the present invention is further illustrated.

Embodiment：It is illustrated by taking regression problem as an example.As shown in Figure 1, a kind of base provided according to embodiments of the present invention In the core ridge regression on-line study method of fixed budget, which contains following steps：

Step 1：Budget value is determined by numerical experiment.The specific steps are that：

(1) select pending data set, in the present embodiment, be illustrated by taking benchmark dataset Cpusmall as an example. The sample total of Cpusmall data set is 8192.It is 1 that sample block size, which is arranged, and 5000 are randomly selected from Cpusmall Sample architecture training sample set, remaining sample architecture test set.Select Gaussian radial basis functionAs kernel function, the wide parameter σ of core takes default value, the i.e. dimension of sample.

(2) determine that budget set to be measured is combined into { 200,300 ..., 4900,5000 }.

(3) budget is successively chosen from the set of above-mentioned steps (2), phase is randomly selected in training sample concentration according to budget It answers the sample of number to establish core ridge regression model, and tests the precision of the budget using test set.

(4) it successively executes above-mentioned steps (3) 10 times, and calculates the average test precision and runing time of each budget.

(5) double longitudinal axis curves are drawn using mean test time and average measuring accuracy, as shown in Fig. 2, when comprehensively considering Between cost and model accuracy to determine budget space between 3500-4500.In the present embodiment, without loss of generality, selection is pre- Calculate is 4000.

Step 2：Initial training set is randomly selected according to budget, establishes ridge regression model, by centralization method by ridge Regression model is converted into the ridge regression model of no intercept and acquires ridge regression solution, is turned ridge regression fallout predictor equivalence by geo-nuclear tracin4 Turn to core ridge regression fallout predictor.The specific steps are that：

Intercept item in model is removed using following centralization method, specific method is：WithReplace x_ij,Table Show the sample average of j-th of input variable, and usesEstimation as intercept item b.At this point, can solve coefficient to Amount β obtains ridge regression solution, is expressed as：

β=[φ^T(X)φ(X)+λI]^-1φ^T(X)y, (2)

Wherein,Y=[y₁；…；y_n].

By ridge regression solution (2) can equivalence be converted into following inner product representation

φ^T(X)[φ(X)φ^T(X)+λI]^-1y, (3)

F (x)=k (x, X) (K+ λ I)^-1y. (4)

Step 3：As shown in Figure 1, data flow is acquired in the form of mini-batch, using fallout predictor in data flow Sample is predicted.

Step 4：Noise in data flow is rejected using 3- σ rule, to keep the stability of fallout predictor.

Step 5：After collecting sample true tag, the prediction output for comparing fallout predictor calculates the contribution of sample | y_i-f (x_i) |, fixed budget learning sample collection is added in maximum contribution sample, is concentrated according to minimum contribution criterion from budget learning sample The learning sample of respective numbers is rejected to maintain vector budget to stablize.

Step 6：Core ridge is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Regression model obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.

Fig. 3 uses on-line study method of the present invention and sliding window core Ridge Regression Modeling Method in the case of being different Chunk Size With average survey of the LS-SVMs on-line study method based on budget supporting vector collection on benchmark dataset Casp and Cpusmall Time comparison diagram is tried, as seen from Figure 3, the on-line study method testing time of the present invention is in different Chunk Size It is superior to other two methods.

Table 1 is listed using on-line study method of the present invention and existing increment type core Ridge Regression Modeling Method, sliding window core ridge Homing method and LS-SVMs on-line study method based on budget supporting vector collection benchmark dataset Abalonescale, Average on-line testing precision and mean test time on Kin, Letters, Pendigits, Cpusmall and Poker.By table 1 as can be seen that on-line study method of the present invention is in the case where guaranteeing measuring accuracy, and the testing time is unanimously better than other sides Method.

Table 1

Above-described embodiment is used to explain the present invention, rather than limits the invention, in spirit and right of the invention It is required that protection scope in, to any modifications and changes for making of the present invention, both fall within protection scope of the present invention.

Claims

1. the core ridge regression on-line study method based on fixed budget, it is characterised in that contain following steps：

(1) budget value is determined by numerical experiment；

(2) initial learning sample is randomly selected according to budget and construct initial learning sample set, establish ridge regression model, pass through Ridge regression model conversation is the ridge regression model without intercept and obtains ridge regression solution by centralization method, is introduced geo-nuclear tracin4 and is gone back to ridge Fallout predictor equivalence is returned to be converted into core ridge regression fallout predictor；

(3) in the form of mini-batch or one-by-one acquire data flow, using fallout predictor to the sample in data flow into Row prediction；

(4) noise in data flow is rejected using 3- σ rule, to keep the stability of fallout predictor；

(5) part sample is added by learning sample set according to sample contribution margin, and rejects respective counts according to minimum contribution criterion The sample of amount maintains budget to stablize；

(6) core ridge regression mould is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Type obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.

2. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that：Step (1) in, determine budget value the specific steps are：

(1) training sample set and test sample set are determined.

(2) estimated value to be measured is successively chosen, randomly selects the sample of respective number in training sample set according to estimated value to be measured This, establishes core ridge regression model, and the precision of the estimated value is tested using test sample set.

(4) double longitudinal axis curves are drawn using average test precision and mean test time, comprehensively considers time cost and core ridge is returned Model accuracy is returned to determine reasonable budget.

3. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that：Step (2) in, obtain fallout predictor the specific steps are：

Training sample learning of structure sample set is randomly selected according to determining budget n, establishes ridge regression model, ridge regression model It is expressed as：

Wherein, β is the coefficient vector of ridge regression fallout predictor, and b is intercept item, e_iFor error term, λ is model regularization parameter,It indicates Feature Mapping, is implicitly determined by way of specified kernel function；

Intercept item in model is removed using following centralization method, specific method is：WithReplace x_ij,Indicate jth The sample average of a input variable, is used in combinationEstimation as intercept item b.It is obtained at this point, coefficient vector β can be solved Ridge regression solution, is expressed as：

β=[φ^T(X)φ(X)+λI]^-1φ^T(X)y, (2)

Wherein,Y=[y₁；…；y_n].

φ^T(X)[φ(X)φ^T(X)+λI]^-1y, (3)

F (x)=k (x, X) (K+ λ I)^-1y. (4)

4. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that：Step (6) in, core ridge regression model is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula, Obtain on-line prediction device the specific steps are：

Structural correction matrix U ∈ R^n×m, it is embodied as：

And correction matrix V ∈ R^n×m, it is embodied as：

U^TV+V^TU+A (8)

Q^-1-Q^-1V(I+U^TQ^-1V)^-1U^TQ^-1 (9)

Wherein, Q^-1=A^-1-A^-1U(I+V^TA^-1U)^-1V^TA^-1；