CN108875962A - Core ridge regression on-line study method based on fixed budget - Google Patents

Core ridge regression on-line study method based on fixed budget Download PDF

Info

Publication number
CN108875962A
CN108875962A CN201810593893.3A CN201810593893A CN108875962A CN 108875962 A CN108875962 A CN 108875962A CN 201810593893 A CN201810593893 A CN 201810593893A CN 108875962 A CN108875962 A CN 108875962A
Authority
CN
China
Prior art keywords
ridge regression
budget
sample
model
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810593893.3A
Other languages
Chinese (zh)
Inventor
宋允全
高富豪
梁锡军
渐令
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201810593893.3A priority Critical patent/CN108875962A/en
Publication of CN108875962A publication Critical patent/CN108875962A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The core ridge regression on-line study method based on fixed budget that the present invention relates to a kind of, budget value is determined by numerical experiment first, construct initial learning sample set, it establishes core ridge regression model and solves and obtain fallout predictor, core ridge regression model is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula and obtains on-line prediction device, and then realizes the on-line prediction to data flow.This method uses fixed budget strategy, can be effectively controlled the scale of on-line study model, saves memory space, computation complexity is effectively reduced, is easily achieved.On-line study method of the present invention, the on-line prediction problem with data flow characteristics can flexibly be handled, data can be collected by way of data block, compared with traditional batch processing mode and current on-line study method, computation complexity and model running time are considerably reduced, it being capable of efficient process recurrence and classification problem.

Description

Core ridge regression on-line study method based on fixed budget
Technical field
The invention belongs to data minings and machine learning field, are related to the method for data mining and data processing, specifically It says, is related to a kind of core ridge regression on-line study method based on fixed budget.
Background technique
The unbiasedness that ridge regression is estimated by abandoning regression coefficient is obtained using losing partial information, reducing precision as cost More steady regression coefficient estimation, the fitting effect in ill data are better than least square method.And the core of geo-nuclear tracin4 is merged Ridge regression can effectively deal with nonlinear problem thus obtain more being widely applied.The solution of traditional core ridge regression model is base It is executed in batch algorithms, the computation complexity of algorithm is O (n3), wherein n is sample number.However, more and more practical Data handled by problem have data flow characteristic, such as dynamic industrial processes optimization, sensor real-time monitoring, sampling Data are constantly acquired over time in the form of data flow.Batch algorithms are not suitable for due to computation complexity height Handle above-mentioned data-flow problem.The on-line learning algorithm of core ridge regression for this purpose, domestic and foreign scholars begin one's study is calculated with reducing Complexity reduces the model running time, and marked achievement is the increment type core ridge regression on-line study method that B.W.Chen is proposed. This method updates core ridge regression model using Sherman-Morrison-Woodbury formula iteration, by each model modification Computation complexity is by O (n3) drop to O (n2).Since sample size increases linearly over time, the scale of core ridge regression model, storage are empty Between, runing time will all be continuously increased therewith.To solve the above problems, urgently developing a kind of based on fixed budget learning sample collection Core ridge regression on-line study method, the memory space of effective Controlling model and learning time while guaranteeing model accuracy, To adapt to data stream environment.
Summary of the invention
It is an object of the invention to for existing core ridge regression on-line study method can not effectively Controlling model scale etc. be no Foot, proposes a kind of core ridge regression on-line study method based on fixed budget, this method can reduce model memory space, subtract Few runing time, meets the real-time demand of application problem.
According to embodiments of the present invention, a kind of core ridge regression on-line study side based on fixed budget learning sample collection is proposed Method contains following steps:
(1) budget value is determined by numerical experiment;
(2) initial learning sample is randomly selected according to budget and constructs initial learning sample set, establish ridge regression model, It is the ridge regression model without intercept by ridge regression model conversation by centralization method and obtains ridge regression solution, introducing geo-nuclear tracin4 will Ridge regression fallout predictor equivalence is converted into core ridge regression fallout predictor;
(3) data flow is acquired in the form of mini-batch or one-by-one, using fallout predictor to the sample in data flow This is predicted;
(4) noise in data flow is rejected using 3 σ rules, to keep the stability of fallout predictor;
(5) part sample is added by learning sample set according to sample contribution margin, and rejects phase according to minimum contribution criterion The sample of quantity is answered, budget is maintained to stablize;
(6) core ridge regression is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Model obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.
In learning method according to an embodiment of the present invention, in step (1), determine budget the specific steps are:
(1) training sample set and test sample set are determined.
(2) estimated value to be measured is successively chosen, randomly selects respective number in training sample set according to estimated value to be measured Sample, establish core ridge regression model, and the precision of the estimated value is tested using test sample set.
(3) it executes step (2) 10 times, and calculates the average test precision and mean test time of each budget.
(4) double longitudinal axis curves are drawn using average test precision and mean test time, comprehensively considers time cost and core Ridge regression model accuracy determines reasonable budget.
In learning method according to an embodiment of the present invention, in step (2), obtain fallout predictor the specific steps are:
Training sample learning of structure sample set is randomly selected according to determining budget n, establishes ridge regression model, ridge regression Model is expressed as:
Wherein, β is the coefficient vector of ridge regression fallout predictor, and b is intercept item, eiFor error term, λ is model regularization ginseng Number,It indicates Feature Mapping, is implicitly determined by way of specified kernel function;
Intercept item in model is removed using following centralization method, specific method is:WithReplace xij,Table The sample average for showing j-th of input variable, is used in combinationEstimation as intercept item b.At this point, coefficient vector can be solved β obtains ridge regression solution, is expressed as:
β=[φT(X)φ(X)+λI]-1φT(X)y, (2)
Wherein,Y=[y1;…;yn].
Following inner product representation is converted by ridge regression solution (2) equivalence
φT(X)[φ(X)φT(X)+λI]-1y, (3)
Core ridge regression fallout predictor can be obtained by being further introduced into geo-nuclear tracin4:
F (x)=k (x, X) (K+ λ I)-1y. (4)
Wherein,K (x, X)=[k (x, x1),k(x,x2),…,k(x,xn)], k () is kernel function, is specified by user.
In learning method according to an embodiment of the present invention, in step (4), after collecting sample true tag, comparison The prediction output of fallout predictor calculates the contribution of sample | yi-f(xi) |, the noise in data flow is rejected according to 3- σ rule.As excellent It selects, in step (5), fixed budget learning sample collection is added in the maximum sample of contribution margin, contributes criterion from budget according to minimum Learning sample concentrates the sample for rejecting respective numbers to maintain budget.
In learning method according to an embodiment of the present invention, in step (6), using low-rank matrix alignment technique and Sherman- Morrison-Woodbury formula update core ridge regression model, obtain on-line prediction device the specific steps are:
(1) using the sample in data flowThe sample for replacing former learning sample to concentrate
(2) marking the symmetric positive definite matrix (K+ λ I) for needing to invert in old model is A, i.e.,
Structural correction matrix U ∈ Rn×m, it is embodied as:
And correction matrix V ∈ Rn×m, it is embodied as:
(3) constructed correction matrix U ∈ R is utilizedn×mWith V ∈ Rn×mSymmetric positive definite matrix A is corrected, i.e.,:
UTV+VTU+A (8)
(4) inverse matrix of symmetric positive definite matrix in (8) is updated using Sherman-Morrison-Woodbury formula:
Q-1-Q-1V(I+UTQ-1V)-1UTQ-1 (9)
Wherein, Q-1=A-1-A-1U(I+VTA-1U)-1VTA-1
(5) right-hand-side vector y is updated according to learning sample set, obtains updated fallout predictor, i.e. on-line prediction device.
Core ridge regression on-line study method proposed by the present invention based on fixed budget, determines the scale of learning sample collection i.e. Budget selects initial learning sample collection, establishes core ridge regression model and solution obtains fallout predictor, utilize low-rank matrix alignment technique And Sherman-Morrison-Woodbury formula updates core ridge regression model and obtains on-line prediction device, realizes to data The on-line prediction of stream.This method uses fixed budget strategy, can be effectively controlled the scale of on-line study model, saves storage sky Between, reduce computation complexity, be easily achieved.Pass through the core ridge regression on-line study according to embodiments of the present invention based on fixed budget Method can flexibly handle the on-line prediction problem with data flow characteristics, and data can be collected by way of data block, with Traditional batch processing mode and current on-line study method are compared, and computation complexity is considerably reduced, and reduce model fortune The row time can flexibly handle recurrence and classification problem.Particularly, it can will be calculated when handling leave one cross validation problem complicated It spends from O (n4) drop to (n3)。
Detailed description of the invention
Attached drawing 1 is core ridge regression on-line study method schematic diagram of the embodiment of the present invention based on fixed budget.
Attached drawing 2 be in the embodiment of the present invention the upper budget Budget Size of benchmark dataset Cpusmall to model accuracy with The impact analysis figure of runing time.
Attached drawing 3 be the upper different data block size Chunk Size of benchmark data set Cpusmall to learning method of the present invention with The mean test time of existing learning method influences schematic diagram.
Attached drawing 4 to learning method of the present invention and has for different data block size Chunk Size on benchmark data set Casp The mean test time of learning method influences schematic diagram.
Subordinate list 1 is the average on-line testing essence of learning method of the present invention and existing learning method in six benchmark datasets Degree and mean test time compare.
Specific embodiment
Below in conjunction with attached drawing, embodiments of the present invention is further illustrated.
Embodiment:It is illustrated by taking regression problem as an example.As shown in Figure 1, a kind of base provided according to embodiments of the present invention In the core ridge regression on-line study method of fixed budget, which contains following steps:
Step 1:Budget value is determined by numerical experiment.The specific steps are that:
(1) select pending data set, in the present embodiment, be illustrated by taking benchmark dataset Cpusmall as an example. The sample total of Cpusmall data set is 8192.It is 1 that sample block size, which is arranged, and 5000 are randomly selected from Cpusmall Sample architecture training sample set, remaining sample architecture test set.Select Gaussian radial basis functionAs kernel function, the wide parameter σ of core takes default value, the i.e. dimension of sample.
(2) determine that budget set to be measured is combined into { 200,300 ..., 4900,5000 }.
(3) budget is successively chosen from the set of above-mentioned steps (2), phase is randomly selected in training sample concentration according to budget It answers the sample of number to establish core ridge regression model, and tests the precision of the budget using test set.
(4) it successively executes above-mentioned steps (3) 10 times, and calculates the average test precision and runing time of each budget.
(5) double longitudinal axis curves are drawn using mean test time and average measuring accuracy, as shown in Fig. 2, when comprehensively considering Between cost and model accuracy to determine budget space between 3500-4500.In the present embodiment, without loss of generality, selection is pre- Calculate is 4000.
Step 2:Initial training set is randomly selected according to budget, establishes ridge regression model, by centralization method by ridge Regression model is converted into the ridge regression model of no intercept and acquires ridge regression solution, is turned ridge regression fallout predictor equivalence by geo-nuclear tracin4 Turn to core ridge regression fallout predictor.The specific steps are that:
Training sample learning of structure sample set is randomly selected according to determining budget n, establishes ridge regression model, ridge regression Model is expressed as:
Wherein, β is the coefficient vector of ridge regression fallout predictor, and b is intercept item, eiFor error term, λ is model regularization ginseng Number,It indicates Feature Mapping, is implicitly determined by way of specified kernel function;
Intercept item in model is removed using following centralization method, specific method is:WithReplace xij,Table Show the sample average of j-th of input variable, and usesEstimation as intercept item b.At this point, can solve coefficient to Amount β obtains ridge regression solution, is expressed as:
β=[φT(X)φ(X)+λI]-1φT(X)y, (2)
Wherein,Y=[y1;…;yn].
By ridge regression solution (2) can equivalence be converted into following inner product representation
φT(X)[φ(X)φT(X)+λI]-1y, (3)
Core ridge regression fallout predictor can be obtained by being further introduced into geo-nuclear tracin4:
F (x)=k (x, X) (K+ λ I)-1y. (4)
Wherein,K (x, X)=[k (x, x1),k(x,x2),…,k(x,xn)], k () is kernel function, is specified by user.
Step 3:As shown in Figure 1, data flow is acquired in the form of mini-batch, using fallout predictor in data flow Sample is predicted.
Step 4:Noise in data flow is rejected using 3- σ rule, to keep the stability of fallout predictor.
Step 5:After collecting sample true tag, the prediction output for comparing fallout predictor calculates the contribution of sample | yi-f (xi) |, fixed budget learning sample collection is added in maximum contribution sample, is concentrated according to minimum contribution criterion from budget learning sample The learning sample of respective numbers is rejected to maintain vector budget to stablize.
Step 6:Core ridge is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Regression model obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.
Fig. 3 uses on-line study method of the present invention and sliding window core Ridge Regression Modeling Method in the case of being different Chunk Size With average survey of the LS-SVMs on-line study method based on budget supporting vector collection on benchmark dataset Casp and Cpusmall Time comparison diagram is tried, as seen from Figure 3, the on-line study method testing time of the present invention is in different Chunk Size It is superior to other two methods.
Table 1 is listed using on-line study method of the present invention and existing increment type core Ridge Regression Modeling Method, sliding window core ridge Homing method and LS-SVMs on-line study method based on budget supporting vector collection benchmark dataset Abalonescale, Average on-line testing precision and mean test time on Kin, Letters, Pendigits, Cpusmall and Poker.By table 1 as can be seen that on-line study method of the present invention is in the case where guaranteeing measuring accuracy, and the testing time is unanimously better than other sides Method.
Table 1
Above-described embodiment is used to explain the present invention, rather than limits the invention, in spirit and right of the invention It is required that protection scope in, to any modifications and changes for making of the present invention, both fall within protection scope of the present invention.

Claims (4)

1. the core ridge regression on-line study method based on fixed budget, it is characterised in that contain following steps:
(1) budget value is determined by numerical experiment;
(2) initial learning sample is randomly selected according to budget and construct initial learning sample set, establish ridge regression model, pass through Ridge regression model conversation is the ridge regression model without intercept and obtains ridge regression solution by centralization method, is introduced geo-nuclear tracin4 and is gone back to ridge Fallout predictor equivalence is returned to be converted into core ridge regression fallout predictor;
(3) in the form of mini-batch or one-by-one acquire data flow, using fallout predictor to the sample in data flow into Row prediction;
(4) noise in data flow is rejected using 3- σ rule, to keep the stability of fallout predictor;
(5) part sample is added by learning sample set according to sample contribution margin, and rejects respective counts according to minimum contribution criterion The sample of amount maintains budget to stablize;
(6) core ridge regression mould is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula Type obtains on-line prediction device, carries out on-line prediction to data stream by on-line prediction device.
2. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that:Step (1) in, determine budget value the specific steps are:
(1) training sample set and test sample set are determined.
(2) estimated value to be measured is successively chosen, randomly selects the sample of respective number in training sample set according to estimated value to be measured This, establishes core ridge regression model, and the precision of the estimated value is tested using test sample set.
(3) it executes step (2) 10 times, and calculates the average test precision and mean test time of each budget.
(4) double longitudinal axis curves are drawn using average test precision and mean test time, comprehensively considers time cost and core ridge is returned Model accuracy is returned to determine reasonable budget.
3. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that:Step (2) in, obtain fallout predictor the specific steps are:
Training sample learning of structure sample set is randomly selected according to determining budget n, establishes ridge regression model, ridge regression model It is expressed as:
Wherein, β is the coefficient vector of ridge regression fallout predictor, and b is intercept item, eiFor error term, λ is model regularization parameter,It indicates Feature Mapping, is implicitly determined by way of specified kernel function;
Intercept item in model is removed using following centralization method, specific method is:WithReplace xij,Indicate jth The sample average of a input variable, is used in combinationEstimation as intercept item b.It is obtained at this point, coefficient vector β can be solved Ridge regression solution, is expressed as:
β=[φT(X)φ(X)+λI]-1φT(X)y, (2)
Wherein,Y=[y1;…;yn].
Following inner product representation is converted by ridge regression solution (2) equivalence
φT(X)[φ(X)φT(X)+λI]-1y, (3)
Core ridge regression fallout predictor can be obtained by being further introduced into geo-nuclear tracin4:
F (x)=k (x, X) (K+ λ I)-1y. (4)
Wherein,K (x, X)=[k (x, x1),k(x,x2),…,k(x,xn)], k () is kernel function, is specified by user.
4. the core ridge regression on-line study method according to claim 1 based on fixed budget, it is characterised in that:Step (6) in, core ridge regression model is updated using low-rank matrix alignment technique and Sherman-Morrison-Woodbury formula, Obtain on-line prediction device the specific steps are:
(1) using the sample in data flowThe sample for replacing former learning sample to concentrate
(2) marking the symmetric positive definite matrix (K+ λ I) for needing to invert in old model is A, i.e.,
Structural correction matrix U ∈ Rn×m, it is embodied as:
And correction matrix V ∈ Rn×m, it is embodied as:
(3) constructed correction matrix U ∈ R is utilizedn×mWith V ∈ Rn×mSymmetric positive definite matrix A is corrected, i.e.,:
UTV+VTU+A (8)
(4) inverse matrix of symmetric positive definite matrix in (8) is updated using Sherman-Morrison-Woodbury formula:
Q-1-Q-1V(I+UTQ-1V)-1UTQ-1 (9)
Wherein, Q-1=A-1-A-1U(I+VTA-1U)-1VTA-1
(5) right-hand-side vector y is updated according to learning sample set, obtains updated fallout predictor, i.e. on-line prediction device.
CN201810593893.3A 2018-06-11 2018-06-11 Core ridge regression on-line study method based on fixed budget Pending CN108875962A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810593893.3A CN108875962A (en) 2018-06-11 2018-06-11 Core ridge regression on-line study method based on fixed budget

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810593893.3A CN108875962A (en) 2018-06-11 2018-06-11 Core ridge regression on-line study method based on fixed budget

Publications (1)

Publication Number Publication Date
CN108875962A true CN108875962A (en) 2018-11-23

Family

ID=64337912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810593893.3A Pending CN108875962A (en) 2018-06-11 2018-06-11 Core ridge regression on-line study method based on fixed budget

Country Status (1)

Country Link
CN (1) CN108875962A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111815681A (en) * 2020-09-04 2020-10-23 中国科学院自动化研究所 Target tracking method based on deep learning and discriminant model training and memory
CN117952566A (en) * 2024-03-25 2024-04-30 南京审计大学 Project cost prediction method and computer system based on ridge regression machine learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111815681A (en) * 2020-09-04 2020-10-23 中国科学院自动化研究所 Target tracking method based on deep learning and discriminant model training and memory
CN117952566A (en) * 2024-03-25 2024-04-30 南京审计大学 Project cost prediction method and computer system based on ridge regression machine learning

Similar Documents

Publication Publication Date Title
CN106485262B (en) Bus load prediction method
Iskrev Local identification in DSGE models
GB2601929A (en) A machine-learning based architecture search method for a neural network
CN106021298B (en) A kind of collaborative filtering recommending method and system based on asymmetric Weighted Similarity
CN106355192A (en) Support vector machine method based on chaos and grey wolf optimization
CN109034175B (en) Image processing method, device and equipment
CN112557034B (en) Bearing fault diagnosis method based on PCA _ CNNS
CN109409425B (en) Fault type identification method based on neighbor component analysis
CN111582538A (en) Community value prediction method and system based on graph neural network
CN113393057A (en) Wheat yield integrated prediction method based on deep fusion machine learning model
CN113674087A (en) Enterprise credit rating method, apparatus, electronic device and medium
CN108875962A (en) Core ridge regression on-line study method based on fixed budget
Davino et al. Quantile composite-based path modeling
CN110324178B (en) Network intrusion detection method based on multi-experience nuclear learning
CN105787507B (en) LS SVMs on-line study methods based on budget supporting vector collection
CN112950048A (en) National higher education system health evaluation based on fuzzy comprehensive evaluation
Westphal et al. Improving model selection by employing the test data
CN114580151A (en) Water demand prediction method based on gray linear regression-Markov chain model
CN108875961A (en) A kind of online weighting extreme learning machine method based on pre- boundary's mechanism
CN111026661B (en) Comprehensive testing method and system for software usability
Degeest et al. Feature ranking in changing environments where new features are introduced
US11803815B1 (en) System for the computer matching of targets using machine learning
Nikolikj et al. Sensitivity Analysis of RF+ clust for Leave-one-problem-out Performance Prediction
CN113296947A (en) Resource demand prediction method based on improved XGboost model
CN109902762A (en) The data preprocessing method deviateed based on 1/2 similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181123

WD01 Invention patent application deemed withdrawn after publication