USRE42440E1 - Robust modeling - Google Patents
Robust modeling Download PDFInfo
- Publication number
- USRE42440E1 USRE42440E1 US11/494,753 US49475306A USRE42440E US RE42440 E1 USRE42440 E1 US RE42440E1 US 49475306 A US49475306 A US 49475306A US RE42440 E USRE42440 E US RE42440E
- Authority
- US
- United States
- Prior art keywords
- training
- weights
- model
- generating
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
Definitions
- the present invention relates generally to a learning machine that models a system. More specifically, a robust modeling system that determines an optimum complexity for a given criteria is disclosed.
- the robust model of a system strikes a compromise between accurately fitting outputs in a known data set and effectively predicting outputs for unknown data.
- a learning machine is a device that maps an unknown set of inputs (X 1 , X 2 , . . . X n ) which may be referred to as an input vector to an output Y.
- Y may be a vector or Y may be a single value.
- Appropriate thresholds may be applied to Y so that the input data is classified by the output Y.
- Y is a number, then the process of associating Y with an input vector is referred to as scoring and when Y is thresholded into classes, then the process of associating Y with an input vector is referred to as classification.
- a learning machine models the system that generates the output from the input using a mathematical model. The mathematical model is trained using a set of inputs and outputs generated by the system. Once the mathematical model is trained using the system generated data, the model may be used to predict future outputs based on given inputs.
- a learning machine can be trained or using various techniques.
- Statistical Learning Theory by Vladmir Vapnik, published by John Wiley and Sons, ⁇ 1998, which is herein incorporated by reference for all purposes, and Advances in Kernel Methods: Support Vector Learning) published by MIT Press ⁇ 1999, which is herein incorporated by reference for all purposes describe how a linear model having a high dimensional feature space can be developed for a system that includes a large number of input parameters and an output.
- One example of a system that may be modeled is electricity consumption by a household over time.
- the output of the system is the amount of electricity consumed by a household and the inputs may be a wide variety of data associated with empirical electricity consumption such as day of the week, month, average temperature, wind speed, household income, number of persons in the household, time of day, etc. It might be desirable to predict future electricity consumption by households given different inputs.
- a learning machine can be trained to predict electricity consumption for various inputs using a training data set that includes sets of input parameters (input vectors) and outputs associated with the input parameters.
- a model trained using available empirical data can then be used to predict future outputs from different inputs.
- FIG. 1A illustrates a model that is complex but is not robust.
- Trace 102 passes very close to all of the data points shown, which are included in the training set. However, because of the complex nature of curve 102 , it is unlikely to successfully approximate the output Y for values of X that are not in the training set.
- FIG. 1B is a graph illustrating a model that is very robust, but does not provide as good a fit as the model shown in FIG. 1A .
- Curve 104 does not pass as close to the data points in the training set shown as Curve 102 did in FIG. 1A .
- Curve 104 is more robust because future data points shown as circles are closer to Curve 104 than to Curve 102 .
- the ability of the model to provide a good fit for data points not included in the training set is determined by the model's robustness. The question of how to determine an appropriately complex model so that the tradeoff between a good fit of the training set and robustness is the subject of considerable research.
- U.S. Pat. No. 5,684,929 (hereinafter the “'929 patent”) issued to Cortes and Jackel illustrates one approach to determining an appropriate complexity for a model used to predict the output of a system.
- Cortes and Jackel teach that, if data is provided in a training set used to train a model and a test data set used to test the model, then an approximation of the percentage error expected for a given level of complexity using a training set of infinite size can be accurately estimated. Based on such an estimate, Cortes and Jackel teach that combining such an estimate with other estimates obtained for different levels of capacity or complexity models can be used so that the error decreases asymptotically towards some minimum error E m .
- Cortes and Jackel then describe increasing the complexity of the modeling machine until the diminishing gains realized as the theoretical error for an infinite training set is asymptotically approached decrease below a threshold.
- the threshold may be adjusted to indicate when further decrease in error does not warrant increasing the complexity of the modeling function.
- FIG. 2 is a graph illustrating how the error for a training data set and the error for data not included in the training set behave as the complexity of a model derived using the training data set increases.
- Curve 200 shows that as the complexity or capacity of the modeling function increases, the aggregate error calculated when comparing the output of the model to the output provided in the training data set for the same inputs decreases. In fact, the difference between the output of the model and the data provided in the training set can be reduced to zero if a sufficiently complex modeling function is used.
- Curve 202 illustrates the error determined by the difference between the output of the model and real output data obtained for inputs not included in the training set.
- the error at first decreases until it reaches a minimum and then begins to increase. This result is caused by an overly complex model becoming excessively dependent on the vagaries of the training set. This phenomenon is referred to as over-training and results in a complex model that is a very good fit of the training data but is not robust.
- a robust model is generated using a technique that optimizes the complexity of the model based on data obtained from the system being modeled.
- Data is split into a training data set and a generalization or cross validation data set.
- weights are determined so that the error between the model output and the training data set is minimized.
- a degree of complexity is found that enables weights to be determined that best minimize some measure of error between the model output or best accomplish some goal that is related to the cross validation data.
- the degree of complexity is measured by a complexity parameter, Lambda.
- a polynomial function is used to model a system.
- the coefficients of the polynomial are determined using data in a training set with a regression method used to minimize the error between the output of the model function and the output data in the training set.
- a regularization coefficient is used to help calculate the weights.
- the regularization coefficient is also a measure of the complexity of the modeling function and may be used as a complexity parameter. By varying the complexity parameter and checking a criteria defined for comparing the output of the model and data in a cross validation set, an optimum complexity parameter may be derived for the modeling function.
- the present invention can be implemented in numerous ways, including as a process, an apparatus, a system, a device, a method, or a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication lines.
- a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication lines.
- a method of generating a robust model of a system includes selecting a modeling function having a set of weights wherein the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter an associated set of weights of the modeling function is determined such that a training error is minimized for a training data set. An error for a cross validation data set is determined for each set of weights associated with one of the plurality of values of the complexity parameter and the set of weights associated with the value of the complexity parameter is selected that best satisfies a cross validation criteria. Thus, the selected set of weights used with the modeling function provides the robust model.
- a method of generating a robust model of a system includes selecting a modeling function having a set of weights wherein the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter, an associated set of weights of the modeling function is determined such that a training error is minimized for a training data set. A cross validation error for a cross validation data set is determined for each set of weights associated with one of the plurality of values of the complexity parameter.
- An optimal value of the complexity parameter is determined that minimizes the cross validation error and an output set of weights of the modeling function using the determined optimal value of the complexity parameter and an aggregate training data set that includes the training data set and the cross validation data set is determined such that an aggregate training error is minimized for the aggregate training data set.
- the output set of weights used with the modeling function provides the robust model.
- a robust modeling engine includes a memory configured to store a training data set and a cross validation data set.
- a processor is configured to select a modeling function having a set of weights.
- the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter, the processor determines an associated set of weights of the modeling function such that a training error is minimized for a training data set.
- the processor determines an error for a cross validation data set for each set of weights associated with one of the plurality of values of the complexity parameter and selects the set of weights associated with the value of the complexity parameter that best satisfies a cross validation criteria.
- An output is configured to output the set of weights associated with the value of the complexity parameter that best satisfies a cross validation criteria.
- FIG. 1A illustrates a model that is complex but is not robust.
- FIG. 1B is a graph illustrating a model that is very robust, but does not provide as good a fit as the model shown in FIG. 1A .
- FIG. 2 is a graph illustrating bow the error for a training data set and the error for data not included in the training set behave as the complexity of a model derived using the training data set increases.
- FIG. 3 is a flowchart illustrating a process for determining an optimum Lambda and outputting a model based on that optimum Lambda.
- FIG. 4 is a block diagram illustrating a system used to generate a robust model as described above.
- the output of such a robust model can be used in a raw form as a score, or the output can be converted to a classification using thresholds. For example, output above a certain electricity usage threshold level could be classified as high and output below that level could be classified as low. Errors in the output of the learning machine can be evaluated either in terms of the number of misclassifications or in terms of the difference between the model output and actual usage. Using raw values is referred to as scoring and using classifications is referred to as classification.
- the inputs are first mapped into a high dimensional feature space.
- new attributes may be derived from the inputs. For example, various cross products and squares such as X 1 2 or X 1 *X 2 may be generated.
- an embodiment using a polynomial of degree 2 across the input attributes will be described. In other embodiments, different degree polynomials are used. Other types of functions may be used as well.
- the attributes include all of the squares and cross products of the inputs, and a bias constant Z 0 .
- the model is linear in attributes Zi.
- Empirical data is used to train the model and derive the weights.
- One way of deriving the weights is to use the least squares method on a set of training data comprising outputs Yi and attributes (Z 1 i,Z 2 i, . . . , ZNi) derived from inputs X 1 i,X 2 i, . . . ,XNi to derive weights w that minimize the error between the model output Y′i and the empirical system outputs Yi.
- the least squares method is described in Numerical Recipes in C Second Edition by Press et al published by Cambridge University Press ⁇ 1997, which is herein incorporated by reference.
- a model of the form described above may be the basis of a support vector machine.
- support vector regression low dimensional input data is mapped into a high dimensional feature space via a nonlinear mapping and linear regression is used in the high dimensional feature space.
- Linear regression in the high dimensional feature space corresponds to nonlinear regression in the low dimensional input space.
- j 0, . . . ,N ⁇ ) 2
- i 1, . . . ,L ⁇
- a regularization function Rreg(f) is derived by adding a regularization coefficient to Remp(f).
- Rreg(f) Remp (f)+Lamba* ⁇ w ⁇ 2 , where Lamba is a constant to be specified. This allows the error to be minimized.
- This mathematical technique for deriving the weights wi using a regularization coefficient is referred to as ridge regression.
- Remp(f) ( ⁇ (Yi ⁇ ( ⁇ wj*Zij
- j 0, . . . ,N ⁇ ) 2
- i 1, . . . ,L ⁇ , for a given Lamba, it is possible to compute the exact solution (w 0 , . . . , wN) by computing the partial derivatives of Remp(f) in w 0 , . . . , wN.
- a model can be described that models a system using weights that are derived by minimizing a regularization function.
- the regularization function includes an error term Remp(f) and a regularization coefficient.
- the error term shown above is derived from the polynomial least square error. Other error terms may be used.
- the regularization coefficient constrains the ability of the polynomial least square error to be minimized by increasing the regularization function when large weights are used.
- Lambda is a nonzero term that is required to solve the above equations for w 0 , . . . , wN. Lambda also has an important interpretation. Referring back to Equation 1, Lambda is multiplied by the sum of the squares of the weights ⁇ . As Lambda becomes larger the regularization term increasingly penalizes models that have large weights. Significantly, Lambda determines the complexity of the model that is derived from a given set of training data. Another way of stating this is that increasing Lambda decreases the VC dimension of the model. The VC dimension, as described in Statistical Learning Theory, which was previously incorporated by reference is a measure of the complexity of the modeling function. Thus, changing Lambda controls the complexity of the model.
- the VC dimension decreases monotonically with an increase in Lambda. Rather, we say that Lambda enforces the VC dimension. That is, the VC dimension generally decreases as Lambda increases, but not necessarily in a monotonic manner.
- Lambda As Lambda increases, the weights derived from a given training data set become smaller. If Lambda is infinite, the weights converge to 0 and the model reduces to a single constant, the average of the training outputs. In the terminology of support vector regression, it is said that the ⁇ 2 term enforces the flatness of the feature space and Lambda controls the flatness of the feature space. Lamba is also referred to as a regularization constant.
- the data is first divided into two sets.
- the first 10 lines of data are the training set and the next 5 lines are used as the cross-validation set.
- the attributes Z 1 , Z 2 , Z 3 , Z 4 , Z 5 are computed by normalizing X 1 and X 2 , to get X 1 N and X 2 N with zero mean, and variance equals to 1. Then we compute:
- the target Y is normalized to get YN.
- the data set is separated into a training data set and a cross validation data set.
- the first 10 lines of data are used for training and the last 5 lines are used for cross validation.
- Rreg(f) (1.62 ⁇ 0.70*w 1 +1.87*w 2 +0.49*w 3 +3.50*w 4 ⁇ 1.31*w 5 ) ⁇ 2 +(0.19 ⁇ 1.37*w 1 +0.90*w 2 +1.88*w 3 +0.81*w 4 ⁇ 1.23*w 5 ) ⁇ 2 +(0.06 ⁇ 0.36*w 1 ⁇ 0.06*w 2 +0.13*w 3 +0.00*w 4 +0.02*w 5 ) ⁇ 2 +( ⁇ 0.63 ⁇ 1.37*w 1 ⁇ 1.03*w 2 +1.88*w 3 +1.06*w 4 +1.41*w 5 ) ⁇ 2 +( ⁇ 0.63 ⁇ 0.02*w 1 ⁇ 1.03*w 2 +0.00*w 3 +1.06*w 4 +0.02*w 5 ) ⁇ 2 +( ⁇ 0.32+1.33*w 1 ⁇ 0.06*w 2 +1.77*w 3 +0.00*w 4 ⁇ 0.08*w 5 ) ⁇ 2 +
- the remainder of the data set shown below is then used to select an optimum Lambda.
- the output of the model is compared to the output specified in the validation set for each input.
- a goal is defined for the comparison and a search is performed by varying Lambda to determine an optimum value of Lambda that best achieves the goal. Selecting an optimum Lambda is described in further detail below.
- the goal can be in various forms.
- the goal can be to minimize the number of misclassifications on the verification set by the model.
- the goal may also be to minimize the sum of the squares of the errors.
- Lambda 0.1.
- the entire data set can be used to derive w 0 , w 1 , w 2 , w 3 , w 4 , w 5 .
- the resulting model fits the entire data set as well as possible for the given lambda selected to maximize robustness.
- the weights wi shown above may be derived by minimizing the regularization function for a given Lambda. Next, a technique will be described for determining an optimum value of Lambda.
- Lambda is optimized using the data set aside as the cross-validation data set separate from the training data set.
- the cross validation set is used to determine or learn the best value of Lambda for the model.
- increasing Lambda in general lowers the VC dimension of the model.
- the best Lambda, and therefore the best VC dimension is determined by a process of selecting different values of Lambda, deriving an optimum sets of weights for each lambda, and then evaluating the performance of the resulting model when applied to the training data set.
- different criteria are used to evaluate the performance of the model corresponding to each Lambda.
- the sum of the squares of the differences between the model outputs and the cross validation data set outputs for corresponding inputs is minimized.
- Different values of Lambda are selected and a search is made for the Lambda that best achieves that goal. That value of Lambda is adopted as the best Lambda for the criteria selected.
- Minimizing the sum of the squares of the differences between the model outputs and the cross validation data set outputs is referred to as a goal or a criteria for selecting Lambda. In other embodiments, other criteria are used. For example, minimizing the absolute value of the differences between the model outputs and the cross validation data set outputs may be the goal or minimizing the maximum difference between the model outputs and the cross validation data set output may be the goal.
- Any criteria may be defined and a search for an optimum Lambda for maximizing or minimizing the criteria can be performed. Any suitable search method for finding an optimum Lambda may be used. In one embodiment, a Newton type minimization method is used to find the best Lambda for the given criteria. Other methods are used in other embodiments, such as the Brent method used above.
- the steps for deriving a model based on a set of data described above include dividing the set of data into two sets, a training set and a cross-validation set.
- a best Remp(f) on the learning set is derived using the regularized least square method, with Lamba being the strength of the regularization (this is sometimes referred to as the vistamboire effect).
- This yields a model in the form: Y w 0 +w 1 *Z 1 + . . . +wN*ZN
- a criteria for evaluating the model is selected and the model is evaluated using the criteria and the cross-validation set.
- the criteria can be the sum of the squares of the errors for each sample in the cross validation data set or any other criteria that may be desired.
- highly specialized criteria are defined based on a specific goal that may be deemed desirable for the model.
- the goal that is defined for the purpose of optimizing Lambda is maximizing the “lift” determined by a “lift curve” for the data in the cross validation set.
- a lift curve is used by direct marketers to determine how effective a selection of certain members of a population has been in terms of maximizing some characteristic of the users (e.g. net sales or total customer value).
- the lift compares the performance of the selected group compared to the expected performance of a randomly selected group and a group selected by a “wizard” having perfect knowledge.
- the criteria for selecting Lambda and optimizing the robustness of the model may be very specifically adapted to a specific application of the model.
- Lambda is optimized for the criteria. It should be noted that, for different criteria, the optimization may be either maximizing a defined parameter (such as lift) or minimizing a defined parameter (such as a type of error).
- a final model is retrained using the optimum Lambda.
- the retraining is done by using both the training data set and the cross validation data set to determine the final set of weights ⁇ .
- FIG. 3 is a flowchart illustrating a process for determining an optimum Lambda and outputting a model based on that optimum Lambda.
- the process starts at 300 .
- the data is split into a training set and a cross validation set.
- the training set is used to determine a model with the best fit for the training set for a given Lambda.
- the cross validation set is used to evaluate the robustness of different models based on different Lambdas.
- a search is made to determine the best Lambda.
- a first Lambda is selected.
- the best fit to the training set is found for the selected Lambda.
- the best fit is found using a selected type of model.
- one such model may be a polynomial for which determining the best fit means determining the polynomial coefficients or weights for each of the terms in the polynomial. In other embodiments, other types of modeling functions are used.
- Lambda for a polynomial is the regularization coefficient. In other embodiments, the Lambda is a parameter that similarly controls the complexity or VC dimension of the model.
- the best fit model determined in step 314 is evaluated according to a specified criteria using the cross validation set.
- the criteria specified may be a general criteria such as minimizing the error or the sum of the squared area between the output of the model and the outputs specified in the cross validation set. In other embodiments, as described above, highly specialized criteria or goals may be specified such as maximizing the lift curve for the model applied to the cross validation set.
- a step 318 the performance of the model according to the criteria is compared to other results for other Lambdas. If, as a result of the comparison, it is determined that the optimum Lambda has been found, then control is transferred to a step 322 . If the optimum Lambda has not been found, then control is transferred to a step 320 where the next Lambda is determined and then control is transferred back to step 314 .
- different methods of selecting Lambdas to check and deciding when an optimum Lambda has been found are used. In general, a search is made for an optimum Lambda that satisfies the criteria and the search is complete when the search method being used determines that improvements in performance gained by selecting other Lambdas fall below some threshold.
- the threshold may be set in a manner that makes an acceptable tradeoff between speed and how precisely an optimum Lambda is determined.
- the model is recomputed in a similar manner as is done in step 314 using all of the data (i.e., both the data in the training set and the data in cross validation set).
- the model is output.
- Lambda may also be output along with various metrics that evaluate the performance of the model. Such metrics may include measures of the error between the outputs derived from the model and the outputs specified in the data set. Performance may also be indicated by the number of misclassifications occurring for the entire data set.
- the process then ends at 326 .
- FIG. 4 is a block diagram illustrating a system used to generate a robust model as described above.
- a processor 400 implements the methods described above.
- a memory 414 stores the data set and model parameters.
- An input interface 410 receives input data and specifications of the type of model to be used, the type of error to be minimized when fitting the training data set, and the goal or criteria to be used to evaluate the performance of the models generated using the cross validation data set.
- An output interface 416 outputs the robust model, along with metrics used to evaluate the performance of the robust model.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Feedback Control In General (AREA)
Abstract
A system and method are disclosed for generating a robust model of a system that selects a modeling function. The modeling function has a set of weights and the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter an associated set of weights of the modeling function is determined such that a training error is minimized for a training data set. An error for a cross validation data set is determined for each set of weights associated with one of the plurality of values of the complexity parameter and the set of weights associated with the value of the complexity parameter is selected that best satisfies a cross validation criteria. Thus, the selected set of weights used with the modeling function provides the robust model.
Description
The present invention relates generally to a learning machine that models a system. More specifically, a robust modeling system that determines an optimum complexity for a given criteria is disclosed. The robust model of a system strikes a compromise between accurately fitting outputs in a known data set and effectively predicting outputs for unknown data.
A learning machine is a device that maps an unknown set of inputs (X1, X2, . . . Xn) which may be referred to as an input vector to an output Y. Y may be a vector or Y may be a single value. Appropriate thresholds may be applied to Y so that the input data is classified by the output Y. When Y is a number, then the process of associating Y with an input vector is referred to as scoring and when Y is thresholded into classes, then the process of associating Y with an input vector is referred to as classification. A learning machine models the system that generates the output from the input using a mathematical model. The mathematical model is trained using a set of inputs and outputs generated by the system. Once the mathematical model is trained using the system generated data, the model may be used to predict future outputs based on given inputs.
A learning machine can be trained or using various techniques. Statistical Learning Theory by Vladmir Vapnik, published by John Wiley and Sons, ©1998, which is herein incorporated by reference for all purposes, and Advances in Kernel Methods: Support Vector Learning) published by MIT Press ©1999, which is herein incorporated by reference for all purposes describe how a linear model having a high dimensional feature space can be developed for a system that includes a large number of input parameters and an output.
One example of a system that may be modeled is electricity consumption by a household over time. The output of the system is the amount of electricity consumed by a household and the inputs may be a wide variety of data associated with empirical electricity consumption such as day of the week, month, average temperature, wind speed, household income, number of persons in the household, time of day, etc. It might be desirable to predict future electricity consumption by households given different inputs. A learning machine can be trained to predict electricity consumption for various inputs using a training data set that includes sets of input parameters (input vectors) and outputs associated with the input parameters. A model trained using available empirical data can then be used to predict future outputs from different inputs.
An important measure of the effectiveness of a trained model is its robustness. Robustness is a measure of how well the model performs on unknown data after training. As a more and more complex model is used to fit the training data set, the aggregate error produced by the model when applied to the entire training set can be lowered all the way to zero, if desired. However, as the complexity or capacity of the model increases, the error that is experienced on input data that is not included in the training set increases. That is because, as the model gets more and more complex, it becomes strongly customized to the training set. As it exactly models the vagaries of the data in the training set, the model tends to lose its ability to provide useful generalized results for data not included in the training set. FIG. 1A illustrates a model that is complex but is not robust. The output of the model is illustrated by trace 102. Trace 102 passes very close to all of the data points shown, which are included in the training set. However, because of the complex nature of curve 102, it is unlikely to successfully approximate the output Y for values of X that are not in the training set.
For example, U.S. Pat. No. 5,684,929 (hereinafter the “'929 patent”) issued to Cortes and Jackel illustrates one approach to determining an appropriate complexity for a model used to predict the output of a system. Cortes and Jackel teach that, if data is provided in a training set used to train a model and a test data set used to test the model, then an approximation of the percentage error expected for a given level of complexity using a training set of infinite size can be accurately estimated. Based on such an estimate, Cortes and Jackel teach that combining such an estimate with other estimates obtained for different levels of capacity or complexity models can be used so that the error decreases asymptotically towards some minimum error Em. Cortes and Jackel then describe increasing the complexity of the modeling machine until the diminishing gains realized as the theoretical error for an infinite training set is asymptotically approached decrease below a threshold. The threshold may be adjusted to indicate when further decrease in error does not warrant increasing the complexity of the modeling function.
For very large training sets where the error on the test data set and the training data set both approximate the error on an infinite training set, this approach is useful. Generally, as complexity increases, the error decreases and it is reasonable to specify a minimum decrease in error below which it is not deemed worthwhile to further increase the complexity of the modeling function. However, the technique taught by Cortes and Jackel does not address the problem of the possible tradeoff in error for new data that results in error actually increasing as the modeling function complexity increases. By assuming that the training set is very large or perhaps infinite, if necessary, the '929 patent assumes that the error asymptotically reaches a minimum. That is not the case for finite data sets and therefore the phenomenon of reduced robustness with increased complexity should be addressed in practical systems with limited training data. What is needed is a way of varying the capacity or complexity of a modeling function and determining an optimum complexity for modeling a given system.
Again, the tradeoff between fit and robustness as the complexity of a model increases suggests the desirability of finding an optimal level of complexity for a model so that the error of the model when applied to future input data may be minimized. However, a simple and effective method of deriving an optimally complex model has not been found. What is needed is a method of determining a model that has optimum or nearly optimum complexity so that when the best fit possible given the optimum complexity is achieved for the training set, the model tends to robustly describe the output of the system for inputs not included in the training set. Specifically, a method of varying the complexity of a model and predicting the performance of a model on future unknown inputs to the system is needed.
A robust model is generated using a technique that optimizes the complexity of the model based on data obtained from the system being modeled. Data is split into a training data set and a generalization or cross validation data set. For a given complexity, weights are determined so that the error between the model output and the training data set is minimized. A degree of complexity is found that enables weights to be determined that best minimize some measure of error between the model output or best accomplish some goal that is related to the cross validation data. The degree of complexity is measured by a complexity parameter, Lambda. Once the optimum complexity has been determined, weights for that complexity may be determined using both the training data set and the generalization data set.
In one embodiment, a polynomial function is used to model a system. The coefficients of the polynomial are determined using data in a training set with a regression method used to minimize the error between the output of the model function and the output data in the training set. A regularization coefficient is used to help calculate the weights. The regularization coefficient is also a measure of the complexity of the modeling function and may be used as a complexity parameter. By varying the complexity parameter and checking a criteria defined for comparing the output of the model and data in a cross validation set, an optimum complexity parameter may be derived for the modeling function.
It should be appreciated that the present invention can be implemented in numerous ways, including as a process, an apparatus, a system, a device, a method, or a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication lines. Several inventive embodiments of the present invention are described below.
In one embodiment, a method of generating a robust model of a system includes selecting a modeling function having a set of weights wherein the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter an associated set of weights of the modeling function is determined such that a training error is minimized for a training data set. An error for a cross validation data set is determined for each set of weights associated with one of the plurality of values of the complexity parameter and the set of weights associated with the value of the complexity parameter is selected that best satisfies a cross validation criteria. Thus, the selected set of weights used with the modeling function provides the robust model.
In one embodiment, a method of generating a robust model of a system includes selecting a modeling function having a set of weights wherein the modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter, an associated set of weights of the modeling function is determined such that a training error is minimized for a training data set. A cross validation error for a cross validation data set is determined for each set of weights associated with one of the plurality of values of the complexity parameter. An optimal value of the complexity parameter is determined that minimizes the cross validation error and an output set of weights of the modeling function using the determined optimal value of the complexity parameter and an aggregate training data set that includes the training data set and the cross validation data set is determined such that an aggregate training error is minimized for the aggregate training data set. The output set of weights used with the modeling function provides the robust model.
In one embodiment, a robust modeling engine includes a memory configured to store a training data set and a cross validation data set. A processor is configured to select a modeling function having a set of weights. The modeling function has a complexity that is determined by a complexity parameter. For each of a plurality of values of the complexity parameter, the processor determines an associated set of weights of the modeling function such that a training error is minimized for a training data set. The processor determines an error for a cross validation data set for each set of weights associated with one of the plurality of values of the complexity parameter and selects the set of weights associated with the value of the complexity parameter that best satisfies a cross validation criteria. An output is configured to output the set of weights associated with the value of the complexity parameter that best satisfies a cross validation criteria.
These and other features and advantages of the present invention will be presented in more detail in the following detailed description and the accompanying figures which illustrate by way of example the principles of the invention.
The present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings, wherein like reference numerals designate like structural elements, and in which:
A detailed description of a preferred embodiment of the invention is provided below. While the invention is described in conjunction with that preferred embodiment, it should be understood that the invention is not limited to any one embodiment. On the contrary, the scope of the invention is limited only by the appended claims and the invention encompasses numerous alternatives, modifications and equivalents. For the purpose of example, numerous specific details are set forth in the following description in order to provide a thorough understanding of the present invention. The present invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, details relating to technical material that is known in the technical fields related to the invention has not been described in detail in order not to unnecessarily obscure the present invention in such detail.
The Regularization Function and Lambda
Using empirical data to derive an optimally robust model will now be described in detail. It should be noted that the output of such a robust model can be used in a raw form as a score, or the output can be converted to a classification using thresholds. For example, output above a certain electricity usage threshold level could be classified as high and output below that level could be classified as low. Errors in the output of the learning machine can be evaluated either in terms of the number of misclassifications or in terms of the difference between the model output and actual usage. Using raw values is referred to as scoring and using classifications is referred to as classification.
In general, a system may be described in terms of an output Y and various inputs X1,X2, . . . ,XN such that Y=F(X1,X2, . . . ,XN), where F is some unknown function of X1,X2, . . . ,XN. To model the system, the inputs are first mapped into a high dimensional feature space. Depending on the function used by the model, new attributes may be derived from the inputs. For example, various cross products and squares such as X1 2 or X1*X2 may be generated. For the purpose of example, an embodiment using a polynomial of degree 2 across the input attributes will be described. In other embodiments, different degree polynomials are used. Other types of functions may be used as well.
After starting with inputs X1,X2, . . . ,XN, a set of attributes can be defined. For example, using a polynomial of degree 2, the number of attributes is N such that N=n(n+3)/2+1. The attributes include all of the squares and cross products of the inputs, and a bias constant Z0. The model derived for the input data is obtained in the form Y=w0*Z0+w1*Z1+w2*Z2+ . . . +wN*ZN, where each wi is a weight that is derived for the model. In the new feature space of (Z0,Z1,Z2, . . . , ZN), the model is linear in attributes Zi. Empirical data is used to train the model and derive the weights.
One way of deriving the weights is to use the least squares method on a set of training data comprising outputs Yi and attributes (Z1i,Z2i, . . . , ZNi) derived from inputs X1i,X2i, . . . ,XNi to derive weights w that minimize the error between the model output Y′i and the empirical system outputs Yi. The least squares method is described in Numerical Recipes in C Second Edition by Press et al published by Cambridge University Press ©1997, which is herein incorporated by reference.
A model of the form described above may be the basis of a support vector machine. In support vector regression, low dimensional input data is mapped into a high dimensional feature space via a nonlinear mapping and linear regression is used in the high dimensional feature space. Linear regression in the high dimensional feature space corresponds to nonlinear regression in the low dimensional input space.
The problem of deriving the weights for a support vector machine is described in detail in Advances in Kernel Methods: Support Vector Learning, which was previously incorporated by reference. Deriving the weights w is accomplished by minimizing the sum of the empirical risk Remp(f) and a complexity term ∥w∥2 which determines the complexity of the model.
Remp(f) is an error term that is defined as follows:
Remp(f)=({(Yu−({(j*Zij|j=0, . . . ,N})2|i=1, . . . ,L}
Remp(f)=({(Yu−({(j*Zij|j=0, . . . ,N})2|i=1, . . . ,L}
A regularization function Rreg(f) is derived by adding a regularization coefficient to Remp(f).
Rreg(f)=Remp (f)+Lamba*∥w∥2, where Lamba is a constant to be specified. This allows the error to be minimized. This mathematical technique for deriving the weights wi using a regularization coefficient is referred to as ridge regression. In the case where Remp(f)=({(Yi−({wj*Zij|j=0, . . . ,N})2|i=1, . . . ,L}, for a given Lamba, it is possible to compute the exact solution (w0, . . . , wN) by computing the partial derivatives of Remp(f) in w0, . . . , wN.
For the minimum of Remp(f), the partial derivatives are equal to zero. This yields N+1 linear equations in N+1 variables that can be solved by inverting a symmetric (N+1, N+1) matrix. A Cholevsky algorithm as is described in Numerical Recipes in C, which was previously incorporated by reference, is used in one embodiment to invert the matrix so that the weights that minimize the regularization function are determined.
Thus, a model can be described that models a system using weights that are derived by minimizing a regularization function. The regularization function includes an error term Remp(f) and a regularization coefficient. The error term shown above is derived from the polynomial least square error. Other error terms may be used. The regularization coefficient constrains the ability of the polynomial least square error to be minimized by increasing the regularization function when large weights are used.
Lambda is a nonzero term that is required to solve the above equations for w0, . . . , wN. Lambda also has an important interpretation. Referring back to Equation 1, Lambda is multiplied by the sum of the squares of the weights ω. As Lambda becomes larger the regularization term increasingly penalizes models that have large weights. Significantly, Lambda determines the complexity of the model that is derived from a given set of training data. Another way of stating this is that increasing Lambda decreases the VC dimension of the model. The VC dimension, as described in Statistical Learning Theory, which was previously incorporated by reference is a measure of the complexity of the modeling function. Thus, changing Lambda controls the complexity of the model. It should be noted that, in general, it need not be true (and in fact is not true) that the VC dimension decreases monotonically with an increase in Lambda. Rather, we say that Lambda enforces the VC dimension. That is, the VC dimension generally decreases as Lambda increases, but not necessarily in a monotonic manner.
As Lambda increases, the weights derived from a given training data set become smaller. If Lambda is infinite, the weights converge to 0 and the model reduces to a single constant, the average of the training outputs. In the terminology of support vector regression, it is said that the ω2 term enforces the flatness of the feature space and Lambda controls the flatness of the feature space. Lamba is also referred to as a regularization constant.
Example Data Set
At this point it is useful to consider an example with 2 input attributes in the original attribute space using a polynomial of degree 2 to model the system. There are 2 input attributes X1 and X2, and in the high dimensional feature space, there are 6 attributes Z0, Z1, Z2, Z3, Z4, Z5 (here, N=5).
Considering an example with 15 lines of data, the data is first divided into two sets. The first 10 lines of data are the training set and the next 5 lines are used as the cross-validation set.
X1 | X2 | Y | ||
Training | 3 | 8 | 64 | |
1 | 6 | 8 | ||
4 | 4 | 3 | ||
1 | 2 | −24 | ||
5 | 2 | −24 | ||
9 | 4 | −12 | ||
2 | 3 | −14 | ||
6 | 2 | −29 | ||
5 | 5 | 21 | ||
2 | 5 | 6 | ||
Validation | 4 | 5 | 18 | |
8 | 2 | −47 | ||
9 | 4 | −12 | ||
8 | 2 | −47 | ||
9 | 8 | 98 | ||
The attributes Z1, Z2, Z3, Z4, Z5 are computed by normalizing X1 and X2, to get X1N and X2N with zero mean, and variance equals to 1. Then we compute:
-
- Z0=1
- Z1=X1N
- Z2=X2N
- Z3=X1N2
- Z4=X2N2
- Z5=X1N*X2N
Accordingly, the target Y is normalized to get YN.
Z1 | Z2 | Z3 | Z4 | Z5 | YN |
Training | |||||
−0.70 | 1.87 | 0.49 | 3.50 | −1.31 | 1.62 |
−1.37 | 0.90 | 1.88 | 0.81 | −1.23 | 0.19 |
−0.36 | −0.06 | 0.13 | 0.00 | 0.02 | 0.06 |
−1.37 | −1.03 | 1.88 | 1.06 | 1.41 | −0.63 |
−0.02 | −1.03 | 0.00 | 1.06 | 0.02 | −0.63 |
1.33 | −0.06 | 1.77 | 0.00 | −0.08 | −0.32 |
−1.03 | −0.55 | 1.06 | 0.30 | 0.57 | −0.37 |
0.31 | −1.03 | 0.10 | 1.06 | −0.32 | −0.76 |
−0.02 | 0.42 | 0.00 | 0.18 | 0.01 | 0.52 |
−1.03 | 0.42 | 1.06 | 0.18 | −0.43 | 0.14 |
Validation | |||||
−0.36 | 0.42 | 0.13 | 0.18 | 0.15 | 0.44 |
0.99 | −1.03 | 0.98 | 1.06 | −1.02 | −1.22 |
1.33 | −0.06 | 1.77 | 0.00 | −0.08 | −0.32 |
0.99 | −1.03 | 0.98 | 1.06 | −1.02 | −1.22 |
1.33 | 1.87 | 1.77 | 3.50 | 2.49 | 2.49 |
The model has the form:
YN=w0+w1*Z1+w2*Z2+w3*Z3+w4*Z4+w5*Z5
The data set is separated into a training data set and a cross validation data set. In this example, the first 10 lines of data are used for training and the last 5 lines are used for cross validation.
Using the data shown,
Rreg(f)=
(1.62−0.70*w1+1.87*w2+0.49*w3+3.50*w4−1.31*w5)^2
+(0.19−1.37*w1+0.90*w2+1.88*w3+0.81*w4−1.23*w5)^2
+(0.06−0.36*w1−0.06*w2+0.13*w3+0.00*w4+0.02*w5)^2
+(−0.63−1.37*w1−1.03*w2+1.88*w3+1.06*w4+1.41*w5)^2
+(−0.63−0.02*w1−1.03*w2+0.00*w3+1.06*w4+0.02*w5)^2
+(−0.32+1.33*w1−0.06*w2+1.77*w3+0.00*w4−0.08*w5)^2
+(−0.37−1.03*w1−0.55*w2+1.06*w3+0.30*w4+0.57*w5)^2
+(−0.76+0.31*w1−1.03*w2+0.10*w3+1.06*w4−0.32*w5)^2
+(0.52−0.02*w1+0.42*w2+0.00*w3+0.18*w4−0.01*w5)^2
+(0.14−1.03*w1+0.42*w2+1.06*w3+0.18*w4−0.43*w5)^2
+Lambda*(w0 2+w1 2+w2 2+w3 2+w4 2+w5 2)=
Rreg(f)+Lambda*∥Wμ2 Equation 1
Rreg(f)=
(1.62−0.70*w1+1.87*w2+0.49*w3+3.50*w4−1.31*w5)^2
+(0.19−1.37*w1+0.90*w2+1.88*w3+0.81*w4−1.23*w5)^2
+(0.06−0.36*w1−0.06*w2+0.13*w3+0.00*w4+0.02*w5)^2
+(−0.63−1.37*w1−1.03*w2+1.88*w3+1.06*w4+1.41*w5)^2
+(−0.63−0.02*w1−1.03*w2+0.00*w3+1.06*w4+0.02*w5)^2
+(−0.32+1.33*w1−0.06*w2+1.77*w3+0.00*w4−0.08*w5)^2
+(−0.37−1.03*w1−0.55*w2+1.06*w3+0.30*w4+0.57*w5)^2
+(−0.76+0.31*w1−1.03*w2+0.10*w3+1.06*w4−0.32*w5)^2
+(0.52−0.02*w1+0.42*w2+0.00*w3+0.18*w4−0.01*w5)^2
+(0.14−1.03*w1+0.42*w2+1.06*w3+0.18*w4−0.43*w5)^2
+Lambda*(w0 2+w1 2+w2 2+w3 2+w4 2+w5 2)=
Rreg(f)+Lambda*∥Wμ2 Equation 1
For any given nonzero Lambda, e.g. from 10−4 to 104, there will be a minimum for the quadratic definite positive (sum of positive terms, hence definite positive for any nonzero Lambda) for Rreg(f) in W=(w0, w1, w2, w3, w4, w5).
This gives a vector W for every Lambda and the model is expressed by:
YN=w0+w1*Z1+w2*Z2+w3*Z3+w4*Z4+w5*Z5
YN=w0+w1*Z1+w2*Z2+w3*Z3+w4*Z4+w5*Z5
The remainder of the data set shown below is then used to select an optimum Lambda. The output of the model is compared to the output specified in the validation set for each input. A goal is defined for the comparison and a search is performed by varying Lambda to determine an optimum value of Lambda that best achieves the goal. Selecting an optimum Lambda is described in further detail below.
Z1 | Z2 | Z3 | Z4 | Z5 | YN | ||
Validation | |||||||
−0.36 | 0.42 | 0.13 | 0.18 | −0.15 | 0.44 | ||
0.99 | −1.03 | 0.98 | 1.06 | −1.02 | −1.22 | ||
1.33 | −0.06 | 1.77 | 0.00 | −0.08 | −0.32 | ||
0.99 | −1.03 | 0.98 | 1.06 | −1.02 | −1.22 | ||
1.33 | 1.87 | 1.77 | 3.50 | 2.49 | 2.49 | ||
In different embodiments, the goal can be in various forms. For example the goal can be to minimize the number of misclassifications on the verification set by the model. The goal may also be to minimize the sum of the squares of the errors. Using this goal and searching for an optimum Lambda using the Brent method, Lambda=0.1. Once the optimum Lamba is obtained, the entire data set can be used to derive w0, w1, w2, w3, w4, w5. The resulting model fits the entire data set as well as possible for the given lambda selected to maximize robustness.
Selecting an Optimum Lambda
The weights wi shown above may be derived by minimizing the regularization function for a given Lambda. Next, a technique will be described for determining an optimum value of Lambda.
Lambda is optimized using the data set aside as the cross-validation data set separate from the training data set. The cross validation set is used to determine or learn the best value of Lambda for the model. As described above, increasing Lambda in general lowers the VC dimension of the model. The best Lambda, and therefore the best VC dimension, is determined by a process of selecting different values of Lambda, deriving an optimum sets of weights for each lambda, and then evaluating the performance of the resulting model when applied to the training data set. In different embodiments, different criteria are used to evaluate the performance of the model corresponding to each Lambda.
For example, in the embodiment illustrated above, the sum of the squares of the differences between the model outputs and the cross validation data set outputs for corresponding inputs is minimized. Different values of Lambda are selected and a search is made for the Lambda that best achieves that goal. That value of Lambda is adopted as the best Lambda for the criteria selected. Minimizing the sum of the squares of the differences between the model outputs and the cross validation data set outputs is referred to as a goal or a criteria for selecting Lambda. In other embodiments, other criteria are used. For example, minimizing the absolute value of the differences between the model outputs and the cross validation data set outputs may be the goal or minimizing the maximum difference between the model outputs and the cross validation data set output may be the goal. Any criteria may be defined and a search for an optimum Lambda for maximizing or minimizing the criteria can be performed. Any suitable search method for finding an optimum Lambda may be used. In one embodiment, a Newton type minimization method is used to find the best Lambda for the given criteria. Other methods are used in other embodiments, such as the Brent method used above.
In general, as Lambda increases, the error between the model output and the training data set increases because the complexity of the model is constrained. However, the robustness of the model increases as Lambda increases. Selecting a value of Lambda by minimizing an error criteria defined for the cross validation set results in a Lambda with a good trade off between fit and robustness.
Thus, the steps for deriving a model based on a set of data described above include dividing the set of data into two sets, a training set and a cross-validation set. For a given Lambda, a best Remp(f) on the learning set is derived using the regularized least square method, with Lamba being the strength of the regularization (this is sometimes referred to as the vistamboire effect). This yields a model in the form:
Y=w0+w1*Z1+ . . . +wN*ZN
Y=w0+w1*Z1+ . . . +wN*ZN
A criteria for evaluating the model is selected and the model is evaluated using the criteria and the cross-validation set. The criteria can be the sum of the squares of the errors for each sample in the cross validation data set or any other criteria that may be desired. In some embodiments, highly specialized criteria are defined based on a specific goal that may be deemed desirable for the model.
For example, in one embodiment, the goal that is defined for the purpose of optimizing Lambda is maximizing the “lift” determined by a “lift curve” for the data in the cross validation set. A lift curve is used by direct marketers to determine how effective a selection of certain members of a population has been in terms of maximizing some characteristic of the users (e.g. net sales or total customer value). The lift compares the performance of the selected group compared to the expected performance of a randomly selected group and a group selected by a “wizard” having perfect knowledge. Thus, the criteria for selecting Lambda and optimizing the robustness of the model may be very specifically adapted to a specific application of the model.
Once a criteria is selected, Lambda is optimized for the criteria. It should be noted that, for different criteria, the optimization may be either maximizing a defined parameter (such as lift) or minimizing a defined parameter (such as a type of error).
In some embodiments, once an optimal Lambda is selected, a final model is retrained using the optimum Lambda. The retraining is done by using both the training data set and the cross validation data set to determine the final set of weights ω.
In a step 312, a first Lambda is selected. Then, in a step 314, the best fit to the training set is found for the selected Lambda. The best fit is found using a selected type of model. As described above, one such model may be a polynomial for which determining the best fit means determining the polynomial coefficients or weights for each of the terms in the polynomial. In other embodiments, other types of modeling functions are used. As described above, Lambda for a polynomial is the regularization coefficient. In other embodiments, the Lambda is a parameter that similarly controls the complexity or VC dimension of the model. In a step 316, the best fit model determined in step 314 is evaluated according to a specified criteria using the cross validation set. The criteria specified may be a general criteria such as minimizing the error or the sum of the squared area between the output of the model and the outputs specified in the cross validation set. In other embodiments, as described above, highly specialized criteria or goals may be specified such as maximizing the lift curve for the model applied to the cross validation set.
In a step 318, the performance of the model according to the criteria is compared to other results for other Lambdas. If, as a result of the comparison, it is determined that the optimum Lambda has been found, then control is transferred to a step 322. If the optimum Lambda has not been found, then control is transferred to a step 320 where the next Lambda is determined and then control is transferred back to step 314. In different embodiments, different methods of selecting Lambdas to check and deciding when an optimum Lambda has been found are used. In general, a search is made for an optimum Lambda that satisfies the criteria and the search is complete when the search method being used determines that improvements in performance gained by selecting other Lambdas fall below some threshold. The threshold may be set in a manner that makes an acceptable tradeoff between speed and how precisely an optimum Lambda is determined.
Once the optimum Lambda is found in step 322, the model is recomputed in a similar manner as is done in step 314 using all of the data (i.e., both the data in the training set and the data in cross validation set). Next, in a step 324, the model is output. Along with outputting the coefficients or other parameters that specify the model, Lambda may also be output along with various metrics that evaluate the performance of the model. Such metrics may include measures of the error between the outputs derived from the model and the outputs specified in the data set. Performance may also be indicated by the number of misclassifications occurring for the entire data set. The process then ends at 326.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. It should be noted that there are many alternative ways of implementing both the process and apparatus of the present invention. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
Claims (24)
1. A computer-implemented method of generating a robust model of a system comprising:
selecting a modeling function having a set of weights, wherein the modeling function has a complexity that is determined by a complexity parameter;
receiving, via an input interface, model specification data of the modeling function for each of a plurality of values of the complexity parameter,;
retrieving a training data set from a memory;
determining an associated set of weights of the modeling function such that a training error is minimized for a the training data set;
determining an error for a cross validation data set for each set of weights associated with one of the plurality of values of the complexity parameter; and
selecting the set of weights associated with a value of the complexity parameter that best satisfies a cross validation criteria;, whereby the selected set of weights used with the modeling function provides the robust model,
wherein the cross validation criteria comprises maximizing lift for the data in the cross validation set; and
outputting the set weights via an output interface.
2. A method of generating a robust model of a system as recited in claim 1 wherein the training error is calculated using a training error criteria that is a function of a difference between training output values associated with training input values determined from the training data set and output values determined from the modeling function and the associated set of weights applied to the training input values.
3. A method of generating a robust model of a system as recited in claim 1 wherein the complexity parameter affects how the training error is minimized.
4. A method of generating a robust model of a system as recited in claim 3 wherein the complexity parameter causes the training error to be decreased for sets of weights that are more complex.
5. A method of generating a robust model of a system as recited in claim 4 wherein the complexity of a modeling function having a set of weights is determined by squared weights of said set.
6. A method of generating a robust model of a system as recited in claim 1 wherein the complexity parameter is a regularization factor.
7. A method of generating a robust model of a system as recited in claim 1 wherein the complexity parameter controls an amount of noise that is added to input data of the training set.
8. A method of generating a robust model of a system as recited in claim 1 wherein the modeling function is a first order polynomial.
9. A method of generating a robust model of a system as recited in claim 1 wherein the modeling function is a second order polynomial.
10. A method of generating a robust model of a system as recited in claim 1 wherein the modeling function is a second order polynomial that includes cross products between input values.
11. A method of generating a robust model of a system as recited in claim 1 wherein the plurality of values of the complexity parameter are selected to best satisfy the cross validation criteria using a Newtonian minimization scheme.
12. A method of generating a robust model of a system as recited in claim 1 wherein the plurality of values of the complexity parameter are selected to best satisfy the cross validation criteria using a Brent method.
13. A method of generating a robust model of a system as recited in claim 1 further including separating an empirical data set into a training data set and a cross validation data set.
14. A method of generating a robust model of a system as recited in claim 1 wherein a threshold is applied to an output of the robust model to classify a set of inputs that generated the output of the robust model.
15. A method of generating a robust model of a system as recited in claim 1 wherein the training error for a training data set having input elements and output elements is defined as a sum of squared differences between said output elements and outputs of the modeling function associated with corresponding ones of said input elements.
16. A method of generating a robust model of a system as recited in claim 1 wherein the training error for a training data set having input elements and output elements is defined as a sum of differences between said output elements and outputs of the modeling function associated with corresponding ones of said input elements.
17. A method of generating a robust model of a system as recited in claim 1 wherein the training error for a training data set having input elements and output elements is defined as a maximum difference between output elements of the training data and outputs of the modeling function associated with corresponding ones of said input elements.
18. A method of generating a robust model of a system as recited in claim 1 further including normalizing the training data.
19. A method of generating a robust model of a system as recited in claim 1 further including splitting a set of data into a training data set and a cross validation training set.
20. A method of generating a robust model of a system as recited in claim 1 further including recalculating the set of weights using both the training data set and the cross validation data set.
21. A method of generating a robust model of a system as recited in claim 1 wherein the cross validation criteria is maximizing lift.
22. A method of generating a robust model of a system as recited in claim 1 wherein the cross validation criteria is minimizing a measure of error between the robust model and the cross validation set.
23. A method of generating a robust model of a system comprising:
selecting a modeling function having a set of weights wherein the modeling function has a complexity that is determined by a complexity parameter;
for a each of a plurality of values of the complexity parameter, determining an associated set of weights of the modeling function such that a training error is minimized for a training data set;
determining a cross validation error for a cross validation data set for each set of weights associated with one of the plurality of values of the complexity parameter;
determining an optimal value of the complexity parameter that minimizes the cross validation error; and
determining an output set of weights of the modeling function using the optimal value of the complexity parameter and an aggregate training data set that includes a training data set and the cross validation data set such that an aggregate training error is minimized for the aggregate training data set; and
whereby the output set of weights used with the modeling function provides the robust model.
24. A robust modeling engine comprising:
a memory configured to store a training data set and a cross validation data set;
an input interface configured to receive model specification data of a modeling function;
a processor configured to:
select a modeling function having a set of weights, wherein the modeling function has a complexity that is determined by a complexity parameter;
for each of a plurality of values of the complexity parameter, determine an associated set of weights of the modeling function such that a training error is minimized for a training data set;
determine an error a quantity for a cross validation data set for each set of weights associated with one of the plurality of values of the complexity parameter; and
select the set of weights associated with the complexity parameter that best satisfies a cross validation criteria; and
an output interface configured to output the set of weights associated with the value of the complexity parameter that best satisfies a cross validation criteria, wherein the cross validation criteria comprises maximizing lift.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/494,753 USRE42440E1 (en) | 1999-10-14 | 2006-07-28 | Robust modeling |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/418,537 US6523015B1 (en) | 1999-10-14 | 1999-10-14 | Robust modeling |
US11/494,753 USRE42440E1 (en) | 1999-10-14 | 2006-07-28 | Robust modeling |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/418,537 Reissue US6523015B1 (en) | 1999-10-14 | 1999-10-14 | Robust modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
USRE42440E1 true USRE42440E1 (en) | 2011-06-07 |
Family
ID=23658541
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/418,537 Ceased US6523015B1 (en) | 1999-10-14 | 1999-10-14 | Robust modeling |
US11/494,753 Expired - Lifetime USRE42440E1 (en) | 1999-10-14 | 2006-07-28 | Robust modeling |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/418,537 Ceased US6523015B1 (en) | 1999-10-14 | 1999-10-14 | Robust modeling |
Country Status (5)
Country | Link |
---|---|
US (2) | US6523015B1 (en) |
EP (2) | EP1727051A1 (en) |
AU (1) | AU1853401A (en) |
CA (2) | CA2353992A1 (en) |
WO (1) | WO2001027789A2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9465773B2 (en) | 2012-08-17 | 2016-10-11 | International Business Machines Corporation | Data-driven distributionally robust optimization |
US9946972B2 (en) | 2014-05-23 | 2018-04-17 | International Business Machines Corporation | Optimization of mixed-criticality systems |
CN111291657A (en) * | 2020-01-21 | 2020-06-16 | 同济大学 | Crowd counting model training method based on difficult case mining and application |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
NL1018387C2 (en) * | 2001-06-26 | 2003-01-07 | Ecicm B V | Linear motor with improved function approximator in the control system. |
US7060632B2 (en) * | 2002-03-14 | 2006-06-13 | Amberwave Systems Corporation | Methods for fabricating strained layers on semiconductor substrates |
US7283982B2 (en) * | 2003-12-05 | 2007-10-16 | International Business Machines Corporation | Method and structure for transform regression |
US20050210015A1 (en) * | 2004-03-19 | 2005-09-22 | Zhou Xiang S | System and method for patient identification for clinical trials using content-based retrieval and learning |
US8170841B2 (en) * | 2004-04-16 | 2012-05-01 | Knowledgebase Marketing, Inc. | Predictive model validation |
US20050234761A1 (en) * | 2004-04-16 | 2005-10-20 | Pinto Stephen K | Predictive model development |
US8165853B2 (en) * | 2004-04-16 | 2012-04-24 | Knowledgebase Marketing, Inc. | Dimension reduction in predictive model development |
US7561158B2 (en) * | 2006-01-11 | 2009-07-14 | International Business Machines Corporation | Method and apparatus for presenting feature importance in predictive modeling |
US20110160555A1 (en) * | 2008-07-31 | 2011-06-30 | Jacques Reifman | Universal Models for Predicting Glucose Concentration in Humans |
US20100125585A1 (en) * | 2008-11-17 | 2010-05-20 | Yahoo! Inc. | Conjoint Analysis with Bilinear Regression Models for Segmented Predictive Content Ranking |
US8498845B2 (en) | 2010-04-21 | 2013-07-30 | Exxonmobil Upstream Research Company | Method for geophysical imaging |
US8380605B2 (en) * | 2010-09-22 | 2013-02-19 | Parametric Portfolio Associates, Llc | System and method for generating cross-sectional volatility index |
US8880446B2 (en) | 2012-11-15 | 2014-11-04 | Purepredictive, Inc. | Predictive analytics factory |
US10423889B2 (en) | 2013-01-08 | 2019-09-24 | Purepredictive, Inc. | Native machine learning integration for a data management product |
WO2014189523A1 (en) * | 2013-05-24 | 2014-11-27 | Halliburton Energy Services, Inc. | Methods and systems for reservoir history matching for improved estimation of reservoir performance |
US9218574B2 (en) | 2013-05-29 | 2015-12-22 | Purepredictive, Inc. | User interface for machine learning |
US9646262B2 (en) | 2013-06-17 | 2017-05-09 | Purepredictive, Inc. | Data intelligence using machine learning |
US10068186B2 (en) * | 2015-03-20 | 2018-09-04 | Sap Se | Model vector generation for machine learning algorithms |
RU2632133C2 (en) | 2015-09-29 | 2017-10-02 | Общество С Ограниченной Ответственностью "Яндекс" | Method (versions) and system (versions) for creating prediction model and determining prediction model accuracy |
RU2692048C2 (en) | 2017-11-24 | 2019-06-19 | Общество С Ограниченной Ответственностью "Яндекс" | Method and a server for converting a categorical factor value into its numerical representation and for creating a separating value of a categorical factor |
RU2693324C2 (en) | 2017-11-24 | 2019-07-02 | Общество С Ограниченной Ответственностью "Яндекс" | Method and a server for converting a categorical factor value into its numerical representation |
US11314783B2 (en) | 2020-06-05 | 2022-04-26 | Bank Of America Corporation | System for implementing cognitive self-healing in knowledge-based deep learning models |
US11756290B2 (en) | 2020-06-10 | 2023-09-12 | Bank Of America Corporation | System for intelligent drift matching for unstructured data in a machine learning environment |
US11475332B2 (en) | 2020-07-12 | 2022-10-18 | International Business Machines Corporation | Selecting forecasting models by machine learning based on analysis of model robustness |
US11429601B2 (en) | 2020-11-10 | 2022-08-30 | Bank Of America Corporation | System for generating customized data input options using machine learning techniques |
US11966360B2 (en) | 2021-01-04 | 2024-04-23 | Bank Of America Corporation | System for optimized archival using data detection and classification model |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5592589A (en) | 1992-07-08 | 1997-01-07 | Massachusetts Institute Of Technology | Tree-like perceptron and a method for parallel distributed training of such perceptrons |
US5640492A (en) | 1994-06-30 | 1997-06-17 | Lucent Technologies Inc. | Soft margin classifier |
US5649068A (en) | 1993-07-27 | 1997-07-15 | Lucent Technologies Inc. | Pattern recognition system using support vectors |
US5659667A (en) * | 1995-01-17 | 1997-08-19 | The Regents Of The University Of California Office Of Technology Transfer | Adaptive model predictive process control using neural networks |
US5684929A (en) | 1994-10-27 | 1997-11-04 | Lucent Technologies Inc. | Method and apparatus for determining the limit on learning machine accuracy imposed by data quality |
US5720003A (en) | 1994-10-27 | 1998-02-17 | Lucent Technologies Inc. | Method and apparatus for determining the accuracy limit of a learning machine for predicting path performance degradation in a communications network |
US5745383A (en) * | 1996-02-15 | 1998-04-28 | Barber; Timothy P. | Method and apparatus for efficient threshold inference |
US5819247A (en) * | 1995-02-09 | 1998-10-06 | Lucent Technologies, Inc. | Apparatus and methods for machine learning hypotheses |
US5987444A (en) * | 1997-09-23 | 1999-11-16 | Lo; James Ting-Ho | Robust neutral systems |
US6393413B1 (en) * | 1998-02-05 | 2002-05-21 | Intellix A/S | N-tuple or RAM based neural network classification system and method |
US6714925B1 (en) * | 1999-05-01 | 2004-03-30 | Barnhill Technologies, Llc | System for identifying patterns in biological data using a distributed network |
US6760715B1 (en) * | 1998-05-01 | 2004-07-06 | Barnhill Technologies Llc | Enhancing biological knowledge discovery using multiples support vector machines |
US7072841B1 (en) * | 1999-04-29 | 2006-07-04 | International Business Machines Corporation | Method for constructing segmentation-based predictive models from data that is particularly well-suited for insurance risk or profitability modeling purposes |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2786002B1 (en) | 1998-11-17 | 2001-02-09 | Sofresud | MODELING TOOL WITH CONTROLLED CAPACITY |
-
1999
- 1999-10-14 US US09/418,537 patent/US6523015B1/en not_active Ceased
-
2000
- 2000-10-13 AU AU18534/01A patent/AU1853401A/en not_active Abandoned
- 2000-10-13 WO PCT/EP2000/010114 patent/WO2001027789A2/en active Application Filing
- 2000-10-13 EP EP06016693A patent/EP1727051A1/en not_active Withdrawn
- 2000-10-13 EP EP00981200A patent/EP1224562A2/en not_active Ceased
- 2000-10-13 CA CA002353992A patent/CA2353992A1/en not_active Abandoned
- 2000-10-13 CA CA2550180A patent/CA2550180C/en not_active Expired - Lifetime
-
2006
- 2006-07-28 US US11/494,753 patent/USRE42440E1/en not_active Expired - Lifetime
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5592589A (en) | 1992-07-08 | 1997-01-07 | Massachusetts Institute Of Technology | Tree-like perceptron and a method for parallel distributed training of such perceptrons |
US5649068A (en) | 1993-07-27 | 1997-07-15 | Lucent Technologies Inc. | Pattern recognition system using support vectors |
US5640492A (en) | 1994-06-30 | 1997-06-17 | Lucent Technologies Inc. | Soft margin classifier |
US5684929A (en) | 1994-10-27 | 1997-11-04 | Lucent Technologies Inc. | Method and apparatus for determining the limit on learning machine accuracy imposed by data quality |
US5720003A (en) | 1994-10-27 | 1998-02-17 | Lucent Technologies Inc. | Method and apparatus for determining the accuracy limit of a learning machine for predicting path performance degradation in a communications network |
US5659667A (en) * | 1995-01-17 | 1997-08-19 | The Regents Of The University Of California Office Of Technology Transfer | Adaptive model predictive process control using neural networks |
US5819247A (en) * | 1995-02-09 | 1998-10-06 | Lucent Technologies, Inc. | Apparatus and methods for machine learning hypotheses |
US5745383A (en) * | 1996-02-15 | 1998-04-28 | Barber; Timothy P. | Method and apparatus for efficient threshold inference |
US5987444A (en) * | 1997-09-23 | 1999-11-16 | Lo; James Ting-Ho | Robust neutral systems |
US6393413B1 (en) * | 1998-02-05 | 2002-05-21 | Intellix A/S | N-tuple or RAM based neural network classification system and method |
US6760715B1 (en) * | 1998-05-01 | 2004-07-06 | Barnhill Technologies Llc | Enhancing biological knowledge discovery using multiples support vector machines |
US7072841B1 (en) * | 1999-04-29 | 2006-07-04 | International Business Machines Corporation | Method for constructing segmentation-based predictive models from data that is particularly well-suited for insurance risk or profitability modeling purposes |
US6714925B1 (en) * | 1999-05-01 | 2004-03-30 | Barnhill Technologies, Llc | System for identifying patterns in biological data using a distributed network |
Non-Patent Citations (15)
Title |
---|
"La Théorie Statistique de Vladimir N. Vapnik", Fiche de Référence, Conférence Nouvelles méthodes en statistiques appliquées-Application aux problèmes de prévision, May 6, 1994, Neuristique S.A., Paris. |
Grace Wahba, "Generalization and Regularization in Nonlinear Learning Systems", The Handbook of Brain Theory and Neural Networks, Michael Arbib editor, MIT Press, 1995, pp. 426-430. |
I. Guyon, V. Vapnik, B. Boser, L. Bottou and S.A. Solla, "Structural Risk Minimization for Character Recognition", Advances in Neural Information Processing Systems 4, Morgan Kaufmann 1992, pp. 471-479. |
I. Guyon, V. Vapnik, B. Boser, L. Bottou and S.A. Solla: Capacity control in linear classifiers for pattern recognition, Proceedings of the 11th IAPR International Conference on Pattern Recognition, Conference B: Pattern Recognition Methodology and Systems , II:385-388, IEEE, Sep. 1992. |
Isabelle Guyon, "A Scaling Law for the Validation-Set Training-Set Size Ratio", AT&T Bell Laboratories, 1997. |
Joel Ratsaby et al., Towards Robust Model Selection Using Estimation and Approximation Error Bounds, 1996, ACM, 57-67. * |
Joel Ratsaby et al; Towards Robust Model Selection Using Estimation and Approximation Error Bounds; 1996; ACM; 0-89791-811-8/96/06; 57-67. |
Léon Bottou, "La Mise en Oeuvre des Idées de Vladimir N. Vapnik", Statistiques et Méthodes Neuronales, chapter 16, Dunod, Paris, 1997, pp. 262-274. |
Michael P. Perrone, "Averaging/Modular Techniques for Neural Networks", The Handbook of Brain Theory and Neural Networks, Michael Arbib editor, MIT Press, 1995, pp. 126-129. |
Paul J. Werbos, "Backpropagation: Basics and New Developments", The Handbook of Brain Theory and Neural Networks, Michael Arbib editor, MIT Press, 1995, pp. 134-139. |
Richard P. Brent; Fast Training Algorithms for Multilayer Neural Nets; 1991; IEEE; 1045-9227/91/0500-0346; 346-354. |
Tan, Y., et al., A Support Vector Machine with a Hybrid Kernel and Minimal Vapnik-Chervonenkis Dimension, IEEE Transactions on knowledge and Data Engineering, vol. 16., No. 4, Apr. 2004, pp. 385-395. * |
Vladimir Cherkassky et al; Regularization Effect of Weight Initialization in Back Propagation Networks; 1998; IEEE; 0-7803-4859-1/98; 2258-2261. |
Vladimir N. Vapnik, "Controlling the Generalization Ability of Learning Processes", The Nature of Statistical Learning Theory, chapter 4, Springer-Verlag, New-York, 1995, pp. 89-118. |
W.H. Highleyman, "The Design and Analysis of Pattern Recognition Experiments", The Bell System Technical Journal, Mar. 1962, pp. 723-744. |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9465773B2 (en) | 2012-08-17 | 2016-10-11 | International Business Machines Corporation | Data-driven distributionally robust optimization |
US9946972B2 (en) | 2014-05-23 | 2018-04-17 | International Business Machines Corporation | Optimization of mixed-criticality systems |
CN111291657A (en) * | 2020-01-21 | 2020-06-16 | 同济大学 | Crowd counting model training method based on difficult case mining and application |
CN111291657B (en) * | 2020-01-21 | 2022-09-16 | 同济大学 | Crowd counting model training method based on difficult case mining and application |
Also Published As
Publication number | Publication date |
---|---|
EP1727051A1 (en) | 2006-11-29 |
WO2001027789A2 (en) | 2001-04-19 |
CA2550180A1 (en) | 2001-04-19 |
AU1853401A (en) | 2001-04-23 |
CA2353992A1 (en) | 2001-04-19 |
CA2550180C (en) | 2011-11-29 |
EP1224562A2 (en) | 2002-07-24 |
US6523015B1 (en) | 2003-02-18 |
WO2001027789A3 (en) | 2002-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
USRE42440E1 (en) | Robust modeling | |
US10510003B1 (en) | Stochastic gradient boosting for deep neural networks | |
JP7315748B2 (en) | Data classifier training method, data classifier training device, program and training method | |
Chan et al. | Bayesian poisson regression for crowd counting | |
US6728690B1 (en) | Classification system trainer employing maximum margin back-propagation with probabilistic outputs | |
US7043462B2 (en) | Approximate fitness functions | |
US6516309B1 (en) | Method and apparatus for evolving a neural network | |
US7162085B2 (en) | Pattern recognition method and apparatus | |
CN111144552B (en) | Multi-index grain quality prediction method and device | |
CN108734321A (en) | A kind of short-term load forecasting method based on the Elman neural networks for improving ABC algorithms | |
US20220187772A1 (en) | Method and device for the probabilistic prediction of sensor data | |
CN112529683A (en) | Method and system for evaluating credit risk of customer based on CS-PNN | |
CN115952832A (en) | Adaptive model quantization method and apparatus, storage medium, and electronic apparatus | |
CN110895772A (en) | Electricity sales amount prediction method based on combination of grey correlation analysis and SA-PSO-Elman algorithm | |
CN113705724B (en) | Batch learning method of deep neural network based on self-adaptive L-BFGS algorithm | |
CN114202065B (en) | Stream data prediction method and device based on incremental evolution LSTM | |
CN112541530B (en) | Data preprocessing method and device for clustering model | |
JP2020204909A (en) | Machine learning device | |
US12033658B2 (en) | Acoustic model learning apparatus, acoustic model learning method, and program | |
CN110728292A (en) | Self-adaptive feature selection algorithm under multi-task joint optimization | |
CN117877587A (en) | Deep learning algorithm of whole genome prediction model | |
KR20190129422A (en) | Method and device for variational interference using neural network | |
CN112801971A (en) | Target detection method based on improvement by taking target as point | |
CN113947030A (en) | Equipment demand prediction method based on gradient descent gray Markov model | |
CN113642784A (en) | Wind power ultra-short term prediction method considering fan state |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FPAY | Fee payment |
Year of fee payment: 12 |