WO2023280316A1 - 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 - Google Patents

一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 Download PDF

Info

Publication number
WO2023280316A1
WO2023280316A1 PCT/CN2022/104694 CN2022104694W WO2023280316A1 WO 2023280316 A1 WO2023280316 A1 WO 2023280316A1 CN 2022104694 W CN2022104694 W CN 2022104694W WO 2023280316 A1 WO2023280316 A1 WO 2023280316A1
Authority
WO
WIPO (PCT)
Prior art keywords
distribution
parameter
xgboost
pricing
class
Prior art date
Application number
PCT/CN2022/104694
Other languages
English (en)
French (fr)
Inventor
杨光
Original Assignee
杨光
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杨光 filed Critical 杨光
Publication of WO2023280316A1 publication Critical patent/WO2023280316A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0206Price or cost determination based on market factors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • the invention relates to machine learning technology and actuarial technology, and in particular to a corresponding big data analysis method.
  • the insurance company will measure the net premium of the insured, which refers to the expected net compensation of the insured. Due to the short period of non-life insurance insurance, the pure premium in this article does not consider the interest factor. To measure the pure premium, it is best to estimate the probability distribution of the loss (payment) amount (single accident or the sum of the insurance period), rather than simply estimate the expected value of the loss (payment). Because in compensatory insurance, there is generally a deductible (or limit) for the compensation of one accident loss or the total loss during the insurance period. Only the probability distribution of the calculated loss (compensation) can determine the deductible. The adjustment of the amount (or limit) makes the adjustment of the pure premium to be dealt with accordingly.
  • the standard assumption is a hypothesis, and the methods for finding the probability distribution of the total loss (total compensation) include the characteristic function class transformation method (Fourier transform method) or the stochastic simulation method.
  • the characteristic function class transformation method Frier transform method
  • stochastic simulation method For assumption b, due to too many parameters to be estimated, there may be a risk of over-fitting, so it is rarely used in the industry.
  • methods of the second category are more refined methods that offer many benefits over methods of the first category.
  • the XGBoost method is an extreme gradient boosting tree method, which has excellent prediction performance and has achieved very good results in many fields.
  • a sample set D ⁇ ( xi , y i ) ⁇ (
  • n, xi ⁇ R m , y i ⁇ R), has m features and n samples.
  • An ensemble tree model which is predicted by adding K tree functions together.
  • ⁇ (f k ) is a regular term.
  • the XGBoost algorithm uses the boosting tree algorithm to minimize the objective function, assuming is the predicted value of the t-th iteration of the i-th sample, add an f t to it, and minimize the following objective function:
  • the optimal objective function value is:
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • This formula is used to calculate candidate split points.
  • the downscaling technique boosts the tree by a factor ⁇ after each step, and is also used to prevent overfitting.
  • ⁇ 0 find the most suitable value of each ⁇ j such that The minimum, to get the optimal weight score of leaf node j:
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • XGBoost XGBoost algorithm for loss function
  • the requirements are stricter, and it is required to Differentiable, and a convex function. If l is not a globally convex function, there is no guarantee that the initial objective function converges to the global minimum. Examples are as follows:
  • T 1. It may be because the learning rate ⁇ is not controlled, so that after the t-1th iteration exist A certain neighborhood of is a concave function, its pair The first derivative of g is positive and the second derivative of h is negative. Make the optimal weight score of the sample in the t-th iteration When ⁇
  • existing XGBoost-like methods are limited to fitting single-parameter probability distributions.
  • the existing XGBoost methods cannot optimize and solve multiple parameters at the same time, and often cannot obtain the optimal prediction performance. For example, if the loss frequency in general (General) insurance pricing obeys a two-parameter negative binomial distribution, it is inappropriate to use a single-parameter Poisson distribution to fit it.
  • the object of the present invention is to provide a data analysis method based on an improved XGBoost method, thereby effectively improving the performance of big data analysis and prediction.
  • the present invention further provides a pricing method based on an improved XGBoost method, which effectively overcomes the defects in the existing solutions.
  • the data analysis method based on the improved XGBoost class method provided by the present invention adopts the improved XGBoost class method to predict and evaluate based on the obtained variable parameters, and the improved XGBoost class method is accurate to the target in the XGBoost class algorithm.
  • the second-order Taylor expansion of the approximate expression of the function is modified.
  • the improved XGBoost method extends the XGBoost method from univariate prediction to multi-parameter prediction of parameter probability distribution, forming a multi-cycle improved XGBoost data analysis method.
  • the loss function is set within the scope of the discussion, the Second-order differentiable; there is one and only one local minimum point and only at this point the derivative is 0, or strictly monotonous.
  • the present invention provides a pricing method, which performs non-life insurance actuarial pricing based on the above data analysis method.
  • the pricing method includes:
  • sample set is divided into training set, verification set and test set; Described training set is used for training the learning model that is used to predict predictor variable, and verification set is used for adjusting hyperparameter, and test set is used for evaluating learning model performance;
  • the pricing method obtains the conditional probability distribution of predictor variables based on the improved XGBoost method, including:
  • the expected value expression of the predictor variable is used as the expected parameter, and the expression of the probability distribution is deformed, and the expected parameter is used as the predicted parameter, and the parameters other than the predicted parameter are regarded as nuisance parameters and hyperparameters; such as the distribution
  • the expression itself already contains expected parameters, so there is no need to deform it, and directly set the prediction parameters and hyperparameters;
  • the present invention provides a data analysis method, which directly extends the improved XGBoost class method to multivariate, forming a multivariate regularization lifting tree method, and the multivariate regularization lifting tree method is effective for the objective function in the XGBoost class algorithm.
  • the second-order Taylor expansion of the approximate expression is corrected, and its h i related items are modified, so that the applicability of the improved XGBoost class method is not limited to the convex loss function.
  • This method can optimize and solve multiple variables in the multivariate loss function (ie, the parameters to be estimated under consideration) at the same time.
  • the loss function l is set within the scope of discussion: (1) second-order differentiable or first-order differentiable, with one and only one local minimum point; (2 ) After selecting any parameter to be estimated as the investigation variable, when the remaining parameters are fixed, there is one and only one local minimum point; only at the local minimum point mentioned in the previous paragraph, the partial derivative of the parameter to be estimated is 0, Or strictly monotonic.
  • y i is regarded as a fixed parameter as an observed value, not as a variable or a parameter to be estimated.
  • the scope of discussion of the parameters to be estimated can be chosen reasonably freely. In practical applications, reasonable predictions will not fall exactly on the theoretical extreme boundary points.
  • the range interval discussed can be regarded as a closed interval, and the boundary of the interval can also be kept a reasonable distance from the theoretical boundary point.
  • is a regularization term
  • the l 1 regularization term can also be added to ⁇ additionally:
  • the differentiability condition for the loss function can be relaxed to first-order differentiability.
  • the present invention provides a pricing method, which performs actuarial pricing based on the above data analysis method.
  • the pricing method includes:
  • the sample set is divided into a training set, a verification set and a test set;
  • the training set is used to train the learning model of the parameter to be estimated for predicting the parameter distribution,
  • the verification set is used to adjust hyperparameters, and
  • the test set is used to evaluate learning model performance;
  • the pricing method obtains the conditional probability distribution of the predictor variables based on the multivariate regularized boosting tree method, including:
  • the invention adopts the improved XGBoost method for data analysis, effectively overcoming various defects in the prior art solutions.
  • the data analysis method based on the multi-cycle improved XGBoost method provided by the present invention uses the improved XGBoost method to carry out cycle multi-parameter modeling, which further improves the prediction performance of the model.
  • the multivariate regularized lifting tree method provided by the present invention further improves the prediction performance of the big data prediction method including the non-life insurance pricing method, and improves the calculation operation efficiency and the interpretability of the model.
  • the present invention further provides a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, the steps of the above-mentioned data analysis method or pricing method are implemented.
  • the present invention further provides a processor, the processor is used to run a program, and when the program runs, the steps of the above data analysis method or pricing method are implemented.
  • the present invention further provides a terminal device, which includes a processor, a memory, and a program stored in the memory and operable on the processor, and the program code is loaded and executed by the processor In order to realize the steps of the above-mentioned data analysis method or pricing method.
  • the present invention further provides a computer program product, which is suitable for performing the steps of the data analysis method or the pricing method when executed on the data processing device.
  • Figure 1 is an example diagram of a non-convex loss function image in the existing XGBoost algorithm
  • Figure 2 is an example diagram of a non-convex loss function image when predicting loss strength in Example 2;
  • Figure 3 is an example diagram of a non-convex loss function image when predicting the number of losses in Example 2;
  • Fig. 4 is after fixing corresponding parameter in example 3, the sample function image example figure of l (loss function);
  • FIG. 5 is an example diagram of an example function image of l (loss function) after the corresponding parameters are fixed in Example 4.
  • this solution improves the XGBoost method, realizes the combination of accurate prediction performance and traditional statistical technology, and further improves the prediction performance.
  • non-life insurance pricing is taken as an example.
  • this scheme can apply the obtained improved XGBoost method and the derived multivariate regularization boosting tree method to non-life insurance pricing, thereby effectively overcoming the background technology.
  • deficiencies of the prior art as set forth in while retaining the advantages of the prior art.
  • the excellent prediction performance of the number of losses (payments) and the intensity of losses (payments) as well as the total loss amount (or total compensation amount) is achieved, so as to achieve the ideal effect of measuring pure premiums.
  • the corresponding improved XGBoost method is constructed by improving the XGBoost method, so as to overcome the requirement that the loss function of the XGBoost method in the prior art must be a convex function.
  • the loss function Set to be the negative log-likelihood function of the predictor variable probability distribution. Further setting the loss function Within the scope of the discussion, the Second-order partial derivative; there is one and only one local minimum point and only at this point the derivative is 0, or strictly monotonic.
  • the value of g i can be set, so that
  • the optimal objective function value is:
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • This formula is used to calculate candidate split points.
  • the optimal objective function value is:
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • This formula is used to calculate candidate split points.
  • the optimal objective function value is:
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • This formula is used to calculate candidate split points.
  • the improved Xgboost-like method can also be applied.
  • the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
  • M can be regarded as a priori experience setting, and can also be treated as a hyperparameter.
  • the maximum likelihood estimated value of the predicted random variable can be used as the initial iteration value of the predicted variable, so as to improve the convergence speed of the algorithm and the interpretability of the method model.
  • the improved XGBoost class method formed in Example 1 is used to form a non-life insurance pricing method.
  • the negative log-likelihood function is used as the loss function
  • the mean parameter is used as the estimated parameter of the XGBoost class method.
  • the improved XGBoost method is used to improve the method for calculating the probability distribution of loss (payment) intensity or loss (payment) times in non-life insurance pricing.
  • this example uses the improved XGBoost method to improve the probability distribution of loss (payment) intensity or loss (payment) times in non-life insurance pricing. It mainly includes the following steps:
  • Collect sample data including sample attributes and observed values of predictor variables.
  • the sample attributes may include model type, mileage driven, car price, age of the car owner, claims in the previous year, traffic violation records, etc.
  • the observed value of the predictor variable is within the insurance period The single loss amount of the accident.
  • the training set is used to train the model, which is a learning model that predicts the variables to be predicted
  • the validation set is used to adjust the hyperparameters
  • the test set is used to evaluate the model performance.
  • the hold-out method, kfold cross-validation method, etc. can be used.
  • the process of using the improved XGBoost class method to obtain the conditional probability distribution of the predictor variable includes:
  • the expected expression of the distribution is substituted into the parameter distribution, and the expected expression is used as the parameter of the probability distribution, that is, the expected parameter, and the expected parameter is further used as the predictor variable to be estimated in the improved XGBoost method; for example
  • the distribution expression itself already contains expected parameters, so there is no need to deform it, and directly set the prediction parameters and hyperparameters.
  • connections can also be added to the expected parameters, such as adding a logarithmic connection to the expected parameters. Adding connections is equivalent to different parameterization forms, no matter which parameterization form has a corresponding loss function, as long as the conditions of the method are met, it can be applied.
  • the improved XGBoost method can combine the advantages of the generalized linear model method and the XGBoost method, and overcome their respective shortcomings.
  • this example adds an evaluation index method to the improved XGBoost class method, using the loss function of the training set as the evaluation index of the verification set and the test set, so that the loss function and evaluation index are perfectly unified.
  • the objective function can be solved optimally, using the log-likelihood function of the probability distribution of the predictor variable or its inverse as the evaluation index conforms to the convention of statistical principles.
  • the specific method to obtain the conditional probability distribution of the predictor variable is as follows:
  • the type of distribution for predicting the random variable Y is selected empirically from the candidate parametric distributions.
  • Y i are independent of each other (conditionally independent with their own characteristics and parameters).
  • the loss function for the entire set is Minimize the following objective function with the improved XGBoost class method:
  • the training set is trained by the improved XGBoost class method
  • the above process obtains the estimated value of ⁇ i .
  • Scaled distribution If a random variable obeys a parameter distribution, the random variable is multiplied by a certain constant to form a new random variable, and the new random variable still obeys the parameter distribution. This parametric distribution is called a scaled distribution.
  • Scaling parameter A random variable obeys a certain scaling distribution, and the possible value range is non-negative.
  • a certain parameter of a scaling distribution satisfies the following two conditions, which is called a scaling parameter: the random variable is multiplied by a certain normal number to form a new random variable, the scaling parameter of the new scaling distribution is also multiplied by this normal constant. The rest of the parameters of the new scaled distribution are unchanged.
  • Scaled distributions are particularly convenient for handling loss amounts when faced with inflation and currency unit conversions, and scaling distributions are preferred as candidate distributions for loss amount random variables.
  • the scaling parameter is denoted as ⁇ .
  • the expectation ⁇ of this scaled distribution can be written in the form ⁇ f, where f is a function of parameters other than ⁇ . but
  • Example (1) is used here to illustrate the prediction of loss (compensation) intensity.
  • the gamma distribution is a fat-tailed scaling distribution
  • is the scaling parameter
  • probability density function is as follows:
  • the loss function of the training set is
  • the predicted minimum value of the initial objective function, the predicted value of the predicted variable, the corresponding loss function value and the conditional probability of the loss (payment) intensity can be obtained distributed.
  • This distribution belongs to the class (a,b,1) and does not belong to the family of exponential distributions.
  • the loss function of the training set is
  • the predicted minimum value of the initial objective function, the predicted value of the predicted variable, the corresponding loss function value and the conditional probability of the number of losses (payments) can be obtained distributed.
  • the inverse of the log-likelihood function on the validation and test sets can be used as the corresponding evaluation index, n is the number of samples in the corresponding set of samples. Since ⁇ is an unknown parameter.
  • the hyperparameters ⁇ and ⁇ need to find the optimal value on the validation set through methods such as grid search. At this time, treat ⁇ as a troublesome parameter and hyperparameter processing, and use methods such as grid search to find the loss function on the verification set the smallest as an estimate of ⁇ .
  • the evaluation indicators of the verification set are used to select hyperparameters and The value of , and determine the optimal model structure.
  • the training set and the verification set are combined as a new training set, and the model structure is set to retrain the model to obtain the updated model and model parameters.
  • Use the updated model to predict the samples of the test set, and obtain the evaluation index value of the model on the test set.
  • Select other possible parameter distributions repeat the previous steps to remodel, but the test set does not change, and get new evaluation index values. Repeat this step until all possible suitable parametric distributions have been modeled. Compare the values of the corresponding evaluation indicators, and select one or several models with the best evaluation value as the prediction model. Keep the model structure settings, and retrain the updated model with all sample data (including the test set) to get the final prediction model.
  • the kfold cross-validation method If the kfold cross-validation method is used, it can be obtained by k training The mean is used as an estimate of ⁇ .
  • this example uses the pure premium calculation model to obtain the pure premium, the probability distribution of the total loss amount, and the probability distribution of the total compensation amount and other non-life insurance pricing elements.
  • the improved XGBoost class method can be further extended from univariate prediction to multi-parameter prediction with random distribution of parameters, forming a multi-cycle improved XGBoost class data analysis method, so as to realize the prediction
  • the boosted tree method predicts all parameters of common parametric probability distributions for random variables.
  • the improved XGBoost method model is used to model the predictive random variable Y i in multiple rounds, which can improve the predictive performance.
  • the random variable Y i here refers to the random variable of the loss (payment) intensity or the number of losses (payment) during the insurance period.
  • this example can be further extended for the scheme of Example 2.
  • the loss function is the corresponding l(y i , ⁇ i , ⁇ 1,i , ⁇ 2 ..., ⁇ l ), if l(y i , ⁇ i , ⁇ 1,i ,..., ⁇ l ) for any value of y i , ⁇ i , ⁇ 2 ,... ⁇ l are all second-order partial derivatives to ⁇ 1,i (or the corresponding first-order partial derivative); there is and only one local minimum point and only at this point the derivative is 0, or strictly monotonous.
  • ⁇ 1,i as the predictor variable
  • use the improved XGBoost class method to predict ⁇ 1 ,i, and get the predicted value of ⁇ 1,i
  • the regularization item of the XGBoost class method can make the score of each leaf node not be too different.
  • each ⁇ i is fixed, and ⁇ is regarded as a predictor variable, and the loss function is
  • step 4 Repeat step 4 until the evaluation metrics of the validation set converge. Keep the model of each step above, and use the test set to select the optimal probability distribution and parameter structure.
  • step (2) if a traditional evaluation index such as mean square error is used, the verification process is consistent with step (2). If the negative log-likelihood function on the validation set is used as the evaluation index, for the model whose predictor variable is ⁇ j,i , the fixed parameter of the negative log-likelihood function is n is the number of validation set samples. ⁇ i , Respectively for the improved XGBoost class model prediction function value obtained by training
  • This example extends the improved XGBoost method to predict multiple parameters to be estimated, and uses one algorithm model to predict multiple parameters to be estimated in the parameter probability distribution at the same time, which can increase the prediction performance of the model and improve operational efficiency and interpretability.
  • y i is an observed value, which is regarded as a fixed parameter, not as a variable or a parameter to be estimated.
  • the scope of discussion of the parameters to be estimated can be chosen reasonably freely. In practical applications, reasonable predictions will not fall exactly on the theoretical extreme boundary points.
  • the range interval discussed can be regarded as a closed interval, and the boundary of the interval can also be kept a reasonable distance from the theoretical boundary point.
  • a sample set D ⁇ ( xi , y i ) ⁇ (
  • n, xi ⁇ R m , y i ⁇ R), has m features and n samples. Add K j tree functions to get parameters forecast result
  • This multivariate regularized boosting tree method is not limited to the situation that a certain h i is not always non-negative, and it is also applicable when all h i are always non-negative.
  • the approximate expression (2) is formally simplified as:
  • Each round of training can train up to l trees at the same time, and each tree has its own hyperparameters.
  • a smaller number of training rounds K can be set separately.
  • the preferred solution is to set the interval of iteration rounds to reduce the total number of training rounds.
  • the initial iterative value of the parameter ⁇ j to be estimated can be obtained by the maximum likelihood estimation of the training set (without considering xi ).
  • a certain loss function l is within the scope of discussion: second-order differentiable, with one and only one local minimum point; if the approximate expression (1) is used, the requirement for the loss function l can be relaxed to first-order differentiable, There is one and only one local minimum point; after selecting any parameter to be estimated, when the remaining parameters are fixed, there is one and only one local minimum point; only the local minimum point mentioned in the previous paragraph is treated
  • the partial derivative of the estimated parameter is 0, or strictly monotonic.
  • the loss function of the training set is
  • the parameters ⁇ i and ⁇ i to be estimated can be set to any reasonable discussion range.
  • One method is to set ⁇ i ⁇ [ ⁇ 1 ,M 1 ], ⁇ i ⁇ [ ⁇ 2 ,M 2 ], ⁇ 1 , ⁇ 2 is a small enough positive number, and M 1 and M 2 are large enough positive numbers.
  • the specific conditional probability distribution of the predictor variable Y i can be obtained by using the multivariate regularized boosting tree method.
  • the loss function may not be a convex function of ⁇ i .
  • the loss function l is a concave function of ⁇ i , and its function image is shown in FIG. 5 .
  • the above modeling process can adopt different feature engineering schemes. Combine the training and validation sets, and retrain the model with the learned hyperparameters. Change the candidate probability distribution type of the predictor variable and repeat the modeling training. Use the learned model to make predictions on the test set, and select one or several evaluation indicators with the smallest probability distribution and the corresponding prediction model as the optimal model. Combine all sample sets, use the learned hyperparameters, retrain the model, get the final model and put it into production.
  • a preferred evaluation index is the negative log-likelihood function.
  • the improvement of the XGBoost method in this patent refers to the improvement of all methods similar to the XGBoost method, such as the famous LightGBM method and CatBoost method.
  • the multi-round cycle XGBoost method and the multivariate regularized boosting tree method, in practical application, only the optimization problem of minimizing the objective function that satisfies the loss function condition or the parameter probability that satisfies the loss function condition is required.
  • the maximum likelihood estimation of each parameter of the distribution (the conditional maximum likelihood estimation of each sample point with different sample characteristics) can be used, not only for non-life insurance pricing, but also for various fields.
  • the embodiment of the present invention also provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the steps of any one or more solutions in the above-mentioned examples 1-4 are implemented.
  • An embodiment of the present invention also provides a processor, the processor is configured to run a program, wherein the program executes any one or more of the steps in the above-mentioned examples 1-4 when running.
  • the embodiment of the present invention also provides a terminal device.
  • the device includes a processor, a memory, and a program stored in the memory and operable on the processor.
  • the program code is loaded and executed by the processor to realize the above-mentioned example 1 - the steps of any one or more schemes in Example 4.
  • the present invention also provides a computer program product, which, when executed on a data processing device, is suitable for executing the steps of any one or more of the above-mentioned examples 1-4.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read only memory (ROM) or flash RAM.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash random access memory
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM Electrically Erasable Programmable Read-On
  • the embodiments of the present invention may be provided as methods, systems or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Technology Law (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备;本方案采用改进型XGBoost类算法,所述改进型XGBoost类方法对目标函数的二阶泰勒展开做修正,修改了其h i项,使得改进后的XGBoost类方法的适用性不局限于凸损失函数。本方案在该改进型XGBoost类方法的基础上,进一步提出多元正则化提升树方法,将预测变量的概率分布从单参数推广到多参数,可广泛应用于各种领域。特别是非寿(General)险定价领域。

Description

一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 技术领域
本发明涉及机器学习技术和精算技术,具体涉及相应的大数据分析方法。
背景技术
一.纯保费测算模型。
在非寿险定价中,保险公司会对被保险人的纯保费进行测算,纯保费指被保险人的期望净赔付额。因非寿险保险期间较短,本文中的纯保费不考虑利息因素。要测算纯保费最好对损失(赔付)金额(单次出险的或保险期间内出险总和的)的概率分布做估计,而不能简单的只对损失额(赔付额)的期望值做估计。因为在补偿型保险中,一般对一次出险损失额或保险期间内的总损失的赔付有一个免赔额(或限额),只有测算出的损失额(赔付额)的概率分布,才能对免赔额(或限额)的调整使得纯保费的调整做相应的处理。
其中,测算总损失额(赔付额)的概率分布有两类方法:
1.直接对保险期间内总损失额(赔付额)的概率分布做估计。
2.分别对保险期间内出险次数(赔付次数)的概率分布和每次出险的损失强度(赔付强度)的概率分布分别做估计。用复合分布模型对此两种概率分布做整合得到总损失(总赔付)的概率分布。一般也有两种假设:
a.标准假设。假设此两种分布相互独立,每次出险的损失(赔付)强度服从独立同分布。
b.此两种分布有关联,或者损失(赔付)强度不服从独立同分布。
标准假设是a假设,求总损失(总赔付)的概率分布的方法有特征函数类变换法(傅里叶变换法)或随机模拟法。对b假设,由于待估参数过多,可能存在过拟合的风险,业界很少采用。一般而言,第二类方法是更精细的方法,相比第一类方法有着诸多好处。
XGBoost方法是一种极限梯度提升树方法,其预测性能优异,在很多领域都取得了非常好的成绩。
该方法的主要过程描述如下:
一个样本集D={(x i,y i)}(|D|=n,x i∈R m,y i∈R),有m个特征,n个样本。一个集成树模型,用K颗树函数相加得到预测结果。
Figure PCTCN2022104694-appb-000001
其中F={f(x)=ω q(x)}(q:R m→T,ω∈R T)是回归树空间。q表示每棵树的结构,把一个样本映射到对应的叶子结点。T是一颗树的叶子结点的个数。每个f k对应一个独立的树结构q及其叶子权重ω。每一颗回归树的每一个叶子结点都有一个连续值得分,用ω i表示第i个叶子结点的得分。为了学习模型中的这些树函数,最小化下面的正则化目标:
Figure PCTCN2022104694-appb-000002
其中,
Figure PCTCN2022104694-appb-000003
l是可导凸函数,表示损失函数。Ω(f k)是正则项。
XGBoost算法用提升树算法去最小化目标函数,假设
Figure PCTCN2022104694-appb-000004
是第i个样本的第t次迭代的预测值,将其加上一个f t,最小化如下目标函数:
Figure PCTCN2022104694-appb-000005
在一般情况下,为了快速优化目标函数。用二阶泰勒展开对其做近似:
Figure PCTCN2022104694-appb-000006
其中,
Figure PCTCN2022104694-appb-000007
将常数项移除,得到第t次迭代的目标函数:
Figure PCTCN2022104694-appb-000008
定义I j={i|q(x i)=j}是划分到叶子结点j的样本点的集合,重写
Figure PCTCN2022104694-appb-000009
得到
Figure PCTCN2022104694-appb-000010
对一个固定的树结构q(x),求
Figure PCTCN2022104694-appb-000011
对每一个ω j的偏导数等于0的ω j值,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000012
最优目标函数值为:
Figure PCTCN2022104694-appb-000013
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I L和I R表示分裂后的左右结点的样本集,I=I L∪I R
分裂后的目标函数的减少值由下式给出:
Figure PCTCN2022104694-appb-000014
此公式用来计算候选划分点。
类似于学习率,收缩尺度技术在每一步提升树后用一个因子η,也用来防止过拟合。此外还有列采样技术防止过拟合。
另外,有的开源代码还提供额外的l 1正则化项:
定义I j={i|q(x i)=j}是划分到叶子结点j的样本点的集合,重写
Figure PCTCN2022104694-appb-000015
得到
Figure PCTCN2022104694-appb-000016
其中,β≥0,找到每一个ω j的最合适取值使得
Figure PCTCN2022104694-appb-000017
最小,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000018
时,
Figure PCTCN2022104694-appb-000019
Figure PCTCN2022104694-appb-000020
时,,
Figure PCTCN2022104694-appb-000021
Figure PCTCN2022104694-appb-000022
时,
Figure PCTCN2022104694-appb-000023
Figure PCTCN2022104694-appb-000024
代入
Figure PCTCN2022104694-appb-000025
得到最优目标函数值。
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
计算左右结点样本集的最优目标函数值,记录分裂的增益,作为最优划分结点的标准。
XGBoost类方法的缺点有:
XGBoost算法对损失函数
Figure PCTCN2022104694-appb-000026
的要求较严格,要求其对
Figure PCTCN2022104694-appb-000027
可导,并且是凸函数。如果l不是全局凸函数,则不能保证初始目标函数收敛到全局最小点。举例说明如下:
假设只有一个样本点(x 1,y 1),
Figure PCTCN2022104694-appb-000028
的自变量是
Figure PCTCN2022104694-appb-000029
y 1看作参数,形状如图1:
以标准的正则化项为例:设γ和λ较小,可以忽略不计,则目标函数近似于损失函数.用考察损失函数代替考察目标函数不影响得到的结论。
由于只有一个样本点,T=1。可能由于没有控制学习率η,使得第t-1次迭代后的
Figure PCTCN2022104694-appb-000030
Figure PCTCN2022104694-appb-000031
Figure PCTCN2022104694-appb-000032
的某个邻域内是凹函数,其对
Figure PCTCN2022104694-appb-000033
的一阶导数为g 1为正,二阶导数h 1为负。使得第t次迭代的该样本的最优权重得分
Figure PCTCN2022104694-appb-000034
当λ<|h 1|时,
Figure PCTCN2022104694-appb-000035
Figure PCTCN2022104694-appb-000036
Figure PCTCN2022104694-appb-000037
更加偏离
Figure PCTCN2022104694-appb-000038
的全局最小点。
此外,现有的XGBoost类方法仅局限于拟合单参数概率分布。对于多参数概率分布,现有的XGBoost类方法无法同时对多个参数进行优化求解,很多时候不能得到最优的预测性能。例如,一般(General)保险定价中的损失频率如果服从双参数的负二项分布,用单参数的泊松分布去拟合是不合适的。
发明内容
针对现有大数据分析预测技术所存在的问题,需要一种新的数据分析处理方案。
为此,本发明的目的在于提供一种基于改进型的XGBoost类方法的数据分析方法,由此来有效提高大数据分析预测的性能。在此基础上,本发明进一步提供基于改进型的XGBoost类方法的定价方法,有效克服现有方案所存在的缺陷。
为了达到上述目的,本发明提供的基于改进型XGBoost类方法的数据分析方法,采用改进型XGBoost类方法基于获取到的变量参数进行预测评估,所述改进型XGBoost类方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,h i不恒为非负时,通过修改其h i相关项,改进型XGBoost类方法的适用性不局限于凸损失函数。
进一步地,所述改进型XGBoost类方法将XGBoost类方法从单变量预测推广到参数概率分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法。
进一步地,所述改进型XGBoost类方法中,设定损失函数
Figure PCTCN2022104694-appb-000039
在讨论的范围内,对
Figure PCTCN2022104694-appb-000040
二阶可导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。
进一步地,所述改进型XGBoost类方法中,对第t次迭代的目标函数
Figure PCTCN2022104694-appb-000041
Figure PCTCN2022104694-appb-000042
可采用以下近似之一:
(1)
Figure PCTCN2022104694-appb-000043
(2)
Figure PCTCN2022104694-appb-000044
或(1)式和(2)式的各h i相关项加权平均表达。
对近似(1),对损失函数
Figure PCTCN2022104694-appb-000045
的可导性要求可放宽至对
Figure PCTCN2022104694-appb-000046
一阶可导。
为了达到上述目的,本发明提供的定价方法,所述定价方法基于上述的数据分析方法进行非寿险精算定价。
进一步地,所述定价方法包括:
(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
(2)对样本数据进行预处理;
(3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;
(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测预测变量的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
(5)选择预测随机变量的参数分布类型,用改进型XGBoost类方法求得预测变量的条件概率分布;
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数 数分布。当对预测变量的参数分布类型有自信时,也可直接指定最优参数分布。此时,候选参数分布中只有此一种参数分布。
进一步地,所述定价方法基于改进型XGBoost类方法求得预测变量的条件概率分布,包括:
(1)从候选参数概率分布中选择某一分布,确定其参数,对同一分布可以有不同的参数化形式;
(2)将预测变量的期望值表达式作为期望参数,对该概率分布的表达式进行变形,将期望参数作为预测参数,预测参数以外的参数看作麻烦(nuisance)参数、超参数;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数;
(3)确定目标函数,以该分布的负对数似然函数作为损失函数;确认该损失函数满足改进型XGBoost方法对损失函数的要求。
(4)对超参数,运用网格搜寻法或先验经验或其他方法确定其值;
(5)当超参数固定时,用改进型XGBoost类算法求得预测参数的预测值;
(6)更换超参数取值,重复步骤(5),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其概率分布。如果对某个超参数的取值有自信,也可直接设定唯一的超参数取值。
为了达到上述目的,本发明提供一种数据分析方法,其将改进型XGBoost类方法直接推广至多元,形成多元正则化提升树方法,所述多元正则化提升树方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,修改了其h i相关项,使得改进型XGBoost类方法的适用性不局限于凸损失函数。本方法可同时对多元损失函数中的多个变量(即考察的待估参数)进行优化求解。
进一步地,所述多元正则化提升树方法中,设定损失函数l在讨论的范围内:(1)二阶可微或一阶可微,有且仅有一个局部极小值点;(2)选定任意的某个待估参数作为考察变量后,当其余参数固定时,有且仅有一个局部极小值点;仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
注:y i作为观测值看作固定的参数,不看做变量或待估参数。对于待估参数的讨论范围,可以合理的自由选择。在实际运用中,合理的预测结果都不会刚好落在理论上的极端边界点。在有些时候,可以将讨论的范围区间看成是闭区间,也可以使区间的边界离理论上的边界点有一点的合理的距离。
进一步地,所述多元正则化提升树方法中目标函数的表达式为:
Figure PCTCN2022104694-appb-000047
其中Ω是正则化项;
Figure PCTCN2022104694-appb-000048
…,
Figure PCTCN2022104694-appb-000049
Figure PCTCN2022104694-appb-000050
Figure PCTCN2022104694-appb-000051
的正则项超参数,
Figure PCTCN2022104694-appb-000052
Figure PCTCN2022104694-appb-000053
中一棵树的叶子结点个数,l是待估参数的个数,k是对应的预测待估参数的提升树的层数。
也可将l 1正则化项额外加入到Ω中:
Figure PCTCN2022104694-appb-000054
进一步地,所述多元正则化提升树方法中,对第t次迭代的目标函数
Figure PCTCN2022104694-appb-000055
Figure PCTCN2022104694-appb-000056
采用以下近似之一:
(1)
Figure PCTCN2022104694-appb-000057
Figure PCTCN2022104694-appb-000058
(2)
Figure PCTCN2022104694-appb-000059
Figure PCTCN2022104694-appb-000060
或(1)式和(2)式的各h i相关项加权平均表达(3);
(3)
Figure PCTCN2022104694-appb-000061
Figure PCTCN2022104694-appb-000062
其中,
Figure PCTCN2022104694-appb-000063
是损失函数
Figure PCTCN2022104694-appb-000064
Figure PCTCN2022104694-appb-000065
的偏导数,
Figure PCTCN2022104694-appb-000066
是损失函数
Figure PCTCN2022104694-appb-000067
Figure PCTCN2022104694-appb-000068
的二阶偏导数。
对于近似(1),对损失函数的可微性条件可放宽至一阶可微。
为了达到上述目的,本发明提供一种定价方法,所述定价方法基于上述的数据分析方法进行精算定价。
进一步地,所述定价方法包括:
(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
(2)对样本数据进行预处理;
(3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;
(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测参数分布的待估参数的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
(5)选择预测随机变量的参数分布类型,用多元正则化提升树方法求得预测变量的条件概率分布;
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。当对预测变量的参数分布类型有自信时,也可直接指定最优参数分布。此时,候选参数分布中只有 此一种参数分布。
进一步地,所述定价方法基于多元正则化提升树方法求得预测变量的条件概率分布,包括:
(1)从候选参数概率分布中选择某一分布,确定其参数形式;对同一种分布,可以有不同的参数化形式。
(2)确定目标函数,以该分布的负对数似然函数作为损失函数。确认该损失函数满足多元正则化提升树方法对损失函数的要求。
(3)以感兴趣的待估参数作为自变量,用多元正则化提升树方法求得该分布所有参数的预测值;从而得到预测变量的具体概率分布表达式。如果有对取值比较自信的参数,可以用经验或其他方法确定其值,这些参数作为固定值不参与提升树的迭代。
本发明采用改进后的XGBoost类方法进行数据分析,有效克服现有技术方案中的各种缺陷。
本发明提供的基于多轮循环改进型XGBoost类方法的数据分析方法运用改进型XGBoost类方法进行循环多参数建模,进一步提高了模型的预测性能。
本发明提供的多元正则化提升树方法,并运用该方法进行数据分析,进一步提高了大数据预测方法包括非寿险定价方法的预测性能,并提高了计算运行效率和模型的可解释性。
在上述方案的基础上,本发明进一步提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种处理器,所述处理器用于运行程序,所述程序运行时实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,所述程序代码由所述处理器加载并执行以实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行数据分析方法或定价方法的步骤。
附图说明
以下结合附图和具体实施方式来进一步说明本发明。
图1为现有XGBoost算法中对非凸损失函数图像示例图;
图2为实例2中对损失强度的预测时非凸损失函数图像示例图;
图3为实例2中对损失次数的预测时非凸损失函数图像示例图;
图4为实例3中固定相应的参数后,l(损失函数)的示例函数图像示例图;
图5为实例4中固定相应的参数后,l(损失函数)的示例函数图像示例图。
具体实施方式
为了使本发明实现的技术手段、创作特征、达成目的与功效易于明白了解,下面结合具体图示,进一步阐述本发明。
针对现有技术所存在的缺陷,本方案对XGBoost类方法进行改进,实现将精准预测性能与传统统计技术结合,进一步提高预测性能。
这里以非寿险定价为例,本方案在应用于非寿险定价时,其可将得到的改进型XGBoost类方法以及派生出的多元正则化提升树方法运用于非寿险定价,从而可有效克服背景技术中所阐述现有技术的缺陷,同时保留现有技术的优点。取得对非寿险定价技术中对于损失(赔付)次数和损失(赔付)强度以及总损失金额(或总赔付金额)优异的预测性能,从而达到测算纯保费的理想效果。
实例1
本实例中通过改进XGBoost类方法以构建相应的改进型XGBoost类方法,以克服现有技术中XGBoost类方法对损失函数必须是凸函数的要求。
本实例给出的改进型XGBoost类算法中,通过对目标函数近似表达的二阶泰勒展开做修正,修改了其h i相关项,使得改进型XGBoost类方法的适用性不局限于凸损失函数。
对此,以下举例进一步说明。
本实例中,将损失函数
Figure PCTCN2022104694-appb-000069
设定为预测变量概率分布的负对数似然函数。进一步设定损失函数
Figure PCTCN2022104694-appb-000070
在讨论的范围内,对
Figure PCTCN2022104694-appb-000071
二阶可偏导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。
在此基础上,对目标函数
Figure PCTCN2022104694-appb-000072
采用以下近似之一均可:
(1)
Figure PCTCN2022104694-appb-000073
(2)
Figure PCTCN2022104694-appb-000074
当采用近似(1)时,对损失函数的可导性要求可放宽至对
Figure PCTCN2022104694-appb-000075
一阶可导;
显然,对于(1)和(2)式的某种加权平均(线性组合)也可看作近似公式的一种变形,如
Figure PCTCN2022104694-appb-000076
(3)
Figure PCTCN2022104694-appb-000077
如果|g i|特别大,即|g i|大于某一个足够大的正数M,
可设置g i的取值,令
Figure PCTCN2022104694-appb-000078
Figure PCTCN2022104694-appb-000079
代替g i,下文中仍用g i表示
Figure PCTCN2022104694-appb-000080
当|g i|特别大时,用
Figure PCTCN2022104694-appb-000081
代替g i,可使得
Figure PCTCN2022104694-appb-000082
的绝对值减小,从而使算法收敛更快。特别地,当g i在某一点无穷大时,如此可使得算法收敛。
对于式(1)进行变量带入,有:
Figure PCTCN2022104694-appb-000083
对一个固定的树结构q(x),求
Figure PCTCN2022104694-appb-000084
对每一个ω j的偏导数等于0的ω j值,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000085
最优目标函数值为:
Figure PCTCN2022104694-appb-000086
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I L和I R表示分裂后的左右结点的样本集,I=I L∪I R
分裂后的目标函数的减少值由下式给出:
Figure PCTCN2022104694-appb-000087
此公式用来计算候选划分点。
对于式(2)进行变量带入,有:
Figure PCTCN2022104694-appb-000088
对一个固定的树结构q(x),求
Figure PCTCN2022104694-appb-000089
对每一个ω j的偏导数等于0的ω j值,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000090
最优目标函数值为:
Figure PCTCN2022104694-appb-000091
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I L和I R表示分裂后的左右结点的样本集,I=I L∪I R
分裂后的目标函数的减少值由下式给出,
Figure PCTCN2022104694-appb-000092
此公式用来计算候选划分点。
对于(3)式,其相应的算法推导如下:
进行变量代入,有
Figure PCTCN2022104694-appb-000093
对一个固定的树结构q(x),求
Figure PCTCN2022104694-appb-000094
对每一个ω j的偏导数等于0的ω j值,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000095
最优目标函数值为:
Figure PCTCN2022104694-appb-000096
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I L和I R表示分裂后的左右结点的样本集,I=I L∪I R
分裂后的目标函数的减少值由下式给出,
Figure PCTCN2022104694-appb-000097
此公式用来计算候选划分点。
此外,对于l 1正则化项,改进型Xgboost类方法可同样适用。
注意到(1),(2)两式是(3)式的特殊情况,以(3)式为例做一个说明:
定义I j={i|q(x i)=j}是划分到叶子结点j的样本点的集合,重写
Figure PCTCN2022104694-appb-000098
得到
Figure PCTCN2022104694-appb-000099
对一个固定的树结构q(x),找到每一个ω j的最合适取值使得
Figure PCTCN2022104694-appb-000100
最小,得到叶子结点j的最优权重得分:
Figure PCTCN2022104694-appb-000101
时,
Figure PCTCN2022104694-appb-000102
Figure PCTCN2022104694-appb-000103
时,
Figure PCTCN2022104694-appb-000104
Figure PCTCN2022104694-appb-000105
时,
Figure PCTCN2022104694-appb-000106
其中,β≥0.
Figure PCTCN2022104694-appb-000107
代入
Figure PCTCN2022104694-appb-000108
得到最优目标函数值。
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
计算左右结点样本集的最优目标函数值,记录分裂的增益,作为最优划分结点的标准。
在此基础上,本改进型XGBoost类方法的其它构成技术方案可采用现有XGBoost类算法中相应的构成方案,此处不加以赘述。
其中,M可看作先验经验设定,也可当做超参数处理。
由于
Figure PCTCN2022104694-appb-000109
的表达式的分母始终为正,其始终与该叶子结点内样本的平均梯度符号相反;如此保证了算法在满足条件下可以收敛。
当损失函数
Figure PCTCN2022104694-appb-000110
满足相应条件时,设定一个较小的学习速率η,一个合适的M和一个非零的λ,可以使目标函数
Figure PCTCN2022104694-appb-000111
收敛于全局最小值点。一个合适的初始迭代值,可以减少训练轮数,加快收敛速度。
优选地,预测随机变量的极大似然估计值可作为预测变量的初始迭代值,以提高算法的收敛速度和方法模型的可解释性。
对于第t步迭代后,可能使
Figure PCTCN2022104694-appb-000112
超过讨论的范围。若此种情况发生,只需对f t(x i)的取值或关于此样本点的此轮迭代的超参数η的取值做修正,使得
Figure PCTCN2022104694-appb-000113
的取值刚好处在讨论范围的边界处即可。
实例2
本实例中利用实例1中形成的改进型XGBoost类方法形成非寿险保险定价方法。在独立性假设下,将负对数似然函数作为损失函数,并将均值参数作为XGBoost类方法的待估参数。
本实例中利用所述改进型XGBoost类方法改进非寿险定价中求损失(赔付)强度或损失(赔付)次数的概率分布的方法。
据此,本实例利用改进型XGBoost类方法改进非寿险定价中求损失(赔付)强度或损失(赔付)次数的概率分布的过程主要包括如下步骤:
(1)首先选择要预测的随机变量,如损失次数随机变量或损失强度随机变量。收集样本数据,包括样本属性和预测变量的观测值。以车险的单次损失金额为例,样本属性可能包括车型,已开里程数,车价,车主年龄,上一年的理赔情况,交通违法记录等等,预测变量的观测值为在保险期间内出险的单次损失金额。
(2)对样本数据进行预处理,包括处理异常值等。
(3)进行特征工程,得到更新后的样本集D={(x i,y i)}。x i是第i个样本的特征向量。
(4)将样本集划分为训练集,验证集和测试集。训练集用来训练模型,该模型为对要预测的变量做出预测的学习模型,验证集用来调整超参数,测试集用来评估模型性能。如可用留出法,kfold交叉验证法等。
(5)在候选参数分布中选择预测随机变量的参数分布类型,用实例1中形成的改进型XGBoost类方法来求得预测变量的条件概率分布。
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤步骤(5),用测试集的评估指标确定最优参数分布。若候选分布中只是一种分布,则不用再次选择。
本实例中采用改进型XGBoost类方法来求得预测变量的条件概率分布的过程包括:
(5.1)从候选参数概率分布中选择某一分布,确定其参数。
本步骤中,将该分布的期望表达式代入该参数分布,以其期望表达式作为该概率分布的参数,即期望参数,进一步以期望参数作为改进型的XGBoost类方法的待估预测变量;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数。
需要说明的是,同广义线性模型类似,对期望参数也可添加不同的连接,如对期望参数添加一个对数连接。添加连接相当于不同的参数化形式,无论何种参数化形式都有相应的损失函数,只要满足方法的条件就能适用。
(5.2)将其余参数看作麻烦参数、超参数,运用网格搜寻法或先验经验或其他方法确定其值;
(5.3)当超参数固定时,用的改进型XGBoost类算法来求得期望参数的预测值。
(5.4)更换超参数取值,重复步骤(5.3),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其具体概率分布表达式。对有些确定取值的超参数,可用其他方法比如经验确定其值,不用更换其值。
其原理和广义线性模型的原理类似,不同之处在于广义线性模型将预测变量的期望连接到线性组合模型,而本方法将待估预测变量的期望连接到改进型XGBoost类提升树模型。从而使得改进型XGBoost类方法能结合广义线性模型方法和XGBoost类方法的优点,克服各自的缺点。
在此基础上,本实例针对该改进型XGBoost类方法,增加一种评估指标的方法,用训练集的损失函数作为验证集和测试集的评估指标,使得损失函数和评估指标完美统一。当目标函数可最优求解时,用预测变量概率分布的对数似然函数或其相反数作为评估指标符合统计原理惯例。
以留出法为例,具体求得预测变量的条件概率分布方法如下:
根据经验从候选参数分布中选择预测随机变量Y的分布类型。
本实例中假定要分析的随机变量Y i(i=1,…,n,n为集合内的样本数量)服从同一类型的参数分布,并且有如下性质:
Y i相互独立(以各自的特征和参数条件独立)。
将Y i概率值或概率密度写成f(y i;μ i,θ)的形式(如果Y i是离散型,则f(y i;μ i,θ)代表其概率值;如果Y i是连续型,则f(y i;μ i,θ)代表其概率密度),
i,θ是该分布的参数,θ是除了μ i以外的参数,如果θ存在)。
其中E(Y i)=μ i,θ与μ i无关,对每一个Y i都有相同取值,看作是麻烦参数或超参数。将μ i作为XGBoost模型的待估预测变量,
Figure PCTCN2022104694-appb-000114
是XGBoost树函数。
为了与陈天奇的论文符号保持一致,以下用
Figure PCTCN2022104694-appb-000115
代替
Figure PCTCN2022104694-appb-000116
定义样本(x i,y i)的损失函数
Figure PCTCN2022104694-appb-000117
如果
Figure PCTCN2022104694-appb-000118
在讨论的范围内,对任意可能的θ和y i都对
Figure PCTCN2022104694-appb-000119
二阶可导(或相应的一阶可导);有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。则继续。否则,需要从候选参数分布中更换需的拟合分布。
整个集合的损失函为
Figure PCTCN2022104694-appb-000120
用改进型XGBoost类方法最小化如下目标函数:
Figure PCTCN2022104694-appb-000121
其中,
Figure PCTCN2022104694-appb-000122
当θ已知时,通过改进型XGBoost类方法对训练集做训练,
求得预测函数
Figure PCTCN2022104694-appb-000123
以上过程得到μ i的估计值。
在此技术基础上,举例如下:
(a)对于损失(赔付)强度的预测:
定义:
缩放分布:如果一个随机变量服从某个参数分布,该随机变量乘以某个正常数形成新的随机变量,新随机变量依然服从该参数分布。该参数分布称为缩放分布。
缩放参数:一个随机变量服从某个缩放分布,可能的取值范围非负,一个缩放分布的某个参数满足如下两个条件称为缩放参数:该随机变量乘以某个正常数形成新的随机变量,新的缩放分布的缩放参数同样乘以该正常数。新缩放分布的其余参数不变。
当面对通货膨胀和货币单位转换时,缩放分布对损失金额的处理特别方便,优选缩放分布作为损失 金额随机变量的候选分布。缩放参数记为β。该缩放分布的期望μ可以写成β·f的形式,f是除β以外参数的函数。则
Figure PCTCN2022104694-appb-000124
这里以例子(1)来说明对于损失(赔付)强度的预测。
例1:
伽马分布是一个厚尾的缩放分布,β是缩放参数,其概率密度函数如下:
Figure PCTCN2022104694-appb-000125
其期望μ=α·β,
Figure PCTCN2022104694-appb-000126
将此概率密度函数写成f(y;μ,θ)的形式:
Figure PCTCN2022104694-appb-000127
假设所要分析的损失(赔付)强度随机变量Y i服从伽马分布,Y i相互独立(以各自特征和参数的条件独立)。其概率密度函数为
Figure PCTCN2022104694-appb-000128
是XGBoost类树函数,α>0,μ i>0。
训练集的损失函数为
Figure PCTCN2022104694-appb-000129
Figure PCTCN2022104694-appb-000130
Figure PCTCN2022104694-appb-000131
二阶可导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。但不是
Figure PCTCN2022104694-appb-000132
的凸函数。
当α=5,y i=4时,
Figure PCTCN2022104694-appb-000133
的函数图像如图2所示。
如果α和超参数的取值确定,运用改进型XGBoost类方法,就能求得初始目标函数的预测最小值,预测变量的预测值,相应的损失函数取值以及损失(赔付)强度的条件概率分布。
对于损失(赔付)次数的预测:
以一个例子(2)说明。
例2:
设Y服从退化后的0分布和泊松分布的混合分布,其概率分布如下:
Figure PCTCN2022104694-appb-000134
该分布属于(a,b,1)类,不属于指数分布族。μ=E(Y)=αλ。
假设保险期间内损失(赔付)次数Y i服从该分布。Y i相互独立。其概率分布函数为:
Figure PCTCN2022104694-appb-000135
训练集的损失函数为
Figure PCTCN2022104694-appb-000136
Figure PCTCN2022104694-appb-000137
Figure PCTCN2022104694-appb-000138
二阶可导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。但当y i=0时,不是
Figure PCTCN2022104694-appb-000139
的凸函数。
当α=0.5,y i=0时,
Figure PCTCN2022104694-appb-000140
的函数图像如图3所示。
如果α和超参数的取值确定,运用改进型XGBoost类方法,就能求得初始目标函数的预测最小值,预测变量的预测值,相应的损失函数取值以及损失(赔付)次数的条件概率分布。
如果得到θ的估计值,就能得到预测随机变量的条件概率分布。
对于评估指标的选择,最好使评估指标与损失函数相统一。
优选地,可使用验证集和测试集上的对数似然函数的相反数
Figure PCTCN2022104694-appb-000141
作为对应的评估指标,n是样本对应集合的样本数量。由于θ是未知参数。而超参数γ和λ需要通过网格搜寻法等方法在验证集上寻找最优值。此时,将θ看作麻烦参数、超参数处理,用网格搜寻法等方法寻找使得验证集上损失函数
Figure PCTCN2022104694-appb-000142
最小的
Figure PCTCN2022104694-appb-000143
作为θ的估计值。
在此基础上,再利用验证集的评估指标选择超参数和
Figure PCTCN2022104694-appb-000144
的取值,并确定最优模型结构。获得
Figure PCTCN2022104694-appb-000145
的取值和超参数取值以及模型结构后,合并训练集和验证集作为新的训练集,用该模型结构设定重新训练模型,得到更新后的模型和模型参数。用更新后的模型对测试集的样本做预测,得到模型在测试集上的评估指标取值。选择其他可能的参数分布,重复之前步骤重新建模,但测试集不改变,得到新的评估指标取值。重复此步,直到对所有可能合适的参数分布都进行建模。比较对应的评估指标取值,选择评估值最好的一个或几个模型作为预测模型。保留模型结构设置,用所有样本数据(包括测试集)重新训练更新模型,得到最终的预测模型。
如果采用kfold交叉验证法,可以取k次训练得到的
Figure PCTCN2022104694-appb-000146
平均值作为θ的估计值。
以上符号含义同背景技术的介绍。
可采用不同的特征工程方案,重复以上步骤,利用验证集的评估指标评估方案的优劣。
在上述方案的基础上,本实例在求得损失(赔付)次数和损失(赔付)强度的条件概率分布后,运用纯保费测算模型求得纯保费,总损失额概率分布,总赔付额概率分布等非寿险定价要素。
实例3
本实例构成的改进型XGBoost类方法中,还可进一步将改进型XGBoost类方法从单变量预测推广到参数随机分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法,从而实现对预测随机变 量常见的参数概率分布的所有参数的提升树方法预测。
本实例中,利用改进型XGBoost类方法模型,对预测随机变量Y i多轮循环建模,可提高预测性能。
这里的随机变量Y i指损失(赔付)强度或保险期间内损失(赔付)次数的随机变量。
具体地,本实例可针对实例2的方案进一步扩展。当求得μ i和麻烦参数的估计值θ 1,…θ l(l是麻烦参数的个数)后,
(1)将μ i和θ 2,…θ n的估计值当做固定参数,损失函数为相应的l(y ii1,i2…,θ l),如果l(y ii1,i,…,θ l)对任意的y ii2,…θ l取值都对θ 1,i二阶可偏导(或对应的一阶可偏导);有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。将θ 1,i作为预测变量,利用改进型XGBoost类方法对θ 1,i做预测建模,得到θ 1,i的预测值
Figure PCTCN2022104694-appb-000147
可选的,用(*)式中得到的θ 1的估计值作为
Figure PCTCN2022104694-appb-000148
的初始值,提高收敛速度。
(2)将μ i和θ 1,i,θ 3…θ n的估计值当做固定参数,损失函数为相应
Figure PCTCN2022104694-appb-000149
如果
Figure PCTCN2022104694-appb-000150
对任意的y ii,
Figure PCTCN2022104694-appb-000151
θ 3,…,θ l的取值都对θ 2,i二阶可偏导(或对应的一阶可偏导);有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。
将θ 2,i作为预测变量,利用XGBoost方法对θ 2,i做预测建模,得到θ 2,i的预测值:
Figure PCTCN2022104694-appb-000152
可选的,用(*)式中得到的θ 2的估计值作为
Figure PCTCN2022104694-appb-000153
的初始值,提高收敛速度。
(3)重复以上步骤,求得θ 3i,…,θ ni的预测值。
说明:XGBoost类方法的正则项可以使得各叶子结点的得分不至于差异过大。
举例如下:
接实例2中的例子(1)
当运用改进后的XGBoost方法求得μ i和α的估计值后,固定每一个μ i,将α视作预测变量,损失函数为
Figure PCTCN2022104694-appb-000154
Figure PCTCN2022104694-appb-000155
对任意y i,μ i
Figure PCTCN2022104694-appb-000156
Figure PCTCN2022104694-appb-000157
二阶可偏导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。满足改进型XGBoost类方法对收敛性的要求。
固定相应的参数后,
Figure PCTCN2022104694-appb-000158
的几个示例函数图像如图4所示。
建立改进型XGBoost类方法预测模型,得到
Figure PCTCN2022104694-appb-000159
(4)将
Figure PCTCN2022104694-appb-000160
作为θ的取值,用改进型XGBoost类方法预测μ i
重复以上步骤得到新一轮的
Figure PCTCN2022104694-appb-000161
的预测值。可选的,用(*)式中得到的θ j(j=1,2,…,l)的估计值作为的相应的初始迭代值,提高收敛速度。
(5)重复第4步,直到验证集的评估指标收敛。保留以上每步的模型,用测试集选出最优的概率分布和参数结构。
关于验证集评估指标的选择,如果采用传统的评估指标如均方误差,则验证过程与步骤(2)一致。如果采用验证集上负对数似然函数做评估指标,对于预测变量为θ j,i的模型,则负对数似然函数的固定参数为
Figure PCTCN2022104694-appb-000162
n是验证集样本的个数。μ i,
Figure PCTCN2022104694-appb-000163
分别为训练得到的改进型XGBoost类模型预测函数值
Figure PCTCN2022104694-appb-000164
可选的,将测试集划出一部分样本作为第2次验证集(也可将全体样本重新划分为训练集,第1次验证集,第2次验证集和测试集),用来验证初始预测变量Y i的某种概率分布在各种参数结构下(不同的循环轮次和不同的参数迭代次数有不同的概率分布参数结构)的预测性能,即以上每次迭代过程得到的模型的拟合效果。用测试集去评估该概率分布的拟合效果。如此划分2个验证集可尽量避免过拟合。
实例4
本实例在改进型XGBoost类方法方案的基础上,进一步给出多元正则化提升树方案。
本实例将改进型XGBoost类方法推广到预测多个待估参数,用一个算法模型同时预测参数概率分布的多个待估参数,如此可增加模型的预测性能并提高运算效率和可解释性。
设l元损失函数为
Figure PCTCN2022104694-appb-000165
假设在讨论范围内,二阶可微,有且仅有一个局部极小值点;如采用下文中目标函数的近似表达式(1),对损失函数l的要求可放宽至一阶可微,有且仅有一个局部极小值点;
选定任意的某个待估参数后,当其余参数固定时,有且仅有一个局部极小值点;
仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
注:y i是观测值,看作固定的参数,不看做变量或待估参数。对于待估参数的讨论范围,可以合理的自由选择。在实际运用中,合理的预测结果都不会刚好落在理论上的极端边界点。在有些时候,可以将讨论的范围区间看成是闭区间,也可以使区间的边界离理论上的边界点有一点的合理的距离。
一个样本集D={(x i,y i)}(|D|=n,x i∈R m,y i∈R),有m个特征,n个样本。用K j颗树函数相加得到
Figure PCTCN2022104694-appb-000166
的参数
Figure PCTCN2022104694-appb-000167
预测结果
Figure PCTCN2022104694-appb-000168
其中F={f(x)=ω q(x)}(q:R m→T,ω∈R T)是回归树空间。q表示每棵树的结构,把一个样本映射到对应的叶子结点。T是一颗树的叶子结点的个数。每个
Figure PCTCN2022104694-appb-000169
对应一个独立的树结构q及其叶子权重ω。为了学习模型中的这些树函数,最小化下面的正则化目标:
Figure PCTCN2022104694-appb-000170
其中,
Figure PCTCN2022104694-appb-000171
…,
Figure PCTCN2022104694-appb-000172
Figure PCTCN2022104694-appb-000173
Figure PCTCN2022104694-appb-000174
的正则项超参数,
Figure PCTCN2022104694-appb-000175
Figure PCTCN2022104694-appb-000176
中一棵树的叶子结点个数。
对第t次迭代的目标函数
Figure PCTCN2022104694-appb-000177
采用以下近似之一:
(1)
Figure PCTCN2022104694-appb-000178
(2)
Figure PCTCN2022104694-appb-000179
类似于改进型XGBoost方法对第t次迭代目标函数的近似表达,对于(1)和(2)式的各h i相关项某种加权平均(线性组合)也可看作近似公式的一种变形:
(3)
Figure PCTCN2022104694-appb-000180
其中,
Figure PCTCN2022104694-appb-000181
是损失函数
Figure PCTCN2022104694-appb-000182
Figure PCTCN2022104694-appb-000183
的偏导数,
Figure PCTCN2022104694-appb-000184
是损失函数
Figure PCTCN2022104694-appb-000185
Figure PCTCN2022104694-appb-000186
的二阶偏导数。
本多元正则化提升树方法不局限于某h i不恒为非负的情形,当所有的h i恒为非负时也适用,此时,近似表达式(2)在形式上化简为:
Figure PCTCN2022104694-appb-000187
每一轮训练同时最多训练l颗树,每棵树有独自的超参数。
如果
Figure PCTCN2022104694-appb-000188
特别大,即
Figure PCTCN2022104694-appb-000189
大于某一个足够大的正数M j
可设置
Figure PCTCN2022104694-appb-000190
的取值,令
Figure PCTCN2022104694-appb-000191
Figure PCTCN2022104694-appb-000192
代替
Figure PCTCN2022104694-appb-000193
仍用
Figure PCTCN2022104694-appb-000194
表示
Figure PCTCN2022104694-appb-000195
如此能使算法收敛更快。特别地,当
Figure PCTCN2022104694-appb-000196
在某一点无穷大时,如此可使得算法收敛。
将每一个参数θ j独立看待,
Figure PCTCN2022104694-appb-000197
的结构和函数表达式同改进型XGBoost类算法。
对每一个待估参数θ j,都有一个学习速率η j和训练轮数K j以及超参数M j
对于确定性比较强的待估参数,可以单独设定较少的训练轮数K。优选方案是,设置迭代轮数间隔,使其总训练轮数减少。
对于算法的其余细节包括树的分裂和
Figure PCTCN2022104694-appb-000198
的预测值以及额外添加l 1正则化项同实例1中改进型XGBoost类方法。
待估参数θ j的初始迭代值可用训练集的极大似然估计(不考虑x i)求得。
以非寿险定价为例,改进解决实例2方案中第5步中求得预测变量的条件概率分布。选择合适的参数概率分布,在独立性假设下,用其负对数使然函数作损失函数
Figure PCTCN2022104694-appb-000199
当损失函数满足相应条件时,可继续,否则需要从候选分布中更换拟合分布或更换参数形式。假定某损失函数l在讨论的范围内:二阶可微,有且仅有一个局部极小值点;如采用近似表达式(1),对损失函数l的要求可放宽至一阶可微,有且仅有一个局部极小值点;选定任意的某个待估参数后,当其余参数固定时,有且仅有一个局部极小值点;仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
以一个例子(3)说明。
例3:
假设保险期间内损失次数Y i服从负二项分布,作为预测变量。Y i相互独立。其概率分布函数的一种经典形式为:
Figure PCTCN2022104694-appb-000200
训练集的损失函数为
Figure PCTCN2022104694-appb-000201
对待估参数β ii可以设置任意的合理的讨论范围,一种方法是设定β i∈[ε 1,M 1],γ i∈[ε 2,M 2],ε 12是足够小的正数,M 1,M 2是足够大的正数。
可以验证,损失函数
Figure PCTCN2022104694-appb-000202
在讨论范围内,二阶可微,有且仅有一个局部极小值点;
选定任意的某个待估参数
Figure PCTCN2022104694-appb-000203
后,当其余参数固定时,有且仅有一个局部极小值点;
仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
(注:在此例中
Figure PCTCN2022104694-appb-000204
)
满足多元正则化提升树方法对损失函数的要求。
可以用多元正则化提升树方法求得预测变量Y i的具体条件概率分布。
但当固定住y i和γ i后该损失函数未必是β i的凸函数。
举例说明如下:
当y i=0,γ i=1时,损失函数l是β i的凹函数,其函数图像如图5所示。
以留出法为例,对模型的各项超参数进行网格搜寻或其他方法确定其值,使得验证集的评估指标最小,得到模型结构和提升树模型内的参数取值以及最优超参数值。
以上建模过程,可采用不同的特征工程方案。合并训练集和验证集,用学得的超参数,重新训练模型。更换预测变量的候选概率分布类型,重复建模训练。对测试集运用学得的模型做预测,选择一种或几种评估指标最小的概率分布及对应的预测模型作为最优模型。合并所有样本集,用学得的超参数,重新训练模型,得到最终模型并投入生产。优选评估指标为负对数似然函数。
由于LightGBM方法,CatBoost方法等方法与XGBoost方法非常相似,本专利对XGBoost类方法的改进指对所有类似XGBoost方法的方法的改进,如著名的LightGBM方法和CatBoost方法。
对于改进型XGBoost类方法,多轮循环XGBoost类方法,多元正则化提升树方法,在实际应用时,只要求解满足损失函数条件的目标函数最小化的最优化问题或求解满足损失函数条件的参数概率分布的各参数的极大似然估计(对不同样本特征的各样本点的条件极大似然估计),就可以运用,不仅仅适用于非寿险定价,可广泛应用于各种领域。
本发明实施例还提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述实例1-实例4中任意一种或多种方案的步骤。
本发明实施例还提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行上述实例1-实例4中任意一种或多种方案的步骤。
本发明实施例还提供了一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,所述程序代码由所述处理器加载并执行以实现上述实例1-实例4中任意一种或多种方案的步骤。
本发明还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行上述实例1-实例4中任意一种或多种方案的步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实 施例的相关描述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者 是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本发明的实施例可提供为方法、系统或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以上显示和描述了本发明的基本原理、主要特征和本发明的优点。本行业的技术人员应该了解,本发明不受上述实施例的限制,上述实施例和说明书中描述的只是说明本发明的原理,在不脱离本发明精神和范围的前提下,本发明还会有各种变化和改进,这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。

Claims (18)

  1. 基于改进型XGBoost类方法的数据分析方法,其特征在于,采用改进型XGBoost类方法基于获取到的变量参数进行预测评估,所述改进型XGBoost类方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,h i不恒为非负时,通过修改其h i相关项,改进型XGBoost类方法的适用性不局限于凸损失函数。
  2. 根据权利要求1所述的基于改进型XGBoost类方法的数据分析方法,其特征在于,所述改进型XGBoost类方法将XGBoost类方法从单变量预测推广到参数分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法。
  3. 根据权利要求1所述的基于改进型XGBoost类方法的数据分析方法,其特征在于,所述改进型XGBoost类方法中,设定损失函数
    Figure PCTCN2022104694-appb-100001
    在讨论的范围内:对
    Figure PCTCN2022104694-appb-100002
    二阶可导或对
    Figure PCTCN2022104694-appb-100003
    一阶可导;有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。
  4. 根据权利要求3所述的基于改进型XGBoost类方法的数据分析方法,其特征在于,所述改进型XGBoost类方法中,对第t次迭代的目标函数
    Figure PCTCN2022104694-appb-100004
    采用以下近似之一:
    Figure PCTCN2022104694-appb-100005
    Figure PCTCN2022104694-appb-100006
    (1)式和(2)式的加权平均表达。
  5. 一种定价方法,其特征在于,所述定价方法基于权利要求1-4中任一项所述的数据分析方法进行精算定价。
  6. 根据权利要求5所述的定价方法,其特征在于,所述定价方法包括:
    (1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
    (2)对样本数据进行预处理;
    (3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;
    (4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测预测变量的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
    (5)选择预测随机变量的参数分布类型,用改进型XGBoost类方法求得预测变量的条件概率分布;
    (6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。
  7. 根据权利要求6所述的定价方法,其特征在于,所述定价方法基于改进型XGBoost类方法求得预测变量的条件概率分布,包括:
    (1)从候选参数概率分布中选择某一分布,确定其参数;
    (2)将预测变量的期望值表达式作为期望参数,对该概率分布的表达式进行变形,将期望参数作为预测参数,预测参数以外的参数看作麻烦参数、超参数;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数;
    (3)确定目标函数,以该分布的负对数似然函数作为损失函数;
    (4)对超参数确定其值;
    (5)当超参数固定时,用改进型XGBoost类算法求得预测参数的预测值;
    (6)更换超参数取值,重复步骤(5),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其具体概率分布表达式。
  8. 一种数据分析方法,其特征在于,形成改进型XGBoost类方法,并直接推广至多元,形成多元正则化提升树方法,所述多元正则化提升树方法对XGBoost类方法中的目标函数近似表达的二阶泰勒展开做修正,修改其h i相关项,使得多元正则化提升树方法的适用性不局限于凸损失函数,并在算法层面同时最优化求解多元目标函数的多个变量。
  9. 根据权利要求8所述的数据分析方法,其特征在于,所述多元正则化提升树方法中,设定损失函数l在讨论的范围内:(1)二阶可微,有且仅有一个局部极小值点;或一阶可微,有且仅有一个局部极小值点;(2)选定任意的某个待估参数作为考察变量后,当其余参数固定时,有且仅有一个局部极小值点;
    仅在前段所述局部极小值点对考察变量的偏导数为0,或者严格单调。
  10. 根据权利要求8所述的数据分析方法,其特征在于,所述多元正则化提升树方法中目标函数的表达式为:
    Figure PCTCN2022104694-appb-100007
    其中Ω是正则化项;
    Figure PCTCN2022104694-appb-100008
    Figure PCTCN2022104694-appb-100009
    Figure PCTCN2022104694-appb-100010
    的正则项超参数,
    Figure PCTCN2022104694-appb-100011
    Figure PCTCN2022104694-appb-100012
    中一棵树的叶子结点个数,l是待估参数的个数,k是对应的预测待估参数的提升树的层数,
    也可将l 1正则化项额外加入到Ω中。
  11. 根据权利要求8所述的数据分析方法,其特征在于,所述多元正则化提升树方法中,对第t次迭 代的目标函数
    Figure PCTCN2022104694-appb-100013
    采用以下近似之一:
    Figure PCTCN2022104694-appb-100014
    Figure PCTCN2022104694-appb-100015
    或(1)式和(2)式各h i相关项的加权平均表达;
    其中,
    Figure PCTCN2022104694-appb-100016
    是损失函数
    Figure PCTCN2022104694-appb-100017
    Figure PCTCN2022104694-appb-100018
    的偏导数,
    Figure PCTCN2022104694-appb-100019
    是损失函数
    Figure PCTCN2022104694-appb-100020
    Figure PCTCN2022104694-appb-100021
    的二阶偏导数。
  12. 一种定价方法,其特征在于,所述定价方法基于权利要求8-11项中任一项所述的数据分析方法进行精算定价。
  13. 根据权利要求12所述的定价方法,其特征在于,所述定价方法包括:
    (1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
    (2)对样本数据进行预处理;
    (3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;
    (4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测参数分布的待估参数的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
    (5)选择预测随机变量的参数分布类型,用多元正则化提升树方法求得预测变量的条件概率分布;
    (6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。
  14. 根据权利要求13所述的定价方法,其特征在于,所述定价方法基于多元正则化提升树方法求得预测变量的条件概率分布,包括:
    (1)从候选参数概率分布中选择某一分布,确定其参数形式;
    (2)确定目标函数,以该分布的负对数似然函数作为损失函数。
    (3)用多元正则化提升树方法求得该分布所有参数的预测值;从而得到预测变量的具体概率分布表达式。
  15. 一种计算机可读存储介质,其上存储有程序,其特征在于,所述程序被处理器执行时实现权利要 求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
  16. 一种处理器,所述处理器用于运行程序,其特征在于,所述程序运行时实现权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的的步骤。
  17. 一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,其特征在于,所述程序代码由所述处理器加载并执行以实现权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
  18. 一种计算机程序产品,其特征在于,当在数据处理设备上执行时,适于执行权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
PCT/CN2022/104694 2021-07-09 2022-07-08 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 WO2023280316A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110781586.X 2021-07-09
CN202110781586 2021-07-09

Publications (1)

Publication Number Publication Date
WO2023280316A1 true WO2023280316A1 (zh) 2023-01-12

Family

ID=84801333

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104694 WO2023280316A1 (zh) 2021-07-09 2022-07-08 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备

Country Status (2)

Country Link
CN (1) CN115601182A (zh)
WO (1) WO2023280316A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402252A (zh) * 2023-03-30 2023-07-07 重庆市生态环境大数据应用中心 用于水污染防治的智能分析决策方法及系统
CN116595872A (zh) * 2023-05-12 2023-08-15 西咸新区大熊星座智能科技有限公司 基于多目标学习算法的焊接参数自适应预测方法
CN116628970A (zh) * 2023-05-18 2023-08-22 浙江大学 基于数据挖掘的航天薄壁件旋压成型工艺参数优化方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116451034A (zh) * 2023-03-30 2023-07-18 重庆大学 基于xgboost算法的压力源与水质关系的分析方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536650A (zh) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 生成梯度提升树模型的方法和装置
CN108777674A (zh) * 2018-04-24 2018-11-09 东南大学 一种基于多特征融合的钓鱼网站检测方法
WO2020247949A1 (en) * 2019-06-07 2020-12-10 The Regents Of The University Of California General form of the tree alternating optimization (tao) for learning decision trees
CN112821420A (zh) * 2021-01-26 2021-05-18 湖南大学 一种基于XGBoost的ASFR模型中动态阻尼因子、多维频率指标的预测方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536650A (zh) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 生成梯度提升树模型的方法和装置
CN108777674A (zh) * 2018-04-24 2018-11-09 东南大学 一种基于多特征融合的钓鱼网站检测方法
WO2020247949A1 (en) * 2019-06-07 2020-12-10 The Regents Of The University Of California General form of the tree alternating optimization (tao) for learning decision trees
CN112821420A (zh) * 2021-01-26 2021-05-18 湖南大学 一种基于XGBoost的ASFR模型中动态阻尼因子、多维频率指标的预测方法及系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402252A (zh) * 2023-03-30 2023-07-07 重庆市生态环境大数据应用中心 用于水污染防治的智能分析决策方法及系统
CN116595872A (zh) * 2023-05-12 2023-08-15 西咸新区大熊星座智能科技有限公司 基于多目标学习算法的焊接参数自适应预测方法
CN116595872B (zh) * 2023-05-12 2024-02-02 西咸新区大熊星座智能科技有限公司 基于多目标学习算法的焊接参数自适应预测方法
CN116628970A (zh) * 2023-05-18 2023-08-22 浙江大学 基于数据挖掘的航天薄壁件旋压成型工艺参数优化方法

Also Published As

Publication number Publication date
CN115601182A (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
WO2023280316A1 (zh) 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备
Weng et al. Gold price forecasting research based on an improved online extreme learning machine algorithm
Pei et al. Wind speed prediction method based on empirical wavelet transform and new cell update long short-term memory network
WO2021007812A1 (zh) 一种深度神经网络超参数优化方法、电子设备及存储介质
Froelich et al. Evolutionary learning of fuzzy grey cognitive maps for the forecasting of multivariate, interval-valued time series
Yu et al. A novel elastic net-based NGBMC (1, n) model with multi-objective optimization for nonlinear time series forecasting
US20230075100A1 (en) Adversarial autoencoder architecture for methods of graph to sequence models
Gong et al. Forecasting stock volatility process using improved least square support vector machine approach
Barratt et al. Least squares auto-tuning
CN114817571B (zh) 基于动态知识图谱的成果被引用量预测方法、介质及设备
Chu et al. Comparing out-of-sample performance of machine learning methods to forecast US GDP growth
US20230306505A1 (en) Extending finite rank deep kernel learning to forecasting over long time horizons
Bui et al. Gaussian process for predicting CPU utilization and its application to energy efficiency
Yu et al. Ceam: A novel approach using cycle embeddings with attention mechanism for stock price prediction
Alizadeh et al. Simulating monthly streamflow using a hybrid feature selection approach integrated with an intelligence model
Zhang et al. Latent adversarial regularized autoencoder for high-dimensional probabilistic time series prediction
Wang et al. An enhanced interval-valued decomposition integration model for stock price prediction based on comprehensive feature extraction and optimized deep learning
Park et al. DeepGate: Global-local decomposition for multivariate time series modeling
Kisiel et al. Portfolio transformer for attention-based asset allocation
Xing et al. Application of a hybrid model based on GA–ELMAN neural networks and VMD double processing in water level prediction
Co et al. Comparison between ARIMA and LSTM-RNN for VN-index prediction
Lian et al. A tweedie compound poisson model in reproducing kernel hilbert space
Yan et al. Transferability and robustness of a data-driven model built on a large number of buildings
Prüser et al. Nonlinearities in macroeconomic tail risk through the lens of big data quantile regressions
Chen et al. A novel expectation–maximization-based separable algorithm for parameter identification of RBF-AR model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22837061

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE