WO2023280316A1 - 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 - Google Patents
一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 Download PDFInfo
- Publication number
- WO2023280316A1 WO2023280316A1 PCT/CN2022/104694 CN2022104694W WO2023280316A1 WO 2023280316 A1 WO2023280316 A1 WO 2023280316A1 CN 2022104694 W CN2022104694 W CN 2022104694W WO 2023280316 A1 WO2023280316 A1 WO 2023280316A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distribution
- parameter
- xgboost
- pricing
- class
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 237
- 238000007405 data analysis Methods 0.000 title claims abstract description 35
- 238000009826 distribution Methods 0.000 claims abstract description 151
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 30
- 230000006870 function Effects 0.000 claims description 157
- 230000014509 gene expression Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 30
- 238000011156 evaluation Methods 0.000 claims description 28
- 238000012360 testing method Methods 0.000 claims description 26
- 238000012795 verification Methods 0.000 claims description 23
- 238000003860 storage Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000011835 investigation Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000010200 validation analysis Methods 0.000 description 8
- 238000007476 Maximum Likelihood Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001447 compensatory effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000002948 stochastic simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0206—Price or cost determination based on market factors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/08—Insurance
Definitions
- the invention relates to machine learning technology and actuarial technology, and in particular to a corresponding big data analysis method.
- the insurance company will measure the net premium of the insured, which refers to the expected net compensation of the insured. Due to the short period of non-life insurance insurance, the pure premium in this article does not consider the interest factor. To measure the pure premium, it is best to estimate the probability distribution of the loss (payment) amount (single accident or the sum of the insurance period), rather than simply estimate the expected value of the loss (payment). Because in compensatory insurance, there is generally a deductible (or limit) for the compensation of one accident loss or the total loss during the insurance period. Only the probability distribution of the calculated loss (compensation) can determine the deductible. The adjustment of the amount (or limit) makes the adjustment of the pure premium to be dealt with accordingly.
- the standard assumption is a hypothesis, and the methods for finding the probability distribution of the total loss (total compensation) include the characteristic function class transformation method (Fourier transform method) or the stochastic simulation method.
- the characteristic function class transformation method Frier transform method
- stochastic simulation method For assumption b, due to too many parameters to be estimated, there may be a risk of over-fitting, so it is rarely used in the industry.
- methods of the second category are more refined methods that offer many benefits over methods of the first category.
- the XGBoost method is an extreme gradient boosting tree method, which has excellent prediction performance and has achieved very good results in many fields.
- a sample set D ⁇ ( xi , y i ) ⁇ (
- n, xi ⁇ R m , y i ⁇ R), has m features and n samples.
- An ensemble tree model which is predicted by adding K tree functions together.
- ⁇ (f k ) is a regular term.
- the XGBoost algorithm uses the boosting tree algorithm to minimize the objective function, assuming is the predicted value of the t-th iteration of the i-th sample, add an f t to it, and minimize the following objective function:
- the optimal objective function value is:
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- This formula is used to calculate candidate split points.
- the downscaling technique boosts the tree by a factor ⁇ after each step, and is also used to prevent overfitting.
- ⁇ 0 find the most suitable value of each ⁇ j such that The minimum, to get the optimal weight score of leaf node j:
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- XGBoost XGBoost algorithm for loss function
- the requirements are stricter, and it is required to Differentiable, and a convex function. If l is not a globally convex function, there is no guarantee that the initial objective function converges to the global minimum. Examples are as follows:
- T 1. It may be because the learning rate ⁇ is not controlled, so that after the t-1th iteration exist A certain neighborhood of is a concave function, its pair The first derivative of g is positive and the second derivative of h is negative. Make the optimal weight score of the sample in the t-th iteration When ⁇
- existing XGBoost-like methods are limited to fitting single-parameter probability distributions.
- the existing XGBoost methods cannot optimize and solve multiple parameters at the same time, and often cannot obtain the optimal prediction performance. For example, if the loss frequency in general (General) insurance pricing obeys a two-parameter negative binomial distribution, it is inappropriate to use a single-parameter Poisson distribution to fit it.
- the object of the present invention is to provide a data analysis method based on an improved XGBoost method, thereby effectively improving the performance of big data analysis and prediction.
- the present invention further provides a pricing method based on an improved XGBoost method, which effectively overcomes the defects in the existing solutions.
- the data analysis method based on the improved XGBoost class method provided by the present invention adopts the improved XGBoost class method to predict and evaluate based on the obtained variable parameters, and the improved XGBoost class method is accurate to the target in the XGBoost class algorithm.
- the second-order Taylor expansion of the approximate expression of the function is modified.
- the improved XGBoost method extends the XGBoost method from univariate prediction to multi-parameter prediction of parameter probability distribution, forming a multi-cycle improved XGBoost data analysis method.
- the loss function is set within the scope of the discussion, the Second-order differentiable; there is one and only one local minimum point and only at this point the derivative is 0, or strictly monotonous.
- the present invention provides a pricing method, which performs non-life insurance actuarial pricing based on the above data analysis method.
- the pricing method includes:
- sample set is divided into training set, verification set and test set; Described training set is used for training the learning model that is used to predict predictor variable, and verification set is used for adjusting hyperparameter, and test set is used for evaluating learning model performance;
- the pricing method obtains the conditional probability distribution of predictor variables based on the improved XGBoost method, including:
- the expected value expression of the predictor variable is used as the expected parameter, and the expression of the probability distribution is deformed, and the expected parameter is used as the predicted parameter, and the parameters other than the predicted parameter are regarded as nuisance parameters and hyperparameters; such as the distribution
- the expression itself already contains expected parameters, so there is no need to deform it, and directly set the prediction parameters and hyperparameters;
- the present invention provides a data analysis method, which directly extends the improved XGBoost class method to multivariate, forming a multivariate regularization lifting tree method, and the multivariate regularization lifting tree method is effective for the objective function in the XGBoost class algorithm.
- the second-order Taylor expansion of the approximate expression is corrected, and its h i related items are modified, so that the applicability of the improved XGBoost class method is not limited to the convex loss function.
- This method can optimize and solve multiple variables in the multivariate loss function (ie, the parameters to be estimated under consideration) at the same time.
- the loss function l is set within the scope of discussion: (1) second-order differentiable or first-order differentiable, with one and only one local minimum point; (2 ) After selecting any parameter to be estimated as the investigation variable, when the remaining parameters are fixed, there is one and only one local minimum point; only at the local minimum point mentioned in the previous paragraph, the partial derivative of the parameter to be estimated is 0, Or strictly monotonic.
- y i is regarded as a fixed parameter as an observed value, not as a variable or a parameter to be estimated.
- the scope of discussion of the parameters to be estimated can be chosen reasonably freely. In practical applications, reasonable predictions will not fall exactly on the theoretical extreme boundary points.
- the range interval discussed can be regarded as a closed interval, and the boundary of the interval can also be kept a reasonable distance from the theoretical boundary point.
- ⁇ is a regularization term
- the l 1 regularization term can also be added to ⁇ additionally:
- the differentiability condition for the loss function can be relaxed to first-order differentiability.
- the present invention provides a pricing method, which performs actuarial pricing based on the above data analysis method.
- the pricing method includes:
- the sample set is divided into a training set, a verification set and a test set;
- the training set is used to train the learning model of the parameter to be estimated for predicting the parameter distribution,
- the verification set is used to adjust hyperparameters, and
- the test set is used to evaluate learning model performance;
- the pricing method obtains the conditional probability distribution of the predictor variables based on the multivariate regularized boosting tree method, including:
- the invention adopts the improved XGBoost method for data analysis, effectively overcoming various defects in the prior art solutions.
- the data analysis method based on the multi-cycle improved XGBoost method provided by the present invention uses the improved XGBoost method to carry out cycle multi-parameter modeling, which further improves the prediction performance of the model.
- the multivariate regularized lifting tree method provided by the present invention further improves the prediction performance of the big data prediction method including the non-life insurance pricing method, and improves the calculation operation efficiency and the interpretability of the model.
- the present invention further provides a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, the steps of the above-mentioned data analysis method or pricing method are implemented.
- the present invention further provides a processor, the processor is used to run a program, and when the program runs, the steps of the above data analysis method or pricing method are implemented.
- the present invention further provides a terminal device, which includes a processor, a memory, and a program stored in the memory and operable on the processor, and the program code is loaded and executed by the processor In order to realize the steps of the above-mentioned data analysis method or pricing method.
- the present invention further provides a computer program product, which is suitable for performing the steps of the data analysis method or the pricing method when executed on the data processing device.
- Figure 1 is an example diagram of a non-convex loss function image in the existing XGBoost algorithm
- Figure 2 is an example diagram of a non-convex loss function image when predicting loss strength in Example 2;
- Figure 3 is an example diagram of a non-convex loss function image when predicting the number of losses in Example 2;
- Fig. 4 is after fixing corresponding parameter in example 3, the sample function image example figure of l (loss function);
- FIG. 5 is an example diagram of an example function image of l (loss function) after the corresponding parameters are fixed in Example 4.
- this solution improves the XGBoost method, realizes the combination of accurate prediction performance and traditional statistical technology, and further improves the prediction performance.
- non-life insurance pricing is taken as an example.
- this scheme can apply the obtained improved XGBoost method and the derived multivariate regularization boosting tree method to non-life insurance pricing, thereby effectively overcoming the background technology.
- deficiencies of the prior art as set forth in while retaining the advantages of the prior art.
- the excellent prediction performance of the number of losses (payments) and the intensity of losses (payments) as well as the total loss amount (or total compensation amount) is achieved, so as to achieve the ideal effect of measuring pure premiums.
- the corresponding improved XGBoost method is constructed by improving the XGBoost method, so as to overcome the requirement that the loss function of the XGBoost method in the prior art must be a convex function.
- the loss function Set to be the negative log-likelihood function of the predictor variable probability distribution. Further setting the loss function Within the scope of the discussion, the Second-order partial derivative; there is one and only one local minimum point and only at this point the derivative is 0, or strictly monotonic.
- the value of g i can be set, so that
- the optimal objective function value is:
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- This formula is used to calculate candidate split points.
- the optimal objective function value is:
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- This formula is used to calculate candidate split points.
- the optimal objective function value is:
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- This formula is used to calculate candidate split points.
- the improved Xgboost-like method can also be applied.
- the tree structure q is obtained using a greedy algorithm, and iteratively adds branches from a single leaf node.
- M can be regarded as a priori experience setting, and can also be treated as a hyperparameter.
- the maximum likelihood estimated value of the predicted random variable can be used as the initial iteration value of the predicted variable, so as to improve the convergence speed of the algorithm and the interpretability of the method model.
- the improved XGBoost class method formed in Example 1 is used to form a non-life insurance pricing method.
- the negative log-likelihood function is used as the loss function
- the mean parameter is used as the estimated parameter of the XGBoost class method.
- the improved XGBoost method is used to improve the method for calculating the probability distribution of loss (payment) intensity or loss (payment) times in non-life insurance pricing.
- this example uses the improved XGBoost method to improve the probability distribution of loss (payment) intensity or loss (payment) times in non-life insurance pricing. It mainly includes the following steps:
- Collect sample data including sample attributes and observed values of predictor variables.
- the sample attributes may include model type, mileage driven, car price, age of the car owner, claims in the previous year, traffic violation records, etc.
- the observed value of the predictor variable is within the insurance period The single loss amount of the accident.
- the training set is used to train the model, which is a learning model that predicts the variables to be predicted
- the validation set is used to adjust the hyperparameters
- the test set is used to evaluate the model performance.
- the hold-out method, kfold cross-validation method, etc. can be used.
- the process of using the improved XGBoost class method to obtain the conditional probability distribution of the predictor variable includes:
- the expected expression of the distribution is substituted into the parameter distribution, and the expected expression is used as the parameter of the probability distribution, that is, the expected parameter, and the expected parameter is further used as the predictor variable to be estimated in the improved XGBoost method; for example
- the distribution expression itself already contains expected parameters, so there is no need to deform it, and directly set the prediction parameters and hyperparameters.
- connections can also be added to the expected parameters, such as adding a logarithmic connection to the expected parameters. Adding connections is equivalent to different parameterization forms, no matter which parameterization form has a corresponding loss function, as long as the conditions of the method are met, it can be applied.
- the improved XGBoost method can combine the advantages of the generalized linear model method and the XGBoost method, and overcome their respective shortcomings.
- this example adds an evaluation index method to the improved XGBoost class method, using the loss function of the training set as the evaluation index of the verification set and the test set, so that the loss function and evaluation index are perfectly unified.
- the objective function can be solved optimally, using the log-likelihood function of the probability distribution of the predictor variable or its inverse as the evaluation index conforms to the convention of statistical principles.
- the specific method to obtain the conditional probability distribution of the predictor variable is as follows:
- the type of distribution for predicting the random variable Y is selected empirically from the candidate parametric distributions.
- Y i are independent of each other (conditionally independent with their own characteristics and parameters).
- the loss function for the entire set is Minimize the following objective function with the improved XGBoost class method:
- the training set is trained by the improved XGBoost class method
- the above process obtains the estimated value of ⁇ i .
- Scaled distribution If a random variable obeys a parameter distribution, the random variable is multiplied by a certain constant to form a new random variable, and the new random variable still obeys the parameter distribution. This parametric distribution is called a scaled distribution.
- Scaling parameter A random variable obeys a certain scaling distribution, and the possible value range is non-negative.
- a certain parameter of a scaling distribution satisfies the following two conditions, which is called a scaling parameter: the random variable is multiplied by a certain normal number to form a new random variable, the scaling parameter of the new scaling distribution is also multiplied by this normal constant. The rest of the parameters of the new scaled distribution are unchanged.
- Scaled distributions are particularly convenient for handling loss amounts when faced with inflation and currency unit conversions, and scaling distributions are preferred as candidate distributions for loss amount random variables.
- the scaling parameter is denoted as ⁇ .
- the expectation ⁇ of this scaled distribution can be written in the form ⁇ f, where f is a function of parameters other than ⁇ . but
- Example (1) is used here to illustrate the prediction of loss (compensation) intensity.
- the gamma distribution is a fat-tailed scaling distribution
- ⁇ is the scaling parameter
- probability density function is as follows:
- the loss function of the training set is
- the predicted minimum value of the initial objective function, the predicted value of the predicted variable, the corresponding loss function value and the conditional probability of the loss (payment) intensity can be obtained distributed.
- This distribution belongs to the class (a,b,1) and does not belong to the family of exponential distributions.
- the loss function of the training set is
- the predicted minimum value of the initial objective function, the predicted value of the predicted variable, the corresponding loss function value and the conditional probability of the number of losses (payments) can be obtained distributed.
- the inverse of the log-likelihood function on the validation and test sets can be used as the corresponding evaluation index, n is the number of samples in the corresponding set of samples. Since ⁇ is an unknown parameter.
- the hyperparameters ⁇ and ⁇ need to find the optimal value on the validation set through methods such as grid search. At this time, treat ⁇ as a troublesome parameter and hyperparameter processing, and use methods such as grid search to find the loss function on the verification set the smallest as an estimate of ⁇ .
- the evaluation indicators of the verification set are used to select hyperparameters and The value of , and determine the optimal model structure.
- the training set and the verification set are combined as a new training set, and the model structure is set to retrain the model to obtain the updated model and model parameters.
- Use the updated model to predict the samples of the test set, and obtain the evaluation index value of the model on the test set.
- Select other possible parameter distributions repeat the previous steps to remodel, but the test set does not change, and get new evaluation index values. Repeat this step until all possible suitable parametric distributions have been modeled. Compare the values of the corresponding evaluation indicators, and select one or several models with the best evaluation value as the prediction model. Keep the model structure settings, and retrain the updated model with all sample data (including the test set) to get the final prediction model.
- the kfold cross-validation method If the kfold cross-validation method is used, it can be obtained by k training The mean is used as an estimate of ⁇ .
- this example uses the pure premium calculation model to obtain the pure premium, the probability distribution of the total loss amount, and the probability distribution of the total compensation amount and other non-life insurance pricing elements.
- the improved XGBoost class method can be further extended from univariate prediction to multi-parameter prediction with random distribution of parameters, forming a multi-cycle improved XGBoost class data analysis method, so as to realize the prediction
- the boosted tree method predicts all parameters of common parametric probability distributions for random variables.
- the improved XGBoost method model is used to model the predictive random variable Y i in multiple rounds, which can improve the predictive performance.
- the random variable Y i here refers to the random variable of the loss (payment) intensity or the number of losses (payment) during the insurance period.
- this example can be further extended for the scheme of Example 2.
- the loss function is the corresponding l(y i , ⁇ i , ⁇ 1,i , ⁇ 2 ..., ⁇ l ), if l(y i , ⁇ i , ⁇ 1,i ,..., ⁇ l ) for any value of y i , ⁇ i , ⁇ 2 ,... ⁇ l are all second-order partial derivatives to ⁇ 1,i (or the corresponding first-order partial derivative); there is and only one local minimum point and only at this point the derivative is 0, or strictly monotonous.
- ⁇ 1,i as the predictor variable
- use the improved XGBoost class method to predict ⁇ 1 ,i, and get the predicted value of ⁇ 1,i
- the regularization item of the XGBoost class method can make the score of each leaf node not be too different.
- each ⁇ i is fixed, and ⁇ is regarded as a predictor variable, and the loss function is
- step 4 Repeat step 4 until the evaluation metrics of the validation set converge. Keep the model of each step above, and use the test set to select the optimal probability distribution and parameter structure.
- step (2) if a traditional evaluation index such as mean square error is used, the verification process is consistent with step (2). If the negative log-likelihood function on the validation set is used as the evaluation index, for the model whose predictor variable is ⁇ j,i , the fixed parameter of the negative log-likelihood function is n is the number of validation set samples. ⁇ i , Respectively for the improved XGBoost class model prediction function value obtained by training
- This example extends the improved XGBoost method to predict multiple parameters to be estimated, and uses one algorithm model to predict multiple parameters to be estimated in the parameter probability distribution at the same time, which can increase the prediction performance of the model and improve operational efficiency and interpretability.
- y i is an observed value, which is regarded as a fixed parameter, not as a variable or a parameter to be estimated.
- the scope of discussion of the parameters to be estimated can be chosen reasonably freely. In practical applications, reasonable predictions will not fall exactly on the theoretical extreme boundary points.
- the range interval discussed can be regarded as a closed interval, and the boundary of the interval can also be kept a reasonable distance from the theoretical boundary point.
- a sample set D ⁇ ( xi , y i ) ⁇ (
- n, xi ⁇ R m , y i ⁇ R), has m features and n samples. Add K j tree functions to get parameters forecast result
- This multivariate regularized boosting tree method is not limited to the situation that a certain h i is not always non-negative, and it is also applicable when all h i are always non-negative.
- the approximate expression (2) is formally simplified as:
- Each round of training can train up to l trees at the same time, and each tree has its own hyperparameters.
- a smaller number of training rounds K can be set separately.
- the preferred solution is to set the interval of iteration rounds to reduce the total number of training rounds.
- the initial iterative value of the parameter ⁇ j to be estimated can be obtained by the maximum likelihood estimation of the training set (without considering xi ).
- a certain loss function l is within the scope of discussion: second-order differentiable, with one and only one local minimum point; if the approximate expression (1) is used, the requirement for the loss function l can be relaxed to first-order differentiable, There is one and only one local minimum point; after selecting any parameter to be estimated, when the remaining parameters are fixed, there is one and only one local minimum point; only the local minimum point mentioned in the previous paragraph is treated
- the partial derivative of the estimated parameter is 0, or strictly monotonic.
- the loss function of the training set is
- the parameters ⁇ i and ⁇ i to be estimated can be set to any reasonable discussion range.
- One method is to set ⁇ i ⁇ [ ⁇ 1 ,M 1 ], ⁇ i ⁇ [ ⁇ 2 ,M 2 ], ⁇ 1 , ⁇ 2 is a small enough positive number, and M 1 and M 2 are large enough positive numbers.
- the specific conditional probability distribution of the predictor variable Y i can be obtained by using the multivariate regularized boosting tree method.
- the loss function may not be a convex function of ⁇ i .
- the loss function l is a concave function of ⁇ i , and its function image is shown in FIG. 5 .
- the above modeling process can adopt different feature engineering schemes. Combine the training and validation sets, and retrain the model with the learned hyperparameters. Change the candidate probability distribution type of the predictor variable and repeat the modeling training. Use the learned model to make predictions on the test set, and select one or several evaluation indicators with the smallest probability distribution and the corresponding prediction model as the optimal model. Combine all sample sets, use the learned hyperparameters, retrain the model, get the final model and put it into production.
- a preferred evaluation index is the negative log-likelihood function.
- the improvement of the XGBoost method in this patent refers to the improvement of all methods similar to the XGBoost method, such as the famous LightGBM method and CatBoost method.
- the multi-round cycle XGBoost method and the multivariate regularized boosting tree method, in practical application, only the optimization problem of minimizing the objective function that satisfies the loss function condition or the parameter probability that satisfies the loss function condition is required.
- the maximum likelihood estimation of each parameter of the distribution (the conditional maximum likelihood estimation of each sample point with different sample characteristics) can be used, not only for non-life insurance pricing, but also for various fields.
- the embodiment of the present invention also provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the steps of any one or more solutions in the above-mentioned examples 1-4 are implemented.
- An embodiment of the present invention also provides a processor, the processor is configured to run a program, wherein the program executes any one or more of the steps in the above-mentioned examples 1-4 when running.
- the embodiment of the present invention also provides a terminal device.
- the device includes a processor, a memory, and a program stored in the memory and operable on the processor.
- the program code is loaded and executed by the processor to realize the above-mentioned example 1 - the steps of any one or more schemes in Example 4.
- the present invention also provides a computer program product, which, when executed on a data processing device, is suitable for executing the steps of any one or more of the above-mentioned examples 1-4.
- the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
- the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
- a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
- processors CPUs
- input/output interfaces network interfaces
- memory volatile and non-volatile memory
- Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read only memory (ROM) or flash RAM.
- RAM random access memory
- ROM read only memory
- flash RAM flash random access memory
- Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
- Information may be computer readable instructions, data structures, modules of a program, or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
- PRAM phase change memory
- SRAM static random access memory
- DRAM dynamic random access memory
- RAM random access memory
- ROM read only memory
- EEPROM Electrically Erasable Programmable Read-On
- the embodiments of the present invention may be provided as methods, systems or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
- a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Technology Law (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开了一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备;本方案采用改进型XGBoost类算法,所述改进型XGBoost类方法对目标函数的二阶泰勒展开做修正,修改了其h i项,使得改进后的XGBoost类方法的适用性不局限于凸损失函数。本方案在该改进型XGBoost类方法的基础上,进一步提出多元正则化提升树方法,将预测变量的概率分布从单参数推广到多参数,可广泛应用于各种领域。特别是非寿(General)险定价领域。
Description
本发明涉及机器学习技术和精算技术,具体涉及相应的大数据分析方法。
一.纯保费测算模型。
在非寿险定价中,保险公司会对被保险人的纯保费进行测算,纯保费指被保险人的期望净赔付额。因非寿险保险期间较短,本文中的纯保费不考虑利息因素。要测算纯保费最好对损失(赔付)金额(单次出险的或保险期间内出险总和的)的概率分布做估计,而不能简单的只对损失额(赔付额)的期望值做估计。因为在补偿型保险中,一般对一次出险损失额或保险期间内的总损失的赔付有一个免赔额(或限额),只有测算出的损失额(赔付额)的概率分布,才能对免赔额(或限额)的调整使得纯保费的调整做相应的处理。
其中,测算总损失额(赔付额)的概率分布有两类方法:
1.直接对保险期间内总损失额(赔付额)的概率分布做估计。
2.分别对保险期间内出险次数(赔付次数)的概率分布和每次出险的损失强度(赔付强度)的概率分布分别做估计。用复合分布模型对此两种概率分布做整合得到总损失(总赔付)的概率分布。一般也有两种假设:
a.标准假设。假设此两种分布相互独立,每次出险的损失(赔付)强度服从独立同分布。
b.此两种分布有关联,或者损失(赔付)强度不服从独立同分布。
标准假设是a假设,求总损失(总赔付)的概率分布的方法有特征函数类变换法(傅里叶变换法)或随机模拟法。对b假设,由于待估参数过多,可能存在过拟合的风险,业界很少采用。一般而言,第二类方法是更精细的方法,相比第一类方法有着诸多好处。
XGBoost方法是一种极限梯度提升树方法,其预测性能优异,在很多领域都取得了非常好的成绩。
该方法的主要过程描述如下:
一个样本集D={(x
i,y
i)}(|D|=n,x
i∈R
m,y
i∈R),有m个特征,n个样本。一个集成树模型,用K颗树函数相加得到预测结果。
其中F={f(x)=ω
q(x)}(q:R
m→T,ω∈R
T)是回归树空间。q表示每棵树的结构,把一个样本映射到对应的叶子结点。T是一颗树的叶子结点的个数。每个f
k对应一个独立的树结构q及其叶子权重ω。每一颗回归树的每一个叶子结点都有一个连续值得分,用ω
i表示第i个叶子结点的得分。为了学习模型中的这些树函数,最小化下面的正则化目标:
l是可导凸函数,表示损失函数。Ω(f
k)是正则项。
在一般情况下,为了快速优化目标函数。用二阶泰勒展开对其做近似:
最优目标函数值为:
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I
L和I
R表示分裂后的左右结点的样本集,I=I
L∪I
R。
分裂后的目标函数的减少值由下式给出:
此公式用来计算候选划分点。
类似于学习率,收缩尺度技术在每一步提升树后用一个因子η,也用来防止过拟合。此外还有列采样技术防止过拟合。
另外,有的开源代码还提供额外的l
1正则化项:
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
计算左右结点样本集的最优目标函数值,记录分裂的增益,作为最优划分结点的标准。
XGBoost类方法的缺点有:
以标准的正则化项为例:设γ和λ较小,可以忽略不计,则目标函数近似于损失函数.用考察损失函数代替考察目标函数不影响得到的结论。
由于只有一个样本点,T=1。可能由于没有控制学习率η,使得第t-1次迭代后的
在
的某个邻域内是凹函数,其对
的一阶导数为g
1为正,二阶导数h
1为负。使得第t次迭代的该样本的最优权重得分
当λ<|h
1|时,
则
更加偏离
的全局最小点。
此外,现有的XGBoost类方法仅局限于拟合单参数概率分布。对于多参数概率分布,现有的XGBoost类方法无法同时对多个参数进行优化求解,很多时候不能得到最优的预测性能。例如,一般(General)保险定价中的损失频率如果服从双参数的负二项分布,用单参数的泊松分布去拟合是不合适的。
发明内容
针对现有大数据分析预测技术所存在的问题,需要一种新的数据分析处理方案。
为此,本发明的目的在于提供一种基于改进型的XGBoost类方法的数据分析方法,由此来有效提高大数据分析预测的性能。在此基础上,本发明进一步提供基于改进型的XGBoost类方法的定价方法,有效克服现有方案所存在的缺陷。
为了达到上述目的,本发明提供的基于改进型XGBoost类方法的数据分析方法,采用改进型XGBoost类方法基于获取到的变量参数进行预测评估,所述改进型XGBoost类方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,h
i不恒为非负时,通过修改其h
i相关项,改进型XGBoost类方法的适用性不局限于凸损失函数。
进一步地,所述改进型XGBoost类方法将XGBoost类方法从单变量预测推广到参数概率分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法。
或(1)式和(2)式的各h
i相关项加权平均表达。
为了达到上述目的,本发明提供的定价方法,所述定价方法基于上述的数据分析方法进行非寿险精算定价。
进一步地,所述定价方法包括:
(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
(2)对样本数据进行预处理;
(3)进行特征工程,得到更新后的样本集D={(x
i,y
i)};x
i是第i个样本的特征向量;
(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测预测变量的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
(5)选择预测随机变量的参数分布类型,用改进型XGBoost类方法求得预测变量的条件概率分布;
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数 数分布。当对预测变量的参数分布类型有自信时,也可直接指定最优参数分布。此时,候选参数分布中只有此一种参数分布。
进一步地,所述定价方法基于改进型XGBoost类方法求得预测变量的条件概率分布,包括:
(1)从候选参数概率分布中选择某一分布,确定其参数,对同一分布可以有不同的参数化形式;
(2)将预测变量的期望值表达式作为期望参数,对该概率分布的表达式进行变形,将期望参数作为预测参数,预测参数以外的参数看作麻烦(nuisance)参数、超参数;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数;
(3)确定目标函数,以该分布的负对数似然函数作为损失函数;确认该损失函数满足改进型XGBoost方法对损失函数的要求。
(4)对超参数,运用网格搜寻法或先验经验或其他方法确定其值;
(5)当超参数固定时,用改进型XGBoost类算法求得预测参数的预测值;
(6)更换超参数取值,重复步骤(5),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其概率分布。如果对某个超参数的取值有自信,也可直接设定唯一的超参数取值。
为了达到上述目的,本发明提供一种数据分析方法,其将改进型XGBoost类方法直接推广至多元,形成多元正则化提升树方法,所述多元正则化提升树方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,修改了其h
i相关项,使得改进型XGBoost类方法的适用性不局限于凸损失函数。本方法可同时对多元损失函数中的多个变量(即考察的待估参数)进行优化求解。
进一步地,所述多元正则化提升树方法中,设定损失函数l在讨论的范围内:(1)二阶可微或一阶可微,有且仅有一个局部极小值点;(2)选定任意的某个待估参数作为考察变量后,当其余参数固定时,有且仅有一个局部极小值点;仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
注:y
i作为观测值看作固定的参数,不看做变量或待估参数。对于待估参数的讨论范围,可以合理的自由选择。在实际运用中,合理的预测结果都不会刚好落在理论上的极端边界点。在有些时候,可以将讨论的范围区间看成是闭区间,也可以使区间的边界离理论上的边界点有一点的合理的距离。
进一步地,所述多元正则化提升树方法中目标函数的表达式为:
…,
也可将l
1正则化项额外加入到Ω中:
或(1)式和(2)式的各h
i相关项加权平均表达(3);
对于近似(1),对损失函数的可微性条件可放宽至一阶可微。
为了达到上述目的,本发明提供一种定价方法,所述定价方法基于上述的数据分析方法进行精算定价。
进一步地,所述定价方法包括:
(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;
(2)对样本数据进行预处理;
(3)进行特征工程,得到更新后的样本集D={(x
i,y
i)};x
i是第i个样本的特征向量;
(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测参数分布的待估参数的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;
(5)选择预测随机变量的参数分布类型,用多元正则化提升树方法求得预测变量的条件概率分布;
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。当对预测变量的参数分布类型有自信时,也可直接指定最优参数分布。此时,候选参数分布中只有 此一种参数分布。
进一步地,所述定价方法基于多元正则化提升树方法求得预测变量的条件概率分布,包括:
(1)从候选参数概率分布中选择某一分布,确定其参数形式;对同一种分布,可以有不同的参数化形式。
(2)确定目标函数,以该分布的负对数似然函数作为损失函数。确认该损失函数满足多元正则化提升树方法对损失函数的要求。
(3)以感兴趣的待估参数作为自变量,用多元正则化提升树方法求得该分布所有参数的预测值;从而得到预测变量的具体概率分布表达式。如果有对取值比较自信的参数,可以用经验或其他方法确定其值,这些参数作为固定值不参与提升树的迭代。
本发明采用改进后的XGBoost类方法进行数据分析,有效克服现有技术方案中的各种缺陷。
本发明提供的基于多轮循环改进型XGBoost类方法的数据分析方法运用改进型XGBoost类方法进行循环多参数建模,进一步提高了模型的预测性能。
本发明提供的多元正则化提升树方法,并运用该方法进行数据分析,进一步提高了大数据预测方法包括非寿险定价方法的预测性能,并提高了计算运行效率和模型的可解释性。
在上述方案的基础上,本发明进一步提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种处理器,所述处理器用于运行程序,所述程序运行时实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,所述程序代码由所述处理器加载并执行以实现上述数据分析方法或定价方法的步骤。
在上述方案的基础上,本发明进一步提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行数据分析方法或定价方法的步骤。
以下结合附图和具体实施方式来进一步说明本发明。
图1为现有XGBoost算法中对非凸损失函数图像示例图;
图2为实例2中对损失强度的预测时非凸损失函数图像示例图;
图3为实例2中对损失次数的预测时非凸损失函数图像示例图;
图4为实例3中固定相应的参数后,l(损失函数)的示例函数图像示例图;
图5为实例4中固定相应的参数后,l(损失函数)的示例函数图像示例图。
为了使本发明实现的技术手段、创作特征、达成目的与功效易于明白了解,下面结合具体图示,进一步阐述本发明。
针对现有技术所存在的缺陷,本方案对XGBoost类方法进行改进,实现将精准预测性能与传统统计技术结合,进一步提高预测性能。
这里以非寿险定价为例,本方案在应用于非寿险定价时,其可将得到的改进型XGBoost类方法以及派生出的多元正则化提升树方法运用于非寿险定价,从而可有效克服背景技术中所阐述现有技术的缺陷,同时保留现有技术的优点。取得对非寿险定价技术中对于损失(赔付)次数和损失(赔付)强度以及总损失金额(或总赔付金额)优异的预测性能,从而达到测算纯保费的理想效果。
实例1
本实例中通过改进XGBoost类方法以构建相应的改进型XGBoost类方法,以克服现有技术中XGBoost类方法对损失函数必须是凸函数的要求。
本实例给出的改进型XGBoost类算法中,通过对目标函数近似表达的二阶泰勒展开做修正,修改了其h
i相关项,使得改进型XGBoost类方法的适用性不局限于凸损失函数。
对此,以下举例进一步说明。
对于式(1)进行变量带入,有:
最优目标函数值为:
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I
L和I
R表示分裂后的左右结点的样本集,I=I
L∪I
R。
分裂后的目标函数的减少值由下式给出:
此公式用来计算候选划分点。
对于式(2)进行变量带入,有:
最优目标函数值为:
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I
L和I
R表示分裂后的左右结点的样本集,I=I
L∪I
R。
分裂后的目标函数的减少值由下式给出,
此公式用来计算候选划分点。
对于(3)式,其相应的算法推导如下:
进行变量代入,有
最优目标函数值为:
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
假设I
L和I
R表示分裂后的左右结点的样本集,I=I
L∪I
R。
分裂后的目标函数的减少值由下式给出,
此公式用来计算候选划分点。
此外,对于l
1正则化项,改进型Xgboost类方法可同样适用。
注意到(1),(2)两式是(3)式的特殊情况,以(3)式为例做一个说明:
其中,β≥0.
树结构q采用贪婪算法求得,迭代的从一个单一的叶子结点开始添加分枝。
计算左右结点样本集的最优目标函数值,记录分裂的增益,作为最优划分结点的标准。
在此基础上,本改进型XGBoost类方法的其它构成技术方案可采用现有XGBoost类算法中相应的构成方案,此处不加以赘述。
其中,M可看作先验经验设定,也可当做超参数处理。
优选地,预测随机变量的极大似然估计值可作为预测变量的初始迭代值,以提高算法的收敛速度和方法模型的可解释性。
实例2
本实例中利用实例1中形成的改进型XGBoost类方法形成非寿险保险定价方法。在独立性假设下,将负对数似然函数作为损失函数,并将均值参数作为XGBoost类方法的待估参数。
本实例中利用所述改进型XGBoost类方法改进非寿险定价中求损失(赔付)强度或损失(赔付)次数的概率分布的方法。
据此,本实例利用改进型XGBoost类方法改进非寿险定价中求损失(赔付)强度或损失(赔付)次数的概率分布的过程主要包括如下步骤:
(1)首先选择要预测的随机变量,如损失次数随机变量或损失强度随机变量。收集样本数据,包括样本属性和预测变量的观测值。以车险的单次损失金额为例,样本属性可能包括车型,已开里程数,车价,车主年龄,上一年的理赔情况,交通违法记录等等,预测变量的观测值为在保险期间内出险的单次损失金额。
(2)对样本数据进行预处理,包括处理异常值等。
(3)进行特征工程,得到更新后的样本集D={(x
i,y
i)}。x
i是第i个样本的特征向量。
(4)将样本集划分为训练集,验证集和测试集。训练集用来训练模型,该模型为对要预测的变量做出预测的学习模型,验证集用来调整超参数,测试集用来评估模型性能。如可用留出法,kfold交叉验证法等。
(5)在候选参数分布中选择预测随机变量的参数分布类型,用实例1中形成的改进型XGBoost类方法来求得预测变量的条件概率分布。
(6)在候选分布中重新选择需要拟合的分布,重复以上步骤步骤(5),用测试集的评估指标确定最优参数分布。若候选分布中只是一种分布,则不用再次选择。
本实例中采用改进型XGBoost类方法来求得预测变量的条件概率分布的过程包括:
(5.1)从候选参数概率分布中选择某一分布,确定其参数。
本步骤中,将该分布的期望表达式代入该参数分布,以其期望表达式作为该概率分布的参数,即期望参数,进一步以期望参数作为改进型的XGBoost类方法的待估预测变量;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数。
需要说明的是,同广义线性模型类似,对期望参数也可添加不同的连接,如对期望参数添加一个对数连接。添加连接相当于不同的参数化形式,无论何种参数化形式都有相应的损失函数,只要满足方法的条件就能适用。
(5.2)将其余参数看作麻烦参数、超参数,运用网格搜寻法或先验经验或其他方法确定其值;
(5.3)当超参数固定时,用的改进型XGBoost类算法来求得期望参数的预测值。
(5.4)更换超参数取值,重复步骤(5.3),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其具体概率分布表达式。对有些确定取值的超参数,可用其他方法比如经验确定其值,不用更换其值。
其原理和广义线性模型的原理类似,不同之处在于广义线性模型将预测变量的期望连接到线性组合模型,而本方法将待估预测变量的期望连接到改进型XGBoost类提升树模型。从而使得改进型XGBoost类方法能结合广义线性模型方法和XGBoost类方法的优点,克服各自的缺点。
在此基础上,本实例针对该改进型XGBoost类方法,增加一种评估指标的方法,用训练集的损失函数作为验证集和测试集的评估指标,使得损失函数和评估指标完美统一。当目标函数可最优求解时,用预测变量概率分布的对数似然函数或其相反数作为评估指标符合统计原理惯例。
以留出法为例,具体求得预测变量的条件概率分布方法如下:
根据经验从候选参数分布中选择预测随机变量Y的分布类型。
本实例中假定要分析的随机变量Y
i(i=1,…,n,n为集合内的样本数量)服从同一类型的参数分布,并且有如下性质:
Y
i相互独立(以各自的特征和参数条件独立)。
将Y
i概率值或概率密度写成f(y
i;μ
i,θ)的形式(如果Y
i是离散型,则f(y
i;μ
i,θ)代表其概率值;如果Y
i是连续型,则f(y
i;μ
i,θ)代表其概率密度),
(μ
i,θ是该分布的参数,θ是除了μ
i以外的参数,如果θ存在)。
当θ已知时,通过改进型XGBoost类方法对训练集做训练,
以上过程得到μ
i的估计值。
在此技术基础上,举例如下:
(a)对于损失(赔付)强度的预测:
定义:
缩放分布:如果一个随机变量服从某个参数分布,该随机变量乘以某个正常数形成新的随机变量,新随机变量依然服从该参数分布。该参数分布称为缩放分布。
缩放参数:一个随机变量服从某个缩放分布,可能的取值范围非负,一个缩放分布的某个参数满足如下两个条件称为缩放参数:该随机变量乘以某个正常数形成新的随机变量,新的缩放分布的缩放参数同样乘以该正常数。新缩放分布的其余参数不变。
这里以例子(1)来说明对于损失(赔付)强度的预测。
例1:
伽马分布是一个厚尾的缩放分布,β是缩放参数,其概率密度函数如下:
将此概率密度函数写成f(y;μ,θ)的形式:
假设所要分析的损失(赔付)强度随机变量Y
i服从伽马分布,Y
i相互独立(以各自特征和参数的条件独立)。其概率密度函数为
训练集的损失函数为
如果α和超参数的取值确定,运用改进型XGBoost类方法,就能求得初始目标函数的预测最小值,预测变量的预测值,相应的损失函数取值以及损失(赔付)强度的条件概率分布。
对于损失(赔付)次数的预测:
以一个例子(2)说明。
例2:
设Y服从退化后的0分布和泊松分布的混合分布,其概率分布如下:
该分布属于(a,b,1)类,不属于指数分布族。μ=E(Y)=αλ。
假设保险期间内损失(赔付)次数Y
i服从该分布。Y
i相互独立。其概率分布函数为:
如果α和超参数的取值确定,运用改进型XGBoost类方法,就能求得初始目标函数的预测最小值,预测变量的预测值,相应的损失函数取值以及损失(赔付)次数的条件概率分布。
如果得到θ的估计值,就能得到预测随机变量的条件概率分布。
对于评估指标的选择,最好使评估指标与损失函数相统一。
优选地,可使用验证集和测试集上的对数似然函数的相反数
作为对应的评估指标,n是样本对应集合的样本数量。由于θ是未知参数。而超参数γ和λ需要通过网格搜寻法等方法在验证集上寻找最优值。此时,将θ看作麻烦参数、超参数处理,用网格搜寻法等方法寻找使得验证集上损失函数
最小的
作为θ的估计值。
在此基础上,再利用验证集的评估指标选择超参数和
的取值,并确定最优模型结构。获得
的取值和超参数取值以及模型结构后,合并训练集和验证集作为新的训练集,用该模型结构设定重新训练模型,得到更新后的模型和模型参数。用更新后的模型对测试集的样本做预测,得到模型在测试集上的评估指标取值。选择其他可能的参数分布,重复之前步骤重新建模,但测试集不改变,得到新的评估指标取值。重复此步,直到对所有可能合适的参数分布都进行建模。比较对应的评估指标取值,选择评估值最好的一个或几个模型作为预测模型。保留模型结构设置,用所有样本数据(包括测试集)重新训练更新模型,得到最终的预测模型。
以上符号含义同背景技术的介绍。
可采用不同的特征工程方案,重复以上步骤,利用验证集的评估指标评估方案的优劣。
在上述方案的基础上,本实例在求得损失(赔付)次数和损失(赔付)强度的条件概率分布后,运用纯保费测算模型求得纯保费,总损失额概率分布,总赔付额概率分布等非寿险定价要素。
实例3
本实例构成的改进型XGBoost类方法中,还可进一步将改进型XGBoost类方法从单变量预测推广到参数随机分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法,从而实现对预测随机变 量常见的参数概率分布的所有参数的提升树方法预测。
本实例中,利用改进型XGBoost类方法模型,对预测随机变量Y
i多轮循环建模,可提高预测性能。
这里的随机变量Y
i指损失(赔付)强度或保险期间内损失(赔付)次数的随机变量。
具体地,本实例可针对实例2的方案进一步扩展。当求得μ
i和麻烦参数的估计值θ
1,…θ
l(l是麻烦参数的个数)后,
(1)将μ
i和θ
2,…θ
n的估计值当做固定参数,损失函数为相应的l(y
i,μ
i,θ
1,i,θ
2…,θ
l),如果l(y
i,μ
i,θ
1,i,…,θ
l)对任意的y
i,μ
i,θ
2,…θ
l取值都对θ
1,i二阶可偏导(或对应的一阶可偏导);有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。将θ
1,i作为预测变量,利用改进型XGBoost类方法对θ
1,i做预测建模,得到θ
1,i的预测值
(2)将μ
i和θ
1,i,θ
3…θ
n的估计值当做固定参数,损失函数为相应
如果
对任意的y
i,μ
i,
θ
3,…,θ
l的取值都对θ
2,i二阶可偏导(或对应的一阶可偏导);有且仅有一个局部极小值点并且仅在该点导数为0,或者严格单调。
将θ
2,i作为预测变量,利用XGBoost方法对θ
2,i做预测建模,得到θ
2,i的预测值:
(3)重复以上步骤,求得θ
3i,…,θ
ni的预测值。
说明:XGBoost类方法的正则项可以使得各叶子结点的得分不至于差异过大。
举例如下:
接实例2中的例子(1)
当运用改进后的XGBoost方法求得μ
i和α的估计值后,固定每一个μ
i,将α视作预测变量,损失函数为
(5)重复第4步,直到验证集的评估指标收敛。保留以上每步的模型,用测试集选出最优的概率分布和参数结构。
关于验证集评估指标的选择,如果采用传统的评估指标如均方误差,则验证过程与步骤(2)一致。如果采用验证集上负对数似然函数做评估指标,对于预测变量为θ
j,i的模型,则负对数似然函数的固定参数为
n是验证集样本的个数。μ
i,
分别为训练得到的改进型XGBoost类模型预测函数值
可选的,将测试集划出一部分样本作为第2次验证集(也可将全体样本重新划分为训练集,第1次验证集,第2次验证集和测试集),用来验证初始预测变量Y
i的某种概率分布在各种参数结构下(不同的循环轮次和不同的参数迭代次数有不同的概率分布参数结构)的预测性能,即以上每次迭代过程得到的模型的拟合效果。用测试集去评估该概率分布的拟合效果。如此划分2个验证集可尽量避免过拟合。
实例4
本实例在改进型XGBoost类方法方案的基础上,进一步给出多元正则化提升树方案。
本实例将改进型XGBoost类方法推广到预测多个待估参数,用一个算法模型同时预测参数概率分布的多个待估参数,如此可增加模型的预测性能并提高运算效率和可解释性。
假设在讨论范围内,二阶可微,有且仅有一个局部极小值点;如采用下文中目标函数的近似表达式(1),对损失函数l的要求可放宽至一阶可微,有且仅有一个局部极小值点;
选定任意的某个待估参数后,当其余参数固定时,有且仅有一个局部极小值点;
仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
注:y
i是观测值,看作固定的参数,不看做变量或待估参数。对于待估参数的讨论范围,可以合理的自由选择。在实际运用中,合理的预测结果都不会刚好落在理论上的极端边界点。在有些时候,可以将讨论的范围区间看成是闭区间,也可以使区间的边界离理论上的边界点有一点的合理的距离。
其中F={f(x)=ω
q(x)}(q:R
m→T,ω∈R
T)是回归树空间。q表示每棵树的结构,把一个样本映射到对应的叶子结点。T是一颗树的叶子结点的个数。每个
对应一个独立的树结构q及其叶子权重ω。为了学习模型中的这些树函数,最小化下面的正则化目标:
…,
类似于改进型XGBoost方法对第t次迭代目标函数的近似表达,对于(1)和(2)式的各h
i相关项某种加权平均(线性组合)也可看作近似公式的一种变形:
本多元正则化提升树方法不局限于某h
i不恒为非负的情形,当所有的h
i恒为非负时也适用,此时,近似表达式(2)在形式上化简为:
每一轮训练同时最多训练l颗树,每棵树有独自的超参数。
对每一个待估参数θ
j,都有一个学习速率η
j和训练轮数K
j以及超参数M
j。
对于确定性比较强的待估参数,可以单独设定较少的训练轮数K。优选方案是,设置迭代轮数间隔,使其总训练轮数减少。
待估参数θ
j的初始迭代值可用训练集的极大似然估计(不考虑x
i)求得。
以非寿险定价为例,改进解决实例2方案中第5步中求得预测变量的条件概率分布。选择合适的参数概率分布,在独立性假设下,用其负对数使然函数作损失函数
当损失函数满足相应条件时,可继续,否则需要从候选分布中更换拟合分布或更换参数形式。假定某损失函数l在讨论的范围内:二阶可微,有且仅有一个局部极小值点;如采用近似表达式(1),对损失函数l的要求可放宽至一阶可微,有且仅有一个局部极小值点;选定任意的某个待估参数后,当其余参数固定时,有且仅有一个局部极小值点;仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
以一个例子(3)说明。
例3:
假设保险期间内损失次数Y
i服从负二项分布,作为预测变量。Y
i相互独立。其概率分布函数的一种经典形式为:
对待估参数β
i,γ
i可以设置任意的合理的讨论范围,一种方法是设定β
i∈[ε
1,M
1],γ
i∈[ε
2,M
2],ε
1,ε
2是足够小的正数,M
1,M
2是足够大的正数。
仅在前段所述局部极小值点对待估参数偏导数为0,或者严格单调。
满足多元正则化提升树方法对损失函数的要求。
可以用多元正则化提升树方法求得预测变量Y
i的具体条件概率分布。
但当固定住y
i和γ
i后该损失函数未必是β
i的凸函数。
举例说明如下:
当y
i=0,γ
i=1时,损失函数l是β
i的凹函数,其函数图像如图5所示。
以留出法为例,对模型的各项超参数进行网格搜寻或其他方法确定其值,使得验证集的评估指标最小,得到模型结构和提升树模型内的参数取值以及最优超参数值。
以上建模过程,可采用不同的特征工程方案。合并训练集和验证集,用学得的超参数,重新训练模型。更换预测变量的候选概率分布类型,重复建模训练。对测试集运用学得的模型做预测,选择一种或几种评估指标最小的概率分布及对应的预测模型作为最优模型。合并所有样本集,用学得的超参数,重新训练模型,得到最终模型并投入生产。优选评估指标为负对数似然函数。
由于LightGBM方法,CatBoost方法等方法与XGBoost方法非常相似,本专利对XGBoost类方法的改进指对所有类似XGBoost方法的方法的改进,如著名的LightGBM方法和CatBoost方法。
对于改进型XGBoost类方法,多轮循环XGBoost类方法,多元正则化提升树方法,在实际应用时,只要求解满足损失函数条件的目标函数最小化的最优化问题或求解满足损失函数条件的参数概率分布的各参数的极大似然估计(对不同样本特征的各样本点的条件极大似然估计),就可以运用,不仅仅适用于非寿险定价,可广泛应用于各种领域。
本发明实施例还提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述实例1-实例4中任意一种或多种方案的步骤。
本发明实施例还提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行上述实例1-实例4中任意一种或多种方案的步骤。
本发明实施例还提供了一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,所述程序代码由所述处理器加载并执行以实现上述实例1-实例4中任意一种或多种方案的步骤。
本发明还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行上述实例1-实例4中任意一种或多种方案的步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实 施例的相关描述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者 是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本发明的实施例可提供为方法、系统或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以上显示和描述了本发明的基本原理、主要特征和本发明的优点。本行业的技术人员应该了解,本发明不受上述实施例的限制,上述实施例和说明书中描述的只是说明本发明的原理,在不脱离本发明精神和范围的前提下,本发明还会有各种变化和改进,这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。
Claims (18)
- 基于改进型XGBoost类方法的数据分析方法,其特征在于,采用改进型XGBoost类方法基于获取到的变量参数进行预测评估,所述改进型XGBoost类方法对XGBoost类算法中的目标函数近似表达的二阶泰勒展开做修正,h i不恒为非负时,通过修改其h i相关项,改进型XGBoost类方法的适用性不局限于凸损失函数。
- 根据权利要求1所述的基于改进型XGBoost类方法的数据分析方法,其特征在于,所述改进型XGBoost类方法将XGBoost类方法从单变量预测推广到参数分布的多参数预测,形成多轮循环改进型XGBoost类数据分析方法。
- 一种定价方法,其特征在于,所述定价方法基于权利要求1-4中任一项所述的数据分析方法进行精算定价。
- 根据权利要求5所述的定价方法,其特征在于,所述定价方法包括:(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;(2)对样本数据进行预处理;(3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测预测变量的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;(5)选择预测随机变量的参数分布类型,用改进型XGBoost类方法求得预测变量的条件概率分布;(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。
- 根据权利要求6所述的定价方法,其特征在于,所述定价方法基于改进型XGBoost类方法求得预测变量的条件概率分布,包括:(1)从候选参数概率分布中选择某一分布,确定其参数;(2)将预测变量的期望值表达式作为期望参数,对该概率分布的表达式进行变形,将期望参数作为预测参数,预测参数以外的参数看作麻烦参数、超参数;如该分布表达式本身已含期望参数,则不需要变形,直接设定预测参数和超参数;(3)确定目标函数,以该分布的负对数似然函数作为损失函数;(4)对超参数确定其值;(5)当超参数固定时,用改进型XGBoost类算法求得预测参数的预测值;(6)更换超参数取值,重复步骤(5),用验证集的评估指标确定最优参数预测值和最优超参数取值;从而得到预测变量的预测值和其具体概率分布表达式。
- 一种数据分析方法,其特征在于,形成改进型XGBoost类方法,并直接推广至多元,形成多元正则化提升树方法,所述多元正则化提升树方法对XGBoost类方法中的目标函数近似表达的二阶泰勒展开做修正,修改其h i相关项,使得多元正则化提升树方法的适用性不局限于凸损失函数,并在算法层面同时最优化求解多元目标函数的多个变量。
- 根据权利要求8所述的数据分析方法,其特征在于,所述多元正则化提升树方法中,设定损失函数l在讨论的范围内:(1)二阶可微,有且仅有一个局部极小值点;或一阶可微,有且仅有一个局部极小值点;(2)选定任意的某个待估参数作为考察变量后,当其余参数固定时,有且仅有一个局部极小值点;仅在前段所述局部极小值点对考察变量的偏导数为0,或者严格单调。
- 一种定价方法,其特征在于,所述定价方法基于权利要求8-11项中任一项所述的数据分析方法进行精算定价。
- 根据权利要求12所述的定价方法,其特征在于,所述定价方法包括:(1)首先选择要预测的随机变量,收集样本数据,包括样本属性和预测变量的观测值;(2)对样本数据进行预处理;(3)进行特征工程,得到更新后的样本集D={(x i,y i)};x i是第i个样本的特征向量;(4)将样本集划分为训练集,验证集和测试集;所述训练集用来训练用于预测参数分布的待估参数的学习模型,验证集用来调整超参数,测试集用来评估学习模型性能;(5)选择预测随机变量的参数分布类型,用多元正则化提升树方法求得预测变量的条件概率分布;(6)在候选分布中重新选择需要拟合的分布,重复以上步骤(5),用测试集的评估指标确定最优参数分布。
- 根据权利要求13所述的定价方法,其特征在于,所述定价方法基于多元正则化提升树方法求得预测变量的条件概率分布,包括:(1)从候选参数概率分布中选择某一分布,确定其参数形式;(2)确定目标函数,以该分布的负对数似然函数作为损失函数。(3)用多元正则化提升树方法求得该分布所有参数的预测值;从而得到预测变量的具体概率分布表达式。
- 一种计算机可读存储介质,其上存储有程序,其特征在于,所述程序被处理器执行时实现权利要 求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
- 一种处理器,所述处理器用于运行程序,其特征在于,所述程序运行时实现权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的的步骤。
- 一种终端设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,其特征在于,所述程序代码由所述处理器加载并执行以实现权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
- 一种计算机程序产品,其特征在于,当在数据处理设备上执行时,适于执行权利要求1-4中任一项或权利要求8-11中任一项所述数据分析方法或权利要求5-7中任一项或权利要求12-14中任一项所述定价方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110781586.X | 2021-07-09 | ||
CN202110781586 | 2021-07-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023280316A1 true WO2023280316A1 (zh) | 2023-01-12 |
Family
ID=84801333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/104694 WO2023280316A1 (zh) | 2021-07-09 | 2022-07-08 | 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115601182A (zh) |
WO (1) | WO2023280316A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402252A (zh) * | 2023-03-30 | 2023-07-07 | 重庆市生态环境大数据应用中心 | 用于水污染防治的智能分析决策方法及系统 |
CN116595872A (zh) * | 2023-05-12 | 2023-08-15 | 西咸新区大熊星座智能科技有限公司 | 基于多目标学习算法的焊接参数自适应预测方法 |
CN116628970A (zh) * | 2023-05-18 | 2023-08-22 | 浙江大学 | 基于数据挖掘的航天薄壁件旋压成型工艺参数优化方法 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116451034A (zh) * | 2023-03-30 | 2023-07-18 | 重庆大学 | 基于xgboost算法的压力源与水质关系的分析方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108536650A (zh) * | 2018-04-03 | 2018-09-14 | 北京京东尚科信息技术有限公司 | 生成梯度提升树模型的方法和装置 |
CN108777674A (zh) * | 2018-04-24 | 2018-11-09 | 东南大学 | 一种基于多特征融合的钓鱼网站检测方法 |
WO2020247949A1 (en) * | 2019-06-07 | 2020-12-10 | The Regents Of The University Of California | General form of the tree alternating optimization (tao) for learning decision trees |
CN112821420A (zh) * | 2021-01-26 | 2021-05-18 | 湖南大学 | 一种基于XGBoost的ASFR模型中动态阻尼因子、多维频率指标的预测方法及系统 |
-
2021
- 2021-08-14 CN CN202110928092.XA patent/CN115601182A/zh active Pending
-
2022
- 2022-07-08 WO PCT/CN2022/104694 patent/WO2023280316A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108536650A (zh) * | 2018-04-03 | 2018-09-14 | 北京京东尚科信息技术有限公司 | 生成梯度提升树模型的方法和装置 |
CN108777674A (zh) * | 2018-04-24 | 2018-11-09 | 东南大学 | 一种基于多特征融合的钓鱼网站检测方法 |
WO2020247949A1 (en) * | 2019-06-07 | 2020-12-10 | The Regents Of The University Of California | General form of the tree alternating optimization (tao) for learning decision trees |
CN112821420A (zh) * | 2021-01-26 | 2021-05-18 | 湖南大学 | 一种基于XGBoost的ASFR模型中动态阻尼因子、多维频率指标的预测方法及系统 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402252A (zh) * | 2023-03-30 | 2023-07-07 | 重庆市生态环境大数据应用中心 | 用于水污染防治的智能分析决策方法及系统 |
CN116595872A (zh) * | 2023-05-12 | 2023-08-15 | 西咸新区大熊星座智能科技有限公司 | 基于多目标学习算法的焊接参数自适应预测方法 |
CN116595872B (zh) * | 2023-05-12 | 2024-02-02 | 西咸新区大熊星座智能科技有限公司 | 基于多目标学习算法的焊接参数自适应预测方法 |
CN116628970A (zh) * | 2023-05-18 | 2023-08-22 | 浙江大学 | 基于数据挖掘的航天薄壁件旋压成型工艺参数优化方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115601182A (zh) | 2023-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023280316A1 (zh) | 一种基于改进型XGBoost类方法的数据分析方法、定价方法以及相关设备 | |
Weng et al. | Gold price forecasting research based on an improved online extreme learning machine algorithm | |
Pei et al. | Wind speed prediction method based on empirical wavelet transform and new cell update long short-term memory network | |
WO2021007812A1 (zh) | 一种深度神经网络超参数优化方法、电子设备及存储介质 | |
Froelich et al. | Evolutionary learning of fuzzy grey cognitive maps for the forecasting of multivariate, interval-valued time series | |
Yu et al. | A novel elastic net-based NGBMC (1, n) model with multi-objective optimization for nonlinear time series forecasting | |
US20230075100A1 (en) | Adversarial autoencoder architecture for methods of graph to sequence models | |
Gong et al. | Forecasting stock volatility process using improved least square support vector machine approach | |
Barratt et al. | Least squares auto-tuning | |
CN114817571B (zh) | 基于动态知识图谱的成果被引用量预测方法、介质及设备 | |
Chu et al. | Comparing out-of-sample performance of machine learning methods to forecast US GDP growth | |
US20230306505A1 (en) | Extending finite rank deep kernel learning to forecasting over long time horizons | |
Bui et al. | Gaussian process for predicting CPU utilization and its application to energy efficiency | |
Yu et al. | Ceam: A novel approach using cycle embeddings with attention mechanism for stock price prediction | |
Alizadeh et al. | Simulating monthly streamflow using a hybrid feature selection approach integrated with an intelligence model | |
Zhang et al. | Latent adversarial regularized autoencoder for high-dimensional probabilistic time series prediction | |
Wang et al. | An enhanced interval-valued decomposition integration model for stock price prediction based on comprehensive feature extraction and optimized deep learning | |
Park et al. | DeepGate: Global-local decomposition for multivariate time series modeling | |
Kisiel et al. | Portfolio transformer for attention-based asset allocation | |
Xing et al. | Application of a hybrid model based on GA–ELMAN neural networks and VMD double processing in water level prediction | |
Co et al. | Comparison between ARIMA and LSTM-RNN for VN-index prediction | |
Lian et al. | A tweedie compound poisson model in reproducing kernel hilbert space | |
Yan et al. | Transferability and robustness of a data-driven model built on a large number of buildings | |
Prüser et al. | Nonlinearities in macroeconomic tail risk through the lens of big data quantile regressions | |
Chen et al. | A novel expectation–maximization-based separable algorithm for parameter identification of RBF-AR model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22837061 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |