CN108009931A

CN108009931A - Using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer

Info

Publication number: CN108009931A
Application number: CN201711422828.6A
Authority: CN
Inventors: 赵昕; 毛耀鋆; 张鲁嘉; 尹龙; 涂闪; 杨明锋
Original assignee: Hangzhou Seven Kyung Mdt Infotech Ltd
Current assignee: Shanghai Qihuang Information Technology Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-05-08
Anticipated expiration: 2037-12-25
Also published as: CN108009931B

Abstract

The present invention relates to a kind of insurance data decision tree, more particularly to the insurance data decision tree using variable gain algorithm and gain algorithm in range layer.Carry out according to the following steps：Gain algorithm → feature selecting mode in variable gain algorithm → range layer.Modeling precision is effectively improved using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer, lifts operation ease.

Description

Insurance data decision tree adopting variable gain algorithm and breadth in-layer gain algorithm

Technical Field

The invention relates to an insurance data decision tree, in particular to an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm.

Background

Insurance industry data and modeling have their own characteristics of 1) insurance industry data continuum and discrete data coexisting 2) insurance industry models being sufficiently compact without loss of accuracy 3) insurance modeling requirements that features of each layer can be manually selected.

The public algorithms on the market do not meet the requirements of the insurance industry and my company, for example, the ID3 algorithm cannot process continuous data, the C4.5 algorithm divides variables into multiple branches instead of binary branches, and the variables used by each layer of the Cart tree algorithm are different. In view of this, my self developed a "seven \28805decisiontree algorithm" that can process both discrete variables and continuous variables, and each layer of the tree is the same variable, each variable can only be binary, and allows the customer to employ a way to manually specify the characteristics of each layer.

Disclosure of Invention

The invention mainly solves the defects in the prior art, and provides the insurance data decision tree adopting the variable gain algorithm and the breadth in-layer gain algorithm, which effectively ensures the mutual coordination of the aspects of usability, accuracy, simplicity and flexibility of modeling in the insurance industry.

The technical problem of the invention is mainly solved by the following technical scheme:

an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is carried out according to the following steps:

(I) variable gain algorithm: the problem that discrete and continuous type variables are processed simultaneously in insurance data modeling and binary splitting is solved;

(1) And analyzing:

in insurance data, the coexistence of discrete and continuous variables often occurs, and the discrete and continuous variables have the characteristics of the discrete and continuous variables; such as policy, insurance applicant's age, which is a continuous variable, such as 30, 31, 32 years, continuous variable has its own range, such as age, which is unlikely to be 200 years old; vehicle types are discrete variables such as "flatbed", "van", "refrigerator car"; discrete variables in the insurance data have own characteristics, the number of the discrete variables after grouping is usually 2-10, samples corresponding to each discrete variable have a considerable number, and obvious distinguishing lines are arranged on target variables; in addition, the insurance industry needs binary splitting to ensure the simplicity and understandability of the model;

in the traditional open algorithm, ID3 cannot process continuous variables, C4.5 and Cart algorithms cannot process continuous and discrete variables to ensure binary splitting, and obviously cannot meet the requirements in insurance data;

the key to solve this problem is how to define the gains of discrete and continuous variables and how to perform binary splitting by the gains; discrete, continuous variable gain algorithm: whether the variables are discrete or continuous, only a corresponding gain algorithm needs to be called, and a binary splitting mode is provided while the gains are calculated, so that the modeling requirement of coexistence of discrete and continuous variables in insurance data is effectively met;

the continuous variable gain algorithm is that a data set is split according to a certain continuous variable cv in insurance data, the data set is split into two parts, and the maximum gain can be obtained on a target variable t; the maximum gain refers to the amount of decrease in variance; i.e. after splitting, the variance of the data set may be reduced;

the method comprises the following specific steps:

(1) setting a gain =0 and a splitting point best _ q;

(2) and sorting the data sets according to the ascending order of continuous variables to be split. Suppose that n pieces of sorted data are respectively:

x ₁ ，x ₂ ，…x _n ；

(3) sequentially taking 1,2,. Cndot.99% of a continuous variable cv; taking every q, rotating (4) and dividing the variable cv into two parts, set, according to q ₁ ＝{x|x _cv ≤q},set ₂ ＝{x|x _cv >q}；

(4) If the number of samples in set1 or set2 is less than the parameter min _ leaf _ num, turning to (3), otherwise, turning to (5);

(5) the gain tmp _ gain is calculated according to the following formula:

wherein k1 and k2 are the number of elements in set1 and set2 respectively; x is the number of _ti Represents a sample x _t The value of the target variable of (1);

(6) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ q = q revolutions (7);

(7) returning to gain if the traversal of the percentile is finished, otherwise turning to (3);

after the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ q, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (4), and the target variable value of each child node is the average value of the target variables on the child nodes;

the discrete variable gain algorithm is that a data set is split according to a certain discrete variable dv in insurance data, the data set is split into two parts, the maximum gain is obtained on a target variable t, and the maximum gain is the reduction of variance;

the method comprises the following specific steps:

(1) setting gain =0, wherein the best left branch and the best right branch are empty and are empty;

(2) acquiring all value sets of the discrete variable dv, namely d1, d2, \8230;, dm;

(3) obtaining all combination modes of discrete variable doubling:

firstly, the left branch set is empty;

starting the value number left _ num of the left branch element of the decision tree from 1 to m-1:

find the left _ num element combination from the set { d1, d2, \8230; dm }, add to the left branch set,

(4) and sequentially taking a left branch set left _ set, a variable full set and a difference set of the left branch set from the left branch set as a right branch set right _ set. Turning to (5) every time one set is taken;

(5) dividing the data set into two parts according to whether the discrete variable dv is valued in a left branch set or a right branch set:

set1＝{x|x _dv ∈left_set},set2＝{x|x _dv ∈right_set}

(6) if the number of set1 or set2 is less than the leaf node min _ leaf _ num, turning (4), otherwise, turning (7);

(7) the gain tmp _ gain is calculated according to the following formula:

wherein k1 and k2 are the number of elements in set1 and set2 respectively;

(8) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ left = set1, best _ right = set2;

(9) returning gain if the left branch set is traversed completely, otherwise turning to (4)

After the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ left and best _ right, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (5), and the target variable value of each child node is the average value of the target variables on the child nodes;

(II) an breadth in-layer gain algorithm: the problem that each layer of the insurance data decision tree needs the same variable is solved;

the user of insurance data modeling is an insurance practitioner, so the insurance data modeling requirement is simple enough, but the accuracy of the model is ensured; one index is concise enough, namely whether each layer of the decision tree is the same variable, if the same variable is found, 4 variables are used in the 5 layers of the decision tree, so that the decision tree is concise and easy to understand enough and is convenient to use in the insurance industry;

the current open algorithms such as ID3 and Cart cannot realize that each layer is the same variable, and the problem needs to be innovated and solved so as to meet the modeling requirement of the insurance industry; the key for solving the problem lies in how to define the quality degree of the variable in each layer, the optimal variable is selected by defining the quality degree of the variable in each layer, and then each layer is split by utilizing the optimal variable in the current layer;

suppose there are d layersAnd (3) node: are respectively a node ₁ ,node ₂ ,···node _d The number of samples on a node is n ₁ ，n ₂ ，…n _d There are m alternative features: f. of ₁ ，f ₂ ，…f _m Selecting the optimal characteristics using the following formula:

wherein gain _ij Represents the gain produced by feature i at node j;

the optimal characteristic is calculated according to the following steps:

setting the maximum gain max _ gain =0, and setting the optimal characteristic max _ flat to be null;

(1) selecting each feature f _i ；

(2) Initializing a current feature weighting gain feature _ gain =0;

(3) calculating feature _ gain;

each node in the node set in pair _j Gain algorithm calculates gain _ij Update

feature_gain:

All nodes in the layer need to be traversed to obtain the current feature f _i A gain of (d);

if feature _ gain is greater than max _ gain: then tentatively determine the optimal feature in the layer as f _i ；

After the characteristics are traversed once, the current optimal characteristics can be obtained;

(III) a characteristic selection mode: the problem that the designated variable can be selected autonomously in insurance data modeling is solved;

insurance data modeling is different from other modeling, and the interpretability of variables is very important; the modeling results are interpreted by, for example, age, vehicle load, and therefore it is sometimes necessary to manually specify variables involved in modeling and the order of the variables; the traditional algorithm can not support the variable adopted by each layer, so that the method solves the problem;

setting whether the tree building mode is automatic or manual, and if the tree building mode is manual, allowing a client to specify which variable each layer of the decision tree is;

the decision tree algorithm is an algorithm for generating a decision tree according to some independent variables and target variables in the data; the input has two parts, one part is original data, the other part is constraint parameters, and the constraint parameters comprise five parts: 1) maximum depth of the tree max _ depth, 2) minimum leaf node sample number min _ leaf _ num, 3) variable set x _ list that can be used, 4) variable type specification: discrete or continuous 5) whether the tree building mode is manual or automatic;

wherein the fifth parameter may enable manual specification of modeling variables;

we first look at the overall framework of the algorithm, which is as follows:

step one, setting a root node as a first layer of a tree;

the corresponding data on the root node is all data;

the target value on the root node is the mean value of all data claim variables;

the alternative characteristic is an original variable set x _ list;

secondly, searching the optimal variable in the layer by layer, and splitting nodes in the layer;

traversing the layer number, as follows:

a) Searching all nodes with the depth meeting the requirement;

b) Finding optimal features on a set of nodes: if the tree building mode is an automatic mode, in the alternative characteristics, the optimal characteristics are searched by adopting an autonomously developed 'in-layer gain algorithm'; if the method is a manual method, selecting a first feature of the x _ list as an optimal feature;

c) Adopting the optimal characteristics found in b), adopting an autonomously developed characteristic gain algorithm, and trying to perform binary splitting on each node in the layer;

d) Deleting the features screened out in b) from the alternative features;

from the above framework, it is seen that in the second step b) manual specification of variables and variable order is achieved.

The method is used for modeling decision trees of insurance company data, and particularly refers to modeling of various attributes (such as driver age, vehicle type, load capacity and the like) of insurance claim data and insurance objects. The decision tree model is a model for predicting target variables by building a tree according to some independent variables.

Through the technical scheme of my department, the insurance data decision tree model can be generated quickly and reliably. Such as by using the testing data of my department,

the list of input variables is:

x_list＝['biz_type_group','CARKIND_group','TONCOUNT_token',

'veh_seat_token','EXHAUSTSCALEJY_token','ALARMFLAG_group',

'cst_age_token','COMPLETEKERBMASS_token','PRICING_PROPERTY']

the maximum depth of the decision tree is set to be 5, the number of the minimum leaf nodes is 200, and the tree building mode is manual.

Therefore, the insurance data decision tree of the variable gain algorithm and the breadth in-layer gain algorithm is adopted, the modeling accuracy is effectively improved, and the operation convenience is improved.

Drawings

FIG. 1 is a diagram of a decision tree model of the present invention.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments.

Example 1: an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is carried out according to the following steps:

(1) And analysis:

the method comprises the following specific steps:

(1) setting a gain =0 and a splitting point best _ q;

(2) and sorting the data sets according to the ascending order of continuous variables to be split. Suppose there are n sorted data, which are:

x ₁ ，x ₂ ，…x _n ；

(3) sequentially taking 1,2,. Cndot.99% of a continuous variable cv; each time a q is taken, the variable cv is divided into two parts, set, according to q ₁ ＝{x|x _cv ≤q},set ₂ ＝{x|x _cv >q}；

(5) the gain tmp _ gain is calculated according to the following formula:

wherein k1 and k2 are the number of elements in set1 and set2 respectively; x is a radical of a fluorine atom _ti Represents a sample x _t The value of the target variable of (1);

the discrete variable gain algorithm is that a data set is split into two parts according to a certain discrete variable dv in insurance data, the maximum gain is obtained on a target variable t, and the maximum gain is the reduction of variance;

the method comprises the following specific steps:

(3) obtaining all combination modes of discrete variable doubling:

firstly, collecting the left branches to be empty;

set1＝{x|x _dv ∈left_set},set2＝{x|x _dv ∈right_set}

(6) if the set1 or set2 quantity is smaller than the leaf node min _ leaf _ num, turning to (4), otherwise, turning to (7);

(7) the gain tmp _ gain is calculated according to the following formula:

wherein k1 and k2 are the number of elements in set1 and set2 respectively;

After the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ left and best _ right, namely, generating two nodes, wherein the data on each node is set1 and set2 in step (5), respectively, and the target variable value of each child node is the average value of the target variables on the child nodes;

suppose there are d nodes within a layer: are respectively a node ₁ ,node ₂ ,···node _d The number of samples on a node is n ₁ ，n ₂ ，…n _d There are m alternative features: f. of ₁ ，f ₂ ，…f _m Selecting the optimal characteristics using the following formula:

wherein gain _ij Represents the gain produced by feature i at node j;

the optimal characteristic is calculated according to the following steps:

(1) selecting each feature f _i ；

(2) Initializing current feature weighting gain feature _ gain =0;

(3) calculating feature _ gain;

each node in the node set in pair _j Gain algorithm calculates gain _ij Update

feature_gain:

(III) a characteristic selection mode: the problem that the designated variable can be selected independently in the insurance data modeling is solved;

the decision tree algorithm is an algorithm for generating a decision tree according to some independent variables and target variables in the data; the input has two parts, one part is original data, the other part is constraint parameters, and the constraint parameters include five: 1) maximum depth of the tree max _ depth, 2) minimum leaf node sample number min _ leaf _ num, 3) variable set x _ list that can be used, 4) variable type specification: discrete or continuous 5) whether the tree building mode is manual or automatic;

we first look at the overall framework of the algorithm, which is as follows:

firstly, setting a root node as a first layer of a tree;

the corresponding data on the root node is all data;

the alternative characteristic is an original variable set x _ list;

traversing the layer number, and performing the following operations:

a) Searching all nodes with the depth meeting the requirement;

c) Adopting the optimal characteristics found in the b), adopting an autonomously developed characteristic gain algorithm, and trying to perform binary splitting on each node in the layer;

d) Deleting the features screened out in b) from the alternative features;

Claims

1. An insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is characterized by comprising the following steps:

(1) And analysis:

the continuous variable gain algorithm is that a data set is split according to a certain continuous variable cv in insurance data, the data set is split into two parts, and the maximum gain can be obtained on a target variable t; the maximum gain refers to the amount of reduction in variance; i.e. after splitting, the variance of the data set may be reduced;

the method comprises the following specific steps:

(1) setting a gain =0 and a splitting point best _ q;

(2) the data sets are sorted in ascending order according to the continuous variables to be split. Suppose there are n sorted data, which are:

x ₁ ，x ₂ ，…x _n ；

(5) the gain tmp _ gain is calculated according to the following formula:

(7) if the percentile traversal is finished, returning gain, otherwise turning to (3);

the method comprises the following specific steps:

(3) obtaining all combination modes of discrete variable doubling:

firstly, collecting the left branches to be empty;

set1＝{x|x _dv ∈left_set},set2＝{x|x _dv ∈right_set}

(7) the gain tmp _ gain is calculated according to the following formula:

wherein k1 and k2 are the number of elements in set1 and set2 respectively;

the user of insurance data modeling is an insurance practitioner, so the requirement of insurance data modeling is simple, but the accuracy of the model is ensured; one index is concise enough, namely whether each layer of the decision tree is the same variable, if the same variable is found, 4 variables are used in the 5 layers of the decision tree, so that the decision tree is concise and easy to understand enough and is convenient to use in the insurance industry;

the current open algorithms such as ID3 and Cart can not realize that each layer is the same variable, and the problem needs to be innovated and solved so as to adapt to the modeling requirement of the insurance industry; the key for solving the problem lies in how to define the quality degree of the variable in each layer, the optimal variable is selected by defining the quality degree of the variable in each layer, and then each layer is split by utilizing the optimal variable in the current layer;

suppose there are d nodes within a layer: are respectively a node ₁ ,node ₂ ,…node _d The number of samples on a node is n ₁ ，n ₂ ，…n _d There are m alternative features: f. of ₁ ，f ₂ ，…f _m Selecting the optimal characteristics using the following formula:

wherein gain _ij Represents the gain produced by feature i at node j;

the optimal characteristic is calculated according to the following steps:

setting the maximum gain max _ gain =0, and enabling the optimal characteristic max _ flat to be null;

(1) selecting each feature f _i ；

(2) Initializing a current feature weighting gain feature _ gain =0;

(3) calculating feature _ gain;

each node in the node set in pair _j Gain algorithm calculates gain _ij Updating feature _ gain:

the current feature f can be obtained only after traversing all nodes in the layers _i A gain of (d);

(III) feature selection mode: the problem that the designated variable can be selected independently in the insurance data modeling is solved;

setting the mode of establishing the tree to be automatic or manual, and if the mode is manual, allowing a client to specify which variable each layer of the decision tree is;

the fifth parameter can realize manual specification of modeling variables;

we first look at the overall framework of the algorithm, which is as follows:

firstly, setting a root node as a first layer of a tree;

the corresponding data on the root node is all data;

the alternative characteristic is an original variable set x _ list;

traversing the layer number, as follows:

a) Searching all nodes with the depth meeting the requirement;

d) Deleting the features screened out in b) from the alternative features;

from the above framework, it is seen that in the second step b) manual specification of variables and variable sequences is achieved.