CN108009931A - Using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer - Google Patents

Using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer Download PDF

Info

Publication number
CN108009931A
CN108009931A CN201711422828.6A CN201711422828A CN108009931A CN 108009931 A CN108009931 A CN 108009931A CN 201711422828 A CN201711422828 A CN 201711422828A CN 108009931 A CN108009931 A CN 108009931A
Authority
CN
China
Prior art keywords
gain
variable
variables
data
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711422828.6A
Other languages
Chinese (zh)
Other versions
CN108009931B (en
Inventor
赵昕
毛耀鋆
张鲁嘉
尹龙
涂闪
杨明锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qihuang Information Technology Co Ltd
Original Assignee
Hangzhou Seven Kyung Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Seven Kyung Mdt Infotech Ltd filed Critical Hangzhou Seven Kyung Mdt Infotech Ltd
Priority to CN201711422828.6A priority Critical patent/CN108009931B/en
Publication of CN108009931A publication Critical patent/CN108009931A/en
Application granted granted Critical
Publication of CN108009931B publication Critical patent/CN108009931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Abstract

The present invention relates to a kind of insurance data decision tree, more particularly to the insurance data decision tree using variable gain algorithm and gain algorithm in range layer.Carry out according to the following steps:Gain algorithm → feature selecting mode in variable gain algorithm → range layer.Modeling precision is effectively improved using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer, lifts operation ease.

Description

Insurance data decision tree adopting variable gain algorithm and breadth in-layer gain algorithm
Technical Field
The invention relates to an insurance data decision tree, in particular to an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm.
Background
Insurance industry data and modeling have their own characteristics of 1) insurance industry data continuum and discrete data coexisting 2) insurance industry models being sufficiently compact without loss of accuracy 3) insurance modeling requirements that features of each layer can be manually selected.
The public algorithms on the market do not meet the requirements of the insurance industry and my company, for example, the ID3 algorithm cannot process continuous data, the C4.5 algorithm divides variables into multiple branches instead of binary branches, and the variables used by each layer of the Cart tree algorithm are different. In view of this, my self developed a "seven \28805decisiontree algorithm" that can process both discrete variables and continuous variables, and each layer of the tree is the same variable, each variable can only be binary, and allows the customer to employ a way to manually specify the characteristics of each layer.
Disclosure of Invention
The invention mainly solves the defects in the prior art, and provides the insurance data decision tree adopting the variable gain algorithm and the breadth in-layer gain algorithm, which effectively ensures the mutual coordination of the aspects of usability, accuracy, simplicity and flexibility of modeling in the insurance industry.
The technical problem of the invention is mainly solved by the following technical scheme:
an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is carried out according to the following steps:
(I) variable gain algorithm: the problem that discrete and continuous type variables are processed simultaneously in insurance data modeling and binary splitting is solved;
(1) And analyzing:
in insurance data, the coexistence of discrete and continuous variables often occurs, and the discrete and continuous variables have the characteristics of the discrete and continuous variables; such as policy, insurance applicant's age, which is a continuous variable, such as 30, 31, 32 years, continuous variable has its own range, such as age, which is unlikely to be 200 years old; vehicle types are discrete variables such as "flatbed", "van", "refrigerator car"; discrete variables in the insurance data have own characteristics, the number of the discrete variables after grouping is usually 2-10, samples corresponding to each discrete variable have a considerable number, and obvious distinguishing lines are arranged on target variables; in addition, the insurance industry needs binary splitting to ensure the simplicity and understandability of the model;
in the traditional open algorithm, ID3 cannot process continuous variables, C4.5 and Cart algorithms cannot process continuous and discrete variables to ensure binary splitting, and obviously cannot meet the requirements in insurance data;
the key to solve this problem is how to define the gains of discrete and continuous variables and how to perform binary splitting by the gains; discrete, continuous variable gain algorithm: whether the variables are discrete or continuous, only a corresponding gain algorithm needs to be called, and a binary splitting mode is provided while the gains are calculated, so that the modeling requirement of coexistence of discrete and continuous variables in insurance data is effectively met;
the continuous variable gain algorithm is that a data set is split according to a certain continuous variable cv in insurance data, the data set is split into two parts, and the maximum gain can be obtained on a target variable t; the maximum gain refers to the amount of decrease in variance; i.e. after splitting, the variance of the data set may be reduced;
the method comprises the following specific steps:
(1) setting a gain =0 and a splitting point best _ q;
(2) and sorting the data sets according to the ascending order of continuous variables to be split. Suppose that n pieces of sorted data are respectively:
x 1 ,x 2 ,…x n
(3) sequentially taking 1,2,. Cndot.99% of a continuous variable cv; taking every q, rotating (4) and dividing the variable cv into two parts, set, according to q 1 ={x|x cv ≤q},set 2 ={x|x cv >q};
(4) If the number of samples in set1 or set2 is less than the parameter min _ leaf _ num, turning to (3), otherwise, turning to (5);
(5) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively; x is the number of ti Represents a sample x t The value of the target variable of (1);
(6) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ q = q revolutions (7);
(7) returning to gain if the traversal of the percentile is finished, otherwise turning to (3);
after the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ q, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (4), and the target variable value of each child node is the average value of the target variables on the child nodes;
the discrete variable gain algorithm is that a data set is split according to a certain discrete variable dv in insurance data, the data set is split into two parts, the maximum gain is obtained on a target variable t, and the maximum gain is the reduction of variance;
the method comprises the following specific steps:
(1) setting gain =0, wherein the best left branch and the best right branch are empty and are empty;
(2) acquiring all value sets of the discrete variable dv, namely d1, d2, \8230;, dm;
(3) obtaining all combination modes of discrete variable doubling:
firstly, the left branch set is empty;
starting the value number left _ num of the left branch element of the decision tree from 1 to m-1:
find the left _ num element combination from the set { d1, d2, \8230; dm }, add to the left branch set,
(4) and sequentially taking a left branch set left _ set, a variable full set and a difference set of the left branch set from the left branch set as a right branch set right _ set. Turning to (5) every time one set is taken;
(5) dividing the data set into two parts according to whether the discrete variable dv is valued in a left branch set or a right branch set:
set1={x|x dv ∈left_set},set2={x|x dv ∈right_set}
(6) if the number of set1 or set2 is less than the leaf node min _ leaf _ num, turning (4), otherwise, turning (7);
(7) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively;
(8) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ left = set1, best _ right = set2;
(9) returning gain if the left branch set is traversed completely, otherwise turning to (4)
After the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ left and best _ right, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (5), and the target variable value of each child node is the average value of the target variables on the child nodes;
(II) an breadth in-layer gain algorithm: the problem that each layer of the insurance data decision tree needs the same variable is solved;
the user of insurance data modeling is an insurance practitioner, so the insurance data modeling requirement is simple enough, but the accuracy of the model is ensured; one index is concise enough, namely whether each layer of the decision tree is the same variable, if the same variable is found, 4 variables are used in the 5 layers of the decision tree, so that the decision tree is concise and easy to understand enough and is convenient to use in the insurance industry;
the current open algorithms such as ID3 and Cart cannot realize that each layer is the same variable, and the problem needs to be innovated and solved so as to meet the modeling requirement of the insurance industry; the key for solving the problem lies in how to define the quality degree of the variable in each layer, the optimal variable is selected by defining the quality degree of the variable in each layer, and then each layer is split by utilizing the optimal variable in the current layer;
suppose there are d layersAnd (3) node: are respectively a node 1 ,node 2 ,···node d The number of samples on a node is n 1 ,n 2 ,…n d There are m alternative features: f. of 1 ,f 2 ,…f m Selecting the optimal characteristics using the following formula:
wherein gain ij Represents the gain produced by feature i at node j;
the optimal characteristic is calculated according to the following steps:
setting the maximum gain max _ gain =0, and setting the optimal characteristic max _ flat to be null;
(1) selecting each feature f i
(2) Initializing a current feature weighting gain feature _ gain =0;
(3) calculating feature _ gain;
each node in the node set in pair j Gain algorithm calculates gain ij Update
feature_gain:
All nodes in the layer need to be traversed to obtain the current feature f i A gain of (d);
if feature _ gain is greater than max _ gain: then tentatively determine the optimal feature in the layer as f i
After the characteristics are traversed once, the current optimal characteristics can be obtained;
(III) a characteristic selection mode: the problem that the designated variable can be selected autonomously in insurance data modeling is solved;
insurance data modeling is different from other modeling, and the interpretability of variables is very important; the modeling results are interpreted by, for example, age, vehicle load, and therefore it is sometimes necessary to manually specify variables involved in modeling and the order of the variables; the traditional algorithm can not support the variable adopted by each layer, so that the method solves the problem;
setting whether the tree building mode is automatic or manual, and if the tree building mode is manual, allowing a client to specify which variable each layer of the decision tree is;
the decision tree algorithm is an algorithm for generating a decision tree according to some independent variables and target variables in the data; the input has two parts, one part is original data, the other part is constraint parameters, and the constraint parameters comprise five parts: 1) maximum depth of the tree max _ depth, 2) minimum leaf node sample number min _ leaf _ num, 3) variable set x _ list that can be used, 4) variable type specification: discrete or continuous 5) whether the tree building mode is manual or automatic;
wherein the fifth parameter may enable manual specification of modeling variables;
we first look at the overall framework of the algorithm, which is as follows:
step one, setting a root node as a first layer of a tree;
the corresponding data on the root node is all data;
the target value on the root node is the mean value of all data claim variables;
the alternative characteristic is an original variable set x _ list;
secondly, searching the optimal variable in the layer by layer, and splitting nodes in the layer;
traversing the layer number, as follows:
a) Searching all nodes with the depth meeting the requirement;
b) Finding optimal features on a set of nodes: if the tree building mode is an automatic mode, in the alternative characteristics, the optimal characteristics are searched by adopting an autonomously developed 'in-layer gain algorithm'; if the method is a manual method, selecting a first feature of the x _ list as an optimal feature;
c) Adopting the optimal characteristics found in b), adopting an autonomously developed characteristic gain algorithm, and trying to perform binary splitting on each node in the layer;
d) Deleting the features screened out in b) from the alternative features;
from the above framework, it is seen that in the second step b) manual specification of variables and variable order is achieved.
The method is used for modeling decision trees of insurance company data, and particularly refers to modeling of various attributes (such as driver age, vehicle type, load capacity and the like) of insurance claim data and insurance objects. The decision tree model is a model for predicting target variables by building a tree according to some independent variables.
Through the technical scheme of my department, the insurance data decision tree model can be generated quickly and reliably. Such as by using the testing data of my department,
the list of input variables is:
x_list=['biz_type_group','CARKIND_group','TONCOUNT_token',
'veh_seat_token','EXHAUSTSCALEJY_token','ALARMFLAG_group',
'cst_age_token','COMPLETEKERBMASS_token','PRICING_PROPERTY']
the maximum depth of the decision tree is set to be 5, the number of the minimum leaf nodes is 200, and the tree building mode is manual.
Therefore, the insurance data decision tree of the variable gain algorithm and the breadth in-layer gain algorithm is adopted, the modeling accuracy is effectively improved, and the operation convenience is improved.
Drawings
FIG. 1 is a diagram of a decision tree model of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments.
Example 1: an insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is carried out according to the following steps:
(I) variable gain algorithm: the problem that discrete and continuous type variables are processed simultaneously in insurance data modeling and binary splitting is solved;
(1) And analysis:
in insurance data, the coexistence of discrete and continuous variables often occurs, and the discrete and continuous variables have the characteristics of the discrete and continuous variables; such as policy, insurance applicant's age, which is a continuous variable, such as 30, 31, 32 years, continuous variable has its own range, such as age, which is unlikely to be 200 years old; vehicle types are discrete variables such as "flatbed", "van", "refrigerator car"; discrete variables in the insurance data have own characteristics, the number of the discrete variables after grouping is usually 2-10, samples corresponding to each discrete variable have a considerable number, and obvious distinguishing lines are arranged on target variables; in addition, the insurance industry needs binary splitting to ensure the simplicity and understandability of the model;
in the traditional open algorithm, ID3 cannot process continuous variables, C4.5 and Cart algorithms cannot process continuous and discrete variables to ensure binary splitting, and obviously cannot meet the requirements in insurance data;
the key to solve this problem is how to define the gains of discrete and continuous variables and how to perform binary splitting by the gains; discrete, continuous variable gain algorithm: whether the variables are discrete or continuous, only a corresponding gain algorithm needs to be called, and a binary splitting mode is provided while the gains are calculated, so that the modeling requirement of coexistence of discrete and continuous variables in insurance data is effectively met;
the continuous variable gain algorithm is that a data set is split according to a certain continuous variable cv in insurance data, the data set is split into two parts, and the maximum gain can be obtained on a target variable t; the maximum gain refers to the amount of decrease in variance; i.e. after splitting, the variance of the data set may be reduced;
the method comprises the following specific steps:
(1) setting a gain =0 and a splitting point best _ q;
(2) and sorting the data sets according to the ascending order of continuous variables to be split. Suppose there are n sorted data, which are:
x 1 ,x 2 ,…x n
(3) sequentially taking 1,2,. Cndot.99% of a continuous variable cv; each time a q is taken, the variable cv is divided into two parts, set, according to q 1 ={x|x cv ≤q},set 2 ={x|x cv >q};
(4) If the number of samples in set1 or set2 is less than the parameter min _ leaf _ num, turning to (3), otherwise, turning to (5);
(5) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively; x is a radical of a fluorine atom ti Represents a sample x t The value of the target variable of (1);
(6) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ q = q revolutions (7);
(7) returning to gain if the traversal of the percentile is finished, otherwise turning to (3);
after the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ q, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (4), and the target variable value of each child node is the average value of the target variables on the child nodes;
the discrete variable gain algorithm is that a data set is split into two parts according to a certain discrete variable dv in insurance data, the maximum gain is obtained on a target variable t, and the maximum gain is the reduction of variance;
the method comprises the following specific steps:
(1) setting gain =0, wherein the best left branch and the best right branch are empty and are empty;
(2) acquiring all value sets of the discrete variable dv, namely d1, d2, \8230;, dm;
(3) obtaining all combination modes of discrete variable doubling:
firstly, collecting the left branches to be empty;
starting the value number left _ num of the left branch element of the decision tree from 1 to m-1:
find the left _ num element combination from the set { d1, d2, \8230; dm }, add to the left branch set,
(4) and sequentially taking a left branch set left _ set, a variable full set and a difference set of the left branch set from the left branch set as a right branch set right _ set. Turning to (5) every time one set is taken;
(5) dividing the data set into two parts according to whether the discrete variable dv is valued in a left branch set or a right branch set:
set1={x|x dv ∈left_set},set2={x|x dv ∈right_set}
(6) if the set1 or set2 quantity is smaller than the leaf node min _ leaf _ num, turning to (4), otherwise, turning to (7);
(7) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively;
(8) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ left = set1, best _ right = set2;
(9) returning gain if the left branch set is traversed completely, otherwise turning to (4)
After the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ left and best _ right, namely, generating two nodes, wherein the data on each node is set1 and set2 in step (5), respectively, and the target variable value of each child node is the average value of the target variables on the child nodes;
(II) an breadth in-layer gain algorithm: the problem that each layer of the insurance data decision tree needs the same variable is solved;
the user of insurance data modeling is an insurance practitioner, so the insurance data modeling requirement is simple enough, but the accuracy of the model is ensured; one index is concise enough, namely whether each layer of the decision tree is the same variable, if the same variable is found, 4 variables are used in the 5 layers of the decision tree, so that the decision tree is concise and easy to understand enough and is convenient to use in the insurance industry;
the current open algorithms such as ID3 and Cart cannot realize that each layer is the same variable, and the problem needs to be innovated and solved so as to meet the modeling requirement of the insurance industry; the key for solving the problem lies in how to define the quality degree of the variable in each layer, the optimal variable is selected by defining the quality degree of the variable in each layer, and then each layer is split by utilizing the optimal variable in the current layer;
suppose there are d nodes within a layer: are respectively a node 1 ,node 2 ,···node d The number of samples on a node is n 1 ,n 2 ,…n d There are m alternative features: f. of 1 ,f 2 ,…f m Selecting the optimal characteristics using the following formula:
wherein gain ij Represents the gain produced by feature i at node j;
the optimal characteristic is calculated according to the following steps:
setting the maximum gain max _ gain =0, and setting the optimal characteristic max _ flat to be null;
(1) selecting each feature f i
(2) Initializing current feature weighting gain feature _ gain =0;
(3) calculating feature _ gain;
each node in the node set in pair j Gain algorithm calculates gain ij Update
feature_gain:
All nodes in the layer need to be traversed to obtain the current feature f i A gain of (d);
if feature _ gain is greater than max _ gain: then tentatively determine the optimal feature in the layer as f i
After the characteristics are traversed once, the current optimal characteristics can be obtained;
(III) a characteristic selection mode: the problem that the designated variable can be selected independently in the insurance data modeling is solved;
insurance data modeling is different from other modeling, and the interpretability of variables is very important; the modeling results are interpreted by, for example, age, vehicle load, and therefore it is sometimes necessary to manually specify variables involved in modeling and the order of the variables; the traditional algorithm can not support the variable adopted by each layer, so that the method solves the problem;
setting whether the tree building mode is automatic or manual, and if the tree building mode is manual, allowing a client to specify which variable each layer of the decision tree is;
the decision tree algorithm is an algorithm for generating a decision tree according to some independent variables and target variables in the data; the input has two parts, one part is original data, the other part is constraint parameters, and the constraint parameters include five: 1) maximum depth of the tree max _ depth, 2) minimum leaf node sample number min _ leaf _ num, 3) variable set x _ list that can be used, 4) variable type specification: discrete or continuous 5) whether the tree building mode is manual or automatic;
wherein the fifth parameter may enable manual specification of modeling variables;
we first look at the overall framework of the algorithm, which is as follows:
firstly, setting a root node as a first layer of a tree;
the corresponding data on the root node is all data;
the target value on the root node is the mean value of all data claim variables;
the alternative characteristic is an original variable set x _ list;
secondly, searching the optimal variable in the layer by layer, and splitting nodes in the layer;
traversing the layer number, and performing the following operations:
a) Searching all nodes with the depth meeting the requirement;
b) Finding optimal features on a set of nodes: if the tree building mode is an automatic mode, in the alternative characteristics, the optimal characteristics are searched by adopting an autonomously developed 'in-layer gain algorithm'; if the method is a manual method, selecting a first feature of the x _ list as an optimal feature;
c) Adopting the optimal characteristics found in the b), adopting an autonomously developed characteristic gain algorithm, and trying to perform binary splitting on each node in the layer;
d) Deleting the features screened out in b) from the alternative features;
from the above framework, it is seen that in the second step b) manual specification of variables and variable order is achieved.

Claims (1)

1. An insurance data decision tree adopting a variable gain algorithm and an breadth in-layer gain algorithm is characterized by comprising the following steps:
(I) variable gain algorithm: the problem that discrete and continuous type variables are processed simultaneously in insurance data modeling and binary splitting is solved;
(1) And analysis:
in insurance data, the coexistence of discrete and continuous variables often occurs, and the discrete and continuous variables have the characteristics of the discrete and continuous variables; such as policy, insurance applicant's age, which is a continuous variable, such as 30, 31, 32 years, continuous variable has its own range, such as age, which is unlikely to be 200 years old; vehicle types are discrete variables such as "flatbed", "van", "refrigerator car"; discrete variables in the insurance data have own characteristics, the number of the discrete variables after grouping is usually 2-10, samples corresponding to each discrete variable have a considerable number, and obvious distinguishing lines are arranged on target variables; in addition, the insurance industry needs binary splitting to ensure the simplicity and understandability of the model;
in the traditional open algorithm, ID3 cannot process continuous variables, C4.5 and Cart algorithms cannot process continuous and discrete variables to ensure binary splitting, and obviously cannot meet the requirements in insurance data;
the key to solve this problem is how to define the gains of discrete and continuous variables and how to perform binary splitting by the gains; discrete, continuous variable gain algorithm: whether the variables are discrete or continuous, only a corresponding gain algorithm needs to be called, and a binary splitting mode is provided while the gains are calculated, so that the modeling requirement of coexistence of discrete and continuous variables in insurance data is effectively met;
the continuous variable gain algorithm is that a data set is split according to a certain continuous variable cv in insurance data, the data set is split into two parts, and the maximum gain can be obtained on a target variable t; the maximum gain refers to the amount of reduction in variance; i.e. after splitting, the variance of the data set may be reduced;
the method comprises the following specific steps:
(1) setting a gain =0 and a splitting point best _ q;
(2) the data sets are sorted in ascending order according to the continuous variables to be split. Suppose there are n sorted data, which are:
x 1 ,x 2 ,…x n
(3) sequentially taking 1,2,. Cndot.99% of a continuous variable cv; each time a q is taken, the variable cv is divided into two parts, set, according to q 1 ={x|x cv ≤q},set 2 ={x|x cv >q};
(4) If the number of samples in set1 or set2 is less than the parameter min _ leaf _ num, turning to (3), otherwise, turning to (5);
(5) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively; x is the number of ti Represents a sample x t The value of the target variable of (1);
(6) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ q = q revolutions (7);
(7) if the percentile traversal is finished, returning gain, otherwise turning to (3);
after the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ q, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (4), and the target variable value of each child node is the average value of the target variables on the child nodes;
the discrete variable gain algorithm is that a data set is split into two parts according to a certain discrete variable dv in insurance data, the maximum gain is obtained on a target variable t, and the maximum gain is the reduction of variance;
the method comprises the following specific steps:
(1) setting gain =0, wherein the best left branch and the best right branch are empty and are empty;
(2) acquiring all value sets of the discrete variable dv, namely d1, d2, \8230;, dm;
(3) obtaining all combination modes of discrete variable doubling:
firstly, collecting the left branches to be empty;
starting the value number left _ num of the left branch element of the decision tree from 1 to m-1:
find the left _ num element combination from the set { d1, d2, \8230; dm }, add to the left branch set,
(4) and sequentially taking a left branch set left _ set, a variable full set and a difference set of the left branch set from the left branch set as a right branch set right _ set. Turning to (5) every time one set is taken;
(5) dividing the data set into two parts according to whether the discrete variable dv is valued in a left branch set or a right branch set:
set1={x|x dv ∈left_set},set2={x|x dv ∈right_set}
(6) if the number of set1 or set2 is less than the leaf node min _ leaf _ num, turning (4), otherwise, turning (7);
(7) the gain tmp _ gain is calculated according to the following formula:
wherein k1 and k2 are the number of elements in set1 and set2 respectively;
(8) if tmp _ gain is greater than gain, gain = tmp _ gain.best _ left = set1, best _ right = set2;
(9) returning gain if the left branch set is traversed completely, otherwise turning to (4)
After the maximum gain of the node is obtained, dividing the data on the node into two parts according to best _ left and best _ right, namely, generating two nodes, wherein the data on each node is set1 and set2 in the step (5), and the target variable value of each child node is the average value of the target variables on the child nodes;
(II) an breadth in-layer gain algorithm: the problem that each layer of the insurance data decision tree needs the same variable is solved;
the user of insurance data modeling is an insurance practitioner, so the requirement of insurance data modeling is simple, but the accuracy of the model is ensured; one index is concise enough, namely whether each layer of the decision tree is the same variable, if the same variable is found, 4 variables are used in the 5 layers of the decision tree, so that the decision tree is concise and easy to understand enough and is convenient to use in the insurance industry;
the current open algorithms such as ID3 and Cart can not realize that each layer is the same variable, and the problem needs to be innovated and solved so as to adapt to the modeling requirement of the insurance industry; the key for solving the problem lies in how to define the quality degree of the variable in each layer, the optimal variable is selected by defining the quality degree of the variable in each layer, and then each layer is split by utilizing the optimal variable in the current layer;
suppose there are d nodes within a layer: are respectively a node 1 ,node 2 ,…node d The number of samples on a node is n 1 ,n 2 ,…n d There are m alternative features: f. of 1 ,f 2 ,…f m Selecting the optimal characteristics using the following formula:
wherein gain ij Represents the gain produced by feature i at node j;
the optimal characteristic is calculated according to the following steps:
setting the maximum gain max _ gain =0, and enabling the optimal characteristic max _ flat to be null;
(1) selecting each feature f i
(2) Initializing a current feature weighting gain feature _ gain =0;
(3) calculating feature _ gain;
each node in the node set in pair j Gain algorithm calculates gain ij Updating feature _ gain:
the current feature f can be obtained only after traversing all nodes in the layers i A gain of (d);
if feature _ gain is greater than max _ gain: then tentatively determine the optimal feature in the layer as f i
After the characteristics are traversed once, the current optimal characteristics can be obtained;
(III) feature selection mode: the problem that the designated variable can be selected independently in the insurance data modeling is solved;
insurance data modeling is different from other modeling, and the interpretability of variables is very important; the modeling results are interpreted by, for example, age, vehicle load, and therefore it is sometimes necessary to manually specify variables involved in modeling and the order of the variables; the traditional algorithm can not support the variable adopted by each layer, so that the method solves the problem;
setting the mode of establishing the tree to be automatic or manual, and if the mode is manual, allowing a client to specify which variable each layer of the decision tree is;
the decision tree algorithm is an algorithm for generating a decision tree according to some independent variables and target variables in the data; the input has two parts, one part is original data, the other part is constraint parameters, and the constraint parameters include five: 1) maximum depth of the tree max _ depth, 2) minimum leaf node sample number min _ leaf _ num, 3) variable set x _ list that can be used, 4) variable type specification: discrete or continuous 5) whether the tree building mode is manual or automatic;
the fifth parameter can realize manual specification of modeling variables;
we first look at the overall framework of the algorithm, which is as follows:
firstly, setting a root node as a first layer of a tree;
the corresponding data on the root node is all data;
the target value on the root node is the mean value of all data claim variables;
the alternative characteristic is an original variable set x _ list;
secondly, searching the optimal variable in the layer by layer, and splitting nodes in the layer;
traversing the layer number, as follows:
a) Searching all nodes with the depth meeting the requirement;
b) Finding optimal features on a set of nodes: if the tree building mode is an automatic mode, in the alternative characteristics, the optimal characteristics are searched by adopting an autonomously developed 'in-layer gain algorithm'; if the method is a manual method, selecting a first feature of the x _ list as an optimal feature;
c) Adopting the optimal characteristics found in b), adopting an autonomously developed characteristic gain algorithm, and trying to perform binary splitting on each node in the layer;
d) Deleting the features screened out in b) from the alternative features;
from the above framework, it is seen that in the second step b) manual specification of variables and variable sequences is achieved.
CN201711422828.6A 2017-12-25 2017-12-25 Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm Active CN108009931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711422828.6A CN108009931B (en) 2017-12-25 2017-12-25 Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711422828.6A CN108009931B (en) 2017-12-25 2017-12-25 Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm

Publications (2)

Publication Number Publication Date
CN108009931A true CN108009931A (en) 2018-05-08
CN108009931B CN108009931B (en) 2021-08-06

Family

ID=62061127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711422828.6A Active CN108009931B (en) 2017-12-25 2017-12-25 Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm

Country Status (1)

Country Link
CN (1) CN108009931B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828306A (en) * 2005-03-01 2006-09-06 中国海洋大学 Method for realizing formulated product sensing index prediction based on M5' model tree
CN102331992A (en) * 2010-06-09 2012-01-25 微软公司 Distributed decision tree training
US20160162793A1 (en) * 2014-12-05 2016-06-09 Alibaba Group Holding Limited Method and apparatus for decision tree based search result ranking
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
US20170061318A1 (en) * 2015-08-24 2017-03-02 International Business Machines Corporation Scalable streaming decision tree learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828306A (en) * 2005-03-01 2006-09-06 中国海洋大学 Method for realizing formulated product sensing index prediction based on M5' model tree
CN102331992A (en) * 2010-06-09 2012-01-25 微软公司 Distributed decision tree training
US20160162793A1 (en) * 2014-12-05 2016-06-09 Alibaba Group Holding Limited Method and apparatus for decision tree based search result ranking
US20170061318A1 (en) * 2015-08-24 2017-03-02 International Business Machines Corporation Scalable streaming decision tree learning
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RUOMING JI: "Segment based decision tree induction with continuous valued attributes", 《IEEE TRANSACTIONS ON CYBERNETICS》 *
黄奇: "基于CHAID决策树的个人收入分析", 《数学理论与应用》 *

Also Published As

Publication number Publication date
CN108009931B (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN110245802B (en) Cigarette empty-head rate prediction method and system based on improved gradient lifting decision tree
Peukert et al. AMC-A framework for modelling and comparing matching systems as matching processes
WO2016101628A1 (en) Data processing method and device in data modeling
CN115392592B (en) Storage product parameter configuration recommendation method, device, equipment and medium
Bolancé et al. Inverse Beta transformation in kernel density estimation
Belov et al. Nevanlinna domains with large boundaries
CN108009931B (en) Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm
EP4250301A1 (en) Method for estimating a variable of interest associated to a given disease as a function of a plurality of different omics data, corresponding device, and computer program product
Yu et al. A partially linear tree-based regression model for multivariate outcomes
JPH0816400A (en) Case inference support equipment
CN115952426A (en) Distributed noise data clustering method based on random sampling and user classification method
CN115828116A (en) Forest resource asset homogeneous region partitioning method based on two-step distance and dynamic constraint
CN111967616B (en) Automatic time series regression method and device
CN109284393B (en) Fusion method for family tree character attribute names
CN104598591B (en) A kind of model element matching process for type attribute graph model
CN109993193B (en) Method and device for identifying key points of three-dimensional curve
CN109614456B (en) Deep learning-based geographic information positioning and partitioning method and device
CN116756431B (en) Information or article recommendation method based on approximate concepts under incomplete form background
Pietrus NON DIFFERENTIALBE PERTURBED NEWTON’s METHOD FOR FUNCTIONS WITH VALUES IN A CONE
CN110689158B (en) Method, device and storage medium for predicting destination
Riasanovsky Two problems in extremal combinatorics
CN110083930A (en) The construction method and device of shale weathering index
CN110990463B (en) Method and device for mining frequent symmetric pattern of time series
CN117497056B (en) Non-contrast HRD detection method, system and device
Jung et al. Using J48 and REPTree to Predict Risk Factors in Medicine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20181228

Address after: Room 2201, Liangyou Building, 618 Shangcheng Road, Pudong New District, Shanghai

Applicant after: Shanghai Qihuang Information Technology Co., Ltd.

Address before: 311201 11th Floor, Block B1, Hangzhou Bay Information Port, Xiaoshan Economic Development Zone, Hangzhou City, Zhejiang Province

Applicant before: Hangzhou seven Kyung Mdt InfoTech Ltd

GR01 Patent grant
GR01 Patent grant