CN108959187B

CN108959187B - Variable box separation method and device, terminal equipment and storage medium

Info

Publication number: CN108959187B
Application number: CN201810309822.6A
Authority: CN
Inventors: 黄严汉; 曾凡刚
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-04-09
Filing date: 2018-04-09
Publication date: 2023-09-05
Anticipated expiration: 2038-04-09
Also published as: CN108959187A

Abstract

The invention relates to the technical field of computers, and provides a variable box-dividing method, a device, terminal equipment and a storage medium, wherein the variable box-dividing method comprises the following steps: acquiring sample data; according to preset variable configuration, determining a nominal variable of the box to be separated and a characteristic value corresponding to the nominal variable from sample data; storing the characteristic values into a preset characteristic value set; aiming at each characteristic value in the characteristic value set, dividing a nominal variable into two boxes by taking the characteristic value as a test split point, and calculating an association index value corresponding to the characteristic value; taking a characteristic value corresponding to the maximum value in the associated index value as a target splitting point to execute a box splitting operation, and removing the characteristic value from the characteristic value set; and stopping the box division if the box division result reaches a preset box number threshold value, otherwise, continuing to execute the box division operation. According to the technical scheme, automatic box-sorting operation is carried out on the nominal variables based on the associated index values, so that manual intervention and time consumption are reduced, and the box-sorting efficiency of the box-sorting operation is improved.

Description

Variable box separation method and device, terminal equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a variable box-dividing method, a variable box-dividing device, a terminal device, and a storage medium.

Background

At present, a common box dividing method is equal-width box dividing or equal-frequency box dividing, wherein the equal-width box dividing refers to dividing a value range of a feature into a equal-width sections, each section is regarded as one box dividing, the equal-frequency box dividing refers to arranging the feature values in a sequence from small to large, dividing the feature values into a parts according to the number of the feature values, and each part is regarded as one box dividing. However, the number of bins must be manually set in advance for both the equal-width bins and the equal-frequency bins, and if the number of bins is too small, more information is lost, and if the number of bins is too large, the goal of bin division cannot be achieved.

If the combination is performed by manual mode after equal frequency division or equal width division, the feature prediction capability cannot be ensured to be improved because the manual combination needs to rely on subjective experience, and more time is required to be consumed, and the efficiency is low.

Under the condition of small sample data volume, the distribution condition of the characteristic values can be analyzed manually, the characteristic values are manually segmented according to the distribution condition, and the distribution of the characteristic values is divided into boxes, however, on one hand, the mode depends on subjective experience, the distribution of the characteristic values can not truly reflect the characteristics of sample variables, the model prediction capability can not be ensured to be improved, and on the other hand, under the condition of huge sample data volume, huge workload is brought by a manual mode, and the box dividing efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a variable box-dividing method, a variable box-dividing device, terminal equipment and a storage medium, which are used for solving the problems of inaccurate box-dividing result and low box-dividing efficiency of medium-width box division or equal-width box division in the prior art.

In a first aspect, an embodiment of the present invention provides a variable binning method, including:

acquiring sample data;

according to preset variable configuration, determining a nominal variable to be divided into boxes and m characteristic values corresponding to the nominal variable from the sample data, wherein m is a positive integer greater than 1;

storing m characteristic values into a preset characteristic value set, setting the initial value of the number k of the box dividing wheels to be 0, and setting the box dividing result of the 0 th wheel box dividing to be null, wherein k is [0, m-1];

aiming at each characteristic value in the characteristic value set, taking the characteristic value as a test splitting point, dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing, and calculating an association index value corresponding to the characteristic value to obtain m-k association index values;

taking a characteristic value corresponding to the maximum value in the m-k associated index values as a target splitting point, dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing, taking the nominal variable as the box dividing result of the k+1-th round of box dividing, and removing the characteristic value from the characteristic value set;

And if k+2 reaches a preset bin number threshold value, stopping bin separation, determining a bin separation result of the k+1 th bin separation as a final bin separation result, otherwise, adding 1 to k, returning each characteristic value in the characteristic value set, taking the characteristic value as a test splitting point, dividing the nominal variable into k+2 bins on the basis of the bin separation result of the k, calculating associated index values corresponding to the characteristic values, and continuously executing the steps of obtaining m-k associated index values.

In a second aspect, an embodiment of the present invention provides a variable box-sorting device, including:

the acquisition module is used for acquiring sample data;

the determining module is used for determining a nominal variable to be divided into boxes and m characteristic values corresponding to the nominal variable from the sample data according to preset variable configuration, wherein m is a positive integer greater than 1;

the storage module is used for storing m characteristic values into a preset characteristic value set, setting the initial value of the number k of the box dividing wheels as 0 and setting the box dividing result of the 0 th wheel box dividing to be empty, wherein k is [0, m-1];

the calculation module is used for dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing aiming at each characteristic value in the characteristic value set, taking the characteristic value as a test splitting point, and calculating an association index value corresponding to the characteristic value to obtain m-k association index values;

The removing module is used for taking a characteristic value corresponding to the maximum value in the m-k associated index values as a target splitting point, dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing, taking the nominal variable as the box dividing result of the k+1-th round of box dividing, and removing the characteristic value from the characteristic value set;

and the circulation module is used for stopping the box division if the k+2 reaches a preset box number threshold value, determining the box division result of the k+1 th round of box division as a final box division result, otherwise, adding 1 to the k, returning each characteristic value in the characteristic value set, taking the characteristic value as a test division point, dividing the nominal variable into k+2 boxes on the basis of the box division result of the k round of box division, and calculating the association index values corresponding to the characteristic values to obtain m-k association index values, wherein the steps are continuously executed.

In a third aspect, an embodiment of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the variable binning method when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the variable binning method.

According to the variable box-dividing method, device, terminal equipment and storage medium provided by the embodiment of the invention, sample data are acquired from a preset database, a nominal variable to be divided and a characteristic value corresponding to the nominal variable are determined from the sample data according to preset variable configuration, the characteristic value is stored in a preset characteristic value set, each characteristic value in the characteristic value set is used as a test splitting point to divide the nominal variable into two boxes, an association index value corresponding to each characteristic value is calculated, a characteristic value corresponding to the maximum value is selected from the association index values to serve as a target splitting point to execute the box-dividing operation, if the box-dividing result reaches a preset box number threshold value, the box-dividing operation is stopped, otherwise, the box-dividing operation is continuously executed, the automatic box-dividing operation of the nominal variable based on the association index value is realized, so that the original sample data information is stored to the greatest extent, the characteristic extraction is rapidly and accurately performed, the manual intervention and the time consumption are reduced, the box-dividing efficiency of the box-dividing operation is improved, and the characteristic coding model can be rapidly constructed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart showing the implementation of the variable binning method provided in embodiment 1 of the present invention;

FIG. 2 is an exemplary diagram of a specific binning process in the variable binning method provided in embodiment 1 of the present invention;

FIG. 3 is a flowchart showing the implementation of step S2 in the variable binning method provided in embodiment 1 of the present invention;

FIG. 4 is a schematic view of a variable box-sorting apparatus provided in example 2 of the present invention;

fig. 5 is a schematic diagram of a terminal device provided in embodiment 4 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, fig. 1 shows an implementation flow of the variable box-dividing method provided in this embodiment. The variable box-dividing method is applied to the feature coding process based on the spark platform, is used for automatically dividing the sample data, can quickly and accurately extract features while saving the original sample data information to the maximum extent, and realizes quick modeling. The details are as follows:

S1: sample data is acquired.

In the embodiment of the invention, sample data is collected from a preset database, and the sample data is mainly insurance service data.

S2: according to preset variable configuration, determining a nominal variable to be divided into boxes and m characteristic values corresponding to the nominal variable from sample data, wherein m is a positive integer greater than 1.

In the embodiment of the invention, the preset variable configuration is used for configuring the variables needing to be binned, and the variable configuration can be flexibly set by a user according to modeling requirements or application requirements.

The variables comprise continuous variables and nominal variables, wherein the continuous variables refer to the variables of which the characteristic values can be arbitrarily valued in a certain interval, the characteristic values are continuous, the characteristic values can be infinitely divided between any two characteristic values, and the continuous variables have units and can be ordered, such as distances; nominal variables refer to values that can be listed one by one, but without units and without order, such as gender.

If the preset variable is determined to be configured as the continuous variable according to the sample data, discretizing the continuous variable, and extracting a corresponding nominal variable and m corresponding characteristic values of the discretized continuous variable, namely converting the continuous variable into the nominal variable; if the preset variable is determined to be configured as the nominal variable according to the sample data, m eigenvalues corresponding to the nominal variable of the box to be divided are directly determined from the sample data.

The data are further divided and distributed into different boxes according to preset conditions. For example, one set of data is: 5. 8, 10, 12, 15, 20, and dividing the data into two boxes if the preset condition is that the data is less than or equal to 10 and the data is more than 10, wherein one box stores the data less than or equal to 10, and the other box stores the data more than 10.

For example, the obtained nominal variable is a housing condition, and the characteristic value of the nominal variable may specifically be: villas, apartments and ordinary residences.

S3: and storing m characteristic values into a preset characteristic value set, setting the initial value of the number k of the box dividing wheels to be 0, and setting the box dividing result of the 0 th round of box dividing to be null, wherein k is [0, m-1].

In the embodiment of the invention, m eigenvalues obtained in the step S2 are stored in a preset eigenvalue set, the number k of the box dividing wheels is initialized, the initial value of k is set to be 0, meanwhile, when the default k is equal to 0, the box dividing result of the 0 th wheel box dividing is empty, and the value range of the number k of the box dividing wheels is more than or equal to 0 and less than or equal to m < -1 >.

It should be noted that, the preset characteristic value set is used for storing the characteristic value of the nominal variable, so as to prepare for the subsequent box division operation according to the characteristic value.

For example, assume three eigenvalues of the nominal variable are: 1560. 2240 and 3200 are all stored into the preset characteristic value set, and the box dividing result of the 0 th round of box dividing is empty because the box dividing operation is not performed, namely the box dividing wheel number is 0.

S4: aiming at each characteristic value in the characteristic value set, taking the characteristic value as a test splitting point, dividing a nominal variable into k+2 boxes on the basis of a box dividing result of a k-th round of box dividing, and calculating an association index value corresponding to the characteristic value to obtain m-k association index values.

In the embodiment of the invention, the characteristic values in the characteristic value set are used as test splitting points, the nominal variable is subjected to box division operation through the test splitting points, m test splitting points are obtained according to the obtained m characteristic values, and the box division operation is carried out once for each test splitting point, so that m times of box division operation are carried out.

Specifically, when k=0, i.e., the 0 th wheel box, it means that no binning is performed; when the 1 st round of box dividing operation is carried out, at the moment, the nominal variable to be divided into 2 boxes through 1 division point on the basis of not carrying out box dividing, namely when k=0, the nominal variable is divided into 2 boxes on the basis of not carrying out box dividing on the 0 th round, namely k+2 boxes; when the 2 nd round of binning is performed, at this time, on the basis of the 1 st round of binning result 2 bins, the 1 bin including the splitting point is split into 2 bins by 1 splitting point, at this time, the nominal variable is split into 3 bins in total, that is, when k=1, the nominal variable is split into 3 bins, that is, k+2 bins on the basis of the 1 st round of binning result 2 bins. And similarly, the k+1 round of binning result is obtained by dividing the nominal variable to be binned into k+2 bins on the basis of the k round of binning result.

And in the process of each box dividing operation, calculating the association index value corresponding to each test dividing point, wherein the number of the association index values is the number of the characteristic values in the current characteristic value set, namely the difference value between m and the box dividing wheel number k.

In the process of each round of binning operation, the association index value may be an information value (information value, IV), a base variance index value, or a pearson chi-square statistic. The IV value is a coefficient for measuring the predictive power of the independent variable, the base variance index value refers to the proportion of the sample set which is divided by a specific attribute and is reduced in the degree of the impurity, and the pearson chi-square statistic is used for measuring the correlation between two nominal variables.

S5: and taking the characteristic value corresponding to the maximum value in the m-k associated index values as a target splitting point, dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing, taking the nominal variable as the box dividing result of the k+1-th round of box dividing, and removing the characteristic value from the characteristic value set.

In the embodiment of the invention, from m-k associated index values calculated in the step S4, a characteristic value corresponding to the largest associated index value is selected as a target splitting point, the splitting operation is performed according to the target splitting point, each round of splitting is based on the last round of splitting results, and the nominal variable comprising the target splitting point is split into 2 rounds of splitting results according to the target splitting point, so that the splitting result of the k+1th round of splitting is based on the splitting result of the k-th round of splitting, namely the k+2th round of splitting results are obtained by adding 2 to the k+1th round of splitting results. Meanwhile, the feature value corresponding to the maximum association index value as the target split point is removed from the feature value set after the binning operation is performed.

S6: if k+2 reaches a preset bin number threshold value, stopping bin division, determining a bin division result of the k+1 th bin division as a final bin division result, otherwise, adding 1 to k, returning each characteristic value in the characteristic value set, taking the characteristic value as a test division point, dividing a nominal variable into k+2 bins on the basis of the bin division result of the k-th bin division, calculating an association index value corresponding to the characteristic value, and obtaining m-k association index values, wherein the step is continuously executed.

Specifically, according to step S5, the k+1 th round of binning result is k+2 bins, if the bin splitting result k+2 reaches the preset bin number threshold, the bin splitting operation is not performed any more, and the k+2 bins are the final bin splitting result of the k+1 th round of bins. If the binning result k+2 does not reach the preset bin number threshold, adding 1 to k, and returning to the step S4 to continue to execute the binning operation.

Further, in the process of the box division operation, the associated index value can be used as a judging condition for stopping the box division operation, namely, when the lifting rate of the associated index value is smaller than a preset lifting rate threshold value, the box division operation is stopped, otherwise, after the 1 adding operation is performed on k, the step S4 is returned to continue to perform the box division operation.

The improvement rate of the association index value may be specifically calculated according to the formula (1), which is described in detail as follows:

v＝(X _p -X _p-1 )/X _p Formula (1)

Wherein v is the rate of improvement of the associated index value, X _p Associated index value corresponding to target split point determined for p-th round of bin division operation, p E [1, m]。

It should be noted that, when massive sample data is faced, the binning process of the embodiment of the invention can be performed based on a spark distributed computing framework, and the computing efficiency can be improved through spark distributed parallel computing, and meanwhile, when the binning is required for a large data volume and a plurality of nominal variables, the binning efficiency can be effectively improved.

For a better understanding of embodiments of the present invention, reference is made to the following description, given by way of specific example, of the following details:

for example, taking age as a nominal variable, 9 boundary values 15, 21, 24, 29, 35, 38, 41, 45, 55, and 55, which are 9 in total, of each age range [10, 15 ], [15, 21 ], [21, 24 ], [24, 29 ], [29, 35 ], [35, 38 ], [38, 41, 45), [45, 55), and [55, 80] are taken as feature values, and a feature value set is constructed.

And taking 9 test split points in the characteristic value set, taking the IV value as an associated index value, and stopping the box dividing operation when the box dividing result reaches 6 boxes, wherein the preset box number threshold is 6 boxes.

Firstly, carrying out box division operation according to each test division point, calculating an IV value corresponding to each test division point, if the test division point 15 is taken for box division, the number of boxes is less than 15 and is 1, and the number of boxes is greater than or equal to 15 and is 1, so that two boxes of [10,15 ] and [15,80] are obtained, and the IV value corresponding to the test division point 15 is calculated to be 20; when the test split point 21 is taken and split, the number of the split points is 1, which is smaller than 21, and the number of the split points is 1, which is larger than or equal to 21. And so on, traversing the other 7 test split points to obtain each test split point and the corresponding IV value thereof as shown in the table one:

list one

Test split point	IV value
		15	20
21	21
		24	22
29	23
		35	25
38	21.1
		41	22.2
45	23.1
		55	22.2

At this time, if the IV value corresponding to the test split point 35 is highest, the feature value 35 is taken as the target split point to perform the box splitting operation, so as to obtain two boxes of [10, 35 ] and [35, 80] as the box splitting result of the first round of box splitting, and meanwhile, the feature value 35 is removed from the feature value set.

After the first round of binning is finished, the current bin number is 2, and the preset bin number threshold value is not reached, and then the second round of binning is performed according to the remaining 8 test splitting points 15, 21, 24, 29, 38, 41, 45 and 55 in the characteristic value set.

During the second round of binning, split (10, 35) with 15, 21, 24 and 29, respectively, split with 38, 41, 45 and 55 [35, 80]. If the test split point 15 is taken, the test split point 15 is divided into 3 boxes [10,15 ], [15, 35 ] and [35, 80], and the IV value corresponding to the test split point 15 is calculated to be 26; if the test split point 21 is taken, the test split point 21 is divided into 3 boxes [10, 21 ], [21, 35) and [35, 80], and the IV value corresponding to the test split point 21 is calculated to be 27; and the other 6 split points are traversed in turn to obtain each test split point and the corresponding IV value thereof as shown in a table II:

Watch II

Test split point	IV value
		15	26
21	27
		24	28
29	29
		38	30
41	31
		45	35
55	30.5

At this time, if the IV value corresponding to the test splitting point 45 is highest, the splitting operation is performed with the feature value 45 as the target splitting point, so as to obtain the splitting results of the second round of splitting, namely [10, 35], [35, 45] and [45, 80] which are 3 boxes in total, and at the same time, the feature value 45 is removed from the feature value set.

After the second round of box division is finished, the current box number is 3, and the preset box number threshold value is not reached, and then the third round of box division operation is continued according to the remaining 7 test split points 15, 21, 24, 29, 38, 41 and 55 in the characteristic value set. According to the mode, other test split points are traversed in sequence, IV values corresponding to the test split points are calculated, the test split point corresponding to the maximum IV value is found out and used as the target split point, the third-wheel box splitting operation is continued until the box number of the box splitting result reaches 6 boxes, and the box splitting operation is stopped.

Fig. 2 shows a specific binning process, as shown in fig. 2, if the data to be binned is ABCDEFGHI, the preset bin number threshold is 5, and if E is used as the test splitting point in the 1 st round of binning process, the bin splitting result of the 1 st round of binning is ABCD and EFGHI, and since the bin number 2 after binning is smaller than the preset bin number threshold 5, the 2 nd round of binning needs to be performed; and (3) continuing to divide the boxes by taking F as a test splitting point, dividing the boxes into 3 boxes of ABCD, E and FGHI on the basis of the 1 st-round box dividing result, and stopping the box dividing operation until the number of boxes of the finally obtained box dividing result is 5 boxes because the number of boxes 3 after the 2 nd-round box dividing is smaller than a preset box number threshold value of 5. Since the bin number of the bin outcome after the 4 th round of bin splitting is equal to the preset bin number threshold value 5, the bin splitting operation is stopped, and the bin outcome of the 4 th round of bin splitting is determined as the final bin outcome.

In the embodiment corresponding to fig. 1, sample data are acquired from a preset database, nominal variables and m corresponding characteristic values thereof are screened according to preset conditions, the characteristic values of the m nominal variables are stored in a preset characteristic value set, meanwhile, the initial value of the number k of the box dividing wheels is set to 0, each characteristic value in the characteristic value set is used as a splitting point for box dividing operation, if k+1 rounds of box dividing operation is carried out, the nominal variables are divided into k+2 boxes on the basis of k-th rounds of box dividing results, the associated index values corresponding to the characteristic values are calculated to be m-k, the maximum index value is selected from the m-k associated index values to serve as splitting points for box dividing operation, the splitting points are added into a preset set, the splitting points are removed from the splitting point set, the nominal variables are divided into k+2 boxes by using k+1 splitting points, meanwhile, the k+2 boxes are used as the results of the first time box dividing operation, if k+2 is equal to the preset box dividing wheels, if the preset boxes stop the preset boxes are equal to the preset boxes, the number k+2 is calculated, the associated index values are calculated to be the maximum time is reduced, the associated index values are calculated to be the maximum, and the time is reduced, and the normal values are calculated to be the normal values, and the value is calculated to be the normal value.

Next, on the basis of the corresponding embodiment of fig. 1, before determining, from the sample data, the nominal variable to be binned and m feature values corresponding to the nominal variable according to the preset variable configuration mentioned in step S2, further parameters of the binning configuration need to be obtained, where the variable binning method further includes:

obtaining a box division configuration parameter from a preset configuration file, wherein the box division configuration parameter comprises a box number threshold.

Specifically, the preset configuration file includes definitions of the box-dividing configuration parameters required to be used in the box-dividing process. The preset configuration file may be an extensible markup language (Extensible Markup Language, XML) configuration file in particular.

In a preset configuration file, a definition area of the sub-box configuration parameters is marked by using a preset sub-box label, and the sub-box configuration parameters are defined in the definition area of the sub-box label mark.

For example, using cross as a binning tag, the bin count threshold max_bins is configured in an XML configuration file as follows:

<cross>

</cross>

according to the embodiment of the invention, the split box configuration parameters can be flexibly adjusted according to the preset configuration files, for example, the box number threshold value can be adjusted in real time according to the actual application requirements, so that the convenience of parameter adjustment is realized, the flexibility of the configuration parameters for different application scenes is improved, and meanwhile, the practicability of split box operation of the split box configuration parameters is also improved.

Based on the corresponding embodiment of fig. 1, a detailed description will be given below of a specific implementation method for determining, from sample data, a nominal variable to be binned and m feature values corresponding to the nominal variable according to the preset variable configuration mentioned in step S2 by means of a specific embodiment.

Referring to fig. 3, fig. 3 shows a specific implementation flow of step S2 provided in the embodiment of the present invention, which is described in detail below:

s21: if the variable is configured as a continuous variable, equal-width box division or equal-frequency box division is carried out on the continuous variable, and an initial box division result is obtained.

The equal-width sub-box refers to dividing the value range of the feature into a equal-width sections, and each section is regarded as one sub-box.

The equal frequency bin is characterized in that characteristic values are arranged in a sequence from small to large, the characteristic values are equally divided into a parts according to the number of the characteristic values, and each part is regarded as a bin.

When the variable is configured as a continuous variable, the continuous variable is first discretized, and the continuous variable is converted into a nominal variable through an equal-width bin or an equal-frequency bin. In the process of equally-wide bin division or equal-frequency bin division, the sequence of bin numbers is consistent with the value of the features, namely the continuous variable features are arranged in a sequence from small to large, the continuous variable features are equally divided into parts according to the number of the continuous variable features, and each part is regarded as a bin division. The final aliquoted individual bins were taken as the initial bin splitting result.

S22: and determining the nominal variable of the box to be separated and m characteristic values corresponding to the nominal variable according to the initial box separation result.

In the embodiment of the present invention, according to the initial box division result obtained in step S21, a nominal variable to be box-divided and a feature value corresponding to the nominal variable are obtained therefrom, wherein the total number of the feature values is m.

For example, assuming that the continuous variable is characterized by age, the ages are divided into 10 bins by equal frequency bins according to step S21: [10, 15), [15, 21), [21, 24), [24, 29), [29, 35), [35, 38), [38, 41), [41, 45), [45, 55), and [55, 80]. The obtained nominal variable is age, and the characteristic value of the nominal variable is the boundary value of the age range, which are respectively: 15. 21, 24, 29, 35, 38, 41, 45 and 55, 9 in total.

In the embodiment corresponding to fig. 3, after equal-width binning or equal-frequency binning is performed on the continuous variable in the sample data, the nominal variable of each bin obtained after binning is extracted again to prepare for the subsequent binning operation, so that the binning operation is performed after the continuous variable is converted into the nominal variable, the accuracy of the binning operation on the continuous variable is ensured, and the binning efficiency is improved.

Based on the embodiment corresponding to fig. 1, a specific implementation method of calculating the association index value corresponding to the feature value mentioned in step S4 is described in detail by a specific embodiment, which is as follows:

if the nominal variable belongs to the binary classification feature, calculating an association index value corresponding to the feature value according to the formula (2).

Wherein IV is the association index value, n _i1 For the number of white samples in the ith bin determined according to the binary classification feature, n _i2 For the number of black samples in the ith bin determined according to the binary classification feature, n _*1 N is the total number of white samples in the sample data _*2 Is the total number of samples of the black samples in the sample data.

In the embodiment of the invention, the binary classification characteristic refers to a characteristic that a preset classification condition only has two classification values, wherein the preset classification condition refers to a classification condition for classifying sample data by using a nominal variable set according to application requirements in the construction process of a characteristic coding model. The sample data can be divided into a white sample and a black sample according to the binary classification characteristic, wherein the sample data meeting the classification condition is the white sample, and the sample which does not meet the classification condition is the black sample.

For example, when the classification condition is "purchasing serious disease insurance", the classification condition has only two classification values, namely "yes" and "no", and when the nominal variable is age, the classification condition is used to judge sample data, and the judgment result has only two classification values, namely "yes" and "no", so that the nominal variable can be considered to belong to the binary classification feature. The sample data with the classification value of yes is a white sample, and the sample data with the classification value of no is a black sample.

In the embodiment of the invention, if the nominal variable belongs to the binary classification feature, the IV value corresponding to the feature value can be rapidly and accurately calculated through the formula (2), and the correlation between the two nominal variables can be accurately reflected aiming at the binary classification feature by using the IV value as the correlation index value, so that the target split point can be accurately selected according to the IV value, and the accuracy of the box division is effectively improved.

if the nominal variable belongs to the multi-element classification feature, calculating an association index value corresponding to the feature value according to the formula (3).

Wherein G is _r For the associated index value, Y is the total sample set of the sample data, n is the sample class number determined according to the multi-element classification feature, and p _g For the duty ratio of samples belonging to the g-th class in the total sample set, gini (Y) is the Gini index of the total sample set, gini (Y) _j ) Is the base-Ni index of the j-th box, Y _j Sample set for nominal variable of jth bin, |Y _j I is Y _j And |y| is the number of samples in the total sample set.

In the embodiment of the invention, the multi-element classification characteristic refers to a characteristic that a preset classification condition has two or more classification values. The number of sample categories is the number of the classified values. For example, if the classification condition is "processing stage of vehicle insurance claim", the classification condition has 3 classification values: the number of sample categories is 3 in the "loss assessment stage", "maintenance stage" and "pay stage", and when the nominal variable is age, the sample data is classified and judged by using the classification condition, the nominal variable can be considered to belong to a plurality of classification features because of different classification values in 3.

Gini (Y) is the calculation of the Gini index over the total sample set, gini (Y) _j ) The base index is calculated in the range of the j-th box, and the value ranges of the base index and the j-th box are different, but the calculation method for calculating the base index is completely the same, and is not repeated here.

In the embodiment of the invention, if the nominal variable belongs to the multi-element classification characteristic, the base variance index value corresponding to the characteristic value can be rapidly and accurately calculated through the formula (3), and the base variance index value is used as the association index value, so that the association between two or more nominal variables can be accurately reflected aiming at the multi-element classification characteristic, the target split point can be accurately selected according to the base variance index value, and the accuracy of the box division is effectively improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Example 2

Corresponding to the variable binning method in embodiment 1, fig. 4 shows a variable binning apparatus in one-to-one correspondence with the variable binning method provided in embodiment 1, and for convenience of explanation, only the portions relevant to the embodiments of the present invention are shown.

As shown in fig. 4, the variable box separation device includes: the system comprises an acquisition module 41, a determination module 42, a storage module 43, a calculation module 44, a removal module 45 and a circulation module 46. The functional modules are described in detail as follows:

an acquisition module 41 for acquiring sample data;

the determining module 42 is configured to determine a nominal variable of the to-be-divided box and m eigenvalues corresponding to the nominal variable from the sample data according to a preset variable configuration, where m is a positive integer greater than 1;

a storage module 43, configured to store m feature values into a preset feature value set, and set an initial value of a number k of the binning wheel to 0, and a bin splitting result of a 0 th wheel bin to be empty, where k e [0, m-1];

the calculating module 44 is configured to divide the nominal variable into k+2 boxes based on the box division result of the kth round of box division by taking the feature value as a test division point for each feature value in the feature value set, and calculate an association index value corresponding to the feature value to obtain m-k association index values;

the removing module 45 is configured to divide the nominal variable into k+2 bins based on the bin division result of the k-th round of bin division by using the feature value corresponding to the maximum value of the m-k associated index values as a target splitting point, as a bin division result of the k+1-th round of bin division, and remove the feature value from the feature value set;

And the circulation module 46 is configured to stop the binning if k+2 reaches a preset bin number threshold, determine the bin splitting result of the k+1 th round of bin splitting as a final bin splitting result, or add 1 to k, and return each feature value in the feature value set, take the feature value as a test splitting point, divide the nominal variable into k+2 bins based on the bin splitting result of the k round of bin splitting, calculate the association index value corresponding to the feature value, and obtain m-k association index values.

Further, the variable box separation device further includes:

the configuration module 47 is configured to obtain a binning configuration parameter from a preset configuration file, where the binning configuration parameter includes a bin number threshold.

Further, the determining module 42 includes:

the initial box dividing module 421 is configured to divide the continuous variable into equal width boxes or equal frequency boxes if the variable is configured as the continuous variable, so as to obtain an initial box dividing result;

the nominal variable determining submodule 422 is configured to determine a nominal variable to be binned and m eigenvalues corresponding to the nominal variable according to an initial binning result.

Further, the calculation module 44 includes:

the binary calculation sub-module 441 is configured to calculate the association index value according to the following formula if the nominal variable belongs to the binary classification feature:

Wherein IV is the association index value, n _i1 For the number of white samples in the ith bin determined according to the binary classification feature, n _i2 For the number of black samples in the ith bin determined from the binary classification feature, n × ₁ Total number of samples, n, of white samples in the sample data ₂ Is the total number of samples of the black samples in the sample data.

Further, the calculation module 44 further includes:

the multiple calculation sub-module 442 is configured to calculate the association index value according to the following formula if the nominal variable belongs to the multiple classification feature:

The process of implementing respective functions by each module in the variable box device provided in this embodiment may refer to the description of embodiment 1, and will not be repeated here.

Example 3

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the variable binning method of embodiment 1, or which, when executed by a processor, implements the functions of the modules in the variable binning apparatus of embodiment 2. In order to avoid repetition, a description thereof is omitted.

It will be appreciated that the computer readable storage medium may comprise: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier wave signal, a telecommunications signal, and the like.

Example 4

Fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 5, the terminal device 50 of this embodiment includes: a processor 51, a memory 52 and a computer program 53, such as a variable binning program, stored in the memory 52 and executable on the processor 51. The steps in the above-described respective variable binning method embodiments, such as steps S1 to S6 shown in fig. 1, are implemented when the processor 51 executes the computer program 53. Alternatively, the processor 51, when executing the computer program 53, performs the functions of the modules in the above-described apparatus embodiments, such as the functions of the modules 41 to 46 shown in fig. 4.

By way of example, the computer program 53 may be divided into one or more modules, one or more modules being stored in the memory 52 and executed by the processor 51 to complete the present invention. One or more of the modules may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 53 in the terminal device 50. For example, the computer program 53 may be divided into an acquisition module, a determination module, a storage module, a calculation module, a removal module, and a loop module, each module having the following specific functions:

The acquisition module is used for acquiring sample data;

the determining module is used for determining a nominal variable to be divided into boxes and m characteristic values corresponding to the nominal variable from sample data according to preset variable configuration, wherein m is a positive integer greater than 1;

the storage module is used for storing m characteristic values into a preset characteristic value set, setting the initial value of the number k of the box dividing wheels as 0 and setting the box dividing result of the 0 th wheel of box dividing as empty, wherein k is [0, m-1];

the calculation module is used for dividing the nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing aiming at each characteristic value in the characteristic value set, taking the characteristic value as a test dividing point, and calculating associated index values corresponding to the characteristic values to obtain m-k associated index values;

the removing module is used for taking a characteristic value corresponding to the maximum value in the m-k associated index values as a target splitting point, dividing a nominal variable into k+2 boxes on the basis of the box dividing result of the k-th round of box dividing, taking the nominal variable as the box dividing result of the k+1-th round of box dividing, and removing the characteristic value from the characteristic value set;

and the circulation module is used for stopping the box division if the k+2 reaches a preset box number threshold value, determining the box division result of the k+1 round of box division as a final box division result, otherwise, adding 1 to the k, returning each characteristic value in the characteristic value set, taking the characteristic value as a test division point, dividing the nominal variable into the k+2 boxes on the basis of the box division result of the k round of box division, calculating the association index value corresponding to the characteristic value, and obtaining m-k association index values, and continuously executing the steps.

Further, the computer program 53 may also be split into:

the configuration module is used for acquiring the box division configuration parameters from a preset configuration file, wherein the box division configuration parameters comprise a box number threshold value.

Further, the determining module includes:

the initial box dividing module is used for dividing the continuous variable into equal width boxes or equal frequency boxes if the variable is configured as the continuous variable, so as to obtain an initial box dividing result;

the nominal variable determining sub-module is used for determining a nominal variable to be divided and m eigenvalues corresponding to the nominal variable according to an initial division result.

Further, the computing module includes:

the binary calculation sub-module is used for calculating the association index value according to the following formula if the nominal variable belongs to the binary classification feature:

wherein IV is the association index value, n _i1 For the number of white samples in the ith bin determined according to the binary classification feature, n _i2 For the number of black samples in the ith bin determined from the binary classification feature, n × ₁ For white samples in sample dataIs n ₂ Is the total number of samples of the black samples in the sample data.

Further, the computing module further includes:

the multi-element calculation sub-module is used for calculating the association index value according to the following formula if the nominal variable belongs to the multi-element classification characteristic:

The terminal device 50 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. Terminal device 50 may include, but is not limited to, a processor 51, a memory 52. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the terminal device 50 and is not limiting of the terminal device 50, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the terminal device 50 may also include input-output devices, network access devices, buses, etc.

The processor 51 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 52 may be an internal storage unit of the terminal device 50, such as a hard disk or a memory of the terminal device 50. The memory 52 may also be an external storage device of the terminal device 50, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 50. Further, the memory 52 may also include both internal storage units and external storage devices of the terminal device 50. The memory 52 is used to store computer programs and other programs and data required by the terminal device 50. The memory 52 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A variable binning method, characterized in that the variable binning method comprises:

acquiring sample data, wherein the sample data is insurance service data;

determining a nominal variable to be classified and m characteristic values corresponding to the nominal variable from the insurance business data according to preset variable configuration, wherein the nominal variable comprises at least one of gender, age and housing condition, and m is a positive integer greater than 1;

2. The variable binning method of claim 1, wherein prior to obtaining a nominal variable to be binned and m eigenvalues corresponding to the nominal variable, the variable binning method further comprises:

Obtaining a box division configuration parameter from a preset configuration file, wherein the box division configuration parameter comprises the box number threshold.

3. The variable binning method according to claim 1, wherein the determining, from the insurance service data, a nominal variable to be binned and m feature values corresponding to the nominal variable according to a preset variable configuration includes:

if the variable is configured as a continuous variable, equally-wide box division or equal-frequency box division is carried out on the continuous variable to obtain an initial box division result;

and determining a nominal variable to be divided and m eigenvalues corresponding to the nominal variable according to the initial case division result.

4. The variable binning method of claim 1, wherein the calculating the associated index value corresponding to the feature value comprises:

if the nominal variable belongs to the binary classification feature, calculating the association index value according to the following formula:

wherein IV is the association index value, n _i1 For the number of white samples in the ith bin determined according to the binary classification feature, n _i2 For the number of black samples in the ith bin determined according to the binary classification feature, n _*1 N is the total number of the white samples in the insurance business data _*2 And the total number of the black samples in the insurance business data is the total number of the black samples.

5. The variable binning method of claim 1, wherein calculating the associated index value corresponding to the feature value further comprises:

if the nominal variable belongs to the multivariate classification feature, calculating the association index value according to the following formula:

wherein G is _r For the association index value, Y is the total sample set of the insurance business data, n is the sample class number determined according to the multi-element classification feature, and p _g Gini (Y) is the base of the total sample set for the ratio of samples belonging to the g-th class in the total sample setIndex, gini (Y) _j ) Is the base-Ni index of the j-th box, Y _j Sample set for nominal variable of jth bin, Y _j Is Y _j Y is the number of samples in the total sample set.

6. A variable box-dividing device, characterized in that the variable box-dividing device comprises:

the acquisition module is used for acquiring sample data, wherein the sample data is insurance service data;

the determining module is used for determining a nominal variable of the box to be separated and m characteristic values corresponding to the nominal variable from the insurance business data according to preset variable configuration, wherein the nominal variable comprises at least one of gender, age and housing condition, and m is a positive integer greater than 1;

7. The variable box-handling device of claim 6, wherein the determination module comprises:

and the nominal variable determining submodule is used for determining a nominal variable to be divided and m eigenvalues corresponding to the nominal variable according to the initial division result.

8. The variable binning apparatus of claim 6, wherein the calculation module comprises:

wherein IV is the association index value, n _i1 For the number of white samples in the ith bin determined according to the binary classification feature, n _i2 For the number of black samples in the ith bin determined according to the binary classification feature, n _*1 N is the total number of the white samples in the insurance business data _*2 The total number of the black samples in the insurance business data is the total number of the black samples;

the multi-element calculation sub-module is used for calculating the association index value according to the following formula if the nominal variable belongs to multi-element classification characteristics:

Wherein G is _r For the association index value, Y is the total sample set of the insurance business data, n is the sample class number determined according to the multi-element classification feature, and p _g For the duty cycle of samples belonging to the g-th class in the total sample set, gini (Y) is the Gini index of the total sample set, gini (Y) _j ) Is the base-Ni index of the j-th box, Y _j Sample set for nominal variable of jth bin, Y _j Is Y _j Y is the number of samples in the total sample set.

9. Terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the variable binning method according to any of claims 1 to 5 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the variable binning method according to any one of claims 1 to 5.