CN116028788A

CN116028788A - Feature binning method, device, computer equipment and storage medium

Info

Publication number: CN116028788A
Application number: CN202310136258.3A
Authority: CN
Inventors: 刘帅; 斯洪标; 姜桂林; 唐丽华; 张东阳; 刘逾江; 闫宁
Original assignee: Hunan Caixin Digital Technology Co ltd
Current assignee: Hunan Caixin Digital Technology Co ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-04-28

Abstract

The embodiment of the application belongs to the field of artificial intelligence, and relates to a feature binning method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining a modeling sample set comprising a plurality of modeling samples, wherein the modeling samples comprise tag variables and characteristic variables of boxes to be separated; dividing a modeling sample set from a feature variable dimension by taking an evidence weight and an information value as decision tree dividing standards to obtain candidate box division results under a plurality of decision trees, wherein the evidence weight and the information value are calculated based on a label variable, the depth of each decision tree is smaller than or equal to a preset maximum depth, and each depth corresponds to at least one candidate box division result; and screening the candidate box division results according to the evidence weight of each box division in the candidate box division results to obtain the target box division result of the characteristic variable. The method and the device improve the speed and accuracy of feature box division.

Description

Feature binning method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a feature binning method, a device, a computer device, and a storage medium.

Background

The scoring card model is a very important model in credit evaluation in the financial field, and can be modeled by adopting logistic regression (Logistics Regression), and continuous variables need to be discretized in use, namely, the modeled characteristics need to be subjected to feature binning, and the quality of the binning largely determines the quality of the scoring card model.

Current feature binning techniques typically perform feature binning based on human experience and knowledge in order to ensure the quality of the feature binning. In the context of big data modeling, the amount of data required in the modeling of the scoring card is exceptionally large, which results in a time-consuming and cumbersome binning process, making feature binning less efficient.

Disclosure of Invention

An object of the embodiments of the present application is to provide a feature binning method, a device, a computer device and a storage medium, so as to solve the problem of low feature binning efficiency.

In order to solve the above technical problems, the embodiments of the present application provide a feature box-sorting method, which adopts the following technical scheme:

obtaining a modeling sample set comprising a plurality of modeling samples, wherein the modeling samples comprise tag variables and characteristic variables of a box to be divided;

dividing the modeling sample set from the feature variable dimension by taking the evidence weight and the information value as decision tree dividing standards to obtain candidate box division results under a plurality of decision trees, wherein the evidence weight and the information value are calculated based on the tag variable, the depth of each decision tree is smaller than or equal to a preset maximum depth, and each depth corresponds to at least one candidate box division result;

And screening the candidate box division results according to the evidence weight of each box division in the candidate box division results to obtain the target box division result of the characteristic variable.

In order to solve the technical problems, the embodiment of the application also provides a characteristic box separation device, which adopts the following technical scheme:

the sample set acquisition module is used for acquiring a modeling sample set containing a plurality of modeling samples, wherein the modeling samples contain tag variables and characteristic variables of the boxes to be separated;

the sample set segmentation module is used for segmenting the modeling sample set from the feature variable dimension by taking the evidence weight and the information value as decision tree segmentation standards to obtain candidate box division results under a plurality of decision trees, wherein the evidence weight and the information value are calculated based on the tag variable, the depth of each decision tree is smaller than or equal to a preset maximum depth, and each depth corresponds to at least one candidate box division result;

and the result screening module is used for screening each candidate box division result according to the evidence weight of each box division in each candidate box division result to obtain the target box division result of the characteristic variable.

In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:

In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions:

Compared with the prior art, the embodiment of the application has the following main beneficial effects: obtaining a modeling sample set containing a plurality of modeling samples, wherein the modeling samples contain tag variables and characteristic variables of boxes to be separated; the modeling sample set can be segmented from the feature variable dimension in the splitting process of the decision tree, so that feature box classification is automatically realized, evidence weight and information value are directly used as decision tree segmentation standards, the calculated amount is reduced, and the accuracy of the generated decision tree and candidate box classification results is improved; the depth of each decision tree is smaller than or equal to the preset maximum depth, and each depth corresponds to at least one candidate box division result, so that the richness of the candidate box division results is improved; and according to the evidence weight of each sub-box in each candidate sub-box result, the overall evaluation and screening are automatically carried out on each candidate sub-box result, so that the accuracy and the generation speed of the final target sub-box result are further improved.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a feature binning method according to the present application;

FIG. 3 is a schematic structural view of one embodiment of a feature packet device according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that, the feature binning method provided in the embodiments of the present application is generally executed by a terminal device, and accordingly, the feature binning device is generally disposed in the terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of a feature binning method according to the present application is illustrated. The characteristic box-dividing method comprises the following steps:

Step S201, a modeling sample set including a plurality of modeling samples is obtained, where the modeling samples include a tag variable and a feature variable of a to-be-binned.

In this embodiment, the electronic device (for example, the terminal device shown in fig. 1) on which the feature binning method operates may communicate with the server through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

Specifically, a modeling sample set is obtained, the modeling sample set comprises a plurality of modeling samples, each modeling sample comprises a tag variable and a characteristic variable, the tag variable can be a tag of the modeling sample, and characteristic binning is needed to be carried out on the characteristic variable in the application; it should be noted that the modeling sample may contain a variety of feature variables, for each of which feature binning may be performed.

For example, in credit assessment in the financial field, the modeling sample in the modeling sample set may be a user sample, and the feature variable may be a feature attribute of the user, such as age, income, etc., and the user sample has a tab variable for displaying whether the user is a refund overdue user, for example, a non-refund overdue user is denoted by 0, and a refund overdue user is denoted by 1.

Step S202, dividing a modeling sample set from a feature variable dimension by taking an evidence weight and an information value as decision tree dividing standards to obtain candidate box division results under a plurality of decision trees, wherein the evidence weight and the information value are calculated based on a label variable, the depth of each decision tree is smaller than or equal to a preset maximum depth, and each depth corresponds to at least one candidate box division result.

Specifically, the present application trains decision trees from a modeled sample set. In the process of generating the decision tree, the training sample set is divided step by step, so that the splitting and the growth of the decision tree are realized.

The root node of the decision tree is the whole modeling sample set, two leaf nodes are generated after splitting at the root node, the modeling sample at the root node is divided into the two leaf nodes, and meanwhile the depth of the decision tree is increased by 1. The two leaf nodes continue to split until the depth of the decision tree reaches a preset depth, and the preset depth is less than or equal to a preset maximum depth. The method can generate a plurality of decision trees with different depths, for example, if the preset maximum depth is 5, the decision tree stops growing when growing and the depth reaches 5; the depth of the decision tree may be preset to 2, 3, 4, 5, etc., but the maximum depth of 5 may not be exceeded, and decision trees with depths of 2, 3, 4, 5, respectively, may be generated, and the number of decision trees with depths of 2/3/4/5 may be more than one.

When a certain node is split, a value A (or called A point) is selected, a modeling sample at the node is divided into two parts according to the value A, and the two parts fall to two newly generated leaf nodes respectively. Point a may have multiple candidate values, e.g

After the segmentation is completed, the modeled samples at the two leaf nodes are each one bin, and the modeled samples at the two leaf nodes have a tag variable from which evidence weights (Weight of Evidence) and information values (Information Value) can be calculated. And selecting a candidate value with the proper A point according to the evidence weight and the information value, and discarding the box division result generated by the unsuitable candidate value.

Evidence weight WOE is an encoding of an independent variable, commonly used for feature transformation, to measure the correlation of the dependent variable (e.g., a feature variable in this application) with the dependent variable (e.g., a tag variable in this application). WOE describes the current grouping of variables, the direction and magnitude of impact on the dependent variables, and the determination of whether a sample (or class to which a sample belongs) will respond to the dependent variables; when WOE is positive, the current value of the variable has positive influence on judging whether the sample can respond or not; when WOE is negative, it has a negative effect. The magnitude of the WOE value is an indication of the magnitude of this effect.

The information value IV is also called information value, or information quantity. IV can be used to measure predictive power of the argument and is calculated based on WOE values.

In the field of decision trees, judging whether to split is usually to calculate information entropy, and after splitting is completed, calculating evidence weight and information value to evaluate the splitting result, so that the efficiency is low. According to the method, the evidence weight and the information value are directly calculated and used as the segmentation standard of the decision tree, so that the method is more direct, the efficiency is higher, and the segmentation result is more accurate.

The evidence weight WOE is generally used for scoring the card model, after the feature variables are classified, the evidence weight WOE is calculated, the classification effect is evaluated according to the evidence weight, and if the classification effect is poor, the classification is performed again. The evidence weight WOE needs the interpretability, the evidence weight and the information value are directly used for carrying out box division, the point with the maximum information value is adopted for carrying out the division each time, the optimal box division effect is already selected, and judgment and artificial selection are not needed like the traditional box division technology.

When the decision tree splits each time, the leaf node generated by the last splitting will split this time until the depth of the decision tree reaches the preset depth N, and the splitting is stopped to obtain at most

And the sub-boxes form candidate sub-box results corresponding to the decision tree.

And step S203, screening each candidate box division result according to the evidence weight of each box division in each candidate box division result to obtain a target box division result of the characteristic variable.

Specifically, the application can obtain a plurality of decision trees with different depths, and can also have a plurality of different decision trees under the same depth. Each decision tree corresponds to a candidate box division result, each candidate box division result comprises a plurality of boxes, each box division corresponds to a value interval of a characteristic variable, each box division comprises a certain number of modeling samples, and the value interval corresponding to each box division is unique.

Each bin in the candidate bin result can calculate evidence weight, and the bin evaluation strategy is preset in the method, and can be from the total evidence weight of each bin

And evaluating the candidate box division results, and screening out the candidate box division results which accord with the box division evaluation strategy as target box division results of the characteristic variables, thereby completing the characteristic box division of the characteristic variables.

In the embodiment, a modeling sample set containing a plurality of modeling samples is obtained, wherein the modeling samples contain tag variables and characteristic variables of boxes to be separated; the modeling sample set can be segmented from the feature variable dimension in the splitting process of the decision tree, so that feature box classification is automatically realized, evidence weight and information value are directly used as decision tree segmentation standards, the calculated amount is reduced, and the accuracy of the generated decision tree and candidate box classification results is improved; the depth of each decision tree is smaller than or equal to the preset maximum depth, and each depth corresponds to at least one candidate box division result, so that the richness of the candidate box division results is improved; and according to the evidence weight of each sub-box in each candidate sub-box result, the overall evaluation and screening are automatically carried out on each candidate sub-box result, so that the accuracy and the generation speed of the final target sub-box result are further improved.

Further, after the step S201, the method may further include: and obtaining decision tree configuration information, wherein the decision tree configuration information comprises the maximum depth and the minimum sample number of the sub-boxes, and the decision tree configuration information is used for limiting the decision tree.

Specifically, decision tree configuration information can be obtained, and the decision tree configuration information can be configured manually and comprises the maximum depth and the minimum sample number of the sub-boxes; wherein, the maximum depth refers to the maximum depth of the decision tree; after the decision tree is split at a certain node L according to the value A of the characteristic variable, the modeling sample at the node is divided into new leaf nodes M, N, the modeling samples at the leaf nodes M, N respectively form a sub-box, A takes different values, and the number of the modeling samples in the leaf nodes M, N also changes. In order to avoid the abnormality of the decision tree, the value of A is controlled by the minimum number of samples in the sub-box, and the selected A needs to ensure that the number of modeling samples in the leaf node M, N is more than or equal to the minimum number of samples in the sub-box; if the value of A cannot meet the minimum sample number of the sub-boxes, discarding the candidate value of A.

In one embodiment, the decision tree configuration information may also be composed of a maximum depth and a binned minimum sample ratio, which is a percentage that works the same as the binned minimum sample number, except that the number of modeled samples in the binned leaf nodes needs to be greater than or equal to the binned minimum sample ratio (e.g., 5%, indicating that the number of modeled samples in the binned leaf nodes must not be less than 5% of the total number of modeled samples).

In this embodiment, decision tree configuration information is obtained, where the decision tree configuration information includes a maximum depth and a minimum number of samples in a bin, and the decision tree configuration information is used to control splitting and generating of a decision tree, so as to ensure rationality of the generated decision tree.

Further, the step S202 may include: determining a plurality of feature division points according to the minimum sample number of the sub-boxes in the value range of the feature variable under each depth smaller than or equal to the maximum depth; performing first-level segmentation on the modeling sample set according to each characteristic segmentation point to obtain a plurality of first-level box-dividing results; according to the label variable of each modeling sample in each first-stage binning result, evidence weight of each first-stage binning result is calculated respectively, and information value of each first-stage binning result is calculated according to the evidence weight of each first-stage binning result; screening the results of each level of the bin according to the obtained information value; and performing iterative segmentation on the screened first-stage box-dividing result until the depth of the segmented decision tree reaches a preset depth, so as to obtain a candidate box-dividing result.

In particular, the present application may generate decision trees of a variety of depths, but the depth of the decision tree for each depth needs to be less than or equal to the maximum depth.

When a decision tree of a certain preset depth is generated, multistage segmentation is required for the modeling sample set. In the first-stage segmentation, a plurality of feature segmentation points are determined according to the minimum number of samples in the bin within the value range of the feature variables, namely, the values of the feature variables (namely, the value A) are determined. And carrying out first-stage segmentation on the modeling sample set according to the characteristic segmentation points to obtain a plurality of first-stage box-dividing results. It should be noted that the segmentation performed according to each feature segmentation point is parallel, and the generated multiple first-stage binning results are also parallel, that is, the modeling sample set is subjected to two-segmentation according to the first feature segmentation point to obtain a first-stage binning result, and the modeling sample set is subjected to two-segmentation according to the second feature segmentation point to obtain a second first-stage binning result, which is repeated in this way and is not repeated.

Each first-stage binning result comprises two bins, modeling samples in the two bins have tag variables, so that evidence weights of the first-stage binning results can be calculated, and information values can be calculated according to the evidence weights. Screening the first-stage screening results according to the information value to screen out at least one first-stage screening result meeting the requirements. The non-screened first-stage box result is discarded, and for the screened first-stage box result, a plurality of characteristic division points are respectively determined in two boxes of the first-stage box result, the first-stage division is performed according to the first-stage division mode, and the first-stage division is performed iteratively. After the segmentation of each stage is completed, the depth of the decision tree is increased by 1, if the depth of the decision tree reaches the preset depth, the segmentation is not performed any more, and at the moment, the box at the latest leaf node is taken as a candidate box-dividing result of the decision tree.

In the embodiment, in the value range of the characteristic variable, a plurality of characteristic division points are determined according to the minimum sample number of the division boxes, so that the rationality of the characteristic division points is ensured; performing first-level segmentation on the modeling sample set according to each characteristic segmentation point to obtain a plurality of first-level box-dividing results; according to the label variable of each modeling sample in each first-stage binning result, evidence weight of each first-stage binning result is calculated respectively, and information value of each first-stage binning result is calculated according to the evidence weight of each first-stage binning result; according to the obtained information value, screening the first-class box results meeting the requirements, and improving the accuracy of the screened first-class box results; and performing iterative segmentation on the screened first-stage box-classifying result until the depth of the segmented decision tree reaches a preset depth to obtain a candidate box-classifying result, thereby realizing automatic box classification of the candidate features.

Further, the step of screening the results of each level of bin according to the obtained information value may include: the obtained information values are arranged in a descending order to obtain an information value sequence; at least one information value is selected from the information value sequence, and a first-level bin result corresponding to the selected information value is reserved.

Specifically, the obtained information values are sorted in a descending order to obtain an information value sequence. Selecting a preset number of information values from the information value sequence, wherein the preset number is greater than or equal to 1, retaining the first-level box dividing results corresponding to the selected information values, discarding the first-level box dividing results corresponding to the unselected information values, and performing subsequent segmentation on the basis of the first-level box dividing results.

For example, for a feature variable H in a modeled sample, the value range is 0-420, the tag variable is 0 or 1, which indicates that the modeled sample is a good sample (e.g., the user represented by the modeled sample is not overdue) when the tag variable is 0, and indicates that the modeled sample is a bad sample (e.g., the user represented by the modeled sample is overdue) when the tag variable is 1.

Assume that the modeled sample set contains 3662 good samples, 338 bad samples. The value range of the corresponding characteristic variable in a certain sub-box is 1-100, the sub-box comprises 647 good samples and 54 bad samples, the marginal ratio of the good samples in all the good samples is 18%, the marginal ratio of the bad samples in all the bad samples is 16%, at the moment, the evidence weight WOE of the sub-box is 0.1006, and the corresponding information value IV is 0.0017. Assuming that the feature division point is H and 50 is taken, the division result obtained after division is shown in table 1:

Here, the application calculates according to the standard calculation formula of the evidence weight WOE and the information value IV.

After division according to the feature division point h=50, the information value sum is 0.0050. With reference to the above example, it is assumed that the total information value after division at the feature division point h= 45/55/60/65 is 0.0047/0.0021/0.0046/0.0031, respectively. Information value sequences 0.0050 (h=50), 0.0047 (h=45), 0.0046 (h=60), 0.0031 (h=65), 0.0021 (h=55) were obtained. The maximum information value 0.0050 can be selected from the information values, and the box division result corresponding to the information value 0.0050 is reserved; or selecting the first N bits of information values from the information value sequence, and continuing to divide on the box division result corresponding to the selected information values.

In one embodiment, the binning result may be selected based on the increment of the information value, for example, table 1, with an information value of 0.0017 prior to binning; after division according to the feature division point h=50, the information value is 0.0050, and the increment of the information value is 0.0033. And arranging the increment of the information value in a descending order, selecting the maximum N increment, and keeping the corresponding binning result.

It will be appreciated that the binning result is selected each time a segmentation is performed after a first level segmentation, according to the screening method described above.

In this embodiment, the obtained information values are arranged in a descending order to obtain an information value sequence, at least one information value is selected from the information value sequence, the first-level bin result corresponding to the selected information value is reserved, the reserved information value of the first-level bin result is larger, and the best first-level bin result is ensured to be selected.

Further, the step S203 may include: for each candidate box division result, acquiring evidence weight of each box division in the candidate box division result; drawing an evidence weight curve of the candidate bin result according to the obtained evidence weight; and screening the candidate box division results according to the evidence weight curves to obtain the target box division results of the characteristic variables.

Specifically, for each candidate bin result, which includes a plurality of bins, the evidence weight for each bin may be calculated, and then the evidence weight curve for the candidate bin result may be plotted. In the evidence weight curve, the ordinate is the numerical value of the evidence weight, and the larger the numerical value of the characteristic variable of the sub-box corresponding to the evidence weight is, the larger the abscissa of the coordinate point is. For example, each bin corresponds to a coordinate point, and the midpoint value of the characteristic variable interval corresponding to the bin may be taken as the abscissa of the coordinate point, and the evidence weight value of the bin may be taken as the ordinate of the coordinate point. According to the method, all candidate box division results can be evaluated on the whole through an evidence weight curve, and the appropriate candidate box division results are selected to serve as target box division results of the characteristic variables.

In this embodiment, an evidence weight curve of the candidate bin result is drawn according to the evidence weights of the bins in the candidate bin result, and the change condition of the evidence weights along with the bin result can be observed on the whole based on the evidence weight curve, so that a suitable candidate bin result can be accurately selected as a target bin result.

Further, the step of screening each candidate bin result according to each evidence weight curve to obtain a target bin result of the feature variable may include: when a monotonic evidence weight curve exists, determining a candidate box division result corresponding to the monotonic evidence weight curve as a target box division result of the characteristic variable; when more than one monotonic evidence weight curve exists, acquiring the gradient of each monotonic evidence weight curve; and selecting a monotonic evidence weight curve with the maximum gradient, and taking a candidate box division result corresponding to the monotonic evidence weight curve as a target box division result of the characteristic variable.

Specifically, since there are multiple candidate binned results, there may be multiple evidence weight curves. The evidence weight standard curve may or may not be monotonic. Generally, when the evidence weight has monotonicity, the binning result is better, if only one monotonic evidence weight curve exists after the evidence weight curve is drawn, the candidate binning result corresponding to the monotonic evidence weight curve is directly determined as the target binning result of the characteristic variable.

If more than one monotonic evidence weight curve exists, the slope of each monotonic evidence weight curve is obtained, the slope reflecting the degree of slope of the curve. Generally, the more inclined the evidence weight curve, i.e. the higher the inclination, the better the binning effect, so that it is possible to select the monotonic evidence weight curve with the greatest inclination (i.e. the monotonic evidence weight curve with the greatest absolute value of the inclination), and take the candidate binning result corresponding to the monotonic evidence weight curve as the target binning result of the feature variable.

In the embodiment, the evidence weight curve is selected according to the monotonicity and the inclination degree, and the candidate box division result corresponding to the selected evidence weight curve is used as the target box division result, so that the accuracy of the target box division result is improved.

Further, the step of screening each candidate bin result according to each evidence weight curve to obtain the target bin result of the feature variable may further include: when a monotonic evidence weight curve does not exist, acquiring the characteristic type of the characteristic variable; when the feature type is the first feature type, acquiring a preset evidence weight standard curve; respectively calculating the similarity between each evidence weight curve and each evidence weight standard curve; selecting an evidence weight curve with the maximum similarity, and determining a candidate box division result corresponding to the evidence weight curve as a target box division result of a characteristic variable; when the feature type is the second feature type, the modeling sample set is iterated.

Specifically, when there is no monotonic evidence weight curve, a feature type of the feature variable is acquired, which is determined based on the business attribute of the feature variable. The feature types include a first feature type and a second feature type. The method and the device can preset a feature type table, and query the feature type of the feature variable through the feature type table.

For the first feature type, the non-monotonic of the evidence weight curve may be interpreted from a business perspective of the feature variable, and thus the evidence weight curve may not be required to be monotonic, e.g., when the feature variable is age, the evidence weight curve may also be non-monotonic.

The evidence weight standard curve is preset, is monotonous and can be generated based on manual experience; the evidence weight standard curve can have a plurality of pieces, each piece corresponds to a specific plurality of characteristic variables; the corresponding characteristic variable of the evidence weight standard curve can be manually configured, or the corresponding relation between the evidence weight standard curve and the characteristic variable can be determined according to the value range of the characteristic variable.

And respectively calculating the similarity between each evidence weight curve and the evidence weight standard curve, wherein the similarity algorithm can refer to the existing curve similarity calculation algorithm. The similarity characterizes the similarity between the evidence weight curve and the evidence weight standard curve, and the larger the similarity is, the closer the evidence weight curve is to monotone. Therefore, the evidence weight curve with the maximum similarity can be selected, and the candidate box division result corresponding to the evidence weight curve is determined as the target box division result of the characteristic variable.

In one embodiment, the fitting line may be obtained by linearly fitting each point in the evidence weight curve, where the fitting line is monotonic, then selecting the fitting line with the greatest inclination, and taking the candidate binning result corresponding to the fitting line as the target binning result of the feature variable.

For the second feature type, the non-monotonic evidence weight curve cannot be interpreted from the service angle of the feature variable, and the non-monotonic evidence weight curve is wrong at the moment, so that the modeling sample set needs to be subjected to iterative calculation again, or the feature variable is directly discarded, and the feature variable is not subjected to feature binning.

In this embodiment, when there is no monotonic evidence weight curve and the feature type of the feature variable is the first feature type, a non-monotonic evidence weight curve may be accepted, a preset evidence weight standard curve is obtained, the similarity between each evidence weight curve and the evidence weight standard curve is calculated, the evidence weight curve with the maximum similarity is closest to monotonic, and the candidate box-dividing result corresponding to the evidence weight curve is determined as the target box-dividing result of the feature variable, so that the accuracy of the target box-dividing result is ensured; when the feature type is the second feature type, indicating that the calculation is wrong, re-iterating the modeling sample set is needed to avoid the error of the target binning result.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a feature box device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

As shown in fig. 3, the feature box sorting device 300 according to the present embodiment includes: a sample set acquisition module 301, a sample set segmentation module 302, and a result screening module 303, wherein:

the sample set obtaining module 301 is configured to obtain a modeling sample set including a plurality of modeling samples, where the modeling samples include a tag variable and a feature variable of a to-be-binned.

The sample set segmentation module 302 is configured to segment the modeling sample set from the feature variable dimension by using the evidence weight and the information value as decision tree segmentation criteria, to obtain candidate box division results under a plurality of decision trees, where the evidence weight and the information value are calculated based on the tag variable, the depth of each decision tree is less than or equal to a preset maximum depth, and each depth corresponds to at least one candidate box division result.

And the result screening module 303 is configured to screen each candidate box result according to the evidence weight of each box in each candidate box result, so as to obtain a target box result of the feature variable.

In some optional implementations of the present embodiment, the feature binning apparatus 300 may further comprise: the configuration acquisition module is used for acquiring configuration information of a decision tree, wherein the configuration information of the decision tree comprises maximum depth and minimum sample number of the sub-boxes, and the configuration information of the decision tree is used for limiting the decision tree.

In some alternative implementations of the present embodiment, the sample set segmentation module 302 may include: the system comprises a segmentation point determination sub-module, a first-stage segmentation sub-module, a calculation sub-module, a first-stage screening sub-module and an iterative segmentation sub-module, wherein:

and the segmentation point determination submodule is used for determining a plurality of characteristic segmentation points according to the minimum sample number of the sub-boxes in the value range of the characteristic variable under the condition that each depth is smaller than or equal to the maximum depth.

And the first-stage segmentation sub-module is used for respectively carrying out first-stage segmentation on the modeling sample set according to each characteristic segmentation point to obtain a plurality of first-stage segmentation results.

The computing sub-module is used for computing the evidence weight of each first-stage binning result according to the label variable of each modeling sample in each first-stage binning result, and computing the information value of each first-stage binning result according to the evidence weight of each first-stage binning result.

And the first-stage screening sub-module is used for screening the results of each first-stage bin according to the obtained information value.

And the iterative segmentation sub-module is used for carrying out iterative segmentation on the screened first-stage segmentation result until the depth of the segmented decision tree reaches the preset depth, so as to obtain a candidate segmentation result.

In some optional implementations of this embodiment, the first-stage screening sub-module may include: the device comprises a descending order arrangement unit and an information value selection unit, wherein:

And the descending order arrangement unit is used for descending order of the obtained information values to obtain an information value sequence.

The information value selecting unit is used for selecting at least one information value from the information value sequence and reserving a first-level box result corresponding to the selected information value.

In some alternative implementations of the present embodiment, the result screening module 303 may include: the system comprises a weight acquisition sub-module, a curve drawing sub-module and a candidate screening sub-module, wherein:

and the weight acquisition sub-module is used for acquiring the evidence weight of each sub-box in the candidate sub-box result for each candidate sub-box result.

And the curve drawing submodule is used for drawing an evidence weight curve of the candidate bin result according to the obtained evidence weight.

And the candidate screening sub-module is used for screening each candidate box-dividing result according to each evidence weight curve to obtain a target box-dividing result of the characteristic variable.

In some optional implementations of this embodiment, the candidate screening submodule may include: curve determining unit, gradient obtaining unit and gradient selecting unit, wherein:

and the curve determining unit is used for determining the candidate box division result corresponding to the monotonic evidence weight curve as the target box division result of the characteristic variable when the monotonic evidence weight curve exists.

And the gradient acquisition unit is used for acquiring the gradient of each monotonic evidence weight curve when more than one monotonic evidence weight curve exists.

And the inclination selection unit is used for selecting the monotonic evidence weight curve with the maximum inclination, and taking the candidate box division result corresponding to the monotonic evidence weight curve as the target box division result of the characteristic variable.

In some optional implementations of this embodiment, the candidate screening submodule may further include: the device comprises a type acquisition unit, a standard acquisition unit, a similarity calculation unit, a similarity selection unit and a re-iteration unit, wherein:

and the type acquisition unit is used for acquiring the characteristic type of the characteristic variable when the monotonic evidence weight curve does not exist.

The standard acquisition unit is used for acquiring a preset evidence weight standard curve when the feature type is the first feature type.

And the similarity calculation unit is used for calculating the similarity between each evidence weight curve and the evidence weight standard curve respectively.

The similarity selecting unit is used for selecting an evidence weight curve with the maximum similarity and determining a candidate box dividing result corresponding to the evidence weight curve as a target box dividing result of the characteristic variable.

And the re-iteration unit is used for re-iterating the modeling sample set when the feature type is the second feature type.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a feature binning method, and the like. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the feature binning method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The computer device provided in the present embodiment may perform the above-described feature binning method. The feature binning method here may be the feature binning method of each of the embodiments described above.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the feature binning method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. The characteristic box-dividing method is characterized by comprising the following steps of:

2. The feature binning method of claim 1, further comprising, after the step of obtaining a modeling sample set comprising a plurality of modeling samples:

and obtaining decision tree configuration information, wherein the decision tree configuration information comprises the maximum depth and the minimum sample number of the sub-boxes, and the decision tree configuration information is used for limiting a decision tree.

3. The feature binning method according to claim 2, wherein the step of partitioning the modeled sample set from feature variable dimensions with evidence weights and information values as decision tree partitioning criteria to obtain candidate binning results under a plurality of decision trees comprises:

determining a plurality of characteristic division points according to the minimum sample number of the sub-boxes in the value range of the characteristic variable under the condition that each depth is smaller than or equal to the maximum depth;

performing first-stage segmentation on the modeling sample set according to each characteristic segmentation point to obtain a plurality of first-stage box-dividing results;

according to the label variable of each modeling sample in each first-stage binning result, evidence weight of each first-stage binning result is calculated respectively, and according to the evidence weight of each first-stage binning result, information value of each first-stage binning result is calculated respectively;

screening the results of each level of the bin according to the obtained information value;

and performing iterative segmentation on the screened first-stage box division result until the depth of the segmented decision tree reaches a preset depth, so as to obtain the candidate box division result.

4. The feature binning method of claim 3, wherein the step of screening the respective first-stage binning results based on the obtained information values comprises:

The obtained information values are arranged in a descending order to obtain an information value sequence;

and selecting at least one information value from the information value sequence, and reserving a first-level box result corresponding to the selected information value.

5. The feature binning method according to claim 1, wherein the step of screening each candidate bin result according to the evidence weight of each bin in each candidate bin result to obtain a target bin result of the feature variable comprises:

for each candidate box division result, acquiring evidence weight of each box division in the candidate box division result;

drawing an evidence weight curve of the candidate box division result according to the obtained evidence weight;

and screening each candidate box dividing result according to each evidence weight curve to obtain the target box dividing result of the characteristic variable.

6. The feature binning method according to claim 1, wherein the step of screening each candidate binning result according to each evidence weight curve to obtain a target binning result of the feature variable comprises:

when a monotonic evidence weight curve exists, determining a candidate box division result corresponding to the monotonic evidence weight curve as a target box division result of the characteristic variable;

When more than one monotonic evidence weight curve exists, acquiring the gradient of each monotonic evidence weight curve;

and selecting a monotonic evidence weight curve with the maximum gradient, and taking a candidate box division result corresponding to the monotonic evidence weight curve as a target box division result of the characteristic variable.

7. The feature binning method of claim 6, wherein the step of screening each candidate binning result according to each evidence weight curve to obtain a target binning result of the feature variable further comprises:

when a monotonic evidence weight curve does not exist, acquiring the characteristic type of the characteristic variable;

when the feature type is a first feature type, acquiring a preset evidence weight standard curve;

respectively calculating the similarity between each evidence weight curve and the evidence weight standard curve;

selecting an evidence weight curve with the maximum similarity, and determining a candidate box division result corresponding to the evidence weight curve as a target box division result of the characteristic variable;

and when the feature type is a second feature type, re-iterating the modeling sample set.

8. A feature packet device, comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the feature binning method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the feature binning method according to any of claims 1 to 7.