CN113052222A

CN113052222A - Feature binning method, electronic device and storage medium

Info

Publication number: CN113052222A
Application number: CN202110288268.XA
Authority: CN
Inventors: 蔡石林; 管胜; 陈树华
Original assignee: Beijing Dingxiang Technology Co ltd
Current assignee: Beijing Dingxiang Technology Co ltd
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-06-29

Abstract

The application provides a feature binning method, electronic equipment and a storage medium, and relates to the technical field of data processing. The characteristic box dividing method comprises the steps of obtaining a sample data set to be subjected to box dividing, wherein the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label; performing box separation processing on the sample data set based on an initial box separation method to obtain a plurality of initial boxes; the evidence weight values of the initial sub-boxes are obtained through calculation, the initial sub-boxes are combined according to the evidence weight values of the initial sub-boxes, and a plurality of combined target sub-boxes are obtained.

Description

Feature binning method, electronic device and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a feature binning method, an electronic device, and a storage medium.

Background

The feature binning is a feature processing method for discretizing continuous variables and combining multi-state discrete variables into few states as a feature preprocessing technology, and can be applied to construction of various models.

In the prior art, feature binning is generally realized based on unsupervised binning methods such as equal-frequency binning and equidistant binning.

However, the existing feature binning method is relatively simple, so that when the existing feature binning method is applied to various models, the stability of the models is poor.

Disclosure of Invention

An object of the present invention is to provide a feature binning method, an electronic device, and a storage medium, which can improve the stability of a model when applied to feature processing of various models, in view of the above-described drawbacks of the related art.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, the present invention provides a feature binning method, comprising:

acquiring a sample data set to be subjected to box separation, wherein the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label;

performing box separation processing on the sample data set based on an initial box separation method to obtain a plurality of initial boxes;

calculating and obtaining an evidence weight value of each initial sub-box, combining a plurality of initial sub-boxes according to the evidence weight value of each initial sub-box, and obtaining a plurality of combined target sub-boxes.

In an alternative embodiment, each of the sample data is multi-dimensional feature data;

the method for performing binning processing on the sample data set based on the initial binning method to obtain a plurality of initial binning comprises the following steps:

if the multi-dimensional feature data comprises continuous feature data, performing binning processing on each dimension feature data corresponding to the continuous feature data based on an unsupervised binning method to obtain a plurality of initial binning of each dimension feature data corresponding to the continuous feature data, wherein the unsupervised binning method comprises at least one of the following steps: an equal frequency box separation method, an equidistant box separation method and a clustering box separation method; and/or if the multi-dimensional feature data comprises discrete feature data, determining a plurality of initial bins of the feature data corresponding to the discrete feature data according to the number of discrete values of the feature data corresponding to the discrete feature data, wherein each discrete value in the feature data corresponding to the discrete feature data corresponds to one initial bin.

In an optional embodiment, the calculating to obtain an evidence weight value of each initial sub-box, and merging the plurality of initial sub-boxes according to the evidence weight value of each initial sub-box to obtain a plurality of merged target sub-boxes includes:

calculating and obtaining a plurality of evidence weight values of the initial sub-boxes of each dimension feature data in the multi-dimension feature data;

and combining the plurality of initial sub-boxes of each dimension characteristic data according to a preset sub-box number threshold value, a preset evidence weight threshold value and the evidence weight values of the plurality of initial sub-boxes of each dimension characteristic data to obtain a plurality of combined target sub-boxes of each dimension characteristic data.

In an optional embodiment, the merging, according to a preset binning number threshold, a preset evidence weight threshold, and an evidence weight value of a plurality of initial bins of each dimensional feature data, the plurality of initial bins of each dimensional feature data to obtain a plurality of target bins after merging of each dimensional feature data includes:

if the fact that the box number of a plurality of initial boxes of each dimension feature data is larger than a preset box number threshold value and the minimum difference value between the evidence weight value of any one initial box and the evidence weight values of two adjacent initial boxes is larger than a preset evidence weight threshold value is determined, acquiring the initial boxes corresponding to the minimum difference value;

and merging any initial sub-box and the initial sub-box corresponding to the minimum difference value to obtain a plurality of target sub-boxes after the feature data of each dimension are merged.

In an optional embodiment, after the combining any one of the initial bins and the initial bin corresponding to the minimum difference value, the method further includes:

if the bin number of the initial bins is determined to be greater than a preset bin number threshold, calculating the number of samples and the sample proportion of each initial bin, wherein the sample proportion is used for indicating the ratio of the number of samples in the initial bins to the total number of samples;

if the number of samples or the sample proportion in any initial box meets the preset requirement, acquiring the minimum difference value between the evidence weight value of any initial box and the evidence weight values of two adjacent initial boxes, and combining the initial boxes corresponding to any initial box and the minimum difference value.

In an alternative embodiment, the multi-dimensional feature data comprises: discrete feature data;

if it is determined that the bin counts of the plurality of initial bins of each dimensional feature data are greater than a preset bin count threshold value, and the minimum difference value between the evidence weight value of any one initial bin and the evidence weight values of two adjacent initial bins is greater than a preset evidence weight threshold value, before obtaining the initial bin corresponding to the minimum difference value, the method further includes:

calculating and obtaining an evidence weight value of an initial sub-box corresponding to each discrete value in each dimension characteristic data corresponding to the discrete characteristic data;

sorting a plurality of initial sub-boxes corresponding to each dimension characteristic data according to the evidence weight value of the initial sub-box corresponding to each discrete value in each dimension characteristic data to obtain an initial sub-box sorting sequence corresponding to each dimension characteristic data;

and determining two adjacent initial sub-boxes of any one initial sub-box corresponding to each dimension characteristic data according to the initial sub-box sequencing sequence corresponding to each dimension characteristic data.

In an alternative embodiment, the method further comprises:

and configuring missing value sub-boxes for each dimension characteristic data corresponding to the multi-dimension characteristic data.

In an optional embodiment, after the combining the plurality of initial bins according to the evidence weight value of each initial bin and obtaining a plurality of combined target bins, the method further includes:

responding to a box dividing editing instruction, determining at least one target box to be edited in a plurality of target boxes, and editing the at least one target box to be edited to obtain an edited target box;

calculating the information quantity value of the target sub-box corresponding to each dimension characteristic data according to the edited target sub-box;

and filtering the target sub-boxes corresponding to the dimensional characteristic data according to the information quantity value of the target sub-boxes corresponding to the dimensional characteristic data and a preset information quantity threshold value, and obtaining the filtered target sub-boxes.

In a second aspect, the present invention provides a feature binning apparatus comprising:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a sample data set to be subjected to box separation, the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label; the box dividing module is used for carrying out box dividing processing on the sample data set based on an initial box dividing method to obtain a plurality of initial boxes; and the merging module is used for calculating and acquiring the evidence weight value of each initial sub-box, merging the plurality of initial sub-boxes according to the evidence weight value of each initial sub-box, and acquiring a plurality of merged target sub-boxes.

In an alternative embodiment, each of the sample data is multi-dimensional feature data; the binning module is specifically configured to, if the multi-dimensional feature data includes continuous feature data, perform binning processing on each dimensional feature data corresponding to the continuous feature data based on an unsupervised binning method to obtain a plurality of initial binning of each dimensional feature data corresponding to the continuous feature data, where the unsupervised binning method includes at least one of the following: an equal frequency box separation method, an equidistant box separation method and a clustering box separation method; and/or if the multi-dimensional feature data comprises discrete feature data, determining a plurality of initial bins of the feature data corresponding to the discrete feature data according to the number of discrete values of the feature data corresponding to the discrete feature data, wherein each discrete value in the feature data corresponding to the discrete feature data corresponds to one initial bin.

In an optional embodiment, the binning module is specifically configured to calculate an evidence weight value of a plurality of initial bins of each dimension feature data in the obtained multi-dimensional feature data; and combining the plurality of initial sub-boxes of each dimension characteristic data according to a preset sub-box number threshold value, a preset evidence weight threshold value and the evidence weight values of the plurality of initial sub-boxes of each dimension characteristic data to obtain a plurality of combined target sub-boxes of each dimension characteristic data.

In an optional embodiment, the binning module is specifically configured to, if it is determined that the binning number of a plurality of initial bins of each dimensional feature data is greater than a preset binning number threshold, and a minimum difference value between an evidence weight value of any one initial bin and an evidence weight value of two adjacent initial bins is greater than a preset evidence weight threshold, obtain an initial bin corresponding to the minimum difference value; and merging any initial sub-box and the initial sub-box corresponding to the minimum difference value to obtain a plurality of target sub-boxes after the feature data of each dimension are merged.

In an optional embodiment, the binning module is further configured to calculate, if it is determined that the binning count of a plurality of initial binning is greater than a preset binning count threshold, a sample count and a sample ratio for obtaining each initial binning, where the sample ratio is used to indicate a ratio of the sample count in the initial binning to a total sample count; if the number of samples or the sample proportion in any initial box meets the preset requirement, acquiring the minimum difference value between the evidence weight value of any initial box and the evidence weight values of two adjacent initial boxes, and combining the initial boxes corresponding to any initial box and the minimum difference value.

In an alternative embodiment, the multi-dimensional feature data comprises: discrete feature data; the box dividing module is further configured to calculate and obtain an evidence weight value of an initial box corresponding to each discrete value in each dimension feature data corresponding to the discrete feature data; sorting a plurality of initial sub-boxes corresponding to each dimension characteristic data according to the evidence weight value of the initial sub-box corresponding to each discrete value in each dimension characteristic data to obtain an initial sub-box sorting sequence corresponding to each dimension characteristic data; and determining two adjacent initial sub-boxes of any one initial sub-box corresponding to each dimension characteristic data according to the initial sub-box sequencing sequence corresponding to each dimension characteristic data.

In an optional embodiment, the binning module is further configured to bin missing values of the multidimensional feature data corresponding to the multidimensional feature data.

In an alternative embodiment, the apparatus further comprises: the merging module is used for responding to a box dividing editing instruction, determining at least one target box to be edited in a plurality of target boxes, editing the at least one target box to be edited and acquiring an edited target box; calculating the information quantity value of the target sub-box corresponding to each dimension characteristic data according to the edited target sub-box; and filtering the target sub-boxes corresponding to the dimensional characteristic data according to the information quantity value of the target sub-boxes corresponding to the dimensional characteristic data and a preset information quantity threshold value, and obtaining the filtered target sub-boxes.

In a third aspect, the present invention provides an electronic device comprising: the system comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when an electronic device runs, the processor and the storage medium are communicated through the bus, and the processor executes the machine-readable instructions to execute the steps of the feature binning method according to any one of the preceding embodiments.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the feature binning method according to any of the preceding embodiments.

The beneficial effect of this application is:

in the feature binning method, the electronic device and the storage medium provided by the embodiment of the application, a sample data set to be binned is obtained, wherein the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label; performing box separation processing on the sample data set based on an initial box separation method to obtain a plurality of initial boxes; the evidence weight values of the initial sub-boxes are obtained through calculation, the initial sub-boxes are combined according to the evidence weight values of the initial sub-boxes, and a plurality of combined target sub-boxes are obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart of a feature binning method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of another characteristic binning method provided in embodiments of the present application;

FIG. 3 is a schematic flow chart of another characteristic binning method provided in the embodiments of the present application;

FIG. 4 is a schematic flow chart of another characteristic binning method provided in embodiments of the present application;

FIG. 5 is a schematic flow chart of another characteristic binning method provided in the embodiments of the present application;

FIG. 6 is a schematic flow chart of another characteristic binning method provided in embodiments of the present application;

FIG. 7 is a functional block diagram of a feature binning apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Before introducing the present application, the related terms in the present application are explained first:

equal frequency binning: the number of bins in each bin is approximately equal.

Equidistant box separation: the feature data is partitioned with the same distance metric.

Clustering and binning: the binning method derived based on the Unsupervised KMeans clustering algorithm is an Unsupervised Learning (Unsupervised Learning) process, and is generally used for grouping data objects according to their characteristic attributes.

The characteristic binning is a characteristic processing method for discretizing continuous variables and combining multi-state discrete variables into few states as a characteristic preprocessing technology, and has the following advantages: the increase and the decrease of discrete characteristics are easy, and the rapid iteration of the model is easy; the sparse vector inner product multiplication speed is high, and the calculation result is convenient to store and easy to expand; the characteristics of the list after the content discretization have strong robustness on abnormal data; after the univariates are discretized into N, each variable has independent weight, which is equivalent to introducing nonlinearity to the model, so that the expression capability of the model can be improved, and the fitting is increased; after discretization, feature crossing can be carried out, M + N variables are changed into M x N variables, nonlinearity is further introduced, and expression capacity is improved; after the list content features are discretized, the model is more stable, for example, if the user age is discretized, 20-30 is taken as an interval, and the model cannot become a completely different person as long as the user age is one year old; after the features are discretized, the function of simplifying a logistic regression model is achieved, and the risk of model overfitting is reduced.

In conclusion, it can be seen that the characteristics are subjected to more accurate binning, so that the risk of model overfitting can be reduced, the probability of sparse data is increased, the calculated amount is reduced, and the abnormal data has strong robustness, so that the model is more stable and has better effect. However, the existing feature binning method is relatively simple, so that when the existing feature binning method is applied to various models, the stability of the models is poor.

In view of this, the embodiment of the present application provides a feature binning method, which is applied to feature processing of various models, so as to improve stability of the models.

Fig. 1 is a schematic flowchart of a feature binning method provided in an embodiment of the present application, where an execution subject of the method may be a user module of a computer, a server, a processor, and the like, as shown in fig. 1, the method may include:

s101, obtaining a sample data set to be subjected to box separation, wherein the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label.

Alternatively, the sample data set may be obtained through a database, a file system, a web page, and the like, where the database, the file system, and the web page may be used to store data in any scene, such as a shopping scene, a loan scene, a taxi taking scene, and the like, and is not limited herein. The sample data set comprises a plurality of sample data, and a sample label labeled by each sample data can be used for indicating the type of the sample, namely a positive sample or a negative sample.

Taking a loan scene as an example, each sample data may correspond to personal information of a loan user, and a sample label labeled by each loan user may be a positive sample or a negative sample, which is used to indicate the credit level of the loan user, for example, if the sample label is a negative sample, it indicates that the credit level of the loan user is low (if the loan is possibly overdue), and if the sample label is a negative sample, the sample label is a negative sample; if the sample label is a positive sample, the credit rating of the loan user is high (if the loan is made, the possibility of repayment on time is high).

S102, performing box separation processing on the sample data set based on an initial box separation method to obtain a plurality of initial boxes.

The initial binning method may be a preconfigured binning method, for example, the initial binning method may be an unsupervised binning method, a supervised binning method, or the like, and is not limited herein, and the initial binning method may perform initial binning on the obtained sample data set to obtain a plurality of initial bins, so that when performing re-binning operation based on the plurality of initial bins, the binning efficiency may be improved.

It should be noted that the number of the obtained initial bins may be different according to different initial binning methods, and is not limited herein.

S103, calculating and obtaining an evidence weight value of each initial sub-box, merging the plurality of initial sub-boxes according to the evidence weight value of each initial sub-box, and obtaining a plurality of merged target sub-boxes.

After a plurality of initial sub-boxes are obtained, the Evidence Weight value (Weight of Evidence, WOE) of each initial sub-box can be calculated and obtained, wherein the Evidence Weight value can represent the distribution situation of positive and negative sample data in the sub-boxes, and the plurality of initial sub-boxes can be merged according to the Evidence Weight value of each initial sub-box, so that the plurality of initial sub-boxes can be merged according to the distribution difference of the samples, and the stability of the model can be improved when the merged plurality of target sub-boxes are used for feature processing of various models. Alternatively, the applied models may include, but are not limited to: a scoring model, a risk assessment model, a product recommendation model, etc., and are not limited herein.

In some embodiments, the WOE value may be calculated and obtained with reference to the following formula:

wherein i>0，WOE_iCan represent the WOE value of the ith initial bin, Bad_iRepresenting the number of negative samples, Bad, in the ith initial bin_TRepresenting the total number of negative samples, Good, in the initial bin_iRepresenting the number of positive samples, Good, in the ith initial bin_TRepresenting the total number of positive samples in the initial bin.

To sum up, the feature binning method provided by the embodiment of the present application includes: acquiring a sample data set to be subjected to box separation, wherein the sample data set comprises a plurality of sample data, and each sample data is marked with a sample label; performing box separation processing on the sample data set based on an initial box separation method to obtain a plurality of initial boxes; the evidence weight values of the initial sub-boxes are obtained through calculation, the initial sub-boxes are combined according to the evidence weight values of the initial sub-boxes, and a plurality of combined target sub-boxes are obtained.

Optionally, each sample data is multidimensional feature data, and the binning processing is performed on the sample data set based on the initial binning method to obtain a plurality of initial bins, which may include:

if the multi-dimensional feature data comprises continuous feature data, performing binning processing on each dimension feature data corresponding to the continuous feature data based on an unsupervised binning method to obtain a plurality of initial binning of each dimension feature data corresponding to the continuous feature data, wherein the unsupervised binning method comprises at least one of the following steps: an equal frequency box separation method, an equidistant box separation method and a clustering box separation method; and/or if the multi-dimensional characteristic data comprises discrete characteristic data, determining a plurality of initial bins of the various dimensional characteristic data corresponding to the discrete characteristic data according to the number of discrete values of the various dimensional characteristic data corresponding to the discrete characteristic data, wherein each discrete value in the various dimensional characteristic data corresponding to the discrete characteristic data corresponds to one initial bin.

Each sample data may be multidimensional characteristic data, and certainly may also be single-dimensional characteristic data, for multidimensional characteristic data, the following process may be referred to perform binning processing on a sample data set to obtain a plurality of initial bins. The multidimensional feature data may include continuous feature data and/or discrete feature data, that is, each dimension of feature data may be continuous feature data or discrete feature data, the continuous feature data may be understood as data with numerical values, such as age, income, and the like, and correspondingly, the discrete feature data may be understood as data with text values, such as gender, academic degree, marital, and the like.

For example, taking a loan scene as an example, each sample data may correspond to personal information of a loan user, where the personal information may include, but is not limited to, multidimensional feature information such as gender, age, income, academic history, marital, and the like, and based on the above description, it is understood that, among them, age and income are continuous feature data, and gender, academic history, and marital are discrete feature data.

The method comprises the steps of performing different binning processing according to different data types, and for continuous characteristic data, performing binning processing on each dimension characteristic data corresponding to the continuous characteristic data based on an unsupervised binning method to obtain a plurality of initial binning of each dimension characteristic data corresponding to the continuous characteristic data, wherein the unsupervised binning method can be any one of an equal-frequency binning method, an equidistant binning method and a clustering binning method, or can be other types of binning methods, and is not limited herein.

Based on the above example, that is, for each sample data, the initial binning of the age dimension and the income dimension may be obtained respectively, and of course, the number of the initial binning of each dimension feature data is not limited herein, and may be different according to different initial binning methods corresponding to each dimension feature data.

For the discrete feature data, a plurality of initial bins of each dimension feature data corresponding to the discrete feature data can be determined according to the number of discrete values of each dimension feature data corresponding to the discrete feature data.

Based on the above example, that is, for each sample data, the initial bins of the gender dimension, the academic dimension, and the marital dimension may be respectively obtained, and the gender dimension is taken as an example for description, it can be understood that the gender dimension includes two discrete values, so that the initial bins of the gender dimension may be determined to be two, where the discrete value "male" corresponds to one initial bin, and the discrete value "female" corresponds to one initial bin, and for descriptions of other dimensions, reference may be made to the description of the gender dimension, and details are not described here again.

Fig. 2 is a schematic flow chart of another characteristic binning method provided in the embodiment of the present application. Optionally, as shown in fig. 2, the calculating to obtain an evidence weight value of each initial sub-box, and merging the multiple initial sub-boxes according to the evidence weight value of each initial sub-box to obtain multiple merged target sub-boxes may include:

s201, calculating and obtaining evidence weight values of a plurality of initial sub-boxes of each dimension feature data in the multi-dimension feature data.

S202, merging the multiple initial sub-boxes of the dimensional feature data according to a preset sub-box number threshold value, a preset evidence weight threshold value and evidence weight values of the multiple initial sub-boxes of the dimensional feature data to obtain multiple target sub-boxes after the dimensional feature data are merged.

The evidence weight values of the multiple initial sub-boxes of each dimension feature data can be calculated and obtained by referring to the above calculation formula, and are not described herein again. The preset threshold of the number of bins is a preset threshold of a target number of bins, and the value may be 5, 10, 20 or any value according to an actual application scenario, which is not limited herein. When the sample data is multi-dimensional feature data, each dimension feature data may correspond to a corresponding preset bin count threshold and a corresponding preset evidence weight threshold, and optionally, the preset bin count threshold and the preset evidence weight threshold corresponding to each dimension feature data may be the same or different, and the application is not limited herein.

It can be understood that, when the sample data is multi-dimensional feature data, the plurality of initial sub-boxes of the multi-dimensional feature data can be merged according to a preset sub-box number threshold, a preset evidence weight threshold and an evidence weight value of the plurality of initial sub-boxes corresponding to the multi-dimensional feature data, so that the plurality of initial sub-boxes can be merged according to the sample distribution difference, and the target sub-box number of the merged multi-dimensional feature data can be smaller than the preset sub-box number threshold.

Fig. 3 is a schematic flow chart of another characteristic binning method provided in the embodiment of the present application. Optionally, as shown in fig. 3, the merging, according to the preset binning number threshold, the preset evidence weight threshold, and the evidence weight values of the multiple initial bins of each dimensional feature data, the multiple initial bins of each dimensional feature data to obtain multiple target bins after merging of each dimensional feature data may include:

s301, if it is determined that the box number of a plurality of initial boxes of each dimension feature data is larger than a preset box number threshold, and the minimum difference value between the evidence weight value of any initial box and the evidence weight values of two adjacent initial boxes is larger than a preset evidence weight threshold, acquiring the initial box corresponding to the minimum difference value.

S302, merging any initial sub-box and the initial sub-box corresponding to the minimum difference value to obtain a plurality of target sub-boxes after merging of the feature data of each dimension.

Optionally, specifically, during merging, the bin numbers of a plurality of initial bins of each dimensional feature data may be compared with a preset bin number threshold, and a minimum difference between an evidence weight value of each initial bin and an evidence weight value of two adjacent initial bins is calculated again; if the bin number of the multiple initial bins of each dimension feature data is greater than the preset bin number threshold value, and the minimum difference value is greater than the preset evidence weight threshold value, the initial bins corresponding to the minimum difference value may be obtained, and the initial bins corresponding to any initial bin and the minimum difference value are merged.

It should be noted that, for a plurality of initial bins of continuous feature data, the plurality of initial bins may be sorted according to their continuity, and two adjacent initial bins of any initial bin may be determined according to the sorting; for a plurality of initial bins of discrete feature data, two adjacent initial bins of any initial bin can be determined according to the evidence weight value of each initial bin.

Fig. 4 is a schematic flow chart of another characteristic binning method provided in the embodiments of the present application. Based on the foregoing embodiment, as shown in fig. 4, after the combining any initial bin and the initial bin corresponding to the minimum difference value, the method may further include:

s401, if the number of the initial sub-boxes is determined to be larger than a preset sub-box number threshold, calculating and obtaining the number of samples of each initial sub-box and a sample proportion, wherein the sample proportion is used for indicating the ratio of the number of the samples in the initial sub-boxes to the total number of the samples.

S402, if the number of samples or the sample proportion in any initial sub-box is determined to meet the preset requirement, the minimum difference value between the evidence weight value of any initial sub-box and the evidence weight values of two adjacent initial sub-boxes is obtained, and the initial sub-boxes corresponding to any initial sub-box and the minimum difference value are combined.

It can be understood that, based on the above embodiments, there is still a case that the number of the multiple initial bins is greater than the preset bin number threshold, at this time, the number of samples and the sample proportion of each initial bin may be calculated and obtained, and whether the number of samples and the sample proportion of the initial bin meet the preset requirements (for example, the number of samples of the initial bin is less than the preset number threshold or the sample proportion is less than the preset sample proportion threshold) is respectively determined, then the minimum difference between the evidence weight of the initial bin and the evidence weight of two adjacent initial bins may be calculated, the initial bins corresponding to the minimum difference are merged, and multiple merged target bins are obtained, by applying the embodiments of the present application, the total number of bins, the number of samples and the sample proportion of the target bins may all meet the preset requirements, and the sample data in the target bins may have a certain representativeness, further, when the method is applied to feature processing of various models, the stability of the models can be improved.

Fig. 5 is a schematic flow chart of another characteristic binning method provided in the embodiment of the present application. Optionally, the multi-dimensional feature data comprises: discrete feature data; as shown in fig. 5, if it is determined that the bin counts of the multiple initial bins of each dimensional feature data are greater than the preset bin count threshold, and the minimum difference between the evidence weight value of any initial bin and the evidence weight values of two adjacent initial bins is greater than the preset evidence weight threshold, before obtaining the initial bin corresponding to the minimum difference, the method further includes:

s501, calculating an evidence weight value of an initial sub-box corresponding to each discrete value in each dimension characteristic data corresponding to the discrete characteristic data.

S502, sequencing a plurality of initial sub-boxes corresponding to each dimension characteristic data according to the evidence weight value of the initial sub-box corresponding to each discrete value in each dimension characteristic data, and obtaining an initial sub-box sequencing sequence corresponding to each dimension characteristic data.

S503, according to the initial binning sequencing sequence corresponding to each dimension characteristic data, determining two adjacent initial binning of any initial binning corresponding to each dimension characteristic data.

For each dimension characteristic data corresponding to the discrete characteristic data, an evidence weight value of an initial sub-box corresponding to each discrete value in each dimension characteristic data can be calculated and obtained, a plurality of initial sub-boxes corresponding to each dimension characteristic data are sequenced according to the evidence weight value, an initial sub-box sequencing sequence corresponding to each dimension characteristic data can be obtained through sequencing, two adjacent initial sub-boxes of any initial sub-box corresponding to each dimension characteristic data can be determined according to the initial sub-box sequencing sequence corresponding to each dimension characteristic data, and therefore the initial sub-box with smaller sample distribution difference with any initial sub-box can be determined as soon as possible.

Optionally, the method further includes: and configuring missing value sub-boxes for each dimension characteristic data corresponding to the multi-dimension characteristic data.

It can be understood that, for sample data, feature data of some dimensions of the sample data may be a missing value or a null value, for example, taking a loan scene as an example, each sample data may correspond to personal information of a loan user, and then a marital dimension of a certain loan user may be a null value.

Fig. 6 is a schematic flow chart of another characteristic binning method provided in the embodiments of the present application. Optionally, as shown in fig. 6, after the combining the multiple initial bins according to the evidence weight values of the initial bins and obtaining multiple combined target bins, the method may further include:

s601, responding to the box dividing editing instruction, determining at least one target box to be edited in the plurality of target boxes, editing the at least one target box to be edited, and obtaining the edited target box.

And S602, calculating the information quantity value of the target sub-box corresponding to each dimension characteristic data according to the edited target sub-box.

S603, filtering the target sub-boxes corresponding to the dimensional feature data according to the information quantity value of the target sub-boxes corresponding to the dimensional feature data and a preset information quantity threshold value, and obtaining filtered target sub-boxes.

Based on the above embodiment, after obtaining a plurality of target binning, according to an actual application scenario, a user may further edit the plurality of initial binning, optionally, the user may generate a binning editing instruction through an editing control, or may generate a binning editing instruction through sliding, dragging, clicking, or the like based on the touch display screen, where the binning editing instruction may include an identifier of a target binning to be edited, and the editing manner may include but is not limited to: merging, splitting and the like, wherein merging is to merge at least two target sub-boxes to be edited into one, and splitting is to split one target sub-box to be edited into at least two, so that any editing can be carried out according to the requirements of the application scene, and the edited target sub-boxes are obtained.

Optionally, for the edited target binning, an Information quantity Value of each target binning corresponding to each dimension feature data may be further calculated, where the Information quantity Value (IV) may be used to indicate an important level of the feature Information of the sample data in the binning, and the greater the IV is, the higher the important level of the feature Information of the sample data in the binning is, and otherwise, the lower the important level is. After the information quantity values of the target sub-boxes corresponding to the dimensional feature data are obtained, the information quantity values of the target sub-boxes can be summed to obtain a total information quantity value corresponding to the dimensional feature data; optionally, if the total information quantity value corresponding to each dimension characteristic data is smaller than the preset information quantity threshold, each dimension characteristic data may be filtered, and the filtered target sub-box is obtained.

Alternatively, the IV value may be calculated according to the following formula:

wherein i>0，IV_iIV value, WOE, representing the ith target bin_iWOE value, Bad, which may represent the ith target bin_iRepresenting the number of negative samples, Bad, in the ith target bin_TRepresenting the total number of negative samples, Good, in the target bin_iRepresenting the number of positive samples, Good, in the ith target bin_TRepresenting the total number of positive samples in the target bin.

Fig. 7 is a functional module schematic diagram of a characteristic box separation device provided in an embodiment of the present application, the basic principle and the technical effect of the device are the same as those of the corresponding method embodiment, and for brief description, the corresponding contents in the method embodiment may be referred to for the parts not mentioned in this embodiment. As shown in fig. 7, the characteristic box separation apparatus 100 includes:

the obtaining module 110 is configured to obtain a sample data set to be binned, where the sample data set includes a plurality of sample data, and each sample data is labeled with a sample label; a binning module 120, configured to perform binning processing on the sample data set based on an initial binning method, so as to obtain multiple initial bins; the merging module 130 is configured to calculate and obtain an evidence weight value of each initial sub-box, and merge the plurality of initial sub-boxes according to the evidence weight value of each initial sub-box to obtain a plurality of merged target sub-boxes.

In an alternative embodiment, each of the sample data is multi-dimensional feature data; the binning module 120 is specifically configured to, if the multi-dimensional feature data includes continuous feature data, perform binning processing on each dimensional feature data corresponding to the continuous feature data based on an unsupervised binning method to obtain multiple initial binning of each dimensional feature data corresponding to the continuous feature data, where the unsupervised binning method includes at least one of the following: an equal frequency box separation method, an equidistant box separation method and a clustering box separation method; and/or if the multi-dimensional feature data comprises discrete feature data, determining a plurality of initial bins of the feature data corresponding to the discrete feature data according to the number of discrete values of the feature data corresponding to the discrete feature data, wherein each discrete value in the feature data corresponding to the discrete feature data corresponds to one initial bin.

In an optional embodiment, the binning module 120 is specifically configured to calculate an evidence weight value of a plurality of initial bins of each dimension feature data in the obtained multi-dimensional feature data; and combining the plurality of initial sub-boxes of each dimension characteristic data according to a preset sub-box number threshold value, a preset evidence weight threshold value and the evidence weight values of the plurality of initial sub-boxes of each dimension characteristic data to obtain a plurality of combined target sub-boxes of each dimension characteristic data.

In an optional embodiment, the binning module 120 is specifically configured to, if it is determined that the binning numbers of a plurality of initial bins of each dimensional feature data are greater than a preset binning number threshold, and a minimum difference value between an evidence weight value of any one initial bin and an evidence weight value of two adjacent initial bins is greater than a preset evidence weight threshold, obtain an initial bin corresponding to the minimum difference value; and merging any initial sub-box and the initial sub-box corresponding to the minimum difference value to obtain a plurality of target sub-boxes after the feature data of each dimension are merged.

In an alternative embodiment, the binning module 120 is further configured to calculate, if it is determined that the binning count of a plurality of initial binning is greater than a preset binning count threshold, a sample count and a sample ratio for obtaining each initial binning, where the sample ratio is used to indicate a ratio of the sample count in the initial binning to the total sample count; if the number of samples or the sample proportion in any initial box meets the preset requirement, acquiring the minimum difference value between the evidence weight value of any initial box and the evidence weight values of two adjacent initial boxes, and combining the initial boxes corresponding to any initial box and the minimum difference value.

In an alternative embodiment, the multi-dimensional feature data comprises: discrete feature data; the binning module 120 is further configured to calculate an evidence weight value of an initial binning corresponding to each discrete value in each dimension feature data corresponding to the obtained discrete feature data; sorting a plurality of initial sub-boxes corresponding to each dimension characteristic data according to the evidence weight value of the initial sub-box corresponding to each discrete value in each dimension characteristic data to obtain an initial sub-box sorting sequence corresponding to each dimension characteristic data; and determining two adjacent initial sub-boxes of any one initial sub-box corresponding to each dimension characteristic data according to the initial sub-box sequencing sequence corresponding to each dimension characteristic data.

In an optional embodiment, the binning module 120 is further configured to perform missing value binning on each dimension feature data corresponding to the multidimensional feature data.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device may include: a processor 210, a storage medium 220, and a bus 230, wherein the storage medium 220 stores machine-readable instructions executable by the processor 210, and when the electronic device is operated, the processor 210 communicates with the storage medium 220 via the bus 230, and the processor 210 executes the machine-readable instructions to perform the steps of the above-mentioned method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present application further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A feature binning method, comprising:

2. The method of claim 1, wherein each of said sample data is multi-dimensional feature data;

if the multi-dimensional feature data comprises continuous feature data, performing binning processing on each dimension feature data corresponding to the continuous feature data based on an unsupervised binning method to obtain a plurality of initial binning of each dimension feature data corresponding to the continuous feature data, wherein the unsupervised binning method comprises at least one of the following steps: an equal frequency box separation method, an equidistant box separation method and a clustering box separation method; and/or the presence of a gas in the gas,

if the multi-dimensional feature data comprises discrete feature data, determining a plurality of initial bins of the feature data corresponding to the discrete feature data according to the number of discrete values of the feature data corresponding to the discrete feature data, wherein each discrete value in the feature data corresponding to the discrete feature data corresponds to one initial bin.

3. The method according to claim 2, wherein the calculating to obtain an evidence weight value of each initial bin, and merging the plurality of initial bins according to the evidence weight value of each initial bin to obtain a plurality of merged target bins comprises:

4. The method according to claim 3, wherein the merging the multiple initial bins of each dimensional feature data according to a preset bin count threshold, a preset evidence weight threshold, and an evidence weight value of the multiple initial bins of each dimensional feature data to obtain multiple target bins after merging of each dimensional feature data comprises:

5. The method of claim 4, wherein after combining any of the initial bins and the initial bin corresponding to the minimum difference value, further comprising:

6. The method of claim 4, wherein the multi-dimensional feature data comprises: discrete feature data;

7. The method of claim 2, further comprising:

8. The method according to any one of claims 1 to 7, wherein said merging the plurality of initial bins according to the evidence weight value of each of the initial bins, and after obtaining a plurality of merged target bins, further comprises:

9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method of any of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the feature binning method according to any of claims 1 to 8.