CN112580825A

CN112580825A - Unsupervised data binning method and unsupervised data binning device

Info

Publication number: CN112580825A
Application number: CN202110196070.9A
Authority: CN
Inventors: 顾凌云; 谢旻旗; 段湾; 潘峻; 朱盈迪; 朱江; 张涛
Original assignee: Shanghai IceKredit Inc
Current assignee: Shanghai IceKredit Inc
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-03-30

Abstract

The unsupervised data binning method and the unsupervised data binning device provided by the invention can set all missing values in original values as first component bins and set first binning codes for the first component bins when the missing values exist in the original values; setting the first residual characteristic value as a second component box and setting a second box splitting code for the second component box if the first residual characteristic value is the same fixed value after the missing value is removed; after the missing value is removed, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value, performing box separation processing on the second residual characteristic value and setting a third box separation code; and mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.

Description

Unsupervised data binning method and unsupervised data binning device

Technical Field

The invention relates to the technical field of data binning, in particular to an unsupervised data binning method and an unsupervised data binning device.

Background

With the development of big data, the development of data services is more and more mature. Many business processes rely on analyzing and identifying data. Therefore, the modeling stability of the artificial intelligence model and the identification accuracy of the data are the key points for ensuring the normal business processing of various businesses. In the modeling stage of the artificial intelligence model, the characteristic data has a large influence on the stability and the fitting degree of the artificial intelligence model. Therefore, the problem of how to perform accurate and reasonable data binning to avoid information loss and too few samples is a technical problem to be solved urgently at the present stage.

Disclosure of Invention

In order to improve the problems, the invention provides an unsupervised data binning method and an unsupervised data binning device.

The embodiment of the invention provides an unsupervised data binning method which is applied to computer equipment and comprises the following steps:

acquiring an original value corresponding to the data characteristics to be subjected to box separation;

if missing values exist in the original values, setting all the missing values in the original values as first component boxes, and setting first box coding for the first component boxes;

after removing the missing value in the original value, if first residual characteristic values are all the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box-dividing codes for the second component boxes;

after removing the missing value in the original value, if a second remaining characteristic value is a non-fixed value, setting a quantile point for the second remaining characteristic value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;

and mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.

Preferably, the setting of quantiles for the second remaining feature value includes:

and setting quantiles in an arithmetic progression for the second residual characteristic value.

Preferably, the binning the second remaining feature value according to the binning point and setting a third binning code includes:

and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.

Preferably, a third bin coding is provided for each third component bin, comprising:

calculating the mean value of the characteristic values in each third component box;

and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.

Preferably, after calculating the mean of the feature values in each third component bin, the method further comprises:

and merging the adjacent bins with the same mean value.

The embodiment of the invention also provides an unsupervised data box separating device, which is used for computer equipment and comprises:

the characteristic acquisition module is used for acquiring an original value corresponding to the characteristic of the data to be subjected to box separation;

a first binning module, configured to set all missing values in the original values as first bins if the missing values exist in the original values, and set a first binning code for the first bins;

a second binning module, configured to, after removing the missing value in the original value, set a first remaining feature value as a second binning if the first remaining feature values are all the same fixed value, and set a second binning code for the second binning;

a third binning module, configured to set a binning point for a second remaining feature value if the second remaining feature value is a non-fixed value after removing the missing value in the original value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;

and the characteristic mapping module is used for mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.

Preferably, the third box splitting module is configured to:

Preferably, after calculating the mean of the eigenvalues within each third component bin, the third bin dividing module is further configured to:

and merging the adjacent bins with the same mean value.

The unsupervised data binning method and the unsupervised data binning device provided by the invention can set all missing values in original values as first component bins and set first binning codes for the first component bins when the missing values exist in the original values; after removing missing values in the original values, if the first residual characteristic values are the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box splitting codes for the second component boxes; after removing the missing value in the original value, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value; according to the split bit, performing box splitting processing on the second residual characteristic value and setting a third box splitting code; and finally, mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of an unsupervised data binning method according to an embodiment of the present invention.

Fig. 2 is a block diagram of an unsupervised data binning apparatus according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

The inventor finds that the existing unsupervised sub-boxes mainly comprise 2 types: (1) equidistant box separation: dividing the characteristic value into N equal parts from the minimum value to the maximum value, wherein the number of samples in each equal part may be different; (2) equal frequency binning: the feature values are grouped according to a series of quantiles in an arithmetic series such that each group contains approximately the same number of samples.

In general, feature binning will divide the variable values into 5 to 15 groups, then if equidistant binning is used, the samples in some intervals may be too few to be statistically meaningful; the equal frequency binning divides samples with characteristic values close to extreme values and non-extreme values into the same binning, and information loss occurs, that is, the situation of the characteristic under the condition of close to a maximum value or a minimum value cannot be completely concerned.

Further, the general steps of the equal-distance binning method are:

1) the computer acquires N pieces of certain characteristic data of the sample, the maximum value of the data is Max, and the minimum value of the data is Min;

2) setting the number of the groups as n, wherein the length L of each interval is (Max-Min)/n, and the boundary values of the intervals are Min, Min + L, Min +2L, …, Max-L and Max;

3) and performing box separation on the characteristic values according to the interval boundary values.

The common steps of the equal frequency binning method are as follows:

1) the computer obtains N pieces of certain characteristic data of the sample;

2) setting the number of packets to N, a series of quantiles Q in an arithmetic series can be identified: q1, q2, q3, q4, q … … and qn, and taking the characteristic value corresponding to each quantile point as an interval boundary value;

However, equal-distance binning only ensures equal division, so that too few samples in some intervals may occur, so that the chance is easy to occur, and the statistical significance is poor. When the frequency is equally divided into bins, the amount of samples in each group is similar, but the samples with characteristic values close to extreme values and non-extreme values are divided into the same bin, that is, the condition of the characteristic when the characteristic value is close to a maximum value or a minimum value cannot be completely noticed, so that information loss occurs.

Therefore, the technical scheme provided by the embodiment of the invention not only reduces the possibility of too small sample size in the interval, but also ensures that the features can be individually grouped and observed when larger or smaller values are obtained.

Referring first to fig. 1, a flow chart of an unsupervised data binning method applied to a computer device is shown, which may specifically include the contents described in the following steps S110 to S150.

Step S110, obtaining an original value corresponding to the data feature to be binned.

Step S120, if missing values exist in the original values, setting all the missing values in the original values as first component bins, and setting first bin codes for the first component bins.

For example, the first bin code may be-1.

Step S130, after removing the missing value in the original value, if the first remaining feature values are all the same fixed value, setting the first remaining feature values as second component boxes, and setting second binning codes for the second component boxes.

For example, the second bin encoding may be 0.

Step S140, after removing the missing value in the original value, if a second remaining feature value is a non-fixed value, setting a quantile point for the second remaining feature value; and performing box separation processing on the second residual characteristic value according to the sub-position and setting a third box separation code.

For example, the quantile Q may be: q1, q2, q3, q4, … …, qn. Wherein q1 is close to the quantile corresponding to the minimum value, qn is close to the quantile corresponding to the maximum value, and q1 to qn can be in an arithmetic progression.

Further, according to a preset quantile point Q, binning non-missing characteristic values which are not fixed values; for example, q1=0.02 is taken as a minimum quantile and V1 is a feature value corresponding to the minimum quantile, a set of feature values is set for the feature value located in a section (— infinity, V1), q2=0.14, q3=0.26, q4=0.38, q5=0.50, q6=0.62, q7=0.74, q8=0.86 are taken and the feature values are correspondingly binned, q9=0.98 is a maximum quantile and V9 is a feature value corresponding to the maximum quantile, and a set of feature values for the feature value located in the section (V9, + ∞) is taken.

For another example, in quantile Q, Q1 can take the values of 0.005, 0.01, 0.02, 0.03 as quantiles, and qn can take the values of 0.995, 0.99, 0.98, 0.97 as quantiles, with 0.02 and 0.98 being given as examples only.

And S150, mapping the data characteristics to be subjected to binning according to the first binning code, the second binning code and the third binning code to obtain a mapping result, and storing the mapping result.

For example, the mapping relationship M between the code value and the bin interval can be obtained in step 140, and if the new sample also has the feature variable, the feature value can be mapped according to M in general, so as to facilitate subsequent modeling or other operations; if the feature belongs to the second component bin during training, but a value that is not missing and not a fixed value X occurs when mapping according to M, then all other values are combined into one group, which is encoded as 1.

As such, based on the contents described in the above steps S110 to S150, when there are missing values in the original values, all the missing values in the original values are set as first component bins and first bin codes are set for the first component bins; after removing missing values in the original values, if the first residual characteristic values are the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box splitting codes for the second component boxes; after removing the missing value in the original value, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value; according to the split bit, performing box splitting processing on the second residual characteristic value and setting a third box splitting code; and finally, mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.

It will be appreciated that in binning, the first bin coding may be coded as-9999 and the second as-999, and if the feature belongs to the second bin at the time of training but a value is present at the time of mapping that is not missing and not of a fixed value X, all other values are combined into one set and coded as-99, and the third bin coding is coded as the mean or median of the data within the bin. Of these, -9999, -999 and-99 are for example purposes only, the primary purpose being to avoid falling within the range of possible mean or median values for the characteristic data.

Optionally, the setting of quantiles for the second remaining feature values in step S140 includes: and setting quantiles in an arithmetic progression for the second residual characteristic value.

Optionally, the step S140 describes that performing binning processing on the second remaining feature value according to the binning point and setting a third binning code, including: and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin. Further, a third binning code is set for each third binning bin, including what is described in steps S141 and S142 below.

Step S141, calculate the mean value of the eigenvalues in each third component box.

And S142, setting a third sub-box code for the third component box according to the descending order of the mean value.

Further, between step S141 and step S142, the method further includes: and merging the adjacent bins with the same mean value.

Based on the same inventive concept, as shown in fig. 2, there is also provided an unsupervised data binning apparatus 200 applied to a computer device, the apparatus comprising:

the characteristic obtaining module 210 is configured to obtain an original value corresponding to a characteristic of data to be binned;

a first binning module 220, configured to set all missing values in the original values as first bins if the missing values exist in the original values, and set a first binning code for the first bins;

a second binning module 230, configured to, after removing the missing value in the original value, set a first remaining feature value as a second binning if the first remaining feature values are all the same fixed value, and set a second binning code for the second binning;

a third binning module 240 for setting a binning point for a second remaining feature value if the second remaining feature value is a non-fixed value after removing the missing value in the original value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;

and the feature mapping module 250 is configured to perform mapping processing on the data features to be binned according to the first binning code, the second binning code, and the third binning code to obtain a mapping result, and store the mapping result.

Optionally, a third binning module 240 for:

Optionally, after calculating the mean of the feature values in each third component bin, the third binning module 240 is further configured to: and merging the adjacent bins with the same mean value.

In summary, the unsupervised data binning method and the unsupervised data binning device provided by the invention have the advantages that the characteristic values close to the maximum or minimum value are grouped with other characteristic values respectively in the method, so that the information contained in the characteristic values is not lost when the characteristic values are larger or smaller, the performance of the characteristic values in the larger and smaller value groups is accurately captured, and the use efficiency is improved. Equal frequency sub-boxes are used in other intervals except for large values and small values, and the situation that samples in some sub-boxes are too few is guaranteed not to occur. After the codes are used for carrying out standardized transformation on different values in the same sub-box, the data fluctuation in the same group hardly influences the model, so that the model has strong robustness on abnormal data, and the stability of the model is improved.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An unsupervised data binning method applied to a computer device, the method comprising the steps of:

2. The method of claim 1, wherein setting a quantile point for the second remaining eigenvalue comprises:

3. The method of claim 1, wherein binning the second remaining eigenvalue according to the binning bits and setting a third binning code comprises:

4. The method of claim 3, wherein setting a third bin code for each third component bin comprises:

5. The method of claim 4, wherein after calculating the mean of the eigenvalues within each third component bin, the method further comprises:

and merging the adjacent bins with the same mean value.

6. An unsupervised data binning device for use with a computer device, the device comprising:

7. The apparatus of claim 6, wherein the third binning module is configured to:

8. The apparatus of claim 6, wherein the third binning module is configured to:

9. The apparatus of claim 8, wherein the third binning module is configured to:

10. The apparatus of claim 9, wherein after calculating the mean of the eigenvalues within each third component bin, the third binning module is further configured to:

and merging the adjacent bins with the same mean value.