CN112580825A - Unsupervised data binning method and unsupervised data binning device - Google Patents

Unsupervised data binning method and unsupervised data binning device Download PDF

Info

Publication number
CN112580825A
CN112580825A CN202110196070.9A CN202110196070A CN112580825A CN 112580825 A CN112580825 A CN 112580825A CN 202110196070 A CN202110196070 A CN 202110196070A CN 112580825 A CN112580825 A CN 112580825A
Authority
CN
China
Prior art keywords
value
binning
setting
box
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110196070.9A
Other languages
Chinese (zh)
Inventor
顾凌云
谢旻旗
段湾
潘峻
朱盈迪
朱江
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai IceKredit Inc
Original Assignee
Shanghai IceKredit Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai IceKredit Inc filed Critical Shanghai IceKredit Inc
Priority to CN202110196070.9A priority Critical patent/CN112580825A/en
Publication of CN112580825A publication Critical patent/CN112580825A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The unsupervised data binning method and the unsupervised data binning device provided by the invention can set all missing values in original values as first component bins and set first binning codes for the first component bins when the missing values exist in the original values; setting the first residual characteristic value as a second component box and setting a second box splitting code for the second component box if the first residual characteristic value is the same fixed value after the missing value is removed; after the missing value is removed, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value, performing box separation processing on the second residual characteristic value and setting a third box separation code; and mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.

Description

Unsupervised data binning method and unsupervised data binning device
Technical Field
The invention relates to the technical field of data binning, in particular to an unsupervised data binning method and an unsupervised data binning device.
Background
With the development of big data, the development of data services is more and more mature. Many business processes rely on analyzing and identifying data. Therefore, the modeling stability of the artificial intelligence model and the identification accuracy of the data are the key points for ensuring the normal business processing of various businesses. In the modeling stage of the artificial intelligence model, the characteristic data has a large influence on the stability and the fitting degree of the artificial intelligence model. Therefore, the problem of how to perform accurate and reasonable data binning to avoid information loss and too few samples is a technical problem to be solved urgently at the present stage.
Disclosure of Invention
In order to improve the problems, the invention provides an unsupervised data binning method and an unsupervised data binning device.
The embodiment of the invention provides an unsupervised data binning method which is applied to computer equipment and comprises the following steps:
acquiring an original value corresponding to the data characteristics to be subjected to box separation;
if missing values exist in the original values, setting all the missing values in the original values as first component boxes, and setting first box coding for the first component boxes;
after removing the missing value in the original value, if first residual characteristic values are all the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box-dividing codes for the second component boxes;
after removing the missing value in the original value, if a second remaining characteristic value is a non-fixed value, setting a quantile point for the second remaining characteristic value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;
and mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.
Preferably, the setting of quantiles for the second remaining feature value includes:
and setting quantiles in an arithmetic progression for the second residual characteristic value.
Preferably, the binning the second remaining feature value according to the binning point and setting a third binning code includes:
and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.
Preferably, a third bin coding is provided for each third component bin, comprising:
calculating the mean value of the characteristic values in each third component box;
and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.
Preferably, after calculating the mean of the feature values in each third component bin, the method further comprises:
and merging the adjacent bins with the same mean value.
The embodiment of the invention also provides an unsupervised data box separating device, which is used for computer equipment and comprises:
the characteristic acquisition module is used for acquiring an original value corresponding to the characteristic of the data to be subjected to box separation;
a first binning module, configured to set all missing values in the original values as first bins if the missing values exist in the original values, and set a first binning code for the first bins;
a second binning module, configured to, after removing the missing value in the original value, set a first remaining feature value as a second binning if the first remaining feature values are all the same fixed value, and set a second binning code for the second binning;
a third binning module, configured to set a binning point for a second remaining feature value if the second remaining feature value is a non-fixed value after removing the missing value in the original value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;
and the characteristic mapping module is used for mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.
Preferably, the third box splitting module is configured to:
and setting quantiles in an arithmetic progression for the second residual characteristic value.
Preferably, the third box splitting module is configured to:
and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.
Preferably, the third box splitting module is configured to:
calculating the mean value of the characteristic values in each third component box;
and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.
Preferably, after calculating the mean of the eigenvalues within each third component bin, the third bin dividing module is further configured to:
and merging the adjacent bins with the same mean value.
The unsupervised data binning method and the unsupervised data binning device provided by the invention can set all missing values in original values as first component bins and set first binning codes for the first component bins when the missing values exist in the original values; after removing missing values in the original values, if the first residual characteristic values are the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box splitting codes for the second component boxes; after removing the missing value in the original value, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value; according to the split bit, performing box splitting processing on the second residual characteristic value and setting a third box splitting code; and finally, mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of an unsupervised data binning method according to an embodiment of the present invention.
Fig. 2 is a block diagram of an unsupervised data binning apparatus according to an embodiment of the present invention.
Detailed Description
In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.
The inventor finds that the existing unsupervised sub-boxes mainly comprise 2 types: (1) equidistant box separation: dividing the characteristic value into N equal parts from the minimum value to the maximum value, wherein the number of samples in each equal part may be different; (2) equal frequency binning: the feature values are grouped according to a series of quantiles in an arithmetic series such that each group contains approximately the same number of samples.
In general, feature binning will divide the variable values into 5 to 15 groups, then if equidistant binning is used, the samples in some intervals may be too few to be statistically meaningful; the equal frequency binning divides samples with characteristic values close to extreme values and non-extreme values into the same binning, and information loss occurs, that is, the situation of the characteristic under the condition of close to a maximum value or a minimum value cannot be completely concerned.
Further, the general steps of the equal-distance binning method are:
1) the computer acquires N pieces of certain characteristic data of the sample, the maximum value of the data is Max, and the minimum value of the data is Min;
2) setting the number of the groups as n, wherein the length L of each interval is (Max-Min)/n, and the boundary values of the intervals are Min, Min + L, Min +2L, …, Max-L and Max;
3) and performing box separation on the characteristic values according to the interval boundary values.
The common steps of the equal frequency binning method are as follows:
1) the computer obtains N pieces of certain characteristic data of the sample;
2) setting the number of packets to N, a series of quantiles Q in an arithmetic series can be identified: q1, q2, q3, q4, q … … and qn, and taking the characteristic value corresponding to each quantile point as an interval boundary value;
3) and performing box separation on the characteristic values according to the interval boundary values.
However, equal-distance binning only ensures equal division, so that too few samples in some intervals may occur, so that the chance is easy to occur, and the statistical significance is poor. When the frequency is equally divided into bins, the amount of samples in each group is similar, but the samples with characteristic values close to extreme values and non-extreme values are divided into the same bin, that is, the condition of the characteristic when the characteristic value is close to a maximum value or a minimum value cannot be completely noticed, so that information loss occurs.
Therefore, the technical scheme provided by the embodiment of the invention not only reduces the possibility of too small sample size in the interval, but also ensures that the features can be individually grouped and observed when larger or smaller values are obtained.
Referring first to fig. 1, a flow chart of an unsupervised data binning method applied to a computer device is shown, which may specifically include the contents described in the following steps S110 to S150.
Step S110, obtaining an original value corresponding to the data feature to be binned.
Step S120, if missing values exist in the original values, setting all the missing values in the original values as first component bins, and setting first bin codes for the first component bins.
For example, the first bin code may be-1.
Step S130, after removing the missing value in the original value, if the first remaining feature values are all the same fixed value, setting the first remaining feature values as second component boxes, and setting second binning codes for the second component boxes.
For example, the second bin encoding may be 0.
Step S140, after removing the missing value in the original value, if a second remaining feature value is a non-fixed value, setting a quantile point for the second remaining feature value; and performing box separation processing on the second residual characteristic value according to the sub-position and setting a third box separation code.
For example, the quantile Q may be: q1, q2, q3, q4, … …, qn. Wherein q1 is close to the quantile corresponding to the minimum value, qn is close to the quantile corresponding to the maximum value, and q1 to qn can be in an arithmetic progression.
Further, according to a preset quantile point Q, binning non-missing characteristic values which are not fixed values; for example, q1=0.02 is taken as a minimum quantile and V1 is a feature value corresponding to the minimum quantile, a set of feature values is set for the feature value located in a section (— infinity, V1), q2=0.14, q3=0.26, q4=0.38, q5=0.50, q6=0.62, q7=0.74, q8=0.86 are taken and the feature values are correspondingly binned, q9=0.98 is a maximum quantile and V9 is a feature value corresponding to the maximum quantile, and a set of feature values for the feature value located in the section (V9, + ∞) is taken.
For another example, in quantile Q, Q1 can take the values of 0.005, 0.01, 0.02, 0.03 as quantiles, and qn can take the values of 0.995, 0.99, 0.98, 0.97 as quantiles, with 0.02 and 0.98 being given as examples only.
And S150, mapping the data characteristics to be subjected to binning according to the first binning code, the second binning code and the third binning code to obtain a mapping result, and storing the mapping result.
For example, the mapping relationship M between the code value and the bin interval can be obtained in step 140, and if the new sample also has the feature variable, the feature value can be mapped according to M in general, so as to facilitate subsequent modeling or other operations; if the feature belongs to the second component bin during training, but a value that is not missing and not a fixed value X occurs when mapping according to M, then all other values are combined into one group, which is encoded as 1.
As such, based on the contents described in the above steps S110 to S150, when there are missing values in the original values, all the missing values in the original values are set as first component bins and first bin codes are set for the first component bins; after removing missing values in the original values, if the first residual characteristic values are the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box splitting codes for the second component boxes; after removing the missing value in the original value, if the second residual characteristic value is a non-fixed value, setting a quantile point for the second residual characteristic value; according to the split bit, performing box splitting processing on the second residual characteristic value and setting a third box splitting code; and finally, mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result. Therefore, information loss can be avoided in the data binning process, too few samples are avoided, the influence of data fluctuation in the same bin on the model can be avoided, and the robustness and stability of the model are ensured.
It will be appreciated that in binning, the first bin coding may be coded as-9999 and the second as-999, and if the feature belongs to the second bin at the time of training but a value is present at the time of mapping that is not missing and not of a fixed value X, all other values are combined into one set and coded as-99, and the third bin coding is coded as the mean or median of the data within the bin. Of these, -9999, -999 and-99 are for example purposes only, the primary purpose being to avoid falling within the range of possible mean or median values for the characteristic data.
Optionally, the setting of quantiles for the second remaining feature values in step S140 includes: and setting quantiles in an arithmetic progression for the second residual characteristic value.
Optionally, the step S140 describes that performing binning processing on the second remaining feature value according to the binning point and setting a third binning code, including: and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin. Further, a third binning code is set for each third binning bin, including what is described in steps S141 and S142 below.
Step S141, calculate the mean value of the eigenvalues in each third component box.
And S142, setting a third sub-box code for the third component box according to the descending order of the mean value.
Further, between step S141 and step S142, the method further includes: and merging the adjacent bins with the same mean value.
Based on the same inventive concept, as shown in fig. 2, there is also provided an unsupervised data binning apparatus 200 applied to a computer device, the apparatus comprising:
the characteristic obtaining module 210 is configured to obtain an original value corresponding to a characteristic of data to be binned;
a first binning module 220, configured to set all missing values in the original values as first bins if the missing values exist in the original values, and set a first binning code for the first bins;
a second binning module 230, configured to, after removing the missing value in the original value, set a first remaining feature value as a second binning if the first remaining feature values are all the same fixed value, and set a second binning code for the second binning;
a third binning module 240 for setting a binning point for a second remaining feature value if the second remaining feature value is a non-fixed value after removing the missing value in the original value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;
and the feature mapping module 250 is configured to perform mapping processing on the data features to be binned according to the first binning code, the second binning code, and the third binning code to obtain a mapping result, and store the mapping result.
Optionally, a third binning module 240 for:
and setting quantiles in an arithmetic progression for the second residual characteristic value.
Optionally, a third binning module 240 for:
and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.
Optionally, a third binning module 240 for:
calculating the mean value of the characteristic values in each third component box;
and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.
Optionally, after calculating the mean of the feature values in each third component bin, the third binning module 240 is further configured to: and merging the adjacent bins with the same mean value.
In summary, the unsupervised data binning method and the unsupervised data binning device provided by the invention have the advantages that the characteristic values close to the maximum or minimum value are grouped with other characteristic values respectively in the method, so that the information contained in the characteristic values is not lost when the characteristic values are larger or smaller, the performance of the characteristic values in the larger and smaller value groups is accurately captured, and the use efficiency is improved. Equal frequency sub-boxes are used in other intervals except for large values and small values, and the situation that samples in some sub-boxes are too few is guaranteed not to occur. After the codes are used for carrying out standardized transformation on different values in the same sub-box, the data fluctuation in the same group hardly influences the model, so that the model has strong robustness on abnormal data, and the stability of the model is improved.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. An unsupervised data binning method applied to a computer device, the method comprising the steps of:
acquiring an original value corresponding to the data characteristics to be subjected to box separation;
if missing values exist in the original values, setting all the missing values in the original values as first component boxes, and setting first box coding for the first component boxes;
after removing the missing value in the original value, if first residual characteristic values are all the same fixed value, setting the first residual characteristic values as second component boxes, and setting second box-dividing codes for the second component boxes;
after removing the missing value in the original value, if a second remaining characteristic value is a non-fixed value, setting a quantile point for the second remaining characteristic value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;
and mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.
2. The method of claim 1, wherein setting a quantile point for the second remaining eigenvalue comprises:
and setting quantiles in an arithmetic progression for the second residual characteristic value.
3. The method of claim 1, wherein binning the second remaining eigenvalue according to the binning bits and setting a third binning code comprises:
and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.
4. The method of claim 3, wherein setting a third bin code for each third component bin comprises:
calculating the mean value of the characteristic values in each third component box;
and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.
5. The method of claim 4, wherein after calculating the mean of the eigenvalues within each third component bin, the method further comprises:
and merging the adjacent bins with the same mean value.
6. An unsupervised data binning device for use with a computer device, the device comprising:
the characteristic acquisition module is used for acquiring an original value corresponding to the characteristic of the data to be subjected to box separation;
a first binning module, configured to set all missing values in the original values as first bins if the missing values exist in the original values, and set a first binning code for the first bins;
a second binning module, configured to, after removing the missing value in the original value, set a first remaining feature value as a second binning if the first remaining feature values are all the same fixed value, and set a second binning code for the second binning;
a third binning module, configured to set a binning point for a second remaining feature value if the second remaining feature value is a non-fixed value after removing the missing value in the original value; according to the sub-position, performing box separation processing on the second residual characteristic value and setting a third box separation code;
and the characteristic mapping module is used for mapping the data characteristics to be subjected to box separation according to the first box separation code, the second box separation code and the third box separation code to obtain a mapping result, and storing the mapping result.
7. The apparatus of claim 6, wherein the third binning module is configured to:
and setting quantiles in an arithmetic progression for the second residual characteristic value.
8. The apparatus of claim 6, wherein the third binning module is configured to:
and performing binning processing on the second residual characteristic value according to the binning position to obtain a plurality of third bins, and setting a third bin code for each third bin.
9. The apparatus of claim 8, wherein the third binning module is configured to:
calculating the mean value of the characteristic values in each third component box;
and setting a third sub-box code for the third component box according to the sequence of the mean value from large to small.
10. The apparatus of claim 9, wherein after calculating the mean of the eigenvalues within each third component bin, the third binning module is further configured to:
and merging the adjacent bins with the same mean value.
CN202110196070.9A 2021-02-22 2021-02-22 Unsupervised data binning method and unsupervised data binning device Pending CN112580825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110196070.9A CN112580825A (en) 2021-02-22 2021-02-22 Unsupervised data binning method and unsupervised data binning device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110196070.9A CN112580825A (en) 2021-02-22 2021-02-22 Unsupervised data binning method and unsupervised data binning device

Publications (1)

Publication Number Publication Date
CN112580825A true CN112580825A (en) 2021-03-30

Family

ID=75113904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110196070.9A Pending CN112580825A (en) 2021-02-22 2021-02-22 Unsupervised data binning method and unsupervised data binning device

Country Status (1)

Country Link
CN (1) CN112580825A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN114329127A (en) * 2021-12-30 2022-04-12 北京瑞莱智慧科技有限公司 Characteristic box dividing method, device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN114329127A (en) * 2021-12-30 2022-04-12 北京瑞莱智慧科技有限公司 Characteristic box dividing method, device and storage medium
CN114329127B (en) * 2021-12-30 2023-06-20 北京瑞莱智慧科技有限公司 Feature binning method, device and storage medium

Similar Documents

Publication Publication Date Title
CN112580825A (en) Unsupervised data binning method and unsupervised data binning device
CN113033631A (en) Model incremental training method and device
CN112418065A (en) Equipment operation state identification method, device, equipment and storage medium
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN112994701A (en) Data compression method and device, electronic equipment and computer readable medium
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN109740750B (en) Data collection method and device
CN113536020B (en) Method, storage medium and computer program product for data query
CN113824580A (en) Network index early warning method and system
CN115629988A (en) Core case determination method and device, electronic equipment and storage medium
JP2015207047A (en) Similar feature extraction device, method, and program
CN114610825A (en) Method and device for confirming associated grid set, electronic equipment and storage medium
CN113535458A (en) Abnormal false alarm processing method and device, storage medium and terminal
CN113688240A (en) Threat element extraction method, device, equipment and storage medium
CN113779275B (en) Feature extraction method, device, equipment and storage medium based on medical data
CN113064597B (en) Redundant code identification method, device and equipment
CN117272933B (en) Concrete pavement report data storage method
US9615111B2 (en) Complexity-adaptive compression of color images using binary arithmetic coding
CN111046012B (en) Method and device for extracting inspection log, storage medium and electronic equipment
CN112968968B (en) Internet of things equipment flow fingerprint identification method and device based on unsupervised clustering
CN117875262B (en) Data processing method based on management platform
CN117056576B (en) Data quality flexible verification method based on big data platform
CN112070178B (en) Method and device for determining image sequence sample set and computer equipment
IL280437B (en) A system and method for producing specifications for constant length messages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210330