CN111191731A - Data processing method and device, storage medium and electronic equipment - Google Patents

Data processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN111191731A
CN111191731A CN202010003252.5A CN202010003252A CN111191731A CN 111191731 A CN111191731 A CN 111191731A CN 202010003252 A CN202010003252 A CN 202010003252A CN 111191731 A CN111191731 A CN 111191731A
Authority
CN
China
Prior art keywords
sample data
binning
result
value
box
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010003252.5A
Other languages
Chinese (zh)
Inventor
付小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongdun Holdings Co Ltd
Original Assignee
Tongdun Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongdun Holdings Co Ltd filed Critical Tongdun Holdings Co Ltd
Priority to CN202010003252.5A priority Critical patent/CN111191731A/en
Publication of CN111191731A publication Critical patent/CN111191731A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a data processing method, a data processing device, a storage medium and electronic equipment, wherein the method comprises the following steps: determining an alternative set of demarcation points based on a Kolmogorov-Similov KS value of sample data; determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box; acquiring a box dividing result of the sample data based on the target boundary point; and training the model based on the box dividing result. Through the discretization of the sample data by the box dividing mode based on the KS value and the IV, the sample data with the same effect on the prediction result of the model is divided into one box, the stability and the accuracy of the trained model are improved, and the risk of overfitting the model is reduced.

Description

Data processing method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, an apparatus, a storage medium, and an electronic device.
Background
Generally, when a model is constructed, a sample variable needs to be discretized, the model is trained by using the discretized sample, the trained model is more stable, and the risk of model overfitting is reduced. For example, a Logistic regression Logistic model when building an application scoring card model requires discretization of the sample variables.
Sample discretization is usually performed by binning. The sub-boxes have strong robustness on abnormal data, and after sample variables are discretized into N dummy variables in a logistic regression model, each dummy variable has independent weight, which is equivalent to introducing nonlinear characteristics to the model, so that the expression capacity of the model can be improved, the fitting is increased, and the accuracy of the model is improved. Therefore, binning (i.e., sample discretization) tends to be a more central loop in the pre-processing of modeling data. The quality of sample binning often affects the scoring effect of the model.
Currently, common binning can be classified into unsupervised binning and supervised binning.
Wherein, the unsupervised sub-box can be divided into:
equidistant box separation: the data were divided into equal parts at equal distances.
Equal frequency binning: the data is divided into a plurality of equal parts, and the number of the equal parts of data is the same.
The supervised binning needs to calculate the division standard through a label, and the common supervised binning includes chi-square binning and decision tree binning, and the division standard of the chi-square value and the information gain is mainly the standard of the chi-square value and the decision tree binning.
The existing box separation method is rough, fixed and single, the lifting effect of the model is limited, the expression capacity of the model is weak, the accuracy of the model is reduced if the model is fit easily, and particularly the existing box separation method is easy to influence by abnormal values in marketing and wind control modeling scenes.
Therefore, a new data processing method, device, storage medium and electronic device are needed to improve the stability of the model and reduce the risk of over-fitting.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the above, the present invention provides a data processing method, system, storage medium and electronic device, which at least to some extent improves the stability of the model and reduces the risk of over-fitting.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to an aspect of the embodiments of the present invention, there is provided a data processing method, wherein the method includes: determining a set of alternative cut points based on a kolmogorov-smirnov KS value for the sample data; determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box; acquiring a box dividing result of the sample data based on the target boundary point; and training the model based on the box dividing result.
In some exemplary embodiments of the invention, based on the foregoing scheme, determining the set of alternative demarcation points based on the KS value of the sample data comprises: and circularly binning the sample data based on the KS value of the sample data, and determining an alternative demarcation point set based on a binning result meeting a preset condition.
In some exemplary embodiments of the present invention, based on the foregoing scheme, performing cyclic binning on the sample data based on a KS value of the sample data, and determining an alternative demarcation point set based on a binning result meeting a preset condition, includes: obtaining a box dividing result of the sample data; judging whether the box separation result meets the preset condition or not; if the judgment result is negative, performing box separation based on the KS value of the box separation result, and updating the box separation result of the sample data; and if so, determining an alternative demarcation point set based on the binning result.
In some exemplary embodiments of the present invention, based on the foregoing scheme, obtaining a binning result of the sample data includes: acquiring the updated binning result of the sample data and acquiring an initial binning result of the sample data; obtaining an initial binning result of the sample data, wherein the obtaining of the initial binning result of the sample data comprises: and calculating a KS value of the sample data, and binning the sample data based on the KS value to obtain an initial binning result of the sample data.
In some exemplary embodiments of the invention, based on the foregoing scheme, determining a target demarcation point from the set of alternative demarcation points based on the number of binning groups and the binned information value IV includes: determining all the demarcation point combinations meeting the box grouping number from the alternative demarcation point set; and calculating the IV of each demarcation point combination, and determining a target demarcation point based on the demarcation point combination corresponding to the maximum IV value.
In some exemplary embodiments of the invention, based on the foregoing, the method further comprises: judging whether a box separation result with the number of sample data smaller than a first threshold exists in each box separation result corresponding to the target demarcation point; and if so, removing the target boundary points corresponding to the binning results of which the number of the sample data in the binning results is less than the number threshold from the target boundary points.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the preset condition includes: at least one of the number of groups binned reaching the group number threshold and the number of sample data within the binning result being less than a second threshold.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for data processing, wherein the apparatus includes: a first determination module configured to determine a set of alternative demarcation points based on a Kolmogorov-Similov KS value of the sample data; the second determining module is configured to determine a target demarcation point from the alternative demarcation point set based on the number of the binning groups and the information value IV of the binning; the acquisition module is configured to acquire the binning result of the sample data based on the target boundary point; a training module configured to train the model based on the binning result.
In some exemplary embodiments of the present invention, based on the foregoing solution, the first determining module is configured to perform cyclic binning on the sample data based on the KS value of the sample data, and determine the set of alternative demarcation points based on a binning result meeting a preset condition.
In some exemplary embodiments of the invention, based on the foregoing, the first determining module includes: the acquisition unit is configured to acquire the binning result of the sample data; the judging unit is configured to judge whether the box separation result meets the preset condition; the box dividing unit is configured to divide the box based on the KS value of the box dividing result and update the box dividing result of the sample data if the judgment result of the judging unit is negative; and the determining unit is configured to determine the alternative demarcation point set based on the binning result if the judgment result of the judging unit is positive.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the obtaining unit is configured to obtain an updated binning result of the sample data, and obtain an initial binning result of the sample data; the obtaining unit is configured to calculate a KS value of the sample data, and bin the sample data based on the KS value to obtain an initial binning result of the sample data.
In some exemplary embodiments of the invention, based on the foregoing, the second determining module is configured to determine, from the set of alternative cut points, all cut point combinations satisfying the number of binning groups; and calculating the IV of each demarcation point combination, and determining a target demarcation point based on the demarcation point combination corresponding to the maximum IV value.
In some exemplary embodiments of the present invention, based on the foregoing, the apparatus further includes: the judging module is configured to judge whether a binning result of which the number of sample data is smaller than a first threshold exists in each binning result corresponding to the target demarcation point; and the removing module is configured to remove the target boundary points corresponding to the binning results of which the number of the sample data in the binning results is less than the number threshold from the target boundary points when the judging result of the judging module is yes.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the preset condition includes: at least one of the number of groups binned reaching the group number threshold and the number of sample data within the binning result being less than a second threshold.
According to a further aspect of embodiments of the present invention, there is provided a computer readable storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the method steps of the first aspect.
According to still another aspect of the embodiments of the present invention, there is provided an electronic apparatus, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method steps as described in the first aspect.
In the implementation of the invention, an alternative boundary point set is determined based on the Kolmogorov-Similov KS value of the sample data; determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box; acquiring a box dividing result of the sample data based on the target boundary point; and training the model based on the box dividing result. Through the discretization of the sample data by the box dividing mode based on the KS value and the IV, the sample data with the same effect on the prediction result of the model is divided into one box, the stability and the accuracy of the trained model are improved, and the risk of overfitting the model is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 is a flow diagram illustrating a method of data processing in accordance with an exemplary embodiment;
FIG. 2 is a flowchart illustrating a method of determining a set of alternate demarcation points in accordance with an exemplary embodiment;
FIG. 3 is a block diagram illustrating an apparatus for data processing in accordance with an exemplary embodiment;
fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
The following describes the data processing method according to the embodiment of the present invention in detail with reference to specific embodiments. It should be noted that the execution subject executing the embodiment of the present invention may include a device with computing processing capability to execute, for example: servers and/or terminal devices, but the invention is not limited thereto.
FIG. 1 is a flow diagram illustrating a method of data processing in accordance with an exemplary embodiment.
As shown in fig. 1, the method may include, but is not limited to, the following steps:
in S110, a set of alternative demarcation points is determined based on the kolmogorov-smiloff KS values of the sample data.
In the embodiment of the invention, sample data can be obtained firstly, and the sample data can be used for a subsequent training model. The sample data used to train the model is also different for different models, e.g., for a wind control scoring card model, the sample data may be the user's age, income, etc. In the embodiment of the invention, each dimension of sample data is subjected to box separation, a box separation result of each dimension is obtained and then input to a model to be trained, and a comprehensive prediction result is obtained based on the coefficient (weight) of each dimension in the model.
In the embodiment of the invention, after the sample data is obtained, the alternative boundary point set can be determined based on the KS value of the sample data.
It should be noted that the KS value is used to evaluate the positive and negative sample discrimination ability of the model, and is measured as the difference between the positive and negative sample cumulative divisions. The larger the accumulated difference of positive and negative samples is, the larger the KS value is, and the stronger the positive and negative sample distinguishing capability of the model is.
The embodiment of the invention provides a KS value calculation method, wherein sample data are sorted according to the size of the numerical value and divided into a plurality of groups, and the accumulated good sample number and the accumulated bad sample number of each group are calculated according to the order from the size to the size. The accumulated good sample number is the accumulated number of all good samples in the group and the group before the group, and the accumulated bad sample number is the accumulated number of all bad samples in the group and the group before the group. Then, the cumulative bad sample proportion and the cumulative good sample proportion of each group are calculated. The accumulated bad sample proportion of each group is equal to the proportion of the accumulated bad sample number of the group to all bad samples, the accumulated good sample proportion of each group is equal to the proportion of the accumulated good sample number of the group to all good samples, and the KS value is the maximum value of the absolute value of the difference value of the accumulated bad sample proportion and the accumulated good sample proportion of all groups.
For example, a certain sample data is divided into 2 groups, where the number of good samples in the first group is 100, the number of bad samples is 400, the number of good samples in the second group is 400, the number of bad samples is 400, the number of all good samples is 500, the number of all bad samples is 800, the number of accumulated good samples in the first group is 100, the number of accumulated bad samples is 400, the number of accumulated good samples is 100/500 is 0.2, the number of accumulated bad samples is 400/800 is 0.5, the absolute value of the difference between the number of accumulated bad samples and the number of accumulated good samples is 0.3, the number of accumulated good samples in the second group is 500, the number of accumulated bad samples is 800, the number of accumulated good samples is 500/500 is 1, the number of accumulated bad samples is 800/800 is 1, and the absolute value of the difference between the number of accumulated bad samples and the number of accumulated good samples is 0, and thus the KS value of the sample data is 0.3.
It should be noted that the grouping of sample data when calculating the KS value does not need to determine whether the preset condition is satisfied. The preset condition is a binning result corresponding to the alternative demarcation point, and is not a binning result when the KS value is calculated.
In the embodiment of the invention, when the alternative demarcation point set is determined, the sample data can be subjected to cyclic binning based on the KS value of the sample data, and the alternative demarcation point set is determined based on the binning result meeting the preset condition. Wherein, the preset condition may include but is not limited to: at least one of the number of groups binned reaching the group number threshold and the number of sample data within the binning result being less than a second threshold.
It should be noted that at least one cut point may be included in the alternative set of cut points.
In S120, a target demarcation point is determined from the set of alternative demarcation points based on the number of binning groups and the binned information value IV. Wherein the target demarcation point is at least one of the alternative set of demarcation points.
In the embodiment of the invention, when the target demarcation point is determined, the number of the box groups can be obtained firstly, and the number of the box groups can be determined based on the number of the box groups of the model to be trained. For example, if the model is used to predict gender, the number of bins may be 2. The invention is not so limited and the number of bins may be based on user set input.
In the embodiment of the invention, after the number of the sub-box groups is obtained, all the boundary point combinations meeting the number of the sub-box groups can be determined from the alternative boundary point set, the IV of each boundary point combination is calculated, and the target boundary point is determined based on the boundary point combination corresponding to the maximum IV.
For example, assuming that there are 10 alternative demarcation points in the alternative demarcation point set and the number of the grouped data is 2, the number of the demarcation point combinations satisfying the grouped data is
Figure BDA0002354261660000082
After the boundary point combinations are obtained, the sample data are subjected to binning according to the boundary point combinations, the IV of each boundary point combination is calculated, and the boundary point combination corresponding to the maximum IV is determined as the target boundary point.
It should be noted that IV (Information Value) is mainly used to encode and estimate the prediction capability of the input variable. The size of the characteristic variable IV indicates the strength of the prediction capability of the variable. The value range of IV is [0, positive infinity), and the calculation formula of IV is as follows:
Figure BDA0002354261660000081
where pyi is the positive sample in this group(referring to the individual with the predictor variable of the model taking the value "yes" or 1) is the proportion of all positive samples in all samples, pniIs the proportion of negative samples in this group to all negative samples in the sample. The larger the IV value is, the larger the difference of the distribution of the good and bad samples on the variable is, namely, the better the distinguishing capability of the variable is.
In the embodiment of the present invention, when calculating the IV of each division point combination, sample data is first classified according to each division point in the division point combination, for example, the sample data may be classified into a plurality of combinations in a manner that a division point is classified into a previous box or a next box, and then the IV of the division point combination is obtained according to the calculation formula for calculating the IV. And after calculating the IV of each demarcation point combination, determining the demarcation point combination corresponding to the maximum IV as a target demarcation point.
In S130, a binning result of the sample data based on the target boundary point is obtained.
According to the embodiment of the invention, after the target boundary point is obtained, the box separation result of the sample data based on the target boundary point can be obtained.
In the embodiment of the invention, after the binning result of the target boundary point is obtained, whether the binning result with the number of sample data smaller than the first threshold exists in each binning result corresponding to the target boundary point can be further judged, and if the result is judged to be yes, the target boundary point corresponding to the binning result with the number of sample data smaller than the number threshold is removed from the target boundary point.
For example, the target dividing points are 5 and 15, in the binning results corresponding to the two target dividing points, the number of sample data in the bins of which the number is less than 5 is divided into one bin which is greater than or equal to 5 and less than 15 and one bin which is greater than or equal to 15, and if the number of sample data in the bins of less than 5 is 25 and is less than the first threshold value 30, 5 of the target dividing points are removed.
It should be noted that the first threshold and the second threshold may be the same or different.
In S140, a model is trained based on the binning result.
In the embodiment of the invention, the model can comprise various models, such as a wind control scoring card model.
In the implementation of the invention, an alternative boundary point set is determined based on the Kolmogorov-Similov KS value of the sample data; determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box; acquiring a box dividing result of the sample data based on the target boundary point; and training the model based on the box dividing result. Through the discretization of the sample data by the box dividing mode based on the KS value and the IV, the sample data with the same effect on the prediction result of the model is divided into one box, the stability and the accuracy of the trained model are improved, and the risk of overfitting the model is reduced.
It should be noted that, in the embodiment of the present invention, based on KS to IV binning, compared with a binning method in the prior art, sample data that can affect a prediction result of a model can be binned into one bin, so that when a subsequent model is trained by using the sample data in the bin, a relationship between the prediction result of the model and the sample data in the bin is stronger, and an expressiveness, an interpretability, and an accuracy of the model are enhanced. For example, for a credit evaluation model, with the method provided in the embodiment of the present invention, sample data can be divided into 5 boxes in an age dimension, where: 0-18, 19-24,25-45,46-100, wherein the age stage of each bin has a more focused effect on the results of the credit assessment model. For example, if the case 2 is divided into equally spaced cases, the case 2 should be 19-36, but research shows that the data of credit of users 19-24 years old is completely different from that of users 25-36 years old, the bad account rate of the users 19-24 years old is generally students who just enter the society or do not enter the society, the bad account rate is high, the users 25-45 years old are generally office workers, the bad account rate is lower, and therefore, the two stages of 19-24 and 25-36 should not be divided into one case. And the users of 36-45 years old generally have labor capacity, and the bad account rate is lower in the same stage of 25-36 years old, so that the users can be classified into one box.
In addition, based on the KS value and IV binning method, abnormal data can be automatically binned into one bin, and therefore the abnormal data can be identified.
In one embodiment, when binning is performed, a special value or a missing value in sample data can be further sorted into one bin, so that the model effect can be further improved, the model operation complexity can be reduced, and the model operation speed can be improved.
In one embodiment, after the model is trained, the target data may be further binned according to the above method, and the binned result is input to the trained model, so that the prediction result of the model is more accurate. When the target data is classified, the target data may be classified based on the determined target boundary point, and after a classification result (a number of the classification result) of the target data is obtained, the classification result may be input to a trained model, so that a prediction result for the classification result (number) may be obtained.
In one embodiment, cyclic binning of sample data based on a KS value of the sample data may be achieved through the following process, and the candidate set of demarcation points is determined based on a binning result meeting a preset condition. FIG. 2 is a flowchart illustrating a method of determining a set of alternate demarcation points in accordance with an exemplary embodiment. It should be noted that the execution subject executing the embodiment of the present invention may include a device with computing processing capability to execute, for example: servers and/or terminal devices, but the invention is not limited thereto.
As shown in fig. 2, the method may include, but is not limited to:
in step S210, a binning result of the sample data is obtained.
In the embodiment of the present invention, obtaining the binning result of the sample data may include: and acquiring the box dividing result of the updated sample data and acquiring the initial box dividing result of the sample data.
Wherein, the step of obtaining the binning result of the updated sample data corresponds to the step S230.
Before circulation is performed, after sample data is acquired, first binning of the sample data is performed to acquire an initial binning result of the sample data. When sample data is subjected to first binning, a KS value of the sample data can be calculated, and the sample data is binned based on the KS value so as to obtain an initial binning result of the sample data.
It should be noted that when calculating the KS value of the sample data, the sample data needs to be grouped, for example, each sample data is grouped into one group, but the present invention is not limited to this, and for example, the sample data may be grouped according to a preset distance or frequency, then the KS value of the sample data is calculated, and the sample data is divided into 2 bins based on the KS value.
It should be noted that, when calculating the KS value, grouping the sample data does not need to determine whether the preset condition is satisfied. The preset condition is a binning result corresponding to the alternative demarcation point, and is not a binning result when the KS value is calculated.
In step S220, it is determined whether the binning result satisfies the preset condition.
In the embodiment of the present invention, the preset conditions include: at least one of the number of groups binned reaching the group number threshold and the number of sample data within the binning result being less than a second threshold.
It should be noted that the number of the binning groups is related to the number of the acquired alternative demarcation points, and if the number of the binning groups is 5, the binning groups correspond to 4 alternative demarcation points.
In the embodiment of the invention, when the sample data is firstly binned, the sample data can be binned into two bins based on the KS value, if the sample data in each bin does not meet the preset condition, the two bins need to be binned again respectively, and the cycle is repeated, wherein the alternative demarcation points refer to the sample data corresponding to the KS value from the first binning to the last binning (the KS value is counted by one demarcation point when corresponding to a group of sample data), and the number of the demarcation points is 2n-1, wherein n is the number of cyclic binning and is a positive integer greater than or equal to 1.
And if the number of the sample data in the binning result is less than the second threshold, the binning is not performed any more, and the other binning which does not satisfy the sample data number less than the second threshold needs to be performed again until the binning results of all the binning results satisfy that the number of the sample data is less than the second threshold.
If the determination result is no, S230 is executed, and if the determination result is yes, S240 is executed.
In step S230, binning is performed based on the KS value of the binning result, and the binning result of the sample data is updated.
In the embodiment of the present invention, if the binning result does not satisfy the preset condition, the KS value of the binning result is calculated, the binning result is binned again based on the KS value, the original binning result is updated based on the binning result, and the step of S210 is continuously performed.
For example, sample data is a, b, c, d, e, and f, when performing first binning, calculating that a KS1 value corresponds to c, and then dividing the data into 2 bins according to a KS1 value, wherein the data corresponding to the KS value can be classified into a previous bin or a next bin, in this embodiment, taking the classification into the previous bin as an example, the two divided bins are 1 [ a ], b, and c ], and 2 [ d, e, and f ], respectively determining whether the binning results of the two bins meet preset conditions, if the two bins do not meet the preset conditions, respectively calculating a KS21 value from the sample data in 1 bin, calculating a KS21 value corresponding to b, and binning the data in 1 bin again based on the KS21 to obtain 1 [ a ], b, and 2 [ c ]. And calculating a KS22 value corresponding to e according to the KS value of the sample data in the 2 boxes, binning the data in the 2 boxes again based on the KS22 to obtain 1 [ d, e ], 2 [ f ], and continuously judging whether the binning results reach preset conditions or not.
In step S240, a set of alternative demarcation points is determined based on the binning results.
For example, in the above example, if the preset condition is that the number of groups reaches the threshold number of groups, and the threshold number of groups is 4, the sample data corresponding to 3 KS values is shared in the above example, b, c, e, and the number of groups is 4, and if the preset condition is reached, the candidate cut points are determined to be b, c, e.
In the embodiment of the invention, the KS value is used for circularly binning the sample data, and the alternative demarcation point set is determined based on the binning result meeting the preset condition, so that the binning result corresponding to the determined alternative demarcation point has the same effect on the prediction result of the model, the stability and the accuracy of the trained model are improved, and the risk of overfitting of the model is reduced.
It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. In the following description of the system, the same parts as those of the foregoing method will not be described again.
Fig. 3 is a schematic structural diagram illustrating a data processing apparatus according to an exemplary embodiment, wherein the apparatus 300 includes: a first determination module 310, a second determination module 320, an acquisition module 330, and a training module 340.
Wherein the first determining module 310 is configured to determine a set of alternative cut points based on a kolmogorov-smirnov KS value for the sample data;
a second determining module 320 configured to determine a target demarcation point from the set of alternative demarcation points based on the number of binning groups and the binned information value IV;
an obtaining module 330 configured to obtain a binning result of the sample data based on the target boundary point;
a training module 340 configured to train the model based on the binning result.
In the implementation of the invention, an alternative boundary point set is determined based on the Kolmogorov-Similov KS value of the sample data; determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box; acquiring a box dividing result of the sample data based on the target boundary point; and training the model based on the box dividing result. Through the discretization of the sample data by the box dividing mode based on the KS value and the IV, the sample data with the same effect on the prediction result of the model is divided into one box, the stability and the accuracy of the trained model are improved, and the risk of overfitting the model is reduced.
Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment. It should be noted that the electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The above-described functions defined in the terminal of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 401.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a marking module, a redirection module, a return module, a receiving module, and an acquisition module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.
Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method of data processing, the method comprising:
determining an alternative set of demarcation points based on a Kolmogorov-Similov KS value of sample data;
determining a target demarcation point from the alternative demarcation point set based on the number of the box groups and the information value IV of the box;
acquiring a box dividing result of the sample data based on the target boundary point;
and training the model based on the box dividing result.
2. The method of claim 1, wherein determining a set of alternative demarcation points based on a KS value of said sample data comprises:
and circularly binning the sample data based on the KS value of the sample data, and determining an alternative demarcation point set based on a binning result meeting a preset condition.
3. The method of claim 2, wherein the sample data is circularly binned based on the KS value of the sample data, and the determining the set of alternative demarcation points based on the binning result satisfying a preset condition comprises:
obtaining a box dividing result of the sample data;
judging whether the box separation result meets the preset condition or not;
if the judgment result is negative, performing box separation based on the KS value of the box separation result, and updating the box separation result of the sample data;
and if so, determining an alternative demarcation point set based on the binning result.
4. The method of claim 3, wherein obtaining the binned results of the sample data comprises: acquiring the updated binning result of the sample data and acquiring an initial binning result of the sample data;
obtaining an initial binning result of the sample data, wherein the obtaining of the initial binning result of the sample data comprises:
and calculating a KS value of the sample data, and binning the sample data based on the KS value to obtain an initial binning result of the sample data.
5. The method of claim 1, wherein determining a target demarcation point from the set of alternative demarcation points based on a number of binning groups and an information value of binning IV comprises:
determining all the demarcation point combinations meeting the box grouping number from the alternative demarcation point set;
and calculating the IV of each demarcation point combination, and determining a target demarcation point based on the demarcation point combination corresponding to the maximum IV value.
6. The method of claim 1, wherein the method further comprises:
judging whether a box separation result with the number of sample data smaller than a first threshold exists in each box separation result corresponding to the target demarcation point;
and if so, removing the target boundary points corresponding to the binning results of which the number of the sample data in the binning results is less than the number threshold from the target boundary points.
7. The method according to any one of claims 1 to 6, wherein the preset conditions include: at least one of the number of groups binned reaching the group number threshold and the number of sample data within the binning result being less than a second threshold.
8. An apparatus for data processing, the apparatus comprising:
a first determination module configured to determine a set of alternative demarcation points based on a kolmogorov-smiloff KS value of sample data;
the second determining module is configured to determine a target demarcation point from the alternative demarcation point set based on the number of the binning groups and the information value IV of the binning;
the acquisition module is configured to acquire the binning result of the sample data based on the target boundary point;
a training module configured to train the model based on the binning result.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010003252.5A 2020-01-02 2020-01-02 Data processing method and device, storage medium and electronic equipment Pending CN111191731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010003252.5A CN111191731A (en) 2020-01-02 2020-01-02 Data processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010003252.5A CN111191731A (en) 2020-01-02 2020-01-02 Data processing method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN111191731A true CN111191731A (en) 2020-05-22

Family

ID=70708367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010003252.5A Pending CN111191731A (en) 2020-01-02 2020-01-02 Data processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111191731A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183644A (en) * 2020-09-29 2021-01-05 中国平安人寿保险股份有限公司 Index stability monitoring method and device, computer equipment and medium
CN112801775A (en) * 2021-01-29 2021-05-14 中国工商银行股份有限公司 Client credit evaluation method and device
CN113610175A (en) * 2021-08-16 2021-11-05 上海冰鉴信息科技有限公司 Service strategy generation method and device and computer readable storage medium
CN114297454A (en) * 2021-12-30 2022-04-08 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium
CN114996371A (en) * 2022-08-03 2022-09-02 广东中盈盛达数字科技有限公司 Associated enterprise anti-fraud model construction method and system based on graph theory algorithm

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183644A (en) * 2020-09-29 2021-01-05 中国平安人寿保险股份有限公司 Index stability monitoring method and device, computer equipment and medium
CN112183644B (en) * 2020-09-29 2024-05-03 中国平安人寿保险股份有限公司 Index stability monitoring method and device, computer equipment and medium
CN112801775A (en) * 2021-01-29 2021-05-14 中国工商银行股份有限公司 Client credit evaluation method and device
CN113610175A (en) * 2021-08-16 2021-11-05 上海冰鉴信息科技有限公司 Service strategy generation method and device and computer readable storage medium
CN114297454A (en) * 2021-12-30 2022-04-08 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium
CN114996371A (en) * 2022-08-03 2022-09-02 广东中盈盛达数字科技有限公司 Associated enterprise anti-fraud model construction method and system based on graph theory algorithm

Similar Documents

Publication Publication Date Title
CN111191731A (en) Data processing method and device, storage medium and electronic equipment
CN111915580A (en) Tobacco leaf grading method, system, terminal equipment and storage medium
CN111460250A (en) Image data cleaning method, image data cleaning device, image data cleaning medium, and electronic apparatus
US20200210776A1 (en) Question answering method, terminal, and non-transitory computer readable storage medium
CN112116225A (en) Fighting efficiency evaluation method and device for equipment system, and storage medium
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN110807159B (en) Data marking method and device, storage medium and electronic equipment
CN112231299A (en) Method and device for dynamically adjusting feature library
CN116911824A (en) Intelligent decision method and system based on electric power big data
CN115860147A (en) Customs sheet pre-judging model training method and device based on unbalanced ensemble learning
CN113469237B (en) User intention recognition method, device, electronic equipment and storage medium
CN115579069A (en) Construction method and device of scRNA-Seq cell type annotation database and electronic equipment
CN113554307B (en) RFM model-based user grouping method, device and readable medium
CN111984842B (en) Bank customer data processing method and device
CN110298690B (en) Object class purpose period judging method, device, server and readable storage medium
CN112685610A (en) False registration account identification method and related device
CN109934604B (en) Sales data processing method and system, storage medium and electronic equipment
EP3748549B1 (en) Learning device and learning method
CN115831219B (en) Quality prediction method, device, equipment and storage medium
CN116912919B (en) Training method and device for image recognition model
CN117334323A (en) Training method of cognitive dysfunction prediction model and related equipment
CN114386520A (en) GC early warning method and system based on gradient lifting regression and storage medium
CN113793496A (en) Main data acquisition method and system
CN114118281A (en) Data sample generation method, object evaluation method, model training method and device
CN115147593A (en) Sampling method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200522