CN114049195A - Square carton splitting method and device - Google Patents

Square carton splitting method and device Download PDF

Info

Publication number
CN114049195A
CN114049195A CN202111320379.0A CN202111320379A CN114049195A CN 114049195 A CN114049195 A CN 114049195A CN 202111320379 A CN202111320379 A CN 202111320379A CN 114049195 A CN114049195 A CN 114049195A
Authority
CN
China
Prior art keywords
result
binning
value
candidate
chi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111320379.0A
Other languages
Chinese (zh)
Inventor
陈翱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xurong Network Technology Co ltd
Original Assignee
Shanghai Xurong Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xurong Network Technology Co ltd filed Critical Shanghai Xurong Network Technology Co ltd
Priority to CN202111320379.0A priority Critical patent/CN114049195A/en
Publication of CN114049195A publication Critical patent/CN114049195A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Telephone Function (AREA)

Abstract

The invention relates to a chi-square box separating method and a chi-square box separating device, wherein the method comprises the following steps: performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin; carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result; performing WOE value monotonicity test on the second bin dividing result; and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result. Through the method and the device, the problems of time consumption, labor consumption and poor box separation effect caused by the fact that a chi-square box separation algorithm needs to design the box separation in a manual mode in the related art are solved, and the technical effects of improving the box separation efficiency and accuracy are achieved.

Description

Square carton splitting method and device
Technical Field
The invention relates to the technical field of box separation, in particular to a chi-square box separation method, a chi-square box separation device, computer equipment and a computer readable storage medium.
Background
In the field of financial wind control, particularly in the modeling process of credit and scoring cards, the fact whether independent variables can be subjected to box separation and how to perform the box separation play a vital role in the effectiveness of the whole model. A set of box separation systems with clear user distinguishing capability is also a good basic stone for the distinguishing capability of the grading card model for the good and bad users. In particular, robust and reliable binning can greatly aid feature screening and narrowing down candidate sets of arguments on the one hand, and the final scorecard value is also a result that is directly dependent on binning on the other hand. Therefore, the efficiency of binning can have a tremendous impact on downstream modeling and variable evaluation.
For numerical variables, common binning methods include equal-frequency, equidistant or custom binning, chi-square binning, and the like. The theoretical essence of chi-square binning is to guarantee two points: 1) the difference between boxes is large; 2) the difference in the box is small. The criterion for determining the difference is the card method.
The scoring card modeling in the related art mainly utilizes a scorecarpy library of python, wherein a card square binning algorithm, namely a sc.woebin function, is included. And when using this function, a method is defined as 'child'. Finally, for each variable, the function outputs a table covering information such as the classification details, the number of good or bad users, the ratio of good or bad users, the WOE value, the IV value and the like, as shown in fig. 1.
The implementation process of the card square binning algorithm in the existing scorecarry library is not card square binning in a general sense, but variable binning is initialized by using quantiles in advance. Although the operation speed can be increased by the initialization in one step, the design concept of chi-square binning is not followed, so that the binning effect after initialization and the final binning effect cannot be guaranteed. Furthermore, the scorecardy algorithm implementation cannot guarantee monotonicity of the sequence of WOE values after binning. The monotonicity of the WOE value is indispensable for the binning of the scoring card model, and if the monotonicity cannot be guaranteed, the setting of the final scoring card value is ambiguous and even contradictory. As shown in fig. 1, except for the case where the first row variable is empty, the WOE value is not within the monotonicity test, and the remaining three WOE values (-0.321718, -0.647835, 0.139838) have no monotonous relation. Therefore, this binning of the variable var1 is not practical. It should be noted that, even if the parameter method of the sc.woebin function is set to 'tree', the monotonicity cannot be guaranteed. Therefore, the practitioner has to design the box by himself in a manual manner, which is time and labor consuming and does not achieve the desired box-separating effect.
At present, no effective solution is provided for the problems of time and labor consumption and poor box separation effect caused by the fact that a chi-square box separation algorithm in the related art needs to design a box in a manual mode.
Disclosure of Invention
The application aims to provide a chi-square binning method, a chi-square binning device, computer equipment and a computer-readable storage medium aiming at the defects in the prior art, so as to at least solve the problems of time and labor consumption and poor binning effect caused by the fact that chi-square binning algorithms need to be designed manually and binning automatically in the related art.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
in a first aspect, an embodiment of the present application provides a chi-square binning method, including:
performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin;
carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result;
performing WOE value monotonicity test on the second bin dividing result;
and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result.
In some embodiments, performing a chi-square test of an adjacent bin based on the first bin split result to obtain a second bin split result comprises:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
In some of these embodiments, after determining the second binning result as a candidate binning result, the method further comprises:
judging whether the number of boxes in the candidate box dividing result is a third threshold value or not;
and if the number of boxes in the candidate box separation result is the third threshold value, determining that the candidate box separation result is the final box separation result.
In some embodiments, after determining whether the number of bins in the candidate binning result is a target threshold, the method further comprises:
and if the number of the bins in the candidate bin separation results is less than the third threshold, determining the candidate bin separation result with the maximum IV value in the candidate bin separation results as a final bin separation result.
In a second aspect, an embodiment of the present application provides a chi fang de-boxing device, including:
the zero-value binning unit is used for performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with the same category and zero number are merged into one bin;
the chi-square inspection unit is used for carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result;
the monotonicity test unit is used for carrying out WOE value monotonicity test on the second box dividing result;
a first determining unit, configured to determine the second binning result as a candidate binning result if the WOE value of the second binning result satisfies a monotonicity condition.
In some of these embodiments, the chi-square verification unit is to:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
In some of these embodiments, the apparatus further comprises:
a judging unit, configured to judge whether the number of bins in the candidate binning result is a third threshold after the second binning result is determined as a candidate binning result;
and a second determining unit, configured to determine that the candidate binning result is a final binning result if the number of bins in the candidate binning result is the third threshold.
In some of these embodiments, the apparatus further comprises:
and a third determining unit, configured to determine, after determining whether the number of bins in the candidate binning result is a target threshold, if the number of bins in the candidate binning result is smaller than the third threshold, a candidate binning result with a largest IV value in the candidate binning result as a final binning result.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the card-square binning method according to the first aspect is implemented.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the card party binning method as described in the first aspect above.
By adopting the technical scheme, compared with the prior art, the chi-square binning method provided by the embodiment of the application obtains a first binning result by performing zero-value binning on original data, wherein the zero-value binning is used for indicating that two adjacent bins with the same category and zero number are merged into one bin; carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result; performing WOE value monotonicity test on the second bin dividing result; and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result, solving the problems of time and labor consumption and poor binning effect caused by the fact that a chi-square binning algorithm needs to design binning by itself in a manual mode in the related art, and achieving the technical effect of improving binning efficiency and accuracy.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic illustration of a card square binning data detail according to the related art;
fig. 2 is a block diagram of a mobile terminal according to an embodiment of the present application;
FIG. 3 is a flow chart of a chi-square binning method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a chi-square binning procedure according to a preferred embodiment of the present application;
FIG. 5 is a block diagram of a chi-square binning apparatus according to an embodiment of the present application;
fig. 6 is a hardware structure diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a mobile terminal. Fig. 2 is a block diagram of a mobile terminal according to an embodiment of the present application. As shown in fig. 2, the mobile terminal includes: a Radio Frequency (RF) circuit 220, a memory 220, an input unit 230, a display unit 240, a sensor 250, an audio circuit 260, a wireless fidelity (WiFi) module 270, a processor 280, and a power supply 290. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 2 is not intended to be limiting of mobile terminals and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile terminal in detail with reference to fig. 2:
the RF circuit 220 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing the received downlink information of the base station to the processor 280; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuits include, but are not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 220 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 220 may be used to store software programs and modules, and the processor 280 executes various functional applications and data processing of the mobile terminal by operating the software programs and modules stored in the memory 220. The memory 220 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 230 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the input unit 230 may include a touch panel 231 and other input devices 232. The touch panel 232, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 231 using any suitable object or accessory such as a finger, a stylus, etc.) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 231 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and then provides the touch point coordinates to the processor 280, and can receive and execute commands from the processor 280. In addition, the touch panel 232 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 230 may include other input devices 232 in addition to the touch panel 231. In particular, other input devices 232 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 240 may be used to display information input by a user or information provided to the user and various menus of the mobile terminal. The Display unit 240 may include a Display panel 241, and optionally, the Display panel 241 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 231 may cover the display panel 241, and when the touch panel 231 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 280 to determine the type of the touch event, and then the processor 280 provides a corresponding visual output on the display panel 241 according to the type of the touch event. Although the touch panel 231 and the display panel 241 are shown as two separate components in fig. 2 to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 231 and the display panel 241 may be integrated to implement the input and output functions of the mobile terminal.
The mobile terminal may also include at least one sensor 250, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 241 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 241 and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the mobile terminal, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile terminal, further description is omitted here.
A speaker 261 and a microphone 262 in the audio circuit 260 may provide an audio interface between the user and the mobile terminal. The audio circuit 260 may transmit the electrical signal converted from the received audio data to the speaker 261, and convert the electrical signal into a sound signal by the speaker 261 and output the sound signal; on the other hand, the microphone 261 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 260, and then outputs the audio data to the processor 280 for processing, and then to the RF circuit 220 to be transmitted to, for example, another mobile terminal, or outputs the audio data to the memory 220 for further processing.
WiFi belongs to a short-distance wireless transmission technology, and the mobile terminal can help a user to send and receive e-mails, browse webpages, access streaming media and the like through the WiFi module 270, and provides wireless broadband internet access for the user. Although fig. 2 illustrates the WiFi module 270, it is understood that it does not belong to the essential components of the mobile terminal, and it can be omitted or replaced with other short-range wireless transmission modules, such as Zigbee module or WAPI module, etc., as required within the scope not changing the essence of the invention.
The processor 280 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 220 and calling data stored in the memory 220, thereby integrally monitoring the mobile terminal. Alternatively, processor 280 may include one or more processing units; preferably, the processor 280 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 280.
The mobile terminal also includes a power supply 290 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 280 via a power management system that may enable managing charging, discharging, and power consumption via the power management system.
Although not shown, the mobile terminal may further include a camera, a bluetooth module, and the like, which will not be described herein.
In this embodiment, the processor 280 is configured to:
performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin;
carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result;
performing WOE value monotonicity test on the second bin dividing result;
and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result.
In some of these embodiments, the processor 280 is further configured to:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
In some of these embodiments, the processor 280 is further configured to:
after the second binning result is determined as a candidate binning result, judging whether the number of bins in the candidate binning result is a third threshold value;
and if the number of boxes in the candidate box separation result is the third threshold value, determining that the candidate box separation result is the final box separation result.
In some of these embodiments, the processor 280 is further configured to:
after judging whether the number of the candidate binning results is a target threshold or not, if the number of the candidate binning results is smaller than the third threshold, determining the candidate binning result with the largest IV value in the candidate binning results as a final binning result.
The embodiment provides a chi-square box separation method. Fig. 3 is a flowchart of a chi-square binning method according to an embodiment of the present application, and as shown in fig. 3, the flowchart includes the following steps:
step S301, performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin;
step S302, checking the chi-square of the adjacent box based on the first box dividing result to obtain a second box dividing result;
step S303, performing WOE value monotonicity test on the second box dividing result;
step S304, if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result.
Through the steps, a first binning result is obtained by performing zero-value binning on original data, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin; carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result; performing WOE value monotonicity test on the second bin dividing result; and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result, solving the problems of time and labor consumption and poor binning effect caused by the fact that a chi-square binning algorithm needs to design binning by itself in a manual mode in the related art, and achieving the technical effect of improving binning efficiency and accuracy.
In the embodiment of the application, the chi-square binning theory is followed, binning is not initialized, starting from the most original discrete data (namely original finest bins), and for adjacent bins, if the number of a certain category is zero, the bins are merged. This step does not affect the final result and can increase the speed. In scorecard modeling, this phenomenon is common due to sample imbalance. For example, two adjacent boxes have the number of classes of (0, a) and (0, b), respectively, and are merged into one box, and the number of classes is (0, a + b). Merging zero-valued bins, if not done, may result in an error because the chi-squared test cannot generate a result for a zero value. Instead of using a 0 value, it is not recommended to use a small value, and the final calculated WOE will cause a large error.
Performing chi-square test of adjacent boxes based on the first box sorting result to obtain a second box sorting result, which may include:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value; the value P is understood to be the probability value of the chi-squared value in the corresponding chi-squared distribution, and the first threshold may be 0.05 and the second threshold may be 6.
If the P value corresponding to the current box separation result exceeds the first threshold value and/or the number of boxes in the current box separation result exceeds the second threshold value, combining the adjacent boxes with the minimum chi-square value to obtain a third box separation result, taking the third box separation result as the current box separation result, and repeatedly executing the step of judging;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
The adjacent box chi-square verification process, typically sets a 0.95 confidence level. In particular, if two adjacent bins are distributed as follows:
variable interval Number of bad samples Number of good samples
[a,b) x y
[b,c) z w
Then a chi-square test is performed on this list of 2 x 2 and the chi-square value is recorded.
And traversing all adjacent boxes to find the minimum chi-square value. The pair of boxes is merged, i.e. if it is the above table, the merged box is [ a, c ]. However, if at this point the number of bins has not exceeded the predetermined value (e.g., has been divided into 6 bins or less) and the adjacent bin chi-squared value is greater than the corresponding chi-squared threshold of the 0.95 confidence level, merging is stopped. Otherwise, merging is continued.
Since the final binning effect needs to be of a practical application type, if the number of samples of a certain category in a certain bin is too small, the practical reference significance is limited. The meaning is equivalent to that when the chi-squared test is performed, the value of any unit is not recommended to be less than 5, otherwise the statistic convergence property of the test has a considerable error.
In some embodiments, after a chi-square test of an adjacent bin is performed based on the first bin result to obtain a second bin result, a WOE value monotonicity test may be performed on the second bin result; and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result. Otherwise, the second binning result is discarded. The WOE value here is collectively referred to as "Weight of evidence".
Check the WOE monotonicity (if the WOE array is (a, b, c, d), then it needs to be guaranteed that a > b > c > d, or a < b < c < d). Since the meaning of the missing values is different, it is not in the scope of monotonicity.
For example, if not monotonic, it may be that for some variable var, the bins are [1, 3), [3, 5), [5, inf, the variable values for user A, B, C are (2, 4, 6), respectively, and the score may be (10, 5, 20). That is, the value of B users in this variable var is between A, C, but the corresponding score is not between A, C. This is an original object against the design of the score card model.
In some of these embodiments, after determining the second binning result as a candidate binning result, the method further comprises:
judging whether the number of boxes in the candidate box dividing result is a third threshold value or not; the third threshold value here may be 5 or 6.
And if the number of boxes in the candidate box separation result is the third threshold value, determining that the candidate box separation result is the final box separation result.
And if the number of the bins in the candidate bin separation results is less than the third threshold, determining the candidate bin separation result with the maximum IV value in the candidate bin separation results as a final bin separation result. The IV Value is called Information Value, and Chinese means Information Value or Information amount.
The embodiment of the application preferably adopts the binning result with the binning number of 5 or 6. In short, the principle of box separation is that good people should be scored high, ordinary people should be scored medium, and bad people should be scored low. In scoring card modeling practice, if the number of bins is too small (e.g., 2, no missing groups), then this variable is often insufficient for default or fraud identification. Conversely, if the bin count can reach 5 or 6, it indicates that the variable can distinguish and identify the default or fraud on different scales, and the identification is based on the solid mathematical statistics theory of chi-square test, so that the final score can be well distinguished. In other words, for a certain variable, there may be two cases where the number of bins is 3 or 5, and some index with a bin number of 3 may be better than 5, but if the relevant index with a bin number of 5 can also meet the requirement, a number of 5 bins is selected. On the other hand, the number of the bins is not too large, otherwise the data amount in the bins may not meet the statistical requirement.
The embodiments of the present application are described and illustrated below by means of preferred embodiments.
FIG. 4 is a schematic diagram of a chi-square binning procedure according to a preferred embodiment of the present application, which may be described as follows, as shown in FIG. 4:
carrying out zero value binning on the original data, and combining adjacent zero value bins; and (3) carrying out adjacent box chi-square inspection on the non-zero-value boxes, merging the adjacent boxes with the minimum chi-square value if the P value exceeds a chi-square threshold (namely the first threshold), carrying out chi-square and box number inspection if the P value does not exceed the chi-square threshold, carrying out WOE value monotonicity inspection if the box number is qualified and the chi-square value is high, and carrying out continuous adjacent box chi-square inspection if the box number is too much or the chi-square value is low. And directly determining the box separation result meeting the WOE value monotonicity as a candidate result, and directly abandoning the box separation result not meeting the WOE value monotonicity. And if the bin number of the candidate result is 5 or 6, directly determining the candidate result as a final bin result, and if the bin number of the candidate result is less than 5, selecting the candidate result with the maximum IV value from the candidate results and determining the candidate result as the final bin result.
The key technical characteristics and innovation points of the preferred embodiment of the application are generated according to the actual application of the scoring card modeling, and specifically, the key technical characteristics and innovation points are as follows: checking monotonicity of a sub-box WOE value at any moment; it is preferable to use 5 or 6 bins instead of using a greedy algorithm to stare at an index. The zero value bins are merged in order to increase the speed, the WOE value monotonicity test ensures the availability of the final bin, and the number of bins is determined to be 5 or 6 to increase the availability of the final bin.
The preferred embodiment of the present application ensures monotonicity of WOE, greatly enhancing the availability of variables. The popular scorecardy binning algorithm cannot guarantee monotonicity, manual binning is caused, optimal binning cannot be guaranteed, time and labor are consumed, and more importantly, binning failure is easily caused, so that available variables are lost. Preferably, 5 and 6 boxes are taken to achieve the effect of enhancing the identification degree of the model. Generally speaking, for continuous type variables, if 5 or 6 bins can be divided, the overall mold-entering effect is better.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The present embodiment provides a card party box distribution device, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the device is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 5 is a block diagram of a card-square box distribution device according to an embodiment of the present application, and as shown in fig. 5, the device includes:
a zero-value binning unit 51, configured to perform zero-value binning on original data to obtain a first binning result, where the zero-value binning is used to indicate that two adjacent bins with zero number in the same category are merged into one bin;
the chi-square checking unit 52 is used for carrying out chi-square checking on adjacent boxes based on the first box dividing result to obtain a second box dividing result;
a monotonicity test unit 53, configured to perform WOE value monotonicity test on the second bin result;
a first determining unit 54, configured to determine the second binning result as a candidate binning result if the WOE value of the second binning result satisfies a monotonicity condition.
In some of these embodiments, the chi-square verification unit 52 is configured to:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
In some of these embodiments, the apparatus further comprises:
a judging unit, configured to judge whether the number of bins in the candidate binning result is a third threshold after the second binning result is determined as a candidate binning result;
and a second determining unit, configured to determine that the candidate binning result is a final binning result if the number of bins in the candidate binning result is the third threshold.
In some of these embodiments, the apparatus further comprises:
and a third determining unit, configured to determine, after determining whether the number of bins in the candidate binning result is a target threshold, if the number of bins in the candidate binning result is smaller than the third threshold, a candidate binning result with a largest IV value in the candidate binning result as a final binning result.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
An embodiment provides a computer device. The card square box separation method combined with the embodiment of the application can be realized by computer equipment. Fig. 6 is a hardware structure diagram of a computer device according to an embodiment of the present application.
The computer device may comprise a processor 61 and a memory 62 in which computer program instructions are stored.
Specifically, the processor 61 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 62 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 62 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 62 may include removable or non-removable (or fixed) media, where appropriate. The memory 62 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 62 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 62 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 62 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 61.
The processor 61 implements any of the card binning methods in the above embodiments by reading and executing computer program instructions stored in the memory 62.
In some of these embodiments, the computer device may also include a communication interface 63 and a bus 60. As shown in fig. 6, the processor 61, the memory 62, and the communication interface 63 are connected via a bus 60 to complete mutual communication.
The communication interface 63 is used for implementing communication between modules, devices, units and/or apparatuses in the embodiments of the present application. The communication interface 63 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 60 comprises hardware, software, or both coupling the components of the computer device to each other. Bus 60 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 60 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 60 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
In addition, in combination with the chi-square binning method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the chi-squared binning methods in the above-described embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A chi fang box separation method is characterized by comprising the following steps:
performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with zero number in the same category are merged into one bin;
carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result;
performing WOE value monotonicity test on the second bin dividing result;
and if the WOE value of the second binning result meets the monotonicity condition, determining the second binning result as a candidate binning result.
2. The chi-square binning method of claim 1, wherein performing chi-square inspection of adjacent bins based on the first binning result, resulting in a second binning result, comprises:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
3. The chi-squared binning method according to claim 1 or 2, wherein after determining the second binning result as a candidate binning result, the method further comprises:
judging whether the number of boxes in the candidate box dividing result is a third threshold value or not;
and if the number of boxes in the candidate box separation result is the third threshold value, determining that the candidate box separation result is the final box separation result.
4. The chi-squared binning method of claim 3, wherein after determining whether the number of bins in the candidate binning result is a target threshold, the method further comprises:
and if the number of the bins in the candidate bin separation results is less than the third threshold, determining the candidate bin separation result with the maximum IV value in the candidate bin separation results as a final bin separation result.
5. The utility model provides a chi fang divides case device which characterized in that includes:
the zero-value binning unit is used for performing zero-value binning on original data to obtain a first binning result, wherein the zero-value binning is used for indicating that two adjacent bins with the same category and zero number are merged into one bin;
the chi-square inspection unit is used for carrying out chi-square inspection on adjacent boxes based on the first box separation result to obtain a second box separation result;
the monotonicity test unit is used for carrying out WOE value monotonicity test on the second box dividing result;
a first determining unit, configured to determine the second binning result as a candidate binning result if the WOE value of the second binning result satisfies a monotonicity condition.
6. The card square binning apparatus of claim 5, wherein the card square checking unit is configured to:
repeatedly performing the following steps based on the first binning result:
judging whether the P value corresponding to the current box separation result does not exceed a first threshold value and whether the number of boxes in the current box separation result does not exceed a second threshold value;
if the P value corresponding to the current binning result exceeds the first threshold value and/or the number of boxes in the current binning result exceeds the second threshold value, merging the adjacent boxes with the minimum chi-square value to obtain a third binning result, and taking the third binning result as the current binning result;
and if the P value corresponding to the current binning result does not exceed the first threshold and the number of bins in the current binning result does not exceed the second threshold, taking the current binning result as the second binning result.
7. The chi-squared binning device according to claim 5 or 6, further comprising:
a judging unit, configured to judge whether the number of bins in the candidate binning result is a third threshold after the second binning result is determined as a candidate binning result;
and a second determining unit, configured to determine that the candidate binning result is a final binning result if the number of bins in the candidate binning result is the third threshold.
8. The card fanning device of claim 7, wherein the device further comprises:
and a third determining unit, configured to determine, after determining whether the number of bins in the candidate binning result is a target threshold, if the number of bins in the candidate binning result is smaller than the third threshold, a candidate binning result with a largest IV value in the candidate binning result as a final binning result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the card fanning method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the card squaring method according to any one of claims 1 to 4.
CN202111320379.0A 2021-11-09 2021-11-09 Square carton splitting method and device Pending CN114049195A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111320379.0A CN114049195A (en) 2021-11-09 2021-11-09 Square carton splitting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111320379.0A CN114049195A (en) 2021-11-09 2021-11-09 Square carton splitting method and device

Publications (1)

Publication Number Publication Date
CN114049195A true CN114049195A (en) 2022-02-15

Family

ID=80207613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111320379.0A Pending CN114049195A (en) 2021-11-09 2021-11-09 Square carton splitting method and device

Country Status (1)

Country Link
CN (1) CN114049195A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115423600A (en) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115423600A (en) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment
CN115423600B (en) * 2022-08-22 2023-08-04 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN106778707B (en) Fingerprint identification method, display screen and mobile terminal
CN106203290B (en) A kind of fingerprint image acquisition method and terminal
CN107122761B (en) Fingerprint image processing method and related product
US10331940B2 (en) Evaluation method and evaluation device for facial key point positioning result
CN108268366B (en) Test case execution method and device
CN106294308B (en) Named entity identification method and device
CN107451450B (en) Biometric identification method and related product
US10324134B2 (en) Method and device for ascertaining required charging time
CN110597793A (en) Data management method and device, electronic equipment and computer readable storage medium
CN108322897B (en) Card package meal combination method and device
CN114049195A (en) Square carton splitting method and device
CN107506628B (en) Biometric identification method and related product
CN107169472B (en) Fingerprint operation method, mobile terminal and storage medium
CN112131901A (en) Method and device for automatically mixing and identifying face and two-dimensional code and intelligent identification equipment
CN108304709B (en) Face unlocking method and related product
CN114155977A (en) Data processing method and device for clinical research project
CN106777383B (en) File sorting method and intelligent terminal
CN107302446B (en) Banknote-based algorithm simulation verification method, client and terminal equipment
CN113835957A (en) Crawler task monitoring method and device
CN109976610B (en) Application program identifier classification method and terminal equipment
CN107039044B (en) Voice signal processing method and mobile terminal
CN112016345A (en) Image identification method and device
CN112130928A (en) Automatic searching method, device, equipment and storage medium for Linux system sound card
CN108073508B (en) Compatibility detection method and device
CN110979345B (en) Verification method and device of vehicle control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination