CN110084376B - Method and device for automatically separating data into boxes - Google Patents

Method and device for automatically separating data into boxes Download PDF

Info

Publication number
CN110084376B
CN110084376B CN201910362666.4A CN201910362666A CN110084376B CN 110084376 B CN110084376 B CN 110084376B CN 201910362666 A CN201910362666 A CN 201910362666A CN 110084376 B CN110084376 B CN 110084376B
Authority
CN
China
Prior art keywords
initial vector
function
function value
box
bringing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910362666.4A
Other languages
Chinese (zh)
Other versions
CN110084376A (en
Inventor
李骥东
何智福
蓝科
覃进学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sefon Software Co Ltd
Original Assignee
Chengdu Sefon Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sefon Software Co Ltd filed Critical Chengdu Sefon Software Co Ltd
Priority to CN201910362666.4A priority Critical patent/CN110084376B/en
Publication of CN110084376A publication Critical patent/CN110084376A/en
Application granted granted Critical
Publication of CN110084376B publication Critical patent/CN110084376B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Operations Research (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a method and a device for automatically separating data boxes, wherein the method specifically comprises the following steps: the method comprises the steps of obtaining basic feature data and box separating conditions input by a user, bringing the box separating conditions into a predefined function to obtain an objective function, determining an initial vector according to the box separating conditions, and bringing the initial vector into the objective function to determine a searching direction of the basic feature data. And then adjusting the initial vector according to the searching direction by taking the initial vector as a reference point and bringing the initial vector into a target function to obtain a corresponding function value, when the difference value between the next function value and the current function value is smaller than the preset convergence precision, determining the initial vector corresponding to the next function value as a dividing point, and finally performing box separation on the basic characteristic data input by the user according to the determined plurality of dividing points. According to the scheme, quick box separation can be realized, so that the association degree among the boxes is minimum, and objective scoring operation can be conveniently carried out on data input by a user.

Description

Method and device for automatically separating data into boxes
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for automatically separating data into boxes.
Background
With the development and popularization of big data and artificial intelligence technologies, more and more financial institutions increase the attention degree on machine learning, and gradually change the traditional management method based on artificial decision-making into intelligent decision-making based on data driving. Particularly, in the personal financial business of banks, such as the fields of credit card business, consumption financial business and the like, the traditional manual approval mode cannot meet the business appeal due to the reasons of small single amount, high application frequency, high timeliness requirement and the like. The machine learning method is used for risk management, and particularly, a scoring card model based on logistic regression is gradually adopted by most banks due to the characteristics of easy explanation, fast iteration, maturity and stability. In the process of card grading, box separation is an especially important link, the stability of a model can be improved by box separation, the calculation performance is improved, however, how to realize automatic box separation and how to optimize the box separation process are always a problem in machine learning modeling.
The main method for separating the boxes comprises the following steps: the method comprises the following steps of equal-frequency box separation, equal-width box separation, automatic box separation and the like, wherein the equal-frequency box separation is mainly performed according to the data proportion, if every 10% of data is taken as one box, the equal-width box separation is mainly performed according to the maximum and minimum characteristic values, if the maximum and minimum age span is 50, the equal-width box separation is taken as one box every 10 years, the equal-width box separation is divided into 5 boxes, and the method has the defect that the influence of different characteristic values on response variables is weakened.
The automatic box dividing method widely used at present comprises automatic box dividing and Chi-square box dividing (Chi-merge) based on a decision tree, wherein the core idea of the automatic box dividing based on the decision tree is to determine a point with maximum characteristic information gain before and after division based on entropy and information gain, and realize automatic box dividing by continuously dividing sub nodes. The key idea of chi-square binning is to gradually merge the classifications based on the characteristic chi-square value values and the iteration reaches the termination condition.
The above two types of automatic binning methods are too sensitive to iteration termination conditions, such as tree depth, minimum bin capacity, and the like, and are prone to causing an overfitting problem, and meanwhile, the two types of automatic binning methods have limited support capability for constraint conditions (for example, a certain type of data must be one bin, a bin subinterval is designated, and the like), and cannot completely meet the binning problem requirement in an actual modeling process.
Disclosure of Invention
The invention aims to provide a method for automatically binning data, which is used for rapidly and effectively binning the data to ensure that the correlation degree between two adjacent bins is the lowest, so that the effect of automatic binning is achieved.
In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:
in a first aspect, an embodiment of the present invention provides a method for automatically binning data, where the method includes: acquiring basic characteristic data input by a user and a box separating condition; bringing the box separating conditions into a predefined function to obtain an objective function; determining an initial vector according to the box separating condition, bringing the initial vector into the objective function, and determining the searching direction of the basic characteristic data; taking the initial vector as a reference point, adjusting the initial vector according to the searching direction, and bringing the initial vector into the target function to obtain a corresponding function value; when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining an adjusted initial vector corresponding to the latter function value as a segmentation point; and according to the plurality of determined dividing points, carrying out binning on the basic characteristic data input by a user.
In a second aspect, an embodiment of the present invention further provides an apparatus for automatically binning data, where the apparatus includes: the receiving and transmitting module is used for acquiring basic characteristic data and box separating conditions input by a user; the processing module is used for substituting the box separating conditions into a predefined function to obtain a target function; determining an initial variable according to the box separating condition, bringing the initial variable into the objective function, and determining the searching direction of the basic characteristic data; taking the initial vector as a reference point, adjusting the initial vector according to the searching direction, and bringing the initial vector into the target function to obtain a corresponding function value; when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining an adjusted initial vector corresponding to the latter function value as a segmentation point; and according to the plurality of determined dividing points, carrying out binning on the basic characteristic data input by a user.
The embodiment of the invention provides a method and a device for automatically separating data boxes, wherein the method specifically comprises the following steps: the method comprises the steps of obtaining basic feature data and box separating conditions input by a user, bringing the box separating conditions into a predefined function to obtain an objective function, determining an initial vector according to the box separating conditions, and bringing the initial vector into the objective function to determine a searching direction of the basic feature data. And then adjusting the initial vector according to the searching direction by taking the initial vector as a reference point and bringing the initial vector into a target function to obtain a corresponding function value, when the difference value between the next function value and the current function value is smaller than the preset convergence precision, determining the initial vector corresponding to the next function value as a dividing point, and finally performing box separation on the basic characteristic data input by the user according to the determined plurality of dividing points. According to the scheme, quick box separation can be realized, so that the association degree among the boxes is minimum, and objective scoring operation can be conveniently carried out on data input by a user.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flow chart illustrating a method for automatically binning data according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating functional modules of an apparatus for automatically binning data according to an embodiment of the present invention.
The figure is as follows: 200-means for automatically binning the data; 210-a transceiver module; 220-processing module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
In the personal financial business of bank, such as the fields of credit card business, consumption financial business and the like, because the single amount of money is small and the application frequency is high, the manual examination brings larger workload. At present, banks or financial institutions generally grade various basic data input by users through a grading card model, and whether the users transact financial services is determined through grading results. The data binning operation is an important link in the scoring card model, which is equivalent to binning that data input by a user is divided into a plurality of groups, and the scoring card model scores the data of each group according to a certain logic to finally obtain a scoring result. Therefore, the data are divided into the groups with the lowest relevance degree through the sub-boxes, so that the data can be scored by the subsequent scoring card model, and the finally obtained scoring result is more accurate. The scheme provides a method for automatically binning data, and the data can be automatically binned through the scheme, so that the association degree between two adjacent bins is the lowest, and a better binning effect is achieved.
Referring to fig. 1, a schematic flow chart of a method for automatically binning data according to an embodiment of the present invention is shown, where the method includes:
and S110, acquiring basic characteristic data input by a user and a box separating condition.
Specifically, the basic characteristic data input by the user includes basic information of the user, such as age, height, weight, income and the like. The binning conditions include the number of bins and the proportion of data in each bin, for example, the number of bins is 5, the proportion of data in each bin is 10%, that is, basic characteristic data input by a user is divided into 5 bins, and the data contained in each bin is not less than 10% of the total data.
And S120, bringing the box separating conditions into a predefined function to obtain an objective function.
Specifically, the binning condition includes the binning number and the data proportion in each bin, and then the binning number and the data proportion in each bin are substituted into a predefined function to obtain an objective function, and the expression mode of the objective function is as follows:
Figure BDA0002047304980000051
wherein, among others,
Figure BDA0002047304980000052
represents the degree of minimum correlation, s.t. represents the constraint, Ci(x) -m represents a bin number constraint, m represents the bin number; ci(x) P represents the minimum proportion per bin, where Ci(x) A constraint function representing x.
In order to solve the above process, the nonlinear optimization process needs to be simplified into a quadratic programming problem, and then the lagrangian function needs to be solved for the objective function first, and then quadratic approximation solution is performed on the lagrangian function to obtain the quadratic programming problem.
The first step solves the lagrangian function for the objective function in the following way:
L(x)=f(x)+λG(x)+μS(x)
wherein L (x) represents Lagrangian function, G (x) is the number of binsConstraint g (x) ═ Ci(x) M, S (x) is the ratio per tank S (x) Ci(x) P, λ is the Lagrangian factor and u is the binning scaling factor.
In the second step, a quadratic approximation solution is carried out on the Lagrange function, so that the optimal solution of the original nonlinear optimization, namely a quadratic programming problem, can be solved, and the calculation mode is as follows:
Figure BDA0002047304980000053
wherein the content of the first and second substances,
Figure BDA0002047304980000054
hk denotes the Hessian matrix (Hessian matrix) of the kth iteration, i.e. the second derivative of the objective function, xkAnd d represents a specific value of x, and the variable searching direction.
S130, determining an initial vector according to the box separating condition, bringing the initial vector into an objective function, and determining the searching direction of the basic characteristic data.
Specifically, the binning condition includes the number of bins, and if the number of bins input by the user is 5, the initial vector x iskIt can be defined as x1 to x4, i.e. the basic feature data input by the user is cut 4 times, resulting in 5 sets of data. And then the determined initial vector is brought into the objective function converted into the quadratic programming problem so as to determine the searching direction of the basic characteristic data. The specific determination method comprises the following steps:
first, a quadratic programming problem is subjected to first order derivation to obtain a gradient vector.
The calculation method is as follows:
Figure BDA0002047304980000061
wherein, gkThe gradient vector is characterized.
And secondly, carrying out second-order derivation on the quadratic programming problem to obtain a Hessian matrix.
The hessian matrix calculation process needs to be carried out on the primitive functionsAt different xkAnd (4) carrying out derivation, and in order to reduce the calculation amount, when the number of the sub-boxes of the basic characteristic data is less than a preset threshold (such as 100), solving the approximately optimal solution of the Hessian matrix by adopting a Newton method, and when the number of the sub-boxes of the basic characteristic data is greater than the preset threshold (such as 100), solving the approximately optimal solution of the Hessian matrix by adopting a BFGS algorithm. And then the approximate optimal solution of the Hessian matrix is used as a calculation result of second-order derivation of the quadratic programming problem.
The method for solving the approximate optimal solution of the Hessian matrix by adopting the Newton method comprises the following steps:
Figure BDA0002047304980000062
the method for solving the approximate optimal solution of the Hessian matrix by adopting the BFGS algorithm comprises the following steps:
let yk=gk+1-gk,sk=xk+1-xk
The hessian matrix of the iterative process can be approximated using Bk, i.e., H ≈ B:
Bk+1=Bk+△Bk
where Bk is an identity matrix, i.e. a matrix with a diagonal of 1, Δ BkRepresents the Bk differential;
Figure BDA0002047304980000063
and finally, calculating the gradient vector and the Hessian matrix according to a preset rule to obtain a direction vector, wherein the direction vector represents the searching direction of the basic characteristic data.
The calculation method is as follows:
Figure BDA0002047304980000064
wherein HkCharacterization of Hessian matrix, gkCharacterizing gradient vectors, dkCharacterizing the direction vectors, i.e. pairsThe search direction of the basic feature data.
And S140, adjusting the initial vector by taking the initial vector as a reference point according to the searching direction and bringing the initial vector into the objective function to obtain a corresponding function value.
Specifically, the user will also input an iteration step size and the number of iterations, the iteration step size using αkIt means that 1 to 1000 steps can be set, and the default step is 1; the number of iterations, denoted by k, can be set to any number of iterations greater than 1, with a default value of 10. And adjusting the initial vector in the search direction by using the initial vector as a reference point, such as the initial vector xkAnd x1 to x4, step length is added to each value in the initial vector in the searching direction, and the adjusted initial vector is brought into the objective function to obtain a corresponding function value. And when the difference value of the calculated function value and the function value corresponding to the initial vector meets the condition or reaches the iteration times, stopping the operation.
And S150, when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining the adjusted initial vector corresponding to the latter function value as a segmentation point.
Specifically, the adjusted initial vector is brought into a target function to obtain a function value, the function value is called a next function value, the initial vector is brought into the target function to obtain a function value, the function value is called a current function value, if the difference value between the next function value and the current function value is smaller than preset convergence precision, the current grouping is indicated, the association degree between the groups is the lowest, and the adjusted initial vector corresponding to the next function value is used as a segmentation point. If the difference between the latter function value and the current function value is greater than the preset convergence precision, the initial vector is re-assigned, namely, alpha is usedk+xkAnd (4) as a new initial vector, (namely adding a step size to the previous initial vector to be used as the new initial vector), repeating the algorithm on the newly-assigned initial vector to determine the searching direction, and re-comparing the function values calculated by the objective function to re-determine the segmentation point.
And S160, according to the plurality of determined dividing points, carrying out binning on the basic characteristic data input by the user.
Specifically, each of the segmentation points corresponds to a position at which the basic feature data is segmented, and the basic feature data input by the user can be binned according to the determined segmentation points, so as to obtain multiple groups of data according with the binning number and the binning proportion input by the user. The finally obtained multiple groups of data have low association degree, so that the scoring card model can conveniently perform scoring operation based on the grouped data, and the calculation precision is improved.
Therefore, according to the method for automatically binning data provided by the invention, a user only needs to input basic characteristic data, binning conditions, iteration step length, iteration times and other basic data and limiting conditions, an optimal segmentation point can be calculated through a set algorithm to complete binning processing on the basic characteristic data, and a subsequent model can conveniently perform scoring operation on the binned data. The beneficial effects of the scheme mainly have two aspects:
1. the influence of variable values on response variables is compensated for in the traditional equal-frequency and equal-width method, when the traditional equal-frequency and equal-width method is used for box separation, characteristic interval differences are ignored, for example, in the relation between age and overdue, the span is 20-50 years old, and one box is used every 5 years old by the equal-width method, but the overdue rate is higher in the young.
2. The problem that the traditional automatic box separation is sensitive to preset parameters and causes overfitting is solved, the SQP method is adopted, a user only needs to set the step length and the iteration times, the optimized IV process is automatically completed by an algorithm, and the dependence on experience of modeling personnel is reduced.
Fig. 2 is a schematic functional module diagram of an apparatus 200 for automatically separating data boxes according to an embodiment of the present invention, which includes a transceiver module 210 and a processing module 220.
The transceiver module 210 is used for acquiring the basic characteristic data and the binning condition input by the user.
In the embodiment of the present invention, S110 may be performed by the transceiver module 210.
The processing module 220 is configured to bring the binning conditions into a predefined function to obtain an objective function; determining an initial variable according to the box separating condition, bringing the initial variable into an objective function, and determining the searching direction of the basic characteristic data; taking the initial vector as a reference point, adjusting the initial vector according to the searching direction, and bringing the initial vector into a target function to obtain a corresponding function value; when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining an adjusted initial vector corresponding to the latter function value as a segmentation point; and according to the determined plurality of dividing points, carrying out binning on the basic characteristic data input by the user.
In an embodiment of the present invention, S120 to S160 may be performed by the processing module 220.
Since the method for automatically binning data is described in detail in the section of the method, it is not described herein again.
In summary, the method and apparatus for automatically binning data provided in the embodiments of the present invention specifically include: the method comprises the steps of obtaining basic feature data and box separating conditions input by a user, bringing the box separating conditions into a predefined function to obtain an objective function, determining an initial vector according to the box separating conditions, and bringing the initial vector into the objective function to determine a searching direction of the basic feature data. And then adjusting the initial vector according to the searching direction by taking the initial vector as a reference point and bringing the initial vector into a target function to obtain a corresponding function value, when the difference value between the next function value and the current function value is smaller than the preset convergence precision, determining the initial vector corresponding to the next function value as a dividing point, and finally performing box separation on the basic characteristic data input by the user according to the determined plurality of dividing points. According to the scheme, quick box separation can be realized, so that the association degree among the boxes is minimum, and objective scoring operation can be conveniently carried out on data input by a user.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for automatically separating the boxes of data is characterized in that the method is applied to the field of financial business,
the method comprises the following steps:
acquiring basic characteristic data input by a user and a box separating condition, wherein the box separating condition comprises the number of boxes and the proportion of data in each box; the basic feature data includes basic information of the user,
the basic information comprises age, height, weight and income;
substituting the box-dividing condition into a predefined function to obtain an objective function, wherein the expression mode of the objective function is
Figure FDA0003008604490000011
Wherein the content of the first and second substances,
Figure FDA0003008604490000012
represents the degree of minimum correlation, s.t. represents the constraint, Ci(x) -m represents a bin number constraint, m represents a bin number, Ci(x) P represents the minimum proportion per bin, Ci(x) A constraint function representing x;
determining an initial vector according to the box separating condition, bringing the initial vector into the objective function, and determining the searching direction of the basic characteristic data;
taking the initial vector as a reference point, adjusting the initial vector according to the searching direction, and bringing the initial vector into the target function to obtain a corresponding function value, wherein the function value is the next function value;
when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining an adjusted initial vector corresponding to the latter function value as a segmentation point, wherein the current function value is a function value obtained by substituting the initial vector into the target function;
and according to the determined plurality of dividing points, carrying out box separation on the basic characteristic data input by the user to obtain a plurality of groups of data.
2. The method of claim 1, wherein said bringing the binning condition into a predefined function results in an objective function comprising the steps of:
solving a Lagrangian function for the target function;
and carrying out quadratic approximation solution on the Lagrangian function to obtain a quadratic programming problem.
3. The method of claim 2, wherein the determining an initial vector based on the binning conditions, bringing the initial vector into the objective function, and the determining a search direction for the base feature data comprises:
determining an initial vector according to the box dividing number included in the box dividing condition, and bringing the initial vector into the quadratic programming problem;
performing first-order derivation on the quadratic programming problem to obtain a gradient vector;
performing second-order derivation on the quadratic programming problem to obtain a Hessian matrix;
and calculating the gradient vector and the Hessian matrix according to a preset rule to obtain a direction vector, wherein the direction vector represents the searching direction of the basic characteristic data.
4. The method of claim 3, wherein the step of second-order derivation of the quadratic programming problem to obtain a Hessian matrix comprises:
when the number of the sub-boxes is smaller than a preset threshold value, solving an approximate optimal solution of the Hessian matrix by adopting a Newton algorithm;
and when the number of the sub-boxes is larger than a preset threshold value, solving the approximate optimal solution of the Hessian matrix by adopting a BFGS algorithm.
5. The method of claim 1, wherein the step of adjusting the initial vector in the search direction with the initial vector as a reference point and bringing the initial vector into the objective function to obtain a corresponding function value comprises:
acquiring an iteration step length and iteration times input by a user;
and adjusting the initial vector according to the iteration step length, bringing the initial vector into the objective function to obtain a corresponding function value, and stopping operation after the iteration times are reached.
6. An automatic data-separating device is characterized in that the device is applied to the field of financial business,
the device comprises:
the system comprises a receiving and sending module, a classifying module and a classifying module, wherein the receiving and sending module is used for acquiring basic characteristic data input by a user and a classifying condition, and the classifying condition comprises a classifying number and a data proportion in each box; the basic characteristic data comprises basic information of the user, wherein the basic information comprises age, height, weight and income;
a processing module for substituting the binning condition into a predefined function to obtain an objective function, wherein the objective function is expressed in a manner of
Figure FDA0003008604490000031
Wherein the content of the first and second substances,
Figure FDA0003008604490000032
represents the degree of minimum correlation, s.t. represents the constraint, Ci(x) -m represents a bin number constraint, m represents a bin number, Ci(x) P represents the minimum proportion per bin, Ci(x) A constraint function representing x; determining an initial vector according to the box separating condition, bringing the initial vector into the objective function, and determining the searching direction of the basic characteristic data; taking the initial vector as a reference point, adjusting the initial vector according to the searching direction, and bringing the initial vector into the target function to obtain a corresponding function value, wherein the function value is the next function value; when the difference value between the latter function value and the current function value is smaller than the preset convergence precision, determining an adjusted initial vector corresponding to the latter function value as a segmentation point, wherein the current function value is a function value obtained by substituting the initial vector into the target function; and according to the determined plurality of dividing points, carrying out box separation on the basic characteristic data input by the user to obtain a plurality of groups of data.
7. The apparatus of claim 6, wherein the processing module is further to:
solving a Lagrangian function for the target function;
and carrying out quadratic approximation solution on the Lagrangian function to obtain a quadratic programming problem.
8. The apparatus of claim 7, wherein the processing module is specifically configured to:
determining an initial vector according to the box dividing number included in the box dividing condition, and bringing the initial vector into the quadratic programming problem;
performing first-order derivation on the quadratic programming problem to obtain a gradient vector;
performing second-order derivation on the quadratic programming problem to obtain a Hessian matrix;
and calculating the gradient vector and the Hessian matrix according to a preset rule to obtain a direction vector, wherein the direction vector represents the searching direction of the basic characteristic data.
9. The apparatus of claim 8, wherein the processing module is specifically configured to:
when the number of the sub-boxes is smaller than a preset threshold value, solving an approximate optimal solution of the Hessian matrix by adopting a Newton algorithm;
and when the number of the sub-boxes is larger than a preset threshold value, solving the approximate optimal solution of the Hessian matrix by adopting a BFGS algorithm.
10. The apparatus of claim 6, wherein the processing module is specifically configured to:
acquiring an iteration step length and iteration times input by a user;
and adjusting the initial vector according to the iteration step length, bringing the initial vector into the objective function to obtain a corresponding function value, and stopping operation after the iteration times are reached.
CN201910362666.4A 2019-04-30 2019-04-30 Method and device for automatically separating data into boxes Active CN110084376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910362666.4A CN110084376B (en) 2019-04-30 2019-04-30 Method and device for automatically separating data into boxes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910362666.4A CN110084376B (en) 2019-04-30 2019-04-30 Method and device for automatically separating data into boxes

Publications (2)

Publication Number Publication Date
CN110084376A CN110084376A (en) 2019-08-02
CN110084376B true CN110084376B (en) 2021-05-14

Family

ID=67418143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910362666.4A Active CN110084376B (en) 2019-04-30 2019-04-30 Method and device for automatically separating data into boxes

Country Status (1)

Country Link
CN (1) CN110084376B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909085A (en) * 2019-11-25 2020-03-24 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium
CN112819034A (en) * 2021-01-12 2021-05-18 平安科技(深圳)有限公司 Data binning threshold calculation method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537067A (en) * 2014-12-30 2015-04-22 广东电网有限责任公司信息中心 Box separation method based on k-means clustering
CN106547758A (en) * 2015-09-17 2017-03-29 阿里巴巴集团控股有限公司 A kind of method and apparatus of data branch mailbox
CN107169511A (en) * 2017-04-27 2017-09-15 华南理工大学 Clustering ensemble method based on mixing clustering ensemble selection strategy
CN108399255A (en) * 2018-03-06 2018-08-14 中国银行股份有限公司 A kind of input data processing method and device of Classification Data Mining model
CN108984790A (en) * 2018-07-31 2018-12-11 蜜小蜂智慧(北京)科技有限公司 A kind of data branch mailbox method and device
CN109063222A (en) * 2018-11-04 2018-12-21 吉铁磊 A kind of self-adapting data searching method based on big data
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050079508A1 (en) * 2003-10-10 2005-04-14 Judy Dering Constraints-based analysis of gene expression data
US8346783B2 (en) * 2009-12-11 2013-01-01 International Business Machines Corporation Method and system for merchandise hierarchy refinement by incorporation of product correlation
CN109598346A (en) * 2017-09-30 2019-04-09 日本电气株式会社 For estimating the causal methods, devices and systems between observational variable

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537067A (en) * 2014-12-30 2015-04-22 广东电网有限责任公司信息中心 Box separation method based on k-means clustering
CN106547758A (en) * 2015-09-17 2017-03-29 阿里巴巴集团控股有限公司 A kind of method and apparatus of data branch mailbox
CN107169511A (en) * 2017-04-27 2017-09-15 华南理工大学 Clustering ensemble method based on mixing clustering ensemble selection strategy
CN108399255A (en) * 2018-03-06 2018-08-14 中国银行股份有限公司 A kind of input data processing method and device of Classification Data Mining model
CN108984790A (en) * 2018-07-31 2018-12-11 蜜小蜂智慧(北京)科技有限公司 A kind of data branch mailbox method and device
CN109063222A (en) * 2018-11-04 2018-12-21 吉铁磊 A kind of self-adapting data searching method based on big data
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Improved Ant Colony Optimization for One-Dimensional Bin Packing Problem with Precedence Constraints;Zeqiang Zhang et al;《 Third International Conference on Natural Computation (ICNC 2007)》;20071105;全文 *
SAS/OR:Rigorous constrained optimized binning for credit scoring;Ivan Oliveira et al;《Data Mining and Predictive Modeling》;20081231;全文 *
基于分箱统计的FCM算法及其在网络入侵检测中的应用;傅涛等;《计算机科学》;20081231;36-39 *
基于特征匹配与分箱技术的分布式网络入侵协同检测系统研究及实现;王洁松;《中国硕士学位论文全文数据库信息科技辑》;20070615;I139-176 *

Also Published As

Publication number Publication date
CN110084376A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
CN103336790A (en) Hadoop-based fast neighborhood rough set attribute reduction method
CN108320171A (en) Hot item prediction technique, system and device
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN102567464A (en) Theme map expansion based knowledge resource organizing method
CN111815432B (en) Financial service risk prediction method and device
CN113590698B (en) Artificial intelligence technology-based data asset classification modeling and hierarchical protection method
CN111967971B (en) Bank customer data processing method and device
CN112417176B (en) Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics
CN103336791A (en) Hadoop-based fast rough set attribute reduction method
CN110084376B (en) Method and device for automatically separating data into boxes
CN102117411A (en) Method and system for constructing multi-level classification model
CN109345007A (en) A kind of Favorable Reservoir development area prediction technique based on XGBoost feature selecting
CN112232944B (en) Method and device for creating scoring card and electronic equipment
CN111143685A (en) Recommendation system construction method and device
CN110109902A (en) A kind of electric business platform recommender system based on integrated learning approach
CN106407379A (en) Hadoop platform based movie recommendation method
CN111275485A (en) Power grid customer grade division method and system based on big data analysis, computer equipment and storage medium
CN109978023A (en) Feature selection approach and computer storage medium towards higher-dimension big data analysis
CN106651461A (en) Film personalized recommendation method based on gray theory
CN109977131A (en) A kind of house type matching system
Zou et al. A multiobjective particle swarm optimization algorithm based on grid technique and multistrategy
CN109977977A (en) A kind of method and corresponding intrument identifying potential user
CN116993548A (en) Incremental learning-based education training institution credit assessment method and system for LightGBM-SVM
CN116756391A (en) Unbalanced graph node neural network classification method based on graph data enhancement
CN111984842B (en) Bank customer data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant