CN111191825A - User default prediction method and device and electronic equipment - Google Patents

User default prediction method and device and electronic equipment Download PDF

Info

Publication number
CN111191825A
CN111191825A CN201911328631.5A CN201911328631A CN111191825A CN 111191825 A CN111191825 A CN 111191825A CN 201911328631 A CN201911328631 A CN 201911328631A CN 111191825 A CN111191825 A CN 111191825A
Authority
CN
China
Prior art keywords
user
data
default
model
prediction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911328631.5A
Other languages
Chinese (zh)
Inventor
于晓栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qilu Information Technology Co Ltd
Original Assignee
Beijing Qilu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qilu Information Technology Co Ltd filed Critical Beijing Qilu Information Technology Co Ltd
Priority to CN201911328631.5A priority Critical patent/CN111191825A/en
Publication of CN111191825A publication Critical patent/CN111191825A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities

Abstract

The disclosure relates to a user default prediction method, a user default prediction device, an electronic device and a computer readable medium. The method comprises the following steps: acquiring financial data of a user, wherein the financial data comprises characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and a user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient lifting decision tree model. The user default prediction method, the device, the electronic equipment and the computer readable medium can save model training and calculation time, save grid search time of the model and improve efficiency and accuracy of user default prediction.

Description

User default prediction method and device and electronic equipment
Technical Field
The present disclosure relates to the field of computer information processing, and in particular, to a user default prediction method, apparatus, electronic device, and computer readable medium.
Background
In general, the machine learning model needs to learn positive samples and negative samples, the positive samples are samples corresponding to correctly classified classes, and the negative samples can select any other samples that are not correctly classified in principle. The machine learning model establishes a specific task according to the positive and negative samples, then trains the machine learning through specific data, and after the training is finished, the machine learning model suitable for a certain specific task is obtained.
Most of the existing methods for analyzing the default risk of the user are Decision tree-based methods, such as gbdt (sparse Boosting Decision tree), which are Decision tree algorithms based on a pre-sorted method (pre-sorted), and such Decision tree algorithms need to store feature values of data and also need to store feature sorted results, such as sorted indexes, and in order to calculate partition points quickly and subsequently, a risk analysis model generated by such a method needs to consume twice as much memory as training data. Secondly, the risk analysis model generated by the method has larger cost in the calculation time, and when each segmentation point is traversed, the calculation of the splitting gain is required, so that the cost of consumption is high. However, as more and more users are served on the financial service platform and the requirement of the users on response time is more and more strict, most default risk analysis models cannot meet the requirement on timeliness at present.
Therefore, a new user default prediction method, apparatus, electronic device, and computer readable medium are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present disclosure provides a method, an apparatus, an electronic device, and a computer-readable medium for predicting a user default, which can save model training and calculation time, save time for grid search of a model, and improve efficiency and accuracy of user default prediction.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, a method for predicting a user default is provided, the method including: acquiring financial data of a user, wherein the financial data comprises characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and a user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient lifting decision tree model.
Optionally, the method further comprises: acquiring financial data and credit data of a historical user; performing character conversion on the financial data to generate historical user data; determining a label for the historical user data based on the credit data; and training the distributed gradient lifting decision tree model through the historical user data with the labels to generate the user default prediction model.
Optionally, the character converting the financial data includes: and converting character type data in the financial data into integer type data.
Optionally, the character converting the financial data further includes: the field names of integer type data are declared.
Optionally, determining a tag for the historical user data according to the credit data includes: acquiring a credit category label and a default label in the credit data; determining a category label for the historical user data according to the credit category label; and determining a breach label for the historical user data based on the breach label.
Optionally, training a distributed gradient boost decision tree model through labeled historical user data to generate the user default prediction model, including: inputting the tagged historical user data into the user violation prediction model; the user default prediction model carries out discretization processing on floating point characteristic values in historical user data; constructing a histogram according to the discretization processing result and the label; and determining the user violation prediction model based on the discretization processing result and the histogram.
Optionally, determining the user violation prediction model based on the discretization processing result and the histogram, comprising: determining an optimal segmentation point based on the discretization processing result and the histogram; determining model parameters based on the optimal segmentation points; and generating the user violation prediction model based on the model parameters.
Optionally, determining an optimal segmentation point based on the discretization processing result and the histogram, including: taking the numerical value in the discretization processing result as an index, and traversing and calculating the statistical value in the histogram; and determining the optimal segmentation point according to the statistic value of the histogram after the statistics is finished.
Optionally, generating the default category and default probability of the user by the user data and the user default prediction model includes: inputting the user data into the user violation prediction model; the default prediction model calculates the user data through a plurality of classifiers based on preset parameters; and generating at least one default category and the default probability corresponding to the default category according to the calculation result.
Optionally, the method further comprises: and generating a financial service policy for the user according to the default category and the default probability.
According to an aspect of the present disclosure, there is provided a user default prediction apparatus, the apparatus including: the data module is used for acquiring financial data of a user, wherein the financial data comprises a plurality of categories of characteristic data; the conversion module is used for performing character conversion on the financial data to generate user data; and the model module is used for generating default categories and default probabilities of the users through the user data and the user default prediction model, and the user default prediction model is obtained through training of a distributed gradient boost decision tree model.
Optionally, the method further comprises: a model training module comprising: the historical data unit is used for acquiring financial data and credit data of historical users; the history conversion unit is used for performing character conversion on the financial data to generate history user data; a history tag unit for determining a tag for the history user data according to the credit data; and the data training unit is used for training the distributed gradient boost decision tree model through the historical user data with the labels to generate the user default prediction model.
Optionally, the history conversion unit is further configured to convert character-type data in the financial data into integer-type data.
Optionally, the history conversion unit is further configured to declare a field name of the integer type data.
Optionally, the history tag unit is further configured to obtain a credit category tag and a default tag in the credit data; determining a category label for the historical user data according to the credit category label; and determining a breach label for the historical user data based on the breach label.
Optionally, the data training unit includes: an input unit for inputting the tagged historical user data into the user default prediction model; the discrete unit is used for discretizing the floating point characteristic value in the historical user data by the user default prediction model; the composition unit is used for constructing a histogram according to the discretization processing result and the label; and the model unit is used for determining the user default prediction model based on the discretization processing result and the histogram.
Optionally, the model unit is further configured to determine an optimal segmentation point based on the discretization processing result and the histogram; determining model parameters based on the optimal segmentation points; and generating the user violation prediction model based on the model parameters.
Optionally, the model unit is further configured to take a numerical value in the discretization processing result as an index, and perform traversal calculation on a statistical value in the histogram; and determining the optimal segmentation point according to the statistic value of the histogram after the statistics is finished.
Optionally, the model module comprises: a prediction unit to input the user data into the user violation prediction model; the calculation unit is used for calculating the user data through a plurality of classifiers by the default prediction model based on preset parameters; and the result unit is used for generating at least one default category and the default probability corresponding to the default category according to the calculation result.
Optionally, the method further comprises: and the policy module is used for generating a financial service policy for the user according to the default category and the default probability.
According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
According to the user default prediction method, the device, the electronic equipment and the computer readable medium, financial data of a user are obtained, wherein the financial data comprise characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and the user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient boosting decision tree model, so that model training and calculation time can be saved, time of grid search of the model can be saved, and efficiency and accuracy of user default prediction can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
FIG. 1 is a system block diagram illustrating a method and apparatus for user default prediction, according to an exemplary embodiment.
FIG. 2 is a flow chart illustrating a method of user default prediction, according to an exemplary embodiment.
FIG. 3 is a flow chart illustrating a method of user default prediction in accordance with another exemplary embodiment.
FIG. 4 is a flow chart illustrating a method of user default prediction in accordance with another exemplary embodiment.
FIG. 5 is a block diagram illustrating a user breach prediction apparatus, according to an example embodiment.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 7 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
The terms to which this disclosure relates are to be construed as follows:
MOB-A term of art in the Credit industry, monthly account number, month on book, represents the number of months a customer has made a bill to a located time after borrowing.
The LightGBM is called Light Gradient Boosting Machine, and is a distributed Gradient Boosting framework based on a decision tree algorithm.
XGboost is called eXtreme Gradient Boosting totally, Chinese is eXtreme Gradient Boosting, and the XGboost is an integrated Gradient Boosting method.
AUC Area Under the curve, Area Under the ROC curve defined as the Area enclosed by the coordinate axes, it is clear that the value of this Area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1. The classifier with larger corresponding AUC value has better effect.
KS vs. KS curves are two lines with threshold on the horizontal axis and TPR (true Positive Rate) and FPR (false Positive Rate) on the vertical axis. The threshold corresponding to the farthest place between the two curves is the threshold capable of dividing the model. The KS value is MAX (TPR-FPR), the distance between the two curves.
Boosting, one of ensemble learning, is a classification algorithm that strengthens a weak classifier into a strong classifier through training, thereby achieving an accurate classification effect.
In the Boosting method, the central idea is to iterate variables once, but to increase submodels one by one and ensure that loss functions are continuously reduced in the process.
Decision Tree, a classification and regression method, is mostly used for classification, the training idea is similar to the Gradient Boosting, and a model is established according to the loss function minimization principle. The splitting modes are divided into two types, one is a learning method (Leaf-wise splitting) according to the splitting of the Leaf, and the other is a learning method (Level-wise splitting).
The GBDT is named as the Gradient Boosting Decision Tree, has the commonality of the Gradient Boosting and Decision Tree, takes a Decision Tree as a submodel, and has the advantages of good training effect, difficulty in overfitting and the like.
The inventor of the present disclosure finds that many types of features such as bank codes, customer channels, mobile phone brands, academic information, and the like are encountered in the current modeling, the processing mode of XGBoost is to perform one-hot coding and convert the XGBoost to a multidimensional 0/1 feature, when the types of features are more, the features after being processed by dummy variables are too much, the learning tree grows very unbalanced, and a very deep depth is needed to achieve a good accuracy. The problem can be solved in the present disclosure by a LightGBM model, which can support direct import of a categorical variable, which does not perform one-hot encoding on the categorical feature, and thus is much faster than one-hot encoding.
The LightGBM uses a special algorithm to determine the segmentation values of the attribute features, and the basic idea is to reorder the classes according to the correlation with the target label, specifically, to reorder the histogram storing the class features according to the accumulated value thereof, and to select the optimal segmentation position on the ordered histogram, thereby improving the accuracy of the model.
Therefore, the LightGBM-based modeling method provided by the invention achieves the effects of improving the model training speed and reducing the feature dimension on the premise of ensuring the model accuracy, and is different from the XGboost algorithm commonly used by people in two aspects, namely, the optimization on the training speed and the memory consumption on one hand and the different processing modes on the class-type features on the other hand. The present disclosure will be described in detail with reference to specific embodiments.
FIG. 1 is a system block diagram illustrating a method and apparatus for user default prediction, according to an exemplary embodiment.
As shown in fig. 1, the system architecture 10 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a financial services application, a shopping application, a web browser application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background management server that supports financial services websites browsed by the user using the terminal apparatuses 101, 102, and 103. The background management server can analyze and process the received user data, and feed back the processing result (such as default categories and default probabilities of the user) to the administrator of the financial service website.
The server 105 may, for example, obtain financial data for a user, the financial data including a plurality of categories of feature data; the server 105 may, for example, perform character conversion on the financial data to generate user data; server 105 may generate the default categories and default probabilities for the users, for example, from the user data and a user default prediction model obtained by distributed gradient boosting decision tree model training.
The server 105 may also, for example, obtain financial and credit data for historical users; the server 105 may also, for example, character convert the financial data to generate historical user data; server 105 may also determine tags for the historical user data, e.g., from the credit data; the server 105 may also train a distributed gradient boosting decision tree model, for example, with labeled historical user data, generating the user violation prediction model.
The server 105 may be a single entity server, or may be composed of a plurality of servers, for example, it should be noted that the user default prediction method provided by the embodiment of the present disclosure may be executed by the server 105, and accordingly, the user default prediction apparatus may be disposed in the server 105. And the web page end provided for the user to browse the financial service platform is generally positioned in the terminal equipment 101, 102 and 103.
FIG. 2 is a flow chart illustrating a method of user default prediction, according to an exemplary embodiment. The user default prediction method 20 includes at least steps S202 to S206.
As shown in fig. 2, in S202, financial data of a user is acquired, where the financial data includes feature data of a plurality of categories. Wherein the characteristic data includes: age, occupation, bank code, customer channel, mobile brand, academic information, etc.,
in S204, the financial data is character-converted to generate user data. And converting the data converted from the character type into the integer type in the financial data.
In S206, the default category and default probability of the user are generated through the user data and a user default prediction model, and the user default prediction model is obtained through training of a distributed gradient boost decision tree model.
In one embodiment, the specific steps may include: inputting the user data into the user violation prediction model; the default prediction model calculates the user data through a plurality of classifiers based on preset parameters; and generating at least one default category and the default probability corresponding to the default category according to the calculation result.
In one embodiment, further comprising: and generating a financial service policy for the user according to the default category and the default probability. For example, when the default category is a installment repayment and the default probability is 90%, financial services other than the installment repayment service may be provided for the user, and when the default probability is an overdue repayment and the default probability is 80%, financial services with higher default money may be provided for the user, which is not limited by the disclosure.
According to the user default prediction method, the financial data of a user are obtained, wherein the financial data comprise characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and the user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient boosting decision tree model, so that model training and calculation time can be saved, time of grid search of the model can be saved, and efficiency and accuracy of user default prediction can be improved.
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
FIG. 3 is a flow chart illustrating a method of user default prediction in accordance with another exemplary embodiment. FIG. 3 is a depiction of a process for building a user violation prediction model.
As shown in fig. 3, in S302, financial data and credit data of the historical user are acquired.
In S304, the financial data is character-converted to generate historical user data. Can include the following steps: and converting character type data in the financial data into integer type data. Further comprising: the field names of integer type data are declared.
In S306, a tag is determined for the historical user data based on the credit data. The method comprises the following steps: acquiring a credit category label and a default label in the credit data; determining a category label for the historical user data according to the credit category label; and determining a breach label for the historical user data based on the breach label.
In S308, the distributed gradient boosting decision tree model is trained through the labeled historical user data, and the user default prediction model is generated.
LightGBM uses a Histogram (Histogram) decision tree algorithm, which has the basic idea of discretizing continuous floating-point eigenvalues into k integers and constructing a k-wide Histogram. When data is traversed, statistics are accumulated in the histogram according to the discretized value serving as an index, after the data is traversed once, the histogram accumulates needed statistics, and then the optimal segmentation point is searched in a traversing mode according to the discretized value of the histogram.
There are many advantages to using the histogram algorithm. Firstly, most obviously, the memory consumption is reduced, the histogram algorithm does not need to additionally store the result of the pre-sorting, and only can store the value after the characteristic discretization, and the value is generally enough to be stored by 8-bit integer, and the memory consumption can be reduced to 1/8. Then, the cost on calculation is greatly reduced, the pre-ordering algorithm needs to calculate the splitting gain once when traversing one feature value, and the histogram algorithm only needs to calculate k times (k can be regarded as a constant).
Of course, the histogram algorithm is not perfect. Since the feature is discretized, the result is affected because the segmentation point is not found precisely. But results on different datasets show that discretized segmentations do not affect the final accuracy much, sometimes better.
FIG. 4 is a flow chart illustrating a method of user default prediction in accordance with another exemplary embodiment. The flow shown in fig. 4 is a detailed description of S308 "training the distributed gradient boost decision tree model by using labeled historical user data to generate the user default prediction model" in the flow shown in fig. 3.
As shown in fig. 4, in S402, the tagged historical user data is input into the user violation prediction model.
In S404, the user default prediction model discretizes floating point eigenvalues in the historical user data.
In S406, a histogram is constructed from the discretization processing result and the label.
In S408, the user violation prediction model is determined based on the discretization processing result and the histogram. The method comprises the following steps: determining an optimal segmentation point based on the discretization processing result and the histogram; determining model parameters based on the optimal segmentation points; and generating the user violation prediction model based on the model parameters.
In one embodiment, determining an optimal segmentation point based on the discretization processing result and the histogram comprises: taking the numerical value in the discretization processing result as an index, and traversing and calculating the statistical value in the histogram; and determining the optimal segmentation point according to the statistic value of the histogram after the statistics is finished.
LightGBM abandons the level-wise growth (level-wise) decision tree growth strategy used by most GBDT tools, and uses a leaf-wise growth (leaf-wise) algorithm with depth constraints. The Level-wise data can simultaneously split leaves on the same layer once, multi-thread optimization is easy to perform, the complexity of the model is well controlled, and overfitting is not easy to perform. In practice, Level-wise is an inefficient algorithm because it treats leaves in the same layer without distinction, which causes much unnecessary overhead because many leaves actually have low splitting gain and do not need to be searched and split. The Leaf-wise is a more efficient strategy, and one Leaf with the largest splitting gain is found from all the current leaves at a time, and then the leaves are split, and the process is repeated. Therefore, compared with the Level-wise, under the condition of the same splitting times, the Level-wise can reduce more errors and obtain better precision. A disadvantage of Leaf-wise is that a deeper decision tree may be grown, resulting in an overfitting. LightGBM therefore adds a maximum depth limit above the Leaf-wise, preventing overfitting while ensuring high efficiency.
In fact, most machine learning tools cannot directly support class features, and generally need to convert the class features into multi-dimensional 0/1 features, which reduces space and time efficiency. While the use of category features is common in practice. Based on the consideration, the LightGBM optimizes the support of the class characteristics, can directly input the class characteristics, does not need additional 0/1 expansion, and adds the decision rule of the class characteristics to the decision tree algorithm. LightGBM is the first GBDT tool to directly support class features.
Selecting 9-month test data of a model which is developed for the last time, reserving non-standard guests, and then carrying out 88538 tests, wherein the number of bad samples is 1571, and the test data are as follows: and 3, dividing a training set and a test set. Performing different processing modes on the class type characteristics, wherein the traditional single hot coding is used for scattering processing on one side, and the final characteristic dimension is 1425; the other side converts the class type features into continuous numbers without scattering, the final feature dimension is 1246, and the number of the class type features is 34. And comparing the model effect and the feature contribution degree of the two processing modes.
Selecting the same model parameter to carry out model comparison, wherein one difference is a num _ leaves parameter unique to lightgbm, the parameter is related to max _ depth, the value of num _ leaves should be less than or equal to 2^ (max _ depth), and overfitting can be caused if the num _ leaves parameter exceeds the max _ depth; another difference is that a parameter, category _ feature, is added that specifies a categorical variable. The parameters selected when comparing the models were as follows:
xgboost_model=XGBClassifier(
n _ estimators is 200, # iterations
learning rate of 0.1, # learning _ rate
max _ depth is 3, maximum depth of # tree
min _ child _ weight is 9, the smallest sample weight sum in the # child node
subsample 0.8, # random samples ratio
colsample _ byte ═ 0.8, # random features ratio,
seed=27)
lightgbm_model=lgb.LGBMClassifier(
boosting_type='gbdt',
metric='auc',
category _ feature 3,8,9,10,11,13,20,49,69,70,81,113, # specifies the number of columns of the type variable
n _ estimators is 200, # iterations
learning rate of 0.1, # learning _ rate
max _ depth is 3, maximum depth of # tree
num _ leaves ═ 7, number of leaves on # one tree
min _ child _ weight is 9, the smallest sample weight sum in the # child node
subsample 0.8, # random samples ratio
colsample _ byte ═ 0.8, # random feature ratio
seed=27)
The model comparison results are as follows: the effect of the user default prediction model established by the lightgbm method is far better than that of the user default prediction model established by other models.
More specifically, the light gbm greatly shortens the consumed time of the xgboost on cv compared with the xgboost, the speed is improved by more than 5 times, meanwhile, the accuracy of the model is improved slightly, the feature dimension is reduced by 14.4% on the premise of only 34 classification features, the maximum improvement effect is still to save the model training time, and especially when the parameters are adjusted, the time of grid search can be greatly saved.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 5 is a block diagram illustrating a user breach prediction apparatus, according to an example embodiment. As shown in fig. 5, the user default prediction apparatus 50 includes: a data module 502, a transformation module 504, a model module 506, a model training module 508; the user violation prediction apparatus 50 may further include: a policy module 510.
The data module 502 is configured to obtain financial data of a user, where the financial data includes feature data of multiple categories;
the conversion module 504 is configured to perform character conversion on the financial data to generate user data; and
the model module 506 is configured to generate default categories and default probabilities of the user through the user data and a user default prediction model, where the user default prediction model is obtained through training of a distributed gradient boost decision tree model. The model module 506 includes: a prediction unit to input the user data into the user violation prediction model; the calculation unit is used for calculating the user data through a plurality of classifiers by the default prediction model based on preset parameters; and the result unit is used for generating at least one default category and the default probability corresponding to the default category according to the calculation result.
The model training module 508 includes: the historical data unit is used for acquiring financial data and credit data of historical users; the history conversion unit is used for performing character conversion on the financial data to generate history user data; the history conversion unit is further used for converting character type data in the financial data into integer type data. The history conversion unit is also used for declaring the field names of the integer type data.
The model training module 508 includes: a history tag unit for determining a tag for the history user data according to the credit data; the history label unit is further used for acquiring a credit category label and a default label in the credit data; determining a category label for the historical user data according to the credit category label; and determining a breach label for the historical user data based on the breach label.
The model training module 508 includes: and the data training unit is used for training the distributed gradient lifting decision tree model through the historical user data with the labels to generate the user default prediction model. The data training unit comprises: an input unit for inputting the tagged historical user data into the user default prediction model; the discrete unit is used for discretizing the floating point characteristic value in the historical user data by the user default prediction model; the composition unit is used for constructing a histogram according to the discretization processing result and the label; and the model unit is used for determining the user default prediction model based on the discretization processing result and the histogram. The model unit is further used for determining an optimal segmentation point based on the discretization processing result and the histogram; determining model parameters based on the optimal segmentation points; and generating the user violation prediction model based on the model parameters. The model unit is also used for traversing and calculating the statistical value in the histogram by taking the numerical value in the discretization processing result as an index; and determining the optimal segmentation point according to the statistic value of the histogram after the statistics is finished.
The policy module 510 is configured to generate a financial services policy for the user based on the breach classification and breach probability.
According to the user default prediction device, the financial data of a user are obtained, wherein the financial data comprise characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and the user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient boosting decision tree model, so that model training and calculation time can be saved, time of grid search of the model can be saved, and efficiency and accuracy of user default prediction can be improved.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 600 according to this embodiment of the disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 that connects the various system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 2, 3, 4.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 600' (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 7, the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present disclosure.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring financial data of a user, wherein the financial data comprises characteristic data of a plurality of categories; performing character conversion on the financial data to generate user data; and generating default categories and default probabilities of the users through the user data and a user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient lifting decision tree model.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method for predicting a user breach, comprising:
acquiring financial data of a user, wherein the financial data comprises characteristic data of a plurality of categories;
performing character conversion on the financial data to generate user data; and
and generating default categories and default probabilities of the users through the user data and a user default prediction model, wherein the user default prediction model is obtained through training of a distributed gradient lifting decision tree model.
2. The method of claim 1, further comprising:
acquiring financial data and credit data of a historical user;
performing character conversion on the financial data to generate historical user data;
determining a label for the historical user data based on the credit data; and
and training a distributed gradient lifting decision tree model through the historical user data with the labels to generate the user default prediction model.
3. The method of claims 1-2, wherein the character converting the financial data comprises:
and converting character type data in the financial data into integer type data.
4. The method of claims 1-3, wherein the financial data is character converted, further comprising:
the field names of integer type data are declared.
5. The method of claims 1-4, wherein determining a label for the historical user data based on the credit data comprises:
acquiring a credit category label and a default label in the credit data;
determining a category label for the historical user data according to the credit category label; and
and determining default labels for the historical user data according to the default labels.
6. The method of claims 1-5, wherein training a distributed gradient boosting decision tree model through labeled historical user data to generate the user violation prediction model comprises:
inputting the tagged historical user data into the user violation prediction model;
the user default prediction model carries out discretization processing on floating point characteristic values in historical user data;
constructing a histogram according to the discretization processing result and the label; and
determining the user violation prediction model based on the discretization processing result and the histogram.
7. The method of claims 1-6, wherein determining the user violation prediction model based on the discretization process result and the histogram comprises:
determining an optimal segmentation point based on the discretization processing result and the histogram;
determining model parameters based on the optimal segmentation points; and
generating the user violation prediction model based on the model parameters.
8. A user breach prediction apparatus, comprising:
the data module is used for acquiring financial data of a user, wherein the financial data comprises a plurality of categories of characteristic data;
the conversion module is used for performing character conversion on the financial data to generate user data; and
and the model module is used for generating default categories and default probabilities of the users through the user data and the user default prediction model, and the user default prediction model is obtained through training of a distributed gradient boost decision tree model.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201911328631.5A 2019-12-20 2019-12-20 User default prediction method and device and electronic equipment Pending CN111191825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911328631.5A CN111191825A (en) 2019-12-20 2019-12-20 User default prediction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911328631.5A CN111191825A (en) 2019-12-20 2019-12-20 User default prediction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111191825A true CN111191825A (en) 2020-05-22

Family

ID=70709276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911328631.5A Pending CN111191825A (en) 2019-12-20 2019-12-20 User default prediction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111191825A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112927719A (en) * 2021-01-22 2021-06-08 中信银行股份有限公司 Risk information evaluation method, device, equipment and storage medium
CN113409135A (en) * 2021-06-30 2021-09-17 中国工商银行股份有限公司 Model training method and device, behavior prediction method and device, equipment and medium
TWI750687B (en) * 2020-06-05 2021-12-21 臺灣銀行股份有限公司 Finger vein identification risk assessment system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018090657A1 (en) * 2016-11-18 2018-05-24 同济大学 Bp_adaboost model-based method and system for predicting credit card user default
CN109063931A (en) * 2018-09-06 2018-12-21 盈盈(杭州)网络技术有限公司 A kind of model method for predicting freight logistics driver Default Probability
CN109657977A (en) * 2018-12-19 2019-04-19 重庆誉存大数据科技有限公司 A kind of Risk Identification Method and system
CN110288459A (en) * 2019-04-24 2019-09-27 武汉众邦银行股份有限公司 Loan prediction technique, device, equipment and storage medium
CN110348721A (en) * 2019-06-29 2019-10-18 北京淇瑀信息科技有限公司 Financial default risk prediction technique, device and electronic equipment based on GBST

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018090657A1 (en) * 2016-11-18 2018-05-24 同济大学 Bp_adaboost model-based method and system for predicting credit card user default
CN109063931A (en) * 2018-09-06 2018-12-21 盈盈(杭州)网络技术有限公司 A kind of model method for predicting freight logistics driver Default Probability
CN109657977A (en) * 2018-12-19 2019-04-19 重庆誉存大数据科技有限公司 A kind of Risk Identification Method and system
CN110288459A (en) * 2019-04-24 2019-09-27 武汉众邦银行股份有限公司 Loan prediction technique, device, equipment and storage medium
CN110348721A (en) * 2019-06-29 2019-10-18 北京淇瑀信息科技有限公司 Financial default risk prediction technique, device and electronic equipment based on GBST

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI750687B (en) * 2020-06-05 2021-12-21 臺灣銀行股份有限公司 Finger vein identification risk assessment system and method
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112927719A (en) * 2021-01-22 2021-06-08 中信银行股份有限公司 Risk information evaluation method, device, equipment and storage medium
CN113409135A (en) * 2021-06-30 2021-09-17 中国工商银行股份有限公司 Model training method and device, behavior prediction method and device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111191825A (en) User default prediction method and device and electronic equipment
CN110390408B (en) Transaction object prediction method and device
CN106844407B (en) Tag network generation method and system based on data set correlation
CN111199474B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN112270545A (en) Financial risk prediction method and device based on migration sample screening and electronic equipment
CN112015562A (en) Resource allocation method and device based on transfer learning and electronic equipment
CN116109373A (en) Recommendation method and device for financial products, electronic equipment and medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN111210332A (en) Method and device for generating post-loan management strategy and electronic equipment
CN111190967A (en) User multi-dimensional data processing method and device and electronic equipment
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN114036921A (en) Policy information matching method and device
CN110807097A (en) Method and device for analyzing data
US20230342606A1 (en) Training method and apparatus for graph neural network
CN112446777B (en) Credit evaluation method, device, equipment and storage medium
CN111930944A (en) File label classification method and device
WO2023221359A1 (en) User security level identification method and apparatus based on multi-stage time sequence and multiple tasks
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
CN110852078A (en) Method and device for generating title
CN112527851B (en) User characteristic data screening method and device and electronic equipment
CN115099875A (en) Data classification method based on decision tree model and related equipment
CN114610953A (en) Data classification method, device, equipment and storage medium
CN113392920A (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN113612777A (en) Training method, traffic classification method, device, electronic device and storage medium
CN112231299A (en) Method and device for dynamically adjusting feature library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination