CN112037009A - Risk assessment method for consumption credit scene based on random forest algorithm - Google Patents

Risk assessment method for consumption credit scene based on random forest algorithm Download PDF

Info

Publication number
CN112037009A
CN112037009A CN202010784787.0A CN202010784787A CN112037009A CN 112037009 A CN112037009 A CN 112037009A CN 202010784787 A CN202010784787 A CN 202010784787A CN 112037009 A CN112037009 A CN 112037009A
Authority
CN
China
Prior art keywords
model
random forest
feature
module
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010784787.0A
Other languages
Chinese (zh)
Inventor
江远强
韩璐
李兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baiweijinke Shanghai Information Technology Co ltd
Original Assignee
Baiweijinke Shanghai Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baiweijinke Shanghai Information Technology Co ltd filed Critical Baiweijinke Shanghai Information Technology Co ltd
Priority to CN202010784787.0A priority Critical patent/CN112037009A/en
Publication of CN112037009A publication Critical patent/CN112037009A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a risk assessment method for a consumption credit scene based on a random forest algorithm, which comprises an information acquisition module, a data preprocessing module, a feature engineering module, a model training and parameter adjusting module, a feature importance assessment module, a model evaluation and selection module and a model deployment monitoring module. Has the advantages that: the method has the advantages that data preprocessing is simple, the efficiency of characteristic engineering and model training is high, the accuracy of the model is high, the random forest model is combined with a consumption credit scene of internet finance, the random forest can better process the problems of high-dimensional sparsity, much noise and variable redundancy of the internet data according to algorithm superiority of the random forest model, the random forest model has higher risk prediction accuracy and stability compared with other traditional scoring card models, credit risk recognition is improved, and a certain reference value is realized on practical application of the consumption credit of the internet finance.

Description

Risk assessment method for consumption credit scene based on random forest algorithm
Technical Field
The invention relates to the technical field of wind control in the internet financial consumption credit industry, in particular to a risk assessment method for a consumption credit scene based on a random forest algorithm.
Background
With the rise of the internet + concept, internet financial consumption credit companies represented by P2P loan, consumption finance, car rental, etc. are like the spring shoots after the rain, but after the wild growth, the development speed and fate of the companies are concentrated on the wind control. The traditional wind control auditing is a scoring card model based on a machine learning algorithm, and comprises a logistic regression, a decision tree, a support vector machine, a neural network and the like, the algorithms are strong in interpretability, simple and easy to understand, the weights of all characteristics can be directly seen, and new data can be easily absorbed to update the model, so that the traditional scoring card model is still a common method for risk assessment in the consumer credit industry when integrated algorithms such as GBDT, random forest, lightGBM and the like appear successively.
With the development of big data, internet credit data is not limited to application and credit investigation data, but more third-party data such as online shopping consumption, network social contact, APP use behaviors and the like are combined, the whole data has the characteristic of high-dimensional sparsity, the traditional scoring card model has obvious limitation on the internet data, and specific problems and difficulties have the following aspects:
the data preprocessing is complicated: the traditional scoring card model has extremely high requirements on data preprocessing, and directly deletes a large number of samples with data sparseness or missing values for convenience of calculation, so that the loss of data value is large;
the characteristic engineering difficulty is as follows: the traditional scoring card model is complex in characteristic engineering, continuous data needs discretization processing and characteristic screening, characteristics need to be selected according to evidence Weight (WOE) conversion and Information Value (IV) and are also rejected according to collinearity among variables, and for current large-data wind-control high-dimensional data, the range of data processing capability of the traditional wind-control modeling scoring card system is far exceeded, and more advanced machine learning algorithm processing is urgently needed;
insufficient model accuracy: compared with an integrated algorithm model based on a combination of a plurality of weak classifiers, a single model of the traditional scoring card lacks exploration and verification among models, and the problems of insufficient model stability and weak generalization capability possibly exist.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a risk assessment method of a consumption credit scene based on a random forest algorithm, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
a risk assessment method for a consumption credit scene based on a random forest algorithm comprises an information acquisition module, a data preprocessing module, a feature engineering module, a model training and parameter adjusting module, a feature importance assessment module, a model evaluation and selection module and a model deployment monitoring module, wherein the information acquisition module is connected with the feature engineering module through the data preprocessing module, the feature engineering module is connected with the feature importance assessment module through the model training and parameter adjusting module, and the feature importance assessment module is connected with the model deployment monitoring module through the model evaluation and selection module.
Further, the risk assessment method for the consumption credit scene based on the random forest algorithm comprises the following steps:
obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
data preprocessing: the uniqueness and the sample integrity of the user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on the variables of the modeling samples, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing value, the missing variable information can be effectively filled in an auxiliary mode, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the clustering of the sparse variables is beneficial to the processing of variable characteristic engineering;
characteristic engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
feature importance assessment and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
monitoring model deployment: after a random forest model with an optimal parameter combination is achieved through grid parameter adjustment and five-fold cross validation method repeated training, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring variables IV and mean values, PSI (probability Stability Index) of the model, KS (relevance score), AUC (aggregate efficiency Index) and other indexes.
Further, the Random Forest Classifier (Random Forest Classifier) includes the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and after the n _ estimators reach a certain number, the performance improvement degree of the model is not large, so that in practice, a moderate numerical value is generally selected by adjusting parameters;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
Further, in tuning a single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted at a coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default kini coefficient 'gini', namely CART algorithm), are adjusted, so that multithreading parallelization operation of the random forest model can be realized, and a faster training effect is achieved.
Further, a modeling sample client is extracted from a company business system through the information acquisition module, client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module, interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable through the feature engineering module according to the business logic and the variable relation to generate a feature wide table with higher dimension, wherein feature selection can be carried out in random forest model training.
Furthermore, the model training and parameter adjusting module realizes the multithreading parallelization optimal parameter combination training of the random forest model by adjusting the random forest integration algorithm, the weak classifier and the system setting, so as to achieve a faster training effect; the feature importance evaluation module obtains the importance of each feature after training the random forest models, selects the features according to descending order of feature importance, trains and adjusts parameters again to obtain the random forest models with higher generalization and stability, and constructs a risk evaluation system and index weights of the consumption credit industry according to the feature importance.
Further, the model evaluation and selection module evaluates the overall effect of the random forest model through KS and AUC, and compares model results of LR, SVM, GBDT, XGboost and the like of initial modeling data; and the model deployment monitoring module deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
The invention has the beneficial effects that:
1. the data preprocessing is simple: the random forest can effectively process sparse data, the sparse data is singly used as one type without filling processing, the random forest is a tree model and can process various data types such as continuous type, discrete type and the like, normalization processing such as normalization and the like is not needed for the data, abnormal value and noise data can be better tolerated by a random forest bagging method (bagging) and a random selection feature splitting method which are combined, in addition, a decision tree is constructed by utilizing a random forest algorithm to predict and interpolate missing values, and potential data information of the missing values can be well supplemented.
2. The efficiency of characteristic engineering and model training is high: each decision tree of the random forest can be independently and simultaneously generated, high parallelization is easy to realize, training is easy to realize in a distributed mode, thousands of high-dimensional variables can be processed, feature selection is not needed, and the requirements of high-dimensional data sample training speed and efficiency in the current big data era are met.
3. The accuracy of the model is high: bagging and random selection feature splitting, etc. methods generate two randoms: the method has the advantages that samples are random, features are random, unbiased estimation of real errors is obtained while training data volume is not lost in modeling, when classified unbalance exists in modeling samples, an effective method for balancing data set errors can be provided by random forests, correlation among models is reduced by the random forests through column sampling (feature sampling), so that variance of integrated models is reduced, and the random forests improve generalization errors by reducing variance of weak classifiers, so that the aim of improving accuracy of the integrated models is fulfilled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a block diagram of a risk assessment method for a consumed credit scenario based on a random forest algorithm according to an embodiment of the present invention;
FIG. 2 is a flow chart of a risk assessment method for a consumed credit scenario based on a random forest algorithm according to an embodiment of the present invention.
In the figure:
1. an information acquisition module; 2. a data preprocessing module; 3. a feature engineering module; 4. a model training and parameter adjusting module; 5. a feature importance evaluation module; 6. a model evaluation and selection module; 7. and the model deploys a monitoring module.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to the embodiment of the invention, a risk assessment method for a consumption credit scene based on a random forest algorithm is provided.
The first embodiment is as follows:
as shown in fig. 1-2, the risk assessment method for a consumption credit scene based on a random forest algorithm according to an embodiment of the present invention includes an information acquisition module 1, a data preprocessing module 2, a feature engineering module 3, a model training and parameter adjusting module 4, a feature importance assessment module 5, a model evaluation and selection module 6, and a model deployment monitoring module 7, where the information acquisition module 1 is connected to the feature engineering module 3 through the data preprocessing module 2, the feature engineering module 3 is connected to the feature importance assessment module 5 through the model training and parameter adjusting module 4, and the feature importance assessment module 5 is connected to the model deployment monitoring module 7 through the model evaluation and selection module 6.
In one embodiment, the risk assessment method for the consumption credit scene based on the random forest algorithm comprises the following steps:
step S101, obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
step S103, data preprocessing: the uniqueness and the sample integrity of a user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on variables of a modeling sample, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable missing rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing rate of the missing value, the method can effectively assist in filling missing variable information, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the sparse variable clustering is favorable for processing variable characteristic engineering;
step S105, feature engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
step S107, model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
step S109, feature importance evaluation and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
step S111, model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
step S113, model deployment monitoring: after a random forest model with an optimal parameter combination is achieved through repeated training of gridding parameter adjustment and five-fold cross validation methods, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring of variables IV and mean values, PSI (probability Stability Index) of model distribution, KS (family Stability Index), AUC (aggregate efficiency Index) and other indexes.
In one embodiment, the Random Forest Classifier (Random Forest Classifier) includes the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and the improvement degree of the performance of the model is not large after n _ estimators reach a certain number, so that in practice, a moderate numerical value is generally selected by parameter adjustment;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
In one embodiment, in tuning a single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted at a coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default kini coefficient 'gini', namely CART algorithm), are adjusted, so that multithreading parallelization operation of the random forest model can be realized, and a faster training effect is achieved.
In one embodiment, a modeling sample client is extracted from a company business system through the information acquisition module 1, client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module 2, interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable according to business logic and variable relation through the characteristic engineering module 3 to generate a higher-dimensional characteristic width table, wherein characteristic selection can be performed in random forest model training.
In one embodiment, the model training and parameter adjusting module 4 adjusts the random forest integration algorithm, the weak classifier and the system setting to realize the multi-thread parallelization optimal parameter combination training of the random forest model, thereby achieving a faster training effect; the feature importance evaluation module 5 obtains the importance of each feature after training the random forest models, performs feature selection according to descending order of feature importance, performs training and parameter tuning again to obtain random forest models with higher generalization and stability, and constructs a risk evaluation system and index weights of the consumption credit industry according to the feature importance.
In one embodiment, the model evaluation and selection module 6 evaluates the overall effect of the random forest model through KS and AUC, comparing model results of LR, SVM, GBDT, XGBoost, etc. of the initial modeling data; and the model deployment monitoring module 7 deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
In summary, the random forest is a relatively advanced integration algorithm, integrates the characteristics of bagging (bootstrapping aggregation) and randomly selecting feature splitting, and the like, has relatively low variance and deviation and relatively superior generalization performance, can improve prediction precision on the premise of not significantly improving the computation amount, can also process complex high-dimensional sparse data, and overcomes the defects of insufficient stability, accuracy and generalization of a single model of a traditional evaluation card, and has the following advantages:
the data preprocessing is simple: the random forest can effectively process sparse data, the sparse data is singly used as one type without filling processing, the random forest is a tree model and can process various data types such as continuous type, discrete type and the like, normalization processing such as normalization and the like is not needed for the data, abnormal value and noise data can be better tolerated by a random forest bagging method (bagging) and a random selection feature splitting method which are combined, in addition, a decision tree is constructed by utilizing a random forest algorithm to predict and interpolate missing values, and potential data information of the missing values can be well supplemented.
The efficiency of characteristic engineering and model training is high: each decision tree of the random forest can be independently and simultaneously generated, high parallelization is easy to realize, training is easy to realize in a distributed mode, thousands of high-dimensional variables can be processed, feature selection is not needed, and the requirements of high-dimensional data sample training speed and efficiency in the current big data era are met.
The accuracy of the model is high: bagging and random selection feature splitting, etc. methods generate two randoms: the method has the advantages that samples are random, features are random, unbiased estimation of real errors is obtained while training data volume is not lost in modeling, when classified unbalance exists in modeling samples, an effective method for balancing data set errors can be provided by random forests, correlation among models is reduced by the random forests through column sampling (feature sampling), so that variance of integrated models is reduced, and the random forests improve generalization errors by reducing variance of weak classifiers, so that the aim of improving accuracy of the integrated models is fulfilled.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. The risk assessment method for the credit consumption scene based on the random forest algorithm is characterized by comprising an information acquisition module (1), a data preprocessing module (2), a feature engineering module (3), a model training and parameter adjusting module (4), a feature importance assessment module (5), a model evaluation and selection module (6) and a model deployment monitoring module (7), wherein the information acquisition module (1) is connected with the feature engineering module (3) through the data preprocessing module (2), the feature engineering module (3) is connected with the feature importance assessment module (5) through the model training and parameter adjusting module (4), and the feature importance assessment module (5) is connected with the model deployment monitoring module (7) through the model evaluation and selection module (6).
2. The risk assessment method for the random forest algorithm-based consumption credit scene is characterized in that the risk assessment method for the random forest algorithm-based consumption credit scene comprises the following steps:
obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
data preprocessing: the uniqueness and the sample integrity of a user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on variables of a modeling sample, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable missing rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing rate of the missing value, the method can effectively assist in filling missing variable information, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the sparse variable clustering is favorable for processing variable characteristic engineering;
characteristic engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
feature importance assessment and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
monitoring model deployment: after a random forest model with an optimal parameter combination is achieved through repeated training of gridding parameter adjustment and five-fold cross validation methods, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring of variables IV and mean values, PSI (probability Stability Index) of model distribution, KS (family Stability Index), AUC (aggregate efficiency Index) and other indexes.
3. A risk assessment method of a Random Forest algorithm based consumer credit scenario according to claim 2, characterized in that the Random Forest Classifier (Random Forest Classifier) comprises the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and after the n _ estimators reach a certain number, the performance improvement degree of the model is not large, so that in practice, a moderate numerical value is generally selected by adjusting parameters;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
4. The risk assessment method for the consumption credit scenario based on the random forest algorithm according to claim 2, wherein in tuning the single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted with coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, by adjusting hyper-parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default Gini coefficient, namely CART algorithm), the multithread parallelization operation of the random forest model can be realized, and a faster training effect can be achieved.
5. The risk assessment method for the consumption credit scene based on the random forest algorithm is characterized in that a modeling sample client is extracted from a company business system through the information acquisition module (1), client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module (2), interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable according to business logic and variable relation through the characteristic engineering module (3) to generate a higher-dimensional characteristic width table, wherein characteristic selection can be performed in random forest model training.
6. The risk assessment method for the consumption credit scene based on the random forest algorithm according to claim 1, characterized in that the model training and parameter adjusting module (4) realizes the multi-thread parallelization optimal parameter combination training of the random forest model by adjusting the random forest integration algorithm, the weak classifier and the system setting, so as to achieve the faster training effect; the feature importance evaluation module (5) obtains the importance of each feature after training the random forest models, selects the features according to descending order of feature importance, trains and adjusts parameters again to obtain the random forest models with higher generalization and stability, and constructs a risk evaluation system of the credit consumption industry and index weights thereof according to the feature importance.
7. The risk assessment method for the consumption credit scenario based on the random forest algorithm according to claim 1, wherein the model evaluation and selection module (6) evaluates the overall effect of the random forest model through KS and AUC, and compares the results of LR, SVM, GBDT, XGboost and other models of initial modeling data; and the model deployment monitoring module (7) deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
CN202010784787.0A 2020-08-06 2020-08-06 Risk assessment method for consumption credit scene based on random forest algorithm Withdrawn CN112037009A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010784787.0A CN112037009A (en) 2020-08-06 2020-08-06 Risk assessment method for consumption credit scene based on random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010784787.0A CN112037009A (en) 2020-08-06 2020-08-06 Risk assessment method for consumption credit scene based on random forest algorithm

Publications (1)

Publication Number Publication Date
CN112037009A true CN112037009A (en) 2020-12-04

Family

ID=73582353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010784787.0A Withdrawn CN112037009A (en) 2020-08-06 2020-08-06 Risk assessment method for consumption credit scene based on random forest algorithm

Country Status (1)

Country Link
CN (1) CN112037009A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200272A (en) * 2020-12-07 2021-01-08 上海冰鉴信息科技有限公司 Service classification method and device
CN112419050A (en) * 2020-12-24 2021-02-26 浙江工商大学 Credit evaluation method and device based on telephone communication network and social behavior
CN112561376A (en) * 2020-12-23 2021-03-26 北京橙色云科技有限公司 Method and device for splitting project and storage medium
CN112785418A (en) * 2021-01-22 2021-05-11 深圳前海微众银行股份有限公司 Credit risk modeling method, credit risk modeling device, credit risk modeling equipment and computer readable storage medium
CN112801563A (en) * 2021-04-14 2021-05-14 支付宝(杭州)信息技术有限公司 Risk assessment method and device
CN112862594A (en) * 2021-02-01 2021-05-28 深圳无域科技技术有限公司 Financial risk control method, system, device and computer readable medium
CN112907359A (en) * 2021-03-24 2021-06-04 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN113409139A (en) * 2021-07-27 2021-09-17 深圳前海微众银行股份有限公司 Credit risk identification method, apparatus, device, and program
CN113610366A (en) * 2021-07-23 2021-11-05 上海淇玥信息技术有限公司 Risk warning generation method and device and electronic equipment
CN113705904A (en) * 2021-08-31 2021-11-26 国网上海市电力公司 Chemical plant area power utilization fault prediction method based on random forest algorithm
CN115409613A (en) * 2022-09-13 2022-11-29 中债金科信息技术有限公司 Bond risk detection model training method and bond risk detection method
CN115953239A (en) * 2023-03-15 2023-04-11 无锡锡商银行股份有限公司 Surface examination video scene evaluation method based on multi-frequency flow network model
CN115993444A (en) * 2022-12-19 2023-04-21 郑州大学 Dual-color immunofluorescence detection method for human serum cerebrospinal fluid GFAP antibody
CN116702052A (en) * 2023-08-02 2023-09-05 云南香农信息技术有限公司 Community social credit system information processing system and method
CN116862643A (en) * 2023-06-25 2023-10-10 福建润楼数字科技有限公司 Automatic wind control feature screening method for multi-channel fund integration credit business
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117171533A (en) * 2023-11-02 2023-12-05 山东省国土测绘院 Real-time acquisition and processing method and system for geographical mapping operation data
CN117370827A (en) * 2023-12-07 2024-01-09 飞特质科(北京)计量检测技术有限公司 Fan quality grade assessment method based on deep clustering model
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200272A (en) * 2020-12-07 2021-01-08 上海冰鉴信息科技有限公司 Service classification method and device
CN112561376A (en) * 2020-12-23 2021-03-26 北京橙色云科技有限公司 Method and device for splitting project and storage medium
CN112419050B (en) * 2020-12-24 2022-05-24 浙江工商大学 Credit evaluation method and device based on telephone communication network and social behavior
CN112419050A (en) * 2020-12-24 2021-02-26 浙江工商大学 Credit evaluation method and device based on telephone communication network and social behavior
CN112785418A (en) * 2021-01-22 2021-05-11 深圳前海微众银行股份有限公司 Credit risk modeling method, credit risk modeling device, credit risk modeling equipment and computer readable storage medium
CN112785418B (en) * 2021-01-22 2024-02-06 深圳前海微众银行股份有限公司 Credit risk modeling method, apparatus, device and computer readable storage medium
CN112862594A (en) * 2021-02-01 2021-05-28 深圳无域科技技术有限公司 Financial risk control method, system, device and computer readable medium
CN112907359A (en) * 2021-03-24 2021-06-04 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN112907359B (en) * 2021-03-24 2024-03-12 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN112801563B (en) * 2021-04-14 2021-08-17 支付宝(杭州)信息技术有限公司 Risk assessment method and device
CN112801563A (en) * 2021-04-14 2021-05-14 支付宝(杭州)信息技术有限公司 Risk assessment method and device
CN113610366A (en) * 2021-07-23 2021-11-05 上海淇玥信息技术有限公司 Risk warning generation method and device and electronic equipment
CN113409139A (en) * 2021-07-27 2021-09-17 深圳前海微众银行股份有限公司 Credit risk identification method, apparatus, device, and program
CN113705904A (en) * 2021-08-31 2021-11-26 国网上海市电力公司 Chemical plant area power utilization fault prediction method based on random forest algorithm
CN115409613A (en) * 2022-09-13 2022-11-29 中债金科信息技术有限公司 Bond risk detection model training method and bond risk detection method
CN115993444A (en) * 2022-12-19 2023-04-21 郑州大学 Dual-color immunofluorescence detection method for human serum cerebrospinal fluid GFAP antibody
CN115953239A (en) * 2023-03-15 2023-04-11 无锡锡商银行股份有限公司 Surface examination video scene evaluation method based on multi-frequency flow network model
CN116862643A (en) * 2023-06-25 2023-10-10 福建润楼数字科技有限公司 Automatic wind control feature screening method for multi-channel fund integration credit business
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117150389B (en) * 2023-07-14 2024-04-12 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN116702052A (en) * 2023-08-02 2023-09-05 云南香农信息技术有限公司 Community social credit system information processing system and method
CN116702052B (en) * 2023-08-02 2023-10-27 云南香农信息技术有限公司 Community social credit system information processing system and method
CN117171533A (en) * 2023-11-02 2023-12-05 山东省国土测绘院 Real-time acquisition and processing method and system for geographical mapping operation data
CN117171533B (en) * 2023-11-02 2024-01-16 山东省国土测绘院 Real-time acquisition and processing method and system for geographical mapping operation data
CN117370827A (en) * 2023-12-07 2024-01-09 飞特质科(北京)计量检测技术有限公司 Fan quality grade assessment method based on deep clustering model
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Similar Documents

Publication Publication Date Title
CN112037009A (en) Risk assessment method for consumption credit scene based on random forest algorithm
CN108898479B (en) Credit evaluation model construction method and device
Isa et al. Using the self organizing map for clustering of text documents
Fan et al. Robust deep auto-encoding Gaussian process regression for unsupervised anomaly detection
CN111311401A (en) Financial default probability prediction model based on LightGBM
CN108898154A (en) A kind of electric load SOM-FCM Hierarchical clustering methods
CN112069310B (en) Text classification method and system based on active learning strategy
CN110826618A (en) Personal credit risk assessment method based on random forest
Pandey et al. An analysis of machine learning techniques (J48 & AdaBoost)-for classification
CN112001788B (en) Credit card illegal fraud identification method based on RF-DBSCAN algorithm
CN111488917A (en) Garbage image fine-grained classification method based on incremental learning
CN104702465A (en) Parallel network flow classification method
CN110717610A (en) Wind power prediction method based on data mining
CN110826617A (en) Situation element classification method and training method and device of model thereof, and server
CN111967971A (en) Bank client data processing method and device
AU2018101531A4 (en) Stock forecast model based on text news by random forest
CN115048988A (en) Unbalanced data set classification fusion method based on Gaussian mixture model
CN113139570A (en) Dam safety monitoring data completion method based on optimal hybrid valuation
CN114463036A (en) Information processing method and device and storage medium
Zhu et al. Loan default prediction based on convolutional neural network and LightGBM
CN111797899B (en) Low-voltage transformer area kmeans clustering method and system
CN113239199A (en) Credit classification method based on multi-party data set
CN115641177B (en) Second-prevention killing pre-judging system based on machine learning
CN116933947A (en) Landslide susceptibility prediction method based on soft voting integrated classifier
Zhang et al. Credit risk control algorithm based on stacking ensemble learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201204

WW01 Invention patent application withdrawn after publication