CN112037009A - Risk assessment method for consumption credit scene based on random forest algorithm - Google Patents
Risk assessment method for consumption credit scene based on random forest algorithm Download PDFInfo
- Publication number
- CN112037009A CN112037009A CN202010784787.0A CN202010784787A CN112037009A CN 112037009 A CN112037009 A CN 112037009A CN 202010784787 A CN202010784787 A CN 202010784787A CN 112037009 A CN112037009 A CN 112037009A
- Authority
- CN
- China
- Prior art keywords
- model
- random forest
- feature
- module
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Engineering & Computer Science (AREA)
- Economics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Technology Law (AREA)
- Game Theory and Decision Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a risk assessment method for a consumption credit scene based on a random forest algorithm, which comprises an information acquisition module, a data preprocessing module, a feature engineering module, a model training and parameter adjusting module, a feature importance assessment module, a model evaluation and selection module and a model deployment monitoring module. Has the advantages that: the method has the advantages that data preprocessing is simple, the efficiency of characteristic engineering and model training is high, the accuracy of the model is high, the random forest model is combined with a consumption credit scene of internet finance, the random forest can better process the problems of high-dimensional sparsity, much noise and variable redundancy of the internet data according to algorithm superiority of the random forest model, the random forest model has higher risk prediction accuracy and stability compared with other traditional scoring card models, credit risk recognition is improved, and a certain reference value is realized on practical application of the consumption credit of the internet finance.
Description
Technical Field
The invention relates to the technical field of wind control in the internet financial consumption credit industry, in particular to a risk assessment method for a consumption credit scene based on a random forest algorithm.
Background
With the rise of the internet + concept, internet financial consumption credit companies represented by P2P loan, consumption finance, car rental, etc. are like the spring shoots after the rain, but after the wild growth, the development speed and fate of the companies are concentrated on the wind control. The traditional wind control auditing is a scoring card model based on a machine learning algorithm, and comprises a logistic regression, a decision tree, a support vector machine, a neural network and the like, the algorithms are strong in interpretability, simple and easy to understand, the weights of all characteristics can be directly seen, and new data can be easily absorbed to update the model, so that the traditional scoring card model is still a common method for risk assessment in the consumer credit industry when integrated algorithms such as GBDT, random forest, lightGBM and the like appear successively.
With the development of big data, internet credit data is not limited to application and credit investigation data, but more third-party data such as online shopping consumption, network social contact, APP use behaviors and the like are combined, the whole data has the characteristic of high-dimensional sparsity, the traditional scoring card model has obvious limitation on the internet data, and specific problems and difficulties have the following aspects:
the data preprocessing is complicated: the traditional scoring card model has extremely high requirements on data preprocessing, and directly deletes a large number of samples with data sparseness or missing values for convenience of calculation, so that the loss of data value is large;
the characteristic engineering difficulty is as follows: the traditional scoring card model is complex in characteristic engineering, continuous data needs discretization processing and characteristic screening, characteristics need to be selected according to evidence Weight (WOE) conversion and Information Value (IV) and are also rejected according to collinearity among variables, and for current large-data wind-control high-dimensional data, the range of data processing capability of the traditional wind-control modeling scoring card system is far exceeded, and more advanced machine learning algorithm processing is urgently needed;
insufficient model accuracy: compared with an integrated algorithm model based on a combination of a plurality of weak classifiers, a single model of the traditional scoring card lacks exploration and verification among models, and the problems of insufficient model stability and weak generalization capability possibly exist.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a risk assessment method of a consumption credit scene based on a random forest algorithm, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
a risk assessment method for a consumption credit scene based on a random forest algorithm comprises an information acquisition module, a data preprocessing module, a feature engineering module, a model training and parameter adjusting module, a feature importance assessment module, a model evaluation and selection module and a model deployment monitoring module, wherein the information acquisition module is connected with the feature engineering module through the data preprocessing module, the feature engineering module is connected with the feature importance assessment module through the model training and parameter adjusting module, and the feature importance assessment module is connected with the model deployment monitoring module through the model evaluation and selection module.
Further, the risk assessment method for the consumption credit scene based on the random forest algorithm comprises the following steps:
obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
data preprocessing: the uniqueness and the sample integrity of the user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on the variables of the modeling samples, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing value, the missing variable information can be effectively filled in an auxiliary mode, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the clustering of the sparse variables is beneficial to the processing of variable characteristic engineering;
characteristic engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
feature importance assessment and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
monitoring model deployment: after a random forest model with an optimal parameter combination is achieved through grid parameter adjustment and five-fold cross validation method repeated training, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring variables IV and mean values, PSI (probability Stability Index) of the model, KS (relevance score), AUC (aggregate efficiency Index) and other indexes.
Further, the Random Forest Classifier (Random Forest Classifier) includes the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and after the n _ estimators reach a certain number, the performance improvement degree of the model is not large, so that in practice, a moderate numerical value is generally selected by adjusting parameters;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
Further, in tuning a single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted at a coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default kini coefficient 'gini', namely CART algorithm), are adjusted, so that multithreading parallelization operation of the random forest model can be realized, and a faster training effect is achieved.
Further, a modeling sample client is extracted from a company business system through the information acquisition module, client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module, interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable through the feature engineering module according to the business logic and the variable relation to generate a feature wide table with higher dimension, wherein feature selection can be carried out in random forest model training.
Furthermore, the model training and parameter adjusting module realizes the multithreading parallelization optimal parameter combination training of the random forest model by adjusting the random forest integration algorithm, the weak classifier and the system setting, so as to achieve a faster training effect; the feature importance evaluation module obtains the importance of each feature after training the random forest models, selects the features according to descending order of feature importance, trains and adjusts parameters again to obtain the random forest models with higher generalization and stability, and constructs a risk evaluation system and index weights of the consumption credit industry according to the feature importance.
Further, the model evaluation and selection module evaluates the overall effect of the random forest model through KS and AUC, and compares model results of LR, SVM, GBDT, XGboost and the like of initial modeling data; and the model deployment monitoring module deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
The invention has the beneficial effects that:
1. the data preprocessing is simple: the random forest can effectively process sparse data, the sparse data is singly used as one type without filling processing, the random forest is a tree model and can process various data types such as continuous type, discrete type and the like, normalization processing such as normalization and the like is not needed for the data, abnormal value and noise data can be better tolerated by a random forest bagging method (bagging) and a random selection feature splitting method which are combined, in addition, a decision tree is constructed by utilizing a random forest algorithm to predict and interpolate missing values, and potential data information of the missing values can be well supplemented.
2. The efficiency of characteristic engineering and model training is high: each decision tree of the random forest can be independently and simultaneously generated, high parallelization is easy to realize, training is easy to realize in a distributed mode, thousands of high-dimensional variables can be processed, feature selection is not needed, and the requirements of high-dimensional data sample training speed and efficiency in the current big data era are met.
3. The accuracy of the model is high: bagging and random selection feature splitting, etc. methods generate two randoms: the method has the advantages that samples are random, features are random, unbiased estimation of real errors is obtained while training data volume is not lost in modeling, when classified unbalance exists in modeling samples, an effective method for balancing data set errors can be provided by random forests, correlation among models is reduced by the random forests through column sampling (feature sampling), so that variance of integrated models is reduced, and the random forests improve generalization errors by reducing variance of weak classifiers, so that the aim of improving accuracy of the integrated models is fulfilled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a block diagram of a risk assessment method for a consumed credit scenario based on a random forest algorithm according to an embodiment of the present invention;
FIG. 2 is a flow chart of a risk assessment method for a consumed credit scenario based on a random forest algorithm according to an embodiment of the present invention.
In the figure:
1. an information acquisition module; 2. a data preprocessing module; 3. a feature engineering module; 4. a model training and parameter adjusting module; 5. a feature importance evaluation module; 6. a model evaluation and selection module; 7. and the model deploys a monitoring module.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to the embodiment of the invention, a risk assessment method for a consumption credit scene based on a random forest algorithm is provided.
The first embodiment is as follows:
as shown in fig. 1-2, the risk assessment method for a consumption credit scene based on a random forest algorithm according to an embodiment of the present invention includes an information acquisition module 1, a data preprocessing module 2, a feature engineering module 3, a model training and parameter adjusting module 4, a feature importance assessment module 5, a model evaluation and selection module 6, and a model deployment monitoring module 7, where the information acquisition module 1 is connected to the feature engineering module 3 through the data preprocessing module 2, the feature engineering module 3 is connected to the feature importance assessment module 5 through the model training and parameter adjusting module 4, and the feature importance assessment module 5 is connected to the model deployment monitoring module 7 through the model evaluation and selection module 6.
In one embodiment, the risk assessment method for the consumption credit scene based on the random forest algorithm comprises the following steps:
step S101, obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
step S103, data preprocessing: the uniqueness and the sample integrity of a user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on variables of a modeling sample, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable missing rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing rate of the missing value, the method can effectively assist in filling missing variable information, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the sparse variable clustering is favorable for processing variable characteristic engineering;
step S105, feature engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
step S107, model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
step S109, feature importance evaluation and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
step S111, model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
step S113, model deployment monitoring: after a random forest model with an optimal parameter combination is achieved through repeated training of gridding parameter adjustment and five-fold cross validation methods, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring of variables IV and mean values, PSI (probability Stability Index) of model distribution, KS (family Stability Index), AUC (aggregate efficiency Index) and other indexes.
In one embodiment, the Random Forest Classifier (Random Forest Classifier) includes the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and the improvement degree of the performance of the model is not large after n _ estimators reach a certain number, so that in practice, a moderate numerical value is generally selected by parameter adjustment;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
In one embodiment, in tuning a single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted at a coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default kini coefficient 'gini', namely CART algorithm), are adjusted, so that multithreading parallelization operation of the random forest model can be realized, and a faster training effect is achieved.
In one embodiment, a modeling sample client is extracted from a company business system through the information acquisition module 1, client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module 2, interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable according to business logic and variable relation through the characteristic engineering module 3 to generate a higher-dimensional characteristic width table, wherein characteristic selection can be performed in random forest model training.
In one embodiment, the model training and parameter adjusting module 4 adjusts the random forest integration algorithm, the weak classifier and the system setting to realize the multi-thread parallelization optimal parameter combination training of the random forest model, thereby achieving a faster training effect; the feature importance evaluation module 5 obtains the importance of each feature after training the random forest models, performs feature selection according to descending order of feature importance, performs training and parameter tuning again to obtain random forest models with higher generalization and stability, and constructs a risk evaluation system and index weights of the consumption credit industry according to the feature importance.
In one embodiment, the model evaluation and selection module 6 evaluates the overall effect of the random forest model through KS and AUC, comparing model results of LR, SVM, GBDT, XGBoost, etc. of the initial modeling data; and the model deployment monitoring module 7 deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
In summary, the random forest is a relatively advanced integration algorithm, integrates the characteristics of bagging (bootstrapping aggregation) and randomly selecting feature splitting, and the like, has relatively low variance and deviation and relatively superior generalization performance, can improve prediction precision on the premise of not significantly improving the computation amount, can also process complex high-dimensional sparse data, and overcomes the defects of insufficient stability, accuracy and generalization of a single model of a traditional evaluation card, and has the following advantages:
the data preprocessing is simple: the random forest can effectively process sparse data, the sparse data is singly used as one type without filling processing, the random forest is a tree model and can process various data types such as continuous type, discrete type and the like, normalization processing such as normalization and the like is not needed for the data, abnormal value and noise data can be better tolerated by a random forest bagging method (bagging) and a random selection feature splitting method which are combined, in addition, a decision tree is constructed by utilizing a random forest algorithm to predict and interpolate missing values, and potential data information of the missing values can be well supplemented.
The efficiency of characteristic engineering and model training is high: each decision tree of the random forest can be independently and simultaneously generated, high parallelization is easy to realize, training is easy to realize in a distributed mode, thousands of high-dimensional variables can be processed, feature selection is not needed, and the requirements of high-dimensional data sample training speed and efficiency in the current big data era are met.
The accuracy of the model is high: bagging and random selection feature splitting, etc. methods generate two randoms: the method has the advantages that samples are random, features are random, unbiased estimation of real errors is obtained while training data volume is not lost in modeling, when classified unbalance exists in modeling samples, an effective method for balancing data set errors can be provided by random forests, correlation among models is reduced by the random forests through column sampling (feature sampling), so that variance of integrated models is reduced, and the random forests improve generalization errors by reducing variance of weak classifiers, so that the aim of improving accuracy of the integrated models is fulfilled.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. The risk assessment method for the credit consumption scene based on the random forest algorithm is characterized by comprising an information acquisition module (1), a data preprocessing module (2), a feature engineering module (3), a model training and parameter adjusting module (4), a feature importance assessment module (5), a model evaluation and selection module (6) and a model deployment monitoring module (7), wherein the information acquisition module (1) is connected with the feature engineering module (3) through the data preprocessing module (2), the feature engineering module (3) is connected with the feature importance assessment module (5) through the model training and parameter adjusting module (4), and the feature importance assessment module (5) is connected with the model deployment monitoring module (7) through the model evaluation and selection module (6).
2. The risk assessment method for the random forest algorithm-based consumption credit scene is characterized in that the risk assessment method for the random forest algorithm-based consumption credit scene comprises the following steps:
obtaining modeling data: randomly extracting modeling sample clients according to application months from a company business system, obtaining the modeling clients by an SMOTE (synthetic timing indexing technology) for unbalanced samples with insufficient performance (bad account rate is obviously lower than that of the former clients in recent application), and performing correlation extraction on application data, credit investigation data, APP operation buried point data and third party data authorized by the clients by using the number of the modeling clients as a main key to combine the application data, credit investigation data, APP operation buried point data and third party data into a modeling data set;
data preprocessing: the uniqueness and the sample integrity of a user number are detected to serve as a sample data quality detection standard, the statistical analysis is carried out on variables of a modeling sample, the distribution range of the variables can be visually described by a distribution diagram, the mean value, the quantile point value, the abnormal value and the missing value of the variables are counted, if the variable missing rate with higher business relevance is higher, a decision tree can be constructed by using a random forest algorithm to predict and interpolate the missing rate of the missing value, the method can effectively assist in filling missing variable information, the K-Means algorithm can be adopted for clustering aiming at the sparser variables, and the sparse variable clustering is favorable for processing variable characteristic engineering;
characteristic engineering: the raw data is preprocessed and then subjected to feature processing and processing, and more predictive and explanatory variables are generally obtained by constructing derivative variables, and the common feature derivative methods are as follows: counting, summing, proportion, time difference, fluctuation rate and the like, more and more useful variables are deeply mined, two variables with associated business logic can be subjected to operations such as addition, subtraction, multiplication, division and the like to generate derivative variables, a higher-dimensional feature wide table is finally generated, and feature selection can be performed in random forest model training optimization;
model training and parameter adjustment: carrying out model training and parameter adjustment by using a Random Forest Classifier (Random Forest Classifier) in a skleern module of python;
feature importance assessment and feature selection: the random forest model is different from other traditional scoring card models in that the importance degree of the features can be output, the importance degree is a result of normalization of the importance value of each feature, the higher the importance of the features is, the more the features are matched with a prediction function, the feature importance evaluation by using random forests is realized in skleran, after the random forest model is trained, the importance of each feature can be obtained by directly calling feature _ importances, the features are sorted in a descending order, TOP500 or T0P100 features of the importance are selected according to the total feature number of a sample to obtain a new feature subset, then training and parameter adjustment are carried out again, the random forest model with better generalization and stability is finally obtained, and a consumption credit industry risk assessment party system and index weight thereof can be constructed according to the feature importance, so that the credit score and overdue risk level of a client are assessed;
model evaluation and selection: the overall effect of the random forest model is evaluated through the KS and the AUC, the KS value can reflect whether the model is accurate or not, and whether the model has enough discrimination on good or bad customers or not can be evaluated; the AUC value can ensure that whether the model is good or not is accurately evaluated under the condition that the sample is not uniform, and the accuracy and the stability of the random forest model are comprehensively evaluated and compared by combining and comparing LR, SVM, GBDT, XGboost and other traditional scoring card models;
monitoring model deployment: after a random forest model with an optimal parameter combination is achieved through repeated training of gridding parameter adjustment and five-fold cross validation methods, the model is deployed to a system platform, and the optimization model is updated and adjusted through monitoring of variables IV and mean values, PSI (probability Stability Index) of model distribution, KS (family Stability Index), AUC (aggregate efficiency Index) and other indexes.
3. A risk assessment method of a Random Forest algorithm based consumer credit scenario according to claim 2, characterized in that the Random Forest Classifier (Random Forest Classifier) comprises the following model parameters:
number of submodels (n _ estimators): in relation to the complexity of the random forest model, theoretically, the larger the number of the sub-models is, the more stable the result is, but the calculated amount is greatly increased, and after the n _ estimators reach a certain number, the performance improvement degree of the model is not large, so that in practice, a moderate numerical value is generally selected by adjusting parameters;
maximum number of features per decision tree (max _ features): the maximum feature number randomly selected by a single tree during splitting is appointed, the smaller the max _ features is, the smaller the integral variance of the model is, the accuracy of the model is improved, and the max _ features is generally set as the result of integral feature number evolution (sqrt) in the random forest classification problem;
maximum depth of tree (max _ depth): according to experience, the method is generally set as None (namely, the method is not limited), all feature numbers are considered during division, and the method grows completely;
number of samples of node minimum partitioning (min _ samples _ split): the condition of continuous division of the subtree is limited, min _ samples _ split represents the minimum number of samples of the current tree node which can be further cut, the default value is 1, and if the samples and the characteristics are more, the min _ samples _ split can be properly increased for convenience of calculation;
maximum leaf node number (max _ leaf _ nodes): overfitting can be prevented by limiting the maximum leaf node number, the default is 'None' (namely the maximum leaf node number is not limited), and if the model features are more, a specific value can be selected through cross validation to be limited;
leaf node minimum sample number (min _ samples _ leaf): each decision tree is specified to be completely generated, namely, a leaf only contains a single sample, if the number of nodes of a certain leaf is less than the threshold value, the leaf and the sibling node are pruned together, the default value is 1, and if the magnitude order of the sample size is very large, the value is recommended to be increased;
leaf node minimum weight total (min _ weight _ fraction _ leaf): the minimum value of the sum of the weights of all samples of the leaf node is limited, if the minimum value is less than the threshold value, the leaf node and the sibling node are pruned together, the default value is 0, that is, the weight problem is not considered, generally, if the missing value of the sample is more or the distribution deviation of the sample is larger, the min _ weight _ fraction _ leaf is adjusted, and the introduction of the sample weight can make the missing data and the unbalanced data more robust.
4. The risk assessment method for the consumption credit scenario based on the random forest algorithm according to claim 2, wherein in tuning the single tree, adjusting max _ leaf _ nodes or max _ depth, the structure of the tree can be adjusted with coarse granularity: the more leaf nodes or the deeper the tree, it means the lower the deviation of the sub-model, the higher the variance; adjusting min _ samples _ split, min _ samples _ leaf, and min _ weight _ fraction _ leaf, the structure of the tree can be fine-grained: the fewer number of samples required for splitting or the fewer samples required for leaf nodes also means that the sub-model is more complex, resulting in a more accurate and efficient model. In addition, by adjusting hyper-parameters of the model, including arm _ start (hot start, default is True), n _ jobs (parallel used process number) and criterion (segmentation strategy default Gini coefficient, namely CART algorithm), the multithread parallelization operation of the random forest model can be realized, and a faster training effect can be achieved.
5. The risk assessment method for the consumption credit scene based on the random forest algorithm is characterized in that a modeling sample client is extracted from a company business system through the information acquisition module (1), client application data, APP operation buried point data and authorized third party data are obtained, and model initial modeling data are obtained; performing descriptive statistics on data variables through the data preprocessing module (2), interpolating missing values by using a random forest algorithm, and clustering sparse variables by using a K-Means algorithm; and constructing a derivative variable according to business logic and variable relation through the characteristic engineering module (3) to generate a higher-dimensional characteristic width table, wherein characteristic selection can be performed in random forest model training.
6. The risk assessment method for the consumption credit scene based on the random forest algorithm according to claim 1, characterized in that the model training and parameter adjusting module (4) realizes the multi-thread parallelization optimal parameter combination training of the random forest model by adjusting the random forest integration algorithm, the weak classifier and the system setting, so as to achieve the faster training effect; the feature importance evaluation module (5) obtains the importance of each feature after training the random forest models, selects the features according to descending order of feature importance, trains and adjusts parameters again to obtain the random forest models with higher generalization and stability, and constructs a risk evaluation system of the credit consumption industry and index weights thereof according to the feature importance.
7. The risk assessment method for the consumption credit scenario based on the random forest algorithm according to claim 1, wherein the model evaluation and selection module (6) evaluates the overall effect of the random forest model through KS and AUC, and compares the results of LR, SVM, GBDT, XGboost and other models of initial modeling data; and the model deployment monitoring module (7) deploys the final random forest model to a system platform, monitors indexes such as variables IV and mean values, PSI, KS, AUC and the like of model distribution, and updates, adjusts and optimizes the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010784787.0A CN112037009A (en) | 2020-08-06 | 2020-08-06 | Risk assessment method for consumption credit scene based on random forest algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010784787.0A CN112037009A (en) | 2020-08-06 | 2020-08-06 | Risk assessment method for consumption credit scene based on random forest algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112037009A true CN112037009A (en) | 2020-12-04 |
Family
ID=73582353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010784787.0A Withdrawn CN112037009A (en) | 2020-08-06 | 2020-08-06 | Risk assessment method for consumption credit scene based on random forest algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112037009A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112200272A (en) * | 2020-12-07 | 2021-01-08 | 上海冰鉴信息科技有限公司 | Service classification method and device |
CN112419050A (en) * | 2020-12-24 | 2021-02-26 | 浙江工商大学 | Credit evaluation method and device based on telephone communication network and social behavior |
CN112561376A (en) * | 2020-12-23 | 2021-03-26 | 北京橙色云科技有限公司 | Method and device for splitting project and storage medium |
CN112785418A (en) * | 2021-01-22 | 2021-05-11 | 深圳前海微众银行股份有限公司 | Credit risk modeling method, credit risk modeling device, credit risk modeling equipment and computer readable storage medium |
CN112801563A (en) * | 2021-04-14 | 2021-05-14 | 支付宝(杭州)信息技术有限公司 | Risk assessment method and device |
CN112862594A (en) * | 2021-02-01 | 2021-05-28 | 深圳无域科技技术有限公司 | Financial risk control method, system, device and computer readable medium |
CN112907359A (en) * | 2021-03-24 | 2021-06-04 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN113409139A (en) * | 2021-07-27 | 2021-09-17 | 深圳前海微众银行股份有限公司 | Credit risk identification method, apparatus, device, and program |
CN113610366A (en) * | 2021-07-23 | 2021-11-05 | 上海淇玥信息技术有限公司 | Risk warning generation method and device and electronic equipment |
CN113705904A (en) * | 2021-08-31 | 2021-11-26 | 国网上海市电力公司 | Chemical plant area power utilization fault prediction method based on random forest algorithm |
CN115409613A (en) * | 2022-09-13 | 2022-11-29 | 中债金科信息技术有限公司 | Bond risk detection model training method and bond risk detection method |
CN115953239A (en) * | 2023-03-15 | 2023-04-11 | 无锡锡商银行股份有限公司 | Surface examination video scene evaluation method based on multi-frequency flow network model |
CN115993444A (en) * | 2022-12-19 | 2023-04-21 | 郑州大学 | Dual-color immunofluorescence detection method for human serum cerebrospinal fluid GFAP antibody |
CN116702052A (en) * | 2023-08-02 | 2023-09-05 | 云南香农信息技术有限公司 | Community social credit system information processing system and method |
CN116862643A (en) * | 2023-06-25 | 2023-10-10 | 福建润楼数字科技有限公司 | Automatic wind control feature screening method for multi-channel fund integration credit business |
CN117150389A (en) * | 2023-07-14 | 2023-12-01 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN117171533A (en) * | 2023-11-02 | 2023-12-05 | 山东省国土测绘院 | Real-time acquisition and processing method and system for geographical mapping operation data |
CN117370827A (en) * | 2023-12-07 | 2024-01-09 | 飞特质科(北京)计量检测技术有限公司 | Fan quality grade assessment method based on deep clustering model |
CN117874654A (en) * | 2024-03-13 | 2024-04-12 | 杭州小策科技有限公司 | Risk monitoring method and system based on random forest algorithm |
-
2020
- 2020-08-06 CN CN202010784787.0A patent/CN112037009A/en not_active Withdrawn
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112200272A (en) * | 2020-12-07 | 2021-01-08 | 上海冰鉴信息科技有限公司 | Service classification method and device |
CN112561376A (en) * | 2020-12-23 | 2021-03-26 | 北京橙色云科技有限公司 | Method and device for splitting project and storage medium |
CN112419050B (en) * | 2020-12-24 | 2022-05-24 | 浙江工商大学 | Credit evaluation method and device based on telephone communication network and social behavior |
CN112419050A (en) * | 2020-12-24 | 2021-02-26 | 浙江工商大学 | Credit evaluation method and device based on telephone communication network and social behavior |
CN112785418A (en) * | 2021-01-22 | 2021-05-11 | 深圳前海微众银行股份有限公司 | Credit risk modeling method, credit risk modeling device, credit risk modeling equipment and computer readable storage medium |
CN112785418B (en) * | 2021-01-22 | 2024-02-06 | 深圳前海微众银行股份有限公司 | Credit risk modeling method, apparatus, device and computer readable storage medium |
CN112862594A (en) * | 2021-02-01 | 2021-05-28 | 深圳无域科技技术有限公司 | Financial risk control method, system, device and computer readable medium |
CN112907359A (en) * | 2021-03-24 | 2021-06-04 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN112907359B (en) * | 2021-03-24 | 2024-03-12 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN112801563B (en) * | 2021-04-14 | 2021-08-17 | 支付宝(杭州)信息技术有限公司 | Risk assessment method and device |
CN112801563A (en) * | 2021-04-14 | 2021-05-14 | 支付宝(杭州)信息技术有限公司 | Risk assessment method and device |
CN113610366A (en) * | 2021-07-23 | 2021-11-05 | 上海淇玥信息技术有限公司 | Risk warning generation method and device and electronic equipment |
CN113409139A (en) * | 2021-07-27 | 2021-09-17 | 深圳前海微众银行股份有限公司 | Credit risk identification method, apparatus, device, and program |
CN113705904A (en) * | 2021-08-31 | 2021-11-26 | 国网上海市电力公司 | Chemical plant area power utilization fault prediction method based on random forest algorithm |
CN115409613A (en) * | 2022-09-13 | 2022-11-29 | 中债金科信息技术有限公司 | Bond risk detection model training method and bond risk detection method |
CN115993444A (en) * | 2022-12-19 | 2023-04-21 | 郑州大学 | Dual-color immunofluorescence detection method for human serum cerebrospinal fluid GFAP antibody |
CN115953239A (en) * | 2023-03-15 | 2023-04-11 | 无锡锡商银行股份有限公司 | Surface examination video scene evaluation method based on multi-frequency flow network model |
CN116862643A (en) * | 2023-06-25 | 2023-10-10 | 福建润楼数字科技有限公司 | Automatic wind control feature screening method for multi-channel fund integration credit business |
CN117150389A (en) * | 2023-07-14 | 2023-12-01 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN117150389B (en) * | 2023-07-14 | 2024-04-12 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN116702052A (en) * | 2023-08-02 | 2023-09-05 | 云南香农信息技术有限公司 | Community social credit system information processing system and method |
CN116702052B (en) * | 2023-08-02 | 2023-10-27 | 云南香农信息技术有限公司 | Community social credit system information processing system and method |
CN117171533A (en) * | 2023-11-02 | 2023-12-05 | 山东省国土测绘院 | Real-time acquisition and processing method and system for geographical mapping operation data |
CN117171533B (en) * | 2023-11-02 | 2024-01-16 | 山东省国土测绘院 | Real-time acquisition and processing method and system for geographical mapping operation data |
CN117370827A (en) * | 2023-12-07 | 2024-01-09 | 飞特质科(北京)计量检测技术有限公司 | Fan quality grade assessment method based on deep clustering model |
CN117874654A (en) * | 2024-03-13 | 2024-04-12 | 杭州小策科技有限公司 | Risk monitoring method and system based on random forest algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112037009A (en) | Risk assessment method for consumption credit scene based on random forest algorithm | |
CN108898479B (en) | Credit evaluation model construction method and device | |
Isa et al. | Using the self organizing map for clustering of text documents | |
Fan et al. | Robust deep auto-encoding Gaussian process regression for unsupervised anomaly detection | |
CN111311401A (en) | Financial default probability prediction model based on LightGBM | |
CN108898154A (en) | A kind of electric load SOM-FCM Hierarchical clustering methods | |
CN112069310B (en) | Text classification method and system based on active learning strategy | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
Pandey et al. | An analysis of machine learning techniques (J48 & AdaBoost)-for classification | |
CN112001788B (en) | Credit card illegal fraud identification method based on RF-DBSCAN algorithm | |
CN111488917A (en) | Garbage image fine-grained classification method based on incremental learning | |
CN104702465A (en) | Parallel network flow classification method | |
CN110717610A (en) | Wind power prediction method based on data mining | |
CN110826617A (en) | Situation element classification method and training method and device of model thereof, and server | |
CN111967971A (en) | Bank client data processing method and device | |
AU2018101531A4 (en) | Stock forecast model based on text news by random forest | |
CN115048988A (en) | Unbalanced data set classification fusion method based on Gaussian mixture model | |
CN113139570A (en) | Dam safety monitoring data completion method based on optimal hybrid valuation | |
CN114463036A (en) | Information processing method and device and storage medium | |
Zhu et al. | Loan default prediction based on convolutional neural network and LightGBM | |
CN111797899B (en) | Low-voltage transformer area kmeans clustering method and system | |
CN113239199A (en) | Credit classification method based on multi-party data set | |
CN115641177B (en) | Second-prevention killing pre-judging system based on machine learning | |
CN116933947A (en) | Landslide susceptibility prediction method based on soft voting integrated classifier | |
Zhang et al. | Credit risk control algorithm based on stacking ensemble learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201204 |
|
WW01 | Invention patent application withdrawn after publication |