CN111951097A - Enterprise credit risk assessment method, device, equipment and storage medium - Google Patents

Enterprise credit risk assessment method, device, equipment and storage medium Download PDF

Info

Publication number
CN111951097A
CN111951097A CN202010805252.7A CN202010805252A CN111951097A CN 111951097 A CN111951097 A CN 111951097A CN 202010805252 A CN202010805252 A CN 202010805252A CN 111951097 A CN111951097 A CN 111951097A
Authority
CN
China
Prior art keywords
variable
data
sample
model
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010805252.7A
Other languages
Chinese (zh)
Inventor
许卫
温水根
何志坚
薛永营
赵彦晖
耿心伟
曾源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Weizhong Credit Technology Co ltd
Original Assignee
Shenzhen Weizhong Credit Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Weizhong Credit Technology Co ltd filed Critical Shenzhen Weizhong Credit Technology Co ltd
Priority to CN202010805252.7A priority Critical patent/CN111951097A/en
Publication of CN111951097A publication Critical patent/CN111951097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/123Tax preparation or submission

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The application discloses an enterprise credit risk assessment method, in the method, by receiving enterprise tax data, the business credit risk of an enterprise is quantified from the tax data dimension, and a foundation is laid for accurate risk assessment of enterprise credit; meanwhile, a small and micro enterprise credit risk model called by the method is built based on an XGboost algorithm, so that the characteristic cross capability of a weak variable of the model is guaranteed; in the training process, based on the analysis of tax sample data, after variable preprocessing, the variable stability and the model stability of the sample variable data are used as evaluation indexes, and the sample variable data are subjected to in-mode variable screening, so that the influence of abnormal sample variables on model training can be filtered, the problem of overfitting when the XGboost algorithm is adopted by a small and micro enterprise credit risk model is solved, and the enterprise credit risk evaluation accuracy is improved. The application also provides an enterprise credit risk assessment device, equipment and a readable storage medium, and the enterprise credit risk assessment device has the beneficial effects.

Description

Enterprise credit risk assessment method, device, equipment and storage medium
Technical Field
The present application relates to the field of anterior segment inspection technologies, and in particular, to a method, an apparatus, and a device for evaluating an enterprise credit risk, and a readable storage medium.
Background
The wide application of big data and internet technology has profound influence on the financial ecology of China, and simultaneously provides a new platform and channel for the financing of small and micro enterprises; the innovative application of the big data technology in the field of internet finance creates more possibilities for the development of financial services of small and micro enterprises.
The enterprise credit investigation system can solve the problem of asymmetric information, reduce the information cost and the transaction cost and further lighten the reverse selection. The credit investigation system can collect, process and process the information of the transaction efficiently in a large scale, reduce the uncertainty in the transaction process as much as possible, reduce the cost of bank information and improve the quality of bank loan. Meanwhile, the credit investigation system of the enterprise makes the risk of the medium and small enterprises more transparent, thereby increasing the financing chance of the medium and small enterprises, and in addition, the credit investigation system of the enterprise can also form an enterprise operation risk constraint mechanism, the credit investigation system provides a platform for the enterprise to display the operation risk level and credit of the enterprise, the enterprise can spontaneously form the constraint mechanism, tends to disclose real information, and finally forms a credit transaction mechanism for social approval.
At present, a credit risk model is generally used as a traditional logistic regression model, and although logistic regression has good business interpretability, a feature cross capability model of some weak variables cannot be learned in the internet era, so that more and more machine learning algorithms are applied to a small and micro enterprise credit risk model.
At present, a small and micro enterprise credit risk model usually adopts an XGboost method for data processing, the XGboost (eXtreme Gradient boosting) is an integrated learning data processing method, and due to the characteristics of few small and micro enterprise wind control modeling samples, complex enterprise types and the like, the XGboost method for data processing can cause the model to be easily over-fitted, and the model after over-fitting can cause the generalization capability of the model to be weak, so that the identification precision of the model is influenced.
Therefore, how to ensure the feature crossing capability of the weak variables and avoid the influence of model overfitting on the identification precision is an urgent problem to be solved by the technical personnel in the field.
Disclosure of Invention
The method can ensure the characteristic cross capability of a weak variable and simultaneously avoid the influence of model overfitting on the identification precision; another object of the present application is to provide an enterprise credit risk assessment apparatus, device and readable storage medium.
In order to solve the above technical problem, the present application provides an enterprise credit risk assessment method, including:
receiving enterprise tax data of an enterprise to be evaluated;
calling a pre-trained small and micro enterprise credit risk model built based on an XGboost algorithm to carry out operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
the training method of the credit risk model of the small micro-enterprise comprises the following steps:
acquiring tax sample data of an enterprise;
performing variable preprocessing on the tax sample data to obtain sample variable data;
taking the variable stability and the model stability of the sample variable data as evaluation indexes, performing variable screening on the sample variable data, and determining a mode entering variable in the sample variable data;
determining model parameters in a small and micro enterprise credit risk model built based on an XGboost algorithm;
and calling the sample variable data to train the credit risk model of the small micro enterprise.
Optionally, performing variable preprocessing on the tax sample data to obtain sample variable data, including:
performing variable analysis on the tax sample data, and taking data output by the variable analysis as preprocessing sample data;
and performing box separation woe on the preprocessed sample data to obtain boxed variable data, and taking the boxed variable data as sample variable data.
Optionally, performing variable analysis on the tax sample data, and taking data output by the variable analysis as pre-processing sample data, including:
performing statistical analysis on the distribution of the tax sample data to obtain sample distribution statistical information;
and performing data filling processing on the missing values and the abnormal values in the sample distribution statistical information, and taking the processed data as pre-processing sample data.
Optionally, with the variable stability and the model stability of the sample variable data as evaluation indexes, performing variable screening on the sample variable data, and determining an input variable in the sample variable data, including:
screening the sample variable data according to the correlation and the variable importance among the sample variable data to obtain a first variable;
and calculating a model stability index of the first variable, and taking the first variable with the model stability index lower than a threshold value as a model entering variable.
Optionally, the determining model parameters in the small micro enterprise credit risk model built based on the XGBoost algorithm includes:
determining the type of the XGboost model base learner; wherein the XGboost model base learner type comprises: gbtree and gbiner;
determining a learning objective function and a model evaluation index of the XGboost; wherein the objective function comprises: logistic regression and linear regression, wherein the model evaluation indexes comprise: auc, loglos, rmse, mae, error;
and adjusting and optimizing the XG boost algorithm parameters, and combining the obtained optimal model parameters to serve as the XG boost model parameters.
The application also provides an enterprise credit risk assessment device, including:
the data receiving unit is used for receiving enterprise tax data of an enterprise to be evaluated;
the model evaluation unit is used for calling a pre-trained small and micro enterprise credit risk model built based on the XGboost algorithm to carry out operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
wherein the model training unit for training the small micro enterprise credit risk model called by the model evaluation unit comprises:
the data acquisition subunit is used for acquiring tax sample data of an enterprise;
the variable preprocessing subunit is used for performing variable preprocessing on the tax sample data to obtain sample variable data;
the variable screening subunit is used for performing variable screening on the sample variable data by taking the variable stability and the model stability of the sample variable data as evaluation indexes to determine a mode entering variable in the sample variable data;
the parameter determining subunit is used for determining model parameters in a small and micro enterprise credit risk model built based on an XGboost algorithm;
and the training subunit is used for calling the sample variable data to train the credit risk model of the small micro-enterprise.
Optionally, the variable preprocessing subunit includes:
the variable analysis subunit is used for carrying out variable analysis on the tax sample data and taking data output by the variable analysis as preprocessed sample data;
and the box dividing processing subunit is used for carrying out box dividing woe processing on the preprocessed sample data to obtain the variable data after box dividing, and taking the variable data after box dividing as the sample variable data.
Optionally, the variable analysis subunit includes:
the statistical analysis subunit is used for performing statistical analysis on the distribution of the tax sample data to obtain sample distribution statistical information;
and the exception processing subunit is used for performing data filling processing on the missing values and the abnormal values in the sample distribution statistical information and taking the processed data as pre-processing sample data.
The present application further provides an enterprise credit risk assessment device, comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the enterprise credit risk assessment method when executing the computer program.
The present application also provides a readable storage medium having a program stored thereon, which when executed by a processor, performs the steps of the enterprise credit risk assessment method.
According to the enterprise credit risk assessment method, the enterprise tax data is received, the operation credit risk of the enterprise is quantified from the tax data dimension, and the enterprise credit can be assessed relatively compared with other assessment dimensions, so that a foundation is laid for accurate risk assessment of the enterprise credit; meanwhile, a small and micro enterprise credit risk model called by the method is built based on an XGboost algorithm, so that the characteristic cross capability of a weak variable of the model is guaranteed; in the training process, based on the analysis of tax sample data, a characteristic project is constructed, after variable preprocessing, the variable stability and the model stability of sample variable data are used as evaluation indexes, the sample variable data are subjected to modeling variable screening, the influence of abnormal sample variables on model training can be filtered, and the over-fitting problem of a small and micro enterprise credit risk model when an XGboost algorithm is adopted is relieved, so that the recognition effect of the trained model is improved, and the enterprise credit risk evaluation accuracy is improved.
The application also provides an enterprise credit risk assessment device, equipment and a readable storage medium, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of an enterprise credit risk assessment method according to an embodiment of the present application;
fig. 2 is a block diagram illustrating an architecture of an enterprise credit risk assessment apparatus according to an embodiment of the present disclosure;
FIG. 3 is a block diagram illustrating an alternative embodiment of an enterprise credit risk assessment device;
fig. 4 is a schematic structural diagram of an enterprise credit risk assessment apparatus according to an embodiment of the present application.
Detailed Description
The core of the application is to provide an enterprise credit risk assessment method, which can ensure the characteristic cross capability of a weak variable and simultaneously avoid the influence of model overfitting on the identification precision; at the other core of the application, an enterprise credit risk assessment device, equipment and a readable storage medium are provided.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart illustrating an enterprise credit risk assessment method according to the present embodiment, where the method mainly includes:
step s110, receiving enterprise tax data of an enterprise to be evaluated;
the enterprise tax data of the enterprise to be evaluated is received, the information types specifically included in the enterprise tax data are not limited, and corresponding setting can be performed according to the needs of actual enterprise operation management, for example, value-added tax, consumption tax, urban construction tax, real estate tax, land use tax, vehicle and ship use tax, enterprise and personal income tax, stamp tax and the like can be included, and the enterprise tax data can be obtained from an enterprise asset liability statement and a profit statement. In the embodiment, the operation risk of the enterprise is quantified from the dimension of the tax data, the operation risk of the enterprise can be more comprehensively and accurately evaluated by the machine learning rating method based on the tax data, and an enterprise operation risk constraint mechanism can be formed.
And wherein, enterprise tax data can be gathered by the system and acquireed, also can direct import enterprise tax data of gathering in advance, does not limit to enterprise tax data acquisition mode in this embodiment, can set for according to actual data acquisition's needs.
Step s120, calling a pre-trained small and micro enterprise credit risk model built based on an XGboost algorithm to perform operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
compared with the traditional logistic regression model, the pre-trained small and micro enterprise credit risk model built based on the XGboost algorithm can learn the cross action of some weak variables, and has better model prediction capability.
The training method of the credit risk model of the small micro-enterprise called by the embodiment specifically comprises the following steps:
(1) acquiring tax sample data of an enterprise;
(2) carrying out variable preprocessing on tax sample data to obtain sample variable data;
the variable preprocessing mainly refers to performing variable analysis processing on sample data, removing irrelevant variables, abnormal variables and the like in the sample data, and avoiding the influence of the data on subsequent data analysis.
The specific variable preprocessing means is not limited in this embodiment, and may be set correspondingly according to the data item of the actual sample data and the requirement of data analysis, which is not limited in this embodiment.
(3) Taking the variable stability of the sample variable data and the model stability as evaluation indexes, performing variable screening on the sample variable data, and determining a mode entering variable in the sample variable data;
the variable stability refers to a stability factor of the variable embodying features, and specific measurement indexes can be elimination and reassignment of abnormal data, assignment of missing data and the like, and are not limited; the model stability refers to a stability factor of the model training process after the model stability variable is applied to the model, and the specific measurement index may be a stability index or the like, which is not limited to this.
(4) Determining model parameters in a small and micro enterprise credit risk model built based on an XGboost algorithm;
model parameter selection is performed through the XGBoost parameter, which may specifically include: a basis learner (boost), an objective function (objective), a model evaluation index (eval _ metric), iteration times (n _ estimators), a maximum depth of a tree (max _ depth), a minimum loss function value (gamma) required for node partitioning, a minimum leaf node sample weight sum (min _ child _ weight), a proportion of sub-samples of a training model to the whole sample set (subsample), a proportion of feature random samples (colsample _ byte), an L1 regularization term weight coefficient (alpha), an L2 regularization term weight coefficient (lambda), a learning rate (learning _ rate), and the like.
The specific model parameter determination strategy is not limited in this embodiment, and may be set according to actual risk assessment requirements.
(5) And calling sample variable data to train the credit risk model of the small micro enterprise.
The specific implementation steps of the model training may refer to implementation manners in related technologies, which are not limited in this embodiment and are not described herein again.
Based on the introduction, in the enterprise credit risk assessment method provided by the embodiment, the enterprise tax data is received, the operation credit risk of the enterprise is quantified from the tax data dimension, and the credit of the enterprise can be assessed relatively compared with other assessment dimensions, so that a foundation is laid for accurate risk assessment of the enterprise credit; meanwhile, a small and micro enterprise credit risk model called by the method is built based on an XGboost algorithm, so that the characteristic cross capability of a weak variable of the model is guaranteed; in the training process, based on the analysis of tax sample data, a characteristic project is constructed, after variable preprocessing, the variable stability and the model stability of sample variable data are used as evaluation indexes, the sample variable data are subjected to modeling variable screening, the influence of abnormal sample variables on model training can be filtered, and the over-fitting problem of a small and micro enterprise credit risk model when an XGboost algorithm is adopted is relieved, so that the recognition effect of the trained model is improved, and the enterprise credit risk evaluation accuracy is improved.
In the above embodiment, a specific implementation process of performing variable preprocessing on tax sample data in training of a credit risk model of a small micro enterprise is not limited, and optionally, a variable preprocessing process may specifically include the following steps:
(1) performing variable analysis on tax sample data, and taking data output by the variable analysis as preprocessing sample data;
the process of actually performing variable analysis on the sample is not limited herein, and can be set according to the requirement of actual data analysis.
Optionally, a process of performing variable analysis on tax sample data may specifically include the following steps:
(1.1) carrying out statistical analysis on the distribution of tax sample data to obtain sample distribution statistical information; data visualization of sample target variable distribution, continuous and categorical variable distribution
And (1.2) performing data filling processing on missing values and abnormal values in the sample distribution statistical information, and taking the processed data as pre-processing sample data.
And starting a program of the data cleaning module, and cleaning and processing missing values and abnormal values of the tax data, wherein the processing specifically comprises operations of transposing, summing and the like of the data. In this embodiment, only the above preprocessing process is taken as an example for description, and other implementation manners can refer to the description of this embodiment, which is not described herein again.
(2) And performing box separation woe on the preprocessed sample data to obtain the variable data after box separation, and taking the variable data after box separation as the sample variable data.
And performing box separation on the preprocessed sample data to obtain a sample data set subjected to box separation.
Because the sample data volume of the wind control model of the small and micro enterprise is small, when the XGboost algorithm is applied for modeling, the model training is carried out after the variable is subjected to binning woe, and the model can be prevented from being over-fitted. The binning woe processing may specifically include decision tree binning, chi-square binning, equal-frequency binning, equal-distance binning, and the like, and may refer to operation processing steps of related binning processing techniques, and the specific binning processing operation steps in this embodiment are not specifically limited.
After sample data is subjected to binning, the samples can be segmented into a training set and a testing set so as to adapt to the sample data requirements under different model use scenes.
The variable screening method provided by the embodiment screens the in-mold variables through variable screening conditions such as the variable deletion rate and the feature importance, is simple in implementation mode, can ensure high variable effectiveness, and can effectively relieve the overfitting condition of the model.
The specific implementation steps of performing variable screening on sample variable data and determining a modulus entering variable in the sample variable data are not limited in the above embodiment, and a variable screening implementation manner is mainly introduced in this embodiment, and mainly includes the following steps:
(1) screening sample variable data according to the correlation and variable importance among the sample variable data to obtain a first variable;
the specific evaluation of the relevance and the importance of the variable means that the specimen embodiment is not limited, for example, woe value may be used as an evaluation standard when evaluating the relevance of the sample data, or the relative distance between two variables may be calculated; when the variable importance of the sample data is evaluated, a random forest or a GBDT (Gradient Boosting Decision Tree) algorithm and the like can be used as evaluation criteria. For example, the variables may be subjected to a screening rule that screens the first variable according to sample variable data variable woe correlation (less than 0.6), random forest or GBDT algorithm variable importance.
(2) And calculating a model stability index of the first variable, and taking the first variable with the model stability index lower than a threshold value as a model entering variable.
The model stability index (PSI) may measure a distribution difference between scores of the test sample and the model development sample, and if a first variable of the model stability index not lower than a threshold is used as a model entry variable, it indicates that the distribution difference between the scores of the test sample and the model development sample is large, and the evaluation accuracy in the actual evaluation of the model may be low, and if a first variable of the model stability index lower than the threshold is used as a model entry variable, it indicates that the distribution difference between the scores of the test sample and the model development sample is small, and the evaluation accuracy in the actual evaluation of the model may be high. For example, the variable PSI may be calculated, and variables with PSI less than 0.1 may be filtered as the last modulo-in variables.
The above-mentioned variable screening mode that this embodiment provided screens the input model variable through variable correlation, training and variable screening conditions such as test sample variable PSI, and the realization mode is simple, and can guarantee that the variable validity is high, can effectively alleviate the condition of model overfitting.
In addition, the specific implementation steps for selecting and determining the model parameters in the foregoing embodiments are also not specifically limited, and a specific implementation manner is described in this embodiment to deepen understanding of the steps.
The method comprises the following implementation steps:
(1) determining the type of the XGboost model base learner; the XGboost model base learner type comprises the following steps: gbtree and gbiner;
the XGboost model base learner is selected, and the XGboost model base learner mainly comprises two types: gbtree (decision tree) and gbiner (linear classifier). Different types of base learners can be configured according to different use requirements in different application scenarios, which is not limited in this embodiment.
(2) Determining a learning objective function and a model evaluation index of the XGboost; wherein the objective function includes: logistic regression and linear regression, and the model evaluation indexes comprise: auc (area Under cutter), logloss, rmse (root mean squared error), mae (mean absolute error), error (error rate);
selecting a learning objective function and a model evaluation index of the XGboost, wherein the objective function mainly comprises: logistic regression and linear regression, and the model evaluation indexes mainly comprise: auc, loglos, rmse, mae, error, etc.
(3) And adjusting and optimizing the XG boost algorithm parameters, and combining the obtained optimal model parameters to serve as the XG boost model parameters.
And adjusting and optimizing the commonly used parameters to obtain the optimal model parameter combination. Because the sample size of the small and micro enterprise wind control model is small, the maximum depth of the tree can be generally set to be 5, and regular parameters of L1 and L2 can also be set to be larger.
The determination mode of the model parameters can be widely applied to risk assessment scenes of different enterprises, and can also ensure a better model training effect when the sample size is less, and improve the accuracy of model identification.
Referring to fig. 2, fig. 2 is a block diagram of a structure of an enterprise credit risk assessment apparatus provided in the present embodiment; the device mainly includes: a data receiving unit 110, a model evaluation unit 120, and a model training unit 130. The enterprise credit risk assessment device provided by the embodiment can be mutually contrasted with the enterprise credit risk assessment method.
The data receiving unit 100 is mainly used for receiving enterprise tax data of an enterprise to be evaluated;
the model evaluation unit 200 is mainly used for calling a pre-trained small and micro enterprise credit risk model built based on the XGboost algorithm to perform operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
the model training unit 130, which is mainly used for training the small micro enterprise credit risk model called by the model evaluation unit, includes:
the data acquiring subunit 131 is mainly used for acquiring tax sample data of an enterprise;
the variable preprocessing subunit 132 is mainly configured to perform variable preprocessing on the tax sample data to obtain sample variable data;
the variable screening subunit 133 is mainly configured to perform variable screening on the sample variable data by using the variable stability of the sample variable data and the model stability as evaluation indexes, and determine a mode entering variable in the sample variable data;
the parameter determining subunit 134 is mainly used for determining model parameters in a small and micro enterprise credit risk model built based on the XGBoost algorithm;
the training subunit 135 is mainly used for calling sample variable data to train the credit risk model of the small micro-enterprise.
Optionally, the variable preprocessing subunit may specifically include:
the variable analysis subunit is used for carrying out variable analysis on the tax sample data and taking data output by the variable analysis as preprocessed sample data;
and the box dividing processing subunit is used for carrying out box dividing woe processing on the preprocessed sample data to obtain the variable data after box dividing, and taking the variable data after box dividing as the sample variable data.
Optionally, the variable analysis subunit may specifically include:
the statistical analysis subunit is used for performing statistical analysis on the distribution of the tax sample data to obtain sample distribution statistical information;
and the exception processing subunit is used for performing data filling processing on the missing values and the abnormal values in the sample distribution statistical information and taking the processed data as pre-processing sample data.
The present embodiment provides another enterprise credit risk assessment apparatus, and as shown in fig. 3, a block diagram of the enterprise credit risk assessment apparatus provided in the present embodiment is shown; the device mainly includes: a variable selection background and a model parameter console.
The variable selection background is responsible for processing the enterprise data and selecting the variables.
The enterprise data processing unit is mainly used for cleaning and describing data analysis of the data and visually displaying the distribution of sample characteristics so as to preliminarily know the data.
The variable selection unit screens the variables according to the methods of variable missing rate, characteristic importance, variable binning, variable correlation, correlation after variable binning and the like. And constructing a feature engineering by an integrated learning method such as Random Forest (Random Forest) and the like. And finally, calculating the variable PSI of the training set and the test set according to the result of variable binning, and screening the variable PSI smaller than 0.1 to serve as a final mode entering variable.
Specifically, the variable selection background comprises the following units:
(1) a data acquisition unit: the system collects original enterprise sample data
(2) A variable distribution unit: is responsible for carrying out statistical analysis on the distribution of the sample variable and visualizing the variable distribution diagram
(3) A data cleaning unit: the method is responsible for cleaning, missing value and abnormal value processing of sample data, missing value and abnormal value filling of the data, and specifically comprises operations such as transposition of the data, mathematical operation and the like
(4) Variable box separation unit: because the sample data volume of the wind control model of the small and micro enterprise is small, when the XGboost algorithm is applied for modeling, the model training is carried out after the variable is subjected to binning woe, and the model is prevented from being over-fitted. Therefore, the unit is responsible for binning the preprocessed sample data, specifically comprising decision tree binning, chi-square binning, equal frequency binning, and equal distance binning. And supports graphical output of binned trend graphs.
(5) A variable selection unit: supporting the division of a training test set, screening variables according to the relevance of the variables after the binning woe, and providing a plurality of algorithms (random forests, GBDT and other algorithms) for selecting the variables according to the importance of the variables. Finally, the variables are selected according to the PSI values of the divided data set variables.
And the model parameter control console is mainly responsible for XGboost model parameter tuning.
After the data is processed by the variable selection background, the sample data enters the model parameter console. And (4) determining a group of finally suitable parameter combinations by adjusting parameters of the XGboost common model.
Specifically, the model parameter console comprises the following units:
(1) a general parameter unit: the XGboost model is responsible for controlling the macroscopic function of the XGboost model, and the main parameter is the type of a base learner.
(2) Learning target parameter unit: and the control of model objective functions and model evaluation indexes is carried out.
(3) Booster parameter unit: the method is responsible for controlling the common boost parameter, and specifically comprises iteration times, the maximum depth of a tree, the lowest loss function value required by node division, the minimum leaf node sample weight sum, the proportion of a sub-sample of a training model in the whole sample set, the proportion of feature random sampling, an L1 regularization term weight coefficient, an L2 regularization term weight coefficient and a learning rate.
The enterprise credit risk assessment device provided by the embodiment strictly screens the model-entering variables of the model, screens the co-linearity problem of the variables, the importance of the variables and the stability of the variables, obtains the final effective and stable model-entering variables, constructs the XGboost model, and avoids the problem of model overfitting caused by small modeling sample size of the small and micro enterprise risk model.
The embodiment provides an enterprise credit risk assessment device, mainly including: a memory and a processor.
Wherein, the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the enterprise credit risk assessment method described in the above embodiments, which may be referred to in the above description of the enterprise credit risk assessment method.
Referring to fig. 4, a schematic structural diagram of an enterprise credit risk assessment device provided in this embodiment is provided, where the enterprise credit risk assessment device may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 322 (e.g., one or more processors), a memory 332, and one or more storage media 330 (e.g., one or more mass storage devices) storing applications 342 or data 344. Memory 332 and storage media 330 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instructions operating on a data processing device. Still further, the central processor 322 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the enterprise credit risk assessment device 301.
The enterprise credit risk assessment device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input-output interfaces 358, and/or one or more operating systems 341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like.
The steps in the enterprise credit risk assessment method described in fig. 1 above can be implemented by the structure of the enterprise credit risk assessment apparatus introduced in this embodiment.
The present embodiment discloses a readable storage medium, on which a program is stored, and the program, when executed by a processor, implements the steps of the enterprise credit risk assessment method described in the above embodiments, which may be referred to in the description of the enterprise credit risk assessment method in the above embodiments.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various other readable storage media capable of storing program codes.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The enterprise credit risk assessment method, device, equipment and readable storage medium provided by the application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. An enterprise credit risk assessment method, comprising:
receiving enterprise tax data of an enterprise to be evaluated;
calling a pre-trained small and micro enterprise credit risk model built based on an XGboost algorithm to carry out operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
the training method of the credit risk model of the small micro-enterprise comprises the following steps:
acquiring tax sample data of an enterprise;
performing variable preprocessing on the tax sample data to obtain sample variable data;
taking the variable stability and the model stability of the sample variable data as evaluation indexes, performing variable screening on the sample variable data, and determining a mode entering variable in the sample variable data;
determining model parameters in a small and micro enterprise credit risk model built based on an XGboost algorithm;
and calling the sample variable data to train the credit risk model of the small micro enterprise.
2. The enterprise credit risk assessment method of claim 1, wherein performing variable preprocessing on the tax sample data to obtain sample variable data comprises:
performing variable analysis on the tax sample data, and taking data output by the variable analysis as preprocessing sample data;
and performing box separation woe on the preprocessed sample data to obtain boxed variable data, and taking the boxed variable data as sample variable data.
3. The enterprise credit risk assessment method of claim 2, wherein performing variable analysis on the tax sample data and using data output by the variable analysis as pre-processed sample data comprises:
performing statistical analysis on the distribution of the tax sample data to obtain sample distribution statistical information;
and performing data filling processing on the missing values and the abnormal values in the sample distribution statistical information, and taking the processed data as pre-processing sample data.
4. The enterprise credit risk assessment method of claim 1, wherein the variable screening of the sample variable data to determine the modelled variables in the sample variable data with the variable stability and the model stability of the sample variable data as assessment indicators comprises:
screening the sample variable data according to the correlation and the variable importance among the sample variable data to obtain a first variable;
and calculating a model stability index of the first variable, and taking the first variable with the model stability index lower than a threshold value as a model entering variable.
5. The enterprise credit risk assessment method of claim 1, wherein the determining model parameters in the small micro enterprise credit risk model built based on the XGBoost algorithm comprises:
determining the type of the XGboost model base learner; wherein the XGboost model base learner type comprises: gbtree and gbiner;
determining a learning objective function and a model evaluation index of the XGboost; wherein the objective function comprises: logistic regression and linear regression, wherein the model evaluation indexes comprise: auc, loglos, rmse, mae, error;
and adjusting and optimizing the XG boost algorithm parameters, and combining the obtained optimal model parameters to serve as the XG boost model parameters.
6. An enterprise credit risk assessment device, comprising:
the data receiving unit is used for receiving enterprise tax data of an enterprise to be evaluated;
the model evaluation unit is used for calling a pre-trained small and micro enterprise credit risk model built based on the XGboost algorithm to carry out operation credit risk evaluation on the enterprise tax data to obtain an evaluation result;
wherein the model training unit for training the small micro enterprise credit risk model called by the model evaluation unit comprises:
the data acquisition subunit is used for acquiring tax sample data of an enterprise;
the variable preprocessing subunit is used for performing variable preprocessing on the tax sample data to obtain sample variable data;
the variable screening subunit is used for performing variable screening on the sample variable data by taking the variable stability and the model stability of the sample variable data as evaluation indexes to determine a mode entering variable in the sample variable data;
the parameter determining subunit is used for determining model parameters in a small and micro enterprise credit risk model built based on an XGboost algorithm;
and the training subunit is used for calling the sample variable data to train the credit risk model of the small micro-enterprise.
7. The enterprise credit risk assessment device of claim 6, wherein the variable preprocessing subunit comprises:
the variable analysis subunit is used for carrying out variable analysis on the tax sample data and taking data output by the variable analysis as preprocessed sample data;
and the box dividing processing subunit is used for carrying out box dividing woe processing on the preprocessed sample data to obtain the variable data after box dividing, and taking the variable data after box dividing as the sample variable data.
8. The enterprise credit risk assessment device of claim 7, wherein the variable analysis subunit comprises:
the statistical analysis subunit is used for performing statistical analysis on the distribution of the tax sample data to obtain sample distribution statistical information;
and the exception processing subunit is used for performing data filling processing on the missing values and the abnormal values in the sample distribution statistical information and taking the processed data as pre-processing sample data.
9. An enterprise credit risk assessment device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the enterprise credit risk assessment method according to any one of claims 1 to 5 when executing the computer program.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a program which, when being executed by a processor, realizes the steps of the enterprise credit risk assessment method according to any one of claims 1 to 5.
CN202010805252.7A 2020-08-12 2020-08-12 Enterprise credit risk assessment method, device, equipment and storage medium Pending CN111951097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010805252.7A CN111951097A (en) 2020-08-12 2020-08-12 Enterprise credit risk assessment method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010805252.7A CN111951097A (en) 2020-08-12 2020-08-12 Enterprise credit risk assessment method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111951097A true CN111951097A (en) 2020-11-17

Family

ID=73332732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010805252.7A Pending CN111951097A (en) 2020-08-12 2020-08-12 Enterprise credit risk assessment method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111951097A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529477A (en) * 2020-12-29 2021-03-19 平安普惠企业管理有限公司 Credit evaluation variable screening method, device, computer equipment and storage medium
CN112633635A (en) * 2020-11-29 2021-04-09 龙马智芯(珠海横琴)科技有限公司 Exhibitor risk assessment method, exhibitor risk assessment device, exhibitor risk assessment server and readable storage medium
CN112749922A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN113205403A (en) * 2021-03-30 2021-08-03 北京中交兴路信息科技有限公司 Method and device for calculating enterprise credit level, storage medium and terminal
CN113222731A (en) * 2021-04-25 2021-08-06 北京工业大学 Small sample credit evaluation method, system and medium based on machine learning
CN113393328A (en) * 2021-06-21 2021-09-14 深圳微众信用科技股份有限公司 Method and device for assessing pre-financing and pre-loan approval and computer storage medium
CN113409150A (en) * 2021-06-21 2021-09-17 深圳微众信用科技股份有限公司 Operation risk and credit risk assessment method, device and computer storage medium
CN113793212A (en) * 2021-09-24 2021-12-14 重庆富民银行股份有限公司 Credit assessment method
CN114492929A (en) * 2021-12-23 2022-05-13 江南大学 XGboost-based financial credit enterprise credit prediction method
CN115329207A (en) * 2022-10-17 2022-11-11 启客(北京)科技有限公司 Intelligent sales information recommendation method and system
CN115860926A (en) * 2023-02-20 2023-03-28 江西汉辰信息技术股份有限公司 Wind control decision method and system based on decision tree
CN116051296A (en) * 2022-12-28 2023-05-02 中国银行保险信息技术管理有限公司 Customer evaluation analysis method and system based on standardized insurance data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779457A (en) * 2016-12-29 2017-05-31 深圳微众税银信息服务有限公司 A kind of rating business credit method and system
CN110163743A (en) * 2019-04-28 2019-08-23 钛镕智能科技(苏州)有限公司 A kind of credit-graded approach based on hyperparameter optimization
CN111507822A (en) * 2020-04-13 2020-08-07 深圳微众信用科技股份有限公司 Enterprise risk assessment method based on feature engineering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779457A (en) * 2016-12-29 2017-05-31 深圳微众税银信息服务有限公司 A kind of rating business credit method and system
CN110163743A (en) * 2019-04-28 2019-08-23 钛镕智能科技(苏州)有限公司 A kind of credit-graded approach based on hyperparameter optimization
CN111507822A (en) * 2020-04-13 2020-08-07 深圳微众信用科技股份有限公司 Enterprise risk assessment method based on feature engineering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴锦华 等: "特征选择方法在信用评分系统中的应用", 信息与电脑(理论版), no. 08, 25 April 2019 (2019-04-25) *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633635A (en) * 2020-11-29 2021-04-09 龙马智芯(珠海横琴)科技有限公司 Exhibitor risk assessment method, exhibitor risk assessment device, exhibitor risk assessment server and readable storage medium
CN112529477A (en) * 2020-12-29 2021-03-19 平安普惠企业管理有限公司 Credit evaluation variable screening method, device, computer equipment and storage medium
CN112749922A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN113205403A (en) * 2021-03-30 2021-08-03 北京中交兴路信息科技有限公司 Method and device for calculating enterprise credit level, storage medium and terminal
CN113222731A (en) * 2021-04-25 2021-08-06 北京工业大学 Small sample credit evaluation method, system and medium based on machine learning
CN113409150A (en) * 2021-06-21 2021-09-17 深圳微众信用科技股份有限公司 Operation risk and credit risk assessment method, device and computer storage medium
CN113393328A (en) * 2021-06-21 2021-09-14 深圳微众信用科技股份有限公司 Method and device for assessing pre-financing and pre-loan approval and computer storage medium
CN113793212A (en) * 2021-09-24 2021-12-14 重庆富民银行股份有限公司 Credit assessment method
CN114492929A (en) * 2021-12-23 2022-05-13 江南大学 XGboost-based financial credit enterprise credit prediction method
CN115329207A (en) * 2022-10-17 2022-11-11 启客(北京)科技有限公司 Intelligent sales information recommendation method and system
CN116051296A (en) * 2022-12-28 2023-05-02 中国银行保险信息技术管理有限公司 Customer evaluation analysis method and system based on standardized insurance data
CN116051296B (en) * 2022-12-28 2023-09-29 中国银行保险信息技术管理有限公司 Customer evaluation analysis method and system based on standardized insurance data
CN115860926A (en) * 2023-02-20 2023-03-28 江西汉辰信息技术股份有限公司 Wind control decision method and system based on decision tree

Similar Documents

Publication Publication Date Title
CN111951097A (en) Enterprise credit risk assessment method, device, equipment and storage medium
CN108564286B (en) Artificial intelligent financial wind-control credit assessment method and system based on big data credit investigation
CN113642849B (en) Geological disaster risk comprehensive evaluation method and device considering spatial distribution characteristics
CN110738564A (en) Post-loan risk assessment method and device and storage medium
CN108960269B (en) Feature acquisition method and device for data set and computing equipment
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN113361690A (en) Water quality prediction model training method, water quality prediction device, water quality prediction equipment and medium
CN113344438A (en) Loan system, loan monitoring method, loan monitoring apparatus, and loan medium for monitoring loan behavior
CN112488496A (en) Financial index prediction method and device
CN113435713B (en) Risk map compiling method and system based on GIS technology and two-model fusion
CN115203496A (en) Project intelligent prediction and evaluation method and system based on big data and readable storage medium
CN114004691A (en) Line scoring method, device, equipment and storage medium based on fusion algorithm
CN116129189A (en) Plant disease identification method, plant disease identification equipment, storage medium and plant disease identification device
CN115906669A (en) Dense residual error network landslide susceptibility evaluation method considering negative sample selection strategy
CN113673609B (en) Questionnaire data analysis method based on linear hidden variables
CN113240513A (en) Method for determining user credit line and related device
CN113553754A (en) Memory, fire risk prediction model construction method, system and device
CN112862014A (en) Client credit early warning method and device
CN111695989A (en) Modeling method and platform of wind-control credit model
CN111612626A (en) Method and device for preprocessing bond evaluation data
CN117493140B (en) Evaluation system for deep learning model
Thilaka et al. A Machine Learning Approach to GDP Prediction by Analyzing Economic Indicators
CN112465310A (en) Computer-implemented data processing method, system, apparatus, and storage medium
Subagyo et al. Study of Economic Inequality in The Agglomeration Region of Malang Raya

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination