CN114492929A - XGboost-based financial credit enterprise credit prediction method - Google Patents

XGboost-based financial credit enterprise credit prediction method Download PDF

Info

Publication number
CN114492929A
CN114492929A CN202111587189.5A CN202111587189A CN114492929A CN 114492929 A CN114492929 A CN 114492929A CN 202111587189 A CN202111587189 A CN 202111587189A CN 114492929 A CN114492929 A CN 114492929A
Authority
CN
China
Prior art keywords
credit
xgboost
data
model
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111587189.5A
Other languages
Chinese (zh)
Inventor
谢振平
翟彬
陈丽芳
刘渊
崔乐乐
宋设
杨宝华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Jiangnan University
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University, Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Jiangnan University
Priority to CN202111587189.5A priority Critical patent/CN114492929A/en
Publication of CN114492929A publication Critical patent/CN114492929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Abstract

The invention discloses a financial credit enterprise credit prediction method based on XGboost, which comprises the steps of primary screening, screening out data related to credit assessment; processing data, processing the abnormal value missing value and classifying the abnormal value missing value; the characteristic engineering is used for processing the characteristics; dividing a data set, and dividing the data into a training set and a verification set; model training, namely training a model on training set data through an Xgboost algorithm; and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition. The method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance.

Description

XGboost-based financial credit enterprise credit prediction method
Technical Field
The invention relates to the technical field of enterprise credit prediction, in particular to a financial credit enterprise credit prediction method based on XGboost.
Background
Today, the credit assessment of enterprises is met with new opportunities in the rapid development of artificial intelligence technology, and the probability of credit prediction of the enterprises is higher than that of the traditional statistical means by utilizing a machine learning method. The existing statistical means has certain limitations: firstly, depending on experience, many statistical rules and calculation methods depend on the experience of experts and manual processing, and the accuracy and the efficiency are all deficient; secondly, the traditional means depends on manpower, the cost is high, the adaptability to different scenes is poor, and the efficiency is low.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above and/or problems with the existing XGboost-based financial credit enterprise credit prediction method.
Therefore, the problem to be solved by the invention is how to provide a financial credit enterprise credit prediction method based on XGboost.
In order to solve the technical problems, the invention provides the following technical scheme: an XGboost-based financial credit enterprise credit prediction method comprises the steps of primary screening, screening out data related to credit assessment; processing data, processing the abnormal value missing value and classifying the abnormal value missing value; the characteristic engineering is used for processing the characteristics; dividing a data set, and dividing the data into a training set and a verification set; model training, namely performing model training on training set data through an Xgboost algorithm; and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when the data is classified, the selected samples are classified according to the past transaction behaviors, credit records and tax payment records of the enterprises.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the processing of the features comprises removing redundant features, and then further reducing the features by using a variance selection method, a principal component analysis method and other dimension reduction methods.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the redundant features are features that can be simply calculated from existing features.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: after the data are divided into a training set and a verification set, when the number of samples of a certain type is small, oversampling processing is carried out.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when screening out data related to credit assessment, the data is preliminarily screened according to the credit card scoring model used in the past and the existing conclusion.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when processing missing values, the operations of deletion, filling and replacement are carried out, and when processing abnormal values, the processing is carried out by a clustering-based method or an isolated forest algorithm.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: and performing oversampling treatment by adopting a SMOTE method.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the existing conclusion is a manually marked data tag.
As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the model was evaluated using ROC.
The invention has the beneficial effects that: the method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
fig. 1 is a parsing frame diagram of the XGboost-based financial credit enterprise credit prediction method in embodiment 1.
Fig. 2 is a map hierarchy diagram of the XGboost-based financial credit enterprise credit prediction method in embodiment 1.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Example 1
Referring to fig. 1 and 2, a first embodiment of the present invention provides an XGboost-based financial credit enterprise credit prediction method, which includes the steps of:
s1: preliminary screening, screening out data related to credit assessment;
s2: processing data, processing the abnormal value missing value and classifying the abnormal value missing value;
s3: a characteristic project, which processes the characteristics;
s4: dividing a data set, and dividing the data into a training set and a verification set;
s5: model training, namely training a model on training set data through an Xgboost algorithm;
s6: and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.
In step S1, data that may be relevant to credit evaluation is initially selected by removing data that is not substantially relevant to credit from the collected business data, where existing conclusions may be summarized in connection with credit card scoring models used in the past and empirical conclusions used in the past. For example, characteristics with high importance degree in the model are selected from the enterprise financial condition analysis as reference standards, the screened standards should refer to the industry conditions such as national policy and industry characteristics, and the analysis commonly used in the traditional credit system analysis such as enterprise basic quality, financial condition, company system, development potential and the like, and the enterprise basic quality factors can be generally considered in the credit assessment: enterprise financial status, system construction, continuous operation time, management efficiency, employee quality and asset quality, external environmental factors: government policies, industry awareness, industry status and upstream and downstream vendor situations, development potential: the profit growth rate and the scientific research investment increase, the data tags are formed in a human marking mode by referring to the factors, and meanwhile operations such as desensitization of data can be completed in the step.
In step S2, the outlier missing value is processed, enterprise objects with complete and reliable data are selected, and meanwhile, data with high reliability and high timeliness are selected from the data as model samples according to past transaction behaviors of the enterprises, and the modeled samples are classified according to various data of the enterprises, such as credit records and tax payment records, that is, the model samples are good in credit and poor in credit. The rating may be referred to a local tax authority, for example, the d-rating may be considered to be poor credit, and the others may be considered to be good credit. Businesses with records of loss of credit or jurisdictions may also be considered to have poor credit.
In step S3, redundant features are removed, and then features are further reduced using a commonly used feature selection method such as a variance selection method, and then using a dimensionality reduction method such as a principal component analysis method. Where redundant features are features that can be easily calculated from existing features.
In step S4, the data is divided into training sets and validation sets, and sampling is performed according to specific situations during sampling, and sampling processing, such as poor-credit companies, generally accounts for a small proportion, and at this time, oversampling or other processing is generally required to be performed on the samples, so that the data sets are more reasonably distributed, and the model effect is better.
In step S5, the accuracy of the model is adjusted to a higher level by adjusting the parameters, and it is generally considered that an accuracy of more than 96% meets the requirement.
In step S6, the model obtained is evaluated by using a model evaluation method, and each feature in the xgboost is analyzed, such as the importance of each feature, and the model is generally evaluated by using ROC.
The data processing in step S2 is mainly to ensure the reliability of data during model training, and the processing of abnormal values and missing values, generally using deletion operations, and filling and replacing missing values according to the actual situation by using the similarity between similar samples, and if the dependency on other data is high, such as a ratio value, filling can be performed through a correlation relationship, so as to ensure the authenticity of model learning data, and also eliminate errors in the data collection process, and the filling manner can be performed by machine learning common data filling means, such as random forest filling, KNN filling, and the like. Abnormal value processing usually includes deletion, substitution using an average value, and as a method such as deletion processing, a machine learning method such as a clustering method or an isolated forest algorithm may be used.
Because the number of enterprises with poor credit in the real data is far lower than that of enterprises with good credit, the number of the obtained negative examples of the data set, namely the enterprises with poor credit, is very small, which easily results in poor final effect of the model, and therefore, some methods need to be applied in the sampling process to eliminate the influence, such as an oversampling method SMOTE method and the like, so that the unbalanced data set is more reasonable and convenient for learning.
Xgboost is a special gradient boosting decision tree, and is an integrated learning method based on a tree structure.
In model training, the Xgboost parameters are generally classified into 3 types, and the first type is that the general parameters are used for controlling basic functions of the parameters, such as nthread, to perform multi-thread control. The second type is a boost parameter, which mainly controls the integration of each step during training, such as tree boost and linear boost, and the target function of XGboost is as follows:
Figure BDA0003428300300000051
in the above formula GjRepresenting the first derivative of the selected leaf node, ω being the weight of the leaf node, HjFor the second derivative, the number of γ control nodes prevents over-fitting, and T is the number of leaf nodes. The structure score is used as the basis of tree splitting in the algorithm, and the smaller the result is, the better the effect is after feature splitting.
The structure is simplified as follows:
Figure BDA0003428300300000052
in the loss function, xgboost uses taylor's formula and taylor's second order expansion as an approximation of the objective function. The method comprises the steps of continuously adding trees to a regression tree again, splitting features according to the number of structure scores to grow a new tree, fitting the residual error predicted at the last time by using the new tree in use, and finally dropping the features of a sample onto corresponding nodes in a book, wherein the sum of the score of each tree is the predicted value finally obtained by a certain sample.
A greedy algorithm for enumerating all tree structures is generally used when leaf nodes are split, an approximate algorithm can be used when data is large and direct calculation cannot be carried out, and different algorithms can be applied to different scenes to meet the requirements of the scenes.
The XGboost algorithm training specifically comprises the step of constructing and adjusting parameters of a model by using the data obtained in the last step through the XGboost algorithm. The XGBoost library function in the Python library may be used. After the accuracy meets the requirement, the model is evaluated by using the evaluation index, and the model is further explained and improved by comparing the method with the actual situation and the existing conclusion by analyzing the importance degree of each characteristic and the like. And then inputting relevant data of the enterprises to be evaluated into the obtained model after the model meets the expected requirements to obtain a credit score which has reference value and more objective efficiency.
The method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance. The method combines an artificial intelligence method to promote the automation and the intellectualization of the enterprise credit assessment, and provides effective reference for various aspects such as bank credit, enterprise wind control, policy making and the like.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (10)

1. An XGboost-based financial credit enterprise credit prediction method is characterized by comprising the following steps of: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
preliminary screening, screening out data related to credit assessment;
processing data, processing the abnormal value missing value and classifying the abnormal value missing value;
the characteristic engineering is used for processing the characteristics;
dividing a data set, and dividing the data into a training set and a verification set;
model training, namely training a model on training set data through an Xgboost algorithm;
and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.
2. The XGboost-based method for credit prediction of financing credit enterprises as claimed in claim 1 wherein: when the data is classified, the selected samples are classified according to the past transaction behaviors, credit records and tax payment records of the enterprises.
3. The XGboost-based financial credit enterprise credit prediction method of claim 1 or 2, wherein: the processing of the features comprises removing redundant features, and then further reducing the features by using a variance selection method, a principal component analysis method and other dimension reduction methods.
4. The XGboost-based finance credit enterprise credit prediction method of claim 3 wherein: the redundant features are features that can be simply calculated from existing features.
5. The XGboost-based financial credit enterprise credit prediction method of any of claims 1, 2 or 4, wherein: after the data are divided into a training set and a verification set, when the number of samples of a certain type is small, oversampling processing is carried out.
6. The XGboost-based method for credit prediction of financing credit enterprises as claimed in claim 5 wherein: when screening out data related to credit assessment, the data is preliminarily screened according to the credit card scoring model used in the past and the existing conclusion.
7. The XGboost-based financial credit enterprise credit prediction method of claim 6, wherein: when processing missing values, the operations of deletion, filling and replacement are carried out, and when processing abnormal values, the processing is carried out by a clustering-based method or an isolated forest algorithm.
8. The XGboost-based financial credit enterprise credit prediction method of claim 6 or 7, wherein: and performing oversampling treatment by adopting a SMOTE method.
9. The XGboost-based financial credit enterprise credit prediction method of claim 8 in which: the existing conclusion is a manually marked data tag.
10. The XGboost-based financial credit enterprise credit prediction method of any of claims 1, 2, 4, 6, 7 or 9 in which: the model was evaluated using ROC.
CN202111587189.5A 2021-12-23 2021-12-23 XGboost-based financial credit enterprise credit prediction method Pending CN114492929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111587189.5A CN114492929A (en) 2021-12-23 2021-12-23 XGboost-based financial credit enterprise credit prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111587189.5A CN114492929A (en) 2021-12-23 2021-12-23 XGboost-based financial credit enterprise credit prediction method

Publications (1)

Publication Number Publication Date
CN114492929A true CN114492929A (en) 2022-05-13

Family

ID=81493942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111587189.5A Pending CN114492929A (en) 2021-12-23 2021-12-23 XGboost-based financial credit enterprise credit prediction method

Country Status (1)

Country Link
CN (1) CN114492929A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429398A (en) * 2022-04-06 2022-05-03 北京市农林科学院信息技术研究中心 Data-driven novel agricultural operation main body credit grade generation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN111951097A (en) * 2020-08-12 2020-11-17 深圳微众信用科技股份有限公司 Enterprise credit risk assessment method, device, equipment and storage medium
CN112053234A (en) * 2020-09-04 2020-12-08 天元大数据信用管理有限公司 Enterprise credit rating method based on macroscopic region economic index and microscopic factor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN111951097A (en) * 2020-08-12 2020-11-17 深圳微众信用科技股份有限公司 Enterprise credit risk assessment method, device, equipment and storage medium
CN112053234A (en) * 2020-09-04 2020-12-08 天元大数据信用管理有限公司 Enterprise credit rating method based on macroscopic region economic index and microscopic factor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙嘉琪: "基于改进XGBoost算法的企业信用评级预测方案设计", 中国优秀硕士学位论文全文数据库(电子期刊), no. 7, pages 152 - 448 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429398A (en) * 2022-04-06 2022-05-03 北京市农林科学院信息技术研究中心 Data-driven novel agricultural operation main body credit grade generation method and device
CN114429398B (en) * 2022-04-06 2023-12-22 北京市农林科学院信息技术研究中心 Data-driven novel agricultural operation subject credibility level generation method and device

Similar Documents

Publication Publication Date Title
Wei et al. Discovering bank risk factors from financial statements based on a new semi‐supervised text mining algorithm
CN109492945A (en) Business risk identifies monitoring method, device, equipment and storage medium
CN110866819A (en) Automatic credit scoring card generation method based on meta-learning
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
WO2007106787A2 (en) Methods and systems for characteristic leveling
CN110738564A (en) Post-loan risk assessment method and device and storage medium
CN111461216B (en) Case risk identification method based on machine learning
CN108492001A (en) A method of being used for guaranteed loan network risk management
Liang et al. A stock time series forecasting approach incorporating candlestick patterns and sequence similarity
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN109492097B (en) Enterprise news data risk classification method
CN113095927A (en) Method and device for identifying suspicious transactions of anti-money laundering
CN110782349A (en) Model training method and system
Cong et al. Analyzing textual information at scale
CN111738762A (en) Method, device, equipment and storage medium for determining recovery price of poor assets
CN114492929A (en) XGboost-based financial credit enterprise credit prediction method
CN117455417B (en) Automatic iterative optimization method and system for intelligent wind control approval strategy
CN114187125A (en) Claims case distribution method, device, equipment and storage medium
Zhang et al. Is enterprise digital transformation beneficial to shareholders? Insights from the cost of equity capital
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
CN116523301A (en) System for predicting risk rating based on big data of electronic commerce
CN115577274A (en) Enterprise batch clustering method and system based on multi-dimensional features
CN114969511A (en) Content recommendation method, device and medium based on fragments
CN115796635A (en) Bank digital transformation maturity evaluation system based on big data and machine learning
Liu Research on risk management of big data and machine learning insurance based on internet finance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination