CN114492929A

CN114492929A - XGboost-based financial credit enterprise credit prediction method

Info

Publication number: CN114492929A
Application number: CN202111587189.5A
Authority: CN
Inventors: 谢振平; 翟彬; 陈丽芳; 刘渊; 崔乐乐; 宋设; 杨宝华
Original assignee: Jiangnan University; Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Current assignee: Jiangnan University; Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-05-13

Abstract

The invention discloses a financial credit enterprise credit prediction method based on XGboost, which comprises the steps of primary screening, screening out data related to credit assessment; processing data, processing the abnormal value missing value and classifying the abnormal value missing value; the characteristic engineering is used for processing the characteristics; dividing a data set, and dividing the data into a training set and a verification set; model training, namely training a model on training set data through an Xgboost algorithm; and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition. The method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance.

Description

XGboost-based financial credit enterprise credit prediction method

Technical Field

The invention relates to the technical field of enterprise credit prediction, in particular to a financial credit enterprise credit prediction method based on XGboost.

Background

Today, the credit assessment of enterprises is met with new opportunities in the rapid development of artificial intelligence technology, and the probability of credit prediction of the enterprises is higher than that of the traditional statistical means by utilizing a machine learning method. The existing statistical means has certain limitations: firstly, depending on experience, many statistical rules and calculation methods depend on the experience of experts and manual processing, and the accuracy and the efficiency are all deficient; secondly, the traditional means depends on manpower, the cost is high, the adaptability to different scenes is poor, and the efficiency is low.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above and/or problems with the existing XGboost-based financial credit enterprise credit prediction method.

Therefore, the problem to be solved by the invention is how to provide a financial credit enterprise credit prediction method based on XGboost.

In order to solve the technical problems, the invention provides the following technical scheme: an XGboost-based financial credit enterprise credit prediction method comprises the steps of primary screening, screening out data related to credit assessment; processing data, processing the abnormal value missing value and classifying the abnormal value missing value; the characteristic engineering is used for processing the characteristics; dividing a data set, and dividing the data into a training set and a verification set; model training, namely performing model training on training set data through an Xgboost algorithm; and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when the data is classified, the selected samples are classified according to the past transaction behaviors, credit records and tax payment records of the enterprises.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the processing of the features comprises removing redundant features, and then further reducing the features by using a variance selection method, a principal component analysis method and other dimension reduction methods.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the redundant features are features that can be simply calculated from existing features.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: after the data are divided into a training set and a verification set, when the number of samples of a certain type is small, oversampling processing is carried out.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when screening out data related to credit assessment, the data is preliminarily screened according to the credit card scoring model used in the past and the existing conclusion.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: when processing missing values, the operations of deletion, filling and replacement are carried out, and when processing abnormal values, the processing is carried out by a clustering-based method or an isolated forest algorithm.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: and performing oversampling treatment by adopting a SMOTE method.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the existing conclusion is a manually marked data tag.

As a preferable aspect of the XGboost-based financial credit enterprise credit prediction method of the present invention, wherein: the model was evaluated using ROC.

The invention has the beneficial effects that: the method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a parsing frame diagram of the XGboost-based financial credit enterprise credit prediction method in embodiment 1.

Fig. 2 is a map hierarchy diagram of the XGboost-based financial credit enterprise credit prediction method in embodiment 1.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1 and 2, a first embodiment of the present invention provides an XGboost-based financial credit enterprise credit prediction method, which includes the steps of:

s1: preliminary screening, screening out data related to credit assessment;

s2: processing data, processing the abnormal value missing value and classifying the abnormal value missing value;

s3: a characteristic project, which processes the characteristics;

s4: dividing a data set, and dividing the data into a training set and a verification set;

s5: model training, namely training a model on training set data through an Xgboost algorithm;

s6: and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.

In step S1, data that may be relevant to credit evaluation is initially selected by removing data that is not substantially relevant to credit from the collected business data, where existing conclusions may be summarized in connection with credit card scoring models used in the past and empirical conclusions used in the past. For example, characteristics with high importance degree in the model are selected from the enterprise financial condition analysis as reference standards, the screened standards should refer to the industry conditions such as national policy and industry characteristics, and the analysis commonly used in the traditional credit system analysis such as enterprise basic quality, financial condition, company system, development potential and the like, and the enterprise basic quality factors can be generally considered in the credit assessment: enterprise financial status, system construction, continuous operation time, management efficiency, employee quality and asset quality, external environmental factors: government policies, industry awareness, industry status and upstream and downstream vendor situations, development potential: the profit growth rate and the scientific research investment increase, the data tags are formed in a human marking mode by referring to the factors, and meanwhile operations such as desensitization of data can be completed in the step.

In step S2, the outlier missing value is processed, enterprise objects with complete and reliable data are selected, and meanwhile, data with high reliability and high timeliness are selected from the data as model samples according to past transaction behaviors of the enterprises, and the modeled samples are classified according to various data of the enterprises, such as credit records and tax payment records, that is, the model samples are good in credit and poor in credit. The rating may be referred to a local tax authority, for example, the d-rating may be considered to be poor credit, and the others may be considered to be good credit. Businesses with records of loss of credit or jurisdictions may also be considered to have poor credit.

In step S3, redundant features are removed, and then features are further reduced using a commonly used feature selection method such as a variance selection method, and then using a dimensionality reduction method such as a principal component analysis method. Where redundant features are features that can be easily calculated from existing features.

In step S4, the data is divided into training sets and validation sets, and sampling is performed according to specific situations during sampling, and sampling processing, such as poor-credit companies, generally accounts for a small proportion, and at this time, oversampling or other processing is generally required to be performed on the samples, so that the data sets are more reasonably distributed, and the model effect is better.

In step S5, the accuracy of the model is adjusted to a higher level by adjusting the parameters, and it is generally considered that an accuracy of more than 96% meets the requirement.

In step S6, the model obtained is evaluated by using a model evaluation method, and each feature in the xgboost is analyzed, such as the importance of each feature, and the model is generally evaluated by using ROC.

The data processing in step S2 is mainly to ensure the reliability of data during model training, and the processing of abnormal values and missing values, generally using deletion operations, and filling and replacing missing values according to the actual situation by using the similarity between similar samples, and if the dependency on other data is high, such as a ratio value, filling can be performed through a correlation relationship, so as to ensure the authenticity of model learning data, and also eliminate errors in the data collection process, and the filling manner can be performed by machine learning common data filling means, such as random forest filling, KNN filling, and the like. Abnormal value processing usually includes deletion, substitution using an average value, and as a method such as deletion processing, a machine learning method such as a clustering method or an isolated forest algorithm may be used.

Because the number of enterprises with poor credit in the real data is far lower than that of enterprises with good credit, the number of the obtained negative examples of the data set, namely the enterprises with poor credit, is very small, which easily results in poor final effect of the model, and therefore, some methods need to be applied in the sampling process to eliminate the influence, such as an oversampling method SMOTE method and the like, so that the unbalanced data set is more reasonable and convenient for learning.

Xgboost is a special gradient boosting decision tree, and is an integrated learning method based on a tree structure.

In model training, the Xgboost parameters are generally classified into 3 types, and the first type is that the general parameters are used for controlling basic functions of the parameters, such as nthread, to perform multi-thread control. The second type is a boost parameter, which mainly controls the integration of each step during training, such as tree boost and linear boost, and the target function of XGboost is as follows:

in the above formula G_jRepresenting the first derivative of the selected leaf node, ω being the weight of the leaf node, H_jFor the second derivative, the number of γ control nodes prevents over-fitting, and T is the number of leaf nodes. The structure score is used as the basis of tree splitting in the algorithm, and the smaller the result is, the better the effect is after feature splitting.

The structure is simplified as follows:

in the loss function, xgboost uses taylor's formula and taylor's second order expansion as an approximation of the objective function. The method comprises the steps of continuously adding trees to a regression tree again, splitting features according to the number of structure scores to grow a new tree, fitting the residual error predicted at the last time by using the new tree in use, and finally dropping the features of a sample onto corresponding nodes in a book, wherein the sum of the score of each tree is the predicted value finally obtained by a certain sample.

A greedy algorithm for enumerating all tree structures is generally used when leaf nodes are split, an approximate algorithm can be used when data is large and direct calculation cannot be carried out, and different algorithms can be applied to different scenes to meet the requirements of the scenes.

The XGboost algorithm training specifically comprises the step of constructing and adjusting parameters of a model by using the data obtained in the last step through the XGboost algorithm. The XGBoost library function in the Python library may be used. After the accuracy meets the requirement, the model is evaluated by using the evaluation index, and the model is further explained and improved by comparing the method with the actual situation and the existing conclusion by analyzing the importance degree of each characteristic and the like. And then inputting relevant data of the enterprises to be evaluated into the obtained model after the model meets the expected requirements to obtain a credit score which has reference value and more objective efficiency.

The method can realize accurate and efficient evaluation of enterprise credit, has good robustness and stability, can be further explained and optimized by combining the existing evaluation system and conclusion, and can meet the requirements of practicability and performance. The method combines an artificial intelligence method to promote the automation and the intellectualization of the enterprise credit assessment, and provides effective reference for various aspects such as bank credit, enterprise wind control, policy making and the like.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. An XGboost-based financial credit enterprise credit prediction method is characterized by comprising the following steps of: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

preliminary screening, screening out data related to credit assessment;

processing data, processing the abnormal value missing value and classifying the abnormal value missing value;

the characteristic engineering is used for processing the characteristics;

dividing a data set, and dividing the data into a training set and a verification set;

model training, namely training a model on training set data through an Xgboost algorithm;

and evaluating and optimizing the model, namely evaluating the model through the verification set data, analyzing each characteristic in the xgboost, and optimizing the model according to the condition.

2. The XGboost-based method for credit prediction of financing credit enterprises as claimed in claim 1 wherein: when the data is classified, the selected samples are classified according to the past transaction behaviors, credit records and tax payment records of the enterprises.

3. The XGboost-based financial credit enterprise credit prediction method of claim 1 or 2, wherein: the processing of the features comprises removing redundant features, and then further reducing the features by using a variance selection method, a principal component analysis method and other dimension reduction methods.

4. The XGboost-based finance credit enterprise credit prediction method of claim 3 wherein: the redundant features are features that can be simply calculated from existing features.

5. The XGboost-based financial credit enterprise credit prediction method of any of claims 1, 2 or 4, wherein: after the data are divided into a training set and a verification set, when the number of samples of a certain type is small, oversampling processing is carried out.

6. The XGboost-based method for credit prediction of financing credit enterprises as claimed in claim 5 wherein: when screening out data related to credit assessment, the data is preliminarily screened according to the credit card scoring model used in the past and the existing conclusion.

7. The XGboost-based financial credit enterprise credit prediction method of claim 6, wherein: when processing missing values, the operations of deletion, filling and replacement are carried out, and when processing abnormal values, the processing is carried out by a clustering-based method or an isolated forest algorithm.

8. The XGboost-based financial credit enterprise credit prediction method of claim 6 or 7, wherein: and performing oversampling treatment by adopting a SMOTE method.

9. The XGboost-based financial credit enterprise credit prediction method of claim 8 in which: the existing conclusion is a manually marked data tag.

10. The XGboost-based financial credit enterprise credit prediction method of any of claims 1, 2, 4, 6, 7 or 9 in which: the model was evaluated using ROC.