CN112085593A

CN112085593A - Small and medium-sized enterprise credit data mining method

Info

Publication number: CN112085593A
Application number: CN202010958951.5A
Authority: CN
Inventors: 崔光裕; 边松华; 崔乐乐
Original assignee: Tianyuan Big Data Credit Management Co Ltd
Current assignee: Tianyuan Big Data Credit Management Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2020-12-15
Anticipated expiration: 2040-09-14
Also published as: CN112085593B

Abstract

The invention discloses a credit data mining method for small and medium-sized enterprises, which relates to the technical field of big data and credit evaluation, realizes credit data mining for the small and medium-sized enterprises based on automatic feature engineering, preprocesses original credit feature data in a training sample feature data set of the small and medium-sized enterprises, forms a feature subset through feature distance calculation, and performs feature linear combination and feature nonlinear combination on the feature subset; the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, a base line model is formed through training, feature importance ranking is conducted according to training results, and features with prediction values are selected. The method can improve the credit feature mining efficiency of the small and medium-sized enterprises, reduce manual intervention, improve the validity of the credit feature mining result of the small and medium-sized enterprises, and further improve the accuracy of credit evaluation of the small and medium-sized enterprises.

Description

Small and medium-sized enterprise credit data mining method

Technical Field

The invention relates to the technical field of big data and credit evaluation, in particular to a credit data mining method for small and medium-sized enterprises.

Background

In the field of credit evaluation of medium and small enterprises, credit characteristics are important factors influencing the credit evaluation effect of the medium and small enterprises. However, due to the complexity and diversity of the credit risks of the medium-sized and small-sized enterprises, the correlation degree of different credit characteristics with the credit risks of different types of medium-sized and small-sized enterprises and the medium-sized and small-sized enterprises is greatly different, and when the credit evaluation of the medium-sized and small-sized enterprises is currently underway, the selection and construction of the credit characteristics are difficult, the manual screening workload is too large, and the requirements on the experience of screening personnel are very strict. How to construct an automatic feature engineering credit data mining system with good prediction effect for small and medium-sized enterprises is an urgent problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a method for mining credit data of small and medium-sized enterprises, which can have a result characteristic set with high prediction capability when the small and medium-sized enterprises perform credit data mining and characteristic engineering and provide a basis for the subsequent comprehensive credit evaluation of the small and medium-sized enterprises.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a small and medium-sized enterprise credit data mining method is based on automatic feature engineering to achieve small and medium-sized enterprise credit data mining, original credit feature data in a training sample feature data set of the small and medium-sized enterprises are preprocessed, then a feature subset is formed through distance calculation, and feature linear combination and feature nonlinear combination are conducted on the feature subset;

the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, a base line model is formed through training, feature importance ranking is conducted according to training results, and features with prediction values are selected.

Aiming at the problems of numerous and complicated credit data, unstable data quality, rich data dimensionality and more weak prediction capability characteristics of small and medium-sized enterprises, the method starts from a data mining model based on automatic characteristic engineering, and is combined with a machine learning algorithm for use through various numerical analysis, so that the credit characteristic mining efficiency and the validity of a mining result of the small and medium-sized enterprises can be improved, and the credit evaluation accuracy of the small and medium-sized enterprises is improved; meanwhile, the method can overcome one-sidedness and subjectivity of selecting credit characteristics of small and medium-sized enterprises based on experience, and reduce operation risks caused by experience limitation.

Preferably, the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on original credit feature data in a training sample feature data set of a medium-sized and small-sized enterprise;

the feature processing is to calculate the similarity of multi-dimensional feature vectors, and merge similarity features to form a feature subset by setting a similarity measurement threshold;

the characteristic linear combination is characterized in that a logistic regression model of each characteristic subset is trained by applying backward stepwise regression aiming at each characteristic subset to form the characteristic linear combination of each characteristic subset;

and in the characteristic nonlinear combination, aiming at each characteristic subset, training a decision tree classifier taking information gain as a measurement standard, THEN obtaining a series of IF-THEN rules according to the path from a root node to each leaf node of the decision tree classifier in the set, and taking the rules as the result of the characteristic nonlinear combination.

Further, the feature processing is to set a normalized similarity metric threshold value for the multi-dimensional feature vector similarity metric value, combine two features smaller than the threshold value to form a feature subset, and update the similarity metric between iterative features until the feature relationship is stable;

and in the characteristic nonlinear combination, the series of IF-THEN rules are used as a simple rule set, the simple rule set is simplified, and the simplified rule is used as a result of the characteristic nonlinear combination.

The simplified rule conditions are set as follows: irrelevant conditions, i.e. conditions that do not have any effect on the theory, may be included in the antecedents of a single rule. These redundant conditions that do not affect the correctness of the rule set can be removed to prune the rules.

Preferably, an XGboost classifier is trained together according to the characteristics of the training feature set to form the baseline model; the training process is as follows:

using an XGboost classifier as a basic model;

adjusting basic box-dividing model parameters by using a HyperOpt method in python to carry out an automatic Bayesian optimization method, taking the AUC value of the model as an effect test standard of a baseline model, and selecting an optimal group of baseline model hyper-parameters as final model parameters to form the baseline model;

and fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in the decision tree model generated by each iteration.

Further, the times of occurrence of each feature in each iteration of the baseline model are summed to serve as a feature importance measure of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and sorting the feature importance coefficients from large to small, setting a feature importance coefficient threshold, and only keeping the features of which the importance coefficients are larger than the threshold as a result feature set.

Preferably, the credit feature data of the training samples of the medium-sized and small-sized enterprises and the classification label data of the training samples of the medium-sized and small-sized enterprises are obtained, wherein the information of the medium-sized and small-sized enterprises classified as bad samples in the classification label data of the training samples of the medium-sized and small-sized enterprises comprises:

an executor of a commercial borrowing litigation;

the enterprise was listed as a loss of credit performer, or the enterprise entity controller was listed as a loss of credit performer;

the enterprise loan has serious default conditions;

the business contract and the bill of the enterprise have serious default conditions;

enterprises have been recorded with penalized records related to credit conditions, such as counterfeiting;

classifying the credit characteristic data of the related training samples of the medium and small enterprises into types including an enterprise basic information class, an enterprise performance condition class, an enterprise security information class and an enterprise financial condition class, wherein

The enterprise basic information class comprises: the actual holding share ratio of the largest stockholder, annual income of an enterprise, the authority of an enterprise operating place, the operating age limit of the enterprise and the accumulated credit amount of the enterprise;

the enterprise performance status class includes: the method comprises the steps of enterprise historical loan fulfillment rate, enterprise historical fulfillment amount, enterprise historical maximum overterm days, enterprise business transaction fulfillment rate, enterprise contract fulfillment rate, enterprise credit category complaint times and enterprise credit category fine amount;

the enterprise security information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprise owners;

the corporate financial status classes include: net asset profitability, total asset profitability, asset liability, snap rate, cash flow liability rate, revenue growth rate, and net profit growth rate.

Preferably, the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the small and medium enterprises includes:

filtering the characteristic of overlarge missing rate;

filling continuous characteristic and discrete characteristic missing values;

checking the equivalent rate of the discrete features, and filtering the features with the excessive equivalent rate;

continuous characteristic fluctuation test is carried out, and the characteristics with excessively small variance are filtered;

discretizing continuous features, namely discretizing the continuous features with less values into discrete features;

filtering the characteristic abnormal value;

and (5) carrying out feature normalization processing.

Preferably, the feature normalization processing adopts a maximum-minimum normalization method.

The formula is as follows:

wherein x is_minAnd x_maxThe maximum value and the minimum value of the feature observed in the training sample of the medium-sized and small-sized enterprises are respectively.

The invention also claims a medium and small-sized enterprise credit data mining device, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program and executing the method.

The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.

Compared with the prior art, the method for mining the credit data of the small and medium-sized enterprises has the following beneficial effects:

the method starts from a data mining model, provides a credit data mining method for small and medium-sized enterprises based on automatic feature engineering, and can greatly improve the credit feature mining efficiency of the small and medium-sized enterprises;

the one-sidedness and subjectivity of selecting credit characteristics of small and medium-sized enterprises based on experience are overcome, and the operation risk caused by the limited experience is reduced;

the generated data mining and characteristic engineering results have high interpretability and strong reusability, and the innovation of the method is realized on the basis of ensuring the comprehensibility and the usability;

through the combined use of various numerical analysis and machine learning algorithms, the effectiveness of credit feature mining results of medium and small enterprises is improved, the accuracy of credit evaluation of the medium and small enterprises is improved, the efficiency of popular financial services is improved, and the risk of default loss of the medium and small enterprises is reduced;

the method can be used for various occasions such as pre-credit state evaluation, post-credit change tracking, finance anti-fraud and the like, and effectively assists business and credit decisions.

Drawings

FIG. 1 is a flow chart of a method for mining credit data of small and medium-sized enterprises according to an embodiment of the present invention;

FIG. 2 is a block diagram of a process flow for linear combination of features provided by an embodiment of the present invention;

FIG. 3 is a block diagram of a process for nonlinear combination of features provided by an embodiment of the present invention;

fig. 4 is a block diagram of a feature screening and evaluating process provided by an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The credit evaluation of the medium and small enterprises is a quantitative process for defining, collecting, evaluating and analyzing the data of the credit risk of the medium and small enterprises, and the credit characteristics are quantitative expression results of the credit traits of the medium and small enterprises. Feature combination and feature selection are two main contents of feature engineering, data and features are the key of machine learning, the height which can be achieved by the performance of a machine learning model is determined, and features have an important position in the machine learning.

In general, the greater the number of features, the more completely the attributes of the original data can be reflected, but the greater the number of features, the better the quality of the original data is not. The feature combination refers to a series of calculation methods, which combine some attributes of the original data to generate some features with more expressive ability, and the feature combination method mainly has linear and nonlinear combinations of features, wherein the linear model includes logistic regression, linear regression and the like, and the nonlinear model includes decision trees, neural networks and the like. The feature selection can simplify the feature set, the accuracy of the model is improved, the time required by the model to operate is reduced, the smaller the number of features is, the simpler the model is, and the easier the data generation process is known by researchers.

Patent document application No. CN 202010055739.8 and publication No. CN111275447A disclose an online network payment fraud detection system based on automated feature engineering. Real-time transaction data records between the user and the merchant which occur on the network through respective PC or mobile terminal are received and summarized by the bank data center; the bank data center screens out required characteristic fields through secondary processing, and provides the original characteristics to the automatic characteristic engineering module; the automatic feature engineering module carries out feature construction on the basis of online network payment of original features to obtain a construction process set of all new features, and the construction process set is provided for a fraud detection module to carry out anomaly identification; and the fraud detection module is used for constructing new features according to the construction process set of the new feature vectors, inputting all the features and the labels into the machine learning model for judgment, releasing normal transactions and providing secondary identity authentication for users in abnormal transactions. And if the subsequent secondary authentication is successful, the user is allowed to conduct the transaction again, otherwise, the user account is locked, and the user is refused to conduct any transaction. The method uses a longitudinal mode conversion function, a transverse mode conversion function and a time window mode conversion function to perform feature processing conversion, aims to transform a single feature and enhance the information expression capability of the feature, and does not provide a method for combining and screening a plurality of features.

The embodiment of the invention provides a method for mining credit data of small and medium-sized enterprises, which is based on automatic feature engineering to realize the mining of the credit data of the small and medium-sized enterprises, and preprocesses original credit feature data in training sample feature data sets of the small and medium-sized enterprises, wherein the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on the original credit feature data in the training sample feature data sets of the small and medium-sized enterprises;

further performing characteristic processing: calculating the similarity of the multi-dimensional feature vectors, and combining similarity features by setting a similarity measurement threshold to form a feature subset;

performing linear feature combination and nonlinear feature combination on the feature subsets, wherein the linear feature combination is to train a logistic regression model of each feature subset by using backward stepwise regression for each feature subset to form a linear feature combination of each feature subset; the characteristic nonlinear combination is characterized in that a decision tree classifier taking information gain as a measurement standard is trained for each characteristic subset, THEN a series of IF-THEN rules are obtained according to the path from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as the result of the characteristic nonlinear combination;

the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, an XGboost classifier is trained together to form a baseline model, feature importance ranking is carried out according to training results, and features with prediction values are selected.

In this embodiment, the data preprocessing step implementation process includes:

the method comprises the steps of obtaining credit feature data of training samples of medium and small enterprises and classification label data of the training samples of the medium and small enterprises, wherein the information of the medium and small enterprises classified as bad samples in the classification label data of the training samples of the medium and small enterprises comprises the following steps:

an executor of a commercial borrowing litigation;

the enterprise loan has serious default conditions;

In this embodiment, the method for preprocessing credit characteristic data of small and medium-sized enterprises includes:

filtering the characteristic of overlarge missing rate;

filling continuous characteristic and discrete characteristic missing values;

filtering the characteristic abnormal value;

and (5) carrying out feature normalization processing.

The feature normalization processing adopts a maximum-minimum normalization method.

The formula is as follows:

And the feature processing is used for calculating the similarity of the multi-dimensional feature vectors and forming a feature subset by setting a similarity threshold value. The calculation formula of the similarity metric value of the multi-dimensional feature vector is as follows:

after the similarity calculation of the multi-dimensional feature vectors is completed, the larger the cosine value of an included angle between the features is, the smaller the correlation between the features is, and otherwise, the larger the correlation between the features is. And setting a normalized cosine value threshold of the included angle between the features as 0.65 according to the cosine value of the included angle between the features, combining the two features with the distance less than the threshold into a feature set, and updating the similarity measurement of the multi-dimensional feature vector until the feature relation is stable.

In the step of linear combination of the features, a logistic regression model of each feature subset is trained by applying backward stepwise regression aiming at the feature subsets to form linear combination of the features of each feature subset. The logistic regression model training method based on backward stepwise regression comprises the following steps:

firstly, putting all the characteristics into a model;

secondly, trying to remove one feature from the model, and judging whether the variation of the whole model interpretation target variable has obvious change or not based on F-test, t-test and model evaluation indexes;

thirdly, removing the features which reduce the interpretation quantity of the target variable to the minimum;

and fourthly, continuously iterating until no feature meets the condition of elimination, and obtaining a logistic regression model of the feature subset, wherein the expression form of the logistic regression model is as follows:

wherein z is w.x + b

Wherein w is the logistic regression model coefficient corresponding to each feature, and b is a constant intercept term.

In the step of nonlinear combination of features, first, a decision tree classifier using information gain as a metric is trained for each feature subset, wherein P (X ═ 1) ═ P, P (X ═ 0) ═ 1-P, and D are assumed to be₁，D₂，D₃，...，D_nN subsets of the data set D are divided according to the value of the characteristic, | D | is the number of samples in the data set, and the expression of the information entropy is:

H(p)＝-plog₂p-(1-p)log₂(1-p)

the information gain expression of the discrete feature A on D is as follows:

gain(A)＝H(D)-H(D|A)

wherein

THEN, a series of IF-THEN rules are obtained according to the path from the root node to each leaf node of the feature set decision tree classifier, and the rules are used as the result of the nonlinear combination of the features. The IF part contains all the checks for a path and the THEN part is the final classification. The conversion mode of the rule is as follows:

1) obtaining simple rules;

2) simplifying rule conditions: irrelevant conditions, i.e. conditions that do not have any effect on the theory, may be included in the antecedents of a single rule. Redundant conditions which do not influence the correctness of the rule set can be deleted, and the rules are simplified;

3) and the simplification rule of the rule is as follows: let rule R be: IF a THEN class C, the simplified rule R' is: IF a 'THEN class C, where a ═ a' goux, means that condition X has no effect on the conclusion "class C".

In the step of feature screening and evaluation, firstly, the original credit features which are only processed by data preprocessing, the linear combination features which are processed by feature linear combination and the nonlinear combination features which are processed by feature nonlinear combination are used as training feature sets to jointly train the XGboost classifier, wherein the training process is as follows:

1) the XGboost classifier is used as a basic model;

2) adjusting basic box-dividing model parameters by using a HyperOpt method in python to carry out an automatic Bayesian optimization method, taking the AUC value of the model as an effect test standard of a baseline model, and selecting an optimal group of baseline model hyper-parameters as final model parameters to form the baseline model;

3) and fitting the sample data of the training set by using the baseline model, and recording the occurrence times of each feature in the decision tree model generated by each iteration.

Then, the times of appearance of each feature in each iteration of the baseline model are added to be used as the feature importance measurement of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and sorting the feature importance coefficients from large to small, setting a feature importance coefficient threshold, and only keeping the features of which the importance coefficients are larger than the threshold as a result feature set.

Aiming at the problems that a large amount of manual intervention is needed, the characteristic dimension is rich in content and the content is numerous and complex, the credit evaluation result of the medium and small enterprises is not accurate due to the fact that the quality of the selected credit characteristic is not stable and the like are often faced during the credit characteristic processing and screening of the medium and small enterprises, the method breaks through the restriction of the number of the characteristic types by introducing the automatic characteristic engineering, reduces the manual intervention and obtains the credit characteristics of the medium and small enterprises which are as excellent as possible.

The embodiment of the invention also provides a device for mining credit data of medium and small enterprises, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine-readable program to execute the method for mining credit data of small and medium-sized enterprises in the above embodiments of the present invention.

The embodiment of the invention also provides a computer readable medium, which stores computer instructions, and when the computer instructions are executed by a processor, the processor executes the method for mining credit data of the medium and small enterprises in the embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A method for mining credit data of small and medium-sized enterprises is characterized in that the mining of the credit data of the small and medium-sized enterprises is realized based on automatic feature engineering, original credit feature data in a training sample feature data set of the small and medium-sized enterprises are preprocessed, feature subsets are formed through feature distance calculation, and feature linear combination and feature nonlinear combination are carried out on the feature subsets;

2. The method for mining the credit data of the small and medium-sized enterprises according to claim 1, wherein the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on the original credit feature data in the training sample feature data set of the small and medium-sized enterprises;

3. The method according to claim 2, wherein the feature processing is to set a normalized similarity metric threshold for the multi-dimensional feature vector similarity metric, combine two features smaller than the threshold to form a feature subset, and update the similarity metric between iterative features until the feature relationship is stable;

4. The medium and small enterprise credit data mining method according to claim 1, 2 or 3, characterized in that an XGboost classifier is trained together according to the features of the training feature set to form the baseline model; the training process is as follows:

using an XGboost classifier as a basic model;

5. The method of claim 4, wherein the number of occurrences of each feature in each iteration of the baseline model is added to be used as the feature importance measure of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and setting a threshold value of the feature importance coefficient, and only keeping the features with the importance coefficients larger than the threshold value as a result feature set.

6. The method for mining the credit data of the medium and small enterprises as claimed in claim 1, wherein the training sample credit feature data of the medium and small enterprises and the training sample classification label data of the medium and small enterprises are obtained, wherein the medium and small enterprise information classified as bad samples in the training sample classification label data of the medium and small enterprises comprises:

an executor of a commercial borrowing litigation;

the enterprise loan has serious default conditions;

7. The method for mining the credit data of the medium and small enterprises according to claim 1, 2 or 6, wherein the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the medium and small enterprises comprises the following steps:

filtering the characteristic of overlarge missing rate;

filling continuous characteristic and discrete characteristic missing values;

filtering the characteristic abnormal value;

and (5) carrying out feature normalization processing.

8. The method as claimed in claim 7, wherein the feature normalization process adopts a max-min normalization method.

9. A credit data mining device for small and medium-sized enterprises is characterized by comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 8.

10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.