CN112085593B

CN112085593B - Credit data mining method for small and medium enterprises

Info

Publication number: CN112085593B
Application number: CN202010958951.5A
Authority: CN
Inventors: 崔光裕; 边松华; 崔乐乐
Original assignee: Tianyuan Big Data Credit Management Co Ltd
Current assignee: Tianyuan Big Data Credit Management Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2024-03-08
Anticipated expiration: 2040-09-14
Also published as: CN112085593A

Abstract

The invention discloses a medium and small enterprise credit data mining method, which relates to the technical field of big data and credit evaluation, and is characterized in that the medium and small enterprise credit data mining is realized based on automatic feature engineering, original credit feature data in a medium and small enterprise training sample feature data set is preprocessed, then feature distance calculation is carried out to form feature subsets, and feature linear combination and feature nonlinear combination are carried out aiming at the feature subsets; the method comprises the steps of using original credit features subjected to data pretreatment processing, linear combination features subjected to characteristic linear combination processing and nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, forming a baseline model through training, sorting the feature importance according to training results, and selecting features with predictive value. The invention can improve the credit feature mining efficiency of the middle and small enterprises, reduce manual intervention, improve the effectiveness of the credit feature mining result of the middle and small enterprises, and further improve the accuracy of credit evaluation of the middle and small enterprises.

Description

Credit data mining method for small and medium enterprises

Technical Field

The invention relates to the technical field of big data and credit evaluation, in particular to a method for mining credit data of small and medium enterprises.

Background

In the field of credit evaluation of small and medium-sized enterprises, credit characteristics are important factors influencing the credit evaluation effect of the small and medium-sized enterprises. However, due to the complexity and diversity of the credit risks of the small and medium enterprises, different credit characteristics are greatly different from the credit risks of the small and medium enterprises of different types and the correlation degree of the credit risks of the small and medium enterprises, and when the credit evaluation of the small and medium enterprises is performed at present, the selection and construction of the credit characteristics are difficult, the manual screening workload is too large, and the experience requirements on screening personnel are very strict. How to construct a credit data mining system for small and medium enterprises with automatic feature engineering with good prediction effect is a problem to be solved urgently.

Disclosure of Invention

Aiming at the defects, the technical task of the invention is to provide the credit data mining method for the small and medium enterprises, which can have a result characteristic set with high prediction capability when carrying out credit data mining and characteristic engineering of the small and medium enterprises, and provides a basis for the subsequent comprehensive credit evaluation of the small and medium enterprises.

The technical scheme adopted for solving the technical problems is as follows:

the method for mining the credit data of the middle and small enterprises is based on automatic feature engineering, and comprises the steps of preprocessing original credit feature data in a training sample feature data set of the middle and small enterprises, forming a feature subset through distance calculation, and carrying out feature linear combination and feature nonlinear combination on the feature subset;

the method comprises the steps of using original credit features subjected to data pretreatment processing, linear combination features subjected to characteristic linear combination processing and nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, forming a baseline model through training, sorting the feature importance according to training results, and selecting features with predictive value.

Aiming at the problems of numerous and miscellaneous credit data, unstable data quality, rich data dimension and more weak predictive capability characteristics of small and medium enterprises, the method starts from a data mining model based on automatic characteristic engineering, and can improve the mining efficiency of the credit characteristics of the small and medium enterprises and the effectiveness of mining results and the accuracy of credit evaluation of the small and medium enterprises through the combined use of various numerical analysis and machine learning algorithms; meanwhile, the one-sided and subjectivity of the credit features of the small and medium enterprises selected based on experience can be overcome, and the operation risk caused by experience limitation is reduced.

Preferably, the preprocessing comprises feature filtering, missing value filling, discretization and normalization of the original credit feature data in the training sample feature data set of the middle and small enterprises;

the feature processing calculates the similarity of the multidimensional feature vectors, and combines the similarity features to form a feature subset by setting a similarity measurement threshold;

the feature linear combination is characterized in that for each feature subset, a logistic regression model of each feature subset is trained by using backward stepwise regression, so that the feature linear combination of each feature subset is formed;

the feature nonlinear combination is characterized in that for each feature subset, a decision tree classifier taking information gain as a measurement standard is trained, THEN a series of IF-THEN rules are obtained according to paths from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as results of the feature nonlinear combination.

Further, the feature processing sets a normalized similarity measurement threshold value according to the similarity measurement value of the multidimensional feature vector, combines two features smaller than the threshold value to form a feature subset, and updates the similarity measurement between iterative features until the feature relationship is stable;

the feature nonlinear combination takes the series of IF-THEN rules as a simple rule set, simplifies the simple rule set, and takes the simplified rule as a result of the feature nonlinear combination.

The reduced rule condition is set as: irrelevant conditions, i.e. conditions that have no effect on the theory, may be included in the antecedents of a single rule. These redundant conditions that do not affect the correctness of the rule set may be deleted and the rules condensed.

Preferably, according to the characteristics of the training characteristic set, training an XGBoost classifier together to form the baseline model; the training process is as follows:

taking the XGBoost classifier as a basic model;

adjusting basic box model parameters by using a Bayesian optimization method of automation in python, taking the model AUC value as an effect test standard of a baseline model, and selecting an optimal group of baseline model superparameter as a final model parameter to form a baseline model;

and fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration.

Further, the times of each feature appearing in each iteration of the baseline model are summed up to serve as a feature importance measure of the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and sequencing the importance coefficients of the features from large to small, setting a threshold value of the importance coefficients of the features, and only reserving the features with the importance coefficients larger than the threshold value as a result feature set.

Preferably, the method for obtaining the credit feature data of the middle and small enterprise training samples and the classification label data of the middle and small enterprise training samples includes:

the executed person of the commercial borrowing litigation;

the enterprise was listed as a trusted delegate, or the enterprise actual controller was listed as a trusted delegate;

the enterprise loan has serious default conditions;

serious default conditions exist in business contracts and notes of enterprises;

enterprises have penalty records related to credit conditions of enterprises such as fake products;

dividing the related credit characteristic data of the training sample of the medium and small enterprises into types including basic enterprise information, performance enterprise information, escort enterprise information and financial enterprise information, wherein

The basic information class of the enterprise comprises: the maximum stockholder actual share-holding ratio, enterprise annual income, enterprise operation place authority, enterprise operation age and enterprise accumulated credit total;

the enterprise performance classes include: the method comprises the steps of enterprise historical loan performance rate, enterprise historical performance amount, maximum expiration date of enterprise history, enterprise business transaction performance rate, enterprise contract performance rate, enterprise credit type complaint frequency and enterprise credit type fine amount;

the enterprise deposit information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprises owners;

the enterprise financial status classes include: equity rate, total equity rate, snap rate, cash flow equity rate, business revenue growth rate, and equity profit growth rate.

Preferably, the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the middle and small enterprises comprises the following steps:

filtering the characteristics with overlarge loss rate;

filling continuous features and discrete feature missing values;

testing the same value rate of discrete features, and filtering the features with overlarge same value rate;

continuous feature volatility test, filtering features with too small variance;

discretizing continuous features, namely discretizing the continuous features with fewer values into discrete features;

filtering the characteristic outliers;

and (5) feature normalization processing.

Preferably, the feature normalization process uses a max-min normalization method.

The formula is as follows:

wherein x is _min And x _max The maximum value and the minimum value of the feature observed in the training sample of the middle and small enterprises respectively.

The invention also claims a medium and small enterprise credit data mining device, which comprises: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor is configured to invoke the machine-readable program to perform the method described above.

The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.

Compared with the prior art, the credit data mining method for the medium and small enterprises has the following beneficial effects:

starting from a data mining model, the method provides a small and medium-sized enterprise credit data mining method based on automatic feature engineering, so that the credit feature mining efficiency of the small and medium-sized enterprises can be greatly improved;

the one-sided and subjectivity of the credit features of small and medium enterprises selected based on experience is overcome, and the operation risk caused by experience limitation is reduced;

the data mining and characteristic engineering result generated by the method has high interpretability and strong reusability, and the innovation of the method is realized on the basis of guaranteeing easy understandability and usability;

through the combined use of a plurality of numerical analysis and machine learning algorithms, the effectiveness of the credit feature mining result of the middle and small enterprises is improved, the credit evaluation accuracy of the middle and small enterprises is improved, the general financial service efficiency is improved, and the default loss risk of the middle and small enterprises is reduced;

the method can be used for various occasions such as credit condition evaluation before credit, credit change tracking after credit, financial anti-fraud and the like, and can effectively assist business and credit decision.

Drawings

FIG. 1 is a flow chart of a method for mining credit data of small and medium enterprises, which is provided by the embodiment of the invention;

FIG. 2 is a flow chart of a feature linear combination step provided by an embodiment of the present invention;

FIG. 3 is a block flow diagram of a feature nonlinear combination step provided by an embodiment of the present invention;

fig. 4 is a flowchart of a feature screening and evaluation step according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the specific examples.

The credit evaluation of the middle and small enterprises is a quantitative process of data definition, collection, evaluation and analysis of the credit risks of the middle and small enterprises, and the credit characteristics are quantitative expression results of the credit characteristics of the middle and small enterprises. Feature combination and feature selection are two main contents of feature engineering, data and features are key to machine learning, the height that can be achieved by the performance of a machine learning model is determined, and features have important roles in machine learning.

In general, the larger the number of features, the more completely reflects the properties of the original data, but the larger the number of features is not, the better. The feature combination refers to a series of calculation methods, wherein some attributes of the original data are combined to generate some features with more expressive ability, and the feature combination method mainly comprises linear and nonlinear combinations of features, wherein a linear model comprises logistic regression, linear regression and the like, and a nonlinear model comprises a decision tree, a neural network and the like. The feature selection can simplify the feature set, the model accuracy is improved, the time required by the model to run is reduced, in addition, the smaller the feature quantity is, the simpler the model is, and the easier the researchers can know the data generation process.

Patent document application number CN 202010055739.8, publication number CN111275447a discloses an online network payment fraud detection system based on automated feature engineering. The real-time transaction data record generated on the network between the user and the merchant through the respective PC or mobile terminal is responsible for receiving the summary by the bank data center; the bank data center screens out the required characteristic fields through secondary processing, and provides the original characteristics to an automatic characteristic engineering module; the automatic feature engineering module performs feature construction to obtain a construction process set of all new features on the basis of the original features paid by the online network, and provides the construction process set for the fraud detection module to perform anomaly identification; and the fraud detection module constructs new features according to the construction process set of the new feature vectors, inputs all the features and the labels into the machine learning model for discrimination, releases normal transactions, and provides secondary identity authentication for users with abnormal transactions. And if the subsequent secondary authentication is successful, the user is allowed to conduct the transaction again, otherwise, the user account is locked, and any transaction is refused. The method uses a conversion function in a longitudinal mode, a conversion function in a transverse mode and a conversion function in a time window mode to perform feature processing conversion, aims at converting a single feature, enhances the information expression capability of the feature, and does not provide a method for combining and screening the features.

The embodiment of the invention provides a medium and small enterprise credit data mining method, which is used for realizing medium and small enterprise credit data mining based on automatic feature engineering, and preprocessing is carried out on original credit feature data in a medium and small enterprise training sample feature data set, wherein the preprocessing comprises feature filtering, missing value filling, discretization and standardization processing on the original credit feature data in the medium and small enterprise training sample feature data set;

and further carrying out characteristic treatment: calculating the similarity of the multidimensional feature vectors, and combining the similarity features by setting a similarity measurement threshold value to form a feature subset;

feature linear combination and feature nonlinear combination are carried out on the feature subsets, the feature linear combination is carried out on each feature subset, a logistic regression model of each feature subset is trained by means of backward stepwise regression, and feature linear combination of each feature subset is formed; the feature nonlinear combination is characterized in that a decision tree classifier taking information gain as a measurement standard is trained for each feature subset, THEN a series of IF-THEN rules are obtained according to paths from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as the result of the feature nonlinear combination;

the method comprises the steps of using original credit features subjected to data preprocessing processing, linear combination features subjected to feature linear combination processing and nonlinear combination features subjected to feature nonlinear combination processing as training feature sets, training an XGBoost classifier together to form a baseline model, sorting feature importance according to training results, and selecting features with predictive value.

In this embodiment, the data preprocessing step includes the following implementation procedures:

the method comprises the steps of obtaining credit feature data of a middle and small enterprise training sample and classification label data of the middle and small enterprise training sample, wherein middle and small enterprise information divided into bad samples in the classification label data of the middle and small enterprise training sample comprises:

the executed person of the commercial borrowing litigation;

the enterprise loan has serious default conditions;

In this embodiment, the adopted method for preprocessing the credit characteristic data of the medium and small enterprises includes:

filtering the characteristics with overlarge loss rate;

filling continuous features and discrete feature missing values;

continuous feature volatility test, filtering features with too small variance;

filtering the characteristic outliers;

and (5) feature normalization processing.

The feature normalization processing adopts a maximum-minimum value normalization method.

The formula is as follows:

And the feature processing is used for calculating the similarity of the multidimensional feature vectors and forming a feature subset by setting a similarity threshold value. The multi-dimensional feature vector similarity measurement value has the following calculation formula:

after the multidimensional feature vector similarity calculation is completed, the larger the cosine value of the included angle between the features is, the smaller the correlation between the features is, otherwise, the larger the correlation between the features is. And setting a normalized included angle cosine value threshold value of 0.65 for the included angle cosine value between the features, combining two features with a distance smaller than the threshold value into a feature set, and updating the multidimensional feature vector similarity measurement until the feature relation is stable.

In the feature linear combination step, a logistic regression model of each feature subset is trained by applying backward stepwise regression aiming at the feature subset, so as to form feature linear combination of each feature subset. The step of training the logistic regression model by using backward stepwise regression comprises the following steps:

firstly, putting all the features into a model;

secondly, trying to remove one of the features from the model, and judging whether the variation of the interpretation target variable of the whole model has significant variation or not based on F-test, t-test and model evaluation indexes;

thirdly, eliminating the characteristics which reduce the interpretation quantity of the target variable to the minimum;

and fourthly, continuously iterating until no feature meets the condition of elimination, and obtaining a logistic regression model of the feature subset, wherein the expression form is as follows:

wherein z=w·x+b

Where w is a logistic regression model coefficient corresponding to each feature and b is a constant intercept term.

In the feature nonlinear combination step, first, for each feature subset, a decision tree classifier using information gain as a metric is trained, where P (x=1) =p, P (x=0) =1-P, D is assumed to be P (x=0) =1-P ₁ ，D ₂ ，D ₃ ，...，D _n Is that the data set D is divided into n subsets according to the values of the characteristics, and I D I is the number of samples in the data set, the expression of the information entropy is as follows:

H(p)＝-plog ₂ p-(1-p)log ₂ (1-p)

the information gain expression of the discrete feature a on D is:

gain(A)＝H(D)-H(D|A)

wherein the method comprises the steps of

THEN, a series of IF-THEN rules are obtained from the paths of the root node to each leaf node according to the feature set decision tree classifier, and the rules are used as the result of nonlinear combination of features. The IF part includes all the tests of one path, and the THEN part is the final classification. The rule conversion mode is as follows:

1) Obtaining simple rules;

2) Rule conditions are reduced: irrelevant conditions, i.e. conditions that have no effect on the theory, may be included in the antecedents of a single rule. The redundant conditions which do not affect the correctness of the rule set can be deleted, and the rules are simplified;

3) Rule reduction criteria: let the rule R be: IF a THEN class C, reduced rule R' is: IF a 'THEN class C, where a=a'. U.x, means that condition X has no effect on the conclusion "class C".

In the feature screening and evaluating step, firstly, the original credit features which are only subjected to data preprocessing processing, the linear combination features which are subjected to feature linear combination processing and the nonlinear combination features which are subjected to feature nonlinear combination processing are used as training feature sets, and an XGBoost classifier is trained together, wherein the training process is as follows:

1) Taking the XGBoost classifier as a basic model;

2) Adjusting basic box model parameters by using a Bayesian optimization method of automatization in python, taking the model AUC value as an effect test standard of a baseline model, and selecting an optimal group of baseline model super parameters as final model parameters to form a baseline model;

3) And fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration.

Then, the times of each feature appearing in each iteration of the baseline model are summed up to be used as a feature importance measure of the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and sequencing the importance coefficients of the features from large to small, setting a threshold value of the importance coefficients of the features, and only reserving the features with the importance coefficients larger than the threshold value as a result feature set.

Aiming at the problems that a large amount of manual intervention is required during processing and screening of credit features of small and medium enterprises, the feature dimension is rich, the content is numerous and miscellaneous, the credit feature quality selected by the method is unstable, the credit evaluation result of the small and medium enterprises is inaccurate, and the like, the method breaks through the restriction of the number of feature types by introducing automatic feature engineering, reduces the manual intervention, and obtains the credit features of the small and medium enterprises as good as possible.

The embodiment of the invention also provides a device for mining the credit data of the small and medium enterprises, which comprises the following steps: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor is configured to invoke the machine-readable program to perform a method for mining credit data of small and medium enterprises according to the above embodiments of the present invention.

The embodiment of the invention also provides a computer readable medium, wherein the computer readable medium is stored with computer instructions, and the computer instructions, when being executed by a processor, cause the processor to execute the medium and small enterprise credit data mining method in the embodiment of the invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.

Examples of the storage medium for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communication network.

Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.

Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.

While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, and it will be appreciated by those skilled in the art that the code audits of the various embodiments described above may be combined to produce further embodiments of the invention, which are also within the scope of the invention.

Claims

1. The method is characterized in that the method realizes the credit data mining of the middle and small enterprises based on automatic feature engineering, the original credit feature data in the training sample feature data set of the middle and small enterprises is preprocessed, then feature distance calculation is carried out to form feature subsets, and feature linear combination and feature nonlinear combination are carried out on the feature subsets;

using the original credit features subjected to data preprocessing processing, the linear combination features subjected to characteristic linear combination processing and the nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, training to form a baseline model, and sorting the feature importance according to training results to select features with predictive value;

jointly training an XGBoost classifier according to the characteristics of the training characteristic set to form the baseline model; the training process is as follows:

taking the XGBoost classifier as a basic model;

fitting training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration;

summing the number of times each feature appears in each iteration of the baseline model as a feature importance measure for the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and setting a feature importance coefficient threshold value, and only preserving features with importance coefficients larger than the threshold value as a result feature set.

2. The method for mining credit data of small and medium enterprises according to claim 1, wherein the preprocessing comprises feature filtering, missing value filling, discretizing and normalizing of original credit feature data in a small and medium enterprise training sample feature data set;

the feature preprocessing calculates the similarity of the multidimensional feature vectors, and combines the similarity features to form a feature subset by setting a similarity measurement threshold;

3. The method for mining credit data of small and medium enterprises according to claim 2, wherein the feature preprocessing sets a normalized similarity measurement threshold value for a multi-dimensional feature vector similarity measurement value, combines two features smaller than the threshold value to form a feature subset, and updates the similarity measurement between iterative features until the feature relationship is stable;

4. The method for mining credit data of small and medium enterprises according to claim 1, wherein the steps of obtaining the credit feature data of the training samples of the small and medium enterprises and the classification label data of the training samples of the small and medium enterprises, wherein the classification label data of the training samples of the small and medium enterprises comprises the steps of:

the executed person of the commercial borrowing litigation;

the enterprise loan has serious default conditions;

5. The method for mining credit data of small and medium enterprises according to claim 1, 2 or 4, wherein the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the small and medium enterprises comprises:

filtering the characteristics with overlarge loss rate;

filling continuous features and discrete feature missing values;

continuous feature volatility test, filtering features with too small variance;

filtering the characteristic outliers;

and (5) feature normalization processing.

6. The method for mining credit data of small and medium enterprises according to claim 5, wherein the feature normalization process adopts a maximum-minimum value normalization method.

7. A medium and small enterprise credit data mining apparatus, comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor being configured to invoke the machine readable program to perform the method of any of claims 1 to 6.

8. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 6.