CN112085593B - Credit data mining method for small and medium enterprises - Google Patents

Credit data mining method for small and medium enterprises Download PDF

Info

Publication number
CN112085593B
CN112085593B CN202010958951.5A CN202010958951A CN112085593B CN 112085593 B CN112085593 B CN 112085593B CN 202010958951 A CN202010958951 A CN 202010958951A CN 112085593 B CN112085593 B CN 112085593B
Authority
CN
China
Prior art keywords
feature
enterprise
credit
small
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010958951.5A
Other languages
Chinese (zh)
Other versions
CN112085593A (en
Inventor
崔光裕
边松华
崔乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyuan Big Data Credit Management Co Ltd
Original Assignee
Tianyuan Big Data Credit Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyuan Big Data Credit Management Co Ltd filed Critical Tianyuan Big Data Credit Management Co Ltd
Priority to CN202010958951.5A priority Critical patent/CN112085593B/en
Publication of CN112085593A publication Critical patent/CN112085593A/en
Application granted granted Critical
Publication of CN112085593B publication Critical patent/CN112085593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a medium and small enterprise credit data mining method, which relates to the technical field of big data and credit evaluation, and is characterized in that the medium and small enterprise credit data mining is realized based on automatic feature engineering, original credit feature data in a medium and small enterprise training sample feature data set is preprocessed, then feature distance calculation is carried out to form feature subsets, and feature linear combination and feature nonlinear combination are carried out aiming at the feature subsets; the method comprises the steps of using original credit features subjected to data pretreatment processing, linear combination features subjected to characteristic linear combination processing and nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, forming a baseline model through training, sorting the feature importance according to training results, and selecting features with predictive value. The invention can improve the credit feature mining efficiency of the middle and small enterprises, reduce manual intervention, improve the effectiveness of the credit feature mining result of the middle and small enterprises, and further improve the accuracy of credit evaluation of the middle and small enterprises.

Description

Credit data mining method for small and medium enterprises
Technical Field
The invention relates to the technical field of big data and credit evaluation, in particular to a method for mining credit data of small and medium enterprises.
Background
In the field of credit evaluation of small and medium-sized enterprises, credit characteristics are important factors influencing the credit evaluation effect of the small and medium-sized enterprises. However, due to the complexity and diversity of the credit risks of the small and medium enterprises, different credit characteristics are greatly different from the credit risks of the small and medium enterprises of different types and the correlation degree of the credit risks of the small and medium enterprises, and when the credit evaluation of the small and medium enterprises is performed at present, the selection and construction of the credit characteristics are difficult, the manual screening workload is too large, and the experience requirements on screening personnel are very strict. How to construct a credit data mining system for small and medium enterprises with automatic feature engineering with good prediction effect is a problem to be solved urgently.
Disclosure of Invention
Aiming at the defects, the technical task of the invention is to provide the credit data mining method for the small and medium enterprises, which can have a result characteristic set with high prediction capability when carrying out credit data mining and characteristic engineering of the small and medium enterprises, and provides a basis for the subsequent comprehensive credit evaluation of the small and medium enterprises.
The technical scheme adopted for solving the technical problems is as follows:
the method for mining the credit data of the middle and small enterprises is based on automatic feature engineering, and comprises the steps of preprocessing original credit feature data in a training sample feature data set of the middle and small enterprises, forming a feature subset through distance calculation, and carrying out feature linear combination and feature nonlinear combination on the feature subset;
the method comprises the steps of using original credit features subjected to data pretreatment processing, linear combination features subjected to characteristic linear combination processing and nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, forming a baseline model through training, sorting the feature importance according to training results, and selecting features with predictive value.
Aiming at the problems of numerous and miscellaneous credit data, unstable data quality, rich data dimension and more weak predictive capability characteristics of small and medium enterprises, the method starts from a data mining model based on automatic characteristic engineering, and can improve the mining efficiency of the credit characteristics of the small and medium enterprises and the effectiveness of mining results and the accuracy of credit evaluation of the small and medium enterprises through the combined use of various numerical analysis and machine learning algorithms; meanwhile, the one-sided and subjectivity of the credit features of the small and medium enterprises selected based on experience can be overcome, and the operation risk caused by experience limitation is reduced.
Preferably, the preprocessing comprises feature filtering, missing value filling, discretization and normalization of the original credit feature data in the training sample feature data set of the middle and small enterprises;
the feature processing calculates the similarity of the multidimensional feature vectors, and combines the similarity features to form a feature subset by setting a similarity measurement threshold;
the feature linear combination is characterized in that for each feature subset, a logistic regression model of each feature subset is trained by using backward stepwise regression, so that the feature linear combination of each feature subset is formed;
the feature nonlinear combination is characterized in that for each feature subset, a decision tree classifier taking information gain as a measurement standard is trained, THEN a series of IF-THEN rules are obtained according to paths from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as results of the feature nonlinear combination.
Further, the feature processing sets a normalized similarity measurement threshold value according to the similarity measurement value of the multidimensional feature vector, combines two features smaller than the threshold value to form a feature subset, and updates the similarity measurement between iterative features until the feature relationship is stable;
the feature nonlinear combination takes the series of IF-THEN rules as a simple rule set, simplifies the simple rule set, and takes the simplified rule as a result of the feature nonlinear combination.
The reduced rule condition is set as: irrelevant conditions, i.e. conditions that have no effect on the theory, may be included in the antecedents of a single rule. These redundant conditions that do not affect the correctness of the rule set may be deleted and the rules condensed.
Preferably, according to the characteristics of the training characteristic set, training an XGBoost classifier together to form the baseline model; the training process is as follows:
taking the XGBoost classifier as a basic model;
adjusting basic box model parameters by using a Bayesian optimization method of automation in python, taking the model AUC value as an effect test standard of a baseline model, and selecting an optimal group of baseline model superparameter as a final model parameter to form a baseline model;
and fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration.
Further, the times of each feature appearing in each iteration of the baseline model are summed up to serve as a feature importance measure of the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and sequencing the importance coefficients of the features from large to small, setting a threshold value of the importance coefficients of the features, and only reserving the features with the importance coefficients larger than the threshold value as a result feature set.
Preferably, the method for obtaining the credit feature data of the middle and small enterprise training samples and the classification label data of the middle and small enterprise training samples includes:
the executed person of the commercial borrowing litigation;
the enterprise was listed as a trusted delegate, or the enterprise actual controller was listed as a trusted delegate;
the enterprise loan has serious default conditions;
serious default conditions exist in business contracts and notes of enterprises;
enterprises have penalty records related to credit conditions of enterprises such as fake products;
dividing the related credit characteristic data of the training sample of the medium and small enterprises into types including basic enterprise information, performance enterprise information, escort enterprise information and financial enterprise information, wherein
The basic information class of the enterprise comprises: the maximum stockholder actual share-holding ratio, enterprise annual income, enterprise operation place authority, enterprise operation age and enterprise accumulated credit total;
the enterprise performance classes include: the method comprises the steps of enterprise historical loan performance rate, enterprise historical performance amount, maximum expiration date of enterprise history, enterprise business transaction performance rate, enterprise contract performance rate, enterprise credit type complaint frequency and enterprise credit type fine amount;
the enterprise deposit information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprises owners;
the enterprise financial status classes include: equity rate, total equity rate, snap rate, cash flow equity rate, business revenue growth rate, and equity profit growth rate.
Preferably, the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the middle and small enterprises comprises the following steps:
filtering the characteristics with overlarge loss rate;
filling continuous features and discrete feature missing values;
testing the same value rate of discrete features, and filtering the features with overlarge same value rate;
continuous feature volatility test, filtering features with too small variance;
discretizing continuous features, namely discretizing the continuous features with fewer values into discrete features;
filtering the characteristic outliers;
and (5) feature normalization processing.
Preferably, the feature normalization process uses a max-min normalization method.
The formula is as follows:
wherein x is min And x max The maximum value and the minimum value of the feature observed in the training sample of the middle and small enterprises respectively.
The invention also claims a medium and small enterprise credit data mining device, which comprises: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to perform the method described above.
The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.
Compared with the prior art, the credit data mining method for the medium and small enterprises has the following beneficial effects:
starting from a data mining model, the method provides a small and medium-sized enterprise credit data mining method based on automatic feature engineering, so that the credit feature mining efficiency of the small and medium-sized enterprises can be greatly improved;
the one-sided and subjectivity of the credit features of small and medium enterprises selected based on experience is overcome, and the operation risk caused by experience limitation is reduced;
the data mining and characteristic engineering result generated by the method has high interpretability and strong reusability, and the innovation of the method is realized on the basis of guaranteeing easy understandability and usability;
through the combined use of a plurality of numerical analysis and machine learning algorithms, the effectiveness of the credit feature mining result of the middle and small enterprises is improved, the credit evaluation accuracy of the middle and small enterprises is improved, the general financial service efficiency is improved, and the default loss risk of the middle and small enterprises is reduced;
the method can be used for various occasions such as credit condition evaluation before credit, credit change tracking after credit, financial anti-fraud and the like, and can effectively assist business and credit decision.
Drawings
FIG. 1 is a flow chart of a method for mining credit data of small and medium enterprises, which is provided by the embodiment of the invention;
FIG. 2 is a flow chart of a feature linear combination step provided by an embodiment of the present invention;
FIG. 3 is a block flow diagram of a feature nonlinear combination step provided by an embodiment of the present invention;
fig. 4 is a flowchart of a feature screening and evaluation step according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples.
The credit evaluation of the middle and small enterprises is a quantitative process of data definition, collection, evaluation and analysis of the credit risks of the middle and small enterprises, and the credit characteristics are quantitative expression results of the credit characteristics of the middle and small enterprises. Feature combination and feature selection are two main contents of feature engineering, data and features are key to machine learning, the height that can be achieved by the performance of a machine learning model is determined, and features have important roles in machine learning.
In general, the larger the number of features, the more completely reflects the properties of the original data, but the larger the number of features is not, the better. The feature combination refers to a series of calculation methods, wherein some attributes of the original data are combined to generate some features with more expressive ability, and the feature combination method mainly comprises linear and nonlinear combinations of features, wherein a linear model comprises logistic regression, linear regression and the like, and a nonlinear model comprises a decision tree, a neural network and the like. The feature selection can simplify the feature set, the model accuracy is improved, the time required by the model to run is reduced, in addition, the smaller the feature quantity is, the simpler the model is, and the easier the researchers can know the data generation process.
Patent document application number CN 202010055739.8, publication number CN111275447a discloses an online network payment fraud detection system based on automated feature engineering. The real-time transaction data record generated on the network between the user and the merchant through the respective PC or mobile terminal is responsible for receiving the summary by the bank data center; the bank data center screens out the required characteristic fields through secondary processing, and provides the original characteristics to an automatic characteristic engineering module; the automatic feature engineering module performs feature construction to obtain a construction process set of all new features on the basis of the original features paid by the online network, and provides the construction process set for the fraud detection module to perform anomaly identification; and the fraud detection module constructs new features according to the construction process set of the new feature vectors, inputs all the features and the labels into the machine learning model for discrimination, releases normal transactions, and provides secondary identity authentication for users with abnormal transactions. And if the subsequent secondary authentication is successful, the user is allowed to conduct the transaction again, otherwise, the user account is locked, and any transaction is refused. The method uses a conversion function in a longitudinal mode, a conversion function in a transverse mode and a conversion function in a time window mode to perform feature processing conversion, aims at converting a single feature, enhances the information expression capability of the feature, and does not provide a method for combining and screening the features.
The embodiment of the invention provides a medium and small enterprise credit data mining method, which is used for realizing medium and small enterprise credit data mining based on automatic feature engineering, and preprocessing is carried out on original credit feature data in a medium and small enterprise training sample feature data set, wherein the preprocessing comprises feature filtering, missing value filling, discretization and standardization processing on the original credit feature data in the medium and small enterprise training sample feature data set;
and further carrying out characteristic treatment: calculating the similarity of the multidimensional feature vectors, and combining the similarity features by setting a similarity measurement threshold value to form a feature subset;
feature linear combination and feature nonlinear combination are carried out on the feature subsets, the feature linear combination is carried out on each feature subset, a logistic regression model of each feature subset is trained by means of backward stepwise regression, and feature linear combination of each feature subset is formed; the feature nonlinear combination is characterized in that a decision tree classifier taking information gain as a measurement standard is trained for each feature subset, THEN a series of IF-THEN rules are obtained according to paths from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as the result of the feature nonlinear combination;
the method comprises the steps of using original credit features subjected to data preprocessing processing, linear combination features subjected to feature linear combination processing and nonlinear combination features subjected to feature nonlinear combination processing as training feature sets, training an XGBoost classifier together to form a baseline model, sorting feature importance according to training results, and selecting features with predictive value.
In this embodiment, the data preprocessing step includes the following implementation procedures:
the method comprises the steps of obtaining credit feature data of a middle and small enterprise training sample and classification label data of the middle and small enterprise training sample, wherein middle and small enterprise information divided into bad samples in the classification label data of the middle and small enterprise training sample comprises:
the executed person of the commercial borrowing litigation;
the enterprise was listed as a trusted delegate, or the enterprise actual controller was listed as a trusted delegate;
the enterprise loan has serious default conditions;
serious default conditions exist in business contracts and notes of enterprises;
enterprises have penalty records related to credit conditions of enterprises such as fake products;
dividing the related credit characteristic data of the training sample of the medium and small enterprises into types including basic enterprise information, performance enterprise information, escort enterprise information and financial enterprise information, wherein
The basic information class of the enterprise comprises: the maximum stockholder actual share-holding ratio, enterprise annual income, enterprise operation place authority, enterprise operation age and enterprise accumulated credit total;
the enterprise performance classes include: the method comprises the steps of enterprise historical loan performance rate, enterprise historical performance amount, maximum expiration date of enterprise history, enterprise business transaction performance rate, enterprise contract performance rate, enterprise credit type complaint frequency and enterprise credit type fine amount;
the enterprise deposit information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprises owners;
the enterprise financial status classes include: equity rate, total equity rate, snap rate, cash flow equity rate, business revenue growth rate, and equity profit growth rate.
In this embodiment, the adopted method for preprocessing the credit characteristic data of the medium and small enterprises includes:
filtering the characteristics with overlarge loss rate;
filling continuous features and discrete feature missing values;
testing the same value rate of discrete features, and filtering the features with overlarge same value rate;
continuous feature volatility test, filtering features with too small variance;
discretizing continuous features, namely discretizing the continuous features with fewer values into discrete features;
filtering the characteristic outliers;
and (5) feature normalization processing.
The feature normalization processing adopts a maximum-minimum value normalization method.
The formula is as follows:
wherein x is min And x max The maximum value and the minimum value of the feature observed in the training sample of the middle and small enterprises respectively.
And the feature processing is used for calculating the similarity of the multidimensional feature vectors and forming a feature subset by setting a similarity threshold value. The multi-dimensional feature vector similarity measurement value has the following calculation formula:
after the multidimensional feature vector similarity calculation is completed, the larger the cosine value of the included angle between the features is, the smaller the correlation between the features is, otherwise, the larger the correlation between the features is. And setting a normalized included angle cosine value threshold value of 0.65 for the included angle cosine value between the features, combining two features with a distance smaller than the threshold value into a feature set, and updating the multidimensional feature vector similarity measurement until the feature relation is stable.
In the feature linear combination step, a logistic regression model of each feature subset is trained by applying backward stepwise regression aiming at the feature subset, so as to form feature linear combination of each feature subset. The step of training the logistic regression model by using backward stepwise regression comprises the following steps:
firstly, putting all the features into a model;
secondly, trying to remove one of the features from the model, and judging whether the variation of the interpretation target variable of the whole model has significant variation or not based on F-test, t-test and model evaluation indexes;
thirdly, eliminating the characteristics which reduce the interpretation quantity of the target variable to the minimum;
and fourthly, continuously iterating until no feature meets the condition of elimination, and obtaining a logistic regression model of the feature subset, wherein the expression form is as follows:
wherein z=w·x+b
Where w is a logistic regression model coefficient corresponding to each feature and b is a constant intercept term.
In the feature nonlinear combination step, first, for each feature subset, a decision tree classifier using information gain as a metric is trained, where P (x=1) =p, P (x=0) =1-P, D is assumed to be P (x=0) =1-P 1 ,D 2 ,D 3 ,...,D n Is that the data set D is divided into n subsets according to the values of the characteristics, and I D I is the number of samples in the data set, the expression of the information entropy is as follows:
H(p)=-plog 2 p-(1-p)log 2 (1-p)
the information gain expression of the discrete feature a on D is:
gain(A)=H(D)-H(D|A)
wherein the method comprises the steps of
THEN, a series of IF-THEN rules are obtained from the paths of the root node to each leaf node according to the feature set decision tree classifier, and the rules are used as the result of nonlinear combination of features. The IF part includes all the tests of one path, and the THEN part is the final classification. The rule conversion mode is as follows:
1) Obtaining simple rules;
2) Rule conditions are reduced: irrelevant conditions, i.e. conditions that have no effect on the theory, may be included in the antecedents of a single rule. The redundant conditions which do not affect the correctness of the rule set can be deleted, and the rules are simplified;
3) Rule reduction criteria: let the rule R be: IF a THEN class C, reduced rule R' is: IF a 'THEN class C, where a=a'. U.x, means that condition X has no effect on the conclusion "class C".
In the feature screening and evaluating step, firstly, the original credit features which are only subjected to data preprocessing processing, the linear combination features which are subjected to feature linear combination processing and the nonlinear combination features which are subjected to feature nonlinear combination processing are used as training feature sets, and an XGBoost classifier is trained together, wherein the training process is as follows:
1) Taking the XGBoost classifier as a basic model;
2) Adjusting basic box model parameters by using a Bayesian optimization method of automatization in python, taking the model AUC value as an effect test standard of a baseline model, and selecting an optimal group of baseline model super parameters as final model parameters to form a baseline model;
3) And fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration.
Then, the times of each feature appearing in each iteration of the baseline model are summed up to be used as a feature importance measure of the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and sequencing the importance coefficients of the features from large to small, setting a threshold value of the importance coefficients of the features, and only reserving the features with the importance coefficients larger than the threshold value as a result feature set.
Aiming at the problems that a large amount of manual intervention is required during processing and screening of credit features of small and medium enterprises, the feature dimension is rich, the content is numerous and miscellaneous, the credit feature quality selected by the method is unstable, the credit evaluation result of the small and medium enterprises is inaccurate, and the like, the method breaks through the restriction of the number of feature types by introducing automatic feature engineering, reduces the manual intervention, and obtains the credit features of the small and medium enterprises as good as possible.
The embodiment of the invention also provides a device for mining the credit data of the small and medium enterprises, which comprises the following steps: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to perform a method for mining credit data of small and medium enterprises according to the above embodiments of the present invention.
The embodiment of the invention also provides a computer readable medium, wherein the computer readable medium is stored with computer instructions, and the computer instructions, when being executed by a processor, cause the processor to execute the medium and small enterprise credit data mining method in the embodiment of the invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.
Examples of the storage medium for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communication network.
Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.
Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.
While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, and it will be appreciated by those skilled in the art that the code audits of the various embodiments described above may be combined to produce further embodiments of the invention, which are also within the scope of the invention.

Claims (8)

1. The method is characterized in that the method realizes the credit data mining of the middle and small enterprises based on automatic feature engineering, the original credit feature data in the training sample feature data set of the middle and small enterprises is preprocessed, then feature distance calculation is carried out to form feature subsets, and feature linear combination and feature nonlinear combination are carried out on the feature subsets;
using the original credit features subjected to data preprocessing processing, the linear combination features subjected to characteristic linear combination processing and the nonlinear combination features subjected to characteristic nonlinear combination processing as training feature sets, training to form a baseline model, and sorting the feature importance according to training results to select features with predictive value;
jointly training an XGBoost classifier according to the characteristics of the training characteristic set to form the baseline model; the training process is as follows:
taking the XGBoost classifier as a basic model;
adjusting basic box model parameters by using a Bayesian optimization method of automation in python, taking the model AUC value as an effect test standard of a baseline model, and selecting an optimal group of baseline model superparameter as a final model parameter to form a baseline model;
fitting training set sample data by using a baseline model, and recording the occurrence times of each feature in a decision tree model generated by each iteration;
summing the number of times each feature appears in each iteration of the baseline model as a feature importance measure for the feature; performing maximum-minimum normalization processing on importance measures of all the features to form feature importance coefficients; and setting a feature importance coefficient threshold value, and only preserving features with importance coefficients larger than the threshold value as a result feature set.
2. The method for mining credit data of small and medium enterprises according to claim 1, wherein the preprocessing comprises feature filtering, missing value filling, discretizing and normalizing of original credit feature data in a small and medium enterprise training sample feature data set;
the feature preprocessing calculates the similarity of the multidimensional feature vectors, and combines the similarity features to form a feature subset by setting a similarity measurement threshold;
the feature linear combination is characterized in that for each feature subset, a logistic regression model of each feature subset is trained by using backward stepwise regression, so that the feature linear combination of each feature subset is formed;
the feature nonlinear combination is characterized in that for each feature subset, a decision tree classifier taking information gain as a measurement standard is trained, THEN a series of IF-THEN rules are obtained according to paths from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as results of the feature nonlinear combination.
3. The method for mining credit data of small and medium enterprises according to claim 2, wherein the feature preprocessing sets a normalized similarity measurement threshold value for a multi-dimensional feature vector similarity measurement value, combines two features smaller than the threshold value to form a feature subset, and updates the similarity measurement between iterative features until the feature relationship is stable;
the feature nonlinear combination takes the series of IF-THEN rules as a simple rule set, simplifies the simple rule set, and takes the simplified rule as a result of the feature nonlinear combination.
4. The method for mining credit data of small and medium enterprises according to claim 1, wherein the steps of obtaining the credit feature data of the training samples of the small and medium enterprises and the classification label data of the training samples of the small and medium enterprises, wherein the classification label data of the training samples of the small and medium enterprises comprises the steps of:
the executed person of the commercial borrowing litigation;
the enterprise was listed as a trusted delegate, or the enterprise actual controller was listed as a trusted delegate;
the enterprise loan has serious default conditions;
serious default conditions exist in business contracts and notes of enterprises;
enterprises have penalty records related to credit conditions of enterprises such as fake products;
dividing the related credit characteristic data of the training sample of the medium and small enterprises into types including basic enterprise information, performance enterprise information, escort enterprise information and financial enterprise information, wherein
The basic information class of the enterprise comprises: the maximum stockholder actual share-holding ratio, enterprise annual income, enterprise operation place authority, enterprise operation age and enterprise accumulated credit total;
the enterprise performance classes include: the method comprises the steps of enterprise historical loan performance rate, enterprise historical performance amount, maximum expiration date of enterprise history, enterprise business transaction performance rate, enterprise contract performance rate, enterprise credit type complaint frequency and enterprise credit type fine amount;
the enterprise deposit information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprises owners;
the enterprise financial status classes include: equity rate, total equity rate, snap rate, cash flow equity rate, business revenue growth rate, and equity profit growth rate.
5. The method for mining credit data of small and medium enterprises according to claim 1, 2 or 4, wherein the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the small and medium enterprises comprises:
filtering the characteristics with overlarge loss rate;
filling continuous features and discrete feature missing values;
testing the same value rate of discrete features, and filtering the features with overlarge same value rate;
continuous feature volatility test, filtering features with too small variance;
discretizing continuous features, namely discretizing the continuous features with fewer values into discrete features;
filtering the characteristic outliers;
and (5) feature normalization processing.
6. The method for mining credit data of small and medium enterprises according to claim 5, wherein the feature normalization process adopts a maximum-minimum value normalization method.
7. A medium and small enterprise credit data mining apparatus, comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor being configured to invoke the machine readable program to perform the method of any of claims 1 to 6.
8. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 6.
CN202010958951.5A 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises Active CN112085593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010958951.5A CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010958951.5A CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Publications (2)

Publication Number Publication Date
CN112085593A CN112085593A (en) 2020-12-15
CN112085593B true CN112085593B (en) 2024-03-08

Family

ID=73737011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010958951.5A Active CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Country Status (1)

Country Link
CN (1) CN112085593B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861064B (en) * 2021-01-20 2023-02-03 重庆第二师范学院 Social credit evaluation source data processing method, system, terminal and medium
CN113538132B (en) * 2021-07-26 2024-04-23 天元大数据信用管理有限公司 Credit scoring method, equipment and medium based on regression tree algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386377B1 (en) * 2003-05-12 2013-02-26 Id Analytics, Inc. System and method for credit scoring using an identity network connectivity
CN103761426A (en) * 2014-01-02 2014-04-30 中国科学院数学与系统科学研究院 Method and system for quickly recognizing feature combinations in high-dimensional data
CN108475393A (en) * 2016-01-27 2018-08-31 华为技术有限公司 The system and method that decision tree is predicted are promoted by composite character and gradient
CN111652291A (en) * 2020-05-18 2020-09-11 温州医科大学 Method for establishing student growth portrait based on group sparse fusion hospital big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386377B1 (en) * 2003-05-12 2013-02-26 Id Analytics, Inc. System and method for credit scoring using an identity network connectivity
CN103761426A (en) * 2014-01-02 2014-04-30 中国科学院数学与系统科学研究院 Method and system for quickly recognizing feature combinations in high-dimensional data
CN108475393A (en) * 2016-01-27 2018-08-31 华为技术有限公司 The system and method that decision tree is predicted are promoted by composite character and gradient
CN111652291A (en) * 2020-05-18 2020-09-11 温州医科大学 Method for establishing student growth portrait based on group sparse fusion hospital big data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于高维性数据特征驱动的网贷信用风险评价研究;时晨;《中国优秀硕士学位论文全文数据库 经济与管理科学辑》;20200615(第6期);第J157-69页 *
面向信用风险预测的特征工程研究;李勇;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(第1期);第I140-282页 *
高维少样本数据的特征压缩;游文杰;吉国力;袁明顺;;计算机工程与应用;20091221(第36期);第165-169页 *

Also Published As

Publication number Publication date
CN112085593A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
Xia et al. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending
Khemakhem et al. Credit risk assessment for unbalanced datasets based on data mining, artificial neural network and support vector machines
Brar et al. Predicting European takeover targets
Callejón et al. A System of Insolvency Prediction for industrial companies using a financial alternative model with neural networks
CN112085593B (en) Credit data mining method for small and medium enterprises
Boguslauskas et al. The selection of financial ratios as independent variables for credit risk assessment
Liu et al. A gradient-boosting decision-tree approach for firm failure prediction: an empirical model evaluation of Chinese listed companies
Wang et al. Improving investment suggestions for peer-to-peer lending via integrating credit scoring into profit scoring
Byanjankar et al. Predicting expected profit in ongoing peer-to-peer loans with survival analysis-based profit scoring
CN116596674A (en) External trade risk assessment method based on big data analysis
CN115860924A (en) Supply chain financial credit risk early warning method and related equipment
CN114092215A (en) Auditing method and system for export tax refund loan
Shiv et al. Credit risk analysis using machine learning techniques
Subia et al. Sample model for the prediction of default risk of loan applications using data mining
Burns et al. Managing consumer credit risk
Eriksson et al. Developing a selection of credit scoring models based on customer data
Tobek et al. Does it pay to follow anomalies research? Machine learning approach with international evidence
Kondratenko et al. Evaluating Expert Decision Systems for Exchange Rate Insurance
Haque et al. Data Science Techniques for Predictive Analytics in Financial Services
Ertuğrul Customer Transaction Predictive Modeling via Machine Learning Algorithms
CN116308590A (en) Bill product pushing method, device and system
CN114429395A (en) Enterprise credit rating method, system and storage medium based on semi-supervised learning
Nasution et al. Credit Risk Detection in Peer-to-Peer Lending Using CatBoost
Zakowska Check for A New Credit Scoring Model to Reduce Potential Predatory Lending: A Design Science Approach
de Freitas Borges et al. Bootstrap Estimator Approach to Financial Stability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant