CN112085593A - Small and medium-sized enterprise credit data mining method - Google Patents

Small and medium-sized enterprise credit data mining method Download PDF

Info

Publication number
CN112085593A
CN112085593A CN202010958951.5A CN202010958951A CN112085593A CN 112085593 A CN112085593 A CN 112085593A CN 202010958951 A CN202010958951 A CN 202010958951A CN 112085593 A CN112085593 A CN 112085593A
Authority
CN
China
Prior art keywords
feature
credit
enterprise
medium
small
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010958951.5A
Other languages
Chinese (zh)
Other versions
CN112085593B (en
Inventor
崔光裕
边松华
崔乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyuan Big Data Credit Management Co Ltd
Original Assignee
Tianyuan Big Data Credit Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyuan Big Data Credit Management Co Ltd filed Critical Tianyuan Big Data Credit Management Co Ltd
Priority to CN202010958951.5A priority Critical patent/CN112085593B/en
Publication of CN112085593A publication Critical patent/CN112085593A/en
Application granted granted Critical
Publication of CN112085593B publication Critical patent/CN112085593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a credit data mining method for small and medium-sized enterprises, which relates to the technical field of big data and credit evaluation, realizes credit data mining for the small and medium-sized enterprises based on automatic feature engineering, preprocesses original credit feature data in a training sample feature data set of the small and medium-sized enterprises, forms a feature subset through feature distance calculation, and performs feature linear combination and feature nonlinear combination on the feature subset; the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, a base line model is formed through training, feature importance ranking is conducted according to training results, and features with prediction values are selected. The method can improve the credit feature mining efficiency of the small and medium-sized enterprises, reduce manual intervention, improve the validity of the credit feature mining result of the small and medium-sized enterprises, and further improve the accuracy of credit evaluation of the small and medium-sized enterprises.

Description

Small and medium-sized enterprise credit data mining method
Technical Field
The invention relates to the technical field of big data and credit evaluation, in particular to a credit data mining method for small and medium-sized enterprises.
Background
In the field of credit evaluation of medium and small enterprises, credit characteristics are important factors influencing the credit evaluation effect of the medium and small enterprises. However, due to the complexity and diversity of the credit risks of the medium-sized and small-sized enterprises, the correlation degree of different credit characteristics with the credit risks of different types of medium-sized and small-sized enterprises and the medium-sized and small-sized enterprises is greatly different, and when the credit evaluation of the medium-sized and small-sized enterprises is currently underway, the selection and construction of the credit characteristics are difficult, the manual screening workload is too large, and the requirements on the experience of screening personnel are very strict. How to construct an automatic feature engineering credit data mining system with good prediction effect for small and medium-sized enterprises is an urgent problem to be solved.
Disclosure of Invention
The technical task of the invention is to provide a method for mining credit data of small and medium-sized enterprises, which can have a result characteristic set with high prediction capability when the small and medium-sized enterprises perform credit data mining and characteristic engineering and provide a basis for the subsequent comprehensive credit evaluation of the small and medium-sized enterprises.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a small and medium-sized enterprise credit data mining method is based on automatic feature engineering to achieve small and medium-sized enterprise credit data mining, original credit feature data in a training sample feature data set of the small and medium-sized enterprises are preprocessed, then a feature subset is formed through distance calculation, and feature linear combination and feature nonlinear combination are conducted on the feature subset;
the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, a base line model is formed through training, feature importance ranking is conducted according to training results, and features with prediction values are selected.
Aiming at the problems of numerous and complicated credit data, unstable data quality, rich data dimensionality and more weak prediction capability characteristics of small and medium-sized enterprises, the method starts from a data mining model based on automatic characteristic engineering, and is combined with a machine learning algorithm for use through various numerical analysis, so that the credit characteristic mining efficiency and the validity of a mining result of the small and medium-sized enterprises can be improved, and the credit evaluation accuracy of the small and medium-sized enterprises is improved; meanwhile, the method can overcome one-sidedness and subjectivity of selecting credit characteristics of small and medium-sized enterprises based on experience, and reduce operation risks caused by experience limitation.
Preferably, the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on original credit feature data in a training sample feature data set of a medium-sized and small-sized enterprise;
the feature processing is to calculate the similarity of multi-dimensional feature vectors, and merge similarity features to form a feature subset by setting a similarity measurement threshold;
the characteristic linear combination is characterized in that a logistic regression model of each characteristic subset is trained by applying backward stepwise regression aiming at each characteristic subset to form the characteristic linear combination of each characteristic subset;
and in the characteristic nonlinear combination, aiming at each characteristic subset, training a decision tree classifier taking information gain as a measurement standard, THEN obtaining a series of IF-THEN rules according to the path from a root node to each leaf node of the decision tree classifier in the set, and taking the rules as the result of the characteristic nonlinear combination.
Further, the feature processing is to set a normalized similarity metric threshold value for the multi-dimensional feature vector similarity metric value, combine two features smaller than the threshold value to form a feature subset, and update the similarity metric between iterative features until the feature relationship is stable;
and in the characteristic nonlinear combination, the series of IF-THEN rules are used as a simple rule set, the simple rule set is simplified, and the simplified rule is used as a result of the characteristic nonlinear combination.
The simplified rule conditions are set as follows: irrelevant conditions, i.e. conditions that do not have any effect on the theory, may be included in the antecedents of a single rule. These redundant conditions that do not affect the correctness of the rule set can be removed to prune the rules.
Preferably, an XGboost classifier is trained together according to the characteristics of the training feature set to form the baseline model; the training process is as follows:
using an XGboost classifier as a basic model;
adjusting basic box-dividing model parameters by using a HyperOpt method in python to carry out an automatic Bayesian optimization method, taking the AUC value of the model as an effect test standard of a baseline model, and selecting an optimal group of baseline model hyper-parameters as final model parameters to form the baseline model;
and fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in the decision tree model generated by each iteration.
Further, the times of occurrence of each feature in each iteration of the baseline model are summed to serve as a feature importance measure of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and sorting the feature importance coefficients from large to small, setting a feature importance coefficient threshold, and only keeping the features of which the importance coefficients are larger than the threshold as a result feature set.
Preferably, the credit feature data of the training samples of the medium-sized and small-sized enterprises and the classification label data of the training samples of the medium-sized and small-sized enterprises are obtained, wherein the information of the medium-sized and small-sized enterprises classified as bad samples in the classification label data of the training samples of the medium-sized and small-sized enterprises comprises:
an executor of a commercial borrowing litigation;
the enterprise was listed as a loss of credit performer, or the enterprise entity controller was listed as a loss of credit performer;
the enterprise loan has serious default conditions;
the business contract and the bill of the enterprise have serious default conditions;
enterprises have been recorded with penalized records related to credit conditions, such as counterfeiting;
classifying the credit characteristic data of the related training samples of the medium and small enterprises into types including an enterprise basic information class, an enterprise performance condition class, an enterprise security information class and an enterprise financial condition class, wherein
The enterprise basic information class comprises: the actual holding share ratio of the largest stockholder, annual income of an enterprise, the authority of an enterprise operating place, the operating age limit of the enterprise and the accumulated credit amount of the enterprise;
the enterprise performance status class includes: the method comprises the steps of enterprise historical loan fulfillment rate, enterprise historical fulfillment amount, enterprise historical maximum overterm days, enterprise business transaction fulfillment rate, enterprise contract fulfillment rate, enterprise credit category complaint times and enterprise credit category fine amount;
the enterprise security information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprise owners;
the corporate financial status classes include: net asset profitability, total asset profitability, asset liability, snap rate, cash flow liability rate, revenue growth rate, and net profit growth rate.
Preferably, the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the small and medium enterprises includes:
filtering the characteristic of overlarge missing rate;
filling continuous characteristic and discrete characteristic missing values;
checking the equivalent rate of the discrete features, and filtering the features with the excessive equivalent rate;
continuous characteristic fluctuation test is carried out, and the characteristics with excessively small variance are filtered;
discretizing continuous features, namely discretizing the continuous features with less values into discrete features;
filtering the characteristic abnormal value;
and (5) carrying out feature normalization processing.
Preferably, the feature normalization processing adopts a maximum-minimum normalization method.
The formula is as follows:
Figure BDA0002679698650000031
wherein x isminAnd xmaxThe maximum value and the minimum value of the feature observed in the training sample of the medium-sized and small-sized enterprises are respectively.
The invention also claims a medium and small-sized enterprise credit data mining device, which comprises: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing the method.
The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.
Compared with the prior art, the method for mining the credit data of the small and medium-sized enterprises has the following beneficial effects:
the method starts from a data mining model, provides a credit data mining method for small and medium-sized enterprises based on automatic feature engineering, and can greatly improve the credit feature mining efficiency of the small and medium-sized enterprises;
the one-sidedness and subjectivity of selecting credit characteristics of small and medium-sized enterprises based on experience are overcome, and the operation risk caused by the limited experience is reduced;
the generated data mining and characteristic engineering results have high interpretability and strong reusability, and the innovation of the method is realized on the basis of ensuring the comprehensibility and the usability;
through the combined use of various numerical analysis and machine learning algorithms, the effectiveness of credit feature mining results of medium and small enterprises is improved, the accuracy of credit evaluation of the medium and small enterprises is improved, the efficiency of popular financial services is improved, and the risk of default loss of the medium and small enterprises is reduced;
the method can be used for various occasions such as pre-credit state evaluation, post-credit change tracking, finance anti-fraud and the like, and effectively assists business and credit decisions.
Drawings
FIG. 1 is a flow chart of a method for mining credit data of small and medium-sized enterprises according to an embodiment of the present invention;
FIG. 2 is a block diagram of a process flow for linear combination of features provided by an embodiment of the present invention;
FIG. 3 is a block diagram of a process for nonlinear combination of features provided by an embodiment of the present invention;
fig. 4 is a block diagram of a feature screening and evaluating process provided by an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The credit evaluation of the medium and small enterprises is a quantitative process for defining, collecting, evaluating and analyzing the data of the credit risk of the medium and small enterprises, and the credit characteristics are quantitative expression results of the credit traits of the medium and small enterprises. Feature combination and feature selection are two main contents of feature engineering, data and features are the key of machine learning, the height which can be achieved by the performance of a machine learning model is determined, and features have an important position in the machine learning.
In general, the greater the number of features, the more completely the attributes of the original data can be reflected, but the greater the number of features, the better the quality of the original data is not. The feature combination refers to a series of calculation methods, which combine some attributes of the original data to generate some features with more expressive ability, and the feature combination method mainly has linear and nonlinear combinations of features, wherein the linear model includes logistic regression, linear regression and the like, and the nonlinear model includes decision trees, neural networks and the like. The feature selection can simplify the feature set, the accuracy of the model is improved, the time required by the model to operate is reduced, the smaller the number of features is, the simpler the model is, and the easier the data generation process is known by researchers.
Patent document application No. CN 202010055739.8 and publication No. CN111275447A disclose an online network payment fraud detection system based on automated feature engineering. Real-time transaction data records between the user and the merchant which occur on the network through respective PC or mobile terminal are received and summarized by the bank data center; the bank data center screens out required characteristic fields through secondary processing, and provides the original characteristics to the automatic characteristic engineering module; the automatic feature engineering module carries out feature construction on the basis of online network payment of original features to obtain a construction process set of all new features, and the construction process set is provided for a fraud detection module to carry out anomaly identification; and the fraud detection module is used for constructing new features according to the construction process set of the new feature vectors, inputting all the features and the labels into the machine learning model for judgment, releasing normal transactions and providing secondary identity authentication for users in abnormal transactions. And if the subsequent secondary authentication is successful, the user is allowed to conduct the transaction again, otherwise, the user account is locked, and the user is refused to conduct any transaction. The method uses a longitudinal mode conversion function, a transverse mode conversion function and a time window mode conversion function to perform feature processing conversion, aims to transform a single feature and enhance the information expression capability of the feature, and does not provide a method for combining and screening a plurality of features.
The embodiment of the invention provides a method for mining credit data of small and medium-sized enterprises, which is based on automatic feature engineering to realize the mining of the credit data of the small and medium-sized enterprises, and preprocesses original credit feature data in training sample feature data sets of the small and medium-sized enterprises, wherein the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on the original credit feature data in the training sample feature data sets of the small and medium-sized enterprises;
further performing characteristic processing: calculating the similarity of the multi-dimensional feature vectors, and combining similarity features by setting a similarity measurement threshold to form a feature subset;
performing linear feature combination and nonlinear feature combination on the feature subsets, wherein the linear feature combination is to train a logistic regression model of each feature subset by using backward stepwise regression for each feature subset to form a linear feature combination of each feature subset; the characteristic nonlinear combination is characterized in that a decision tree classifier taking information gain as a measurement standard is trained for each characteristic subset, THEN a series of IF-THEN rules are obtained according to the path from a root node to each leaf node of the decision tree classifier in the set, and the rules are used as the result of the characteristic nonlinear combination;
the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, an XGboost classifier is trained together to form a baseline model, feature importance ranking is carried out according to training results, and features with prediction values are selected.
In this embodiment, the data preprocessing step implementation process includes:
the method comprises the steps of obtaining credit feature data of training samples of medium and small enterprises and classification label data of the training samples of the medium and small enterprises, wherein the information of the medium and small enterprises classified as bad samples in the classification label data of the training samples of the medium and small enterprises comprises the following steps:
an executor of a commercial borrowing litigation;
the enterprise was listed as a loss of credit performer, or the enterprise entity controller was listed as a loss of credit performer;
the enterprise loan has serious default conditions;
the business contract and the bill of the enterprise have serious default conditions;
enterprises have been recorded with penalized records related to credit conditions, such as counterfeiting;
classifying the credit characteristic data of the related training samples of the medium and small enterprises into types including an enterprise basic information class, an enterprise performance condition class, an enterprise security information class and an enterprise financial condition class, wherein
The enterprise basic information class comprises: the actual holding share ratio of the largest stockholder, annual income of an enterprise, the authority of an enterprise operating place, the operating age limit of the enterprise and the accumulated credit amount of the enterprise;
the enterprise performance status class includes: the method comprises the steps of enterprise historical loan fulfillment rate, enterprise historical fulfillment amount, enterprise historical maximum overterm days, enterprise business transaction fulfillment rate, enterprise contract fulfillment rate, enterprise credit category complaint times and enterprise credit category fine amount;
the enterprise security information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprise owners;
the corporate financial status classes include: net asset profitability, total asset profitability, asset liability, snap rate, cash flow liability rate, revenue growth rate, and net profit growth rate.
In this embodiment, the method for preprocessing credit characteristic data of small and medium-sized enterprises includes:
filtering the characteristic of overlarge missing rate;
filling continuous characteristic and discrete characteristic missing values;
checking the equivalent rate of the discrete features, and filtering the features with the excessive equivalent rate;
continuous characteristic fluctuation test is carried out, and the characteristics with excessively small variance are filtered;
discretizing continuous features, namely discretizing the continuous features with less values into discrete features;
filtering the characteristic abnormal value;
and (5) carrying out feature normalization processing.
The feature normalization processing adopts a maximum-minimum normalization method.
The formula is as follows:
Figure BDA0002679698650000061
wherein x isminAnd xmaxThe maximum value and the minimum value of the feature observed in the training sample of the medium-sized and small-sized enterprises are respectively.
And the feature processing is used for calculating the similarity of the multi-dimensional feature vectors and forming a feature subset by setting a similarity threshold value. The calculation formula of the similarity metric value of the multi-dimensional feature vector is as follows:
Figure BDA0002679698650000071
after the similarity calculation of the multi-dimensional feature vectors is completed, the larger the cosine value of an included angle between the features is, the smaller the correlation between the features is, and otherwise, the larger the correlation between the features is. And setting a normalized cosine value threshold of the included angle between the features as 0.65 according to the cosine value of the included angle between the features, combining the two features with the distance less than the threshold into a feature set, and updating the similarity measurement of the multi-dimensional feature vector until the feature relation is stable.
In the step of linear combination of the features, a logistic regression model of each feature subset is trained by applying backward stepwise regression aiming at the feature subsets to form linear combination of the features of each feature subset. The logistic regression model training method based on backward stepwise regression comprises the following steps:
firstly, putting all the characteristics into a model;
secondly, trying to remove one feature from the model, and judging whether the variation of the whole model interpretation target variable has obvious change or not based on F-test, t-test and model evaluation indexes;
thirdly, removing the features which reduce the interpretation quantity of the target variable to the minimum;
and fourthly, continuously iterating until no feature meets the condition of elimination, and obtaining a logistic regression model of the feature subset, wherein the expression form of the logistic regression model is as follows:
Figure BDA0002679698650000072
wherein z is w.x + b
Wherein w is the logistic regression model coefficient corresponding to each feature, and b is a constant intercept term.
In the step of nonlinear combination of features, first, a decision tree classifier using information gain as a metric is trained for each feature subset, wherein P (X ═ 1) ═ P, P (X ═ 0) ═ 1-P, and D are assumed to be1,D2,D3,...,DnN subsets of the data set D are divided according to the value of the characteristic, | D | is the number of samples in the data set, and the expression of the information entropy is:
H(p)=-plog2p-(1-p)log2(1-p)
the information gain expression of the discrete feature A on D is as follows:
gain(A)=H(D)-H(D|A)
wherein
Figure BDA0002679698650000073
THEN, a series of IF-THEN rules are obtained according to the path from the root node to each leaf node of the feature set decision tree classifier, and the rules are used as the result of the nonlinear combination of the features. The IF part contains all the checks for a path and the THEN part is the final classification. The conversion mode of the rule is as follows:
1) obtaining simple rules;
2) simplifying rule conditions: irrelevant conditions, i.e. conditions that do not have any effect on the theory, may be included in the antecedents of a single rule. Redundant conditions which do not influence the correctness of the rule set can be deleted, and the rules are simplified;
3) and the simplification rule of the rule is as follows: let rule R be: IF a THEN class C, the simplified rule R' is: IF a 'THEN class C, where a ═ a' goux, means that condition X has no effect on the conclusion "class C".
In the step of feature screening and evaluation, firstly, the original credit features which are only processed by data preprocessing, the linear combination features which are processed by feature linear combination and the nonlinear combination features which are processed by feature nonlinear combination are used as training feature sets to jointly train the XGboost classifier, wherein the training process is as follows:
1) the XGboost classifier is used as a basic model;
2) adjusting basic box-dividing model parameters by using a HyperOpt method in python to carry out an automatic Bayesian optimization method, taking the AUC value of the model as an effect test standard of a baseline model, and selecting an optimal group of baseline model hyper-parameters as final model parameters to form the baseline model;
3) and fitting the sample data of the training set by using the baseline model, and recording the occurrence times of each feature in the decision tree model generated by each iteration.
Then, the times of appearance of each feature in each iteration of the baseline model are added to be used as the feature importance measurement of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and sorting the feature importance coefficients from large to small, setting a feature importance coefficient threshold, and only keeping the features of which the importance coefficients are larger than the threshold as a result feature set.
Aiming at the problems that a large amount of manual intervention is needed, the characteristic dimension is rich in content and the content is numerous and complex, the credit evaluation result of the medium and small enterprises is not accurate due to the fact that the quality of the selected credit characteristic is not stable and the like are often faced during the credit characteristic processing and screening of the medium and small enterprises, the method breaks through the restriction of the number of the characteristic types by introducing the automatic characteristic engineering, reduces the manual intervention and obtains the credit characteristics of the medium and small enterprises which are as excellent as possible.
The embodiment of the invention also provides a device for mining credit data of medium and small enterprises, which comprises: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is configured to invoke the machine-readable program to execute the method for mining credit data of small and medium-sized enterprises in the above embodiments of the present invention.
The embodiment of the invention also provides a computer readable medium, which stores computer instructions, and when the computer instructions are executed by a processor, the processor executes the method for mining credit data of the medium and small enterprises in the embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims (10)

1. A method for mining credit data of small and medium-sized enterprises is characterized in that the mining of the credit data of the small and medium-sized enterprises is realized based on automatic feature engineering, original credit feature data in a training sample feature data set of the small and medium-sized enterprises are preprocessed, feature subsets are formed through feature distance calculation, and feature linear combination and feature nonlinear combination are carried out on the feature subsets;
the original credit features which are only processed through data preprocessing, the linear combination features which are processed through feature linear combination and the nonlinear combination features which are processed through feature nonlinear combination are used as training feature sets, a base line model is formed through training, feature importance ranking is conducted according to training results, and features with prediction values are selected.
2. The method for mining the credit data of the small and medium-sized enterprises according to claim 1, wherein the preprocessing comprises the steps of performing feature filtering, missing value filling, discretization and normalization processing on the original credit feature data in the training sample feature data set of the small and medium-sized enterprises;
the feature processing is to calculate the similarity of multi-dimensional feature vectors, and merge similarity features to form a feature subset by setting a similarity measurement threshold;
the characteristic linear combination is characterized in that a logistic regression model of each characteristic subset is trained by applying backward stepwise regression aiming at each characteristic subset to form the characteristic linear combination of each characteristic subset;
and in the characteristic nonlinear combination, aiming at each characteristic subset, training a decision tree classifier taking information gain as a measurement standard, THEN obtaining a series of IF-THEN rules according to the path from a root node to each leaf node of the decision tree classifier in the set, and taking the rules as the result of the characteristic nonlinear combination.
3. The method according to claim 2, wherein the feature processing is to set a normalized similarity metric threshold for the multi-dimensional feature vector similarity metric, combine two features smaller than the threshold to form a feature subset, and update the similarity metric between iterative features until the feature relationship is stable;
and in the characteristic nonlinear combination, the series of IF-THEN rules are used as a simple rule set, the simple rule set is simplified, and the simplified rule is used as a result of the characteristic nonlinear combination.
4. The medium and small enterprise credit data mining method according to claim 1, 2 or 3, characterized in that an XGboost classifier is trained together according to the features of the training feature set to form the baseline model; the training process is as follows:
using an XGboost classifier as a basic model;
adjusting basic box-dividing model parameters by using a HyperOpt method in python to carry out an automatic Bayesian optimization method, taking the AUC value of the model as an effect test standard of a baseline model, and selecting an optimal group of baseline model hyper-parameters as final model parameters to form the baseline model;
and fitting the training set sample data by using a baseline model, and recording the occurrence times of each feature in the decision tree model generated by each iteration.
5. The method of claim 4, wherein the number of occurrences of each feature in each iteration of the baseline model is added to be used as the feature importance measure of the feature; carrying out maximum-minimum normalized processing on the importance measurement of all the characteristics to form characteristic importance coefficients; and setting a threshold value of the feature importance coefficient, and only keeping the features with the importance coefficients larger than the threshold value as a result feature set.
6. The method for mining the credit data of the medium and small enterprises as claimed in claim 1, wherein the training sample credit feature data of the medium and small enterprises and the training sample classification label data of the medium and small enterprises are obtained, wherein the medium and small enterprise information classified as bad samples in the training sample classification label data of the medium and small enterprises comprises:
an executor of a commercial borrowing litigation;
the enterprise was listed as a loss of credit performer, or the enterprise entity controller was listed as a loss of credit performer;
the enterprise loan has serious default conditions;
the business contract and the bill of the enterprise have serious default conditions;
enterprises have been recorded with penalized records related to credit conditions, such as counterfeiting;
classifying the credit characteristic data of the related training samples of the medium and small enterprises into types including an enterprise basic information class, an enterprise performance condition class, an enterprise security information class and an enterprise financial condition class, wherein
The enterprise basic information class comprises: the actual holding share ratio of the largest stockholder, annual income of an enterprise, the authority of an enterprise operating place, the operating age limit of the enterprise and the accumulated credit amount of the enterprise;
the enterprise performance status class includes: the method comprises the steps of enterprise historical loan fulfillment rate, enterprise historical fulfillment amount, enterprise historical maximum overterm days, enterprise business transaction fulfillment rate, enterprise contract fulfillment rate, enterprise credit category complaint times and enterprise credit category fine amount;
the enterprise security information class includes: the effective guarantee value of enterprises and the guarantee mode of enterprise owners;
the corporate financial status classes include: net asset profitability, total asset profitability, asset liability, snap rate, cash flow liability rate, revenue growth rate, and net profit growth rate.
7. The method for mining the credit data of the medium and small enterprises according to claim 1, 2 or 6, wherein the specific implementation manner of preprocessing the original credit feature data in the training sample feature data set of the medium and small enterprises comprises the following steps:
filtering the characteristic of overlarge missing rate;
filling continuous characteristic and discrete characteristic missing values;
checking the equivalent rate of the discrete features, and filtering the features with the excessive equivalent rate;
continuous characteristic fluctuation test is carried out, and the characteristics with excessively small variance are filtered;
discretizing continuous features, namely discretizing the continuous features with less values into discrete features;
filtering the characteristic abnormal value;
and (5) carrying out feature normalization processing.
8. The method as claimed in claim 7, wherein the feature normalization process adopts a max-min normalization method.
9. A credit data mining device for small and medium-sized enterprises is characterized by comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 8.
10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.
CN202010958951.5A 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises Active CN112085593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010958951.5A CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010958951.5A CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Publications (2)

Publication Number Publication Date
CN112085593A true CN112085593A (en) 2020-12-15
CN112085593B CN112085593B (en) 2024-03-08

Family

ID=73737011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010958951.5A Active CN112085593B (en) 2020-09-14 2020-09-14 Credit data mining method for small and medium enterprises

Country Status (1)

Country Link
CN (1) CN112085593B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861064A (en) * 2021-01-20 2021-05-28 重庆第二师范学院 Social credit evaluation source data processing method, system, terminal and medium
CN113538132A (en) * 2021-07-26 2021-10-22 天元大数据信用管理有限公司 Credit scoring method, device and medium based on regression tree algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386377B1 (en) * 2003-05-12 2013-02-26 Id Analytics, Inc. System and method for credit scoring using an identity network connectivity
CN103761426A (en) * 2014-01-02 2014-04-30 中国科学院数学与系统科学研究院 Method and system for quickly recognizing feature combinations in high-dimensional data
US20170213280A1 (en) * 2016-01-27 2017-07-27 Huawei Technologies Co., Ltd. System and method for prediction using synthetic features and gradient boosted decision tree
CN111652291A (en) * 2020-05-18 2020-09-11 温州医科大学 Method for establishing student growth portrait based on group sparse fusion hospital big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386377B1 (en) * 2003-05-12 2013-02-26 Id Analytics, Inc. System and method for credit scoring using an identity network connectivity
CN103761426A (en) * 2014-01-02 2014-04-30 中国科学院数学与系统科学研究院 Method and system for quickly recognizing feature combinations in high-dimensional data
US20170213280A1 (en) * 2016-01-27 2017-07-27 Huawei Technologies Co., Ltd. System and method for prediction using synthetic features and gradient boosted decision tree
CN108475393A (en) * 2016-01-27 2018-08-31 华为技术有限公司 The system and method that decision tree is predicted are promoted by composite character and gradient
CN111652291A (en) * 2020-05-18 2020-09-11 温州医科大学 Method for establishing student growth portrait based on group sparse fusion hospital big data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
时晨: "基于高维性数据特征驱动的网贷信用风险评价研究", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》, no. 6, 15 June 2020 (2020-06-15), pages 157 - 69 *
李勇: "面向信用风险预测的特征工程研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 1, 15 January 2022 (2022-01-15), pages 140 - 282 *
游文杰;吉国力;袁明顺;: "高维少样本数据的特征压缩", 计算机工程与应用, no. 36, 21 December 2009 (2009-12-21), pages 165 - 169 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861064A (en) * 2021-01-20 2021-05-28 重庆第二师范学院 Social credit evaluation source data processing method, system, terminal and medium
CN112861064B (en) * 2021-01-20 2023-02-03 重庆第二师范学院 Social credit evaluation source data processing method, system, terminal and medium
CN113538132A (en) * 2021-07-26 2021-10-22 天元大数据信用管理有限公司 Credit scoring method, device and medium based on regression tree algorithm
CN113538132B (en) * 2021-07-26 2024-04-23 天元大数据信用管理有限公司 Credit scoring method, equipment and medium based on regression tree algorithm

Also Published As

Publication number Publication date
CN112085593B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
Xia et al. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending
CN110009479B (en) Credit evaluation method and device, storage medium and computer equipment
Brar et al. Predicting European takeover targets
CN109035003A (en) Anti- fraud model modelling approach and anti-fraud monitoring method based on machine learning
Abdou et al. Prediction of financial strength ratings using machine learning and conventional techniques
CN112085593A (en) Small and medium-sized enterprise credit data mining method
Deng et al. An intelligent system for insider trading identification in Chinese security market
Liu et al. A gradient-boosting decision-tree approach for firm failure prediction: an empirical model evaluation of Chinese listed companies
Wang et al. Improving investment suggestions for peer-to-peer lending via integrating credit scoring into profit scoring
Wu et al. Customer churn prediction for commercial banks using customer-value-weighted machine learning models
Ullah et al. Predicting Default Payment of Credit Card Users: Applying Data Mining Techniques
CN113919934A (en) Bank loan service scoring strategy iteration method
KR100589561B1 (en) System for the Optimization of Corporate Financial Structure
Hung et al. Customizable and committee data mining framework for stock trading
Subia et al. Sample model for the prediction of default risk of loan applications using data mining
Hytis et al. Automated identification of fraudulent financial statements by analyzing data traces
Sadatrasoul Matrix Sequential Hybrid Credit Scorecard Based on Logistic Regression and Clustering
Burns et al. Managing consumer credit risk
CN117291740B (en) Receivables data authenticity intelligent identification auditing system based on big data
CN117994017A (en) Method for constructing retail credit risk prediction model and online credit service Scoredelta model
Zakowska Check for A New Credit Scoring Model to Reduce Potential Predatory Lending: A Design Science Approach
CN116308590A (en) Bill product pushing method, device and system
Hassan et al. Interpretable Machine Learning Models for Credit Risk Assessment
CN118071483A (en) Method for constructing retail credit risk prediction model and personal credit business Scorepsi model
Ma Through the crisis: UK SMEs performance during the ‘credit crunch’

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant