CN112330441A - Method for evaluating business value credit loan of medium and small enterprises - Google Patents

Method for evaluating business value credit loan of medium and small enterprises Download PDF

Info

Publication number
CN112330441A
CN112330441A CN202011261813.8A CN202011261813A CN112330441A CN 112330441 A CN112330441 A CN 112330441A CN 202011261813 A CN202011261813 A CN 202011261813A CN 112330441 A CN112330441 A CN 112330441A
Authority
CN
China
Prior art keywords
data
credit
enterprises
sample
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011261813.8A
Other languages
Chinese (zh)
Inventor
金昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chenxin Credit Information Co ltd
Original Assignee
Beijing Chenxin Credit Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chenxin Credit Information Co ltd filed Critical Beijing Chenxin Credit Information Co ltd
Priority to CN202011261813.8A priority Critical patent/CN112330441A/en
Publication of CN112330441A publication Critical patent/CN112330441A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Medical Informatics (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

A business value credit loan evaluation method for small and medium-sized enterprises selects initial and ending credit granting time periods in a total effective sample data set as time window periods, and determines the default frequency or time level interval with stable trend of the enterprises through the time windows; establishing a database table according to the business indexes of the enterprise financing application, comparing the difference performance of the refusal credit and the actual paying enterprise on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprise can obtain the bank credit by using a data mining algorithm, and effectively identifying whether the financing application enterprise meets the bank credit granting requirement through the commercial value credit loan evaluation model. The method and the system realize the business value credit loan evaluation requirements of small and medium-sized enterprises based on historical data, can effectively identify enterprise objects which can obtain loans of the current cooperative bank at a large probability, improve the loan acquisition rate of platform financing application enterprises, and solve the problems of difficult financing, expensive financing and slow financing of the small and medium-sized enterprises.

Description

Method for evaluating business value credit loan of medium and small enterprises
Technical Field
The invention relates to the technical field of data processing, in particular to a business value credit loan evaluation method for small and medium-sized enterprises.
Background
At the present stage, the commercial value credit loan of the medium and small enterprises is created aiming at the difficulty of financing of the medium and small enterprises. The commercial value credit loan of the small and medium-sized enterprises is a loan issued by the credit of the borrower, the borrower can obtain the loan only by the credit of the borrower without providing collateral or guaranteeing by a third party, and the credit degree of the borrower is used as the repayment guarantee. Because the loan mode has higher risk, the conditions of economic benefit, operation management level, development prospect and the like of a borrower are generally considered in detail to reduce the risk.
At present, a corresponding management platform appears for the business value credit of the medium and small enterprises, for example, the business value credit reform test points of the medium and small enterprises are started by credit committee in Chongqing city, and the credit without mortgage, pledge and guarantee is realized. As one of the first trial cooperation banks, the Chongqing city branch of China Bank and the Chongqing branch of Guangdong Bank take a plurality of measures, promote the policy to fall to the ground, and bring the sense of actual acquirement for the civil and private enterprises.
The business value credit loan of the medium and small enterprises is indispensable to the credit risk control of the enterprises, the enterprise financing evaluation model constructed on the basis of expert experience is mainly used in the early stage of the management platform, along with the fact that financing credit data accumulated by the management platform are more abundant, how to better train and learn a new good enterprise which can effectively identify the general probability of obtaining the loan of the current cooperative bank is an important target in the new development stage of the management platform, and therefore, how to realize the business value credit loan evaluation of the medium and small enterprises on the basis of the historical data becomes a technical problem to be solved urgently.
Disclosure of Invention
Therefore, the invention provides a method for evaluating business value credit of small and medium-sized enterprises, which realizes the evaluation requirement of business value credit of small and medium-sized enterprises based on historical data, effectively identifies enterprise objects which can obtain the current cooperative bank loan with a high probability, and solves the problems of difficult financing, expensive financing and slow financing of small and medium-sized enterprises.
In order to achieve the above purpose, the invention provides the following technical scheme: a business value credit loan evaluation method for medium and small enterprises comprises the following steps:
(1) preparing data: selecting enterprise data with financing loan application history records as original sample data, defining the approval crediting as a positive sample and defining the refusal crediting as a negative sample according to the credit feedback information state;
(2) combing positive and negative sample sets: combing and judging the original sample data, and cleaning and combing out samples with uncertain trust feedback information, positive samples actually serving as feedback information and negative sample states displayed as feedback of the trust service state;
(3) and (3) judging positive and negative sample sets of the enterprise: obtaining feedback information data which has definite financing application behaviors in the original sample data and has acquired whether a bank definitely gives credit or not as a total effective sample data set, and marking the data of the total effective sample data set according to the definitions of a positive sample and a negative sample;
(4) sample extraction: randomly extracting a sample set and a test set from the total effective sample data set in a layering way;
(5) time window determination statistics: selecting the initial and ending credit granting time periods in the total effective sample data set as time window periods, and determining the time level interval with stable enterprise default frequency or trend through the time window;
(6) constructing a commercial value credit loan evaluation model: establishing a database table according to business indexes of enterprise financing application, comparing difference performance of refusal credit and actual paying enterprises on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprises can obtain bank credit or not by using a data mining algorithm, and effectively identifying whether the financing application enterprises meet bank credit granting requirements or not through the commercial value credit loan evaluation model.
As an optimal scheme of the business value credit loan evaluation method of the medium and small enterprises, in the positive and negative sample set judgment process of the enterprises, sample data of the bank feedback information in the actual loan payment state is set as a positive sample; the technical assumption is that the sample data of the bank feedback information of rejecting the payment and rejecting the pre-authorization is a negative sample.
As a preferable scheme of the business value credit loan evaluation method of the small and medium-sized enterprises, the enterprise data comprises tax, financial data, social security data, patent data, soft copyright data, trademark data, real estate data, electric power data, customs data and special fund data.
And as an optimal scheme of the business value credit loan evaluation method of the medium-sized and small enterprises, performing derivative variable generation, missing value processing, extreme value identification processing and key variable discovery on the total effective sample data set.
As a preferable scheme of the business value credit loan evaluation method for the small and medium-sized enterprises, the derivative variable generation comprises total house area collection, mortgage house area collection, total land area collection, mortgage land area collection, patent quantity collection, soft literature quantity collection, trademark quantity collection, quarterly electricity consumption and quarterly payment electricity charge.
As a preferable scheme of the business value credit loan evaluation method for medium and small enterprises, the missing value is processed into tax data, and 0, mean, median or mode filling is carried out on the missing value.
As a preferable scheme of the business value credit loan evaluation method of the medium-sized and small enterprises, the extreme value identification mode comprises the following steps: finding continuous nodes containing a preset number of observed values by using a decision tree; and dividing the data into a plurality of subsets by using a clustering algorithm, and regarding the clusters containing a preset number as extreme values.
The extreme value processing mode comprises the following steps: when the number of the extreme values is less than the preset number, deleting the extreme values; and when the number of the extreme values reaches the preset number, carrying out mean value replacement processing on the extreme values.
As an optimal scheme of the business value credit loan evaluation method for small and medium-sized enterprises, the key variable discovery utilizes a data mining algorithm to mine and analyze all financing quantitative data to obtain the degree of importance of each variable in model identification and classification prediction, and the degree of importance is expressed frequently through the importance of accurate prediction of the enterprises.
As an optimal scheme of the business value credit evaluation method of the medium-sized and small enterprises, one-hot codes are used in the construction of the business value credit evaluation model, and the values of the discrete features are expanded to European space.
As a preferred scheme of the business value credit loan evaluation method of the medium-sized and small enterprises, the business value credit loan evaluation model is checked and evaluated by adopting an ROC curve, an AUC value and a P-R method;
and carrying out hierarchical classification management on enterprise financing matching service work in a form of developing a main scale.
The invention has the following advantages: enterprise data with financing loan application history records are selected as original sample data, and according to the credit feedback information state, the credit granting agreement is defined as a positive sample, and the credit refusing is defined as a negative sample; combing and judging original sample data, and cleaning and combing out samples with uncertain trust feedback information, positive feedback information and trust service state feedback display as negative sample state; obtaining feedback information data which has definite financing application behaviors in original sample data and whether a bank definitely gives credit or not as a total effective sample data set, and marking the data of the total effective sample data set according to the definitions of a positive sample and a negative sample; randomly extracting a sample set and a test set from the total effective sample data set in a layering way; selecting the initial and ending credit granting time periods in the total effective sample data set as time window periods, and determining the time level interval with stable enterprise default frequency or trend through the time window; establishing a database table according to the business indexes of the enterprise financing application, comparing the difference performance of the refusal credit and the actual paying enterprise on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprise can obtain the bank credit by using a data mining algorithm, and effectively identifying whether the financing application enterprise meets the bank credit granting requirement through the commercial value credit loan evaluation model. The method and the system realize the business value credit loan evaluation requirements of small and medium-sized enterprises based on historical data, and can effectively identify enterprise objects which can obtain the loan of the current cooperative bank at a high probability, thereby pushing the credit-good enterprises which can obtain the bank credit at a high probability to the cooperative bank in time, improving the loan acquisition rate of platform financing application enterprises, and solving the problems of difficult financing, expensive financing and slow financing of the small and medium-sized enterprises.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
FIG. 1 is a schematic diagram of a method for evaluating business value credit of a medium-sized or small-sized enterprise according to an embodiment of the present invention;
FIG. 2 is a statistical chart of time window determination in the business value credit evaluation of medium and small enterprises provided in the embodiment of the present invention;
fig. 3 is a diagram illustrating extreme value identification and processing in the business value credit evaluation of medium and small enterprises according to an embodiment of the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a method for evaluating business value credit of small and medium-sized enterprises is provided, which comprises the following steps:
s1, data preparation: selecting enterprise data with financing loan application history records as original sample data, defining the approval crediting as a positive sample and defining the refusal crediting as a negative sample according to the credit feedback information state;
s2, combing positive and negative sample sets: combing and judging the original sample data, and cleaning and combing out samples with uncertain trust feedback information, positive samples actually serving as feedback information and negative sample states displayed as feedback of the trust service state;
s3, judging whether the enterprise positive and negative sample sets are as follows: obtaining feedback information data which has definite financing application behaviors in the original sample data and has acquired whether a bank definitely gives credit or not as a total effective sample data set, and marking the data of the total effective sample data set according to the definitions of a positive sample and a negative sample;
s4, sample extraction: randomly extracting a sample set and a test set from the total effective sample data set in a layering way;
s5, time window determination statistics: selecting the initial and ending credit granting time periods in the total effective sample data set as time window periods, and determining the time level interval with stable enterprise default frequency or trend through the time window;
s6, constructing a commercial value credit loan evaluation model: establishing a database table according to business indexes of enterprise financing application, comparing difference performance of refusal credit and actual paying enterprises on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprises can obtain bank credit or not by using a data mining algorithm, and effectively identifying whether the financing application enterprises meet bank credit granting requirements or not through the commercial value credit loan evaluation model.
The data of the embodiment of the invention is from a commercial value credit loan management platform of Chongqing cities, enterprise data of which the management platform has related financing loan application history is selected as original sample data 9178 family, and by 8 months of 2020, through combing and manual judgment, 4850 family data in the original sample data are finally obtained and have definite financing application behaviors, and feedback information data of whether a bank definitely gives credit or not is obtained and is used as a total effective sample data set.
Aiming at the financial sample set 9178 enterprise credit-granting historical data with actual credit-granting feedback of a bank, partial credit-granting feedback information is determined artificially one by one to be ambiguous, part of the feedback information is actually positive samples, and special samples with negative sample states (refusing credit granting) are displayed by platform credit-granting business state feedback and are cleaned and sorted out, so that the accurate positive and negative sample definition attribution of the final effective samples and the machine learning accuracy in model construction can be guaranteed to the maximum extent.
The uncertain types of the credit granting states in the original credit granting sample data mainly comprise the following types through combing:
Figure BDA0002774877940000061
through comprehensive combing and manual judgment, an accurate and credible 4850 positive and negative sample set (data recorded by repeated application are removed) is obtained, wherein 1 is defined as a positive sample, and 0 is defined as a negative sample. The relevant data table is output after sorting, and relevant raw data of the 4850 valid positive and negative sample sets are used as model entering data for model training and machine learning.
Figure BDA0002774877940000062
Figure BDA0002774877940000071
Note: the top 20 positive and negative sample set business lists showing only 4850 valid samples here illustrate the problem.
Setting the bank feedback information as a positive sample in an actual paying state, technically assuming that the sample data of the bank feedback information as 'paying refusal' and 'refusing preauthorization' are negative samples in order to strictly control the risk control level of the financing evaluation model because NO actual default sample data exists on the current platform, manually judging the sample data marked as NO (negative sample) and actually YES (positive sample) in the partial bank feedback state one by one and adding the sample data into the positive sample set in order to ensure the quality of the sample data, finally determining the positive and negative sample sets, setting the proportions of the sample set and the test set as 80% and 20% respectively, performing layered random extraction, and then performing data mining and model training.
The quality of the whole sample data basically meets the basic requirements of data modeling, the sample data of the subdivision industry is poor, wherein only 26 effective samples of the electronic subdivision industry are available, 71 effective samples of the automobile subdivision industry are available, and only 13 effective samples of the medicine subdivision industry are available, so that the training learning is easy to have a bias condition, and the trained model does not have complete representativeness.
The newly added customs data plays a remarkable role in model training and machine learning, and the high-quality data units with better time sequence and continuity are more utilized when the model is periodically upgraded at the later stage, so that the data has remarkable benefits for judging enterprise classification problems.
Although the electric power data has larger fluctuation due to factors such as industry, scale, production and operation cycle, light season, outage and the like, the data is real and intuitively judged to improve the model effect, in the model training and learning, the weight influence of the electric power data on effectively identifying and judging whether an enterprise can obtain bank credit loan is lower as shown by the result of the actual credit authorization correlation analysis of the bank, the training and learning result has a relation with the collection cycle structure of the electric power data, the coverage degree of the electric power data on the sample size and the structure attribute of financial products to a great extent, and the reference degree of the bank to the electric power data is obviously lower than the reference degree of tax data and financial data.
Referring to fig. 2, to determine statistics for the time window of the optimal performance time window, the determination of the optimal performance time window requires selecting a time point at which the enterprise default is in a stable state, since no substantial default sample exists at present, but for the enterprise data that is rejected for payment or refusal of pre-credit and has a clear reason for being fed back by the bank, it can be technically assumed that at a future time point, there is a risk event that the repayment capacity is seriously insufficient and thus there is a large default probability. It can be known from fig. 2 that the determination of the optimal performance time window can only reflect the trend change of the risk exposure of each cooperative bank in a certain time period on the commercial value credit loan management platform from the side, and the default frequency or trend of the substantive enterprise can not be effectively determined to be stable in a certain horizontal interval, so that all credit granting time periods from the beginning of the financing loan application data of the platform to the present are selected as the setting of the time window.
The management platform has the recorded definition data of the enterprises of refusal credit and refusal pre-credit in the credit granting records of the financing enterprises, which can be technically assumed as bad enterprises with high credit risk, and in strict sense, the enterprises correspond to the enterprises in a certain time period in the financing application business of the commercial value credit and loan management platform, which are negative examples of enterprises judged by banks to have high credit risk and poor loan capability and further cannot smoothly obtain credit of the platform cooperation bank, but are not typical negative examples of enterprises with extremely high credit risk, loan capability and loan willingness in real commercial activities. The list of relevant data indicators available from the commercial value credit platform enterprise is as follows:
Figure BDA0002774877940000081
Figure BDA0002774877940000091
specifically, the data is integrally checked before the model is built, so that the data quality can meet the requirements of model building, wherein missing value interpolation and repeated value processing are the most common. The sample data before model construction has a plurality of related variables, and an important variable discovery link is required to complete selection of the model entering variables.
Derivative variable generation includes total house area summary, mortgage house area summary, total land area summary, mortgage land area summary, patent quantity summary, soft work quantity summary, trademark quantity summary, quarterly electricity usage and quarterly pay electricity charge amount.
For example, normally operated enterprises do not have any tax information, normally operated enterprises generally have tax related data, and only the range of the tax amount calculation basis which is needed to be paid may not be reached in the aspect of actually paying tax is possible, and because XGBOOST treats missing values as a sparse matrix, the missing values of the missing values which are not considered when nodes are split per se. The missing value data is divided into a left sub-tree and a right sub-tree to respectively count the loss, and the better one is selected. If no data is missing in training and data missing occurs in prediction, the data missing is classified into the right subtree by default, good feedback is provided for missing values, and in order to obtain a model closer to reality, filling of 0, mean, median or mode and the like is not performed on the missing values.
Referring to fig. 3, for extreme value identification of variables in a sample, the following two methods are mainly used for effective identification:
(1) persistent nodes containing a small number of observations are found using a decision tree.
(2) The data is divided into a plurality of subsets by a clustering algorithm, only a small number of clusters (ideally, only one sample/observed value is contained) are considered as extreme values. Specifically, a suitable method is selected by a modeling worker according to various data type characteristics by adopting which mode to identify extreme values. In the case, the extreme values are identified by a method combining cluster analysis and expert experience.
For example, taking 2018 tax data as an example, the data is divided into 8 subsets by using a clustering algorithm (the number of the subsets is determined according to expert experience), and then data groups containing a small number of clusters or significant outliers are regarded as extreme values to be removed (when the data amount is small, when the data amount is large, the strategy of mean value replacement is adopted).
FIG. 3 is a method for clustering and analyzing data of the variable of business income by using a clustering algorithm, finding extreme values and processing the extreme values. From the above results of performing cluster analysis on data of revenue variable, group 1(8e +07,1e +08) in 8 groups is significantly separated from other groups, and has more extreme values, about 400 more, as shown in fig. 3, the first group of clusters returns to the cluster analysis array set, that is, the data cluster corresponding to array 1 of the marked frame, and for such cases, we adopt a mean value replacement form to process the group of extreme values. When extreme values are few (1-3), the method adopts a direct deletion mode to process. When the extreme values are large (more than or equal to 3), the average value of the variables is used for replacement.
The factors influencing the fact that the enterprise can not become a good enterprise approved by the bank to obtain credit are many, and the key concern factors of each bank are different, if the factors concerned by each bank are introduced into the analysis model as explanatory variables, the established model is very complicated, it is uneconomical and impractical for both the actual landing effect and the operational level, important variables affecting bank credit decisions must be found out by means of data mining methods, the important variables are used as the model-entering variables for data modeling, so that the data modeling method can embody the good enterprises of each bank to the maximum, the method can acquire capture and portrayal of the definition features of the actual paying enterprises on the platform, and further converge to a probability to achieve the maximum through model training for countless times, so that important variable parameters of good enterprises and bad enterprises can be identified, namely, the weight influence degree arrangement. By utilizing model algorithms such as XGBOST, LightGBM, BilSTM and the like to carry out deep data mining and analysis on all financing related quantitative data of the platform, the degree of importance of each variable in model identification and classification prediction, namely the ordering condition of the degree of importance is obtained, and only the first 10 variables are intercepted as follows.
Serial number Interpreting variables Frequency of importance of enterprise being accurately predicted
1 tax2017RD_INPUT 257
2 tax2017ENT_PAY_TAXES 250
3 tax2017ENT_PAY_TAXES_DUE 244
4 tax2017BUSINESS_PROFIT 236
5 tax2019BUSINESS_REVENUE 231
6 plat_knowledge_totalSB_TOTAL 228
7 tax2018ENT_PAY_TAXES_DUE 223
8 tax2018BUSINESS_REVENUE 221
9 tax2017TOTAL_PROFIT 211
10 tax2018BUSINESS_PROFIT 208
Based on the actual condition of the business requirement level of the commercial value credit loan management platform, the XGB, LR, GBDT, LightGBM and BilSTM algorithms are selected to realize effective identification of whether the platform financing application enterprises meet the bank credit granting requirements. Specifically, a broad table is established around some business indexes of enterprise financing application, difference performance of refusal (pre) credit granting and actual paying enterprises on the business indexes is compared, and a business rule capable of distinguishing whether the classified enterprises can obtain bank credit granting is obtained finally by utilizing data mining engineering, namely, identification of importance degree of characteristic variables of the classified enterprises which are actually paying or refusal (pre) credit granting is obtained.
Specifically, machine learning algorithms such as XGBOOST, LR, GBDT, LightGBM, BiLSTM and the like are essentially based on a decision tree algorithm, and XGBOOST is a tree integration model, which sums results of K (number of trees) trees as a final predicted value, and has the advantage that it supports column sampling and row sampling by using a random forest algorithm, so that the overfitting risk can be reduced, the calculation amount can be reduced, and at the same time, it supports a linear classifier, which is equivalent to a logistic regression (classification problem) and a linear regression (regression problem) with regular terms of L1 and L2. The method is similar to the method that the optimal segmentation points are segmented at each tree node, and the model parameters are continuously learned and trained until the accuracy and the recall ratio reach the optimal balance, so that the probability distribution and the sensitivity of the model to accurately classify enterprises are optimal.
Specifically, when new financing loan application data enters a new commercial value financing evaluation model, the trained new commercial value industry subdivision cluster model can automatically identify basic data, tax data, social security, customs data, electric power data, intellectual property data and the like of an enterprise according to judgment rules, wherein the basic data, the tax data, the social security, the customs data, the electric power data, the intellectual property data and the like also comprise conditions of an enterprise industry type code, enterprise scale data and the like, identification of each data can enter a corresponding next decision node, and a plurality of decision nodes can finally obtain classification results of the enterprise.
To train a better business value credit evaluation model, the test was repeated on the following 6 parameters.
max depth is the maximum depth of the tree. This value is also used to avoid overfitting. 6 numerical values are preselected;
lambda is the L2 regularization term for the weights. Enhancing the generalization ability. 10 numerical values are preselected;
subsample-proportion of random sampling. And controlling the fitting degree. 8 numerical values are preselected;
colsample _ byte, which is used to control the ratio of the number of columns per random sample. 8 numerical values are preselected;
and min _ child _ weight, determining the sample weight sum of the minimum leaf node. The model is optimized integrally, and local optimization is avoided. 8 numerical values are preselected;
num is learning time of the model, fitting degree is controlled, and 5 numerical values are preselected;
the extreme values of the above parameters were removed, and 6 × 10 × 8 × 5 models 153600 models were trained, and the best model among them was selected according to the P-R curve and the AOC value.
The target value of the commercial value credit loan evaluation model is 0 or 1(0 is refusal, 1 is loan), and 0-100% of a probability model of the loan approval of a bank is output.
Specifically, the XGBOOST algorithm advantageously identifies and processes 2 category features that segment the industry and enterprise scales. Whether XGBOST or other BOOSTing Trees, the Trees used are cart regression trees, which means that the lifting Tree algorithm only accepts numerical feature input and does not directly support class features, and the XGBOST takes the class features as numerical types by default. Obviously, neither the industry nor the enterprise scale is a numerical category, nor the characteristics of the 2 categories are a continuous value, so that a corresponding method is needed to solve the problem, and the unordered non-numerical characteristics such as the industry characteristics and the enterprise scale can be used for the machine learning task.
The method for solving the problems is 'One-Hot Encoding', and the main reasons for using the One-Hot Encoding are as follows:
and (3) using one-hot coding to expand the value of the discrete feature to an Euclidean space, wherein a certain value of the discrete feature corresponds to a certain point of the Euclidean space. In machine learning algorithms such as regression, classification, clustering and the like, calculation of distances between features or calculation of similarity are very important, and the calculation of distances or similarities commonly used by us is similarity calculation in an Euclidean space, and cosine similarity is calculated and is also based on the Euclidean space.
The one-hot coding is used for the discrete features, so that the distance calculation between the features can be more reasonable. For example, there is a discrete feature representing the work type, the discrete feature has three values, and the distance of the calculated feature is not reasonable without using one-hot encoding. That is more reasonable if one-hot coding is used.
Taking a practical example in this case, there are three characteristic attributes:
the scale of the enterprise: [ LARGE, MEDIUM AND SMALL ]
Industry categories: [ PRODUCTION OF COMPLETE CAR, PRODUCTION OF SPECIAL EQUIPMENT FOR ELECTRONIC AND ELECTRO-MECHANICAL MACHINES, AND PREPARATION OF CHEMICAL DRUGS ]
County: [ Yubei district, Wanzhou district, Beibei culture district, Jiangnan district ]
For a certain sample, such as [ "small", "vehicle manufacturing", "Yubei district" ], we need to digitize the characteristics of the classification type value, and the most direct method can adopt a serialization mode: [0,1,3]. However, even after conversion to a digital representation, the data cannot be used directly in our classifier. Because, classifiers tend to default to data that is continuous and ordered. According to the above representation, the numbers are not ordered but randomly assigned. Such feature processing cannot be directly put into a machine learning algorithm.
In order to solve the problem of non-numerical non-continuous type of the characteristics of the subdivision industry and the size variables of the enterprise scale, the method adopts a form of One-Hot Encoding (also called One-bit effective Encoding). The method is to use an N-bit status register to encode N states, each state being represented by its own independent register bit and only one of which is active at any time. It can be understood that for each feature, if it has m possible values, it becomes m binary features after unique hot coding. And, these features are mutually exclusive, with only one activation at a time. Therefore, the data may become sparse.
For solving the problem of linear divisibility of variable data, the samples are coded in an One-Hot coding mode, wherein the samples are 'small', 'finished automobile manufacturing', 'Yubei district' ] and the 'small' corresponds to [1, 0], similarly, the 'automobile industry' corresponds to [0,1, 0] and the 'Yubei district' corresponds to [0,0,0,1 ]. The result of the full feature digitization is: [1,0,0,1,0,0,0,0,1]. Therefore, the method solves the problem that the unordered classification features enter the relevant classifier algorithm for learning and training, in this case, the industry class type features and the enterprise scale class type features are scientifically incorporated into the model algorithm to realize learning and training, and the finally obtained model actually completely embodies the learning and training of the enterprise scale and the industry class features.
In the embodiment of the invention, an ROC curve, an AUC value and a P-R method are adopted to test and evaluate the distinguishing capability and the actual performance of a model, before model development and training learning, effective samples are divided into a training set and a testing set according to a random layering mode and a cutting proportion of 80% and 20%, the training set is mainly used for training and learning the model to construct a prediction model, and the testing set is mainly used for testing the performance and the distinguishing capability of the prediction model obtained by training.
Specifically, the ROC curve is the most common index for measuring the discriminative power of the default probability model, and is the test of the accuracy of the model constructed based on the sample set data for identifying positive and negative samples by using the test set data, and the expandability, the discriminative power and the accuracy of the obtained prediction model in other data sets. The ROC curve is drawn in the mode of constructing a model prediction result mixed matrix by using the test set to test the accuracy of the model.
The AUC value is an important and very common indicator for measuring the classification performance of the machine learning model, and can only be used in the case of two classifications. The method is suitable for distinguishing the practical problems of good enterprises (bank credit can be obtained) and poor enterprises (refusal (pre) credit), and the essential meaning reflects that for any pair of positive and negative example samples, the probability that the model predicts the positive example as the positive example is higher than the probability that the negative example as the positive example, and the model classification capacity expressed to the model by the ROC curve is reflected more intuitively. In general, the magnitude of AUC reflects the model superiority or inferiority with reference to the following numerical ranges.
AUC 1 represents a perfect classifier
0.5< AUC <1, superior to the stochastic classifier
AUC 0.5, random classification, without value
0< AUC <0.5, error results
The AUC value in the universal version model test without enterprise filling data is 0.91, which shows that the distinguishing capability of the new model constructed based on the sample set data is in an excellent interval, and the new model meets the initial requirements of deployment and application in actual service scenes.
Specifically, the P-R curve is used for testing the performance of the model from another angle, and the test data with known results are calculated by using the new model, so that the new model can find out the negative samples and the missing samples, and the true positive rate, namely the recall rate and the accuracy rate are known. Threshold value each percentile is a point, and traversing all threshold values forms a curve. From the P-R curve, when Recall is around 0.4 threshold, Precision value of the model begins to decrease, but Recall can achieve the completeness of the model result finding. This shows that the new prediction probability model has good distinguishing and classifying capabilities, the P-R curve can distinguish the model better than the ROC curve when the negative samples are significantly larger than the positive samples, and the positive and negative samples have no great quantity difference in this example, so the new model meets the preliminary requirements of deployment and application in the actual business scene.
In order to keep consistent with industry practice, based on the technical scheme of the invention, a main scale can be developed to realize hierarchical classification management of enterprise financing matching service work, the main scale design is a process of mapping default probability to risk level, the observed risk level symbol corresponds to the average default probability, and the main scale design aims to visually present and display a clear credit level and corresponding information of credit line guidance after the evaluation of commercial value credit is finished when an enterprise applies for the commercial value credit loan financing. The main scale has the characteristics that the commercial value credit loan evaluation result is more standard, meanwhile, the clearly subdivided credit levels of enterprises are easier to apply to other financing application scenes, the subdivided level design expresses the probability interval of each enterprise for obtaining bank loan in an integrated manner, and the main scale has remarkable benefits for fine management and the implementation of complex rewarding policies.
The design of the main scale follows the following principle:
(1) the master scale should have risk differentiation capabilities, with different levels being able to represent different default risks. (2) The master scale should map the probability of breach to risk levels continuously and without overlap.
(3) The classification of the risk classes should be fine enough to distinguish between different types of risk classes, the default probability values of adjacent classes should not vary too much, and the span of each default probability interval should be monotonic and preferably increase in a geometric progression.
(4) Customers cannot be overly concentrated in a single risk level, and the number of customers per risk level cannot exceed 30% of the total number of customers.
(5) The default probability map takes into account the industry distribution of the subject and does not deviate significantly.
The main scale is developed based on the default center trend of the multi-year default probability, the upper limit and the lower limit of each credit level are determined, the statistical data without the multi-year default probability are limited, and the corresponding relation between the default probability and the credit level can be determined temporarily only by adopting a simple corresponding method to serve as a simple main scale. Returning to the invention, the main scale which accords with the actual condition of the platform can be designed by utilizing the default probability distribution of all enterprises.
And combing according to model training result data, and designing the main scale into 22 levels, wherein each level corresponds to the actual default probability of one main body. In order to realize mapping of the default probability of the enterprise to the corresponding credit level, the whole default probability must be subjected to box-dividing and cutting treatment, in this embodiment, the default probability of the enterprise is divided according to four levels, twenty two levels, and each level interval corresponds to a corresponding default probability of the enterprise, and the result is returned to the present case, which is actually the probability that the enterprise cannot obtain bank (pre) credit. By arranging the probability distribution structure chart and the box dividing result, the corresponding credit grade mapping main scale can be arranged, so that the main scale design work in the model upgrading can be realized.
According to the invention, through understanding the actual business problems of the commercial value credit loan evaluation model, the relevant algorithms (five algorithms such as XGB, LR, GBDT, LightGBM, BilSTM and the like) suitable for the training of the commercial value credit loan evaluation model are selected, no actual default sample of the commercial value credit loan is considered at present, a positive and negative sample set is defined by using a method for assuming default technically, and all docking data fields are trained and learned. According to the invention, enterprise data with financing loan application history records is selected as original sample data, and according to the state of credit granting feedback information, the definition of granting credit is a positive sample, and the definition of refusing credit is a negative sample; combing and judging original sample data, and cleaning and combing out samples with uncertain trust feedback information, positive feedback information and trust service state feedback display as negative sample state; obtaining feedback information data which has definite financing application behaviors in original sample data and whether a bank definitely gives credit or not as a total effective sample data set, and marking the data of the total effective sample data set according to the definitions of a positive sample and a negative sample; randomly extracting a sample set and a test set from the total effective sample data set in a layering way; selecting the initial and ending credit granting time periods in the total effective sample data set as time window periods, and determining the time level interval with stable enterprise default frequency or trend through the time window; establishing a database table according to the business indexes of the enterprise financing application, comparing the difference performance of the refusal credit and the actual paying enterprise on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprise can obtain the bank credit by using a data mining algorithm, and effectively identifying whether the financing application enterprise meets the bank credit granting requirement through the commercial value credit loan evaluation model. The method and the system realize the business value credit loan evaluation requirements of small and medium-sized enterprises based on historical data, and can effectively identify enterprise objects which can obtain the loan of the current cooperative bank at a high probability, thereby pushing the credit-good enterprises which can obtain the bank credit at a high probability to the cooperative bank in time, improving the loan acquisition rate of platform financing application enterprises, and solving the problems of difficult financing, expensive financing and slow financing of the small and medium-sized enterprises.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (10)

1. A business value credit loan evaluation method for small and medium-sized enterprises is characterized by comprising the following steps:
(1) preparing data: selecting enterprise data with financing loan application history records as original sample data, defining the approval crediting as a positive sample and defining the refusal crediting as a negative sample according to the credit feedback information state;
(2) combing positive and negative sample sets: combing and judging the original sample data, and cleaning and combing out samples with uncertain trust feedback information, positive samples actually serving as feedback information and negative sample states displayed as feedback of the trust service state;
(3) and (3) judging positive and negative sample sets of the enterprise: obtaining feedback information data which has definite financing application behaviors in the original sample data and has acquired whether a bank definitely gives credit or not as a total effective sample data set, and marking the data of the total effective sample data set according to the definitions of a positive sample and a negative sample;
(4) sample extraction: randomly extracting a sample set and a test set from the total effective sample data set in a layering way;
(5) time window determination statistics: selecting the initial and ending credit granting time periods in the total effective sample data set as time window periods, and determining the time level interval with stable enterprise default frequency or trend through the time window;
(6) constructing a commercial value credit loan evaluation model: establishing a database table according to business indexes of enterprise financing application, comparing difference performance of refusal credit and actual paying enterprises on the business indexes, giving out a commercial value credit loan evaluation model for distinguishing whether the enterprises can obtain bank credit or not by using a data mining algorithm, and effectively identifying whether the financing application enterprises meet bank credit granting requirements or not through the commercial value credit loan evaluation model.
2. The method for evaluating the commercial value credit of the medium and small enterprises according to claim 1, wherein in the process of judging the positive and negative sample sets of the enterprises, the sample data of the bank with the feedback information of the bank in the actual loan payment state is set as a positive sample; the technical assumption is that the sample data of the bank feedback information of rejecting the payment and rejecting the pre-authorization is a negative sample.
3. The method as claimed in claim 1, wherein the enterprise data includes tax, financial data, social security data, patent data, soft copy data, copyright data, trademark data, real estate data, power data, customs data and special fund data.
4. The method of claim 1, wherein the total valid sample data set is subjected to derivative variable generation, missing value processing, extreme value identification processing and key variable discovery.
5. The method according to claim 4, wherein the derivative variable generation comprises a total housing area summary, a mortgage housing area summary, a total land area summary, a mortgage land area summary, a patent quantity summary, a soft copy quantity summary, a trademark quantity summary, a quarterly electricity consumption and a quarterly payment electricity charge.
6. The method as claimed in claim 4, wherein the missing value is processed as tax data, and 0, mean, median or mode filling is performed on the missing value.
7. The method as claimed in claim 4, wherein the manner of identifying the extreme value includes: finding continuous nodes containing a preset number of observed values by using a decision tree; dividing the data into a plurality of subsets by using a clustering algorithm, and taking the clusters containing a preset number as extreme values;
the extreme value processing mode comprises the following steps: when the number of the extreme values is less than the preset number, deleting the extreme values; and when the number of the extreme values reaches the preset number, carrying out mean value replacement processing on the extreme values.
8. The method as claimed in claim 4, wherein the key variable discovery is implemented by mining and analyzing all financing quantitative data through a data mining algorithm, so as to obtain the degree of importance of each variable in model identification and classification prediction, wherein the degree of importance is expressed by the importance frequency of accurate prediction of enterprises.
9. The method as claimed in claim 1, wherein in the establishment of the business value credit evaluation model, one-hot coding is used to extend the values of the discrete features into the Euclidean space.
10. The method for evaluating the commercial value credit of the medium and small enterprises as claimed in claim 1, wherein the commercial value credit evaluation model is checked and evaluated by adopting an ROC curve, an AUC value and a P-R method;
and carrying out hierarchical classification management on enterprise financing matching service work in a form of developing a main scale.
CN202011261813.8A 2020-11-12 2020-11-12 Method for evaluating business value credit loan of medium and small enterprises Pending CN112330441A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011261813.8A CN112330441A (en) 2020-11-12 2020-11-12 Method for evaluating business value credit loan of medium and small enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011261813.8A CN112330441A (en) 2020-11-12 2020-11-12 Method for evaluating business value credit loan of medium and small enterprises

Publications (1)

Publication Number Publication Date
CN112330441A true CN112330441A (en) 2021-02-05

Family

ID=74318085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011261813.8A Pending CN112330441A (en) 2020-11-12 2020-11-12 Method for evaluating business value credit loan of medium and small enterprises

Country Status (1)

Country Link
CN (1) CN112330441A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011751A (en) * 2021-03-19 2021-06-22 天道金科股份有限公司 Small and medium-sized micro enterprise credit evaluation method based on big data
CN113095712A (en) * 2021-04-25 2021-07-09 国家电网有限公司 Enterprise credit granting score obtaining method and device and computer equipment
CN113506174A (en) * 2021-08-19 2021-10-15 北京中数智汇科技股份有限公司 Method, device and equipment for training risk early warning model of medium and small enterprises

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011751A (en) * 2021-03-19 2021-06-22 天道金科股份有限公司 Small and medium-sized micro enterprise credit evaluation method based on big data
CN113095712A (en) * 2021-04-25 2021-07-09 国家电网有限公司 Enterprise credit granting score obtaining method and device and computer equipment
CN113506174A (en) * 2021-08-19 2021-10-15 北京中数智汇科技股份有限公司 Method, device and equipment for training risk early warning model of medium and small enterprises

Similar Documents

Publication Publication Date Title
CN111199343B (en) Multi-model fusion tobacco market supervision abnormal data mining method
CN110009479B (en) Credit evaluation method and device, storage medium and computer equipment
CN112700045B (en) Intelligent site selection system based on land reserve implementation monitoring model
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN112330441A (en) Method for evaluating business value credit loan of medium and small enterprises
CN111178611B (en) Method for predicting daily electric quantity
CN106897918A (en) A kind of hybrid machine learning credit scoring model construction method
CN110807700A (en) Unsupervised fusion model personal credit scoring method based on government data
CN107230108A (en) The processing method and processing device of business datum
CN112365339A (en) Method for judging commercial value credit loan amount of small and medium-sized enterprises
CN114022269A (en) Enterprise credit risk assessment method in public credit field
CN115145993A (en) Railway freight big data visualization display platform based on self-learning rule operation
Zheng et al. [Retracted] Using an Optimized Learning Vector Quantization‐(LVQ‐) Based Neural Network in Accounting Fraud Recognition
CN113688870B (en) Group renting room identification method based on user electricity behavior by adopting hybrid algorithm
CN111738610A (en) Public opinion data-based enterprise loss risk early warning system and method
CN115131039B (en) Enterprise risk assessment method based on nonlinear dimension reduction, computer equipment and storage medium
Xu et al. MM-UrbanFAC: Urban functional area classification model based on multimodal machine learning
Zhao et al. A Big Data-Driven Financial Auditing Method Using Convolution Neural Network
Zhukovska Decision-making model on potential borrower lending for independent experts group
Kulothungan Loan Forecast by Using Machine Learning
WO1992017853A2 (en) Direct data base analysis, forecasting and diagnosis method
Zeng A comparison study on the era of internet finance China construction of credit scoring system model
CN117766068B (en) Machine learning method for predicting porphyry deposit type and reserve based on zircon component
Thu et al. Analysis of Factors Affecting the Development of Socio-Economic Systems of Vietnam Based on Combinatorial Methods
CN118071482A (en) Method for constructing retail credit risk prediction model and consumer credit business Scorebetad model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205

RJ01 Rejection of invention patent application after publication