CN111047193A

CN111047193A - Enterprise credit scoring model generation algorithm based on credit big data label

Info

Publication number: CN111047193A
Application number: CN201911278580.XA
Authority: CN
Inventors: 刘海滨; 郭佳劼; 叶林; 沙凌峰; 冉作舟
Original assignee: Shanghai Dolphin Enterprise Credit Reporting Service Co Ltd
Current assignee: Shanghai Dolphin Enterprise Credit Reporting Service Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-04-21

Abstract

The invention provides an enterprise credit scoring model generation algorithm based on a credit big data label, and relates to the technical field of scoring model generation. The enterprise credit scoring model generation algorithm based on the credit big data label comprises the following steps: 1. labeling original data based on a label classification and quantitative analysis method of credit big data to construct an enterprise label matrix; 2. screening according to the identity tags, and constructing an enterprise scene tag library; 3. processing the enterprise label matrix by using the (k, epsilon) -core set; 4. carrying out index screening on the enterprise credit data by using a random forest algorithm; 5. taking the IV value as a single variable screening standard; 6. and fitting the screened variables to a logistic regression model. According to the method, the big data credit label is utilized, the structure of the parent-child model is adopted, sparse big data information is processed into dense information through the child model, then output information of the child model is used as input variables of the parent model, and the information is processed layer by layer to form the technical framework of the model nested model.

Description

Enterprise credit scoring model generation algorithm based on credit big data label

Technical Field

The invention relates to the technical field of scoring model generation, in particular to an enterprise credit scoring model generation algorithm based on credit big data labels.

Background

The enterprise credit is a product of market economy, is a comprehensive analysis and determination of the capability of various market participation bodies to fulfill corresponding economic contracts and the integral credibility of the enterprises, and in the market economy countries, the level of the enterprise credit is directly linked with the financing cost. The interest rate of the enterprises (units) with high credit rating and excellent credit standing for issuing bonds or applying for loan is low, and the interest rate of the enterprises (units) with low credit rating and poor credit standing for issuing bonds or applying for loan is correspondingly higher; enterprises (entities) without credit rating, i.e., non-credit recorders, are not allowed to issue bonds in the market and are generally difficult to credit for money.

According to the knowledge, the popular is the three-level ten-level credit rating standard, the AAA level credit rating is the highest level, the representative enterprise credit degree is high, the debt risk is small, the excellent credit record is provided, the operation condition is good, the profitability is strong, the development prospect is wide, the influence of uncertain factors on the operation and the development of the enterprise credit rating model is very small, but at present, an effective enterprise credit rating model generation algorithm based on a credit big data label is not provided, and therefore the enterprise credit rating is uncertain.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides an enterprise credit scoring model generation algorithm based on credit big data labels, and solves the problem that the enterprise credit scoring has a lot of uncertainty due to the fact that an effective enterprise credit scoring model generation algorithm based on credit big data labels does not exist in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme: an enterprise credit scoring model generation algorithm based on credit big data labels comprises the following steps:

1. labeling original data based on a label classification and quantitative analysis method of credit big data to construct an enterprise label matrix;

2. screening according to the identity tags, and constructing an enterprise scene tag library;

3. processing the enterprise label matrix by using the (k, epsilon) -core set;

4. carrying out index screening on the enterprise credit data by using a random forest algorithm;

5. taking the IV value as a single variable screening standard;

6. fitting the screened variables to a logistic regression model;

7. and combining the scoring card model to obtain the credit score of the enterprise.

Preferably, in the step 1, the original data of the enterprise obtained from the public information credit platform is labeled by a label classification and quantitative analysis method based on credit big data, so as to construct an enterprise label matrix.

Preferably, in the step 2, different identity tags are screened according to different analysis scenarios, and different enterprise scenario tag databases are constructed.

Preferably, the (k, epsilon) -kernel set-based algorithm in the step 3 is used for compressing the sparse matrix and reducing the complexity of operation space and time.

Preferably, the constructing a random forest in the step 4 includes:

1) taking blacklist enterprises in the current year as bad samples, and taking the rest enterprises as good samples to fit a random forest model;

2) after the importance results of all indexes are obtained, the indexes with the importance ratio less than 0.1 percent are removed, and the screened data indexes are obtained preliminarily.

Preferably, in the step 5, the data index generated in the step 3 is evaluated according to the WOE value and the IV value;

1) WOE value calculation formula:

pgood is the occupancy of the good sample under the value of the label;

pbad is the occupancy rate of the bad sample under the condition of changing the value of the label;

2) IV value calculation formula:

wherein N is the number of the labels with the value capable.

Preferably, in the step 6, a logistic regression model is fitted according to the variables obtained after the screening, the weight of each dimension is calculated, and the WOE value and the regression coefficient of each dimension are combined by combining the scoring card model mentioned in the step 7, so as to obtain the credit score of the enterprise.

(III) advantageous effects

The invention provides an enterprise credit scoring model generation algorithm based on a credit big data label.

The method has the following beneficial effects:

1. according to the method, the big data credit label is utilized, the structure of the parent-child model is adopted, sparse big data information is processed into dense information through the child model, then output information of the child model is used as input variables of the parent model, and the information is processed layer by layer to form the technical framework of the model nested model.

2. The invention analyzes the behaviors and requirements of credit subjects in different industries, fields, areas, application directions and the like through an enterprise credit scoring algorithm model, accurately delineates and scores the credit risk characteristics of the enterprise, brings as much data related to credit as possible into a credit rating index system by using a credit rating system of data evaluation, reduces manual intervention on the result by combining a credit rating model of big data, and applies the rating result to various fields of social governance, public service, economic activity, public welfare and the like.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b):

as shown in fig. 1, an embodiment of the present invention provides an enterprise credit scoring model generation algorithm based on credit big data tags, including the following specific contents:

the label classification and quantitative analysis method based on credit big data is characterized by labeling original data and constructing an enterprise label matrix:

the method comprises the steps of taking a renewal enterprise obtaining business qualification and a closing enterprise obtaining business qualification at one time as samples, and collecting all available public credit records and industry credit record data in the last two years, wherein the credit information data are derived from information data, public credit information platform data and credit internet data of registration management departments such as market supervision.

After data cleaning is carried out through a credit big data label technology of a company, all index data are quantized, and an enterprise label matrix is constructed, wherein each value in the matrix represents quantized information under a certain enterprise specific dimension.

Secondly, processing the enterprise label matrix by using the (k, epsilon) -core set:

due to diversification and non-standardization of an enterprise credit data acquisition channel, the client information loss rate is high, information of the same client in different dimensions is often incomplete, and the data sparsity is finally reflected. The (k, epsilon) -core set is used for processing the enterprise label matrix, so that the dimension of the enterprise sparse matrix can be effectively reduced, and the core information in the matrix is highlighted.

The algorithm is defined as follows:

set of points for an n-dimensional space

And a vector in an n-dimensional space

Defining the minimum Euclidean distance from the vector x to the point set S as:

for a matrix A of dimension (m × n), the row vector is (a)₁，...，a_m)，

Define the sum of the squares of the distances a to S as:

for the kernel set:

by kernel set is meant the row vector (a) for a (m × n) matrix A₁，...，a_m)

It can be understood that m points in n-dimensional space and the kernel set is composed of these row vectors (a)₁，...，a_m)

Set C after weighting, i.e. { omega }₁a₁，...，ω_ma_m}

Here, the weight average of ownership is greater than or equal to 0, when the kernel set C is a weighted subset of the row vector set of the matrix a, but when the weight average of ownership is equal to 1, this is the set C is the set of row vectors of a, and it is also satisfied that, for all k-th order subspaces S, the distance to a can be approximately expressed as the distance to the kernel set C of a, and the expression is:

|dist²(A,S)-dist²(c,s)|≤ε·dist²(A,S) (3)

in short, the distance from S to A can be approximated by the distance from S to C.

Thirdly, performing index screening on the enterprise credit data by using a random forest algorithm:

obtaining the relative weight of the indexes by using a random forest, arranging the indexes with the importance less than 0.1 percent from large to small according to the specific gravity, finding out whether the three indexes of the asset liability rate, the tax payment credit level and the established year exist at the same time, having the largest influence on judging whether the enterprise is continuously reserved, and simultaneously, according to different definitions of target variables, outputting different data of results.

Fourthly, WOE and IV of the label:

1) evaluating the algorithm tag generated in the third step according to the WOE (weightofEvent) value and the IV (informationValue) value;

2) WOE value calculation formula:

P_goodthe occupancy rate of the good sample under the value of the label is obtained;

P_badthe occupancy rate of the bad samples under the condition of changing the value of the label is changed;

3) IV value calculation formula:

n is the number of possible values of the label.

The criteria for screening variables are shown in the following table, selecting tags with IV values greater than 0.03;

fifthly, logistic regression model:

the traditional credit risk scoring model takes a Logistic regression method as a core, and has the advantages that the Logistic method has unique advantages in processing data of two classification dependent variables, meanwhile, the model is weak in assumption about data distribution, and has good performance when the data is in non-normal distribution. Therefore, the method is the most widely applied method of the financial institutions and credit investigation institutions at home and abroad at present. And changing the variables of all dimensions in the model through the difference of the target variables of the training set to obtain the logistic regression model based on the target variables.

Sixthly, grading card models:

and (3) a single-dimensional grading algorithm:

score＝A+B*ln(odds)，

wherein P is the probability of a bad user;

the single-dimensional score interval algorithm is as follows:

wherein B is the fractional increase of PDO (pointopodedods) for each 1-fold increase of odds

Substituting the score p0 when odds is θ _0, the score p0+ PDO when odds is 2 θ _0 into the score formula;

the final score determination is shown in the following table:

in the model, the potential score of the enterprise in the year is obtained by the formula, wherein the benchmark score is 50 scores, and the score of the double rate (PDO) is 5 scores. Through the process, the scores of all enterprises in the industry of travel are calculated, and the construction of an industry credit development trend analysis model is completed.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An enterprise credit scoring model generation algorithm based on credit big data labels is characterized in that: the method comprises the following steps:

3. processing the enterprise label matrix by using the (k, epsilon) -core set;

5. taking the IV value as a single variable screening standard;

6. fitting the screened variables to a logistic regression model;

2. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 1, original data of the enterprise is labeled through the original data of the enterprise obtained from the public information credit platform based on a label classification and quantitative analysis method of credit big data, and an enterprise label matrix is constructed.

3. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 2, different identity tags are screened according to different analysis scenes, and different enterprise scene tag databases are constructed.

4. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: and 3, the (k, epsilon) -kernel set based algorithm is used for compressing the sparse matrix and reducing the complexity of the operation space and the time complexity.

5. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: the step 4 of constructing the random forest comprises the following steps:

6. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 5, the data index generated in the step 3 is evaluated according to the WOE value and the IV value;

1) WOE value calculation formula:

pgood is the occupancy of the good sample under the value of the label;

2) IV value calculation formula:

wherein N is the number of the labels with the value capable.

7. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: and 6, fitting a logistic regression model according to the variables obtained after the screening, calculating the weight of each dimension, and combining the WOE value and the regression coefficient of each dimension to obtain the credit score of the enterprise by combining the scoring card model mentioned in the step 7.