CN111047193A - Enterprise credit scoring model generation algorithm based on credit big data label - Google Patents
Enterprise credit scoring model generation algorithm based on credit big data label Download PDFInfo
- Publication number
- CN111047193A CN111047193A CN201911278580.XA CN201911278580A CN111047193A CN 111047193 A CN111047193 A CN 111047193A CN 201911278580 A CN201911278580 A CN 201911278580A CN 111047193 A CN111047193 A CN 111047193A
- Authority
- CN
- China
- Prior art keywords
- credit
- enterprise
- big data
- label
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 29
- 239000011159 matrix material Substances 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000012216 screening Methods 0.000 claims abstract description 13
- 238000007637 random forest analysis Methods 0.000 claims abstract description 9
- 238000007477 logistic regression Methods 0.000 claims abstract description 8
- 238000010224 classification analysis Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000004445 quantitative analysis Methods 0.000 claims abstract description 6
- 238000002372 labelling Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 7
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000012797 qualification Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011157 data evaluation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Educational Administration (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an enterprise credit scoring model generation algorithm based on a credit big data label, and relates to the technical field of scoring model generation. The enterprise credit scoring model generation algorithm based on the credit big data label comprises the following steps: 1. labeling original data based on a label classification and quantitative analysis method of credit big data to construct an enterprise label matrix; 2. screening according to the identity tags, and constructing an enterprise scene tag library; 3. processing the enterprise label matrix by using the (k, epsilon) -core set; 4. carrying out index screening on the enterprise credit data by using a random forest algorithm; 5. taking the IV value as a single variable screening standard; 6. and fitting the screened variables to a logistic regression model. According to the method, the big data credit label is utilized, the structure of the parent-child model is adopted, sparse big data information is processed into dense information through the child model, then output information of the child model is used as input variables of the parent model, and the information is processed layer by layer to form the technical framework of the model nested model.
Description
Technical Field
The invention relates to the technical field of scoring model generation, in particular to an enterprise credit scoring model generation algorithm based on credit big data labels.
Background
The enterprise credit is a product of market economy, is a comprehensive analysis and determination of the capability of various market participation bodies to fulfill corresponding economic contracts and the integral credibility of the enterprises, and in the market economy countries, the level of the enterprise credit is directly linked with the financing cost. The interest rate of the enterprises (units) with high credit rating and excellent credit standing for issuing bonds or applying for loan is low, and the interest rate of the enterprises (units) with low credit rating and poor credit standing for issuing bonds or applying for loan is correspondingly higher; enterprises (entities) without credit rating, i.e., non-credit recorders, are not allowed to issue bonds in the market and are generally difficult to credit for money.
According to the knowledge, the popular is the three-level ten-level credit rating standard, the AAA level credit rating is the highest level, the representative enterprise credit degree is high, the debt risk is small, the excellent credit record is provided, the operation condition is good, the profitability is strong, the development prospect is wide, the influence of uncertain factors on the operation and the development of the enterprise credit rating model is very small, but at present, an effective enterprise credit rating model generation algorithm based on a credit big data label is not provided, and therefore the enterprise credit rating is uncertain.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides an enterprise credit scoring model generation algorithm based on credit big data labels, and solves the problem that the enterprise credit scoring has a lot of uncertainty due to the fact that an effective enterprise credit scoring model generation algorithm based on credit big data labels does not exist in the prior art.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme: an enterprise credit scoring model generation algorithm based on credit big data labels comprises the following steps:
1. labeling original data based on a label classification and quantitative analysis method of credit big data to construct an enterprise label matrix;
2. screening according to the identity tags, and constructing an enterprise scene tag library;
3. processing the enterprise label matrix by using the (k, epsilon) -core set;
4. carrying out index screening on the enterprise credit data by using a random forest algorithm;
5. taking the IV value as a single variable screening standard;
6. fitting the screened variables to a logistic regression model;
7. and combining the scoring card model to obtain the credit score of the enterprise.
Preferably, in the step 1, the original data of the enterprise obtained from the public information credit platform is labeled by a label classification and quantitative analysis method based on credit big data, so as to construct an enterprise label matrix.
Preferably, in the step 2, different identity tags are screened according to different analysis scenarios, and different enterprise scenario tag databases are constructed.
Preferably, the (k, epsilon) -kernel set-based algorithm in the step 3 is used for compressing the sparse matrix and reducing the complexity of operation space and time.
Preferably, the constructing a random forest in the step 4 includes:
1) taking blacklist enterprises in the current year as bad samples, and taking the rest enterprises as good samples to fit a random forest model;
2) after the importance results of all indexes are obtained, the indexes with the importance ratio less than 0.1 percent are removed, and the screened data indexes are obtained preliminarily.
Preferably, in the step 5, the data index generated in the step 3 is evaluated according to the WOE value and the IV value;
1) WOE value calculation formula:
pgood is the occupancy of the good sample under the value of the label;
pbad is the occupancy rate of the bad sample under the condition of changing the value of the label;
2) IV value calculation formula:
wherein N is the number of the labels with the value capable.
Preferably, in the step 6, a logistic regression model is fitted according to the variables obtained after the screening, the weight of each dimension is calculated, and the WOE value and the regression coefficient of each dimension are combined by combining the scoring card model mentioned in the step 7, so as to obtain the credit score of the enterprise.
(III) advantageous effects
The invention provides an enterprise credit scoring model generation algorithm based on a credit big data label.
The method has the following beneficial effects:
1. according to the method, the big data credit label is utilized, the structure of the parent-child model is adopted, sparse big data information is processed into dense information through the child model, then output information of the child model is used as input variables of the parent model, and the information is processed layer by layer to form the technical framework of the model nested model.
2. The invention analyzes the behaviors and requirements of credit subjects in different industries, fields, areas, application directions and the like through an enterprise credit scoring algorithm model, accurately delineates and scores the credit risk characteristics of the enterprise, brings as much data related to credit as possible into a credit rating index system by using a credit rating system of data evaluation, reduces manual intervention on the result by combining a credit rating model of big data, and applies the rating result to various fields of social governance, public service, economic activity, public welfare and the like.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example (b):
as shown in fig. 1, an embodiment of the present invention provides an enterprise credit scoring model generation algorithm based on credit big data tags, including the following specific contents:
the label classification and quantitative analysis method based on credit big data is characterized by labeling original data and constructing an enterprise label matrix:
the method comprises the steps of taking a renewal enterprise obtaining business qualification and a closing enterprise obtaining business qualification at one time as samples, and collecting all available public credit records and industry credit record data in the last two years, wherein the credit information data are derived from information data, public credit information platform data and credit internet data of registration management departments such as market supervision.
After data cleaning is carried out through a credit big data label technology of a company, all index data are quantized, and an enterprise label matrix is constructed, wherein each value in the matrix represents quantized information under a certain enterprise specific dimension.
Secondly, processing the enterprise label matrix by using the (k, epsilon) -core set:
due to diversification and non-standardization of an enterprise credit data acquisition channel, the client information loss rate is high, information of the same client in different dimensions is often incomplete, and the data sparsity is finally reflected. The (k, epsilon) -core set is used for processing the enterprise label matrix, so that the dimension of the enterprise sparse matrix can be effectively reduced, and the core information in the matrix is highlighted.
The algorithm is defined as follows:
Defining the minimum Euclidean distance from the vector x to the point set S as:
for a matrix A of dimension (m × n), the row vector is (a)1,...,am),
Define the sum of the squares of the distances a to S as:
for the kernel set:
by kernel set is meant the row vector (a) for a (m × n) matrix A1,...,am)
It can be understood that m points in n-dimensional space and the kernel set is composed of these row vectors (a)1,...,am)
Set C after weighting, i.e. { omega }1a1,...,ωmam}
Here, the weight average of ownership is greater than or equal to 0, when the kernel set C is a weighted subset of the row vector set of the matrix a, but when the weight average of ownership is equal to 1, this is the set C is the set of row vectors of a, and it is also satisfied that, for all k-th order subspaces S, the distance to a can be approximately expressed as the distance to the kernel set C of a, and the expression is:
|dist2(A,S)-dist2(c,s)|≤ε·dist2(A,S) (3)
in short, the distance from S to A can be approximated by the distance from S to C.
Thirdly, performing index screening on the enterprise credit data by using a random forest algorithm:
obtaining the relative weight of the indexes by using a random forest, arranging the indexes with the importance less than 0.1 percent from large to small according to the specific gravity, finding out whether the three indexes of the asset liability rate, the tax payment credit level and the established year exist at the same time, having the largest influence on judging whether the enterprise is continuously reserved, and simultaneously, according to different definitions of target variables, outputting different data of results.
Fourthly, WOE and IV of the label:
1) evaluating the algorithm tag generated in the third step according to the WOE (weightofEvent) value and the IV (informationValue) value;
2) WOE value calculation formula:
Pgoodthe occupancy rate of the good sample under the value of the label is obtained;
Pbadthe occupancy rate of the bad samples under the condition of changing the value of the label is changed;
3) IV value calculation formula:
n is the number of possible values of the label.
The criteria for screening variables are shown in the following table, selecting tags with IV values greater than 0.03;
fifthly, logistic regression model:
the traditional credit risk scoring model takes a Logistic regression method as a core, and has the advantages that the Logistic method has unique advantages in processing data of two classification dependent variables, meanwhile, the model is weak in assumption about data distribution, and has good performance when the data is in non-normal distribution. Therefore, the method is the most widely applied method of the financial institutions and credit investigation institutions at home and abroad at present. And changing the variables of all dimensions in the model through the difference of the target variables of the training set to obtain the logistic regression model based on the target variables.
Sixthly, grading card models:
and (3) a single-dimensional grading algorithm:
score=A+B*ln(odds),
wherein P is the probability of a bad user;
the single-dimensional score interval algorithm is as follows:
wherein B is the fractional increase of PDO (pointopodedods) for each 1-fold increase of odds
Substituting the score p0 when odds is θ _0, the score p0+ PDO when odds is 2 θ _0 into the score formula;
the final score determination is shown in the following table:
in the model, the potential score of the enterprise in the year is obtained by the formula, wherein the benchmark score is 50 scores, and the score of the double rate (PDO) is 5 scores. Through the process, the scores of all enterprises in the industry of travel are calculated, and the construction of an industry credit development trend analysis model is completed.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (7)
1. An enterprise credit scoring model generation algorithm based on credit big data labels is characterized in that: the method comprises the following steps:
1. labeling original data based on a label classification and quantitative analysis method of credit big data to construct an enterprise label matrix;
2. screening according to the identity tags, and constructing an enterprise scene tag library;
3. processing the enterprise label matrix by using the (k, epsilon) -core set;
4. carrying out index screening on the enterprise credit data by using a random forest algorithm;
5. taking the IV value as a single variable screening standard;
6. fitting the screened variables to a logistic regression model;
7. and combining the scoring card model to obtain the credit score of the enterprise.
2. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 1, original data of the enterprise is labeled through the original data of the enterprise obtained from the public information credit platform based on a label classification and quantitative analysis method of credit big data, and an enterprise label matrix is constructed.
3. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 2, different identity tags are screened according to different analysis scenes, and different enterprise scene tag databases are constructed.
4. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: and 3, the (k, epsilon) -kernel set based algorithm is used for compressing the sparse matrix and reducing the complexity of the operation space and the time complexity.
5. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: the step 4 of constructing the random forest comprises the following steps:
1) taking blacklist enterprises in the current year as bad samples, and taking the rest enterprises as good samples to fit a random forest model;
2) after the importance results of all indexes are obtained, the indexes with the importance ratio less than 0.1 percent are removed, and the screened data indexes are obtained preliminarily.
6. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: in the step 5, the data index generated in the step 3 is evaluated according to the WOE value and the IV value;
1) WOE value calculation formula:
pgood is the occupancy of the good sample under the value of the label;
pbad is the occupancy rate of the bad sample under the condition of changing the value of the label;
2) IV value calculation formula:
wherein N is the number of the labels with the value capable.
7. The enterprise credit scoring model generation algorithm based on credit big data labels as claimed in claim 1, wherein: and 6, fitting a logistic regression model according to the variables obtained after the screening, calculating the weight of each dimension, and combining the WOE value and the regression coefficient of each dimension to obtain the credit score of the enterprise by combining the scoring card model mentioned in the step 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911278580.XA CN111047193A (en) | 2019-12-13 | 2019-12-13 | Enterprise credit scoring model generation algorithm based on credit big data label |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911278580.XA CN111047193A (en) | 2019-12-13 | 2019-12-13 | Enterprise credit scoring model generation algorithm based on credit big data label |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111047193A true CN111047193A (en) | 2020-04-21 |
Family
ID=70236178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911278580.XA Pending CN111047193A (en) | 2019-12-13 | 2019-12-13 | Enterprise credit scoring model generation algorithm based on credit big data label |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111047193A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582466A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Scoring card configuration method, device, equipment and storage medium for simulation neural network |
CN112182333A (en) * | 2020-09-25 | 2021-01-05 | 山东亿云信息技术有限公司 | Talent space-time big data processing method and system based on random forest |
CN112418987A (en) * | 2020-11-20 | 2021-02-26 | 厦门大学 | Method and system for rating credit of transportation unit, electronic device and storage medium |
CN113159709A (en) * | 2021-03-24 | 2021-07-23 | 深圳闪回科技有限公司 | Automatic label system and system |
CN114462516A (en) * | 2022-01-21 | 2022-05-10 | 天元大数据信用管理有限公司 | Enterprise credit score sample labeling method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779457A (en) * | 2016-12-29 | 2017-05-31 | 深圳微众税银信息服务有限公司 | A kind of rating business credit method and system |
CN109784731A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of private education mechanism credit scoring system and its construction method |
-
2019
- 2019-12-13 CN CN201911278580.XA patent/CN111047193A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779457A (en) * | 2016-12-29 | 2017-05-31 | 深圳微众税银信息服务有限公司 | A kind of rating business credit method and system |
CN109784731A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of private education mechanism credit scoring system and its construction method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582466A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Scoring card configuration method, device, equipment and storage medium for simulation neural network |
CN111582466B (en) * | 2020-05-09 | 2023-09-01 | 深圳市卡数科技有限公司 | Score card configuration method, device and equipment for simulating neural network and storage medium |
CN112182333A (en) * | 2020-09-25 | 2021-01-05 | 山东亿云信息技术有限公司 | Talent space-time big data processing method and system based on random forest |
CN112418987A (en) * | 2020-11-20 | 2021-02-26 | 厦门大学 | Method and system for rating credit of transportation unit, electronic device and storage medium |
CN112418987B (en) * | 2020-11-20 | 2022-04-29 | 厦门大学 | Method and system for rating credit of transportation unit, electronic device and storage medium |
CN113159709A (en) * | 2021-03-24 | 2021-07-23 | 深圳闪回科技有限公司 | Automatic label system and system |
CN114462516A (en) * | 2022-01-21 | 2022-05-10 | 天元大数据信用管理有限公司 | Enterprise credit score sample labeling method and device |
CN114462516B (en) * | 2022-01-21 | 2024-04-16 | 天元大数据信用管理有限公司 | Enterprise credit scoring sample labeling method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111047193A (en) | Enterprise credit scoring model generation algorithm based on credit big data label | |
Brezigar-Masten et al. | CART-based selection of bankruptcy predictors for the logit model | |
CN111368147B (en) | Graph feature processing method and device | |
CN104321794B (en) | A kind of system and method that the following commercial viability of an entity is determined using multidimensional grading | |
CN111401600A (en) | Enterprise credit risk evaluation method and system based on incidence relation | |
Li et al. | Multi-factor based stock price prediction using hybrid neural networks with attention mechanism | |
CN107609771A (en) | A kind of supplier's value assessment method | |
Xu et al. | Novel key indicators selection method of financial fraud prediction model based on machine learning hybrid mode | |
CN112102006A (en) | Target customer acquisition method, target customer search method and target customer search device based on big data analysis | |
Song et al. | Enhancing enterprise credit risk assessment with cascaded multi-level graph representation learning | |
Wang et al. | Joint loan risk prediction based on deep learning‐optimized stacking model | |
Petersone et al. | A Data-Driven Framework for Identifying Investment Opportunities in Private Equity | |
CN113506173A (en) | Credit risk assessment method and related equipment thereof | |
CN117350845A (en) | Enterprise credit risk assessment method based on cascade hypergraph neural network | |
CN117114705A (en) | Continuous learning-based e-commerce fraud identification method and system | |
Khajehpour et al. | Does Fundraising Have Meaningful Sequential Patterns? The Case of Fintech Startups | |
Kipkogei et al. | Business success prediction in Rwanda: a comparison of tree-based models and logistic regression classifiers | |
Wang | [Retracted] Correlation Analysis between Tourism and Economic Growth Based on Computable General Equilibrium Model (CGE) | |
Zhou | Loan Default Prediction Based on Machine Learning Methods | |
Yuan | [Retracted] Analysis of Consumer Behavior Data Based on Deep Neural Network Model | |
CN114943563A (en) | Rights and interests pushing method and device, computer equipment and storage medium | |
Giannopoulos | The effectiveness of artificial credit scoring models in predicting NPLs using micro accounting data | |
Li et al. | Influence of Internet-based Social Big Data on Personal Credit Reporting | |
Kalaivani et al. | A Comparative Study of Regression algorithms on House Sales Price Prediction | |
Zhang et al. | Enterprise event risk detection based on supply chain contagion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200421 |