CN115330526A

CN115330526A - Enterprise credit scoring method and device

Info

Publication number: CN115330526A
Application number: CN202211023231.5A
Authority: CN
Inventors: 王利鑫; 李仰允; 崔乐乐; 徐宏伟
Original assignee: Tianyuan Big Data Credit Management Co Ltd
Current assignee: Tianyuan Big Data Credit Management Co Ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-11

Abstract

The invention relates to the field of enterprise credit scoring, and particularly provides an enterprise credit scoring method, which comprises the following steps: s1, establishing a standard data warehouse through data aggregation and fusion; s2, screening enterprise credit evaluation indexes; s3, forming a credit evaluation model entering index by the characteristic engineering; s4, establishing a deep learning model; s5, deep learning model training; s6, evaluating importance of the mold entering characteristics; and S7, forming enterprise scores. Compared with the prior art, the method constructs a relatively accurate enterprise credit scoring model, deeply excavates and analyzes the client potential risk by applying the deep learning technology to the client high-dimensional characteristics, and enables credit approval service to be more efficient and faster.

Description

Enterprise credit scoring method and device

Technical Field

The invention relates to the field of enterprise credit scoring, and particularly provides an enterprise credit scoring method and device.

Background

Deep learning is derived from a neural network, and recognition of a specific mode is realized by simulating the ability of human brain to learn and process knowledge. Compared with the traditional scoring method, the deep learning parallel distribution processing method has strong parallel distribution processing capacity and strong distribution storage and learning capacity, can be used in the supervision field (classification and prediction) and the unsupervised field (feature derivation), and can learn the intricate and complex hidden feature association and mode features in a large number of data features. The enterprise credit score based on the deep learning is one of the extended applications of the deep learning technology in the enterprise credit score, and a foundation is laid for establishing various models in the enterprise wind control field by applying the deep learning technology based on a large amount of data and characteristics in the later period.

The enterprise credit score is one of important links for credit risk management and control of enterprises, overdue probability index reference is provided according to existing data, a means for measuring risk probability in a score mode is adopted, and generally, the higher the score is, the safer the score is. The enterprise credit score modeling usually adopts a machine learning modeling method of logistic regression, decision trees and combined models. With the popularization of the application of the artificial intelligence technology in the field of financial wind control, a credit scoring model based on a deep learning technology is widely applied. In the credit finance industry, with the characteristic of small amount dispersion, a user sinks more, and more needs to continuously perfect intellectualization in each link of loan, examination and approval, customer service and post-loan management, so that the risk of the user is reduced.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the enterprise credit scoring method with strong practicability.

The invention further provides a credit scoring device for the enterprise, which is reasonable in design, safe and applicable.

The technical scheme adopted by the invention for solving the technical problem is as follows:

an enterprise credit scoring method comprises the following steps:

s1, establishing a standard data warehouse through data aggregation and fusion;

s2, screening enterprise credit evaluation indexes;

s3, forming a credit evaluation model entering index by the characteristic engineering;

s4, establishing a deep learning model;

s5, deep learning model training;

s6, evaluating importance of the mold entering characteristics;

and S7, forming enterprise scores.

Further, in step S1, a unified data standard is established first to perform standardized management on the multi-source data stored in the database; secondly, the treatment and processing of multi-source data are carried out through an ETL data treatment tool, internet data storage data are regularly pulled, real-time interface data are processed through a memory, and data processing, data standardization, index calculation and light characteristic mining are carried out on the data in combination with a batch flow processing mode;

and finally, fusing and converging the three-party multi-source data into a unified data warehouse through transverse and longitudinal data fusion, wherein the data warehouse stores standard library data after the multi-source data fusion, an index library and a feature library obtained through processing.

Further, in step S2, based on the established enterprise multi-source data standard library, establishing an enterprise standard library covering three levels of hierarchy, and based on the standard library, establishing an enterprise credit evaluation index system, where the three levels of indexes are specific enterprise credit evaluation indexes extracted through a database table; the second-level index is an enterprise credit evaluation index category which is integrated with business knowledge classification and arrangement on the basis of the third-level index;

the primary index is an evaluation dimension finally determined by evaluating the credit risk of the enterprise, and the primary index dimension is used for displaying a radar map of an enterprise portrait and is used for evaluating the credit risk condition of the enterprise on each subdivision dimension.

Further, in step S3, exploratory data analysis is performed first, the exploratory data analysis mainly includes performing simple descriptive statistics on training samples and marine selection indexes, performing simple statistical analysis on the data, performing data segmentation on specific index data, and performing deep analysis on dynamic change conditions of the data and value taking conditions under a certain specific condition; and performing visual analysis on the model entering sample index by drawing a histogram curve of the univariate and a relation curve of the univariate and the target variable.

Further, in step S3, data cleaning is performed, a random forest method is used to fill the missing indicators in the training samples, first, a feature list with missing values and a feature list without missing values in the features are counted, each missing indicator in the missing features is selected as an objective function, non-missing values of feature variables and objective variables are used as training examples to train a random forest model, and the trained random forest model is output and stored to predict the missing values of the missing features. And carrying out Z-Score standardization treatment on the training sample subjected to data cleaning and missing value filling to form a standardized training vector, and inputting the training vector into a neural network for model training after the neural network structure is established.

Further, in step S4, firstly, the neural network structure is determined, then the activation function is determined, and finally the weight search strategy is determined.

Further, in step S5, the training of the deep learning network usually adopts an open source packet of tensoflow and keras to train the deep learning model, and the change conditions of the loss function, the training sample accuracy and the test sample accuracy in the model training process along with the model iteration process are shown by drawing a model learning curve in the training process to judge the convergence condition of the model.

Further, in step S6, in the deep learning network construction process, the importance of the input disturbance feature is selected to evaluate the importance of the input model index, each feature of the data X is disturbed, a predicted value is obtained for a new X input network, and a loss function is calculated as an importance score of the feature;

after the importance of the input disturbance features is applied to the model entering features to evaluate the importance of the model entering indexes, the model entering features are arranged from high to low according to the disturbance feature importance, different threshold values are sequentially selected to screen the model entering features, the screened features are subjected to multiple model training by applying the determined deep learning network structure, the final model entering features are finally determined through the training effect of the deep learning model, and the optimal deep learning network model is finally determined and output for storage.

Further, in step S7, there are two standard scorecard conversion methods, one is a method based on WOE conversion, which calculates a feature score by a WOE value and a coefficient of a feature predicted by a logistic regression model;

secondly, based on the enterprise default probability obtained by model prediction, standard score conversion is carried out according to the default probability, the default probability of the enterprise is predicted through a deep learning network, and the enterprise default probability obtained by enterprise prediction is converted into the standard score of the enterprise through a standard scoring card conversion method;

and finally, testing the whole grading distribution of the training sample by a normal test method, and performing distribution adjustment on grading results of which the grading distribution is not in accordance with the normal distribution through score adjustment and score conversion.

An enterprise credit scoring device, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform an enterprise credit scoring method.

Compared with the prior art, the enterprise credit scoring method and the enterprise credit scoring device have the following outstanding beneficial effects:

according to the method, a relatively accurate enterprise credit scoring model is constructed, the deep learning technology is used for deeply excavating the high-dimensional characteristics of the client to analyze the potential risk of the client, and the credit approval service is more efficient and faster.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of an enterprise credit scoring method.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

as shown in fig. 1, in the enterprise credit scoring method in this embodiment, the implementation steps of the enterprise credit scoring model based on deep learning mainly include: and carrying out data aggregation, data management and data fusion on the basis of the collected enterprise data, the collected internet data and the collected third-party interface data, and finally forming fused standard data to be stored in a standard library in a data warehouse.

Screening enterprise credit evaluation indexes based on a standard library, and establishing an enterprise credit evaluation index system covering three layers; performing feature cleaning and feature screening based on credit evaluation indexes of the sea election to determine final mold entering indexes; the deep learning and the logistic regression are fused to establish a deep learning neural network structure comprising a network structure, an activation function, a weight search strategy and the like; training an enterprise credit evaluation model based on the formed model training vector and the established deep learning model, performing iterative tuning on the model, and finally determining an optimal model; and predicting the default probability of the enterprise based on the optimal model, performing standard scoring card conversion on the enterprise based on the default probability to form an enterprise standard score, and checking the score distribution condition of the enterprise to form a final enterprise credit score.

The method comprises the following specific steps:

s1, establishing a standard data warehouse through data aggregation and fusion;

the multi-source data of the enterprise covers government data of the enterprise, wherein the multi-source data of the enterprise comprises information such as business, accumulated fund, social security, issuing and reform committee, bank security supervision, administrative penalty and the like, internet data of the enterprise comprises information such as e-commerce data, marketing information, affirmation information, online store information, legal action, lost letter execution, tendering and the like, and third-party data of the enterprise comprises information such as enterprise business information, personnel information, people-enterprise relationship data and the like; firstly, establishing a uniform data standard specification to carry out standardized management on multi-source data which is put in a warehouse; secondly, the treatment and processing of multi-source data are carried out through ETL and other data treatment tools, the storable data such as internet data are regularly pulled, real-time interface data are processed through a memory, and data processing, data standardization, index calculation, light characteristic mining and the like are carried out on the data in combination with a batch processing mode; and finally, fusing and converging the three-party multi-source data into a unified data warehouse through transverse and longitudinal data fusion, wherein the data warehouse stores information such as standard library data, an index library, a feature library and the like obtained by processing after the multi-source data fusion.

S2, screening enterprise credit evaluation indexes;

and establishing an enterprise standard library covering three levels of layers based on the established enterprise multi-source data standard library, and establishing an enterprise credit evaluation index system based on the standard library. The third-level indexes are specific enterprise credit evaluation indexes extracted through a database table, such as the number of times of acquiring customs enterprise grades, real payment capital of enterprises, duration, personnel scale, whether to be listed in a blacklist, the number of times of rating of reissued contracts in the last year and the like; the second-level index is an enterprise credit assessment index category which is integrated with business knowledge classification and arrangement on the basis of the third-level index, such as risk, legal representative, incidence relation, management layer, industry, legality, management, region and the like; the primary index is an evaluation dimension finally determined by evaluating the credit risk of the enterprise, such as repayment, industry, operation, performance, region, cash flow, operation and the like, and the primary index dimension is used for displaying a radar map of an enterprise portrait and evaluating the credit risk condition of the enterprise on each subdivision dimension.

the sea election indexes screened based on the enterprise multi-source data need to finally form the model entering characteristics required by model training through a plurality of processes such as exploratory data analysis, data cleaning, variable selection, variable derivation and the like.

1) Exploratory data analysis

The exploratory data analysis mainly comprises the steps of carrying out simple description statistics on training samples and marine selection indexes, analyzing the variance, the mean value, the median, the data distribution and the like of each index, carrying out simple statistical analysis on the data, carrying out data segmentation on specific index data (time sequence, a certain period, a certain country, the change data of the indexes along with the time lapse, and the like), and carrying out deep analysis on the dynamic change condition of the data and the value taking condition under a certain specific condition; and performing visual analysis on the model-entering sample indexes by drawing a histogram curve of a single variable, a relation curve of the single variable and a target variable and the like.

2) Data cleansing

Data cleaning firstly processes invalid values in the indexes and carries out numerical quantification on part of quantifiable indexes; then carrying out missing value statistics on the mold-entering indexes, and removing training indexes with the missing values larger than 60%; counting the equivalence ratio of the remaining indexes, removing the characteristic that the attribute only has one value, and removing the indexes with the attribute equivalence ratio more than 60%; removing unreasonable indexes determined in exploratory data analysis; performing VIF collinearity analysis on the residual evaluation indexes, and removing relevant characteristics; calculating the characteristic loss ratio of the training samples according to the characteristic loss ratio of the samples, and removing the training samples with the characteristic loss ratio larger than 50%; and (3) detecting abnormal values by adopting a quartile range (IQR) method of a box diagram aiming at the abnormal values of the indexes, screening the abnormal values of part of the indexes according to the quartile standard, and filling the screened abnormal values serving as missing values by using a specific numerical value '-999'.

Filling the missing indexes in the training samples by adopting a random forest method, firstly counting the characteristics with missing values and the characteristic list without missing values in the characteristics, selecting each missing index in the missing characteristics as a target function, and taking the non-missing values of the characteristic variables and the target variables as training examples to train a random forest model, and outputting and storing the trained random forest model for predicting the missing values of the missing characteristics. And carrying out Z-Score standardization processing on the training samples subjected to data cleaning and missing value filling to form standardized training vectors, and inputting the training vectors into a neural network for model training after the neural network structure is established.

S4, establishing a deep learning model;

the deep learning model has a plurality of network structures, a fully-connected MLP network, a CNN convolutional network and the like can be adopted, and the modeling process mainly comprises the determination of a neural network structure (comprising the number of input layer nodes, the number of hidden layer layers, the number of hidden layer nodes and the connection state of the hidden layers), the activation function, the determination of a neural network weight optimizing strategy (comprising the determination of a loss function, a learning rate and iteration times) and the training of the deep learning model.

1) Determining neural network structure

The neural network comprises an input layer, a hidden layer and an output layer. The number of input layer nodes of the neural network is the number of input training sample characteristics, and the number of the input layer nodes is often determined according to the number of input mode characteristics; the number of output layer nodes of the neural network corresponds to the number of classes of the training samples; the number of hidden layers, namely the number of actually modeled hidden layers and the number of nodes of the hidden layers are usually selected according to experiment effect comparison and experience; if the number of the hidden nodes is too small, the network cannot have necessary learning capacity and information processing capacity, and if the number of the hidden nodes is too large, not only can the complexity of the network structure be greatly increased, but also the network is more likely to fall into a local minimum point in the learning process, and the learning speed of the network becomes very slow.

2) Determining activation functions

The most common neural network activation functions include Sigmoid, tanh, softplus, relu (Rectifier linkage Units), and the like, and the activation function of the final neural network is often determined through comparative analysis in the modeling process.

3) Determining a weight search policy

The weight search strategy of the neural network mainly comprises the determination of a loss function, an optimizer, a learning rate and iteration times. The loss function is used for predicting the difference between the output value and the true value, the neural network model training calculates the difference between the output value and the true value by means of the loss function, the weight and parameters (namely a back propagation strategy) are adjusted in the past, and then the model parameters are adjusted by using a gradient descent method; the optimization device is characterized in that an open source bow and bow tensiorflow, a keras and the like are usually adopted for model training in the training of the deep learning network, a plurality of optimization device selection rules are fused in a tool, and an optimal optimization device can be selected by performing comparative analysis according to the conditions of actual training samples in actual modeling; learning rate, namely the step length in the gradient descent method, wherein the learning rate is too small to cause slow calculation, the learning rate is too large to cause non-convergence, and a default value is often determined according to experience and actual data in the modeling process; the iteration times determine whether the learning process is finished or not in the neural network model training process, the accuracy is too low due to too small iteration times, the time cost is too large due to too large iteration times, and the specific iteration times are determined to be dynamically adjusted according to the model convergence condition in the actual modeling training process.

S5, deep learning model training;

the training of the deep learning network usually adopts an open source packet of tensoflow and keras to train a deep learning model, and a model learning curve is drawn in the training process to show the change conditions of a loss function, the training sample accuracy and the test sample accuracy in the model training process along with the model iteration process so as to judge the convergence condition of the model.

S6, evaluating importance of the mold entering characteristics;

in the deep learning network construction process, the importance of the model-entering index is evaluated by selecting the importance of the input disturbance characteristics, each characteristic of the data X is disturbed, then a predicted value is obtained for a new X input network, and a loss function is calculated and used as an importance score of the characteristic. After the importance of the input disturbance features is applied to the model entering features to evaluate the importance of the model entering indexes, the model entering features are arranged from high to low according to the disturbance feature importance, different threshold values are sequentially selected to screen the model entering features, the screened features are subjected to multiple model training by applying the determined deep learning network structure, the final model entering features are finally determined through the training effect of the deep learning model, and the optimal deep learning network model is finally determined and output for storage.

S7, forming enterprise scores;

and finally obtaining a deep learning network model with optimal and stable model effect based on multi-threshold selection and multiple model iterations of disturbance feature importance, predicting the default probability of the enterprise based on the deep learning network model, and calculating the total score of the enterprise according to the default probability of the enterprise by a scoring card conversion method. The standard scoring card conversion method mainly comprises two methods, namely a WOE conversion method is used for calculating the feature score through the WOE value and the coefficient of the feature obtained by the prediction of a logistic regression model; and secondly, based on the default probability of the enterprise obtained by model prediction, converting the standard score according to the default probability. And predicting the default probability of the enterprise by the deep learning network, and converting the enterprise default probability obtained by enterprise prediction into the standard score of the enterprise by a standard score card conversion method. And finally, testing the overall grading distribution of the training sample by a normal test method, and performing distribution adjustment on the grading result of which the grading distribution is not in accordance with the normal distribution by methods of score adjustment, score conversion and the like.

Based on the foregoing method, an enterprise credit scoring apparatus in this embodiment includes: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

The above embodiments are only specific examples of the present invention, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of the enterprise credit scoring method and apparatus of the present invention and are made by those skilled in the art should fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An enterprise credit scoring method is characterized by comprising the following steps:

s1, establishing a standard data warehouse through data aggregation and fusion;

s2, screening enterprise credit evaluation indexes;

s4, establishing a deep learning model;

s5, deep learning model training;

s6, evaluating importance of the mold entering characteristics;

and S7, forming enterprise scores.

2. The enterprise credit scoring method according to claim 1, wherein in step S1, a unified data standard is established first to perform standardized management on the warehoused multi-source data; secondly, the treatment and processing of multi-source data are carried out through an ETL data treatment tool, internet data storage data are regularly pulled, real-time interface data are processed through a memory, and data processing, data standardization, index calculation and light characteristic mining are carried out on the data in combination with a batch flow processing mode;

3. The enterprise credit scoring method according to claim 1 or 2, wherein in step S2, based on the established enterprise multi-source data standard library, an enterprise standard library covering three levels of hierarchy is established, and an enterprise credit evaluation index system is established based on the standard library, wherein the three levels of indexes are specific enterprise credit evaluation indexes extracted through database tables; the second-level index is an enterprise credit evaluation index category which is integrated with business knowledge classification and arrangement on the basis of the third-level index;

4. The enterprise credit scoring method according to claim 3, wherein in step S3, exploratory data analysis is performed first, the exploratory data analysis mainly includes simple descriptive statistics on training samples and marine selection indexes, and after the simple statistical analysis is performed on the data, data segmentation is performed on specific index data, and the dynamic change condition of the data and the value taking condition under a specific condition are deeply analyzed; and performing visual analysis on the model entering sample index by drawing a histogram curve of the univariate and a relation curve of the univariate and the target variable.

5. The enterprise credit scoring method according to claim 4, wherein in step S3, data cleaning is performed, a random forest random method is used to fill in the missing indicators in the training samples,

firstly, counting a feature list with missing values and a feature list without missing values in features, wherein each missing index in the missing features is selected as a target function, non-missing values of feature variables and target variables are used as training examples for training a RandomForest model, and the trained random forest model is output and stored for predicting the missing values of the missing features. And carrying out Z-Score standardization treatment on the training sample subjected to data cleaning and missing value filling to form a standardized training vector, and inputting the training vector into a neural network for model training after the neural network structure is established.

6. The enterprise credit scoring method of claim 5, wherein in step S4, the neural network structure is determined, the activation function is determined, and the weight search strategy is determined.

7. The enterprise credit scoring method according to claim 6, wherein in step S5, the deep learning network is trained by using ten-source flow and keras open source packets to train the deep learning model, and the loss function, the training sample accuracy and the test sample accuracy in the model training process are shown by drawing a model learning curve along with the change of the model iteration process in the training process to determine the convergence of the model.

8. The enterprise credit scoring method according to claim 7, wherein in step S6, in the deep learning network construction process, the importance of the modeling index is evaluated by selecting the importance of the input disturbance feature, each feature of the data X is disturbed, a predicted value is obtained for a new X input network, and a loss function is calculated as the importance score of the feature;

9. An enterprise credit scoring method according to claim 8, wherein in step S7, there are two standard scoring card conversion methods, one is a method based on WOE conversion, and feature score is calculated by WOE value and coefficient of features predicted by logistic regression model;

and finally, testing the overall grading distribution of the training sample by a normal test method, and performing distribution adjustment on grading results of which the grading distribution is not in accordance with the normal distribution through score adjustment and score conversion.

10. An enterprise credit scoring apparatus, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 9.