CN115760373A

CN115760373A - Method and device for establishing enterprise credit wind control model

Info

Publication number: CN115760373A
Application number: CN202211505130.1A
Authority: CN
Inventors: 齐阳
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-07

Abstract

The invention discloses a method and a device for establishing an enterprise credit wind control model, which can be used in the technical field of artificial intelligence, and the method comprises the following steps: receiving credit information data of a first type of enterprise as a training sample, adding the training sample into a source domain, and receiving credit information data of a second type of enterprise as a testing sample, and adding the testing sample into a target domain; measuring the distance between the test sample and the training sample to obtain the corresponding characteristics of the test sample and the training sample; classifying the test samples in the target domain, and constructing a model of the test samples; matching the training samples in the source domain with the samples closest to the test samples in the target domain in a one-to-one manner, and performing consistency learning on the characteristics of the matched samples to optimize the model of the test samples; and taking the model as a credit wind control model of the second type of enterprises. The invention can establish a proper enterprise wind control model, and the model is used for analyzing the credit information of the small and micro enterprises so as to reduce the loan risk of the small and micro enterprises.

Description

Method and device for establishing enterprise credit wind control model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for establishing an enterprise credit wind control model.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In recent years, each commercial bank receives more and more credit applications of a large number of enterprises mainly including small and micro enterprises, so that the credit wind control of the small and micro enterprises is very important for the commercial banks.

The current credit wind control of the small and micro enterprises has the characteristics of few training samples and deviation of data distribution and data distribution of the large and medium enterprises. Most of the existing enterprise credit wind control models are obtained through large and medium enterprise credit evaluation training, so that the general credit wind control models are mostly not suitable for small and micro enterprise credit wind control in the existing scene.

Disclosure of Invention

The embodiment of the invention provides a method for establishing an enterprise credit wind control model, which is used for establishing a proper enterprise wind control model, analyzing the credit information of a small and micro enterprise and reducing the loan risk of the small and micro enterprise. The method comprises the following steps:

the method comprises the steps of receiving credit information data of a first type of enterprise as a training sample and adding the training sample to a source domain, receiving credit information data of a second type of enterprise as a testing sample and adding the testing sample to a target domain, wherein the first type of enterprise is an enterprise with the scale larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold;

classifying the training samples in the source domain according to the characteristics of the training samples, and measuring the distance between the test samples and the training samples to obtain the characteristics corresponding to the test samples and the training samples;

classifying the test sample in the target domain according to the corresponding characteristics and the classification of the training sample in the source domain, and constructing a model of the test sample;

matching the training samples in the source domain with the samples closest to the test samples in the target domain in a one-to-one manner, and performing consistency learning on the characteristics of the matched samples to optimize the model of the test samples;

and taking the model as a credit wind control model of the second type of enterprises, and performing credit wind control on the second type of enterprises by using the credit wind control model.

The embodiment of the invention also provides a device for establishing the enterprise credit wind control model, which is used for establishing a proper enterprise wind control model, analyzing the credit information of the small and micro enterprises and reducing the loan risk of the small and micro enterprises. The device comprises:

the data receiving module is used for receiving credit information data of a first type of enterprise, taking the credit information data as a training sample, adding the training sample into a source domain, receiving credit information data of a second type of enterprise, taking the credit information data as a testing sample, and adding the testing sample into a target domain, wherein the first type of enterprise is an enterprise of which the scale is larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold;

the metric learning module is used for classifying the training samples in the source domain according to the characteristics of the training samples and obtaining the characteristics corresponding to the test samples and the training samples by measuring the distance between the test samples and the training samples;

the model building module is used for classifying the test samples in the target domain according to the corresponding characteristics and the classification of the training samples in the source domain, and building a model of the test samples;

the sample matching module is used for matching the training samples in the source domain with the testing samples in the target domain in a one-to-one mode, conducting consistency learning on the characteristics of the matched samples and optimizing the model of the testing samples; and taking the model as a credit wind control model of the second type of enterprises, and performing credit wind control on the second type of enterprises by using the credit wind control model.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the method for establishing the enterprise credit wind control model when executing the computer program.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for establishing an enterprise credit wind control model is implemented.

Embodiments of the present invention also provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for establishing an enterprise credit wind control model is implemented.

In the embodiment of the invention, compared with the technical scheme that most enterprise credit wind control models in the prior art are obtained through large and medium enterprise credit evaluation training, so that a general credit wind control model is mostly not suitable for small and micro enterprise credit wind control in the existing scene, the scheme for establishing the enterprise credit wind control model is characterized in that the credit information data of a first type of enterprise is received and is used as a training sample and added into a source domain, the credit information data of a second type of enterprise is received and is used as a test sample and added into a target domain, wherein the first type of enterprise is an enterprise with the scale larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold; classifying the training samples in the source domain according to the characteristics of the training samples, and measuring the distance between the test samples and the training samples to obtain the characteristics corresponding to the test samples and the training samples; classifying the test sample in the target domain according to the corresponding characteristics and the classification of the training sample in the source domain, and constructing a model of the test sample; matching the training samples in the source domain with the samples closest to the test samples in the target domain in a one-to-one manner, and performing consistency learning on the characteristics of the matched samples to optimize the model of the test samples; the model is used as a credit wind control model of a second type of enterprise, and the credit wind control model is used for carrying out credit wind control on the second type of enterprise, so that the establishment of a proper enterprise wind control model can be realized, the credit information of the small and micro enterprise can be analyzed, and the loan risk of the small and micro enterprise can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a schematic flow chart illustrating a method for establishing an enterprise credit wind control model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a method for establishing an enterprise credit wind control model according to another embodiment of the present invention;

FIG. 3 is a diagram illustrating metric learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a maximum flow based sample pairing method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating domain distribution matching according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for building a business credit wind control model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus for building a business credit wind control model according to another embodiment of the invention;

fig. 8 is a schematic structural diagram of an apparatus for establishing an enterprise credit wind control model according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

In the description of the present specification, the terms "comprising," "including," "having," "containing," and the like are used in an open-ended fashion, i.e., to mean including, but not limited to. Reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.

The following explains terms involved in the embodiments of the present invention:

transfer learning: the transfer learning is a machine learning method, namely, a model developed for a source domain is taken as an initial point and is reused in the process of developing the model for a target domain.

Overdue rate analysis (vintage): the overdue rate analysis method is a dynamic analysis method for analyzing the performance conditions of assets in different periods, the overdue condition of 1,2,3 … N months after each batch of loan is observed on the basis of the account age of the loan, and the vintage is the overdue rate which is essentially based on a delay index.

WOE and IV indices: in the scorecard modeling process, WOE (Weight of Evidence) is commonly used for feature transformation, and IV (Information Value) is used for measuring the prediction capability of features.

XGB and LGB algorithms: the GBDT (Gradient Boosting Decision Tree) is a model with long abundance and invalidity in machine learning, the main idea is to use a weak classifier (Decision Tree) to carry out iterative training to obtain an optimal model, and the model has the advantages of good training effect, difficulty in overfitting and the like. GBDT is not only widely used in the industry, but is often used for tasks such as multi-classification, click-through rate prediction, and search ranking. The XGB full-name (XGBOOst) eXtreme Gradient promotion is a tool of a large-scale parallel booted tree and is the fastest and best tool kit of an open source booted tree at present. The algorithm applied by the XGBoost is an improvement of GBDT (gradient boosting decision tree), and can be used for both classification and regression problems. And the LGB (Light Gradient Boosting Machine, light gbm) is also a framework for implementing the GBDT algorithm, supports high-efficiency parallel training, and has the advantages of faster training speed, lower memory consumption, better accuracy, support of distributed type, capability of rapidly processing mass data, and the like.

SMOTE algorithm: the SMOTE algorithm is a common algorithm for solving data imbalance, and the basic idea of the algorithm is to analyze and simulate a few class samples, and add a new sample which is manually simulated to a data set, so that the class in the original data is not seriously unbalanced any more. The KNN technology is adopted in the simulation process of the algorithm, and the steps of simulating and generating a new sample are as follows: sampling a nearest neighbor algorithm, and calculating K neighbors of each few class samples; randomly selecting N samples from K neighbors to carry out random linear interpolation; constructing a new minority sample; and synthesizing the new sample and the original data to generate a new training set.

And (3) data standardization treatment: normalization of data (normalization) is to scale data to fall within a small specific interval. In some index processing for comparison and evaluation, unit limitation of data is removed and converted into a dimensionless pure numerical value, so that indexes of different units or orders can be compared and weighted conveniently. The most typical method is the normalization processing of data, namely, the data is uniformly mapped to a [0,1] interval, and the common method for normalizing the data is Min-max normalization (Min-maxnormalization) which is also called dispersion normalization, and the Min-max normalization is a linear transformation on the original data, so that the result falls into a [0,1] interval.

Discretizing: the continuous attribute of the data is converted into a classification attribute, namely, the continuous attribute is discretized, a plurality of discrete division points are set in the value range of the numerical value, the value range is divided into a plurality of discretized intervals, and finally different symbols or integer values are used for representing the data value in each subinterval.

Metric learning: the metric learning is a model framework suitable for small sample learning, and the small sample model learning is realized through the metric learning. When the model is used for prediction, the distance between the test sample and the training sample is measured to realize the prediction of the test sample

In recent years, each commercial bank receives a large number of credit applications of enterprises such as small and micro enterprises, so that credit management of the small and micro enterprises is extremely important for the commercial banks.

With the development of machine learning, researchers have introduced many machine learning algorithms, such as multilayer perceptron, random forest, support vector machine, boosting, etc., into the model of credit wind control, which largely depend on the training of the network on large data sets, the performance of the model seriously decreases in the case of insufficient training samples or large data distribution deviation.

The credit wind control of the small and micro enterprises has the characteristics of few training samples and deviation of data distribution and data distribution of the large and medium enterprises. However, most of the existing enterprise credit wind control models are obtained through large and medium enterprise credit evaluation training, so that the general credit wind control models are mostly not suitable for small and micro enterprise credit wind control in the existing scene. Therefore, in order to improve the model accuracy of credit wind control of the small micro-enterprise, a transfer learning method is urgently needed to solve the problem.

The transfer learning is to use a small number of samples in the target domain to perform model fine adjustment after completing model learning by using samples in the source domain, so as to realize the prediction of target domain data. The challenge is that on the one hand the distribution of source and target domains is different and on the other hand the number of samples of the target domain is usually smaller.

Aiming at the technical problems in the prior art, the invention provides a scheme for establishing an enterprise credit wind control model, which is used for establishing a proper enterprise wind control model, analyzing the credit information of a small and micro enterprise and reducing the loan risk of the small and micro enterprise. The scheme for establishing the enterprise credit wind control model is described in detail below.

Fig. 1 is a schematic flow chart of establishing an enterprise credit wind control model in an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step 101: the method comprises the steps of receiving credit information data of a first class of enterprises as training samples and adding the training samples to a source domain, receiving credit information data of a second class of enterprises as testing samples and adding the testing samples to a target domain, wherein the first class of enterprises are enterprises of which the scale is larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold;

step 102: classifying the training samples in the source domain according to the characteristics of the training samples, and measuring the distance between the test samples and the training samples to obtain the characteristics corresponding to the test samples and the training samples;

step 103: classifying the test sample in the target domain according to the corresponding characteristics and the classification of the training sample in the source domain, and constructing a model of the test sample;

step 104: matching the training samples in the source domain with the samples closest to the test samples in the target domain in a one-to-one manner, and performing consistency learning on the characteristics of the matched samples to optimize the model of the test samples;

step 105: taking the model as a credit wind control model of a second type of enterprise, and performing credit wind control on the second type of enterprise by using the credit wind control model;

in step 101, the credit data of the first type of enterprise can be the credit data of a large and medium enterprise, and the credit data of the second type of enterprise can be the credit data of a small and small enterprise; and distinguishing the large and medium enterprises or the small and micro enterprises according to a preset threshold value.

In one embodiment, the credit information data of the first type of business in the source domain and the credit information data of the second type of business in the target domain are preprocessed, wherein the preprocessing comprises data analysis, missing value completion and relevance feature screening.

The data analysis comprises the following steps: analyzing the credit information data according to a generating function image method;

missing value completion includes: interpolating a missing value of the credit information data according to a mean interpolation method; detecting and deleting abnormal values of the credit information data according to a 3 sigma principle; uniformly mapping the credit information data to a [0,1] interval through the standardized processing of the credit information data, and discretizing the characteristics of continuous credit information data through box separation operation; analyzing the characteristics of credit information data according to a overdue rate analysis method, and processing the unbalance problem of the category of the test sample by utilizing an SMOTE algorithm;

the screening of the relevant characteristics comprises the following steps: detecting and deleting repeated features in the features of the credit information data, and storing the unrepeated features of the credit information data; and analyzing the correlation between the saved features and the credit card approval result, and removing irrelevant features.

In one embodiment, characteristics of credit information data are analyzed according to an overdue rate analysis method, WOE and IV indexes corresponding to the characteristics are obtained, and indexes with values larger than a preset value in the IV indexes are screened out.

In one embodiment, K neighbors of each test sample are calculated according to a KNN algorithm, N test samples are randomly selected from the K neighbors to perform random linear interpolation, a new test sample is constructed, the new test sample is synthesized with the test sample in the target domain, and the classes of the test samples in the target domain are balanced based on the synthesized test sample.

In step 102, the distance between the test sample and the training sample is a P-norm distance, a cosine similarity or an EMD distance.

In one embodiment, the features of the test sample and the training sample corresponding to each other are obtained by measuring the distance between the test sample and the training sample according to the metric learning method.

In step 104, based on the method of transfer learning, the training samples in the source domain are paired with the nearest samples in the target domain one-to-one.

In one embodiment, the characteristics of all samples are divided into the characteristics of a source domain characteristic group and the characteristics of a target domain characteristic group according to different domains, wherein the characteristics of the source domain characteristic group are used for constructing source domain characteristic group nodes, and the characteristics of the target domain characteristic group are used for constructing target domain characteristic group nodes.

And constructing a network flow model, wherein the network flow model comprises source nodes, target nodes, source domain feature group nodes and target domain feature group nodes, the source nodes are respectively connected with all the source domain feature group nodes, the target nodes are respectively connected with all the target domain feature group nodes, and each source domain feature group node is respectively connected with each target domain feature group node.

And determining the weight of the connection between the source domain feature group node and the target domain feature group node according to the distance between the training sample and the test sample.

And solving the network flow model according to a dijkstra algorithm to obtain a matching method of a global optimal solution between the source domain feature group nodes and the target domain feature group nodes.

And matching the training samples in the source domain with the test samples in the target domain in a one-to-one mode according to the matching method.

In one embodiment, the network flow model is a least cost maximum flow model.

In one embodiment, the characteristics of the matched samples are subjected to consistency learning through a regular term loss function, and regular term loss in the regular term loss function is calculated according to an MSE loss function or a KL divergence loss function.

In step 105, credit wind control is performed on the second type of enterprises by using the credit wind control model, namely, the second type of enterprises are divided into small and micro enterprises according to a preset threshold value.

In one embodiment, before the credit wind control model is used for performing credit wind control on the second-class enterprises, the XGB and LGB algorithms are combined to optimize the credit wind control model, a training set and a test set are constructed according to IV indexes screened out by overdue rate analysis, the performance of the optimized credit wind control model is tested by adopting model accuracy and a confusion matrix, and the credit wind control model is modified according to a performance test result.

To facilitate an understanding of how the present invention may be implemented, the method for establishing an enterprise credit wind control model will be described in detail below with reference to fig. 2-5.

As shown in fig. 2, the method for establishing the enterprise credit wind control model is divided into 4 steps:

1. data preprocessing, comprising: data import, missing value completion and correlation characteristic screening.

In specific implementation, the data importing comprises: the method comprises the steps of receiving credit information data of a first class of enterprises as training samples and adding the training samples to a source domain, receiving credit information data of a second class of enterprises as testing samples and adding the testing samples to a target domain, wherein the first class of enterprises are enterprises of which the scale is larger than a preset threshold value; the second type of enterprise is an enterprise with a size smaller than a preset threshold. Missing value completion includes: interpolating a missing value of the credit information data according to a mean interpolation method; detecting and deleting abnormal values of the credit information data according to a 3 sigma principle; uniformly mapping the credit information data to a [0,1] interval through the standardized processing of the credit information data, and discretizing the characteristics of continuous credit information data through box separation operation; analyzing the characteristics of credit information data according to a overdue rate analysis method, and processing the unbalance problem of the category of the test sample by utilizing an SMOTE algorithm; the screening of the relevant characteristics comprises the following steps: detecting and deleting repeated features in the features of the credit information data, and storing the unrepeated features of the credit information data; and analyzing the correlation between the saved features and the credit card approval result, and removing irrelevant features.

2. Metric learning, comprising: model construction, feature extraction and model measurement.

In one embodiment, as shown in fig. 3, the training samples in the source domain and the test samples in the target domain are classified according to sample features, i.e. the training samples and the test samples have the same features and are classified into the same class.

In specific implementation, the training samples are classified in the source domain according to the characteristics of the training samples, and the characteristics corresponding to the testing samples and the training samples are extracted by measuring the distance between the testing samples and the training samples; and classifying the test sample in the target domain according to the corresponding features and the classification of the training sample in the source domain, and constructing a model of the test sample.

3. Domain distribution matching, comprising: sample pairing, construction of regular term constraints and model optimization.

In an embodiment, the maximum flow-based sample pairing method is shown in fig. 4, where Source in the minimum cost maximum flow model in the graph is a Source node and is connected to all Source domain feature group nodes, and Receiver is a target node and is connected to all target domain feature group nodes.

In specific implementation, the characteristics of all samples are divided into the characteristics of a source domain characteristic group and the characteristics of a target domain characteristic group according to different domains, wherein the characteristics of the source domain characteristic group are used for constructing source domain characteristic group nodes, and the characteristics of the target domain characteristic group are used for constructing target domain characteristic group nodes; constructing a minimum cost maximum flow model, wherein the model comprises a source node, a target node, a source domain feature group node and a target domain feature group node, the source node is respectively connected with all the source domain feature group nodes, the target node is respectively connected with all the target domain feature group nodes, and each source domain feature group node is respectively connected with each target domain feature group node; the weight of the connection between the source domain feature group nodes and the target domain feature group nodes is determined according to the distance between the training sample and the test sample; solving the network flow model according to dijkstra algorithm to obtain a matching method of a global optimal solution between source domain feature group nodes and target domain feature group nodes; and matching the training samples in the source domain with the test samples in the target domain in a one-to-one mode according to the matching method.

In one embodiment, as shown in the field distribution matching diagram of fig. 5, after sample pairing, canonical term constraints and model optimization are constructed. The effect of domain distribution matching on sample pairing, distribution of source and target domains, and source and target domain samples is shown in fig. 5.

In specific implementation, after samples closest to different domains are paired one to one, the characteristics of the paired samples are subjected to consistency learning through a regular term loss function. The regular term loss can be realized by adopting an MSE loss function or a KL divergence loss function, and the process of optimizing the loss function is the process of approximating the characteristics between two domains. And through regular term loss, the difference between the two domains is reduced, so that the generalization of the model in the target domain is improved.

4. Model testing, comprising: test data acquisition and model migration.

In specific implementation, the test data includes: and constructing a training set and a testing set according to the screened IV indexes. Model migration includes: testing the performance of the optimized credit wind control model by adopting the model accuracy and the confusion matrix; and modifying the credit wind control model according to the performance test result.

The embodiment of the invention also provides a device for establishing the enterprise credit wind control model, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the method for establishing the enterprise credit wind control model, the implementation of the device can refer to the implementation of the method for establishing the enterprise credit wind control model, and repeated parts are not repeated.

Fig. 6 is a schematic structural diagram of an apparatus for establishing an enterprise credit wind control model according to an embodiment of the present invention, as shown in fig. two, the apparatus includes:

the data receiving module 01 is used for receiving credit information data of a first type of enterprises, taking the credit information data as a training sample, adding the training sample into a source domain, receiving credit information data of a second type of enterprises, taking the credit information data as a testing sample, and adding the testing sample into a target domain, wherein the first type of enterprises are enterprises of which the scale is larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold;

the metric learning module 02 is used for classifying the training samples in the source domain according to the characteristics of the training samples, and obtaining the characteristics corresponding to the test samples and the training samples by measuring the distance between the test samples and the training samples;

the model building module 03 is used for classifying the test samples in the target domain according to the corresponding features and the classification of the training samples in the source domain, and building a model of the test samples;

the sample pairing module 04 is used for pairing the training samples in the source domain and the testing samples in the target domain in a one-to-one mode, performing consistency learning on the characteristics of the paired samples, and optimizing the model of the testing samples; and taking the model as a credit wind control model of the second type of enterprises, and performing credit wind control on the second type of enterprises by using the credit wind control model.

In one embodiment, the sample pairing module is specifically configured to:

dividing the characteristics of all samples into the characteristics of a source domain characteristic group and the characteristics of a target domain characteristic group according to different domains, wherein the characteristics of the source domain characteristic group are used for constructing source domain characteristic group nodes, and the characteristics of the target domain characteristic group are used for constructing target domain characteristic group nodes;

constructing a network flow model, wherein the network flow model comprises source nodes, target nodes, source domain feature group nodes and target domain feature group nodes, the source nodes are respectively connected with all the source domain feature group nodes, the target nodes are respectively connected with all the target domain feature group nodes, and each source domain feature group node is respectively connected with each target domain feature group node;

determining the weight of the connection between the source domain feature group node and the target domain feature group node according to the distance between the training sample and the test sample;

solving the network flow model according to dijkstra algorithm to obtain a matching method of a global optimal solution between source domain feature group nodes and target domain feature group nodes;

As shown in fig. 7, the above apparatus for establishing an enterprise credit wind control model may further include: a data pre-processing module 05 for:

preprocessing credit information data of a first type of enterprise in a source domain and credit information data of a second type of enterprise in a target domain, wherein the preprocessing comprises data analysis, missing value completion and correlation feature screening;

the data analysis comprises: analyzing credit information data according to a generating function image method;

the missing value completion comprises: interpolating a missing value of the credit information data according to a mean interpolation method; detecting and deleting abnormal values of the credit information data according to a 3 sigma principle; uniformly mapping the credit information data to a [0,1] interval through the standardized processing of the credit information data, and discretizing the characteristics of continuous credit information data through box separation operation; analyzing the characteristics of credit information data according to an overdue rate analysis method, and processing the unbalance problem of the category of the test sample by utilizing an SMOTE algorithm;

the relevant feature screening comprises: detecting and deleting repeated features in the features of the credit information data, and storing the unrepeated features of the credit information data; and analyzing the correlation between the saved features and the credit card approval result, and removing irrelevant features.

In one embodiment, the data preprocessing module is specifically configured to:

the characteristics of the credit information data are analyzed according to a overdue rate analysis method by adopting the following steps:

analyzing the characteristics of credit information data to obtain WOE and IV indexes corresponding to the characteristics;

and screening out the indexes with the numerical values larger than the preset value in the IV indexes.

In one embodiment, the data preprocessing module is specifically configured to:

the following steps are adopted, and the SMOTE algorithm is utilized to process the unbalance problem of the categories of the test samples:

calculating K neighbors of each test sample according to a KNN algorithm;

randomly selecting N test samples from K neighbors to carry out random linear interpolation, and constructing new test samples;

the new test sample is synthesized with the test samples in the target domain, and the classes of the test samples in the target domain are balanced based on the synthesized test samples.

As shown in fig. 8, the above apparatus for establishing an enterprise credit wind control model may further include: a model test module 06 for:

optimizing a credit wind control model by combining XGB and LGB algorithms;

constructing a training set and a testing set according to the screened IV indexes, and testing the performance of the optimized credit wind control model by adopting the model accuracy and a confusion matrix;

and modifying the credit wind control model according to the performance test result.

In summary, in the embodiment of the present invention, credit information data of a first type of enterprise is received as a training sample and added to a source domain, and credit information data of a second type of enterprise is received as a testing sample and added to a target domain, where the first type of enterprise is an enterprise whose scale is greater than a preset threshold; the second type of enterprises are enterprises with the scale smaller than a preset threshold; classifying the training samples in the source domain according to the characteristics of the training samples, and measuring the distance between the test samples and the training samples to obtain the characteristics corresponding to the test samples and the training samples; classifying the test sample in the target domain according to the corresponding characteristics and the classification of the training sample in the source domain, and constructing a model of the test sample; matching the training samples in the source domain with the samples closest to the test samples in the target domain in a one-to-one manner, and performing consistency learning on the characteristics of the matched samples to optimize the model of the test samples; and taking the model as a credit wind control model of the second type of enterprises, and performing credit wind control on the second type of enterprises by using the credit wind control model. The method can establish a proper small and micro enterprise credit wind control model, analyzes the small and micro enterprise credit information according to the model, is favorable for banks to design credit products with strong pertinence, individuation and differentiation, and helps the banks to strengthen loan business management, thereby reducing the loan risk of small and micro enterprises.

Embodiments of the present invention further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method for establishing the enterprise credit wind control model.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of establishing an enterprise credit wind control model, comprising:

2. The method of claim 1, further comprising:

the data analysis comprises: analyzing the credit information data according to a generating function image method;

the missing value completion comprises: interpolating a missing value of the credit information data according to a mean interpolation method; detecting and deleting abnormal values of credit information data according to a 3 sigma principle; uniformly mapping the credit information data to a [0,1] interval through the standardized processing of the credit information data, and discretizing the characteristics of continuous credit information data through box separation operation; analyzing the characteristics of credit information data according to a overdue rate analysis method, and processing the unbalance problem of the category of the test sample by utilizing an SMOTE algorithm;

3. The method of claim 2, wherein analyzing characteristics of credit information data according to a overdue analytics comprises:

4. The method of claim 2, wherein processing the class of test samples for imbalance using the SMOTE algorithm comprises:

calculating K neighbors of each test sample according to a KNN algorithm;

5. The method of claim 1, wherein the distance between the test sample and the training sample is a P-norm distance, a cosine similarity, or an EMD distance.

6. The method of claim 1, wherein pairing the training samples in the source domain with the nearest distance samples in the target domain comprises:

the weight of the connection between the source domain feature group nodes and the target domain feature group nodes is determined according to the distance between the training sample and the test sample;

7. The method of claim 6, wherein the network flow model is a least cost maximum flow model.

8. The method of claim 1, wherein the consistency learning of the features of the paired samples comprises:

and carrying out consistency learning on the characteristics of the matched samples through a regular term loss function, wherein the regular term loss in the regular term loss function is obtained by calculation according to an MSE loss function or a KL divergence loss function.

9. The method of claim 3, further comprising:

optimizing a credit wind control model by combining XGB and LGB algorithms;

constructing a training set and a testing set according to the screened IV indexes, and testing the performance of the optimized credit wind control model by adopting the model accuracy and the confusion matrix;

10. An apparatus for building a business credit wind model, comprising:

the data receiving module is used for receiving credit information data of a first type of enterprises, taking the credit information data as a training sample, adding the training sample into a source domain, receiving credit information data of a second type of enterprises, taking the credit information data as a testing sample, and adding the testing sample into a target domain, wherein the first type of enterprises are enterprises of which the scale is larger than a preset threshold value; the second type of enterprises are enterprises with the scale smaller than a preset threshold;

11. The apparatus of claim 10, further comprising a data pre-processing module to:

the missing value completion comprises: interpolating the missing value of the credit information data according to a mean interpolation method; detecting and deleting abnormal values of credit information data according to a 3 sigma principle; uniformly mapping the credit information data to a [0,1] interval through the standardized processing of the credit information data, and discretizing the characteristics of continuous credit information data through box separation operation; analyzing the characteristics of credit information data according to a overdue rate analysis method, and processing the unbalance problem of the category of the test sample by utilizing an SMOTE algorithm;

12. The apparatus of claim 11, wherein the data pre-processing module is specifically configured to:

13. The apparatus of claim 11, wherein the data pre-processing module is specifically configured to:

the following steps are adopted, and the SMOTE algorithm is utilized to process the unbalance problem of the classes of the test samples:

calculating K neighbors of each test sample according to a KNN algorithm;

the new test sample is synthesized with the test samples in the target domain, balancing the classes of the test samples in the target domain based on the synthesized test samples.

14. The apparatus of claim 10, wherein the distance between the test sample and the training sample is a P-norm distance, a cosine similarity, or an EMD distance.

15. The apparatus of claim 10, wherein the sample pairing module is specifically configured to:

solving the network flow model according to a dijkstra algorithm to obtain a matching method of a global optimal solution between source domain feature group nodes and target domain feature group nodes;

16. The apparatus of claim 15, wherein the network flow model is a least cost maximum flow model.

17. The apparatus of claim 10, wherein the sample pairing module is specifically configured to:

18. The apparatus of claim 12, further comprising a model test module to:

optimizing a credit wind control model by combining XGB and LGB algorithms;

19. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 9 when executing the computer program.

20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any one of claims 1 to 9.

21. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 9.