CN113159461A

CN113159461A - Small and medium-sized micro-enterprise credit evaluation method based on sample transfer learning

Info

Publication number: CN113159461A
Application number: CN202110567611.4A
Authority: CN
Inventors: 唐嘉成; 陈瑞勇; 孙秀文; 焦春明
Original assignee: Tiandao Jinke Co ltd
Current assignee: Tiandao Jinke Co ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-07-23

Abstract

The invention discloses a medium and small micro-enterprise credit evaluation method based on sample transfer learning, which comprises the following steps: merging the obtained target domain data and the obtained source domain data, and performing data preprocessing and feature derivation on the merged data; splitting the merged data of the completed feature derivation to obtain target domain data and source domain data after sample expansion; performing characteristic binning and screening dimension reduction processing on the source domain data obtained by splitting; transferring the source domain data subjected to feature binning and dimension reduction by screening to the target domain data obtained by splitting, and training to form an enterprise credit scoring model by respectively taking the source domain data and the target domain data as a training set and a test set; and inputting the acquired original target domain data into an enterprise credit score model, and predicting and outputting the credit score of the small and medium-sized micro enterprises. The method can accurately predict and evaluate the actual credit worthiness of small and medium-sized micro-enterprises with insufficient samples.

Description

Small and medium-sized micro-enterprise credit evaluation method based on sample transfer learning

Technical Field

The invention relates to the technical field of enterprise credit evaluation, in particular to a method for evaluating credit of small and medium-sized micro enterprises based on sample transfer learning.

Background

Whether the model samples are sufficient is directly related to the prediction accuracy of the model. In the technical field of enterprise credit evaluation, a traditional machine learning technology is mostly adopted to perform credit evaluation modeling on an industry department (for example, a manufacturing industry is a door industry, and a mold industry under the manufacturing industry is a small industry) enterprise with sufficient samples, and an enterprise credit evaluation model with high credit evaluation accuracy is difficult to train on a small industry enterprise (generally a small and medium-sized micro enterprise) with insufficient samples. For example, in some industries (especially in subclass subdivision industries such as the mold industry), there are problems of serious shortage of sample size, too small amount of default sample size, and extreme imbalance of quantity ratio of default samples to non-default samples, so that accurate credit evaluation cannot be performed on small and medium-sized micro-enterprises in these subdivision industries directly based on a big data decision model.

Because the actual credit worthiness of the small and medium-sized micro-enterprises with insufficient samples is difficult to reflect truly, and local financial institutions lack the basis for approving the small and medium-sized micro-enterprises, when the small and medium-sized micro-enterprises apply for loan, the approving process is often more complicated, the approving period is relatively longer, the small and medium-sized micro-enterprises are more liable to be refused by banks, and meanwhile, the financial institutions are difficult to expand the target customer group of the small and medium-sized micro-enterprises with good actual credit worthiness, so that an enterprise credit evaluation method capable of reflecting the actual credit worthiness of the small and medium-sized micro-enterprises with insufficient samples truly and accurately is urgently needed.

Disclosure of Invention

The invention provides a method for evaluating credit of small and medium-sized micro enterprises based on sample transfer learning, aiming at truly and accurately reflecting the actual credit worthiness of the small and medium-sized micro enterprises with insufficient samples.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for evaluating the credit of the small and medium-sized micro enterprises based on sample transfer learning is provided, and comprises the following specific steps:

1) merging the obtained target domain data and the obtained source domain data, and performing data preprocessing and feature derivation on the merged data;

2) splitting the merged data derived from the completed features to obtain the target domain data and the source domain data after sample expansion;

3) performing characteristic binning and screening dimension reduction processing on the source domain data obtained by splitting;

4) migrating the source domain data subjected to feature binning, screening and dimension reduction to the target domain data obtained by splitting in the step 2), and training to form an enterprise credit scoring model by taking the source domain data and the target domain data as a training set and a test set respectively;

5) and inputting the acquired original target domain data into the enterprise credit score model, and predicting and outputting the credit score of the small and medium-sized micro enterprises.

As a preferable aspect of the present invention, in step 1), the data cleansing of the merged data includes performing missing value processing and/or abnormal value processing on the merged data.

As a preferable aspect of the present invention, in step 1), the performing feature derivation on the merged data includes performing feature derivation on any one or more of a record-type feature, a statistical-type feature, and other types of features except for the record-type feature and the statistical-type feature;

the statistical type characteristics comprise any one or more of number and size statistical characteristics, data fluctuation condition statistical characteristics and data change condition statistical characteristics;

the number size statistical characteristics comprise any one or more of mean statistical characteristics, maximum statistical characteristics and sum statistical characteristics;

the data fluctuation condition statistical characteristics comprise variance statistical characteristics and/or fluctuation rate statistical characteristics;

the data change condition statistical characteristics comprise growth rate statistical characteristics and/or trend statistical characteristics.

As a preferred embodiment of the present invention, after the characteristics of step 1) are derived, an invalid sample needs to be detected and deleted, and the method for detecting and deleting the invalid sample comprises:

judging whether a sample with the variable value missing ratio larger than 70% exists in the normal enterprise sample or whether a sample with the variable value missing ratio larger than 80% exists in the default enterprise sample,

if so, the sample is regarded as an invalid sample and deleted;

if not, the sample is retained.

As a preferable aspect of the present invention, in step 3), the method for performing feature binning on the source domain data specifically includes:

3.1) dividing null values of all variables in the source domain data into a box, then performing characteristic box separation on non-null parts of all the variables, and calculating a WOE value and a single-box IV value corresponding to each box i respectively through the following formula (1) and formula (2);

in formula (1), WOE_iRepresenting the WOE value corresponding to the bin i;

#B_ithe number of default enterprises in the sub-box i is shown;

#G_ithe number of normal enterprises in the sub-box i is shown;

#B_Tthe number of all default enterprises;

#G_Tis the number of all normal enterprises;

in the formula (2), IV_iRepresenting the IV value corresponding to the bin i;

3.2) judging whether each variable needs to be subjected to characteristic box separation adjustment,

if yes, performing characteristic box-dividing adjustment on the variable;

and if not, completing the characteristic box separation of the variables.

As a preferred embodiment of the present invention, in step 3.1), the method for performing feature binning on non-empty continuous variables comprises the following steps:

3.11) carrying out data segmentation on the continuous variable to obtain a plurality of segmentation points, traversing each segmentation point, and then calculating the IV value corresponding to the continuous variable subjected to characteristic binning by taking each segmentation point as a binning division point;

3.12) calculating the increment c of the IV value corresponding to the continuous variable after the division of the division point relative to the IV value corresponding to the division point before;

3.13) judging whether the increment c is larger than the preset increment threshold value,

if yes, performing characteristic binning on the non-empty part of the continuous variable by taking the dividing point as a dividing point;

and if not, not performing characteristic binning on the non-empty part of the continuous variable by taking the dividing point as a dividing point.

In a preferred embodiment of the present invention, the increment threshold is 0.01.

As a preferred aspect of the present invention, the number of samples in each of the bins i is not less than 20% of the non-empty data amount in the source domain data, and each of the bins i includes at least one positive sample and one negative sample.

As a preferable scheme of the present invention, in step 3.2), the method for determining whether each variable needs to be subjected to feature binning adjustment comprises:

judging whether the WOE value of the variable in the corresponding sub-box shows monotonous change or U-shaped trend change along with the increase of the value of the variable,

if yes, judging that the variable does not need to be subjected to characteristic binning adjustment;

if not, the variable is judged to need to be subjected to box separation adjustment.

As a preferable aspect of the present invention, in the binning adjustment, if the WOE value of the variable in the corresponding bin i does not show a monotonic change or U-shaped change trend with the increase of the value of the variable in the process of gradually increasing the non-empty data amount in the bin i from 10% to 50%, the variable is discarded.

As a preferable scheme of the present invention, in step 3), the method for screening and reducing dimensions of the variables comprises:

judging whether the IV value corresponding to each variable is larger than a preset IV threshold value,

if yes, the variable is reserved;

if not, the variable is eliminated.

As a preferable scheme of the present invention, in step 3), the method for screening and reducing dimensions of each variable comprises:

judging whether the correlation between the variables is larger than a preset correlation threshold value,

if yes, eliminating all the variables with correlation larger than the correlation threshold;

if not, the variable is reserved.

As a preferred embodiment of the present invention, in step 4), the enterprise credit scoring model is obtained by training using KLIEP transfer learning technology.

The invention has the following beneficial effects:

1. the credit rating model can accurately predict the credit worthiness of the enterprises in the industry with insufficient samples, and solves the problem that the accurate credit evaluation is difficult to be performed on the small and medium-sized micro-enterprises with insufficient samples in the past;

2. the enterprise credit scoring model obtained through training carries out credit scoring on small and medium-sized micro enterprises in the industry with insufficient samples, so that a financial institution can accurately screen out enterprise client groups meeting loan admission conditions, accurate customer acquisition is facilitated, and the small and medium-sized micro enterprise client groups are expanded;

3. the enterprise credit scoring model trained by the invention can truly and accurately reflect the actual credit worthiness of small and medium-sized micro enterprises with insufficient samples, can help financial institutions to improve the decision efficiency of the self-wind control link, provides scientific basis for loan approval of the financial institutions, and is beneficial to avoiding financial institutions missing potential credit-conserving small and medium-sized micro enterprise customer groups.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a diagram illustrating implementation steps of a method for evaluating credit of a small and medium-sized micro enterprise based on sample transfer learning according to an embodiment of the present invention;

FIG. 2 is a block diagram of a flow chart of a method for evaluating credit of a small-medium enterprise according to an embodiment of the present invention;

FIG. 3 is a diagram of method steps for feature binning source domain data;

FIG. 4 is a diagram of method steps for feature binning non-empty continuous variables;

fig. 5 is a distribution diagram of the scoring results of credit scoring of small and medium-sized micro enterprises by the enterprise credit scoring model obtained through migration learning.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The embodiment of the invention takes the credit evaluation of small and medium-sized micro enterprises in the mold industry as an example, constructs a plurality of enterprise credit evaluation indexes from a plurality of dimensions of enterprise production and management, enterprise revenue and tax payment, enterprise credit information, enterprise basic information, enterprise risk behaviors and the like according to enterprise data of a financial service credit information sharing platform in the region where the enterprise is located, combines machine learning and migration learning technology, proposes that machine learning is firstly utilized to construct an enterprise credit evaluation model of a door industry (for example, the mold industry is a minor industry of manufacturing industry, the manufacturing industry is a door industry of the mold industry, the door industry is a broad concept, the middle and large industries above the minor and the door industry are collectively called as door industry), and then migrates the enterprise credit evaluation model suitable for the door industry to the target minor industry through the migration learning technology, therefore, the problem that credit worthiness assessment cannot be carried out on small and medium-sized micro-enterprises by using big data due to the fact that the sample size of the target small-category industry is too small or the proportion of positive and negative samples (default enterprise data is a negative sample and non-default enterprise data is a positive sample) is unbalanced is effectively solved.

The technical scheme provided by the invention for evaluating the credit worthiness of the small and medium-sized micro-enterprises with insufficient samples comprises the following four contents: 1) obtaining a modeling sample; 2) preprocessing modeling sample data; 3) characteristic engineering; 4) and (5) model construction and verification evaluation. In this embodiment, the model evaluation indexes finally determined through evaluation are 29, and the 29 evaluation indexes cover four data dimensions of enterprise production and operation, tax due to enterprise collection, enterprise credit information and enterprise basic information.

The following describes in detail an implementation process of the sample transfer learning-based credit evaluation method for small and medium-sized micro-enterprises, which is provided by the embodiment of the present invention, with reference to the accompanying drawings. As shown in fig. 1 and fig. 2, the method for evaluating credit of a small and medium-sized micro-enterprise provided in this embodiment includes the following steps:

step 1) merging the obtained target domain data and source domain data, and performing data preprocessing and feature derivation on the merged data; the target domain data is subclass industry data with insufficient samples (such as enterprise data of a mold industry), and the source domain data is door industry data with sufficient samples and industrial relevance with the subclass industry (such as enterprise data of a manufacturing industry);

in the sample data selection, target domain data is selected first. In this embodiment, according to the national economic industry classification revised in 2017, the enterprise data whose industry is "mold industry" (industry number 3525) is selected to be registered in the business office enterprise registry as the target domain data of the model. Since the model is to be modeled by using the transfer learning technique, source domain data with sufficient samples is also required after target domain data (target industry data) is acquired, and enterprise data which is the same as the manufacturing industry but is not the mold industry is selected as the source domain data. The enterprise data attributed to the enterprise of the manufacturing industry is the industry enterprise data with the first two industry numbers between 13-43.

In addition, for an enterprise without a loan record, it cannot be determined that the enterprise is an default enterprise (for example, an enterprise with an overdue loan is regarded as an default enterprise) or a normal enterprise, and it is difficult to perform model training using such sample data, so that only the enterprise data with the loan record is selected based on the information in the financial institution enterprise credit granting table, so as to ensure that each sample has a label of the default enterprise/the normal enterprise.

Since the transfer learning technique only processes data distribution differences, the feature definitions of the source domain data and the target domain data are required to be the same, and therefore, when data preprocessing and feature derivation are performed on sample data, the target domain data and the source domain data need to be merged and processed in the same way. Therefore, the first step of the method for evaluating credit of small and medium-sized micro-enterprises provided by the embodiment is as follows: and merging the acquired target domain data and the acquired source domain data.

In addition, it should be noted that, because the service occurrence time information of the sample data may not be complete, and there may be some obvious abnormalities of the sample data on the time axis, in order to obtain stable data, the present invention considers that one year (12 months) is selected as the sample presentation window period, and three years (36 months) are selected as the sample observation window period. Taking sample data of 2018 as an example, regarding an enterprise with poor loan in 2018 as a default enterprise, and then modeling a rating card by observing variable conditions of the enterprise between 2016 and 2018. In the subsequent feature engineering part, feature derivation is performed according to the selection of the current window period.

In order to expand the sample size to obtain more information, we consider sample windows of three different time spans, for example, 2018 as a performance window period, 2016-2018 as an observation window period; taking 2019 as a performance window period and taking 2017-2019 as an observation window period; the 2020 year is taken as an expression window period, and 2018-2020 years are taken as an observation window period. And then, respectively carrying out feature engineering according to the three different sample windows to obtain sample features required by modeling, and acquiring default labels corresponding to each sample. And then, summarizing and merging the data of the three sample windows to obtain a final sample data set.

In this embodiment, the data preprocessing of the merged data mainly includes missing value processing and abnormal value processing of the data.

Data missing value handling

For the loss of partial variables of a single record in the original data table, for example, in the original table of the credit line, the single record may have the loss of the "credit line", but the variables such as the "used credit line", "credit financial institution", "effective starting date", "effective ending date" and the like are not lost, and for the loss, different processing methods are adopted according to different loss conditions. If the missing variables are excessive or important variables such as money amount, quantity and the like are all missing, deleting the record; if fewer variables are missing and some important variables are not missing, the record is retained for subsequent individual feature binning of the missing values to increase the applicability of the scoring card.

For partial variable loss caused by integrating the data table, generally, the reason for the loss is that no corresponding actual business occurs in the enterprise, so the loss values are not filled in the loss, so that the loss values are subjected to separate feature binning processing in the following process.

Data outlier handling

The data abnormal value processing mainly includes replacing the abnormal value with a missing value and correcting the abnormal value. For example, after the data is identified as an abnormal value, if normal value inference and filling cannot be performed on the abnormal value through other information in the same table, the abnormal value is directly replaced by the missing value. For example, for the data of "enterprise sales revenue information", the original data table may have a negative enterprise sales revenue, and a normal value cannot be inferred from other information in the table, so that a "negative value" of the enterprise sales revenue is marked as an abnormal value and is directly replaced with a missing value.

And the correction of the abnormal value is mainly filled according to other information. For example, if a data value is identified as an abnormal value and a normal value can be inferred from other information in the table, the abnormal value is padded to a normal value according with logic. For example, for the data of "enterprise housing administration debt amount information", the data recorded by the same enterprise in different data warehousing periods may have inconsistent quantity units in the original data sheet. For example, two records exist for the same enterprise: 3287 and 32870000, which may be understood as 3287 ten thousand and 32870000, to unify data units, "32870000" may be marked as an abnormal value and the result of dividing by 10000 fills the abnormal value "32870000" as a normal value.

In this embodiment, the characteristic derivation of the target domain data and the source domain data includes characteristic derivation of any one or more of a record-type characteristic, a statistical-type characteristic, and other types of characteristics except for the record-type characteristic and the statistical-type characteristic. The derivation of the record-type, statistical-type, and other types of features is illustrated separately below:

derivation of recordable features

For each record type feature, firstly, whether the business corresponding to the record type feature occurs or not is concerned, if the business has the relevant record of the business, the record is marked as 1, otherwise, the record is marked as 0. For example, for the record-type feature of "enterprise logout situation", only the information about the enterprise that has ever logged out (for example, the enterprise a has ever transacted logout business) is generally recorded in the original data table. The derivation for the record-type feature "enterprise logoff case" may be: and recording the logout records of the enterprise as 1, otherwise, recording the logout records as 0.

Derivation of statistical features

The invention focuses on the variables related to the quantity and the amount, and generates a series of statistical characteristics of sample data of the last three years and sample data of the last one year by setting different time spans.

The statistical characteristics mainly comprise quantity statistical characteristics, data fluctuation statistical characteristics and data change statistical characteristics. The quantity statistical characteristics mainly comprise mean statistical characteristics (the mean describes the average level of the single business occurrence amount or quantity of the enterprise), maximum statistical characteristics (the maximum describes the highest level of the single business occurrence amount or quantity of the enterprise) and sum statistical characteristics (the sum describes the accumulated value of the business occurrence amount or quantity of the enterprise in a target time span);

the data fluctuation statistical characteristics mainly comprise variance statistical characteristics and fluctuation rate statistical characteristics (the fluctuation statistical characteristics describe the deviation degree of the single service occurrence amount or quantity from the mean value thereof in the target time span). Taking monthly data as an example, the fluctuation rate calculation formula is as follows:

the data change condition statistical characteristics mainly comprise growth rate statistical characteristics and trend statistical characteristics.

The growth rate mainly measures the quantity change condition of the average value in the last two years, such as the power consumption data of an enterprise, and assuming that the selected sample window period is 2016-:

the trend is mainly to extract information of numerical increase or decrease on the basis of the growth rate. Specifically, if the average value of the business occurrence of a variable in 2018 is greater than the average value of the business occurrence of the variable in 2017, the variable presents an increasing trend, and the trend statistical characteristic of the variable is recorded as 1, otherwise, the trend statistical characteristic is recorded as 0.

Derivation of other types of features

Other types of features are, for example, basic information features of the enterprise (such as registered capital and age of the enterprise), and the derivation methods for other types of features are not described herein.

After the feature derivation of step 3) is completed, the source domain data and the target domain data are taken together to obtain, for example, 40646 samples in total, which contain 56 predictors and 1 target variable (label). 657 samples marked as default enterprises (labeled as '1'), 39989 samples marked as normal enterprises (labeled as '0'), and a proportion of 0-1 samples as high as 60.9:1 (the proportion of default and non-default samples is extremely unbalanced), and the condition that partial samples are relatively seriously lost can affect the prediction accuracy of the finally trained enterprise credit scoring model.

In order to reduce the proportion of 0-1 samples on the premise of ensuring the number of default enterprise samples and reduce the influence of invalid samples (seriously-missing samples) on subsequent modeling, the invalid samples need to be detected and deleted after the characteristic derivation is completed. The method for detecting and deleting invalid samples preferably comprises the following steps:

judging whether a sample with the variable value missing ratio larger than 70% exists in a normal enterprise sample or whether a sample with the variable value missing ratio larger than 80% exists in a default enterprise sample (considering the scarcity and the importance of a 1-label sample, different invalid sample deleting conditions are set for a 0-1 sample),

if yes, the sample is regarded as an invalid sample and deleted;

if not, the sample is retained.

After the invalid sample deletion, the sample size was reduced from 40646 to 38253, with 621 samples for the default business, 37632 samples for the normal business, and a 60.6:1 ratio of 0-1 samples.

Because the invention introduces the transfer learning technology to construct the enterprise credit scoring model aiming at the small and medium-sized micro enterprises, after the characteristic derivation is completed, the target domain data and the source domain data need to be split so as to predict the value of the target variable of the target domain data by the source domain data. Therefore, as shown in fig. 1, the method for evaluating credit of a small and medium-sized micro enterprise provided in this embodiment further includes:

step 2) splitting the merged data of which the characteristic derivation is completed to obtain target domain data and source domain data of the sample after expansion; in this embodiment, after the data splitting, there are 36558 pieces of source domain data, where the number of samples of the default enterprise is 603, and the number of samples of the normal enterprise is 35955; there are 1695 pieces of target domain data, with 18 samples for default businesses and 1677 samples for normal businesses.

Step 3) performing characteristic binning and screening dimension reduction processing on the source domain data obtained by splitting; the characteristic binning is to discretize continuous variables so as to facilitate model iteration; the screening and dimension reduction aims to reduce the sample data size so as to improve the model training speed. It should be emphasized that, because the target domain data itself is insufficient and scarce in terms of samples, feature binning and feature screening are not performed on the target domain data, and objects of feature binning and dimension reduction in screening are both source domain data.

In the embodiment, a supervised decision tree binning method is selected to perform feature binning on the variables, the principle is to construct a tree model on the variables, and the binning mode fully considers data distribution and label information and is more favorable for subsequent model training. The specific method steps for performing feature binning on source domain data are shown in fig. 3, and include:

step 3.1) dividing null values of all variables in source domain data into one box, then performing characteristic box separation on non-null parts of all variables, and calculating a WOE value and a single box IV value corresponding to each box i respectively through the following formula (1) and formula (2);

in formula (1), WOE_iRepresenting the WOE value corresponding to the bin i;

#B_ithe number of default enterprises (label is 1) in the sub-box i;

#G_ithe number of normal enterprises (label is 0) in the box i;

#B_Tthe number of all default enterprises;

#G_Tis the number of all normal enterprises;

the larger the default enterprise occupation ratio in the branch box i is, the WOE_iThe larger the value, the WOE_iHas a value range of [ - ∞, + ∞ [ + ∞ ]]；

In the formula (2), IV_iRepresenting the IV value corresponding to the bin i;

default enterprise and normal in box iThe greater the difference between the proportion of businesses and the proportion of all default businesses and all normal businesses, IV_iThe larger the value. The IV value corresponding to the variable is the sum of the IV values of all the characteristic sub-boxes, and the calculation formula is as follows:

n is the number of characteristic bins of the variable;

it should be noted that, the present invention does not perform feature binning on non-empty discrete variables, and the processing method for non-empty discrete variables is, for example:

if the order type discrete variable, for example, the academic calendar: primary school, junior middle school, high school and this department can be correspondingly converted into 1, 2, 3 and 4;

if it is a disorder-type discrete variable, for example, age: male and female are transformed into 1 and 0 correspondingly.

The method for performing feature binning on non-empty continuous variables is shown in fig. 4, and specifically includes the following steps:

step 3.11) carrying out data segmentation on the continuous variable (the specific segmentation mode is not explained here) to obtain a plurality of segmentation points, traversing each segmentation point, and then calculating an IV value corresponding to the continuous variable after characteristic binning is carried out by taking each segmentation point as a binning division point;

step 3.12) calculating the increment c of the IV value corresponding to the continuous variable after the division of the division point relative to the corresponding IV value before the division;

step 3.13) determines whether the increment c is greater than a preset increment threshold (preferably 0.01),

if not, the non-empty part of the continuous variable is not subjected to characteristic binning by taking the dividing point as a dividing point.

It should be noted that, while the partition is determined, the number of samples in each bin needs to be considered, and preferably, the number of samples in each bin is not less than 20% of the non-empty data amount in the source domain data, and each bin includes at least one positive sample (normal enterprise data is a positive sample) and one negative sample (default enterprise data is a negative sample).

During the characteristic binning process of the data, the WOE value and Iv value of each variable in each bin can be obtained. Before the WOE code conversion is carried out, the reasonability of the WOE value of each variable needs to be evaluated, and the evaluation criterion is that whether the WOE value changes monotonically or in a U-shaped trend along with the increase of the value of the variable. For variables that do not meet this evaluation criterion, feature binning is required. Therefore, with continued reference to fig. 3, the method for performing feature binning on source domain data further includes:

if yes, performing characteristic binning adjustment on the variable;

and if not, completing the characteristic binning of the variable.

The method for judging whether the variable needs to be subjected to characteristic box separation adjustment comprises the following steps:

judging whether the WOE value of the variable in the corresponding sub-box is monotonous or U-shaped trend along with the increase of the value of the variable, if so, judging that the variable does not need to be subjected to characteristic sub-box adjustment;

In the characteristic binning adjustment, the minimum sample number of binning is changed, the non-empty data volume of the source domain data is gradually increased from 10% until the change trend of the WOE value meets the evaluation standard of 'showing monotonous change or U-shaped change trend along with the increase of the value of the variable', and the binning adjustment of the variable is completed. If the minimum number of samples increases to 50% of the non-empty data volume, the trend of the variation of the variable in the value of the WOE corresponding to the bin still does not meet the above evaluation criterion, the variable is rejected.

After the above characteristic binning and WOE encoding, 7 variables with WOE values not meeting the evaluation criteria were removed, and 49 variables remained. The number of the obtained boxes corresponding to each variable is at most 6, and at least 2.

After feature derivation and feature binning, because the usable variables of the sample are still more (49 predicted variables), and the variables are not necessarily all suitable for subsequent model construction, further screening and dimension reduction processing needs to be performed on the variables. In this embodiment, the IV value and the correlation method are used for dimension reduction. The IV value is used for measuring the information quantity of a certain variable and can be used for representing the prediction capability of the variable, and the greater the IV value is, the greater the capability of the variable for distinguishing good or bad clients is. The IV value dimension reduction method comprises the following steps: by setting an IV threshold (preferably 0.02), variables with IV values greater than the IV threshold are retained, and variables with IV values less than the IV threshold are eliminated. By the IV value dimension reduction method, 11 variables with too low IV values are removed, and 49 predicted variables are reduced to 38.

The correlation dimension reduction method comprises the following steps:

in logistic regression, model estimates are distorted if there is a strong correlation between the arguments (model input variables). Therefore, in order to avoid this as much as possible, arguments having a large correlation are eliminated as much as possible when performing the variable filtering. The invention sets a correlation threshold (preferably 0.7), and judges whether the correlation between variables is greater than the correlation threshold (the specific judgment method is not explained here), if yes, the variables with the correlation greater than the correlation threshold are eliminated, and if not, the variables are retained. Through the relevance screening and dimension reduction, 9 variables with high relevance are eliminated, and finally 29 variables remain.

Referring to fig. 1, the method for evaluating credit of a small and medium-sized micro enterprise provided in this embodiment further includes:

step 4) transferring the source domain data subjected to feature binning and dimension reduction by screening to the target domain data obtained by splitting in the step 2), and training to form an enterprise credit scoring model by taking the source domain data and the target domain data as a test set and a test set respectively;

and 5) inputting the acquired original target domain data into an enterprise credit score model, and predicting and outputting the credit score of the small and medium-sized micro enterprises.

The following brief description of the training method of the enterprise credit scoring model:

under the method framework of transfer learning, taking source domain data subjected to transfer learning processing as a training set, taking target domain data obtained by splitting after feature derivation as a test set, applying 29 finally screened in-mode features (variables), setting a model basis function as a Gaussian kernel function, determining optimal model parameters through an LCV method, and obtaining a final transfer sample weight value according to the optimal parameters.

The process of selecting the optimal model for the LCV method is briefly described as follows:

LCV (likelihood Cross validation), a mode of selecting model through Cross validation, LCV takes the size of likelihood function value (likelihoodfunction) as the standard of selecting model, selects the model with maximum likelihood function value as the optimal model.

In the KLIEP migration learning method, the objective function can be expressed as:

since the function is convex, an optimal set of weight parameters can be found by maximizing J

And then pass through

And obtaining the estimated weight. In this process, the weight estimation is largely influenced by the basis functions

The influence of (c). When a gaussian kernel function is used as the basis function, the parameter that determines the kernel function is the window width parameter σ. Different sigma results in optimal weight parameters

Different, therefore, cross-validation is required to select the most appropriate hyper-parameter σ.

Based on the concept of LCV, we consider J as a likelihood function, and select the model with the maximum J as the optimal model. In design cross-validation, (1) target domain data is split into R groupsDisjoint subsets, where R-1 set is used for weight parameter training, resulting in

(2) The remaining 1 set of data

For calculating the likelihood function values, the calculation formula is:

the process computes the mean of the maximum likelihood function after traversing the R sets of subsets. And finally, comparing results under different hyper-parameters sigma, and selecting the corresponding hyper-parameter when the likelihood function mean is maximum as the hyper-parameter of the optimal model.

In addition, aiming at the problem that the sample has unbalanced labels, a cost sensitive method is used in the model training process, and a better model is obtained through training by adjusting the distribution of original training data through sample weight. The sample weight setting is determined by the sample imbalance ratio, for the data in the training set, the ratio of the number of normal enterprises to the number of default enterprises is 35955:603, and to ensure that the sample weight sum is 1 (setting the weight sum to 1 is convenient for model calculation), the sample weight of the normal enterprises is set to 0.51, and the sample weight of the default enterprises is set to 30.31.

And then, directly constructing a Logistic model based on the finally screened 29 variables and the weight values of the migration samples, wherein in the model construction, some variables with strong correlation possibly exist, so that the model is invalid. Therefore, further screening of the variables is required. The screening method is to comprehensively consider the correlation and the IV value of the variables, select the variables with higher IV values from the variable group with higher correlation, discard the variables with lower IV values, finally discard 12 variables, and use the remaining 17 variables as the input variables of the model training.

Fig. 5 shows a distribution diagram of the scoring results of the enterprise credit scoring model obtained by the migration learning on non-default enterprises and default enterprises (fig. 5a is a distribution diagram of the scoring results of the non-default enterprises, and fig. 5b is a distribution diagram of the scoring results of the default enterprises). As can be seen from the credit scoring results, most of the scores of non-default passenger groups are concentrated on above 580 points, the scores of default passenger groups are generally low, and most of the scores are below 580 points, which shows that the enterprise credit scoring model trained by the invention has good prediction performance.

After the model training is finished, the performance of the model needs to be evaluated, and the model can be practically applied after the evaluation result reaches the standard. The method for evaluating the model performance of the enterprise credit scoring model is briefly described as follows:

by combining the credit scoring result output by the model and the actual default condition of the enterprise in the verified customer group (target domain data), an index value of a relevant evaluation index of the model can be calculated, for example, the fitting capability of the model can be evaluated through an ROC curve and an AUC value, and the distinguishing capability of the model on good and bad customers can be evaluated through a KS index. Through calculation, the AUC value of the enterprise credit scoring model formed by the training of the invention is 0.85, and the distinguishing degree index KS value is 0.56.

Secondly, considering the problem of label unbalance of original data, selecting an optimal threshold (prediction probability) by further using a KS value, specifically, segmenting the prediction probability, calculating a difference value between a good sample proportion and a bad sample proportion accumulated in an interval, selecting the prediction probability corresponding to a maximum difference value (namely the KS value) as the optimal threshold (for example, 0.75), and evaluating that the prediction accuracy (precision) of the Logistic model is 61.1% and the recall rate (recall) is 11.0% by using a Logistic model test set verification confusion matrix obtained by using the threshold as shown in the following table a.

Threshold value of 0.75	Forecasting default businesses	Forecasting normal business
			Real default enterprise	11	7
True normal enterprise	89	1588

TABLE a

The Logistic regression model and the transfer learning technique are briefly described below:

logistic regression model

In the traditional credit score modeling, the target variable Y is represented as default behavior of the borrower, if the default behavior occurs, the value of the target variable Y is usually marked as 1, otherwise, the value of the target variable Y is marked as 0, so that the credit score problem can be regarded as a binary classification problem.

The most common credit score is a Logistic regression model, which does not directly model Y, but models the default probability p (X) ═ Pr (Y ═ 1| X), where X is the value of each borrower independent variable. When the default probability is greater than a certain critical value, the borrower can be marked as default.

The penalty probability takes on any continuous value from 0-1, i.e., 0 ≦ p (X) ≦ 1, and in order to satisfy this constraint, consider modeling using Logistic functions, β in equation (3) below contains the intercept and the coefficients before each X variable:

after the formula (3) is transformed, the following formula (4) can be obtained:

in formula (4)

Called odds, which is intuitively understood as the ratio of default probability to non-default probability, and has a value between 0 and ∞, and the larger the value, the larger the default probability. Taking logarithm of two sides of the equation of the formula (4) to obtain a formula (5):

the Logistic regression model has the greatest advantage that the output result of the model has interpretability, so that the Logistic regression model can be widely applied to the scoring card model. In order to calculate the coefficients of the regression model, considering the maximum likelihood method, the log-likelihood function employed by the present invention is expressed by the following formula (6):

in the formula (6), y_iRepresenting an actual probability value;

i represents a sample;

n represents the total amount of samples;

according to equation (6), the calculation form of the model coefficient estimation value can be expressed by the following equation (7):

according to the convex optimization theory, the optimal solution of the model coefficient can be obtained by a classical numerical optimization algorithm such as a gradient descent method, a Newton method and the like.

Transfer learning

In the invention, the model migration learning sets the door industry data as the source domain and the subclass industry data as the target domain, and migrates the source domain data information to the target domain, thereby solving the problems that the target domain data volume is less and the model is difficult to be stably trained. The invention preferably adopts KLIEP (Kullback-Leibler impact optimization Procedure) transfer learning technology to solve the enterprise credit scoring model suitable for the subclass industry.

KLIEP is a transfer learning method based on sample weight, and comprises three steps of weight problem transformation, hyper-parameter selection and model weight solving.

1. Weight problem translation

Splitting sample weights into multiple weight parameters and multiple base function products

Setting basis functions as Gaussian kernel functions of target domain data

Wherein

For target domain data, σ is a parameter of a Gaussian function, whereby the problem of solving weights is translated into solving multiple weight parameters

To a problem of (a).

2. Hyper-parameter selection

Setting a proper over-parameter sigma range, and selecting an optimal over-parameter by using a Likelihood cross validation (Likelihood cross validation) technology. Splitting the target domain data into R sets of disjoint subsets, wherein R-1 sets of subsets are estimated with weighting parameters, and the estimated objective function is expressed by the following equations (8) and (9):

in the formula (8) and the formula (9),

is the target domain data in the R-1 group;

is source domain data;

te (abbreviation of test) represents target domain data (i.e. mould industry data for application case);

n_terepresenting a target domain sample number;

tr (abbreviation for train) represents source domain data, i.e. manufacturing industry data of the non-mold industry in the application case;

n_trrepresenting the number of source domain samples.

Then using the remaining 1 set of target domain data subsets and the estimated weight parameter α_lAfter traversing the R-set subset, the maximum likelihood score mean is solved, the likelihood score equation is expressed by the following equation (10):

in the formula (10), the first and second groups,

representing the R-th fraction of the target domain sample equally divided into R fractions;

representing the weight of the r-th target domain sample,

wherein the weight parameter

Is estimated by R-1 data；

Or

Representing the likelihood function value/objective function value of the r-th data estimate.

Maximum likelihood score mean

Calculated by the following equation (11):

averaging the maximum likelihood scores

And selecting a group of hyperparameters with the maximum likelihood scores as model hyperparameters by comparing the result performances of the CVs under different hyperparameters.

3. Model weight solving

And estimating model weight parameters based on the selected model optimal hyper-parameters, the target domain data and the source domain data to obtain an optimal sample weight result. And in the stage of training the classification prediction model, adding the sample weight into the model training, namely completing the transfer learning link of the model.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A method for evaluating credit of small and medium-sized micro enterprises based on sample transfer learning is characterized by comprising the following specific steps:

2. The method for credit evaluation of small and medium-sized micro-enterprises according to claim 1, wherein in step 1), the data cleaning of the merged data comprises missing value processing and/or abnormal value processing of the merged data.

3. The medium and small micro enterprise credit evaluation method according to claim 1, wherein in step 1), the characteristic derivation of the merged data comprises characteristic derivation of any one or more of a record-type characteristic, a statistical-type characteristic and other types of characteristics except the record-type characteristic and the statistical-type characteristic;

4. The method for evaluating credit of small and medium-sized micro-enterprises according to claim 1, wherein after the characteristics of step 1) are derived, invalid samples need to be detected and deleted, and the method for detecting and deleting the invalid samples comprises the following steps:

if so, the sample is regarded as an invalid sample and deleted;

if not, the sample is retained.

5. The method for credit evaluation of small and medium-sized micro-enterprises according to claim 1, wherein in the step 3), the method for performing feature binning on the source domain data specifically comprises the following steps:

in formula (1), WOE_iRepresenting the WOE value corresponding to the bin i;

#B_ithe number of default enterprises in the sub-box i is shown;

#G_ithe number of normal enterprises in the sub-box i is shown;

#B_Tthe number of all default enterprises;

#G_Tis the number of all normal enterprises;

in the formula (2), IV_iRepresenting the IV value corresponding to the bin i;

if yes, performing characteristic box-dividing adjustment on the variable;

and if not, completing the characteristic box separation of the variables.

6. The medium and small micro enterprise credit evaluation method of claim 5, wherein in the step 3.1), the method for performing feature binning on non-empty continuous variables comprises the following steps:

7. The medium-small micro enterprise credit evaluation method according to claim 6, wherein the increment threshold is 0.01.

8. The method of claim 5, wherein the number of samples in each bin i is not less than 20% of the amount of non-null data in the source domain data, and each bin i comprises at least one positive sample and one negative sample.

9. The method for evaluating the credit of the small and medium-sized micro enterprises according to claim 5, wherein in the step 3.2), the method for judging whether each variable needs to be subjected to characteristic binning adjustment comprises the following steps:

10. The method according to claim 5, wherein in the process of performing binning adjustment, if the amount of non-empty data in the bin i gradually increases from 10% to 50%, and the WOE value of the variable corresponding to the bin i still does not show monotonic change or U-shaped change trend along with the increase of the value of the variable, the variable is discarded.

11. The method for evaluating the credit of the small and medium-sized enterprises according to claim 1 or 5, wherein the step 3) comprises the steps of:

if yes, the variable is reserved;

if not, the variable is eliminated.

12. The method for credit evaluation of small and medium-sized enterprises according to claim 11, wherein in the step 3), the method for screening and reducing dimensions of each variable comprises the following steps:

if not, the variable is reserved.

13. The method according to claim 1, wherein in step 4), the enterprise credit scoring model is obtained by training using KLIEP transfer learning technique.