CN114022269A - Enterprise credit risk assessment method in public credit field - Google Patents

Enterprise credit risk assessment method in public credit field Download PDF

Info

Publication number
CN114022269A
CN114022269A CN202111260166.3A CN202111260166A CN114022269A CN 114022269 A CN114022269 A CN 114022269A CN 202111260166 A CN202111260166 A CN 202111260166A CN 114022269 A CN114022269 A CN 114022269A
Authority
CN
China
Prior art keywords
data
enterprise
variable
sample
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111260166.3A
Other languages
Chinese (zh)
Inventor
程亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Weizhi Technology Co ltd
Original Assignee
Jiangsu Weizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Weizhi Technology Co ltd filed Critical Jiangsu Weizhi Technology Co ltd
Priority to CN202111260166.3A priority Critical patent/CN114022269A/en
Publication of CN114022269A publication Critical patent/CN114022269A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a credit risk assessment method for an enterprise in the public credit field, which can be used for constructing an enterprise credit assessment model based on historical data of the enterprise in the public credit field and analyzing to obtain a credit risk assessment result of the enterprise in the public credit field. In the technical scheme of the invention, based on the fact that historical data of an enterprise in the public credit field is used as basic sample data, after the numerical characteristic variables and the non-numerical characteristic variables in the historical data of the enterprise in the public credit field are subjected to variable box separation processing respectively, variable screening operation is carried out, and then based on a predictive mathematical model: performing regression fitting on the logistic regression model to establish the probability for evaluating the serious trust loss risk of the enterprise; and finally, establishing a credit risk evaluation card of the enterprise in the public credit field to evaluate the credit risk of the enterprise.

Description

Enterprise credit risk assessment method in public credit field
Technical Field
The invention relates to the technical field of enterprise credit risk assessment, in particular to an enterprise credit risk assessment method in the field of public credit.
Background
In the market economy, the credit of an enterprise becomes increasingly important in the environment in which it operates. However, existing credit assessment methods are mostly based on financial data of the enterprise, such as: the operating capital, revenue, profit margin, profit growth rate, etc. of the enterprise are expanded. But only a single enterprise is evaluated in financial terms, and the obtained result is obviously only suitable for the requirements of the investors.
At present, the national is promoting the construction of a social credit system, and governments of all parts develop research and exploration on the evaluation of enterprise credit so as to better manage the enterprise according to the credit condition of the enterprise. However, most basic information data of an enterprise in the public credit field is non-numerical data, and meanwhile, the evaluation dimension of the public credit field and the evaluation dimension of the financial field are completely different, so that the existing risk assessment method based on the financial data cannot be directly used. How to evaluate the credit of the enterprise in the public credit domain is a problem to be solved urgently.
Disclosure of Invention
In order to solve the problem that a method for evaluating the credit of an enterprise in the public credit field is lacked in the prior art, the invention provides the method for evaluating the credit risk of the enterprise in the public credit field.
The technical scheme of the invention is as follows: a credit risk assessment method for enterprises in the public credit field is characterized by comprising the following steps:
s1: collecting historical data of various types of enterprises in the field of public credit as basic sample data;
the basic sample data comprises: enterprise basic information and enterprise public credit information;
the enterprise basic information is information describing the current state of the enterprise;
the enterprise public credit information is data describing penalty records once received by the enterprise in the public credit field;
s2: preprocessing all sample data in the basic sample data according to the types of the sample data to obtain a sample data set;
dividing the sample data set into a training sample basic data set and a test data set;
s3: performing variable binning processing on the data in the training sample basic data set;
performing binning processing on the non-numerical characteristic variables in the training sample basic data set based on a direct binning method;
s4: performing variable screening operation on the training sample basic data set subjected to variable binning processing to obtain a final training data set;
the variable screening operation comprises the following steps:
a 1: calculating each characteristic variable in the training sample basic data set to obtain a corresponding information value;
a 2: deleting the characteristic variables of which the information values are lower than a preset information value threshold;
a 3: calculating the Pearson correlation coefficient of any two characteristic variables in the training sample basic data set, and removing the characteristic variables in the training sample basic data set based on sample correlation;
a 4: performing logistic regression fitting on each characteristic variable in the training sample basic data set based on the independent variable and the dependent variable of each characteristic variable, and removing all characteristic variables which cannot obviously predict the dependent variable to obtain the training data set;
s5: performing model fitting on the training data set based on a logistic regression model, and removing the characteristic variables which do not accord with preset significance indexes to obtain a credit risk assessment model of the enterprise in the public credit field;
s6: constructing a credit risk scoring card of the public credit field enterprise based on the parameters of the credit risk assessment model of the public credit field enterprise;
s7: obtaining basic sample data of an enterprise to be evaluated as sample data to be evaluated, inputting the basic sample data to be evaluated into a model of the enterprise credit risk assessment card in the public credit field, obtaining an interval score to which a value of each characteristic variable of the sample data to be evaluated belongs, and then adding all the scores to obtain a final score of the risk assessment corresponding to the enterprise to be evaluated in the public credit field, namely an enterprise credit risk assessment result of the enterprise to be evaluated.
It is further characterized in that:
in step S1, based on the data characteristics of the basic sample data, the method for acquiring the basic sample data includes:
b 1: putting the data obtained after the enterprise basic information is directly collected into the basic sample data to participate in calculation;
b 2: classifying the enterprise public credit information data according to a data calculation obtaining mode, wherein the classification comprises the following steps: historical behavior record quantity variable and latest behavior record time variable;
the historical behavior record quantity variable comprises: the administrative penalty record number, the judicial litigation record number, the record number listed in the loss message executant list, the record number listed in the abnormal operation list and the record number of the business registration change;
the last behavior recording time class variable comprises: the date of the administrative penalty, the date of the litigation, the date of the list of distrusted executives, the date of the list of abnormal business, and the date of the business registration change;
b 3: collecting all public data for the data of the historical behavior record quantity, respectively accumulating according to types to obtain the sum of the data as the numerical value of each characteristic variable, and putting the sum into the basic sample data to participate in calculation;
b 4: for the data of the recorded quantity of the business registration change, firstly, collecting the public business registration data, then, comparing the public business registration data according to time, finding the latest data of the recorded time as the numerical value of each characteristic variable, and putting the numerical value into the basic sample data to participate in calculation;
the enterprise basic information comprises: establishing time, registered capital, affiliated industry categories, economic types, enterprise equity structures, the number of associated companies, total amount of external investment, total number of enterprise employees and the social security payment base number of enterprises;
the enterprise public credit information includes: the administrative penalty record number, the judicial litigation record number, the record number listed in the loss of trust executee list, the record number listed in the abnormal operation list, the record number of the business registration change, the date subjected to the administrative penalty, the date related to the litigation, the date listed in the loss of trust executee list, the date listed in the abnormal operation list and the date registered by the business;
the preprocessing operation in step S2 includes:
c 1: for the data related to the enterprise basic information, counting the missing degree of the sample data of each characteristic variable;
if the sample missing degree of a certain characteristic variable in the basic sample data exceeds a preset missing threshold value, deleting the sample data of the certain characteristic variable in the basic sample data;
otherwise, if the default missing threshold value is not exceeded, filling the value of the corresponding missing sample data to-1;
c 2: for the historical behavior record quantity variable and the latest behavior record time variable, finding out the missing value of each characteristic variable, regarding that the behavior record does not exist, and marking the value corresponding to the missing sample data as 0;
in step S3, the step of performing binning processing on the numerical characteristic variables based on the information value optimal binning method includes:
d 1: sorting the characteristic variables to be processed according to the magnitude of the variable values, and carrying out interval division to obtain characteristic box data;
when the intervals are divided, the variable quantity of each interval does not exceed a preset box dividing threshold;
d 2: counting and merging all the characteristic box data;
the count merge operation includes the steps of:
d 2.1: identifying a number of samples of each of the feature binning data;
d 2.2: finding all the characteristic binning data with the sample number smaller than a preset sample threshold value, and recording the characteristic binning data as to-be-merged binning data;
d 2.3: judging whether the data to be subjected to binning is at the positions of two ends of the sorted characteristic binning data;
if so, merging the to-be-merged binning data with the adjacent binning data;
otherwise, implementing step d 2.4;
d 2.4: respectively calculating the merging IV loss values of the to-be-merged box data and the adjacent box data on the two sides;
selecting adjacent box data on one side with smaller IV loss value to be merged for merging;
d 3: performing similar merging operation on the characteristic box data;
the similar merging operation comprises the following steps:
d 3-1: acquiring all characteristic box data, and calculating a combined IV loss value between any two adjacent boxes to obtain all combined IV loss values;
d 3-2: finding the smallest combined IV loss value, and comparing the smallest combined IV loss value with a preset combined loss threshold value;
merging the two feature binned data if the smallest of the merged IV loss values is less than the merged loss threshold;
d 3-3: repeating d 3-1-d 3-2 until all of the combined IV loss values are greater than the combined loss threshold;
d 4: calculating an evidence weight value for each feature binning data, and replacing each feature binning data with the corresponding evidence weight value;
in step S3, after the feature binning data is subjected to the similar merge operation, before step d4 is performed, it is determined whether the feature binning data needs to maintain a linear feature,
if the linear characteristic needs to be kept, linear adjustment is needed; otherwise, step d4 is performed directly;
the linear adjustment comprises the steps of:
c 1: calculating an evidence weight value of each feature binning data;
c 2: sorting all the evidence weighted values according to the corresponding arrangement sequence of the characteristic box data;
c 3: finding the evidence weight value which does not accord with the linear variation trend, and merging the corresponding characteristic box data with the adjacent box data;
in the step a3 of the variable screening operation, the step of removing the feature variables in the training sample basic dataset based on the sample correlation includes the following steps:
a 31: acquiring all characteristic variables and recording the characteristic variables as correlation variables to be calculated;
a 32: calculating the Pearson correlation coefficient of every two variables in all the variables to be calculated;
a 33: finding all variable pairs with the Pearson correlation coefficient larger than a preset correlation threshold value, and removing characteristic variables with smaller information values;
a 34: calculating a variance expansion factor for the remaining feature variables in the training sample basic data set;
a 35: comparing and finding all variance expansion factors and a preset variance expansion factor threshold value;
if the variance expansion factor is larger than the preset feature variable of the variance expansion factor threshold, rejecting the variable, recording the residual variable as a correlation variable to be calculated, and repeating the steps a 34-a 35;
if all the characteristic variables are less than or equal to the variance inflation factor threshold, implementing step a 4;
in step S5, performing logistic regression model fitting by adopting a backward stepwise regression mode;
in step S3, binning is performed on the numerical characteristic variables in the training sample basic data set based on an information value optimal binning method.
The invention provides an enterprise credit risk assessment method in the public credit field, which is characterized in that based on historical data of an enterprise in the public credit field as basic sample data, after carrying out variable binning processing on numerical characteristic variables and non-numerical characteristic variables in the historical data of the enterprise in the public credit field, carrying out variable screening operation, and then based on a predictive mathematical model: performing regression fitting on the logistic regression model to establish the probability for evaluating the serious trust loss risk of the enterprise; finally, establishing a credit risk evaluation card of the enterprise in the public credit field to evaluate the credit risk of the enterprise; according to the technical scheme, historical data in the field of public credit is used as basic sample data, so that the evaluation of the public credit of an enterprise is ensured to be carried out based on public data, and an evaluator can conveniently obtain the evaluation basic data under the condition of need, so that the technical scheme of the invention is ensured to have practicability; the stability and the accuracy of the logistic regression model are ensured by carrying out data processing on historical data in the public credit field, the evaluation party can conveniently and quickly grade the enterprise by using the credit risk evaluation card of the enterprise in the public credit field, the evaluation on the public credit of the enterprise can be finished without using complex calculation, and meanwhile, the user can clearly know the parameters which have large influence weight on the risk evaluation result in the evaluation card, so that the intelligibility of the result of the risk evaluation method is enhanced, and the technical scheme of the invention has better practicability.
Drawings
FIG. 1 is a schematic flow chart of a method for assessing credit risk of an enterprise in the public credit domain;
FIG. 2 is an embodiment of variance inflation factor calculation.
Detailed Description
As shown in fig. 1, the invention relates to a credit risk assessment method for an enterprise in the public credit field, which is characterized by comprising the following steps:
s1: collecting historical data of various types of enterprises in the field of public credit as basic sample data;
the basic sample data includes: enterprise basic information and enterprise public credit information;
the enterprise basic information is information describing the current state of the enterprise;
the enterprise public credit information is data describing penalty records that the enterprise has accepted in the public credit domain.
In this embodiment, the enterprise basic information includes: establishing time, registered capital, affiliated industry categories, economic types, enterprise equity structures, the number of associated companies, total amount of external investment, total number of enterprise employees and the social security payment base number of enterprises;
the enterprise public credit information includes: the number of records of administrative penalty, the number of records of judicial litigation, the number of records listed in the list of distressed executives, the number of records listed in the list of abnormal operations, the number of records of business registration changes, the date of receiving administrative penalty, the date related to litigation, the date listed in the list of distressed executives, the date listed in the list of abnormal operations, and the date of business registration changes.
Historical data of enterprises used in the public credit field are public, evaluators can obtain the data through a regular way, the enterprises to be evaluated are evaluated based on the data, more cost is not required to be paid for obtaining the data, the authenticity of the data is not required to be worried, and the public credit field enterprise credit risk evaluation can be carried out on the enterprises fairly and equitably based on the data. Wherein, the names and meanings of the independent variables in the classified basic sample data refer to the following table 1.
Table 1: name and meaning of independent variable
Figure BDA0003325282920000051
In step S1, based on the data characteristics of the basic sample data, the method for acquiring the basic sample data includes:
b 1: data obtained after directly acquiring the basic information of the enterprise is put into basic sample data to participate in calculation;
b 2: classifying the enterprise public credit information data according to a data calculation obtaining mode, wherein the data calculation obtaining mode comprises the following steps: historical behavior record quantity variable and latest behavior record time variable;
the historical behavior record quantity variables include: the administrative penalty record number, the judicial litigation record number, the record number listed in the loss message executant list, the record number listed in the abnormal operation list and the record number of the business registration change;
the last behavior record time class variables include: the date of the administrative penalty, the date of the litigation, the date of the list of distrusted executives, the date of the list of abnormal business, and the date of the business registration change;
b 3: collecting all public data for the data of the historical behavior record quantity, respectively accumulating according to the types to obtain the sum of the data as the numerical value of each characteristic variable, and putting the sum into basic sample data to participate in calculation;
b 4: for the data of the record quantity of the industrial and commercial registration change, firstly, some public industrial and commercial registration data are collected, then, comparison is carried out according to time, the latest data of the record time is found out and is used as the numerical value of each characteristic variable, and the numerical value is put into basic sample data to participate in calculation.
Through the type of data, classify basic sample data, then gather data through the collection mode of difference, ensure that this patent technical scheme more matches with the reality, have more the practicality.
S2: preprocessing all sample data in the basic sample data according to the types of the sample data to obtain a sample data set;
dividing a sample data set into a training sample basic data set and a test data set;
the preprocessing operation comprises the following steps:
c 1: for data related to the basic information of the enterprise, counting the missing degree of sample data of each characteristic variable;
if the sample loss degree of a certain characteristic variable in the basic sample data exceeds a preset loss threshold value, deleting the sample data of the characteristic variable in the basic sample data;
otherwise, if the default missing threshold value is not exceeded, filling the value of the corresponding missing sample data to-1;
c 2: for the historical behavior record quantity variable and the latest behavior record time variable, finding out the missing value of each characteristic variable, regarding that the behavior record does not exist, and marking the value corresponding to the missing sample data as 0;
in this embodiment, the missing threshold is set to 95%, that is, if a certain feature variable is missing in 95% of basic sample data of an enterprise, it indicates that the feature variable does not have a data basis participating in evaluation, and the feature variable is deleted from the basic sample data, so as to ensure that sample data used for evaluation subsequently has representativeness.
S3: performing variable binning processing on data in the training sample basic data set, which is specifically shown in the following table 2;
performing box separation processing on non-numerical characteristic variables in the training sample basic data set based on a direct box separation method; dividing each value of the characteristic variable into separate intervals to carry out Evidence Weight value (WOE) replacement; and performing binning processing on the numerical characteristic variables in the training sample basic data set based on an information value optimal binning method.
Table 2: variable box separation method
Figure BDA0003325282920000061
In the technical scheme of the invention, the non-numerical variables are discretized by a direct binning method, so that the numerical variables are used in the calculation process of risk assessment to ensure the integrity of the data participating in calculation. There are various binning methods for numerical characteristic variables, such as: equidistant binning, equal-frequency binning and clustering binning in an unsupervised binning method, a chi-square binning method in a supervised binning method and an information value optimization method; in the embodiment, an information value optimization method is adopted; the information value optimal method is to divide boxes by reducing the loss of the information value as much as possible, so that the discretization of the variable is ensured, the loss of the information value of the variable can be avoided to the greatest extent, and the information value of the variable is directly related to the prediction capability of the variable; the numerical characteristic variables are subjected to binning by an information value optimal method, so that the accuracy of binning of the numerical characteristic variables in the technical scheme of the invention is ensured, and further the accuracy of a subsequent fitting process and evaluation of the scoring card is ensured.
The procedure of binning the numerical characteristic variables based on the information value optimal binning method is as follows. The number of sample data in each box is adjusted through counting and merging operation, and each sub-box is ensured to be representative; the number of the sub-boxes is reduced through similar merging operation, and the stability of the model is ensured; all data are finely adjusted through linear adjustment, and stability of the data in a box is guaranteed on the whole.
d 1: sorting the characteristic variables to be processed according to the magnitude of the variable values, and carrying out interval division to obtain characteristic box data;
when the intervals are divided, the variable quantity of each interval does not exceed a preset box dividing threshold;
in specific implementation, the classification threshold is set according to the sample data amount in the basic sample data, so that each piece of characteristic classification data has enough data amount to be representative, and too little interval division cannot be caused. Generally, in order to ensure the availability of the model and the validity of the sample data amount in the basic sample data is not less than 1 ten thousand, 80 ten thousand pieces of basic sample data are used for calculation in the present embodiment.
d 2: counting and merging all the characteristic box data;
the technology merging operation comprises the following steps:
d 2.1: confirming the number of samples of each feature bin data;
d 2.2: finding all characteristic binning data with the number of samples smaller than a preset sample threshold value, and recording the characteristic binning data as to-be-merged binning data;
d 2.3: judging whether the data to be boxed are positioned at the two ends of the sorted characteristic box data;
if yes, merging the to-be-merged box data with the adjacent box data;
otherwise, implementing step d 2.4;
d 2.4: respectively calculating a merging IV loss value of the to-be-merged box data and the adjacent box data on the two sides;
and selecting the adjacent boxed data on the side with smaller IV loss value for combination.
The count merge operation requires that the number of samples for credit (dependent variable takes a value of 0) and the number of samples for loss (dependent variable takes a value of 1) in each bin be greater than a minimum number, thereby ensuring that the samples in each bin are sufficiently representative; because different data sets have different sizes, the sample threshold is the ratio of the number of confident samples to the number of distrusted samples, and in this embodiment, the sample threshold is 0.02.
That is, if there are 10000 samples, of which 9000 are confident samples and 1000 are distrusted samples;
then, taking the sample threshold of 0.02 means: the number of samples for each box for credit is at least 180 and the number of samples for loss of credit is at least 20.
The number of samples participating in the calculation is ensured to be enough through the counting and combining operation, and then the samples of each box are ensured to have enough representativeness.
The calculation method of the combined IV loss value comprises the following steps:
combined IV loss value-sum of IV values of the two previous bins combined-combined IV value
The IV value is calculated by the following method:
Figure BDA0003325282920000071
IV is the current information value representing the variable, and k represents the current total interval number of the variable; ln denotes taking the natural logarithm, PiIndicates the number of blind samples in the interval (positive, i.e., the prediction result should be 1 for the model), NiThe number of samples in the interval (negative, i.e. the prediction result for the model should be 0), SIG PiRepresents for all PiSummation (i.e. the number of total lost samples).
As shown in the following table 3, in the binning example 1, row number 33, which has only 9 samples for loss of credit, does not meet the minimum requirement for each bin, and is the smallest number of samples for loss of credit in all bins, should be merged with other bins, which have a combined loss IV value of 8.9E-07 with the previous bin and 0.00012 with the next bin, and should be selected for merging with the previous bin.
Table 3: case separation example 1
Figure BDA0003325282920000081
The combined results are shown in Table 4 as binning example 2, with row number 32 being the combined result.
Table 4: case separation example 2
Figure BDA0003325282920000082
d 3: performing similar merging operation on the characteristic box data;
the similar merging operation comprises the following steps:
d 3-1: acquiring all characteristic box data, and calculating the combined IV loss value between any two adjacent boxes to obtain all combined IV loss values;
d 3-2: finding the smallest combined IV loss value, and comparing the smallest combined IV loss value with a preset combined loss threshold value;
if the minimum merging IV loss value is smaller than the merging loss threshold value, merging the two characteristic boxed data;
d 3-3: and repeating d 3-1-d 3-2 until all combined IV loss values are greater than the combined loss threshold value.
And performing similar combination operation on the characteristic box data subjected to counting and combination operation, and performing similar combination operation on the characteristic box data to ensure the difference between samples in each box and the similarity of the samples in the boxes. The merging loss threshold is set according to the quality of the sample data, in this embodiment, the merging loss threshold is set to 0.0001, and only two bins with the minimum loss are merged at each time in the similar merging stage.
The last column in table 2 is used to screen the row with the lowest loss, and it can be seen from the table that the row with the serial number of 8 is the row with the lowest loss at present, and the loss of the IV value after being merged with the previous box is only 7.3E-09; since the two merging operations in table 2 in the example do not affect each other, for convenience, the merging operations are performed directly, in actual operations, counting and merging are generally performed first, and then similar merging is performed, the merging operation is performed only once each time, the merging result is shown in table 4, and the row with the serial number of 7 is the merged result.
After the similar merging operation is performed on the feature binning data, before step d4 is performed, determining whether the feature binning data needs to maintain linear features, and if the linear features need to be maintained, performing linear adjustment; otherwise, step d4 is performed directly;
the linear adjustment comprises the following steps:
c 1: calculating an evidence weight value of each characteristic sub-box data;
c 2: sorting all the evidence weighted values according to the arrangement sequence of the corresponding characteristic box data;
c 3: and finding out an evidence weight value which does not accord with the linear variation trend, and merging the corresponding characteristic box data with the adjacent box data.
In the technical scheme of the invention, the influence of the independent variable on the dependent variable is ensured to be monotonous or in a simpler U-shaped or inverted U-shaped form through linear adjustment so as to increase the stability of the model. Bins that vary discontinuously are merged by observing that the Evidence Weight value (WOE) varies as the packet value increases.
The WOE representing the relative risk magnitude of the interval is calculated as follows:
Figure BDA0003325282920000091
in the formula, WiI.e. the evidence weight of the variable i box, ln represents the natural logarithm taken, PiIndicates the number of blind samples in the interval (Positive, i.e., the prediction result should be 1 for the model), NiThe number of samples in the interval (Negative, i.e. the prediction result for the model should be 0) is referred to, SIGMA PiRepresents for all PiSummation (i.e. the number of total lost samples).
Specifically referring to table 4, it can be seen that the WOE value of the feature binning data in table 4 belongs to a continuous inverted U shape, and linear adjustment is not required; in the binning example 3 shown in table 5, the row with serial number 19 has a more abrupt change, the WOE value from top to bottom is gradually decreased, and the position of the row 19 is suddenly increased, which does not conform to the general trend, is likely to be caused by the specificity of the data set, and should be merged with the adjacent bin.
Table 5: case separation example 3
Figure BDA0003325282920000092
d 4: and calculating an evidence weight value for each feature sub-box data, and replacing each feature sub-box data with a corresponding evidence weight value. The details are shown in tables 6 and 7 below.
Table 6 is an example of a partial variable binning result, which shows an example of a binning result obtained after binning partial numerical characteristic variables; table 7 shows an example in which part of the variables are replaced with evidence weight values in the binning process.
Table 6: partial variable binning result examples
Figure BDA0003325282920000101
Table 7: partial variable alternative embodiment
Figure BDA0003325282920000102
S4: carrying out variable screening operation on the training sample basic data set subjected to variable binning processing to obtain a final training data set;
the variable screening operation comprises the following steps:
a 1: respectively calculating each characteristic variable in the training sample basic data set to obtain corresponding information values;
a 2: deleting the characteristic variables of which the information values are lower than a preset information value threshold;
a 3: calculating the Pearson correlation coefficient of any two characteristic variables in the basic data set of the training sample, and removing the characteristic variables in the basic data set of the training sample based on the sample correlation;
a 4: and performing logistic regression fitting on each characteristic variable in the training sample basic data set based on the independent variable and the dependent variable, and removing all the characteristic variables which can not obviously predict the dependent variable to obtain a training data set.
The information value of a characteristic variable represents its correlation with the dependent variable, or its ability to predict the dependent variable. The prediction capabilities represented by different values are shown in table 8, in the embodiment, the information value threshold is set to be 0.1, namely, variables with information values larger than 0.1 are selected to enter the model, so that the variables entering the model all have prediction capabilities on dependent variables; in particular, when there are fewer variables with information values greater than 0.1 (e.g., less than 10), to ensure that sufficient data is available, the screening criteria are reduced according to the particular data, such as: and if the variable is more than 0.02, more variables can enter the model, so that the overall prediction capability of the model is improved.
Table 8: variable IV values and usage recommendations
Serial number IV value Predictive power Use advice
1 0.02 or less Without predictability Need not use
2 0.02~0.1 Poor predictability Generally not used, and can be used as appropriate when the variables are less
3 0.1~0.2 Intermediate predictive Can be used in general, and can be abandoned as appropriate when the variables are more
4 0.2 or more Strong predictability Can be used
In this example, the IV values after partial variable binning are shown in table 9.
Table 9: IV value after partial variable binning
Serial number Variable names IV value
1 Time of establishment 0.306
2 Registered capital 0.530
3 Belonging to the trade door 0.286
4 Of the economic type 0.204
5 Enterprise equity structure 0.163
6 Gross external investment 0.256
7 Total number of employees of enterprise 0.112
8 Administrative penalty record quantity 0.124
9 Number of judicial litigation records 1.016
10 Number of records listed in list of distrusted executives 0.620
11 Number of records listed in abnormal operation list 0.125
12 Record number of business registration change 0.200
13 Time penalized by administration 0.232
14 Time involved in litigation 1.305
15 Time listed in the list of distrusted executives 0.769
16 Time listed in abnormal operation list 0.212
17 Time of business registration change 0.234
In the step a3 of the variable screening operation, the step of removing the feature variables in the training sample basic data set based on the sample correlation includes the following steps:
a 31: acquiring all characteristic variables and recording the characteristic variables as correlation variables to be calculated;
a 32: calculating the Pearson correlation coefficient of every two variables in all the variables to be calculated;
a 33: finding all variable pairs with the Pearson correlation coefficient larger than a preset correlation threshold value, and removing characteristic variables with smaller information values;
a 34: calculating a variance expansion factor for the remaining characteristic variables in the training sample basic data set;
a 35: comparing and finding all variance expansion factors and a preset variance expansion factor threshold value;
if the variance expansion factor is larger than the characteristic variable of the preset variance expansion factor threshold, rejecting the variable, recording the residual variable as the variable to be calculated for correlation, and repeating the steps a 34-a 35;
if all the feature variables are less than or equal to the variance inflation factor threshold, then step a4 is performed.
In the technical scheme of the invention, enterprise basic information and enterprise public credit information data are used as basic data, the correlation between two variables is described through a Pearson correlation coefficient, and then the correlation between one characteristic variable and a plurality of other characteristic variables is described through a variance expansion factor; in the variable screening process, the Pearson correlation coefficient and the variance expansion factor are reduced in a variable removing mode, the correlation among independent variables is ensured to be reduced, and the stability of the model is improved.
The pearson correlation coefficient is calculated as follows:
Figure BDA0003325282920000121
in which X and Y represent different variables, XmThe value of the variable X representing the sample m,
Figure BDA0003325282920000122
represents the mean value of the variable X and,
Figure BDA0003325282920000123
represents the mean of the variable Y, Σ represents the sign of the summation.
In this embodiment, the correlation threshold is set to 0.7; for example, in the variable correlation matrix in table 10 below, only the correlation between the number of records for the trusted person to be performed and the time at which the trusted person to be performed was listed exceeds 0.7, and the number of records for the variable to be performed for the trusted person to be performed should be eliminated in accordance with the IV value in table 9.
Table 10: correlation coefficient embodiment of partial variables
Figure BDA0003325282920000124
The final correlation was about 0.15 after calculation of the pearson correlation coefficient for the established time variance and the registered capital variance for 30 sample data as given in table 11 below.
Table 11: pearson correlation coefficient calculation example
Sample numbering Time of establishment Registered capital Sample numbering Time of establishment Registered capital Sample numbering Time of establishment Registered capital
Sample
1 0.4534 -0.2080 Sample 11 0.4534 1.5022 Sample 21 0.4534 -0.0595
Sample 2 0.4534 0.8285 Sample 12 -0.2540 -0.2080 Sample 22 0.4534 -0.2080
Sample 3 0.4534 -0.2080 Sample 13 0.4534 -0.4077 Sample 23 0.2296 1.5022
Sample 4 0.4534 -0.4077 Sample 14 -0.9493 0.2909 Sample 24 0.4534 1.5022
Sample 5 0.4534 1.5022 Sample 15 0.4534 -0.2080 Sample 25 0.4534 0.2909
Sample 6 0.4534 0.2909 Sample 16 0.3654 -0.7070 Sample 26 0.4534 -0.2080
Sample 7 0.4534 0.8285 Sample 17 0.4534 1.5022 Sample 27 0.4534 0.8285
Sample 8 0.4534 -0.4077 Sample 18 0.4534 0.2909 Sample 28 -0.0111 -0.4077
Sample 9 -0.0111 -0.2080 Sample 19 -0.0111 0.2909 Sample 29 -0.9493 -0.0595
Sample 10 0.4534 -0.2080 Sample 20 0.4534 0.8285 Sample 30 -0.2540 0.2909
The variance expansion factor for variable X is calculated as follows:
the first step, taking variable X as dependent variable, and using other variables except X variable as independent variable to carry out linear regression fitting to obtain predicted variable of X
Figure BDA0003325282920000125
Second, calculate X and
Figure BDA0003325282920000126
the Pearson correlation coefficient is the complex correlation coefficient R of X and other independent variables;
thirdly, calculating a variance expansion factor of the variable X, wherein the calculation method comprises the following steps:
differential expansion factor of 1/(1-R)2)
As shown in fig. 2 of the drawings of the specification, fig. 2 is an embodiment of variance inflation factor calculation, and in the example of fig. 2, variance inflation factors are selected for calculating the dates related to litigation over the other three variables (standing time, registered capital, number of judicial litigation) in the 30 samples. Firstly, using litigation date variable as dependent variable, and other three independent variables as independent variables to perform multivariate linear regression fitting to obtain regression equation, and calculating out litigation date variable value of regression prediction, wherein the correlation coefficient of the prediction result and the real result is the complex correlation coefficient R, thus obtaining the variance expansion factor. In the example of fig. 2, the complex correlation coefficient R is 0.170712, the square of R is 0.029143, and the expansion factor of variance is 1.03.
The correlation between the true value of the date involved in litigation and the predicted value from linear regression using the other three variables in the example table of fig. 2 is about 0.1707, i.e., complex correlation coefficient R0.1707, corresponding to:
variance expansion factor is 1/(1-R)2)=1/(1-0.1707^2)≈1.03
Less than 5, step a4 may be performed.
In step a4, performing logistic regression fitting on each feature variable in the training sample basic data set based on the independent variable and the dependent variable, and removing all feature variables which can not obviously predict the dependent variable to obtain a training data set; in the technical scheme of the invention, each independent variable and the dependent variable are subjected to logistic regression fitting, and when the fitting result shows that the independent variable can remarkably predict the dependent variable, the final model fitting can be carried out, so that each independent variable subjected to the final model fitting can remarkably predict the dependent variable.
As shown in table 12 below, the independent variable establishment time univariate test results, wherein the significance index is 0.000 (0 under rounding), less than 0.05, can significantly predict the dependent variable, representing that the variable is available. The significance index is an index for representing the prediction capability of the variables to the dependent variables in the regression, calculation is performed based on the prior art, and the prediction capability of the corresponding independent variables to the dependent variables can be considered to be significant if the significance index is less than 0.05 in the embodiment.
Table 12: results of logistic regression with established time variables
Figure BDA0003325282920000131
S5: performing model fitting on the training data set based on a logistic regression model, and removing the characteristic variables which do not accord with preset significance indexes to obtain a credit risk assessment model of the enterprise in the public credit field; in this embodiment, logistic regression model fitting is performed in a stepwise backward regression mode.
The logistic regression model is as follows:
Figure BDA0003325282920000132
where prob is the probability of loss of confidence for the sample,
Figure BDA0003325282920000133
called the loss of confidence ratio, i is the number of the variable to distinguish between different variables, xiRefers to the value of the variable i of the sample after evidence weight replacement (i.e. the value in the column of the value of the variable after replacement in Table 7), βiIs the regression coefficient, β, of the variable i0Constant terms of the model are parameters to be estimated; in this embodiment, the parameters are estimated and given by constructing a maximum likelihood method.
A significance index threshold is preset, the significance index threshold in this embodiment is set to 0.05, the feature variables in the training data set are used for model fitting, if the significance indexes of all the feature variables are smaller than the significance index threshold, parameters of the model are directly output, otherwise, one feature variable with the largest significance index is removed, namely, the least significant feature variable is removed, and model fitting is performed again. As shown in table 13 below, the significance index of the characteristic variable "enterprise equity structure" is eliminated if it is greater than the significance index threshold.
TABLE 13 backward stepwise regression intermediate results example
Figure BDA0003325282920000141
And circulating until the significance indexes of all the characteristic variables in the training data set are all smaller than the significance index threshold value. The final fitting results are shown in table 14.
TABLE 14 examples of logistic regression model fitting results
Figure BDA0003325282920000142
In this embodiment, the final regression equation obtained after fitting is:
Figure BDA0003325282920000143
and the prob is the probability of loss of confidence of the sample, the value of the independent variable is the value after the evidence weight replacement, the left side of the regression equation is the natural logarithm value of the loss of confidence ratio of the sample, and the probability prob of loss of confidence of the sample can be obtained through simple transformation.
S6: and constructing the credit risk scoring card of the public credit field enterprise based on the parameters of the credit risk assessment model of the public credit field enterprise.
The prediction result of the logistic regression model is the probability of losing confidence of the sample, the value range is (0,1), and the probability of losing confidence is higher as the value is closer to 1. The conversion formula between each other is as follows:
Figure BDA0003325282920000144
a, B are constants, the variation interval and the variation degree of the enterprise credit score are mainly controlled, the variation interval and the variation degree need to be solved through preset values, and prob is the sample credit loss probability predicted by the logistic regression model.
Solving A, B requires two assumptions: specific loss of confidence ratio theta0In this embodiment, 20 is set, and the corresponding probability of losing credit is about 95.24%; and theta0Corresponding expectation score S0In this embodiment, S0Set to 400, i.e., when the enterprise credit score is 400, the probability of losing credit is about 95.24%; the score dS of the change when the business loses credit to double is set to 20 in this embodiment.
These two assumed quantities are substituted into the calculation formula for the credit score, i.e., solvable constants A and B:
Figure BDA0003325282920000151
the desired credit score interval can be set using the above formula. The conversion formula of credit score and loss probability is thus as follows:
Figure BDA0003325282920000152
when the probability of losing credit is brought to 95%, the corresponding credit score is:
Figure BDA0003325282920000153
and fitting the obtained parameters by using a logistic regression equation, and calculating a basic score and a score corresponding to each interval of each variable. The base score is the default score obtained for each sample, and is related to the constant term of the logistic regression equation, and the value is A-B beta0Each interval of each variable corresponds to a score of-B x β0*xiWherein x isiThe evidence weight value of the corresponding interval of the variable value is referred. The calculation of the score is to add the scores of the corresponding intervals of each variable value of the sample to the basic score.
Scoring according to the interval to which the value of each variable of the sample belongs, and then adding all the scores to obtain the final score.
Among this patent technical scheme, through setting up the score card, can grade the enterprise fast, make things convenient for practical application, can directly confirm the score that the index corresponds through the interval that the corresponding index of enterprise is located, simply add and find total score, confirm the probability level of losing confidence of enterprise through total score. In this embodiment, the correspondence between the final score and the loss probability is shown in table 15.
Table 15: correspondence between score interval and loss probability
Scoring condition Probability of loss of confidence situation
<300 >99.84%
[300,400) (95.24%,99.84%]
[400,500) (38.46%,95.24%]
[500,600) (1.91%,38.46%]
[600,700) (0.06%,1.91%]
>=700 <=0.06%
S7: obtaining basic sample data of an enterprise to be evaluated as sample data to be evaluated, inputting the sample data to be evaluated into a model of an enterprise credit risk assessment card in the public credit field, obtaining an interval score to which a value of each characteristic variable of the sample data to be evaluated belongs, and then adding all the scores to obtain an enterprise credit risk assessment result of the enterprise to be evaluated.
For example, the sample examples in table 16, the basis is the score obtained directly from each sample, and the constant terms derived from the logistic regression results are mainly the approximate range of control scores; the value of the satisfied time variable of the sample is 16 years, the sample belongs to the interval (6, 18) in the score card, the corresponding score is-2, the class of the industry belongs to the class of F (wholesale retail industry), the corresponding score is 6 and the like, the score result of each variable is finally added to 430, namely the score of the sample is 430, and the corresponding probability of losing credit is between 38.46% and 95.24% according to the table 15.
Table 16: scoring card sample scoring example
Figure BDA0003325282920000161
The technical scheme of the invention has the greatest characteristics that only data in the field of public credit of enterprises, including public credit information and basic information, belong to publicly disclosed information, the acquisition difficulty is low, and the reliability of the information is high; records of public credit information for a business, such as: the behavior listed in the credit loss executant and the like is a very serious credit loss behavior which represents the repayment willingness or repayment capacity of the enterprise, so the technical scheme of the invention takes the record of the public credit information which can reflect the credit risk of the enterprise as basic data to evaluate the risk of the enterprise in the field of the public credit, and the evaluation method provided by the technical scheme of the invention is an important supplement to the credit risk evaluation of the traditional enterprise.

Claims (9)

1. A credit risk assessment method for enterprises in the public credit field is characterized by comprising the following steps:
s1: collecting historical data of various types of enterprises in the field of public credit as basic sample data;
the basic sample data comprises: enterprise basic information and enterprise public credit information;
the enterprise basic information is information describing the current state of the enterprise;
the enterprise public credit information is data describing penalty records once received by the enterprise in the public credit field;
s2: preprocessing all sample data in the basic sample data according to the types of the sample data to obtain a sample data set;
dividing the sample data set into a training sample basic data set and a test data set;
s3: performing variable binning processing on the data in the training sample basic data set;
performing binning processing on the non-numerical characteristic variables in the training sample basic data set based on a direct binning method;
s4: performing variable screening operation on the training sample basic data set subjected to variable binning processing to obtain a final training data set;
the variable screening operation comprises the following steps:
a 1: calculating each characteristic variable in the training sample basic data set to obtain a corresponding information value;
a 2: deleting the characteristic variables of which the information values are lower than a preset information value threshold;
a 3: calculating the Pearson correlation coefficient of any two characteristic variables in the training sample basic data set, and removing the characteristic variables in the training sample basic data set based on sample correlation;
a 4: performing logistic regression fitting on each characteristic variable in the training sample basic data set based on the independent variable and the dependent variable of each characteristic variable, and removing all characteristic variables which cannot obviously predict the dependent variable to obtain the training data set;
s5: performing model fitting on the training data set based on a logistic regression model, and removing the characteristic variables which do not accord with preset significance indexes to obtain a credit risk assessment model of the enterprise in the public credit field;
s6: constructing a credit risk scoring card of the public credit field enterprise based on the parameters of the credit risk assessment model of the public credit field enterprise;
s7: obtaining basic sample data of an enterprise to be evaluated as sample data to be evaluated, inputting the basic sample data to be evaluated into a model of the enterprise credit risk assessment card in the public credit field, obtaining an interval score to which a value of each characteristic variable of the sample data to be evaluated belongs, and then adding all the scores to obtain a final score of the risk assessment corresponding to the enterprise to be evaluated in the public credit field, namely an enterprise credit risk assessment result of the enterprise to be evaluated.
2. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: in step S1, based on the data characteristics of the basic sample data, the method for acquiring the basic sample data includes:
b 1: putting the data obtained after the enterprise basic information is directly collected into the basic sample data to participate in calculation;
b 2: classifying the enterprise public credit information data according to a data calculation obtaining mode, wherein the classification comprises the following steps: historical behavior record quantity variable and latest behavior record time variable;
the historical behavior record quantity variable comprises: the administrative penalty record number, the judicial litigation record number, the record number listed in the loss message executant list, the record number listed in the abnormal operation list and the record number of the business registration change;
the last behavior recording time class variable comprises: the date of the administrative penalty, the date of the litigation, the date of the list of distrusted executives, the date of the list of abnormal business, and the date of the business registration change;
b 3: collecting all public data for the data of the historical behavior record quantity, respectively accumulating according to types to obtain the sum of the data as the numerical value of each characteristic variable, and putting the sum into the basic sample data to participate in calculation;
b 4: and for the data of the record quantity of the business registration change, firstly, acquiring the public business registration data, then, comparing according to time, finding the latest data of the record time, taking the latest data as the numerical value of each characteristic variable, and putting the latest data into the basic sample data to participate in calculation.
3. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: the enterprise basic information comprises: establishing time, registered capital, affiliated industry categories, economic types, enterprise equity structures, the number of associated companies, total amount of external investment, total number of enterprise employees and the social security payment base number of enterprises;
the enterprise public credit information includes: the number of records of administrative penalty, the number of records of judicial litigation, the number of records listed in the list of distressed executives, the number of records listed in the list of abnormal operations, the number of records of business registration changes, the date of receiving administrative penalty, the date related to litigation, the date listed in the list of distressed executives, the date listed in the list of abnormal operations, and the date of business registration changes.
4. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: the preprocessing operation in step S2 includes:
c 1: for the data related to the enterprise basic information, counting the missing degree of the sample data of each characteristic variable;
if the sample missing degree of a certain characteristic variable in the basic sample data exceeds a preset missing threshold value, deleting the sample data of the certain characteristic variable in the basic sample data;
otherwise, if the default missing threshold value is not exceeded, filling the value of the corresponding missing sample data to-1;
c 2: and for the historical behavior record quantity variable and the latest behavior record time variable, finding out the missing value of each characteristic variable, regarding that the behavior record does not exist, and marking the value corresponding to the missing sample data as 0.
5. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: in step S3, the step of performing binning processing on the numerical characteristic variables based on the information value optimal binning method includes:
d 1: sorting the characteristic variables to be processed according to the magnitude of the variable values, and carrying out interval division to obtain characteristic box data;
when the intervals are divided, the variable quantity of each interval does not exceed a preset box dividing threshold;
d 2: counting and merging all the characteristic box data;
the count merge operation includes the steps of:
d 2.1: identifying a number of samples of each of the feature binning data;
d 2.2: finding all the characteristic binning data with the sample number smaller than a preset sample threshold value, and recording the characteristic binning data as to-be-merged binning data;
d 2.3: judging whether the data to be subjected to binning is at the positions of two ends of the sorted characteristic binning data;
if so, merging the to-be-merged binning data with the adjacent binning data;
otherwise, implementing step d 2.4;
d 2.4: respectively calculating the merging IV loss values of the to-be-merged box data and the adjacent box data on the two sides;
selecting adjacent box data on one side with smaller IV loss value to be merged for merging;
d 3: performing similar merging operation on the characteristic box data;
the similar merging operation comprises the following steps:
d 3-1: acquiring all characteristic box data, and calculating a combined IV loss value between any two adjacent boxes to obtain all combined IV loss values;
d 3-2: finding the smallest combined IV loss value, and comparing the smallest combined IV loss value with a preset combined loss threshold value;
merging the two feature binned data if the smallest of the merged IV loss values is less than the merged loss threshold;
d 3-3: repeating d 3-1-d 3-2 until all of the combined IV loss values are greater than the combined loss threshold;
d 4: and calculating an evidence weight value for each feature sub-box data, and replacing each feature sub-box data with the corresponding evidence weight value.
6. The method of claim 5, wherein the method comprises: in step S3, after the feature binning data is subjected to the similar merge operation, before step d4 is performed, it is determined whether the feature binning data needs to maintain a linear feature,
if the linear characteristic needs to be kept, linear adjustment is needed; otherwise, step d4 is performed directly;
the linear adjustment comprises the steps of:
c 1: calculating an evidence weight value of each feature binning data;
c 2: sorting all the evidence weighted values according to the corresponding arrangement sequence of the characteristic box data;
c 3: and finding the evidence weight value which does not accord with the linear variation trend, and merging the corresponding characteristic binning data with adjacent binning data.
7. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: in the step a3 of the variable screening operation, the step of removing the feature variables in the training sample basic dataset based on the sample correlation includes the following steps:
a 31: acquiring all characteristic variables and recording the characteristic variables as correlation variables to be calculated;
a 32: calculating the Pearson correlation coefficient of every two variables in all the variables to be calculated;
a 33: finding all variable pairs with the Pearson correlation coefficient larger than a preset correlation threshold value, and removing characteristic variables with smaller information values;
a 34: calculating a variance expansion factor for the remaining feature variables in the training sample basic data set;
a 35: comparing and finding all variance expansion factors and a preset variance expansion factor threshold value;
if the variance expansion factor is larger than the preset feature variable of the variance expansion factor threshold, rejecting the variable, recording the residual variable as a correlation variable to be calculated, and repeating the steps a 34-a 35;
if all the feature variables are less than or equal to the variance inflation factor threshold, then step a4 is performed.
8. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: in step S5, logistic regression model fitting is performed by backward stepwise regression.
9. The method for assessing the credit risk of a public credit domain enterprise according to claim 1, wherein: in step S3, binning is performed on the numerical characteristic variables in the training sample basic data set based on an information value optimal binning method.
CN202111260166.3A 2021-10-28 2021-10-28 Enterprise credit risk assessment method in public credit field Pending CN114022269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111260166.3A CN114022269A (en) 2021-10-28 2021-10-28 Enterprise credit risk assessment method in public credit field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111260166.3A CN114022269A (en) 2021-10-28 2021-10-28 Enterprise credit risk assessment method in public credit field

Publications (1)

Publication Number Publication Date
CN114022269A true CN114022269A (en) 2022-02-08

Family

ID=80058065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111260166.3A Pending CN114022269A (en) 2021-10-28 2021-10-28 Enterprise credit risk assessment method in public credit field

Country Status (1)

Country Link
CN (1) CN114022269A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115841279A (en) * 2023-02-20 2023-03-24 塔比星信息技术(深圳)有限公司 Supply chain data evaluation method, device, equipment and storage medium
CN116012143A (en) * 2023-01-03 2023-04-25 睿智合创(北京)科技有限公司 Variable selection and parameter estimation method under case-division regression
CN117808578A (en) * 2024-03-01 2024-04-02 杭银消费金融股份有限公司 Intelligent pedestrian credit information data analysis method and system
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012143A (en) * 2023-01-03 2023-04-25 睿智合创(北京)科技有限公司 Variable selection and parameter estimation method under case-division regression
CN116012143B (en) * 2023-01-03 2023-10-13 睿智合创(北京)科技有限公司 Variable selection and parameter estimation method under case-division regression
CN115841279A (en) * 2023-02-20 2023-03-24 塔比星信息技术(深圳)有限公司 Supply chain data evaluation method, device, equipment and storage medium
CN117808578A (en) * 2024-03-01 2024-04-02 杭银消费金融股份有限公司 Intelligent pedestrian credit information data analysis method and system
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm
CN117874654B (en) * 2024-03-13 2024-05-24 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Similar Documents

Publication Publication Date Title
CN114022269A (en) Enterprise credit risk assessment method in public credit field
CN111754345B (en) Bit currency address classification method based on improved random forest
CN107633265A (en) For optimizing the data processing method and device of credit evaluation model
CN116108758B (en) Landslide susceptibility evaluation method
CN106384282A (en) Method and device for building decision-making model
CN110909963A (en) Credit scoring card model training method and taxpayer abnormal risk assessment method
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
Utari et al. Implementation of data mining for drop-out prediction using random forest method
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN113159461A (en) Small and medium-sized micro-enterprise credit evaluation method based on sample transfer learning
CN112330441A (en) Method for evaluating business value credit loan of medium and small enterprises
Yıldırım et al. Robust Mahalanobis distance based TOPSIS to evaluate the economic development of provinces
CN114926299A (en) Prediction method for predicting vehicle accident risk based on big data analysis
CN109992592B (en) College poverty and poverty identification method based on flow data of campus consumption card
CN115689713A (en) Abnormal risk data processing method and device, computer equipment and storage medium
CN112506930B (en) Data insight system based on machine learning technology
CN114626940A (en) Data analysis method and device and electronic equipment
CN114219606A (en) Power data-based power consumption enterprise credit evaluation method and system
CN113240213A (en) Method, device and equipment for selecting people based on neural network and tree model
WO1992017853A2 (en) Direct data base analysis, forecasting and diagnosis method
CN112396507A (en) Shadow division-based integrated SVM personal credit evaluation method
Munandar et al. Data Mining for Development Inequality.
CN113537734B (en) Energy data application catalog extraction method based on maximum correlation minimum redundancy
CN117934150A (en) Personal credit assessment method and device based on improved class unbalance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220208