CN114663102A - Method, equipment and storage medium for predicting debt subject default based on semi-supervised model - Google Patents

Method, equipment and storage medium for predicting debt subject default based on semi-supervised model Download PDF

Info

Publication number
CN114663102A
CN114663102A CN202011395004.6A CN202011395004A CN114663102A CN 114663102 A CN114663102 A CN 114663102A CN 202011395004 A CN202011395004 A CN 202011395004A CN 114663102 A CN114663102 A CN 114663102A
Authority
CN
China
Prior art keywords
model
semi
default
debt
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011395004.6A
Other languages
Chinese (zh)
Inventor
王专
郝玉爽
田鑫涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Life Insurance Asset Management Co ltd
Original Assignee
China Life Insurance Asset Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Life Insurance Asset Management Co ltd filed Critical China Life Insurance Asset Management Co ltd
Priority to CN202011395004.6A priority Critical patent/CN114663102A/en
Publication of CN114663102A publication Critical patent/CN114663102A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a method, equipment and a storage medium for predicting debt subject default based on a semi-supervised model, wherein the method comprises the following steps: s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data; s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors; s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model; s4: and judging and predicting default risks of the debt subject based on the semi-supervised model. The modeling method is based on the combination of an unlabeled sample weighting method and a score card model, the XGB classifier trained by positive samples and unlabeled samples is used for expanding the scale of the positive samples according to the risk ranking capacity, the samples with the highest risk probability are used as new positive samples, the score card model is trained, and the semi-supervised model is constructed.

Description

Method, equipment and storage medium for predicting debt subject default based on semi-supervised model
Technical Field
The invention relates to the technical field of computers, and particularly provides a method, equipment and a storage medium for predicting default of a debt subject based on a semi-supervised model.
Background
The traditional debt enterprise default prediction method mainly uses financial data and credit researcher scores to grade enterprises to obtain default probability of the enterprises, and news public opinion data is unstructured data, cannot be directly used by a computer model and is difficult to serve as input of the model, so that how to automatically establish a prediction debt main body default model by using news public opinions is a necessary problem to be solved in the prior art.
In the prior art, the financial data and credit researchers are used for scoring different dimensions of enterprises, the grading of the enterprises with debt is carried out, the default probability is output, the public opinion data is manually processed, a large number of business personnel participate in the grading and subjective evaluation, and a large number of early warning rules are formulated, so that the prior art has the phenomena of insufficient mining and analysis, difficulty in comprehensively implementing evaluation of credit risk level, low efficiency and dependence on subjective judgment.
In addition, the default of the debt issuing enterprise is a small probability event, so that the number of positive samples is very small during data modeling, and how to expand the positive sample ratio by using the existing samples is the key for solving the problem of model distortion.
Disclosure of Invention
The invention provides a method, equipment and a storage medium for predicting debt subject default based on a semi-supervised model, aiming at solving the problems of manual public opinion data processing, participation of a large number of service personnel and establishment of an early warning rule by subjective evaluation in the prior art.
The technical scheme of the invention is as follows:
a method for predicting debt subject default based on a semi-supervised model, comprising:
s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data;
s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors;
s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model;
s4: and judging and predicting default risks of the debt subject based on the semi-supervised model.
Further, the index system of S1 ranks the debt subject by basic qualification information, financial management information, penalty information, share right pledge information, news public opinion information, internal and external rating information, and risk related information.
Further, the S2 mines potential information of the debt subject data by using the statistical indexes of logarithm, mean, mode and extreme value.
Further, establishing the semi-supervised model of S3 includes the following steps:
s21: adjusting parameters by taking AUC as a target through grid search, and training an XGboost model to obtain a classifier for identifying whether a sample is marked;
s22: performing probability calibration by using a calibration classifier, and calibrating the output of the XGboost as the probability of an approximate standard;
s23: using the calibrated sample and the original negative label as a modeling target of a subsequent training scoring card;
s24: calculating weights by using the balance sample weight;
s25: using chi-square sub-boxes to convert all the characteristics into ordinal type classification variables;
s26: analyzing the degree of association between the features and the modeling target and the colinearity between the features, and screening high-quality features which can enter the model;
s27: manually optimizing feature interpretability;
s28: training a scoring card model after the characteristic certification is subjected to weight recoding;
s29: and manually checking the scoring rules, and correcting a few rules which are inconsistent with the response rate distribution result.
Further, the scoring card model of S2 is a scoring card model based on logistic regression, and converts the distribution in each feature of the positive sample into an evidence weight code, and generates a score by combining the evidence weight and β in the regression coefficient, and the output data drives the scoring card model to reflect the information mined from the data and the operation logic of the model, so as to provide a scoring process of the debt issue subject and a single factor scoring ratio.
Further, the semi-supervised model tested discriminative power by the KS evaluation model, KS > 0.4.
Further, the AUC of S21 ranges from AUC > 0.7.
The invention also provides a device for predicting default of a debt issue subject based on the semi-supervised model, which comprises:
a memory, a processor, a communication bus, and a semi-supervised model predictive debt subject default program stored on the memory,
the communication bus is used for realizing communication connection between the processor and the memory;
the processor is used for executing the semi-supervised model based debt subject default predicting program to realize the steps of the semi-supervised model based debt subject default predicting method.
The invention also provides a computer readable storage medium, which stores executable instructions, and the storage medium stores a program for predicting default of a debt subject based on a semi-supervised model, and when the program is executed by a processor, the method for predicting default of a debt subject based on semi-supervised machine learning realizes the steps of any one of the methods for predicting default of a subject based on semi-supervised machine learning.
The beneficial effects of the invention at least comprise:
(1) the modeling method is based on the combination of an unlabeled sample weighting method and a scoring card model, the XGB classifier trained by positive samples and unlabeled samples is used for expanding the scale of the positive samples according to the risk ranking capacity, the samples with the highest high risk probability are used as new positive samples, and the scoring card model is trained, and the scoring card model is used as an evaluation model for outputting a final result due to the good interpretability and white-boxed training process of the scoring card model;
(2) by using a positive sample and unmarked sample learning method in semi-supervised learning, the scale of the positive sample is enlarged, and an original seriously biased modeling sample is corrected, so that on one hand, the possibility of the marked sample existing in the unmarked sample is looked at, on the other hand, the model can be better learned to the characteristics of a bad sample, and the risk of fitting more noise of the model caused by unbalanced samples is reduced;
(3) the method generates the model in a data-driven mode based on the model of machine learning, reduces information loss caused by subjective intervention, has more objective risk early warning, and more effectively captures risk change of advance default of a subject.
Drawings
Fig. 1 is a flow chart of predicting default of a debt subject based on a semi-supervised model according to the present invention.
FIG. 2 is a flow chart of the semi-supervised model of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With reference to fig. 1 and fig. 2, a method for predicting a debt subject default based on a semi-supervised model includes:
s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data;
s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors;
s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model;
s4: and judging and predicting default risks of the debt subject based on the semi-supervised model.
The main objective of modeling of the invention is to predict a subject with high default probability through a quantitative model, realize advance detection of default risks, and mainly obtain enterprise business data and news public opinion data as the source of analysis data. The analysis object is a main body with public opinion data, and the potential rules and the connection of the main body under three dimensions of basic qualification, industrial and commercial change, public opinion change and the like are mined and predicted mainly from two angles of news public opinion and industrial and commercial basic information; enriching bottom-level indexes through characteristic engineering, and exploring risk factors associated with default risks; and establishing a scoring card model based on semi-supervised learning, and evaluating the possibility of the default risk of the predicted subject.
To this end, first, the target variable is determined: defining a target variable as a bond default subject with a pre-warning function and capable of being matched with industrial and commercial information, and taking the subject as a serious biased sample;
and secondly, generating bottom-layer factors through characteristic engineering. Wherein, the factors generated by the industrial and commercial information relate to basic qualification evaluation, financial information evaluation, penalty information evaluation and share right pledge evaluation; factors generated from the early warning data relating to time, exposure, type, rating and sentiment tags for bond early warning;
in order to mine potential information of data as much as possible, the characteristic engineering adopts a processing method of statistical indexes such as logarithm, mean value, mode, extreme value and the like, and the predictive power of the statistical indexes is obvious from the view of final model-entering indexes;
and finally, evaluating, wherein a semi-supervised scoring card model is adopted in the modeling, so that the problem of serious biased sample modeling is solved, and the model discrimination is good.
The analysis process comprises the following steps:
1. definition of target variables
The objective of the modeling is to predict the default risk of the bond main body, so whether the main body is default or not is taken as a target variable.
Default data is accumulated from 4 months in 2014 to 12 months in 2019, 437 records are recorded, 180 subjects are related (the same subject can correspond to multiple default records), 90 subjects without early warning data before default and 4 subjects which cannot be matched with industrial and commercial information are removed, and the rest 86 default subjects serve as modeling positive samples and account for 0.57% of all modeling samples (15134 subjects).
2. Data preparation
2.1 construction of the index System
Based on the modeling requirement for realizing public opinion prior early warning default events of debt sending main bodies, an index system for analyzing the credit default risks of the debt sending main bodies is established from four angles of news public opinions, industrial and commercial information, market evaluation and external information. The index system carries out omnibearing rating on debt-issuing main bodies from 8 subdivision dimensions, such as basic qualification information, financial management information, penalty information, share right pledge information, news public opinion information, internal and external rating information, risk associated information and the like. Wherein, the modeling has realized 133 factors in total, and the related factors are shown in table 1:
Figure BDA0002814499380000061
Figure BDA0002814499380000071
watch (1)
2.2 feature construction
Data tables such as industry and commerce types, news public opinion types, rating types and the like in the relational database are integrated, and processing and implementation of bottom layer features are achieved from the aspects of statistical analysis, business judgment and derivative construction through feature engineering.
2.3 data verification
And the data quality is probed, and the characteristic calculation is ensured to be correct. The method can be divided into an accuracy test part and a logic test part.
2.3.1 accuracy test
And counting the missing, the repeating, the field type and the abnormal condition of each characteristic, and providing a direction for subsequent abnormal value processing. The specific test method is as follows:
deletion (c): the missing number, the probing line number and the missing proportion of each field;
repeating: only one field (all repeated) is taken, and the unique ratio of the values of each field is checked;
the field type: whether the field type is consistent with the design;
abnormal value: 3 sigma principle: out of the range of the mean plus or minus three times the standard deviation.
2.3.2 logical test
A business-defined outlier;
and (3) computing logic check: and randomly extracting a certain proportion of data by a random sampling method, performing index processing by using different tools, and finally comparing the calculated results.
2.4 data cleansing
The missing values and the abnormal values can influence the prediction capability of the factors on the final model result, and through the statistical analysis of the modeling samples, the data noise in the modeling samples is identified, the data quality is improved, and the model effect is improved.
2.4.1 missing value handling
The missing value may be classified into a numerical type variable, and a classification type variable, according to the type of the feature. In the process of processing the missing value, according to different characteristic meanings, the missing value can be processed into a class or filled according to a mean value, a median and a mode, and the field with serious missing (more than 80% missing) is removed. For culled features, the feature prediction capability is analyzed to consider whether to convert to a rule entry model. The specific treatment method comprises the following steps:
numerical type variables: and (4) determining the missing assignment strategy (mean, median, mode and the like) of the feature according to the principle of minimizing data noise from the meaning of the feature. For example: in the modeling process, the deletion is assigned to be 99 according to the meaning of the characteristics of the latest business change duration, the latest annual newspaper disclosure duration from the present, the latest financial newspaper disclosure duration from the present, the present establishment duration, the present registration duration and the like; assigning the loss as a mean value according to the meaning of the characteristics such as actual payment capital, registered capital, staff number and the like; and assigning the missing value to be 0 according to the characteristics of the number of branches, the times of legal change and the like.
Typing variables: the missing values themselves may have corresponding business meanings, and in order to retain the information in the data as much as possible, the missing values of the typed variables are usually divided into a group and assigned. For example: in the modeling process, the loss of the characteristics of continuous profit for two years or more, continuous loss for two years or more, profit for the latest financial year or not, loss for the latest financial year or not and the like is separately classified into one class, and the value is assigned as-1; and (4) assigning the missing value to be 0 according to the meaning of the characteristics of whether the shareholder is the executed person or not, whether the main body is the executed person or not and the like (representing that the corresponding event does not occur).
2.4.2 outlier handling
The modeling does not generate factors which can define abnormal values by the service. For features that cannot define outliers from traffic, they can be filtered according to the Lauda rule (3 σ criterion). Although the abnormal value processing makes the input of the model more stable, which is beneficial for the model to capture the overall characteristics of the data, the existence of the abnormal value may be the real characteristics of the data, so the abnormal value is not processed by using the Lauda rule in the modeling.
2.5 training, test set partitioning
In order to verify the accuracy and stability of the model training result and have good generalization capability, all samples are classified into 7: the ratio of 3 is divided into a training set and a testing set.
3. Model construction
3.1 exploratory data analysis
Through exploratory analysis of the data, debt default subjects defined as target variables, namely risk exposure samples with Y ═ 1, totaling 86 families, accounting for only 0.57% of the full debt subject samples, are among typical unbalanced sample modeling problems.
3.1.1 unbalanced sample handling
Semi-supervised methods based on PULearning (positive and unmarked sample learning). Searching a data point which is most similar to the marked sample in the unmarked samples by learning the existing marked samples, and taking the data point as a newly added marked sample;
under the condition that the modeling cannot obtain more bad samples, an idea based on PU-Learning is selected to solve the sample imbalance, and the principle of unbalanced sample processing and model application are explained in detail below.
3.2 semi-supervised model
Positive sample and unlabeled sample modeling (posived under labeled learning) in semi-supervised learning have wide application prospects in the aspects of unbalanced sample processing, potential target identification and the like, and comprise scenes with fewer negative labels, such as complaints, credit risk exposure and the like. From the perspective of positivetnobabeledrearing, the known bond default main body is a part of the total amount of main bodies with high default risks, and in addition, enterprises with high default risks still exist in the debt main body. That is, the samples other than these are not pure risk-free samples, but are unlabeled samples, and thus the breach pre-warning model is modeled using positive samples (subjects in which breach events have occurred) and unlabeled samples (breach risk is not exposed).
The core idea of the modeling method is that the probability of whether a sample is marked is directly proportional to the probability of whether the sample is a positive sample, i.e. from the aspect of sequencing capability, training with a positive sample and an unlabeled sample is equivalent to training with a positive sample and a negative sample, so that a classifier can be trained with the positive sample and the unlabeled sample, the probability of each sample being marked output by the classifier is converted into the weight of each sample, and then the classifier is trained again with the sample weight, so as to obtain the probability (modeling target) that the sample belongs to the positive sample.
The modeling combines the method with a scoring card model, the scale of the positive sample is enlarged by utilizing the sorting capacity of the positive sample classifier and the unlabeled sample classifier on risks, and the scoring card model is trained by using a 5% sample with the highest high risk probability as a new positive sample. The method solves the problems that random and truly effective characteristics are difficult to distinguish, the model stability and the ductility are weak and the like caused by too few high-risk labels.
The scoring card model of the modeling is based on the following principle and methodology: a scoring card model based on logistic regression is selected, distribution in each feature of a positive sample is converted into evidence weight codes, scores are generated by combining the evidence weights and beta in regression coefficients, and the output data-driven scoring card can visually reflect information mined from data and operation logic of the model, so that the scoring process of a liability and the single factor scoring ratio can be clearly given. The formula for converting logistic regression parameters into score card scores is as follows:
P0PDO is the score doubling the specified rate as the benchmark score for the score card. P is0PDO is two super parameters of a grading card model and is used for controlling the concentrated trend and the discrete degree of grading, and 60 to-10 are taken in the modeling;
beta is a coefficient obtained by training logistic regression, intercept is an intercept obtained by training logistic regression, and n is the number of features in the model;
calculation of the constant B:
Figure BDA0002814499380000101
calculation of the constant a: a ═ P0+B×ln(P0);
Fixed score of score card: FixedScore ═ A-B × intecept;
score for each shift:
Figure BDA0002814499380000111
the semi-supervised model modeling steps are as follows:
(1) classifier whether training samples are labeled: training an XGboost model, learning and classifying whether a sample is marked or not, and adjusting parameters by taking AUC as a target through grid search, wherein the optimal hyper-parameters are as shown in a table 2:
Colsample_bytree 1
Learning_rate 0.01
Max_depth 10
N_estimators 200
watch (2)
The XGboost is an optimized distributed gradient enhancement library, aims to realize high efficiency, flexibility and portability, realizes a machine learning algorithm under a GradientBoosting framework, provides parallel tree promotion (also called GBDT and GBM) by the XGboost, and can quickly and accurately solve a plurality of data science problems.
AUC (area under Curve) is defined as the area enclosed by the coordinate axes under the ROC curve, and obviously the value of the area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the detection method is; and when the value is equal to 0.5, the authenticity is lowest, and the application value is not high.
(2) And (3) probability calibration: the probability calibration was performed using a CalibratedClassifierCV, the output of the XGBoost was calibrated to approximate the standard probability, the hyper-parameters used were method _ calibrated ═ isotonicc, cv ═ 3, as shown in table 3:
Method_Calibrated Isotonic
CV 3
watch (3)
(3) Constructing an enlarged modeling target: using the calibrated sample with probability sequencing Top 5% and the original negative label as the modeling target of the subsequent training scoring card;
(4) unbalanced sample weighting: so that the model does not pay attention to the misclassification of the positive sample because the positive sample is only 5%, and the weight is calculated by using sklern. utilis. class _ weight;
(5) characteristic discretization: all features were converted to ordinal type categorical variables using chi-square binning, whose hyper-parameters are shown in table 4:
Max_intervals 10
Min_intervals 5
Initial_intervals 100
watch (4)
(6) Predictive force and collinearity analysis: analyzing the degree of association between the features and the modeling target and the colinearity between the features, and screening high-quality features which can enter a model;
(7) manual optimization of feature interpretability: checking the high-quality features which can enter the model one by one, analyzing whether the frequency distribution and the response rate distribution of each value can be interpreted in business and can be derived from random fluctuation of data, adjusting the grouping of the features according to the result, and checking whether the verification set has the same trend with the training set. Characteristics which cannot be interpreted, have high probability of random fluctuation of data, or have inconsistent training set and verification set trends cannot be modeled;
(8) evidence weight conversion and scoring training: the scoring card model is trained after the feature certification is recoded, and the hyper-parameters are shown in table 5:
Max_intervals 10
Min_intervals 5
Initial_intervals 100
watch (5)
(9) Adjusting a scoring card: manually checking the scoring rules, and correcting a few rules which are not in accordance with the response rate distribution result;
english involved in the modeling step of the semi-supervised model refers to parameter setting in the code.
The final semi-supervised default model injection indexes are as follows in table 6:
Figure BDA0002814499380000131
Figure BDA0002814499380000141
watch (6)
4. Model evaluation
4.1 evaluation method
The default early warning scene of the debt subject is not a conventional classification problem containing positive and negative samples, but a semi-supervised problem of marked positive samples and unmarked samples, and the sample proportion is difficult to meet the training requirement of the traditional classification model. In the modeling sample, the number of default bodies is too small, and the rest of the bodies are actually a mixture of high-risk non-default bodies and low-risk bodies. The accuracy index considers the subject with high risk but not default of the model prediction as a prediction error, but actually such subject really approaches the default subject on the risk characteristic, only the risk is not exposed yet or other factors outside the model cause the default. Therefore, the traditional accuracy index is no longer suitable for default early warning scenes of main debt bodies, the whole risk sequencing capacity of the model is evaluated through AUC in the modeling, and the positive and negative sample distinguishing capacity is evaluated through the KS evaluation model.
AUC (area under ROC curve): the model was tested for its ranking ability and suggested AUC values above 0.7. The higher the AUC, the better the model classification effect, and the higher the probability that the default sample is arranged in front of the non-default sample.
0.5< A C <1, is superior to random guess and has prediction value; a C is 0.5, and as with random guesses, has no predictive value;
K-S statistic: the discriminative power of the model was examined and the value of KS was suggested to be above 0.40.
0.4< KS, good model distinguishing capability; KS is more than 0.2 and less than or equal to 0.4, and the distinguishing capability of the model is general; KS is less than or equal to 0.2, and the model discrimination capability is poor.
4.2 semi-supervised model assessment
As shown in tables (7) and (8), the final AUC on the test set showed 0.9617, close to 1; a KS of 0.7779 greater than 0.4 indicates that the semi-supervised model has good risk ranking and differentiation capabilities for predicting default probability of a debt subject. Meanwhile, after the early warning scores are arranged in a descending order, the recall rate reaches 88.37% on the threshold of the first 2%, and the model also reflects that the model has good prediction capability on default risks.
Evaluation index Full scale sample
AUC 0.9611
KS 0.8191
Watch (7)
Abnormal level descending order Cumulative recall rate
Top1% 74.42%
Top2% 81.40%
Top5% 87.21%
Top10% 90.70%
All are provided with 100.00%
Watch (8)
The invention provides a method for predicting debt subject default based on a semi-supervised model, which comprises the following steps:
taking whether the debt subject defaults as a target variable;
acquiring main body data of news public opinion information, industrial and commercial information, market evaluation information and external information of a debt main body, constructing an index system of credit default risks of the debt main body through the main body data, and comprehensively rating the debt main body from 8 subdivision dimensions such as basic qualification information, financial management information, penalty information, share right pledge information, news public opinion information, internal and external rating information and risk associated information by the index system;
integrating data tables of an industrial and commercial class, a news public opinion class and a rating class in a relational database through feature engineering, realizing processing of bottom layer features from three aspects of statistical analysis, service judgment and derivative construction, producing bottom layer factors, inspecting data quality from an accuracy inspection part and a logic inspection part, ensuring the accuracy of feature calculation, screening abnormal values, eliminating missing values, improving data quality and improving prediction capability, and processing the feature engineering by adopting statistical indexes of logarithm, mean value, mode and extreme value in order to mine potential information of main data of debt;
the method comprises the steps of establishing a semi-supervised model based on the combination of an unlabelled sample weighting method and a scoring card model, utilizing the sorting capacity of a positive sample and an unlabelled sample classifier on risks to enlarge the scale of the positive sample, using the sample with the highest risk probability as a new positive sample, training the scoring card model, and solving the problems of difficulty in distinguishing random and truly effective characteristics, weak model stability and ductility and the like caused by too few high-risk labels. The scoring card model is a scoring card model based on logistic regression, the distribution in each characteristic of a positive sample is converted into evidence weight codes, scores are generated by combining the evidence weights and beta in regression coefficients, and the output data drives the scoring card model to reflect information mined from data and the operation logic of the model, so that the scoring process of a debt main body and the single-factor scoring ratio are clearly given;
and judging and predicting default risks of the debt subject based on the semi-supervised model.
By using the learning method of the positive sample and the unmarked sample in the semi-supervised learning, the scale of the positive sample is enlarged, the modeling sample which is seriously biased originally is corrected, on one hand, the possibility of the marked sample existing in the unmarked sample is looked at, on the other hand, the model can be better learned to the characteristics of the bad sample, and the risk of fitting more noise of the model caused by unbalanced samples is reduced.
The invention provides a method for predicting default of a debt subject based on a semi-supervised model, which comprises the following steps of establishing the semi-supervised model:
adjusting parameters by taking AUC as a target through grid search, and training an XGboost model to obtain a classifier for identifying whether a sample is marked;
performing probability calibration by using a calibration classifier, and calibrating the output of the XGboost as the probability of an approximate standard;
using the calibrated sample and the original negative label as a modeling target of a subsequent training scoring card;
calculating weights by using the balance sample weight;
using chi-square sub-boxes to convert all the characteristics into ordinal type classification variables;
analyzing the correlation degree of the features and the modeling target and the colinearity among the features, and screening high-quality features which can be modeled;
manually optimizing feature interpretability, checking high-quality features which can be input into a model one by one, analyzing whether frequency distribution and response rate distribution of each value can be interpreted in business and can be derived from random fluctuation of data or not, adjusting grouping of the features according to the result, and checking whether the verification set has the same trend as the training set or not, and the features which cannot be interpreted, are high in probability of being derived from random fluctuation of the data or have inconsistent trends of the training set and the verification set cannot be input into the model;
training a scoring card model after the characteristic certification is subjected to weight recoding;
and manually checking the scoring rules, and correcting a few rules which are inconsistent with the response rate distribution result.
The method for predicting default of the main debt body based on the semi-supervised model comprises the step that the semi-supervised model tests the distinguishing capability through a KS evaluation model, and when KS is larger than 0.4, the distinguishing capability is good.
The method for predicting default of debt main bodies based on the semi-supervised model comprises the step of testing the ordering capacity of the model through AUC, wherein the AUC in the range is larger than 0.7, the higher the AUC is, the better the classification effect of the model is, and the higher the probability that default samples are arranged in front of non-default samples is.
The invention also provides a device for predicting default of a debt issue subject based on the semi-supervised model, which comprises:
a memory, a processor, a communication bus, and a semi-supervised model predictive debt subject default program stored on the memory,
the communication bus is used for realizing communication connection between the processor and the memory;
the processor is used for executing a procedure for predicting default of the debt subject based on the semi-supervised model so as to realize the steps of the method for predicting default of the debt subject based on the semi-supervised model.
The invention also provides a computer readable storage medium, which stores executable instructions, and the storage medium stores a program for predicting default of a debt subject based on the semi-supervised model, and when the program for predicting default of the debt subject based on the semi-supervised model is executed by a processor, the method for predicting default of the subject based on semi-supervised machine learning as described in any one of the above steps is realized.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A method for predicting default of a debt subject based on a semi-supervised model is characterized in that: the method comprises the following steps:
s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data;
s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors;
s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model;
s4: and judging and predicting default risks of the debt subject based on the semi-supervised model.
2. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: and the index system of the S1 grades the debt main body through basic qualification information, financial management information, penalty information, share right pledge information, news public opinion information, internal and external rating information and risk associated information.
3. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: and S2, mining potential information of the debt subject data by adopting the statistic indexes of logarithm, mean, mode and extreme value.
4. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the semi-supervised model establishing of the S3 comprises the following steps:
s21: adjusting parameters by taking AUC as a target through grid search, and training an XGboost model to obtain a classifier for identifying whether a sample is marked;
s22: performing probability calibration by using a calibration classifier, and taking the output calibration of the XGboost as the probability of an approximate standard;
s23: using the calibrated sample and the original negative label as a modeling target of a subsequent training score card;
s24: calculating weights by using the balance sample weight;
s25: converting all the characteristics into ordinal classification variables by using chi-square classification boxes;
s26: analyzing the degree of association between the features and the modeling target and the colinearity between the features, and screening high-quality features which can enter the model;
s27: manually optimizing feature interpretability;
s28: training a scoring card model after the characteristic certification is subjected to weight recoding;
s29: and manually checking the scoring rules, and correcting a few rules which are inconsistent with the response rate distribution result.
5. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the scoring card model of S3 is based on a scoring card model of logistic regression, the distribution in each feature of the positive sample is converted into evidence weight codes, then scoring is generated by combining the evidence weight and beta in the regression coefficient, the output data drives the scoring card model to reflect information mined from data and the operational logic of the model, and the scoring process of a debt main body and the single factor scoring ratio are given.
6. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the semi-supervised model of S3 examined discriminative power by a KS evaluation model, KS > 0.4.
7. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 4, wherein: the AUC of S21 ranged from AUC > 0.7.
8. An apparatus for predicting default of a debt subject based on a semi-supervised model, characterized in that: the device for predicting debt subject default based on semi-supervised model comprises:
a memory, a processor, a communication bus, and a semi-supervised model predictive debt subject default program stored on the memory,
the communication bus is used for realizing communication connection between the processor and the memory;
the processor is used for executing the semi-supervised model based default prediction program to realize the semi-supervised model based default prediction method of the debt subject as claimed in any one of claims 1 to 7.
9. A computer-readable storage medium storing executable instructions, wherein: the storage medium stores a semi-supervised model based default prediction program for a debt subject, and the semi-supervised model based default prediction program is executed by a processor to realize the steps of the semi-supervised machine learning based default prediction method for a subject according to any one of the above claims 1-7.
CN202011395004.6A 2020-12-03 2020-12-03 Method, equipment and storage medium for predicting debt subject default based on semi-supervised model Pending CN114663102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011395004.6A CN114663102A (en) 2020-12-03 2020-12-03 Method, equipment and storage medium for predicting debt subject default based on semi-supervised model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011395004.6A CN114663102A (en) 2020-12-03 2020-12-03 Method, equipment and storage medium for predicting debt subject default based on semi-supervised model

Publications (1)

Publication Number Publication Date
CN114663102A true CN114663102A (en) 2022-06-24

Family

ID=82025389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011395004.6A Pending CN114663102A (en) 2020-12-03 2020-12-03 Method, equipment and storage medium for predicting debt subject default based on semi-supervised model

Country Status (1)

Country Link
CN (1) CN114663102A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579651A (en) * 2023-05-11 2023-08-11 中国矿业报社 Mining project evaluation method based on semi-supervised learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579651A (en) * 2023-05-11 2023-08-11 中国矿业报社 Mining project evaluation method based on semi-supervised learning
CN116579651B (en) * 2023-05-11 2023-11-10 中国矿业报社 Mining project evaluation method based on semi-supervised learning

Similar Documents

Publication Publication Date Title
CN107025596B (en) Risk assessment method and system
CN113537807B (en) Intelligent wind control method and equipment for enterprises
WO2012018968A1 (en) Method and system for quantifying and rating default risk of business enterprises
CN111583012B (en) Method for evaluating default risk of credit, debt and debt main body by fusing text information
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
CN111598682A (en) Credit risk assessment method and system for enterprise
CN111709826A (en) Target information determination method and device
CN113065969A (en) Enterprise scoring model construction method, enterprise scoring method, medium and electronic device
CN116468536A (en) Automatic risk control rule generation method
CN118312848A (en) Financial data information index extraction method and device
CN114663102A (en) Method, equipment and storage medium for predicting debt subject default based on semi-supervised model
CN117522609A (en) Bad financial recognition method, system and readable storage medium based on interpretable machine learning
CN117114812A (en) Financial product recommendation method and device for enterprises
CN112434886A (en) Method for predicting client mortgage loan default probability
CN113064883A (en) Method for constructing logistics wind control model, computer equipment and storage medium
Sueyoshi et al. Efficiency measurement and strategic classification of Japanese banking institutions
CN111738610A (en) Public opinion data-based enterprise loss risk early warning system and method
US20240152818A1 (en) Methods for mitigation of algorithmic bias discrimination, proxy discrimination and disparate impact
CN114596152A (en) Method, device and storage medium for predicting debt subject default based on unsupervised model
CN115330526A (en) Enterprise credit scoring method and device
CN115237970A (en) Data prediction method, device, equipment, storage medium and program product
CN115936293A (en) Subway construction safety accident risk evaluation method based on PCA
CN112508665A (en) Distributed enterprise credit assessment method based on information sharing
CN114511201A (en) Method for evaluating enterprise comprehensive capacity
CN118071482A (en) Method for constructing retail credit risk prediction model and consumer credit business Scorebetad model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination