CN114663102A

CN114663102A - Method, device and storage medium for predicting default of bond issuer based on semi-supervised model

Info

Publication number: CN114663102A
Application number: CN202011395004.6A
Authority: CN
Inventors: 王专; 郝玉爽; 田鑫涛
Original assignee: China Life Insurance Asset Management Co ltd
Current assignee: China Life Insurance Asset Management Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2022-06-24

Abstract

The invention relates to the technical field of computers, and discloses a method, equipment and a storage medium for predicting debt subject default based on a semi-supervised model, wherein the method comprises the following steps: s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data; s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors; s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model; s4: and judging and predicting default risks of the debt subject based on the semi-supervised model. The modeling method is based on the combination of an unlabeled sample weighting method and a score card model, the XGB classifier trained by positive samples and unlabeled samples is used for expanding the scale of the positive samples according to the risk ranking capacity, the samples with the highest risk probability are used as new positive samples, the score card model is trained, and the semi-supervised model is constructed.

Description

Method, device and storage medium for predicting default of bond issuer based on semi-supervised model

技术领域technical field

本发明涉及计算机技术领域，特别提供一种基于半监督模型预测发债主体违约的方法、设备及存储介质。The invention relates to the field of computer technology, and in particular provides a method, device and storage medium for predicting the default of a bond issuer based on a semi-supervised model.

背景技术Background technique

传统的发债企业违约预测方法主要使用财务数据和信用研究员评分来对企业进行评级，得出企业的违约概率，新闻舆情数据由于为非结构化数据，无法被计算机模型直接使用，难以作为模型的输入，因此如何自动利用新闻舆情建立预测发债主体违约模型是现有技术需要解决的必要问题。The traditional default prediction methods for bond-issuing companies mainly use financial data and credit researcher scores to rate companies to obtain the default probability of companies. Because news and public opinion data is unstructured data, it cannot be directly used by computer models, and it is difficult to be used as a model. Therefore, how to automatically use news and public opinion to establish a model to predict the default of the bond issuer is a necessary problem to be solved by the existing technology.

现有技术常常通过使用财务数据和信用研究员对企业不同维度的打分，进行发债企业的评级，输出违约概率，及人工处理舆情数据，大量业务人员的参与及主观评价的输入，制定大量预警规则，这使现在技术存在着挖掘、分析不足，评估信用风险水平难以全面实施，且效率低与依赖主观判断的现象。Existing technologies often use financial data and credit researchers to score different dimensions of enterprises to rate bond-issuing companies, output default probability, and manually process public opinion data, with the participation of a large number of business personnel and the input of subjective evaluation, to formulate a large number of early warning rules. , which makes the current technology have insufficient mining and analysis, it is difficult to fully implement the assessment of credit risk level, and the efficiency is low and depends on subjective judgment.

另外，发债企业违约是小概率事件，所以在数据建模时，正样本非常少，如何利用现有的样本扩充正样本占比，是解决模型失真问题的关键。In addition, the default of bond-issuing companies is a small probability event, so in data modeling, there are very few positive samples. How to use the existing samples to expand the proportion of positive samples is the key to solving the problem of model distortion.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术存在人工处理舆情数据，大量业务人员参与及主观评价制定预警规则的问题，本发明提供了一种基于半监督模型预测发债主体违约的方法、设备及存储介质。In order to solve the problem of manual processing of public opinion data in the prior art and the participation and subjective evaluation of a large number of business personnel to formulate early warning rules, the present invention provides a method, equipment and storage medium for predicting the default of a bond issuer based on a semi-supervised model.

本发明的技术方案如下：The technical scheme of the present invention is as follows:

一种基于半监督模型预测发债主体违约的方法，包括：A method for predicting the default of a bond issuer based on a semi-supervised model, comprising:

S1：获取发债主体的主体数据，所述主体数据包括新闻舆情信息、工商信息、市场评价信息、外部信息，通过所述主体数据构建发债主体的信用违约风险的指标体系；S1: Obtain the subject data of the bond issuer, the subject data includes news and public opinion information, industrial and commercial information, market evaluation information, and external information, and construct an indicator system of the credit default risk of the bond issuer through the subject data;

S2：从统计分析、业务判断、衍生构造底层特征，生成底层因子；S2: Construct underlying features from statistical analysis, business judgment, and derivation, and generate underlying factors;

S3：基于未标记样本加权法与评分卡模型的组合建立半监督模型；S3: Establish a semi-supervised model based on the combination of unlabeled sample weighting method and scorecard model;

S4：基于半监督模型判断预测发债主体的违约风险。S4: Judging and predicting the default risk of the issuer based on the semi-supervised model.

进一步地，所述S1的指标体系通过基本资质信息、财务经营信息、处罚信息、股权质押信息、新闻舆情信息、内外部评级信息和风险关联信息对发债主体进行评级。Further, the index system of S1 grades the bond issuer through basic qualification information, financial operation information, punishment information, equity pledge information, news and public opinion information, internal and external rating information, and risk related information.

进一步地，所述S2采用了对数、均值、众数和极值的统计类指标挖掘发债主体数据的潜在信息。Further, the S2 uses statistical indicators of logarithm, mean, mode and extreme value to mine potential information of bond issuer data.

进一步地，建立所述S3的半监督模型包括以下步骤：Further, establishing the semi-supervised model of S3 includes the following steps:

S21：通过网格搜索以AUC为目标进行调参，训练XGBoost模型获取识别样本是否被标记的分类器；S21: Adjust parameters with the AUC as the target through grid search, and train the XGBoost model to obtain a classifier that identifies whether the sample is marked;

S22：使用校准分类器进行概率校准，将XGBoost的输出校准作为近似标准的概率；S22: Use the calibration classifier for probability calibration, and use the output calibration of XGBoost as the probability of the approximate standard;

S23：使用校准后的样本，与原负面标签取并集作为后续训练评分卡的建模目标；S23: Use the calibrated sample and take the union with the original negative label as the modeling target of the subsequent training scorecard;

S24：使用均衡样本加权计算权重；S24: Use the balanced sample weighting to calculate the weight;

S25：使用卡方分箱将特征全部转化为序数型分类变量；S25: Use chi-square binning to convert all features into ordinal categorical variables;

S26：分析特征与建模目标的关联程度，及特征之间的共线性，筛选可以入模的优质特征；S26: Analyze the degree of association between the feature and the modeling target, and the collinearity between the features, and screen out the high-quality features that can be modeled;

S27：人工优化特征可解释性；S27: Manually optimize feature interpretability;

S28：将特征经证据权重编码后训练评分卡模型；S28: Train the scorecard model after encoding the features with the weight of evidence;

S29：人工查看评分规则，修正少数与响应率分布结果不符的规则。S29: Manually check the scoring rules, and correct a few rules that are inconsistent with the response rate distribution results.

进一步地，所述S2的评分卡模型，是基于逻辑回归的评分卡模型，将正样本各特征中的分布转化为证据权重编码，再结合证据权重和回归系数中的β生成评分，输出的数据驱动评分卡模型反映从数据中挖掘的信息以及模型的运算逻辑，给出发债主体评分过程及单因子评分占比。Further, the scorecard model of the S2 is a scorecard model based on logistic regression, which converts the distribution in each feature of the positive sample into evidence weight coding, and then combines the evidence weight and the β in the regression coefficient to generate a score, and the output data. The driving scorecard model reflects the information mined from the data and the operation logic of the model, and gives the scoring process and the proportion of single-factor scores for the subject of debt issuance.

进一步地，所述半监督模型通过KS评价模型检验区分能力，KS＞0.4。Further, the semi-supervised model tests the discrimination ability through the KS evaluation model, KS>0.4.

进一步地，所述S21的AUC的范围为AUC＞0.7。Further, the range of the AUC of the S21 is AUC>0.7.

本发明还提供了一种基于半监督模型预测发债主体违约的设备，所述基于半监督模型预测发债主体违约设备包括：The present invention also provides a device for predicting the default of a bond issuer based on a semi-supervised model, and the device for predicting the default of a bond issuer based on the semi-supervised model includes:

存储器、处理器，通信总线以及存储在所述存储器上的半监督模型预测发债主体违约程序，a memory, a processor, a communication bus, and a semi-supervised model stored on the memory to predict the default procedure of the issuer,

所述通信总线用于实现处理器与存储器间的通信连接；The communication bus is used to realize the communication connection between the processor and the memory;

所述处理器用于执行所述基于半监督模型预测发债主体违约程序，以实现如上述任意一项所述的基于半监督模型预测发债主体违约方法的步骤。The processor is configured to execute the program for predicting the default of the bond issuer based on the semi-supervised model, so as to realize the steps of the method for predicting the default of the bond issuer based on the semi-supervised model as described in any one of the above.

本发明还提供了一种计算机可读存储介质，存储有可执行指令，所述存储介质上存储有基于半监督模型预测发债主体违约程序，所述基于半监督模型预测发债主体违约程序被处理器执行时实现上述任意一项所述的基于半监督机器学习预测主体违约方法的步骤。The present invention also provides a computer-readable storage medium storing executable instructions, the storage medium storing a program for predicting the default of a bond issuer based on a semi-supervised model, and the program for predicting the default of a bond issuer based on a semi-supervised model is When executed by the processor, any one of the above-mentioned steps of the method for predicting subject default based on semi-supervised machine learning are implemented.

本发明的有益效果至少包括：The beneficial effects of the present invention at least include:

(1)本次建模的方法是基于未标记样本加权法与评分卡模型的组合，利用正样本和无标记样本训练的XGB分类器对风险的排序能力扩大正样本的规模，使用高风险概率最高的样本作为新的正样本，训练评分卡模型，因为评分卡模型良好的可解释性与白盒化的训练过程，将其作为输出最终结果的评价模型；(1) This modeling method is based on the combination of the unlabeled sample weighting method and the scorecard model, using the XGB classifier trained on positive samples and unlabeled samples to sort risks to expand the scale of positive samples, using high-risk probability The highest sample is used as a new positive sample to train the scorecard model. Because of the good interpretability of the scorecard model and the white-box training process, it is used as the evaluation model to output the final result;

(2)利用半监督学习中的正样本和无标记样本学习方法，扩大了正样本规模，将原本严重有偏的建模样本进行了修正，一方面正视了未标记样本中存在标记样本的可能，另一方面能更好的让模型学习到坏样本的特征，减少了样本不均衡带来的模型更多的拟合了噪音的风险；(2) Using the positive sample and unlabeled sample learning methods in semi-supervised learning, the scale of positive samples is enlarged, and the original severely biased modeling samples are corrected, on the one hand, the possibility of labeled samples in unlabeled samples is squarely , on the other hand, it can better allow the model to learn the characteristics of bad samples, reducing the risk of the model fitting more noise caused by sample imbalance;

(3)本方法基于机器学习的模型以数据驱动的方式生成模型，减少主观干预造成的信息损失，风险预警更加客观，更有效的捕捉主体事前违约的风险变动。(3) This method generates a model based on the machine learning model in a data-driven manner, reduces the information loss caused by subjective intervention, makes the risk warning more objective, and more effectively captures the risk change of the subject's prior default.

附图说明Description of drawings

图1为本发明的基于半监督模型预测发债主体违约的流程图。FIG. 1 is a flow chart of predicting the default of a bond issuer based on a semi-supervised model of the present invention.

图2为本发明的半监督模型的流程图。FIG. 2 is a flow chart of the semi-supervised model of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

结合图1与图2所示，一种基于半监督模型预测发债主体违约的方法，包括：Combined with Figure 1 and Figure 2, a method for predicting the default of a bond issuer based on a semi-supervised model includes:

本发明建模的主要目标是通过量化模型实现对具有高违约概率的主体进行预测，实现违约风险的事前侦测，分析数据的来源主要为企业工商数据和新闻舆情数据。分析对象为具有舆情数据的主体，主要从新闻舆情和工商基本信息两个角度出发，挖掘预测主体在基本资质、工商变动、舆情变动等三个维度下的潜在规律和联系；通过特征工程丰富底层指标，探索与违约风险相关联的风险因子；建立基于半监督学习的评分卡模型，对预测主体发生违约风险的可能性进行评估。The main goal of the present invention's modeling is to predict subjects with high default probability through quantitative models, and to realize the pre-detection of default risks. The object of analysis is the subject with public opinion data, mainly from the perspectives of news public opinion and basic industrial and commercial information, to mine the potential laws and connections of the forecast subject in the three dimensions of basic qualifications, industrial and commercial changes, and public opinion changes; enrich the bottom layer through feature engineering index, explore the risk factors associated with default risk; establish a scorecard model based on semi-supervised learning, and evaluate the possibility of predicting the default risk of the subject.

对此，首先，确定目标变量：将目标变量定义为具有事前预警且能匹配工商信息的债券违约主体，将其作为严重的有偏样本；In this regard, first, determine the target variable: define the target variable as a bond defaulting subject with advance warning and matching industrial and commercial information, and take it as a serious biased sample;

其次通过特征工程，生成底层因子。其中，由工商信息生成的因子，涉及基本资质评价、财报信息评价、处罚信息评价、股权质押评价；由预警数据生成的因子，涉及关于债券预警的时间、敞口、类型、评级和情感标签；Secondly, through feature engineering, the underlying factors are generated. Among them, the factors generated from industrial and commercial information involve basic qualification evaluation, financial report information evaluation, penalty information evaluation, and equity pledge evaluation; the factors generated from early warning data involve the time, exposure, type, rating and sentiment label of bond early warning;

本次特征工程为了尽可能挖掘数据的潜在信息，采用了对数、均值、众数和极值等统计类指标加工方法，从最终入模指标来看，统计类指标预测力显著；In order to tap the potential information of the data as much as possible, this feature engineering adopts the processing methods of statistical indicators such as logarithm, mean, mode and extreme value. From the perspective of the final model-entry indicators, the statistical indicators have significant predictive power;

最后进行评估，本次建模采用半监督评分卡模型，解决了严重有偏样本建模问题，模型区分度良好。Finally, the evaluation is carried out. The semi-supervised scorecard model is used in this modeling, which solves the problem of serious biased sample modeling, and the model has a good degree of discrimination.

分析过程包括：The analysis process includes:

1、目标变量的定义1. Definition of target variable

本次建模的目标是对债券主体的违约风险进行预测，所以将主体是否违约作为目标变量。The goal of this modeling is to predict the default risk of the bond subject, so whether the subject defaults is the target variable.

违约数据由2014年4月累计至2019年12月，共437条记录，涉及180家主体(同一个主体可对应多条违约记录)，剔除无首次违约前预警数据的90家主体和无法匹配工商信息的4家主体，剩余86家违约主体作为建模正样本，占全部建模样本(15134家主体)的0.57％。The default data was accumulated from April 2014 to December 2019, with a total of 437 records involving 180 entities (the same entity can correspond to multiple default records), excluding 90 entities without the first pre-default warning data and those who could not match the industrial and commercial 4 subjects of the information, and the remaining 86 default subjects are used as modeling positive samples, accounting for 0.57% of all modeling samples (15134 subjects).

2、数据准备2. Data preparation

2.1指标体系建设2.1 Construction of indicator system

基于实现舆情事前预警发债主体违约事件的建模需求，确立了从新闻舆情、工商信息、市场评价、外部信息四个角度构建分析发债主体的信用违约风险的指标体系。指标体系从基本资质信息、财务经营信息、处罚信息、股权质押信息、新闻舆情信息、内外部评级信息和风险关联信息等8个细分维度全方位评级发债主体。其中，本次建模已实现因子共计133个，涉及因子如表1所示：Based on the modeling needs of realizing public opinion pre-warning of bond issuers' default events, an indicator system for analyzing the credit default risk of bond issuers is established from four perspectives: news public opinion, industrial and commercial information, market evaluation, and external information. The indicator system comprehensively rates bond issuers from 8 subdivisions including basic qualification information, financial operation information, punishment information, equity pledge information, news and public opinion information, internal and external rating information, and risk-related information. Among them, a total of 133 factors have been realized in this modeling, and the involved factors are shown in Table 1:

表(1)Table 1)

2.2特征构建2.2 Feature Construction

整合关系型数据库中工商类、新闻舆情类和评级类等数据表，通过特征工程，从统计分析、业务判断、衍生构造三个方面实现底层特征的加工与实现。Integrate the data tables of industry and commerce, news and public opinion, and rating in relational databases, and realize the processing and realization of the underlying features from three aspects: statistical analysis, business judgment, and derivative structure through feature engineering.

2.3数据检验2.3 Data verification

探查数据质量，保证特征计算正确无误。可分为准确性检验和逻辑性检验两部分。Probe data quality to ensure feature calculations are correct. It can be divided into two parts: accuracy test and logic test.

2.3.1准确性检验2.3.1 Accuracy test

统计各特征的缺失、重复、字段类型、和异常情况，为后续的异常值处理提供方向。具体检验方法如下：Count the missing, duplicate, field type, and abnormal conditions of each feature to provide direction for subsequent outlier processing. The specific inspection methods are as follows:

缺失：每个字段的缺失数量、探查行数和缺失比例；Missing: the number of missing, probed rows and missing proportions for each field;

重复：只有一种取值的字段(全部重复)，并查看各字段的独一无二的取值的比例；Repetition: Fields with only one value (all repetitions), and check the ratio of unique values of each field;

字段类型：字段类型是否与设计一致；Field Type: Whether the field type is consistent with the design;

异常值：3σ原则：超出均值加减三倍标准差的范围。Outliers: 3σ principle: beyond the range of plus or minus three standard deviations from the mean.

2.3.2逻辑性检验2.3.2 Logical test

业务上定义的异常值；Business-defined outliers;

计算逻辑检验：通过随机抽样的办法，随机抽取一定比例的数据，通过使用不同工具进行指标加工，最终对比计算结果。Computational logic test: Through random sampling, a certain proportion of data is randomly selected, and index processing is performed by using different tools, and the calculation results are finally compared.

2.4数据清洗2.4 Data cleaning

缺失值和异常值会影响因子对于最终模型结果的预测能力，通过对建模样本的统计分析，识别建模样本中的数据噪音，提升数据质量、提升模型效果。Missing values and outliers will affect the factor's ability to predict the final model results. Through the statistical analysis of the modeling samples, the data noise in the modeling samples can be identified, and the data quality and model effect can be improved.

2.4.1缺失值处理2.4.1 Missing value handling

按照特征类型的不同，缺失值可以分为数值型变量、和分类型变量。在处理缺失值的过程中，按照特征含义的不同，缺失值可以处理成为一类或按照均值、中位数、众数进行填充，对于严重缺失(缺失80％以上)的字段做剔除处理。对于剔除的特征，分析该特征预测能力考虑是否转换成规则进入模型。具体的处理方法如下：According to different feature types, missing values can be divided into numerical variables and categorical variables. In the process of dealing with missing values, according to the different feature meanings, missing values can be processed into one category or filled in according to the mean, median, and mode, and fields that are seriously missing (more than 80% missing) are eliminated. For the excluded features, analyze the predictive ability of the feature and consider whether to convert it into a rule and enter the model. The specific processing method is as follows:

数值型变量：从特征的含义出发，根据最小化数据噪音原则，决定该特征的缺失赋值策略(均值、中位数、众数等)。例如：在建模过程中，将最近工商变更时长、年报最近披露距今时长、财报最近披露距今时长、距今成立时长、距今登记时长等特征按含义将缺失赋值为99；将实缴资本、注册资本、员工人数等特征按含义将缺失赋值为均值；将分支机构数量、法人变更次数等等特征按含义将缺失值赋值为0。Numerical variables: Starting from the meaning of the feature, according to the principle of minimizing data noise, determine the missing assignment strategy (mean, median, mode, etc.) of the feature. For example: in the modeling process, assign the missing value to 99 according to the meaning of the characteristics such as the time of recent industrial and commercial changes, the time of the most recent disclosure of the annual report, the time of the most recent financial report disclosure, the time of establishment, and the time of registration. The features such as capital, registered capital, and number of employees are assigned the missing value as the mean value according to their meaning; the missing value is assigned to 0 according to their meaning, such as the number of branches, the number of legal person changes, etc.

分类型变量：缺失本身可能有相应的业务含义，为了尽量保留数据中的信息，通常将分类型变量的缺失值单独分成一组并赋值。例如：建模过程中，将是否连续盈利两年及以上、是否连续亏损两年及以上、最近一个财年是否盈利、最近一个财年是否亏损等特征的缺失单独划为一类，赋值为-1；将持股过半股东是否为被执行人、主体是否为被执行人等特征按含义将缺失值赋值为0(代表未发生对应事件)。Categorical variables: The missing itself may have corresponding business meanings. In order to preserve the information in the data as much as possible, the missing values of the categorical variables are usually grouped separately and assigned. For example, in the modeling process, the lack of features such as whether it has been profitable for two or more consecutive years, whether it has been in losses for two consecutive years or more, whether it is profitable in the last fiscal year, and whether it is a loss in the most recent fiscal year is classified as a separate category, and the assignment is - 1; Assign the missing value to 0 according to the meaning of whether the shareholder holding more than half of the shares is the person subject to execution, whether the subject is the person subject to execution, etc. (representing that no corresponding event has occurred).

2.4.2异常值处理2.4.2 Outlier Handling

本次建模暂未生成可由业务定义异常值的因子。对于无法从业务上定义异常值的特征，可依据拉伊达法则(3σ准则)筛选。虽然进行异常值处理会使模型的输入更稳定，有利于模型捕捉到数据总体的特征，但异常值的存在也可能是此数据的真实特点，因此本次建模未使用拉伊达法则处理异常值。This modeling has not yet generated factors that can define outliers by business. For features that cannot define outliers from business, they can be screened according to Raida's rule (3σ criterion). Although the processing of outliers will make the input of the model more stable and help the model to capture the characteristics of the data population, the existence of outliers may also be the real characteristics of the data, so this modeling does not use Raida's rule to handle abnormality value.

2.5训练、测试集划分2.5 Division of training and test sets

为了验证模型训练结果的准确性、稳定性，并具备良好的泛化能力，本次建模将所有样本以7：3的比例分别划分为训练集、测试集。In order to verify the accuracy and stability of the model training results and have good generalization ability, this modeling divides all samples into training set and test set with a ratio of 7:3.

3、模型构建3. Model building

3.1探索性数据分析3.1 Exploratory Data Analysis

通过对数据的探索性分析，定义为目标变量的发债违约主体，即Y＝1的风险暴露样本，共计86家，仅占全量发债主体样本的0.57％，属于典型的不均衡样本建模问题。Through the exploratory analysis of the data, the default subjects of bond issuance defined as target variables, that is, risk exposure samples with Y=1, total 86, accounting for only 0.57% of the total sample of bond issuance subjects, which is a typical unbalanced sample modeling question.

3.1.1不均衡样本处理3.1.1 Unbalanced sample processing

基于PULearning(正样本和无标记样本学习)的半监督方法。通过学习已有标记样本，寻找未标记样本中最近似标记样本的数据点，将其作为新增的标记样本；Semi-supervised method based on PULearning (positive and unlabeled sample learning). By learning the existing labeled samples, find the data points that are most similar to the labeled samples in the unlabeled samples, and use them as the newly added labeled samples;

本次建模无法获取更多坏样本的情况下，选取了基于PU-Learning的思路解决样本不均衡，下面将详细阐述不均衡样本处理的原理和模型应用。In the case that more bad samples cannot be obtained in this modeling, the idea based on PU-Learning is selected to solve the sample imbalance. The principle and model application of imbalanced sample processing will be described in detail below.

3.2半监督模型3.2 Semi-supervised models

半监督学习中的正样本和未标记样本建模(PositiveUnlabeledLearning)在不均衡样本处理、潜在目标识别等方面应用前景广泛，包括投诉、信用风险暴露等负面标签较少的场景。从PositiveUnlabeledLearning的角度看，已知的债券违约主体为全量具有高违约风险主体中的一部分，除此之外的发债主体中仍存在高违约风险的企业。即除此之外的样本并非纯粹的无风险样本，而是未标记的样本，因此违约预警模型实为使用正样本(已发生违约事件的主体)和未标记样本(违约风险未暴露)建模。Positive sample and unlabeled sample modeling (Positive Unlabeled Learning) in semi-supervised learning has a wide range of applications in imbalanced sample processing, potential target recognition, etc., including complaints, credit risk exposure and other scenarios with less negative labels. From the perspective of PositiveUnlabeledLearning, the known defaulting entities of bonds are part of the total number of entities with high default risk, and there are still companies with high default risk among other bond-issuing entities. That is, the samples other than this are not pure risk-free samples, but unlabeled samples, so the default early warning model is actually modeled using positive samples (subjects with default events) and unlabeled samples (default risk not exposed) .

本次建模的方法的核心思路是一个样本是否被标记的概率与其是否为正样本的概率成正比，即单从排序能力而言，用正样本和未标记样本训练等同于用正样本和负样本训练，因此可先用正样本和未标记样本训练分类器，将分类器输出的每个样本被标记的概率转化为各样本的权重后，使用样本权重再次训练分类器，得到样本属于正样本的概率(建模目标)。The core idea of this modeling method is that the probability of whether a sample is marked is proportional to the probability of whether it is a positive sample, that is, in terms of sorting ability, training with positive samples and unlabeled samples is equivalent to using positive samples and negative samples. Sample training, so you can first train the classifier with positive samples and unlabeled samples, convert the probability that each sample output by the classifier is labeled into the weight of each sample, and use the sample weight to train the classifier again to get that the sample belongs to the positive sample. the probability of (modeling target).

本次建模将上述方法与评分卡模型组合起来，利用正样本和未标记样本分类器对风险的排序能力扩大正样本的规模，使用高风险概率最高的5％样本作为新的正样本，训练评分卡模型。这解决了高风险标签过少导致的难以区分随机和真正有效的特征、模型稳定性和延展性弱等问题。This modeling combines the above method with the scorecard model, uses the positive sample and unlabeled sample classifier's ability to sort risks to expand the scale of positive samples, and uses the 5% samples with the highest high risk probability as new positive samples, training Scorecard model. This solves the problems of indistinguishable random and truly effective features, weak model stability and ductility caused by too few high-risk labels.

本次建模的评分卡模型基于以下原理和方法论：选择基于逻辑回归的评分卡模型，将正样本各特征中的分布转化为证据权重编码，再结合证据权重和回归系数中的β生成评分，输出的数据驱动型评分卡能直观的反映从数据中挖掘的信息以及模型的运算逻辑,能够清晰的给出发债主体评分过程及单因子评分占比。逻辑回归参数转化为评分卡分数的公式如下：The scorecard model for this modeling is based on the following principles and methodology: select a scorecard model based on logistic regression, convert the distribution of each feature of the positive sample into evidence weight coding, and then combine the evidence weight and the β in the regression coefficient to generate a score, The output data-driven scorecard can intuitively reflect the information mined from the data and the operational logic of the model, and can clearly give the debt issuer's scoring process and the proportion of single-factor scores. The formula for converting logistic regression parameters to scorecard scores is as follows:

P₀为评分卡的基准分，PDO为指定比率翻倍的分数。P₀和PDO为评分卡模型的两个超参数，用于控制评分的集中趋势和离散程度，本次建模取60和-10；P ₀ is the benchmark score of the scorecard, and PDO is the score that doubles the specified rate. P ₀ and PDO are two hyperparameters of the scorecard model, which are used to control the central tendency and dispersion degree of the score. This modeling takes 60 and -10;

β为训练逻辑回归得到的系数，intercept为训练逻辑回归得到的截距，n为模型中特征的数量；β is the coefficient obtained by training logistic regression, intercept is the intercept obtained by training logistic regression, and n is the number of features in the model;

常数B的计算：

Calculation of constant B:

常数A的计算：A＝P₀+B×ln(P₀)；Calculation of constant A: A=P ₀ +B×ln(P ₀ );

评分卡的固定分值：FixedScore＝A-B×intercept；The fixed score of the scorecard: FixedScore=A-B×intercept;

每个档位的评分：

Rating for each grade:

半监督模型建模步骤如下：The semi-supervised model modeling steps are as follows:

(1)训练样本是否被标记的分类器：训练XGBoost模型，学习分类一个样本是否会被标记，通过网格搜索以AUC为目标进行调参，最优的超参数如表2：(1) A classifier for whether the training sample is marked: Train the XGBoost model, learn to classify whether a sample will be marked, and adjust the parameters through grid search with the AUC as the target. The optimal hyperparameters are shown in Table 2:

Colsample_bytreeColsample_bytree 11 Learning_rateLearning_rate 0.010.01 Max_depthMax_depth 1010 N_estimatorsN_estimators 200200

表(2)Table 2)

XGBoost是一个优化的分布式梯度增强库，旨在实现高效，灵活和便携，它在GradientBoosting框架下实现机器学习算法，XGBoost提供并行树提升(也称为GBDT，GBM)，可以快速准确地解决许多数据科学问题。XGBoost is an optimized distributed gradient boosting library designed to be efficient, flexible and portable, it implements machine learning algorithms under the GradientBoosting framework, XGBoost provides parallel tree boosting (also known as GBDT, GBM), which can solve many Data Science Questions.

AUC(AreaUnderCurve)被定义为ROC曲线下与坐标轴围成的面积，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y＝x这条直线的上方，所以AUC的取值范围在0.5和1之间。AUC越接近1.0，检测方法真实性越高；等于0.5时，则真实性最低，无应用价值。AUC (AreaUnderCurve) is defined as the area enclosed by the coordinate axis under the ROC curve. Obviously, the value of this area will not be greater than 1. Also, since the ROC curve is generally above the straight line y=x, the value range of AUC is between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the detection method; when it is equal to 0.5, the authenticity is the lowest and has no application value.

(2)概率校准：使用CalibratedClassifierCV进行概率校准，将XGBoost的输出校准为近似标准的概率，使用的超参数为method_calibrated＝isotonic，cv＝3，如表3所示：(2) Probabilistic calibration: Use CalibratedClassifierCV for probability calibration, and calibrate the output of XGBoost to approximate standard probability. The hyperparameters used are method_calibrated=isotonic, cv=3, as shown in Table 3:

Method_CalibratedMethod_Calibrated IsotonicIsotonic CVCV 33

表(3)table 3)

(3)构建扩大的建模目标：使用校准后的概率排序Top5％的样本，与原负面标签取并集作为后续训练评分卡的建模目标；(3) Build an expanded modeling target: use the calibrated probability to sort the Top5% samples, and take the union with the original negative label as the modeling target for the subsequent training scorecard;

(4)不均衡样本加权：使得模型不会因为正样本只有5％而不重视正样本的误分类，使用sklearn.utils.class_weight计算权重；(4) Unbalanced sample weighting: The model will not ignore the misclassification of positive samples because the positive samples are only 5%, and use sklearn.utils.class_weight to calculate the weight;

(5)特征离散化：使用卡方分箱将特征全部转化为序数型分类变量，卡方分箱的超参数如表4所示：(5) Feature discretization: Chi-square binning is used to convert all features into ordinal categorical variables. The hyperparameters of chi-square binning are shown in Table 4:

Max_intervalsMax_intervals 1010 Min_intervalsMin_intervals 55 Initial_intervalsInitial_intervals 100100

表(4)Table 4)

(6)预测力和共线性分析：分析特征与建模目标的关联程度，以及特征之间的共线性，筛选可以入模的优质特征；(6) Predictive power and collinearity analysis: analyze the degree of association between features and modeling targets, as well as the collinearity between features, and screen high-quality features that can be modeled;

(7)人工优化特征可解释性：逐个查看可以入模的优质特征，分析其各取值的频率分布和响应率分布是否可以在业务上解释、是否可能源自于数据的随机波动，据此调整特征的分组，并查看验证集上是否具备与训练集相同的趋势。无法解释、源自数据随机波动的可能性高、或训练集和验证集趋势不符的特征不能入模；(7) Manually optimized feature interpretability: Check the high-quality features that can be entered into the model one by one, and analyze whether the frequency distribution and response rate distribution of each value can be explained in business, and whether it may be derived from random fluctuations in the data. Adjust the grouping of features and see if the validation set has the same trend as the training set. Features that cannot be explained, have a high probability of random fluctuations in the data, or that do not match the trends of the training set and the validation set cannot be included in the model;

(8)证据权重转化和评分训练：将特征经证据权重编码后训练评分卡模型，超参数为表5所示：(8) Evidence weight conversion and scoring training: The scorecard model is trained after the features are encoded by the evidence weight, and the hyperparameters are shown in Table 5:

表(5)table 5)

(9)评分卡调整：人工查看评分规则，修正少数与响应率分布结果不符的规则；(9) Scorecard adjustment: manually review the scoring rules, and correct a few rules that are inconsistent with the response rate distribution results;

半监督模型建模步骤中涉及的英文指的是代码里面的参数设置。The English involved in the semi-supervised model modeling step refers to the parameter settings in the code.

最终半监督默认模型入模指标如下表6：The final semi-supervised default model entry indicators are shown in Table 6:

表(6)Table (6)

4、模型评估4. Model evaluation

4.1评估方法4.1 Evaluation method

发债主体违约预警场景并非常规的包含正负样本的分类问题，而是已标记正样本+未标记样本的半监督问题，样本比例难以满足传统分类模型的训练要求。本次建模样本中，违约主体的数量过少，而剩余的主体实际上是高风险尚未违约、和风险较低主体的混合。精确度指标会将模型预测为高风险但尚未违约的主体视为预测错误，但实际此类主体在风险特征上真实逼近已违约主体，只是风险尚未暴露或有其他模型外因素导致其尚未发生违约。因此传统的精确度指标不再适用于发债主体违约预警场景，本次建模通过AUC来评价模型的整体风险排序能力、通过KS评价模型对正负样本的区分能力。The default warning scenario of bond issuers is not a conventional classification problem involving positive and negative samples, but a semi-supervised problem of marked positive samples + unmarked samples, and the sample ratio is difficult to meet the training requirements of traditional classification models. In this modeling sample, the number of defaulting entities is too small, and the remaining entities are actually a mixture of high-risk entities that have not yet defaulted and lower-risk entities. The accuracy index will treat the entities that the model predicts as high risk but have not yet defaulted as prediction errors, but in fact, such entities are actually close to the defaulted entities in terms of risk characteristics, but the risk has not been exposed or there are other factors outside the model that cause them to not yet default. . Therefore, the traditional accuracy index is no longer suitable for the default warning scenario of the bond issuer. In this modeling, AUC is used to evaluate the overall risk ranking ability of the model, and KS is used to evaluate the model's ability to distinguish positive and negative samples.

AUC(ROC曲线下面积)：检验模型的排序能力，建议AUC值在0.7以上。AUC越高，模型分类效果越好，违约样本排在非违约样本前面的概率越大。AUC (area under the ROC curve): To test the sorting ability of the model, it is recommended that the AUC value be above 0.7. The higher the AUC, the better the classification effect of the model, and the greater the probability that default samples will be ranked ahead of non-default samples.

0.5<A隸C<1，优于随机猜测，具有预测价值；A隸C＝0.5，与随机猜测一样，不具有预测价值；0.5<A under C<1, which is better than random guess and has predictive value; A under C=0.5, same as random guess, has no predictive value;

K-S统计量：检验模型的区分能力，建议KS值在0.40以上。K-S statistic: To test the discriminating ability of the model, it is recommended that the KS value be above 0.40.

0.4<KS，模型区分能力好；0.2<KS≤0.4，模型区分能力一般；KS≤0.2，模型区分能力差。0.4<KS, the model distinguishing ability is good; 0.2<KS≤0.4, the model distinguishing ability is average; KS≤0.2, the model distinguishing ability is poor.

4.2半监督模型评估4.2 Evaluation of Semi-Supervised Models

如表(7)、(8)所示，最终在测试集上的AUC表现为0.9617，接近于1；KS为0.7779大于0.4表明半监督模型对于预测发债主体违约概率具有良好的风险排序能力和区分能力。同时，按照预警分数降序排列后，在前2％的阈值上，召回率达到88.37％，也反映了模型对于违约风险有良好的预测能力。As shown in Tables (7) and (8), the final AUC performance on the test set is 0.9617, which is close to 1; KS is 0.7779, which is greater than 0.4, indicating that the semi-supervised model has a good risk ranking ability for predicting the default probability of the issuer and ability to distinguish. At the same time, after sorting in descending order of early warning scores, the recall rate reaches 88.37% on the top 2% threshold, which also reflects the model's good predictive ability for default risk.

评价指标Evaluation indicators 全量样本full sample AUCAUC 0.96110.9611 KSKS 0.81910.8191

表(7)Table (7)

异常水平降序排列Descending order of abnormal levels 累计召回率Cumulative recall Top1％Top1% 74.42％74.42% Top2％Top2% 81.40％81.40% Top5％Top5% 87.21％87.21% Top10％Top10% 90.70％90.70% 全部all 100.00％100.00%

表(8)Table (8)

本发明提供了一种基于半监督模型预测发债主体违约的方法，包括：The present invention provides a method for predicting the default of a bond issuer based on a semi-supervised model, including:

将发债主体是否违约作为目标变量；Take whether the issuer defaults as the target variable;

获取发债主体的新闻舆情信息、工商信息、市场评价信息、外部信息的主体数据，通过主体数据构建发债主体的信用违约风险的指标体系，指标体系从基本资质信息、财务经营信息、处罚信息、股权质押信息、新闻舆情信息、内外部评级信息和风险关联信息等8个细分维度全方位评级发债主体；Obtain the subject data of the bond issuer's news and public opinion information, industry and commerce information, market evaluation information, and external information, and construct an indicator system for the credit default risk of the bond issuer through the subject data. 8 sub-dimensions, including equity pledge information, news and public opinion information, internal and external rating information, and risk-related information, comprehensively rate bond issuers;

通过特征工程整合关系型数据库中的工商类、新闻舆情类和评级类的数据表，从统计分析、业务判断、衍生构造三个方面实现底层特征的加工，生产底层因子，从准确性检验和逻辑性检验两部分检验数据质量，保证特征计算的正确性，筛选异常值，剔除缺失值，提升数据质量，提高预测能力，为了挖掘发债主体数据的潜在信息，特征工程采用了对数、均值、众数和极值的统计类指标进行加工；Through feature engineering, the data tables of industry and commerce, news and public opinion and rating in the relational database are integrated, and the processing of the underlying features is realized from the three aspects of statistical analysis, business judgment, and derivative structure, and the underlying factors are produced. The two parts are used to test the quality of data, ensure the correctness of feature calculation, filter outliers, eliminate missing values, improve data quality, and improve prediction ability. Statistical indicators of mode and extreme value are processed;

基于未标记样本加权法与评分卡模型的组合建立半监督模型，利用正样本和未标记样本分类器对风险的排序能力扩大正样本的规模，使用高风险概率最高的样本作为新的正样本，训练评分卡模型，解决高风险标签过少导致的难以区分随机和真正有效的特征、模型稳定性和延展性弱等问题。评分卡模型是基于逻辑回归的评分卡模型，将正样本各特征中的分布转化为证据权重编码，再结合证据权重和回归系数中的β生成评分，输出的数据驱动评分卡模型反映从数据中挖掘的信息以及模型的运算逻辑，清晰的给出发债主体评分过程及单因子评分占比；A semi-supervised model is established based on the combination of the unlabeled sample weighting method and the scorecard model, and the positive samples and unlabeled sample classifiers can be used to sort risks to expand the size of the positive samples, and the samples with the highest high risk probability are used as new positive samples. Train a scorecard model to solve problems such as indistinguishable random and truly effective features caused by too few high-risk labels, and weak model stability and scalability. The scorecard model is a scorecard model based on logistic regression, which converts the distribution of each feature of positive samples into evidence weight codes, and then combines the evidence weights and β in the regression coefficients to generate scores, and the output data-driven scorecard model reflects the data from the data. The mined information and the operation logic of the model clearly give the scoring process and single-factor scoring ratio of the issuer;

基于半监督模型判断预测发债主体的违约风险。Judging and predicting the default risk of bond issuers based on a semi-supervised model.

利用半监督学习中的正样本和无标记样本学习方法，扩大了正样本规模，将原本严重有偏的建模样本进行了修正，一方面正视了未标记样本中存在标记样本的可能，另一方面能更好的让模型学习到坏样本的特征，减少了样本不均衡带来的模型更多的拟合了噪音的风险。Using the positive samples and unlabeled samples learning methods in semi-supervised learning, the scale of positive samples has been expanded, and the original severely biased modeling samples have been revised. On the other hand, it can better allow the model to learn the characteristics of bad samples, and reduce the risk of the model fitting more noise caused by the imbalance of samples.

本发明提供的基于半监督模型预测发债主体违约的方法，包括建立半监督模型的步骤：The method for predicting the default of a bond issuer based on a semi-supervised model provided by the present invention includes the steps of establishing a semi-supervised model:

通过网格搜索以AUC为目标进行调参，训练XGBoost模型获取识别样本是否被标记的分类器；Through grid search, the parameters are adjusted with the AUC as the target, and the XGBoost model is trained to obtain a classifier that recognizes whether the sample is marked;

使用校准分类器进行概率校准，将XGBoost的输出校准作为近似标准的概率；Probabilistic calibration is performed using the calibration classifier, and the output calibration of XGBoost is used as the probability of the approximate standard;

使用校准后的样本，与原负面标签取并集作为后续训练评分卡的建模目标；Use the calibrated sample and take the union with the original negative label as the modeling target of the subsequent training scorecard;

使用均衡样本加权计算权重；Calculate weights using balanced sample weighting;

使用卡方分箱将特征全部转化为序数型分类变量；Use chi-square binning to convert all features into ordinal categorical variables;

分析特征与建模目标的关联程度，及特征之间的共线性，筛选可以入模的优质特征；Analyze the degree of association between features and modeling targets, and the collinearity between features, and screen high-quality features that can be modeled;

人工优化特征可解释性，逐个查看可以入模的优质特征，分析其各取值的频率分布和响应率分布是否可以在业务上解释、是否可能源自于数据的随机波动，据此调整特征的分组，并查看验证集上是否具备与训练集相同的趋势，无法解释、源自数据随机波动的可能性高、或训练集和验证集趋势不符的特征不能入模；Manually optimize the interpretability of features, check the high-quality features that can be entered into the model one by one, analyze whether the frequency distribution and response rate distribution of each value can be explained in business, and whether it may be caused by random fluctuations in the data, and adjust the features accordingly. Group, and check whether the validation set has the same trend as the training set, and the features that cannot be explained, have a high possibility of random fluctuations in the data, or that the trends of the training set and the validation set do not match;

将特征经证据权重编码后训练评分卡模型；The scorecard model is trained by encoding the features with the weight of evidence;

人工查看评分规则，修正少数与响应率分布结果不符的规则。Manually review the scoring rules and correct a few rules that do not match the response rate distribution results.

本发明提供的基于半监督模型预测发债主体违约的方法，包括半监督模型通过KS评价模型检验区分能力，KS＞0.4时，区分能力佳。The method for predicting the default of a bond issuer based on a semi-supervised model provided by the present invention includes that the semi-supervised model tests the distinguishing ability through the KS evaluation model. When KS>0.4, the distinguishing ability is good.

本发明提供的基于半监督模型预测发债主体违约的方法，包括通过AUC检验模型的排序能力，AUC的范围AUC＞0.7，AUC越高，模型分类效果越好，违约样本排在非违约样本前面的概率越大。The method for predicting the default of a bond issuer based on a semi-supervised model provided by the present invention includes testing the sorting ability of the model through AUC. The range of AUC is AUC>0.7. the greater the probability.

本发明还提供了一种基于半监督模型预测发债主体违约的设备，基于半监督模型预测发债主体违约设备包括：The present invention also provides a device for predicting the default of a bond issuer based on a semi-supervised model, and the device for predicting the default of a bond issuer based on the semi-supervised model includes:

通信总线用于实现处理器与存储器间的通信连接；The communication bus is used to realize the communication connection between the processor and the memory;

处理器用于执行基于半监督模型预测发债主体违约程序，以实现如上述任意一项的基于半监督模型预测发债主体违约方法的步骤。The processor is configured to execute the procedure of predicting the default of the bond issuer based on the semi-supervised model, so as to realize the steps of any of the above-mentioned methods for predicting the default of the bond issuer based on the semi-supervised model.

本发明还提供了一种计算机可读存储介质，存储有可执行指令，存储介质上存储有基于半监督模型预测发债主体违约程序，基于半监督模型预测发债主体违约程序被处理器执行时实现上述任意一项所述的基于半监督机器学习预测主体违约方法的步骤。The present invention also provides a computer-readable storage medium, which stores executable instructions, and stores a program for predicting the default of a bond issuer based on a semi-supervised model. The steps of implementing any of the above-mentioned methods for predicting subject default based on semi-supervised machine learning.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are only the embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.

Claims

1. A method for predicting default of a debt subject based on a semi-supervised model is characterized in that: the method comprises the following steps:

s1: acquiring main data of a debt main body, wherein the main data comprises news public opinion information, industrial and commercial information, market evaluation information and external information, and constructing an index system of credit default risks of the debt main body through the main data;

s2: constructing bottom layer characteristics from statistical analysis, service judgment and derivation to generate bottom layer factors;

s3: establishing a semi-supervised model based on the combination of an unmarked sample weighting method and a scoring card model;

s4: and judging and predicting default risks of the debt subject based on the semi-supervised model.

2. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: and the index system of the S1 grades the debt main body through basic qualification information, financial management information, penalty information, share right pledge information, news public opinion information, internal and external rating information and risk associated information.

3. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: and S2, mining potential information of the debt subject data by adopting the statistic indexes of logarithm, mean, mode and extreme value.

4. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the semi-supervised model establishing of the S3 comprises the following steps:

s21: adjusting parameters by taking AUC as a target through grid search, and training an XGboost model to obtain a classifier for identifying whether a sample is marked;

s22: performing probability calibration by using a calibration classifier, and taking the output calibration of the XGboost as the probability of an approximate standard;

s23: using the calibrated sample and the original negative label as a modeling target of a subsequent training score card;

s24: calculating weights by using the balance sample weight;

s25: converting all the characteristics into ordinal classification variables by using chi-square classification boxes;

s26: analyzing the degree of association between the features and the modeling target and the colinearity between the features, and screening high-quality features which can enter the model;

s27: manually optimizing feature interpretability;

s28: training a scoring card model after the characteristic certification is subjected to weight recoding;

s29: and manually checking the scoring rules, and correcting a few rules which are inconsistent with the response rate distribution result.

5. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the scoring card model of S3 is based on a scoring card model of logistic regression, the distribution in each feature of the positive sample is converted into evidence weight codes, then scoring is generated by combining the evidence weight and beta in the regression coefficient, the output data drives the scoring card model to reflect information mined from data and the operational logic of the model, and the scoring process of a debt main body and the single factor scoring ratio are given.

6. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 1, wherein: the semi-supervised model of S3 examined discriminative power by a KS evaluation model, KS > 0.4.

7. The method for predicting default of a debt subject based on semi-supervised model as recited in claim 4, wherein: the AUC of S21 ranged from AUC > 0.7.

8. An apparatus for predicting default of a debt subject based on a semi-supervised model, characterized in that: the device for predicting debt subject default based on semi-supervised model comprises:

a memory, a processor, a communication bus, and a semi-supervised model predictive debt subject default program stored on the memory,

the communication bus is used for realizing communication connection between the processor and the memory;

the processor is used for executing the semi-supervised model based default prediction program to realize the semi-supervised model based default prediction method of the debt subject as claimed in any one of claims 1 to 7.

9. A computer-readable storage medium storing executable instructions, wherein: the storage medium stores a semi-supervised model based default prediction program for a debt subject, and the semi-supervised model based default prediction program is executed by a processor to realize the steps of the semi-supervised machine learning based default prediction method for a subject according to any one of the above claims 1-7.