CN114676932A

CN114676932A - Bond default prediction method and device based on class imbalance machine learning framework

Info

Publication number: CN114676932A
Application number: CN202210407035.1A
Authority: CN
Inventors: 李孜; 杨帆; 吴皓; 孙彦杰
Original assignee: Icbc Credit Suisse Fund Management Co ltd
Current assignee: Icbc Credit Suisse Fund Management Co ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-06-28

Abstract

The present disclosure provides a bond default prediction method, apparatus, device, medium and product based on a category-imbalanced machine-learning framework, the method comprising: acquiring related data of a debt subject; preprocessing the acquired relevant data of the debt subject; selecting the features with the highest contribution degree to model training from the relevant data of the debt subject; constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model; and deploying the optimal model, and using the optimal model to predict bond default. The method and the device have the advantage that more accurate bond default prediction is achieved by selecting the features with the highest contribution degree to model training and using a Self packed Ensemble method to train the prediction model.

Description

Bond default prediction method and device based on class imbalance machine learning framework

Technical Field

The invention relates to the technical field of financial science and technology, in particular to a bond default prediction method, device, equipment, medium and product based on a class imbalance machine learning framework.

Background

Along with the rapid development of the bond market in China, the credit risk of the bond market is continuously accumulated, the incidence frequency of bond default events is increased year by year, and the bond default events are normalized. The '11-super-daily debt' untied payment in 2014 is a recognized first-time default event of credit debt, and the number and the amount of default debts are increased from 6 to 13.40 billion in 2014 to 155 to 1757.72 billion in 2020. The main body of default debt is also spread from small and micro enterprises to large and medium-sized enterprises, national enterprises and listed companies. The number of high-grade default bodies is also increased, and in 2020, the AA grade of the first default body accounts for 27.59% of the first default body, and already exceeds 17.20% of the BBB and the following grades. Regional distribution of default bodies ranges from the southeast coastal region to the middle and western region, and industries also include multi-field stock control, refining and selling of petroleum and natural gas, comprehensive industries, buildings and engineering, food processing and meat, real estate development and the like.

The fixed income assets account for a large part of investment targets of the asset management industry, principal or interest of holders is lost due to illegal bonds, and bond valuation of all bonds of illegal bond-issuing subjects is soared, so that net value of products is influenced, and redemption of partial clients is triggered. The redemption of the product by the client can lead a product manager to sell the asset with good liquidity, but the investor who redeems the product firstly can obtain cash of the asset, and the investor who redeems the product later can only bear the loss caused by the asset with low liquidity, so that the unfairness can accelerate the redemption of the investor, increase the liquidity pressure of the asset management product and further cause the net value of the product to drop. In order to prevent this, it is necessary to predict bond default and avoid bonds that may "thunderstorm" in advance.

Some machine learning-based bond default prediction methods have been developed in the industry, and these methods usually utilize data of finance, credit investigation, credit rating, company bulletin, etc. of a bond issuing subject, and train a machine learning model and find optimal parameters after data cleaning and feature engineering. However, the above method has a problem of serious imbalance of data categories, the proportion of default bond debt subjects to all credit bond debt subjects is very low, and only about 1% of the debt subjects have bond default, which causes imbalance of data categories for training a machine learning model, so that the tendency of the model to various categories of samples is generated during model training, and the prediction effect is influenced.

Disclosure of Invention

In view of the above problem of poor bond default prediction effect caused by unbalanced data categories of the training machine learning model, the present disclosure provides a bond default prediction method, device, apparatus, medium and product based on the category unbalanced machine learning framework, and more accurate bond default prediction is achieved by selecting the feature with the highest contribution to model training and using a Self packed Ensemble method to train the prediction model.

A first aspect of the present disclosure provides a bond default prediction method based on a class imbalance machine learning framework, the prediction method comprising:

selecting features with the highest contribution degree to model training from relevant data of a debt subject, wherein the relevant data comprises annual financial statement data and default condition data of the debt subject;

constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;

and deploying the optimal model, and using the optimal model to predict bond default.

According to an embodiment of the present disclosure, before the selecting a feature with the highest contribution degree to model training from the data related to the debt subject, the method further includes:

acquiring related data of a debt subject;

and preprocessing the acquired data related to the debt subject.

According to the embodiment of the disclosure, the acquiring of the relevant data of the debt subject specifically includes:

acquiring an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond main body from a database, screening out all credit and bond main bodies, removing city debts from the account statement, acquiring annual financial statement data of the credit and bond main bodies, judging whether the credit and bond main bodies default, marking whether the credit and bond main bodies default, and setting a first default label of the credit and bond main bodies.

According to an embodiment of the present disclosure, the preprocessing the acquired data related to the debt subject includes:

counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.

According to an embodiment of the present disclosure, the selecting a feature with the highest contribution degree to model training from the relevant data of the debt subject specifically includes:

and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.

According to the embodiment of the disclosure, the constructing a training set and a test set, performing model training based on the training set and the test set by using a Self packed Ensemble learning method, and selecting an optimal model specifically includes:

determining a base classifier in an ensemble learning framework;

dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set;

performing model training based on the training set and the test set by using a Self packed Ensemble Ensemble learning method;

and selecting an optimal model.

According to an embodiment of the present disclosure, the determining a base classifier in an ensemble learning framework specifically includes: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.

According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set specifically includes: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.

According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set by using a 5-fold cross-validation method specifically includes:

randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.

According to the embodiment of the present disclosure, the performing model training based on the training set and the test set by using a Self-packed Ensemble learning method specifically includes:

1) initializing a few samples P and a plurality of samples N in a training set D;

2) using the random in the majority sample NUndersampled subset N₀Training the 1 st base classifier f with the minority sample P₀，N₀In accordance with the number of P, i.e. | N₀’|＝|P|；

3) The sum of all base classifiers up to now is taken as an integration model F_iNamely:

wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;

4) binning the majority of samples N into k groups B according to the classification hardness₁，B₂，…，B_kFor a given model F (-) and (x, y), the sample class hardness H_XH (x, y, F), where H is the classification stiffness function and k is the number of bins;

5) calculating the average classification hardness of each bin, the average hardness of the 1 st bin being

6) Updating self-paced factors

7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor₁Wherein:

8) based on the sampling weight p₁Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:

9) training base classifier f on new undersampled sample set_i；

10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);

11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.

According to an embodiment of the present disclosure, the selecting an optimal model specifically includes:

setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.

According to the embodiment of the disclosure, the evaluation index used for evaluating the quality of the model comprises accuracy, precision, recall, F1 or ROC-AUC.

According to an embodiment of the present disclosure, the hyper-parameters include a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:

the hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of sub-boxes and a classification hardness function;

the hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.

According to an embodiment of the present disclosure, the deploying the optimal model specifically includes: and packaging the trained optimal model into HTTP service by Python fastapi, and deploying the HTTP service into a production environment server.

According to an embodiment of the present disclosure, the predicting bond default by using the optimal model specifically includes:

and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.

A second aspect of the present disclosure provides a bond default prediction apparatus based on a category-unbalanced machine learning framework, the prediction apparatus including a feature selection module, a model training module, and a default prediction module, wherein:

the characteristic selection module is used for selecting the characteristics with the highest contribution degree to model training from the relevant data of the debt main body, wherein the relevant data comprises annual financial statement data and default condition data of the debt main body;

the model training module is used for constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;

and the default prediction module is used for deploying the optimal model and predicting bond default by using the optimal model.

According to an embodiment of the present disclosure, the apparatus further comprises a data acquisition module and a data preprocessing module, wherein:

the data acquisition module is used for acquiring related data of the debt subject;

and the data preprocessing module is used for preprocessing the acquired related data of the debt subject.

According to an embodiment of the present disclosure, the selecting a feature with a highest contribution degree to model training from the relevant data of the debt subject specifically includes:

determining a base classifier in an ensemble learning framework;

and selecting an optimal model.

According to the embodiment of the present disclosure, the model training based on the training set and the test set and using the Self packed Ensemble learning method specifically includes:

2) using a randomly undersampled subset N of the majority samples N₀Training the 1 st base classifier f with the minority sample P₀，N₀In accordance with the number of P, i.e. | N₀’|＝|P|；

3) The sum of all base classifiers so far is taken as the integration model Fi, namely:

4) binning the majority of samples N into k groups B according to the classification hardness₁，B₂，…，B_kFor a given model F (·) and (x, y), the sample class hardness H_XH (x, y, F), where H is the classification stiffness function and k is the number of bins；

6) Updating self-paced factors

9) training base classifier f on new undersampled sample set_i；

According to an embodiment of the present disclosure, the deploying the optimal model specifically includes: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.

According to an embodiment of the present disclosure, the using the optimal model to predict bond default specifically includes:

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a method of liability breach prediction based on a category imbalance machine learning framework as described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of bond breach prediction based on a class imbalance machine learning framework as described above.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program that when executed by a processor implements a method for bond breach prediction based on a class imbalance machine learning framework as described above.

Compared with the prior art, the bond default prediction method, the device, the equipment, the medium and the product based on the class imbalance machine learning framework can train a prediction model by using a Self packed Ensemble method, and serially train the model by using an undersampling and Boosting integration method, wherein some training samples which contribute most to the current integration model are selected by undersampling of each iteration in the training process instead of simple random sampling, so that more accurate bond default prediction is realized, more bond-initiating subjects with potential risks are found, and loss caused by credit risks is avoided as soon as possible.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates a flow chart of a bond default prediction method based on a category imbalance machine learning framework according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram for model training and selecting an optimal model using a training set and a test set, according to an embodiment of the disclosure;

fig. 3 schematically illustrates a block diagram of a bond default prediction apparatus based on a category-based unbalanced machine learning framework according to an embodiment of the present disclosure;

fig. 4 schematically illustrates a block diagram of an electronic device implementing a method for bond default prediction based on a category imbalance machine learning framework according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Compared with the prior art, the disclosure provides a bond default prediction method, device, equipment, medium and product based on a class imbalance machine learning framework, and the bond default prediction method based on the class imbalance machine learning framework comprises the following steps: acquiring related data of a debt subject; preprocessing the acquired relevant data of the debt subject; selecting the features with the highest contribution degree to model training from the relevant data of the debt subject; constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and a first default label of the debt subject, performing model training by using a Self-Paced Ensemble Ensemble learning method (also called a Self-Paced Ensemble learning method) based on the training set and the testing set, and selecting an optimal model; and deploying the optimal model, and using the optimal model to predict bond default.

According to the scheme, a prediction model can be trained by using a Self packed Ensemble method, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model can be selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is realized, more debt subjects with potential risks are found, and loss caused by credit risks is avoided as early as possible.

A bond default prediction method, apparatus, device, medium, and product based on a category imbalance machine learning framework according to embodiments of the present disclosure will be described in detail below with reference to fig. 1-4.

Fig. 1 schematically illustrates a flowchart of a bond default prediction method based on a category imbalance machine learning framework according to an embodiment of the present disclosure.

As shown in fig. 1, this embodiment provides a bond default prediction method based on a category-unbalanced machine learning framework, where the method includes operations S101 to S105, and specifically the following steps:

in operation S101, related data of a debt subject is acquired, the related data including annual financial statement data and default condition data of the debt subject.

The acquiring of the relevant data of the debt subject specifically includes: and acquiring relevant data of the debt subject from the database.

The acquiring of the data related to the debt subject from the database specifically includes:

the method comprises the steps of obtaining an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond subject from a database, screening out all credit and bond subject, removing city debts from the account, obtaining annual financial statement data of the credit and bond subject, judging whether the credit and bond subject is default, marking whether the credit and bond subject is default, and setting a first default label of the bond subject.

The database may be a Wind database.

In operation S102, the acquired data related to the debt subject is preprocessed.

The preprocessing of the acquired data related to the subject of debt comprises: counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.

In operation S103, a feature having the highest contribution degree to model training is selected from the data related to the debt subject.

Selecting the feature with the highest contribution degree to model training from the relevant data of the debt subject, specifically comprising: and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.

In operation S104, using the selected characteristics of the annual financial statement data of the lending subject and the first default label of the lending subject, constructing a training set and a test set, performing model training based on the training set and the test set and using a Self packed Ensemble learning method, and selecting an optimal model.

The annual financial statement data of the debt subject is data of two years before default prediction year, for example, 2018 annual financial statement data of a subject to be predicted for default in 2020 is selected.

As shown in fig. 2, the constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, performing model training by using a Self packed Ensemble method based on the training set and the testing set, and selecting an optimal model specifically includes:

in operation S1041, a base classifier in the ensemble learning framework is determined.

The determining of the base classifier in the ensemble learning framework specifically includes: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.

The LightGBM is an efficient implementation of a gradient Boosting decision tree algorithm, and the algorithm integrates a decision tree by a Boosting method and obtains an optimal model through iterative training.

In operation S1042, a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject is divided into a training set and a test set.

In order to find the optimal model hyper-parameter combination, a part of the data set is required to be divided as a test set, and the rest is required to be used as a training set when the hyper-parameters are adjusted.

The data set including the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject is divided into a training set and a testing set, and the method specifically comprises the following steps: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.

Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method, and specifically comprising the following steps of: randomly dividing a data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the rest 1 part of the data set as a test set, and repeating the selecting step five times, wherein the selected training sets are different each time.

In operation S1043, model training is performed based on the training set and the test set and using a Self packed Ensemble learning method.

The model training is performed by using a Self packed Ensemble learning method based on the training set and the test set, and specifically includes:

2) training the 1 st base classifier f using the randomly undersampled subset N0 of the majority samples N and the minority samples P₀，N₀In accordance with the number of P, i.e. | N₀’|＝|P|；

3) The sum of all base classifiers so far is taken as an integration model F_iNamely:

5) calculating the average classification hardness of each bin, wherein the average hardness of the 1 st bin is

6) Updating self-paced factors

9) training base classifier f on new undersampled sample set_i；

In operation S1044, an optimal model is selected.

The selecting the optimal model specifically includes: setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.

The evaluation indexes used for evaluating the quality of the model comprise accuracy, precision, recall, F1 or ROC-AUC.

The hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM-based classifier.

The hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of bins and a classification hardness function.

In operation S105, the optimal model is deployed, and bond default prediction is performed using the optimal model.

The deploying the optimal model specifically includes: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.

The predicting bond default by using the optimal model specifically comprises: and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.

The periodically generating prediction data for model prediction specifically includes: and developing parts of feature extraction and feature engineering into a storage process of a database, and periodically generating prediction data for model prediction.

By means of the bond default prediction method based on the class imbalance machine learning framework, a Self packed Ensemble method can be used for training a prediction model, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model are selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is achieved, more bond issuing subjects with potential risks are found, and loss caused by credit risks is avoided early.

Based on the bond default prediction method based on the category-based unbalanced machine learning framework shown in fig. 1, the present disclosure also provides a bond default prediction apparatus based on the category-based unbalanced machine learning framework. The apparatus will be described in detail below with reference to fig. 3.

Fig. 3 schematically shows a block diagram of a bond default prediction apparatus based on a category imbalance machine learning framework according to an embodiment of the present disclosure.

As shown in fig. 3, this embodiment provides a bond default prediction apparatus 300 based on a category-unbalanced machine learning framework, where the apparatus 300 includes a data acquisition module 301, a data preprocessing module 302, a feature selection module 303, a model training module 304, and a default prediction module 305.

The data acquisition module 301 is configured to acquire relevant data of a debt issue subject, where the relevant data includes annual financial statement data and default condition data of the debt issue subject.

The database may be a Wind database.

The data preprocessing module 302 is configured to preprocess the acquired data related to the debt subject.

The preprocessing of the acquired data related to the debt subject comprises the following steps: counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.

The feature selection module 303 is configured to select a feature with the highest contribution degree to model training from the relevant data of the debt subject.

The model training module 304 is configured to construct a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, perform model training by using a Self packed Ensemble learning method based on the training set and the testing set, and select an optimal model.

Constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, performing model training by using a Self packed Ensemble method based on the training set and the testing set, and selecting an optimal model, wherein the method specifically comprises the following steps:

(1) a base classifier in an ensemble learning framework is determined.

The determining of the base classifier in the ensemble learning framework specifically includes: and determining a LightGBM binary classification classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classification classifier comprises a plurality of decision trees.

(2) Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set.

Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method, and specifically comprising the following steps of: randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.

(3) Model training is performed based on the training set and the test set and using a Self packed Ensemble Ensemble learning method.

4) binning the majority of samples N into k groups B according to the classification hardness₁，B₂，…，B_kFor a given model F (·) and (x, y), the sample class hardness H_XH (x, y, F), where H is the classification stiffness function and k is the number of bins;

6) Updating self-paced factors

9) training base classifier f on new undersampled sample set_i；

(4) And selecting an optimal model.

The default prediction module 305 is configured to deploy the optimal model, and perform bond default prediction by using the optimal model.

The periodically generating prediction data for model prediction specifically includes: the parts of feature extraction and feature engineering are developed into a storage process of a database, and prediction data used for model prediction is generated periodically.

By means of the bond default prediction device based on the class imbalance machine learning framework, a Self packed Ensemble method can be used for training a prediction model, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model can be selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is achieved, more bond issuing subjects with potential risks are found, and loss caused by credit risks is avoided early.

Figure 4 schematically illustrates a block diagram of an electronic device suitable for implementing a method for bond breach prediction based on a category-based unbalanced machine learning framework, in accordance with an embodiment of the present disclosure.

As shown in fig. 4, an electronic device 400 according to an embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. Processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 401 may also include onboard memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.

In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are stored. The processor 401, ROM 402 and RAM 403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 402 and/or the RAM 403. Note that the programs may also be stored in one or more memories other than the ROM 402 and RAM 403. The processor 401 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 405, input/output (I/O) interface 405 also being connected to bus 404. Electronic device 400 may also include one or more of the following components connected to I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM 402 and/or RAM 403 and/or one or more memories other than ROM 402 and RAM 403 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the item recommendation method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 401. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 409, and/or installed from the removable medium 411. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program, when executed by the processor 401, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A bond default prediction method based on a category-unbalanced machine learning framework, the prediction method comprising:

2. The method of claim 1, wherein prior to selecting the features that contribute most to model training from the data related to the debt subject, the method further comprises:

acquiring related data of a debt subject;

and preprocessing the acquired data related to the debt subject.

3. The method for bond default prediction based on the category-based unbalanced machine learning framework as claimed in claim 2, wherein the obtaining of the data related to the subject of debt specifically comprises:

4. The method for bond breach prediction based on class imbalance machine learning framework of claim 2, wherein the preprocessing of the obtained data related to the subject of debt comprises:

counting the loss rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.

5. The method for bond default prediction based on the category-based unbalanced machine learning framework of claim 1, wherein the selecting the feature with the highest contribution degree to model training from the relevant data of the debt subjects specifically comprises:

6. The bond default prediction method based on the class imbalance machine learning framework as claimed in claim 1, wherein the constructing a training set and a testing set, performing model training based on the training set and the testing set by using a Self packed Ensemble learning method, and selecting an optimal model specifically comprises:

determining a base classifier in an ensemble learning framework;

and selecting an optimal model.

7. The method for bond default prediction based on class imbalance machine learning framework according to claim 6, wherein the determining the base classifier in the ensemble learning framework specifically comprises: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.

8. The method according to claim 6, wherein the dividing a data set containing the selected annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set specifically comprises: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.

9. The method of claim 8, wherein the dividing a dataset comprising the characteristics of the annual financial statement data of the selected debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method comprises:

randomly dividing a data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the rest 1 part of the data set as a test set, and repeating the selecting step five times, wherein the selected training sets are different each time.

10. The bond default prediction method based on the class imbalance machine learning framework according to any one of claims 6-9, wherein the model training based on the training set and the test set and using a Self packed Ensemble learning method specifically comprises:

6) Updating self-paced factors

9) training base classifier f on new undersampled sample set_i；

11. The method for bond default prediction based on the category-unbalanced machine learning framework of any one of claims 6 to 9, wherein the selecting the optimal model specifically comprises:

setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using GridSearch, observing indexes of the model on the test set, evaluating the quality of the model according to the results of the 5 times of model training, and selecting the optimal model.

12. The method of claim 11, wherein the evaluation metrics used to evaluate the quality of the model include accuracy, precision, recall, F1, or ROC-AUC.

13. The method of claim 11, wherein the hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:

14. The method for bond default prediction based on the class-imbalance machine learning framework according to claim 1, wherein the deploying the optimal model specifically comprises: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.

15. The method for bond breach prediction based on class imbalance machine learning framework of claim 1 or 14, wherein the predicting bond breach using the optimal model specifically comprises:

16. A bond default prediction device based on a category-unbalanced machine learning framework, the prediction device comprising a feature selection module, a model training module, and a default prediction module, wherein:

and the default prediction module is used for deploying the optimal model and using the optimal model to predict bond default.

17. The category imbalance machine learning framework-based bond breach prediction apparatus of claim 16, further comprising a data acquisition module and a data preprocessing module, wherein:

18. The bond default prediction device based on the category-based unbalanced machine learning framework of claim 17, wherein the obtaining of the data related to the subject of debt comprises:

19. The debt default prediction apparatus based on class imbalance machine learning framework according to claim 17, wherein the preprocessing of the acquired data related to the debt subject comprises:

20. The bond default prediction apparatus based on the category-based unbalanced machine learning framework of claim 16, wherein the selecting the feature with the highest contribution degree to model training from the data related to the debt subjects specifically comprises:

21. The bond default prediction apparatus based on class imbalance machine learning framework according to claim 16, wherein the constructing a training set and a testing set, performing model training based on the training set and the testing set by using a Self packed Ensemble learning method, and selecting an optimal model specifically comprises:

determining a base classifier in an ensemble learning framework;

and selecting an optimal model.

22. The bond default prediction apparatus based on class imbalance machine learning framework according to claim 21, wherein the determining the base classifier in the ensemble learning framework specifically comprises: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.

23. The debt default prediction apparatus based on class imbalance machine learning framework according to claim 21, wherein the dividing of the data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set specifically comprises: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.

24. The bond breach predicting device according to claim 23, wherein the data set including the annual financial statement data of the selected debt subject and the first breach label of the debt subject is divided into a training set and a testing set by using 5-fold cross-validation method, specifically comprising:

25. The bond default prediction apparatus based on class imbalance machine learning framework according to any one of claims 21-24, wherein the model training based on the training set and the test set and using Self packed Ensemble learning method specifically comprises:

6) Updating self-paced factors

9) training base classifier f on new undersampled sample set_i；

26. The bond breach prediction device of any of claims 21-24, wherein the selecting the optimal model specifically comprises:

27. The bond breach predicting device according to claim 26, wherein the evaluation metrics used for evaluating the quality of the model include accuracy, precision, recall, F1 or ROC-AUC.

28. The bond breach prediction device based on class imbalance machine learning framework of claim 26, wherein the hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:

the hyper-parameters of the LightGBM-based classifier include: the leaf number, the leaf minimum data size, the maximum depth, the decision tree number, the learning rate, the decision tree type, the L1 regularization coefficient, and the L2 regularization coefficient.

29. The device for bond default prediction based on class-imbalance machine learning framework according to claim 16, wherein the deploying the optimal model specifically comprises: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.

30. The bond breach prediction device based on class imbalance machine learning framework of claim 16 or 29, wherein the performing the bond breach prediction by using the optimal model specifically comprises:

31. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-15.

32. A computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-15.

33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-15.