CN114676932A - Bond default prediction method and device based on class imbalance machine learning framework - Google Patents

Bond default prediction method and device based on class imbalance machine learning framework Download PDF

Info

Publication number
CN114676932A
CN114676932A CN202210407035.1A CN202210407035A CN114676932A CN 114676932 A CN114676932 A CN 114676932A CN 202210407035 A CN202210407035 A CN 202210407035A CN 114676932 A CN114676932 A CN 114676932A
Authority
CN
China
Prior art keywords
bond
data
training
model
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210407035.1A
Other languages
Chinese (zh)
Inventor
李孜
杨帆
吴皓
孙彦杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Icbc Credit Suisse Fund Management Co ltd
Original Assignee
Icbc Credit Suisse Fund Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Icbc Credit Suisse Fund Management Co ltd filed Critical Icbc Credit Suisse Fund Management Co ltd
Priority to CN202210407035.1A priority Critical patent/CN114676932A/en
Publication of CN114676932A publication Critical patent/CN114676932A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Mathematical Physics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Technology Law (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure provides a bond default prediction method, apparatus, device, medium and product based on a category-imbalanced machine-learning framework, the method comprising: acquiring related data of a debt subject; preprocessing the acquired relevant data of the debt subject; selecting the features with the highest contribution degree to model training from the relevant data of the debt subject; constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model; and deploying the optimal model, and using the optimal model to predict bond default. The method and the device have the advantage that more accurate bond default prediction is achieved by selecting the features with the highest contribution degree to model training and using a Self packed Ensemble method to train the prediction model.

Description

Bond default prediction method and device based on class imbalance machine learning framework
Technical Field
The invention relates to the technical field of financial science and technology, in particular to a bond default prediction method, device, equipment, medium and product based on a class imbalance machine learning framework.
Background
Along with the rapid development of the bond market in China, the credit risk of the bond market is continuously accumulated, the incidence frequency of bond default events is increased year by year, and the bond default events are normalized. The '11-super-daily debt' untied payment in 2014 is a recognized first-time default event of credit debt, and the number and the amount of default debts are increased from 6 to 13.40 billion in 2014 to 155 to 1757.72 billion in 2020. The main body of default debt is also spread from small and micro enterprises to large and medium-sized enterprises, national enterprises and listed companies. The number of high-grade default bodies is also increased, and in 2020, the AA grade of the first default body accounts for 27.59% of the first default body, and already exceeds 17.20% of the BBB and the following grades. Regional distribution of default bodies ranges from the southeast coastal region to the middle and western region, and industries also include multi-field stock control, refining and selling of petroleum and natural gas, comprehensive industries, buildings and engineering, food processing and meat, real estate development and the like.
The fixed income assets account for a large part of investment targets of the asset management industry, principal or interest of holders is lost due to illegal bonds, and bond valuation of all bonds of illegal bond-issuing subjects is soared, so that net value of products is influenced, and redemption of partial clients is triggered. The redemption of the product by the client can lead a product manager to sell the asset with good liquidity, but the investor who redeems the product firstly can obtain cash of the asset, and the investor who redeems the product later can only bear the loss caused by the asset with low liquidity, so that the unfairness can accelerate the redemption of the investor, increase the liquidity pressure of the asset management product and further cause the net value of the product to drop. In order to prevent this, it is necessary to predict bond default and avoid bonds that may "thunderstorm" in advance.
Some machine learning-based bond default prediction methods have been developed in the industry, and these methods usually utilize data of finance, credit investigation, credit rating, company bulletin, etc. of a bond issuing subject, and train a machine learning model and find optimal parameters after data cleaning and feature engineering. However, the above method has a problem of serious imbalance of data categories, the proportion of default bond debt subjects to all credit bond debt subjects is very low, and only about 1% of the debt subjects have bond default, which causes imbalance of data categories for training a machine learning model, so that the tendency of the model to various categories of samples is generated during model training, and the prediction effect is influenced.
Disclosure of Invention
In view of the above problem of poor bond default prediction effect caused by unbalanced data categories of the training machine learning model, the present disclosure provides a bond default prediction method, device, apparatus, medium and product based on the category unbalanced machine learning framework, and more accurate bond default prediction is achieved by selecting the feature with the highest contribution to model training and using a Self packed Ensemble method to train the prediction model.
A first aspect of the present disclosure provides a bond default prediction method based on a class imbalance machine learning framework, the prediction method comprising:
selecting features with the highest contribution degree to model training from relevant data of a debt subject, wherein the relevant data comprises annual financial statement data and default condition data of the debt subject;
constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;
and deploying the optimal model, and using the optimal model to predict bond default.
According to an embodiment of the present disclosure, before the selecting a feature with the highest contribution degree to model training from the data related to the debt subject, the method further includes:
acquiring related data of a debt subject;
and preprocessing the acquired data related to the debt subject.
According to the embodiment of the disclosure, the acquiring of the relevant data of the debt subject specifically includes:
acquiring an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond main body from a database, screening out all credit and bond main bodies, removing city debts from the account statement, acquiring annual financial statement data of the credit and bond main bodies, judging whether the credit and bond main bodies default, marking whether the credit and bond main bodies default, and setting a first default label of the credit and bond main bodies.
According to an embodiment of the present disclosure, the preprocessing the acquired data related to the debt subject includes:
counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
According to an embodiment of the present disclosure, the selecting a feature with the highest contribution degree to model training from the relevant data of the debt subject specifically includes:
and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
According to the embodiment of the disclosure, the constructing a training set and a test set, performing model training based on the training set and the test set by using a Self packed Ensemble learning method, and selecting an optimal model specifically includes:
determining a base classifier in an ensemble learning framework;
dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set;
performing model training based on the training set and the test set by using a Self packed Ensemble Ensemble learning method;
and selecting an optimal model.
According to an embodiment of the present disclosure, the determining a base classifier in an ensemble learning framework specifically includes: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.
According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set specifically includes: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set by using a 5-fold cross-validation method specifically includes:
randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.
According to the embodiment of the present disclosure, the performing model training based on the training set and the test set by using a Self-packed Ensemble learning method specifically includes:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) using the random in the majority sample NUndersampled subset N0Training the 1 st base classifier f with the minority sample P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers up to now is taken as an integration model FiNamely:
Figure BDA0003601838190000041
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (-) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) calculating the average classification hardness of each bin, the average hardness of the 1 st bin being
Figure BDA0003601838190000042
6) Updating self-paced factors
Figure BDA0003601838190000043
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure BDA0003601838190000044
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure BDA0003601838190000051
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
According to an embodiment of the present disclosure, the selecting an optimal model specifically includes:
setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.
According to the embodiment of the disclosure, the evaluation index used for evaluating the quality of the model comprises accuracy, precision, recall, F1 or ROC-AUC.
According to an embodiment of the present disclosure, the hyper-parameters include a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:
the hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of sub-boxes and a classification hardness function;
the hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.
According to an embodiment of the present disclosure, the deploying the optimal model specifically includes: and packaging the trained optimal model into HTTP service by Python fastapi, and deploying the HTTP service into a production environment server.
According to an embodiment of the present disclosure, the predicting bond default by using the optimal model specifically includes:
and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
A second aspect of the present disclosure provides a bond default prediction apparatus based on a category-unbalanced machine learning framework, the prediction apparatus including a feature selection module, a model training module, and a default prediction module, wherein:
the characteristic selection module is used for selecting the characteristics with the highest contribution degree to model training from the relevant data of the debt main body, wherein the relevant data comprises annual financial statement data and default condition data of the debt main body;
the model training module is used for constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;
and the default prediction module is used for deploying the optimal model and predicting bond default by using the optimal model.
According to an embodiment of the present disclosure, the apparatus further comprises a data acquisition module and a data preprocessing module, wherein:
the data acquisition module is used for acquiring related data of the debt subject;
and the data preprocessing module is used for preprocessing the acquired related data of the debt subject.
According to the embodiment of the disclosure, the acquiring of the relevant data of the debt subject specifically includes:
acquiring an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond main body from a database, screening out all credit and bond main bodies, removing city debts from the account statement, acquiring annual financial statement data of the credit and bond main bodies, judging whether the credit and bond main bodies default, marking whether the credit and bond main bodies default, and setting a first default label of the credit and bond main bodies.
According to an embodiment of the present disclosure, the preprocessing the acquired data related to the debt subject includes:
counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
According to an embodiment of the present disclosure, the selecting a feature with a highest contribution degree to model training from the relevant data of the debt subject specifically includes:
and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
According to the embodiment of the disclosure, the constructing a training set and a test set, performing model training based on the training set and the test set by using a Self packed Ensemble learning method, and selecting an optimal model specifically includes:
determining a base classifier in an ensemble learning framework;
dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set;
performing model training based on the training set and the test set by using a Self packed Ensemble Ensemble learning method;
and selecting an optimal model.
According to an embodiment of the present disclosure, the determining a base classifier in an ensemble learning framework specifically includes: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.
According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set specifically includes: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
According to an embodiment of the present disclosure, the dividing a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set by using a 5-fold cross-validation method specifically includes:
randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.
According to the embodiment of the present disclosure, the model training based on the training set and the test set and using the Self packed Ensemble learning method specifically includes:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) using a randomly undersampled subset N of the majority samples N0Training the 1 st base classifier f with the minority sample P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers so far is taken as the integration model Fi, namely:
Figure BDA0003601838190000081
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (·) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) Calculating the average classification hardness of each bin, the average hardness of the 1 st bin being
Figure BDA0003601838190000082
6) Updating self-paced factors
Figure BDA0003601838190000083
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure BDA0003601838190000084
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure BDA0003601838190000085
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
According to an embodiment of the present disclosure, the selecting an optimal model specifically includes:
setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.
According to the embodiment of the disclosure, the evaluation index used for evaluating the quality of the model comprises accuracy, precision, recall, F1 or ROC-AUC.
According to an embodiment of the present disclosure, the hyper-parameters include a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:
the hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of sub-boxes and a classification hardness function;
the hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.
According to an embodiment of the present disclosure, the deploying the optimal model specifically includes: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.
According to an embodiment of the present disclosure, the using the optimal model to predict bond default specifically includes:
and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a method of liability breach prediction based on a category imbalance machine learning framework as described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of bond breach prediction based on a class imbalance machine learning framework as described above.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program that when executed by a processor implements a method for bond breach prediction based on a class imbalance machine learning framework as described above.
Compared with the prior art, the bond default prediction method, the device, the equipment, the medium and the product based on the class imbalance machine learning framework can train a prediction model by using a Self packed Ensemble method, and serially train the model by using an undersampling and Boosting integration method, wherein some training samples which contribute most to the current integration model are selected by undersampling of each iteration in the training process instead of simple random sampling, so that more accurate bond default prediction is realized, more bond-initiating subjects with potential risks are found, and loss caused by credit risks is avoided as soon as possible.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates a flow chart of a bond default prediction method based on a category imbalance machine learning framework according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram for model training and selecting an optimal model using a training set and a test set, according to an embodiment of the disclosure;
fig. 3 schematically illustrates a block diagram of a bond default prediction apparatus based on a category-based unbalanced machine learning framework according to an embodiment of the present disclosure;
fig. 4 schematically illustrates a block diagram of an electronic device implementing a method for bond default prediction based on a category imbalance machine learning framework according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Compared with the prior art, the disclosure provides a bond default prediction method, device, equipment, medium and product based on a class imbalance machine learning framework, and the bond default prediction method based on the class imbalance machine learning framework comprises the following steps: acquiring related data of a debt subject; preprocessing the acquired relevant data of the debt subject; selecting the features with the highest contribution degree to model training from the relevant data of the debt subject; constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and a first default label of the debt subject, performing model training by using a Self-Paced Ensemble Ensemble learning method (also called a Self-Paced Ensemble learning method) based on the training set and the testing set, and selecting an optimal model; and deploying the optimal model, and using the optimal model to predict bond default.
According to the scheme, a prediction model can be trained by using a Self packed Ensemble method, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model can be selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is realized, more debt subjects with potential risks are found, and loss caused by credit risks is avoided as early as possible.
A bond default prediction method, apparatus, device, medium, and product based on a category imbalance machine learning framework according to embodiments of the present disclosure will be described in detail below with reference to fig. 1-4.
Fig. 1 schematically illustrates a flowchart of a bond default prediction method based on a category imbalance machine learning framework according to an embodiment of the present disclosure.
As shown in fig. 1, this embodiment provides a bond default prediction method based on a category-unbalanced machine learning framework, where the method includes operations S101 to S105, and specifically the following steps:
in operation S101, related data of a debt subject is acquired, the related data including annual financial statement data and default condition data of the debt subject.
The acquiring of the relevant data of the debt subject specifically includes: and acquiring relevant data of the debt subject from the database.
The acquiring of the data related to the debt subject from the database specifically includes:
the method comprises the steps of obtaining an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond subject from a database, screening out all credit and bond subject, removing city debts from the account, obtaining annual financial statement data of the credit and bond subject, judging whether the credit and bond subject is default, marking whether the credit and bond subject is default, and setting a first default label of the bond subject.
The database may be a Wind database.
In operation S102, the acquired data related to the debt subject is preprocessed.
The preprocessing of the acquired data related to the subject of debt comprises: counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
In operation S103, a feature having the highest contribution degree to model training is selected from the data related to the debt subject.
Selecting the feature with the highest contribution degree to model training from the relevant data of the debt subject, specifically comprising: and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
In operation S104, using the selected characteristics of the annual financial statement data of the lending subject and the first default label of the lending subject, constructing a training set and a test set, performing model training based on the training set and the test set and using a Self packed Ensemble learning method, and selecting an optimal model.
The annual financial statement data of the debt subject is data of two years before default prediction year, for example, 2018 annual financial statement data of a subject to be predicted for default in 2020 is selected.
As shown in fig. 2, the constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, performing model training by using a Self packed Ensemble method based on the training set and the testing set, and selecting an optimal model specifically includes:
in operation S1041, a base classifier in the ensemble learning framework is determined.
The determining of the base classifier in the ensemble learning framework specifically includes: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.
The LightGBM is an efficient implementation of a gradient Boosting decision tree algorithm, and the algorithm integrates a decision tree by a Boosting method and obtains an optimal model through iterative training.
In operation S1042, a data set including the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject is divided into a training set and a test set.
In order to find the optimal model hyper-parameter combination, a part of the data set is required to be divided as a test set, and the rest is required to be used as a training set when the hyper-parameters are adjusted.
The data set including the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject is divided into a training set and a testing set, and the method specifically comprises the following steps: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method, and specifically comprising the following steps of: randomly dividing a data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the rest 1 part of the data set as a test set, and repeating the selecting step five times, wherein the selected training sets are different each time.
In operation S1043, model training is performed based on the training set and the test set and using a Self packed Ensemble learning method.
The model training is performed by using a Self packed Ensemble learning method based on the training set and the test set, and specifically includes:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) training the 1 st base classifier f using the randomly undersampled subset N0 of the majority samples N and the minority samples P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers so far is taken as an integration model FiNamely:
Figure BDA0003601838190000141
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (-) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) calculating the average classification hardness of each bin, wherein the average hardness of the 1 st bin is
Figure BDA0003601838190000142
6) Updating self-paced factors
Figure BDA0003601838190000143
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure BDA0003601838190000144
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure BDA0003601838190000145
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
In operation S1044, an optimal model is selected.
The selecting the optimal model specifically includes: setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.
The evaluation indexes used for evaluating the quality of the model comprise accuracy, precision, recall, F1 or ROC-AUC.
The hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM-based classifier.
The hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of bins and a classification hardness function.
The hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.
In operation S105, the optimal model is deployed, and bond default prediction is performed using the optimal model.
The deploying the optimal model specifically includes: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.
The predicting bond default by using the optimal model specifically comprises: and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
The periodically generating prediction data for model prediction specifically includes: and developing parts of feature extraction and feature engineering into a storage process of a database, and periodically generating prediction data for model prediction.
By means of the bond default prediction method based on the class imbalance machine learning framework, a Self packed Ensemble method can be used for training a prediction model, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model are selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is achieved, more bond issuing subjects with potential risks are found, and loss caused by credit risks is avoided early.
Based on the bond default prediction method based on the category-based unbalanced machine learning framework shown in fig. 1, the present disclosure also provides a bond default prediction apparatus based on the category-based unbalanced machine learning framework. The apparatus will be described in detail below with reference to fig. 3.
Fig. 3 schematically shows a block diagram of a bond default prediction apparatus based on a category imbalance machine learning framework according to an embodiment of the present disclosure.
As shown in fig. 3, this embodiment provides a bond default prediction apparatus 300 based on a category-unbalanced machine learning framework, where the apparatus 300 includes a data acquisition module 301, a data preprocessing module 302, a feature selection module 303, a model training module 304, and a default prediction module 305.
The data acquisition module 301 is configured to acquire relevant data of a debt issue subject, where the relevant data includes annual financial statement data and default condition data of the debt issue subject.
The acquiring of the relevant data of the debt subject specifically includes: and acquiring relevant data of the debt subject from the database.
The acquiring of the data related to the debt subject from the database specifically includes:
the method comprises the steps of obtaining an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond subject from a database, screening out all credit and bond subject, removing city debts from the account, obtaining annual financial statement data of the credit and bond subject, judging whether the credit and bond subject is default, marking whether the credit and bond subject is default, and setting a first default label of the bond subject.
The database may be a Wind database.
The data preprocessing module 302 is configured to preprocess the acquired data related to the debt subject.
The preprocessing of the acquired data related to the debt subject comprises the following steps: counting the missing rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
The feature selection module 303 is configured to select a feature with the highest contribution degree to model training from the relevant data of the debt subject.
Selecting the feature with the highest contribution degree to model training from the relevant data of the debt subject, specifically comprising: and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
The model training module 304 is configured to construct a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, perform model training by using a Self packed Ensemble learning method based on the training set and the testing set, and select an optimal model.
The annual financial statement data of the debt subject is data of two years before default prediction year, for example, 2018 annual financial statement data of a subject to be predicted for default in 2020 is selected.
Constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, performing model training by using a Self packed Ensemble method based on the training set and the testing set, and selecting an optimal model, wherein the method specifically comprises the following steps:
(1) a base classifier in an ensemble learning framework is determined.
The determining of the base classifier in the ensemble learning framework specifically includes: and determining a LightGBM binary classification classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classification classifier comprises a plurality of decision trees.
The LightGBM is an efficient implementation of a gradient Boosting decision tree algorithm, and the algorithm integrates a decision tree by a Boosting method and obtains an optimal model through iterative training.
(2) Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set.
In order to find the optimal model hyper-parameter combination, a part of the data set is required to be divided as a test set, and the rest is required to be used as a training set when the hyper-parameters are adjusted.
The data set including the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject is divided into a training set and a testing set, and the method specifically comprises the following steps: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
Dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method, and specifically comprising the following steps of: randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.
(3) Model training is performed based on the training set and the test set and using a Self packed Ensemble Ensemble learning method.
The model training is performed by using a Self packed Ensemble learning method based on the training set and the test set, and specifically includes:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) using a randomly undersampled subset N of the majority samples N0Training the 1 st base classifier f with the minority sample P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers up to now is taken as an integration model FiNamely:
Figure BDA0003601838190000181
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (·) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) calculating the average classification hardness of each bin, the average hardness of the 1 st bin being
Figure BDA0003601838190000182
6) Updating self-paced factors
Figure BDA0003601838190000183
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure BDA0003601838190000184
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure BDA0003601838190000185
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
(4) And selecting an optimal model.
The selecting the optimal model specifically includes: setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using Grid Search, observing indexes of the model on the test set, evaluating the quality of the model according to results of the 5 times of model training, and selecting an optimal model.
The evaluation indexes used for evaluating the quality of the model comprise accuracy, precision, recall, F1 or ROC-AUC.
The hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM-based classifier.
The hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of bins and a classification hardness function.
The hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.
The default prediction module 305 is configured to deploy the optimal model, and perform bond default prediction by using the optimal model.
The deploying the optimal model specifically includes: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.
The predicting bond default by using the optimal model specifically comprises: and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
The periodically generating prediction data for model prediction specifically includes: the parts of feature extraction and feature engineering are developed into a storage process of a database, and prediction data used for model prediction is generated periodically.
By means of the bond default prediction device based on the class imbalance machine learning framework, a Self packed Ensemble method can be used for training a prediction model, the model is trained serially by using an undersampling and Boosting integration method, some training samples which contribute most to the current integration model can be selected by undersampling of each iteration in the training process instead of simple random sampling, more accurate bond default prediction is achieved, more bond issuing subjects with potential risks are found, and loss caused by credit risks is avoided early.
Figure 4 schematically illustrates a block diagram of an electronic device suitable for implementing a method for bond breach prediction based on a category-based unbalanced machine learning framework, in accordance with an embodiment of the present disclosure.
As shown in fig. 4, an electronic device 400 according to an embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. Processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 401 may also include onboard memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.
In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are stored. The processor 401, ROM 402 and RAM 403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 402 and/or the RAM 403. Note that the programs may also be stored in one or more memories other than the ROM 402 and RAM 403. The processor 401 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 405, input/output (I/O) interface 405 also being connected to bus 404. Electronic device 400 may also include one or more of the following components connected to I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM 402 and/or RAM 403 and/or one or more memories other than ROM 402 and RAM 403 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the item recommendation method provided by the embodiment of the disclosure.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 401. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 409, and/or installed from the removable medium 411. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program, when executed by the processor 401, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (33)

1. A bond default prediction method based on a category-unbalanced machine learning framework, the prediction method comprising:
selecting features with the highest contribution degree to model training from relevant data of a debt subject, wherein the relevant data comprises annual financial statement data and default condition data of the debt subject;
constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;
and deploying the optimal model, and using the optimal model to predict bond default.
2. The method of claim 1, wherein prior to selecting the features that contribute most to model training from the data related to the debt subject, the method further comprises:
acquiring related data of a debt subject;
and preprocessing the acquired data related to the debt subject.
3. The method for bond default prediction based on the category-based unbalanced machine learning framework as claimed in claim 2, wherein the obtaining of the data related to the subject of debt specifically comprises:
acquiring an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond main body from a database, screening out all credit and bond main bodies, removing city debts from the account statement, acquiring annual financial statement data of the credit and bond main bodies, judging whether the credit and bond main bodies default, marking whether the credit and bond main bodies default, and setting a first default label of the credit and bond main bodies.
4. The method for bond breach prediction based on class imbalance machine learning framework of claim 2, wherein the preprocessing of the obtained data related to the subject of debt comprises:
counting the loss rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
5. The method for bond default prediction based on the category-based unbalanced machine learning framework of claim 1, wherein the selecting the feature with the highest contribution degree to model training from the relevant data of the debt subjects specifically comprises:
and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
6. The bond default prediction method based on the class imbalance machine learning framework as claimed in claim 1, wherein the constructing a training set and a testing set, performing model training based on the training set and the testing set by using a Self packed Ensemble learning method, and selecting an optimal model specifically comprises:
determining a base classifier in an ensemble learning framework;
dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set;
performing model training based on the training set and the test set by using a Self packed Ensemble Ensemble learning method;
and selecting an optimal model.
7. The method for bond default prediction based on class imbalance machine learning framework according to claim 6, wherein the determining the base classifier in the ensemble learning framework specifically comprises: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.
8. The method according to claim 6, wherein the dividing a data set containing the selected annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set specifically comprises: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
9. The method of claim 8, wherein the dividing a dataset comprising the characteristics of the annual financial statement data of the selected debt subject and the first default label of the debt subject into a training set and a testing set by using a 5-fold cross-validation method comprises:
randomly dividing a data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the rest 1 part of the data set as a test set, and repeating the selecting step five times, wherein the selected training sets are different each time.
10. The bond default prediction method based on the class imbalance machine learning framework according to any one of claims 6-9, wherein the model training based on the training set and the test set and using a Self packed Ensemble learning method specifically comprises:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) using a randomly undersampled subset N of the majority samples N0Training the 1 st base classifier f with the minority sample P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers so far is taken as the integration model Fi, namely:
Figure FDA0003601838180000031
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (-) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) calculating the average classification hardness of each bin, the average hardness of the 1 st bin being
Figure FDA0003601838180000032
6) Updating self-paced factors
Figure FDA0003601838180000033
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure FDA0003601838180000034
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure FDA0003601838180000035
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
11. The method for bond default prediction based on the category-unbalanced machine learning framework of any one of claims 6 to 9, wherein the selecting the optimal model specifically comprises:
setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using GridSearch, observing indexes of the model on the test set, evaluating the quality of the model according to the results of the 5 times of model training, and selecting the optimal model.
12. The method of claim 11, wherein the evaluation metrics used to evaluate the quality of the model include accuracy, precision, recall, F1, or ROC-AUC.
13. The method of claim 11, wherein the hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:
the hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of sub-boxes and a classification hardness function;
the hyper-parameters of the LightGBM-based classifier include: leaf number, leaf minimum data size, maximum depth, decision tree number, learning rate, decision tree type, L1 regularization coefficient, L2 regularization coefficient.
14. The method for bond default prediction based on the class-imbalance machine learning framework according to claim 1, wherein the deploying the optimal model specifically comprises: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.
15. The method for bond breach prediction based on class imbalance machine learning framework of claim 1 or 14, wherein the predicting bond breach using the optimal model specifically comprises:
and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
16. A bond default prediction device based on a category-unbalanced machine learning framework, the prediction device comprising a feature selection module, a model training module, and a default prediction module, wherein:
the characteristic selection module is used for selecting the characteristics with the highest contribution degree to model training from the relevant data of the debt main body, wherein the relevant data comprises annual financial statement data and default condition data of the debt main body;
the model training module is used for constructing a training set and a testing set by using the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject, carrying out model training by using a Self packed Ensemble learning method based on the training set and the testing set, and selecting an optimal model;
and the default prediction module is used for deploying the optimal model and using the optimal model to predict bond default.
17. The category imbalance machine learning framework-based bond breach prediction apparatus of claim 16, further comprising a data acquisition module and a data preprocessing module, wherein:
the data acquisition module is used for acquiring related data of the debt subject;
and the data preprocessing module is used for preprocessing the acquired related data of the debt subject.
18. The bond default prediction device based on the category-based unbalanced machine learning framework of claim 17, wherein the obtaining of the data related to the subject of debt comprises:
acquiring an asset liability statement, a profit statement, a cash flow statement, a financial index, a bond default statement, a bond classification block, a bond concept block and a medium bond earning rate curve of a bond main body from a database, screening out all credit and bond main bodies, removing city debts from the account statement, acquiring annual financial statement data of the credit and bond main bodies, judging whether the credit and bond main bodies default, marking whether the credit and bond main bodies default, and setting a first default label of the credit and bond main bodies.
19. The debt default prediction apparatus based on class imbalance machine learning framework according to claim 17, wherein the preprocessing of the acquired data related to the debt subject comprises:
counting the loss rate of the characteristics of the data related to the debt subject, removing abnormal values of the characteristics of the data related to the debt subject, and performing box separation processing on the characteristics of the data related to the debt subject.
20. The bond default prediction apparatus based on the category-based unbalanced machine learning framework of claim 16, wherein the selecting the feature with the highest contribution degree to model training from the data related to the debt subjects specifically comprises:
and calculating the univariate correlation, the multivariate correlation, the IV value, the information entropy and the Gini coefficient of the characteristics of the relevant data of the debt subject, and selecting the characteristics with the highest contribution degree to model training from the characteristics by combining business experience.
21. The bond default prediction apparatus based on class imbalance machine learning framework according to claim 16, wherein the constructing a training set and a testing set, performing model training based on the training set and the testing set by using a Self packed Ensemble learning method, and selecting an optimal model specifically comprises:
determining a base classifier in an ensemble learning framework;
dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set;
performing model training based on the training set and the test set by using a Self packed Ensemble Ensemble learning method;
and selecting an optimal model.
22. The bond default prediction apparatus based on class imbalance machine learning framework according to claim 21, wherein the determining the base classifier in the ensemble learning framework specifically comprises: the method specifically comprises the following steps: and determining a LightGBM binary classifier as a base classifier in the ensemble learning framework, wherein the LightGBM binary classifier comprises a plurality of decision trees.
23. The debt default prediction apparatus based on class imbalance machine learning framework according to claim 21, wherein the dividing of the data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a testing set specifically comprises: dividing a data set containing the selected characteristics of the annual financial statement data of the debt subject and the first default label of the debt subject into a training set and a test set using 5-fold cross-validation.
24. The bond breach predicting device according to claim 23, wherein the data set including the annual financial statement data of the selected debt subject and the first breach label of the debt subject is divided into a training set and a testing set by using 5-fold cross-validation method, specifically comprising:
randomly dividing the data set containing the selected annual financial statement data characteristics of the debt subject and the first default label of the debt subject into 5 parts, selecting 4 parts of the data set as a training set each time, using the remaining 1 part as a test set, repeating the selecting step five times, wherein the training sets selected each time are different.
25. The bond default prediction apparatus based on class imbalance machine learning framework according to any one of claims 21-24, wherein the model training based on the training set and the test set and using Self packed Ensemble learning method specifically comprises:
1) initializing a few samples P and a plurality of samples N in a training set D;
2) using a randomly undersampled subset N of the majority samples N0Training the 1 st base classifier f with the minority sample P0,N0In accordance with the number of P, i.e. | N0’|=|P|;
3) The sum of all base classifiers up to now is taken as an integration model FiNamely:
Figure FDA0003601838180000071
wherein i represents the total number of base classifiers that have been trained, with a maximum value of n;
4) binning the majority of samples N into k groups B according to the classification hardness1,B2,…,BkFor a given model F (-) and (x, y), the sample class hardness HXH (x, y, F), where H is the classification stiffness function and k is the number of bins;
5) calculating the average classification hardness of each bin, the average hardness of the 1 st bin being
Figure FDA0003601838180000072
6) Updating self-paced factors
Figure FDA0003601838180000073
7) Calculating sampling weight p of each sub-box according to classification hardness and self-step factor1Wherein:
Figure FDA0003601838180000074
8) based on the sampling weight p1Under-sampling the majority of samples N after the binning, wherein the sample amount is consistent with the minority of samples P, and the sample amount of the 1 st binning is as follows:
Figure FDA0003601838180000075
9) training base classifier f on new undersampled sample seti
10) Returning to the step 3) to continue new iteration until n iterations are completed in the step 3);
11) after n iterations are completed, all the base classifiers are integrated into an integrated classifier.
26. The bond breach prediction device of any of claims 21-24, wherein the selecting the optimal model specifically comprises:
setting candidate values of each hyper-parameter, combining the candidate values into a hyper-parameter matrix, performing 5 times of model training on the divided training set by using GridSearch, observing indexes of the model on the test set, evaluating the quality of the model according to the results of the 5 times of model training, and selecting the optimal model.
27. The bond breach predicting device according to claim 26, wherein the evaluation metrics used for evaluating the quality of the model include accuracy, precision, recall, F1 or ROC-AUC.
28. The bond breach prediction device based on class imbalance machine learning framework of claim 26, wherein the hyper-parameters comprise a hyper-parameter of a Self packed Ensemble classifier and a hyper-parameter of a LightGBM based classifier, wherein:
the hyper-parameters of the Self packed Ensemble classifier comprise: the number of base classifiers, the number of sub-boxes and a classification hardness function;
the hyper-parameters of the LightGBM-based classifier include: the leaf number, the leaf minimum data size, the maximum depth, the decision tree number, the learning rate, the decision tree type, the L1 regularization coefficient, and the L2 regularization coefficient.
29. The device for bond default prediction based on class-imbalance machine learning framework according to claim 16, wherein the deploying the optimal model specifically comprises: and packaging the trained optimal model into an HTTP service by using Python fastapi, and deploying the HTTP service into a production environment server.
30. The bond breach prediction device based on class imbalance machine learning framework of claim 16 or 29, wherein the performing the bond breach prediction by using the optimal model specifically comprises:
and generating prediction data for model prediction periodically, converting the processed prediction data into a JSON format, sending the JSON format to the optimal model packaged as HTTP service for bond default prediction, retrieving a prediction result after prediction is finished, and pushing the prediction result to a downstream application system for display.
31. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-15.
32. A computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-15.
33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-15.
CN202210407035.1A 2022-04-18 2022-04-18 Bond default prediction method and device based on class imbalance machine learning framework Pending CN114676932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210407035.1A CN114676932A (en) 2022-04-18 2022-04-18 Bond default prediction method and device based on class imbalance machine learning framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210407035.1A CN114676932A (en) 2022-04-18 2022-04-18 Bond default prediction method and device based on class imbalance machine learning framework

Publications (1)

Publication Number Publication Date
CN114676932A true CN114676932A (en) 2022-06-28

Family

ID=82078455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210407035.1A Pending CN114676932A (en) 2022-04-18 2022-04-18 Bond default prediction method and device based on class imbalance machine learning framework

Country Status (1)

Country Link
CN (1) CN114676932A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115907972A (en) * 2023-01-16 2023-04-04 齐鲁工业大学(山东省科学院) Unbalanced credit investigation data risk assessment method and system based on double self-walking learning
CN116306958A (en) * 2022-09-13 2023-06-23 中债金科信息技术有限公司 Training method of default risk prediction model, default risk prediction method and device
CN117763356A (en) * 2023-12-26 2024-03-26 中国地质科学院地质力学研究所 Rapid earthquake phase identification method based on LightGBM algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846521A (en) * 2018-06-22 2018-11-20 西安电子科技大学 Shield-tunneling construction unfavorable geology type prediction method based on Xgboost
CN111340236A (en) * 2020-03-03 2020-06-26 中债金融估值中心有限公司 Bond default prediction method based on bond valuation data and integrated machine learning
CN111524015A (en) * 2020-04-10 2020-08-11 易方达基金管理有限公司 Method and device for training prediction model, computer equipment and readable storage medium
CN112767172A (en) * 2021-01-22 2021-05-07 上海析鲸信息科技有限公司 Bond default early warning and identification technology based on machine learning model algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846521A (en) * 2018-06-22 2018-11-20 西安电子科技大学 Shield-tunneling construction unfavorable geology type prediction method based on Xgboost
CN111340236A (en) * 2020-03-03 2020-06-26 中债金融估值中心有限公司 Bond default prediction method based on bond valuation data and integrated machine learning
CN111524015A (en) * 2020-04-10 2020-08-11 易方达基金管理有限公司 Method and device for training prediction model, computer equipment and readable storage medium
CN112767172A (en) * 2021-01-22 2021-05-07 上海析鲸信息科技有限公司 Bond default early warning and identification technology based on machine learning model algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHINING LIU等: ""Self-paced Ensemble for Highly Imbalanced Massive Data Classification"", 《2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING(ICDE)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306958A (en) * 2022-09-13 2023-06-23 中债金科信息技术有限公司 Training method of default risk prediction model, default risk prediction method and device
CN115907972A (en) * 2023-01-16 2023-04-04 齐鲁工业大学(山东省科学院) Unbalanced credit investigation data risk assessment method and system based on double self-walking learning
CN115907972B (en) * 2023-01-16 2023-09-12 齐鲁工业大学(山东省科学院) Unbalanced credit investigation data risk assessment method and system based on double self-step learning
CN117763356A (en) * 2023-12-26 2024-03-26 中国地质科学院地质力学研究所 Rapid earthquake phase identification method based on LightGBM algorithm

Similar Documents

Publication Publication Date Title
CN114676932A (en) Bond default prediction method and device based on class imbalance machine learning framework
Fernández-Rodríguez et al. Business and institutional determinants of Effective Tax Rate in emerging economies
Girma et al. Evaluating the foreign ownership wage premium using a difference-in-differences matching approach
Titko et al. Measuring bank efficiency: DEA application
CN107545422B (en) Cashing detection method and device
Rezende et al. Predicting financial distress in publicly-traded companies
CN109583966A (en) A kind of high value customer recognition methods, system, equipment and storage medium
US20170221075A1 (en) Fraud inspection framework
CN113919886A (en) Data characteristic combination pricing method and system based on summer pril value and electronic equipment
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
Balcilar et al. South African stock returns predictability using domestic and global economic policy uncertainty: Evidence from a nonparametric causality-in-quantiles approach
Korhonen et al. Factors driving investment in planted forests: a comparison between OECD and non-OECD countries
CN111626855A (en) Bond credit interest difference prediction method and system
CN112434862B (en) Method and device for predicting financial dilemma of marketing enterprises
Gresnigt et al. Specification testing in hawkes models
Narimani et al. A multivariate decomposition–ensemble model for estimating long-term rainfall dynamics
CN114707883A (en) Bond default prediction method, device, equipment and medium based on time sequence characteristics
CN111815435A (en) Visualization method, device, equipment and storage medium for group risk characteristics
Ruotsalainen et al. The effects of sample plot selection strategy and the number of sample plots on inoptimality losses in forest management planning based on airborne laser scanning data
CN114092215B (en) Auditing method and system for export tax refund loan
Brumma et al. Modeling downturn LGD in a Basel framework
Rahman et al. Detecting accounting fraud in family firms: Evidence from machine learning approaches
Niknya et al. Financial distress prediction of Tehran Stock Exchange companies using support vector machine
US20200265521A1 (en) Multimedia risk summarizer
CN111612626A (en) Method and device for preprocessing bond evaluation data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wu Hao

Inventor after: Yang Fan

Inventor after: Sun Yanjie

Inventor before: Li Zi

Inventor before: Yang Fan

Inventor before: Wu Hao

Inventor before: Sun Yanjie

CB03 Change of inventor or designer information
RJ01 Rejection of invention patent application after publication

Application publication date: 20220628

RJ01 Rejection of invention patent application after publication