CN111612627A

CN111612627A - Method for evaluating bond risk influence indexes

Info

Publication number: CN111612627A
Application number: CN202010464996.7A
Authority: CN
Inventors: 袁豪
Original assignee: Shenzhen Bopu Technology Co ltd
Current assignee: Shenzhen Bopu Technology Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-01

Abstract

The embodiment of the invention provides a method for evaluating bond risk influence indexes, which comprises the following steps: acquiring a data source of the bond sample; constructing data characteristics and classification targets according to the data sources; calculating the importance of the data characteristics by adopting a random forest algorithm, and generating importance ranking of the data characteristics; according to the importance sorting sequence, adding the data features to the classification model one by one, calculating corresponding accuracy, and selecting a feature subset reaching the highest accuracy; and obtaining an important influence index according to the feature subset. The method provided by the embodiment of the invention is used for sequencing the feature importance of the data set of the bond sample, calculating the accuracy through the classification model, finding out the optimal feature subset, and removing the redundant features in the data set under the condition of ensuring the classification capability of the feature subset, thereby screening out the important indexes influencing the bond risk and reducing the workload of bond information acquisition.

Description

Method for evaluating bond risk influence indexes

Technical Field

The invention relates to the technical field of big data, in particular to a method for evaluating risk influence indexes of bonds.

Background

Bond breach events occur frequently in the last two years, and bond breach will become a common risk event as policies are changed. The existing bond risk prediction technology mainly extracts useful data features from a wide range of bond data, and trains bonds in a machine learning mode so as to obtain a classification model capable of predicting bond default. Factors such as credit investigation information, financial data, third-party credit rating reports and research reports may affect bond risks, and how to collect effective indexes from mass data becomes a problem to be solved urgently in evaluating bond risks.

In the prior art, a random forest algorithm is mainly used for evaluating and analyzing a plurality of data characteristics, characteristics with high importance are found, and data sources of the characteristics are traced back to determine which indexes which can be obtained are important for predicting bond default.

However, the existing data feature evaluation method is difficult to accurately define effective features and redundant features, and a great amount of index information containing the redundant features still needs to be collected before risk prediction is carried out by adopting a classification model, so that the time consumption of data acquisition is long.

Disclosure of Invention

The invention mainly aims to provide a method for evaluating bond risk influence indexes, which aims to solve the technical problem that the existing index information contains a large number of redundant features.

The invention provides a method for evaluating bond risk influence indexes, which comprises the following steps:

acquiring a data source of the bond sample;

constructing data characteristics and classification targets according to the data sources;

calculating the importance of the data characteristics by adopting a random forest algorithm, and generating importance ranking of the data characteristics;

according to the importance sorting sequence, adding the data features to the classification model one by one, calculating corresponding accuracy, and selecting a feature subset reaching the highest accuracy;

and obtaining an important influence index according to the feature subset.

Preferably, the data source for obtaining the bond sample comprises: a data source of a bond sample is obtained, wherein the data source comprises two or more index information and a default record of the bond sample.

Preferably, the constructing the data features and the classification targets according to the data source includes:

constructing the data characteristics according to the index information;

and constructing the classification target according to the default record.

Preferably, the calculating the importance of the data features by using a random forest algorithm, and the generating the importance ranking of the data features includes:

constructing a decision tree according to the data characteristics and the classification target, and generating a random forest;

calculating the importance of the data features through the random forest;

and arranging the data features according to the sequence of the importance from high to low to generate the importance ranking of the data features.

Preferably, the classification model is obtained by training data of the bond sample by adopting an SVM algorithm, a random forest algorithm, a naive Bayes algorithm, a CART algorithm or a Bagging algorithm.

The method provided by the embodiment of the invention is used for sequencing the feature importance of the data set of the bond sample, calculating the accuracy through the classification model, finding out the optimal feature subset, and removing the redundant features in the data set under the condition of ensuring the classification capability of the feature subset, thereby screening out the important indexes influencing the bond risk and reducing the workload of bond information acquisition.

Drawings

FIG. 1 is a flowchart of the steps of embodiment 1 of a method for evaluating risk impact indicators of bonds of the present invention;

FIG. 2 is a flowchart illustrating the steps of embodiment 2 of a method for evaluating risk impact indicators of bonds according to the present invention;

fig. 3 is a flowchart of the steps of an embodiment 3 of the method for evaluating risk impact indicators of bonds of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

One of the core ideas of the embodiment of the invention is to provide a method for evaluating bond risk influence indexes, screen out important indexes influencing bond risk and reduce the workload of information acquisition.

Referring to fig. 1, a flowchart illustrating steps of embodiment 1 of a method for evaluating risk influence indicators of bonds of the present invention is shown, which may specifically include the following steps:

s101, obtaining a data source of the bond sample.

Specifically, the method for acquiring the data source is to adopt the original data or to randomly extract the data source based on the original data.

And S102, constructing data characteristics and classification targets according to the data source.

Specifically, a data set for feature importance analysis is constructed according to the data source, and the data set comprises two or more feature quantities and classification targets.

S103, calculating the importance of the data features by adopting a random forest algorithm, and generating the importance ranking of the data features.

Specifically, a random forest algorithm is adopted, the importance of the data features is calculated according to an information entropy or kini impurity feature importance measurement mode, and importance ranking of the data features is generated.

And S104, adding the data features to the classification models one by one according to the importance sorting sequence, calculating corresponding accuracy, and selecting the feature subset reaching the highest accuracy.

Specifically, the data features included in the feature subset are effective features that guarantee the prediction effect of the classification model, and the features other than the feature subset are redundant features.

And S105, obtaining an important influence index according to the feature subset.

Specifically, an index corresponding to the effective characteristic is found in a data source, and the index is an important index influencing bond risk.

The method of the embodiment sorts the feature importance of the data set of the bond sample, calculates the accuracy through the classification model, finds the optimal feature subset, and removes the redundant features in the data set under the condition of ensuring the classification capability of the feature subset, thereby screening out the important indexes influencing the bond risk and reducing the workload of collecting the bond information.

Referring to fig. 2, a flowchart illustrating steps of embodiment 2 of the method for evaluating risk influence indicators of bonds of the present invention is shown, and specifically, the method may include the following steps:

s201, acquiring a data source of the bond sample, wherein the data source comprises two or more index information and a default record of the bond sample.

Specifically, a plurality of index information and default records of the bond sample are extracted and screened from data such as bond equity historical data, corresponding industry index historical data, bond subject historical financial statements, individual bonds and subject historical ratings thereof and the like in each bond.

S202, constructing the data characteristics according to the index information.

Specifically, 30 data features are constructed according to the index information of the bond sample, which are respectively as follows: c _ level (bond rating), CLD (whether or not there is a bond rating down-regulation), D _ level (subject rating), DLD (whether or not there is a subject rating down-regulation), DTAR (asset liability rate), DTAR _ DIFF (difference from the previous portfolio rate), PM (gross interest rate), PM _ DIFF (difference from the previous portfolio rate), DC (liability capital ratio), DC _ DIFF (difference from the previous portfolio ratio), ROE (net asset profitability), ROE _ DIFF (difference from the previous portfolio net asset profitability), OCF (operational net cash flow), OCF _ DIFF (difference from the previous portfolio operational net cash flow), OCF/D (operational net cash flow/negative aggregate), OCF/D _ DIFF (difference from the previous portfolio "operational net cash flow/negative aggregate"), OCF/D _ DIFF (average daily average), and dlf (average daily average) avg _ price (mean price in median valuation quarter), max _ diff (maximum rise in median valuation quarter), min _ diff (maximum fall in median valuation quarter), max _ min _ diff (difference between the highest value and the lowest value appearing in median valuation quarter, with the result containing a sign representing whether the maximum fluctuation is a fall or a rise), diff _ rate (overall rise and fall proportion of median valuation quarter), is _ stop (whether there is an overdue within median valuation quarter), concept _ avg _ day _ diff (industry index daily fluctuation), concept _ avg _ day _ absdiff (industry index daily fluctuation absolute), concept _ max _ diff (maximum fluctuation within industry index quarterly), concept _ min _ diff (maximum drop within industry index quarterly), concept _ max _ min _ diff (difference between the highest and lowest values occurring within industry index quarterly), and concept _ avg _ price (average index within industry index quarterly divided by the index of the first day quarterly).

S203, constructing the classification target according to the default record.

Specifically, the classification target is constructed according to whether the bond sample has a default record in a corresponding quarter: if the default occurs, marking as 1; the default is marked 0 if no violations have occurred.

And S204, constructing a decision tree according to the data characteristics and the classification target, and generating a random forest.

Specifically, the number of the bond samples is recorded as K, and K samples are randomly drawn and taken from the K samples as a training set; recording the number of the data features as M, and randomly extracting M features from the data features as branch bases (M is less than or equal to M); constructing a decision tree according to the training set, the branch basis and the classification target and according to a measuring mode of the purity of the kinney or the information entropy; and repeating the steps to construct a plurality of decision trees, generating a random forest, and recording the number of the decision trees in the random forest as N.

And S205, calculating the importance of the data characteristics through the random forest.

Specifically, for each decision tree in the random forest, the prediction error of the out-of-bag data is calculated by using the corresponding out-of-bag data (OOB) data, and is recorded as errOOB 1; the out-of-bag data is the sample remaining after the decision tree takes the k samples; replacing the data characteristics X of all samples of the data outside the bag with random numbers, calculating the error of the data outside the bag again, and recording the error as errOOB 2; the importance of the data feature X is sigma (eerOOB2-eerOOB 1)/N.

S206, arranging the data features according to the sequence of the importance from high to low, and generating the importance ranking of the data features.

Specifically, on the basis of S203, the importance of all data features is calculated one by one, the data features are arranged according to the order of the importance from high to low, and an importance ranking of the data features is generated, which sequentially includes: DLD, concept _ avg _ day _ absdiff, concept _ max _ DIFF, C _ level, concept _ max _ min _ DIFF, concept _ min _ DIFF, CLD, D _ level, concept _ avg _ day _ DIFF, DTAR _ DIFF, max _ DIFF, OCF/D _ DIFF, is _ stop, OCF/D, OCF _ DIFF, DC _ DIFF, max _ min _ DIFF, avg _ day _ absdiff, concept _ avg _ price, ROE _ DIFF, DIFF _ rate, avg _ day _ DIFF, ROE, avg _ price, DTAR, min _ DIFF, PM _ DIFF, DC, OCF.

And S207, adding the data features to the classification models one by one according to the importance sorting sequence, calculating corresponding accuracy, and selecting the feature subset reaching the highest accuracy.

Specifically, adding the data features to the classification model one by one according to the order of the importance ranking, calculating the corresponding accuracy, and selecting the feature subset reaching the highest accuracy, includes: DLD, concept _ avg _ day _ absdiff, concept _ max _ DIFF, C _ level, concept _ max _ min _ DIFF, concept _ min _ DIFF, CLD, D _ level, concept _ avg _ day _ DIFF, DTAR _ DIFF, max _ DIFF, OCF/D _ DIFF, is _ stop, OCF/D, OCF _ DIFF, DC _ DIFF.

And S208, obtaining an important influence index according to the feature subset.

Specifically, obtaining an important influence index according to the feature subset includes: whether main rating down-regulation, industry index daily average fluctuation absolute value, maximum fluctuation range in industry index quarterly, bond rating, difference between maximum value and minimum value in industry index quarterly, maximum drop in industry index quarterly, whether bond rating down-regulation, main rating, industry index daily average fluctuation, difference with previous financial asset liability rate, maximum fluctuation range in middle bond valuation quarterly, difference with previous financial newspaper operational net cash flow/liability sum, whether overdue card is in middle bond valuation quarterly, operational net cash flow/liability sum, difference with previous financial affair operational net cash flow, difference with previous financial affair capital ratio of previous newspaper.

According to the method, 30 data features in the bond sample data set are subjected to importance sorting, the feature subset reaching the highest accuracy is calculated through the classification model, and a large number of redundant features in the data set are removed, so that important indexes influencing bond risks are screened out, collection of non-important index information is avoided, workload of information processing can be reduced, and evaluation efficiency of the bond risks is improved.

Referring to fig. 3, a flowchart illustrating steps of embodiment 3 of the method for evaluating risk influence indicators of bonds of the present invention is shown, and specifically, the method may include the following steps:

s301, acquiring a data source of the bond sample, wherein the data source comprises two or more index information and a default record of the bond sample.

S302, constructing the data characteristics according to the index information.

S303, constructing the classification target according to the default record.

S304, according to the data characteristics and the classification target, a decision tree is constructed, and a random forest is generated.

S305, calculating the importance of the data characteristics through the random forest.

S306, arranging the data features according to the sequence of the importance from high to low, and generating the importance ranking of the data features.

S307, adding the data features to the classification model one by one according to the importance sorting sequence, calculating corresponding accuracy, and selecting a feature subset reaching the highest accuracy; the classification model is obtained by training data of the bond samples by adopting an SVM algorithm, a random forest algorithm, a naive Bayes algorithm, a CART algorithm or a Bagging algorithm.

Specifically, the data features are added to the classification model one by one according to the importance sorting sequence, the corresponding accuracy is calculated, and the feature subset reaching the highest accuracy is selected; the classification model is obtained by training data of bond samples by adopting an SVM algorithm. The SVM algorithm is a machine learning method developed on the basis of a statistical theory, and shows a plurality of specific advantages in solving the problems of small samples, nonlinearity and high-dimensional pattern recognition based on the principle of minimizing structural risk. The SVM classification model can realize classification prediction according to the data characteristics of the bond samples.

And S308, obtaining an important influence index according to the feature subset.

According to the method, the characteristic subset reaching the highest accuracy is calculated by using the SVM classification model, and the effective characteristics aiming at the SVM classification model can be screened out under the condition that the classification capability is guaranteed, so that the important indexes influencing the bond risk are screened out, and the bond risk evaluation efficiency based on the SVM algorithm is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method for evaluating risk influence indexes of bonds provided by the invention is described in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of assessing a bond risk impact indicator, comprising:

acquiring a data source of the bond sample;

and obtaining an important influence index according to the feature subset.

2. The method of claim 1, wherein the obtaining a data source of the bond sample comprises: a data source of a bond sample is obtained, wherein the data source comprises two or more index information and a default record of the bond sample.

3. The method of claim 2, wherein constructing data features and classification targets from the data sources comprises:

constructing the data characteristics according to the index information;

and constructing the classification target according to the default record.

4. The method of claim 3, wherein the calculating the importance of the data features using a random forest algorithm, and wherein generating the importance ranking of the data features comprises:

calculating the importance of the data features through the random forest;

5. The method of claim 1, wherein the classification model is obtained by training data of the bond sample using an SVM algorithm, a random forest algorithm, a naive Bayes algorithm, a CART algorithm, or a Bagging algorithm.