CN110689437A

CN110689437A - Communication construction project financial risk prediction method based on random forest

Info

Publication number: CN110689437A
Application number: CN201910949059.8A
Authority: CN
Inventors: 王翔; 杨帆
Original assignee: HUBEI TELECOM ENGINEERING Co Ltd
Current assignee: HUBEI TELECOM ENGINEERING Co Ltd
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2020-01-14

Abstract

The invention discloses a random forest based communication construction project financial risk prediction method, which relates to the field of data processing and comprises the following steps: acquiring and classifying financial data of an enterprise, preprocessing the financial data and obtaining a plurality of data factors; judging whether the data factors have multiple collinearity, if so, reducing the dimensionality of the data factors to obtain a data set, and carrying out SMOTE algorithm processing on the data set; otherwise, the SMOTE algorithm processing is directly carried out on the data factors, model training is carried out on the processed data, and a test set is used for verification.

Description

Communication construction project financial risk prediction method based on random forest

Technical Field

The invention relates to the field of data processing, in particular to a communication construction project financial risk prediction method based on random forests.

Background

In recent years, the information industry of China has been leaps and bounds, the internet era has entered a large integration period, the information and communication network is in order to meet the increasing demand of national economic development, the demand of communication engineering project construction is continuously increased, and therefore, a great deal of business opportunities are brought to communication construction enterprises.

Due to the particularity of requirements of customers and project construction management, the industry where communication construction enterprises are located is obviously different from other industries, the characteristics of long construction period and large capital stock and flow demand exist generally, and in the construction process, high accounts receivable are generally generated due to the credit relationship with the first party, so that capital settlement is caused; meanwhile, the capital recovery period of a communication construction enterprise is relatively long, a certain proportion of quality assurance funds are generally reserved before engineering construction, and when the enterprise scale is continuously developed and expanded and more bearing projects are carried out, large capital requirements and capital investment can be generated.

Furthermore, because the communication construction enterprise engineering project has a long period, if accounts to be collected cannot be timely and effectively cleaned, cash flow shortage and even fund chain breakage can be caused. Therefore, in the communication construction enterprise, accounts receivable are important creditory assets of the enterprise, but actually the resources are occupied by the owing party, and the enterprise has the right to collect money but cannot dominate the assets, so that the accounts receivable of the enterprise are only accounts before being collected, and are likely to be damaged due to bad accounts caused by factors such as credit default of the owing party, and therefore, the real occupation of the assets and the resources of the enterprise is only achieved after the cash flow of the assets actually flows into the enterprise.

In addition, the generation of receivable accounts also means that the income of the business of the enterprise owner forms no cash inflow and only represents the collection right of the enterprise, so that before the receivable accounts are recovered, the profit part realized by the enterprise is only the profit of the book surface, is reflected by the number in the report form, not the actual profit, and the income corresponding to the amount of the receivable accounts does not have a real cash inflow as a basis, but is just 'expensive on paper'.

The method is characterized in that when account receivable and debt clearing is perfect, a relatively stable cash flow is generated for a communication construction enterprise, the enterprise can develop healthily, and daily operation and management activities of the enterprise can be guaranteed, so that the communication construction enterprise can pay attention to the condition of debt and debt at any time and clear up the items which are not received and paid in time, and the enterprise can reasonably analyze the received items before construction aiming at the problems of account receivable of the communication industry, wherein the analyzed contents comprise an industry state, a company financial condition, a policy background and the like.

Currently, researchers and researchers in the related field have done a lot of research and validation work on financial risk prediction: relevant factors of cash flow influence are added on the basis of an original Z model by national economists Zhouyghua, an F model for financial risk prediction is constructed, relevant financial statement data of 27 enterprises are used by the model, and the model is divided according to the marketing and non-marketing, so that a good prediction effect is obtained; zhang jin Gui, Huang Shu, Wang Jun Nu and so on use principal component analysis algorithm to carry on the importance analysis to every financial data, have selected the data of 40 enterprises on the market and not on the market, have analyzed the reason causing the financial failure of the enterprise, have obtained better effects; and the Wu Shannon scholars, the Gunn and other scholars analyze the financial related data by constructing a regression model, select financial data of more than 70 listed companies, and evaluate financial risks by using a logistic stet regression algorithm.

In addition, methods such as neural network model, Z-Score model, analytic hierarchy process, factor analysis, and efficacy coefficient method are widely used for financial risk prediction, but the methods have many disadvantages; a. the construction of the model does not consider the industrial characteristics, and due to different industrial conditions, better effects cannot be obtained by simple conversion, the method is single, the requirements of the model cannot be met in the prediction and analysis process, and the problem of unbalanced data cannot be solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a communication construction project financial risk prediction method based on random forests.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:

a communication construction project financial risk prediction method based on random forests comprises the following steps:

acquiring and classifying financial data of an enterprise, preprocessing the financial data and obtaining a plurality of data factors; judging whether the data factors have multiple collinearity, if so, reducing the dimensionality of the data factors to obtain a data set, and carrying out SMOTE algorithm processing on the data set; otherwise, the SMOTE algorithm processing is directly carried out on the data factors, model training is carried out on the processed data, and a test set is used for verification.

Further, the step of preprocessing the financial data and obtaining a plurality of data factors comprises the following steps: uniformly controlling the range of each datum between 0 and 1, and standardizing all the data by using a preprocessing function in a machine learning library to ensure that all the data are between 0 and 1;

the processed data is divided into the ability of debt, the ability of operation, the ability of profit and the ability of growth, the ability of debt includes the flow rate, the speed rate, the factor of cash rate, the ability of operation includes the turnover rate of accounts receivable, the turnover rate of stock, the day factor of stock turnover, the ability of profit includes the net rate, the gross rate, the factor of income per share, and the ability of growth includes the total asset growth rate, the net asset growth rate, the factor of equity growth rate.

Further, the determining whether multiple collinearity exists in the data factor includes:

calculating VIF coefficients of the data factors, and considering that collinearity exists between the corresponding data factors when the VIF is larger than 10;

wherein R is_i ²Calculated by a ridge regression model, R_iIs the ith variable X_iWith other variables X_jA complex correlation coefficient, i.e., a coefficient of mass R, (i ≠ 1,2, 3.. k, i ≠ j)²Is the arithmetic square root of goodness of fit, but this coefficient R may be determined_i ²Means using X_iMaking dependent variable for all other X_j(i ═ 1,2,3, … …, k, i ≠ j) is the coefficients that can be found after a new regression is made.

Further, the importance of all factors is screened by adopting a Relieff algorithm, and the idea of the algorithm is as follows: the relevance of different features to different classes is distinguished by weighting features, and the weighted value is calculated according to the distinguishing capability of the features to samples of different classes in a neighborhood. By setting the weight threshold, the features larger than the threshold are retained, and the features smaller than the threshold are filtered out, and finally, the feature subset is obtained. The idea of the Relief algorithm is shown as follows:

wherein x is⁽ⁱ⁾The samples in the training set are similar samples in the neighborhood of H, and non-similar samples are M. If x⁽ⁱ⁾Difference coefficient D (P, x) to H on feature P⁽ⁱ⁾,h^(j)) Is smaller, and x⁽ⁱ⁾With the difference coefficient D (P,^x(i),m^(k)) Larger, indicating that the feature P plays a positive role in classification, the feature weight should be raised; on the contrary, the feature P has a negative effect on classification, the feature weight should be reduced, and this operation is repeated m times to obtain a feature weight vector.

Further, the specific method for processing data by the SMOTE algorithm is as follows: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.

Further, before the SMOTE algorithm processing, the method further includes the following steps:

selecting an unbalanced data processing algorithm, and selecting a KM-SMOTE algorithm, a RM-SMOTE algorithm and a SMOTE algorithm to process and compare unbalanced data;

KM-SMOTE algorithm: the key core idea of the algorithm is to combine the K-Means method with the SMOTE algorithm to solve the boundary ambiguity problem possibly existing in SMOTE; the core of the algorithm mainly comprises three parts of determining boundary points of a few classes, judging danger points and correcting an oversampling formula; the method comprises the following steps of selecting two oversampling formulas to correct, wherein the correction of the oversampling formula is the most key step; firstly, an oversampling formula for directly performing oversampling operation is introduced, and the oversampling operation inserts new data based on a selected sample point; the formula is as follows:

X_new＝u_i+rand(0,1)*(X-u_i),i＝1,2,3,...,k

wherein X_newFor the newly interpolated samples, u_iIs cluster center, X is u_iRepresenting a random number between 0 and 1 for original sample data in cluster center cluster, wherein k is the number of clusters; all new interpolated data are in the middle of the cluster and the data sample point, so that the overfitting problem caused by the fact that the difference space of the traditional SMOTE is too small is solved;

the RM-SMOTE algorithm is characterized in that a small number of samples of an unbalanced data set are preprocessed to form clusters, and on the basis, a spherical interval with a determined radius is designed according to the Euclidean distance between the clustered clusters and the clustered data samples to perform random interpolation; the algorithm mainly comprises four parts: determining few types of boundary points, judging dangerous points, determining an A-dimensional spherical space distance and correcting an oversampling formula; the interpolation formula in which the over-sampling is corrected is the most important part, and the algorithm designs that the randomly generated synthetic data must satisfy the following three formulas:

||X_new-u_i||≤D_max

wherein | | | X_new-u_iI represents the synthetic data X_newTo cluster heart u_iEuclidean distance of D_maxClustering data samples to new cluster u_iMaximum value of euclidean distance of (d);

X_newj＝u_ij+rand(0,1)*(b_j-a_j),1≤j≤E

wherein x_newjRepresents a synthetic sample X_newAnd (b) the attribute value of the jth attribute, rand (0,1) represents a random number between (0,1), and (b)_j-a_j) Satisfies the following requirements

a_j＝u_ij-|x_maxj-u_ij|,b_j＝u_ij+|x_maxj-u_ij|,1≤j≤E

Wherein, | x_maxj-u_ijI represents data X for obtaining maximum Euclidean distance_maxForm a cluster with the heart u_iAbsolute value of attribute difference of jth attribute between the two.

Compared with the prior art, the invention has the advantages that:

(1) according to the communication construction project financial risk prediction method based on the random forest, the multiple co-linear factors are subjected to dimensionality reduction treatment through the Relieff algorithm, so that the independence among the factors is improved, the dimensionality of a calculation space is effectively reduced, the complexity of the algorithm is reduced as far as possible on the premise of ensuring the main characteristics, and the calculation of the model is faster and more accurate; the three algorithms for processing the unbalanced data are analyzed, the SMOTE algorithm is determined to be used for processing the unbalanced data, down sampling is used for reducing the number of samples of multiple sample types, so that the structure of the data is more reasonable, the effectiveness of the model calculation result is improved, and compared with the existing logistic stet algorithm, the support vector machine algorithm and the voting algorithm, the accuracy and the effectiveness of the method are higher.

Drawings

FIG. 1 is a flowchart of a method for predicting financial risk of a communication construction project based on a random forest according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides a method for predicting financial risk of a communication construction project based on a random forest, where the method includes the following steps:

A. and acquiring and classifying the enterprise financial data.

B. The financial data is preprocessed and a plurality of data factors are obtained.

C. And D, judging whether the data factors have multiple collinearity, if so, switching to the step D, and otherwise, switching to the step F by taking all the data factors as a data set.

D. And D, reducing the dimensionality of the data factor to obtain a data set, and turning to the step E.

E. And F, carrying out SMOTE algorithm processing on the data set, and turning to the step F.

F. And performing model training on the processed data, and verifying by using a test set.

The method specifically comprises the following steps:

s1, acquiring enterprise financial data from the Wande database: the system comprises an asset liability statement, a profit statement and a cash flow statement, which are classified into a main board marketing class, a middle and small enterprise board class and an entrepreneur board class according to the property and the size of an enterprise.

S2, cleaning and preprocessing the enterprise financial data: due to the input requirement of the model, the range of each data needs to be uniformly controlled between 0 and 1, and all data need to be standardized by using a preprocessing function in a machine learning library (in the embodiment, a min-max standardization method is used), so that all data are between 0 and 1.

S3: the processed data is divided into four categories of repayment capacity, operation capacity, profitability and growth capacity, the repayment capacity comprises flow rate, quick-action rate, cash rate, capital turnover rate, liquidation value rate and interest payment multiple factors, the operation capacity comprises accounts receivable turnover rate, inventory turnover days, accounts receivable turnover rate, business period, flow asset turnover rate and total asset turnover rate factors, the profitability comprises net interest rate, gross interest rate, income per share, business profit rate, cost expense profit rate, surplus cash guarantee multiple, total asset return rate, net asset return rate and capital return rate factors, the growth capacity comprises total asset increase rate, net asset increase rate, stockholder equity increase rate and other factors, and each category represents financial capacity of different dimensions of the enterprise.

S4, analyzing the structure of each category of data: since the Shanghai-Shen exchange specially processes the stock transaction (Special transaction) of the listed company with abnormal financial condition or other conditions, the Special transaction is called ST stock in the front short term, so the stock is called ST stock, the financial condition of the main board and the small and medium enterprise boards is distinguished according to ST stock and non-ST stock, the ST stock represents that the financial condition of the enterprise is good, and the non-ST stock represents that the financial of the listed company has a large problem.

For the startup board, the enterprise with negative asset profitability in two consecutive seasons is set to be 0, and the rest is set to be 1, wherein 0 represents poor financial condition, and 1 represents good financial data.

The statistical result is that the number of positive samples (non-ST shares) and negative samples (ST shares) in the mainboard data set has a serious imbalance, the proportion of the non-ST shares to the mainboard of Shanghai and Shenyang is low, and the number of the positive samples and the negative samples has a large difference.

S5, calculating the decision coefficient of each factor and other factors in the current category, judging whether the current factor has multiple collinearity, if so, switching to the step S6, otherwise, switching to the step S7 by taking all current factors as a data set.

And calculating corresponding coefficient of VIF (Variance Inflation Factor) by determining coefficient, determining which factors have more serious multiple collinearity, and considering that the collinearity exists between the indexes when the VIF is more than 10.

Coefficient of determination R in Variance Inflation Factor (VIF)²Calculated by ridge regression model:

wherein R is_iIs the ith variable X_iWith other variables X_jComplex correlation coefficient, i.e. coefficient of determination R, (i ≠ 1,2, 3.... k, i ≠ j)²The arithmetic square root of goodness of fit. But this coefficient R may be determined_i ²Means using X_iMaking dependent variable for all other X_j(i ═ 1,2,3, … …, k, i ≠ j) is the coefficients that can be found after a new regression is made.

S6, deleting the multiple collinearity factors, that is, the factors with lower correlation to reduce the dimensionality and data size of the data, to obtain a data set, and then turning to step S7, where the reduction of the dimensionality and data size of the data is implemented by ReliefF, and the specific steps are as follows: the relevance of different features to different classes is distinguished by weighting features, and the weighted value is calculated according to the distinguishing capability of the features to samples of different classes in a neighborhood. By setting the weight threshold, the features larger than the threshold are retained, and the features smaller than the threshold are filtered out, and finally, the feature subset is obtained. The idea of the Relief algorithm is shown as follows:

In this embodiment, the number of factors after calculation of the VIF coefficient is still large, and therefore, we need to further screen the screened factors, which has two purposes: a. the factors with lower correlation to the final result need to be deleted to reduce the dimensionality and data volume of the data, and the deletion is realized through a Relieff algorithm; b. the operation speed of the model can be increased and the complexity of the program can be reduced by screening the data dimension and the data volume, and an under-sampling method for artificially synthesizing data on unbalanced sample data is performed by using a SMOTE algorithm (synthetic minor Over-sampling TEchnique, a similar interpolation).

S7, because the number of ST strands in the collected data is small but the number of ST strands is very large (namely the positive example data is large and the negative example data is small), the proportion of positive examples and negative examples in the data is adjusted by adopting the SMOTE algorithm so as to solve the problem of data imbalance of the positive example data and the negative example data, and the adjustment of the data structure comprises two methods of up-sampling and down-sampling: the up-sampling, namely increasing the number of samples of the negative example, enables the proportion of the positive samples and the proportion of the negative samples to be relatively balanced; the down-sampling is to reduce the number of positive samples and make the number of positive and negative samples more reasonable, so the present embodiment adopts the down-sampling, i.e. reduces the number of positive samples.

The specific processing method of the SMOTE algorithm comprises the following steps: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.

Step S7 further includes selecting an unbalanced data processing algorithm, and selecting the KM-SMOTE algorithm, the RM-SMOTE algorithm, and the SMOTE algorithm to perform unbalanced data processing and comparison.

KM-SMOTE algorithm: the main core idea of the algorithm is to combine the K-Means method with the SMOTE algorithm to solve the boundary ambiguity problem which may exist in SMOTE. The core of the algorithm mainly comprises three parts of determining boundary points of a few classes, judging danger points and correcting an oversampling formula. The correction of the oversampling formula is the most critical step, and two oversampling formulas are selected for correction. First, an oversampling formula is introduced in which an oversampling operation is directly performed by inserting new data based on a selected sample point. The formula is as follows:

X_new＝u_i+rand(0,1)*(X-u_i),i＝1,2,3,...,k

wherein X_newFor the newly interpolated samples, u_iIs cluster center, X is u_iFor the original sample data in cluster center cluster, rand (0,1) represents a certain random number between 0 and 1, and k is the number of clusters. All new interpolated data are in the middle of the cluster and the data sample point, so that the overfitting problem caused by the fact that the difference space of the traditional SMOTE is too small is solved.

The RM-SMOTE algorithm is characterized in that a small number of samples of an unbalanced data set are preprocessed to form clusters, and on the basis, a spherical interval with a determined radius is designed according to Euclidean distances between the clustered clusters and the clustered data samples, so that random interpolation is performed. The algorithm mainly comprises four parts: determining few types of boundary points, judging dangerous points, determining an A-dimensional spherical space distance and correcting an oversampling formula. The interpolation formula in which the over-sampling is corrected is the most important part, and the algorithm designs that the randomly generated synthetic data must satisfy the following three formulas:

||X_new-u_i||≤D_max

wherein | | | X_new-u_iI represents the synthetic data X_newTo cluster heart u_iEuclidean distance of D_maxClustering data samples to new cluster u_iMaximum value of euclidean distance.

X_newj＝u_ij+rand(0,1)*(b_j-a_j),1≤j≤E

a_j＝u_ij-|x_maxj-u_ij|,b_j＝u_ij+|x_maxj-u_ij|,1≤j≤E

Through the three algorithm calculations, the data processed by the SMOTE algorithm has better balance, so the SMOTE algorithm is selected for carrying out unbalanced data processing.

S8, randomly dividing the processing data set into a training set and a testing set according to a ratio of 8:2 by using cross _ validation, train _ test _ split method (function for separating data sets) in a skearn frame (a machine learning library), wherein the processing data set in the embodiment is data processed by ReliefF algorithm and SMOTE.

And S9, training the model by using the training data, fitting the model by using the training data, wherein the training time is different from the data size along with the iteration number.

S10, the test data are used for verifying the effect of the model, and the embodiment uses two indexes of accuracy and random sample recall rate to represent the prediction effect of the model.

Accuracy (accuracycacy) (TP + TN)/(TP + FN + FP + TN)

Wherein: TP is the number of the positive classes predicted to be the positive classes, FN is the number of the positive classes predicted to be the negative classes, FP is the number of the negative classes predicted to be the positive classes, TN is the number of the negative classes predicted to be the negative classes, and the accuracy can effectively reflect the judgment capability of the algorithm on the samples.

When a data set with unbalanced classification is used (for example, there is a significant difference between the numbers of positive class labels and negative class labels), the accuracy rate cannot reflect the overall situation of the data, and the effect of the accuracy rate index is greatly reduced.

In this embodiment, the defect of accuracy is supplemented by using a random sample recall ratio, where the recall ratio is what the proportion of the samples that are actually in the positive category is correctly predicted as the positive category, and the formula is as follows:

recall (recall) ═ TP)/(TP + FP)

Wherein: TP is the number of positive classes predicted from positive classes, FP is the number of negative classes predicted from positive classes.

S11, forecasting the financial risk by using three classification algorithms of a logic Stent algorithm, a support vector machine algorithm and a voting algorithm, and comparing the result with that in the step S10 to prove the advantage of the embodiment in operation accuracy.

As can be seen from tables 1 to 4, the result of predicting the data of the main board, the medium and small boards or the startup board by using the algorithm of the present embodiment is better than the logistic stet algorithm, the support vector machine algorithm, and the voting algorithm, and the random forest performs as well in the recall rate aspect, although the recall rate is not high in the main board data set, this may be related to the low ratio of positive and negative samples in the main board data.

TABLE 1 logistic stewart regression results

Table 2 support vector machine results

TABLE 3 voting Algorithm results

Table 4 random forest results

The present invention is not limited to the above-mentioned preferred embodiments, and any other products in various forms can be obtained by anyone with the teaching of the present invention, but any changes in the shape or structure thereof, which have the same or similar technical solutions as the present invention, are within the protection scope.

Claims

1. A communication construction project financial risk prediction method based on random forests is characterized by comprising the following steps: the method comprises the following steps:

2. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the method for preprocessing the financial data and obtaining a plurality of data factors comprises the following steps: uniformly controlling the range of each datum between 0 and 1, and standardizing all the data by using a preprocessing function in a machine learning library to ensure that all the data are between 0 and 1;

3. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the judging whether the multiple collinearity exists in the data factor comprises the following steps:

4. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the importance of all factors was screened using the ReliefF algorithm:

5. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the specific method for processing data by the SMOTE algorithm comprises the following steps: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.

6. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: before SMOTE algorithm processing, the method also comprises the following steps:

X_new＝u_i+rand(0,1)*(X-u_i),i＝1,2,3,...,k

||X_new-u_i||≤D_max

X_newj＝u_ij+rand(0,1)*(b_j-a_j),1≤j≤E

a_j＝u_ij-|x_maxj-u_ij|,b_j＝u_ij+|x_maxj-u_ij|,1≤j≤E

Wherein, | x_maxj-u_ijI denotes the maximum Euclidean distanceData X_maxForm a cluster with the heart u_iAbsolute value of attribute difference of jth attribute between the two.