CN110689437A - Communication construction project financial risk prediction method based on random forest - Google Patents

Communication construction project financial risk prediction method based on random forest Download PDF

Info

Publication number
CN110689437A
CN110689437A CN201910949059.8A CN201910949059A CN110689437A CN 110689437 A CN110689437 A CN 110689437A CN 201910949059 A CN201910949059 A CN 201910949059A CN 110689437 A CN110689437 A CN 110689437A
Authority
CN
China
Prior art keywords
data
algorithm
samples
new
factors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910949059.8A
Other languages
Chinese (zh)
Inventor
王翔
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUBEI TELECOM ENGINEERING Co Ltd
Original Assignee
HUBEI TELECOM ENGINEERING Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUBEI TELECOM ENGINEERING Co Ltd filed Critical HUBEI TELECOM ENGINEERING Co Ltd
Priority to CN201910949059.8A priority Critical patent/CN110689437A/en
Publication of CN110689437A publication Critical patent/CN110689437A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a random forest based communication construction project financial risk prediction method, which relates to the field of data processing and comprises the following steps: acquiring and classifying financial data of an enterprise, preprocessing the financial data and obtaining a plurality of data factors; judging whether the data factors have multiple collinearity, if so, reducing the dimensionality of the data factors to obtain a data set, and carrying out SMOTE algorithm processing on the data set; otherwise, the SMOTE algorithm processing is directly carried out on the data factors, model training is carried out on the processed data, and a test set is used for verification.

Description

Communication construction project financial risk prediction method based on random forest
Technical Field
The invention relates to the field of data processing, in particular to a communication construction project financial risk prediction method based on random forests.
Background
In recent years, the information industry of China has been leaps and bounds, the internet era has entered a large integration period, the information and communication network is in order to meet the increasing demand of national economic development, the demand of communication engineering project construction is continuously increased, and therefore, a great deal of business opportunities are brought to communication construction enterprises.
Due to the particularity of requirements of customers and project construction management, the industry where communication construction enterprises are located is obviously different from other industries, the characteristics of long construction period and large capital stock and flow demand exist generally, and in the construction process, high accounts receivable are generally generated due to the credit relationship with the first party, so that capital settlement is caused; meanwhile, the capital recovery period of a communication construction enterprise is relatively long, a certain proportion of quality assurance funds are generally reserved before engineering construction, and when the enterprise scale is continuously developed and expanded and more bearing projects are carried out, large capital requirements and capital investment can be generated.
Furthermore, because the communication construction enterprise engineering project has a long period, if accounts to be collected cannot be timely and effectively cleaned, cash flow shortage and even fund chain breakage can be caused. Therefore, in the communication construction enterprise, accounts receivable are important creditory assets of the enterprise, but actually the resources are occupied by the owing party, and the enterprise has the right to collect money but cannot dominate the assets, so that the accounts receivable of the enterprise are only accounts before being collected, and are likely to be damaged due to bad accounts caused by factors such as credit default of the owing party, and therefore, the real occupation of the assets and the resources of the enterprise is only achieved after the cash flow of the assets actually flows into the enterprise.
In addition, the generation of receivable accounts also means that the income of the business of the enterprise owner forms no cash inflow and only represents the collection right of the enterprise, so that before the receivable accounts are recovered, the profit part realized by the enterprise is only the profit of the book surface, is reflected by the number in the report form, not the actual profit, and the income corresponding to the amount of the receivable accounts does not have a real cash inflow as a basis, but is just 'expensive on paper'.
The method is characterized in that when account receivable and debt clearing is perfect, a relatively stable cash flow is generated for a communication construction enterprise, the enterprise can develop healthily, and daily operation and management activities of the enterprise can be guaranteed, so that the communication construction enterprise can pay attention to the condition of debt and debt at any time and clear up the items which are not received and paid in time, and the enterprise can reasonably analyze the received items before construction aiming at the problems of account receivable of the communication industry, wherein the analyzed contents comprise an industry state, a company financial condition, a policy background and the like.
Currently, researchers and researchers in the related field have done a lot of research and validation work on financial risk prediction: relevant factors of cash flow influence are added on the basis of an original Z model by national economists Zhouyghua, an F model for financial risk prediction is constructed, relevant financial statement data of 27 enterprises are used by the model, and the model is divided according to the marketing and non-marketing, so that a good prediction effect is obtained; zhang jin Gui, Huang Shu, Wang Jun Nu and so on use principal component analysis algorithm to carry on the importance analysis to every financial data, have selected the data of 40 enterprises on the market and not on the market, have analyzed the reason causing the financial failure of the enterprise, have obtained better effects; and the Wu Shannon scholars, the Gunn and other scholars analyze the financial related data by constructing a regression model, select financial data of more than 70 listed companies, and evaluate financial risks by using a logistic stet regression algorithm.
In addition, methods such as neural network model, Z-Score model, analytic hierarchy process, factor analysis, and efficacy coefficient method are widely used for financial risk prediction, but the methods have many disadvantages; a. the construction of the model does not consider the industrial characteristics, and due to different industrial conditions, better effects cannot be obtained by simple conversion, the method is single, the requirements of the model cannot be met in the prediction and analysis process, and the problem of unbalanced data cannot be solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a communication construction project financial risk prediction method based on random forests.
In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:
a communication construction project financial risk prediction method based on random forests comprises the following steps:
acquiring and classifying financial data of an enterprise, preprocessing the financial data and obtaining a plurality of data factors; judging whether the data factors have multiple collinearity, if so, reducing the dimensionality of the data factors to obtain a data set, and carrying out SMOTE algorithm processing on the data set; otherwise, the SMOTE algorithm processing is directly carried out on the data factors, model training is carried out on the processed data, and a test set is used for verification.
Further, the step of preprocessing the financial data and obtaining a plurality of data factors comprises the following steps: uniformly controlling the range of each datum between 0 and 1, and standardizing all the data by using a preprocessing function in a machine learning library to ensure that all the data are between 0 and 1;
the processed data is divided into the ability of debt, the ability of operation, the ability of profit and the ability of growth, the ability of debt includes the flow rate, the speed rate, the factor of cash rate, the ability of operation includes the turnover rate of accounts receivable, the turnover rate of stock, the day factor of stock turnover, the ability of profit includes the net rate, the gross rate, the factor of income per share, and the ability of growth includes the total asset growth rate, the net asset growth rate, the factor of equity growth rate.
Further, the determining whether multiple collinearity exists in the data factor includes:
calculating VIF coefficients of the data factors, and considering that collinearity exists between the corresponding data factors when the VIF is larger than 10;
Figure BDA0002225164470000041
wherein R isi 2Calculated by a ridge regression model, RiIs the ith variable XiWith other variables XjA complex correlation coefficient, i.e., a coefficient of mass R, (i ≠ 1,2, 3.. k, i ≠ j)2Is the arithmetic square root of goodness of fit, but this coefficient R may be determinedi 2Means using XiMaking dependent variable for all other Xj(i ═ 1,2,3, … …, k, i ≠ j) is the coefficients that can be found after a new regression is made.
Further, the importance of all factors is screened by adopting a Relieff algorithm, and the idea of the algorithm is as follows: the relevance of different features to different classes is distinguished by weighting features, and the weighted value is calculated according to the distinguishing capability of the features to samples of different classes in a neighborhood. By setting the weight threshold, the features larger than the threshold are retained, and the features smaller than the threshold are filtered out, and finally, the feature subset is obtained. The idea of the Relief algorithm is shown as follows:
Figure BDA0002225164470000042
Figure BDA0002225164470000043
wherein x is(i)The samples in the training set are similar samples in the neighborhood of H, and non-similar samples are M. If x(i)Difference coefficient D (P, x) to H on feature P(i),h(j)) Is smaller, and x(i)With the difference coefficient D (P,x(i),m(k)) Larger, indicating that the feature P plays a positive role in classification, the feature weight should be raised; on the contrary, the feature P has a negative effect on classification, the feature weight should be reduced, and this operation is repeated m times to obtain a feature weight vector.
Further, the specific method for processing data by the SMOTE algorithm is as follows: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.
Further, before the SMOTE algorithm processing, the method further includes the following steps:
selecting an unbalanced data processing algorithm, and selecting a KM-SMOTE algorithm, a RM-SMOTE algorithm and a SMOTE algorithm to process and compare unbalanced data;
KM-SMOTE algorithm: the key core idea of the algorithm is to combine the K-Means method with the SMOTE algorithm to solve the boundary ambiguity problem possibly existing in SMOTE; the core of the algorithm mainly comprises three parts of determining boundary points of a few classes, judging danger points and correcting an oversampling formula; the method comprises the following steps of selecting two oversampling formulas to correct, wherein the correction of the oversampling formula is the most key step; firstly, an oversampling formula for directly performing oversampling operation is introduced, and the oversampling operation inserts new data based on a selected sample point; the formula is as follows:
Xnew=ui+rand(0,1)*(X-ui),i=1,2,3,...,k
wherein XnewFor the newly interpolated samples, uiIs cluster center, X is uiRepresenting a random number between 0 and 1 for original sample data in cluster center cluster, wherein k is the number of clusters; all new interpolated data are in the middle of the cluster and the data sample point, so that the overfitting problem caused by the fact that the difference space of the traditional SMOTE is too small is solved;
the RM-SMOTE algorithm is characterized in that a small number of samples of an unbalanced data set are preprocessed to form clusters, and on the basis, a spherical interval with a determined radius is designed according to the Euclidean distance between the clustered clusters and the clustered data samples to perform random interpolation; the algorithm mainly comprises four parts: determining few types of boundary points, judging dangerous points, determining an A-dimensional spherical space distance and correcting an oversampling formula; the interpolation formula in which the over-sampling is corrected is the most important part, and the algorithm designs that the randomly generated synthetic data must satisfy the following three formulas:
||Xnew-ui||≤Dmax
wherein | | | Xnew-uiI represents the synthetic data XnewTo cluster heart uiEuclidean distance of DmaxClustering data samples to new cluster uiMaximum value of euclidean distance of (d);
Xnewj=uij+rand(0,1)*(bj-aj),1≤j≤E
wherein xnewjRepresents a synthetic sample XnewAnd (b) the attribute value of the jth attribute, rand (0,1) represents a random number between (0,1), and (b)j-aj) Satisfies the following requirements
aj=uij-|xmaxj-uij|,bj=uij+|xmaxj-uij|,1≤j≤E
Wherein, | xmaxj-uijI represents data X for obtaining maximum Euclidean distancemaxForm a cluster with the heart uiAbsolute value of attribute difference of jth attribute between the two.
Compared with the prior art, the invention has the advantages that:
(1) according to the communication construction project financial risk prediction method based on the random forest, the multiple co-linear factors are subjected to dimensionality reduction treatment through the Relieff algorithm, so that the independence among the factors is improved, the dimensionality of a calculation space is effectively reduced, the complexity of the algorithm is reduced as far as possible on the premise of ensuring the main characteristics, and the calculation of the model is faster and more accurate; the three algorithms for processing the unbalanced data are analyzed, the SMOTE algorithm is determined to be used for processing the unbalanced data, down sampling is used for reducing the number of samples of multiple sample types, so that the structure of the data is more reasonable, the effectiveness of the model calculation result is improved, and compared with the existing logistic stet algorithm, the support vector machine algorithm and the voting algorithm, the accuracy and the effectiveness of the method are higher.
Drawings
FIG. 1 is a flowchart of a method for predicting financial risk of a communication construction project based on a random forest according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides a method for predicting financial risk of a communication construction project based on a random forest, where the method includes the following steps:
A. and acquiring and classifying the enterprise financial data.
B. The financial data is preprocessed and a plurality of data factors are obtained.
C. And D, judging whether the data factors have multiple collinearity, if so, switching to the step D, and otherwise, switching to the step F by taking all the data factors as a data set.
D. And D, reducing the dimensionality of the data factor to obtain a data set, and turning to the step E.
E. And F, carrying out SMOTE algorithm processing on the data set, and turning to the step F.
F. And performing model training on the processed data, and verifying by using a test set.
The method specifically comprises the following steps:
s1, acquiring enterprise financial data from the Wande database: the system comprises an asset liability statement, a profit statement and a cash flow statement, which are classified into a main board marketing class, a middle and small enterprise board class and an entrepreneur board class according to the property and the size of an enterprise.
S2, cleaning and preprocessing the enterprise financial data: due to the input requirement of the model, the range of each data needs to be uniformly controlled between 0 and 1, and all data need to be standardized by using a preprocessing function in a machine learning library (in the embodiment, a min-max standardization method is used), so that all data are between 0 and 1.
S3: the processed data is divided into four categories of repayment capacity, operation capacity, profitability and growth capacity, the repayment capacity comprises flow rate, quick-action rate, cash rate, capital turnover rate, liquidation value rate and interest payment multiple factors, the operation capacity comprises accounts receivable turnover rate, inventory turnover days, accounts receivable turnover rate, business period, flow asset turnover rate and total asset turnover rate factors, the profitability comprises net interest rate, gross interest rate, income per share, business profit rate, cost expense profit rate, surplus cash guarantee multiple, total asset return rate, net asset return rate and capital return rate factors, the growth capacity comprises total asset increase rate, net asset increase rate, stockholder equity increase rate and other factors, and each category represents financial capacity of different dimensions of the enterprise.
S4, analyzing the structure of each category of data: since the Shanghai-Shen exchange specially processes the stock transaction (Special transaction) of the listed company with abnormal financial condition or other conditions, the Special transaction is called ST stock in the front short term, so the stock is called ST stock, the financial condition of the main board and the small and medium enterprise boards is distinguished according to ST stock and non-ST stock, the ST stock represents that the financial condition of the enterprise is good, and the non-ST stock represents that the financial of the listed company has a large problem.
For the startup board, the enterprise with negative asset profitability in two consecutive seasons is set to be 0, and the rest is set to be 1, wherein 0 represents poor financial condition, and 1 represents good financial data.
The statistical result is that the number of positive samples (non-ST shares) and negative samples (ST shares) in the mainboard data set has a serious imbalance, the proportion of the non-ST shares to the mainboard of Shanghai and Shenyang is low, and the number of the positive samples and the negative samples has a large difference.
S5, calculating the decision coefficient of each factor and other factors in the current category, judging whether the current factor has multiple collinearity, if so, switching to the step S6, otherwise, switching to the step S7 by taking all current factors as a data set.
And calculating corresponding coefficient of VIF (Variance Inflation Factor) by determining coefficient, determining which factors have more serious multiple collinearity, and considering that the collinearity exists between the indexes when the VIF is more than 10.
Coefficient of determination R in Variance Inflation Factor (VIF)2Calculated by ridge regression model:
wherein R isiIs the ith variable XiWith other variables XjComplex correlation coefficient, i.e. coefficient of determination R, (i ≠ 1,2, 3.... k, i ≠ j)2The arithmetic square root of goodness of fit. But this coefficient R may be determinedi 2Means using XiMaking dependent variable for all other Xj(i ═ 1,2,3, … …, k, i ≠ j) is the coefficients that can be found after a new regression is made.
S6, deleting the multiple collinearity factors, that is, the factors with lower correlation to reduce the dimensionality and data size of the data, to obtain a data set, and then turning to step S7, where the reduction of the dimensionality and data size of the data is implemented by ReliefF, and the specific steps are as follows: the relevance of different features to different classes is distinguished by weighting features, and the weighted value is calculated according to the distinguishing capability of the features to samples of different classes in a neighborhood. By setting the weight threshold, the features larger than the threshold are retained, and the features smaller than the threshold are filtered out, and finally, the feature subset is obtained. The idea of the Relief algorithm is shown as follows:
Figure BDA0002225164470000091
Figure BDA0002225164470000092
wherein x is(i)The samples in the training set are similar samples in the neighborhood of H, and non-similar samples are M. If x(i)Difference coefficient D (P, x) to H on feature P(i),h(j)) Is smaller, and x(i)With the difference coefficient D (P,x(i),m(k)) Larger, indicating that the feature P plays a positive role in classification, the feature weight should be raised; on the contrary, the feature P has a negative effect on classification, the feature weight should be reduced, and this operation is repeated m times to obtain a feature weight vector.
In this embodiment, the number of factors after calculation of the VIF coefficient is still large, and therefore, we need to further screen the screened factors, which has two purposes: a. the factors with lower correlation to the final result need to be deleted to reduce the dimensionality and data volume of the data, and the deletion is realized through a Relieff algorithm; b. the operation speed of the model can be increased and the complexity of the program can be reduced by screening the data dimension and the data volume, and an under-sampling method for artificially synthesizing data on unbalanced sample data is performed by using a SMOTE algorithm (synthetic minor Over-sampling TEchnique, a similar interpolation).
S7, because the number of ST strands in the collected data is small but the number of ST strands is very large (namely the positive example data is large and the negative example data is small), the proportion of positive examples and negative examples in the data is adjusted by adopting the SMOTE algorithm so as to solve the problem of data imbalance of the positive example data and the negative example data, and the adjustment of the data structure comprises two methods of up-sampling and down-sampling: the up-sampling, namely increasing the number of samples of the negative example, enables the proportion of the positive samples and the proportion of the negative samples to be relatively balanced; the down-sampling is to reduce the number of positive samples and make the number of positive and negative samples more reasonable, so the present embodiment adopts the down-sampling, i.e. reduces the number of positive samples.
The specific processing method of the SMOTE algorithm comprises the following steps: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.
Step S7 further includes selecting an unbalanced data processing algorithm, and selecting the KM-SMOTE algorithm, the RM-SMOTE algorithm, and the SMOTE algorithm to perform unbalanced data processing and comparison.
KM-SMOTE algorithm: the main core idea of the algorithm is to combine the K-Means method with the SMOTE algorithm to solve the boundary ambiguity problem which may exist in SMOTE. The core of the algorithm mainly comprises three parts of determining boundary points of a few classes, judging danger points and correcting an oversampling formula. The correction of the oversampling formula is the most critical step, and two oversampling formulas are selected for correction. First, an oversampling formula is introduced in which an oversampling operation is directly performed by inserting new data based on a selected sample point. The formula is as follows:
Xnew=ui+rand(0,1)*(X-ui),i=1,2,3,...,k
wherein XnewFor the newly interpolated samples, uiIs cluster center, X is uiFor the original sample data in cluster center cluster, rand (0,1) represents a certain random number between 0 and 1, and k is the number of clusters. All new interpolated data are in the middle of the cluster and the data sample point, so that the overfitting problem caused by the fact that the difference space of the traditional SMOTE is too small is solved.
The RM-SMOTE algorithm is characterized in that a small number of samples of an unbalanced data set are preprocessed to form clusters, and on the basis, a spherical interval with a determined radius is designed according to Euclidean distances between the clustered clusters and the clustered data samples, so that random interpolation is performed. The algorithm mainly comprises four parts: determining few types of boundary points, judging dangerous points, determining an A-dimensional spherical space distance and correcting an oversampling formula. The interpolation formula in which the over-sampling is corrected is the most important part, and the algorithm designs that the randomly generated synthetic data must satisfy the following three formulas:
||Xnew-ui||≤Dmax
wherein | | | Xnew-uiI represents the synthetic data XnewTo cluster heart uiEuclidean distance of DmaxClustering data samples to new cluster uiMaximum value of euclidean distance.
Xnewj=uij+rand(0,1)*(bj-aj),1≤j≤E
Wherein xnewjRepresents a synthetic sample XnewAnd (b) the attribute value of the jth attribute, rand (0,1) represents a random number between (0,1), and (b)j-aj) Satisfies the following requirements
aj=uij-|xmaxj-uij|,bj=uij+|xmaxj-uij|,1≤j≤E
Wherein, | xmaxj-uijI represents data X for obtaining maximum Euclidean distancemaxForm a cluster with the heart uiAbsolute value of attribute difference of jth attribute between the two.
Through the three algorithm calculations, the data processed by the SMOTE algorithm has better balance, so the SMOTE algorithm is selected for carrying out unbalanced data processing.
S8, randomly dividing the processing data set into a training set and a testing set according to a ratio of 8:2 by using cross _ validation, train _ test _ split method (function for separating data sets) in a skearn frame (a machine learning library), wherein the processing data set in the embodiment is data processed by ReliefF algorithm and SMOTE.
And S9, training the model by using the training data, fitting the model by using the training data, wherein the training time is different from the data size along with the iteration number.
S10, the test data are used for verifying the effect of the model, and the embodiment uses two indexes of accuracy and random sample recall rate to represent the prediction effect of the model.
Accuracy (accuracycacy) (TP + TN)/(TP + FN + FP + TN)
Wherein: TP is the number of the positive classes predicted to be the positive classes, FN is the number of the positive classes predicted to be the negative classes, FP is the number of the negative classes predicted to be the positive classes, TN is the number of the negative classes predicted to be the negative classes, and the accuracy can effectively reflect the judgment capability of the algorithm on the samples.
When a data set with unbalanced classification is used (for example, there is a significant difference between the numbers of positive class labels and negative class labels), the accuracy rate cannot reflect the overall situation of the data, and the effect of the accuracy rate index is greatly reduced.
In this embodiment, the defect of accuracy is supplemented by using a random sample recall ratio, where the recall ratio is what the proportion of the samples that are actually in the positive category is correctly predicted as the positive category, and the formula is as follows:
recall (recall) ═ TP)/(TP + FP)
Wherein: TP is the number of positive classes predicted from positive classes, FP is the number of negative classes predicted from positive classes.
S11, forecasting the financial risk by using three classification algorithms of a logic Stent algorithm, a support vector machine algorithm and a voting algorithm, and comparing the result with that in the step S10 to prove the advantage of the embodiment in operation accuracy.
As can be seen from tables 1 to 4, the result of predicting the data of the main board, the medium and small boards or the startup board by using the algorithm of the present embodiment is better than the logistic stet algorithm, the support vector machine algorithm, and the voting algorithm, and the random forest performs as well in the recall rate aspect, although the recall rate is not high in the main board data set, this may be related to the low ratio of positive and negative samples in the main board data.
TABLE 1 logistic stewart regression results
Figure BDA0002225164470000121
Table 2 support vector machine results
Figure BDA0002225164470000131
TABLE 3 voting Algorithm results
Table 4 random forest results
The present invention is not limited to the above-mentioned preferred embodiments, and any other products in various forms can be obtained by anyone with the teaching of the present invention, but any changes in the shape or structure thereof, which have the same or similar technical solutions as the present invention, are within the protection scope.

Claims (6)

1. A communication construction project financial risk prediction method based on random forests is characterized by comprising the following steps: the method comprises the following steps:
acquiring and classifying financial data of an enterprise, preprocessing the financial data and obtaining a plurality of data factors; judging whether the data factors have multiple collinearity, if so, reducing the dimensionality of the data factors to obtain a data set, and carrying out SMOTE algorithm processing on the data set; otherwise, the SMOTE algorithm processing is directly carried out on the data factors, model training is carried out on the processed data, and a test set is used for verification.
2. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the method for preprocessing the financial data and obtaining a plurality of data factors comprises the following steps: uniformly controlling the range of each datum between 0 and 1, and standardizing all the data by using a preprocessing function in a machine learning library to ensure that all the data are between 0 and 1;
the processed data is divided into the ability of debt, the ability of operation, the ability of profit and the ability of growth, the ability of debt includes the flow rate, the speed rate, the factor of cash rate, the ability of operation includes the turnover rate of accounts receivable, the turnover rate of stock, the day factor of stock turnover, the ability of profit includes the net rate, the gross rate, the factor of income per share, and the ability of growth includes the total asset growth rate, the net asset growth rate, the factor of equity growth rate.
3. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the judging whether the multiple collinearity exists in the data factor comprises the following steps:
calculating VIF coefficients of the data factors, and considering that collinearity exists between the corresponding data factors when the VIF is larger than 10;
wherein R isi 2Calculated by a ridge regression model, RiIs the ith variable XiWith other variables XjA complex correlation coefficient, i.e., a coefficient of mass R, (i ≠ 1,2, 3.. k, i ≠ j)2Is the arithmetic square root of goodness of fit, but this coefficient R may be determinedi 2Means using XiMaking dependent variable for all other Xj(i ═ 1,2,3, … …, k, i ≠ j) is the coefficients that can be found after a new regression is made.
4. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the importance of all factors was screened using the ReliefF algorithm:
Figure FDA0002225164460000021
Figure FDA0002225164460000022
wherein x is(i)The samples in the training set are similar samples in the neighborhood of H, and non-similar samples are M. If x(i)Difference coefficient D (P, x) to H on feature P(i),h(j)) Is smaller, and x(i)With the difference coefficient D (P,x(i),m(k)) Larger, indicating that the feature P plays a positive role in classification, the feature weight should be raised; on the contrary, the feature P has a negative effect on classification, the feature weight should be reduced, and this operation is repeated m times to obtain a feature weight vector.
5. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: the specific method for processing data by the SMOTE algorithm comprises the following steps: and calculating a K-neighbor homogeneous set of a few samples, selecting samples from the K-neighbor homogeneous set, synthesizing new samples, and classifying the new samples by using a classifier.
6. The random forest-based communication construction project financial risk prediction method of claim 1, wherein: before SMOTE algorithm processing, the method also comprises the following steps:
selecting an unbalanced data processing algorithm, and selecting a KM-SMOTE algorithm, a RM-SMOTE algorithm and a SMOTE algorithm to process and compare unbalanced data;
KM-SMOTE algorithm: the key core idea of the algorithm is to combine the K-Means method with the SMOTE algorithm to solve the boundary ambiguity problem possibly existing in SMOTE; the core of the algorithm mainly comprises three parts of determining boundary points of a few classes, judging danger points and correcting an oversampling formula; the method comprises the following steps of selecting two oversampling formulas to correct, wherein the correction of the oversampling formula is the most key step; firstly, an oversampling formula for directly performing oversampling operation is introduced, and the oversampling operation inserts new data based on a selected sample point; the formula is as follows:
Xnew=ui+rand(0,1)*(X-ui),i=1,2,3,...,k
wherein XnewFor the newly interpolated samples, uiIs cluster center, X is uiRepresenting a random number between 0 and 1 for original sample data in cluster center cluster, wherein k is the number of clusters; all new interpolated data are in the middle of the cluster and the data sample point, so that the overfitting problem caused by the fact that the difference space of the traditional SMOTE is too small is solved;
the RM-SMOTE algorithm is characterized in that a small number of samples of an unbalanced data set are preprocessed to form clusters, and on the basis, a spherical interval with a determined radius is designed according to the Euclidean distance between the clustered clusters and the clustered data samples to perform random interpolation; the algorithm mainly comprises four parts: determining few types of boundary points, judging dangerous points, determining an A-dimensional spherical space distance and correcting an oversampling formula; the interpolation formula in which the over-sampling is corrected is the most important part, and the algorithm designs that the randomly generated synthetic data must satisfy the following three formulas:
||Xnew-ui||≤Dmax
wherein | | | Xnew-uiI represents the synthetic data XnewTo cluster heart uiEuclidean distance of DmaxClustering data samples to new cluster uiMaximum value of euclidean distance of (d);
Xnewj=uij+rand(0,1)*(bj-aj),1≤j≤E
wherein xnewjRepresents a synthetic sample XnewAnd (b) the attribute value of the jth attribute, rand (0,1) represents a random number between (0,1), and (b)j-aj) Satisfies the following requirements
aj=uij-|xmaxj-uij|,bj=uij+|xmaxj-uij|,1≤j≤E
Wherein, | xmaxj-uijI denotes the maximum Euclidean distanceData XmaxForm a cluster with the heart uiAbsolute value of attribute difference of jth attribute between the two.
CN201910949059.8A 2019-10-08 2019-10-08 Communication construction project financial risk prediction method based on random forest Pending CN110689437A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910949059.8A CN110689437A (en) 2019-10-08 2019-10-08 Communication construction project financial risk prediction method based on random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910949059.8A CN110689437A (en) 2019-10-08 2019-10-08 Communication construction project financial risk prediction method based on random forest

Publications (1)

Publication Number Publication Date
CN110689437A true CN110689437A (en) 2020-01-14

Family

ID=69111585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910949059.8A Pending CN110689437A (en) 2019-10-08 2019-10-08 Communication construction project financial risk prediction method based on random forest

Country Status (1)

Country Link
CN (1) CN110689437A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930250A (en) * 2020-02-12 2020-03-27 成都数联铭品科技有限公司 Enterprise credit risk prediction method and system, storage medium and electronic equipment
CN111612640A (en) * 2020-05-27 2020-09-01 上海海事大学 Data-driven vehicle insurance fraud identification method
CN113393488A (en) * 2021-06-08 2021-09-14 南京师范大学 Behavior track sequence multi-feature simulation method based on quantum migration
CN113506160A (en) * 2021-06-17 2021-10-15 山东师范大学 Risk early warning method and system for unbalanced financial text data
CN113948207A (en) * 2021-10-18 2022-01-18 东北大学 Blood glucose data processing method for hypoglycemia early warning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930250A (en) * 2020-02-12 2020-03-27 成都数联铭品科技有限公司 Enterprise credit risk prediction method and system, storage medium and electronic equipment
CN111612640A (en) * 2020-05-27 2020-09-01 上海海事大学 Data-driven vehicle insurance fraud identification method
CN113393488A (en) * 2021-06-08 2021-09-14 南京师范大学 Behavior track sequence multi-feature simulation method based on quantum migration
CN113506160A (en) * 2021-06-17 2021-10-15 山东师范大学 Risk early warning method and system for unbalanced financial text data
CN113948207A (en) * 2021-10-18 2022-01-18 东北大学 Blood glucose data processing method for hypoglycemia early warning

Similar Documents

Publication Publication Date Title
Perols et al. Finding needles in a haystack: Using data analytics to improve fraud prediction
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
Zopounidis et al. Multi-group discrimination using multi-criteria analysis: Illustrations from the field of finance
KR20010103784A (en) Valuation prediction models in situations with missing inputs
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
CN111783829A (en) Financial anomaly detection method and device based on multi-label learning
CN111985937A (en) Method, system, storage medium and computer equipment for evaluating value information of transaction traders
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN110930038A (en) Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium
CN113095927A (en) Method and device for identifying suspicious transactions of anti-money laundering
CN112037006A (en) Credit risk identification method and device for small and micro enterprises
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
Chi et al. Debt rating model based on default identification: Empirical evidence from Chinese small industrial enterprises
CN112862585A (en) Personal loan type bad asset risk rating method based on LightGBM decision tree algorithm
CN116503174A (en) Financial data prediction system based on big data
Zhou et al. A two-stage credit scoring model based on random forest: Evidence from Chinese small firms
CN117114812A (en) Financial product recommendation method and device for enterprises
CN115271442A (en) Modeling method and system for evaluating enterprise growth based on natural language
CN113421014A (en) Target enterprise determination method, device, equipment and storage medium
CN112685563A (en) Cash flow data processing method and device under production and operation view
Shilbayeh et al. Creditworthiness pattern prediction and detection for GCC Islamic banks using machine learning techniques
Li et al. Parametric and non-parametric combination model to enhance overall performance on default prediction
Bakhshi et al. Developing a hybrid approach to credit priority based on accounting variables (using analytical network process (ANP) and multi-criteria decision-making)
Zeng A comparison study on the era of internet finance China construction of credit scoring system model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination