CN114841239A

CN114841239A - Marketing company financial abnormity analysis method based on machine learning

Info

Publication number: CN114841239A
Application number: CN202210337280.XA
Authority: CN
Inventors: 喻华丽; 曾海泉; 王美华; 周玉臣; 孙倩南
Original assignee: SHENZHEN STOCK EXCHANGE
Current assignee: SHENZHEN STOCK EXCHANGE
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-08-02

Abstract

The invention discloses a method for analyzing financial abnormality of a listed company based on machine learning, which comprises the following steps: acquiring financial data and non-financial data for training, and preprocessing the financial data and the non-financial data to obtain financial characteristic data and non-financial characteristic data; constructing a derivative index according to the financial characteristic data, and screening important characteristics to obtain input characteristics; training a financial anomaly analysis model based on the non-financial characteristic data and the input characteristics, and performing anomaly grade prediction on unknown samples based on the trained financial anomaly analysis model; obtaining a model result vector corresponding to a historical abnormal sample and an unknown sample predicted to be abnormal, judging the financial abnormal type of the unknown sample through a k-nearest neighbor algorithm based on the model result vector, and performing index analysis. The interpretability of the financial abnormality analysis result of the listed company is enhanced.

Description

Marketing company financial abnormity analysis method based on machine learning

Technical Field

The invention relates to the technical field of financial data processing, in particular to a method for analyzing financial abnormality of a listed company based on machine learning.

Background

Traditional company financial anomaly analysis of going to market generally starts the analysis from financial accounting subject and business operation, by professional accounting personnel, based on business field knowledge, through to the comparison of various accounting subjects peer, discovers financial anomaly risk point. The traditional financial accounting service analysis method needs deep participation of a plurality of professionals, and consumes a great deal of time and energy. In recent years, many financial anomaly analysis methods based on machine learning have appeared, that is, large data analysis is performed on a large number of samples and feature data by means of methods such as data mining and machine learning, and individual anomalies are discovered. The method is rapid in analysis and low in cost, and can predict the financial abnormal degree of a listed company. Although the financial abnormality analysis method based on machine learning can assist a supervisor to find the financial abnormality degree of a listed company to a certain extent, the financial abnormality analysis method has the defect of poor interpretability, namely, the specific aspects and indexes of the company which are abnormal cannot be accurately indicated.

Disclosure of Invention

The invention mainly aims to provide a method for analyzing financial abnormality of a listed company based on machine learning, and aims to achieve the effect of simplifying and improving financial abnormality explanation of the listed company.

In order to achieve the above object, the present invention provides a method for analyzing financial abnormality of a listed company based on machine learning, comprising the steps of:

acquiring financial data and non-financial data for training, and preprocessing the financial data and the non-financial data to obtain financial characteristic data and non-financial characteristic data;

constructing a derivative index according to the financial characteristic data, and screening important characteristics to obtain input characteristics;

training a financial anomaly analysis model based on the non-financial characteristic data and the input characteristics, and performing anomaly grade prediction on unknown samples based on the trained financial anomaly analysis model;

obtaining a model result vector corresponding to a historical abnormal sample and an unknown sample predicted to be abnormal, judging the financial abnormal type of the unknown sample through a k-nearest neighbor algorithm based on the model result vector, and performing index analysis.

Optionally, the financial data includes high-risk abnormally-listed company financial data subject to certificate-proctor administrative penalties, autonomic regulations, and reminder attention letters, medium-abnormal-listed company financial data with a withdrawal risk warning and other risk warning identification, abnormal-listed company financial data with an information disclosure level of C or D other than the high-risk abnormally-listed company and the medium-abnormal-listed company, and listed company financial data without significant abnormalities other than the high-risk abnormally-listed company financial data, the medium-abnormal-listed company financial data, and the abnormal-listed company financial data.

Optionally, the obtaining a model result vector corresponding to the historical abnormal sample and the unknown sample predicted to be abnormal, and based on the model result vector, determining the financial abnormal category of the unknown sample by using a k-nearest neighbor algorithm, and performing index analysis includes:

marking the abnormal category corresponding to the historical abnormal sample;

vectorizing the analysis result of the financial anomaly analysis model corresponding to the historical anomaly sample;

obtaining a model result vector corresponding to the unknown sample characteristic data through the financial anomaly analysis model;

taking the model result vector corresponding to the historical abnormal sample and the model result vector corresponding to the unknown sample as k-nearest neighbor algorithm input, and judging the abnormal category of the unknown sample through the k-nearest neighbor algorithm; and

determining normal intervals and abnormal intervals of various indexes in different industries based on historical samples;

and performing index analysis based on the index interpretation library.

Optionally, the step of obtaining financial data and non-financial data for training and preprocessing the financial data and the non-financial data to obtain financial characteristic data and non-financial characteristic data includes:

obtaining the financial data and the non-financial data for training;

determining financial indexes and non-financial indexes;

and cleaning and reconstructing the financial data and the non-financial data according to the financial indexes and the non-financial indexes to obtain the financial characteristic data and the non-financial characteristic data.

Optionally, the step of constructing a derivative index according to the financial characteristic data, performing important characteristic screening, and acquiring an input characteristic includes:

grouping financial indexes of whether implicit association relation exists or not to be determined in the financial characteristic data;

acquiring a regression equation based on the financial characteristic data and the financial index grouping result;

constructing a derivative index based on the regression equation;

and (5) screening important features to obtain input features.

The embodiment of the invention provides a marketing company financial abnormity analysis method based on machine learning. And then mining implicit association relation between the financial indexes and the financial indexes through a linear regression method, and constructing derived financial characteristics. And then, performing feature screening and model training by using a LightGBM algorithm to obtain a set of financial anomaly analysis models, performing anomaly grade prediction on unknown samples by using the models, and simultaneously obtaining model output vector results of historical anomaly samples and unknown samples. And finally, judging the abnormal category of the unknown sample by adopting a k-nearest neighbor algorithm, and performing index analysis, thereby enhancing the interpretability of the financial abnormal analysis result of the listed company.

Drawings

FIG. 1 is a flowchart illustrating an exemplary method for analyzing financial anomalies of listed companies based on machine learning according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to solve the defects, the main solutions of the embodiment of the method for analyzing the financial abnormality of the listed company based on the machine learning are as follows:

training a financial anomaly analysis model based on the non-financial characteristic data and the input characteristics, and performing anomaly grade prediction on an unknown sample based on the trained financial anomaly analysis model;

The method mainly constructs financial characteristics around the aspects of profit quality, asset quality, cash flow quality and the like based on three report forms of financial data, and extracts important non-financial data such as abnormal change conditions of an accounting firm to construct the non-financial characteristics. And then mining the implicit association relationship between the financial indexes and the financial indexes by a linear regression method to obtain a regression equation, and constructing derived financial characteristics based on the regression equation. And then, performing feature screening and model training by using a LightGBM algorithm to obtain a set of financial anomaly analysis models, performing anomaly grade prediction on unknown samples by using the models, and simultaneously obtaining model output vector results of historical anomaly samples and unknown samples. And finally, judging the abnormal category of the unknown sample by adopting a k-nearest neighbor algorithm, and performing index analysis, thereby enhancing the interpretability of the financial abnormal analysis result of the listed company.

Referring to fig. 1, in an embodiment of the method for analyzing financial anomalies of a listed company based on machine learning according to the present invention, the method for analyzing financial anomalies of a listed company based on machine learning includes the following specific steps:

step S10, acquiring financial data and non-financial data for training, and preprocessing the financial data and the non-financial data to obtain financial characteristic data and non-financial characteristic data;

in this embodiment, the financial data includes high-risk abnormally-marketed company financial data subjected to certificate-proctor administrative penalties, autonomic regulations, and a reminder for an attention letter, medium-abnormal-marketed company financial data with a withdrawal risk warning and other risk warning marks, abnormal-marketed-company financial data having an information disclosure level of C or D, other than the high-risk abnormally-marketed company and the medium-abnormal-marketed-company, and marketed-company financial data having no significant abnormality, other than the high-risk abnormally-marketed-company financial data, the medium-abnormal-marketed-company financial data, and the abnormal-marketed-company financial data. For example, the high-risk abnormal marketing company financial data may be set to 1, the medium abnormal marketing company financial data may be set to 2, the abnormal marketing company financial data may be set to 3, and the marketing company financial data having no significant abnormality may be set to 0.

In particular, the financial data may include an enterprise's statement of assets and liabilities, statement of profits, statement of cash flow, and financial remarks data, and the non-financial data may include an enterprise's litigation postings, violation postings, accounting office exception changes, accounting audit organization rankings (e.g., the chinese registered accountant association may be used to rank various offices or the system is drawn with accounting offices of securities), arbitration aborts, and prosecution aborts. And after the financial data and the non-financial data are acquired, preprocessing the financial data and the non-financial data. For example, the required financial and non-financial indicators may be determined first.

Specifically, to improve the accuracy and reliability of the financial anomaly analysis result, the financial index may be constructed around the aspects of profit quality, asset quality, cash flow quality, and the like. For example, the financial indicators may be set to a profit growth rate, a monetary fund to gross asset weight, a cash flow amount generated by a business activity, and the like. It can be understood that, in this embodiment, the specific sub-items included corresponding to the financial index are not limited, and the user may set the specific sub-items in a customized manner according to actual requirements, and the specific sub-items are not enumerated here. For the non-financial index, the abnormal change times, the number of the complaints and the like of the accounting office can be set. And then, performing cleaning and reconstruction calculation on the financial data and the non-financial data, and combining the financial index data and the non-financial index data according to the company code and the reporting year to obtain complete financial characteristics and non-financial characteristics.

Step S20, constructing a derivative index according to the financial characteristic data, and screening important characteristics to obtain input characteristics;

in this embodiment, a derivative index may be constructed according to the financial characteristic data, and an implicit association relationship between the financial index and the financial index may be searched by using linear regression. First, the financial indexes to be determined whether the implicit association exists are grouped, such as the financial subject indexes related to the enterprise operation activities, such as business income, business cost, sales expense, net cash flow of the operation activities, and the like.

For each group, the case of all index combinations is traversed (the independent variables in the regression equation are set to not exceed five indexes at most, taking into account the traversal time overhead and the complexity of the regression equation). The method comprises the specific steps of firstly selecting a certain index in a group as a dependent variable, selecting two to five indexes from the rest indexes as independent variables, then performing binary to quinary regression based on financial index data, and calculating the fitting degree of a regression equation (the fitting degree can judge the regression effect). And (3) screening out a regression equation of which the calculation result is in the threshold range by setting a fitting degree R _ square threshold, wherein the lower limit range is set to be 0.7, and the upper limit range is set to be 0.95. Suppose that a regression equation is selected by calculation as: the sales cost ≈ k1 ≈ monetary fund + k2 · business cost, that is, it is proved that a large number of samples satisfy such a correlation, and if the financial index data of a certain company does not conform to the correlation and the degree of deviation is larger, it can be considered that the possibility that the financial condition of the company is abnormal is larger, so that a residual value obtained by subtracting the predicted value of the regression equation (i.e., the calculated value of the right side portion in the regression equation) and the actual value of the financial index (i.e., the actual value of the dependent variable) can be used as a newly generated derivative index for model training to improve the model identification effect.

Further, after the derived indexes are obtained, the LightGBM model is selected to screen important indexes in consideration of factors such as time overhead, interpretability, sample capacity, model effect and the like. And training for multiple times by taking the derived characteristic and financial characteristic data as algorithm input, acquiring a characteristic set actually used by the model in each training, and screening out the characteristic with relatively more use times as the input characteristic of the model because the decision tree uses the characteristic with the largest information gain during construction.

Step S30, training a financial anomaly analysis model based on the non-financial characteristic data and the input characteristics, and performing anomaly grade prediction on an unknown sample based on the trained financial anomaly analysis model;

in this embodiment, a LightGBM algorithm is used to train a financial anomaly analysis multi-classification model based on the non-financial feature data and the input features, and a model with the best effect is retained through a grid search parameter adjusting method and five-fold cross validation. And after the unknown samples are processed as the training samples, obtaining the characteristic data of the unknown samples, inputting the characteristic data into the financial anomaly analysis model for prediction to obtain the anomaly grades of the unknown samples, wherein the samples with the classification results of 1, 2 and 3 are regarded as anomalies.

And step S40, obtaining model result vectors corresponding to historical abnormal samples and unknown samples predicted to be abnormal, judging the financial abnormal types of the unknown samples through a k-nearest neighbor algorithm based on the model result vectors, and performing index analysis.

Specifically, in order to improve the interpretability of the financial anomaly analysis result of the listed company, the financial anomaly similarity between companies is calculated, and an index anomaly interpretation library is constructed to perform index analysis and other methods to assist interpretation, so that the analysis result has higher credibility and assists the understanding and analysis of users.

In the aspect of computing the financial abnormal similarity, the financial abnormal similarity between companies is computed by using a path vector of a decision tree which is output by a financial abnormal analysis model. The historical abnormal sample is marked with abnormal categories, including income abnormal, cost abnormal, cash flow abnormal, liability abnormal, asset abnormal and related party benefit delivery. And then converting the path of the abnormal sample in the trained financial abnormity analysis model into a vector, wherein each abnormal sample can form a vector with a specific dimension to be used as a basis for judging the abnormal type of the unknown company.

Optionally, the path vector of the feature data of the unknown sample can also be obtained through the financial anomaly analysis model, and then the financial anomaly category of the unknown sample is judged by adopting a K-nearest neighbor algorithm, that is, K instances nearest to the unknown instance are found in the anomaly samples of the known anomaly category, and most of the K instances belong to a certain class, so that the anomaly category of the unknown sample is predicted to be the same class. The method of the proximity measurement can select a distance algorithm to perform similarity calculation on the unknown sample vector and the known abnormal sample vector respectively. Taking the manhattan distance in two-dimensional space as an example, the manhattan distance between the i point of the coordinate (x1, y1) and the j point of the coordinate (x2, y2) is:

d(i,j)＝|x1-x2|+|y1-y2|

cosine similarity, Euclidean distance and other methods can be adopted for measurement. For the selection of the K value, the optimal K value can be selected by adopting a cross validation method based on the known abnormal class samples.

In the aspect of index analysis, on the basis of input characteristic data, the optimal effect of distinguishing abnormal and non-abnormal samples is taken as a standard, and normal intervals and abnormal intervals of various indexes in different industries are solved. Taking industry A and index x as an example, firstly screening historical sample data of industry A, performing ascending sorting according to an index value of x, selecting 25% -75% industry quantiles as an initial normal interval, taking 5%, 10% or other quantiles as sliding intervals to expand leftwards and rightwards, taking the sliding window as a normal interval, taking the sliding window as an abnormal interval if the sliding window is not in the range, taking whether the actual value of the sample index is in the range of the abnormal interval as a standard for judging whether the sample is abnormal, solving the recall rate and the accuracy rate of the abnormal sample under the condition, finally solving a f1 value by combining the recall rate and the accuracy rate, and finding out the normal interval and the abnormal interval corresponding to the maximum f1 value through multiple calculations. If the first calculation can use the quantiles of 25% -75% of the industry as normal intervals, and the quantiles less than 25% of the industry and more than 75% of the industry as abnormal intervals, firstly, the number TP of abnormal samples, the number TN of abnormal samples, the number FP of abnormal samples and the number FN of abnormal samples are calculated under the condition that the abnormal samples are used as abnormal samples, and the precision rate and the recall rate are calculated, wherein the precision rate precision calculation mode is as follows:

the calculation mode of the recall rate recall is as follows:

f1 values were calculated from precision and recall:

and in the second calculation, 20% -80% quantiles can be used as normal intervals, quantiles smaller than 20% of industry and larger than 80% of industry are used as abnormal intervals, the F1 value is calculated again, and by analogy, the normal interval and the abnormal interval corresponding to the maximum F1 value are finally found out and used as the normal interval and the abnormal interval of the x index of the industry A.

And performing index analysis based on the index interpretation library. And regarding the index with the actual value larger than the right boundary of the normal interval as larger in the industry, regarding the index with the actual value smaller than the left boundary of the normal interval as smaller in the industry, regarding the other indexes as normal, and prompting the larger and smaller indexes. If the index financial rate is larger, the company is prompted that the load is abnormal relative to other companies in the industry. Advising business personnel to pay attention to the item of borrowing the balance sheet of the company, paying interest in additional amount such as whether financing exists or not when other accounts are due, changing a repayment policy and increasing cash discount; on the other hand, abnormal path matching is carried out, the financial experience of the financial experts is solidified into an abnormal path, and the rule of the path can be defined as that the XX characteristic is larger than the XX characteristic, the XX characteristic is smaller than the XX characteristic, and the XX characteristic is larger than the XX characteristic. If the enterprise's financial data satisfies the abnormal path, the abnormal path will be the business explanation for which some abnormality may occur. If the path "management cost is smaller, management cost growth rate is smaller, and out-of-business export is larger" can be interpreted as: the company has a smaller management cost ratio in the current period and a smaller management cost increase rate, and meanwhile has a larger external expenditure ratio, the company can possibly take the unreasonable period cost as the external expenditure to achieve the purpose of making the business profit.

In the technical scheme disclosed in the embodiment, financial data and non-financial data used for training are firstly acquired, financial features are constructed around the aspects of profit quality, asset quality, cash flow quality and the like, and important non-financial data such as abnormal change conditions of an accounting firm and the like are extracted to construct non-financial features. And then mining the implicit association relationship between the financial indexes and the financial indexes by a linear regression method to obtain a regression equation, and constructing derived financial characteristics based on the regression equation. And then, performing feature screening and model training by using a LightGBM algorithm to obtain a set of financial anomaly analysis models, performing anomaly grade prediction on unknown samples by using the models, and simultaneously obtaining model output vector results of historical anomaly samples and unknown samples. And finally, judging the abnormal category of the unknown sample by adopting a k-nearest neighbor algorithm, and performing index analysis, thereby enhancing the interpretability of the financial abnormal analysis result of the listed company.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal device (which may be a computer or a server) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for analyzing financial abnormality of a listed company based on machine learning is characterized by comprising the following steps:

2. The machine-learning-based listed company financial anomaly analysis method according to claim 1, wherein the financial data comprises high-risk abnormally-listed company financial data subject to certificate-proctor administrative penalties, autonomic regulations and reminder attention letters, medium-abnormal-listed company financial data with a market-withdrawal risk alert and other risk alert identification, abnormal-listed company financial data with an information disclosure level of class C or class D other than the high-risk abnormally-listed company and the medium-abnormal-listed company, and listed company financial data without significant anomalies other than the high-risk abnormally-listed company financial data, the medium-abnormal-listed company financial data and the abnormally-listed company financial data.

3. The method for analyzing financial abnormality of listed company based on machine learning according to claim 1, wherein said step of obtaining a model result vector corresponding to historical abnormal samples and unknown samples predicted to be abnormal, and judging the financial abnormality category of said unknown samples by k-nearest neighbor algorithm based on said model result vector, and performing index analysis comprises:

marking the abnormal category corresponding to the historical abnormal sample;

and performing index analysis based on the index interpretation library.

4. The machine learning-based method for analyzing financial anomalies of a listed company as claimed in claim 1 wherein said step of obtaining financial data and non-financial data for training and pre-processing said financial data and said non-financial data to obtain financial characteristic data and non-financial characteristic data comprises:

obtaining the financial data and the non-financial data for training;

determining financial indexes and non-financial indexes;

5. The machine learning-based method for analyzing financial anomalies of listed companies according to claim 1, wherein said step of constructing derived metrics from said financial characteristic data and performing significant feature screening to obtain input features comprises:

constructing a derivative index based on the regression equation;

and (5) screening important features to obtain input features.