CN111783829A

CN111783829A - Financial anomaly detection method and device based on multi-label learning

Info

Publication number: CN111783829A
Application number: CN202010474735.3A
Authority: CN
Inventors: 林康
Original assignee: Gf Securities Co ltd
Current assignee: Gf Securities Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-16

Abstract

The invention discloses a financial anomaly detection method and device based on multi-label learning, wherein the method comprises the steps of firstly generating a characteristic vector sample of each enterprise according to financial information of the existing enterprises, and then carrying out sample balance according to preset sampling parameters to obtain a training sample set; labeling the training sample set according to a plurality of labels obtained from historical accountability data, and constructing a financial anomaly detection model based on a multi-label learning algorithm; and finally, acquiring financial information of the enterprise to be detected, constructing a sample input vector, inputting the sample input vector into a financial anomaly detection model, and acquiring a detection result. By adopting the technical scheme of the invention, the problems of low accuracy rate caused by few data samples and incapability of detecting financial plastering means can be solved, and the detection accuracy is improved.

Description

Financial anomaly detection method and device based on multi-label learning

Technical Field

The invention relates to the technical field of computers, in particular to a financial anomaly detection method and device based on multi-label learning.

Background

Financial information is important data for measuring enterprises, and the requirement for marketing of the enterprises is high, so that more and more marketing companies whitewash the financial information which needs to be regularly disclosed. These activities bring enormous losses to investors in the capital market and also hinder the healthy development of the domestic market. Enterprises which carry out special audit on enterprise financial information exist in China, and manual audit is generally carried out through audit experts. But the auditing experts find that the financial treatment is very time-consuming and labor-consuming by analyzing the financial information with huge data volume, and different auditing experts can have different opinions on the same problem. It is therefore difficult to analyze the financial status of a listed company on a large scale, let alone to explore the financial status of millions of small and medium-sized micro-enterprises.

Financial anomaly detection based on big data artificial intelligence algorithm modeling provides assistance to this work. At present, foreign countries have many modeling technologies based on financial indexes, which mainly relate to the following indexes: 1. the plastering chance index is the easiness of plastering. Such as the number of holdings of a large shareholder, the number of people at a prison, the number of independent board of directors, etc. 2. For example, to avoid ST, a company may be subsidized with financial data the next year if a year loss occurs. 3. The characteristic of the pasture, i.e. the operation index of the company, relates to a series of indexes such as the turnover rate of accounts receivable, the turnover rate of stock, and the profit margin.

The modeling techniques are widely applied abroad, and the models comprise logistic regression, neural networks, Bayesian networks, decision trees and the like. Besides the above technologies, the support vector machine, the multivariate discrimination method, the index positive and negative probability discrimination method, etc. are available in China. The recognition rate of the foreign model is higher than that of the domestic model, firstly, because the domestic and foreign market information is different, the foreign model and the domestic model have different system backgrounds and different whitewash risk factors; and secondly, because the number of domestic financial charm companies is small, the number of penalized companies is below 50 every year, so that the modeling can be influenced by sample imbalance to reduce the identification accuracy. And the punishment book of the supervision authorities contains a plurality of specific financial charting problems, namely a company can carry out financial charting with a plurality of means, and the company cannot pay attention to whether the financial reports are charted or not during auditing, and also pay attention to which items are charted and which charting means are adopted. Therefore, the motivation and the business scene behind the painting can be combined for inference, which are not considered by the prior art.

Disclosure of Invention

The embodiment of the invention provides a financial anomaly detection method and device based on multi-label learning, which can solve the problems of low accuracy rate caused by few data samples and incapability of detecting financial embellishment means and improve the detection accuracy.

The invention provides a financial abnormity detection method based on multi-label learning, which comprises the following steps:

acquiring financial information of a plurality of enterprises, and generating a feature vector sample of each enterprise according to the financial information of each enterprise; the characteristic vector samples are divided into positive samples and negative samples according to whether the enterprises have financial violations or not;

balancing the quantity of positive samples and negative samples in all the feature vector samples according to preset sampling parameters to obtain a training sample set;

labeling the training sample set according to a plurality of labels obtained from historical accountability data, and constructing a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set after labeling;

acquiring financial information of an enterprise to be detected, and constructing a sample input vector according to the financial information of the enterprise to be detected;

and inputting the sample input vector into the financial abnormity detection model to obtain a multi-label result vector of the enterprise to be detected, and obtaining a detection result of the enterprise to be detected according to the multi-label result vector.

Further, the obtaining financial information of a plurality of enterprises, and generating a feature vector sample of each enterprise according to the financial information of each enterprise specifically includes:

acquiring financial information of a plurality of enterprises in a time region; the financial information includes: an asset liability statement, a profit statement and a cash flow statement;

generating a financial index table, a derivative financial index table and a derivative financial index attached table according to the asset liability table, the profit table and the cash flow table;

and extracting various indexes from the asset liability statement, the profit statement, the cash flow statement, the financial index statement, the derived financial index statement and the derived financial index attached statement, and generating a characteristic vector sample of each enterprise.

Further, according to preset sampling parameters, the number of positive samples and the number of negative samples in all feature vector samples are balanced to obtain a training sample set, which specifically comprises:

increasing the number of the positive samples through random oversampling, reducing the number of the negative samples through random undersampling, and balancing data of the positive samples and the negative samples according to a sampling parameter r to obtain a training sample set;

wherein the sampling parameter

N_negIs the number of positive samples; n is a radical of_posIs the number of negative samples.

Further, the plurality of tags obtained from the historical accountability data specifically include:

the historical accountability data comprises: the disclaimer and disclosure material published by the regulatory body, the proprietary data owned by the financial institution, and the legal data owned by the third party data company;

the plurality of labels are respectively: in-table validation, out-of-table disclosure, revenue, fees, assets, liabilities, cash flows, withheld security, withheld party financing, withheld association transactions, withheld major litigation, performance disclosure modification, administrative penalties, and internal control issues.

Further, the method includes the steps of constructing a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set labeled with the labels, and specifically includes the following steps:

and configuring a classifier for the label of each category according to a one-vs-all strategy algorithm, and constructing the financial anomaly detection model by combining the training sample set labeled by the label.

Correspondingly, the invention also provides a financial abnormity detection device based on multi-label learning, which comprises: the system comprises a training sample acquisition module, a sample balance module, a label marking module, a model construction module, a to-be-detected sample acquisition module and a detection module;

the training sample acquisition module is used for acquiring financial information of a plurality of enterprises and generating a characteristic vector sample of each enterprise according to the financial information of each enterprise; the characteristic vector samples are divided into positive samples and negative samples according to whether the enterprises have financial violations or not;

the sample balancing module is used for balancing the number of positive samples and negative samples in all the feature vector samples according to preset sampling parameters to obtain a training sample set;

the label labeling module is used for labeling the training sample set according to a plurality of labels obtained from historical accountability data;

the model construction module is used for constructing a financial abnormity detection model based on a multi-label learning algorithm according to the training sample set labeled with the labels;

the to-be-detected sample acquisition module is used for acquiring financial information of an enterprise to be detected and constructing a sample input vector according to the financial information of the enterprise to be detected;

the detection module is used for inputting the sample input vector into the financial abnormity detection model, obtaining a multi-label result vector of the enterprise to be detected, and obtaining a detection result of the enterprise to be detected according to the multi-label result vector.

Further, the training sample acquisition module comprises a first acquisition unit, a first generation unit and a second generation unit;

the first acquisition unit is used for acquiring financial information of a plurality of enterprises in a time region; the financial information includes: an asset liability statement, a profit statement and a cash flow statement;

the first generation unit is used for generating a financial index table, a derived financial index table and a derived financial index attached table according to the asset liability table, the profit table and the cash flow table;

the second generating unit is used for extracting various indexes from the asset liability statement, the profit statement, the cash flow statement, the financial index statement, the derived financial index statement and the derived financial index attached statement and generating a characteristic vector sample of each enterprise.

Further, the sample balancing module is configured to balance the number of positive samples and the number of negative samples in all feature vector samples according to a preset sampling parameter, so as to obtain a training sample set, specifically:

the sample balancing module increases the number of the positive samples through random oversampling, decreases the number of the negative samples through random undersampling, and balances the data of the positive samples and the negative samples according to a sampling parameter r to obtain a training sample set;

wherein the sampling parameter

Further, the model construction module is used for constructing a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set labeled with the labels, and specifically comprises:

and the model construction module configures a classifier for each class of label according to a one-vs-all strategy algorithm, and constructs the financial anomaly detection model by combining the training sample set labeled by the label.

In view of the above, the financial anomaly detection method and device based on multi-label learning provided by the invention have the advantages that firstly, the characteristic vector sample of each enterprise is generated according to the financial information of the existing enterprise, and then, the sample balance is carried out according to the preset sampling parameters to obtain the training sample set; labeling the training sample set according to a plurality of labels obtained from historical accountability data, and constructing a financial anomaly detection model based on a multi-label learning algorithm; and finally, acquiring financial information of the enterprise to be detected, constructing a sample input vector, inputting the sample input vector into a financial anomaly detection model, and acquiring a detection result. Compared with the prior art that the accuracy is low or the plastering motivation cannot be obtained due to the lack of the samples, the technical scheme of the invention not only balances the number of the samples and improves the accuracy of the model, but also obtains a plurality of types of labels from historical accountability data, calculates the plastering means based on a multi-label learning method, and cooperatively judges the final financial plastering event by the labels, thereby further improving the detection accuracy. In addition, the invention can also assist audit experts in analyzing and judging financial information disclosed by companies, thereby reducing labor cost and judgment errors and improving the efficiency of financial audit.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a method for detecting financial anomalies based on multi-tag learning according to the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a method for detecting financial anomalies based on multi-tag learning according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of an embodiment of the financial anomaly detection device based on multi-tag learning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a schematic flowchart of an embodiment of a financial anomaly detection method based on multi-label learning according to the present invention is shown. The method shown in fig. 1 includes steps 101 to 105, and each step is as follows:

step 101: acquiring financial information of a plurality of enterprises, and generating a feature vector sample of each enterprise according to the financial information of each enterprise; the feature vector samples are divided into positive samples and negative samples according to whether the enterprises have financial violations or not.

In this embodiment, step 101 specifically includes: acquiring financial information of a plurality of enterprises in a time region; the financial information includes: an asset liability statement, a profit statement and a cash flow statement; generating a financial index table, a derivative financial index table and a derivative financial index attached table according to the asset liability table, the profit table and the cash flow table; and extracting various indexes from the asset liability statement, the profit statement, the cash flow statement, the financial index statement, the derived financial index statement and the derived financial index attached statement, and generating a characteristic vector sample of each enterprise.

In this embodiment, the financial analysis relies primarily on 3 tables, namely the balance sheet, the profit sheet, and the cash flow sheet. The balance sheet reflects the condition of the company's balance, liability and equity at a certain moment. The profit sheet reflects the income, expense, profit and other conditions of the company within a certain period of time, and reveals the operation results of the company. The cash flow table reflects the cash change condition of the company and reveals the trend of cash flow of the company in the operation, investment and financing activities. The index table and the derivative table generated by the invention also mainly depend on the three tables, and the derivative table can include but is not limited to: a financial index table, a derived financial index table, and a derived financial index attached table. A small part of the index table is directly taken from the original data of the three tables, and more core indexes are taken from indexes derived based on the three tables. For example, the financial index analysis systems (repayment ability, profitability, etc.) commonly used in the industry also adopt self-developed analysis indexes as necessary supplements, such as the index of the capital investment proportion of the equity of the stock and the equity of the capitalization.

In this embodiment, data collection may be addressed by purchasing a data supplier, and the data is typically presented to the user in a tabular form stored in a database. The user constructs a required table by database extraction technologies such as Hive, SQL and the like, and main fields comprise dates, enterprise codes and various indexes to form a financial index table, a derivative financial index table and a derivative financial index attached table. Various indexes are extracted from the enterprise data, and a characteristic vector sample of each enterprise is generated. A sample consists of feature vectors for a business over a time interval, which may be a quarter, half year, or even a year.

In this embodiment, the feature vector samples are divided into positive and negative samples before training. If the enterprise does not have a financial violation within the time interval, then the sample is negative, and the negative sample is positive.

Step 102: and balancing the quantity of positive samples and negative samples in all the feature vector samples according to preset sampling parameters to obtain a training sample set.

In this embodiment, sample balancing is achieved by sampling and down-sampling. Because the positive and negative samples output in step 101 are not balanced, the negative samples are many orders of magnitude higher than the positive samples. To reduce the risk of model failure due to sample imbalance, the sampling tool may employ a sample algorithm in a mature Python algorithm library. Negative sampling is carried out on a small amount of positive samples, sampling is carried out on a large amount of negative samples, and a part of the positive samples is selected from a large amount of samples to participate in subsequent model learning. Thus, a training sample set with the same magnitude of positive and negative samples can be obtained.

In this embodiment, step 102 specifically includes: increasing the number of the positive samples through random oversampling, reducing the number of the negative samples through random undersampling, and balancing data of the positive samples and the negative samples according to a sampling parameter r to obtain a training sample set;

wherein the sampling parameter

In order to avoid over-fitting and under-fitting of features, the sampling degree of two aspects is controlled through a sampling parameter r, and a training sample set is obtained through a sampling algorithm.

Step 103: labeling the training sample set according to a plurality of labels obtained from historical accountability data, and constructing a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set after labeling.

In this embodiment, the historical accountability data includes: the disclaimer and disclosure material published by the regulatory body, the proprietary data owned by the financial institution, and the legal data owned by third party data companies. The plurality of labels are respectively: in-table validation, out-of-table disclosure, revenue, fees, assets, liabilities, cash flows, withheld security, withheld party financing, withheld association transactions, withheld major litigation, performance disclosure modification, administrative penalties, and internal control issues.

In this embodiment, the historical accountability data typically details what financial instrument violation violations were performed by the accountability enterprise at a particular time.

According to the method, the labels of multiple categories are obtained by refining the categories of financial problems according to historical accountability data. And eliminating data which cannot be labeled, wherein the data generally accounts for less than 5% of total violation samples, and modeling work is not influenced. Secondly. For an enterprise with the continuous fraud life of 2 years or more, the enterprise is specifically analyzed according to the specific conditions of each year, namely, the enterprise is divided into a plurality of samples for independent analysis according to the years, and finally the stock codes and the fraud years are identification bases of different samples.

In this embodiment, the tagging is to perform analysis tagging on data in a sample, for example, a certain item of data in a sample belongs to a "revenue" tag, and the sample is marked with 1 corresponding to a certain type of tag, and is marked with 0 in a negative positive way. Thus each exemplar may correspond to a variety of different labels, with one exemplar corresponding to a multi-dimensional vector such as [0,1,1,0, … …,0 ]. This can translate the problem into a typical problem that is solved using multi-label learning, i.e. this can be considered as predicting the properties of data points that are not mutually exclusive.

In this embodiment, a one-vs-all policy algorithm is adopted, a classifier is configured for each class of label, and a financial anomaly detection model is constructed by combining a training sample set labeled by the label. In this strategy, assuming there are n classes, then n binomial classifiers are built, each classifying one of the classes and the remaining classes. When prediction is carried out, the n binomial classifiers are used for classification to obtain the probability that the data belongs to the current class, and the class with the highest probability is selected as a final prediction result. The one-vs-all strategy algorithm adopted by the invention can improve the calculation efficiency (only n classifiers are needed) and also has the interpretability of the algorithm. Since each label is represented by only one classifier, knowledge about the class can be obtained by examining its corresponding classifier.

The present invention can implement the above model construction using, but not limited to, OneVsRestClassifier in Python library. The financial anomaly detection model can output various index performances such as precision, recall ratio and the like.

As an example of this embodiment, when constructing the financial anomaly detection model, a one-vs-one policy may also be used. Assuming that n classes are provided, two classifiers are established for every two classes to obtain k ═ nx (n-1)/2 classifiers. When new data is classified, the k classifiers are used for classification in sequence, each classification is equivalent to one voting, and the classification result is equivalent to one vote for the class. After all k classifiers are used for classification, the class with the most votes is selected as the final classification result, which is equivalent to k times of voting.

Step 104: and acquiring financial information of the enterprise to be detected, and constructing a sample input vector according to the financial information of the enterprise to be detected.

In this embodiment, in order to obtain a whole that determines whether the company performs financial charting, the financial information of the enterprise to be detected needs to be obtained first, and then according to the financial information of the enterprise to be detected, the sample generation method described in step 101 is used to obtain features of the enterprise in a certain time interval, so as to form a sample input vector.

Step 105: and inputting the input vector into a financial abnormity detection model, obtaining a multi-label result vector of the enterprise to be detected, and obtaining a detection result of the enterprise to be detected according to the multi-label result vector.

In this embodiment, the financial anomaly detection model outputs a multi-label result vector, and the detection result can be obtained by performing overall inference on the vector. The more labels are involved in the multi-label result vector, which indicates that the more the sample problems are, the more the financial breading is performed, the more important attention is needed.

In this embodiment, on the one hand, the auditing experts are concerned with the specific type of financial breading and, on the other hand, they also want to obtain an overall assessment. Assume that the prediction result is L, which is a 14-dimensional vector, each dimension being a binary component of 0/1. The present invention sums all components of this result vector with the evaluation result as a whole. Which is an integer S in the interval 0 to 14. Generally, an emphasis value can be set, for example, an enterprise with an emphasis value of 2 or more needs to pay attention and investigate.

In addition, the multi-label result vector obtained by the method can be used as auxiliary data to assist an audit expert in analyzing and judging financial information disclosed by a company, so that the labor cost and the judgment error are reduced, and the efficiency of financial audit is improved. As described above, through the screening of the invention, the number of samples which need to be concerned by an auditing specialist is reduced, and the auditing efficiency is improved.

For better explaining the principle and the flow of the present invention, referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of the financial anomaly detection method based on multi-tag learning provided by the present invention. Fig. 2 describes the flows of feature selection, sample balancing, label annotation, multi-label learning model and overall inference, and the specific principles of each flow may refer to the above related description without limitation.

Correspondingly, the invention further provides a financial abnormity detection device based on multi-label learning. Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the financial anomaly detection apparatus based on multi-tag learning according to the present invention. As shown in fig. 3, the apparatus includes: a training sample acquisition module 301, a sample balancing module 302, a label labeling module 303, a model building module 304, a to-be-detected sample acquisition module 305 and a detection module 306.

The training sample acquisition module 301 is configured to acquire financial information of a plurality of enterprises, and generate a feature vector sample of each enterprise according to the financial information of each enterprise; the feature vector samples are divided into positive samples and negative samples according to whether the enterprises have financial violations or not.

The sample balancing module 302 is configured to balance the number of positive samples and the number of negative samples in all feature vector samples according to a preset sampling parameter, so as to obtain a training sample set.

The label labeling module 303 is configured to label the training sample set according to a plurality of labels obtained from the historical accountability data.

The model construction module 304 is configured to construct a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set labeled with the labels.

The to-be-detected sample acquisition module 305 is configured to acquire financial information of the to-be-detected enterprise, and construct a sample input vector according to the financial information of the to-be-detected enterprise.

The detection module 306 is configured to input the sample input vector into the financial anomaly detection model, obtain a multi-tag result vector of the enterprise to be detected, and obtain a detection result of the enterprise to be detected according to the multi-tag result vector.

In the present embodiment, the training sample acquisition module 301 includes a first acquisition unit, a first generation unit, and a second generation unit.

The first acquisition unit is used for acquiring financial information of a plurality of enterprises in a time region; the financial information includes: an asset liability statement, a profit statement and a cash flow statement.

The first generating unit is used for generating a financial index table, a derived financial index table and a derived financial index attached table according to the asset liability table, the profit table and the cash flow table.

And the second generating unit is used for extracting various indexes from the asset liability statement, the profit statement, the cash flow statement, the financial index statement, the derived financial index statement and the derived financial index attached statement and generating a characteristic vector sample of each enterprise.

In this embodiment, the sample balancing module 302 is configured to balance the number of positive samples and the number of negative samples in all feature vector samples according to a preset sampling parameter, so as to obtain a training sample set, specifically:

wherein the sampling parameter

In this embodiment, the model construction module 304 is configured to construct a financial anomaly detection model based on a multi-label learning algorithm according to a training sample set labeled with a label, specifically:

the model construction module 304 configures a classifier for each class of label according to a one-vs-all policy algorithm, and constructs the financial anomaly detection model by combining the training sample set labeled by the label.

Furthermore, the financial abnormity detection method of the invention and the affair inference of financial auditing experts can solve the problems of many applications in the aspect of company financial affair declaration, such as company operation condition analysis, stock price control and the like, reduce the culture cost, time and resources of the auditing experts and enlarge the application range of the invention.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A financial anomaly detection method based on multi-label learning is characterized by comprising the following steps:

2. The financial anomaly detection method based on multi-label learning according to claim 1, wherein the financial information of a plurality of enterprises is obtained, and a feature vector sample of each enterprise is generated according to the financial information of each enterprise, specifically:

3. The financial anomaly detection method based on multi-label learning according to claim 1, wherein the number of positive samples and negative samples in all feature vector samples is balanced according to preset sampling parameters to obtain a training sample set, specifically:

wherein the sampling parameter

4. The financial anomaly detection method based on multi-tag learning according to claim 1, wherein the plurality of tags derived from historical accountability data are specifically:

5. The financial anomaly detection method based on multi-label learning according to claim 4, wherein a financial anomaly detection model is constructed based on a multi-label learning algorithm according to the training sample set labeled with labels, specifically:

6. A financial anomaly detection device based on multi-tag learning, said financial anomaly detection device comprising: the system comprises a training sample acquisition module, a sample balance module, a label marking module, a model construction module, a to-be-detected sample acquisition module and a detection module;

7. The multi-label learning-based financial anomaly detection device according to claim 6, wherein said training sample acquisition module comprises a first acquisition unit, a first generation unit and a second generation unit;

8. The financial anomaly detection device based on multi-label learning according to claim 6, wherein the sample balancing module is configured to balance the number of positive samples and negative samples in all feature vector samples according to preset sampling parameters to obtain a training sample set, specifically:

wherein the sampling parameter

9. A financial anomaly detection device based on multi-tag learning according to claim 6, wherein said plurality of tags derived from historical accountability data are specifically:

10. The financial anomaly detection device based on multi-label learning according to claim 9, wherein the model construction module is configured to construct a financial anomaly detection model based on a multi-label learning algorithm according to the training sample set labeled with labels, specifically: