CN117151867B

CN117151867B - Enterprise exception identification method and system based on big data

Info

Publication number: CN117151867B
Application number: CN202311220859.9A
Authority: CN
Inventors: 卢煜晟; 赵鹏; 陈正国; 吴建君; 邓李鑫; 蒋新国
Original assignee: Jiangsu Shucheng Information Technology Co ltd
Current assignee: Jiangsu Shucheng Information Technology Co ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2024-04-30
Anticipated expiration: 2043-09-20
Also published as: CN117151867A

Abstract

The invention provides an enterprise anomaly identification method and system based on big data, which solve the problems that the conventional enterprise risk identification has high sample quantity proportion, is difficult to meet, has serious prediction hysteresis and the like, and mainly comprises the following steps: s1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels; s2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p _o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p _o smaller than a certain threshold value; s3, analyzing public credit data of enterprise historical change, extracting metadata set features, selecting different prediction models in a model pool according to different data set features, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.

Description

Enterprise exception identification method and system based on big data

Technical Field

The invention relates to the technical field of data identification, in particular to an enterprise anomaly identification method and system based on big data.

Background

Currently, a big data scoring model is generally adopted for abnormal enterprise identification. The big data scoring model is based on the abnormal recognition of the current enterprise state, and can only judge whether the enterprise is abnormal currently or not, and the future abnormal probability of the enterprise cannot be predicted. For financial wind control and government enterprise supervision, when the enterprise is in an abnormal state and alarms again, certain hysteresis exists, and the requirements of early warning and early intervention cannot be met. In many fields such as enterprise loan management of wisdom finance, enterprise supervision of wisdom government affairs, can in time discern unusual enterprise, it is very important to avoid risk, reduce loss.

Abnormal enterprise identification is a component of enterprise evaluation, and the current mainstream method is as follows: firstly, constructing a scoring model based on big data; then, calculating a score for each business based on the scoring model; and finally, identifying the enterprises with lower scores as abnormal enterprises. The method can effectively utilize the historical public credit data of enterprises, and has higher scientificity and objectivity.

However, this method has the following problems:

(1) Only judging whether the current state of the enterprise is abnormal or not, and judging the probability of the enterprise abnormal in the future;

(2) When the classification model is constructed, the minimum requirement is on the proportion of positive and negative samples. In practice, however, this requirement is not necessarily satisfied, and there are cases where the classification model fails.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an enterprise anomaly identification method and system based on big data, which can meet the requirements of early warning and early intervention of financial wind control and government enterprise supervision and further prevent and intervene the loss possibly caused by improper enterprise operation.

In order to solve the technical problems, the invention adopts the following technical scheme: an enterprise anomaly identification method based on big data comprises the following steps:

S1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels;

S2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p _o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p _o smaller than a certain threshold value;

S3, analyzing public credit data of enterprise historical change, extracting metadata set features, selecting different prediction models in a model pool according to different data set features, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.

Further, the public credit data in the step S1 is historical data and existing data, including business data, annual report data, blacklist data, administrative license data, administrative penalty data, stock right quality and guarantee data, lawsuit data, intellectual property data.

Further, the similarity calculation in the step S2 is as follows,

For enterprise feature setsA high-quality enterprise self-encoder delta _G and a risk enterprise self-encoder delta _B;

Will be Input Δ _G gets/>Will/>Input Δ _B gets/>For f ₂∈B,for f₃ e O, initialize P=φ, for O e O, for G e G, B e B, where f ₁ represents a good enterprise, G represents a collection of good enterprises,/>Represents a trained high-quality enterprise self-encoder model, f ₂ represents a risk enterprise, B represents a risk enterprise set, and/>Representing a trained risk enterprise self-encoder model, g representing high-quality enterprise corresponding features, and b representing risk enterprise corresponding features;

feature similarity with high-quality enterprises S _g＝(∑_bEuclic(r_o,r_b))/N, feature similarity with inauguration enterprises S _b＝(∑_bEuclic(r_o,r_b))/M;

Wherein N, M is the corresponding set number, euclic (r _o,r_b) represents the distance of 0-b, and the anomaly probability P _o＝S_b/S_g of enterprise o;

Adding P _o into the set P, and outputting an abnormal enterprise probability set P of each enterprise.

Further, in the step S3, the prediction model selection and integration steps are specifically as follows,

For the prediction model set M { M ₁,m₂,...,m_k }, the data set D, the data set features and the K classification model delta of the prediction model: { f (D) →m };

calculating a feature f (D) of the dataset D;

Inputting f (D) into a K classification model delta: { f (D) →m }, obtaining the top N models with the best performance, and obtaining the prediction result of the future state of the enterprise by using the top N models;

and (3) combining N prediction results with weights of the model performance by using a voting method, and carrying out weighted summation to obtain a final prediction result.

Further, the model pool in the step S3 includes Arima, logistic regression, polynomial regression, bayesian regression, support vector machine, random forest, lightGBM, catBoost, XGBoost.

Further, the Arima model prediction steps are as follows,

Organizing enterprise historical public credit data into a time sequence, and counting the time granularity of the time sequence according to quarters;

Constructing an Arima model, training and optimizing, and predicting the future one-year enterprise state of the enterprise after inputting the characteristics of the integrated metadata set;

The trained Arima model was evaluated by means of the mean absolute percentage error MAPE.

An enterprise anomaly identification system based on big data, comprising:

the classification module is used for classifying and marking the enterprises according to the enterprise public credit data through the classification model, and marking results comprise high-quality enterprises, risk enterprises and other enterprises;

the calculation module is used for calculating the similarity among other enterprises, high-quality enterprises and risk enterprises through the self-encoder model respectively to obtain the abnormal probability of the risk enterprises, wherein the self-encoder model is obtained through training according to the corresponding label enterprise data characteristics;

the prediction module extracts the historical public credit data changes of the enterprises and inputs the changes into corresponding different prediction models, outputs the future annual state data of the enterprises after integration, and simultaneously invokes the similarity algorithm in the calculation module to calculate the abnormal probability corresponding to the future state data of each enterprise

And the alarm module outputs alarm to enterprises of which the abnormal probability in the calculation module and the prediction module is smaller than a specified threshold value.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the big data based enterprise anomaly identification method as described above when the program is executed.

A non-transitory computer readable storage medium having stored thereon a computer program for implementing the steps of the big data based enterprise anomaly identification method described above when executed by a processor.

Compared with the prior art, the invention has the beneficial effects that: the method has the advantages that the classification model is used for classifying and marking the public credit data of the enterprise, the corresponding data are input into different self-encoder models in a training mode, the situation of unbalanced proportion of positive and negative samples in the abnormal identification process of the enterprise can be dealt with, the hysteresis in the abnormal identification process is greatly reduced by the mode of predicting the future state data of the enterprise through the prediction model, in the future state prediction process, different data sets correspond to different prediction models and voting is integrated and output, the reliability of a data source is greatly improved, the accuracy of a prediction result is guaranteed, and the method has important significance for avoiding financial risks of the enterprise.

Drawings

The disclosure of the present invention is described with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:

FIG. 1 schematically shows a flow chart of an overall method for identifying anomalies in an enterprise in accordance with the present invention;

FIG. 2 schematically shows a flow chart for identifying enterprise anomaly probabilities from a model of an encoder in accordance with the present invention

FIG. 3 schematically shows a flow chart for predicting the future state anomaly probability of an enterprise according to a prediction model provided by the invention;

figure 4 schematically shows a flow chart of a model selection and integration process of data set features according to the present invention.

Detailed Description

It is to be understood that, according to the technical solution of the present invention, those skilled in the art may propose various alternative structural modes and implementation modes without changing the true spirit of the present invention. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit the invention to the precise form disclosed.

An embodiment according to the present invention is shown in connection with fig. 1-4.

As shown in fig. 1, for the overall recognition method, the main steps include,

S3, analyzing public credit data of enterprise historical change, extracting data set characteristics, selecting different prediction models in a model pool according to different data set characteristics, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.

The following specifically describes the above steps:

For step S1, in the process of identifying the abnormal enterprises in the scenes of loan management, supervision of enterprises of intelligent finance, etc., the enterprise categories are mainly high-quality enterprises and risk enterprises, as shown in fig. 1 and 2, the present invention firstly classifies and screens each enterprise in the sample through the existing classification model, the classification model can be multiple types in the existing model, including logistic regression model, decision tree, random forest, KNN, lightgbm, xgboost, catboost, etc., in the identifying process, the enterprise history and the existing public credit data are read first, and the data sources can be provided by third party data suppliers such as sky eye examination and enterprise examination, and the data content includes the industrial and commercial data, annual report data, blacklist data, administrative license data, administrative punishment data, stock right quality and guarantee data, lawsuit data, intellectual property data, etc.

After the classification model is identified and judged, classifying and marking are carried out on samples which are already identified as risk enterprises or high-quality enterprises, marking results are output, and meanwhile, each label sample is input into a self-encoder model corresponding to a label class so as to train the model, and the basis of marking is mainly a blacklist. The blacklist includes: list of trusted executives, list of executed persons, list of major illicit enterprises, serious administrative penalties, etc. Enterprises that are only blacklisted are marked as inauguration enterprises. The red list includes: tax level a business listings, major ideas business listings, etc. Enterprises which are only listed in the red list are marked as high-quality enterprises.

After the enterprise labeling is completed, the number of high-quality enterprises and risk enterprises and the proportion of the high-quality enterprises and the risk enterprises in all enterprises are calculated. The positive and negative sample ratio is used as a decision basis for subsequent self-encoder model training. It should be noted that the classification model has a requirement on the positive-negative sample ratio, and in general, the positive sample cannot be less than one tenth at the lowest. The self-encoder only has the requirement on the number of positive and negative samples, and the minimum number of the positive and negative samples cannot be less than 1000. In the field of enterprise public credit evaluation, the total number of enterprises available for model training is relatively large, while the number of positive samples is often not large. Therefore, the requirement of the number of positive and negative samples is easier to meet than the requirement of the proportion of positive and negative samples. This is also why adaptation of the self-encoder to the scene is more extensive than the classification model.

The self-encoder can convert the high-dimensional enterprise public credit data into low-dimensional representation, retain high-value information and omit low-value information. The self-encoder consists of two main parts: the encoder is used to encode the input, and the decoder uses the encoding to reconstruct the input. The self-encoder generally approximately reconstructs the input signal, the reconstruction result comprising only the most relevant parts of the original signal.

Based on the characteristic dimensions and sample size of the enterprise public credit data, a feed-forward, acyclic neural network is employed to link inputs and outputs with one or more hidden layers, similar to a single-layer perceptron in a multi-layer perceptron. The output layer has the same number of nodes (neurons) as the input layer. The number of output layer nodes is consistent with the number of input layers, and the aim is to reconstruct the input.

In the process of training the self-encoder model of the high-quality enterprise or the inauguration enterprise, different high-quality enterprises or inauguration enterprises are selected to be trained according to the number of positive and negative samples. According to industry experience, this minimum number cannot be less than 1000. If the number of positive sample enterprises exceeds 1000, training a self-encoder model of the high-quality enterprises; training a self-encoder model of the inauguration enterprise if the number of negative sample enterprises exceeds 1000; if both exceed 1000, the self-encoder models of premium enterprises and risk enterprises are trained simultaneously.

After the training of the self-encoder model corresponding to the high-quality enterprise and the inauguration enterprise is completed, for the enterprise samples which are not judged to be the inauguration enterprise and are not judged to be the high-quality enterprise, the similarity between the enterprise samples and the corresponding label enterprise is calculated to obtain the abnormal probability of the abnormal enterprise, and the specific calculation steps are as follows,

Will be Input Δ _G gets/>For f ₁ ε G; will/>Input Δ _B gets/>For f ₂∈B,for f₃ e O, initialize P=φ, for O e O, for G e G, B e B, where f ₁ represents a good enterprise, G represents a collection of good enterprises,/>Represents a trained high-quality enterprise self-encoder model, f ₂ represents a risk enterprise, B represents a risk enterprise set, and/>Representing a trained risk enterprise self-encoder model, g representing high-quality enterprise corresponding features, and b representing risk enterprise corresponding features;

feature similarity with high-quality enterprises S _g＝(∑_bEuclic(r_o,r_b))/N, feature similarity with inauguration enterprises S _b＝(∑_b Euclic(r_o,r_b))/M;

The threshold may be set by the user during the abnormal business identification and alarm process. For example, if the 10% of the highest abnormal probability is set as the threshold, the enterprise sample with the P _o less than 10% is judged to be identified as an abnormal enterprise and is alarmed, so that a user is reminded to take measures in advance, avoid risks and reduce losses.

Similarly, as shown in fig. 3, the abnormal enterprise is identified by using a big data scoring model, where the big data scoring model is based on the abnormal identification of the current enterprise state, and only can determine whether the enterprise is abnormal at present, so that the future abnormal probability of the enterprise cannot be predicted. For financial wind control and government enterprise supervision, when the enterprise is in an abnormal state and alarms again, certain hysteresis exists, and the requirements of early warning and early intervention cannot be met.

In order to overcome the defects, the invention also provides an abnormal enterprise identification method based on a regression model in step S3. The method predicts the state of the enterprise in the next year based on the historical change trend of the enterprise state, so that the abnormal probability of the enterprise in the next year is predicted, the requirements of early warning and early intervention of financial wind control and government enterprise supervision are met, and the possible loss caused by improper enterprise operation is further prevented and intervened.

Regression models are predictive modeling techniques that are commonly used for predictive analysis, time series models, and finding causal relationships between dependent and independent variables. The regression model is used for finding the law of the change of the enterprise characteristics along with time, so that the future enterprise state is predicted.

Furthermore, considering the problem that different prediction models are different in performance on different data sets, and further the selection of the prediction models is difficult, the invention also provides a prediction model integration optimization method based on the characteristics of the data sets in step S3. Firstly, analyzing a data set, and extracting characteristics of the data set; then selecting different prediction models according to different data set characteristics; and finally integrating the top N models with the best performance. The method can fully utilize the characteristics of the data set and effectively improve the generalization capability of the model.

When the model is selected, the basis is a classification model. Based on historical experience in which the behavior of different predictive models on different data sets is known, we construct a model-selected classification model based on this part of the historical knowledge. After a new dataset has arrived, classification models are entered using the features of the new dataset, which would give a list of recommended predictive models. We select the top N prediction models best according to this recommendation list. The model pool comprises: arima, logistic regression, polynomial regression, bayesian regression, support vector machines, random forests, lightGBM, catBoost, XGBoost, etc.

As shown in fig. 4, the prediction model selection and integration steps are specifically as follows,

calculating a feature f (D) of the dataset D;

Taking ARIMA regression model as an example, ARIMA model is the two most widely used time series prediction methods.

In the enterprise state regression prediction process, enterprise historical public credit data is first organized into a time series. The time granularity of the time sequence is counted according to quarters, and the data content comprises: address change times, stockholder change times, bid and winning times, employee number, external investment times and amount, stock right quality and guarantee times and amount, administrative punishment grades and times, complaint times, intellectual property number and the like; then constructing an ARIMA model to predict future enterprise states; finally, the trained ARIMA model is evaluated using the mean absolute percentage error MAPE (mean absolute percentage error).

The specific ARIMA model is given below, with ARIMA (p, d, q) consisting of three parts:

(1) AR (p) autoregressive model: the value of the current time point is equal to a regression of the values of the past several time points. Since it is independent of other interpretation variables and depends only on its past history, it is called autoregressive; if the last p historical values in the past are relied on, the order is called p and is marked as an AR (p) model. Here p is set to 20 quarters based on statistics of all enterprise available data. If the available data of enterprises is less than 20 quarters, interpolation processing is performed.

(2) I (d): differentiating the time sequence; because time sequence analysis requires stationarity, an unstable sequence needs to be converted into a stable sequence by a certain means, and the means adopted in general are differences; d represents the order of the difference, the value at the moment t is subtracted by the value at the moment t-1, and a new time sequence is called a 1-order difference sequence; the 1-order differential sequence of the 1-order differential sequence is called a 2-order differential sequence, and so on. The test shows that the second-order difference effect is better.

MA (q): representing a moving average model, meaning that the value of the current point in time is equal to a regression of the prediction error of the past several points in time; prediction error = model prediction value-true value; if the sequence depends on the last q historical prediction error values, the order is called q and is marked as MA (q) model. Through experiments, here q is set to 8.

Future year enterprise features have been predicted based on ARIMA model. Taking the enterprise characteristics of the next year as input, and calling an enterprise characteristic similarity calculation algorithm in the self-encoder model so as to predict the abnormal probability of the enterprise of the next year. If the probability of abnormality is large in the next year, the alarm is given to remind the financial wind control or enterprise supervision user to take precautions in advance.

Similarly, the enterprise anomaly identification system based on big data constructed based on the method can achieve the effects, and the system can specifically comprise:

Furthermore, the method steps and system described above may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The technical scope of the present invention is not limited to the above description, and those skilled in the art may make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and these changes and modifications should be included in the scope of the present invention.

Claims

1. The enterprise anomaly identification method based on big data is characterized by comprising the following steps:

the similarity calculation step in step S2 is as follows,

Adding P _o into the set P, and outputting an abnormal enterprise probability set P of each enterprise;

2. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: the public credit data in the step S1 is historical data and existing data, including business data, annual report data, blacklist data, administrative license data, administrative penalty data, equity quality and guarantee data, lawsuit data and intellectual property data.

3. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: in the step S3, the prediction model selection and integration steps are specifically as follows,

calculating a feature f (D) of the dataset D;

4. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: the model pool in the step S3 comprises Arima, logistic regression, polynomial regression, bayesian regression, support vector machine, random forest and LightGBM, catBoost, XGBoost.

5. The method for identifying the abnormality of the enterprise based on the big data as claimed in claim 4, wherein: the Arima prediction procedure is as follows,

Constructing an Arima model, training and optimizing, and predicting the future one-year enterprise state of the enterprise after inputting the characteristics of the integrated data set;

6. A system for identifying a big data based method for identifying an abnormality of an enterprise according to any one of claims 1 to 5, comprising:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the big data based enterprise anomaly identification method as claimed in any one of claims 1 to 5.

8. A non-transitory computer readable storage medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, is adapted to carry out the method for identifying corporate anomalies based on big data according to any one of claims 1 to 5.