CN117151867B - Enterprise exception identification method and system based on big data - Google Patents

Enterprise exception identification method and system based on big data Download PDF

Info

Publication number
CN117151867B
CN117151867B CN202311220859.9A CN202311220859A CN117151867B CN 117151867 B CN117151867 B CN 117151867B CN 202311220859 A CN202311220859 A CN 202311220859A CN 117151867 B CN117151867 B CN 117151867B
Authority
CN
China
Prior art keywords
enterprise
data
enterprises
model
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311220859.9A
Other languages
Chinese (zh)
Other versions
CN117151867A (en
Inventor
卢煜晟
赵鹏
陈正国
吴建君
邓李鑫
蒋新国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Shucheng Information Technology Co ltd
Original Assignee
Jiangsu Shucheng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Shucheng Information Technology Co ltd filed Critical Jiangsu Shucheng Information Technology Co ltd
Priority to CN202311220859.9A priority Critical patent/CN117151867B/en
Publication of CN117151867A publication Critical patent/CN117151867A/en
Application granted granted Critical
Publication of CN117151867B publication Critical patent/CN117151867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Evolutionary Computation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an enterprise anomaly identification method and system based on big data, which solve the problems that the conventional enterprise risk identification has high sample quantity proportion, is difficult to meet, has serious prediction hysteresis and the like, and mainly comprises the following steps: s1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels; s2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p o smaller than a certain threshold value; s3, analyzing public credit data of enterprise historical change, extracting metadata set features, selecting different prediction models in a model pool according to different data set features, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.

Description

Enterprise exception identification method and system based on big data
Technical Field
The invention relates to the technical field of data identification, in particular to an enterprise anomaly identification method and system based on big data.
Background
Currently, a big data scoring model is generally adopted for abnormal enterprise identification. The big data scoring model is based on the abnormal recognition of the current enterprise state, and can only judge whether the enterprise is abnormal currently or not, and the future abnormal probability of the enterprise cannot be predicted. For financial wind control and government enterprise supervision, when the enterprise is in an abnormal state and alarms again, certain hysteresis exists, and the requirements of early warning and early intervention cannot be met. In many fields such as enterprise loan management of wisdom finance, enterprise supervision of wisdom government affairs, can in time discern unusual enterprise, it is very important to avoid risk, reduce loss.
Abnormal enterprise identification is a component of enterprise evaluation, and the current mainstream method is as follows: firstly, constructing a scoring model based on big data; then, calculating a score for each business based on the scoring model; and finally, identifying the enterprises with lower scores as abnormal enterprises. The method can effectively utilize the historical public credit data of enterprises, and has higher scientificity and objectivity.
However, this method has the following problems:
(1) Only judging whether the current state of the enterprise is abnormal or not, and judging the probability of the enterprise abnormal in the future;
(2) When the classification model is constructed, the minimum requirement is on the proportion of positive and negative samples. In practice, however, this requirement is not necessarily satisfied, and there are cases where the classification model fails.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an enterprise anomaly identification method and system based on big data, which can meet the requirements of early warning and early intervention of financial wind control and government enterprise supervision and further prevent and intervene the loss possibly caused by improper enterprise operation.
In order to solve the technical problems, the invention adopts the following technical scheme: an enterprise anomaly identification method based on big data comprises the following steps:
S1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels;
S2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p o smaller than a certain threshold value;
S3, analyzing public credit data of enterprise historical change, extracting metadata set features, selecting different prediction models in a model pool according to different data set features, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.
Further, the public credit data in the step S1 is historical data and existing data, including business data, annual report data, blacklist data, administrative license data, administrative penalty data, stock right quality and guarantee data, lawsuit data, intellectual property data.
Further, the similarity calculation in the step S2 is as follows,
For enterprise feature setsA high-quality enterprise self-encoder delta G and a risk enterprise self-encoder delta B;
Will be Input Δ G gets/>Will/>Input Δ B gets/>For f 2∈B,for f3 e O, initialize P=φ, for O e O, for G e G, B e B, where f 1 represents a good enterprise, G represents a collection of good enterprises,/>Represents a trained high-quality enterprise self-encoder model, f 2 represents a risk enterprise, B represents a risk enterprise set, and/>Representing a trained risk enterprise self-encoder model, g representing high-quality enterprise corresponding features, and b representing risk enterprise corresponding features;
feature similarity with high-quality enterprises S g=(∑bEuclic(ro,rb))/N, feature similarity with inauguration enterprises S b=(∑bEuclic(ro,rb))/M;
Wherein N, M is the corresponding set number, euclic (r o,rb) represents the distance of 0-b, and the anomaly probability P o=Sb/Sg of enterprise o;
Adding P o into the set P, and outputting an abnormal enterprise probability set P of each enterprise.
Further, in the step S3, the prediction model selection and integration steps are specifically as follows,
For the prediction model set M { M 1,m2,...,mk }, the data set D, the data set features and the K classification model delta of the prediction model: { f (D) →m };
calculating a feature f (D) of the dataset D;
Inputting f (D) into a K classification model delta: { f (D) →m }, obtaining the top N models with the best performance, and obtaining the prediction result of the future state of the enterprise by using the top N models;
and (3) combining N prediction results with weights of the model performance by using a voting method, and carrying out weighted summation to obtain a final prediction result.
Further, the model pool in the step S3 includes Arima, logistic regression, polynomial regression, bayesian regression, support vector machine, random forest, lightGBM, catBoost, XGBoost.
Further, the Arima model prediction steps are as follows,
Organizing enterprise historical public credit data into a time sequence, and counting the time granularity of the time sequence according to quarters;
Constructing an Arima model, training and optimizing, and predicting the future one-year enterprise state of the enterprise after inputting the characteristics of the integrated metadata set;
The trained Arima model was evaluated by means of the mean absolute percentage error MAPE.
An enterprise anomaly identification system based on big data, comprising:
the classification module is used for classifying and marking the enterprises according to the enterprise public credit data through the classification model, and marking results comprise high-quality enterprises, risk enterprises and other enterprises;
the calculation module is used for calculating the similarity among other enterprises, high-quality enterprises and risk enterprises through the self-encoder model respectively to obtain the abnormal probability of the risk enterprises, wherein the self-encoder model is obtained through training according to the corresponding label enterprise data characteristics;
the prediction module extracts the historical public credit data changes of the enterprises and inputs the changes into corresponding different prediction models, outputs the future annual state data of the enterprises after integration, and simultaneously invokes the similarity algorithm in the calculation module to calculate the abnormal probability corresponding to the future state data of each enterprise
And the alarm module outputs alarm to enterprises of which the abnormal probability in the calculation module and the prediction module is smaller than a specified threshold value.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the big data based enterprise anomaly identification method as described above when the program is executed.
A non-transitory computer readable storage medium having stored thereon a computer program for implementing the steps of the big data based enterprise anomaly identification method described above when executed by a processor.
Compared with the prior art, the invention has the beneficial effects that: the method has the advantages that the classification model is used for classifying and marking the public credit data of the enterprise, the corresponding data are input into different self-encoder models in a training mode, the situation of unbalanced proportion of positive and negative samples in the abnormal identification process of the enterprise can be dealt with, the hysteresis in the abnormal identification process is greatly reduced by the mode of predicting the future state data of the enterprise through the prediction model, in the future state prediction process, different data sets correspond to different prediction models and voting is integrated and output, the reliability of a data source is greatly improved, the accuracy of a prediction result is guaranteed, and the method has important significance for avoiding financial risks of the enterprise.
Drawings
The disclosure of the present invention is described with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:
FIG. 1 schematically shows a flow chart of an overall method for identifying anomalies in an enterprise in accordance with the present invention;
FIG. 2 schematically shows a flow chart for identifying enterprise anomaly probabilities from a model of an encoder in accordance with the present invention
FIG. 3 schematically shows a flow chart for predicting the future state anomaly probability of an enterprise according to a prediction model provided by the invention;
figure 4 schematically shows a flow chart of a model selection and integration process of data set features according to the present invention.
Detailed Description
It is to be understood that, according to the technical solution of the present invention, those skilled in the art may propose various alternative structural modes and implementation modes without changing the true spirit of the present invention. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit the invention to the precise form disclosed.
An embodiment according to the present invention is shown in connection with fig. 1-4.
As shown in fig. 1, for the overall recognition method, the main steps include,
S1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels;
S2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p o smaller than a certain threshold value;
S3, analyzing public credit data of enterprise historical change, extracting data set characteristics, selecting different prediction models in a model pool according to different data set characteristics, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.
The following specifically describes the above steps:
For step S1, in the process of identifying the abnormal enterprises in the scenes of loan management, supervision of enterprises of intelligent finance, etc., the enterprise categories are mainly high-quality enterprises and risk enterprises, as shown in fig. 1 and 2, the present invention firstly classifies and screens each enterprise in the sample through the existing classification model, the classification model can be multiple types in the existing model, including logistic regression model, decision tree, random forest, KNN, lightgbm, xgboost, catboost, etc., in the identifying process, the enterprise history and the existing public credit data are read first, and the data sources can be provided by third party data suppliers such as sky eye examination and enterprise examination, and the data content includes the industrial and commercial data, annual report data, blacklist data, administrative license data, administrative punishment data, stock right quality and guarantee data, lawsuit data, intellectual property data, etc.
After the classification model is identified and judged, classifying and marking are carried out on samples which are already identified as risk enterprises or high-quality enterprises, marking results are output, and meanwhile, each label sample is input into a self-encoder model corresponding to a label class so as to train the model, and the basis of marking is mainly a blacklist. The blacklist includes: list of trusted executives, list of executed persons, list of major illicit enterprises, serious administrative penalties, etc. Enterprises that are only blacklisted are marked as inauguration enterprises. The red list includes: tax level a business listings, major ideas business listings, etc. Enterprises which are only listed in the red list are marked as high-quality enterprises.
After the enterprise labeling is completed, the number of high-quality enterprises and risk enterprises and the proportion of the high-quality enterprises and the risk enterprises in all enterprises are calculated. The positive and negative sample ratio is used as a decision basis for subsequent self-encoder model training. It should be noted that the classification model has a requirement on the positive-negative sample ratio, and in general, the positive sample cannot be less than one tenth at the lowest. The self-encoder only has the requirement on the number of positive and negative samples, and the minimum number of the positive and negative samples cannot be less than 1000. In the field of enterprise public credit evaluation, the total number of enterprises available for model training is relatively large, while the number of positive samples is often not large. Therefore, the requirement of the number of positive and negative samples is easier to meet than the requirement of the proportion of positive and negative samples. This is also why adaptation of the self-encoder to the scene is more extensive than the classification model.
The self-encoder can convert the high-dimensional enterprise public credit data into low-dimensional representation, retain high-value information and omit low-value information. The self-encoder consists of two main parts: the encoder is used to encode the input, and the decoder uses the encoding to reconstruct the input. The self-encoder generally approximately reconstructs the input signal, the reconstruction result comprising only the most relevant parts of the original signal.
Based on the characteristic dimensions and sample size of the enterprise public credit data, a feed-forward, acyclic neural network is employed to link inputs and outputs with one or more hidden layers, similar to a single-layer perceptron in a multi-layer perceptron. The output layer has the same number of nodes (neurons) as the input layer. The number of output layer nodes is consistent with the number of input layers, and the aim is to reconstruct the input.
In the process of training the self-encoder model of the high-quality enterprise or the inauguration enterprise, different high-quality enterprises or inauguration enterprises are selected to be trained according to the number of positive and negative samples. According to industry experience, this minimum number cannot be less than 1000. If the number of positive sample enterprises exceeds 1000, training a self-encoder model of the high-quality enterprises; training a self-encoder model of the inauguration enterprise if the number of negative sample enterprises exceeds 1000; if both exceed 1000, the self-encoder models of premium enterprises and risk enterprises are trained simultaneously.
After the training of the self-encoder model corresponding to the high-quality enterprise and the inauguration enterprise is completed, for the enterprise samples which are not judged to be the inauguration enterprise and are not judged to be the high-quality enterprise, the similarity between the enterprise samples and the corresponding label enterprise is calculated to obtain the abnormal probability of the abnormal enterprise, and the specific calculation steps are as follows,
For enterprise feature setsA high-quality enterprise self-encoder delta G and a risk enterprise self-encoder delta B;
Will be Input Δ G gets/>For f 1 ε G; will/>Input Δ B gets/>For f 2∈B,for f3 e O, initialize P=φ, for O e O, for G e G, B e B, where f 1 represents a good enterprise, G represents a collection of good enterprises,/>Represents a trained high-quality enterprise self-encoder model, f 2 represents a risk enterprise, B represents a risk enterprise set, and/>Representing a trained risk enterprise self-encoder model, g representing high-quality enterprise corresponding features, and b representing risk enterprise corresponding features;
feature similarity with high-quality enterprises S g=(∑bEuclic(ro,rb))/N, feature similarity with inauguration enterprises S b=(∑b Euclic(ro,rb))/M;
Wherein N, M is the corresponding set number, euclic (r o,rb) represents the distance of 0-b, and the anomaly probability P o=Sb/Sg of enterprise o;
Adding P o into the set P, and outputting an abnormal enterprise probability set P of each enterprise.
The threshold may be set by the user during the abnormal business identification and alarm process. For example, if the 10% of the highest abnormal probability is set as the threshold, the enterprise sample with the P o less than 10% is judged to be identified as an abnormal enterprise and is alarmed, so that a user is reminded to take measures in advance, avoid risks and reduce losses.
Similarly, as shown in fig. 3, the abnormal enterprise is identified by using a big data scoring model, where the big data scoring model is based on the abnormal identification of the current enterprise state, and only can determine whether the enterprise is abnormal at present, so that the future abnormal probability of the enterprise cannot be predicted. For financial wind control and government enterprise supervision, when the enterprise is in an abnormal state and alarms again, certain hysteresis exists, and the requirements of early warning and early intervention cannot be met.
In order to overcome the defects, the invention also provides an abnormal enterprise identification method based on a regression model in step S3. The method predicts the state of the enterprise in the next year based on the historical change trend of the enterprise state, so that the abnormal probability of the enterprise in the next year is predicted, the requirements of early warning and early intervention of financial wind control and government enterprise supervision are met, and the possible loss caused by improper enterprise operation is further prevented and intervened.
Regression models are predictive modeling techniques that are commonly used for predictive analysis, time series models, and finding causal relationships between dependent and independent variables. The regression model is used for finding the law of the change of the enterprise characteristics along with time, so that the future enterprise state is predicted.
Furthermore, considering the problem that different prediction models are different in performance on different data sets, and further the selection of the prediction models is difficult, the invention also provides a prediction model integration optimization method based on the characteristics of the data sets in step S3. Firstly, analyzing a data set, and extracting characteristics of the data set; then selecting different prediction models according to different data set characteristics; and finally integrating the top N models with the best performance. The method can fully utilize the characteristics of the data set and effectively improve the generalization capability of the model.
When the model is selected, the basis is a classification model. Based on historical experience in which the behavior of different predictive models on different data sets is known, we construct a model-selected classification model based on this part of the historical knowledge. After a new dataset has arrived, classification models are entered using the features of the new dataset, which would give a list of recommended predictive models. We select the top N prediction models best according to this recommendation list. The model pool comprises: arima, logistic regression, polynomial regression, bayesian regression, support vector machines, random forests, lightGBM, catBoost, XGBoost, etc.
As shown in fig. 4, the prediction model selection and integration steps are specifically as follows,
For the prediction model set M { M 1,m2,...,mk }, the data set D, the data set features and the K classification model delta of the prediction model: { f (D) →m };
calculating a feature f (D) of the dataset D;
Inputting f (D) into a K classification model delta: { f (D) →m }, obtaining the top N models with the best performance, and obtaining the prediction result of the future state of the enterprise by using the top N models;
and (3) combining N prediction results with weights of the model performance by using a voting method, and carrying out weighted summation to obtain a final prediction result.
Taking ARIMA regression model as an example, ARIMA model is the two most widely used time series prediction methods.
In the enterprise state regression prediction process, enterprise historical public credit data is first organized into a time series. The time granularity of the time sequence is counted according to quarters, and the data content comprises: address change times, stockholder change times, bid and winning times, employee number, external investment times and amount, stock right quality and guarantee times and amount, administrative punishment grades and times, complaint times, intellectual property number and the like; then constructing an ARIMA model to predict future enterprise states; finally, the trained ARIMA model is evaluated using the mean absolute percentage error MAPE (mean absolute percentage error).
The specific ARIMA model is given below, with ARIMA (p, d, q) consisting of three parts:
(1) AR (p) autoregressive model: the value of the current time point is equal to a regression of the values of the past several time points. Since it is independent of other interpretation variables and depends only on its past history, it is called autoregressive; if the last p historical values in the past are relied on, the order is called p and is marked as an AR (p) model. Here p is set to 20 quarters based on statistics of all enterprise available data. If the available data of enterprises is less than 20 quarters, interpolation processing is performed.
(2) I (d): differentiating the time sequence; because time sequence analysis requires stationarity, an unstable sequence needs to be converted into a stable sequence by a certain means, and the means adopted in general are differences; d represents the order of the difference, the value at the moment t is subtracted by the value at the moment t-1, and a new time sequence is called a 1-order difference sequence; the 1-order differential sequence of the 1-order differential sequence is called a 2-order differential sequence, and so on. The test shows that the second-order difference effect is better.
MA (q): representing a moving average model, meaning that the value of the current point in time is equal to a regression of the prediction error of the past several points in time; prediction error = model prediction value-true value; if the sequence depends on the last q historical prediction error values, the order is called q and is marked as MA (q) model. Through experiments, here q is set to 8.
Future year enterprise features have been predicted based on ARIMA model. Taking the enterprise characteristics of the next year as input, and calling an enterprise characteristic similarity calculation algorithm in the self-encoder model so as to predict the abnormal probability of the enterprise of the next year. If the probability of abnormality is large in the next year, the alarm is given to remind the financial wind control or enterprise supervision user to take precautions in advance.
Similarly, the enterprise anomaly identification system based on big data constructed based on the method can achieve the effects, and the system can specifically comprise:
the classification module is used for classifying and marking the enterprises according to the enterprise public credit data through the classification model, and marking results comprise high-quality enterprises, risk enterprises and other enterprises;
the calculation module is used for calculating the similarity among other enterprises, high-quality enterprises and risk enterprises through the self-encoder model respectively to obtain the abnormal probability of the risk enterprises, wherein the self-encoder model is obtained through training according to the corresponding label enterprise data characteristics;
the prediction module extracts the historical public credit data changes of the enterprises and inputs the changes into corresponding different prediction models, outputs the future annual state data of the enterprises after integration, and simultaneously invokes the similarity algorithm in the calculation module to calculate the abnormal probability corresponding to the future state data of each enterprise
And the alarm module outputs alarm to enterprises of which the abnormal probability in the calculation module and the prediction module is smaller than a specified threshold value.
Furthermore, the method steps and system described above may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The technical scope of the present invention is not limited to the above description, and those skilled in the art may make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and these changes and modifications should be included in the scope of the present invention.

Claims (8)

1. The enterprise anomaly identification method based on big data is characterized by comprising the following steps:
S1, classifying and marking enterprises by a classification model according to enterprise public credit data, outputting marking results, and respectively inputting marked enterprise data into self-encoder models of corresponding label categories for training according to the number and the duty ratio of labels;
S2, inputting untagged enterprise public credit data into the trained self-encoder model in the S1 respectively, so as to calculate the similarity between the untagged enterprise public credit data and different marked enterprises, obtaining abnormal probability p o of the untagged enterprises according to the similarity, and outputting an alarm to enterprises with p o smaller than a certain threshold value;
the similarity calculation step in step S2 is as follows,
For enterprise feature setsA high-quality enterprise self-encoder delta G and a risk enterprise self-encoder delta B;
Will be Input Δ G gets/>For f 1 ε G; will/>Input Δ B gets/>For f 2∈B,for f3 e O, initialize P=φ, for O e O, for G e G, B e B, where f 1 represents a good enterprise, G represents a collection of good enterprises,/>Represents a trained high-quality enterprise self-encoder model, f 2 represents a risk enterprise, B represents a risk enterprise set, and/>Representing a trained risk enterprise self-encoder model, g representing high-quality enterprise corresponding features, and b representing risk enterprise corresponding features;
feature similarity with high-quality enterprises S g=(∑bEuclic(ro,rb))/N, feature similarity with inauguration enterprises S b=(∑bEuclic(ro,rb))/M;
Wherein N, M is the corresponding set number, euclic (r o,rb) represents the distance of 0-b, and the anomaly probability P o=Sb/Sg of enterprise o;
Adding P o into the set P, and outputting an abnormal enterprise probability set P of each enterprise;
S3, analyzing public credit data of enterprise historical change, extracting data set characteristics, selecting different prediction models in a model pool according to different data set characteristics, integrating the top N models with the best performance, outputting future state data of the enterprise, calling the self-encoder model trained in S2, calculating corresponding similarity and abnormal probability, and outputting an alarm.
2. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: the public credit data in the step S1 is historical data and existing data, including business data, annual report data, blacklist data, administrative license data, administrative penalty data, equity quality and guarantee data, lawsuit data and intellectual property data.
3. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: in the step S3, the prediction model selection and integration steps are specifically as follows,
For the prediction model set M { M 1,m2,...,mk }, the data set D, the data set features and the K classification model delta of the prediction model: { f (D) →m };
calculating a feature f (D) of the dataset D;
Inputting f (D) into a K classification model delta: { f (D) →m }, obtaining the top N models with the best performance, and obtaining the prediction result of the future state of the enterprise by using the top N models;
and (3) combining N prediction results with weights of the model performance by using a voting method, and carrying out weighted summation to obtain a final prediction result.
4. The method for identifying the abnormality of the enterprise based on the big data according to claim 1, wherein the method comprises the following steps: the model pool in the step S3 comprises Arima, logistic regression, polynomial regression, bayesian regression, support vector machine, random forest and LightGBM, catBoost, XGBoost.
5. The method for identifying the abnormality of the enterprise based on the big data as claimed in claim 4, wherein: the Arima prediction procedure is as follows,
Organizing enterprise historical public credit data into a time sequence, and counting the time granularity of the time sequence according to quarters;
Constructing an Arima model, training and optimizing, and predicting the future one-year enterprise state of the enterprise after inputting the characteristics of the integrated data set;
The trained Arima model was evaluated by means of the mean absolute percentage error MAPE.
6. A system for identifying a big data based method for identifying an abnormality of an enterprise according to any one of claims 1 to 5, comprising:
the classification module is used for classifying and marking the enterprises according to the enterprise public credit data through the classification model, and marking results comprise high-quality enterprises, risk enterprises and other enterprises;
the calculation module is used for calculating the similarity among other enterprises, high-quality enterprises and risk enterprises through the self-encoder model respectively to obtain the abnormal probability of the risk enterprises, wherein the self-encoder model is obtained through training according to the corresponding label enterprise data characteristics;
the prediction module extracts the historical public credit data changes of the enterprises and inputs the changes into corresponding different prediction models, outputs the future annual state data of the enterprises after integration, and simultaneously invokes the similarity algorithm in the calculation module to calculate the abnormal probability corresponding to the future state data of each enterprise
And the alarm module outputs alarm to enterprises of which the abnormal probability in the calculation module and the prediction module is smaller than a specified threshold value.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the big data based enterprise anomaly identification method as claimed in any one of claims 1 to 5.
8. A non-transitory computer readable storage medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, is adapted to carry out the method for identifying corporate anomalies based on big data according to any one of claims 1 to 5.
CN202311220859.9A 2023-09-20 2023-09-20 Enterprise exception identification method and system based on big data Active CN117151867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311220859.9A CN117151867B (en) 2023-09-20 2023-09-20 Enterprise exception identification method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311220859.9A CN117151867B (en) 2023-09-20 2023-09-20 Enterprise exception identification method and system based on big data

Publications (2)

Publication Number Publication Date
CN117151867A CN117151867A (en) 2023-12-01
CN117151867B true CN117151867B (en) 2024-04-30

Family

ID=88902508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311220859.9A Active CN117151867B (en) 2023-09-20 2023-09-20 Enterprise exception identification method and system based on big data

Country Status (1)

Country Link
CN (1) CN117151867B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275942A (en) * 2019-06-26 2019-09-24 上海交通大学 A kind of electronics authority security incident convergence analysis method
CN112017025A (en) * 2020-08-26 2020-12-01 天元大数据信用管理有限公司 Enterprise credit assessment method based on fusion of deep learning and logistic regression
CN113344589A (en) * 2021-05-12 2021-09-03 兰州理工大学 Intelligent identification method for collusion behavior of power generation enterprise based on VAEGMM model
CN113435753A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Enterprise risk judgment method, device, equipment and medium in high-risk industry
CN113554310A (en) * 2021-07-23 2021-10-26 广西综合交通大数据研究院 Enterprise credit dynamic evaluation model based on intelligent contract
CN114066242A (en) * 2021-11-11 2022-02-18 北京道口金科科技有限公司 Enterprise risk early warning method and device
CN114140007A (en) * 2021-12-07 2022-03-04 航天信息股份有限公司 Risk enterprise identification method and device and storage medium
CN114298819A (en) * 2021-12-07 2022-04-08 国久大数据股份有限公司 Enterprise credit risk prediction method
CN114519529A (en) * 2022-02-21 2022-05-20 天元大数据信用管理有限公司 Enterprise credit rating method, device and medium based on convolution self-encoder
US11367141B1 (en) * 2017-09-28 2022-06-21 DataInfoCom USA, Inc. Systems and methods for forecasting loss metrics
CN115618297A (en) * 2022-09-08 2023-01-17 中国工商银行股份有限公司 Method and device for identifying abnormal enterprise
CN115713403A (en) * 2022-11-16 2023-02-24 中证数智科技(深圳)有限公司 Enterprise risk identification method, device and equipment based on self-coding neural network
CN116402352A (en) * 2023-04-20 2023-07-07 华泰证券股份有限公司 Enterprise risk prediction method and device, electronic equipment and medium
CN116451114A (en) * 2023-03-30 2023-07-18 上海电信科技发展有限公司 Internet of things enterprise classification system and method based on enterprise multisource entity characteristic information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210287071A1 (en) * 2020-03-12 2021-09-16 Morgan State University Method and Apparatus for Augmented Data Anomaly Detection
US11379685B2 (en) * 2020-03-16 2022-07-05 Sas Institute Inc. Machine learning classification system
US20230088840A1 (en) * 2021-09-23 2023-03-23 Bank Of America Corporation Dynamic assessment of cryptocurrency transactions and technology adaptation metrics

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11367141B1 (en) * 2017-09-28 2022-06-21 DataInfoCom USA, Inc. Systems and methods for forecasting loss metrics
CN110275942A (en) * 2019-06-26 2019-09-24 上海交通大学 A kind of electronics authority security incident convergence analysis method
CN112017025A (en) * 2020-08-26 2020-12-01 天元大数据信用管理有限公司 Enterprise credit assessment method based on fusion of deep learning and logistic regression
CN113344589A (en) * 2021-05-12 2021-09-03 兰州理工大学 Intelligent identification method for collusion behavior of power generation enterprise based on VAEGMM model
CN113435753A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Enterprise risk judgment method, device, equipment and medium in high-risk industry
CN113554310A (en) * 2021-07-23 2021-10-26 广西综合交通大数据研究院 Enterprise credit dynamic evaluation model based on intelligent contract
CN114066242A (en) * 2021-11-11 2022-02-18 北京道口金科科技有限公司 Enterprise risk early warning method and device
CN114140007A (en) * 2021-12-07 2022-03-04 航天信息股份有限公司 Risk enterprise identification method and device and storage medium
CN114298819A (en) * 2021-12-07 2022-04-08 国久大数据股份有限公司 Enterprise credit risk prediction method
CN114519529A (en) * 2022-02-21 2022-05-20 天元大数据信用管理有限公司 Enterprise credit rating method, device and medium based on convolution self-encoder
CN115618297A (en) * 2022-09-08 2023-01-17 中国工商银行股份有限公司 Method and device for identifying abnormal enterprise
CN115713403A (en) * 2022-11-16 2023-02-24 中证数智科技(深圳)有限公司 Enterprise risk identification method, device and equipment based on self-coding neural network
CN116451114A (en) * 2023-03-30 2023-07-18 上海电信科技发展有限公司 Internet of things enterprise classification system and method based on enterprise multisource entity characteristic information
CN116402352A (en) * 2023-04-20 2023-07-07 华泰证券股份有限公司 Enterprise risk prediction method and device, electronic equipment and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
TSI-GAN: Unsupervised time series anomaly detection using convolutional cycle-consistent generative adversarial networks;Shyam Sundar Saravanan,等;《Advances in Knowledge Discovery and Data Mining》;20230527;第13935卷;第39-54页 *
Unusual insider behavior detection framework on enterprise resource planning systems using adversarial recurrent autoencoder;Jongmin Yu, 等;《IEEE Transaction on Industrial Informatics》;20210617;第18卷(第3期);第1541-1551页 *
基于改进自编码器的农林业上市企业财务风险预警模型;何熠;《中国优秀硕士学位论文全文数据库农业科技辑》;20230315(第3期);第D049-3页 *
餐饮企业业务运营数据分析及应用关键技术研究;乔益民;《中国博士学位论文全文数据库信息科技辑》;20220115(第1期);第I138-59页 *

Also Published As

Publication number Publication date
CN117151867A (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Yerpude Predictive modelling of crime data set using data mining
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
Cheng et al. A seasonal time-series model based on gene expression programming for predicting financial distress
Korangi et al. A transformer-based model for default prediction in mid-cap corporate markets
Li et al. Stock trading strategies based on deep reinforcement learning
Khademolqorani et al. A hybrid analysis approach to improve financial distress forecasting: Empirical evidence from Iran
CN117151867B (en) Enterprise exception identification method and system based on big data
JP6993863B2 (en) Information processing system and learning method of information processing system
Chen et al. Financial distress prediction using data mining techniques
Haryono et al. Stock price forecasting in Indonesia stock exchange using deep learning: A comparative study
Liu et al. RETRACTED ARTICLE: Company financial path analysis using fuzzy c-means and its application in financial failure prediction
CN115640335B (en) Enterprise portrait-based enterprise analysis method, system and cloud platform
Benali et al. Modelling stock prices of energy sector using supervised machine learning techniques
CN113837859B (en) Image construction method for small and micro enterprises
Abiodun et al. A comparative analysis of stock series prediction of apple and google datasets using deep learning
CN113177831B (en) Financial early warning system constructed by application of public data and early warning method
Kaur et al. Prediction of stock prices of blue-chip companies using machine learning algorithms
Toai et al. Predicting VN-Index Value by KNN Algorithm of Machine Learning
Ju et al. Identifying financial market trend reversal behavior with structures of price activities based on deep learning methods
Dagar A Comparative Study on Loan Eligibility
Grogoriou Credit risk analysis via machine learning methods: client segmentation based on probability of default
CN113554226A (en) Financial time sequence prediction and decision method based on isomorphic convolutional neural network
Ruhal et al. A Comparative Study Of Statistical Methods And Machine Learning Approaches For Stock Price Prediction
CN113837859A (en) Small and micro enterprise portrait construction method
Yan AUTOENCODER BASED GENERATOR FOR CREDIT INFORMATION RECOVERY OF RURAL BANKS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant