CN112580992B

CN112580992B - Illegal fund collecting risk monitoring system for financial-like enterprises

Info

Publication number: CN112580992B
Application number: CN202011539768.8A
Authority: CN
Inventors: 梅剑; 陈文�; 周凡吟; 吴桐; 曾途
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2024-04-09
Anticipated expiration: 2040-12-23
Also published as: CN112580992A

Abstract

The invention relates to an illegal funding risk monitoring system of a financial enterprise, which comprises a data preparation system, a model training system and a risk monitoring system; the data preparation system is used for collecting training data required by model training and monitoring data required by target enterprise monitoring; the model training system is used for acquiring the training data from the data preparation system and respectively training to obtain a prediction model corresponding to each algorithm based on at least two algorithms; the risk monitoring system is used for acquiring the monitoring data from the data preparation system, testing to obtain corresponding predicted values based on each predicted model output by the model training system, and integrating all the predicted values into a total predicted value to be output. By using the system for risk monitoring, the dependence of the total predicted value on a single algorithm can be reduced in an allocation mode, and the stability of the model is enhanced.

Description

Illegal fund collecting risk monitoring system for financial-like enterprises

Technical Field

The invention relates to the technical field of risk supervision, in particular to an illegal funding risk monitoring system of a financial enterprise.

Background

A financial-like enterprise refers to an enterprise whose business has financial activity attributes, but does not obtain financial licenses, and is not directly regulated by a national financial regulatory agency (i.e., is not under a line of three parties). With the complexity of networks and the rapid development of information communication technologies, some businesses with complex structures are overlapped with internet financial channels, online and offline are combined with each other, business complexity is improved, daily risks are increased in a surging manner due to illegal operations of some related enterprises, illegal fund collecting criminal activities of similar finance are rampant, criminal techniques are updated continuously, and higher requirements are also put forward on supervision parties.

The invention of China with publication number of CN 110704572A discloses a warning method for suspected illegal funding risks, wherein the method is used for predicting risk values of multiple risk types in a modular manner, and finally, a plurality of risk values are fused to obtain a final risk prediction value. The method can obtain the prediction result more accurately, but has some defects, such as that only one algorithm is adopted when predicting the risk value of each module, and the stability of the prediction model is completely dependent on the algorithm, namely the dependence on a single algorithm is higher.

Disclosure of Invention

The invention aims to solve the technical problem that the prediction result in the prior art has higher dependence on a single algorithm, and provides an illegal funding risk monitoring system for financial enterprises, which reduces the dependence of the stability of a prediction model on the single algorithm in an allocation mode and improves the stability and reliability of the system.

In order to achieve the above objective, the present invention provides an illegal funding risk monitoring system for a financial enterprise, including a data preparation system, a model training system and a risk monitoring system; wherein,

the data preparation system is used for collecting training data required by model training and monitoring data required by target enterprise monitoring;

the model training system is used for acquiring the training data from the data preparation system and respectively training to obtain a prediction model corresponding to each algorithm based on at least two algorithms;

the risk monitoring system is used for acquiring the monitoring data from the data preparation system, testing to obtain corresponding predicted values based on each predicted model output by the model training system, and integrating all the predicted values into a total predicted value to be output.

Further, the model training system comprises at least two training units, wherein each training unit acquires the training data from the data preparation system and trains to obtain a corresponding prediction model based on an algorithm.

In the scheme, one training unit is respectively configured for each algorithm to train, so that mutual interference can be avoided, parallel operation can be realized, and the processing efficiency is improved.

Further, each training unit comprises at least two training subunits and an integration subunit, each training subunit extracts different data from the training data and trains based on the same algorithm to obtain a corresponding module model, and the integration subunit integrates the module model obtained by each training subunit into the prediction model.

In the scheme, aiming at the prediction model under each algorithm, based on the modularized thought, each module model is obtained through module training, so that a complex model is simplified, the data processing capacity of each training subunit can be reduced, and the stability and reliability of each training subunit can be ensured.

Further, when the training subunit trains to obtain the module model based on the Logistic LASSO algorithm, the industry to which the enterprise belongs is used as an input characteristic variable of the module model.

When the training subunit trains based on the GBDT algorithm to obtain a module model, the training subunit trains in different industries to obtain module models corresponding to the industries respectively.

In the scheme, the industry is introduced as one input characteristic variable of the model, or the industry is respectively trained to corresponding module models of various industries, so that the model has pertinence, the influence of industry difference on a prediction result can be eliminated, and the reliability of the prediction result is enhanced.

Further, when integrating all predicted values into one total predicted value output, the risk monitoring system weights the predicted values of the algorithms based on the AUC values of the different algorithms to obtain and output the total predicted value.

In the scheme, the predicted values of the algorithms are weighted based on the AUC values of the different algorithms, the AUC values reflect the distinguishing capability of the model, weight distribution is carried out based on the distinguishing capability of the algorithms, the distinguishing capability of the algorithms is balanced while the risk of the algorithms is allocated, and the accuracy of an output result is improved.

The system disclosed by the invention is beneficial to automatically, real-time and accurately intelligently monitoring illegal fund collecting risks of financial enterprises, can reduce the dependence of a prediction model on a single algorithm in an allocation mode, and enhances the stability and reliability of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an illegal funding risk monitoring system for a financial-like enterprise in an embodiment.

Fig. 2 is a block diagram showing the components of the data preparation system in the embodiment.

FIG. 3 is a block diagram of a model training system in an embodiment.

FIG. 4 is a flow chart of illegal funding risk monitoring for a financial-like enterprise in an embodiment.

Fig. 5a and 5b are schematic diagrams of roc _ auc curves of the Logistic LASSO algorithm model and roc _ auc curves of the GBDT algorithm model in the examples, respectively.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Fig. 1 is a block diagram of an illegal funding risk monitoring system for a financial-like enterprise according to an embodiment of the present invention. As shown in fig. 1, the system comprises a data preparation system, a model training system and a risk monitoring system, wherein the model training system extracts corresponding data from the data preparation system to perform model training, a prediction model is output, the risk monitoring system is respectively communicated with the data preparation system and the model training system, monitoring data is extracted from the data preparation system, a prediction model is obtained from the model training system, and then risk monitoring is performed based on the monitoring data and the prediction model.

Specifically, the data preparation system is used for collecting training data required for model training and monitoring data required for target enterprise monitoring.

In a more sophisticated approach, optionally, the risk monitoring system may further include an early warning system, in communication with the risk monitoring system, to obtain the total predicted value from the risk monitoring system, and then compare the total predicted value with a set risk threshold, and if necessary, issue an early warning prompt, for example, when the total predicted value is greater than or equal to the risk threshold. The early warning prompt can be implemented in various ways, for example, the early warning prompt information is sent to the appointed supervisor in the form of mail, short message, etc., and the on-site early warning is carried out in the form of audible and visual alarm signal.

More specifically, the data preparation system collects the training data from a plurality of data sources based on a first set of keywords; the monitoring data is collected from a plurality of data sources based on the second set of keywords. The test data and the monitoring data are structured data, and if the test data and the monitoring data are unstructured data or semi-structured data, the unstructured data or the semi-structured data are converted into structured data. The detailed components of the data preparation system are shown in FIG. 2 and will be described in greater detail below.

As shown in fig. 3, the model training system includes at least two training units, each of which acquires the training data from the data preparation system and trains based on an algorithm to obtain a corresponding predictive model. In the test example, two algorithms were used for training, the Logistic LASSO algorithm and the GBDT algorithm, respectively, but it will be readily appreciated that many other classification algorithms could be used. Each training unit performs training of an algorithm model, so that mutual interference can be avoided, and the processing efficiency can be accelerated by parallel operation.

It should be appreciated that the data used for model training is the same for any algorithm, i.e., training of different algorithms is performed based on the same data to avoid differences in model discrimination capability due to differences in data.

As an implementation mode, the prediction model of illegal funding risks can be obtained through direct training of more data, but the processing mode has large one-time operation amount and consumes a great amount of time. Therefore, another implementation manner is adopted in this embodiment, that is, based on the modularized thought, the illegal funding risk is divided into multiple risk types, one risk type can be understood as a module, one risk type is trained to obtain a module model, and finally, multiple module models are integrated into one model, so as to obtain the prediction model under the algorithm.

As shown in fig. 3, for each training unit, the training unit includes at least two training subunits and an integrated subunit, each training subunit extracts different data from the training data and performs training based on the same algorithm to obtain a corresponding module model, and the integrated subunit integrates the module model obtained by training each training subunit into the prediction model.

By way of example, the illegal funding risk may be split into 5 risk types, such as enterprise comprehensive strength risk, business behavior risk, integrity risk, judicial complaint risk, and party-associated risk, so that each training unit includes 5 training subunits at this time. Of course, this is only an example, and the present solution allows for a different number and different ways of sorting.

It will be readily appreciated that the data required for each training subunit to perform the training of each module model is different, possibly intersecting but necessarily distinct, and therefore a data list needs to be configured for each training subunit based on which the training subunit extracts the corresponding data from the data preparation system.

After each training subunit trains to obtain a corresponding module model, the module models are respectively output to the integration subunit, and the integration subunit integrates the plurality of module models into one model. The integration process is a process of configuring weights for each module model, and as the simplest way, the weights can be directly configured manually, but the objectivity cannot be guaranteed. Therefore, in this embodiment, a genetic algorithm is adopted to perform weight optimization, so as to obtain the weight of each module model in the prediction model. The genetic algorithm is the prior art, in the scheme, only the output value of each module model is used as the input value of the genetic algorithm model to train, and the weight corresponding to each input value of the genetic algorithm model finally obtained is the weight of each module model. The genetic algorithm model is obtained based on the data attribute between the input and the output, so that the weight optimization through the genetic algorithm is more objective, and the obtained weight is more accurate.

Assuming that two algorithms are adopted to generate a prediction model, namely an algorithm A and an algorithm B, as shown in fig. 4, when the illegal funding risk monitoring system executes a risk monitoring task, firstly taking the industry of a target enterprise and a preset characteristic variable as input of a module model, taking a risk value as output, and respectively training to obtain the module models of all modules under the algorithm A (and B), wherein, for example, the score of a module 1-A in the figure represents the score of a module 1 under the algorithm A (the risk value is represented in the form of a score); then, the respective algorithms are respectively integrated with the scores of the respective module models, for example, the score a=k1 in the figure is the score of the module 1-a+k2 is the score of the module 2-a, and k1 and k2 are the weights of the module model 1 and the module model 2 under the algorithm a respectively; and finally integrating the scores under the algorithms to obtain the final risk score of the target enterprise, namely weighting the risk scores (predicted values) of the algorithms by the risk monitoring system based on the AUC values of the different algorithms to obtain and output the final risk score (total predicted value) of the target enterprise. For example, in the model layer in the figure, risk score=ka×a score+kb score, where ka and kb are the weight of the prediction model under the a algorithm and the weight of the prediction model under the B algorithm, respectively. ka. kb is the AUC value based on the different algorithms that the risk monitoring system configures.

There are a number of classification algorithms that can be used for risk prediction, here schematically listing two preferred algorithms, the Logistic LASSO and GBDT algorithms.

In this embodiment, some improvements are made on the conventional processing manner, for example, an input characteristic variable (i.e., an industry variable) of an industry as a Logistic LASSO algorithm is added into an algorithm model, each type of industry model has the same independent variable and dependent variable, and a random effect model is constructed through the interaction effect of each independent variable and the industry variable, so that a single independent variable can have different coefficient values in each industry, a unified regression model can be built, a separate model is not built in different industries, the degree of freedom is saved, and more data information is utilized.

In this embodiment, a specific modeling procedure of the Logistic LASSO algorithm is as follows:

1) An industry variable of an enterprise is constructed, assuming that there are M industry classifications in total, if a certain enterprise belongs to the mth industry, the corresponding industry variable is d_m=1, d_0= ⋯ =d_ (M-1) =d_ (m+1) = ⋯ =d_m=0. M and M are positive integers, and M is less than or equal to M.

2) And (3) putting a plurality of preset characteristic variables and industry variables into the model together, and performing Logistic LASSO model training by the sub-modules. That is, each training subunit puts a plurality of characteristic variables and industry variables into the model together for training, and a corresponding module model is obtained. It is to be understood that the module model obtained by each training subunit corresponds to a risk type, and therefore, the number of feature variables and the feature values in each module model may be different.

3) After the training of each module model is completed, a weight matrix is obtained, each industry variable corresponds to a group of index coefficients in the weight matrix, so that index coefficients under each industry are obtained through training, and the index coefficients corresponding to different industries can be different. For example: in the embodiment, the method is divided into 4 industries (m=4), the industry variable is used as a random effect variable and is put into a model, the index coefficient of the industry is calculated based on the data training of the industry variable automatic screening same industry in the model calculation process, after the 4 industries (the training sample contains the data of all industries) are trained together, a weight matrix of n×4 is calculated (n represents the number of characteristic variables of the module model), and each column corresponds to the index coefficient of one industry. And storing all trained index coefficients.

4) And establishing a Logistic model by using the finally determined model coefficient of the index to obtain a probability output value, namely multiplying the standardized index value by the corresponding index coefficient to obtain the probability that the enterprise belongs to an illegal fund collection label in each module according to the industry of the enterprise for a new any enterprise, and obtaining the enterprise risk total score by using a formula score= (p) 0.2 x 100.

5) After the Logistic model of each risk module is completed, a genetic algorithm is used for carrying out weight optimization on the predicted value result of each module, and then the result of each module is weighted and fused to obtain the predicted value result of the enterprise under the Logistic algorithm.

The training process of the model also comprises setting parameters of a genetic algorithm, such as population number (population), probability of gene exchange (cross_prob), mutation coefficient (mutation_prob), evolution iteration number (generation) and parameters of early termination (epsilon).

And selecting the current optimal model weight for constructing a model according to the optimal AUC and KS index results after parameter adjustment.

AUC (Area Under Curve) is defined as the area enclosed by the ROC curve and the coordinate axis, and is a general classification model effect evaluation index, the value is (0.5, 1), the closer the AUC value is to 1, the more obvious the classification model is on the whole, the more obvious the classification model is on the good and bad samples, the more than 0.6 is generally, the model has certain distinguishing capability, and the more than 0.7 is considered to have better distinguishing capability.

Referring to fig. 5 (a), in the test example, the AUC value of the final selected model is 0.862, and the model fits well.

In this embodiment, for the GBDT model algorithm, the training subunit performs training in industries to obtain module models under corresponding industries, and when applied, the module models of corresponding industries are called according to the industries to which the target enterprise belongs. This has the advantage of eliminating inaccuracy in the prediction results caused by industry variances. For each model, the GBDT model was fitted using a black and white sample ratio of about 1:30 under different industries, and a full black sample was used. The following parameters were adjusted to fit the GBDT model using gridsearcv in Sklearn, including: n_optimizers (number of trees last fitted to the model), learning_rate (learning rate), min_samples_split (minimum number of split samples of nodes), min_samples_leaf (minimum number of samples of each leaf node), max_depth (maximum depth of fit of each tree), max_features (maximum number of features randomly selected when fitting each tree), subsamples (proportion of samples randomly selected when fitting each tree), random_state (seeds randomly selected), the current best model weight is selected for constructing the model according to the best AUC and KS index results after the tuning.

Referring to fig. 5 (b), the AUC value was 0.85 by the test, which is a better model fit.

In the Logistic LASSO algorithm, industries are used as a characteristic variable to be input into a model, GBDT is used for model training in different industries to respectively obtain module models in different industries, and no matter which way is, the model training is carried out in different industries by different modules, so that the influence of industry difference on model output can be eliminated, and the reliability of the module model is improved.

Referring to fig. 2, the data preparing system includes a data collecting unit, a data analyzing unit and a data processing unit, wherein the data collecting unit is used for collecting related structured data, semi-structured data and unstructured data from various data sources to form a first data set based on a predefined first keyword set; the data analysis unit is used for extracting characteristic information from the first data set according to the keywords to form a first structured data set, and storing the first structured data set in a database; the data processing unit screens the structured data in the first structured data set based on the set screening conditions to form an effective structured sample data set. The characteristic information contained in each piece of structured data in the structured sample data set is the data value of the characteristic variable input into the model during model training.

The various sources of data include web pages, internet text, data published by the hosting entity, public opinion databases, other business databases, associative awareness maps, and the like. The data may be obtained by crawler technology or by direct provision by a corresponding mechanism, etc.

In more detail, the data collection unit is configured to collect structured data, semi-structured data and unstructured data related to keywords contained in the first keyword set from various data sources based on a predefined first keyword set, the collected structured data, semi-structured data and unstructured data forming the first data set. The data analysis unit extracts characteristic information from the first data set according to the keywords, one piece of characteristic information forms one piece of structured data, a plurality of pieces of structured data form the first structured data set, and the first structured data set is stored in the database. The data processing unit performs data screening on the structured data in the first structured data set mainly based on the set screening conditions to form an effective structured sample data set. The structured sample data set can be randomly divided into a training sample set and a verification sample set according to a certain proportion during model training.

In one embodiment, the predefined first keyword set includes a keyword set such as a business name keyword set, a business type keyword set, a keyword set of a business operation scope, an APP name keyword set, a specific list keyword set, and the like, and the business name and related information are searched according to the set keyword set to be stored in the first data table.

For example, it is determined whether or not a keyword "small loan" exists for a company name, if so, the company name is stored in a first data table and industry information corresponding to the company is recorded as "small loan", further, for example, key fields such as the keywords "real estate mortgage work", "property right mortgage work", and "real estate mortgage work" are included in a company operation range, the company name is stored in the first data table and industry information corresponding to the company is recorded as "work", for example, a private fund enterprise crawled in the chinese securities investment fund association (web site: http:// gs.amac.orc.cn /), the name in the enterprise list is stored in the first data table, and industry information corresponding to the company is recorded as "private fund".

And searching again to obtain a second data table according to the obtained enterprise list record of the first data table as a search condition, wherein the second data table comprises an association subject and an association relation related to enterprises. The association subject is divided into four subject types of an enterprise body, a first circle layer, a second circle layer, a history dimension and a hidden association party according to the association relation degree with the enterprise object. The enterprise ontology is mainly an enterprise and legal representatives thereof; the first circle layer refers to a first-degree association party which is most closely connected with an enterprise, and comprises an enterprise stockholder, a branch office, a company for external investment of the enterprise, a board of directors of the enterprise, a supervisor and a high manager; the second circle layer comprises the current secondary and tertiary association party of the enterprise; the history dimension comprises enterprises and natural persons which have a history investment and financing relation with a target company, and the history dimension refers to the enterprises and natural persons which have the history investment and financing relation but are in an exiting state at present and comprises a second-circle layer; the hidden association party is an association party which is not disclosed except for business relations between enterprises and other enterprises or natural people, the hidden association party model extracts the hidden association relations such as the same contact way, the same office address, bidding competition, guarantee shared debt and the like among enterprises through the extraction of multi-source data, and the core controller of the important enterprises can be obtained through the hidden association party model. And storing the acquired information into a second data table.

And searching again to obtain a third data table according to the obtained enterprise list of the first data table and the association subject and association relation list of enterprises recorded by the second data table as a third searching condition, wherein the third data table comprises related subject enterprise events, and the searched enterprise events can be divided into 3 types, namely enterprise red list events, enterprise gray list events and enterprise black list events. The red list event refers to an event with good credit performance of enterprises, and comprises information such as marketing, qualification, administrative permission and the like; the blacklist event refers to an event of credit failure such as illegal violation of enterprise participation, and comprises administrative punishment, trust losing behavior and other information; the "gray list event" is a matter in the middle of the black list, such as information about enterprise information change, enterprise recruitment, etc., and the obtained information is stored in the third data table. And finally, establishing an association relationship among the first data table, the second data table and the third data table to obtain the first data set.

The first structured dataset stores a plurality of financial-like business information, business information web site information, business industry, business event information, and the like. Because the obtained structured data in the first structured data set may have extreme data, abnormal data or incomplete data, in order to improve the stability of risk monitoring, the data processing unit needs to perform data screening on all the structured data, reject redundant data with low correlation, and retain the data with the highest stability and prediction capability. When the data analysis unit extracts the characteristic information according to the keywords, the characteristic extraction is performed based on the preliminarily set characteristic variables, and the characteristic information in the structured data screened by the data processing unit is the data value of the characteristic variables input into the model during model training.

In one embodiment, the data processing unit may calculate the IV value of a single variable of the structured data to evaluate the distinguishing capability of the single feature variable on the good or bad samples, where the IV value represents the duty ratio difference of the good or bad samples in different value groups of the variable, and a larger IV value indicates a larger distinguishing capability of the feature variable on the good or bad samples. By way of example, the IV value metrics are as follows: IV is [0.02,0.1), the characteristic variable has weak distinguishing ability, IV is [0.1, 0.3), the characteristic variable has medium distinguishing ability, IV is greater than or equal to 0.3, and the characteristic variable has strong distinguishing ability.

The data preparation system is used for preparing sample data used for model training and monitoring data aiming at a target enterprise, so the data collection unit is also used for collecting structured data, semi-structured data and unstructured data related to second keywords in a second keyword set from various data sources based on a preset second keyword set to form a second data set; the data analysis unit is also used for extracting characteristic information from the second data set according to the keywords to form a second structured data set, and the second structured data set is stored in a database; the data processing unit also screens structured data in the second structured data set based on structured sample data in the structured sample data set to form an effective structured monitoring data set. That is, when the structured monitoring dataset is formed, data screening is not performed based on the set screening conditions, but based on the data used in model training, that is, the structured monitoring dataset contains the data values of the feature variables required for each prediction model.

The second key may include the business name and associated fields of the target business (i.e., the monitored business) and the second structured dataset obtained is a complete combination of various types of subjects and events associated with the business.

The second data set is formed in the same manner as the first data set, except that the first data set includes related data of all industries and all enterprises, and the second data set includes only data of the target enterprise and related subjects and related events thereof.

The above data preparation system is described by way of example only with respect to the preparation method of training data and monitoring data, and allows for different processing methods based on different applications or different practitioners.

The risk monitoring system is provided based on illegal funding risk application of financial-like enterprises, but is also easy to understand, and the system can be applied to other applications in the risk monitoring field.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that the modules of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the constituent modules and steps of the examples have been described generally in terms of functionality in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in this application, it should be understood that the disclosed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The illegal fund collection risk monitoring system of the financial enterprise is characterized by comprising a data preparation system, a model training system and a risk monitoring system; wherein,

the data preparation system collects the training data from a plurality of data sources based on a first set of keywords; and/or, collecting the monitoring data from a plurality of data sources based on a second set of keywords;

the data preparation system comprises a data collection unit, a data analysis unit and a data processing unit, wherein the data collection unit is used for collecting related structured data, semi-structured data and unstructured data from various data sources based on a predefined first keyword set to form a first data set; the data analysis unit is used for extracting characteristic information from the first data set according to the keywords to form a first structured data set, and storing the first structured data set in a database; the data processing unit screens the structured data in the first structured data set based on the set screening conditions to form a structured sample data set;

the predefined first keyword set comprises an enterprise name keyword set, an enterprise type keyword set, a keyword set of an enterprise operation range, an APP name keyword set and a specific list keyword set, and enterprise names and related information are searched and acquired according to the set keyword set and stored in a first data table; searching again to obtain a second data table according to the obtained enterprise list record of the first data table as a search condition, wherein the second data table comprises an association subject and an association relation related to enterprises; searching again to obtain a third data table according to the obtained enterprise list of the first data table and the association subject and association relation list of the enterprises recorded by the second data table as third searching conditions, wherein the third data table comprises related subject enterprise events; establishing an association relationship among a first data table, a second data table and a third data table to obtain the first data set; forming a second data set in the same way, wherein the second data set only comprises data of the target enterprise, the related subject and the related event;

the model training system comprises at least two training units, wherein each training unit acquires the training data from the data preparation system and trains based on an algorithm to obtain a corresponding prediction model;

each training unit comprises at least two training subunits and an integrated subunit, each training subunit extracts different data from the training data and trains based on the same algorithm to obtain a corresponding module model, and the integrated subunit integrates the module model obtained by each training subunit into the prediction model;

2. The financial enterprise-like illegal funding risk monitoring system of claim 1, wherein when the training subunit trains to obtain the module model based on the Logistic LASSO algorithm, the industry to which the enterprise belongs is used as an input feature variable of the module model.

3. The financial enterprise-like illegal funding risk monitoring system of claim 1, wherein when the training subunit trains to obtain a module model based on GBDT algorithm, the sub-industries train to obtain module models corresponding to respective industries, respectively.

4. The system for monitoring illegal funding risk of financial enterprises according to claim 1, wherein when the integration subunit integrates the module model obtained by each training subunit into the prediction model, a genetic algorithm is adopted to perform weight optimization to obtain the weight of each module model in the prediction model.

5. The financial enterprise-like illegal funding risk monitoring system of claim 1, wherein the risk monitoring system, when integrating all of the predictors into one total predictor output, weights the predictors of each algorithm based on the AUC values of the different algorithms to obtain and output the total predictor.

6. The financial enterprise-like illegal funding risk monitoring system of claim 1, wherein the training data and/or the monitoring data is structured data.