CN116468536A

CN116468536A - Automatic risk control rule generation method

Info

Publication number: CN116468536A
Application number: CN202310334618.0A
Authority: CN
Inventors: 林日英; 于溦; 董菲
Original assignee: Guangzhou Xinjing Information Technology Service Co ltd
Current assignee: Guangzhou Xinjing Information Technology Service Co ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-21

Abstract

The invention discloses a method for generating an automatic risk control rule, which comprises the following steps: the method comprises the steps of obtaining data, a rule generation framework, automatically generating a target variable library, automatically generating a characteristic variable library, combining multiple algorithms to select characteristics and automatically generating rules. The method is characterized in that rules for automatically generating loan risks are generated, the generation of the rules is closely related to the definition of target variables and the content of characteristic variables, a target variable library is generated automatically, a characteristic variable library is formed automatically, and then a final risk control rule is obtained efficiently and quickly through the automatic formation rules and the optimization rules, so that risk monitoring information is provided for loan and loan after the loan.

Description

Automatic risk control rule generation method

Technical Field

The invention relates to the technical field of wind control prevention and control of financial products, in particular to a method for generating an automatic risk control rule.

Background

In the field of industrial supply chain finance, major financial customers are manufacturing enterprises, and most of common methods for monitoring risk after loan are mainly out-visit after loan. At present, enterprises with credit records available for credit investigation are fewer in China, and the credit records of most enterprises are thinner, so that enterprise credit views with reference value cannot be formed, therefore, a supply chain financial loan monitoring platform is usually based on ERP data of core enterprises and data of a credit management system, realizes centralized integration and information sharing of scattered operation data and loan related information, carries out tracking monitoring on clients, detects various potential risks through analysis tools, carries out monitoring analysis on the operation, credit giving condition and financial use condition of loan clients and sends early warning information of risk points to related business departments, and provides basis for approval decision before loan and post-loan management. In the traditional bank loan client risk management process, due to the fact that enterprise scales are quite different and hysteresis is carried out on business and other non-business types, quantitative comparison basis is lacked, in addition, related information is mainly analyzed and discussed according to manpower, the occupied period is long, and the bank cannot accurately and rapidly carry out risk management.

Disclosure of Invention

The invention aims to provide a method for automatically generating a risk control rule, which is realized to provide risk monitoring information for loans and post-loans so as to solve the problems in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of automated risk control rule generation, comprising the steps of:

s1: acquiring data: including internal and external data of the enterprise;

s2: a framework for rule generation: the automatic generation rule comprises the steps of determining a target variable, forming a characteristic variable library, forming a wide table based on the association of the target variable and the characteristic variable library, selecting training data for analysis and box division generation rule, and then using data of a verification period to verify the validity of the rule;

s3: automatically generating a target variable library: the target variable is overdue clients, the regular target variable is determined based on the analyzed granularity, the granularity comprises a client level, a bill level and a loan stroke level, the target variable is circularly generated by adopting a template of pre-N-overdue-M-time based on the deadline and the overdue days aiming at different granularity, then all the current paying-off behavior data including the fields of the overdue deadline and the overdue days are traversed every day, the value of the target variable is obtained according to the definition of the target variable, and the value is updated according to the increment of days;

s4: automatically generating a characteristic variable library: the feature variables are the original data of different topics of each day after traversing and stopping, the window variable statistics derivation technology and the feature combination derivation technology are adopted, the feature variables of different topics are automatically derived, and a feature wide table for rule generation, namely x1, … …, xn and a target value Y, is formed by taking a target variable library as a main table and automatically associating the feature variable library;

s5: the multiple algorithms incorporate selection features: selecting characteristics by adopting index analysis and combination of various algorithms;

s6: automatically generating rules: based on index analysis and a feature set screened by a multi-algorithm model, adopting a chi-square box division method and a decision tree box division method to design univariate and multivariate rules.

Further, the internal data of the enterprise in S1 includes basic information of the enterprise, transaction information of the enterprise, IOT production information of the enterprise equipment, and financial information of the enterprise; the external data includes business information, business financial information, business judicial information, and business lending intent behavior information.

Further, the specific definition of the target variable in S3:

bad clients: the number of overdue days of the former N-period > M days, recorded as 1,

good clients: the number of days of expiration of the previous N-period is 0, which is recorded as 0,

intermediate client: the number of early N-term overdue days was between 0 and M days, and was noted as-1.

Further, the construction process of the features in S4 using various techniques is as follows:

window variable statistical derivative technique: the characteristics of the template structure are adopted by the latest-N-time unit-action-item-statistic;

feature combination derivation technique: combining two or more category attributes into one through operation, wherein the type of operation comprises addition, subtraction, multiplication and division four-rule operation and AND or NOR logic operation;

decomposition category derivation technique: converting the value of the characteristic into the characteristic of a dummy variable by judging the value of the characteristic;

reconstruction numerical value quantity derivation technology: the integer part is separated from the fractional part, and a staged statistical feature is constructed.

Further, the characteristic variable updating period of the S4 is different for different topics, the basic information is updated in a daily total amount, and the characteristics of the transaction information, the financial information and the production information, which are changed every day, are updated in daily increment.

Further, the specific method in S5 is as follows:

s501: processing the data of the feature wide table, and selecting and dividing the data into a training set and a verification set according to the data of the training window and the verification window;

s502: the method is used for screening out the characteristics with good effect and high stability by carrying out index analysis and machine learning algorithm modeling training on the data of the training set.

Still further, the index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, and specifically:

IV value: the predictive power for evaluating the variables can be used for rapidly screening the variables, defining IV values >0.02, and the variables have the effect:

bad in the above _i Bad customer number of each section; bad (Bad) _T Total bad subscriber number; good (Good) _i The number of good clients per segment; good (Good) _T Total good customer numbers;

KS values: the measurement indicates the degree that the variable can distinguish positive and negative clients, the larger the KS value is, the stronger the variable is capable of distinguishing bad clients, the KS value ranges from 0 to 1, and the variable with KS >0.2 is defined to have better distinguishing capability:

KS＝mεx(TPR-FFR)

in the above formula, TPR: a true class rate equal to the ratio of the number of clients that are true positive class and predicted to be positive class/the number of clients that are true positive class; FPR: a false positive class rate equal to the ratio of the number of clients that are truly negative and predicted as positive class/the number of clients that are truly negative class;

GINI coefficient: representing the probability that a randomly selected sample in the sample set is misclassified; the smaller the GINI index is, the smaller the probability that the selected sample in the set is wrong, i.e. the higher the purity of the set, and conversely, the less pure the set is:

in the above formula, pk represents the probability that the selected sample belongs to k class, and the probability that the sample is misclassified is (1-pk); the sample set has K categories, and a randomly selected sample can belong to any one of the K categories, so that the categories are summed; when classified into two categories, GINI (P) =2p (1-P);

information entropy: for feature selection, uncertainty of measured results is smaller in information entropy, and the results are simpler:

in the above, p _i Probability for each class;

PSI stability coefficient: PSI is used for measuring the stability of a variable, and the smaller the PSI value is, the smaller the difference between two distributions is, and the more stable the representation is; when PSI is less than 0.1, the variable stability is very high; PSI is between 0.1 and 0.25, variable stability is general, and when PSI is more than 0.25, variable stability is poor, and selection is not recommended:

in the above, actual _i : the attribute value of the variable of the first period of time is i number of clients; actual (Actual) _T : variable total number of clients for a first period of time; exceptit _i The attribute value of the variable of the second period of time is i number of clients; exceptit _T Total number of clients for the variable for the second period of time.

Further, the machine learning algorithm modeling training process in S502 is as follows:

model one: when XGBoost model is used for classifying and predicting, feature data of a training set is used as input data, a classifying result of each sample is obtained by setting iterative learning rate, maximum depth of tree, sampling rate of each tree to the sample and regularization term coefficient, then feature importance of a model modeling variable is obtained by evaluating the effect of the model result, and screening features are carried out according to the feature importance;

model two: aiming at the tabular data, the TabNet model combines the characteristics of a tree structure and a neural network, and a sequence attention mechanism is adopted to select a feature subset with semantic value on each round of decision steps for processing, namely, feature selection and feature processing on each round of decision steps are realized;

fusion of index analysis with model: according to the training sample data, the feature set screened by index analysis is X1, the feature set screened based on the XGBoost model is X2, and the feature set screened based on the TabNet model is X3, and in order to improve the effectiveness of rules, a mode of intersection is adopted, and a plurality of methods are fused to obtain a final feature set X.

Still further, the univariate rule in S6: screening the last feature set in the S4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.

Still further, the multivariate rule in S6: the last feature set is screened out from the data of the training set S4, and two modes of generating the multivariate rule are adopted, wherein one mode is to form a batch rule by carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features, and the other mode is to randomly select a plurality of features each time and carry out box division batch rule by adopting a decision tree method based on the GINI coefficient; and then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.

Compared with the prior art, the invention has the beneficial effects that:

according to the method for generating the automatic risk control rule, the information such as actual production operation and lending intention of a loan client is obtained, comprehensive evaluation is carried out on the production operation condition and repayment capacity based on the automatically generated wind control rule, a lending risk early warning mechanism is established, and post-lending risk management and credit evaluation are carried out, so that accurate monitoring and loan risk management are achieved.

Drawings

FIG. 1 is a flow chart of a method of automatically generating rules in accordance with the present invention;

FIG. 2 is a block diagram of acquiring enterprise data in accordance with the present invention;

FIG. 3 is a diagram of a design framework for rule generation of the present invention;

FIG. 4 is a flow chart of the automatic generation of target variables of the present invention;

FIG. 5 is a flow chart of an automated feature width table generation of the present invention;

FIG. 6 is a feature flow diagram of an automated screening rule of the present invention;

FIG. 7 is a flow chart of an automated rule generation of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides a method for generating an automated risk control rule, including the following steps:

step 1: acquiring data: including internal and external data of the enterprise; the internal data mainly comprises basic information of enterprises, transaction information of the enterprises, IOT production information of enterprise equipment, financial information of the enterprises and the like; the external data mainly includes enterprise business information, enterprise financial information, enterprise judicial information, and enterprise lending intention behaviors, as shown in fig. 2.

Step 2: a framework for rule generation: the automatic generation rule comprises the steps of determining a target variable, forming a characteristic variable library, forming a wide table based on the association of the target variable and the characteristic variable library, selecting training data for analysis and box division generation rule, and then using data of a verification period to verify the validity of the rule; as shown in fig. 3.

Step 3: automatically generating a target variable library: the step aims at loan risk rules, so that target variables are generally overdue clients, the target variables of the rules are determined based on the analyzed granularity, the general granularity is client level, bill level and loan stroke level, so that the target variables can be circularly generated by adopting templates such as pre-N-overdue-M- (time) based on deadlines and overdue days for different granularity, then all current paying-off behavior data are traversed every day, the focus comprises fields such as repayment deadlines, overdue days and the like, the values of the target variables are obtained according to the definition of the target variables, and the values are updated according to the increment of days, as shown in fig. 4; wherein, the specific definition of the target variable is as follows:

Step 4: automatically forming a feature broad table: the characteristic variables of the step are automatically derived into characteristic variables of different topics by traversing and stopping the original data of different topics every day by adopting methods such as a window variable statistics derivation technology, a characteristic combination derivation technology and the like, the characteristic variable update periods of the different topics are different, the basic information is updated in a daily total amount mode, and the characteristics of transaction information, financial information, production information and the like which are changed every day are updated in daily increment mode more appropriately; as shown in fig. 5, the present method performs the feature construction by employing a variety of techniques:

window variable statistical derivative technique: features of such template construction are employed with most recent-N- (time units) - (actions) - (item) - (statistics);

feature combination derivation technique: two or more category attributes are combined into one by an operation. The types of the operations comprise addition, subtraction, multiplication and division four-rule operations, AND and NOR logic operations and the like;

decomposition category derivation technique: the characteristic of the dummy variable can be converted by judging the value of the characteristic;

By automatically associating the feature variable library with the target variable library as a main table, a feature broad table for rule generation, that is, x1, … …, xn and a target value Y is formed.

Step 5: the multiple algorithms incorporate selection features: the index analysis and the combination of a plurality of algorithms are adopted to select the characteristics, and then different box division technologies are combined, so that the effectiveness of the rules is improved;

specifically, the data of the feature broad table is processed, then the data is selected and divided into a training set and a verification set according to the data of the training window and the verification window, and index analysis and machine learning algorithm (XGBoost and TabNet) modeling training are carried out on the data of the training set, so that the features with good effect and high stability are screened.

Wherein, index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, wherein the definition and calculation formulas of each index are as follows:

KS＝mεx(TPR-FPR)

in the above, p _i Probability for each class;

The machine learning algorithm modeling training process in this step is as follows:

model one: when XGBoost model is used for classifying and predicting, feature data of a training set is used as input data, parameters such as iterative learning rate, maximum depth of trees, sampling rate of each tree to a sample, regularization term coefficient and the like are set to obtain a classification result of each sample, then the effect of the model result is evaluated to obtain feature importance of a model modeling variable, and screening features are carried out according to the feature importance;

model two: aiming at the table data, the TabNet model combines the characteristics of the tree structure and the neural network, a sequence attention mechanism is adopted to select a feature subset with semantic value on each round of decision steps for processing, namely, feature selection and feature processing on each round of decision steps are realized, so that the training effect on classification problems exceeds or is higher than that of other table learning models, and therefore, as a feature broad table of the table data, the TabNet method is adopted to train the model from the other direction so as to screen features according to the feature importance of the model;

fusion of index analysis with model: according to the training sample data, the feature set screened by index analysis is X1, the feature set screened based on the XGBoost model is X2, and the feature set screened based on the TabNet model is X3, and in order to improve the rule effectiveness, a mode of intersection is adopted, and a plurality of methods are fused to obtain a final feature set X, as shown in FIG. 6.

Step 6: automatically generating rules: based on index analysis and a feature set screened by a multi-algorithm model, adopting a chi-square box dividing method and a decision tree box dividing method to design univariate and multivariate rules; as shown in fig. 7: the method comprises the following steps:

univariate rule: screening the last feature set in the step 4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.

Multivariate rules: the final feature set in the step 4 is screened from the data of the training set, and two modes of generating the multivariate rule are adopted, wherein one mode is a mode of carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features to form a batch rule, and the other mode is a mode of randomly selecting a plurality of features each time and adopting a decision tree method based on the GINI coefficient to carry out box division to form the batch rule. And then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.

The method for generating the automatic risk control rule can efficiently and rapidly form the risk control rule: the method is characterized in that rules for automatically generating loan risks are generated, the generation of the rules is closely related to the definition of target variables and the content of characteristic variables, the target variable library is automatically generated and the characteristic variable library is automatically formed, and then the final risk control rules are efficiently and quickly obtained through the automatic forming rules and the optimization rules, so that risk monitoring information is provided for loan and after-loan. Secondly, the rule variable is selected through the measurement index and the data mining algorithm: through index analysis of the feature broad table and selection of feature variables by adopting classification models XGBoost and TabNet algorithm, and then combining different box division technologies, rules with better effects can be obtained.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. A method of automated risk control rule generation, comprising the steps of:

s1: acquiring data: including internal and external data of the enterprise;

2. A method of automated risk control rule generation according to claim 1, wherein: the internal data of the enterprise in S1 comprises basic information of the enterprise, transaction information of the enterprise, IOT production information of enterprise equipment and financial information of the enterprise; the external data includes business information, business financial information, business judicial information, and business lending intent behavior information.

3. A method of automated risk control rule generation according to claim 1, wherein: specific definition of target variable in S3:

4. A method of automated risk control rule generation according to claim 1, wherein: s4, adopting various technologies to perform the characteristic construction process as follows:

5. The method of claim 1, wherein S4 is different in updating period of feature variables for different topics, the basic information is updated in a total daily amount, and the features of the transaction information, the financial information and the production information, which change every day, are updated in daily increments.

6. The method for generating an automated risk control rule according to claim 1, wherein the specific method in S5 is as follows:

7. The method of automated risk control rule generation of claim 6, wherein the index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, and specifically:

KS＝max(TPR-FPR)

in the above, p _i Probability for each class;

8. The method of automated risk control rule generation of claim 6, wherein the machine learning algorithm modeling training process in S502 is as follows:

9. The method of automated risk control rule generation of claim 1, wherein the univariate rule in S6: screening the last feature set in the S4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.

10. The method of automated risk control rule generation of claim 1, wherein the multivariate rule of S6: the last feature set is screened out from the data of the training set S4, and two modes of generating the multivariate rule are adopted, wherein one mode is to form a batch rule by carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features, and the other mode is to randomly select a plurality of features each time and carry out box division batch rule by adopting a decision tree method based on the GINI coefficient; and then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.