CN116468536A - Automatic risk control rule generation method - Google Patents

Automatic risk control rule generation method Download PDF

Info

Publication number
CN116468536A
CN116468536A CN202310334618.0A CN202310334618A CN116468536A CN 116468536 A CN116468536 A CN 116468536A CN 202310334618 A CN202310334618 A CN 202310334618A CN 116468536 A CN116468536 A CN 116468536A
Authority
CN
China
Prior art keywords
variable
feature
rule
data
clients
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310334618.0A
Other languages
Chinese (zh)
Inventor
林日英
于溦
董菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xinjing Information Technology Service Co ltd
Original Assignee
Guangzhou Xinjing Information Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xinjing Information Technology Service Co ltd filed Critical Guangzhou Xinjing Information Technology Service Co ltd
Priority to CN202310334618.0A priority Critical patent/CN116468536A/en
Publication of CN116468536A publication Critical patent/CN116468536A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a method for generating an automatic risk control rule, which comprises the following steps: the method comprises the steps of obtaining data, a rule generation framework, automatically generating a target variable library, automatically generating a characteristic variable library, combining multiple algorithms to select characteristics and automatically generating rules. The method is characterized in that rules for automatically generating loan risks are generated, the generation of the rules is closely related to the definition of target variables and the content of characteristic variables, a target variable library is generated automatically, a characteristic variable library is formed automatically, and then a final risk control rule is obtained efficiently and quickly through the automatic formation rules and the optimization rules, so that risk monitoring information is provided for loan and loan after the loan.

Description

Automatic risk control rule generation method
Technical Field
The invention relates to the technical field of wind control prevention and control of financial products, in particular to a method for generating an automatic risk control rule.
Background
In the field of industrial supply chain finance, major financial customers are manufacturing enterprises, and most of common methods for monitoring risk after loan are mainly out-visit after loan. At present, enterprises with credit records available for credit investigation are fewer in China, and the credit records of most enterprises are thinner, so that enterprise credit views with reference value cannot be formed, therefore, a supply chain financial loan monitoring platform is usually based on ERP data of core enterprises and data of a credit management system, realizes centralized integration and information sharing of scattered operation data and loan related information, carries out tracking monitoring on clients, detects various potential risks through analysis tools, carries out monitoring analysis on the operation, credit giving condition and financial use condition of loan clients and sends early warning information of risk points to related business departments, and provides basis for approval decision before loan and post-loan management. In the traditional bank loan client risk management process, due to the fact that enterprise scales are quite different and hysteresis is carried out on business and other non-business types, quantitative comparison basis is lacked, in addition, related information is mainly analyzed and discussed according to manpower, the occupied period is long, and the bank cannot accurately and rapidly carry out risk management.
Disclosure of Invention
The invention aims to provide a method for automatically generating a risk control rule, which is realized to provide risk monitoring information for loans and post-loans so as to solve the problems in the background art.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a method of automated risk control rule generation, comprising the steps of:
s1: acquiring data: including internal and external data of the enterprise;
s2: a framework for rule generation: the automatic generation rule comprises the steps of determining a target variable, forming a characteristic variable library, forming a wide table based on the association of the target variable and the characteristic variable library, selecting training data for analysis and box division generation rule, and then using data of a verification period to verify the validity of the rule;
s3: automatically generating a target variable library: the target variable is overdue clients, the regular target variable is determined based on the analyzed granularity, the granularity comprises a client level, a bill level and a loan stroke level, the target variable is circularly generated by adopting a template of pre-N-overdue-M-time based on the deadline and the overdue days aiming at different granularity, then all the current paying-off behavior data including the fields of the overdue deadline and the overdue days are traversed every day, the value of the target variable is obtained according to the definition of the target variable, and the value is updated according to the increment of days;
s4: automatically generating a characteristic variable library: the feature variables are the original data of different topics of each day after traversing and stopping, the window variable statistics derivation technology and the feature combination derivation technology are adopted, the feature variables of different topics are automatically derived, and a feature wide table for rule generation, namely x1, … …, xn and a target value Y, is formed by taking a target variable library as a main table and automatically associating the feature variable library;
s5: the multiple algorithms incorporate selection features: selecting characteristics by adopting index analysis and combination of various algorithms;
s6: automatically generating rules: based on index analysis and a feature set screened by a multi-algorithm model, adopting a chi-square box division method and a decision tree box division method to design univariate and multivariate rules.
Further, the internal data of the enterprise in S1 includes basic information of the enterprise, transaction information of the enterprise, IOT production information of the enterprise equipment, and financial information of the enterprise; the external data includes business information, business financial information, business judicial information, and business lending intent behavior information.
Further, the specific definition of the target variable in S3:
bad clients: the number of overdue days of the former N-period > M days, recorded as 1,
good clients: the number of days of expiration of the previous N-period is 0, which is recorded as 0,
intermediate client: the number of early N-term overdue days was between 0 and M days, and was noted as-1.
Further, the construction process of the features in S4 using various techniques is as follows:
window variable statistical derivative technique: the characteristics of the template structure are adopted by the latest-N-time unit-action-item-statistic;
feature combination derivation technique: combining two or more category attributes into one through operation, wherein the type of operation comprises addition, subtraction, multiplication and division four-rule operation and AND or NOR logic operation;
decomposition category derivation technique: converting the value of the characteristic into the characteristic of a dummy variable by judging the value of the characteristic;
reconstruction numerical value quantity derivation technology: the integer part is separated from the fractional part, and a staged statistical feature is constructed.
Further, the characteristic variable updating period of the S4 is different for different topics, the basic information is updated in a daily total amount, and the characteristics of the transaction information, the financial information and the production information, which are changed every day, are updated in daily increment.
Further, the specific method in S5 is as follows:
s501: processing the data of the feature wide table, and selecting and dividing the data into a training set and a verification set according to the data of the training window and the verification window;
s502: the method is used for screening out the characteristics with good effect and high stability by carrying out index analysis and machine learning algorithm modeling training on the data of the training set.
Still further, the index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, and specifically:
IV value: the predictive power for evaluating the variables can be used for rapidly screening the variables, defining IV values >0.02, and the variables have the effect:
bad in the above i Bad customer number of each section; bad (Bad) T Total bad subscriber number; good (Good) i The number of good clients per segment; good (Good) T Total good customer numbers;
KS values: the measurement indicates the degree that the variable can distinguish positive and negative clients, the larger the KS value is, the stronger the variable is capable of distinguishing bad clients, the KS value ranges from 0 to 1, and the variable with KS >0.2 is defined to have better distinguishing capability:
KS=mεx(TPR-FFR)
in the above formula, TPR: a true class rate equal to the ratio of the number of clients that are true positive class and predicted to be positive class/the number of clients that are true positive class; FPR: a false positive class rate equal to the ratio of the number of clients that are truly negative and predicted as positive class/the number of clients that are truly negative class;
GINI coefficient: representing the probability that a randomly selected sample in the sample set is misclassified; the smaller the GINI index is, the smaller the probability that the selected sample in the set is wrong, i.e. the higher the purity of the set, and conversely, the less pure the set is:
in the above formula, pk represents the probability that the selected sample belongs to k class, and the probability that the sample is misclassified is (1-pk); the sample set has K categories, and a randomly selected sample can belong to any one of the K categories, so that the categories are summed; when classified into two categories, GINI (P) =2p (1-P);
information entropy: for feature selection, uncertainty of measured results is smaller in information entropy, and the results are simpler:
in the above, p i Probability for each class;
PSI stability coefficient: PSI is used for measuring the stability of a variable, and the smaller the PSI value is, the smaller the difference between two distributions is, and the more stable the representation is; when PSI is less than 0.1, the variable stability is very high; PSI is between 0.1 and 0.25, variable stability is general, and when PSI is more than 0.25, variable stability is poor, and selection is not recommended:
in the above, actual i : the attribute value of the variable of the first period of time is i number of clients; actual (Actual) T : variable total number of clients for a first period of time; exceptit i The attribute value of the variable of the second period of time is i number of clients; exceptit T Total number of clients for the variable for the second period of time.
Further, the machine learning algorithm modeling training process in S502 is as follows:
model one: when XGBoost model is used for classifying and predicting, feature data of a training set is used as input data, a classifying result of each sample is obtained by setting iterative learning rate, maximum depth of tree, sampling rate of each tree to the sample and regularization term coefficient, then feature importance of a model modeling variable is obtained by evaluating the effect of the model result, and screening features are carried out according to the feature importance;
model two: aiming at the tabular data, the TabNet model combines the characteristics of a tree structure and a neural network, and a sequence attention mechanism is adopted to select a feature subset with semantic value on each round of decision steps for processing, namely, feature selection and feature processing on each round of decision steps are realized;
fusion of index analysis with model: according to the training sample data, the feature set screened by index analysis is X1, the feature set screened based on the XGBoost model is X2, and the feature set screened based on the TabNet model is X3, and in order to improve the effectiveness of rules, a mode of intersection is adopted, and a plurality of methods are fused to obtain a final feature set X.
Still further, the univariate rule in S6: screening the last feature set in the S4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
Still further, the multivariate rule in S6: the last feature set is screened out from the data of the training set S4, and two modes of generating the multivariate rule are adopted, wherein one mode is to form a batch rule by carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features, and the other mode is to randomly select a plurality of features each time and carry out box division batch rule by adopting a decision tree method based on the GINI coefficient; and then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
Compared with the prior art, the invention has the beneficial effects that:
according to the method for generating the automatic risk control rule, the information such as actual production operation and lending intention of a loan client is obtained, comprehensive evaluation is carried out on the production operation condition and repayment capacity based on the automatically generated wind control rule, a lending risk early warning mechanism is established, and post-lending risk management and credit evaluation are carried out, so that accurate monitoring and loan risk management are achieved.
Drawings
FIG. 1 is a flow chart of a method of automatically generating rules in accordance with the present invention;
FIG. 2 is a block diagram of acquiring enterprise data in accordance with the present invention;
FIG. 3 is a diagram of a design framework for rule generation of the present invention;
FIG. 4 is a flow chart of the automatic generation of target variables of the present invention;
FIG. 5 is a flow chart of an automated feature width table generation of the present invention;
FIG. 6 is a feature flow diagram of an automated screening rule of the present invention;
FIG. 7 is a flow chart of an automated rule generation of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention provides a method for generating an automated risk control rule, including the following steps:
step 1: acquiring data: including internal and external data of the enterprise; the internal data mainly comprises basic information of enterprises, transaction information of the enterprises, IOT production information of enterprise equipment, financial information of the enterprises and the like; the external data mainly includes enterprise business information, enterprise financial information, enterprise judicial information, and enterprise lending intention behaviors, as shown in fig. 2.
Step 2: a framework for rule generation: the automatic generation rule comprises the steps of determining a target variable, forming a characteristic variable library, forming a wide table based on the association of the target variable and the characteristic variable library, selecting training data for analysis and box division generation rule, and then using data of a verification period to verify the validity of the rule; as shown in fig. 3.
Step 3: automatically generating a target variable library: the step aims at loan risk rules, so that target variables are generally overdue clients, the target variables of the rules are determined based on the analyzed granularity, the general granularity is client level, bill level and loan stroke level, so that the target variables can be circularly generated by adopting templates such as pre-N-overdue-M- (time) based on deadlines and overdue days for different granularity, then all current paying-off behavior data are traversed every day, the focus comprises fields such as repayment deadlines, overdue days and the like, the values of the target variables are obtained according to the definition of the target variables, and the values are updated according to the increment of days, as shown in fig. 4; wherein, the specific definition of the target variable is as follows:
bad clients: the number of overdue days of the former N-period > M days, recorded as 1,
good clients: the number of days of expiration of the previous N-period is 0, which is recorded as 0,
intermediate client: the number of early N-term overdue days was between 0 and M days, and was noted as-1.
Step 4: automatically forming a feature broad table: the characteristic variables of the step are automatically derived into characteristic variables of different topics by traversing and stopping the original data of different topics every day by adopting methods such as a window variable statistics derivation technology, a characteristic combination derivation technology and the like, the characteristic variable update periods of the different topics are different, the basic information is updated in a daily total amount mode, and the characteristics of transaction information, financial information, production information and the like which are changed every day are updated in daily increment mode more appropriately; as shown in fig. 5, the present method performs the feature construction by employing a variety of techniques:
window variable statistical derivative technique: features of such template construction are employed with most recent-N- (time units) - (actions) - (item) - (statistics);
feature combination derivation technique: two or more category attributes are combined into one by an operation. The types of the operations comprise addition, subtraction, multiplication and division four-rule operations, AND and NOR logic operations and the like;
decomposition category derivation technique: the characteristic of the dummy variable can be converted by judging the value of the characteristic;
reconstruction numerical value quantity derivation technology: the integer part is separated from the fractional part, and a staged statistical feature is constructed.
By automatically associating the feature variable library with the target variable library as a main table, a feature broad table for rule generation, that is, x1, … …, xn and a target value Y is formed.
Step 5: the multiple algorithms incorporate selection features: the index analysis and the combination of a plurality of algorithms are adopted to select the characteristics, and then different box division technologies are combined, so that the effectiveness of the rules is improved;
specifically, the data of the feature broad table is processed, then the data is selected and divided into a training set and a verification set according to the data of the training window and the verification window, and index analysis and machine learning algorithm (XGBoost and TabNet) modeling training are carried out on the data of the training set, so that the features with good effect and high stability are screened.
Wherein, index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, wherein the definition and calculation formulas of each index are as follows:
IV value: the predictive power for evaluating the variables can be used for rapidly screening the variables, defining IV values >0.02, and the variables have the effect:
bad in the above i Bad customer number of each section; bad (Bad) T Total bad subscriber number; good (Good) i The number of good clients per segment; good (Good) T Total good customer numbers;
KS values: the measurement indicates the degree that the variable can distinguish positive and negative clients, the larger the KS value is, the stronger the variable is capable of distinguishing bad clients, the KS value ranges from 0 to 1, and the variable with KS >0.2 is defined to have better distinguishing capability:
KS=mεx(TPR-FPR)
in the above formula, TPR: a true class rate equal to the ratio of the number of clients that are true positive class and predicted to be positive class/the number of clients that are true positive class; FPR: a false positive class rate equal to the ratio of the number of clients that are truly negative and predicted as positive class/the number of clients that are truly negative class;
GINI coefficient: representing the probability that a randomly selected sample in the sample set is misclassified; the smaller the GINI index is, the smaller the probability that the selected sample in the set is wrong, i.e. the higher the purity of the set, and conversely, the less pure the set is:
in the above formula, pk represents the probability that the selected sample belongs to k class, and the probability that the sample is misclassified is (1-pk); the sample set has K categories, and a randomly selected sample can belong to any one of the K categories, so that the categories are summed; when classified into two categories, GINI (P) =2p (1-P);
information entropy: for feature selection, uncertainty of measured results is smaller in information entropy, and the results are simpler:
in the above, p i Probability for each class;
PSI stability coefficient: PSI is used for measuring the stability of a variable, and the smaller the PSI value is, the smaller the difference between two distributions is, and the more stable the representation is; when PSI is less than 0.1, the variable stability is very high; PSI is between 0.1 and 0.25, variable stability is general, and when PSI is more than 0.25, variable stability is poor, and selection is not recommended:
in the above, actual i : the attribute value of the variable of the first period of time is i number of clients; actual (Actual) T : variable total number of clients for a first period of time; exceptit i The attribute value of the variable of the second period of time is i number of clients; exceptit T Total number of clients for the variable for the second period of time.
The machine learning algorithm modeling training process in this step is as follows:
model one: when XGBoost model is used for classifying and predicting, feature data of a training set is used as input data, parameters such as iterative learning rate, maximum depth of trees, sampling rate of each tree to a sample, regularization term coefficient and the like are set to obtain a classification result of each sample, then the effect of the model result is evaluated to obtain feature importance of a model modeling variable, and screening features are carried out according to the feature importance;
model two: aiming at the table data, the TabNet model combines the characteristics of the tree structure and the neural network, a sequence attention mechanism is adopted to select a feature subset with semantic value on each round of decision steps for processing, namely, feature selection and feature processing on each round of decision steps are realized, so that the training effect on classification problems exceeds or is higher than that of other table learning models, and therefore, as a feature broad table of the table data, the TabNet method is adopted to train the model from the other direction so as to screen features according to the feature importance of the model;
fusion of index analysis with model: according to the training sample data, the feature set screened by index analysis is X1, the feature set screened based on the XGBoost model is X2, and the feature set screened based on the TabNet model is X3, and in order to improve the rule effectiveness, a mode of intersection is adopted, and a plurality of methods are fused to obtain a final feature set X, as shown in FIG. 6.
Step 6: automatically generating rules: based on index analysis and a feature set screened by a multi-algorithm model, adopting a chi-square box dividing method and a decision tree box dividing method to design univariate and multivariate rules; as shown in fig. 7: the method comprises the following steps:
univariate rule: screening the last feature set in the step 4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
Multivariate rules: the final feature set in the step 4 is screened from the data of the training set, and two modes of generating the multivariate rule are adopted, wherein one mode is a mode of carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features to form a batch rule, and the other mode is a mode of randomly selecting a plurality of features each time and adopting a decision tree method based on the GINI coefficient to carry out box division to form the batch rule. And then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
The method for generating the automatic risk control rule can efficiently and rapidly form the risk control rule: the method is characterized in that rules for automatically generating loan risks are generated, the generation of the rules is closely related to the definition of target variables and the content of characteristic variables, the target variable library is automatically generated and the characteristic variable library is automatically formed, and then the final risk control rules are efficiently and quickly obtained through the automatic forming rules and the optimization rules, so that risk monitoring information is provided for loan and after-loan. Secondly, the rule variable is selected through the measurement index and the data mining algorithm: through index analysis of the feature broad table and selection of feature variables by adopting classification models XGBoost and TabNet algorithm, and then combining different box division technologies, rules with better effects can be obtained.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims (10)

1. A method of automated risk control rule generation, comprising the steps of:
s1: acquiring data: including internal and external data of the enterprise;
s2: a framework for rule generation: the automatic generation rule comprises the steps of determining a target variable, forming a characteristic variable library, forming a wide table based on the association of the target variable and the characteristic variable library, selecting training data for analysis and box division generation rule, and then using data of a verification period to verify the validity of the rule;
s3: automatically generating a target variable library: the target variable is overdue clients, the regular target variable is determined based on the analyzed granularity, the granularity comprises a client level, a bill level and a loan stroke level, the target variable is circularly generated by adopting a template of pre-N-overdue-M-time based on the deadline and the overdue days aiming at different granularity, then all the current paying-off behavior data including the fields of the overdue deadline and the overdue days are traversed every day, the value of the target variable is obtained according to the definition of the target variable, and the value is updated according to the increment of days;
s4: automatically generating a characteristic variable library: the feature variables are the original data of different topics of each day after traversing and stopping, the window variable statistics derivation technology and the feature combination derivation technology are adopted, the feature variables of different topics are automatically derived, and a feature wide table for rule generation, namely x1, … …, xn and a target value Y, is formed by taking a target variable library as a main table and automatically associating the feature variable library;
s5: the multiple algorithms incorporate selection features: selecting characteristics by adopting index analysis and combination of various algorithms;
s6: automatically generating rules: based on index analysis and a feature set screened by a multi-algorithm model, adopting a chi-square box division method and a decision tree box division method to design univariate and multivariate rules.
2. A method of automated risk control rule generation according to claim 1, wherein: the internal data of the enterprise in S1 comprises basic information of the enterprise, transaction information of the enterprise, IOT production information of enterprise equipment and financial information of the enterprise; the external data includes business information, business financial information, business judicial information, and business lending intent behavior information.
3. A method of automated risk control rule generation according to claim 1, wherein: specific definition of target variable in S3:
bad clients: the number of overdue days of the former N-period > M days, recorded as 1,
good clients: the number of days of expiration of the previous N-period is 0, which is recorded as 0,
intermediate client: the number of early N-term overdue days was between 0 and M days, and was noted as-1.
4. A method of automated risk control rule generation according to claim 1, wherein: s4, adopting various technologies to perform the characteristic construction process as follows:
window variable statistical derivative technique: the characteristics of the template structure are adopted by the latest-N-time unit-action-item-statistic;
feature combination derivation technique: combining two or more category attributes into one through operation, wherein the type of operation comprises addition, subtraction, multiplication and division four-rule operation and AND or NOR logic operation;
decomposition category derivation technique: converting the value of the characteristic into the characteristic of a dummy variable by judging the value of the characteristic;
reconstruction numerical value quantity derivation technology: the integer part is separated from the fractional part, and a staged statistical feature is constructed.
5. The method of claim 1, wherein S4 is different in updating period of feature variables for different topics, the basic information is updated in a total daily amount, and the features of the transaction information, the financial information and the production information, which change every day, are updated in daily increments.
6. The method for generating an automated risk control rule according to claim 1, wherein the specific method in S5 is as follows:
s501: processing the data of the feature wide table, and selecting and dividing the data into a training set and a verification set according to the data of the training window and the verification window;
s502: the method is used for screening out the characteristics with good effect and high stability by carrying out index analysis and machine learning algorithm modeling training on the data of the training set.
7. The method of automated risk control rule generation of claim 6, wherein the index analysis: performing univariate analysis on the feature data of the training set, wherein the analyzed indexes comprise IV values, KS values, GINI coefficients, information entropy and PSI stability coefficients, and comprehensively evaluating the effectiveness and stability of each index, so as to screen the features, and specifically:
IV value: the predictive power for evaluating the variables can be used for rapidly screening the variables, defining IV values >0.02, and the variables have the effect:
bad in the above i Bad customer number of each section; bad (Bad) T Total bad subscriber number; good (Good) i The number of good clients per segment; good (Good) T Total good customer numbers;
KS values: the measurement indicates the degree that the variable can distinguish positive and negative clients, the larger the KS value is, the stronger the variable is capable of distinguishing bad clients, the KS value ranges from 0 to 1, and the variable with KS >0.2 is defined to have better distinguishing capability:
KS=max(TPR-FPR)
in the above formula, TPR: a true class rate equal to the ratio of the number of clients that are true positive class and predicted to be positive class/the number of clients that are true positive class; FPR: a false positive class rate equal to the ratio of the number of clients that are truly negative and predicted as positive class/the number of clients that are truly negative class;
GINI coefficient: representing the probability that a randomly selected sample in the sample set is misclassified; the smaller the GINI index is, the smaller the probability that the selected sample in the set is wrong, i.e. the higher the purity of the set, and conversely, the less pure the set is:
in the above formula, pk represents the probability that the selected sample belongs to k class, and the probability that the sample is misclassified is (1-pk); the sample set has K categories, and a randomly selected sample can belong to any one of the K categories, so that the categories are summed; when classified into two categories, GINI (P) =2p (1-P);
information entropy: for feature selection, uncertainty of measured results is smaller in information entropy, and the results are simpler:
in the above, p i Probability for each class;
PSI stability coefficient: PSI is used for measuring the stability of a variable, and the smaller the PSI value is, the smaller the difference between two distributions is, and the more stable the representation is; when PSI is less than 0.1, the variable stability is very high; PSI is between 0.1 and 0.25, variable stability is general, and when PSI is more than 0.25, variable stability is poor, and selection is not recommended:
in the above, actual i : the attribute value of the variable of the first period of time is i number of clients; actual (Actual) T : variable total number of clients for a first period of time; exceptit i The attribute value of the variable of the second period of time is i number of clients; exceptit T Total number of clients for the variable for the second period of time.
8. The method of automated risk control rule generation of claim 6, wherein the machine learning algorithm modeling training process in S502 is as follows:
model one: when XGBoost model is used for classifying and predicting, feature data of a training set is used as input data, a classifying result of each sample is obtained by setting iterative learning rate, maximum depth of tree, sampling rate of each tree to the sample and regularization term coefficient, then feature importance of a model modeling variable is obtained by evaluating the effect of the model result, and screening features are carried out according to the feature importance;
model two: aiming at the tabular data, the TabNet model combines the characteristics of a tree structure and a neural network, and a sequence attention mechanism is adopted to select a feature subset with semantic value on each round of decision steps for processing, namely, feature selection and feature processing on each round of decision steps are realized;
fusion of index analysis with model: according to the training sample data, the feature set screened by index analysis is X1, the feature set screened based on the XGBoost model is X2, and the feature set screened based on the TabNet model is X3, and in order to improve the effectiveness of rules, a mode of intersection is adopted, and a plurality of methods are fused to obtain a final feature set X.
9. The method of automated risk control rule generation of claim 1, wherein the univariate rule in S6: screening the last feature set in the S4 from the data of the training set, carrying out chi-square box division and decision tree box division, and carrying out rule optimization by measuring index lift value, recall rate, precision rate and hit rate, and screening out rules that the effects of the chi-square box division and the decision tree box division result meet the conditions; and then carrying out the same box division result on the verification set data, calculating the lift value, recall rate, precision rate and hit rate of the verification rule, and finally associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
10. The method of automated risk control rule generation of claim 1, wherein the multivariate rule of S6: the last feature set is screened out from the data of the training set S4, and two modes of generating the multivariate rule are adopted, wherein one mode is to form a batch rule by carrying out chi-square box division on the features and then carrying out cross combination on a plurality of features, and the other mode is to randomly select a plurality of features each time and carry out box division batch rule by adopting a decision tree method based on the GINI coefficient; and then, optimizing the rule by calculating the lift value, recall rate, precision rate and hit rate of each rule and setting screening conditions based on the indexes, carrying out the same box division result on the verification set data, calculating the corresponding lift value, recall rate, precision rate and hit rate, and finally, associating the rule content effect of the training set with the rule effect of the verification set and outputting the result.
CN202310334618.0A 2023-03-30 2023-03-30 Automatic risk control rule generation method Pending CN116468536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310334618.0A CN116468536A (en) 2023-03-30 2023-03-30 Automatic risk control rule generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310334618.0A CN116468536A (en) 2023-03-30 2023-03-30 Automatic risk control rule generation method

Publications (1)

Publication Number Publication Date
CN116468536A true CN116468536A (en) 2023-07-21

Family

ID=87183500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310334618.0A Pending CN116468536A (en) 2023-03-30 2023-03-30 Automatic risk control rule generation method

Country Status (1)

Country Link
CN (1) CN116468536A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196823A (en) * 2023-09-08 2023-12-08 厦门国际银行股份有限公司 Wind control rule generation method, system and storage medium
CN117455417A (en) * 2023-12-22 2024-01-26 深圳刷宝科技有限公司 Automatic iterative optimization method and system for intelligent wind control approval strategy

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196823A (en) * 2023-09-08 2023-12-08 厦门国际银行股份有限公司 Wind control rule generation method, system and storage medium
CN117196823B (en) * 2023-09-08 2024-03-19 厦门国际银行股份有限公司 Wind control rule generation method, system and storage medium
CN117455417A (en) * 2023-12-22 2024-01-26 深圳刷宝科技有限公司 Automatic iterative optimization method and system for intelligent wind control approval strategy
CN117455417B (en) * 2023-12-22 2024-04-09 深圳刷宝科技有限公司 Automatic iterative optimization method and system for intelligent wind control approval strategy

Similar Documents

Publication Publication Date Title
Ye et al. A novel forecasting method based on multi-order fuzzy time series and technical analysis
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
CN116468536A (en) Automatic risk control rule generation method
CN110852856A (en) Invoice false invoice identification method based on dynamic network representation
CN115907611B (en) Fitting inventory control method based on fitting market value
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
CN115310752A (en) Energy big data-oriented data asset value evaluation method and system
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN115018562A (en) User pre-churn prediction method, device and system
CN114298538A (en) Investment scheme evaluation method, system and storage medium for power grid retail project
CN116911994B (en) External trade risk early warning system
Wu et al. The BP neural network with adam optimizer for predicting audit opinions of listed companies.
CN116596674A (en) External trade risk assessment method based on big data analysis
CN110738565A (en) Real estate finance artificial intelligence composite wind control model based on data set
KR102499182B1 (en) Loan regular auditing system using artificia intellicence
Marevac et al. Decision-making AI for customer worthiness and viability
CN113065969A (en) Enterprise scoring model construction method, enterprise scoring method, medium and electronic device
CN112215689A (en) Financial fraud risk assessment method and device based on evidence theory
Wang Research on Enterprise Financial Performance Evaluation Method Based on Data Mining
DANIALI et al. Periodizing management’s risks of construction projects with gray relational analysis and fmea approach
CN117217867A (en) Enterprise credit prediction and optimization system based on quantum genetic algorithm
Ashraf et al. Prediction of Economic Value Added status of Tehran Stock Exchanges by using Genetic Algorithm
Cai et al. A genetic algorithm model for personal credit scoring
Jin Stock Price Analysis and Prediction Method Based on Machine Learning: Taking Apple Inc as an Example
CN113077189A (en) Method and device for evaluating life cycle of small and micro enterprise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination