CN112200659A

CN112200659A - Method and device for establishing wind control model and storage medium

Info

Publication number: CN112200659A
Application number: CN202011045716.5A
Authority: CN
Inventors: 邵俊; 李越; 蔡艺齐; 周炬; 路林林; 张磊
Original assignee: Shenzhen Suoxinda Data Technology Co ltd
Current assignee: Shenzhen Suoxinda Data Technology Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-08

Abstract

The application discloses a method, a device and a storage medium for establishing a wind control model, wherein the method comprises the following steps: acquiring a user historical data set; according to the value of different users of each first characteristic and the label of whether each user violates, carrying out characteristic weight WOE pre-coding treatment on each first characteristic; screening out second characteristics from the first characteristics, wherein the number of the second characteristics is smaller than that of the first characteristics; determining a box dividing mode that the WOE value of each second feature meets monotonicity to obtain a plurality of second boxes of each second feature; and establishing a wind control model by utilizing a classification model based on the plurality of second boxes of each second characteristic. Through the mode, manual box separation can be replaced, a large amount of time is saved, and the same precision effect as manual modeling is achieved.

Description

Method and device for establishing wind control model and storage medium

Technical Field

The present application relates to the field of big data analysis and data mining technologies, and in particular, to a method, an apparatus, and a storage medium for establishing a wind control model.

Background

The current credit wind control model establishing step comprises the following steps: data acquisition, feature Weight (WOE) transformation, variable clustering and regression modeling; wherein the WOE transform involves two sub-steps including: (1) merging each variable according to the value of the variable, regarding the merged sample as a box, numbering each box, and (2) transforming the variable in each box according to a WOE formula. The two sub-steps of the WOE transform described above require that the WOE value of each bin exhibit a monotonic trend by continually manually adjusting the bins. The modeling process is very lengthy and can be labor and time intensive.

Disclosure of Invention

Based on the above, the application provides a method, a device and a storage medium for establishing a wind control model.

In a first aspect, the present application provides a method for building a wind control model, the method comprising:

acquiring a user history data set, wherein the user history data set comprises user history data, and the user history data comprises values corresponding to a plurality of different first characteristics of a user and a label indicating whether the user violates or not;

according to the value of different users of each first characteristic and the label of whether each user violates, carrying out characteristic weight WOE pre-coding treatment on each first characteristic;

screening second characteristics from the first characteristics based on a plurality of first boxes obtained after WOE pre-coding processing is carried out on each first characteristic, wherein the number of the second characteristics is smaller than that of the first characteristics;

determining a box dividing mode that the WOE value of each second feature meets monotonicity based on all possible box dividing modes of a plurality of first boxes of each second feature, and further obtaining a plurality of second boxes of each second feature;

and establishing the wind control model by utilizing a classification model based on a plurality of second boxes of each second characteristic.

In a second aspect, the present application provides a computer apparatus comprising: a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and, when executing the computer program, implement the method of establishing a wind control model as described above.

In a third aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the method of establishing a wind control model as described above.

The embodiment of the application provides a method, a device and a storage medium for establishing a wind control model, which are used for acquiring a user historical data set, wherein the user historical data set comprises user historical data, and the user historical data comprises values corresponding to a plurality of different first characteristics of a user respectively and a label indicating whether the user violates or not; according to the value of different users of each first characteristic and the label of whether each user violates, carrying out characteristic weight WOE pre-coding treatment on each first characteristic; screening second characteristics from the first characteristics based on a plurality of first boxes obtained after WOE pre-coding processing is carried out on each first characteristic, wherein the number of the second characteristics is smaller than that of the first characteristics; determining a box dividing mode that the WOE value of each second feature meets monotonicity based on all possible box dividing modes of a plurality of first boxes of each second feature, and further obtaining a plurality of second boxes of each second feature; and establishing the wind control model by utilizing a classification model based on a plurality of second boxes of each second characteristic. Because only WOE pre-coding processing is carried out during WOE conversion, the monotonicity of WOE is not specially considered, automatic processing can be realized through a computer, first characteristics are screened after the WOE pre-coding processing, a small number of second characteristics are selected from the first characteristics, then the monotonicity of WOE is considered for the small number of second characteristics, all possible box dividing modes are traversed, the characteristics of the monotonicity of WOE values (the WOE values are monotonically increased or monotonically decreased) are combined, and the box dividing mode that the WOE values of the second characteristics meet the monotonicity is determined, so that the computer can automatically and quickly divide the small number of second characteristics according to the characteristics of the monotonicity of the WOE values, the divided WOE values meet the monotonicity, manual box dividing is replaced, and a large amount of time is saved; the wind control model established by the plurality of second boxes based on each second characteristic and meeting the WOE value monotonicity can achieve the same precision effect as that of manual modeling.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for creating a wind control model according to the present application;

FIG. 2 is a flow chart of another embodiment of a method for building a wind control model according to the present application;

FIG. 3 is a flow chart of another embodiment of a method of building a wind control model according to the present application;

FIG. 4 is a flow chart of another embodiment of a method of creating a wind control model according to the present application;

5-7 are schematic diagrams of bin readjustment in the method of building a wind control model according to the present application;

FIG. 8 is a flow chart of yet another embodiment of a method of building a wind control model according to the present application;

FIG. 9 is a schematic structural diagram of an embodiment of a computer apparatus according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Referring to fig. 1, fig. 1 is a flowchart of an embodiment of a method for establishing a wind control model according to the present application.

The wind control model is a short name for a risk control model, is commonly used in financial institutions such as credit guarantees, and is used for performing risk control on businesses. In the current generation, a metering analysis method is introduced into the financial institution in each link of risk management as much as possible, background analysis and review are carried out by means of big data, and optimization and adjustment are carried out continuously, so that the financial institution can reach balance more quickly in the game process of risk and income, and profit maximization of local space or even more spaces is realized. The wind control model is a necessary weapon for realizing the precise differentiation of risks when the wind control strategy reaches balance. A wind control model, broadly speaking, represents any risk management model constructed using data.

The method comprises the following steps: step S101, step S102, step S103, step S104, and step S105.

Step S101: the method comprises the steps of obtaining a user history data set, wherein the user history data set comprises user history data, and the user history data comprises values corresponding to a plurality of different first characteristics of a user and a label indicating whether the user violates.

The user history data set may refer to a set of user history data used to build a wind control model, and may include a user history data set for training (may be simply referred to as a training set) and a user history data set for testing (may be simply referred to as a testing set). The user historical data set can be divided into a training set and a testing set according to a certain proportional relation. Typically, the data size of the training set is larger than that of the test set, for example, the user history data set is as follows: the scale of 3 is divided into a training set and a test set.

In this embodiment, obtaining the user history data set may be obtaining the user history data set used for training the wind control model. After the wind control model is established, a user historical data set for testing the wind control model can be obtained; and testing the wind control model by using the user historical data set for testing the wind control model.

The characteristics, also called variables, independent variables, variable characteristics or characteristic variables, of the user may be characteristics related to the user, and the values of the characteristics may be actual content or data corresponding to the characteristics of the user, for example, gender is female, gender is a characteristic, and female is a value of the characteristic; for another example: age 40, age "being a characteristic," age 40 "being a value of a characteristic; for another example: the average cost per month is 1 ten thousand, the average cost per month is a characteristic, and the average cost per month is a value of the characteristic; and so on.

In this embodiment, the first feature may refer to a feature before screening so as to be distinguished from a feature after screening later (i.e., a second feature later). The first characteristics of the user include, but are not limited to, basic attributes of the user, asset data, transaction data, activity data, and the like. Basic attributes of a user include, but are not limited to, age, gender, and the like. The asset data comprises data such as account balance and the like, the transaction data comprises data such as consumption amount in a period of time (such as in 7 days), and the active data comprises data such as the login times of an app of a terminal (such as a mobile phone) in a period of time (such as in one month).

According to whether the value of the feature is related to a text or a numerical value, the first feature can be divided into a text-type feature (the value of the feature is related to the text) and a numerical-type feature (the value of the feature is related to the numerical value); wherein the textual features may include gender, academic calendar, occupation, etc., and the numerical features may include age, asset data, transactional data, activity data, etc.

The user history data comprises a first characteristic of the user, a value of the first characteristic and a label of whether the user violates. A user usually has a plurality of different first features, and has a plurality of values of the first features, the label of whether the user violates includes a user who violates (i.e., a bad user) and a user who does not violate (i.e., a good user), the label of whether the user violates is also called a dependent variable, the dependent variable may have a value of

0or

1, 0 indicates that the user is a good user who does not violate, and 1 indicates a bad user who violates. The plurality of user history data forms a user history data set.

Step S102: and carrying out characteristic weight WOE precoding processing on each first characteristic according to the value of different users of each first characteristic and the label of whether each user violates.

The characteristic weight woe (weight of evidence), also called variable weight, is a form of encoding the original independent variable, and is a measure for converting the value of the variable into the magnitude of default rate. For example, for the same variable such as age, if the WOE value for age group [27,30] is 0.3 and the WOE value for age group [31,35] is 0.1, it means that the default probability for users in the age group [27,30] is higher than that for users in the age group [31,35 ].

If each value of a variable is coded, the calculated amount is very large, the constructed model is unstable, and the abnormal value of the variable easily interferes with the model, so that before WOE coding is performed on the variable, the variable needs to be subjected to binning processing (also called discretization and grouping) firstly in order to simplify calculation, improve the stability of the model and reduce the disturbance of the abnormal value. WOE is a measure of the magnitude of the default rate by which the value of a variable is converted, and thus the magnitude of the WOE reflects the magnitude of the default rate.

For the WOE transformation of each feature, the boxes need to be firstly separated according to the value of the feature, and then the WOE transformation is carried out on each box. The main method for binning in the industry at present is to adopt chi-square binning. In order to ensure the rationality of binning, i.e. the difference in WOE value from bin to bin is as large as possible, and to ensure that the monotonicity of the WOE is met after a WOE transformation. Because if the difference is not large enough, the separation is not necessary, and the monotonicity of the WOE is ensured, so that the method can better accord with the business logic.

Because the number of variables is too large, and the selection mode of single numerical variable binning points is too many, the calculation amount is very large for a computer, and the human eyes can roughly select proper binning points by judging the WOE trend change to ensure that the WOE is as monotonous as possible and the WOE value difference between bins is as large as possible. Therefore, in order to ensure that the binning meets the business needs and business logic, modelers in the prior art tend to manually analyze the binning of the variables one by one.

In this embodiment, the WOE pre-coding processing does not require consideration of the monotonicity meeting the WOE, the binning is performed by initial and rough division according to a preset mode, and the monotonicity meeting the WOE after binning is not required, so the WOE pre-coding processing can be automatically realized by a computer without manual intervention, and rapid automatic processing can be realized.

For a certain variable, after binning, a plurality of bins are obtained, and for the ith bin, the WOE calculation formula is as follows:

p_yithe ratio of bad users in the box to the total bad users (in the risk control model, bad users correspond to default users, which can be represented by 0 as good users without default (also referred to as unresponsive samples), and 1 represents a default bad user (also referred to as a response sample), that is, the ratio of the label with the value of 1 in the box to all the labels with the value of 1 is included);

p_niis the proportion of good users in the box to the total good users (i.e. the proportion of the tags with the value of 0 in the box to all the tags with the value of 0);

y_iis the number of bad users in this bin;

n_iis the number of good users in this bin;

y_Tis the number of all bad users in the whole users;

n_Tis the number of all good users in the total users;

from the above formula, WOE can be understood as the difference between the bad user distribution in each bin relative to the good user distribution, or as the difference between the bad-to-good ratio in each bin relative to the overall bad-to-good ratio. The larger the WOE, the greater the two differences above. Good users and bad users in the original data are mixed together and cannot be distinguished; by the box separation operation, the bad users of the good users can be separated as much as possible, and the WOE is used for measuring the separation degree of the bad users of the good users after the box separation.

The box separation method comprises an unsupervised box separation and a supervised box separation, wherein the card side box separation in the supervised box separation is more in use. In practical applications, an appropriate binning method may be selected, which is not limited herein.

Step S103: and screening second characteristics from the first characteristics based on a plurality of first boxes obtained after the WOE pre-coding processing is carried out on each first characteristic, wherein the number of the second characteristics is smaller than that of the first characteristics.

When a wind control model is constructed by using a two-classification model, independent variables are often required to be screened, for example: there are 200 candidate independent variables, and usually, 200 variables are not directly put into the model for fitting training, but some methods are used to select some of the 200 independent variables and put into the model to form a list of in-model variables.

When the variable is screened, the embodiment of the application may adopt various methods that can be used to screen a plurality of first bins and a plurality of WOE values of the plurality of first bins obtained after performing the WOE pre-coding processing based on each first feature, and screen the second feature (i.e., the screened feature) from the first features (the features that are not screened). The process of screening the in-mold variables is a relatively complex process and needs to consider many factors, such as: predictive power of variables, correlation between variables, simplicity of variables (easy to generate and use), robustness of variables (not easily bypassed), interpretability of variables in business (comprehension of objections), and the like. The most important and direct measure of this is the predictive power of the variables. Methods of screening variables that may be employed include, but are not limited to: variables can be screened by adopting a variable clustering method to eliminate the collinearity among the variables; the IV value may be used for preliminary screening, deletion of IV values below an IV threshold variable, and the like.

Step S104: and determining the box dividing mode that the WOE value of each second feature meets monotonicity based on all possible box dividing modes of the plurality of first boxes of each second feature, and further obtaining a plurality of second boxes of each second feature.

In this embodiment, after the first features are filtered, the multiple first bins of the selected second features are adjusted, so that the adjusted WOE values of the sub-bins satisfy monotonicity. The WOE value is monotonous, and can be monotonically increased or monotonically decreased. If the WOE is monotonically increased, the WOE value of the following box is always greater than the WOE value of the preceding box in chronological order; if the WOE monotonically decreases, the WOE value of the following bin is always less than the WOE value of the preceding bin in chronological order. All possible binning modes after readjusting the plurality of first bins (i.e., the bins before adjustment) of the second feature can enable the computer to find out the binning mode in which the WOE value meets the monotonicity by using the monotonicity law, thereby obtaining a plurality of second bins (i.e., the bins after adjustment) of each second feature. Of course, if the WOE values of the first bins of the second feature have satisfied monotonicity, no adjustment is necessary, i.e., the second bins of the second feature. By the mode, the computer can automatically and quickly divide the second characteristics with small quantity according to the characteristics of monotonicity of the WOE value, so that the WOE value after division meets the monotonicity, manual division is replaced, and a large amount of time is saved; the wind control model established by the plurality of second boxes based on each second characteristic and meeting the WOE value monotonicity can achieve the same precision effect as that of manual modeling, and can realize business interpretability.

Step S105: and establishing the wind control model by utilizing a classification model based on a plurality of second boxes of each second characteristic.

The binary models include, but are not limited to: logistic regression models, decision tree models, and the like. In one embodiment, the classification model includes a logistic regression model that is used more in practical applications. Logistic Regression (Logistic Regression) is a machine learning method for solving the problem of two-classification (0or 1) to estimate the probability of something, and in this embodiment, Logistic Regression model is used to estimate the probability of user default.

The method comprises the steps of obtaining a user history data set, wherein the user history data set comprises user history data, and the user history data comprises values corresponding to a plurality of different first characteristics of a user and a label indicating whether the user violates or not; according to the value of different users of each first characteristic and the label of whether each user violates, carrying out characteristic weight WOE pre-coding treatment on each first characteristic; screening second characteristics from the first characteristics based on a plurality of first boxes obtained after WOE pre-coding processing is carried out on each first characteristic, wherein the number of the second characteristics is smaller than that of the first characteristics; determining a box dividing mode that the WOE value of each second feature meets monotonicity based on all possible box dividing modes of a plurality of first boxes of each second feature, and further obtaining a plurality of second boxes of each second feature; and establishing the wind control model by utilizing a classification model based on a plurality of second boxes of each second characteristic. Because only WOE pre-coding processing is carried out during WOE conversion, the monotonicity of WOE is not specially considered, automatic processing can be realized through a computer, first characteristics are screened after the WOE pre-coding processing, a small number of second characteristics are selected from the first characteristics, then the monotonicity of WOE is considered for the small number of second characteristics, all possible box dividing modes are traversed, the characteristics of the monotonicity of WOE values (the WOE values are monotonically increased or monotonically decreased) are combined, and the box dividing mode that the WOE values of the second characteristics meet the monotonicity is determined, so that the computer can automatically and quickly divide the small number of second characteristics according to the characteristics of the monotonicity of the WOE values, the divided WOE values meet the monotonicity, manual box dividing is replaced, and a large amount of time is saved; the wind control model established by the plurality of second boxes based on each second characteristic and meeting the WOE value monotonicity can achieve the same precision effect as that of manual modeling.

In an embodiment, in step S102, performing the characteristic weight WOE precoding processing on each first characteristic according to the value of the different user of each first characteristic and the label of whether each user violates, may include: substep S1021 and substep S1022, as shown in fig. 2.

Substep S1021: and performing pre-binning processing on each first characteristic based on the value of different users of each first characteristic and the label of whether each user violates, so as to obtain a plurality of first bins of each first characteristic.

Substep S1022: and performing WOE conversion processing on the plurality of first boxes of each first characteristic respectively to obtain a plurality of WOE values of each first characteristic.

In substep S1021, performing pre-binning processing on each first feature based on the value of the different user of each first feature and the label of whether each user violates to obtain a plurality of first bins of each first feature, which may further include two cases:

one is that the first feature comprises a text-type feature:

a1: and if the first characteristics comprise text type characteristics, taking the value of each text type characteristic as a box, and determining the WOE value of each box.

A2: all bins are sorted by the size of the WOE value. The sequence may be from large to small, or from small to large.

A3: and carrying out chi-square combination processing on all the sorted boxes to obtain a plurality of first boxes of each text type characteristic.

In this embodiment, since the number of text-type features is small, the value of each text-type feature is taken as one box, all the boxes are sorted according to the size of the WOE value, and then chi-square combination is performed, so that the text-type features can pass through step S102 and satisfy the monotonicity of the WOE value.

Another is that the first characteristic includes a numerical characteristic:

b1: and if the first characteristics comprise numerical characteristics, sorting according to the value of the numerical characteristics.

B2: and dividing the value of the numerical characteristic into a plurality of boxes in an equal frequency mode according to the sorted sequence. In this embodiment, the value of the numerical characteristic is divided into N bins in an equal frequency manner according to the sorted order, where N is a large number, generally N >1000, and the WOE value of each bin is calculated.

B3: and carrying out chi-square combination processing on all the sorted boxes to obtain a plurality of first boxes of each numerical type characteristic.

Binning is the discretization of continuous variables (numerical features) and the merging of multi-state discrete variables (text features) into fewer states. The basic idea of the box separation is as follows: for accurate discretization, the relative class frequencies should be completely consistent within one interval; thus, two adjacent intervals can be merged if they have very similar class distributions; otherwise, they should be kept separate. Whereas low chi-squared values indicate that they have similar class distributions.

The card and square combining step comprises the following steps:

calculating the chi-square values of all adjacent boxes (if the steps are divided into 1000 boxes, 999 chi-square values are required to be calculated), and combining the two boxes with the minimum chi-square values until the chi-square values of all adjacent boxes are all larger than a certain preset value p₀(e.g. p)₀＝0.1)

The chi-square value calculation formula of the adjacent boxes is as follows:

wherein A is₁₁The number of the default bad users (namely the label Y takes the value of 1) in the box with the number of 1 is represented;

A₁₂the number of good users (i.e. the label Y takes a value of 0) with no default in the box with the number 1 is represented;

A₂₁the number of the default bad users in the box with the number of 2 is represented;

A₂₂representing the number of good users without default in the bin numbered 2;

E₁₁＝C₁*R₁/N，E₁₂＝C₁*R₂/N,E₂₁＝C₂*R₁/N,E₂₂＝C₂*R₂/N

wherein C is₁＝A₁₁+A₂₁，C₂＝A₁₂+A₂₂，R₁＝A₁₁+A₁₂，R₂＝A₂₁+A₂₂，N＝C₁+C₂

In an embodiment, in step S103, the step of screening out the second features from the first features based on a plurality of first bins obtained after performing the WOE pre-coding processing on each first feature may include: substep S103A1 and substep S103A2, as shown in FIG. 3.

Sub-step S103a 1: the plurality of first features are divided into a plurality of classes by a factorization method.

Sub-step S103a 2: and screening a first feature with the highest information value IV and a first feature with the highest goodness of fit R2 value for the first time in each class, wherein the first feature screened for the first time is the second feature.

The principle of factor analysis is introduced as follows:

assume that there are N candidate variables X₁,X₂,...,X_NA factor analysis is required. Factor analysis method assumes the presence of k common factors F₁,F₂,...,F_kSo that each original variable can be written as a linear sum of the k common factors and a special factor epsilon, i.e. for any variable X_iX may be_iWriting into:

X_i＝a_i1F₁+a_i2F₂+...+a_ikF_k+ε_i

wherein the coefficient a_i1,a_i2,...,a_ikCalled the load factor, then belongs to [1, N ] for all i]A matrix a of size N x k, called the load matrix, is formed.

The estimation method of the load matrix may adopt a principal component method, a principal factor method or a maximum likelihood estimation method, which is not discussed in detail herein.

For the estimation method of the load matrix, the embodiment adopts a principal component method for estimation, and specifically includes:

in the following representation, Σ represents the sum,

indicates that the sum is from 1 to N according to the value of i.

In the factor analysis model construction, the estimation of the number k of common factors and the estimation of a load matrix are involved. The present embodiment below estimates the above parameters using a principal component method.

Assuming the original eigenvectors of the N candidate variables, the covariance matrix is calculated, which is a matrix M of N x N, where M_ijIs X in the ith row of the matrix M_iWith X in column j_jThe covariance of (a).

N characteristic roots and characteristic vectors of the covariance matrix M are calculated. The N characteristic roots are respectively marked as lambda according to the sequence from big to small₁,λ₂,...,λ_nThe N normalized eigenvectors corresponding to the feature roots sorted in the above manner are sequentially denoted as v₁,v₂,...,v_N；

Wherein the number of common factors

That is, the present embodiment selects such a minimum k that the sum of the first k largest eigenvalues is greater than 0.75.

The load matrix estimated using the principal component method is as follows:

in another embodiment, the step S103 of filtering out the second feature from the first features based on a plurality of first bins obtained after the WOE precoding processing is performed on each first feature may include: substep S103B1, substep S103B2, substep S103B3, and substep S103B4 are shown in FIG. 4.

Sub-step S103B 1: an IV value for each first feature is determined based on the plurality of WOE values for each first feature.

Sub-step S103B 2: the first feature having an IV value greater than or equal to the IV threshold is screened for the first time.

Sub-step S103B 3: and classifying the first characteristics screened for the first time into a plurality of classes by a factor analysis method.

Sub-step S103B 4: and screening the first characteristic with the highest IV value and the first characteristic with the highest R2 value for the second time in each class, wherein the first characteristic screened for the second time is the second characteristic.

On the basis, in step S104, before determining the binning mode in which the WOE value of each second feature satisfies monotonicity based on all possible binning modes of the plurality of first bins of each second feature and further obtaining the plurality of second bins of each second feature, the method may further include: and thirdly screening through a logistic regression model of a backward elimination method based on each first feature screened secondarily, wherein the first feature screened thirdly is the second feature.

In another embodiment, in step S103, the step of screening out the second features from the first features based on a plurality of first bins obtained after performing the WOE pre-coding processing on each first feature may include the following steps:

the first step is as follows: the plurality of first features are divided into a plurality of classes by a factorization method.

The second step is that: the first feature with the highest information value IV and the first feature with the highest goodness of fit R2 value are screened out for the first time in each class.

The third step: and based on each first feature screened for the first time, screening for the second time through a logistic regression model of a backward elimination method, wherein the first feature screened for the second time is the second feature.

Assuming there are 10 candidate variables, the above steps can be decomposed into:

(1) after the WOE transformation is performed, the embodiment now has 10 candidate variables, factor analysis is performed on all the candidate variables to obtain 3 common variables, and then the size of the load matrix is 10 × 3, and the obtained load matrix is solved.

(2) The class to which the variable belongs is determined by which common variable the value of the load matrix coefficient is largest.

(3) In the k classes (3 classes in this embodiment), two variables are selected for each class, one of which is the highest value of the variable IV and the other of which is the highest value of the variable R2. A high IV value means that the variable contributes more to the model result, a high R2 value means that the variable is most representative in the cluster, and further, a high contribution means that the variable has a large influence on the probability value of the model output, for simplicity, the influence means that the variable has the largest correlation with the output probability value, and the representative means that the pearson correlation coefficient with the principal component in the cluster is the largest, wherein the IV value is expressed as follows:

IV：∑_x((#{Y＝1,X＝x}/#{Y＝1})-(#{Y＝0,X＝x}/#{Y＝0}))*WOE(X＝x)

where # { A } denotes the count, i.e., the number of samples satisfying condition A, # { A, B } denotes the number of samples satisfying both A and B conditions.

R2 represents a representative metric within a cluster that can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs.

The two variables selected for each class may be the same variable, thus leaving a maximum of 2k variables, and the remaining variables are eliminated.

(4) And (3) iterating the remaining maximum 2k variables by adopting a backward elimination method to carry out variable screening, specifically, carrying out logistic regression modeling on all candidate variables entering the process, observing variance expansion coefficient (VIF) values of all the variables, and if the VIF values of the variables are greater than 4, indicating that the collinearity exists, and then eliminating the variable with the highest p value.

The coefficient of Variance expansion (VIF) is a measure of the severity of complex (multiple) collinearity in a multivariate linear regression model and represents the ratio of the Variance of the regression coefficient estimates compared to the Variance assuming a non-linear correlation between the independent variables.

(5) Rejecting variables with p values greater than a specified value;

(6) the above steps are repeated until the p-values of all variables are less than a specified value (e.g., 0.05) and the VIFs of all variables are less than 4, i.e., the collinearity of the model is completely eliminated.

In an embodiment, in step S104, the determining, based on all possible binning manners of the plurality of first bins of each second feature, a binning manner in which the WOE value of each second feature satisfies monotonicity to obtain a plurality of second bins of each second feature may include: and determining that the WOE value of each second feature meets monotonicity and is the binning mode with the maximum binning number based on all possible binning modes of the plurality of first bins of each second feature, and further obtaining a plurality of second bins of each second feature.

The WOE is a corresponding relation reflecting the value probability of the independent variable and the dependent variable, so that the monotonicity of the WOE is very important for a good model. For example, when considering the correlation between the historical overdue number of the user and the default probability of the user, the default probability of the user is expected to increase with the increase of the historical overdue number of the user, and is not expected to increase first and then decrease or decrease first and then increase. Whether the WOE is monotonic, directly determines whether the logic of the model can be interpreted, determines whether the business party has sufficient confidence to use the model, and is also an important way to reduce model overfitting.

In this embodiment, the maximum number of bins is required when the WOE value satisfies the monotonicity, which can make the interpretability of the second feature more detailed when the WOE value of the second feature satisfies the monotonicity, and can also make the computer reduce the amount of calculation, thereby further improving the modeling speed.

Referring to fig. 5-7, the abscissa represents the number of the boxes, the left ordinate represents the number of samples, the number of samples per box (# obs) is represented by the height of the histogram in dark gray, the right ordinate represents the WOE value, and the WOE value per box is represented by the numerical value of the broken line in light gray. As shown in fig. 5, the WOE value does not satisfy monotonicity, and in this step, the 2 nd bin and the 3 rd bin are combined (as shown in fig. 6) to obtain a WOE map with monotonicity (as shown in fig. 7).

The above process may be experienced manually, and it was found that the WOE curve can be made monotonic by combining the 2 nd and 3 rd bins. But the present embodiment utilizes the monotonicity law to perform automatic binning adjustment by a computer.

In step S104, the determining, based on all possible binning modes of the plurality of first bins of each second feature, that the WOE value of each second feature satisfies monotonicity and is the binning mode with the largest binning number, so as to obtain a plurality of second bins of each second feature may further include: sub-step S1041, sub-step S1042, sub-step S1043, sub-step S1044, and sub-step S1045, as shown in fig. 8.

Substep S1041: and comparing the WOE values of all the two adjacent boxes based on a plurality of first boxes of each second characteristic, if the WOE value of the rear box is larger than that of the front box, determining that the monotonic index is 1, otherwise, determining that the monotonic index is-1 and the current number of the first boxes is N.

When the number of bins is N, the number of monotonic indices is N-1. If the WOE of N bins satisfies monotonicity, N-1 or N-1 results. If the WOE of the N bins does not satisfy monotonicity, the number of 1 s or-1 s is smaller than N-1 s.

Substep S1042: and judging whether the monotone index cumulative absolute value r of the plurality of first boxes of each second feature is equal to N-1, wherein the monotone index cumulative absolute value r is equal to the absolute value of the sum of all monotone indexes of the plurality of first boxes of each second feature.

In the present embodiment, if the WOE of N bins satisfies monotonicity, N-1 or N-1 is obtained, and the monotonous exponential cumulative absolute value r is equal to the absolute value of (N-1) or the absolute value of- (N-1), i.e., the monotonous exponential cumulative absolute value r is equal to N-1. When WOE meets monotonicity, the monotonous index cumulative absolute value r is equal to the number of the current boxes minus 1; when WOE does not satisfy monotonicity, the monotonicity index cumulative absolute value r is smaller than the number of the current bins minus 1.

Substep S1043: and if r of the first boxes of the second feature is equal to N-1, determining that the WOE values of the first boxes of the second feature satisfy monotonicity and are in a box dividing mode with the maximum box dividing number, wherein the first boxes of the second feature are the second boxes of the second feature.

In the step, whether the WOE values of the current N first boxes meet monotonicity is judged. If the WOE values of the current N first bins satisfy monotonicity, the latter steps are ended, otherwise, the latter steps are continued.

Substep S1044: and if r of the plurality of first bins of the second feature is not equal to N-1, determining whether a bin splitting mode with r being N-2 exists in all bin splitting modes with the number of N-1 bins of the second feature, and if the bin splitting mode with r being N-2 exists, determining that the WOE value of the bin splitting mode with r being N-2 satisfies monotonicity and the number of bins is the maximum, wherein the plurality of bins corresponding to the bin splitting mode with r being N-2 are the plurality of second bins of the second feature.

In the step, when the WOE values of the current N first boxes do not meet the monotonicity, all the box dividing modes with the box dividing number of N-1 are continuously judged. If the binning mode with the monotonic exponential cumulative absolute value r equal to N-2 can be found in all the binning modes with the binning number of N-1, the WOE value of the binning mode meets monotonicity and the binning number is the maximum, the subsequent steps can be finished, and otherwise, the subsequent steps are continued.

Substep S1045: and if the binning mode with r being N-2 does not exist, continuously determining whether the binning mode with r being N-3 exists in all the binning modes with the second feature and the number of bins being N-2 until the binning mode with the WOE value of the second feature meeting monotonicity and the number of bins being the largest is determined, and further obtaining a plurality of second bins of each second feature.

In the step, when the WOE values of the current N-1 boxes do not meet the monotonicity, all the box dividing modes with the box dividing number of N-2 are continuously judged until the box dividing mode with the WOE value of the second characteristic meeting the monotonicity and the largest box dividing number is determined.

In this embodiment, all possible binning modes are traversed, and a so-called binning mode is a dividing mode in which N previously divided first bins are arbitrarily combined without changing the sequence, that is, any two or more consecutive first bins can be combined together. If we have N first bins to be partitioned, then the partition mode is 2^n-1And (4) seed preparation.

In this embodiment, the WOE values of two adjacent bins are compared from the N first bins divided before, and if the WOE value of the latter bin is greater than that of the former bin, the monotonic index is determined to be 1, otherwise the monotonic index is determined to be-1. If m is the number of the current bin, after summing all monotonic indexes, taking the absolute value to obtain the monotonic index cumulative absolute value r, if and only if WOE monotonically increases or monotonically decreases (i.e., satisfies the monotonic), r is equal to m-1. In all the optimal box separation modes, the box separation mode with the largest box separation number is selected as the WOE final box separation mode, and by the mode, the interpretability can be refined, and the modeling speed can be increased.

After the wind control model is established, a score can be established: and receiving application data of the new users in real time, wherein the application data comprises the characteristics of all users needing to input a wind control model and calculate scores, then applying the established logistic regression model to output default probability of each new user, sequencing the users according to the probability values, and determining whether to pay for the new application users by combining with the credit granting targets of the users.

The method of the embodiment of the application is described in detail by taking a specific modeling flow of the wind control application scoring card of a certain stock system bank as an example.

The user history data set comprises 55596 pieces of labeled data in total and 109 characteristic variables in total, wherein 5 textual variables comprise academic calendars, industries, sexes, marital conditions and regions; and 104 numerical variables including credit balance, credit investigation times, income level and the like.

(1) Through an automated procedure, WOE pre-coding is implemented, these variables being converted after this step into corresponding WOE values. The WOE value of this step may not exhibit monotonicity, e.g., the inRangeDate variable in this step does not satisfy the requirement of WOE value monotonicity.

(2) All variables with unsatisfactory IV values are eliminated by preliminary screening of the data, and 69 variables are reserved.

(3) Through variable clustering, the variables are divided into 18 classes through an automatic factor analysis algorithm, and a variable with the highest IV value and a variable with the highest R2 value are reserved in each class, so that 36 variables are reserved.

(4) And (4) performing logistic regression modeling and screening on variables by using a backward elimination method, and finally keeping the number of the variables entering the model to be 19.

(5) The 19 variables are boxed to fine tune, including inRangeDate variable of the above example, whose refined WOE satisfies the monotonicity requirement.

Wherein, the automatic box-separating fine-tuning process comprises:

(a) and (3) calculating a monotonous exponential cumulative absolute value r of each variable X, judging whether r is equal to N-1(N is the number of current boxes), and if so, indicating that the current variable boxes are monotonous and not needing to be continuously fine-tuned.

(b) And if the r is not true, continuing to judge all the bins with the bin number of N-1, judging whether r is N-2 for a certain bin, if so, stopping, and otherwise, continuing the next step.

(c) And judging whether all the bins with the bin number of N-2 exist in a certain bin, if so, stopping, and if not, continuing the next step.

(d) This is repeated until a bin is found that has monotonicity, and fine tuning of the variable ceases.

(6) After the box separation fine tuning is finished, performing logistic regression modeling by a backward elimination method again, and finally keeping the number of the variables entering the model to be 17, wherein each index of the model variable meets the requirement, and the variables meet the monotonicity, so that the modeling is finished.

(7) And carrying out scoring card conversion according to the finally established logistic regression model, and formulating a credit granting strategy.

If the wind control model is based on manual modeling, all 109 variables are binned and subjected to WOE conversion, and two modeling engineers spend three working days. When the automatic modeling process of the embodiment of the application is adopted, the program runs the full process of modeling in only two hours, and almost no manual workload is needed.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of a computer device 100 according to the present application, including: a memory 1 and a processor 2; the memory 1 is used for storing a computer program; the processor 2 is configured to execute the computer program and, when executing the computer program, implement the method for establishing a wind control model according to any one of the above. For a detailed description of the related contents, please refer to the related contents of the above method for establishing the wind control model, which will not be described in detail herein.

Wherein the memory 1 and the processor 2 are connected by a bus.

The processor 2 may be a micro-control unit, a central processing unit, a digital signal processor, or the like.

The memory 1 may be a Flash chip, a read-only memory, a magnetic disk, an optical disk, a usb disk, or a removable hard disk.

The present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to implement the method of creating a wind control model as defined in any one of the above. For a detailed description of the related contents, please refer to the related contents of the above method for establishing the wind control model, which will not be described in detail herein.

The computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory. The computer readable storage medium may also be an external storage device such as a hard drive equipped with a plug-in, smart memory card, secure digital card, flash memory card, or the like.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The above description is only for the specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of creating a wind control model, the method comprising:

2. The method of claim 1, wherein performing the feature weight WOE precoding on each first feature according to the value of the different user of each first feature and the label of whether each user violates the rule comprises:

performing pre-binning processing on each first characteristic based on the value of different users of each first characteristic and a label of whether each user violates to obtain a plurality of first bins of each first characteristic;

and performing WOE conversion processing on the plurality of first boxes of each first characteristic respectively to obtain a plurality of WOE values of each first characteristic.

3. The method of claim 2, wherein the pre-binning each first feature based on the values of the different users of each first feature and the label of whether each user violates a contract to obtain a plurality of first bins for each first feature comprises:

if the first characteristics comprise text type characteristics, taking the value of each text type characteristic as a box, and determining the WOE value of each box;

sorting all the boxes according to the WOE value;

card square combination processing is carried out on all the boxes after sorting to obtain a plurality of first boxes of each text type characteristic;

and/or the presence of a gas in the gas,

if the first characteristics comprise numerical characteristics, sorting according to the value of the numerical characteristics;

dividing the value of the numerical characteristic into a plurality of boxes in an equal frequency mode according to the sorted sequence;

and carrying out chi-square combination processing on all the sorted boxes to obtain a plurality of first boxes of each numerical type characteristic.

4. The method of claim 1, wherein determining the binning mode in which the WOE value of each second feature satisfies monotonicity based on all possible binning modes of the plurality of first bins of each second feature to obtain the plurality of second bins of each second feature comprises:

and determining that the WOE value of each second feature meets monotonicity and is the binning mode with the maximum binning number based on all possible binning modes of the plurality of first bins of each second feature, and further obtaining a plurality of second bins of each second feature.

5. The method of claim 4, wherein the determining, based on all possible binning modes of the plurality of first bins for each second feature, the binning mode having the highest WOE value and the highest binning number for each second feature to obtain the plurality of second bins for each second feature comprises:

based on a plurality of first boxes of each second characteristic, comparing the WOE values of all the boxes adjacent to each other, if the WOE value of the rear box is larger than that of the front box, determining that the monotonic index is 1, otherwise, determining that the monotonic index is-1, and the current number of the first boxes is N;

judging whether the monotone index cumulative absolute value r of the plurality of first boxes of each second feature is equal to N-1, wherein the monotone index cumulative absolute value r is equal to the absolute value of the sum of all monotone indexes of the plurality of first boxes of each second feature;

if r of the first boxes of the second feature is equal to N-1, determining that WOE values of the first boxes of the second feature satisfy monotonicity and are in a box dividing mode with the maximum box dividing number, wherein the first boxes of the second feature are the second boxes of the second feature;

if r of the plurality of first bins of the second feature is not equal to N-1, determining whether a bin splitting mode with r being N-2 exists in bin splitting modes with all bin splitting numbers being N-1 of the second feature, and if the bin splitting mode with r being N-2 exists, determining that the WOE value of the bin splitting mode with r being N-2 satisfies monotonicity and the number of bins is the maximum, wherein the plurality of bins corresponding to the bin splitting mode with r being N-2 are the plurality of second bins of the second feature;

and if the binning mode with r being N-2 does not exist, continuously determining whether the binning mode with r being N-3 exists in all the binning modes with the second feature and the number of bins being N-2 until the binning mode with the WOE value of the second feature meeting monotonicity and the number of bins being the largest is determined, and further obtaining a plurality of second bins of each second feature.

6. The method of claim 1, wherein the step of filtering out second features from the first features based on a plurality of first bins obtained after the WOE precoding processing is performed on each first feature comprises:

dividing the plurality of first features into a plurality of classes by a factor analysis method;

and screening a first feature with the highest information value IV and a first feature with the highest goodness of fit R2 value for the first time in each class, wherein the first feature screened for the first time is the second feature.

7. The method of claim 1, wherein the step of filtering out second features from the first features based on a plurality of first bins obtained after the WOE precoding processing is performed on each first feature comprises:

determining an IV value for each first feature based on the plurality of WOE values for each first feature;

screening out a first characteristic with the IV value being greater than or equal to the IV threshold value for the first time;

dividing the first feature screened for the first time into a plurality of classes by a factor analysis method;

and screening the first characteristic with the highest IV value and the first characteristic with the highest R2 value for the second time in each class, wherein the first characteristic screened for the second time is the second characteristic.

8. The method of claim 7, wherein before determining the binning mode in which the WOE value of each second feature satisfies monotonicity based on all possible binning modes of the plurality of first bins of each second feature and obtaining the plurality of second bins of each second feature, further comprising:

and thirdly screening through a logistic regression model of a backward elimination method based on each first feature screened secondarily, wherein the first feature screened thirdly is the second feature.

9. The method of claim 1, wherein the two classification models comprise logistic regression models; the first features include textual features including gender, academic calendar, occupation, and numerical features including age, asset data, transactional data, activity data.

10. The method of claim 1, wherein the obtaining a user history data set comprises: acquiring a user historical data set for training the wind control model;

the method further comprises the following steps: acquiring a user historical data set for testing the wind control model; and testing the wind control model by using the user historical data set for testing the wind control model.

11. A computer device, the computer device comprising: a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and, when executing the computer program, to implement the method of establishing a wind control model according to any of claims 1-10.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of establishing a wind control model according to any one of claims 1-10.