CN114170000A

CN114170000A - Credit card user risk category identification method, device, computer equipment and medium

Info

Publication number: CN114170000A
Application number: CN202111485316.0A
Authority: CN
Inventors: 汪志艺; 王伟权; 杨俊勉; 吴佳文
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-11

Abstract

The application relates to a credit card user risk category identification method, a credit card user risk category identification device, a computer device, a storage medium and a computer program product, which are applied to the field of big data. The method and the device can improve the identification accuracy of the risk categories of the credit card users. The method comprises the following steps: acquiring user information of a credit card user; screening a plurality of risk influence factors from the user information by using a stepwise regression model and a multivariate logistic regression model; and inputting the attribute values corresponding to the risk influence factors into a pre-constructed decision tree model for traversal, and acquiring the risk category to which the credit card user belongs, which is output by the decision tree model.

Description

Credit card user risk category identification method, device, computer equipment and medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for identifying a risk category of a credit card user.

Background

With the popularity of credit cards, credit card business plays an increasingly important role in the banking regime. Whether the credit card user is a risk user or not is predicted, so that the risk user and the high-quality user can be identified as early as possible, the bank can timely adopt a mode of reducing the credit card amount of the risk user and moderately increasing the credit card amount of the high-quality user, and the loss caused by the risk user is reduced so as to better provide service for the high-quality user.

At present, all user information related to a user is generally adopted for modeling in the identification technology of the credit card user risk categories, but the identification technology does not distinguish and screen all user information, so that the user information which has a more significant influence on the credit card risk is difficult to effectively extract, and the identification accuracy of the method on the credit card user risk categories is not high enough.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium and a computer program product for identifying a risk category of a credit card user.

In a first aspect, the present application provides a method for identifying risk categories of credit card users. The method comprises the following steps:

acquiring user information of a credit card user;

screening the user information through a stepwise regression model and a multiple logistic regression model to obtain a plurality of risk influence factors;

and inputting the attribute values corresponding to the risk influencing factors into a pre-constructed decision tree model for traversal, and acquiring the risk category to which the credit card user belongs, which is output by the decision tree model.

In one embodiment, before the attribute values corresponding to the risk influencing factors are input into a pre-constructed decision tree model for traversal to obtain the credit risk value of the credit card user, the method further includes:

acquiring training sample data corresponding to the plurality of influence factors;

and training a decision tree model taking C4.5 as a basic structure by using the training sample data to obtain the pre-constructed decision tree model.

In one embodiment, the training of the C4.5-based decision tree model using the training sample data to obtain the pre-constructed decision tree model includes:

putting a training sample data set on a father node;

for each father node, traversing the segmentation mode corresponding to the training sample data set on the father node to find out the optimal segmentation mode;

and dividing the training sample data set into a plurality of sub data sets by taking the threshold value in the optimal division mode as a division point, respectively distributing the plurality of sub data sets to the child nodes corresponding to the parent node, and adjusting model parameters in the decision tree model taking C4.5 as the basic structure until the sub data sets on the child nodes can not be subdivided, so as to obtain the pre-constructed decision tree model.

In one embodiment, the stepwise regression model is a back-off stepwise regression model; the screening of the user information through the stepwise regression model and the multiple logistic regression model to obtain a plurality of influence factors comprises the following steps:

screening the user information through the backward stepwise regression model to obtain a first variable set;

and screening the plurality of influence factors from the first variable set through a multiple logistic regression model.

In one embodiment, the filtering, by the fallback type stepwise regression model, a first set of variables from the user information includes:

constructing an interpretation variable set based on all the user information;

constructing a linear regression model by taking each interpretation variable in the interpretation variable set as an independent variable and taking the risk user category as a dependent variable;

performing multiple rounds of significance verification on the linear regression model; in each round of significance verification, calculating the contribution value of each independent variable to the dependent variable, eliminating the dependent variable with the minimum contribution value to obtain the next linear regression model until no independent variable is eliminated, and obtaining the independent variable which is not eliminated as the first variable set.

In one embodiment, the multiple logistic regression model is constructed based on Sigmoid function.

In a second aspect, the application also provides a credit card user risk category identification device. The device comprises:

the user information acquisition module is used for acquiring the user information of the credit card user;

the risk influence factor screening module is used for screening a plurality of risk influence factors from the user information through a stepwise regression model and a multiple logistic regression model;

and the risk category output module is used for inputting the attribute values corresponding to the risk influence factors into a pre-constructed decision tree model for traversing to obtain the risk categories which the credit card users belong to and are output by most of the decision tree models.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the credit card user risk category identification method embodiment when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in the above-described credit card user risk category identification method embodiments.

In a fifth aspect, the present application further provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above-described credit card user risk category identification method embodiments.

According to the method, the device, the computer equipment, the storage medium and the computer program product for identifying the risk category of the credit card user, the user information of the credit card user is acquired; screening a plurality of risk influence factors from the user information by using a stepwise regression model and a multivariate logistic regression model; and inputting the attribute values corresponding to the risk influence factors into a pre-constructed decision tree model for traversal, and acquiring the risk category to which the credit card user belongs, which is output by the decision tree model. According to the method and the device, the plurality of risk influence factors are effectively screened out from the user information, the risk categories to which the credit card users belong are identified by means of the decision tree model, the identification accuracy of the risk categories of the credit card users is improved, whether the users have the credit default risks or not can be identified specifically aiming at the existing user data, the reduction of the credit card default rate is facilitated, and the users without the credit default risks can further adopt a credit card business related adjustment mode to optimize the overall user service.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a method for identifying risk categories of credit card users;

FIG. 2 is a flow diagram illustrating a method for identifying risk categories for credit card users in one embodiment;

FIG. 3 is a verification result of significance verification using a multiple logistic regression model in one embodiment;

FIG. 4 is a graph illustrating the results of screening variables using a stepwise regression model according to one embodiment;

FIG. 5 is a graph showing the results of one embodiment of variable screening using stepwise regression models after gender culling;

FIG. 6 is a diagram of a decision tree model in one embodiment;

FIG. 7 is a block diagram showing the structure of a risk category identification means for credit card users in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Fig. 9 is an internal structural view of a computer device in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. In addition, the data and information to which the present disclosure relates may be data and information that is authorized by a user or sufficiently authorized by various parties.

The credit card user risk category identification method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The data storage system may store data that the server 102 needs to process. The data storage system may be integrated on the server 102, or may be located on the cloud or other network server. The terminal 101 may be but not limited to various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 102 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a method for identifying risk categories of credit card users is provided, which is illustrated by applying the method to the server 102 in fig. 1, and includes the following steps:

step S201, obtaining user information of a credit card user;

herein, a credit card user refers to a consumer who holds a credit card. User information refers to data related to credit card users, including gender, age, marital status, education level, type of residence, category of occupation, age of employment, personal income, insurance payment, vehicle condition, etc.

It should be emphasized that the user information is information that the user voluntarily discloses to the society or is authenticated by a legal institution and discloses to the society for a legal or legitimate reason; on the other hand, the user information used in this patent refers to business data that is related to the enterprise and can be used in a legal scope, which is acquired by the user of this patent in his own business scope or program permission scope, or data related to the target enterprise, which is authorized by the user of this patent and is obtained based on legal use, such as user information only used for internal analysis, or analysis business only used for precise marketing or improving service properties in the business permission scope.

Specifically, the server 102 obtains the credit card user information of each consumer terminal through a data collection tool, for example, obtains the user-related information through the information registered on the credit application installed by the user, and may also be the data set of the credit card user input by the bank staff through the terminal device within the scope permitted by the law and responsibility.

Step S202, a plurality of risk influence factors are screened from the user information through a stepwise regression model and a multiple logistic regression model;

among them, the stepwise regression model is one of statistical methods for understanding the relationship between independent variables and dependent variables. Stepwise regression is also a process of screening variables, and a stepwise regression model can be used to construct a regression model from a set of candidate variables, allowing the regression model to automatically identify the influencing variables. The basic idea of stepwise regression is to introduce variables into the model one by one, perform an F-test after each introduction of an explanatory variable, perform a t-test on the already selected explanatory variables one by one, and delete the originally introduced explanatory variables when they become no longer significant due to the introduction of the later explanatory variables. To ensure that only significant variables are contained in the regression equation before each new variable is introduced. This is an iterative process until neither significant explanatory variables are selected into the regression equation, nor insignificant explanatory variables are removed from the regression equation. To ensure that the final set of interpretation variables is optimal.

Multiple logistic regression models, also known as multiple classification logistic regression, are statistical models that are essentially a function of multiple logistic regression models to describe each factor in comparison to a reference classification.

Specifically, since the user information obtained in step S201 is complicated, factors having a large correlation or influence on the classification result need to be selected from the user information. The gender X1, the age X2, the marital state X3, the education degree X4, the living type X5, the occupation type X6, the working age X7, the personal income X8, the insurance payment X9 and the vehicle condition X10 are used as independent variables, the risk type (namely whether the user is a credit risk user) Y is used as a dependent variable, a multiple logistic regression model is established, and according to the multiple logistic regression result (shown in figure 3), the two variables of the marital state and the occupation type are not obvious. In fig. 3, Pr (> z) is a significance interpretation of the results, and if less than 0.05, it indicates that the variable is significant, if greater than 0.05, it indicates that the variable is not significant, and the larger the value of Pr (> z), the less significant it represents. Screening the remaining variables by adopting a stepwise regression method to screen the variables, wherein the remaining variables are sex X1, age X2, education degree X4, residence type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10, and finally, a plurality of risk influencing factors which have more obvious influence on the risk categories are respectively: age X2, education level X4, residence type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10, and 7 risk influence factors.

Step S203, inputting the attribute values corresponding to the risk influencing factors into a pre-constructed decision tree model for traversal, and acquiring the risk category to which the credit card user belongs, which is output by the decision tree model.

The decision tree model is a method for approximating a discrete function value. It is a typical classification method that first processes the data, generates readable rules and decision trees using a generalisation algorithm, and then uses the decisions to analyze the new data. In essence, a decision tree is a process of classifying data through a series of rules. Common decision tree algorithms are the ID3 algorithm, the C4.5 algorithm, the C5.0 algorithm, and the CART algorithm. Since the variables used in this application relate to discrete (e.g., educational) and continuous (e.g., personal income), the C4.5 algorithm is used herein to construct the decision tree model.

Specifically, the attribute values of the 7 risk influencing factors of the user are input into a pre-constructed decision tree model, so that the decision tree model performs traversal classification on the attribute values of the risk influencing factors, and finally the risk category to which the user belongs is obtained, for example, the user is a credit risk user or the user is a credit risk-free user.

In the embodiment, the user information of the credit card user is acquired; screening a plurality of risk influence factors from the user information by using a stepwise regression model and a multivariate logistic regression model; and inputting the attribute values corresponding to the risk influence factors into a pre-constructed decision tree model for traversal, and acquiring the risk category to which the credit card user belongs, which is output by the decision tree model. According to the method and the device, the plurality of risk influence factors are effectively screened out from the user information, the risk categories to which the credit card users belong are identified by means of the decision tree model, the identification accuracy of the risk categories of the credit card users is improved, whether the users have the credit default risks or not can be identified specifically aiming at the existing user data, the reduction of the credit card default rate is facilitated, and the users without the credit default risks can be further subjected to overall user service optimization by adopting a credit card business related adjustment mode. In an embodiment, before the step S203, the method further includes: acquiring training sample data corresponding to a plurality of influence factors; training the decision tree model taking C4.5 as a basic structure by using training sample data to obtain a pre-constructed decision tree model.

Specifically, the training process specifically includes:

(1) acquiring training sample data;

the training data of this embodiment is derived from a classical credit card dataset (a dataset downloaded from an open source platform whose data has been authorized by the user) including gender, age, marital status, education level, residence type, occupation category, working age, personal income, insurance payment, vehicle condition, and whether the user is at risk. Basically covers all aspects, and the accuracy of the user for predicting the credit card risk by using the data set is higher.

(2) Data pre-processing

Since the data in the classic credit card data set is typically already cleaned (i.e., default to no outliers or default values), only the variables need to be digitized. Specifically, the gender column of the data set has two cases, one is male and the other is female. For the gender of the data set, female was converted to 0 and male was converted to 1. There are four cases of the marital status class of a data set, including not married, dissimilarity, and funeral. For the marriage status column of the data set, ungainment is converted to 0, married to 1, outlier to 2 and funeral to 3. There are five cases of education in the data set, including junior middle school and below, senior middle school, major, the subject and the master and above. For the education level column of the data set, junior high school and below are converted into 0, senior high school is converted into 1, major specialty is converted into 2, subject family is converted into 3, and master and above are converted into 4. For the professional category column of the dataset, convert individual household to 0, national enterprise to 1, private enterprise to 2, foreign enterprise to 3, and other enterprises to 4. For the dwell type column of the data set, the house-buy is converted to 0, the house-rent is converted to 1, and the others are converted to 2. For the vehicle case column of the data set, none is converted to 0 and some is converted to 1. For insurance payment of the data set, there will be no conversion to 0 and 1. Regarding the personal income column of the data set, the personal annual income is less than or equal to 10 ten thousand, regarded as low income, converted into 0, the personal annual income is more than 10 ten thousand and less than 50 ten thousand, regarded as medium income, converted into 1, and the personal annual income is more than or equal to 50 ten thousand, regarded as high income, converted into 2.

(3) Screening variables

The data set is divided into a training set and a testing set, wherein the training set accounts for 80% of the data set, and the testing set accounts for 20% of the data set. Establishing a multiple logistic regression model by using sex X1, age X2, marital state X3, education degree X4, living type X5, occupation type X6, working age X7, personal income X8, insurance payment X9, vehicle condition X10 as an independent variable and whether the risk user Y is a dependent variable (see figure 3). As can be appreciated from fig. 3, the two variables of marital status and job category are not significant. The variables (see fig. 4) are screened by a stepwise regression method, and the remaining variables obtained by screening are sex X1, age X2, education degree X4, residence type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10.

And then, a multivariate logistic regression model is established by screening the remaining variables (see fig. 5), the gender X1 is found to be not obvious enough, the multivariate logistic regression model is established after the gender X1 is removed, and all the variables are found to be obvious after the gender X1 is removed, so that the decision tree model is established by adopting the age X2, the education degree X4, the residence type X5, the working life X7, the personal income X8, the insurance payment X9 and the vehicle condition X10 as independent variables and judging whether the risk user Y is a dependent variable.

(4) Training decision trees

The construction process of the decision tree comprises the following steps:

4.1, putting all training data sets on the root node.

4.2, traversing each division mode of each attribute, and finding the best division point

4.3, dividing the root node into a plurality of child nodes (more than or equal to 2) according to the best division point in 4.2

4.4, repeating the steps 4.2 and 4.3 for the rest of samples and attributes until the data in each child node belongs to the same class.

In the embodiment, the pre-constructed decision tree model is obtained by obtaining the training data set and training the C4.5 model by using the training data set, so as to make a data basis for subsequently classifying the user.

In an embodiment, the training of the decision tree model with C4.5 as a basic structure using the training sample data to obtain the pre-constructed decision tree model includes:

putting a training sample data set on a father node; traversing the segmentation mode corresponding to the training sample data set on the father node aiming at each father node to find out the optimal segmentation mode; and taking a threshold value in the optimal segmentation mode as a segmentation point, segmenting the training sample data set into a plurality of sub data sets, respectively distributing the plurality of sub data sets to the child nodes corresponding to the parent node, and adjusting model parameters in the decision tree model taking C4.5 as a basic structure until the sub data sets on the child nodes can not be segmented, so as to obtain a pre-constructed decision tree model.

The decision tree is essentially a node tree, and is composed of nodes and branch branches, in the node tree, if an attribute has a previous level, the previous level is called as a parent node of the attribute, if the attribute does not have the previous level, the attribute has no parent node, and the next level of the attribute is called as a child node of the attribute. In a node tree, the top nodes are called root nodes (roots), each node having a parent node, except for the root node (which has no parent).

Specifically, the C4.5 algorithm is used as a basic model to obtain a trained decision tree model, and the C4.5 algorithm is a classical algorithm for generating a decision tree and is an extension and optimization of the ID3 algorithm. The C4.5 algorithm uses the information gain ratio for node splitting.

The construction process of the decision tree comprises the following steps:

4.1, putting all training data sets on the root node.

In this embodiment, using age X2, education level X4, residence type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10, modeling is performed using C4.5 algorithm, and the decision tree model is shown in fig. 6, where the first layer used by the decision tree is personal income. And if the annual income of the individual is less than 50 ten thousand, the first category is taken as the first layer of the decision tree. And if the annual income is more than or equal to 50 ten thousand, the second type is used as the first layer of the decision tree. For personal annual income less than 50 million, the vehicle condition is taken as the second level of the decision tree. The vehicle condition is none, as a first class at the second level of the decision tree, and the vehicle condition is present, as a second class at the second level of the decision tree. As can be seen from the figure, 100% is among the at risk users for annual incomes less than 50 ten thousand and vehicle conditions are none. And regarding the personal income less than 50 ten thousand and the vehicle condition is certain, regarding the personal annual income as the third layer of the decision tree, wherein the annual income less than or equal to 10 ten thousand yuan is the first type of the third layer of the decision tree, and the annual income more than 10 ten thousand yuan is the second type of the third layer of the decision tree. When the annual income is less than or equal to 10 ten thousand yuan and the vehicle condition is in some cases, the vehicle condition has a probability of 94.29% belonging to the risk user, so the user in the condition is judged as the risk user by the application. And in the case that the annual income is between 10 ten thousand and 50 ten thousand and the vehicle condition is certain, the age is taken as the fourth layer of the decision tree, the age of less than or equal to 28 years is taken as the first class of the fourth layer of the decision tree, and the age of more than 28 years is taken as the second class of the fourth layer of the decision tree. When the annual income is between 10 and 50 ten thousand, the vehicle condition is present and the age is equal to or less than 28 years old, there is a probability of 92.47% belonging to the risky user, and therefore the patent judges the user in such a condition as the risky user. When the annual income is between 10 and 50 ten thousand, the vehicle condition is present and the age is more than 28 years old, there is a probability of 9.54% belonging to the risky users, so the patent judges the users in such a condition as users without risk. For the condition that the annual income of the individual is more than or equal to 50 ten thousand, insurance payment is of a first type which is not used as the second layer of the decision tree algorithm, and if the insurance payment is available, the probability of 100 percent is the user without risk. And if the annual income of the individual is more than or equal to 50 ten thousand and the insurance payment is zero, taking the age as the third layer of the decision tree. And taking the age less than or equal to 35 as the first class of the third layer of the decision tree, and taking the age more than 35 as the second class of the third layer of the decision tree. And regarding the condition that the annual income of the individual is more than or equal to 50 ten thousand, the insurance contribution is absent and the age is more than 35, the education degree is taken as the fourth layer of the decision tree. The first category of the fourth layer of the decision tree is the first category of the fourth layer of the decision tree with the academic history at high school and above, and the second category of the fourth layer of the decision tree is the second category of the fourth layer of the decision tree with the academic history at junior school and below. When the annual income of an individual is more than or equal to 50 ten thousand, insurance payment is zero, the age is more than 35, and the school calendar is high and above, and the probability of 6.78 percent is a risk user, so the user in the condition is judged to be a non-risk user by the application. When the annual income of an individual is more than or equal to 50 ten thousand, insurance payment is zero, the annual income is more than 35, and the academic history is in the junior middle school and below, and 80% of probability is risk users. And regarding the condition that the annual income of the individual is more than or equal to 50 ten thousand, the insurance contribution is absent and the age is less than or equal to 35, and regarding the education degree as the fourth layer of the decision tree. The academic records of high school, middle school, lower school, and subject are taken as the first class of the fourth layer of the decision tree, the academic records are taken as the second class of the fourth layer of the decision tree in the profession, and the academic records are taken as the third class of the fourth layer of the decision tree in the greater or higher school. When the personal income is more than or equal to 50 ten thousand, the insurance payment is zero, the age is less than or equal to 35, the academic history is high, middle and lower, and the probability of 97.22% of the users in the family is the risk users, so the users in the situation are judged to be the risk users by the application. When the personal income is more than or equal to 50 ten thousand, the insurance payment is zero, the age is less than or equal to 35, and the user is learned to be a master and above, and the probability of 100 percent is the user without risk, so the patent judges the user in the condition as the user without risk. And regarding the situations that the personal income is more than or equal to 50 ten thousand, the insurance payment is zero, the age is less than or equal to 35 and the academic history is a major, taking the age as the fifth layer of the decision tree. And taking the age less than or equal to 32 as the first class of the fifth layer of the decision tree, and taking the age more than 32 as the second class of the fifth layer of the decision tree. When the personal income is more than or equal to 50 ten thousand, the insurance payment is none, the age is less than or equal to 32, and the study is a case of a large specialty, 100% of cases are risk users. When the personal income is more than or equal to 50 ten thousand, insurance payment is none, the age is more than 32, and the study is a case of a large specialty, 100% of cases are users without risks.

Optionally, the method calculates the model precision of the decision tree model through a confusion matrix or an AUC value, and is used for verifying the accuracy of the model prediction effect. A confusion matrix (also called an error matrix or probability table) is a specific matrix used to show the visualization effect of model performance, and is commonly used in supervised learning models. In the confusion matrix, each column represents a predicted value and each row represents an actual category. In this implementation, in order to detect the effect of the model, the data of the test set is substituted into the model of the training set for prediction, and compared with the situation that whether the actual bank user is a risk user, a confusion matrix is established, and the result of the confusion matrix is shown in table 1:

TABLE 1 confusion matrix

From the results of the confusion matrix, it can be seen that the first column represents the actual category, that is, in this embodiment, the total number of users that the bank user is actually in the risk-free category (the first column in the above table is represented by 0) is 275+ 11-286 users, where 275 users are predicted by the model as "risk-free users" and 11 users are predicted incorrectly as "risk users", and it can be calculated that: the probability of the predicted customer not being an at risk customer but the predicted error is 96.15%. The number of users who are actually in the risky category (the first column in the above table is denoted by 1) of the bank users is 12+892 to 904 users, wherein 892 users are accurately predicted as "risky users" and 12 users are erroneously determined as "non-risky users", so that the accuracy of correctly predicting whether a customer is a risky customer reaches 98.67%, and in sum, the accuracy of the model is 98.07%, and the overall accuracy is very high.

Optionally, in this embodiment, in order to verify the authenticity of the detection method, the embodiment may further determine the quality of the model by using an AUC (area Under current) value, where the AUC value is defined as an area enclosed by an roc (receiver Operating characterizing) curve and a coordinate axis, and the AUC value is a standard for measuring the quality of the classification model, and obviously, the larger the AUC value, the better the classifier classifies. When AUC is 1, it is a perfect classifier, meaning that when this prediction model is used, a perfect prediction can be obtained regardless of what threshold is set. In most prediction scenarios, no perfect classifier exists. If the AUC value is more than 0.9, the model has good effect. The results of the application are shown in fig. 3-3, and the AUC 0.9730, which is greater than 0.9, can be seen from the figure, which shows that the model has good effect.

In the embodiment, the C4.5 model is trained through the training sample data set to obtain the pre-constructed decision tree model, so that a mathematical basis is provided for subsequently identifying and predicting the risk category to which the user belongs. And the C4.5 model is used for training, so that the accuracy of identifying the risk category by the model is further improved.

In one embodiment, the stepwise regression model is a backward stepwise regression model; the step S202 includes: screening user information through a backward stepwise regression model to obtain a first variable set; and screening a plurality of influence factors from the first variable set through a multiple logistic regression model.

Among them, the regression model of the back-off type (also referred to as back-off method) is a method commonly used in the regression model, and means that one variable is removed from the model at a time.

Specifically, a regression equation including all variables is fitted first at the beginning, and hypothesis testing criteria of independent variables left in the regression equation without being eliminated are specified in advance. And then, according to the contribution size of the independent variable corresponding to the variable Y, carrying out detection from small to large, and sequentially removing the independent variables without statistical significance. Every time an argument is removed, recalculation is performed, the contribution of the argument which is not removed to Y is checked, and whether the argument which has the smallest contribution to the model is removed or not is determined. The above process is repeated until all of the arguments in the regression equation meet the given criteria left in the equation, and no arguments can be culled. In the process, only the independent variable is considered to be eliminated, and once the independent variable is eliminated, the regression equation is not considered to be introduced. The first variable set obtained by screening the user information through the backward stepwise regression model comprises: gender X1, age X2, education X4, occupancy type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10. And establishing a multiple logistic regression model for the first variable set, wherein the application field of the multiple logistic regression model is a two-classification field, and the solving step is three steps. The first step is to determine the regression function (usually with Sigmoid function); the second step is to determine a cost function (containing parameters); and solving the model parameters, wherein the gradient descent algorithm or the maximum likelihood estimation algorithm is adopted to solve the model parameters, and the gradient descent algorithm is an optimization algorithm for solving the model parameters. In this embodiment, 7 risk influence factors, namely age X2, education degree X4, residence type X5, working life X7, personal income X8, insurance payment X9 and vehicle condition X10, are finally screened from the above influence factors.

In the embodiment, the first variable set is obtained by screening from the user information through the backward stepwise regression model, which is beneficial to obtaining the interpretation variables more relevant to the user risk category.

In an embodiment, the filtering the user information by the regression model to obtain the first set of variables includes:

constructing an interpretation variable set based on all user information; constructing a linear regression model by taking each interpretation variable in the interpretation variable set as an independent variable and taking the risk user category as a dependent variable; performing multiple rounds of significance verification on the linear regression model; the significance check is one of statistical hypothesis tests, and the significance is a method for detecting whether the difference between an experimental group and a control group in a scientific experiment and whether the difference is a wiring harness. In each round of significance verification, the contribution value of each independent variable to the dependent variable is calculated, the dependent variable with the minimum contribution value is eliminated, the next linear regression model is obtained until no independent variable is eliminated, and the independent variable which is not eliminated is obtained and serves as a first variable set.

Specifically, in the process of screening variables, the present embodiment adopts a back-off method, that is, one variable is sequentially removed from the linear regression model, and the significance level of each remaining variable on the dependent variable is examined. Specifically, constructing an interpretation variable set based on all the user information; constructing a linear regression model by taking each interpretation variable in the interpretation variable set as an independent variable and taking the risk user category as a dependent variable; performing multiple rounds of significance verification on the linear regression model; in each round of significance verification, calculating the contribution value of each independent variable to the dependent variable, wherein the contribution value is a t-test value, eliminating the dependent variable with the minimum contribution value to obtain the next linear regression model until no independent variable is eliminated, and obtaining the independent variable which is not eliminated as the first variable set. The first set of variables includes: gender X1, age X2, education X4, occupancy type X5, working age X7, personal income X8, insurance payment X9 and vehicle condition X10.

According to the embodiment, the first variable set is obtained through the backward screening, so that the independent variables obviously related to the risk categories of the users can be extracted, and favorable conditions are provided for constructing the decision tree model.

Specifically, the application field of the multiple logistic regression model is a two-classification field, and the solving step is three steps. The method comprises the steps of determining a regression function (usually a Sigmoid function), determining a cost function (containing parameters), and solving model parameters.

In the embodiment, the Sigmoid function is used as the regression function, so that the model accuracy is further improved.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a credit card user risk category identification device for realizing the credit card user risk category identification method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in the embodiment of one or more credit card user risk category identification devices provided below can be referred to the limitations on the credit card user risk category identification method in the above, and details are not repeated herein.

In one embodiment, as shown in fig. 7, there is provided a credit card user risk category identification device 700, including: a user information obtaining module 701, a risk influence factor screening module 702, and a risk category output module 703, wherein:

a user information obtaining module 701, configured to obtain user information of a credit card user;

a risk factor screening module 702, configured to screen multiple risk factors from the user information through a stepwise regression model and a multiple logistic regression model;

a risk category output module 703, configured to input the attribute values corresponding to the multiple risk influencing factors into a pre-constructed decision tree model for traversal, so as to obtain a risk category to which the credit card user belongs, and the risk category is output by a majority of decision tree models.

In one embodiment, the apparatus further comprises a model training unit configured to:

In one embodiment, the model training unit is further configured to:

putting a training sample data set on a father node;

In one embodiment, the stepwise regression model is a back-off stepwise regression model; the risk influencing factor screening module is further used for:

In one embodiment, the risk influencing factor screening module 702 is further configured to:

constructing an interpretation variable set based on all the user information;

In one embodiment, the apparatus 700 further includes a model accuracy detection unit for calculating the model accuracy of the decision tree model by using a confusion matrix or AUC values.

The various modules in the credit card user risk category identification device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store user information, as well as risk category data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a credit card user risk category identification method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a credit card user risk category identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configurations illustrated in fig. 8-9 are merely block diagrams of portions of configurations related to aspects of the present application, and do not constitute limitations on the computing devices to which aspects of the present application may be applied, as particular computing devices may include more or less components than those illustrated, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the above mentioned credit card user risk category identification method embodiment when executing the computer program.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the above-described credit card user risk category identification method embodiments.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of the above-described credit card user risk category identification method embodiments.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A credit card user risk category identification method, the method comprising:

acquiring user information of a credit card user;

2. The method of claim 1, wherein before inputting the attribute values corresponding to the plurality of risk influencing factors into a pre-constructed decision tree model for traversal, and obtaining the credit risk value of the credit card user, the method further comprises:

3. The method according to claim 2, wherein said training a C4.5-based decision tree model using said training sample data to obtain said pre-constructed decision tree model comprises:

putting a training sample data set on a father node;

4. The method of claim 1, wherein the stepwise regression model is a back-off stepwise regression model; the screening of the user information through the stepwise regression model and the multiple logistic regression model to obtain a plurality of influence factors comprises the following steps:

5. The method of claim 4, wherein the filtering the first set of variables from the user information through the fallback type stepwise regression model comprises:

constructing an interpretation variable set based on all the user information;

6. The method according to any one of claims 1 to 5, further comprising: calculating the model precision of the decision tree model through a confusion matrix or AUC value.

7. A credit card user risk category identification device, the device comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.