CN112613920A

CN112613920A - Loss probability prediction method and device

Info

Publication number: CN112613920A
Application number: CN202011640346.XA
Authority: CN
Inventors: 朱红伟; 吴正良
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-06

Abstract

The application discloses a loss probability prediction method and a loss probability prediction device, wherein the method comprises the following steps: acquiring characteristic data of a target client in a target observation time interval; predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval; the XGboost model is trained according to training samples; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of a reduced level in the set of customers. Therefore, according to the method provided by the application, the loss probability of the client in the presentation time can be efficiently obtained through the characteristic data of the client in the target observation time according to the pre-trained XGboost model.

Description

Loss probability prediction method and device

Technical Field

The application relates to the field of computers, in particular to a loss probability prediction method and device.

Background

The customer prediction attrition probability has become the data information needed in many industries, and at present, the customer prediction attrition probability is mainly manually predicted according to the customer information. However, in some scenarios, the number of customers is huge, and if the information of the customers is collated manually and the probability of customer loss is calculated, a large amount of manpower is consumed, and the efficiency is low. Therefore, how to provide a method for predicting the loss probability with high efficiency becomes a technical problem which needs to be solved urgently in the field.

Disclosure of Invention

In order to solve the technical problem, the application provides a loss probability prediction method and a loss probability prediction device, which are used for efficiently predicting the loss probability of a customer.

In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

the embodiment of the application provides a loss probability prediction method, which comprises the following steps:

acquiring characteristic data of a target client in a target observation time interval;

predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval;

the XGboost model is trained according to training samples; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of reduced rank in the customer set; the start time of the presentation time interval is after the end time of the observation time interval.

Optionally, the method further comprises:

determining the group type of each customer in the customer set; each group type has a corresponding level division mode;

and determining whether the level of each client is degraded or not according to the level division mode corresponding to the group type to which each client belongs.

Optionally, when the feature data comprises a plurality of feature data types; the method further comprises the following steps:

and when the loss probability of the target client in the target expression time interval exceeds a preset threshold value, obtaining one characteristic data with the maximum weight in the multiple characteristic data types according to the characteristic data of the target client in the target observation time interval and a pre-trained XGboost model.

Optionally, the method further comprises:

obtaining an analysis report of a target customer set; the analysis report comprises a loss customer list and loss reasons, wherein the loss probability in the target customer set exceeds the preset threshold; the loss reason is obtained by analyzing the characteristic data with the maximum weight in the multiple characteristic data types.

Optionally, the feature data comprises:

at least one of customer basic information, customer financial asset information, customer bank card information, customer transaction information, customer insurance information, and bank agent customer payroll information.

The embodiment of the present application further provides a loss probability prediction device, the device includes:

the obtaining module is used for obtaining the characteristic data of the target client in the target observation time interval;

the prediction module is used for predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval;

Optionally, the apparatus further comprises:

the determining module is used for determining the group type of each customer in the customer set; each group type has a corresponding level division mode;

and the dividing module is used for determining whether the level of each client is degraded or not according to the level dividing mode corresponding to the group type to which each client belongs.

Optionally, when the feature data comprises a plurality of feature data types; the device further comprises:

and the reason obtaining module is used for obtaining one characteristic data with the largest weight in the multiple characteristic data types according to the characteristic data of the target client in the target observation time interval and a pre-trained XGboost model when the loss probability of the target client in the target expression time interval exceeds a preset threshold value.

Optionally, the apparatus further comprises:

the report acquisition module is used for acquiring an analysis report of the target customer set; the analysis report comprises a loss customer list and loss reasons, wherein the loss probability in the target customer set exceeds the preset threshold; the loss reason is obtained by analyzing the characteristic data with the maximum weight in the multiple characteristic data types.

Optionally, the feature data comprises:

According to the technical scheme, the method has the following beneficial effects:

the embodiment of the application provides a loss probability prediction method and a loss probability prediction device, wherein the method comprises the following steps: acquiring characteristic data of a target client in a target observation time interval; predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval; the XGboost model is trained according to training samples; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of reduced rank in the customer set; the start time of the presentation time interval is after the end time of the observation time interval. Therefore, according to the method provided by the application, the loss probability of the client in the presentation time can be efficiently obtained through the characteristic data of the client in the target observation time according to the pre-trained XGboost model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for predicting attrition probability provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a presentation time interval and an observation time interval provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a method for training an XGBoost model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a loss probability prediction apparatus according to an embodiment of the present application.

Detailed Description

In order to help better understand the scheme provided by the embodiment of the present application, before describing the method provided by the embodiment of the present application, a scenario of an application of the scheme of the embodiment of the present application is described.

In order to solve the above technical problem, an embodiment of the present application provides a method and an apparatus for predicting a loss probability, where the method includes: acquiring characteristic data of a target client in a target observation time interval; predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval; the XGboost model is trained according to training samples; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of reduced rank in the customer set; the start time of the presentation time interval is after the end time of the observation time interval. Therefore, according to the method provided by the application, the loss probability of the client in the expression time can be efficiently obtained through the characteristic data of the client in the observation time according to the pre-trained XGboost model.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

Referring to fig. 1, a schematic flow chart of a method for predicting attrition probability provided in the embodiment of the present application is shown. As shown in fig. 1, the method for predicting the attrition probability provided in the embodiment of the present application includes the following steps:

s101: and obtaining the characteristic data of the target client in the target observation time interval.

S101: predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval.

It should be noted that, in the embodiment of the present application, the XGBoost model is trained according to a training sample; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of reduced rank in the customer set; the start time of the presentation time interval is after the end time of the observation time interval.

In this embodiment, as a possible implementation manner, the feature data includes: at least one of customer basic information, customer financial asset information, customer bank card information, customer transaction information, customer insurance information, and bank agent customer payroll information.

In this embodiment of the present application, in order to make the prediction accuracy of the method provided in this embodiment of the present application higher, as a possible implementation manner, the method provided in this embodiment of the present application further includes: determining the group type of each customer in the customer set; each group type has a corresponding level division mode; and determining whether the level of each client is degraded or not according to the level division mode corresponding to the group type to which each client belongs. It can be understood that, in the embodiment of the present application, the level division modes corresponding to each group of the client are different, and thus, the method provided in the embodiment of the present application may formulate a suitable division rule according to the characteristics of each group, so that the prediction accuracy of the method provided in the embodiment of the present application is higher.

A specific application method of the type division manner of the client group provided in the embodiment of the present application will be described below:

the method comprises the steps of obtaining characteristics of financial asset balance, current deposit balance, periodic deposit balance, loan balance, investment balance and the like of a personal VIP client according to a client grade mark, and dividing the client into 6 groups by a K-means clustering algorithm and combining business experience.

(1) Type division of customer groups:

high-asset-life customers: the monthly-day average financial assets are more than or equal to 100 ten thousand, and the daily average value of the current deposit and the total financial assets is 80 percent.

High-asset periodic type customer: the monthly-day average financial asset is more than or equal to 100 ten thousand, and the monthly-day average value > of the periodic deposit and the total financial asset accounts for 80 percent.

High-asset investment type customers: the monthly-day average financial assets are more than or equal to 100 ten thousand, and the monthly-day average value > of the investment deposit and the total financial assets accounts for 80 percent.

Low-asset-life customers: the monthly-day average financial assets are less than 100 ten thousand, and the daily average value of the current deposit and the total financial assets is 80 percent.

Low-asset periodic type customer: the monthly-day average financial assets are less than 100 ten thousand, and the monthly-day average value > of the periodic deposits and the total financial assets accounts for 80 percent.

Low-asset investment type clients: the monthly-day average financial assets is less than 100 ten thousand, and the monthly-day average value > of the investment deposit and the total financial assets accounts for 80 percent.

(2) Loss marking

And performing customer downshift marking according to the customer grouping category, dividing each type of customer into N levels according to the customer financial asset scale, and regarding the customer loss as each level is reduced.

High-asset-life customers: the reduction of the size of the customer's life assets by 20% is a level, defined as loss;

high-asset periodic type customer: the regular asset scale of the client is reduced by 20 percent to one level, and is defined as loss;

high-asset investment type customers: the reduction of the investment asset scale of the client by 20 percent is a level, and is defined as loss;

low-asset-life customers: the reduction of the size of the customer's life assets by 40% is a level, defined as loss;

low-asset periodic type customer: the regular asset scale of the client is reduced by 40 percent to one level, and is defined as loss;

low-asset investment type clients: the reduction in the size of the investment assets of the client by 40% is a level of reduction, defined as churn.

The presentation time interval and the observation time interval in the embodiment of the present application are described below:

referring to fig. 2, a schematic diagram of a presentation time interval and an observation time interval provided in an embodiment of the present application is shown. In order to improve the prediction accuracy of the method of the present application, as shown in fig. 2, the start time of the observation period (observation time interval) and the end time of the presentation period (presentation time interval) in the embodiment of the present application may be the same time. As an example, the observation time interval in the embodiment of the present application may be 6 months, and the performance time interval may be 3 months.

After the client is classified, the method provided by the embodiment of the application can also be used for preprocessing the characteristic data of the client. The method for generating feature data of the method provided by the present application will be described in a specific embodiment as follows:

(1) and (5) feature construction.

The customer characteristics generate M characteristic items of the personal VIP customer from dimensions such as type, amount, quantity, time and the like from bank owned data such as customer basic information, customer grade, administrative relationship, financial assets, bank card information, debit card transaction, credit card transaction, channel transaction, transfer, held product, deposit, loan, financing, national debt, fund, precious metal, insurance, third party deposit management, generation wage and the like.

(2) And (4) characteristic derivation.

The method mainly extracts features which have larger influence on results from data according to expert experience and a violent method, and mainly comprises four types of summary features, combination features, statistical methods and user-defined methods.

For the characteristics, summarizing and deriving the characteristics of the amount and the quantity for 1, 2, 3, 6, 9 and 12 months according to the time dimension, improving the significance of the characteristics and better conforming to the business logic; the basic information of the client and the mark characteristics are combined to obtain the combined characteristics, such as the age, whether a credit card combination is held, the gender, whether a financial product combination is held, and the like; the characteristics of the amount and quantity can be further derived to obtain the characteristics of an average value, a maximum value, a minimum value, a standard deviation and the like by a statistical method; meanwhile, the original characteristics of the client can be derived from the service perspective according to the experience of a client manager.

(3) And (5) preprocessing the characteristics.

And aiming at the generated features, preprocessing the features by using a popular method.

First, missing value padding: feature variables with missing values above a certain percentage (e.g., 90%) are deleted; filling the characteristic variables with the missing values lower than 90%, wherein the filling rule is as follows: numerical variables: filling the mean value or filling 0; enumerated variables: the filling mode.

Secondly, capping: and calculating two end points of the money amount class characteristics, wherein the point value smaller than the lower end point is set as the lower end point value, and the point value higher than the upper end point is set as the upper end point value. (e.g., 5% for the lower end and 95% for the upper end).

Thirdly, continuous type characteristic binning: and carrying out block mapping on the continuous features according to intervals to realize discretization of the continuous features. The continuous features are converted into discrete features by calculating a binning threshold based on the maximum entropy of the decision tree.

(4) And (4) selecting characteristics.

And selecting the final application characteristics of the model in two steps for the characteristics which are subjected to the pretreatment.

The method comprises the following steps: and (3) performing model training on the samples of each client group by adopting a combination of a logistic regression algorithm and a random forest learning algorithm, ranking each model training result according to the feature importance, and selecting the optimal first a features as the features to be modeled for each client feature.

Step two: and removing the same type of variables. And (3) clustering the variables to be modeled into b classes by using a K-means clustering algorithm, and selecting the most relevant characteristic variables in each class for reservation.

C characteristics of the final mold are formed through the two steps.

After the feature data of the client is formed, the XGBoost model is trained in the embodiment of the present application. One specific embodiment of training the XGBoost model is described below:

referring to fig. 3, the figure is a schematic diagram of a method for training an XGBoost model according to an embodiment of the present application. As shown in fig. 3, the method for training the XGBoost model provided in the embodiment of the present application includes: after the clients are clustered (data clustering), the clients agreeing to the client cluster are divided into training sets and test sets, then an XGboost model is used for trial, and the parameters of the XGboost model are adjusted by comparing the accuracy rate and the recall rate (the higher the preset threshold value of the loss probability is, the higher the accuracy rate is and the lower the recall rate is in general), so that the optimal module with both the accuracy rate and the recall rate is obtained. As an example, the ratio of the number of customers in the training set and the test set in the embodiment of the present application may be 7: 3.

in the embodiment of the present application, as a possible implementation manner, when the feature data includes a plurality of feature data types; the method further comprises the following steps: and when the loss probability of the target client in the target expression time interval exceeds a preset threshold value, obtaining one characteristic data with the maximum weight in the multiple characteristic data types according to the characteristic data of the target client in the target observation time interval and a pre-trained XGboost model. It can be understood that the method provided by the embodiment of the present application can obtain one of the feature data types with the largest weight, thereby obtaining the possible cause that the client is lost, and thus providing support for downstream applications.

Further, in the embodiments of the present application, as a possible implementation manner, the method provided by the implementation of the present application further includes: obtaining an analysis report of a target customer set; the analysis report comprises a loss customer list and loss reasons, wherein the loss probability in the target customer set exceeds the preset threshold; the loss reason is obtained by analyzing the characteristic data with the maximum weight in the multiple characteristic data types.

To sum up, the embodiment of the present application provides a method for predicting a loss probability, including: acquiring characteristic data of a target client in a target observation time interval; predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval; the XGboost model is trained according to training samples; the training sample comprises characteristic data of a client set in an observation time interval and the marked client set; the attrition customers are customers of reduced rank in the customer set; the start time of the presentation time interval is after the end time of the observation time interval. Therefore, according to the method provided by the application, the loss probability of the client in the expression time can be efficiently obtained through the characteristic data of the client in the observation time according to the pre-trained XGboost model.

Based on the loss probability prediction method provided by the embodiment, the embodiment of the application also provides a loss probability prediction device.

Referring to fig. 4, the drawing is a schematic structural diagram of a runoff probability predicting device according to an embodiment of the present disclosure. As shown in fig. 4, an attrition probability prediction device provided in the embodiment of the present application includes:

an obtaining module 100, configured to obtain feature data of a target client in a target observation time interval.

The prediction module 200 is used for predicting the loss probability of the target client in the target expression time interval according to the feature data of the target client in the target observation time interval and a pre-trained XGboost model; the start time of the target presentation time interval is after the end time of the target observation time interval.

In the embodiment of the present application, as a possible implementation manner, the apparatus further includes: the determining module is used for determining the group type of each customer in the customer set; each group type has a corresponding level division mode; and the dividing module is used for determining whether the level of each client is degraded or not according to the level dividing mode corresponding to the group type to which each client belongs.

In the embodiment of the present application, as a possible implementation manner, when the feature data includes a plurality of feature data types; the device further comprises: and the reason obtaining module is used for obtaining one characteristic data with the largest weight in the multiple characteristic data types according to the characteristic data of the target client in the target observation time interval and a pre-trained XGboost model when the loss probability of the target client in the target expression time interval exceeds a preset threshold value.

In the embodiment of the present application, as a possible implementation manner, the apparatus further includes: the report acquisition module is used for acquiring an analysis report of the target customer set; the analysis report comprises a loss customer list and loss reasons, wherein the loss probability in the target customer set exceeds the preset threshold; the loss reason is obtained by analyzing the characteristic data with the maximum weight in the multiple characteristic data types.

In summary, the embodiment of the application provides a loss probability prediction device, and according to a pre-trained XGBoost model, the loss probability of a client in the performance time can be efficiently obtained through the characteristic data of the client in the observation time of a target.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The method disclosed by the embodiment corresponds to the system disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the system part for description.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the disclosed embodiments will enable those skilled in the art to make or use the invention in various modifications to these embodiments, which will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A run-off probability prediction method, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein when the feature data comprises a plurality of feature data types; the method further comprises the following steps:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein the characterization data comprises:

6. An attrition probability prediction device, the device comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 6, wherein when the feature data comprises a plurality of feature data types; the device further comprises:

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 6, wherein the characterization data comprises: