CN112765451A

CN112765451A - Client intelligent screening method and system based on ensemble learning algorithm

Info

Publication number: CN112765451A
Application number: CN202011605956.6A
Authority: CN
Inventors: 王玮; 钟严堃; 徐勤燕; 杨阳; 骆天; 顾佳盛
Original assignee: Shanghai Data Center of China Life Insurance Co Ltd
Current assignee: Shanghai Data Center of China Life Insurance Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-05-07

Abstract

The invention relates to a customer intelligent screening method and a system based on an ensemble learning algorithm, wherein the method comprises the following steps: a modeling feature obtaining step: acquiring customer data, extracting customer characteristics from the customer data, and performing characteristic transformation processing; screening the client characteristics after the characteristic transformation processing to obtain modeling characteristics; model training: acquiring a model training data set corresponding to the modeling characteristics, dividing the model training data set into a training set and a test set, and training a pre-established stacking ensemble learning algorithm model; customer score prediction step: obtaining client data corresponding to the modeling characteristics, loading the client data into a trained stacking ensemble learning algorithm model, and obtaining a client score corresponding to the client; customer screening: and performing customer screening according to the customer score. Compared with the prior art, the method has the advantages of comprehensive data extraction, high calculation efficiency, accurate prediction result and the like.

Description

Client intelligent screening method and system based on ensemble learning algorithm

Technical Field

The invention relates to the field of intelligent client screening methods, in particular to an intelligent client screening method and system based on an ensemble learning algorithm.

Background

In the field of accurate marketing of insurance industry, most of the existing customer recommendation solutions are 'customer screening based on sales experience rules' and 'collaborative filtering customer recommendation based on users'. The former can effectively utilize the sales experience of insurance salesmen, but the actual effect is very limited because of lack of data support or incomplete consideration of user behavior characteristics; the latter model is a model commonly used in large-scale user-article interaction modes such as shopping websites and the like, and is recommended by matching with customers with similar purchasing behaviors through the past product purchasing records of the users.

The invention with publication number CN110223102A provides a customer recommendation method, device, electronic device and storage medium, wherein the method comprises the following steps: acquiring negative sample data of each unseamed client in a test time period; preprocessing the negative sample data to obtain standard sample data; establishing a data matrix by combining the generation time of each dimension data; scoring the data matrix by using a scoring model to obtain the possible scores of the ordering of the clients corresponding to the data matrix; and recommending the corresponding customers to the sales personnel in the order of the high to low ordering possibility scores. The client recommendation method provided by the invention can accurately calculate the order forming possibility of each non-order forming client, and further accurately recommend the target client with high order forming possibility to sales personnel, so that the sales success rate is greatly improved.

According to the method, the data matrix of the client is scored through the scoring model, so that the recommended target client is obtained, but the accuracy of the scoring result of the scoring model cannot be guaranteed.

Disclosure of Invention

The invention aims to provide a customer intelligent screening method and system based on an ensemble learning algorithm to overcome the defects of lack of data support or incomplete consideration of user behavior characteristics and low customer screening accuracy in the prior art.

The purpose of the invention can be realized by the following technical scheme:

a customer intelligent screening method based on an ensemble learning algorithm comprises the following steps:

a modeling feature obtaining step: acquiring customer data, extracting customer characteristics from the customer data, and performing characteristic transformation processing; screening the client characteristics after the characteristic transformation processing to obtain modeling characteristics;

model training: acquiring a model training data set corresponding to the modeling characteristics, dividing the model training data set into a training set and a test set, and training a pre-established stacking ensemble learning algorithm model;

customer score prediction step: obtaining client data corresponding to the modeling characteristics, loading the client data into a trained stacking ensemble learning algorithm model, and obtaining a client score corresponding to the client;

customer screening: and performing customer screening according to the customer score.

Further, the stacking ensemble learning algorithm model is a model with two layers superposed, wherein the first layer comprises a plurality of sub models, and the second layer comprises an Xgboost algorithm;

the model training step comprises: based on the training set and the test set, the multiple sub-models are subjected to modeling fitting by using a cross validation mode respectively to obtain a training prediction set, a test prediction set, training labels and test labels of all the sub-models;

and transversely combining the training prediction set and the training labels into a second-layer training set, transversely combining the testing prediction set and the testing labels into a second-layer testing set, inputting the second-layer training set into the Xgboost algorithm for training optimization, and performing model verification by adopting the second-layer testing set.

Further, the plurality of submodels includes a Random Forest submodel, an Adaboost submodel, a Gradient Boosting Tree submodel, a Support Vector Machine submodel, and a Logistic Regression submodel.

Further, the customer score is a purchase intention score of the customer.

Further, the modeling feature obtaining step further includes performing exception data processing on the customer data, where the exception data processing specifically includes: filling missing data, and deleting or correcting error data.

Further, in the modeling feature obtaining step, the feature transformation processing includes dummy variable processing, and the dummy variable processing specifically includes: carrying out numerical coding treatment on the binary variables by adopting 0 and 1; and processing the multi-class dummy variables by adopting one-hot coding.

Further, in the modeling feature obtaining step, the feature transformation processing includes feature discretization processing, and the feature discretization specifically includes: and carrying out discretization treatment on the estimated customer characteristics by taking quantiles as boundaries.

Further, in the modeling feature obtaining step, the feature transformation processing includes numerical value transformation, and the numerical value transformation specifically includes: and processing the long tail distribution data by adopting logarithmic transformation or box-cox transformation.

Further, in the modeling feature obtaining step, the screening of the client features after the feature transformation processing specifically includes the following steps:

performing chi-square inspection on the client characteristics subjected to the characteristic transformation processing to realize primary screening;

and screening the preliminarily screened customer characteristics by adopting a stepwise regression method.

The invention also provides a client intelligent screening system based on the ensemble learning algorithm, which comprises the following components:

the data extraction module is used for acquiring customer data, extracting customer characteristics from the customer data and carrying out characteristic transformation processing;

the characteristic screening module is used for screening the client characteristics subjected to the characteristic transformation processing to obtain modeling characteristics;

the model construction module is used for acquiring a model training data set corresponding to the modeling characteristics, dividing the model training data set into a training set and a test set, and training a pre-established stacking ensemble learning algorithm model;

the client score prediction module is used for acquiring client data corresponding to the modeling characteristics, loading the client data into a trained stacking ensemble learning algorithm model and acquiring a client score corresponding to the client;

and the client screening module is used for screening the clients according to the client scores.

Compared with the prior art, the invention has the following advantages:

the service is oriented to the salesperson, and accurately recommends the high-quality customers to the salesperson instead of recommending the products to the customers. The insurance industry is also a sales-oriented industry at present, customers rarely pay attention to product recommendation spontaneously, and at this time, when the main pushed product in a certain period is determined, the model invention can select the customer group with the most purchasing intention to recommend the product to a corresponding marketer, so that blind visit by the marketer is avoided, the marketing efficiency is greatly improved, and the satisfaction degree of the customers is also improved.

Meanwhile, the invention uses a stacking integration algorithm, the algorithm principle is to combine the prediction results of nearly ten machine learning algorithms such as 'Random Forest', 'Adaboost' and the like, and secondary modeling is carried out on the prediction results to achieve the effects of combining the advantages of various algorithms and weakening the defects of certain algorithms. The algorithm has excellent prediction effect on insurance data, and is stronger than any single basic model.

The method processes the client data, so that the process of obtaining the modeling characteristics is also strict and comprehensively considered, and firstly, filling, deleting and correcting abnormal data are carried out, so that the reliability of basic client data is improved; carrying out numerical coding and one-hot coding on dummy variables in the customer data, such as gender and occupation in the customer data; processing the long tail distribution data by adopting logarithmic transformation or box-cox transformation; if the historical premium of the customer in the customer data, such data is put into the model directly and the use effect is not good, the invention adopts logarithmic transformation to process the data, so that the data is compressed, and the overall distribution is more uniform and accords with the common statistical distribution; the processing measures are all targeted processing on insurance client data, so that the processing measures can be applied to a subsequent stacking ensemble learning algorithm model, comprehensive extraction of the client data is facilitated, and complexity of subsequent data processing is reduced.

The invention also designs the screening of the client characteristics, realizes the hypothesis screening through predictive analysis and the removal of the least remarkable characteristics through chi-square test and random logistic regression characteristic screening, is favorable for reducing the load of data processing in the stacking ensemble learning algorithm model, and improves the calculation efficiency.

Drawings

FIG. 1 is a schematic flow chart of a customer intelligent screening method based on an ensemble learning algorithm according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example 1

As shown in fig. 1, the present embodiment provides a customer intelligent screening method based on an ensemble learning algorithm, including the following steps:

a modeling feature obtaining step including S1: acquiring customer data, extracting customer characteristics from the customer data, and performing characteristic transformation processing;

s2: screening the client characteristics after the characteristic transformation processing to obtain modeling characteristics;

model training step S3: obtaining a model training data set corresponding to modeling characteristics, dividing the model training data set into a training set and a test set, and training a pre-established stacking ensemble learning algorithm model;

customer score prediction step S4: obtaining client data corresponding to modeling characteristics, loading the client data into a trained stacking ensemble learning algorithm model, and obtaining a client score corresponding to the client;

customer filtering step S5: and performing customer screening according to the customer scores.

As a preferred embodiment, the stacking ensemble learning algorithm model is a model with two layers superposed, wherein the first layer comprises a plurality of sub models, and the second layer comprises an Xgboost algorithm;

and transversely combining the training prediction set and the training labels into a second-layer training set, transversely combining the testing prediction set and the testing labels into a second-layer testing set, inputting the second-layer training set into an Xgboost algorithm for training optimization, and performing model verification by adopting the second-layer testing set.

In a preferred embodiment, the plurality of submodels includes a Random Forest submodel, an Adaboost submodel, a Gradient Boosting Tree submodel, a Support Vector Machine submodel, and a Logistic Regression submodel.

As a preferred embodiment, the AUC index, the recall rate and the accuracy index predicted in the test set are used as stopping conditions of the model training step.

As a preferred embodiment, the customer score is a purchase intention score of the customer.

As a preferred embodiment, the modeling feature obtaining step further includes performing exception data processing on the customer data, where the exception data processing specifically includes: filling missing data, and deleting or correcting error data.

As a preferred embodiment, in the modeling feature obtaining step, the feature transformation process includes a dummy variable process, and the dummy variable process specifically includes: carrying out numerical coding treatment on the binary variables by adopting 0 and 1; and processing the multi-class dummy variables by adopting one-hot coding.

As a preferred embodiment, in the modeling feature obtaining step, the feature transformation process includes a feature discretization process, and the feature discretization specifically includes: and carrying out discretization treatment on the estimated customer characteristics by taking quantiles as boundaries.

As a preferred embodiment, in the modeling feature obtaining step, the feature transformation process includes numerical transformation, and the numerical transformation specifically includes: and processing the long tail distribution data by adopting logarithmic transformation or box-cox transformation.

As a preferred embodiment, in the modeling feature obtaining step, the screening of the client features after the feature transformation processing specifically includes the following steps:

The above preferred embodiments are combined to obtain an optimal embodiment, and the implementation process of the optimal embodiment is described below.

1. Customer data processing in a modeling feature acquisition step

The customer data comprises demographic characteristics for explaining information of variables such as sex, age, height, marital status and the like, and business behavior data such as policy business behavior, historical insurance purchase records and the like of the customer; and meanwhile, the client is marked with a label of whether the client is an active client or not according to the recent purchase record, and the label is used as a dependent variable of the model. In the step, 100 residual relevant characteristics are extracted in total, and abnormal data processing is carried out on missing data, error data and the like; and performing characteristic transformation processing by methods such as coding, discretization, logarithmic transformation and the like, and finally storing the transformed data in a large data warehouse in a distributed manner for subsequent modeling.

Detailed methods of data processing implemented in the examples:

1.1 Exception data handling

a. Missing data: most of the category variables adopt a mode filling method, and the categories with the highest frequency are filled; the number of bits and the mean value can be padded for continuous variables.

b. Error data: usually, a deleting or correcting manner is adopted, for example, data with age less than 0 is deleted, and if the identification card data of the client is valid, the embodiment may also correct the original error data by using the birthday data in the identification card. But if the error data and the null data of a certain characteristic are excessive, the characteristic is deleted.

1.2 feature transformation processing

a. And (4) processing a dummy variable: in the example, the embodiment uses two ways to process, firstly, two classification variables are used, and the embodiment only needs to use 0 and 1 numerical codes, for example, male uses 0 to replace 0, female uses 1 to replace; if the multi-class dummy variable is processed by one-hot coding as in the professional embodiment, for example, there are 20 major classes in the professional embodiment, the professional characteristics are coded into a 20-dimensional vector, where one bit is 1 and the other bits are 0, each dimension represents a professional, and the position of the vector where 1 is located represents the professional of the client.

b. Characteristic discretization: for example, the income characteristics are generally only one approximate estimated value when the client fills in, and the deviation is often large, so that the income is divided into five levels from low to high by taking quantiles as boundaries, and the significance of the converted characteristics is often enhanced from the aspect of analysis.

c. And (3) numerical value transformation: the embodiment uses a logarithmic transformation, a box-cox transformation, and the like to process some long tail distribution data. For example, the historical premium of the customers, most of the premium of the customers is concentrated on low premium, the higher the premium, the customers are distributed with rare but the numerical difference of the premium is quite large, and the numerical difference is from unit number to tens of millions. Such data are directly put into the model, the use effect is not good, the embodiment adopts logarithmic transformation to process the data so as to compress the data, and the overall distribution is more uniform and accords with the common statistical distribution.

2. Customer feature screening in modeling feature acquisition step

The model screening mainly comprises two steps of chi-square test and random logistic regression feature screening.

In the first step, univariate analysis is carried out on each feature, and whether the feature has a prediction effect is preliminarily judged by using chi-square test in a form of a list table. The chi-square test is mainly used for testing whether variables are independent from each other, and taking sex characteristics as an example, the embodiment provides that the original assumed sex is not related to the purchase intention of insurance, and the expected purchase proportion of male and female based on the original assumed sex can be estimated by the overall purchase proportion of a client in a certain period. However, in the embodiment, the chi-square statistic value can be calculated by using the real male and female purchase ratio, and finally, the chi-square statistic value is found to be far larger than the chi-square test value, and the p value finally calculated is far smaller than 0.01, so that the embodiment can consider that the relationship between the original assumed gender and the insurance purchase intention has a statistically significant meaning. In a similar way, the present embodiment checks all features potentially affecting the customer's buying intent to screen out a first set of valid features.

And secondly, using a preliminary screening feature fitting model in a logistic regression mode to obtain significance ranking of each feature, and gradually eliminating the least significant features by adopting a stepwise regression method to obtain the final available features.

In the stepwise regression method adopted in this embodiment in the example, in the first step, all the candidate features are used to fit the model by using a logistic regression algorithm to obtain the aic (Akaike information criterion) of the model to judge the fitting degree of the model, and meanwhile, this embodiment can calculate the Z-score of each variable, screen out the variable with the minimum Z-score, and perform a new round of regression fitting by using the remaining variables. Comparing the second and first models AIC, this example retains all variables for the second time and repeats the above steps of screening out variables if the second model AIC is less than the first (the flow can be explained by the inset). Eventually, regression is carried out until the AIC is no longer small and the remaining variables are retained as features for the final modeling.

3. Procedure for model training step

The invention uses a stacking integration algorithm to construct the model. The algorithm principle is to combine a plurality of base models, take the result predicted by the base models as the characteristic, and then use the predicted values as the input of the next layer model to carry out the second round of training. The algorithm weakens the defects of a single base model by combining various machine learning algorithms, integrates the advantages of each model and achieves the result with better prediction effect.

In this example, the detailed implementation process of the stacking integration algorithm is as follows:

in the embodiment, two layers of superimposed models are used, and five models of "Random Forest", "Adaboost", "Gradient Boosting Tree", "Support Vector Machine" and "Logistic Regression" are selected in the first layer of the embodiment and are respectively subjected to modeling fitting by using a cross validation form. Taking "Random Forest" as an example, the data is divided into 75% of training set and 25% of testing set, and the training set and the testing set contain labels of actual purchasing situations of customers. In the training set, the present embodiment employs 5-fold cross validation, that is, all data are divided into five equal parts, four of the five equal parts are taken to train out a model each time, and another part is used for prediction, so as to obtain a model Mi (i ═ 1,2,3,4,5) and a prediction set Pi (i ═ 1,2,3,4,5) of each training set, and a test prediction set ti (i ═ 1,2,3,4,5) is obtained by using the model prediction test set. After 5 times of training and prediction, in this embodiment, the training set prediction Pi is spliced into a training prediction set D1 of a Random Forest model according to the sequence of training samples, and each sample prediction value of the five test prediction sets is averaged to obtain a test prediction set T1 of the Random Forest model, and a fixed training set label L1 and a test set label L2. In this embodiment, the remaining four models are completed one by one according to the above steps, and finally, this embodiment obtains a training prediction set Dm (where m is 1,2,3,4,5 each corner mark represents one of the above five models), a test prediction set Tm (where m is 1,2,3,4,5 each corner mark represents one of the above five models), a training label L1, and a test label L2 for all models.

The second layer of the integrated model uses the Xgboost algorithm, combines Dm with L1 horizontally into a training set D of the second layer model, combines Tm with L2 horizontally into a test set T of the second layer model, fits the training set using the Xgboost algorithm, and verifies the effect of the final model with the test set T.

In the embodiment, the parameters of the model are adjusted through AUC indexes, recall rate and accuracy rate of prediction in the test set, and when the effect reaches the prediction, the step can be stopped to store the model as the final model for the subsequent prediction application part. This part is also called a model training module, and after the parameters are determined, the embodiment will extract the latest training data at regular time for updating the model.

4. Process of customer score prediction step and customer filtering step

After the model training is finished, the model file is stored in the background. And re-extracting the characteristic index data of all the customers every day as model input, and outputting the purchase intention scores of the insurance products of the customers on the current day. And finally, for each marketer, the system distributes the clients most worthy of visiting marketing in the mobile phone sales management APP of the marketer according to the score sequence, and once the clients finish visiting, the clients are temporarily hidden from the recommendation list.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A customer intelligent screening method based on an ensemble learning algorithm is characterized by comprising the following steps:

2. The method for intelligently screening clients based on ensemble learning algorithm according to claim 1, wherein the stacking ensemble learning algorithm model is a two-layer superimposed model, the first layer comprises a plurality of sub models, and the second layer comprises Xgboost algorithm;

3. The method as claimed in claim 2, wherein the sub-models include a Random Forest sub-model, an Adaboost sub-model, a Gradient Boosting Tree sub-model, a Support Vector Machine sub-model and a Logistic Regression sub-model.

4. The method as claimed in claim 1, wherein the customer score is a purchase intention score of the customer.

5. The method according to claim 1, wherein the modeling feature obtaining step further comprises abnormal data processing on the customer data, and the abnormal data processing specifically comprises: filling missing data, and deleting or correcting error data.

6. The method for intelligently screening clients based on the ensemble learning algorithm according to claim 1, wherein in the step of obtaining modeled features, the feature transformation processing includes dummy variable processing, and the dummy variable processing specifically includes: carrying out numerical coding treatment on the binary variables by adopting 0 and 1; and processing the multi-class dummy variables by adopting one-hot coding.

7. The method for intelligently screening clients based on ensemble learning algorithm as claimed in claim 1, wherein in the modeling feature obtaining step, the feature transformation process includes a feature discretization process, and the feature discretization is specifically as follows: and carrying out discretization treatment on the estimated customer characteristics by taking quantiles as boundaries.

8. The method for intelligently screening clients based on ensemble learning algorithm as claimed in claim 1, wherein in the step of obtaining modeling features, the feature transformation process includes numerical transformation, and the numerical transformation is specifically: and processing the long tail distribution data by adopting logarithmic transformation or box-cox transformation.

9. The method for intelligently screening clients based on the ensemble learning algorithm as claimed in claim 1, wherein in the step of obtaining modeled features, the step of screening client features after feature transformation specifically comprises the steps of:

10. A customer intelligent screening system based on an ensemble learning algorithm is characterized by comprising: