CN112149352B

CN112149352B - Prediction method for marketing activity clicking by combining GBDT automatic characteristic engineering

Info

Publication number: CN112149352B
Application number: CN202011007410.0A
Authority: CN
Inventors: 项亮; 方同星
Original assignee: Shanghai Shuming Artificial Intelligence Technology Co ltd
Current assignee: Shanghai Shuming Artificial Intelligence Technology Co ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2021-08-31
Anticipated expiration: 2040-09-23
Also published as: CN112149352A

Abstract

A click prediction method for marketing activities by combining GBDT automatic characteristic engineering comprises a data preprocessing step, a GBDT prediction model establishing step, a prediction model establishing step with regularization terms and a prediction step for the click of marketing activities; the data preprocessing step comprises the steps of extracting original characteristic information from original information of a user, sequentially processing the original characteristic information in all batches with task batch numbers, carrying out One-hot coding processing on the attribution characteristics of the mobile phone number of the user, and sequencing all the task batch numbers according to an ascending order to obtain the sequence of the task batches; the user prediction model is selected as a combination of an LR model with a regularization term and a GBDT prediction model; and predicting the click willingness degree of the user group simulating the internet product marketing by adopting a user prediction model. Therefore, the invention can provide a way for directly predicting the advertisement click intention of the user and can process data with large-scale sparse characteristics.

Description

Prediction method for marketing activity clicking by combining GBDT automatic characteristic engineering

Technical Field

The invention relates to the technical field of artificial intelligence marketing in the Internet, in particular to a method for predicting marketing activity clicking by combining GBDT automatic characteristic engineering.

Background

With the increasingly intense market competition of the internet industry, the application of big data becomes a new mode of internet marketing, namely, the big data of internet operators is accurately obtained by guest systems. The big data intelligent customer acquisition system takes an operator big database as a center, directly captures the contact information of users meeting the user-defined conditions, directly communicates with customers, reduces the customer acquisition cost of enterprises, and improves the profits of the enterprises.

Currently, the advertisement marketing behavior is often predicted through a user portrait and user behavior characteristics, and more commonly used Machine learning algorithms can be classified into Logistic Regression (LR) and Factorization Machine (FM) represented by a linear model, and a Gradient Boosting Decision Tree (GBDT).

However, both of the above algorithms have some inherent disadvantages:

for the linear model, because the expression capability of the linear model is limited, the interactive information between the features cannot be effectively learned by the linear model, for example, the logistic regression can only learn the first-order features, and even if the factorization machine considers the feature interaction, the factorization machine can only learn the second-order feature interactive information. Therefore, linear models rely heavily on feature engineering by algorithm engineers, which enhances the learning ability of linear models by manually selecting features and performing construction of high-order interactive features.

And secondly, the gradient lifting decision tree model can easily realize the interaction among the characteristics by traversing the characteristics and carrying out characteristic space-simple division on the samples, so that the gradient lifting decision tree model has strong learning capacity. However, in the field of marketing advertisement recommendation, user features often include a large number of sparse one-hot type features, such as a home location, a URL to access a 4G page, and the like, and only a few of these features have corresponding values.

Therefore, the gradient boosting decision tree based algorithm is not suitable for processing the data containing a large amount of sparse features, so that overfitting is easily caused, and feature information is wasted because a large amount of features may not be used as split nodes of the decision tree.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a prediction method for marketing activity clicking by combining GBDT automatic feature engineering, which models continuous features in user features by using high-order feature interaction of GBDT pairs and combines leaf nodes in a model as new high-order interaction sparse features with original user sparse features, thereby not only fully utilizing user feature information, but also solving the problem that GBDT is insensitive to sparse features.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a prediction method for marketing activity clicking by combining GBDT automatic characteristic engineering is characterized by comprising a data preprocessing step S1, a GBDT prediction model establishing step S2 and an LR prediction model establishing step S3 with regularization terms;

the data preprocessing step S1 includes the steps of:

step S11: acquiring original information of a user, and extracting original characteristic information from the original information of the user; the original characteristic information comprises a user ID, a user mobile phone number attribution, a task batch number, a user DPI access frequency and a DPI access frequency; the task batch number represents original information of a user in a date time period, and the DPI access frequency of the user are each task batch number as a measurement unit;

step S12: sequentially processing the original characteristic information in all batches with the task batch numbers, and performing One-hot coding processing on the attribution characteristics of the user mobile phone number; wherein the One-hot encoding process comprises:

sequentially expanding all different user access DPIs as independent features according to the task batch numbers, and expanding the DPI access frequency in the task batch numbers into the relationship features of the frequency of the user access DPI and the DPI according to all the different user access DPIs;

step S13: sequencing all the task batch numbers in an ascending order to obtain sequencing of all the task batch numbers; the ascending order of the task batch numbers is in direct proportion to the date and time, and the closer the date and time is, the larger the task batch numbers are;

the GBDT prediction model building step S2 includes the steps of:

step S21: after preprocessing, taking a user ID as a sample unit, regarding the attribution feature and/or the feature whether the user visits the DPI to click as a sparse feature of the sample, and defining the frequency of the user visiting the DPI as a continuous feature;

step S22: selecting data in the task batch with the largest task batch number as a verification set, and using the rest data of the task batch number as a training set;

step S23: providing a GBDT prediction model to be established, taking continuous characteristics of each sample in the training set as input of the GBDT prediction model, taking relation characteristics of DPI (user access rate) and DPI (DPI rate) frequency of each sample in the training set as output of the GBDT prediction model, training the GBDT prediction model, and verifying the GBDT prediction model by each sample in the verification set to obtain the GBDT prediction model after parameter optimization;

the LR prediction model establishing step S3 with the regularization term specifically includes:

step S31: taking a user ID as a sample unit, regarding the attribution feature and/or the feature whether the user visits a DPI to click as the sparse feature of the sample, and frequently passing the user visiting the DPI through the GBDT prediction model to obtain the leaf node position sparse feature of the sample, wherein the number of the leaf node position sparse feature of the sample is the number of leaf nodes of the sample in the training set plus the number of subtrees of sparse features;

step S32: selecting data in the task batch with the largest task batch number as a verification set, and using the rest data of the task batch number as a training set;

step S33: providing an LR model with a regularization term, performing feature splicing on leaf node position sparse features of the samples of each sample in the training set and sparse features of the samples to serve as input of the LR model with the regularization term, taking relationship features of frequency of DPI (user access) and DPI (deep packet inspection) of each sample in the training set as output of the LR model with the regularization term, training the LR model with the regularization term, verifying the LR model with the regularization term by using each sample in the verification set to obtain the LR model with the regularization term after parameter optimization, and forming a user prediction model together with the optimized GBDT prediction model; wherein the output values of the LR model and the GBDT prediction model are output value results after weighting processing and are used as the output values of the user prediction model.

Further, the method for predicting a click on a marketing campaign by combining GBDT automated feature engineering further includes a step S4 of predicting a click on a marketing campaign, where the step S4 specifically includes:

step S41: acquiring a user group and user original information of the user group, which are predicted by clicking on a marketing activity, and extracting original characteristic information from the user original information; the original characteristic information comprises a user ID, a user mobile phone number attribution, a current task batch number, a user DPI access frequency and a user DPI access frequency; the DPI is accessed by the user and the DPI access frequency of the user takes the batch number of the task as a measurement unit;

step S42: performing One-hot coding processing on the original characteristic information of the current task batch number according to the attribution characteristics of the user mobile phone number; wherein the One-hot encoding process comprises:

expanding all different user access DPIs as independent features according to the current task batch number, and expanding the DPI access frequency in the current task batch number into the relationship features of the DPI and the DPI access frequency of the user according to all different user access DPIs;

step S43: defining the frequency of accessing the DPI by the user as a continuous characteristic by taking a user ID as a sample unit, regarding the characteristic of whether the attribution characteristic and/or the DPI is clicked by the user as a sparse characteristic of the sample, and obtaining a leaf node position sparse characteristic of the sample by the GBDT prediction model according to the frequency of accessing the DPI by the user, wherein the number of the leaf node position sparse characteristic of the sample is the number of leaf nodes of the sample in the training set multiplied by subtree sparse characteristics;

step S44: providing an established user prediction model, taking continuous features of each sample in the sample set as input of the GBDT prediction model to obtain the first prediction probability value of the GBDT prediction model, and performing feature splicing on leaf node position sparse features of the samples in the sample set and sparse features of the samples to serve as input of the LR model with the regularization term to obtain the second prediction probability value of the LR model with the regularization term; wherein the user prediction model is the LR model with regularization term + the GBDT prediction model;

step S45: and weighting the first prediction probability value and the second prediction probability value, and taking the weighted output value result as the output value of the LR model with the regularization term + the GBDT prediction model.

Further, the LR model output value with the regularization term is weighted 0.8, and the GBDT prediction model output value is weighted 0.2.

Further, the click prediction method for marketing activities in combination with GBDT automatic feature engineering further includes:

step S46: and according to the actual delivery requirement, selecting all or part of LR models with regularization terms and users with GBDT prediction model output values exceeding a certain threshold value to perform accurate marketing tasks.

According to the technical scheme, the prediction method for the marketing activity click by combining the GBDT automatic characteristic engineering can effectively utilize the characteristics of the GBDT to carry out high-order interaction on the continuous characteristics of the user and output the continuous characteristics as the sparse characteristics, then the high-order interaction is combined with the original sparse characteristics, the logistic regression model which is good at processing the sparse characteristics is used for modeling, and finally the output result of the logistic regression and the output result of the GBDT are weighted and averaged to obtain the final result. The method can obviously improve the accuracy of the user click behavior prediction.

Drawings

FIG. 1 is a flow chart illustrating a method for predicting a click on a marketing campaign by combining GBDT automatic feature engineering according to an embodiment of the present invention

FIG. 2 is a schematic diagram illustrating the implementation of the process from step S2 to step S4 in the embodiment of the present invention

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

In the following detailed description of the embodiments of the present invention, in order to clearly illustrate the structure of the present invention and to facilitate explanation, the structure shown in the drawings is not drawn to a general scale and is partially enlarged, deformed and simplified, so that it should be understood as a limitation of the present invention.

It should be noted that, in the following specific embodiments of the present invention, the method for predicting clicks on marketing campaigns by combining GBDT automatic feature engineering may include a data preprocessing step, a model building step, and a model using step, and compared with the conventional algorithm based on a gradient boosting decision tree, the present invention can provide a way for users to directly predict ad click willingness, and is also suitable for processing data with large-scale sparse features.

Referring to fig. 1, fig. 1 is a flow chart illustrating a method for predicting a click on a marketing campaign by combining GBDT automatic feature engineering according to an embodiment of the present invention. As shown in fig. 1, the method for predicting a click on a marketing campaign by combining GBDT automatic feature engineering includes a data preprocessing step S1, a GBDT prediction model building step S2, an LR prediction model building step S3 with a regularization term, and a step S4 of predicting a click on a marketing campaign.

In an embodiment of the present invention, the data preprocessing step S1 includes the following steps:

step S11: acquiring original information of a user, and extracting original characteristic information from the original information of the user; the original feature information comprises a user ID (id), a user mobile phone number attribution (location), a task number (batch number), user access DPI (DPI) and user access DPI frequency (DPI frequency); the task batch number represents original information of a user in a date time period, and the DPI access frequency of the user is measured by each task batch number.

Step S12: sequentially processing the original characteristic information in all batches with the task batch numbers, and performing One-hot coding processing on the attribution characteristics of the user mobile phone number (the One-hot coding is a common method for data preprocessing, and 0/1 mapping the category characteristics into new characteristics according to different values); wherein the One-hot encoding process comprises:

and sequentially expanding all different DPIs accessed by the users as independent features according to the task batch numbers, and expanding the DPI access frequency in the task batch numbers into the relationship features of the DPI and the DPI access frequency of the users according to all different DPIs accessed by the users.

Specifically, it can be considered that one task batch number (batch number) corresponds to user data of one day, and users in the same task batch number (batch number) in the user's original information may have duplication, because the same user may access multiple users to access the DPI. Therefore, all different user access DPIs need to be expanded as a separate feature, under which the current user value is 1 if one user has accessed the user access DPI, and 0 otherwise.

And similarly, expanding the access frequency of the user to the DPI into the characteristics of the user to access the DPI and the frequency of the user to access the DPI according to all different user access DPIs, wherein if the user accesses the DPI for m times, the value of the current user under the characteristics is m, and otherwise, the value is 0.

Referring to table 1 below, table 1 is a table description of raw data before preprocessing, and taking data of the same batch as an example, the table description can be briefly expressed as follows:

raw data before preprocessing:

table 1:

referring to table 2 below, table 2 is a table description of the data after preprocessing, and the data of the same batch is taken as an example, and can be briefly expressed as follows:

TABLE 2

Step S13: sequencing all the task batch numbers in an ascending order to obtain sequencing of all the task batch numbers; the ascending order of the task batch numbers is in direct proportion to the date and time, and the closer the date and time is, the larger the task batch numbers are.

After the processing, the user ID in each task batch is a unique value; then, the user data of all the batches are processed, the user data are merged according to the batch direction, the sorting is carried out according to the ascending order of the task batch numbers (batch numbers), the more the date of the task batch is, the larger the task batch number is, and the processed sample can be obtained.

After the data preprocessing step is completed, the data of the last batch can be selected as a verification sample set to select model parameters, and all samples except the verification sample set form a training sample set for establishing a model, namely the training sample set is used for carrying out model training; the sample set is validated for model parameter selection.

The invention idea in the embodiment of the invention is a click prediction method for marketing activities by combining GBDT automatic characteristic engineering, wherein a user prediction model is the LR model with regularization term plus the GBDT prediction model, namely, continuous characteristics in the user characteristics are modeled by utilizing the high-order characteristic interaction of GBDT pairs, leaf nodes in the model are used as the sparse characteristics of new high-order interaction to be combined with the sparse characteristics of the original user, so that the user characteristic information can be fully utilized, and the problem that the GBDT is insensitive to the sparse characteristics is solved.

Therefore, in the embodiment of the present invention, the user prediction model actually includes two models, one is a GBDT prediction model, and the other is an LR model with a regularization term, that is, the user prediction model is a combination of the LR model with the regularization term + the GBDT prediction model.

The GBDT prediction model building step S2 includes the steps of:

step S23: providing a GBDT prediction model to be established, taking continuous characteristics of each sample in the training set as input of the GBDT prediction model, taking relation characteristics of DPI (user access rate) and DPI (deep packet inspection) frequency of each sample in the training set as output of the GBDT prediction model, and training and verifying the GBDT prediction model to obtain the GBDT prediction model after parameter optimization;

step S33: providing an LR model with a regularization term, performing feature splicing on leaf node position sparse features of the samples of each sample in the training set and sparse features of the samples, using the leaf node position sparse features and the sparse features of the samples as input of the LR model with the regularization term, using relationship features of frequency of DPI (user access) and DPI (deep packet inspection) of each sample in the training set as output of the LR model with the regularization term, and training and verifying the LR model with the regularization term to obtain the LR model with the regularization term after parameter optimization.

That is, for the processed data, the last batch of data is selected as the verification sample set to perform the selection of the model parameters, and all the samples except the verification sample set constitute the training sample set for establishing the model. The user prediction model is selected as a combination of the LR model with regularization term + the GBDT prediction model.

Specifically, referring to fig. 2, fig. 2 is a schematic diagram illustrating a process from step S2 to step S4 according to an embodiment of the present invention. The model training and validation processes of step S2 and step S3 are not to be recited here.

In an embodiment of the present invention, the method for predicting a click on a marketing campaign by combining GBDT automatic feature engineering includes a step S4 of predicting a click on a marketing campaign, which may specifically include:

step S41: acquiring a user group and user original information of the user group, which are predicted by clicking on a marketing activity, and extracting original characteristic information from the user original information; the original characteristic information comprises a user ID, a user mobile phone number attribution, a current task batch number, a user DPI access frequency and a user DPI access frequency; and the DPI access frequency of the user take the batch number of the task as a measurement unit.

The above steps are mainly to perform feature extraction on the user group for internet product marketing, and then to perform preprocessing on the original feature information of the current task batch number, and the preprocessing step S42 is as follows:

step S42: performing One-hot coding processing on the original characteristic information of the current task batch number according to the attribution characteristics of the user mobile phone number; and the One-hot coding processing comprises the steps of expanding all different user access DPIs as independent features according to the task batch number, and expanding the DPI access frequency in the task batch number into a relation feature of the DPI and the DPI access frequency of the user according to all different user access DPIs.

after the processing steps are completed, the characteristics are brought into a user prediction model, so that partial users with high willingness can be screened out in advance before advertisement putting, and accurate putting of marketing advertisements is carried out on the users.

In an embodiment of the present invention, the LR model output value with the regularization term may be weighted to 0.8, and the GBDT prediction model output value may be weighted to 0.2.

Of course, the present invention may further include step S46: and according to the actual delivery requirement, selecting all or part of LR models with regularization terms and users with GBDT prediction model output values exceeding a certain threshold value to perform accurate marketing tasks.

The result shows that a large number of users with low willingness can be directly screened out from the putting targets through the user prediction model, so that a large amount of marketing cost is saved, and the profit margin is increased.

The above description is only for the preferred embodiment of the present invention, and the embodiment is not intended to limit the scope of the present invention, so that all the equivalent structural changes made by using the contents of the description and the drawings of the present invention should be included in the scope of the present invention.

Claims

1. A prediction method for marketing activity clicking by combining GBDT automatic characteristic engineering is characterized by comprising a data preprocessing step S1, a GBDT prediction model establishing step S2 and an LR prediction model establishing step S3 with regularization terms;

the data preprocessing step S1 includes the steps of:

step S11: acquiring original information of a user, and extracting original characteristic information from the original information of the user; the original characteristic information comprises a user ID, a user mobile phone number attribution, a task batch number, a user DPI access frequency and a DPI access frequency; the task batch number represents original information of a user in a date time period, and the DPI access frequency of the user take each task batch number as a metering unit;

the GBDT prediction model building step S2 includes the steps of:

2. The method for predicting a click on a marketing campaign in conjunction with GBDT automation feature engineering of claim 1, further comprising a step S4 of predicting a click on a marketing campaign, wherein the step S4 specifically comprises:

step S44: providing an established user prediction model, taking continuous features of each sample in the sample set as input of the GBDT prediction model to obtain a first prediction probability value of the GBDT prediction model, and performing feature splicing on leaf node position sparse features of the samples in the sample set and sparse features of the samples to serve as input of the LR model with the regularization term to obtain a second prediction probability value of the LR model with the regularization term; wherein the user prediction model is the LR model with regularization term + the GBDT prediction model;

3. The method of predicting a marketing campaign click in conjunction with GBDT automated feature engineering of claim 2, wherein the LR model output value with regularization term is weighted 0.8 and the GBDT prediction model output value is weighted 0.2.

4. The method of predicting a marketing campaign click in conjunction with GBDT automated feature engineering of claim 2 or 3, further comprising: