CN112967088A

CN112967088A - Marketing activity prediction model structure and prediction method based on knowledge distillation

Info

Publication number: CN112967088A
Application number: CN202110235391.5A
Authority: CN
Inventors: 项亮; 潘信法
Original assignee: Shanghai Shuming Artificial Intelligence Technology Co ltd
Current assignee: Shanghai Shuming Artificial Intelligence Technology Co ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-15

Abstract

A marketing activity prediction method based on knowledge distillation comprises the steps of data preprocessing, data set division and network training framework formation of a teacher model, data set division and network training framework formation of a student model, prediction model establishment, marketing activity prediction and the like; firstly, constructing a more complex teacher model Net-T with a residual error neural network as a core, then constructing a student model Net-S formed by a simple neural network, and weighting a soft label obtained by training the teacher model Net-T at a high temperature and a hard label obtained by training the student model Net-S at the same temperature to obtain a total loss function of knowledge distillation; and (4) training to obtain a final obtained neural network model and making a prediction by taking the total loss function as an objective function of the student model Net-S in actual deployment. The result shows that the hybrid model effectively expands the application of deep learning to the calculation of the advertisement and recommendation system algorithm, and the accuracy of user click prediction is obviously improved.

Description

Marketing activity prediction model structure and prediction method based on knowledge distillation

Technical Field

The invention relates to the technical field of artificial intelligence marketing in the Internet, in particular to a marketing activity prediction model structure and a prediction method based on knowledge distillation.

Background

With the rapid development of deep learning algorithms and the successful application in many fields, for example, the problem of gradient disappearance during training is better solved by using a residual neural network (ResNet) in the field of Computer Vision (CV), and the ultra-strong Processing capability on text data is realized by a transform model and a Bert model in the field of Natural Language Processing (NLP). The revolutionary technology enables the application effect of the deep learning algorithm in different fields to be rapidly improved, and landing of the deep learning algorithm is accelerated. However, as the training data increases, the network model becomes more complex, and the parameters also tend to increase rapidly, even to the order of hundreds of millions.

Taking the calculation of the advertisement and recommendation system algorithm as an example, the following problems exist in the practical application of the algorithm:

the actual facing traffic of the advertisement calculation and recommendation system is often very large, and the recommendation system is often required to have strong timeliness. This means that, although the deep learning model may rely on hardware (such as GPU acceleration) for offline testing, for online deployment, if the deep learning model is too complex and has too many parameters, the corresponding speed of the model may be too slow, and particularly, when the traffic is large for a specific service scenario, the demand for timely pushing may not be satisfied.

Under general conditions, the offline training model and the online deployment model cannot be distinguished intentionally, namely, the model with better offline training effect is directly moved to the online to be deployed correspondingly; however, it is clear to those skilled in the art that there are some inconsistencies between the training model and the deployment model. For example: the models with good effect obtained in the training are either complex in scale or can only be realized by integrating a plurality of relatively simple models through the idea of ensemble learning (ensemble learning). In the field of computing advertising and recommendation algorithms, these simple models include Logistic Regression (LR), Factorization (FM), and simple Deep Neural Network (DNN). In addition, the large model has high requirements (memory, video memory and the like) on deployment resources; meanwhile, we have strict limitations on delay and computational resources during deployment; thus, large models are generally inconvenient to deploy directly into a service.

When the recommendation system works on the actual line, adjustment and change of data structures such as features and the like may be faced, and a complex large model is generally inflexible in adjustment and the like compared with a small model, so that additional calculation overhead is increased.

And fourthly, the parameters of the network model and the 'knowledge' quantity which can be captured or learned from the data are not in a stable linear relation, but are close to a growing curve with gradually reduced marginal benefits. In addition, the amount of "knowledge" that can be captured or learned using the exact same training data is not necessarily the same for the exact same model architecture and model parameters. That is, the proper training method can make the model parameter total amount smaller, and obtain more "knowledge" as much as possible.

Based on the above problems, there is a need in the industry to reduce the number of parameters of the model, i.e., model compression, on the premise of ensuring performance, so as to explore an effective landing method for calculating advertisements and recommending an algorithm to deploy link practical applications on line.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a marketing campaign prediction method based on knowledge distillation. In order to achieve the purpose, the technical scheme of the invention is as follows:

a marketing campaign prediction method based on knowledge distillation comprises a data preprocessing step S1, a teacher model data set division and network training frame forming step S2, a student model data set division and network training frame forming step S3 and a prediction model establishing step S4; the data preprocessing step S1 includes the steps of:

step S11: acquiring original information of N users, and extracting original characteristic information from the original information of each user; the original characteristic information comprises a user ID, a user mobile phone number attribution, a task batch number, a user access DPI frequency, an access time, an access duration characteristic and/or a digital label which is clicked or not by the user; the task batch number represents original information of a user in a date time period, and the DPI access frequency of the user are each task batch number as a measurement unit;

step S12: sequentially processing the original characteristic information in all batches with the task batch numbers, and performing One-hot coding processing on the attribution characteristics of the user mobile phone number; wherein the One-hot encoding process comprises:

sequentially expanding all different user access DPIs as independent features according to the task batch numbers, and expanding the DPI access frequency in the task batch numbers into the relationship features of the DPI and the DPI access frequency of the users according to all different user access DPIs;

the step S2 of dividing the data set of the teacher model and forming the network training framework is a mixed method of adopting layered sampling and k-fold cross validation; the method specifically comprises the following steps:

s21: after preprocessing, selecting all data in the task batch number to be equally divided into k +1 sets; wherein, 1 set is used as a test set, and the data of the rest sets are used as a training set;

s22: calculating the total proportion of each sample in the two types of samples of clicking and non-clicking from the training set, and then, assuming that the training set is divided into k sets and the requirement of meeting the requirement that the proportion of the two types of samples of clicking and non-clicking in the samples obtained from each set is consistent with the total proportion;

s23: sequentially selecting one of the K sets as a verification set, and the rest K-1 sets as training sets to form K groups of verification set pairs and training set pairs; sequentially using K groups of verification sets and training set pairs to train the initialized teacher model, verifying the training result by using the corresponding verification sets, and testing by using the test sets to obtain K groups of test results; wherein the teacher model is a residual error neural network;

s24: averaging the K groups of test results to obtain an average value of the K groups of test results;

the data set partitioning and network training framework forming step S3 of the student model includes:

s31: selecting all data in the task batch number to be equally divided into K +1 sets; wherein, 1 set is used as a verification set, and the rest data is used as a training set;

s32: a neural network is adopted as a student model Net-S, wherein the student model Net-S comprises an input layer, M fully-connected hidden layers and an output layer;

the prediction model building step S4 includes the steps of:

step S41: providing an initialized knowledge distillation-based training model, wherein the training model comprises a teacher model Net-T training channel, a student model Net-S training channel and an output module;

step S42: training the teacher model Net-T by adopting the data set according to the data set division and network training framework of the teacher model, taking the average of K-fold cross validation results as a final classifier, and obtaining a soft label;

step S43: training the student model Net-S by adopting the data set according to the data set division and network training framework of the student model Net-S, obtaining hard prediction by training at a lower temperature t, and forming a hard loss function L by the hard prediction and a real label hard label_hard；

Step S44: distilling knowledge of the teacher model Net-T to the student model Net-S at a higher temperature T, namely training at the high temperature T to obtain soft prediction and forming a soft loss function L with a soft label_solf；

Step S45: weighting the soft and hard Loss functions to obtain a total Loss function Loss, i.e.

Loss＝αL_soft+βL_hard(ii) a Where α is the soft loss function L_solfBeta is the real tag hard loss function L_hardThe weight of (c);

step S46: and using the total loss function as an objective function of the student model Net-S during actual deployment, training to obtain the student model Net-S with optimized parameters, and using the student model Net-S after final optimization as a marketing activity prediction model based on knowledge distillation.

Further, the overall architecture of the teacher model Net-T network in the teacher model training channel comprises:

the input layer is used for inputting data obtained after the division of the data set of the teacher model Net-T;

the embedded layer is used for extracting information and reducing dimensions of the data characteristics input from the input layer;

multiplying the layer, and respectively performing the feature interaction of the outer product and the inner product on the features processed by the embedding layer;

the factorization layer is used for carrying out factorization on the weight matrix after the features are interacted;

a fully-connected layer comprising N hidden layers, wherein the hidden layers are designed into four network forms, namely incremental creating, invariable constant, diamond or incremental creating; wherein N is greater than M;

and the output layer outputs the predicted probability by adopting a sigmoid function, and forms a two-classification problem of clicking or not clicking by defining a threshold, namely an output result of dividing into a positive label or a negative label.

Further, two or three fully connected hidden layers are included in the student model Net-S.

Further, in knowledge distillation, the activation function of the output layer of the teacher model Net-T uses a generalized normalized exponential function, namely, a softmax function,

and when the temperature T is higher, the soft label obtained by training with the teacher model Net-T is softer.

Further, in the knowledge distillation process, the selection of the high temperature T is related to the parameter quantity of the student model Net-S, when the parameter quantity of the student model Net-S is smaller, the high temperature T selects a lower temperature, otherwise, when the parameter quantity of the student model Net-S is larger, the high temperature T selects a higher temperature.

Further, the method for predicting marketing campaign based on knowledge distillation further comprises a marketing campaign predicting step S5, wherein the step S5 specifically comprises:

step S51: acquiring a user group for simulating Internet product marketing and user original information of the user group, and extracting original characteristic information from the user original information; the task batch number represents original information of a user in a date time period, and the DPI access frequency of the user are each task batch number as a measurement unit;

step S52: performing One-hot coding processing on the original characteristic information of the task batch number according to the attribution characteristics of the user mobile phone number; wherein the One-hot encoding process comprises:

expanding all different user access DPIs as independent features according to the task batch number, and expanding the DPI access frequency in the task batch number into a relation feature of the DPI and the DPI access frequency of the user according to all different user access DPIs;

step S53: providing the established knowledge distillation-based marketing campaign prediction model; wherein, the probability range of the predicted value is limited between 0 and 1 by using sigmoid function, and the two classification problems of clicking or not clicking are formed by defining threshold value, namely, the predicted value of the marketing activity prediction model based on knowledge distillation is the clicking willingness degree of the user.

Further, in the method for predicting a marketing campaign based on knowledge distillation, the model predicting step S5 further includes:

step S54: and selecting all or part of the users with the model predicted value of 1 to click with willingness in a centralized manner according to the actual putting requirements to carry out accurate marketing tasks.

Further, after step S11, an anomaly detection and processing step, a continuous feature processing step, and/or a dimension reduction step are/is further included for the original information of the user; and in the continuous feature processing step, the data distribution is adjusted for the continuous features by using a RankGauss method, and in the dimensionality reduction step, the dimensionality reduction is performed on the high-dimensional features by using a principal component analysis method.

Further, the method for predicting the marketing campaign based on knowledge distillation further comprises a step S47 of performing model evaluation index processing and tuning processing on the model for predicting the marketing campaign based on knowledge distillation; the model evaluation indexes comprise an AUC value, a Log loss value and a relative information gain RIG value.

Further, the model tuning process includes one or more of the following steps:

batch normalization is added, and the problem of internal covariate deviation of data is solved;

adding a function of leading part of neurons to be in a dormant state in a training process in a network;

adjusting the learning rate, generally adjusting the learning rate in the training process through strategies such as exponential attenuation and the like;

setting multiple sub-training averaging to better solve the problem of insufficient generalization capability caused by large data variance;

adding L1 or L2 regularization, and applying penalties to the loss function to reduce the risk of overfitting;

and (3) optimizing the hyper-parameters.

According to the technical scheme, firstly, a more complex teacher model Net-T with a residual error neural network as a core is constructed, and then, a student model Net-S formed by a simple neural network is constructed. Weighting a soft label obtained by training a teacher model Net-T at high temperature and a hard label obtained by training a student model Net-S at the same temperature to obtain a total loss function of knowledge distillation; and (4) training to obtain a final obtained neural network model by taking the total loss function as an objective function of the student model Net-S in actual deployment, and making a prediction.

By the method, Bayesian inference can be effectively utilized, and prediction uncertainty is introduced into the neural network, so that the model has stronger robustness. And (3) by an inner/outer product combination method, intersecting the features to extract the high-dimensional recessive features. The mixed model can effectively expand the application of deep learning to the algorithm problem of the calculation advertisement and recommendation system, obviously improves the accuracy of the user click behavior prediction, and can screen out a large number of low-intention users from the target of delivery, thereby saving a large amount of marketing cost and realizing the increase of profit margin.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for predicting a marketing campaign based on knowledge distillation according to an embodiment of the present invention

FIG. 2 is a schematic diagram of original data and data obtained after RankGauss processing in the embodiment of the present invention

FIG. 3 is a diagram of a teacher model data set partitioning and network training framework according to an embodiment of the present invention

FIG. 4 is a diagram of a teacher model in an embodiment of the invention

FIG. 5 is a schematic diagram of a student model according to an embodiment of the invention

FIG. 6 is a schematic diagram of a knowledge distillation based marketing campaign prediction model in an embodiment of the present invention

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

In the following detailed description of the embodiments of the present invention, in order to clearly illustrate the structure of the present invention and to facilitate explanation, the structure shown in the drawings is not drawn to a general scale and is partially enlarged, deformed and simplified, so that it should be understood as a limitation of the present invention.

Referring to fig. 1, fig. 1 is a flow chart illustrating a marketing campaign prediction method based on knowledge distillation according to an embodiment of the present invention. As shown in fig. 1, the knowledge distillation-based marketing campaign prediction method includes a data preprocessing step S1, a teacher model data set partitioning and network training framework forming step S2, a student model data set partitioning and network training framework forming step S3, and a prediction model building step S4.

In an embodiment of the present invention, the data preprocessing step S1 includes the following steps:

step S11: acquiring original information of N users, and extracting original characteristic information from the original information of the users; the original feature information comprises a user ID (id), a user mobile phone number attribution (location), a task number (batch number), a user access DPI (DPI), a user access DPI frequency (DPI frequency), an access time, an access duration feature and/or a digital label of the feature of whether the user clicks or not; the task batch number represents original information of a user in a date time period, the DPI access frequency of the user, the DPI access time of the user and/or the user access time are/is each task batch number as a metering unit, and the DPI access time of the user on the day and the attribution feature of the mobile phone number of the user are category features.

Referring to table 1 below, table 1 is a table description of raw data before preprocessing, and taking the data of the same batch as an example, the raw data before preprocessing is shown in table 1 below:

in the embodiment of the present invention, the raw data further needs to undergo steps of anomaly detection and processing, category feature processing, continuous feature processing, dimension reduction processing, and the like.

Abnormality detection and processing: in the process of combining the service requirements, deletion, filling and other processing are required for missing values, overlarge values and the like in the original data. In the data acquisition process, as the number of general users is in the million level, the missing condition may occur in the data acquisition process; if the missing amount is small, the removal can be generally directly carried out; if it is impossible to determine whether the missing data will affect the final model training effect, the missing value can be filled up by taking the average, mode, median, etc.

In addition, in data acquisition, a problem of an excessively large value may be encountered, for example, a user accesses the DPI ten thousand times within a day, which generally does not help to improve the generalization capability of the model in the actual modeling process, and therefore, a culling process or a padding method may be adopted to perform corresponding processing.

Further, in the embodiment of the present invention, continuous features may also be processed, that is, data of access time and access duration of different dimensions may be mapped to a uniform interval. Specifically, for the characteristics such as the access time and the access duration, for example, the data distribution may be adjusted by using a RankGauss method. RankGauss is similar to conventional normalization or normalization methods, and its basic function is to map data of different dimensions to a uniform interval, such as 0-1 or-1 to 1. This is very important for gradient-based algorithms such as deep learning. On the basis, the RankGauss further utilizes the reciprocal of the error function, so that the normalized data presents approximate Gaussian distribution. Referring to fig. 2, fig. 2 is a schematic diagram illustrating original data and data obtained after RankGauss processing according to an embodiment of the present invention. Wherein, the graph (a) is original data, and the graph (b) is data obtained after RankGauss.

In the embodiment of the invention, Principal Component Analysis (PCA) can be further adopted to perform dimensionality reduction on the high-dimensional feature. As can be seen from the above processing of the class characteristics, a high-dimensional sparse matrix is generally formed after the one-hot encoding, which means that there is no way to derive in many places when the error propagates reversely for the training of the neural network, which is obviously not beneficial to the network training. Meanwhile, the high-dimensional characteristics also increase the calculation overhead. Therefore, it is necessary to perform dimension reduction on the high-dimensional features. The PCA achieves the purpose of reducing the dimension by solving the maximum variance of the original data in a certain projection direction; the loss of information contained in the original features is reduced as much as possible while the feature dimensions are reduced, so that the purpose of comprehensively analyzing the collected data is achieved.

Step S12: processing the category characteristics; performing One-hot coding processing on the attribution characteristics of the user mobile phone number and the DPI accessed by the user; and the One-hot coding processing comprises the steps of sequentially expanding all different user access DPIs as independent features according to the task batch numbers, and expanding the DPI access frequency in the task batch numbers into a relation feature of the DPI and the DPI access frequency of the users according to all different user access DPIs.

Specifically, firstly, One-hot unique coding can be performed on the DPI accessed by the user on the same day and the attributive features of the mobile phone number of the user, and the One-hot unique coding is expanded. Taking a user accessing the DPIs as an example, if a certain user accesses a certain DPI, recording the DPI as 1, and recording the rest DPIs as 0; thus, if there are 10 different DPIs, 10 columns of features are formed, and only one corresponding user in each column of features is 1, and the rest are 0.

After pretreatment, the data is in the form of table 2 below:

after the data processing step, the teacher model data set partitioning and network training framework forming step S2 can be performed, wherein the step S2 is a hybrid method combining hierarchical sampling and k-fold cross validation; the method specifically comprises the following steps:

s24: averaging the K groups of test results to obtain an average value of the K groups of test results as a final classifier; the error generated by the final average of the network training framework using the K-fold interactive verification of the teacher model may also be referred to as out-of-bag (oob) error.

Referring to fig. 3, fig. 3 is a schematic diagram of a data set partitioning and network training framework of a teacher model according to an embodiment of the present invention. As shown in fig. 3, a mixed method of hierarchical sampling and k-fold cross validation is adopted to perform data set partitioning and network training framework, so that the teacher model can be more accurate.

Hierarchical sampling is a sampling method that preserves class proportions. Specifically, the overall sample is divided into a plurality of layers according to a certain characteristic, and then pure random sampling is carried out from each layer to form a sample. Specifically, the ratio of each sample in the two types of samples, click and no click, can be first calculated from the training set, and then the training set is assumed to be divided into k layers, and sampling is performed at each layer, so that the ratio of the two types of samples in the samples obtained at each layer is basically consistent with the total ratio.

In the embodiment of the present invention, 5-fold cross validation is taken as an example for explanation, that is, 80% of the training set is subdivided into 5 parts, 1 part is selected as the validation set each time, and the other 4 parts are selected as the training set. Thus, the model can be trained for 5 times, and each training is carried out by using 20% of the test set to obtain the evaluation index of the model, and the average of the 5-fold cross validation results is used as the final classifier. Thus, although the training process is relatively burdensome, the error generated by the final average of the network training framework using K-fold interactive validation of the teacher model is very small.

In the embodiment of the invention, the teacher model generally has complex requirements and can have strong expression capability on the characteristics in the data. Referring to fig. 4, fig. 4 is a schematic diagram of a teacher model according to an embodiment of the present invention. As shown in fig. 4, the overall architecture of the teacher model Net-T network in the teacher model training channel includes:

the Input layer (Input layer) is used for inputting data obtained by dividing the data set of the teacher model Net-T; the method can divide the characteristics into individual fields (field) according to different characteristics (such as information of DPI duration, gender, age distribution and the like), and perform One-hot encoding (One-hot encoding) on the category characteristics;

an Embedding layer (Embedding layer) for extracting information and reducing dimension of the data characteristics input from the input layer;

a Product layer (Product layer) for performing feature interaction of outer Product and inner Product on the features processed by the embedding layer;

a Factorization layer (Factorization layer) for factorizing the weight matrix after the feature interaction; that is, to reduce the calculation amount, the weight matrix is factorized as shown in fig. 3;

a Fully-connected layer (full-connected layer) including N hidden layers, wherein the hidden layers are designed into four network forms, namely incremental creation, invariable constant, diamond or incremental creation; wherein N is greater than M;

Next, a data set partitioning and network training framework forming step S3 of the student model is performed, the S3 including:

s32: a neural network is adopted as a student model Net-S, and the student model Net-S comprises an input layer, M fully-connected hidden layers and an output layer.

In the embodiment of the invention, the student model can establish a simple deep neural network in consideration of the aspects of flexibility in model deployment, small consumption of computing resources and the like, for example, the student model Net-S can comprise two or three fully-connected hidden layers. Referring to fig. 5, fig. 5 is a schematic diagram of a student model according to an embodiment of the invention. As shown in fig. 5, the student model only includes three Fully-connected hidden layers (full-connected layers) between the input layer (Iutput) and the Output layer (Output), and the number of network parameters is greatly reduced compared to the teacher model.

After the above model is built, the prediction model building step S4 may be executed. In an embodiment of the invention, the predictive model is based on an integral framework of knowledge distillation. Referring to fig. 6, fig. 6 is a schematic diagram of a knowledge-distillation-based marketing campaign prediction model according to an embodiment of the present invention. The steps of establishing the prediction model of the marketing campaign based on knowledge distillation are described below with reference to fig. 6.

The prediction model building step S4 includes the steps of:

step S43: training the student model Net-S by adopting the data set according to the data set division and network training framework of the student model Net-S, obtaining hard prediction by training at a lower temperature t, and comparing the hard prediction with the real predictionTag hard label composition hard loss function L_hard；

Loss＝αL_soft+βL_hard(ii) a Where α is the soft loss function L_solfBeta is the real tag hard loss function L_hardThe weight of (c); among them, high temperature distillation is the most important step for establishing a prediction model.

It is noted here that for the multi-class problem, the activation function of the output layer generally uses a normalized exponential function, namely, softmax function, which is modified to a generalized softmax function in order to introduce the concept of temperature in knowledge distillation. I.e., the activation function of the output layer of the teacher model Net-T, uses a generalized normalized exponential function, i.e., softmax function,

In addition, in the knowledge distillation process, the selection of the high temperature T is related to the parameter quantity of the student model Net-S, when the parameter quantity of the student model Net-S is smaller, the high temperature T selects a lower temperature, and conversely, when the parameter quantity of the student model Net-S is larger, the high temperature T selects a higher temperature.

That is, in the knowledge distillation process, the extraction of the characteristic knowledge is carried out by selecting a proper temperature. Generally, the high and low of the high temperature T are the attention degree of the student model Net-S training process to the negative label (the negative label is not clicked in the click rate, namely the label is 0). When the temperature is lower, there is less concern about negative labels, especially those that are significantly lower than the average; and when the temperature is higher, the value related to the negative label is relatively increased, and the student model Net-S focuses relatively more on the negative label. In general, the selection of the high temperature T is related to the size of the student model Net-S, and when the parameter quantity of the student model Net-S is smaller, the high temperature T can be selected to be relatively lower temperature.

Specifically, after the training, the finally optimized student model Net-S is used as a knowledge distillation-based marketing activity prediction model, and the output prediction users are classified into two categories, namely 'click' and 'no click', so that an output layer neuron is finally added to the network structure.

After the model training is finished, the method further comprises a step S47 of carrying out model evaluation index processing and adjusting and optimizing processing on the marketing activity prediction model based on knowledge distillation; the model evaluation indexes comprise AUC (area Under cutter) values, Log loss values and relative Information gain RIG (relative Information gain) values. In general, the closer the AUC value is to 1, the better the model classification effect. The smaller the Log loss value is, the higher the accuracy of the click rate estimation is; the larger the relative information gain value is, the better the model effect is.

For example, after the data are processed according to the above steps and trained by the model, the training effect of the model can be judged through the locally verified AUC value; if the effect is poor, the model generally needs to be optimized, and for the deep learning algorithm, the optimization can be generally performed from the following aspects:

adding Batch Normalization (Batch Normalization) to solve the Internal Covariate Shift problem of data.

And secondly, adding Dropout in the network, namely enabling part of the neurons to be in a dormant state in the training process.

And thirdly, adjusting the learning rate, wherein the learning rate in the training process is generally adjusted through strategies such as exponential attenuation and the like.

And fourthly, setting multiple seed training for averaging, and reducing the overfitting risk in the training process.

Increasing L1 or L2 regularization, and applying punishment to the loss function to reduce the risk of overfitting.

And sixthly, optimizing the super parameters.

In the optimization method of the hyper-parameter, a Grid Search (Grid Search) or a Random Search (Random Search) can be generally adopted; however, the two methods are relatively high in consumption of computing resources and are not efficient. In an embodiment of the present invention, a Bayesian Optimization (Bayesian Optimization) strategy is employed. Bayesian optimization calculates posterior probability distribution of the previous n data points through Gaussian process regression to obtain the mean value and variance of each hyper-parameter at each value-taking point; bayesian optimization finally selects a group of better hyper-parameters through balancing mean and variance and according to the joint probability distribution among the hyper-parameters.

After all the processing steps are completed, the characteristics are brought into a user prediction model, so that partial users with high willingness can be screened out in advance before advertisement putting, and accurate putting of marketing advertisements is carried out on the users.

That is, the present invention may further include a marketing campaign prediction step S5, where the step S5 specifically includes:

step S53: providing the established knowledge distillation-based marketing campaign prediction model; wherein, a probability range of a predicted value is limited between 0 and 1 by using a sigmoid function, and the predicted value is formed into a two-class problem of clicking or not clicking by defining a threshold value, namely the predicted value of the marketing activity prediction model based on knowledge distillation is the clicking willingness degree of the user;

The result shows that Bayesian inference can be effectively utilized by the method of the invention, and prediction uncertainty is introduced into the neural network, so that the model has stronger robustness. And (3) by an inner/outer product combination method, intersecting the features to extract the high-dimensional recessive features. The hybrid model can effectively expand the application of deep learning to the algorithm problem of the calculation advertisement and recommendation system, and obviously improve the accuracy of the prediction of the user click behavior, thereby saving a large amount of marketing cost and realizing the increase of profit margin.

The above description is only for the preferred embodiment of the present invention, and the embodiment is not intended to limit the scope of the present invention, so that all the equivalent structural changes made by using the contents of the description and the drawings of the present invention should be included in the scope of the present invention.

Claims

1. A marketing campaign prediction method based on knowledge distillation is characterized by comprising a data preprocessing step S1, a teacher model data set division and network training frame forming step S2, a student model data set division and network training frame forming step S3 and a prediction model establishing step S4;

the data preprocessing step S1 includes the steps of:

the prediction model building step S4 includes the steps of:

2. The knowledge distillation-based marketing campaign prediction method of claim 1, wherein the overall architecture of the teacher model Net-T network in the teacher model training channel comprises:

3. The knowledge distillation based marketing campaign prediction method of claim 1, wherein the student model Net-S comprises two or three hidden layers that are fully connected.

4. The knowledge distillation-based marketing campaign prediction method of claim 1, wherein, in knowledge distillation, the activation function of the output layer of the teacher model Net-T uses a generalized normalized exponential function, namely a softmax function,

5. The knowledge-distillation-based marketing campaign prediction method of claim 1, wherein in the knowledge distillation process, the selection of the high temperature T is related to the parameter size of the student model Net-S, and when the parameter size of the student model Net-S is smaller, the high temperature T selects a lower temperature, whereas when the parameter size of the student model Net-S is larger, the high temperature T selects a higher temperature.

6. The knowledge distillation-based marketing campaign prediction method of claim 1, further comprising a marketing campaign prediction step S5, wherein the step S5 specifically comprises:

7. The knowledge distillation-based marketing campaign prediction method of claim 6, wherein the model prediction step S5 further comprises:

8. The knowledge distillation-based marketing campaign prediction method of claim 1, further comprising performing an anomaly detection and processing step, a continuous feature processing step and/or a dimension reduction step on the user' S raw information after step S11; and in the continuous feature processing step, the data distribution is adjusted for the continuous features by using a RankGauss method, and in the dimensionality reduction step, the dimensionality reduction is performed on the high-dimensional features by using a principal component analysis method.

9. The knowledge distillation-based marketing campaign prediction method of claim 1, further comprising a step S47 of performing model evaluation index processing and tuning processing on the knowledge distillation-based marketing campaign prediction model; the model evaluation indexes comprise an AUC value, a Log loss value and a relative information gain RIG value.

10. The marketing prediction method combining automated feature engineering and residual neural networks of claim 9, wherein the model tuning process comprises one or more of:

and (3) optimizing the hyper-parameters.