CN111538761A

CN111538761A - Click rate prediction method based on attention mechanism

Info

Publication number: CN111538761A
Application number: CN202010317646.8A
Authority: CN
Inventors: 邓晓衡; 刘良知; 李海霞; 刘梦杰
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-14

Abstract

The invention provides a click rate prediction method based on an attention mechanism, which comprises the following steps: step 1, preprocessing the characteristics of users, and performing One-hot unique coding on the same type of user characteristics to obtain a high-dimensional sparse characteristic vector; step 2, reducing the dimension of the high-dimension sparse feature vector by embedding the vector, and taking the feature vector after dimension reduction as an input vector of a click rate model to be respectively brought into a compressed interactive network and a deep neural network; and 3, performing Hadamard product on the input initial characteristic vector and the input vector of each hidden layer, taking the obtained result as the input value of the next hidden layer, and increasing the combination of the characteristics by one dimension every more hidden layer. The method comprehensively considers the low-dimensional characteristics, the explicit high-dimensional characteristics and the implicit high-dimensional characteristics of the user, screens useful characteristic combinations through a self-attention mechanism, improves the prediction efficiency, does not need to manually extract the characteristics, and can extract the high-dimensional characteristic combinations.

Description

Click rate prediction method based on attention mechanism

Technical Field

The invention relates to the technical field of internet application, in particular to a click rate prediction method based on an attention mechanism.

Background

With the explosive growth of internet information, the field of computer science, especially artificial intelligence technology, has made great progress. As a branch of computer science and applied science, it is mainly studied how to simulate, extend and expand the mental processes of the human brain (such as memory, learning, reasoning and decision making) using machines. At present, artificial intelligence technology is successfully applied to the fields of automatic driving, medical diagnosis, language identification, image identification, financial big data and the like.

Although the current industry has deeper research on click rate estimation, the models have some problems, such as large data volume, sparse data and the like, the industry is biased to use shallow models to solve the problems, the shallow models are difficult to train, difficult to deploy in a production environment and weak in interpretability, the shallow models are used for focusing more attention on constructing explicit combined features in a manner of manually constructing features and simple operations among some features to improve the performance of the click rate estimation model, and implicit information such as implicit combined features among deeply mined data and highly nonlinear relations inherent in the features is not provided, so that the click rate estimation method has great research significance for the advertisement click rate estimation problem. The algorithm which is widely applied at present is generally a GBDT + LR model, Wide & Deep model. However, these models have a problem that features need to be manually extracted and a high-dimensional feature combination cannot be extracted. Some models capable of being automatically extracted, such as Deep FM models, have the problem that the training mode is implicit characteristic, which easily causes overhigh dimensionality. Although the Deep & Cross model can solve the problem at present, the Deep & Cross model belongs to interaction at an element level and cannot well represent feature interaction vectors.

Disclosure of Invention

The invention provides a click rate prediction method based on an attention mechanism, and aims to solve the problems that a traditional model needs manual feature extraction, high-dimensionality feature combination cannot be extracted, and dimensionality is easily overhigh.

In order to achieve the above object, an embodiment of the present invention provides a click rate prediction method based on an attention mechanism, including:

step 1, preprocessing the characteristics of users, and performing One-hot unique coding on the same type of user characteristics to obtain a high-dimensional sparse characteristic vector;

step 2, reducing the dimension of the high-dimension sparse feature vector by embedding the vector, and taking the feature vector after dimension reduction as an input vector of a click rate model to be respectively brought into a compressed interactive network and a deep neural network;

step 3, carrying out Hadamard product on the input initial characteristic vector and the input vector of each hidden layer, taking the obtained result as the input value of the next hidden layer, and increasing the combination of the characteristics by one dimension every more hidden layer;

step 4, obtaining useful combination characteristics by the result vector obtained by each layer through an attention mechanism, and summing and pooling the combination characteristics;

and 5, simplifying and splicing the pooled result and the result obtained by the deep neural network into a new feature vector, and bringing the new feature vector into an output layer to obtain a predicted value.

Wherein, the step 1 specifically comprises:

collecting a data set X ═ { x) of user characteristics₁，x₂，……x_NIs the total number of training samples, x_i∈{x₁，x₂，……x_N}，x_iRepresenting the ith user characteristic data to be processed.

Wherein, the step 1 further comprises:

the user features are converted into a high-dimensional sparse feature vector using one-hot encoding.

Wherein, the step 2 specifically comprises:

the low-dimensional combined features are converted by an embedded layer vector, and sparse vectors are mapped to space vectors which are relatively dense and have non-zero vector elements.

Wherein, the step 2 further comprises:

processing the raw data into data with mean value of 0 and variance of 1 by a normalization method, wherein the normalized data uses x_normExpressed, the specific calculation formula is as follows:

where x denotes continuous value data, μ denotes a variance of original data, and σ denotes a mean of the original data.

Wherein, the step 3 specifically comprises:

according to the feature vectors obtained by the embedding layer, the feature vectors are spliced into a matrix of m × d, wherein m is the number of the feature vectors, d is the dimension of the feature vectors, and x is^kRepresenting the state of the k-th hidden layer in the compressed interactive network,

is a matrix in which H_kRepresenting the number of compression features of the hidden layer of the k layer, the feature embedding layer is called the hidden layer of the 0 th layer, H₀The state calculation equation for each hidden layer k in the compressed interactive network is:

wherein H is more than or equal to 1 and less than or equal to H_k，

A parameter matrix representing the h-th eigenvector, where "o" represents the Hadamard product, i.e. the product operation of the corresponding bit elements between the two vectors, x^kAt x^k-1On the basis of (a) and (b)⁰Explicit interaction results in^kOrder ratio of (x)^k-1And the maximum order of the obtained feature interaction is increased by 1 when a hidden layer is added to the compressed interactive network.

Wherein, the step 4 specifically comprises:

different interaction vectors are endowed with different weights by the result of each layer after vector interaction through a self-attention mechanism, and the result is subjected to summation pooling to obtain a high-dimensional interaction result.

Wherein, the step 5 specifically comprises:

the vector of the embedded layer is brought into a deep neural network to obtain a result after multilayer interaction, the result obtained by the deep neural network and the result obtained by a compressed interaction network are compressed and spliced into a new matrix and are brought into a single-layer perceptron to obtain a final result, and an output result formula is as follows:

where σ is sigmoid function, x_fIs the original characteristic of the image to be displayed,

is the output of the DNN output layer, y_cinIs the output of the CIN output layer,

the linear regression, the weight matrix of the DNN output layer and the CIN output layer, and b is a learnable parameter are shown.

Wherein, the step 5 further comprises:

the weight parameters of the model are continuously updated through the loss function and the gradient descent, and the formula of the loss function is as follows:

wherein,

representing the predicted value of the model prediction, y_iRepresenting the true value of the actual data, N being the total number of training instances, the optimization process is to minimize the following objective function:

where λ represents the regularization term and θ represents the parameter set, including parameters in the linear part, the CIN part and the DNN part.

The scheme of the invention has the following beneficial effects:

according to the click rate prediction method based on the attention mechanism, the dense vectors behind the Embedding layer are interacted similarly to a residual error network, the result obtained through multiple interactions is summed and pooled through the attention mechanism, the result of the deep neural network and the result of the compressed interaction network are spliced into a new vector, the new vector is output to obtain a result, the prediction result is more accurate and reliable, the low-dimensional feature, the explicit high-dimensional feature and the implicit high-dimensional feature of a user are comprehensively considered, the useful feature combination is screened through the attention mechanism, the prediction efficiency is improved, the feature combination with high dimensionality can be extracted without manually extracting the features, and the overhigh dimensionality is not easily caused.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a model architecture diagram of the present invention;

FIG. 3 is a schematic diagram of each layer of the interactive network of the present invention;

FIG. 4 is a schematic diagram of the self attention mechanism summing pooling of the present invention;

FIG. 5 is a graph showing the results of the experiment according to the present invention;

fig. 6 is a schematic diagram illustrating the influence of different network layer numbers on the experimental results.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a click rate prediction method based on an attention mechanism, aiming at the problems that the existing model needs manual feature extraction, cannot extract feature combinations with high dimensionality and easily causes overhigh dimensionality.

As shown in fig. 1 to 6, an embodiment of the present invention provides a click rate prediction method based on an attention mechanism, including: step 1, preprocessing the characteristics of users, and performing One-hot unique coding on the same type of user characteristics to obtain a high-dimensional sparse characteristic vector; step 2, reducing the dimension of the high-dimension sparse feature vector by embedding the vector, and taking the feature vector after dimension reduction as an input vector of a click rate model to be respectively brought into a compressed interactive network and a deep neural network; step 3, carrying out Hadamard product on the input initial characteristic vector and the input vector of each hidden layer, taking the obtained result as the input value of the next hidden layer, and increasing the combination of the characteristics by one dimension every more hidden layer; step 4, obtaining useful combination characteristics by the result vector obtained by each layer through an attention mechanism, and summing and pooling the combination characteristics; and 5, simplifying and splicing the pooled result and the result obtained by the deep neural network into a new feature vector, and bringing the new feature vector into an output layer to obtain a predicted value.

Wherein, the step 1 specifically comprises: collecting a data set X ═ { x) of user characteristics₁，x₂，……x_NIs the total number of training samples, x_i∈{x₁，x₂，……x_N}，x_iRepresenting the ith user characteristic data to be processed.

Wherein, the step 1 further comprises: the user features are converted into a high-dimensional sparse feature vector using one-hot encoding.

In the click rate prediction method based on the attention mechanism according to the above embodiment of the present invention, the encoding manner of the unique hot code is relatively simple, and the N states are encoded according to the N-bit state register, for example, the basic information of the user is user ═ user ID ═ 02, gender ═ male, and interest ═ rock ═ and roll, and the vector converted according to the definition of the unique hot code becomes a vector composed of 0 and 1, such as user ═ 0, 1, 0, …, 0] [1, 0] [0, 1, 0, …, 0 ].

Wherein, the step 2 specifically comprises: the low-dimensional combined features are converted by an embedded layer vector, and sparse vectors are mapped to space vectors which are relatively dense and have non-zero vector elements.

Wherein, the step 2 further comprises: processing the raw data into data with mean value of 0 and variance of 1 by a normalization method, wherein the normalized data uses x_normExpressed, the specific calculation formula is as follows:

According to the click rate prediction method based on the attention mechanism, aiming at the characteristic that the characteristic dimension of One-hot coding is too high, an embedded layer vector is used for converting the characteristic dimension into a low-dimensional combined characteristic, a sparse vector is mapped into a space vector which is relatively dense and has non-zero vector elements, for the embedded vector, initial embedded characteristics are generated by random numbers, and are iterated continuously through gradient descent, so that an accurate embedded vector value is obtained finally, for continuous values, characteristic values need to be subjected to normalization processing, specifically, original data are processed into data with the mean value of 0 and the variance of 1 through a normalization method, the normalization method can change the distribution of the original data, is insensitive to abnormal values, and is suitable for a large data scene.

Splicing the feature vectors into a matrix of m × d according to the feature vectors obtained by the embedding layer, wherein m is the number of the feature vectors, d is the dimension of the feature vectors, and x is^kRepresenting the state of the k-th hidden layer in the compressed interactive network,

wherein H is more than or equal to 1 and less than or equal to H_k，

A parameter matrix representing the h-th eigenvector, wherein

Representing a Hadamard product, i.e. the operation of the product of corresponding bit elements between two vectors, x^kAt x^k-1On the basis of (a) and (b)⁰Explicit interaction results in^kOrder ratio of (x)^k-1And the maximum order of the obtained feature interaction is increased by 1 when a hidden layer is added to the compressed interactive network.

The click rate prediction method based on attention mechanism according to the above embodiment of the present invention, the product operation of the corresponding bit elements between two vectors, for example,

wherein, the step 4 specifically comprises: different interaction vectors are endowed with different weights by the result of each layer after vector interaction through a self-attention mechanism, and the result is subjected to summation pooling to obtain a high-dimensional interaction result.

According to the click rate prediction method based on the attention mechanism, which is disclosed by the embodiment of the invention, because the vector interaction has the defect of high time complexity, different interaction vectors are endowed with different weights through the self-attention mechanism according to the result of the vector interaction of each layer, so that a large amount of time can be saved.

Wherein, the step 5 specifically comprises: the vector of the embedded layer is brought into a deep neural network to obtain a result after multilayer interaction, the result obtained by the deep neural network and the result obtained by a compressed interaction network are compressed and spliced into a new matrix and are brought into a single-layer perceptron to obtain a final result, and an output result formula is as follows:

Wherein, the step 5 further comprises: the weight parameters of the model are continuously updated through the loss function and the gradient descent, and the formula of the loss function is as follows:

wherein,

In the click rate prediction method based on the attention mechanism according to the embodiment of the present invention, the experimental part of model training and prediction adopts an industry public data set: large ad click through rate prediction Criteo dataset and context based APP recommendation Frappe dataset. The Criteo data set contains a total of 11 GB-sized 7-day continuous user behavior logs, about 4100 ten thousand historical records, each training sample comprises 39 data features of different fields, wherein the 11 th to 13 th dimensions 113 are continuous value anonymous features, the C1 to C26 are discrete value anonymous features, and the desensitized anonymous features mainly comprise user features, item features and environment features and are transparent to the specific meaning of each field feature. Another data set is based on the APP recommended Frappe data set, each log contains 8 contextual category features such as weather, city, time, etc. except user ID and article ID, and features C1-C10 containing 10 fields in the Frappe data set all belong to category features and have no numerical features, and the Frappe data set is relatively small in size and has a total of 288609 training samples.

The data of 1/10 was randomly selected as the validation set for the Criteo dataset and the Frappe dataset, and the remaining data was used as the training set. The click rate prediction method based on the attention mechanism is implemented based on Tensorflow3+ python3.6, and an optimal set of hyper-parameters is found for each model in a mode of executing grid search on a verification set. The optimization method is Adam, the learning rate is 0.001, the batch size is 4096, regularization is performed using L2 with a coefficient of 0.0001, the number of hidden nodes is defaulted: 400 in the DNN output layer; the CIN output layer is 200 on the Criteo dataset and 100 on the Frappe dataset, for the CrossNet and CIN models in Deep & Cross, because of the difference in data, experiments will be performed by changing the depth of the hidden layer and comparing the best experimental results for each model.

As shown in fig. 5, the click-through rate prediction method (Our's) based on the attention mechanism is compared with other model experimental results, and as can be seen from the experimental results, the LR model is the least performing one of all models, because the LR model can only process some simple feature combinations with low dimensionality, which indicates that it is very necessary to extract implicit features from sparse data by a deep learning method; other models which are trained through Deep learning, such as PNN, Wide & Deep, Deep FM, Deep & Cross, have better effects than FM models, show that real data features are generally very complex, like the FM models which can only process two-dimensional features and cannot process more than three-dimensional features well, so the FM models have not very good effect on high-dimensional feature interaction processing; the processing effect of the DeepFM and Deep & Cross mixed models is better than that of the PNN model only considering the high-dimensional features, which indicates that the low-dimensional interactive features and the high-dimensional interactive features need to be considered simultaneously, and the Wide & Deep model has a lower effect than that of the PNN model because the feature combination mode of the Wide & Deep model is still manually combined; the prediction result of the click rate prediction method based on the attention mechanism is better than three mixed models including Wide & Deep, Deep FM and Deep & Cross, which indicates the need of further subdividing the explicit high-dimensional features, the explicit features are divided into high-dimensional features and low-dimensional features, and certain effect is achieved by combining the training of the implicit high-dimensional features (the features trained by DNN). Compared with the network depth of dozens of layers of computer vision, the network setting of the model of the click rate prediction method based on the attention mechanism is not particularly deep, and a good effect can be achieved by only about 3 layers. As can be seen from fig. 6, when the number of network layers is less than 3, the training result of the model is increasing, and when the number of network layers is greater than 3, the training result of the model is decreasing, which indicates that the more complicated the number of network layers is, the worse the training effect is, and overfitting is easily generated.

The click rate prediction method based on the attention mechanism according to the above embodiment of the present invention maps the same class of user features into high-dimensional sparse vectors by unique hot coding, changes the features into low-dimensional dense vectors by Embedding layer Embedding, brings the feature vectors into a compressed interactive network and a deep neural network, respectively, obtains the input value of the next layer by performing the product operation of the corresponding bit elements of the matrix on the initial input value and the input value of the hidden layer by the compressed interactive network and the deep neural network, obtains the input vector of each hidden layer by the calculation of a plurality of hidden layers, performs weight calculation on the input vector of each hidden layer by the attention mechanism, obtains the result of the high-dimensional explicit interactive vector by summing pooling, outputs the result by activating the function after splicing the result obtained by deep learning in the deep neural network and the result of the compressed interactive network, according to the click rate prediction method based on the attention mechanism, in Criteo and Frappe public data sets, low-dimensional features, explicit high-dimensional features and implicit high-dimensional features of users are comprehensively considered, useful feature combinations are screened through the attention mechanism in the attention mechanism, prediction efficiency is improved, manual feature extraction is not needed, high-dimensional feature combinations can be extracted, too high dimensionality is not easily caused, the capability of a wide-depth model for extracting complex combination features is improved, and the click rate prediction method based on the attention mechanism is good in prediction effect by multiplying vector levels instead of element levels and fusing the attention mechanism.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A click rate prediction method based on an attention mechanism is characterized by comprising the following steps:

2. The attention mechanism-based click rate prediction method according to claim 1, wherein the step 1 specifically comprises:

3. The attention mechanism-based click rate prediction method according to claim 2, wherein the step 1 further comprises:

4. The attention mechanism-based click rate prediction method according to claim 3, wherein the step 2 specifically comprises:

5. The attention mechanism-based click rate prediction method of claim 4, wherein the step 2 further comprises:

6. The attention mechanism-based click rate prediction method according to claim 5, wherein the step 3 specifically comprises:

wherein H is more than or equal to 1 and less than or equal to H_k，

A parameter matrix representing the h-th eigenvector, wherein

7. The attention mechanism-based click rate prediction method according to claim 6, wherein the step 4 specifically comprises:

8. The attention mechanism-based click rate prediction method according to claim 7, wherein the step 5 specifically comprises:

9. The attention mechanism-based click rate prediction method of claim 8, wherein the step 5 further comprises:

wherein,