CN116757747A

CN116757747A - Click rate prediction method based on behavior sequence and feature importance

Info

Publication number: CN116757747A
Application number: CN202310568077.8A
Authority: CN
Inventors: 王瑛琦; 季会勤
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-09-15

Abstract

The application discloses a click rate prediction method based on a behavior sequence and feature importance, which comprises the following steps: preprocessing a public internet platform data set to obtain candidate object sequence characteristics, user historical behavior characteristics, user portrait characteristics and other characteristics; inputting the processed characteristics into an embedding layer, and converting the low-dimensional sparse characteristics into high-dimensional dense embedding characteristics; inputting embedded features corresponding to historical behavior features and candidate object sequence features of a user into a user behavior sequence network to perform user behavior sequence modeling to obtain an interest state vector of the user, wherein the user behavior sequence network comprises an interest extraction layer and an interest update layer; inputting the data with the user portrait features and other features embedded into a hierarchical attention network; and (3) inputting the output after the output is spliced into a multi-layer neural network for training to obtain a click rate prediction result. The method reduces the complexity of calculation and improves the click rate prediction efficiency of the model.

Description

Click rate prediction method based on behavior sequence and feature importance

Technical Field

The application relates to the technical field of information recommendation systems, in particular to a click rate prediction method based on behavior sequences and feature importance.

Background

With the development of information technology and the internet industry, people's clothing and food residence are closely connected with the internet, and the information content is explosively increased. How to mine effective information from massive data, helping users find the most interesting items is an important problem, and the occurrence of recommendation systems greatly alleviates the problem.

A recommendation system is a technique for recommending personalized content to a user using historical behavior data of the user and other related information. The advent of the recommender system has become an effective way to overcome information overload, and it mainly uses user information, item information, and explicit or implicit information of the user to help the user find valuable items. The recommendation system framework is roughly divided into two stages of recall and sorting, wherein recall is to find the articles interested by the user from a massive article set, sorting is to score the recalled articles, and sorting is performed from large to small according to the clicking probability of the user. The click rate of an item is an important indicator for measuring the preference of a user for the item, so that the click rate prediction in the sorting stage plays a vital role in industrial application. Accurate prediction of click rate is beneficial to improving performance of the recommendation system and brings maximized commercial benefits.

In order to improve the performance of click rate prediction models and bring good experience to users, a plurality of CTR models based on deep learning are proposed, and the methods are greatly improved in a recommendation system compared with the traditional methods. The model based on feature interaction and the behavior sequence model are two most important modeling modes in CTR prediction. Feature interaction-based models originate from POLY2 and Factorization Machines (FM), and this class of methods focus on modeling high-order feature combinations and interactions. Deep learning methods are used by Cheng et al (Cheng H T, koc L, harmsen J, wide & Deep Learning for recommender systems. The workshop on deep learning for recommender systems. (2016) ACM, 7-10.) and by Qu et al (Qu Y, cai H, ren K, product-based neural networks for user response prediction. (2016) IEEE: 1149-1154) to extract project-level features and feature interaction information. With the widespread use of attention models, some sequential methods extract user's interest representations through behavior sequences using a cyclic neural network based attention and multi-headed self-attention mechanism, etc., and methou et al (methou G, zhu X, song C, deep interest network for click-throughput rate prediction. (2018) ACM: 1059-1068) propose to extract user behavior sequences using a cyclic neural network, thereby obtaining representations of user interests. The higher-order interest mining of the user's historical behavior significantly enhances the representation capability of the model, further improving the performance of CTR prediction.

The CTR based on feature interaction crosscombines all features in different ways, ignoring the influence of user history data, and thus limiting the efficiency of the model. Click rate prediction DIN and DIEN based on behavior sequences model historical data of users, but are not enough to show multiple interests of users, and the calculation complexity is high. The items purchased by the user at the next moment are affected not only by the history sequence but also by the attributes of the user. For example, a user purchases lipstick, books, men's clothing, etc. in time sequence on a certain e-commerce website, but we should recommend more goods such as eyebrows, dress, etc. in consideration of the characteristics of the user's gender, age 26, time, etc.

Disclosure of Invention

Aiming at the problems that the existing click rate prediction method cannot give consideration to feature interaction and behavior sequence modeling and limits prediction performance, the application provides the click rate prediction method based on the behavior sequence and feature importance, a global-local gating module and Post-LN Informier are designed in an interest extraction layer to extract the interests of users, a feature interaction network is utilized to capture target items and non-time sequence features aiming at non-time sequence attributes to realize nonlinear transformation of a sample space, and the nonlinear capability of the model is improved.

In order to achieve the above purpose, the present application adopts the following technical scheme:

a click rate prediction method based on behavior sequences and feature importance comprises the following steps:

step 1: preprocessing a public internet platform data set to obtain candidate object sequence characteristics, user historical behavior characteristics, user portrait characteristics and other characteristics; the other features include branding, price;

step 2: inputting the processed characteristics into an embedding layer, and converting the low-dimensional sparse characteristics into high-dimensional dense embedding characteristics;

step 3: inputting embedded features corresponding to historical behavior features and candidate object sequence features of a user into a user behavior sequence network to perform user behavior sequence modeling to obtain an interest state vector of the user, wherein the user behavior sequence network comprises an interest extraction layer and an interest update layer;

step 4: inputting the data with the user portrait features and other features embedded into a hierarchical attention network;

step 5: and (3) inputting the output of the step (3) and the output of the step (4) after being spliced into a multi-layer neural network for training, and obtaining a click rate prediction result.

Further, in the step 2, the embedded feature is constructed as follows:

set the input data X _i 、X _b 、X _u 、X _c Respectively representing candidate object sequence characteristics, user history behavior characteristics, user portrait characteristics and other characteristics;

the sparse features are converted into low-dimensional dense embedded features after passing through an embedding layer, and the embedded features are respectively expressed as E _i 、E _b 、E _u 、E _c ；

For X _b Position coding E for obtaining behavior sequence by position coding _pos The behavior sequence embedding is performed as follows:

E _bs ＝E _b +E _pos ＝[e _1s ,e _2s ,...,e _ts ...,e _Ts ]

wherein E is _bs Representing the final user's historical behavior embedding feature vector, E _b Is an embedding of the user's historical behavior,

further, in the step 3, the interest extraction layer includes a global-local gating module, where the global-local gating module includes a global gating module and a local module; the interest update layer includes a attention mechanism based gating loop unit a-GRU.

Further, the global gating module is specifically configured to perform the following steps:

for embedded feature vector E _bs Consider global information fusion, to embed feature vector E _bs Each feature embedding in (a) is compressed using a mean pool, and global information p is calculated _i Forming a statistical vector P, and obtaining a vector G after passing through two full-connection layers;

will embed the feature vector E _bs And global gating vector G, and constructing global feature embedding V by a re-weighting method _g ＝F _{g_rewight} (G,E _bs )＝[g ₁ ·e _1s ,...,g _s ·e _Ts ]。

Further, the local module is specifically configured to perform the following steps:

for embedded feature vector E _bs Considering local information fusion, constructing local feature embedded representation, calculating contribution of single feature through a mechanism of reducing dimension and increasing dimension, and obtaining local gating vector representation L=F _{l_ex} (E _bs )＝σ ₁ (W ₃ σ ₂ (W ₄ E _bs ) Where σ1 and σ2 are nonlinear activation functions, W ₃ And W is ₄ Is a learning parameter;

according to the embedded feature vector E _bs And local gating vector L to obtain local feature embedding V _l ＝F _{l_rewight} (L,E _bs )＝[l ₁ ·e _1s ,...,l _s ·e _Ts ]。

Further, in the step 3, the modeling of the user behavior sequence is performed in the following manner:

combining global feature embedding and local feature embedding:

wherein R represents the total sequence embedded representation,respectively representing dot multiplication and addition operations among elements;

let q _i ，k _i ，v _i Respectively representLine i of (a), and assign the acquired R to q _i ，k _i ，v _i The method comprises the steps of carrying out a first treatment on the surface of the In the self-attention mechanism, the input sequence will be mapped into three different vector spaces, respectively: query Q (i.e., query), key K (i.e., key), value V (i.e., value) vector space; t (T) _Q ＝T _k ＝T _v =t, representing the length of the sequence; d, d _v Representing an embedding dimension; then randomly selecting u number from K, and performing dot multiplication operation on Q and K to obtain ++>

From the slaveSelecting u number, and arranging from big to small and marking index number Q in Q _index Then find the corresponding index number Q from the original Q _index And is named->Will->And K-point multiplication followed by scale operations, denoted +.>Representation->Is the maximum value of (2); />Representation->Average value of (2); m (q) _i K) represents->Maximum value and->The difference between the average values of (a);

calculating the average value of the original vector VAssign to->The remaining index portions of (1-q) _index Restoring the number of dimensions of the remaining index part to be the same as the original Q dimension, will be similar to other multi-headed self-attention mechanisms>And V is calculated and then is combined with->Splicing to finally obtain an interest vector;

employing multiple attention heads and stitching outputs of multiple attention heads

MHPA(Q,K,V)＝concat(head ₁ ,...,head _h )W ^o

Wherein W is ^o Is a parameter matrix, concat (head ₁ ,...,head _h ) Representing stitching of the outputs of the plurality of attention heads, h=4; although the dimension of each head is reduced, the overall computational cost is similar to that of a single head in the full dimension; MHPA (Q, K, V) is a variable to which the spliced multi-head is assigned;

inputting the spliced result into an FFN network, and obtaining a final multi-interest vector representation F by using Dropout and ReLU;

using auxiliary loss function L _aux It uses the next behavior to supervise learning of the interesting state in the current step.Wherein->Is the ith row vector at time x of F, ">Representing N embedded sequence pairs, sigma ()' is a sigmoid activation function,<,>representing internal volume->Is a negative sample selected from the original embedding, N represents the number of training samples; and finally, inputting the multiple interest vectors F into an interest update layer to obtain an interest update vector H.

Further, the hierarchical attention network is trained as follows:

user-embedded, item-embedded, and other feature-embedded representations are spliced and input into the hierarchical attention, and the vectors are expressed as: c (C) ₁ ＝[E _i ；E _u ；E _c ]；

Using the attention mechanism a in each layer _l And aggregate hidden vector U _l Finally, the high-order features are fused into dense real-value vectorsWherein->Hadamard product representing two vectors, < ->Aggregation vector of layer 1 representing the jth feature,/->An aggregate vector for the first layer representing the j-th feature.

Further, in the step 5, a negative log likelihood loss function is used in the training of the multi-layer neural network.

Compared with the prior art, the application has the beneficial effects that:

the application discloses a click rate prediction method integrating user behaviors and feature importance, which not only captures user interests and interest updating processes from user historical behaviors, but also models high-order interaction between non-time sequence features by using a hierarchical attention mechanism. Specifically, first, we have designed and implemented an interest extraction layer to extract the interests of the user. Meanwhile, an auxiliary loss function is introduced to supervise the extraction of the user interest features. Secondly, a gating circulation unit based on an attention mechanism is introduced in an interest updating layer to enhance the influence of the target advertisement related interest. Finally, capturing target items and other features by utilizing a feature interaction network to realize nonlinear transformation of a sample space, and increasing the nonlinear capability of the model. Compared with other methods, the method converts the modeling behavior sequence problem into a time sequence prediction problem at the interest extraction layer, and uses the reconstructed Informier structure, so that the calculation complexity is reduced, and the model efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a click rate prediction method based on behavior sequences and feature importance according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a global-local gating module according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a Post-LN Informimer architecture in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of a hierarchical attention network architecture according to an embodiment of the present application;

FIG. 5 is a study rate influence experimental result;

FIG. 6 is a graph showing the results of a head number effect experiment;

FIG. 7 is a graph showing the results of a hierarchical attention network layer number impact experiment;

FIG. 8 is a graph of predicted performance of different models on a dataset;

FIG. 9 is a graph showing the performance of different variants of the application on different data sets.

Detailed Description

The application is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

the application combines characteristic interaction with a behavior sequence, and provides a click rate prediction method based on the behavior sequence and the characteristic importance, and the method has a framework shown in figure 1 and comprises the following steps:

1. Embedding layer

The application divides the original data into four groups: candidate item sequences, user history behavior, user portraits, and other characteristic information. Wherein each category feature is made up of several fields, such as user profile including age, gender, occupation, etc., user history behavior including items accessed by the user and categories to which the items belong. In addition, factors such as price, brand and the like which are important are introduced into project characteristics, and the factors are also taken into consideration.

In the NLP domain, each feature may be encoded as a Gao Weishan thermal vector. Typically the original data is a sparse vector, e.g., the males in the gender field may be encoded as [0,1 ]]. Assume that the connection result of one thermal vector of different fields is X _i 、X _b 、X _u 、X _c Representing candidates, historical behavior, user portraits, and other characteristic information, respectively. These sparse features are transformed through the embedding layer into low dimensional dense features called E _i 、E _b 、E _u 、E _c . Taking the embedding of a sequence of user actions as an example, the formula is as follows:

where T represents the length of the user behavior, d _v Representing item e _i Embedded dimensions. X is X _b After passing through the embedded layer, E is obtained _b . Since the sequential nature of the sequence cannot be captured in the interest extraction layer, to solve this problem, position embedding by sinusoidal signals of varying frequency is employed. The user behavior sequence embedding is ultimately expressed as:

E _bs ＝E _b +E _pos ＝[e _1s ,e _2s ,...,e _ts ...,e _Ts ]

wherein E is _b Is the embedding of the historical behavior of the user, E _pos For position coding of behavior sequences, E _bs Is the final behavior sequence embedded representation.

2. Interest extraction layer

The user behavior sequence conceals the dynamically evolving interests of the user, and in an electronic commerce system, the user behavior is a carrier of potential interests, and the interests change after the user takes one behavior. At the interest extraction layer, we extract a series of interest states from a succession of user actions. For historical purchase behavior data of the user, short-term and high-frequency purchases can reflect interest preferences of the user. Therefore, we introduce a global-local gating module that focuses on the part of the user behavior information that is valuable for the candidates and proposes Post-LN Informimers to capture the behavior relationship in the sequence.

Different features in click rate prediction have different importance to the target task. For example, when predicting whether a person will watch a certain movie, feature hobbies and sexes are more important than occupation and address. As shown in fig. 2, the present application contemplates a global-local gating module that considers both global and local sequence information. It consists of two sub-parts, considering the influence of different sequence features globally and locally, respectively.

The global gating module consists of three steps: an extrusion step, an excitation step and a re-weighting step. Given field embedding vector E _bs We use the mean pool to compress each feature embedding e _ts To calculate global information p _i And forms a statistical vector P. The excitation step is then used to learn the weights embedded for each field based on the statistical vector. Finally, we learn weights using two Fully Connected (FC) layers. The formula is as follows:

wherein the method comprises the steps ofRepresentation e _ts The j-th number of (2)>Is a global gating vector, +.>And->Is a learning parameter and r is a scale factor. V (V) _g Representing global feature embedding.

In order to capture feature information for each feature in feature importance modeling, a local module is designed. Instead of pooling the user behavior sequence averages in the global module, the local module directly uses the dimension reduction and increase mechanism to calculate the contribution of a single feature to the target task.

Where L is the local control vector,and->Representing the dot product and add operations between elements, respectively. Sigma 3 and Sigma 4 are nonlinear activation functions, W ₃ And W is ₄ Is a learning parameter, V _l Representing local feature embedding. The global-local gating module comprehensively emphasizes the features distributed in the global and local, and dynamically adjusts the weight of each feature according to the contribution of each feature.

Typically modeling user behavior sequences uses a multi-headed self-attention mechanism or RNN sequence model. The application improves Informiers, and the main idea is to realize the method according to the fact that the attention coefficient is distributed in a long tail, namely a few keys-value pairs contribute to main attention, so that each key is allowed to pay attention to only a plurality of main queries.

Generating Post-LN Informater actual operation steps:

step one: let q _i ，k _i ，v _i Respectively representAnd assign the three data in the first acquisition R to the ith row of the system, the dimension is T x d _v The method comprises the steps of carrying out a first treatment on the surface of the In the self-attention mechanism, the input sequence will be mapped into three different vector spaces, respectively: query Q (i.e., query), key K (i.e., key), value V (i.e., value) vector space; t (T) _Q ＝T _k ＝T _v =t, representing the length of the sequence; d, d _v Representing an embedding dimension;

step two: randomly selecting the number u from K, and performing dot product operation on Q and K to obtainFurther, M (q) _i K) is as follows:

wherein the method comprises the steps ofIs the maximum value of the key-value pair, < >>Is the arithmetic mean of the key-value pair. If the query of item i gets a larger M (q _i K), representing a greater enrichment of extracted user interests. According to the ordered result, we randomly select the former u number from Q as +.>Will->And K-point multiplication operation, and then scale operation, expressed as

Step three: as shown in fig. 3, to make the dimensions of the key-value pair the same as the original dimensions, we replace trivial attention with the average of V, i.e., V.

Step four: multi-headed attention enables models to focus together on information from different representation subspaces at different locations. This is suppressed if there is only one attention head. Splicing the plurality of heads to obtain the following formula:

MHPA(Q,K,V)＝concat(head ₁ ,...,head _h )W ^o

wherein head is _i =pa (Q, K, V); representing the concatenation of dimensions, W ⁰ Is a parameter matrix. MHPA (Q, K, V) is a variable to which the spliced multi-headed is assigned.

Step five: next, a Feed Forward Network (FFN) is added to further enhance the nonlinear model, to avoid overcommitted and hierarchically learn meaningful features, dropout and ReLU are used in both the above steps and FFN, the output being as follows:

S＝LayerNorm(MHPA(Q,K,V)+R)

F＝LayerNorm(S+Dropout(Relu(SW ¹ +b ¹ )W ² +b ² ))

wherein W is ¹ ,W ² ,b ¹ ,b ² Is a learnable parameter and LayerNorm is the standard normalization layer. R is the original embedded vector.

While in the preceding subsections, the user behavior sequence is modeled herein to capture future relationships between behaviors, it does not effectively represent the interests of the user. The auxiliary loss function is used to supervise project learning, which uses the next behavior to supervise learning of the state of interest in the current step. F (F) _x Is the row vector at time x of F. The formula for the auxiliary loss is defined as:

wherein the method comprises the steps ofRepresenting N embedded sequence pairs, σ (-) is a sigmoid activation function, and<,>representing the inner product. And->Is a negative sample selected from the original embedding, N represents the number of training samples.

3. Interest update layer

With the interest extraction layer in the previous section, an interest group status representation of the user can be obtained. However, the interests of the user may change continuously under the influence of external environment or other factors, random jumps may exist in the historical behavior sequence of the user, and each interest has a process of updating continuously and may evolve gradually over time. To solve the above problem, a gated loop unit model based on an attention mechanism is introduced in the interest update layer.

At the interest update layer, a correlation weight of each interest and the candidate advertisements can be obtained, namely an attention score, wherein the attention score reflects the correlation between the target advertisements and the input interests, and an attention function used in the interest evolution process can be expressed as follows:

wherein e _ts Representing concatenation of embedded vectors from different fields, inThe table is a parameter matrix, n ₁ Is the dimension of the hidden state vector, n ₂ As are the dimensions of the embedded vector.

The attention score reflects the correlation between the targeted advertisement and the input interest state, the higher the correlation of the interest state and the targeted advertisement, the greater the attention score. We want to introduce attention scores as an update strategy into the update gate of the GRU, thus using an attention mechanism based gating loop unit a-GRU. The structure can determine the update strength of the hidden interest state according to the size of the attention score, namely the interest state related to the target advertisement has higher participation in the update process of the final interest state, and the interest state unrelated to the target advertisement has lower participation or even does not participate in the update process. The specific formula of the hidden layer output state is as follows:

wherein H is _i ，H _i-1 Andis in a hidden state, a _i Is the attention score. Compared to the original GRU structure, the A-GRU structure will use the attention score instead of the original update gate. Under the action of the A-GRU, the interest update layer can treat the historical behaviors differently, not only provide final interest expression and more relevant historical information, but also predict the click rate of the target item along with the evolution trend of the interest.

4. Hierarchical attention layer

The significance of feature intersection is to improve the nonlinear modeling capability of the model and improve the effect of the model. The existing CTR only pays attention to interaction between the behavior sequence and the target item, and ignores the relationship between other attributes and the target item. The application proposes a hierarchical attention that models higher-order feature interactions using the attention mechanism of the hierarchical structure.

According to formula C ₁ ＝[E _i ；E _u ；E _c ]Splicing the target item and other attributes as input of an interaction layer, wherein; representing splice, C ₁ Is an input to the hierarchical attention network;

since it is expensive to enumerate all possible combinations to compute the high order multi-feature interactions, in order to get the vector representation C of the l+1 layer _l+1 According to the formulaObtaining layer IAn aggregated hidden vector, wherein->An attention aggregation score for the first layer of the j-th feature;

according to the formulaCalculate the attention-aggregated score of layer i, where W _l Is the weight of the first layer, c _l Is a vector of the context of the first layer.

According to the first layer C ₁ And the aggregate vector of the first layer, the attention aggregate formula is expressed as:wherein->Representing the hadamard product of the two vectors.

5. Prediction layer

The prediction layer connects the modeled user behavior sequences and feature interaction layers into a representation vector as a prediction result.

In order to evaluate the effect of the model, an objective function is specified for optimization, the objective of which is to minimize the cross entropy of the predicted value and the real label. Because the click rate prediction task is a binary classification task, the loss function selects cross entropy loss, which is generally defined as:

where y.epsilon.0, 1 is the true value,the predicted probability for y. In addition, all parameters are optimized using a standard back propagation algorithm. To better mine user interest, an auxiliary loss function was introduced in the foregoing that uses the next behavior to oversee the currentLearning of the step interest state. The global loss function of the CTR model is:

L＝(1-λ)*L _target +λ*L _aux

lambda is the hyper-parameter used to balance the two subtasks.

According to the method, the user behavior sequence and the user attribute are comprehensively considered, the gating network, the Attention network and the deep neural network model are subjected to nonlinear fitting, the click rate prediction model of the recommendation system is constructed, the model is trained to obtain a prediction result, and various interests of the user are deeply mined.

To verify the effect of the application, the following experiments were performed:

6. experimental part

6.1 Experimental setup

This section will introduce the data sets and baseline methods used in the experiments, giving experimental indicators and detailed information of the experiments.

6.1.1 data sets

The amazon dataset consists of amazon product reviews and metadata. We used two subsets of Amazon datasets: beaurity and Electronics to verify the effect of CUBFI. These datasets collect user behavior at time stamps. Assuming that there are k products being reviewed in the user behavior sequence, our goal is to predict whether user u will write a comment for the k-th product based on the first k-1 product being reviewed. Each user takes more than 5 historical behaviors. We divide into 80%, 10% and 10% by creating training, validation and test sets from random samples in the original dataset.

Table 1. Basic statistics of the dataset.

6.1.2 Baseline

To evaluate the performance of the proposed method of the application (abbreviated as CUBFI), the application is compared with the most advanced method of CTR prediction, which is widely used:

(1) Wide & Deep: wide & Deep proposes a width and depth model architecture, which combines generalization ability and memory ability.

(2) FiBiNet: PNN uses the product layer to capture interaction patterns between inter-domain categories.

(3) DIN: this is an early effort, utilizing the historical behavior of the user, and utilizing the attention mechanism to activate the behavior of the user that is interested in different items.

(4) DIEN: this is a recent study of CTR modeling of continuous user behavior data. It integrates the GRU with candidate-centric attention to capture the benefits involved. DMIN: a multi-headed self-focused behavior refinement layer is used to better capture user history item representations, and then a multi-interest extraction layer is applied to extract multiple user interests.

6.1.3 evaluation index

The present example uses Accuracy, logloss and AUC values as model evaluation indices. The accuracy index indicates the proportion of correctly predicted cases to all cases, and a higher value indicates the discrimination capability of the classifier. Logloss is a commonly used loss function in the binary classification problem, and the click rate can be divided into a binary distribution of clicking and non-clicking; the smaller the logoss value, the higher the model CTR prediction accuracy. AUC is the area under the ROC curve, which is insensitive to whether the positive and negative samples are balanced or not, and is only related to the sorting effect. The actual calculation method is to divide the positive number of samples greater than the negative number of samples by the sum of all the comparisons of the positive and negative samples.

6.1.4 Experimental details

The experiment uses Tensorflow to implement CUBFI and all baseline methods on GPU RTX 3080 Ti. In the embedding layer, the feature embedding dimension on two data sets is set to 18, the historical behavior of each user takes the last 20, and the historical behavior sequence comprises two parts of commodity and category, so that the total embedding dimension is K=36. To optimize the model, the Electronics and Beauty dataset was updated during the training phase using Adam, with a batch size of 128. In the experiment, the experimental data are divided into a training set, a test set and a verification set according to the ratio of 8:1:1. To ensure reliability of model performance, all models were processed with the same data set and the reported results were the average of 5 experiments. The parameters set by the CUBFI model on different data sets are different through multiple experimental results verification, as shown in Table 2.

Table 2. Training parameters of the dataset.

6.2 Performance analysis

In this section, a number of sets of comparative experiments were set up to verify the performance of the methods herein, first, the effect of hyper-parameters on the model was studied. Then, the model was evaluated for its performance by comparison with a reference model. And finally, analyzing the experimental performances of the interest extraction layer and the feature interaction layer.

6.2.1 design of training parameters

(1) Influence of learning Rate

First, the fixed attention header number is set to 2, and the feature interaction layer is set to 3. The learning rate of the model between 0.001 and 0.020 is calculated, and the performance of the model on the verification set is observed by adjusting the learning rate. As shown in fig. 4, it can be seen from fig. 4 that the optimal learning values are different in different data sets, the optimal value is 0.002 in Electronics data sets and 0.004 in Beauty data sets.

(2) Influence of the number of heads

Multiple heads in the self-attention mechanism are essentially multiple independent attention calculations that prevent overfitting as an integrated function. As can be seen from fig. 5, the trend of the model to fluctuate in AUC and logoss performance is substantially consistent as the number of heads increases. Since the embedding dimension k=36 of the historical behavior sequence, the number of headers set must be a factor of 36. As can be seen from fig. 5, the optimal value is 4 on the Electronics dataset and 3 on the Beauty dataset.

(3) Influence of hierarchical attention network

In this study we kept the other factors unchanged, only increasing the number of layers of the hierarchical attention network. The performance of the model steadily increases as the number of layers increases from 1 to 4. The performance of the model steadily increases as the number of layers increases from 1 to 4. The complexity of the model is increased along with the deepening of the layer number, and when the layer number is 4, the performance of the model is in a decreasing trend. As can be seen from fig. 6, the optimal value of the number of interaction layers on both data sets is set to 3.

6.2.2 model Performance analysis on dataset

In this section, the model of the present application was compared to the baseline method over two data sets to evaluate the overall performance of the model. The prediction results of the different models are evaluated by using the indexes introduced in 6.1.3, and specific evaluation indexes of each model are shown in table 3:

table 3: performance evaluation table for each model.

By comparing the baseline method with the method of the present application, the results under the three indicators show that the method of the present application performs best. Compared with a basic model Widedeep model, the AUC value of the method is improved by 3.65%, the accuracy is improved by 4.89%, the loss is improved by 9.02%, and as can be seen from FIG. 7, various evaluation indexes of the method on the Electronics data set are greatly improved compared with other baseline methods. Experimental results show that the CUBFI model provided by the application can effectively improve the click rate prediction performance. From the above results, some observations can be made.

1. The CUBFI model performed better than the other baseline on all indicators over both datasets. WideDeep and FiBiNet are relatively classical approaches to click rate prediction using feature interactions, while DIN, DIEN and DMIN introduce a user historical behavioral sequence modeling user's interest representation. Overall, the latter model is better than the former in terms of the index, which also shows that modeling with user behavior sequences is effective.

2. In the paper [ Xiao Z, yang L, jiang W, wei Y, hu Y, wang H (2020) Deep multi-interest network for click-throughput rate prediction in: proceedings of the 29th ACM international conference on information and knowledge management,pp2265-2268], the FiBiNet model is considered very efficient. FiBiNet performance is superior to Widedeep because FiBiNet uses a SENET network, thereby enabling the importance of each feature to be captured accurately. Compared to the FiBiNet model, the AUC of CUBFI was improved by 2.14% on Electronics dataset and 5.29% on Beauty dataset. Compared with DIN, DIEN and DMIN models modeling the user behavior sequences, the proposed CUBFI model has better performance, with AUC on Electronics dataset improved by 1.98%, 1.31% and 1.02%, respectively, which also demonstrates the effectiveness of combining feature interactions and user behavior sequences.

3. As shown in fig. 8, it can also be observed that CUBFI has optimal performance on all data sets for all metrics. The CUBFI performance superiority is mainly manifested in three aspects: (1) A global-local gating module is designed to adaptively select meaningful features. (2) The Post-LN Informater module provided by the application models the user behavior sequence more accurately. (3) The application provides a click rate prediction model which integrates user interests and feature interactions, and comprehensively considers user behavior sequences and user feature importance.

6.2.3 performance analysis of interest extraction and feature interaction layers

To analyze the effectiveness of the interest extraction layer and the feature interaction layer, we designed a comparative experiment. Designing a model_A to verify the influence on the performance of the Model after the global-local gating module is removed; designing a model_B to verify the influence on the Model performance after Post-LN Informater is removed; model_C was designed as the final experimental result herein; model_D is designed to remove only the hierarchical attention network. The experimental results are shown in table 4:

table 4: different variants of CUBFI.

Each module in the CUBFI is validated to determine if the module is necessary to exist and if the performance of the model can be improved. In carrying out these experiments, only a portion was removed and the remainder remained unchanged. As can be seen from fig. 9:

1. deleting any module in the CUBFI results in performance degradation, which verifies that any module of the CUBFI model proposed by the present application plays a critical role in performance.

2. From the Electronics dataset, the AUC and loglos performance of the model_a Model, respectively, were significantly reduced after the global-local gating module was deleted. Experimental results indicate that it is effective for the module to select meaningful user behavior features.

As can be seen from the Beaurity dataset, the overall performance of the model is also overall reduced after the hierarchical attention network is deleted. Experimental results indicate that feature interactions are effective in improving the performance of the model.

The application provides a click rate prediction method based on a behavior sequence and feature importance. At the interest extraction layer, a global-local gating module and a Pre-LN Informier module are adopted to model a user behavior sequence, and the next behavior is used for supervising the learning of the interest state of the current step in combination with an auxiliary loss function. A attention-based gating loop unit is then employed to simulate the most relevant interest update process for targeted advertising. Finally, a multi-interaction module is adopted to extract non-time sequence characteristic information during characteristic interaction. Experimental results show that the click rate prediction accuracy can be effectively improved by the method model.

The foregoing is merely illustrative of the preferred embodiments of this application, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this application, and it is intended to cover such modifications and changes as fall within the true scope of the application.

Claims

1. The click rate prediction method based on the behavior sequence and the feature importance is characterized by comprising the following steps of:

2. The click-through rate prediction method based on behavior sequence and feature importance according to claim 1, wherein in step 2, embedded features are constructed as follows:

E _bs ＝E _b +E _pos

wherein E is _bs Representing final user calendarEmbedding feature vectors into history behavior, E _b Is the embedding of the user's historical behavior.

3. The click-through rate prediction method based on behavior sequence and feature importance according to claim 2, wherein in the step 3, the interest extraction layer comprises a global-local gating module, and the global-local gating module comprises a global gating module and a local module; the interest update layer includes a attention mechanism based gating loop unit a-GRU.

4. The click-through rate prediction method based on behavior sequence and feature importance of claim 3, wherein the global gating module is specifically configured to perform the following steps:

will embed the feature vector E _bs And global gating vector G, and constructing global feature embedding V by a re-weighting method _g 。

5. The click-through rate prediction method based on behavior sequence and feature importance according to claim 4, wherein the local module is specifically configured to perform the following steps:

for embedded feature vector E _bs Considering local information fusion, constructing local feature embedded representation, and calculating contribution of single features through a mechanism for reducing the dimension and increasing the dimension to obtain local gating vector representation L;

according to the embedded feature vector E _bs And local gating vector L to obtain local feature embedding V _l 。

6. The click-through rate prediction method based on behavior sequence and feature importance according to claim 5, wherein in step 3, user behavior sequence modeling is performed in the following manner:

combining global feature embedding and local feature embedding:

let q _i ，k _i ，v _i Respectively representLine i of (a), and assign the acquired R to q _i ，k _i ，v _i The method comprises the steps of carrying out a first treatment on the surface of the Q, K, V represent three vector spaces of query, key, value, respectively; t (T) _Q ＝T _k ＝T _v Representing the length of the sequence; d, d _v Representing an embedding dimension; then randomly selecting u number from K, and performing dot multiplication operation on Q and K to obtain ++>

From the slaveSelecting u number, and arranging from big to small and marking index number Q in Q _index Then find the corresponding index number Q from the original Q _index And is named->Will->And K-point multiplication followed by scale operations, denoted +.> Representation->Is the maximum value of (2); />Representation->Average value of (2); m (q) _i K) represents->Maximum value and->The difference between the average values of (a);

calculating the average value of the original vector VAssign to->The remaining index portions of (a) are restored to the same dimension as the original Q, and +.>And V is calculated and then is combined with->Splicing to finally obtain an interest vector;

adopting a plurality of attention heads, and splicing the outputs of the plurality of attention heads;

and using an auxiliary loss function, using the next action to monitor learning of the interesting state in the current step, and finally inputting the multi-interest vector F into an interest updating layer to obtain an interest updating vector H.

7. The click-through rate prediction method based on behavior sequence and feature importance of claim 1, wherein the hierarchical attention network is trained in the following manner:

8. The click-through rate prediction method based on behavior sequence and feature importance according to claim 1, wherein in the step 5, a negative log likelihood loss function is used in the training of the multi-layer neural network.