CN116976505A

CN116976505A - Click rate prediction method of decoupling attention network based on information sharing

Info

Publication number: CN116976505A
Application number: CN202310837811.6A
Authority: CN
Inventors: 王瑛琦; 季会勤; 韩宏宇; 何欣; 于俊洋; 翟锐
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-07-08
Filing date: 2023-07-08
Publication date: 2023-10-31

Abstract

The invention belongs to the technical field of intelligent recommendation and deep learning, and discloses a click rate prediction method of a decoupling attention network based on information sharing. Specifically, a high-dimensional sparse sample is put into a decoupled multi-head self-attention layer after passing through an embedding layer and is used as input of a parallel network architecture, and an interactive function of a hierarchical attention mechanism is used in an explicit part, so that the expression capability of a model is improved. In addition, the invention sets a sharing interaction layer to solve the problem of insufficient sharing of parallel network information. The invention explicitly simulates the interaction of features in a low dimensional space, enabling the whole model to effectively adapt to large-scale internet platform datasets in an end-to-end manner. Finally, experiments are carried out on two real data sets criterion and Avazu, and experimental results show that the model has remarkable improvement in the aspects of loss rate and accuracy rate of click rate prediction and algorithm efficiency.

Description

Click rate prediction method of decoupling attention network based on information sharing

Technical Field

The invention relates to the technical field of intelligent recommendation and deep learning, in particular to a click rate prediction method of a decoupling attention network based on information sharing.

Background

In recent years, deep learning has been widely used in computer vision, speech recognition, and natural language processing. Because the image, speech and text signals are spatially or temporally correlated, the newly introduced unsupervised training model of deep structures can explore this local dependence and build a dense representation of feature space, enabling the neural network model to effectively learn higher-order features directly from the original features. Based on the learning capabilities described above, deep learning becomes an effective model for estimating online user response rate problems, such as advertisement click-through rate problems. Predictions of Click Through Rates (CTRs) are critical to industrial recommendation systems and online advertising, and determine whether an item will be recommended to a user by estimating the probability that the user clicks on the recommended item. However, in CTR predictions, most of the input features are multi-domain, discrete classification features, such as the city in which the user is located, the type of device, the advertisement category, etc., whose dependencies are unknown. Therefore, how to improve CTR estimates by learning feature representations of large-scale discrete classifications is a key challenge. Some functional interactions are easily understood, but most of the feature interactions are hidden in the data, are difficult to identify a priori, and can only be automatically captured through machine learning. Even for easy-to-understand interactions, it is difficult to model features in detail when their data size is large.

The effective modeling of feature interactions is one of the most commonly used optimization methods to improve the accuracy of CTR model predictions. For example, a decomposer (FM) embeds feature i into potential factor vector q _i ＝[q _i1 ,q _i2 ,…q _ik ]And interactions between features are modeled as inner products of potential vectors. FM can be extended to anyHigh-order feature interactions, but these feature interactions contain useful and useless feature combinations. Since deep neural networks have strong feature representation learning capabilities, learning complex and selective feature interactions with DNNs is a relatively good attempt. For example, FNN proposes a neural network supporting a factoring machine to learn high-order feature interactions, the algorithm using a pre-trained factoring machine for field embedding before DNN is applied. PNN further proposes a product-based neural network that introduces a product layer between the embedded layer and the DNN layer, without relying on pre-training. The main disadvantage of FNN and PNN is that they focus more on higher order feature interactions, while capturing fewer lower order interactions. In order to better simulate the interaction of low-order features and high-order features, google in 2016 proposed Wide&The Deep model combines the linear model with DNN well, and combines the memory capacity while improving the generalization capacity of the model. Deep&The cross and deep FM models not only overcome the problem of only focusing on high-order feature interaction by introducing a hybrid architecture, but also do not need to manually perform feature cross combination. These deep CTR models can be divided into two classes, parallel network architecture and serial network architecture, in a manner that combines explicit and implicit feature interactions of network modeling.

How to find meaningful higher-order feature interactions is a significant challenge for click rate estimation. One widely used approach to modeling higher order feature interactions is to calculate the inner product of feature embedding, similar to a self-attention neural network in deep learning. Intuitively, every pair of feature vectors encodes every two semantics of feature interactions, but ignores modeling the general impact of each feature. To explicitly model such univariate semantics, univariate terms are decoupled from a common self-care network that computes the general impact of a feature on all other features.

In summary, most of the CTR models are parallel network architectures, one for explicit feature interactions and one for implicit feature interactions. However, the existing methods have the following disadvantages:

1. ambiguity of feature interactions in different semantic subspaces is ignored, and most models only consider interactions between features, ignoring the effect of one feature on other features.

2. Explicit and implicit interactions in the parallel network architecture are fused at the last layer, and the middle layer has no information sharing, so that interaction signals among each other are weakened.

3. For sharing in a parallel network architecture, the DCN used by the existing model is based on high-order interaction of bit-wise level, and cannot effectively capture interaction characteristics.

Disclosure of Invention

Aiming at the problems, in order to better simulate complex characteristic interaction behaviors and further predict the click rate of a user more accurately, the invention provides a click rate prediction method DSAN (Disentangled Self-attention Neural Network) of a decoupling attention network based on information sharing, which comprises the steps of firstly decoupling a self-attention mechanism into two parts, modeling specific interaction between two characteristics by paired items, modeling the influence of one characteristic on other characteristics by one element, and inputting the influence into a parallel network architecture. The invention sets two modules in the shared interaction layer to enhance the interaction signal between the parallel networks. In addition, a hierarchical attention mechanism is used in the display interaction section, so that each feature performs higher-order interaction more meaningfully.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a click rate prediction method of a decoupling attention network based on information sharing comprises the following steps:

step 1: processing the classified data and the numerical data in the disclosed internet platform data set, and converting the classified data and the numerical data into dense and low-dimensional embedded vectors; the classification data includes advertisement categories; the numerical data includes a user age;

step 2: inputting the embedded features into a decoupling attention model for processing to obtain a final feature vector representation;

step 3: the final feature vector representation is passed through a shared interaction layer to simulate higher order interactions;

step 4: and splicing the information of the interaction layer to the output layer to obtain a click rate prediction result.

Further, the step 1 includes:

step 11, converting each sparse vector corresponding to the classified data into a low-dimensional dense vector according to the following formula:

e _i ＝w _i x _i

wherein e _i Representing characteristic x _i The corresponding low-dimensional dense vector is used,is characteristic x _i λ is the number of eigenvalues in the i-th field, d is the dimension of the embedded vector;

step 12, converting the numerical data into a low-dimensional space according to the following formula:

e _j ＝w _j x _j

wherein the method comprises the steps ofIs characteristic x _j Is>Where λ' =1, represents scalar values of only one class, d is the corresponding embedding dimension;

step 13, applying an embedding layer on the original input layer, compressing the high-dimensional sparse vector into a low-dimensional dense vector, and expressing the result of the embedding layer as:

e＝[e ₁ ；e ₂ ；…e _i ；…e _j ；…e _m ]

wherein; representing the series operation of the matrix, m represents the number of features,

further, the step 2 includes:

step 21, taking each feature interaction as a key value pair, and learning the importance of each feature interaction by multiplying each embedded feature, so that the important key value pair obtains a higher attention score; formally, each candidate feature is transformed by transformation into a new embedding space:

q _i ＝W _q e _i ，k _j ＝W _k e _j ，k _j2 ＝W _k2 e _j ，v _k ＝W _v e _k

query q _i Key k _j And k _j2 Value v _k Parameterizing W by three linear transformation matrices respectively _q 、W _k 、W _k2 And W is _v Obtained, andd represents the dimension of the embedded layer, d' is the dimension of attention;

step 22, using paired terms to simulate purely specific interactions, using unitary terms to simulate the general effects of all feature fields, adding the two terms to obtain an attention score, and performing point multiplication on the value; in order for the model to learn different feature interactions in different subspaces, multiple attention heads are used:

H _i ＝[σ((q _i -μ _q ) ^T (k _j -μ _k ))+σ(μ _p k _j ^T )]v _k

where sigma is the activation function and,and->Respectively carrying out linear conversion on the embedded features by the average value of the query vector and the key vector; m is the total number of features of the user and the item, +.>Represents an average value of the key vectors;

step 23, connecting all attention heads in series to obtain h=concat [ H ] ₁ ；H ₂ ；…；H _h ]And combining a residual network to obtain a final eigenvector representation, wherein the formula is defined as follows:

wherein the method comprises the steps ofIs a linear projection matrix>For ReLu activation function, L ₀ Representing the final feature vector representation as the first layer of the shared interaction layer.

Further, the step 3 includes:

step 31, the obtained augmentation matrix L ₀ As input to the shared interaction layer, learning feature intersection and sharing parallel network information;

step 32, obtaining C by using the decomposition module _l And D _l Two parts:

wherein g _i And g' _i I represents the i-th feature in a sample, g _i A gating score representing the ith feature,is the Hadamard product of two vectors, C _l Is an explicit high order interaction part, D _l Is an implicit high-order interaction part;

step 33, in order to obtain feature cross-over combination C of the l+1 order _l+1 Feature cross-over combination C of first layer _l C at initialization _l ＝C ₀ From which C can be obtained _l+1 The attention aggregation formula is as follows:

wherein the method comprises the steps ofRepresents the attention score in the j-th feature in layer l,/for>Expressed as:

step 34, at each layer of higher-order feature interaction, a sharing module is used to solve the above problem, assuming that two network data C of a certain layer are obtained _l And D _l The sharing module may be expressed as:

wherein the method comprises the steps ofIs the hadamard product of the two vectors;

step 35, repeatedly executing steps 32, 33 and 34, can generate 1-order to l-order feature representations, and then represent three vectors obtained in the l-order as C _l 、D _l And L _l Input to the standard logistic layer, the expression is as follows:

w and b are parameters of weight and bias, respectively.

Further, in the output layer, the click rate prediction is performed by using a logoss loss function.

Further, after the step 4, the method further includes:

and recommending advertisements to the user according to the click rate prediction result.

Compared with the prior art, the invention has the beneficial effects that:

1. a multi-headed note mechanism based on decoupling is presented, defining pairs and unions. The multi-headed mechanism may analyze the interaction relationships that exist for features under different potential semantic subspaces.

2. The invention provides a parallel network architecture of a multi-level sharing mechanism, and the invention sets two modules in a sharing interaction layer so as to enhance interaction signals in a parallel network. One is a decomposition module and the other is a sharing module. The decomposition module is used for distinguishing characteristic distribution of the field control network in different networks, and the sharing module is used for capturing layered interaction signals in the parallel network.

3. In the parallel network architecture part, a hierarchical attention interaction network is proposed, which is a vector-wise level interaction mode.

4. Extensive comparative experiments were performed on two real data sets criterion and Avazu. Experimental results of CTR prediction tasks show that the method is superior to the existing click rate prediction method in terms of the accuracy rate and the loss rate of click rate prediction.

Drawings

Fig. 1 is a schematic diagram of a network structure adopted by a click rate prediction method of a decoupling attention network based on information sharing according to an embodiment of the present invention;

FIG. 2 is a flowchart of a click rate prediction method of a decoupled attention network based on information sharing according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of feature embedding (embedding dimension 4) according to an embodiment of the present invention;

FIG. 4 is a graph showing the effect of the dimensions of an embedded layer on model performance according to an embodiment of the present invention;

FIG. 5 is a graph showing the effect of the number of interaction layers on the performance of a model in accordance with an embodiment of the present invention;

FIG. 6 is a comparative ablation test of a model of an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

as shown in fig. 1 and fig. 2, a click rate prediction method of a decoupling attention network based on information sharing includes:

step 2: inputting the embedded features into a decoupling attention model for processing to obtain combined individual features;

step 3: the combined individual features pass through a shared interaction layer to simulate high-order interaction;

Most of the existing CTR models at present adopt parallel network architecture to learn explicit and implicit characteristic interaction, an implicit part automatically learns high-order characteristic interaction by using DNN, an explicit part usually learns high-order characteristic interaction by using a fixed function or FM variant, but the existing models ignore ambiguity of characteristic interaction in different semantic subspaces, and the parallel network architecture also has the defects of small sharing and no interaction between layers, so that expressive force of the models is limited. In order to solve the above problems, the present invention proposes a click rate prediction model DSAN of a decoupled attention network based on information sharing.

The DSAN model consists of the following four parts, as in fig. 1. Mainly comprises the following steps: an embedded layer, (2) a decoupled multi-headed self-attention layer, (3) a shared interaction layer, and (4) an output layer. Wherein the shared interaction layer is the core part of the model. Firstly, converting input features into dense and low-dimensional embedded vectors; secondly, inputting the embedded features into a decoupled self-care model; then the high-order interaction is simulated through the shared interaction layer; and finally, splicing the information of the interaction layer to the output layer to obtain the click rate predicted value.

1. Embedding layer

In computer vision or natural language processing, the input data is typically an image or text signal, which has a correlation in space or time. However, in the field of recommendation systems, the input features tend to be sparse, large in dimension, and have no obvious spatio-temporal correlation. To solve the problem of sample sparsity and high dimension in the recommendation system, it is common practice to use feature embedding. Assuming N samples and m features, some of these feature values are of the categorical type and some of these feature values are of the numerical type. A commonly used method for classification data is feature embedding, i.e. converting each sparse vector into a low-dimensional dense vector. The expression form is as follows:

e _i ＝w _i x _i (1)

wherein e _i Representing characteristic x _i The corresponding low-dimensional dense vector is used,is characteristic x _i λ is the number of eigenvalues in the i-th field. d is the dimension of the embedded vector. If x _i Is one-hot code, at x _i The j-th element->Then x _i The representation of (a) is +.>If x _i Is a multi-hot code, at x _i Many elements are 1, and the embedding of these elements is expressed as +.>Then x _i Is the sum or average of these embeddings. For example, sex is a feature, using one-hot encoded versions, male is [0,1]Female [1, 0]]. It can also be converted into a low-dimensional space for numerical values:

e _j ＝w _j x _j (2)

wherein the method comprises the steps ofIs characteristic x _j Is>Where λ' =1, represents scalar values of only one class, d is the corresponding embedding dimension.

By the above method, an embedding layer is applied on the original input layer, and the high-dimensional sparse vector is compressed into a low-dimensional dense vector, as shown in fig. 3. The results of the embedded layer are expressed as:

e＝[e ₁ ；e ₂ ；…e _i ；…e _j ；…e _m ] (3)

2. decoupled multi-headed self-attention layer

The multi-head self-care network achieves remarkable effect in complex relation modeling. For example, this algorithm has advantages in modeling arbitrary word dependencies in machine translation and sentence embedding, and has been successfully applied to node similarity capture in graph embedding. The present invention improves on this technique and serves to model the correlation between different feature domains.

For each sparse input feature x _i By embedding the search, it is converted into a dense embedding vector e _i . After the low-dimensional representation of each feature is obtained, a scaled dot product attention scheme is used to model the high-order interactions between features. Specifically, each feature interaction is formulated as a key-value pair and the importance of each feature interaction is learned by multiplying each embedded feature such that the important key-value pair gets a higher attention score. Formally, each candidate feature is transformed by transformation into a new embedding space:

q _i ＝W _q e _i ，k _j ＝W _k e _j ，k _j2 ＝W _k2 e _j ，v _k ＝W _v e _k (4)

query q _i Key k _j And k _j2 Value v _k Parameterizing W by three linear transformation matrices respectively _q 、W _k 、W _k2 And W is _v Obtained, andd represents the dimension of the embedded layer and d' is the dimension of attention.

Standard multi-headed self-attention mechanisms have proven detrimental to feature learning in previous visual learning tasks, and thus the present invention uses decoupled self-attention mechanisms where paired and unitary items are decoupled using separate softmax functions and embedding matrices, which greatly reduces the difficulty of paired and unitary item joint learning. Paired terms simulate purely specific interactions, univariate terms simulate the general effects of all feature fields, and the two terms are added to get the attention score, which is then point multiplied with the value. In order for the model to learn different feature interactions in different subspaces, multiple attention heads are used:

H _i ＝[σ((q _i -μ _q ) ^T (k _j -μ _k ))+σ(μ _p k _j ^T )]v _k (5)

where sigma is the activation function and,and->The average value of the query vector and the key vector, respectively, and the embedded features are linearly converted. M is the total number of features of the user and the item, +.>Representing the average of the key vectors. All attention heads are connected in series to give h=concat [ H ] ₁ ；H ₂ ；…；H _h ]. Finally, combining a residual error network to obtain a final eigenvector representation, and determining a formulaThe meaning is as follows:

wherein the method comprises the steps ofIs a linear projection matrix to avoid dimension mismatch. />Is a ReLu activation function, wherein L ₀ Representing the final feature vector representation as the first layer of the shared interaction layer.

3. Shared interaction layer

In the shared interaction layer, there are two main modules, one being a decomposition module and one being a sharing module. The decomposition module is used for judging the characteristic distribution in different networks in a soft selection mode through the field control network. The sharing module performs dense fusion by establishing a connection between the cross-network and the deep network, thereby capturing layered interaction signals between the parallel networks. Both modules are lightweight, model independent, and of low temporal and spatial complexity, and can be well generalized to the CTR model of a parallel network architecture.

And a decomposition module: the CTR model with parallel network architecture utilizes both explicit and implicit feature interactions, and the existing model provides all features equally to both networks. At the decomposition module, different features are adapted to different interaction functions. First, the obtained augmentation matrix L is obtained ₀ As input to the shared interaction layer, feature interleaving is learned and parallel network information is shared. Finally, obtaining C by using a decomposition module _l And D _l Two parts:

wherein g _i And g' _i I represents the i-th feature in a sample, g _i A gating score representing the ith feature.Is the hadamard product of the two vectors.

In parallel network architecture, explicit feature interactions are typically modeled using predefined interaction functions to efficiently explore bounded interactions. The explicit feature interaction part uses an interaction function with hierarchical attention to effectively learn different orders of salient features as an explanation. Implicit feature interactions are mainly learned through the full connection layer. C (C) _l Is an explicit high order interaction part, D _l Is an implicit higher order interaction part. Of these two parts, the most important is the explicit high-order interaction part. To obtain the first +1 order feature cross-over combination C _l+1 First aggregate feature cross-over combination C of first layer _l . C at initialization _l ＝C ₀ From which C can be obtained _l+1 . The attention aggregation formula is as follows:

wherein the method comprises the steps ofRepresents the attention score in the j-th feature of the first layer,>expressed as:

and a sharing module: the existing parallel network architecture respectively learns explicit and implicit characteristic interaction, and the two networks are separately and independently executed until the last layer is subjected to information fusion. This mode fails to capture the correlation between two parallel networks in the middle layer, weakening the interaction signal between explicit and implicit feature interactions. To address this problem, the present invention uses a sharing module to capture a layer-by-layer interaction signal between two parallel networks.

At each layer of high-order feature interactions, a sharing module is used to solve the above-mentioned problems. Suppose that two network data C of a certain layer are obtained _l And D _l The sharing module may be expressed as:

wherein the method comprises the steps ofIs the hadamard product of the two vectors.

Repeated application of equations 7, 8 and 10 can produce a representation from 1 st order to l st order. The three vectors obtained in the first order are denoted as C _l 、D _l And L _l Input to the standard logistic layer, the expression is as follows:

where W and b are parameters of weight and bias, respectively.

4. Output layer

The loss function defined in the model is loglos, which is formulated as follows:

wherein y is _i Andand respectively representing a true value and a predicted value, wherein i represents a training sample, N is the total number of samples, and a gradient descent algorithm updates the model weight.

To verify the effect of the invention, the following experiments were performed:

5. experiment

In this section, the validity of the model DSAN of the invention will be evaluated by answering the following three questions.

Problem 1: is the proposed DSAN model superior to the most advanced baseline approach of existing CTR? Is model high-order feature interactions effective?

Problem 2: what is the parameter configuration of the model influencing the improvement of the model accuracy?

Problem 3: is the proposed decoupled self-attention and hierarchical attention able to improve the performance of the model?

Before answering these questions, the experimental setup is first described.

5.1 Experimental settings

5.1.1 data set

Experiments were set up in this section to compare the performance of the model DSAN of the invention with other models. 80% of the data were randomly extracted as training set, 10% as test set, and the remaining 10% as validation set. Table 1 summarizes the two published data sets used in the experiments:

TABLE 1 evaluation of dataset statistics

(1) Criterion: the system comprises a displayed advertisement flow log for more than 7 days, a click record of 4500 thousands of users on a displayed advertisement is provided, each sample comprises 13 numerical characteristic fields and 26 category characteristic fields, the numerical characteristics comprise prices and the like, and the category characteristics comprise brands, user IDs, advertisement categories and the like.

(2) Avazu: this is Avazu provided data that is used to predict whether a mobile advertisement will be clicked. The method comprises a click log of 4000 ten thousand users for 10 days, which consists of 23 classification features such as domain name, category, connection type, advertisement ID, user ID, app ID, equipment information and the like, and a field which is not beneficial to CTR prediction, namely a serial number ID is removed.

5.1.2 evaluation index

In the experiment, two evaluation indices AUC and loglos were used. AUC is the area under the ROC curve and is a common indicator for evaluating CTR predictions. The higher the AUC value, the better the performance. Logloss, a loss function in CTR prediction, is a widely used metric in binary classification, and can measure the distance between two distributions. The smaller the value, the better the performance.

5.1.3, baseline model

The method DSAN of the invention is compared with the most advanced method of the following CTR task.

LR: the LR model can only learn the interactions of first order features, which is insufficient to represent the interactions between features

Acting as a medicine.

FM: factorization techniques are used to model second order features interactions.

NFM: NFM stacks deep neural networks on top of second order feature interaction layers. This model is also an improvement of FM.

● Wide & Deep (WDL): it is composed of two parts: its depth Model is identical to Base Model and its width is wide

The degree model is a linear model. It is the first model to combine the generalization of the depth model with the memory of the wide linear model.

● Deep & Cross (DCN): the Cross Network (Cross Network) is the core of the Deep & Cross model,

the method takes the outer product of the spliced feature vectors on a bit level and carries out explicit modeling on feature interaction.

Deep fm: the combination of depth MLP and factor engine calculates CTR. FM is used for low order between features

The combination, depth is used for higher order combinations between features.

xDeepFM: the Compressed Interaction Network (CIN) is the core of the xDeepFM model, which aims at displaying

The manner of the equation generates feature interactions at the vector level.

FiBiNet: it introduces two modules, the SENet module can dynamically learn the importance of features, and

the bilinear interaction layer may improve the feature interaction pattern.

InterHAt: it uses a transducer with multiple heads self-attention for feature learning, where

On the basis, it uses a hierarchical attention layer to predict CTR while providing interpretable insight into the predicted outcome.

5.1.4 Experimental details

All models are implemented using a pytorch framework. The super parameters of each model are tuned and the optimal setting of the model is described in 5.3. The learning rate of the model was set to 0.001, and Adam with a small batch size of 1024 was used for the optimization method. The hidden layer of the deep network is set to [400,400,400] by default. The feature embedding dimension of all models was set to a fixed value of 16, the number of attention heads was set to 2, the attention embedding dimension was set to a fixed value of 20, and dropout was set to 0.2 to avoid overfitting. For the CIN network of the xDeepFM model, 3 interaction layers are used.

TABLE 2 comparison of the effectiveness of different algorithms

4.2 model Properties (RQ 1)

In CTR prediction task, the improvement of the CTR prediction task on AUC or Loglos can be 10 ^-3 The magnitude of the performance gain was considered a dramatic improvement, and table 2 summarizes the performance of the two data sets on different models from which the following observations can be made:

(1) In the training of the model, it was found that all models had the one-epoch phenomenon. Therefore, only one training is performed on these models. As can be seen from table 2, the performance of the model varies from dataset to dataset, and DSAN has advantages over all reference models. The LR model performs worse than all models, which also suggests that deep networking and factoring can improve the performance of the model. The prediction result of FM is better than LR, indicating that the interaction of the second order features has a positive effect on the prediction. NFM and deep FM add deep neural networks based on FM, further improving accuracy of predictions. It is proved that the high-order interaction characteristic of the deep network learning can improve the prediction effect. On the criterion dataset, the Deep & Cross model was observed to be better on all benchmark methods, even better than the most advanced EDCN. This may be due to the inconsistent manner of processing and evaluation protocols of the data set in the previous model. Thus, the present invention uniformly uses the same evaluation protocol, the same data processing method, so that the results are comparable.

(2) The same evaluation protocol and the same data processing method are used, so that the results are comparable. DSAN has a distinct advantage in predictive performance over classical NFM and deep fm models. AUC values increased by 2.05% and 1.49% on criterion dataset and logoss values decreased by 2.62% and 3.4%, respectively. The DSAN model is also a small improvement over WDL, DCN and xDeepFM. On the criterion dataset, the accuracy was improved by 1.73%, 0.56% and 1.7%, respectively, and the loglos was reduced by about 3.86%, 1.07% and 3.7%, respectively. The model performance of these parallel architectures is not better, thus demonstrating the effectiveness of the shared interaction layer proposed by the present invention.

(3) It can be seen that the model we propose is superior to FiBiNet. The AUC of our model was increased by 0.58% and 0.72% in criterion and Avazu datasets, respectively, compared to FiBiNet. The performance index, loglos, was reduced by about 0.35% and 0.36%, respectively, and AUC was increased by 0.41% and 0.37% compared to the optimal baseline method on criterion and Avazu datasets. This improvement has two reasons. The first reason is that the shared interaction part of the DSAN model uses two modules to distinguish features and integrate features. Secondly, the split multi-head self-focusing mechanism can analyze possible interactions among the features of different potential semantic subspaces, and the expression capability of the model is enhanced. Overall, it can be seen from the table that the DSAN proposed by the present invention achieves the best performance over all data sets.

5.3, parameter analysis (RQ 2)

To further verify and gain insight into the proposed model, this section investigates the effect of hyper-parameters on the model. Verification is performed on both datasets and the principle of changing one super parameter while keeping the other settings unchanged is adhered to.

5.3.1 dimensions of the embedding layer

Since the embedding layer is part of a deep learning model for converting discrete input features into a dense vector representation so that the model can better understand and process these inputs, the embedding size has a large impact on the model. To study the effect of embedding size on the model, the embedding sizes of criterion and Avazu datasets were adjusted to {8, 16, 24, 32, 40} respectively. Fig. 4 shows experimental results for different embedding sizes on two data sets. It can be seen that smaller embedded vectors may not capture all features in the input data, resulting in a reduced generalization ability of the model. Larger embedded vectors require more computing resources and memory to store and process, which may result in slower computing speeds for the model. We set the embedding sizes to 16 and 32, resulting in optimal performance of criterion and Avazu datasets, respectively.

5.3.2 number of attention heads

In this section, note that the number of heads is the hyperparametric study. We will note that the number of heads is set to 1 to 5. Maintaining the best state of other parameters. It can be seen that increasing the number of heads allows the model to better capture the different information in the input and to increase the representation of the model. This is because each head can focus on a different part of the input, enabling the model to better understand the different input features. As shown in fig. 5, as the number of attention points increases, performance may become better, but the number of model parameters increases, resulting in an increase in storage cost. On both data sets, note that the best performance can be achieved when the number of heads is 2.

5.3.3 influence of variants

In this section, the impact of three different variants in a shared module was studied. In order to capture signals between different networks of a parallel architecture, the Hadamard product of two vectors is denoted as DSAN-HP, which is also the way used in the paper and is therefore denoted as DSAN. The inner and hadamard product feature fusion is denoted DSAN-IH,

the two vectors are connected and then are represented as DSAN-CN by a feedforward neural network.

As can be seen from table 3, DSAN works best. This is because the Hadamard product is co-located with the element and does not involve any weights or coefficients. The effect of DSAN-IH and DSAN-CN on the criterion dataset is significantly worse, since the criterion dataset has more samples and more additional parameters are added.

TABLE 3 comparison of Performance of different variants

5.4 ablation experiment (RQ 3)

To investigate the effect of decoupled multi-headed self-attention and hierarchical attention on DSAN, the decoupled multi-headed self-attention portion was named DSAN-DA and the hierarchical attention portion was named DSAN-HA. As can be seen from fig. 6, the decoupled multi-headed attention layer has a great influence on the model, which has an important role in improving the DSAN efficiency.

In summary, the invention provides a DSAN model, a high-efficiency CTR estimation model. The DSAN learns the ambiguity of the characteristic interaction using the decoupled self-care layer and inputs the learned result to the shared interaction layer. The proposed shared interaction layer utilizes a hierarchical attention mechanism to learn the importance of different features in the explicit interaction part and deduces the interpretation from the learned importance distribution. Better offline AUC and loglos are achieved compared to other models.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. The click rate prediction method of the decoupling attention network based on information sharing is characterized by comprising the following steps of:

2. The method for predicting click through rate of a decoupled attention network based on information sharing according to claim 1, wherein the step 1 comprises:

e _i ＝w _i x _i

e _j ＝w _j x _j

e＝[e ₁ ；e ₂ ；…e _i ；…e _j ；…e _m ]

3. the click-through rate prediction method of an information sharing-based decoupled attention network of claim 1, wherein the step 2 comprises:

H _i ＝[σ((q _i -μ _q ) ^T (k _j -μ _k ))+σ(μ _p k _j ^T )]v _k

4. The click-through rate prediction method of an information sharing-based decoupled attention network of claim 3, wherein said step 3 comprises:

step 31, the obtained augmentation matrix L ₀ Learning feature intersection and co-ordination as input to a shared interaction layerParallel network information is shared;

step 32, obtaining C by using the decomposition module _l And D _l Two parts:

step 34, at each layer of higher-order feature interaction, a sharing module is used to solve the above problem, assuming that two network data C of a certain layer are obtained _l And D _l Sharing ofThe modules may be expressed as:

w and b are parameters of weight and bias, respectively.

5. The click-through rate prediction method of the decoupling attention network based on information sharing of claim 1, wherein in the output layer, a logoss loss function is adopted for click-through rate prediction.

6. The method for predicting click through rate of a decoupled attention network based on information sharing according to claim 1, further comprising, after said step 4: