CN113887694A

CN113887694A - A CTR Prediction Model Based on Feature Representation with Attention Mechanism

Info

Publication number: CN113887694A
Application number: CN202010629307.3A
Authority: CN
Inventors: 杨卫东; 杜博亚
Original assignee: Zhuhai Fudan Innovation Research Institute; Fudan University
Current assignee: Zhuhai Fudan Innovation Research Institute; Fudan University
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2022-01-04
Anticipated expiration: 2040-07-01
Also published as: CN113887694B

Abstract

In order to complete click rate estimation according to the object characteristics of the object to be detected, the method can be used as a data fine arrangement link and applied to the fields of enterprise-level recommendation systems, search systems, online advertisement systems and the like. The invention provides a click rate estimation model based on characteristic representation under an attention mechanism, which comprises the following steps: the characteristic embedding layer is used for conducting vectorization processing on the continuous characteristic and the discrete characteristic to form a stacking characteristic and an explicit characteristic cross network, conducting explicit characteristic combination and implicit characteristic cross network on the stacking characteristic through the attention cross network, conducting implicit characteristic combination and pre-estimation probability output layer on the stacking characteristic through the multilayer perceptron, and pre-estimating the click rate according to the received combination characteristic. The attention cross network eliminates the dependence of the pre-estimated model on the artificial characteristic engineering, and simultaneously, the introduction of the attention mechanism distinguishes the importance of each combined characteristic on model pre-estimation and eliminates the influence of useless and redundant characteristics on the model.

Description

Click rate estimation model based on characteristic representation under attention mechanism

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to an end-to-end click prediction technology.

Background

Click-through rate estimation is one of the most core research topics in the industry as a key technology directly influencing user platform experience and advertisement revenue. At present, research work at home and abroad is mainly on the aspect of characteristic representation, and the existing methods are mainly divided into a machine learning click rate model and a deep learning click rate model.

In the early stage, the industry is limited by computing power, online learning and model deployment, and a lightweight machine learning model is mainly built, and the most classical model belongs to a Logistic Regression model (Logistic Regression). LR is rapidly becoming the mainstream model of CTR estimation in the industry due to its advantages of good mathematical meaning, strong interpretability, convenience for engineering deployment, etc. In 2010, Bredan McMahan et al propose an online learning algorithm FTRL (follow The regulated leader) for LR, which further promotes The application of LR in The industry, but LR models are linear in nature and have limited learning capability, and The prediction effect of LR models is usually dependent on The characteristic engineering capability of data scientists. Therefore, the industry begins to explore the construction of second-order combination features by using a binomial model (multinomial Regression), and performs explicit feature combination by using pairwise intersection of the features, so that the problem of feature combination is solved to a certain extent by a violent intersection mode, but the method can only learn the co-occurrence features appearing in a training data set, and is difficult to generalize the non-co-occurrence combination features in large-scale sparse data scenes such as recommendation and advertisement. In order to solve the defect of the binomial model, an FM (localization mechanism) is proposed by Steffen Rendle of the university of comstanzi in 2010, germany, a hidden weight vector (latent vector) is learned for each feature, the hidden vector inner product is used as a feature cross weight, the feature combination problem in a sparse feature scene is well solved, in addition, the FM training complexity is further reduced by a method of transforming an objective function form, and therefore, the FM gradually becomes an important choice of the CTR model in the industry around 2012 to 2014. In 2015, the FFM (Field-aware competition Machine) proposed based on FM was dazzling in a plurality of CTR forecast competitions, and then started to be applied in recommendation and advertising scenes by companies such as Criteo, Mei Tuo and the like. Compared with an FM model, the FFM model mainly introduces a concept of 'domain', when feature crossing is carried out, each feature selects an implicit vector corresponding to a combined feature domain to carry out inner product operation to obtain the weight of crossing features, so that the model has stronger expression capability, but is limited by the limitations that the FFM has higher space complexity and can only carry out second-order feature crossing and the like, and the FFM model is not widely used in the industry. In addition, Xinran He and the like in 2014 proposed a solution based on a GBDT (gradient Boost Decision tree) + LR combination model for processing high-dimensional feature combination and screening problems, which utilizes GBDT to automatically carry out feature screening and combination, utilizes unique codes to code leaf nodes, and then inputs the coding features as an LR model to complete CTR prediction, opens the precedent of utilizing the model to carry out high-order feature construction and screening, solves the problem of very troublesome feature combination and screening in the past more efficiently, and greatly promotes the important trend of feature engineering modeling.

In the period, researchers find that personalized requirements are easier to dig out by high-order combined features, the pushing effect of 'thousands of people and thousands of faces' is achieved, the high-order combined feature business logic is difficult to understand due to the fact that sparse features are increased sharply, the traditional artificial feature engineering is difficult to continue in the aspect of digging high-order combined features, and people begin to try to continue the advantages of model extraction features to complete personalized pushing of users in a big data scene. Along with the great diversity of deep learning in the fields of computer vision and natural language processing, people begin to try to utilize a neural network to automatically perform feature characterization so as to replace manual feature engineering to finish click rate estimation.

In 2016, Deep learning is applied to click rate prediction in a large scale, a Deep Crossing serial network structure is proposed by Microsoft Ying Shan and the like, the Deep Crossing serial network structure covers the classic elements of a CTR prediction neural network model, namely, sparse features are converted into low-dimensional dense features by adding an embedded layer, segmented feature vectors are spliced by using a stack layer, combination and conversion of the features are completed by multiple layers of neural networks, the CTR prediction is finally completed by a Sigmoid activation function, and a residual error network structure is formally introduced into the click rate prediction model for enhancing the high-order feature extraction capability of the model. In the same year, Zhang Weinan, etc. of Shanghai traffic university propose FNN, and on the basis of the previous Deep Crossing network structure, the hidden layer vector of FM is used as the embedding of users and materials, thereby avoiding completely training an embedding matrix from a random state, and greatly reducing the training time of a model and the instability of an embedding layer. The method for pre-training is used for completing the training of the network embedding layer, and undoubtedly is effective engineering experience for reducing the complexity and the training instability of a deep learning model. However, the conventional DNN directly uses multiple full-link layers to complete feature cross combination, and lacks "pertinence" of feature combination to a click rate estimation scene, so Yanru Qu et al propose PNN (Product-based Neural Network), add a Product layer between an embedded layer and a full-link layer, aim at performing feature combination between different feature domains, and enhance the capability of a model to represent different data modes.

In 2016, Google Heng-Tze Cheng and the like propose a Wide & Deep parallel network structure, and a Wide part of a single input layer and a Deep part passing through a multilayer perceptron are spliced and transmitted to an output layer. The method is characterized in that the memory of the model is realized by using the Wide part (memorability), the Generalization of the model is realized by using the Deep part (Generalization), the DNN excavates the implicit high-order feature combination, and the LR connects the Wide part and the Deep part to form the unified CTR model. Wide & Deep establishes a click rate estimation parallel framework based on Deep learning, but does not get rid of the limitation of needing to help with artificial feature engineering. Aiming at the defects of insufficient performance of the Wide part and the like, in 2017, Huifeng Guo and the like propose DeepFM, on the basis of continuing the Wide and Deep parallel network structure, FM is used for replacing the original Wide part, the feature combination capability of a shallow network is enhanced, and meanwhile, the dependence of a click rate prediction depth model on an artificial feature engineering is eliminated. In the same year, Ruoxi Wang et al propose Deep & Cross Network (DCN), begin to use Cross Network to replace the original Wide part, realize bit-level explicit feature interaction, further refine the feature Cross granularity of the Wide part. Xiangnan He et al propose NFM (neural differentiation machines) to improve Deep part, introduce Bi-interaction Pooling layer to replace FM to perform feature crossing, and further strengthen the feature combination capability of Deep layer. A deep learning network DIN (deep Interest network) based on an attention mechanism is proposed by Ali baba Guorui Zhou and the like in 2018, real-time Interest features are extracted from a user behavior sequence, and feature characterization on the user side is further improved. In the same year, Jianxun Lian et al propose an xDeepFM parallel network structure, model vector-level explicit feature Interaction, and CIN (compressed Interaction network) is adopted in the Wide part to enhance the explicit feature combination capability of the model, so that a certain effect is achieved. Subsequently, Guorui Zhou et al proposed DIEN (deep Interest Evolution network) in 2019, started to introduce a sequence model AUGRU on the basis of DIN, concatenated user interests at different times to form an Interest Evolution chain, and finally input the 'Interest vector' at the current moment into an upper-layer multilayer perceptron to finish click rate estimation in conjunction with other characteristics, so as to obtain a better effect.

In summary, the current deep learning model still cannot perform feature extraction, feature combination and feature screening by using a neural network instead of manual feature engineering. Furthermore, on the premise that a neural network is used for replacing a manual characteristic project, the characteristic change trend cannot be accurately estimated, and accurate click rate estimation cannot be obtained automatically.

Disclosure of Invention

In order to solve the problems, a click rate estimation model based on feature characterization under an attention mechanism is provided. The invention adopts the following technical scheme:

the click rate estimation model based on the characteristic representation under the attention mechanism comprises the following steps: the system comprises a characteristic embedding layer, an explicit characteristic cross network, an implicit characteristic cross network and a pre-estimation probability output layer. The characteristic embedding layer is used for carrying out vectorization processing on the continuous characteristic and the discrete characteristic and then carrying out stacking embedding processing to form a stacking characteristic; an explicit feature crossover network that forms an explicit output vector by inputting the stacking features into the attention crossover network for explicit feature combining; the implicit characteristic cross network is used for inputting the stacking characteristics into a multilayer perceptron to carry out implicit characteristic combination to form an implicit output vector; a probability output layer is pre-estimated, the explicit output vector and the implicit output vector are combined to form a high-order nonlinear combined characteristic, and the combined characteristic is transmitted to a Sigmoid activation function to predict the click rate, so that the click rate is obtained; wherein the attention crossing network comprises: the cross layer processes the stacking features through a cross algorithm and generates a multi-dimensional vector; and an attention layer for processing the multidimensional vector through a fully connected neural network to generate an attention score, performing normalization processing on the attention score to generate a characteristic coefficient, and further generating an explicit output vector through an output calculation formula based on the characteristic coefficient.

The machine learning-based real-time labor contraction judging system provided by the invention has the technical characteristics that the vectorization processing comprises the following steps: carrying out one-hot code conversion on the discrete features, and taking the encoded discrete features as embedded vectors; carrying out data standardization according to data distribution characteristics on the continuous features to form dense features; stacking and embedding the embedded vectors and the dense features to be used as stacking features, wherein a matrix calculation formula of the one-hot coding conversion is as follows:

x_embed，i＝W_embed，ix_i#(1)

in the formula, x_embed，iIs an embedded vector, x_iIs a binary input of class i, and

is an embedding matrix to be optimized together with other parameters in the network, and n_eAnd n_vRespectively an input dimension and an embedding vector dimension.

The machine learning-based temporary labor contraction judging system provided by the invention also has the technical characteristics that the calculation formula of the cross algorithm is as follows:

in the formula (I), the compound is shown in the specification,

are column vectors representing outputs from the l-th and l + 1-th interleaved layers, respectively;

is the weight and deviation of the ith layer, and the function f represents the feature vector intersection formula of each layer.

The machine learning-based real-time labor contraction judging system provided by the invention also has the technical characteristics that the computing logic of the standardized processing in the attention layer is as follows:

a′₁＝h^TRelu(Wx_i+b)#(3)

in the formula (I), the compound is shown in the specification,

are model parameters and the attention score is normalized by Softmax.

The machine learning-based real-time labor contraction judging system provided by the invention has the technical characteristics that the output calculation formula of the attention crossing network is as follows:

in the formula, a_iIs the attention weight.

The machine learning-based temporary labor contraction judging system provided by the invention also has the technical characteristics that the calculation logics of all layers of the multilayer perceptron are as follows:

H_l+1＝f(w_lH_l+B_l)#(7)

in the formula, H_l+1Representing the hidden layer, f (-) is the Relu function.

The machine learning-based real-time labor contraction judging system provided by the invention has the technical characteristics that the formula of the Sigmoid activation function is as follows:

in the formula (I), the compound is shown in the specification,

respectively outputting the explicit characteristic cross network and the multilayer perception machine, and obtaining a final click rate predicted value through a Sigmoid function.

The machine learning-based antenatal uterine contraction judging system provided by the invention also has the technical characteristics that the click rate estimation model based on the characteristic representation under the attention mechanism carries out error back transmission on the click rate through a Logloss loss function until the click rate output by the output layer is converged, and the parameter updating of the click rate estimation model based on the characteristic representation under the attention mechanism is completed.

The machine learning-based temporary labor contraction judging system provided by the invention also has the technical characteristics that the Logloss loss function formula is as follows:

in the formula, p_iFor the output of the click-through rate prediction model, y_iAnd (3) a label corresponding to the sample, N is the number of training samples, and lambda is an L2 regular term coefficient, error back-transmission is carried out through a Logloss loss function, and parameter updating is carried out based on the error back-transmission until the final click rate model training is completed through convergence.

Action and Effect of the invention

According to the click rate estimation model based on the feature representation under the attention mechanism, due to the feature embedding layer, the feature embedding layer solves the problem of overlarge vector dimension after the single hot coding processing by vectorizing the continuous feature and the discrete feature. Meanwhile, the system also has an explicit characteristic cross network, the explicit characteristic cross network realizes the dynamic weighting of the combination items through the attention cross network, more efficiently utilizes the combination characteristics, and eliminates the influence of redundant characteristics on a click rate prediction model. The method further comprises an implicit characteristic cross network, and the implicit characteristic cross network completes the capture of highly nonlinear interaction characteristics by applying a multilayer perceptron, so that the problem that the characteristic expression capability of the model is limited by the parameter scale is solved. Finally, the invention is provided with a pre-estimation probability output layer which carries out output click rate pre-estimation based on the explicit characteristic cross network and the implicit characteristic cross network through a Sigmoid activation function, so that the obtained pre-estimation data is more accurate. Further, the estimated data can be used as a data fine-ranking link and applied to the fields of enterprise-level recommendation systems, search systems, online advertisement systems and the like.

Drawings

FIG. 1 is a block diagram of a click through rate prediction model based on feature characterization under an attention mechanism in an embodiment of the present invention;

FIG. 2 is a flow chart of the operation of a feature embedding layer in an embodiment of the present invention;

FIG. 3 is a network architecture diagram of an attention crossing network in an embodiment of the present invention;

FIG. 4 is a network architecture diagram of a multi-tier perceptron in an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a training process of a feature characterization-based click rate estimation model under an attention mechanism according to an embodiment of the present invention; and

FIG. 6 is a flowchart of the deployment of the click-through rate prediction model in the embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the present invention easy to understand, the click rate estimation model based on the characteristic characterization under the attention mechanism of the present invention is specifically described below with reference to the embodiments and the drawings.

< example >

FIG. 1 is a block diagram of a click through rate prediction model based on feature characterization under the attention mechanism in the embodiment of the present invention.

As shown in FIG. 1, the feature characterization-based click through rate prediction model 100 under the attention mechanism includes: a feature embedding layer 101, an explicit feature crossing network 102, an implicit feature crossing network 103, and a predictive probability output layer 104.

In this embodiment, the user-side feature, the advertisement-side feature and the context feature are collected, the collected features are divided into a continuous feature and a discrete feature, the two features are used as input features, a training data set is formed by arranging the input features, and further, model training of the click rate estimation model 100 based on the feature representation under the attention mechanism is completed through the training data set. The training data set consists of n samples (x, y), the input features are continuous features and discrete features, the goal is to construct a click rate estimation model y which is a model (x), and x belongs to R n and is used for predicting that a user u clicks a candidate material v in a specified context_tIs equal to [0, 1 ] in the probability y, y ∈]

The feature embedding layer 101 performs vectorization processing on the received input features, and stacks the obtained embedded vectors and dense features to form stacked features and output the stacked features.

FIG. 2 is a flow chart of the operation of the feature embedding layer in an embodiment of the present invention.

As shown in fig. 2, the steps of forming the stacked features by the feature embedding layer 101 are as follows:

step S1, carrying out one-hot code conversion on the discrete type features, taking the encoded discrete type features as embedded vectors, and then entering step S2;

step S2, carrying out data standardization based on data distribution characteristics on the continuous features to form dense features, and then entering step S3;

in step S3, the embedding vector formed in step S1 and the dense feature formed in step S2 are subjected to a stack embedding process, i.e., the embedding vector and the dense feature are stacked into one vector, and the vector is referred to as a stacked feature, and then an end state is entered.

In this embodiment, the matrix calculation formula of the stack embedding process is:

x_embed，i＝W_embed，ix_i#(1)

The explicit feature crossing network 102 receives the stacked features formed by the feature embedding layer 101, and performs a crossing algorithm on the stacked features through the attention crossing network to generate a multi-dimensional vector;

fig. 3 is a network configuration diagram of an attention crossover network in an embodiment of the present invention.

As shown in diagram a of fig. 3, the attention-crossing network includes: a cross-layer 21 and an attention layer 22.

As shown in the B diagram of FIG. 3, the number of neurons in each of the interleaved layers 21 is the same and equal to the input vector x₀Of (c) is calculated.

In this embodiment, the calculation formula of the crossover algorithm is:

in the formula (I), the compound is shown in the specification,

The attention layer 22 processes the multidimensional vector through the fully-connected neural network and generates an attention score, and normalizes the attention score to generate a feature coefficient, and generates an explicit output vector by an output calculation formula based on the feature coefficient.

In this embodiment, an Attention network is used as the fully-connected neural network, a ReLU is used as the activation function, and the size of the network is expressed by an Attention factor. In addition, the computational logic of the normalization process in the attention layer 22 is:

a′_i＝h^TRelu(Wx_i+b)#(3)

in the formula (I), the compound is shown in the specification,

are model parameters and the attention score is normalized by Softmax. Taking the result of the above formula as a characteristic coefficient, wherein the output calculation formula of the attention crossing network is as follows:

in the formula, a_iIs the attention weight.

The implicit feature crossing network 103 forms an implicit output vector by inputting the stacked features into a multi-layer perceptron for implicit feature combination.

Fig. 4 is a network structure diagram of a multi-layer perceptron in an embodiment of the invention.

As shown in fig. 4, in the present embodiment, the multi-layer perceptron is a fully connected feedforward neural network, and the calculation logic of each layer is as follows:

H_l+1＝f(W₁H_l+B_l)#(7)

in the formula, H_l+1Representing the hidden layer, f (-) is the Relu function.

The pre-estimated probability output layer 104 combines the explicit output vector and the implicit output vector to form a high-order nonlinear combined feature, and simultaneously transmits the combined feature to a Sigmoid activation function to predict the click rate, so as to obtain the pre-estimated click rate.

In this embodiment, the formula of the Sigmoid activation function is as follows:

in the formula (I), the compound is shown in the specification,

respectively outputting an explicit characteristic cross network and a multilayer sensing machine, and obtaining a final click rate predicted value through a Sigmoid function;

after the click rate predicted value is obtained, the click rate estimation model 100 based on the characteristic representation under the attention mechanism performs error back transmission on the estimated click rate through a Logloss loss function until the estimated click rate output by the output layer is converged, and completes parameter updating of the click rate estimation model based on the characteristic representation under the attention mechanism, thereby completing model training of the click rate estimation model 100 based on the training data set in the embodiment under the attention mechanism.

In this embodiment, the formula of the Logloss loss function is as follows:

in the formula, p_iFor click rate estimation model output, y_iThe sample corresponds to the label, N is the number of training samples, and λ is the L2 regular term coefficient.

FIG. 5 is a flowchart illustrating a training process of a feature-characterization-based click rate estimation model under an attention mechanism according to an embodiment of the present invention.

As shown in fig. 5, the training process of the click rate estimation model based on the feature characterization under the attention mechanism in this embodiment is as follows:

step U1, constructing a data set, dividing data characteristics in the data set into continuous data characteristics and discrete data characteristics, and then entering step U2;

step U2, constructing a feature embedding layer, carrying out vectorization processing on the continuous features and the discrete features, then carrying out stacking embedding processing to form stacking features, and then entering step U3;

step U3, constructing an explicit feature crossing network, performing explicit feature combination by inputting stacking features into the attention crossing network to form an explicit output vector, and then entering step U4;

step U4, constructing an implicit characteristic cross network, inputting the stacking characteristics into a multilayer perceptron to perform implicit characteristic combination to form an implicit output vector, and then entering step U5;

step U5, constructing an estimated probability output layer, combining the explicit output vector and the implicit output vector to form a high-order nonlinear combined feature, simultaneously transmitting the combined feature to a Sigmoid activation function to predict the click rate to obtain an estimated click rate, and then entering step U6;

and step U6, performing error back transmission on the estimated click rate through a Logloss loss function until the estimated click rate output by the output layer is converged, completing parameter updating of the click rate estimation model based on characteristic representation under the attention mechanism, further completing model training, and then entering an ending state.

The programming environment for implementation of the system in this embodiment is Pycharm, and the version of Python is 3.6.

As shown in fig. 6, the specific working steps after completing the training of the click rate estimation model based on the feature characterization under the attention mechanism of this embodiment are as follows:

step T1, acquiring original data of the real service scene, namely acquiring the original data by early-stage data embedding, background log extraction and online information collection through splicing and summarizing, and then entering step T2;

step T2, preprocessing the original data collected in the step T1 to form sorted data, and then entering the step T3;

the pretreatment in this embodiment includes: and carrying out abnormal value processing, missing value processing and noise data processing on the collected original data.

Step T3, constructing the sorted data into a training data set, a testing data set and a verification data set of the click rate estimation model, determining the proportion distribution of the training data set, the testing data set and the verification data set according to the data volume and the service, and then entering step T4;

step T4, inputting each data set into a click rate estimation model to obtain an estimated click rate, and then entering step T5;

step T5, digging a feature project according with the interest preference of the user by combining the estimated click rate, wherein no manual feature project is carried out, and then, the step S6 is carried out;

step T6, extracting the corresponding optimal super parameter combination according to the performance of the off-line data set, and then entering step T7;

step T7, estimating the prediction performance of the click estimation model through the artificially set model measurement indexes (Logloss and AUC are often used as estimation indexes in the click rate estimation scene), and then entering step T8;

and step T8, carrying out online small flow test on the click estimation model by an algorithm engineer, verifying the online effect of the model, deploying the model online after the test, and then entering an ending state.

In this embodiment, the super-reference selection method in T6 is grid search, random search, and bayesian search.

After the model deployment is completed, a series of experiments are performed by using a click rate estimation model 100(Deep & Attention Cross Network, DACN) based on feature representation under the Attention mechanism of the invention, wherein a programming environment used when an Attention Cross Network (ACN) in the click rate estimation model 100 is realized is Pycharm, and the version of Python is 3.6. The running environment of the experiment is Core i7 CPU, 32GB memory and Linux operating system. The data set for the experiment was from real click data of Criteo Lab and movie reviews data of MovieLens. Model evaluation was performed using two indices: AUC and Logloss, which evaluate the performance of the model from different levels.

Experiments compare a novel characteristic Cross Network DACN (Deep & Attention Cross Network) which realizes the combination of explicit type and implicit type, namely the invention, with LR (logical regression), DNN, FM (differentiation mechanisms), Wide & Deep, DCN (Deep & Cross Network) and Deep FM. As mentioned above, these models are currently the mainstream and industry-validated click-through rate prediction models. Since DACN aims at extracting feature combinations through models, we will not perform any artificial feature engineering on the original features as control variables.

DACN is implemented herein on TensorFlow. Performing data normalization on the dense features using a logarithmic transformation; for class type features, the features are embedded into a length of 6 × dimension^1/4In dense vectors of (c); using an Adam optimizer, a Mini-Batch stochastic gradient descent is employed, with Batch size set to 512 and the DNN network setting Batch normalization. For the comparative model, the parameter settings in PNN for FNN and PNN were followed. The DNN module is provided with Dropout of 0.5, the network structure is set as 400-400, the optimization algorithm adopts Mini-batch gradient descent based on Adam, the activation function uniformly uses Relu, the embedding dimension of FM is set as 10, and the rest of the model is set to be consistent with DACN.

FIG. 7 is a graph showing the comparison result of a single model in the embodiment of the present invention.

The effect of explicitly combined features on overall model predictive performance under the attention mechanism was first validated. In the comparison model, FM explicit measurement 2-order feature interaction, DNN modeling implicit high-order feature interaction, Cross Network modeling explicit high-order feature interaction, and ACN modeling explicit feature interaction and self-contained feature screening. Wherein each single model is represented in fig. 7 in two public data sets.

Experiments show that the ACN provided by the invention is always superior to other comparative models. As a conclusion, on the one hand, for practical datasets, high order interactions on sparse features are necessary, as evidenced by the clear superiority of DNN, Cross Network and ACN over FM on both datasets; on the other hand, the ACN is an optimal individual model, and the effectiveness of the ACN in modeling the interaction of the explicit high-order features is verified.

FIG. 8 is a diagram illustrating the comparison result of the integrated model in the embodiment of the present invention

DACN integrates ACN and DNN into a peer-to-peer network architecture. The ACN is used for explicit combined feature extraction and screening, the DNN is used for implicit combined feature extraction, and feature characterization is performed to the greatest extent through parallel connection of the ACN and the DNN. The performances of the DACN and the current mainstream click rate prediction model on the two public data sets are compared, and the experimental result is as shown in fig. 8.

It is readily apparent from fig. 8 that LR is worse than all other models, indicating that the factorization-based model is critical for modeling sparse-class interaction features; and Wide & Deep, DCN and Deep FM are obviously superior to DNN, and the result shows that the DNN implicit feature extraction capability is relatively limited, and a short board with insufficient feature combination capability is usually made up by means of manual feature engineering. Secondly, the DACN index is significantly improved compared to the DCN index. The advantages of the DACN have been demonstrated from a theoretical perspective, and the addition of the Attention network structure realizes the screening of the combination features of each designated order, improves the weight of important combination features, and eliminates the influence of redundant features. Experimental results prove that the structure can effectively realize feature screening and greatly improve the performance of the integral model.

The DACN provided by the invention achieves the best performance on both public data sets, which means that explicit and implicit high-order features are combined, and the original feature characterization is more sufficient. Meanwhile, the experiment result also verifies that the ACN is used for carrying out the specified order explicit characteristic combination to greatly improve the final model performance, and the reasonability of the DACN provided by the invention is laterally verified.

FIG. 9 is a diagram illustrating the comparison result of the number of network parameters in the embodiment of the present invention

In consideration of the additional parameters introduced by ACN, CrossNet, and DNN were compared on the critico dataset, and the minimum number of parameters required for each model to achieve the optimal log loss threshold was compared, because the number of parameters of each model embedded matrix was equal, the number of parameters in the embedded layer was omitted in the calculation of the number of parameters, and the experimental results are shown in fig. 9.

The experimental results show that the storage efficiency of the ACN and the Cross Network provided by the invention is nearly one order of magnitude higher than that of DNN, and the main reason is that the common feature Cross structure realizes the completion of feature interaction of a specified order by linear space complexity. In addition, the parameters of the ACN and the Cross Network belong to the same order of magnitude, an Attention Network introduced by the ACN only comprises a hidden layer, the number of the required parameters can be approximately ignored, and the model click rate prediction accuracy is greatly improved. The ACN structure provided by the invention is proved to have great advantages in space complexity from the side.

Examples effects and effects

According to the click rate estimation model based on the feature representation provided by the embodiment, due to the feature embedding layer, the feature embedding layer carries out vectorization processing on the continuous feature and the discrete feature, and the problem that the vector dimension is too large after the one-hot coding processing is solved. Meanwhile, the system also has an explicit characteristic cross network, the explicit characteristic cross network realizes the dynamic weighting of the combination items through the attention cross network, more efficiently utilizes the combination characteristics, and eliminates the influence of redundant characteristics on a click rate prediction model. The method further comprises an implicit characteristic cross network, and the implicit characteristic cross network completes the capture of highly nonlinear interaction characteristics by applying a multilayer perceptron, so that the problem that the characteristic expression capability of the model is limited by the parameter scale is solved. And finally, the prediction probability output layer is provided, and the prediction probability output layer performs click rate prediction based on the output of the explicit characteristic cross network and the implicit characteristic cross network through a Sigmoid activation function, so that the obtained prediction data is more accurate. Further, the estimated data can be used as a data fine-ranking link and applied to the fields of enterprise-level recommendation systems, search systems, online advertisement systems and the like.

In the embodiment, the discrete features are subjected to one-hot code conversion in the feature embedding layer, and the encoded discrete features are used as embedding vectors and are subjected to data standardization according to data distribution characteristics to form dense features; the two low-dimensional dense vectors can more effectively retain original semantic information.

In an embodiment, the interleaving algorithms proposed in the interleaving layer make the combination of explicit features more efficient.

In the embodiment, the attention layer further enables the model to learn and combine the feature weights by enabling the contribution degrees of different parts to be different when the different parts are compressed together, so that automatic feature extraction is realized.

In the embodiment, the dynamic weighting of the combination items is realized by the display feature cross network through an attention cross network mechanism, the combination features are utilized more efficiently, and the influence of redundant features on the click rate prediction model is eliminated.

In the embodiment, the explicit characteristic cross network and the implicit characteristic cross network are connected in parallel, so that the characteristic characterization capability of the model is further enhanced, and the click rate estimation precision is improved.

In the embodiment, through experiments, the click rate estimation model which is mainstream at present and is verified in the industry is subjected to single model expression comparison, integrated model expression comparison and network parameter quantity comparison respectively, so that the rationality of the invention and the ACN structure provided by the invention have great advantages in the aspect of space complexity.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims

1. A click-through rate estimation model based on feature representation under an attention mechanism is used to complete click-through rate estimation according to the object feature of the object to be tested, and the object feature is divided into a continuous feature and a discrete feature, and it is characterized in that ,include:

a feature embedding layer, which performs vectorization processing on the continuous features and the discrete features, and then stacks and embeds them to form stacked features;

Explicit feature intersection network, which forms an explicit output vector by inputting the stacked features into the attention intersection network for explicit feature combination;

Implicit feature intersection network, which forms an implicit output vector by inputting the stacked features into a multilayer perceptron for implicit feature combination;

Estimate probability output layer, combine the explicit output vector and the implicit output vector to form a high-order nonlinear combined feature, and pass the combined feature to the sigmoid activation function for CTR prediction to obtain the CTR ;

Wherein, the attention cross network includes:

a cross layer that processes the stacked features through a cross algorithm and generates a multidimensional vector; and

The attention layer processes the multi-dimensional vector through a fully connected neural network and generates an attention score, and normalizes the attention score to generate a feature coefficient, and further generates the display formula based on the feature coefficient through an output calculation formula. output vector.

2. the click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

Wherein, the vectorization process is:

One-hot encoding conversion is performed on the discrete features, and the encoded discrete features are used as embedding vectors;

performing data standardization on the continuous feature according to the data distribution feature to form a dense feature;

Perform the stacked embedding process on the embedding vector and the dense feature as stacked features,

The matrix calculation formula of the one-hot encoding transformation is:

x _{embed, i} =W _{embed, i} x _i #(1)

where x _{embed, i} is the embedding vector, x _i is the binary input of the i-th class, and

is the embedding matrix that will be optimized along with other parameters in the network, and n _e and n _v are the input and embedding vector dimensions, respectively.

3. the click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

Wherein, the calculation formula of the crossover algorithm is:

In the formula,

is a column vector, representing the cross-layer output from layer I and layer I+1, respectively;

is the weight and bias of the first layer, and the function f represents the cross formula of feature vectors of each layer.

4. the click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

Wherein, the calculation logic of the normalization processing in the attention layer is:

a _i ′=h ^T Relu(Wx _i +b)#(3)

In the formula,

are the model parameters, x _i is the binary input of the i-th class, a _i is the attention weight, and the attention score is normalized by Softmax.

5. the click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

Wherein, the output calculation formula of the attention crossing network is:

where a _i is the attention weight,

are the model parameters, and x _i are the binary inputs of the i-th class.

6. The click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

The calculation logic of each layer of the multilayer perceptron is as follows:

H _l+1 =f(W _l H _l +B _I )#(7)

In the formula, H _l+1 represents the hidden layer, f( ) is the Relu function,

are the weights and biases of the lth layer.

7. The click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

Wherein, the formula of the Sigmoid activation function is:

In the formula,

are the outputs of the explicit feature cross network and the multi-layer perceptron respectively, W _logits is the weight of the Sigmoid function, and the final click-through rate prediction value is obtained through the Sigmoid function.

8. The click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 1, is characterized in that:

The CTR prediction model based on the feature representation under the attention mechanism performs error back propagation on the CTR through the Logloss loss function, until the CTR output by the output layer converges, and the attention is completed. Parameter update of CTR prediction model based on feature representation under force mechanism.

9. The click-through rate estimation model based on feature representation under a kind of attention mechanism according to claim 5, is characterized in that:

The formula of the Logloss loss function is as follows:

In the formula, W _l is the weight of the lth layer, p _i is the output of the click-through rate estimation model, y _i is the corresponding label of the sample, N is the number of training samples, λ is the L2 regular term coefficient, through the Logloss loss function Perform error back propagation and update parameters based on this until convergence and complete the final click rate model training.