CN111563770A

CN111563770A - Click rate estimation method based on feature differentiation learning

Info

Publication number: CN111563770A
Application number: CN202010342981.3A
Authority: CN
Inventors: 郑小林; 杨煜溟
Original assignee: Hangzhou Jztdata Technology Co ltd
Current assignee: Hangzhou Jztdata Technology Co ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-08-21

Abstract

The invention relates to a recommended click rate estimation technology and aims to provide a click rate estimation method based on feature differentiation learning. The method comprises the following steps: firstly, constructing an input vector of original features to obtain low-dimensional feature vector representation of each original feature; constructing a neural network with feature combination capability, obtaining combined feature vectors and constructing and outputting the combined feature vectors; then, differential activation constraint is proposed to control the similarity between the feature vectors and improve the integrity of feature vector expression; the existing compression-excitation network is used for distinguishing the feature importance, so that the distinguishing capability of the neural network on the features is improved; and finally, performing combined training on the neural network with the feature combination capability and the deep neural network to obtain a final predicted value. The method can improve the judgment capability of the click rate estimation model on the effectiveness of the combined features, can deeply analyze the original features, accurately depict the combined relation of the feature vectors, and effectively predict the probability of the recommended content being clicked by the user.

Description

Click rate estimation method based on feature differentiation learning

Technical Field

The invention relates to the field of recommended click rate estimation, in particular to a click rate estimation method based on feature differentiation learning.

Background

With the development of information technology and internet, people gradually move from the times of lack of information to the times of information overload, the complexity and the nonuniformity of massive information make information acquisition difficult and time-consuming, and both information consumers and information producers encounter great challenges. More and more internet applications have successfully introduced recommendation systems, and the fields in which recommendation systems are widely utilized include e-commerce, movies and videos, music, social networks, location-based services, personalized advertisements, and the like.

The core task of the recommendation system is to provide content display meeting the interest for a user in a specific environment, a Click Through Rate (CTR) is recommended and used for describing the probability of being clicked by the user after the content display, and the CTR estimation refers to predicting the probability value of the content recommended to the user in the specific environment being clicked by the user by using the context related data of the user and the content Through a data mining related technology. Whether the recommended content is clicked or not can be estimated to reflect whether the content currently recommended and displayed meets the interest of the user or not, therefore, a click rate estimation algorithm is widely used in a content sorting stage of a recommendation system to generate a recommendation list meeting the interest and habit of the user, the satisfaction degree of the user on the application recommended content is further improved, the online time of the user in the application is improved, or the revenues of advertisement putting in the application are improved.

The research of recommendation systems can be linked to many related fields such as user modeling, machine learning and information retrieval, etc., which have evolved into an independent research field in the 90 s of the 20 th century due to their increasing importance. A recommendation question is defined as a question as to how to estimate how a user scores unseen items, so that the item with the highest score estimate among those items can be recommended to the user. In the recommendation process, the recommendation accuracy, diversity, efficiency of the recommendation algorithm and other problems are the key points of the recommendation algorithm research.

The recommendation system can be regarded as a search ranking system, that is, given a query, a recommendation task finds relevant items from a database in a recall stage, then in a ranking stage, a recalled content subset is further ranked based on a target score estimated by a user click rate, and then content distribution is performed in combination with a strategy. In the recommendation sorting stage, the accurate estimation of the click rate of the user has important guiding effects on improving the value of traffic and increasing the advertising revenue, so that the prediction of the recommended click rate is a research direction with engineering and academic significance at the same time.

With the rapid development of the mobile internet, the feature dimensions and forms of content recommendation are increasingly large and diversified, and meanwhile, the structure of a recommendation algorithm model is also developed from a shallow layer to a deep layer, and the recommendation algorithm model is mainly divided into two types of click rate estimation methods, namely a traditional machine learning model and a deep learning model.

For user click rate estimation, a method widely adopted in the industry is to combine artificial feature engineering and a linear logistic regression model, and the linear model has the advantages of simple structure, convenience in maintenance, capability of processing discretization features and realization of distributed computation. However, linear models lack the ability to capture implicit features, and require extensive manual feature engineering for better prediction results. For example, an important task in feature engineering is cross-extraction on class features. The original class features are independent, and the combination of some possibly associated features is more beneficial to the prediction result of the model. However, conventional cross feature engineering creates a number of problems: first, obtaining high quality artificial features is costly, and data scientists spend a great deal of time exploring potential patterns in product data to design meaningful cross-sectional features for a particular task. In addition, in internet mass recommendation prediction systems, the original dimensionality of the data is typically thousands of, and it is impractical to extract all of the combined features by hand. Moreover, the manually extracted combination features cannot generate combination features that do not appear in the training set. Therefore, it is a very meaningful task to study how to use the model for automated feature cross-combination.

Aiming at the problem of feature combination of large-scale sparse discrete data, the traditional model can not be separated from manual feature engineering. And the effect of improving click rate prediction by exploring complex feature combinations by utilizing the feature expression capability of the deep neural network has two main advantages: firstly, the depth model has strong expression capability and can learn high-order nonlinear characteristics; in addition, other types of features, such as pictures, voice and the like, can be expanded more easily, and the end-to-end model prediction capability is realized.

As described above, currently, many research results on recommending click rate estimation are available, and the methods used are also widely available, for example, the following technical solutions are disclosed in the following documents:

the Chinese invention patent 'a click rate estimation method and system based on Xgboost algorithm' (CN 201811312769.1). The technical scheme comprises the following steps: selecting a preset number of original features from log data of an advertisement putting platform; carrying out model training on the Xgboost algorithm by utilizing each original characteristic to obtain a model file; acquiring current characteristics corresponding to a predetermined number of advertisements in an advertisement library of an advertisement putting platform; and respectively calculating the click rate of each current characteristic and the model file to obtain a corresponding estimated click rate value. Therefore, the method obtains the corresponding model file on the basis of the Xgboost algorithm, and the model file can rapidly process the advertisement characteristics to obtain the estimated click rate value. In addition, the method has good portability, namely the method can be realized on each platform, and has high fault tolerance compared with the related technology.

The Chinese invention patent 'a click rate estimation method based on an FFM deep neural network' (CN 201910123419.9). The technical scheme comprises the following steps: 1) discretizing data in the training data; 2) re-encoding the discretized training data; 3) training the recoded training data through an FFM deep neural network; 4) preprocessing data to be predicted; 5) and (4) pre-estimating the click rate of the preprocessed data through the trained neural network. The method of the invention utilizes the characteristics of strong expression capability and automatic feature combination of the FFM deep neural network model, so that the model can learn low-order information and high-order information of features, and simultaneously solves the problem of automatic feature crossing, thereby being better applied to the industrial and living fields.

The Chinese invention patent 'a click rate estimation method based on decision tree and logistic regression' (CN 201711439302.9). The method comprises the following steps: acquiring relevant characteristic data of the release information; establishing a click rate estimation model based on a decision tree and a probability sparse linear classifier cascade structure; generating real-time training data by an online connector; training a click rate estimation model through real-time training data to obtain the latest click rate estimation model to estimate the click rate; a model system structure based on decision tree and probability sparse linear classifier cascade structure is provided, which also comprises an online learning layer and discloses an online connector, which is a very critical component in the online learning layer and can convert training data into real-time streaming data.

Although the technical solutions of the above three documents can estimate the recommended click rate, they still have the following disadvantages to be applied to a specific internet application scenario:

most click rate estimation methods focus on combining original category features, but the integrity of combined feature expression and the importance of the combined features are not considered at the same time, and better prediction accuracy can be achieved through the complete feature expression and the effective feature utilization.

The feature intersection is a key problem in the field of click rate estimation, models with intersection network structures are designed for many relevant works, the models usually utilize vector inner products or Hadamard products to calculate the intersection of feature vectors, and no clear structure is used for distinguishing the meanings of a plurality of intermediate feature vectors in the networks, so that the feature expression capacity of the models can be limited, and the overfitting problem of the models is caused.

In addition, in the click-through rate estimation task, the importance degree between different features is different, for example, in order to predict the income situation of one person, the influence of professional features on the income is obviously greater than the hobby, and the importance degree after different feature combinations is also different. Therefore, in a neural network with a feature intersection structure, if each feature is intersected with other features by using the same weight, more and more obvious information loss is brought as the number of network layers is increased.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects in the prior art and provide a click rate estimation method based on feature differentiation learning.

In order to solve the technical problem, the solution of the invention is as follows:

the click rate estimation method based on feature differentiation learning comprises the following steps: firstly, constructing an input vector of original features to obtain low-dimensional feature vector representation of each original feature; constructing a neural network with feature combination capability, obtaining combined feature vectors and constructing and outputting the combined feature vectors; then, differential activation constraint is proposed to control the similarity between the feature vectors and improve the integrity of feature vector expression; the existing compression-excitation network is used for distinguishing the feature importance, so that the distinguishing capability of the neural network on the features is improved; and finally, performing combined training on the neural network with the feature combination capability and the deep neural network to obtain a final predicted value.

The method disclosed by the invention specifically comprises the following steps of:

(1) constructing input vectors of original features

Embedding and coding sparse features in a feature input layer of a click rate estimation model based on deep learning, and converting each input original data feature into a low-dimensional dense real numerical vector, namely an embedded vector of the features; splicing the embedded vectors of all the features to be used as a result of a feature input layer, and using the feature embedded vectors as basic units of the features;

(2) constructing neural networks with feature combination capability

Combining the features in a vector mode, wherein each basic unit is an embedded vector of the features; combining every two eigenvectors output by the upper layer of the neural network and the original eigenvectors, and performing weighted average on the obtained multiple combined vectors to obtain the output of each layer of the neural network;

combining the neural network with the original embedded vector once more every time one layer is added, wherein the number of the layers determines the times of feature combination, and the output of each hidden layer in the neural network structure is determined by the input of the previous hidden layer and the original feature; the structure of the feature vector is kept in each layer of the network, and all feature combinations are carried out according to the vector;

(3) using differential activation constraints to control similarity between feature vectors

Each layer of the neural network is used as a unit for differentiating the feature vectors, so that the feature vectors in each layer have difference as much as possible, and the cosine similarity is used for representing the difference between the feature vectors;

and (3) iteratively solving each orthogonal vector representation in a regularization constraint mode, calculating cosine similarity between every two vectors in each iteration process, and adding the cosine similarity as a regularization loss into a model for co-training: explicitly controlling the similarity degree between the feature vectors through the regular terms of the differentiated activation constraints, so that the similarity between the feature vectors is continuously reduced in the training of the neural network model;

(4) constructing outputs of neural networks

Splicing the feature vectors of all hidden layers of the neural network with the feature combination capability constructed in the step (2) to obtain a combined feature matrix as output; the combined feature matrix comprises combined features of any number of layers, and each element dimension is a feature vector;

(5) distinguishing the feature importance by utilizing a compression-excitation network;

for all combined features and original features, an attention mechanism based on a compression-excitation network is introduced, the weight of important features is increased, and the weight of unimportant features is reduced;

the output of the neural network with feature combination capability is all the combined features and the original feature vectors, which are used as the input of the compression-excitation network, and the weight vector corresponding to each feature is generated by the latter; directly connecting the weight-adjusted feature vector to an output unit, wherein the obtained neural network model is called a feature importance degree-based differential activation network;

(6) performing combined training on the differential activation network and the deep neural network based on the feature importance degree to construct a combined model

Connecting the output of the differentiated activation network based on the feature importance obtained in the step (5) to the existing deep neural network to construct a deep learning model; connecting the combined features with weights output by the differentiated activation network based on feature importance to a linear logistic regression model and a deep neural network model for combined training; connecting the outputs of the linear logistic regression model and the deep neural network model to an output unit to obtain a joint click rate pre-estimated value, namely a probability value estimated by the finally recommended click rate; the larger the value, the higher the probability that the recommended content is clicked by the user.

Compared with the prior art, the invention has the beneficial effects that:

1. in order to improve the differential expression capability of the feature vectors, the invention provides a differential activation constraint method aiming at the feature vectors, which can increase the difference among different feature vectors in a targeted manner, thereby activating more implicit modes in data and achieving the purpose of efficient feature coding.

2. The invention utilizes the existing compression-excitation network to automatically learn the weight of the combined feature, provides a differentiated activation network based on feature importance, and improves the judgment capability of a click rate estimation model on the effectiveness of the combined feature.

3. According to the method, the output of a differential activation network based on feature importance is simultaneously connected to a deep neural network and a linear logistic regression model, the deep part of the model enables the model to have the capability of simultaneously learning explicit and implicit high-order feature combinations, and meanwhile, the generalization of the whole model is improved; the shallow part of the model can learn the feature low-order combination for improving the generalization of the model, and the model does not need to be artificially combined with features.

4. The invention discloses an innovative calculation method for estimating recommended click rate, which can deeply analyze original characteristics, accurately depict the combination relation of characteristic vectors and effectively predict the probability of clicking recommended contents by users.

Drawings

Fig. 1 is an overall architecture of a feature importance-based differentiated activation network in the present invention.

Fig. 2 is a schematic diagram of the overall structure of the differentiated activation network in the present invention.

Fig. 3 is a schematic diagram of a compression-excitation network unit structure in the present invention.

Detailed Description

The click rate estimation method based on feature differentiation learning provided by the invention is based on the differentiation activation network of feature importance, provides a differentiation activation constraint method aiming at feature vectors, and can increase the differences among different feature vectors in a targeted manner, thereby activating more implicit modes in data and achieving the purpose of efficient feature coding. And the weight of the combined features is automatically learned by utilizing the existing compression-excitation network, a differentiated activation network based on feature importance is constructed, the judgment capability of a click rate estimation model on the effectiveness of the combined features is improved, and then a recommended click rate pre-estimated value is obtained through calculation.

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the click rate estimation method based on feature differentiation learning specifically comprises the following steps:

step (1): constructing a characteristic input vector;

the feature input layer of the click rate estimation model based on deep learning comprises a process of embedding and coding sparse features, and each input original data feature is converted into a low-dimensional dense real numerical vector, namely an embedded vector of the features. Stitching the embedded vectors of all features as the result E ═ E of the feature input layer₁，e₂，...，e_f]Where f represents the number of features,

an embedding vector representing the ith feature, and d is the dimension of the embedding vector,

is a matrix symbol. The neural network model embeds features into vectors as basic units of features,only the calculation of the feature vector is needed in the subsequent part.

Step (2): constructing a neural network with feature combination capability;

combining features in a vector mode, each basic unit is an embedded vector of the features, and the output of the k layer of the neural network with the feature combination capability is represented as a matrix

Wherein, H_kThe layer represents the number of k-th layer feature embedding vectors set, the d-th layer is the dimension of the embedding vectors, and,

representing the ith feature vector of the kth layer. Set up H₀M denotes the number of embedded vectors of the original feature. The method for calculating the h-th feature vector of the k-th layer in the neural network is as follows:

wherein H is more than or equal to 1 and less than or equal to H_k，

Representing a parameter matrix corresponding to the H characteristic vector of the k layer of the neural network with the characteristic combination capability, wherein the number of the parameters of the k layer of the neural network with the characteristic combination capability is H_k-1*m*H_k. Feature vector

The calculation process is that the feature vector output by the previous layer and the original feature vector are combined pairwise, and the obtained H is_k-1× m combination vectors are weighted and averaged to get the h-th eigenvector of the current layer.

The neural network with feature combination capability can be added with the original embedded vector X once more every time one layer is added⁰So that the number of layers of the neural network controls the number of explicit feature combinationsWherein the output of each hidden layer is determined by the previous hidden layer and the original input. The structure of the feature vector is maintained at each layer of the neural network with feature combination capability, so that all feature combinations are performed vector-wise.

And (3): utilizing differentiated activation constraints to control similarity between feature vectors;

each layer of the neural network with the feature combination capability is used as a unit for feature vector differentiation, so that the feature vectors in each layer have differences as much as possible, and cosine similarity is used for representing the differences among the feature vectors; the coding expression of the neural network with the feature combination capability should remove the redundancy of the representation among the feature vectors as much as possible, reduce the unpredictability of the system, and improve the representation capability of the combination features and the generalization capability of the model.

The feature differentiation has an important role in the feature expression capability, and in order to remove information redundancy possibly existing among feature vectors and improve the differential expression capability of the feature vectors, a Differentiated Activation Constraint (DAC) method for the feature vectors is adopted to explicitly control the similarity degree among the feature vectors.

And (4) iteratively solving each orthogonal vector representation by adopting a regularization constraint mode. And calculating cosine similarity between every two feature vectors in each iteration process, and adding the cosine similarity as regularized loss into an integral click rate estimation model for co-training. The similarity degree between the feature vectors is explicitly controlled through the regular term of the differential activation constraint, so that the similarity between the feature vectors is continuously reduced in model training, and the neural Network structure is defined as a Differential Activation Network (DAN).

Defining the depth of the differential activation network as T, and representing the number of the characteristic vectors of the k layer as H_kVector of motion

I-th feature embedding vector representing k-th layer, setting H₀M represents the original feature-embedded vector of the input. Taking each layer of the neural network with the feature combination capability as a unit for feature vector differentiation, wherein the goal of the differentiation activation constraint is to make feature vectors in each layer have differences as much as possible, namely, equivalently, minimizing the similarity between the feature vectors in each layer:

wherein the content of the first and second substances,

a loss function representing the differentiated activation constraint, α representing a parameter of the neural network,

representing feature vectors

And

cosine values in between; the size of the included angle between the vectors, namely the difference of the directions between the vectors, is expressed by the cosine value of the vector, and the difference between the characteristic vectors is expressed by cosine similarity.

And (4): constructing the output of a neural network with the characteristic combination capability;

since the k-th layer has H_kA different parameter matrix, so that the k-th layer output of the differentiated activation network is H_kA number of different feature vectors. FIG. 2 shows the overall structure of the differentiated activation network, defining the depth of the differentiated activation network as T, and the number of all the combined features and the original features as

Feature vectors of all hidden layers

k∈[0，T]Splicing to obtain a combined feature matrix C with the dimension of n × d ═ x₁，x₂，…，x_n]As the output of the differentiated activation network, each element dimension is a feature vector of d, and therefore, all the combined features from 0 th order to T th order are included in the combined feature matrix C.

And (5): distinguishing the feature importance by utilizing a compression-excitation network;

after the network is activated differentially, an attention mechanism based on a compression-excitation network (SENET) is introduced, the compression-excitation network is an existing technology and is mainly used for distinguishing the weight of each feature in the neural network model, and for all combined features and original features, the weight of important features can be automatically increased, and the weight of unimportant features can be reduced.

The output of the differentiated activation network is all combined features and the original feature vector C ═ x₁，x₂，…，x_n]Using these feature vectors as input to the compression-excitation network, a weight vector is generated corresponding to the importance of each feature, i.e., a ═ a₁，a₂，…，a_nIn which a_iIs the weight of the ith feature. These weights are then applied to all features to obtain C_se＝[v₁，v₂，…，v_n]Wherein

Representing the weighted feature vectors, each v_i＝x_ia_i，i∈[1，2，…，n]Representing the ith adjusted feature vector.

As shown in fig. 3, the compression-excitation network performs weight adjustment in parallel with the feature vector, and is composed of three parts, i.e., compression, excitation, and weight adjustment, which are described separately.

The compression (Squeeze) process converts the vectors into scalar quantities by computing the statistical features of each feature vector, and in particular, the input feature vector C is scaled [ x ] using a maximum pooling or average pooling method₁，x₂，…，x_n]Compressed into a statistical valueVector Z ═ Z₁，z₂，…，z_n]Wherein the scalar z_iGlobal information representing the ith eigenvector, exemplified by the mean pooling, z_iThe calculation process of (2) is as follows:

after the compression process, all elements on each feature vector are averaged to a value, because the final weight is applied to the whole feature vector, which results in calculating the weight based on the whole information of the feature vector. In addition, the correlation among the feature vectors is utilized, but not the correlation among the internal elements of the feature vectors, and the distribution information in the feature vectors is shielded by global pooling, so that the calculation of the weight is more accurate.

The Excitation (Excitation) process learns the importance between each feature based on the statistical vector Z, using two fully connected neural network layers (fully connected layers) to learn these weights. The first fully-connected layer is used for dimension reduction, the reduction ratio is set to r, the fully-connected layer compresses input n statistical values into n/r values to reduce the calculation amount, and sigma is used₁As a non-linear activation function. The second fully-connected layer acts to revert back to n dimensions and uses σ₂As a non-linear activation function. Thus, the weight of each feature vector can be calculated as follows:

A＝F_ex(Z)＝σ₂(W₂σ₁(W₁Z))

wherein the content of the first and second substances,

and

respectively parameters of the first and second fully connected layer,

is the weight of the feature vectorAbout the ratio r. And (5) trying the performance of the reduction ratio r under various values, and finally obtaining the balance of the overall performance and the calculated amount. The excitation process adopts the fully-connected layer to train the true weight by using the correlation between the features, the output of each batch of sample compression does not represent the weight to be adjusted by the true features, and the true weight is trained based on all data, so a fully-connected network is required for training.

Finally, the compression-excitation network carries out weight adjustment on all the feature vectors, namely, the original feature vector E and the weight vector A are multiplied by elements to obtain an adjusted feature vector C_se＝[v₁，v₂，…，v_n]The calculation process is as follows:

the above mechanism dynamically learns the importance of feature vectors after compression-excitation of the network, and for a particular task, increases the weight of important features and decreases the weight of features unrelated to the task. Finally, the weight-adjusted feature vector C_seDirectly connecting to an output unit to obtain a feature importance-based differentiated Activation Network (FiDAN), which is a shallow model without a deep neural Network structure.

And (6): performing combined training on the differential activation network and the deep neural network based on the feature importance to construct a combined model;

and connecting the output of the differential activation network based on the feature importance degree to a traditional deep neural network to construct a deep model. The deep neural network is an existing technology, is composed of a plurality of fully-connected layers and is a nonlinear activation function, and can express implicit high-order characteristic nonlinear combination. Let a be ═ y₁，y₂，…，y_2n]The output of the network is activated for differentiation based on feature importance, wherein,

is a feature vector, then a is input into a deep neural network to learn high-order feature crossing, and the forward propagation process of the neural network is as follows:

x¹＝σ(W¹a+b¹)x^k＝σ(W^kx^(k-1)+b^k)

where k is the number of layers in the neural network, σ is a non-linear activation function, x^kIs the output of the k layer of the neural network, and the process of deep neural network learning high-order feature intersection is implicit.

In order to enable the model to have generalization and memory at the same time, the combined features with weights output by the differentiated activation network FiDAN based on the feature importance degree are connected to the linear logistic regression model and the deep neural network model at the same time for joint training. On the one hand, this model has the ability to learn both low-order and high-order feature combinations, and on the other hand, it learns both implicit and explicit feature combinations. Therefore, the outputs of the linear logistic regression model and the deep neural network model are connected to the output unit, and the click rate pre-estimated value of the joint model is as follows:

wherein the content of the first and second substances,

is a click rate pre-estimated value, sigma represents a sigmoid function, a is the output of the differentiated activation network FiDAN,

is the output of the deep neural network module and w and b are the parameters of the network.

Calculated click rate pre-estimated value

The probability value of the final recommended click rate estimation is obtained, and the larger the value is, the higher the probability that the recommended content is clicked by the user is.

Finally, it should be noted that the above-mentioned list is only a specific embodiment of the present invention. It is obvious that the present invention is not limited to the above embodiments, but many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A click rate estimation method based on feature differentiation learning is characterized by comprising the following steps: firstly, constructing an input vector of original features to obtain low-dimensional feature vector representation of each original feature; constructing a neural network with feature combination capability, obtaining combined feature vectors and constructing and outputting the combined feature vectors; then, differential activation constraint is proposed to control the similarity between the feature vectors and improve the integrity of feature vector expression; the existing compression-excitation network is used for distinguishing the feature importance, so that the distinguishing capability of the neural network on the features is improved; and finally, performing combined training on the neural network with the feature combination capability and the deep neural network to obtain a final predicted value.

2. The method according to claim 1, characterized in that it comprises in particular the steps of:

(1) constructing input vectors of original features

(2) constructing neural networks with feature combination capability

iterative solution is carried out on each orthogonal vector representation in a regularization constraint mode, the cosine similarity between every two vectors is calculated in each iterative process and is used as a regularization loss to be added into a model for common training; explicitly controlling the similarity degree between the feature vectors through the regular terms of the differentiated activation constraints, so that the similarity between the feature vectors is continuously reduced in the training of the neural network model;

(4) constructing outputs of neural networks

3. The method according to claim 2, wherein in the step (1), the embedded vectors of all the features are spliced as a result E ═ E of the feature input layer₁，e₂，...，e_f]Where f represents the number of features,

is a matrix symbol.

4. The method according to claim 2, wherein in the step (2), the output of the k-th layer of the neural network with feature combination capability is represented as a matrix

a feature vector representing an ith of the kth layer; set up H₀M represents the number of embedded vectors of the original feature; the method for calculating the h-th feature vector of the k-th layer in the neural network is as follows:

wherein H is more than or equal to 1 and less than or equal to H_k，

Representing a parameter matrix corresponding to the H characteristic vector of the k layer of the neural network with characteristic combination capability, wherein the number of the parameters of the k layer of the neural network is H_k-1*m*H_k。

5. The method according to claim 2, wherein, in the step (3),

each layer of the neural network with the feature combination capability is used as a unit for feature vector differentiation, so that the feature vectors in each layer have differences as much as possible, and cosine similarity is used for representing the differences among the feature vectors;

iterative solution is carried out on each orthogonal vector representation in a regularization constraint mode, cosine similarity between every two characteristic vectors is calculated in each iterative process, and the cosine similarity is used as regularization loss and added into an overall click rate estimation model for common training; explicitly controlling the similarity degree between the feature vectors through the regular term of the differential activation constraint, so that the similarity between the feature vectors is continuously reduced in model training, and the neural network structure is defined as a differential activation network;

I-th feature embedding vector representing k-th layer, setting H₀M represents the input original feature embedding vector; difference inThe goal of the activation constraint is to make the feature vectors within each layer of the neural network as different as possible, i.e. equivalent to minimizing the similarity between the feature vectors of each layer:

wherein the content of the first and second substances,

representing feature vectors

And

6. The method of claim 2, wherein in step (4), the k-th layer has H_kThe output of k layer of the neural network is H_kA number of different feature vectors; defining the depth of the differential activation network as T, and the number of all combined features and original features as

Feature vectors of all hidden layers

Splicing to obtain a combined feature matrix C with the dimension of n × d ═ x₁，x₂，…，x_n]As differentiated activationAnd (3) outputting a network, wherein each element dimension is a feature vector of d, and the combined feature matrix C comprises all combined features from 0 order to T order.

7. The method according to claim 2, wherein, in the step (5),

for all combined features and original features, an attention mechanism based on a compression-excitation network is introduced, the compression-excitation network in the prior art is used for distinguishing the weight of each feature in the neural network model, increasing the weight of important features and reducing the weight of unimportant features;

the output of the differentiated activation network is all the combined features and the original feature vectors, which are used as the input of the compression-excitation network, and the latter generates a weight vector corresponding to each feature; directly connecting the feature vector after the weight adjustment to an output unit to obtain a differential activation network based on feature importance;

the output of the differentiated activation network is all combined features and the original feature vector C ═ x₁，x₂，…，x_n]Using these feature vectors as input to the compression-excitation network, a weight vector is generated corresponding to the importance of each feature, i.e., a ═ a₁，a₂，…，a_nIn which a_iIs the weight of the ith feature; these weights are then applied to all features to obtain C_se＝[v₁，v₂，…，v_n]Wherein

Representing the weighted feature vectors, each v_i＝x_ia_i，i∈[1，2，…，n]Representing the ith adjusted feature vector;

the compression-excitation network adopts a mode of being connected with the characteristic vector in parallel to carry out weight adjustment and consists of three parts of compression, excitation and weight adjustment, wherein:

in the compression process, the vector is converted into a scalar by calculating the statistical characteristics of each characteristic vector; utensil for cleaning buttockVolumetric, input feature vector C ═ x using maximum pooling or mean pooling methods₁，x₂，…，x_n]Compressed into a statistical value vector Z ═ Z₁，z₂，…，z_n]Wherein the scalar z_iGlobal information representing the ith eigenvector, exemplified by the mean pooling, z_iThe calculation process of (2) is as follows:

after the compression process, all elements on each feature vector are averaged into a value;

the excitation process learns the importance between each feature based on the statistics vector Z, learning these weights using two fully connected neural network layers; the first fully-connected layer is used for dimension reduction, the reduction ratio is set to r, the fully-connected layer compresses input n statistical values into n/r values to reduce the calculation amount, and sigma is used₁As a non-linear activation function; the second fully-connected layer acts to revert back to n dimensions and uses σ₂As a non-linear activation function; thus, the weight of each feature vector is calculated as follows:

A＝F_ex(Z)＝σ₂(W₂σ₁(W₁Z))

wherein the content of the first and second substances,

and

respectively parameters of the first and second fully connected layer,

is the weight of the feature vector, the reduction ratio is r; trying the performance of the reduction ratio r under various values, and finally obtaining the balance of the overall performance and the calculated amount; the function of the full connection layer adopted in the excitation process is to utilize the characteristicsTraining real weights by characterizing correlation among samples, wherein the output of each batch of sample compression does not represent the weight of the real features to be adjusted, and the real weights are obtained by training based on all data, so a full-connection network is required for training;

finally, the compression-excitation network performs weight adjustment on all the feature vectors: multiplying the original characteristic vector E and the weight vector A by elements to obtain an adjusted characteristic vector C_se＝[v₁，v₂，…，v_n]The calculation process is as follows:

after the compression-excitation network, the importance of the feature vector is dynamically learned, so that the weight of the important features is increased, and the weight of features irrelevant to the task is reduced; finally, the weight-adjusted feature vector C_seDirectly connected to the output unit, thus obtaining the differential activation network based on the feature importance, which is a shallow model without the structure of the deep neural network.

8. The method of claim 2, wherein in step (6), the click rate pre-estimated value of the joint model is:

wherein the content of the first and second substances,