CN112199550B

CN112199550B - Short video click rate prediction method based on emotion capsule network

Info

Publication number: CN112199550B
Application number: CN202010937121.4A
Authority: CN
Inventors: 吴健; 顾盼; 韩玉强; 高维
Original assignee: Shandong Industrial Technology Research Institute of ZJU
Current assignee: Shandong Industrial Technology Research Institute of ZJU
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2023-05-19
Anticipated expiration: 2040-09-08
Also published as: CN112199550A

Abstract

The invention belongs to the technical field of internet service, and particularly relates to a short video click rate prediction method based on an emotion capsule network. A short video click rate prediction method based on an emotion capsule network comprises the following steps: s1, dividing a user behavior sequence into a block sequence; s2, extracting module features from the user block sequence and the target short video by adopting a gate mechanism; s3, using an emotion capsule network to obtain emotion characterization of a user on a target short video; s4, predicting the click rate of the user on the target short video according to the emotion characteristics; s5, designing a loss function according to the model characteristics; s6, updating model parameters by adopting an Adam optimizer. The invention provides a short video click rate prediction method based on an emotion capsule network, which is used for judging emotion of a user to each module of a target short video by utilizing positive feedback and negative feedback data of the user in a short video platform and predicting the click rate of the user to the current short video in a fine granularity.

Description

Short video click rate prediction method based on emotion capsule network

Technical Field

The invention belongs to the technical field of internet service, and particularly relates to a short video click rate prediction method based on an emotion capsule network.

Background

Short video is a new type of video with shorter time. The shooting of short videos does not require the use of specialized equipment nor specialized skills. The user can shoot and upload the short video platform conveniently by the mobile phone, so that the short video quantity of the short video platform is increased very rapidly. The method and the system have the advantages that the demand for an effective short video recommendation system is urgent, and the effective short video recommendation system can improve user experience and user viscosity, so that huge commercial value is brought to a platform.

In recent years, many researchers have proposed personalized recommendation methods based on videos. These methods can be divided into three categories: collaborative filtering, content-based recommendation, and hybrid recommendation methods. But short videos have different characteristics than videos: the descriptive text is of lower quality, shorter duration and longer sequence of interactions by the user over a period of time. Thus, short video recommendations are a more challenging task and researchers have proposed some approaches. For example, wei et al use a graph convolution structure to fuse multimodal information of short videos, thereby better simulating user preferences; chen et al use a hierarchical attention mechanism to calculate the importance of both items and categories to get more accurate predictions.

Although these methods achieve good results, these structures only use the user's positive feedback information and ignore the negative feedback information. Negative feedback information refers to the behavior that the user looks at the cover of the short video, but does not click to watch. In recent research results, li et al combine positive and negative feedback data and use a graph-based recurrent neural network to model, ultimately resulting in user preferences. But they calculate the user's preference for the short video as a whole, in fact the video has many modules and the user has different emotions and preferences for the different modules of the video. And they directly weight sum the predicted results of positive and negative feedback without fine granularity discussing the specific effects of positive and negative feedback.

Disclosure of Invention

The invention aims to solve the technical problem of providing a short video click rate prediction method based on an emotion capsule network, which is used for judging emotion of a user to each module of a target short video by utilizing positive feedback and negative feedback data of the user in a short video platform and predicting the click rate of the user to the current short video in a fine granularity manner. For this purpose, the invention adopts the following technical scheme:

a short video click rate prediction method based on an emotion capsule network comprises the following steps:

s1, dividing a user behavior sequence into a block sequence;

s2, extracting module features from the user block sequence and the target short video by adopting a gate mechanism, wherein the module features are the user block sequence module features and the target short video module features respectively;

s3, obtaining emotion characterization of a user on a target short video by using an emotion capsule network, extracting emotion characteristics from low-level module characteristics through the capsule network, and analyzing emotion of the user on different modules of the short video to predict click rate, wherein the emotion is classified into positive, negative and neutral;

the positive emotion data is derived from a positive feedback sequence of a user, the negative emotion data is derived from a negative feedback sequence of the user, and the neutral emotion data is derived from a common part of the positive feedback sequence and the negative feedback sequence;

s4, predicting the click rate of the user on the target short video according to the emotion characteristics;

s5, designing a loss function according to the model characteristics;

s6, updating model parameters by adopting an Adam optimizer.

Wherein the short video is composed of finer granularity modules (e.g., video scenes, video topics, video emotions); the emotions are classified into positive (positive), negative (negative) and neutral (neutral), and are demonstrated by the first 3 letters of english for three emotions, respectively, in the following formulas.

Further, the positive emotion data is derived from the positive feedback sequence of the user, the negative emotion data is derived from the negative feedback sequence of the user, and the neutral emotion data is derived from the common part of the positive feedback sequence and the negative feedback sequence.

On the basis of adopting the technical scheme, the invention can also adopt the following further technical scheme:

the step S1 further includes:

s11, using a window with length w to sequence clicking behaviors of a user

Dividing into m blocks, each block characterizing a meterThe calculation method is as follows:

wherein ,

preference characteristics at the kth block for the user;

s12, processing negative feedback information of the user in the same mode to obtain negative feedback block representation

The calculation method of the step S2 is as follows:

wherein ,

representing the ith module feature of the kth block, W _i,1 and W_i，2 Is the transfer matrix of the i-th module,

is the bias vector of the ith block, σ is the sigmoid activation function, and Σ is the multiplication of element levels, +.>

Is a preference feature of the kth block in the click sequence, q _i Is a representation of the ith module and q _i Shared for all users. The number M of modules of the short video is a super parameter.

The step S2 further includes:

s21, after each module vector representation of each block is obtained, the same module information in all the blocks is aggregated by adopting an average pool:

where M is the number of blocks, the formula derives M module features from the positive feedback sequence

S22, obtaining M module characteristics from the negative feedback sequence by adopting the same method

Obtaining M module features from a target short video>

The method extracts positive emotion from the click sequence and negative emotion from the non-click sequence. Further, some module features often appear in both click and un-click sequences of the user, and the user holds neutral feelings about such modules. Therefore, the step S3 further includes:

s31, matching the module features extracted from the user sequence with the module features of the target short video one by one to form an activating unit:

wherein ,

is the i-th module feature of the target short video,/->

Is the ith module feature of the user positive feedback sequence, as is the multiplication of the element dimensions, g is the activation function;

s32, obtaining an activation unit of the negative feedback sequence by adopting the same method

S33, extracting emotion characteristics from an activation unit extracted by positive feedback by adopting an emotion capsule network:

wherein s.epsilon.pos, neu,

is the conversion matrix of the ith activation unit of the positive feedback sequence to the emotion capsule s; front emotion capsule v _pos By->

Is obtained by a weighted sum of (2); />

Is the connection coefficient, representing

Weight of->

The parameters are updated by adopting a dynamic routing algorithm;

g is a vector activation function (squash activation) commonly used in capsule networks, the formula is as follows:

wherein, the sum of the values represents the length of the vector;

s34, extracting emotion characteristics from an activation unit extracted by negative feedback by adopting an emotion capsule network:

negative emotion capsule v _neg Equal to:

wherein s is { neg, neu };

s35, extracting from positive feedback and negative feedback sequences to obtain a neutral emotion capsule:

further, the connection coefficient is improved

The updating method of (1) comprises the following steps:

s301, increasing a temperature coefficient, and improving a dynamic routing coefficient by increasing the temperature coefficient

The formula is as follows: />

Wherein p is ∈ { +, - }, and s is { pos, neg, neu };

is the connection coefficient of the input capsule i to the output capsule s and is initialized to 0; τ is the temperature coefficient; when τ is 0 ⁺ Output emotion capsules tend to focus on only one input capsule; and when tau-infinity, the effect of the input capsule on the output emotion capsule tends to be consistent.

S302, considering importance degrees of different modules of the short video, the length of the activation unit can be used for describing the importance degrees of the modules. Thus, according toDynamic routing coefficient correction for importance degree of different short video modules

The formula is as follows:

wherein ,

is the length of the activation unit, which can account for the importance of the module, p e { +, - }, and s e { pos, neg, neu }.

Said step S4 further comprises a given emotion capsule v _s Calculating the probability of clicking the target short video by the user:

wherein s.epsilon.pos, neg, neu,

and />

Is a transfer matrix->

Is the offset vector, b _s,2 Is an offset scalar; sigma is a sigmoid activation function, b _u Is the bias of the user dimension, vs is the length of the vector, representing the confidence of the emotion.

The step S5 includes:

s51, predicting value of click rate of target short video by user

Calculating predictive value +.>

And the true value y to update the model parameters using the error; a cross-entropy loss function (cross-entropy loss) is used to guide the update process of model parameters:

wherein y epsilon {0,1} is a true value representing whether the user clicked on the target short video; sigma is a sigmoid function;

s52, in order to ensure that the emotion capsule network can accurately capture emotion with fine granularity, an edge loss function L is increased _stm And an inconsistent loss function L _asp As a regularization term, the cross entropy loss function is combined with two regularization terms to obtain a complete loss function:

the edge loss function L _stm The calculation formula is as follows:

wherein ,

representing all of the dataset<User, short video>Pairing; when the true value y=1, v _s ＝v _pos The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, v _s ＝v _neg ；/>

Representing the reverse emotion of emotion capsule s;

the disagreement regularization (disagreement regularization) leads M module vectors q to trend disagreement, different module characteristics are extracted from the short video as much as possible, and the disagreement regularization term L _asp The calculation formula is as follows:

where M is the number of module vectors q.

The beneficial technical effects of the invention are as follows:

(1) The invention provides a short video click rate prediction method based on an emotion capsule network, which utilizes a gate mechanism and the emotion capsule network to analyze different emotions of a user to different modules of a short video from positive feedback information and negative feedback information of the user so as to obtain more accurate prediction.

(2) The invention provides a novel emotion capsule network framework, which extracts positive, negative and neutral emotion of a user from different modules of a short video sequence of the user. And improvements are made to the framework and dynamic routing algorithms of the capsule network.

(3) A large number of experiments are carried out based on the two pieces of disclosed short video data, and the method has better effect than the latest method in the same data.

Drawings

FIG. 1 is a schematic flow chart of a short video click rate prediction method based on an emotion capsule network;

FIG. 2 is a model frame diagram of a short video click rate prediction method based on an emotion capsule network;

FIG. 3 is a diagram of the positive feedback and negative feedback information of a user of a short video click rate prediction method based on an emotion capsule network.

Detailed Description

In order to further understand the present invention, the short video click rate prediction method based on the emotion capsule network provided by the present invention is specifically described below with reference to the specific embodiments, but the present invention is not limited thereto, and the technical personnel in the field make insubstantial improvements and adjustments under the core guiding concept of the present invention, and still fall within the protection scope of the present invention.

The short video click rate prediction task is to build a model to predict the probability of a user clicking on a short video.

The historical sequence of the user is expressed as

Where p ε { +, - } represents click and no click behavior, x _j Representing the jth short video, l is the length of the sequence. The whole sequence can be further subdivided into click sequences +.>

And the click-free sequence->

I.e. positive feedback and negative feedback information. Thus, the short video click rate prediction problem can be expressed as: input user click sequence +.>

Click sequence->

Target short video x _new To predict user-to-target short video x _new Is a click rate of (a) is provided.

Therefore, the invention provides a short video click rate prediction method based on an emotion capsule network. According to the positive feedback and negative feedback information of the user on the short video, the emotion of the user on different modules of the short video is mined, and the click rate of the user on the target short video is predicted. Positive feedback here means that the user clicks on a short video; negative feedback means that the platform shows the cover of the short video, but the user does not click on the short video, indicating that the user is not interested in the short video. The method considers that the user has different emotions (sentiments) on different modules (aspect) of the short video, predicts the click rate of the user on the target short video, and should analyze the preference of the user on the different modules of the short video. As shown in fig. 3, there are three modules for a short video: video scene, video theme, video emotion. From the clicking and non-clicking actions of the user, we can find that the user likes the beauty related video theme and the positive video emotion, but dislikes the animal related theme and the negative video emotion. Further, the user maintains a neutral attitude to the video scene.

The method mainly consists of three parts, as shown in fig. 2. The first part is to extract the module feature (aspect) from the user positive feedback and negative feedback information by using a gate mechanism, and simultaneously extract the module feature from the target short video by using the same method. And the second part is to pair the two module features one by one to form an activation unit, input the activation unit into an emotion capsule network and forecast the emotion of a user to different modules of the short video. The third part is to predict the click rate of the short video based on the emotional characteristics of the user on the current short video.

As shown in fig. 1, according to one embodiment of the invention, the method comprises the steps of:

s100, dividing the user behavior sequence into block (block) sequences. Click behavior sequence for a user

Can be expressed as +.>

wherein />

Is the cover map feature vector of the short video, and d is the feature vector length. Because the short video duration is short, the behavior sequence of the user is longer. Therefore, the method uses a window with length w to divide the sequence X ⁺ The short video of user interactions in one block tends to be relatively similar, divided into m blocks.

At this time, each block feature characterization is calculated as follows:

wherein ,

is a preference feature of the user at the kth block. Although some of the information is lost by summing, the main information is preserved, which is what we need. The method adopts the same mode to process the negative feedback information of the user, and obtains the block representation +.>

S200, extracting module features from a user block sequence and a target short video, the short video being composed of finer granularity modules (e.g., video scenes, video topics, video emotions).

The method adopts the door mechanism extraction module characteristic, and the following formula is the ith module for extracting the kth block:

wherein ,W_i,1 and W_i,2 Is the transfer matrix of the i-th module,

is the bias vector for the i-th module. Sigma is the sigmoid activation function, +.. />

Is a preference feature of the kth block in the click sequence, q _i Is a representation of the ith module and q _i Shared for all users. The number M of modules of the short video is a super parameter, and the number M is set to be 5 through experimental verification in the method. After each module vector representation of each block is obtained, the patentThe same module information in all blocks is aggregated by using an average pool (average pool):

where m is the number of blocks. Finally we can obtain M from the positive feedback sequence

Module features. By the same way, M module features can be obtained from the negative feedback sequence>

M module features can be obtained from the target short video +.>

S300, using an emotion capsule network to obtain emotion characterization of a user on a target short video. For a target short video, the click rate is predicted by analyzing the emotion of a user to different modules of the short video. The method adopts a capsule network to extract emotion characteristics from low-level module characteristics. There are three kinds of emotion, namely: positive (positive), negative (negative), and neutral (neutral). The method extracts positive emotion from the click sequence and negative emotion from the non-click sequence. Further, some module features often appear in both click and un-click sequences of the user, and the user holds neutral feelings about such modules. Such emotions also play a role in prediction, especially when negative and positive emotions are not apparent.

Firstly, the module features extracted from the user sequence and the module features of the target short video are paired one by one to form an activation unit:

wherein ,

is the i-th module feature of the target short video,/->

Is the i-th module feature of the user positive feedback sequence, +.. g is a vector activation function (squarish) commonly used in capsule networks, the formula is as follows:

wherein, representing the length of the vector. By the same method, the activation unit of the negative feedback sequence can be obtained

And extracting emotion characteristics from the extracted active units of positive feedback and negative feedback by adopting a capsule network:

wherein s.epsilon.pos, neu,

is the transition matrix of the ith activation element of the positive feedback sequence to the emotion capsule s, from which positive and neutral emotion is obtained as can be seen from the formula. Front emotion capsule v _pos By->

Is obtained by a weighted sum of (a).

Also, g is the activation function (squarish) commonly used in capsule networks, and

is a connection coefficient representing->

Weight of->

The parameters are updated using a dynamic routing algorithm.

Likewise, negative emotion capsule v _neg Equal to:

where s.epsilon { neg, neu }.

And the neutral emotion capsule is extracted from positive feedback and negative feedback sequences:

therefore, positive emotion data is derived from the positive feedback sequence of the user, negative emotion data is derived from the negative feedback sequence of the user, and neutral emotion data is derived from a common part of the positive feedback sequence and the negative feedback sequence. This structure is also innovative to the present method, unlike the full connectivity of traditional capsule networks. To increase the differentiation between emotion capsules, we further improve

Is updated according to the update mode of the system. The modification has two points, the first point is to increase the temperature coefficient:

where p ε { +, - }, and s ε { pos, neg, neu }.

Is the connection coefficient of the input capsule i to the output capsule s and is initialized to 0.τ is the temperature coefficient, and experiments have shown that τ=0.8 works best in this data. When τ is 0 ⁺ Output emotion capsules tend to focus on only one input capsule; and when tau-infinity, the effect of the input capsule on the output emotion capsule tends to be consistent.

The second point is the importance of the modules for which the length of the activation unit can be stated, taking into account the importance of the different modules of the short video. We use the length of the activation unit to correct

wherein ,

is the length of the activation element, p e { +, - }, and s e { pos, neg, neu }.

S400, predicting the click rate of the user on the target short video according to the emotion characteristics. Given emotion capsule v _s The probability of clicking the target short video by the user is calculated as follows:

wherein s.epsilon.pos, neg, neu,

and />

Is a transfer matrix->

Is the offset vector, b _s,2 Is an offset scalar. Sigma is a sigmoid activation function, b _u Is the bias of the user dimension, ||v _s The i is the length of the vector and represents the confidence of the emotion. />

S500, designing a loss function according to the model characteristics. Predicted value of click rate of target short video by user

Calculating predictive value +.>

And the true value y, and then updating the model parameters using the error. We use a cross-entropy loss function (cross-entropy loss) to guide the update process of model parameters:

where y e {0,1} is a true value representing whether the user clicked on the target short video. Sigma is a sigmoid function.

Meanwhile, in order to ensure that the emotion capsule network can correctly capture emotion with fine granularity, an edge loss function (margin loss) is added as a regularization term:

wherein ,

representing all of the dataset<User, short video>For the pair, and according to the experimental results, we set the parameters e=0.8 and λ=0.5. Notably, when the true value y=1, v _s ＝v _pos The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, v _s ＝v _neg 。/>

Representing the negative emotion of emotion capsule s.

Furthermore, the method also introduces inconsistent regularization (disagreement regularization) to lead M module vectors q to trend to be inconsistent, and different module characteristics are extracted from the short video as much as possible:

finally, combining the cross entropy loss function and two regular terms to obtain a complete loss function, wherein the complete loss function is as follows:

wherein, in our experiments, lambda _s =0.1 and λ _a =0.1. We updated model parameters using Adam optimizer.

The foregoing description of the embodiments is provided to facilitate the understanding and application of the invention to those skilled in the art. It will be apparent to those having ordinary skill in the art that various modifications to the above-described embodiments may be readily made and the generic principles described herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims

1. A short video click rate prediction method based on an emotion capsule network is characterized by comprising the following steps:

s1, dividing a user behavior sequence into a block sequence;

the step S1 further includes:

s11, using a window with length w to sequence clicking behaviors of a user

Dividing the block into m blocks, and calculating the characteristic characterization of each block as follows:

wherein ,

preference characteristics at the kth block for the user;

the calculation method of the step S2 is as follows:

wherein ,

representing the ith module feature of the kth block, W _i,1 and W_i,2 Is the transfer matrix of the ith module, +.>

Is a preference feature of the kth block in the click sequence, q _i Is a representation of the ith module and q _i Sharing for all users;

the step S2 further includes:

Obtaining M module features from a target short video>

S3, obtaining emotion characterization of a user on a target short video by using an emotion capsule network, extracting emotion characteristics from low-level module characteristics through the capsule network, and analyzing emotion of the user on different modules of the short video to predict click rate, wherein the emotion is divided into positive emotion, negative emotion and neutral emotion;

the step S3 further includes:

wherein ,

is the i-th module feature of the target short video,/->

Is the ith module feature of the user positive feedback sequence, as is the multiplication of the element dimensions, g is the activation function; />

wherein s.epsilon.pos, neu,

Is obtained by a weighted sum of (2); />

Is the connection coefficient, representing

Weight of->

The parameters are updated by adopting a dynamic routing algorithm;

the activation function g is a vector activation function commonly used in capsule networks:

wherein, the sum of the values represents the length of the vector;

negative emotion capsule v _neg Equal to:

wherein s is { neg, neu };

the step S3 also comprises improving the connection coefficient

The updating method of (2), the method includes:

The formula is as follows:

wherein p is ∈ { +, - }, and s is { pos, neg, neu };

is the connection coefficient of the input capsule i to the output capsule s and is initialized to 0; τ is the temperature coefficient;

s302, correcting dynamic routing coefficients according to importance degrees of different short video modules

The formula is as follows: />

wherein ,

is the length of the activation cell, p ε { +, - }, and s ε { pos, neg, neu };

said step S4 further comprises a given emotion capsule v _s The probability of clicking the target short video by the user is calculated as follows:

wherein s.epsilon.pos, neg, neu,

and />

Is a transfer matrix->

Is the offset vector, b _s,2 Is an offset scalar; sigma is a sigmoid activation function, b _u Is the bias of the user dimension, ||v _s The l is the length of the vector;

s5, designing a loss function according to the model characteristics;

the step S5 includes the steps of:

s51, predicting value of click rate of target short video by user

Calculating predictive value +.>

And the true value y to update the model parameters using the error; the cross entropy loss function is adopted to guide the updating process of the model parameters:

s52, increasing the edge loss function L _stm And an inconsistent loss function L _asp As a canonical term, the loss function is:

wherein ,λ_s and λ_a Respectively the loss function L _stm and L_asp Is a regular parameter of (2);

in the step S52, the edge loss function L _stm The calculation formula is as follows:

wherein ,

representing all of the dataset<User, short video>Epsilon and lambda are model parameters; when the true value y=1, v _s ＝v _pos The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, v _s ＝v _neg ；/>

Representing the reverse emotion of emotion capsule s;

the inconsistent regularization term L _asp The calculation formula is as follows:

wherein M is the number of module vectors q;

s6, updating model parameters by adopting an Adam optimizer.