CN111246256B

CN111246256B - Video recommendation method based on multi-mode video content and multi-task learning

Info

Publication number: CN111246256B
Application number: CN202010108302.6A
Authority: CN
Inventors: 史景伦; 邓丽; 梁可弘; 傅钎栓; 林阳城
Original assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Current assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-05-25
Anticipated expiration: 2040-02-21
Also published as: CN111246256A

Abstract

The invention discloses a video recommendation method based on multi-modal video content and multi-task learning, which comprises the following steps: extracting visual, audio and text characteristics of the short video through a pre-trained model; fusing the multi-modal characteristics of the video by adopting an attention mechanism method; learning characteristic representation of the social relationship of the user by adopting a deep walking method; providing a deep neural network model learning multi-domain feature representation based on an attention mechanism; embedding the features generated based on the steps into a shared layer serving as a multitask model, and generating prediction results through a multilayer perceptron. The method utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, the importance of interactive features in recommendation learning is considered, and a deep neural network model based on an attention mechanism is provided to enrich the learning of high-order features and provide more accurate personalized video recommendation for users.

Description

Video recommendation method based on multi-mode video content and multi-task learning

Technical Field

The invention relates to the technical field of network videos and recommendation systems, in particular to a video recommendation method based on multi-mode video content and multi-task learning.

Background

With the rapid popularization of intelligent mobile terminals and the development of multimedia technology, videos gradually become carriers of information dissemination, short videos rapidly rise in recent years, videos become a main mode of people's entertainment, and the interests of users are more widely expressed. The sudden increase of the number of short videos brings about a serious information overload problem, and how to find videos in which users are interested from massive data becomes a hot topic and a research object. A good recommendation system can help consumers to find interesting and even potentially interesting videos more quickly and conveniently and can also help content providers to improve profits and user stickiness, so that the recommendation system becomes an important standard for measuring various large video platforms in the last decade.

Two important challenges are faced by current short video recommendation techniques: (1) at present, most recommendation algorithms recommend based on user preferences and user behaviors, neglect the content of articles, and have a serious cold start problem, so that most videos are neglected. However, the metadata of the micro-video is uploaded by the user, which may be inaccurate for the video, and how to effectively utilize the multi-modal information of the video becomes a significant challenge for video recommendation. (2) The recommendation model of the single task cannot meet the current demand for multiple tasks, and whether the user watches the video or not needs to be predicted in video recommendation, and behaviors of the user such as rating, praise and forwarding need to be predicted. The effective multi-task model not only can reduce the model training cost, but also can improve the model prediction of all tasks.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a video recommendation method based on multi-modal video content and multi-task learning, which realizes more personalized recommendation by fusing the multi-modal video content. The method pays attention to the content of the video, the information relationship between the user and the short video becomes more complex due to the multi-mode content, meanwhile, the multi-mode information of the short video provides richer information for the whole recommendation system, and the cold start problem can be effectively avoided.

The purpose of the invention can be achieved by adopting the following technical scheme:

a video recommendation method based on multi-modal video content and multi-task learning comprises the following steps:

s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, and extracting audio characteristics through a VGGish model; obtaining video text characteristics by counting the frequency of each word in the video title appearing in the video text word bank;

s2, learning the weight of each modal feature (including visual feature, audio feature and video text feature) of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation;

s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex (namely the user) in the network by a deep walking method, and taking the vector representation as characteristic representation of the social relationship of the user;

s4, in multiple fields, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to be used as input of a deep neural network, and learning multi-domain feature representation;

s5, embedding the features generated based on the above steps as a shared part among tasks in multi-task learning, and generating prediction results by keeping output layers specific to the tasks.

Further, the step S1 includes:

s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model inclusion-V3 model for video frame extraction, and finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video; extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;

s12, extracting audio information, and extracting audio features of the video by using a pre-trained VGGish model;

s13, extracting video title information, and counting the frequency of each word appearing in the video text lexicon as video text characteristics;

and S14, performing dimensionality reduction on the video static feature, the video dynamic feature and the audio feature by adopting a Principal Component Analysis (PCA) method, and splicing the video static feature and the video dynamic feature into a visual feature.

Further, in step S2, learning the dependency relationship between the user and each modality of the video through the attention mechanism, assigning a corresponding weight to each modality, and performing weighted summation on each modality feature to obtain a final feature representation of the video, which includes the following specific steps:

s21, fusing user characteristics, learning the dependency relationship between the user and each mode of the video through an attention mechanism, namely learning the weight of the user for distributing visual characteristics, audio characteristics and video text characteristics, and calculating by the following formula:

where m e { v, a, t } represents a visual modality, an auditory modality, and a text modality, respectively,

the attention scores obtained for the respective modes are normalized to obtain a^mIndicating the user's preference (i.e., weight) for each modality,

then visual, auditory and textual features extracted from the video are represented, e_UIs a feature of the user that,

and

is a trainable parameter of the attention network, b denotes bias;

and S22, weighting the characteristics of each mode of the video and obtaining the final characteristic representation of the video.

Further, in step S3, learning potential feature representations of the user society by using a deep walking method specifically includes: the method comprises the steps that a user and a video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared with a sentence (Word sequence), and the feature representation of each Word, namely the social relation feature representation of the node (user), is learned through a classic Word2Vec model in natural language processing.

Further, in step S4, a deep neural network model based on the attention mechanism is proposed, and the original multi-domain features and the attention-based interaction features are embedded as the input of the deep neural network, so as to enrich the learning of the high-order features of the neural network, and the specific process is as follows:

s41, carrying out one-hot coding on discrete data in the multi-domain features, then reducing the dimension through an embedded matrix, splicing all continuous features, and converting the spliced continuous features into vectors with the same dimension through one layer of full connection;

s42, learning effective feature combinations by adopting an attention mechanism, learning interactive weights among different features by the attention mechanism, wherein the weight calculation formula is as follows:

e_inter＝[a_0,0e₀e₀,a_0,1e₀e₁,…a_i,je_ie_j]，

wherein e_iIs the ith feature, e_ie_jThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,

the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain a_i,jWeight representing feature interaction, e_interA cross feature formed by two-by-two interaction of multi-domain features, W, h is a trainable parameter of the attention network, and b represents an offset;

and S43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and outputting the result through a multilayer perceptron to be used as the final representation of the multi-domain features.

Further, in step S5, the multi-modal video features learned in step S2, the social features learned in step S3, and the multi-domain features learned in step S4 are merged together to be used as a part shared by each task in the multi-task learning, parameters not shared are trained by the multi-layer perceptron corresponding to each task, and finally the task prediction result is output through a sigmoid function.

Compared with the prior art, the invention has the following advantages and effects:

the method disclosed by the invention utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, in consideration of the importance of interactive features in recommendation learning, the invention provides a deep neural network model based on an attention mechanism, which enriches the learning of high-order features and provides more accurate personalized video recommendation for users; in the multi-task learning, the learned feature representation is shared by multiple tasks, and the tasks are learned together, so that the overall parameter scale is reduced, and the requirements on multi-task recommendation in the industrial and living fields are better met

Drawings

FIG. 1 is a flow chart of a disclosed video recommendation method based on multimodal video content and multitask learning;

FIG. 2 is a schematic diagram of the structure of video multi-modal feature extraction and attention mechanism fusion features introduced in the present invention;

FIG. 3 is a diagram illustrating the structure between a user and a video in the present invention;

FIG. 4 is a schematic diagram of a depth neural network prediction model based on an attention mechanism according to the present invention;

FIG. 5 is a schematic diagram of a video recommendation structure for multitask learning in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

Fig. 1 is a flowchart of a video recommendation method based on multi-modal video content and multi-task learning, specifically including the following steps:

t1, video multi-modal feature extraction:

a. and extracting video frames, namely intercepting video frame pictures through an opencv video reading class cv2.video Capture, storing the video frame pictures under the path folder, wherein the frame number starts from 0, and intercepting each frame picture of the video without adopting frame skipping interception by considering the short and precise characteristics of the short video.

b. Video static feature extraction, namely, after the size of each frame of picture of a video is adjusted to [299,299], the picture is respectively input into a pre-trained inclusion-V3 network, as shown in fig. 1, the input is mapped into feature vectors with 2048 dimensions, the feature vectors are used as static original feature vectors of the video frames, and in order to keep the information of each frame of the video, each video frame is subjected to an average pooling layer to extract the static feature representation of the video.

c. Video dynamic feature extraction, wherein a 3D-CNN (3-dimensional convolutional neural network) is very suitable for space-time feature learning, each video randomly extracts continuous 16 frames to form input samples, the sizes of convolution kernels are 3 multiplied by 3 through five groups of convolution layers and pooling layers, the sizes of pooling kernels are 2 multiplied by 2, 4096-dimensional features are output through two layers of full connection, and 487-dimensional dynamic video features are output through one layer of full connection.

d. Audio feature extraction: and extracting audio information from the video, extracting a 128-dimensional feature vector through a pre-trained VGGish network model, and fusing the audio information of each frame through an average pooling layer to output the audio feature of the video.

e. Text feature extraction: extracting text information of the video, performing word segmentation by using a common Chinese word segmentation tool jieba, giving independent serial numbers to each word, counting the frequency of each word appearing in the video text lexicon, and finally performing feature standardization processing.

Where id is the number of the vocabulary set, n_ijThe number of times of the ith word in the jth video text is shown, and the denominator is the sum of the number of times of all the words in the video text.

f. Reducing the extracted static and dynamic characteristics of the video into 32-dimensional characteristics by PCA, and splicing the static and dynamic characteristics of the video and the dynamic characteristics of the video to form visual characteristics; and reducing the extracted audio features into 64-dimensional features by PCA.

T2, video multi-modal feature fusion:

the invention integrates the attention mechanism with the user characteristics to learn the dependency relationship between the user and each mode, and distributes different weights to each mode. The whole multi-modal feature extraction and fusion structure is shown in the attached figure 2, the extracted visual features, audio features and text features are subjected to feature fusion through an attention mechanism, and the feature fusion process is as follows:

a. learning the weights assigned by the user to visual, audio, video-text features, a^mRepresenting the learning of the preference (i.e. weight) of the user in each modality, the calculation flow is as follows:

wherein e_VRepresenting a video feature, e_UIs a user characteristic, m e { v, a, t } represents a visual modality, an auditory modality, and a text modality,

then visual, auditory and textual features extracted from the video are represented,

and

is the weight of the attention network and b represents the bias.

b. The features of the video are represented as a weighted sum of the features of the respective modalities:

experiments prove that the effect of fusing the video multi-modal characteristics is better than that of only adopting user characteristics, context characteristics and the like, and different modal characteristic weights are distributed through an attention mechanism on the basis, so that the whole recommendation model is more personalized.

T3, social feature learning:

the potential representation of the social relationship between each user and each video in the graph is learned through a deep walking method, for example, as shown in fig. 3, a user-video network is formed by taking the users and the videos as nodes, lines between the nodes represent that the users have viewed the videos, and a Word vector model Word2Vec in NLP natural language processing is used in network representation by means of the fact that a distribution rule of random walk in the network and a rule of sentence sequences in NLP natural language processing appearing in a corpus have similar power law distribution characteristics.

In a user-video network with u_iGenerating node sequences for the random walk of the root node, each sequence being a sentence (transmit), as follows:

wherein u is_iDenotes the ith user, v_iRepresenting the ith video.

Learning social characteristics of a user using the Word2Vec model in classical natural language processing is represented as: e.g. of the type_s＝Word2Vec([sentence1,sentence2…,sentencem],size＝64)。

T4, multi-domain feature learning:

the multi-domain characteristics comprise user side information such as user id, user gender, user age and the like; the invention provides an attention-based deep neural network estimation model for making full use of basic characteristics of item side information such as video id watched by a user, video author id, whether to click and the like, wherein the model is shown in figure 4, and the specific steps comprise:

a. characteristic embedding: the features are divided into two types, one is sparse features generated by using single-hot coding for the features of the type and the id type, and the other is numerical continuous features. Each sparse feature e_sparse,iConverting the feature vector into a feature vector e with dimension n by an embedding method_i(ii) a Each successive feature e_dense,iAfter being spliced, the vector e with the same dimensionality is converted into through one layer of full connection_dense. The feature embedding calculation process is as follows:

e_i＝W_ie_sparse,i，

e_dense＝FC([e_dense,0,e_dense,1,…])，

wherein the parameter W_iFor embedding the matrix, FC (-) represents the full-connected function, and the features are spliced into the original feature e after being embedded_origin。

b. Extracting initial features: except for embedding the original feature representation e_originBesides, the feature interaction information is also very important for click prediction, so that an attention mechanism is adopted to provide an effective feature interaction representation for the deep neural network bottom layer, and the specific calculation process is as follows:

e_inter＝[a_0,0e₀e₀,a_0,1e₀e₁,…a_i,je_ie_j]，

the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain a_i,jWeight representing feature interaction, e_interThe cross feature formed by the two-by-two crossing of the multi-domain features, W, h is a parameter of the attention network, and b represents the bias.

c. Fully connecting the neural networks: the original features and the cross features thereof are used as input layers of the fully-connected neural network, and the result output by the multilayer perceptron is expressed as multi-domain features.

T5, multitask sequence learning:

shown in FIG. 5The video recommendation structure diagram for multitask learning is based on the video feature e generated in the above step T1_VStep T2 generated social feature e_sAnd the multi-domain feature e generated in step T3_fThe shared layer of the multi-task model shares learned feature expressions of different tasks, and each task trains parameters which are not shared among the tasks through a multi-layer perceptron respectively, and the calculation process is as follows:

the calculation formula of the prediction probability is as follows:

wherein e_inputFor shared feature inputs in the multitask model, sigma is sigmoid function, and the function is defined as

H is the number of hidden layers corresponding to the kth task,

represents the hidden layer training parameters of the H-th layer,

represents the output of the H-th layer hidden layer,

representing the k-th task prediction result.

The calculated loss function is as follows:

model training adjusts the whole network parameters by using an Adam optimization algorithm through back propagation, wherein each time of the back propagation and the forward propagation is defined as an epoch, and iteration is carried out until the output prediction result does not obviously change or the specified iteration number is reached.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A video recommendation method based on multi-modal video content and multi-task learning is characterized by comprising the following steps:

s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, extracting audio characteristics through a VGGish model, and obtaining video text characteristics by counting the frequency of each word in a video title appearing in a video text word stock;

s2, learning the weight of each modal feature of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation, wherein each modal feature of the video comprises a visual feature, an audio feature and a video text feature;

s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex in the network through a deep walking method, and taking the vector representation as characteristic representation of the user social relationship, wherein the vertex of the user-video social network represents the user;

s4, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to serve as input of a deep neural network, and learning multi-domain feature representation;

wherein, the step S4 includes:

e_inter＝[a_0,0e₀e₀,a_0,1e₀e₁,…a_i,je_ie_j]，

s43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and taking the result output by the multilayer perceptron as the final representation of the multi-domain features;

2. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S1 comprises:

s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model increment-V3 model for video frame extraction, finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video, and extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;

and S14, reducing the dimensions of the video static feature, the video dynamic feature and the audio feature by adopting a PCA method, and simultaneously splicing the video static feature and the video dynamic feature to form the visual feature.

3. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S2 comprises:

the attention scores obtained for the respective modes are normalized to obtain a^mRepresenting the user's preference for each modality, which is equivalent to the user's weight for each modality，

and

is a trainable parameter of the attention network, b denotes bias;

4. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S3 comprises:

the user and the video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared as a sentence, and the Word feature representation, namely the social relation feature representation of the node, is learned through a Word2Vec model.

5. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S5 is performed as follows:

and (3) splicing and fusing the multi-modal video features learned in the step S2, the social features learned in the step S3 and the multi-domain features learned in the step S4 to form a part shared by all tasks in multi-task learning, training parameters which are not shared through multi-layer perceptrons corresponding to all tasks respectively, and finally outputting a task prediction result through a sigmoid function.