CN111246256B - Video recommendation method based on multi-mode video content and multi-task learning - Google Patents
Video recommendation method based on multi-mode video content and multi-task learning Download PDFInfo
- Publication number
- CN111246256B CN111246256B CN202010108302.6A CN202010108302A CN111246256B CN 111246256 B CN111246256 B CN 111246256B CN 202010108302 A CN202010108302 A CN 202010108302A CN 111246256 B CN111246256 B CN 111246256B
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- user
- learning
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/251—Learning process for intelligent management, e.g. learning user preferences for recommending movies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
- H04N21/4666—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4668—Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
Abstract
The invention discloses a video recommendation method based on multi-modal video content and multi-task learning, which comprises the following steps: extracting visual, audio and text characteristics of the short video through a pre-trained model; fusing the multi-modal characteristics of the video by adopting an attention mechanism method; learning characteristic representation of the social relationship of the user by adopting a deep walking method; providing a deep neural network model learning multi-domain feature representation based on an attention mechanism; embedding the features generated based on the steps into a shared layer serving as a multitask model, and generating prediction results through a multilayer perceptron. The method utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, the importance of interactive features in recommendation learning is considered, and a deep neural network model based on an attention mechanism is provided to enrich the learning of high-order features and provide more accurate personalized video recommendation for users.
Description
Technical Field
The invention relates to the technical field of network videos and recommendation systems, in particular to a video recommendation method based on multi-mode video content and multi-task learning.
Background
With the rapid popularization of intelligent mobile terminals and the development of multimedia technology, videos gradually become carriers of information dissemination, short videos rapidly rise in recent years, videos become a main mode of people's entertainment, and the interests of users are more widely expressed. The sudden increase of the number of short videos brings about a serious information overload problem, and how to find videos in which users are interested from massive data becomes a hot topic and a research object. A good recommendation system can help consumers to find interesting and even potentially interesting videos more quickly and conveniently and can also help content providers to improve profits and user stickiness, so that the recommendation system becomes an important standard for measuring various large video platforms in the last decade.
Two important challenges are faced by current short video recommendation techniques: (1) at present, most recommendation algorithms recommend based on user preferences and user behaviors, neglect the content of articles, and have a serious cold start problem, so that most videos are neglected. However, the metadata of the micro-video is uploaded by the user, which may be inaccurate for the video, and how to effectively utilize the multi-modal information of the video becomes a significant challenge for video recommendation. (2) The recommendation model of the single task cannot meet the current demand for multiple tasks, and whether the user watches the video or not needs to be predicted in video recommendation, and behaviors of the user such as rating, praise and forwarding need to be predicted. The effective multi-task model not only can reduce the model training cost, but also can improve the model prediction of all tasks.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a video recommendation method based on multi-modal video content and multi-task learning, which realizes more personalized recommendation by fusing the multi-modal video content. The method pays attention to the content of the video, the information relationship between the user and the short video becomes more complex due to the multi-mode content, meanwhile, the multi-mode information of the short video provides richer information for the whole recommendation system, and the cold start problem can be effectively avoided.
The purpose of the invention can be achieved by adopting the following technical scheme:
a video recommendation method based on multi-modal video content and multi-task learning comprises the following steps:
s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, and extracting audio characteristics through a VGGish model; obtaining video text characteristics by counting the frequency of each word in the video title appearing in the video text word bank;
s2, learning the weight of each modal feature (including visual feature, audio feature and video text feature) of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation;
s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex (namely the user) in the network by a deep walking method, and taking the vector representation as characteristic representation of the social relationship of the user;
s4, in multiple fields, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to be used as input of a deep neural network, and learning multi-domain feature representation;
s5, embedding the features generated based on the above steps as a shared part among tasks in multi-task learning, and generating prediction results by keeping output layers specific to the tasks.
Further, the step S1 includes:
s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model inclusion-V3 model for video frame extraction, and finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video; extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;
s12, extracting audio information, and extracting audio features of the video by using a pre-trained VGGish model;
s13, extracting video title information, and counting the frequency of each word appearing in the video text lexicon as video text characteristics;
and S14, performing dimensionality reduction on the video static feature, the video dynamic feature and the audio feature by adopting a Principal Component Analysis (PCA) method, and splicing the video static feature and the video dynamic feature into a visual feature.
Further, in step S2, learning the dependency relationship between the user and each modality of the video through the attention mechanism, assigning a corresponding weight to each modality, and performing weighted summation on each modality feature to obtain a final feature representation of the video, which includes the following specific steps:
s21, fusing user characteristics, learning the dependency relationship between the user and each mode of the video through an attention mechanism, namely learning the weight of the user for distributing visual characteristics, audio characteristics and video text characteristics, and calculating by the following formula:
where m e { v, a, t } represents a visual modality, an auditory modality, and a text modality, respectively,the attention scores obtained for the respective modes are normalized to obtain amIndicating the user's preference (i.e., weight) for each modality,then visual, auditory and textual features extracted from the video are represented, eUIs a feature of the user that,andis a trainable parameter of the attention network, b denotes bias;
and S22, weighting the characteristics of each mode of the video and obtaining the final characteristic representation of the video.
Further, in step S3, learning potential feature representations of the user society by using a deep walking method specifically includes: the method comprises the steps that a user and a video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared with a sentence (Word sequence), and the feature representation of each Word, namely the social relation feature representation of the node (user), is learned through a classic Word2Vec model in natural language processing.
Further, in step S4, a deep neural network model based on the attention mechanism is proposed, and the original multi-domain features and the attention-based interaction features are embedded as the input of the deep neural network, so as to enrich the learning of the high-order features of the neural network, and the specific process is as follows:
s41, carrying out one-hot coding on discrete data in the multi-domain features, then reducing the dimension through an embedded matrix, splicing all continuous features, and converting the spliced continuous features into vectors with the same dimension through one layer of full connection;
s42, learning effective feature combinations by adopting an attention mechanism, learning interactive weights among different features by the attention mechanism, wherein the weight calculation formula is as follows:
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterA cross feature formed by two-by-two interaction of multi-domain features, W, h is a trainable parameter of the attention network, and b represents an offset;
and S43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and outputting the result through a multilayer perceptron to be used as the final representation of the multi-domain features.
Further, in step S5, the multi-modal video features learned in step S2, the social features learned in step S3, and the multi-domain features learned in step S4 are merged together to be used as a part shared by each task in the multi-task learning, parameters not shared are trained by the multi-layer perceptron corresponding to each task, and finally the task prediction result is output through a sigmoid function.
Compared with the prior art, the invention has the following advantages and effects:
the method disclosed by the invention utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, in consideration of the importance of interactive features in recommendation learning, the invention provides a deep neural network model based on an attention mechanism, which enriches the learning of high-order features and provides more accurate personalized video recommendation for users; in the multi-task learning, the learned feature representation is shared by multiple tasks, and the tasks are learned together, so that the overall parameter scale is reduced, and the requirements on multi-task recommendation in the industrial and living fields are better met
Drawings
FIG. 1 is a flow chart of a disclosed video recommendation method based on multimodal video content and multitask learning;
FIG. 2 is a schematic diagram of the structure of video multi-modal feature extraction and attention mechanism fusion features introduced in the present invention;
FIG. 3 is a diagram illustrating the structure between a user and a video in the present invention;
FIG. 4 is a schematic diagram of a depth neural network prediction model based on an attention mechanism according to the present invention;
FIG. 5 is a schematic diagram of a video recommendation structure for multitask learning in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
Fig. 1 is a flowchart of a video recommendation method based on multi-modal video content and multi-task learning, specifically including the following steps:
t1, video multi-modal feature extraction:
a. and extracting video frames, namely intercepting video frame pictures through an opencv video reading class cv2.video Capture, storing the video frame pictures under the path folder, wherein the frame number starts from 0, and intercepting each frame picture of the video without adopting frame skipping interception by considering the short and precise characteristics of the short video.
b. Video static feature extraction, namely, after the size of each frame of picture of a video is adjusted to [299,299], the picture is respectively input into a pre-trained inclusion-V3 network, as shown in fig. 1, the input is mapped into feature vectors with 2048 dimensions, the feature vectors are used as static original feature vectors of the video frames, and in order to keep the information of each frame of the video, each video frame is subjected to an average pooling layer to extract the static feature representation of the video.
c. Video dynamic feature extraction, wherein a 3D-CNN (3-dimensional convolutional neural network) is very suitable for space-time feature learning, each video randomly extracts continuous 16 frames to form input samples, the sizes of convolution kernels are 3 multiplied by 3 through five groups of convolution layers and pooling layers, the sizes of pooling kernels are 2 multiplied by 2, 4096-dimensional features are output through two layers of full connection, and 487-dimensional dynamic video features are output through one layer of full connection.
d. Audio feature extraction: and extracting audio information from the video, extracting a 128-dimensional feature vector through a pre-trained VGGish network model, and fusing the audio information of each frame through an average pooling layer to output the audio feature of the video.
e. Text feature extraction: extracting text information of the video, performing word segmentation by using a common Chinese word segmentation tool jieba, giving independent serial numbers to each word, counting the frequency of each word appearing in the video text lexicon, and finally performing feature standardization processing.
Where id is the number of the vocabulary set, nijThe number of times of the ith word in the jth video text is shown, and the denominator is the sum of the number of times of all the words in the video text.
f. Reducing the extracted static and dynamic characteristics of the video into 32-dimensional characteristics by PCA, and splicing the static and dynamic characteristics of the video and the dynamic characteristics of the video to form visual characteristics; and reducing the extracted audio features into 64-dimensional features by PCA.
T2, video multi-modal feature fusion:
the invention integrates the attention mechanism with the user characteristics to learn the dependency relationship between the user and each mode, and distributes different weights to each mode. The whole multi-modal feature extraction and fusion structure is shown in the attached figure 2, the extracted visual features, audio features and text features are subjected to feature fusion through an attention mechanism, and the feature fusion process is as follows:
a. learning the weights assigned by the user to visual, audio, video-text features, amRepresenting the learning of the preference (i.e. weight) of the user in each modality, the calculation flow is as follows:
wherein eVRepresenting a video feature, eUIs a user characteristic, m e { v, a, t } represents a visual modality, an auditory modality, and a text modality,the attention scores obtained for the respective modes are normalized to obtain amIndicating the user's preference (i.e., weight) for each modality,then visual, auditory and textual features extracted from the video are represented,andis the weight of the attention network and b represents the bias.
b. The features of the video are represented as a weighted sum of the features of the respective modalities:
experiments prove that the effect of fusing the video multi-modal characteristics is better than that of only adopting user characteristics, context characteristics and the like, and different modal characteristic weights are distributed through an attention mechanism on the basis, so that the whole recommendation model is more personalized.
T3, social feature learning:
the potential representation of the social relationship between each user and each video in the graph is learned through a deep walking method, for example, as shown in fig. 3, a user-video network is formed by taking the users and the videos as nodes, lines between the nodes represent that the users have viewed the videos, and a Word vector model Word2Vec in NLP natural language processing is used in network representation by means of the fact that a distribution rule of random walk in the network and a rule of sentence sequences in NLP natural language processing appearing in a corpus have similar power law distribution characteristics.
In a user-video network with uiGenerating node sequences for the random walk of the root node, each sequence being a sentence (transmit), as follows:
wherein u isiDenotes the ith user, viRepresenting the ith video.
Learning social characteristics of a user using the Word2Vec model in classical natural language processing is represented as: e.g. of the types=Word2Vec([sentence1,sentence2…,sentencem],size=64)。
T4, multi-domain feature learning:
the multi-domain characteristics comprise user side information such as user id, user gender, user age and the like; the invention provides an attention-based deep neural network estimation model for making full use of basic characteristics of item side information such as video id watched by a user, video author id, whether to click and the like, wherein the model is shown in figure 4, and the specific steps comprise:
a. characteristic embedding: the features are divided into two types, one is sparse features generated by using single-hot coding for the features of the type and the id type, and the other is numerical continuous features. Each sparse feature esparse,iConverting the feature vector into a feature vector e with dimension n by an embedding methodi(ii) a Each successive feature edense,iAfter being spliced, the vector e with the same dimensionality is converted into through one layer of full connectiondense. The feature embedding calculation process is as follows:
ei=Wiesparse,i,
edense=FC([edense,0,edense,1,…]),
wherein the parameter WiFor embedding the matrix, FC (-) represents the full-connected function, and the features are spliced into the original feature e after being embeddedorigin。
b. Extracting initial features: except for embedding the original feature representation eoriginBesides, the feature interaction information is also very important for click prediction, so that an attention mechanism is adopted to provide an effective feature interaction representation for the deep neural network bottom layer, and the specific calculation process is as follows:
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterThe cross feature formed by the two-by-two crossing of the multi-domain features, W, h is a parameter of the attention network, and b represents the bias.
c. Fully connecting the neural networks: the original features and the cross features thereof are used as input layers of the fully-connected neural network, and the result output by the multilayer perceptron is expressed as multi-domain features.
T5, multitask sequence learning:
shown in FIG. 5The video recommendation structure diagram for multitask learning is based on the video feature e generated in the above step T1VStep T2 generated social feature esAnd the multi-domain feature e generated in step T3fThe shared layer of the multi-task model shares learned feature expressions of different tasks, and each task trains parameters which are not shared among the tasks through a multi-layer perceptron respectively, and the calculation process is as follows:
the calculation formula of the prediction probability is as follows:
wherein einputFor shared feature inputs in the multitask model, sigma is sigmoid function, and the function is defined asH is the number of hidden layers corresponding to the kth task,represents the hidden layer training parameters of the H-th layer,represents the output of the H-th layer hidden layer,representing the k-th task prediction result.
The calculated loss function is as follows:
model training adjusts the whole network parameters by using an Adam optimization algorithm through back propagation, wherein each time of the back propagation and the forward propagation is defined as an epoch, and iteration is carried out until the output prediction result does not obviously change or the specified iteration number is reached.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (5)
1. A video recommendation method based on multi-modal video content and multi-task learning is characterized by comprising the following steps:
s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, extracting audio characteristics through a VGGish model, and obtaining video text characteristics by counting the frequency of each word in a video title appearing in a video text word stock;
s2, learning the weight of each modal feature of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation, wherein each modal feature of the video comprises a visual feature, an audio feature and a video text feature;
s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex in the network through a deep walking method, and taking the vector representation as characteristic representation of the user social relationship, wherein the vertex of the user-video social network represents the user;
s4, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to serve as input of a deep neural network, and learning multi-domain feature representation;
wherein, the step S4 includes:
s41, carrying out one-hot coding on discrete data in the multi-domain features, then reducing the dimension through an embedded matrix, splicing all continuous features, and converting the spliced continuous features into vectors with the same dimension through one layer of full connection;
s42, learning effective feature combinations by adopting an attention mechanism, learning interactive weights among different features by the attention mechanism, wherein the weight calculation formula is as follows:
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterA cross feature formed by two-by-two interaction of multi-domain features, W, h is a trainable parameter of the attention network, and b represents an offset;
s43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and taking the result output by the multilayer perceptron as the final representation of the multi-domain features;
s5, embedding the features generated based on the above steps as a shared part among tasks in multi-task learning, and generating prediction results by keeping output layers specific to the tasks.
2. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S1 comprises:
s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model increment-V3 model for video frame extraction, finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video, and extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;
s12, extracting audio information, and extracting audio features of the video by using a pre-trained VGGish model;
s13, extracting video title information, and counting the frequency of each word appearing in the video text lexicon as video text characteristics;
and S14, reducing the dimensions of the video static feature, the video dynamic feature and the audio feature by adopting a PCA method, and simultaneously splicing the video static feature and the video dynamic feature to form the visual feature.
3. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S2 comprises:
s21, fusing user characteristics, learning the dependency relationship between the user and each mode of the video through an attention mechanism, namely learning the weight of the user for distributing visual characteristics, audio characteristics and video text characteristics, and calculating by the following formula:
where m e { v, a, t } represents a visual modality, an auditory modality, and a text modality, respectively,the attention scores obtained for the respective modes are normalized to obtain amRepresenting the user's preference for each modality, which is equivalent to the user's weight for each modality, Then visual, auditory and textual features extracted from the video are represented, eUIs a feature of the user that,andis a trainable parameter of the attention network, b denotes bias;
and S22, weighting the characteristics of each mode of the video and obtaining the final characteristic representation of the video.
4. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S3 comprises:
the user and the video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared as a sentence, and the Word feature representation, namely the social relation feature representation of the node, is learned through a Word2Vec model.
5. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S5 is performed as follows:
and (3) splicing and fusing the multi-modal video features learned in the step S2, the social features learned in the step S3 and the multi-domain features learned in the step S4 to form a part shared by all tasks in multi-task learning, training parameters which are not shared through multi-layer perceptrons corresponding to all tasks respectively, and finally outputting a task prediction result through a sigmoid function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010108302.6A CN111246256B (en) | 2020-02-21 | 2020-02-21 | Video recommendation method based on multi-mode video content and multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010108302.6A CN111246256B (en) | 2020-02-21 | 2020-02-21 | Video recommendation method based on multi-mode video content and multi-task learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111246256A CN111246256A (en) | 2020-06-05 |
CN111246256B true CN111246256B (en) | 2021-05-25 |
Family
ID=70869269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010108302.6A Expired - Fee Related CN111246256B (en) | 2020-02-21 | 2020-02-21 | Video recommendation method based on multi-mode video content and multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111246256B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11922287B2 (en) | 2020-07-15 | 2024-03-05 | Baidu USA, LLC | Video recommendation with multi-gate mixture of experts soft actor critic |
CN111862990B (en) * | 2020-07-21 | 2022-11-11 | 思必驰科技股份有限公司 | Speaker identity verification method and system |
CN111949884B (en) * | 2020-08-26 | 2022-06-21 | 桂林电子科技大学 | Multi-mode feature interaction-based depth fusion recommendation method |
CN112183391A (en) * | 2020-09-30 | 2021-01-05 | 中国科学院计算技术研究所 | First-view video behavior prediction system and method |
CN112183547A (en) * | 2020-10-19 | 2021-01-05 | 中国科学院计算技术研究所 | Multi-mode data-based multi-task learning method and system |
CN112163165A (en) * | 2020-10-21 | 2021-01-01 | 腾讯科技(深圳)有限公司 | Information recommendation method, device, equipment and computer readable storage medium |
CN114422859B (en) * | 2020-10-28 | 2024-01-30 | 贵州省广播电视信息网络股份有限公司 | Deep learning-based ordering recommendation system and method for cable television operators |
CN112328861B (en) * | 2020-11-24 | 2023-06-23 | 郑州航空工业管理学院 | News spreading method based on big data processing |
CN112307257B (en) * | 2020-11-25 | 2021-06-15 | 中国计量大学 | Short video click rate prediction method based on multi-information node graph network |
CN112948708B (en) * | 2021-03-05 | 2022-08-12 | 清华大学深圳国际研究生院 | Short video recommendation method |
CN112966644A (en) * | 2021-03-24 | 2021-06-15 | 中国科学院计算技术研究所 | Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof |
CN113095883B (en) * | 2021-04-21 | 2023-04-07 | 山东大学 | Video payment user prediction method and system based on deep cross attention network |
CN113312514B (en) * | 2021-07-30 | 2021-11-09 | 平安科技(深圳)有限公司 | Grouping method, device, equipment and medium combining Deepwalk and community discovery technology |
CN113704547B (en) * | 2021-08-26 | 2024-02-13 | 合肥工业大学 | Multimode tag recommendation method based on unidirectional supervision attention |
CN113794900B (en) * | 2021-08-31 | 2023-04-07 | 北京达佳互联信息技术有限公司 | Video processing method and device |
CN113821682B (en) * | 2021-09-27 | 2023-11-28 | 深圳市广联智通科技有限公司 | Multi-target video recommendation method, device and storage medium based on deep learning |
CN113807307B (en) * | 2021-09-28 | 2023-12-12 | 中国海洋大学 | Multi-mode joint learning method for video multi-behavior recognition |
CN114358364A (en) * | 2021-11-20 | 2022-04-15 | 重庆邮电大学 | Attention mechanism-based short video frequency click rate big data estimation method |
CN114969534A (en) * | 2022-06-04 | 2022-08-30 | 哈尔滨理工大学 | Mobile crowd sensing task recommendation method fusing multi-modal data features |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5512939A (en) * | 1994-04-06 | 1996-04-30 | At&T Corp. | Low bit rate audio-visual communication system having integrated perceptual speech and video coding |
JP2010206447A (en) * | 2009-03-03 | 2010-09-16 | Panasonic Corp | Viewing terminal device, server device and participation type program sharing system |
CN108108771A (en) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | Image answering method based on multiple dimensioned deep learning |
CN110019812B (en) * | 2018-02-27 | 2021-08-20 | 中国科学院计算技术研究所 | User self-production content detection method and system |
CN108932304B (en) * | 2018-06-12 | 2019-06-18 | 山东大学 | Video moment localization method, system and storage medium based on cross-module state |
CN109344288B (en) * | 2018-09-19 | 2021-09-24 | 电子科技大学 | Video description combining method based on multi-modal feature combining multi-layer attention mechanism |
CN109874053B (en) * | 2019-02-21 | 2021-10-22 | 南京航空航天大学 | Short video recommendation method based on video content understanding and user dynamic interest |
CN110188343B (en) * | 2019-04-22 | 2023-01-31 | 浙江工业大学 | Multi-mode emotion recognition method based on fusion attention network |
CN110096617B (en) * | 2019-04-29 | 2021-08-10 | 北京百度网讯科技有限公司 | Video classification method and device, electronic equipment and computer-readable storage medium |
-
2020
- 2020-02-21 CN CN202010108302.6A patent/CN111246256B/en not_active Expired - Fee Related
Non-Patent Citations (1)
Title |
---|
基于AT-LSTM的弹幕评论情感分析;庄须强 刘方爱;《数字技术与应用》;20180210;第36卷(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111246256A (en) | 2020-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111246256B (en) | Video recommendation method based on multi-mode video content and multi-task learning | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN111444709B (en) | Text classification method, device, storage medium and equipment | |
CN111708950B (en) | Content recommendation method and device and electronic equipment | |
CN109753566A (en) | The model training method of cross-cutting sentiment analysis based on convolutional neural networks | |
CN112163165A (en) | Information recommendation method, device, equipment and computer readable storage medium | |
US20220171760A1 (en) | Data processing method and apparatus, computer-readable storage medium, and electronic device | |
CN113627447B (en) | Label identification method, label identification device, computer equipment, storage medium and program product | |
CN113590849A (en) | Multimedia resource classification model training method and multimedia resource recommendation method | |
CN113723166A (en) | Content identification method and device, computer equipment and storage medium | |
CN112989212B (en) | Media content recommendation method, device and equipment and computer storage medium | |
CN111598183A (en) | Multi-feature fusion image description method | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium | |
CN114201516A (en) | User portrait construction method, information recommendation method and related device | |
CN113741759B (en) | Comment information display method and device, computer equipment and storage medium | |
CN115169472A (en) | Music matching method and device for multimedia data and computer equipment | |
CN116955599A (en) | Category determining method, related device, equipment and storage medium | |
CN114282528A (en) | Keyword extraction method, device, equipment and storage medium | |
CN116205700A (en) | Recommendation method and device for target product, computer equipment and storage medium | |
Lin et al. | Social media popularity prediction based on multi-modal self-attention mechanisms | |
CN115482021A (en) | Multimedia information recommendation method and device, electronic equipment and storage medium | |
CN111552881A (en) | Sequence recommendation method based on hierarchical variation attention | |
CN116628345B (en) | Content recommendation method and device, electronic equipment and storage medium | |
CN117556149B (en) | Resource pushing method, device, electronic equipment and storage medium | |
CN117521674B (en) | Method, device, computer equipment and storage medium for generating countermeasure information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210525 |