CN111246256B - Video recommendation method based on multi-mode video content and multi-task learning - Google Patents

Video recommendation method based on multi-mode video content and multi-task learning Download PDF

Info

Publication number
CN111246256B
CN111246256B CN202010108302.6A CN202010108302A CN111246256B CN 111246256 B CN111246256 B CN 111246256B CN 202010108302 A CN202010108302 A CN 202010108302A CN 111246256 B CN111246256 B CN 111246256B
Authority
CN
China
Prior art keywords
video
feature
user
learning
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010108302.6A
Other languages
Chinese (zh)
Other versions
CN111246256A (en
Inventor
史景伦
邓丽
梁可弘
傅钎栓
林阳城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Menghui Robot Co ltd
South China University of Technology SCUT
Original Assignee
Guangzhou Menghui Robot Co ltd
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Menghui Robot Co ltd, South China University of Technology SCUT filed Critical Guangzhou Menghui Robot Co ltd
Priority to CN202010108302.6A priority Critical patent/CN111246256B/en
Publication of CN111246256A publication Critical patent/CN111246256A/en
Application granted granted Critical
Publication of CN111246256B publication Critical patent/CN111246256B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies

Abstract

The invention discloses a video recommendation method based on multi-modal video content and multi-task learning, which comprises the following steps: extracting visual, audio and text characteristics of the short video through a pre-trained model; fusing the multi-modal characteristics of the video by adopting an attention mechanism method; learning characteristic representation of the social relationship of the user by adopting a deep walking method; providing a deep neural network model learning multi-domain feature representation based on an attention mechanism; embedding the features generated based on the steps into a shared layer serving as a multitask model, and generating prediction results through a multilayer perceptron. The method utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, the importance of interactive features in recommendation learning is considered, and a deep neural network model based on an attention mechanism is provided to enrich the learning of high-order features and provide more accurate personalized video recommendation for users.

Description

Video recommendation method based on multi-mode video content and multi-task learning
Technical Field
The invention relates to the technical field of network videos and recommendation systems, in particular to a video recommendation method based on multi-mode video content and multi-task learning.
Background
With the rapid popularization of intelligent mobile terminals and the development of multimedia technology, videos gradually become carriers of information dissemination, short videos rapidly rise in recent years, videos become a main mode of people's entertainment, and the interests of users are more widely expressed. The sudden increase of the number of short videos brings about a serious information overload problem, and how to find videos in which users are interested from massive data becomes a hot topic and a research object. A good recommendation system can help consumers to find interesting and even potentially interesting videos more quickly and conveniently and can also help content providers to improve profits and user stickiness, so that the recommendation system becomes an important standard for measuring various large video platforms in the last decade.
Two important challenges are faced by current short video recommendation techniques: (1) at present, most recommendation algorithms recommend based on user preferences and user behaviors, neglect the content of articles, and have a serious cold start problem, so that most videos are neglected. However, the metadata of the micro-video is uploaded by the user, which may be inaccurate for the video, and how to effectively utilize the multi-modal information of the video becomes a significant challenge for video recommendation. (2) The recommendation model of the single task cannot meet the current demand for multiple tasks, and whether the user watches the video or not needs to be predicted in video recommendation, and behaviors of the user such as rating, praise and forwarding need to be predicted. The effective multi-task model not only can reduce the model training cost, but also can improve the model prediction of all tasks.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a video recommendation method based on multi-modal video content and multi-task learning, which realizes more personalized recommendation by fusing the multi-modal video content. The method pays attention to the content of the video, the information relationship between the user and the short video becomes more complex due to the multi-mode content, meanwhile, the multi-mode information of the short video provides richer information for the whole recommendation system, and the cold start problem can be effectively avoided.
The purpose of the invention can be achieved by adopting the following technical scheme:
a video recommendation method based on multi-modal video content and multi-task learning comprises the following steps:
s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, and extracting audio characteristics through a VGGish model; obtaining video text characteristics by counting the frequency of each word in the video title appearing in the video text word bank;
s2, learning the weight of each modal feature (including visual feature, audio feature and video text feature) of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation;
s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex (namely the user) in the network by a deep walking method, and taking the vector representation as characteristic representation of the social relationship of the user;
s4, in multiple fields, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to be used as input of a deep neural network, and learning multi-domain feature representation;
s5, embedding the features generated based on the above steps as a shared part among tasks in multi-task learning, and generating prediction results by keeping output layers specific to the tasks.
Further, the step S1 includes:
s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model inclusion-V3 model for video frame extraction, and finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video; extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;
s12, extracting audio information, and extracting audio features of the video by using a pre-trained VGGish model;
s13, extracting video title information, and counting the frequency of each word appearing in the video text lexicon as video text characteristics;
and S14, performing dimensionality reduction on the video static feature, the video dynamic feature and the audio feature by adopting a Principal Component Analysis (PCA) method, and splicing the video static feature and the video dynamic feature into a visual feature.
Further, in step S2, learning the dependency relationship between the user and each modality of the video through the attention mechanism, assigning a corresponding weight to each modality, and performing weighted summation on each modality feature to obtain a final feature representation of the video, which includes the following specific steps:
s21, fusing user characteristics, learning the dependency relationship between the user and each mode of the video through an attention mechanism, namely learning the weight of the user for distributing visual characteristics, audio characteristics and video text characteristics, and calculating by the following formula:
Figure BDA0002389081480000031
Figure BDA0002389081480000032
where m e { v, a, t } represents a visual modality, an auditory modality, and a text modality, respectively,
Figure BDA0002389081480000033
the attention scores obtained for the respective modes are normalized to obtain amIndicating the user's preference (i.e., weight) for each modality,
Figure BDA0002389081480000036
then visual, auditory and textual features extracted from the video are represented, eUIs a feature of the user that,
Figure BDA0002389081480000034
and
Figure BDA0002389081480000035
is a trainable parameter of the attention network, b denotes bias;
and S22, weighting the characteristics of each mode of the video and obtaining the final characteristic representation of the video.
Further, in step S3, learning potential feature representations of the user society by using a deep walking method specifically includes: the method comprises the steps that a user and a video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared with a sentence (Word sequence), and the feature representation of each Word, namely the social relation feature representation of the node (user), is learned through a classic Word2Vec model in natural language processing.
Further, in step S4, a deep neural network model based on the attention mechanism is proposed, and the original multi-domain features and the attention-based interaction features are embedded as the input of the deep neural network, so as to enrich the learning of the high-order features of the neural network, and the specific process is as follows:
s41, carrying out one-hot coding on discrete data in the multi-domain features, then reducing the dimension through an embedded matrix, splicing all continuous features, and converting the spliced continuous features into vectors with the same dimension through one layer of full connection;
s42, learning effective feature combinations by adopting an attention mechanism, learning interactive weights among different features by the attention mechanism, wherein the weight calculation formula is as follows:
Figure BDA0002389081480000041
Figure BDA0002389081480000042
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,
Figure BDA0002389081480000043
the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterA cross feature formed by two-by-two interaction of multi-domain features, W, h is a trainable parameter of the attention network, and b represents an offset;
and S43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and outputting the result through a multilayer perceptron to be used as the final representation of the multi-domain features.
Further, in step S5, the multi-modal video features learned in step S2, the social features learned in step S3, and the multi-domain features learned in step S4 are merged together to be used as a part shared by each task in the multi-task learning, parameters not shared are trained by the multi-layer perceptron corresponding to each task, and finally the task prediction result is output through a sigmoid function.
Compared with the prior art, the invention has the following advantages and effects:
the method disclosed by the invention utilizes an attention mechanism to combine with the user characteristics to fuse the multi-modal characteristics of the video, so that the whole recommendation is richer and personalized; meanwhile, aiming at multi-domain features, in consideration of the importance of interactive features in recommendation learning, the invention provides a deep neural network model based on an attention mechanism, which enriches the learning of high-order features and provides more accurate personalized video recommendation for users; in the multi-task learning, the learned feature representation is shared by multiple tasks, and the tasks are learned together, so that the overall parameter scale is reduced, and the requirements on multi-task recommendation in the industrial and living fields are better met
Drawings
FIG. 1 is a flow chart of a disclosed video recommendation method based on multimodal video content and multitask learning;
FIG. 2 is a schematic diagram of the structure of video multi-modal feature extraction and attention mechanism fusion features introduced in the present invention;
FIG. 3 is a diagram illustrating the structure between a user and a video in the present invention;
FIG. 4 is a schematic diagram of a depth neural network prediction model based on an attention mechanism according to the present invention;
FIG. 5 is a schematic diagram of a video recommendation structure for multitask learning in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
Fig. 1 is a flowchart of a video recommendation method based on multi-modal video content and multi-task learning, specifically including the following steps:
t1, video multi-modal feature extraction:
a. and extracting video frames, namely intercepting video frame pictures through an opencv video reading class cv2.video Capture, storing the video frame pictures under the path folder, wherein the frame number starts from 0, and intercepting each frame picture of the video without adopting frame skipping interception by considering the short and precise characteristics of the short video.
b. Video static feature extraction, namely, after the size of each frame of picture of a video is adjusted to [299,299], the picture is respectively input into a pre-trained inclusion-V3 network, as shown in fig. 1, the input is mapped into feature vectors with 2048 dimensions, the feature vectors are used as static original feature vectors of the video frames, and in order to keep the information of each frame of the video, each video frame is subjected to an average pooling layer to extract the static feature representation of the video.
c. Video dynamic feature extraction, wherein a 3D-CNN (3-dimensional convolutional neural network) is very suitable for space-time feature learning, each video randomly extracts continuous 16 frames to form input samples, the sizes of convolution kernels are 3 multiplied by 3 through five groups of convolution layers and pooling layers, the sizes of pooling kernels are 2 multiplied by 2, 4096-dimensional features are output through two layers of full connection, and 487-dimensional dynamic video features are output through one layer of full connection.
d. Audio feature extraction: and extracting audio information from the video, extracting a 128-dimensional feature vector through a pre-trained VGGish network model, and fusing the audio information of each frame through an average pooling layer to output the audio feature of the video.
e. Text feature extraction: extracting text information of the video, performing word segmentation by using a common Chinese word segmentation tool jieba, giving independent serial numbers to each word, counting the frequency of each word appearing in the video text lexicon, and finally performing feature standardization processing.
Figure BDA0002389081480000061
Where id is the number of the vocabulary set, nijThe number of times of the ith word in the jth video text is shown, and the denominator is the sum of the number of times of all the words in the video text.
f. Reducing the extracted static and dynamic characteristics of the video into 32-dimensional characteristics by PCA, and splicing the static and dynamic characteristics of the video and the dynamic characteristics of the video to form visual characteristics; and reducing the extracted audio features into 64-dimensional features by PCA.
T2, video multi-modal feature fusion:
the invention integrates the attention mechanism with the user characteristics to learn the dependency relationship between the user and each mode, and distributes different weights to each mode. The whole multi-modal feature extraction and fusion structure is shown in the attached figure 2, the extracted visual features, audio features and text features are subjected to feature fusion through an attention mechanism, and the feature fusion process is as follows:
a. learning the weights assigned by the user to visual, audio, video-text features, amRepresenting the learning of the preference (i.e. weight) of the user in each modality, the calculation flow is as follows:
Figure BDA0002389081480000071
Figure BDA0002389081480000072
wherein eVRepresenting a video feature, eUIs a user characteristic, m e { v, a, t } represents a visual modality, an auditory modality, and a text modality,
Figure BDA0002389081480000073
the attention scores obtained for the respective modes are normalized to obtain amIndicating the user's preference (i.e., weight) for each modality,
Figure BDA0002389081480000074
then visual, auditory and textual features extracted from the video are represented,
Figure BDA0002389081480000075
and
Figure BDA0002389081480000076
is the weight of the attention network and b represents the bias.
b. The features of the video are represented as a weighted sum of the features of the respective modalities:
Figure BDA0002389081480000077
experiments prove that the effect of fusing the video multi-modal characteristics is better than that of only adopting user characteristics, context characteristics and the like, and different modal characteristic weights are distributed through an attention mechanism on the basis, so that the whole recommendation model is more personalized.
T3, social feature learning:
the potential representation of the social relationship between each user and each video in the graph is learned through a deep walking method, for example, as shown in fig. 3, a user-video network is formed by taking the users and the videos as nodes, lines between the nodes represent that the users have viewed the videos, and a Word vector model Word2Vec in NLP natural language processing is used in network representation by means of the fact that a distribution rule of random walk in the network and a rule of sentence sequences in NLP natural language processing appearing in a corpus have similar power law distribution characteristics.
In a user-video network with uiGenerating node sequences for the random walk of the root node, each sequence being a sentence (transmit), as follows:
Figure BDA0002389081480000081
wherein u isiDenotes the ith user, viRepresenting the ith video.
Learning social characteristics of a user using the Word2Vec model in classical natural language processing is represented as: e.g. of the types=Word2Vec([sentence1,sentence2…,sentencem],size=64)。
T4, multi-domain feature learning:
the multi-domain characteristics comprise user side information such as user id, user gender, user age and the like; the invention provides an attention-based deep neural network estimation model for making full use of basic characteristics of item side information such as video id watched by a user, video author id, whether to click and the like, wherein the model is shown in figure 4, and the specific steps comprise:
a. characteristic embedding: the features are divided into two types, one is sparse features generated by using single-hot coding for the features of the type and the id type, and the other is numerical continuous features. Each sparse feature esparse,iConverting the feature vector into a feature vector e with dimension n by an embedding methodi(ii) a Each successive feature edense,iAfter being spliced, the vector e with the same dimensionality is converted into through one layer of full connectiondense. The feature embedding calculation process is as follows:
ei=Wiesparse,i
edense=FC([edense,0,edense,1,…]),
Figure BDA0002389081480000082
wherein the parameter WiFor embedding the matrix, FC (-) represents the full-connected function, and the features are spliced into the original feature e after being embeddedorigin
b. Extracting initial features: except for embedding the original feature representation eoriginBesides, the feature interaction information is also very important for click prediction, so that an attention mechanism is adopted to provide an effective feature interaction representation for the deep neural network bottom layer, and the specific calculation process is as follows:
Figure BDA0002389081480000091
Figure BDA0002389081480000092
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,
Figure BDA0002389081480000093
the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterThe cross feature formed by the two-by-two crossing of the multi-domain features, W, h is a parameter of the attention network, and b represents the bias.
c. Fully connecting the neural networks: the original features and the cross features thereof are used as input layers of the fully-connected neural network, and the result output by the multilayer perceptron is expressed as multi-domain features.
T5, multitask sequence learning:
shown in FIG. 5The video recommendation structure diagram for multitask learning is based on the video feature e generated in the above step T1VStep T2 generated social feature esAnd the multi-domain feature e generated in step T3fThe shared layer of the multi-task model shares learned feature expressions of different tasks, and each task trains parameters which are not shared among the tasks through a multi-layer perceptron respectively, and the calculation process is as follows:
Figure BDA0002389081480000094
the calculation formula of the prediction probability is as follows:
Figure BDA0002389081480000095
wherein einputFor shared feature inputs in the multitask model, sigma is sigmoid function, and the function is defined as
Figure BDA0002389081480000096
H is the number of hidden layers corresponding to the kth task,
Figure BDA0002389081480000097
represents the hidden layer training parameters of the H-th layer,
Figure BDA0002389081480000098
represents the output of the H-th layer hidden layer,
Figure BDA0002389081480000099
representing the k-th task prediction result.
The calculated loss function is as follows:
Figure BDA0002389081480000101
model training adjusts the whole network parameters by using an Adam optimization algorithm through back propagation, wherein each time of the back propagation and the forward propagation is defined as an epoch, and iteration is carried out until the output prediction result does not obviously change or the specified iteration number is reached.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (5)

1. A video recommendation method based on multi-modal video content and multi-task learning is characterized by comprising the following steps:
s1, analyzing video content by adopting a deep learning technology, respectively extracting static characteristics and dynamic characteristics of a video to form visual characteristics through an inclusion-V3 model and a 3-dimensional convolutional neural network, extracting audio characteristics through a VGGish model, and obtaining video text characteristics by counting the frequency of each word in a video title appearing in a video text word stock;
s2, learning the weight of each modal feature of the video by adopting an attention mechanism, and finally weighting each modal feature to obtain a video feature representation, wherein each modal feature of the video comprises a visual feature, an audio feature and a video text feature;
s3, forming a user-video social network by taking the user and the video as nodes, learning vector representation of a vertex in the network through a deep walking method, and taking the vector representation as characteristic representation of the user social relationship, wherein the vertex of the user-video social network represents the user;
s4, learning effective feature combinations based on an attention mechanism, splicing and fusing the feature combinations with original features to serve as input of a deep neural network, and learning multi-domain feature representation;
wherein, the step S4 includes:
s41, carrying out one-hot coding on discrete data in the multi-domain features, then reducing the dimension through an embedded matrix, splicing all continuous features, and converting the spliced continuous features into vectors with the same dimension through one layer of full connection;
s42, learning effective feature combinations by adopting an attention mechanism, learning interactive weights among different features by the attention mechanism, wherein the weight calculation formula is as follows:
Figure FDA0002948029360000011
Figure FDA0002948029360000012
einter=[a0,0e0e0,a0,1e0e1,…ai,jeiej],
wherein eiIs the ith feature, eiejThe feature element level multiplication is carried out, the dimension is unchanged after the multiplication,
Figure FDA0002948029360000023
the attention score of the ith feature and the jth feature interaction is obtained, and the attention score is normalized to obtain ai,jWeight representing feature interaction, einterA cross feature formed by two-by-two interaction of multi-domain features, W, h is a trainable parameter of the attention network, and b represents an offset;
s43, splicing and fusing the original features and the attention-based cross features thereof to be used as input, and taking the result output by the multilayer perceptron as the final representation of the multi-domain features;
s5, embedding the features generated based on the above steps as a shared part among tasks in multi-task learning, and generating prediction results by keeping output layers specific to the tasks.
2. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S1 comprises:
s11, extracting static characteristics of each frame of video by using a pre-trained classic image processing model increment-V3 model for video frame extraction, finally fusing the information of each frame through an average pooling layer to serve as the static characteristics of the video, and extracting dynamic characteristics of the video by using a 3-dimensional convolutional neural network;
s12, extracting audio information, and extracting audio features of the video by using a pre-trained VGGish model;
s13, extracting video title information, and counting the frequency of each word appearing in the video text lexicon as video text characteristics;
and S14, reducing the dimensions of the video static feature, the video dynamic feature and the audio feature by adopting a PCA method, and simultaneously splicing the video static feature and the video dynamic feature to form the visual feature.
3. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S2 comprises:
s21, fusing user characteristics, learning the dependency relationship between the user and each mode of the video through an attention mechanism, namely learning the weight of the user for distributing visual characteristics, audio characteristics and video text characteristics, and calculating by the following formula:
Figure FDA0002948029360000021
Figure FDA0002948029360000022
where m e { v, a, t } represents a visual modality, an auditory modality, and a text modality, respectively,
Figure FDA0002948029360000031
the attention scores obtained for the respective modes are normalized to obtain amRepresenting the user's preference for each modality, which is equivalent to the user's weight for each modality,
Figure FDA0002948029360000032
Figure FDA0002948029360000033
Then visual, auditory and textual features extracted from the video are represented, eUIs a feature of the user that,
Figure FDA0002948029360000034
and
Figure FDA0002948029360000035
is a trainable parameter of the attention network, b denotes bias;
and S22, weighting the characteristics of each mode of the video and obtaining the final characteristic representation of the video.
4. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S3 comprises:
the user and the video are used as nodes to form a user-video network, connecting lines between the user nodes and the video nodes show that the user watches the video, a node sequence generated by random walk in the user-video network is compared as a sentence, and the Word feature representation, namely the social relation feature representation of the node, is learned through a Word2Vec model.
5. The method for video recommendation based on multi-modal video content and multi-task learning according to claim 1, wherein said step S5 is performed as follows:
and (3) splicing and fusing the multi-modal video features learned in the step S2, the social features learned in the step S3 and the multi-domain features learned in the step S4 to form a part shared by all tasks in multi-task learning, training parameters which are not shared through multi-layer perceptrons corresponding to all tasks respectively, and finally outputting a task prediction result through a sigmoid function.
CN202010108302.6A 2020-02-21 2020-02-21 Video recommendation method based on multi-mode video content and multi-task learning Expired - Fee Related CN111246256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010108302.6A CN111246256B (en) 2020-02-21 2020-02-21 Video recommendation method based on multi-mode video content and multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010108302.6A CN111246256B (en) 2020-02-21 2020-02-21 Video recommendation method based on multi-mode video content and multi-task learning

Publications (2)

Publication Number Publication Date
CN111246256A CN111246256A (en) 2020-06-05
CN111246256B true CN111246256B (en) 2021-05-25

Family

ID=70869269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010108302.6A Expired - Fee Related CN111246256B (en) 2020-02-21 2020-02-21 Video recommendation method based on multi-mode video content and multi-task learning

Country Status (1)

Country Link
CN (1) CN111246256B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11922287B2 (en) 2020-07-15 2024-03-05 Baidu USA, LLC Video recommendation with multi-gate mixture of experts soft actor critic
CN111862990B (en) * 2020-07-21 2022-11-11 思必驰科技股份有限公司 Speaker identity verification method and system
CN111949884B (en) * 2020-08-26 2022-06-21 桂林电子科技大学 Multi-mode feature interaction-based depth fusion recommendation method
CN112183391A (en) * 2020-09-30 2021-01-05 中国科学院计算技术研究所 First-view video behavior prediction system and method
CN112183547A (en) * 2020-10-19 2021-01-05 中国科学院计算技术研究所 Multi-mode data-based multi-task learning method and system
CN112163165A (en) * 2020-10-21 2021-01-01 腾讯科技(深圳)有限公司 Information recommendation method, device, equipment and computer readable storage medium
CN114422859B (en) * 2020-10-28 2024-01-30 贵州省广播电视信息网络股份有限公司 Deep learning-based ordering recommendation system and method for cable television operators
CN112328861B (en) * 2020-11-24 2023-06-23 郑州航空工业管理学院 News spreading method based on big data processing
CN112307257B (en) * 2020-11-25 2021-06-15 中国计量大学 Short video click rate prediction method based on multi-information node graph network
CN112948708B (en) * 2021-03-05 2022-08-12 清华大学深圳国际研究生院 Short video recommendation method
CN112966644A (en) * 2021-03-24 2021-06-15 中国科学院计算技术研究所 Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof
CN113095883B (en) * 2021-04-21 2023-04-07 山东大学 Video payment user prediction method and system based on deep cross attention network
CN113312514B (en) * 2021-07-30 2021-11-09 平安科技(深圳)有限公司 Grouping method, device, equipment and medium combining Deepwalk and community discovery technology
CN113704547B (en) * 2021-08-26 2024-02-13 合肥工业大学 Multimode tag recommendation method based on unidirectional supervision attention
CN113794900B (en) * 2021-08-31 2023-04-07 北京达佳互联信息技术有限公司 Video processing method and device
CN113821682B (en) * 2021-09-27 2023-11-28 深圳市广联智通科技有限公司 Multi-target video recommendation method, device and storage medium based on deep learning
CN113807307B (en) * 2021-09-28 2023-12-12 中国海洋大学 Multi-mode joint learning method for video multi-behavior recognition
CN114358364A (en) * 2021-11-20 2022-04-15 重庆邮电大学 Attention mechanism-based short video frequency click rate big data estimation method
CN114969534A (en) * 2022-06-04 2022-08-30 哈尔滨理工大学 Mobile crowd sensing task recommendation method fusing multi-modal data features

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5512939A (en) * 1994-04-06 1996-04-30 At&T Corp. Low bit rate audio-visual communication system having integrated perceptual speech and video coding
JP2010206447A (en) * 2009-03-03 2010-09-16 Panasonic Corp Viewing terminal device, server device and participation type program sharing system
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN110019812B (en) * 2018-02-27 2021-08-20 中国科学院计算技术研究所 User self-production content detection method and system
CN108932304B (en) * 2018-06-12 2019-06-18 山东大学 Video moment localization method, system and storage medium based on cross-module state
CN109344288B (en) * 2018-09-19 2021-09-24 电子科技大学 Video description combining method based on multi-modal feature combining multi-layer attention mechanism
CN109874053B (en) * 2019-02-21 2021-10-22 南京航空航天大学 Short video recommendation method based on video content understanding and user dynamic interest
CN110188343B (en) * 2019-04-22 2023-01-31 浙江工业大学 Multi-mode emotion recognition method based on fusion attention network
CN110096617B (en) * 2019-04-29 2021-08-10 北京百度网讯科技有限公司 Video classification method and device, electronic equipment and computer-readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于AT-LSTM的弹幕评论情感分析;庄须强 刘方爱;《数字技术与应用》;20180210;第36卷(第02期);全文 *

Also Published As

Publication number Publication date
CN111246256A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN111246256B (en) Video recommendation method based on multi-mode video content and multi-task learning
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN111444709B (en) Text classification method, device, storage medium and equipment
CN111708950B (en) Content recommendation method and device and electronic equipment
CN109753566A (en) The model training method of cross-cutting sentiment analysis based on convolutional neural networks
CN112163165A (en) Information recommendation method, device, equipment and computer readable storage medium
US20220171760A1 (en) Data processing method and apparatus, computer-readable storage medium, and electronic device
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN113590849A (en) Multimedia resource classification model training method and multimedia resource recommendation method
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN111598183A (en) Multi-feature fusion image description method
CN112131345A (en) Text quality identification method, device, equipment and storage medium
CN114201516A (en) User portrait construction method, information recommendation method and related device
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN116955599A (en) Category determining method, related device, equipment and storage medium
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN116205700A (en) Recommendation method and device for target product, computer equipment and storage medium
Lin et al. Social media popularity prediction based on multi-modal self-attention mechanisms
CN115482021A (en) Multimedia information recommendation method and device, electronic equipment and storage medium
CN111552881A (en) Sequence recommendation method based on hierarchical variation attention
CN116628345B (en) Content recommendation method and device, electronic equipment and storage medium
CN117556149B (en) Resource pushing method, device, electronic equipment and storage medium
CN117521674B (en) Method, device, computer equipment and storage medium for generating countermeasure information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210525