CN115329127A - Multi-mode short video tag recommendation method integrating emotional information - Google Patents

Multi-mode short video tag recommendation method integrating emotional information Download PDF

Info

Publication number
CN115329127A
CN115329127A CN202210867181.2A CN202210867181A CN115329127A CN 115329127 A CN115329127 A CN 115329127A CN 202210867181 A CN202210867181 A CN 202210867181A CN 115329127 A CN115329127 A CN 115329127A
Authority
CN
China
Prior art keywords
label
short video
video
information
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210867181.2A
Other languages
Chinese (zh)
Inventor
李玉华
杜畅
李瑞轩
辜希武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202210867181.2A priority Critical patent/CN115329127A/en
Publication of CN115329127A publication Critical patent/CN115329127A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Abstract

The invention discloses a multi-mode short video tag recommendation method fusing emotion information, which belongs to the technical field of video processing and comprises the following steps: constructing a short video sample set; inputting the short video sample into an initial multi-mode label recommendation model based on a multi-head attention mechanism and an auto-encoder so as to extract the characteristics of images, audios and texts of the short video sample to obtain content characteristics and emotional characteristics, and fusing by using an attention network to obtain a plurality of candidate video labels; training an initial multi-mode label recommendation model to obtain a target multi-mode label recommendation model by taking an expected video label as a target and taking character feature difference between a candidate video label and the expected video label as loss; and inputting the current short video into the target multi-mode label recommendation model so as to generate the target video label. According to the invention, by fusing the image characteristics, the audio characteristics and the text characteristics, the multi-mode information related to the video can be fully utilized, and the quality of the generated video label is effectively improved.

Description

Multi-mode short video tag recommendation method integrating emotional information
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to a multi-mode short video tag recommendation method fusing emotion information.
Background
Due to the development of multimedia technology and portable mobile devices and the push of different short video platforms, short video is becoming a new media form for the public to acquire information and socialize as an extension of the traditional text and graphics media. Meanwhile, the short video has limited time span, can be conveniently shot and instantly shared, and is widely spread and huge in quantity. The recommendation system is mainly applied to the business field, and recommends commodity contents matched with a user by analyzing the connection between the user and the commodity, and then the recommendation system gradually carries out a recommendation task on contents interested by the user by analyzing user related information such as historical tweets, comments, articles and the like in social media and news platforms. The main body in the recommendation system can be broadly divided into a user and an article, and the recommendation task mode can also be roughly divided into a plurality of modes, one mode is that an article list conforming to the user is matched through a related algorithm, namely, the recommendation is carried out on the user through the similarity between the articles; one is that through the similarity between users, the same item can be recommended to the user group with the same characteristics; still another approach is to match item lists with similar attributes by reasonably modeling the relevant information of the user, and the basic idea of tag recommendation is derived from this method.
Tags are important content in a variety of different fields for specific information and for search engines to locate key sources of information. The tags can be single words, non-space phrases or even any word combination with the symbol # as a prefix, the management and classification of the tweet information of different contents are realized, articles with similar contents and tweets shared by other users can be obtained through hyperlinks of the tags, and the tag service can enable the users to benefit in searching and browse more interesting contents. Automatic tagging of text and images has become an important research topic in recent years.
In the existing short video tag recommendation model, texts containing emotion factors and tags are difficult to accurately predict after multi-mode information fusion, particularly in the field of short videos, the short videos contain richer image contents and more easily contain emotion information of users, however, emotion information introduced in the field of short video tag recommendation is less researched, and is generally only linearly overlapped with content information, and the prediction accuracy is not high. Overall, with the short video tags generated by the prior art, the quality thereof is to be further improved.
Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention provides a multimodal short video tag recommendation method fusing emotion information, and aims to fully utilize multimodal information related to short video by fusing image features, audio features and text features, and effectively improve the quality of a generated video tag, thereby solving the technical problem of poor quality of the existing short video tag.
In order to achieve the above object, according to an aspect of the present invention, there is provided a method for recommending a multimodal short video tag with emotion information fused, the method comprising:
s1: constructing a short video sample set, wherein the label of each short video sample comprises a plurality of corresponding platform labels, and the attribute of each short video sample comprises a corresponding image feature, an audio feature and a text feature;
s2: inputting the short video sample into an initial multi-mode label recommendation model based on a multi-head attention mechanism and an auto-encoder so as to perform feature extraction on images, audios and texts of the short video sample to obtain content features and emotional features, and performing fusion by using an attention network to obtain a plurality of candidate video labels; training the initial multi-mode label recommendation model to obtain a target multi-mode label recommendation model by taking an expected video label as a target and taking character feature difference between the candidate video label and the expected video label as loss;
s3: inputting the current short video into the target multi-modal tag recommendation model so that the target multi-modal tag recommendation model generates a target video tag.
In one embodiment, the set of short video samples includes a training set, a validation set, and a test set; the S2 comprises the following steps:
s21: inputting the training set into the initial multi-modal label recommendation model; the initial multi-modal tag recommendation model comprises: the content feature extraction module, the emotion feature extraction module and the label prediction module are fused;
s22: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the content characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-mode Transfomer model to obtain corresponding content characteristics;
s23: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the emotional characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-head attention mechanism to obtain corresponding emotional characteristics;
s24: fusing content features, emotion features and tag text features corresponding to the training set by using the tag prediction module to obtain short video fusion features, and generating a plurality of candidate video tags according to the short video fusion features; calculating errors of character features between each candidate video label and the real video label so as to reduce the loss through continuous iterative training;
s25: and respectively verifying and testing the initial multi-mode label recommendation model in the training process by utilizing the verification set and the test set, and taking the verified and tested initial multi-mode label recommendation model as the target multi-mode label recommendation model.
In one embodiment, the content feature extraction module is based on a modal Transfomer structure; the emotion feature extraction module is based on a multi-head attention structure of a cross-modality; the label prediction module is based on an attention network.
In one embodiment, the content feature extraction module includes, connected in sequence: an encoder layer, a stacked block layer, and a fusion layer; the encoder layer is used for encoding different modal information, and the stacked block layer is used for carrying out modal representation with attention mechanism; the fusion layer is used for fusing the cross-modal information to obtain final content characteristic representation; and the stacking block layer adopts N stacking blocks for each mode to realize the characteristic representation with the attention mechanism, and the stacking blocks comprise a multi-head attention mechanism, a cross-attention mechanism and two feedforward neural networks.
In one embodiment, the emotion feature extraction module performs inter-modal feature fusion on image features, audio features and text features through a multi-modal multi-head attention framework (MMFA) to obtain emotion expression vectors corresponding to the short video samples;
the MMFA comprises a multi-head self-attention mechanism and a multi-head co-attention mechanism.
In one embodiment, before S21, S2 further includes:
eliminating video samples which cannot be normally played through integrity check; video samples having a duration below a duration threshold, having textual information below a word count threshold, and/or having missing audio channels are filtered out.
In one embodiment, the enabling it to perform feature extraction on the image, audio and text of the short video sample to obtain content features and emotional features includes:
dividing the audio data of the short video sample into audio segments according to a preset time interval T, respectively extracting the characteristics of the audio segments, and combining the audio segments into audio characteristics according to a time sequence;
extracting a frame of image from the image data of the short video sample according to a preset video frame number N, respectively extracting the features of each frame of image, and combining the images into image features according to a time sequence;
constructing a word bank by using the tweet information and the original label information of the short video sample, representing words in the word bank as vectors by using a pre-training language model, and extracting the characteristics of the vectors to obtain character characteristics; and aiming at the connecting words with the length larger than the length threshold value, performing word segmentation by using a word segmentation tool, obtaining characteristics through the pre-training language model, and averaging to obtain context characteristics.
In one embodiment, the constructing a thesaurus by using the tweet information and the original tag information of the short video sample includes:
counting all tweet information and original label information of the short video sample, and aligning, segmenting words and counting word frequency in sequence; and sequencing the words according to the sequence of the word frequency from high to low, and taking the words with the word frequency higher than N times to construct the word bank, wherein N is a preset proportional parameter.
In one embodiment, the constructing a thesaurus by using the tweet information and the original tag information of the short video sample further includes: filtering non-English characters in the text pushing information and the original label information; carrying out root reduction on English words with the same root; and segmenting the connecting words with the lengths larger than the length threshold value to obtain a plurality of independent words.
According to another aspect of the invention, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) According to the invention, by fusing the image characteristics, the audio characteristics and the text characteristics, the multi-mode information related to the video can be fully utilized, and the quality of the generated video label is effectively improved.
(2) The initial multi-mode label recommendation model provided by the invention can gradually and stepwise perform feature fusion, so that multi-mode information is more effectively fused based on the influence weight of different information on the video label, and the quality of the finally generated video label is improved.
(3) According to the invention, the collected video data is subjected to the preprocessing operation before feature extraction, so that errors and redundancies of data in a data set can be effectively avoided, the training effect of the model is ensured, and the video label generated by the model is ensured to have higher quality.
(4) According to the invention, when multi-modal feature fusion is carried out, emotion information in a video can be considered, and through a multi-task learning method, the video emotion information can be captured at the same time, so that the prediction of a label with emotion attributes is facilitated, and the quality of the generated video label is further optimized.
Drawings
Fig. 1 is a logic diagram of a multi-modal short video tag recommendation method incorporating emotion information according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a content feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a stack block in a content feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an emotion feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the invention provides a multi-modal short video tag recommendation method fused with emotion information, a model training phase and a tag prediction phase;
the model training phase comprises:
(S1.1) collecting video data and corresponding label data; by comparing the video quantity and the label quality of each short video platform, the Python web crawler is selected to collect data from the Vine platform; the specific collection method is as follows: the original Vine data set needs to be processed, intersection of a user ID-video URL, a user ID-user text pushing and a user ID-text pushing label is taken, data only containing videos or texts are filtered, data are connected through uniqueness of the user ID, information contained in the final data set is a video URL-text, the text is text containing specific labels, meanwhile, labels with the frequency of being less than 10 times in the whole data set are eliminated, the contingency of label videos is reduced, and data are cleaned through a data preprocessing method to be described below. The binary media files in the video downloading addresses are captured by setting a local agent and a crawler program, 40049 available short videos and 1935 different labels are finally obtained, meanwhile, approximately 86% of the short videos in the data set are distributed in about 6s, each video at least comprises 1 label and at most 21 labels, the number of the labels contained in each video is 4.8 on average, and the length of the words contained in the tweet on average is 9.73.
(S1.2) preprocessing the collected videos and the corresponding labels; the pretreatment operation specifically comprises: eliminating videos which cannot be played normally through integrity check; the integrity check tool used is specifically an open source tool ffmpeg; the video which cannot be normally played is removed through integrity check; filtering out video data with short duration, short text pushing information and/or missing audio channels; in this embodiment, the step (S1.2) further includes:
the construction method of the tag word stock comprises the following steps: after all videos collected in the model training stage are segmented, word frequency of words is counted; after the words are sequenced according to the sequence of the word frequency from high to low, the vocabulary with the word frequency higher than N times is taken to construct a word bank; wherein N is a preset proportional parameter. When a word bank is constructed, non-English characters in the text pushing information and the original label information are filtered; carrying out root reduction on English words with the same root; and for the longer connecting words, performing word segmentation through a word segmentation tool to obtain a plurality of single words.
Separating the video data into image data and audio data, and respectively extracting the characteristics of the image data, the audio data and the corresponding user text pushing data; a plurality of platform labels corresponding to the short videos are used as training labels, image features, audio features and tweet features of the short videos are used as attributes of the videos to form a sample, all the samples form a data set, and the data set is divided into a training set, a verification set and a test set.
(S2) the model training phase comprises:
separating the video data into image data and audio data, and respectively extracting the characteristics of the image data, the audio data and the corresponding user text pushing data; forming a sample by taking a plurality of platform labels corresponding to the short video as training labels and taking image characteristics, audio characteristics and text characteristics of the short video as attributes of the video, forming a data set by all the samples, and dividing the data set into a training set, a verification set and a test set;
establishing an initial multi-modal label recommendation model based on a multi-head attention mechanism and a self-encoder, wherein the initial multi-modal label recommendation model is used for performing content feature extraction and emotion feature extraction on image features, audio features and tweet comment context features through different pre-training model types, realizing multi-modal feature fusion through a content feature extraction module and an emotion feature extraction module, and generating a plurality of video labels related to video content and user recommendation information according to the fused features; based on the multi-task learning method, the content feature extraction module and the emotion special diagnosis fusion module carry out multi-task learning, namely modal information is discovered through different pre-training models, compared with single-task learning, the emotion feature extraction module can effectively fuse emotion information in a short video while extracting content features of the short video, feature vectors of different tasks are fused through an attention network, noise is avoided, and the performance of an initial multi-mode label recommendation model is improved. And respectively training, verifying and testing the established model by utilizing the training set, the verifying set and the testing set, thereby obtaining a target multi-modal label recommending model.
Further, the initial multi-mode label recommendation model comprises a content feature extraction module, an emotion feature extraction module and a label prediction module; the content feature extraction module is used for extracting features of the short video data in image and audio modes through a pre-training model related to video content information and fusing the image, the audio and the text modes through a multi-mode Transfomer model; the emotion feature extraction module is used for extracting features of the short video data in an image and audio mode through a pre-training model related to video emotion system information and fusing the image, the audio and a text mode through a multi-head attention mechanism; the label prediction module is used for fusing the video content fusion characteristics output by the content characteristic extraction module, the emotion fusion characteristics output by the emotion characteristic extraction module and label text characteristics related to the video to obtain short video fusion characteristics, and generating a plurality of candidate video labels according to the short video fusion characteristics; the label prediction module is further used for calculating an error between the generated video label and the character feature of the real video label as a loss so as to reduce the loss through continuous iterative training.
Based on the initial multi-mode label recommendation model structure, the method can gradually and stepwise fuse the characteristics, thereby fusing multi-mode information more effectively and improving the quality of a finally generated label based on the influence weight of different information on the video label.
In the step (S2), the feature extraction is performed on the different modal features in the content feature extraction module and the emotion feature extraction module, and the method includes: for visual features, framing is carried out on a content feature extraction module by using an ffmpeg open source streaming media tool, feature extraction is carried out on the visual features in a video by adopting VGG16-LSTM, continuous information of picture frames in time is captured, and 1024-dimensional visual feature representation vectors are obtained by adopting a slow fusion mode; acquiring 512-dimensional visual feature representation by adopting a 3D-CNN pre-training model for the emotional feature extraction module; for the audio features, extracting feature vectors of the audio features by adopting VGGish for a content feature extraction module, resampling the audio data at a certain frequency, obtaining a spectrogram of the sampled data through Fourier transform, and finally obtaining 1024-dimensional acoustic spectrum data of an audio file through a filter; and for the emotional feature extraction module, a Librosa sound spectrum processing library is adopted and input into a convolutional neural network to obtain a 300-dimensional feature vector. For text features, performing feature extraction on a content feature extraction module and a Hashtag word text by adopting a Bert pre-training language model to obtain 768-dimensional embedded representation; and adopting a pre-trained English corpus Glove model for the emotional feature extraction module, and finally obtaining a text vector with the size of 300 dimensions.
In addition, an automatic bullet screen comment generation model is established based on an auto-encoder and an auto-decoder, and is used for respectively carrying out time sequence analysis on image characteristics, audio characteristics and bullet screen comment context characteristics, realizing multi-mode characteristic fusion together with character characteristics of bullet screen comments, and generating a bullet screen comment related to video content and bullet screen context according to the fused characteristics; the bullet screen comment automatic generation model comprises the following steps: the system comprises a content feature extraction module, an emotion feature extraction module and a label prediction module; the content feature extraction module performs content information fusion and comprises the following steps: the encoder layer encodes different modal information, the stacking block layer performs modal representation with attention mechanism, and the fusion layer fuses the cross-modal information to obtain final video content feature representation.
As shown in fig. 2, the video content representation method mainly includes three parts, namely, encoding different modality information by an encoding layer, performing modality representation with attention by a stacked block layer, and fusing cross-modality information by a fusion layer to obtain a final video content feature representation.
The stacking block layer adopts N stacking blocks for each mode, realizes characteristic representation with attention mechanism for mode information, and each stacking block is as shown in FIG. 3 and mainly comprises four parts: including a Multi-Head Attention mechanism (Multi-Head Attention), a Cross Attention mechanism (Cross Attention), and two feed-Forward Neural Networks (FNNs).
The Multi-Head Attention module is an important structure in a transform model, more matrix parameters are added on the basis of a traditional self-Attention mechanism, so that Q, K and V matrixes are multiplied by original input vectors and then are subjected to multiplication with a plurality of matrixes, the number of the matrixes represents the number of heads of the Multi-Head Attention mechanism, and the advantage of dividing data flow into different subspaces is realized, so that the model can pay Attention to information in different aspects; cross-attention is actually positioned at the position of a decoder in an original transform model, a corresponding K and V matrix is generated in an encoder module, and the information fusion of different information is realized by combining the Q matrix information of the decoder module, and the information fusion of the encoder and the decoder is originally used on a machine translation task, but the Cross-attention can also be used for the fusion of information among different modes in a short video, so that the transform model can be more unified and naturally combined with information of different modes; the FNN is used for connecting a plurality of subspace matrix information generated by multi-head attentions into a unified matrix and multiplying the unified matrix with a parameter matrix of information to obtain a final single-mode information fusion matrix.
In the stacked block of the t-th layer, the input of the same modality information comes from the output information of the t-1 layer.
Figure BDA0003759071790000101
Figure BDA0003759071790000102
The feature fusion method among different modalities is shown as follows, wherein c represents a text modality, v represents a visual modality, a represents a video modality, and taking a calculation mode of visual modality information v as an example, after a cross-attention mechanism, a fusion gate is constructed through a multi-layer perceptron, and is multiplied by a corresponding weight matrix to perform inter-modality feature fusion on text and audio information respectively.
Figure BDA0003759071790000103
Figure BDA0003759071790000104
Figure BDA0003759071790000105
Figure BDA0003759071790000106
Then will be
Figure BDA0003759071790000107
And inputting the input into a feedforward neural network to obtain the output of the t-th stacked block. In the same way, the output of audio mode and text mode in t layer can be obtained
Figure BDA0003759071790000108
And
Figure BDA0003759071790000109
Figure BDA00037590717900001010
the output of the stacked blocks is then converted to a vector of a particular length by a weight pooling operation.
Figure BDA00037590717900001011
Figure BDA00037590717900001012
After three modal attention-related feature representations are obtained in the same way, feature fusion gates are created in a fusion layer through a multilayer sensing mechanism, different modal information is multiplied by certain weight and added to obtain a fusion vector in the stack block.
g v =MLP(F c ,F v ,F a )#(10)
Figure BDA0003759071790000111
In the step (S2), the emotion feature extraction module performs emotion information fusion, and the method comprises the following steps:
compared with a content attribute tag, the emotion attribute tag is more easily predicted, after feature representation is extracted through emotion pre-training models of different modes, the overall module architecture is as shown in fig. 4, and features of visual, audio and text features are subjected to feature fusion among modes through a Multi-mode Multi-Head Attention framework (MMFA) to obtain emotion representation vectors in short videos.
The multi-head attention machine mechanism enables the model to obtain final representation information from different representation spaces by considering the relevance between the models, and the MMFA comprises a multi-head self-attention machine (multi-head self-attention) mechanism and a multi-head co-attention machine (multi-head mutual-attention). The former function is the same as the MTT, the data stream is dispersed into different subspaces through more parameter matrixes, and the attention weight matrix information comes from the same modality and is only used for reinforcing the information characteristic of the modality; the latter attention weight matrix information comes from different modal characteristics, and different modal information is convenient to fuse, the specific method is that information between different modalities, such as vision-text, text-audio and audio-vision, is fused pairwise, and essentially, on the basis of taking a Transformer as a pure image classification task, a linear conversion unit which originally only contains picture information is changed into a linear layer containing information of two modalities in an input linear layer, then the two modality information input into an attention mechanism are linearly superposed to form different modal fusion characteristics under a common attention mechanism, finally, single modal representation and common modal representation are combined, and the combined single modal representation and common modal representation are input into a softmax layer to generate emotional attribute characteristics.
For a Multi-head attention mechanism module (Multi-head self-attention), taking a visual mode V as an example, a matrix Q, K and V in the traditional Multi-head attention mechanism needs to be correspondingly changed through fusion of single mode information, and all input parts are changed into visual mode information.
Figure BDA0003759071790000112
Assuming that the number of linear transformation layers is m, the final fusion expression of the single mode is shown in formula (14).
Figure BDA0003759071790000113
The multi-head common attention mechanism fuses different modal information, essentially, feature fusion is performed two by two, taking a visual mode v and an audio mode a as an example, and the calculation method is shown as a formula (15).
Figure BDA0003759071790000121
Assuming that the number of linear transformation layers is m, the fusion expression of the final fusion modality is shown in equation (16).
Figure BDA0003759071790000122
The feature representation obtained by fusing the modal information with the co-attention mechanism not only includes information fusion between every two modalities, but also includes the modal information of the modality.
Figure BDA0003759071790000123
The label prediction phase comprises the following steps:
for a given video-tag pair (x) i ,y i ) Assuming that the content feature vector is finally obtained
Figure BDA0003759071790000124
Emotional feature vector
Figure BDA0003759071790000125
Vector h is denoted by Hashtag. Because the labels can be divided into two types according to semantic information, one of the labels is a label with emotional attributes, and the other label contains content attributes, semantic classification determines that candidate labels have different attention weights on emotional characteristics and content characteristics, an attention mechanism is needed to fuse the two multi-mode characteristics, and formulas (18) and (19) represent the emotional attributes and the content attributes combined with original label information.
Figure BDA0003759071790000126
Figure BDA0003759071790000127
In order to obtain the attention weights of the two, the formula (20) and the formula (21) obtain the weight distribution of different features through an exponential operation, as shown below.
Figure BDA0003759071790000128
Figure BDA0003759071790000131
The fusion of the two is obtained after the input into the attention networkRatio of
Figure BDA0003759071790000132
The final fused vector is
Figure BDA0003759071790000133
To evaluate the relevance score of a given short video and tag, x is added i,j The input is input into a multilayer perceptron, in which the representation vector of the short video and the label embedding can learn the correlation between the two through a nonlinear hidden layer, wherein the hidden layer of the multilayer perceptron is represented by formula (22).
Figure BDA0003759071790000134
This example training treats the tag recommendation as a binary task, where a positive sample is given if the predicted video-tag pair appears in the dataset, otherwise the given video-tag pair is a negative sample, and selects the negative sample tag of the video by means of random sampling, eventually using cross entropy as a loss function.
Figure BDA0003759071790000135
According to another aspect of the invention, a multi-modal short video tag recommendation system fused with emotion information comprises: a computer-readable storage medium and a processor; a computer readable storage medium for storing an executable program; the processor is used for reading an executable program stored in a computer readable storage medium and executing the multi-modal short video tag recommendation method fused with the emotion information.
Tags are important content in a variety of different fields for specific information and for search engines to locate key sources of information. The tags can be single words, non-space phrases or even any word combination with the symbol # as a prefix, the management and classification of the tweet information of different contents are realized, articles with similar contents and tweets shared by other users can be obtained through hyperlinks of the tags, and the tag service can enable the users to benefit in searching and browse more interesting contents. Automatic tagging of text and images has become an important research topic in recent years. However, there are few models for initial multi-modal tag recommendation, and although some have proposed methods to perform recommendation tasks for text, images, or blogs, they are not suitable for short video domains. Since these models are designed for the respective domains and the structure of short video is also different from text and pictures. The invention solves the problems and can better recommend the short video label.
It will be understood by those skilled in the art that the foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included within the scope of the present invention.

Claims (10)

1. A multi-mode short video tag recommendation method fusing emotion information is characterized by comprising the following steps:
s1: constructing a short video sample set, wherein the label of each short video sample comprises a plurality of corresponding platform labels, and the attribute of each short video sample comprises a corresponding image feature, an audio feature and a text feature;
s2: inputting the short video sample into an initial multi-mode label recommendation model based on a multi-head attention mechanism and an auto-encoder so as to perform feature extraction on images, audios and texts of the short video sample to obtain content features and emotional features, and performing fusion by using an attention network to obtain a plurality of candidate video labels; training the initial multi-mode label recommendation model to obtain a target multi-mode label recommendation model by taking an expected video label as a target and taking character feature difference between the candidate video label and the expected video label as loss;
s3: inputting the current short video into the target multi-modal tag recommendation model so that the target multi-modal tag recommendation model generates a target video tag.
2. The method for multi-modal short video tag recommendation fusing emotion information as recited in claim 1, wherein the short video sample set comprises a training set, a verification set and a test set; the S2 comprises the following steps:
s21: inputting the training set into the initial multi-modal label recommendation model; the initial multi-modal tag recommendation model comprises: the content feature extraction module, the emotion feature extraction module and the label prediction module are fused;
s22: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the content characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-mode Transfomer model to obtain corresponding content characteristics;
s23: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the emotional characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-head attention mechanism to obtain corresponding emotional characteristics;
s24: fusing content features, emotion features and tag text features corresponding to the training set by using the tag prediction module to obtain short video fusion features, and generating a plurality of candidate video tags according to the short video fusion features; calculating errors of character features between each candidate video label and the real video label so as to reduce the loss through continuous iterative training;
s25: and respectively verifying and testing the initial multi-mode label recommendation model in the training process by utilizing the verification set and the test set, and taking the verified and tested initial multi-mode label recommendation model as the target multi-mode label recommendation model.
3. The method for recommending a multimodal short video tag with emotion information fused as claimed in claim 2, wherein said content feature extraction module is based on a modal transmomer structure; the emotion feature extraction module is based on a multi-head attention structure of a cross-modality; the label prediction module is based on an attention network.
4. The method for recommending multimodal short video tags by fusing emotion information as claimed in claim 3, wherein said content feature extraction module comprises sequentially connected: an encoder layer, a stacked block layer, and a fusion layer; the encoder layer is used for encoding different modal information, and the stacked block layer is used for carrying out modal representation with attention mechanism; the fusion layer is used for fusing the cross-modal information to obtain final content characteristic representation; and the stacked block layer adopts a plurality of stacked blocks for each mode to realize the characteristic representation with the attention mechanism, and the stacked blocks comprise a multi-head attention mechanism, a cross-attention mechanism and two feedforward neural networks.
5. The method for recommending a multimodal short video tag with fused emotion information as claimed in claim 3, wherein said emotion feature extraction module performs inter-modal feature fusion on image features, audio features and text features through a multimodal multi-head attention framework (MMFA) to obtain emotion expression vectors corresponding to said short video samples;
the MMFA comprises a multi-head self-attention mechanism and a multi-head co-attention mechanism.
6. The method for recommending a multimodal short video tag with fused emotion information as claimed in claim 2, wherein before S21, said S2 further comprises:
eliminating video samples which cannot be normally played through integrity check; video samples having a duration below a duration threshold, having textual information below a word count threshold, and/or having missing audio channels are filtered out.
7. The method for recommending multimodal short video tags with fused emotion information as claimed in claim 1, wherein said step of enabling said short video samples to perform feature extraction on images, audios and texts of said short video samples to obtain content features and emotion features comprises:
dividing the audio data of the short video sample into audio segments according to a preset time interval T, respectively extracting the characteristics of the audio segments, and combining the audio segments into audio characteristics according to the time sequence;
extracting a frame of image from the image data of the short video sample according to a preset video frame number N, respectively extracting the features of each frame of image, and combining the images into image features according to a time sequence;
constructing a word bank by using the tweet information and the original label information of the short video sample, representing words in the word bank as vectors by using a pre-training language model, and extracting the characteristics of the vectors to obtain character characteristics; and aiming at the connecting words with the length larger than the length threshold value, performing word segmentation by using a word segmentation tool, obtaining characteristics through the pre-training language model, and averaging to obtain context characteristics.
8. The method for multi-modal short video tag recommendation with emotion information fused as claimed in claim 7, wherein said constructing a thesaurus using the tweet information and the original tag information of said short video samples comprises:
counting all tweet information and label information of the short video sample, and aligning, segmenting words and counting word frequency in sequence; and sequencing the words according to the sequence of the word frequency from high to low, and taking the words with the word frequency higher than N times to construct the word bank, wherein N is a preset proportional parameter.
9. The method for recommending multi-modal short video tags with fused emotional information according to claim 8, wherein the constructing a thesaurus using the textual information of the short video sample and the original tag information further comprises: filtering non-English characters in the text pushing information and the original label information; carrying out root reduction on English words with the same root; and segmenting the connecting words with the lengths larger than the length threshold value to obtain a plurality of independent words.
10. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 9 when executing the computer program.
CN202210867181.2A 2022-07-22 2022-07-22 Multi-mode short video tag recommendation method integrating emotional information Pending CN115329127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210867181.2A CN115329127A (en) 2022-07-22 2022-07-22 Multi-mode short video tag recommendation method integrating emotional information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210867181.2A CN115329127A (en) 2022-07-22 2022-07-22 Multi-mode short video tag recommendation method integrating emotional information

Publications (1)

Publication Number Publication Date
CN115329127A true CN115329127A (en) 2022-11-11

Family

ID=83920167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210867181.2A Pending CN115329127A (en) 2022-07-22 2022-07-22 Multi-mode short video tag recommendation method integrating emotional information

Country Status (1)

Country Link
CN (1) CN115329127A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658935A (en) * 2022-12-06 2023-01-31 北京红棉小冰科技有限公司 Personalized comment generation method and device
CN115935008A (en) * 2023-02-16 2023-04-07 杭州网之易创新科技有限公司 Video label generation method, device, medium and computing equipment
CN115964560A (en) * 2022-12-07 2023-04-14 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN116661803A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment
CN116744063A (en) * 2023-08-15 2023-09-12 四川中电启明星信息技术有限公司 Short video push system integrating social attribute information
CN116758462A (en) * 2023-08-22 2023-09-15 江西师范大学 Emotion polarity analysis method and device, electronic equipment and storage medium
CN117112834A (en) * 2023-10-24 2023-11-24 苏州元脑智能科技有限公司 Video recommendation method and device, storage medium and electronic device
CN117151826A (en) * 2023-09-13 2023-12-01 广州数说故事信息科技有限公司 Multi-mode electronic commerce commodity alignment method and device, electronic equipment and storage medium
CN117290596A (en) * 2023-09-20 2023-12-26 北京约来健康科技有限公司 Recommendation label generation method, device, equipment and medium for multi-mode data model
CN117390291A (en) * 2023-12-12 2024-01-12 山东省人工智能研究院 User demand recommendation algorithm and system based on decoupling multi-mode model
CN117708375A (en) * 2024-02-05 2024-03-15 北京搜狐新媒体信息技术有限公司 Video processing method and device and related products

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658935A (en) * 2022-12-06 2023-01-31 北京红棉小冰科技有限公司 Personalized comment generation method and device
CN115964560B (en) * 2022-12-07 2023-10-27 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN115964560A (en) * 2022-12-07 2023-04-14 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN115935008A (en) * 2023-02-16 2023-04-07 杭州网之易创新科技有限公司 Video label generation method, device, medium and computing equipment
CN116661803A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment
CN116661803B (en) * 2023-07-31 2023-11-17 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment
CN116744063A (en) * 2023-08-15 2023-09-12 四川中电启明星信息技术有限公司 Short video push system integrating social attribute information
CN116744063B (en) * 2023-08-15 2023-11-03 四川中电启明星信息技术有限公司 Short video push system integrating social attribute information
CN116758462A (en) * 2023-08-22 2023-09-15 江西师范大学 Emotion polarity analysis method and device, electronic equipment and storage medium
CN117151826A (en) * 2023-09-13 2023-12-01 广州数说故事信息科技有限公司 Multi-mode electronic commerce commodity alignment method and device, electronic equipment and storage medium
CN117290596A (en) * 2023-09-20 2023-12-26 北京约来健康科技有限公司 Recommendation label generation method, device, equipment and medium for multi-mode data model
CN117112834A (en) * 2023-10-24 2023-11-24 苏州元脑智能科技有限公司 Video recommendation method and device, storage medium and electronic device
CN117112834B (en) * 2023-10-24 2024-02-02 苏州元脑智能科技有限公司 Video recommendation method and device, storage medium and electronic device
CN117390291A (en) * 2023-12-12 2024-01-12 山东省人工智能研究院 User demand recommendation algorithm and system based on decoupling multi-mode model
CN117390291B (en) * 2023-12-12 2024-03-12 山东省人工智能研究院 User demand recommendation method and system based on decoupling multi-mode model
CN117708375A (en) * 2024-02-05 2024-03-15 北京搜狐新媒体信息技术有限公司 Video processing method and device and related products

Similar Documents

Publication Publication Date Title
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
CN110390103B (en) Automatic short text summarization method and system based on double encoders
Meghawat et al. A multimodal approach to predict social media popularity
CN115034224A (en) News event detection method and system integrating representation of multiple text semantic structure diagrams
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN109670050A (en) A kind of entity relationship prediction technique and device
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN113392265A (en) Multimedia processing method, device and equipment
CN115408488A (en) Segmentation method and system for novel scene text
Glavan et al. InstaIndoor and multi-modal deep learning for indoor scene recognition
CN114611520A (en) Text abstract generating method
Luo et al. Multimodal reconstruct and align net for missing modality problem in sentiment analysis
Abdar et al. A review of deep learning for video captioning
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN113486174A (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN117036833A (en) Video classification method, apparatus, device and computer readable storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN117132923A (en) Video classification method, device, electronic equipment and storage medium
CN116383517A (en) Dynamic propagation feature enhanced multi-modal rumor detection method and system
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
CN113919338A (en) Method and device for processing text data
Agarwal et al. Sentiment Analysis Dashboard for Socia Media comments using BERT
Namitha et al. Prediction of Movie Categories Using Randomized Sequences with Machine Learning
Cui et al. Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination