CN115329127A

CN115329127A - Multi-mode short video tag recommendation method integrating emotional information

Info

Publication number: CN115329127A
Application number: CN202210867181.2A
Authority: CN
Inventors: 李玉华; 杜畅; 李瑞轩; 辜希武
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-11-11

Abstract

The invention discloses a multi-mode short video tag recommendation method fusing emotion information, which belongs to the technical field of video processing and comprises the following steps: constructing a short video sample set; inputting the short video sample into an initial multi-mode label recommendation model based on a multi-head attention mechanism and an auto-encoder so as to extract the characteristics of images, audios and texts of the short video sample to obtain content characteristics and emotional characteristics, and fusing by using an attention network to obtain a plurality of candidate video labels; training an initial multi-mode label recommendation model to obtain a target multi-mode label recommendation model by taking an expected video label as a target and taking character feature difference between a candidate video label and the expected video label as loss; and inputting the current short video into the target multi-mode label recommendation model so as to generate the target video label. According to the invention, by fusing the image characteristics, the audio characteristics and the text characteristics, the multi-mode information related to the video can be fully utilized, and the quality of the generated video label is effectively improved.

Description

Multi-mode short video tag recommendation method integrating emotional information

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a multi-mode short video tag recommendation method fusing emotion information.

Background

Due to the development of multimedia technology and portable mobile devices and the push of different short video platforms, short video is becoming a new media form for the public to acquire information and socialize as an extension of the traditional text and graphics media. Meanwhile, the short video has limited time span, can be conveniently shot and instantly shared, and is widely spread and huge in quantity. The recommendation system is mainly applied to the business field, and recommends commodity contents matched with a user by analyzing the connection between the user and the commodity, and then the recommendation system gradually carries out a recommendation task on contents interested by the user by analyzing user related information such as historical tweets, comments, articles and the like in social media and news platforms. The main body in the recommendation system can be broadly divided into a user and an article, and the recommendation task mode can also be roughly divided into a plurality of modes, one mode is that an article list conforming to the user is matched through a related algorithm, namely, the recommendation is carried out on the user through the similarity between the articles; one is that through the similarity between users, the same item can be recommended to the user group with the same characteristics; still another approach is to match item lists with similar attributes by reasonably modeling the relevant information of the user, and the basic idea of tag recommendation is derived from this method.

Tags are important content in a variety of different fields for specific information and for search engines to locate key sources of information. The tags can be single words, non-space phrases or even any word combination with the symbol # as a prefix, the management and classification of the tweet information of different contents are realized, articles with similar contents and tweets shared by other users can be obtained through hyperlinks of the tags, and the tag service can enable the users to benefit in searching and browse more interesting contents. Automatic tagging of text and images has become an important research topic in recent years.

In the existing short video tag recommendation model, texts containing emotion factors and tags are difficult to accurately predict after multi-mode information fusion, particularly in the field of short videos, the short videos contain richer image contents and more easily contain emotion information of users, however, emotion information introduced in the field of short video tag recommendation is less researched, and is generally only linearly overlapped with content information, and the prediction accuracy is not high. Overall, with the short video tags generated by the prior art, the quality thereof is to be further improved.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a multimodal short video tag recommendation method fusing emotion information, and aims to fully utilize multimodal information related to short video by fusing image features, audio features and text features, and effectively improve the quality of a generated video tag, thereby solving the technical problem of poor quality of the existing short video tag.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method for recommending a multimodal short video tag with emotion information fused, the method comprising:

s1: constructing a short video sample set, wherein the label of each short video sample comprises a plurality of corresponding platform labels, and the attribute of each short video sample comprises a corresponding image feature, an audio feature and a text feature;

s2: inputting the short video sample into an initial multi-mode label recommendation model based on a multi-head attention mechanism and an auto-encoder so as to perform feature extraction on images, audios and texts of the short video sample to obtain content features and emotional features, and performing fusion by using an attention network to obtain a plurality of candidate video labels; training the initial multi-mode label recommendation model to obtain a target multi-mode label recommendation model by taking an expected video label as a target and taking character feature difference between the candidate video label and the expected video label as loss;

s3: inputting the current short video into the target multi-modal tag recommendation model so that the target multi-modal tag recommendation model generates a target video tag.

In one embodiment, the set of short video samples includes a training set, a validation set, and a test set; the S2 comprises the following steps:

s21: inputting the training set into the initial multi-modal label recommendation model; the initial multi-modal tag recommendation model comprises: the content feature extraction module, the emotion feature extraction module and the label prediction module are fused;

s22: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the content characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-mode Transfomer model to obtain corresponding content characteristics;

s23: extracting the characteristics of the image mode and the audio mode of the training set by using a pre-training model in the emotional characteristic extraction module, extracting text characteristics, and fusing the image characteristics, the audio characteristics and the text characteristics by using a multi-head attention mechanism to obtain corresponding emotional characteristics;

s24: fusing content features, emotion features and tag text features corresponding to the training set by using the tag prediction module to obtain short video fusion features, and generating a plurality of candidate video tags according to the short video fusion features; calculating errors of character features between each candidate video label and the real video label so as to reduce the loss through continuous iterative training;

s25: and respectively verifying and testing the initial multi-mode label recommendation model in the training process by utilizing the verification set and the test set, and taking the verified and tested initial multi-mode label recommendation model as the target multi-mode label recommendation model.

In one embodiment, the content feature extraction module is based on a modal Transfomer structure; the emotion feature extraction module is based on a multi-head attention structure of a cross-modality; the label prediction module is based on an attention network.

In one embodiment, the content feature extraction module includes, connected in sequence: an encoder layer, a stacked block layer, and a fusion layer; the encoder layer is used for encoding different modal information, and the stacked block layer is used for carrying out modal representation with attention mechanism; the fusion layer is used for fusing the cross-modal information to obtain final content characteristic representation; and the stacking block layer adopts N stacking blocks for each mode to realize the characteristic representation with the attention mechanism, and the stacking blocks comprise a multi-head attention mechanism, a cross-attention mechanism and two feedforward neural networks.

In one embodiment, the emotion feature extraction module performs inter-modal feature fusion on image features, audio features and text features through a multi-modal multi-head attention framework (MMFA) to obtain emotion expression vectors corresponding to the short video samples;

the MMFA comprises a multi-head self-attention mechanism and a multi-head co-attention mechanism.

In one embodiment, before S21, S2 further includes:

eliminating video samples which cannot be normally played through integrity check; video samples having a duration below a duration threshold, having textual information below a word count threshold, and/or having missing audio channels are filtered out.

In one embodiment, the enabling it to perform feature extraction on the image, audio and text of the short video sample to obtain content features and emotional features includes:

dividing the audio data of the short video sample into audio segments according to a preset time interval T, respectively extracting the characteristics of the audio segments, and combining the audio segments into audio characteristics according to a time sequence;

extracting a frame of image from the image data of the short video sample according to a preset video frame number N, respectively extracting the features of each frame of image, and combining the images into image features according to a time sequence;

constructing a word bank by using the tweet information and the original label information of the short video sample, representing words in the word bank as vectors by using a pre-training language model, and extracting the characteristics of the vectors to obtain character characteristics; and aiming at the connecting words with the length larger than the length threshold value, performing word segmentation by using a word segmentation tool, obtaining characteristics through the pre-training language model, and averaging to obtain context characteristics.

In one embodiment, the constructing a thesaurus by using the tweet information and the original tag information of the short video sample includes:

counting all tweet information and original label information of the short video sample, and aligning, segmenting words and counting word frequency in sequence; and sequencing the words according to the sequence of the word frequency from high to low, and taking the words with the word frequency higher than N times to construct the word bank, wherein N is a preset proportional parameter.

In one embodiment, the constructing a thesaurus by using the tweet information and the original tag information of the short video sample further includes: filtering non-English characters in the text pushing information and the original label information; carrying out root reduction on English words with the same root; and segmenting the connecting words with the lengths larger than the length threshold value to obtain a plurality of independent words.

According to another aspect of the invention, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) According to the invention, by fusing the image characteristics, the audio characteristics and the text characteristics, the multi-mode information related to the video can be fully utilized, and the quality of the generated video label is effectively improved.

(2) The initial multi-mode label recommendation model provided by the invention can gradually and stepwise perform feature fusion, so that multi-mode information is more effectively fused based on the influence weight of different information on the video label, and the quality of the finally generated video label is improved.

(3) According to the invention, the collected video data is subjected to the preprocessing operation before feature extraction, so that errors and redundancies of data in a data set can be effectively avoided, the training effect of the model is ensured, and the video label generated by the model is ensured to have higher quality.

(4) According to the invention, when multi-modal feature fusion is carried out, emotion information in a video can be considered, and through a multi-task learning method, the video emotion information can be captured at the same time, so that the prediction of a label with emotion attributes is facilitated, and the quality of the generated video label is further optimized.

Drawings

Fig. 1 is a logic diagram of a multi-modal short video tag recommendation method incorporating emotion information according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a content feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a stack block in a content feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an emotion feature extraction module of the initial multi-modal tag recommendation model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the invention provides a multi-modal short video tag recommendation method fused with emotion information, a model training phase and a tag prediction phase;

the model training phase comprises:

(S1.1) collecting video data and corresponding label data; by comparing the video quantity and the label quality of each short video platform, the Python web crawler is selected to collect data from the Vine platform; the specific collection method is as follows: the original Vine data set needs to be processed, intersection of a user ID-video URL, a user ID-user text pushing and a user ID-text pushing label is taken, data only containing videos or texts are filtered, data are connected through uniqueness of the user ID, information contained in the final data set is a video URL-text, the text is text containing specific labels, meanwhile, labels with the frequency of being less than 10 times in the whole data set are eliminated, the contingency of label videos is reduced, and data are cleaned through a data preprocessing method to be described below. The binary media files in the video downloading addresses are captured by setting a local agent and a crawler program, 40049 available short videos and 1935 different labels are finally obtained, meanwhile, approximately 86% of the short videos in the data set are distributed in about 6s, each video at least comprises 1 label and at most 21 labels, the number of the labels contained in each video is 4.8 on average, and the length of the words contained in the tweet on average is 9.73.

(S1.2) preprocessing the collected videos and the corresponding labels; the pretreatment operation specifically comprises: eliminating videos which cannot be played normally through integrity check; the integrity check tool used is specifically an open source tool ffmpeg; the video which cannot be normally played is removed through integrity check; filtering out video data with short duration, short text pushing information and/or missing audio channels; in this embodiment, the step (S1.2) further includes:

the construction method of the tag word stock comprises the following steps: after all videos collected in the model training stage are segmented, word frequency of words is counted; after the words are sequenced according to the sequence of the word frequency from high to low, the vocabulary with the word frequency higher than N times is taken to construct a word bank; wherein N is a preset proportional parameter. When a word bank is constructed, non-English characters in the text pushing information and the original label information are filtered; carrying out root reduction on English words with the same root; and for the longer connecting words, performing word segmentation through a word segmentation tool to obtain a plurality of single words.

Separating the video data into image data and audio data, and respectively extracting the characteristics of the image data, the audio data and the corresponding user text pushing data; a plurality of platform labels corresponding to the short videos are used as training labels, image features, audio features and tweet features of the short videos are used as attributes of the videos to form a sample, all the samples form a data set, and the data set is divided into a training set, a verification set and a test set.

(S2) the model training phase comprises:

separating the video data into image data and audio data, and respectively extracting the characteristics of the image data, the audio data and the corresponding user text pushing data; forming a sample by taking a plurality of platform labels corresponding to the short video as training labels and taking image characteristics, audio characteristics and text characteristics of the short video as attributes of the video, forming a data set by all the samples, and dividing the data set into a training set, a verification set and a test set;

establishing an initial multi-modal label recommendation model based on a multi-head attention mechanism and a self-encoder, wherein the initial multi-modal label recommendation model is used for performing content feature extraction and emotion feature extraction on image features, audio features and tweet comment context features through different pre-training model types, realizing multi-modal feature fusion through a content feature extraction module and an emotion feature extraction module, and generating a plurality of video labels related to video content and user recommendation information according to the fused features; based on the multi-task learning method, the content feature extraction module and the emotion special diagnosis fusion module carry out multi-task learning, namely modal information is discovered through different pre-training models, compared with single-task learning, the emotion feature extraction module can effectively fuse emotion information in a short video while extracting content features of the short video, feature vectors of different tasks are fused through an attention network, noise is avoided, and the performance of an initial multi-mode label recommendation model is improved. And respectively training, verifying and testing the established model by utilizing the training set, the verifying set and the testing set, thereby obtaining a target multi-modal label recommending model.

Further, the initial multi-mode label recommendation model comprises a content feature extraction module, an emotion feature extraction module and a label prediction module; the content feature extraction module is used for extracting features of the short video data in image and audio modes through a pre-training model related to video content information and fusing the image, the audio and the text modes through a multi-mode Transfomer model; the emotion feature extraction module is used for extracting features of the short video data in an image and audio mode through a pre-training model related to video emotion system information and fusing the image, the audio and a text mode through a multi-head attention mechanism; the label prediction module is used for fusing the video content fusion characteristics output by the content characteristic extraction module, the emotion fusion characteristics output by the emotion characteristic extraction module and label text characteristics related to the video to obtain short video fusion characteristics, and generating a plurality of candidate video labels according to the short video fusion characteristics; the label prediction module is further used for calculating an error between the generated video label and the character feature of the real video label as a loss so as to reduce the loss through continuous iterative training.

Based on the initial multi-mode label recommendation model structure, the method can gradually and stepwise fuse the characteristics, thereby fusing multi-mode information more effectively and improving the quality of a finally generated label based on the influence weight of different information on the video label.

In the step (S2), the feature extraction is performed on the different modal features in the content feature extraction module and the emotion feature extraction module, and the method includes: for visual features, framing is carried out on a content feature extraction module by using an ffmpeg open source streaming media tool, feature extraction is carried out on the visual features in a video by adopting VGG16-LSTM, continuous information of picture frames in time is captured, and 1024-dimensional visual feature representation vectors are obtained by adopting a slow fusion mode; acquiring 512-dimensional visual feature representation by adopting a 3D-CNN pre-training model for the emotional feature extraction module; for the audio features, extracting feature vectors of the audio features by adopting VGGish for a content feature extraction module, resampling the audio data at a certain frequency, obtaining a spectrogram of the sampled data through Fourier transform, and finally obtaining 1024-dimensional acoustic spectrum data of an audio file through a filter; and for the emotional feature extraction module, a Librosa sound spectrum processing library is adopted and input into a convolutional neural network to obtain a 300-dimensional feature vector. For text features, performing feature extraction on a content feature extraction module and a Hashtag word text by adopting a Bert pre-training language model to obtain 768-dimensional embedded representation; and adopting a pre-trained English corpus Glove model for the emotional feature extraction module, and finally obtaining a text vector with the size of 300 dimensions.

In addition, an automatic bullet screen comment generation model is established based on an auto-encoder and an auto-decoder, and is used for respectively carrying out time sequence analysis on image characteristics, audio characteristics and bullet screen comment context characteristics, realizing multi-mode characteristic fusion together with character characteristics of bullet screen comments, and generating a bullet screen comment related to video content and bullet screen context according to the fused characteristics; the bullet screen comment automatic generation model comprises the following steps: the system comprises a content feature extraction module, an emotion feature extraction module and a label prediction module; the content feature extraction module performs content information fusion and comprises the following steps: the encoder layer encodes different modal information, the stacking block layer performs modal representation with attention mechanism, and the fusion layer fuses the cross-modal information to obtain final video content feature representation.

As shown in fig. 2, the video content representation method mainly includes three parts, namely, encoding different modality information by an encoding layer, performing modality representation with attention by a stacked block layer, and fusing cross-modality information by a fusion layer to obtain a final video content feature representation.

The stacking block layer adopts N stacking blocks for each mode, realizes characteristic representation with attention mechanism for mode information, and each stacking block is as shown in FIG. 3 and mainly comprises four parts: including a Multi-Head Attention mechanism (Multi-Head Attention), a Cross Attention mechanism (Cross Attention), and two feed-Forward Neural Networks (FNNs).

The Multi-Head Attention module is an important structure in a transform model, more matrix parameters are added on the basis of a traditional self-Attention mechanism, so that Q, K and V matrixes are multiplied by original input vectors and then are subjected to multiplication with a plurality of matrixes, the number of the matrixes represents the number of heads of the Multi-Head Attention mechanism, and the advantage of dividing data flow into different subspaces is realized, so that the model can pay Attention to information in different aspects; cross-attention is actually positioned at the position of a decoder in an original transform model, a corresponding K and V matrix is generated in an encoder module, and the information fusion of different information is realized by combining the Q matrix information of the decoder module, and the information fusion of the encoder and the decoder is originally used on a machine translation task, but the Cross-attention can also be used for the fusion of information among different modes in a short video, so that the transform model can be more unified and naturally combined with information of different modes; the FNN is used for connecting a plurality of subspace matrix information generated by multi-head attentions into a unified matrix and multiplying the unified matrix with a parameter matrix of information to obtain a final single-mode information fusion matrix.

In the stacked block of the t-th layer, the input of the same modality information comes from the output information of the t-1 layer.

The feature fusion method among different modalities is shown as follows, wherein c represents a text modality, v represents a visual modality, a represents a video modality, and taking a calculation mode of visual modality information v as an example, after a cross-attention mechanism, a fusion gate is constructed through a multi-layer perceptron, and is multiplied by a corresponding weight matrix to perform inter-modality feature fusion on text and audio information respectively.

Then will be

And inputting the input into a feedforward neural network to obtain the output of the t-th stacked block. In the same way, the output of audio mode and text mode in t layer can be obtained

And

the output of the stacked blocks is then converted to a vector of a particular length by a weight pooling operation.

After three modal attention-related feature representations are obtained in the same way, feature fusion gates are created in a fusion layer through a multilayer sensing mechanism, different modal information is multiplied by certain weight and added to obtain a fusion vector in the stack block.

g _v ＝MLP(F _c ，F _v ，F _a )#(10)

In the step (S2), the emotion feature extraction module performs emotion information fusion, and the method comprises the following steps:

compared with a content attribute tag, the emotion attribute tag is more easily predicted, after feature representation is extracted through emotion pre-training models of different modes, the overall module architecture is as shown in fig. 4, and features of visual, audio and text features are subjected to feature fusion among modes through a Multi-mode Multi-Head Attention framework (MMFA) to obtain emotion representation vectors in short videos.

The multi-head attention machine mechanism enables the model to obtain final representation information from different representation spaces by considering the relevance between the models, and the MMFA comprises a multi-head self-attention machine (multi-head self-attention) mechanism and a multi-head co-attention machine (multi-head mutual-attention). The former function is the same as the MTT, the data stream is dispersed into different subspaces through more parameter matrixes, and the attention weight matrix information comes from the same modality and is only used for reinforcing the information characteristic of the modality; the latter attention weight matrix information comes from different modal characteristics, and different modal information is convenient to fuse, the specific method is that information between different modalities, such as vision-text, text-audio and audio-vision, is fused pairwise, and essentially, on the basis of taking a Transformer as a pure image classification task, a linear conversion unit which originally only contains picture information is changed into a linear layer containing information of two modalities in an input linear layer, then the two modality information input into an attention mechanism are linearly superposed to form different modal fusion characteristics under a common attention mechanism, finally, single modal representation and common modal representation are combined, and the combined single modal representation and common modal representation are input into a softmax layer to generate emotional attribute characteristics.

For a Multi-head attention mechanism module (Multi-head self-attention), taking a visual mode V as an example, a matrix Q, K and V in the traditional Multi-head attention mechanism needs to be correspondingly changed through fusion of single mode information, and all input parts are changed into visual mode information.

Assuming that the number of linear transformation layers is m, the final fusion expression of the single mode is shown in formula (14).

The multi-head common attention mechanism fuses different modal information, essentially, feature fusion is performed two by two, taking a visual mode v and an audio mode a as an example, and the calculation method is shown as a formula (15).

Assuming that the number of linear transformation layers is m, the fusion expression of the final fusion modality is shown in equation (16).

The feature representation obtained by fusing the modal information with the co-attention mechanism not only includes information fusion between every two modalities, but also includes the modal information of the modality.

The label prediction phase comprises the following steps:

for a given video-tag pair (x) _i ，y _i ) Assuming that the content feature vector is finally obtained

Emotional feature vector

Vector h is denoted by Hashtag. Because the labels can be divided into two types according to semantic information, one of the labels is a label with emotional attributes, and the other label contains content attributes, semantic classification determines that candidate labels have different attention weights on emotional characteristics and content characteristics, an attention mechanism is needed to fuse the two multi-mode characteristics, and formulas (18) and (19) represent the emotional attributes and the content attributes combined with original label information.

In order to obtain the attention weights of the two, the formula (20) and the formula (21) obtain the weight distribution of different features through an exponential operation, as shown below.

The fusion of the two is obtained after the input into the attention networkRatio of

The final fused vector is

To evaluate the relevance score of a given short video and tag, x is added _i，j The input is input into a multilayer perceptron, in which the representation vector of the short video and the label embedding can learn the correlation between the two through a nonlinear hidden layer, wherein the hidden layer of the multilayer perceptron is represented by formula (22).

This example training treats the tag recommendation as a binary task, where a positive sample is given if the predicted video-tag pair appears in the dataset, otherwise the given video-tag pair is a negative sample, and selects the negative sample tag of the video by means of random sampling, eventually using cross entropy as a loss function.

According to another aspect of the invention, a multi-modal short video tag recommendation system fused with emotion information comprises: a computer-readable storage medium and a processor; a computer readable storage medium for storing an executable program; the processor is used for reading an executable program stored in a computer readable storage medium and executing the multi-modal short video tag recommendation method fused with the emotion information.

Tags are important content in a variety of different fields for specific information and for search engines to locate key sources of information. The tags can be single words, non-space phrases or even any word combination with the symbol # as a prefix, the management and classification of the tweet information of different contents are realized, articles with similar contents and tweets shared by other users can be obtained through hyperlinks of the tags, and the tag service can enable the users to benefit in searching and browse more interesting contents. Automatic tagging of text and images has become an important research topic in recent years. However, there are few models for initial multi-modal tag recommendation, and although some have proposed methods to perform recommendation tasks for text, images, or blogs, they are not suitable for short video domains. Since these models are designed for the respective domains and the structure of short video is also different from text and pictures. The invention solves the problems and can better recommend the short video label.

It will be understood by those skilled in the art that the foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included within the scope of the present invention.

Claims

1. A multi-mode short video tag recommendation method fusing emotion information is characterized by comprising the following steps:

2. The method for multi-modal short video tag recommendation fusing emotion information as recited in claim 1, wherein the short video sample set comprises a training set, a verification set and a test set; the S2 comprises the following steps:

3. The method for recommending a multimodal short video tag with emotion information fused as claimed in claim 2, wherein said content feature extraction module is based on a modal transmomer structure; the emotion feature extraction module is based on a multi-head attention structure of a cross-modality; the label prediction module is based on an attention network.

4. The method for recommending multimodal short video tags by fusing emotion information as claimed in claim 3, wherein said content feature extraction module comprises sequentially connected: an encoder layer, a stacked block layer, and a fusion layer; the encoder layer is used for encoding different modal information, and the stacked block layer is used for carrying out modal representation with attention mechanism; the fusion layer is used for fusing the cross-modal information to obtain final content characteristic representation; and the stacked block layer adopts a plurality of stacked blocks for each mode to realize the characteristic representation with the attention mechanism, and the stacked blocks comprise a multi-head attention mechanism, a cross-attention mechanism and two feedforward neural networks.

5. The method for recommending a multimodal short video tag with fused emotion information as claimed in claim 3, wherein said emotion feature extraction module performs inter-modal feature fusion on image features, audio features and text features through a multimodal multi-head attention framework (MMFA) to obtain emotion expression vectors corresponding to said short video samples;

6. The method for recommending a multimodal short video tag with fused emotion information as claimed in claim 2, wherein before S21, said S2 further comprises:

7. The method for recommending multimodal short video tags with fused emotion information as claimed in claim 1, wherein said step of enabling said short video samples to perform feature extraction on images, audios and texts of said short video samples to obtain content features and emotion features comprises:

dividing the audio data of the short video sample into audio segments according to a preset time interval T, respectively extracting the characteristics of the audio segments, and combining the audio segments into audio characteristics according to the time sequence;

8. The method for multi-modal short video tag recommendation with emotion information fused as claimed in claim 7, wherein said constructing a thesaurus using the tweet information and the original tag information of said short video samples comprises:

counting all tweet information and label information of the short video sample, and aligning, segmenting words and counting word frequency in sequence; and sequencing the words according to the sequence of the word frequency from high to low, and taking the words with the word frequency higher than N times to construct the word bank, wherein N is a preset proportional parameter.

9. The method for recommending multi-modal short video tags with fused emotional information according to claim 8, wherein the constructing a thesaurus using the textual information of the short video sample and the original tag information further comprises: filtering non-English characters in the text pushing information and the original label information; carrying out root reduction on English words with the same root; and segmenting the connecting words with the lengths larger than the length threshold value to obtain a plurality of independent words.

10. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 9 when executing the computer program.