CN110866184B

CN110866184B - Short video data label recommendation method and device, computer equipment and storage medium

Info

Publication number: CN110866184B
Application number: CN201911093019.4A
Authority: CN
Inventors: 王小婵; 杨超; 蒋斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2022-12-02
Anticipated expiration: 2039-11-11
Also published as: CN110866184A

Abstract

The application relates to a short video data tag recommendation method, a short video data tag recommendation device, computer equipment and a storage medium, wherein image data, audio data and text data in multi-mode short video data are extracted by acquiring the multi-mode short video data; respectively extracting an emotional characteristic matrix and a content characteristic matrix; obtaining a multi-mode fusion emotion feature vector corresponding to the emotion feature matrix through a preset emotion common space, and obtaining a multi-mode fusion content feature vector corresponding to the content feature matrix through a preset content common space; acquiring matching scores of preset tag semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector; and recommending the label according to the matching score. According to the method and the device, the emotional characteristics and the content characteristics of the multi-modal short video data in different modes are fused, then the matching result of the fused characteristics and the label is obtained, the label is recommended for the multi-modal short video data according to the matching result, and the label can be effectively recommended for the short video.

Description

Short video data label recommendation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for recommending short video data tags, a computer device, and a storage medium.

Background

The current era is that of the internet, and the number of worldwide netizens is reported to reach 40 hundred million nowadays. Meanwhile, due to the popularization of mobile devices and the reduction of the threshold for making short videos, short videos are increasingly favored by people as a new life story recording and sharing mode. Meanwhile, short video platforms and applications such as Vine, snapchat, trembling, fast-hand, etc. have also gained unprecedented growth in recent years. Compared with characters and pictures, the information embedded in the short video is more visual and vivid, so that watching the short video gradually becomes the first choice for more and more people to enjoy leisure and entertainment.

However, in the face of the new short videos which are spread out in a large number, how to quickly and accurately find the desired content is often a very difficult task. To solve this problem, one feasible strategy is to add hashtags to short videos, so that users can quickly match their own desired content through keywords on the platform. However, the existing automatic tag recommendation methods are mainly applied to plain text and text combined with image directions, and these recommendation methods are not suitable for tag recommendation in the short video field.

Disclosure of Invention

Therefore, it is necessary to provide a method, an apparatus, a computer device and a storage medium for recommending short video data tags, which can be applied to the field of short video, for solving the problem that the existing tag recommendation method is not applicable to the field of short video tag recommendation.

A method of short video data tag recommendation, the method comprising:

acquiring multi-modal short video data, and extracting image data, audio data and text data in the multi-modal short video data;

extracting emotion feature matrixes of the image data, the audio data and the text data respectively, and extracting content feature matrixes of the image data, the audio data and the text data respectively;

obtaining multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes through a preset emotion common space, and obtaining multi-mode fusion content feature vectors corresponding to the content feature matrixes through a preset content common space;

acquiring matching scores of preset label semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector;

and recommending a label for the multi-modal short video data according to the matching score.

An apparatus for short video data tag recommendation, the apparatus comprising:

the modal data extraction module is used for acquiring multi-modal short video data and extracting image data, audio data and text data in the multi-modal short video data;

the feature extraction module is used for respectively extracting emotional feature matrixes of the image data, the audio data and the text data and respectively extracting content feature matrixes of the image data, the audio data and the text data;

the feature fusion module is used for acquiring multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes through a preset emotion common space and acquiring multi-mode fusion content feature vectors corresponding to the content feature matrixes through a preset content common space;

the feature matching module is used for acquiring matching scores of preset label semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector;

and the label recommending module is used for recommending labels for the multi-modal short video data according to the matching scores.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

respectively extracting emotion feature matrixes of the image data, the audio data and the text data, and respectively extracting content feature matrixes of the image data, the audio data and the text data;

obtaining a multi-mode fusion emotion feature vector corresponding to each emotion feature matrix through a preset emotion common space, and obtaining a multi-mode fusion content feature vector corresponding to each content feature matrix through a preset content common space;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the short video data tag recommendation method, the short video data tag recommendation device, the computer equipment and the storage medium, image data, audio data and text data in multi-mode short video data are extracted by acquiring the multi-mode short video data; respectively extracting emotional characteristic matrixes of the image data, the audio data and the text data, and respectively extracting content characteristic matrixes of the image data, the audio data and the text data; obtaining multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes through a preset emotion common space, and obtaining multi-mode fusion content feature vectors corresponding to the content feature matrixes through a preset content common space; acquiring matching scores of preset tag semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector; and recommending labels for the multi-modal short video data according to the matching scores. According to the method and the device, the emotional characteristics and the content characteristics of the multi-mode short video data in different modes are fused, then the matching result of the fused characteristics and the label is obtained, the corresponding label is recommended for the multi-mode short video data according to the matching result, and the corresponding label can be effectively recommended for the short video.

Drawings

FIG. 1 is a diagram of an exemplary application environment for a method for short video tag recommendation in one embodiment;

FIG. 2 is a functional diagram of a method for short video data tag recommendation in one embodiment;

FIG. 3 is a flowchart illustrating a method for short video data tag recommendation in one embodiment;

FIG. 4 is a schematic sub-flow chart illustrating step S100 of FIG. 3 according to an embodiment;

FIG. 5 is a schematic sub-flow chart illustrating step S500 in FIG. 3 according to an embodiment;

FIG. 6 is a schematic sub-flow chart illustrating step S700 in FIG. 3 according to an embodiment;

FIG. 7 is a diagram illustrating an overall flowchart of a method for short video data tag recommendation in one embodiment;

FIG. 8 is a block diagram of an apparatus for short video data tag recommendation in one embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The tag recommendation method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the tag recommendation server 104 through a network. The terminal 102 can submit the multi-modal short video data to the tag recommendation server 104, and the tag recommendation server 104 can acquire the multi-modal short video data and extract image data, audio data and text data in the multi-modal short video data; respectively extracting an emotion characteristic matrix and a content characteristic matrix of the image data, the audio data and the text data; obtaining multi-mode fusion emotion feature vectors corresponding to emotion feature matrixes of image data, audio data and text data through a preset emotion common space, and obtaining multi-mode fusion content feature vectors corresponding to content feature matrixes of the image data, the audio data and the text data through a preset content common space; acquiring matching scores of preset tag semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector; and recommending labels for the multi-modal short video data according to the matching scores. And feeds the tag back to the terminal 102. The schematic diagram of the short video data tag recommendation method is shown in fig. 2, for two short videos with almost the same content, tags expressing emotion have the condition that the meanings of partial tags are opposite, and by means of the scheme of the application, through each modal data of the multi-modal short videos, not only content tags can be recommended, but also emotion tags can be recommended at the same time, and a more accurate recommendation effect is achieved. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 3, a method for recommending a short video data tag is provided, which is described by taking the method as an example applied to the tag recommendation server side in fig. 1, and includes the following steps:

and S100, acquiring the multi-mode short video data, and extracting image data, audio data and text data in the multi-mode short video data.

The multi-modal short video data specifically refers to short video data simultaneously containing images, audios and texts, and particularly, the tag recommendation server can clean the multi-modal short video data according to attributes of the multi-modal short video data. For example, for short video data which does not include images, audio and texts in any mode, the server may directly feed back a recommendation failure message to the terminal, and for short video data whose video duration is lower than a preset duration threshold, because of fewer features, the deviation of tag recommendation is large, and the server may also directly feed back a recommendation failure message to the terminal without tag recommendation. The server may first receive the multimodal short video data submitted by the terminal, and then separate image data, audio data, and text data from the multimodal short video data.

As shown in fig. 4, in one embodiment, step S100 includes:

and S110, acquiring multi-mode short video data.

S130, separating the image modality data and the audio modality data of the multi-modality short video data.

And S150, extracting key frame data in the image modality data, and using the key frame data as image data.

S170, the audio mode data is divided into audio segments, and the audio segments are used as audio data.

And S190, taking text modal data corresponding to the multi-mode short video data as text data.

The text data corresponding to the multimodal short video data is generally text data attached to the short video, and is not fused with the image data and the audio data. The process of extracting the modality data may first separate the image modality data and the audio module data from the fused short video data, and in one embodiment, the server may separate the image modality and the audio modality of the short video from each other using a Fast forwarding Picture Experts Group (FFmpeg) tool. And extracting key frame data from the image modality data, and using the key frame data as image data corresponding to the multi-modal short video data, in one embodiment, the server may extract pictures from the separated image modality stream as the key frame data in a preset time span, for example, for a short video data with a length of 6s, 12 pictures from the image modality stream may be extracted as the key frame data in a time span of 0.5 s. Similarly, the server may divide the audio modality data into audio pieces and treat the audio pieces as audio data. And text modal data corresponding to the multi-modal short video data can be used as text data directly when a user uploads text data attached to the multi-modal short video data, namely comment text data actively added when the user publishes a short video. The specific data of each modality in the multi-modal data can be accurately extracted through different methods.

S300, extracting the emotion characteristic matrixes of the image data, the audio data and the text data respectively, and extracting the content characteristic matrixes of the image data, the audio data and the text data respectively.

Specifically, the server may use a vector to replace the emotional feature and the content feature of a key frame, and the vector corresponding to each key frame in the image data constitutes an emotional feature matrix and a content feature matrix corresponding to the image data. Similarly, the emotion feature vector and the content feature vector corresponding to each section of audio in the audio data can be obtained to form an emotion feature matrix and a content feature matrix corresponding to the audio data. And acquiring a content feature vector corresponding to each word in the text data to form a content feature matrix corresponding to the text data, and acquiring an emotion feature matrix corresponding to the text data.

In one embodiment, step S300 specifically includes:

and extracting content feature vectors corresponding to all key frames in the image data through a preset ResNet-152 feature extractor, and constructing a content feature matrix corresponding to the image data according to the content feature vectors corresponding to all key frames.

For extracting the content features of the image data, resNet-152 pre-trained on an ImageNet data set can be used as a feature extractor, a 2048-dimensional content feature vector can be obtained for each key frame, and then a content feature matrix corresponding to the image data can be constructed according to the content feature vector corresponding to each key frame. For example, for a short video data containing 12 key frames, the content feature matrix corresponding to the image data is: 2048 × 12 matrix.

And extracting the emotion feature vector corresponding to each key frame in the image data through a preset CNN feature extractor, and constructing an emotion feature matrix corresponding to the image data according to the emotion feature vector corresponding to each key frame.

For extracting the emotional features of the image data, a CNN network pre-trained on a data set of SentiBank is used as a feature extractor, for each key frame data, the obtained original output corresponds to the probability of 2089 adjective-name word pairs (for example, "cute girls", "funny animals"), and considering that the emotions expressed by the adjective-name word pairs corresponding to the same adjective are almost the same, the probabilities of the adjective-name word pairs with the same adjective are combined to reduce the dimension of the obtained features. Finally, for each key frame, a 231-dimensional emotional feature vector is obtained. (note: for each key frame, the original result is that each adjective-noun pair corresponds to a probability, so the initial dimension of the feature is 2089, after combining the probabilities of the adjective-noun pairs corresponding to the same adjective, a 231-dimensional feature vector is obtained after dimensionality reduction), and then the emotion feature matrix corresponding to the image data is constructed according to the emotion feature vector corresponding to each key frame. For example, for a short video including 12 key frames, the obtained emotion feature matrix corresponding to the image data is: 231 x 12 matrix

Extracting content feature vectors corresponding to all audio segments in the audio data through a preset SoundNet CNN feature extractor, and constructing a content feature matrix corresponding to the audio data according to the content feature vectors corresponding to all the audio segments.

For extracting the content features of the audio data, sound net CNN can be used for feature extraction to obtain a 1024-dimensional feature vector, and then a content feature matrix corresponding to the audio data is constructed according to the content feature vector corresponding to each audio segment. For example, for a short video containing 6 audio segments, the content feature matrix corresponding to the obtained audio data is: 1024 x 6 matrix

Extracting each basic acoustic feature corresponding to each audio segment in the audio data through a preset library of Librosa tools, obtaining an emotional feature vector corresponding to each audio segment in the audio data according to the basic acoustic features, and constructing an emotional feature matrix corresponding to the audio data according to the emotional feature vector corresponding to each audio segment.

For extracting the emotional features of the audio data, basic acoustic features such as zero-crossing rate, mel frequency spectrum coefficient, amplitude and the like of each audio segment can be extracted by using library of python tools, namely Librosa, to obtain a feature vector of 512 as the emotional features of the audio segments. And then constructing an emotion feature matrix corresponding to the audio data according to the emotion feature vectors corresponding to the audio segments. That is, a vector constructed by the basic acoustic features of the audio segments can be used as an emotion feature vector corresponding to each audio segment in the audio data. For example, for a short video containing 6 audio segments, the emotion feature matrix corresponding to the obtained audio data is: 512 by 6 matrix

Acquiring a glove word vector corresponding to each word in the text data, taking the glove word vector corresponding to each word as a content feature vector corresponding to each word, and constructing a content feature matrix corresponding to the text data according to the content feature vector corresponding to each word.

For extracting content features of text data, a glove word vector corresponding to each word in the text data can be obtained, and for each word in the text data, a 300-dimensional feature vector is obtained. And for each short video, text alignment is required, and statistics shows that the number of words of texts corresponding to most of short videos is about 10 generally, so that the texts with more than 10 words are cut off, the texts with less than 10 words are supplemented with 0 in the feature vector, the glove word vector corresponding to the word can be used as the content feature vector corresponding to each word, and the content feature matrix corresponding to the text data is constructed according to the content feature vector corresponding to each word. For a short video, the content feature matrix corresponding to the obtained text data is as follows: 300 x 10 matrix.

And extracting an emotional characteristic matrix corresponding to the text data through a preset CoreNLP tool.

For extracting the emotion characteristics of the text data, the corresponding emotion characteristics of the text can be extracted by means of a CoreNLP tool, and for each short video corresponding text, a 5-dimensional vector is obtained, which corresponds to the probabilities of being very negative, neutral, positive and very positive, respectively. The 5-dimensional vector can be used as an emotion feature matrix corresponding to the text data. For a short video, the obtained emotion feature matrix corresponding to the text data is as follows: 5*1. The server can effectively extract emotional characteristics and content characteristics corresponding to the separated multi-modal data through various tools.

And S500, obtaining the multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes through a preset emotion common space, and obtaining the multi-mode fusion content feature vectors corresponding to the content feature matrixes through a preset content common space.

The multi-mode fusion emotion feature vector is a feature vector obtained by fusing emotion feature vectors of all the modes, and the multi-mode fusion content feature vector is a feature vector obtained by fusing content feature vectors of all the modes. After the emotion characteristic data and the content characteristic data corresponding to the three short-video modes are respectively extracted, the difference minimization between the representations of the three short-video modes can be realized through a common space, information fusion and information supplement among different mode characteristics are carried out, and then label recommendation is carried out.

As shown in fig. 5, in one embodiment, step S500 includes:

s510, obtaining emotion feature vectors corresponding to emotion feature matrixes of the image data, the audio data and the text data.

The emotion feature vector corresponding to the emotion feature matrix of the image data is obtained according to the average value of the emotion feature vectors corresponding to the key frames in the image data, and the emotion feature vector corresponding to the emotion feature matrix of the audio data is obtained according to the average value of the emotion feature vectors corresponding to the audio segments in the audio data.

For each multi-modal short video data, the emotion of the image data and the audio data is extracted from each key frame or audio segment of the multi-modal short video data, so that the emotion feature of the multi-modal short video data is a time sequence formed by extracting the emotion feature from each key frame or audio segment. Firstly, averaging the emotional characteristic time sequences of the image data and the audio data to obtain an emotional characteristic vector corresponding to the emotional characteristic matrix so as to adapt to the learning of the multi-layer perceptron. And the emotion characteristic vector corresponding to the emotion characteristic matrix of the image data is obtained according to the average value of the emotion characteristic vectors corresponding to all key frames in the image data, and the emotion characteristic vector corresponding to the emotion characteristic matrix of the audio data is obtained according to the average value of the emotion characteristic vectors corresponding to all audio segments in the audio data. E.g. for an emotional feature matrix

The corresponding emotional feature vector is

The emotion feature vectors obtained in this step are respectively expressed as:

wherein

Representing the corresponding emotional feature vector of the image data,

representing the corresponding emotional feature vector of the audio data,

and representing the corresponding emotional characteristic vector of the text data.

S530, the emotion feature vectors corresponding to the image data, the audio data and the text data are mapped to an emotion common space through the multilayer perceptron, the emotion feature vectors mapped to the emotion common space are aligned and adjusted through an alignment loss function, and the multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes of the image data, the audio data and the text data are obtained.

After obtaining the emotion feature vectors corresponding to the modal data, the emotion feature vectors may be mapped to a preset emotion common space through respective multilayer perceptrons (three multilayer perceptrons are in a parallel relationship). The emotion feature vectors corresponding to the image, audio and text modal data after passing through the multilayer perceptron are expressed as follows:

and is

Wherein f is _v ()、f _a ()、f _t () And the emotion feature vector respectively representing each modal data passes through a mapping function which is obtained by multi-layer perceptron learning and is mapped to an emotion common space by an original feature space. And then, in an emotion common space, the feature difference of three modes of the short video is minimized through an alignment loss function, so that information on different modes can be mutually transmitted and jointly adjusted. The alignment loss function used in the modal data complementary learning process corresponding to the content common space described below is also the same as the alignment loss function used in the emotion common space. Wherein the alignment loss function is as follows:

wherein,

the m-mode of the ith short video is represented by the characteristics of k (k belongs to (s, c)) common space, wherein s is sentiment common space, s common space represents emotion common space, and c is content, and c common space represents content common space. After the adjustment of the alignment function, the emotion feature vectors of the three modes of image data, audio data and text data can be spliced to obtain a multi-mode fusion emotion feature vector. The multi-mode fusion emotional characteristic vector can be used as a formula

It is shown that,

representing a vector join operation.

And S550, inputting each content feature vector in the content feature matrix corresponding to the image data, the audio data and the text data into a preset bidirectional LSTM neural network, and acquiring a forward hidden state vector and a backward hidden state vector corresponding to each content feature vector.

The content characteristics of each modality for each short video can be understood as a set of time series. Can use

Representing the content feature vector at the m-th modality instant t of the i-th short video. The timing content features of each modality are input into a parallel pre-defined bi-directional LSTM neural network to capture timing information of each modality. Each bi-directional LSTM neural network accepts inputs as:

the resulting output is a forward hidden state vector

) And a backward hidden state vector

) The hidden layer feature vector of each mode after passing through the respective bidirectional LSTM is represented as a concatenation of a forward hidden state vector and a backward hidden state vector, and is represented as:

the learning process of the entire bidirectional LSTM neural network can be expressed as:

m∈{v,a,t}

wherein LSTM ^f Representing the forward learning process of a bidirectional LSTM, LSTM ^b A backward learning process is shown that represents a bi-directional LSTM, both processes being done simultaneously in a bi-directional LSTM neural network.

And S570, acquiring content feature weights corresponding to the content feature vectors according to the forward hidden state vectors and the backward hidden state vectors corresponding to the content feature vectors through a self-attention mechanism.

Then, the obtained time sequence hidden layer characteristics of each mode are obtained

And (5) obtaining the weight distribution of the features at each moment through twice linear transformation. The purpose of this step is to assign different weights to the features obtained by the hidden layer at different times by the self-attention mechanism, the more important the features are, the more assigned weights will be, thereby capturing the more important information in the respective modalities, as well as filtering the interference and redundant information. Specifically, for the m-mode of the ith video, the weight corresponding to the content feature vector at the time t is

The calculation process is as follows:

wherein ReLU and softmax are activation functions.

S590, according to the content feature matrixes corresponding to the image data, the audio data and the text data and the content feature weights corresponding to the content feature vectors, obtaining the content feature vectors corresponding to the image data, the audio data and the text data, aligning and adjusting the content feature vectors corresponding to the image data, the audio data and the text data through an alignment loss function, and obtaining the multi-mode fusion content feature vectors corresponding to the content feature matrixes of the image data, the audio data and the text data.

After the weights are obtained, content feature vectors corresponding to the content feature matrix can be obtained according to the weights, and the content feature vectors can be obtained by a formula:

wherein

Representing the content feature vector corresponding to the content feature matrix after mapping the common space,

the weight corresponding to the content feature vector representing time t,

representing the content feature vector at time t.

Such as a content feature matrix of

I.e. where the content feature vector at a time is

And

and the weight of the content feature vector at each moment is 1/2,1/3,1/6, then the content feature vector corresponding to the content feature matrix is

Then canThe obtained content feature vector after mapping the common space is adjusted through the alignment loss function, so that feature differences of three modes of the short video are as small as possible in the content common space, and therefore information on different modes can be mutually transferred and adjusted together. After the adjustment of the alignment function, the content feature vectors of the three modalities of the image, the audio and the text can be directly spliced to obtain the multi-modality fusion content feature vector. The multimodal fusion content feature vector can be formulated

And (4) showing. The accuracy of label recommendation can be effectively improved through the fusion of the multi-mode feature vectors.

And S700, acquiring matching scores of the preset label semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector.

The matching score is used for reflecting the matching degree of the label semantics and the multi-mode short video data, and the multi-mode fusion emotion feature vectors and the multi-mode fusion content feature vectors corresponding to the three modes are obtained and can be matched with each preset label semantics to obtain the matching score of each label semantics and the multi-mode short video data.

As shown in fig. 6, in one embodiment, step S700 includes:

and S720, splicing the label semantic feature vectors, the multi-mode fusion emotion feature vectors and the multi-mode fusion content feature vectors corresponding to the preset label semantics.

When the matching score is obtained, firstly, the label semantic feature vector corresponding to the label semantic, the obtained multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector need to be spliced. Specifically, the feature obtained after fusion of the ith short video and the jth label is represented as x _i,t ：

Wherein

Representing a multi-modal fused emotional feature vector,

representing a multi-modal fused content feature vector and h representing a tag semantic feature vector.

And S740, performing interaction of the multi-mode fusion emotion characteristics, the multi-mode fusion content characteristics and the label semantic characteristics through the multilayer perceptron, and acquiring interaction feature vectors.

And then, interaction between each feature in the short video and the tag semantic feature can be carried out through a multilayer perceptron. Obtaining a feature vector o after interaction of the short video and the label semantic features after multiple nonlinear transformations _l . The process of multiple nonlinear transformations is specifically represented as follows:

o ₁ ＝ReLU(W ₁ x _i，j +b ₁ )

o ₂ ＝ReLU(W ₂ o ₁ +b ₂ )

...

o _l ＝ReLU(W _l o _1-1 +b _l )

wherein ReLU is an activation function

And S760, acquiring a matching score of the multi-mode short video data and the label according to the interactive feature vector.

Then, a matching score of the multimodal short video data and the tags can be directly obtained based on the interactive feature vectors, and the matching process can be represented by the following formula:

the sigmoid is an activation function, and the output matching score can be specifically the matching probability between [0,1 ]. By matching the scores, the results of tag recommendations can be more intuitively represented.

In one embodiment, step S700 is preceded by: acquiring a training data set, and acquiring labels corresponding to the multi-modal short video data in the training data set; automatically segmenting words of labels in a phrase form which are not separated by spaces; acquiring a glove word vector of each word in the label obtained by word segmentation; and acquiring a word vector average value of each glove word vector, and taking the word vector average value as a preset label semantic corresponding to the label.

The training data set specifically refers to some multi-modal short video data used for training corresponding models related to the label recommendation method, corresponding labels are distributed to the short video data in advance, and when label semantics are extracted, labels in a phrase form without space separation can be automatically segmented. And then obtaining a glove word vector of each word in the label obtained by word segmentation, and averaging the word vectors of the words to obtain a 300-dimensional vector which is used as the word vector of the whole label, namely the preset label semantics in the application. Specifically, word segmentation processing can be carried out by using the method, when corresponding labels expressed in an English form are automatically segmented, word matching can be carried out from front to back, the most common words are preferentially matched, the words are intercepted out after matching is successful, and then follow-up continuous matching is carried out. In addition, when the method and the device are applied to processing and recommending the tags in other language forms, the semantic information of the tags can be directly obtained in a corresponding natural language processing mode, and then the corresponding form language tags are recommended through final splicing.

And S900, recommending labels for the multi-modal short video data according to the matching scores. Specifically, S900 includes recommending a tag corresponding to the tag semantics for the multimodal short video data when the matching score is greater than or equal to the preset score threshold.

After the matching score is obtained, the matching score can be compared with a preset score threshold, the larger the matching score is, the higher the matching probability is, whether the current tag is matched with the current multi-modal short video can be judged based on the comparison result, and then the tag corresponding to the tag semantics can be recommended for the multi-modal short video data. The recommended label content generally comprises a label of a content class and a label of an emotion class. Wherein the preset score threshold is typically 0.5. The overall flow of the short video data tag recommendation method of the present application can be shown in fig. 7.

In particular, the process of multi-modal short video data recommendation of the present application may be implemented by a neural network model, where the neural network model includes various models used in the above process, and the models involved in the training process are mainly neural network models involved in the emotion common space learning process and the content common space learning process, that is, models used in the processes of step S500 to step S900 of the present application. The data set used for training the model can be obtained firstly, and then the whole data set is divided into three parts, namely a training set, a verification set and a test set according to the proportion of 80%, 10% and 10%. The training set is used for training the whole neural network model and is used for adjusting the hyper-parameters, and then the verification set is used for verifying the performance of the trained neural network model under the current training set and the corresponding hyper-parameters. The test set is used only to test the effect of the final model to determine the actual prediction capability and optimal parameters of the model. The cross entropy loss function used in the neural network model training process is specifically as follows:

wherein,

short video-tag pair (x) composed of ith short video data and jth tag _i ,y _j ) Is determined. s _i,j Is (x) _i ,y _j ) The target score of (1). S represents the set of all positive short video-label pairs and negative short video-label pairs used for training. For correctly paired short video-tag pairs, its target score s _i,j The value is 1, otherwise the value is 0.

Then, iterative training is performed by using an Adam optimization algorithm, and the fusion step of the multi-modal feature vectors and the step of obtaining the matching score data, namely the steps from step S500 to step S900 in the application, are repeated in the model training process to reversely update the network parameters until the network overall loss function L converges.

L＝L _alignment +L _prediction

And (5) under the condition that the currently set hyper-parameters are verified by using the verification set, using the training set to train the effect of the well trained network model on the verification set, and adjusting the hyper-parameters. The above process is repeated until the model has minimal error on the validation set. The hyper-parameter at this time may be regarded as an optimal hyper-parameter. And finally, testing the generalization ability of the final model by using the test set. The recommended effect of the model on the unknown data set is determined.

According to the short video data tag recommendation method, multi-mode short video data are obtained, and image data, audio data and text data in the multi-mode short video data are extracted; respectively extracting emotional characteristic matrixes of the image data, the audio data and the text data, and respectively extracting content characteristic matrixes of the image data, the audio data and the text data; obtaining multi-mode fusion emotion feature vectors corresponding to the emotion feature matrixes through a preset emotion common space, and obtaining multi-mode fusion content feature vectors corresponding to the content feature matrixes through a preset content common space; acquiring matching scores of preset tag semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector; and recommending labels for the multi-modal short video data according to the matching scores. According to the method and the device, the emotional characteristics and the content characteristics of the multi-mode short video data in different modes are fused, then the matching result of the fused characteristics and the label is obtained, the corresponding label is recommended for the multi-mode short video data according to the matching result, and the corresponding label can be effectively recommended for the short video.

It should be understood that although the various steps in the flow charts of fig. 3-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a short video data tag recommendation apparatus, including:

the modal data extraction module 100 is configured to obtain multimodal short video data and extract image data, audio data, and text data in the multimodal short video data.

The feature extraction module 300 is configured to extract emotional feature matrices of the image data, the audio data, and the text data, and extract content feature matrices of the image data, the audio data, and the text data, respectively.

The feature fusion module 500 is configured to obtain a multi-modal fusion emotion feature vector corresponding to each emotion feature matrix through a preset emotion common space, and obtain a multi-modal fusion content feature vector corresponding to each content feature matrix through a preset content common space.

The feature matching module 700 is configured to obtain matching scores of the preset tag semantics and the multi-modal fusion emotion feature vector and the multi-modal fusion content feature vector.

And a tag recommending module 900, configured to recommend tags for the multimodal short video data according to the matching scores.

In one embodiment, the system further comprises a tag semantic acquisition module, configured to: acquiring a training data set, and acquiring labels corresponding to the multi-modal short video data in the training data set; automatically segmenting words of labels in a phrase form which are not separated by spaces; acquiring a glove word vector of each word in the label obtained by word segmentation; and acquiring a word vector average value of each glove word vector, and taking the word vector average value as a preset label semantic corresponding to the label.

In one embodiment, the modal data extraction module 100 is specifically configured to: acquiring multi-mode short video data; separating image modal data and audio modal data of the multi-modal short video data; extracting key frame data in the image modal data, and taking the key frame data as image data; dividing the audio modal data into audio segments, and taking the audio segments as audio data; and taking text modal data corresponding to the multi-modal short video data as text data.

In one embodiment, the feature extraction module 300 is specifically configured to: extracting content feature vectors corresponding to all key frames in the image data through a preset ResNet-152 feature extractor, and constructing a content feature matrix corresponding to the image data according to the content feature vectors corresponding to all key frames; extracting emotion feature vectors corresponding to all key frames in the image data through a preset CNN feature extractor, and constructing emotion feature matrixes corresponding to the image data according to the emotion feature vectors corresponding to all key frames; extracting content feature vectors corresponding to all audio segments in the audio data through a preset SoundNet CNN feature extractor, and constructing a content feature matrix corresponding to the audio data according to the content feature vectors corresponding to all the audio segments; extracting each basic acoustic feature corresponding to each audio segment in the audio data through a preset library of Librosa tools, obtaining an emotional feature vector corresponding to each audio segment in the audio data according to the basic acoustic features, and constructing an emotional feature matrix corresponding to the audio data according to the emotional feature vector corresponding to each audio segment; acquiring glove word vectors corresponding to words in text data, taking the glove word vectors corresponding to the words as content characteristic vectors corresponding to the words, and constructing a content characteristic matrix corresponding to the text data according to the content characteristic vectors corresponding to the words; and extracting an emotion characteristic matrix corresponding to the text data by a preset CoreNLP tool.

In one embodiment, the feature fusion module 500 is specifically configured to: acquiring emotion feature vectors corresponding to emotion feature matrixes of image data, audio data and text data, acquiring the emotion feature vectors corresponding to the emotion feature matrixes of the image data according to the average value of the emotion feature vectors corresponding to all key frames in the image data, and acquiring the emotion feature vectors corresponding to the emotion feature matrixes of the audio data according to the average value of the emotion feature vectors corresponding to all audio segments in the audio data; respectively mapping emotion feature vectors corresponding to the image data, the audio data and the text data to an emotion common space through a multilayer perceptron, aligning and adjusting the emotion feature vectors mapped to the emotion common space through an alignment loss function, and acquiring multi-mode fusion emotion feature vectors corresponding to emotion feature matrixes of the image data, the audio data and the text data; inputting each content feature vector in a content feature matrix corresponding to image data, audio data and text data into a preset bidirectional LSTM neural network to obtain a forward hidden state vector and a backward hidden state vector corresponding to each content feature vector; acquiring content feature weights corresponding to the content feature vectors according to the forward hidden state vectors and the backward hidden state vectors corresponding to the content feature vectors through a self-attention mechanism; according to content feature matrixes corresponding to the image data, the audio data and the text data and content feature weights corresponding to the content feature vectors, content feature vectors corresponding to the image data, the audio data and the text data are obtained, the content feature vectors corresponding to the image data, the audio data and the text data are aligned and adjusted through an alignment loss function, and multi-mode fusion content feature vectors corresponding to the content feature matrixes of the image data, the audio data and the text data are obtained.

In one embodiment, the feature matching module 900 is specifically configured to: splicing a label semantic feature vector, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector corresponding to preset label semantics; performing multi-mode fusion of emotional features, multi-mode fusion of content features and interaction with tag semantic features through a multi-layer perceptron to obtain interaction feature vectors; and acquiring a matching score of the multimode short video data and the label according to the interactive feature vector.

In one embodiment, the tag recommendation module 900 is specifically configured to: and recommending a label corresponding to the label semanteme for the multi-modal short video data when the matching score is greater than or equal to a preset score threshold value.

For specific limitations of the short video data tag recommendation apparatus, reference may be made to the above limitations of the short video data tag recommendation method, which is not described herein again. All or part of the modules in the short video data tag recommendation device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a short video data tag recommendation method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

respectively extracting emotional characteristic matrixes of the image data, the audio data and the text data, and respectively extracting content characteristic matrixes of the image data, the audio data and the text data;

acquiring matching scores of preset tag semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector;

and recommending labels for the multi-modal short video data according to the matching scores.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

respectively extracting an emotional characteristic matrix and a content characteristic matrix of the image data, the audio data and the text data;

obtaining multi-mode fusion emotion feature vectors corresponding to emotion feature matrixes of image data, audio data and text data through a preset emotion common space, and obtaining multi-mode fusion content feature vectors corresponding to content feature matrixes of the image data, the audio data and the text data through a preset content common space;

acquiring matching scores of preset label semantics, a multi-mode fusion emotion feature vector and a multi-mode fusion content feature vector;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A short video data tag recommendation method includes:

recommending a label for the multi-modal short video data according to the matching score;

the acquiring the multi-modal short video data and extracting the image data, the audio data and the text data in the multi-modal short video data comprises:

acquiring multi-mode short video data;

separating image modal data and audio modal data of the multi-modal short video data;

extracting key frame data in the image modality data, and taking the key frame data as image data;

dividing the audio modal data into audio segments, and taking the audio segments as audio data;

taking text modal data corresponding to the multi-modal short video data as text data;

the extracting key frame data in the image modality data comprises:

extracting pictures from the separated image modal data as key frame data in a preset time span;

the extracting the emotion feature matrixes of the image data, the audio data and the text data respectively, and the extracting the content feature matrixes of the image data, the audio data and the text data respectively comprises:

extracting content feature vectors corresponding to all key frames in the image data through a preset ResNet-152 feature extractor, and constructing a content feature matrix corresponding to the image data according to the content feature vectors corresponding to all key frames;

extracting emotion feature vectors corresponding to all key frames in the image data through a preset CNN feature extractor, and constructing emotion feature matrixes corresponding to the image data according to the emotion feature vectors corresponding to all key frames;

extracting content feature vectors corresponding to all audio segments in the audio data through a preset SoundNet CNN feature extractor, and constructing a content feature matrix corresponding to the audio data according to the content feature vectors corresponding to all the audio segments;

extracting each basic acoustic feature corresponding to each audio segment in the audio data through a preset library of Librosa tools, obtaining an emotional feature vector corresponding to each audio segment in the audio data according to the basic acoustic features, and constructing an emotional feature matrix corresponding to the audio data according to the emotional feature vector corresponding to each audio segment;

acquiring a glove word vector corresponding to each word in the text data, taking the glove word vector corresponding to each word as a content feature vector corresponding to each word, and constructing a content feature matrix corresponding to the text data according to the content feature vector corresponding to each word;

extracting an emotional characteristic matrix corresponding to the text data through a preset CoreNLP tool;

the obtaining of the multi-mode fusion emotion feature vectors corresponding to the emotion feature matrices through the preset emotion common space and the obtaining of the multi-mode fusion content feature vectors corresponding to the content feature matrices through the preset content common space include:

obtaining emotion feature vectors corresponding to the emotion feature matrixes of the image data according to the average value of the emotion feature vectors corresponding to the key frames in the image data, and obtaining emotion feature vectors corresponding to the emotion feature matrixes of the audio data according to the average value of the emotion feature vectors corresponding to the audio segments in the audio data;

mapping the emotion feature vectors corresponding to the image data, the audio data and the text data to an emotion common space through a multilayer perceptron, aligning and adjusting the emotion feature vectors mapped to the emotion common space through an alignment loss function, and acquiring multi-mode fusion emotion feature vectors corresponding to emotion feature matrixes of the image data, the audio data and the text data;

inputting each content feature vector in a content feature matrix corresponding to the image data, the audio data and the text data into a preset bidirectional LSTM neural network to obtain a forward hidden state vector and a backward hidden state vector corresponding to each content feature vector;

acquiring content feature weights corresponding to the content feature vectors according to the forward hidden state vectors and the backward hidden state vectors corresponding to the content feature vectors through a self-attention mechanism;

acquiring content feature vectors corresponding to the image data, the audio data and the text data according to content feature matrixes corresponding to the image data, the audio data and the text data and content feature weights corresponding to the content feature vectors, aligning and adjusting the content feature vectors corresponding to the image data, the audio data and the text data through an alignment loss function, and acquiring multi-mode fusion content feature vectors corresponding to the content feature matrixes of the image data, the audio data and the text data;

the obtaining of matching scores of the preset tag semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector comprises:

splicing a label semantic feature vector corresponding to preset label semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector;

performing multi-mode fusion of emotional features, multi-mode fusion of content features and interaction with tag semantic features through a multi-layer perceptron to obtain interaction feature vectors;

and acquiring the matching score of the multi-mode short video data and the label according to the interactive feature vector.

2. The method of claim 1, wherein before obtaining the matching scores of the pre-set tag semantics and the multi-modal fused emotion feature vector and the multi-modal fused content feature vector, the method further comprises:

acquiring a training data set, and acquiring labels corresponding to multi-modal short video data in the training data set;

automatically segmenting the labels in the form of phrases which are not separated by spaces;

acquiring a glove word vector of each word in the label obtained by word segmentation;

and acquiring a word vector average value of each glove word vector, and taking the word vector average value as a preset label semantic corresponding to the label.

3. The method of claim 1, wherein the separating image modality data and audio modality data of the multi-modal short video data comprises:

separating image modality data and audio modality data of the multi-modality short video data through an FFmpeg tool.

4. The method of claim 1, wherein the recommending tags for the multimodal short video data according to the matching score comprises:

and recommending a label corresponding to the label semanteme for the multi-modal short video data when the matching score is greater than or equal to a preset score threshold value.

5. An apparatus for short video data tag recommendation, the apparatus comprising:

the tag recommending module is used for recommending tags for the multi-modal short video data according to the matching scores;

the modal data extraction module is specifically configured to: acquiring multi-mode short video data; separating image modal data and audio modal data of the multi-modal short video data; extracting key frame data in the image modality data, and taking the key frame data as image data; dividing the audio modal data into audio segments, and taking the audio segments as audio data; taking text modal data corresponding to the multi-mode short video data as text data; the extracting key frame data in the image modality data comprises: extracting pictures from the separated image modal data as key frame data in a preset time span;

the feature extraction module is specifically configured to: extracting content feature vectors corresponding to all key frames in the image data through a preset ResNet-152 feature extractor, and constructing a content feature matrix corresponding to the image data according to the content feature vectors corresponding to all key frames; extracting emotion feature vectors corresponding to all key frames in the image data through a preset CNN feature extractor, and constructing emotion feature matrixes corresponding to the image data according to the emotion feature vectors corresponding to all key frames; extracting content feature vectors corresponding to all audio segments in the audio data through a preset SoundNet CNN feature extractor, and constructing a content feature matrix corresponding to the audio data according to the content feature vectors corresponding to all the audio segments; extracting each basic acoustic feature corresponding to each audio segment in the audio data through a preset library of Librosa tools, obtaining an emotional feature vector corresponding to each audio segment in the audio data according to the basic acoustic features, and constructing an emotional feature matrix corresponding to the audio data according to the emotional feature vector corresponding to each audio segment; acquiring a glove word vector corresponding to each word in the text data, taking the glove word vector corresponding to each word as a content feature vector corresponding to each word, and constructing a content feature matrix corresponding to the text data according to the content feature vector corresponding to each word; extracting an emotion characteristic matrix corresponding to the text data through a preset CoreNLP tool;

the feature extraction module is further to: obtaining emotion feature vectors corresponding to the emotion feature matrixes of the image data according to the average value of the emotion feature vectors corresponding to the key frames in the image data, and obtaining emotion feature vectors corresponding to the emotion feature matrixes of the audio data according to the average value of the emotion feature vectors corresponding to the audio segments in the audio data; respectively mapping the emotion feature vectors corresponding to the image data, the audio data and the text data to an emotion common space through a multilayer perceptron, aligning and adjusting the emotion feature vectors mapped to the emotion common space through an alignment loss function, and acquiring multi-mode fusion emotion feature vectors corresponding to emotion feature matrixes of the image data, the audio data and the text data; inputting each content feature vector in a content feature matrix corresponding to the image data, the audio data and the text data into a preset bidirectional LSTM neural network to obtain a forward hidden state vector and a backward hidden state vector corresponding to each content feature vector; acquiring content feature weights corresponding to the content feature vectors according to the forward hidden state vectors and the backward hidden state vectors corresponding to the content feature vectors through a self-attention mechanism; acquiring content feature vectors corresponding to the image data, the audio data and the text data according to content feature matrixes corresponding to the image data, the audio data and the text data and content feature weights corresponding to the content feature vectors, aligning and adjusting the content feature vectors corresponding to the image data, the audio data and the text data through an alignment loss function, and acquiring multi-mode fusion content feature vectors corresponding to the content feature matrixes of the image data, the audio data and the text data;

the feature matching module is specifically configured to: splicing a label semantic feature vector corresponding to preset label semantics, the multi-mode fusion emotion feature vector and the multi-mode fusion content feature vector; performing multi-mode fusion of emotional features, multi-mode fusion of content features and interaction with tag semantic features through a multi-layer perceptron to obtain interaction feature vectors; and acquiring the matching score of the multi-mode short video data and the label according to the interactive feature vector.

6. The apparatus of claim 5, further comprising a tag semantics obtaining module configured to: acquiring a training data set, and acquiring labels corresponding to multi-modal short video data in the training data set; automatically segmenting words of labels in a phrase form which are not separated by spaces; acquiring a glove word vector of each word in the label obtained by word segmentation; and acquiring a word vector average value of each glove word vector, and taking the word vector average value as a preset label semantic corresponding to the label.

7. The apparatus according to claim 5, wherein the modality data extraction module is specifically configured to: and separating image modal data and audio modal data of the multi-modal short video data through an FFmpeg tool.

8. The apparatus of claim 5, wherein the tag recommendation module is specifically configured to: and recommending a label corresponding to the label semanteme for the multi-modal short video data when the matching score is greater than or equal to a preset score threshold value.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.