CN114491258A

CN114491258A - Keyword recommendation system and method based on multi-modal content

Info

Publication number: CN114491258A
Application number: CN202210088492.9A
Authority: CN
Inventors: 何智勇; 冯皓楠; 马良荔; 牛敬华; 刘耀勋
Original assignee: Naval University of Engineering PLA
Current assignee: Naval University of Engineering PLA
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-13

Abstract

The invention discloses a multi-modal content-based keyword recommendation system, which comprises a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module.

Description

Keyword recommendation system and method based on multi-modal content

Technical Field

The invention relates to the technical field of keyword recommendation, in particular to a system and a method for keyword recommendation based on multi-modal content.

Background

The social media platform is an important internet application generated in the big data era, and is provided with a unique theme tag mechanism, wherein the theme tag is a specific form of metadata and is a string of characters with symbols # as a prefix. The user expresses a central idea of a post by tagging the content to be posted with one or more representative topic tags, which generally represent the user's view or emotion of the current topic. Because users are reluctant to set or set the subject label at will, the subject label is quickly generated with the post but is not organized for management. There is a problem of information overload.

Disclosure of Invention

The invention aims to provide a keyword recommendation system and method based on multi-modal content, which can improve the efficiency of information transmission and organization on the Internet and is beneficial to solving the problem of information overload.

In order to achieve the purpose, the keyword recommendation system based on the multi-modal content is characterized by comprising a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module;

the feature extraction module is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;

the feature fusion module is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-modal content comprising the text, the labels and the picture;

the keyword generation module is used for generating a new label which does not exist in a data set label space by adopting a Seq2Seq framework for the fusion vector of the multi-modal content, wherein the data set label space is a set of all labels in a given input post set;

the system comprises a user individuation analysis module, a vector similarity calculation method, a normalization function calculation module and a recommendation module, wherein the user individuation analysis module is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating semantic similarity between posts of a current label to be recommended and the L user historical posts by using the vector similarity calculation method, performing normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performing weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;

the training module is used for training trainable parameters of the keyword generation module and the user personalized analysis module by adopting a standard negative log-likelihood loss function as a training function to obtain a trained keyword generation module and a trained user personalized analysis module;

and the recommending module is used for vector splicing the new labels which are obtained by the trained keyword generating module and do not exist in the label space of the data set with the total influence vector of all randomly extracted user historical posts which are obtained by the trained user personalized analyzing module on the posts of the current labels to be recommended to obtain spliced joint vectors, and the spliced joint vectors utilize a normalization function to obtain the recommended probability of each new label which does not exist in the label space of the data set.

The invention has the beneficial effects that:

the method and the system can select proper theme keywords for the posts, can eliminate meaningless and noisy noise information on the social media platform, can provide required quick access information for the user, and are beneficial to improving the efficiency of platform information spreading and information organization. As users on social media platforms are reluctant to set or set the hashtag at will, the hashtag is caused to be generated quickly with the post but is not managed organized. Hundreds of topic tags are associated under one discussion topic, so that massive information redundancy is caused. The invention recommends tags to posts by utilizing the multi-modal post content published by the user on the social media platform and the personalized preference of the user. The method is characterized in that semantic association information among multi-modal contents is modeled by using a multi-head attention mechanism, the attention mechanism can well consider key contents of different modes by using a method of guiding one mode to another mode, and the modeling effect of the model is enhanced by stacking a plurality of attention mechanism layers. In addition, in consideration of personalized differences of different users, historical information of the users is selected to analyze personalized features of the users, and overall tag recommendation is performed by integrating the personalized features of the users and the multi-modal content of the posts of the tags to be recommended currently. The method has the advantages of accuracy and novelty.

Drawings

FIG. 1 is a schematic structural view of the present invention;

the system comprises a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module, wherein the feature extraction module is 1, the feature fusion module is 2, the keyword generation module is 3, and the user personalized analysis module is 4.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

the system for recommending keywords based on multi-modal content as shown in fig. 1 comprises a feature extraction module 1, a feature fusion module 2, a keyword generation module 3, a user personalized analysis module 4, a training module 5 and a recommendation module 6;

the feature extraction module 1 is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures; the bidirectional gating circulation unit is one of the circulation neural networks, can reduce the problem of gradient disappearance while keeping long-term sequence information of a text, and can greatly improve the training efficiency of the model, and the VGG-16 neural network uses a better convolution kernel, is simpler in structure compared with other neural networks, and has less parameter quantity.

The feature fusion module 2 is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-modal content comprising the text, the labels and the picture; compared with the common attention mechanism, the multi-head attention mechanism can understand multiple semantic meanings contained in the text and can more comprehensively understand the text content.

The keyword generation module 3 is configured to generate a new tag that does not exist in a tag space of a data set by using a Seq2Seq framework for the fusion vector of the multimodal content, where the tag space of the data set is a set of all tags in a given input post set; the Seq2Seq framework is a model adopted when the length of an output sequence is uncertain, and the model can be adopted to generate labels without length limitation.

The user individuation analysis module 4 is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating semantic similarity between posts of a current label to be recommended and the L user historical posts by using a vector similarity calculation method, performing normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performing weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended; sampling the user history may allow understanding of the user's habits in using tags when faced with similar content.

The training module 5 is configured to train all trainable vector parameters and matrix parameters (all weight matrices and bias matrices) of the keyword generation module 3 and the user personalized analysis module 4 by using a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module 3 and the user personalized analysis module 4;

detailed methods of training are described in the references:

Zhang S,Yao Y,Xu F,et al.Hashtag Recommendation for Photo Sharin g Services[J].Proceedings of the AAAI Conference on Artificial Intelli gence,2019,33:5805-5812.

and references:

Y Wang,Li J,Lyu M R,et al.Cross-Media Keyphrase Prediction:A Un ified Framework with Multi-Modality Multi-Head Attention and Image Wordings[J].2020.

the loss function is a standard for measuring the quality of the model, and in machine learning, the model is usually expected to have smaller loss, the negative log-likelihood function is minimized, and the loss function is a common loss function by taking the idea of statistics as reference.

The recommending module 6 is configured to vector-splice the new tags which are not present in the tag space of the data set and are obtained by the trained keyword generating module 3 and the total influence vectors of all randomly extracted user history posts which are obtained by the trained user personalization analyzing module 4 on the posts of the current tags to be recommended to obtain spliced joint vectors, and obtain the recommended probability of each new tag which is not present in the tag space of the data set by using a normalization function for the spliced joint vectors. At this time, the recommendation result given by the recommendation module is a result integrating the semantic information of the multi-modal content and the user personalized information. This makes the recommendation not only accurate but also person-to-person.

In the above technical solution, the feature extraction module 1 extracts the text and the picture content by using a bi-directional gating circulation unit (bi-GRU) and a VGG-16 neural network, respectively, to obtain a feature vector matrix of each modality.

The feature fusion module 2 fuses feature information of a text mode and a picture mode by using a Transformer structure based on a multi-head attention mechanism, and the attention mechanism can well consider key contents of different modes by using a method of guiding one mode to another mode. The modeling effect of the model can also be enhanced by stacking the number of layers of the common attention mechanism.

The keyword generation module 3 applies the fused feature matrix after the feature fusion modality, and generates a keyword by using Seq2Seq framework decoding, and here, in order to generate a tag that does not exist in the tag space of the data set, a word can be copied in the source input text by adopting the idea of a copying mechanism.

The user individuation analysis module 4 considers the individuation characteristic information of the user, and selects the historical information of the user to analyze the user characteristics in consideration of individuation differences of different users. And analyzing the habit preference of the user by quantifying the similarity of the multi-modal content characteristics in the historical posts and the posts to be recommended currently.

The training module 5 is used for performing end-to-end training on the keyword generation module 3 and the user personalized analysis module 4, and performing model training by using a multi-label classification method.

The recommending module 6 is used for performing personalized topic keyword recommendation by integrating personalized features of the user and multi-modal semantic information of the current post to be recommended, wherein the multi-modal semantic information can generate a label which does not exist in a data set label space through a sequence generating method.

In the above technical solution, the specific method for the feature extraction module 1 to perform feature extraction on the text and the tag in the input post set given by the user to be recommended by using a bidirectional gating cycle unit to obtain the semantic feature vector matrix of the text and the tag is as follows:

firstly, pre-training a lookup table for the text and label content in the input post set, and then, pre-training each word x of the text and the label in the input post set_iEmbedded in a high-dimensional vector, represented as a discrete vector representation e (x) for each word_i) Finally, the discrete vector representation e (x) for each word using a bi-directional gated round robin unit_i) And (3) encoding:

wherein the content of the first and second substances,

representing the word x_iThe forward direction hidden state of (a) is,

representing the word x_iBackward hidden state, GRU represents a gated round function,

representing the word x_iThe (i-1) th forward hidden state,

representing the word x_iThe (i + 1) th backward hidden state;

will the word x_iForward hidden state of

And the word x_iBackward hidden state of

Connected into the word x_iOne context-aware representation vector of:

all the context perception expression vectors in the input post set are stored in a text vector library to obtain a semantic feature vector matrix M of texts and labels_text，M_text＝{h_i，...,h_lx}∈R^lx×dWherein d represents the dimension of the hidden state, lx represents the number of words in the text, R represents the subscript, h_lxThe context-aware representation vector representing the lxh word.

In the above technical solution, the feature extraction modules 1 pairThe specific method for obtaining the semantic feature vector matrix of the picture by extracting the features of the picture in the given input post set by using the VGG-16 neural network comprises the following steps: firstly, the input picture is resized into 224 x 224, a 7 x 7-grid convolution feature map is extracted from each picture, and each feature map is converted into a new picture vector v through a linear projection layer softmax_iEach new picture vector contains visual information of different areas of the picture, and a semantic feature vector matrix M of the picture is obtained_image，M_image＝{v_i,...,v_lv}∈R^lv×dWherein d represents the dimension of the hidden state, lv represents the number of picture areas, v_lvA picture vector representing the lv picture area.

In the above technical solution, the specific method for fusing the semantic feature vector matrixes of the text and the label with the semantic feature vector matrix of the picture by the feature fusion module 2 based on the multi-head attention mechanism is as follows:

in the multi-head attention mechanism, a text semantic feature vector is used as Query to guide picture attention, then picture features are used as Query to guide text attention, and the proportional point multiplication A is executed on each group of picture common attention mechanism operation pair { Query, Key, Value }:

A^M(Q,K,V)＝[head₁；…；head_H]W^O.

wherein, Query or Q represents Query in the multi-head attention system, Key or K represents Key in the multi-head attention system, Value or V represents Value in the multi-head attention system, A represents attention operation, softmax represents normalization function, d represents attention operation_kDimension representing a key, A^MIndicates that the outputs of all attention operations A are connected together, W^OThe overall weight matrix is represented as a function of,

is the weight matrix corresponding to Query, Key and Value, and projects the Query, Key and Value from d-dimension space to lower d_HDimension space, where H is the number of heads used in the model, the output of all heads will be connected to A^MAnd transferred to a feedforward network with residual connection and layer normalization, multiple layers of co-attention mechanism can be stacked to enhance the modeling capability of the model, and then the outputs of all the layers of attention mechanism are added by a linear multi-modal fusion layer to obtain a context vector c_fuse∈R^dContext vector c_fuse∈R^dI.e. a fused vector of multimodal content comprising text, tags and pictures.

In the above technical solution, the keyword generation module 3 adopts a Seq2Seq framework for the fusion vector of the multimodal content, and a specific method for generating a new tag that does not exist in a tag space of a data set includes: sequence y of tag generation based on the frame of Seq2Seq ═ y₁,...,y_lyWhere the generation probability is defined as

Wherein P represents conditional probability, t represents variable, values of 1-ly, y represents new label sequence_tRepresenting the sequence content generated at position t, ly representing the length of the generated label;

the tag generation process is modeled in the Seq2Seq framework using a unidirectional GRU decoder whose state is hidden by the last GRU encoder of the text encoder

GRU encoder hidden state s released in process of initializing and generating label_t＝GRU(s_t-1,u_t)∈R^dWherein s is_t-1Representing hidden states s of a GRU encoder_tPrevious GRU encoder hidden state u_tRepresenting the encoder output at the tth step of the GRU, which is also the input to the GRU decoder, at each step t the sequence of tags y ≦ y is generated based on the framework of Seq2Seq₁,...,y_lyWhere the generation probability is defined as

Wherein P represents conditional probability, t represents step variable, values of 1-ly, y represents generated new label sequence_tRepresenting the sequence content generated in the step t, and ly representing the length of the generated label;

output u of GRU encoder by using vector splicing method_tHidden state s of GRU decoder_tMultimodal fusion vector c after modal fusion_fuseSplicing to obtain a total generated label sequence vector c_tThe calculation formula is as follows:

c_t＝[u_t；s_t；c_fuse].

predicting the total generated label sequence vector c by using a normalization function softmax_tThe probability distribution P (y) of the sequence content yt generated at each step t_t) The calculation formula is as follows: p (y)_t)＝softmax(MLP(c_t)).

Wherein MLP is a multi-layered perceptron model, and the total generated tag sequence vector c_tAnd obtaining a new label which does not exist in the label space of the data set through the internal neural network calculation of the multilayer perceptron model.

In the above technical solution, the specific method for the user personalized analysis module 4 to obtain the total influence vector of all randomly extracted user history posts on the current tag post to be recommended is as follows:

firstly, an external storage unit is created to store L randomly-extracted user historical posts, the memory unit uses a user id to carry out indexing, then the habit of using a label of a user is learned from the historical posting records of the user, and a context vector c is obtained_fuseAs an input, measuring semantic similarity between the posts of the current tag to be recommended and the historical posts, and calculating the similarity of semantic feature vectors of the multimodal contents of the user historical posts and the posts of the tag to be recommended by using the following formula:

r_i＝tanh(c_fuse⊙c_fuse ⁱ)

wherein, c_fuse ⁱA context vector indicating the ith history post, the | _ indicating the multiplication of elements, tanh being a function of the calculated similarity, r_i∈R^dA relevance vector of a post which represents a current label to be recommended and the ith historical post is connected together by using a vector connection method, and a similarity matrix r ═ r can be obtained₁,r₂,…,r_L]Calculating the influence weight of each historical post on the post of the current tag to be recommended by using a softmax normalization function:

wherein, W_r∈R^dIs an attention weight matrix of influence weights, b_rE R is the attention bias matrix of influence weights, and a can be obtained^r∈R^LIs a weight vector containing historical post weight, and uses F for historical posts in the storage unit_iIndicating the tag in the ith historical post, tag F is first labeled_iEmbedding into vector space fⁱ∈R^d×MWherein M is the maximum length of the tag, d is the embedding dimension of the tag, fⁱIs a matrix of dXM, R is a symbol representing a vector^dMatrix representing d dimension, R^LAnd (3) a matrix representing L dimension, wherein L is the space of the historical posts of the user, and the total influence vector F of all selected historical posts on the posts of the current to-be-recommended label is calculated by a weighted summation method:

fusing semantics of multi-modal content into vector c by using vector splicing method_fuseThe vector is spliced with a total influence vector F to obtain a personalized semantic vector q, wherein q is [ c ═ c_fuse；F]。

Integration of multi-modes using normalization function softmax predictionProbability distribution of each label in the personalized semantic vector q of the state content and the user personalized preference content: p is a radical of_q＝softmax(MLP(q))，P_qThe probability distribution of each label in a personalized semantic vector q integrating the multi-modal content and the user personalized preference content is referred to, wherein the MLP is a multilayer perceptron model.

In the above technical solution, the specific method for the training module 5 to obtain the trained keyword generation module 3 and the trained personalized analysis module 4 is as follows:

the training of the keyword generation module 3 and the user personalization analysis module 4 is to adopt a standard negative log-likelihood loss function as a training function:

where N is the number of posts in the training set of the data set, θ represents a trainable parameter shared across the entire framework, and γ is a parameter balancing these two penalties

And

n represents a variable with a value range of 1-N, t represents the tth step of the GRU decoder, l_yIndicates the length of the generated tag sequence,

the probability that the module 3 generates a label is represented,

expressing the influence probability of the user personalization of the model 4 on the label recommendation, L (theta) expressing the model result after training the trainable parameters of the whole framework of the model, defining the training targets of the whole keyword generation module 3 and the user personalization analysis module 4 through the loss function of the model, training all the trainable parameters in the keyword generation module 3 and the user personalization analysis module 4, and closingIn the training process, if the verification loss of the model is not reduced, the keyword generation module 3 and the user personalized analysis module 4 attenuate the verification loss of the keyword generation module 3 and the user personalized analysis module 4 by 0.5 by using a gradient clipping method with the maximum gradient norm of 5, and finish the training of the keyword generation module 3 and the user personalized analysis module 4 by monitoring the change of the verification loss and adopting an early stopping method.

In the above technical solution, a specific method for obtaining the recommended probability of each new tag that does not exist in the tag space of the data set by the recommending module 6 is as follows:

the trained keyword generation module 3 obtains the generated new label c_tThe trained user personalized analysis module 4 calculates and obtains the total influence vector t of all randomly extracted historical user posts on the current to-be-recommended label post₁And combining the two vectors through splicing operation to obtain a joint vector q:

q＝Concat(c_t,t₁)

concat represents vector splicing operation, after a joint vector is obtained, the current problem can be treated as a multi-label classification problem, and the probability P recommended to each new label which does not exist in the label space of the data set is calculated by using a softmax normalization function:

P＝softmax(W_qq+b_q)

wherein, W_qIs a weight matrix of the vector q, b_qIs the vector q bias matrix. And finally recommending the first K topic labels with high recommended probability to the user.

A keyword recommendation method based on multi-modal content comprises the following steps:

step 1: the feature extraction module 1 is used for extracting features of texts and labels in an input post set given by a user to be recommended by a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and is used for extracting features of pictures in the given input post set by a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;

step 2: the feature fusion module 2 fuses the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector containing multi-modal content of the text, the labels and the picture;

and step 3: the keyword generation module 3 generates a new label which does not exist in a data set label space by adopting a Seq2Seq framework on the fusion vector of the multi-modal content, wherein the data set label space is a set of all labels in a given input post set;

and 4, step 4: the user individuation analysis module 4 randomly extracts L user historical posts from a historical post set of a user to be recommended, calculates semantic similarity between posts of a current label to be recommended and the L user historical posts by using a vector similarity calculation method, performs normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performs weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;

and 5: the training module 5 trains trainable parameters of the keyword generation module 3 and the user personalized analysis module 4 by using a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module 3 and the user personalized analysis module 4;

step 6: and the recommending module 6 performs vector splicing on the new labels which are obtained by the trained keyword generating module 3 and do not exist in the label space of the data set and the total influence vector of all randomly extracted user historical posts obtained by the trained user personalized analysis module 4 on the posts of the current labels to be recommended to obtain spliced joint vectors, and obtains the recommended probability of each new label which does not exist in the label space of the data set by using a normalization function on the spliced joint vectors.

Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims

1. A multi-modal content-based keyword recommendation system is characterized by comprising a feature extraction module (1), a feature fusion module (2), a keyword generation module (3), a user personalized analysis module (4), a training module (5) and a recommendation module (6);

the feature extraction module (1) is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;

the feature fusion module (2) is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector containing multi-mode content of the text, the labels and the picture;

the keyword generation module (3) is configured to generate a new tag that does not exist in a data set tag space by using a Seq2Seq framework for the fusion vector of the multimodal content, where the data set tag space is a set of all tags in a given input post set;

the user individuation analysis module (4) is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating the semantic similarity between the posts of the current label to be recommended and the L user historical posts by using a vector similarity calculation method, carrying out normalization function calculation on the semantic similarity to obtain the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then carrying out weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain the total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;

the training module (5) is used for training all trainable vector parameters and matrix parameters of the keyword generation module (3) and the user personalized analysis module (4) by adopting a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module (3) and the user personalized analysis module (4);

and the recommending module (6) is used for vector splicing the new labels which are obtained by the trained keyword generating module (3) and do not exist in the data set label space with the total influence vectors of all randomly extracted user historical posts which are obtained by the trained user personalized analyzing module (4) on the posts of the current labels to be recommended to obtain spliced joint vectors, and the spliced joint vectors are subjected to normalization function to obtain the recommended probability of each new label which does not exist in the data set label space.

2. The multimodal content based keyword recommendation system of claim 1, wherein: the feature extraction module (1) performs feature extraction on texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain a semantic feature vector matrix of the texts and the labels, and the specific method comprises the following steps:

wherein the content of the first and second substances,

representing a word x_iThe forward direction hidden state of (a) is,

representing the word x_iThe (i-1) th forward hidden state,

representing the word x_iThe (i + 1) th backward hidden state;

will the word x_iForward hidden state of

And the word x_iBackward hidden state of

Connected into the word x_iOne context-aware representation vector of:

3. The multi-modal content-based keyword recommendation system of claim 1, wherein: the feature extraction module (1) performs feature extraction on the pictures in the given input post set by using a VGG-16 neural network, and the specific method for obtaining the semantic feature vector matrix of the pictures comprises the following steps: firstly, the input is carried outThe size of the incoming picture is adjusted to 224 multiplied by 224, a 7 multiplied by 7 convolution feature map is extracted from each picture, and each feature map is converted into a new picture vector v through a linear projection layer softmax_iEach new picture vector contains visual information of different areas of the picture to obtain a semantic feature vector matrix M of the picture_image，M_image＝{v_i,...,v_lv}∈R^lv×dWherein d represents the dimension of the hidden state, lv represents the number of picture areas, v_lvA picture vector representing the lv picture area.

4. The multimodal content based keyword recommendation system of claim 1, wherein: the feature fusion module (2) uses a multi-head attention-based mechanism to fuse the semantic feature vector matrixes of the text and the label with the semantic feature vector matrix of the picture, and the specific method comprises the following steps:

A^M(Q,K,V)＝[head₁；…；head_H]W^O.

is the weight matrix corresponding to Query, Key and Value, and projects the Query, Key and Value from d-dimension space to lower d_HDimension space, where H is the number of heads used in the model, the output of all heads will be connected to A^MAnd then the outputs of all attention mechanism layers are added by a linear multi-mode fusion layer to obtain a context vector c_fuse∈R^dContext vector c_fuse∈R^dI.e. a fused vector of multimodal content comprising text, tags and pictures.

5. The multimodal content based keyword recommendation system of claim 1, wherein: the keyword generation module (3) adopts a Seq2Seq framework for the fusion vector of the multimodal content, and the specific method for generating a new tag which does not exist in the tag space of the data set comprises the following steps: sequence y of tag generation based on the frame of Seq2Seq<y₁,...,y_ly>Wherein the generation probability is defined as

GRU encoder hidden state s released in process of initializing and generating label_t＝GRU(s_t-1,u_t)∈R^dWherein s is_t-1Representing hidden states s of a GRU encoder_tIs the previous one ofHidden state of GRU encoder u_tRepresenting the encoder output at the tth step of the GRU, which is also the input to the GRU decoder, at each step t the sequence y of labels is generated based on the framework of Seq2Seq<y₁,...,y_ly>Wherein the generation probability is defined as

c_t＝[u_t；s_t；c_fuse].

6. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for obtaining the total influence vector of all randomly extracted historical user posts on the current tag post to be recommended by the user personalized analysis module (4) comprises the following steps:

firstly, an external storage unit is developed to store L randomly-extracted user historical posts, the memory unit uses a user id to carry out indexing, then the habit of using a label of a user is learned from the historical posting record of the user,vector c of context_fuseAs an input, measuring semantic similarity between the posts of the current tag to be recommended and the historical posts, and calculating the similarity of semantic feature vectors of the multimodal contents of the user historical posts and the posts of the tag to be recommended by using the following formula:

r_i＝tanh(c_fuse⊙c_fuse ⁱ)

wherein, c_fuse ⁱA context vector indicating the ith history post, which indicates the multiplication of an element, tanh is a function for calculating the similarity, r_i∈R^dA relevance vector of a post which represents a current label to be recommended and the ith historical post is connected together by using a vector connection method, and a similarity matrix r ═ r can be obtained₁,r₂,…,r_L]Calculating the influence weight of each historical post on the post of the current tag to be recommended by using a softmax normalization function:

fusing semantics of multi-modal content into vector c by using vector splicing method_fuseThe vector is spliced with a total influence vector F to obtain a personalized semantic vector q, wherein q is [ c ═ c_fuse；F]；

Predicting the probability distribution of each label in a personalized semantic vector q integrating the multimodal content and the user personalized preference content by using a normalization function softmax: p is a radical of_q＝softmax(MLP(q))，P_qThe probability distribution of each label in a personalized semantic vector q integrating the multi-modal content and the user personalized preference content is referred to, wherein the MLP is a multilayer perceptron model.

7. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for the training module (5) to obtain the trained keyword generation module (3) and the user personalized analysis module (4) is as follows:

the training of the keyword generation module (3) and the user personalized analysis module (4) adopts a standard negative log likelihood loss function as a training function:

And

the probability that the module 3 generates a label is represented,

representing the probability of influence of the model 4 user personalization on tag recommendation, L (theta) representing the model result after training of trainable parameters of the entire framework of the model, the training targets of the whole keyword generation module (3) and the user personalized analysis module (4) are defined through the loss function of the model, training all trainable parameters in the keyword generation module (3) and the user personalized analysis module (4), wherein in the training process of the keyword generation module (3) and the user personalized analysis module (4), if the verification loss of the model is not reduced, the verification loss of the keyword generation module (3) and the user personalized analysis module (4) is attenuated by 0.5 by applying a gradient clipping method with the maximum gradient norm of 5, and finishing the training of the keyword generation module (3) and the user personalized analysis module (4) by monitoring the change of the verification loss and adopting an early stopping method.

8. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for obtaining the recommended probability of each new label which does not exist in the label space of the data set by the recommending module (6) is as follows:

the trained keyword generation module (3) obtains a generated new label c_tThe trained user personalized analysis module (4) calculates and obtains the total influence vector t of all randomly extracted historical user posts on the current to-be-recommended label posts₁And combining the two vectors through splicing operation to obtain a joint vector q:

q＝Concat(c_t,t₁)

P＝softmax(W_qq+b_q)

wherein, W_qIs a weight matrix of the vector q, b_qIs the vector q bias matrix.

9. A keyword recommendation method based on multi-modal content is characterized by comprising the following steps:

step 1: the method comprises the following steps that a feature extraction module (1) performs feature extraction on a text and a label in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain a semantic feature vector matrix of the text and the label, and performs feature extraction on a picture in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrix of the picture;

and 2, step: the feature fusion module (2) fuses the semantic feature vector matrixes of the text and the labels and the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-mode content comprising the text, the labels and the picture;

and step 3: a keyword generation module (3) adopts a Seq2Seq framework for the fusion vector of the multi-modal content to generate a new label which does not exist in a data set label space, wherein the data set label space is a set of all labels in a given input post set;

and 4, step 4: the user individuation analysis module (4) randomly extracts L user historical posts from a historical post set of a user to be recommended, calculates the semantic similarity between the posts of the current label to be recommended and the L user historical posts by using a vector similarity calculation method, performs normalization function calculation on the semantic similarity to obtain the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performs weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain the total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;

and 5: the training module (5) trains trainable parameters of the keyword generation module (3) and the user personalized analysis module (4) by adopting a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module (3) and the user personalized analysis module (4);

step 6: and the recommending module (6) performs vector splicing on the new labels which are obtained by the trained keyword generating module (3) and do not exist in the data set label space and the total influence vector of all randomly extracted user historical posts obtained by the trained user personalized analysis module (4) on the posts of the current to-be-recommended labels to obtain spliced joint vectors, and obtains the recommended probability of each new label which does not exist in the data set label space by using a normalization function on the spliced joint vectors.