CN114491258A - Keyword recommendation system and method based on multi-modal content - Google Patents

Keyword recommendation system and method based on multi-modal content Download PDF

Info

Publication number
CN114491258A
CN114491258A CN202210088492.9A CN202210088492A CN114491258A CN 114491258 A CN114491258 A CN 114491258A CN 202210088492 A CN202210088492 A CN 202210088492A CN 114491258 A CN114491258 A CN 114491258A
Authority
CN
China
Prior art keywords
vector
label
user
module
posts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210088492.9A
Other languages
Chinese (zh)
Inventor
何智勇
冯皓楠
马良荔
牛敬华
刘耀勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval University of Engineering PLA
Original Assignee
Naval University of Engineering PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval University of Engineering PLA filed Critical Naval University of Engineering PLA
Priority to CN202210088492.9A priority Critical patent/CN114491258A/en
Publication of CN114491258A publication Critical patent/CN114491258A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a multi-modal content-based keyword recommendation system, which comprises a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module.

Description

Keyword recommendation system and method based on multi-modal content
Technical Field
The invention relates to the technical field of keyword recommendation, in particular to a system and a method for keyword recommendation based on multi-modal content.
Background
The social media platform is an important internet application generated in the big data era, and is provided with a unique theme tag mechanism, wherein the theme tag is a specific form of metadata and is a string of characters with symbols # as a prefix. The user expresses a central idea of a post by tagging the content to be posted with one or more representative topic tags, which generally represent the user's view or emotion of the current topic. Because users are reluctant to set or set the subject label at will, the subject label is quickly generated with the post but is not organized for management. There is a problem of information overload.
Disclosure of Invention
The invention aims to provide a keyword recommendation system and method based on multi-modal content, which can improve the efficiency of information transmission and organization on the Internet and is beneficial to solving the problem of information overload.
In order to achieve the purpose, the keyword recommendation system based on the multi-modal content is characterized by comprising a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module;
the feature extraction module is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;
the feature fusion module is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-modal content comprising the text, the labels and the picture;
the keyword generation module is used for generating a new label which does not exist in a data set label space by adopting a Seq2Seq framework for the fusion vector of the multi-modal content, wherein the data set label space is a set of all labels in a given input post set;
the system comprises a user individuation analysis module, a vector similarity calculation method, a normalization function calculation module and a recommendation module, wherein the user individuation analysis module is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating semantic similarity between posts of a current label to be recommended and the L user historical posts by using the vector similarity calculation method, performing normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performing weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;
the training module is used for training trainable parameters of the keyword generation module and the user personalized analysis module by adopting a standard negative log-likelihood loss function as a training function to obtain a trained keyword generation module and a trained user personalized analysis module;
and the recommending module is used for vector splicing the new labels which are obtained by the trained keyword generating module and do not exist in the label space of the data set with the total influence vector of all randomly extracted user historical posts which are obtained by the trained user personalized analyzing module on the posts of the current labels to be recommended to obtain spliced joint vectors, and the spliced joint vectors utilize a normalization function to obtain the recommended probability of each new label which does not exist in the label space of the data set.
The invention has the beneficial effects that:
the method and the system can select proper theme keywords for the posts, can eliminate meaningless and noisy noise information on the social media platform, can provide required quick access information for the user, and are beneficial to improving the efficiency of platform information spreading and information organization. As users on social media platforms are reluctant to set or set the hashtag at will, the hashtag is caused to be generated quickly with the post but is not managed organized. Hundreds of topic tags are associated under one discussion topic, so that massive information redundancy is caused. The invention recommends tags to posts by utilizing the multi-modal post content published by the user on the social media platform and the personalized preference of the user. The method is characterized in that semantic association information among multi-modal contents is modeled by using a multi-head attention mechanism, the attention mechanism can well consider key contents of different modes by using a method of guiding one mode to another mode, and the modeling effect of the model is enhanced by stacking a plurality of attention mechanism layers. In addition, in consideration of personalized differences of different users, historical information of the users is selected to analyze personalized features of the users, and overall tag recommendation is performed by integrating the personalized features of the users and the multi-modal content of the posts of the tags to be recommended currently. The method has the advantages of accuracy and novelty.
Drawings
FIG. 1 is a schematic structural view of the present invention;
the system comprises a feature extraction module, a feature fusion module, a keyword generation module, a user personalized analysis module, a training module and a recommendation module, wherein the feature extraction module is 1, the feature fusion module is 2, the keyword generation module is 3, and the user personalized analysis module is 4.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
the system for recommending keywords based on multi-modal content as shown in fig. 1 comprises a feature extraction module 1, a feature fusion module 2, a keyword generation module 3, a user personalized analysis module 4, a training module 5 and a recommendation module 6;
the feature extraction module 1 is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures; the bidirectional gating circulation unit is one of the circulation neural networks, can reduce the problem of gradient disappearance while keeping long-term sequence information of a text, and can greatly improve the training efficiency of the model, and the VGG-16 neural network uses a better convolution kernel, is simpler in structure compared with other neural networks, and has less parameter quantity.
The feature fusion module 2 is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-modal content comprising the text, the labels and the picture; compared with the common attention mechanism, the multi-head attention mechanism can understand multiple semantic meanings contained in the text and can more comprehensively understand the text content.
The keyword generation module 3 is configured to generate a new tag that does not exist in a tag space of a data set by using a Seq2Seq framework for the fusion vector of the multimodal content, where the tag space of the data set is a set of all tags in a given input post set; the Seq2Seq framework is a model adopted when the length of an output sequence is uncertain, and the model can be adopted to generate labels without length limitation.
The user individuation analysis module 4 is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating semantic similarity between posts of a current label to be recommended and the L user historical posts by using a vector similarity calculation method, performing normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performing weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended; sampling the user history may allow understanding of the user's habits in using tags when faced with similar content.
The training module 5 is configured to train all trainable vector parameters and matrix parameters (all weight matrices and bias matrices) of the keyword generation module 3 and the user personalized analysis module 4 by using a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module 3 and the user personalized analysis module 4;
detailed methods of training are described in the references:
Zhang S,Yao Y,Xu F,et al.Hashtag Recommendation for Photo Sharin g Services[J].Proceedings of the AAAI Conference on Artificial Intelli gence,2019,33:5805-5812.
and references:
Y Wang,Li J,Lyu M R,et al.Cross-Media Keyphrase Prediction:A Un ified Framework with Multi-Modality Multi-Head Attention and Image Wordings[J].2020.
the loss function is a standard for measuring the quality of the model, and in machine learning, the model is usually expected to have smaller loss, the negative log-likelihood function is minimized, and the loss function is a common loss function by taking the idea of statistics as reference.
The recommending module 6 is configured to vector-splice the new tags which are not present in the tag space of the data set and are obtained by the trained keyword generating module 3 and the total influence vectors of all randomly extracted user history posts which are obtained by the trained user personalization analyzing module 4 on the posts of the current tags to be recommended to obtain spliced joint vectors, and obtain the recommended probability of each new tag which is not present in the tag space of the data set by using a normalization function for the spliced joint vectors. At this time, the recommendation result given by the recommendation module is a result integrating the semantic information of the multi-modal content and the user personalized information. This makes the recommendation not only accurate but also person-to-person.
In the above technical solution, the feature extraction module 1 extracts the text and the picture content by using a bi-directional gating circulation unit (bi-GRU) and a VGG-16 neural network, respectively, to obtain a feature vector matrix of each modality.
The feature fusion module 2 fuses feature information of a text mode and a picture mode by using a Transformer structure based on a multi-head attention mechanism, and the attention mechanism can well consider key contents of different modes by using a method of guiding one mode to another mode. The modeling effect of the model can also be enhanced by stacking the number of layers of the common attention mechanism.
The keyword generation module 3 applies the fused feature matrix after the feature fusion modality, and generates a keyword by using Seq2Seq framework decoding, and here, in order to generate a tag that does not exist in the tag space of the data set, a word can be copied in the source input text by adopting the idea of a copying mechanism.
The user individuation analysis module 4 considers the individuation characteristic information of the user, and selects the historical information of the user to analyze the user characteristics in consideration of individuation differences of different users. And analyzing the habit preference of the user by quantifying the similarity of the multi-modal content characteristics in the historical posts and the posts to be recommended currently.
The training module 5 is used for performing end-to-end training on the keyword generation module 3 and the user personalized analysis module 4, and performing model training by using a multi-label classification method.
The recommending module 6 is used for performing personalized topic keyword recommendation by integrating personalized features of the user and multi-modal semantic information of the current post to be recommended, wherein the multi-modal semantic information can generate a label which does not exist in a data set label space through a sequence generating method.
In the above technical solution, the specific method for the feature extraction module 1 to perform feature extraction on the text and the tag in the input post set given by the user to be recommended by using a bidirectional gating cycle unit to obtain the semantic feature vector matrix of the text and the tag is as follows:
firstly, pre-training a lookup table for the text and label content in the input post set, and then, pre-training each word x of the text and the label in the input post setiEmbedded in a high-dimensional vector, represented as a discrete vector representation e (x) for each wordi) Finally, the discrete vector representation e (x) for each word using a bi-directional gated round robin uniti) And (3) encoding:
Figure BDA0003488197010000061
Figure BDA0003488197010000062
wherein the content of the first and second substances,
Figure BDA0003488197010000063
representing the word xiThe forward direction hidden state of (a) is,
Figure BDA0003488197010000064
representing the word xiBackward hidden state, GRU represents a gated round function,
Figure BDA0003488197010000065
representing the word xiThe (i-1) th forward hidden state,
Figure BDA0003488197010000066
representing the word xiThe (i + 1) th backward hidden state;
will the word xiForward hidden state of
Figure BDA0003488197010000067
And the word xiBackward hidden state of
Figure BDA0003488197010000068
Connected into the word xiOne context-aware representation vector of:
Figure BDA0003488197010000069
all the context perception expression vectors in the input post set are stored in a text vector library to obtain a semantic feature vector matrix M of texts and labelstext,Mtext={hi,...,hlx}∈Rlx×dWherein d represents the dimension of the hidden state, lx represents the number of words in the text, R represents the subscript, hlxThe context-aware representation vector representing the lxh word.
In the above technical solution, the feature extraction modules 1 pairThe specific method for obtaining the semantic feature vector matrix of the picture by extracting the features of the picture in the given input post set by using the VGG-16 neural network comprises the following steps: firstly, the input picture is resized into 224 x 224, a 7 x 7-grid convolution feature map is extracted from each picture, and each feature map is converted into a new picture vector v through a linear projection layer softmaxiEach new picture vector contains visual information of different areas of the picture, and a semantic feature vector matrix M of the picture is obtainedimage,Mimage={vi,...,vlv}∈Rlv×dWherein d represents the dimension of the hidden state, lv represents the number of picture areas, vlvA picture vector representing the lv picture area.
In the above technical solution, the specific method for fusing the semantic feature vector matrixes of the text and the label with the semantic feature vector matrix of the picture by the feature fusion module 2 based on the multi-head attention mechanism is as follows:
in the multi-head attention mechanism, a text semantic feature vector is used as Query to guide picture attention, then picture features are used as Query to guide text attention, and the proportional point multiplication A is executed on each group of picture common attention mechanism operation pair { Query, Key, Value }:
Figure BDA0003488197010000071
AM(Q,K,V)=[head1;…;headH]WO.
wherein, Query or Q represents Query in the multi-head attention system, Key or K represents Key in the multi-head attention system, Value or V represents Value in the multi-head attention system, A represents attention operation, softmax represents normalization function, d represents attention operationkDimension representing a key, AMIndicates that the outputs of all attention operations A are connected together, WOThe overall weight matrix is represented as a function of,
Figure BDA0003488197010000072
is the weight matrix corresponding to Query, Key and Value, and projects the Query, Key and Value from d-dimension space to lower dHDimension space, where H is the number of heads used in the model, the output of all heads will be connected to AMAnd transferred to a feedforward network with residual connection and layer normalization, multiple layers of co-attention mechanism can be stacked to enhance the modeling capability of the model, and then the outputs of all the layers of attention mechanism are added by a linear multi-modal fusion layer to obtain a context vector cfuse∈RdContext vector cfuse∈RdI.e. a fused vector of multimodal content comprising text, tags and pictures.
In the above technical solution, the keyword generation module 3 adopts a Seq2Seq framework for the fusion vector of the multimodal content, and a specific method for generating a new tag that does not exist in a tag space of a data set includes: sequence y of tag generation based on the frame of Seq2Seq ═ y1,...,ylyWhere the generation probability is defined as
Figure BDA0003488197010000081
Wherein P represents conditional probability, t represents variable, values of 1-ly, y represents new label sequencetRepresenting the sequence content generated at position t, ly representing the length of the generated label;
the tag generation process is modeled in the Seq2Seq framework using a unidirectional GRU decoder whose state is hidden by the last GRU encoder of the text encoder
Figure BDA0003488197010000082
GRU encoder hidden state s released in process of initializing and generating labelt=GRU(st-1,ut)∈RdWherein s ist-1Representing hidden states s of a GRU encodertPrevious GRU encoder hidden state utRepresenting the encoder output at the tth step of the GRU, which is also the input to the GRU decoder, at each step t the sequence of tags y ≦ y is generated based on the framework of Seq2Seq1,...,ylyWhere the generation probability is defined as
Figure BDA0003488197010000083
Wherein P represents conditional probability, t represents step variable, values of 1-ly, y represents generated new label sequencetRepresenting the sequence content generated in the step t, and ly representing the length of the generated label;
output u of GRU encoder by using vector splicing methodtHidden state s of GRU decodertMultimodal fusion vector c after modal fusionfuseSplicing to obtain a total generated label sequence vector ctThe calculation formula is as follows:
ct=[ut;st;cfuse].
predicting the total generated label sequence vector c by using a normalization function softmaxtThe probability distribution P (y) of the sequence content yt generated at each step tt) The calculation formula is as follows: p (y)t)=softmax(MLP(ct)).
Wherein MLP is a multi-layered perceptron model, and the total generated tag sequence vector ctAnd obtaining a new label which does not exist in the label space of the data set through the internal neural network calculation of the multilayer perceptron model.
In the above technical solution, the specific method for the user personalized analysis module 4 to obtain the total influence vector of all randomly extracted user history posts on the current tag post to be recommended is as follows:
firstly, an external storage unit is created to store L randomly-extracted user historical posts, the memory unit uses a user id to carry out indexing, then the habit of using a label of a user is learned from the historical posting records of the user, and a context vector c is obtainedfuseAs an input, measuring semantic similarity between the posts of the current tag to be recommended and the historical posts, and calculating the similarity of semantic feature vectors of the multimodal contents of the user historical posts and the posts of the tag to be recommended by using the following formula:
ri=tanh(cfuse⊙cfuse i)
wherein, cfuse iA context vector indicating the ith history post, the | _ indicating the multiplication of elements, tanh being a function of the calculated similarity, ri∈RdA relevance vector of a post which represents a current label to be recommended and the ith historical post is connected together by using a vector connection method, and a similarity matrix r ═ r can be obtained1,r2,…,rL]Calculating the influence weight of each historical post on the post of the current tag to be recommended by using a softmax normalization function:
Figure BDA0003488197010000091
wherein, Wr∈RdIs an attention weight matrix of influence weights, brE R is the attention bias matrix of influence weights, and a can be obtainedr∈RLIs a weight vector containing historical post weight, and uses F for historical posts in the storage unitiIndicating the tag in the ith historical post, tag F is first labelediEmbedding into vector space fi∈Rd×MWherein M is the maximum length of the tag, d is the embedding dimension of the tag, fiIs a matrix of dXM, R is a symbol representing a vectordMatrix representing d dimension, RLAnd (3) a matrix representing L dimension, wherein L is the space of the historical posts of the user, and the total influence vector F of all selected historical posts on the posts of the current to-be-recommended label is calculated by a weighted summation method:
Figure BDA0003488197010000092
fusing semantics of multi-modal content into vector c by using vector splicing methodfuseThe vector is spliced with a total influence vector F to obtain a personalized semantic vector q, wherein q is [ c ═ cfuse;F]。
Integration of multi-modes using normalization function softmax predictionProbability distribution of each label in the personalized semantic vector q of the state content and the user personalized preference content: p is a radical ofq=softmax(MLP(q)),PqThe probability distribution of each label in a personalized semantic vector q integrating the multi-modal content and the user personalized preference content is referred to, wherein the MLP is a multilayer perceptron model.
In the above technical solution, the specific method for the training module 5 to obtain the trained keyword generation module 3 and the trained personalized analysis module 4 is as follows:
the training of the keyword generation module 3 and the user personalization analysis module 4 is to adopt a standard negative log-likelihood loss function as a training function:
Figure BDA0003488197010000101
where N is the number of posts in the training set of the data set, θ represents a trainable parameter shared across the entire framework, and γ is a parameter balancing these two penalties
Figure BDA0003488197010000102
And
Figure BDA0003488197010000103
n represents a variable with a value range of 1-N, t represents the tth step of the GRU decoder, lyIndicates the length of the generated tag sequence,
Figure BDA0003488197010000104
the probability that the module 3 generates a label is represented,
Figure BDA0003488197010000105
expressing the influence probability of the user personalization of the model 4 on the label recommendation, L (theta) expressing the model result after training the trainable parameters of the whole framework of the model, defining the training targets of the whole keyword generation module 3 and the user personalization analysis module 4 through the loss function of the model, training all the trainable parameters in the keyword generation module 3 and the user personalization analysis module 4, and closingIn the training process, if the verification loss of the model is not reduced, the keyword generation module 3 and the user personalized analysis module 4 attenuate the verification loss of the keyword generation module 3 and the user personalized analysis module 4 by 0.5 by using a gradient clipping method with the maximum gradient norm of 5, and finish the training of the keyword generation module 3 and the user personalized analysis module 4 by monitoring the change of the verification loss and adopting an early stopping method.
In the above technical solution, a specific method for obtaining the recommended probability of each new tag that does not exist in the tag space of the data set by the recommending module 6 is as follows:
the trained keyword generation module 3 obtains the generated new label ctThe trained user personalized analysis module 4 calculates and obtains the total influence vector t of all randomly extracted historical user posts on the current to-be-recommended label post1And combining the two vectors through splicing operation to obtain a joint vector q:
q=Concat(ct,t1)
concat represents vector splicing operation, after a joint vector is obtained, the current problem can be treated as a multi-label classification problem, and the probability P recommended to each new label which does not exist in the label space of the data set is calculated by using a softmax normalization function:
P=softmax(Wqq+bq)
wherein, WqIs a weight matrix of the vector q, bqIs the vector q bias matrix. And finally recommending the first K topic labels with high recommended probability to the user.
A keyword recommendation method based on multi-modal content comprises the following steps:
step 1: the feature extraction module 1 is used for extracting features of texts and labels in an input post set given by a user to be recommended by a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and is used for extracting features of pictures in the given input post set by a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;
step 2: the feature fusion module 2 fuses the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector containing multi-modal content of the text, the labels and the picture;
and step 3: the keyword generation module 3 generates a new label which does not exist in a data set label space by adopting a Seq2Seq framework on the fusion vector of the multi-modal content, wherein the data set label space is a set of all labels in a given input post set;
and 4, step 4: the user individuation analysis module 4 randomly extracts L user historical posts from a historical post set of a user to be recommended, calculates semantic similarity between posts of a current label to be recommended and the L user historical posts by using a vector similarity calculation method, performs normalization function calculation on the semantic similarity to obtain influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performs weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain a total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;
and 5: the training module 5 trains trainable parameters of the keyword generation module 3 and the user personalized analysis module 4 by using a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module 3 and the user personalized analysis module 4;
step 6: and the recommending module 6 performs vector splicing on the new labels which are obtained by the trained keyword generating module 3 and do not exist in the label space of the data set and the total influence vector of all randomly extracted user historical posts obtained by the trained user personalized analysis module 4 on the posts of the current labels to be recommended to obtain spliced joint vectors, and obtains the recommended probability of each new label which does not exist in the label space of the data set by using a normalization function on the spliced joint vectors.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (9)

1. A multi-modal content-based keyword recommendation system is characterized by comprising a feature extraction module (1), a feature fusion module (2), a keyword generation module (3), a user personalized analysis module (4), a training module (5) and a recommendation module (6);
the feature extraction module (1) is used for extracting features of texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain semantic feature vector matrixes of the texts and the labels, and extracting features of pictures in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrixes of the pictures;
the feature fusion module (2) is used for fusing the semantic feature vector matrixes of the text and the labels with the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector containing multi-mode content of the text, the labels and the picture;
the keyword generation module (3) is configured to generate a new tag that does not exist in a data set tag space by using a Seq2Seq framework for the fusion vector of the multimodal content, where the data set tag space is a set of all tags in a given input post set;
the user individuation analysis module (4) is used for randomly extracting L user historical posts from a historical post set of a user to be recommended, calculating the semantic similarity between the posts of the current label to be recommended and the L user historical posts by using a vector similarity calculation method, carrying out normalization function calculation on the semantic similarity to obtain the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then carrying out weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain the total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;
the training module (5) is used for training all trainable vector parameters and matrix parameters of the keyword generation module (3) and the user personalized analysis module (4) by adopting a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module (3) and the user personalized analysis module (4);
and the recommending module (6) is used for vector splicing the new labels which are obtained by the trained keyword generating module (3) and do not exist in the data set label space with the total influence vectors of all randomly extracted user historical posts which are obtained by the trained user personalized analyzing module (4) on the posts of the current labels to be recommended to obtain spliced joint vectors, and the spliced joint vectors are subjected to normalization function to obtain the recommended probability of each new label which does not exist in the data set label space.
2. The multimodal content based keyword recommendation system of claim 1, wherein: the feature extraction module (1) performs feature extraction on texts and labels in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain a semantic feature vector matrix of the texts and the labels, and the specific method comprises the following steps:
firstly, pre-training a lookup table for the text and label content in the input post set, and then, pre-training each word x of the text and the label in the input post setiEmbedded in a high-dimensional vector, represented as a discrete vector representation e (x) for each wordi) Finally, the discrete vector representation e (x) for each word using a bi-directional gated round robin uniti) And (3) encoding:
Figure FDA0003488189000000021
Figure FDA0003488189000000022
wherein the content of the first and second substances,
Figure FDA0003488189000000023
representing a word xiThe forward direction hidden state of (a) is,
Figure FDA0003488189000000024
representing the word xiBackward hidden state, GRU represents a gated round function,
Figure FDA0003488189000000025
representing the word xiThe (i-1) th forward hidden state,
Figure FDA0003488189000000026
representing the word xiThe (i + 1) th backward hidden state;
will the word xiForward hidden state of
Figure FDA0003488189000000027
And the word xiBackward hidden state of
Figure FDA0003488189000000028
Connected into the word xiOne context-aware representation vector of:
Figure FDA0003488189000000029
all the context perception expression vectors in the input post set are stored in a text vector library to obtain a semantic feature vector matrix M of texts and labelstext,Mtext={hi,...,hlx}∈Rlx×dWherein d represents the dimension of the hidden state, lx represents the number of words in the text, R represents the subscript, hlxThe context-aware representation vector representing the lxh word.
3. The multi-modal content-based keyword recommendation system of claim 1, wherein: the feature extraction module (1) performs feature extraction on the pictures in the given input post set by using a VGG-16 neural network, and the specific method for obtaining the semantic feature vector matrix of the pictures comprises the following steps: firstly, the input is carried outThe size of the incoming picture is adjusted to 224 multiplied by 224, a 7 multiplied by 7 convolution feature map is extracted from each picture, and each feature map is converted into a new picture vector v through a linear projection layer softmaxiEach new picture vector contains visual information of different areas of the picture to obtain a semantic feature vector matrix M of the pictureimage,Mimage={vi,...,vlv}∈Rlv×dWherein d represents the dimension of the hidden state, lv represents the number of picture areas, vlvA picture vector representing the lv picture area.
4. The multimodal content based keyword recommendation system of claim 1, wherein: the feature fusion module (2) uses a multi-head attention-based mechanism to fuse the semantic feature vector matrixes of the text and the label with the semantic feature vector matrix of the picture, and the specific method comprises the following steps:
in the multi-head attention mechanism, a text semantic feature vector is used as Query to guide picture attention, then picture features are used as Query to guide text attention, and the proportional point multiplication A is executed on each group of picture common attention mechanism operation pair { Query, Key, Value }:
Figure FDA0003488189000000031
AM(Q,K,V)=[head1;…;headH]WO.
wherein, Query or Q represents Query in the multi-head attention system, Key or K represents Key in the multi-head attention system, Value or V represents Value in the multi-head attention system, A represents attention operation, softmax represents normalization function, d represents attention operationkDimension representing a key, AMIndicates that the outputs of all attention operations A are connected together, WOThe overall weight matrix is represented as a function of,
Figure FDA0003488189000000032
Figure FDA0003488189000000033
is the weight matrix corresponding to Query, Key and Value, and projects the Query, Key and Value from d-dimension space to lower dHDimension space, where H is the number of heads used in the model, the output of all heads will be connected to AMAnd then the outputs of all attention mechanism layers are added by a linear multi-mode fusion layer to obtain a context vector cfuse∈RdContext vector cfuse∈RdI.e. a fused vector of multimodal content comprising text, tags and pictures.
5. The multimodal content based keyword recommendation system of claim 1, wherein: the keyword generation module (3) adopts a Seq2Seq framework for the fusion vector of the multimodal content, and the specific method for generating a new tag which does not exist in the tag space of the data set comprises the following steps: sequence y of tag generation based on the frame of Seq2Seq<y1,...,yly>Wherein the generation probability is defined as
Figure FDA0003488189000000041
Wherein P represents conditional probability, t represents variable, values of 1-ly, y represents new label sequencetRepresenting the sequence content generated at position t, ly representing the length of the generated label;
the tag generation process is modeled in the Seq2Seq framework using a unidirectional GRU decoder whose state is hidden by the last GRU encoder of the text encoder
Figure FDA0003488189000000042
GRU encoder hidden state s released in process of initializing and generating labelt=GRU(st-1,ut)∈RdWherein s ist-1Representing hidden states s of a GRU encodertIs the previous one ofHidden state of GRU encoder utRepresenting the encoder output at the tth step of the GRU, which is also the input to the GRU decoder, at each step t the sequence y of labels is generated based on the framework of Seq2Seq<y1,...,yly>Wherein the generation probability is defined as
Figure FDA0003488189000000043
Wherein P represents conditional probability, t represents step variable, values of 1-ly, y represents generated new label sequencetRepresenting the sequence content generated in the step t, and ly representing the length of the generated label;
output u of GRU encoder by using vector splicing methodtHidden state s of GRU decodertMultimodal fusion vector c after modal fusionfuseSplicing to obtain a total generated label sequence vector ctThe calculation formula is as follows:
ct=[ut;st;cfuse].
predicting the total generated label sequence vector c by using a normalization function softmaxtThe probability distribution P (y) of the sequence content yt generated at each step tt) The calculation formula is as follows: p (y)t)=softmax(MLP(ct)).
Wherein MLP is a multi-layered perceptron model, and the total generated tag sequence vector ctAnd obtaining a new label which does not exist in the label space of the data set through the internal neural network calculation of the multilayer perceptron model.
6. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for obtaining the total influence vector of all randomly extracted historical user posts on the current tag post to be recommended by the user personalized analysis module (4) comprises the following steps:
firstly, an external storage unit is developed to store L randomly-extracted user historical posts, the memory unit uses a user id to carry out indexing, then the habit of using a label of a user is learned from the historical posting record of the user,vector c of contextfuseAs an input, measuring semantic similarity between the posts of the current tag to be recommended and the historical posts, and calculating the similarity of semantic feature vectors of the multimodal contents of the user historical posts and the posts of the tag to be recommended by using the following formula:
ri=tanh(cfuse⊙cfuse i)
wherein, cfuse iA context vector indicating the ith history post, which indicates the multiplication of an element, tanh is a function for calculating the similarity, ri∈RdA relevance vector of a post which represents a current label to be recommended and the ith historical post is connected together by using a vector connection method, and a similarity matrix r ═ r can be obtained1,r2,…,rL]Calculating the influence weight of each historical post on the post of the current tag to be recommended by using a softmax normalization function:
Figure FDA0003488189000000051
wherein, Wr∈RdIs an attention weight matrix of influence weights, brE R is the attention bias matrix of influence weights, and a can be obtainedr∈RLIs a weight vector containing historical post weight, and uses F for historical posts in the storage unitiIndicating the tag in the ith historical post, tag F is first labelediEmbedding into vector space fi∈Rd×MWherein M is the maximum length of the tag, d is the embedding dimension of the tag, fiIs a matrix of dXM, R is a symbol representing a vectordMatrix representing d dimension, RLAnd (3) a matrix representing L dimension, wherein L is the space of the historical posts of the user, and the total influence vector F of all selected historical posts on the posts of the current to-be-recommended label is calculated by a weighted summation method:
Figure FDA0003488189000000061
fusing semantics of multi-modal content into vector c by using vector splicing methodfuseThe vector is spliced with a total influence vector F to obtain a personalized semantic vector q, wherein q is [ c ═ cfuse;F];
Predicting the probability distribution of each label in a personalized semantic vector q integrating the multimodal content and the user personalized preference content by using a normalization function softmax: p is a radical ofq=softmax(MLP(q)),PqThe probability distribution of each label in a personalized semantic vector q integrating the multi-modal content and the user personalized preference content is referred to, wherein the MLP is a multilayer perceptron model.
7. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for the training module (5) to obtain the trained keyword generation module (3) and the user personalized analysis module (4) is as follows:
the training of the keyword generation module (3) and the user personalized analysis module (4) adopts a standard negative log likelihood loss function as a training function:
Figure FDA0003488189000000062
where N is the number of posts in the training set of the data set, θ represents a trainable parameter shared across the entire framework, and γ is a parameter balancing these two penalties
Figure FDA0003488189000000063
And
Figure FDA0003488189000000064
n represents a variable with a value range of 1-N, t represents the tth step of the GRU decoder, lyIndicates the length of the generated tag sequence,
Figure FDA0003488189000000065
the probability that the module 3 generates a label is represented,
Figure FDA0003488189000000066
representing the probability of influence of the model 4 user personalization on tag recommendation, L (theta) representing the model result after training of trainable parameters of the entire framework of the model, the training targets of the whole keyword generation module (3) and the user personalized analysis module (4) are defined through the loss function of the model, training all trainable parameters in the keyword generation module (3) and the user personalized analysis module (4), wherein in the training process of the keyword generation module (3) and the user personalized analysis module (4), if the verification loss of the model is not reduced, the verification loss of the keyword generation module (3) and the user personalized analysis module (4) is attenuated by 0.5 by applying a gradient clipping method with the maximum gradient norm of 5, and finishing the training of the keyword generation module (3) and the user personalized analysis module (4) by monitoring the change of the verification loss and adopting an early stopping method.
8. The multimodal content based keyword recommendation system of claim 1, wherein: the specific method for obtaining the recommended probability of each new label which does not exist in the label space of the data set by the recommending module (6) is as follows:
the trained keyword generation module (3) obtains a generated new label ctThe trained user personalized analysis module (4) calculates and obtains the total influence vector t of all randomly extracted historical user posts on the current to-be-recommended label posts1And combining the two vectors through splicing operation to obtain a joint vector q:
q=Concat(ct,t1)
concat represents vector splicing operation, after a joint vector is obtained, the current problem can be treated as a multi-label classification problem, and the probability P recommended to each new label which does not exist in the label space of the data set is calculated by using a softmax normalization function:
P=softmax(Wqq+bq)
wherein, WqIs a weight matrix of the vector q, bqIs the vector q bias matrix.
9. A keyword recommendation method based on multi-modal content is characterized by comprising the following steps:
step 1: the method comprises the following steps that a feature extraction module (1) performs feature extraction on a text and a label in an input post set given by a user to be recommended by using a bidirectional gating circulation unit to obtain a semantic feature vector matrix of the text and the label, and performs feature extraction on a picture in the given input post set by using a VGG-16 neural network to obtain the semantic feature vector matrix of the picture;
and 2, step: the feature fusion module (2) fuses the semantic feature vector matrixes of the text and the labels and the semantic feature vector matrix of the picture by using a multi-head attention-based mechanism to obtain a fusion vector of multi-mode content comprising the text, the labels and the picture;
and step 3: a keyword generation module (3) adopts a Seq2Seq framework for the fusion vector of the multi-modal content to generate a new label which does not exist in a data set label space, wherein the data set label space is a set of all labels in a given input post set;
and 4, step 4: the user individuation analysis module (4) randomly extracts L user historical posts from a historical post set of a user to be recommended, calculates the semantic similarity between the posts of the current label to be recommended and the L user historical posts by using a vector similarity calculation method, performs normalization function calculation on the semantic similarity to obtain the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended, and then performs weighted summation on the influence weight of each randomly extracted user historical post on the posts of the current label to be recommended to obtain the total influence vector of all randomly extracted user historical posts on the posts of the current label to be recommended;
and 5: the training module (5) trains trainable parameters of the keyword generation module (3) and the user personalized analysis module (4) by adopting a standard negative log-likelihood loss function as a training function to obtain the trained keyword generation module (3) and the user personalized analysis module (4);
step 6: and the recommending module (6) performs vector splicing on the new labels which are obtained by the trained keyword generating module (3) and do not exist in the data set label space and the total influence vector of all randomly extracted user historical posts obtained by the trained user personalized analysis module (4) on the posts of the current to-be-recommended labels to obtain spliced joint vectors, and obtains the recommended probability of each new label which does not exist in the data set label space by using a normalization function on the spliced joint vectors.
CN202210088492.9A 2022-01-25 2022-01-25 Keyword recommendation system and method based on multi-modal content Pending CN114491258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210088492.9A CN114491258A (en) 2022-01-25 2022-01-25 Keyword recommendation system and method based on multi-modal content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210088492.9A CN114491258A (en) 2022-01-25 2022-01-25 Keyword recommendation system and method based on multi-modal content

Publications (1)

Publication Number Publication Date
CN114491258A true CN114491258A (en) 2022-05-13

Family

ID=81474209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210088492.9A Pending CN114491258A (en) 2022-01-25 2022-01-25 Keyword recommendation system and method based on multi-modal content

Country Status (1)

Country Link
CN (1) CN114491258A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969534A (en) * 2022-06-04 2022-08-30 哈尔滨理工大学 Mobile crowd sensing task recommendation method fusing multi-modal data features
CN115203471A (en) * 2022-09-15 2022-10-18 山东宝盛鑫信息科技有限公司 Attention mechanism-based multimode fusion video recommendation method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969534A (en) * 2022-06-04 2022-08-30 哈尔滨理工大学 Mobile crowd sensing task recommendation method fusing multi-modal data features
CN115203471A (en) * 2022-09-15 2022-10-18 山东宝盛鑫信息科技有限公司 Attention mechanism-based multimode fusion video recommendation method
CN115203471B (en) * 2022-09-15 2022-11-18 山东宝盛鑫信息科技有限公司 Attention mechanism-based multimode fusion video recommendation method

Similar Documents

Publication Publication Date Title
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Garcia et al. A dataset and baselines for visual question answering on art
CN107066464A (en) Semantic Natural Language Vector Space
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
Peris et al. Video description using bidirectional recurrent neural networks
CN114491258A (en) Keyword recommendation system and method based on multi-modal content
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
Rafailidis et al. Adversarial training for review-based recommendations
Gao et al. A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective
Wu et al. Graph capsule aggregation for unaligned multimodal sequences
Song et al. LSTM-in-LSTM for generating long descriptions of images
Khan et al. A deep neural framework for image caption generation using gru-based attention mechanism
CN116975199A (en) Text prediction method, device, equipment and storage medium
CN112131345A (en) Text quality identification method, device, equipment and storage medium
Chaudhuri Visual and text sentiment analysis through hierarchical deep learning networks
Abdar et al. A review of deep learning for video captioning
Chowdhury et al. A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA)
CN115640449A (en) Media object recommendation method and device, computer equipment and storage medium
Ludwig et al. Deep embedding for spatial role labeling
CN113435212B (en) Text inference method and device based on rule embedding
Chen et al. Video captioning via sentence augmentation and spatio-temporal attention
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN114443916A (en) Supply and demand matching method and system for test data
CN114282528A (en) Keyword extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination