CN115952307A - Recommendation method based on multimodal graph contrast learning, electronic device and storage medium - Google Patents
Recommendation method based on multimodal graph contrast learning, electronic device and storage medium Download PDFInfo
- Publication number
- CN115952307A CN115952307A CN202211742093.6A CN202211742093A CN115952307A CN 115952307 A CN115952307 A CN 115952307A CN 202211742093 A CN202211742093 A CN 202211742093A CN 115952307 A CN115952307 A CN 115952307A
- Authority
- CN
- China
- Prior art keywords
- user
- modal
- item
- layer
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Image Analysis (AREA)
- Editing Of Facsimile Originals (AREA)
Abstract
The invention discloses a recommendation method based on multimodal graph contrast learning, which comprises the following steps: 1. data acquisition and pretreatment; 2. a graph volume layer; 3. constructing a comparison learning layer; 4. constructing a loss function; 5. and training the graph comparison learning model. When the method is used for processing the recommendation task of the multi-modal data, the representation of the user and the article can be enhanced through the separated graph learning mode and the contrast learning, and the problem of multi-modal noise pollution is relieved.
Description
Technical Field
The invention relates to a multi-modal graph contrast learning-based multimedia recommendation method, electronic equipment and a storage medium, and belongs to the field of recommendation systems.
Background
Multimedia-based recommendations are a challenging task that requires not only learning collaboration signals from user-item interactions, but also capturing modality-specific user cues of interest from complex multimedia content. Despite significant advances in current solutions for multimedia-based recommendation algorithms, they are still limited by multi-modal noise pollution. In particular, a substantial portion of the multimedia content of the item is independent of user preferences such as background, overall layout, image brightness, word order in the title, and semantic-free words. In addition, most recent studies are performed by image learning. This means that as the message propagates into the user and item representations, the polluting effects will be further amplified.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-modal graph contrast learning-based multimedia recommendation method, electronic equipment and a storage medium, so that the problem of multi-modal noise pollution is relieved when a recommendation task of multi-modal data is processed, and the representation of a user and an article is enhanced through a separated graph learning mode and contrast learning, so that the recommendation accuracy and precision can be improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a multimedia recommendation method based on multimodal graph contrast learning, which is characterized by comprising the following steps of:
step 1, data acquisition and pretreatment;
step 1.1, building a project set of commodities, and marking as I = { I = { (I) } 1 ,i 2 ,…,in,…i |I| In, where in represents the nth item, | I | represents the total number of items;
constructing a user set, and recording as U = { U = 1 ,u 2 ,…,um,…,u |U| H, wherein um represents the mth user; | U | represents the total number of users;
constructing user item bipartite graphsWherein it is present>Represents the mth user u m And the nth item in, if so, then let @>Otherwise, make->
Respectively mapping the m < th > user um and the n < th > item in into user embeddingAnd item embedding->The m-th user um is respectively corresponding to embedding vectors in an image modality V and a text modality T and is->And &>
Step 1.2, depth feature extraction:
the image v corresponding to the nth commodity item in n Inputting the data into a pre-trained VGG16 model for processing to obtain image characteristicsThe image feature matrix ^ of the image modality V is thus constructed using equation (1)>dV is the dimension of the image feature:
corresponding text t of the nth commodity item in n Inputting the text into a pre-trained Sennce 2Vec model for processing to obtain text characteristicsThe text feature matrix ≥ of the text modality T is thus constructed using equation (2)>dT is the dimension of the text feature:
step 2, constructing a multimodal map contrast learning model, comprising the following steps: a graph volume layer, a comparison learning layer and a prediction layer;
step 2.1, processing the graph volume layer:
step 2.2.1, respectively obtaining the embedding of the mth user um and the nth item in the ith layer graph convolution layer by using the formulas (3) and (4):
in the formulae (3) and (4),and &>Respectively representing the neighbor sets of the mth user um and the nth item in,and &>Respectively representing the neighbor number of the mth user um and the neighbor number of the nth item in; />Is the embedding of the nth item in the l-1 th map convolutional layer, and makes ^ er/be when l =1>Is the embedding of the mth user um in the l-1 th map convolutional layer, and makes ÷ greater than or equal to 1>
Step 2.2.2, respectively inIn the image mode V or the text mode T, the mth user u is obtained through the formulas (5) and (6) respectively m And embedding of the nth item in into the first layer graph convolution layer in the multi-modal model
In the formulas (5) and (6), alpha is a hyper-parameter, and TR represents transposition; modal represents a multi-modal, and modal = V or T, is a weight transformation matrix for a multimodal modal, d moda l is the dimension of the multimodal modal feature, d is the embedding size; />Characteristic of the multimodal modal, representing the nth item, is @>Represents the embedding of the mth user um in the l-1 th level map convolutional layer under the multimodal modal, and-> Represents the mth user u m An embedded vector at level l-1 in the image modality V>Represents the mth user u m Embedding vectors in the l-1 level of the text modality T let ^ er when l =1> Indicating that the nth item in is embedded into the graph convolution layer of the l-1 layer under the multi-mode modal; and-> An image feature representing the nth item in->A text feature representing the nth item in, when l =1, causes ≧>
Step 2.2.3, obtaining the mth user u by using the formula (7) and the formula (8) m And embedding of the nth item in into the l +1 th graph convolution layer under the multi-modal model/>
Step 2.2.4, processing is carried out according to the process from step 2.2.2 to step 2.2.3, so that the characteristics of the mth user um are output by the L-th layerCharacteristic ^ s of mth user um under multimodal modal>Characteristic ^ of the nth item in>
Step 2.3, processing of the comparison learning layer:
In the formula (9), the reaction mixture is,representing the characteristics of the jth user uj in a multi-modal model at the L level, wherein tau is a hyper-parameter; step 2.3.2 construction of the term comparison loss function ^ by equation (10)>
In the formula (10), the compound represented by the formula (10),representing the characteristics of the kth user ik in a multimodal modal;
Step 2.4, processing the prediction layer:
In the formula (12), λ is a hyperparameter;
and 3, constructing a loss function of the multimodal map contrast learning model:
In the formulae (13) to (15),is training data, ix denotes the xth item, <' > is>Representing the neighbor set of the mth user um, wherein sigma is a sigmoid function;
step 4, training the multi-modal graph contrast learning model by utilizing a gradient descent method based on the training data O, and calculating a total loss functionWhen the training iteration times reach the set times or the loss error is smaller than the set threshold value, the training is stopped, so that an optimal multi-modal graph contrast learning model is obtained and is used for judging the image characteristic matrix of the image modality>Text feature matrix of text modality>User embedding &>Item embedding +>Dense vector representation->And &>And processing and outputting the score of each user for each item, thereby selecting top items and recommending each user.
The invention relates to an electronic device, comprising a memory and a processor, characterized in that the memory is used for storing programs for supporting the processor to execute the multimedia recommendation method, and the processor is configured to execute the programs stored in the memory.
The present invention is a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when being executed by a processor, performs the steps of the multimedia recommendation method.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention constructs an embedded graph volume network module which is specially used for spreading users and items, and is specially used for spreading the embedding of the users and the items, thereby lightening the problem of multi-mode noise pollution.
2. The present invention enhances the representation of users and items through separate graph learning modes and contrast learning to better capture the collaborative signal and multimodal preferences and attenuate the effects of multimodal noise.
Drawings
FIG. 1 is a schematic diagram of the recommendation method based on multi-modal graph contrast learning according to the present invention.
Detailed Description
In this embodiment, a recommendation method based on multimodal graph contrast learning is to construct a graph convolution module to capture a collaborative signal and multimodal user preferences, then eliminate noise pollution in modeling of the multimodal user preferences by adopting contrast learning, and finally, in order to ensure sufficient learning of a model, in this embodiment, an alternating training strategy is used to optimize the collaborative signal and the multimodal user preferences, specifically, as shown in fig. 1, the following steps are performed:
step 1, data acquisition and pretreatment;
step 1.1, building a project set of commodities, and marking as I = { I = { (I) } 1 ,i 2 ,…,in,…i |I| In, where in represents the nth item, | I | represents the total number of items;
constructing a user set, and recording the user set as U = { U = { (U) } 1 ,u 2 ,…,um,…,u |U| H, wherein um represents the mth user; | U | represents the total number of users;
constructing user item interaction graph using implicit feedback data in datasetWherein +>Indicates whether an interaction exists between the mth user um and the nth item in, and if so, causes &>Otherwise, make +>
Respectively mapping the m < th > user um and the n < th > item in into user embeddingAnd item embedding->The m-th user um is respectively corresponding to embedding vectors in an image modality V and a text modality T and is->And &>
Step 1.2, depth feature extraction:
the image v corresponding to the nth commodity item in n Inputting the data into a pre-trained VGG16 model for processing to obtain image characteristicsIn order to construct an image feature matrix { (1) } of the image modality V>dV is the dimension of an image feature:
Corresponding text t of the nth commodity item in n Inputting the text into a pre-trained Sennce 2Vec model for processing to obtain text characteristicsIn order to construct a text feature matrix { (2) } of the text modality T>dT is the dimension of the text feature:
step 2, constructing a multimodal map contrast learning model, comprising the following steps: a graph volume layer, a comparison learning layer and a prediction layer;
step 2.1, processing the graph volume layer:
step 2.2.1, in order to model clean high-order collaborative signals, the invention does not incorporate multimodal features into the user item interaction graphAnd executing the graph convolution operation. The embedding of the mth user um and the nth item in the l layer graph convolution layer is obtained by using the formula (3) and the formula (4), respectively:
in the formulae (3) and (4),and &>Respectively representing the neighbor sets of the mth user um and the nth item in,and &>Respectively representing the neighbor number of the mth user um and the neighbor number of the nth item in; />Is the embedding of the nth item in the convolution layer of the l-1 layer graph, and makes ^ greater or less than 1 when l =1> Is the embedding of the mth user um in the graph convolution layer of the l-1 layer, and makes ^ greater or less than 1 when l =1>
Step 2.2.2, respectively obtaining the mth user u through the formula (5) and the formula (6) under the image mode V or the text mode T m And embedding of the nth item in into the first layer graph convolution layer in the multi-modal modelTo incorporate multimodal information of historical interactions into the node representation:
in the formulas (5) and (6), alpha is a hyper-parameter, and TR represents transposition; modal represents a multi-modal, and modal = V or T, is a weight transformation matrix for a multimodal modal, d moda l is the dimension of the multimodal modal feature, d is the embedding size; />Characteristic of the multimodal modal, representing the nth item, is @>Represents the embedding of the mth user um in the l-1 th level map convolutional layer under the multimodal modal, and-> Represents the mth user u m An embedded vector at level l-1 in the image modality V>Represents the mth user u m Embedding vectors in the l-1 level of the text modality T let ^ er when l =1> Represents the embedding of the nth item in into the graph volume layer of the layer l-1 under the multi-mode modal; and-> An image feature representing the nth item in->A text feature representing the nth item in, when l =1, causes ≧>
Step 2.2.3, obtaining the mth user u by using the formula (7) and the formula (8) m And embedding of the nth item in into the l +1 th graph convolution layer under the multi-modal model
Step 2.2.4, processing according to the process from step 2.2.2 to step 2.2.3, thereby outputting the characteristics of the mth user um by the L-th layerCharacteristic @ of mth user um under multimodal modal>Characteristic ^ of the nth item in>
Step 2.3, processing of the comparison learning layer:
In the formula (9), the reaction mixture is,representing the characteristics of the jth user uj in a multi-modal model at the L level, wherein tau is a hyper-parameter; step 2.3.2 construction of the term comparison loss function ^ by the equation (10)>
In the formula (10), the reaction mixture is,representing the characteristics of the kth user ik in a multimodal modal;
step 2.3.3, in order to combine the node characteristics of users and projects in visual and text modes, constructing a contrast loss function by the formula (11)
Step 2.4, processing the prediction layer:
In the formula (12), λ is a hyperparameter;
and 3, in order to optimize the multi-modal graph contrast learning model, updating the representations of the user and the items by adopting a widely used Bayesian personalized ranking loss as a basic optimization target, wherein the Bayesian personalized ranking loss assumes that the user prefers the history interactive items rather than the untouched items. Constructing a loss function of the multi-modal graph learning model:
In the formulae (13) to (15),is training data, ix denotes the xth item, <' > is>Representing a neighbor set of the mth user um, wherein sigma is a sigmoid function;
step 4, training the multi-modal graph contrast learning model by utilizing a gradient descent method based on the training data O, and calculating a total loss functionWhen the training iteration times reach the set times or the loss error is smaller than the set threshold value, the training is stopped, so that an optimal multi-modal graph contrast learning model is obtained and is used for judging the image characteristic matrix of the image modality>Text feature matrix of text modality>User embedding->Item embedding->Dense vector representation +>And &>And processing and outputting the score of each user for each item, thereby selecting top items and recommending each user.
In this embodiment, an electronic device includes a memory for storing a program that supports a processor to execute the multimedia recommendation method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the multimedia recommendation method.
Claims (3)
1. A multi-modal graph contrast learning-based multimedia recommendation method is characterized by comprising the following steps of:
step 1, data acquisition and pretreatment;
step 1.1, building a project set of commodities, and marking as I = { I = { (I) } 1 ,i 2 ,…,i n ,…i |I| In which i n Represents the nth item, | I | represents the total number of items;
constructing a user set, and recording the user set as U = { U = { (U) } 1 ,u 2 ,…,u m ,…,u |U| In which u m Represents the mth user; | U | represents the total number of users;
constructing a user item bipartite graphWherein it is present>Represents the mth user u m And the nth item i n Whether there is an interaction between them or not, if yes, make->Otherwise, make +>
The mth user u m And the nth item i n Mapping separately to user embeddingAnd item embedding->Mth user u m In image modalities V and text, respectivelyThe corresponding embedded vector in the present mode T is->And &>
Step 1.2, depth feature extraction:
the nth commodity item i n Corresponding image v n Inputting the data into a pre-trained VGG16 model for processing to obtain image characteristicsThe image feature matrix ^ of the image modality V is thus constructed using equation (1)>d V Is the dimension of the image feature:
the nth commodity item i n Corresponding text t n Inputting the text into a pre-trained Sennce 2Vec model for processing to obtain text characteristicsThe text feature matrix ≥ of the text modality T is thus constructed using equation (2)>d T Is the dimension of the text feature:
step 2, constructing a multimodal graph comparison learning model, which comprises the following steps: a graph volume layer, a comparison learning layer and a prediction layer;
step 2.1, processing the graph convolution layer:
step 2.2.1, respectively obtaining the mth user u by using the formula (3) and the formula (4) m And the nth item i n Embedding of the convolutional layer in the l-th layer:
in the formulae (3) and (4),and &>Respectively represent the m-th user u m And the nth item i n Is selected, based on the number of neighbor sets in the neighbor set, is greater than>Andrespectively represent the m-th user u m And the nth item i n The number of neighbors of (a); />Is the nth item i n Embedding in the convolution layer of the l-1 st layer, when l =1, make ^ greater or lesser than> Is the mth user u m Embedding in the convolution layer of the l-1 st layer, when l =1, make ^ greater or lesser than>
Step 2.2.2, respectively obtaining the mth user u through the formula (5) and the formula (6) under the image mode V or the text mode T m And the nth item i n Embedding of layer I graph convolution layer in multimodal modal/>
In the formulas (5) and (6), alpha is a hyper-parameter, and TR represents transposition; modal represents a multi-modal, and modal = V or T, is a weight transformation matrix for a multimodal modal, d modal Is the dimension of the multi-modal feature, d is the embedding size; />Characteristic of the multimodal modal, representing the nth item, is @>Represents the mth user u m Embedding of layer l-1 map convolutional layer under multimodal modal, and ^ h> Represents the mth user u m An embedded vector at level l-1 in the image modality V>Represents the mth user u m Embedding vectors in the l-1 level of the text modality T let ^ er when l =1> Represents the nth item i n Embedding a layer l-1 graph convolution layer under a multi-modal model; and-> Represents the nth item i n The characteristics of the image of (a) are,represents the nth item i n When l =1, let £ be £ £ 5>
Step 2.2.3, obtaining the mth user u by using the formula (7) and the formula (8) m And the nth item i n Embedding of layer 1 + graph convolution layer in multimodal modal
Step 2.2.4, processing according to the process from step 2.2.2 to step 2.2.3, thereby outputting the mth user u by the L layer m Is characterized byMth user u m Feature @ under a multimodal modal>The nth item i n Is characterized by>
Step 2.3, processing of the comparison learning layer:
In the formula (9), the reaction mixture is,represents the jth user u j Features under multimodal modal at level L, τ is a hyper-parameter;
In the formula (10), the reaction mixture is,represents the k-th user i k Features under a multimodal modal;
Step 2.4, processing the prediction layer:
In the formula (12), λ is a hyperparameter;
and 3, constructing a loss function of the multimodal map contrast learning model:
And 3.3, constructing a total loss function L by using the formula (15):
in the formulae (13) to (15),is training data, i represents the xth item,represents the mth user u m σ is a sigmoid function;
step 4, training the multi-modal graph contrast learning model by utilizing a gradient descent method based on the training data O, and calculating a total loss functionWhen the training iteration times reach the set times or the loss error is smaller than the set threshold value, the training is stopped, so that an optimal multi-modal graph contrast learning model is obtained and is used for judging the image characteristic matrix of the image modality>Text feature matrix of text modality>User embedding &>Item embedding +>Dense vector representation +>And &>And processing and outputting the score of each user for each item, thereby selecting top items and recommending each user.
2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that enables the processor to perform the multimedia recommendation method of claim 1, and the processor is configured to execute the program stored in the memory.
3. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the multimedia recommendation method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211742093.6A CN115952307A (en) | 2022-12-30 | 2022-12-30 | Recommendation method based on multimodal graph contrast learning, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211742093.6A CN115952307A (en) | 2022-12-30 | 2022-12-30 | Recommendation method based on multimodal graph contrast learning, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115952307A true CN115952307A (en) | 2023-04-11 |
Family
ID=87285822
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211742093.6A Pending CN115952307A (en) | 2022-12-30 | 2022-12-30 | Recommendation method based on multimodal graph contrast learning, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115952307A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932887A (en) * | 2023-06-07 | 2023-10-24 | 哈尔滨工业大学(威海) | Image recommendation system and method based on multi-modal image convolution |
CN117786234A (en) * | 2024-02-28 | 2024-03-29 | 云南师范大学 | Multimode resource recommendation method based on two-stage comparison learning |
-
2022
- 2022-12-30 CN CN202211742093.6A patent/CN115952307A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932887A (en) * | 2023-06-07 | 2023-10-24 | 哈尔滨工业大学(威海) | Image recommendation system and method based on multi-modal image convolution |
CN117786234A (en) * | 2024-02-28 | 2024-03-29 | 云南师范大学 | Multimode resource recommendation method based on two-stage comparison learning |
CN117786234B (en) * | 2024-02-28 | 2024-04-26 | 云南师范大学 | Multimode resource recommendation method based on two-stage comparison learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11314806B2 (en) | Method for making music recommendations and related computing device, and medium thereof | |
US11593612B2 (en) | Intelligent image captioning | |
CN107836000B (en) | Improved artificial neural network method and electronic device for language modeling and prediction | |
CN108509573B (en) | Book recommendation method and system based on matrix decomposition collaborative filtering algorithm | |
US10489688B2 (en) | Personalized digital image aesthetics in a digital medium environment | |
CN106776673B (en) | Multimedia document summarization | |
CN107273438B (en) | Recommendation method, device, equipment and storage medium | |
CN111339415B (en) | Click rate prediction method and device based on multi-interactive attention network | |
CN111444320B (en) | Text retrieval method and device, computer equipment and storage medium | |
CN110362723B (en) | Topic feature representation method, device and storage medium | |
CN115952307A (en) | Recommendation method based on multimodal graph contrast learning, electronic device and storage medium | |
CN110046221A (en) | A kind of machine dialogue method, device, computer equipment and storage medium | |
KR20160144384A (en) | Context-sensitive search using a deep learning model | |
US20230316379A1 (en) | Deep learning based visual compatibility prediction for bundle recommendations | |
CN106708929B (en) | Video program searching method and device | |
CN111309878B (en) | Search type question-answering method, model training method, server and storage medium | |
CN114358203A (en) | Training method and device for image description sentence generation module and electronic equipment | |
KR20190075277A (en) | Method for searching content and electronic device thereof | |
CN111985548A (en) | Label-guided cross-modal deep hashing method | |
CN115455228A (en) | Multi-mode data mutual detection method, device, equipment and readable storage medium | |
CN106570196B (en) | Video program searching method and device | |
CN114298783A (en) | Commodity recommendation method and system based on matrix decomposition and fusion of user social information | |
CN112069404A (en) | Commodity information display method, device, equipment and storage medium | |
CN116186301A (en) | Multi-mode hierarchical graph-based multimedia recommendation method, electronic equipment and storage medium | |
CN112966513B (en) | Method and apparatus for entity linking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |