CN115952307A

CN115952307A - Recommendation method based on multimodal graph contrast learning, electronic device and storage medium

Info

Publication number: CN115952307A
Application number: CN202211742093.6A
Authority: CN
Inventors: 薛峰; 桑胜; 张研; 徐江凤; 叶向晖
Original assignee: Anhui Aisino Corp; Hefei University of Technology
Current assignee: Anhui Aisino Corp; Hefei University of Technology
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-11

Abstract

The invention discloses a recommendation method based on multimodal graph contrast learning, which comprises the following steps: 1. data acquisition and pretreatment; 2. a graph volume layer; 3. constructing a comparison learning layer; 4. constructing a loss function; 5. and training the graph comparison learning model. When the method is used for processing the recommendation task of the multi-modal data, the representation of the user and the article can be enhanced through the separated graph learning mode and the contrast learning, and the problem of multi-modal noise pollution is relieved.

Description

Recommendation method based on multimodal graph contrast learning, electronic device and storage medium

Technical Field

The invention relates to a multi-modal graph contrast learning-based multimedia recommendation method, electronic equipment and a storage medium, and belongs to the field of recommendation systems.

Background

Multimedia-based recommendations are a challenging task that requires not only learning collaboration signals from user-item interactions, but also capturing modality-specific user cues of interest from complex multimedia content. Despite significant advances in current solutions for multimedia-based recommendation algorithms, they are still limited by multi-modal noise pollution. In particular, a substantial portion of the multimedia content of the item is independent of user preferences such as background, overall layout, image brightness, word order in the title, and semantic-free words. In addition, most recent studies are performed by image learning. This means that as the message propagates into the user and item representations, the polluting effects will be further amplified.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-modal graph contrast learning-based multimedia recommendation method, electronic equipment and a storage medium, so that the problem of multi-modal noise pollution is relieved when a recommendation task of multi-modal data is processed, and the representation of a user and an article is enhanced through a separated graph learning mode and contrast learning, so that the recommendation accuracy and precision can be improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a multimedia recommendation method based on multimodal graph contrast learning, which is characterized by comprising the following steps of:

step 1, data acquisition and pretreatment;

step 1.1, building a project set of commodities, and marking as I = { I = { (I) } ₁ ,i ₂ ,…,in,…i _|I| In, where in represents the nth item, | I | represents the total number of items;

constructing a user set, and recording as U = { U = ₁ ,u ₂ ,…,um,…,u _|U| H, wherein um represents the mth user; | U | represents the total number of users;

constructing user item bipartite graphs

Wherein it is present>

Represents the mth user u _m And the nth item in, if so, then let @>

Otherwise, make->

Respectively mapping the m < th > user um and the n < th > item in into user embedding

And item embedding->

The m-th user um is respectively corresponding to embedding vectors in an image modality V and a text modality T and is->

And &>

Step 1.2, depth feature extraction:

the image v corresponding to the nth commodity item in _n Inputting the data into a pre-trained VGG16 model for processing to obtain image characteristics

The image feature matrix ^ of the image modality V is thus constructed using equation (1)>

dV is the dimension of the image feature:

corresponding text t of the nth commodity item in _n Inputting the text into a pre-trained Sennce 2Vec model for processing to obtain text characteristics

The text feature matrix ≥ of the text modality T is thus constructed using equation (2)>

dT is the dimension of the text feature:

step 2, constructing a multimodal map contrast learning model, comprising the following steps: a graph volume layer, a comparison learning layer and a prediction layer;

step 2.1, processing the graph volume layer:

step 2.2.1, respectively obtaining the embedding of the mth user um and the nth item in the ith layer graph convolution layer by using the formulas (3) and (4):

in the formulae (3) and (4),

and &>

Respectively representing the neighbor sets of the mth user um and the nth item in,

and &>

Respectively representing the neighbor number of the mth user um and the neighbor number of the nth item in; />

Is the embedding of the nth item in the l-1 th map convolutional layer, and makes ^ er/be when l =1>

Is the embedding of the mth user um in the l-1 th map convolutional layer, and makes ÷ greater than or equal to 1>

Step 2.2.2, respectively inIn the image mode V or the text mode T, the mth user u is obtained through the formulas (5) and (6) respectively _m And embedding of the nth item in into the first layer graph convolution layer in the multi-modal model

In the formulas (5) and (6), alpha is a hyper-parameter, and TR represents transposition; modal represents a multi-modal, and modal = V or T,

is a weight transformation matrix for a multimodal modal, d _moda l is the dimension of the multimodal modal feature, d is the embedding size; />

Characteristic of the multimodal modal, representing the nth item, is @>

Represents the embedding of the mth user um in the l-1 th level map convolutional layer under the multimodal modal, and->

Represents the mth user u _m An embedded vector at level l-1 in the image modality V>

Represents the mth user u _m Embedding vectors in the l-1 level of the text modality T let ^ er when l =1>

Indicating that the nth item in is embedded into the graph convolution layer of the l-1 layer under the multi-mode modal; and->

An image feature representing the nth item in->

A text feature representing the nth item in, when l =1, causes ≧>

Step 2.2.3, obtaining the mth user u by using the formula (7) and the formula (8) _m And embedding of the nth item in into the l +1 th graph convolution layer under the multi-modal model

/>

Step 2.2.4, processing is carried out according to the process from step 2.2.2 to step 2.2.3, so that the characteristics of the mth user um are output by the L-th layer

Characteristic ^ s of mth user um under multimodal modal>

Characteristic ^ of the nth item in>

Step 2.3, processing of the comparison learning layer:

step 2.3.1, constructing a user contrast loss function through the formula (9)

In the formula (9), the reaction mixture is,

representing the characteristics of the jth user uj in a multi-modal model at the L level, wherein tau is a hyper-parameter; step 2.3.2 construction of the term comparison loss function ^ by equation (10)>

In the formula (10), the compound represented by the formula (10),

representing the characteristics of the kth user ik in a multimodal modal;

step 2.3.3, constructing a contrast loss function by the formula (11)

Step 2.4, processing the prediction layer:

calculating a preference score between the mth user um and the nth item in using equation (12)

In the formula (12), λ is a hyperparameter;

and 3, constructing a loss function of the multimodal map contrast learning model:

step 3.1, constructing a first loss function by using the formula (13)

Step 3.2, constructing a second loss function by using the formula (14)

Step 3.3, constructing a total loss function by using the formula (15)

In the formulae (13) to (15),

is training data, ix denotes the xth item, <' > is>

Representing the neighbor set of the mth user um, wherein sigma is a sigmoid function;

step 4, training the multi-modal graph contrast learning model by utilizing a gradient descent method based on the training data O, and calculating a total loss function

When the training iteration times reach the set times or the loss error is smaller than the set threshold value, the training is stopped, so that an optimal multi-modal graph contrast learning model is obtained and is used for judging the image characteristic matrix of the image modality>

Text feature matrix of text modality>

User embedding &>

Item embedding +>

Dense vector representation->

And &>

And processing and outputting the score of each user for each item, thereby selecting top items and recommending each user.

The invention relates to an electronic device, comprising a memory and a processor, characterized in that the memory is used for storing programs for supporting the processor to execute the multimedia recommendation method, and the processor is configured to execute the programs stored in the memory.

The present invention is a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when being executed by a processor, performs the steps of the multimedia recommendation method.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention constructs an embedded graph volume network module which is specially used for spreading users and items, and is specially used for spreading the embedding of the users and the items, thereby lightening the problem of multi-mode noise pollution.

2. The present invention enhances the representation of users and items through separate graph learning modes and contrast learning to better capture the collaborative signal and multimodal preferences and attenuate the effects of multimodal noise.

Drawings

FIG. 1 is a schematic diagram of the recommendation method based on multi-modal graph contrast learning according to the present invention.

Detailed Description

In this embodiment, a recommendation method based on multimodal graph contrast learning is to construct a graph convolution module to capture a collaborative signal and multimodal user preferences, then eliminate noise pollution in modeling of the multimodal user preferences by adopting contrast learning, and finally, in order to ensure sufficient learning of a model, in this embodiment, an alternating training strategy is used to optimize the collaborative signal and the multimodal user preferences, specifically, as shown in fig. 1, the following steps are performed:

step 1, data acquisition and pretreatment;

constructing a user set, and recording the user set as U = { U = { (U) } ₁ ,u ₂ ,…,um,…,u _|U| H, wherein um represents the mth user; | U | represents the total number of users;

constructing user item interaction graph using implicit feedback data in dataset

Wherein +>

Indicates whether an interaction exists between the mth user um and the nth item in, and if so, causes &>

Otherwise, make +>

And item embedding->

And &>

Step 1.2, depth feature extraction:

In order to construct an image feature matrix { (1) } of the image modality V>

dV is the dimension of an image feature：

In order to construct a text feature matrix { (2) } of the text modality T>

dT is the dimension of the text feature:

step 2.1, processing the graph volume layer:

step 2.2.1, in order to model clean high-order collaborative signals, the invention does not incorporate multimodal features into the user item interaction graph

And executing the graph convolution operation. The embedding of the mth user um and the nth item in the l layer graph convolution layer is obtained by using the formula (3) and the formula (4), respectively:

in the formulae (3) and (4),

and &>

and &>

Is the embedding of the nth item in the convolution layer of the l-1 layer graph, and makes ^ greater or less than 1 when l =1>

Is the embedding of the mth user um in the graph convolution layer of the l-1 layer, and makes ^ greater or less than 1 when l =1>

Step 2.2.2, respectively obtaining the mth user u through the formula (5) and the formula (6) under the image mode V or the text mode T _m And embedding of the nth item in into the first layer graph convolution layer in the multi-modal model

To incorporate multimodal information of historical interactions into the node representation:

Characteristic of the multimodal modal, representing the nth item, is @>

Represents the embedding of the nth item in into the graph volume layer of the layer l-1 under the multi-mode modal; and->

An image feature representing the nth item in->

A text feature representing the nth item in, when l =1, causes ≧>

Step 2.2.4, processing according to the process from step 2.2.2 to step 2.2.3, thereby outputting the characteristics of the mth user um by the L-th layer

Characteristic @ of mth user um under multimodal modal>

Characteristic ^ of the nth item in>

Step 2.3, processing of the comparison learning layer:

step 2.3.1, constructing a user contrast loss function through the formula (9)

In the formula (9), the reaction mixture is,

representing the characteristics of the jth user uj in a multi-modal model at the L level, wherein tau is a hyper-parameter; step 2.3.2 construction of the term comparison loss function ^ by the equation (10)>

In the formula (10), the reaction mixture is,

representing the characteristics of the kth user ik in a multimodal modal;

step 2.3.3, in order to combine the node characteristics of users and projects in visual and text modes, constructing a contrast loss function by the formula (11)

Step 2.4, processing the prediction layer:

In the formula (12), λ is a hyperparameter;

and 3, in order to optimize the multi-modal graph contrast learning model, updating the representations of the user and the items by adopting a widely used Bayesian personalized ranking loss as a basic optimization target, wherein the Bayesian personalized ranking loss assumes that the user prefers the history interactive items rather than the untouched items. Constructing a loss function of the multi-modal graph learning model:

step 3.1, constructing a first loss function by using the formula (13)

Step 3.2, constructing a second loss function by using the formula (14)

Step 3.3, constructing a total loss function by using the formula (15)

In the formulae (13) to (15),

is training data, ix denotes the xth item, <' > is>

Representing a neighbor set of the mth user um, wherein sigma is a sigmoid function;

Text feature matrix of text modality>

User embedding->

Item embedding->

Dense vector representation +>

And &>

In this embodiment, an electronic device includes a memory for storing a program that supports a processor to execute the multimedia recommendation method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the multimedia recommendation method.

Claims

1. A multi-modal graph contrast learning-based multimedia recommendation method is characterized by comprising the following steps of:

step 1, data acquisition and pretreatment;

step 1.1, building a project set of commodities, and marking as I = { I = { (I) } ₁ ,i ₂ ,…,i _n ,…i _|I| In which i _n Represents the nth item, | I | represents the total number of items;

constructing a user set, and recording the user set as U = { U = { (U) } ₁ ,u ₂ ,…,u _m ,…,u _|U| In which u _m Represents the mth user; | U | represents the total number of users;

constructing a user item bipartite graph

Wherein it is present>

Represents the mth user u _m And the nth item i _n Whether there is an interaction between them or not, if yes, make->

Otherwise, make +>

The mth user u _m And the nth item i _n Mapping separately to user embedding

And item embedding->

Mth user u _m In image modalities V and text, respectivelyThe corresponding embedded vector in the present mode T is->

And &>

Step 1.2, depth feature extraction:

the nth commodity item i _n Corresponding image v _n Inputting the data into a pre-trained VGG16 model for processing to obtain image characteristics

d _V Is the dimension of the image feature:

the nth commodity item i _n Corresponding text t _n Inputting the text into a pre-trained Sennce 2Vec model for processing to obtain text characteristics

d _T Is the dimension of the text feature:

step 2, constructing a multimodal graph comparison learning model, which comprises the following steps: a graph volume layer, a comparison learning layer and a prediction layer;

step 2.1, processing the graph convolution layer:

step 2.2.1, respectively obtaining the mth user u by using the formula (3) and the formula (4) _m And the nth item i _n Embedding of the convolutional layer in the l-th layer:

in the formulae (3) and (4),

and &>

Respectively represent the m-th user u _m And the nth item i _n Is selected, based on the number of neighbor sets in the neighbor set, is greater than>

And

respectively represent the m-th user u _m And the nth item i _n The number of neighbors of (a); />

Is the nth item i _n Embedding in the convolution layer of the l-1 st layer, when l =1, make ^ greater or lesser than>

Is the mth user u _m Embedding in the convolution layer of the l-1 st layer, when l =1, make ^ greater or lesser than>

Step 2.2.2, respectively obtaining the mth user u through the formula (5) and the formula (6) under the image mode V or the text mode T _m And the nth item i _n Embedding of layer I graph convolution layer in multimodal modal

/>

is a weight transformation matrix for a multimodal modal, d _modal Is the dimension of the multi-modal feature, d is the embedding size; />

Characteristic of the multimodal modal, representing the nth item, is @>

Represents the mth user u _m Embedding of layer l-1 map convolutional layer under multimodal modal, and ^ h>

Represents the nth item i _n Embedding a layer l-1 graph convolution layer under a multi-modal model; and->

Represents the nth item i _n The characteristics of the image of (a) are,

represents the nth item i _n When l =1, let £ be £ £ 5>

Step 2.2.3, obtaining the mth user u by using the formula (7) and the formula (8) _m And the nth item i _n Embedding of layer 1 + graph convolution layer in multimodal modal

Step 2.2.4, processing according to the process from step 2.2.2 to step 2.2.3, thereby outputting the mth user u by the L layer _m Is characterized by

Mth user u _m Feature @ under a multimodal modal>

The nth item i _n Is characterized by>

Step 2.3, processing of the comparison learning layer:

step 2.3.1, constructing a user contrast loss function through the formula (9)

In the formula (9), the reaction mixture is,

represents the jth user u _j Features under multimodal modal at level L, τ is a hyper-parameter;

step 2.3.2, construction of item contrast loss function by the formula (10)

In the formula (10), the reaction mixture is,

represents the k-th user i _k Features under a multimodal modal;

step 2.3.3, constructing a contrast loss function by the formula (11)

/>

Step 2.4, processing the prediction layer:

the mth user u is calculated using equation (12) _m And the nth item u _n Preference score between

In the formula (12), λ is a hyperparameter;

step 3.1, constructing a first loss function by using the formula (13)

Step 3.2, constructing a second loss function by using the formula (14)

And 3.3, constructing a total loss function L by using the formula (15):

in the formulae (13) to (15),

is training data, i represents the xth item,

represents the mth user u _m σ is a sigmoid function;

Text feature matrix of text modality>

User embedding &>

Item embedding +>

Dense vector representation +>

And &>

2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that enables the processor to perform the multimedia recommendation method of claim 1, and the processor is configured to execute the program stored in the memory.

3. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the multimedia recommendation method of claim 1.