CN116932887A

CN116932887A - Image recommendation system and method based on multi-modal image convolution

Info

Publication number: CN116932887A
Application number: CN202310669701.3A
Authority: CN
Inventors: 朱东杰; 谭景元; 丁卓; 张立斌; 鲁宁
Original assignee: Changjiang Shidai Communication Co ltd; Harbin Institute of Technology Weihai
Current assignee: Changjiang Shidai Communication Co ltd; Harbin Institute of Technology Weihai
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-10-24

Abstract

The invention discloses an image recommendation system and method based on multi-modal image convolution, and belongs to the technical field of computers. The method for aggregating the multi-mode features by comprehensively utilizing the graph convolution realizes recommendation of film and television works aiming at user preference based on the graph convolution architecture. Firstly, acquiring information such as evaluation records of the same user on film and television works, related posters of the film and television works and the like through a crawler algorithm; preprocessing a data set, enhancing the data by using a MixGen method, and expanding the data set; representing the image data mode and the text data mode into a vector form by using a linear transformation method and the like; extracting information of different modes, and respectively obtaining vector representations of a text mode and an image data mode in multiple modes; carrying out intra-layer and inter-layer node aggregation on the same mode by utilizing graph convolution, and extracting fine granularity intention of a user on a film; establishing a relationship between fine granularity and coarse granularity user intentions by utilizing interlayer aggregation, and establishing super node combination for processing of different modes; and (3) the characteristics of all modes obtained through aggregation are passed through an attention mechanism layer, interaction among different modes is enhanced, and finally a film and television work recommendation list is obtained. The method solves the problems that the existing multi-mode recommendation system is difficult to model the user preference under a specific mode and different mode data are difficult to interact.

Description

Image recommendation system and method based on multi-modal image convolution

Technical Field

The invention discloses an image recommendation system and method based on multi-modal image convolution, and belongs to the technical field of computers.

Background

A recommendation system is a technology widely used in the internet and other fields to predict items that a user may like by analyzing the user's behavior and interests, thereby providing personalized recommendations to the user. How to accurately recommend film and television works meeting the interests of users becomes an important task. However, the large number of movie works means complicated contents and meta information, which makes it difficult for the conventional recommendation system to well acquire the preference of the user.

Existing multimodal recommendation systems rely primarily on user behavior data (e.g., viewing history, scores, etc.) and content information (e.g., actors, genre, etc.) for movie works. However, this information tends to be high-dimensional, sparse and heterogeneous, which makes it very difficult to build an efficient user preference model. Furthermore, the fusion process of multimodal information often involves a large amount of parameter adjustment, which makes conventional multimodal recommendation systems poor in modeling user preferences in a particular mode.

Disclosure of Invention

The method solves the problem that the existing multi-mode recommendation system is difficult to model the user preference in the specific mode. An image recommendation system and method based on multi-modal image convolution are provided.

The invention discloses an image recommendation system and method based on multi-modal image convolution, which are realized by the following technical scheme:

step one, obtaining information such as evaluation records of the same user on the film and television works, related posters of the film and television works and the like through a crawler algorithm.

And step two, preprocessing a data set, enhancing the data by using a MixGen method, and expanding the data set.

And thirdly, extracting information of different modes, and representing the image data mode and the text data mode into a vector form by using methods such as linear transformation.

And fourthly, carrying out intra-layer and inter-layer node aggregation on the same mode by utilizing graph convolution, and extracting the fine granularity intention of a user on the film.

And fifthly, establishing super nodes for different modes, and establishing interlayer aggregation to establish a relationship between fine granularity and coarse granularity user intention.

And step six, the characteristics of all modes obtained through aggregation are enhanced through a self-attention mechanism layer, interaction among different modes is enhanced, and finally a film and television work recommendation list is obtained.

The invention has the most outstanding characteristics and remarkable beneficial effects that:

according to the image recommendation system and method based on multi-modal image convolution, information such as evaluation records of the same user on film and television works and related posters of the film and television works is obtained through a crawler algorithm; preprocessing a data set, enhancing the data by using a MixGen method, and expanding the data set; representing the image data mode and the text data mode into a vector form by using a linear transformation method and the like; extracting information of different modes, and respectively obtaining vector representations of a text mode and an image data mode in multiple modes; carrying out intra-layer and inter-layer node aggregation on the same mode by utilizing graph convolution, and extracting fine granularity intention of a user on a film; establishing a relationship between fine granularity and coarse granularity user intentions by utilizing interlayer aggregation, and establishing super node combination for processing of different modes; and (3) the characteristics of all modes obtained through aggregation are passed through an attention mechanism layer, interaction among different modes is enhanced, and finally a film and television work recommendation list is obtained. The method solves the problems that the existing multi-mode recommendation system is difficult to model the user preference under a specific mode and different mode data are difficult to interact.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of an overall model architecture of the multi-modal recommendation system of the present invention;

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

In order to better explain the present embodiment, the technical solutions in the present embodiment will be clearly and completely described below with reference to the drawings in the present embodiment.

The description of the present embodiment is given with reference to fig. 1, and the multi-modal knowledge graph representation method based on multi-prediction tasks provided in the present embodiment specifically includes the following steps:

And secondly, in order to expand the data set when training the model, data enhancement is carried out on the multi-mode data. In the process of data enhancement, in order to preserve the features of images and texts as much as possible, a MixGen data enhancement method is used, whose expression is as follows:

I _k ＝γ*I _i +(1-γ)*I _j

T _k ＝T _i ■T _j

wherein ■ represents a concat connection, gamma represents a parameter between 0 and 1, and gamma takes on a value of 0.5，I _k And T _k Representing the image and text data resulting from the enhancement of the data. As can be seen from the formula, this retains the data of the image and text as much as possible.

Extracting features of the image and the text, extracting the features of the image, and dividing the input image into blocks with the same size to obtain a sequence containing N blocks: { x ₁ ,x ₂ ,…,x _N }，x _i Represents p ² * C-sized pictures, the side length of a square block is denoted as p, and the number of channels of the block is denoted as C; performing linear transformation on the block sequence to obtain a feature vector sequence: { v ₁ ,v ₂ ,…,v _N -a }; adding position codes p to each feature vector _i Obtaining an input vector sequence { x' ₁ ,x′ ₂ ,…,x′ _N }。x′ ₁ The dimension is denoted as d feature vector. Extracting text features, and word segmentation is carried out on an input text by WordPiece to obtain an input word sequence: { y ₁ ,y ₂ ,…,y _N }，y _i Representing each text word; adding a tag [ CLS ] at the beginning of a word sequence]A start symbol representing a classification task; adding a tag [ SEP ] at the end of a sentence]Representing the end of the sentence; adding position-coding information q to each word of a word sequence _i And mapping the word sequence into a word vector sequence: { y' ₁ ,y′ ₂ ,…,y′ _N }，y′ ₁ Representing a feature vector of dimension d.

And fourthly, carrying out intra-layer and inter-layer node aggregation on the same mode by utilizing graph convolution, and extracting the fine granularity intention of a user on the film. Processing of different modal features as shown in fig. 2, the image and the features of the image are respectively input into different graph convolution neural networks, and a collaborative interaction graph g= { X, a }, wherein X represents the extracted features of the text and the image, and a is adjacent to the matrix. Each layer of nodes represents different modal characteristics under the agreeing mode, and whether interactions exist among the characteristics is represented by a matrix A, which is defined as follows:

wherein N (v) _i ) Representing a neighbor set. The in-layer aggregation is realized through an aggregation function, the topological structure of each node neighborhood and the distribution condition of node characteristics in the neighborhood are known, and the aggregation function of text and image modes is expressed as follows:

wherein the method comprises the steps ofRepresenting adjacency matrix->Degree matrix of->Adjacency matrix representing layer I text features, < >>Representing the results of the aggregation of the first layer text features. />Representing adjacency matrix->Degree matrix of->Adjacency matrix representing layer I text features, < >>Representing the results of the aggregation of the first layer text features.

And fifthly, establishing super nodes for different modes, and establishing interlayer aggregation to establish a relationship between fine granularity and coarse granularity user intention. The representation of the established supernode is as follows:

wherein the method comprises the steps ofRepresenting a set of (l+1) -th layer graph-rolled supernodes in the aggregation of text features, +.>Representing the kth supernode of the (l+1) th layer graph convolution in the aggregation of text features, wherein the dimension of each supernode is d. K (K) ^(l ⁺¹⁾ Representing the number of super nodes at layer (l+1). Wherein->Representing a set of (l+1) -th layer graph rolling super nodes in the aggregation of image features, +.>Representing the kth supernode of the (l+1) th layer graph convolution in the aggregation of the image features, wherein the dimension of each supernode is d. Then performing an inner product on the super node and the nodes aggregated by each graph convolution layer, wherein the inner product is formed by:

wherein the method comprises the steps ofRepresenting the ith node of the convolution layer of the first layer graph in the aggregation of text features, r _i,k,t Representing affinity scores for the kth supernode and the ith node in text processing,/->Representing the ith node of the convolution layer of the first layer graph in the aggregation of image features, r _i,k,v Representing affinity scores for the kth supernode and the ith node in the image processing. The score is then normalized to calculate the weight assigned to each node.

Wherein the method comprises the steps ofRepresenting the assignment of weights of the ith node established for text feature points to the kth established supernode, wherein +.>And the distribution weight of the ith node established for the image characteristic points to the kth established super node is expressed, and the relation between the bottom layer graph and the high layer graph can be constructed by utilizing the distribution weight. The high-level adjacency matrix can be obtained by the low-level adjacency matrix in the form as follows:

wherein A is _t Representing the adjacency matrix of layer 0 in aggregating text features,representing a weight matrix.

And step six, the characteristics of all modes obtained through aggregation are enhanced through a self-attention mechanism layer, interaction among different modes is enhanced, and finally a film and television work recommendation list is obtained. The nodes of each layer in each mode are obtained through continuous iteration, and the specific representation forms are as follows:

wherein the method comprises the steps ofAn ith node representing a text feature at a first level in a graph volume, +.>Representing the ith node of the ith layer of text features in the graph convolution. The acquired nodes of each layer pass through a self-attention mechanism layer, and the expression of the self-attention mechanism is as follows:

wherein K is _s 、Q _s And V _s Representing the feature vector of each node, and K _s ＝Q _s ＝V _s . Then, the nodes output by the child attention mechanism layer are spliced, and the expression is as follows:

the method comprises the steps of (1) connecting a symbol, splicing the characteristics finally obtained by aggregation of all modes, and constructing a vector which can be learned by different modes, wherein the expression is as follows:

H _i ＝h _i,t +h _i,v

u _i ＝u _i,t +u _i,v

finally, the recommendation is personalized and ranked by using Bayesian personalized ranking, so that a user, observed items and unobserved item triples are constructed, wherein the form is as follows:

τ＝{(U,H _p ,H _q )∣A _i,p ＝1,A _i,q ＝0}

wherein U is a leachable vector representing a user, H _p Represents the observed item, H _q Representing unobserved items, to mitigate unnecessary overlap, a cross entropy loss build model loss function is introduced:

for training τ, the following loss function is used:

the loss function of the final model training is:

where α is an adjustable parameter.

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An image recommendation system and method based on multi-modal image convolution is characterized by comprising the following steps:

And step two, preprocessing the data set, enhancing the data by using a MixGen method, and expanding the data set.

And thirdly, extracting information of different modes to respectively obtain vector representations of a text mode and an image data mode in the multiple modes.

And fifthly, establishing a relation between fine granularity and coarse granularity user intention by utilizing interlayer aggregation, and establishing super node combination for processing of different modes.

And step six, the characteristics of all modes obtained through aggregation are subjected to an attention mechanism layer, interaction among different modes is enhanced, and finally a film and television work recommendation list is obtained.

2. The image recommendation system and method based on multi-modal image convolution according to claim 1, wherein the image data modality and the text data modality are obtained and expressed in a vector form by using the methods of linear transformation and the like in the second step.

3. The method of claim 2, wherein the step two uses a MixGen data enhancement method. In the process of data enhancement, in order to preserve the features of images and texts as much as possible, a MixGen data enhancement method is used, whose expression is as follows:

I _k ＝γ*I _i +(1-γ)*I _j

T _k ＝T _i ■T _j

wherein ■ represents a concat connection, gamma represents a parameter between 0 and 1, gamma takes on a value of 0.5, I _k And T _k Representing the image and text data resulting from the enhancement of the data. As can be seen from the formula, this retains the data of the image and text as much as possible.

4. A method for extracting fine granularity intention of a user on a movie according to claim 3, wherein in the fourth step, the intra-layer node aggregation method and the inter-layer node aggregation method are performed on the same modality by using graph convolution, features of images and texts are respectively input into different graph convolution neural networks, and a collaborative interaction graph g= { X, a }, wherein X represents the extracted features of the texts and the images, and a adjacency matrix is constructed. Each layer of nodes represents different modal characteristics under the agreeing mode, and whether interactions exist among the characteristics is represented by a matrix A, which is defined as follows:

5. The method for establishing a relationship between fine-granularity and coarse-granularity user intentions by interlayer aggregation according to claim 4, wherein in the fifth step, super nodes are established for different modalities, and the established super nodes are represented as follows:

wherein the method comprises the steps ofRepresenting a set of (l+1) -th layer graph-rolled supernodes in the aggregation of text features, +.>Representing the kth supernode of the (l+1) th layer graph convolution in the aggregation of text features, wherein the dimension of each supernode is d. K (K) ⁽⁺¹⁾ Representing the number of super nodes at layer (l+1). Wherein->Representing a set of (l+1) -th layer graph rolling super nodes in the aggregation of image features, +.>Representing the kth supernode of the (l+1) th layer graph convolution in the aggregation of the image features, wherein the dimension of each supernode is d. Then performing an inner product on the super node and the nodes aggregated by each graph convolution layer, wherein the inner product is formed by:

wherein the method comprises the steps ofRepresenting the ith section of the convolution layer of the first layer graph in the aggregation of text featuresPoint, r _i,k,t Representing affinity scores for the kth supernode and the ith node in text processing,/->Representing the ith node of the convolution layer of the first layer graph in the aggregation of image features, r _i,k,v Representing affinity scores for the kth supernode and the ith node in the image processing. The score is then normalized to calculate the weight assigned to each node.

wherein the method comprises the steps ofA _t Representing the adjacency matrix of layer 0 in aggregating text features,representing a weight matrix.

6. The method of claim 5, wherein the interaction between different modalities is enhanced by a self-attention mechanism in step six. The nodes of each layer in each mode are obtained through continuous iteration, and the specific representation forms are as follows:

where K, Q and V represent feature vectors for each node, and k=q=v. Then, the nodes output by the child attention mechanism layer are spliced, and the expression is as follows:

H _i ＝h _i,t +h _i,v

u _i ＝u _i,t +u _i,v

finally, the recommendation is personalized and ranked by using Bayesian personalized ranking, so that a user, observed items and unobserved items are constructed, wherein the form is as follows:

τ＝{(U,V _p ,V _q )∣A _i,p ＝1,A _i,q ＝0}。