CN112364245A

CN112364245A - Top-K movie recommendation method based on heterogeneous information network embedding

Info

Publication number: CN112364245A
Application number: CN202011306020.3A
Authority: CN
Inventors: 汤颖; 陈懿
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-12
Anticipated expiration: 2040-11-20
Also published as: CN112364245B

Abstract

The method for recommending the Top-K movie based on heterogeneous information network embedding comprises the following steps: step 1, preprocessing data: step 2, embedding a learning heterogeneous information network; step 3, information is spread in the heterogeneous information network; step 4, aggregating node information and side information; step 5, predicting and scoring; and 6, Top-K evaluation. The invention improves the learning of the heterogeneous information network, explicitly adds edges among nodes into the learning process, applies the improved learning method to the recommendation task, fully obtains the relation among different types of nodes in the movie data, obtains richer semantic information compared with the traditional homogeneous network, obtains the edge information compared with the existing heterogeneous information network learning method, reduces the data loss in the learning process and improves the utilization rate of the information in the heterogeneous information network.

Description

Top-K movie recommendation method based on heterogeneous information network embedding

Technical Field

The invention relates to a movie recommendation method.

Background

Along with the rapid development of the internet, the method brings abundant information to people and meets the requirements of people on the information. With the explosive increase of information amount, people find that more and more information can be contacted in daily life, but less and less information is really useful for the people, so that the problem of information overload is generated, namely, the problem that users cannot quickly find needed information due to limited knowledge level and cognitive ability of the users in the face of mass information.

Initially, the main approaches to information overload were categorized catalogs and search engines, such as yahoo and google. However, due to the rapid increase of the data volume, the method cannot meet the requirements of people, and therefore, a recommendation system is produced. The recommendation system obtains the interest of the user by analyzing the historical behavior of the user and actively pushes the interested information for the user.

The recommendation algorithm mainly focuses on collaborative filtering in early research and achieves good effect. Collaborative filtering is mainly divided into two major categories, namely neighborhood-based collaborative filtering and model-based collaborative filtering. Neighborhood-based collaborative filtering can be divided into user-based collaborative filtering and item-based collaborative filtering; the model-based collaborative filtering method mainly comprises an SVM model, a Bayesian network model, a cause shadow model and the like.

At present, the methods only focus on homogeneous networks, the homogeneous networks cannot well model complex real world, and heterogeneous information networks are introduced at the moment. The heterogeneous information network comprises more than two different types of nodes and relations, so that the complex relations in the real world can be well described, and the recommendation accuracy is improved. At present, research of heterogeneous information networks in recommendation mainly focuses on embedded learning of nodes, and the general directions of the heterogeneous information networks are classified into two types, namely a meta-path method and a graph neural network which is directly utilized. The two methods are characterized in that nodes are vectorized, structural information of a heterogeneous information network is obtained, and then a recommendation task is completed by combining a classical recommendation algorithm. At present, most of the methods focus on the processing of nodes, but ignore the information of edges between the nodes, and the types of the nodes in the heterogeneous information network are various, so the types of the edges between the nodes are also various, and contain much information, and the information is ignored.

Disclosure of Invention

In order to overcome the defects in the prior art and add rich side information in the heterogeneous information network into the recommendation model, the invention provides a new recommendation method based on the heterogeneous information network.

The method includes the steps that nodes and edges in a heterogeneous information network are initially embedded by using a TransR method to obtain vector representations of the nodes and the edges, then the node vectors and the edge vectors are aggregated to obtain vector representations of users and articles, and then a Top-K recommendation task is completed.

The method for recommending the Top-K movie based on heterogeneous information network embedding comprises the following specific steps:

step 1, preprocessing data, specifically comprising:

1.1 cleaning data; cleaning the original data, and filtering invalid data in the original data set, wherein the invalid data comprise user data with the watching times smaller than a preset value and movie data with the evaluation times smaller than the preset value, and further training data and testing data;

1.2 constructing heterogeneous information network data and constructing training data and test data; constructing a heterogeneous information network by using the cleaned data; constructing the cleaned data into a triple group to represent the heterogeneous information network, wherein the triple group is in the following form:

(h,r,t) (1)

wherein h represents a head node, t represents a tail node, and r represents the relationship between the head node h and the tail node t, i.e. the edge between the head node h and the tail node t;

step 2, the embedded learning heterogeneous information network specifically comprises the following steps:

2.1 initializing embedding; firstly, initializing vectors of nodes and edges in the heterogeneous information network, wherein a TransR model is adopted to initialize the nodes and the edges in the heterogeneous information network by using vectors with the same dimension, namely E_h、E_t、E_rHead, tail, and edges; the nodes are then mapped according to the type of relationship, i.e. for each relationship r there is a mapping matrix M_rAnd mapping the nodes into a vector space of the relation r, wherein the formula is as follows:

wherein the content of the first and second substances,

respectively representing vectors after the nodes h and t are mapped to r;

2.2, learning a heterogeneous information network; here, vector representations of nodes and edges are obtained through initialization, and the heterogeneous information network is learned through a score function:

wherein f (h, r, t) represents a scoring function; by means of the function, nodes with connections can be close to each other, while nodes without connections can be distant from each other; loss function L of learning process₁Is defined as:

wherein (h, r, t) e G represents a positive sample in the heterogeneous information network,

is a negative example, G denotes a heterogeneous information network;

step 3, information is transmitted in the heterogeneous information network, and the method specifically comprises the following steps:

3.1 calculating the attention scores between the nodes and the neighbors;

unlike the meta-path method using a pre-prepared path instance, the present invention directly calculates attention scores for a node and its neighbors according to connectivity of the node in a heterogeneous information network, for example, the attention score pi (h, r, t) of a node h and its neighbor t is:

wherein tanh (-) is an activation function; the closer the nodes are associated with their neighbors, the greater the attention score; since a node has multiple neighbors, there are multiple attention scores, so the obtained attention scores are normalized:

wherein the numerator exp (π (h, r, t)) represents the attention score, denominator, of a node h and its one neighbor t

Represents the sum of the attention scores of all the neighbors of node h;

3.2 information transmission among nodes, wherein the part of information aggregated from neighbor nodes to the current node comprises node fusion; specifically, taking the head node h in the triplet (h, r, t) as an example, its neighbor set is N_hIf { (h, r, t) | (h, r, t) ∈ G }, then the vector of the neighbor of node h is represented as:

wherein

Information transmitted by neighbor nodes of the node h is represented;

step 4, aggregating node information and side information; aggregation of node h and its edges between neighbors

Expressed as:

to aggregate this information, it is implemented by the following function:

wherein LeakReLU (-) is the activation function, E_hThe node h initializes the representation and,

is a representation of the edges of the image,

is the information of the neighbor of node h; information in the heterogeneous information network is fully mined through the representation of the aggregation nodes and the edges;

step 5, predicting and scoring; through the above steps, a representation E of the user node may be obtained_uAnd representation of item node E_iAs follows:

scoring the predictions

Expressed as the inner product of the user node vector representation and the item node vector representation:

score predicted loss function L₂The following were used:

D＝{(u,i,j)|(u,i)∈R⁺,(u,j)∈R^-} (15)

wherein D is the data set, (u, i) eR⁺Denotes a positive sample, (u, j) e R^-Is a negative sample; total loss function L_totalComprises the following steps:

L_total＝L₁+L₂ (16)

step 6, Top-K evaluation; by two commonly used criteria: HR @ K and NDCG @ K are used for evaluating the recommendation method, and the formula is as follows:

wherein K represents the first K data in the recommendation removing result; GT represents test set data; rel_iThen the correlation at the ith location is represented, typically rel if the item at the ith location is in the test set_iIs 1, otherwise is 0; z_kRepresenting the normalized coefficient.

Preferably, the predetermined value described in step 1.1 is 20 times.

The invention integrates the current novel heterogeneous information network learning method, fuses the relationship between nodes into the learning of the heterogeneous information network, and fully excavates the information in the heterogeneous network; the innovation point of the method is that the learning of the heterogeneous information network is improved, edges between nodes are explicitly added into the learning process, the improved learning method is applied to the recommendation task, the relation between different types of nodes in the movie data is fully acquired, compared with the traditional homogeneous network, richer semantic information is acquired, compared with the existing heterogeneous information network learning method, the edge information is acquired, the data loss in the learning process is reduced, and the utilization rate of the information in the heterogeneous information network is improved.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention.

Detailed Description

The input data of the method provided by the invention is divided into two parts, namely heterogeneous information graph data, namely a triple, and scoring data for training and testing, and the output of the method is the top K movie lists of each user.

As shown in fig. 1, the Top-K movie recommendation method based on heterogeneous information network embedding of the present invention includes the following steps:

step 1, preprocessing data, specifically:

1.1 cleaning data; removing users with the film watching times smaller than 20 and films with the film watching times smaller than 20 in the film data to finish the cleaning of the data;

1.2 constructing a heterogeneous information network and a grading data set; coding the user, the movie, the director, the actors and the genres, coding the relationships among the objects of the user, the movie, the director, the movie, the actors and the movie, constructing a triple and a scoring data set, randomly dividing the scoring data set to obtain training data and test data, wherein the training samples comprise positive samples and negative samples; constructing a heterogeneous information network by using the cleaned data; constructing the cleaned data into a triple group to represent the heterogeneous information network, wherein the triple group is in the following form:

(h,r,t) (1)

step 2, embedding a learning heterogeneous network;

2.1 initializing embedding; and taking the constructed triple data as the input of the heterogeneous information network embedding learning in the form of an adjacency matrix, and initializing the nodes by the following formula,

the vector representation after nodes h and t are mapped to r respectively:

2.2, learning a heterogeneous information network; learning the embedding of the heterogeneous information system network through a score function, wherein f (h, r, t) represents the score function:

by means of the function, nodes with connections can be close to each other, while nodes without connections can be distant from each other; the loss function of this learning process is defined as:

wherein (h, r, t) e G represents a positive sample triplet in the heterogeneous information network,

is a negative example, G denotes a heterogeneous information network;

step 3, information is spread in the heterogeneous information network; the information transmission between the node and the neighbors is calculated, one node is provided with a plurality of neighbors, and the importance of each neighbor to the node is inconsistent, so that the weight between the node and the different neighbors is firstly calculated, and then the information transmission is carried out on the node and the neighbors thereof; specifically, the method comprises the following steps:

3.1 calculating the attention scores between the nodes and the neighbors;

the importance of different neighbors to a node varies, and for this reason the degree of importance, i.e. the weight between a node and its neighbors, is measured by pi (h, r, t), where tanh (·) is the activation function:

after the weights of the node and all its neighbors are computed, these importance values are normalized:

wherein N is_h{ (h, r, t) | (h, r, t) ∈ G } represents a neighbor of the node h, and the numerator exp (pi (h, r, t)) represents the attention score, denominator, of the node h and its one neighbor t

Represents the sum of the attention scores of all the neighbors of node h;

3.2 information transmission among nodes; the information propagated by the neighbors of the nodes is aggregated through the calculated weight,

information forwarded from the neighbor is represented:

step 4, aggregating node information and side information;

the node information, the information transmitted by the neighbors of the node and the information of the edges between the node and the neighbors are aggregated, firstly, the edges between the node h and the neighbors are aggregated and are expressed as

Then, the three are aggregated, and the activation function adopts LeakReLU (·):

step (ii) of5. Predicting the score; finally, through the above steps, final vector representations of the user node and the movie node can be obtained, and are respectively marked as E_uAnd E_i：

Predictive scoring

the loss function in the scoring prediction process is:

D＝{(u,i,j)|(u,i)∈R⁺,(u,j)∈R^-} (15)

wherein D is the data set, (u, i) is e.R⁺Denotes a positive sample, (u, j) e R^-Is a negative sample;

the total loss function of the entire model is L_total：

L_total＝L₁+L₂ (16)

Step 6; Top-K evaluation; after the whole learning process is completed, evaluating the result output by the model; the output of the model is a list of the front K movie numbers of each user, and the recommendation result is evaluated through two indexes of HR @ K and NDCG @ K:

all steps of the entire recommendation are now complete.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. The method for recommending the Top-K movie based on heterogeneous information network embedding comprises the following specific steps:

step 1, preprocessing data, specifically comprising:

(h,r,t) (1)

2.1 initializing embedding; firstly, initializing vectors of nodes and edges in the heterogeneous information network, wherein a TransR model is adopted to initialize the nodes and the edges in the heterogeneous information network by using vectors with the same dimension, namely E_h、E_t、E_rHead, tail, and edges; then according to the relation typeThe nodes are mapped, i.e. for each relation r, there is a mapping matrix M_rAnd mapping the nodes into a vector space of the relation r, wherein the formula is as follows:

wherein the content of the first and second substances,

respectively representing vectors after the nodes h and t are mapped to r;

is a negative example, G denotes a heterogeneous information network;

3.1 calculating the attention scores between the nodes and the neighbors;

Represents the sum of the attention scores of all the neighbors of node h;

wherein

Information transmitted by neighbor nodes of the node h is represented;

Expressed as:

to aggregate this information, it is implemented by the following function:

is a representation of the edges of the image,

scoring the predictions

score predicted loss function L₂The following were used:

D＝{(u,i,j)|(u,i)∈R⁺,(u,j)∈R^-} (15)

wherein D is the data set, (u, i) is e.R⁺Denotes a positive sample, (u, j) e R^-Is a negative sample; total loss function L_totalComprises the following steps:

L_total＝L₁+L₂ (16)

2. The heterogeneous information network embedding-based Top-K movie recommendation method of claim 1, wherein: the predetermined value stated in step 1.1 is 20 times.