WO2012093046A2

WO2012093046A2 - Hybrid content recommendation system using matrices breakdowns

Info

Publication number: WO2012093046A2
Application number: PCT/EP2011/073921
Authority: WO
Inventors: Jean-Ronan Vigouroux; Louis Chevallier; Anne Lambert
Original assignee: Thomson Licensing
Priority date: 2011-01-05
Filing date: 2011-12-23
Publication date: 2012-07-12

Description

Hybrid content recommendation system using matrices breakdowns

FIELD OF THE INVENTION

The invention related to a method for recommending contents, such as movies for example, by using matrices factorizations BACKGROUND

Content recommendation systems are used to predict, for each user, the rating that the user would have given to a content which has not been viewed yet. This has various applications:

Help in the content selection

- Help in the exploration of a catalogue

Possibility to anticipate viewing by downloading a content

Recommendation systems are commonly of the collaborative type, i.e. the recommendations for a user are developed from the ratings given by other users on items proposed for consumption.

The state of the art on the recommendation systems is presented in the following article:

Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extension, Gediminas Adomavicius and Alexander Tuzhilin, IEEE Transactions on Knowledge and Data Engineering, Vol 17, No 6, June 2005.

The recommendation based on factorization is documented in:

Matrix Factorization Technigues for Recommender Systems, Yehuda Koren, Robert Bell and Chris Volinsky, IEEE Computer, Vol. 2, No. 18, 2009.

A more detailed presentation can be found in:

Factorization meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Yehuda Koren, KDD'08, August 24—27, 2008, Las Vegas, Nevada, USA.

The advantage of collaborative recommender systems is that they provide precise

recommendations, diversified and calculable at a reasonable cost, particularly on the basis of techniques for the factorization of the ratings matrix.

Collaborative recommender systems do not enable, however, to establish recommendations for new items which have not been rated by other users. Being based on the transfer of ratings of other users to a given user, it is apparent that they cannot operate in the absence of ratings. This problem is commonly called 'cold-start problem'.

Recommender systems based on the content constitute another strategy enabling this problem of lack of initial ratings to be handled. In general, this involves establishing a rating function from metadata describing the content and not from the content itself. Different approaches have been proposed: they prove in general to be less precise than collaborative approaches. However, they enable recommendations to be calculated in the absence of ratings established by other users. Hybrid recommender systems constitute an attempt to conciliate the advantages of the two approaches, by enabling on the one hand contents hardly or not rated yet to be evaluated, and on the other hand by offering the precision and the characteristic diversity typical of the

collaborative recommender systems.

Different approaches have been proposed to implement hybrid recommender systems. States of the art can be found in:

Incorporating Contextual Information In Recommender Systems Using a Multidimensional Approach, Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen and Alexander Tuzhilin, ACM Transactions on Information Systems, Vol. 23, No. 1, Jan. 2005, pp. 103—145.

A Survey of Collaborative Filtering Technigues, Xiaoyuan Su and Taghi M. Khoshgoftaar, Hindawi Publishing Corporation, Advances in Artificial Intelligence, Vol. 2009, Article ID 421425, 2009.

SUMMARY

An alternative approach to hybrid recommendation based on the decomposition of matrices for the rating and representation of documents in a vector space is proposed.

The invention consists in a method for determining a rating vector for a content to be rated, wherein each element of the rating vector is a predicted rating associated to a user of a set of users, the rating vector being determined in a system wherein the following data are provided:

A set of contents and a set of metadata

A metadata matrix, wherein a row of the metadata matrix characterizes the relevance of the set of contents to a piece of metadata of the set of metadata, and wherein a column of the metadata matrix characterizes the relevance of a content of the set of contents to the set of metadata

A rating matrix, wherein a row of the rating matrix characterizes ratings given by a user of the set of users to the set of contents, and wherein a column of the rating matrix characterizes the rating given by the set of users to a content of the set of contents the method comprising the following steps:

Factorizing the content metadata matrix in a product of a first column vector and a first row vector;

Factorizing the rating matrix in a product of a second column vector and a second row vector;

- Ingesting a new content metadata column vector characterizing the metadata of the content to be rated, the new content metadata vector having the same height as the first row vector;

Completing the second row vector using a similarity function representative of similarity between the new content metadata vector and the vectors of the first row vector, such that the second row vector is completed by a collaborative factors column vector having the same height as the second row vector and having the same width as the new content metadata vector;

Concatenating the new content metadata vector and the collaborative factors column vector such as to get a last vector;

- Multiplying the second column vector with the last vector, such as to determine the rating vector ; and Displaying at least one element of the rating vector

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows a rating matrix

Figure 2 shows a metadata matrix

Figures 3, 4 and 5 show improved rating matrices

Figures 6 to 10 show numerical examples

DETAILED DESCRIPTION

Figure 1 shows a rating matrix. A useful piece of data in content recommendation is the ratings matrix. The users being placed in rows and the contents in columns, the collected ratings are placed in a matrix R , such that r_ui is the rating given by the user u concerning the content . If no rating has been given the data is known as missing. In usual recommendation problems, 99% of the data are missing, and the recommendation algorithms are precisely intended to evaluate the missing data.

It is known that the matrix R can be approximated by a product of two factors W and H :

R = WH

The W and H matrices contain data which are respectively characteristic of the users and contents. A column H_t contains values which indicate the characteristics of a content z^' , and a row W_u contains the sensitivities of a user u to the different characteristics.

Different algorithms are available to perform this factorization of an incomplete matrix. For example: Large-scale collaborative filtering algorithms, Ma Chih-Cha, Master's thesis, Department of Computer Science and Information Engineering, College of Electrical Engineering & Computer Science, National Taiwan University, 2008.

A. Ilin and T. Raiko. Practical approaches to principal component analysis in the presence of missing values. Accepted to Journal of Machine Learning Research, 2010.

Figure 2 shows a metadata matrix. Likewise, the metadata of a contents collection can be represented in a metadata matrix M . This matrix contains in rows the different possible metadata, for example the list of the known actors, and in columns the different contents, for example the different movies. m_ui has a value which represents the metadata relevance u for the movie . For example m_ui = 1 if the actor u plays in the movie , m_ui = 0 otherwise. m_ui can also be a value indicating the importance of the actor u in the cast of . The metadata can be the actors, genres, keywords relative to the movies, summary words, directors, budgets, and the like. Any dataset relative to the contents collection can be considered as potential metadata which can be incorporated into additional rows of the matrix . This method of processing the metadata is inspired by the representation known as 'Vector Space Model' in text analysis, where the different texts of a collection are represented in a matrix one axis of which is associated with the words of the collection vocabulary, and the other one with the different documents. The values of the matrix elements represent the presence or the absence of a word in a text, or the importance of a word in a text, for example measured by a value such as Tf/ldf (Term frequency- Inverse term frequency). The first step of the hybrid processing consists in factorizing approximately this metadata matrix (in general sparse, i.e. containing especially zeros). This makes it possible to display two matrices W_M and H_M such that M = W_MH_M . The factorization rank, i.e. the width of W_M or the height of H_M determines the approximation quality. Likewise as above, the columns of H_M are data intrinsic to the movies, and the rows of W_M contain the relevancies of particular metadata with regard to the possible characteristics of the movies.

It will be now hereafter explained how rating estimation is improved by use of hybrid information.

Figure 3 shows an improved rating matrix. A first contribution consists in noticing that H_M constitutes additional information enabling the contents to be characterized, and that it is possible to use it to obtain a better rating evaluation. For this, the method that is proposed consists in factorizing the ratings matrix R by fixing a part of the right factor:

R = W_CH_C

where W_c and H_c are the compound factors using hybrid information based on the content and the collaborative data. The proposed method consists in fixing in H_c a part of the values as equal to the factors based on the content H_M and in discovering the other values by the same optimization algorithm enabling the factors of R to be discovered.

Thus, the rating predictions r_ui will be made by row x column multiplications W_c H_c which will use at least as many information as the standard collaborative method W_U H _i .

It will be now explained how the cold start problem is solved.

Figure 4 shows an improved rating matrix. A second contribution consists in noticing that the 'Content Based Factors' can be obtained, including for contents having received no rating, which makes it possible for us to solve the cold start problem mentioned above: the prediction of ratings for contents not rated yet. Several strategies are conceivable; Some of them are presented below.

A first possible strategy : Use of neighbourhoods

This type of strategy consists in searching in the matrix H_M the vector or vectors closest to a given vector h describing the metadata of a content, in forming the barycentre of vectors of corresponding collaborative factors and in using this vector to calculate the rating predictions. In a first implementation, the ratings matrix is factorized in a standard manner to display the factors relative to the users W and to the contents H . The ingestion of a new document consists in calculating the vector A representing this document. This can be either metadata describing the document, or the vector projection onto a space from the factorization of the document collection metadata matrix, or a vector from the complete re-factorization of the document collection metadata matrix. Then the k (k≥ 1) vectors closest to A , for a certain similarity function are searched. An example of similarity function commonly employed is the cosine of the two vectors. Let h_t be the k -nth vector closest to a and s_t the similarity between A and h_t . Then b i<k i≤k (equation 1). and the ratings for the new content are estimated from the user profiles W and B

This algorithm version works in a satisfactory manner.

Figure 5 shows a further improved rating matrix. In this second implementation, the matrix R has been factorized by incorporating the profiles based on the metadata into the contents profile, as indicated before. In this case, the profile vector completion algorithm for a new content is identical to that indicated above: from the new content profile A, a collaborative profile B is determined, and a ratings prediction is made from the profile for the new content.

Below are given two other collaborative factor calculation strategies for the new content. Both can be implemented using the collaborative profile of the user for the prediction or using a compound profile like above.

A second possible strategy : Multiple regression

This type of strategy consists in determining B from A (ratings of above) by using a multiple regression from the independent variable 'Content Based Factors' to the dependent variable 'Collaborative Factors'.

A further possible strategy : Neural network

Another option consists in entraining a multilayer perceptron taking as input the 'Content Based Factors' and providing as output the 'Collaborative Factors'. A scale normalization can be necessary, the transition functions of the neural networks typically having an output in [θ;ΐ] or [- l;l] . In this case, the necessary multiplicative coefficients are transferred to the 'Collaborative User Factors' or the 'Compound User Profile' according to the chosen embodiment. For a new content, the network is then used in prediction to obtain B from A .

A concrete case will now be described hereafter.

The ratings matrix R is based on ratings matrices put in the public domain by Netflix, on the occasion of a competition organized by this company. This matrix of size 480 189 rows x 17 770 columns contains 99 072 112 ratings, that is to say about 1% of known boxes. This matrix has been factorized into two matrices W and H of size 480 189 x 50 and 50 x 17 770 respectively.

A selection of movies was imposed in the TMS (Tribune Media Services) film base. Selected movies are the TMS movies which were also in Netflix. This gave a first base of movies of size about 5,000. The selected movies are known as 'hot'. For these movies, ratings were given by the users of Netflix. Therefore, a matrix R_T of ratings, of size 480 189 x 5,000, as well as the corresponding factors W_T and H_T are provided: W_T is equal to W and H_T is obtained by the restriction of H to columns corresponding to the TMS selected movies. The matrix R_T was then enriched with 2,000 recent movies, also selected in the TMS base. These movies are known as 'cold'. To do this, the I M DB (Internet Movie Data base) metadata for these different movies was collected, and a matrix containing the genres, keywords and actors for each of these movies was formed. There are 28 genres, 98,689 keywords and 1,947,647 actors in the matrix. This therefore gives a metadata matrix of width 7,000 and of height 2,046,364.

According to above first implementation, the factors are then calculated for cold movies. To do this, for each column of the metadata matrix corresponding to a cold movie, the columns corresponding to the hot movies which are the closest are searched as well as the similarity with these columns. The collaborative factor for the cold movie is the sum of the factors for the hot movies, weighted by the similarity, according to the formula of equation 1 above.

A numerical example will now be detailed hereafter. The following example is intended to illustrate what precedes without any limitation of the general description above. It is assumed that we have to deal with 20 users having rated 15 different movies as follows, as described in Figure 6. Predictions will be made for the table described in Figure 6, that is to say the incomplete ratings matrix R above, factorized into two factors W (Figure 7) and H (Figure 8), such that R=W. H, obtained by minimizing the table reconstruction error (the minimum error that has been obtained is equal to RMSE = 0.29723).

The incomplete matrix R has been factorized here by using a gradient algorithm. The details for the implementation of this algorithm can in particular be found:

Large-scale collaborative filtering algorithms, Ma Chih-Cha, Master's thesis, Department of Computer Science and Information Engineering, College of Electrical Engineering & Computer Science, National Taiwan University, 2008.

To make predictions for a new movie not rated yet, the metadata of the 15 movies and of the new movie are considered. It is assumed that there are five possible genres for each movie, and that the genres have been assigned according to the table described in Figure 9. The new movie therefore has the Romantic, Fiction and War genres. The similarities to the 15 already rated movies are calculated here by the vector cosine, as shown in Figure 10. The rated movies closest to the new movie are therefore Apocalypse Now, Titanic and The Pianist, if a neighbourhood of size equal to 3 is used. The similarity calculation by cosine, for example here between Titanic and 'New movie', is made as follows:

„. ., . . (1.1+0.0+0.0+0.1+0.1)

Similarity (Titanic, New movie) = , ,_ . = 0.57735...

i²+o²+o²+o²+o². i²+o²+o²+i²+i²)

The vector allowing rating predictions for the new movie will then be: 1,92682 r0,332712i rl0,23758l

⁰-^{577 5}*[6,₅4492₇] ⁺°-^{577 5}*[4;88187₂] ⁺ ,5642₃] ₌ Γ4,902346ΐ

(0.57735+0.57735+0.816497) 12, 698905 J

The column vectors used for the calculation above come from matrix H, as shown in Figure 11.

As a consequence, the rating forecasts for the users for the new movie, obtained by multiplying r4 9023461

the factor W calculated previously by the column vector 2598905] ' ^are f°"^ow'ⁿ8^: 3,174834

27,83584

1,788494

3,579768

2,664382

30,86025

10,62016

1,840971

19,49864

-20,046

7,648723

2,059543

2,380132

2,794481

15,11235

4,080328

40,76849

58,63928

-42,065

52.10909

For the given user, the highest rating (here 58.63928) corresponds to the best rating, and lowest rating (-42.065) corresponds to the least good rating.

These ratings are favorably reduced to a 1 to 5 interval, for example, depending on the application.

Claims

Method for determining a rating vector for a content to be rated, wherein each element of the rating vector is a predicted rating associated to a user of a set of users, the rating vector being determined in a system wherein the following data are provided:

A set of contents and a set of metadata

Ingesting a new content metadata column vector characterizing the metadata of the content to be rated, the new content metadata vector having the same height as the first row vector;

Multiplying the second column vector with the last vector, such as to determine the rating vector ; and

Displaying at least one element of the rating vector Method according to claim 1, wherein the new content metadata is inputted manually