KR101442122B1

KR101442122B1 - Apparatus and Method for Content Recommendation

Info

Publication number: KR101442122B1
Application number: KR1020120142841A
Authority: KR
Inventors: 장해; 김종권
Original assignee: 서울대학교산학협력단
Priority date: 2012-12-10
Filing date: 2012-12-10
Publication date: 2014-10-07
Also published as: KR20140084365A

Abstract

본 발명은 콘텐츠 추천 장치 및 방법에 대하여 개시한다. 본 발명의 일면에 따른 콘텐츠 추천 장치는, 멱함수(Power law) 분포를 갖는 온라인상의 콘텐츠 평점 데이터에서 사용자가 추천받고자 하는 제1 종류의 콘텐츠에 대응하는 평점 데이터를 수집하는 수집부, 수집된 평점 데이터 중에서 평가 횟수가 임계개수 이상이고 제1 종류의 콘텐츠에 평점을 매긴 사용자들의 평점 데이터를 선별하는 전처리부, 선별된 평점 데이터를 이용하여 사전에 매겨지지 않은 평점을 예측하고, 제1 종류의 콘텐츠에 대응되는 모든 추천대상 콘텐츠에 대해 모든 선별 대상 사용자의 평점 및 예측된 평점이 반영된 종합 평점 데이터를 산출하는 예측부, 그리고 산출된 종합 평점 데이터에서 추천대상 콘텐츠별 평점을 평균하고, 상기 모든 추천대상 콘텐츠 중 평균한 평점이 가장 높은 콘텐츠부터 기설정된 개수만큼 사용자에게 제공하는 추천부를 포함하는 것을 특징으로 한다.The present invention discloses a content recommendation apparatus and method. A content recommendation apparatus according to an aspect of the present invention includes a collection unit for collecting rating data corresponding to a first type of content to be recommended by a user in online content rating data having a power law distribution, A preprocessing unit for selecting rating data of users rated for the first type of content and having a number of evaluation times equal to or greater than a threshold number of ratings, a rating unit for predicting non-predefined ratings using the selected rating data, A rating unit for calculating overall rating data reflecting the ratings and predicted ratings of all the screening target users for all the content items to be recommended corresponding to the recommendation target contents, Provides a predefined number of users from the highest rated content in the content It is characterized in that it comprises like parts.

Description

[0001] APPARATUS AND METHOD FOR CONTENT RECOMMENDATION [0002]

본 발명은 콘텐츠 추천에 관한 것으로서, 더 구체적으로는 사용자의 평점 데이터를 이용하여 콘텐츠를 추천할 수 있는 콘텐츠 추천 장치 및 방법에 관한 것이다.The present invention relates to content recommendation, and more particularly, to a content recommendation apparatus and method capable of recommending content using user's rating data.

최근, Epinons, 페이스북(Facebook), IMDB 등 온라인 소셜 네트워크(Online social network) 내 온라인 구전(Word of Mouth)이 사람들의 구매, 영화관람 등에 많은 영향을 주고 있다.Recently, Word of Mouth in online social networks such as Epinons, Facebook, and IMDB has had a great impact on people's purchasing and movie viewing.

또한, 영화사의 마케팅 전략의 일환인 온라인 영화평점은 만족도가 높은 영화를 관람하려는 사람들의 주목을 받고 있다.In addition, online movie ratings, which are part of the film's marketing strategy, are attracting the attention of those who are looking for high-quality movies.

예를 들어, 외국의 Movielens, Rotten Tomato, 한국의 Daum, Naver 등의 영화 평점사이트의 평점정보 등이 관람자의 영화 선택에 영향을 주고 있다.For example, ratings of movie rating sites such as Movielens, Rotten Tomato, Daum and Naver in foreign countries are influencing viewers' choice of movies.

본 발명은 전술한 바와 같은 기술적 배경에서 안출된 것으로서, 멱함수(Power law)에 따른 분포를 갖는 평점 데이터를 이용하여 콘텐츠의 평점을 예측하는 콘텐츠 추천 장치 및 방법을 제공하는 것을 그 목적으로 한다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a content recommendation apparatus and method for predicting a rating of a content using rating data having a distribution according to a power law.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

본 발명의 일면에 따른 콘텐츠 추천 장치는, 멱함수(Power law) 분포를 갖는 온라인상의 콘텐츠 평점 데이터에서 사용자가 추천받고자 하는 제1 종류의 콘텐츠에 대응하는 평점 데이터를 수집하는 수집부; 수집된 평점 데이터 중에서 평가 횟수가 임계개수 이상이고 상기 제1 종류의 콘텐츠에 평점을 매긴 사용자들의 평점 데이터를 선별하는 전처리부; 선별된 평점 데이터를 이용하여 사전에 매겨지지 않은 평점을 예측하고, 상기 제1 종류의 콘텐츠에 대응되는 모든 추천대상 콘텐츠에 대해 모든 선별 대상 사용자의 평점 및 예측된 평점이 반영된 종합 평점 데이터를 산출하는 예측부; 및 산출된 종합 평점 데이터에서 추천대상 콘텐츠별 평점을 평균하고, 상기 모든 추천대상 콘텐츠 중 평균한 평점이 가장 높은 콘텐츠부터 기설정된 개수만큼 사용자에게 제공하는 추천부를 포함하는 것을 특징으로 한다.A content recommendation apparatus according to an embodiment of the present invention includes: a collection unit for collecting rating data corresponding to a first type of content to be recommended by a user in online content rating data having a power law distribution; A preprocessing unit for selecting rating data of users rated for the first kind of content and having a number of evaluations equal to or greater than a threshold number of the collected rating data; Estimating a score not pre-determined using the selected rating data, and calculating overall rating data reflecting the ratings and predicted ratings of all the selected users for all the recommended content corresponding to the first type of content A prediction unit; And a recommendation unit for averaging the ratings of the content to be recommended in the calculated overall rating data and providing a predetermined number of users from the highest rated content among all the recommended content.

본 발명의 다른 면에 따른 장치에 의한 콘텐츠 추천 방법은, 멱함수(Power law) 분포를 갖는 온라인상의 콘텐츠 평점 데이터에서 사용자가 추천받고자 하는 제1 종류의 콘텐츠에 대응하는 평점 데이터를 수집하는 단계; 수집된 평점 데이터 중에서 평가횟수가 임계개수 이상이고 상기 제1 종류의 콘텐츠에 평점을 매긴 사용자들의 평점 데이터를 선별하는 단계; 선별된 평점 데이터를 이용하여 사전에 매겨지지 않은 평점을 예측하고, 상기 제1 종류의 콘텐츠에 대응되는 모든 추천대상 콘텐츠에 대해 모든 선별 대상 사용자의 평점 및 예측된 평점이 반영된 종합 평점 데이터를 산출하는 단계; 및 산출된 종합 평점 데이터에서 추천대상 콘텐츠별 평점을 평균하고, 상기 모든 추천대상 콘텐츠 중 평균한 평점이 가장 높은 콘텐츠부터 기설정된 개수만큼 사용자에게 제공하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of recommending content by an apparatus, comprising: collecting rating data corresponding to a first type of content to be recommended by a user in content rating data on-line having a power law distribution; Selecting rating data of the users rated for the first type of content and the rating number of which is equal to or greater than the threshold number of ratings collected from the collected rating data; Estimating a score not pre-determined using the selected rating data, and calculating overall rating data reflecting the ratings and predicted ratings of all the selected users for all the recommended content corresponding to the first type of content step; And averaging the ratings of the content to be recommended from the calculated total rating data and providing the user with a predetermined number of contents from the highest rated content among all the recommended content.

본 발명에 따르면, 평점 데이터를 이용한 평점을 예측의 정확도를 향상시킬 수 있다.According to the present invention, it is possible to improve the accuracy of prediction of a score using rating data.

도 1 및 도 2는 Daum 영화 평점 네트워크의 인 등급(In-degree) 분포와 아웃 등급(Out-degree) 분포를 도시한 그래프.
도 3은 행렬 인수분해 방법을 설명한 도면.
도 4는 본 발명의 실시예에 따른 콘텐츠 추천 장치를 도시한 구성도.
도 5는 임계개수를 5로 설정했을 때, 도 6은 임계개수를 10으로 설정했을 때 본 발명의 실시예에 따른 평점 예측의 정확도를 다른 평점 예측 방법과 비교하여 도시한 그래프.FIG. 1 and FIG. 2 are graphs showing In-degree distribution and Out-degree distribution of Daum movie rating network.
3 is a view for explaining a matrix factorization method;
4 is a configuration diagram showing a content recommendation apparatus according to an embodiment of the present invention;
FIG. 5 is a graph illustrating a comparison of the accuracy of the rating prediction according to the embodiment of the present invention with other rating prediction methods when the threshold number is set to 5, and FIG.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

이하, 본 발명의 구체 구성에 대해서 설명하기에 앞서, 영화 평점 네트워크에 대해서 분석해본다. Before describing the specific configuration of the present invention, the movie rating network will be analyzed.

이를 위하여, 2009년 2월 25일부터 2012년 5월 14일까지의 Daum 영화 평점 네트워크의 평점 데이터를 크롤링(Crawling)하였다. 여기서, 크롤링은 무수히 많은 컴퓨터에 분산 저장되어 있는 문서를 수집하여 검색대상의 색인으로 포함시키는 기술이다.To this end, we crawled ratings data of Daum movie rating network from February 25, 2009 to May 14, 2012. Here, crawling is a technique of collecting documents stored in a large number of computers and storing them as an index of a search target.

해당 기간(20009-02-25~2012-05-14)의 Daum 영화 평점 데이터에서, 전체 영화의 수는 11682이며, 사용자의 수는 140271이고, 사용자가 영화에 단 평점의 수는 425537이다.In the Daum movie rating data of the period (20009-02-25 ~ 2012-05-14), the total number of movies is 11682, the number of users is 140271, and the number of the users in the movie is 425537.

도 1 및 도 2는 Daum 영화 평점 네트워크의 인 등급(In-degree) 분포와 아웃 등급(Out-degree) 분포를 도시한 그래프이다. 여기서, 인 등급은 하나의 영화를 평가한 사용자의 수이고, 아웃 등급은 한 명의 사용자가 평가한 영화의 수이다.1 and 2 are graphs showing in-degree distribution and out-degree distribution of Daum movie rating network. Herein, the rating is the number of users who evaluated one movie, and the out rating is the number of movies evaluated by one user.

도 1 및 도 2에서, Y축은 상보누적함수(Complementary Cumulative density function, CCDF)이며, 이것은 1에서 x값의 누적분포(Cumulative density function, CDF) 값을 뺀 값과 동일하다.1 and 2, the Y axis is a complementary cumulative density function (CCDF), which is equal to 1 minus the cumulative density function (CDF) value.

예를 들어, 도 1에서 x=1000일 때, y=0.01인데, 이는 1000개 이상의 평가를 받은 영화의 비율이 전체에서 0.01을 차지한다는 의미이다.For example, in FIG. 1, when x = 1000, y = 0.01, which means that the proportion of movies that received more than 1000 evaluations is 0.01 in total.

도 1 및 도 2에서, Fitting 선(점선 참조)은 멱함수(Power law)에 따른 그래프이며, 그 분포는 Pr(X=x)=ax를 따르고, b값은 기울기의 절대값이다.In Fig. 1 and Fig. 2, the fitting line (see the dotted line) is a graph according to the power law, and its distribution follows Pr (X = x) = ax, and the b value is the absolute value of the slope.

도 1 및 도 2에서, Daum 영화 평점 네트워크는 온라인 소셜 네트워크에서 사용자와 관계를 갖는 이웃(Neighbor) 수의 분포(degree distribution)와 같이 멱함수 분포를 따르는 것을 알 수 있다. In FIGS. 1 and 2, it can be seen that the Daum movie rating network follows a power function distribution such as a degree distribution of neighbors having a relationship with a user in an online social network.

이를 통해서, 온라인 영화 평점 네트워크는 온라인 소셜 네트워크 특성을 지닌다는 것을 알 수 있다.Through this, it can be seen that online movie rating network has characteristics of online social network.

이하, 온라인 콘텐츠 평점 네트워크를 기반으로 콘텐츠를 추천하는 방법에 대해서 설명한다. 이때, 콘텐츠는 영화, 드라마, 동영상, 소설이나, 사진첩 등을 포함한다. 이하의 명세서에서는 설명의 편의성을 위해서 영화 평점 네트워크를 기반으로 영화를 추천하는 방법에 대해서 설명한다.Hereinafter, a method of recommending content based on an online content rating network will be described. At this time, the content includes a movie, a drama, a movie, a novel, a photo album, and the like. In the following description, a method for recommending a movie based on a movie rating network will be described for convenience of explanation.

<< 행렬 인수분해 방법 >><< Matrix factorization method >>

온라인 영화의 평점 데이터에 SVD(Singular Value Decomposition) 방법을 적용하는 행렬 인수분해 방법(Matric Factorization)에 의해 영화를 추천해줄 수 있다.The movie can be recommended by Matrix Factorization which applies SVD (Singular Value Decomposition) method to the rating data of online movies.

행렬 인수분해 방법에서는 하기의 수학식 1과 같이 사용자 u의 영화 i의 예측 평점 데이터는 산출될 수 있다.In the matrix factorization method, the predictive rating data of the movie i of the user u can be calculated as shown in Equation (1) below.

수학식 1에서, p_u는 전체 사용자의 특징이며, q_i는 전체 영화의 특징이며, 각기 도 3과 같은 관계에 있다. 여기서, 도 3과 같이, R_m _×n은 전체 평점 데이터이며, 각 사용자 u는 벡터 p_u∈R^f의 관계에 있고, 각 항목(영화) i는 벡터 q_i∈R^f의 관계에 있다. 여기서, d_i는 잠재요인(Latent factor)이다. 이 같이, 행렬 인수분해 방법을 이용하면, 사용자 u와 항목 i의 관계로부터 매겨지지 않은 평점(도 3의 ?참조)을 예측할 수 있다.In Equation 1, p _u is a feature of the entire user, and q _i is a feature of the entire movie, and each has a relationship as shown in FIG. As shown in FIG. 3, R _m _{× n} is total rating data, each user u is in the relationship of vector p _u ∈ R ^f , and each item (movie) i is in the relationship of vector q _i ∈ R ^f . Where d _i is a latent factor. As described above, by using the matrix factorization method, it is possible to predict the unrated scores (see? In Fig. 3) from the relationship between the user u and the item i.

더 세부적인 행렬 인수분해 방법에 대해서는 인용문헌 1(S. Deerwester, S. Dumais, G. W. Fumas, T. K Landauer and R. Harshman, "Indexing by Latent Semantic Analysis", Journal of the Society for Information Science 41(1990), 391-407페이지 참조) 등의 배경기술에 개시된 바 있으므로, 그에 대한 설명은 생략하기로 한다.For more detailed matrix factoring methods, see S. Deerwester, S. Dumais, GW Fumas, T. K Landauer and R. Harshman, "Indexing by Latent Semantic Analysis", Journal of the Society for Information Science 41 1990), pp. 391-407), and a description thereof will be omitted.

행렬 인수분해 방법은 비교적 우수한 성능을 제공하지만, 다른 방법과 혼용되면, 성능을 더욱 향상시킬 수 있다.The matrix factorization method provides relatively good performance, but if mixed with other methods, the performance can be further improved.

<< << AverageAverage biasbias 방법 >> How to >>

평균 바이어스 방법은 하기의 수학식 2와 같이, 사용자와 아이템의 특성을 반영하는 bias b_ui를 고려한다.The average bias method considers the bias b _ui that reflects the characteristics of the user and the item, as shown in Equation 2 below.

수학식 2에서, u는 전체 영화의 평균평점이며, b_i는 영화 i가 받은 평균평점과 u의 차이, b_u는 사용자 u가 매긴 평균평점과 u의 차이이다.In equation (2), u is the average rating of the entire movie, b _i is the difference between the average rating of movie i and u, and b _u is the difference between the average rating of user u and u.

예를 들어, 전체 영화의 평균평점이 5점, 영화 "타이타닉"의 평균 평점이 8점, 철수의 평균 평점이 4점이라면, b_i는 3점, b_u는 -1점이 된다. 따라서, b_ui는 7(5+3-1)점이 된다.For example, if the average rating of the entire movie is 5, the average rating of the movie "Titanic" is 8, and the average rating of the withdrawal is 4, then b _i is 3 and b _u is -1. Therefore, b _ui becomes 7 (5 + 3-1) points.

행렬 인수분해 방법과 평균 바이어스 방법을 함께 사용하면 콘텐츠 추천의 성능을 보다 향상시킬 수 있는데, 이에 대해서는 인용문헌 2(Y. Koren, R. Bell, C. Bolinsky, "Matrix Factorization Techniques for Recommender systems" in Computer 2009) 등의 배경기술에 개시된 바 있다.The matrix factorization method and the average bias method can be used together to improve the performance of the content recommendation, which is described in Y. Koren, R. Bell, C. Bolinsky, "Matrix Factorization Techniques for Recommender systems" Computer 2009).

<< << RandomRandom biasbias >> >>

본 발명에 따른 랜덤 바이어스 방법은 수학식 3과 같이, 사용자 및 영화의 평점 분포 특성을 함께 반영하는 방법이다.The random bias method according to the present invention is a method of reflecting the characteristics of the distribution of ratings of users and movies as shown in Equation (3).

수학식 3에서, u는 전체 영화의 평균평점, r_i는 영화 i가 받은 평점의 분포를 갖는 확률변수와 u의 차이, r_u는 사용자 u가 매긴 평점의 분포를 갖는 확률변수와 u의 차이이다.In Equation (3), u is the average rating of the entire movie, r _i is the difference between u and u, a random variable having a distribution of ratings received by movie i, r _u is a random variable having a distribution of ratings of u, to be.

예를 들어, 전체 영화의 평균평점이 5점이고, 타이타닉이 받은 평점은 5점이 50%, 10점이 50%라면, 50%의 확률로 5와 10이 나오는 확률변수를 생성하고, 그로부터 u를 빼서 r_i를 얻는다. r_u도 마찬가지의 방식으로 얻을 수 있다.For example, if the average rating of an entire movie is 5, and the rating received by Titanic is 50% for 5 points and 50% for 10 points, create a random variable with 5 and 10 probability of 50%, subtract u from it, _i . r _u can be obtained in the same way.

전술한 방식을 통해 같은 평균점수를 갖는 사용자나, 영화라도 그 분포에 따라 각기 다른 결과가 도출될 수 있다. 대개, 사용자의 평점은 양극단으로 갈리는 경향을 띄는데, 그러한 경우에 랜덤 바이어스에 따른 형태의 보정이 효과적일 수 있다.A user with the same average score through the above-described method, or a movie, can have different results depending on its distribution. Typically, the user's rating tends to be biased, and in such cases correction of the shape according to the random bias may be effective.

본 발명에 따르면, 행렬 인수분해 방법과 랜덤 바이어스 방법을 혼용하여 콘텐츠 추천의 성능이 향상시킬 수 있다.According to the present invention, the performance of content recommendation can be improved by using a matrix factorization method and a random bias method.

전술한 바와 같이, 도 1 및 도 2의 온라인 영화 평점 네트워크에서 특정 기간의 전체 데이터는 평가자 140271명, 영화 11682, 영화에 대한 평점 140271개이다. 그런데, 전체 데이터의 사용자 중에서 1개 또는 2개 평가한 사용자가 대부분을 차지하는 경우에는 드문(Sparse) 데이터 셋을 가질 수 있다. 평점이 적은 사용자가 많을 때 행렬 인수분해 방법을 사용하면, 첫 시작(Cold Start) 문제가 발생할 수 있다.As described above, in the online movie rating network of Figs. 1 and 2, the total data of a specific period is 140271 evaluators, 11682 movies, and 140271 ratings for movies. However, a sparse data set can be used when one or two users of the entire data occupy most of the users. If you use the matrix factorization method when there are a large number of users with low ratings, a cold start problem may occur.

이러한 문제를 방지하고자, 본 발명의 실시예에서는 임계개수(Threshold) 이상의 영화에 평점을 준 사용자의 평점 데이터를 기반으로 영화를 추천한다.In order to prevent such a problem, in the embodiment of the present invention, a movie is recommended on the basis of rating data of a user who rated a movie having a threshold value or more.

또한, 사용자가 평점을 0점으로 매긴 경우, 해당 영화를 강력하게 비추천한 것인데, 0점이 사용자가 평점을 매기지 않은 것으로 여겨져, 사용자 아이템 행렬에 반영되지 않는다. 이 경우, 평점 예측 결과에 많은 오차가 발생할 수 있다.Also, if a user scored a rating of 0, the movie was strongly deprecated, and a score of 0 would not be rated by the user and would not be reflected in the user item matrix. In this case, a large error may occur in the evaluation result of the rating.

이러한 문제를 방지하고자, 본 발명의 실시예에는 각 사용자가 매긴 모든 평점에 1점을 더하여 평점이 1~11점 내에 있도록 만들고, 이후 예측된 평점에서 1을 뺀 값을 영화 추천에 이용한다.In order to prevent such a problem, in the embodiment of the present invention, one point is added to all the rated points of each user so that the rating is within the range of 1 to 11, and a value obtained by subtracting 1 from the predicted rating is used for movie recommendation.

전술한 예에서는 콘텐츠가 영화인 경우를 예로 들어 설명하였으나, 드라마, 소설 등의 다른 콘텐츠에 대해서도 전술한 과정을 통해서 평점을 예측한 후 콘텐츠를 추천할 수 있다.In the example described above, the content is a movie, but it is also possible to recommend the content after predicting the rating through other processes such as drama and novel.

이하, 도 4를 참조하여 본 발명의 실시예에 따른 콘텐츠 추천 장치에 대해서 설명한다. 도 4는 본 발명의 실시예에 따른 콘텐츠 추천 장치를 도시한 구성도이다.Hereinafter, a content recommendation apparatus according to an embodiment of the present invention will be described with reference to FIG. 4 is a configuration diagram showing a content recommendation apparatus according to an embodiment of the present invention.

도 4에 도시된 바와 같이, 본 발명의 실시예에 따른 콘텐츠 추천 장치(10)는 입력부(110), 수집부(120), 전처리부(130), 예측부(140) 및 추천부(150)를 포함한다. 이때, 입력부(110)는 콘텐츠 추천 장치(10) 내 구비되지 않고, 그 외부에 구비될 수도 있음을 물론이다.4, the content recommendation apparatus 10 according to an exemplary embodiment of the present invention includes an input unit 110, a collecting unit 120, a preprocessing unit 130, a predicting unit 140, and a recommending unit 150, . At this time, it goes without saying that the input unit 110 may not be provided in the content recommendation apparatus 10, but may be provided outside the input unit 110.

입력부(110)는 사용자 인터페이스로서, 사용자가 추천받고자 하는 콘텐츠의 종류를 선택받는다. 예를 들어, 콘텐츠의 종류는 영화, 드라마, 소설이나, 만화 등이다. 이때, 사용자는 입력부(110)를 통해 그중 영화를 추천받고자함을 선택할 수 있다.The input unit 110 is a user interface, and selects a type of content to be recommended by the user. For example, the types of contents are movies, dramas, novels, cartoons, and the like. At this time, the user can select to receive a movie through the input unit 110.

수집부(120)는 멱함수 분포를 갖는 온라인 네트워크상의 콘텐츠 평점 데이터로부터 사용자에 의해 선택된 종류의 콘텐츠 평점 데이터를 수집한다. The collection unit 120 collects content rating data of a kind selected by a user from content rating data on an online network having a power function distribution.

여기서, 멱함수 분포를 갖는 온라인 네트워크의 콘텐츠 평점 데이터는 Daum나, Naver, 영화정보 제공 사이트와 같이 영화 평점을 제공하는 사이트의 콘텐츠 평점 데이터일 수 있다.Here, the content rating data of the online network having the power function distribution may be content rating data of a site providing movie ratings such as Daum, Naver, and a movie information providing site.

전처리부(130)는 수집된 콘텐츠 평점 데이터 중에서 임계개수 이상의 동종 콘텐츠에 평점을 매긴 사용자의 콘텐츠 평점 데이터를 선별한다. 여기서, 임계개수는 6 이상인 것이 좋다.The preprocessing unit 130 selects content rating data of a user who rated a homogeneous content equal to or more than a threshold number among the collected content rating data. Here, the critical number is preferably 6 or more.

예를 들어, 전처리부(130)는 영화 평점 데이터 중에서 임계개수 이상의 영화에 평점을 매긴 선별대상 사용자의 콘텐츠 평점 데이터를 선별한다.For example, the preprocessing unit 130 selects content rating data of a screening user who has rated a movie having a threshold number or more from the movie rating data.

전처리부(130)는 선별된 콘텐츠 평점 데이터에 1을 더하여 평점 스케일을 0~10에서 1~11 사이로 만든다. 본 발명의 실시예에서는 이러한 방식으로 사용자에 의해 매겨진 평점 0점과 평점이 매겨지지 않은 콘텐츠를 구분할 수 있다.The preprocessing unit 130 adds 1 to the selected content rating data to make the rating scale from 0 to 10 to 1 to 11. [ In the embodiment of the present invention, in this way, it is possible to distinguish the rated points from the user and the unrated content.

예측부(140)는 하기의 수학식 4와 같이 행렬 인수분해와 랜덤 바이어스 방법에 의해, 스케일 조정된 콘텐츠 평점 데이터 중에서 평점이 매겨지지 않은 콘텐츠 평점 데이터를 예측하고, 예측된 평점에서 1을 뺀 최종 예측 평점을 산출한다.The predictor 140 predicts the content rating data that is not rated from the scaled content rating data by matrix factorization and the random bias method as shown in Equation (4) below, and subtracts 1 from the predicted rating And calculates a predictive rating.

여기서, a는 행렬 인수분해와 랜덤 바이어스 방법의 가중치이며(0≤a≤1), R_MF는 콘텐츠 평점 데이터에 행렬 인수분해를 적용한 결과로서 수학식 1에 의해 산출될 수 있고, R_bias는 콘텐츠 평점 데이터에 랜덤 바이어스 방법을 적용한 결과로서 수학식 3에 의해 산출될 수 있다. 이때, 가중치는 행렬 인수분해와 평균 바이어스 방법과의 비교를 통해서 설정되는 것이 좋다. 이에 해서는 도 5 및 6을 참조하여 후술한다.Here, a is the weight of the matrix factorization methods and the random bias (0≤a≤1), R _MF may be calculated by the equation (1) as a result of applying the matrix factorization to the content rating data, R _bias is the content Can be calculated by Equation (3) as a result of applying the random bias method to the rating data. In this case, the weights are preferably set through comparison of the matrix factorization and the average bias method. This will be described later with reference to Figs. 5 and 6. Fig.

예를 들어, 콘텐츠 평점 데이터에 사용자 1이 영화 A, B, C에 매긴 평점과 사용자 2가 영화 A, C에 매긴 평점이 포함되어 있는 경우라면, 예측부(140)는 콘텐츠 평점 데이터를 이용하여 사용자 2가 B에 매길 것으로 예상되는 평점을 예측하고, 여기서 1을 뺀다. 따라서, 예측부(140)는 선택된 종류에 대응하는 모든 추천대상(예컨대, 상영중 또는 판매중) 콘텐츠에 대해 모든 선별대상 사용자의 평점 또는 예측 평점이 매겨진 종합 평점 데이터를 산출할 수 있다. For example, if the content rating data includes the ratings of the user 1 for the movies A, B, and C and the ratings for the movies A and C of the user 2, the predictor 140 uses the content rating data Predict the rating that User 2 expects to put on B, and subtract 1 from it. Accordingly, the prediction unit 140 can calculate the overall rating data in which all the rated users or the predicted ratings of all the target users are ranked with respect to all the recommended targets (for example, during or during the sale) corresponding to the selected type.

추천부(150)는 종합 평점 데이터를 이용하여 선택된 종류의 콘텐츠 추천을 의뢰한 사용자에게 콘텐츠를 추천할 수 있다.The recommendation unit 150 can recommend the content to the user who has requested the recommendation of the selected type of content using the comprehensive rating data.

예를 들어, 추천부(150)는 복수의 영화에 대한 종합 평점 데이터에서 영화별 평점을 평균하여 그 중에서 평점이 높은 순으로 3개의 영화를 사용자에게 추천해줄 수 있다.For example, the recommendation unit 150 may average ratings of the movies in the comprehensive rating data for a plurality of movies, and recommend three movies to the user in the order of the highest rating.

이하, 도 5 및 도 6을 참조하여 본 발명의 실시예에 따른 가중치에 따른 평점 예측의 정확도를 RMSE(Root Mean Square Error)를 성능 평가 지표로 하여 살펴본다.Hereinafter, with reference to FIG. 5 and FIG. 6, the accuracy of the evaluation of the score according to the weight according to the embodiment of the present invention will be described with reference to RMSE (Root Mean Square Error) as a performance evaluation index.

도 5는 임계개수를 5로 설정했을 때 본 발명의 실시예에 따른 평점 예측 방법(MF Random)의 정확도를 다른 평점 예측 방법과 비교하여 도시한 그래프이며, 도 6은 임계개수를 10으로 설정했을 때 본 발명의 실시예에 따른 평점 예측의 정확도를 다른 평점 예측 방법과 비교하여 도시한 그래프이다. 여기서, 다른 평점 예측 방법은 행렬 인수분해 방법(MF) 및 행렬 인수분해와 평균 바이어스를 결합한 방법(MF Average)이다.FIG. 5 is a graph illustrating the accuracy of the score prediction method (MF Random) according to an embodiment of the present invention when the threshold number is set to 5, and FIG. 6 is a graph showing the number of thresholds FIG. 5 is a graph illustrating the accuracy of the prediction of a rating according to an embodiment of the present invention, in comparison with other rating methods. Here, the other rating prediction method is a method of combining matrix factorization (MF) and matrix factorization and average bias (MF Average).

도 5 및 도 6에서는 전체 평점 데이터의 90%를 훈련 셋(즉, 입력 데이터)으로 설정하고, 훈련 셋에 의해 훈련 셋으로 설정되지 않은 10%에 대해 평점을 예측한다. 그리고, 예측된 평점과 훈련 셋으로 설정되지 않은 10%의 평점 데이터 간의 RMSE 값을 산출하여 평점 예측의 정확도를 확인하였다.In Figures 5 and 6, 90% of the total rating data is set to the training set (i.e., input data) and the rating is predicted for 10% that is not set to the training set by the training set. Then, the accuracy of rating prediction was verified by calculating the RMSE value between the predicted rating and the 10% rating data not set in the training set.

여기서, 평가의 편의성을 위해서 전체 평점 데이터는 모든 대상 사용자가 모든 추천대상 콘텐츠에 평점을 매긴 데이터를 사용하였다. 이때, RMSE 값은 작을수록 추천 정확도가 높을 것임은 물론이다.Here, for the convenience of evaluation, the entire rating data is data in which all target users are rated for all the recommended contents. In this case, the smaller the RMSE value, the higher the recommendation accuracy will be.

도 5에서, 가중치 a가 0.5일 때, MF Average 방법의 RMSE 값이 2.385로 최소치(Min)로, 행렬 인수분해 방법의 RMSE 값인 2.5보다 10.2%정도 낮음을 알 수 있다.In FIG. 5, when the weight a is 0.5, the RMSE value of the MF average method is 2.385, which is the minimum value (Min), which is about 10.2% lower than the RMSE value 2.5 of the matrix factorization method.

또한, 본 발명의 MF Average 방법의 RMSE 값은 a값이 0.1인 경우 2.385로 최소치인데, 이 또한 종래의 행렬 인수분해 방법보다 4.8%정도 정확도가 높음을 알 수 있다.In addition, the RMSE value of the MF average method of the present invention is 2.385 when the value of a is 0.1, which is 4.8% higher than that of the conventional matrix factorization method.

그러나, RMSE의 최소값을 비교하면, 임계개수가 5인 경우에는 본 발명의 실시예에 따른 MF Random 방법이 평균 바이어스 결합 방법(MF Average)보다 정확도가 낮음을 알 수 있다. 그러나, 도 6과 같이 임계개수가 어느 정도 이상이 되면 본 발명의 평점 예측의 정확도가 높아진다.However, when the minimum value of the RMSE is compared, it can be seen that the MF Random method according to the embodiment of the present invention is less accurate than the average bias combining method (MF Average) when the threshold number is 5. However, as shown in FIG. 6, when the threshold number is greater than or equal to a certain degree, the accuracy of the rating prediction of the present invention becomes high.

즉, 도 6에서, 최소 RMSE 값은 가중치 a가 0.1일 때 MF Random 방법의 RMSE 값은 2.276으로서, 이는 MF Average 방법의 최소 RMSE 값인 2.287, 행렬 인수분해 방법의 최소 RMSE 값 2.372보다 각기 0.4%, 4.1% 정확도가 높아진 것을 알 수 있다.6, the RMSE value of the MF Random method is 2.276 when the weight a is 0.1, which is smaller than the minimum RMSE value of the MF average method of 2.287 and the minimum RMSE value of 2.372 of the matrix factorization method of 0.4% 4.1% Accuracy is higher.

이로부터, 본 발명의 실시예에 따른 MF Random 방법은 사용자 평점의 수가 많은 경우에 더욱 성능이 향상됨을 알 수 있다. 따라서, 본 발명에서는 임계개수를 6 이상 적용하는 것이 좋다.From this, it can be seen that the MF Random method according to the embodiment of the present invention further improves performance when the number of user ratings is large. Therefore, in the present invention, it is preferable to apply a threshold number of 6 or more.

이와 같이, 본 발명의 실시예는 멱함수 분포를 따르는 평점 데이터를 이용하여 평점을 예측할 때 그 예측의 정확도를 향상시킬 수 있다.As described above, the embodiment of the present invention can improve the accuracy of the prediction when the score is predicted using the rating data following the power function distribution.

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims

A collection unit for collecting rating data corresponding to a content of a first type that a user desires to receive from content rating data on-line having a power law distribution;
A preprocessing unit for selecting rating data of users rated for the first kind of content and having a number of evaluations equal to or greater than a threshold number of the collected rating data;
Estimating a score not pre-determined using the selected rating data, and calculating overall rating data reflecting the ratings and predicted ratings of all the selected users for all the recommended content corresponding to the first type of content A prediction unit; And
A recommendation unit that averages the ratings of the recommendation target content from the calculated overall rating data and provides the users with a predetermined number of the content with the highest average rating among all the recommendation target contents,
And the content recommendation apparatus.

The method according to claim 1,
The preprocessing unit adjusts the scale of the collected rating data by adding 1 to all of the collected rating data,
Wherein the predicting unit calculates the final predictive rating by subtracting 1 from the predicted rating and calculates the overall rating data by reflecting the calculated final predictive rating
/ RTI >

delete

2. The method of claim 1,
A content recommendation apparatus as claimed in any one of the preceding claims,

Collecting rating data corresponding to a first type of content to be recommended by a user from online content rating data having a power law distribution;
Selecting rating data of the users rated for the first type of content and the rating number of which is equal to or greater than the threshold number of ratings collected from the collected rating data;
Estimating a score not pre-determined using the selected rating data, and calculating overall rating data reflecting the ratings and predicted ratings of all the selected users for all the recommended content corresponding to the first type of content step; And
Averaging the ratings of the recommendation target content from the calculated overall rating data and providing the user with a predetermined number of the content with the highest average rating among all the recommendation target contents
And the content recommendation method.

6. The method of claim 5,
The step of selecting the rating data includes:
Adjusting the scale of the collected rating data by adding 1 to all of the collected rating data,
Wherein the calculating step comprises:
Calculating a final predicted score by subtracting 1 from the predicted score, and calculating the overall score data by reflecting the calculated final predicted score
.

delete