KR101116026B1

KR101116026B1 - Collaborative filtering recommender system based on similarity measures using the origin moment of difference random variable

Info

Publication number: KR101116026B1
Application number: KR1020090131059A
Authority: KR
Inventors: 홍광석; 권형준
Original assignee: 성균관대학교산학협력단
Priority date: 2009-12-24
Filing date: 2009-12-24
Publication date: 2012-02-13
Also published as: KR20110074167A

Abstract

본 발명은, 사용자의 유사성 혹은 아이템의 유사성을 이용하는 메모리 기반 협업 필터링의 성능 개선을 위한 새로운 유사성 척도를 제안한다. 이러한 유사성 척도 RMS(Raw Moment-based Similarity)는, 협업 필터링 기법에서 새로운 사용자 혹은 새로운 아이템이 등장한 이후에 일정 기간 발생하는 콜드 스타트(cold-start: 완전 시작) 조건에서 공통 선호도가 극히 적어 유사도 측정이 정확하지 못한 경우에 유효하고, 선형적인 특성을 가지는 두 변수의 유사성을 판별하기 위한 분야에 범용적으로 사용될 수 있다.The present invention proposes a new similarity measure for improving the performance of memory-based collaborative filtering using the similarity of users or similarity of items. This measure of similarity, RMS (Raw Moment-based Similarity), has a very low common preference under cold-start conditions that occur after a new user or a new item has appeared in a collaborative filtering technique. It is valid in case of inaccuracy and can be used universally in the field for determining similarity between two variables having linear characteristics.

Description

Collaborative filtering recommendation system based on similarity measure using origin moment of difference probability variable {COLLABORATIVE FILTERING RECOMMENDER SYSTEM BASED ON SIMILARITY MEASURES USING THE ORIGIN MOMENT OF DIFFERENCE RANDOM VARIABLE}

본 발명은, 사용자의 유사성 혹은 아이템의 유사성을 이용하는 메모리 기반 협업 필터링의 성능 개선을 위한 새로운 유사성 척도에 기반한 협업 필터링 추천 시스템 및 유사도 테이블 구축방법에 관한 것이다.The present invention relates to a collaborative filtering recommendation system and a method of constructing a similarity table based on a new similarity measure for improving the performance of memory-based collaborative filtering using similarity of users or similarity of items.

추천 시스템은, 개인화 기술의 하나로서, 사용자가 관심을 보일만한 콘텐츠 혹은 정보를 사용자에게 추천하는 것이다. 과거에는 단순히 사용자의 기호정보에 맞추어 추천하는 규칙 기반의 알고리즘이 성행하였으나, 정보기술이 발전하면서 인공지능 분야의 소프트컴퓨팅 기법, 데이터베이스 분야의 데이터마이닝 기법 등에 의해 사용자의 행동 패턴을 파악하기 시작하면서 사용자의 성향을 분석하고 이에 맞게 사용자가 선호할 것이라고 생각되는 정보 및 콘텐츠를 추천하는 등과 같이 추 천 시스템은 점점 진화하고 있다.The recommendation system, as one of personalization techniques, is to recommend content or information to the user that may be of interest to the user. In the past, rule-based algorithms, which were simply recommended according to user's preference information, prevailed, but as information technology developed, users began to understand user's behavior patterns by soft computing techniques in artificial intelligence and data mining techniques in database. Recommendation systems are evolving, such as analyzing the propensity of users and recommending information and content that users might prefer.

추천 시스템을 구축하기 위한 기법 중 하나인 협업 필터링(Collaborative Filtering)은 현존하는 가장 뛰어난 추천 방법론으로 알려져 있다. 이 협업 필터링은 많은 사용자들로부터 얻은 기호정보(taste information)에 따라 사용자들의 관심사들을 자동적으로 예측하게 해주는 방법이다. 협력 필터링 접근법의 근본적인 가정은 사용자들의 과거의 경향이 미래에서도 그대로 유지 될 것이라는 전제에 있다. 즉, 사용자 A가 과거에 특정 스타일의 콘텐츠를 선호하였다면, 미래에도 그럴 것이라는 가설에 근거하여 추천을 수행하는 기법이다. 예를 들어, 음악에 관한 협력 필터링 혹은 추천시스템(recommendation system)은 사용자들의 기호(좋음, 싫음)에 대한 부분적인 목록(partial list)을 이용하여 그 사용자의 음악에 대한 기호를 예측하게 된다. 이 시스템은 특정 사용자의 정보에만 국한 된 것이 아니라 많은 사용자들로 부터 수집한 정보를 사용한다는 것이 특징이다. 이것이 단순히 투표를 한 수를 기반으로 각 아이템의 관심사에 대한 평균적인 평가로 처리하는 방법과 차별화 된 것이다. 즉 고객들의 선호도와 관심 표현을 바탕으로 선호도, 관심에서 비슷한 패턴을 가진 고객들을 식별해 내는 기법이다. 비슷한 취향을 가진 고객들에게 서로 아직 구매하지 않은 상품들은 교차 추천하거나 분류된 고객의 취향이나 생활 형태에 따라 관련 상품을 추천하는 형태의 서비스를 제공하기 위해 사용된다.Collaborative filtering, one of the techniques for building a recommendation system, is known as the best recommendation methodology in existence. This collaborative filtering is a way to automatically predict users' interests based on taste information from many users. The fundamental assumption of the collaborative filtering approach is the premise that users' past trends will remain the same in the future. In other words, if user A prefers a specific style of content in the past, it is a technique that performs recommendation based on the hypothesis that it will be in the future. For example, a collaborative filtering or recommendation system for music may predict a user's preference for music using a partial list of users' likes (likes or dislikes). The system is not limited to specific user information, but uses information gathered from many users. This is different from the way it simply handles the average rating of each item's interests based on the number of votes. In other words, it is a technique to identify customers with similar patterns in preference and interest based on their preferences and expressions of interest. Products that have not yet purchased each other to customers with similar tastes are used to provide services in the form of cross-recommending or recommending related products according to the tastes or lifestyles of the classified customers.

이와 같이 협업 필터링은, 많은 사용자들에게 얻은 선호도 정보를 이용함으로써, 아이템(정보 혹은 콘텐츠 등의 추천 대상 매개체)에 대한 사용자의 선호도를 예측할 수 있다. 사용자 선호도를 1, 2, 3, 4, 5, 6, 7, 8의 여덟 단계로 부여할 수 있는 시스템에서 아래의 표 1과 같은 사용자-아이템간의 선호도 매트릭스를 가정하면, 협업 필터링의 궁극적인 목적은 아직 평가되지 않은 부분, 즉, 아래의 매트릭스에서 '?'부분을 정확하게 예측하는 것이다.As described above, the collaborative filtering may predict the user's preference for an item (a medium for recommendation such as information or content) by using the preference information obtained from many users. Assuming the user-item preference matrix as shown in Table 1 below in a system that can assign user preferences to eight levels of 1, 2, 3, 4, 5, 6, 7, and 8, the ultimate goal of collaborative filtering Is an accurate prediction of the part that has not yet been evaluated, that is, the part of '?' In the matrix below.

표 1. 사용자-아이템 선호도 매트릭스의 예Table 1. Example of user-item preference matrix

협업 필터링 추천 시스템에서는 예측한 선호도와 실제 사용자가 부여한 선호도의 오차를 구하여 성능을 평가한다. 협업 필터링 기법의 기술 분류 중 하나인 메모리 기반의 협업 필터링 방법은 다음과 같은 세 단계로 실행된다.The collaborative filtering recommendation system evaluates the performance by calculating the error between the predicted preference and the preference given by the actual user. The memory-based collaborative filtering method, one of the technical classifications of the collaborative filtering technique, is executed in three steps as follows.

1) 유사도 계산 단계1) Similarity calculation step

유사도 계산은 메모리 기반 추천 시스템에서 가장 중요한 단계이다. 사용자 기반 접근 방법은 추천 대상 사용자와 다른 모든 사용자들 사이의 유사도를 계산한다. 추천 대상 사용자와 다른 사용자들이 공통적으로 선호도를 부여한 아이템을 이용한다. 아래의 표 2는 USER 3의 ITEM 4에 대한 선호도('?'로 표기된 부분)를 예측하고자 하는 목적으로 USER 3과 USER 4의 유사도를 계산하기 위해 사용되는 선호도 데이터를 표기한 것이다. 표 2에 있어서, 유사도 계산 대상이 되는 사용자 USER 3과 USER 4는 각각 이산변수 X와 Y로 표현할 수 있는 바, X = {4, 2, 3}, Y = {8, 3, 4}와 같이 되어 두 변수는 선형적인 관계를 갖게 된다.Similarity calculation is the most important step in a memory-based recommendation system. The user based approach calculates the similarity between the recommended user and all other users. The user to which the recommendation target user and other users have given preference in common is used. Table 2 below shows the preference data used to calculate the similarity between USER 3 and USER 4 for the purpose of predicting USER 3's preference for ITEM 4 (marked with '?'). In Table 2, users USER 3 and USER 4, which are subject to similarity calculation, can be represented by discrete variables X and Y, respectively, as X = {4, 2, 3} and Y = {8, 3, 4}. The two variables have a linear relationship.

표 2. 사용자 기반 접근 방법에서 사용자 간의 유사도 계산을 위한 데이터의 예Table 2. Example data for calculating similarity between users in a user-based approach

아이템 기반 접근 방법은 추천하고자 하는 아이템과 다른 모든 아이템 사이의 유사도를 계산한다. 유사도 계산의 대상이 되는 두 아이템에 모두 선호도를 부여한 사용자들의 선호도를 이용한다. 아래의 표 3은 USER 3의 ITEM 4에 대한 선호도('?'로 표기된 부분)를 예측하고자 하는 목적으로 ITEM 4와 ITEM 5의 유사도를 계산하기 위해 사용되는 선호도 데이터를 표기한 것이다. 유사도 계산 대상이 되는 아이템 ITEM 4와 ITEM 5고 또한 각각 이산변수 X와 Y로 표현할 수 있는 바, X = {5, 4, 5}, Y = {2, 3, 8}과 같이 되어 두 변수는 선형적인 관계를 갖게 된다.The item based approach calculates the similarity between the item to be recommended and all other items. The preferences of users who give preference to both items to be calculated for the similarity are used. Table 3 below shows the preference data used to calculate the similarity between ITEM 4 and ITEM 5 for the purpose of predicting USER 3's preference for ITEM 4 (marked with '?'). ITEM 4 and ITEM 5, which are subject to similarity calculation, can also be represented by discrete variables X and Y, respectively, as X = {5, 4, 5}, Y = {2, 3, 8}. There is a linear relationship.

표 3. 아이템 기반 접근 방법에서 아이템 간의 유사도 계산을 위한 데이터의 예Table 3. Example data for calculating similarity between items in an item-based approach

변수 X와 Y의 유사도를 측정하면, 사용자와 사용자, 혹은 아이템과 아이템 사이의 유사성을 알 수 있다. 유사도를 측정하기 위한 척도는 정보검색 분야에서 널리 이용되는 코사인 유사도, 상관계수 기반 방법인 피어슨 곱적률 상관계수와 스피어만 순위 상관계수 등이 있다.By measuring the similarity between the variables X and Y, we can see the similarity between the user and the user or between the items. Measures for measuring similarity include cosine similarity widely used in the field of information retrieval, Pearson product rate correlation coefficient and Spearman rank correlation coefficient, which are correlation coefficient based methods.

아래의 식 (1)은 코사인 유사도를 나타내고, 식 (2)는 스피어만 순위 상관계수를 나타내며, 식 (3)은 피어슨 곱적률 상관계수를 나타낸다.Equation (1) below represents cosine similarity, equation (2) represents the Spearman rank correlation coefficient, and equation (3) represents the Pearson product rate correlation coefficient.

(1)

(One)

(2)

(3)

여기서, X 및 Y는 선형적인 특성을 갖는 변수이고, Xi 및 Xi는 선형적인 특성을 갖는 변수 X 및 Y의 i번째 값이다.Where X and Y are variables having linear characteristics, and Xi and Xi are i th values of variables X and Y having linear characteristics.

사용자 기반 접근 방법이든 아이템 기반 접근 방법이든, 서로간의 공통 선호 도가 없다면 유사도 계산은 불가능하고, 점수 예측에 사용될 수 없다. 접근 방법에 따라서 아이템 혹은 사용자의 유사도 테이블을 만들어 실시간 추천 시간을 대폭적으로 감소시킬 수 있다. 사용자가 총 5명인 경우와 혹은 아이템이 총 5개인 경우의 유사도 테이블의 예를, 도 1과 도 2에 나타내었다. 유사도 테이블의 갱신은, 시스템 관리자가 정한 시간 주기에 따라서 자동적으로 이루어질 수 있다.Whether there is a common preference between user-based or item-based approaches, similarity calculation is impossible and cannot be used for score prediction. Depending on the approach, a similarity table of items or users can be created to drastically reduce real-time recommendation time. 1 and 2 show examples of the similarity table when the total number of users is five or the total number of items is five. The update of the similarity table may be automatically performed at a time period determined by the system administrator.

2) 근접 이웃 형성 단계2) Proximity neighbor formation step

상기 1)의 단계에서 계산된 유사도에 따라서 근접 이웃을 형성하는 단계이다. 근접 이웃이란, 선호도 예측에 사용할 데이터를 뜻한다. 유사도가 계산된 모든 데이터를 고려하는 방법, 상위 N개를 선택하는 Top-N 방법, 특정 유사도 임계치 이상의 데이터를 고려하는 방법 등이 있다. 사용자 기반 접근 방법에서는 점수 예측에 사용할 사용자들을 선별하고, 아이템 기반 접근 방법에서는 점수 예측에 사용할 아이템들을 선별하는 것이다.A step of forming a neighboring neighborhood according to the similarity calculated in the step 1). Proximity neighborhood means data to be used for the prediction of preference. There is a method of considering all the data whose similarity is calculated, a Top-N method of selecting the top N, and a method of considering data above a certain similarity threshold. In the user-based approach, users are selected for score prediction, and in the item-based approach, items are used for score prediction.

근접 이웃의 숫자가 너무 적거나 너무 많으면, 추천 시스템의 성능을 악화시키는 요인이 된다. 근접 이웃의 개수를 1로 설정하고 증가시키면서 실험하면, 특정 시점에서 가장 좋은 성능을 낸 후에 점점 하락한다. 이러한 이유로, 계산 시간이 빠르면서도 좋은 성능을 보이는 Top-N 방법을 이용하고, KNN(K-nearest neighborhood: K-최근접 이웃) 방법이라고 칭하기도 한다.Too few or too many neighboring neighbors can lead to poor performance of the recommender system. If you experiment with setting the number of neighbors to 1 and increasing it, it will gradually drop after performing at its best. For this reason, the Top-N method, which is fast in computation time and shows good performance, is used and is sometimes referred to as the K-nearest neighborhood (KNN) method.

3) 선호도 예측 단계3) Prediction Prediction Step

추천 대상 사용자와 근접 이웃들의 유사도를 가중치로 삼아 추천 대상 사용자에게 발생하지 않은 아이템들의 선호도를 예측한다.The similarity between the recommended user and the neighbors is weighted to predict the preference of items that do not occur to the recommended user.

아래의 식 (4)는 사용자 기반 접근법의 예측을 나타내고, 식 (5)는 아이템 기반 접근법의 예측을 나타낸다.Equation (4) below represents the prediction of the user based approach, and equation (5) represents the prediction of the item based approach.

(4)

(5)

여기서, u는 추천 대상 사용자, i는 추천 대상 아이템, r_u는 추천 대상 사용자 u가 부여한 아이템들의 선호도, w_u _,v는 사용자 u와 사용자 v의 유사도, r_v _,i는 사용자 v의 아이템 i에 대한 선호도, r_v는 사용자 v의 선호도, r_i는 사용자들이 부여한 아이템 i의 선호도, w_i _,j는 아이템 i와 j의 유사도, r_u _,j는 사용자 u의 아이템 j에 대한 선호도, r_j는 사용자들이 부여한 아이템 j의 선호도이다.Where u is the recommendation user, i is the recommendation item, r _u is the preference of items given by the recommendation user u, w _u _{, v} are the similarities between user u and v, r _v _, i For r, r _v is the preference of user v, r _i is the preference of item i given by the user, w _i _{, j} is the similarity between item i and j, r _u _{, j} is the preference for user j's item j, r _j is the item j's preference given by the user.

특정 아이템에 대한 선호도 예측 점수가 높다는 것은, 그 아이템에 대한 추천 대상 사용자의 선호도가 높다고 예상되는 것이므로, 예측 점수를 내림차순으로 정렬하여 사용자에게 추천함으로써 사용자 기반 협업 필터링의 프로세스가 완료된다.Since the preference prediction score for a particular item is high, it is expected that the recommendation target user's preference for the item is high, so that the process of user-based collaborative filtering is completed by sorting the prediction score in descending order and recommending it to the user.

다음에는 이러한 기존의 유사성 척도의 문제점에 대해 설명한다.The following describes the problem of this existing similarity measure.

먼저, 콜드 스타트 조건이란 새로운 사용자 혹은 아이템이 데이터베이스에 생성된 후, 사용자 선호도 데이터가 극히 적은 경우에 예측 성능이 저하되는 것을 의미한다. 협업 필터링 기법에서 사용자 선호도 예측 성능을 향상시키기 위해 해결해야 할 조건 중 하나이다.First, the cold start condition means that prediction performance is deteriorated when a new user or item is created in the database and the user preference data is extremely small. It is one of the conditions to be solved to improve the user preference prediction performance in the collaborative filtering technique.

기존의 코사인 유사도와 상관계수 기반의 유사도 측정 방법은, 도 3, 도 4, 도 5, 도 6과 같은 모습의 선호도 그래프에서 예측 성능의 저하를 야기한다. 도 3, 도 4, 도 5, 도 6은 콜드 스타트 조건에서 빈번하게 발생하는 선호도 형태를 나타내는 것으로서, 콜드 스타트 조건이 아닌(공통 선호도 개수가 많은) 때에는 도 3, 도 4, 도 5, 도 6과 같은 형태의 선호도 그래프가 나타나는 경우가 드물기 때문에, 코사인 유사도와 상관계수 기반의 방법들은 모두 일정한 성능을 보장할 수 있다. 그렇지만, 새로운 사용자 혹은 새로운 아이템이 데이터베이스에 지속적으로 추가될 것이므로, 콜드 스타트 조건을 위해 특화된 유사성 척도가 반드시 필요함은 분명하다.The existing cosine similarity and correlation coefficient-based similarity measurement method causes a decrease in prediction performance in the preference graphs shown in FIGS. 3, 4, 5, and 6. 3, 4, 5, and 6 show a preference form frequently occurring in the cold start condition. When the non-cold start condition (a large number of common preferences) is used, FIGS. 3, 4, 5, and 6 are shown. Since it is rare to see a preference graph in the form of, a cosine similarity and a correlation-based method can guarantee a certain performance. However, as new users or new items will continue to be added to the database, it is clear that specialized similarity measures are necessary for cold start conditions.

도 3, 도 4, 도 5, 도 6에 있어서, X축은 아이템, Y축은 선호도, u1은 사용자 1, u2는 사용자 2를 의미한다. 즉, 사용자 u1과 u2가 4개의 아이템에 부여한 선호도를 그래프로 나타낸 것이다. 뒤이은 내용은 사용자 기반 접근법으로 설명하지만, 아이템 기반 접근법도 동일하다.3, 4, 5, and 6, the X axis represents an item, the Y axis represents a preference, u1 means user 1, and u2 means user 2. That is, the graph shows the preferences given to the four items by the user u1 and u2. What follows is described as a user-based approach, but the item-based approach is the same.

도 3을 참조하면, 두 사용자의 선호도는 평행을 이루고 있다. 이러한 경우, 코사인 유사도의 계산 결과는 두 사용자가 완전히 동일함을 나타내는 1을 출력한다. 완전히 동일하다고 표현할 수 있는 경우는 u1 = {1, 1, 1, 1}, u2 = {1, 1, 1, 1}과 같이 두 사용자의 선호도가 완전히 일치해야 하는 경우일 뿐이다. 그러나, 두 사용자의 유사도 그래프가 평행한 경우, 코사인 유사도는 무조건 1을 출력하기 때문에, 유사도 측정에 문제가 있다. Referring to FIG. 3, the preferences of two users are parallel. In this case, the calculation result of the cosine similarity outputs 1 indicating that the two users are completely identical. It can only be expressed as if the two users' preferences must match completely, such as u1 = {1, 1, 1, 1}, u2 = {1, 1, 1, 1}. However, when the similarity graphs of two users are parallel, cosine similarity outputs 1 unconditionally, which causes a problem in measuring similarity.

도 3과 같이 두 사용자의 선호도가 평행을 이루면, 상관계수 기반의 두 유사성 척도인 피어슨 곱적률 상관계수와 스피어만 순위 상관계수는 표준편차가 0으로 되어, 앞에서 설명한 계산식에서 분모가 0으로 되기 때문에, 유사도 측정이 불가능하다. 유사도 측정이 불가능하다는 것은, 근접 이웃에 포함시킬 수 없어(메모리 기반 협업 필터링 단계 2) 선호도 예측에 활용할 수 없기 때문에, 데이터를 무용지물로 만들게 된다.When the preferences of two users are parallel as shown in FIG. 3, the Pearson product rate correlation coefficient and the Spearman rank correlation coefficient, which are two similarity measures based on the correlation coefficient, have a standard deviation of 0, and thus a denominator of 0 in the above-described calculation. Similarity measurements are not possible. The inability to measure similarity makes the data useless because it cannot be included in the neighbors (memory-based collaborative filtering stage 2) and thus cannot be used for preference prediction.

도 4를 참조하면, 두 사용자의 선호도 그래프가 지그재그 형태를 보이는 경우에 상관계수 기반의 방법들은 두 사용자의 성향이 정반대임을 나타내는 -1을 출력한다. 동일한 아이템 목록에서 두 사용자가 유사한 선호도를 보였음에도 불구하고 정반대의 성향을 보인다는 것은 명백히 잘못된 결과이다.Referring to FIG. 4, in the case where the preference graphs of the two users show a zigzag shape, correlation-based methods output -1 indicating that the propensity of the two users is opposite. It's obviously wrong to see two users in the same list of items having the opposite preferences, even though they have similar preferences.

도 4의 (a)에 나타낸 그래프와 도 4의 (b)에 나타낸 그래프는 모두 상관계수 기반의 방법들이 -1을 출력한다는 것도 문제지만, 코사인 유사도는 왼쪽 그래프의 경우 0.8, 오른쪽 그래프의 경우 0.96을 출력한다. 두 선호도 그래프도 모두, 두 사용자의 선호도 차이는 동일함에도 불구하고 코사인 유사도는 다른 결과를 보이는 바, 이는 공정하지 못한 결과라고 볼 수 있다.In the graph shown in FIG. 4A and the graph shown in FIG. 4B, the correlation coefficient-based methods also output -1, but the cosine similarity is 0.8 for the left graph and 0.96 for the right graph. Outputs In both preference graphs, the cosine similarity shows different results even though the preference difference of the two users is the same, which is an unfair result.

도 5를 참조하면, 도 4와 유사한 지그재그 형태에서 선호도 하나가 변형된 모습이다. 피어슨 곱적률 상관계수는, 도 5의 (a)에 나타낸 그래프에서는 -0.904, (b)에 나타낸 그래프에서는 -0.816을 출력한다. 도 2와 비교해 보았을 때, 도 2보다 반대 성향이 강해짐에도 불구하고, 오히려 점점 유사하다는 결과를 보이고 있는 바, 이는 잘못된 결과라고 볼 수 있다.Referring to FIG. 5, a preference is changed in a zigzag form similar to that of FIG. 4. The Pearson product ratio correlation coefficient outputs -0.904 in the graph shown in (a) of FIG. 5 and -0.816 in the graph shown in (b). Compared with FIG. 2, despite the stronger tendency of the opposite tendency than FIG. 2, the results show that the results are more and more similar.

도 6은 두 사용자의 공통 선호도가 1개인 경우를 나타내고 있다. 이러한 경우, 코사인 유사도는 선호도 값에 상관없이 1을 출력하며, 상관계수 기반 방법들은 공분산과 표준편차를 구할 수 없어 계산이 불가능하다. 즉, 공통 선호도가 1개인 경우에는, 기존의 유사성 척도들은 잘못된 결과를 보이므로, 선호도 예측 성능을 저하시킨다.6 illustrates a case where two users have a common preference. In this case, the cosine similarity outputs 1 regardless of the preference value, and the correlation coefficient-based methods cannot calculate the covariance and standard deviation because they cannot be obtained. That is, in the case of one common preference, the existing similarity measures show wrong results, thereby degrading the preference prediction performance.

본 발명이 해결하고자 하는 기술적 과제는, 협업 필터링 추천 시스템이 사용자의 선호도를 예측할 때에 콜드 스타트 조건에서 발생하는 문제점을 보완하여 선호도 예측 성능을 향상시킬 수 있는 차이 확률 변수의 원점 모멘트를 이용한 유사성 척도에 기반한 협업 필터링 추천 시스템 및 유사도 테이블 구축방법을 제공하고자 하는 것이다.The technical problem to be solved by the present invention is a similarity measure using the origin moment of the difference probability variable that can improve the preference prediction performance by complementing the problems caused by the cold start condition when the collaborative filtering recommendation system predicts the user's preferences. To provide a collaborative filtering recommendation system and similarity table construction method.

콜드 스타트 조건에서는 사용자 혹은 아이템의 유사도를 계산하기 위한 공통 선호도의 개수가 극히 적으며, 기존의 유사성 척도는 도 1, 도 2, 도 3, 도 4와 같은 문제점이 빈번하게 발생하여 선호도 예측 성능 저하의 요인이 된다.In the cold start condition, the number of common preferences for calculating the similarity of a user or an item is extremely small, and the existing similarity measure causes frequent problems as shown in FIGS. It becomes a factor.

본 발명에서는 도 9에 나타낸 식을 이용하여 두 변수의 차이로 구성되는 다항확률변수(D)를 구하고, D를 이용하여 도 10에 나타낸 식에 따라 두 변수의 유사성을 계산하여 콜드 스타트 조건에서 더욱 정확히 유사도를 측정한다. 또한, 도 10에 나타낸 식에서 모멘트 차수(r)를 조정하여 도 11에서 나타낸 바와 같이 유사도 분포를 변경하여 동일한 선호도 차이를 보이는 여러 조건에서 유사도 차별화를 둘 수 있는 특징에 의해 여러 가지 다른 특성을 가지는 사용자-선호도 매트릭스에 적용하기 쉽도록 하는 유연성을 제공한다. 또한, 사용자-선호도 매트릭스가 포함하고 있는 선호도 범위, 사용자 수, 아이템 수 등의 다양한 특성이 바뀌더라도, 본 발명에서 제안하는 유사성 척도는 그에 맞게 동작함은 물론이다. 콜드 스타트 조건은 물론, 도 1, 도 2, 도 3, 도 4와 같은 경우가 아니더라도, 기존의 유사성 척 도보다 뒤떨어진 성능을 보이지 않음을 도 12 및 도 13의 실험결과를 통해 확인할 수 있었다.In the present invention, the polynomial probability variable (D) consisting of the difference between the two variables is obtained using the equation shown in FIG. 9, and the similarity between the two variables is calculated using the equation shown in FIG. Measure the similarity exactly. In addition, the user having various different characteristics by adjusting the moment order r in the equation shown in FIG. 10 and changing the similarity distribution as shown in FIG. Provides flexibility to make it easy to apply to the preference matrix. In addition, even if various characteristics such as the preference range, the number of users, the number of items, and the like included in the user-preference matrix are changed, the similarity measure proposed in the present invention operates accordingly. Cold start condition, of course, even if not the case shown in Figures 1, 2, 3, 4, it can be confirmed through the experimental results of FIGS.

콜드 스타트 조건에서는, 사용자 혹은 아이템의 유사도를 계산하기 위한 공통 선호도의 개수가 극히 적으며, 기존의 유사성 척도는 도 1, 도 2, 도 3, 도 4와 같은 문제점이 빈번하게 발생하여 선호도 예측 성능 저하의 요인이 된다.In the cold start condition, the number of common preferences for calculating the similarity of a user or an item is extremely small, and the existing similarity measure frequently causes problems as shown in FIGS. 1, 2, 3, and 4, and thus, the preference prediction performance It is a factor of deterioration.

이에 본 발명에서는 두 변수의 차이로 구성되는 다항확률변수를 구하고, 이를 이용하여 두 변수의 모멘트를 계산하여 콜드 스타트 조건에서 더욱 정확히 유사도를 측정한다. 또한, 모멘트 차수(r)를 조정하여 동일한 선호도 차이를 보이는 여러 조건에서 유사도 차별화를 둘 수 있는 특징이 있으며, 유사도 분포를 조절할 수 있어서 여러 가지 다른 특성을 가지는 사용자-선호도 매트릭스에 적용하기 쉽도록 하는 유연성을 제공한다. 콜드 스타트 문제에 강하면서도 콜드 스타트 조건이 아닌 경우에 있어서도 기존의 유사성 척도보다 뒤떨어진 성능을 보이지 않음을 예시도면에 의한 실험결과를 통해 검증하였다.Accordingly, the present invention obtains a polynomial probability variable consisting of the difference between the two variables, and calculates the moments of the two variables using the variable to measure the similarity more accurately in the cold start condition. In addition, by adjusting the moment order (r), the similarity can be differentiated under various conditions showing the same preference difference, and the similarity distribution can be adjusted so that it is easy to apply to a user-preference matrix having several different characteristics. Provide flexibility. In the case of the cold start problem, but not cold start condition, the performance was not inferior to the existing similarity measure.

본 발명의 제1 관점에 따르면, 두 사용자 혹은 두 아이템의 공통 선호도의 차이에 기초한 두 변수의 차이로 구성되는 다항확률변수를 구하고, 이를 이용하여 두 변수의 모멘트를 계산함으로써 유사도를 측정하는 차이 확률 변수의 원점 모멘트를 이용한 유사성 척도에 기반한 협업 필터링 추천 시스템으로서,According to a first aspect of the present invention, a polynomial probability variable consisting of a difference between two variables based on a difference in common preferences of two users or two items is obtained, and a difference probability of measuring similarity by calculating moments of two variables using the same is calculated. A collaborative filtering recommendation system based on similarity measure using the origin moment of a variable.

사용자가 입력한 시스템 관리자가 규정한 정수형 숫자 범위 내에서 아이템에 대한 선호도를 수신하는 선호도 데이터 수신수단과,A preference data receiving means for receiving a preference for an item within an integer number range defined by a system administrator input by a user;

상기 선호도 데이터 수신수단에서 수신된 선호도에 대해, 소정의 절차에 따라 시스템의 성능 측정 시점 이전에 예측된 선호도와 사용자로부터 수신된 아이템 선호도의 오차를 계산하여 시스템의 성능을 측정하는 성능 측정수단,Performance measurement means for measuring the performance of the system with respect to the preference received by the preference data receiving means, by calculating an error between the predicted preference and the item preference received from the user according to a predetermined procedure, according to a predetermined procedure;

입력된 선호도, 과거에 현재 사용자가 입력한 선호도, 과거에 현재 사용자가 아닌 사람들이 입력한 선호도 데이터를 기초로, 유사도 측정 방법을 이용하여 사용자 유사도 또는 아이템 유사도를 측정하는 유사도 측정수단,A similarity measuring means for measuring user similarity or item similarity using a similarity measuring method based on input preferences, preferences input by a current user in the past, and preference data input by non-current users in the past,

모든 아이템쌍에 대해 서로간의 유사도를 계산하여 미리 저장해 둔 아이템 유사도 데이터베이스로부터 사용자가 아직 평가하지 않은 아이템과 유사한 아이템을 검색하는 유사 아이템 검색수단,Similar item search means for searching similar items with items that the user has not yet evaluated from the previously stored item similarity database by calculating similarity between each other for all pairs of items,

사용자 유사도 데이터베이스로부터 추천하고자 하는 사용자와 같은 성향을 보이는 사용자들을 검색하는 유사 사용자 검색수단,A similarity user search means for searching for users who have the same propensity as a user to recommend from a user similarity database;

상기 사용자 검색수단에 의해 검색한 사용자를 기반으로 하는 사용자 기반 방법 및 상기 아이템 검색수단에 의해 검색한 아이템을 기반으로 하는 아이템 기반 방법 중 한 가지를 선택하여 아래의 식 (4) 및 식 (5) 중 한 식을 이용해 사용자가 아직 평가하지 않은 아이템에 대한 선호도를 예측하는 선호도 예측수단,

(4)

(5)
(여기서, u는 추천 대상 사용자, i는 추천 대상 아이템, r_u는 추천 대상 사용자 u가 부여한 아이템들의 선호도, w_u,v는 사용자 u와 사용자 v의 유사도, r_v,i는 사용자 v의 아이템 i에 대한 선호도, r_v는 사용자 v의 선호도, r_i는 사용자들이 부여한 아이템 i의 선호도, w_i,j는 아이템 i와 j의 유사도, r_u,j는 사용자 u의 아이템 j에 대한 선호도, r_j는 사용자들이 부여한 아이템 j의 선호도이다.)The following equations (4) and (5) are selected by selecting one of a user based method based on a user searched by the user searching means and an item based method based on an item searched by the item searching means. Preference predictive means for predicting preference for items that the user has not yet rated,

(4)

(5)
Where u is the recommended user, i is the recommended item, r _u is the preference of the items given by the recommended user u, w _{u, v} is the similarity between user u and _v, and r _{v, i} is the item of user v the preference for i, r _v is the preference of user v, r _i is the preference of item i given by the user, w _{i, j} is the similarity of items i and j, r _{u, j} is the preference of user u's item j, r _j is the preference of item j given by the user.)

사용자가 아직 평가하지 않은 아이템에 대해 예측된 선호도를 조사하여 선호도가 높은 것으로부터 순서대로 아이템을 정렬하여 추천 목록을 생성하는 추천 아이템 목록 생성수단 및,A recommendation item list generating means for generating a recommendation list by sorting the items in order from the highest preference by investigating the predicted preferences for the items which the user has not yet evaluated;

상기 추천 아이템 목록 생성수단에서 생성된 추천 아이템 목록에 의거하여 상위로부터 소정 개수의 아이템과 그 아이템에 대한 예상 선호도를 사용자에게 추천하는 아이템 추천수단을 구비하여 구성된 차이 확률 변수의 원점 모멘트를 이용한 유사성 척도에 기반한 협업 필터링 추천 시스템이 제공된다.A similarity measure using the origin moment of the difference probability variable configured with the recommendation item list generated by the recommendation item list generating means and an item recommendation means for recommending to the user a predetermined number of items and an expected preference for the item. A collaborative filtering recommendation system based on is provided.

삭제delete

상기 수신된 아이템에 대한 선호도는, 다른 모든 사용자의 아이템에 대한 선호도가 저장되는 공용 사용자-아이템 선호도 데이터베이스에 저장되는 것이 바람직하다.The preferences for the received items are preferably stored in a public user-item preferences database where preferences for all other user's items are stored.

상기 성능 측정은, 사용자로부터 새로운 선호도를 수신할 때마다 오차의 평균으로 계산되는 것이 바람직하다.The performance measure is preferably calculated as the average of the errors each time a new preference is received from the user.

시스템 관리자가 특정한 임계치를 지정하여 유사 사용자의 수를 제한할 수 있는 것이 바람직하다.It is desirable for a system administrator to be able to limit the number of similar users by specifying specific thresholds.

또, 본 발명의 제2 관점에 따르면, 사용자의 유사성 혹은 아이템의 유사성을 이용하는 메모리 기반의 협업 필터링 추천 시스템에 있어서의 협업 필터링의 성능 개선을 위해 RMS를 이용하여 유사도 측정 대상이 되는 두 사용자의 유사도 테이블을 구축하는 방법으로서,In addition, according to the second aspect of the present invention, the similarity of two users to be measured similarity using the RMS to improve the performance of the collaborative filtering in the memory-based collaborative filtering recommendation system using the similarity of the user or the similarity of the items As a way to build a table,

사용자 B 변수와 사용자 T 변수를 모두 0으로 초기화하는 단계와,Initializing both user B and user T variables to 0,

상기 사용자 B를 1만큼 증가시키는 단계,Increasing the user B by 1,

상기 사용자 T의 값을 상기 사용자 B의 값과 동일하게 설정하는 단계,Setting the value of user T equal to the value of user B,

상기 사용자 T를 1만큼 증가시키는 단계,Increasing the user T by 1,

상기 사용자 B와 사용자 T 모두 평가한 아이템 선호도를 추출하는 단계,Extracting item ratings evaluated by both user B and user T;

차이확률변수 D를 계산하는 단계,Calculating the difference probability variable D,

상기 차이확률변수 D의 1/2, 2, 2/3, 3차 모멘트를 계산하는 단계,Calculating 1/2, 2, 2/3, and third moments of the difference probability variable D,

상기 계산 결과들을 사용자 유사도 테이블에 저장하는 단계,Storing the calculation results in a user similarity table;

상기 사용자 T의 값이 총 사용자 수(N)와 동일한지 여부를 판단하는 단계 및,Determining whether the value of the user T is equal to the total number of users (N), and

상기 사용자 B의 값이 총 사용자 수(N)와 동일한지 여부를 판단하는 단계를 구비하는 RMS를 이용한 사용자 유사도 테이블의 구축방법이 제공된다.A method of constructing a user similarity table using RMS is provided that includes determining whether the value of user B is equal to the total number of users (N).

상기 RMS를 이용한 사용자 유사도 테이블의 구축방법은, 사용자가 공통 평가한 아이템의 선호도를 이용하여 선호도 데이터를 가지고 있는 모든 사용자 사이의 유사도를 구하는 것이 바람직하다.In the method of constructing a user similarity table using the RMS, it is preferable to calculate the similarity between all users who have the preference data by using the preferences of items that the user commonly evaluates.

상기 사용자 B와 사용자 T 모두 평가한 아이템 선호도를 추출하는 단계에서, 사용자 B가 1이고 사용자 T가 2일 때, 추출되는 아이템 선호도는 {2, 3, 4} 및 {4, 5, 2}로 추출되는 것이 바람직하다.In the extracting the item preferences evaluated by both the user B and the user T, when the user B is 1 and the user T is 2, the extracted item preferences are {2, 3, 4} and {4, 5, 2}. It is preferable to extract.

또한, 본 발명의 제3 관점에 따르면, 사용자의 유사성 혹은 아이템의 유사성을 이용하는 메모리 기반의 협업 필터링 추천 시스템에 있어서의 협업 필터링의 성능 개선을 위해 RMS를 이용하여 유사도 측정 대상이 되는 두 아이템의 유사도 테이블을 구축하는 방법으로서,In addition, according to the third aspect of the present invention, the similarity of two items to be measured for similarity by using RMS to improve the performance of the collaborative filtering in the memory-based collaborative filtering recommendation system using the similarity of the user or the similarity of the items As a way to build a table,

아이템 B 변수와 아이템 T 변수를 모두 0으로 초기화하는 단계와,Initializing both the Item B and Item T variables to 0,

상기 아이템 B를 1만큼 증가시키는 단계,Increasing the item B by 1,

상기 아이템 T의 값을 상기 아이템 B의 값과 동일하게 설정하는 단계,Setting the value of the item T equal to the value of the item B,

상기 아이템 T를 1만큼 증가시키는 단계,Increasing the item T by 1,

상기 아이템 B와 상기 아이템 T에 모두 평가한 사용자의 아이템 선호도를 추출하는 단계,Extracting an item preference of the user who rated both the item B and the item T,

상기 계산 결과들을 아이템 유사도 테이블에 저장하는 단계,Storing the calculation results in an item similarity table;

상기 아이템 T의 값이 총 아이템 수(N)와 동일한지 여부를 판단하는 단계 및,Determining whether the value of the item T is equal to the total number N;

상기 아이템 B의 값이 총 아이템 수(N)와 동일한지 여부를 판단하는 단계를 구비하는 RMS를 이용한 아이템 유사도 테이블의 구축방법이 제공된다.A method of constructing an item similarity table using RMS is provided that includes determining whether the value of the item B is equal to the total number of items (N).

이 RMS를 이용한 아이템 유사도 테이블의 구축방법은, 하나의 사용자가 두 아이템에 모두 평가한 선호도들을 이용하여 선호도 데이터를 가지고 있는 모든 사용자 사이의 유사도를 구하는 것이 바람직하다.In the method of constructing the item similarity table using the RMS, it is preferable to calculate the similarity between all users who have the preference data by using the preferences evaluated by one user on both items.

상기 아이템 B와 상기 아이템 T에 모두 평가한 사용자의 아이템 선호도를 추출하는 단계에서, 아이템 B가 1이고 아이템 T가 2일 때, 추출되는 사용자의 아이템 선호도는 {1, 3} 및 {2, 2}로 추출되는 것이 바람직하다.In the step of extracting the item preference of the user who rated both the item B and the item T, when the item B is 1 and the item T is 2, the item preferences of the extracted user are {1, 3} and {2, 2 } Is preferably extracted.

본 발명에 의하면, 두 변수의 차이로 구성되는 다항확률변수를 구하고, 이를 이용하여 두 변수의 모멘트를 계산하여 콜드 스타트 조건에서 더욱 정확히 유사도를 측정하며, 또한 모멘트 차수(r)를 조정하여 동일한 선호도 차이를 보이는 여러 조건에서 유사도 차별화를 둘 수 있는 특징이 있고, 유사도 분포를 조절할 수 있어서 여러 가지 다른 특성을 가지는 사용자-선호도 매트릭스에 적용하기 쉽도록 하는 유연성을 제공할 수 있다.According to the present invention, a polynomial probability variable consisting of the differences between two variables is obtained, and the moments of the two variables are calculated using the same to measure the similarity more accurately in the cold start condition, and the moment order r is adjusted to adjust the same preference. There are features that allow similarity differentiation under different conditions, and the similarity distribution can be adjusted to provide flexibility for easy application to user-preference matrices with different characteristics.

이하, 첨부도면을 참조하면서 본 발명의 실시예에 대해 설명한다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

도 7은 본 발명의 설명 및 도면에서 출현하는 수식에서의 기수법을 나타내고, 도 8은 본 발명의 실시예에 따른 유사성 척도에 기반한 협업 필터링 추천 시스템을 나타낸다.FIG. 7 shows the notation in the equations appearing in the description and drawings of the invention, and FIG. 8 shows a collaborative filtering recommendation system based on a similarity measure in accordance with an embodiment of the invention.

도 8에 있어서, 선호도 데이터 수신부(1)는 사용자가 입력한 시스템 관리자가 규정한 정수형 숫자 범위 내에서 아이템에 대한 선호도를 수신한다. 수신된 아이템에 대한 선호도는 다른 모든 사용자의 아이템에 대한 선호도가 저장되는 공용 사용자-아이템 선호도 데이터베이스에 저장된다.In FIG. 8, the preference data receiver 1 receives a preference for an item within an integer number range defined by a system administrator input by a user. The preferences for the received item are stored in the public user-item preferences database where the preferences for all other user's items are stored.

성능 측정기(2)는, 선호도 데이터 수신부(1)에서 수신된 선호도에 대해, 아래 절차에 따라 과거에 예측된 선호도와 사용자로부터 수신된 아이템 선호도의 오차를 계산하여 시스템의 성능을 측정한다. 성능 측정은 사용자로부터 새로운 선호도를 수신할 때마다 오차의 평균으로 계산된다.The performance measurer 2 measures the performance of the system by calculating an error between the previously predicted preference and the item preference received from the user according to the following procedure with respect to the preference received by the preference data receiver 1. The performance measure is calculated as the average of the errors each time a new preference is received from the user.

RMS 유사도 측정부(3)는, 입력된 선호도, 과거에 현재 사용자가 입력한 선호도, 과거에 현재 사용자가 아닌 사람들이 입력한 선호도 데이터를 공용 사용자-아이템 선호도 데이터베이스에서 꺼내어 본 발명에서 제안하는 유사도 측정 방법인 RMS를 이용하여 사용자 유사도(도 14) 및 아이템 유사도(도 15)를 측정한다. 이렇게 측정된 아이템 유사도와 사용자 유사도는 개별적으로 저장된다.The RMS similarity measurement unit 3 measures the similarity proposed by the present invention by extracting the input preference, the preference input by the current user in the past, and the preference data input by people who are not the current user in the past from the public user-item preference database. RMS, which is a method, is used to measure user similarity (FIG. 14) and item similarity (FIG. 15). The measured item similarity and user similarity are stored separately.

유사 아이템 검색부(4)는, 모든 아이템쌍에 대해 서로간의 유사도를 계산하여 미리 저장해 둔 RMS 아이템 유사도 데이터베이스(5)로부터 사용자가 평가하지 않은 아이템과 유사한 아이템(선호도 예측에 사용할 아이템)을 검색한다. 시스템 관리자는 특정한 임계치를 지정하여 유사 사용자의 수를 제한할 수 있다.The similar item search unit 4 searches for similar items (items to be used for preference prediction) that are not evaluated by the user from the RMS item similarity database 5, which calculates the similarity between each item pair and stores them in advance. . System administrators can specify specific thresholds to limit the number of similar users.

유사 사용자 검색부(6)는, RMS 사용자 유사도 데이터베이스(7)로부터 추천하고자 하는 사용자와 유사한 성향을 보이는 사용자(선호도 예측에 사용할 사용자)들을 검색한다. 시스템 관리자는 특정한 임계치를 지정하여 유사 사용자의 수를 제한할 수 있다.The similar user searcher 6 searches for users (users to be used for the preference prediction) that have a similar tendency to the user to be recommended from the RMS user similarity database 7. System administrators can specify specific thresholds to limit the number of similar users.

선호도 예측부(8)는, 사용자 기반 방법 및 아이템 기반 방법 중 한 가지를 선택하여 사용자가 아직 평가하지 않은 아이템에 대한 선호도를 예측한다.The preference prediction unit 8 selects one of a user-based method and an item-based method to predict a preference for an item that the user has not yet evaluated.

추천 아이템 목록 생성부(9)는, 사용자가 아직 평가하지 않은 아이템에 대해 예측된 선호도가 높은 순서대로 아이템을 내림차순으로 정렬하여 추천 목록을 생성한다.The recommendation item list generation unit 9 generates a recommendation list by sorting the items in descending order in the order of high predicted preference for items that the user has not yet evaluated.

아이템 추천부(10)는, 추천 아이템 목록에 의거하여 상위 n개의 아이템과 그 아이템에 대한 예상 선호도를 사용자에게 추천한다.The item recommendation unit 10 recommends the top n items and the expected preferences for the items to the user based on the recommended item list.

본 발명은, 콜드 스타트 조건에서 선호도 예측 성능이 저하되는 문제를 개선하기 위한 새로운 유사성 척도이다. 앞에서 언급한 바와 같이, 도 3, 도 4, 도 5, 도 6과 같은 선호도 그래프는 공통 선호도가 비교적 많을 때는 물론이고, 특히 공통 선호도가 적은 경우에도 더욱 빈번하게 발생한다. 즉, 콜드 스타트 조건에서 아주 빈번하게 발생하여 협업 필터링 추천 시스템의 예측 성능을 저하시키는 주 요인이 된다. 이에 본 발명에서는 확률 변수의 모멘트를 이용한 유사성 척도(Raw Moment-based Similarity: RMS)를 제안한다.The present invention is a new measure of similarity to ameliorate the problem of degraded preference prediction performance in cold start conditions. As mentioned above, the preference graphs as shown in FIGS. 3, 4, 5, and 6 occur more frequently as well as when the common preference is relatively high, especially when the common preference is small. In other words, it occurs frequently in cold start conditions, which is a major factor in degrading the predictive performance of the collaborative filtering recommendation system. Accordingly, the present invention proposes a Raw Moment-based Similarity (RMS) using moments of random variables.

도 9는 두 사용자 혹은 두 아이템의 공통 선호도의 차이를 가지고 생성할 수 있는 다항확률변수(D)를 생성하는 방법을 설명하는 도면이다. D는 본 발명에서 제안하는 RMS 유사도를 계산하기 위해 사용된다. 다항확률변수(D)는 통계학에서 다음과 같은 다항확률법칙(Multinomial Random Law)에 따른다. 본 발명에서 제안하는 방법에서 다항확률변수(D)에서 나타날 수 있는 사건의 개수는 R+1개이다.FIG. 9 is a diagram illustrating a method of generating a polynomial probability variable D that can be generated with a difference in common preference between two users or two items. D is used to calculate the RMS similarity proposed in the present invention. The polynomial probability variable (D) is based on the following multinomial random law in statistics. In the method proposed in the present invention, the number of events that can appear in the polynomial probability variable (D) is R + 1.

(6)

1부터 5까지의 정수형 선호도를 부여할 수 있는 협업 필터링 추천 시스템에서 X = {5, 3, 2}이고 Y = {3, 2, 4}일 때, 도 9에 나타낸 식에 따라서 다항확률변수(D)는 다음과 같이 될 것이다.In a collaborative filtering recommendation system that can give integer preferences from 1 to 5, when X = {5, 3, 2} and Y = {3, 2, 4}, the polynomial probability variable ( D) will be as follows.

D = {2, 1, 2}D = {2, 1, 2}

도 10은 본 발명에서 제안하는 RMS 유사도를 계산하는 식을 나타낸다. 도 9에 나타낸 식에 의해 생성된 다항확률변수(D)의 r차 모멘트를 구하여 선호도 데이터베이스에 입력될 수 있는 최대 선호도와 최소 선호도 차이의 r승으로 나누면, 그 결과값의 범위는 0으로부터 1로 정규화된다. 1은 변수 X와 Y가 완전히 다름을 뜻하고, 0은 완전히 동일함을 뜻한다.10 shows an equation for calculating the RMS similarity proposed in the present invention. The rth moment of the polynomial probability variable (D) generated by the equation shown in FIG. 9 is obtained and divided by the r power of the difference between the maximum and minimum preferences that can be input to the preference database, and the result ranges from 0 to 1 Normalized. 1 means that the variables X and Y are completely different, and 0 means they are exactly the same.

선호도 예측 시에 가중치로 사용할 수 있으려면, 유사할수록 0으로부터 1로 정규화된 값이 1에 수렴해야 한다. 그러므로, 결과값의 역을 취한다. 1로부터 5까지의 정수형 선호도를 부여할 수 있는 협업 필터링 추천 시스템(R은 5 - 1 = 4)에서 X = {5, 3, 2}이고 Y = {3, 2, 4}일 때, 도 8에 의해 다항확률변수(D)를 구하고, 이를 도 10에 나타낸 식에 적용하여 1차 모멘트(r=1)를 이용한 RMS(X, Y)는 다 음 식 (7)과 같이 구할 수 있다.In order to be able to use it as a weight in predicting preferences, the more similar, the normalized value from 0 to 1 should converge to 1. Therefore, the inverse of the result is taken. FIG. 8 when X = {5, 3, 2} and Y = {3, 2, 4} in a collaborative filtering recommendation system capable of giving integer preferences from 1 to 5 (R is 5-1 = 4) By applying the polynomial probability variable (D) and applying it to the equation shown in Figure 10, RMS (X, Y) using the first moment (r = 1) can be obtained as in the following equation (7).

(7)

또한, 2차 모멘트(r=2)를 이용한 RMS(X, Y)는 다음의 식 (8)과 같이 구할 수 있다.In addition, RMS (X, Y) using the secondary moment (r = 2) can be calculated | required as following Formula (8).

(8)

또한, 제곱근 모멘트를 이용한 RMS(X, Y)는 다음 식 (9)와 같이 구할 수 있다.In addition, RMS (X, Y) using the square root moment can be obtained as shown in the following equation (9).

(9)

본 발명에서 제안하는 방법은 도 3, 도 4, 도 5, 도 6과 같은 상황에서 비롯되는 기존 유사성 척도들의 문제점을 보완하여, 콜드 스타트 조건에서 더욱 직관적이고 명확하게 유사함의 정도를 판별할 수 있다. 그 근거는, 도 10에 나타낸 바와 같다. 도 11은 모멘트 차수(r)을 di의 제곱근(1/2제곱), 제곱, 제곱근의 세제곱(2/3제곱), 제곱으로 설정했을 때에 유사도가 어떻게 나타나는지를 나타내고 있 다.The method proposed in the present invention compensates for the problems of the existing similarity measures resulting from the situation as shown in FIGS. 3, 4, 5, and 6, and can determine the degree of similarity more intuitively and clearly in the cold start condition. . The reason is as shown in FIG. FIG. 11 shows how the similarity is expressed when the moment order r is set to the square root (1/2 square) of di, the square, the cube root of the square root (2/3 square), and the square.

도 11에서는 본 발명에서 제안하는 유사성 척도가 기존의 유사성 척도와 차별화되는 두 가지 주요 특징을 발견할 수 있다. 첫 번째 특징은, 도 11의 (a)에 나타낸 표의 차이합이 16인 세 가지 경우의 유사도 변화에서 확인된다. 차이의 합계가 모두 16으로 동일하지만, 모멘트 차수(r)의 변경에 의해서 어떻게 차별화를 줄 수 있는지를 나타내고 있다. 선호도 차이가 큰 것의 빈도가 높고 낮음에 의해서 도 11의 (b)에 나타낸 그래프의 사각형으로 표기한 부분처럼 변화한다.In FIG. 11, two main features of the similarity measure proposed by the present invention are distinguished from the existing similarity measure. The first feature is found in the similarity change in three cases where the sum of differences in the table shown in FIG. Although the sum of the differences is equal to all 16, it shows how the differentiation can be achieved by changing the moment order r. Due to the high and low frequency of the large preference difference, it changes as indicated by the square of the graph shown in FIG.

두 번째 특징은, 모멘트 차수 r을 낮게 설정할수록 유사도 분포는 낮은 쪽에 형성되고, 높게 설정할수록 높은 쪽에 형성된다는 것이다. 본 발명에서 제안하는 알고리즘의 이러한 유연성은, 각자 다른 특성(선호도의 최소값과 최대값, 사용자의 수, 아이템의 수, 선호도의 데이터의 수 등)을 가지는 사용자-선호도 매트릭스에서 높은 성능을 이끌어 내기 위한 수단이 될 수 있다.The second feature is that the lower the moment order r is set, the similarity distribution is formed at the lower side, and the higher is set at the higher side. This flexibility of the algorithm proposed in the present invention is intended to derive high performance in the user-preference matrix, which has different characteristics (minimum and maximum values of preference, number of users, number of items, number of data of preference, etc.). It can be a means.

도 12와 도 13은 협업 필터링에서 선호도 예측 성능을 평가하기 위한 측정 지표인 MAE(Mean Absolute Error)를 측정한 결과를 나타낸다. 비교 대상 알고리즘은 COS(코사인 유사도), PCC(피어슨 곱적률 상관계수), SRCC(스피어만 순위 상관계수)이며, 데이터 세트는 기존의 많은 연구들에서 사용된 바 있는 MovieLens의 데이터 세트를 이용하였다. 기존의 수많은 연구논문들에서 사용하는 방법과 동일하게, [사용자-아이템-선호도]의 포맷으로 구성된 총 100,000개의 선호도 데이터에서 임의로 30,000개를 샘플링하여 테스트 데이터(실험을 위한 입력 데이터)로 사용하였고, 나머지 70,000개의 데이터를 트레이닝 데이터(유사도 테이블 구성을 위한 데이 터)로 사용하여 실험하였다.12 and 13 illustrate measurement results of Mean Absolute Error (MAE), which is a measure for evaluating preference prediction performance in collaborative filtering. The algorithms to be compared are COS (Cosine Similarity), PCC (Pearson Product Ratio Correlation), SRCC (Spearman Rank Correlation), and the data set uses MovieLens's data set which has been used in many previous studies. In the same way as used in many research papers, 30,000 samples were randomly sampled from a total of 100,000 preference data in [user-item-preference] format and used as test data (input data for experiments). The remaining 70,000 data were used as training data (data for constructing similarity tables).

도 12를 참조하면, 도 3, 도 4, 도 5, 도 6의 경우가 빈번하게 나타나는 콜드 스타트 조건은 물론, 콜드 스타트 조건이 아닌 경우가 혼합되어 있는데도 불구하고, 본 발명에서 제안하는 유사성 척도가 기존의 방법들보다 두드러지게 향상된 성능을 보임을 확인할 수 있다. 이는, 콜드 스타트 조건은 물론이고, 많은 공통 선호도를 갖는 두 변수의 유사도 계산에서도 기존의 유사성 척도보다 뒤떨어지지 않음을 보여준다.Referring to FIG. 12, the similarity measures proposed in the present invention are mixed even though the cold start condition, which is frequently shown in FIGS. 3, 4, 5, and 6, is not a cold start condition. It can be seen that the performance is significantly improved over the existing methods. This shows that not only cold start conditions, but also similarity calculations of two variables having many common preferences lag behind existing similarity measures.

도 13은 유사도 계산 시에 강제적으로 공통 선호도 개수를 제한하여 실험한 결과이다. 이는, 도 12에서 가장 좋은 성능을 보였던 근접 이웃 개수를 선택하고, 콜드 스타트 조건을 인위적으로 만들어 실험한 결과로서, 본 발명에서 제안하는 RMS가 기존의 유사성 척도보다 콜드 스타트 문제에 얼마나 강한지 직관적으로 알 수 있는 결과이다. 도 13에서 나타낸 바와 같이, 본 발명에서 제안하는 방법은 기존의 방법보다 콜드 스타트 문제에 강함을 알 수 있다.FIG. 13 shows the results of experiments by forcibly limiting the number of common preferences when calculating similarity. This is a result of selecting the nearest neighbor number that showed the best performance in FIG. 12 and artificially creating a cold start condition, and intuitively indicating how the RMS proposed in the present invention is stronger in the cold start problem than the existing similarity measure. The result is obvious. As shown in FIG. 13, it can be seen that the method proposed by the present invention is stronger in the cold start problem than the conventional method.

도 14는, 메모리 기반 협업 필터링의 사용자 기반 접근법에 있어서, 본 발명에서 제안하는 RMS를 이용한 사용자 유사도 테이블(데이터베이스)의 구축 순서도를 나타낸다. 이 도면에서, B_USER와 T_USER는 유사도 측정 대상이 되는 두 사용자를 나타내며, N은 최대 사용자 개수이다. 선호도 데이터를 가지고 있는 모든 사용자 사이의 유사도를 구하는 방법을 나타내고 있다. 구축한 사용자 유사도 테이블(데이터베이스)의 구조의 예는, 아래의 표 4와 같다.FIG. 14 shows a construction flowchart of a user similarity table (database) using RMS proposed in the present invention in a user-based approach of memory-based collaborative filtering. In this figure, B_USER and T_USER represent two users to be measured for similarity, and N is the maximum number of users. It shows how to calculate the similarity among all users who have preference data. An example of the structure of the constructed user similarity table (database) is shown in Table 4 below.

표 4Table 4

ITEM 1ITEM 1 ITEM 2ITEM 2 ITEM 3ITEM 3 ITEM 4ITEM 4 USER 1USER 1 1One 22 33 44 USER 2USER 2 44 55 22 USER 3USER 3 1One 22 33 USER 4USER 4 33 22 33 USER 5USER 5 55 44 22

도 14를 참조하면, 본 발명에서 제안하는 RMS를 이용한 사용자 유사도 테이블(데이터베이스)의 구축 순서는 다음과 같다.Referring to Figure 14, the construction procedure of the user similarity table (database) using the RMS proposed in the present invention is as follows.

먼저, 단계 S1에서는 B_USER 변수와 T_USER 변수를 모두 0으로 초기화하고, 단계 S2에 있어서는 B_USER을 1만큼 증가시킨다.First, in step S1, both the B_USER variable and the T_USER variable are initialized to 0, and in step S2, the B_USER is increased by one.

이어서, 단계 S3에서는 T_USER의 값을 B_USER의 값과 동일하게 설정하고, 단계 S4에 있어서는 T_USER을 1만큼 증가시킨다.Next, in step S3, the value of T_USER is set equal to the value of B_USER, and in step S4, T_USER is increased by one.

이어서, 단계 S5에서는 B_USER와 T_USER 모두 평가한 아이템 선호도를 추출한다. 상기의 표 4에 있어서, B_USER가 1이고 T_USER가 2일 때, 추출되는 아이템 선호도는 {2, 3, 4} 및 {4, 5, 2}가 된다.Subsequently, in step S5, the item preference evaluated by both B_USER and T_USER is extracted. In Table 4, when B_USER is 1 and T_USER is 2, the extracted item preferences are {2, 3, 4} and {4, 5, 2}.

다음으로, 단계 S6에 있어서 차이확률변수 D를 계산하고, 단계 S7에서는 D의 1/2, 2, 2/3, 3차 모멘트를 계산한다.Next, the difference probability variable D is calculated in step S6, and 1/2, 2, 2/3 and cubic moments of D are calculated in step S7.

다음으로, 단계 S8에서는, 계산 결과들을 다음과 같은 형태로 사용자 유사도 테이블(DB)(20)에 저장한다. 예컨대,Next, in step S8, the calculation results are stored in the user similarity table (DB) 20 in the following form. for example,

{B_USER T_USER RMS(1/2) RMS(2) RMS(2/3) RMS(3)} {B_USER T_USER RMS (1/2) RMS (2) RMS (2/3) RMS (3)}

ex) 1 2 결과값 결과값 결과값 결과값ex) 1 2 Result value Result value Result value

다음으로, 단계 S9에 있어서는, T_USER의 값이 총 사용자 수(N)와 동일하면 다음 단계로 진행하고, 그렇지 않다면 단계 S4로 리턴한다.Next, in step S9, if the value of T_USER is equal to the total number of users N, the flow advances to the next step, and otherwise returns to step S4.

다음으로, 단계 S10에 있어서는, B_USER의 값이 총 사용자 수(N)와 동일하면 다음 단계로 진행하고, 그렇지 않다면 단계 S3으로 리턴한다.Next, in step S10, if the value of B_USER is equal to the total number of users N, the flow advances to the next step, otherwise, the flow returns to step S3.

이와 같이 해서, 본 발명에서 제안하는 RMS를 이용한 사용자 유사도 테이블(데이터베이스)을 구축할 수 있다.In this way, a user similarity table (database) using RMS proposed in the present invention can be constructed.

도 15는 본 발명에서 제안하는 RMS를 이용한 아이템 유사도 테이블(데이터베이스)의 구축 순서를 나타낸다. 본 도면에서, B_ITEM과 T_ITEM은 유사도 측정 대상이 되는 두 아이템을 나타내며, N은 최대 아이템 개수이다.15 shows a construction procedure of an item similarity table (database) using RMS proposed in the present invention. In this figure, B_ITEM and T_ITEM indicate two items to be measured for similarity, and N is the maximum number of items.

RMS 계산을 위해서, 사용자 기반 접근법에서는 두 사용자가 공통 평가한 아이템의 선호도를 이용하지만, 아이템 기반 접근법에서는 하나의 사용자가 두 아이템에 모두 평가한 선호도들을 이용하는 점에서는 차이가 있다.For the RMS calculation, the user-based approach uses the preferences of items that two users have rated in common, while the item-based approach differs in using the preferences that one user has rated on both items.

구축한 아이템 유사도 테이블(데이터베이스스)의 구조의 예는, 아래의 표 5와 같다.An example of the structure of the constructed item similarity table (databases) is shown in Table 5 below.

표 5Table 5

도 15를 참조하면, 본 발명에서 제안하는 RMS를 이용한 아이템 유사도 테이 블(데이터베이스)의 구축 순서는 다음과 같다.Referring to Figure 15, the construction order of the item similarity table (database) using the RMS proposed in the present invention is as follows.

먼저, 단계 S11에서는 B_ITEM 변수와 T_ITEM 변수를 모두 0으로 초기화하고, 단계 S12에 있어서는 B_ITEM을 1만큼 증가시킨다.First, in step S11, both the B_ITEM variable and the T_ITEM variable are initialized to 0, and in step S12, the B_ITEM is increased by one.

이어서, 단계 S13에서는 T_ITEM의 값을 B_ITEM의 값과 동일하게 설정하고, 단계 S14에 있어서는 T_ITEM을 1만큼 증가시킨다.Next, in step S13, the value of T_ITEM is set equal to the value of B_ITEM, and in step S14, T_ITEM is increased by one.

이어서, 단계 S15에서는 B_ITEM와 T_ITEM에 모두 평가한 사용자의 아이템 선호도를 추출한다. 상기의 표 5에 있어서, B_ITEM가 1이고 T_ITEM가 2일 때, 추출되는 사용자의 아이템 선호도는 {1, 3} 및 {2, 2}가 된다.Subsequently, in step S15, the item preferences of the user who evaluated both B_ITEM and T_ITEM are extracted. In Table 5, when B_ITEM is 1 and T_ITEM is 2, the item preferences of the extracted users are {1, 3} and {2, 2}.

다음으로, 단계 S16에 있어서 차이확률변수 D를 계산하고, 단계 S17에서는 D의 1/2, 2, 2/3, 3차 모멘트를 계산한다.Next, the difference probability variable D is calculated in step S16, and 1/2, 2, 2/3, and third moments of D are calculated in step S17.

다음으로, 단계 S18에서는, 계산 결과들을 다음과 같은 형태로 아이템 유사도 테이블(DB)(30)에 저장한다. 예컨대,Next, in step S18, the calculation results are stored in the item similarity table (DB) 30 in the following form. for example,

{B_ITEM T_ITEM RMS(1/2) RMS(2) RMS(2/3) RMS(3)} {B_ITEM T_ITEM RMS (1/2) RMS (2) RMS (2/3) RMS (3)}

ex) 1 2 0.5 0.6 0.7 0.8ex) 1 2 0.5 0.6 0.7 0.8

다음으로, 단계 S19에 있어서는, T_ITEM의 값이 총 아이템 수(N)와 동일하면 다음 단계로 진행하고, 그렇지 않다면 단계 S14로 리턴한다.Next, in step S19, if the value of T_ITEM is equal to the total number of items N, the flow advances to the next step; otherwise, the flow returns to step S14.

다음으로, 단계 S20에 있어서는, B_ITEM의 값이 총 아이템 수(N)와 동일하면 다음 단계로 진행하고, 그렇지 않다면 단계 S13으로 리턴한다.Next, in step S20, if the value of B_ITEM is equal to the total number of items N, the process advances to the next step, otherwise, the process returns to step S13.

이와 같이 해서, 본 발명에서 제안하는 RMS를 이용한 아이템 유사도 테이블(데이터베이스)을 구축할 수 있다.In this way, an item similarity table (database) using RMS proposed in the present invention can be constructed.

본 발명에서 제안하는 유사성 척도 RMS는, 협업 필터링 추천 시스템은 물론이고, 다른 다양한 분야에 응용될 수 있다. 기존의 코사인 유사도 및 상관계수 기반 유사성 척도가 사용되는 정보검색, 통계 분석 등은 물론이고, 동일한 크기를 갖는 정지영상의 유사성 비교, 동일한 시간에 수신되는 두 신호의 유사성 비교 등 선형적인 특성을 가지는 두 변수의 유사성을 판단하기 위한 처리가 요구되는 분야에는 어디에든지 응용 가능하다.Similarity measure RMS proposed in the present invention can be applied to various other fields as well as a collaborative filtering recommendation system. In addition to information retrieval using statistical cosine similarity and correlation coefficient-based similarity measures, statistical analysis, etc. Applicable anywhere in the field where processing for determining similarity of variables is required.

도 1은 사용자 기반 협업 필터링의 수행 시간 감축을 위한 유사도 테이블을 나타낸 도면이다.1 illustrates a similarity table for reducing execution time of user-based collaborative filtering.

도 2는 아이템 기반 협업 필터링의 수행 시간 감축을 위한 유사도 테이블을 나타낸 도면이다.2 is a diagram illustrating a similarity table for reducing execution time of item-based collaborative filtering.

도 3은 공통 선호도가 평행인 경우를 나타낸 도면이다.3 is a diagram illustrating a case in which common preferences are parallel.

도 4는 공통 선호도가 지그재그 형태인 경우 (1)을 나타낸 도면이다.4 is a diagram illustrating a case (1) when the common preference is in a zigzag form.

도 5는 공통 선호도가 지그재그 형태인 경우 (2)를 나타낸 도면이다.5 is a diagram illustrating a case (2) when the common preference is in a zigzag form.

도 6은 공통 선호도가 1개인 경우를 나타낸 도면이다.6 illustrates a case where one common preference is one.

도 7은 본 발명에서의 기수법을 설명하는 도면이다.It is a figure explaining the radix method in this invention.

도 8은 본 발명의 실시예에 따른 유사성 척도에 기반한 협업 필터링 추천 시스템을 나타낸 도면이다.8 is a diagram illustrating a collaborative filtering recommendation system based on a similarity measure according to an embodiment of the present invention.

도 9는 선형적인 관계를 갖는 두 변수의 차이의 절대값으로 생성되는 다항확률변수(D)를 구하는 방법을 설명하는 도면이다.FIG. 9 is a diagram for explaining a method of obtaining a polynomial probability variable D generated as an absolute value of a difference between two variables having a linear relationship.

도 10은 본 발명에서 제안하는 확률 변수의 모멘트를 이용한 유사성 척도를 설명하는 도면이다.10 is a view for explaining the similarity measure using the moment of the random variable proposed in the present invention.

도 11은 본 발명에서 제안하는 RMS의 모멘트 차수 조정에 따른 유사도 분포 변화의 예를 나타낸 도면이다.11 is a view showing an example of the similarity distribution change according to the moment order adjustment of the RMS proposed in the present invention.

도 12는 근접 이웃 개수의 변화에 대한 절대평균오차 측정 실험결과(full-rating experiment)를 나타낸 도면이다.FIG. 12 is a diagram illustrating a full-rating experiment of absolute mean error for a change in the number of neighboring neighbors.

도 13은 공통 선호도 개수 변화에 대한 절대평균오차 측정 실험결과(cold-start experiment)를 나타낸 도면이다.FIG. 13 is a view showing a cold-start experiment for measuring an average number of common preference changes. FIG.

도 14는 RMS를 이용한 사용자 유사도 데이터베이스 구축 순서도이다.14 is a flowchart of building a user similarity database using RMS.

도 15는 RMS를 이용한 아이템 유사도 데이터베이스 구축 순서도이다.15 is a flowchart of building an item similarity database using RMS.

Claims

A polynomial probability variable consisting of the difference of two variables based on the difference of common preferences of two users or two items is calculated. Based collaborative filtering recommendation system,

A preference data receiving means for receiving a preference for an item within an integer number range defined by a system administrator input by a user;

Performance measurement means for measuring the performance of the system with respect to the preference received by the preference data receiving means, by calculating an error between the predicted preference and the item preference received from the user according to a predetermined procedure, according to a predetermined procedure;

A similarity measuring means for measuring user similarity or item similarity using a similarity measuring method based on input preferences, preferences input by a current user in the past, and preference data input by non-current users in the past,

Similar item search means for searching similar items with items that the user has not yet evaluated from the previously stored item similarity database by calculating similarity between each other for all pairs of items,

A similarity user search means for searching for users who have the same propensity as a user to recommend from a user similarity database;

The following equations (4) and (5) are selected by selecting one of a user based method based on a user searched by the user searching means and an item based method based on an item searched by the item searching means. Preference predictive means for predicting preference for items that the user has not yet rated,

(4)

(5)

Where u is the recommended user, i is the recommended item, r _u is the preference of the items given by the recommended user u, w _{u, v} is the similarity between user u and _v, and r _{v, i} is the item of user v the preference for i, r _v is the preference of user v, r _i is the preference of item i given by the user, w _{i, j} is the similarity of items i and j, r _{u, j} is the preference of user u's item j, r _j is the preference of item j given by the user.)

A recommendation item list generating means for generating a recommendation list by sorting the items in order from the highest preference by investigating the predicted preferences for the items which the user has not yet evaluated;

The origin moment of the difference probability variable, characterized in that it comprises an item recommending means for recommending to the user a predetermined number of items and an expected preference for the item based on the recommendation item list generated by the recommendation item list generating means. Collaborative filtering recommendation system based on similarity measure.

delete

The method of claim 1, wherein the preference for the received item is stored in a public user-item preference database in which preferences of all other users are stored in the similarity measure using the origin moment of the difference probability variable. Based collaborative filtering recommendation system.

The collaborative filtering recommendation system of claim 1, wherein the performance measure is calculated as an average of errors each time a new preference is received from a user.

The collaborative filtering recommendation system of claim 1, wherein the system administrator may limit the number of similar users by designating a specific threshold value.

delete