KR101450453B1

KR101450453B1 - Method and apparatus for recommending contents

Info

Publication number: KR101450453B1
Application number: KR1020130052928A
Authority: KR
Inventors: 심규석; 김영훈; 박윤재
Original assignee: 서울대학교산학협력단
Priority date: 2013-05-10
Filing date: 2013-05-10
Publication date: 2014-10-13

Abstract

The present invention relates to a method and an apparatus for recommending document content suitable for each user in a digital content providing system with a recommendation function. According to an embodiment, the method for recommending document content and the apparatus for executing the same are provided, wherein the method for recommending the document content comprises the steps of: inputting information about a user and a document as input data; and generating a preference database based on the input data. The input data includes a set of documents, a set of users, a set of documents created by a user, a set of documents recommended by a user, and a set of words that appear in the whole document.

Description

TECHNICAL FIELD The present invention relates to a content recommendation method,

본 발명은 컨텐츠 추천 방법 및 장치에 관한 것으로, 보다 상세하게는, 추천기능이 있는 디지털 컨텐츠 제공 시스템에서 각 사용자에게 알맞은 문서 컨텐츠를 추천하기 위한 방법 및 장치에 관한 것이다. The present invention relates to a content recommendation method and apparatus, and more particularly, to a method and apparatus for recommending document contents suitable for each user in a digital content providing system having a recommendation function.

디지털 컨텐츠 시장은 인터넷 발달과 함께 가장 크게 성장한 시장으로서 현재 소셜 네트워크의 성장과 함께 더욱 시장의 확장 속도가 빨라지고 있다. 대표적인 디지털 컨텐츠로는 뉴스나 영화, 논문과 같은 것을 예로 들 수 있다.The digital content market is the largest market that has grown with the development of the Internet. With the growth of social networks, the market is expanding more rapidly. Typical digital contents include news, movies, and theses.

최근들어 인터넷 포털 업체를 비롯하여 디지털 컨텐츠를 제공하는 다양한 형태의 컨텐츠 제공 업체들은 각 사용자들의 선호도에 맞는 맞춤형 컨텐츠를 추천하고 제공하는 서비스를 개시하고 있다. 이러한 컨텐츠 추천을 위한 방법 중 하나는 확률적 모델을 이용한 방법이다. In recent years, various types of content providers providing digital contents including Internet portal companies have started offering services for recommending and providing customized contents according to the preferences of each user. One of the methods for recommending this content is a method using a stochastic model.

확률적 모델에 따르면 사용자가 컨텐츠를 작성하거나 혹은 컨텐츠에 대한 요약이나 설명글을 작성하는 과정을 모델링한다. 이 때 사용자가 작성하는 컨텐츠는 텍스트 기반의 컨텐츠로 한정하며 이하에서 이를 "문서"라고 칭하도록 한다. 문서에 대한 확률적 모델링은 예를 들어 토마스 호프만(Thomas Hofmann)의 논문 "Probabilistic Latent Indexing" (SIGIR 1999)에서 제시한 PLSI 모델링이 공지되어 있다. PLSI 모델은 통계상의 잠재 계층(statistical latent class) 모델에 기초해서 자동으로 문서를 분류하는 방식이다. 이 기법은 정보 검색과 필터링, 자연 언어 처리, 문서를 이용한 기계학습(machine learning) 등 다양한 분야에서 사용된다. According to the probabilistic model, the user models the process of creating content or creating a summary or description of the content. At this time, the content created by the user is limited to the text-based content and is hereinafter referred to as "document ". Probabilistic modeling of documents is known, for example, in PLIFSI modeling presented in Thomas Hofmann's paper "Probabilistic Latent Indexing" (SIGIR 1999). The PLSI model is a method of automatically classifying documents based on a statistical latent class model. This technique is used in various fields such as information retrieval and filtering, natural language processing, and machine learning using documents.

그러나 이러한 PLSI 모델을 문서 컨텐츠에 대한 생성과정에 응용하는 경우 사용자에 의한 컨텐츠의 생성 과정만 모델링 할 수 있을 뿐 사용자가 타인이 작성한 컨텐츠에 투표하는 과정은 모델링할 수 없다. 따라서 사용자가 컨텐츠를 전혀 작성한 이력이 없다면 단지 타인의 컨텐츠에 추천만 한 경우 이 사용자의 선호도를 분석하여 다른 컨텐츠를 추천할 수 있는 방법이 없다.However, when applying this PLSI model to the process of generating document contents, only the process of generating the content by the user can be modeled, and the process of voting the content created by the user can not be modeled. Therefore, if the user does not have the history of creating the content at all, if there is only recommendation to the contents of the other person, there is no way to analyze the user's preference and recommend other contents.

그러므로 컨텐츠의 생성 과정뿐만 아니라 사용자가 타인의 컨텐츠에 투표하는 과정까지 모델링에서 고려하여 사용자의 컨텐츠에 대한 선호도를 계산하고 적절한 컨텐츠를 추천하는 방법이 요구되고 있다. Therefore, there is a need for a method of calculating the preference of the user's content and recommending the appropriate content considering the modeling process from the process of creating the content to the process of the user voting the content of the other user.

본 발명의 일 실시예에 따르면, 사용자가 컨텐츠를 생성하는 과정뿐만 아니라 타인의 컨텐츠에 투표하는 과정까지 모델링에서 고려하여 사용자의 컨텐츠에 대한 선호도를 계산하고 컨텐츠를 추천하는 방법 및 장치를 제공할 수 있다.According to an embodiment of the present invention, it is possible to provide a method and an apparatus for calculating a preference for a user's content and considering the content in consideration of modeling up to a process of generating content, as well as a process of voting for content of another user have.

또한 본 발명의 일 실시예에 따르면, 사용자가 직접 컨텐츠를 작성한 적이 없더라도 타인의 컨텐츠에 대해 투표했던 이력을 통해서 사용자의 선호도를 계산하고 컨텐츠를 추천하는 방법 및 장치를 제공할 수 있다.Also, according to an embodiment of the present invention, it is possible to provide a method and apparatus for calculating a user's preference and recommending content through a history of voting for the content of another person even though the user has never created the content directly.

본 발명의 일 실시예에 따르면, 문서 컨텐츠 추천 방법에 있어서, 문서와 사용자에 관한 정보를 입력데이터로서 입력하는 단계; 및 상기 입력데이터에 기초하여 선호도 데이터베이스(DB)를 생성하는 단계;를 포함하고, 상기 입력데이터는 문서들의 집합, 사용자들의 집합, 사용자가 작성한 문서들의 집합, 사용자가 추천한 문서들의 집합, 및 전체 문서에 나타나는 단어들의 집합을 포함하는 것을 특징으로 하는 문서 컨텐츠 추천 방법을 제공할 수 있다. According to an embodiment of the present invention, there is provided a document content recommendation method comprising: inputting information about a document and a user as input data; And generating a preference database (DB) based on the input data, wherein the input data includes a set of documents, a set of users, a set of documents created by the user, a set of documents recommended by the user, And a set of words appearing in the document.

또한 본 발명의 일 실시예에 따르면, 문서 컨텐츠 추천 방법에 있어서, 상기 각 선호도의 사후 확률분포에 기초하여, 복수개의 후보 문서의 각각에 대한 선호도 점수를 산출하는 단계; 및 각 후보 문서에 대해 산출된 상기 선호도 점수에 기초하여 사용자에게 문서를 추천하는 단계;를 더 포함하는 것을 특징으로 하는 문서 컨텐츠 추천 방법을 제공할 수 있다. According to an embodiment of the present invention, there is provided a document content recommendation method comprising: calculating a preference score for each of a plurality of candidate documents based on a posterior probability distribution of each preference; And recommending a document to the user based on the preference score calculated for each candidate document.

또한 본 발명의 일 실시예에 따르면 상기 문서 컨텐츠 추천 방법을 컴퓨터에서 실행시키기 위한 프로그램이 기록된 컴퓨터로 읽을 수 있는 기록매체를 제공할 수 있다. According to another aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for causing a computer to perform the document content recommendation method.

본 발명의 일 실시예에 따르면 사용자가 컨텐츠를 생성하는 과정뿐만 아니라 타인의 컨텐츠에 투표하는 과정까지 모델링에서 고려하여 사용자의 컨텐츠에 대한 선호도를 계산하고 컨텐츠를 추천할 수 있는 이점이 있다. According to the embodiment of the present invention, there is an advantage that the user can calculate the preference of the user's content and recommend the content by taking into account modeling from the process of creating content to voting on the content of another user.

또한 본 발명의 일 실시예에 따르면, 사용자가 직접 컨텐츠를 작성한 적이 없더라도 타인의 컨텐츠에 대해 투표했던 이력을 통해서 사용자의 선호도를 계산하고 컨텐츠를 추천할 수 있는 이점이 있다. Also, according to an embodiment of the present invention, there is an advantage that a user can calculate a user's preference and recommend contents through hysteresis in which the user has not directly written content but has already voted on the content of another person.

또한 본 발명의 일 실시예에 따르면, 새로 작성된 컨텐츠에 대해서도 컨텐츠의 내용과 컨텐츠에 투표한 사용자들이 선호도를 이용해 다른 사용자들에게 추천해 줄 수 있는 이점이 있다.Also, according to an embodiment of the present invention, there is an advantage that users who vote on contents and contents of newly created contents can recommend to other users by using their preferences.

도1은 본 발명의 일 실시예에 따라 사용자에게 컨텐츠를 추천하는 방법을 간단히 도식화한 흐름도,
도2는 학습단계(S10)에서의 학습 알고리즘의 예시적 방법을 나타내는 흐름도,
도3a는 본 발명을 설명하기 위한 예시적인 문서의 주제 선호도 확률분포,
도3b는 본 발명을 설명하기 위한 예시적인 사용자의 주제 선호도 확률분포,
도3c는 본 발명을 설명하기 위한 예시적인 주제에 대해 단어가 선택될 확률분포,
도4는 추천단계(S20)에서의 추천 알고리즘의 예시적 방법을 나타내는 흐름도,
도5는 문서와 사용자의 유형에 따른 추천타입을 나타내는 도표, 그리고,
도6은 일 실시예에 따라 컨텐츠 추천을 위한 일 예시적인 네트워크 구성을 나타내는 블록도이다.FIG. 1 is a flowchart illustrating a method of recommending content to a user according to an exemplary embodiment of the present invention.
2 is a flowchart showing an exemplary method of the learning algorithm in the learning step S10,
FIG. 3A illustrates exemplary subject preference probability distributions for illustrative purposes of the present invention,
FIG. 3B illustrates exemplary user preference probability distributions for illustrating the present invention,
FIG. 3C illustrates a probability distribution for a word to be selected for an exemplary subject to illustrate the present invention,
4 is a flow chart illustrating an exemplary method of recommendation algorithm in recommendation step S20,
5 is a diagram showing a document and a recommendation type according to the type of user,
6 is a block diagram illustrating an exemplary network configuration for content recommendation in accordance with one embodiment.

이상의 본 발명의 목적들, 다른 목적들, 특징들 및 이점들은 첨부된 도면과 관련된 이하의 바람직한 실시예들을 통해서 쉽게 이해될 것이다. 그러나 본 발명은 여기서 설명되는 실시예들에 한정되지 않고 다른 형태로 구체화될 수도 있다. 오히려, 여기서 소개되는 실시예들은 개시된 내용이 철저하고 완전해질 수 있도록 그리고 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 제공되는 것이다.BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features, and advantages of the present invention will become more readily apparent from the following description of preferred embodiments with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein but may be embodied in other forms. Rather, the embodiments disclosed herein are provided so that the disclosure can be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

본 명세서에서 어떤 구성요소가 다른 구성요소 상에 있다고 언급되는 경우에 그것은 다른 구성요소 상에 직접 형성될 수 있거나 또는 그들 사이에 제 3의 구성요소가 개재될 수도 있다는 것을 의미한다. 또한, 도면들에 있어서, 구성요소들의 두께는 기술적 내용의 효과적인 설명을 위해 과장된 것이다.In the present specification, when an element is referred to as being on another element, it may be directly formed on another element, or a third element may be interposed therebetween. Further, in the drawings, the thickness of the components is exaggerated for an effective description of the technical content.

본 명세서에서 제1, 제2 등의 용어가 구성요소들을 기술하기 위해서 사용된 경우, 이들 구성요소들이 이 같은 용어들에 의해서 한정되어서는 안 된다. 이들 용어들은 단지 어느 구성요소를 다른 구성요소와 구별시키기 위해서 사용되었을 뿐이다. 여기에 설명되고 예시되는 실시예들은 그것의 상보적인 실시예들도 포함한다.Where the terms first, second, etc. are used herein to describe components, these components should not be limited by such terms. These terms have only been used to distinguish one component from another. The embodiments described and exemplified herein also include their complementary embodiments.

본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 '포함한다(comprise)' 및/또는 '포함하는(comprising)'은 언급된 구성요소는 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.In the present specification, the singular form includes plural forms unless otherwise specified in the specification. The terms "comprise" and / or "comprising" used in the specification do not exclude the presence or addition of one or more other elements.

이하, 도면을 참조하여 본 발명을 상세히 설명하도록 한다. 아래의 특정 실시예들을 기술하는데 있어서, 여러 가지의 특정적인 내용들은 발명을 더 구체적으로 설명하고 이해를 돕기 위해 작성되었다. 하지만 본 발명을 이해할 수 있을 정도로 이 분야의 지식을 갖고 있는 독자는 이러한 여러 가지의 특정적인 내용들이 없어도 사용될 수 있다는 것을 인지할 수 있다. 어떤 경우에는, 발명을 기술하는 데 있어서 흔히 알려졌으면서 발명과 크게 관련 없는 부분들은 본 발명을 설명하는 데 있어 혼돈이 오는 것을 막기 위해 기술하지 않음을 미리 언급해 둔다. Hereinafter, the present invention will be described in detail with reference to the drawings. In describing the specific embodiments below, various specific details have been set forth in order to explain the invention in greater detail and to assist in understanding it. However, it will be appreciated by those skilled in the art that the present invention may be understood by those skilled in the art without departing from such specific details. In some instances, it should be noted that portions of the invention that are well known in the description of the invention and not significantly related to the invention do not describe confusion in describing the present invention.

본 발명은 디지털 컨텐츠 제공 시스템 중에서 사용자 로그인 등을 통해 개개인을 식별할 수 있고 사용자가 컨텐츠를 작성하거나 타인이 작성한 컨텐츠에 대해 (투표나 추천 등의 방법으로) 특정 주제에 대한 선호 여부를 표현할 수 있는 시스템 환경에서 이 사용자가 선호할만한 주제에 관한 컨텐츠를 추천하는 방법 및 장치에 관한 것이다. 여기서 컨텐츠는 적어도 텍스트가 부분적으로 포함하는 텍스트 기반의 컨텐츠를 의미하며 이하에서 이를 "문서" 또는 "문서 컨텐츠"라고 칭하기도 한다. 예를 들어 컨텐츠는 뉴스 기사, 제품 리뷰, 블로그 글, 영화나 예술작품에 대한 리뷰 글 등과 같이 여러 개의 단어의 집합으로 정의되는 문서들이 본 실시예에서의 컨텐츠에 해당한다. The present invention relates to a digital content providing system capable of identifying an individual through a user login or the like and capable of expressing a preference for a specific subject (for example, by voting or recommending a content created by a user or created by another person) To a method and apparatus for recommending content relating to a user's preferred topic in a system environment. Here, the content refers to text-based content at least partially including text, which is hereinafter also referred to as "document" or "document content". For example, the content corresponds to the content in the present embodiment, which is defined as a set of a plurality of words such as a news article, a product review, a blog post, a review on a movie or an art work, and the like.

본 발명의 일 실시예에 따른 컨텐츠 추천 방법 및 장치에서, 각 사용자들이 직접 작성한 컨텐츠 및 타인의 컨텐츠 중 자신이 투표(또는 추천)한 과거 내역에 기초하여 각 사용자들이 선호하는 주제를 분석하고, 이를 바탕으로 각 사용자가 아직 접해보지 않은 새로운 컨텐츠 중 사용자가 선호할 만한 컨텐츠를 추천할 수 있다. 본 발명에 따르면, 사용자가 직접 컨텐츠를 작성한 적이 없고 단지 타인의 컨텐츠에 대해 추천만 했더라도 이러한 추천 이력을 통해 사용자의 선호도를 계산하여 해당 사용자가 선호하는 컨텐츠를 제시할 수 있다. In the content recommendation method and apparatus according to an exemplary embodiment of the present invention, each user's favorite content is analyzed based on past contents of his / her own content and other users' votes (or recommended) It is possible to recommend content that the user prefers among the new contents that each user has not yet touched. According to the present invention, even if the user has never created the content and only recommends the content of the other person, the user's preference can be calculated through the recommendation history to present the content preferred by the user.

도1은 본 발명의 일 실시예에 따라 사용자에게 컨텐츠를 추천하는 방법을 간단히 도식화한 흐름도이다. FIG. 1 is a simplified diagram illustrating a method for recommending content to a user according to an embodiment of the present invention. Referring to FIG.

일 실시예에서 컨텐츠 추천 방법은 선호도 데이터베이스(DB)를 생성하기 위한 학습단계(S10) 및 생성된 선호도 DB에 기초하여 사용자에게 컨텐츠를 추천하는 추천단계(S20)를 포함한다. In one embodiment, the content recommendation method includes a learning step S10 for creating a preference database DB and a recommending step S20 for recommending a content to a user based on the generated preference DB.

도시된 실시예에서 학습단계(S10)는 문서와 사용자에 관한 정보를 입력데이터로서 학습 알고리즘에 입력하는 단계, 및 입력데이터에 기초하여 선호도 데이터베이스(DB)를 생성하는 단계;를 포함할 수 있다. In the illustrated embodiment, the learning step S10 may include inputting information about the document and the user as input data to the learning algorithm, and generating a preference database DB based on the input data.

이 때 상기 입력데이터는 예를 들어 문서들의 집합, 사용자들의 집합, 사용자가 작성한 문서들의 집합, 사용자가 추천한 문서들의 집합, 및 전체 문서에 나타나는 단어들의 집합을 포함할 수 있다. The input data may include, for example, a set of documents, a set of users, a set of documents created by the user, a set of documents recommended by the user, and a set of words appearing in the entire document.

또한 생성된 선호도 DB는 예를 들어 문서의 주제 선호도, 사용자의 주제 선호도, 및 주제의 단어 선호도를 포함할 수 있다. 이 때 바람직한 실시예에서 각 선호도는 확률분포로 표현된다. 즉 문서의 주제 선호도는 특정 문서(d)에서 특정 주제(z)를 선택할 확률의 확률분포(θ)로 표현되고, 사용자의 주제 선호도는 특정 사용자(u)가 특정 주제(z)를 선택할 확률의 확률분포(γ)로 표현되고, 그리고 주제의 단어 선호도는 특정 주제(z)에 대해 특정 단어(w)를 선택할 확률의 확률분포(φ)로 표현될 수 있다. The generated preference DB may also include, for example, the subject preference of the document, the subject preference of the user, and the word preference of the subject. At this time, in the preferred embodiment, each preference is represented by a probability distribution. That is, the subject preference of a document is expressed by a probability distribution (θ) of probabilities of selecting a specific subject (z) in a specific document (d), and the user's subject preference is expressed by the probability that a particular user (u) Is represented by a probability distribution (γ), and the word preference of the subject can be expressed as a probability distribution (φ) of probability of selecting a particular word (w) for a particular subject (z).

이와 같이 각기 확률분포로 표현되는 선호도의 집합인 선호도 DB를 학습 알고리즘에 의해 생성한 후, 추천단계(S20)에서, 어느 특정 사용자의 정보가 입력되면, 추천 알고리즘은 주어진 복수개의 문서들에 대한 선호도를 상기 선호도 DB에 기초하여 수치화하고 선호도가 높은 문서들을 사용자에게 추천할 수 있다. After the preference DB, which is a set of preferences represented by the probability distributions, is generated by the learning algorithm and then the information of a specific user is input in the recommendation step S20, the recommendation algorithm determines the preference for a given plurality of documents Can be digitized based on the preference DB, and documents with high preference can be recommended to the user.

이하에서 도2 및 도3을 참조하여 학습단계(S10)의 구체적 일 실시예에 대해 상술하기로 한다. 도2는 학습단계(S10)에서의 학습 알고리즘의 예시적 방법을 나타내는 흐름도이다. Hereinafter, a specific embodiment of the learning step S10 will be described in detail with reference to FIG. 2 and FIG. 2 is a flowchart showing an exemplary method of the learning algorithm in the learning step S10.

우선 단계(S110)에서, 입력데이터 및 선호도의 사전(추정) 확률분포에 기초하여 문서 작성확률, 문서 추천확률, 및 단어 선택확률을 계산한다. 여기서 사용되는 입력데이터 및 선호도의 사전 확률분포는 다음과 같다. In step S110, a document creation probability, a document recommendation probability, and a word selection probability are calculated based on a dictionary (estimated) probability distribution of input data and preferences. The prior probability distribution of input data and preferences used here is as follows.

입력데이터Input data

여기서 입력데이터는 선호도 DB의 생성을 위해 필요한 데이터로서 미리 결정된 기지(旣知)의 값이며, 다음과 같은 데이터를 포함할 수 있다. Here, the input data is a predetermined value determined as data necessary for generating the preference DB, and may include the following data.

D = {d₁,d₂, …, dn}: 주어진 문서들의 집합D = {d ₁ , d ₂ , ... , dn}: A set of given documents

U = {u₁, u₂,…, u_m}: 사용자들의 집합U = {u ₁ , u ₂ , ... , u _m }: a set of users

D(u_a): 사용자 u_a가 작성한 문서들의 집합D (u _a ): set of documents created by user u _a

L(u_a): 사용자 u_a가 읽고 추천한 문서들의 집합L (u _a ): _a set of documents read and recommended by user u _a

U(di): 문서 di를 추천한 사용자들의 집합U (di): the set of users who recommended the document di

W(di): 문서 di에 있는 단어들의 목록(bag-of-words)W (di): a list of words in the document di (bag-of-words)

Ni: 문서 di에 들어있는 단어들의 개수Ni: Number of words in the document di

W = {w₁,w₂,…, w_l}: 전체 문서(D)에 나타나는 단어들의 집합W = {w ₁ , w ₂ , ... , w _l }: a set of words appearing in the entire document (D)

선호도의 사전 확률분포Pre-probability distribution of preferences

선호도 DB는 다음과 같이 각각 확률분포로 정의되는 문서의 주제 선호도, 사용자의 주제 선호도, 및 주제의 단어 선호도를 포함할 수 있다. The preference DB may include the subject preference of the document, the subject preference of the user, and the word preference of the subject, which are each defined by a probability distribution as follows.

θ(z|d): 문서(d)에서 주제(z)를 선택할 확률이며, 문서(d)의 주제 선호도를 의미한다.θ (z | d) is the probability of choosing the subject (z) in document (d), which is the subject preference of document (d).

γ(z|u): 사용자(u)가 주제(z)를 선택할 확률이며, 사용자(u)의 주제 선호도를 의미한다.γ (z | u) is the probability that the user (u) will select the topic (z) and it means the subject preference of the user (u).

φ(w|z): 주제(z)에 대해 단어(w)를 선택할 확률이며, 주제(z)의 단어 선호도이다. φ (w | z) is the probability of choosing the word (w) for the topic (z), and is the word preference of the topic (z).

이와 관련하여 도3a 내지 도3c는 문서의 주제 선호도의 확률분포(θ), 사용자의 주제 선호도의 확률분포(γ), 및 주제의 단어 선호도의 확률분포(φ)의 각각의 예시적인 추정(사전) 확률분포를 나타낸다. In this regard, Figures 3a-3c illustrate an exemplary estimate (a dictionary) of each of the probability distribution (?) Of the subject preferences of the document, the probability distribution (?) Of the subject's preferences of the user, and the probability distribution ) Probability distribution.

도3a는 일 예시적인 문서의 주제 선호도의 사전 확률분포(θ)로서, 문서(d)가 N개 존재하고 K개의 주제(z)로 분류되어 있을 때 각 문서(d)가 어느 특정 주제(z)에 해당할 확률을 나타내는 표이다. 표를 참조하면, 문서(d1)가 주제(z1)에 속할 확률은 0.05이고 주제(z2)에 속할 확률은 0.03이며 이와 같이 문서(d1)가 각 주제(z1 내지 z_K)에 속할 확률이 표시되고 모든 주제에 대한 확률을 더하면 1이 된다. 마찬가지로 문서(d2) 내지 문서(d_N)의 각각에 대해서도 각 주제에 속할 확률이 표시되고 각 문서마다 전제 주제에 속할 확률을 더하면 각각 1이 됨을 알 수 있다. Figure 3a is a prior probability distribution ([theta]) of the subject preferences of an exemplary document, where each document d is associated with a particular subject z when there are N documents and categorized into K subjects z ). &Lt; / RTI > Referring to the table, the document (d1) is the probability to belong to the subject (z1) is 0.05. The subject probability to belong to the (z2) is 0.03 and is thus a document (d1) is the probability to belong to each subject (z1 to z _K) show And adds 1 to the probability for all subjects. Similarly, document (d2) to the document _(N d) is the probability to belong to each subject also each shown and adding a probability to belong to the entire subject for each document can be seen that each one.

도3b는 일 예시적인 사용자의 주제 선호도의 사전 확률분포(γ)로서, M명의 사용자(u)가 존재하고 주제(z)가 K개로 분류되어 있을 때 각 사용자(u)가 문서(d)가 어느 특정 주제(z)를 선택할 확률을 나타내는 표이다. 표를 참조하면 모든 사용자(u)에 대해, 각 사용자가 각각의 주제를 선택할 확률이 할당되어 있고 각 사용자마다 전체 주제를 선택할 확률을 모두 더하면 1이 된다. FIG. 3B is a prior-art distribution of subject preferences γ of an exemplary user, where each user u has a document d when there are M users u and the subject z is categorized into K, It is a table showing the probability of choosing a specific subject (z). Referring to the table, for all users (u), a probability is assigned to each user to select each topic, and 1 is added to the probability of selecting the entire topic for each user.

도3c는 본 발명을 설명하기 위한 예시적인 주제에 대해 단어가 선택될 사전 확률분포(φ)로서, 주제(z)가 K개로 분류되어 있고 전체 L개의 단어(w)가 있을 때 각 주제(z)가 어느 특정 단어(w)를 포함할 확률을 나타낸다. 표를 참조하면 모든 주제(z)에 대해, 각 주제가 각각의 단어를 포함할 확률이 할당되어 있고 각 주제마다 전체 단어를 포함할 확률을 모두 더하면 1이 된다. FIG. 3C is a prior probability distribution (φ) in which a word is selected for an exemplary subject for explaining the present invention. When the subject (z) is classified into K and L words (w) Represents the probability that a particular word w will be included. Referring to the table, for all subjects (z), the probability that each topic will contain each word is assigned and becomes 1 when all the probabilities of including the whole word in each subject are all added.

이 때 단계(S110)에서 사용되는 선호도의 확률분포는 임의로 결정된 추정값이며 실제의 입력데이터에 기초하여 산출된 값이 아님에 유의해야 한다. 단계(S110)에서 사용되는 선호도 DB의 확률분포들은 추정 확률분포이고 그 후 후술하는 단계(S120, S130)를 통해 실제의 입력데이터를 가장 잘 만족하는 선호도 DB의 사후 확률분포를 얻게 된다. 즉 본 발명은 사전 확률(prior probability)과 관측값을 이용하여 사후 확률(posterior probability) 정보를 도출하는 베이즈 정리(Bayes' Theorem)를 이용한 것으로, 선호도 DB의 각 선호도의 사전 확률분포(θ, γ, φ) 및 관측값(입력데이터)을 통해 선호도 DB의 선호도의 실제의 확률분포를 얻게 된다. It should be noted that the probability distribution of the preferences used in step S110 is an arbitrarily determined estimation value and not a value calculated based on actual input data. The probability distributions of the preference DB used in the step S110 are estimated probability distributions and then the posterior probability distributions of the preference DBs that best satisfy the actual input data are obtained through the following steps S120 and S130. That is, the present invention utilizes Bayes 'Theorem to derive posterior probability information using prior probability and observation values, and it is possible to use the Bayes' Theorem to derive the posterior probability information (θ, γ, φ) and observation value (input data), the actual probability distribution of the preference of the preference DB is obtained.

다시 도2를 참조하면, 단계(S110)에서 입력데이터 및 선호도의 사전 확률분포에 기초하여 문서 작성확률, 단어 선택확률, 및 문서 추천확률은 각각 다음과 같이 계산될 수 있다. Referring again to FIG. 2, in step S110, the document creation probability, word selection probability, and document recommendation probability may be calculated as follows based on the prior probability distribution of input data and preference.

문서 작성확률Document creation probability

문서 작성확률 Pa(u,d)은 임의의 사용자(u_a)가 문서(d)를 작성할 확률을 의미하며, 문서 작성확률은 문서의 주제 선호도의 확률분포(θ)와 사용자의 주제 선호도의 확률분포(γ) 사이의 유사도에 비례한다. 모든 사용자(u)에 대해 문서 작성확률을 계산하며, 일 실시예에서 문서 작성확률 Pa(u,d)는 다음과 같이 수식1로 정의될 수 있다. The document creation probability Pa (u, d) means the probability that any user (u _a ) will create the document (d), and the document creation probability is the probability distribution (?) Of the topic preference of the document and the probability Is proportional to the similarity between the distributions ([gamma]). The document creation probability Pa (u, d) in one embodiment may be defined as Equation 1 as follows.

여기서, θ(z|d)는 문서의 주제 선호도의 확률분포, γ(z|u)는 사용자의 주제 선호도의 확률분포이다. 그리고 α는 상수로서 사용자가 자신의 선호도와 차이 가 나는 문서들을 얼마나 다양하게 작성하는지를 나타내는 상수이다. 이 값이 클수록 사용자는 자신의 주제 선호도와 더욱더 유사한 문서들만 작성한다는 것을 의미한다. α는 구체적 실시 상황에 따라 결정될 수 있다. 그리고

은 쿨백-라이블러 발산(Kullback-Leibler divergence) 함수로서 두 확률분포 γ(z|u)와 θ(z|d) 사이의 거리를 수치화하는 함수이다. 이 함수값이 크다면 두 확률분포의 거리가 큰 것을 의미하므로, 문서 작성확률은 이 함수에 반비례하도록 정의된다. Here, θ (z | d) is the probability distribution of the topic preference of the document, and γ (z | u) is the probability distribution of the user's topic preference. And, α is a constant, which is a constant indicating how many different kinds of documents the user will make different from his or her preference. The higher the value, the more users will create documents that are more similar to their topic preferences. ? can be determined according to a concrete implementation situation. And

Is a function that quantifies the distance between two probability distributions γ (z | u) and θ (z | d) as a Kullback-Leibler divergence function. If the value of this function is large, it means that the distance between two probability distributions is large. Therefore, the document creation probability is defined to be in inverse proportion to this function.

그리고 이하의 본 실시예에서는 두 확률분포 사이의 거리를 수치화하기 위해 쿨백-라이블러 발산 함수를 이용하지만 실시 형태에 따라 다른 함수를 사용하여 거리를 수치화할 수 있음은 물론이다. In the following embodiment, it is needless to say that the distance can be numerically calculated by using other functions according to the embodiment, although the Kullback-Leibler divergence function is used to quantify the distance between the two probability distributions.

단어 선택확률Word selection probability

단어 선택확률은 임의의 사용자(u_a)가 작성한 문서(d)가 임의의 단어(w)를 포함할 확률이다. 모든 문서에 대해, 각 문서가 θ(z|d)의 확률분포에 따라 주제(z)를 선택하고, 이 선택된 주제(z)가 φ(w|z)의 확률분포로 임의의 단어(w)를 포함하고 있으므로, 단어 선택확률은 문서의 주제 선호도의 확률분포(θ)와 주제의 단어 선호도의 확률분포(φ)의 곱에 비례한다. 일 실시예에서 문서 작성확률은 각 사용자가 작성한 모든 문서에 대해 계산된다. The word selection probability is a probability that the document d created by an arbitrary user u _a includes an arbitrary word w. For all documents, each document selects a topic (z) according to a probability distribution of θ (z | d), and the selected subject (z) , The word selection probability is proportional to the product of the probability distribution (?) Of the subject preference of the document and the probability distribution (?) Of the word preference of the subject. In one embodiment, the document creation probability is calculated for every document created by each user.

문서 추천확률Document referral probability

문서 추천확률 Pv(u,d)은 임의의 사용자(u_a)가 타인의 문서를 추천할 확률을 의미하며, 문서의 주제 선호도의 확률분포(θ)와 사용자의 주제 선호도의 확률분포(γ) 사이의 유사도에 비례하고 해당 문서에 대한 추천횟수에 반비례한다. 일 실시예에서 문서 추천확률 Pv(u,d)은 다음과 같이 수식2로 정의될 수 있다. The document recommendation probability Pv (u, d) means the probability that _a certain user (u _a ) recommends the document of the other person and the probability distribution (θ) of the topic preference of the document and the probability distribution (γ) And is inversely proportional to the number of referrals to the document. In one embodiment, the document recommendation probability Pv (u, d) may be defined by Equation 2 as follows.

여기서, Pv(u,d)는 사용자(u)가 문서(d)를 작성할 문서 작성확률, θ(z|d)는 문서의 주제 선호도의 확률분포, γ(z|u)는 사용자의 주제 선호도의 확률분포이다. 그리고 β는 상수이고 s는 해당 문서(d)에 대한 추천횟수를 나타낸다. β/s 값이 커질수록 사용자는 자신의 주제 선호도와 유사한 문서만을 추천한다는 것을 의미하게 된다. 여기서 문서(d)에 대한 추천횟수를 분모로 한 것은 추천횟수가 클수록, 즉 더욱더 많은 사람들에게 노출된 문서일수록, 사람들이 자신의 주제 선호도와는 다르더라도 추천할 확률이 높아지는 것을 고려하기 위한 것이다. 그리고 KL은 쿨백-라이블러 발산 함수이고 두 확률분포 θ(z|d)와 γ(z|u) 사이의 거리를 수치화한다. (Z | d) is the probability distribution of the topic preference of the document, and γ (z | u) is the probability that the user's subject preference . And β is a constant and s represents the number of recommendation for the document (d). The larger the value of β / s, the more the user recommends documents that are similar to his or her topic preferences. Here, the denominator of the recommendation frequency for the document (d) is to consider that the greater the number of recommendation times, that is, the greater the number of documents exposed to a larger number of people, the higher the likelihood of recommendation even if they differ from their own topic preferences. KL is a Kullback-Leibler divergence function and quantifies the distance between two probability distributions θ (z | d) and γ (z | u).

도2를 참조하면, 상술한 바와 같이 단계(S110)에서 문서 작성확률, 문서 추천확률, 및 단어 선택확률을 계산하면, 단계(S120)에서, 상기 계산된 확률들에 기초하여 데이터가 생성될 확률(우도: likelihood)을 산출한다. Referring to FIG. 2, when the document creation probability, the document recommendation probability, and the word selection probability are calculated in step S110 as described above, in step S120, the probability that data is generated based on the calculated probabilities (Likelihood).

일 실시예에서, 선호도 DB의 각 선호도의 사전 확률분포(θ, γ, φ)가 주어졌을 때 위의 확률모델에 따라 데이터가 생성될 확률, 즉 우도(L)는 아래의 수식3과 같이 정의될 수 있다. In one embodiment, given the prior probability distributions (?,?,?) Of preferences DB, the likelihood that data will be generated according to the above probability model, i.e., likelihood (L) .

그 후 단계(S130)에서 상기 우도를 최대화하는 각 선호도의 사후 확률분포(θ, γ, φ)를 산출한다. 즉 우도(L) 값이 최대가 되도록 하는 선호도 DB의 각 선호도의 확률분포 θ(z|d), γ(z|u), 및 φ(w|z)을 계산한다 Then, in step S130, posterior probability distributions (?,?,?) Of each preference for maximizing the likelihood are calculated. The probability distributions? (Z | d),? (Z | u), and? (W | z) of each preference of the preference DB such that the likelihood (L)

이 과정은 EM 알고리즘으로 알려진 공지의 방법을 사용할 수 있다. EM 알고리즘은 예컨대 A. P. 뎀스터(Dempster), N. M. 레어드(Laird), 및 D. B. 루빈(Rubin)의 논문 "Maximum likelihood from incomplete data via the EM algorithm" (Journal of Royal Statist. Soc., 39:1~38, 1977) 등에 개시되어 있으며, 반복적으로 E스텝과 M스텝을 수행하면서 우도 값이 더 이상 증가하지 않고 수렴할 때의 확률분포 값을 해(solution)로 구하는 방법을 사용한다. This process can use a known method known as the EM algorithm. EM algorithms are described in, for example, AP Dempster, NM Laird, and DB Rubin, "Maximum likelihood from incomplete data via EM algorithm" (Journal of Royal Statist. Soc., 39: 1-38, 1977), and a method of finding a probability distribution value when a likelihood value does not increase any more while performing E step and M step repeatedly is used as a solution.

본 발명에서 EM 알고리즘에 따라 E스텝과 M스텝에서 계산해야 할 값은 아래의 수식4와 같다. In the present invention, the values to be calculated in the E step and M step according to the EM algorithm are expressed by the following equation (4).

EM 알고리즘은 항상 최적의 값을 찾는 것을 보장하지 못하기 때문에 EM알고리즘을 여러번 실행하여 가장 큰 우도 값을 얻어냈을 때의 확률분포를 가장 좋은 해로 삼는 것이 바람직하다. Since the EM algorithm does not always guarantee that it finds the optimal value, it is desirable to run the EM algorithm multiple times to make the probability distribution with the greatest likelihood value the best.

이상과 같이 단계(S110) 내지 단계(S130)를 수행하여 선호도 DB의 각 선호도의 사후 확률분포 θ(z|d), γ(z|u), 및 φ(w|z)를 얻을 수 있고, 이 사후 확률분포의 선호도 DB는 그 후 추천단계(S20)에서 사용자에게 가정 적절한 주제의 문서를 선택하고 추천하는 단계에서 사용될 수 있다. The posterior probability distributions? (Z | d),? (Z | u), and? (W | z) of each preference of the preference DB can be obtained by performing steps S110 to S130, The posterior probability distribution preference DB may then be used in a step of recommending and recommending a document of a subject hypothetical to the user in the recommendation step (S20).

도4는 추천단계(S20)에서의 추천 알고리즘의 예시적 방법을 나타내는 흐름도이다. 4 is a flow chart illustrating an exemplary method of recommendation algorithm in recommendation step S20.

도시된 일 실시예에서, 추천 알고리즘은 주어진 사용자에 대한 추천에 고려되는 추천 후보 문서들(d1, d2,… di) 중에서 사용자가 가장 선호할만한 하나 이상의 문서를 선택할 수 있도록 한다. In one illustrated embodiment, the recommendation algorithm allows the user to select one or more preferred documents among the recommended candidate documents (d1, d2, ... di) considered in the recommendation for a given user.

도4를 참조하면, 단계(S210) 내지 단계(S250)를 실행함으로써 모든 추천 후보 문서(d1,d2,..,di)의 각각에 대한 선호도 점수(score(di))를 산출한다. 이 때 단계(S230)는 선호도 DB의 사후 확률분포에 기초하여 주어진 사용자에 대해 각 후보 문서에 대한 선호도 점수를 계산한다. 추천 후보 문서의 개수는 특별히 한정되지 않고 실시 형태에 따라 결정될 수 있다. Referring to Fig. 4, the score (di (di)) for each of the recommended candidate documents d1, d2, ..., di is calculated by executing steps S210 to S250. At this time, step S230 calculates a preference score for each candidate document for a given user based on the posterior probability distribution of the preference DB. The number of recommended candidate documents is not particularly limited and can be determined according to the embodiment.

그 후 단계(S260)에서, 각 추천 후보 문서에 대해 산출된 선호도 점수에 기초하여 추천 후보 문서들을 정렬하고, 단계(S270)에서, 높은 선호도 점수를 갖는 하나 이상의 추천 후보 문서를 사용자에게 제시한다. Thereafter, in step S260, the recommendation candidate documents are sorted based on the calculated preference score for each recommended candidate document, and in step S270, one or more recommended candidate documents having a high preference score are presented to the user.

한편 도4의 선호도 점수 산출 단계(S230)에서 각 추천 후보 문서의 선호도 점수는 선호도 DB의 사후 확률분포를 이용하여 계산되는데, 주어진 사용자(u_a) 및 추천 후보 문서(di)의 정보가 선호도 DB에 존재하는 경우 직접적으로 계산이 될 수 있다. 그러나 경우에 따라 새로운 사용자에 대해 문서를 추천해야 한다거나 또는 기존 사용자에게 대해 새로운 문서를 추천해야 하는 경우도 있다. 본 발명의 일 실시예에서, 이러한 경우를 위해 사용자 정보와 후보 문서의 정보가 선호도 DB에 존재하는지 여부에 따라 도5에서와 같이 추천 타입을 4가지 경우로 분류하고 각 추천타입 별로 선호도 점수를 산출하는 방법을 제시한다. The road in preference score calculating step in 4 (S230) preference score for each recommended candidate document is calculated by using the posterior probability distribution of the preference DB, a given user (u _a) and the recommended candidate document (di) of the information is preference DB If it exists, it can be calculated directly. In some cases, however, it may be necessary to recommend a document for a new user or to recommend a new document for an existing user. According to one embodiment of the present invention, as shown in FIG. 5, the recommendation type is classified into four cases according to whether the user information and the candidate document information are present in the preference DB, and a preference score is calculated for each recommendation type And how to do it.

도5는 문서와 사용자의 유형에 따른 추천타입을 나타내는 도표이다. 도5를 참조하면, 사용자 정보와 문서 정보의 각각의 유무에 따라 다음 4가지 추천 타입으로 분류된다. FIG. 5 is a chart showing a recommendation type according to a document and a type of a user. Referring to FIG. 5, the following four types are classified according to the presence or absence of user information and document information.

- 웜-스타트(Warm-start) 추천: 추천 대상인 사용자와 추천하고자 하는 문서가 학습단계(S10)에서 생성된 선호도 DB에 모두 존재하는 경우- Warm-start recommendation: If the user to be recommended and the document to be recommended all exist in the preference DB created in the learning step (S10)

- 콜드-스타트(Cold-start) 추천 타입1: 선호도 DB에 나타나지 않은 새로운 사용자에게 선호도 DB에 존재하는 문서를 추천하는 경우- Cold-start Recommendation type 1: If a new user who does not appear in the preference DB recommends documents existing in the preference DB

- 콜드-스타트 추천 타입2: 선호도 DB에 존재하는 사용자에게 선호도 DB에 나타나지 않은 새로운 문서를 추천하는 경우- Cold-start recommendation type 2: When a new document that does not appear in the preference DB is recommended to users existing in the preference DB

- 콜드-스타트 추천 타입3: 선호도 DB에 나타나지 않은 새로운 사용자에게 선호도 DB에 나타나지 않은 새로운 문서를 추천하는 경우- Cold-start recommendation type 3: When a new user who does not appear in the preference DB recommends a new document not appearing in the preference DB

본 발명의 바람직한 실시예에 따르면, 이들 각 추천타입에 따른 선호도 점수의 계산은 다음과 같이 이루어진다. According to a preferred embodiment of the present invention, the calculation of the preference score according to each of these recommendation types is performed as follows.

웜-스타트 추천인 경우Warm-start recommendations

추천 대상인 사용자(u_a)와 추천하고자 하는 문서(di)가 학습단계(S10)에서 생성된 선호도 DB에 모두 존재하는 경우, 사용자(u_a)의 문서(di)에 대한 선호도 점수는 문서의 주제 선호도의 확률분포(θ)와 사용자의 주제 선호도의 확률분포(γ) 사이의 유사도에 비례하는 함수로 정의할 수 있다. 일 실시예에서 이 선호도 점수는 아래의 수식5로 표현될 수 있다. Recommended target user (u _a) and documents (di) if that exists both in the preference DB generated in the learning phase (S10), the user (u _a) documents (di) Rating score is the topic of this article for the to be recommended Can be defined as a function proportional to the similarity between the probability distribution (θ) of the preferences and the probability distribution (γ) of the user's subject preferences. In one embodiment, this preference score may be expressed by Equation 5 below.

여기서 score(di)는 사용자(u_a)의 후보 문서(di)에 대한 선호도 점수이고, γ(z|u_a)는 추천의 대상인 사용자(u_a)의 주제 선호도의 확률분포, θ(z|di)는 후보 문서(di)의 주제 선호도의 확률분포, 그리고 KL은 쿨백-라이블러 발산 함수이다. Here, score (di) is the preference score for the candidate document (di) of the user (u _a ), and γ (z | u _a ) is the probability distribution of the topic preference of the user (u _a ) di) is the probability distribution of the subject preference of the candidate document (di), and KL is the Kullback-Leibler divergence function.

그러나 대안적으로, 상기 수식5를 사용하지 않고 문서의 주제 선호도의 확률분포(θ)와 사용자의 주제 선호도의 확률분포(γ) 사이의 거리를 수치화하는 다른 함수를 사용할 수 있음도 물론이다. Alternatively, however, it is of course also possible to use another function to quantify the distance between the probability distribution (?) Of the topic preferences of the document and the probability distribution (?) Of the user's topic preferences without using the above equation (5).

콜드-스타트 Cold-start 추천타입1인Recommended type 1 person 경우 Occation

학습단계(S10)에서 생성된 선호도 DB에 나타나지 않은 새로운 사용자에게 선호도 DB에 존재하는 문서를 추천하는 경우로서, 사용자(u_a)의 주제 선호도 확률분포 γ(z|u_a)를 알 수 없기 때문에 사용자(u_a)가 작성한 문서들의 집합(D(u_a)) 및 추천한 문서들의 집합(L(u_a))을 이용하여 다음과 같은 수식6으로 γ(z|u_a)를 계산한다. As if the new user does not appear in the preference DB generated in the learning phase (S10) to recommend a document that exists in preference DB, the user (u _a) subject preference probability γ of | unknown (z u _a) not because calculates | (u _a z) user (u _a) a set of written documents (D (u _a)) and a set of a like article (L (u _a)) γ in the following formula 6 such as using a.

여기서 Z(u_a)는 모든 주제에 대한 선호도 분포 γ(z|u)의 합이 1이 되도록 하는 정규화 값이다. 이와 같이 수식6에 의해 γ(z|u_a)를 산출하면 이 값을 수식5에 대입함으로써 선호도 점수를 산출할 수 있다. Where Z (u _a ) is a normalization value such that the sum of the preference distributions γ (z | u) for all subjects is 1. If y (z | u _a ) is calculated by Equation 6 as above, the preference score can be calculated by substituting this value into Equation 5.

콜드-스타트 Cold-start 추천타입2인Recommended type 2 people 경우 Occation

학습단계(S10)에서 생성된 선호도 DB에 존재하는 사용자에게 선호도 DB에 나타나지 않은 새로운 문서를 추천하는 경우로서, 문서(di)의 주제 선호도 확률분포 θ(z|di)를 알 수 없기 때문에 후보 문서(di)에 대한 단어들의 확률분포, 문서(di)를 작성한 사용자의 주제 선호도의 확률분포, 및 문서(di)를 추천한 사용자의 주제 선호도의 확률분포에 기초하여 다음과 같이 수식7을 이용하여 θ(z|di)를 계산한다. Since the user does not know the subject preference probability distribution θ (z | di) of the document di, it is possible to recommend a new document not present in the preference DB to the user existing in the preference DB generated in the learning step S10, based on the probability distributions of the words for the di (di), the probability distribution of the subject preference of the user who created the document di, and the probability distribution of the subject preference of the user who recommended the document (di) Calculate θ (z | di).

여기서 Z(di)는 모든 주제에 대한 선호도 분포 θ(z|d)의 합이 1이 되도록 하는 정규화 값이고, u_j는 문서 di를 작성한 사람을 의미한다. 이와 같이 수식7에 의해 θ(z|di)를 산출하면 이 값을 수식5에 대입함으로써 선호도 점수를 산출할 수 있다. Here, Z (di) is a normalization value such that the sum of the preference distributions θ (z | d) for all subjects is 1, and u _j means a person who created the document di. When calculating? (Z | di) by Equation (7), the preference score can be calculated by substituting this value into Equation (5).

콜드-스타트 Cold-start 추천타입3인Recommended type 3 people 경우 Occation

학습단계(S10)에서 생성된 선호도 DB에 나타나지 않은 새로운 사용자에게 선호도 DB에 나타나지 않은 새로운 문서를 추천하는 경우로서, 사용자(u_a)의 주제 선호도 확률분포 γ(z|u_a) 및 문서(di)의 주제 선호도 확률분포 θ(z|di) 모두 알 수 없는 경우이다. 이 경우에는 상기 수식6과 수식7에 의해 확률분포 γ(z|u_a)와 확률분포 θ(z|di)를 각각 산출한 뒤 이 산출된 두 값을 수식5에 대입함으로써 선호도 점수를 계산할 수 있다. As if the new user does not appear in the preference DB generated in the learning phase (S10) to recommend a new document that does not appear in the preference DB, the user (u _a) γ topic preferences probability distribution (z | u _a) and documents (di ) Is a case in which both of the subject preference probability distributions θ (z | di) are unknown. In this case, the preference score can be calculated by calculating the probability distribution? (Z | u _a ) and the probability distribution? (Z | di) using the above Equation 6 and Equation 7, have.

이상과 같은 방법을 통해, 사용자 정보와 문서 정보를 알고 있는 경우뿐만 아니라 사용자 정보 및/또는 문서 정보가 선호도 DB에 존재하지 않는 경우에도 각 후보 문서에 대한 선호도 점수를 계산할 수 있다. Through the above-described method, the preference score for each candidate document can be calculated not only when the user information and the document information are known but also when the user information and / or document information is not present in the preference DB.

도6은 일 실시예에 따라 컨텐츠 추천을 위한 컨텐츠 추천 장치를 포함하는 예시적인 네트워크 구성을 나타내는 블록도이다. 6 is a block diagram illustrating an exemplary network configuration including a content recommendation device for content recommendation according to an embodiment.

도6을 참조하면 일 실시예에 따른 컨텐츠 추천 장치는 서버(30), 입력데이터 DB(50), 및 선호도 DB(60)를 포함할 수 있고, 네트워크(20)를 통해 다수의 사용자 단말기(10)와 통신하도록 연결될 수 있다. Referring to FIG. 6, the content recommendation apparatus may include a server 30, an input data DB 50, and a preference DB 60, and may be connected to a plurality of user terminals 10 As shown in FIG.

사용자 단말기(10)는 예를 들어 예를 들어 스마트폰, 태블릿 PC, 노트북 컴퓨터 등의 휴대용 모바일 단말기 또는 데스크탑 컴퓨터와 같은 비-휴대용 단말기일 수 있다. The user terminal 10 may be, for example, a non-portable terminal, such as a portable mobile terminal, e.g., a smart phone, a tablet PC, a notebook computer, or a desktop computer.

네트워크(20)는 휴대용 단말기(10)와 서버(30) 사이에 데이터 송수신 경로를 제공하는 임의의 형태의 네트워크로서, LAN, WAN, 인터넷망, 및/또는 이동통신망 중 하나를 포함할 수 있다. The network 20 is any type of network that provides a data transmission / reception path between the portable terminal 10 and the server 30 and may include one of a LAN, a WAN, an Internet network, and / or a mobile communication network.

서버(30)는 사용자 단말기(10)에 컨텐츠를 제공하는 서비스 서버일 수 있고, 일 실시예에서, 상술한 컨텐츠 추천을 위한 학습 알고리즘 및/또는 추천 알고리즘을 수행할 수 있는 컨텐츠 추천 어플리케이션(40)을 포함한다. 이를 위해 서버(30)는 프로세서, 메모리, 저장부, 통신부 등으로 구성될 수 있고, 컨텐츠 추천 어플리케이션(40)이 저장부에 저장되어 있다가 프로세서의 제어 하에 메모리에 로딩되어 실행될 수 있다. The server 30 may be a service server for providing contents to the user terminal 10 and may include a content recommendation application 40 capable of performing a learning algorithm and / . For this, the server 30 may include a processor, a memory, a storage unit, a communication unit, and the like. The content recommendation application 40 may be stored in a storage unit and loaded into a memory under the control of the processor.

도시된 실시예에서 서버(30)는 입력데이터 DB(50)와 선호도 DB(60)와 통신가능하게 연결된다. 대안적인 실시예에서 서버(30)가 입력데이터 DB(50)와 선호도 DB(60) 중 적어도 하나를 포함할 수도 있다. 입력데이터 DB(50)는 선호도 DB의 생성을 위해 필요한 입력데이터를 저장할 수 있고, 선호도 DB(60)는 학습단계(S10)를 수행하기 전의 사전 선호도 확률분포 및 학습단계(S10)를 수행한 후 생성된 사후 선호도 확률분포를 저장할 수 있다. In the illustrated embodiment, the server 30 is communicably connected to the input data DB 50 and the preference DB 60. [ In an alternative embodiment, the server 30 may include at least one of an input data DB 50 and a preference DB 60. The input data DB 50 may store input data necessary for generating the preference DB and the preference DB 60 may perform the pre-preference probability distribution before the learning step S10 and the learning step S10 The generated posterior preference probability distribution can be stored.

이상과 같이 본 발명의 일 실시예에 따른 컨텐츠 추천 방법 및 장치는, 사용자는 디지털 컨텐츠 제공 시스템에 게재한 컨텐츠 뿐만 아니라 타인이 작성한 컨텐츠에 대한 추천 이력까지 모두 활용하여 사용자의 선호도를 분석하고 사용자에게 관심있을 만한 컨텐츠를 추천할 수 있다. 따라서 사용자가 직접 작성한 컨텐츠가 없는 경우에도 사용자가 추천한 컨텐츠의 이력을 통해 사용자의 선호도 성향을 분석할 수 있고 이에 따라 보다 정확하고 적합한 컨텐츠 추천을 행할 수 있다. As described above, the content recommendation method and apparatus according to an embodiment of the present invention allow the user to analyze the user's preference by utilizing not only the content displayed in the digital content providing system but also the recommendation history of the content created by the other user, You can recommend interesting content. Therefore, even if there is no content created by the user, the preference propensity of the user can be analyzed through the history of the content recommended by the user, thereby making it possible to perform more accurate and appropriate content recommendation.

상기와 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되지 않는다. 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 상술한 기재로부터 다양한 수정 및 변형이 가능함을 이해할 것이다. 그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.As described above, although the present invention has been described with reference to the limited embodiments and drawings, the present invention is not limited to the above embodiments. It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the present invention as defined by the appended claims. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the equivalents of the claims, as well as the claims.

10: 사용자 단말기
20: 네트워크
30: 서버
40: 컨텐츠 추천 어플리케이션
50: 입력데이터 DB
60: 선호도 DB10: User terminal
20: Network
30: Server
40: Content recommendation application
50: input data DB
60: Preference DB

Claims

A document content recommendation method,
The server receives information about a document and a user as input data; And
Wherein the server generates a preference database (DB) based on the input data,
The input data includes a set of documents, a set of users, a set of documents created by the user, a set of documents recommended by the user, and a set of words appearing in the entire document,
Wherein the preference DB includes a subject preference of a document, a subject preference of a user, and a word preference of a subject.

delete

The method according to claim 1,
The subject preference of the document is represented by a probability distribution (θ) of probability of selecting a subject (z) in the document (d), and the subject's preference of the user is a probability distribution of the probability y), and the word preference of the subject is represented by a probability distribution (?) of probabilities of selecting the word (w) for the subject (z).

4. The method of claim 3, wherein the generating the preference DB comprises:
(S110) the server calculating a document creation probability, a document recommendation probability, and a word selection probability based on the input data and the prior probability distribution of each preference;
(S120) calculating, by the server, a probability (likelihood) at which data is generated based on the calculated document creation probability, the document recommendation probability, and the word selection probability; And
(S130) of calculating a posterior probability distribution of each preference that maximizes the likelihood of the server.

5. The method of claim 4,
Wherein the document creation probability is proportional to a degree of similarity between a probability distribution (?) Of the subject preference of the document and a probability distribution (?) Of the subject preference of the user.

6. The method of claim 5,
The document creation probability is defined by the following equation,

Here, Pa (u, d) is the probability that the user u will create the document d, θ (z | d) is the probability distribution of the topic preference of the document, , A constant is a constant, and KL is a Kullback-Leibler divergence function.

5. The method of claim 4,
Wherein the document recommendation probability is in proportion to a degree of similarity between a probability distribution (?) Of the subject preference of the document and a probability distribution (?) Of the subject's preference of the user and inversely proportional to a recommended number of the document. Way.

8. The method of claim 7,
The document recommendation probability is defined by the following equation,

(Z | d) is the probability distribution of the topic preference of the document, and γ (z | u) is the probability that the user's subject preference , Β is a constant, s is a recommendation number for the document (d), and KL is a Kullback-Leibler divergence function.

5. The method of claim 4,
Wherein the word selection probability is proportional to a product of a probability distribution (?) Of the subject preferences of the document and a probability distribution (?) Of the word preferences of the subject.

5. The method of claim 4,
The server calculating a preference score for each of the plurality of candidate documents based on the posterior probability distribution of each preference; And
And recommending a document to the user based on the preference score calculated for each candidate document by the server.

11. The method of claim 10,
Wherein the preference score is proportional to a similarity between a probability distribution (?) Of the subject preferences of the document and a probability distribution (?) Of the subject preferences of the user.

12. The method of claim 11,
The preference score is defined by the following equation,

(Z | ua) is the probability distribution of the topic preference of the user (ua) that is the object of recommendation, and θ (z | di) is the candidate document (di | di), and KL is a Kullback-Leibler divergence function.

13. The method of claim 12,
Users of topics probability distribution _{(γ (z | u a)} ) of preference is the subject of the recommendation, if you do not know the user (u _a) is a collection of written documents (D (u _a)) and a set of recommended documents ( Wherein the probability distribution (? (Z | u _a )) is calculated based on L (u _a ).

13. The method of claim 12,
A probability distribution of words for the candidate document di when the probability distribution of the subject preferences di (? (Z | di)) is unknown, a subject preference of the user who created the document di (Z (di | di)) on the basis of a probability distribution of a user's recommendation and a probability distribution of a topic preference of a user who recommended the document di.

A computer-readable recording medium having recorded thereon a program for causing a computer to execute the method according to any one of claims 1 to 14.