KR102053858B1

KR102053858B1 - Method for calculating rating of content

Info

Publication number: KR102053858B1
Application number: KR1020180023985A
Authority: KR
Inventors: 김여립
Original assignee: 울산과학기술원
Priority date: 2018-02-27
Filing date: 2018-02-27
Publication date: 2019-12-09
Also published as: KR20190102905A

Abstract

본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법은 콘텐츠에 대한 평점이 포함된 문서들을 수집하는 단계; 상기 수집된 문서들을 기설정된 분류 조건에 따라 복수 개의 그룹들로 분류하는 단계; 텍스트 마이닝 알고리즘을 이용하여, 상기 분류된 그룹들 중 일부 그룹에 포함된 문서들과 기설정된 키워드들 간의 상관관계에 대한 벡터값들을 산출하는 단계; 패턴 인식 알고리즘을 이용하여, 상기 산출된 벡터값들 중 일부에 기반하여 평점 산출 모델을 생성하는 단계; 및 상기 분류된 그룹들 중 나머지 그룹에 포함된 문서들을 상기 평점 산출 모델에 대응시킨 결과로 출력되는 상기 콘텐츠에 대한 평점을 산출하는 단계를 포함한다.Content rating method according to an embodiment of the present invention comprises the steps of collecting documents containing a rating for the content; Classifying the collected documents into a plurality of groups according to a predetermined classification condition; Calculating vector values for correlations between documents included in some of the classified groups and predetermined keywords using a text mining algorithm; Generating a rating calculation model based on some of the calculated vector values using a pattern recognition algorithm; And calculating a rating for the content output as a result of matching the documents included in the remaining groups among the classified groups to the rating calculation model.

Description

How to calculate your content rating {METHOD FOR CALCULATING RATING OF CONTENT}

본 발명은 콘텐츠 평점 산출 방법에 관한 것으로, TF-IDF(Term Frequency-Inverse Document Frequency)을 이용하여 리뷰들을 형태소별로 분석하고, 서포트 벡터 머신(Support Vector Machine)을 이용하여 콘텐츠의 평점을 산출하는 방법에 관한 것이다.The present invention relates to a method for calculating a content rating, wherein the reviews are morphologically analyzed using TF-IDF (Term Frequency-Inverse Document Frequency), and a method of calculating a content rating using a support vector machine (Support Vector Machine). It is about.

평점이란 가치를 평하여 매긴 점수로서, 소셜 네트워크 서비스(Social Network Service)의 발달로 인하여 영화, 음악 및 드라마 등의 미디어 콘텐츠에 평점을 매기는 것이 보편화 되었다.Grading is a rating that is valued. It has become popular to rate media contents such as movies, music and dramas due to the development of social network services.

또한, 네티즌, 평론가 등이 부여한 평점은 미디어 콘텐츠의 흥행에 영향을 미치고 있다. 예컨대, 개봉을 앞둔 영화를 미리 본 관객들의 온라인 평점이 개봉 후 영화의 흥행 성적에 영향을 미친다는 사실을 부인할 수 없게 되었다.In addition, the ratings given by netizens, critics, etc. are affecting the popularity of media content. For example, it is undeniable that the online ratings of audiences who have previewed a movie ahead of time affect the performance of the movie after the release.

종래에는 미디어 콘텐츠에 대한 사용자 평점 데이터의 노이즈를 제거하여 데이터의 신뢰도를 향상시킨 콘텐츠 추천을 위한 동적인 노이즈 제거 방법 및 콘텐츠 추천 시스템이 개시되었다.In the related art, a dynamic noise removing method and a content recommendation system for content recommendation for removing content noise of user rating data on media content and improving reliability of data have been disclosed.

다만, 종래 기술은 사용자들이 평가한 평점의 수 및 시점을 가중치로 반영하여 평점의 분포도를 분석함으로써, 노이즈를 필터링하는 구성을 개시하고 있을 뿐, 리뷰를 분석하여 평점을 추정하는 구성은 전혀 개시하고 있지 않다.However, the prior art only discloses a configuration for filtering noise by analyzing the distribution of ratings by reflecting the number and time points of ratings by users as weights, and does not disclose any configuration for estimating ratings by analyzing reviews. Not.

리뷰는 사용자들이 해당 미디어 콘텐츠에 대해 느낀 점 등을 기술해 놓은 것으로서, 평점만으로는 담을 수 없는 사용자의 평가 의도가 담겨있다.A review is a description of the user's feelings about the media content, and contains the user's intention to evaluate the rating alone.

따라서, 단순히 평점을 종합하여 미디어 콘텐츠의 평점을 분석하는 것이 아니라, 리뷰를 분석함으로써 미디어 콘텐츠에 대한 보다 정확한 평점을 추정하는 기술의 개발이 필요하다.Therefore, it is necessary to develop a technique for estimating a more accurate rating of media content by analyzing reviews, rather than simply analyzing the rating of media content by synthesizing the ratings.

KRKR 10-2017-007942310-2017-0079423 AA

본 발명은 관심 콘텐츠에 대해 수집된 문서들에 기반하여, TF-IDF를 이용하여 콘텐츠에 대한 문서들을 형태소별로 분석하고, 서포트 벡터 머신을 이용하여 콘텐츠의 평균 평점을 산출하는 방법에 관한 것이다.The present invention relates to a method of morphologically analyzing documents for content using TF-IDF based on documents collected for content of interest, and calculating an average rating of the content using a support vector machine.

본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법은 콘텐츠에 대한 평점이 포함된 문서들을 수집하는 단계; 상기 수집된 문서들을 기설정된 분류 조건에 따라 복수 개의 그룹들로 분류하는 단계; 텍스트 마이닝 알고리즘을 이용하여, 상기 분류된 그룹들 중 일부 그룹에 포함된 문서들과 기설정된 키워드들 간의 상관관계에 대한 벡터값들을 산출하는 단계; 패턴 인식 알고리즘을 이용하여, 상기 산출된 벡터값들 중 일부에 기반하여 평점 산출 모델을 생성하는 단계; 및 상기 분류된 그룹들 중 나머지 그룹에 포함된 문서들을 상기 평점 산출 모델에 대응시킨 결과로 출력되는 상기 콘텐츠에 대한 평점을 산출하는 단계를 포함할 수 있다.Content rating method according to an embodiment of the present invention comprises the steps of collecting documents containing a rating for the content; Classifying the collected documents into a plurality of groups according to a predetermined classification condition; Calculating vector values for correlations between documents included in some of the classified groups and predetermined keywords using a text mining algorithm; Generating a rating calculation model based on some of the calculated vector values using a pattern recognition algorithm; And calculating a rating for the content that is output as a result of matching the documents included in the remaining groups among the classified groups to the rating calculation model.

상기 평점을 산출하는 단계 이후, 상기 수집된 문서들에 대해 상기 그룹들로 분류하는 단계 내지 상기 평점을 산출하는 단계를 반복 수행하는 단계를 더 포함할 수 있다.After the calculating of the rating, the method may further include repeating the step of calculating the rating to the group to the groups of the collected documents.

상기 반복 수행하는 단계는, 상기 수집된 문서들의 수에 비례하게 설정된 반복 횟수를 만족할 때까지 반복 수행될 수 있다.The repeating may be repeated until the number of repetitions set in proportion to the number of collected documents is satisfied.

상기 반복 수행하는 단계 이후, 상기 평점을 산출하는 단계에서 산출되는 평점들의 평균을 산출하는 단계를 더 포함할 수 있다.After the repeating step, the method may further include calculating an average of the ratings calculated in the calculating of the rating.

상기 문서들을 수집하는 단계 이후, 카테고리별로 기설정된 분류 키워드에 기반하여 상기 수집된 문서들을 상기 카테고리별로 분류하는 단계를 더 포함할 수 있다.After collecting the documents, the method may further include classifying the collected documents by the category based on a classification keyword preset for each category.

상기 분류하는 단계 내지 평점을 산출하는 단계는, 상기 분류된 카테고리별로 각각 수행되어, 상기 카테고리 각각에 대한 평점이 산출될 수 있다.The classifying or calculating the rating may be performed for each classified category, and a rating for each category may be calculated.

상기 산출 모델을 생성하는 단계 이후, 상기 산출된 벡터값들 중 나머지를 상기 생성된 평점 산출 모델에 입력시킴으로써, 상기 생성된 평점 산출 모델의 분류 성능에 대한 정확도를 산출할 수 있다.After generating the calculation model, the accuracy of the classification performance of the generated rating calculation model may be calculated by inputting the rest of the calculated vector values into the generated rating calculation model.

상기 벡터값을 산출하는 단계는, TF-IDF(Term Frequency-Inverse Document Frequency) 알고리즘을 이용하여, 상기 분류된 그룹들 중 제1 및 제2 그룹에 포함된 문서들과 기설정된 키워드들 간의 상관관계에 대한 벡터값들을 산출하는 단계를 포함할 수 있다.The calculating of the vector value may include correlation between documents included in first and second groups of the classified groups and preset keywords using a term frequency-inverse document frequency (TF-IDF) algorithm. Calculating vector values for.

상기 평점 산출 모델을 생성하는 단계는, 서포트 벡터 머신(Support Vector Machine)을 이용하여, 상기 제1 그룹에 대해 산출된 벡터값들로부터 평점 산출 모델을 생성하는 단계를 포함할 수 있다.The generating of the rating calculation model may include generating a rating calculation model from the vector values calculated for the first group by using a support vector machine.

상기 분류하는 단계는, 상기 분류된 그룹들 중 제1 그룹에 포함된 문서들에 있어서, 평점이 일정 점수 미만인 문서들의 개수에 기반하여 상기 평점이 일정 점수 이상인 문서들의 개수를 조정하는 단계를 포함할 수 있다.The classifying may include adjusting, in documents included in a first group of the classified groups, the number of documents having a rating greater than or equal to a predetermined score based on the number of documents having a rating less than a predetermined score. Can be.

상기 기설정된 분류 조건은, 상기 콘텐츠에 대한 평점을 기준으로 상기 수집된 문서들을 분류하고, 상기 분류된 문서들의 개수를 소정의 비율에 따라 나누어 상기 복수 개의 그룹들로 분류시키는 조건을 포함할 수 있다.The predetermined classification condition may include a condition of classifying the collected documents based on a rating of the content, and classifying the number of classified documents into a plurality of groups according to a predetermined ratio. .

본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법에 따르면, F-IDF 알고리즘과 더불어 서포트 벡터 머신을 이용하여 수집된 문서들로부터 콘텐츠에 대판 평점을 산출하기 때문에, 단순히 사용자들이 리뷰와 함께 매긴 평점을 종합하는 방법보다 더 구체적이고 신뢰도 높은 평점을 제공할 수 있는 장점이 있다.According to the content rating calculation method according to an embodiment of the present invention, since a large rating is calculated for the content from documents collected using the support vector machine together with the F-IDF algorithm, the user simply ranks the rating with the review. There is an advantage that can provide a more specific and reliable rating than the synthesis method.

또한, 미디어 콘텐츠의 전체적인 평점과 별도로, 영상미, 스토리 또는 OST 등의 카테고리별로 평점을 산출하여 제공할 수 있는 장점이 있다.In addition, in addition to the overall rating of the media content, there is an advantage that can be provided by calculating the rating for each category, such as image beauty, story or OST.

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.
도 1은 본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법의 순서도를 간략히 도시한 도면이다.
도 2는 본 발명의 다른 실시예에 따른 콘텐츠 평점 산출 방법의 순서도를 간략히 도시한 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description in order to provide a thorough understanding of the present invention, provide examples of the present invention and together with the description, describe the technical features of the present invention.
1 is a view briefly illustrating a flowchart of a method for calculating a content rating according to an embodiment of the present invention.
2 is a flowchart illustrating a content rating method according to another embodiment of the present invention briefly.

본 명세서에서 제1 및/또는 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 즉, 구성요소들을 상기 용어들에 의해 한정하고자 함이 아니다.In this specification, terms such as first and / or second are used only for the purpose of distinguishing one component from another component. In other words, it is not intended to limit the components by the above terms.

본 명세서에서 '포함하다' 라는 표현으로 언급되는 구성요소, 특징, 및 단계는 해당 구성요소, 특징 및 단계가 존재함을 의미하며, 하나 이상의 다른 구성요소, 특징, 단계 및 이와 동등한 것을 배제하고자 함이 아니다.Components, features, and steps that are referred to herein as "comprising" mean that such components, features, and steps exist and are intended to exclude one or more other components, features, steps, and equivalents. This is not it.

본 명세서에서 단수형으로 특정되어 언급되지 아니하는 한, 복수의 형태를 포함한다. 즉, 본 명세서에서 언급된 구성요소 등은 하나 이상의 다른 구성요소 등의 존재나 추가를 의미할 수 있다.Unless otherwise specified and stated in the singular, the plural forms are included. That is, the components and the like mentioned herein may mean the presence or addition of one or more other components.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함하여, 본 명세서에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자(통상의 기술자)에 의하여 일반적으로 이해되는 것과 동일한 의미이다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. to be.

즉, 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In other words, terms such as those defined in the commonly used dictionaries should be construed as meanings consistent with the meanings in the context of the related art, and, unless expressly defined herein, are construed in ideal or overly formal meanings. It doesn't work.

이하에서는, 첨부된 도면을 참조하여 본 발명의 실시예에 따른 콘텐츠 평점 산출 방법에 대해 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a content rating calculation method according to an embodiment of the present invention.

도 1은 본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법의 순서도를 간략히 도시한 도면이다.1 is a view briefly illustrating a flowchart of a method for calculating a content rating according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법은 문서들을 수집하는 단계(S101), 그룹들로 분류하는 단계(S103), 벡터값들을 산출하는 단계(S105), 평점 산출 모델을 생성하는 단계(S107), 정확도 및 평점을 산출하는 단계(S109), 반복 수행하는 단계(S111) 및 산출된 정확도 및 평점들의 평균을 산출하는 단계(S113)을 포함할 수 있다.Referring to FIG. 1, in the method of calculating a content rating according to an embodiment of the present invention, the method may include collecting documents (S101), classifying them into groups (S103), calculating vector values (S105), and calculating a rating. The method may include generating a model (S107), calculating an accuracy and a rating (S109), repeating the operation (S111), and calculating an average of the calculated accuracy and ratings (S113).

문서들을 수집하는 단계(S101)는 포털 사이트 또는 소셜 네트워크 서비스를 통해서 콘텐츠에 대한 평점이 포함된 문서들을 수집하는 단계로서, 크롤링(Crawling) 기법을 이용하여 수행될 수 있다.Collecting the document (S101) is a step of collecting documents containing a rating for the content through a portal site or a social network service, it may be performed using a crawling (Crawling) technique.

예컨대, 포털사이트에서 A 영화에 대한 평점을 포함하는 문서들이 크롤링 기법을 이용하여 수집될 수 있다. 여기서, 수집되는 문서들은 A 영화에 대한 평점 및 사용자 리뷰를 포함하며, 복수 개의 포털사이트들에 게시된 문서들이 모두 수집될 수 있다.For example, documents including a rating for an A movie in a portal site may be collected using a crawling technique. Here, the collected documents include a rating and a user review for the A movie, and all the documents posted on the plurality of portal sites may be collected.

문서들을 수집하는 단계(S101)는 복수 개의 포털사이트에서 특정 영화에 대한 평점을 포함하는 문서들을 수집한 경우, 각 포털사이트마다 평점의 범위가 상이할 수 있으므로, 복수 개의 포털 사이트의 평점의 범위를 일치시키기 위해 평점의 범위를 스케일링하는 단계를 포함할 수 있다.In the step of collecting documents S101, when documents including ratings for a specific movie are collected from a plurality of portal sites, the range of ratings may be different for each portal site, and thus, the range of ratings of the plurality of portal sites is determined. Scaling the range of ratings to match.

예컨대, a사이트의 평점의 범위가 1점 내지 10점인데 반해, b사이트의 평점의 범위가 1점 내지 100점인 경우, b사이트의 평점의 범위를 a사이트의 평점의 범위로 스케일링되어, a사이트 및 b사이트의 평점의 범위가 1점 내지 10점으로 동일해질 수 있다.For example, if the rating of a site is 1 to 10 points, while the rating of b site is 1 to 100 points, the rating of b site is scaled to the rating of a site, and the rating of a site is a site. And the range of the rating of the b site can be the same from 1 to 10 points.

다만, 위의 예시와 달리, a사이트의 평점의 범위가 1점 내지 10점이고, b사이트의 평점의 범위가 0점 내지 10점인 경우, 사용자들은 동일한 기준을 가지고 평점을 매길 수 있기 때문에, 평점의 범위를 스케일링하는 단계는 선택적으로 수행될 수 있다.However, unlike the above example, if the rating of the site a ranges from 1 to 10 points, and the rating of the site b ranges from 0 to 10 points, users can rate the same criteria, Scaling the range can optionally be performed.

그룹들로 분류하는 단계(S103)는 수집된 문서들을 기설정된 분류 조건에 따라 복수 개의 그룹들로 분류하는 단계이다.In step S103, the collected documents are classified into a plurality of groups according to a predetermined classification condition.

기설정된 분류 조건은 콘텐츠에 대한 평점을 기준으로 수집된 문서들을 분류한 후, 분류된 문서들의 개수를 소정의 비율에 따라 나누어 복수 개의 그룹들로 분류시키는 조건이다.The preset classification condition is a condition for classifying the collected documents based on the rating of the content, and then classifying the number of classified documents according to a predetermined ratio and classifying the documents into a plurality of groups.

문서들은 평점에 따라 긍정, 부정, 중립 문서들로 분류될 수 있다. Documents can be classified into positive, negative, or neutral documents based on their grade.

예컨대 평점이 3점 미만인 문서들은 부정 문서들, 4점 이상 6점 이하인 문서들은 중립 문서들, 7점 이상 10점 이하인 문서들은 긍정 문서들로 분류될 수 있다.For example, documents having a score less than 3 points may be classified as negative documents, documents having 4 points or more and 6 points or less, neutral documents, documents having 7 points or more and 10 points or less may be classified as positive documents.

그 후, 긍정 문서들 및 부정 문서들은 소정의 비율에 따라 랜덤하게 분할되어, 제1 그룹 및 제2 그룹으로 분류되고, 중립 문서들은 소정의 비율에 따라 분할되어 제3 그룹으로 분류될 수 있다.Thereafter, the positive documents and the negative documents can be randomly divided according to a predetermined ratio, classified into the first group and the second group, and the neutral documents can be divided into the third group according to the predetermined ratio.

예컨대, 소정의 비율이 7 대 3의 비율로 설정된 경우, 긍정문서들 및 부정문서들 각각의 70%는 제1 그룹으로 분류되고, 나머지 30%는 제2 그룹으로 분류될 수 있다. 또한, 중립 문서들 중 30%는 제3 그룹으로 분류될 수 있다. For example, when a predetermined ratio is set at a ratio of 7 to 3, 70% of each of the positive documents and the negative documents may be classified into a first group, and the remaining 30% may be classified into a second group. In addition, 30% of the neutral documents may be classified into a third group.

긍정, 부정 및 중립 문서들 중에서 70% 또는 30%에 포함되는 문서들은 랜덤하게 선택되어, 제1 내지 제3 그룹으로 분류될 수 있다.Documents included in 70% or 30% of positive, negative and neutral documents may be randomly selected and classified into the first to third groups.

긍정 및 부정 문서들은 평점 산출 모델을 생성하기 위한 문서들이지만, 중립 문서들은 생성된 평점 산출 모델을 통해 콘텐츠에 대한 평점을 산출하기 위한 문서들이다. 따라서, 수집된 문서들 중 중립 문서들이 모두 사용될 경우, 산출되는 평점은 4점 내지 6점 사이에 분포될 확률이 높기 때문에 중립 문서들 중 30%만 제3 그룹으로 분류될 수 있다.Positive and negative documents are documents for generating a rating model, while neutral documents are documents for calculating a rating for content through the generated rating model. Therefore, when all neutral documents among the collected documents are used, only 30% of the neutral documents can be classified into the third group because the calculated scores are likely to be distributed between 4 and 6 points.

그룹들로 분류하는 단계(S103)는 제1 내지 제3 그룹을 분류한 후, 제1 그룹에 포함된 긍정 문서들의 개수를 조정하는 단계를 포함할 수 있다.The step S103 of classifying into groups may include classifying the first to third groups and then adjusting the number of positive documents included in the first group.

예컨대, 수집된 문서들에서 긍정 문서들의 비율이 부정 문서들의 비율보다 높을 경우, 산출되는 평점들의 평균이 높은 쪽으로 편향되어 산출될 수 있는 확률이 매우 높다. For example, if the ratio of positive documents in the collected documents is higher than the ratio of negative documents, there is a high probability that the average of the calculated ratings can be biased towards the higher and calculated.

따라서, 제1 그룹에 포함된 부정 문서들의 개수에 기반하여 제1 그룹에 포함된 긍정 문서들의 개수가 조정될 수 있다. Therefore, the number of positive documents included in the first group may be adjusted based on the number of negative documents included in the first group.

예컨대, 제1 그룹에 긍정 문서들이 100개, 부정 문서들이 30개 포함되어 있고, 제1 그룹에 긍정 문서들의 개수가 부정 문서들의 개수의 3배 이하로 포함되게 설정된 경우, 10개의 긍정 문서들이 랜덤하게 선택되어 제1 그룹에서 삭제될 수 있다.For example, if 100 positive documents and 30 negative documents are included in the first group, and the number of positive documents is set to include 3 times or less of the number of negative documents in the first group, 10 positive documents are random. Can be selected and deleted from the first group.

벡터값들을 산출하는 단계(S105)는 텍스트 마이닝을 이용하여 분류된 그룹들 중 일부 그룹에 포함된 문서들과 기설정된 키워드들 간의 상관관계에 대한 벡터값들을 산출하는 단계이다.The calculating of the vector values (S105) is a step of calculating vector values for correlations between documents included in some groups among the groups classified using text mining and predetermined keywords.

텍스트 마이닝 알고리즘으로는 TF-IDF(Term Frequency-Inverse Document Frequency) 알고리즘이 사용될 수 있으나, 알고리즘의 종류는 이에 국한되는 것은 아니며, 문서들과 키워드들 간의 상관관계에 대한 벡터값들을 산출할 수 있는 알고리즘이면 제한없이 사용될 수 있다.As a text mining algorithm, a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm may be used, but the type of algorithm is not limited thereto, and an algorithm capable of calculating vector values of correlations between documents and keywords. Can be used without limitation.

기설정된 키워드는 콘텐츠에 대한 리뷰에서 많이 사용되는 단어로 설정될 수 있으며, 예컨대, “진짜”, “재밌” 및 ”난리“ 등의 키워드들이 설정될 수 있다.The preset keyword may be set to a word that is frequently used in a review of the content. For example, keywords such as “real”, “fun” and “random” may be set.

TF-IDF는 TF(Term Frequency, 단어 빈도)와 IDF(Inverse Document Frequency, 역문서 빈도)의 곱이므로, 제1 그룹 및 제2 그룹 각각에 포함된 문서들과 기설정된 키워드들 간의 TF-IDF가 산출될 수 있다.Since TF-IDF is a product of TF (Term Frequency) and IDF (Inverse Document Frequency), the TF-IDF between the documents included in each of the first group and the second group and preset keywords Can be calculated.

문서들과 키워드들 각각에 대한 TF-IDF가 산출되므로, 문서들에 대해서 기설정된 키워드들의 TF-IDF 값들의 벡터값이 산출될 수 있다.Since the TF-IDF for each of the documents and the keywords is calculated, a vector value of the TF-IDF values of the keywords preset for the documents can be calculated.

예컨대, 제1 문서에 있어서, “진짜”라는 키워드의 TF-IDF 값이 0.3이고, “재밌”이라는 키워드의 TF-IDF 값이 0이며, “난리”라는 키워드의 TF-IDF 값이 0.7이라면, 제1 문서와 기설정된 키워드들 간의 상관관계에 대한 벡터값이 (0.3, 0, 0.7)로 산출될 수 있다. For example, in the first document, if the TF-IDF value of the keyword "real" is 0.3, the TF-IDF value of the keyword "funny" is 0, and the TF-IDF value of the keyword "random" is 0.7, A vector value for correlation between the first document and the predetermined keywords may be calculated as (0.3, 0, 0.7).

TF-IDF 알고리즘을 이용하여, 제1 그룹 및 제2 그룹에 포함된 모든 문서에 대하여 기설정된 키워드들 간의 상관관계에 대한 벡터값들이 산출될 수 있다.Using the TF-IDF algorithm, vector values for correlations between predetermined keywords for all documents included in the first group and the second group may be calculated.

평점 산출 모델을 생성하는 단계(S107)는 패턴 인식 알고리즘을 이용하여, 산출된 벡터값들 중 일부에 기반하여 평점 산출 모델을 생성하는 단계이다.Generating a rating calculation model (S107) is a step of generating a rating calculation model based on some of the calculated vector values using a pattern recognition algorithm.

패턴 인식 알고리즘으로는 서포트 벡터 머신(Support Vector Machine, SVM) 알고리즘이 사용될 수 있으나, 패턴 인식 알고리즘의 종류는 이에 국한되는 것은 아니다.As a pattern recognition algorithm, a support vector machine (SVM) algorithm may be used, but the type of pattern recognition algorithm is not limited thereto.

벡터값들을 산출하는 단계(S105)의 결과로써, 제1 그룹 및 제2 그룹에 포함된 문서들에 대한 벡터값들이 산출되었으나, 평점 산출 모델을 생성하기 위한 벡터값들은 제1 그룹에 포함된 문서들에 대한 벡터값들만 사용되고, 제2 그룹에 포함된 문서들에 대한 벡터값들은 생성된 평점 산출 모델의 정확도를 측정하는데 사용된다.As a result of calculating the vector values (S105), vector values for documents included in the first group and the second group have been calculated, but vector values for generating a rating calculation model are included in the document included in the first group. Only vector values are used, and the vector values for the documents included in the second group are used to measure the accuracy of the generated rating calculation model.

서포트 벡터 머신은 패턴 인식, 자료 분석을 위한 지도 학습 모델로서, 두 카테고리 중 어느 하나에 속한 데이터의 집합이 주어졌을 때, 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 비확률적 이진 선형 분류 모델을 만들 수 있으며, 커널 트릭을 사용할 경우 비선형 분류 모델을 만들 수 있다.The support vector machine is a supervised learning model for pattern recognition and data analysis. Given a set of data belonging to one of two categories, the probability of determining which category the new data belongs to is based on the given data set. A binary binary classification model can be created, and a kernel trick can be used to create a nonlinear classification model.

평점 산출 모델을 생성하는 단계(S107)는 서포트 벡터 머신을 이용하여 제1 그룹에 포함된 문서들에 대한 벡터값들을 구분하는 평점 산출 모델인 초평면(hyperplane)을 생성할 수 있다.The generating of the rating calculation model (S107) may generate a hyperplane, which is a rating calculation model that distinguishes vector values for documents included in the first group by using the support vector machine.

즉, 제1 그룹에 포함된 긍정 문서들 및 부정 문서들과 기설정된 키워드들 간의 벡터값을 구분하기 위한 평점 산출 모델이 생성될 수 있다.That is, a rating calculation model may be generated for distinguishing a vector value between positive documents and negative documents included in the first group and predetermined keywords.

정확도 및 평점을 산출하는 단계(S109)는 생성된 평점 산출 모델의 분류 성능의 정확도 및 생성된 평점 산출 모델을 이용하여 콘텐츠에 대한 평점을 산출하는 단계이다.The calculating of the accuracy and the rating (S109) is a step of calculating the rating for the content using the accuracy of the classification performance of the generated rating calculation model and the generated rating calculation model.

먼저, 산출된 벡터값들 중 나머지 벡터값들 즉, 제2 그룹에 포함된 문서들에 대한 벡터값들을 평점 산출 모델에 입력시킴으로써, 평점 산출 모델의 분류 성능의 정확도가 산출될 수 있다.First, the accuracy of the classification performance of the rating calculation model may be calculated by inputting the remaining vector values, ie, vector values of documents included in the second group, to the rating calculation model.

제2 그룹에는 제1 그룹과 마찬가지로 긍정 문서들 및 부정 문서들이 포함되어 있기 때문에, 제2 그룹에 포함된 문서들에 대한 벡터값들을 평점 산출 모델에 입력시켰을 때, 제2 그룹에 포함된 문서들의 분류 결과에 따라서 평점 산출 모델의 분류 성능에 대한 정확도가 산출될 수 있다.Since the second group includes the positive documents and the negative documents like the first group, when the vector values for the documents included in the second group are input to the scoring model, the second group includes the documents included in the second group. According to the classification result, the accuracy of the classification performance of the rating calculation model may be calculated.

예컨대, 평점 산출 모델이 제2 그룹에 포함된 문서들에 대한 벡터값들을 입력받아, 제2 그룹에 포함된 문서들을 긍정 문서들 및 부정 문서들로 오류없이 분류하였다면, 평점 산출 모델의 분류 성능에 대한 정확도는 100%로 산출될 수 있다.For example, if the rating calculation model receives vector values for documents included in the second group and classifies the documents included in the second group into positive documents and negative documents without errors, the classification performance of the rating calculation model is determined. Accuracy can be calculated as 100%.

다음으로, 분류된 그룹들 중 나머지 그룹에 포함된 문서들을 평점 산출 모델에 대응시킨 결과로 출력되는 콘텐츠에 대한 평점이 산출될 수 있다.Next, a rating of the content output as a result of matching the documents included in the remaining groups among the classified groups to the rating calculation model may be calculated.

즉, 기설정된 키워드들 간의 상관관계에 대한 벡터값을 산출하지 않은 제3 그룹에 포함된 문서들을 평점 산출 모델에 대응시킨 결과로, 콘텐츠에 대한 평점이 산출될 수 있다.That is, as a result of correlating the documents included in the third group that does not calculate the vector value of the correlation between the predetermined keywords to the rating calculation model, the rating for the content may be calculated.

평점 산출 모델에 대응시킨 결과는 제3 그룹에 포함된 문서들 각각이 긍정 문서일 확률 출력되며, 제3 그룹에 포함된 문서들에 대하여 출력된 확률들의 평균을 구하여 콘텐츠에 대한 평점이 산출될 수 있다.The result corresponding to the rating calculation model is a probability that each of the documents included in the third group is a positive document, and the average of probabilities output for the documents included in the third group is averaged to calculate a rating for the content. have.

예컨대, 제3 그룹에 포함된 문서들을 평점 산출 모델에 대응시킨 결과로 출력된 확률들의 평균이 80%라면, 해당 콘텐츠에 대한 평점이 80점으로 산출될 수 있다.For example, if the average of the probabilities output as a result of mapping the documents included in the third group to the rating calculation model is 80%, the rating for the corresponding content may be calculated as 80 points.

반복 수행하는 단계(S111)는 정확도 및 평점을 산출한 이후에, 수집된 문서들에 대하여 그룹들로 분류하는 단계(S103) 내지 정확도 및 평점을 산출하는 단계(S109)를 반복적으로 수행하는 단계이다.In step S111, after calculating the accuracy and the rating, the step of classifying the collected documents into groups S103 to calculating the accuracy and the rating S109 is performed repeatedly. .

기설정된 분류 조건에 따르면, 수집된 문서들은 평점에 따라 긍정, 부정 및 중립 문서로 분류되고, 소정의 비율에 따라 제1, 제2 및 제3 그룹으로 랜덤하게 분류되기 때문에, 제1 내지 제3 그룹에 포함된 문서들이 변경될 수 있다.According to a predetermined classification condition, the collected documents are classified into positive, negative and neutral documents according to the ratings, and randomly classified into the first, second and third groups according to a predetermined ratio, and thus, the first to third documents. Documents included in the group can be changed.

즉, 반복 수행하는 단계(S111)는 제1 내지 제3 그룹에 포함된 문서들을 변경하여, 새로운 평점 산출 모델을 생성하고, 생성된 평점 산출 모델의 분류 성능에 대한 정확도를 산출할 수 있고, 변경된 제3 그룹에 포함된 문서들을 새로운 평점 산출 모델에 대응시켜서 콘텐츠에 대한 평점을 산출할 수 있다.That is, in step S111, the documents included in the first to third groups may be changed to generate a new rating calculation model, and the accuracy of the classification performance of the generated rating calculation model may be calculated. Ratings for the content may be calculated by matching the documents included in the third group with a new rating calculation model.

반복 수행하는 단계(S111)는 수집된 문서들의 수에 비례하게 설정된 반복 횟수를 만족할 때까지 수행될 수 있다. The repeating step S111 may be performed until the number of repetitions set in proportion to the number of collected documents is satisfied.

수집된 문서들의 수가 많을수록 반복 수행될 때마다 제1 내지 제3 그룹에 포함되는 문서들의 중복이 줄어들기 때문에, 수집된 문섣르의 수에 비례하게 설정된 반복 횟수를 만족할 때까지 반복 수행하는 단계(S111)가 수행될 수 있다.Since the number of documents collected in the first to third groups is reduced each time the number of collected documents is repeated, the step of repeating until the number of repetitions set in proportion to the number of collected texts is satisfied (S111). ) May be performed.

또한, 반복 수행하는 단계(S111)는 수집된 문서들의 수에 관계없이 기설정된 반복 횟수를 만족할 때까지 수행될 수도 있다. In addition, the repeating step (S111) may be performed until a predetermined number of repetitions is satisfied regardless of the number of collected documents.

예컨대, 본 발명의 일 실시예에 따른 콘텐츠 평점 산출 방법을 100번을 수행하도록 설정된 경우, 수집된 문서의 양에 관계없이 반복 수행하는 단계(S111)가 99회 반복되어, 100개의 산출된 평점 산출 모델의 분류 성능에 대한 정확도 및 콘텐츠에 대한 평점이 각각 산출될 수 있다.For example, when the content rating calculation method according to an embodiment of the present invention is set to perform 100 times, the step of performing the repetition (S111) is repeated 99 times regardless of the amount of the collected document, thereby calculating 100 calculated ratings. The accuracy of the classification performance of the model and the rating for the content may be respectively calculated.

산출된 정확도 및 평점들의 평균을 산출하는 단계(S113)는 기설정된 반복 횟수를 만족한 경우, 지금까지 산출된 평점 산출 모델의 분류 성능에 대한 정확도 및 콘텐츠에 대한 평점 각각의 평균을 산출하는 단계이다.Computing the average of the calculated accuracy and the ratings (S113) is a step of calculating the average of the accuracy of the classification performance of the rating calculation model and the ratings for the content if the predetermined number of repetitions is satisfied so far .

예컨대, 100개의 산출된 평점 산출 모델의 분류 성능에 대한 정확도 및 콘텐츠에 대한 평점이 각각 산출되었다면, 100개의 정확도 및 평점에 대한 평균이 산출되어, 평점 산출 모델의 분류 성능에 대한 정확도 및 콘텐츠에 대한 평점으로 출력될 수 있다.For example, if the accuracy for the classification performance of the 100 calculated rating models and the rating for the content have been calculated, respectively, an average of the 100 accuracy and the rating is calculated to calculate the accuracy for the classification performance of the rating model and the content. Can be output as a rating.

도 2는 본 발명의 다른 실시예에 따른 콘텐츠 평점 산출 방법의 순서도를 간략히 도시한 도면이다.2 is a flowchart illustrating a content rating method according to another embodiment of the present invention briefly.

도 2를 참조하면, 본 발명의 다른 실시예에 따른 콘텐츠 평점 산출 방법은 문서들을 수집하는 단계(S101)를 수행한 후, 수집된 문서들을 카테고리별로 분류하는 단계(S102)를 더 포함할 수 있다.Referring to FIG. 2, the method for calculating a content rating according to another embodiment of the present invention may further include classifying the collected documents by category after performing the step of collecting documents (S101). .

수집된 문서들을 카테고리별로 분류하는 단계(S102)는 카테고리별로 기설정된 분류 키워드에 기반하여 수집된 문서들을 분류하는 단계이다.In step S102, the collected documents are classified by category, the collected documents are classified based on a classification keyword preset for each category.

예컨대, 영화 장르에 대해서는 연기력, 스토리, 감독, OST(Original Sound Track) 및 영상미에 대한 카테고리가 설정될 수 있다. 스토리 카테고리에 대해서 스토리, story, 내용, 개연성 및 소재 등의 분류 키워드가 설정되고, OST 카테고리에 대해서 OST, 노래, 음악 및 음향 등의 분류 키워드가 설정될 수 있다.For example, a category for acting power, story, director, original sound track (OST) and visual beauty may be set for the movie genre. Classification keywords such as story, story, content, probability, and material may be set for the story category, and classification keywords such as OST, song, music, and sound may be set for the OST category.

수집된 문서들 중에는 각 카테고리별로 기설정된 분류 키워드가 복수 개 포함된 문서들도 있기 때문에, 복수 개의 카테고리에 설정된 분류 키워드를 포함하는 문서들은 해당하는 카테고리들 전부에 포함될 수 있다.Since some documents may include a plurality of classification keywords preset for each category, documents including classification keywords set in a plurality of categories may be included in all of the corresponding categories.

수집된 문서들을 카테고리별로 분류한 이후에는, 카테고리별로 독립적으로 그룹들을 분류하는 단계(S103) 내지 산출된 정확도 및 평점들의 평균을 산출하는 단계(S113)이 수행될 수 있다.After classifying the collected documents by category, the step of classifying the groups independently by category (S103) or calculating the average of the calculated accuracy and ratings (S113) may be performed.

예컨대, 스토리 카테고리에서는 스토리 카테고리에 대한 평점 산출 모델이 생성되기 때문에 스토리 카테고리에 대한 평점이 산출되며, OST 카테고리에서는 OST 카테고리에 대한 평점 산출 모델이 생성되기 때문에 OST 카테고리에 대한 평점이 산출될 수 있다.For example, since a rating calculation model for the story category is generated in the story category, a rating for the story category is calculated. In the OST category, a rating for the OST category may be calculated because the rating calculation model for the OST category is generated.

즉, 콘텐츠 평점 산출 방법은 수집된 문서들 전체에 대해서 콘텐츠에 대한 평점을 산출할 수 있으며, 기설정된 분류 키워드에 기반하여 카테고리별로 평점을 산출할 수도 있다.That is, the content rating calculation method may calculate a rating for the content of all the collected documents, and may calculate a rating for each category based on a preset classification keyword.

따라서, 사용자는 관심 콘텐츠들 중 어느 하나에 대한 구매, 예매 또한 관람을 선택해야 할 때, 관심 콘텐츠들에 대한 평점 및 카테고리별 평점을 산출하여 확인함으로써, 관심 콘텐츠들을 상세하게 비교해볼 수 있는 장점이 있다.Therefore, when a user needs to purchase, purchase, or view one of the contents of interest, the user can compare the contents of interest in detail by calculating and confirming a rating and a category-specific rating of the contents of interest. have.

비록 본 명세서에서의 설명은 예시적인 몇 가지 양상으로 나타났지만, 다양한 수정이나 변경이 후술되는 특허청구범위에 의해 정의되는 범주로부터 이루어질 수 있으며, 본 발명의 기술적인 보호범위는 다음의 특허청구범위에 의하여 정해져야 할 것이다.Although the description herein has been shown in several illustrative aspects, various modifications and changes can be made from the scope defined by the following claims, and the technical protection scope of the invention is set forth in the following claims. It must be decided by.

Claims

Collecting documents including a rating of the content;
Classifying the collected documents into a plurality of groups according to a predetermined classification condition;
Calculating vector values of correlations between documents included in some of the classified groups and predetermined keywords using a text mining algorithm;
Generating a rating calculation model based on some of the calculated vector values using a pattern recognition algorithm; And
Calculating a rating for the content output as a result of matching the documents included in the remaining groups among the classified groups to the rating calculation model,
The classifying step,
In the documents included in the first group of the classified groups,
Adjusting the number of documents whose rating is above a certain score based on the number of documents having a rating below a certain score,
How content is calculated.

The method of claim 1,
After calculating the rating,
Further comprising: classifying the collected documents into groups and calculating the ratings.
How content is calculated.

The method of claim 2,
Repeating the step,
Repeatedly performed until the number of repetitions set in proportion to the number of collected documents is satisfied;
How content is calculated.

The method of claim 2,
After the repeating step,
Calculating an average of ratings calculated in the calculating of the rating;
How content is calculated.

The method of claim 1,
After collecting the documents,
The method may further include classifying the collected documents by the category based on a classification keyword preset for each category.
How content is calculated.

The method of claim 5,
The classifying or calculating the rating may include:
Each of the classified categories is performed to calculate a rating for each of the categories,
How content is calculated.

The method of claim 1,
After generating the calculation model,
Computing the accuracy of the classification performance of the generated rating calculation model by inputting the rest of the calculated vector value to the generated rating calculation model,
How content is calculated.

The method of claim 1,
Computing the vector value,
Calculating vector values of correlations between documents included in the first and second groups of the classified groups and predetermined keywords by using a term frequency-inverse document frequency (TF-IDF) algorithm. Included,
How content is calculated.

The method of claim 8,
Generating the rating calculation model,
Generating a rating calculation model from the vector values calculated for the first group using a support vector machine;
How content is calculated.

delete

The method of claim 1,
The preset classification condition is,
Classify the collected documents based on a rating for the content,
Is a condition of dividing the number of the classified documents according to a predetermined ratio and classifying the plurality of groups.
How content is calculated.