KR101049648B1

KR101049648B1 - Blog Rank Method for Efficiently Searching Blogs Using the Blog Rank Algorithm

Info

Publication number: KR101049648B1
Application number: KR1020090014906A
Authority: KR
Inventors: 이지형; 윤광호; 심학준; 김정훈
Original assignee: 성균관대학교산학협력단
Priority date: 2009-02-23
Filing date: 2009-02-23
Publication date: 2011-07-14
Also published as: KR20100095883A

Abstract

본 발명은, 블로그 랭크 알고리즘을 이용해서 블로그를 검색하기 위한 블로그 랭크 방법에 있어서, 블로그 페이지를 작성한 블로거의 명성과, 블로그 페이지의 트랙백 연결성 및, 블로그 페이지의 사용자 반응성을 평가하고, 이 평가를 기초로 블로그 페이지를 평가하되, 키워드 t의 관점에서 페이지 e의 평가, ES(e, t)가, The present invention provides a blog rank method for searching for a blog using a blog rank algorithm, wherein the reputation of the blogger who created the blog page, the trackback connectivity of the blog page, and the user responsiveness of the blog page are evaluated. Evaluates the blog page, but evaluates page e in terms of keyword t, ES (e, t),

ES(e, t) =

ES (e, t) =

로 정의되는 것을 특징으로 하는 블로그 랭크 알고리즘을 이용해서 효율적으로 블로그를 검색하기 위한 블로그 랭크 방법을 제공한다.Provides a blog rank method for efficiently searching a blog using a blog rank algorithm characterized in that it is defined as.

Description

BLOG RANK METHOD FOR EFFECTIVELY SEARCHING THE BLOG USING BLOG RANK ALGORITHM}

본 발명은 블로그 검색 방법에 관한 것으로, 특히 블로그 랭크 알고리즘을 이용해서 효율적으로 블로그를 검색하기 위한 블로그 랭크 방법에 관한 것이다.The present invention relates to a blog search method, and more particularly, to a blog rank method for efficiently searching a blog using a blog rank algorithm.

오늘날, 웹 2.0 환경의 도래로 인해 대부분의 웹페이지는 블로그 영역에서 생성되고, 또한 기존의 웹페이지 또한 블로그 영역으로 전환되어 가고 있다. 따라서, 블로그 영역에 대한 효율적인 정보검색 기술은 날로 그 중요성이 증가하고 있다. 그 중에서도 페이지 랭크 알고리즘 기술은 이러한 정보검색 기술의 가장 핵심적인 영역이다.Today, due to the advent of the Web 2.0 environment, most web pages are generated in the blog area, and the existing web pages are also converted into the blog area. Therefore, the importance of efficient information retrieval technology in the blog area is increasing day by day. Among them, page rank algorithm technology is the most important area of such information retrieval technology.

지금까지 웹 페이지 랭킹을 위한 많은 알고리즘들이 연구되었다(예컨대, S. Chakrabarti, 2003. Mining the web, Morgan Kaufmann Publishers.). 그 중에서도 PageRank(S. Brin and L. Page, 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine, In Proceedings of 7th International World Wide Web Conference.)와, HITS(J. M. Kleinberg, 1999. Authoritative sources in hyperlinked environment, Journal of the ACM, Vol. 46, No. 5.)가 가장 성공적인 결과였으며, 그들에 대한 검색 효율성은 학계 및 산업계에 널리 알려져 있다. 이러한 알고리즘들의 특징은 페이지의 연결성을 기반으로 페이지의 우수성을 평가하는 것이다. 물론, 이러한 알고리즘들을 블로그 영역에 바로 적용할 수 있지만, 다음과 같은 블로그 페이지의 구조적 특징들을 간과함으로써 더욱 효율적인 검색결과를 기대하기 힘들다.Many algorithms for web page ranking have been studied so far (eg, S. Chakrabarti, 2003. Mining the web, Morgan Kaufmann Publishers.). Among them, PageRank (S. Brin and L. Page, 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine, In Proceedings of 7th International World Wide Web Conference.) And HITS (JM Kleinberg, 1999. Authoritative sources in hyperlinked). environment, Journal of the ACM, Vol. 46, No. 5.) was the most successful result, and the search efficiency for them is well known in the academic and industrial world. The distinctive feature of these algorithms is that they evaluate the superiority of pages based on their connectivity. Of course, these algorithms can be applied directly to the blog area, but it is difficult to expect more efficient search results by overlooking the structural features of the following blog pages.

1) 블로그 페이지는 웹 페이지와 달리 트랙백 연결, 태그, 댓글과 같은 구조적 속성들을 가진다(C. Marlow. Audience, structure and authority in the weblog community. In International Communication Association Conference, New Orleans, LA, 2004. http://web.media.mit.edu/>>ameron/cv/pubs/04-01.pdf.). 이러한 속성들은 블로그 영역에서 활동 중인 블로거들에 의해 생성되고 수정되므로, 그들의 생각, 관심, 반응들을 반영한다. 또한, 유사한 생각과 관심을 가진 블로거들과의 상호작용을 유도하여 더 많은 양질의 정보가 생성 및 관리 된다(A. Java, P. Kolari, T. Finin, and T. Oates. Modeling the Spread of Influence on the Blogosphere. Technical report, University of Maryland, Baltimore County, March 2006.). 즉, 블로그 영역의 정보는 이러한 구조적 속성들과 상호작용을 통하여 점차 발전하는 것이다. 따라서, 더욱 효율적인 검색을 위해 이러한 구조적 속성들과 상호작용을 고려해야 한다.Unlike web pages, blog pages have structural properties such as trackback connections, tags, and comments (C. Marlow. Audience, structure and authority in the weblog community.In International Communication Association Conference, New Orleans, LA, 2004. http http://web.media.mit.edu/>>ameron/cv/pubs/04-01.pdf.). These attributes are created and modified by bloggers working in the blog area, reflecting their thoughts, interests and responses. In addition, more quality information is generated and managed by inducing interaction with similarly thought and interested bloggers (A. Java, P. Kolari, T. Finin, and T. Oates. Modeling the Spread of Influence) on the Blogosphere.Technical report, University of Maryland, Baltimore County, March 2006.). In other words, the information in the blog area is gradually developed through interaction with these structural attributes. Therefore, these structural attributes and interactions should be considered for more efficient retrieval.

2) 하나의 블로그 사이트는 한명의 블로거가 작성한 페이지들로 구성 및 관리되고, 이로 인해 블로그는 흔히 “Personal publishing tool"로 불린다. 따라서, 보통 블로그 페이지의 질과 주제는 그 블로그를 관리하는 블로거의 지식과 관심 사항에 따라 결정된다. 이러한 특징으로 인해, 처음 연결성이 없지만 유용한 블로그 페이지를 블로거의 과거 명성을 기반으로 좋은 페이지로 평가 할 수 있다.2) A blog site is composed and managed by pages written by one blogger, which is why a blog is often called a “personal publishing tool.” Therefore, the quality and theme of a blog page is usually the knowledge of the blogger who manages the blog. These characteristics make it possible to evaluate a blog that is not initially connected but useful, based on the blogger's past reputation as a good page.

즉, 오늘날 대부분의 웹 페이지는 블로그 영역에서 생성되고 기존의 웹 페이지 또한 블로그 영역으로 전환되어 가고 있다. 이러한 상황에서, 트랙백 연결, 블로거의 명성, 태그, 사용자 반응성과 같은 블로그 페이지와 웹 페이지 간의 특성차이는 간과할 수 없는 중요한 문제이다. 또한, 이러한 특성차이를 반영하지 않는 전통적인 웹 페이지 랭킹 알고리즘을 블로그 페이지에 단순히 적용하는 것은 효율적인 검색을 위해 적절하지 않다.In other words, most web pages are generated in the blog area and the existing web pages are being converted into the blog area. In this situation, the characteristic differences between blog pages and web pages, such as trackback connections, blogger reputation, tags, and user responsiveness, are an important issue that cannot be overlooked. Also, simply applying a traditional web page ranking algorithm to a blog page that does not reflect this characteristic difference is not appropriate for efficient search.

본 발명은 상기한 점을 감안하여 발명된 것으로, 블로그의 구조적 특징들을 활용하여 블로거의 명성, 트랙백 연결성, 사용자 반응성을 평가하고 이를 기반으로 블로그 페이지를 평가하는 블로그 랭크 알고리즘을 이용해서 효율적으로 블로그를 검색하기 위한 블로그 랭크 방법을 제공함에 그 목적이 있다.The present invention has been invented in view of the above, and utilizes the structural features of the blog to evaluate the blog's reputation, trackback connectivity, and user responsiveness, and based on the blog rank algorithm that efficiently evaluates the blog page. The purpose is to provide a blog ranking method for searching.

상기 목적을 달성하기 위한 본 발명에 따른 블로그 랭크 알고리즘을 이용해서 효율적으로 블로그를 검색하기 위한 블로그 랭크 방법은, A blog rank method for efficiently searching a blog by using a blog rank algorithm according to the present invention for achieving the above object,

블로그 랭크 알고리즘을 이용해서 블로그를 검색하기 위한 블로그 랭크 방법에 있어서,In a blog rank method for searching a blog using a blog rank algorithm,

블로그 페이지를 작성한 블로거의 명성과, 블로그 페이지의 트랙백 연결성 및, 블로그 페이지의 사용자 반응성을 평가하고, 이 평가를 기초로 블로그 페이지를 평가하되,Evaluate the blog page's reputation, the trackback connectivity of the blog page, and the user responsiveness of the blog page.

키워드 t의 관점에서 페이지 e의 평가, ES(e, t)가,In terms of keyword t, the evaluation of page e, ES (e, t),

ES(e, t) =

ES (e, t) =

여기서,here,

b: 태그 t를 포함하는 페이지 e를 작성한 블로거b: blogger who created page e with tag t

K: 페이지 e의 트랙백 연결 수K: number of trackback connections on page e

b_i: 페이지 e에 i번째 트랙백 연결을 생성한 블로거b _i : Blogger creating the i th trackback connection on page e

BS(t, b): 태그 t에 대한 블로거 b의 명성점수BS (t, b): Blogger b's reputation score for tag t

TBS(t, e, b_i): 태그 t의 관점에서, 페이지 e에 포함된 블로거 b_i가 생성한TBS (t, e, b _i ): In terms of tag t, generated by blogger b _i contained in page e

i번째 트랙백 연결의 점수 Score of the i trackback connection

로서,

as,

N_tr(b, t): 블로거 b가 태그 t에 대해 작성한 모든 페이지들의N _tr (b, t): for all pages written by blogger b for tag t

트랙백 연결 수 Trackback Connections

N_cm(b, t): 블로거 b가 태그 t에 대해 작성한 모든 페이지들의 댓글 수N _cm (b, t): Number of comments on all pages written by blogger b about tag t

β: 0.001(트랙백 연결이 없는 경우를 위해)β: 0.001 (for no trackback connection)

CS(e): 페이지 e의 댓글 수CS (e): Number of comments on page e

로 정의되는 것을 특징으로 한다.It is characterized by.

본 발명에 따르면, 블로그 구조의 종합적 분석을 통하여 물리적 연결성 뿐만 아니라 사용자 참여를 통한 내용적 평가를 암묵적으로 반영하여 검색 효율성을 향상시키게 된다.According to the present invention, through comprehensive analysis of the blog structure, implicitly reflects not only physical connectivity but also content evaluation through user participation, thereby improving search efficiency.

더욱이, 작성 초기에 링크를 가지지 않은 페이지를 더욱 효율적으로 평가하 게 된다.In addition, it will be more efficient to evaluate pages that do not have links at the beginning of their writing.

이하, 예시도면을 참조하면서 본 발명에 따른 실시예를 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 따른 블로그 랭크 방법에서는 블로그의 특징들을 고려하여 페이지를 평가하는 “블로그-랭크(blog-rank)” 알고리즘을 이용한다. 이 알고리즘은 더 높은 트랙백 연결성과 사용자 반응성을 가진 페이지를 더 좋은 페이지로 평가하고, 또한 더 높은 명성을 가진 블로거가 작성한 페이지를 더 좋은 페이지로 평가한다. 페이지는 높은 명성을 가진 블로거가 연결한 트랙백 연결이 많을수록 더욱 높은 트랙백 연결성을 가진다. 이러한 블로그-랭크 알고리즘을 실현하기 위해 본 발명에서는 다음과 같은 3가지 요소를 평가한다.The blog rank method according to the present invention uses a "blog-rank" algorithm that evaluates a page in consideration of characteristics of a blog. This algorithm evaluates pages with higher trackback connectivity and user responsiveness as better pages, and also pages with better reputation bloggers. Pages have more trackback connectivity, with more trackback connections from bloggers with a higher reputation. In order to realize this blog-rank algorithm, the following three factors are evaluated in the present invention.

ㆍ 페이지를 작성한 블로거의 명성The reputation of the blogger who wrote the page

ㆍ 페이지의 트랙백 연결성Page Trackback Connectivity

ㆍ 페이지의 사용자 반응성User responsiveness of the page

상기한 블로거의 명성은 트랙백 연결, 태그, 그리고 댓글을 기반으로 2개의 점수를 계산하여 평가한다. 블로거가 작성한 페이지들의 점수, 블로그 영역에서의 블로거 활동점수. 페이지들의 점수는 태그, 트랙백 연결, 그리고 댓글을 기반으로 계산하고, 블로거 활동점수는 블로거가 작성한 좋은 페이지들의 수를 기반으로 계산한다. 본 발명에서 좋은 페이지는 주로 높은 트랙백 연결성을 가진 페이지이다.The blogger's reputation is evaluated by calculating two scores based on trackback connections, tags, and comments. The score of pages written by bloggers, and the score of blogger activity in the blog area. Page scores are calculated based on tags, trackback links, and comments, and blogger activity scores are based on the number of good pages written by bloggers. Good pages in the present invention are mainly pages with high trackback connectivity.

상기한 트랙백 연결성은 트랙백 연결의 수와 연결을 생성한 블로거의 명성을 기반으로 계산한다.The trackback connectivity described above is calculated based on the number of trackback connections and the reputation of the blogger that created the connection.

상기한 사용자 반응성은 트랙백 연결에 대한 댓글의 가치평가를 기반으로 계산한다.The user responsiveness described above is calculated based on the valuation of the comment on the trackback connection.

본 발명에 따른 블로그-랭크 알고리즘은 2가지의 중요한 특징을 가진다.The blog-rank algorithm according to the present invention has two important features.

첫 번째 특징은, 페이지 평가 시 구조적 관점에서 평가한다는 것이다. 즉, 페이지의 내용에는 관심을 가지지 않는다. 그러나, 블로그의 구조적 속성이 페이지의 질을 반영하므로 내용적 평가가 암묵적으로 이루어진다. 예컨대, 블로거가 어떤 블로그 사이트에서 유용하고 관심 있는 페이지를 본다면, 자신의 사이트 이용자에게 그 페이지를 제공하고 싶어 할 것이고, 그로 인해 자신의 사이트와 트랙백 연결을 생성할 것이다. 또는, 트랙백 연결을 통해 공유하지 않더라도 댓글을 통하여 자신의 의견을 페이지에 표현할 것이다. 즉, 트랙백 연결과 댓글은 페이지의 내용이 누군가에게는 유용하고 관심이 있다는 것을 나타낸다고 볼 수 있다. 결국, 이러한 속성들을 더 많이 포함하거나 더 좋은 것들을 포함하는 페이지는 사용자들을 더욱 만족시킬 수 있는 정보가 될 가능성이 높다. 다시 말하면, 블로그 페이지의 내용평가는 블로거들의 활동에 의해서 암묵적으로 수행되고, 그 평가결과는 블로그 페이지의 구조에 자연스럽게 포함되는 것이다.The first feature is that when evaluating a page, it evaluates from a structural point of view. In other words, we are not interested in the contents of the page. However, because the structural attributes of the blog reflect the quality of the page, the content evaluation is implicit. For example, if a blogger sees a useful and interesting page on a blog site, he or she would like to make that page available to users of his site, thereby creating a trackback link with his site. Or, even if you don't share via trackback link, you will express your opinion on page through comments. In other words, trackback links and comments indicate that the content of the page is useful and interesting to someone. After all, a page that contains more of these attributes or that contains better ones is likely to be more informative for users. In other words, the content evaluation of blog pages is implicitly performed by the activities of bloggers, and the evaluation results are naturally included in the structure of blog pages.

본 발명에 따른 블로그-랭크 알고리즘의 또 다른 특징은 과거 블로거의 활동을 기반으로 페이지 평가가 가능하다는 것이다. 이것은 처음 내부링크를 가지지는 않지만 과거 주목을 많이 받았던 블로거의 페이지, 즉 유용한 정보가 될 가능성 이 높은 페이지를 높게 평가할 수 있는 것이다. 이 특징은 새로운 주제를 주로 다루는 블로그 영역의 특성을 고려할 때 매우 중요한 특징이라 할 수 있다.Another feature of the blog-rank algorithm according to the present invention is that the page can be evaluated based on past blogger activity. This can be highly appreciated for bloggers who don't have internal links for the first time, but are likely to be useful information. This feature is very important considering the characteristics of blog areas that mainly deal with new topics.

지금까지 블로그 페이지 랭킹을 위한 몇몇 연구가 있었다. 예컨대, 아다(Adar) 등은 iRank라는 랭킹 방법을 제안했다[E. Adar., L. Zhang., L. Adamic., R. Lukose., (2004). "Implicit Structure and the Dynamics of Blogspace", Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, WWW 2004].So far there has been some research for blog page ranking. For example, Adar et al. Proposed a ranking method called iRank [E. Adar., L. Zhang., L. Adamic., R. Lukose., (2004). "Implicit Structure and the Dynamics of Blogspace", Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, WWW 2004].

이 방법은 정보의 출처를 포함하는 페이지를 높게 평가하는데 반면, 본 발명에 따른 블로그-랭크 알고리즘을 이용한 랭크 방법은 유명한 페이지를 높게 평가하는 방법이다. While this method highly evaluates a page including a source of information, the ranking method using the blog-rank algorithm according to the present invention is a method of highly evaluating a famous page.

이는 최신 주제를 주로 다루는 블로그 영역의 특성 상 사용자의 관심과 상호작용을 기반으로 유명한 페이지를 더 높게 평가하는 본 발명에 따른 블로그-랭크 알고리즘을 이용한 랭크 방법이 더 좋은 정보를 제공한다.The ranking method using the blog-rank algorithm according to the present invention, which ranks famous pages higher based on the user's interest and interaction, provides better information due to the characteristics of the blog area that mainly deal with the latest topics.

한편, Fujimura 등에 따르면[K. Fujimura., T. Inoue., M. Sugisaki., (2005). "TheEigenRumor Algorithm for Ranking Weblogs", 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, WWW 2005.], 에이전트(agent)와 오브젝트(object) 간의 연결성을 EigenRumor vertor로 정의하고, 이를 계산하여 페이지를 평가하는 방법을 제안하고 있다. 이 방법은 정보의 제공과 평가에 전적으로 의존하는 방법으로 트랙백 연결성을 고려하지 않는다. 반면, 본 발명에 따른 블로그-랭크 알고리즘의 연결성 평가는 트랙백 연결성에 기반을 두 기 때문에, 페이지의 내용평가를 암묵적으로 포함함으로써 더 좋은 정보 검색을 가능하게 한다.Meanwhile, according to Fujimura et al. [K. Fujimura., T. Inoue., M. Sugisaki., (2005). "TheEigenRumor Algorithm for Ranking Weblogs", 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, WWW 2005.]. We are suggesting how to evaluate. This method does not consider trackback connectivity as it relies solely on the provision and evaluation of information. On the other hand, since the connectivity evaluation of the blog-rank algorithm according to the present invention is based on the trackback connectivity, implicit inclusion of the content evaluation of the page enables better information retrieval.

한편, 본 발명에 따른 블로그-랭크 알고리즘의 연결성 분석은 PageRank, HITS와 유사하다. 그러나, 본 발명에 따른 블로그-랭크 알고리즘은 블로거의 명성점수를 평가 및 관리하기 때문에, 페이지의 작성 시기와 무관하게 블로거의 명성 평가에 따라 페이지를 평가할 수 있다. 이는 평가를 위한 점수를 얻기 위해 일정 시간이 필요한 PageRank, HITS에서는 거의 불가능하다.Meanwhile, the connectivity analysis of the blog-rank algorithm according to the present invention is similar to PageRank and HITS. However, since the blog-rank algorithm according to the present invention evaluates and manages the blogger's reputation score, the page may be evaluated according to the blogger's reputation evaluation regardless of when the page is created. This is almost impossible with PageRank, HITS, which requires a certain amount of time to score points for evaluation.

또한, 본 발명에 따른 블로그-랭크 알고리즘을 적용한 소프트웨어적인 검색 모듈을 구현하여 기존의 다른 알고리즘에 따른 검색 방법과 검색 성능을 비교 하였다. 이와 같은 검색 비교를 위해, 티스토리 도메인(http://www.tistory.com/)에서 195개의 블로그 사이트, 62906개의 페이지를 수집하였다. 검색 비교 결과, 본 발명에 따른 블로그-랭크 알고리즘을 적용한 검색 방법이 기존의 티스토리 검색시스템에 따른 방법 보다 사용자 쿼리에 더욱 관련 있는 정보를 검색하였다. 이는 블로그의 구조적 특징이 블로그 검색의 성능과 유용성을 개선할 수 있다는 것을 보여준다.In addition, by implementing a software search module applying the blog-rank algorithm according to the present invention compared the search method according to the existing algorithm and the search performance. For this comparison, we collected 1,95 blog sites and 62,906 pages from the Tistory domain (http://www.tistory.com/). As a result of the search comparison, the search method applying the blog-rank algorithm according to the present invention searched more relevant information for the user query than the method based on the existing Tstory search system. This shows that the structural features of blogs can improve the performance and usefulness of blog search.

도 1은 블로그 영역의 구조적 특징을 나타낸 것으로, 블로그 페이지는 기존의 웹 페이지와는 구조적으로 차이를 갖는다. 즉, 블로그 영역은 많은 수의 블로그 사이트들을 가진다. 하나의 블로그 사이트는 한명의 블로거와 그가 작성한 페이지들로 구성된다. 하나의 페이지는 내용, 태그, 트랙백 연결, 댓글, 페이지를 작성한 블로거의 ID, 작성시간 등의 속성을 갖추어 구성된다. 또한, 블로그 영역 의 블로거들은 블로그 영역의 페이지를 읽고 트랙백 연결과 댓글을 작성하여 블로그 사이트 간에서 2종류의 상호작용을 생성한다. 트랙백 연결은 블로그 페이지와 다른 블로그 사이트 간의 연결이고, 댓글은 블로그 페이지와 사용자 간의 상호작용이다.1 illustrates a structural feature of a blog area, where a blog page is structurally different from an existing web page. In other words, the blog area has a large number of blog sites. A blog site consists of a blogger and the pages he writes. A page is composed of attributes such as content, tag, trackback link, comment, ID of the blogger who created the page, and creation time. In addition, bloggers in the blog area read two pages in the blog area and create trackback links and comments to create two kinds of interactions between blog sites. Trackback links are links between blog pages and other blog sites, and comments are interactions between blog pages and users.

또한, 사용자들은 웹에서 검색 시 보통 키워드를 사용하고, 검색엔진은 그 키워드와 관련된 정보를 사용자에게 제공한다. 일반적으로 블로그 페이지는 태그를 포함하고 있다(C. Marlow. Audience, structure and authority in the weblog community. In International Communication Association Conference, New Orleans, LA, 2004. http://web.media.mit.edu/>>ameron/cv/pubs/04-01.pdf.). 또한, 블로거는 보통 자신의 페이지 내용을 잘 나타내기 위해 태그를 신중히 선택한다. 따라서, 본 발명에서는 사용자의 쿼리 키워드와 페이지의 내용의 연관성 평가를 위해 태그를 이용한다. 즉, 만약, 키워드 k가 페이지에 태그로 존재한다면, 그 페이지의 내용은 쿼리 키워드와 관련이 있다고 가정한다. Also, users usually use keywords when searching on the web, and the search engine provides the user with information related to the keywords. In general, blog pages contain tags (C. Marlow. Audience, structure and authority in the weblog community.In International Communication Association Conference, New Orleans, LA, 2004. http://web.media.mit.edu/ >> ameron / cv / pubs / 04-01.pdf.). Also, bloggers usually choose tags carefully to better represent their page content. Therefore, the present invention uses a tag for evaluating the correlation between the user's query keyword and the content of the page. That is, if keyword k is present as a tag on a page, the content of that page is assumed to be related to the query keyword.

마지막으로 본 발명에서는 아래의 가정을 따라 좋은 페이지를 판단한다. Finally, in the present invention, a good page is determined according to the following assumptions.

ㆍ 더 좋은 블로거가 작성한 페이지가 더 좋은 페이지Pages written by better bloggers are better

ㆍ 더 좋은 블로거에 의해 생성된 트랙백이 더 많을수록 더 좋은 페이지The more trackbacks generated by better bloggers, the better pages

ㆍ 더 많은 댓글을 포함하는 페이지가 더 좋은 페이지Better pages with more comments

이러한 요소들을 고려하여 페이지의 점수가 다음의 식 (1)과 같이 평가된다.Taking these factors into consideration, the page score is evaluated as in Equation (1) below.

즉,In other words,

태그 t를 가진 페이지의 점수 = 키워드 t에 대한 블로거의 명성점수 +Score on page with tag t = blogger's reputation score for keyword t +

트랙백의 점수 + 댓글의 점수 ------- (1) Trackback's Score + Comment's Score ------- (1)

여기서, 키워드 t에 대한 블로거의 명성점수는 키워드 t를 포함하는 블로거가 작성한 페이지와 키워드 t에 대한 블로거의 과거 활동을 기반으로 평가한다.Here, the blogger's reputation score for the keyword t is evaluated based on the page written by the blogger including the keyword t and the past activity of the blogger for the keyword t.

이어, 각 요소들의 평가를 위한 본 발명에 따른 블로그-랭크 알고리즘에 대해 설명한다.Next, a blog-rank algorithm according to the present invention for evaluating each element will be described.

1. 블로그-랭크 알고리즘1. Blog-rank algorithm

상기 평가 식 (1)에 따라, 블로그-랭크 알고리즘이 다음의 식 (2)와 같이 정의된다.According to the above evaluation formula (1), the blog-rank algorithm is defined as the following formula (2).

키워드 t의 관점에서 페이지 e의 평가, ES(e, t)는,In terms of keyword t, evaluation of page e, ES (e, t),

ES(e, t) =

-------- (2)ES (e, t) =

-------- (2)

여기서,here,

i번째 트랙백 연결의 점수 Score of the i trackback connection

로서,

as,

트랙백 연결 수 Trackback Connections

CS(e): 페이지 e의 댓글 수이다.CS (e): Number of comments on page e.

α^b _t는 댓글의 가중치이다. 댓글의 가치가 트랙백의 가치보다 더 작으므로 α^b _t는 댓글의 영향력을 약화시킨다. 상기 식 (2)는 블로그-랭크 알고리즘이 효율적인 블로그 검색을 위하여 블로거, 트랙백 연결, 댓글, 그리고 태그와 같은 블로그 구조의 특징을 기반으로 페이지를 평가하는 것을 나타낸다. 이하, 상기한 각 요소들에 대해 설명한다.α ^b _t is the weight of the comment. Since the value of the comment is smaller than the value of the trackback, α ^b _t weakens the influence of the comment. Equation (2) indicates that the blog-rank algorithm evaluates a page based on characteristics of the blog structure such as bloggers, trackback links, comments, and tags for efficient blog search. Hereinafter, each of the above elements will be described.

1.1 블로거 명성점수(BS)1.1 Blogger Reputation Score (BS)

블로거 명성점수는 특정 태그에 대한 블로거의 명성을 나타낸다. 본 발명에 따른 방법에서는 하나의 주제에 대한 블로거의 명성이 그의 과거 행동과 과거 산출물에 기반한다고 가정한다. 만약, 블로거가 한 주제(태그)에 대해 블로그 영역에서 활발히 활동한다면, 그리고 그 활동의 산출물이 좋다면, 그의 명성은 높을 것이다.The blogger reputation score indicates the blogger's reputation for a particular tag. The method according to the invention assumes that the blogger's reputation for a subject is based on his past behavior and past output. If a blogger is active in the blog area on a topic (tag), and the output of the activity is good, his reputation will be high.

이를 위해 2가지 관점의 점수, 즉 특정 태그에 대해 블로거가 작성한 모든 페이지의 평균점수(BES)와, 특정 태그에 대해 블로그 영역에서의 블로거 활동점수를 고려한다. 한 점수가 블로거 명성점수에 크게 영향을 미치는 것을 방지하기 위해 두 점수를 정규화 한다. 이러한 사항들을 기반으로 블로거 명성점수는 다음과 같이 정의한다.To do this, we consider two points of view: the average score (BES) of all pages written by the blogger for a particular tag, and the blogger activity score in the blog area for that particular tag. Normalize both scores to prevent one score from significantly affecting the blogger reputation score. Based on these points, the blogger reputation score is defined as follows.

BS(t, b) = sigmoid(BES(t, b)) + sigmoid(BAS(t,b))BS (t, b) = sigmoid (BES (t, b)) + sigmoid (BAS (t, b))

여기서, sigmoid(a) = 1 / 1 + e^-a 이다Where sigmoid (a) = 1/1 + e ^-a

1.1.1 블로거의 페이지 점수(BES)1.1.1 Page Scores for Bloggers

블로거의 페이지 점수는 특정 태그에 대해 블로거가 작성한 페이지의 질을 나타낸다. 본 발명에서는 블로거의 페이지 점수를 과거 블로거가 작성했던 페이지들의 평균점수로 정의한다. 이는, 블로그 영역에서 작성했던 페이지는 주요한 산출물이기 때문이다.The blogger's page score indicates the quality of the blogger's pages for a particular tag. In the present invention, the page score of the blogger is defined as the average score of the pages written by the blogger in the past. This is because pages written in the blog area are the main output.

여기서, E^b _t: 블로거 b가 태그 t에 대해 작성한 모든 페이지의 집합 Where E ^b _t : the set of all pages written by blogger b for tag t

본 발명에서 블로거가 과거에 작성했던 페이지는 블로거의 명성이 없는 상태에서 작성한 페이지라 가정한다. 따라서, 페이지의 점수 계산식 (2)에서 블로거 명성점수(BS)를 기본값 1로 설정하여 과거 작성한 페이지의 점수(PES)는 다음과 같이 정의한다.In the present invention, a page written in the past by a blogger is assumed to be a page written without a blogger's reputation. Therefore, in the score calculation formula (2) of the page, the blogger reputation score (BS) is set to the default value of 1, and the score (PES) of the previously created page is defined as follows.

여기서,here,

K: 페이지 e의 모든 트랙백 연결 수K: Number of all trackback connections on page e

b_j: 페이지 e_i에 j번째 트랙백 연결을 생성한 블로거b _j : Blogger creating j trackback link on page e _i

TBS_j(t, e_i, b_j): 태그 t의 관점에서, 페이지 e_i에 포함된 블로거 b_j가 생성한TBS _j (t, e _i , b _j ): In terms of tag t, generated by blogger b _j contained in page e _i

i번째 트랙백 연결의 점수 Score of the i trackback connection

로서,

as,

트랙백 연결 수 Trackback Connections

CS(e): 페이지 e의 댓글 수이다.CS (e): Number of comments on page e.

가중치 α와 β는 매우 작은 값이므로 CS(e_i)는 과거 작성한 페이지 계산에 크게 영향을 주지 않는다. 따라서, 블로거의 페이지 점수는 트랙백 점수와 블로거 활동점수에 지대한 영향을 받는다. 여기서, 블로거 활동점수 또한 트랙백 연결 수에 기반을 두므로, 결과적으로 블로그-랭크 알고리즘의 전반적인 평가는 트랙백 연결에 가장 큰 영향을 받는다. 트랙백 점수에 대해서는 이후 설명한다.Since the weights α and β are very small values, CS (e _i ) does not significantly affect the calculation of the pages created in the past. Therefore, the page score of the blogger is greatly influenced by the trackback score and the blogger activity score. Here, the blogger activity score is also based on the number of trackback connections, and consequently the overall evaluation of the blog-rank algorithm is most affected by the trackback connections. The trackback score will be described later.

1.1.2 블로거 활동점수(BAS)1.1.2 Blogger Activity Score (BAS)

블로거 활동 점수는 특정 태그에 대해 블로그 영역에서의 블로그 활동을 나타낸다. 이 블로거 활동 점수는 블로거가 작성한 모든 페이지들의 수와 블로거가 작성한 전체 페이지 당 한 페이지의 트랙백 가치를 기반으로 계산한다. 이러한 생각을 실현하기 위해 블로거 활동점수(BAS)를 다음과 같이 정의한다.The blogger activity score represents blog activity in the blog area for a particular tag. This blogger activity score is calculated based on the number of all pages written by the blogger and the trackback value of one page per full page written by the blogger. To realize this idea, the Blogger Activity Score (BAS) is defined as follows.

----- (4)

여기서,here,

N^t _en(b): 블로거 b가 태그 t에 대해 작성한 모든 페이지의 수N ^t _en (b): Number of all pages written by blogger b for tag t

N^tr _en(b): 블로거 b가 작성한 페이지 중 트랙백 연결을 가진 페이지의 총수N ^tr _en (b): Total number of pages written by blogger b that have trackback connections

N_en(b): 블로거 b에 의해 작성된 모든 페이지의 수이다.N _en (b): Number of all pages written by blogger b.

상기 식 (4)에서, 첫 번째 요소는 블로거 행동의 양을 나타내고, 마지막 요소는 블로거 행동의 질을 나타낸다. 블로거 행동의 질은 트랙백 연결을 기반으로 구해진다.In Equation (4), the first element represents the amount of blogger behavior and the last element represents the quality of blogger behavior. The quality of blogger behavior is based on trackback connections.

1.2 트랙백 점수(TBS)1.2 Trackback Score (TBS)

트랙백 점수는 특정한 태그를 가진 페이지의 트랙백 연결성을 나타낸다. 상기한 바와 같이, 트랙백 연결성은 트랙백 연결의 수와 트랙백 연결을 생성한 블 로거의 명성을 기반으로 평가한다. 따라서, 태그 t의 관점에서 페이지 e_i의 트랙백 점수는 다음과 같이 정의한다.:The trackback score indicates the trackback connectivity of the page with a specific tag. As mentioned above, trackback connectivity is evaluated based on the number of trackback connections and the reputation of the blogger that created the trackback connection. Thus, in terms of tag t, the trackback score of page e _i is defined as follows:

TBS_i(t, e_i, b_j) =

TBS _i (t, e _i , b _j ) =

여기서,here,

b_j: j번째 트랙백 연결을 생성한 블로거 bb _j : blogger that created the jth trackback connection b

BS(t, b_j): j번째 트랙백 연결을 생성한 블로거 b의 명성점수이다.BS (t, b _j ): The reputation score of blogger b that created the jth trackback connection.

값 1은 트랙백의 정량적요소를 나타내고, BS(t, b_j)는 트랙백의 정성적요소를 나타낸다.The value 1 represents the quantitative component of the trackback, and BS (t, b _j ) represents the qualitative component of the trackback.

실험 설정Experiment setup

블로그-랭크 알고리즘의 궁극적인 목적은 쿼리 키워드와 반환되는 검색결과의 연관성을 개선하는 것이다. 이러한 연관성의 개선을 증명하기 위해, 본 발명에 따른 블로그-랭킹 알고리즘을 적용한 블로그 검색 방법을 구현하기 위해 티스토리 도메인에서 195개의 블로그, 62906개의 페이지를 수집하였다. 또한, 본 발명에 따른 블로그-랭킹 알고리즘을 이용한 블로그 검색 결과와 기존의 블로그 검색결과를 비교하였다.The ultimate goal of the blog-rank algorithm is to improve the association of query keywords with the returned search results. To demonstrate this improvement, 195 blogs and 62906 pages were collected from the Tstory domain to implement a blog search method using the blog-ranking algorithm according to the present invention. In addition, the blog search results using the blog-ranking algorithm according to the present invention were compared with the existing blog search results.

효율적인 검색결과를 위해 쿼리 키워드와 연관성이 더 많은 페이지가 그렇지 않은 페이지보다 더 높게 랭킹되어야 한다. 상기한 바와 같이, 블로거는 자신의 페이지 내용을 잘 나타내기 위해 신중히 태그를 선택한다. 그러므로, 본 발명에서는 쿼리 키워드와 관련 있는 태그를 포함하는 페이지가 쿼리 키워드와 연관성이 있다고 가정한다. 즉, 페이지가 쿼리 키워드와 관련이 있는 태그들을 더 많이 포함하고 있다면, 그 페이지는 쿼리 키워드와 더 연관성이 높다. For more efficient search results, pages that are more relevant to the query keyword should be ranked higher than those that are not. As mentioned above, bloggers carefully choose tags to better represent their page content. Therefore, in the present invention, it is assumed that a page including a tag related to a query keyword is related to the query keyword. In other words, if a page contains more tags that are related to a query keyword, the page is more relevant to the query keyword.

쿼리 키워드와 관련이 있는 태그들은 동시 출현 빈도수를 기반으로 정의한다. 만약, 쿼리 키워드와 동일한 태그를 포함하는 페이지가 있다면, 그 페이지 내의 모든 태그들은 쿼리 키워드와 1의 연관성을 가진다. 이와 같이, 1의 연관성을 가진 태그들이 다른 페이지에서 또 다시 동시에 존재하면 연관성 값은 1씩 증가하게 된다. 이러한 과정이 반복되어 연관성 값이 10을 넘는 태그들을 선택하여 쿼리 키워드와 관련이 있는 태그들로 정의한다. 본 발명은 이러한 태그들을 티스토리 도메인(http://www.tistory.com/), 이글루스 도메인(http://www.egloos.com/), 블로그코리아 도메인(http://www.blogkorea.net/)의 페이지에서 구성한다. 여기서, 티스토리는 본 발명에 따른 블로그-랭크 랭킹알고리즘 계산을 위해 수집되었던 도메인이고, 이글루스와 블로그코리아는 그렇지 않은 도메인이다.Tags associated with query keywords are defined based on the co-occurrence frequency. If there is a page that contains the same tag as the query keyword, all tags in that page have an association of 1 with the query keyword. As such, if the tags having an association of 1 are present on another page at the same time, the association value is increased by one. This process is repeated to select tags with relevance greater than 10 and define them as tags related to the query keyword. In the present invention, these tags are assigned to the Tstory domain (http://www.tistory.com/), the Eagles domain (http://www.egloos.com/), and the Blog Korea domain (http://www.blogkorea.net/ We configure in page of). Here, the story is a domain that has been collected for computing the blog-rank ranking algorithm according to the present invention, and Eagles and Blog Korea are domains that are not.

검색결과의 효율성을 평가하기 위해 페이지에 포함된 쿼리 키워드와 연관된 태그의 수를 측정하고, 그 수를 K순위까지의 NDCG at K(Normalized Discounted Cumulative Gain)에 적용하였다. NDCG at K는 검색결과의 랭크 정확성을 측정하는 정보검색의 평가 metric 중 하나이다[K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the ACM Conference on Research and Development on Information Retrieval (SIGIR), 2000]. 주어진 쿼리 q가 있을 때 랭크된 검색결과는 다음의 식에 의해 K순위별로 NDCG가 계산된다.To evaluate the effectiveness of the search results, we measured the number of tags associated with the query keywords included in the page and applied the number to NDCG at K (Normalized Discounted Cumulative Gain). NDCG at K is one of the evaluation metric of IR which measures rank accuracy of search results [K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the ACM Conference on Research and Development on Information Retrieval (SIGIR), 2000]. When there is a given query q, the ranked search result is NDCG calculated by K rank by the following equation.

NDCG at K는 검색결과의 순위 j = 1에서 K까지의 게인(gain)의 합으로 계산된다. r(j)는 순위 j에서의 보상을 나타내는 함수이다. 본 발명에 따른 블로그-랭킹 방법의 실험에서 r(j)는 j번째 페이지 e_j의 RV(e_j, q)를 기반으로 계산된다. 즉, r(j) = log(RV(e_j, q) + 1)이다. RV(e_j, q)는 페이지 e_j에서 쿼리 키워드 q와 관련 있는 태그들의 개수이다. M_q는 정규화 상수이고 이로 인해 NDCG의 최대값은 1이 된다.NDCG at K is calculated as the sum of gains from rank j = 1 to K in the search results. r (j) is a function representing the reward in rank j. In an experiment of the blog-ranking method according to the present invention, r (j) is calculated based on the RV (e _j, q) of the j th page e _j . That is, r (j) = log (RV (e _j, q) + 1). RV (e _j, q) is the number of tags associated with the query keyword q in page e _j . M _q is a normalization constant, which causes the maximum value of NDCG to be 1.

실험의 현실성을 위해, 쿼리 키워드 풀에서 임의로 선택된 쿼리 키워드를 이용하였다. 총 20개의 실험쿼리는 10개의 이슈 쿼리와 10개의 임의 쿼리로 구성한다. 이슈 쿼리는 최근에 사회적으로 이슈가 되는 사건이나 인물들과 관련된 키워드를 쿼리 키워드 풀에서 선택하였는 바, 예컨대 이명박, 광우병, 올림픽 등을 들 수 있다. 임의 쿼리는 쿼리 키워드 풀에서 임의로 선택하였다. 쿼리 키워드 풀은 다음의 과정으로 구성한다.For the practicality of the experiment, a query keyword randomly selected from the query keyword pool was used. A total of 20 experimental queries consist of 10 issue queries and 10 random queries. The issue query has recently selected keywords related to events or characters that are social issues in the query keyword pool, such as Lee Myung-bak, mad cow disease, and Olympics. Random queries were randomly selected from the query keyword pool. The query keyword pool consists of the following steps.

(1) 초기 키워드를 가진 집합을 하나의 쿼리-집합으로 둔다. 초기 키워드는 임의로 선택한다.(1) Set the set with the initial keyword as a query-set. Initial keywords are randomly selected.

(2) 쿼리-집합에서 키워드를 하나 임의로 선택한다.(2) Randomly select a keyword from the query-set.

(3) 선택된 키워드와 관련된 키워드를 블로그 영역에서 찾고, 찾아진 키워드 를 쿼리-집합에 추가한다. 관련된 키워드는 선택된 키워드와 동일한 태그를 가진 페이지의 모든 태그들이다.(3) Search the blog area for keywords related to the selected keyword, and add the found keywords to the query-set. Related keywords are all tags on the page that have the same tag as the selected keyword.

(4) 쿼리-집합의 크기가 100을 넘으면 중단한다.(4) Abort if the size of the query-set exceeds 100.

(5) 이전에 선택되지 않은 키워드를 하나 선택하고 3번과정으로 돌아간다.(5) Select one keyword not previously selected and go back to step 3.

완성된 쿼리-집합을 쿼리 키워드 풀로 이용한다.Use the completed query-set as a query keyword pool.

실험 결과Experiment result

본 발명에 따른 블로그-랭크 알고리즘에 따른 블로그 검색 방법의 검색효율성을 평가하기 위해 앞에서 정의한 실험 설정을 따라 다음과 같이 실험하였다.In order to evaluate the search efficiency of the blog search method according to the blog-rank algorithm according to the present invention, the following experiment was set as follows.

(1) 실험 쿼리들을 선택한다. 실험 쿼리는 10개의 이슈 쿼리와 10개의 임의 쿼리로 구성한다.(1) Select experimental queries. The experimental query consists of 10 issue queries and 10 random queries.

(2) 실험 쿼리들을 본 발명에 따른 블로그-랭크 검색 방법과 티스토리 블로그 검색 방법에 적용하고 검색결과를 받는다. (2) The experimental queries are applied to the blog-rank search method and the Tstory blog search method according to the present invention, and receive a search result.

(3) NDCG at K 평가 metric과 3 도메인에서의 관련된 태그들을 이용하여 두 방법의 검색결과를 비교 분석한다.(3) The NDCG at K evaluation metric and related tags in 3 domains are used to compare and analyze the results of the two methods.

먼저, 티스토리 도메인에서 구성된 관련 태그들을 가지고 실험하였다.First, we experimented with the related tags configured in the story domain.

도 2a 및 도 2b는 본 발명과 티스토리 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프이다.2A and 2B are graphs showing search performance improvement of each of an issue query and an arbitrary query based on the related tags of the present invention and the tistory domain.

도 2a에 나타낸 바와 같이, 상위 10위 페이지에서 전체적으로 지속적인 성능개선이 나타나고 있음을 알 수가 있다. 도 2a에 있어서 NDCG 값은 10개 이슈 쿼 리의 평균값을 나타낸 것이다. 랭크 2(K=2)에서, 본 발명에 따른 블로그-랭크의 NDCG 값은 0.804이고, 티스토리-랭크의 값은 0.615임을 알 수가 있다.As shown in FIG. 2A, it can be seen that continuous improvement of overall performance is shown in the top ten pages. In FIG. 2A, the NDCG value represents an average value of ten issue queries. In rank 2 (K = 2), it can be seen that the NDCG value of the blog-rank according to the present invention is 0.804, and the value of the tistory-rank is 0.615.

또한, 도 2b에 나타낸 바와 같이, 임의 쿼리에서도 상위 10위 페이지에서 전체적으로 지속적인 성능개선이 나타나고 있음을 알 수가 있다. 가장 큰 성능개선을 보인 곳은 랭크 5로서, 본 발명에 따른 블로그-랭크가 0.764인데 반해, 티스토리-랭크는 0.645이다.In addition, as shown in FIG. 2B, it can be seen that a continuous improvement in overall performance occurs in the top 10 page even in arbitrary queries. The biggest performance improvement was found in rank 5, while the blog-rank according to the present invention was 0.764, while the story-rank was 0.645.

이와 같은 결과는 본 발명에 따른 블로그-랭크 알고리즘이 기존의 티스토리의 랭크 알고리즘 보다 검색 성능이 더욱 우수하다는 것을 보여준다. 또한, 이슈 쿼리에서의 성능개선이 임의 쿼리에서의 성능 개선보다 더욱 좋은 것으로 미루어 블로그-랭크 알고리즘이 최근에 이슈가 되는 페이지를 더 잘 반영한다는 것을 보여준다.These results show that the blog-rank algorithm according to the present invention has better search performance than the conventional rank algorithm. In addition, performance improvements in issue queries are better than performance improvements in arbitrary queries, showing that the blog-rank algorithm better reflects the recently-issued pages.

또한, 본 발명의 실험 도메인이 아닌 다른 2 도메인, 즉 이글루스 도메인과 블로그코리아 도메인에서 구성한 관련된 태그들을 적용하여 실험하였다.In addition, experiments were performed by applying related tags configured in two domains other than the experimental domain of the present invention, that is, the igloos domain and the blog Korea domain.

도 3a 및 도 3b는 본 발명과 이글루스 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프이고, 도 4a 및 도 4b는 본 발명과 블로그코리아 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프이다.3A and 3B are graphs showing the improvement of search performance of an issue query and an arbitrary query based on the related tags of the present invention and the eagles domain, and FIGS. 4A and 4B are based on the related tags of the present invention and the blog Korea domain. This is a graph showing the improvement of search performance for each issue query and random query.

이글루스와 블로그코리아의 태그를 적용한 실험에서도 본 발명에 따른 블로그-랭크 알고리즘의 성능이 더 우수하다는 것을 알 수가 있다.In experiments using tags from Eagles and Blog Korea, it can be seen that the performance of the blog-rank algorithm according to the present invention is better.

결과적으로, 본 발명에 따른 블로그-랭크 알고리즘을 적용한 검색시스템이 기존의 티스토리 검색시스템의 검색 성능보다 우수한 것이 증명되었다.As a result, it was proved that the search system to which the blog-rank algorithm according to the present invention is applied is superior to the search performance of the existing Tstory search system.

이와 같은 본 발명에 따른 블로그-랭크 알고리즘의 성능 개선을 표 1로 나타낸다. 표 1에 있어서 각 값은 본 발명에 따른 블로그-랭크 알고리즘을 적용한 방법과, 티스토리에 따른 검색 방법의 NDCG값의 차이이고, 괄호안의 값은 성능개선율이다.Table 1 shows the performance improvement of the blog-rank algorithm according to the present invention. In Table 1, each value is a difference between the NDCG value of the method of applying the blog-rank algorithm and the search method according to the tistory, and the values in parentheses are performance improvement rates.

표 1 TABLE 1

연관된 태그들의 구성도메인 Configuration domain of associated tags 티스토리Tistory 이글루스Eagles 블로그코리아BlogKorea 이슈 쿼리Issue query 0.168 (24%) 0.168 (24%) 0.129 (18%) 0.129 (18%) 0.142 (20%) 0.142 (20%) 임의 쿼리Random query 0.085 (12%) 0.085 (12%) 0.057 (7%) 0.057 (7%) 0.090 (15%) 0.090 (15%)

이상과 같이 표 1에서는 전체 테스트 쿼리를 적용한 실험 결과를 나타내고 있고, 실험 결과에서 본 발명에 따른 블로그-랭크 알고리즘을 이용한 블로그 랭크 방법은 기존의 검색 방법 보다 전체 쿼리(이슈 쿼리 + 임의 쿼리)에 대해 평균 16%의 성능 개선을 달성하였고, 이슈 쿼리에 대한 성능 개선은 평균 20%로, 임의 쿼리의 성능 개선 평균 11% 보다 2배 정도 높은 성능개선을 달성하고 있음을 알 수 있다.As described above, Table 1 shows the experimental results to which the entire test query is applied, and the blog rank method using the blog-rank algorithm according to the present invention shows that the entire query (issue query + random query) is better than the conventional search method. The average 16% performance improvement was achieved, and the performance improvement on the issue query was 20% on average, which is 2 times higher than the average improvement of 11% on arbitrary queries.

이상과 같이, 전통적인 웹 페이지와 달리 블로그 페이지는 트랙백 연결, 태그, 댓글과 같은 구조적인 특징을 가지고 있다. 따라서, 이러한 특징들을 고려하는 본 발명에 따른 블로그 랭크 알고리즘은 기존의 웹 페이지 랭크 알고리즘보다 블로그 검색 성능이 더 우수하다.As mentioned above, unlike traditional web pages, blog pages have structural features such as trackback links, tags, and comments. Therefore, the blog rank algorithm according to the present invention that considers these features has better blog search performance than the existing web page rank algorithm.

더욱이, 본 발명에 따른 알고리즘은 블로그 페이지를 크게 3가지의 요소, 즉 블로거의 명성, 트랙백 연결성, 사용자의 반응성의 요소로 분석하였다. 이러한 요소 중, 트랙백 연결성과 사용자 반응성은 단지 블로그의 구조적인 요소만을 평가하지만 암묵적인 내용평가를 포함하여 더 우수한 검색성능을 내게 된다. 이는 컴퓨팅 성능의 관점에서 중요한 특징 중의 하나이다. 왜냐하면 데이터의 내용을 직접분석 하는 것은 비싼 컴퓨팅 비용이 요구되기 때문이다. 또한, 본 발명에 따른 방법은 블로거의 명성을 평가 및 관리하여 초기에 연결성을 가지지는 않지만 유용한 내용을 가진 페이지를 저평가하지 않게 된다.Moreover, the algorithm according to the present invention analyzed the blog page into three main factors: blogger's reputation, trackback connectivity, and user's responsiveness. Among these factors, trackback connectivity and user responsiveness only evaluate the structural elements of the blog, but provide better search performance, including implicit content evaluation. This is one of the important features in terms of computing performance. Because analyzing the content of data directly requires expensive computing costs. In addition, the method according to the present invention evaluates and manages the blogger's reputation so as not to underestimate pages that are not initially connected but have useful content.

도 1은 블로그 영역의 구조적 특징을 나타낸 도면,1 is a view showing the structural features of the blog area;

도 2a 및 도 2b는 본 발명과 티스토리 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프,2A and 2B are graphs illustrating an improvement in search performance of each of an issue query and an arbitrary query based on the related tags of the present invention and the tistory domain,

도 3a 및 도 3b는 본 발명과 이글루스 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프,3A and 3B are graphs illustrating an improvement in search performance of each of an issue query and an arbitrary query based on related tags of the present invention and the eagles domain,

도 4a 및 도 4b는 본 발명과 블로그코리아 도메인의 관련 태그를 기초로 한 이슈 쿼리와 임의 쿼리 각각의 검색 성능 개선을 나타낸 그래프이다.4A and 4B are graphs illustrating search performance improvement of each of an issue query and an arbitrary query based on related tags of the present invention and the blog Korea domain.

Claims

In a blog rank method for searching a blog using a blog rank algorithm,

Evaluate the blog page's reputation, the trackback connectivity of the blog page, and the user responsiveness of the blog page.

In terms of keyword t, the evaluation of page e, ES (e, t),

ES (e, t) =

here,

b: blogger who created page e with tag t

K: number of trackback connections on page e

b _i : Blogger creating the i th trackback connection on page e

BS (t, b): Blogger b's reputation score for tag t

TBS (t, e, b _i ): In terms of tag t, generated by blogger b _i contained in page e

Score of the i trackback connection

as,

N _tr (b, t): for all pages written by blogger b for tag t

Trackback Connections

N _cm (b, t): Number of comments on all pages written by blogger b about tag t

β: 0.001 (for no trackback connection)

CS (e): Number of comments on page e

Blog rank method for efficiently searching a blog using a blog rank algorithm, characterized in that defined as.

The reputation score BS (t, b) of blogger b for the tag t,

BS (t, b) = sigmoid (BES (t, b)) + sigmoid (BAS (t, b))

Where sigmoid (a) = 1/1 + e ^-a

3. The blogger's page score BES, according to claim 2,

Where E ^b _t : the set of all pages written by blogger b for tag t

The method of claim 1,

The formula ES (e, t) =

Sets blogger reputation score (BS) to the default value of 1 to create a score PES for the past page, and the score PES for the past page,

here,

K: Number of all trackback connections on page e

b _j : Blogger creating j trackback link on page e _i

TBS _j (t, e _i , b _j ): In terms of tag t, the blogger b _j contained in page e _i is generated.

Score of successful i-th trackback connection

as,

N _tr (b, t): Trackback of all pages written by blogger b for tag t

Number of connections

N _cm (b, t): Number of comments on all pages written by blogger b about tag t

β: 0.001 (for no trackback connection)

CS (e): Number of comments on page e

The method of claim 2, wherein the blogger activity score BAS is

here,

N ^t _en (b): Number of all pages written by blogger b for tag t

N ^tr _en (b): Total number of pages written by blogger b that have trackback connections

N _en (b): number of all pages written by blogger b

The system of claim 1, wherein the trackback score TBS is:

TBS _i (t, e _i , b _j ) =

here,

b _j : blogger that created the jth trackback connection b

BS (t, b _j ): The reputation score of blogger b that created the jth trackback connection.