KR100992524B1

KR100992524B1 - A method and system for measuring web page similarity using social bookmarks

Info

Publication number: KR100992524B1
Application number: KR1020090018456A
Authority: KR
Inventors: 김명호; 강상욱; 이기용; 김현규
Original assignee: 한국과학기술원
Priority date: 2009-03-04
Filing date: 2009-03-04
Publication date: 2010-11-05
Also published as: KR20100099890A

Abstract

본 발명은 태그를 이용한 웹 페이지 간의 유사도 측정 방법에 관한 것으로, 해결하고자 하는 기술적 과제는 하나의 태그가 다른 태그와 같이 사용되었을 때 각각의 추상 클래스에 속할 확률을 계산함으로써, 해당 태그가 어떤 의미로 사용되었는지를 추정할 수 있으며, 이에 따라 웹 페이지 간의 유사도를 더욱 정확하게 측정할 수 있는 태그를 이용한 웹 페이지 간의 유사도 측정 방법을 제공하는데 있다.The present invention relates to a method for measuring similarity between web pages using a tag. The technical problem to be solved is to calculate the probability that one tag belongs to each abstract class when the tag is used together with another tag. The present invention provides a method for measuring similarity between web pages using a tag that can estimate whether the tag is used and thus more accurately measure the similarity between web pages.

이를 위해 본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 방법은 웹 페이지와 그에 대한 태그들을 동시 발생 데이터로 보고, 주어진 웹 페이지들과 그들에 대한 태그들 및 시스템에 기설정된 추상 클래스로부터 SMM을 구하는 SMM 구축단계 및 상기 SMM 구축단계에 의한 결과 값을 사용하여 비교 대상인 웹 페이지들에 대한 태그들이 상기의 같은 추상 클래스에 포함될 확률을 계산하여 이를 바탕으로 상기 웹 페이지들 간의 유사도를 구하는 유사도 측정단계를 포함하는 것을 특징으로 하는 태그를 이용한 웹 페이지 간의 유사도 측정 방법을 개시한다.To this end, the method for measuring similarity between web pages using a tag according to the present invention is to view a web page and its tags as co-occurrence data, and to obtain an SMM from given web pages, tags for them and an abstract class preset in the system. A similarity measurement step of calculating similarity between the web pages by calculating the probability that the tags for the web pages to be compared are included in the same abstract class by using the SMM construction step and the result value of the SMM construction step. Disclosed is a method for measuring similarity between web pages using a tag.

웹 페이지, 유사도, 태그, 소셜 북마크 Web pages, similarity, tags, social bookmarks

Description

A method and system for measuring similarity between web pages using tags {A METHOD AND SYSTEM FOR MEASURING WEB PAGE SIMILARITY USING SOCIAL BOOKMARKS}

본 발명은 웹 페이지 간의 유사도를 측정하는 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for measuring similarity between web pages.

최근 웹(Web) 환경에서 소셜 북마킹(Social Bookmarking) 서비스가 떠오르고 있다. 소셜 북마킹 서비스란 웹 사용자들로 하여금 웹 페이지들에 대해 자유롭게 태깅(tagging)할 수 있도록 하는 서비스이다.Recently, social bookmarking services have emerged in the web environment. The social bookmarking service is a service that allows web users to freely tag web pages.

태그(tag)란 웹 페이지의 주제 또는 내용을 정의하는 단어 또는 짧은 문장이다. 하나의 웹 페이지에는 여러 개의 태그가 부착될 수 있으며, 하나의 태그는 여러개의 웹 페이지에 사용될 수 있다. A tag is a word or short sentence that defines the subject or content of a web page. Multiple tags may be attached to one web page, and one tag may be used for multiple web pages.

현재, 웹 사용자들이 이런 태깅 활동을 가장 활발하게 하고 있는 곳은 Delicious(http://delicious.com)이다.Currently, where web users are most active in tagging is Delicious (http://delicious.com).

SocialSimRank(SSR)는 태그간의 유사도를 측정하는 방법이다. 비록 SSR이 직 접적으로 웹 페이지간의 유사도를 측정하기 위하여 제안된 방법은 아니지만, SSR은 태그간의 유사도를 바탕으로 하여 웹 페이지간의 유사도를 측정한다.SocialSimRank (SSR) is a method of measuring similarity between tags. Although SSR is not a proposed method for directly measuring similarity between web pages, SSR measures similarity between web pages based on similarity between tags.

상기 SSR은 결과 값으로 태그간의 유사도 값인 2차원 행렬 S_A를 만들어낸다. 이 때, 상기 S_A의 크기는 N_A × N_A 이며, 여기서 N_A는 태그의 개수이다. The SSR generates a two-dimensional matrix S _A which is a similarity value between tags as a result value. At this time, the size of S _A is N _A × N _A , where N _A is the number of tags.

만약, 페이지 P와 Q가 있고 각각 [t₁,t₂,...,t_n] , [u₁,u₂,...,u_m]를 태그로 가진다고 하자. SSR을 사용하여 상기 두 페이지간의 유사도는 P와 Q가 가진 각각의 태그간의 유사도를 모두 합한 것과 같은데 하기의 수학식 1과 같이 표현될 수 있다.Suppose you have pages P and Q, and you have [t ₁ , t ₂ , ..., t _n ], [u ₁ , u ₂ , ..., u _m ] as tags. The similarity between the two pages using SSR is equal to the sum of the similarities between the respective tags of P and Q, which can be expressed by Equation 1 below.

상기와 같이 유사도를 매우 쉽게 계산할 수 있지만, 상기 방법은 문제가 있다. 태그는 사람이 일상에서 사용하는 단어로 여러 가지 의미를 가질 수 있다. 따라서 하나의 태그가 다른 어떤 태그와 사용되었느냐에 따라 완전히 다른 의미를 가질 수 있다. 그렇지만 SSR은 의미가 달라질 수도 있다는 것은 고려하지 않고 계산하여 태그의 실제적인 의미를 파악하지 못하고 웹 페이지간의 유사도를 계산한다.Although the similarity can be calculated very easily as above, the method is problematic. Tags are words that people use in everyday life and can have many meanings. Therefore, one tag can have a completely different meaning depending on what other tag is used. However, SSR calculates the similarity between web pages without considering the actual meaning of the tag without considering that the meaning may be different.

예를 들어 [java, programming]과 [java, travel]을 태그로 가진 두 페이지의 유사도를 측정한다고 하자. 이 때, 전자에서 'java'는 프로그래밍에서의 자바로 사용되었고 후자에서는 자바 섬 여행의 의미로 사용되었다. 하지만 SSR을 사용하면 S_A(java, java) + S_A(java, travel) + S_A(programming, java) + S_A(programming, travel)의 값이 최종 유사도가 된다. For example, suppose you measure the similarity between two pages tagged with [java, programming] and [java, travel]. In the former, 'java' was used as Java in programming, and the latter was used to mean Java island travel. However, with SSR, the value of S _A (java, java) + S _A (java, travel) + S _A (programming, java) + S _A (programming, travel) is the final similarity.

[java, java]와 [programming, java]는 서로 매우 유사한 태그라고 판단될 수 있어 높은 유사도를 나타낼 수 있고 [programming, travel]은 전혀 연관되지 않은 태그하 유사도가 0에 가까운 값을 가질 것이다. 하지만 위의 덧셈을 계산하게 되면 최종적으로 0에 가까운 값이 아닌 두 페이지가 유사하다고 판단될 가능성이 높다. [java, java] and [programming, java] can be judged to be very similar tags, and thus can exhibit high similarity, and [programming, travel] will have a value close to zero under similarity that is not related at all. However, if you calculate the addition above, it is likely that two pages that are not near zero will be similar.

SSR은 위에서 사용된 'java'가 'programming'과 관련된 것인지 아니면 'travel'과 관련된 것인지를 판단하지 못하므로 결국 페이지간의 유사도는 정확하지 않을 가능성이 높고, 태그를 이용하는 기존의 웹 페이지 유사도 측정방법들은 태그 간의 상호관계를 전혀 고려하지 않았기 때문에 다의어 태그의 경우 어떤 의미로 사용되었는지 파악하기 어렵고, 따라서 올바르지 않은 비교 결과 값을 나타낼 수 있다는 문제점이 있다.Since SSR cannot determine whether 'java' used above is related to 'programming' or 'travel', the similarity between pages is likely to be inaccurate. Since the interrelationship between the tags is not considered at all, it is difficult to understand the meaning of the multilingual tag, and thus, an incorrect comparison result may be displayed.

상기한 바와 같이 태그를 이용하는 기존의 웹 페이지 유사도 측정방법들은 여러 가지의 뜻을 가지는 다의어(多義語)가 태그로 사용되었을 경우, 웹 페이지간의 유사도가 올바르게 측정되지 않을 수 있다.As described above, existing web page similarity measuring methods using a tag may not measure the similarity between web pages correctly when a multiword having various meanings is used as a tag.

다의어 태그는 어떤 태그들과 같이 사용되었느냐에 따라 다른 의미를 가질 수 있는데, 이러한 상황을 고려하지 않음으로써 유사하지 않은 웹 페이지들이 유사한 페이지로 판단되는 문제가 발생하게 되는바, 본 발명은 이러한 다의어에 대한 유사판단을 태그를 이용하여 효율을 향상시키는데 그 목적이 있다.The multilingual tag may have different meanings depending on which tags are used together. By not considering such a situation, a problem arises in that dissimilar web pages are determined to be similar pages. The purpose of this method is to improve efficiency by using a tag for similar judgment.

상기한 바와 같은 목적을 달성하기 위해 본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 방법은 웹 페이지와 그에 대한 태그들을 동시 발생 데이터로 보고, 주어진 웹 페이지들과 그들에 대한 태그들 및 시스템에 기설정된 추상 클래스로부터 SMM을 구하는 SMM 구축단계 및 상기 SMM 구축단계에 의한 결과 값을 사용하여 비교 대상인 웹 페이지들에 대한 태그들이 상기의 같은 추상 클래스에 포함될 확률을 계산하여 이를 바탕으로 상기 웹 페이지들 간의 유사도를 구하는 유사도 측정단계를 포함할 수 있다.In order to achieve the object as described above, the method for measuring similarity between web pages using a tag according to the present invention is to view the web page and its tags as co-occurrence data, and to provide the web pages with tags and systems for them. The probability that tags for the web pages to be compared are included in the same abstract class is calculated using the SMM construction step of obtaining the SMM from the set abstract class and the result value of the SMM construction step. The similarity measuring step of obtaining the similarity may be included.

상기 SMM 구축단계는 태깅된 웹페이지와 태그를 수집하는 수집단계와, 상기 수집단계에 의해 수집된 정보를 이용하여 SMM을 적용하는 적용단계 및 상기 적용단계 후에 각 추상 클래스가 나타날 확률, 태그가 상기 추상 클래스에 속할 조건부 확률 및 웹 페이지가 상기 추상 클래스에 속할 조건부 확률을 구하는 확률단계를 포함할 수 있다.The SMM building step includes a collecting step of collecting tagged web pages and tags, an applying step of applying the SMM using the information collected by the collecting step, and a probability that each abstract class appears after the applying step, and the tag is Conditional probabilities that belong to the abstract class and probability steps for obtaining conditional probabilities that the web page belongs to the abstract class.

상기 유사도 측정단계는 비교 대상인 두 웹 페이지를 선택하는 선택단계와, 상기 선택단계에서 선택된 웹 페이지의 태그를 추출하는 추출단계와, 상기 추출단계에서 추출된 태그들을 대상으로 상기 SMM 구축단계에 의한 결과를 이용하여 같은 추상 클래스에 포함될 확률을 계산하는 확률계산 단계 및 상기 확률계산 단계의 실행 후 이를 태그의 수로 제곱근하여 유사도를 구하는 유사도 도출단계를 포함할 수 있다.The similarity measuring step includes a selection step of selecting two web pages to be compared, an extraction step of extracting a tag of the web page selected in the selection step, and a result of the SMM construction step for the tags extracted in the extraction step It may include a probability calculation step of calculating the probability to be included in the same abstract class by using and a similarity derivation step of obtaining the similarity by square root of the number of tags after the execution of the probability calculation step.

또한, 상기한 바와 같은 목적을 달성하기 위해 본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 시스템은 유사도를 구하는 대상 웹 페이지에 대한 태그 정보를 수집하는 웹 페이지 정보 수집부와, 상기 웹페이지 정보 수집부에서 수집한 정보를 이용해 SMM을 구축하는 SMM 구축부와, 유사도를 구하려는 웹페이지를 선정하는 웹페이지 선택부와, 상기 웹페이지 선택부를 통해 선택된 웹페이지에 대해 태그를 추출하는 태그 추출부 및 상기 SMM 구축부와 상기 태그 추출부를 통합하여 선택된 두 페이지간의 유사도를 구하는 확률계산부를 포함할 수 있다.In addition, the similarity measurement system between the web page using the tag according to the present invention in order to achieve the above object, Web page information collection unit for collecting the tag information for the target web page to obtain a similarity, and collecting the web page information SMM construction unit for building the SMM by using the information collected by the unit, a web page selection unit for selecting a web page to obtain the similarity, a tag extraction unit for extracting a tag for the web page selected through the web page selection unit and the The SMM constructing unit and the tag extracting unit may be integrated to include a probability calculating unit for obtaining a similarity between two selected pages.

상기 확률계산부는 웹 페이지 및 태그들이 각각 시스템에 기설정된 주제에 속할 확률을 구한 후, 비교 대상이 되는 웹 페이지들의 태그들이 동일한 주제에 속할 조건부 확률을 구할 수 있다.The probability calculator may calculate a probability that the web pages and the tags belong to a topic that is preset in the system, and then obtain a conditional probability that the tags of the web pages to be compared belong to the same topic.

상기한 바와 같이 본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 방법 및 시스템에 의하면, 웹 페이지 및 태그들이 각각 어떤 주제에 속할 확률을 구한 후, 비교 대상이 되는 웹 페이지들의 태그들이 동일한 주제에 속할 조건부 확률을 통해 웹 페이지 간의 유사도를 측정하여, 개인화 검색의 기능 향상 및 유사 페이지 수집 등에 사용될 수 있는 효과가 있다.As described above, according to the method and system for measuring similarity between web pages using a tag according to the present invention, after calculating the probability that the web page and the tags belong to a certain subject, the tags of the web pages to be compared belong to the same subject. By measuring the similarity between web pages through conditional probabilities, there is an effect that can be used to improve the function of personalized search and collect similar pages.

즉, 하나의 태그가 다른 태그와 같이 사용되었을 때 각각의 추상 클래스에 속할 확률을 계산함으로써, 해당 태그가 어떤 의미로 사용되었는지를 추정할 수 있으며, 이에 따라 웹 페이지 간의 유사도를 더욱 정확하게 측정할 수 있는 효과가 있다.That is, by calculating the probability that one tag belongs to each abstract class when used with another tag, it is possible to estimate what the tag is used for, and thus to measure the similarity between web pages more accurately. It has an effect.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명한다. 우선, 도면들 중 동일한 구성요소 또는 부품들은 가능한 한 동일한 참조부호를 나타내고 있음에 유의해야 한다. 본 발명을 설명함에 있어서 관련된 공지기능 혹은 구성에 대한 구체적인 설명은 본 발명의 요지를 모호하게 하지 않기 위해 생략한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. First, it should be noted that the same components or parts in the drawings represent the same reference numerals as much as possible. In describing the present invention, detailed descriptions of related well-known functions or configurations are omitted in order not to obscure the gist of the present invention.

본 발명에서는 웹 페이지에 대한 태그 정보를 이용하여 웹 페이지 간의 유사도를 측정하기 위해 기존 기술인 Separable Mixture Model(SMM)을 활용한다. 상기 SMM은 동시 발생 데이터(co-occurrence data)를 위한 통계적 모델이다. 여기서, 동시 발생 데이터란 동시에 발생한 두 가지 종류의 데이터를 뜻하는데, 문서와 그에 대한 키워드가 한 예가 될 수 있다.The present invention utilizes a conventional technology, the Separable Mixture Model (SMM), to measure the similarity between web pages using tag information on the web pages. The SMM is a statistical model for co-occurrence data. Here, the co-occurrence data refers to two kinds of data generated at the same time. For example, a document and a keyword thereof may be used.

주어진 동시 발생 데이터의 집합에 SMM을 적용하게 되면 각 데이터가 주어진 K개의 추상 클래스 각각에 속할 확률을 구할 수 있는데, 이 때 같은 클래스로 분류된 데이터는 같은 주제에 관한 것이라고 볼 수 있다. 만약, 두 데이터가 서로 다른 클래스로 분류되어 있다면 두 데이터는 서로 다른 주제를 다룰 가능성이 높다. 그리고, 여러 주제에 관한 데이터는 여러 개의 추상 클래스에 속할 수도 있다.By applying SMM to a given set of concurrent data, we can find the probability that each data belongs to each of the given K abstract classes, where data classified under the same class is about the same subject. If the two data are classified into different classes, they are likely to cover different topics. And data on different topics may belong to multiple abstract classes.

주어진 데이터로부터 SMM이 구축되면, 상기 SMM을 통해Once the SMM is constructed from the given data, the SMM

(1) 각 추상 클래스가 나타날 확률,(1) the probability that each abstract class will appear,

(2) 동시 발생 데이터에 포함된 두 가지 종류의 각 데이터가 각각의 추상 클래스에 속할 조건부 확률을 얻을 수 있다.(2) A conditional probability that each of the two kinds of data included in the concurrent data belongs to each abstract class can be obtained.

이 때, 각 추상 클래스가 나타날 확률은 원소가 K개인 1차원 행렬로 나타낼 수 있으며, 각 데이터가 각 추상 클래스에 속할 조건부 확률은 2차원 행렬로 나타낼 수 있다.In this case, the probability that each abstract class appears may be represented by a one-dimensional matrix having K elements, and the conditional probability that each data belongs to each abstract class may be represented by a two-dimensional matrix.

웹 페이지 및 태그를 분류할 K개의 추상 클래스가 주어졌다고 하자. 주어진 웹 페이지들과 그들에 대한 태그들로부터 SMM을 구축하면, 구축된 SMM으로부터 다음과 같은 값을 얻을 수 있다.Suppose we have K abstract classes to classify web pages and tags. If you build an SMM from given web pages and tags for them, you get the following values from the built SMM:

추상 클래스 C_α가 나타날 확률 :

Probability of abstract class C _α :

웹 페이지 P_i가 주어진 추상 클래스 C_α에 속할 조건부 확률 :

Conditional probability that the web page P _i belongs to the given abstract class C _α :

태그 t_j가 주어진 추상 클래스 C_α에 속할 조건부 확률 :

Conditional probability that tag t _j belongs to the given abstract class C _α :

본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 방법은 위 세가지 종류의 확률 값을 사용하여 두 웹 페이지 간의 유사도를 다음과 같이 측정한다.In the method of measuring similarity between web pages using a tag according to the present invention, the similarity between two web pages is measured using the three kinds of probability values as follows.

위의 확률 값을 바탕으로 둘 이상의 태그가 같은 추상 클래스에 속할 확률을 구할 수 있는데, 먼저 두 개의 태그 t₁과 u₁이 같은 클래스에 속할 확률은 다음의 수학식 2와 같이 구할 수 있다.Based on the probability values above, the probability that two or more tags belong to the same abstract class can be calculated. First, the probability that two tags t ₁ and u ₁ belong to the same class can be obtained as in Equation 2 below.

상기 수학식 2를 3개의 태그 t₁ ,t₂, u₁ 에 대한 식으로 확장하면 하기의 수학식 3과 같다.If Equation 2 is expanded into three tags t ₁ , t ₂ , and u ₁ , Equation 3 is obtained.

상기 수학식 3을 확장하면 태그

가 같은 클래스에 속할 확률을 하기의 수학식 4와 같이 계산할 수 있다.Expanding Equation 3 above, the tag

Can be calculated as shown in Equation 4 below.

도 1은 종래의 웹 사용자들이 웹 페이지에 태깅한 데이터를 바탕으로 SMM을 구축하는 흐름도이다. 먼저 웹 사용자들이 웹 페이지에 태깅한 데이터를 가져온다(110). 그리고 이 데이터를 사용하여 위에서 설명한 SMM을 구축한다(120). 구축된 SMM의 결과 값 즉, 각 추상 클래스가 나타날 확률, 각 태그가 각 추상 클래스에 속할 조건부 확률, 각 웹 페이지가 각 추상 클래스에 속할 조건부 확률을 저장한다(130). 이 결과는 추후 웹 페이지 간의 유사도를 계산하는데 사용된다.1 is a flow diagram of a conventional web users to build the SMM based on the data tagged on the web page. First, the web users import data tagged to the web page (110). The data is then used to construct the SMM described above (120). The result value of the constructed SMM, that is, the probability that each abstract class appears, the conditional probability that each tag belongs to each abstract class, and the conditional probability that each web page belongs to each abstract class are stored (130). This result is later used to calculate the similarity between web pages.

도 2는 SMM의 결과로부터 두 웹 페이지간의 유사도를 측정하는 흐름도이다.2 is a flowchart for measuring the similarity between two web pages from the results of the SMM.

먼저, 유사도 측정에 사용될 두 웹 페이지를 선택한다. 이 때, 두 웹 페이지를 각각 P₁ 과 Q₁이라 하면(210), P₁ 과 Q₁의 태그를 추출한 후 웹 페이지 P₁이 가지고 있는 태그를 (t₁,t₂,…,t_n) 이라 하고 Q₁의 태그를 (u₁,u₂,…,u_m)이라 하자(220). 앞에서 계산된 SMM의 결과 값을 사용하여 (t₁,t₂,…,t_n _,u₁,u₂,…,u_m)가 같은 클래스에 속할 확률을 구한다(230).First, select two web pages that will be used to measure similarity. At this time, two web pages each P ₁ And Q ₁ (210), P ₁ After extracting the Q and Q ₁ tags, let the tag of the web page P ₁ be (t ₁ , t ₂ ,…, t _n ) and the tag of Q _{1 be} (u ₁ , u ₂ ,…, u _m ) (220). The probability of (t ₁ , t ₂ ,…, t _n _, u ₁ , u ₂ ,…, u _m ) belonging to the same class is calculated using the result of the SMM calculated above (230).

최종적으로, 두 웹 페이지의 유사도를 위의 확률 값에 (n+m)근을 구하여 계산한다. S_σ를 두 웹 페이지간의 유사도라 할 때, S_σ는 하기의 수학식 5와 같이 정 의된다.Finally, the similarity of two web pages is calculated by finding the (n + m) root of the above probability value. When S _σ is the similarity between two web pages, S _σ is defined as in Equation 5 below.

상기 수학식 5에서 나온 값이 두 웹 페이지 P₁과 Q₁의 유사도가 된다(240).The value obtained from Equation 5 becomes the similarity between the two web pages P ₁ and Q ₁ (240).

도 3은 본 발명에 따른 웹 페이지간 유사도 측정 시스템의 구성을 나타낸 구성도이다. 즉, 도 3은 주어진 웹 페이지들과 그에 대한 태그들이 있을 때 본 발명에 따라 이 데이터로부터 SMM을 구축하고 상기 SMM 구축결과를 사용하여 웹 페이지 간으 유사도를 구하는 실시예를 나타내는 도이다.3 is a block diagram showing the configuration of a similarity measurement system between web pages according to the present invention. That is, FIG. 3 is a diagram illustrating an embodiment of constructing an SMM from this data according to the present invention when there are given web pages and tags thereof, and obtaining similarity between web pages using the SMM establishment result.

도 3의 310은 예제로 사용된 웹 페이지 A,B,…,J와 각 웹 페이지에 대한 태그들을 나타낸다. 두 개의 추상 클래스 C₁과 C₂가 주어져 있다고 할 때, 웹 페이지 A,B,…,J와 그들에 대한 태그들로부터 SMM을 구축하면 그 결과로 각 추상 클래스가 나타날 확률 p(C₁),p(C₂)(320), 그리고 각 태그가 각 추상 클래스에 속할 조건부 확률 p(t_j｜C₁), p(t_j｜C₂)(330)을 구할 수 있다.3, 310 shows web pages A, B,... Used as examples. , J and the tags for each web page. Given two abstract classes C ₁ and C ₂ , web pages A, B,... Constructing an SMM from J and the tags for them results in the probability that each abstract class will appear p (C ₁ ), p (C ₂ ) (320), and the conditional probability that each tag will belong to each abstract class p ( t _j | C ₁ ) and p (t _j | C ₂ ) 330 can be obtained.

상기한 바와 같이 구한 값들을 사용하여 두 웹 페이지 간의 유사도를 계산할 수 있다. 이를 예를 들어 설명하면 웹 페이지 E와 F간의 유사도는 다음과 같이 구할 수 있다. 먼저, E와 F가 가지고 있는 태그들이 동일한 클래스에 속할 확률을 계 산하면 하기의 수학식 6과 같다.As described above, the similarity between the two web pages may be calculated using the obtained values. For example, the similarity between the web pages E and F can be obtained as follows. First, Equation 6 below calculates the probability that the tags of E and F belong to the same class.

위의 값의 (n+m)근 값을 구하면(동일한 태그를 제외하면 n+m=5 이다.) E와 F간의 유사도는 하기의 수학식 7과 같다.If the (n + m) root of the above value is obtained (n + m = 5 except for the same tag), the similarity between E and F is expressed by Equation 7 below.

동일한 방법으로 웹 페이지 A와 B의 유사도를 측정하면 0의 값을 가진다.In the same way, measuring the similarity between web pages A and B has a value of zero.

기존의 SSR과 본원발명에 의한 방법을 비교해 본다.Compare the existing SSR with the method according to the present invention.

도 4a는 본 발명에 따른 웹 페이지 간의 유사도를 측정하는 실시예 중 각 페이지에서 태그의 빈도를 구하는 것을 나타내는 도이다. 도 4a의 데이터를 이용하여 상기 SSR에 의한 유사도를 구하면 하기의 표 1과 같다.Figure 4a is a view showing the frequency of the tag in each page of the embodiment of measuring the similarity between web pages according to the present invention. When the similarity by the SSR is obtained using the data of FIG. 4A, it is shown in Table 1 below.

PagePage BB CC DD EE FF GG HH II JJ AA 0.05210.0521 0.1790.179 0.1990.199 0.01470.0147 0.08400.0840 0.1860.186 0.1590.159 0.1330.133 0.2110.211 BB 0.06310.0631 0.08920.0892 0.1430.143 0.1270.127 0.05490.0549 0.01770.0177 0.1170.117 0.1020.102 CC 0.1570.157 0.01500.0150 0.09420.0942 0.1560.156 0.2420.242 0.1290.129 0.1350.135 DD 0.01740.0174 0.09210.0921 0.1380.138 0.1640.164 0.1220.122 0.2190.219 EE 0.1300.130 0.0020.002 0.0020.002 0.1010.101 00 FF 0.07230.0723 0.01230.0123 0.15730.1573 0.1410.141 GG 0.1170.117 0.10510.1051 0.2100.210 HH 0.00990.0099 0.02500.0250 II 0.2110.211

한편, 본원발명에 의한 방법으로 도 4a의 데이터에 대해서 각 페이지간의 유사도를 측정한 결과는 하기의 표 2와 같다.On the other hand, the results of measuring the similarity between each page with respect to the data of Figure 4a by the method of the present invention are shown in Table 2 below.

PagePage BB CC DD EE FF GG HH II JJ AA 00 0.2160.216 0.2310.231 00 1.1×0^-26 1.1 × 0 ^-26 0.2160.216 0.2160.216 1.4×10^-61 1.4 × 10 ^-61 0.2310.231 BB 00 00 0.1760.176 0.1500.150 00 00 0.1760.176 00 CC 0.2160.216 00 1.1×10^-26 1.1 × 10 ^-26 0.2160.216 0.2100.210 1.3×10^-61 1.3 × 10 ^-61 0.2160.216 DD 00 1.1×10^-26 1.1 × 10 ^-26 0.2160.216 0.2160.216 1.4×10^-61 1.4 × 10 ^-61 0.2310.231 EE 0.1540.154 00 00 0.1760.176 00 FF 1.0×10^-26 1.0 × 10 ^-26 1.1×10^-26 1.1 × 10 ^-26 0.1480.148 5.1×10^-33 5.1 × 10 ^-33 GG 0.2160.216 1.2×10^-61 1.2 × 10 ^-61 0.2160.216 HH 1.3×10^-61 1.3 × 10 ^-61 0.2160.216 II 1.1×10^-81 1.1 × 10 ^-81

상기 표 1과 표 2는 각각 SSR과 본원발명의 방법으로 계산된 웹 페이지간의 유사도를 나타낸다. 각 열과 행은 페이지를 나타내고, 표 안의 값이 해당 페이지간의 유사도이다. 실험 전에 예측하였듯이, 유사한 페이지간의 유사도(표 1, 표 2의 C-J, D-J간의 유사도)는 SSR의 경우 [0.135, 0.219], 본원발명 방법의 경우 [0.216, 0.231]로 비슷하게 나타났다. 하지만 유사하지 않은 페이지의 경우에도 SSR은 유사한 페이지로 판단하는 경우가 있는데, 표 1과 표 2의 A-B, A-E, A-F, A-I, B-J, C-F, C-I, D-F, D-I, G-I 구간이 그 예이다.Table 1 and Table 2 show the similarity between the SSR and the web page calculated by the method of the present invention, respectively. Each column and row represents a page, and the values in the table are the similarities between the pages. As predicted before the experiment, the similarity between similar pages (the similarity between C-J and D-J in Table 1 and Table 2) was similar for SSR [0.135, 0.219] and for the present invention method [0.216, 0.231]. However, even in the case of dissimilar pages, the SSR may be determined as a similar page, for example, A-B, A-E, A-F, A-I, B-J, C-F, C-I, D-F, D-I, and G-I in Tables 1 and 2.

하나의 예로 페이지 B와 J간의 유사도를 보면, SSR은 두 페이지간의 유사도를 0.102로 측정하였고, 본원발명은 0으로 측정하였다. B와 J는 각각 다른 주제에 대하여 다루는 페이지이지만 두 페이지 모두 'java'가 키워드로 사용되었는바, SSR은 같이 사용된 'java'라는 태그로 인하여 두 페이지가 유사하다고 측정하였고 결과적으로 이것은 앞에서 설명하였듯이 SSR등 기존방법의 문제점이다.As an example, looking at the similarity between pages B and J, the SSR measured the similarity between two pages as 0.102 and the present invention measured as 0. B and J are pages that deal with different subjects, but both pages use 'java' as a keyword, so SSR measured that the two pages were similar due to the tag 'java' used together. This is a problem of existing methods such as SSR.

하지만, 본원발명에 의한 방법은 모든 태그들이 같은 클래스에 속해 있는지를 판단하여 유사도를 측정하기 때문에 SSR보다 정확한 결과 값을 얻을 수 있었다.However, the method of the present invention was able to obtain a more accurate result value than SSR because the similarity is measured by determining whether all tags belong to the same class.

이상과 같이 본 발명에 따른 태그를 이용한 웹 페이지 간의 유사도 측정 방법 및 시스템을 예시한 도면을 참조로 하여 설명하였으나, 본 명세서에 개시된 실시예와 도면에 의해 본 발명이 한정되는 것은 아니며, 본 발명의 기술사상 범위내에서 당업자에 의해 다양한 변형이 이루어질 수 있음은 물론이다.As described above with reference to the drawings illustrating a method and system for measuring similarity between web pages using a tag according to the present invention, the present invention is not limited by the embodiments and drawings disclosed herein, but of the present invention Of course, various modifications may be made by those skilled in the art within the scope of the technical idea.

도 1은 종래의 웹 사용자들이 웹 페이지에 태깅한 데이터를 바탕으로 SMM을 구축하는 흐름도이다.1 is a flow diagram of a conventional web users to build the SMM based on the data tagged on the web page.

도 3은 본 발명에 따른 웹 페이지간 유사도 측정 시스템의 구성을 나타낸 구성도이다.3 is a block diagram showing the configuration of a similarity measurement system between web pages according to the present invention.

도 4a는 본 발명에 따른 웹 페이지 간의 유사도를 측정하는 실시예 중 각 페이지에서 태그의 빈도를 구하는 것을 나타내는 도이다. Figure 4a is a view showing the frequency of the tag in each page of the embodiment of measuring the similarity between web pages according to the present invention.

도 4b는 본 발명에 따른 웹 페이지 간의 유사도를 측정하는 실시예 중 페이지 E와 F 간의 유사도를 구하는 과정을 나타내는 도이다.4B is a diagram illustrating a process of obtaining a similarity between pages E and F among embodiments of measuring similarity between web pages according to the present invention.

Claims

An SMM construction step of viewing a web page and its tags as co-occurrence data, and obtaining an SMM from given web pages and their tags and an abstract class preset in the system; And

And a similarity measurement step of calculating similarity between the web pages based on the probability that the tags for the web pages to be compared are included in the same abstract class by using the result value obtained by the SMM construction step. A method of measuring similarity between web pages using a tag.

The method according to claim 1,

The SMM building step may include: a collecting step of collecting tagged web pages and tags;

An application step of applying the SMM by using the information collected by the collection step; And

A similarity measure between the web pages using the tag, the probability of each abstract class appearing after the applying step, a conditional probability that a tag belongs to the abstract class, and a conditional probability that a web page belongs to the abstract class. Way.

The method according to claim 1,

The measuring similarity may include selecting two web pages to be compared;

An extraction step of extracting a tag of the web page selected in the selection step;

A probability calculation step of calculating a probability to be included in the same abstract class by using the result of the SMM construction step on the tags extracted in the extraction step; And

And a similarity deriving step of obtaining a similarity by square-rooting the number of tags after the execution of the probability calculation step.

A recording medium having stored therein a computer program relating to a method for measuring similarity between web pages using the tags according to claim 1.

A web page information collection unit for collecting tag information on a target web page for which a similarity is obtained;

An SMM construction unit for constructing an SMM using information collected by the web page information collection unit;

A web page selection unit for selecting a web page for which a similarity is to be obtained;

A tag extractor configured to extract a tag for a webpage selected through the webpage selector; And

And a probability calculator for integrating the SMM constructing unit and the tag extractor to obtain a similarity between the two selected pages.

The method according to claim 5,

The probability calculator calculates a probability that the web page and the tags belong to a topic set in the system, respectively, and then calculates a conditional probability that the tags of the web pages to be compared belong to the same topic. Measuring system.