KR100621737B1

KR100621737B1 - Method for auto-classifying Web Sites

Info

Publication number: KR100621737B1
Application number: KR1019990063028A
Authority: KR
Inventors: 이종혁; 권오욱
Original assignee: 학교법인 포항공과대학교; 주식회사 케이티
Priority date: 1999-12-27
Filing date: 1999-12-27
Publication date: 2006-09-06
Also published as: KR20010060623A

Abstract

본 발명은 웹사이트 자동 분류방법에 관한 것으로, 종래의 웹사이트를 주제어들에 따라 분류하기 위해서는 기존의 문서 범주화 방법에 이러한 연결 정보를 이용할 수 있는 방법을 추가하여야 보다 정확하고 효과적인 분류가 가능하다. 그러므로 웹사이트에 포함된 많은 웹페이지들을 모두 이용하면 그 웹 사이트의 주제와 다른 내용을 담은 웹 페이지로 인하여 오 분류가 발생할 수도 있고 또한 웹사이트 자동 분류에 과다한 시간이 문제점이 있었다. 이에 본 발명은 웹페이지를 주제 범주에 따라 분류하고, 그 분류된 웹페이지들의 주제 범주를 이용하여 전체 웹사이트를 분류함으로서 웹사이트를 같은 시간에 보다 정확한 범주(주제어)들로 자동 분류할 수 있으며 또한 웹사이트 자동 분류의 성능을 향상시킬 수 있다.The present invention relates to a method for automatically classifying a website. In order to classify a conventional website according to key words, it is necessary to add a method that can use this connection information to an existing document categorization method for more accurate and effective classification. Therefore, if all the web pages included in the web site are used, a misclassification may occur due to the web page containing the topic and other contents of the web site, and there is a problem of excessive time in the automatic web site classification. Accordingly, the present invention can classify web pages into more accurate categories (main control) at the same time by classifying web pages according to subject categories and classifying entire websites using the subject categories of the classified web pages. It can also improve the performance of automatic website classification.

Description

Method for auto-classifying Web Sites}

도 1은 본 발명에 적용되는 웹사이트 자동 분류 시스템.1 is a website automatic classification system applied to the present invention.

도 2는 본 발명에 따른 웹사이트내에서 웹페이지들간의 복잡한 상연결을 다루기 쉬운 트리 구조로의 변환과정을 설명하는 도면.FIG. 2 is a diagram for explaining a process of converting a complicated phase link between web pages into a tree structure within a website according to the present invention. FIG.

도 3은 본 발명에 따른 웹페이지 범주화 방법인 k-nearest neighbor 방법에 대한 전체 설명도.3 is an overall explanatory diagram of a k-nearest neighbor method which is a web page categorization method according to the present invention.

본 발명은 웹사이트 자동 분류방법에 관한 것으로, 보다 상세하게는 기존 문서 범주화에서 널리 사용된 k-nearest neighbor 모델을 이용하여 웹 페이지 범주화를 수행하고 웹 페이지 범주화 단계의 결과인 웹사이트 내의 웹 페이지에 할당된 주제 범주들을 이용하여 웹사이트 주제 범주로써 적합한가를 판단함으로서 정확한 주제어 별로 자동 분류가 가능하도록 한 웹사이트 자동 분류방법에 관한 것이다.The present invention relates to a method for automatically classifying a website, and more particularly, to classify a web page using a k-nearest neighbor model widely used in existing document categorization, and to a web page in a website that is a result of the web page categorization step. The present invention relates to a method for automatically classifying a website which enables automatic classification according to the correct subject word by determining whether it is suitable as a website theme category by using the assigned subject categories.

일반적으로 웹 페이지들은 일반적인 문서들이 갖고 있는 문서 정보(textual information)뿐만 아니라, 문서구조 정보(text structural information)와 문서 상호간의 연결 정보(link information)를 갖고 있다. 문서 구조 정보와 문서 상호 연 결 정보 등의 부차적인 정보들을 제외하고는 실제 내용을 담고 있는 문자들의 나열인 문서 정보는 웹 페이지와 일반적인 문서와 공통점이다. 이와 같은 공통점에 의해서 문서 범주화 방법들을 웹 페이지나 웹사이트 분류(범주화)에 이용할 수 있다.In general, web pages have text structural information and link information between documents as well as the textual information of general documents. Except for secondary information such as document structure information and document interconnection information, document information, which is a list of characters containing actual content, is common to web pages and general documents. This commonality makes it possible to use document categorization methods for web pages or website categorization.

여기서 문서 범주화란 문서를 전문가에 의해서 이미 정해진 주제어(범주) 체계에 대하여 하나 이상의 주제어들을 부여함으로써 주제어에 따른 문서 분류를 하는 방법이다. 널리 알려져 있는 문서 범주화 방법으로는 k-nearest neighbor 방법, 베이지언 확률(Baysian Probability) 방법, 규칙에 기반한 방법 등의 학습에 의해 문서를 분류하는 방법들이 있다. 이러한 방법 중에서 가장 널리 알려진 k-nearest neighbor 방법과 베이지언 확률 방법이 가장 좋은 성능을 보인다. 이러한 방법들은 문서에 가지고 있는 키워드들이 그 내용을 나타내는 것을 기반으로 하는 방법들이다. 즉, k-nearest neighbor 방법은 이미 주제어가 할당된 문서들과 새로이 주제를 할당할 문서를 각 문서들을 키워드간의 비교를 통하여 새로이 주제어를 할당할 문서와 가장 가까운 k개의 기존 문서들을 추출한다. 이 추출된 k 문서들이 속하는 주제어들을 현재 주제어를 부여할 문서에 가장 적당한 주제어로 생각하는 방법이다. 베이지언 확률 방법은 미리 주제어가 할당된 문서들을 학습 집합(learning set)으로 하여 각 주제어의 특성 또는 자질(feature)을 나타내는 단어들을 추출하여, 새로이 주제어를 할당할 문서에서 이러한 자질(키워드)들이 가장 많이 가지는 주제어들을 그 문서에 합당한 주제어로 선정하는 방법이다.Here, document categorization is a method of classifying documents according to a subject by assigning one or more subjects to a subject system (category) system already defined by an expert. Well-known document categorization methods include classifying documents by learning such as k-nearest neighbor method, Bayesian probability method, and rule-based method. The k-nearest neighbor method and Bayesian probability method show the best performance among these methods. These methods are based on how the keywords in the document represent their content. That is, the k-nearest neighbor method extracts k existing documents that are closest to the document to which the new keyword is newly assigned by comparing the documents to which the subject is newly assigned and the document to which the new topic is assigned. It is a way to think of the subjects to which the extracted k documents belong as the most appropriate subjects for the current document. Bayesian probabilistic method extracts words representing the characteristics or features of each subject by using a pre-assigned document as a learning set, so that these qualities (keywords) It is a way of selecting many subjects that are appropriate for the document.

또한, 인터넷상에서 웹 페이지 또는 웹사이트들은 질적, 양적으로 급격히 성장하여 이러한 미디어가 가지고 있는 정보를 분류할 필요성이 크게 대두되고 있다. 이러한 한 예로 많은 웹검색 시스템들이 디렉토리 정보 즉, 주제어에 따른 웹사이트 분류를 제공하고 있다. 웹사이트 혹은 웹 페이지들은 정보를 담고 있는 점에서 기존의 문서와 유사하지만, 일반 문서의 일차적 나열 성격과는 달리 웹페이지들은 상호 연결되어 있는 복잡한 구조를 이루고 있다. In addition, on the Internet, web pages or websites have grown rapidly in quality and quantity, and there is a great need for classifying the information of such media. One example of this is that many web search systems provide directory information, that is, categorization of web sites according to the key words. Web sites or web pages are similar to existing documents in that they contain information. However, unlike the primary documents of general documents, web pages have a complex structure in which they are interconnected.

또한, 하나의 목적을 위하여 작성되는 정보들은 몇 개 혹은 몇 백개 이상의 웹페이지들로 구성되는 경우가 대부분이다. 그리고 이들을 대표하는 웹 페이지를 홈페이지라고 불리운다. 이들 홈페이지에는 분류하기 위해 필요로 하는 키워드들이 거의 없고 단순히 다른 웹 페이지에 대한 링크 정보만이 있는 경우가 대부분이다. 그러므로, 하나의 목적으로 구성되어진 웹 페이지들의 그룹인 웹사이트를 분류하기 위해서는 단순히 홈페이지에 있는 정보만을 이용하기에는 문제가 있었다. Also, the information created for one purpose is often composed of several or hundreds of web pages. And the web pages representing them are called homepages. Most of these homepages have very few keywords that need to be categorized and are simply links to other web pages. Therefore, in order to classify a website which is a group of web pages constituted for one purpose, there is a problem in using only the information on the homepage.

즉, 현재까지 알려진 문서 범주화 시스템들의 방법론들은 일반적인 문서를 주제어 체계에 따라 분류할 때에 적합한 방법들이다. 하지만 웹페이지들은 일반적인 문서와 달리 상호간의 연결 정보를 포함하면서 이러한 연결 정보에 의하여 보다 큰 개념인 웹사이트를 형성한다. 이러한 웹사이트를 주제어들에 따라 분류하기 위해서는 기존의 문서 범주화 방법에 이러한 연결 정보를 이용할 수 있는 방법을 추가하여야 보다 정확하고 효과적인 분류가 가능하다. 그러므로 웹사이트에 포함된 많은 웹페이지들을 모두 이용하면 그 웹 사이트의 주제와 다른 내용을 담은 웹 페이지로 인하여 오 분류가 발생할 수도 있고 또한 웹사이트 자동 분류에 과다한 시간이 문제점이 있었다.That is, the methodologies of document categorization systems known to date are suitable methods for classifying general documents according to the main control system. However, unlike general documents, web pages contain mutual connection information and form a larger concept website by this connection information. In order to categorize such websites according to the subjects, it is necessary to add a method of using such connection information to the existing document categorization method for more accurate and effective classification. Therefore, if all the web pages included in the web site are used, a misclassification may occur due to the web page containing the topic and other contents of the web site, and there is a problem of excessive time in the automatic web site classification.

이에 본 발명은 상기와 같은 종래의 문제점을 해결하기 위해 안출된 것으로, 신속하고 정확하게 웹 사이트를 분류할 수 있도록 한 웹 사이트 자동 분류방법을 제공함에 그 목적이 있다. Accordingly, an object of the present invention is to provide a method for automatically classifying a web site, which is designed to solve the above problems.

상기와 같은 목적을 달성하기 위한 본 발명은 주제 범주를 할당하고자 하는 웹사이트의 홈페이지 링크정보에 의하여 도달 가능한 서버내의 웹 페이지를 검색하는 제 1과정; 검색된 웹 페이지들을 각각의 연결정보로 표현하여, 트리구조로 변환하는 제 2과정; 상기 변환된 트리구조를 이용하여 웹 페이지를 절단하는 제 3과정; 웹 페이지 절단에 의해 웹사이트가 구성되어, 주어진 웹사이트내의 각 웹 페이지에 대한 범주화를 실행하는 제 4과정; 및 각 페이지에 대해 설정된 주제 범주들을 이용하여 주어진 웹 사이트의 주제 범주를 선택하는 제 5과정을 포함하는 것을 특징으로 한다.The present invention for achieving the above object is a first step of searching for a web page in the server reachable by the link information of the home page of the website to which the subject category is to be assigned; A second step of converting the searched web pages into respective connection information and converting the web pages into a tree structure; A third step of cutting the web page using the converted tree structure; A fourth step of constructing a website by web page cutting to perform categorization for each web page in a given website; And a fifth process of selecting a subject category of a given web site using the subject categories set for each page.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 자동 분류를 구현한 웹사이트 범주화 시스템의 전체 구성도이다. 본 시스템은 입력으로 주제 범주를 할당하고자 하는 웹 사이트의 홈페이지의 URL (Uniform Resource Locator)을 받아들인다. 이때 홈페이지의 URL은 "http://www.postech.ac.kr/main.html"의 형식을 가진다. 여기서 단지 URL만으로는 그 웹서버 내의 어느 웹페이지까지가 웹사이트인지를 판별하지 못한다. 그래서, 웹사이트 구조화 모듈에서 입력으로 주어진 홈페이지의 링크 정보에 의하여 도달 가 능한 동일 서버 내의 웹페이지들을 찾는다. 1 is an overall configuration diagram of a website categorization system implementing automatic classification. The system accepts as input the Uniform Resource Locator (URL) of the home page of the web site to which the subject category is to be assigned. At this time, the URL of the homepage has a format of "http://www.postech.ac.kr/main.html". Here, only the URL does not determine which web page in the web server is the web site. Thus, the web site structuring module finds the web pages within the same server reachable by the link information of the home page given as input.

그리고 찾아진 웹페이지들을 하나의 노드(Node)로 보고 각각의 연결정보를 나타내면, 웹페이지들의 구조는 방향성 순환 그래프(Directed Cyclic Graph) 구조이므로 처리하기에는 복잡한 구조이다. 우리는 이것을 다루기 쉬운 트리(Tree) 구조로 변화한다. 이때 트리의 루트(Root)는 홈페이지가 된다. 본 발명에서 방향성 순환 그래프 구조를 트리 구조로 변환하기 위해서 방향성 순환 그래프를 breadth first traval을 이용하여 쉽게 변환할 수 있다. When the found web pages are viewed as a node and each connection information is displayed, the structure of the web pages is a structure of a directed cyclic graph, which is a complicated structure to process. We turn this into a tree structure that is easy to handle. The root of the tree is the homepage. In the present invention, in order to convert the directional cyclic graph structure into a tree structure, the directional cyclic graph can be easily converted using breadth first traval.

이때, 각 웹페이지에 대한 연결 정보는 상당히 중요하기 때문에 다른 웹페이지로부터의 참조되는 횟수와 그 웹페이지에서 다른 웹페이지들로 연결되는 수를 모두 기록한다. 도면 2는 이러한 구조 변환에 대한 예를 보이고 있다. 웹페이지 절단(web page pruning) 단계에서는 웹페이지들의 트리 구조를 이용하여 웹사이트를 대표하는 웹페이지라고 생각되지 않는 웹페이지들을 절단하여 다음에 수행할 범주화의 실행 시간을 줄인다. 웹페이지 절단을 위해서는 우선 이미 트리 형태로 구조화된 웹페이지들 중에서 어떠한 웹페이지들이 전체 웹사이트를 대표할 것인가에 대한 판단 기준이 필요하다. At this time, since the connection information for each web page is very important, record both the number of references from other web pages and the number of links to other web pages from the web page. 2 shows an example of such a structural transformation. In the web page pruning step, a tree structure of web pages is used to cut web pages that are not considered to be representative web pages to reduce the execution time of the next categorization. In order to cut a web page, first, a criterion for determining which web pages represent the entire web site among the web pages already structured in a tree form is required.

본 발명에서는 다음과 같은 기준을 이용하여 웹페이지들을 웹사이트의 대표로 선택하고 나머지 웹페이지들은 제거한다.In the present invention, the web pages are selected as representatives of the web sites using the following criteria, and the remaining web pages are removed.

(1) 트리 구조의 깊이(depth of the node): 구조화한 트리에서 루트는 홈페이지이기 때문에 깊이가 적을수록 홈페이지와 유사한 내용을 담은 웹페이지일 가능성이 크다.(1) Depth of the node: In a structured tree, the root is the homepage, so the smaller the depth, the more likely the webpage is to contain content similar to the homepage.

(2) 웹 페이지가 다른 웹 페이지에 대한 참조횟수 (the number of outgoing links) : 같은 사이트내의 다른 페이지들을 많이 참조하는 웹 페이지는 전체 내용의 중심에 있을 것이라고 생각할 수 있다. 그러므로, 다른 웹 페이지로의 참조 즉 링크 수가 많은 웹 페이지일수록 전체 사이트 내에서 중요한 역할을 한다.(2) The number of outgoing links of a web page: A web page that references many other pages in the same site can be thought to be at the center of the entire content. Therefore, references to other web pages, that is, web pages with a high number of links, play an important role within the entire site.

(3) 다른 웹 페이지로부터의 참조 횟수 (the number of incoming links) : 다른 웹 페이지로부터의 참조 횟수가 많은 웹 페이지는 그 내용이 전체 사이트에서 중요하기 때문인 경우가 일반적이다. 그러므로, 다른 웹 페이지로부터의 참조 횟수가 많은 웹 페이지일수록 중요하다.(3) the number of incoming links: A web page with a high number of references from other web pages is generally because the content is important for the entire site. Therefore, the more important the web page is the number of references from other web pages.

위에서 언급한 웹사이트내에서 각 웹페이지의 대표성을 웹페이지 중요도라는 수식으로 아래와 같이 나타낼 수 있다. (식 1)은 웹사이트 S에서 웹페이지 D _i 의 중요도를 나타낸다.Representation of each web page in the above-mentioned website can be expressed by the formula of web page importance as follows. (Equation 1) represents the importance of the web page D _i in the website S.

(식 1)(Equation 1)

즉, 식 1의 결과로 웹사이트내에서 가장 높은 중요도를 가지는 몇 개의 웹페이지들을 선택하여 대표성을 가지는 웹페이지들로 인식한다.That is, as a result of Equation 1, several web pages having the highest importance in the website are selected and recognized as representative web pages.

웹사이트 구조화와 웹 페이지 절단에 의하여 처리할 웹사이트가 구성되면, 본 발명에서 제안하는 2 단계 범주화 시스템에 의하여 주제 범주를 선택한다. 다음은 각각의 단계에 대한 간단한 설명이다.When a website to be processed by website structuring and web page cutting is constructed, the subject category is selected by the two-stage categorization system proposed in the present invention. The following is a brief description of each step.

제 1단계 모델(웹페이지 범주화 모델) : 주어진 웹사이트 내의 각 웹 페이지에 대한 범주화를 실행한다. 웹 페이지 단위의 범주화 방법은 기존의 문서 범주화 방법과 유사하다. 단지 웹페이지라는 특수한 문서라는 점을 제외하고는 그 환경이 같다. 그래서 본 발명에서는 웹페이지 범주화를 위해서 문서 범주화에서 널리 사용하는 k-nearest neighbor 방법을 이용한다. 또한 이 모델에서는 기존의 모든 문서 범주화 모델들을 이용 가능하다. 본 발명의 시스템을 위하여 k-nearest neighbor 방법만을 설명한다.First Step Model (Web Page Categorization Model): Performs categorization for each web page within a given website. The categorization method by web page unit is similar to the existing document categorization method. The environment is the same except that it is a special document called a web page. Therefore, the present invention uses the k-nearest neighbor method widely used in document categorization for web page categorization. In addition, all existing document categorization models are available in this model. Only the k-nearest neighbor method is described for the system of the present invention.

제 2단계 모델(웹사이트 범주화 모델) : 1단계 범주화의 결과인 각 웹페이지에 대해 설정된 주제 범주들을 이용하여 주어진 웹사이트의 주제 범주를 선택한다. 이때, 웹사이트 범주화 모델에서는 각 웹페이지가 웹 사이트 내에서의 중요도가 다른 점을 이용하여 보다 중요한 웹페이지의 주제 범주가 전체 웹사이트 주제 범주를 선택할 때 보다 큰 영향을 주도록 한다. 즉 각 웹사이트의 홈페이지 경우에는 상당히 중요한 역할을 하기 때문에 홈페이지의 주제 범주가 전체 웹사이트의 주제 범주가 될 가능성을 크게 한다.Second-level model (website categorization model): The subject category of a given website is selected using the subject categories set up for each web page resulting from the first-level categorization. At this time, in the website categorization model, each web page has a different importance in the web site, so that the subject category of the more important web page has a greater influence when selecting the entire website topic category. In other words, the homepage of each website plays a very important role, thus increasing the possibility that the topic category of the homepage becomes the subject category of the entire website.

여기서 상기 제 1 단계인 웹페이지 범주화 단계는 웹페이지에서 상호 연결 정보를 제외한 경우에는 일반 문서와 거의 유사하다. 그러므로, 본 발명의 1단계 웹페이지 범주화 단계에서 사용하는 범주화 모델로는 기존 문서 범주화에서 개발한 모든 시스템을 이용할 수 있다. 하지만, 본 발명에서는 이해를 쉽게 할 수 있고 또 한 널리 이용되는 k-nearest neighbor 방법을 이용한다.In this case, the first step of categorizing the webpage is similar to a general document when the webpage excludes interconnection information. Therefore, as the categorization model used in the step 1 webpage categorization step of the present invention, all systems developed in the existing document categorization can be used. However, the present invention uses a k-nearest neighbor method that can be easily understood and also widely used.

일반적으로 k-nearest neighbor 방법을 기반한 문서 범주화는 두 단계로 이루어 진다. 첫 번째 단계에서는 새로이 범주를 할당할 문서와 가장 문서-문서 관련도가 높은 k개의 학습 문서를 추출한다. 두 번째 단계에서는 추출된 k개의 문서들이 가지는 범주들을 이용하여 새로운 문서가 가질 수 있는 범주의 가능성을 계산하여 범주들을 할당한다. 본 발명에서의 k-nearest neighbor에 기반한 웹페이지 범주화 모델 역시 두 단계로 이루어진다. In general, document categorization based on the k-nearest neighbor method is a two-step process. In the first step, we extract the documents to be newly assigned categories and the k document documents that are most document-document related. In the second step, categories are assigned by calculating the likelihood of the category of new documents using the categories of the extracted k documents. The web page categorization model based on k-nearest neighbor in the present invention also consists of two steps.

도면 3은 k-nearest neighbor에 기반한 웹페이지 범주화 시스템의 전체 구성도를 나타내고 있다. 우선 k-nearest neighbor의 첫 번째 단계인 문서-문서 관련도 계산에 의해서 k 개의 학습 문서를 추출하기 전에, 학습 문서들을 범주 자질 데이터베이스에 있는 <범주, 단어> 쌍들을 이용하여 내부 문서 표현으로 나타내고 범주를 새로이 할당할 새로운 문서 역시 자질 후보 데이터베이스에 있는 <단어>들을 이용하여 내부 문서 표현으로 나타낸다. 3 shows the overall configuration of a web page categorization system based on k-nearest neighbor. First, before extracting k learning documents by the first step of k-nearest neighbor, document-document relevance calculation, the learning documents are represented as internal document representations using <category, word> pairs in the category feature database. The new document to be newly assigned is also represented as an internal document representation using <words> in the feature candidate database.

본 발명에서는 문서의 내부 표현뿐만 아니라 전반적인 문서 범주화를 위해서 벡터 공간 모델을 이용한다. 그리고 일반적으로 학습 문서의 내부 표현은 정보검색 시스템에서의 색인과 마찬가지로 역색인 파일에 저장하여 계속적인 이용이 가능하게 한다. 웹페이지 범주화의 첫 번째 단계인 문서-문서 관련도 계산은 벡터 공간 모델에서 많이 사용하는 코사인 관련도(식 2)를 이용한다. The present invention utilizes a vector space model for general document categorization as well as internal representation of the document. In general, the internal representation of the learning document is stored in an inverted index file like the index of an information retrieval system for continuous use. The first step in categorizing web pages, the document-document relevance calculation, uses the cosine relevance (Equation 2), which is often used in the vector space model.

이로 인하여 새로운 문서와 가장 유사한 k 개의 학습 문서를 추출한 후, 이들 k개의 학습 문서가 가지는 범주들을 이용하여 새로운 문서에 가장 적합한 범주 들을 선택하게 된다. 이와 같이 새로운 문서에 적합한 범주들을 선택하는 것이 제안하는 웹페이지 범주화의 두 번째 단계이다. Therefore, after extracting k learning documents most similar to the new document, the categories of these k learning documents are used to select the most suitable categories for the new document. As such, selecting the appropriate categories for the new document is the second step in the proposed web page categorization.

(식 2)(Equation 2)

새로운 웹페이지 D _x 에 범주 C _k 를 할당할 확률 P(C _k |D _x )을 (식 3)과 같이 나타낼 수 있다. (식 3)의 결과로 모든 주제어(범주)들이 새로운 웹페이지를 분류할 가능성을 알 수 있다. 이 값이 높게 가지는 주제어들이 새로운 웹페이지의 범주로 선정되어진다. (식 3)에서 D _j 는 첫 단계에서 문서-문서 관련도에 의해서 추출한 k 개의 학습 문서들을 의미한다.The probability P ( C _k | D _x ) to assign the category C _k to the new web page D _x can be expressed as (Equation 3). As a result of (Equation 3), it can be seen that all the main words (categories) can classify new web pages. Keywords with a high value are selected as categories of new web pages. In Equation 3, D _j denotes k learning documents extracted by document-document relevance in the first step.

(식 3)(Equation 3)

(식 3)에서 P(C _k |D _j )의 계산은 학습 집합에서의 웹페이지 D _j 가 범주 C _k 를 주제범주로 가질 경우, P(C _k |D _j )=1이고, 그렇지 않을 경우에는 P(C _k |D _j )=0이 된다. 확률값 P(C _k |D _j )에 의해서 이미 추출한 k 개의 학습 웹페이지에 많이 할당된 주제 범 주가 (식 3)의 값이 높아지고 또한 새로운 웹페이지와 문서-문서 관련도가 높은 웹페이지에 있는 주제 범주일수록 (식 3)의 값이 높게 되어 새로운 웹페이지에 할당될 가능성이 커진다.(Equation 3) P in | calculation of (C _k D _j) when the web page D _j in the training set have a category C _k as the subject category, P |, if a _{_{(C k D j) = 1}} , otherwise P ( C _k | D _j ) = 0. The subject category value (Equation 3), which is largely assigned to the k training web pages already extracted by the probability value P ( C _k | D _j ), is also high, and the new web page and the topic in the document-document relevance web page. The more categories, the higher the value of (Equation 3), and the more likely it is to be assigned to a new web page.

한편, 제 2단계인 웹사이트 범주화 단계는 웹사이트에 포함할 웹페이지들을 찾기 위해서 이미 언급한 웹사이트 구조화와 웹페이지 절단 단계를 이용한다. 이에 따라 얻어진 웹사이트를 웹페이지들의 집합으로 표현할 수 있다. 임의 웹사이트 S가 n개의 웹페이지로 구성되어 있다고 할 때, 다음과 같이 웹페이지 D _i 로 구성된 집합으로 표현한다.On the other hand, the second step, the website categorization step, uses the aforementioned website structuring and web page cutting step to find the web pages to be included in the website. The website thus obtained can be expressed as a set of web pages. When an arbitrary website S is composed of n web pages, it is expressed as a set consisting of web pages D _i as follows.

위와 같이 웹사이트를 정의하면, 웹사이트 범주화 모델은 웹사이트 S가 주어졌을 때, 임의 주제 범주 C _k 가 할당될 가능성을 구하는 조건부 확률식이 된다. 이에 따른 조건 확률식은 (식 4)와 같이 정의할 수 있다.Defining a website as above, the website categorization model is a conditional probabilistic equation that calculates the likelihood that any subject category C _k will be assigned, given a website S. Accordingly, the conditional probability equation may be defined as shown in Equation 4.

(식 4)(Equation 4)

(식 4)에서 웹사이트 S를 그 웹사이트에 속한 웹페이지들로 아무런 가정없이 변환이 가능하다. (식 4)에 의해서 웹사이트 범주화 문제는 웹페이지들에 대한 문제로 변환된 것이다. 이때, (식 4)를 해결하기 위해서, 우리는 웹페이지들에 속한 모든 내용을 하나로 합쳐서 커다란 하나의 웹 페이지로 생각할 수 있다. <그림 2-13>과 같이 단순한 웹 페이지들의 결합으로 이루어진 커다란 HTML 문서를 웹 사이 트라고 할 수 있다. 본 발명에서는 웹사이트를 구성하는 웹페이지들은 상호 독립이라고 생각한다.In Equation 4, the website S can be converted into web pages belonging to the website without any assumption. According to Equation 4, the categorization problem of the website is converted into a problem with the web pages. At this time, in order to solve (Equation 4), we can combine all the contents belonging to the web pages into one large web page. As shown in Figure 2-13, a large HTML document composed of simple web pages can be called a web site. In the present invention, the web pages constituting the website are considered to be independent of each other.

웹페이지들은 독립적으로 존재하기 때문에, (식 4)의 오른쪽의 첫 번째 항의 조건 확률식을 (식 5)와 같이 전개할 수 있다.Since the web pages exist independently, the conditional probability expression of the first term on the right side of (Equation 4) can be developed as shown in (Equation 5).

(식 5)(Eq. 5)

(식 5)에서 P(D _i )는 모든 경우에 동일하게 나타나기 때문에 없어도 범주 할당에 영향을 미치지 않는다. 그리고 P(C _k )를 n-1만큼 나누어 주는 이유는 우리가 이미 n번에 걸치는 (식 3)에 의한 계산식에서 P(C _k )를 곱하였기 때문에, 오로지 한 번의 P(C _k )만을 곱해주기 위해서다. (식 5)를 (식 4)에 적용하고 (식 4)의 오른쪽 마지막 항에 웹페이지가 독립인 것을 감안한다면 (식 14)는 (식 6)처럼 변형할 수 있다.In Equation 5, P ( D _i ) appears the same in all cases, so it does not affect the category assignment without it. And the reason for dividing P ( C _k ) by n-1 is because we have already multiplied P ( C _k ) in the equation by n times (Equation 3), so we only multiply by one P ( C _k ) To give. If you apply (Eq. 5) to (Eq. 4) and consider that the webpage is independent in the last right column of (Eq. 4), Equation 14 can be modified as in Equation 6.

(식 6)(Equation 6)

(식 6)에서 P(D _i |S)는 웹사이트 S내에서의 웹페이지 D _i 의 중요도를 나타내고 이미 앞에서 (식 1)에서 계산하여 둔 것과 같다. In Equation 6, P ( D _i | S ) represents the importance of the web page D _i in the website S and is the same as that calculated previously in (Equation 1).

(식 6)에 의해서 가장 큰 값을 가지는 범주(주제어)들로 새로운 웹사이트를 분류하게 하면 웹사이트 전체에 대한 자동 분류가 이루어진다.Equation (6) allows the new website to be classified into the categories (main control) that have the largest value.

상술한 바와 같은 본 발명에 따르면, 웹사이트를 같은 시간에 보다 정확한 범주(주제어)들로 자동 분류할 수 있으며 또한 웹사이트 자동 분류의 성능을 향상시킬 수 있다. 더욱이, 기존 방법들이 일반적인 문서에 대한 분류에 대한 것에 반하여 본 발명은 수많은 웹페이지들에 의해 이어진 웹사이트를 자동 분류하는 방법을 기술하고 있다.According to the present invention as described above, it is possible to automatically classify websites into more accurate categories (main control) at the same time and also improve the performance of website automatic sorting. Moreover, the present invention describes a method for automatically categorizing a website followed by numerous web pages, whereas existing methods are for categorizing general documents.

아울러 본 발명의 바람직한 실시예들은 예시의 목적을 위해 개시된 것이며, 당업자라면 본 발명의 사상과 범위안에서 다양한 수정, 변경, 부가등이 가능할 것이며, 이러한 수정 변경 등은 이하의 특허 청구의 범위에 속하는 것으로 보아야 할 것이다.In addition, preferred embodiments of the present invention are disclosed for the purpose of illustration, those skilled in the art will be able to various modifications, changes, additions, etc. within the spirit and scope of the present invention, such modifications and modifications belong to the following claims You will have to look.

Claims

A first step of searching for a web page in a server reachable by the home page link information of a website to which a subject category is to be assigned;

A second step of converting the searched web pages into respective connection information into a tree structure;

A third step of performing a web page truncation for selecting a predetermined web page representing the website from among the web pages converted into a tree structure;

A fourth step of performing categorization on a predetermined web page selected according to the web page cutting; And

And a fifth step of performing categorization on the website using a subject category assigned to each web page according to the categorization of the predetermined web page.

The method of claim 1,

In the second process, the connection information

A method for automatically classifying a website, which records both a reference count from another web page and a link to another web page in the web page.

The method of claim 1, wherein the web page cutting,

Automatic classification of websites, characterized in that the web pages are selected as representatives of the websites and the remaining web pages are removed by referring to the depth of the tree structure, the number of times the web pages refer to other web pages, and the number of references from other web pages. Way.

The method of claim 3, wherein

The representative of the website is

A method of automatically classifying a website, wherein the tree structure has a small depth and the web page has a high number of references to other web pages and a high number of references from other web pages.

The method of claim 1, wherein the categorization in the fourth process,

Automatic classification of websites characterized by using the K-nearest neighbor method.

The method of claim 1, wherein the fourth process,

Extracting k learning documents most similar to a new document by calculating document-document relevance; And

and a second step of selecting the most suitable categories for the new document using the categories of k learning documents.

The method of claim 6,

Web site, characterized in that the learning documents using the category, the word pairs in the category feature database before the first step, and the new document to be newly assigned a category is also represented in the internal document representation using the words in the feature candidate database. Automatic classification method.

The method of claim 1, wherein the fifth process comprises:

A conditional probability equation for obtaining a probability that an arbitrary subject category C _k is assigned to the website is expressed as Equation 6.

(Equation 6)

Where C _k is the webpage category, D _i is the webpage, and S is the website.