KR20110023308A

KR20110023308A - Method of searching web documents and system for performing the method

Info

Publication number: KR20110023308A
Application number: KR1020090081089A
Authority: KR
Inventors: 이병정; 장도; 김한준
Original assignee: 서울시립대학교 산학협력단
Priority date: 2009-08-31
Filing date: 2009-08-31
Publication date: 2011-03-08
Also published as: KR101132393B1

Abstract

PURPOSE: A web document search method and a system performing the same are provided to supply a better web document search result by reflecting the preference of the user. CONSTITUTION: A transaction analyzer(100) collects and analyzes web pages corresponding to a search key supplied from a user. A resource manager(120) manages information about a web document search process. A transaction processor analyzes the request of the user by combining a link based ranking scheme. The transaction processor classifies the action of the users according to the analysis result.

Description

METHOOD OF SEARCHING WEB DOCUMENTS AND SYSTEM FOR PERFORMING THE METHOD}

본 발명은 웹 문서 검색 방법 및 이를 수행하기 위한 시스템에 관한 것으로, 보다 상세하게는 폭소노미와 링크 기반 랭킹 기법을 조합한 집단지성 기반의 웹 문서 검색 방법 및 이를 수행하기 위한 시스템에 관한 것이다.The present invention relates to a web document retrieval method and a system for performing the same. More particularly, the present invention relates to a collective intelligence-based web document retrieval method and a system for performing the same.

일반적으로 정보 검색 시스템은 정보 수요자가 필요하다고 예측되는 정보나 데이터를 미리 수집, 가공, 처리하여 찾기 쉬운 형태로 축적해놓은 데이터베이스로부터 요구에 적합한 정보를 신속하게 찾아내어 정보 요구자에게 제공되는 시스템을 말한다. In general, an information retrieval system refers to a system that provides information requesters by quickly finding information suitable for a request from a database accumulated in an easy-to-find form by collecting, processing, and processing information or data that is expected to be needed by an information consumer.

종래에는 이러한 정보 검색 시스템을 활용하여 웹 포탈 사이트를 구축하였다. 상기 웹 포탈 사이트에서는 정보 검색시스템을 활용하여 인터넷 등을 통해 웹 사이트 등에서 정보를 수집하고 가공하여 처리하며 사용자 단말, 예를들어 컴퓨터나 무선 단말기 등에게 검색된 정보를 제공하게 된다. In the past, a web portal site was constructed by using such an information retrieval system. The web portal site utilizes an information retrieval system to collect, process, and process information on a web site through the Internet, and provide the retrieved information to a user terminal, for example, a computer or a wireless terminal.

한편, 정보통신 및 기술의 발달로 새로운 분야 및 용어가 급격히 증가하는 추세에서 종래의 정보 검색 방법을 적용하는 경우, 새로운 분야나 용어가 증가함에 따라 관련된 정보 검색 시스템을 수시로 경신하는 작업이 필요하다. 특히, 방대한 문서에 대한 정보가 저장된 문서 DB를 수정하는 일은 많은 시간과 비용이 소요되기 때문에 수시로 실시하기 어려운 단점이 있다. On the other hand, when the conventional information retrieval method is applied in a new field and terminology is rapidly increasing due to the development of information communication and technology, it is necessary to update the relevant information retrieval system from time to time as the new field or term is increased. In particular, the modification of the document DB in which the information on the huge documents is stored takes a lot of time and money, so it is difficult to carry out from time to time.

다른 한편으로, 인터넷에서 이용 가능한 정보가 기하급수적으로 증가함으로써, 인터넷 사용자들은 엄청난 양의 인터넷 문서 중 필요한 웹 페이지를 찾기 위해 매일 많은 시간을 보내야 한다. 엄청난 양의 웹 페이지를 처리하기 위해 수많은 웹 문서 검색 기술이 개발되고 있고 웹 문서 검색 엔진은 더욱 복잡해지고 있다. 그러나 검색의 질적인 측면을 보면, 검색된 전체 웹 페이지 중 절반 정도는 불필요한 것으로 보고되고 있다. On the other hand, with the exponential growth of information available on the Internet, Internet users have to spend a lot of time every day looking for the necessary web pages out of a huge amount of Internet documents. Numerous web document search techniques are being developed to handle huge amounts of web pages, and web document search engines are becoming more complex. However, in the qualitative aspect of search, about half of the searched web pages are reported as unnecessary.

비록 이런 검색 시스템이 사용자의 선호가 반영된 적절한 웹 페이지를 찾기 위해 개발됐을 지라도, 아직 해결되지 않은 몇 가지 문제가 있다. 가장 큰 문제는 검색 시스템이 인터넷에서 사용자의 선호를 반영한 수많은 웹 페이지를 얻을 수 있을 지라도, 웹 페이지의 엄청난 양 때문에 그것을 수집하고 분석한 것은 여전히 만족스럽지 못하다는 점이다.Although such a search system has been developed to find appropriate web pages that reflect user preferences, there are some issues that have not been solved yet. The biggest problem is that even though the search system can get numerous web pages on the Internet that reflect the user's preferences, it is still unsatisfactory to collect and analyze them due to the huge amount of web pages.

이에 본 발명의 기술적 과제는 이러한 점에 착안한 것으로, 본 발명의 목적은 인터넷에서 대량의 웹 페이지 중 사용자에게 적응하도록 검색에서 사용자 행위와 선호도를 함께 반영하여 더 나은 검색 결과를 얻기 위해서 집단지성을 기반으로 하는 웹 문서 검색 방법을 제공하는 것이다.Therefore, the technical problem of the present invention has been made in view of this point, the object of the present invention is to reflect the user behavior and preferences in the search to adapt to the user of a large number of web pages on the Internet in order to obtain a better search results collective intelligence To provide a web document search method based on the above.

본 발명의 다른 목적은 상기한 웹 문서 검색 방법을 수행하기 위한 웹 문서 검색 시스템을 제공하는 것이다. Another object of the present invention is to provide a web document search system for performing the web document search method described above.

상기한 본 발명의 목적을 실현하기 위하여 일실시예에 따른 웹 문서 검색 방법은, 사용자에 의해 웹 페이지들의 태그를 포함하는 검색 키워드가 입력됨에 따라, 자원 관리 계층의 스토리지 서버는 상기 검색 키워드를 저장하고, 사용자 태그에 관한 정보를 트랜잭션 처리 계층으로 전송하는 단계와, 상기 트랜잭션 처리 계층의 폭소노미 매니저는 상기 태그를 편집하고, 상기 태그들 사이의 관계를 추론하여 사용자에 의한 웹 페이지의 클릭 여부를 분석하는 단계와, 상기 트랜잭션 처리 계층의 인덱서 모듈은 상기 태그를 기반으로 웹 페이지를 분류하는 단계와, 트랜잭션 분석 계층의 랭킹 모듈은 사용자들의 기록되어 있는 행동을 반영하기 위해 페이지랭크 스코어를 연산하는 개인화된 링크 기반 랭킹 알고리즘을 이용하여 동일 카테고리에 속하는 웹 페이지들을 정렬하는 단계와, 상기 트랜잭션 분석 계층의 웹 검색 인터페이스는 순서화된 웹 문서 검색 결과를 사용자에게 표시하는 단계를 포 함한다.In the web document search method according to an embodiment to realize the above object of the present invention, as a search keyword including a tag of web pages is input by a user, the storage server of the resource management layer stores the search keyword. And transmitting information about a user tag to a transaction processing layer, and the Foxsonomi manager of the transaction processing layer edits the tag and infers a relationship between the tags to analyze whether a user clicks on a web page. The indexing module of the transaction processing layer classifies the web page based on the tag, and the ranking module of the transaction analysis layer personalizes the page rank score to reflect the recorded behavior of users. Web belonging to same category using link based ranking algorithm Sorting the pages, and the web search interface of the transaction analysis layer includes displaying the ordered web document search results to the user.

본 발명의 실시예에서, 상기 폭소노미 매니저는 사용자가 클릭한 모든 웹 페이지에 고유의 ID를 부여한 후 태그 트리를 생성하고, 동일한 태그를 사용한 사용자의 수를 계산하며, 반복적인 프로그램에 의해 계산된 가장 높은 빈도의 태그를 웹 페이지의 마지막 카테고리로 태그 데이터베이스에 삽입한다. In an embodiment of the present invention, the FoxMono manager assigns a unique ID to every web page clicked by a user, generates a tag tree, calculates the number of users using the same tag, and calculates the number of users calculated by an iterative program. Insert a high frequency tag into the tag database as the last category of a web page.

본 발명의 실시예에서, 상기 페이지랭크 스코어는 하기하는 수학식 2에 의해 정의되는 것을 특징으로 하는 웹 문서 검색 방법:In an embodiment of the present invention, the page rank score is a web document search method, characterized in that defined by Equation 2 below:

[수학식 2][Equation 2]

(여기서, α는 댐핑 팩터(damping factor)로서, 0.85이다. V_i는 사용자 맞춤값으로서, 아래의 수학식 3에 의해 정의된다. 사용자의 맞춤값은 페이지 i에 연결된 모든 페이지에서 사용자가 클릭한 수와 모든 페이지에 연결되어있는 웹 페이지에서 사용자가 클릭한 수를 계산하여 결정되어 웹 페이지에 대한 사용자의 선호 정도를 반영한다.)(Where α is a damping factor, which is 0.85. V _i is a user-customized value, defined by Equation 3 below. The user's custom-value is defined by the user clicking on all pages linked to page i. It is determined by counting the number of clicks a user clicks on a web page that is linked to all pages and reflects the user's preference for the web page.)

[수학식 3]&Quot; (3) "

(여기서, C_ji(u)는 사용자 u가 페이지 j에서 페이지 i로 클릭한 횟수로서, 아래의 수학식 4에 의해 정의되고, U는 전체 사용자의 집합이다.)(Where C _ji (u) is the number of times user u clicked from page j to page i, defined by Equation 4 below, where U is the total set of users:

[수학식 4]&Quot; (4) "

(수학식 2에서, S_ji는, 페이지 j가 페이지 i에 연결되어 있을 때, 페이지 j로부터의 상호 아웃링크(outlink)의 수로서, 아래의 수학식 5에 의해 정의된다. n은 페이지의 총 수이다.)(In Equation 2, S _ji is the number of mutual outlinks from page j when page j is linked to page i, and is defined by Equation 5 below. It is a number.)

[수학식 5] [Equation 5]

상기한 본 발명의 다른 목적을 실현하기 위하여 일실시예에 따른 웹 문서 검색 시스템은, 사용자로부터 제공되는 검색 키워드에 상응하는 웹 페이지들을 수집하여, 분석하고, 사용자에게 검색 결과를 표시하는 트랜잭션 분석부와, 웹 문서 검색 프로세스에 대한 정보를 저장하고 관리하는 자원 관리부와, 폭소노미와 링크 기반 랭킹 기법을 조합하여 사용자들의 요청을 분석하고, 분석된 결과에 따라 사용자들의 행동 및 관심을 분류하여 상기 자원 관리부에 제공하는 트랜잭션 처리부를 포함한다. In order to realize the above object of the present invention, a web document retrieval system according to an embodiment collects and analyzes web pages corresponding to a search keyword provided from a user, and analyzes and displays a search result to the user. And a resource manager that stores and manages information about the web document retrieval process, and a combination of folksonomy and link-based ranking techniques to analyze user requests and classify user behaviors and interests according to the analyzed results. It includes a transaction processing unit provided to.

본 발명의 실시예에서, 상기 트랜잭션 분석부는, 사용자에 의한 검색 키워드를 수신하고, 상기 검색 키워드에 상응하는 웹 페이지를 사용자에게 제공하는 웹 검색 인터페이스와, 웹 페이지들을 수집하는 웹 크롤러(crawler)와, 사용자에게 상 기 검색 키워드에 상응하는 웹 페이지들이 제공되도록 동일한 카테고리에 속하는 웹 페이지들을 정렬시켜 상기 웹 검색 인터페이스에 제공하는 랭킹 모듈을 포함할 수 있다. In an embodiment of the present invention, the transaction analyzer may include a web search interface for receiving a search keyword by a user and providing a web page corresponding to the search keyword to the user, a web crawler for collecting web pages; The ranking module may arrange the web pages belonging to the same category to be provided to the web search interface so that the web pages corresponding to the search keywords are provided to the user.

본 발명의 실시예에서, 상기 자원 관리부는, 폭소노미 저장소 및 링크 정보 저장소를 포함하는 스토리지 서버를 포함할 수 있다. In an embodiment of the present invention, the resource manager may include a storage server including a folkson repository and a link information repository.

본 발명의 실시예에서, 상기 트랜잭션 처리부는, 웹 페이지에 상응하는 태그를 편집하고, 상기 태그들 사이의 관계를 추론하고, 사용자의 행동을 분석하는 폭스노미 매니저와, 상기 태그를 기반으로 웹 페이지를 카테고리별로 분류하는 인덱서 모듈을 포함할 수 있다. In an embodiment of the present invention, the transaction processing unit, a Foxnome manager for editing a tag corresponding to a web page, inferring relationships between the tags, and analyzing a user's behavior, and a web page based on the tag It may include an indexer module for classifying by category.

본 발명의 실시예에서, 상기 폭소노미 매니저는 특정 웹 페이지에 대응하여 사용자에 의해 입력되는 태그를 해당 웹 페이지에 기입하는 폭소노미 에디터와, 태그가 기입된 웹 페이지의 사용자에 의한 클릭 여부를 체크하는 유저 검색 분석기와, 사용자에 의해 특정 웹 페이지가 클릭됨에 따라, 해당 페이지에 아이디를 부여하여 태그와 페이지간의 태그 트리를 생성하는 태그 연계 추출기를 포함할 수 있다. In an embodiment of the present invention, the folkson manager is a folkson editor for writing a tag input by the user corresponding to a specific web page to the web page, and the user to check whether or not the user clicks on the tagged web page The search analyzer may include a tag association extractor configured to generate a tag tree between a tag and a page by assigning an ID to the page as a specific web page is clicked by a user.

본 발명의 실시예에서, 상기 인덱서 모듈은, 폭소노미 처리된 웹 페이지에 기입된 언어를 분석하는 언어분석기(syntactic analyzer)와, 상기 언어분석기에 의해 분석된 언어의 특징을 선택하는 특징 선택기와, 상기 특징 선택기에 의해 선택된 언어에 대응하는 웹 페이지의 태그를 인식하는 태그 인식기와, 상기 태그 인식기에 의해 인식된 태그의 의미를 분류하여 상기 자원 관리부에 저장하는 의미분류 기(semantic classifier)를 포함할 수 있다. In an embodiment of the present invention, the indexer module may include: a syntactic analyzer for analyzing a language written on a foxsonized web page, a feature selector for selecting features of the language analyzed by the language analyzer, and A tag recognizer for recognizing a tag of a web page corresponding to the language selected by the feature selector, and a semantic classifier for classifying the meaning of the tag recognized by the tag recognizer and storing the semantic classifier in the resource manager. have.

본 발명의 실시예에서, 상기 랭킹 모듈에 의해 계산되는 페이지랭크 스코어는, 하기하는 수학식 2에 의해 정의될 수 있다.In an embodiment of the present invention, the page rank score calculated by the ranking module may be defined by Equation 2 below.

[수학식 2][Equation 2]

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

(수학식 2에서, S_ji는, 페이지 j가 페이지 i에 연결되어 있을 때, 페이지 j 로부터의 상호 아웃링크(outlink)의 수로서, 아래의 수학식 5에 의해 정의된다. n은 페이지의 총 수이다.)(In Equation 2, S _ji is the number of mutual outlinks from page j when page j is linked to page i, and is defined by Equation 5 below. It is a number.)

[수학식 5] [Equation 5]

이러한 웹 문서 검색 방법 및 이를 수행하기 위한 시스템에 의하면, 집단지성 기반하에 폭소노미와 링크 기반 랭킹 기법을 조합함으로써 인터넷에서 대량의 웹 페이지 중 사용자에게 적응하도록 검색에서 사용자 행위와 선호도를 함께 반영하여 더 나은 웹 문서 검색 결과를 제공할 수 있다.According to the web document retrieval method and a system for performing the same, a combination of folksonomy and link-based ranking technique based on collective intelligence is used to better reflect user behaviors and preferences in search to adapt to users among a large number of web pages on the Internet. Provide web document search results.

이하, 첨부한 도면들을 참조하여, 본 발명을 보다 상세하게 설명하고자 한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be described in more detail with reference to the accompanying drawings. As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하여 도시한 것이다. Like reference numerals are used for like elements in describing each drawing. In the accompanying drawings, the dimensions of the structures are shown in an enlarged scale than actual for clarity of the invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof described on the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, parts, or combinations thereof.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Also, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

본 발명에 따른 웹 문서 검색 시스템은 집단지성을 기반으로 한다. 여기서, 집단지성은 새로운 통신 기술의 도래로 인해 점점 더 많은 관심을 받고 있다. 상기 집단지성은 한 그룹의 사람들이 행동이나 환경, 아이디어를 결합하여 만들어진 새 로운 식견을 의미한다. The web document retrieval system according to the present invention is based on collective intelligence. Here, collective intelligence is getting more and more attention due to the arrival of new communication technologies. Collective intelligence means new insights created by a group of people combining actions, circumstances, and ideas.

집단지성을 기반으로 하는 웹 문서 검색 시스템은 폭소노미(Folksonomy)와 링크(Link)의 결합을 통해 사용자의 선호에 따라 웹 페이지의 순위를 정한다. 여기서, 폭소노미(folksonomy)는 협력적으로 콘텐츠를 설명하고 분류하기 위한 태그(tag)를 생성하고 관리하는 새로운 분류 기법이다. 이것은 콘텐츠를 설명하고 분류하기 위하여 각 웹 페이지에 첨부된 태그나 라벨을 의미한다. 사용자들은 각 페이지의 의미에 맞게 자유롭고 주관적으로 태그라 불리는 키워드들을 단다. 누구든 태그로 어떤 단어든지 선택할 수 있고, 한 페이지에 여러 개의 태그를 달수 있다. Web document retrieval system based on collective intelligence ranks web pages according to user's preference through the combination of Folksonomy and Link. Here, folksonomy is a new classification technique for creating and managing tags for cooperatively describing and classifying content. This means a tag or label attached to each web page to describe and classify the content. Users put keywords called tags freely and subjectively to match the meaning of each page. Anyone can choose any word with a tag, and tag multiple pages on the same page.

웹 2.0은 아무도 데이터를 소유하지 않고 어떤 프로그래밍 또는 인터넷 환경에서도 모든 사람이 데이터를 사용할 수 있는 플랫폼을 의미한다. 그동안 웹 사이트는 일방적으로 TV나 라디오처럼 정보와 서비스를 제공하기만 해왔는데 이를 미디어라고 표현하기도 했다. 지금까지는 웹 사이트에 올린 데이터 또는 서비스되는 데이터를 이동시키거나 활용할 수 없었다. 그러나 웹 2.0 환경이 구축되면 자유롭게 데이터를 이동시킬 수 있게 된다. 웹 1.0을 가리키는 대표적인 서비스가 웹 포털 서비스라면 웹 2.0은 플랫폼을 의미한다. 웹 포털 사이트의 서비스는 사용자가 마음대로 할 수 없지만 플랫폼인 웹 2.0에서는 사용자가 원하는 대로 데이터를 활용할 수 있다. 이러한 웹 2.0에서는 태그를 사용하여 웹 문서를 작성하는 예가 증가하고 있다.Web 2.0 means a platform where nobody owns data and everyone can use it in any programming or Internet environment. In the meantime, Web sites have been unilaterally providing information and services like TV or radio, which is often referred to as media. Until now, it was not possible to move or utilize data posted to a website or serviced data. But once a Web 2.0 environment is built, you can move data around freely. If the representative service pointing to Web 1.0 is a web portal service, Web 2.0 means a platform. The service of the web portal site is not at your disposal, but the platform, Web 2.0, allows you to utilize the data as you want. In these Web 2.0, an increasing number of examples of using a tag to create a web document.

도 1은 본 발명에 따른 웹 문서 검색 시스템을 설명하기 위한 블록도이다.1 is a block diagram illustrating a web document search system according to the present invention.

도 1을 참조하면, 본 발명에 따른 웹 문서 검색 시스템의 구조는 버스 라인 프레임워크(Bus line Framework)에 기반을 두고 있다. 상기 버스 라인 프레임워크(11)의 양측에는 관리(Management) 계층과 자원(Resource) 계층이 존재한다. 본 실시예에서, 계층이라는 용어는 논리적으로 구분한 용어일 뿐 하드웨어적으로 구분한 용어는 아니다. 한편, 버스 라인 프레임워크는 웹 페이지 저장부로 정의될 수 있고, 관리 계층은 관리부로 정의될 수 있으며, 자원 계층은 자원저장부로 정의될 수 있다. 하지만, 이하에서는 계층이라는 용어를 그대로 사용하기로 한다.Referring to FIG. 1, the structure of a web document retrieval system according to the present invention is based on a bus line framework. On both sides of the bus line framework 11, there is a management layer and a resource layer. In the present embodiment, the term hierarchical is a logically divided term, not a hardware term. Meanwhile, the bus line framework may be defined as a web page storage unit, a management layer may be defined as a management unit, and a resource layer may be defined as a resource storage unit. However, hereinafter, the term hierarchy will be used as it is.

상기 데이터 버스(Data Bus) 계층에는 대량의 웹 페이지 데이터가 저장된다. A large amount of web page data is stored in the data bus layer.

상기 관리 계층에는 폭소노미 매니저(12), 검색 엔진(13) 및 QA 엔진(14)이 구비된다. 폭소노미 매니저(12)는 대량의 데이터를 분석하고 분류하는 역할을 수행한다. 즉, 폭소노미 매니저(12)는 데이터 버스(11)에 저장된 문서 중 태그를 가진 웹 문서로부터 태그를 추출하여 추출된 태그를 주제로 할당하고, 폭소노미를 사용하여 분야를 할당한다. 상기 검색 엔진(13)과 QA 엔진(14)은 사용자의 요청을 처리하고 검색 결과를 사용자에게 반환하는 역할을 수행한다. The management hierarchy is provided with a folkson manager 12, a search engine 13 and a QA engine 14. The folksonomie 12 is responsible for analyzing and classifying a large amount of data. That is, the folkson manager 12 extracts the tag from the web document with the tag among the documents stored in the data bus 11 and assigns the extracted tag to the subject, and assigns the field using the folksonomi. The search engine 13 and the QA engine 14 process a user's request and return a search result to the user.

상기 자원 계층에는 폭소노미 저장소(15) 및 링크 정보 저장소(16)가 구비되고, 상기 관리 계층의 구성요소의 해당 정보를 저장하는 역할을 수행한다. The resource layer includes a folkson storage 15 and a link information storage 16, and serves to store the corresponding information of the components of the management layer.

도 2는 사용자들이 웹 페이지에 태그를 단 예를 설명하기 위한 개념도이다. 2 is a conceptual diagram illustrating an example in which users tag a web page.

도 2를 참조하면, 사용자들을 각각 10명으로 구성된 　그룹 A　와 　그룹 B　로 무작위로 나눴다. Referring to FIG. 2, users were randomly divided into 'group A' and 'group B' each consisting of 10 members.

그룹 A에 포함된 3명은 페이지 B의 태그로 "portal"을 사용하고 동일한 페이지에 대해 나머지 7명은 　news　란 태그를 사용했다. 본 실시예에서는 동일한 페 이지에 동일한 태그를 사용한 사용자들의 수로 한 페이지 당 한 태그의 빈도를 정의하고, 그 페이지에 할당된 각 태그의 빈도를 사용해 페이지의 분류를 결정한다. 즉, 가장 큰 빈도를 가진 태그가 그 페이지의 카테고리가 된다.The three people in group A used "portal" as the tag for page B, and the other seven used the tag "news" for the same page. In this embodiment, the frequency of one tag per page is defined by the number of users who use the same tag on the same page, and the classification of the page is determined using the frequency of each tag assigned to the page. In other words, the tag with the highest frequency becomes the category of the page.

도 2에 따르면, "news" 태그의 빈도가 　portal　 태그의 빈도보다 많았기 때문에 검색 시스템은 페이지 B의 카테고리로 "news"를 선택한다. 이와 마찬가지로 "news"와 "music" 태그가 각각 페이지 A와 페이지 C의 카테고리가 된다. According to FIG. 2, the search system selects "news" as the category of page B because the frequency of the "news" tag is higher than the frequency of the "portal" tag. Similarly, the "news" and "music" tags are categories for page A and page C, respectively.

표 1은 도 2에 대한 페이지 분류의 결과를 보여준다. Table 1 shows the results of the page classification for FIG.

본 실시예에서, 페이지 A와 페이지 B의 카테고리가 동일한 카테고리인 　news"임을 알 수 있다. In this embodiment, it can be seen that the categories of page A and page B are 'news' which is the same category.

폭소노미의 가장 중요한 이점은 사용자들이 빠르게 검색할 수 있고 연관된 웹 페이지를 쉽게 분류할 수 있다는 것이다. 폭소노미는 평평하고, 비계층적이고, 공유된 용어를 제공하는 것으로 알려져 있다. 이것이 많은 유용한 특징을 가졌더라도 검색에 사용될 때는 결정적인 문제가 있다. 즉, 검색된 페이지들의 순서가 각 페이지에 할당된 태그의 인기에 크게 의존하기 때문이다. 결과적으로, 폭소노미가 사용자들의 행동은 잘 반영하는 반면, 그들의 선호를 충분히 반영하진 못한다. 따라서 고품질 검색 엔진들은 일제히 사용자의 선호를 수용하는 추가적인 기법을 고려할 필요가 있다. 이에 따라, 본 발명에서는 가능한 기법들 중 하나인 링크 기반 랭킹 기법을 활용한다.The most important advantage of Foxsonomy is that users can quickly search and easily categorize the associated web pages. Foxonomies are known to provide flat, non-hierarchical, shared terms. Although it has many useful features, it is a decisive problem when used in search. That is, the order of the retrieved pages depends on the popularity of the tag assigned to each page. As a result, Foxsonmi reflects the user's behavior well, while not fully reflecting their preferences. Therefore, high quality search engines need to consider additional techniques for accommodating user preferences all at once. Accordingly, the present invention utilizes one of the possible techniques, the link based ranking technique.

한편, 지금까지 사용자들은 사용자들의 요청에 맞는 페이지를 얻어왔지만 사용자들이 얻은 페이지의 순서는 단순히 크롤링(crawling)되는 순서였다. 페이지의 집합이 크다면, 사용자들이 불필요한 많은 콘텐츠로부터 각 검색어를 참조하여 실제로 그들이 관심 있는 페이지들을 선별해야한다. 이 문제를 해결하기 위해서 웹 검색 시스템에서 랭킹된 결과를 얻을 필요가 있다. 일반적으로, 현재 존재하는 페이지 추천 시스템에서 사용되는 랭킹 기법들은 콘텐츠 기반 랭킹과 링크 기반 랭킹중 어느 하나의 카테고리를 통해 얻는다.On the other hand, users have been able to obtain pages that meet user requests so far, but the order of pages obtained by users is simply the order of crawling. If the set of pages is large, users should refer to each search term from a lot of unnecessary content to actually select the pages they are interested in. To solve this problem, it is necessary to obtain the ranked results from the web search system. In general, the ranking techniques used in existing page recommendation systems are obtained through one of the categories of content based ranking and link based ranking.

콘텐츠 기반 랭킹 결정 방식은, 사용자의 선호와 관련된 정보를 분석하고 주어진 검색어에 대한 스코어 순서로 페이지를 돌려준다. 이 스코어의 통계는 모든 페이지의 콘텐츠에 기반하고 있다. 대부분의 검색 엔진이 계속 이런 방식으로 동작한다 하더라도, 다른 사람들이 그 페이지에 대해 제공하는 정보, 명확히 말해서 누가 그 페이지를 연결했었는지, 그들이 그 페이지에 대해 뭐라고 말했는지에 대한 정보를 고려함으로써 약간은 향상될 수 있다. 이렇게 향상된 방법을 링크 기반 랭킹이라고 부른다. The content-based ranking determination method analyzes information related to a user's preference and returns a page in order of score for a given search word. The statistics of this score are based on the content of all pages. Although most search engines continue to work this way, a little by considering the information that other people provide about the page, specifically who linked the page and what they said about the page, Can be improved. This improved method is called link-based ranking.

도 3은 본 발명의 실시예에 따른 웹 문서 검색 시스템을 설명하기 위한 블록도이다.3 is a block diagram illustrating a web document search system according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시예에 따른 웹 문서 검색 시스템(100)은 트랜잭션 분석 계층(Transaction analysis layer), 자원 관리 계층(Resources management layer) 및 트랜잭션 처리 계층(Transaction process layer)으로 이루어진다. 본 실시예에서, 계층이라는 용어는 논리적으로 구분한 용어일 뿐 하드웨어적으로 구분한 용어는 아니다. 한편, 트랜잭션 분석 계층은 트랜잭션 처리부로 정의될 수 있고, 자원 관리 계층은 자원 관리부로 정의될 수 있으며, 트랜잭션 처리 계층은 트랜잭션 처리부로 정의될 수 있다. 하지만, 이하에서는 계층이라는 용어를 그대로 사용하기로 한다. Referring to FIG. 3, the web document retrieval system 100 according to an embodiment of the present invention includes a transaction analysis layer, a resources management layer, and a transaction process layer. In the present embodiment, the term hierarchical is a logically divided term, not a hardware term. Meanwhile, the transaction analysis layer may be defined as a transaction processor, the resource management layer may be defined as a resource manager, and the transaction processing layer may be defined as a transaction processor. However, hereinafter, the term hierarchy will be used as it is.

상기 트랜잭션 분석 계층에는 웹 검색 인터페이스(112), 웹 크로럴러(114), 랭킹 모듈(116)이 구비되어, 사용자로부터 제공되는 검색 키워드에 상응하는 웹 페이지들을 수집하여, 분석하고, 사용자에게 검색 결과를 표시한다. The transaction analysis layer is provided with a web search interface 112, a web crawler 114, and a ranking module 116 to collect and analyze web pages corresponding to a search keyword provided from a user, and to analyze the search results to the user. Is displayed.

상기 웹 검색 인터페이스(112)는 사용자에 의한 검색 키워드를 수신하고, 상기 검색 키워드에 상응하는 웹 페이지를 사용자에게 제공한다. The web search interface 112 receives a search keyword by the user and provides a web page corresponding to the search keyword to the user.

상기 웹 크로럴러(114)는 웹 상에 존재하는 웹 페이지들을 수집한다. The web crawler 114 collects web pages existing on the web.

상기 랭킹 모듈(116)은 사용자에게 상기 검색 키워드에 상응하는 웹 페이지들이 제공되도록 동일한 카테고리에 속하는 웹 페이지들을 정렬시켜 상기 웹 검색 인터페이스에 제공한다. The ranking module 116 arranges web pages belonging to the same category and provides them to the web search interface so that the web pages corresponding to the search keyword are provided to the user.

상기 트랜잭션 분석 계층에는 도 1에서 설명된 시스템 구조의 검색엔진 컴포넌트와 QA 엔진 컴포넌트가 구비된다. 즉, 트랜잭션 분석 계층에서, 사용자로부터 정보가 수집 및 분석되고, 자원 관리 계층에서, 정보의 분석결과가 전송된다. 또한, 트랜잭션 분석 계층에서 사용자에게 검색 결과를 표시하는 동작이 이루어진다. The transaction analysis layer is provided with a search engine component and a QA engine component of the system structure described in FIG. That is, in the transaction analysis layer, information is collected and analyzed from the user, and in the resource management layer, the analysis result of the information is transmitted. In addition, an operation of displaying a search result to a user is performed in the transaction analysis layer.

상기 자원 관리 계층에는 스토리지 서버(120)가 구비되어, 검색 프로세스에 대한 정보가 관리되고 저장된다.The resource management layer is provided with a storage server 120 to manage and store information about a search process.

상기 스토리지 서버(120)는 도 1에서 설명된 폭소노미 저장소와 링크 정보 저장소가 구비된다. The storage server 120 is provided with a folkson storage and a link information storage described in FIG.

상기 트랜잭션 처리 계층에는 폭소노미 매니저(130) 및 인덱서 모듈(140)이 구비되어, 폭소노미와 링크 기반 랭킹 기법을 조합하여 사용자들의 요청을 분석하고, 분석된 결과에 따라 사용자들의 행동 및 관심을 분류하여 상기 자원 관리부에 제공한다. The transaction processing layer is provided with a folkson manager 130 and an indexer module 140, by combining folksonomy and link-based ranking techniques to analyze the user's request, and classify the user's behavior and interest according to the analysis result Provided to the resource management department.

상기 폭소노미 매니저(130)는 폭소노미 에디터(132), 태그 연계 추출기(134), 유저 검색 분석기(136)를 포함하여, 웹 페이지에 상응하는 태그를 편집하고, 상기 태그들 사이의 관계를 추론하고, 사용자의 행동을 분석한다. The folksonography manager 130 includes a folkson editor 132, a tag association extractor 134, a user search analyzer 136, edit a tag corresponding to a web page, infer the relationship between the tags, Analyze user behavior.

상기 폭소노미 에디터(132)는 특정 웹 페이지에 대응하여 사용자에 의해 입력되는 태그를 해당 웹 페이지에 기입한다. The folkson editor 132 writes a tag input by a user in correspondence with a specific web page to the corresponding web page.

상기 태그 연계 추출기(134)는 사용자에 의해 특정 웹 페이지가 클릭됨에 따라, 해당 페이지에 아이디를 부여하여 태그와 페이지간의 태그 트리를 생성한다. The tag association extractor 134 generates a tag tree between a tag and a page by assigning an ID to the corresponding page as a specific web page is clicked by a user.

유저 검색 분석기(136)는 태그가 기입된 웹 페이지의 사용자에 의한 클릭 여부를 체크한다. The user search analyzer 136 checks whether the user clicks on the tagged web page.

상기 인덱서 모듈(140)은 언어분석기(142), 특징선택기(144), 태그인식기(146), 의미분류기(148)를 포함하여, 상기 태그를 기반으로 웹 페이지를 카테고리별로 분류한다. The indexer module 140 includes a language analyzer 142, a feature selector 144, a tag recognizer 146, and a semantic classifier 148 to classify web pages based on the tag.

상기 언어분석기(142)는 폭소노미 처리된 웹 페이지에 기입된 언어를 분석한다. The language analyzer 142 analyzes the language written on the foxsonized web page.

상기 특징선택기(144)는 상기 언어분석기에 의해 분석된 언어의 특징을 선택한다. The feature selector 144 selects features of the language analyzed by the language analyzer.

상기 태그인식기(146)는 상기 특징 선택기에 의해 선택된 언어에 대응하는 웹 페이지의 태그를 인식한다. The tag recognizer 146 recognizes a tag of a web page corresponding to the language selected by the feature selector.

상기 의미분류기(148)는 상기 태그 인식기에 의해 인식된 태그의 의미를 분류하여 상기 자원 관리부에 저장한다. The semantic classifier 148 classifies the meanings of the tags recognized by the tag recognizer and stores them in the resource manager.

상기 트랜잭션 처리 계층에는 폭소노미 관리 컴포넌트가 구비된다. 즉, 상기 트랜잭션 처리 계층에서 사용자들의 요청이 분석되고, 분석된 결과를 근거로 사용자들의 행동이나, 관심에 대한 정보가 분류된다. The transaction processing layer is equipped with a foxnomy management component. That is, the request of the users is analyzed in the transaction processing layer, and information about the actions or interests of the users is classified based on the analyzed result.

이러한 본 발명에 따르면, 대상 웹 페이지에 연결된 다른 웹 페이지를 고려하는 개인화된 링크 기반 랭킹 기법을 도입함으로써, 검색의 정확도를 높여 웹 페이지의 중요성을 측정할 수 있는 한계를 높였다. According to the present invention, by introducing a personalized link-based ranking technique that considers other web pages linked to the target web page, the limit of measuring the importance of the web page is increased by increasing the accuracy of the search.

또한, 본 발명에 따르면, 북마크 같은 특정 데이터 소스에 의존하지 않으며 개인화된 링크 기반 랭킹 기법을 도입함으로써, 웹 문서 검색 시스템의 정확도를 향상시킬 수 있다.In addition, according to the present invention, by introducing a personalized link-based ranking technique without depending on a specific data source such as a bookmark, it is possible to improve the accuracy of the web document retrieval system.

또한, 본 발명에 따르면, 폭소노미와 개인화된 링크 기반 랭킹 기법을 결합한 집단지성에 기반을 두어 웹 문서 검색 시스템을 설계하였으므로, 사용자의 행동과 선호를 고려해서 사용자가 관련된 웹 페이지들을 빠르게 찾을 수 있다. In addition, according to the present invention, since the web document retrieval system is designed based on the collective intelligence combining the folksonomy and the personalized link-based ranking technique, the user can quickly find related web pages in consideration of the user's behavior and preferences.

또한, 본 발명에 따르면, 인터넷의 전체 웹 페이지 중에서 사용자의 선호를 수용하는 연관된 웹 페이지들을 찾을 수 있다.In addition, according to the present invention, it is possible to find associated web pages among the entire web pages of the Internet that accommodate the user's preferences.

도 4는 본 발명에 따른 폭스노미와 링크 기반 랭킹 기법의 결합을 이용한 웹 페이지 랭킹의 전반적인 처리를 설명하기 위한 개념도이다. 4 is a conceptual diagram illustrating an overall process of web page ranking using a combination of a FoxNomi and a link based ranking technique according to the present invention.

도 3 및 도 4를 참조하면, 사용자가 웹 페이지들의 태그와 같은 검색 키워드를 입력함에 따라, 상기 검색 키워드는 자원 관리 계층의 스토리지 서버(120)에 저장된다. 사용자 태그에 관한 정보는 트랜잭션 처리 계층으로 전송된다. 3 and 4, as a user inputs a search keyword such as a tag of web pages, the search keyword is stored in the storage server 120 of the resource management layer. Information about the user tag is sent to the transaction processing layer.

트랜잭션 처리 계층의 폭소노미 매니저(130)는 태그를 편집하고, 태그들 사이의 관계를 추론하고, 사용자의 행동을 분석하는데 이용된다. 또한, 트랜잭션 처리 계층의 인덱서 모듈(140)은 태그를 기반으로 웹 페이지를 분류한다. 동일한 태그들로 분류된 웹 페이지들은 동일한 카테고리에 속한다. The foxson manager 130 of the transaction processing layer is used to edit tags, infer relationships between tags, and analyze user behavior. In addition, the indexer module 140 of the transaction processing layer classifies web pages based on tags. Web pages classified with the same tags belong to the same category.

이어, 동일한 카테고리에 속해있는 웹 페이지들은 랭킹 모듈(116)에 의해 정렬된다. 상기 랭킹 모듈(116)은 링크 기반의 랭킹 알고리즘을 이용하여 동일 카테고리에 속해있는 웹 페이지들을 정렬한다. Subsequently, web pages belonging to the same category are sorted by the ranking module 116. The ranking module 116 sorts web pages belonging to the same category by using a link-based ranking algorithm.

마지막으로, 순서화된 웹 페이지들의 결과가 웹 검색 인터페이스(112)에 의해 사용자들에게 표시된다.Finally, the results of the ordered web pages are displayed to the users by the web search interface 112.

도 5는 본 발명에 따른 폭스노미와 링크 기반 랭킹 기법의 결합을 이용한 웹 페이지 랭킹 기법을 이용한 웹 문서 검색 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a web document retrieval method using a web page ranking technique using a combination of the FoxNomi and the link-based ranking technique according to the present invention.

도 3 및 도 5를 참조하면, 웹 크로럴러(114)는 웹 페이지들을 수집하고, 수집된 웹 페이지들을 저장한다(단계 S110). 3 and 5, the web crawler 114 collects web pages and stores the collected web pages (step S110).

이어, 인덱서 모듈(140)은 구문을 분석하고(단계 S120), 태그 기반 의미를 분류한다(단계 S130). 즉, 인덱서 모듈(140)은 태그를 기반으로 웹 페이지를 분류하는데, 예를들어, 동일한 태그들로 분류된 웹 페이지들은 동일한 카테고리에 속한다.The indexer module 140 then parses the syntax (step S120) and classifies the tag-based meaning (step S130). That is, the indexer module 140 classifies web pages based on tags. For example, web pages classified with the same tags belong to the same category.

이어, 웹 크로럴러(114)에 의한 웹 페이지들의 수집이 종료되었는지의 여부를 체크하여(단계 S140), 종료된 것으로 체크되면 종료하고, 종료되지 않은 것으로 체크되면 단계 S110으로 피드백한다. Then, it is checked whether or not the collection of web pages by the web crawler 114 has ended (step S140). If it is checked that it is finished, it is terminated. If it is not finished, it feeds back to step S110.

한편, 사용자에 의한 웹 페이지들의 태그와 같은 검색 키워드가 입력되었는지의 여부를 체크한다(단계 S210), 단계 S210에서, 검색 키워드가 입력됨에 따라, 입력된 검색 키워드를 자원 관리 계층의 스토리지 서버(120)에 저장된다. 사용자 태그에 관한 정보는 트랜잭션 처리 계층(Transaction Process Layer)으로 전송된다. On the other hand, it is checked whether or not a search keyword such as a tag of web pages by a user is input (step S210). As the search keyword is input in step S210, the search keyword is inputted to the storage server 120 of the resource management layer. ) Information about the user tag is transmitted to a transaction process layer.

이어, 트랙잭션 처리 계층의 폭소노미 매니저(132)는 사용자 태그를 편집하고, 태그들 사이의 관계를 추론하여 사용자의 행동을 분석하는 태그 관계를 분석한다(단계 S220). Subsequently, the Foxsonomi manager 132 of the transaction processing layer edits the user tag and infers the relationship between the tags to analyze a tag relationship for analyzing the user's behavior (step S220).

이어, 랭킹 모듈(116)은 개인화된 링크 기반의 랭킹 알고리즘을 이용하여 동일한 카테고리에 속해있는 웹 페이지들이 정렬된다(단계 S230). Next, the ranking module 116 sorts the web pages belonging to the same category by using a personalized link-based ranking algorithm (step S230).

이어, 상기 개인화된 링크 기반의 랭킹 알고리즘에 의해 순서화된 웹 페이지들의 결과를 사용자들에게 표시한다(단계 S240).Then, the results of the web pages ordered by the personalized link-based ranking algorithm are displayed to the users (step S240).

도 5에서, 단계 S110 내지 단계 S140으로 이루어지는 일련의 동작과 단계 S210 내지 단계 S240으로 이루어지는 일련의 동작은 서로 병렬적으로 이루어진다. In FIG. 5, a series of operations consisting of steps S110 to S140 and a series of operations consisting of steps S210 to S240 are performed in parallel with each other.

이상에서 설명된 바와 같이, 본 실시예에서는 폭소노미와 개인화된 링크 기반의 페이지 랭킹 구조를 결합한 집단지성의 웹 문서 검색 시스템을 제안한다. 즉, 본 발명에 따른 웹 문서 검색 시스템은 폭소노미와 개인화된 링크 기반 랭킹 기법을 결합한 집단지성에 기반을 두어 설계되었다. 상기한 방법은 사용자 개인의 행동과 선호를 고려해서 사용자가 관련된 웹 페이지들을 빠르게 찾을 수 있게 한다. As described above, this embodiment proposes a collective intelligence web document retrieval system combining a folksonomi and a personalized link based page ranking structure. In other words, the web document retrieval system according to the present invention is designed based on collective intelligence combining a folksonomi and a personalized link based ranking technique. The above method allows the user to quickly find relevant web pages in consideration of the user's individual behavior and preferences.

그러면, 이하에서, 폭소노미 알고리즘과 링크 기반 랭킹 알고리즘에 대해 설명한다. Then, in the following, a description of the Foxsonomi algorithm and link-based ranking algorithm.

<폭소노미 알고리즘><Foxonomy Algorithm>

본 발명에 따른 웹 문서 검색 시스템에서, 폭소노미 알고리즘은 검색 시스템에서의 태그와 사용자의 행동을 분석하여 검색된 페이지를 분류하는데 사용된다. 본 발명에 따른 폭소노미 처리 과정을 설명하기 위한 의사코드(pseudo-code)의 일례는 아래와 같다. In the web document retrieval system according to the present invention, the Foxsonomi algorithm is used to classify the retrieved pages by analyzing the tags and the user's behavior in the retrieval system. An example of a pseudo-code for explaining a folksonomy process according to the present invention is as follows.

================================================================================

input: tag Ti, page Pj input: tag Ti, page Pj

make edge e(Ti, Pj) between Ti and Pj if Pj has Ti.make edge e (Ti, Pj) between Ti and Pj if Pj has Ti.

for each page Pj for each page Pj

fmax= f(Ti, Pj) //frequency of tag Ti assigned to Pjfmax = f (Ti, Pj) // frequency of tag Ti assigned to Pj

for each tag Tk of Pj for each tag Tk of Pj

delete edge e(Tk, Pj) delete edge e (Tk, Pj)

if (f(Tk, Pj) > fmax ) T = Tk if (f (Tk, Pj)> fmax) T = Tk

end end

make edge e(T, Pj) make edge e (T, Pj)

end end

상기한 폭소노미 처리 과정에 따르면, 사용자가 웹 문서 검색 시스템의 사용자 인터페이스에 태그를 입력하면 웹 문서 검색 시스템은 태그를 기록하고 사용자 행동, 예를들어, 웹 페이지(또는 웹 문서)를 클릭하는 것을 포착한다. According to the above described Foxson process, when a user enters a tag in the user interface of the web document search system, the web document search system records the tag and captures the user's action, for example, clicking on the web page (or web document). do.

이어, 사용자가 클릭한 모든 웹 페이지에 고유의 ID를 부여한 후 태그 트리를 만든다. 태그 트리에서 태그(Ti)와 페이지(Pj)는 각각 부모 노드와 자식 노드가 된다. Then, you assign a unique ID to every web page that the user clicks, then build a tag tree. In the tag tree, the tags Ti and page Pj become parent and child nodes, respectively.

이어, 웹 문서 검색 시스템은 동일한 태그를 사용한 사용자의 수를 계산한다. The web document retrieval system then calculates the number of users using the same tag.

마지막으로, 반복적인 프로그램에 의해 계산된 가장 높은 빈도의 태그는 웹 페이지의 마지막 카테고리로 태그 데이터베이스에 삽입된다. 웹 페이지들의 카테고리가 같다면 이 웹 페이지들은 동일한 카테고리에 속하는 것이다. Finally, the highest frequency tag calculated by the recursive program is inserted into the tag database as the last category of the web page. If the web pages are of the same category, then these web pages belong to the same category.

이러한 폭소노미 알고리즘에서, 만약 전체 웹 페이지의 수가 m으로 결정되고 웹 페이지에 할당된 전체 태그의 수가 n으로 결정 결정됐다면, 폭소노미 알고리즘의 시간복잡도(Time Complexity)는 O(mn)이다. 여기서, 시간복잡도는 처리해야하는 데이터의 양(N이나 n으로 표기)에 따라 소요되는 시간으로 절대적인 시간이 아닌 비례적인 시간을 나타내고, O(f(n))와 같이 표기된다. In such a folksonomy algorithm, if the total number of web pages is determined to be m and the total number of tags assigned to the web page is determined to be n, then the time complexity of the folksonomy algorithm is O (mn). Here, the time complexity is a time required according to the amount of data to be processed (indicated by N or n), which represents a proportional time rather than an absolute time, and is expressed as O (f (n)).

< 개인화된 링크 기반 랭킹 알고리즘><Personalized Link-based Ranking Algorithm>

본 발명에 따른 개인화된 링크 기반 랭킹 알고리즘은 도 3에 도시된 랭킹 모듈(116)에 포함된다. 고품질의 검색 결과를 생성하기 위해서, 일반적으로, 링크 기반 랭킹 알고리즘은 모든 페이지에 각 페이지가 얼마나 연관성이 있는가를 나타내는 스코어를 할당한다. 각 페이지의 연관성은 그 페이지에 연결된 다른 관련 있는 페이지의 수와 다른 페이지들의 링크의 수로 계산된다. The personalized link based ranking algorithm according to the present invention is included in the ranking module 116 shown in FIG. In order to produce high quality search results, link-based ranking algorithms generally assign a score to each page indicating how relevant each page is. The relevance of each page is calculated from the number of other related pages linked to that page and the number of links to other pages.

한 페이지의 연관성을 　페이지랭크(PageRank) 스코어라 하자. 페이지랭크 알고리즘은 구글(Google)의 설립자에 의해 개발되었고 이런 개념의 변화는 현재 매우 큰 검색 엔진에서 사용되고 있다. 이론적으로 페이지랭크는 누군가가 무작위로 링크를 클릭하여 어떤 페이지에 도달할 확률로 계산된다. 그 페이지가 다른 유명한 페이지로부터 링크를 포함하고 있으면(inbound), 누군가가 우연히 그 페이지를 방문할 가능성이 더욱 높아진다. 이를 포착하기 위해서 페이지 랭크는 사용자가 지속적으로 각 페이지의 링크를 클릭하면 기회를 주는 댐핑 팩터(damping factor) α를 사용한다. α의 값은 사용자가 연결된 다른 페이지에서 그 웹 페이지를 클릭하는 행동을 분석하여 결정한다. 일반적으로, α는 0.85가 할당된다. 일반적으로 웹 페이지 i의 페이지랭크 스코어(PRi)는 하기하는 수학식 1에 의해 계산된다.Assume the relevance of a page as a PageRank score. The PageRank algorithm was developed by the founder of Google, and a change in this concept is currently used in very large search engines. In theory, pagerank is calculated as the probability that someone will randomly click on a link to reach a page. If the page contains links from other popular pages (inbound), it is more likely that someone accidentally visits the page. To capture this, the page rank uses a damping factor α that gives the user an opportunity to continually click on a link on each page. The value of α is determined by analyzing the behavior of the user clicking on the web page from another linked page. In general, α is assigned to 0.85. In general, the page rank score PRi of the web page i is calculated by Equation 1 below.

여기서, α는 댐핑 팩터(damping factor)로서, 0.85이다. S_ji는, 페이지 j가 페이지 i에 연결되어 있을 때, 페이지 j로부터의 상호 아웃링크(outlink)의 수이다. 만약 j에서 i로의 링크가 없다면, S_ji는 0이 된다. n은 페이지의 총 수이다. Here, α is a damping factor, which is 0.85. S _ji is the number of mutual outlinks from page j when page j is linked to page i. If there is no link from j to i, S _ji is zero. n is the total number of pages.

한편, 본 발명에서는 사용자들의 기록되어 있는 행동을 반영하기 위해 링크 기반 랭킹 알고리즘에서 Meanwhile, in the present invention, the link-based ranking algorithm is used to reflect the recorded behavior of users.

페이지랭크 스코어를 계산하는 수학식 1을 하기하는 수학식 2와 같이 수정하여 링크 기반 랭킹 알고리즘에 활용함으로써 사용자들의 기록되어 있는 사용자 개인의 행동을 반영하였다. 명확하게 수정된 페이지 랭크 알고리즘의 시간복잡도는 O(n2 + nq)이다.Equation 1, which calculates the page rank score, is modified to be used in the link-based ranking algorithm as shown in Equation 2 below to reflect the recorded user's individual behavior. The time complexity of the clearly modified page rank algorithm is O (n2 + nq).

수학식 2에서, PR_i은 개인화된 페이지랭크 스코어다. α는 댐핑 팩터(damping factor)로서, 0.85이다. V_i는 맞춤형 검색을 위한 사용자 맞춤값으로서, 아래의 수학식 3에 의해 정의된다. 사용자의 맞춤값은 웹 페이지에 대한 사용자의 선호 정도를 반영한다. 사용자의 맞춤값은 페이지 i에 연결된 모든 페이지에서 사용자가 클릭한 수와 모든 페이지에 연결되어있는 웹 페이지에서 사용자가 클릭한 수를 계산하여 결정된다. In Equation 2, PR _i is a personalized PageRank score. α is a damping factor, which is 0.85. V _i is a user-customized value for the customized search, and is defined by Equation 3 below. The user's custom value reflects the user's preference for the web page. The user's custom value is determined by counting the number of clicks by the user on all pages linked to page i and the number of clicks by the user on web pages linked to all pages.

여기서, C_ji(u)는 사용자 u가 페이지 j에서 페이지 i로 클릭한 횟수로서, 아래의 수학식 4에 의해 정의되고, U는 전체 사용자의 집합이다.Here, C _ji (u) is the number of times user u clicks from page j to page i, and is defined by Equation 4 below, and U is a set of all users.

수학식 2에서, S_ji는, 페이지 j가 페이지 i에 연결되어 있을 때, 페이지 j로부터의 상호 아웃링크(outlink)의 수로서, 아래의 수학식 5에 의해 정의된다. 만약 j에서 i로의 링크가 없다면, Sji는 0이 된다. n은 페이지의 총 수이다. In Equation 2, S _ji is the number of mutual outlinks from page j when page j is connected to page i, and is defined by Equation 5 below. If there is no link from j to i, Sji is zero. n is the total number of pages.

본 발명에 따른 검색 시스템의 실질적 타당성을 입증하기 위해서, 웹 페이지를 검색하는 일련의 실험을 실시하였다. 이하에서, 본 발명에 따른 웹 페이지의 검색 실험에 대해 설명한다. In order to prove the practical validity of the search system according to the present invention, a series of experiments were conducted to search web pages. Hereinafter, a search experiment of a web page according to the present invention will be described.

도 6은 본 발명에 따라 프로토 타입으로 구현된 시스템에서 뉴스(news)란 검색어로 페이지를 검색한 결과를 보여준다. 6 shows a result of searching a page with a search word news in a system implemented as a prototype according to the present invention.

도 6을 참조하면, 본 실시예에서는 링크 기반 랭킹 기법으로 페이지랭크 스코어가 계산되어 순서대로 나열하였다. 나열된 검색어와 일치하는 다섯 페이지가 있음을 볼 수 있었다. 각 웹 페이지의 스코어는 페이지 제목의 끝에 표시된다. Referring to FIG. 6, in this embodiment, page rank scores are calculated and arranged in order using a link based ranking technique. You can see that there are five pages that match the listed search terms. The score of each web page is displayed at the end of the page title.

즉, <rihanna-Search results for rihanna-CNN.com>와 관련하는 웹 페이지의 페이지랭크 스코어는 0.15218로 표시되었고, <Commentaries: News & Videos about Commentaries>와 관련하는 웹 페이지의 페이지랭크 스코어는 0.12945로 표시되었다. 또한, <moos-Search results for moos - CNN.com>와 관련하는 웹 페이지의 페이지랭크 스코어는 0.11017로 표시되었고, <china-Search results for china-CNN.com>와 관련하는 웹 페이지의 페이지랭크 스코어는 0.10100으로 표시되었으며, <Benazir Bhutto : News & Videos about Benazir Bhutt>와 관련하는 웹 페이지의 페이지랭크 스코어는 0.09598로 표시되었다. That is, the page rank score of the web page associated with <rihanna-Search results for rihanna-CNN.com> was 0.15218, and the page rank score of the web page associated with <Commentaries: News & Videos about Commentaries> was 0.12945. Displayed. In addition, the page rank score of the web page related to <moos-Search results for moos-CNN.com> was 0.11017, and the page rank score of the web page related to <china-Search results for china-CNN.com>. Is displayed as 0.10100, and the page rank score of the web page related to <Benazir Bhutto: News & Videos about Benazir Bhutt> is 0.09598.

사용자들은 자신에게 보여지는 도 6과 같은 검색 결과화면에서 관련성이 없는 페이지를 확인하고 지울 수 있다. Users can identify and delete irrelevant pages on the search results screen shown in FIG. 6.

각각의 웹 페이지와 관련하는 정보의 다음 컬럼에는 해당 웹 페이지로 연결되는 링크 정보가 함께 표시됨을 확인할 수 있었다. You can see that the next column of information related to each web page displays the link information to the web page.

본 발명에 따른 검색 실험에서 웹 페이지들은 웹 문서 검색 시스템에 추가되었고, 표 2는 추가된 결과를 보여준다. In the search experiment according to the present invention, web pages were added to the web document search system, and Table 2 shows the added results.

표 2는 시스템에 폭소노미 알고리즘에 의해 추가된 웹 페이지들의 결과이다.Table 2 shows the results of the web pages added by the Foxson algorithm to the system.

표 2를 참조하면, 예를들어, 웹 페이지(P1)에는 태그 <뉴스>가 56개, 태그 <city>가 11개, 그리고, 태그 <trip>이 3개가 매칭되어 있고, 웹 페이지(P1)의 카테고리는 뉴스 카테고리로 분류되었다. Referring to Table 2, for example, the web page P1 is matched with 56 tags <news>, 11 tags <city>, and 3 tags <trip>, and the web page P1 is matched. The categories of were categorized into news categories.

도 7은 링크 구조와 페이지랭크의 계산 결과를 보여준다. 7 shows the calculation results of the link structure and page rank.

도 7을 참조하면, 추가적인 세 페이지들, 즉, 이전 다섯 페이지들(P1, P2, P3, P4, P5)과 연결 관계에 있는 P6, P7, P8은 계산에 추가되었다. Referring to FIG. 7, additional three pages, that is, P6, P7, and P8, which are connected to the previous five pages P1, P2, P3, P4, and P5, have been added to the calculation.

본 실험에서 추가된 세 페이지들인 P6, P7, P8은 페이지 P5에 연결 되어있고, 페이지 P6, P7, P8의 페이지랭크 스코어를 모두 0.1의 값으로 설정하였다. 본 실시예에서, 페이지 P6, P7, P8의 스코어를 이용하고 개인화된 페이지랭크 알고리즘을 사용하여 페이지 P5의 개인화된 페이지랭크 스코어로서, 0.09598을 얻을 수 있었다. 다른 페이지들(P1, P2, P3, P4)의 개인화된 페이지랭크 스코어도 도 7에 나타냈다.Three pages P6, P7, and P8 added in this experiment are connected to page P5, and the page rank scores of pages P6, P7, and P8 are all set to a value of 0.1. In this example, using the scores of pages P6, P7, P8 and using the personalized PageRank algorithm, 0.09598 could be obtained as the personalized PageRank score of page P5. Personalized PageRank scores of the other pages P1, P2, P3, P4 are also shown in FIG.

즉, 개인화된 페이지랭크 알고리즘을 사용하여 페이지 P1, P2, P3, P4의 개인화된 페이지랭크 스코어로서 각각 0.15218, 0.12945, 0.11017, 0.10100을 얻었다. That is, using the personalized page rank algorithm, 0.15218, 0.12945, 0.11017, and 0.10100 were obtained as the personalized page rank scores of pages P1, P2, P3, and P4, respectively.

표 3 및 표 4는 본 실험에서 페이지랭크 계산의 세부내용이 도시된다. Tables 3 and 4 show details of the page rank calculation in this experiment.

표 3은 상호 아웃링크(outlink)의 수(S_ji)를 계산하는 방법을 보인다. Table 3 shows how to calculate the number of mutual outlink (S _ji ).

표 3을 참조하면, 첫 번째 열은 도 6의 검색 결과 목록의 페이지들에 연결되는 웹 페이지들을 보여준다. 예를 들면 페이지 P2, P3, P4, P5 모두 페이지 P1에 연결된다. Referring to Table 3, the first column shows web pages linked to the pages of the search result list of FIG. For example, pages P2, P3, P4, and P5 are all linked to page P1.

본 실시예에서는 상호 아웃링크(outlink)의 수(S_ji)의 값을 계산하기 위해서 사용자들이 각 링크페이지를 클릭하는 수를 계산해야 한다. 여기서, 상호 아웃링크(outlink)의 수(S_ji)는, 페이지 j가 페이지 i에 연결되어 있을 때, 페이지 j로부터의 상호 아웃링크(outlink)의 수이다. 만약 j에서 i로의 링크가 없다면, S_ji는 0이다. In this embodiment, in order to calculate the value of the number S _ji of the mutual outlinks, it is necessary to calculate the number of users clicking on each link page. Here, the number S _ji of mutual outlinks is the number of mutual outlinks from page j when page j is linked to page i. If there is no link from j to i, S _ji is zero.

표 3 에서 보면 사용자들이 네 개의 페이지들(P2, P3, P4, P5)에서 페이지 P1 로 클릭한 전체 횟수는 288이다. 사용자들이 P2, P3, P4, P5에서 클릭한 전체 횟수는 각각 126, 57, 60, 45이다. 사용자들은 P6, P7, P8에 대해서는 클릭하지 않았다. In Table 3, the total number of times a user clicks on page P1 from four pages P2, P3, P4, and P5 is 288. The total number of user clicks on P2, P3, P4, and P5 are 126, 57, 60, and 45, respectively. Users didn't click on P6, P7, or P8.

본 실시예에서는 이 값들을 이용하여 S21, S31, S41 및 S51을 계산할 수 있다.In the present embodiment, these values can be used to calculate S21, S31, S41 and S51.

표 4는 본 발명에 따라 계산된 최종 페이지랭크 스코어(PR_i)를 나타낸다. Table 4 shows the final PageRank score (PR _i ) calculated according to the present invention.

표 4를 참조하면, 페이지 P1의 경우, 댐핑 팩터 α가 0.85이고, 사용자 맞춤값이 15/41이므로 개인화된 페이지랭크 스코어(PR_i)로서 0.15218이 계산되었다. 또한, 페이지 P2의 경우, 댐핑 팩터 α가 0.85이고, 사용자 맞춤값이 11/41이므로 개인화된 페이지랭크 스코어(PR_i)로서 0.12945가 계산되었다. 또한, 페이지 P3의 경우, 댐핑 팩터 α가 0.85이고, 사용자 맞춤값이 7/41이므로 개인화된 페이지랭크 스코어(PR_i)로서 0.11017이 계산되었다. 또한, 페이지 P4의 경우, 댐핑 팩터 α가 0.85이고, 사용자 맞춤값이 5/41이므로 개인화된 페이지랭크 스코어(PR_i)로서 0.10100이 계산되었다. 또한, 페이지 P5의 경우, 댐핑 팩터 α가 0.85이고, 사용자 맞춤값이 3/41이므로 개인화된 페이지랭크 스코어(PR_i)로서 0.09598이 계산되었다.Referring to Table 4, for the page P1, since the damping factor α is 0.85 and the user fit value is 15/41, 0.15218 is calculated as the personalized page rank score PR _i . In addition, for the page P2, since the damping factor α is 0.85 and the user fit value is 11/41, 0.12945 was calculated as the personalized page rank score PR _i . In addition, for the page P3, since the damping factor α is 0.85 and the user fit value is 7/41, 0.11017 is calculated as the personalized page rank score PR _i . In addition, for the page P4, since the damping factor α is 0.85 and the user fit value is 5/41, 0.10100 is calculated as the personalized page rank score PR _i . In addition, for the page P5, since the damping factor α is 0.85 and the user fit value is 3/41, 0.09598 was calculated as the personalized page rank score PR _i .

이상에서는 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to the embodiments, those skilled in the art can be variously modified and changed within the scope of the invention without departing from the spirit and scope of the invention described in the claims below. I can understand.

이상에서 설명한 바와 같이, 본 발명에 따르면, 집단지성에 기반을 둔 검색 시스템을 구현함으로써, 검색의 고품질을 달성할 수 있다. 즉, 폭소노미와 링크 기반 랭크 구조를 결합함으로써, 검색에서 사용자 행위와 사용자의 선호도를 함께 반영한 집단지성에 기반을 둔 검색 시스템을 구현할 수 있다. 특히, 웹 2.0과 함께 사용자들의 선호도, 생각을 조합하는 새로운 분류방법인 폭소노미를 사용하고, 사용자 행위와 더불어 개인화를 지원하는 링크 기반 랭킹 기법을 사용함으로써, 검색 효과를 개선할 수 있다.As described above, according to the present invention, by implementing a search system based on collective intelligence, it is possible to achieve a high quality search. In other words, by combining the folksonomy and link-based rank structure, it is possible to implement a search system based on collective intelligence that reflects both user behavior and user preference in search. In particular, by using Foxsonomi, a new classification method that combines users' preferences and thoughts with Web 2.0, and link-based ranking technique that supports personalization along with user behavior, the search effect can be improved.

도 6은 프로토타입 시스템에서 뉴스(news)란 검색어로 페이지를 검색한 결과를 보여준다. FIG. 6 shows a result of searching a page by searching for news in a prototype system.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

112 : 웹 검색 인터페이스 114 : 웹 크로럴러112: Web Search Interface 114: Web Chlorella

116 : 랭킹 모듈 120 : 스토리지 서버116 ranking module 120 storage server

130 : 폭소노미 매니저 132 : 폭소노미 에디터130: Foxson manager 132: Foxson editor

134 : 태그 연계 추출기 136 : 유저 검색 분석기134: Tag Association Extractor 136: User Search Analyzer

140 : 인덱서 모듈 142 : 언어분석기140: indexer module 142: language analyzer

144 : 특징선택기 146 : 태그인식기144: feature selector 146: tag recognizer

148 : 의미분류기148: semantic classifier

Claims

When a search keyword including a tag of web pages is input by a user, the storage server of a resource management layer stores the search keyword and transmits information about the user tag to a transaction processing layer;

The foxson manager of the transaction processing layer editing the tag and inferring a relationship between the tags to analyze whether a user clicks on a web page;

The indexer module of the transaction processing layer classifies a web page based on the tag;

The ranking module of the transaction analysis layer includes: arranging web pages belonging to the same category using a personalized link based ranking algorithm that calculates a page rank score to reflect the recorded behavior of users; And

And the web search interface of the transaction analysis layer comprises presenting the ordered web document search results to a user.

The method of claim 1, wherein the folkson manager,

Give every web page the user clicked a unique ID, then generate a tag tree,

Count the number of people who used the same tag,

A method of retrieving a web document, comprising inserting the highest frequency tag calculated by the recursive program into the tag database as the last category of the web page.

The method of claim 2, wherein in the tag tree, the tag is defined as a parent node, and the web page corresponding to the tag is defined as a child node.

The method of claim 1, wherein the page rank score is defined by Equation 2 below:

[Equation 2]

(Where α is a damping factor, which is 0.85. V _i is a user-customized value, defined by Equation 3 below. The user's custom-value is defined by the user clicking on all pages linked to page i. It is determined by counting the number of clicks a user clicks on a web page that is linked to all pages and reflects the user's preference for the web page.)

&Quot; (3) "

(Where C _ji (u) is the number of times user u clicked from page j to page i, defined by Equation 4 below, where U is the total set of users:

&Quot; (4) "

(Equation 1, S _ji is the number of mutual outlinks from page j when page j is linked to page i, and is defined by Equation 5 below. It is a number.)

[Equation 5]

A transaction analyzer configured to collect and analyze web pages corresponding to the search keyword provided from the user, and display the search result to the user;

A resource manager to store and manage information about a web document search process; And

And a transaction processing unit that combines a folksonomy and a link-based ranking technique to analyze the requests of users and classifies the behaviors and interests of the users according to the analyzed results and provides them to the resource manager.

The method of claim 5, wherein the transaction analysis unit,

A web search interface for receiving a search keyword by a user and providing a web page corresponding to the search keyword to the user;

A web crawler that collects web pages; And

And a ranking module for arranging web pages belonging to the same category and providing them to the web search interface such that web pages corresponding to the search keyword are provided to a user.

The method of claim 5, wherein the resource management unit,

A web document retrieval system comprising a storage server comprising a folkson repository and a link information repository.

The method of claim 5, wherein the transaction processing unit,

A FoxNomi manager for editing a tag corresponding to a web page, inferring a relationship between the tags, and analyzing a user's behavior; And

And an indexer module for classifying web pages into categories based on the tag.

The method of claim 8, wherein the folksonist

A foxson editor for writing a tag input by a user in correspondence to a specific web page to the web page;

A user search analyzer for checking whether a user clicks on a tagged web page by a user; And

And a tag association extractor for generating a tag tree between a tag and a page by assigning an ID to the page as a specific web page is clicked by a user.

The method of claim 9, wherein the indexer module

A syntactic analyzer for analyzing the language written on the foxsonized web page;

A feature selector for selecting a feature of the language analyzed by the language analyzer;

A tag recognizer for recognizing a tag of a web page corresponding to a language selected by the feature selector; And

And a semantic classifier for classifying the meanings of the tags recognized by the tag recognizer and storing the semantic classifiers in the resource management unit.

The method of claim 6, wherein the page rank score calculated by the ranking module,

Web document retrieval system, characterized by the following equation (2):

[Equation 2]

&Quot; (3) "

&Quot; (4) "

(In Equation 2, S _ji is the number of mutual outlinks from page j when page j is linked to page i, and is defined by Equation 5 below. It is a number.)

[Equation 5]