KR20000065588A

KR20000065588A - An Information Retrieval method that Incorporates Different Types of Links

Info

Publication number: KR20000065588A
Application number: KR1019990012010A
Authority: KR
Inventors: 맹성현; 주원균
Original assignee: 맹성현; 주원균
Priority date: 1999-04-07
Filing date: 1999-04-07
Publication date: 2000-11-15
Also published as: KR100311355B1

Abstract

PURPOSE: An information search method using the link information according to the link type classification is provided to satisfy the customer's demands by improving the reliance of the search through selective application of the link's properties that exist in the document when calculating the similarities between the document and the search word. CONSTITUTION: The initial search assemblage is created by searching for documents through the vector space model(301). An expanded assemblage is created by including an external document, an additional document, in the initial search assemblage(302). The final search result is created by revaluating the appropriateness of the expansion assemblage's documents by using the link information that exists between the documents(303).

Description

An information retrieval method that Incorporates Different Types of Links}

본 발명은 일반적인 문서 정보 및 웹 정보 등을 검색하기 위한 정보 검색 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 특히 링크 타입의 구분에 따른 링크 정보를 이용하여 정보를 검색하는 정보 검색 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to an information retrieval method for retrieving general document information, web information, and the like, and to a computer-readable recording medium having recorded thereon a program for realizing the method. An information retrieval method for retrieving information and a computer readable recording medium having recorded thereon a program for realizing the method.

일반적으로 사용자는 하이퍼텍스트와 브라우징을 통해 구조화된 정보 공간으로부터 정보를 효과적으로 얻을 수 있으며, 기대되지 않은 아이템(문서)들을 우연히 찾아 낼 수도 있다.In general, a user can effectively obtain information from a structured information space through hypertext and browsing, and can accidentally find unexpected items (documents).

그러나, 웹 상의 검색 엔진들은 브라우징과 검색의 결합이 없이 단지 문서들이 독립적인 단위로 되어 있다는 가정 하에서, 질의와 문서 사이의 관계 정도를 결정하기 위한 수단으로 문서 또는 전체 문서 집합의 단어 빈도수를 사용하여 문서를 검색함으로써, 사용자들의 높은 요구를 충족시킬 수 없는 문제점이 있었다.However, search engines on the web use the word frequency of a document or an entire document set as a means to determine the degree of relationship between a query and a document, assuming that documents are in independent units, without the combination of browsing and search. By retrieving a document, there was a problem that could not meet the high demands of users.

즉, 현재 웹에서의 검색은 문서를 독립적인 단위로 보고 질의어와 문서 사이의 관계 정도를 결정하기 위한 수단으로 문서 또는 문서 집합의 단어 빈도수를 사용하고 있다. 그러나, 웹 상의 문서들은 하이퍼링크에 의해서 서로 다른 문서들과 관계를 맺고 있어 현재의 검색 엔진으로는 양질의 검색 결과를 얻을 수 없는 문제점이 있었다.In other words, search on the web currently uses the word frequency of a document or a set of documents as a means to determine the degree of relationship between a query and a document by viewing the document as an independent unit. However, the documents on the Web are related to different documents by hyperlinks, and thus, there is a problem in that a current search engine cannot obtain high quality search results.

본 발명은 상기 문제점을 해결하기 위하여 안출된 것으로, 링크 정보에 의한 영향을 검색에 반영시켜 검색 신뢰도를 향상시켜 사용자의 요구를 만족시켜주는 정보 검색 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.Disclosure of Invention The present invention has been made to solve the above-mentioned problems, and includes an information retrieval method that satisfies a user's needs by improving the reliability of a search by reflecting the influence of link information in a search, and a computer recording a program for realizing the method. The purpose is to provide a readable recording medium.

즉, 본 발명은, 문서에 존재하는 링크의 속성을 문서와 질의어간에 유사도를 계산할 때 선별적으로 적용함으로써 검색 신뢰도를 향상시켜 사용자의 요구를 만족시켜주는 정보 검색 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.That is, the present invention provides an information retrieval method and a program for realizing the above information by satisfying the user's demand by improving the search reliability by selectively applying the attributes of the link existing in the document when calculating the similarity between the document and the query word. Its purpose is to provide a computer readable recording medium having recorded thereon.

즉, 본 발명은, 문서 집합을 검색하여 초기 검색 집합을 구성한 후에 링크 정보를 사용하여 외부 문서(부가적인 문서)를 포함하도록 초기 검색 집합을 확장시켜 확장 집합을 결정한 다음에 링크 정보를 활용하여 확장 집합에 있는 문서들의 적합성을 재평가하여 최종 검색 결과를 생성하는 정보 검색 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.That is, in the present invention, after the document set is searched to construct an initial search set, the initial search set is extended to include an external document (additional document) by using the link information, and then the extension set is determined by using the link information. It is an object of the present invention to provide an information retrieval method for re-evaluating the suitability of documents in a collection to generate a final search result, and a computer-readable recording medium recording a program for realizing the method.

도 1a 및 1b 는 본 발명에 따른 링크의 방향성(directionality)과 직접성(directness)에 대한 설명도.1A and 1B are explanatory diagrams of directionality and directness of a link according to the present invention;

도 2 는 본 발명에 따른 링크 정보를 이용한 정보 검색 방법에 대한 시나리오의 일예시도.2 is an exemplary view of a scenario for an information retrieval method using link information according to the present invention.

도 3 은 본 발명에 따른 링크 정보를 이용한 정보 검색 방법에 대한 일실시예 흐름도.3 is a flowchart illustrating an information retrieval method using link information according to the present invention;

도 4 는 본 발명에 따른 링크 정보 데이터베이스의 일실시예 구조도.4 is a structural diagram of an embodiment of a link information database according to the present invention;

도 5 는 도 3 의 초기 검색 집합 구성 과정에 대한 시나리오의 일예시도.5 is an exemplary diagram of a scenario for the initial search set configuration process of FIG.

도 6 은 도 3 의 초기 검색 과정에 대한 일실시예 상세 흐름도.6 is a detailed flowchart of an embodiment of the initial search process of FIG.

도 7 은 도 3 의 확장 집합 결정 과정과 링크 정보 재랭킹 과정에 대한 시나리오의 일예시도.FIG. 7 is an exemplary diagram illustrating a scenario for the extended set determination process and the link information reranking process of FIG. 3.

도 8 은 도 3 의 확장 집합 결정 과정과 링크 정보 재랭킹 과정에 대한 일실시예 상세 흐름도.FIG. 8 is a detailed flowchart illustrating an extension set determination process and a link information reranking process of FIG. 3. FIG.

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

201 : 벡터 공간 모델 202 : 색인 정보 데이터베이스201: vector space model 202: index information database

203 : 링크 기반 검색 모듈 204 : 링크 정보 데이터베이스203: Link Based Search Module 204: Link Information Database

상기 목적을 달성하기 위하여 본 발명은, 링크 타입의 구분에 따른 링크 정보를 이용한 정보 검색 방법에 있어서, 검색하고자 하는 문서 집합을 검색하여 초기 검색 집합을 구성하는 제 1 단계; 상기 링크 타입의 구분에 따른 링크 정보를 사용하여 타 문서를 포함하도록 상기 초기 검색 집합을 확장시켜 확장 집합을 결정하는 제 2 단계; 및 상기 링크 정보를 활용하여 상기 확장 집합에 있는 문서들의 적합성을 재평가하여 검색 결과를 생성하는 제 3 단계를 포함하여 이루어진 것을 특징으로 한다.According to an aspect of the present invention, there is provided an information retrieval method using link information according to a classification of a link type, the method comprising: a first step of constructing an initial search set by searching a document set to be searched; A second step of determining an extension set by extending the initial search set to include other documents by using link information according to the classification of the link type; And a third step of re-evaluating suitability of the documents in the extension set using the link information to generate a search result.

한편, 본 발명은, 대용량 프로세서를 구비한 정보 검색 시스템에, 검색하고자 하는 문서 집합을 검색하여 초기 검색 집합을 구성하는 제 1 기능; 상기 링크 타입의 구분에 따른 링크 정보를 사용하여 타 문서를 포함하도록 상기 초기 검색 집합을 확장시켜 확장 집합을 결정하는 제 2 기능; 및 상기 링크 정보를 활용하여 상기 확장 집합에 있는 문서들의 적합성을 재평가하여 검색 결과를 생성하는 제 3 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention, an information retrieval system having a large-capacity processor, a first function for constructing an initial search set by searching for a document set to search; A second function of determining the extension set by extending the initial search set to include other documents by using link information according to the classification of the link type; And a computer readable recording medium having recorded thereon a program for realizing a third function of re-evaluating suitability of the documents in the extension set using the link information to generate a search result.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

이하의 본 발명에서는 문서의 의미가 순차적인 텍스트 그 자체에 있는 것이 아니라 다른 문서들과의 관계로부터 영향을 받고 유도될 수 있으며, 그 관계는 다양한 링크 타입으로 표현된다는 가정을 전제로 하여 검색 과정에 링크 정보를 사용할 수 있는 방법을 제시하고, 다양한 링크 타입들이 미치는 영향력에 근거하여, 최종적으로 링크 정보가 검색 신뢰도를 향상시키는 방법에 대하여 살펴보기로 한다.In the following invention, the meaning of the document is not in the sequential text itself, but may be influenced and derived from the relationship with other documents, and the relationship is expressed in the search process on the assumption that the relationship is represented by various link types. We will present a way to use link information, and finally, based on the influence of various link types, we will look at how link information improves search reliability.

도 1a 및 1b 는 본 발명에 따른 정보 검색 방법에 사용된 링크 속성을 나타내는 도면으로서, 링크의 방향성(directionality)(도 1a)과 직접성(directness)(도 1b)에 대한 설명도이다.1A and 1B are diagrams showing link attributes used in the information retrieval method according to the present invention, and are explanatory diagrams of directionality (FIG. 1A) and directness (FIG. 1B) of a link.

먼저, 링크에 기반한 검색을 하기 위해서는 우선 링크의 속성에 대한 정의가 있어야 한다. 본 발명에서는 링크의 속성을 링크의 방향성과 직접성 및 질의어 여부의 세 가지 측면에 초점을 맞추었다. 도 1a 에 도시된 바와 같이, 방향성은 출력 링크와 입력 링크사이의 차별을 두기 위한 것이다. 도 1b 에 도시된 바와 같이, 직접성에 대해서는 직접 링크와 간접 링크를 정의하는데, 직접 링크는 그 의미처럼 두 문서가 링크에 의해 직접 연결되어 있는 것을 말하며, 위에서 언급된 출력 및 입력 링크들은 모두 직접 링크의 예이다. 간접 링크는 문서 A와 문서 B가 공통적으로 문서 C에 대한 출력 링크를 가지거나 똑같은 문서 C로부터 입력 링크를 가질 때, 문서 A와 문서 B사이에 존재하는 링크를 말한다.First, in order to perform a search based on a link, there must be a definition of a link attribute first. In the present invention, the attributes of the link are focused on three aspects of the directivity and directness of the link and whether or not the query. As shown in FIG. 1A, the directionality is to differentiate between the output link and the input link. As shown in Fig. 1B, for directness, direct link and indirect link are defined, which means that two documents are directly connected by a link as it is, and the output and input links mentioned above are both direct links. Is an example. Indirect linking refers to a link that exists between Document A and Document B when Document A and Document B have either an output link to Document C in common or an input link from the same Document C.

이러한 각 링크의 속성에 따라 적용되는 수학식을 달리하여 링크 정보를 문서 검색에 선별적으로 적용함으로써, 검색 효과를 최대로 높일 수 있다.By selectively applying the link information to the document search by varying the equations applied according to the attributes of each link, the search effect can be maximized.

즉, 본 발명에서는 다양한 타입의 하이퍼링크를 사용하는 방법에 주안점을 두었는데, 입력과 출력 링크의 구분, 직접과 간접 링크의 구분, 링크의 앵커가 질의어인지 비질의어인지의 구분에 따라 문서의 유사도를 조정함으로써, 정보 검색의 신뢰도를 향상시킨다.In other words, the present invention focuses on the method of using various types of hyperlinks, the similarity of documents according to the distinction between input and output links, direct and indirect links, and whether the anchor of the link is a query or non-query. By adjusting, the reliability of information retrieval is improved.

도 2 는 본 발명에 따른 링크 정보를 이용한 정보 검색 방법에 대한 시나리오의 일예시도이고, 도 3 은 본 발명에 따른 링크 정보를 이용한 정보 검색 방법에 대한 일실시예 흐름도이다.2 is an exemplary view illustrating a scenario for an information retrieval method using link information according to the present invention, and FIG. 3 is a flowchart illustrating an information retrieval method using link information according to the present invention.

먼저, 벡터 공간 모델(vector space model)(202)을 통해 문서 집합을 검색하여 초기 검색 집합을 구성한다(301). 그 상세한 방법을 살펴보면 다음과 같다.First, a document set is searched through a vector space model 202 to construct an initial search set (301). The detailed method is as follows.

상기 초기 검색 집합 구성 과정(301)은 벡터 공간 모델(201)과 색인 정보 데이터베이스(202)를 사용한 일반적인 문서 검색 과정으로, 문서 i에 나타나는 단어 j에 대한 가중치 w_ij를 계산하기 위해 아래의 (수학식1)을 사용한다.The initial search set construction process 301 is a general document search process using the vector space model 201 and the index information database 202. The following mathematical _equation is used to calculate the weight w _ij for the word j that appears in the document i. Equation 1) is used.

여기서, ntf(normalized term frequency)는 문서에 있는 최대 단어 빈도수에 의해 정규화된 전체 빈도이며, nidf(normalized inverse document frequency)는 단어에 대한 정규화된 역 문서 빈도이고, n은 문서 집합에 있는 전체 문서 수이다. 질의 Q의 문서 D_i에 대한 검색 상태 값(Retrieval Status Value)은 코사인 유사도 값을 가진다. 그 값은 아래의 (수학식2)와 같이 계산된다.Where ntf (normalized term frequency) is the overall frequency normalized by the maximum word frequency in the document, normalized inverse document frequency (nidf) is the normalized inverse document frequency for the word, and n is the total number of documents in the document set to be. The Retrieve Status Value for Document D _i of Query Q has a cosine similarity value. The value is calculated as shown in Equation 2 below.

여기서, t는 질의어에 포함된 단어 수를 나타내며, 이 과정이 모두 끝났을 때 적어도 하나의 질의 단어를 가진 문서들이 절단점에 따라 보여진다.Here, t denotes the number of words included in the query, and when the process is completed, documents with at least one query word are displayed according to the cutting point.

다음으로, 링크정보를 사용하여 초기 검색 집합을 외부 문서라는 부가적인 문서를 포함하도록 확장시켜 확장 집합을 결정한다(302). 그 상세한 방법을 살펴보면 다음과 같다.Next, using the link information, the initial search set is extended to include an additional document called an external document to determine an extended set (302). The detailed method is as follows.

상기 확장 집합 결정 과정(302)은 링크 기반 검색 모듈(203)과 링크 정보 데이터베이스를 사용한 문서 검색 과정으로, 출력 링크를 이용하여 초기 검색 집합에 있는 문서들과 그 집합밖에 있는 문서(외부에 있는 부가적인 문서)들과의 연결 관계를 조사한 후에, 링크에 의해 초기 검색 집합에 연결되어 있는 문서들을 초기 검색 집합에 포함시켜 확장 집합을 결정한다. 이 과정은 일종의 피드백의 형식을 취하는데, 초기 검색 집합에 있는 상위 랭크된 문서와 원래의 질의가 결합되어 확장 질의를 형성하고, 이 확장 질의와 초기 검색 문서들과의 연결 관계에 따라 부가적인 문서들이 검색된다.The extended set determination process 302 is a document search process using a link-based search module 203 and a link information database. After examining the link relations with the existing documents, the extension set is determined by including the documents linked to the initial search set by the link in the initial search set. This process takes the form of feedback, which combines the top-ranked document in the initial search set with the original query to form an extended query, and additional documents based on the association of this extended query with the initial search documents. Are searched.

확장 문서들은 보통 질의 단어를 포함하고 있지 않기 때문에 단순히 상기 (수학식2)를 적용하여 검색 상태 값(RSV)을 계산하는 것은 불가능하다. 확장 문서들에 검색 상태 값(RSV)을 할당하기 위해서 링크로 연결된 원본 문서로부터의 검색 상태 값(RSV)을 계승하도록 하고, 이때 그들 사이의 유사도를 반영한다. 수식으로 표현하면 아래의 (수학식3)과 같다.Since the extended documents usually do not include the query word, it is impossible to simply calculate the search status value (RSV) by applying Equation 2 above. In order to assign the search status value RSV to the extended documents, the search status value RSV from the linked original document is inherited, and the similarity between them is reflected. When expressed as an equation, Equation 3 is shown below.

이때, 0 ≤ Sim(D_in, D_ex) ≤ 1이다. 또한, 외부 문서에 대해 둘 이상의 입력 링크가 있을 경우에는 상기 (수학식3)을 사용하여 각 링크들에 대한 검색 상태 값(RSV)들을 계산하고 최대 값을 고른다. 외부 문서가 많은 입력 링크를 가질수록 높은 검색 상태 값(RSV)을 가지며, 질의를 보다 만족시킨다고 가정한다면 뎀스터-세이퍼(Dempster-Shafer) 결합 규칙을 사용하여 입력 링크들의 가치를 반영하는 것도 가능하다.At this time, 0 ≦ Sim (D _in , D _ex ) ≦ 1. In addition, when there are two or more input links to an external document, Equation 3 is used to calculate search state values (RSVs) for each link and to select a maximum value. The more external documents have more input links, the higher the search status value (RSV), and assuming that the query is more satisfied, it is also possible to use the Dempster-Shafer combining rule to reflect the value of the input links.

다음으로, 문서들 사이에 존재하는 링크 정보를 활용하여 확장 집합에 있는 문서들의 적합성을 재평가하여 최종 검색 결과를 생성한다(303). 즉, 링크 정보를 혼합(재랭킹)하여 검색 결과를 생성한다. 그 상세한 방법을 살펴보면 다음과 같다.Next, using the link information existing between the documents to re-evaluate the suitability of the documents in the extension set to generate the final search results (303). That is, the link information is mixed (reranked) to generate a search result. The detailed method is as follows.

상기 링크 정보의 혼합(재랭킹) 과정(303)은, 집합 전체에 걸쳐 있는 링크 정보를 이용하여 후보 문서들을 모두 재랭킹하는 과정이다. 새로운 문서(외부 문서)는 검색된 내부 문서 리스트의 다양한 위치에 삽입될 수 있고, 검색 상태 값(RSV)이 새로 계산되기 때문에 내부 문서에 대한 초기 랭킹이 바뀔 수도 있다. 기본 알고리즘은 문서들이 링크를 통해 연결되었을 때 각각이 서로에게 미치는 영향력을 반영하는 것이다. 첫 단계에서 링크 후보들은 대체로 질의에 적합한 것으로 가정했기에, 문서들이 더 많은 링크 후보들을 가질수록 주어진 질의에 대해 밀착된 형태를 취한다. 링크의 타입에 따라 문서들간의 관계에 미치는 영향력이 다르기에, 링크의 영향력은 직접/간접 링크로 나누어 따로 분석한다.The mixing (re-ranking) process 303 of the link information is a process of re-ranking all candidate documents using link information throughout the set. The new document (external document) may be inserted at various positions in the searched internal document list, and the initial ranking for the internal document may change because the search status value RSV is newly calculated. The basic algorithm is to reflect the impact each of the documents have on each other when linked. In the first step, link candidates are generally assumed to be suitable for a query, so that the more documents the document has, the more closely it is for a given query. Since the influence on the relationship between documents differs according to the type of link, the influence of the link is divided into direct and indirect links and analyzed separately.

이제, 직접 링크에 의한 영향을 살펴보면 다음과 같다.Now, the influence of the direct link is as follows.

직접 링크는 입력 링크/출력 링크에 따른 링크의 방향성과 앵커의 질의어/비질의어 여부에 따라 더 분류될 수 있다. 네 가지의 경우를 고려하여, 주어진 문서 D에 대한 직접 링크의 영향을 계산하기 위해 아래의 (수학식4)를 사용한다.The direct link can be further classified according to the direction of the link along the input link / output link and whether the anchor is a query / non-query. Considering four cases, use Equation 4 below to calculate the effect of a direct link to a given document D.

상기 (수학식4)의 각 항은 다음을 나타낸다. 첫 번째 항은 D에 있는 질의어로부터 나가는 링크의 영향, 두 번째 항은 D의 비질의어로부터 나가는 링크, 세 번째 항은 다른 문서의 질의어로부터의 입력 링크, 네 번째 항은 다른 문서에 있는 비질의어로부터의 입력 링크를 나타낸다. 그리고, 상기 각 항들은 연산자로 표시되는 뎀스터-세이퍼(Dempster-Shafer) 결합 규칙에 따라 결합된다.Each term of said Formula (4) shows the following. The first term is the effect of links out of the query in D, the second is the link out of the non-query in D, the third term is the input link from the query in another document, and the fourth term is from the non-query in another document. Represents an input link. And, each of the above terms are operators Joined according to the Dempster-Shafer joining rule indicated by.

또한, 상기 (수학식4)에서 α_i는 4가지 타입의 링크의 강도 또는 중요도를 나타내는 매개변수로서, 정확한 값은 실험에 의해 결정되며, r, s, t, w 는 각각 문서 D로부터 출발하거나 문서 D를 가리키는 4가지 다른 타입의 링크 수를 나타낸다. 또한, 심벌는 합의 일반적인 의미가 아니라의 의미를 갖는데, 이것은 각각의 항목의 값이 절대 1을 초과하지 않음을 보장한다. 문서 D에 대한 새로운 검색 상태 값(RSV)은 아래의 (수학식5)와 같은 방법으로 계산된다.In Equation (4), α _i is a parameter representing strength or importance of four types of links, and exact values are determined by experiments, and r, s, t, and w are each derived from document D or Represents the number of four different types of links that point to document D. Also, the symbol Is not a common sense of the consensus Which means that the value of each item never exceeds one. The new search state value (RSV) for document D is calculated in the same manner as in (5).

다음으로, 간접 링크에 의한 영향을 살펴보면 다음과 같다.Next, the influence of the indirect link is as follows.

두 문서 A와 B가 같은 문서에 대한 링크를 가질 때, 그들 문서 사이의 관계를 가정해 볼 수 있다. 마찬가지로 한 문서가 두개의 문서에 대해 분리된 링크를 가질 때, 두 문서 사이의 관계도 고려해 볼 수 있다. 같은 목적지를 가짐으로써 생기는, D_i와 D_j사이의 링크 강도를 계산하기 위해서는, 얼마나 많은 링크들이 D_i와D_j를 출발하여 같은 목적지를 가지는지를 고려해야 한다. 강도 σ_i,j는 아래의 (수학식6)과 같이 계산될 수 있다.When two documents A and B have links to the same document, you can assume the relationship between them. Similarly, when a document has separate links to two documents, you can also consider the relationship between the two documents. To calculate the link strength between D _i and D _j resulting from having the same destination, we need to consider how many links have the same destination starting from D _i and D _j . The intensity σ _{i, j} can be calculated as shown in Equation 6 below.

여기서, || 는 링크의 수를 나타내며, L_i는 D_i로부터 나가는 링크, L_j는 D_j로부터 나가는 링크, L_ij는 D_i와 D_j로부터 출발하여 같은 목적지를 가지는 두 링크의 쌍을 나타낸다. D_i와 D_k를 가리키는 문서에 의한 간접 링크의 강도도 유사한 방법으로 계산될 수 있다. 문서 D_j와 D_k가 주어졌을 때, 문서 D_i에 대한 간접 링크의 영향은 아래의 (수학식7)과 같이 계산된다.Where || Denotes the number of links, L _i denotes a link outgoing from D _i , L _j denotes a link outgoing from D _j , and L _ij denotes a pair of two links having the same destination starting from D _i and D _j . The strength of indirect links by documents pointing to D _i and D _k can also be calculated in a similar way. Given documents D _j and D _k , the effect of indirect links on document D _i is calculated as shown in Equation 7 below.

그리고, 직접/간접 링크의 효과를 고려할 때, 문서 D에 대한 최종 검색 상태 값(RSV)은 아래의 (수학식8)과 같이 계산된다.In consideration of the effect of the direct / indirect link, the final search state value RSV for the document D is calculated as shown in Equation 8 below.

여기서, 매개변수 α₄는 실험에 의해 결정되며, 다른 매개 변수들은 상기 (수학식4)에서 설명한 바와 같다.Here, the parameter α ₄ is determined by experiment, and other parameters are as described in Equation 4 above.

도 4 는 본 발명에 따른 링크 정보 데이터베이스의 일실시예 구조도이다.4 is a structural diagram of an embodiment of a link information database according to the present invention.

도면에 도시된 바와 같이, 링크 정보(Link_Info)는 문서에 관한 모든 링크 정보를 갖는 자료 형태로, 문서 식별자(DocID)는 문서 고유의 번호를 나타내는 필드이고, 출력 링크 정보(Out_Link_Info) 포인터는 문서 식별자(DocID)에 해당하는 문서에서 다른 문서로 링크가 나가는 정보를 저장하는 필드이며, 입력 링크 정보(In_Link_Info) 포인터는 다른 문서에서 문서 식별자(DocID)에 해당하는 문서로 링크가 들어오는 정보를 저장하는 필드이다.As shown in the figure, the link information Link_Info is a data type having all link information about the document, the document identifier DocID is a field indicating a document unique number, and the output link information Out_Link_Info pointer is a document identifier. (DocID) is a field for storing information that links to another document, and the input link information (In_Link_Info) pointer is a field for storing information for linking to a document corresponding to a document identifier (DocID) from another document. to be.

출력 링크 정보(Out_Link_Info)는 다른 링크와 구별할 수 있도록 의미를 부여하는 링크 이름(Link_Name) 필드와 링크가 지시하는 문서(목적지 문서)의 문서 식별자(DocID)로 구성되고, 입력 링크 정보(In_Link_Info)는 다른 링크와 구별할 수 있도록 의미를 부여하는 링크 이름(Link_Name) 필드와 링크가 출발하는 문서의 문서 식별자(DocID)로 구성되어 있다.The output link information (Out_Link_Info) is composed of a link name (Link_Name) field which gives meaning to distinguish it from other links, and a document identifier (DocID) of the document (destination document) indicated by the link, and input link information (In_Link_Info). Consists of a link name (Link_Name) field giving meaning to distinguish it from other links and a document identifier (DocID) of the document from which the link starts.

도 5 는 도 3 의 초기 검색 집합 구성 과정에 대한 시나리오의 일예시도이고, 도 6 은 도 3 의 초기 검색 과정에 대한 일실시예 상세 흐름도이다.FIG. 5 is an exemplary diagram of a scenario for the initial search set configuration process of FIG. 3, and FIG. 6 is a detailed flowchart of an embodiment of the initial search process of FIG. 3.

먼저, 사용자에 의해서 질의어가 들어오면, 벡터 검색 엔진은 누산기를 초기화한다(501,601). 벡터 검색이란 문서와 질의어를 일종의 벡터로 표현하여 문서와 질의어 사이의 유사도를 상기 (수학식2)와 같은 수학식을 사용하여 계산하여 검색하는 것이다. 따라서, 벡터를 실제로 구현하기 위해서 누산기라는 데이터 구조를 사용한다.First, when a query is entered by the user, the vector search engine initializes the accumulators (501 and 601). The vector search is to express the document and the query word as a kind of vector and calculate and search the similarity between the document and the query word using Equation (Equation 2). Thus, to actually implement the vector, we use a data structure called an accumulator.

일단 누산기를 초기화했으면, 처리할 질의어가 있는지를 판단하여(602) 없으면 누산기 값 중에서 가장 큰 K개(K는 자연수)를 찾아서 기본 결과 집합을 생성하고(506,609), 있으면 색인 정보가 저장된 하부 저장 구조(B+트리로 구성됨)로부터 질의어에 대한 역리스트(문서에서 추출된 색인어와 색인어가 있는 문서 식별자 및 가중치로 구성됨)를 검색한다(503,603).Once the accumulator has been initialized, it is determined whether there is a query to process (602), if not it finds the largest K of the accumulator values (K is a natural number) and generates a basic result set (506,609), and if so, the underlying storage structure that stores the index information. The reverse list (consisting of the index word extracted from the document, the document identifier with the index word and the weight) from the query (consisting of B + tree) is searched (503, 603).

이후, 역리스트가 있는지를 판단하여(604) 없으면 처리할 질의어가 있는지를 판단하는 과정(602)으로 천이하고, 있으면 누산기에 역리스트가 존재하는지를 판단한다(605).Thereafter, if there is an inverse list (604), the process proceeds to determining whether there is a query to be processed (step 602), and if so, it is determined whether an inverse list exists in the accumulator (605).

판단 결과(605), 누산기에 역리스트가 존재하면(문서 식별자에 대한 누산기가 존재하면) 해당 누산기로부터 정보를 읽어서 현재 계산된 질의어와 문서의 유사도를 더해서 그 누산기에 다시 넣는다(504,606). 만약 누산기에 역리스트가 존재하지 않으면 역리스트의 문서 식별자(ID)에 대해 질의어와 문서의 유사도를 누산기에 추가한다(505,607).As a result of the determination 605, if an inverse list exists in the accumulator (if there is an accumulator for the document identifier), the information is read from the accumulator, and the similarity between the currently calculated query and the document is added to the accumulator (504, 606). If the inverse list does not exist in the accumulator, the similarity between the query word and the document is added to the accumulator with respect to the document identifier (ID) of the inverse list (505, 607).

상기와 같이 질의어와 문서의 유사도 계산이 완료되면, 유사도가 감소하는 순서대로 누산기를 정렬시킨다(502,608). 마지막으로 누산기 값 중에서 가장 큰 K개(K는 자연수)를 찾아서 기본 결과 집합을 생성한다(506,609).When the similarity calculation of the query and the document is completed as described above, the accumulators are arranged in the order of decreasing similarity (502, 608). Finally, the largest K of the accumulator values (K is a natural number) is generated and a basic result set is generated (506, 609).

도 7 은 도 3 의 확장 집합 결정 과정과 링크 정보 재랭킹 과정에 대한 시나리오의 일예시도로서, 도 8 은 도 3 의 확장 집합 결정 과정과 링크 정보 재랭킹 과정에 대한 일실시예 상세 흐름도이다.FIG. 7 is an exemplary view illustrating a scenario for the extension set determination process and the link information reranking process of FIG. 3. FIG. 8 is a detailed flowchart illustrating the extension set determination process and the link information reranking process of FIG. 3.

먼저, 상기 도 6 에서 생성된 기본 결과 집합을 링크 누산기에 추가한다(701,801). 다음으로 기본 결과 집합에 있는 문서들 중 링크에 의해 연결된 문서들에 대해 유사도를 계산한다(702,802). 이때, 사용되는 링크는 도 1 에서 제시한 링크 속성 중 내부에서 외부로 나가는 링크를 기준으로 이루어진다. 이들 문서는 보통 질의 단어를 포함하고 있지 않기 때문에 상기 (수학식3)을 사용하여 유사도를 다시 계산한다. 이후, 다시 계산된 유사도를 가진 문서들을 새로운 결과 집합에 추가한다(703,803).First, the basic result set generated in FIG. 6 is added to the link accumulator (701 and 801). Next, similarities are calculated (702, 802) for the documents linked by the link among the documents in the basic result set. In this case, the link used is based on a link going from the inside to the outside of the link attributes shown in FIG. 1. Since these documents usually do not contain query words, the similarity is recalculated using Equation (3) above. Then, the documents with the recomputed similarity are added to the new result set (703, 803).

다음으로 링크 효과를 검색에 반영하기 위해서 이미 저장된 링크 정보를 각 문서에 적용한다(704,804). 이때, 링크 효과는 직접 링크에 의한 효과와 간접 링크에 의한 효과를 반영한다. 마지막으로 직접/간접 링크의 효과를 고려하여 문서의 최종 유사도(RSV)를 계산하고 결과 집합을 생성한다(705,805).Next, in order to reflect the link effect in the search, the previously stored link information is applied to each document (704, 804). In this case, the link effect reflects the effect of the direct link and the effect of the indirect link. Finally, the final similarity (RSV) of the document is calculated and the result set is generated in consideration of the effects of the direct / indirect link (705, 805).

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes can be made in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기와 같은 본 발명은, 링크 정보를 사용하여 기존의 검색 엔진이 제공하지 못했던 브라우징과 검색을 결합함으로써 검색의 신뢰도를 향상시킬 수 있는 효과가 있다.The present invention as described above has the effect of improving the reliability of the search by combining the search and browsing that the existing search engine has not provided by using the link information.

또한, 본 발명에서는 다른 링크 타입의 검색 신뢰도에 대한 영향력을 보였다. 많은 하이퍼링크를 가진 계몽사 집합을 통한 실험 결과, 11점 평균 정확도 면에서 어떤 링크 정보도 사용되지 않은 베이스 라인에 비해서 44.8%의 신뢰도 향상을 보았다. 이처럼 본 발명에서 제시한 링크 기반 검색 방법에 대한 실험을 통해, 링크는 매우 유용하며, 검색 신뢰도의 향상에 기여할 것이라는 전재를 입증하였다.In addition, the present invention has shown an influence on the search reliability of other link types. Experiments with the Enlightenment Set with many hyperlinks show a 44.8% improvement in reliability over the baseline where no link information is used in terms of 11-point average accuracy. Through experiments on the link-based retrieval method proposed in the present invention, it proved that the link is very useful and contribute to the improvement of the search reliability.

또한, 입력과 출력 링크를 모두 사용했을 때 질의를 통한 정합으로 얻어질 수 없는 적합 문서가 검색 리스트에 포함될 수 있으며, 이미 검색된 문서에 대한 재랭킹을 통해 적합 문서들이 검색 결과 리스트의 상위로 이동함을 보였다.In addition, when both input and output links are used, suitable documents that cannot be obtained by matching through queries may be included in the search list, and re-ranking of already searched documents may cause suitable documents to move to the top of the search result list. Showed.

Claims

In the information retrieval method using link information according to the classification of the link type,

A first step of searching for a document set to be searched and constructing an initial search set;

A second step of determining an extension set by extending the initial search set to include other documents by using link information according to the classification of the link type; And

A third step of re-evaluating suitability of the documents in the extension set by using the link information to generate a search result

Information retrieval method using the link information made, including.

The method of claim 1,

The second step,

A fourth step of adding the initial search set to a link accumulator;

A fifth step of calculating a similarity degree for documents linked by a link among the documents in the initial search set; And

A sixth step of determining the extension set by adding documents having similarity calculated in the fifth step to the initial search set;

Information retrieval method using the link information made, including.

The method of claim 1,

The second step,

A fourth step of examining a connection relationship between documents in the initial search set and external documents (additional documents) outside the initial search set by using an output link; And

A fifth step of determining an extension set by including documents linked to the initial search set by the link information in the initial search set;

Information retrieval method using the link information made, including.

The method of claim 1,

The second step,

The upper ranked document in the initial search set and the original query are combined to form an extended query, and additional documents are searched for and included in the initial search set according to the connection relationship between the extended query and the initial search documents. And determining the extension set by using the link information.

The method according to any one of claims 1 to 4,

The third step,

A seventh step of applying the link information stored in the link information database to a document in the extension set to reflect the link effect in the search; And

Eighth step of calculating document similarity (RSV) and generating link search results by considering the effects of direct / indirect links

Information retrieval method using the link information made, including.

The method of claim 5,

The link information database,

A document identifier (DocID) field indicating a document unique number;

An output link information (Out_Link_Info) pointer field for storing information for linking from a document corresponding to a document identifier (DocID) to another document; And

Input link information (In_Link_Info) pointer field that stores information that a link comes from a different document to the document that corresponds to the document identifier (DocID).

Information retrieval method using the link information made, including.

The method of claim 6,

Each of the pointer fields is

A link name (Link_Name) field for giving meaning to distinguish it from other links; And

Document identifier (DocID) field indicating the document to which the link points

Information retrieval method using the link information made, including.

The method according to any one of claims 1 to 4,

The third step,

And re-ranking (mixing) all candidate documents using link information that spans the entire extended set to generate a final search result.

The method according to any one of claims 1 to 4,

The use process of the link information,

In order to improve the reliability of information retrieval by selectively applying link information when adjusting the similarity between document and query word according to input and output link, direct and indirect link, and link anchor as query or non-query. Information retrieval method using the link information, characterized in that.

In an information retrieval system with a large processor,

A first function of searching for a document set to be searched to form an initial search set;

A second function of determining the extension set by extending the initial search set to include other documents by using link information according to the classification of the link type; And

A third function of re-evaluating suitability of documents in the extension set by using the link information to generate a search result

A computer-readable recording medium having recorded thereon a program for realizing this.