KR20040039691A

KR20040039691A - Indexing method of information searching system

Info

Publication number: KR20040039691A
Application number: KR1020020067836A
Authority: KR
Inventors: 이상호; 박선영; 전혜정
Original assignee: 엘지전자 주식회사
Priority date: 2002-11-04
Filing date: 2002-11-04
Publication date: 2004-05-12

Abstract

PURPOSE: An indexing method of an information retrieval system is provided to enhance performance of the information retrieval system without performing an additional keyword extracting operation for extracting the existing keyword when a document is deleted or modified. CONSTITUTION: In case that a new document is added, the keyword of the added document is extracted and searches a posting list to insert the document information including the keyword(401). It is judged that a searched posting list is present(402). If the posting list is not present, a new posting list is generated to the keyword index(403). The position information of the posting list generated to a position list of the posting list is added(404). The document information including the keyword is added to a document index(405).

Description

Indexing method of information searching system

본 발명은 정보 검색 시스템에 관한 것으로서, 특히 정보 검색을 처리함에 있어 역 인덱스 형태의 듀얼 인덱스(Dual Index) 구조를 갖도록 함으로써, 문서의 수정과 삭제를 효율적으로 지원할 수 있는 정보 검색 시스템의 인덱싱 방법에 관한 것이다.The present invention relates to an information retrieval system, and more particularly, to an indexing method of an information retrieval system that can efficiently support modification and deletion of a document by having a dual index structure having an inverted index structure in processing information retrieval. It is about.

컴퓨터와 통신 기술의 발전으로 개인이 소장하고 있는 정보는 줄어드는 반면, 불특정 다수의 사람들과 공유하는 정보는 늘어나고 있다. 즉, 서적이나 마이크로 필름 등의 형태로 저장되던 정보는 디지털화 되어 서버측에 저장되고 이를 인터넷을 통해 많은 사람들이 접근하여 이용하게 되었다. 이와 같은 환경에서 많은 정보로부터 사용자가 원하는 정보를 신속하게 찾아주는 정보 검색 시스템의 중요성이 높아지고 있다.Advances in computer and communication technologies have diminished the information that individuals hold, while increasing information shared with a large number of unspecified people. In other words, information stored in the form of books or microfilm is digitized and stored on the server side, and many people access and use it through the Internet. In such an environment, the importance of an information retrieval system for quickly finding information desired by a user from a great deal of information is increasing.

이에 따라, 사용자가 원하는 정보를 찾아줄 수 있도록, 정보 검색 시스템은 문서로부터 키워드를 추출하고 이를 사용하여 인덱스를 구축한다. 그리고, 정보 검색 시스템은 이 구축된 인덱스를 사용하여 사용자의 질의에 적합한 문서를 검색하고, 그 검색된 결과를 제공해 준다.Accordingly, the information retrieval system extracts keywords from the document and builds an index using the information so that the user can find the desired information. The information retrieval system uses the constructed index to search for a document that matches the user's query and provide the retrieved result.

이러한 정보 검색 시스템에서 가장 널리 사용되고 있는 인덱스 구조는 키워드가 주어졌을 때에 이를 포함하고 있는 문서를 찾아 주는 역 인덱스(Inverted Index) 구조이다. 역 인덱스 구조는 키워드별로 해당 키워드를 포함하고 있는 문서들의 정보(예를 들면, 문서의 식별자)로 구성된 포스팅 리스트(Posting List)를 유지하고 있다.The most widely used index structure in such an information retrieval system is an Inverted Index structure that finds a document containing a keyword when a keyword is given. The reverse index structure maintains a posting list composed of information (eg, document identifiers) of documents including the corresponding keyword for each keyword.

한편, 정보 검색 시스템에서는 문서의 삽입/삭제/수정 등이 동적으로 발생하기 때문에 역 인덱스도 동적으로 변경되는 문서들을 효율적으로 관리할 수 있는 구조이어야 한다. 그렇지 않으면, 정보 검색 시스템에 많은 부하(load)가 발생하여 사용자에게 최신의 정보를 신속하게 제공할 수 없게 된다. 또한, 문서의 변경(삽입, 삭제, 수정 등)이 제대로 반영되지 않는 경우에는, 사용자의 요청에 대하여 정확한 정보를 제공할 수 없게 된다.On the other hand, in the information retrieval system, since insertion / deletion / modification of documents occurs dynamically, the reverse index should also have a structure capable of efficiently managing documents that are dynamically changed. Otherwise, there is a large load on the information retrieval system and it is not possible to quickly provide the latest information to the user. In addition, when a change (insertion, deletion, modification, etc.) of a document is not properly reflected, accurate information cannot be provided in response to a user's request.

기존의 대표적인 역 인덱스 구조로는 도 1에 나타낸 바와 같은 '단순 역 인덱스 구조'와 도 2에 나타낸 바와 같은 '서브 인덱스를 이용한 역 인덱스 구조'가 있다.Exemplary conventional inverted index structures include 'simple inverted index structure' as shown in FIG. 1 and 'inverted index structure using sub index' as shown in FIG. 2.

기존의 '단순 역 인덱스 구조'는 일반적으로 가장 널리 사용되는 역 인덱스 구조로서, 키워드에 대해서 인덱스(여기서는 B⁺-트리 인덱스로 구축하였음)를 구축하여 각 키워드로 검색된 단말 노드가 포스팅 리스트가 되도록 한 역 인덱스 구조이다. 이 구조는 구현이 간단하다는 장점이 있는 반면에 검색/수정/삭제가 어렵다는 단점이 있다. 즉, 문서를 검색하는 경우에 긴 포스팅 리스트를 순차 검색을 해야 하며, 문서가 삭제되는 경우에는 삭제될 문서로부터 키워드를 다시 추출한 후 이를 사용하여 해당 문서의 정보를 포스팅 리스트에서 삭제해야 하기 때문이다.The existing 'simple reverse index structure' is generally the most widely used reverse index structure. An index (here, B ⁺ -tree index) is constructed for a keyword so that a terminal node searched by each keyword becomes a posting list. Inverse index structure. While this structure has the advantage of being simple to implement, it has the disadvantage of being difficult to search / modify / delete. That is, when searching for a document, a long posting list must be searched sequentially, and when a document is deleted, a keyword must be extracted from the document to be deleted and then the information of the document must be deleted from the posting list.

그리고, '서브 인덱스를 이용한 역 인덱스 구조'는 '단순 역 인덱스 구조'에서의 검색 속도를 향상시키기 위해서 각 포스팅 리스트에 독립된 인덱스(서브 인덱스)를 둔 구조이다. 그러나, 이와 같은 '서브 인덱스를 이용한 역 인덱스 구조'에서도 문서를 삭제하기 위해서는 삭제될 문서로부터 키워드를 다시 추출해야 하는 문제점이 있다.In addition, the 'inverse index structure using the sub index' is a structure in which an independent index (sub index) is placed in each posting list in order to improve the search speed in the 'simple inverse index structure'. However, even in such a 'reverse index structure using a sub index', there is a problem in that a keyword must be extracted again from a document to be deleted in order to delete the document.

한편, 일반적으로 문서로부터 키워드를 추출하기 위해서는 문서 내의 각 단어에 대해 어휘 사전으로부터 어휘정보를 구해야 하기 때문에, 키워드 추출은 정보 검색 시스템에 많은 부하를 요구하는 연산이 된다. 따라서, 키워드 추출 연산을 줄이는 것은 정보 검색 시스템의 성능 향상에 있어 중요한 요소가 된다.On the other hand, in general, in order to extract keywords from a document, lexical information must be obtained from a lexical dictionary for each word in the document, so that keyword extraction is an operation that requires a large load on the information retrieval system. Therefore, reducing the keyword extraction operation is an important factor in improving the performance of the information retrieval system.

그러나, 앞서 기술한 기존 역 인덱스 구조들('단순 역 인덱스 구조', '서브 인덱스를 이용한 역 인덱스 구조')은 문서를 수정/삭제하는 경우에 (수정/삭제의 대상이 되는) 기존 문서에 포함된 키워드를 추출하기 위해서 매번 키워드 추출 연산을 수행해야 한다. 또한, 문서의 수정 및 삭제 시에 이전 버전의 문서를 키워드 추출을 위해서 보관해야 하는 문제점도 발생된다.However, the existing inverted index structures described above ('simple inverted index structure' and 'inverted index structure using sub index') are included in the existing document (which is subject to modification / delete) when the document is modified / deleted. To extract the generated keywords, a keyword extraction operation must be performed each time. In addition, when modifying or deleting a document, a problem arises in that a previous version of the document should be kept for keyword extraction.

이와 같은 문제점을 해결할 수 있는 간단한 방법으로는 각 문서에서 추출한 키워드들을 정보 검색 시스템에서 별도로 유지 하는 것인데, 이는 정보 검색 시스템 내에 요구되는 메모리의 양이 매우 크게 증가한다는 다른 문제를 야기시킨다.A simple way to solve this problem is to keep keywords extracted from each document separately in the information retrieval system, which causes another problem that the amount of memory required in the information retrieval system is greatly increased.

본 발명은, 문서의 삭제 및 수정 시에 기존 키워드를 추출하기 위한 추가적인 키워드 추출 연산을 수행하지 않음으로써, 정보 검색 시스템의 성능을 향상시킬 수 있는 정보 검색 시스템의 인덱싱 방법을 제공함에 그 목적이 있다.An object of the present invention is to provide an indexing method of an information retrieval system that can improve the performance of the information retrieval system by not performing an additional keyword extraction operation for extracting an existing keyword when deleting or modifying a document. .

도 1은 종래의 단순 역 인덱스 구조를 나타낸 도면.1 illustrates a conventional simple inverse index structure.

도 2는 종래의 서브 인덱스를 이용한 역 인덱스 구조를 나타낸 도면.2 is a view showing an inverted index structure using a conventional sub-index.

도 3은 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여 생성된 듀얼 인덱스 구조를 나타낸 도면.3 is a diagram illustrating a dual index structure generated by an indexing method of an information retrieval system according to the present invention.

도 4는 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여, 문서가 추가되는 경우의 인덱싱 과정을 나타낸 순서도.4 is a flowchart illustrating an indexing process when a document is added by the indexing method of the information retrieval system according to the present invention.

도 5는 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여, 문서 삭제가 수행되는 경우의 인덱싱 과정을 나타낸 순서도.5 is a flowchart illustrating an indexing process when document deletion is performed by the indexing method of the information retrieval system according to the present invention.

상기의 목적을 달성하기 위하여 본 발명에 따른 정보 검색 시스템의 인덱싱 방법은, 클라이언트로부터 요청되는 문서에 대한 검색 정보를 제공하기 위하여, 문서에 대한 인덱싱을 수행함에 있어,In order to achieve the above object, the indexing method of the information retrieval system according to the present invention, in order to provide the search information for the document requested from the client, in performing indexing on the document,

해당 키워드를 갖는 문서를 검색할 수 있도록, 해당 키워드를 갖는 문서의 위치 정보를 제공하는 포스팅 리스트를 구비하는 역 인덱스 구조의 키워드 인덱스를 생성하고, 문서 식별자로부터 키워드 인덱스 내의 해당 포스팅 리스트의 위치를검색할 수 있도록, 상기 포스팅 리스트의 위치 정보를 제공하는 문서 인덱스를 생성하는 점에 그 특징이 있다.A keyword index of an inverted index structure is created having a posting list that provides location information of a document having the keyword, and the position of the posting list in the keyword index is retrieved from the document identifier. The feature is that a document index for providing location information of the posting list is generated.

또한, 상기의 목적을 달성하기 위하여 본 발명에 따른 정보 검색 시스템의 인덱싱 방법은, 클라이언트로부터 요청되는 문서에 대한 검색 정보를 제공하기 위하여, 추가되는 문서에 대한 인덱싱을 수행함에 있어,In addition, in order to achieve the above object, the indexing method of the information retrieval system according to the present invention, in order to provide the search information for the document requested from the client, in performing the indexing for the additional document,

추가되는 문서에 대한 키워드를 추출하고, 추출된 키워드를 포함하는 문서 정보를 삽입할 포스팅 리스트를 검색하고, 검색된 포스팅 리스트의 존재 여부를 판단하는 단계와; 상기 판단 결과, 상기 검색된 포스팅 리스트가 존재하지 않으면, 키워드 인덱스에 새로운 포스팅 리스트를 생성하고, 포스팅 리스트의 위치 리스트에 생성된 포스팅 리스트의 위치 정보를 추가하고, 문서 인덱스에 해당 키워드를 포함하는 문서의 식별자 정보를 추가하는 단계를 포함하는 점에 그 특징이 있다.Extracting a keyword for the added document, searching for a posting list into which document information including the extracted keyword is to be inserted, and determining whether a searched posting list exists; As a result of the determination, if the searched posting list does not exist, a new posting list is created in the keyword index, the location information of the created posting list is added to the location list of the posting list, and Its feature is that it includes the step of adding identifier information.

여기서 본 발명에 의하면, 문서 인덱스에 해당 키워드를 포함하는 문서의 식별자 정보를 추가하는 점에 그 특징이 있다.According to the present invention, the feature is that the identifier information of the document including the keyword is added to the document index.

또한, 상기의 목적을 달성하기 위하여 본 발명에 따른 정보 검색 시스템의 인덱싱 방법은, 클라이언트로부터 요청되는 문서에 대한 검색 정보를 제공하기 위하여, 삭제되는 문서에 대한 인덱싱을 수행함에 있어,In addition, in order to achieve the above object, the indexing method of the information retrieval system according to the present invention, in order to provide the search information for the document requested from the client, in performing indexing on the document to be deleted,

문서 인덱스를 통해 삭제될 문서에 포함된 키워드들의 포스팅 리스트를 검색하고, 각 포스팅 리스트 내에 있는 해당 문서 정보를 삭제하고, 각 포스팅 리스트 내에 문서 정보가 존재하는 지의 여부를 판단하는 단계와; 상기 판단 결과, 상기 각 포스팅 리스트 내에 문서 정보가 존재하지 않는 경우에는 키워드 인덱스에서 포스팅 리스트를 삭제하고, 포스팅 리스트의 위치 리스트에서 해당 항목을 삭제하며, 문서 인덱스에서 삭제된 문서의 식별자 정보를 삭제하는 단계; 를 포함하는 점에 그 특징이 있다.Retrieving a posting list of keywords included in a document to be deleted through the document index, deleting corresponding document information in each posting list, and determining whether document information exists in each posting list; As a result of the determination, if there is no document information in each posting list, the posting list is deleted from the keyword index, the corresponding item is deleted from the position list of the posting list, and the identifier information of the deleted document is deleted from the document index. step; Its features are to include.

이와 같은 본 발명에 의하면, 문서의 삭제 및 수정 시에 기존 키워드를 추출하기 위한 추가적인 키워드 추출 연산을 수행하지 않음으로써, 정보 검색 시스템의 성능을 향상시킬 수 있는 장점이 있다.According to the present invention, there is an advantage that the performance of the information retrieval system can be improved by not performing an additional keyword extraction operation for extracting an existing keyword when deleting or modifying a document.

즉, 본 발명은 정보 검색 시스템에서 사용되는 역 인덱스에 관한 것으로서, 문서의 삭제 및 수정 시에 기존 키워드를 추출하기 위한 추가적인 키워드 추출 연산이 필요치 않는 역 인덱스 구조의 듀얼 인덱스(Dual Index)를 제안하고자 한다.That is, the present invention relates to an inverted index used in an information retrieval system, and to propose a dual index having an inverted index structure that does not require an additional keyword extraction operation for extracting an existing keyword when a document is deleted or modified. do.

이하, 첨부된 도면을 참조하여 본 발명에 따른 실시 예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여 생성된 듀얼 인덱스 구조를 나타낸 도면이다.3 is a diagram illustrating a dual index structure generated by an indexing method of an information retrieval system according to the present invention.

도 3을 참조하여 설명하면, 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여 생성되는 듀얼 인덱스는 키워드로부터 문서 식별자를 검색하는 '키워드 인덱스'와 문서 식별자로부터 포스팅 리스트의 위치를 검색하는 '문서 인덱스'로 구성되어져 있다. 그리고, '포스팅 리스트의 위치 리스트'를 통해 문서 인덱스가 포스팅 리스트의 위치를 간접적으로 가리키도록 한다. 이는 문서의 추가/삭제 시에 포스팅 리스트의 위치가 변경될 수 있기 때문이다( B⁺-트리 인덱스와 같은 경우에는 노드의 분할/병합으로 노드의 위치가 변경된다.).Referring to FIG. 3, the dual index generated by the indexing method of the information retrieval system according to the present invention is a 'keyword index' for retrieving a document identifier from a keyword and a 'document index' for retrieving a location of a posting list from the document identifier. It consists of '. In addition, the document index points indirectly to the location of the posting list through the location list of the posting list. This is because the location of the posting list may change when the document is added / deleted (in the case of a B ⁺ -tree index, the position of the node may be changed by splitting / merging the nodes).

여기서, 각 포스팅 리스트는 '포스팅 리스트의 위치 리스트'에서 자신의 위치를 나타내고 있는 항목을 포인팅하고 있는데, 이는 키워드의 추가/삭제로 인해 자신의 위치가 변경되는 경우에 이를 반영하기 위함이다. 이로 인해, 문서 인덱스를 통해 검색된 포스팅 리스트의 위치 정보는 항상 정확하게 유지될 수 있게 된다.Here, each posting list points to an item indicating its position in the 'location list of the posting list', in order to reflect this when its position is changed due to the addition / deletion of a keyword. As a result, the position information of the posting list retrieved through the document index can always be maintained accurately.

즉, 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하면, 키워드로부터 문서 식별자를 검색할 수 있도록 역 인덱스 구조의 '키워드 인덱스'를 생성하며, 문서 식별자로부터 해당 문서의 위치 정보를 제공하는 포스팅 리스트의 위치를 검색할 수 있도록 '문서 인덱스'를 생성함으로써, 듀얼 인덱스 구조를 갖는 인덱싱을 수행한다.That is, according to the indexing method of the information retrieval system according to the present invention, a 'keyword index' of an inverted index structure is generated so that a document identifier can be retrieved from a keyword, and a posting list of the post list that provides position information of the document from the document identifier By creating a 'document index' to search for a location, indexing with a dual index structure is performed.

그리고, 듀얼 인덱스에서 키워드 검색을 수행할 때에는 키워드 인덱스를 사용하여 질의를 만족시키는 문서 식별자를 찾고, 이를 사용하여 해당 문서를 사용자에게 제공한다. 그리고, 문서의 수정이나 삭제로 인해 기존 문서의 키워드를 키워드 인덱스에서 제거해야 하는 경우에는 문서 인덱스와 포스팅 리스트의 위치 리스트를 사용하여 삭제할 키워드의 포스팅 리스트를 찾고, 이 리스트에서 삭제할 문서에 관한 정보를 삭제한다.When performing a keyword search in the dual index, the keyword index is used to find a document identifier that satisfies the query, and the document is provided to the user using the keyword index. If a keyword of an existing document needs to be removed from the keyword index due to the modification or deletion of the document, the posting list of the keyword to be deleted is found by using the document index and the position list of the posting list, and information about the document to be deleted from this list is displayed. Delete it.

듀얼 인덱스에서 문서 인덱스와 키워드 인덱스를 위해서 사용될 수 있는 인덱스 구조는 가장 일반적으로 사용되는 B⁺-트리 뿐 만 아니라 기존의 다른 인덱스 구조도 사용할 수 있다. 그리고, 포스팅 리스트 내에서 해당 문서 식별자를 좀 더 빠르게 검색하기 위해서는 서브 인덱스(각 포스팅 리스트에 부여된 독립된 인덱스)를 적용하여 사용할 수도 있다.The index structure that can be used for document indexes and keyword indexes in dual indexes can use not only the most commonly used B ⁺ -trees but also other existing index structures. In addition, in order to search the document identifier in the posting list more quickly, a sub-index (an independent index assigned to each posting list) may be used.

한편, 도 4 및 도 5는 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여, 듀얼 인덱스에서 문서를 추가하는 경우와 삭제하는 경우의 알고리즘을 각각 나타낸 도면이다.4 and 5 illustrate algorithms for adding and deleting documents in the dual index, respectively, by the indexing method of the information retrieval system according to the present invention.

그러면, 도 4를 참조하여 문서가 추가되는 경우의 인덱싱 처리과정에 대하여 설명해 보기로 한다. 도 4에서는 문서를 추가하기 위해서 키워드를 추출한 후에 해당 키워드를 듀얼 인덱스에 삽입하는 과정을 보이고 있다.Next, an indexing process in the case of adding a document will be described with reference to FIG. 4. 4 illustrates a process of inserting a keyword into a dual index after extracting a keyword to add a document.

먼저, 정보 검색 시스템에서 새로운 문서가 추가되는 경우에는, 추가되는 문서의 키워드를 추출하고 해당 키워드를 포함하는 문서 정보를 삽입할 포스팅 리스트를 검색한다(단계 401). 그리고, 검색된 포스팅 리스트가 존재하는 지의 여부(해당 키워드의 존재 여부)를 판단한다(단계 402).First, when a new document is added in the information retrieval system, a keyword of the added document is extracted and a posting list for inserting document information including the keyword is searched (step 401). Then, it is determined whether the searched posting list exists (the existence of the corresponding keyword) (step 402).

상기 단계 402에서의 판단 결과, 해당 포스팅 리스트가 존재하지 않으면(즉, 해당 키워드를 처음 삽입하는 경우), 키워드 인덱스에 새로운 포스팅 리스트를 생성하고(단계 403), 포스팅 리스트의 위치 리스트에 생성된 포스팅 리스트의 위치 정보를 추가한다(단계 404). 그리고, 문서 인덱스에 해당 키워드를 포함하는 문서 정보(예컨대, 문서 식별자)를 추가한다(단계 405). 한편, 상기 단계 402에서의 판단 결과, 해당 포스팅 리스트가 존재하는 경우에는 상기 단계 405 이후의 과정을 수행하게 된다. 이와 같은 과정을 통하여, 새로운 문서가 추가되는 경우에 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의한 인덱싱 처리를 수행할 수 있게 된다.As a result of the determination in step 402, if the corresponding posting list does not exist (i.e., when the corresponding keyword is inserted for the first time), a new posting list is created in the keyword index (step 403), and the posting generated in the position list of the posting list is displayed. The location information of the list is added (step 404). Then, document information (eg, document identifier) including the keyword is added to the document index (step 405). As a result of the determination in step 402, if the corresponding posting list exists, the process after step 405 is performed. Through this process, when a new document is added, the indexing process by the indexing method of the information retrieval system according to the present invention can be performed.

이제 도 5를 참조하여, 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하여 기존의 문서를 삭제하는 처리 과정에 대하여 살펴 보기로 한다.Referring to FIG. 5, a process of deleting an existing document by the indexing method of the information retrieval system according to the present invention will be described.

본 발명에 따른 듀얼 인덱스 구조에서 문서를 삭제하기 위해서는 먼저, 문서 인덱스를 통해 해당 문서가 포함한 키워드들의 포스팅 리스트를 검색한다(단계 501). 그리고, 각 포스팅 리스트 내에 있는 해당 문서 정보를 삭제하고(단계 502), 해당 문서 정보가 삭제된 포스팅 리스트 내에 다른 문서 정보가 존재하는 지의 여부를 판단한다(단계 503).To delete a document in the dual index structure according to the present invention, first, a posting list of keywords included in the document is searched through the document index (step 501). Then, the corresponding document information in each posting list is deleted (step 502), and it is determined whether other document information exists in the posting list from which the document information is deleted (step 503).

상기 단계 503에서의 판단 결과, 포스팅 리스트 내에 문서 정보가 더 이상 존재하지 않는 경우에는 키워드 인덱스에서 포스팅 리스트를 삭제하고(단계 504), 또한 포스팅 리스트의 위치 리스트에서 해당 항목을 삭제한다(단계 505). 그리고, 문서 인덱스에서 해당 문서 정보를 삭제한다(단계 506). 한편, 상기 단계 503에서의 판단 결과, 포스팅 리스트 내에 문서 정보가 존재하는 경우에는 문서 인덱스에서 해당 문서 정보를 삭제하고(단계 506), 이후 인덱싱 작업을 종료하도록 한다.As a result of the determination in step 503, if the document information no longer exists in the posting list, the posting list is deleted from the keyword index (step 504), and the corresponding item is deleted from the position list of the posting list (step 505). . The document information is then deleted from the document index (step 506). On the other hand, if the document information exists in the posting list as a result of the determination in step 503, the document information is deleted from the document index (step 506), and then the indexing operation is terminated.

그리고, 기존 문서를 수정해야 하는 경우에는, 도 5를 참조하여 설명된 삭제 알고리즘을 수행한 후에, 도 4를 참조하여 설명된 추가 알고리즘을 수행함으로써, 수정되는 문서에 대한 별도의 키워드 추출과정 없이 수정된 문서에 대한 인덱싱 처리를 수행할 수 있게 된다.In addition, when the existing document needs to be modified, after performing the deletion algorithm described with reference to FIG. 5 and then performing the additional algorithm described with reference to FIG. 4, modification is performed without a separate keyword extraction process for the document to be modified. Indexing can be performed on the document.

이와 같은 과정을 통하여 수행되는 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에 의하면 다음과 같은 여러 가지 장점을 갖는다.According to the indexing method of the information retrieval system according to the present invention carried out through such a process has several advantages as follows.

첫째, 문서의 삭제를 효율적으로 지원한다. 기존에는 역 인덱스 구조에서 문서를 삭제하기 위해서 삭제할 문서 내에 있는 키워드를 먼저 추출해야만 하였다. 이는 시스템에 많은 오버헤드를 유발시켜 시스템의 성능을 저하시켰다. 그러나, 본 발명에 따른 듀얼 인덱스 구조에서는 문서 인덱스를 통해 삭제할 문서의 키워드의 포스팅 리스트를 바로 검색할 수 있어 기존 방식에 비해 효율적으로 문서를 삭제할 수 있다. 또한, 문서 인덱스는 키워드 자체를 가지고 있는 것이 아니라, 포스팅 리스트의 위치 정보만을 유지하므로 메모리의 오버헤드도 적다.First, it supports efficient document deletion. Previously, to delete a document from the reverse index structure, the keywords in the document to be deleted must first be extracted. This caused a lot of overhead on the system and degraded the system's performance. However, in the dual index structure according to the present invention, the posting list of the keyword of the document to be deleted can be directly retrieved through the document index, so that the document can be deleted more efficiently than the conventional method. In addition, the document index does not have a keyword itself, but also maintains only the location information of the posting list, thereby reducing the overhead of memory.

둘째, 문서의 수정을 효율적으로 지원한다. 역 인덱스 구조에서 문서의 수정은 기존 문서의 삭제 후에 수정된 문서의 추가로 수행된다. 따라서, 본 발명에 따른 듀얼 인덱스 구조에서는, 앞에서 설명된 바와 같이, 문서의 삭제를 효율적으로 지원하기 때문에 문서의 수정도 또한 효율적으로 지원할 수 있게 된다.Second, it supports efficient document modification. The modification of the document in the reverse index structure is performed by addition of the modified document after deletion of the existing document. Therefore, in the dual index structure according to the present invention, as described above, since the deletion of the document is efficiently supported, the modification of the document can also be efficiently supported.

셋째, 기존 역 인덱스 구조와 독립적인 구조이다. 본 발명에 따른 듀얼 인덱스 구조는 다른 역 인덱스 구조와 독립적인 성질을 가지기 때문에 다른 우수한 역 인덱스와 통합이 가능하다. 즉, 듀얼 인덱스는 서브 인덱스를 이용한 역 인덱스 구조와 통합하여 검색 성능을 더욱 더 향상 시킬 수 있다. 또한, 듀얼 인덱스 내의 키워드 인덱스와 문서 인덱스에서 사용되는 인덱스 구조도 특정 인덱스 구조에 종속되지 않기 때문에 B⁺-트리 인덱스 구조 뿐 만 아니라 다른 우수한 인덱스 구조를 사용하여 듀얼 인덱스의 성능을 향상시킬 수 있다.Third, it is independent of the existing inverse index structure. Since the dual index structure according to the present invention has a property independent of other inverse index structures, it is possible to integrate with other excellent inverse indexes. That is, the dual index can be further integrated with the inverse index structure using the sub index to further improve the search performance. In addition, since the index structure used in the keyword index and the document index in the dual index is not dependent on the specific index structure, the performance of the dual index can be improved by using not only the B ⁺ -tree index structure but also other excellent index structures.

이상의 설명에서와 같이 본 발명에 따른 정보 검색 시스템의 인덱싱 방법에의하면, 문서의 삭제 및 수정 시에 기존 키워드를 추출하기 위한 추가적인 키워드 추출 연산을 수행하지 않음으로써, 정보 검색 시스템의 성능을 향상시킬 수 있는 장점이 있다.As described above, according to the indexing method of the information retrieval system according to the present invention, the performance of the information retrieval system can be improved by not performing an additional keyword extraction operation for extracting an existing keyword when the document is deleted or modified. There is an advantage.

Claims

In an information retrieval system, in performing indexing on a document in order to provide search information on a document requested from a client,

Generate a keyword index of an inverted index structure having a posting list for providing the location information of the document with the keyword, so as to search for the document with the keyword,

And a document index for providing the location information of the posting list so that the location of the posting list can be retrieved from the document identifier.

In the information retrieval system, in order to provide retrieval information about a document requested from a client, in performing indexing on the additional document,

Extracting a keyword for the added document, searching for a posting list into which document information including the extracted keyword is to be inserted, and determining whether a searched posting list exists;

As a result of the determination, if the searched posting list does not exist, a new posting list is created in the keyword index, the location information of the created posting list is added to the location list of the posting list, and And adding the identifier information to the information retrieval system.

The method of claim 2,

And determining that the searched posting list exists, and when the searched posting list exists, the identifier information of the document including the corresponding keyword is added to the document index.

In the information retrieval system, in order to provide retrieval information for a document requested from a client, in performing indexing on the deleted document,

Retrieving a posting list of keywords included in a document to be deleted through the document index, deleting corresponding document information in each posting list, and determining whether other document information exists in each posting list;

As a result of the determination, when other document information no longer exists in each posting list, the posting list is deleted from the keyword index, the corresponding item is deleted from the position list of the posting list, and the identifier information of the document deleted from the document index. Deleting; Indexing method of the information retrieval system comprising a.

The method of claim 4, wherein

And determining that there is another document in the posting list, and when there is another document, identifier information of the document to be deleted is deleted from the document index.