KR100434718B1

KR100434718B1 - Method and system for indexing document

Info

Publication number: KR100434718B1
Application number: KR20010007571A
Authority: KR
Inventors: 전석진; 이상호
Original assignee: 전석진; 이상호
Priority date: 2001-02-15
Filing date: 2001-02-15
Publication date: 2004-06-07
Also published as: KR20020067162A

Abstract

본 발명은 문서 색인 시스템 및 그 방법에 관한 것으로서, 특히, 특정 키워드와 이를 포함하는 문서의 URL 주소 목록으로 구성된 색인 정보를 발생시켜 이를 통신망으로 연결된 다수개의 컴퓨터들에게 분배한 후,그들 각 컴퓨터들에서 각 색인 정보에 의거한 색인을 수행하도록 하는 것을 특징으로 한다. 특히, 특정 키워드와 일정 거리 이내에 위치한 단어들과의 조합인 컨셉별로 문서 목록을 작성함으로써 식별성이 뛰어난 검색 결과를 도출할 수 있으며, 이러한 색인 절차를 통신망으로 연결된 다수개의 컴퓨터에서 분산 처리하도록 함으로써 검색 속도를 빠르게 개선하고, 문서 색인을 위한 시스템 부하를 현저히 줄일 수 있다는 효과가 있다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document indexing system and a method thereof, and more particularly, to generate index information consisting of a specific keyword and a list of URL addresses of documents containing the same, and to distribute the index information to a plurality of computers connected through a communication network. Is characterized in that to perform an index based on each index information. In particular, by creating a document list for each concept, which is a combination of words located within a certain distance of a specific keyword, a highly recognizable search result can be derived, and the indexing process is distributed to a plurality of computers connected to a network to speed up the search. Can be improved quickly and the system load for document indexing can be significantly reduced.

Description

Document indexing system and its method {Method and system for indexing document}

본 발명은 문서 색인 시스템 및 그 방법에 관한 것으로, 보다 상세하게는 특정 키워드와 이를 포함하는 문서의 URL 주소 목록으로 구성된 색인 정보를 발생시켜 이를 통신망으로 연결된 다수개의 컴퓨터들에게 분배한 후,그들 각 컴퓨터들에서 각 색인 정보에 의거한 색인을 수행하도록 하는 것을 특징으로 하는 문서 색인 시스템 및 그 방법에 관한 것이다.The present invention relates to a document indexing system and a method thereof, and more particularly, generating index information consisting of a specific keyword and a list of URL addresses of documents including the same, and distributing the index information to a plurality of computers connected through a communication network. A document indexing system and method are provided that allow computers to perform an index based on respective index information.

최근 들어, 대부분의 문서를 컴퓨터로 작성하고 통신망을 통해 문서를 배포하고 획득함에 따라 효과적으로 문서를 찾는 기술의 중요성이 매우 커지고 있다. 더구나, 인터넷이 보급됨으로써 전문가뿐만 아니라 일반인도 통신망에 접속하여 정보를 제공하거나 획득하는 것이 일반화되고, 이에 따라 인터넷으로 접근할 수 있는 정보의 양이 기하급수적으로 증가하고 있다. 따라서, 역사상 유례없는 거대한 정보창고이자 정보 획득 인프라인 인터넷에서 검색엔진(예컨대, AltaVista, yahoo, infoseek ultra, dejanews, lycos, empas 등)이 가장 성공적인 응용 프로그램으로 자리 매김을 하고 있다.In recent years, as most documents are written on a computer and documents are distributed and acquired through a communication network, the importance of finding a document effectively becomes very important. Moreover, with the spread of the Internet, it is common for not only experts but also ordinary people to access and provide information through communication networks, and accordingly, the amount of information accessible through the Internet is increasing exponentially. As a result, search engines (eg, AltaVista, yahoo, infoseek ultra, dejanews, lycos, empas, etc.) are becoming the most successful applications on the Internet, an unprecedented huge information warehouse and information acquisition infrastructure.

이러한 검색 엔진의 경우 그 검색 대상이 되는 문서들을 사전에 색인한 후, 외부에서 입력되는 입력 조건에 의해 해당 문서를 검색하여 제공하는 일련의 과정을 수행한다. 그런데, 이 때, 검색 대상이 되는 문서들을 어떻게 색인하느냐 하는 색인 방법에 의해 검색 엔진의 효율성이 크게 좌우된다.In the case of such a search engine, the documents to be searched are indexed in advance, and a series of processes for searching and providing the documents by an external input condition are performed. However, at this time, the efficiency of the search engine greatly depends on the indexing method of how to index the documents to be searched.

검색 엔진의 보조적인 장치로서 대용량의 문서들을 색인하기 위한 종래의 문서 색인 시스템에 대한 예가 도 1에 나타나 있다.An example of a conventional document indexing system for indexing a large amount of documents as an aid to a search engine is shown in FIG. 1.

도 1을 참조하면 종래의 문서 색인 시스템은 문서 DB(10), 키워드 DB(20), 색인부(30), 색인 DB(40)를 포함하여 구성된다.Referring to FIG. 1, a conventional document indexing system includes a document DB 10, a keyword DB 20, an index unit 30, and an index DB 40.

문서 DB(10)는 색인 대상이 되는 문서들을 저장 관리하고, 키워드 DB(20)는 문서를 색인하기 위한 기준이 되는 키워드 정보를 저장 관리한다. 색인부(30)는 상기 키워드 DB(20)에 저장된 키워드를 가지고 문서 DB(10)에 저장된 문서들을 분석하여, 각 키워드를 포함하는 문서들의 목록 즉, 키워드별 문서 목록을 생성한다. 색인 DB(40)는 그 키워드별 문서 목록을 저장 관리한다. 이 때 생성된 키워드별 문서 목록은 도 2에 나타난 바와 같다.The document DB 10 stores and manages documents to be indexed, and the keyword DB 20 stores and manages keyword information serving as a reference for indexing documents. The index unit 30 analyzes the documents stored in the document DB 10 using the keywords stored in the keyword DB 20, and generates a list of documents including each keyword, that is, a document list for each keyword. The index DB 40 stores and manages a document list for each keyword. The generated document list for each keyword is as shown in FIG.

이러한 키워드별 문서 목록은 하나의 키워드를 포함하는 모든 문서들에 대한목록을 제공하도록 함으로써, 인터넷과 같이 방대한 문서가 제공되는 시스템에서 문서 검색을 수행할 경우 과다한 검색 결과를 발생하므로 오히려 사용자가 원하는 정보를 선별하기가 어렵다는 단점이 있다. 또한, 많은 인터넷 이용자가 동시에 검색을 요청하는 경우 검색 시간과 응답 시간이 길어짐에 따라 검색 효율이 낮고, 키워드가 되는 단어의 모호성으로 인하여 사용자가 원하는 문서를 정확하게 검색할 수 없다는 단점이 있다.The keyword list of each keyword provides a list of all documents including a single keyword. Thus, when a document search is performed in a system where a large document is provided, such as the Internet, excessive search results are generated. It is difficult to screen the disadvantages. In addition, when many Internet users request a search at the same time, the search efficiency and the response time are long, and thus the search efficiency is low, and due to the ambiguity of the word that is a keyword, the user cannot search the desired document accurately.

따라서, 본 발명은 상기한 바와 같은 종래의 제반 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 특정 키워드와 이를 포함하는 문서의 URL 주소 목록으로 구성된 색인 정보를 발생시켜 이를 통신망으로 연결된 다수개의 컴퓨터들에게 분배한 후, 그들 각 컴퓨터들에서 각 색인 정보에 의거한 컨셉별 색인을 수행하여 보다 빠르고 정확한 검색 결과를 얻을 수 있도록 하는 문서 색인 시스템 및 그 방법을 제공하는 데에 있다.Accordingly, the present invention has been made to solve the above-mentioned conventional problems, and an object of the present invention is to generate a plurality of index information consisting of a specific keyword and a list of URL addresses of documents containing the same and connected to the communication network The present invention provides a document indexing system and method for distributing to computers and then performing a concept-based indexing on each of the computers based on the respective index information to obtain faster and more accurate search results.

도 1은 종래의 문서 색인 시스템에 대한 예시도,1 is an exemplary diagram of a conventional document indexing system;

도 2는 종래의 색인 결과로 생성된 키워드별 문서 목록에 대한 예시도,2 is an exemplary diagram of a document list for each keyword generated as a result of a conventional index;

도 3은 본 발명의 실시예에 따른 문서 색인 시스템에 대한 개략적인 블록도,3 is a schematic block diagram of a document indexing system according to an embodiment of the present invention;

도 4는 본 발명의 실시예에 따른 문서 색인 과정에서 생성된 문서별 컨셉 목록에 대한 예시도,4 is an exemplary diagram for a concept list for each document generated in a document indexing process according to an embodiment of the present invention;

도 5는 본 발명의 실시예에 따른 문서 색인 과정에서 생성된 컨셉별 문서 목록에 대한 예시도,5 is an exemplary diagram for a concept-based document list generated in a document indexing process according to an embodiment of the present invention;

도 6은 본 발명의 실시예에 따라 문서별 컨셉 목록을 관리하기 위한 데이터 베이스 구조에 대한 예시도,6 is an exemplary diagram for a database structure for managing a concept list for each document according to an embodiment of the present invention.

도 7은 내지 도 9는 본 발명의 실시예에 따라 문서 내에서 컨셉을 추출하고, 컨셉별 문서 점수를 카운트하는 과정을 설명하기 위한 예시도,7 to 9 are exemplary diagrams for explaining a process of extracting a concept in a document and counting document scores for each concept according to an embodiment of the present invention;

도 10은 본 발명의 실시예에 따라 문서 색인을 처리하기 위한 개략적인 처리 절차도,10 is a schematic processing procedure diagram for processing a document index according to an embodiment of the present invention;

도 11은 본 발명의 실시예에 따른 호스트 컴퓨터의 처리 과정에 대한 흐름도,11 is a flowchart of a process of a host computer according to an embodiment of the present invention;

도 12는 본 발명의 실시예에 따른 게스트 컴퓨터의 처리 과정에 대한 흐름도.12 is a flowchart of a process of a guest computer according to an embodiment of the present invention.

♣ 도면의 주요 부분에 대한 부호의 설명 ♣♣ Explanation of symbols for the main parts of the drawing ♣

100 : 호스트 컴퓨터 110 : 문서 DB100: host computer 110: document DB

120 : 키워드 DB 130 : 키워드별 색인부120: keyword DB 130: index by keyword

140 : 키워드별 색인 DB 150 : 색인 정보 생성부140: index by keyword DB 150: index information generation unit

160, 210 : I/F 부 170 : 컨셉별 색인 관리부160, 210: I / F part 170: Index management part by concept

180 : 컨셉별 색인 DB 200 : 게스트 컴퓨터180: Index by concept DB 200: Guest computer

220 : 문서 검색부 230 : 컨셉 추출부220: document search unit 230: concept extraction unit

240 : 컨셉별 색인부240: index by concept

상기 목적을 달성하기 위해 본 발명에서 제공하는 문서 색인 시스템은 색인하고자 하는 문서를 저장 관리하는 문서 데이터 베이스부와, 문서를 색인하기 위한 주요 정보가 되는 키워드를 저장 관리하는 키워드 데이터 베이스부와, 상기 키워드 데이터 베이스부에 저장된 키워드를 가지고 상기 문서 데이터 베이스부에 저장된 문서들을 색인하여 키워드별 문서 목록을 생성하는 키워드별 색인부와, 상기 키워드별 색인부에서 생성된 키워드별 문서 목록을 저장하는 키워드별 색인 데이터 베이스부와, 상기 키워드별 문서 목록 및 문서 데이터 베이스부의 내용에 의거하여, 키워드와 그 키워드를 포함하는 소정 개수의 문서 주소(URL)들로 이루어진 색인 정보를 생성하는 색인 정보 생성부와, 상기 색인 정보를 다수개의 색인 보조 프로세스들에게 분배하여, 그 색인 보조 프로세스들이 해당 색인 정보에 포함된 문서 주소(URL)에 의해 통신망상에서 색인하고자 하는 문서를 찾아온 후, 그 문서 및 색인 정보에 포함된 키워드에 의거하여 컨셉별 문서 목록을 생성하도록 제어하고, 상기 상기 다수개의 색인 보조 프로세스들 각각으로부터 컨셉별 문서 목록을 수신하는 인터페이스부와, 상기 인터페이스부를 통해 전송되는 컨셉별 문서 목록들을 통합 관리하는 컨셉별 색인 관리부와, 상기 컨셉별 색인 관리부를 통해 전달되는 컨셉별 문서 목록을 저장하는 컨셉별 색인 데이터 베이스부를 포함한다.In order to achieve the above object, a document indexing system provided by the present invention includes a document database unit for storing and managing a document to be indexed, a keyword database unit for storing and managing keywords, which are key information for indexing a document, and A keyword index section for generating a document list for each keyword by indexing documents stored in the document database section with keywords stored in a keyword database section, and for each keyword for storing a document list for each keyword generated by the keyword index section; An index information generation unit for generating index information consisting of a keyword and a predetermined number of document addresses (URLs) including the keyword, based on an index database unit, the document list for each keyword, and the contents of the document database unit; Distribute the index information to a plurality of index assistance processes, The index assisting processes retrieve the document to be indexed on the communication network by the document address (URL) included in the corresponding index information, and generate a concept-based document list based on the keyword included in the document and the index information. An interface unit for receiving a concept-specific document list from each of the plurality of index assistance processes, a concept-based index management unit integrally managing the concept-specific document lists transmitted through the interface unit, and storing a concept-specific document list delivered through the concept-based index management unit It includes a concept-specific index database unit.

한편, 상기 목적을 달성하기 위해 본 발명에서 제공하는 문서 색인 방법은 색인하고자 하는 문서들에 대한 키워드별 색인을 수행하여 키워드별 문서 목록을 생성하고, 상기 키워드별 문서 목록 및 문서 정보에 의거하여, 키워드와 그 키워드를 포함하는 소정 개수의 문서 주소(URL)들로 이루어진 색인 정보를 생성하는 제1 과정과, 상기 색인 정보를 다수개의 색인 보조 프로세스들에게 분배하는 제2 과정과, 상기 다수개의 색인 보조 프로세스에서 상기 분배된 색인 정보에 의거한 컨셉별 문서 목록을 각각 생성하도록 하는 제3 과정과, 상기 다수개의 컨셉별 문서 목록들을 하나의 컨셉별 통합 문서 목록으로 재구성한 후, 그 컨셉별 통합 문서 목록을 총괄하는 제4 과정을 포함한다.Meanwhile, in order to achieve the above object, the document indexing method provided by the present invention performs a keyword index for documents to be indexed to generate a document list for each keyword, and based on the document list for each keyword and document information, A first process of generating index information consisting of a keyword and a predetermined number of document addresses (URLs) including the keyword, a second process of distributing the index information to a plurality of index assistance processes, and the plurality of indexes A third process of generating a concept-specific document list based on the distributed index information in a subsidiary process, reconstructing the plurality of concept-specific document lists into one concept-based workbook list, and then overseeing the concept-based workbook list A fourth process is included.

이하에서는, 본 발명에 따른 문서 색인 시스템 및 그 방법을 첨부된 도면에의거하여 좀 더 구체적으로 설명하겠다.Hereinafter, a document indexing system and a method thereof according to the present invention will be described in more detail with reference to the accompanying drawings.

도 3은 본 발명의 실시예에 따른 문서 색인 시스템에 대한 개략적인 블록도이다. 도 3을 참조하면, 본 발명의 문서 색인 시스템은 호스트 컴퓨터(100)와, 게스트 컴퓨터(200)를 포함하여 구성된다.3 is a schematic block diagram of a document indexing system according to an embodiment of the present invention. Referring to FIG. 3, the document indexing system of the present invention includes a host computer 100 and a guest computer 200.

호스트 컴퓨터(100)는 특정 키워드와 이를 포함하는 문서의 URL 주소 목록으로 구성된 색인 정보를 발생시켜 이를 통신망으로 연결된 다수개의 컴퓨터들에게 분배하고, 상기 다수개의 컴퓨터로부터 컨셉별 문서 목록을 전달받아 저장 관리한다.The host computer 100 generates index information consisting of a specific keyword and a list of URL addresses of documents including the same, distributes the index information to a plurality of computers connected through a communication network, and receives and stores a list of documents for each concept from the plurality of computers. .

이를 위해, 호스트 컴퓨터(100)는 문서 DB(110), 키워드 DB(120), 키워드별 색인부(130), 키워드별 색인 DB(140), 색인 정보 생성부(150), I/F 부(160), 컨셉별 색인 관리부(170) 및 컨셉별 색인 DB(180)를 포함한다.To this end, the host computer 100 includes a document DB 110, a keyword DB 120, an index unit 130 for each keyword, an index DB 140 for each keyword, an index information generator 150, and an I / F unit ( 160, the concept-based index management unit 170, and the concept-based index DB 180.

문서 DB(110)는 색인하고자 하는 문서를 저장 관리하고, 키워드 DB(120)는 문서를 색인하기 위한 주요 정보가 되는 키워드를 저장 관리한다. 이 때, 키워드 DB(120)에 저장되는 키워드는 기 설정된 값을 사용할 수도 있고, 문서 DB(110)에 저장된 문서들로부터 추출하여 사용할 수 도 있다.The document DB 110 stores and manages documents to be indexed, and the keyword DB 120 stores and manages keywords which are main information for indexing documents. In this case, the keyword stored in the keyword DB 120 may use a predetermined value or may be extracted from the documents stored in the document DB 110.

키워드별 색인부(130)는 키워드 DB(120)에 저장된 키워드를 가지고 문서 DB(110)에 저장된 문서들을 색인하여 키워드별 문서 목록을 생성한 후, 키워드별 색인 DB(140)에 저장한다. 이 때, 생성된 키워드별 문서 목록은 도 2에 나타난 종래의 색인 결과와 같다.The keyword index unit 130 indexes the documents stored in the document DB 110 with the keywords stored in the keyword DB 120 to generate a keyword list for each keyword, and stores the keyword list in the keyword index DB 140. At this time, the generated keyword list of each keyword is the same as the conventional index result shown in FIG.

색인 정보 생성부(150)는 키워드별 문서 목록 및 문서 DB(110)의 내용에 의거하여, 키워드와 그 키워드를 포함하는 소정 개수의 문서 주소(URL)들로 이루어진 색인 정보를 생성한다.The index information generation unit 150 generates index information including a keyword and a predetermined number of document addresses (URLs) including the keyword based on the document list for each keyword and the contents of the document DB 110.

I/F부(160)는 색인 정보 생성부(150)에서 생성된 색인 정보를 게스트 컴퓨터들(200)로 전송하고, 게스트 컴퓨터들(200)로부터 전송되는 컨셉별 문서 목록을 수신하여 컨셉별 색인 관리부(170)로 전달한다. 이 때, I/F 부(160)는 색인 정보를 다수개의 색인 보조 프로세스인 게스트 컴퓨터들(200)에게 분배한 후, 게스트 컴퓨터들(200)이 해당 색인 정보에 포함된 문서 주소(URL)에 의해 통신망상에서 색인하고자 하는 문서를 찾아온 후, 그 문서 및 색인 정보에 포함된 키워드에 의거하여 컨셉별 문서 목록을 생성하도록 제어한다.The I / F unit 160 transmits the index information generated by the index information generation unit 150 to the guest computers 200, receives a concept-specific document list transmitted from the guest computers 200, and then indexes the concept-based index management unit ( 170). At this time, the I / F unit 160 distributes the index information to the guest computers 200 which are a plurality of index assist processes, and then the guest computers 200 are assigned to the document address (URL) included in the index information. By searching for the document to be indexed on the communication network by the control, based on the keyword included in the document and the index information, the control to generate a list of documents by concept.

컨셉별 색인 관리부(170)는 I/F 부(160)를 통해 다수개의 게스트 컴퓨터들(200)로부터 전송되는 컨셉별 문서 목록들을 통합 관리한다. 즉, 다수개의 게스트 컴퓨터들(200)로부터 전송되는 컨셉별 문서 목록을 통합하여 컨셉별 통합 문서 목록을 생성한 후, 각 문서에 해당 컨셉이 출현한 횟수에 의해 결정된 컨셉별 문서 점수에 의거하여 상기 컨셉별 통합 문서 목록을 정렬한다.The conceptual index management unit 170 integrates and manages the conceptual document lists transmitted from the plurality of guest computers 200 through the I / F unit 160. That is, after generating a concept-based workbook list by integrating a list of document-by-concept documents transmitted from a plurality of guest computers 200, the workbook for each concept based on the concept-based document score determined by the number of times the concept appears in each document. Sort the list.

컨셉별 색인 DB(180)는 이와 같이 컨셉별 문서 점수에 의해 정렬된 컨셉별 문서 목록들을 저장 관리한다.The concept-based index DB 180 stores and manages the document lists for each concept sorted by the document score for each concept.

한편, 게스트 컴퓨터(200)는 호스트 컴퓨터(100)와 통신망으로 연결되며, 호스트 컴퓨터(100)로부터 색인 정보에 의거하여 컨셉별 색인을 수행한 후, 그 결과 생성된 컨셉별 문서 목록을 다시 호스트 컴퓨터(100)로 전달한다. 이를 위해, 게스트 컴퓨터(200)는 I/F 부(210), 문서 검색부(220), 컨셉 추출부(230) 및 컨셉별 색인부(240)를 포함한다.On the other hand, the guest computer 200 is connected to the host computer 100 through a communication network, and after performing a concept-by-concept index on the basis of the index information from the host computer 100, the host computer 100 again the resulting list of concept-specific documents To pass). To this end, the guest computer 200 includes an I / F unit 210, a document search unit 220, a concept extractor 230, and a concept indexer 240.

I/F부(210)는 호스트 컴퓨터(100)로부터 키워드와 그 키워드를 포함하는 소정 개수의 문서들로 이루어진 색인 정보를 수신하고, 게스트 컴퓨터(200)내부에서 자체 발생된 컨셉별 문서 목록을 호스트 컴퓨터(100)로 전송한다.The I / F unit 210 receives index information consisting of keywords and a predetermined number of documents including the keywords from the host computer 100, and generates a list of documents for each concept generated within the guest computer 200 by the host computer. Send to 100.

문서 검색부(220)는 색인 정보에 포함된 문서 주소(URL)에 의해 통신망상에서 색인하고자 하는 문서를 찾아온 후, 그 문서를 텍스트 포맷으로 변환한다.The document retrieving unit 220 retrieves a document to be indexed from the communication network by a document address (URL) included in the index information, and converts the document into a text format.

컨셉 추출부(230)는 문서 검색부(220)에서 텍스트 포맷으로 변환된 문서 내에서, 색인 정보에 포함된 키워드와 일정 거리 이내에 있는 단어의 조합인 컨셉을 추출하고, 문서별 컨셉 목록을 생성한다.The concept extractor 230 extracts a concept, which is a combination of words included in the index information and a word within a predetermined distance, in the document converted into the text format by the document search unit 220, and generates a concept list for each document. .

컨셉별 색인부(240)는 컨셉 추출부(230)에서 생성된 문서별 컨셉 목록을 컨셉별로 재구성하여 컨셉별 문서 목록을 생성하고, 그 컨셉별 문서 목록을 I/F 부(210)로 전송한다.The concept indexing unit 240 reconstructs the concept list for each document generated by the concept extraction unit 230 for each concept to generate a concept document list, and transmits the concept document list to the I / F unit 210.

도 4는 상기와 같은 문서 색인 과정에서 생성되는 문서별 컨셉 목록에 대한 예를 나타낸다. 즉, 호스트 컴퓨터가 도 2의 키워드별 문서 목록에서 키워드 'A'를 포함하는 문서들 중 '문서 1' 및 '문서 5'가 위치한 URL을 키워드 'A'와 함께 게스트 컴퓨터에게 전송한 경우에 대한 예를 나타낸다.4 shows an example of a concept list for each document generated in the document indexing process as described above. That is, a case in which the host computer transmits URLs where 'Document 1' and 'Document 5' located among the documents including the keyword 'A' in the document list for each keyword of FIG. 2 to the guest computer together with the keyword 'A' For example.

상기 예에서 게스트 컴퓨터는 호스트 컴퓨터로부터 전송된 URL로 접근하여 해당 문서를 찾아온 후, 그 문서들을 텍스트 문서로 변환한다. 그리고, 그 문서 내에서, 키워드 'A'와 일정 거리 이내에 있는 단어들을 추출하여, 그 단어들과 키워드 'A'를 조합하여 컨셉을 추출한다. 이러한 일련의 과정을 거쳐 생성된 것이 도 4에 나타난 문서별 컨셉 목록이다.In the above example, the guest computer accesses the URL sent from the host computer, retrieves the document, and converts the document into a text document. In the document, words that are within a certain distance from the keyword 'A' are extracted, and the concept is extracted by combining the words and the keyword 'A'. Generated through such a series of processes is a concept list for each document shown in FIG.

도 4를 참조하면, '문서 1'에서 추출되어 게스트 컴퓨터로 전송된 모든 문자열에서 추출된 단어들은 'B', 'C', 'D', 'Z'가 있으며, 이들과 키워드 'A'의 조합에 의해 생성된 컨셉은 'AB', 'AC', 'AD', AZ'가 있다.Referring to FIG. 4, the words extracted from all strings extracted from 'Document 1' and transmitted to the guest computer include 'B', 'C', 'D', and 'Z'. The concepts generated by the combination are 'AB', 'AC', 'AD', and AZ.

한편, 각 문서별로 각 컨셉들이 발생한 횟수를 그 컨셉에 대한 문서의 점수로 설정하는데, 이는 도 4의 '발생횟수' 항목에 나타나 있다. 즉, 컨셉 'AB'에 대한 '문서 1'의 점수는 '6'이고, 컨셉 'AC'에 대한 '문서 1'의 점수는 '4'이다. 이와 같이 하여 각 컨셉들에 대한 문서의 점수가 결정된다.On the other hand, the number of occurrences of each concept for each document is set as the score of the document for the concept, which is shown in the 'number of occurrences' item of FIG. That is, the score of 'Document 1' for the concept 'AB' is '6', and the score of 'Document 1' for the concept 'AC' is '4'. In this way the score of the document for each concept is determined.

도 5는 상기와 같이 구성된 문서별 컨셉 목록을 컨셉별로 재구성한 컨셉별 문서 목록에 대한 예를 나타낸다. 즉, 도 4에 도시된 문서별 컨셉 목록을 컨셉별로 재구성하면, 컨셉 'AB'를 포함하는 문서는 '문서 1'이 있고, 컨셉 'AB'에 대한 '문서 1'의 점수는 '6'이다. 또한, 컨셉 'AC'를 포함하는 문서는 '문서 5'와 '문서 1'이 있고, 컨셉 'AC'에 대한 '문서 5'와 '문서 1'의 점수는 각각 '5'와 '4'이다. 이 때, 각 컨셉별 문서 목록들은 이러한 점수에 의해 정렬하는 것을 원칙으로 한다. 컨셉 'AC'의 경우도 '문서 5'와 '문서 1'을 그 문서의 점수에 의해 내림차순으로 정렬하였다.5 illustrates an example of a concept-based document list in which the concept list for each document configured as described above is reconstructed for each concept. That is, if the concept list for each document illustrated in FIG. 4 is reconstructed for each concept, the document including the concept 'AB' has 'Document 1', and the score of 'Document 1' for the concept 'AB' is '6'. . Documents containing the concept 'AC' include 'Document 5' and 'Document 1', and the scores of 'Document 5' and 'Document 1' for the concept 'AC' are '5' and '4', respectively. . At this time, the list of documents for each concept should be sorted by this score. In the case of the concept 'AC', 'Document 5' and 'Document 1' were also sorted in descending order by the score of the document.

도 6은 도 5와 같이 부분적으로 생성된 컨셉별 문서 목록들을 하나의 목록에서 통합 관리하기 위한 데이터 베이스 구조에 대한 예시도이다. 즉, 다수의 게스트 컴퓨터들 각각에서 전송되는 컨셉별 문서 목록들을 호스트 컴퓨터에서 통합하여 관리하기 위한 데이터 베이스 구조에 대한 예시도이다.FIG. 6 is an exemplary diagram of a database structure for integrating and managing partially generated concept-specific document lists as shown in FIG. 5 in one list. That is, it is an exemplary diagram of a database structure for integrating and managing concept-specific document lists transmitted from each of a plurality of guest computers in a host computer.

도 6을 참조하면, 각 컨셉별로 그 컨셉을 포함하는 문서들을 그 문서의 점수에 의해 정렬하여 데이터 베이스에 등록한 것을 볼 수 있다. 한편, 좌측 문서 정보란(갱신전)에는 그 문서들의 정보를 갱신하기 이전의 데이터가, 우측 문서 정보란(갱신후)에는 그 문서들의 정보를 도 5에 나타난 컨셉별 문서 목록에 의해 갱신한 후의 데이터가 나타나 있다.Referring to FIG. 6, it can be seen that documents containing the concept for each concept are sorted by score of the document and registered in the database. On the other hand, the data before updating the information of the documents in the left document information column (before updating), and the data after updating the information of the documents in the right document information column (after updating) by the concept document list shown in FIG. Is shown.

즉, 컨셉 'AB'의 경우 도 5의 컨셉별 문서 목록에 의해 '문서 1'의 점수가 '15'에서 '6'이 증가된 '21'로 변동되었다. 또한, 컨셉 'AC'의 경우 도 6의 컨셉별 문서 목록에 의해 '문서 1' 및 '문서 5'의 점수가 각각 변동되었고, 이에 의해 '문서 1', '문서 5', '문서 9'의 순서가 그 점수에 의해 새롭게 정렬되었다.That is, in case of the concept 'AB', the score of 'Document 1' is changed from '15' to '21' by '6' according to the concept document list of FIG. 5. In addition, in the case of the concept 'AC', the scores of 'Document 1' and 'Document 5' were changed by the document list of each concept in FIG. 6, and accordingly, the order of 'Document 1', 'Document 5', and 'Document 9' was changed. Was newly sorted by score.

도 7 내지 도 9는 본 발명의 일 실시예에 따라 문서 내에서 컨셉을 추출하고, 컨셉별 문서 점수를 카운트하는 과정을 설명하기 위한 예시도이다.7 to 9 are exemplary diagrams for explaining a process of extracting a concept from a document and counting document scores for each concept according to an embodiment of the present invention.

도 7은 문서 내에서 컨셉을 추출하기 위한 컨셉 추출 범위를 나타낸 예로서, 특히, 키워드 '정보'를 가지고, 그 키워드와 5 단어 이내에 있는 단어를 조합하여 컨셉을 추출하고자 하는 경우에 대한 예이다.FIG. 7 is an example illustrating a concept extraction range for extracting a concept in a document. In particular, FIG. 7 illustrates an example in which a concept is extracted by combining a keyword and a word within 5 words with the keyword 'information'.

도 7을 참조하면, 도 7에 표시된 문서에서 키워드 '정보'는 4번 출현하였으며, 그 각각의 키워드('정보¹', '정보²', '정보³', '정보⁴')들을 중심으로 컨셉을 추출하기 위한 영역을 설정하되, 그 키워드를 중심으로 5단어 이내의 범위로 설정할 경우, 도 7과 같이 'A', 'B', 'C', 'D'의 4영역으로 구분할 수 있다.Referring to FIG. 7, the keyword 'information' appears four times in the document shown in FIG. 7, and focuses on each of the keywords ('information ¹ ', 'information ² ', 'information ³ ', and 'information ⁴ '). If an area for extracting a concept is set, but set within a range of 5 words based on the keyword, it can be divided into 4 areas of 'A', 'B', 'C', and 'D' as shown in FIG. .

상기 4개의 문자열에서 추출된 모든 단어들을 키워드인 '정보'와 조합하여발생된 컨셉 목록과, 그 컨셉이 해당 문자열 내에서 발생된 발생 횟수를 나타내는 데이터 베이스 구조의 예가 도 8에 나타나 있다.An example of a concept list generated by combining all the words extracted from the four strings with the keyword 'information' and an example of a database structure indicating the number of occurrences of the concept generated in the string are shown in FIG. 8.

도 9는 이와 같이 문자열로 발생된 컨셉 및 해당 컨셉의 발생 횟수를 통합하여 나타내었다. 이 때, 각 컨셉들은 그 발생 횟수에 의해 내림차순으로 정렬됨을 원칙으로 하며, 도 9에서도 각 컨셉들이 정렬된 상태를 나타낸다.9 shows the concept generated as a string and the number of occurrences of the concept. In this case, the concepts are in principle sorted in descending order by the number of occurrences, and each concept is also shown in FIG. 9.

도 10은 본 발명의 실시예에 따라 문서 색인을 처리하기 위한 개략적인 처리 절차도이다. 도 10을 참조하면, 우선 호스트 컴퓨터(100)에서는 기본적인 문서 및 키워드 정보를 가지고 색인 정보를 생성한다(s10). 이 때, 호스트 컴퓨터(100)는 특정 키워드와, 그 키워드를 포함하는 소정 개수의 문서들의 위치를 가리키는 URL을 포함하는 색인 정보를 생성한다. 그리고, 그 색인 정보를 게스트 컴퓨터(200)들에게 분배한다(s20).10 is a schematic processing procedure diagram for processing a document index according to an embodiment of the present invention. Referring to FIG. 10, first, the host computer 100 generates index information based on basic document and keyword information (S10). At this time, the host computer 100 generates index information including a specific keyword and a URL indicating a location of a predetermined number of documents including the keyword. The index information is distributed to the guest computers 200 (s20).

그러면, 게스트 컴퓨터들(200)은 수신된 색인 정보에 의거하여 컨셉별 문서 목록을 생성한 후 그 컨셉별 문서 목록을 정렬하여(s30), 호스트 컴퓨터(100)에게 전송한다(s40).Then, the guest computers 200 generate a document list for each concept based on the received index information, sort the document list for each concept (s30), and transmit it to the host computer 100 (s40).

호스트 컴퓨터(100)는 각 게스트 컴퓨터(200)들로부터 수신한 컨셉별 문서 목록들을 재정렬하여 통합 관리한다(s50).The host computer 100 rearranges and manages the document lists for each concept received from each guest computer 200 (S50).

즉, 다시 말하면, 이와 같이 본 발명에 의한 컨셉별 문서 색인을 수행하기 위해서는 크게 4단계의 과정을 수행하여야 하는데, 먼저, 호스트 컴퓨터에서 색인하고자 하는 문서들에 대한 키워드별 색인을 수행하여 키워드별 문서 목록을 생성하고, 상기 키워드별 문서 목록 및 문서 정보에 의거하여, 키워드와 그 키워드를포함하는 소정 개수의 문서 주소(URL)들로 이루어진 색인 정보를 생성하는 제1 과정과, 상기 색인 정보를 다수개의 색인 보조 프로세스들에게 분배하는 제2 과정과, 상기 다수개의 색인 보조 프로세스에서 즉, 게스트 컴퓨터에서 상기 분배된 색인 정보에 의거한 컨셉별 문서 목록을 각각 생성하도록 하는 제3 과정과, 다시 호스트 컴퓨터에서 상기 다수개의 컨셉별 문서 목록들을 하나의 컨셉별 통합 문서 목록으로 재구성한 후, 그 컨셉별 통합 문서 목록을 총괄하는 제4 과정을 포함한다.In other words, in order to perform the concept-based document indexing according to the present invention, a four-step process must be largely performed. First, the keyword list for the documents to be indexed by the host computer is performed. And generating index information including a keyword and a predetermined number of document addresses (URLs) including the keyword, based on the document list for each keyword and the document information. A second process of distributing to the index assistance processes, a third process of respectively generating a concept-specific document list based on the distributed index information in the plurality of index assistance processes, that is, a guest computer, and again at the host computer After reconstructing a plurality of concept document lists into one concept workbook list, A fourth process of overseeing the workbook list is included.

도 11은 본 발명의 실시예에 따른 호스트 컴퓨터의 처리 과정에 대한 흐름도이다. 즉, 호스트 컴퓨터에서는 상기 설명 중 제1 과정과, 제2 과정 및 제4 과정을 수행하게 되는데, 이러한 처리 과정을 도 11을 참조하여 설명하면 다음과 같다.11 is a flowchart of a process of a host computer according to an embodiment of the present invention. That is, the host computer performs the first process, the second process, and the fourth process in the above description, which will be described below with reference to FIG.

먼저, 색인하고자 하는 문서들에 대한 키워드별 색인을 수행하여 키워드별 문서 목록을 생성하고(s110), 그 키워드별 문서 목록 및 문서 정보에 의거하여 색인 정보를 생성한다(s120). 즉, 각 키워드별로 그 키워드별 문서 목록에 포함된 문서들을 선택한 후, 각 키워드별로 키워드와 그 키워드를 포함하는 소정 개수의 문서들의 위치를 가리키는 URL로 이루어진 색인 정보를 생성한다. 그리고, 그 색인 정보를 다수개의 게스트 컴퓨터들에게 분배한 후(s130), 각 게스트 컴퓨터로부터 컨셉별 문서 목록을 수신할 때까지 대기한다(s140).First, a keyword-by-keyword index is generated for keywords to be indexed (s110), and index information is generated based on the keyword-specific document list and document information (s120). That is, after selecting documents included in a document list for each keyword for each keyword, index information including a keyword and a URL indicating a location of a predetermined number of documents including the keyword for each keyword is generated. After distributing the index information to a plurality of guest computers (s130), it waits until a concept document list is received from each guest computer (s140).

만약, 게스트 컴퓨터로부터 컨셉별 문서 목록이 수신되면, 호스트 컴퓨터는 그 컨셉별 문서 목록들을 기존에 생성되어 호스트 컴퓨터에서 관리되어 오는 컨셉별 문서 목록과 통합한 후(s150), 그 컨셉별 통합 문서 목록을 총괄한다(s160). 이 때, 호스트 컴퓨터는 각 문서에 해당 컨셉이 출현한 횟수에 의해 결정된 컨셉별 문서 점수에 의거하여 그 컨셉별 통합 문서 목록을 정렬하는 과정을 수행한다.If the concept-specific document list is received from the guest computer, the host computer integrates the concept-specific document lists with the concept-specific document list that is generated and managed on the host computer (s150), and then aggregates the concept-based workbook list (s150). s160). At this time, the host computer performs a process of sorting the workbook list for each concept based on the document score for each concept determined by the number of times the concept has appeared in each document.

한편, 도 12는 본 발명의 실시예에 따른 게스트 컴퓨터의 처리 과정에 대한 흐름도이다. 즉, 게스트 컴퓨터에서는 본 발명의 4단계의 처리 과정 중 제3 과정을 수행하게 되는데, 이러한 처리 과정을 도 12를 참조하여 설명하면 다음과 같다. 즉, 게스트 컴퓨터가 호스트 컴퓨터로부터 색인 정보를 수신하면(s205), 먼저, 통신망에서 상기 색인 정보에 포함된 URL에 의해 색인하고자 하는 문서를 검색하여(s210), 색인하고자 하는 문서를 찾았으면 그 문서를 텍스트 포맷으로 변환한다(s215).12 is a flowchart illustrating a process of processing a guest computer according to an embodiment of the present invention. That is, the guest computer performs a third process among the four process steps of the present invention. This process will be described with reference to FIG. 12 as follows. That is, when the guest computer receives the index information from the host computer (s205), first, the communication network searches for the document to be indexed by the URL included in the index information (s210). Is converted to a text format (S215).

그리고, 그 문서 내에서 색인 정보에 포함된 키워드를 검색하여(s220), 그 키워드와 일정 거리 이내에 있는 단어의 조합인 컨셉을 추출한다(s225). 이러한 과정들(s220, s225)을 문서의 끝까지(s230) 반복 수행한 후, 그 문서의 컨셉 목록을 컨셉별로 재구성하여 컨셉별 문서 목록을 생성한다(s235). 이 때, 해당 키워드와 컨셉을 이루는 단어는 그 키워드와 5 내지 50단어 이내에 존재하는 단어를 선택하는 것이 바람직하다.Then, the keyword included in the index information is searched in the document (s220), and a concept that is a combination of words within a certain distance from the keyword is extracted (s225). After repeating these processes (s220, s225) until the end of the document (s230), the concept list of the document is reconstructed for each concept to generate a document list for each concept (s235). At this time, it is preferable to select a word existing within the concept of the keyword and the words within 5 to 50 words.

그리고, 색인해야 할 다음 URL이 존재하는 지의 여부를 확인하여(s240), 다음 문서가 존재하는 경우, 다음 문서를 선택한 후(s245), 이와 같은 일련의 과정들(s210 내지 s240)을 반복 수행한다. 상기 확인(s240) 결과 색인해야 할 더 이상의 문서가 없을 경우 컨셉별 문서 목록들을 갱신한다(s250). 이 때, 각 문서에 해당 컨셉이 출현한 횟수에 의해 컨셉별 문서의 점수를 설정하고, 그 컨셉별 문서의 점수에 의거하여 컨셉별 문서 목록을 재구성한다. 그리고, 그 컨셉별 문서 목록을 호스트 컴퓨터에게 전달한다(s255).Then, by checking whether the next URL to be indexed exists (s240), and if the next document exists, selects the next document (s245), and repeats a series of processes (s210 to s240). . If there are no more documents to be indexed as a result of the checking (s240), the document list for each concept is updated (s250). At this time, the score of the document for each concept is set based on the number of times the concept appears in each document, and the document list for each concept is reconstructed based on the score of the document for each concept. The concept-based document list is transmitted to the host computer (s255).

이 때, 게스트 컴퓨터는 호스트 컴퓨터와 통신망으로 연결된 다수개의 컴퓨터들을 말하지만, 이러한 일련의 과정들이 하나의 컴퓨터에서 이루어진다고 가정할 경우, 하나의 컴퓨터 내에 존재하는 다수개의 색인 보조 프로세스를 일컬을 수도 있다.In this case, the guest computer refers to a plurality of computers connected to the host computer through a communication network. However, assuming that a series of processes are performed on one computer, the guest computer may refer to a plurality of index assistance processes existing in one computer.

이상의 설명은 하나의 실시예를 설명한 것에 불과한 것으로서, 본 발명은 상술한 실시예에 한정되지 않으며 첨부한 특허청구범위 내에서 다양하게 변경 가능하다. 예를 들어 본 발명의 실시예에 구체적으로 나타난 각 구성 요소의 형상 및 구조는 변형하여 실시할 수 있다.The above description is merely an example of an embodiment, and the present invention is not limited to the above-described embodiment and can be variously changed within the scope of the appended claims. For example, the shape and structure of each component specifically shown in the embodiment of the present invention can be modified.

이상에서 설명한 바와 같이 본 발명에 따른 문서 색인 시스템 및 그 방법에 의하면, 키워드와 그 키워드와 일정 거리 이내에 있는 단어의 조합으로 이루어진 컨셉별로 대용량의 문서들을 색인함으로써, 검색시 소요되는 검색 엔진의 부하를 효율적으로 줄일 수 있고, 이로 인해 검색 속도를 빠르게 개선할 수 있으며, 보다 정확한 검색 결과를 얻을 수 있도록 한다는 장점이 있다.As described above, according to the document indexing system and method according to the present invention, by indexing a large amount of documents by a concept consisting of a combination of keywords and words within a certain distance of the keyword, the search engine load required for the search It can be reduced efficiently, which can improve the speed of the search quickly and can provide more accurate search results.

또한, 문서 색인 과정을 통신망으로 연결된 다수개의 컴퓨터에 의해 분산 처리한 후, 그 색인 결과를 하나의 호스트 컴퓨터에서 통합 관리하도록 하되, 호스트 컴퓨터에서 문서 전체를 전송하는 것이 아니고, 해당 키워드와 이를 포함하는 문서의 URL 주소 목록을 게스트 측으로 전송하도록 함으로써, 호스트 컴퓨터와 게스트 컴퓨터들간 통신 부하를 최소화하면서, 문서 색인 시스템 자체의 부하를 현저히 줄일 수 있다는 효과가 있다.In addition, after distributing the document indexing process by a plurality of computers connected to a communication network, the index result is managed in one host computer, but the entire document is not transmitted from the host computer. By sending the URL address list of the document to the guest side, it is possible to significantly reduce the load of the document indexing system itself while minimizing the communication load between the host computer and the guest computers.

Claims

In a document indexing system for analyzing information and storing information for each index word,

A document database unit for storing and managing documents to be indexed;

A keyword database unit for storing and managing keywords, which are key information for indexing documents;

A keyword index unit for generating a document list for each keyword by indexing documents stored in the document database unit with the keywords stored in the keyword database unit;

An index database section for each keyword for storing a document list for each keyword generated by the keyword index section;

An index information generation unit for generating index information of a keyword and a predetermined number of document addresses (URLs) including the keyword, based on the document list for each keyword and the contents of the document database unit;

The index information is distributed to a plurality of index assistance processes, and the index assistance processes retrieve the document to be indexed on the communication network by the document address (URL) included in the index information, and then included in the document and the index information. An interface unit controlling to generate a concept-based document list based on a keyword, and receiving a concept-based document list from each of the plurality of index assistance processes;

An index management unit for each concept that integrates and manages the document lists for each concept transmitted through the interface unit;

A concept-based index database unit for storing a concept-specific document list delivered through the concept-based index management unit,

The interface unit

After the plurality of indexing assistance processes convert the documents retrieved from the communication network into a text format, extract a concept which is a combination of keywords included in the index information and words within a certain distance within the document, and extract the extracted concept-specific document list. Document indexing system, characterized in that the control to generate.

delete

According to claim 1, wherein the index management by concept

After generating a workbook list for each concept by integrating a plurality of concept-specific document lists transmitted through the interface unit, sorting the workbook list for each concept based on the document score for the concept determined by the number of times the concept appeared in each document. Document indexing system characterized.

In the document indexing method for analyzing information and storing information for each index word,

Create a keyword list by keyword by indexing documents to be indexed, and generate a keyword and a predetermined number of document addresses (URLs) including the keyword based on the keyword list and document information. A first process of generating the index information,

Distributing the index information to a plurality of index assistance processes;

A third step of generating a concept-specific document list based on the distributed index information in the plurality of index assistance processes;

Reconstructing the plurality of concept document lists into a single concept workbook list, and including a fourth process of overseeing the list of workbooks by concept;

The plurality of index assistance processes

A document indexing method, characterized in that it is an index assist process included in each of a plurality of computers connected to a communication network or a plurality of index assist processes included in one computer.

The method of claim 4, wherein the third process is

Step 3-1 of allowing the plurality of index assistance processes to retrieve documents to be indexed on a communication network by means of document addresses (URLs) transmitted in the first step;

Step 3-2 to convert the document into a text format,

In the converted document, extracting a concept that is a combination of words included in the index information and words within a predetermined distance, and then generating a concept list for each document;

Reconstructing the concept list by document for each concept includes a step 3-4 to generate a document list for each concept,

Step 3-3 is

A document indexing method comprising extracting a combination of a keyword and a word existing within 5 to 50 words as a concept.

delete

The method of claim 4, wherein the fourth process

Step 4-1 of generating a list of integrated documents by concept by integrating the plurality of conceptual document lists;

And a step 4-2 of sorting the list of workbooks by the concept based on the document scores by the concept determined by the number of times the corresponding concept has appeared in each document.

delete