KR20000054268A

KR20000054268A - Method and system for document classification and search using document auto-summary system

Info

Publication number: KR20000054268A
Application number: KR1020000029370A
Authority: KR
Inventors: 전상훈
Original assignee: 전상훈; 주식회사 엔아이비소프트
Priority date: 2000-05-30
Filing date: 2000-05-30
Publication date: 2000-09-05
Also published as: KR100356105B1

Abstract

PURPOSE: A system for classifying and searching a document is provided to permit a search key to find a document similar to a specific document. CONSTITUTION: A system(120) for classifying and searching a document generates summary data on a document as a search object to prepare for a search demand from a user. The summary data is stored in each main work and sentence information database(230,240) and managed by a summary data management module(220). An automatic document summarizing module(210) interlocks with the summary data management module and generates the summary data on the document of the search object. Further, only a text part is additionally separated from the content of the search object document. The entire content of the document is stored in a text database(260) of the search object document.

Description

Method and system for document classification and search using document auto-summary system

본 발명은 문서 분류 검색 방법 및 문서 분류 검색 시스템에 관한 것으로서, 특히 자동 요약을 이용하여 구축된 데이터베이스를 사용하여 문서 자체를 대상으로 특정 문서와 유사한 문서를 탐색하여 분류하고 검색할 수 있도록 해 주는 문서 분류 검색 방법 및 시스템에 관한 것이다.The present invention relates to a document classification retrieval method and a document classification retrieval system. In particular, a document that enables a user to search, classify, and search a document similar to a specific document by using a database constructed by using an automatic summary. A classification search method and system.

기존의 문서 분류 검색 시스템은 공통적으로 검색어(keyword)를 사용한 검색 시스템이다. 이와 같이 핵심 단어로 웹 문서를 검색하는 경우는 사용자가 간단한 검색어 입력만으로 정보를 검색할 수 있지만, 검색 결과의 양이 지나치게 많고 실제로 검색 결과에서 사용자가 필요로 하는 문서를 다시 한 번 일일이 확인을 해야하기 때문에, 검색 결과 확인에 많은 시간을 투자하여야 한다는 단점이 있다. 즉, 검색 결과가 전체적인 문서 내용을 기반으로 이루어지는 것이 아니어서 검색된 문서가 원하는 정보를 담고 있는 문서인지 확신할 수 없으므로 필요한 정보를 쉽고 정확하게 찾고 싶다는 사용자의 요구를 충족시키기 어려우며, 검색에 있어서 핵심이라고 할 수 있는 시간과 노력의 절약이라는 측면에서 사용자의 불만이 많다.The existing document classification search system is a search system using a keyword in common. When searching for web documents with key words like this, users can search for information by simply entering a simple search term, but the amount of search results is too high and the user needs to double check the documents that the user needs in the search results. Therefore, there is a disadvantage in that a large amount of time should be invested in checking the search results. In other words, since the search results are not based on the entire document content, it is difficult to satisfy the user's need to find the necessary information easily and accurately because it is not certain whether the searched document contains the desired information. There are many complaints from users in terms of saving time and effort.

이에 대한 해결 방법으로 검색 결과에서 검색어가 포함된 문장을 제시하는 방법이 사용되고 있다. 이 경우 사용자가 웹 문서의 내용을 어느 정도 파악할 수 있기는 하지만, 검색어를 포함한 문장만을 부분적으로 조합하여 요약문으로 제시하기 때문에 내용에 일관성이 없고, 문장 조합만으로는 문서의 전체적인 내용을 이해할 수 없으며, 문장에 검색어가 포함되어 있더라도 문서의 전체적인 내용이 사용자가 요구하는 내용이 아닐 수 있다는 등의 문제점이 있다.As a solution to this problem, a method of presenting a sentence including a search word in a search result has been used. In this case, although the user can grasp the contents of the web document to some extent, the contents are inconsistent because only partial sentences including the search terms are presented in summary, and the sentence combination alone does not understand the entire contents of the document. Even if a search word is included in the document, the entire contents of the document may not be the content required by the user.

따라서, 특정한 문서 자체와 유사한 것들을 탐색하여 관련성을 기준으로 분류하여 결과를 보여주는 문서 분류 및 검색 시스템이 있다면 매우 유용할 것이다.Therefore, it would be very useful to have a document classification and retrieval system that searches for similarities to a particular document itself and sorts them by relevance.

한편, 방대한 양의 문서의 내용을 쉽게 파악할 수 있도록 하기 위한 방법으로 여러 가지 자동 요약 시스템이 개발되어 있다. 문서 자동 요약 시스템이란 간단히 말하면 '문서의 내용을 일정한 크기로 줄여주는 것'이라고 할 수 있으며, 주어진 문서에서 중요하지 않은 부분이나 사소한 부분을 생략하면서 핵심적인 내용을 일관성있게 추려내어 모아주는 문서 내용 압축 시스템이다.On the other hand, various automatic summarization systems have been developed as a way to easily understand the contents of a large amount of documents. In short, the automatic document summarization system can be described as 'shrink the contents of a document to a certain size', and compresses the contents of a document by consistently extracting the core contents while eliminating important or minor parts of a given document. System.

통상 자동 요약 시스템에서 자동 요약을 하는 과정은 먼저 문서의 내용을 읽어들여서 요약용의 해석 단위로 분류하는 파싱(parsing) 단계로부터 시작한다. 문서 자동 요약 시스템에서는 문서를 문단의 집합으로 간주하고, 문장은 다시 단어의 집합으로 파악하며, 단어가 문서 자동 요약 시스템의 최하위 요소인 동시에 주제어(keyword)의 역할을 한다. 문서 자동 요약의 두번째 단계는 문서의 주제어 정보를 구축하는 것이다. 즉, 문서의 최하위 요소인 단어를 기준으로 빈도 정보를 수집하여 주제어 정보를 구축한다. 주제어 정보를 구축한 후에는 주제문장을 선별하기 위해서 각 문장별로 문장의 가중치를 계산한다. 문장의 가중치 계산은 2단계로 나누어 진행되는데, 먼저 각 문장에 대해 주제어가 나타난 빈도를 중심으로 점수를 부여하고, 문장의 길이와 문장에 포함된 주제어의 길이를 기준으로 점수를 부여하여 각 문장별 가중치를 계산한다. 이와 같이 하여 각 문장의 가중치가 계산되면 가중치가 높은 문장부터 차례로 문장을 선택하여 지정한 분량의 요약문을 생성한다.Normally, the process of automatic summarization in an automatic summarization system begins with a parsing step of first reading the content of a document and classifying it into interpretation units for summary. The automatic document summarization system considers a document as a set of paragraphs, a sentence as a set of words, and a word is the lowest element of the automatic document summarization system and serves as a keyword. The second step in automatic document summarization is to build the document's key word information. In other words, the frequency information is collected based on the word, which is the lowest element of the document, to construct the main term information. After constructing the topic information, the sentence weights are calculated for each sentence in order to select the topic sentence. Calculation of weight of sentences is divided into two stages. First, scores are given based on the frequency of occurrence of the main word for each sentence, and scores are given based on the length of the sentence and the length of the main word included in the sentence. Calculate the weight. When the weight of each sentence is calculated in this way, the sentences with the highest weight are selected in order to generate a summary of the specified amount.

따라서, 이와 같은 자동 요약 시스템을 적절히 활용한다면 사용자가 원하는 정보를 포함하는 문서를 검색하는 데에 유용하게 사용할 수 있을 것이다.Therefore, if you use this automatic summarization system properly, it can be useful to search for documents containing the information you want.

본 발명은 이러한 점을 감안하여 이루어진 것으로서, 본 발명의 목적은 특정한 문서를 검색키로 하여 이와 유사한 문서를 찾을 수 있는 문서 분류 및 검색 시스템을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been made in view of this point, and an object of the present invention is to provide a document classification and retrieval system that can find similar documents using a specific document as a search key.

본 발명의 다른 목적은 문서 검색 결과에 대해 개별 문서의 내용을 직접 일일이 확인하지 않고도 원하는 내용의 문서를 찾을 수 있는 문서 분류 및 검색 시스템을 제공하는 것이다.Another object of the present invention is to provide a document classification and retrieval system that can find a document of desired content without directly checking the contents of the individual document with respect to the document search result.

본 발명의 또다른 목적은 검색한 문서의 내용을 사용자가 쉽고 빠르게 파악할 수 있는 문서 분류 및 검색 시스템을 제공하는 것이다.Another object of the present invention is to provide a document classification and retrieval system that allows a user to easily and quickly grasp the contents of a retrieved document.

도 1은 본 발명의 문서 분류 검색 시스템을 이용하여 문서를 검색하는 과정을 보여주는 개요도이다.1 is a schematic diagram showing a process of searching a document using the document classification search system of the present invention.

도 2는 본 발명의 문서 분류 검색 시스템의 내부 구성의 일예를 보여주는 개략 블록도이다.2 is a schematic block diagram showing an example of an internal configuration of a document classification retrieval system of the present invention.

도 3는 본 발명의 문서 분류 검색 시스템 내의 주제어 정보 데이터베이스에 저장되는 데이터의 구조를 나타낸다.3 shows the structure of data stored in a keyword information database in the document classification retrieval system of the present invention.

도 4a와 도 4b는 각각 본 발명의 문서 분류 검색 시스템 내의 주제문장 정보 데이터베이스 내에 저장되는 데이터의 구조와 주제문장 정보 데이터베이스 내의 각레코드 내의 문장 정보 필드의 데이터 구조를 나타낸다.4A and 4B respectively show the structure of data stored in the topic sentence information database in the document classification retrieval system of the present invention and the data structure of the sentence information field in each record in the topic sentence information database.

도 5는 본 발명의 문서 분류 검색 과정을 나타내는 흐름도이다.5 is a flowchart illustrating a document classification search process according to the present invention.

도 6은 키 문서 입력 과정의 상세한 흐름을 나타내는 흐름도이다.6 is a flowchart illustrating a detailed flow of a key document input process.

도 7은 검색하고자 하는 키 문서 입력 화면의 예이다.7 is an example of a key document input screen to be searched.

도 8과 도 9는 각각 주제어별 가중치 계산 과정과 주제문장별 가중치 계산 과정의 상세한 흐름을 나타내는 흐름도이다.8 and 9 are flow charts showing the detailed flow of the weight calculation process for each topic and the weight calculation process for each topic sentence, respectively.

도 10과 도 11은 검색 결과를 사용자에게 표시하는 화면 구성의 예이다.10 and 11 are examples of screen configurations for displaying a search result to a user.

도 12는 본 발명의 문서 분류 검색 시스템을 이용하여 실제로 문서를 검색한 사례를 나타내는 도면이다.12 is a diagram showing an example of actually searching for a document using the document classification search system of the present invention.

이와 같은 목적을 달성하기 위한 본 발명의 문서 분류 검색 방법에서는, 입력 문서와의 유사성을 기준으로 검색 대상 문서를 분류 검색하며, 검색 키 문서(key document)를 입력하는 제 1 단계, 검색 키 문서의 주제어 정보를 생성하는 제 2 단계, 검색 대상 문서 내에 포함된 각 주제어에 대하여 주제어를 내용으로 하는 주제어 필드와 주제어를 포함하는 검색 대상 문서의 문서 식별자를 내용으로 하는 하나 이상의 문서 식별자 필드를 포함하는 레코드를 포함하는 주제어 정보 데이터베이스를 이용하여 검색 대상 문서에 주제어별 가중치를 부여하는 제 3 단계, 각 검색 대상 문서에 대하여 검색 대상 문서의 문서 식별자를 내용으로 하는 문서 식별자 필드와 검색 대상 문서를 구성하는 문장의 정보를 내용으로 하는 하나 이상의 문장 정보 필드를 포함하는 레코드를 포함하며, 문장 정보 필드는 각각 문장의 번호, 문서 내에서의 문장의 위치, 문장의 길이, 문장이 포함하고 있는 주제어의 비율에 따라 정해지는 문장 가중치를 내용으로 하는 문장번호, 문장위치, 문장길이, 문장가중치 서브필드와 각 문장 내에 포함되어 있는 주제어 식별자를 내용으로 하는 하나 이상의 주제어 식별자 서브필드를 포함하고 있는 주제문장 정보 데이터베이스를 이용하여, 검색 대상 문서에 주제문장별 가중치를 부여하는 제 4 단계, 주제어별 가중치와 주제문장별 가중치를 합한 전체 가중치가 높은 것으로부터 낮은 것의 순서로 검색 대상 문서를 분류하는 제 5 단계를 포함한다.In the document classification search method of the present invention for achieving the above object, the first step of classifying and searching the search target document based on the similarity with the input document, inputting a search key document, the search key document A second step of generating subject information, a record comprising a subject field with a subject as the subject and one or more document identifier fields with the subject of the document identifier of the search subject document including the subject for each subject contained in the search subject document A third step of assigning a weight to each search target document by using a topic information database including a document identifier field having a document identifier of a search target document as a content and a sentence constituting the search target document for each search target document; A level containing one or more sentence information fields containing the information of A sentence information field includes a sentence number, a sentence position, and a sentence weight determined by a sentence number, a sentence position in a document, a sentence length, and a percentage of a main word included in the sentence. A weight of subject sentence is given to a search target document by using a sentence sentence information database including a sentence length, a sentence weight subfield, and one or more subject identifier identifier subfields whose subject identifiers are included in each sentence. And a fifth step of classifying the search target document in the order of the total weight of the sum of the weight for each topic and the weight for each topic sentence, from high to low.

여기에서, 검색 키 문서를 입력하는 제 1 단계에서는 검색 키 문서의 내용을 직접 입력하거나, 검색 키 문서의 파일명을 지정하거나, 검색 키 문서가 존재하는 인터넷 주소를 지정할 수 있다.Here, in the first step of inputting the search key document, the content of the search key document may be directly input, a file name of the search key document may be specified, or an internet address where the search key document exists.

한편, 검색 키 문서의 주제어 정보는 주제단어와 주제단어의 출현빈도에 따른 주제단어 가중치를 포함하고, 주제어별 가중치를 부여하는 제 3 단계는, 주제어 정보 데이터베이스를 이용하여, 검색 키 문서의 주제단어와 일치하는 주제어를 갖는 검색 대상 문서에 가중치를 부여하는 제 3-1 단계와 검색 대상 문서에 키 문서의 주제단어 가중치에 따른 가중치를 부여하는 제 3-2 단계를 포함할 수 있고, 주제문장별 가중치를 부여하는 제 4 단계는, 주제문장 정보 데이터베이스를 이용하여 각 검색 대상 문서의 주제문장을 추출하는 제 4-1 단계, 주제문장 정보 데이터베이스를 이용하여 추출된 각 검색 대상 문서의 주제문장에 대한 주제어를 추출하는 제 4-2 단계, 검색 키 문서의 주제단어와 일치하는 주제어를 포함하는 주제문장을 갖는 검색 대상 문서에 가중치를 부여하는 제 4-3 단계, 검색 대상 문서에 검색 키 문서의 주제단어 가중치에 따른 가중치를 부여하는 제 4-4 단계, 검색 대상 문서의 주제문장에서 주제단어의 점유율을 계산하여 가중치를 부여하는 제 4-5 단계를 포함할 수 있다.Meanwhile, the topic information of the search key document includes the topic word and the weight of the topic word according to the frequency of occurrence of the topic word, and the third step of assigning the weight for each topic is the topic word of the search key document by using the topic information database. Step 3-1 to give a weight to the search target document having a topic word matching to and 3-2 step to give a weight according to the weight of the topic word of the key document to the search target document, In the fourth step of assigning the weights, the fourth step of extracting the topic sentence of each search target document using the topic sentence information database, and the topic sentence of each search target document extracted using the topic sentence information database Step 4-2, extracting the subject word, goes to the searched document having the subject sentence containing the subject word that matches the subject word of the search key document. Step 4-3 of assigning a value to the search target document, and weighting of the search word document according to the weight of the subject word of the search key document. Steps 4-5 may be included.

또한, 분류된 문서를 사용자에게 표시하는 제 6 단계를 더 포함할 수 있으며, 검색 키 문서를 입력하는 제 2 단계에서 검색 키 문서의 주제문장 정보를 생성하고, 제 6 단계에서 검색 키 문서의 문서 정보와 분류된 문서의 문서 정보를 함께 표시하되, 검색 키 문서의 문서 정보는 검색 키 문서의 주제어 정보와 주제문장 정보를 포함하고, 분류된 문서의 문서 정보는 주제어 정보 데이터베이스와 주제문장 정보 데이터베이스를 이용하여 추출된 분류된 문서의 주제어 정보와 주제문장 정보를 포함한다. 또한, 분류된 문서의 문서 정보는 분류된 문서에 포함되어 있는 검색 키 문서의 주제단어의 집합이 일치하는 문서들을 그룹으로 묶어 그룹 내에서 전체 가중치가 높은 것으로부터 낮은 것의 순서로 검색 대상 문서를 분류하여 표시할 수도 있다.The method may further include a sixth step of displaying the classified document to the user, wherein the second sentence inputting the search key document generates subject sentence information of the search key document, and in the sixth step, the document of the search key document. Information and document information of the classified document are displayed together, and the document information of the search key document includes the information of the key word and the topic sentence of the search key document, and the document information of the classified document includes the topic information database and the topic sentence information database. Includes topic information and topic sentence information of the classified documents extracted by using. In addition, the document information of the classified document is classified into a group of documents in which the subject word set of the search key document included in the classified document is matched in order from the highest weight to the lowest in the group. It can also be displayed.

한편, 본 발명에 따른 하는 문서 분류 검색 시스템은, 검색 대상 문서 내에 포함된 각 주제어에 대하여 주제어를 내용으로 하는 주제어 필드와 주제어를 포함하는 검색 대상 문서의 문서 식별자를 내용으로 하는 하나 이상의 문서 식별자 필드를 포함하는 레코드를 포함하는 주제어 정보 데이터베이스, 각 검색 대상 문서에 대하여 검색 대상 문서의 문서 식별자를 내용으로 하는 문서 식별자 필드와 검색 대상 문서를 구성하는 문장의 정보를 내용으로 하는 하나 이상의 문장 정보 필드를 포함하는 레코드를 포함하며, 문장 정보 필드는 각각 문장의 번호, 문서 내에서의 문장의 위치, 문장의 길이, 문장이 포함하고 있는 주제어의 비율에 따라 정해지는 문장 가중치를 내용으로 하는 문장번호, 문장위치, 문장길이, 문장가중치 서브필드와 각 문장 내에 포함되어 있는 주제어 식별자를 내용으로 하는 하나 이상의 주제어 식별자 서브필드를 포함하고 있는 주제문장 정보 데이터베이스, 입력 문서로부터 주제어 정보를 추출할 수 있는 주제어 정보 추출 수단, 추출된 상기 주제어 정보를 입력으로 받아 입력된 주제어와 동일한 문자열의 존재를 검색할 수 있는 검색 수단을 포함하며, 상기 검색 수단은, 주제어 정보 데이터베이스를 이용하여 각 검색 대상 문서에 주제어별 가중치를 부여하고, 주제문장 정보 데이터베이스를 이용하여 상기 검색 대상 문서에 주제문장별 가중치를 부여하여, 주제어별 가중치와 주제문장별 가중치를 합한 전체 가중치가 높은 것으로부터 낮은 것의 순서로 검색 대상 문서를 분류할 수 있는 것을 특징으로 한다.On the other hand, the document classification search system according to the present invention, for each of the main words included in the search target document, the main topic field with the main content and at least one document identifier field with the content of the document identifier of the search target document including the main word A terminology information database comprising a record comprising a document identifier field for each search target document, the document identifier field of which the document identifier of the search target document is included, and one or more sentence information fields whose information is a sentence of the sentence constituting the search target document. A sentence information field includes a sentence number including a sentence number, a sentence weight determined according to a sentence number, a sentence position in a document, a sentence length, and a percentage of the main word included in the sentence. Position, sentence length, and sentence weight subfields A subject sentence information database including one or more subject identifier identifier subfields whose contents are the subject identifiers, a subject information extracting means for extracting subject information from an input document, and a subject subject input by receiving the extracted subject information as input Search means for searching for the presence of a character string equal to and wherein the search means assigns a weight to each search subject document to each search subject document using a subject information database, and uses the subject sentence information database to search for the search target document. It is characterized in that the weight of each sentence of the subject sentence is assigned to the search target document in the order of the total weight of the sum of the weight of each topic word and the weight of the topic sentence is high to low.

또한, 입력 문서로부터 주제문장 정보를 추출할 수 있는 주제문장 정보 추출 수단을 더 포함하거나, 검색 결과를 사용자에게 표시할 수 있는 표시 수단을 더 포함할 수도 있고, 검색 대상 문서의 내용을 텍스트 데이터로 저장하는 검색 대상 문서 텍스트 데이터베이스를 더 포함할 수도 있다.The apparatus may further include subject sentence information extracting means for extracting subject sentence information from the input document, or may further include display means for displaying a search result to the user. It may further include a searched document text database to store.

이제 본 발명의 바람직한 실시예에 대하여 도면을 참고로 하여 상세히 설명한다.Preferred embodiments of the present invention will now be described in detail with reference to the drawings.

먼저 문서 분류 검색 시스템(120)은 검색 요구가 있을 것에 대비해서 검색 대상 문서(110)에 대하여 문서 요약 정보를 생성하여 자체 내에 데이터베이스로 보관하고 있다. 클라이언트 컴퓨터(140)에 의해 인터넷(130)을 통해 문서 검색 요청이 있는 경우, 문서 분류 검색 시스템(120)은 미리 보관된 문서 요약 정보를 이용하여 문서 검색을 수행하고 이 결과를 역시 인터넷(130)을 통해 클라이언트 컴퓨터(140)로 제공한다.First, the document classification retrieval system 120 generates document summary information for the search target document 110 and stores it in a database in preparation for a search request. When a document search request is made by the client computer 140 via the Internet 130, the document classification retrieval system 120 performs a document search using the pre-stored document summary information, and the result is also retrieved from the Internet 130. Provided to the client computer 140 through.

다음으로, 도 2를 참조하여 문서 분류 검색 시스템의 내부 구성에 대해 설명한다. 도 2는 본 발명의 문서 분류 검색 시스템의 내부 구성의 일예를 보여주는 개략 블록도이다.Next, the internal structure of the document classification retrieval system is demonstrated with reference to FIG. 2 is a schematic block diagram showing an example of an internal configuration of a document classification retrieval system of the present invention.

문서 자동 요약 모듈(210)은 입력으로 문서를 받아들여 주제어 정보와 주제문장 정보를 포함하는 문서 요약 정보를 생성한다. 사용자로부터 문서 검색 요구가 있을 경우, 사용자는 검색하고자 하는 키 문서를 문서 분류 검색 시스템(120)으로 입력하며, 입력된 키 문서에 대하여 문서 자동 요약 모듈(210)이 자동 요약을 수행하여 키 문서 요약 정보를 생성한다. 생성된 키 문서 요약 정보는 검색을 위하여 문서 검색 모듈(250)로 전달된다.The document automatic summarization module 210 accepts a document as an input and generates document summarization information including topic information and topic sentence information. When there is a document search request from the user, the user inputs a key document to be searched into the document classification search system 120, and the document automatic summarization module 210 performs an automatic summarization on the entered key document to sum up the key document. Generate information. The generated key document summary information is passed to the document retrieval module 250 for retrieval.

한편, 본 발명의 문서 분류 검색 시스템(120)은 사용자의 검색 요구가 있을 것에 대비하여 검색 대상이 되는 문서에 대해 요약 정보를 생성하여 이를 각각 주제어 정보 데이터베이스(230)와 주제문장 정보 데이터베이스(240)에 저장하여 두고, 이는 요약 정보 관리 모듈(220)에 의해 관리된다. 문서 자동 요약 모듈(210)은 요약 정보 관리 모듈(220)과도 연동되어 이와 같은 데이터베이스를 구축하기 위한 단계에서 검색 대상 문서에 대한 요약 정보를 생성한다. 또한, 검색 대상 문서의 내용은 텍스트 부분만 별도로 분리되어 문서의 전체 내용이 검색 대상 문서 텍스트 데이터베이스(260)에 저장된다. 이와 같이 함으로써, 이후 검색 결과를 사용자에게 표시할 때, 검색 대상 문서에 재차 접근할 필요가 없이 검색 대상 문서 텍스트 데이터베이스(260)에 저장된 내용을 이용하여 표시할 수 있게 된다.Meanwhile, the document classification retrieval system 120 according to the present invention generates summary information on a document to be searched in preparation for a user's search request, and then, respectively, the subject information database 230 and the subject sentence information database 240. Stored therein, it is managed by the summary information management module 220. The document automatic summarization module 210 also interoperates with the summarization information management module 220 to generate summary information on the search target document in the step of constructing such a database. In addition, only the text portion of the content of the search target document is separately separated so that the entire content of the document is stored in the search target document text database 260. In this way, when the search result is displayed to the user later, it is possible to display by using contents stored in the search target document text database 260 without having to access the search target document again.

문서 검색 모듈(250)은 문서 자동 요약 모듈(210)로부터 키 문서 요약 정보를 입력받고, 키 문서 요약 정보를 바탕으로 주제어 정보 데이터베이스(230)와 주제문장 정보 데이터베이스(240)를 검색하여 키 문서와 유사한 문서를 검색하는 본 발명의 핵심적인 기능을 한다.The document retrieval module 250 receives the key document summary information from the document automatic summarization module 210, and searches the key word information database 230 and the topic sentence information database 240 based on the key document summary information. It is a key function of the present invention to search for similar documents.

문서 검색 모듈(250)에 의해 수행된 문서 분류 검색의 결과는 결과 표시 모듈(270)에 의해 사용자에게 표시되며, 이 때 검색 대상 문서 텍스트 데이터베이스(260)에 저장되어 있는 검색 대상 문서의 내용을 함께 사용자에게 표시할 수 있다.The results of the document classification search performed by the document retrieval module 250 are displayed to the user by the result display module 270, together with the contents of the retrieval document stored in the retrieval document text database 260. Can be displayed to the user.

주제어 정보 데이터베이스(230)와 주제문장 정보 데이터베이스(240)에 저장되는 데이터의 구조가 도 3, 도 4a 및 도 4b에 나타나 있다.The structure of the data stored in the topic information database 230 and the topic sentence information database 240 is shown in FIGS. 3, 4A, and 4B.

주제어 정보 데이터베이스(230)는, 도 3에 나타난 바와 같이, 검색 대상 문서에서 나타나는 주제어 별로 구성되어 있다. 즉, 각 주제어 별로 주제어 자체가 내용이 되는 주제단어 필드와 해당 주제어를 포함하고 있는 문서의 ID를 포함하는 하나 이상의 문서 ID 필드를 포함하는 레코드를 단위로 구성된다. 각 필드는 필요에 따라 다양한 문서 정보를 포함하는 여러 서브필드들로 구성될 수 있다.As shown in FIG. 3, the main word information database 230 is configured for each main word appearing in the search target document. That is, each subject is composed of a unit including a subject word field in which the subject word itself is a content, and a record including one or more document ID fields including an ID of a document including the subject word. Each field may be composed of several subfields containing various document information as necessary.

이에 비해 주제문장 정보 데이터베이스(240)는, 도 4a에 나타난 바와 같이, 검색 대상 문서별로 구성되어 있다. 주제문장 정보 데이터베이스의 각 레코드는 해당 검색 대상 문서의 문서 ID 필드와 문서에 포함된 각 문장의 문장 정보를 내용으로 하는 하나 이상의 문장 정보 필드로 구성된다. 문장 정보 필드는 다시 여러 개의 서브필드로 나뉘어 있는데, 문장 정보 필드의 상세한 데이터 구조가 도 4b에 나타나 있다. 즉, 문장 정보 필드는 각각 문장의 번호, 문서 내에서의 문장의 위치, 문장의 길이, 문장이 포함하고 있는 주제어의 비율에 따라 정해지는 문장 가중치를 내용으로 하는 문장번호, 문장위치, 문장길이, 문장가중치 서브필드를 갖고 있으며, 각 문장 내에 포함되어 있는 주제어 ID를 내용으로 하는 하나 이상의 주제어 ID 서브필드를 또한 포함하고 있다.In contrast, the topic sentence information database 240 is configured for each document to be searched as shown in FIG. 4A. Each record of the topic sentence information database is composed of a document ID field of the corresponding document to be searched and one or more sentence information fields having sentence information of each sentence included in the document. The sentence information field is further divided into several subfields. The detailed data structure of the sentence information field is shown in FIG. 4B. That is, the sentence information field may be a sentence number, sentence position, sentence length, each of which has a sentence weight determined according to the number of sentences, the position of the sentence in the document, the length of the sentence, and the ratio of the main words included in the sentence. It has a sentence weight subfield and also includes one or more subject ID subfields whose content is the subject ID included in each sentence.

이제, 사용자로부터의 검색 요구가 있을 때의 문서 검색 과정에 대해 도 5를 참고하여 설명한다. 도 5는 본 발명의 문서 검색 과정을 나타내는 흐름도이다.Now, a document search process when there is a search request from the user will be described with reference to FIG. 5 is a flowchart illustrating a document retrieval process according to the present invention.

먼저 검색하고자 하는 키 문서(key document)를 입력한다(S510). 본 발명의 문서 분류 검색 시스템의 특징은 문서를 검색하기 위해서 키워드를 입력하지 않고 키 문서를 입력한다는 점이며, 이에 따라 키 문서와 유사한 내용을 갖는 문서를 검색해 준다는 점이다.First, a key document to be searched is input (S510). A feature of the document classification retrieval system of the present invention is that a key document is entered without entering a keyword to search for a document, thereby retrieving a document having similar contents to the key document.

도 6은 키 문서 입력 과정의 상세한 흐름을 나타내는 흐름도이다. 도 6을 참조하면, 검색할 문서를 입력하는 방법은 크게 세 가지가 있다. 첫번째 방법은 검색할 문서의 내용을 직접 입력하는 것이며(S610), 이 때 문서 분류 검색 시스템은 텍스트의 입력창을 사용자에게 제공하고, 이는 HTML의 Form 태그와 <input type='textarea'> 태그를 이용하여 구현할 수 있다.6 is a flowchart illustrating a detailed flow of a key document input process. Referring to FIG. 6, there are three ways to input a document to be searched. The first method is to directly enter the content of the document to be searched (S610). At this time, the document classification search system provides a text input box to the user, which uses an HTML form tag and an <input type = 'textarea'> tag. Can be implemented.

두번째 방법은 사용자의 컴퓨터에 있는 문서를 지정하는 것으로(S620), 이와 같이 할 경우 사용자 컴퓨터의 파일을 웹 브라우저를 이용하여 문서 분류 검색 시스템으로 전송한다(S640). 파일 전송의 경우 HTML의 Form 태그와 <input type='file'> 태그를 이용한다. 이 때 문서 분류 검색 시스템은 검색하고자 하는 파일명을 입력할 수 있는 입력창을 제공하며, 사용자 시스템 내의 파일을 찾을 수 있는 찾아보기 기능을 제공할 수 있다. 도 7에는 사용자의 웹 브라우저에 나타나는 키 문서 입력 화면의 예가 도시되어 있다.The second method is to designate a document in the user's computer (S620). In this case, the file of the user's computer is transmitted to the document classification search system using a web browser (S640). For file transfer, use HTML form tag and <input type = 'file'> tag. In this case, the document classification search system may provide an input window for inputting a file name to be searched, and may provide a browse function for finding a file in a user system. 7 illustrates an example of a key document input screen displayed in a user's web browser.

키 문서 입력 화면에서는 또한 주제어와 주제문장의 비율을 선택할 수 있도록 할 수도 있다. 주제어와 주제문장의 비율은 문서 분류 검색 시스템에서 미리 설정하여 둘 수도 있지만, 이를 사용자가 선택할 수 있도록 함으로써 사용자가 원하는 정도에 따라 문서 검색의 수준을 정할 수 있다. 주제어와 주제문장은 각각 개수나 크기별(5단어, 10단어 또는 512 byte, 1kbyte 등)로 선택하거나 비율(1%, 5%, 10% 등)로 선택할 수 있다.In the key document input screen, it is also possible to select a ratio between the main word and the topic sentence. The ratio of the key word and the topic sentence may be set in advance in the document classification search system, but the user may select this to set the level of document search according to the degree desired by the user. Subject words and topic sentences can be selected by number or size (5 words, 10 words or 512 bytes, 1 kbyte, etc.) or by percentage (1%, 5%, 10%, etc.).

세번째 방법은 사용자의 컴퓨터에는 없고 인터넷을 통하여 접근이 가능한 다른 컴퓨터에 있는 문서를 지정하는 것으로, 이렇게 하기 위해서는 문서의 인터넷 주소를 입력하면 된다(S630). 이 때 문서 분류 검색 시스템은 사용자의 웹브라우저에 주소 입력창을 제공하고, 해당 인터넷 주소로 접근하여 사용자가 지정한 문서를 읽어온다(S650).The third method is to specify a document on another computer that is not present in the user's computer and accessible through the Internet. To do this, the Internet address of the document may be input (S630). At this time, the document classification search system provides an address input box to the user's web browser, and accesses the Internet address and reads the document designated by the user (S650).

파일의 내용을 읽어온 후에는 파일 내용이 보통의 텍스트로 처리가 가능한 텍스트 데이터인지를 검사한다(S660). 보통의 텍스트로 처리가 가능한 아스키 텍스트 파일(.txt)은 바로 키 문서 입력(S510)의 다음 단계인 키 문서 요약 정보 생성 단계(S520)로 처리가 진행되지만, 이를 제외한 나머지 바이너리 형식으로 된 파일이나 웹 문서(.html)들은 문서 분류 검색 시스템 내부에서 별도의 텍스트 필터를 사용하여 텍스트를 추출한다(S670).After reading the contents of the file, it is checked whether the contents of the file are text data that can be processed as plain text (S660). ASCII text file (.txt) that can be processed as normal text is processed to the key document summary information generation step (S520), which is the next step of the key document input (S510), but the file in the binary format except for this The web documents (.html) extract text using a separate text filter inside the document classification search system (S670).

이와 같이, 보통의 텍스트 문서가 아닌 경우는 문서 분류 검색 시스템 내부에서 문서 필터 처리를 하게 되므로, 검색할 문서의 종류는 아스키 텍스트 파일(.txt), 웹 문서(.html), 워드프로세서 파일(.doc, .hwp, .gul 등) 및 기타 텍스트를 포함한 파일 등은 모두 가능하다.As such, if the document is not a normal text document, document filtering is performed inside the document classification search system. Therefore, the types of documents to be searched are ASCII text files (.txt), web documents (.html), and word processor files (. doc, .hwp, .gul, etc.) and other files with text.

다음, 입력된 키 문서에 대해서 요약 정보를 추출한다(S520). 키 문서의 요약 정보 추출은 문서 분류 검색 시스템의 문서 자동 요약 모듈(210)에 의해 이루어지는데, 문서의 최하위 요소인 단어를 기준으로 빈도 정보를 수집하여 주제어 정보를 구축하고, 각 문장별로 문장의 가중치를 계산하여 주제어 정보와 주제문장 정보를 생성한다. 이렇게 처리하여 생성된 키 문서의 주제어 정보는 빈도에 의한 주제어 가중치를 포함하며, 키 문서의 주제문장 정보는 주제어 가중치(주제어의 출현 빈도에 따라 계산된 가중치), 문장 길이, 문장 안에서 단어 수에 의해 계산된 문장 가중치를 포함한다.Next, summary information is extracted for the input key document (S520). Extraction of the summary information of the key document is performed by the document automatic summary module 210 of the document classification retrieval system. The frequency information is collected based on the word which is the lowest element of the document to construct the main term information, and the weight of the sentence for each sentence. Calculate the subject information and the subject sentence information. The key word information of the key document generated by this process includes the key word weight by frequency, and the key sentence information of the key document is determined by the key word weight (weight calculated according to the frequency of main control), the sentence length, and the number of words in the sentence. It includes the calculated sentence weight.

키 문서에 대해 추출된 주제어 정보는 이후의 검색 과정에서 유사한 문서를 검색하는 검색 키로서 사용되고, 추출된 주제문장 정보는 검색 결과를 표시할 때 키 문서에 대해 요약 정보를 제공하기 위해 사용된다. 따라서, 필요에 따라서는 주제어 정보만을 추출하여 검색에 사용하고, 검색 결과 표시 단계에서는 입력 문서 전체를 표시하거나, 입력 문서에 대해서는 문서의 제목만을 표시하는 등으로 할 수도 있다.The extracted key word information for the key document is used as a search key for searching similar documents in a subsequent retrieval process, and the extracted extracted topic sentence information is used to provide summary information for the key document when displaying the search results. Therefore, if necessary, only the main word information can be extracted and used for the search, and in the search result display step, the entire input document can be displayed, or only the title of the document can be displayed for the input document.

이제 키 문서의 주제어 정보를 이용하여 검색 대상인 각 문서에 대해 주제어별 가중치를 계산한다(S530). 주제어에 대한 가중치 계산은 키 문서에서 생성된 주제어 정보를 문서 요약 정보 관리 시스템이 보관하고 있는 주제어 정보와 비교하여 각 검색 대상 문서별로 가중치를 계산하는 것이다. 주제어별 가중치의 계산은 크게 2 단계로 이루어진다. 주제어별 가중치 계산 과정의 상세한 흐름이 도 8에 나타나 있다.Now, the weight for each subject is calculated for each document to be searched using the subject information of the key document (S530). The weight calculation for the main word is to calculate the weight for each search target document by comparing the main word information generated in the key document with the main word information stored in the document summary information management system. Calculation of weight for each subject is largely composed of two steps. A detailed flow of the weighting process for each topic is shown in FIG. 8.

첫번째 단계에서는 키 문서의 주제어 목록을 주제어 정보 데이터베이스의 각 주제어와 비교하여(S810) 일치하는 주제어를 갖고 있는 문서에 대해 가중치를 부여하는 방식으로 키 문서와 각 문서의 주제어 가중치를 계산한다(S820). 이렇게 하면, 일치하는 주제어를 많이 포함한 문서일수록 주제어 개수에 대한 가중치가 높아진다. 또한 이 때, 일치하는 주제어가 검색 대상 문서의 전체 주제어에서 차지하는 비율을 백분율로 계산하여 첫번째 단계에서 부여된 가중치에 곱한다. 이는 각 문서마다 전체 주제어의 수가 다르기 때문에 일률적으로 주제어 가중치를 적용할 경우에 변별력이 낮아지는 문제를 해결하기 위한 것이다.In the first step, the key word list of the key document and the key word of each document are calculated by comparing the key word list of the key document with each key word of the key word information database (S810) and assigning a weight to a document having a matching key word (S820). . In this way, the weight of the number of key words increases as the document containing more matching key words is used. At this time, the ratio of the matching main word to the total main word of the searched document is calculated as a percentage and multiplied by the weight given in the first step. This is to solve the problem of low discrimination when subject weight is uniformly applied because the total number of main words is different for each document.

두번째 단계는 키 문서의 개별 주제어가 지닌 주제어별 가중치를 검색된 문서의 주제어에 누적해서 계산하는 과정이다(S830). 이는 첫번째 단계에서 제시하는 주제어 가중치를 보완하는 작업으로서, 단순하게 어휘 목록을 많이 가지고 있는 문서보다는 빈도가 높은 주제어를 많이 포함한 문서일수록 유사한 문서로 판정하기 위한 작업이다. 다음, 이에 대해서도 전체 주제어 수가 다른 것에 따른 변별력 저하를 해결하기 위해서, 검색 키 문서와 검색 대상 문서에서 일치하여 발견된 주제어의 주제어별 가중치의 합을 검색 대상 문서의 전체 주제어의 주제어별 가중치 합에 대한 백분율로 계산한다.The second step is a process of cumulatively calculating the weight of each subject in the individual subject of the key document on the subject of the searched document (S830). This is to supplement the keyword weight suggested in the first step, and is to determine that a document containing a large number of frequent words is similar to a document rather than a document having many vocabulary lists. Next, in order to solve the decrease in discrimination caused by the difference in the total number of keywords, the sum of the weights of the key words of the key words found in the search key document and the search target document by the sum of the weights of the key words of all the key words in the search target document. Calculate as a percentage.

한편, 주제어에 대하여 가중치를 계산하는 방법으로는 두 가지를 사용할 수 있다. 첫번째 방법은 전체 주제어 목록을 모두 이용하여 처리하는 방법이고, 두번째 방법은 부분적인 상위(high range) 주제어만 가지고 처리하는 방법이다.On the other hand, two methods can be used to calculate the weight for the main word. The first method is to process the entire list of keywords, and the second method is to process only the partial high range keywords.

첫번째 방법의 경우에는 시간이 많이 걸린다는 단점이 있지만, 전체적인 주제어에 대한 일치 정도를 정확하게 계산할 수 있다는 장점이 있어 일치 정도가 정확한 검색이 필요한 경우에 사용한다. 두번째 방법은 빈도가 높은 핵심 주제어를 이용하기 때문에 주제 정보에 대한 변별력을 높일 경우에 적용할 수 있으며, 시간이 상대적으로 덜 걸린다는 장점이 있다.The first method has a disadvantage in that it takes a lot of time, but it has the advantage that it can accurately calculate the degree of match for the entire subject, so it is used when a match is required. The second method uses the core key word with high frequency, so it can be applied in case of increasing discrimination on the topic information, which has the advantage of taking less time.

한편, 정확도를 높이기 위해서는 두 가지 방법을 복합적으로 사용할 수도 있다. 즉, 전체 주제어를 가지고 1차 주제어 가중치를 계산한 후에 핵심 주제어만을 가지고 2차 가중치를 추가로 적용하는 것이다.On the other hand, the two methods can be used in combination to increase the accuracy. In other words, after calculating the primary keyword weight using the entire main word, the secondary weight is additionally applied using only the main keyword.

이와 같이 각 검색 대상 문서에 대해 주제어별 가중치를 계산하면 각 문서별로 주제어별 가중치가 높은 문서부터 차례로 분류된다. 그런데 주제어만 가지고 가중치를 계산하면, 주제어의 분포나 주제어 간의 결합 관계를 고려하지 않았기 때문에 단순하게 주제어가 많이 나타날수록 가중치가 높아진다는 한계가 있다. 즉, 주제어를 많이 포함한 문서일수록 유사한 문서로 판정하게 되는데 이러한 한계를 극복하려면 주제어 간의 분포나 문장 안에서 주제어의 결합 관계를 검사해야 한다. 따라서, 다음 단계로 주제문장별 가중치를 계산한다(S540).As described above, when the weight for each subject is calculated for each document to be searched, the documents are classified in order from the documents having the highest weight for each subject for each document. However, if the weight is calculated using only the main word, since the distribution of the main word or the binding relationship between the main words is not considered, the weight of the main word increases as the main word appears more. In other words, the more documents containing the subjects, the more similar documents are determined. To overcome this limitation, the distribution of the subjects or the relation of the subjects in the sentence should be examined. Therefore, a weight for each topic sentence is calculated as a next step (S540).

도 9는 주제문장별 가중치 계산 과정의 상세한 흐름을 나타내는 흐름도이다.9 is a flowchart illustrating a detailed flow of a weight sentence calculation process for each sentence.

주제문장에 대한 가중치를 계산하기 위해서는 먼저 주제문장을 선별(S910)하는데, 이는 주제문장 정보 데이터베이스(240)에서 문서별로 가중치가 높은 것부터 낮은 것의 순서로 일정 비율만큼 주제 문장으로 선택하는 방식으로 수행된다.In order to calculate a weight for a topic sentence, first, a topic sentence is selected (S910), which is performed by selecting a topic sentence by a predetermined ratio in the order of high weight to low weight for each document in the topic sentence information database 240. .

다음으로는 선택된 주제 문장 내의 주제어를 주제문장 정보 데이터베이스로부터 추출하고(S920), 이 주제어와 키 문서의 주제어를 비교하여 키 문서의 주제어가 나타날 때마다 해당 문서에 대해 가중치를 부여해 준다(S930).Next, the main word in the selected topic sentence is extracted from the main sentence information database (S920), and the main word of the key document is compared with the main word and the weight is given to the document whenever the main word of the key document appears (S930).

이제 키 문서의 주제어별 가중치를 문서에 누적해서 계산해 준다(S940). 즉, 키 문서의 주제어 중 빈도가 높은 주제어를 포함하는 문서일수록 높은 가중치를 갖게 된다.The cumulative weight for each subject of the key document is calculated in the document (S940). That is, a document including a high frequency main word among key words of a key document has a higher weight.

마지막으로, 주제 문장에서 키 문서의 주제어의 점유율, 즉, 선택된 주제 문장에서 발견된 키 문서의 주제어가 차지하는 비율(전체 문장 길이에 대한 주제어의 길이)을 계산한다(S950). 이와 같이 길이 비율에 의한 백분율을 적용하는 것은 문장 길이에 대한 변별력을 높이기 위한 것으로서, 각 문서마다 길이가 다르고 문서 내의 각 문장마다 길이가 다르기 때문에 일률적으로 주제어와 주제 문장 가중치를 적용할 경우 변별력이 낮아지는 문제를 해결한다.Finally, the occupancy of the main word of the key document in the subject sentence, that is, the ratio of the main word of the key document found in the selected subject sentence (length of the main word to the total sentence length) is calculated (S950). In this way, the percentage based on the length ratio is used to increase the discrimination of sentence length. Since the length is different for each document and the length is different for each sentence in the document, the discrimination power is uniformly applied when the main word and the topic sentence weight are uniformly applied. Losing solves the problem.

결과적으로 가중치가 높은 키 문서의 주제어를 많이 포함한 문장일수록 주제 문장에 대한 가중치가 높아진다.As a result, the more sentences containing the key words of the key document with a higher weight, the higher the weight for the subject sentence.

주제어 가중치와 주제 문장 가중치를 적용하면 최종적으로 검색 대상인 각 문서에 대해 가중치가 계산된다. 이와 같이 계산된 가중치를 가지고 가중치가 높은 것으로부터 정렬하여 문서를 분류하고(S550), 이를 사용자에게 표시한다.Applying the keyword weight and the topic sentence weight, the weight is finally calculated for each document to be searched. The documents are sorted by sorting from the higher weights with the weights calculated as described above (S550) and displayed to the user.

도 10과 도 11은 이와 같이 처리한 결과를 사용자에게 표시하는 화면 구성의 예이다.10 and 11 are examples of screen configurations for displaying the results of such processing to the user.

결과 처리 화면은 크게 키 문서 영역과 결과 문서 영역으로 나뉘어진다. 키 문서 영역에서는 키 문서 제목과 키 문서의 내용, 키 문서의 주제어, 키 문서의 일반적인 문서 정보를 표시한다. 이 때 키 문서의 내용에서는 앞서 키 문서에 대해 생성된 요약 정보 중 주제문장 정보를 이용하여 일정 비율로 요약된 문서의 요약을 제시할 수 있다. 키 문서 주제어의 내용에서는 역시 키 문서에 대해 생성된 요약 정보 중 주제어 정보를 이용하여 키 문서의 주제어를 표시하며, 일정한 비율 또는 개수로 표시할 주제어를 제한할 수 있다.The result processing screen is divided into a key document area and a result document area. The key document area displays the key document title, the content of the key document, the key words of the key document, and general document information of the key document. In this case, the content of the key document may present a summary of the document summarized at a predetermined rate using the topic sentence information among the summary information generated for the key document. In the content of the key document key word, the key word of the key document is displayed using the key word information among the summary information generated for the key document, and the key word to be displayed in a certain ratio or number can be limited.

결과 문서 영역에서는 검색 결과 키 문서와의 연관성이 가장 큰 것으로부터 낮은 것의 순서로 표시할 수도 있고, 주제어 연관성에 따라 문서를 일정한 그룹으로 묶은 후에 그룹 내의 문서에 대해 가중치 순서로 표시하여 보여줄 수도 있다. 여기에서, 주제어 연관성에 따라 문서를 분류한다는 것은 검색된 문서마다 보유하고 있는 키 문서의 주제어를 가지고 판단하여, 동일한 주제어를 가지고 있는 문서를 하나의 그룹으로 설정한 다음 같은 그룹 안에서 가중치가 높은 문서부터 차례로 보여준다는 것이다. 즉, 주제어 연관성 별로 검색 문서를 표시할 경우, 도 11에 나타난 바와 같이, 각 문서 그룹에 대해 연관된 주제어를 표시하고 해당 그룹에 속하는 문서를 가중치순으로 정렬하여 표시한다.In the result document area, the association with the search result key document may be displayed in the order of the largest to the lowest, or the documents in the group may be displayed in a weighted order after grouping the documents according to the main keyword association. Here, classifying documents according to subject relations is determined based on the key words of key documents held for each retrieved document, and sets documents having the same subject words as one group, and then documents with the highest weight in the same group. To show. That is, when a search document is displayed for each keyword association, as shown in FIG. 11, the associated keyword is displayed for each document group, and the documents belonging to the group are displayed in a sorted order by weight.

결과 문서 영역에서 표시하는 결과 문서 정보는 키 문서의 정보와 유사한 형태로 처리한다. 즉, 결과 문서 제목과 결과 문서의 내용, 결과 문서의 주제어, 결과 문서의 일반적인 문서 정보를 표시한다. 여기서 결과 문서의 내용으로 주제문장 정보 데이터베이스의 주제문장 정보를 이용하여 결과 문서의 요약 정보를 제시하므로 사용자는 결과 문서로 직접 접근하여 내용을 확인하지 않고도 결과 문서의 내용을 쉽게 파악할 수 있다.The result document information displayed in the result document area is processed in a form similar to that of the key document. That is, the title of the result document, the content of the result document, key words of the result document, and general document information of the result document are displayed. Here, the summary information of the result document is presented by using the topic sentence information of the topic sentence information database as the content of the result document so that the user can easily grasp the content of the result document without directly accessing the result document and checking the content.

이제, 문서 자동 요약법을 이용하여 실제로 문서를 분류하여 검색한 사례에 대해 설명한다. 이 사례는 유니텔에서 수집한 연합 뉴스 데이터를 가지고 문서 분류 검색을 시도한 것인데, 이 데이터의 특징은 다음과 같다.We now describe an example of actually classifying and searching for documents using automatic document summarization. In this case, an attempt was made to retrieve a document classification using the Union News data collected by Unitel.

연합 뉴스에서 하나의 문서의 평균 길이는 500 - 1,000 바이트 정도이며, 총 문서 개수는 18,359건이다. 기간은 1997년 6월 16일부터 1999년 9월 30일까지이며, 키 문서로 사용한 것은 1997년 6월 17일자 기사 중에서 하나를 선택하였다. 가능한 한 원문 내용을 고치지 않았으므로 띄어쓰기나 맞춤법에 어긋나는 경우가 있음을 감안해야 한다.In Yonhap, the average length of a document is between 500 and 1,000 bytes and the total number of documents is 18,359. The period was from June 16, 1997 to September 30, 1999, and the key document used was one of the articles dated June 17, 1997. As much as possible, it is important to keep in mind that the text is not corrected, so it may be inconsistent with spacing or spelling.

사례의 검색 결과를 표시하는 화면예가 도 12에 나타나 있다.An example of a screen displaying a search result of a case is shown in FIG.

키 문서로 사용한 문서는 1997년 6월 17일 전주에서 음주 측정을 거부한 운전자를 구속했다는 내용으로, 이 문서에서 주제어를 추출하면 "음주, 혐의, 거부, 검문, 경찰, 구속, 단속, 도로교통법, 등록, 마흔살, 북부, 삼거리, 서신동, 소속, 연합, 요구, 위반, 운전자, 전주} 등이 나타난다. 이러한 주제어를 기반으로 유사한 문서를 검색하여 두 단어간의 관계를 중심으로 살펴보면 {음주-혐의(36건), 음주-구속(29건), 음주-측정(17건)} 순으로 관련된 문서를 찾을 수 있으며, 이것을 세 단어, 네 단어로 계속 확장하면 최종적으로 {음주-구속-단속-혐의-거부-검찰(1건)}에 해당하는 문서를 찾을 수 있다. 이렇게 최종적으로 찾은 문서는 1997년 6월 23일 서울에서 음주 측정을 거부한 사람에 대하여 영장을 기각했다는 내용으로, 이 문서의 주제어는 {음주, 서울, 구속, 오토바이, 경우, 동부, 경찰관, 기각, 도로교통법, 연합, 혐의, 거부, 검찰, 계속, 구속영장, 남대문, 도주, 등록} 등으로 나타난다. 두 문서를 비교해보면, 음주 측정과 관련된 내용으로서 모두 음주 측정을 거부하였지만 한 사람은 구속되고 다른 한 사람은 영장이 기각되었다는 내용이다. 이와 같이 최종적으로 분류된 문서는 키 문서와 유사한 내용으로서 주제는 음주 측정과 관련되어 있다는 사실을 한눈에 파악할 수 있다. 그리고 나머지 분류 등급이 낮은 문서들을 살펴보면 {음주-구속-단속-혐의-거부} 등의 주제어에 의해서 분류된 것으로 모두 음주 측정 단속에서 측정을 거부한 혐의로 구속되는 것과 관련된 내용들이다. 마찬가지로 점점 분류 등급이 낮은 문서를 확인해보면, {음주-측정, 음주-단속} 등과 관련된 내용이지만 직접적으로 키 문서와의 관련성이 점점 줄어드는 것을 확인할 수 있다.The document used as a key document states that he arrested a driver who refused to measure alcohol in Jeonju on June 17, 1997. When he extracted the key words from this document, he said, "Drinking, allegation, denial, prosecution, police, restraint, crackdown, road traffic law." , Registration, forty years old, northern, crossroads, Seosin-dong, affiliation, association, demand, violation, driver, Jeonju}, etc. Searching for similar documents based on these themes, focusing on the relationship between the two words, 36), drinking-restraint (29), drinking-measurement (17)}, and continue to expand to three words and four words. Refuse-Prosecutor (1 case)} The final document found here was the rejection of a warrant for a person who refused to measure alcohol on June 23, 1997 in Seoul. {Drinking, Seoul, Redeemer, Oh Bai, case, eastern, police officers, dismissal, road traffic law, coalition, charges, denials, prosecution, continuation, arrest warrant, Namdaemun, escape, registration}, etc. Comparing the two documents, the content of alcohol consumption is all related to alcohol measurement. Although he refused to measure, one was arrested and the other was denied the warrant, and the final sorted document is similar to the key document, and it can be seen at a glance that the subject is related to alcohol measurement. The rest of the documents with lower classification levels are classified by subjects such as {drinking-restraining-regulation-suspect-rejection}, which are all related to being convicted of refusing to measure in the drunkards crackdown. If you look at this low document, it is related to {drinking-measuring, drinking-clamping}, but directly related to the key document. You can see that gender is increasingly shrinking.

본 발명에 따르면, 문서 자체를 검색 키로 하여 이와 유사한 내용을 갖는 문서를 검색할 수 있으므로, 한번의 검색으로 원하는 정보를 쉽고 빠르게 찾을 수 있다.According to the present invention, since a document having a similar content can be searched using the document itself as a search key, the desired information can be easily and quickly found with a single search.

또한, 문서에 대한 검색 결과를 문서의 주제와 관련된 요약 정보로 표시하여 주므로, 검색 결과를 다시 확인해야 하는 불편함이 없이 빠르게 원하는 정보를 찾을 수 있다.In addition, since the search results for the document are displayed as summary information related to the subject of the document, the desired information can be found quickly without the inconvenience of having to check the search results again.

Claims

In the method for classifying and searching the search target document based on the similarity with the input document,

A first step of inputting a search key document,

A second step of generating subject word information of the search key document;

A subject information database including, for each subject contained in a search target document, a record comprising a subject field having the subject as content and at least one document identifier field having the document identifier of the search subject document containing the subject A third step of assigning a weight to each search word to the search target document by using;

And a record including a document identifier field having a document identifier of a search target document as content and at least one sentence information field having information as a content of a sentence constituting the search target document, for each search target document, wherein the sentence information field Sentence number, sentence position, sentence length and sentence weight subfields each having a sentence weight determined according to the sentence number, the position of the sentence in the document, the length of the sentence, and the ratio of the main words included in the sentence. A fourth step of assigning a weight for each topic sentence to the search target document by using a topic sentence information database including at least one topic identifier identifier subfield whose content is a topic identifier included in each sentence;

And a fifth step of classifying the search target document in order of the sum of the weights for each topic and the weight for each topic sentence, from high to low.

The method of claim 1,

The first step is a step of directly inputting the contents of the search key document.

The method of claim 1,

The first step is a step of specifying a file name of a search key document.

The method of claim 1,

The first step is a step of specifying an Internet address where a search key document exists.

The method of claim 1,

And in the second step, the topic word information of the search key document includes a topic word and a topic word weight according to the frequency of occurrence of the topic word.

The method of claim 5,

The third step,

A step 3-1 of using the topic information database to weight the search target document having a topic word that matches the topic word of the search key document;

And a third to second weighting of the search target document according to the topic word weight of the key document.

The method of claim 5,

The fourth step,

Step 4-1 of extracting a topic sentence of each search target document using the topic sentence information database;

Step 4-2 of extracting a topic word for a topic sentence of each extracted search target document using the topic sentence information database;

4-3, weighting a search target document having a topic sentence including a topic word that matches the topic word of the search key document;

4-4 to give the search target document a weight according to the topic word weight of the search key document,

And a fourth to fifth step of calculating a weight of the subject word of the search key document and assigning a weight to the subject sentence of the search target document.

The method of claim 1,

And a sixth step of displaying the classified document to a user.

The method of claim 8,

The document classification retrieval method of generating topic sentence information of the retrieval key document in the second step.

The method of claim 9,

In the sixth step,

Document information of the search key document and document information of the classified document are displayed together,

The document information of the search key document includes the topic word information and the topic sentence information of the search key document,

The document information of the classified document includes the subject information and the subject sentence information of the classified document extracted using the subject information database and the subject sentence information database.

The method of claim 10,

In the sixth step, the document information of the categorized document is classified as a group of documents in which a set of subject words of the search key document included in the categorized document is matched, and thus the overall weight is high in the group. A document classification search method for classifying and displaying the searched documents in descending order.

A subject information database comprising, for each subject contained in a search subject document, a record comprising a subject term field with the subject word as its content and one or more document identifier fields with the document identifier of the search subject document containing the subject word as content;

And a record including a document identifier field having a document identifier of a search target document as content and at least one sentence information field having information as a content of a sentence constituting the search target document, for each search target document, wherein the sentence information field Sentence number, sentence position, sentence length and sentence weight subfields each having a sentence weight determined according to the sentence number, the position of the sentence in the document, the length of the sentence, and the ratio of the main words included in the sentence. A topic sentence information database including one or more topic identifier identifier subfields whose content is a topic identifier included in each sentence,

Main word information extracting means for extracting main word information from an input document,

A search means for receiving the extracted subject word information as an input and searching for the existence of the same character string as the input subject word;

The search means,

By using the topic information database, the weight for each topic is assigned to each search target document,

By using the topic sentence information database, weights for each topic sentence are assigned to the search target document,

And a document classification retrieval system capable of classifying the search target document in the order from the highest weight to the lowest total weight obtained by adding the weight for each topic word and the weight for each topic sentence.

The method of claim 12,

A document classification retrieval system further comprising subject sentence information extraction means for extracting subject sentence information from an input document.

The method of claim 12,

A document classification search system, further comprising display means for displaying a search result to a user.

The method of claim 12,

A document classification search system further comprising a search target document database that stores the contents of the search target document as text data.