KR100588739B1

KR100588739B1 - Method for preventing duplication of internet documents

Info

Publication number: KR100588739B1
Application number: KR1019990052029A
Authority: KR
Inventors: 김형근; 김학훈
Original assignee: 주식회사 케이티
Priority date: 1999-11-23
Filing date: 1999-11-23
Publication date: 2006-06-13
Also published as: KR20010047696A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 문서처리시스템에서 문서의 중복 방지 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것임.The present invention relates to a method for preventing duplication of documents in a document processing system and a computer-readable recording medium having recorded thereon a program for realizing the method.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은, 문서처리시스템에서 인터넷 문서 등을 포함하는 모든 문서의 내용물에 근거하여 문서의 중복된 처리를 방지하기 위한 문서의 중복 방지 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하고자 함.The present invention provides a computer-readable recording method for preventing duplication of a document and a program for realizing the method based on the contents of all documents including an internet document in a document processing system. To provide a record carrier.

3. 발명의 해결방법의 요지 3. Summary of Solution to Invention

본 발명은, 문서처리시스템에서의 중복된 문서를 방지하는 방법에 있어서, 문서의 본문 내용을 추출하여 추출된 본문 내용을 기반으로 해쉬값을 계산하는 제 1 단계; 계산된 상기 해쉬값과 기 저장된 중복검사용 해쉬값을 비교하는 제 2 단계; 및 상기 제 2 단계의 비교결과에 따라, 문서의 본문 내용의 중복 여부를 판별하고, 중복된 문서를 제거하는 제 3 단계를 포함함.The present invention provides a method of preventing duplicate documents in a document processing system, comprising: a first step of extracting a body content of a document and calculating a hash value based on the extracted body content; A second step of comparing the calculated hash value with a prestored hash value for redundancy check; And a third step of determining whether or not the contents of the body of the document are duplicated according to the comparison result of the second step, and removing the duplicated document.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 문서 검색 엔진 등에 이용됨.The invention is used in document search engines and the like.

문서, 중복, MD-5(Message Digest-5), 인터넷, 해쉬Document, Duplicate, Message Digest-5 (MD-5), Internet, Hash

Description

Method for preventing duplication of internet documents in document processing systems

도 1 은 본 발명이 적용되는 문서수집시스템의 구성 예시도.1 is an exemplary configuration of a document collection system to which the present invention is applied.

도 2 는 본 발명에 따른 문서의 중복 방지 방법에 대한 일실시예 흐름도.2 is a flow diagram of an embodiment of a method for preventing duplication of documents according to the present invention.

*도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

11 : 문서 수집용 서버 12 : 문서 저장용 서버11: Server for document collection 12: Server for document storage

본 발명은 문서처리시스템에서 인터넷 문서 등을 포함하는 문서의 중복된 처리를 방지할 수 있는 문서의 중복 방지 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of preventing duplication of documents that can prevent duplication of documents including Internet documents and the like in a document processing system, and to a computer-readable recording medium having recorded thereon a program for realizing the method.

일반적인 인터넷 환경에서는 특정 문서가 쉽게 복사되거나 그 문서로 접근할 수 있는 주소(URL : Uniform Resource Locators)가 여러 개로 표현될 수 있다. 따 라서, 대량의 인터넷 수집이 필요한 문서처리시스템에서 중복된 문서의 처리로 인해 저장공간과 컴퓨팅 자원의 낭비를 초래하였고, 문서의 중복을 제거하기 위한 인력 소모 또한 많았다. 이에, 불필요한 문서 중복 제거로, 대량의 인터넷 수집이 필요한 문서처리시스템의 문서 중복에 의한 저장공간과 컴퓨팅 자원의 낭비를 없애고 문서 중복을 제거하기 위한 인력 소모를 막을 수 있는 방안이 필요하다.In a general Internet environment, a specific document may be easily copied or expressed as several URLs (Uniform Resource Locators) accessible to the document. As a result, the processing of duplicated documents in a document processing system that requires a large amount of Internet collection has caused a waste of storage space and computing resources, and also consumed a lot of manpower to eliminate duplication of documents. Therefore, by eliminating unnecessary document deduplication, there is a need for a method of eliminating waste of storage space and computing resources due to document duplication of a document processing system requiring a large amount of Internet collection, and preventing manpower consumption for eliminating document duplication.

이상에서와 같이, 인터넷 문서는 특성상 쉽게 복제될 수가 있다. 설령, 복제되지 않더라도 하나의 문서가 서로 다른 주소(URL)를 가지는 경우는 매우 흔하다. 따라서, 하나의 문서를 고유하게 지칭할 수 있는 방법이 존재하지 않는다는 것이다. 특히, 인터넷 게시판에 올라가는 글이나 문서들은 게시판 프로그램 구성에 따라서 이론적으로 무한히 다른 주소(URL)로 표현 가능하다. 이러한 예를 살펴보면 다음과 같다.As mentioned above, Internet documents can be easily duplicated in nature. Even if not replicated, it is very common for a document to have a different address (URL). Thus, there is no way to uniquely refer to a document. In particular, articles or documents posted on the Internet bulletin board can be expressed at an infinitely different address (URL) according to the bulletin board program configuration. The following is an example.

http://host.bbs.server/bbsread.cgi?id=10http: //host.bbs.server/bbsread.cgi? id = 10

http://host.bbs.server/bbsread.cgi?prev=1&id=10http: //host.bbs.server/bbsread.cgi? prev = 1 & id = 10

http://host.bbs.server/bbsread.cgi?prev=1&id=10&visit=0http: //host.bbs.server/bbsread.cgi? prev = 1 & id = 10 & visit = 0

http://host.bbs.server/bbsread.cgi?prev=1&visit=0&next=13&id=10http: //host.bbs.server/bbsread.cgi? prev = 1 & visit = 0 & next = 13 & id = 10

상기 예에서 보는 바와 같이, 하나의 문서에 수많은 주소(URL)가 붙을 수 있으므로 기본적으로 주소만 가지고 문서를 유일하게 파악하는 것은 가능하지 않다. As shown in the above example, a number of addresses (URLs) can be attached to a single document, so it is not possible to uniquely identify a document using only an address.

한편, 주소가 아닌 문서의 내용물로 중복을 검사하려면, 문서들의 길이가 평 균적으로 매우 길며, 그 길이도 일정하기 때문에, 모든 문서에 대해서 고속으로 중복을 체크하는 것은 대단히 어려운 일이다.On the other hand, to check for duplicates by the contents of a document rather than an address, it is very difficult to check for duplicates at high speed for all documents because the lengths of documents are generally very long and the length is constant.

이처럼, 종래에는 인터넷 문서를 검색이나 변환하는 등의 각종 처리를 하려고 할 때, 각각의 문서를 중복없이 유일하게 처리되도록 할 필요가 있음에도 불구하고, 문서의 제목이나 문서의 인터넷 주소(URL)만으로는 그렇게 할 수가 없었다. 따라서, 문서의 내용물을 토대로 중복을 검사해야 하는데, 내용물 전체를 중복검사에 쓰기에는 내용의 길이가 가변적이고 양도 많기 때문에 단순한 비교만으로는 중복검사가 가능하지 않는 문제점이 있었다. As described above, in the prior art, when attempting to perform various processes such as searching and converting Internet documents, each document must be uniquely processed without duplication. However, only the title of the document and the Internet address (URL) of the document do so. I could not. Therefore, duplication should be checked on the basis of the contents of the document, but there is a problem that duplication inspection is not possible by simple comparison because the length of the contents is variable and quantitative in order to write the entire contents in duplicate inspection.

상기한 바와 같은 문제점을 해결하기 위하여 안출된 본 발명은, 문서처리시스템에서 인터넷 문서 등을 포함하는 모든 문서의 내용물에 근거하여 문서의 중복된 처리를 방지하기 위한 문서의 중복 방지 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.
SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and the present invention provides a method and method for preventing duplication of documents to prevent duplicate processing of documents based on the contents of all documents including Internet documents in the document processing system. It is an object of the present invention to provide a computer-readable recording medium that records a program for realization.

상기 목적을 달성하기 위한 본 발명은, 문서처리시스템에서의 중복된 문서를 방지하는 방법에 있어서, 문서의 본문 내용을 추출하여 추출된 본문 내용을 기반으로 해쉬값을 계산하는 제 1 단계; 계산된 상기 해쉬값과 기 저장된 중복검사용 해쉬값을 비교하는 제 2 단계; 및 상기 제 2 단계의 비교결과에 따라, 문서의 본문 내용의 중복 여부를 판별하고, 중복된 문서를 제거하는 제 3 단계를 포함하여 이루어진 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method of preventing duplicate documents in a document processing system, the method comprising: extracting a body content of a document and calculating a hash value based on the extracted body content; A second step of comparing the calculated hash value with a prestored hash value for redundancy check; And a third step of determining whether the body content of the document is duplicated and removing the duplicate document according to the comparison result of the second step.

그리고, 본 발명은 프로세서를 구비한 문서처리시스템에, 문서의 본문 내용을 추출하여 추출된 본문 내용을 기반으로 해쉬값을 계산하는 제1 기능; 계산된 상기 해쉬값과 기 저장된 중복검사용 해쉬값을 비교하는 제2 기능; 및 상기 제2 기능의 비교결과에 따라, 문서의 본문 내용의 중복 여부를 판별하고, 중복된 문서를 제거하는 제3 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention provides a document processing system having a processor, comprising: a first function of extracting a body content of a document and calculating a hash value based on the extracted body content; A second function of comparing the calculated hash value with a prestored hash value for redundancy check; And a computer-readable recording medium having recorded thereon a program for realizing a third function of determining whether or not the body content of a document is duplicated and removing the duplicated document according to the comparison result of the second function.

본 발명은 같은 문서의 중복된 처리를 방지하기 위해, 모든 문서에 대해 본문을 요약한 고정길이의 숫자표현으로 변환한 후에, 그 숫자표현만 가지고 문서의 유일성을 체크하면 문서의 중복문제를 해결할 수 있다. 이때, 유일한 숫자표현은 해슁함수를 사용하면 문서 본문만 가지고 바로 계산할 수 있으므로, 문서들의 일련번호를 부여하는 곳에서 병목현상없이 병렬적으로 처리할 수가 있으므로 높은 효율로 문서처리가 가능하다.In order to prevent duplicate processing of the same document, the present invention can solve the problem of duplication of documents by converting all documents to fixed-length numeric expressions summarizing the text, and then checking the uniqueness of the documents with only the numeric expressions. have. In this case, the unique numeric expression can be calculated directly using only the document body by using the hash function, so that the document can be processed with high efficiency because it can be processed in parallel without bottleneck in the place of serial number of documents.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 문서수집시스템의 구성 예시도로서, 도면에서 "11"은 문서 수집용 서버, "12"는 문서 저장용 서버를 각각 나타낸다.1 is an exemplary configuration diagram of a document collection system to which the present invention is applied. In the drawings, "11" denotes a document collection server and "12" denotes a document storage server.

도 1에 도시된 바와 같이, 본 발명이 적용되는 문서수집시스템은, 인터넷을 통해 문서를 수집하고 수집된 문서의 본문을 요약하여 고정길이의 숫자표현으로 변환한 후에 기 저장된 값과 비교함으로써 문서의 중복여부를 판정하는 문서 수집용 서버(11)와 문서 수집용 서버(11)의 판정결과에 따라 중복되지 않은 문서를 저장하는 문서 저장용 서버(12)를 구비한다.As shown in Figure 1, the document collection system to which the present invention is applied, collects documents over the Internet, summarizes the body of the collected documents, converts them into fixed-length numeric representations, and compares them with previously stored values. And a document storage server 12 for storing documents that are not duplicated according to the determination result of the document collection server 11 and the document collection server 11 for determining whether or not to overlap.

문서 수집용 서버(11)는 문서를 수집하고 수집된 문서에 대해 문서의 본문 내용을 추출한 후에, 추출된 본문 내용을 기반으로 해쉬값을 계산하고 기존에 보관된 해쉬값(즉, 중복검사용 해쉬값)과 새로 계산된 해쉬값을 비교하여 문서의 중복을 검사한다. 이때, 고속의 검사를 위해 중복검사용 해쉬값은 소규모 데이터베이스에 저장된다.The document collection server 11 collects the documents and extracts the body content of the document for the collected documents, calculates a hash value based on the extracted body content, and stores the hash value (that is, a hash for duplicate inspection). Value) and the newly calculated hash value to check for duplication of the document. At this time, the hash value for redundancy check is stored in a small database for a high speed test.

문서 수집용 서버(11)는 인터넷 문서의 중복을 방지하기 위해 문서의 본문을 요약하여 짧고 고정적인 숫자로 표현하고, 그 표현된 숫자를 근거로 문서의 중복성을 검사한다. 이때, 각각의 문서마다 고정된 길이의 짧은 숫자를 부여하게 되면, 그 부여된 숫자만 비교하면 되므로 쉽게 중복여부를 판단할 수 있다.In order to prevent duplication of the Internet document, the document collecting server 11 summarizes the body of the document and expresses the short and fixed number, and checks the redundancy of the document based on the expressed number. In this case, if a short number of a fixed length is assigned to each document, only the assigned number needs to be compared, and thus it may be easily determined whether the document is duplicated.

여기서, 숫자의 부여시에는, 일련번호를 부여하는 방식이 아니라, 문서의 내용물을 근거로 계산해 내므로, 다중 프로세스가 동시에 서로 다른 문서에 대해 서로 다른 문서에 번호부여가 가능하다는 장점이 있다. 이렇게 되면, 문서번호 부여 절차에 병목현상을 줄일 수 있으므로 매우 높은 효율을 보일 수 있다.Here, when assigning a number, it is calculated based on the contents of the document, not a method of assigning a serial number, and thus, multiple processes can simultaneously number different documents for different documents. This can reduce bottlenecks in the document numbering process and therefore provide very high efficiency.

이를 보다 상세히 살펴보면, 인터넷에서 수집된 문서들의 본문을 기존에 알려진 적절한 해슁 함수(예를 들면, MD5(Message Digest-5))를 통해서 해쉬값을 계산한다. 계산된 해쉬값은 문서 본문에 비해 길이가 평균적으로 매우 짧으며, 길이 도 고정적이므로 저장이나 비교가 용이하다. Looking at this in more detail, the body of documents collected from the Internet calculates the hash value using a known hash function (eg MD5 (Message Digest-5)). The calculated hash value is very short on average compared to the body of the document, and its length is fixed, so it is easy to store and compare.

본 실시예에서는 이처럼 저장이나 비교가 용이한 숫자표현으로 문서를 변환함으로써, 기존에 수집된 문서와 새로 수집된 문서가 동일한 문서인지 그렇지 않은지를 고속으로 판단할 수 있게 된다.In the present embodiment, by converting a document into a numeric expression that can be easily stored or compared, it is possible to quickly determine whether a previously collected document and a newly collected document are the same document or not.

부가적으로, 해쉬값의 계산은 오로지 문서 내용만 가지고 하는 것이므로, 외부 데이터베이스에 의존해서 일련번호를 부여하는 방식에 비해, 번호부여상의 병목현상을 줄일 수 있으므로 다중 프로세스 환경에서 더 적합하다.In addition, since the hash value is calculated only from the document content, it is more suitable in a multi-process environment because it can reduce bottlenecks in the numbering compared to the serial number relying on an external database.

도 2 는 본 발명에 따른 문서의 중복 방지 방법에 대한 일실시예 흐름도이다.2 is a flowchart illustrating a method of preventing duplication of documents according to the present invention.

도 2에 도시된 바와 같이, 본 발명에 따른 문서의 중복 방지 방법은, 먼저 문서 수집용 서버(11)가 인터넷으로부터 문서들을 수집하여(201), 수집된 문서의 본문 내용을 추출한다(202). 이때, 각 문서마다 수집할 당시의 주소(URL)가 부여되어 있지만, 부여된 주소가 다르다 하더라도 서로 같은 문서일 수가 있다.As illustrated in FIG. 2, in the method of preventing duplication of documents according to the present invention, a document collection server 11 first collects documents 201 from the Internet, and extracts body contents of the collected documents (202). . At this time, although each document is given an address (URL) at the time of collection, even if the given address is different, it may be the same document.

이후, 각각의 문서마다 문서의 내용물을 입력해서 고정된 길이의 숫자로 변환(해쉬값 계산)한다(203). 여기서, 변환에는 이미 널리 알려진 해슁 기법중에 아무것이나 쓸 수 있다. 특히, 본 실시예에서는 MD5 해슁 기법을 이용한다.Thereafter, the contents of the document are input for each document and converted into a fixed length number (hash value calculation) (203). Here, the transformation can be any of the well known hashing techniques. In particular, the present embodiment uses the MD5 hashing technique.

이 고정길이의 숫자들(해쉬값)은 문서마다 그 내용을 입력으로 받아서 만든 고유의 번호이므로 주소(URL)가 달라도, 내용이 같으면 같은 숫자가 나오도록 되어 있다.The fixed-length numbers (hash values) are unique numbers created by inputting the contents of each document. Therefore, even if the addresses are different, the same numbers appear.

다음으로, 문서마다 부여된 고정길이의 숫자들을 별도의 저장공간(소규모 데 이터베이스 해쉬값 데이블)에 저장하여, 새로운 문서가 들어 왔을 때 기존에 저장된 고정길이의 숫자들과 비교하여(204) 이미 같은 문서가 존재하면 새 문서를 버리고(205), 그렇지 않으면 새 문서를 대규모 데이터베이스에 저장한다(206). 이때, 중복내용의 문서가 아닌 경우에는 그 문서의 고정길이의 숫자들(해쉬값)을 소규모 데이터베이스 해쉬값 데이블에 저장하여 다른 문서의 들어 왔을 때 중복검사용 해쉬값으로 사용한다.Next, the fixed-length numbers assigned to each document are stored in a separate storage space (small database hash value table), and compared with the fixed-length numbers previously stored when a new document is entered (204). If the same document exists, the new document is discarded (205), otherwise the new document is stored in a large database (206). At this time, if the document does not have duplicate contents, the fixed-length numbers (hash values) of the document are stored in a small database hash value table and used as a duplicate check value when another document is entered.

이상에서와 같은, 본 발명은 대량의 인터넷 수집이 필요한 문서처리시스템에서 문서 중복에 의한 저장공간과 컴퓨팅 자원의 낭비를 없애고 문서 중복을 제거하기 위한 인력 소모를 막을 수 있어, 인터넷 검색 엔진에 사용되는 문서 수집 로봇 프로그램 등 대량의 인터넷 문서를 처리하는 곳에서 사용될 수 있다.As described above, the present invention can eliminate the waste of storage space and computing resources due to document duplication in the document processing system that requires a large amount of Internet collection, and can prevent the consumption of manpower for eliminating document duplication, It can be used in the processing of a large amount of Internet documents, such as a document collection robot program.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, and the above-described embodiments and accompanying It is not limited to the drawing.

상기한 바와 같은 본 발명은, 수많은 문서처리시스템들(바람직하게는 검색엔진)에서 문서가 필요이상으로 중복되는 것을 방지할 수 있어, 문서 중복에 의한 저장공간과 컴퓨팅 자원의 낭비를 없애고, 문서 중복을 제거하기 위한 인력 모소를 막을 수 있으며, 문서처리시스템에 대한 신뢰도를 향상시킬 수 있는 효과가 있다.As described above, the present invention can prevent duplication of documents in many document processing systems (preferably search engines) more than necessary, eliminating waste of storage space and computing resources due to duplication of documents, and duplication of documents. There is an effect that can prevent staffing to eliminate the problem, and improve the reliability of the document processing system.

Claims

In a method for preventing duplicate documents in a document processing system,

Extracting the body content of the document and calculating a hash value based on the extracted body content;

A second step of comparing the calculated hash value with a prestored hash value for redundancy check; And

A third step of determining whether or not the body content of the document is duplicated according to the comparison result of the second step, and removing the duplicated document

Method of preventing duplication of documents in the document processing system made, including.

The method of claim 1,

The hash value is,

Substantially, the length of the document is very short and the length is fixed. Therefore, the document processing system characterized in that it is a fixed-length numeric expression value that can reduce the bottleneck of the numbering that is easy to store or compare. How to avoid duplication.

In a document processing system having a processor,

A first function of extracting body content of a document and calculating a hash value based on the extracted body content;

A second function of comparing the calculated hash value with a prestored hash value for redundancy check; And

A third function of determining whether or not the contents of the body of the document are duplicated according to the comparison result of the second function, and removing the duplicated document

A computer-readable recording medium having recorded thereon a program for realizing this.