KR100834292B1

KR100834292B1 - Document processing method and system

Info

Publication number: KR100834292B1
Application number: KR1020060108786A
Authority: KR
Inventors: 심규철
Original assignee: 엔에이치엔(주)
Priority date: 2006-11-06
Filing date: 2006-11-06
Publication date: 2008-05-30
Also published as: KR20080040865A

Abstract

본 발명은 문서 처리 방법 및 시스템에 관한 것으로서, 이 방법은, 적어도 하나의 공통 청크를 포함하는 복수의 문서로부터 원본 문서와 적어도 하나의 복사 문서를 구별하는 단계, 그리고 공통 청크에 기초하여 원본 문서 및 복사 문서의 스코어를 서로 다른 방식으로 산출하는 단계를 포함한다. 본 발명에 의하면, 문서 스코어를 계산하여 문서에 부여함으로써 문서의 복사 정도를 파악할 수 있다.The present invention relates to a method and a system for processing a document, the method comprising: distinguishing an original document from at least one copy document from a plurality of documents including at least one common chunk, and based on the common chunk; Calculating the scores of the copied document in different ways. According to the present invention, the degree of copying of a document can be grasped by calculating and assigning a document score to the document.

문서 처리, 청크, 해시 값, 원본 문서, 복사 문서, 인덱스 볼륨, 문서 스코어, 중복 문서 Document processing, chunks, hash values, original documents, copy documents, index volumes, document scores, duplicate documents

Description

Document processing method and system {DOCUMENT PROCESSING METHOD AND SYSTEM}

도 1은 본 발명의 실시예에 따른 문서 처리 시스템을 설명하기 위한 블록도이다.1 is a block diagram illustrating a document processing system according to an embodiment of the present invention.

도 2는 도 1에 도시한 해시 생성 서버의 한 예를 도시한 블록도이다.FIG. 2 is a block diagram illustrating an example of the hash generation server shown in FIG. 1.

도 3은 본 발명의 실시예에 따른 문서 처리 시스템에서 사용되는 문서 컬렉션 파일, 해시 컬렉션 파일 및 스코어 컬렉션 파일의 한 예를 도시한 개략도이다.3 is a schematic diagram illustrating an example of a document collection file, a hash collection file, and a score collection file used in a document processing system according to an embodiment of the present invention.

도 4는 본 발명의 실시예에 따라 청크를 추출하는 한 예를 도시한 개략도이다.4 is a schematic diagram illustrating an example of extracting chunks according to an embodiment of the present invention.

도 5는 도 1에 도시한 스코어 생성 서버의 한 예를 도시한 블록도이다.FIG. 5 is a block diagram illustrating an example of the score generation server shown in FIG. 1.

도 6 내지 도 9는 본 발명의 실시예에 따라 문서 스코어를 계산하는 예를 도시한 개략도이다.6-9 are schematic diagrams showing examples of calculating document scores according to an embodiment of the present invention.

<도면 부호의 설명><Description of Drawing>

100: 문서 처리 시스템, 110: 문서 데이터베이스,100: document processing system, 110: document database,

120: 해시 생성 서버, 122: 문서 파싱 모듈,120: hash generation server, 122: document parsing module,

124: 청크 추출 모듈, 126: 해시 계산 모듈,124: chunk extraction module, 126: hash calculation module,

128: 해시 컬렉션 생성 모듈, 130: 해시 인덱싱 서버,128: hash collection generation module, 130: hash indexing server,

140: 해시 인덱스 볼륨, 150: 스코어 생성 서버,140: hash index volume, 150: score generation server,

152: 해시 파싱 모듈, 154: 서치 모듈,152: hash parsing module, 154: search module,

156: 소팅 모듈, 158: 스코어 계산 모듈,156: sorting module, 158: score calculation module,

159: 스코어 컬렉션 생성 모듈, 160: 문서 인덱싱 서버,159: score collection generation module, 160: document indexing server,

170: 문서 인덱스 볼륨, 180: 스코어 컬렉션 데이터베이스,170: document index volume, 180: score collection database,

190: 데이터베이스 조회 서버, 200: 통신망,190: database query server, 200: communication network,

300: 사용자 단말기, 400: 검색 시스템300: user terminal, 400: search system

본 발명은 문서 처리 방법 및 시스템에 관한 것으로, 보다 상세하게는 인터넷을 이용한 정보 검색에서 보다 고품위의 검색 결과를 도출할 수 있도록 하는 문서 처리 방법 및 시스템에 관한 것이다.The present invention relates to a document processing method and system, and more particularly, to a document processing method and system that can derive a higher quality search results in information retrieval using the Internet.

최근 초고속 인터넷이 급속도로 보급됨에 따라 인터넷은 현대 생활에 없어서는 안 될 필수품이 되고 있다. 인터넷을 이용하는 대부분의 사용자는 브라우저를 통하여 인터넷에 접속한 후 인터넷 포털 사이트를 통하여 정보를 검색한다. 사용자가 인터넷 포털의 검색창에 질의를 입력하면 인터넷 포털은 데이터베이스에서 질의에 대응하는 다양한 정보를 추출하여 검색 결과로서 사용자에게 제공한다. 추출된 정보는 사전, 지식 정보(예를 들면, 네이버의 지식iN), 블로그, 카페, 전문자료, 사이트, 책, 뉴스, 웹페이지, 동영상 등과 같은 카테고리로 나뉘어 사용자에게 전달되어 표시된다.With the rapid spread of high speed internet in recent years, it has become an indispensable necessity in modern life. Most users who use the Internet access the Internet through a browser and search for information through an Internet portal site. When the user enters a query in the search box of the Internet portal, the Internet portal extracts various information corresponding to the query from the database and provides the search result to the user. The extracted information is divided into categories such as dictionaries, knowledge information (eg, Naver's knowledge iN), blogs, cafes, specialty materials, sites, books, news, web pages, videos, and the like, and is displayed and delivered to the user.

특정 주제에 대하여 문서를 독창적으로 작성하지 않고 다른 사용자가 작성한 문서를 복사하여 지식 정보에 답변하거나 자신의 블로그나 카페 등에 포스팅하는 사용자가 증가하고 있다. 이것은 인터넷에서 사용되는 문서가 쉽게 복사될 수 있는 특징을 가지고 있기 때문이다. 사용자는 주로 신문 기사나 전문 자료 또는 타인의 블로그나 카페 등에서 원본 문서 자체를 동일하게 복제하거나 필요한 부분만 선택적으로 복사하여 문서를 작성한다. 복사된 문서에 문구를 추가하여 별도의 내용을 보충하거나 부연 설명을 하는 경우가 있으며, 문서의 일부를 수정하는 경우도 있다. 따라서 이렇게 작성된 문서들은 원본 문서와 동일하거나 실질적으로 동일한 것이 많다.Increasingly, users do not create original documents on specific subjects, but rather copy documents written by other users to answer knowledge information or post their blogs or cafes. This is because a document used on the Internet has a feature that can be easily copied. The user creates a document mainly by copying the original document itself or selectively copying only the necessary parts from newspaper articles or specialized materials or blogs or cafes of others. In some cases, additional text may be added to the copied document to supplement the contents, or explanation may be modified. Thus, the documents thus created are often the same or substantially the same as the original document.

검색 결과로서 사용자에게 표시되는 문서들 중에서 검색 질의에 가장 관련성이 있는 것이 가장 먼저 표시되고, 관련성에 따라서 표시 순서가 정해진다. 그런데 복사된 문서들이 여과되지 않고 원본 문서와 함께 검색 결과에 포함되어 사용자에게 표시되면 검색 서비스 품질이 저하된다. 사용자는 중복된 문서를 열어보는 데 많은 시간을 허비할 수 있고, 실질적으로 관련성이 높더라도 표시 순위에서 밀리는 문서를 열어보지 않을 수도 있으며, 검색 결과에 대하여 재검색해야 하는 번거로움이 발생할 수도 있다.Among the documents displayed to the user as a search result, the one most relevant to the search query is displayed first, and the display order is determined according to the relevance. However, if the copied documents are not filtered but included in the search results and displayed to the user, the quality of the search service is degraded. A user may spend a lot of time opening a duplicate document, may not open a document that is pushed in the display rank even though it is highly relevant, and may have to re-search the search result.

따라서 본 발명이 이루고자 하는 기술적 과제는 정보 검색에서 보다 고품위의 검색 결과를 도출할 수 있도록 하는 문서 처리 방법 및 시스템을 제공하는 것이다.Therefore, the technical problem to be achieved by the present invention is to provide a document processing method and system that can derive a higher quality search results from information search.

이러한 기술적 과제를 이루기 위한 본 발명의 한 실시예에 따른 문서 처리 방법은, 적어도 하나의 공통 청크를 포함하는 복수의 문서로부터 원본 문서와 적어도 하나의 복사 문서를 구별하는 단계, 그리고 상기 공통 청크에 기초하여 상기 원본 문서 및 상기 복사 문서의 스코어를 서로 다른 방식으로 산출하는 단계를 포함한다.The document processing method according to an embodiment of the present invention for achieving the technical problem, the step of distinguishing the original document and at least one copy document from a plurality of documents including at least one common chunk, and based on the common chunk Calculating the scores of the original document and the copied document in different ways.

상기 원본 문서의 스코어는 상기 원본 문서가 가지는 청크의 수효와 상기 공통 청크의 수효의 비의 합일 수 있다.The score of the original document may be the sum of the number of chunks of the original document and the number of the common chunks.

상기 원본 문서의 스코어(OS)는 다음 수학식과 같이 계산되며,The score (OS) of the original document is calculated by the following equation,

OD는 상기 원본 문서가 가지는 청크의 수효이고, CD_i는 i번째 복사 문서가 가지는 공통 청크의 수효이며, n은 상기 복사 문서의 수효일 수 있다.OD is the number of chunks of the original document, CD _i is the number of common chunks of the i-th copy document, and n may be the number of the copy documents.

상기 복사 문서의 스코어는 상기 복사 문서가 가지는 청크의 수효에서 상기 공통 청크의 수효를 뺀 값과 상기 복사 문서가 가지는 청크의 수효의 비일 수 있다.The score of the copy document may be a ratio of the number of chunks of the copy document and the number of chunks of the copy document.

각 문서별로 문서 식별 부호, 문서 스코어 및 복사 리스트를 기록한 스코어 컬렉션 파일을 생성하는 단계를 더 포함하며, 상기 원본 문서의 복사 리스트에 상기 복사 문서의 문서 식별 부호가 기재되며, 상기 복사 문서의 복사 리스트에 상기 원본 문서의 문서 식별 부호가 기재될 수 있다.Generating a score collection file recording document identification marks, document scores, and copy lists for each document, wherein document identification codes of the copied documents are described in the copy list of the original document, and the copy list of the copied documents. Document identification code of the original document can be described.

상기 문서 구별 단계는, 복수의 문서로부터 청크를 추출하는 단계, 기억 매체를 조회하여 상기 청크를 공통으로 가지는 중복 문서를 추출하는 단계, 그리고 상기 중복 문서의 문서 생성 시간을 비교하여 원본 문서와 복사 문서를 구별하는 단계를 포함할 수 있다.The document discriminating step may include extracting chunks from a plurality of documents, extracting a duplicate document having the chunk in common by querying a storage medium, and comparing a document generation time of the duplicate document with the original document and the copy document. It may include the step of distinguishing.

상기 중복 문서는 소정 수효 이상의 상기 공통 청크를 가질 수 있다.The duplicate document may have more than a certain number of the common chunks.

본 발명의 다른 태양에 따른 컴퓨터로 읽을 수 있는 매체는 상기한 방법 중 어느 하나를 컴퓨터에 실행시키기 위한 프로그램을 기록한다.A computer readable medium according to another aspect of the present invention records a program for causing a computer to execute any of the above methods.

본 발명의 다른 태양에 따른 문서 처리 시스템은, 적어도 하나의 공통 청크를 포함하는 복수의 문서로부터 원본 문서와 적어도 하나의 복사 문서를 구별하는 문서 구별 모듈, 그리고 상기 공통 청크에 기초하여 상기 원본 문서 및 상기 복사 문서의 스코어를 서로 다른 방식으로 산출하는 스코어 산출 모듈을 포함한다.According to another aspect of the present invention, a document processing system includes a document discrimination module that distinguishes an original document from at least one copy document from a plurality of documents including at least one common chunk, and the original document and the based on the common chunk; And a score calculation module for calculating the score of the copy document in different ways.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention.

먼저, 도 1 내지 도 5를 참고하여 본 발명의 실시예에 따른 문서 처리 시스템 및 방법에 대하여 상세하게 설명한다.First, a document processing system and method according to an embodiment of the present invention will be described in detail with reference to FIGS. 1 to 5.

도 1을 참고하면, 본 발명의 실시예에 따른 문서 처리 시스템(100)은 문서 데이터베이스(110), 해시 생성 서버(120), 해시 인덱싱 서버(130), 해시 인덱스 볼 륨(140), 스코어 생성 서버(150), 문서 인덱싱 서버(160), 문서 인덱스 볼륨(170), 스코어 컬렉션 데이터베이스(180), 그리고 데이터베이스 조회 서버(190)를 포함하며, 검색 시스템(400)과 연결되어 있다. 검색 시스템(400)은 통신망(200)을 통하여 복수의 사용자 단말기(300)와 연결되어 있다.Referring to FIG. 1, a document processing system 100 according to an embodiment of the present invention may include a document database 110, a hash generation server 120, a hash indexing server 130, a hash index volume 140, and a score generation. A server 150, a document indexing server 160, a document index volume 170, a score collection database 180, and a database lookup server 190 are included and connected to the search system 400. The search system 400 is connected to the plurality of user terminals 300 through the communication network 200.

문서 데이터베이스(110)는 복수의 다양한 문서를 저장하고 있다. 문서는 카페나 블로그 등에 포스팅되어 있는 문서, 뉴스, 전문 자료, 사용자가 작성한 지식 정보 문서 등을 포함한다. 문서 데이터베이스(110)는 카페, 블로그, 뉴스, 전문 자료, 지식 정보 등과 같은 카테고리를 나누어 해당 카테고리에 맞게 문서를 저장할 수 있다. 각 문서는 자신을 유일하게 식별할 수 있는 문서 식별 부호(global document ID)를 가지며 문서 생성 시간(document creation time), 본문(document body) 및 서식 등을 포함하고, 하이퍼텍스트 생성 언어(hypertext markup language, HTML)나 표준 범용 문서 생성 언어(standard generalized markup language, SGML)와 같은 마크업 언어(markup language)로 작성될 수 있다.The document database 110 stores a plurality of various documents. Documents include documents posted in cafes, blogs, etc., news, technical data, and user-written knowledge information documents. The document database 110 may divide a category such as a cafe, a blog, news, specialized material, knowledge information, and the like and store the document according to the corresponding category. Each document has a global document ID that uniquely identifies it, includes document creation time, document body and formatting, and a hypertext markup language. , HTML) or a standard generalized markup language (SGML).

도 2를 참고하면, 해시 생성 서버(120)는 문서 파싱 모듈(122), 청크 추출 모듈(124), 해시 계산 모듈(126) 및 해시 컬렉션 생성 모듈(128)을 포함하며, 문서 데이터베이스(110)로부터의 문서 컬렉션 파일(document collection file)을 참조하여 해시 컬렉션 파일(hash collection file)을 생성한다.Referring to FIG. 2, the hash generation server 120 includes a document parsing module 122, a chunk extraction module 124, a hash calculation module 126, and a hash collection generation module 128, and a document database 110. Create a hash collection file by referencing the document collection file from.

문서 컬렉션 파일은 문서 데이터베이스(110)에 저장되어 있는 문서 중 적어도 하나의 문서를 파일로 만든 것이다. 문서 컬렉션 파일에 포함되어 있는 각 문 서는 문서 식별 부호(gdid), 문서 생성 시간(time) 및 본문(body)을 포함하며, 도 3의 (a)에 도시한 것처럼 일정한 형식을 가지고 있다. 도 3은 본 발명의 실시예에 따른 문서 처리 시스템에서 사용되는 문서 컬렉션 파일, 해시 컬렉션 파일 및 스코어 컬렉션 파일의 한 예를 도시한 개략도이다.The document collection file is a file of at least one document stored in the document database 110. Each document included in the document collection file includes a document identification code (gdid), a document generation time (time), and a body, and has a certain format as shown in FIG. 3 is a schematic diagram illustrating an example of a document collection file, a hash collection file, and a score collection file used in a document processing system according to an embodiment of the present invention.

여기서 ＞＠gdid#와 줄바꿈 제어 문자(＼n) 사이에 나타나는 'ID1'이 문서 식별 부호를 의미한다. 문서 식별 부호의 크기는 인터페이스의 통일을 위하여 문서마다 동일한 것이 바람직하다. 예를 들면 문서 식별 부호의 크기는 20 바이트로 설정할 수 있으며, 문서 식별 부호는 각 문서의 카테고리가 구분될 수 있도록 문서 식별 부호의 선두에 구분 부호를 포함할 수 있다. 구분 부호는 적어도 하나의 바이트로 이루어질 수 있으며, 예를 들면, 뉴스는 0110, 지식 정보는 0120, 카페는 0130, 블로그는 0140 등으로 설정할 수 있다.Here, " ID1 " appearing between ># gdid # and a line break control character (#n) means a document identification code. The size of the document identification code is preferably the same for each document for unification of the interface. For example, the size of the document identification code may be set to 20 bytes, and the document identification code may include a separator at the head of the document identification code so that categories of each document can be distinguished. The delimiter may be composed of at least one byte. For example, news may be set to 0110, knowledge information to 0120, cafe to 0130, blog to 0140, and the like.

문서 생성 시간은 연월일(yyyymmdd) 및 시분초(hhmmss) 단위의 시간 정보를 포함하며, 문서 식별 부호 다음 행의 ＞＠time#와 줄바꿈 제어 문자(＼n) 사이에 위치한다. 본문은 ＞＠body#와 줄바꿈 제어 문자(＼n) 사이에 위치하며, 예를 들어 도 3의 (a)에 도시한 것처럼, 본문은 복수의 문장 CA, CB, CC, CD, CE, CF를 포함한다. 이와 같이 본문은 적어도 하나의 문장을 포함할 수 있으나 반드시 문장의 형태를 가지지 않을 수도 있다.The document generation time includes time information in the unit of date (yyyymmdd) and hour and minute seconds (hhmmss), and is located between ># time # and the line break control character (#n) on the line following the document identifier. The body is located between ># body # and the line break control character (#n). For example, as shown in FIG. 3 (a), the body includes a plurality of sentences CA, CB, CC, CD, CE, CF. It includes. As such, the text may include at least one sentence, but may not necessarily have a sentence form.

도 3의 (a)에 도시한 문서 형식은 하나의 예로서, 문서 컬렉션 파일은 이와 다른 형식을 가질 수도 있다.The document format shown in (a) of FIG. 3 is one example, and the document collection file may have a different format.

본 발명의 실시예에 따른 문서 처리 시스템(100)은 문서 데이터베이스(110) 로부터 문서를 추출하여 문서 컬렉션 파일을 생성하고 이를 해시 생성 서버(120)에 전송하는 문서 컬렉션 생성 모듈(도시하지 않음)을 포함할 수 있다.The document processing system 100 according to the embodiment of the present invention extracts a document from the document database 110, generates a document collection file, and transmits the document collection generating module (not shown) to the hash generation server 120. It may include.

문서 파싱 모듈(122)은 문서 데이터 베이스(110) 또는 문서 컬렉션 생성 모듈로부터 문서 컬렉션 파일을 받아 이에 포함되어 있는 각 문서에 대하여 파싱 작업을 수행하여 해시 생성 서버(120)가 각 문서의 본문을 기초로 청크 추출 작업을 수행할 수 있도록 한다.The document parsing module 122 receives the document collection file from the document database 110 or the document collection generating module and parses each document included therein, so that the hash generation server 120 based on the body of each document. Allow chunk extraction.

청크 추출 모듈(124)은 문서 파싱 모듈(122)로부터 파싱된 본문을 받아 청크(chunk) 및 유효 청크를 추출한다. 청크는 본문을 적어도 하나의 덩어리로 나눌 때 각 덩어리를 지칭하는 것으로서, 청크와 청크를 구분하는 문자인 소정의 피벗 문자를 기준으로 하여 추출된다. 피벗 문자는 온점(.), 물음표(?) 및 느낌표(!)를 포함하는 문장의 마침표, 줄바꿈 제어 문자(＼n), 공백 문자, 세미콜론(;) 및 콜론(:) 중 적어도 하나를 포함할 수 있다. 피벗 문자는 구분된 청크의 어느 하나에 포함될 수 있으나 포함되지 않을 수도 있다. 피벗 문자는 필요에 따라 임의로 설정될 수 있으며, 앞서 나열한 것 이외에도 필요에 따라 다양한 문자를 피벗 문자로 이용할 수 있다.The chunk extraction module 124 receives the parsed text from the document parsing module 122 and extracts chunks and valid chunks. The chunk refers to each chunk when the text is divided into at least one chunk, and is extracted based on a predetermined pivot character that is a character that distinguishes the chunk from the chunk. Pivot characters include at least one of a period, a newline control character (＼n), a space character, a semicolon (;), and a colon (:) in sentences that include hot points (.), Question marks (?), And exclamation points (!). can do. Pivot characters may be included in any of the delimited chunks, but may not be included. Pivot characters can be arbitrarily set as needed, and in addition to those listed above, various characters can be used as pivot characters as necessary.

피벗 문자가 마침표인 경우 하나의 문장을 기준으로 하여 하나의 청크가 생성되고, 줄바꿈 제어 문자인 경우 새로운 행을 기준으로 하나의 청크가 생성되며, 공백 문자인 경우 하나의 단어를 기준으로 하여 하나의 청크가 생성된다.If the pivot character is a period, one chunk is created based on one sentence; if it is a line break control character, one chunk is created based on a new line; if it is a space character, one chunk is based on one word. Chunks of are generated.

피벗 문자로서 마침표나 줄바꿈 제어 문자를 이용하는 경우에는 중복이 아닌 문서를 중복 문서로 판단할 가능성이 낮고, 복사 여부를 판단하는 속도가 대체로 빠르며, 수정 없이 단순하게 복사한 후 붙이기(copy & paste) 방식으로 생성된 문서의 복사 여부를 판단하는 데 효과적이다. 피벗 문자로서 공백 문자를 이용하는 경우에는 다른 문서로부터 내용을 복사한 후 무작위로 많은 부분에 수정을 가한 문서의 복사 여부를 판단하는 데 효과적이다.When using periods or line breaks as pivot characters, it is unlikely that a non-duplicate document will be considered as a duplicate document, it is generally faster to judge whether to copy, and simply copies and pastes without modification. It is effective for judging whether or not a document generated by the method is copied. In the case of using a space character as a pivot character, it is effective to determine whether to copy a document that has been modified at a large amount at random after copying content from another document.

청크 추출 대상이 되기 위하여 문서는 본문의 크기가 최소 본문 크기 이상이어야 하며, 이보다 작으면 청크 추출 대상에서 제외된다. 최소 본문 크기는 예를 들면 128 바이트와 같이 설정할 수 있다.In order to be chunked, the document must be at least as large as the body. If it is smaller than this, the document is excluded from chunking. The minimum body size can be set, for example, 128 bytes.

유효 청크는 소정 조건을 충족하는 청크이다. 예를 들어 청크의 크기가 최소 청크 크기 이상이면 해당 청크는 유효 청크에 포함될 수 있다. 최소 청크 크기는 예를 들어 40 바이트로 설정될 수 있으나 이에 한정되지 않으며, 필요에 따라 다른 크기로 설정될 수 있다. 최소 청크 크기 미만의 청크는 무시할 수 있다. 이와 달리 최소 청크 크기 미만의 청크에 대하여 해당 청크 다음에 나타나는 청크와 병합하되 병합된 청크의 크기가 최소 청크 크기 이상이 되면 이를 유효 청크에 포함시킬 수도 있다.An effective chunk is a chunk that satisfies a predetermined condition. For example, if the size of the chunk is larger than the minimum chunk size, the chunk may be included in the valid chunk. The minimum chunk size may be set to 40 bytes, for example, but is not limited thereto, and may be set to another size as needed. Chunks less than the minimum chunk size can be ignored. Alternatively, a chunk that is smaller than the minimum chunk size may be merged with the chunk that appears after the chunk, and may be included in the valid chunk when the size of the merged chunk is larger than or equal to the minimum chunk size.

한 문서에서 추출할 수 있는 유효 청크의 수효를 예를 들면 50개와 같이 최대 청크 수효 이하로 제한할 수 있다. 이때 최대 청크 수효를 초과하는 청크에 대하여는 무시할 수 있다.The number of valid chunks that can be extracted from a document can be limited to less than the maximum chunk count, for example 50. Chunks exceeding the maximum number of chunks may be ignored.

예를 들면, '==', '^^', 숫자, 상투적인 문구 등과 같은 특정 문자 및 특정 문자열은 청크에서 제외시킬 수 있다. 또한 필요에 따라 괄호([], {}, ()) 등에 둘러싸인 문자열도 제외시킬 수 있다.For example, certain characters and certain strings, such as '==', '^^', numbers, and conventional phrases, can be excluded from the chunk. You can also exclude strings enclosed in parentheses ([], {}, ()) if necessary.

피벗 문자가 존재하지 않는 문서에 대하여는 다음과 같은 방식으로 청크를 추출할 수 있다. 해당 문서에 포함되어 있는 각 단어 또는 문구에 대하여 제1 해시 함수(hash function)를 적용하여 해시 값을 계산한다. 해당 문서를 일정한 크기의 문자열로 나누고 각 문자열에 대하여 제1 해시 함수를 적용하여 해시 값을 계산할 수도 있다. 해시 함수는 요약 함수 또는 메시지 다이제스트 함수(message digest function)라고도 하는데, 주어진 문자열로부터 고정된 길이의 의사난수를 생성하는 함수이다.For documents without pivot characters, you can extract the chunk in the following way: A hash value is calculated by applying a first hash function to each word or phrase included in the document. A hash value may be calculated by dividing the document into strings of a constant size and applying a first hash function to each string. Hash functions, also called summary functions or message digest functions, generate a fixed-length pseudorandom number from a given string.

계산된 해시 값(M)을 소정 자연수(K)로 나누고 나머지(N)를 구한다(M％K＝N, 0≤N≤K-1). 이때 나머지(N) 중 특정한 값(예를 들면, 0)을 가지는 단어/문구/문자열을 기준으로 하여 청크를 추출할 수 있다. 즉, 이러한 단어/문구/문자열을 마치 피벗 문자인 것처럼 청크를 구분하는 기준으로 활용할 수 있다. 이러한 방식으로 청크를 추출하는 것을 제외하면 피벗 문자가 존재하지 않는 문서의 경우에도 피벗 문자가 존재하는 문서의 경우와 마찬가지로 여러 가지 다양한 제한들이 동일하게 적용될 수 있다. 또한 피벗 문자의 유무에 무관하게 이러한 방식으로 청크를 추출할 수도 있다.The calculated hash value M is divided by the predetermined natural number K and the remainder N is obtained (M% K = N, 0≤N≤K-1). At this time, the chunk may be extracted based on a word / phrase / string having a specific value (for example, 0) among the rest (N). In other words, these words / phrases / strings can be used as criteria for separating chunks as if they were pivot characters. Except for extracting the chunks in this manner, the same limitations can be applied to a document without a pivot character as well as a document with a pivot character. You can also extract chunks in this way with or without pivot characters.

청크를 추출한 일례를 도 4를 참고하여 설명한다.An example of extracting chunks will be described with reference to FIG. 4.

도 4의 (a)에 도시한 것은 청크 추출 대상 문서이고, (b)에 도시한 것은 추출된 청크이다. 피벗 문자로서 마침표를 사용하였으며, 최소 본문 크기를 128 바 이트, 소정 크기를 40 바이트로 설정하였다. (b)에서 첫 번째 행부터 세 번째 행까지의 청크가 유효 청크이다. 네 번째 행부터 마지막 행까지의 청크는 크기가 40 바이트(세로 점선으로 도시함) 미만이므로 유효 청크에 속하지 않는다. 한편 문서의 첫 번째 행에서 괄호([])와 괄호([])에 쌓인 부분은 청크에서 제외되었으나, 필요에 따라 청크에 포함될 수도 있다.Shown in FIG. 4A is a chunk extraction target document, and shown in FIG. 4B is an extracted chunk. A period was used as the pivot character, and the minimum body size was set to 128 bytes and the predetermined size was set to 40 bytes. In (b), the chunks from the first to the third row are valid chunks. The chunks from the fourth row to the last row are not valid chunks because they are less than 40 bytes (shown in dashed vertical lines). On the other hand, the parts of the first line of the document that are enclosed in parentheses ([]) and parentheses ([]) are excluded from the chunk, but can be included in the chunk as needed.

해시 계산 모듈(126)은 청크 추출 모듈(124)로부터의 각 유효 청크에 대하여 제2 해시 함수를 적용하여 해시 값을 산출한다. 제2 해시 함수는 앞서 설명한 제1 해시 함수와 동일할 수도 있으나 다를 수도 있다. 해시 계산 모듈(126)은, 제2 해시 함수로서, 예를 들면, 일반적으로 잘 알려진 MD(message digest algorithm) 계열, SHA(secure hash algorithm) 계열, RIPEMD(race integrity primitives evaluation message digest) 계열 등의 해시 함수를 이용할 수 있으나 이에 한정되지 않으며 이들과 다른 형태의 해시 함수를 이용할 수도 있다.The hash calculation module 126 calculates a hash value by applying a second hash function to each valid chunk from the chunk extraction module 124. The second hash function may be the same as or different from the first hash function described above. The hash calculation module 126 is a second hash function, for example, a well-known message digest algorithm (MD) series, a secure hash algorithm (SHA) series, a race integrity primitives evaluation message digest (RIPEMD) series, or the like. Hash functions may be used, but the present invention is not limited thereto, and other types of hash functions may be used.

해시 컬렉션 생성 모듈(128)은 해시 계산 모듈(126)로부터 생성된 해시 값을 받아 문서 컬렉션 파일에 대응하는 해시 컬렉션 파일을 생성한다. 해시 컬렉션 파일에 포함되어 있는 각 해시 문서는 ＜gdid＞, ＜time＞, ＜count＞ 및 ＜hash＞로 각각 구분되는 문서 식별 부호, 문서 생성 시간, 해시 카운트 및 해시 값을 포함하며, 도 3의 (b)에 도시한 것처럼 일정한 형식을 가지고 있다.The hash collection generation module 128 receives the hash value generated from the hash calculation module 126 and generates a hash collection file corresponding to the document collection file. Each hash document included in the hash collection file includes a document identification code, a document generation time, a hash count, and a hash value, each of which is divided into <gdid>, <time>, <count>, and <hash>. It has a certain form as shown in (b).

한 예로서, 도 3의 (b)는 도 3의 (a)에 표현된 문서에 대응하는 해시 문서이다. 이 해시 문서는 문서 식별 부호 'ID1'를 가지며, 해당 문서와 동일한 문서 생성 시간을 가진다. 도 3의 (a)에서 본문의 각 문장이 청크이고 CC 및 CF가 유효하 지 않은 청크라 하면, 도 3의 (b)에서 해시 값 ha, hb, hd 및 he은 각각 유효 청크 CA, CB, CD 및 CE에 대한 해시 값이 된다. 이때 해시 카운트 '4'는 해시 값의 수효이고, 유효 청크의 수효와 동일하다.As an example, FIG. 3B is a hash document corresponding to the document represented in FIG. 3A. This hash document has a document identification code 'ID1' and has the same document generation time as that document. In (a) of FIG. 3, if each sentence of the body is a chunk and CC and CF are invalid chunks, the hash values ha, hb, hd, and he in FIG. 3 (b) are valid chunks CA, CB, Hash values for CD and CE. In this case, the hash count '4' is the number of hash values and is equal to the number of valid chunks.

해시 인덱싱 서버(130)는 해시 생성 서버(120)로부터 해시 컬렉션 파일을 받아 해시 컬렉션 파일에 포함되어 있는 해시 값을 기준으로 인덱싱 작업을 수행하여 해시 인덱스 볼륨(140)에 저장되어 있는 해시 인덱스를 갱신한다.The hash indexing server 130 receives the hash collection file from the hash generation server 120 and performs an indexing operation based on the hash value included in the hash collection file to update the hash index stored in the hash index volume 140. do.

해시 인덱스 볼륨(140)은 기억 매체로서, 전체 해시 값에 대한 해시 인덱스를 기억한다. 해시 인덱스에는 조회를 용이하게 하기 위하여 해시 값이 사전 순서대로 배열되며, 각 해시 값에 대하여 해당 해시 값을 포함하는 해시 문서의 문서 식별 부호 및 문서 생성 시간 등이 나열된다.The hash index volume 140 stores a hash index for all hash values as a storage medium. Hash values are arranged in alphabetical order in order to facilitate retrieval. For each hash value, the document identification code and document generation time of the hash document including the corresponding hash value are listed.

도 5를 참고하면, 스코어 생성 서버(150)는 해시 파싱 모듈(152), 서치 모듈(154), 소팅 모듈(156), 스코어 계산 모듈(158) 및 스코어 컬렉션 생성 모듈(159)을 포함하며, 해시 생성 서버(120)로부터의 해시 컬렉션 파일과 해시 인덱스 볼륨(140)의 해시 인덱스를 이용하여 스코어 컬렉션 파일(score collection file)을 생성한다.Referring to FIG. 5, the score generation server 150 includes a hash parsing module 152, a search module 154, a sorting module 156, a score calculation module 158, and a score collection generation module 159, A score collection file is generated using the hash collection file from the hash generation server 120 and the hash index of the hash index volume 140.

해시 파싱 모듈(152)은 해시 생성 서버(120)로부터 해시 컬렉션 파일을 받아 이에 포함되어 있는 각 해시 문서에 대하여 파싱 작업을 수행하여 스코어 생성 서버(150)가 각 해시 문서의 해시 값을 기초로 문서 스코어를 계산할 수 있도록 한다.The hash parsing module 152 receives a hash collection file from the hash generation server 120 and parses each hash document included therein, so that the score generation server 150 generates a document based on the hash value of each hash document. Allow the score to be calculated.

서치 모듈(154)은 해시 파싱 모듈(152)로부터 파싱된 해시 문서를 받아 해당 해시 문서에 포함되어 있는 각 해시 값에 대하여 해시 인덱스 볼륨(140)을 조회한다. 조회 결과 해당 해시 값을 가지는 문서를 해시 인덱스 볼륨(140)으로부터 추출해낸다. 파싱된 해시 문서와 추출된 문서가 공통으로 가지고 있는 해시 값의 수효가 소정 설정치 이상이면 서로 중복된 문서라고 판단한다. 소정 설정치는 각 문서가 가지는 유효 청크의 수효에 변동하여 설정될 수 있다.The search module 154 receives the hash document parsed from the hash parsing module 152 and inquires the hash index volume 140 for each hash value included in the hash document. As a result of the search, a document having a corresponding hash value is extracted from the hash index volume 140. If the number of hash values that the parsed hash document and the extracted document have in common is more than a predetermined set value, it is determined that the documents overlap each other. The predetermined set value can be set by varying the number of valid chunks that each document has.

서치 모듈(154)은 해시 컬렉션 파일에 포함되어 있는 전체 해시 문서에 대하여 조회를 수행하고, 중복된 문서가 없는 문서 및 중복된 문서라고 판단된 문서에 대한 정보를 소팅 모듈(156)에 전달한다.The search module 154 performs an inquiry on the entire hash document included in the hash collection file, and transmits information about the document without the duplicate document and the document determined to be the duplicate document to the sorting module 156.

소팅 모듈(156)은 서치 모듈(154)로부터 중복이라고 판단된 문서들을 문서 생성 시간 또는 문서 식별 부호를 기준으로 비교하여 순서대로 정돈한 후 원본 문서와 복사 문서를 구별한다. 문서 생성 시간이 제일 앞서는 문서가 원본 문서이고, 나머지는 복사 문서이다. 문서 생성 시간이 동일한 중복 문서의 경우, 뉴스 카테고리에 있는 문서가 가장 먼저 생성된 것으로 판단할 수 있으며, 지식 정보, 카페, 블로그 카테고리 순으로 문서 생성 순위를 결정할 수 있다. 따라서 이 경우 문서 식별 부호가 작은 문서를 원본 문서로, 나머지 문서를 복사 문서로 판단할 수 있다. 이것은 앞서 설명한 것처럼 구분 부호에 따라 뉴스가 가장 먼저 생성되는 문서로 파악될 수 있기 때문이다.The sorting module 156 compares documents determined to be duplicates from the search module 154 based on document generation time or document identification code, and arranges them in order, and then distinguishes the original document from the copy document. The document with the earliest document generation time is the original document, and the rest is the copy document. In the case of duplicate documents having the same document generation time, the document in the news category may be determined to be generated first, and the document generation order may be determined in the order of knowledge information, cafe, and blog category. Therefore, in this case, a document with a small document identification code can be determined as the original document and the remaining documents as copy documents. This is because the news can be identified as the first document generated according to the delimiter as described above.

한편, 중복된 문서가 없는 문서는 원본 문서가 된다.On the other hand, a document without duplicate documents becomes an original document.

스코어 계산 모듈(158)은 원본 문서와 복사 문서의 문서 스코어를 서로 다른 방식으로 계산하여 해당 문서에 문서 스코어를 각각 부여한다.The score calculation module 158 calculates document scores of the original document and the copy document in different ways and assigns document scores to the corresponding documents, respectively.

원본 문서의 문서 스코어(OS)는 원본 문서 내에서 복사 문서로 복사된 부분이 차지하는 비율의 합에 1을 더한 값으로서, 다음 [수학식 1]과 같이 계산된다.The document score (OS) of the original document is a value obtained by adding 1 to the sum of the proportion of the portion copied to the copy document in the original document, and is calculated as in Equation 1 below.

여기서 'OD'는 원본 문서의 유효 청크의 수효이고, 'CD_i'는 i번째 복사 문서와 원본 문서가 공통으로 가지고 있는 유효 청크의 수효이다. n은 원본 문서와 중복된 문서라고 판단된 복사 문서의 수효이다.Where 'OD' is the number of valid chunks of the original document, and 'CD _i ' is the number of valid chunks that the i th copy document and the original document have in common. n is the number of copy documents determined to be a duplicate of the original document.

원본 문서의 문서 스코어(OS)는 원본 문서의 유효 청크가 얼마나 많이 복사 문서로 복사되었는가를 의미하며, 1 이상인 실수이다. 따라서 원본 문서의 문서 스코어(OS)가 크면 클수록 원본 문서의 유효 청크가 복사 문서에 더욱 많이 복사된 것으로 판단할 수 있다.The document score (OS) of the original document means how many valid chunks of the original document have been copied to the copy document, which is a real number that is one or more. Therefore, the larger the document score (OS) of the original document, the more effective chunks of the original document can be determined to have been copied to the copy document.

물론 원본 문서의 문서 스코어는 [수학식 1]의 계산 값과 다른 값을 가질 수도 있으며, 예를 들면 [수학식 1]에서 1을 더하지 않은 값을 가질 수 있다.Of course, the document score of the original document may have a value different from the calculated value of [Equation 1], for example, may have a value not adding 1 in [Equation 1].

복사 문서의 문서 스코어(CS)는 복사 문서 내에서 원본 문서로부터 복사된 부분을 제외한 부분이 차지하는 비율로서, 다음 [수학식 2]와 같이 계산한다.The document score CS of the copied document is a ratio occupied by the portion of the copied document except the portion copied from the original document, and is calculated as in Equation 2 below.

여기서 'CT'는 복사 문서의 유효 청크의 수효이고, 'CD'는 복사 문서와 원본 문서가 공통으로 가지고 있는 유효 청크의 수효이다.Where 'CT' is the number of valid chunks of the copy document, and 'CD' is the number of valid chunks that the copy document and the original document have in common.

복사 문서의 문서 스코어(CS)는 원본 문서와 다른 내용이 어느 정도 가미되었는가를 의미하며, 1보다 작은 실수이다. 따라서 복사 문서의 문서 스코어(CS)가 작으면 작을수록 복사된 부분 이외에 독창적으로 가미된 내용이 복사 문서에 더욱 없다고 판단할 수 있다.The document score (CS) of the copy document indicates how much different content is added from the original document and is a real number less than one. Therefore, as the document score CS of the copied document is smaller, it can be determined that there is no content added to the copied document in addition to the copied portion.

스코어 컬렉션 생성 모듈(159)은 스코어 계산 모듈(158)로부터 계산된 문서 스코어를 받아 해시 컬렉션 파일에 대응하는 스코어 컬렉션 파일을 생성한다. 스코어 컬렉션 파일에 포함되어 있는 각 스코어 문서는 ＜gdid＞, ＜cscore＞, ＜ccount＞ 및 ＜clist＞로 각각 구분되는 문서 식별 부호, 문서 스코어, 복사 카운트 및 복사 리스트를 포함하며, 도 3의 (c)에 도시한 것처럼 일정한 형식을 가지고 있다.The score collection generation module 159 receives the document score calculated from the score calculation module 158 and generates a score collection file corresponding to the hash collection file. Each score document included in the score collection file includes a document identifier, a document score, a copy count, and a copy list, each divided into <gdid>, <cscore>, <ccount>, and <clist>. As shown in c), it has a certain form.

복사 카운트는 원본 문서의 경우 원본 문서와 중복된 문서라고 판단된 복사 문서의 수효를 나타내고, 복사 문서의 경우 1이 된다. 복사 리스트에는 원본 문서의 경우 원본 문서와 중복된 문서라고 판단된 복사 문서의 문서 식별 부호가 나열되고, 복사 문서의 경우 원본 문서의 문서 식별 부호가 표시된다. 그러나 이와 달리 복사 문서의 경우 복사 리스트에 원본 문서의 또 다른 복사 문서의 문서 식별 부호가 표시될 수도 있다.The copy count indicates the number of copy documents judged to be duplicates of the original document in the case of the original document, and is 1 in the case of the copy document. In the copy list, the document identification code of the copy document determined to be a duplicate of the original document in the case of the original document, and the document identification code of the original document in the copy document is displayed. However, in the case of a copy document, the document identification code of another copy document of the original document may be displayed in the copy list.

도 3의 (c)에 도시한 스코어 문서는 문서 식별 부호 'ID1'를 가지며, 문서 스코어가 3.5이다. 따라서 문서 스코어가 1 이상이므로 이 문서는 원본 문서라고 파악된다. 이 원본 문서의 유효 청크가 3개의 복사 문서에 복사되었고, 이들 복사 문서의 문서 식별 부호는 'ID2', 'ID3' 및 'ID4'이다.The score document shown in Fig. 3C has a document identification code 'ID1', and the document score is 3.5. Therefore, this document is identified as the original document because the document score is 1 or more. The valid chunks of this original document were copied into three copy documents, and the document identifiers of these copied documents were 'ID2', 'ID3' and 'ID4'.

스코어 생성 서버(150)는 이와 같이 생성된 스코어 컬렉션 파일을 문서 인덱싱 서버(160)에 보낸다. 또한 스코어 생성 서버(150)는 생성된 스코어 컬렉션 파일을 이용하여 스코어 컬렉션 데이터베이스(180)에 저장되어 있는 문서 스코어 정보를 갱신한다.The score generation server 150 sends the score collection file thus generated to the document indexing server 160. In addition, the score generation server 150 updates the document score information stored in the score collection database 180 using the generated score collection file.

스코어 컬렉션 데이터베이스(180)는 문서 식별 부호, 문서 스코어 및 복사 리스트 등을 포함하는 스코어 정보를 저장하고, 또한 이와 별도로 스코어 컬렉션 파일 자체를 스코어 생성 서버(150)로부터 받아 저장할 수도 있다.The score collection database 180 may store score information including a document identification code, a document score, a copy list, and the like, and may separately receive and store the score collection file itself from the score generation server 150.

문서 인덱싱 서버(160)는 문서 데이터베이스(110) 또는 문서 컬렉션 생성 모듈로부터의 문서 컬렉션 파일과 스코어 생성 서버(150)로부터의 스코어 컬렉션 파일을 이용하여 문서 인덱싱 작업을 수행한다. 그리고 문서 인덱스 볼륨(170)에 저장되어 있는 문서 인덱스를 갱신한다. 또한 문서 인덱싱 서버(160)는 필요에 따라 스코어 컬렉션 데이터베이스(180)에 직접 요청하여 이로부터 스코어 컬렉션 정보를 받아 문서 인덱싱 작업을 수행할 수도 있다.The document indexing server 160 performs a document indexing operation using the document collection file from the document database 110 or the document collection generation module and the score collection file from the score generation server 150. The document index stored in the document index volume 170 is updated. In addition, the document indexing server 160 may directly request the score collection database 180 and receive score collection information therefrom to perform document indexing.

문서 인덱스 볼륨(170)은 기억 매체로서, 전체 문서에 대한 문서 인덱스를 기억한다. 문서 인덱스에는 조회용 키워드가 배열되며, 각 키워드에 대하여 해당 키워드를 포함하는 문서의 문서 식별 부호와 문서 스코어 등이 나열된다.The document index volume 170 is a storage medium and stores a document index for all documents. In the document index, the keyword for inquiry is arranged, and for each keyword, the document identification code and the document score of the document including the keyword are listed.

본 발명의 실시예에 따른 문서 처리 시스템(100)은 일괄 처리 방식으로 문서 처리를 수행할 수 있다. 시간이 흐를수록 사용자가 만든 새로운 문서가 문서 데이 터베이스(110)에 축적되는데, 새로운 문서가 일정한 양만큼 문서 데이터베이스(110)에 축적되면 새로 축적된 문서에 대하여 문서 컬렉션 파일을 생성하고 각 문서에 문서 스코어를 부여하는 등 지금까지 설명한 방식으로 문서 처리를 수행할 수 있다. 이와 달리 문서 데이터베이스(110)에 새로 축적되는 문서에 대하여 일정한 주기마다 문서 처리를 수행할 수도 있다.The document processing system 100 according to the embodiment of the present invention may perform document processing in a batch processing method. As time passes, new documents created by the user are accumulated in the document database 110. When new documents are accumulated in the document database 110 by a predetermined amount, a document collection file is generated for each newly accumulated document, Document processing can be performed in the manner described so far, such as by assigning a document score. Alternatively, document processing may be performed at regular intervals for documents newly accumulated in the document database 110.

이러한 일괄 처리 방식과 달리 데이터베이스 조회 서버(190)는 검색 시스템(400) 또는 문서 처리 시스템(100)의 운영자 또는 관리자의 요청에 따라 특정 문서의 문서 스코어를 조회하거나 산출할 수 있다. 특정 문서의 문서 스코어가 스코어 컬렉션 데이터베이스(180)에 이미 저장되어 있다면 단순히 스코어 컬렉션 데이터베이스(180)에서 해당 문서 스코어를 조회하여 리턴하면 된다. 그러나 저장되어 있지 않다면 데이터베이스 조회 서버(190)는 특정 문서에 대하여 문서 스코어를 산출할 수 있다. 즉, 이 경우 데이터베이스 조회 서버(190)는 특정 문서에 대하여 유효 청크를 추출하고 해시 값을 생성하며 해시 인덱스 볼륨(140)을 조회하여 중복 문서를 추출하고 문서 생성 시간에 따라 원본 문서 및 복사 문서를 판정한 후 문서 스코어를 계산하는 등의 문서 처리를 수행한 후 문서 스코어를 리턴할 수 있다.Unlike the batch processing method, the database inquiry server 190 may query or calculate a document score of a specific document according to a request of an operator or an administrator of the search system 400 or the document processing system 100. If the document score of a particular document is already stored in the score collection database 180, the document score may be simply retrieved from the score collection database 180 and returned. However, if not stored, the database query server 190 may calculate a document score for the particular document. That is, in this case, the database query server 190 extracts a valid chunk for a specific document, generates a hash value, queries the hash index volume 140, extracts duplicate documents, and retrieves the original document and the copy document according to the document generation time. After the determination, the document score may be returned after document processing such as calculating the document score.

본 발명의 실시예에 따른 문서 처리 시스템(100)은 스코어 컬렉션 데이터베이스(180) 또는 데이터베이스 조회 서버(190)를 선택적으로 구비할 수 있다.The document processing system 100 according to the embodiment of the present invention may optionally include a score collection database 180 or a database query server 190.

검색 시스템(400)은 각 문서에 부여된 문서 스코어에 기초하여 사용자 검색 질의에 적절히 대응할 수 있다. 즉, 검색 시스템(400)은 사용자 단말기(300)로부터 질의를 수신하고 질의 내용에 기초하여 문서 인덱스 볼륨(170)을 검색한다. 그 리고 문서 인덱스 볼륨(170)에서 질의와 관련된 문서들을 찾아낸 후 해당 문서들의 문서 스코어에 기초하여 사용자 단말기(300)에 표시될 표시 순위를 정할 수 있다. 문서 스코어가 높은 문서에 대하여 우선 순위를 두어 검색 결과 화면에 먼저 표시할 수 있으며, 1 미만의 문서 스코어를 가지는 문서는 표시하지 않을 수 있다. 또는 중복 문서라고 판단되는 문서를 집단화(grouping/clustering)하여 표시할 수도 있다. 그러나 질의와 관련된 문서의 문서 스코어에 기초하여 문서의 표시 순위를 정하거나 집단화하는 작업을 본 발명의 실시예에 따른 문서 처리 시스템(100) 내에서 수행할 수도 있다.The search system 400 may appropriately respond to a user search query based on document scores assigned to each document. That is, the search system 400 receives the query from the user terminal 300 and searches the document index volume 170 based on the contents of the query. In addition, after finding documents related to the query in the document index volume 170, display ranks of the documents to be displayed on the user terminal 300 may be determined based on document scores of the documents. Documents with a high document score may be given priority to be displayed first on the search result screen, and documents having a document score of less than 1 may not be displayed. Alternatively, documents determined to be duplicate documents may be displayed by grouping / clustering. However, the task of determining or grouping the display order of documents based on document scores of documents related to the query may be performed in the document processing system 100 according to an embodiment of the present invention.

그러면 도 6 내지 도 9를 참고하여 본 발명의 실시예에 따라 문서 스코어를 계산하는 방법에 대하여 예를 들어 상세하게 설명한다.6 to 9, a method of calculating document scores according to an embodiment of the present invention will be described in detail, for example.

해시 컬렉션 파일의 해시 문서가 각 도면의 역삼각형 위에 표시되어 있으며, 이에 대응하는 스코어 컬렉션 파일의 스코어 문서가 그 아래에 표시되어 있다. 중복을 피하기 위하여 해시 문서를 생략하고 스코어 문서만 도시한 부분도 있다. 설명의 편의를 위하여 해시 컬렉션 파일은 현재의 일괄 처리 작업 때 생성된 것뿐만 아니라 이전의 일괄 처리 작업 때에 생성된 것을 표시할 수도 있다.The hash document of the hash collection file is displayed above the inverted triangle of each figure, and the score document of the corresponding score collection file is displayed below. In order to avoid duplication, some hash documents are omitted and only score documents are shown. For convenience of description, the hash collection file may display not only generated in the current batch process but also generated in the previous batch process.

먼저 도 6을 참고하면, 문서 식별 부호 ID11, ID12, ID13 및 ID14를 가지는 해시 문서가 나란히 나열되어 있다. 이하, 문서 식별 부호 ID11을 가지는 문서를 간단히 'ID11 문서' 또는 '문서 ID11'이라고 하고, 다른 문서도 동일한 방식으로 표기하기로 한다.Referring first to FIG. 6, hash documents having document identification IDs ID11, ID12, ID13, and ID14 are listed side by side. Hereinafter, the document having the document identification code ID11 will be simply referred to as 'ID11 document' or 'document ID11', and other documents will be described in the same manner.

문서 ID11은 해시 값 A1, B1, C1, D1, E1 및 F1을 가지고, 문서 ID12는 문서 ID11과 동일한 해시 값을 가지며(즉, 문서 ID12는 문서 ID11과 실질적으로 동일하다), 문서 ID13은 해시 값 A1, B1 및 C1을 가지고, 문서 ID14는 문서 ID11과 동일한 해시 값 및 해시 값 G1 및 H1을 가진다. 중복된 문서라고 판단할 수 있는 중복 해시 값의 수효를 2라고 설정하면 이들 문서는 서치 모듈(154)에 의하여 서로 중복된 문서라고 판단될 수 있다.Document ID11 has hash values A1, B1, C1, D1, E1, and F1, document ID12 has the same hash value as document ID11 (ie, document ID12 is substantially the same as document ID11), and document ID13 has a hash value. With A1, B1 and C1, document ID14 has the same hash value and hash values G1 and H1 as document ID11. When the number of duplicate hash values that can be determined as duplicate documents is set to 2, these documents may be determined to be duplicate documents by the search module 154.

문서 ID11, ID12, ID13 및 ID14는 각각 문서 생성 시간 T11, T12, T13 및 T14를 가진다. 시간 T11이 다른 문서 생성 시간보다 가장 앞선 시간이라고 가정하면, 문서 ID11은 원본 문서가 되고, 문서 ID12, ID13 및 ID14는 복사 문서가 된다.Documents ID11, ID12, ID13, and ID14 have document generation times T11, T12, T13, and T14, respectively. Assuming that time T11 is the time earlier than the other document generation time, document ID11 becomes an original document, and document ID12, ID13, and ID14 become copy documents.

[수학식 1]에 의하면 원본 문서 ID11의 문서 스코어는 3.5이다. 즉, 문서 ID11의 해시 값의 수효는 6이고, 문서 ID11과 ID12가 공통으로 가지는 해시 값의 수효가 6이므로 문서 ID12에 의한 비율은 1이고, 문서 ID11과 ID13이 공통으로 가지는 해시 값의 수효가 3이므로 문서 ID13에 의한 비율은 0.5이며, 문서 ID11과 ID14가 공통으로 가지는 해시 값의 수효가 6이므로 문서 ID14에 의한 비율은 1이다. 따라서 이들의 합에 1을 더한 3.5가 문서 ID11의 문서 스코어가 된다.According to [Equation 1], the document score of the original document ID11 is 3.5. That is, the number of hash values of document ID11 is 6, the number of hash values common to documents ID11 and ID12 is 6, so the ratio by document ID12 is 1, and the number of hash values that document ID11 and ID13 have in common. Since it is 3, the ratio by document ID13 is 0.5, and since the number of hash values which document ID11 and ID14 have in common is 6, the ratio by document ID14 is 1. Therefore, 3.5 which adds 1 to these sums becomes the document score of document ID11.

[수학식 2]에 의하면 복사 문서 ID12 및 ID13은 원본 문서 ID11이 가지고 있는 해시 값 이외의 해시 값을 가지고 있지 않으므로 복사 문서 ID12 및 ID13의 문서 스코어는 0이 된다. 복사 문서 ID14는 원본 문서 ID11이 가지고 있는 해시 값 이외에 해시 값 G1 및 H1을 가지고 있으므로 복사 문서 ID14의 문서 스코어는 2／8 ＝0.25가 된다.According to [Equation 2], the copy documents ID12 and ID13 do not have hash values other than the hash values of the original document ID11, so the document scores of the copy documents ID12 and ID13 are zero. Since the copy document ID14 has hash values G1 and H1 in addition to the hash value possessed by the original document ID11, the document score of the copy document ID14 is 2/8 = 0.25.

원본 문서 ID11의 복사 리스트에는 복사 문서 ID12, ID13 및 ID14가 나열되고, 복사 카운트는 3이 된다. 복사 문서 ID12, ID13 및 ID14의 복사 리스트에는 원본 문서 ID11이 표시되고, 복사 카운트는 1이 된다.In the copy list of the original document ID11, copy document ID12, ID13, and ID14 are listed, and the copy count is three. The original document ID11 is displayed in the copy list of the copy documents ID12, ID13, and ID14, and the copy count is one.

도 7을 참고하면, 문서 ID21은 T21 시간에 생성되었으며 해시 값 A2, B2, C2, D2, E2 및 F2를 가지고 있다. 문서 ID22는 T22 시간에 생성되었으며 해시 값 A2, B2, C2, D2, E2, F2, G2 및 H2를 가지고 있다. 문서 ID23은 T23 시간에 생성되었으며 해시 값 D2, E2, F2, G2 및 H2를 가지고 있다. 문서 ID24는 T24 시간에 생성되었으며 해시 값 G2 및 H2를 가지고 있다.Referring to FIG. 7, document ID21 was generated at time T21 and has hash values A2, B2, C2, D2, E2, and F2. Document ID22 was generated at time T22 and has hash values A2, B2, C2, D2, E2, F2, G2 and H2. Document ID23 was generated at time T23 and has hash values D2, E2, F2, G2, and H2. Document ID24 was generated at time T24 and has hash values G2 and H2.

시간 T21이 다른 문서 생성 시간보다 가장 앞선 시간이라고 가정하면, 문서 ID21은 원본 문서가 되고, 문서 ID22 및 ID23은 복사 문서가 된다.Assuming that time T21 is the time earlier than other document generation time, document ID21 becomes an original document, and document ID22 and ID23 become a copy document.

원본 문서 ID21의 문서 스코어는 2.5이고, 복사 리스트에는 복사 문서 ID22 및 ID23이 표시되며, 복사 카운트는 2가 된다. 복사 문서 ID22 및 ID23의 문서 스코어는 각각 0.25 및 0.4이고, 복사 리스트에는 원본 문서 ID21이 표시되며, 복사 카운트는 1이 된다.The document score of the original document ID21 is 2.5, the copy documents ID22 and ID23 are displayed in the copy list, and the copy count is two. The document scores of the copy documents ID22 and ID23 are 0.25 and 0.4, respectively, the original document ID21 is displayed in the copy list, and the copy count is one.

한편, 해시 값 G2 및 H2는 원본 문서 ID21에 포함되지 않지만 복사 문서 ID22, ID23 및 ID24에 포함된다. 시간 T22가 시간 T23 및 T24보다 앞선 시간이라고 가정하면, 문서 ID22는 해시 값 G2 및 H2에 대하여 원본 문서로서 역할을 하고, 도 7에 화살표로 표시한 것처럼 문서 ID24는 문서 ID22의 복사 문서가 된다. 따라서 문서 ID24의 문서 스코어는 [수학식 2]에 의하면 0이 되고, 복사 리스트에는 문 서 ID22가 표시되며, 복사 카운트는 1이 된다.On the other hand, hash values G2 and H2 are not included in the original document ID21 but are included in the copy documents ID22, ID23, and ID24. Assuming that time T22 is a time earlier than times T23 and T24, document ID22 serves as the original document for hash values G2 and H2, and document ID24 becomes a copy document of document ID22, as indicated by the arrow in FIG. Therefore, the document score of the document ID24 is 0 according to [Equation 2], the document ID22 is displayed on the copy list, and the copy count is 1.

도 8을 참고하면, 문서 ID32 및 ID33은 이전 문서 처리 작업 중에 생성된 문서 컬렉션 파일에 포함되어 있던 문서로서, 문서 스코어의 계산이 완료된 문서이다. 시간 T32가 시간 T33보다 앞선다고 하면 문서 ID32는 원본 문서이고, 문서 ID33은 문서 ID32와 해시 값 A3, B3, C3이 중복되므로 문서 ID32의 복사 문서이다. 따라서 문서 ID33의 복사 리스트에 문서 ID32가 표시되어 있다. 문서 ID32 및 ID33의 문서 스코어는 각각 1.5 및 0.4이다.Referring to FIG. 8, the documents ID32 and ID33 are documents included in the document collection file generated during the previous document processing operation and are documents for which the document score is calculated. If time T32 is earlier than time T33, document ID32 is the original document, and document ID33 is a copy document of document ID32 because document ID32 and hash values A3, B3, and C3 overlap. Therefore, document ID32 is displayed in the copy list of document ID33. Document scores of documents ID32 and ID33 are 1.5 and 0.4, respectively.

문서 처리 시스템(100)이 시간 T3에 일괄 처리 방식의 문서 처리 작업을 수행한다고 하자. 문서 ID31 및 ID34가 새로운 문서 컬렉션 파일에 포함되어 있으며, 문서 ID31이 해시 값 G3 및 H3을 가지고 문서 ID34가 해시 값 A3, B3 및 C3을 가지므로 이들 문서 ID31 및 ID34는 문서 ID32 및 ID33과 중복된 문서라고 판단된다. 여기서 시간 T31이 시간 T32보다 앞서고, 시간 T32가 시간 T34보다 앞선다면 문서 ID31이 문서 ID32보다 먼저 생성된 것이므로 문서 ID33의 원본 문서는 문서 ID32에서 문서 ID31로 바뀌게 된다. 즉, 여러 문서에서 내용을 복사한 복사 문서의 원본 문서는 여러 문서 중 가장 먼저 생성된 문서가 된다. 물론 문서 ID34의 원본 문서는 문서 ID32가 된다.Assume that the document processing system 100 performs a batch processing document processing operation at time T3. Documents ID31 and ID34 are included in the new document collection file, and because document ID31 has hash values G3 and H3 and document ID34 has hash values A3, B3, and C3, these documents ID31 and ID34 are duplicates of documents ID32 and ID33. It is considered a document. Here, if time T31 is ahead of time T32 and time T32 is ahead of time T34, the original document of document ID33 is changed from document ID32 to document ID31 because document ID31 is generated before document ID32. In other words, the original document of the copied document, the contents of which are copied from several documents, becomes the first generated document among several documents. Of course, the original document of document ID34 becomes document ID32.

그 결과 문서 ID33의 문서 스코어는 0.6으로 바뀌고 복사 리스트는 문서 ID31로 바뀐다. 문서 ID32의 문서 스코어는 2로 바뀌고 복사 리스트에 문서 ID34가 추가된다. 문서 ID31의 문서 스코어는 2이고, 복사 리스트에는 문서 ID33이 기재된다. 문서 ID34의 문서 스코어는 0이고, 복사 리스트에는 문서 ID32가 기재된 다.As a result, the document score of document ID33 is changed to 0.6 and the copy list is changed to document ID31. The document score of document ID32 is changed to 2 and document ID34 is added to the copy list. The document score of document ID31 is 2, and document ID33 is described in the copy list. Document ID34 has a document score of 0 and document ID32 is described in the copy list.

이러한 문서 스코어 정보를 포함하는 스코어 컬렉션 파일이 문서 처리 작업이 완료되는 시점 T3'에 생성되며, 도 8의 제일 아래 행에 표시된 것과 같다.A score collection file containing such document score information is generated at the time point T3 'when the document processing operation is completed, as shown in the bottom row of FIG.

이와 같이 이전에 문서 스코어가 계산된 문서라 하더라도 이 문서와 내용이 중복된 문서가 새로 입력되면 이 문서에 의하여 문서 스코어 및 복사 리스트가 바뀔 수 있다.As such, even if a document whose document score has been previously calculated is newly inputted, a document score and a copy list may be changed by this document.

도 9를 참고하면, 문서 ID41은 해시 값 A4 및 B4를 가지고, 문서 ID42는 해시 값 A4, B4 및 C4를 가지며, 문서 ID43은 해시 값 A4, B4, A4, B4, A4, B4 및 C4를 가지고 있다. 문서 ID43은 A4 및 B4에 대응하는 내용을 반복해서 복사하여 작성된 문서이다. 시간 T41이 시간 T42 및 T43보다 앞선다고 하면 문서 ID41이 원본 문서가 되고, 문서 ID42 및 ID43이 복사 문서가 된다.9, document ID41 has hash values A4 and B4, document ID42 has hash values A4, B4 and C4, and document ID43 has hash values A4, B4, A4, B4, A4, B4 and C4. have. Document ID43 is a document created by repeatedly copying the contents corresponding to A4 and B4. If time T41 is earlier than time T42 and T43, document ID41 becomes the original document, and document ID42 and ID43 become the copy document.

문서 ID41의 문서 스코어는 3이고, 복사 리스트에 문서 ID42 및 ID43이 기재된다. 문서 ID42의 문서 스코어는 0.33이고, 복사 리스트에 문서 ID41이 기재된다. 그러나 문서 ID43은 해시 값의 수효를 어떻게 취급하느냐에 따라 문서 스코어의 값이 달라질 수 있다.The document score of document ID41 is 3, and document ID42 and ID43 are described in a copy list. The document score of document ID42 is 0.33, and document ID41 is described in a copy list. However, Document ID43 can vary the value of the document score, depending on how the number of hash values is handled.

먼저 (a)의 경우 문서 ID43이 해시 값을 7개 가지고 있으며, 문서 ID41과 중복되는 해시 값이 2개(A4 및 B4) 있고 중복되지 않는 해시 값이 5개 있다고 한 것으로서, 이 경우 문서 스코어는 5／7＝0.71이 된다. (b)의 경우 문서 ID43이 해시 값을 7개 가지고 있으며, 문서 ID41과 중복되는 해시 값이 6개(중복된 A4 및 B4 전체) 있고, 중복되지 않는 해시 값이 1개(C4) 있다고 한 것으로서, 이 경우 문서 스 코어는 1／7＝0.14가 된다. 끝으로 (c)의 경우 자체적으로 중복된 해시 값을 하나로 처리하여 문서 ID43이 해시 값을 3개 가지고 있으며, 문서 ID41과 중복되는 해시 값이 2개 있고 중복되지 않은 해시 값이 1개 있다고 한 것으로서, 이 경우 문서 스코어는 1／3＝0.33이 된다.First, in case (a), document ID43 has seven hash values, there are two hash values overlapping document ID41 (A4 and B4), and there are five non-duplicate hash values. 5/7 = 0.71. In the case of (b), document ID43 has seven hash values, there are six hash values overlapping document ID41 (all duplicated A4 and B4), and one non-overlapping hash value (C4). In this case, the document score is 1/7 = 0.14. Finally, in the case of (c), it treats duplicate hash values as one, so that document ID43 has three hash values, there are two hash values overlapping document ID41, and one non-duplicate hash value. In this case, the document score is 1/3 of 0.33.

문서 처리 시스템(100)은 필요에 따라 자체적으로 중복된 내용을 가지고 있는 문서에 대하여 위 3가지 경우 중 어느 하나를 적절히 선택하여 문서 스코어를 계산할 수 있다.If necessary, the document processing system 100 may calculate a document score by appropriately selecting any one of the above three cases with respect to a document having duplicate contents by itself.

본 발명의 실시예에 따른 문서 처리 시스템(100)이 포함하고 있는 서버는 해당 기능을 수행하는 프로세스일 수 있으며, 또는 이러한 기능을 수행하는 하드웨어로 구현될 수도 있다. 또한 각 서버에 포함되어 있는 모듈은 반드시 해당 서버에 포함될 필요는 없으며 다른 서버에 포함되거나 독립적으로 분리될 수도 있다.The server included in the document processing system 100 according to the exemplary embodiment of the present invention may be a process for performing a corresponding function, or may be implemented as hardware for performing such a function. In addition, the modules included in each server are not necessarily included in the server, but may be included in other servers or separated independently.

본 발명의 실시예에 따른 문서 처리 시스템(100)은 문서 데이터베이스(110), 해시 인덱스 볼륨(140), 문서 인덱스 볼륨(170) 및 스코어 컬렉션 데이터베이스(180)를 관리하기 위한 관리 모듈을 더 포함할 수 있다.The document processing system 100 according to the embodiment of the present invention may further include a management module for managing the document database 110, the hash index volume 140, the document index volume 170, and the score collection database 180. Can be.

본 발명의 실시예는 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터로 읽을 수 있는 매체를 포함한다. 이 매체는 지금까지 설명한 것과 같이 문서를 처리하는 방법을 실행시키기 위한 프로그램 또는 프로세스를 기록한다. 이 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이러한 매체의 예에는 하드디스크, 플로피디스크 및 자기 테이프와 같은 자기 매체, CD 및 DVD와 같은 광기록 매체, 플롭티컬 디 스크(floptical disk)와 자기-광 매체, 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치 등이 있다. 또는 이러한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the invention include a computer readable medium containing program instructions for performing various computer-implemented operations. This medium records a program or process for executing a method of processing a document as described so far. The media may include, alone or in combination with the program instructions, data files, data structures, and the like. Examples of such media include, but are not limited to, magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CDs and DVDs, floppy and magnetic disks, programs such as ROM, RAM, flash memory, and the like. Hardware devices configured to store and execute instructions. Alternatively, the medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the preferred embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

이와 같이, 본 발명에 의하면, 각 문서로부터 청크를 추출하고 추출된 청크에 기초한 해시 값을 비교함으로써 원본 문서인지 복사 문서인지 파악할 수 있다.As described above, according to the present invention, it is possible to determine whether an original document or a copy document is obtained by extracting chunks from each document and comparing hash values based on the extracted chunks.

또한 중복된 해시 값의 수효에 근거하여 문서 스코어를 계산하고 문서에 이를 부여함으로써 원본 문서의 경우 원본 문서의 내용이 어느 정도로 다른 문서에 복사되었는지를 파악할 수 있으며, 복사 문서의 경우 원본 문서와 다른 내용이 어느 정도로 추가되었는지를 파악할 수 있다.In addition, by calculating and assigning a document score to the number of duplicate hash values, you can determine how much of the content of the original document was copied to another document in the case of the original document. You can see how much of this has been added.

중복된 문서에 대하여 복사 리스트를 작성함으로써 원본 문서의 경우 이것의 복사 문서를 파악할 수 있으며, 복사 문서의 경우 이것의 원본 문서를 파악할 수 있다.By creating a copy list for duplicate documents, the original document can be identified in the case of the original document, and in the case of the copy document, the original document can be identified.

그리고 문서에 부여된 문서 스코어를 이용하여 검색 결과에 순위를 부여하거나 표시 여부를 결정함으로써 사용자에게 보다 고품위의 검색 결과를 제공할 수 있다.In addition, the user may provide a higher quality search result to the user by ranking or displaying the search results using the document scores assigned to the documents.

Claims

Extracting chunks from the plurality of documents,

Querying a storage medium to extract duplicate documents having the chunk in common;

Comparing the document generation time of the duplicate document to distinguish the original document from the at least one copy document, and

Calculating scores of the original document and the copied document based on a common chunk that the duplicate documents have in common

Document processing method comprising a.

In claim 1,

And the score of the original document is the sum of the number of chunks of the original document and the number of the common chunks.

In claim 1,

The score (OS) of the original document is calculated by the following equation,

OD is the number of chunks of the original document, CD _i is the number of common chunks of the i-th copy document, and n is the number of the copy documents.

In claim 1,

And a score of the copy document is a ratio of the number of chunks of the copy document minus the number of the common chunks and the number of chunks of the copy document.

In claim 1,

Generating a score collection file recording document identification marks, document scores, and copy lists for each document,

The document identification code of the copy document is described in the copy list of the original document, and the document identification code of the original document is described in the copy list of the copy document.

delete

In claim 1,

And the duplicated document has the common chunk of a predetermined number or more.

A computer-readable medium having recorded thereon a program for causing a computer to execute the method of any one of claims 1 to 5 and 7.

A chunk extraction module for extracting chunks from a plurality of documents,

A search module for querying a storage medium to extract duplicate documents having the chunks in common;

A sorting module for comparing the document generation time of the duplicate document to distinguish the original document from the at least one copy document, and

A score calculation module for calculating a score of the original document and the copied document based on a common chunk that the duplicate document has in common

Document processing system comprising a.

In claim 9,

And a score of the original document is a sum of a ratio of the number of chunks of the original document and the number of the common chunks.

In claim 9,

OD is the number of chunks of the original document, CD _i is the number of common chunks of the i th copy document, and n is the number of the copy documents.

In claim 9,

And further comprising a score collection generation module for generating a score collection file for each document documenting document identification marks, document scores, and copy lists.

delete

In claim 9,

And the duplicated document has a predetermined number or more of the common chunks.