KR20110117440A

KR20110117440A - System and method for calculating similarity between documents

Info

Publication number: KR20110117440A
Application number: KR1020100036894A
Authority: KR
Inventors: 윤석호; 황원석; 김상욱
Original assignee: 엔에이치엔(주)
Priority date: 2010-04-21
Filing date: 2010-04-21
Publication date: 2011-10-27
Also published as: KR101099908B1

Abstract

문서 간의 유사도 계산 시스템 및 방법이 제공된다. 문사 간의 유사도 제공 시스템은 유사 문서 검색 요청에 연관되는 제1 문서의 제목 및 요약으로부터 키워드를 추출한다. 또한, 문서 간의 유사도 계산 시스템은 상기 제1 문서가 참조하는 문서 및 상기 제1 문서를 참조하는 문서 중 적어도 하나로부터, 키워드를 추가로 추출할 수 있다.A system and method for calculating similarity between documents is provided. The similarity provision system between the sentences extracts a keyword from the title and summary of the first document associated with the similar document search request. In addition, the similarity calculation system between documents may further extract a keyword from at least one of a document referenced by the first document and a document referring to the first document.

Description

System and method for calculating similarity between documents {SYSTEM AND METHOD FOR CALCULATING SIMILARITY BETWEEN DOCUMENTS}

문서 간의 유사도 계산 시스템 및 방법에 연관되며, 보다 특정하게는 키워드 비교를 통한 논문 간의 유사도 계산 시스템 및 방법에 연관된다.It is related to the similarity calculation system and method between documents, and more particularly to the similarity calculation system and method between articles through keyword comparison.

DB(Data base)화 된 학술 정보의 검색 서비스에 대한 관심이 높아지고 있다. 대표적인 학술 정보 검색 서비스 중 하나는 사용자가 관심을 갖는 논문과 유사한 주제의 논문을 검색하여 제공하는 유사 논문 검색 서비스이다.Interest in DB (Data base) retrieval service of academic information is increasing. One of representative academic information retrieval services is a similar article retrieval service that searches for and provides articles of a similar subject to a paper of interest to the user.

이러한 유사 논문 검색 서비스를 위해서는 DB 내의 논문들 사이의 유사도를 계산하는 방법이 요구된다.For such a similar paper search service, a method of calculating similarity between papers in a DB is required.

기존의 유사 논문 검색 서비스는 텍스트 기반에서 논문들의 키워드를 도출하고 비교함으로써 수행되었는데, 이를 텍스트 기반 유사도 계산 방법이라고 한다.The existing similar paper search service was performed by deriving and comparing keywords of papers based on text, which is called text-based similarity calculation method.

그런데, 논문 DB에는 크롤링(crawling)과 파싱(parsing)의 어려움 때문에 논문 본문이 텍스트로 저장되어 있지 않은 경우가 많다. 따라서, 이러한 키워드들 간의 유사도 계산에 있어서, 논문의 제목 및 요약(abstract)에서 추출된 키워드들이 사용된다. 여기서, 논문의 제목 및 요약으로부터 추출된 키워드들은 개수가 적거나 논문 주제와의 연관도가 낮은 경우 논문 유사도 계산에 어려움이 있다.However, due to the difficulty of crawling and parsing, the article body is not stored as text in the article DB. Therefore, in calculating the similarity between these keywords, the keywords extracted from the title and abstract of the article are used. Here, when the number of keywords extracted from the title and the summary of the article is small or the degree of association with the thesis subject is difficult, it is difficult to calculate the article similarity.

본 발명은 논문의 본문 전체를 텍스트로 추출하는 과정을 거치지 않더라도 주제와의 연관도가 높은 키워드를 추출할 수 있는 시스템 및 방법을 제공한다.The present invention provides a system and method for extracting keywords having a high relevance to a topic without going through the process of extracting the entire body of the article as text.

본 발명은 DB를 변경하지 않고도 논문 유사도 계산의 품질을 높일 수 있는 시스템 및 방법을 제공한다.The present invention provides a system and method that can improve the quality of the paper similarity calculation without changing the DB.

본 발명은 논문 유사도 계산에 있어서 비교의 대상이 되는 키워드의 수를 증가시켜 논문 간 유사 판단의 정확도를 향상시킬 수 있는 시스템 및 방법을 제공한다.The present invention provides a system and method that can improve the accuracy of similarity determination between articles by increasing the number of keywords to be compared in article similarity calculation.

본 발명의 일측에 따르면, 제1 문서로부터 상기 제1 문서의 키워드를 추출하고, 상기 제1 문서를 참조하는 문서 및 상기 제1 문서가 참조하는 문서 중 적어도 하나로부터 상기 제1 문서의 키워드를 추가로 추출하는 키워드 추출부, 및 상기 추출된 키워드를 이용하여 상기 제1 문서와 제2 문서 사이의 유사도를 계산하는 유사도 계산부를 포함하는 문서 간의 유사도 계산 시스템이 제공된다.According to one aspect of the present invention, a keyword of the first document is extracted from a first document, and a keyword of the first document is added from at least one of a document referring to the first document and a document referenced by the first document. A similarity calculation system is provided between a document including a keyword extraction unit for extracting and a similarity calculation unit for calculating a similarity between the first document and the second document using the extracted keyword.

상기 키워드 추출부는, 상기 제2 문서로부터 상기 제2 문서의 키워드를 추출하고, 상기 제2 문서를 참조하는 문서 및 상기 제2 문서가 참조하는 문서 중 적어도 하나로부터 상기 제2 문서의 키워드를 추출할 수 있다.The keyword extracting unit may extract a keyword of the second document from the second document, and extract a keyword of the second document from at least one of a document referring to the second document and a document referenced by the second document. Can be.

본 발명의 일실시예에 따르면, 상기 유사도 계산부는, 상기 추출된 제1 문서의 키워드를 이용하여 계산되는 제1 벡터 및 상기 제2 문서의 키워드를 이용하여 계산되는 제2 벡터 사이의 유사도를 상기 제1 문서와 상기 제2 문서의 유사도로서 계산한다.According to an embodiment of the present invention, the similarity calculator is configured to determine the similarity between the first vector calculated using the extracted keyword of the first document and the second vector calculated using the keyword of the second document. It calculates as the similarity degree of a 1st document and a said 2nd document.

또한, 상기 키워드 추출부는, 상기 제1 문서의 제목 및 요약으로부터 상기 제1 문서의 키워드를 추출하고, 상기 제1 문서를 참조하는 문서의 제목 및 요약, 상기 제1 문서가 참조하는 문서의 제목 및 요약 중 적어도 하나로부터 상기 제1 문서의 키워드를 추가로 추출할 수 있다.The keyword extracting unit may extract a keyword of the first document from a title and a summary of the first document, a title and a summary of a document referring to the first document, a title of a document to which the first document refers, and The keyword of the first document may be further extracted from at least one of the summaries.

본 발명의 일실시예에 따르면, 상기 키워드 추출부는, 상기 제1 문서를 참조하는 문서를 참조하는 문서, 및 상기 제1 문서를 참조하는 문서가 참조하는 제1 문서 이외의 다른 문서 중 적어도 하나로부터 상기 제1 문서의 키워드를 추가로 추출한다.According to an embodiment of the present invention, the keyword extracting unit may include at least one of a document referring to a document referring to the first document and a document other than the first document referred to by the document referring to the first document. The keyword of the first document is further extracted.

또한 본 발명의 다른 일실시예에 따르면, 상기 키워드 추출부는, 상기 제1 문서가 참조하는 문서를 참조하는 제1 문서 이외의 다른 문서, 및 상기 제1 문서가 참조하는 문서가 참조하는 문서 중 적어도 하나로부터 상기 제1 문서의 키워드를 추가로 추출한다.According to another embodiment of the present invention, the keyword extracting unit may include at least one of a document other than a first document referring to a document referenced by the first document, and a document referenced by the document referenced by the first document. The keyword of the first document is further extracted from one.

본 발명의 또 다른 일측에 따르면, 제1 문서로부터 상기 제1 문서의 키워드를 추출하는 단계, 상기 제1 문서를 참조하는 문서 및 상기 제1 문서가 참조하는 문서 중 적어도 하나로부터 상기 제1 문서의 키워드를 추가로 추출하는 단계, 및 상기 추출된 키워드를 이용하여 상기 제1 문서와 제2 문서 사이의 유사도를 계산하는 단계를 포함하는 문서 간의 유사도 계산 방법이 제공된다.According to another aspect of the invention, extracting a keyword of the first document from a first document, from at least one of a document that refers to the first document and a document referenced by the first document of the first document A method of calculating similarity between documents is further provided, including extracting a keyword, and calculating similarity between the first document and the second document using the extracted keyword.

본 발명의 일실시예에 따르면, 논문의 본문 전체를 텍스트로 추출하는 과정을 거치지 않더라도 주제와의 연관도가 높은 키워드를 추출할 수 있다.According to an embodiment of the present invention, a keyword having a high relation with a subject may be extracted even though the entire body of the article is not extracted as text.

본 발명의 일실시예에 따르면, DB를 변경하지 않고도 논문 유사도 계산의 품질을 높일 수 있다.According to one embodiment of the present invention, it is possible to improve the quality of the paper similarity calculation without changing the DB.

본 발명의 일실시예에 따르면, 논문 유사도 계산에 있어서 비교의 대상이 되는 키워드의 수를 증가시켜 논문 간 유사 판단의 정확도를 향상시킬 수 있다.According to an embodiment of the present invention, the accuracy of similarity determination between articles can be improved by increasing the number of keywords to be compared in the article similarity calculation.

도 1은 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템을 도시한다.
도 2는 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템에서 제1 문서의 키워드를 추출하는 과정을 나타내는 과정을 도시하는 개념도이다.
도 3은 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템에서 제1 문서 및 비교 대상이 되는 제2 문서의 키워드를 추출하는 과정을 나타내는 과정을 도시하는 개념도이다.
도 4는 본 발명의 일실시예에 따라 유사도 계산의 대상 문서가 논문인 경우에, 논문 사이의 유사도 계산 방법을 도시한다.
도 5는 본 발명의 일실시예에 따라 유사도 계산의 대상 문서가 논문인 경우에, 논문 사이의 유사도 계산 방법을 도시한다.1 illustrates a system for calculating similarity between documents according to an embodiment of the present invention.
2 is a conceptual diagram illustrating a process of extracting a keyword of a first document in a system for calculating similarity between documents according to an embodiment of the present invention.
3 is a conceptual diagram illustrating a process of extracting keywords of a first document and a second document to be compared in a similarity calculation system between documents according to an embodiment of the present invention.
4 illustrates a method of calculating similarity between articles when a document to be calculated for similarity is a article according to an embodiment of the present invention.
FIG. 5 illustrates a method for calculating similarity between articles when a document to be calculated for similarity is a article according to an embodiment of the present invention.

이하에서, 본 발명의 일부 실시예를, 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the embodiments. Like reference numerals in the drawings denote like elements.

도 1은 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템(100)를 도시한다.1 illustrates a similarity calculation system 100 between documents in accordance with one embodiment of the present invention.

문서 간의 유사도 계산 시스템(100)는 키워드 추출부(110) 및 유사도 계산부(120)를 포함한다.The similarity calculation system 100 between documents includes a keyword extracting unit 110 and a similarity calculating unit 120.

본 발명의 일실시예에 따르면, 문서 간의 유사도 계산 시스템(100)는 사용자에 의해 선택되는 제1 문서와 문서 DB(Database)(101) 내에 저장된 임의의 제2 문서 사이의 유사도 계산을 수행할 수 있다. 이러한 과정은 학술 정보 제공 시스템(도시되지 않음)에서 제공되는 유사 논문 검색 서비스의 일부로서 수행될 수 있다.According to one embodiment of the present invention, the similarity calculation system 100 between documents may perform similarity calculation between the first document selected by the user and any second document stored in the document DB (Database) 101. have. This process may be performed as part of a similar article search service provided in an academic information providing system (not shown).

또한, 문서 간의 유사도 계산 시스템(100)는 사용자에 의해 선택되는 제1 문서와 문서 DB(Database)(101) 내에 저장된 문서 중 사용자에 의해 선택된 제2 문서 사이의 유사도 계산을 수행할 수도 있다. 이러한 과정은 상기 유사 논문 검색 서비스의 일부로서, 유사 논문 검색 서비스와 별개로 수행되는 논문 간의 유사도 계산을 위한 것일 수 있다.In addition, the similarity calculation system 100 between documents may perform similarity calculation between the first document selected by the user and the second document selected by the user among documents stored in the document DB (Database) 101. This process is part of the similar article search service, and may be for calculating similarity between articles performed separately from the similar article search service.

이하에서는, 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템(100)가 학술 정보 제공 시스템에서 제공되는 유사 논문 검색 서비스에 적용되는 실시예를 설명하나, 문서 간의 유사도 계산이 필요한 경우라면, 문서 간의 유사도 계산 시스템(100)는 다른 어떤 응용을 위해서도 제공될 수 있다.Hereinafter, an embodiment in which the similarity calculation system 100 between documents according to an embodiment of the present invention is applied to a similar paper search service provided by an academic information providing system is described. The similarity calculation system 100 may be provided for any other application.

한편, 본 발명의 일실시예에 따르면, 상기 제1 문서 및 상기 제2 문서는 논문이다. 그러나, 제1 문서 및 제2 문서의 종류가 논문에 한정되는 것으로 해석되어서는 안 된다.Meanwhile, according to an embodiment of the present invention, the first document and the second document are papers. However, the types of the first document and the second document should not be interpreted as being limited to the paper.

따라서 이하에서 별다른 언급이 없이 제1 문서 및/또는 제2 문서가 논문인 예시적 실시예를 언급하더라도 이는 본 발명의 일실시예에 불과하며, 제1 문서 및/또는 제2 문서는 본 발명의 사상을 변경하지 않는 한도에서 다른 문서 형태, 이를테면 기술 표준 문서, 특허 문서, 저널의 기사, 웹 문서 등 다양한 형태의 문서로 해석될 수 있다.Accordingly, hereinafter, although referring to exemplary embodiments in which the first document and / or the second document are articles without further mention, this is only one embodiment of the present invention, and the first document and / or the second document may be used to describe the present invention. Unless the spirit is changed, it can be interpreted in various types of documents such as technical standard documents, patent documents, journal articles, and web documents.

상기한 바와 같이 문서 사이의 유사도 계산을 위한 기존 연구로는 텍스트 기반 유사도 계산 방법들이 수행될 수 있다.As described above, text-based similarity calculation methods may be performed in existing researches for calculating similarity between documents.

텍스트 기반 유사도 계산 방법은 문서를 상기 문서 내에 포함된 키워드의 집합으로서 표현하고, 문서 간의 키워드 집합들 사이의 매칭(matching) 여부를 비교하여 얼마나 많은 공통 키워드를 가지고 있는 지에 기초하여 문서 간의 유사도를 계산한다.The text-based similarity calculation method expresses a document as a set of keywords included in the document, and calculates the similarity between documents based on how many common keywords are obtained by comparing the matching between keyword sets between documents. do.

본 발명의 일실시예에 따르면, 이렇게 비교 대상이 되는 키워드들을 비교 대상이 되는 문서뿐만 아니라, 상기 비교 대상이 되는 문서가 참조하는 문서, 및/또는 상기 문서를 참조하는 문서로부터 추출한다.According to an embodiment of the present invention, the keywords to be compared are extracted not only from the document to be compared, but also from the document to which the document to be compared is referred to and / or a document to which the document is to be compared.

이를 테면, 키워드 추출부(110)에 사용자에 의해 선택되는 제1 문서가 입력되면, 상기 키워드 추출부(110)는 상기 제1 문서로부터 상기 제1 문서의 키워드를 추출한다. 또한 키워드 추출부(110)는 상기 제1 문서가 참조하는 문서(이하에서는 "제1 문서의 피참조 문서"라고도 한다) 및/또는 상기 제1 문서를 참조하는 문서(이하에서는 "제1 문서를 피참조 문서로 하는 문서"라고도 한다)로부터, 상기 제1 문서의 키워드를 더 추출한다.For example, when a first document selected by a user is input to the keyword extractor 110, the keyword extractor 110 extracts a keyword of the first document from the first document. In addition, the keyword extraction unit 110 may refer to a document referred to by the first document (hereinafter also referred to as a "referenced document of the first document") and / or a document referring to the first document (hereinafter referred to as "the first document"). A keyword of the first document is further extracted.

이렇게 제1 문서, 그리고 제1 문서의 피참조 문서 및/또는 제1 문서를 피참조 문서로 하는 문서부터 추출된 키워드들 전체가 상기 제1 문서의 키워드로 결정된다.Thus, all keywords extracted from the first document, the referenced document of the first document, and / or the document referred to as the referenced document are determined as keywords of the first document.

종래에 학술 정보 시스템 등에서는 유사 논문 검색 서비스가 제공되는 경우에, 논문의 유사도 검색을 위해서 선택되는 키워드들은 비교 대상이 되는 논문 자체의 제목(title) 및 요약(abstract)의 텍스트로부터 선택되었다.Conventionally, when a similar article search service is provided in an academic information system or the like, keywords selected for the similarity search of articles are selected from the title and abstract text of the articles themselves to be compared.

논문의 본문의 경우에는 크롤링(crawling)과 파싱(parsing)에 어려움이 있기 때문에 논문 DB 내에 텍스트로 저장되어 있지 않은 경우가 많다. 따라서, 논문 본문의 텍스트 정보 없이 제목 및 요약의 텍스트로부터 추출된 키워드는 개수가 비교적 적고, 논문의 세부 주제와 상관도가 낮은 경우도 많다.In the case of the main body of the article, since it is difficult to crawl and parse, it is often not stored as text in the article DB. Therefore, the number of keywords extracted from the text of the title and the summary without the text information of the article body is relatively small and often has little correlation with the detailed subject of the article.

따라서, 본 발명의 일실시예에 따른 키워드 추출부(110)는, 유사 논문 검색 서비스 등에서 이용되기 위해 문서의 키워드를 추출하는 경우, 입력되는 제1 문서뿐만 아니라, 상기 제1 문서가 참조하는 문서들 및/또는 상기 제1 문서를 참조하는 문서들의 텍스트로부터도 상기 제1 문서의 키워드를 추출한다.Therefore, the keyword extracting unit 110 according to an embodiment of the present invention, when extracting a keyword of a document for use in a similar article search service or the like, not only the first document to be input, but also the document referenced by the first document. Keywords of the first document are also extracted from texts of documents referring to the documents and / or the first document.

이 경우, 추출된 제1 문서의 키워드들은, 개수가 많고 논문의 세부 주제와의 상관도가 높은 키워드들도 보다 많이 포함되므로, 유사도 계산의 정확도를 높일 수 있다.In this case, since the keywords of the extracted first document include more keywords that have a large number and have a high correlation with the detailed subject of the article, the accuracy of the similarity calculation can be improved.

이렇게 제1 문서뿐만 아니라 상기 제1 문서의 피참조 문서 및/또는 제1 문서를 피참조 문서로 하는 문서로부터 상기 제1 문서의 키워드가 추출되는 경우, 유사도 계산부(120)는 추출된 제1 문서의 키워드들을 이용하여, 문서 DB(101) 내에 포함된 문서들과 제1 문서 사이의 유사도를 계산한다.In this way, when keywords of the first document are extracted from not only the first document but also the referenced document of the first document and / or the document referred to as the first document, the similarity calculator 120 extracts the extracted first document. Using the keywords of the document, the similarity between the documents included in the document DB 101 and the first document is calculated.

문서 DB(101) 내의 문서들 중 임의의 제2 문서에 대해 상기 제1 문서와의 사이의 유사도를 계산하는 방법에는 다양한 실시예가 있을 수 있다.There may be various embodiments of the method for calculating the similarity with the first document for any second document among the documents in the document DB 101.

상기한 바와 같이, 문서들 각각의 텍스트로부터 추출된 키워드들 사이의 매칭을 이용해 문서들 사이의 유사도를 계산하는 텍스트 기반 유사도 계산 방법에는 불리언 모델, 벡터 모델, 그리고 확률 모델 등이 있다.As described above, a text-based similarity calculation method for calculating similarity between documents using matching between keywords extracted from text of each document includes a Boolean model, a vector model, and a probability model.

본 발명의 일실시예에 따르면, 유사도 계산부(120)는 추출된 제1 문서의 키워드들 및 추출된 제2 문서의 키워드 사이의 유사도를 계산하기 위해 상기 벡터 모델을 이용한다.According to one embodiment of the present invention, the similarity calculator 120 uses the vector model to calculate the similarity between the keywords of the extracted first document and the keywords of the extracted second document.

물론, 본 발명의 다른 실시예들에서는, 상기 불리언 모델, 확률 모델 등이 이용될 수 있으며, 이하에서 벡터 모델을 참조하여 설명하더라도 본 발명이 일부 실시예에 한정되어 해석되어서는 안 된다.Of course, in other embodiments of the present invention, the Boolean model, the probabilistic model, and the like may be used, and the present invention should not be construed as being limited to some embodiments even when described with reference to the vector model below.

벡터 모델의 실시예에서, 유사도 계산부(120)는, 제1 문서에 대해 추출된 키워드들을 제1 벡터로 표현한다. 그리고, 유사도 계산부(120)는 제1 문서와 비교 대상이 되는 제2 문서에 대해 추출된 키워드들을 제2 벡터로 표현한다.In an embodiment of the vector model, the similarity calculator 120 expresses keywords extracted for the first document as a first vector. The similarity calculator 120 expresses keywords extracted for the second document to be compared with the first document as a second vector.

그리고 본 발명의 일실시예에 따르면, 유사도 계산부(120)는 상기 제1 벡터와 제2 벡터 사이의 유사도를 아래 수학식에 의해 계산한다.According to one embodiment of the present invention, the similarity calculator 120 calculates the similarity between the first vector and the second vector by the following equation.

[수학식 1][Equation 1]

여기서, 벡터 A는 제1 문서의 키워드들을 대표하는 제1 벡터이고, 벡터 B는 제2 문서의 키워드들을 대표하는 제2 벡터이다.Here, vector A is a first vector representing keywords of the first document, and vector B is a second vector representing keywords of the second document.

그리고 유사도 Sim(A, B)는 제1 벡터 A와 제2 벡터 B의 유사도를 계산한 결과 값이다. 이러한 계산 방법을 벡터 모델 방법에서는 cosine measure 라고도 한다.The similarity Sim (A, B) is a result of calculating the similarity between the first vector A and the second vector B. This calculation method is also called cosine measure in the vector model method.

이 유사도 Sim(A, B)의 값은 제1 문서의 키워드들과 제2 문서의 키워드들 간에 매칭되는 키워드의 수가 많을수록 커진다.The value of this similarity Sim (A, B) increases as the number of keywords matched between the keywords of the first document and the keywords of the second document increases.

한편, 상기한 바와 같이 유사도 계산부(120)가 벡터 모델 및 상기 수학식 1을 이용하여 제1 문서와 제2 문서 사이의 유사도를 계산하는 과정은 본 발명의 일실시예에 불과하므로, 본 발명의 사상을 변경하지 않는 범위에서 구체적인 계산 모델이나 계산 식이 변경될 수도 있다.On the other hand, as described above, the process of calculating the similarity between the first document and the second document by the similarity calculator 120 using the vector model and Equation 1 is only an embodiment of the present invention. The specific calculation model or formula may be changed without changing the spirit of the equation.

그리고, DB(101) 내의 문서들 각각에 대해 유사도 계산부(120)가 제1 문서와의 유사도를 계산한 경우, 문서 간의 유사도 계산 시스템(100)는 유사도가 높은 순서로 미리 저정된 개수의 문서들을 유사 문서(유사 논문)으로서 제공할 수 있다.In addition, when the similarity calculator 120 calculates the similarity with the first document for each of the documents in the DB 101, the similarity calculation system 100 between the documents may store a predetermined number of documents in the order of high similarity. Can be provided as a similar document (similar paper).

이상에서 서술한 키워드 추출부(110) 및 유사도 계산부(120)의 동작은 도 2 이하를 참조하여 보다 상세히 후술된다.The operations of the keyword extracting unit 110 and the similarity calculating unit 120 described above will be described later in more detail with reference to FIG.

도 2는 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템에서 제1 문서의 키워드를 추출하는 과정을 나타내는 과정을 도시하는 개념도이다.2 is a conceptual diagram illustrating a process of extracting a keyword of a first document in a system for calculating similarity between documents according to an embodiment of the present invention.

제1 문서(200)은 이를 테면 논문일 수 있다.The first document 200 can be, for example, a paper.

도 1의 문서 간의 유사도 계산 시스템(100)에 제1 문서(200)가 입력되는 경우, 키워드 추출부(110)는 제1 문서(200)의 참조문서(Reference) 목록(204)로부터 상기 제1 문서(200)가 참조하는 피참조 문서인 문서(210) 및 문서(220)를 식별한다.When the first document 200 is input to the similarity calculation system 100 between the documents of FIG. 1, the keyword extractor 110 may determine the first document 200 from the reference list 204 of the first document 200. The document 210 and the document 220 which are the referenced documents referenced by the document 200 are identified.

또한, 키워드 추출부(110)는 문서 DB(101) 내의 문서들 중 적어도 일부의 문서들 각각의 참조문서(Reference) 목록을 참조하여 상기 제1 문서(200)를 참조하는 제1 문서를 피참조 문서로 하는 문서(230) 및 문서(240)를 식별한다.In addition, the keyword extracting unit 110 refers to a reference document of each of at least some of the documents in the document DB 101, and refers to the first document referring to the first document 200. The document 230 and the document 240 made into a document are identified.

그리고, 키워드 추출부(110)는 제1 문서(200) 중 텍스트 정보를 포함하지 않는 본문(203)을 제외하고 텍스트 정보를 포함하는 제목(201), 요약(202) 및 참조문서 목록(204)로부터 키워드 {"Mining", "Tree", "Clustering", "Frequent"}(251)를 추출한다.In addition, the keyword extracting unit 110 may include a title 201, a summary 202, and a reference document list 204 including text information except for a text 203 including no text information in the first document 200. The keywords {"Mining", "Tree", "Clustering", and "Frequent"} 251 are extracted from these.

이러한 키워드 추출 과정에서는 텍스트 내에서 가능한 모든 키워드를 추출한 다음 불필요한 요소들, 이를테면 조사, 특수 기호, stopword 등을 제거하는 방법이 이용될 수 있다.In the keyword extraction process, a method of extracting all possible keywords in the text and then removing unnecessary elements such as a search, a special symbol, and a stopword may be used.

종래의 방법에 의하면, 제1 문서(200)로부터만 키워드가 추출되었기 때문에, 이렇게 추출된 키워드 {"Mining", "Tree", "Clustering", "Frequent"}(251)가 제1 문서(200)의 키워드로서 다른 문서의 키워드들과 비교 대상이 되었다.According to the conventional method, since keywords are extracted only from the first document 200, the extracted keywords {"Mining", "Tree", "Clustering", "Frequent"} 251 are extracted from the first document 200. ) Is compared to keywords in other documents.

그러나 본 발명의 일실시예에 따르면, 제1 문서(200)가 참조하는 피참조 문서인 문서(210) 및 문서(220) 등으로부터도 키워드 {"Data", "Pattern"}(252)가 추출된다.However, according to an embodiment of the present invention, the keywords {"Data", "Pattern"} 252 are also extracted from the document 210, the document 220, and the like, which are referred to by the first document 200. do.

이 경우, 문서(210) 내에서 텍스트 정보가 포함된 제목(211), 요약(213), 참조문서 목록(214), 그리고 문서(220) 내에서 텍스트 정보가 포함된 제목(221), 요약(223), 참조문서 목록(224)으로부터 키워드가 추출되었다.In this case, the heading 211 including the text information in the document 210, the summary 213, the reference document list 214, and the heading 221 including the text information in the document 220 and the summary ( 223, keywords are extracted from the reference document list 224.

이 경우, 텍스트 정보가 포함되지 않은 본문 부분(213 및 223)은 이용되지 않을 수 있다. 다만, 문서에 따라서는 본문 중의 일부가 텍스트 정보를 포함하는 경우도 있는데, 이러한 경우에는 본문으로부터도 키워드가 추출될 수 있음은 물론이다(이하 같다).In this case, the body parts 213 and 223 not including the text information may not be used. However, depending on the document, some of the text may include text information. In this case, keywords may be extracted from the text (hereinafter, referred to).

또한, 본 발명의 일실시예에 따르면, 상기 제1 문서(200)를 피참조 문서로 참조하는 문서(230) 및 문서(240) 등으로부터도 키워드 {"Association", "Apriori", "Candidate"}(253)가 추출된다.In addition, according to an embodiment of the present invention, the keywords {"Association", "Apriori", "Candidate" from documents 230, 240, etc., which refer to the first document 200 as the referenced document. } 253 is extracted.

이 경우에도, 텍스트 정보를 포함하지 않는 본문 부분(233 및 243)을 제외하고, 텍스트 정보가 포함된 제목(231 및 241), 요약(233 및 243) 및 참조문서 목록(234 및 244)로부터 키워드가 추출될 수 있다.Even in this case, keywords from the headings 231 and 241, the summaries 233 and 243, and the references list 234 and 244 containing the text information, except for the body parts 233 and 243, which do not contain the text information. Can be extracted.

따라서, 본 실시예에서 키워드 추출부(110)에 의해 추출된 제1 문서(200)의 키워드(250)는 {"Data", "Pattern", "Mining", "Tree", "Clustering", "Frequent", "Association", "Apriori", "Candidate"}이다.Therefore, in the present embodiment, the keyword 250 of the first document 200 extracted by the keyword extractor 110 is {"Data", "Pattern", "Mining", "Tree", "Clustering", " Frequent "," Association "," Apriori "," Candidate "}.

이렇게 추출된 제1 문서의 키워드(250)이 제1 문서와 다른 문서 사이의 유사도를 계산하는 데에 활용된다.The extracted keyword 250 of the first document is used to calculate the similarity between the first document and another document.

도 3은 본 발명의 일실시예에 따른 문서 간의 유사도 계산 시스템에서 제1 문서(200) 및 비교 대상이 되는 제2 문서(300)의 키워드를 추출하는 과정을 나타내는 과정을 도시하는 개념도이다.3 is a conceptual diagram illustrating a process of extracting keywords of the first document 200 and the second document 300 to be compared in the system for calculating similarity between documents according to an embodiment of the present invention.

제1 문서(200), 제1 문서가 참조하는 피참조 문서인 문서(210 및 220), 및 제1 문서를 피참조 문서로 하는 문서(230 및 240)으로부터 제1 문서의 키워드(250)가 추출되는 과정은 도 2를 참조하여 상술한 바와 같다.From the first document 200, the documents 210 and 220 which are the referenced documents to which the first document refers, and the documents 230 and 240 whose first document is the referenced document, keywords 250 of the first document are added. The extraction process is as described above with reference to FIG.

본 발명의 일실시예에 따르면, 도 1의 문서 DB(101) 내의 임의의 제2 문서(300)에 대한 키워드(350) 추출 과정 또한 제1 문서(200)의 키워드(250) 추출 과정과 유사한 과정에 의해 수행될 수 있다.According to one embodiment of the present invention, the keyword 350 extraction process for any second document 300 in the document DB 101 of FIG. 1 is also similar to the keyword 250 extraction process of the first document 200. It can be carried out by the process.

키워드 추출부(110)는 제2 문서(300)로부터, 키워드 {"Hash", "Graph", "Network", "Clustering", Frequent"}(351)를 추출한다. 그리고, 키워드 추출부(110)는 제2 문서(300)의 참조문서 목록을 참조하여, 제2 문서(300)가 참조하는 피참조 문서(310 및 320)으로부터 키워드 {"Candidate", "Minsup"}(352)를 추출한다.The keyword extraction unit 110 extracts the keywords {"Hash", "Graph", "Network", "Clustering", and Frequent "} 351 from the second document 300. Then, the keyword extraction unit 110 is extracted. ) Extracts the keywords {"Candidate", "Minsup"} 352 from the referenced documents 310 and 320 referenced by the second document 300 with reference to the reference document list of the second document 300. .

또한, 키워드 추출부(110)는 제2 문서(300)를 피참조 문서로 하는 문서(330 및 340)으로부터 키워드 {"Pattern", "Data"}(353)를 추출한다.In addition, the keyword extracting unit 110 extracts the keywords {"Pattern", "Data"} 353 from the documents 330 and 340 which use the second document 300 as the referenced document.

이러한 과정을 통해 제2 문서(300)의 키워드 {"Candidate", "Minsup", "Hash", "Graph", "Network", "Clustering", Frequent", "Pattern", "Data"}(350)가 추출되었다.Through this process, the keywords {"Candidate", "Minsup", "Hash", "Graph", "Network", "Clustering", Frequent "," Pattern "," Data "} of the second document 300 (350) ) Was extracted.

종래의 방법에 따라 제1 문서(200)으로부터만 추출된 키워드 {"Mining", "Tree", "Clustering", "Frequent"}(251)와 제2 문서(300)로부터만 추출된 키워드 {"Hash", "Graph", "Network", "Clustering", Frequent"}(351)가 비교되는 경우, 매칭되는 공통 키워드는 {"Clustering", "Frequent"}의 두 개에 불과하기 때문에, 유사도가 두 개의 키워드 매칭에 대응한 값으로 계산되었다.Keywords {"Mining", "Tree", "Clustering", "Frequent"} 251 extracted only from the first document 200 and keywords {"extracted only from the second document 300 according to the conventional method Hash "," Graph "," Network "," Clustering ", and Frequent"} (351), the similarity is similar because only two common keywords are {"Clustering", "Frequent"}. It is calculated as a value corresponding to two keyword matching.

그러나, 본 발명의 일실시예에 따라 추출된 제1 문서(200)의 키워드 {"Data", "Pattern", "Mining", "Tree", "Clustering", "Frequent", "Association", "Apriori", "Candidate"}(250)와 제2 문서(300)의 키워드 {"Candidate", "Minsup", "Hash", "Graph", "Network", "Clustering", Frequent", "Pattern", "Data"}(350)가 비교되는 경우, 매칭되는 공통 키워드는 {"Data", "Pattern", "Clustering", "Frequent", "Candidate"}의 다섯 개이다. 따라서, 본 발명의 일실시예에 따르면, 제1 문서(200)와 제2 문서(300) 사이의 유사도는 다섯 개의 키워드 매칭에 대응한 값으로 계산된다.However, keywords {"Data", "Pattern", "Mining", "Tree", "Clustering", "Frequent", "Association" and "keywords of the first document 200 extracted according to an embodiment of the present invention. Apriori "," Candidate "} (250) and keyword {" Candidate "," Minsup "," Hash "," Graph "," Network "," Clustering ", Frequent", "Pattern" , "Data"} 350 is compared, the common keywords matched are five of {"Data", "Pattern", "Clustering", "Frequent", "Candidate"}. According to an example, the similarity between the first document 200 and the second document 300 is calculated as a value corresponding to five keyword matching.

따라서, 본 발명의 일실시예에 따르면 유사도 계산부(120)에 의해 계산되는 제1 문서(200)와 제2 문서(300) 사이의 유사도 값이 종래의 방법에 비해 높아질 수 있다. 이러한 실시예에 의하면, 비교의 대상이 되는 키워드 수가 증가되고, 문서의 세부 주제를 대표하는 구체적인 내용의 키워드들이 많이 추출될 수 있으므로, 문서 간의 유사도 계산의 정확성이 크게 향상될 수 있다.Therefore, according to an embodiment of the present invention, the similarity value between the first document 200 and the second document 300 calculated by the similarity calculator 120 may be higher than that of the conventional method. According to this embodiment, since the number of keywords to be compared is increased and a large number of keywords with specific contents representing detailed subjects of the document can be extracted, the accuracy of similarity calculation between documents can be greatly improved.

도 4는 본 발명의 일실시예에 따라 유사도 계산의 대상 문서가 논문인 경우에, 논문 사이의 유사도 계산 방법을 도시한다.4 illustrates a method of calculating similarity between articles when a document to be calculated for similarity is a article according to an embodiment of the present invention.

단계(S410)에서 제1 논문으로부터 제1 논문의 키워드가 추출된다. 이를테면 도 2를 참조하여 상술한 바와 같이, 제1 문서(200)으로부터 키워드(251)를 추출하는 과정에 의해 제1 논문의 키워드가 추출될 수 있다.In operation S410, keywords of the first article are extracted from the first article. For example, as described above with reference to FIG. 2, a keyword of the first article may be extracted by a process of extracting the keyword 251 from the first document 200.

이러한 과정에서 제1 논문의 키워드 추출은 도 1의 키워드 추출부(110)에 의해 수행되는 텍스트 기반의 키워드 추출일 수 있다.In this process, keyword extraction of the first article may be text-based keyword extraction performed by the keyword extraction unit 110 of FIG. 1.

그리고, 단계(S420)에서 상기 제1 논문이 참조하는 피참조 논문들로부터 제1 논문의 키워드가 추가적으로 추출된다. 이를테면, 도 2에서 제1 문서(200)가 참조하는 피참조 문서들(210 및 220)로부터 키워드(252)를 추출하는 과정에 의해 제1 논문의 키워드가 추가적으로 추출될 수 있다.In operation S420, keywords of the first article are additionally extracted from the referenced articles referenced by the first article. For example, a keyword of the first article may be additionally extracted by a process of extracting the keyword 252 from the referenced documents 210 and 220 referenced by the first document 200 in FIG. 2.

또한, 단계(S430)에서 상기 제1 논문을 피참조 논문으로 참조하는 논문들로부터 제1 논문의 키워드가 추가적으로 추출된다. 이를테면, 도 2에서 제1 문서(200)를 피참조 문서로서 참조하는 문서들(230 및 240)로부터 키워드(253)를 추출하는 과정에 의해 제1 논문의 키워드가 추가적으로 추출될 수 있다.Further, in step S430, keywords of the first article are additionally extracted from articles that refer to the first article as the referenced article. For example, the keyword of the first article may be additionally extracted by the process of extracting the keyword 253 from the documents 230 and 240 that refer to the first document 200 as the referenced document in FIG. 2.

이렇게 추출된 키워드들 전체가 상기 제1 논문의 키워드로서 논문 사이의 유사도 계산에 이용된다.All of the extracted keywords are used in the similarity calculation between articles as keywords of the first article.

단계(S440)에서 제2 논문에 대해 추출된 키워드와 상기 추출된 제1 논문의 키워드들을 비교함으로써 제1 논문과 제2 논문 사이의 유사도가 계산될 수 있다. 이러한 유사도 계산은 도 1의 유사도 계산부(120)에 의해 수행될 수 있다.In operation S440, similarity between the first article and the second article may be calculated by comparing keywords extracted for the second article with keywords of the extracted first article. The similarity calculation may be performed by the similarity calculator 120 of FIG. 1.

한편, 상기 제2 논문에 대해 추출된 키워드 역시, 도 3을 참조하여 상술한 바와 같이 제2 논문, 제2 논문이 참조하는 피참조 논문, 및 제2 논문을 피참조 논문으로서 참조하는 논문 등으로부터 추출된 것일 수 있다.On the other hand, the keywords extracted for the second article are also derived from the second article, the referenced article referenced by the second article, and the article referring the second article as the referred article, as described above with reference to FIG. 3. It may be extracted.

단계(S440)에서 제1 논문과 제2 논문의 유사도를 계산하는 과정은 도 5를 참조하여 보다 상세히 후술한다.A process of calculating the similarity between the first paper and the second paper in step S440 will be described later in more detail with reference to FIG. 5.

도 5는 본 발명의 일실시예에 따라 유사도 계산의 대상 문서가 논문인 경우에, 논문 사이의 유사도 계산 방법을 도시한다.FIG. 5 illustrates a method for calculating similarity between articles when a document to be calculated for similarity is a article according to an embodiment of the present invention.

단계(S510)에서는 도 4의 단계(S410 내지 S430)를 통해 추출된 제1 논문의 키워드를 이용하여, 유사도 계산부(120)에 의해 제1 논문의 키워드의 제1 벡터가 계산된다. 이러한 벡터 계산은 도 1 및 수학식 1을 참조하여 상술한 벡터 모델에 의해 수행될 수 있다.In operation S510, the first vector of the keyword of the first article is calculated by the similarity calculator 120 using the keywords of the first article extracted through the operations S410 through S430 of FIG. 4. This vector calculation may be performed by the vector model described above with reference to FIG.

그리고 단계(S520)에서는 제2 논문의 키워드를 이용하여, 유사도 계산부(120)에 의해 제2 논문의 키워드의 제2 벡터가 계산된다. 제2 벡터의 계산 또한 도 1 및 수학식 1을 참조하여 상술한 벡터 모델에 의해 수행될 수 있다.In operation S520, the second vector of the keyword of the second article is calculated by the similarity calculator 120 using the keyword of the second article. Calculation of the second vector may also be performed by the vector model described above with reference to FIGS. 1 and 1.

그리고, 단계(S530)에서는 유사도 계산부(120)에 의해 제1 벡터 및 제2 벡터의 유사도가 상기 수학식 1을 이용하여 계산될 수 있다. 이렇게 계산된 제1 벡터 및 제2 벡터의 유사도는 제1 논문 및 제2 논문 사이의 유사도로서 결정될 수 있다.In operation S530, the similarity between the first vector and the second vector may be calculated by the similarity calculator 120 using Equation 1 above. The similarity of the first vector and the second vector thus calculated may be determined as the similarity between the first and second articles.

이 경우, 단계(S520) 내지 단계(S530)은 논문이 저장된 DB 내의 다른 논문들 각각에 대해서도 수행되어, 이들 각각의 논문들과 상기 제1 문서 사이의 유사도가 계산될 수 있다.In this case, steps S520 to S530 may be performed for each of the other articles in the DB where the articles are stored, so that the similarity between the respective articles and the first document may be calculated.

그러면, 문서 간의 유사도 계산 시스템(100)는 DB 내의 여러 논문 중, 상기 제1 논문과의 유사도가 높게 계산된 순서로, 미리 지정된 수의 논문을 유사 논문 검색 결과로서 제공할 수 있다.Then, the similarity calculation system 100 between documents may provide a predetermined number of papers as a result of similar paper search in the order in which the similarity with the first paper is calculated among the various papers in the DB.

이러한 키워드 추출 및 유사도 계산의 보다 상세한 내용은 도 1 내지 도 3을 참조하여 상술한 바와 같다.More details of such keyword extraction and similarity calculation are as described above with reference to FIGS. 1 to 3.

본 발명의 일실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 시스템이 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 시스템은 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment of the present invention can be implemented in the form of a program command which can be executed through various computer means and recorded in a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware systems specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware system described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above embodiments, and those skilled in the art to which the present invention pertains various modifications and variations from such descriptions. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the claims below but also by the equivalents of the claims.

100: 문서 간 유사도 계산 시스템
110: 키워드 추출부
120: 유사도 계산부
101: 문서 DB (Database)100: Document similarity calculation system
110: keyword extraction unit
120: similarity calculation unit
101: Document DB (Database)

Claims

A keyword extracting unit which extracts a keyword of the first document from a first document and further extracts a keyword of the first document from at least one of a document referring to the first document and a document referenced by the first document; And
A similarity calculation unit that calculates a similarity between the first document and the second document to be similar determination object using the extracted keyword.
Similarity calculation system between documents comprising a.

The method of claim 1,
The keyword extraction unit,
Extracting a keyword of the second document from the second document, and further extracting a keyword of the second document from at least one of a document referring to the second document and a document referenced by the second document. Similarity calculation system between documents.

The method of claim 1,
The similarity calculation unit,
Calculating a similarity between the first vector calculated using the extracted first document keyword and the second vector calculated using the second document keyword as the similarity between the first document and the second document. Characteristic similarity calculation system.

The method of claim 1,
The keyword extraction unit,
Extracting a keyword of the first document from a title and summary of the first document, and extracting a keyword of the first document from at least one of a title and summary of a document referring to the first document, a title and a summary of a document referenced by the first document A similarity calculation system between documents, characterized by further extracting keywords of documents.

The method of claim 1,
The keyword extraction unit,
And further extracting keywords of the first document from at least one of a document referring to a document referring to the first document and a document other than the first document referenced by the document referring to the first document. Similarity calculation system between documents.

The method of claim 1,
The keyword extraction unit,
And further extracting keywords of the first document from at least one of documents other than the first document referring to the document referenced by the first document, and documents referenced by the document referenced by the first document. Similarity calculation system between documents.

Extracting a keyword of the first document from a first document;
Further extracting keywords of the first document from at least one of a document referring to the first document and a document referenced by the first document; And
Calculating a similarity between the first document and the second document using the extracted keyword
Similarity calculation method between documents comprising a.

The method of claim 7, wherein
Extracting keywords of the second document from the second document; And
Further extracting a keyword of the second document from at least one of a document referring to the second document and a document referenced by the second document
Further comprising:
The calculating of the similarity may include calculating similarity between the first document and the second document using the keyword of the extracted first document and the keyword of the extracted second document. Way.

The method of claim 7, wherein
Calculating the similarity,
Calculating a first vector using keywords of the extracted first document;
Calculating a second vector using keywords of the extracted second document; And
Calculating a similarity between the first vector and the second vector as a similarity between the first document and the second document
Similarity calculation method between documents comprising a.

The method of claim 7, wherein
The keyword of the first document is extracted using at least one of a title and a summary of the first document, a title and a summary of the document referring to the first document, and a title and a summary of the document referenced by the first document. Method for calculating similarity between documents.

A computer-readable recording medium in which a program for executing the method of any one of claims 7 to 10 is recorded.