KR102045574B1

KR102045574B1 - Apparatus and method for deducting keyword of technical document

Info

Publication number: KR102045574B1
Application number: KR1020180017372A
Authority: KR
Inventors: 박상성; 김종찬; 강지호
Original assignee: 고려대학교 산학협력단
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2019-11-18
Also published as: KR20190097669A

Abstract

본 발명은 기술 문서 키워드를 도출하는 장치 및 방법을 개시한다. 본 발명의 일실시예에 따르면 기술 문서 키워드를 도출하는 장치는 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는, 학술 문서와 기술 문서의 유사도를 검증하는 유사도 검증부, 상기 검증된 유사도가 기준 값보다 높을 경우, 상기 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출하는 키워드 추출부 및 상기 추출된 적어도 하나 이상의 학술 공통 키워드를 기술 문서 키워드로 도출하는 기술 문서 키워드 도출부를 포함할 수 있다.The present invention discloses an apparatus and method for deriving a technical document keyword. According to an embodiment of the present invention, a device for deriving a technical document keyword may include a similarity verification unit for verifying similarity between an academic document and a technical document included in collected analysis data based on the same search word, and the verified similarity reference value. When higher, a keyword extraction unit for extracting at least one or more academic common keywords that are identically disclosed in a plurality of parts constituting the academic document, and a technical document keyword for deriving the extracted at least one academic common keyword as a technical document keyword. It may include a derivation unit.

Description

TECHNICAL DOCUMENT} Apparatus and method for deriving technical document keywords {APPARATUS AND METHOD FOR DEDUCTING KEYWORD OF TECHNICAL DOCUMENT}

본 발명은 특허 데이터를 효율적으로 검색하기 위한, 기술 문서 키워드를 도출하는 기술적 사상에 관한 것으로, 더욱 상세하게는 학술 문서와 기술 문서의 유사도를 검증한 후, 학술 문서의 키워드에 기초하여 기술 문서 키워드를 도출하는 장치 및 방법에 관한 것이다.The present invention relates to a technical idea of deriving a technical document keyword for efficiently searching patent data. More specifically, after verifying the similarity between the academic document and the technical document, the technical document keyword is based on the keyword of the academic document. It relates to a device and a method for deriving.

정부와 기업들은 신기술 개발을 통해 경쟁우위를 얻기 위해 노력하고 있다.Governments and corporations are trying to gain competitive advantage through the development of new technologies.

그러나, 신기술 개발은 성공했을 경우에 얻어지는 이익만큼이나 실패했을 경우에 입는 손실이 막대하다.However, new technology development has enormous loss in case of failure as much as profit in case of success.

따라서 올바른 기술 개발 방향을 설정하고 실패 위험을 최소화하기 위한 기술 R&D 전략 수립은 신기술 개발의 필수적인 요소이다.Therefore, establishing the right technology development direction and establishing the technology R & D strategy to minimize the risk of failure are essential elements of new technology development.

이러한 기술 R&D 전략 수립을 위한 도구로 특허 분석이 활용되고 있다. 특히 지속 가능한 기술 경영을 위한 정량적인 특허 분석의 중요성이 증가함에 따라 텍스트 마이닝을 이용한 특허 분석 방법론에 대한 연구가 활발히 수행되고 있다.Patent analysis is used as a tool for establishing this technology R & D strategy. In particular, as the importance of quantitative patent analysis for sustainable technology management increases, studies on patent analysis methodology using text mining are being actively conducted.

텍스트 마이닝을 이용한 단어 기반의 특허 분석은 노이즈 단어 제거 및 키워드 추출 과정이 선행된다.Word-based patent analysis using text mining is preceded by noise word removal and keyword extraction.

따라서, 추출된 키워드는 후속 분석에 절대적인 영향을 미치며, 추출된 키워드에 대한 검증 없이는 후속 분석의 결과를 신뢰할 수 없다.Thus, the extracted keywords have an absolute effect on subsequent analysis, and the results of subsequent analysis cannot be trusted without verification of the extracted keywords.

그러나, 지금까지 대부분의 연구는 선택된 키워드에 대한 검증 없이 키워드가 분석에 적합하다는 가정 하에 특허 분석을 수행하고 있다.However, until now, most studies have performed patent analysis under the assumption that the keyword is suitable for analysis without verification of the selected keyword.

또한, 키워드 도출 기준은 분석 목적에 따라 달라 질 수 있다. 지금까지는 수집된 특허 데이터 셋에서 각각의 특허들을 기술적 특징에 따라 얼마나 잘 분류 및 군집할 수 있는지를 키워드 선택 기준으로 삼고 있었다.Also, keyword derivation criteria may vary depending on the purpose of analysis. Until now, keyword selection criteria were based on how well each patent could be classified and clustered according to technical characteristics in the collected patent data set.

집단 내의 특허들이 포함하고 있는 단어들을 비교하는 TF-IDF(Term Frequency - Inverse Document Frequency) 가중치를 이용한 방법이 대표적이다.A typical method is to use TF-IDF (Term Frequency-Inverse Document Frequency) weighting to compare words included in a group of patents.

그러나, 종래 방법으로 선택된 특허 키워드들은 한 특허와 다른 특허와의 비교되는 특징을 나타낼 수는 있지만 그 특허의 고유한 특징을 나타낸다고 하기는 어렵다.However, although patent keywords selected by conventional methods may exhibit features that are comparable with one patent and the other, it is difficult to say that the patent keywords exhibit unique features of the patent.

또한, 분석 데이터 안의 특허들을 비교하여 키워드를 선택하기 때문에 분석 범위가 정해져 있지 않은 특허 분석에서는 기존의 키워드 선택 기준은 적합하지 않다.In addition, since the keywords are selected by comparing the patents in the analysis data, the existing keyword selection criteria are not suitable for patent analysis in which the analysis range is not determined.

즉, 다른 특허와의 비교 없이, 단일 특허의 기술적 특징을 나타낼 수 있는 키워드 선택 기준 및 도출 방법이 필요하다.In other words, there is a need for keyword selection criteria and derivation methods that can represent the technical characteristics of a single patent without comparison with other patents.

한국등록특허 제10-1600870호, "통계적 방법을 이용한 핵심 키워드 추출 방법"Korean Patent Registration No. 10-1600870, "Key Keyword Extraction Method Using Statistical Method" 한국공개특허 제10-2009-0033728호, "컨텐트 요약 정보 제공 방법 및 그 장치"Korean Patent Publication No. 10-2009-0033728, "Method and Apparatus for Providing Content Summary Information" 한국등록특허 제10-1623860호, "문서 요소에 대한 유사도를 산출하는 방법"Korean Patent No. 10-1623860, "Method of calculating similarity for document elements" 한국등록특허 제10-1505546호, "텍스트 마이닝을 이용한 키워드 도출 방법"Korean Patent No. 10-1505546, "Keyword Derivation Method Using Text Mining" 미국공개특허 2012/0330955, "DOCUMENT SIMILARITY CALCULATION DEVICE"United States Patent Application Publication No. 2012/0330955, "DOCUMENT SIMILARITY CALCULATION DEVICE" 미국공개특허 2015/0234835, "KEYWORD ASSESSMENT"United States Patent Application Publication No. 2015/0234835, "KEYWORD ASSESSMENT"

본 발명은 기술 문서 키워드를 도출하는 장치 및 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide an apparatus and method for deriving a technical document keyword.

본 발명은 학술 문서와 기술 문서의 유사도 검증에 기초하여 학술 문서의 저자 키워드와 학술 문서의 구성에서 추출된 학술 공통 키워드를 비교하여 학술 공통 키워드에 대한 성능을 검증하고, 이를 기초하여 기술 문서의 구성에서 추출된 기술 공통 키워드를 기술 문서 키워드로 도출하는 것을 목적으로 한다.The present invention compares the author common keywords and the academic common keywords extracted from the structure of the academic documents based on the similarity verification between the academic documents and the technical documents, and verifies the performance of the academic common keywords, and based on the composition of the technical documents The purpose is to derive the technical common keyword extracted from the technical document keyword.

본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 도출된 기술 문서 키워드의 신뢰성을 향상시키는 것을 목적으로 한다.An object of the present invention is to improve the reliability of a keyword of a technical document derived by using statistical verification based on text mining.

본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 학술 문서와 기술 문서의 유사도를 검증함으로써 도출된 기술 문서 키워드의 타당성을 향상시키는 것을 목적으로 한다.An object of the present invention is to improve the validity of keywords derived from technical documents by verifying the similarity between academic and technical documents using text mining based statistical verification.

본 발명의 일실시예에 따르면 기술 문서 키워드를 도출하는 장치는 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는, 학술 문서와 기술 문서의 유사도를 검증하는 유사도 검증부, 상기 검증된 유사도가 기준 값보다 높을 경우, 상기 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출하는 학술 키워드 추출부, 상기 추출된 적어도 하나 이상의 학술 공통 키워드를 상기 학술 문서의 저자 키워드와 비교하여 키워드 도출 성능을 검증하는 키워드 성능 검증부, 상기 기술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 기술 공통 키워드를 추출하는 기술 키워드 추출부 및 상기 추출된 적어도 하나 이상의 기술 공통 키워드를 기술 문서 키워드로 도출하는 기술 문서 키워드 도출부를 포함할 수 있다.According to an embodiment of the present invention, a device for deriving a technical document keyword may include a similarity verification unit for verifying similarity between an academic document and a technical document included in collected analysis data based on the same search word, and the verified similarity reference value. When higher, the academic keyword extraction unit for extracting at least one or more academic common keywords that are identically disclosed in a plurality of parts constituting the academic document, comparing the extracted at least one academic common keyword with the author keyword of the academic document A keyword performance verification unit for verifying keyword derivation performance, a technology keyword extraction unit for extracting at least one or more technology common keywords that are identically started in a plurality of parts constituting the technical document, and the extracted at least one technology common keyword; Technical document Technical document derived by keyword Word derivation may include a.

본 발명의 일실시예에 따르면 기술 문서 키워드를 도출하는 장치는 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹으로 상기 수집된 분석 데이터를 분류하는 데이터 분류부 및 상기 학술 문서 비교 그룹에서 학술 문서 간의 제1 유사도를 산출하고, 상기 기술 문서 비교 그룹에서 기술 문서 간의 제2 유사도를 산출하며, 상기 학술 및 기술 문서 비교 그룹에서 학술 문서와 기술 문서 간의 제3 유사도를 산출하는 유사도 산출부를 더 포함할 수 있다.According to an embodiment of the present invention, a device for deriving a technical document keyword may include a data classification unit for classifying the collected analysis data into an academic document comparison group, a technical document comparison group, and an academic and technical document comparison group, and the academic document comparison group. A similarity calculator for calculating a first similarity between the academic documents, calculating a second similarity between the technical documents in the technical document comparison group, and calculating a third similarity between the academic and technical documents in the academic and technical document comparison group It may further include.

본 발명의 일실시예에 따르면 상기 유사도 검증부는, 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도의 평균값을 산출하고, 상기 산출된 평균값과 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도 각각을 비교하며, 상기 산출된 평균값과 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도 각각의 차이가 상기 기준 범위에 상응할 경우, 상기 유사도를 상기 기준 값보다 높은 것으로 검증하고, 상기 차이가 상기 기준 범위를 벗어날 경우, 상기 유사도를 상기 기준 값보다 낮은 것으로 검증할 수 있다.According to an embodiment of the present invention, the similarity verifying unit may calculate an average value of the calculated first similarity, the calculated second similarity, and the calculated third similarity, and the calculated average value and the calculated first similarity degree. And comparing each of the calculated second similarity and the calculated third similarity, wherein the difference between the calculated average value and the calculated first similarity, the calculated second similarity and the calculated third similarity is the reference. When the range corresponds to the range, the similarity may be verified to be higher than the reference value, and when the difference is outside the reference range, the similarity may be verified to be lower than the reference value.

본 발명의 일실시예에 따르면 상기 유사도 산출부는, 상기 학술 문서 비교 그룹에서 복수의 학술 문서의 제1 요약을 추출하고, 상기 추출된 제1 요약에 대한 텍스트 마이닝을 수행하여 제1 문서 단어 행렬로 정형화하고, 상기 정형화된 제1 문서 단어 행렬을 이용하여 상기 제1 유사도를 산출할 수 있다.According to an embodiment of the present invention, the similarity calculator extracts a first summary of a plurality of academic documents from the academic document comparison group, performs text mining on the extracted first summary, and converts the first summary into a first document word matrix. The first similarity may be calculated by using a standardized first document word matrix.

본 발명의 일실시예에 따르면 상기 유사도 산출부는, 상기 기술 문서 비교 그룹에서 복수의 기술 문서의 제2 요약을 추출하고, 상기 추출된 제2 요약에 대한 텍스트 마이닝을 수행하여 제2 문서 단어 행렬로 정형화하고, 상기 정형화된 제2 문서 단어 행렬을 이용하여 상기 제2 유사도를 산출할 수 있다.According to an embodiment of the present invention, the similarity calculator extracts a second summary of a plurality of technical documents from the technical document comparison group, performs text mining on the extracted second summary, and converts the second summary into a second document word matrix. The second similarity may be calculated by using a standardized second document word matrix.

본 발명의 일실시예에 따르면 상기 유사도 산출부는, 상기 학술 및 기술 문서 비교 그룹에서 복수의 학술 문서의 제3 요약 및 복수의 기술 문서의 제4 요약을 추출하고, 상기 추출된 제3 요약 및 상기 추출된 제4 요약에 대한 텍스트 마이닝을 수행하여 제3 문서 단어 행렬로 정형화하고, 상기 정형화된 제3 문서 단어 행렬을 이용하여 상기 제3 유사도를 산출할 수 있다.According to an embodiment of the present invention, the similarity calculator extracts a third summary of a plurality of academic documents and a fourth summary of a plurality of technical documents from the academic and technical document comparison group, and extracts the extracted third summary and the fourth technical document. Text mining of the extracted fourth summary may be performed to form a third document word matrix, and the third similarity may be calculated using the standardized third document word matrix.

본 발명의 일실시예에 따르면 상기 유사도 산출부는, 상기 제1 문서 단어 행렬의 행 값과 열 값을 코사인 거리(cosine distance) 수학식에 적용하여 상기 제1 유사도를 산출하고, 상기 제2 문서 단어 행렬의 행 값과 열 값을 상기 코사인 거리(cosine distance) 수학식에 적용하여 상기 제2 유사도를 산출하며, 상기 제3 문서 단어 행렬의 행 값과 열 값을 상기 코사인 거리(cosine distance) 수학식에 적용하여 상기 제3 유사도를 산출할 수 있다.According to an embodiment of the present invention, the similarity calculator calculates the first similarity by applying a row value and a column value of the first document word matrix to a cosine distance equation, and the second document word. The second similarity is calculated by applying a row value and a column value of a matrix to the cosine distance equation, and the cosine distance equation is used for the row and column values of the third document word matrix. It may be applied to to calculate the third similarity.

본 발명의 일실시예에 따르면 상기 키워드 성능 검증부는, 상기 추출된 적어도 하나 이상의 학술 공통 키워드와 상기 저자 키워드 간의 일치되는 키워드 수를 산출하고, 상기 산출된 키워드 수를 상기 추출된 적어도 하나 이상의 학술 공통 키워드의 수로 나눠서 상기 키워드 도출 성능을 검증할 수 있다.According to an embodiment of the present invention, the keyword performance verification unit may calculate a number of keywords matched between the extracted at least one academic common keyword and the author keyword, and calculate the calculated number of keywords at least one academic common. The keyword derivation performance can be verified by dividing by the number of keywords.

본 발명의 일실시예에 따르면 상기 기술 문서 키워드 도출부는, 상기 적어도 하나 이상의 학술 공통 키워드의 개시 빈도에 기초하여 상기 적어도 하나 이상의 학술 공통 키워드의 순위들을 결정할 수 있다.According to an embodiment of the present invention, the technical document keyword deriving unit may determine rankings of the at least one academic common keyword based on a frequency of opening of the at least one academic common keyword.

본 발명의 일실시예에 따르면 상기 기술 문서 키워드 도출부는, 상기 결정된 순위들에 기초하여 상기 적어도 하나 이상의 학술 공통 키워드 중 노이즈 분류 기준 보다 낮은 순위에 해당하는 학술 공통 키워드를 상기 적어도 하나 이상의 학술 공통 키워드에서 제외할 수 있다.According to an embodiment of the present invention, the technical document keyword deriving unit may include the academic common keyword corresponding to a lower rank than the noise classification criterion among the at least one academic common keyword based on the determined rankings. Can be excluded.

본 발명의 일실시예에 따르면 상기 학술 문서를 구성하는 복수의 부분은 상기 학술 문서의 요약, 서론 및 결론 중에서 적어도 하나를 포함하고, 상기 기술 문서를 구성하는 복수의 부분은 상기 기술 문서의 요약, 서론, 결론, 청구범위 및 제목 중에서 적어도 하나를 포함할 수 있다.According to an embodiment of the present invention, the plurality of parts constituting the academic document includes at least one of a summary, an introduction, and a conclusion of the academic document, and the plurality of parts constituting the technical document may include a summary of the technical document, It may include at least one of an introduction, conclusion, claims, and title.

본 발명의 일실시예에 따르면 기술 문서 키워드를 도출하는 방법은 유사도 검증부에서, 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는, 학술 문서와 기술 문서의 유사도를 검증하는 단계, 학술 키워드 추출부에서, 상기 검증된 유사도가 기준 값보다 높을 경우, 상기 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출하는 단계, 키워드 성능 검증부에서, 상기 추출된 적어도 하나 이상의 학술 공통 키워드를 상기 학술 문서의 저자 키워드와 비교하여 키워드 도출 성능을 검증하는 단계, 기술 키워드 추출부에서, 상기 기술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 기술 공통 키워드를 추출하는 단계 및 기술 문서 키워드 도출부에서, 상기 추출된 적어도 하나 이상의 기술 공통 키워드를 기술 문서 키워드로 도출하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method for deriving a technical document keyword may include verifying a similarity between an academic document and a technical document included in the analysis data collected based on the same search term in the similarity verification unit, and the academic keyword extracting unit. In the step of, if the verified similarity is higher than the reference value, extracting at least one or more academic common keywords that are identically disclosed in a plurality of parts constituting the scholarly document, keyword performance verification unit, at least one extracted Comparing the academic common keywords with the author keywords of the academic document to verify keyword derivation performance; and, in the technical keyword extracting unit, extracting at least one or more technical common keywords that are identically disclosed in a plurality of parts constituting the technical document; In the step and technical document keyword derivation unit, the extracted at least The one or more techniques common keywords may include the step of deriving a technical documentation keywords.

본 발명의 일실시예에 따르면 상기 학술 문서와 기술 문서의 유사도를 검증하는 단계는, 데이터 분류부에서, 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹으로 상기 수집된 분석 데이터를 분류하는 단계, 유사도 산출부에서, 상기 학술 문서 비교 그룹에서 학술 문서 간의 제1 유사도를 산출하는 단계, 상기 유사도 산출부에서, 상기 기술 문서 비교 그룹에서 기술 문서 간의 제2 유사도를 산출하는 단계 및 상기 유사도 산출부에서, 상기 학술 및 기술 문서 비교 그룹에서 학술 문서와 기술 문서 간의 제3 유사도를 산출하는 단계를 더 포함할 수 있다.According to an embodiment of the present invention, the step of verifying the similarity between the academic document and the technical document may include: analyzing data collected by the data classification unit into an academic document comparison group, a technical document comparison group, and an academic and technical document comparison group. Classifying, in the similarity calculator, calculating a first similarity between academic documents in the academic document comparison group, calculating, in the similarity calculator, a second similarity between technical documents in the technical document comparison group, and the The similarity calculator may further include calculating a third similarity between the academic document and the technical document in the academic and technical document comparison group.

본 발명의 일실시예에 따르면 상기 학술 문서와 기술 문서의 유사도를 검증하는 단계는, 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도의 평균값을 산출하는 단계, 상기 산출된 평균값과 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도 각각을 비교하는 단계, 상기 산출된 평균값과 상기 산출된 제1 유사도, 상기 산출된 제2 유사도 및 상기 산출된 제3 유사도 각각의 차이가 상기 기준 범위에 상응할 경우, 상기 유사도를 상기 기준 값보다 높은 것으로 검증하는 단계 및 상기 차이가 상기 기준 범위를 벗어날 경우, 상기 유사도를 상기 기준 값보다 낮은 것으로 검증하는 단계를 포함할 수 있다.According to an embodiment of the present disclosure, verifying the similarity between the academic document and the technical document may include calculating an average value of the calculated first similarity, the calculated second similarity, and the calculated third similarity. Comparing the calculated average value with the calculated first similarity, the calculated second similarity, and the calculated third similarity, respectively, the calculated average value and the calculated first similarity, the calculated second similarity, and the Verifying that the similarity is higher than the reference value when the difference of each of the calculated third similarities corresponds to the reference range, and verifying the similarity is lower than the reference value when the difference is outside the reference range. It may include the step.

본 발명의 일실시예에 따르면 상기 제1 유사도를 산출하는 단계는, 상기 학술 문서 비교 그룹에서 복수의 학술 문서의 제1 요약을 추출하는 단계, 상기 추출된 제1 요약에 대한 텍스트 마이닝을 수행하여 제1 문서 단어 행렬을 정형화하는 단계 및 상기 정형화된 제1 문서 단어 행렬을 이용하여 상기 제1 유사도를 산출하는 단계를 포함하고, 상기 제2 유사도를 산출하는 단계는, 상기 기술 문서 비교 그룹에서 복수의 기술 문서의 제2 요약을 추출하는 단계, 상기 추출된 제2 요약에 대한 텍스트 마이닝을 수행하여 제2 문서 단어 행렬을 정형화하는 단계 및 상기 정형화된 제2 문서 단어 행렬을 이용하여 상기 제2 유사도를 산출하는 단계를 포함하며, 상기 제3 유사도를 산출하는 단계는, 상기 학술 및 기술 문서 비교 그룹에서 복수의 학술 문서의 제3 요약 및 복수의 기술 문서의 제4 요약을 추출하는 단계, 상기 추출된 제3 요약 및 상기 추출된 제4 요약에 대한 텍스트 마이닝을 수행하여 제3 문서 단어 행렬을 정형화하는 단계 및 상기 정형화된 제3 문서 단어 행렬을 이용하여 상기 제3 유사도를 산출하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the calculating of the first similarity may include extracting a first summary of a plurality of academic documents from the academic document comparison group, and performing text mining on the extracted first summary. Formulating a first document word matrix and calculating the first similarity using the standardized first document word matrix, wherein calculating the second similarity comprises: selecting a plurality of documents from the technical document comparison group. Extracting a second summary of the descriptive document, performing text mining on the extracted second summary to form a second document word matrix, and using the formatted second document word matrix And calculating the third similarity level, wherein the third summary and the third summary of the plurality of academic documents in the academic and technical document comparison group are calculated. Extracting a fourth summary of a plurality of technical documents, performing text mining on the extracted third summary and the extracted fourth summary to form a third document word matrix and the formatted third document word Computing the third similarity using a matrix.

본 발명은 학술 문서와 기술 문서의 유사도 검증에 기초하여 학술 문서의 저자 키워드와 학술 문서의 구성에서 추출된 학술 공통 키워드를 비교하여 학술 공통 키워드에 대한 성능을 검증하고, 이를 기초하여 기술 문서의 구성에서 추출된 기술 공통 키워드를 기술 문서 키워드로 도출할 수 있다.The present invention compares the author common keywords and the academic common keywords extracted from the structure of the academic documents based on the similarity verification between the academic documents and the technical documents, and verifies the performance of the academic common keywords, and based on the composition of the technical documents The technology common keyword extracted from may be derived as a technology document keyword.

본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 도출된 기술 문서 키워드의 신뢰성을 향상시킬 수 있다.The present invention can improve the reliability of the derived technical document keywords using text mining based statistical verification.

본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 학술 문서와 기술 문서의 유사도를 검증함으로써 도출된 기술 문서 키워드의 타당성을 향상시킬 수 있다.The present invention can improve the validity of keywords derived from technical documents by verifying the similarity between academic and technical documents using text mining based statistical verification.

본 발명은 학술 문서와 기술 문서의 유사도 검증에 기초하여 기술 문서 키워드를 도출함에 따라 다양한 특허 분석 방법론 및 연구 방향을 제공할 수 있다.The present invention can provide various patent analysis methodologies and research directions by deriving technical document keywords based on similarity verification of academic documents and technical documents.

도 1은 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 장치의 구성 요소를 설명하는 도면이다.
도 2 내지 도 5는 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법과 관련된 흐름도를 설명하는 도면이다.1 is a diagram illustrating the components of an apparatus for deriving a technical document keyword according to an embodiment of the present invention.
2 to 5 are diagrams illustrating a flowchart related to a method of deriving a technical document keyword according to an embodiment of the present invention.

이하, 본 문서의 다양한 실시 예들이 첨부된 도면을 참조하여 기재된다.Hereinafter, various embodiments of the present disclosure will be described with reference to the accompanying drawings.

실시 예 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다.The examples and terms used therein are not intended to limit the techniques described in this document to specific embodiments, but should be understood to include various modifications, equivalents, and / or alternatives to the examples.

하기에서 다양한 실시 예들을 설명에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.In the following description of the various embodiments, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

그리고 후술되는 용어들은 다양한 실시 예들에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, terms to be described below are terms defined in consideration of functions in various embodiments, and may vary according to a user's or operator's intention or custom. Therefore, the definition should be made based on the contents throughout the specification.

도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.In connection with the description of the drawings, similar reference numerals may be used for similar components.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.Singular expressions may include plural expressions unless the context clearly indicates otherwise.

본 문서에서, "A 또는 B" 또는 "A 및/또는 B 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다.In this document, expressions such as "A or B" or "at least one of A and / or B" may include all possible combinations of items listed together.

"제1," "제2," "첫째," 또는 "둘째," 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다.Expressions such as "first," "second," "first," or "second," etc. may modify the components, regardless of order or importance, to distinguish one component from another. Used only and do not limit the components.

어떤(예: 제1) 구성요소가 다른(예: 제2) 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다.When any (eg first) component is said to be "connected (functionally or communicatively)" or "connected" to another (eg second) component, the other component is said other It may be directly connected to the component or may be connected through another component (for example, the third component).

본 명세서에서, "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, 하드웨어적 또는 소프트웨어적으로 "~에 적합한," "~하는 능력을 가지는," "~하도록 변경된," "~하도록 만들어진," "~를 할 수 있는," 또는 "~하도록 설계된"과 상호 호환적으로(interchangeably) 사용될 수 있다.In this specification, "configured to" is modified to have the ability to "suitable," "to," "to," depending on the context, for example, hardware or software. Can be used interchangeably with "made to", "doing", or "designed to".

어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다.In some situations, the expression “device configured to” may mean that the device “can” together with other devices or components.

예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(예: CPU 또는 application processor)를 의미할 수 있다.For example, the phrase “processor configured (or configured to) perform A, B, and C” may be implemented by executing a dedicated processor (eg, an embedded processor) to perform its operation, or one or more software programs stored in a memory device. It may mean a general purpose processor (eg, a CPU or an application processor) capable of performing the corresponding operations.

또한, '또는' 이라는 용어는 배타적 논리합 'exclusive or' 이기보다는 포함적인 논리합 'inclusive or' 를 의미한다.In addition, the term 'or' means inclusive or 'inclusive or' rather than 'exclusive or'.

즉, 달리 언급되지 않는 한 또는 문맥으로부터 명확하지 않는 한, 'x가 a 또는 b를 이용한다' 라는 표현은 포함적인 자연 순열들(natural inclusive permutations) 중 어느 하나를 의미한다.In other words, unless stated otherwise or unclear from the context, the expression 'x uses a or b' means any one of natural inclusive permutations.

도 1은 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 장치의 구성 요소를 설명하는 도면이다.1 is a diagram illustrating the components of an apparatus for deriving a technical document keyword according to an embodiment of the present invention.

이하 사용되는 '..부', '..기' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는, 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.The terms '.. unit' and '.. group' used below mean a unit for processing at least one function or operation, which may be implemented by hardware or software, or a combination of hardware and software.

본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 장치는 키워드 도출 장치를 포함한다.The apparatus for deriving a technical document keyword according to an embodiment of the present invention includes a keyword derivation apparatus.

도 1을 참고하면, 키워드 도출 장치(100)는 유사도 검증부(110), 키워드 추출부(120) 및 기술 문서 키워드 도출부(140)를 포함한다.Referring to FIG. 1, the keyword derivation apparatus 100 includes a similarity verification unit 110, a keyword extraction unit 120, and a technical document keyword derivation unit 140.

본 발명의 일실시예에 따르면 유사도 검증부(110)는 동일한 검색어에 기반하여 수집된 분석 데이터가 포함하는 학술 문서와 기술 문서의 유사도를 검증한다.According to an embodiment of the present invention, the similarity verification unit 110 verifies the similarity between the academic document and the technical document included in the collected analysis data based on the same search word.

예를 들어, 학술 문서는 논문(paper)를 포함하고, 기술 문서는 특허(patent) 및 과학 도서를 포함한다.For example, academic documents include papers, and technical documents include patents and scientific books.

본 발명의 다른 실시예에 따르면 유사도 검증부(110)는 데이터 분류부(112) 및 유사도 산출부(114)를 포함한다.According to another embodiment of the present invention, the similarity verification unit 110 includes a data classifier 112 and a similarity calculator 114.

일례로, 데이터 분류부(112)는 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹으로 동일한 검색어에 기반하여 수집된 분석 데이터를 분류한다.For example, the data classifier 112 classifies the collected analysis data based on the same search word into the academic document comparison group, the technical document comparison group, and the academic and technical document comparison group.

예를 들어, 학술 문서 비교 그룹은 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는 학술 문서들을 포함한다.For example, the academic document comparison group includes academic documents included in the analysis data collected based on the same search word.

또한, 기술 문서 비교 그룹은 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는 기술 문서들을 포함한다.In addition, the technical document comparison group includes technical documents included in the analysis data collected based on the same search word.

더하여, 학술 및 기술 문서 비교 그룹은 동일한 검색어 기반하여 수집된 분석 데이터에 포함되는 학술 문서들과 기술 문서들을 포함한다.In addition, the academic and technical document comparison group includes academic documents and technical documents included in the analysis data collected based on the same search term.

일례로, 데이터 분류부(112)는 동일한 검색어를 이용하여 학술 문서 데이터베이스로부터 총 60건의 학술 문서를 수집하고, 동일한 검색어를 이용하여 기술 문서 데이터베이스로부터 총 60건의 기술 문서를 수집할 수 있다.For example, the data classifier 112 may collect a total of 60 academic documents from an academic document database using the same search word, and collect a total of 60 technical documents from the technical document database using the same search word.

다음으로, 데이터 분류부(112)는 총 30건의 학술 문서를 학술 문서 비교 그룹으로 분류하고, 총 30건의 기술 문서를 기술 문서 비교 그룹으로 분류하고, 학술 문서 및 기술 문서 각각 30건씩을 학술 및 기술 문서 비교 그룹으로 분류할 수 있다.Next, the data classification unit 112 classifies a total of 30 academic documents into an academic document comparison group, classifies a total of 30 technical documents into a technical document comparison group, and classifies 30 academic documents and technical documents into 30 academic and technical documents, respectively. Can be classified into document comparison groups.

일례로, 유사도 산출부(114)는 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹 각각에서 그룹 내 문사 유사도를 산출할 수 있다.As an example, the similarity calculator 114 may calculate the similarity of the articles in the group in each of the academic document comparison group, the technical document comparison group, and the academic and technical document comparison group.

예를 들어, 유사도 산출부(114)는 학술 문서 비교 그룹에서 학술 문서 간의 제1 유사도를 산출하고, 기술 문서 비교 그룹에서 기술 문서 간의 제2 유사도를 산출하며, 학술 및 기술 문서 비교 그룹에서 학술 문서와 기술 문서 간의 제3 유사도를 산출한다.For example, the similarity calculator 114 calculates a first similarity between academic documents in the academic document comparison group, calculates a second similarity between technical documents in the technical document comparison group, and an academic document in the academic and technical document comparison group. And the third similarity between the technical document.

일례로, 유사도 산출부(114)는 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹 각각에서 공통 구성요소에 해당하는 요약 부분을 추출하고 텍스트 마이닝(text mining)을 이용해 문서 단어 행렬 형태로 정형화한다.In one example, the similarity calculator 114 extracts a summary part corresponding to a common component from each of the academic document comparison group, the technical document comparison group, and the academic and technical document comparison group, and uses a text mining to perform a document word matrix. Format in form.

본 발명의 일실시예에 따르면 유사도 산출부(114)는 학술 문서 비교 그룹에서 복수의 학술 문서의 제1 요약을 추출한다.According to an embodiment of the present invention, the similarity calculator 114 extracts a first summary of the plurality of academic documents from the academic document comparison group.

다음으로, 유사도 산출부(114)는 추출된 제1 요약에 대한 텍스트 마이닝을 수행하여 제1 문서 단어 행렬로 정형화하고, 상기 정형화된 제1 문서 단어 행렬을 이용하여 상기 제1 유사도를 산출할 수 있다.Next, the similarity calculator 114 may perform text mining on the extracted first summary to form a first document word matrix, and calculate the first similarity using the standardized first document word matrix. have.

일례로, 유사도 산출부(114)는 기술 문서 비교 그룹에서 복수의 기술 문서의 제2 요약을 추출하고, 추출된 제2 요약에 대한 텍스트 마이닝을 수행하여 제2 문서 단어 행렬로 정형화하고, 정형화된 제2 문서 단어 행렬을 이용하여 제2 유사도를 산출할 수 있다.For example, the similarity calculator 114 extracts a second summary of the plurality of technical documents from the technical document comparison group, performs text mining on the extracted second summary, and forms a second document word matrix to form a standardized text. The second similarity may be calculated using the second document word matrix.

또한, 유사도 산출부(114)는 학술 및 기술 문서 비교 그룹에서 복수의 학술 문서의 제3 요약 및 복수의 기술 문서의 제4 요약을 추출한다.In addition, the similarity calculator 114 extracts a third summary of the plurality of academic documents and a fourth summary of the plurality of technical documents from the academic and technical document comparison group.

다음으로, 유사도 산출부(114)는 추출된 제3 요약 및 추출된 제4 요약에 대한 텍스트 마이닝을 수행하여 제3 문서 단어 행렬로 정형화하고 제3 문서 단어 행렬을 이용하여 제3 유사도를 산출할 수 있다.Next, the similarity calculator 114 performs text mining on the extracted third summary and the extracted fourth summary to form a third document word matrix and calculate a third similarity using the third document word matrix. Can be.

예를 들어, 유사도 산출부(114)는 하기 표 1과 같이 요약 부분을 문서 단어 행렬 형태로 정형화 할 수 있다.For example, the similarity calculator 114 may format the summary part into a document word matrix form as shown in Table 1 below.

[표 1]TABLE 1

다음으로, 유사도 산출부(114)는 하기 문서 단어 행렬을 하기 수학식 1에 적용하여 유사도를 산출한다.Next, the similarity calculator 114 calculates the similarity by applying the following document word matrix to Equation 1 below.

[수학식 1][Equation 1]

수학식 1에서, Cosine similarity는 유사도를 나타낼 수 있고, A는 문서 단어 행렬에서 행의 값을 나타낼 수 있고, B는 문서 단어 행렬에서 열의 값을 나타낼 수 있다.In Equation 1, Cosine similarity may represent similarity, A may represent a row value in a document word matrix, and B may represent a column value in a document word matrix.

예를 들어, 수학식 1은 코사인 거리(cosine distance) 수학식을 포함할 수 있다.For example, Equation 1 may include a cosine distance equation.

일례로, 유사도 산출부(114)는 제1 문서 단어 행렬의 행 값과 열 값을 코사인 거리 수학식에 적용하여 제1 유사도를 산출할 수 있다.For example, the similarity calculator 114 may calculate the first similarity by applying the row and column values of the first document word matrix to the cosine distance equation.

또한, 유사도 산출부(114)는 제2 문서 단어 행렬의 행 값과 열 값을 상기 코사인 거리 수학식에 적용하여 제2 유사도를 산출할 수 있다.Also, the similarity calculator 114 may calculate a second similarity by applying the row and column values of the second document word matrix to the cosine distance equation.

또한, 유사도 산출부(114)는 제3 문서 단어 행렬의 행 값과 열 값을 코사인 거리 수학식에 적용하여 제3 유사도를 산출할 수 있다.Also, the similarity calculator 114 may calculate the third similarity by applying the row and column values of the third document word matrix to the cosine distance equation.

일례로, 유사도 산출부(114)는 제1 유사도, 제2 유사도 및 제3 유사도를 하기 표 2와 같이 산출할 수 있다.In one example, the similarity calculator 114 may calculate the first similarity, the second similarity, and the third similarity as shown in Table 2 below.

[표 2]TABLE 2

본 발명의 일실시예에 따르면 유사도 검증부(110)는 제1 유사도, 제2 유사도 및 제3 유사도의 평균값을 산출한다.According to an embodiment of the present invention, the similarity verification unit 110 calculates an average value of the first similarity, the second similarity, and the third similarity.

다음으로, 유사도 검증부(110)는 산출된 평균값과 제1 유사도, 제2 유사도 및 제3 유사도 각각을 비교하며, 평균값과 제1 유사도, 제2 유사도 및 제3 유사도 각각의 차이가 기준 범위에 상응할 경우, 학술 문서와 기술 문서 간의 유사도를 기준 값보다 높은 것으로 검증한다.Next, the similarity verification unit 110 compares the calculated average value with each of the first similarity degree, the second similarity degree, and the third similarity degree, and the difference between the average value, the first similarity degree, the second similarity degree, and the third similarity degree is within the reference range. If appropriate, the similarity between the academic and technical documents is verified as higher than the reference value.

또한, 유사도 검증부(110)는 평균값과 제1 유사도, 제2 유사도 및 제3 유사도 각각의 차이가 기준 범위를 벗어날 경우, 학술 문서와 기술 문서 간의 유사도를 상기 기준 값보다 낮은 것으로 검증할 수 있다.In addition, the similarity verification unit 110 may verify that the similarity between the academic document and the technical document is lower than the reference value when the difference between the average value and each of the first similarity, the second similarity, and the third similarity is out of the reference range. .

즉, 유사도 검증부(110)는 각 그룹 내에서 산출되는 유사도를 비교하여 차이가 크지 않을 경우에 학술 문서와 기술 문서 간의 유사도가 높은 것으로 검증한다.That is, the similarity verification unit 110 compares the similarity calculated in each group and verifies that the similarity between the academic document and the technical document is high when the difference is not large.

예를 들어, 기준 범위는 유사도 간의 비교 결과에 따른 차이 값의 허용 범위에 상응할 수 있고, 기준 값은 유사도가 높고 낮음을 판단하기 위한 기준선에 상응할 수 있다.For example, the reference range may correspond to an allowable range of the difference value according to the comparison result between similarities, and the reference value may correspond to a baseline for determining that the similarity is high and low.

또한, 유사도 검증부(110)는 제1 유사도, 제2 유사도 및 제3 유사도에 대하여 분산분석(Analysis of Variance, ANOVA)을 수행하여 각 그룹별 유사도 간의 차이를 검증할 수 있다.In addition, the similarity verification unit 110 may verify the difference between similarities for each group by performing an analysis of variation (ANOVA) on the first similarity, the second similarity, and the third similarity.

즉, 본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 도출된 기술 문서 키워드의 신뢰성을 향상시킬 수 있다.That is, the present invention can improve the reliability of the technical document keywords derived by using statistical verification based on text mining.

즉, 본 발명은 텍스트 마이닝 기반의 통계적 검증을 이용하여 학술 문서와 기술 문서의 유사도를 검증함으로써 도출된 기술 문서 키워드의 타당성을 향상시킬 수 있다.That is, the present invention can improve the validity of the keyword of the technical document derived by verifying the similarity between the academic document and the technical document using text mining based statistical verification.

본 발명의 일실시예에 따르면 키워드 추출부(120)는 학술 키워드 추출부(미도시) 및 기술 키워드 추출부(미도시)를 포함할 수 있다. 이하, 설명에서는 키워드 추출부(120)가 학술 문서에서 학술 공통 키워드를 추출하고, 기술 문서에서 기술 공통 키워드를 추출하는 구성을 설명한다.According to an embodiment of the present invention, the keyword extractor 120 may include an academic keyword extractor (not shown) and a technology keyword extractor (not shown). In the following description, the keyword extractor 120 extracts the academic common keyword from the academic document and the technical common keyword from the technical document.

본 발명의 일실시예에 따르면 키워드 추출부(120)는 유사도 검증부(110)에서 검증된 유사도가 기준값보다 높을 경우, 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출할 수 있다.According to an embodiment of the present invention, the keyword extractor 120 may include at least one academic common keyword that is identically disclosed in a plurality of parts constituting the academic document when the similarity verified by the similarity verifier 110 is higher than a reference value. Can be extracted.

예를 들어, 학술 문서를 구성하는 복수의 부분은 학술 문서의 요약, 서론 및 결론 중에서 적어도 하나를 포함할 수 있다.For example, the plurality of parts constituting the academic document may include at least one of a summary, an introduction, and a conclusion of the academic document.

본 발명의 일실시예에 따르면 키워드 추출부(120)는 기술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 기술 공통 키워드를 추출할 수 있다.According to an embodiment of the present invention, the keyword extraction unit 120 may extract at least one or more description common keywords that are identically disclosed in a plurality of parts constituting the description document.

예를 들어, 기술 문서를 구성하는 복수의 부분은 기술 문서의 요약, 서론, 결론, 청구범위 및 제목 중에서 적어도 하나를 포함할 수 있다.For example, the plurality of parts constituting the technical document may include at least one of a summary, an introduction, a conclusion, a claim, and a title of the technical document.

본 발명의 일실시예에 따르면 키워드 성능 검증부(130)는 추출된 적어도 하나 이상의 학술 공통 키워드를 학술 문서의 저자 키워드와 비교하여 키워드 도출 성능을 검증한다.According to an embodiment of the present invention, the keyword performance verification unit 130 compares the extracted at least one academic common keyword with the author keyword of the academic document to verify the keyword derivation performance.

예를 들어, 적어도 하나 이상의 저자 키워드는 학술 문서의 저자가 학술 문서 제작 시, 학술 문서의 특징을 나타내는 단어로 지정할 수 있다.For example, the at least one author keyword may be designated as a word representing characteristics of the academic document when the author of the academic document produces the academic document.

일례로, 키워드 성능 검증부(130)는, 추출된 적어도 하나 이상의 학술 공통 키워드와 저자 키워드 간의 일치되는 키워드 수를 산출하고, 산출된 키워드 수를 추출된 적어도 하나 이상의 학술 공통 키워드의 수로 나눠서 키워드 도출 성능을 검증할 수 있다.For example, the keyword performance verification unit 130 calculates a number of keywords matched between the extracted at least one academic common keyword and the author keyword, and divides the calculated keyword number by the number of the extracted at least one academic common keyword to derive a keyword. You can verify performance.

예를 들어, 본 발명은 학술 문서와 기술 문서의 유사도 검증에 기초하여 학술 문서의 저자 키워드를 키워드 도출 기준으로 이용해 기술 문서 키워드 도출 방법을 탐색할 수 있다.For example, the present invention may search for a method of deriving a technical document keyword using the author keyword of the academic document as a keyword derivation criterion based on the similarity verification between the academic document and the technical document.

예를 들어, 키워드 성능 검증부(130)는 산출된 키워드 수를 추출된 적어도 하나 이상의 학술 공통 키워드의 수로 나눠서 산출되는 정확성 비율이 키워드 성능 검증 기준값보다 높을 경우 키워드 도출 성능을 긍정 등급으로 검증할 수 있다.For example, the keyword performance verification unit 130 may verify the keyword derivation performance as a positive grade when the accuracy ratio calculated by dividing the calculated keyword number by the extracted at least one academic common keyword is higher than the keyword performance verification reference value. have.

또한, 키워드 성능 검증부(130)는 산출된 키워드 수를 추출된 적어도 하나 이상의 학술 공통 키워드의 수로 나눠서 산출되는 정확성 비율이 키워드 성능 검증 기준값보다 낮을 경우 키워드 도출 성능을 부정 등급으로 검증할 수 있다.In addition, the keyword performance verification unit 130 may verify the keyword derivation performance as a negative rating when the accuracy ratio calculated by dividing the calculated keyword number by the extracted at least one academic common keyword is lower than the keyword performance verification reference value.

일례로 키워드 성능 검증부(130)는 적어도 하나 이상의 학술 공통 키워드의 개시 빈도에 기초하여 적어도 하나 이상의 학술 공통 키워드의 순위들을 결정할 수 있다.In one example, the keyword performance verification unit 130 may determine the ranks of the at least one academic common keyword based on the frequency of the at least one academic common keyword.

본 발명의 일실시예에 따르면 키워드 성능 검증부(130)는 결정된 순위들에 기초하여 적어도 하나 이상의 학술 공통 키워드 중 노이즈 분류 기준 보다 낮은 순위에 해당하는 학술 공통 키워드를 적어도 하나 이상의 학술 공통 키워드에서 제외한다.According to an embodiment of the present invention, the keyword performance verification unit 130 excludes, from the at least one or more academic common keywords, the common academic keywords corresponding to the lower rank than the noise classification criteria among the at least one or more academic common keywords based on the determined rankings. do.

일례로, 키워드 성능 검증부(130)는 유효 키워드와 일치되는 키워드들의 수를 추출된 키워드들의 수로 나눠서 비율을 산출하여 키워드 도출 성능을 검증할 수 있다.For example, the keyword performance verification unit 130 may verify the keyword derivation performance by calculating a ratio by dividing the number of keywords matching the valid keyword by the number of extracted keywords.

일례로, 키워드 성능 검증부(130)는 추출된 학술 공통 키워드들 중 유효 키워드와 일치되는 키워드들의 수를 유효 키워드의 수로 나눠서 비율을 산출하여 키워드 도출 성능을 검증할 수 있다.For example, the keyword performance verification unit 130 may verify the keyword derivation performance by calculating a ratio by dividing the number of keywords that match the valid keywords among the extracted academic common keywords by the number of valid keywords.

본 발명의 일실시예에 따르면 기술 문서 키워드 도출부(140)는 키워드 도출 성능 검증 후, 추출된 적어도 하나 이상의 기술 공통 키워드를 기술 문서 키워드로 도출할 수 있다.According to an embodiment of the present invention, the technical document keyword deriving unit 140 may derive at least one extracted technical common keyword as the technical document keyword after the keyword derivation performance verification.

본 발명은 학술 문서와 기술 문서의 유사도 검증에 기초하여 학술 문서의 저자 키워드를 키워드 도출 기준으로 이용해 기술 문서 키워드 도출 방법을 탐색할 수 있다.According to the present invention, a method of deriving a technical document keyword can be searched by using the author keyword of the academic document as a keyword derivation criterion based on the similarity verification between the academic document and the technical document.

또한, TF-IDF를 통하여 추출되는 키워드들에 대하여 비율을 산출한 후, TF-IDF를 통하여 추출되는 키워드들의 정확도와 기술 문서 키워드 도출부(140)에 의하여 도출되는 기술 문서 키워드들의 정확도를 비교할 수 있다.In addition, after calculating a ratio with respect to keywords extracted through the TF-IDF, the accuracy of the keywords extracted through the TF-IDF and the accuracy of the technical document keywords derived by the technical document keyword deriving unit 140 may be compared. have.

본 발명은 TF-IDF를 대체하여 단어의 출현 빈도와 동일하게 개시되는지의 여부를 이용하여 기술 문서 키워드를 도출할 수 있다.The present invention can derive a technical document keyword by using whether or not the TF-IDF is replaced with the occurrence frequency of the word.

본 발명의 다른 실시예에 따르면 키워드 도출 장치(100)는 데이터 수집부(미도시)를 더 포함할 수 있다.According to another embodiment of the present invention, the keyword derivation apparatus 100 may further include a data collector (not shown).

일례로, 데이터 수집부(미도시)는 동일한 검색어에 기반하여 학술 문서 데이터베이스로부터 학술 문서들을 수집할 수 있다.For example, the data collector (not shown) may collect academic documents from an academic document database based on the same search word.

예를 들어, 데이터 수집부(미도시)는 동일한 검색어에 기반하여 기술 문서 데이터베이스로부터 기술 문서들을 수집할 수 있다.For example, the data collector (not shown) may collect the technical documents from the technical document database based on the same search word.

즉, 데이터 수집부(미도시)는 학술 문서 데이터베이스(database)와 기술 문서 데이터베이스에서 동일한 검색어를 이용하여 학술 문서 및 기술 문서를 포함하는 분석 데이터를 수집할 수 있다.That is, the data collection unit (not shown) may collect analysis data including academic documents and technical documents using the same search word in an academic document database and a technical document database.

또한, 본 발명은 기술 문서의 유사도 검증에 기초하여 기술 문서 키워드를 도출함에 따라 학술 문서의 키워드를 특허 분석의 지표로 활용할 수 있다.In addition, the present invention can derive the technical document keyword based on the verification of the similarity of the technical document, the keyword of the academic document can be utilized as an index of patent analysis.

도 2는 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법과 관련된 흐름도를 설명하는 도면이다.2 is a diagram illustrating a flowchart related to a method of deriving a technical document keyword according to an embodiment of the present invention.

구체적으로, 도 2는 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법이 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 학술 공통 키워드를 기술 문서 키워드로 도출하는 절차를 예시한다.In detail, FIG. 2 illustrates a procedure for deriving a technical common keyword, which is the same as that of a method of deriving a technical document keyword according to an embodiment of the present invention, from a plurality of parts constituting the academic document.

도 2를 참고하면, 단계(201)에서 기술 문서 키워드를 도출하는 방법은 학술 문서와 기술 문서의 유사도를 검증한다.Referring to FIG. 2, the method of deriving the technical document keyword in step 201 verifies the similarity between the academic document and the technical document.

즉, 기술 문서 키워드를 도출하는 방법은 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는, 학술 문서와 기술 문서의 유사도를 검증한다.That is, the method of deriving the technical document keyword verifies the similarity between the academic document and the technical document included in the collected analysis data based on the same search word.

일례로, 기술 문서 키워드를 도출하는 방법은 동일한 검색어에 기반하여 수집된 분석 데이터에 포함되는 복수의 학술 문서와 복수의 기술 문서를 세 그룹으로 분류한다.For example, the method of deriving the technical document keyword classifies a plurality of academic documents and a plurality of technical documents included in the collected analysis data based on the same search word into three groups.

다음으로, 기술 문서 키워드를 도출하는 방법은 각 그룹에서 유사도를 산출하여 학술 문서와 기술 문서의 유사도를 검증한다.Next, the method of deriving the technical document keyword verifies the similarity between the academic document and the technical document by calculating the similarity in each group.

단계(202)에서 기술 문서 키워드를 도출하는 방법은 학술 문서에서 학술 공통 키워드를 추출한다.In step 202, the method for deriving the technical document keyword extracts the academic common keyword from the academic document.

즉, 기술 문서 키워드를 도출하는 방법은 검증된 유사도가 기준 값보다 높을 경우, 학술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출한다.That is, the method of deriving the technical document keyword extracts at least one academic common keyword that is identically disclosed in a plurality of parts constituting the academic document when the verified similarity is higher than the reference value.

단계(203)에서 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드와 저자 키워드를 비교하여 도출 성능을 검증한다.The method of deriving the technical document keyword in step 203 compares the academic common keyword and the author keyword to verify the derivation performance.

즉, 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드와 저자 키워드의 일치 비율을 산출하여 키워드 도출 성능을 검증한다.That is, in the method of deriving the technical document keyword, the keyword derivation performance is verified by calculating a match ratio between the academic common keyword and the author keyword.

단계(204)에서 기술 문서 키워드를 도출하는 방법은 기술 문서에서 기술 공통 키워드를 추출한다.The method of deriving the technical document keyword in step 204 extracts the technical common keyword from the technical document.

즉, 기술 문서 키워드를 도출하는 방법은 기술 문서를 구성하는 복수의 부분에서 동일하게 개시되는 적어도 하나 이상의 기술 공통 키워드를 추출한다.That is, the method for deriving the technical document keyword extracts at least one or more technical common keywords that are identically disclosed in the plurality of parts constituting the technical document.

즉, 기술 문서 키워드를 도출하는 방법은 추출된 적어도 하나 이상의 기술 공통 키워드를 기술 문서 키워드로 도출할 수 있다.That is, the method of deriving the technical document keyword may derive at least one extracted technical common keyword as the technical document keyword.

도 3은 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법과 관련된 흐름도를 설명하는 도면이다.3 is a diagram illustrating a flowchart related to a method of deriving a technical document keyword according to an embodiment of the present invention.

도 3은 기술 문서 키워드를 도출하는 방법이 동일한 검색어에 기반하여 수집된 분석 데이터를 그룹별로 분류한 후, 그룹별 유사도를 산출하는 절차를 예시한다.3 illustrates a procedure of calculating technical document keywords after classifying the collected analysis data based on the same search word into groups, and calculating similarity for each group.

도 3을 참고하면, 단계(301)에서 기술 문서 키워드를 도출하는 방법은 수집된 분석 데이터를 그룹별로 분류한다.Referring to FIG. 3, in step 301, the method of deriving the technical document keyword classifies the collected analysis data into groups.

즉, 기술 문서 키워드를 도출하는 방법은 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹으로 수집된 분석 데이터를 분류한다.That is, the method of deriving the technical document keyword classifies the analysis data collected into the academic document comparison group, the technical document comparison group, and the academic and technical document comparison group.

다시 말해, 기술 문서 키워드를 도출하는 방법은 분석 데이터에 포함되는 학술 문서들과 기술 문서들을 학술 문서 비교 그룹, 기술 문서 비교 그룹 및 학술 및 기술 문서 비교 그룹 중 어느 하나로 분류한다.In other words, the method for deriving the technical document keyword classifies the academic documents and technical documents included in the analysis data into one of an academic document comparison group, a technical document comparison group, and an academic and technical document comparison group.

단계(302)에서 기술 문서 키워드를 도출하는 방법은 학술 문서 비교 그룹에서 제1 유사도를 산출한다.The method of deriving the technical document keyword in step 302 yields a first similarity in the academic document comparison group.

즉, 기술 문서 키워드를 도출하는 방법은 학술 문서 비교 그룹에 분류된 복수의 학술 문서에서 요약 부분을 추출하고, 각 문서 별 요약 부분에 대하여 텍스트 마이닝을 수행하여 제1 문서 단어 행렬로 정형화한 후, 제1 문서 단어 행렬의 행 값과 열 값을 상기 수학식 1에 적용하여 제1 유사도를 산출한다.That is, a method of deriving a technical document keyword may include extracting a summary part from a plurality of academic documents classified in the academic document comparison group, performing text mining on each summary part of each document, and formatting the first document word matrix. The first similarity is calculated by applying the row value and the column value of the first document word matrix to Equation (1).

단계(303)에서 기술 문서 키워드를 도출하는 방법은 기술 문서 비교 그룹에서 제2 유사도를 산출한다.The method of deriving the technical document keyword in step 303 calculates a second similarity level in the technical document comparison group.

즉, 기술 문서 키워드를 도출하는 방법은 기술 문서 비교 그룹에 분류된 복수의 기술 문서에서 요약 부분을 추출한다.That is, the method of deriving the technical document keyword extracts the summary part from the plurality of technical documents classified in the technical document comparison group.

다음으로, 기술 문서 키워드를 도출하는 방법은 각 문서 별 요약 부분에 대하여 텍스트 마이닝을 수행하여 제2 문서 단어 행렬로 정형화한 후, 제2 문서 단어 행렬의 행 값과 열 값을 상기 수학식 1에 적용하여 제2 유사도를 산출한다.Next, in the method of deriving the technical document keyword, text mining is performed on each summary part of each document to form a second document word matrix, and then the row and column values of the second document word matrix are expressed in Equation 1 above. Apply to calculate the second similarity.

단계(304)에서 기술 문서 키워드를 도출하는 방법은 학술 및 기술 문서 비교 그룹에서 제3 유사도를 산출한다.The method of deriving the technical document keyword in step 304 yields a third similarity level in the academic and technical document comparison group.

즉, 기술 문서 키워드를 도출하는 방법은 학술 및 기술 문서 비교 그룹에서 학술 문서의 요약 부분 및 기술 문서의 요약 부분을 추출한다.That is, the method of deriving the technical document keyword extracts the summary part of the academic document and the summary part of the technical document from the academic and technical document comparison group.

다음으로, 기술 문서 키워드를 도출하는 방법은 각각 추출된 요약 부분에 대한 텍스트 마이닝을 수행하여 제3 문서 단어 행렬로 정형화하고, 정형화된 제3 문서 단어 행렬을 이용하여 제3 유사도를 산출할 수 있다.Next, the method of deriving the technical document keyword may perform text mining on each extracted summary part to form a third document word matrix, and calculate a third similarity using the standardized third document word matrix. .

즉, 기술 문서 키워드를 도출하는 방법은 제3 문서 단어 행렬의 행 값과 열 값을 상기 수학식 1에 적용하여 제3 유사도를 산출할 수 있다. That is, in the method of deriving the technical document keyword, the third similarity may be calculated by applying the row value and the column value of the third document word matrix to Equation 1.

예를 들어, 학술 문서와 기술 문서는 동일하게 요약 부분을 포함하고 있다. 따라서, 기술 문서 키워드를 도출하는 방법은 각각의 그룹에서 요약 부분을 추출할 수 있다.For example, academic and technical documents contain summary parts in the same way. Thus, the method of deriving the technical document keyword may extract the summary portion from each group.

도 4는 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법과 관련된 흐름도를 설명하는 도면이다.4 is a diagram illustrating a flowchart related to a method of deriving a technical document keyword according to an embodiment of the present invention.

도 4는 기술 문서 키워드를 도출하는 방법이 각 그룹에서 산출된 유사도의 평균값을 산출하고, 각 그룹에서 산출된 유사도와 그 평균값을 이용하여 분석 데이터에 포함되는 학술 문서와 기술 문서의 유사도를 판단하는 절차를 예시한다.4 is a method for deriving a technical document keyword to calculate an average value of similarity calculated in each group, and determine similarity between the academic documents and technical documents included in the analysis data using the similarity calculated in each group and the average value. Illustrate the procedure.

도 4를 참고하면, 단계(401)에서 기술 문서 키워드를 도출하는 방법은 제1 유사도, 제2 유사도 및 제3 유사도의 평균값을 산출한다.Referring to FIG. 4, in step 401, the method of deriving the technical document keyword calculates an average value of the first similarity degree, the second similarity degree, and the third similarity degree.

단계(402)에서 기술 문서 키워드를 도출하는 방법은 제1 유사도와 평균값을 비교한 후, 제1 유사도와 평균값의 차이가 기준 범위에 상응할 경우, 단계(403)로 진행한다. 반면에, 기술 문서 키워드를 도출하는 방법은 제1 유사도와 평균값의 차이가 기준 범위를 벗어날 경우, 단계(406)로 진행한다.The method of deriving the technical document keyword in step 402 compares the first similarity and the average value, and then proceeds to step 403 when the difference between the first similarity and the average value corresponds to the reference range. On the other hand, the method of deriving the technical document keyword proceeds to step 406 when the difference between the first similarity and the average value is out of the reference range.

즉, 기술 문서 키워드를 도출하는 방법은 제1 유사도와 평균값의 차이가 기준 범위에 포함될 경우, 단계(403)로 진행하여 제2 유사도와 비교를 진행하고, 제1 유사도와 평균값의 차이가 기준 범위를 벗어날 경우, 단계(406)에서 학술 문서와 기술 문서 간의 유사도가 기준값보다 낮은 것으로 판단한다.That is, in the method of deriving the technical document keyword, when the difference between the first similarity and the average value is included in the reference range, the process proceeds to step 403 to compare the second similarity and the difference between the first similarity and the average value is the reference range. If out, it is determined in step 406 that the similarity between the academic and technical documents is lower than the reference value.

단계(403)에서 기술 문서 키워드를 도출하는 방법은 제2 유사도와 평균값을 비교한 후, 제2 유사도와 평균값의 차이가 기준 범위에 상응할 경우, 단계(404)로 진행한다. 반면에, 기술 문서 키워드를 도출하는 방법은 제2 유사도와 평균값의 차이가 기준 범위를 벗어날 경우, 단계(406)로 진행한다.The method of deriving the technical document keyword in step 403 compares the second similarity and the average value, and then proceeds to step 404 when the difference between the second similarity and the average value corresponds to the reference range. On the other hand, the method of deriving the technical document keyword proceeds to step 406 when the difference between the second similarity and the average value is out of the reference range.

즉, 기술 문서 키워드를 도출하는 방법은 제2 유사도와 평균값의 차이가 기준 범위에 포함될 경우, 단계(404)로 진행하여 제3 유사도와 비교를 진행하고, 제2 유사도와 평균값의 차이가 기준 범위를 벗어날 경우, 단계(406)에서 학술 문서와 기술 문서 간의 유사도가 기준값보다 낮은 것으로 판단한다.That is, in the method of deriving the technical document keyword, when the difference between the second similarity and the average value is included in the reference range, the process proceeds to step 404 to compare the third similarity, and the difference between the second similarity and the average value is the reference range. If out, it is determined in step 406 that the similarity between the academic and technical documents is lower than the reference value.

단계(404)에서 기술 문서 키워드를 도출하는 방법은 제3 유사도와 평균값을 비교한 후, 제3 유사도와 평균값의 차이가 기준 범위에 상응할 경우, 단계(405)로 진행한다. 반면에, 제3 유사도와 평균값의 차이가 기준 범위를 벗어날 경우, 단계(406)로 진행한다.The method of deriving the technical document keyword in step 404 compares the third similarity and the average value, and then proceeds to step 405 when the difference between the third similarity and the average value corresponds to the reference range. On the other hand, if the difference between the third similarity and the average value is out of the reference range, the process proceeds to step 406.

즉, 기술 문서 키워드를 도출하는 방법은 제3 유사도와 평균값의 차이가 기준 범위에 포함될 경우, 단계(405)에서 학술 문서와 기술 문서 간의 유사도가 기준값보다 높은 것으로 판단하고, 해당 절차를 종료한다.That is, in the method of deriving the technical document keyword, when the difference between the third similarity and the average value is included in the reference range, it is determined in step 405 that the similarity between the academic document and the technical document is higher than the reference value, and the procedure is terminated.

다시말 해, 기술 문서 키워드를 도출하는 방법은 제1 유사도, 제2 유사도 및 제3 유사도가 평균값과 큰 차이를 나타내지 않을 경우, 분석 데이터에 포함되는 학술 문서와 기술 문서의 유사도를 높다고 판단할 수 있다.In other words, when the first article similarity, the second similarity, and the third similarity do not show a significant difference from the average value, the method of deriving the technical document keyword may determine that the similarity between the academic document and the technical document included in the analysis data is high. have.

상술한 설명에서는 기술 문서 키워드를 도출하는 방법이 각 그룹의 유사도와 그 평균값을 비교하는 절차를 설명하였다.In the above description, the method of deriving the technical document keyword has described a procedure for comparing the similarity of each group and its average value.

그러나 기술 문서 키워드를 도출하는 방법은 그에 한정되지 않고, 평균값을 산출하는 단계 없이 각 그룹의 유사도를 상호 간에 대비하여 학술 문서와 기술 문서의 유사도를 판단할 수 있다.However, the method of deriving the technical document keyword is not limited thereto, and the similarity of the academic document and the technical document may be determined by comparing the similarity of each group with each other without calculating the average value.

도 5는 본 발명의 일실시예에 따른 기술 문서 키워드를 도출하는 방법과 관련된 흐름도를 설명하는 도면이다.5 is a diagram illustrating a flowchart related to a method of deriving a technical document keyword according to an embodiment of the present invention.

도 5는 기술 문서 키워드를 도출하는 방법이 학술 공통 키워드의 개시 빈도에 기초하여 노이즈 키워드를 제외하는 절차를 예시한다.5 illustrates a procedure in which a method of deriving a technical document keyword excludes a noise keyword based on a frequency of start of the academic common keyword.

도 5를 참고하면, 단계(501)에서 기술 문서 키워드를 도출하는 방법은 학술 문서에서 적어도 하나 이상의 학술 공통 키워드를 추출한다.Referring to FIG. 5, in operation 501, a method of deriving a technical document keyword extracts at least one academic common keyword from an academic document.

즉, 기술 문서 키워드를 도출하는 방법은 학술 문서와 기술 문서 간에 유사도가 검증되는 학술 문서를 구성하는 복수의 부분에 동시에 개시되는 적어도 하나 이상의 학술 공통 키워드를 추출한다.That is, the method of deriving the technical document keyword extracts at least one academic common keyword that is simultaneously disclosed in a plurality of parts constituting the academic document whose similarity is verified between the academic document and the technical document.

단계(502)에서 기술 문서 키워드를 도출하는 방법은 각 학술 공통 키워드의 개시 빈도를 산출한다.The method of deriving the technical document keyword in step 502 calculates the starting frequency of each academic common keyword.

즉, 기술 문서 키워드를 도출하는 방법은 문서 내에서 학술 공통 키워드의 개시 빈도 또는 복수의 학술 문서 내에서 학술 공통 키워드의 개시 빈도를 텍스트 마이닝에 기반하여 산출할 수 있다.That is, the method of deriving the technical document keyword may calculate the starting frequency of the academic common keyword in the document or the starting frequency of the academic common keyword in the plurality of academic documents based on the text mining.

단계(503)에서 기술 문서 키워드를 도출하는 방법은 산출된 개시 빈도에 기초하여 적어도 하나 이상의 학술 공통 키워드의 순위를 결정한다.The method of deriving the technical document keywords in step 503 determines the ranking of the at least one academic common keyword based on the calculated starting frequency.

즉, 기술 문서 키워드를 도출하는 방법은 학술 문서 내에서 학술 공통 키워드의 개시 빈도를 산출하고, 산출된 개시 빈도에 따라 각 학술 공통 키워드의 순위를 결정한다.That is, the method of deriving the technical document keyword calculates the starting frequency of the academic common keywords in the academic document, and determines the ranking of each academic common keyword according to the calculated starting frequency.

여기서, 기술 문서 키워드를 도출하는 방법은 개시 빈도가 많은 순서대로 학술 공통 키워드의 순위를 결정한다.Here, the method for deriving the technical document keyword determines the ranking of the academic common keyword in the order of high frequency of disclosure.

단계(504)에서 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드의 순위와 노이즈 분류 기준을 비교한다.The method of deriving the technical document keyword in step 504 compares the ranking of the academic common keyword with the noise classification criteria.

즉, 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드의 순위가 노이즈 분류 기준보다 상대적으로 클 경우, 단계(505)로 진행하고, 반대의 경우 단계(506)로 진행한다.That is, the method of deriving the technical document keyword proceeds to step 505 when the rank of the academic common keyword is relatively larger than the noise classification criterion and vice versa.

단계(505)에서 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드를 유효 키워드로 결정하여 학술 공통 키워드를 기술 문서 키워드로 도출한다.In the method of deriving the technical document keyword in step 505, the academic common keyword is determined as the valid keyword, and the academic common keyword is derived as the technical document keyword.

즉, 기술 문서 키워드를 도출하는 방법은 노이즈 분류 기준과 대비된 학술 공통 키워드를 기술 문서 키워드로 도출한다.That is, the method of deriving the technical document keyword derives the academic common keyword as the technical document keyword compared with the noise classification criteria.

단계(506)에서 기술 문서 키워드를 도출하는 방법은 학술 공통 키워드를 노이즈 키워드로 결정하여 도출 대상에서 제외한다.In the method of deriving the technical document keyword in step 506, the academic common keyword is determined as the noise keyword and excluded from the derivation target.

즉, 기술 문서 키워드를 도출하는 방법은 노이즈 분류 기준과 대비된 학술 공통 키워드를 추출된 학술 공통 키워드들에서 제외한다.That is, the method of deriving the technical document keyword excludes the academic common keywords compared to the noise classification criteria from the extracted academic common keywords.

예를 들어, 노이즈 분류 기준은 결정된 학술 공통 키워드들의 순위에서 하위 1/3 위치에 상응할 수 있다.For example, the noise classification criteria may correspond to the lower third position in the ranking of the determined academic common keywords.

즉, 기술 문서 키워드를 도출하는 방법은 추출된 학술 공통 키워드들 순위에서 상위 2/3 위치에 상응하는 학술 공통 키워드만을 유효 키워드로 결정할 수 있다.That is, in the method of deriving the technical document keyword, only the academic common keyword corresponding to the upper 2/3 position in the extracted academic common keywords ranking may be determined as a valid keyword.

본 발명의 청구항 또는 명세서에 기재된 실시 예들에 따른 방법들은 하드웨어, 소프트웨어, 또는 하드웨어와 소프트웨어의 조합의 형태로 구현될(implemented) 수 있다.Methods according to the embodiments described in the claims or the specification of the present invention may be implemented in the form of hardware, software, or a combination of hardware and software.

그러한 소프트웨어는 컴퓨터 판독 가능 저장 매체에 저장될 수 있다. 컴퓨터 판독 가능 저장 매체는, 적어도 하나의 프로그램(소프트웨어 모듈), 전자 장치에서 적어도 하나의 프로세서에 의해 실행될 때 전자 장치가 본 발명의 방법을 실시하게 하는 명령어들(instructions)을 포함하는 적어도 하나의 프로그램을 저장한다.Such software may be stored on a computer readable storage medium. The computer readable storage medium includes at least one program (software module), at least one program including instructions for causing the electronic device to perform the method of the present invention when executed by the at least one processor in the electronic device. Save it.

이러한 소프트웨어는, 휘발성(volatile) 또는 (ROM: Read Only Memory)과 같은 불휘발성(non-volatile) 저장장치의 형태로, 또는 램(RAM: random access memory), 메모리 칩(memory chips), 장치 또는 집적 회로(integrated circuits)와 같은 메모리의 형태로, 또는 컴팩트 디스크 롬(CD-ROM: Compact Disc-ROM), 디지털 다목적 디스크(DVDs: Digital Versatile Discs), 자기 디스크(magnetic disk) 또는 자기 테이프(magnetic tape) 등과 같은 광학 또는 자기적 판독 가능 매체에, 저장될 수 있다.Such software may be in the form of volatile or non-volatile storage, such as Read Only Memory (ROM), or random access memory (RAM), memory chips, devices or In the form of memory such as integrated circuits, or in compact disc ROM (CD-ROM), digital versatile discs (DVDs), magnetic disks or magnetic tapes tape, or the like, on an optical or magnetic readable medium.

저장 장치 및 저장 미디어는, 실행될 때 일 실시 예들을 구현하는 명령어들을 포함하는 프로그램 또는 프로그램들을 저장하기에 적절한 기계-판독 가능 저장 수단의 실시 예들이다.Storage device and storage media are embodiments of machine-readable storage means suitable for storing a program or programs that include instructions that, when executed, implement embodiments.

상술한 구체적인 실시 예들에서, 발명에 포함되는 구성 요소는 제시된 구체적인 실시 예에 따라 단수 또는 복수로 표현되었다.In the above-described specific embodiments, the components included in the invention are expressed in the singular or plural according to the specific embodiments presented.

그러나, 단수 또는 복수의 표현은 설명의 편의를 위해 제시한 상황에 적합하게 선택된 것으로서, 상술한 실시 예들이 단수 또는 복수의 구성 요소에 제한되는 것은 아니며, 복수로 표현된 구성 요소라 하더라도 단수로 구성되거나, 단수로 표현된 구성 요소라 하더라도 복수로 구성될 수 있다.However, the singular or plural expressions are selected to suit the circumstances presented for convenience of description, and the above-described embodiments are not limited to the singular or plural elements, and the singular or plural elements may be composed of the singular or the plural elements. However, even a component expressed in the singular may be configured in plural.

한편 발명의 설명에서는 구체적인 실시 예에 관해 설명하였으나, 다양한 실시 예들이 내포하는 기술적 사상의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다.Meanwhile, in the description of the present invention, specific embodiments have been described, but various modifications may be made without departing from the scope of the technical idea included in the various embodiments.

그러므로 본 발명의 범위는 설명된 실시 예에 국한되어 정해져서는 아니되며 후술하는 청구범위뿐만 아니라 이 청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the claims below and equivalents thereof.

100: 키워드 도출 장치 110: 유사도 검증부
112: 데이터 분류부 114: 유사도 산출부
130: 키워드 추출부 130: 기술 문서 키워드 도출부100: keyword derivation apparatus 110: similarity verification unit
112: data classifying unit 114: similarity calculating unit
130: keyword extraction unit 130: technical document keyword derivation unit

Claims

A similarity verification unit verifying a similarity between the academic document and the technical document included in the collected analysis data based on the same search word;
An academic keyword extracting unit for extracting at least one academic common keyword that is identically disclosed in a plurality of parts constituting the academic document, when the verified similarity is higher than a reference value;
A keyword performance verification unit which verifies keyword derivation performance by comparing the extracted at least one academic common keyword with author keywords of the academic document;
A description keyword extraction section for extracting at least one description common keyword that is identically disclosed in a plurality of parts constituting the description document;
A technical document keyword deriving unit configured to derive the extracted at least one technical common keyword as a technical document keyword;
A data classification unit classifying the collected analysis data into an academic document comparison group, a technical document comparison group, and an academic and technical document comparison group; And
Calculate a first similarity between academic documents in the academic document comparison group, calculate a second similarity between technical documents in the technical document comparison group, and obtain a third similarity between academic documents and technical documents in the academic and technical document comparison group Comprising a similarity calculating unit
Device for deriving technical document keywords.

delete

The method of claim 1,
The similarity verification unit,
Calculating an average value of the calculated first similarity, the calculated second similarity, and the calculated third similarity, and calculating the average value and the calculated first similarity, the calculated second similarity, and the calculated third similarity. Similarity is compared with each other, and if the difference between the calculated average value and the calculated first similarity, the calculated second similarity, and the calculated third similarity corresponds to a preset reference range, the similarity is set as the reference value. Verifying higher, and verifying that the similarity is lower than the reference value if the difference is outside the reference range.
Device for deriving technical document keywords.

The method of claim 1,
The similarity calculation unit,
Extracting a first summary of a plurality of academic documents from the academic document comparison group, performing text mining on the extracted first summary to form a first document word matrix, and using the standardized first document word matrix To calculate the first similarity
Device for deriving technical document keywords.

The method of claim 4, wherein
The similarity calculation unit,
Extracting a second summary of a plurality of technical documents from the technical document comparison group, performing text mining on the extracted second summary to form a second document word matrix, and using the standardized second document word matrix To calculate the second similarity
Device for deriving technical document keywords.

The method of claim 5,
The similarity calculation unit,
Extracting a third summary of a plurality of academic documents and a fourth summary of a plurality of technical documents from the academic and technical document comparison group, and performing text mining on the extracted third summary and the extracted fourth summary; Formulating a three document word matrix, and calculating the third similarity using the standardized third document word matrix.
Device for deriving technical document keywords.

The method of claim 6,
The similarity calculation unit,
Calculating the first similarity by applying a row value and a column value of the first document word matrix to a cosine distance equation,
Calculating the second similarity by applying a row value and a column value of the second document word matrix to the cosine distance equation;
Calculating a third similarity by applying a row value and a column value of the third document word matrix to the cosine distance equation
Device for deriving technical document keywords.

A similarity verification unit verifying a similarity between the academic document and the technical document included in the collected analysis data based on the same search word;
An academic keyword extracting unit for extracting at least one academic common keyword that is identically disclosed in a plurality of parts constituting the academic document, when the verified similarity is higher than a reference value;
A keyword performance verification unit which verifies keyword derivation performance by comparing the extracted at least one academic common keyword with author keywords of the academic document;
A description keyword extraction section for extracting at least one description common keyword that is identically disclosed in a plurality of parts constituting the description document; And
A technical document keyword derivation unit configured to derive the extracted at least one technical common keyword as a technical document keyword,
The keyword performance verification unit,
Calculating the number of keywords matched between the extracted at least one academic common keyword and the author keyword, and dividing the calculated keyword number by the number of the extracted at least one academic common keyword to verify the keyword derivation performance
Device for deriving technical document keywords.

The method of claim 8,
The technical document keyword derivation unit,
Determine ranks of the at least one academic common keyword based on a frequency of appearance of the at least one academic common keyword
Device for deriving technical document keywords.

The method of claim 9,
The technical document keyword derivation unit,
Based on the determined rankings, excluding a scholarly common keyword corresponding to a lower rank than the noise classification criterion among the at least one or more academic common keywords from the at least one or more academic common keywords
Device for deriving technical document keywords.

A similarity verification unit verifying a similarity between the academic document and the technical document included in the collected analysis data based on the same search word;
An academic keyword extracting unit for extracting at least one academic common keyword that is identically disclosed in a plurality of parts constituting the academic document, when the verified similarity is higher than a reference value;
A keyword performance verification unit which verifies keyword derivation performance by comparing the extracted at least one academic common keyword with author keywords of the academic document;
A description keyword extraction section for extracting at least one description common keyword that is identically disclosed in a plurality of parts constituting the description document; And
A technical document keyword derivation unit configured to derive the extracted at least one technical common keyword as a technical document keyword,
The plurality of parts constituting the scholarly document include at least one of a summary, an introduction, and a conclusion of the scholarly document.
The plurality of parts constituting the technical document include at least one of a summary, an introduction, a conclusion, a claim, and a title of the technical document.
Device for deriving technical document keywords.

Verifying, by the similarity verification unit, the similarity between the academic document and the technical document included in the collected analysis data based on the same search word;
Extracting, by the academic keyword extracting unit, at least one or more academic common keywords that are identically disclosed in a plurality of parts constituting the academic document when the verified similarity is higher than a reference value;
Verifying a keyword derivation performance by comparing the extracted at least one academic common keyword with the author keyword of the academic document;
Extracting, by the technical keyword extracting unit, at least one or more technical common keywords which are identically disclosed in a plurality of parts constituting the technical document; And
In the technical document keyword derivation unit, deriving the extracted at least one technical common keyword as a technical document keyword,
The step of verifying the similarity between the academic document and the technical document,
In the data classification unit, classifying the collected analysis data into an academic document comparison group, a technical document comparison group, and an academic and technical document comparison group;
Calculating, in a similarity calculator, a first similarity between academic documents in the academic document comparison group;
Calculating, by the similarity calculator, a second similarity between technical documents in the technical document comparison group; And
Calculating, by the similarity calculating unit, a third similarity between the academic document and the technical document in the academic and technical document comparison group;
How to derive technical documentation keywords.

delete

The method of claim 12,
The step of verifying the similarity between the academic document and the technical document,
Calculating an average value of the calculated first similarity, the calculated second similarity, and the calculated third similarity;
Comparing the calculated average value with each of the calculated first similarity, the calculated second similarity, and the calculated third similarity;
Verifying that the similarity is higher than the reference value when a difference between each of the calculated average value and the calculated first similarity, the calculated second similarity, and the calculated third similarity corresponds to a preset reference range. ; And
Verifying that the similarity is lower than the reference value when the difference is outside the reference range.
How to derive technical documentation keywords.

The method of claim 12,
Computing the first similarity degree,
Extracting a first summary of a plurality of academic documents from the academic document comparison group;
Performing text mining on the extracted first summary to form a first document word matrix; And
Calculating the first similarity using the standardized first document word matrix,
The calculating of the second similarity level may include:
Extracting a second summary of a plurality of technical documents from the technical document comparison group;
Performing text mining on the extracted second summary to form a second document word matrix; And
Calculating the second similarity degree using the standardized second document word matrix,
Computing the third similarity,
Extracting a third summary of a plurality of academic documents and a fourth summary of a plurality of technical documents from the academic and technical document comparison group;
Performing text mining on the extracted third summary and the extracted fourth summary to form a third document word matrix; And
Calculating the third similarity using the standardized third document word matrix.
How to derive technical documentation keywords.