KR101620841B1

KR101620841B1 - Patent Analysis Method using A Hierarchical Diagram of Technology based on Statistical Patent Analysis

Info

Publication number: KR101620841B1
Application number: KR1020140143056A
Authority: KR
Inventors: 박상성; 장동식; 이홍철; 김갑조; 이준혁; 이준석; 전성해; 강지호
Original assignee: 고려대학교 산학협력단
Priority date: 2014-10-22
Filing date: 2014-10-22
Publication date: 2016-05-23
Also published as: KR20160047112A

Abstract

본 발명은 특허 문서를 이용하여 기술을 분석하는 방법에 관한 것으로서, 분석하고자 하는 문서들로부터 단어들을 추출하는 단계, 상기 추출된 단어들을 이용하여 문서-단어 행렬을 생성하는 단계, 상기 생성된 문서-단어 행렬로부터 키워드를 선택하여 문서-키워드 행렬을 생성하는 단계, 상기 문서-키워드 행렬을 이용하여 회귀모델을 생성하는 단계, 상기 생성된 회귀모델의 매개변수들 중 제 1 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 키워드를 선택하는 단계, 및 상기 선택된 키워드 간의 유의확률을 이용하여 키워드 간의 관계를 도출하는 단계를 포함함으로써 통계적 분석을 이용하여 기술계층을 도출하고, 도출된 기술계층을 이용함으로써 객관적이고 정확한 특허 분석이 가능하다.The present invention relates to a method of analyzing a technology using a patent document, comprising the steps of extracting words from documents to be analyzed, generating a document-word matrix using the extracted words, Generating a document-keyword matrix by selecting a keyword from a word matrix, generating a regression model using the document-keyword matrix, calculating a significance probability (p (n)) of the parameters of the generated regression model, selecting a keyword corresponding to a parameter having a value of -value, and deriving a relation between the keywords by using the significance probability between the selected keywords, thereby deriving a technology layer using statistical analysis, By using the hierarchy, objective and accurate patent analysis is possible.

Description

TECHNICAL FIELD The present invention relates to a patent analysis method using a technology hierarchical diagram model based on statistical analysis,

본 발명은 특허를 분석하는 방법에 관한 것으로서, 보다 구체적으로, 통계적 분석을 이용하여 기술계층을 도출하고, 도출된 기술계층을 이용하여 특허를 분석하는 방법에 관한 것이다.The present invention relates to a method for analyzing a patent, and more particularly, to a method for deriving a technology layer using statistical analysis and analyzing a patent using the derived technology layer.

최근 기술예측(Technology Forecasting, TF)은 기술경영에서 중요하게 떠오르는 이슈이다. 대부분 MOT R&D정책은 다양한 기술예측의 결과에 의존한다. 전통적인 기술예측은 델파이 방법과 같은 정성적인 접근법을 이용하였다. 델파이 방법이 기술예측에서 대표적인 방법이었지만, 결과가 주관적이라는 단점으로 인해 적합한 방법은 아니다.Recent Technology Forecasting (TF) is an important issue in technology management. Most MOT R & D policies depend on the outcome of various technology forecasts. Traditional technology forecasts use qualitative approaches such as the Delphi method. Although the Delphi method is a representative method in technology prediction, it is not a suitable method due to the disadvantage that the result is subjective.

이러한 문제점으로 인해 최근 대부분 기술예측 기법들은 특허분석과 같은 정량적인 접근법에 초점을 맞추어져가고 있는 추세이다. 특허는 개발된 기술들에 대한 다양한 정보를 가지고 있으며, 기술예측에서 사용되는 객관적인 자료 형식 중 하나이다. Due to these problems, most of the technology prediction techniques have recently been focusing on quantitative approaches such as patent analysis. Patents have a variety of information on developed technologies and are one of the objective data formats used in technology prediction.

한국공개특허 "조사 대상 문서의 문서 특징 분석 장치(10-2006-0095565)"Korea Open Patent "Document Characteristic Analysis Apparatus for Investigated Documents (10-2006-0095565)"

본 발명이 해결하고자 하는 과제는 통계적 분석을 이용하여 기술계층을 도출하고, 도출된 기술계층을 이용하여 특허를 분석하는 방법을 제공하는 것이다.A problem to be solved by the present invention is to provide a method of deriving a technology layer using statistical analysis and analyzing a patent using the derived technology layer.

본 발명은 상기 과제를 해결하기 위하여, 특허 문서를 이용하여 기술을 분석하는 방법에 있어서, 분석하고자 하는 문서들로부터 단어들을 추출하는 단계; 상기 추출된 단어들을 이용하여 문서-단어 행렬을 생성하는 단계; 상기 생성된 문서-단어 행렬로부터 키워드를 선택하여 문서-키워드 행렬을 생성하는 단계; 상기 문서-키워드 행렬을 이용하여 회귀모델을 생성하는 단계; 상기 생성된 회귀모델의 매개변수들 중 제 1 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 키워드를 선택하는 단계; 및 상기 선택된 키워드 간의 유의확률을 이용하여 키워드 간의 관계를 도출하는 단계를 포함하는 방법을 제공한다.According to an aspect of the present invention, there is provided a method of analyzing a technology using a patent document, the method comprising: extracting words from documents to be analyzed; Generating a document-word matrix using the extracted words; Generating a document-keyword matrix by selecting a keyword from the generated document-word matrix; Generating a regression model using the document-keyword matrix; Selecting a keyword corresponding to a parameter having a significance probability (p-value) below a first threshold among the parameters of the generated regression model; And deriving a relationship between the keywords by using the significance probability between the selected keywords.

본 발명의 다른 실시예에 의하면, 상기 키워드 간의 관계를 도출하는 단계는, 상기 선택된 키워드 중 제 2 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 하나 이상의 키워드를 제 1 키워드로 선정하는 단계; 및 상기 제 1 키워드와 제 1 키워드 이외의 키워드들 간의 회귀모델 분석을 통해, 제 3 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 키워드를 상기 제 1 키워드와 관계성이 있는 키워드로 선정하는 단계를 포함하는 방법일 수 있다.According to another embodiment of the present invention, the step of deriving the relationship between the keywords may include: selecting one or more keywords corresponding to the parameters having the significance probabilities below the second threshold among the selected keywords as the first keyword; And selecting, as a keyword having a relation with the first keyword, a keyword corresponding to a parameter having a significance probability below a third threshold through a regression model analysis between the first keyword and the keywords other than the first keyword . &Lt; / RTI >

본 발명의 다른 실시예에 의하면, 상기 키워드 간의 관계를 도출하는 단계는, 상기 키워들 간의 유의확률을 이용하여 계층을 나누는 것을 특징으로 하는 방법일 수 있다.According to another embodiment of the present invention, the step of deriving the relation between the keywords may be a method of dividing the hierarchy by using the significance probability between the keywords.

본 발명의 다른 실시예에 의하면, 상기 문서-키워드 행렬을 생성하는 단계는, 상기 문서-단어 행렬로부터 제 4 임계치 이상의 발생 빈도 값을 갖는 단어들을 선정하는 단계; 및 상기 선정된 단어들 중 기술과 관련이 없는 단어들을 제거하는 단계를 포함하는 방법일 수 있다.According to another embodiment of the present invention, the step of generating the document-keyword matrix includes: selecting words having an occurrence frequency value of the fourth threshold or more from the document-word matrix; And removing words of the selected words that are not related to the description.

본 발명의 다른 실시예에 의하면, 상기 분석하고자 하는 문서들은 특허문서이고, 상기 단어를 추출하는 단계는, 특허문서의 발명의 명칭, 요약, 특허청구범위, 또는 발명의 상세한 설명 중 하나 이상의 부분에서 단어들을 추출하는 것을 특징으로 하는 방법일 수 있다.According to another embodiment of the present invention, the document to be analyzed is a patent document, and the step of extracting the word may include extracting at least one of a name, an abstract, a patent claim, And extracting the words.

본 발명의 다른 실시예에 의하면, 상기 단어들을 추출하는 단계는, 문장분석(parsing) 및 말뭉치분석(corpus)를 이용하여 수행되는 것을 특징으로 하는 방법일 수 있다.According to another embodiment of the present invention, the step of extracting the words may be performed using sentence parsing and corpus analysis.

본 발명에 따르면, 핵심키워드 도출을 위하여 기존의 전문가의견, 델파이기법 등 정성적으로 추출하던 방법을 벗어나, 통계적 분석을 이용하여 기술계층을 도출하고, 도출된 기술계층을 이용하여 특허를 분석할 수 있다. 이에 따라 객관적인 특허 분석이 가능하다.According to the present invention, it is possible to derive a technology layer using a statistical analysis, and to analyze a patent using the derived technology layer, in order to derive a core keyword, out of a method of qualitatively extracting expert opinions and Delphi techniques have. Thus, objective patent analysis is possible.

도 1은 본 발명의 일 실시예에 따른 특허분석방법의 흐름도이다.
도 2 내지 3은 본 발명의 다른 실시예에 따른 특허분석방법의 흐름도이다.
도 4는 본 발명의 또 다른 실시예에 따른 특허분석방법의 흐름도이다.
도 5 내지 6은 본 발명의 실시예에 따른 특허분석방법에 따라 도출된 기술계층을 나타낸 것이다.1 is a flowchart of a patent analysis method according to an embodiment of the present invention.
2 to 3 are flowcharts of a patent analysis method according to another embodiment of the present invention.
4 is a flowchart of a method for analyzing a patent according to another embodiment of the present invention.
5 to 6 illustrate a technology layer derived according to a patent analysis method according to an embodiment of the present invention.

본 발명에 관한 구체적인 내용의 설명에 앞서 이해의 편의를 위해 본 발명이 해결하고자 하는 과제의 해결 방안의 개요 혹은 기술적 사상의 핵심을 우선 제시한다.Prior to the description of the concrete contents of the present invention, for the sake of understanding, the outline of the solution of the problem to be solved by the present invention or the core of the technical idea is first given.

본 발명의 일 실시예에 따른 특허 문서를 이용하여 기술을 분석하는 방법에 있어서, 분석하고자 하는 문서들로부터 단어들을 추출하는 단계, 상기 추출된 단어들을 이용하여 문서-단어 행렬을 생성하는 단계, 상기 생성된 문서-단어 행렬로부터 키워드를 선택하여 문서-키워드 행렬을 생성하는 단계, 상기 문서-키워드 행렬을 이용하여 회귀모델을 생성하는 단계, 상기 생성된 회귀모델의 매개변수들 중 제 1 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 키워드를 선택하는 단계, 및 상기 선택된 키워드 간의 유의확률을 이용하여 키워드 간의 관계를 도출하는 단계를 포함한다.A method for analyzing a technology using a patent document according to an embodiment of the present invention includes extracting words from documents to be analyzed, generating a document-word matrix using the extracted words, Generating a document-keyword matrix by selecting a keyword from the generated document-word matrix, generating a regression model using the document-keyword matrix, generating a regression model with a first threshold of the parameters of the generated regression model Selecting a keyword corresponding to a parameter having a significance value (p-value), and deriving a relationship between the keywords using the significance probability between the selected keywords.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있는 실시 예를 상세히 설명한다. 그러나 이들 실시예는 본 발명을 보다 구체적으로 설명하기 위한 것으로, 본 발명의 범위가 이에 의하여 제한되지 않는다는 것은 당업계의 통상의 지식을 가진 자에게 자명할 것이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It will be apparent to those skilled in the art, however, that these examples are provided to further illustrate the present invention, and the scope of the present invention is not limited thereto.

본 발명이 해결하고자 하는 과제의 해결 방안을 명확하게 하기 위한 발명의 구성을 본 발명의 바람직한 실시예에 근거하여 첨부 도면을 참조하여 상세히 설명하되, 당해 도면에 대한 설명시 필요한 경우 다른 도면의 구성요소를 인용할 수 있음을 미리 밝혀둔다. 아울러 본 발명의 바람직한 실시 예에 대한 동작 원리를 상세하게 설명함에 있어 본 발명과 관련된 공지 기능 혹은 구성에 대한 구체적인 설명 그리고 그 이외의 제반 사항이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.
BRIEF DESCRIPTION OF THE DRAWINGS The above and other features and advantages of the present invention will become more apparent by describing in detail preferred embodiments thereof with reference to the attached drawings in which: It is possible to quote the above. In the following detailed description of the principles of operation of the preferred embodiments of the present invention, it is to be understood that the present invention is not limited to the details of the known functions and configurations, and other matters may be unnecessarily obscured, A detailed description thereof will be omitted.

도 1은 본 발명의 일 실시예에 따른 특허분석방법의 흐름도이다.1 is a flowchart of a patent analysis method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 특허 문서를 이용하여 기술을 분석하는 방법은 다음의 일련의 과정으로 통해 구현된다.A method of analyzing a technology using a patent document according to an embodiment of the present invention is implemented through the following series of processes.

110 단계는 분석하고자 하는 문서들로부터 단어들을 추출하는 단계이다.Step 110 is a step of extracting words from the documents to be analyzed.

보다 구체적으로, 분석하고자 하는 문서들로부터 텍스트 마이닝 방법 등을 이용하여 단어들을 추출한다. 텍스트 마이닝(Text Mining) 방법은 자연어로 구성된 비정형 텍스트 데이터에서 패턴 또는 관계를 추출하여 가치와 의미 있는 정보를 찾아내는 마이닝 기법으로, 사람들이 말하는 언어를 이해할 수 있는 자연어처리 기술에 기반한 기술이다. 상기 단어들은 문장분석(parsing) 및 말뭉치분석(corpus)를 이용하여 추출할 수 있다. More specifically, words are extracted from documents to be analyzed using a text mining method or the like. Text mining is a mining technique that extracts patterns or relationships from unstructured text data composed of natural language and finds value and meaningful information. It is a technology based on natural language processing technology that can understand people's language. The words can be extracted using sentence parsing and corpus analysis.

상기 문서들을 분석하고자 하는 기술 또는 기술분야에 대해 조사된 문서들일 수 있다. Or may be documents that have been investigated for the technology or technical field to analyze the documents.

상기 분석하고자 하는 문서들은 텍스트 형식의 문서들로, 특히 특허문서일 수 있다. 특허문서에 대한 단어를 추출함에 있어서, 유의한 단어들을 추출하는 정확성 및 효율성을 위하여, 특허문서의 발명의 명칭, 요약, 특허청구범위, 또는 발명의 상세한 설명 중 하나 이상의 부분에서 단어들을 추출할 수 있다. 특허문서의 발명의 명칭, 요약, 특허청구범위에는 해당 특허문서에서 다루고자 하는 발명의 핵심적인 부분이 포함되고 해당 발명의 상세한 내용은 발명의 상세한 설명에 포함되는바, 특허문서의 발명의 명칭, 요약, 특허청구범위, 또는 발명의 상세한 설명 중 하나 이상의 부분에서 단어들을 추출할 수 있다. 상기 특허문서 중, 특허의 요약(추록)을 이용하는 것이 바람직하다.The documents to be analyzed may be textual documents, in particular patent documents. In extracting words for a patent document, for the sake of accuracy and efficiency in extracting significant words, it is possible to extract words from at least one part of the title, summary, patent claim, or inventive description of the patent document have. The title, abstract, and claims of the patent document include the essential parts of the invention to be covered by the patent document, and the details of the invention are included in the detailed description of the invention, Extracts words from one or more portions of the abstract, the claims, or the detailed description of the invention. Among the above patent documents, it is preferable to use a summary of the patent (supplement).

120 단계는 상기 추출된 단어들을 이용하여 문서-단어 행렬을 생성하는 단계이다.Step 120 is a step of generating a document-word matrix using the extracted words.

보다 구체적으로, 분석한 문서와 추출된 단어들을 이용하여 문서-단어 행렬(Document-Term Matrix, DTM)을 생성한다. Corpus와 텍스트 마이닝 기법의 text repository를 사용하여 문서-단어 행렬을 구성한다. 문서-단어 행렬은 문서와 단어 간의 관계를 나타낸 행렬로, 문서에 각 단어가 나타나는 빈도수를 나타낸다. 행렬의 행과 열은 단어와 문서들로 구성된다. 각 원소는 각 문서에서 단어 발생 빈도 값이다. More specifically, a document-term matrix (DTM) is generated using the analyzed document and the extracted words. Corpus and text mining technique text repository are used to construct document-word matrix. Document - A word matrix is a matrix that describes the relationship between a document and words, and represents the frequency with which each word appears in the document. The rows and columns of a matrix consist of words and documents. Each element is a word frequency value in each document.

130 단계는, 상기 생성된 문서-단어 행렬로부터 키워드를 선택하여 문서-키워드 행렬을 생성하는 단계이다.Step 130 is a step of generating a document-keyword matrix by selecting a keyword from the generated document-word matrix.

보다 구체적으로, 상기 단어들 중 키워드를 선택하여 문서-키워드 행렬을 생성한다. 단어들 중 키워드를 선택함으로써 행의 수를 줄여 빠르고 정확한 분석을 가능하도록 한다. More specifically, a keyword of the words is selected to generate a document-keyword matrix. By selecting keywords among the words, the number of lines is reduced to enable fast and accurate analysis.

문서-키워드 행렬을 생성함에 있어서, 도 3의 310 단계 내지 320 단계를 수행할 수 있다. 310 단계에서 문서-단어 행렬로부터 제 4 임계치 이상의 발생 빈도 값을 갖는 단어들을 선정할 수 있다. 상기 제 4 임계치는 미리 설정되어 있거나, 문서-단어 행렬의 행의 수에 따라 결정될 수 있다. 발생 빈도가 높을수록 키워드일 확률이 높은바, 발생 빈도 값에 따라 단어를 선정할 수 있다. 320 단계에서 선정된 단어들 중 기술과 관련이 없는 단어들을 제거하여 문서-키워드 행렬을 생성할 수 있다. 발생 빈도는 높으나, 기술과 관련이 없는 단어들을 제거함으로써 정확성을 높일 수 있다. 예를 들어, "is", "the"와 같이 의미 없는 단어들을 제거할 수 있다. 제거되는 단어들을 미리 설정되어 있거나 사용자의 설정 입력에 따라 특정 단어들이 제거될 수 있다. In generating the document-keyword matrix, steps 310 to 320 of FIG. 3 may be performed. In step 310, words having an occurrence frequency value equal to or higher than the fourth threshold value can be selected from the document-word matrix. The fourth threshold may be predetermined or may be determined according to the number of rows of the document-word matrix. The higher the frequency of occurrence is, the more likely the keyword is, and the word can be selected according to the occurrence frequency value. The document-keyword matrix may be generated by removing words not related to the description among the selected words in step 320. [ The frequency of occurrence is high, but accuracy can be improved by removing words that are not related to technology. For example, you can remove meaningless words such as "is" and "the". The words to be removed may be preset or specific words may be removed according to the setting input of the user.

140 단계는 상기 문서-키워드 행렬을 이용하여 회귀모델을 생성하는 단계이다.Step 140 is a step of generating a regression model using the document-keyword matrix.

보다 구체적으로, 문서-키워드 행렬을 이용하여 회귀모델을 생성한다. 회귀모델은 하나 또는 그 이상의 독립변수의 종속변수에 대한 영향의 추정을 할 수 있는 통계기법으로 키워드의 통계적 분석이 가능하다. 즉, 회귀모델을 생성하여 선택된 키워드들 중 통계적으로 유의한 키워드를 찾을 수 있다. 키워드 선택은 통계적으로 유의한 분석이 아닌바, 회귀모델을 이용함으로써 통계적으로 유의한 분석이 가능하다. 상기 문서-키워드 행렬로부터 회귀모델을 생성한다. 회귀모델은 다음과 같이 나타낼 수 있다.More specifically, a regression model is generated using a document-keyword matrix. A regression model is a statistical technique that can estimate the impact of one or more independent variables on dependent variables. That is, a regression model can be generated to find statistically significant keywords among the selected keywords. The keyword selection is not statistically significant but statistically significant by using the regression model. A regression model is generated from the document-keyword matrix. The regression model can be expressed as:

Z는 종속변수이고, X_s는 독립변수이다. 종속변수는 영향을 받는 기술이고, 독립변수는 개발된 기술들을 의미한다. 즉, 종속변수는 분석을 하고자 하는 기술이고, 독립변수는 해당 기술에 관련하여 개발된 기술들로 문서분석을 통해 도출되는 키워드들에 해당한다. ε는 오차항이다. 회귀 매개변수 β_k는 두 기술들 사이의 인과관계의 강도를 나타낸다. Full regression model에서는 아래와 같이, 모든 독립변수들을 사용하고 가설을 세운다.Z is a dependent variable, and X _s is an independent variable. Dependent variables are the affected technologies and independent variables are the developed technologies. That is, the dependent variable is a technique to be analyzed, and the independent variable corresponds to a keyword derived from a document analysis by techniques developed in relation to the technology. ε is the error term. The regression parameter β _k represents the strength of the causal relationship between the two techniques. In the full regression model, all independent variables are used and hypothesized as follows.

Z 기술에 영향을 미치는 X 기술들을 찾기 위해, 가설검정을 실시할 수 있다. 귀무가설 H₀는 i 번째 회귀 매개변수가 0이라는 것을 나타낸다. 이 의미는 i 번째 기술은 기술 Z에 영향을 미치지 않는다는 것이다. 또한 대립가설 H₁은 i 번째 회귀변수가 0이 아니라는 것을 나타낸다. 즉, 기술 Z는 기술 Xi에 의존하고 있다는 것이다. 기술 X_i가 Z에 유의적인 영향을 미치는지 판단하기 위해, 통계적 검정을 통하여 H₀를 거절할 수 있다. 가설검정을 위해 자유도가 n-(k+1)인 t분포를 사용한다. n과 k는 데이터 크기와 변수의 개수이다. 만약 H0가 참이라면, 아래와 같이 검정통계량을 계산할 수 있다.Hypothesis testing can be conducted to find X technologies that affect Z technology. The null hypothesis H ₀ indicates that the ith regression parameter is zero. This means that the i-th technology does not affect the technology Z. Also, the alternative hypothesis H ₁ indicates that the i-th regression variable is not zero. That is, the technology Z depends on the technology Xi. To determine if the technique X _i has a significant effect on Z, H ₀ can be rejected through a statistical test. For the hypothesis test, we use the t distribution with n - (k + 1) degrees of freedom. n and k are the data size and the number of variables. If H0 is true, the test statistic can be calculated as follows.

는

의 추정치이고,

는

의 표준오차이다. H₀를 거절하기 위해서는 다음조건을 만족시켜야 한다.

The

&Lt; / RTI >

The

. To reject H ₀ , the following conditions must be satisfied.

α는 유의수준이고, 또한 t 분포표를 이용하여

의 값을 얻을 수 있다.α is a significant level, and using a t-distribution table

Can be obtained.

150 단계는 상기 생성된 회귀모델의 매개변수들 중 제 1 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 키워드를 선택하는 단계이다.Step 150 is a step of selecting a keyword corresponding to a parameter having a significant probability (p-value) below the first threshold among the parameters of the generated regression model.

보다 구체적으로, 빠르고 정확한 분석을 위하여, 유의한 키워드를 선택하여 키워드의 수를 줄일 수 있다. 이를 위하여, 140 단계에서 생성된 회귀모델의 매개변수들 중 제 1 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 키워드를 선택할 수 있다. 유의확률은 매개변수가 유의한지 안 한지 확인하기 위해 요구되는 가장 작은 유의수준으로, 만약 매개변수의 회귀결과에서 유의확률이 0.05보다 작다면 이 매개변수는 유의한 것으로 판단할 수 있다. 검정통계량의 유의확률이 0.05보다 작을 때, X_i의 기술은 유의하다고 판단한다. 즉, 축소된 회귀모델을 생성하기 위하여, 유의확률이 0.05보다 작은 매개변수에 해당하는 키워드들을 선택한다. 핵심 키워드를 선정함으로써 정확도와 속도 면에서 매우 효율적이다.More specifically, for fast and accurate analysis, it is possible to reduce the number of keywords by selecting significant keywords. For this, a keyword corresponding to a parameter having a significant probability (p-value) below the first threshold value among the parameters of the regression model generated in step 140 may be selected. The significance probability is the lowest significance level required to see if the parameter is significant. If the significance probability is less than 0.05 in the regression result of the parameter, this parameter can be judged to be significant. When the significance probability of the test statistic is less than 0.05, the description of X _i is judged to be significant. That is, in order to generate a reduced regression model, keywords corresponding to parameters having a probability of less than 0.05 are selected. Selecting key keywords is very efficient in terms of accuracy and speed.

160 단계는 상기 선택된 키워드 간의 유의확률을 이용하여 키워드 간의 관계를 도출하는 단계이다.Step 160 is a step of deriving a relation between the keywords using the probabilities of the selected keywords.

보다 구체적으로, 상기 선택된 유의한 키워드들을 분석하고자 하는 기술과의 관계를 키워드 간의 유의확률을 이용하여 도출할 수 있다. 상기 키워드 간의 관계는 상기 키워들 간의 유의확률을 이용하여 계층을 나누어 나타낼 수 있다. 계층을 나누어, 기술 계층 다이어그램(A Hierarchical Diagram of Technology, HDT)으로 나타낼 수 있다. 상위 계층 기술과 하위 계층의 기술에 대한 관계는 다음과 같이 나타낼 수 있다. More specifically, it is possible to derive the relationship between the selected significant keywords and the technique for analyzing the selected significant keywords using the significance probability between the keywords. The relationship between the keywords can be expressed by dividing the hierarchy by using the significance probability between the keywords. The hierarchy can be divided into a hierarchical diagram of technology (HDT). The relation between the upper layer technology and the lower layer technology can be expressed as follows.

X₁은 영향을 받은 기술이고, W₁은 개발된 기술이다. u₁₁은 X₁과 W₁의 인과관계 강도이다. 상기 모델은 기술 Z를 예측하기 위해 모든 인과관계 강도를 찾는 것이다. W_m기술은 (ump*vpn*rn)에 인과관계 강도에 따라 기술Z 개발에 영향을 미친다. 그러므로 TF의 Z를 제외하고 모든 기술의 모든 인과관계 강도를 산출할 수 있다.X ₁ is the affected technology, and W ₁ is the developed technology. u ₁₁ is the causal relationship strength between X ₁ and W ₁ . The model is to find all causal intensities to predict technology Z. W _m technology affects the development of technology Z depending on causality intensity in (ump * vpn * rn). Therefore, all causal strengths of all techniques except for Z of TF can be calculated.

상기 키워드 간의 관계를 도출하기 위해, 도 2의 210 단계 내지 220 단계를 수행할 수 있다.In order to derive the relationship between the keywords, steps 210 to 220 of FIG. 2 may be performed.

210 단계는 상기 선택된 키워드 중 제 2 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 하나 이상의 키워드를 제 1 키워드로 선정하는 단계이다.Step 210 is a step of selecting at least one keyword corresponding to a parameter having a significance probability below a second threshold among the selected keywords as a first keyword.

제 1 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 키워드들 중 제 2 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 하나 이상의 키워드를 제 1 키워드로 선정한다. 제 1 키워드는 분석하고자 하는 기술과 가장 관련이 높은 기술들로, 제 2 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 키워드를 상위계층에 해당하는 제 1 키워드로 선정한다. 제 2 임계치는 미리 설정되어 있거나, 사용자의 입력에 따라 설정될 수 있다. 또는 유의확률의 낮은 순서대로 미리 설정된 수만큼 제 1 키워드로 선정할 수도 있다. One or more keywords corresponding to parameters having significance probabilities below the second threshold among the keywords corresponding to the parameters having the significance probabilities below the first threshold value are selected as the first keywords. The first keyword is a technique most related to the technology to be analyzed, and selects a keyword corresponding to a parameter having a probability of less than a second threshold as a first keyword corresponding to an upper layer. The second threshold value may be preset or may be set according to the user's input. Alternatively, the first keyword may be selected in a predetermined number in the order of lower probability of significance.

220 단계는 상기 제 1 키워드와 제 1 키워드 이외의 키워드들 간의 회귀모델 분석을 통해, 제 3 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 키워드를 상기 제 1 키워드와 관계성이 있는 키워드로 선정하는 단계이다.In operation 220, a keyword corresponding to a parameter having a significance probability less than a third threshold value is selected as a keyword having a relation with the first keyword through a regression model analysis between keywords other than the first keyword and the first keyword .

보다 구체적으로, 제 1 키워드를 선정한 후 다음 계층의 키워드를 선정하는 단계이다. 계층의 깊이는 미리 설정되어 있거나, 키워드의 수, 또는 분석하고자 하는 정도에 따라 달라질 수 있다. 제 1 키워드와 관련된 다음 계층의 키워드를 선정하기 위하여, 제 1 키워드와 제 1 키워드 이외의 키워드들 간의 회귀모델 분석을 수행한다. 상기 제 1 키워드와 제 1 키워드 이외의 키워드들 간의 회귀모델 분석을 통해, 제 3 임계치 이하의 유의확률을 갖는 매개변수에 해당하는 키워드를 상기 제 1 키워드와 관계성이 있는 키워드로 선정한다. 본 단계는 140 단계 내지 150 단계에 대응한다.
More specifically, after selecting the first keyword, a keyword of the next layer is selected. The depth of the hierarchy may be predetermined, or may vary depending on the number of keywords, or the degree to be analyzed. In order to select a keyword in the next layer related to the first keyword, a regression model analysis is performed between keywords other than the first keyword and the first keyword. A keyword corresponding to a parameter having a significance probability less than or equal to a third threshold value is selected as a keyword having a relation with the first keyword through a regression model analysis between keywords other than the first keyword and the first keyword. This step corresponds to steps 140 to 150.

도 4는 본 발명의 또 다른 실시예에 따른 특허분석방법의 흐름도이다.4 is a flowchart of a method for analyzing a patent according to another embodiment of the present invention.

정해진 TF 분야의 특허문서를 검색하고, 특허데이터의 요약(초록)을 추출하여 문서-단어 행렬(DTM)을 구성한다. 문서-단어 행렬(DTM)으로부터 문서-키워드 행렬로 재구성한다. 이를 위하여, 문서-단어 행렬(DTM)에서 상위에 랭크된 단어(ranked terms)들을 검색하고, 의미 없는 단어는 제거한 후, 남아있는 단어들을 이용하여 키워드를 선택하여 문서-키워드 행렬 구성한다. 이후, Hierarchical 회귀분석 실시한다. 즉, 모든 키워드들을 사용하여 full regression model 구성한 후, 선택된 키워드를 사용하여 reduced regression model 구성하고, 완성된 HDT 구성하여 구성된 HDT모델 해석하여 분석하고자 하는 기술을 분석한다.
A document-word matrix (DTM) is constructed by retrieving a patent document in a predetermined TF field and extracting a summary (abstract) of the patent data. Reconstruct from document-word matrix (DTM) to document-keyword matrix. To do this, the ranked terms are searched for in the document-word matrix (DTM), meaningless words are removed, and keywords are selected using the remaining words to form a document-keyword matrix. Hierarchical regression analysis is then performed. That is, a full regression model is constructed using all the keywords, a reduced regression model is constructed using the selected keywords, and a technique to analyze and analyze the constructed HDT model by analyzing the constructed HDT model is analyzed.

도 5 내지 6은 본 발명의 실시예에 따른 특허분석방법에 따라 도출된 기술계층을 나타낸 것이다.5 to 6 illustrate a technology layer derived according to a patent analysis method according to an embodiment of the present invention.

본 발명의 실시예에 따른 특허분석을 이용한 결과는 도 6과 같다. KIPRIS와 USPTO(KIPRIS 2013; USPTO 2013)으로부터 검색한 특허문서를 이용하여 결과를 도출하였다. TF를 이용하는 기술분야로 telematics를 선택했다. data set은 미국, 유럽, 중국에 출원된 telematics 특허들로 구성되었고 전체 특허문서 수는 총 474건을 이용하였다. 출원 수는 2000년대부터 현저하게 증가했다. DTM구성을 위해 검색된 특허문서의 초록을 사용하여 행렬의 행과 열을 각각 문서와 단어들로 나타냈다. 행렬의 요소들은 각 문서에서 단어 발생 빈도수를 나타낸다. DTM을 사용하여 telematics의 TF를 위해 Hierarchical regression을 실시했다. Path analysis를 위한 변수선택을 위해 telematics를 포함한 상위 20개 키워드를 선정했다. 키워드들로는 center, communication, control, device, diagnostic, information, message, mobile, network, receiving, request, service, signal, telematics, terminal, transmit, unit, user, vehicle를 선정했다. 이를 통해 아래와 같은 full regression을 구했다.The results of the patent analysis according to the embodiment of the present invention are shown in FIG. The results were derived using patent documents retrieved from KIPRIS and USPTO (KIPRIS 2013; USPTO 2013). I chose telematics as a technology field that uses TF. The data set consisted of telematics patents filed in the US, Europe, and China, and a total of 474 patent documents were used. The number of applications has increased significantly since the 2000s. Using the abstract of the patent document retrieved for the DTM configuration, the rows and columns of the matrix are represented by documents and words, respectively. The elements of the matrix represent the frequency of word occurrences in each document. Hierarchical regression was performed for TF of telematics using DTM. The top 20 keywords including telematics were selected for variable selection for path analysis. We selected the keywords center, communication, control, device, diagnostic, information, message, mobile, network, receiving, request, service, signal, telematics, terminal, transmit, unit, user, vehicle. We obtained the following full regression.

회귀모델에서 종속변수는 telematics이고 telematics를 제외한 모든 키워드들은 독립변수들로 사용되었다. 표1은 full regression 분석결과를 나타낸다.In the regression model, the dependent variable is telematics and all keywords except telematics are used as independent variables. Table 1 shows the results of full regression analysis.

Independent variableIndependent variable BetaBeta p-valuep-value centercenter 0.1260.126 0.0010.001 communicationcommunication -0.004-0.004 0.8570.857 connectionconnection -0.009-0.009 0.6420.642 controlcontrol 0.0370.037 0.0720.072 devicedevice 0.2860.286 0.0010.001 diagnosticdiagnostic -0.044-0.044 0.0240.024 informationinformation 0.0220.022 0.3460.346 messagemessage 0.0240.024 0.2360.236 mobilemobile 0.0030.003 0.8930.893 networknetwork 0.0600.060 0.0030.003 receivingreceiving 0.0860.086 0.0010.001 requestrequest 0.0630.063 0.0040.004 serviceservice 0.1690.169 0.0010.001 signalsignal 0.0190.019 0.4400.440 terminalterminal 0.2190.219 0.0010.001 transmittransmit -0.154-0.154 0.0010.001 unitunit 0.3210.321 0.0010.001 useruser 0.0670.067 0.0010.001 vehiclevehicle 0.1520.152 0.0010.001

Beta는 표준화된 변수의 회귀 매개변수를 나타내고, 모든 독립변수들의 효과를 비교할 수 있다. 즉, beta는 각 독립변수에서 단위변화와 연관된 “telematics”의 변화이다. 예를 들면, “center”의 단위변화는 “telematics”의 발생에 0.126씩 증가한다. 게다가, 우리는 “unit”이 “telematics”에 가장 큰 영향을 미친다는 것을 발견했다. 종속변수에 대한 독립변수의 유의성은 p-value에 의해 결정된다. 95%의 신뢰구간에서 p-value가 0.05보다 작다면 유의한 변수라고 결정할 수 있다. 따라서, “center”, “device”, “diagnostic”, “network”, “receiving”, “request”, “service”, “terminal”, “transmit”, “unit”, “user”, “vehicle”를 키워드로 선택하여 HDT모델의 변수로 이용하였다.Beta represents regression parameters of standardized variables and can compare the effect of all independent variables. That is, beta is a change in "telematics" associated with unit change in each independent variable. For example, the unit change of "center" increases by 0.126 in the occurrence of "telematics". Furthermore, we have found that "unit" has the greatest effect on "telematics". The significance of independent variables for the dependent variable is determined by the p-value. If the p-value is less than 0.05 in the 95% confidence interval, it can be determined to be a significant variable. Therefore, you can use the "center", "device", "diagnostic", "network", "receiving", "request", "service", "terminal" Were used as variables in the HDT model.

다음은 아래와 같이 “telematics”의 TF를 위한 HDT생성을 위해 reduced regression model을 구성하였다.The following is a reduced regression model for HDT generation for TF of "telematics" as shown below.

Independent variablesIndependent variables BetaBeta p-valuep-value centercenter 0.1240.124 0.0010.001 devicedevice 0.2850.285 0.0010.001 diagnosticdiagnostic -0.045-0.045 0.0240.024 networknetwork 0.0610.061 0.0190.019 receivingreceiving 0.1010.101 0.0020.002 requestrequest 0.0630.063 0.0010.001 serviceservice 0.1710.171 0.0030.003 terminalterminal 0.2200.220 0.0010.001 transmittransmit -0.134-0.134 0.0010.001 unitunit 0.3250.325 0.0010.001 useruser 0.0690.069 0.0010.001 vehiclevehicle 0.1650.165 0.0010.001

모든 독립변수들의 p-value가 0.05이하이므로 독립변수들은 모두 유의하다. device, unit, terminal의 beta가 다른 것들보다 큰 값을 가졌기 때문에 도 5와 같이, 제 1 키워드로 선정하여 첫 번째 HDT모델을 구성하였다.The independent variables are all significant because the p-value of all independent variables is less than 0.05. Since the beta of the device, the unit, and the terminal has a larger value than the others, the first HDT model is constructed by selecting the first keyword as shown in FIG.

결과적으로, 변수 “device”, “unit”, “terminal”는 직접적으로 “telematics”에 영향을 미치는 것을 발견하였다. 또한 다른 독립변수들은 “device”, “unit”, “terminal”를 경유하여 “telematics”에 간접적으로 영향을 미쳤다. 그러므로 “telematics” TF를 위한 HDT모델을 완성하기 위해 세 종속변수들(i.e. device, unit, terminal)의 세 개 회귀모형을 구성하였고, 결과는 아래와 같다.As a result, we found that the variables "device", "unit" and "terminal" directly affect "telematics". Other independent variables also indirectly affected "telematics" via "device", "unit", and "terminal". Therefore, three regression models of three dependent variables (ie device, unit, terminal) were constructed to complete the HDT model for "telematics" TF.

Independent variableIndependent variable Dependent variableDependent variable devicedevice unitunit terminalterminal BetaBeta p-valuep-value BetaBeta p-valuep-value BetaBeta p-valuep-value centercenter -0.093-0.093 0.0010.001 0.0930.093 0.0010.001 0.0010.001 0.9620.962 diagnosticdiagnostic 0.0400.040 0.1400.140 -0.093-0.093 0.0010.001 0.0150.015 0.5790.579 networknetwork 0.1230.123 0.0010.001 0.0190.019 0.4640.464 0.0680.068 0.0110.011 receivingreceiving 0.0600.060 0.0480.048 0.0700.070 0.0200.020 0.1310.131 0.0010.001 requestrequest -0.020-0.020 0.2530.253 0.0250.025 0.3910.391 0.0550.055 0.0620.062 serviceservice 0.0330.033 0.2530.253 0.0440.044 0.1250.125 0.0550.055 0.0010.001 transmittransmit -0.096-0.096 0.0010.001 0.0960.096 0.0010.001 0.1000.100 0.0010.001 useruser 0.0480.048 0.0010.001 0.1820.182 0.0010.001 0.1240.124 0.3880.388 vehiclevehicle 0.0160.016 0.0010.001 0.0790.079 0.0050.005 -0.053-0.053 0.0660.066

여기에서 세 독립변수에 따라, p-value에 기반하여 통계적으로 유의성을 사용하여 각 통계적으로 유의한 독립변수를 선택하였다. p-value가 0.05(신뢰수준 95%) 이하인 것들을 고려하여, device에 대한 독립변수로 p-value가 0.05이하인 center, network, receiving, transmit, user, vehicle를 선택하였다. 같은 방법으로 아래와 같이 unit과 terminal에 대한 독립변수를 결정하였다.Here, statistically significant independent variables were selected using statistical significance based on three independent variables, p-value. We selected center, network, receiving, transmit, user, and vehicle with p-value of 0.05 or less as an independent variable for the device, considering those with p-value less than 0.05 (confidence level 95%). In the same way, the independent variables for unit and terminal were determined as follows.

Dependent variableDependent variable Selected independent variableSelected independent variable devicedevice center, network, receiving, transmit, user, vehiclecenter, network, receiving, transmit, user, vehicle unitunit center, diagnostic, receiving, transmit, user, vehiclecenter, diagnostic, receiving, transmit, user, vehicle terminalterminal network, receiving, service, transmitnetwork, receiving, service, transmit

표4의 결과로부터, center, network, receiving, transmit, user, vehicle의 기술이 device 기술 개발에 영향을 미쳤다는 것을 알 수 있고 이 개발이 telematics의 기술개발에 영향을 미쳤다는 것을 알 수 있다. unit과 terminal 기술의 경우도 device 과 마찬가지다. 표4의 결과를 이용하여, 도 6의 telematics TF를 관한 complete hierarchical model을 만들었다. From the results in Table 4, it can be seen that the technology of center, network, receiving, transmit, user, and vehicle influenced the development of device technology, and it can be seen that this development affected the technology development of telematics. Unit and terminal technologies are similar to devices. Using the results of Table 4, a complete hierarchical model of the telematics TF of FIG. 6 was constructed.

이를 통해, “telematics” TF의 기술적 연관성을 알 수 있다. 각 연결강도는 아래부터 상단까지의 강도를 의미한다. 예를 들어, “receiving” 기술은 “terminal” 기술에 0.133의 연결강도만큼 영향을 미친다. 게다가 이 가중치는 p-value가 0.05보다 작으므로 유의하다. “terminal” 기술은 “telematics” 기술에 0.422의 연결강도만큼 연속적으로 영향을 미친다. 만약 “telematics” 기술에 대한 “receiving” 기술의 직접적인 영향을 알고 싶다면, 두 가중치를 곱해야한다: “terminal”에서의 “receiving” (0.133)과 “telematics”에서의 “terminal” (0.422)의 가중치. 그러므로 “receiving” 기술은 간접적으로 “telematics” 기술에 0.056 (0.133×0.422)의 연결강도만큼 영향을 미친다. Hierarchical diagram의 나머지부분들도 같은 방법으로 설명될 수 있다.
This shows the technical relevance of the "telematics" TF. Each connection strength means the strength from bottom to top. For example, the "receiving" technology affects the "terminal" technology as much as the connection strength of 0.133. In addition, this weight is significant because the p-value is less than 0.05. The "terminal" technology continuously affects the "telematics" technology as much as the connection strength of 0.422. If we want to know the direct effect of the "receiving" technique on the "telematics" technique, we must multiply two weights: the weights of "receiving" (0.133) in "terminal" and "terminal" (0.422) in "telematics" . Therefore, the "receiving" technology indirectly affects the "telematics" technology with a connection strength of 0.056 (0.133 × 0.422). The rest of the hierarchical diagram can be explained in the same way.

130 단계의 키워드를 선택함에 있어서, 다음과 같은 일련의 과정(410 단계 내지 을 통해 키워드를 선택할 수 있다.In selecting the keyword of step 130, a keyword may be selected through the following series of steps (step 410).

410 단계는 상기 생성된 문서-단어 행렬로부터 주성분분석을 통해 문서-주성분분석 행렬을 생성하는 단계이다.In operation 410, a document-principal component analysis matrix is generated from the generated document-word matrix through principal component analysis.

보다 구체적으로, 추출된 단어들 중 핵심 키워드를 선정하는 것은 매우 어려우며, 선정된 키워드에 따라 이후 결과에 지대한 영향을 미치기 때문에 섣불리 단어의 개수를 줄일 수 없다. 객관적이고 핵심 키워드를 추출하기 위하여, 주성분 분석(Principal component analysis, PCA)을 이용하여 행렬의 희소성을 해결하면서 동시에 손실될 수 있는 정보를 가능한 보존할 수 있다. 주성분분석은 다양한 변수들에 대해 분석하는 소위 다변량(multivariate) 분석인데, 많은 변수들로부터 몇 개의 주성분들을 추출하는 방법이다. 즉, 주성분분석은 차원축소(dimension reduction)를 위한 것으로, 여기서 주성분이라는 것은 많은 변수들을 설명하는 주된(principal) 성분이라는 의미이다. 상기 주성분분석은 상관행렬을 이용하여 수행될 수 있다. 또는 공분산행렬을 이용하여 수행될 수도 있다. 주성분분석을 통해 생성되는 문서-주성분분석 행렬은 주성분(PC)의 개수는 처음 변수인 단어(Term)의 개수만큼 생성된다.More specifically, it is very difficult to select a core keyword among the extracted words, and the number of words can not be reduced because the selected keyword has a great influence on subsequent results. In order to extract objective and key keywords, principal component analysis (PCA) can be used to solve the scarcity of the matrix while preserving information that can be lost at the same time. Principal component analysis is a so-called multivariate analysis that analyzes various variables. It extracts several principal components from many variables. Principal component analysis is for dimension reduction, where the principal component is the principal component that describes many variables. The principal component analysis may be performed using a correlation matrix. Or a covariance matrix. The number of principal components (PC) in the document - principal component analysis matrix generated by principal component analysis is generated as many as the number of words (Term) as the first variable.

여기서, 효율적인 분석을 위하여, PC의 개수를 줄일 수 있다. 상기 주성분분석 결과, 임계치 이상의 고유 값을 갖는 주성분만을 이용하여 축소된 문서-주성분분석 행렬을 생성할 수 있다. 유의미한 주성분만을 이용하여 분석을 수행하기 위하여, 주성분의 고유 값이 임계치 이상의 고유 값을 갖는 주성분만을 이용하여 행렬의 열의 수를 줄일 수 있다. 주성분분석을 통해 각 주성분에 대한 고유 값(eigen value)와 각 주성분에서의 키워드별 주성분 점수를 산출할 수 있다. 상기 산출된 각 주성분의 고유 값이 임계치 이상인 경우의 주성분만을 이용하여 문서-주성분분석 행렬을 축소한다. 임계치는 1일 수 있다. 즉, 고유 값이 1이상인 주성분만을 이용하여 주성분의 수를 줄여 차원축소가 가능하다. Here, for efficient analysis, the number of PCs can be reduced. As a result of the principal component analysis, a reduced document-principal component analysis matrix can be generated using only principal components having eigenvalues equal to or greater than the threshold value. In order to perform the analysis using only the significant principal components, the number of columns of the matrix can be reduced by using only the principal component having the eigenvalue of the principal component equal to or greater than the threshold value. The eigen value for each principal component and the principal component score for each principal component in the principal component can be calculated through principal component analysis. And the document-principal component analysis matrix is reduced using only the principal component when the eigenvalues of the calculated principal components are equal to or greater than the threshold value. The threshold value may be one. That is, it is possible to reduce the size by reducing the number of principal components using only the principal component having an eigenvalue of 1 or more.

420 단계는 문서-주성분분석 행렬을 이용하여 회귀모델을 생성하는 단계이다. 보다 구체적으로, 문서-주성분분석 행렬을 이용하여 회귀모델을 생성한다. 회귀모델은 하나 또는 그 이상의 독립변수의 종속변수에 대한 영향의 추정을 할 수 있는 통계기법으로 주성분의 통계적 분석이 가능하다. 즉, 회귀모델을 생성하여 선택된 주성분들 중 통계적으로 유의한 주성분을 찾을 수 있다. 주성분분석은 통계적으로 유의한 분석이 아닌바, 회귀모델을 이용함으로써 통계적으로 유의한 분석이 가능하다. 상기 문서-주성분석 행렬로부터 회귀모델을 생성한다. 회귀모델은 다음과 같이 나타낼 수 있다.Step 420 is a step of generating a regression model using the document-principal component analysis matrix. More specifically, a regression model is generated using a document-principal component analysis matrix. A regression model is a statistical technique that can estimate the effect of one or more independent variables on dependent variables. Statistical analysis of the principal components is possible. In other words, a statistically significant principal component of the selected principal components can be found by creating a regression model. Principal component analysis is not statistically significant but statistically significant by using regression model. A regression model is generated from the document-principal analysis matrix. The regression model can be expressed as:

430 단계는 상기 생성된 회귀모델의 매개변수들 중 임계치 이하의 유의확률(p-value)을 갖는 매개변수에 해당하는 주성분을 선택하는 단계이다. 보다 구체적으로, 420 단계에서 생성된 회귀모델에서 통계적으로 유의한 주성분을 선택하기 위하여, 회귀모델의 매개변수를 이용한다. 수학식 7에서 β₁과 β₂는 유의확률(p-value)을 갖는다. 임계치 이하인 유의확률을 갖는 매개변수에 해당하는 주성분을 유의한 주성분으로 선택할 수 있다. 상기 임계치는 0.05일 수 있다. 유의확률이 0.05 이하인 매개변수에 해당하는 주성분을 유의한 주성분으로 선택할 수 있다. 만약 여러 개의 단어가 존재한다면 모든 단어들의 중요도를 임의로 정하기 어려우며 각각의 단어가 실제로 영향을 미치는지 여부를 확인하기 어렵다. 따라서, 주성분분석과 회귀모델을 통해 핵심 키워드를 선정함으로써 정확도와 속도 면에서 매우 효율적이다.Step 430 is a step of selecting a principal component corresponding to a parameter having a significant probability (p-value) below the threshold value among the parameters of the generated regression model. More specifically, in order to select a statistically significant principal component in the regression model generated in step 420, the parameters of the regression model are used. In Equation (7), β ₁ and β ₂ have a significant probability (p-value). A principal component corresponding to a parameter having a significance probability below the threshold value can be selected as a significant principal component. The threshold may be 0.05. The principal component corresponding to the parameter having the probability of 0.05 or less can be selected as a significant principal component. If there are several words, it is difficult to randomize the importance of all the words and it is difficult to confirm whether each word actually affects it. Therefore, selecting key keywords through principal component analysis and regression model is very efficient in terms of accuracy and speed.

440 단계는 상기 선택된 주성분에 속하는 하나 이상의 단어를 핵심 키워드로 선정하는 단계이다. 보다 구체적으로, 주성분에 포함된 단어들 중 하나 이상의 단어를 핵심 키워드로 선정한다. 주성분에 포함된 단어들 중 핵심 키워드를 선정한다. 주성분에 포함된 단어들이 복수인 경우, 상기 주성분에 속하는 단어들 중 임계치 이상의 주성분 점수를 갖는 단어 또는 주성분 점수가 높은 순서에 따라 소정의 단어를 핵심 키워드로 선택할 수 있다. 주성분은 복수의 단어를 포함할 수 있고, 각 단어들의 주성분 부하(선형 계수) 즉, 주성분 점수를 이용하여 최종적으로 핵심 키워드를 선정할 수 있다. 상기 주성분 점수가 임계치 이상의 주성분 점수를 갖는 단어를 핵심 키워드로 선택할 수 있다. 주성분 점수는 -1 내지 1일 수 있고, 상기 임계치는 0일 수 있다. 또는 주성분 점수가 높은 단어들 순서대로 핵심 키워드를 선정할 수 있다. 미리 설정된 수만큼 선정하거나, 또는 단어 및 문서의 양에 따라 선정되는 단어의 수가 달라질 수 있다. 임계치와 단어의 수 둘 모두에 따라 선정될 수도 있다. In step 440, one or more words belonging to the selected principal component are selected as key keywords. More specifically, one or more words included in the principal component are selected as key keywords. Select key words among the words included in the main component. If there are a plurality of words included in the principal component, a word having a principal component score of a threshold value or more among the words belonging to the principal component or a predetermined word may be selected as a core keyword according to the order in which the principal component score is high. The principal component may include a plurality of words, and the core keyword may be finally selected using the principal component load (linear coefficient) of the words, that is, the principal component score. A word having the principal component score of the principal component score equal to or greater than the threshold value can be selected as a key keyword. The principal component score may be from -1 to 1, and the threshold may be zero. Or key words in order of the words having a high principal component score. A predetermined number of words may be selected, or the number of words selected may be varied according to the amount of words and documents. It may be chosen according to both the threshold and the number of words.

주성분분석(PCA)과 회귀모델을 이용하여 통계적으로 유의한 매개변수β를 찾아내고 선정된 매개변수를 통해 중요한 키워드를 선정하여 정량적 기술동향 분석을 할 수 있다.By using principal component analysis (PCA) and regression model, statistically significant parameter β can be found and quantitative technology trend analysis can be done by selecting important keywords through selected parameters.

본 발명의 실시예는 문서의 단어를 추출하여, 문서-단어 행렬을 생성하고, 문서-키워드 행렬을 생성하고, 회귀모델을 생성하여, 키워드 간의 관계를 도출하는 하나 이상의 처리부(프로세서) 및 처리부에서 산출되는 결과 및 제거되는 단어들을 저장하는 하나 이상의 저장부(데이터베이스)를 포함할 수 있다.An embodiment of the present invention includes at least one processing unit (processor) for extracting words of a document, generating a document-word matrix, generating a document-keyword matrix, generating a regression model, One or more storage units (database) for storing the results to be calculated and words to be removed.

본 발명의 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments of the present invention may be implemented in the form of program instructions that can be executed on various computer means and recorded on a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims

A method for analyzing a technology using a patent document,
Extracting words from documents to be analyzed;
Generating a document-word matrix using the extracted words;
Generating a document-keyword matrix by selecting a keyword from the generated document-word matrix;
Generating a regression model using the document-keyword matrix;
Selecting a keyword corresponding to a parameter having a significance probability (p-value) below a first threshold among the parameters of the generated regression model; And
And deriving a relationship between the keywords using the probability of the selected keyword,
Deriving a relationship between the keywords,
Selecting at least one keyword corresponding to a parameter having a significance probability less than a second threshold among the selected keywords as a first keyword; And
Selecting a keyword corresponding to a parameter having a significance probability less than a third threshold value as a keyword having a relation with the first keyword through a regression model analysis between the first keyword and the keywords other than the first keyword, &Lt; / RTI >

delete

The method according to claim 1,
Deriving a relationship between the keywords,
And dividing the hierarchy by using the significance probability between the keywords.

The method according to claim 1,
Wherein the generating the document-keyword matrix comprises:
Selecting words having an occurrence frequency value equal to or higher than a fourth threshold value from the document-word matrix; And
And removing words of the selected words that are not relevant to the description.

The method according to claim 1,
The documents to be analyzed are patent documents,
The step of extracting the word comprises:
Extracting words from at least one part of the title, abstract, patent claim, or inventive description of the patent document.

The method according to claim 1,
The step of extracting the words comprises:
Parsing and corpus. &Lt; RTI ID = 0.0 > 8. < / RTI >