KR102224684B1

KR102224684B1 - System and method for generating prediction of technology transfer base on machine learning

Info

Publication number: KR102224684B1
Application number: KR1020190078907A
Authority: KR
Inventors: 박상성; 이준석
Original assignee: 청주대학교 산학협력단
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2021-03-09
Also published as: KR20210002968A

Abstract

본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법은(a) 데이터 수집부가 기술이전 예측모델을 생성하는데 필요한 특허데이터를 수집하는 단계; (b) 데이터 분류부가 상기 특허데이터를 수치데이터와 텍스트데이터로 분류하는 단계; (c) 데이터 전처리 모듈이 상기 특허데이터 중 비정형데이터인 텍스트데이터를 분석 가능하도록 하는 전처리 단계; 및 (d) 기술이전 예측모델 구성 모듈이 수치데이터와 전처리된 텍스트데이터를 수신하고 앙상블 모델을 통해 상기 텍스트 데이터에서 기술토픽을 추출하는 토픽 모델링을 수행하고, 상기 기술토픽과 수치데이터로 앙상블 모델링을 수행하여 기술이전 예측모형을 생성하는 단계;를 포함하여 수집된 특허데이터를 통해 모집단에 대한 기술이전 가능성을 객관적이고 용이하게 평가할 수 있는 효과가 있다. The method for generating a technology transfer prediction model based on machine learning according to the present invention includes the steps of: (a) collecting patent data necessary for a data collection unit to generate a technology transfer prediction model; (b) classifying the patent data into numerical data and text data by a data classification unit; (c) a pre-processing step of allowing the data pre-processing module to analyze text data, which is unstructured data among the patent data; And (d) the technology transfer prediction model configuration module receives numerical data and preprocessed text data, performs topic modeling in which a technology topic is extracted from the text data through an ensemble model, and ensembles modeling with the technology topic and numerical data. It has the effect of objectively and easily evaluating the possibility of technology transfer to the population through the collected patent data, including the step of generating a technology transfer prediction model by performing it.

Description

Machine learning-based technology transfer prediction model generation system and generation method {SYSTEM AND METHOD FOR GENERATING PREDICTION OF TECHNOLOGY TRANSFER BASE ON MACHINE LEARNING}

본 발명은 기계학습 기반의 기술이전 예측 시스템에 관한 것으로써, 더욱 상세하게는 기술분야의 특허데이터를 수집하고, 수집한 특허데이터를 수치 데이터와 텍스트 데이터로 구분하여 이중, 텍스트 데이터를 대상으로 전처리 및 토픽모델링 기법을 이용하여 분석대상 특허데이터의 기술토픽을 도출하고, 도출된 기술토픽 정보와 수치데이터를 이용한 기계학습 기반의 기술이전 예측시스템에 관한 것이다.The present invention relates to a machine learning-based technology transfer prediction system, and more specifically, collects patent data in a technical field, divides the collected patent data into numerical data and text data, and preprocesses them for double and text data. And a technology transfer prediction system based on machine learning that derives a technology topic of the patent data to be analyzed using a topic modeling technique, and uses the derived technology topic information and numerical data.

지식의 증가속도가 기하급수적으로 증가하는 최근의 상황에서 전통적인 연구개발 방법으로는 이러한 변화속도를 따라가는 것이 점점 더 어려워지고 있다. 이에 대한 대안으로써 기술경영 방법중 하나인 기술이전이 존재한다. In the recent situation where the speed of knowledge growth is increasing exponentially, it is becoming more and more difficult to keep up with this speed of change with traditional R&D methods. As an alternative to this, technology transfer, one of the methods of technology management, exists.

더욱이, 기술이전은 기술사업화에 필요한 중요한 요소의 하나로써 일자리 창출, 경제 활성화 등에 충분한 기여할 수 있다는 장점을 갖고 있다. 특허청 산하기관 및 기술보증기금 등에서 국제추세에 맞추어 다양한 기술사업화 지원 사업을 하고 있지만, 기술 이전율을 증가시키는 것에는 한계가 존재한다.Moreover, technology transfer is one of the important factors necessary for technology commercialization and has the advantage of being able to contribute sufficiently to job creation and economic revitalization. Although the KIPO-affiliated offices and the Technology Guarantee Fund are conducting various technology commercialization support projects in line with international trends, there is a limit to increasing the technology transfer rate.

왜냐하면, 기술사업화를 위해 수많은 특허에 대하여 그 우수함을 확인할 수 있는 객관적인 지표가 부족하며, 현 시스템에 적합한 모형이 부재하기 때문이다.This is because objective indicators that can confirm the excellence of numerous patents for technology commercialization are lacking, and there is no model suitable for the current system.

따라서, 대부분의 기술사업화를 위한 특허 선정은 주로 전문가의 의견에만 의존하고 있다는 한계점이 있다.Therefore, there is a limitation in that most of the selection of patents for technology commercialization is mainly dependent on the opinions of experts.

대한민국 등록특허공보 제10-1806669호(2017.12.01)Korean Registered Patent Publication No. 10-1806669 (2017.12.01)

상술한 한계점을 해결하기 위해 본 발명은 기술분야의 특허데이터를 수집하고, 수집한 특허데이터를 수치 데이터와 텍스트 데이터로 구분하여 이중, 텍스트 데이터를 대상으로 전처리 및 토픽모델링 기법을 이용하여 분석대상 특허데이터의 기술토픽을 도출하고, 도출된 기술토픽 정보와 수치데이터를 이용한 기계학습 기반의 기술이전 예측시스템을 제공하는 데 목적이 있다.In order to solve the above-described limitations, the present invention collects patent data in the technical field, divides the collected patent data into numerical data and text data, and uses pre-processing and topic modeling techniques for text data to analyze target patents. The purpose is to derive technical topics of data and to provide a machine learning-based technology transfer prediction system using the derived technical topic information and numerical data.

상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법은 (a) 데이터 수집부가 기술이전 예측모델을 생성하는데 필요한 특허데이터를 수집하는 단계; (b) 데이터 분류부가 상기 특허데이터를 수치데이터와 텍스트데이터로 분류하는 단계; (c) 데이터 전처리 모듈이 상기 특허데이터 중 비정형데이터인 텍스트데이터를 분석 가능하도록 하는 전처리 단계; 및 (d) 기술이전 예측모델 구성 모듈이 수치데이터와 전처리된 텍스트데이터를 수신하고 앙상블 모델을 통해 상기 텍스트 데이터에서 기술토픽을 추출하는 토픽 모델링을 수행하고, 상기 기술토픽과 수치데이터로 앙상블 모델링을 수행하여 기술이전 예측모형을 생성하는 단계;를 포함하는 것을 특징으로 한다.A method for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object comprises: (a) collecting patent data required by a data collection unit to generate a technology transfer prediction model; (b) classifying the patent data into numerical data and text data by a data classification unit; (c) a pre-processing step of allowing the data pre-processing module to analyze text data, which is unstructured data among the patent data; And (d) the technology transfer prediction model configuration module receives numerical data and preprocessed text data, performs topic modeling in which a technology topic is extracted from the text data through an ensemble model, and ensembles modeling with the technology topic and numerical data. And generating a technology transfer prediction model by performing it.

상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법은 상기 (c)단계에서 상기 데이터 전처리 모듈이 수집된 상기 특허데이터에 불필요한 노이즈 특허나 중복된 특허를 삭제한 유효한 특허 중, 이전된 특허 건을 수치데이터로 추출하는 단계;를 더 포함하는 것을 특징으로 한다.The method for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object is effective in removing unnecessary noise patents or duplicate patents from the patent data collected by the data preprocessing module in step (c). It characterized in that it further comprises a; of the patent, extracting the transferred patent case as numerical data.

상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법의 (c)단계는 (c-1) 상기 데이터 전처리 모듈이 상기 텍스트데이터를 구성하는 문장에서 불용어 제거 단계; (c-2) 상기 데이터 전처리 모듈이 TF-IDF(frequency-inverse document frequency)방법을 사용하여 상기 텍스트데이터에서 중요한 단어를 선택하는 단계; (c-3) 상기 데이터 전처리 모듈이 형태소 분석을 통해 상기 텍스트데이터를 구성하는 단어의 형태를 통일하는 단계; 및 (c-4) 상기 데이터 전처리 모듈이 분석 프로세스를 왜곡시킬 수 있는 숫자, 구두점 및 기호를 제거하는 단계;를 포함하는 것을 특징으로 한다.Step (c) of the method for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object includes: (c-1) removing stop words from sentences constituting the text data by the data preprocessing module; (c-2) selecting, by the data preprocessing module, an important word from the text data using a frequency-inverse document frequency (TF-IDF) method; (c-3) unifying, by the data preprocessing module, the shape of words constituting the text data through morpheme analysis; And (c-4) removing, by the data preprocessing module, numbers, punctuation marks, and symbols that may distort the analysis process.

상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법의 (d)단계에서 상기 기술이전 예측모델 구성 모듈이

모델로 문서내 기술토픽을 추출하는 것을 특징으로 한다.In step (d) of the method for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object, the technology transfer prediction model construction module

It is characterized by extracting a technical topic in a document with a model.

상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 방법의 (d)단계에서 기술이전 예측모델 구성 모듈이

앙상블 모델을 이용한 기술이전 예측모형을 생성하는 것을 특징으로 한다.In step (d) of the method for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object, the technology transfer prediction model configuration module is

It is characterized by generating a technology transfer prediction model using an ensemble model.

다른 실시예로써, 상술한 목적을 달성하기 위한 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템은 기술이전 예측모델을 생성하는데 필요한 특허데이터를 수집하는 데이터 수집부; 상기 특허데이터를 수치데이터와 텍스트데이터로 분류하는 데이터 분류부; 상기 특허데이터 중 비정형데이터인 텍스트데이터를 분석 가능하도록 하는 전처리 하는 데이터 전처리 모듈; 및 전처리된 상기 비정형데이터에서 기술토픽 모델링을 통해 기술토픽을 도출하고, 해당 기술토픽과 상기 데이터 전처리 모듈의 수치데이터 처리부에 의해 기추출한 수치데이터를 앙상블 모델에 적용하여 기술이전 예측모형을 생성하는 기술이전 예측모델 구성 모듈;을 포함하는 것을 특징으로 한다.In another embodiment, a system for generating a technology transfer prediction model based on machine learning according to the present invention for achieving the above object comprises: a data collection unit for collecting patent data required to generate a technology transfer prediction model; A data classification unit for classifying the patent data into numerical data and text data; A data pre-processing module for pre-processing text data, which is unstructured data, among the patent data; And technology to derive a technology topic from the preprocessed unstructured data through technology topic modeling, and apply the technology topic and numerical data previously extracted by the numerical data processing unit of the data preprocessing module to the ensemble model to generate a technology transfer prediction model. It characterized in that it comprises a; previous prediction model construction module.

본 발명에 따른 기계학습 기반의 기술이전 예측 시스템은 수집된 특허데이터를 통해 모집단에 대한 기술이전 가능성을 객관적이고 용이하게 평가할 수 있는 효과가 있다The machine learning-based technology transfer prediction system according to the present invention has the effect of being able to objectively and easily evaluate the possibility of technology transfer to a population through the collected patent data.

도 1은 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템의 블록도 이다.
도 2는 본 발명에 따른 기계학습 기반의 기술이전 예측모델 방법의 흐름도이다.
도 3은 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템에 의한 기술 주제를 찾는 과정을 도시한 도면이다.
도 4는 본 발명에서 사용되는 Latent Dirichlet Allocation(LDA) 그래픽 모델 및 AdaBoost 알고리즘의 그래픽 모델이다.
도 5는 본 발명에서의 기술이전 예측 모델을 위한 모델링 프로세스이다.
도 6은 최적의 주제 수를 분석한 결과 그래프 도면이다.
도 7은 본 발명에 따른 모델 및 다른 모델의 정확도와 특이성에 대한 그래프 도면이다.
도 8은 본 발명에 따른 모델 및 다른 모델의 민감도에 대한 그래프 도면이다.1 is a block diagram of a system for generating a technology transfer prediction model based on machine learning according to the present invention.
2 is a flowchart of a method for predicting technology transfer based on machine learning according to the present invention.
3 is a diagram illustrating a process of finding a technology subject by the system for generating a technology transfer prediction model based on machine learning according to the present invention.
4 is a graphic model of a Latent Dirichlet Allocation (LDA) graphic model and an AdaBoost algorithm used in the present invention.
5 is a modeling process for a technology transfer prediction model in the present invention.
6 is a graph of the results of analyzing the optimal number of subjects.
7 is a graph of the accuracy and specificity of the model according to the present invention and other models.
8 is a graph of the sensitivity of the model according to the present invention and other models.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면을 참조하여 상세하게 설명하도록 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be described in detail with reference to the drawings. However, this is not intended to limit the present invention to a specific embodiment, it should be understood to include all changes, equivalents, or substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals have been used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재 항목들의 조합 또는 복수의 관련된 기재 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term and/or includes a combination of a plurality of related items or any of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급될 때에는 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it should be understood that it is directly connected to or may be connected to the other component, but other components may exist in the middle. something to do. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessive formal meaning unless explicitly defined in this application. Does not.

명세서 및 청구범위 전체에서, 어떤 부분이 어떤 구성 요소를 포함한다고 할때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있다는 것을 의미한다.Throughout the specification and claims, when a certain part includes a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated.

이하 첨부된 도면을 참조하여 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템 및 생성 방법에 대해 상세히 설명한다.Hereinafter, a machine learning-based technology transfer prediction model generation system and a generation method according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 기계학습 기반의 기술이전 예측 시스템의 블록도 이다.1 is a block diagram of a machine learning-based technology transfer prediction system according to the present invention.

도 1에 도시된 바와 같이 본 발명에 따른 기계학습 기반의 기술이전 예측 시스템은 데이터 수집부(100), 데이터 분류부(200), 수치데이터 처리부(310)로 구성된 데이터 전처리 모듈(300), 및 기술이전 예측모델 구성 모듈(400)을 포함한다.As shown in Figure 1, the machine learning-based technology transfer prediction system according to the present invention includes a data preprocessing module 300 composed of a data collection unit 100, a data classification unit 200, and a numerical data processing unit 310, and A technology transfer prediction model construction module 400 is included.

상기 데이터 수집부(100)는 한국의 서비스 제공 업체인 WIPS(World Intellectual Property Service) 서버, 또는 특허청에서 제공하는 서비스인 키프리스 서버 등에 접속하여 기술이전 예측모델을 생성하는데 필요한 특허명세서 등과 같은 특허데이터를 수집한다.The data collection unit 100 accesses a World Intellectual Property Service (WIPS) server, a service provider in Korea, or a Keyprise server, a service provided by the Korean Intellectual Property Office, and provides patent data such as patent specifications required to generate a technology transfer prediction model. To collect.

상기 데이터 분류부(200)는 상기 데이터 수집부(100)가 수집한 특허데이터를 수치데이터와 텍스트데이터로 분류한다.The data classification unit 200 classifies the patent data collected by the data collection unit 100 into numerical data and text data.

상기 데이터 전처리부 모듈(300)는 상기 데이터 분류부(200)가 분류한 데이터 중, 비정형 데이터인 상기 텍스트데이터에 대하여 텍스트 마이닝을 통해 데이터를 구조화시키는 전처리과정을 통해 분석 가능한 형태로 가공한다.The data preprocessor module 300 processes the text data, which is unstructured data, among the data classified by the data classification unit 200 into a form that can be analyzed through a preprocessing process of structuring the data through text mining.

상기 기술이전 예측모델 구성 모듈(400)은 상기 데이터 전처리 모듈(300)이 가공한 텍스트데이터를 입력으로 하여 앙상블 모델을 통해 토픽 모델링을 수행하여 기술토픽을 도출한다.The technology transfer prediction model construction module 400 receives text data processed by the data preprocessing module 300 as input and performs topic modeling through an ensemble model to derive a technology topic.

또한, 상기 기술이전 예측모델 구성 모듈(400)은 상기 데이터 전처리 모듈(300)이 가공한 텍스트데이터를 이용하여 토픽 모델링을 수행하여 기술토픽을 도출한 기술토픽 정보와 수치데이터 처리부(310)에 의해 생성된 수치데이터를 입력으로 하여 앙상블 모델 기반의 앙상블 모델링을 수행하여 기술이전 예측모델을 생성한다.In addition, the technology transfer prediction model configuration module 400 performs topic modeling using text data processed by the data preprocessing module 300 to derive technology topic information and numerical data processing unit 310. Ensemble modeling based on the ensemble model is performed using the generated numerical data as input to generate a technology transfer prediction model.

또 다른 실시예로써, 상술한 바와 같은 구성을 갖는 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템에 의한 예측모델 생성 방법에 대해설명한다.As another embodiment, a method of generating a prediction model using a machine learning-based technology transfer prediction model generation system according to the present invention having the configuration as described above will be described.

먼저, 도 2에 도시된 바와 같이 상기 데이터 수집부(100)는 한국의 특허 정보 서비스 제공자로부터 특허명세서 등과 같은 특허 데이터를 수집하는 단계를 수행한다(S100).First, as shown in FIG. 2, the data collection unit 100 performs a step of collecting patent data such as a patent specification from a Korean patent information service provider (S100).

보다 구체적으로, 상기 특허 데이터는 특허 정보를 제공하는 한국의 서비스 제공 업체인 WIPS(World Intellectual Property Service) 또는 특허청에서 제공하는 서비스인 키프리스 등에서 수집할 수 있다.More specifically, the patent data may be collected by WIPS (World Intellectual Property Service), a service provider in Korea that provides patent information, or Cypris, a service provided by the Korean Intellectual Property Office.

상기 데이터 분류부(200)는 상기 데이터 수집부(100)가 수집한 특허데이터를 수치데이터와 텍스트데이터로 분류하는 단계를 수행한다(S200).The data classification unit 200 performs a step of classifying the patent data collected by the data collection unit 100 into numerical data and text data (S200).

또한, 상기 데이터 전처리 모듈(300)은 수치데이터 처리부(310)를 포함하고, 상기 특허데이터 중 비정형데이터인 텍스트데이터를 분석 가능하도록 하는 전처리 단계를 수행한다(S300).In addition, the data preprocessing module 300 includes a numerical data processing unit 310, and performs a preprocessing step of allowing text data, which is unstructured data, among the patent data to be analyzed (S300).

상기 기술이전 예측모델 구성 모듈(500)은 상기 데이터 전처리 모듈(300)이 가공한 텍스트데이터를 입력으로 하여 앙상블 모델을 통해 토픽 모델링을 수행하여 기술토픽을 도출하는 단계를 수행하고(S400), 도출한 상기 기술토픽정보와 수치데이터 처리부(310)에 의해 생성된 수치데이터를 입력으로 하여 앙상블 모델 기반의 앙상블 모델링을 수행하여 기술이전 예측모델을 생성하는 단계를 수행한다(S500).The technology transfer prediction model construction module 500 performs a step of deriving a technology topic by performing topic modeling through an ensemble model by inputting text data processed by the data preprocessing module 300 (S400), and deriving A step of generating a technology transfer prediction model by performing ensemble modeling based on an ensemble model by inputting the technology topic information and the numerical data generated by the numerical data processing unit 310 is performed (S500).

상기 S300 단계 내지 S500단계에 대한 구체적인 설명은 아래에 설명한 바와 같다.A detailed description of the steps S300 to S500 is as described below.

상기 데이터 전처리 모듈(300)은 수집된 데이터에서 관련성이 없는 불필요한 노이즈 특허나 중복 특허가 포함되어 있으므로 해당 노이즈 특허나 중복 특허는 삭제되는 것이 바람직하다.Since the data preprocessing module 300 includes unnecessary noise patents or duplicate patents that are not relevant in the collected data, it is preferable that the noise patent or duplicate patents are deleted.

검색자가 특허를 쉽게 찾을 수 있도록 특허 심사관은 특허 명세서의 내용에 따라 IPC, CPC (cooperative patent classification) 또는 FI(file index)를 사용하여 특허를 분류한다.The patent examiner classifies the patent using IPC, cooperative patent classification (CPC), or file index (FI) according to the contents of the patent specification so that the searcher can easily find the patent.

그러나 분류 코드에는 자세한 기술 내용이 포함되어 있지 않으므로 이를 기반으로 기술 자료를 확인하는 기능은 제한적이다.However, since the classification code does not contain detailed technical details, the ability to check technical data based on this is limited.

따라서 본 발명에서 대상 도메인의 기술은 먼저 주제 모델링을 통해 분류한다.Therefore, in the present invention, the description of the target domain is first classified through subject modeling.

도 3은 본 발명에서 기술 주제를 찾는 과정을 보여준다. 3 shows a process of finding a technology subject in the present invention.

수집된 데이터의 기술적인 주제를 파악하기 위해 제목과 초록을 병합하여 얻은 특허 문서의 텍스트를 언어적인 텍스트 세트인 코퍼스로 사용했다.In order to grasp the technical subject of the collected data, the text of the patent document obtained by merging the title and abstract was used as a corpus, a set of linguistic texts.

주어진 문법 구조하에서, 단어 클래스는 일반적으로 술어, 객체 등의 위치에 따라 쓰여진다.Under a given grammatical structure, word classes are usually written according to their position in predicates, objects, etc.

그중에서도 "the", "a", "by", "as"및 "is"와 같은 불용어는 일반적으로 사용되는 용어와 관련하여 일반 문장을 작성하는데 필요하지만 본 발명에서는 불필요하다.Among them, stop terms such as "the", "a", "by", "as", and "is" are necessary to compose general sentences with respect to commonly used terms, but are unnecessary in the present invention.

이러한 단어는 LDA에서 특별한 의미를 제공하지 않으며 포함될 때 계산의 복잡성을 불필요하게 증가시키므로 효율적인 정보 처리를 위해 적절하게 제거해야 한다.These words do not provide any special meaning in LDA, and when included they unnecessarily increase the complexity of the computation, so they must be removed appropriately for efficient information processing.

따라서 상기 데이터 전처리 모듈(300)은 전처리 단계에서 불용어를 제거한다.Therefore, the data preprocessing module 300 removes stop words in the preprocessing step.

또한, 상기 데이터 전처리 모듈(300)은 문서의 모든 단어가 추후 분석에 중요하지 않기 때문에 TF-IDF(frequency-inverse document frequency)를 사용하여 중요한 단어를 선택했다.In addition, the data preprocessing module 300 selects important words using a frequency-inverse document frequency (TF-IDF) because all words in the document are not important for later analysis.

단어의 형태는 문장에서 각 단어의 위치에 의해 결정된다. 따라서, 낱말 모양이 동일하지 않으면, 본 발명에 따른 기계학습 기반의 기술이전 예측모델 생성 시스템은 몇몇 낱말이 동일한 의미를 가지는지 아니면 다른 의미를 가지는지 파악할 수 없어 결과적으로 자료 왜곡으로 귀착될 가능성이 있다. 따라서 상기 데이터 전처리 모듈(300)은 동일한 의미를 지닌 단어의 형태를 통일 한다.The shape of the word is determined by the position of each word in the sentence. Therefore, if the word shapes are not the same, the machine learning-based technology transfer prediction model generation system according to the present invention cannot grasp whether some words have the same meaning or different meanings, and as a result, there is a possibility that data distortion may result. have. Therefore, the data preprocessing module 300 unifies the form of words having the same meaning.

본 발명에서 형태소 분석은 단어의 형태를 통일하는데 사용된다.In the present invention, morpheme analysis is used to unify the shape of a word.

또한, 상기 데이터 전처리 모듈(300)은 분석 프로세스를 왜곡시킬 수 있는 숫자, 구두점 및 기호를 제거한다.In addition, the data preprocessing module 300 removes numbers, punctuation marks, and symbols that may distort the analysis process.

상기 데이터 전처리 모듈(300)을 통해 정제된 데이터는 LDA(Latent Dirichlet allocation) 모델에 적합하도록 사용된다.The data refined through the data preprocessing module 300 is used to fit the LDA (Latent Dirichlet Allocation) model.

상기 LDA 모델에 맞추기 위해 Gibbs 샘플링 방법을 채택하여 최적의 K를 찾기 위해 2에서 10으로 증가시켰다. 또한 반복 횟수는 주제가 잘 표시 될 수 있도록 500, 1000 및 2000으로 제한된다.To fit the LDA model, the Gibbs sampling method was adopted, and the K was increased from 2 to 10 to find the optimal K. Also, the number of repetitions is limited to 500, 1000 and 2000 so that the subject can be displayed well.

참고로, 본 발명의 일실시예에서 K와 반복횟수를 제한한 것에 불과하고, 최적의 모델 도출을 위한 과정으로 반복횟수, K 등은 데이터에 따라서 가변적이다.For reference, in an embodiment of the present invention, K and the number of iterations are limited, and as a process for deriving an optimal model, the number of iterations, K, etc. are variable according to data.

Dirichlet 파라미터 α는 문서로부터의 추정 값으로 설정되고 파라미터 β는 0.1로 설정된다. 표 1은 반복을 통해 발견된 LDA모델에 적합한 최적의 매개 변수를 보여준다.The Dirichlet parameter α is set to the estimated value from the document and the parameter β is set to 0.1. Table 1 shows the optimal parameters suitable for the LDA model found through iteration.

ComponentComponent CandidatesCandidates Inference Algorithm
The number of K
Gibbs sampling iteration
Parameter α,βInference Algorithm
The number of K
Gibbs sampling iteration
Parameter α,β Gibbs sampling
From 2 to 10
1000
A=T/50,β=0.1Gibbs sampling
From 2 to 10
1000
A=T/50,β=0.1

주제 모델은 관리되지 않는 방법 중 하나이다. 즉, 더 큰 수집된 문서 모음에서 문서의 주제 또는 주제를 식별할 수 있는 텍스트 마이닝 기술이다.가장 보편적인 주제 모델링 기법 중 하나인 상기 LDA(Latent Dirichlet allocation)는 베이지안 모델을 기반으로 코퍼스를 표현하기 위한 확률적 모델이며 잠복 의미론적분석(LSA)의 확률론적 확장으로 간주된다.Thematic model is one of the unmanaged methods. That is, it is a text mining technology that can identify the subject or subject of a document in a larger collection of collected documents. The LDA (Latent Dirichlet allocation), one of the most common subject modeling techniques, expresses the corpus based on the Bayesian model. It is a probabilistic model for and is considered a probabilistic extension of Latent Semantic Analysis (LSA).

도 4에 도시된 바와 같이 LDA의 기본 아이디어는 각 문서에 주제가 있고 주제가 단어 분포로 정의될 수 있다는 것이다.As shown in Fig. 4, the basic idea of LDA is that each document has a subject, and the subject can be defined as a word distribution.

예를 들어 m번째 문서 D_m이 있을 때, 문서 D_m이 잠복된 토픽 z_m,_n에 포함된 분포는

으로 표시된다.

은 Dirichlet 분포에 하이퍼-파라미터 α가 뒤따르는 다항식 분포이다.For example, when the second document m D _m, D _m is a document distribution contained in the latent topic z _m, _n is

It is indicated as.

Is a polynomial distribution followed by the hyper-parameter α in the Dirichlet distribution.

주제 수 k = 1,. . . , K는 통계적으로 추정될 수 있거나, 실험자는 고정된 값을 결정할 수 있다.Number of subjects k = 1,. . . , K can be statistically estimated, or the experimenter can determine a fixed value.

k 개의 토픽에 대한 단어의 분포는

으로 표시되며, 이는 다 변수 분포가 Dirichlet 분포를 따르는 다항 분포이기도 하다. 확률 w_m,_n은 p (w_m,_n |z_m,_n, β)에 의해 결정된다.The distribution of words over k topics is

It is denoted by, which is also a polynomial distribution whose multivariate distribution follows the Dirichlet distribution. The probabilities w _m , _n are determined by p (w _m , _n |z _m , _n , β).

상기 수학식 1에서 문서의 한계 분포는 수학식 2에 보는 바와 같이 θ에 대해 적분하고, z에 대해 합산으로부터 도출할 수 있다.In Equation 1, the limit distribution of the document can be derived from the integration with respect to θ and the summation with respect to z as shown in Equation 2.

최종적으로, 코퍼스의 확률은 아래의 수학식 3과 같이 각 단일 문서의 한계 확률의 곱으로 표현할 수 있다.Finally, the corpus probability can be expressed as the product of the limit probability of each single document as shown in Equation 3 below.

상술한 주제 모델은 일련의 문서에서 숨겨진 주제를 식별하는 데 유용한 방법이다. 상술한 이점의 결과로서, 최근에는 주제 모델을 사용하여 기술 문서를 분석하기 위한 연구가 활발하게 진행되고 있다.The subject model described above is a useful method for identifying hidden subjects in a series of documents. As a result of the above-described advantages, research for analyzing technical documents using subject models has been actively conducted in recent years.

기술 이전 예측 모델을 구현하기 위해 전형적인 앙상블 모델인 AdaBoost 알고리즘을 사용한다. To implement the technology transfer prediction model, we use the AdaBoost algorithm, which is a typical ensemble model.

앙상블 방법은 여러 개의 약한 학습자를 결합하여 정확도를 높이는 알고리즘으로, 일반적으로 부스팅과 배깅을 통해 이루어진다. The ensemble method is an algorithm that improves accuracy by combining several weak learners, and is generally achieved through boosting and bagging.

AdaBoost 알고리즘의 가장 중요한 특징은 AdaBoost가 순차적으로 약한 학습자를 생성하는 반면, 자루에 생성된 약한 학습자는 병렬로 생성된다는 것이다.The most important feature of the AdaBoost algorithm is that AdaBoost sequentially generates weak learners, whereas weak learners generated in the bag are generated in parallel.

도 4의 후단에 해당하는 AdaBoost 알고리즘의 그래픽 모델을 참조하여 앙상블 알고리즘에 대해 살펴본다.The ensemble algorithm will be described with reference to the graphic model of the AdaBoost algorithm corresponding to the latter part of FIG. 4.

주요 아이디어는 각 분포에 대해 약한 분류 기준의 가중치를 조정하고 마지막으로 이러한 약 분류 기준을 결합하여 하나의 강력한 분류 기준을 얻는 것이다. The main idea is to adjust the weight of the weak classification criteria for each distribution and finally combine these weak classification criteria to get one strong classification criteria.

즉, 높은 정확성을 갖는 가설을 생성한다.In other words, it generates a hypothesis with high accuracy.

훈련 예가

로 주어진다고 가정하면, 여기서

및 y_i는 클래스 또는 레이블을 나타낸다.Training example

Assuming that it is given by

And y _i represents a class or label.

초기 분포D₁(i)는 균일한 분포인 D₁(i)=1/n로 정의된다.The initial distribution D ₁ (i) is defined as a uniform distribution D ₁ (i) = 1/n.

분포 D_t는 각 라운드에서 약한 가설 h_t를 최소화하기 위해 가중된 오차 ε_t에 따라 업데이트 된다.The distribution D _t is updated with the weighted error ε _t to minimize the weak hypothesis h _{t in each round.}

오차 ε_t에 따라 약한 가설(h_t)의 가중치는 상기 수학식 4에 따라 결정되고, 분포(D_t+1(i))는 아래의 수학식 5를 통해 가중치에 의해 갱신된다.The weight of the weak hypothesis (h _t ) according to the error ε _t is determined according to Equation 4, and the distribution (D _t+1 (i)) is updated by the weight through Equation 5 below.

상기 수학식 5에서 알 수 있듯이 ε_t>0.5, α_t>0, 및 ε_t<0.5,이면 α_t>0이다.As can be seen from Equation 5, if ε _t >0.5, α _t >0, and ε _t <0.5, then α _t >0.

즉, 약한 가설(h_t)의 오류ε_t가 감소함에 따라 가중치α_t가 증가한다. 오류ε_t가 0.5보다 크면 약한 분리기의 성능은 임의 추측의 성능보다 낮다. 그러므로 약한 분류기준에 대한 가설은 고려되지 않고 다음 라운드로 넘어간다.That is, as the error ε _t _{of the weak hypothesis (h t} ) decreases, the weight α _t increases. If the error ε _t is greater than 0.5, the performance of the weak separator is lower than that of random guesses. Therefore, the hypothesis for the weak classification criterion is not considered and proceeds to the next round.

상기 수학식 6에서 h_t(i)=y_i이면 이때,

이고, h_t(i)≠y_i이면 이때

이며, 여기서 Z_t는 정규화 계수를 나타낸다. 이것은

을 만족한다. 결과적으로 강한 분류기는 아래의 수학식 7과 같이 정의된다. _{If h t} (i) = y _i in Equation 6, then,

And if h _t (i)≠y _i then

Where Z _t represents the normalization coefficient. this is

Is satisfied. As a result, the strong classifier is defined as in Equation 7 below.

도 5은 제안된 모델이다. 특허 참조 및 비 특허 참조와 같은 변수를 의미한다. 기술 주제는 LDA를 사용하는 특허의 기술 설명을 나타낸다. 데이터 세트의 변수 중 "이전"은 특허가 원래 보유자에서 다른 보유자로 이전 되었는지 여부를 나타낸다.5 is a proposed model. Refers to variables such as patent references and non-patent references. The technical subject represents the technical description of the patent using the LDA. Among the variables in the data set, "transfer" indicates whether the patent has been transferred from the original holder to another holder.

본 발명에서 특허 데이터를 수집하기 위한 특허 검색 데이터베이스에서는 출원인과 현재 양수인의 정보를 확인할 수 있다.In the patent search database for collecting patent data in the present invention, information on the applicant and the current assignee can be checked.

양수인이 변경된 경우 특허권이 이전된 것으로 간주된다. 본 발명에서 출력 변수 y_i로 "이전"을 사용했다.If the assignee changes, the patent right is deemed to have been transferred. In the present invention, "previous" is used as the _{output variable y i.}

따라서 특허가 다른 특허로 이전되면 +1 값을 가지며 그렇지 않은 경우 -1 값을 갖는다. AdaBoost 알고리즘에 사용된 매개 변수는 반복을 통해 추정되었다.Therefore, if a patent is transferred to another patent, it has a value of +1, otherwise it has a value of -1. The parameters used in the AdaBoost algorithm were estimated through iterations.

제안된 모델의 성능을 검증하기 위해 분류 성능을 평가하는데 사용할 수 있는 정확성, 특이성 및 민감도 측정을 고려한다. 측정값은 표 2에 자세히 나와 있다.To verify the performance of the proposed model, we consider the accuracy, specificity and sensitivity measurements that can be used to evaluate the classification performance. The measurements are detailed in Table 2.

정확도는 분류기의 전체 성능을 나타낸다. 진정한 양성(TP)과 진정한 음성(TN)을 함께 고려한다.Accuracy represents the overall performance of the classifier. Consider true positive (TP) and true negative (TN) together.

그러나 분류기가 잡음이 많은 학습 데이터를 지나치게 학습하면 과도한 결과를 초래할 수 있다.However, excessive learning of noisy training data by the classifier can lead to excessive results.

따라서, 정확도는 그러한 상황에서 분류기의 정확한 성능을 측정할 수 없다.Therefore, accuracy cannot measure the exact performance of the classifier in such a situation.

이 문제를 극복하기 위해 본 발명은 진정한 양성만을 고려한 민감도와 진성한 음성만을 고려한 특이성을 모두 고려하여 제안된 모델의 성능을 평가한다.In order to overcome this problem, the present invention evaluates the performance of the proposed model by considering both the sensitivity considering only true positives and the specificity considering only true negatives.

본 발명의 실험을 위해 상술한 조건에 따라 추론 및 기계 학습의 기술 분야에 대한 데이터를 수집했다.For the experiment of the present invention, data on the technical fields of inference and machine learning were collected according to the above-described conditions.

특정 기술에 따라 추세를 알고 싶지만 특허 데이터에는 기술적 분류 정보가 없다. 따라서 본 발명에서는 상술한 바와 같이 LDA를 이용하여 기술적 분류를 수행 하였다. I want to know the trend according to a specific technology, but there is no technical classification information in the patent data. Therefore, in the present invention, technical classification was performed using LDA as described above.

또한, 최적의 주제 수를 선택한 결과는 도 6에 도시된 바와 같다.In addition, the result of selecting the optimal number of subjects is as shown in FIG. 6.

데이터에 따라 최적 주제 수는 추후 변동가능되며, 본 발명에서 최적 발명 수를 결정하기 위하여 Cao et al(2009) - “density-based method for adaptive LDA model selection”방법론을 적용하였습니다According to the data, the optimal number of subjects may change later, and Cao et al (2009)-“density-based method for adaptive LDA model selection” methodology was applied to determine the optimal number of inventions in the present invention.

이러한 결과를 토대로 수집된 데이터의 기술 주제를 분류했다. 그 결과를 표 3에 나타냈다.Based on these results, the technical topics of the collected data were classified. The results are shown in Table 3.

각 주제에 포함된 상위 10개 키워드를 사용하여 기술 주제를 정의했다.The technical topics were defined using the top 10 keywords included in each topic.

기술 분류의 결과로, "자연 언어 이해" 기술은 추론 및 기계 학습 분야에서 약 22.4 %의 가장 큰 비중을 차지했다. 다음으로, "전문가 시스템" 기술이 21.4 %를 차지하고, 신호 처리, 이미지 프로세싱, 인공 신경망 기술이 그 뒤를 이었다.As a result of the technology classification, the "natural language understanding" technology accounted for the largest share of about 22.4% in inference and machine learning. Next, "expert system" technology accounted for 21.4%, followed by signal processing, image processing, and artificial neural network technologies.

표 4는 위의 기술 분야에 대한 기술 이전 건수를 보여준다.Table 4 shows the number of technology transfers for the above technology fields.

수집된 데이터의 이전율은 평균 29%이다. LDA를 이용한 기술 분류에 따른 기술 이전 비율을 확인한 결과, 주제 3(자연어 이해)분야의 기술 이전율은 다른 기술 분야보다 상대적으로 높은 38%이다.The transfer rate of the collected data is an average of 29%. As a result of confirming the rate of technology transfer according to technology classification using LDA, the technology transfer rate in Topic 3 (Natural Language Understanding) is 38%, which is relatively higher than that of other technology areas.

본 발명에서 제안한 기술 이전 예측 모델을 생성하기 위해 이전 주제 모델 결과와 AdaBoost 알고리즘을 사용하는 특허에 대한 정량적 색인을 병합한다.In order to generate the technology transfer prediction model proposed in the present invention, the result of the previous subject model and the quantitative index of the patent using the AdaBoost algorithm are merged.

제안된 방법의 성능을 비교하기 위해 대표적인 분류 알고리즘인 K-NN (K-nearest neighbor classifier), SVM (Support Vector Machine) 및 Neural Network 알고리즘을 사용하여 성능 비교 테스트를 수행하였다. 성능 측정은 이미 언급한 정확도, 민감도 및 특이성을 사용한다. 또한, 기술 주제가 포함된 모델과 그렇지 않은 모델을 논의한다. 실험 결과는 표 5 및 6에 나타난 바와 같다.To compare the performance of the proposed method, a performance comparison test was performed using K-NN (K-nearest neighbor classifier), SVM (Support Vector Machine), and Neural Network algorithms, which are representative classification algorithms. Performance measurements use the accuracy, sensitivity and specificity already mentioned. We also discuss models with and without technical topics. The experimental results are as shown in Tables 5 and 6.

도 7과 9에서 NT는 기술 주제를 포함하지 않는 모델을 의미하고 YT는 기술 주제를 포함하는 모델을 의미한다.In FIGS. 7 and 9, NT means a model that does not include a technology subject, and YT indicates a model that includes a technology subject.

본 발명에서 기술 주제가 포함된 모델과 그렇지 않은 모델을 비교했다.In the present invention, a model with a technical subject and a model without a technical subject was compared.

그 결과 문헌의 기술 정보를 포함하는 기술 이전 예측 모델의 분류 성능은 문헌 정보를 포함하지 않는 모델보다 우수하다.As a result, the classification performance of the technology transfer prediction model including the technical information of the document is superior to that of the model not including the document information.

기술적인 내용을 포함하지 않는 모델의 민감도는 모델에서 초과 맞춤이 발생하기 때문에 기술 정보가 포함된 모델의 민감도보다 현저히 낮다.The sensitivity of a model that does not contain technical content is significantly lower than that of a model that includes technical information because over-fitting occurs in the model.

따라서 기술 정보가 기술 이전 예측에 중요한 요소라고 가정할 수 있다.Therefore, it can be assumed that technical information is an important factor in forecasting technology transfer.

다음으로, K-nearest neighbor classifier(K-NN), SVM(Support Vector Machine) 및 Neural Network와 같은 분류기를 기반으로 제안된 방법 및 다른 모델을 사용하여 모델을 생성했다. Next, a model was created using the proposed method and other models based on classifiers such as K-nearest neighbor classifier (K-NN), SVM (Support Vector Machine), and Neural Network.

K-NN은 구조가 간단하지만 성능이 우수하다. 이런 이유로 많은 분류 문제에서 사용된다. 지지 벡터 머신과 인공신경망은 또한 우수한 분류 성능을 갖는 것으로 잘 알려져 있으며 다양한 분야에 적용되고 있다.K-NN has a simple structure but excellent performance. For this reason, it is used in many classification problems. Support vector machines and artificial neural networks are well known to have excellent classification performance and are applied in various fields.

본 발명에서 제시된 모델과 비교한 결과 및 위에서 언급 한 다른 분류 기준에 따라 모델의 정확도가 전반적으로 비슷하다는 것을 알 수 있다.It can be seen that the accuracy of the model is generally similar according to the result of comparison with the model presented in the present invention and the other classification criteria mentioned above.

그러나 민감도와 특이성의 측면에서 각각 참 긍정과 참 부정을 나타냄으로써 제안된 모델과 다른 모델 사이에 유의 한 차이가 있었다.However, in terms of sensitivity and specificity, there was a significant difference between the proposed model and the other models by indicating true positive and true negative, respectively.

특히 본 발명에서 비교한 다른 모델의 민감도와 특이성은 제안된 모델의 민감도와 특이성보다 낮았다.이는 모델의 초과 적용으로 인한 것이다.In particular, the sensitivity and specificity of the other models compared in the present invention were lower than the sensitivity and specificity of the proposed model. This is due to excessive application of the model.

이 결과는 제안된 모델이 기술 이전 예측의 경우 다른 모델보다 성능이 우수함을 보여준다.This result shows that the proposed model outperforms other models in the case of technology transfer prediction.

따라서, 본 발명에서 특허 데이터를 기반으로 한 제안 된 모델은 기술 이전 예측에 적합하다는 것으로 결론지을 수 있다.Therefore, it can be concluded that the proposed model based on the patent data in the present invention is suitable for technology transfer prediction.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and a person of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical idea of the present invention, but to explain it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100 : 데이터 수집부
200 : 데이터 분류부
300 : 데이터 전처리 모듈
310 : 수치데이터 처리부
400 : 기술이전 예측모델 구성 모듈100: data collection unit
200: data classification unit
300: data preprocessing module
310: Numerical data processing unit
400: Technology transfer prediction model configuration module

Claims

In the machine learning-based technology transfer prediction model generation method by the machine learning-based technology transfer prediction model generation system,
(a) collecting patent data necessary for the data collection unit to generate a technology transfer prediction model;
(b) classifying the patent data into numerical data and text data by a data classification unit;
(c) a pre-processing step of allowing the data pre-processing module to analyze text data, which is unstructured data among the patent data; And
(d) The technology transfer prediction model configuration module receives numerical data and preprocessed text data, performs topic modeling to extract a technology topic from the text data through an ensemble model, and performs ensemble modeling with the technology topic and numerical data. And generating a technology transfer prediction model by doing so,
In the step (d), the technology transfer prediction model construction module

A method of generating a technology transfer prediction model based on machine learning, characterized by generating a technology transfer prediction model using an ensemble model.
H(x): refers to a strong hypothesis, and _{a collection of generated h t} (weak learners)
T: total number of _{h t}
α _t : the weight of the weak classifier
h _t : t-th weak learner

The method of claim 1,
In step (c) above
And extracting, as numerical data, transferred patents among valid patents from which unnecessary noise patents or duplicate patents have been deleted from the collected patent data by the data preprocessing module. How to create an old predictive model.

The method of claim 2,
Step (c) is
(c-1) removing stop words from sentences constituting the text data by the data preprocessing module;
(c-2) selecting, by the data preprocessing module, an important word from the text data using a frequency-inverse document frequency (TF-IDF) method;
(c-3) unifying, by the data preprocessing module, the shape of words constituting the text data through morpheme analysis; And
(c-4) removing, by the data preprocessing module, numbers, punctuation marks, and symbols that may distort the analysis process.

The method of claim 3,
In step (d) above
The technology transfer prediction model configuration module

A method of generating a technology transfer prediction model based on machine learning, characterized in that a technology topic is extracted from a document with a model.
α: Hyper-paramter following Dirichlet Distribution to determine the distribution of documents in LDA
β: hyper-paramter following Dirichlet Distribution to determine the distribution of words in LDA
θ _m : document
z _m , _n : hidden topic
w _m : Distribution to be included in latent topics z _m and _{n (polynomial distribution)}
n: order of words
N _m : The entire set of words contained in the m th document
w _m , _n : the nth word out of M words

delete

A data collection unit that collects patent data necessary to generate a technology transfer prediction model;
A data classification unit for classifying the patent data into numerical data and text data;
A data pre-processing module for pre-processing text data, which is unstructured data, among the patent data; And
Technology for receiving the numerical data and preprocessed text data, performing topic modeling to extract a technology topic from the text data through an ensemble model, and performing ensemble modeling with the technology topic and numerical data to generate a technology transfer prediction model A previous prediction model construction module; Including,
The technology transfer prediction model configuration module

A machine learning-based technology transfer prediction model generation system, characterized in that it generates a technology transfer prediction model using an ensemble model.
H(x): means a strong hypothesis, and _{a collection of generated h t} (weak learners)
T: total number of _{h t}
α _t : the weight of the weak classifier
h _t : t-th weak learner