KR20220115482A

KR20220115482A - Apparatus for evaluating latent value of patent based on deep learning and method thereof

Info

Publication number: KR20220115482A
Application number: KR1020210030790A
Authority: KR
Inventors: 손종진; 이종선; 김나미
Original assignee: 주식회사 페어랩스; 이종선; 김나미
Priority date: 2021-02-10
Filing date: 2021-03-09
Publication date: 2022-08-17
Also published as: KR102614912B1

Abstract

Disclosed in the present invention are a deep learning-based patent potential value evaluation device and a method therefor. That is, the present invention performs a pre-processing function on the patent data whose potential value is to be evaluated, perform word embedding function on the preprocessed patent data, applies word-embedded patent data to a topic model to perform topic inference, calculates the topic proportion for each text and the probability value of each topic word through the topic embedding vector, which is the result of performing the topic inference function, and evaluates the potential value of the patent data by combining the calculated topic proportion by text and the probability value of each topic word with technology trend information, technology promising information, etc. Accordingly, when evaluating the value of patents, unstructured data and technology trends or prospects are considered together. Stable topic extraction is possible from patent data without degradation of topic model performance by compressing information by expressing words from segmented low-density vectors such as one-hot vectors to embedding vectors, which are continuous high-density vectors. The performance change of the topic model due to preprocessing tasks such as stop word processing is reduced. Similarity information between words and topics can be obtained according to the use of vector space.

Description

Apparatus for evaluating latent value of patent based on deep learning and method thereof

본 발명은 딥러닝 기반 특허 잠재가치 평가 장치 및 그 방법에 관한 것으로서, 특히 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 전처리된 특허 데이터에 대해 워드 임베딩 기능을 수행하고, 워드 임베딩된 특허 데이터를 토픽 모델에 적용하여 토픽 추론 기능을 수행하고, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 통해 텍스트별 토픽 비중과 토픽별 단어의 확률값을 산출하고, 산출된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 잠재가치를 평가하는 딥러닝 기반 특허 잠재가치 평가 장치 및 그 방법에 관한 것이다.The present invention relates to a deep learning-based patent potential value evaluation apparatus and method, and in particular, performs a pre-processing function on patent data to evaluate potential value, performs a word embedding function on the pre-processed patent data, and performs word embedding The patent data is applied to the topic model to perform the topic inference function, and the topic weight for each text and the probability value of the word for each topic are calculated through the topic embedding vector, which is the result of performing the topic reasoning function, and the calculated topic weight for each text and It relates to a deep learning-based patent potential value evaluation apparatus and method for evaluating the potential value of the patent data by combining the probability value of each topic word with technology trend information and technology prospect information.

특허의 가치 평가는 기술이전 시 기술 가치 평가를 통해 기술 협상에 활용하거나, 기업이 보유하고 있는 특허의 가치를 평가하여 기업의 자본화나 기업 기술의 가치를 확보하거나, 보유하고 있는 특허의 기술 가치를 담보로 금융회사를 통해 담보 대출을 수행하기 위한 용도 등으로 이뤄지고 있다.Patent value evaluation is used for technology negotiation through technology value evaluation during technology transfer, capitalization of the enterprise or the value of enterprise technology by evaluating the value of the patents owned by the enterprise, or the technology value of the owned patents. It is being used as collateral to carry out secured loans through financial companies.

이러한 특허의 가치 평가는 정량 데이터나 정성 데이터를 활용하여 수행되고 있으며, 정량 데이터 활용의 경우 해당 특허의 기술성, 권리성, 활용성, 시장성 등의 측면을 특허의 피인용수, 패밀리수, 발명자수, 청구항수 등의 정량지표를 활용하여 측정하고 있어, 단순 산술 계산을 통한 특허의 질적 측면을 포함한 잠재 가치 추정에 한계가 있고, 정성 데이터 활용의 경우 사전적으로 정의된 키워드 및 키워드 사전을 활용하고 있어 사전적으로 정의되는 유망성 정의에 벗어나는 새로운 기술 또는 유망성 요소의 경우에는 신뢰할 수 있는 잠재 가치 결과를 얻을 수 없는 한계점이 있다.The value evaluation of these patents is performed using quantitative or qualitative data, and in the case of quantitative data utilization, aspects such as technicality, rights, utility, marketability, etc. , the number of claims, etc. are used for measurement, so there is a limit to the estimation of potential value including the qualitative aspect of patents through simple arithmetic calculations. In the case of a new technology or a promising element that deviates from the definition of prospect defined in the dictionary, there is a limit in that reliable potential value results cannot be obtained.

한국등록특허 제10-1851136호 [제목: 글로벌 특허가치평가 시스템 및 방법]Korean Patent Registration No. 10-1851136 [Title: Global Patent Valuation System and Method]

본 발명의 목적은 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 전처리된 특허 데이터에 대해 워드 임베딩 기능을 수행하고, 워드 임베딩된 특허 데이터를 토픽 모델에 적용하여 토픽 추론 기능을 수행하고, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 통해 텍스트별 토픽 비중과 토픽별 단어의 확률값을 산출하고, 산출된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 잠재가치를 평가하는 딥러닝 기반 특허 잠재가치 평가 장치 및 그 방법을 제공하는 데 있다.An object of the present invention is to perform a pre-processing function on patent data to evaluate potential value, perform a word embedding function on the pre-processed patent data, and apply the word embedded patent data to a topic model to perform a topic inference function Then, through the topic embedding vector, which is the result of performing the topic inference function, the topic weight for each text and the probability value of the word for each topic are calculated, and the calculated topic weight for each text and the probability value of the word for each topic are used with technology trend information, technology prospect information, etc. It is to provide a deep learning-based patent potential value evaluation device and method that combine to evaluate the potential value of the patent data.

본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 장치는 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행하고, 상기 워드 임베딩된 특허 데이터를 미리 학습된 토픽 모델에 적용하여 토픽 추론 기능을 수행하여, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 생성하고, 상기 생성된 토픽 임베딩 벡터를 근거로 텍스트별 토픽 비중과 토픽별 단어의 확률값을 계산하고, 상기 계산된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 근거로 상기 특허 데이터에 대해 잠재가치를 평가하는 제어부; 및 상기 특허 데이터에 대한 평가 결과를 표시하는 표시부를 포함할 수 있다.A deep learning-based patent potential value evaluation apparatus according to an embodiment of the present invention performs a pre-processing function on patent data to evaluate the potential value, performs a word embedding function on the pre-processed patent data, and the word embedded The patent data is applied to the pre-learned topic model to perform the topic inference function to generate a topic embedding vector that is a result of performing the topic inference function, and based on the generated topic embedding vector, the topic weight for each text and the word for each topic a control unit that calculates a probability value of , and evaluates the potential value of the patent data based on the calculated topic weight for each text and the probability value for each topic word; and a display unit displaying an evaluation result for the patent data.

본 발명과 관련된 일 예로서 상기 제어부는, 상기 특허 데이터에 대해 정제 과정, 문장 토큰화 과정, 토큰화 과정 및 단어 분리 과정 중 적어도 하나의 과정을 수행할 수 있다.As an example related to the present invention, the controller may perform at least one of a refining process, a sentence tokenization process, a tokenization process, and a word separation process on the patent data.

본 발명과 관련된 일 예로서 상기 제어부는, 상기 워드 임베딩된 특허 데이터 전체에 미리 정의된 수만큼의 토픽에 대한 확률 분포를 연산하여, 상기 토픽 임베딩 벡터를 생성할 수 있다.As an example related to the present invention, the controller may generate the topic embedding vector by calculating a probability distribution for a predefined number of topics in the entire word-embedded patent data.

본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 방법은 제어부에 의해, 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하는 단계; 상기 제어부에 의해, 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행하는 단계; 상기 제어부에 의해, 상기 워드 임베딩된 특허 데이터를 미리 학습된 토픽 모델에 적용하여 토픽 추론 기능을 수행하여, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 생성하는 단계; 상기 제어부에 의해, 상기 생성된 토픽 임베딩 벡터를 근거로 텍스트별 토픽 비중과 토픽별 단어의 확률값을 계산하는 단계; 및 상기 제어부에 의해, 상기 계산된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 근거로 상기 특허 데이터에 대해 잠재가치를 평가하는 단계를 포함할 수 있다.A deep learning-based patent potential value evaluation method according to an embodiment of the present invention includes, by a control unit, performing a pre-processing function on patent data for which the potential value is to be evaluated; performing, by the control unit, a word embedding function on the preprocessed patent data; generating, by the control unit, a topic embedding vector that is a result of performing a topic inference function by applying the word-embedded patent data to a pre-learned topic model to perform a topic inference function; calculating, by the control unit, a topic weight for each text and a probability value of a word for each topic based on the generated topic embedding vector; and evaluating, by the control unit, a potential value of the patent data based on the calculated weight of topics for each text and a probability value of words for each topic.

본 발명과 관련된 일 예로서 상기 특허 데이터에 대해 전처리 기능을 수행하는 단계는, 상기 특허 데이터에 대해서 미리 정의된 불용어 제거 기능, 미리 정의된 특수기호 제거 기능 및 미리 정의된 빈도 이하로 사용된 단어의 제거 기능 중 적어도 하나의 기능을 포함하는 정제 과정을 수행하는 과정; 및 상기 정제 과정을 수행한 특허 데이터에 대해서 바이트 페어 인코딩(BPE) 기반의 토큰화 과정을 수행하는 과정을 포함할 수 있다.As an example related to the present invention, the step of performing a pre-processing function on the patent data includes a predefined stopword removal function, a predefined special symbol removal function, and a predefined frequency of words used below a predefined frequency for the patent data. performing a purification process including at least one of the removal functions; and performing a byte pair encoding (BPE)-based tokenization process on the patent data on which the refinement process has been performed.

본 발명과 관련된 일 예로서 상기 텍스트별 토픽 비중과 토픽별 단어의 확률값을 계산하는 단계는, 상기 워드 임베딩된 특허 데이터에 포함된 서브 세트 또는 문서 단위로 미리 정의된 토픽 수만큼의 토픽에 대해서 시간 구간에 따른 토픽의 비중인 텍스트별 토픽 비중을 계산하는 과정; 및 상기 워드 임베딩된 특허 데이터에 미리 정의된 개수의 토픽 각각마다 토픽을 구성하는 단어의 확률 분포인 토픽별 단어의 확률값을 계산하는 과정을 포함할 수 있다.As an example related to the present invention, the step of calculating the topic weight for each text and the probability value of the word for each topic may include time for as many topics as a predefined number of topics in a subset or document unit included in the word-embedded patent data. The process of calculating the topic weight for each text, which is the weight of the topic according to the section; and calculating a probability value of a word for each topic, which is a probability distribution of words constituting the topic for each of a predefined number of topics in the word-embedded patent data.

본 발명과 관련된 일 예로서 상기 특허 데이터에 대해 잠재가치를 평가하는 단계는, 상기 계산된 시간 구간에 따른 토픽별 비중에 대한 평균, 이동 평균 및 모멘텀에 대해서 미리 설정된 통계 분석 기법을 활용하여 토픽별 잠재 트렌드 지수를 산출하는 과정; 상기 산출된 토픽별 잠재 트렌드 지수를 가중치로 하여 상기 특허 데이터가 잠재적으로 내포하는 토픽 분포와의 연산을 통해 유망 기술 트렌드 잠재 가치를 추정하는 과정; 및 미리 설정된 벤치마크 대상 특허 데이터로부터 추출된 토픽과 상기 잠재가치를 평가하고자 하는 특허 데이터와 관련한 토픽 간의 유사도 측정을 통해 상기 특허 데이터의 잠재적 가치를 추정하는 과정을 포함할 수 있다.As an example related to the present invention, the step of evaluating the potential value for the patent data includes using a statistical analysis technique preset for the average, moving average, and momentum of each topic according to the calculated time interval for each topic. the process of calculating the potential trend index; a process of estimating a potential value of a promising technology trend through calculation with a topic distribution potentially included in the patent data by using the calculated potential trend index for each topic as a weight; and estimating the potential value of the patent data through similarity measurement between a topic extracted from preset benchmark target patent data and a topic related to the patent data for which the potential value is to be evaluated.

본 발명은 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 전처리된 특허 데이터에 대해 워드 임베딩 기능을 수행하고, 워드 임베딩된 특허 데이터를 토픽 모델에 적용하여 토픽 추론 기능을 수행하고, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 통해 텍스트별 토픽 비중과 토픽별 단어의 확률값을 산출하고, 산출된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 잠재가치를 평가함으로써, 특허의 가치 평가 시 비정형 데이터와 기술 트렌드나 유망성을 함께 고려하고, 단어를 원-핫 벡터(one-hot vector)와 같은 분절된 저밀도 벡터(sparse vector)에서 연속형 고밀도 벡터(continuous dense vector)인 임베딩 벡터로 표현하여 정보를 압축하여 특허 데이터로부터 토픽 모델의 성능 저하없이 안정적인 토픽 추출이 가능하고, 불용어 처리와 같은 전처리 작업에 따른 토픽 모델의 성능 변화를 줄이고, 벡터 공간 활용에 따른 단어 및 토픽 간 유사성 정보를 얻을 수 있는 효과가 있다.The present invention performs a preprocessing function on patent data to evaluate potential value, performs a word embedding function on the preprocessed patent data, applies the word embedded patent data to a topic model to perform a topic inference function, Through the topic embedding vector, which is the result of performing the topic inference function, the topic weight for each text and the probability value of the word for each topic are calculated, and the calculated topic weight for each text and the probability value of the word for each topic are combined with technology trend information, technology prospect information, etc. By evaluating the potential value of the patent data, unstructured data and technology trends or prospects are considered together when evaluating the value of a patent, and the word is converted into a segmented sparse vector such as a one-hot vector. It is expressed as an embedding vector, which is a continuous dense vector, to compress information, enabling stable topic extraction from patent data without degrading the performance of the topic model, and reducing the performance change of the topic model due to preprocessing such as stopword processing. There is an effect of obtaining similarity information between words and topics according to the use of vector space.

도 1은 본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 장치의 구성을 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 특정한 하나의 토픽에 대해 산출된 각 시간 구간별 잠재 트렌드 팩터의 예를 나타낸 도이다.
도 3은 본 발명의 실시예에 따른 특정한 하나의 토픽에 대해 산출된 각 시간 구간별 잠재 트렌드 팩터의 시각화 예를 나타낸 도이다.
도 4는 본 발명의 실시예에 따른 네트워크 그래프의 예를 나타낸 도이다.
도 5는 본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 방법을 나타낸 흐름도이다.
도 6은 본 발명의 실시예에 따른 추정된 유망 기술 트렌드 잠재 가치의 예를 나타낸 도이다.
도 7은 본 발명의 실시예에 따른 추정된 벤치마크 잠재 가치의 예를 나타낸 도이다.1 is a block diagram showing the configuration of a deep learning-based patent potential value evaluation apparatus according to an embodiment of the present invention.
2 is a diagram illustrating an example of a potential trend factor for each time section calculated for one specific topic according to an embodiment of the present invention.
3 is a diagram illustrating a visualization example of a potential trend factor for each time section calculated for one specific topic according to an embodiment of the present invention.
4 is a diagram illustrating an example of a network graph according to an embodiment of the present invention.
5 is a flowchart illustrating a deep learning-based patent potential value evaluation method according to an embodiment of the present invention.
6 is a diagram illustrating an example of an estimated promising technology trend potential value according to an embodiment of the present invention.
7 is a diagram illustrating an example of an estimated benchmark potential value according to an embodiment of the present invention.

본 발명에서 사용되는 기술적 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 발명에서 사용되는 기술적 용어는 본 발명에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 발명에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다. 또한, 본 발명에서 사용되는 일반적인 용어는 사전에 정의되어 있는 바에 따라, 또는 전후 문맥상에 따라 해석되어야 하며, 과도하게 축소된 의미로 해석되지 않아야 한다.It should be noted that the technical terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. In addition, the technical terms used in the present invention should be interpreted as meanings generally understood by those of ordinary skill in the art to which the present invention belongs, unless otherwise defined in the present invention, and excessively comprehensive It should not be construed as a human meaning or in an excessively reduced meaning. In addition, when the technical term used in the present invention is an incorrect technical term that does not accurately express the spirit of the present invention, it should be understood by being replaced with a technical term that can be correctly understood by those skilled in the art. In addition, general terms used in the present invention should be interpreted as defined in advance or according to the context before and after, and should not be interpreted in an excessively reduced meaning.

또한, 본 발명에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 본 발명에서 "구성된다" 또는 "포함한다" 등의 용어는 발명에 기재된 여러 구성 요소들 또는 여러 단계를 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, the singular expression used in the present invention includes the plural expression unless the context clearly dictates otherwise. In the present invention, terms such as "consisting of" or "comprising" should not be construed as necessarily including all of the various components or various steps described in the invention, and some components or some steps may not be included. It should be construed that it may further include additional components or steps.

또한, 본 발명에서 사용되는 제 1, 제 2 등과 같이 서수를 포함하는 용어는 구성 요소들을 설명하는데 사용될 수 있지만, 구성 요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성 요소는 제 2 구성 요소로 명명될 수 있고, 유사하게 제 2 구성 요소도 제 1 구성 요소로 명명될 수 있다.Also, terms including ordinal numbers such as first, second, etc. used in the present invention may be used to describe the components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings, but the same or similar components are given the same reference numerals regardless of the reference numerals, and the redundant description thereof will be omitted.

또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다.In addition, in the description of the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, it should be noted that the accompanying drawings are only for easy understanding of the spirit of the present invention, and should not be construed as limiting the spirit of the present invention by the accompanying drawings.

도 1은 본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 장치(100)의 구성을 나타낸 블록도이다.1 is a block diagram showing the configuration of a deep learning-based patent potential value evaluation apparatus 100 according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 딥러닝 기반 특허 잠재가치 평가 장치(100)는 통신부(110), 저장부(120), 표시부(130), 음성 출력부(140) 및 제어부(150)로 구성된다. 도 1에 도시된 딥러닝 기반 특허 잠재가치 평가 장치(100)의 구성 요소 모두가 필수 구성 요소인 것은 아니며, 도 1에 도시된 구성 요소보다 많은 구성 요소에 의해 딥러닝 기반 특허 잠재가치 평가 장치(100)가 구현될 수도 있고, 그보다 적은 구성 요소에 의해서도 딥러닝 기반 특허 잠재가치 평가 장치(100)가 구현될 수도 있다.As shown in FIG. 1 , the deep learning-based patent potential value evaluation apparatus 100 includes a communication unit 110 , a storage unit 120 , a display unit 130 , a voice output unit 140 , and a control unit 150 . . Not all of the components of the deep learning-based patent latent value evaluation apparatus 100 shown in FIG. 1 are essential components, and the deep learning-based patent latent value evaluation apparatus ( 100) may be implemented, and the deep learning-based patent potential value evaluation apparatus 100 may be implemented with fewer components.

상기 딥러닝 기반 특허 잠재가치 평가 장치(100)는 스마트폰(Smart Phone), 휴대 단말기(Portable Terminal), 이동 단말기(Mobile Terminal), 폴더블 단말기(Foldable Terminal), 개인 정보 단말기(Personal Digital Assistant: PDA), PMP(Portable Multimedia Player) 단말기, 텔레매틱스(Telematics) 단말기, 내비게이션(Navigation) 단말기, 개인용 컴퓨터(Personal Computer), 노트북 컴퓨터, 슬레이트 PC(Slate PC), 태블릿 PC(Tablet PC), 울트라북(ultrabook), 웨어러블 디바이스(Wearable Device, 예를 들어, 워치형 단말기(Smartwatch), 글래스형 단말기(Smart Glass), HMD(Head Mounted Display) 등 포함), 와이브로(Wibro) 단말기, IPTV(Internet Protocol Television) 단말기, 스마트 TV, 디지털방송용 단말기, AVN(Audio Video Navigation) 단말기, A/V(Audio/Video) 시스템, 플렉시블 단말기(Flexible Terminal), 디지털 사이니지 장치 등과 같은 다양한 단말기에 적용될 수 있다.The deep learning-based patent potential value evaluation device 100 is a smart phone, a portable terminal, a mobile terminal, a foldable terminal, a personal digital assistant: PDA), PMP (Portable Multimedia Player) terminal, Telematics terminal, Navigation terminal, Personal Computer, Notebook computer, Slate PC, Tablet PC (Tablet PC), Ultrabook ( ultrabook), wearable devices (including, for example, watch-type terminals (Smartwatch), glass-type terminals (Smart Glass), HMD (Head Mounted Display), etc.), Wibro terminals, IPTV (Internet Protocol Television) It can be applied to various terminals such as a terminal, a smart TV, a terminal for digital broadcasting, an audio video navigation (AVN) terminal, an audio/video (A/V) system, a flexible terminal, and a digital signage device.

상기 통신부(110)는 유/무선 통신망을 통해 내부의 임의의 구성 요소 또는 외부의 임의의 적어도 하나의 단말기와 통신 연결한다. 이때, 상기 외부의 임의의 단말기는 서버(미도시), 다른 단말(미도시), 각국 특허청 서버(미도시) 등을 포함할 수 있다. 여기서, 무선 인터넷 기술로는 무선랜(Wireless LAN: WLAN), DLNA(Digital Living Network Alliance), 와이브로(Wireless Broadband: Wibro), 와이맥스(World Interoperability for Microwave Access: Wimax), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), IEEE 802.16, 롱 텀 에볼루션(Long Term Evolution: LTE), LTE-A(Long Term Evolution-Advanced), 광대역 무선 이동 통신 서비스(Wireless Mobile Broadband Service: WMBS) 등이 있으며, 상기 통신부(110)는 상기에서 나열되지 않은 인터넷 기술까지 포함한 범위에서 적어도 하나의 무선 인터넷 기술에 따라 데이터를 송수신하게 된다. 또한, 근거리 통신 기술로는 블루투스(Bluetooth), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association: IrDA), UWB(Ultra Wideband), 지그비(ZigBee), 인접 자장 통신(Near Field Communication: NFC), 초음파 통신(Ultra Sound Communication: USC), 가시광 통신(Visible Light Communication: VLC), 와이 파이(Wi-Fi), 와이 파이 다이렉트(Wi-Fi Direct) 등이 포함될 수 있다. 또한, 유선 통신 기술로는 전력선 통신(Power Line Communication: PLC), USB 통신, 이더넷(Ethernet), 시리얼 통신(serial communication), 광/동축 케이블 등이 포함될 수 있다.The communication unit 110 communicates with any internal component or at least one external terminal through a wired/wireless communication network. In this case, the external arbitrary terminal may include a server (not shown), another terminal (not shown), and a server of each country's Intellectual Property Office (not shown). Here, as wireless Internet technologies, wireless LAN (WLAN), DLNA (Digital Living Network Alliance), WiBro (Wireless Broadband: Wibro), Wimax (World Interoperability for Microwave Access: Wimax), HSDPA (High Speed Downlink Packet Access) ), High Speed Uplink Packet Access (HSUPA), IEEE 802.16, Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), Wireless Mobile Broadband Service (WMBS), etc. In this case, the communication unit 110 transmits and receives data according to at least one wireless Internet technology within a range including Internet technologies not listed above. In addition, short-range communication technologies include Bluetooth, Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra Wideband (UWB), ZigBee, and Near Field Communication (NFC). , Ultra Sound Communication (USC), Visible Light Communication (VLC), Wi-Fi (Wi-Fi), Wi-Fi Direct (Wi-Fi Direct), etc. may be included. In addition, the wired communication technology may include power line communication (PLC), USB communication, Ethernet, serial communication, optical/coaxial cable, and the like.

또한, 상기 통신부(110)는 유니버설 시리얼 버스(Universal Serial Bus: USB)를 통해 임의의 단말과 정보를 상호 전송할 수 있다.Also, the communication unit 110 may mutually transmit information with an arbitrary terminal through a Universal Serial Bus (USB).

또한, 상기 통신부(110)는 이동통신을 위한 기술표준들 또는 통신방식(예를 들어, GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), CDMA2000(Code Division Multi Access 2000), EV-DO(Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced) 등)에 따라 구축된 이동 통신망 상에서 기지국, 상기 서버, 상기 다른 단말, 상기 각국 특허청 서버 등과 무선 신호를 송수신한다.In addition, the communication unit 110 is a technology standard or communication method for mobile communication (eg, GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), CDMA2000 (Code Division Multi Access 2000), EV -DO (Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), LTE (Long Term Evolution), LTE-A (Long Term Evolution-Advanced, etc.) transmits and receives radio signals to and from the base station, the server, the other terminal, and the server of each country's Intellectual Property Office.

또한, 상기 통신부(110)는 상기 제어부(150)의 제어에 의해, 특허 데이터를 관리하는 국내 특허청 서버(미도시), 해외 특허청 서버(미도시) 등에 접속한다.In addition, under the control of the control unit 150, the communication unit 110 connects to a domestic Intellectual Property Office server (not shown), a foreign Intellectual Property Office server (not shown), etc. that manage patent data.

또한, 상기 통신부(110)는 상기 제어부(150)의 제어에 의해, 상기 접속된 국내 특허청 서버, 해외 특허청 서버 등에 등록된 복수의 특허 데이터(또는 특허 텍스트 데이터)를 API 방식(또는 오픈 API 방식), 크롤링 방식 등을 통해 수집한다. 여기서, 상기 텍스트 데이터를 포함하는 이미지 파일(예를 들어 png, gif, jpg 등의 형태) 또는 문서 파일(예를 들어 pdf, docx, hwp 등의 형태)을 수집(또는 다운로드)하는 경우, 상기 제어부(150)는 해당 이미지 파일에서 OCR(optical character reader) 기능을 통해 상기 텍스트 데이터를 포함하는 특허 데이터를 추출하거나 또는, 문서 파일에서 상기 텍스트 데이터를 포함하는 특허 데이터를 추출할 수 있다.In addition, the communication unit 110, under the control of the control unit 150, a plurality of patent data (or patent text data) registered in the connected domestic Intellectual Property Office server, foreign patent office server, etc. API method (or open API method) , crawling, etc. Here, when collecting (or downloading) an image file (eg, in the form of png, gif, jpg, etc.) or a document file (eg, in the form of pdf, docx, hwp, etc.) including the text data, the control unit 150 may extract patent data including the text data from a corresponding image file through an optical character reader (OCR) function, or extract patent data including the text data from a document file.

상기 저장부(120)는 다양한 사용자 인터페이스(User Interface: UI), 그래픽 사용자 인터페이스(Graphic User Interface: GUI) 등을 저장한다.The storage unit 120 stores various user interfaces (UIs), graphic user interfaces (GUIs), and the like.

또한, 상기 저장부(120)는 상기 딥러닝 기반 특허 잠재가치 평가 장치(100)가 동작하는데 필요한 데이터와 프로그램 등을 저장한다.In addition, the storage unit 120 stores data and programs necessary for the deep learning-based patent potential value evaluation apparatus 100 to operate.

즉, 상기 저장부(120)는 상기 딥러닝 기반 특허 잠재가치 평가 장치(100)에서 구동되는 다수의 응용 프로그램(application program 또는 애플리케이션(application)), 딥러닝 기반 특허 잠재가치 평가 장치(100)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다. 이러한 응용 프로그램 중 적어도 일부는 무선 통신을 통해 외부 서버로부터 다운로드 될 수 있다. 또한, 이러한 응용 프로그램 중 적어도 일부는 딥러닝 기반 특허 잠재가치 평가 장치(100)의 기본적인 기능을 위하여 출고 당시부터 딥러닝 기반 특허 잠재가치 평가 장치(100) 상에 존재할 수 있다. 한편, 응용 프로그램은 상기 저장부(120)에 저장되고, 딥러닝 기반 특허 잠재가치 평가 장치(100)에 설치되어, 제어부(150)에 의하여 상기 딥러닝 기반 특허 잠재가치 평가 장치(100)의 동작(또는 기능)을 수행하도록 구동될 수 있다.That is, the storage unit 120 is a plurality of application programs (application programs or applications) driven in the deep learning-based patent potential value evaluation apparatus 100, the deep learning-based patent potential value evaluation apparatus 100 of It can store data and instructions for operation. At least some of these applications may be downloaded from an external server via wireless communication. In addition, at least some of these applications may exist on the deep learning-based patent potential value evaluation device 100 from the time of shipment for the basic function of the deep learning-based patent potential value evaluation device 100 . On the other hand, the application program is stored in the storage unit 120, installed in the deep learning-based patent latent value evaluation device 100, the operation of the deep learning-based patent latent value evaluation device 100 by the control unit 150 (or function) may be driven to perform.

또한, 상기 저장부(120)는 플래시 메모리 타입(Flash Memory Type), 하드 디스크 타입(Hard Disk Type), 멀티미디어 카드 마이크로 타입(Multimedia Card Micro Type), 카드 타입의 메모리(예를 들면, SD 또는 XD 메모리 등), 자기 메모리, 자기 디스크, 광디스크, 램(Random Access Memory: RAM), SRAM(Static Random Access Memory), 롬(Read-Only Memory: ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory) 중 적어도 하나의 저장매체를 포함할 수 있다. 또한, 딥러닝 기반 특허 잠재가치 평가 장치(100)는 인터넷(internet)상에서 저장부(120)의 저장 기능을 수행하는 웹 스토리지(web storage)를 운영하거나, 또는 상기 웹 스토리지와 관련되어 동작할 수도 있다.In addition, the storage unit 120 is a flash memory type (Flash Memory Type), a hard disk type (Hard Disk Type), a multimedia card micro type (Multimedia Card Micro Type), a card type memory (eg, SD or XD) memory, etc.), magnetic memory, magnetic disk, optical disk, RAM (Random Access Memory: RAM), SRAM (Static Random Access Memory), ROM (Read-Only Memory: ROM), EEPROM (Electrically Erasable Programmable Read-Only Memory), It may include at least one storage medium among Programmable Read-Only Memory (PROM). In addition, the deep learning-based patent potential value evaluation apparatus 100 operates a web storage that performs the storage function of the storage unit 120 on the Internet, or may operate in connection with the web storage. have.

또한, 상기 저장부(120)는 상기 제어부(150)의 제어에 의해 상기 수집된 각국 특허청 서버에 등록된 복수의 특허 데이터 등을 저장한다.In addition, the storage unit 120 stores a plurality of patent data registered in the server of each country's patent office collected under the control of the control unit 150 .

상기 표시부(또는 디스플레이부)(130)는 상기 제어부(150)의 제어에 의해 상기 저장부(120)에 저장된 사용자 인터페이스 및/또는 그래픽 사용자 인터페이스를 이용하여 다양한 메뉴 화면 등과 같은 다양한 콘텐츠를 표시할 수 있다. 여기서, 상기 표시부(130)에 표시되는 콘텐츠는 다양한 텍스트 또는 이미지 데이터(각종 정보 데이터 포함)와 아이콘, 리스트 메뉴, 콤보 박스 등의 데이터를 포함하는 메뉴 화면 등을 포함한다. 또한, 상기 표시부(130)는 터치 스크린 일 수 있다.The display unit (or display unit) 130 may display various contents such as various menu screens using the user interface and/or graphic user interface stored in the storage unit 120 under the control of the control unit 150 . have. Here, the content displayed on the display unit 130 includes various text or image data (including various information data) and a menu screen including data such as icons, list menus, and combo boxes. Also, the display unit 130 may be a touch screen.

또한, 상기 표시부(130)는 액정 디스플레이(Liquid Crystal Display: LCD), 박막 트랜지스터 액정 디스플레이(Thin Film Transistor-Liquid Crystal Display: TFT LCD), 유기 발광 다이오드(Organic Light-Emitting Diode: OLED), 플렉시블 디스플레이(Flexible Display), 3차원 디스플레이(3D Display), 전자잉크 디스플레이(e-ink display), LED(Light Emitting Diode) 중에서 적어도 하나를 포함할 수 있다.In addition, the display unit 130 includes a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT LCD), an organic light-emitting diode (OLED), and a flexible display. It may include at least one of a flexible display, a 3D display, an e-ink display, and a Light Emitting Diode (LED).

또한, 상기 표시부(130)는 상기 제어부(150)의 제어에 의해 상기 수집된 각국 특허청 서버에 등록된 복수의 특허 데이터 등을 표시한다.In addition, the display unit 130 displays a plurality of patent data registered in the server of each country's patent office collected under the control of the control unit 150 .

상기 음성 출력부(140)는 상기 제어부(150)에 의해 소정 신호 처리된 신호에 포함된 음성 정보를 출력한다. 여기서, 상기 음성 출력부(140)에는 리시버(receiver), 스피커(speaker), 버저(buzzer) 등이 포함될 수 있다.The audio output unit 140 outputs audio information included in a signal processed by the control unit 150 . Here, the audio output unit 140 may include a receiver, a speaker, a buzzer, and the like.

또한, 상기 음성 출력부(140)는 상기 제어부(150)에 의해 생성된 안내 음성을 출력한다.In addition, the voice output unit 140 outputs a guide voice generated by the control unit 150 .

또한, 상기 음성 출력부(140)는 상기 제어부(150)의 제어에 의해 상기 수집된 각국 특허청 서버에 등록된 복수의 특허 데이터 등에 대응하는 음성 정보(또는 음향 효과)를 출력한다.In addition, the audio output unit 140 outputs audio information (or sound effects) corresponding to a plurality of patent data registered in the server of each country's patent office, etc., under the control of the control unit 150 .

상기 제어부(controller, 또는 MCU(microcontroller unit)(150)는 상기 딥러닝 기반 특허 잠재가치 평가 장치(100)의 전반적인 제어 기능을 실행한다.The controller (or microcontroller unit) 150 executes the overall control function of the deep learning-based patent potential value evaluation apparatus 100 .

또한, 상기 제어부(150)는 상기 저장부(120)에 저장된 프로그램 및 데이터를 이용하여 딥러닝 기반 특허 잠재가치 평가 장치(100)의 전반적인 제어 기능을 실행한다. 상기 제어부(150)는 RAM, ROM, CPU, GPU, 버스를 포함할 수 있으며, RAM, ROM, CPU, GPU 등은 버스를 통해 서로 연결될 수 있다. CPU는 상기 저장부(120)에 액세스하여, 상기 저장부(120)에 저장된 O/S를 이용하여 부팅을 수행할 수 있으며, 상기 저장부(120)에 저장된 각종 프로그램, 콘텐츠, 데이터 등을 이용하여 다양한 동작을 수행할 수 있다.In addition, the control unit 150 executes the overall control function of the deep learning-based patent potential value evaluation apparatus 100 using the program and data stored in the storage unit 120 . The controller 150 may include a RAM, ROM, CPU, GPU, and bus, and the RAM, ROM, CPU, GPU, etc. may be connected to each other through a bus. The CPU may access the storage unit 120 and perform booting using the O/S stored in the storage unit 120 , and use various programs, contents, data, etc. stored in the storage unit 120 . Thus, various operations can be performed.

또한, 상기 제어부(150)는 상기 통신부(110)를 통해 특허 데이터를 관리하는 상기 국내 특허청 서버, 상기 해외 특허청 서버 등에 접속한다.In addition, the control unit 150 accesses the domestic Intellectual Property Office server, the overseas Korean Intellectual Property Office server, etc. that manage patent data through the communication unit 110 .

또한, 상기 제어부(150)는 상기 접속된 국내 특허청 서버, 해외 특허청 서버 등에 등록된 복수의 특허 데이터(또는 특허 텍스트 데이터/학습용 특허 데이터)를 API 방식(또는 오픈 API 방식), 크롤링 방식 등을 통해 수집한다. 여기서, 상기 텍스트 데이터를 포함하는 이미지 파일(예를 들어 png, gif, jpg 등의 형태) 또는 문서 파일(예를 들어 pdf, docx, hwp 등의 형태)을 수집(또는 다운로드)하는 경우, 상기 제어부(150)는 해당 이미지 파일에서 OCR(optical character reader) 기능을 통해 상기 텍스트 데이터를 포함하는 특허 데이터를 추출하거나 또는, 문서 파일에서 상기 텍스트 데이터를 포함하는 특허 데이터를 추출할 수 있다.In addition, the control unit 150 transmits a plurality of patent data (or patent text data/patent data for learning) registered to the connected domestic Intellectual Property Office server, foreign patent office server, etc. through an API method (or open API method), a crawling method, etc. collect Here, when collecting (or downloading) an image file (eg, in the form of png, gif, jpg, etc.) or a document file (eg, in the form of pdf, docx, hwp, etc.) including the text data, the control unit 150 may extract patent data including the text data from a corresponding image file through an optical character reader (OCR) function, or extract patent data including the text data from a document file.

이때, 상기 제어부(150)는 다양한 특허 오픈소스 데이터에 대해서 크롤링 방식 등을 통해 해당 특허 데이터를 수집할 수도 있다.In this case, the control unit 150 may collect the corresponding patent data through a crawling method for various patent open source data.

또한, 상기 제어부(150)는 상기 수집된 복수의 특허 데이터 각각에 대해서 전처리(preprocessing) 기능을 수행한다. 이때, 상기 제어부(150)는 세그먼테이션(segmentation) 기능을 추가로 수행할 수도 있다.In addition, the control unit 150 performs a preprocessing function for each of the plurality of collected patent data. In this case, the controller 150 may additionally perform a segmentation function.

즉, 상기 제어부(150)는 상기 수집된 복수의 특허 데이터 각각에 대해 정규화(normalization) 과정을 수행한다. 이때, 상기 정규화 과정은 정제(cleaning)(또는 노이즈 제거(de-nosing)) 과정, 문장 토큰화(sentence tokenization) 과정, 토큰화(tokenization) 과정 및 단어 분리(subword segmentation) 과정 중 적어도 하나의 과정을 포함한다. 여기서, 상기 제어부(150)는 해외로부터 수집된 텍스트 데이터에 있어서, 기존의 사전에 등록되지 않은 형태소의 경우 인식되지 않는 경우가 발생할 수 있기 때문에, 바이트 페어 인코딩(Byte Pare Encoding: BPE) 방식을 통해 가장 확률적으로 높은 단어를 매칭하여 출력함에 따라 띄어쓰기 오류로 인한 알려지지 않은 토큰(unknown token) 발생을 차단할 수 있다.That is, the controller 150 performs a normalization process on each of the plurality of collected patent data. In this case, the normalization process is at least one of a cleaning (or de-nosing) process, a sentence tokenization process, a tokenization process, and a subword segmentation process. includes Here, in the text data collected from abroad, the controller 150 may not be recognized in the case of a morpheme that is not previously registered in advance. By matching and outputting the word with the highest probability, it is possible to block the occurrence of an unknown token due to a spacing error.

또한, 상기 제어부(150)는 상기 복수의 특허 데이터에 대해서 미리 정의된 불용어 제거 기능, 미리 정의된 특수기호 제거 기능, 미리 정의된 빈도 이하로 사용된 단어의 제거 기능 등의 정제 과정을 수행하고, 해당 정제 과정을 수행한 복수의 특허 데이터에 대해서 토큰화 과정을 수행할 수도 있다.In addition, the control unit 150 performs a refining process such as a predefined stopword removal function, a predefined special symbol removal function, and a function to remove words used with less than a predefined frequency for the plurality of patent data, A tokenization process may be performed on a plurality of patent data on which the corresponding purification process has been performed.

이때, 상기 제어부(150)는 다양한 언어별로 토픽 모델을 학습시켜 관리하거나 또는, 다양한 언어를 미리 설정된 기준 언어로 각각 번역하고, 기준 언어로 번역된 복수의 특허 데이터를 대상으로 토픽 모델을 학습시켜 관리할 수도 있다.In this case, the control unit 150 learns and manages topic models for each language, or translates various languages into a preset reference language, and trains and manages the topic model for a plurality of patent data translated into the reference language. You may.

여기서, 다양한 언어별로 토픽 모델을 학습시켜 관리하고자 하는 경우, 상기 제어부(150)는 다음의 과정을 통해, 상기 수집된 특허 데이터(또는 상기 전처리된 특허 데이터)를 미리 설정된 기준 언어로 번역할 수 있다.Here, when it is desired to learn and manage a topic model for each of various languages, the controller 150 may translate the collected patent data (or the preprocessed patent data) into a preset reference language through the following process. .

상기 제어부(150)는 상기 전처리된 복수의 특허 데이터(또는 상기 전처리 및/또는 세그먼테이션된 복수의 특허 데이터)에 미리 설정된 번역 모델을 적용하여 미리 설정된 기준 언어로 번역한다. 여기서, 상기 번역 모델은 시퀀스 투 시퀀스 모델(sequence-to-sequence model), 주의 모델(attention model) 등을 포함한다.The control unit 150 applies a preset translation model to the plurality of preprocessed patent data (or the plurality of preprocessed and/or segmented patent data) to translate into a preset reference language. Here, the translation model includes a sequence-to-sequence model, an attention model, and the like.

즉, 상기 제어부(150)는 미리 설정된 언어를 감지하는 함수를 이용해서 상기 전처리된 복수의 특허 데이터 각각의 언어를 감지한다.That is, the control unit 150 detects the language of each of the plurality of pre-processed patent data using a function for detecting a preset language.

또한, 상기 제어부(150)는 저장부(120)에 미리 저장된 복수의 번역용 네트워크 함수 중에서, 상기 감지된 각각의 언어에 최적화된 하이퍼파라미터(hyperparameter) 값들을 포함하는 특정 번역용 네트워크 함수를 각각 로딩(loading)(또는 호출)한다.In addition, the control unit 150 loads a specific network function for translation including hyperparameter values optimized for each detected language from among a plurality of network functions for translation previously stored in the storage unit 120 , respectively. (loading) (or calling).

또한, 상기 제어부(150)는 상기 전처리된 복수의 특허 데이터에 상기 로딩된 특정 번역용 네트워크 함수를 각각 적용하여 상기 미리 설정된 기준 언어로 각각 번역한다. 이때, 해당 전처리된 특허 데이터의 언어가 상기 기준 언어인 경우, 상기 제어부(150)는 해당 전처리된 특허 데이터에 대한 번역 과정을 생략할 수 있다.Also, the control unit 150 applies the loaded specific network function for translation to the plurality of preprocessed patent data, respectively, and translates each into the preset reference language. In this case, when the language of the pre-processed patent data is the reference language, the controller 150 may omit the translation process for the pre-processed patent data.

또한, 상기 제어부(150)는 상기 기준 언어로 번역된 복수의 특허 데이터 및 상기 수집된 원본 형태의 복수의 특허 데이터를 매핑하여 상기 저장부(120)에 저장한다.In addition, the control unit 150 maps the plurality of patent data translated into the reference language and the collected original form of the plurality of patent data and stores it in the storage unit 120 .

이에 따라, 상기 제어부(150)는 이와 같은 번역으로 텍스트 분류의 언어 모델을 따로 만드는 것이 아니라, 하나의 분류 모델로 일관성 있게 분류 기능을 수행할 수 있다.Accordingly, the controller 150 may perform a classification function consistently with a single classification model, rather than separately creating a language model for text classification through such translation.

또한, 상기 제어부(150)는 상기 전처리된 복수의 특허 데이터에 대해서 워드 임베딩(word embedding) 기능을 수행한다. 이때, 상기 제어부(150)는 정적 임베딩 방식(static embedding method), 문맥화된 워드 임베딩 방식(contextualized/dynamic word embedding method) 등을 통해 상기 전처리된 복수의 특허 데이터에 대해서 워드 임베딩 기능을 수행할 수 있다. 여기서, 상기 정적 임베딩 방식은 CBOW 모델(Continuous Bag of Words Model), 스킵그램 모델(Skip-gram model), Glove 모델, fastText 모델, Lda2Vec 모델, Node2Vec 모델, Characters Embeddings 모델, CNN embeddings 모델 등을 포함한다. 또한, 상기 문맥화된 워드 임베딩 방식(또는 동적 임베딩 방식)은 콘텍스트(context) 정보를 학습하는 트랜스포머(Transformer), 엘모(ELMo: Embeddings from Language Models), 버트(BERT: Bidirectional Encoder Representations from Transformers), GPT 모델 Gpenerative Pre-training Transformer models), CoVe(Contextualized Word-Embeddings), CVT(Cross-View Training), ULMFiT(Universal Language Model Fine-tuning), Transformer XL, XLNet(Generalized Autoregressive Pre-training), ERNIE(Enhanced Representation through Knowledge Integration), FlairEmbeddings(Contextual String Embeddings for Sequence Labelling) 등과 같은 딥러닝 방식의 언어 모델(language model)을 포함한다.In addition, the control unit 150 performs a word embedding function on the plurality of pre-processed patent data. At this time, the control unit 150 may perform a word embedding function for the plurality of preprocessed patent data through a static embedding method, a contextualized/dynamic word embedding method, etc. have. Here, the static embedding method includes a CBOW model (Continuous Bag of Words Model), a skip-gram model, a Glove model, a fastText model, an Lda2Vec model, a Node2Vec model, a Characters Embeddings model, a CNN embeddings model, and the like. . In addition, the contextualized word embedding method (or dynamic embedding method) includes a transformer learning context information, ELMo (Embeddings from Language Models), BERT (Bidirectional Encoder Representations from Transformers), GPT model Gpenerative Pre-training Transformer models), CoVe (Contextualized Word-Embeddings), CVT (Cross-View Training), ULMFiT (Universal Language Model Fine-tuning), Transformer XL, XLNet (Generalized Autoregressive Pre-training), ERNIE ( It includes language models of deep learning methods such as Enhanced Representation through Knowledge Integration) and FlairEmbeddings (Contextual String Embeddings for Sequence Labeling).

또한, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터를 미리 설정된 기준 길이(또는 기준 단어수)에 따라 복수의(또는 하나 이상의) 서브 세트(또는 복수의/하나 이상의 추론 데이터 세트)로 분할한다. 이때, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터에 포함된 복수의 단어를 미리 설정된 시간 구간으로 분할(또는 분류/분리)하거나, 미리 설정된 단어수 단위(예를 들어 100개 단위)로 분할하거나, 미리 설정된 도메인 개념(또는 카테고리/클래스)으로 분할한다.Also, the control unit 150 divides the word-embedded patent data into a plurality of (or one or more) subsets (or a plurality of/one or more inference data sets) according to a preset reference length (or reference number of words). . In this case, the controller 150 divides (or classifies/separates) a plurality of words included in the word-embedded patent data into a preset time interval, or divides it into a preset number of words (for example, 100 units). Or, it is divided into preset domain concepts (or categories/classes).

또한, 상기 제어부(150)는 상기 워드 임베딩 기능 수행에 따른 결과 정보를 상기 저장부(120)에 저장한다. 여기서, 상기 워드 임베딩 기능 수행에 따른 결과 정보(또는 상기 워드 임베딩 기능 수행에 따른 복수의 서브 세트별 결과 정보)는 토큰, 단어 사전, 단어 빈도수 등을 포함한다. 이때, 상기 토큰(token)은 음절, 단어, 합성어, 서브-워드(sub-word) 등을 포함한다.Also, the control unit 150 stores result information according to the execution of the word embedding function in the storage unit 120 . Here, the result information according to the execution of the word embedding function (or the result information for each of a plurality of subsets according to the execution of the word embedding function) includes a token, a word dictionary, a word frequency, and the like. In this case, the token includes a syllable, a word, a compound word, a sub-word, and the like.

또한, 상기 제어부(150)는 상기 워드 임베딩 기능 수행에 따른 결과 정보(또는 복수의 서브 세트별 결과 정보)를 근거로 복수의 서브 세트(또는 복수의 학습용 서브 세트)를 활용하여 사전 학습된 네트워크 함수를 포함하는 복수의 토픽 서브 모델을 포함하는 토픽 모델을 생성한다. 여기서, 상기 복수의 학습용 서브 세트는 학습 데이터(또는 특허 데이터)의 도메인, 데이터 생성 시간 구간 등과 같은 미리 결정된 기준에 의해 그룹화된 서로 상이한 학습 데이터를 포함할 수 있다. 또한, 상기 복수의 토픽 서브 모델은 미리 설정된 제 1 시간 구간에서 생성된 학습 데이터로 형성된 제 1 학습 데이터 서브 세트로 사전 학습된 제 1 네트워크 함수를 포함하는 제 1 토픽 서브 모델, 상기 제 1 시간 구간과 다른 제 2 시간 구간에서 생성된 학습 데이터로 형성된 제 2 학습 데이터 서브 세트로 사전 학습된 제 2 네트워크 함수를 포함하는 제 2 토픽 서브 모델 등을 포함할 수 있다. 또한, 상기 복수의 토픽 서브 모델은 미리 설정된 제 11 도메인에서 생성된 학습 데이터로 형성된 제 11 학습 데이터 서브 세트로 사전 학습된 제 11 네트워크 함수를 포함하는 제 11 토픽 서브 모델, 상기 제 11 도메인과 다른 제 12 도메인에서 생성된 학습 데이터로 형성된 제 12 학습 데이터 서브 세트로 사전 학습된 제 12 네트워크 함수를 포함하는 제 12 토픽 서브 모델 등을 포함할 수 있다.In addition, the control unit 150 uses a plurality of subsets (or a plurality of subsets for learning) based on result information (or result information for each of the plurality of subsets) according to the execution of the word embedding function to perform a pre-learned network function A topic model including a plurality of topic sub-models including Here, the plurality of learning subsets may include different learning data grouped by a predetermined criterion such as a domain of the learning data (or patent data), a data generation time interval, and the like. In addition, the plurality of topic submodels include a first topic submodel including a first network function pre-trained with a first training data subset formed of training data generated in a preset first time interval, the first time interval and a second topic sub-model including a second network function pre-trained as a second training data subset formed of training data generated in a second time interval different from the second topic sub-model. In addition, the plurality of topic sub-models include an eleventh topic sub-model including an eleventh network function pre-trained with an eleventh training data subset formed from training data generated in a preset eleventh domain, and a different from the eleventh domain. It may include a twelfth topic sub-model including a twelfth network function pre-trained as a twelfth training data subset formed from the training data generated in the twelfth domain.

즉, 상기 제어부(150)는 사전에 지정된 파라미터 값들을 이용해서 상기 복수의 토픽 서브 모델을 포함하는 토픽 모델을 생성한다. 여기서, 파라미터는 모델 하이퍼파라미터(예를 들어 number of topics, dimension of rho(embedding matrix), dimension of embeddings(word embedding), dimension of hidden space of theta, activation function of theta 등 포함), 최적화 하이퍼파라미터(예를 들어 word embedding method(dim, epoch, minCount 등 포함), batch size, learning rate, epoch, optimizer, dropout rate, gradients clipping, I2 regularization 등 포함) 등을 포함한다.That is, the controller 150 generates a topic model including the plurality of topic sub-models by using preset parameter values. Here, the parameters are model hyperparameters (including, for example, number of topics, dimension of rho (embedding matrix), dimension of embeddings (word embedding), dimension of hidden space of theta, activation function of theta, etc.), optimization hyperparameters ( Examples include word embedding methods (including dim, epoch, minCount, etc.), batch size, learning rate, epoch, optimizer, dropout rate, gradients clipping, I2 regularization, etc.

또한, 상기 제어부(150)는 상기 임베딩된 복수의 특허 데이터를 이용해서 상기 생성된 토픽 모델에 대해 학습 기능을 수행한다.In addition, the control unit 150 performs a learning function on the generated topic model by using the plurality of embedded patent data.

이와 같이, 상기 제어부(150)는 복수의 특허 데이터를 이용해서 토픽 모델을 생성하고, 상기 생성된 토픽 모델을 대상으로 다양한 언어의 특허 데이터를 이용해서 사전 학습 기능을 수행할 수 있다.As such, the controller 150 may generate a topic model using a plurality of patent data, and may perform a dictionary learning function using patent data of various languages for the generated topic model.

또한, 상기 제어부(150)는 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행한다. 이때, 상기 제어부(150)는 세그먼테이션 기능을 추가로 수행할 수도 있다.In addition, the control unit 150 performs a pre-processing function on the patent data for which the potential value is to be evaluated. In this case, the controller 150 may additionally perform a segmentation function.

즉, 저장부(120)에 미리 저장된 복수의 특허 데이터 중에서 사용자 선택(또는 사용자 입력/터치/제어)에 따라 어느 하나의 특허 데이터(또는 하나 이상의 특허 데이터)가 선택되는 경우 또는, 통신부(110)를 통해 접속된 국내 특허청 서버(미도시)나 해외 특허청 서버(미도시)에 등록된 복수의 특허 데이터 중에서 사용자 선택에 따라 어느 하나의 특허 데이터가 선택되는 경우 또는, 사용자 입력에 따라 잠재가치를 평가하고자 하는 특허 데이터를 수신하는 경우, 상기 제어부(150)는 상기 특허 데이터에 대해서 전처리 기능(또는 정규화 과정)을 수행한다. 이때, 정규화(normaliztion) 과정은 정제(cleaning)(또는 노이즈 제거(de-nosing)) 과정, 문장 토큰화(sentence tokenization) 과정, 토큰화(tokenization) 과정 및 단어 분리(subword segmentation) 과정 중 적어도 하나의 과정을 포함한다.That is, when any one patent data (or one or more patent data) is selected according to a user selection (or user input/touch/control) from among a plurality of patent data stored in advance in the storage unit 120 , or the communication unit 110 ) If any one patent data is selected according to the user's selection from among a plurality of patent data registered in the domestic Intellectual Property Office server (not shown) or overseas patent office server (not shown) connected through When receiving desired patent data, the control unit 150 performs a pre-processing function (or normalization process) on the patent data. In this case, the normalization process is at least one of a cleaning (or de-nosing) process, a sentence tokenization process, a tokenization process, and a subword segmentation process. includes the process of

또한, 상기 제어부(150)는 상기 특허 데이터에 대해서 바이트 페어 인코딩(Byte Pair Encoding: BPE) 기반의 토큰화 과정을 수행한다. 이때, 상기 제어부(150)는 상기 특허 데이터에 대해서 미리 정의된 불용어 제거 기능, 미리 정의된 특수기호 제거 기능, 미리 정의된 빈도 이하로 사용된 단어의 제거 기능 등의 정제 과정을 수행하고, 해당 정제 과정을 수행한 특허 데이터에 대해서 토큰화 과정을 수행할 수도 있다.Also, the controller 150 performs a Byte Pair Encoding (BPE)-based tokenization process on the patent data. In this case, the control unit 150 performs a refining process such as a predefined stopword removal function, a predefined special symbol removal function, and a function to remove words used with less than a predefined frequency on the patent data, and the refining process is performed. A tokenization process may be performed on the patent data that has been processed.

또한, 상기 제어부(150)는 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행한다. 이때, 상기 제어부(150)는 정적 임베딩 방식, 문맥화된 워드 임베딩 방식 등을 통해 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행할 수 있다. 여기서, 상기 정적 임베딩 방식은 CBOW 모델, 스킵그램 모델, Glove 모델, fastText 모델, Lda2Vec 모델, Node2Vec 모델, Characters Embeddings 모델, CNN embeddings 모델 등을 포함한다. 또한, 상기 문맥화된 워드 임베딩 방식(또는 동적 임베딩 방식)은 콘텍스트 정보를 학습하는 트랜스포머, 엘모(ELMo), 버트(BERT), GPT 모델, CoVe, CVT, ULMFiT, Transformer XL, XLNet, ERNIE, FlairEmbeddings 등과 같은 딥러닝 방식의 언어 모델을 포함한다.Also, the control unit 150 performs a word embedding function on the pre-processed patent data. In this case, the control unit 150 may perform a word embedding function on the preprocessed patent data through a static embedding method, a contextualized word embedding method, or the like. Here, the static embedding method includes a CBOW model, a skip gram model, a Glove model, a fastText model, an Lda2Vec model, a Node2Vec model, a Characters Embeddings model, a CNN embeddings model, and the like. In addition, the contextualized word embedding method (or dynamic embedding method) is a transformer that learns context information, ELMo, BERT, GPT model, CoVe, CVT, ULMFiT, Transformer XL, XLNet, ERNIE, FlairEmbeddings It includes a language model of a deep learning method, such as

또한, 상기 제어부(150)는 상기 워드 임베딩 기능 수행에 따른 결과 정보를 상기 저장부(120)에 저장한다. 여기서, 상기 워드 임베딩 기능 수행에 따른 결과 정보(또는 상기 워드 임베딩 기능 수행에 따른 복수의 서브 세트별 결과 정보)는 토큰, 단어 사전, 단어 빈도수 등을 포함한다. 이때, 상기 토큰(token)은 음절, 단어, 합성어, 서브-워드 등을 포함한다.Also, the control unit 150 stores result information according to the execution of the word embedding function in the storage unit 120 . Here, the result information according to the execution of the word embedding function (or the result information for each of a plurality of subsets according to the execution of the word embedding function) includes a token, a word dictionary, a word frequency, and the like. In this case, the token includes a syllable, a word, a compound word, a sub-word, and the like.

또한, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터(또는 상기 분할된 복수의 서브 세트 각각)를 미리 학습된(또는 설정된) 토픽 모델에 적용하여 토픽 추론 기능을 수행한다. 여기서, 상기 워드 임베딩된 특허 데이터는 복수의 토큰, 단어 사전, 단어 빈도수 등을 포함한다. 이때, 상기 토픽 모델은 사전 학습된 네트워크 함수로 구성된 복수의 토픽 서브 모델을 포함하거나 또는, 사전에 지정된 파라미터 값들을 이용해서 생성된 상태일 수 있다.Also, the control unit 150 applies the word-embedded patent data (or each of the divided plurality of subsets) to a pre-trained (or set) topic model to perform a topic inference function. Here, the word-embedded patent data includes a plurality of tokens, a word dictionary, a word frequency, and the like. In this case, the topic model may include a plurality of topic sub-models composed of pre-trained network functions, or may be generated using pre-specified parameter values.

즉, 상기 제어부(150)는 복수의 특허 데이터(또는 복수의 학습용 특허 데이터)를 근거로 사전 학습된 토픽 모델에 포함된 네트워크 함수를 이용해서 상기 워드 임베딩된 특허 데이터에 대해 토픽 추론 기능을 수행하여, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터(또는 상기 워드 임베딩된 특허 데이터와 관련한 토픽 임베딩 벡터)를 생성(또는 구성/분류)한다. 이때, 상기 잠재가치를 평가하고자 하는 특허 데이터 이전에 상기 복수의 특허 데이터를 근거로 사전 학습된 토픽 모델에 포함된 네트워크 함수를 이용하는 경우, 상기 제어부(150)는 사전학습된 네트워크 함수의 지식(또는 해당 사전학습된 네트워크 함수에 포함된 변수별 가중치)을 활용할 수 있기 때문에, 새로운 입력 특허 데이터 학습시 새로운 입력 특허 데이터의 특징에 대해서만 학습하면 학습이 완료될 수 있으므로, 전체 학습에 필요한 시간 및 연산량을 줄일 수 있다.That is, the control unit 150 performs a topic inference function on the word-embedded patent data using a network function included in a pre-trained topic model based on a plurality of patent data (or a plurality of patent data for learning). , generate (or construct/classify) a topic embedding vector (or a topic embedding vector related to the word embedded patent data) that is a result of performing the topic reasoning function. In this case, when a network function included in a topic model pre-trained based on the plurality of patent data is used before the patent data for which the potential value is to be evaluated, the control unit 150 controls the knowledge of the pre-trained network function (or weight) included in the pre-trained network function), so when learning new input patent data, learning can be completed by learning only the characteristics of new input patent data can be reduced

이와 같이, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터(또는 추론 데이터 세트) 전체에 미리 정의된 수만큼의 토픽에 대한 확률 분포를 연산하여, 상기 토픽 임베딩 벡터(또는 토픽 임베딩 매트릭스)를 연산(또는 생성)할 수 있다.In this way, the control unit 150 calculates the probability distribution for a predefined number of topics in the entire word-embedded patent data (or inference data set), and calculates the topic embedding vector (or topic embedding matrix). (or create)

여기서, 토픽 모델링(Topic Modeling)은 비정형 텍스트 데이터(또는 비정형 특허 데이터)에 대한 구조적인 분석 방법으로써, 해당 문서가 내포하고 있는 토픽들의 비중과 각 토픽을 구성하는 단어들의 분포를 제공함으로써, 대용량 텍스트 정보에 잠재된 토픽을 해석 가능한 데이터 형태로 표현하는 방법이다.Here, topic modeling is a structural analysis method for unstructured text data (or unstructured patent data). By providing the weight of topics contained in a corresponding document and the distribution of words constituting each topic, large-capacity text It is a method of expressing latent topics in information in the form of interpretable data.

또한, 상기 제어부(150)는 상기 생성된 토픽 임베딩 벡터를 근거로 텍스트별 토픽 비중과 토픽별 단어의 확률값을 계산(또는 산출/연산)한다.In addition, the controller 150 calculates (or calculates/calculates) a topic weight for each text and a probability value of a word for each topic based on the generated topic embedding vector.

즉, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터(또는 상기 추론 데이터 세트)에 포함된 서브 세트 또는 문서 단위로 미리 정의된 토픽 수만큼의 토픽에 대해서 시간 구간에 따른 토픽의 비중(또는 토픽의 확률 분포/텍스트별 토픽 비중)을 계산(또는 산출/연산)한다.That is, the control unit 150 determines the proportion of topics (or topics) according to the time interval for as many topics as the number of topics predefined for each subset or document unit included in the word-embedded patent data (or the inference data set). Calculate (or calculate/compute) the probability distribution of / topic weight by text).

또한, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터(또는 상기 추론 데이터 세트)에 미리 정의된 개수의 토픽 각각마다 토픽을 구성하는 단어의 확률 분포(또는 토픽별 단어의 확률값)를 계산한다. 이때, 각 단어는 워드 임베딩 매트릭스를 이용해서 임베딩 벡터로 구할 수 있다.Also, the control unit 150 calculates a probability distribution (or probability value of a word for each topic) of words constituting a topic for each of a predefined number of topics in the word-embedded patent data (or the inference data set). In this case, each word can be obtained as an embedding vector using a word embedding matrix.

또한, 상기 제어부(150)는 상기 계산된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 트렌드, 잠재가치 등을 평가한다.In addition, the control unit 150 evaluates the trend and potential value of the patent data by combining the calculated topic weight for each text and the probability value of each topic word with technology trend information and technology potential information.

즉, 상기 제어부(150)는 다음의 [수학식 1]을 이용해서, 상기 계산된 시간 구간에 따른 토픽별 비중에 대한 평균, 이동 평균, 모멘텀 등에 대해서 미리 설정된 통계 분석 기법을 활용하여 토픽별 잠재 트렌드 지수(또는 토픽별 잠재 트렌드 팩터: Latent Trend Factor)를 산출한다.That is, the control unit 150 uses the following [Equation 1], and uses a statistical analysis technique preset for the average, moving average, momentum, etc. of the weight of each topic according to the calculated time interval, and uses the potential for each topic. Calculate the trend index (or latent trend factor by topic).

여기서, 상기

은 임의의 토픽 A의 트렌드 팩터(또는 잠재 트렌드 지수)를 나타내고, 상기

는 t 시점에서의 토픽 A의 비중을 나타내고, 상기

은 t-1 시점에서의 토픽 A의 비중을 나타낸다.Here, the

represents the trend factor (or potential trend index) of any topic A,

represents the specific gravity of topic A at time t, and

denotes the specific gravity of topic A at time t-1.

예를 들어, 도 2에 도시된 바와 같이, 상기 제어부(150)는 특정한 하나의 토픽에 대해 각 시간 구간별 잠재 트렌드 팩터를 산출한다.For example, as shown in FIG. 2 , the controller 150 calculates a potential trend factor for each time section with respect to one specific topic.

또한, 도 3에 도시된 바와 같이, 상기 제어부(150)는 상기 산출된 특정한 하나의 토픽에 대해 각 시간 구간별 잠재 트렌드 팩터를 시각화하여 상기 표시부(130)에 표시한다.In addition, as shown in FIG. 3 , the control unit 150 visualizes the potential trend factor for each time section for the calculated one specific topic and displays it on the display unit 130 .

또한, 상기 제어부(150)는 상기 계산된 토픽별 잠재 트렌드 지수를 가중치로 하여 개별 특허 데이터(또는 잠재가치를 평가하고자 하는 특허 데이터)가 잠재적으로 내포하는 토픽 분포와의 연산을 통해 유망 기술 트렌드 잠재 가치를 추정한다.In addition, the control unit 150 uses the calculated potential trend index for each topic as a weight and calculates the potential of a promising technology trend through the operation with the topic distribution potentially contained in individual patent data (or patent data for which potential value is to be evaluated). estimate the value

또한, 상기 제어부(150)는 미리 설정된 벤치마크 대상 특허 데이터로부터 추출된 토픽과 상기 잠재가치를 평가하고자 하는 특허 데이터와 관련한 토픽 간의 유사도 측정(또는 유사도 비교)을 통해 해당 특허 데이터의 잠재적 가치를 추정한다. 여기서, 상기 벤치마크 대상 특허 데이터는 피인용수, 발명자수, 패밀리수, 청구항수 등을 포함하는 유망성 지표에서 미리 설정된 상위(예를 들어 상위 10% 이내)에 해당하는 특허 데이터일 수 있다.In addition, the control unit 150 estimates the potential value of the corresponding patent data through similarity measurement (or similarity comparison) between the topic extracted from the preset benchmark target patent data and the topic related to the patent data for which the potential value is to be evaluated. do. Here, the benchmark target patent data may be patent data corresponding to a preset upper rank (eg, within the upper 10%) in a prospect index including the number of citations, the number of inventors, the number of families, and the number of claims.

또한, 상기 제어부(150)는 상기 잠재가치를 평가하고자 하는 특허 데이터와 관련해서 평가된 평가 결과((또는 추정된 추정 결과)를 상기 표시부(130) 및/또는 상기 음성 출력부(140)를 통해 출력한다.In addition, the control unit 150 displays the evaluation result (or estimated estimation result) evaluated in relation to the patent data for which the potential value is to be evaluated through the display unit 130 and/or the audio output unit 140 . print out

또한, 상기 제어부(150)는 상기 잠재가치를 평가하고자 하는 특허 데이터와 관련해서 평가된 평가 결과((또는 추정된 추정 결과)를 상기 통신부(110)를 통해 다른 단말, 다른 서버 등에 제공한다.In addition, the control unit 150 provides an evaluation result (or an estimated estimation result) evaluated in relation to the patent data for which the potential value is to be evaluated, through the communication unit 110 to another terminal, another server, or the like.

본 발명의 실시예에서는, 특허 데이터에 대해서 잠재가치를 평가하는 것을 주로 설명하고 있으나, 이에 한정되는 것은 아니며, 상기 제어부(150)는 특정 산업 분야와 관련해서 학습된 토픽 모델을 이용해서 특정 산업 분야와 관련한 텍스트 데이터를 대상으로 상기 전처리 과정, 상기 워드 임베딩 과정, 상기 토픽 모델을 이용한 추론 과정 등을 수행하여, 해당 특정 산업 분야와 관련한 텍스트 데이터에 대해서 잠재가치를 평가(또는 추정)할 수도 있다.In the embodiment of the present invention, evaluation of the potential value for patent data is mainly described, but the present invention is not limited thereto, and the control unit 150 uses a topic model learned in relation to a specific industry field to a specific industry field. By performing the pre-processing process, the word embedding process, the inference process using the topic model, etc. on the text data related to

또한, 상기 제어부(150)는 상기 잠재가치를 평가하고자 하는 특허와 관련한 토픽을 기준으로 모든 데이터(예를 들어 특허 데이터와 관련한 소스 데이터, 전처리된 특허 데이터 등 포함)를 비교 분석할 수 있도록 토픽 유사성을 기반으로 한 데이터 유사성 분석 기능을 수행(또는 제공)할 수 있다.In addition, the control unit 150 compares and analyzes all data (including, for example, source data related to patent data, pre-processed patent data, etc.) based on the topic related to the patent for which the potential value is to be evaluated. It is possible to perform (or provide) a data similarity analysis function based on

즉, 상기 제어부(150)는 상기 연산된 토픽 임베딩 매트릭스와 추론 데이터 서브 세트의 토픽 확률 분포 간의 행렬 연산을 통해 추론 데이터 서브 세트의 토픽 임베딩 벡터를 계산하고, 상기 계산된 토픽 임베딩 벡터를 이용해서 다른 추론 데이터 서브 세트의 토픽 임베딩 벡터와의 코사인 유사도 계산을 통해 서브 세트의 토픽 유사도는 계산(또는 산출/연산)한다.That is, the control unit 150 calculates the topic embedding vector of the speculation data subset through a matrix operation between the calculated topic embedding matrix and the topic probability distribution of the speculation data subset, and uses the calculated topic embedding vector The topic similarity of the subset is calculated (or calculated/computed) through the cosine similarity calculation with the topic embedding vector of the inference data subset.

또한, 상기 제어부(150)는 상기 계산된 서브 세트 간의 토픽 유사도를 확장하여 추론 데이터 서브 세트 간의 구성할 수 있는 쌍의 코사인 유사도를 모두 구성(또는 계산)하여, 무방향성 네트워크 그래프(Undirected network graph)를 포함한 네트워크 그래프를 생성한다.In addition, the control unit 150 expands the topic similarity between the calculated subsets to configure (or calculate) all the configurable pair cosine similarities between the inference data subsets to form an undirected network graph. Create a network graph including

또한, 도 4에 도시된 바와 같이, 상기 제어부(150)는 상기 생성된 네트워크 그래프를 상기 표시부(130)에 표시한다.In addition, as shown in FIG. 4 , the control unit 150 displays the generated network graph on the display unit 130 .

이와 같이, 상기 제어부(150)는 서로 다른 특허 간의 코사인 유사도를 비교할 수 있다.In this way, the control unit 150 may compare the cosine similarity between different patents.

또한, 상기 제어부(150)는 상기 생성된 네트워크 그래프를 이용한 중심성 계산을 통해 토픽 네트워크 중심성 연산을 수행(또는 진행)한다. 여기서, 상기 중심성 계산에는 연결 중심성(Degree centrality), 매개 중심성(Betweenness centrality), 근접 중심성(Closeness centrality), 조화 중심성(Harmony centrality), 카츠 중심성(Katz centrality), 페이지랭크(PageRank), 고유벡터 중심성(Eigenvector centrality) 등을 포함할 수 있다. 이때, 상기 제어부(150)는 추론 데이터 구성을 시간 구간에 따라 설정하여, 시간 구간에 따른 토픽별 중심성 계산을 수행하고, 시간에 따른 중심성 변화량을 계산하여 해당 토픽 네트워크 중심성을 계산할 수 있다.In addition, the control unit 150 performs (or proceeds) the topic network centrality calculation through the centrality calculation using the generated network graph. Here, the centrality calculation includes degree centrality, between centrality, closeness centrality, harmony centrality, Katz centrality, PageRank, and eigenvector centrality. (Eigenvector centrality) and the like. In this case, the control unit 150 may set the inference data configuration according to the time interval, perform centrality calculation for each topic according to the time interval, and calculate the centrality change amount according to time to calculate the topic network centrality.

이와 같이, 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 전처리된 특허 데이터에 대해 워드 임베딩 기능을 수행하고, 워드 임베딩된 특허 데이터를 토픽 모델에 적용하여 토픽 추론 기능을 수행하고, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 통해 텍스트별 토픽 비중과 토픽별 단어의 확률값을 산출하고, 산출된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 잠재가치를 평가할 수 있다.In this way, a preprocessing function is performed on the patent data to evaluate the potential value, a word embedding function is performed on the preprocessed patent data, and a topic inference function is performed by applying the word embedded patent data to the topic model, Through the topic embedding vector, which is the result of performing the topic inference function, the topic weight for each text and the probability value of the word for each topic are calculated, and the calculated topic weight for each text and the probability value of the word for each topic are combined with technology trend information, technology prospect information, etc. The potential value of the patent data can be evaluated.

이하에서는, 본 발명에 따른 딥러닝 기반 특허 잠재가치 평가 방법을 도 1 내지 도 7을 참조하여 상세히 설명한다.Hereinafter, a deep learning-based patent potential value evaluation method according to the present invention will be described in detail with reference to FIGS. 1 to 7 .

도 5는 본 발명의 실시예에 따른 딥러닝 기반 특허 잠재가치 평가 방법을 나타낸 흐름도이다.5 is a flowchart illustrating a deep learning-based patent potential value evaluation method according to an embodiment of the present invention.

먼저, 제어부(150)는 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행한다. 이때, 상기 제어부(150)는 세그먼테이션 기능을 추가로 수행할 수도 있다.First, the control unit 150 performs a pre-processing function on the patent data for which the potential value is to be evaluated. In this case, the controller 150 may additionally perform a segmentation function.

즉, 저장부(120)에 미리 저장된 복수의 특허 데이터 중에서 사용자 선택(또는 사용자 입력/터치/제어)에 따라 어느 하나의 특허 데이터(또는 하나 이상의 특허 데이터)가 선택되는 경우 또는, 통신부(110)를 통해 접속된 국내 특허청 서버(미도시)나 해외 특허청 서버(미도시)에 등록된 복수의 특허 데이터 중에서 사용자 선택에 따라 어느 하나의 특허 데이터가 선택되는 경우 또는, 사용자 입력에 따라 잠재가치를 평가하고자 하는 특허 데이터를 수신하는 경우, 상기 제어부(150)는 상기 특허 데이터에 대해서 전처리 기능(또는 정규화 과정)을 수행한다. 이때, 정규화 과정은 정제(또는 노이즈 제거) 과정, 문장 토큰화 과정, 토큰화 과정 및 단어 분리 과정 중 적어도 하나의 과정을 포함한다.That is, when any one patent data (or one or more patent data) is selected according to a user selection (or user input/touch/control) from among a plurality of patent data stored in advance in the storage unit 120 , or the communication unit 110 ) If any one patent data is selected according to the user's selection from among a plurality of patent data registered in the domestic Intellectual Property Office server (not shown) or overseas patent office server (not shown) connected through When receiving desired patent data, the control unit 150 performs a pre-processing function (or normalization process) on the patent data. In this case, the normalization process includes at least one of a refinement (or noise removal) process, a sentence tokenization process, a tokenization process, and a word separation process.

일 예로, 상기 제어부(150)는 사용자 입력에 따라 특허의 잠재가치를 평가하고자 하는 제 1 특허 데이터를 수신하고, 상기 수신된 제 1 특허 데이터에 대해서 해당 제 1 특허 데이터에 포함된 약 10만여 개의 단어를 나열하고, 나열된 10만여 개의 단어 중에서 상기 미리 정의된 불용어 제거 기능, 상기 미리 정의된 특수기호 제거 기능, 상기 미리 정의된 빈도 이하로 사용된 단어의 제거 기능 등의 정제 과정을 수행한다.For example, the control unit 150 receives the first patent data for evaluating the potential value of the patent according to the user input, and for the received first patent data, about 100,000 pieces of the first patent data included in the first patent data are received. The words are listed, and a refining process is performed, such as the predefined stopword removal function, the predefined special symbol removal function, and the function to remove words used less than the predefined frequency from among the 100,000 or so listed words.

또한, 상기 제어부(150)는 상기 정제 과정을 수행한 제 1 특허 데이터에 포함된 복수의 단어에 대해서 상기 BPE 기반의 토큰화 과정을 수행하여 약 3만 개의 단어(또는 토큰)를 분류한다(S510).In addition, the controller 150 classifies about 30,000 words (or tokens) by performing the BPE-based tokenization process on a plurality of words included in the first patent data on which the purification process has been performed (S510) ).

이후, 상기 제어부(150)는 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행한다. 이때, 상기 제어부(150)는 정적 임베딩 방식, 문맥화된 워드 임베딩 방 등을 통해 상기 전처리된 특허 데이터에 대해서 워드 임베딩 기능을 수행할 수 있다. 여기서, 상기 정적 임베딩 방식은 CBOW 모델, 스킵그램 모델, Glove 모델, fastText 모델, Lda2Vec 모델, Node2Vec 모델, Characters Embeddings 모델, CNN embeddings 모델 등을 포함한다. 또한, 상기 문맥화된 워드 임베딩 방식(또는 동적 임베딩 방식)은 콘텍스트 정보를 학습하는 트랜스포머, 엘모(ELMo), 버트(BERT), GPT 모델, CoVe, CVT, ULMFiT, Transformer XL, XLNet, ERNIE, FlairEmbeddings 등과 같은 딥러닝 방식의 언어 모델을 포함한다.Thereafter, the control unit 150 performs a word embedding function on the pre-processed patent data. In this case, the control unit 150 may perform a word embedding function on the preprocessed patent data through a static embedding method, a contextualized word embedding room, or the like. Here, the static embedding method includes a CBOW model, a skip gram model, a Glove model, a fastText model, an Lda2Vec model, a Node2Vec model, a Characters Embeddings model, a CNN embeddings model, and the like. In addition, the contextualized word embedding method (or dynamic embedding method) is a transformer that learns context information, ELMo, BERT, GPT model, CoVe, CVT, ULMFiT, Transformer XL, XLNet, ERNIE, FlairEmbeddings It includes a language model of a deep learning method, such as

일 예로, 상기 제어부(150)는 상기 전처리된 특허 데이터에 대해서 CBOW 모델을 이용해서 워드 임베딩 기능을 수행하여 약 300개의 단어를 분류한다.For example, the controller 150 classifies about 300 words by performing a word embedding function on the preprocessed patent data using a CBOW model.

또한, 상기 제어부(150)는 상기 분류된 300개의 단어에 대해서, 제 1 단어 내지 제 100 단어를 포함하는 제 1 서브 세트, 제 101 단어 내지 제 200 단어를 포함하는 제 2 서브 세트 및 제 201 단어 내지 제 300 단어를 포함하는 제 3 서브 세트로 분할한다(S520).In addition, with respect to the 300 classified words, the control unit 150 includes a first subset including the first to 100th words, a second subset including the 101st to 200th words, and a 201st word. to a third subset including the 300th word (S520).

이후, 상기 제어부(150)는 상기 워드 임베딩된 특허 데이터(또는 상기 분할된 복수의 서브 세트 각각)를 미리 학습된(또는 설정된) 토픽 모델에 적용하여 토픽 추론 기능을 수행한다. 여기서, 상기 워드 임베딩된 특허 데이터는 복수의 토큰, 단어 사전, 단어 빈도수 등을 포함한다. 이때, 상기 토픽 모델은 사전 학습된 네트워크 함수로 구성된 복수의 토픽 서브 모델을 포함하거나 또는, 사전에 지정된 파라미터 값들을 이용해서 생성된 상태일 수 있다.Thereafter, the control unit 150 applies the word-embedded patent data (or each of the divided plurality of subsets) to a pre-trained (or set) topic model to perform a topic inference function. Here, the word-embedded patent data includes a plurality of tokens, a word dictionary, a word frequency, and the like. In this case, the topic model may include a plurality of topic sub-models composed of pre-trained network functions, or may be generated using pre-specified parameter values.

일 예로, 상기 제어부(150)는 상기 워드 임베딩 기능 수행에 따른 약 300개의 단어를 입력값으로 상기 토픽 모델(예를 들어 토픽 추론을 위한 네트워크 함수를 포함)을 적용하여, 해당 워드 임베딩 기능 수행에 따른 약 300개의 단어에 대한 토픽 임베딩 벡터를 생성한다(S530).For example, the control unit 150 applies the topic model (for example, including a network function for topic inference) to about 300 words according to the execution of the word embedding function as input values to perform the corresponding word embedding function. A topic embedding vector for about 300 words is generated ( S530 ).

이후, 상기 제어부(150)는 상기 생성된 토픽 임베딩 벡터를 근거로 텍스트별 토픽 비중과 토픽별 단어의 확률값을 계산(또는 산출/연산)한다.Thereafter, the controller 150 calculates (or calculates/calculates) a topic weight for each text and a probability value of a word for each topic based on the generated topic embedding vector.

일 예로, 상기 제어부(150)는 상기 생성된 해당 워드 임베딩 기능 수행에 따른 약 300개의 단어에 대한 토픽 임베딩 벡터를 근거로 텍스트별 토픽 비중과 토픽별 단어의 확률값을 각각 계산한다(S540).For example, the control unit 150 calculates the topic weight for each text and the probability value of the word for each topic based on the topic embedding vectors for about 300 words according to the execution of the generated corresponding word embedding function (S540).

이후, 상기 제어부(150)는 상기 계산된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 트렌드, 잠재가치 등을 평가한다.Thereafter, the control unit 150 combines the calculated topic weight for each text and the probability value of each topic word with technology trend information, technology potential information, and the like to evaluate a trend, potential value, etc. for the corresponding patent data.

즉, 상기 제어부(150)는 앞선 [수학식 1]을 이용해서, 상기 계산된 시간 구간에 따른 토픽별 비중에 대한 평균, 이동 평균, 모멘텀 등에 대해서 미리 설정된 통계 분석 기법을 활용하여 토픽별 잠재 트렌드 지수(또는 토픽별 잠재 트렌드 팩터: Latent Trend Factor)를 산출한다.That is, the control unit 150 uses the above [Equation 1], and uses a statistical analysis technique preset for the average, moving average, momentum, etc. of the weight for each topic according to the calculated time interval to obtain a potential trend for each topic. Calculate the index (or latent trend factor by topic).

이와 같이, 상기 제어부(150)는 잠재가치를 평가하고자 하는 임의의 특허 데이터에 대해서 해당 특허 데이터의 트렌드, 잠재가치 등을 평가한다. 이때, 상기 제어부(150)는 해당 특허 데이터에 대한 트렌드, 잠재가치 평가 시, 미리 설정된 범위(예를 들어 0 ~ 100, 0점 ~ 100점 등 포함) 중 어느 하나의 값(또는 점수)으로 해당 특허 데이터의 잠재가치를 평가(또는 추정/연산)하거나 또는, 특정 금액으로 해당 특허 데이터의 잠재가치를 평가할 수 있다.In this way, the control unit 150 evaluates the trend, potential value, etc. of the patent data for any patent data for which the potential value is to be evaluated. At this time, the control unit 150 corresponds to any one value (or score) of a preset range (for example, 0 to 100, including 0 to 100 points, etc.) when evaluating a trend and potential value for the corresponding patent data. You can evaluate (or estimate/calculate) the potential value of patent data, or evaluate the potential value of the patent data with a specific amount.

일 예로, 도 6에 도시된 바와 같이, 상기 제어부(150)는 토픽 A, 토픽 B 및 토픽 C와 관련해서 각 토픽의 비중(610)을 근거로 각 토픽별 잠재 트렌드 팩터(620)를 각각 산출한다.For example, as shown in FIG. 6 , the controller 150 calculates a potential trend factor 620 for each topic based on the weight 610 of each topic in relation to the topic A, topic B, and topic C, respectively. do.

또한, 상기 제어부(150)는 상기 산출된 토픽별 잠재 트렌드 지수(620)를 가중치로 하여 상기 각 토픽의 비중(610)과 곱하여, 각 토픽별 잠재 트렌드 점수(또는 토픽별 유망 기술 트렌드 잠재 가치)(630)를 산출(또는 추정)한다.In addition, the control unit 150 multiplies the calculated potential trend index 620 for each topic as a weight and multiplies it by the weight 610 of each topic, and a potential trend score for each topic (or potential value of a promising technology trend for each topic) Calculate (or estimate) (630).

또한, 도 7에 도시된 바와 같이, 상기 제어부(150)는 미리 설정된 벤치마크 대상 특허 데이터로부터 추출된 토픽과 상기 잠재가치를 평가하고자 하는 제 1 특허 데이터와 관련한 토픽 간의 유사도 측정(또는 유사도 비교)을 통해 해당 제 1 특허 데이터의 잠재적 가치를 추정한다(S550). In addition, as shown in FIG. 7 , the control unit 150 measures the similarity (or compares the similarity) between the topic extracted from the preset benchmark target patent data and the topic related to the first patent data for which the potential value is to be evaluated. Estimate the potential value of the first patent data through (S550).

본 발명의 실시예는 앞서 설명된 바와 같이, 잠재가치를 평가하고자 하는 특허 데이터에 대해 전처리 기능을 수행하고, 전처리된 특허 데이터에 대해 워드 임베딩 기능을 수행하고, 워드 임베딩된 특허 데이터를 토픽 모델에 적용하여 토픽 추론 기능을 수행하고, 토픽 추론 기능 수행에 따른 결과인 토픽 임베딩 벡터를 통해 텍스트별 토픽 비중과 토픽별 단어의 확률값을 산출하고, 산출된 텍스트별 토픽 비중과 토픽별 단어의 확률값을 기술 트렌드 정보, 기술 유망성 정보 등과 결합하여 해당 특허 데이터에 대한 잠재가치를 평가하여, 특허의 가치 평가 시 비정형 데이터와 기술 트렌드나 유망성을 함께 고려하고, 단어를 원-핫 벡터(one-hot vector)와 같은 분절된 저밀도 벡터(sparse vector)에서 연속형 고밀도 벡터(continuous dense vector)인 임베딩 벡터로 표현하여 정보를 압축하여 특허 데이터로부터 토픽 모델의 성능 저하없이 안정적인 토픽 추출이 가능하고, 불용어 처리와 같은 전처리 작업에 따른 토픽 모델의 성능 변화를 줄이고, 벡터 공간 활용에 따른 단어 및 토픽 간 유사성 정보를 얻을 수 있다.As described above, the embodiment of the present invention performs a pre-processing function on patent data to evaluate potential value, performs a word embedding function on the pre-processed patent data, and applies the word-embedded patent data to the topic model. It is applied to perform the topic inference function, calculates the topic weight for each text and the probability value of each topic word through the topic embedding vector, which is the result of performing the topic reasoning function, and describes the calculated topic weight for each text and the probability value of each topic word By combining trend information and technology potential information, the potential value of the patent data is evaluated, and when evaluating the value of a patent, both unstructured data and technology trends or prospects are considered, and words are combined with a one-hot vector. By expressing the same segmented low-density vector (sparse vector) as an embedding vector, which is a continuous dense vector, information is compressed to enable stable topic extraction from patent data without degrading the topic model performance, and preprocessing such as stopword processing It is possible to reduce the performance change of the topic model according to the task, and obtain similarity information between words and topics according to the use of vector space.

전술된 내용은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Those of ordinary skill in the art to which the present invention pertains may modify and modify the above-described contents without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

100: 딥러닝 기반 특허 잠재가치 평가 장치
110: 통신부 120: 저장부
130: 표시부 140: 음성 출력부
150: 제어부100: Deep learning-based patent potential value evaluation device
110: communication unit 120: storage unit
130: display unit 140: audio output unit
150: control unit

Claims

Performs a pre-processing function on patent data to evaluate potential value, performs a word embedding function on the pre-processed patent data, and applies the word-embedded patent data to a pre-learned topic model to perform a topic inference function Thus, a topic embedding vector is generated as a result of performing the topic inference function, and a topic weight for each text and a probability value of a word for each topic are calculated based on the generated topic embedding vector, and the calculated topic weight for each text and each topic a control unit for evaluating the potential value of the patent data based on the probability value of the word; and
A deep learning-based patent potential value evaluation device including a display unit for displaying an evaluation result for the patent data.

The method of claim 1,
The control unit is
A deep learning-based patent potential value evaluation apparatus, characterized in that at least one of a refining process, a sentence tokenization process, a tokenization process, and a word separation process is performed on the patent data.

The method of claim 1,
The control unit is
A deep learning-based patent potential value evaluation apparatus, characterized in that by calculating a probability distribution for a predefined number of topics in the entire word-embedded patent data, the topic embedding vector is generated.

performing, by the control unit, a pre-processing function on the patent data for which the potential value is to be evaluated;
performing, by the control unit, a word embedding function on the preprocessed patent data;
generating, by the control unit, a topic embedding vector that is a result of performing a topic reasoning function by applying the word-embedded patent data to a pre-learned topic model to perform a topic reasoning function;
calculating, by the controller, a topic weight for each text and a probability value of a word for each topic based on the generated topic embedding vector; and
Deep learning-based patent potential value evaluation method comprising the step of evaluating, by the control unit, the latent value of the patent data based on the calculated topic weight for each text and the probability value of each topic word.

5. The method of claim 4,
The step of performing a pre-processing function on the patent data,
performing a refining process including at least one of a function of removing a predefined stopword, a function of removing a predefined special symbol, and a function of removing a word used less than a predefined frequency with respect to the patent data; and
Deep learning-based patent potential value evaluation method comprising the step of performing a byte pair encoding (BPE)-based tokenization process on the patent data on which the refinement process has been performed.

5. The method of claim 4,
Calculating the topic weight for each text and the probability value of the word for each topic comprises:
calculating a topic weight for each text, which is the weight of topics according to a time section, for as many topics as a predefined number of topics in a subset or document unit included in the word-embedded patent data; and
Deep learning-based patent potential value evaluation method comprising the step of calculating a probability value of each topic, which is a probability distribution of words constituting the topic, for each of a predefined number of topics in the word-embedded patent data.

5. The method of claim 4,
The step of evaluating the potential value of the patent data,
calculating a potential trend index for each topic by using a statistical analysis technique preset for the average, the moving average, and the momentum of the weight for each topic according to the calculated time interval;
a process of estimating a potential value of a promising technology trend through calculation with a topic distribution potentially contained in the patent data by using the calculated potential trend index for each topic as a weight; and
Deep learning-based patent comprising the step of estimating the potential value of the patent data by measuring the similarity between the topic extracted from the preset benchmark target patent data and the topic related to the patent data for which the potential value is to be evaluated Potential valuation method.