KR20180072167A

KR20180072167A - System for extracting similar patents and method thereof

Info

Publication number: KR20180072167A
Application number: KR1020160175455A
Authority: KR
Inventors: 경진영; 윤장혁; 최덕용
Original assignee: 특허법인 해담; 윤장혁; 최덕용
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2018-06-29

Abstract

When a company commercializes a technical item, barrier patents which induce obstacles such as patent infringement to commercialization can exist. When there is a barrier patent, the company may establish a countermeasure strategy, such as invalidation strategy, avoidance design, and non-infringement logic development. An object of the present invention is to propose a system for evacuating prior patents which can invalidate the barrier patents. On the basis of a patent database (10) comprehensively including bibliographic information and expertise information on patents of US trademark and patent office, the system finally recommends prior patents having a potential capable of invalidating the barrier patents, by way of a keyword group setting module for the barrier patents, an information entropy calculation module for grasping a ratio of keyword group content contained in an arbitrary patent, and a similarity calculation module for calculating a semantic patent similarity between the barrier patents and an arbitrary patent through topic modeling.

Description

[0001] SYSTEM FOR EXTRACTING SIMILAR PATENTS AND METHOD THEREOF [0002]

본 발명은 기준특허와 유사한 특허들을 자동으로 추출하는 유사특허 추출 시스템 및 그 방법에 관한 것이다.The present invention relates to a similar patent extraction system and method for automatically extracting patents similar to a reference patent.

일반적으로, 사업화하고자 하는 기술 아이템에 대해 특허침해와 같이 사업화에 장애를 유발하는 특허를 장벽특허라 일컫는다. 따라서 기업체에서 특정 아이템을 사업화할 경우, 다른 출원인이 등록해 놓은 특허를 고려하지 않는다면 힘들게 개발한 제품을 시장에서 판매할 때 특허침해의 문제가 발생할 수 있다. 즉, 해당 기업이 특허권자라고 하더라도 다른 출원인의 특허를 침해하지 않아야 비로소 자신의 특허권을 실시할 수 있게 된다. 그러므로, 현재 개발중인 제품에 대해 어떤 특허가 문제가 될 수 있는지를 파악하고 이에 대한 대응전략을 수립하고, 그에 맞추어 발명개발을 진행할 필요가 있다. 예를 들어, 개발기술 아이템에 대한 장벽특허가 존재할 경우 장벽특허의 무효화 전략, 회피설계, 비침해 논리개발 등과 같은 대응전략을 수립할 수 있으며, 이는 고도의 특허 및 기술 전문지식을 필요로 한다.Generally, patents that obstruct commercialization, such as patent infringement on technology items to be commercialized, are called barrier patents. Therefore, if a company makes commercialization of a specific item, if it does not consider the patent registered by another applicant, it may cause a problem of patent infringement when selling a product developed on the market in the market. In other words, even if a company is a patentee, it can not be infringed by another patentee's patent. Therefore, it is necessary to identify what kind of patents may be a problem for the products currently under development, establish a countermeasure strategy for them, and develop inventions accordingly. For example, in the presence of barrier patents for development technology items, countermeasures such as invalidation strategies for barrier patents, avoidance designs, and development of non-infringement logic can be established, requiring high level of patent and technical expertise.

앞서 언급한 장벽특허에 대한 대응전략 중, 본 발명은 무효화 전략을 지원하는 것을 목표로 한다. 이미 특허권을 획득한 특허의 경우에도 청구범위에 기재된 발명의 전부 또는 일부에 무효사유가 존재할 수 있는데, 이는 특허성이 부정되어야 할 특허가 심사의 미비로 특허권을 받은 것을 의미한다. 특허의 무효가 확정되면 해당 특허권의 효력은 처음부터 없던 것으로 취급되므로 침해의 문제가 발생하지 않는데, 무효화 전략은 특허침해의 판단을 통해 침해를 부정하기 어려운 경우에 선택하는 대응전략이다. 이러한 특허의 무효화 요인은 주로 신규성 및 진보성에 의해 발생하며, 장벽특허로 인식되는 핵심특허의 무효화 요인 검토를 위해서는 해당 장벽특허에 관련된 선행특허문헌들을 수집하여 분석하는 과정을 거치게 된다.Among the countermeasures against the aforementioned barrier patent, the present invention aims at supporting the invalidation strategy. In the case of a patent already obtained a patent right, there may be invalid reasons in all or part of the invention described in the claim, which means that the patent to be denied the patent right has received the patent right due to insufficient examination. If the invalidation of the patent is confirmed, the validity of the patent is deemed to be from the beginning. Therefore, the problem of infringement does not arise. The invalidation strategy is a countermeasure strategy to choose when it is difficult to deny the infringement by judging the patent infringement. These factors are mainly caused by novelty and inventive step. In order to investigate the cause of nullification of key patents recognized as barrier patents, the precedent patent documents related to the barrier patents are collected and analyzed.

실무적으로 보면, 장벽특허에 대한 선행특허문헌 분석의 대부분의 과정이 전문가에 의한 정성적 작업으로 이루어지고 있다. 즉, 기술 전문가가 장벽특허에 기재된 기술에 대한 검토를 한 다음, 특허 검색식을 작성하여 특허검색서비스로부터 관련특허들을 수집한 후 기술적인 연관성을 지닌 특허들을 일일이 선정하게 된다. 그러나, 이러한 작업이 적게는 수천 건 많게는 수만 건에 이르는 관련특허들을 대상으로 하기 때문에 많은 시간과 인력이 투입된다. 현업 경험에 따르면, 1인이 하루에 최대 약 1,000건(30초에 1건을 검토한다는 가정) 정도의 특허 데이터만을 개략적으로 검토할 수 있으며, 선행특허에 대한 청구범위를 포함한 상세내용까지 검토할 경우 1일 최대 300건도 쉽지 않은 실정이다. 또한, 반복적인 검토작업의 한계로 인해 판단 착오가 빈번히 발생하는 문제점이 발생할 수 있다.In practice, most of the process of analyzing prior patent literature for barrier patents is done by qualitative work by experts. In other words, after a technical expert reviews the technology described in the barrier patent, it creates a patent search formula, collects related patents from the patent search service, and then selects patents having technological relevance. However, this task involves a lot of time and manpower because it targets thousands of related patents, which number to tens of thousands. According to industry experience, one person can review roughly 1,000 patent data per day (assuming that one case is reviewed every 30 seconds), and review details including the claims for prior patents It is not easy to get up to 300 cases a day. In addition, a problem of frequent misjudgment may occur due to the limit of repetitive review work.

다음으로 발명적 관점에서 보면, 기존 발명방법들은 수집된 특허집합에 대해 키워드 유사성 또는 구조적 유사성을 분석하여 특허간의 유사관계를 파악하는 형태이며, 이들은 수집범위 내에 포함되지 않거나 잠재적으로 다른 분야에 존재하면서 장벽특허를 무효화시킬 수 있는 선행특허기술들에 대한 분석이 어렵다는 점에서 장벽특허에 대한 선행특허기술을 발굴하는 방법론에 적용하기에는 한계점을 지닌다.Next, from the inventive point of view, the existing inventive methods are a type of analyzing the keyword similarity or structural similarity to the collected patent set to grasp the similarity relationship between the patents, and they are not included in the collection range or exist in potentially different fields It is difficult to analyze the prior patent technologies that can invalidate the barrier patents. Therefore, it has limitations to apply to the methodology of finding the prior patent technology for barrier patents.

따라서, 기존에는 유사특허를 추출하기 위해 모집단에서 사람이 직접 특허 특허문서를 일일이 검토하고 유사특허를 추출하는 단순 반복 작업을 통해 유사특허를 필터링하고 있어 많은 시간과 인력이 투입된다.Therefore, in order to extract similar patents, a lot of time and manpower are injected because the similar patents are filtered through a simple repetitive operation in which a person directly examines a patent patent document directly in a population and extracts a similar patent.

하기의 특허문헌은 특허문서 검색 및 분류 방법 및 그 시스템에 관한 것이나, 상술한 문제에 대한 해결책을 제시하지 못하고 있다.The following patent documents relate to a patent document search and classification method and system thereof, but fail to provide a solution to the above-mentioned problems.

한국 공개특허공보 제10-2006-0047306호Korean Patent Publication No. 10-2006-0047306

본 발명은 상기한 종래 기술의 문제점을 해결하기 위한 것으로써, 기준특허의 키워드를 중심으로 키워드의 존재 여부와 출현 빈도 등의 정보엔트로피를 계산하여 기준특허에 대한 유사특허를 추출하는 유사특허 추출 시스템 및 그 방법을 제공한다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a similar patent extracting system for extracting similar patents on a reference patent by calculating information entropy such as existence / And a method thereof.

또한, 키워드 기반으로 추출된 후보특허와 기준특허 간의 주제모델링 분석을 통해 텍스트 유사도를 계산하여 텍스트 관점에서의 유사특허를 추출할 수 있는 유사특허 추출 시스템 및 그 방법을 제공한다.The present invention also provides a similar patent extraction system and a method for extracting similar patents from a text viewpoint by calculating text similarity through subject modeling analysis between a candidate patent extracted from a keyword base and a reference patent.

상기와 같은 목적을 달성하기 위한 본 발명의 실시예에 따른 유사특허 추출 방법은, (a) 복수의 특허문서 중 관심대상인 적어도 어느 하나의 기준특허를 선정한 후 해당 기준특허에 대한 주요 키워드를 포함한 특허정보를 설정하는 단계; (b) 상기에서 설정된 키워드를 이용하여 특허 모집단 또는 특허 데이터베이스에 수록된 전체 특허를 대상으로 해당 키워드에 대한 정보엔트로피를 계산하는 단계; 및 (c) 상기에서 산출된 정보엔트로피를 기초로 기준특허와 유사한 적어도 하나 이상의 후보특허를 선정하여 제공하는 단계;를 포함할 수 있다.According to another aspect of the present invention, there is provided a method of extracting similar patents, comprising: (a) selecting at least one reference patent as a target of interest among a plurality of patent documents, Setting information; (b) calculating information entropy for the patent on all patents recorded in the patent population or the patent database using the keyword set in the step (b); And (c) selecting and providing at least one candidate patent similar to the reference patent based on the information entropy calculated in the step (a).

상기 추출 방법은, 상기에서 선정된 후보특허들에 대한 단어빈도 정보를 기초로 각 후보특허의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 후보특허의 주제별 확률분포를 이용하여 기준특허와의 텍스트 유사도를 산출하는 단계; 및 상기에서 산출된 유사도에 따라 기준특허와 유사한 텍스트 구조를 갖는 유사특허를 추출하여 제공하는 단계;를 더 포함할 수 있다.The extraction method includes analyzing a topic of each candidate patent based on the word frequency information of the candidate patents selected in the above process to calculate the subject probability distribution, Calculating a text similarity with the patent; And extracting and providing a similar patent having a text structure similar to the reference patent according to the similarity calculated in the above.

상기에서 특허정보는 기준특허의 특허번호, 조사구간 및 IPC분류코드 중 어느 하나 이상을 포함할 수 있다.The patent information may include at least one of a patent number, an inspection period, and an IPC classification code of a reference patent.

또한, 상기와 같은 목적을 달성하기 위한 본 발명의 실시예에 따른 유사특허 추출 시스템은, 복수의 특허문서 중 관심대상인 적어도 어느 하나의 기준특허를 선정한 후 해당 기준특허에 대한 주요 키워드를 포함한 특허정보를 설정하는 기준특허설정부; 복수의 특허문서에서 각 문서별로 주요 키워드를 추출하여 문서 내 출현빈도를 각각 측정하는 빈도측정부; 상기 빈도측정부에서 측정된 문서내 전체 단어의 출현수에 대한 주요 키워드의 문서 내 출현빈도 확률을 통해 각 문서의 정보 엔트로피를 산출하는 정보량산출부; 및 상기에서 산출된 정보엔트로피를 기초로 기준특허와 유사한 적어도 하나 이상의 후보특허를 선정하는 후보특허선정부;를 포함할 수 있다.According to another aspect of the present invention, there is provided a system for extracting similar patents, comprising: a processor for selecting at least one reference patent among a plurality of patent documents, A reference patent setting section for setting a reference patent setting section; A frequency measuring unit for extracting key keywords for each document from a plurality of patent documents and measuring frequency of occurrence in the document; An information amount calculating unit for calculating information entropy of each document through a document occurrence frequency probability of a main keyword with respect to the number of occurrences of all words in the document measured by the frequency measuring unit; And a candidate patent selection section for selecting at least one candidate patent similar to the reference patent based on the information entropy calculated in the above.

그리고, 추출 시스템은, 상기에서 선정된 후보특허들에 대한 단어빈도 정보를 기초로 각 후보특허의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 후보특허의 주제별 확률분포를 이용하여 기준특허와의 텍스트 유사도를 산출하는 유사도산출부; 및 상기에서 산출된 유사도에 따라 기준특허와 유사한 텍스트 구조를 갖는 유사특허를 추출하여 제공하는 유사특허추출부;를 더 포함할 수 있다.Then, the extraction system analyzes the topic of each candidate patent based on the word frequency information of the candidate patents selected in the above-described manner, calculates the subject-specific probability distribution, and uses the subject-specific probability distribution of each candidate patent A similarity calculating unit for calculating a text similarity with a reference patent; And a similar patent extracting unit for extracting and providing a similar patent having a text structure similar to the reference patent according to the similarity calculated in the above.

본 발명의 실시예에 의하면, 사용자가 분석하고자 하는 기술적 관점을 포함하는 정도를 반영하여 선행특허문헌들을 분석할 수 있다. 기존 특허검색서비스들은 불린검색(Boolean search)을 기반으로 하고 있어 검색결과가 사용자가 분석하고자 하는 기술적 관점에 대한 내용을 얼마나 포함하는지를 고려하지 않는 반면, 본 시스템은 사용자의 기술적 관점을 담은 관심어 그룹들과 이들에 대한 가중치를 설정하여 임의의 특허문서들이 사용자가 보고자 하는 기술적 관점과 얼마나 연관되어 있는지를 파악할 수 있는 이점이 있다.According to an embodiment of the present invention, the prior patent documents can be analyzed in accordance with the degree of inclusion of the technical viewpoints to be analyzed by the user. While conventional patent search services are based on Boolean search, they do not consider how much the search results contain the technical viewpoints that the user wants to analyze. On the other hand, And the weights for them, so that it is possible to understand how arbitrary patent documents are related to the technical viewpoints that the user sees.

또한, 본 시스템은 검색식 작성 및 선행특허문헌 선별과 같은 작업에 있어 조사전문가에 대한 의존성을 낮출 수 있으며, 본 시스템의 자동화 프로세스를 통해 선행특허기술 수집 및 내용 분석에 소요되는 시간을 획기적으로 단축할 수 있다.In addition, this system can reduce the reliance on research experts in the work such as the creation of retrieval formulas and selection of prior patent documents, and dramatically shortens the time required for the collection and analysis of prior arts through the automation process of this system can do.

도 1은 본 발명에 의한 유사특허 추출 시스템을 설명하기 위한 개념도이다.
도 2는 본 발명의 실시예에 의한 유사특허 추출 시스템을 설명하기 위한 구성도이다.
도 3은 도 2에 도시된 유사도산출부의 세부 구성을 나타낸 도면이다.
도 4는 본 발명의 실시예에 의한 유사특허 추출 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 의한 유사특허 추출 시스템을 나타낸 UI화면이다.
도 6은 도 4의 기준특허 정보 입력 및 키워드 추천 화면을 나타낸 도면이다.
도 7은 도 4의 정보량 계산 결과를 나타낸 화면이다.
도 8a 내지 도 8c는 도 4의 텍스트 유사도 산출에 적용된 LDA를 설명하기 위한 도면이다.
도 9는 도 4의 텍스트 유사도 계산 결과를 나타낸 화면이다.
도 10은 본 발명의 실시예에 의한 기준특허와 유사한 특허문서를 추출하여 보여주는 최종 화면이다.1 is a conceptual diagram for explaining a similar patent extraction system according to the present invention.
2 is a block diagram illustrating a similar patent extraction system according to an embodiment of the present invention.
3 is a diagram showing a detailed configuration of the similarity degree calculating unit shown in FIG.
4 is a flowchart for explaining a similar patent extracting method according to an embodiment of the present invention.
5 is a UI screen showing a similar patent extraction system according to an embodiment of the present invention.
FIG. 6 is a diagram showing the reference patent information input and keyword recommendation screen of FIG.
FIG. 7 is a screen showing the result of the calculation of the information amount in FIG.
FIGS. 8A to 8C are diagrams for explaining the LDA applied to the computation of the text similarity degree in FIG.
9 is a view showing the result of calculation of the text similarity degree in FIG.
10 is a final screen for extracting and displaying a patent document similar to the reference patent according to the embodiment of the present invention.

이하, 본 발명의 다양한 실시 예들이 첨부된 도면을 참조하여 기재된다. 실시 예 및 이에 사용된 용어들은 본 발명에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성 요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.Hereinafter, various embodiments of the present invention will be described with reference to the accompanying drawings. It is to be understood that the embodiments and terminology used herein are not intended to limit the invention to the particular embodiments, but are to be construed to cover various modifications, equivalents, and / or alternatives of the embodiments. In connection with the description of the drawings, like reference numerals may be used for similar components. The singular expressions may include plural expressions unless the context clearly dictates otherwise.

도 1은 본 발명에 의한 유사특허 추출 시스템을 설명하기 위한 개념도이다.1 is a conceptual diagram for explaining a similar patent extraction system according to the present invention.

일반적으로, 특허나 발명 검색에서는 최대한 넓은 범위의 모집단을 얻기 위하여 검색식을 개략적으로 작성한 후 다양한 응용 기술이나 이질적인 특징을 가지는 특허문서들을 모두 수집하고 있다. 결과적으로 이는 모집단에 노이즈가 포함되어 노이즈를 제거하는 과정이나 관심특허를 추출하는 과정이 필연적으로 발생하게 된다.In general, in order to obtain the widest range of populations in a patent or invention search, a search formula is roughly prepared, and then various patent documents having various application technologies or heterogeneous features are collected. As a result, it is inevitable that noise is included in the population and the process of removing noise or extracting the patent concerned.

본 발명은 특허 데이터베이스 또는 모집단으로부터 관심대상인 기준특허를 선정하고, 이와 유사한 유사특허를 효율적으로 추출하기 위한 방법을 제안하는 것이다. 여기서 기준특허는 임의 기업의 사업화에 장애가 되는 장벽특허이거나 관심대상 특허 또는 노이즈특허일 수 있다. 중요한 것은 미리 설정된 기준특허와 유사한 특허들을 자동으로 조사하여 추출한다는 것이다. 본 발명의 실시예에서는 기준특허가 장벽특허인 경우를 예로 들어 설명한다.The present invention proposes a method for selecting a reference patent object of interest from a patent database or a population and efficiently extracting a similar patent similar thereto. Here, the reference patent may be a barrier patent obstructing the commercialization of a certain company or a patent or noise patent of interest. The important thing is to automatically search for and extract patents similar to pre-established reference patents. In the embodiment of the present invention, the case where the reference patent is a barrier patent will be described as an example.

유사특허를 추출하는 방법은 섀넌 정보엔트로피(Shannon entropy)를 이용하여 기준특허와 키워드 빈도 및 의미론적 유사도 분석을 이용하여 유사도가 높은 특허문서들을 군집화하여 일괄 추출하는 것이다.The method of extracting similar patents is to collectively group the patent documents having high similarity by using standard patent, keyword frequency and semantic similarity analysis using Shannon information entropy.

본 발명은 주어진 장벽특허가 존재한다는 가정하에, 이를 무력화시킬 수 있는 잠재성을 지닌 선행특허들을 특허 데이터베이스로부터 자동으로 발굴하는 시스템의 제안을 목표로 한다. 제안되는 시스템은 기준특허가 주어졌을 경우, 1) 이 장벽특허에 대한 키워드(이하, '관심어' 라고도 함)를 추출하여 분석자가 분석하고자 하는 기술적 관점을 설정하는 관심어 그룹 설정 모듈, 2) 임의의 특허에 대해 관심어 그룹의 정보를 얼마나 많은 정도로 기술하고 있는지를 분석하는 정보엔트로피 계산 모듈, 3) 토픽 모델링을 활용하여 장벽특허와 임의의 특허간의 의미론적 특허 유사도를 계산하는 모듈로 구성될 수 있다.The present invention aims at proposing a system for automatically discovering prior patents with potential to disable them, from a patent database, on the assumption that a given barrier patent exists. The proposed system consists of 1) an interest group setting module that extracts a keyword (hereinafter, also referred to as 'a word of interest') for a barrier patent and sets a technical viewpoint to be analyzed by the analyst, 2) An information entropy calculation module for analyzing how much the information of the interest group is described about an arbitrary patent, and 3) a module for calculating the semantic patent similarity degree between the barrier patent and the arbitrary patent using the topic modeling .

도 2는 본 발명의 실시예에 의한 유사특허 추출 시스템을 설명하기 위한 구성도이다.2 is a block diagram illustrating a similar patent extraction system according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 의한 유사특허 추출 시스템(1)은 특허 데이터베이스(10), 기준특허설정부(110), 정보처리부(130) 및 유사특허추출부(150)로 이루어져 있으며, 정보처리부(130)는 키워드추천부(131), 빈도측정부(133), 정보량산출부(135), 후보특허선정부(137) 및 유사도산출부(139)를 포함할 수 있다.2, a similar patent extraction system 1 according to an embodiment of the present invention includes a patent database 10, a reference patent setting unit 110, an information processing unit 130, and a similar patent extracting unit 150 The information processing unit 130 may include a keyword recommendation unit 131, a frequency measurement unit 133, an information amount calculation unit 135, a candidate patent specification unit 137, and a similarity calculation unit 139.

특허 데이터베이스(10)는 각 국가별 특허정보가 저장된 데이터베이스 또는 특허문서 검색 시스템에 의해 키워드 검색식을 통해 수집된 특허리스트(Excel)일 수 있다. 특허문서 검색 시스템은 구글특허(Google Patent), 델피온(Delphi-in), 키프리스(Kipris), 윕스온(WIPSON), 윈텔립스(WINTELIPS), NDSL 등과 같은 검색 시스템일 수 있다.The patent database 10 may be a database storing patent information for each country or a patent list (Excel) collected through a keyword search formula by a patent document search system. The patent document search system may be a search system such as Google Patent, Delphi-in, Kipris, WIPSON, WINTELIPS, NDSL, and the like.

기준특허설정부(110)는 기준특허의 등록번호, 기준특허의 출원일을 감안한 조사구간, 핵심 키워드 및 기준특허가 속하는 IPC분류코드 중 하나 이상을 사용자로부터 입력받아 설정하도록 구성되어 있다. 키워드는 적어도 하나 이상으로 구성되며, 키워드는 후술할 특허문서의 정보엔트로피 계산에 이용된다.The reference patent setting unit 110 is configured to receive and set at least one of the registration number of the reference patent, the irradiation interval considering the filing date of the reference patent, the core keyword, and the IPC classification code to which the reference patent belongs. The keyword is composed of at least one or more keywords, and the keyword is used for information entropy calculation of a patent document to be described later.

키워드추천부(131)는 기준특허의 등록번호가 입력되면 해당 특허에서 자동으로 키워드를 추출하여 사용자에게 제공하는 것이다. 문서에서 자동으로 키워드를 추출하는 방법에는 대표적으로 단어 사전인 WordNet(https://wordnet.princeton.edu/)을 이용하는 방법과 딥 러닝 기반으로 학습된 모델을 이용하는 방법이 있는데, 본 시스템에서는 복합어의 추출이 가능한 딥 러닝 기반의 AlchemyAPI(http://www.alchemyapi.com/)를 이용한다. AlchemyAPI는 IBM의 인공지능 기반 데이터 분석 플랫폼인 Watson에 포함되어 있는 이미지, 음성, 텍스트 등의 분석이 가능한 툴이다. 특히 텍스트에 대해서는 개체 인식, 감성 분석, 초록 등 다양한 기능을 지원하는데, 본 시스템은 AlchemyAPI의 키워드 추출 기능을 사용하여 기준특허에서 중요도를 가지는 키워드들을 자동 추출한다. 이와 같은 키워드 추천은 사용자의 키워드 설정에 편의를 돕기 위한 것이며, 키워드 선정의 정확성을 높이기 위해서는 사용자가 기준특허의 내용을 직접 파악하여 적절한 키워드를 선정하는 것이다.The keyword recommendation unit 131 automatically extracts the keyword from the patent and provides it to the user when the registration number of the reference patent is input. There are two methods of automatically extracting keywords from a document: WordNet (https://wordnet.princeton.edu/), which is a word dictionary, and using a learned model based on deep learning. In this system, We use AlchemyAPI (http://www.alchemyapi.com/), an extractable deep-run-based solution. AlchemyAPI is a tool that can analyze images, voice, text, etc., contained in IBM's artificial intelligence-based data analysis platform, Watson. Especially, it supports various functions such as object recognition, emotional analysis, and abstract in text. The system extracts keywords having importance in reference patent by using Alchemy API 's keyword extraction function. In order to improve the accuracy of the keyword selection, the user directly grasps the content of the reference patent and selects an appropriate keyword.

빈도측정부(133)는 특허 데이터베이스(10)에 저장된 복수의 특허문서에서 각 특허문서별로 기준특허의 설정 키워드를 추출하고, 추출된 키워드의 특허문서 내 출현 빈도를 각각 측정하도록 구성되어 있다.The frequency measurement unit 133 is configured to extract a set keyword of the reference patent for each patent document from a plurality of patent documents stored in the patent database 10 and to measure the appearance frequency of the extracted keyword in the patent document.

정보량산출부(135)는 측정된 특허문서내 전체 단어의 출현수에 대한 미리 설정된 키워드의 특허문서 내 출현빈도 확률의 합산을 통해 각 특허문서의 정보엔트로피를 산출하도록 구성되어 있다. 상기에서 설정된 키워드는 키워드 별로 가중치를 다르게 설정하여 정보엔트로피 계산에 반영할 수 있다.The information amount calculation unit 135 is configured to calculate the information entropy of each patent document by summing the appearance frequency probabilities of the predetermined keywords in the patent document with respect to the number of occurrences of all words in the measured patent document. The keywords set in the above can be set to different weights for each keyword and reflected in the information entropy calculation.

정보엔트로피는 정보이론의 중요한 개념으로서, 어떠한 상황에서 불확실성을 측정하는 것이다. 즉 불확실성이 높은 상황에서는 높은 정보엔트로피 값을 가지며, 불확실성이 낮은 상황에서는 낮은 정보엔트로피 값을 가진다. 예를 들어, 동전을 던지는 사건은 주사위를 던지는 사건보다 낮은 불확실성, 다시 말해, 발생할 수 있는 사건이 2가지인 경우가 6가지일 경우보다 낮은 불확실성과 정보엔트로피 값을 가진다. 또한, 같은 상황에서 각 사건이 발생하는 확률에 따라 하나의 시스템의 정보량이 변화한다. 즉 사건의 수와 다른 조건이 같은 상황에서 각 사건의 발생 확률이 다르다고 가정한다면, 각 사건의 발생확률이 모두 같은 경우 사건에 대한 예측이 더욱 어려워지므로 이 경우 가장 높은 정보엔트로피 값을 가지게 된다.Information entropy is an important concept of information theory, in which uncertainty is measured. That is, it has a high information entropy value in a high uncertainty state and a low information entropy value in a low uncertainty state. For example, a coin throwing event has lower uncertainty and information entropy values than six cases where there are two cases of uncertainty, that is, events that can occur, than the case of throwing a die. Also, the amount of information of one system changes according to the probability of occurrence of each event in the same situation. In other words, assuming that the probability of occurrence of each event is different in the same situation with the number of events, if the occurrence probability of each event is the same, prediction of the event becomes more difficult, and in this case, the highest information entropy value is obtained.

이와 같은 정보엔트로피는 이산확률 분포에 대해 하기 수학식 1의 섀넌 엔트로피(Shannon entropy) 알고리즘을 활용하여 측정하는 것이 가능하다. 해당 수학식을 활용할 경우 하나의 시스템에 대한 정보의 량을 수치로 나타낼 수 있는데, 이는 해당 시스템의 정보의 다양성의 정도를 의미한다. 즉, 특허문서가 사용자가 원하는 정보, 즉 관심어를 골고루 포함하고 있다면 관심문서일 확률이 높기 때문에 관심어에 대한 정보량을 측정하여 특허문서의 정보엔트로피를 산출하는 것이다.Such information entropy can be measured using the Shannon entropy algorithm of the following equation (1) for the discrete probability distribution. Using the mathematical expression, the amount of information about one system can be expressed as a numerical value, which means the degree of information diversity of the system. That is, since the probability that a patent document includes information desired by the user, that is, a word of interest evenly, is high, the information entropy of the patent document is calculated by measuring the amount of information on the word of interest.

단, i는 관심어이며, pi는 관심어 i의 확률 값임.Where i is the word of interest and pi is the probability value of the word i of interest.

후보특허선정부(137)는 산출된 정보엔트로피를 기초로 기준특허와 유사한 적어도 하나 이상의 후보특허를 선정하여 제공하도록 구성되어 있다. 후보특허 선정 기준은 미리 설정된 기준값이나 순위별 건수로 정해질 수 있다.The candidate patent selection unit 137 is configured to select and provide at least one candidate patent similar to the reference patent based on the calculated information entropy. Candidate patent selection criteria can be determined by preset reference value or number of orders.

그리고, 유사도산출부(139)는 빈도측정부(133)에서 생성된 복수의 특허문서에 대한 단어빈도 정보를 기초로 각 특허문서의 주제 분석을 통해 주제별 확률분포를 산출하고, 각 특허문서의 주제별 확률분포를 이용하여 선정된 기준특허와의 유사도를 산출하도록 구성되어 있다.The similarity calculation unit 139 calculates a subject-specific probability distribution based on the subject analysis of each patent document based on the word frequency information on the plurality of patent documents generated by the frequency measurement unit 133, And is configured to calculate the degree of similarity with the selected reference patent using the probability distribution.

여기서, 유사도산출부(139)는 도 3에 도시된 바와 같이 주제확률분포산출부(140), 유사도계산부(142) 및 유사특허군집화부(144)를 포함할 수 있다. 주제확률분포산출부(140)는 빈도측정부(133)에서 생성된 단어빈도 정보에 대해 주제 모델링 알고리즘을 통해 각 특허문서가 각각의 주제에 속할 확률분포를 산출하도록 구성되어 있고, 유사도계산부(142)는 후보특허선정부(137)에서 기준특허로 선정된 특허문서의 주제 확률분포와 다른 비교대상 특허문서의 주제 확률분포를 비교하여, 기준특허로 선정된 특허문서와 다른 비교대상 특허문서 간의 유사도를 계산하도록 구성되어 있고, 유사특허군집화부(144)는 유사도산출부(139)에서 산출된 유사도에 기초하여 기준특허와 유사한 특허문서들을 군집화하여 제공하도록 구성되어 있다. 여기서, 비교대상 특허문서는 후보특허선정부(137)를 통해 선정된 후보특허일 수 있다.Here, as shown in FIG. 3, the similarity calculating unit 139 may include a subject probability distribution calculating unit 140, a similarity calculating unit 142, and a similar patent clustering unit 144. The subject probability distribution calculating unit 140 is configured to calculate a probability distribution in which each patent document belongs to each topic through the subject modeling algorithm for the word frequency information generated by the frequency measuring unit 133, 142 compares the subject probability distribution of the patent document selected as the standard patent with the subject probability distribution of the other comparison target patent documents in the candidate patent application management section 137 and determines whether the patent document selected as the reference patent and the other comparison target patent document And the similar patent clustering unit 144 is configured to cluster and provide patent documents similar to the reference patent based on the similarity calculated by the similarity calculating unit 139. [ Here, the comparison target patent document may be a candidate patent selected through the candidate patent selection unit 137. [

상기에서 주제 모델링 알고리즘으로는 LDA(Latent Dirichlet Allocation), LSA(Latent Semantic Analysis), 및 PLSA(Probabilistic Latent Semantic Analysis) 중 어느 하나의 알고리즘이 이용될 수 있고, 유사도 측정 알고리즘으로는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나가 이용될 수 있다. 상기의 알고리즘들은 공지의 기술이므로 상세한 설명은 생략한다.As the subject modeling algorithm, any one of LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis), and PLSA (Probabilistic Latent Semantic Analysis) may be used, and the similarity measurement algorithm may be Hellinger distance, cosine similarity, and Jaccard similarity coefficient may be used. The above-described algorithms are well known in the art and will not be described in detail.

그리고, 유사특허추출부(150)는 유사특허군집화부(144)에서 산출된 특허문서별 유사도 정보에 따라 기준특허와 유사한 텍스트 구조를 갖는 유사특허를 추출하여 제공하게 된다. 여기서, 유사특허추출부(150)는 산출된 특허문서별 유사도 정보와 미리 설정된 임계값을 비교하여 기준특허와 유사한 유사특허를 추출하여 제공할 수 있다. 이때 유사특허는 특허번호(등록번호, 출원번호 등), 발명의 명칭, 출원인정보 등 서지사항을 추출하여 클라이언트의 화면(170) 상에 제공할 수 있다.The similar patent extraction unit 150 extracts similar patents having a text structure similar to that of the reference patent according to the similarity information for each patent document calculated by the similar patent clustering unit 144. [ Here, the similar patent extracting unit 150 may extract similar similar patents similar to the reference patent by comparing the calculated similarity degree information per patent document with a preset threshold value. At this time, the similar patent can extract the bibliography such as the patent number (registration number, application number, etc.), the name of the invention, the applicant information, and provide it on the screen 170 of the client.

이와 같이 본 발명에서는 기준특허의 키워드를 이용하여 특허문서별 정보엔트로피를 구한 후 특허문서별 정보엔트로피를 기준으로 후보특허를 선정한다. 선정된 후보특허와 기준특허와의 텍스트 유사도를 다시 계산하고 계산된 유사도값과 정보엔트로피에 따라 유사특허를 최종적으로 추출하여 제공하게 된다. 물론, 정보엔트로피만을 이용하여 유사특허를 선정 및 추출할 수도 있다.As described above, in the present invention, the information entropy for each patent document is obtained using the keyword of the reference patent, and the candidate patent is selected based on the information entropy for each patent document. The text similarity degree between the selected candidate patent and the reference patent is recalculated, and the similar patent is finally extracted according to the calculated similarity value and the information entropy. Of course, similar patents can be selected and extracted using only information entropy.

여기서, 키워드는 특허문서별 정보엔트로피 계산에 이용되며, 유사특허는 각 특허문서의 정보엔트로피를 기준으로 선정되며, 유사도는 기준특허의 주제 확률분포와 다른 특허문서의 주제 확률분포 간의 벡터를 측정하여 산출된다.Here, the keyword is used to calculate information entropy for each patent document, the similar patent is selected based on the information entropy of each patent document, the similarity is measured between the subject probability distribution of the reference patent and the subject probability distribution of another patent document .

또한, 각 특허문서의 정보엔트로피를 계산하기 위해 Shannon entropy 알고리즘이 이용될 수 있고, 각 특허문서의 주제 확률분포를 획득하기 위해 Latent Dirichlet Allocation, Latent Semantic Analysis 또는 Probabilistic Latent Semantic Analysis 알고리즘이 이용될 수 있고, 특허문서간 유사도를 측정하기 위해 Hellinger distance, Cosine similarity 또는 Jaccard similarity coefficient 알고리즘이 이용될 수 있다.In addition, the Shannon entropy algorithm can be used to calculate the information entropy of each patent document, and Latent Dirichlet Allocation, Latent Semantic Analysis or Probabilistic Latent Semantic Analysis algorithms can be used to obtain the subject probability distribution of each patent document , Hellinger distance, Cosine similarity or Jaccard similarity coefficient algorithm can be used to measure the similarity between patent documents.

이와 같이 구성된 관심문서 추출 시스템의 제반 동작과정을 도 4 내지 도 10을 참조하여 보다 구체적으로 살펴본다.A detailed operation process of the interest document extraction system configured as above will be described with reference to FIG. 4 to FIG.

도 4는 본 발명의 실시예에 의한 관심문서인 유사특허 추출 방법을 나타낸 순서도이다.4 is a flowchart illustrating a similar patent extracting method as an interest document according to an embodiment of the present invention.

먼저, 사용자는 유사특허 추출프로그램을 클라이언트에 설치한 상태에서, 프로그램을 구동하여 특허추출 시스템(1)에 접속한다. 추출프로그램의 유저 인터페이스는 도 5에 도시된 바와 같이 기준특허 번호(등록번호 또는 출원번호), 조사구간, 키워드, IPC 등 기준특허 정보를 입력하기 위한 입력창이 구비되어 있다(S10, S20).First, the user connects the patent extraction system 1 by driving the program with the similar patent extraction program installed on the client. As shown in FIG. 5, the user interface of the extraction program is provided with an input window for inputting reference patent information such as reference patent number (registration number or application number), irradiation interval, keyword, IPC, etc. (S10, S20).

기준특허의 등록번호를 입력한 후 우측 상단에 위치된 특허선정 버튼을 클릭하면 추출 시스템(1)은 해당 기준특허의 기본 정보를 특허 데이터베이스(10)로부터 추출하여 도 6과 같이 기준특허정보 출력 창에 디스플레이한다(S30). 이어, 정보처리부(130)는 AlchemyAPI와 같은 외부의 키워드 추출 시스템(미 도시됨)으로 기준특허의 문서정보(초록, 청구범위 등)를 전송하여 키워드를 추천받고 추천된 키워드를 키워드 추천창에 디스플레이할 수도 있다. 키워드 추출 시스템은 공지의 기술이므로 상세한 설명은 생략한다.When the registration number of the reference patent is input and the patent selection button located at the upper right is clicked, the extraction system 1 extracts the basic information of the reference patent from the patent database 10, (S30). Then, the information processing unit 130 transmits document information (abstract, claim range, etc.) of the reference patent to an external keyword extraction system (not shown) such as Alchemy API to recommend the keyword and display the recommended keyword in the keyword recommendation window You may. Since the keyword extraction system is a known technology, a detailed description thereof will be omitted.

사용자는 키워드 추출 시스템으로부터 추천된 키워드를 참고하거나 또는 임의로 키워드를 선정하여 입력하면 기준특허의 키워드를 기준으로 특허 데이터베이스(10)에 저장된 유사특허를 검색하게 된다(S40). 기준특허 정보에서 조사구간은 기준특허의 출원일 이전에 공개된 특허를 대상으로 조사하는 것을 의미하며, IPC 코드는 해당 IPC를 보유하고 있는 특허를 대상으로 조사하게 되는 것을 의미한다. 이와 같이 IPC코드와 조사구간을 설정함으로써, 조사대상 특허를 임의로 조정할 수 있어 필요에 따라 추출 시스템(1)의 계산 속도를 줄일 수 있다.The user searches for a similar patent stored in the patent database 10 on the basis of the keyword of the reference patent by referring to a keyword recommended by the keyword extraction system or by arbitrarily selecting a keyword (S40). In the reference patent information, the investigation section refers to the examination of patents disclosed before the filing date of the reference patent, and the IPC code means that the patent having the IPC is examined. By setting the IPC code and the irradiation interval in this manner, the patent to be researched can be arbitrarily adjusted, and the calculation speed of the extraction system 1 can be reduced as needed.

이어, 추출 시스템(1)의 빈도측정부(133)는 특허 데이터베이스(10)에 저장된 복수의 특허문서에 포함된 설정 키워드를 추출하고, 해당 키워드의 출현 빈도를 산출하여 복수의 특허문서에 대한 단어빈도 정보를 생성할 수 있다. 여기서, 키워드의 출현 빈도(tf)는 특정 단어가 특허문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 특허문서의 길이에 따라 단어의 빈도값을 조절하여 산출할 수 있다. 예컨대, 빈도측정부(133)는 단어 출현 빈도(tf)로 출현횟수를 이용할 수도 있지만, 하기 수학식 2에 의해 산출할 수도 있다. 이때, 특허문서 내에서 출현 빈도가 가장 높은 단어는 '1'값을 가질 것이고, 그 외의 단어는 1보다 작은 값을 가질 것이다.Then, the frequency measurement unit 133 of the extraction system 1 extracts a set keyword included in a plurality of patent documents stored in the patent database 10, calculates the occurrence frequency of the keyword, Frequency information can be generated. Here, the appearance frequency (tf) of the keyword is a value indicating how often a specific word appears in the patent document, and can be calculated by adjusting the frequency value of the word according to the length of the patent document. For example, the frequency measurement unit 133 may use the frequency of appearance as the frequency of occurrence of the word tf, but may also be calculated by the following equation (2). At this time, the word having the highest occurrence frequency in the patent document will have a value of '1', and other words will have a value smaller than 1.

여기서, t: 임의의 단어, d: 임의의 특허문서, w: 특허문서 d에 있는 임의의 단어, f(t,d): 특허문서 d에 들어 있는 단어 t의 빈도임.Where t is any word, d is any patent document, w is any word in patent document d, and f (t, d) is the frequency of the word t in patent document d.

정보량산출부(135)는 설정된 키워드를 기준으로 각 특허의 정보엔트로피(Shannon Entorpy)를 산출할 수 있다. 키워드는 기준특허와 관련된 키워드이며, 사용자에 의해 설정될 수 있으며, 키워드 별로 가중치가 다르게 설정될 수 있다. 여기서 가중치는 해당 키워드의 출현 빈도에 대해 가중치를 적용할 수 있다는 의미이다.The information amount calculating unit 135 can calculate the information entropy of each patent based on the set keyword. The keyword is a keyword related to the reference patent and can be set by the user, and the weight may be set differently for each keyword. Here, the weight means that weights can be applied to the occurrence frequency of the keyword.

구체적으로, 정보량산출부(135)는 특허문서가 포함하고 있는 단어들 중 키워드에 포함된 단어와 포함되지 않은 단어의 빈도를 이용하여 키워드에 대한 정보엔트로피를 산출할 수 있다. 각 단어의 빈도 정보는 빈도측정부(133)에서 측정된 빈도 정보를 이용할 수 있다.Specifically, the information amount calculating unit 135 can calculate the information entropy of the keyword by using the frequency of the word included in the keyword and the frequency of the word not included in the keyword included in the patent document. The frequency information of each word can be used as the frequency information measured by the frequency measuring unit 133.

여기서, 정보엔트로피는 하기 수학식 3에 의해 산출될 수 있다.Here, the information entropy can be calculated by the following equation (3).

여기서, n: 각 특허문서의 필요 정보량의 값을 의미, k: 키워드의 분류 수, h_i: 각 키워드의 발생 확률로서, 하나의 특허문서내의 전체 단어 출현수에 대한 키워드 i에 해당하는 단어의 출현빈도 확률임.Where, n: mean value of the required amount of information of each of the patent documents, k: number of classification of the keyword, h _i: a probability of occurrence of each keyword, the word that a corresponds to the keyword i of the total number of word appearance in the patent document Occurrence frequency probability.

예를 들면, 키워드의 키워드가 Stereo, Lithography 및 3D로 3개인 경우, 특정 특허문서의 키워드에 포함된 단어와 키워드에 포함되지 않은 단어의 빈도는 하기 표 1과 같다.For example, in the case where the keyword of the keyword is three in Stereo, Lithography and 3D, the frequency of words included in the keyword of the specific patent document and not included in the keyword is as shown in Table 1 below.

여기서, 특정 특허문서에서의 Stereo의 hi는 Stereo의 출현 빈도인 4를 전체 단어의 출현 빈도인 50으로 나눈값이 될 수 있다. 이를 바탕으로 상기 표 2의 키워드(k1~k3)를 수학식 3에 적용하면, 해당 특허문서의 정보엔트로피는 '1.08'이 될 수 있다.Here, the hi of Stereo in a specific patent document may be a value obtained by dividing the frequency of appearance of Stereo by 50, which is the appearance frequency of the entire word. Based on this, if the keywords (k1 to k3) of Table 2 are applied to Equation (3), the information entropy of the patent document can be '1.08'.

표 1의 매트릭스는 하나의 특허문서에 대해 각각 발생한다. 본 발명에서는 키워드 리스트를 제외한 나머지 단어들은 하나의 단어처럼 취급하여 단어 군(비키워드)을 형성하여, 핵심 단어들(키워드)에 대한 정보량의 구성을 극대화하였다.The matrix of Table 1 occurs for each patent document. In the present invention, words other than the keyword list are treated as one word to form a word group (non-keyword), thereby maximizing the composition of the information amount for the core words (keyword).

이어, 후보특허선정부(137)는 산출된 각 특허문서의 정보엔트로피와 미리 설정된 기준값(정보엔트로피)을 기준으로 기준특허와 유사한 후보특허를 선정하게 된다(S50). 여기에서, 각 특허문서의 정보엔트로피가 미리 설정된 기준값보다 큰 경우에만 해당 특허문서가 기준특허와 유사한 후보특허로 선정될 것이다. 이와 같이 선정된 후보특허는 도 7과 같이 정보엔트로피 순으로 제공될 수 있다. 이때 후보특허 정보는 특허번호, 발명의 명칭 및 정보엔트로피(구성요소유사도 값)가 디스플레이될 수 있다. 여기서, 기준값은 정보엔트로피가 아니라 순위로도 설정될 수도 있다. 예컨대, 기준값이 500건으로 설정되어 있을 경우 정보엔트로피 순위로 1위부터 500위까지의 특허가 후보특허로 선정되는 것이다.Next, the candidate patent selecting unit 137 selects a candidate patent similar to the reference patent based on the information entropy of each calculated patent document and a predetermined reference value (information entropy) (S50). Here, only when the information entropy of each patent document is larger than a preset reference value, the patent document will be selected as a candidate patent similar to the reference patent. The selected candidate patents can be provided in the order of information entropy as shown in FIG. At this time, the candidate patent information may be displayed with the patent number, the name of the invention, and the information entropy (component similarity value). Here, the reference value may be set to rank rather than information entropy. For example, when the reference value is set to 500, the first to the 500th information entropy ranking is selected as the candidate patent.

필요에 따라 정보엔트로피만을 이용하여 유사특허를 추출할 수 있다.Similar patents can be extracted using only information entropy as needed.

예컨대, 정보량산출부(135)에서는 기준특허설정부(110)에서 선정된 키워드들을 이용하여 전체 특허를 대상으로 해당 키워드들에 대한 정보엔트로피를 계산한다. 기준특허설정부(110)에서 저장된 키워드들이 특허 내에 고르게 존재할수록 높은 엔트로피를 가지게 되고, 1~2가지의 일부 키워드들에 편중될 경우에는 상대적으로 낮은 엔트로피를 가지게 된다. 이를 본 시스템의 관점에서 보면, 기술적으로 분석자가 관심을 가지는 키워드들이 전부 존재하는 특허는 높은 엔트로피를 나타내고, 특정 키워드만 존재하는 특허는 상대적으로 낮은 엔트로피를 나타내게 된다. 따라서 높은 정보엔트로피를 가지는 특허는 분석자가 무효화 시키고 싶은 장벽특허의 내용과 관련이 높은 특허라고 할 수 있다.For example, the information amount calculating unit 135 calculates information entropy for the keywords using the keywords selected by the reference patent setting unit 110 for all patents. As the stored keywords in the reference patent setting unit 110 are uniformly present in the patent, they have a high entropy, and when they are biased to one or two of some keywords, they have a relatively low entropy. From the viewpoint of the present system, patents in which all the keywords that technically have an interest in the analysts exhibit a high entropy, and patents in which only a specific keyword has a relatively low entropy. Therefore, a patent having a high information entropy can be regarded as a patent related to the contents of a barrier patent which the analyst desires to invalidate.

반면 정보엔트로피가 낮은 특허는 분석자가 무효화시키고 싶은 내용을 일부만 포함하여 장벽특허를 무효화 시키기에는 무리가 있는 특허일 것으로 판단된다. 우선 본 모듈에서는 각 특허의 정보 중에서 제목, 초록, 전체 청구항, 상세한 설명과 같이 특허의 내용과 관련된 텍스트정보를 합친다. 특허마다 합쳐진 전체 텍스트에 대해서 출현하는 단어의 개수를 구하고 그 중에서 관심어의 출현 빈도를 계산한다. 따라서 각 특허를 관심어의 벡터로 표현할 수 있고 표현된 특허-관심어 벡터와 수학식 3을 이용하여 개별 특허의 정보량을 구할 수 있다.On the other hand, a patent with a low information entropy is considered to be a patent that can not invalidate a barrier patent by including only a part of what the analyst wants to invalidate. First, this module combines the text information related to the content of the patent, such as title, abstract, full claim, and detailed description among the information of each patent. The number of appearing words for the total text combined for each patent is obtained, and the frequency of occurrence of the word of interest is calculated therefrom. Therefore, each patent can be represented by a vector of the interest, and the information amount of each patent can be obtained by using the expressed patent-interest vector and Equation (3).

수학식 3은 변형된 정보량 계산식으로 관심을 두고 있는 정보의 양을 계산한다. 따라서 수학식 3을 이용하면 불필요한 정보를 제외하고 분석자가 관심을 두고 있는 정보가 얼마나 다채롭게 표현되어지는지를 의미하는 지표를 계산할 수 있다. 정보량을 계산한 뒤 본 후보특허선정부(137)에서는 정보량을 내림차순으로 정렬하여 사전에 정해진 개수만큼의 후보 특허를 선정하고 이를 사용자에게 보여준다. 선정된 후보특허들에 대해서는 추가적인 처리없이 다음 모듈인 유사도산출부(139)에서 유사도를 계산하게 된다.Equation (3) calculates the amount of information that is interested in the modified information amount calculation formula. Therefore, by using Equation (3), it is possible to calculate an index indicating how colorful the information that the analyst is interested in, excluding the unnecessary information. After calculating the amount of information, the present candidate patent selection unit 137 sorts the information amounts in descending order to select a predetermined number of candidate patents, and displays them to the user. The degree of similarity is calculated by the degree-of-similarity calculating section 139, which is the next module, without further processing on the selected candidate patents.

유사도산출부(139)는 선정된 후보특허들에 대한 단어빈도 정보를 기초로 각 특허문서의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 특허문서의 주제별 확률분포를 이용하여 기준특허와의 텍스트 유사도를 산출하게 된다(S60).The similarity calculation unit 139 calculates a topic probability distribution by analyzing a topic of each patent document based on the word frequency information about the selected candidate patents, The similarity degree of the text with the patent is calculated (S60).

유사도산출부(139)는 주제확률분포산출부(140), 유사도계산부(142) 및 유사특허군집화부(144)를 포함할 수 있다.The similarity calculating unit 139 may include a subject probability distribution calculating unit 140, a similarity calculating unit 142, and a similar patent clustering unit 144.

주제확률분포산출부(140)는 기준특허와 유사한 유사특허를 추출하기 위해서 주제 모델링 알고리즘을 이용하여 특허문서 간의 벡터 유사도를 분석함으로써 특허문서 간의 잠재적인 연관관계까지 고려할 수 있다. 만일, 후보특허선정부(137)에 의해 복수개의 후보특허가 선정된 경우, 유사특허군집화부(144)는 일예로, 도 8에 도시된 바와 같이 빈도측정부(133)를 통해 생성된 각 후보특허의 단어별 가중치에 대해 주제 모델링 알고리즘을 적용하여 각각의 특허문서가 각 주제에 속할 확률분포를 산출할 수 있다. 예컨대, 주제 모델링 알고리즘은 잠재적 디리클레 할당(LDA; Latent Dirichlet Allocation) 알고리즘이 될 수 있다.The subject probability distribution calculating unit 140 may consider the potential relationship between patent documents by analyzing the vector similarity between patent documents using a subject modeling algorithm to extract similar patents similar to the reference patent. If a plurality of candidate patents are selected by the candidate patent selecting unit 137, the similar patent clustering unit 144, for example, determines whether or not each candidate generated through the frequency measuring unit 133 A subject modeling algorithm can be applied to the weighting of each word in a patent, so that a probability distribution of each patent document belonging to each subject can be calculated. For example, the subject modeling algorithm may be a Latent Dirichlet Allocation (LDA) algorithm.

LDA 알고리즘은 공지기술로 특허문서의 주제(Topic)별 분류에서 일반적으로 사용되는 툴로서, 도 8와 같은 매트랩 코드(Matlab Code)를 참조하여 간단하게 설명하고자 한다. 기본적으로 LDA 알고리즘은 특허문서가 단어의 묶음이고, 특허문서는 특정 주제를 가지고 있으며, 주제는 특허문서들마다 공유된다는 전제에서 시작된다. 예를 들어 도 8a와 같이 8개의 특허문서가 있고, 각 특허문서는 총 16개의 단어로 이루어져 있다고 가정할 경우 단어의 출현빈도에 따라 칼라로 표시하는 것이 가능하다. 예컨대, 초록색이 짙을수록 단어의 출현빈도가 높은 것이고, 파란색이 짙을수록 출현빈도가 낮은 것을 의미한다. 도 8a의 7번 특허문서의 경우 매트릭스(3,4)의 단어만 출현빈도가 상당히 높은 것을 알 수 있다.The LDA algorithm is a tool generally used in classification according to a topic of a patent document as a known technology, and will be briefly described with reference to a MATLAB code as shown in FIG. Basically, the LDA algorithm begins on the premise that a patent document is a bundle of words, a patent document has a specific subject, and a subject is shared by patent documents. For example, if there are eight patent documents as shown in FIG. 8A, and each patent document is made up of a total of 16 words, it can be displayed in color according to the occurrence frequency of words. For example, the higher the green color, the higher the occurrence frequency of words, and the higher the blue color, the lower the appearance frequency. In the case of the seventh patent document of FIG. 8A, only the word of the matrix (3, 4) appears at a significantly high frequency.

도 8b는 주제에 대한 분포를 나타내는 것으로, 8개의 주제(Topic1~Topic8)가 있고 주제별로 어떤 단어들을 가지고 있는지를 나타낸다. 즉, 주제는 단어들에 대한 분포를 의미한다. 예컨대, 주제1의 경우는 첫 번째에서 네 번째((1,1)~(1,4))까지 단어들의 출현빈도가 높은 것이다. 따라서, 각 특허문서별 단어별 가중치에 대해 LDA를 적용하면 도 8b와 같은 비슷한 양상을 보이게 되며, 이를 통해 각 주제를 찾게 된다.FIG. 8B shows a distribution on a topic, which has eight topics (Topic 1 to Topic 8) and indicates which words have a topic. That is, the subject means a distribution of words. For example, in the case of topic 1, the frequency of appearance of words from the first to the fourth ((1,1) to (1,4)) is high. Therefore, when LDA is applied to the weight for each word in each patent document, a similar pattern as shown in FIG. 8B is obtained, thereby finding each topic.

도 8c는 각 특허문서에 대한 주제의 분포를 나타낸 것으로, 빨간색은 데이터를 만들 때 사용된 것이고, 파란색이 LDA를 통해서 찾아낸 것이다. 즉, x축에 해당되는 주제의 순서를 무시했을 때, 결국 LDA를 통해 각 특허문서의 주제를 유사하게 찾아낼 수 있다는 것을 알 수 있다.FIG. 8C shows the distribution of the subject matter for each patent document. Red is used to make the data, and blue is found through the LDA. In other words, when the order of the subject corresponding to the x-axis is ignored, it can be seen that the LDA can eventually find the subject of each patent document similarly.

상기 LDA(Latent Dirichlet Allocation) 외에도 주제 모델링 알고리즘으로 LSA(Latent Semantic Analysis) 또는 PLSA(Probabilistic latent semantic analysis)가 사용될 수도 있다.In addition to the Latitude Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (PLSA) may be used as a subject modeling algorithm.

주제(Topic; '기술분야'에 해당됨)의 수는 추출 시스템(1)에 미리 설정될 수 있으며, 주제의 수는 여러 번의 테스트에 걸쳐 8개 내지 10개로 분류하는 것이 가장 적절한 것으로 확인되었다. 따라서, 하기 표 3과 같이 주제를 먼저 9개로 분류한 후 다수의 특허문서에 LDA를 적용하여 각 주제별로 분류하였다.The number of topics (corresponding to the 'technical field') can be preset in the extraction system 1, and it has been found most appropriate to classify the number of subjects into 8 to 10 over several tests. Therefore, as shown in Table 3 below, the subjects are first classified into 9 categories, and LDA is applied to a large number of patent documents to be classified into respective subjects.

하기 표 2에서와 같이 LDA의 결과로 도출된 각 주제에 속하는 특허 특허문서의 수와 각 주제를 구성하는 주요 키워드 정보를 나타낼 수 있으며, 각 주제에 대응하는 주요 키워드 정보를 이용하여, 해당 주제의 특성을 판단하는 것이 가능하다. 예를 들어 Topic 1의 경우 작은 입자를 접착하는 방식(Adhesive particulate bonding)의 기술 군집임을 유추할 수 있다.As shown in Table 2 below, the number of patent patent documents belonging to each topic derived as a result of the LDA and major keyword information constituting each topic can be displayed. Using the main keyword information corresponding to each topic, It is possible to judge the characteristics. For example, in Topic 1, it can be inferred that this is a technical clustering of adhesive particulate bonding.

이와 같이 주제확률분포산출부(140)는 주제별 키워드를 추출하고, 각 특허 특허문서별로 각 주제에 속할 확률분포를 하기 표 3과 같이 산출할 수 있다.In this manner, the subject probability distribution calculating unit 140 extracts the subject keywords and calculates the probability distribution belonging to each subject in each patent patent document as shown in Table 3 below.

이어, 유사도계산부(142는 각 주제에 속할 확률분포를 이용하여 특허문서간 유사도 분석을 실행하여 특허문서간 유사도를 산출할 수 있으며, 유사도는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine Similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나의 알고리즘에 의해 산출될 수 있다.Then, the similarity calculator 142 can calculate the similarity between the patent documents by performing the similarity analysis between the patent documents using the probability distribution belonging to each subject. The similarity degree is calculated by using Hellinger distance, Cosine Similarity ) And a Jaccard similarity coefficient.

일예로, 유사도계산부(142)는 하기 수학식 4의 헬링거 디스턴스(Hellinger distance; H(P,Q)에 의해 후보특허선정부(137)에서 선정된 기준특허와 다른 비교 대상 특허문서 사이의 유사도를 산출할 수 있다.For example, the similarity calculation unit 142 calculates the similarity between the reference patent selected by the candidate patent selecting unit 137 and the other comparison target patent document by the Hellinger distance H (P, Q) in the following equation (4) The degree of similarity can be calculated.

여기서, i는 주제, k는 주제의 개수, pi는 기준특허 특허문서의 주제 확률분포, qi는 비교대상 특허문서의 주제 확률분포임.Here, i is the subject, k is the number of subjects, pi is the subject probability distribution of the reference patent document, and qi is the subject probability distribution of the compared patent document.

헬링거 디스턴스로 산출되는 결과값(H(P,Q)은 0에서 1 사이의 값을 가지게 되는 데, 결과값이 작을수록 두 특허문서 사이의 유사도 정도가 크고, 결과값이 클수록 두 특허문서 간의 유사도 정도가 작다. 따라서, 최종 유사도 값(S(P,Q)은 직관적인 이해가 용이하도록 헬링커 디스턴스의 결과값(H(P,Q)을 하기 수학식 5와 같이 1로 감산한 후 감산된 값을 유사도 값으로 사용할 수도 있다.The result value (H (P, Q)) calculated by Hellinger's distance has a value between 0 and 1. The smaller the result is, the greater the degree of similarity between the two patent documents. The larger the result, Therefore, the final similarity value S (P, Q) is obtained by subtracting the result value H (P, Q) of the hellenistance to 1 as shown in the following equation (5) to facilitate intuitive understanding, Can be used as the similarity value.

하기 표 4는 상기 수학식 4 및 5에 의해 산출된 특허문서별 유사도 매트릭스의 일례를 나타낸 것이다.Table 4 below shows an example of a similarity matrix for each patent document calculated by Equations (4) and (5).

이어, 유사특허군집화부(144)는 유사도계산부(142)에 의해 산출된 유사도 정보에 기초하여 기준특허와 유사한 유사특허들을 군집화하여 제공하게 된다. 이때 기준특허가 복수개일 경우 기준특허 별로 유사특허를 군집화하는 것도 가능하다.The similar patent clustering unit 144 groups and provides similar patents similar to the reference patent based on the similarity information calculated by the similarity calculating unit 142. [ At this time, if there are a plurality of reference patents, it is also possible to group similar patents for each reference patent.

본 발명에서 제시하는 시스템의 마지막 모듈인 유사도 계산 모듈에서는 앞 단계에서 정보량을 기준으로 장벽특허 무효화의 가능성이 있다고 판단된 후보특허들에 대해, 토픽모델링을 기반으로 장벽특허와의 텍스트 구조적 유사도를 계산한다. 토픽모델링은 문서들의 집합에서 단어들의 출현을 기반으로 잠재적인 토픽들을 파악하고, 토픽과 단어 간의 연관관계를 이용하여 각 문서들이 어떤 토픽을 어느 정도 포함하고 있는지 추론할 수 있는 알고리즘이다. 토픽모델링의 최종 계산물로 문서와 토픽 간의 확률분포를 얻을 수 있는데, 본 모듈에서는 각 특허 마다 가지고 있는 토픽의 확률분포를 이용하여 특허 간 텍스트구조의 유사도를 계산한다. 따라서 기준특허와 유사한 텍스트로 구성되어있는 특허는 높은 유사도를 나타낼 것이며 특허의 내용을 구성하는 단어가 장벽특허와 달라질수록 특허에서 다루고 있는 주제가 변화하므로, 낮은 유사도를 나타낼 것이다. 유사도를 계산 할 때에는 코사인 유사도(Cosine similarity), 자카드 계수(Jaccard coefficient), 유클리드 거리(Euclidean distance) 등 다양한 방법을 이용할 수 있지만, 본 시스템에서는 특허의 토픽분포가 확률분포라는 점에 착안하여, 두 확률분포간의 유사도를 계산하는데 사용되는 헬링거 거리(Hellinger distance)를 사용한다. 따라서 본 유사도계산모듈의 최종 계산물은 각각의 후보특허들과 장벽특허간의 헬링거 거리이며, 헬링거 거리의 수식은 수학식 4와 같다. 또한 유사도는 1에 가까울수록 유사하다는 의미이고 0에 가까울수록 다르다는 의미를 가지는 것과 반대로, 헬링거 거리는 유사할수록 0에 가까워지고 다를수록 1에 가까워지는 특성을 보이기 때문에, 직관적인 이해를 돕기 위해서 수학식 5를 이용하여 헬링거 거리를 유사도의 의미를 가지도록 변환한다.The similarity calculation module, which is the last module of the system proposed in the present invention, computes the text structural similarity with the barrier patent based on the topic modeling for the candidate patents judged as having potential barrier invalidation based on the information amount in the previous step do. Topic modeling is an algorithm that can identify potential topics based on the appearance of words in a set of documents and infer which topics each document contains by using the association between topics and words. The final computation of topic modeling can be used to obtain a probability distribution between a document and a topic. In this module, the similarity of the text structure between patents is calculated using the probability distribution of the topics of each patent. Therefore, patents composed of texts similar to standard patents will show a high degree of similarity, and as the words constituting the content of the patent are different from the barrier patents, the subject covered by the patent will change. Various methods such as cosine similarity, Jaccard coefficient, and Euclidean distance can be used to calculate the similarity. However, in this system, considering that the patent topic distribution is a probability distribution, The Hellinger distance used to calculate the similarity between the probability distributions is used. Therefore, the final calculation of this similarity calculation module is the hellinger distance between each candidate patent and the barrier patent, and the formula of the hellinger distance is shown in Equation (4). In addition, similarity is closer to 1, which means similar, closer to 0, meaning different. On the contrary, Hellinger's distance is closer to 0 and closer to 1, 5 to convert the hellinger distance to have the meaning of similarity.

이후, 유사특허추출부(150)는 유사특허군집화부(144)를 통해 군집화된 기준특허와 그 유사특허를 미리 설정된 임계값을 기준으로 필터링하여 제거 또는 추출할 수 있다. 즉, 유사특허추출부(150)는 기준특허와 비교대상 특허문서 간의 유사도 정보를 미리 설정된 임계값과 비교하여 기준특허와 유사한 관심문서들을 선별하여 추천할 수 있다.Then, the similar patent extracting unit 150 may filter or remove the clustered reference patent and the similar patent through the similar patent clustering unit 144 based on a preset threshold value. That is, the similar patent extracting unit 150 may compare the similarity information between the reference patent and the comparative patent document with a predetermined threshold value, and recommend interest documents similar to the reference patent to recommend.

즉, 도 9에서 보는 바와 같이, 추출 시스템(1)에서는 기준특허와 유사한 특허문서를 군집화하여 제공하게 되는 데, 해당 기준특허와 유사한 유사특허들 간의 텍스트 유사도에 대한 정보를 제공하게 된다.In other words, as shown in FIG. 9, in the extraction system 1, a patent document similar to the reference patent is clustered and provided, and information on the similarity degree between similar patents similar to the reference patent is provided.

이어, 도 9의 우측 상단에 위치된 특허목록 저장 버튼을 클릭하면, 도 10에서와 같이 유사특허추출부(150)에 의해 추출된 유사특허들의 정보와 유사도 값이 순위별로 저장되며, 저장된 특허목록을 유사특허출력부(170)를 통해 디스플레이할 수 있다. 저장되는 특허목록은 엑셀파일이나 텍스트파일일 수 있으며, 유사도값은 정보엔트로피 기반 구성요소 유사도 값과 유사도산출부(139)에 의해 산출된 텍스트유사도 값일 수 있다.9, the information of the similar patents extracted by the similar patent extracting unit 150 and the similarity values are stored by rank, as shown in FIG. 10, Can be displayed through the similar patent output section (170). The similarity value may be an information entropy-based component similarity value and a text similarity value calculated by the similarity degree calculating unit 139.

유사도 분석 과정에 앞서 사용자는 기준특허와의 유사도를 분석할 때 도 5에서 특허명세서의 분석 범위(초록, 청구범위, 상세한 설명 등)를 선택할 수 있는데, 이는 정보량 분석 모듈에서와 마찬가지로 특허명세서 전체를 대상으로 유사특허를 조사하는 경우에는 상세한 설명을 포함한 특허명세서 전체를 사용하도록 하고, 청구항 단위로 유사특허를 조사하고자 하는 경우에는 전체 청구항까지만 사용하도록 하여 분석자의 목적에 잘 부합하는 결과를 도출할 수 있도록 하기 위함이다. 다만, 분석 범위가 늘어날수록 유사도 분석에 소요되는 시간은 증가된다.Prior to the similarity analysis process, the user can select the analysis range (abstract, claims, detailed description, etc.) of the patent specification in FIG. 5 when analyzing the similarity with the reference patent. In the case of investigating similar patents, the entire patent specification including the detailed description should be used. In the case of investigating similar patents in the claims unit, only the whole claims should be used so that the results that meet the purpose of the analyst can be derived . However, as the scope of analysis increases, the time required to analyze the similarity increases.

유사도 분석 후에 최종적으로 후보특허들의 순위는 다양한 기준으로 결정될 수 있다. 그러나 본 유사도 모듈에서 계산되는 유사도는 기술적인 유사성이 아닌 텍스트 구조상 단어의 출현 빈도에 따른 구조적인 유사도이기 때문에 유사도 자체만으로 유사 기술의 가능성을 판단하기보다는 정보량을 중심으로 하고 구조적 유사도는 참고의 수준으로 사용하는 것이 바람직하다. 따라서 최종적으로 순위를 매길 때에는 정보량을 기준으로 순위를 매기며, 유사도는 추후 분석자가 개별 특허들을 점검할 때 참고하는 용도로 사용할 수도 있다.After the similarity analysis, the ranking of the candidate patents can be determined based on various criteria. However, similarity computed by this similarity module is structural similarity according to occurrence frequency of words in text structure rather than technical similarity. Therefore, rather than judging the possibility of similar technology by itself, similarity degree is centered on information amount, Is preferably used. Therefore, the ranking is based on the amount of information in the final ranking, and the degree of similarity can be used as a reference for future analysts to check individual patents.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고 후술하는 특허청구범위에 의해 한정되며, 본 발명의 구성은 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 그 구성을 다양하게 변경 및 개조할 수 있다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 쉽게 알 수 있다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not intended to limit the invention to the particular forms disclosed. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

1: 특허 추출시스템 10: 특허 데이터베이스
110: 기준특허설정부 130: 정보처리부
131: 키워드추천부 133: 빈도측정부
135: 정보량산출부 137: 후보특허선정부
139: 유사도산출부 140: 주제확률분포산출부
142: 유사도계산부 144: 유사특허군집화부
150: 유사특허추출부 170: 유사특허출력부1: Patent extraction system 10: Patent database
110: Reference patent setting unit 130: Information processing unit
131: Keyword recommendation unit 133: Frequency measurement unit
135: information amount calculating section 137: candidate patent specification section
139: similarity calculating unit 140: subject probability distribution calculating unit
142: similarity calculation unit 144: similar patent clustering unit
150: similar patent extracting unit 170: similar patent outputting unit

Claims

(a) selecting at least one reference patent as a target of interest among a plurality of patent documents, and then setting patent information including a main keyword for the reference patent;
(b) calculating information entropy for the patent on all patents recorded in the patent population or the patent database using the keyword set in the step (b); And
(c) selecting and providing at least one candidate patent similar to the reference patent based on the information entropy calculated in the step (c).

The method according to claim 1,
(d) analyzing the topic of each candidate patent based on the word frequency information of the selected candidate patents, calculating the probability distribution of themes, and using the topic-specific probability distribution of each candidate patent, Calculating a text similarity degree of the text; And
(e) extracting and providing a similar patent having a text structure similar to the reference patent according to the similarity calculated in the above.

The method according to claim 1,
Wherein the patent information includes at least one of a patent number, an irradiation interval, and an IPC classification code of the reference patent in the step (a).

The method of claim 1, wherein the information entropy (n)
A similar patent extraction method calculated by Shannon entropy, which is the following equation.
[Equation]

Here, n represents the value of the necessary information amount of each patent document, k represents the number of classified keywords, and h _{i represents the} probability of occurrence of each keyword, a word corresponding to the keyword i for the total number of word occurrences in one patent document The probability of occurrence is.

The method according to claim 1,
Wherein the reference patent is a barrier patent or a patent of interest that is an obstacle to commercialization.

The method of claim 2,
The step (d)
Calculating a probability distribution in which the candidate patent selected by applying a topic modeling algorithm to the word frequency information belongs to each topic;
Calculating a degree of similarity between the reference patent and the comparative candidate patent using the subject-specific probability distribution of each candidate patent and the subject probability distribution of the reference patent; And
And clustering and providing patent documents similar to the reference patent based on the calculated similarity.

The method of claim 6,
The subject modeling algorithm is any one of Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA), and the similarity is determined by Hellinger distance, Cosine Similarity, And a Jacquard similarity coefficient. The method according to claim 1,

The method of claim 2,
The degree of similarity in the step (e)
Wherein the probability distribution of each patent document belonging to the subject is calculated by applying the Hellinger distance (H (P, Q)).
[Equation]

Here, k being a subject, t is the number of topics, p _k are probability distributions of the topics based on patent, q _k is _the probability of the subject to be compared patent document distribution.

The method of claim 8,
Since the value of the hellinger distance H (P, Q) is equal to 0 when the probability distributions of the patent documents are the same, the similarity can be expressed by the following formula (S (P, Q), similarity) Determine a similar patent extraction method.
[Equation]

A reference patent setting unit for setting patent information including a main keyword for a reference patent after selecting at least one reference patent as a target of interest among a plurality of patent documents;
A frequency measuring unit for extracting key keywords for each document from a plurality of patent documents and measuring frequency of occurrence in the document;
An information amount calculating unit for calculating information entropy of each document through a document occurrence frequency probability of a main keyword with respect to the number of occurrences of all words in the document measured by the frequency measuring unit; And
And a candidate patent selection unit that selects at least one candidate patent similar to the reference patent based on the information entropy calculated in the above.

The method of claim 10,
Based on the word frequency information of the candidate patents selected above, a topic analysis of each candidate patent is performed to calculate the subject probability distribution, and the text similarity with the reference patent is calculated using the subject-specific probability distribution of each candidate patent A similarity degree calculating unit for calculating a similarity degree; And
And a similar patent extracting unit for extracting and providing similar patents having a text structure similar to the reference patent according to the similarity calculated in the above.

The method of claim 10,
The information entropy (n)
A similar patent extraction system calculated by Shannon entropy, which is the following equation.
[Equation]

The method of claim 11,
Wherein the similarity-
A subject probability distribution calculating unit for calculating a probability distribution of each patent document belonging to each subject through a subject modeling algorithm with respect to the generated word frequency information;
A similarity calculation unit for comparing a subject probability distribution of a patent document selected as a reference patent in the reference patent setting unit with a subject probability distribution of another comparison target document to calculate a similarity degree between the reference patent and another patent document; And
And a similar patent clustering unit for clustering and providing patent documents similar to the reference patent based on the similarity calculated by the similarity calculating unit.

The method of claim 11,
Wherein the similarity-
Wherein the probability distribution of each patent document belonging to the subject is calculated by applying the Hellinger distance H (P, Q) of the following equation.
[Equation]