KR20210039905A

KR20210039905A - Method, apparatus and program for generating for thesaurus of patent document

Info

Publication number: KR20210039905A
Application number: KR1020200014498A
Authority: KR
Inventors: 박상준; 김도언
Original assignee: (주)디앤아이파비스
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2021-04-12

Abstract

A method for generating a thesaurus of a patent document is provided. According to one aspect of the present invention to solve the above-mentioned problem, the method for generating a thesaurus of a patent document comprises the steps of: obtaining, by a server, the entire patent documents which is the basis of the thesaurus; analyzing, by the server, morphemes for each of the entire patent documents; converting the words included in each of the entire patent documents into word vectors based on the result of the morpheme analysis; calculating, by the server, a degree of similarity between the word vectors; and grouping, by the server, a word corresponding to the word vector into a thesaurus word group based on the degree of similarity.

Description

[Method, apparatus and program for generating for thesaurus of patent document]

본 발명은 특허문서의 유의어 사전 생성 방법, 장치 및 컴퓨터프로그램에 관한 것이다. The present invention relates to a method, an apparatus, and a computer program for generating a synonym dictionary for a patent document.

선행기술조사는 국내에서 출원된 특허 뿐만 아니라 해외에서 출원된 특허를 조사대상으로 하고 있다. 이러한, 선행기술조사를 수행하는데 있어서 특허문서에 기재된 단어들을 개념, 동의어, 유사어 및 동일 기술 분야로 확장하고 확장된 단어를 포함시켜 선행특허를 검색하고 있다.Prior art research is targeting not only domestically applied patents but also overseas patents. In performing such a prior art search, the words described in the patent document are expanded to concepts, synonyms, similar words, and the same technical field, and the preceding patents are searched by including the expanded words.

일반적으로, 이러한 종래 선행기술조사는 조사자가 직접 대상 특허 또는 대상 발명을 파악하고, 경험에 의해 대상 특허 또는 대상 발명에 기재된 단어의 동의어와 유사어를 확장시켜 선행기술조사를 수행하게 된다.In general, in such a prior art search, the investigator directly grasps the target patent or target invention, and expands the synonyms and similar words of words described in the target patent or target invention based on experience to perform a prior art search.

이에 따라, 종래의 선행기술조사는 선행기술조사의 기초가 되는 선행기술조사 키워드가 조사자의 경험 및 의도에 따라 작성 및 결정되므로 조사자 마다 큰 차이가 있어 균일한 결과를 획득할 수 없는 문제점이 있다. 또한, 종래의 선행기술조사에서 이용되는 키워드는 조사자에 따라 오류 또는 부적절하게 결정된 경우, 해당 키워드를 통해 검색되는 선행기술이 대상 발명 및 대상 특허와 연관없는 문제점을 가지고 있다.Accordingly, in the prior art search, there is a problem in that it is impossible to obtain a uniform result because the prior art search keyword, which is the basis of the prior art search, is created and determined according to the experience and intention of the investigator. In addition, when a keyword used in a conventional prior art search is determined in error or inappropriately according to an investigator, the prior art searched through the keyword has a problem that is not related to the target invention and the target patent.

이에 따라, 선행기술조사를 수행하는데 있어서, 특허문서의 단어에 대한 정확성 높은 유의어 사전을 생성할 수 있는 기술이 요구되고 있다.Accordingly, there is a need for a technology capable of generating a synonym dictionary with high accuracy for words in patent documents in conducting prior art research.

공개특허공보 제10-2010-0113423호, 2010.10.21Unexamined Patent Publication No. 10-2010-0113423, 2010.10.21

본 발명이 해결하고자 하는 과제는 특허문서에서 사용되는 단어의 유의어 사전을 생성할 수 있는 특허문서의 유의어 사전 생성 방법, 장치 및 컴퓨터프로그램을 제공하는 것이다.An object to be solved by the present invention is to provide a method, an apparatus, and a computer program for generating a synonym dictionary for a patent document capable of generating a synonym dictionary for words used in a patent document.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems that are not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 특허문서의 유의어 사전 생성 방법에 있어서, 서버가, 유의어 사전의 기초가 되는 전체 특허문서를 획득하는 단계; 상기 서버가, 상기 전체 특허문서 각각에 대해 형태소를 분석하는 단계; 상기 형태소 분석 결과에 기초하여 상기 전체 특허문서 각각 포함된 단어를 워드 벡터로 변환하는 단계; 상기 서버가, 상기 워드 벡터 간의 유사도를 산출하는 단계; 및 상기 서버가, 상기 유사도에 기초하여 상기 워드 벡터에 대응되는 단어를 유사어 그룹으로 그루핑하는 단계를 포함할 수 있다.A method for generating a synonym dictionary for a patent document according to an aspect of the present invention for solving the above-described problem, the method comprising: obtaining, by a server, an entire patent document that is a basis of the synonym dictionary; Analyzing, by the server, a morpheme for each of the entire patent documents; Converting words included in each of the entire patent documents into word vectors based on the result of the morpheme analysis; Calculating, by the server, a degree of similarity between the word vectors; And grouping, by the server, a word corresponding to the word vector into a similar word group based on the similarity.

바람직하게, 상기 워드 벡터로 변환하는 단계는 상기 특허문서를 기초하여 Word2Vec 학습을 통해 상기 단어를 워드 벡터로 변환하는 단계를 포함할 수 있다.Preferably, the step of converting the word vector may include converting the word into a word vector through Word2Vec learning based on the patent document.

바람직하게, 상기 유사도를 산출하는 단계는 상기 워드 벡터 중 어느 두 워드 벡터 간의 거리를 산출하고 상기 산출된 거리를 유사도로 산출하는 단계를 포함할 수 있다.Preferably, calculating the degree of similarity may include calculating a distance between any two word vectors among the word vectors and calculating the calculated distance as a degree of similarity.

바람직하게, 상기 유사어 그룹으로 그루핑하는 단계는 상기 워드 벡터 중 어느 두 워드 벡터 간의 상기 유사도가 미리 설정된 기준 유사도 미만인지 여부를 확인하고, 상기 워드 벡터 중 어느 두 워드 벡터 간의 상기 유사도가 미리 설정된 기준 유사도 미만이면 해당 두 워드 벡터에 대응되는 두 단어를 상기 유사어 그룹으로 그루핑하는 단계를 포함할 수 있다.Preferably, in the step of grouping into the similar word group, it is checked whether the similarity between any two word vectors of the word vectors is less than a preset reference similarity, and the similarity between any two word vectors of the word vectors is a preset reference similarity. If it is less than that, grouping two words corresponding to the corresponding two word vectors into the similar word group may be included.

바람직하게, 상기 전체 특허문서를 획득하는 단계는 상기 획득된 전체 특허문서 중 어느 하나의 특허문서가 노이즈 문서 조건을 충족하는지 여부를 확인하고, 상기 노이즈 문서 조건을 충족하는 특허문서를 제거하는 단계를 포함할 수 있다. Preferably, the step of obtaining the entire patent document includes checking whether any one of the obtained patent documents satisfies the noise document condition, and removing the patent document that satisfies the noise document condition. Can include.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 특허문서의 유의어 사전 생성 장치는 하나 이상의 인스트럭션을 저장하는 메모리; 및 상기 메모리에 저장된 상기 하나 이상의 인스트럭션을 실행하는 프로세서를 포함할 수 있다.An apparatus for generating a synonym dictionary for a patent document according to an aspect of the present invention for solving the above-described problem includes: a memory for storing one or more instructions; And a processor that executes the one or more instructions stored in the memory.

바람직하게, 상기 프로세서는 상기 하나 이상의 인스트럭션을 실행함으로 써, 상기 특허문서의 유의어 사전 생성 방법을 수행할 수 있다.Preferably, the processor executes the one or more instructions to perform a method of generating a synonym dictionary for the patent document.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 특허문서의 유의어 사전 생성 컴퓨터프로그램은 하드웨어인 컴퓨터와 결합되어, 특허문서의 유의어 사전 생성 방법을 수행할 수 있도록 컴퓨터에서 독출가능한 기록매체에 저장될 수 있다The computer program for generating a synonym dictionary for a patent document according to an aspect of the present invention for solving the above-described problem is combined with a computer that is hardware and stored in a computer-readable recording medium to perform the method of generating the synonym for the patent document. Can be

본 발명은 전체 특허문서, 특허분류정보 및 검색특허문서 각각에서 특정 단어에 대한 세부 중요도를 산출하여 특허문서의 유의어 사전 생성함으로써 중요도 스코어를 세분화하여 정확하게 산출할 수 있다.In the present invention, the importance score can be subdivided and accurately calculated by generating a synonym dictionary of the patent document by calculating detailed importance for a specific word in each of the entire patent document, patent classification information, and search patent document.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 시스템을 설명하기 위한 예시도이다.
도 2는 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 전처리 과정을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 중간 과정을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 후처리 과정을 설명하기 위한 흐름도이다.
도 5는 본 발명의 다른 실시예에 따른 특허문서의 유의어 사전 생성 방법을 설명하기 위한 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 장치의 구성도이다.1 is an exemplary diagram for describing a system according to an embodiment of the present invention.
2 is a flowchart illustrating a preprocessing process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.
3 is a flowchart illustrating an intermediate process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.
4 is a flowchart illustrating a post-processing process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.
5 is a flowchart illustrating a method of generating a synonym dictionary for a patent document according to another embodiment of the present invention.
6 is a block diagram of an apparatus for generating a synonym dictionary for a patent document according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and a method of achieving them will become apparent with reference to the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terms used in the present specification are for describing exemplary embodiments and are not intended to limit the present invention. In this specification, the singular form also includes the plural form unless specifically stated in the phrase. As used herein, “comprises” and/or “comprising” do not exclude the presence or addition of one or more other elements other than the mentioned elements. Throughout the specification, the same reference numerals refer to the same elements, and "and/or" includes each and all combinations of one or more of the mentioned elements. Although "first", "second", and the like are used to describe various elements, it goes without saying that these elements are not limited by these terms. These terms are only used to distinguish one component from another component. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical idea of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used with meanings that can be commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not interpreted ideally or excessively unless explicitly defined specifically.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.The term "unit" or "module" used in the specification refers to a hardware component such as software, FPGA or ASIC, and the "unit" or "module" performs certain roles. However, "unit" or "module" is not meant to be limited to software or hardware. The “unit” or “module” may be configured to be in an addressable storage medium, or may be configured to reproduce one or more processors. Thus, as an example, "sub" or "module" refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, Includes procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables. Components and functions provided within "sub" or "modules" may be combined into a smaller number of components and "sub" or "modules" or into additional components and "sub" or "modules". Can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms "below", "beneath", "lower", "above", "upper", etc. It can be used to easily describe the correlation between a component and other components. Spatially relative terms should be understood as terms including different directions of components during use or operation in addition to the directions shown in the drawings. For example, if a component shown in a drawing is turned over, a component described as "below" or "beneath" of another component will be placed "above" the other component. I can. Accordingly, the exemplary term “below” may include both directions below and above. Components may be oriented in other directions, and thus spatially relative terms may be interpreted according to the orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In the present specification, a computer means all kinds of hardware devices including at least one processor, and may be understood as encompassing a software configuration operating in the corresponding hardware device according to embodiments. For example, the computer may be understood as including all of a smartphone, a tablet PC, a desktop, a laptop, and a user client and application running on each device, and is not limited thereto.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each of the steps described herein is described as being performed by a computer, but the subject of each step is not limited thereto, and at least some of the steps may be performed by different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 시스템을 설명하기 위한 예시도이다.1 is an exemplary diagram for describing a system according to an embodiment of the present invention.

본 발명에 따른 특허문서의 유의어 사전을 생성하기 위한 시스템은 서버(10) 및 전자 장치(20)를 포함한다.A system for generating a synonym dictionary of a patent document according to the present invention includes a server 10 and an electronic device 20.

서버(10)는 전체 특허문서를 획득하고, 전체 특허문서 각각 포함된 단어를 워드 벡터로 변환하여 워드 벡터 간의 유사도를 산출하며, 유사도에 기초하여 단어들을 유사어 그룹으로 그루핑하는 함으로써, 특허문서의 유사어 사전을 생성 위한 구성이다.The server 10 obtains the entire patent document, converts the words included in each of the patent documents into word vectors, calculates the similarity between word vectors, and groups words into similar word groups based on the similarity. This is a configuration for creating a dictionary.

구체적으로, 서버(10)는 전자 장치(20)로부터 유의어 사전의 기초가 되는 전체 특허문서를 획득할 수 있다. Specifically, the server 10 may obtain the entire patent document, which is the basis of the synonym dictionary, from the electronic device 20.

본 명세서에서, 전체 특허문서는 각국 특허청에 특허 등록을 받기 위해 출원인이 제출하는 기술 내용에 대한 문서 전체일 수 있다. 다만, 이에 한정되는 것은 아니고, 특허문서는, 특허 출원을 위한 직무 발명서, 논문 등 기술 내용을 포함한 다양한 문서를 포함하는 개념으로 이해될 수 있다. 일 실시예에 따라, 전체 특허문서는 특허 출원을 위한 직무 발명서, 논문 중 적어도 하나이고, 유사특허문서는 특허 출원을 위한 직무 발명서, 논문, 특허출원서 중 적어도 하나일 수 있다. In this specification, the entire patent document may be the entire document on the technical content submitted by the applicant to obtain a patent registration in each country's patent office. However, the present disclosure is not limited thereto, and a patent document may be understood as a concept including various documents including technical contents such as a job invention for a patent application, a thesis, and the like. According to an embodiment, the entire patent document may be at least one of a job invention and a thesis for a patent application, and the similar patent document may be at least one of a job invention, a thesis, and a patent application for a patent application.

전자 장치(20)는 서버(10)로 특허문서를 제공하기 위한 구성이다. 본 발명에 따른 전자 장치(200)는 스마트 폰으로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 스마트폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 영상 전화기, 전자책 리더기(e-book reader), 데스크탑 PC (desktop PC), 랩탑 PC(laptop PC), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player) 또는 웨어러블 장치(wearable device) 중 적어도 하나를 포함할 수 있다.The electronic device 20 is a configuration for providing a patent document to the server 10. The electronic device 200 according to the present invention may be implemented as a smart phone, but this is only an example, and a smart phone, a tablet personal computer, a mobile phone, a video phone, E-book reader, desktop PC, laptop PC, netbook computer, workstation, server, personal digital assistant (PDA), portable multimedia (PMP) player) or a wearable device.

도 2는 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 전처리 과정을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a preprocessing process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.

도 2를 참조하면, 단계 S110에서, 서버(10)는 유의어 사전의 기초가 되는 전체 특허문서를 획득할 수 있다. 구체적으로, 서버(10)는 실시간으로 시간이 경과함에 따라 공개되는 특허문서를 획득하여 전체 특허문서를 획득할 수 있다. 이때, 서버(10)는 통신을 통해 외부 서버 또는 전자 장치(20)로부터 최신에 공개된 특허문서를 수신하여 메모리(104)에 저장할 수 있다.Referring to FIG. 2, in step S110, the server 10 may obtain the entire patent document that is the basis of the synonym dictionary. Specifically, the server 10 may acquire the patent document disclosed in real time as time elapses to obtain the entire patent document. In this case, the server 10 may receive the latest published patent document from the external server or the electronic device 20 through communication and store it in the memory 104.

이어서, 단계 S120에서, 서버(10)는 획득된 전체 특허문서 중 어느 하나의 특허문서가 노이즈 문서 조건을 충족하는지 여부를 확인할 수 있다. 구체적으로, 서버(10)는 노이즈 문서 조건을 충족하는 특허문서를 전체 특허문서로부터 제거할 수 있다. 여기서, 노이즈 문서 조건은 특허문서가 노이즈 문서 인지 여부를 확인하는 것으로써, 전체 특허문서 각각 내에 미리 설정된 기준 크기 이상의 공백을 갖는 경우, 미러 설정된 횟수 이상 동일한 문자가 반복되는 경우, 전체 특허문서 내에 텍스트와 이지미와의 비율이 미리 설정된 기준 비율 이상인 경우이다.Subsequently, in step S120, the server 10 may check whether any one of the obtained patent documents satisfies the noise document condition. Specifically, the server 10 may remove a patent document that satisfies the noise document condition from the entire patent document. Here, the noise document condition is to check whether the patent document is a noise document, and if there is a space of more than a preset reference size in each of the entire patent documents, if the same character is repeated more than the set number of mirrors, the text in the entire patent document This is the case when the ratio between Lee and Jimmy is greater than or equal to the preset reference ratio.

즉, 서버(10)는 전체 특허문서에 대해 노이즈 여부 인지 여부를 확인할 수 있다. 서버(10)는 노이즈 문서 조건을 만족하는 노이즈 문서를 제거할 수 있다. That is, the server 10 may check whether or not there is noise in the entire patent document. The server 10 may remove a noisy document that satisfies the noisy document condition.

이후, 단계 S130에서, 서버(10)는 전체 특허문서에 대해 형태소 분석을 수행할 수 있다. 이를 통해, 서버(10)는 전체 특허문서에 포함된 단어를 추출할 수 있다. 이때, 서버(10)는 전체 특허문서로부터 형태소를 분석하는 한 형태소 분석법의 종류는 한정되지 않음을 유의한다.Thereafter, in step S130, the server 10 may perform morpheme analysis on the entire patent document. Through this, the server 10 may extract words included in the entire patent document. Note that the type of morpheme analysis method is not limited as long as the server 10 analyzes morphemes from the entire patent document.

이를 통해, 서버(10)는 전체 특허문서 내에 포함된 단어를 추출하여 워드 벡터로 변환할 수 있다.Through this, the server 10 may extract words included in the entire patent document and convert them into word vectors.

도 3은 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 중간 과정을 설명하기 위한 흐름도이다.3 is a flowchart illustrating an intermediate process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.

도 3을 참조하면, 단계 S210에서, 서버(10)는 형태소 분석 결과에 기초하여 상기 전체 특허문서 각각 포함된 단어를 워드 벡터로 변환할 수 있다.Referring to FIG. 3, in step S210, the server 10 may convert words included in each of the entire patent documents into word vectors based on the result of morpheme analysis.

이때, 서버(10)는 상술된 단어의 복수의 단어적 특성을 다차원의 실수 공간에 사영하여 벡터화하여 워드 벡터로 변환할 수 있다. 일 실시 예에서, 서버(10)는 단어를 Word2vec 학습을 이용하여 워드 벡터를 변환할 수 있다.In this case, the server 10 may project a plurality of word characteristics of the above-described word into a multidimensional real space and convert it into a word vector by vectorizing it. In an embodiment, the server 10 may convert a word vector to a word using Word2vec learning.

또한, 서버(10)는 단어를 200~300차원 정도의 벡터 공간에 표현할 수 있으며, 학습을 위하여 주변 단어가 만드는 의미의 방향성을 기반으로 타겟 단어를 예측하는 CBOW(Continuous Bag of Words)와 한 단어를 기준으로 주변에 올 수 있는 단어를 예측하는 Skip-gram모델을 활용할 수 있다.In addition, the server 10 can express words in a vector space of about 200 to 300 dimensions, and for learning purposes, CBOW (Continuous Bag of Words) and one word predict a target word based on the direction of meaning created by surrounding words. You can use the Skip-gram model that predicts words that may come around you based on.

이때, 두 워드 벡터 간의 거리는 두 워드 벡터 각각에 대응되는 단어 간의 유사성을 나타내고, 워드 벡터의 방향은 특허문서 내에서의 의미를 나타낼 수 있다. In this case, the distance between the two word vectors indicates similarity between words corresponding to each of the two word vectors, and the direction of the word vector may indicate meaning in the patent document.

이후, 단계 S220에서, 서버(10)는 워드 벡터 간의 유사도를 산출할 수 있다. 구체적으로, 서버(10)는 두 워드 벡터 간의 코사인 유사도 산출하거나 코사이 유사도를 정규화하여 두 워드 벡터 간의 유사도로 산출할 수 있다.Thereafter, in step S220, the server 10 may calculate a degree of similarity between word vectors. Specifically, the server 10 may calculate a cosine similarity between two word vectors or normalize the cosine similarity to calculate a similarity between the two word vectors.

예를 들어, 서버(10)는 실수 공간상의 두 워드 벡터 간 각도의 코사인 값을 이용하여 두 워드 벡터 간의 유사도로 산출할 수 있다. 또한, 서버(10)는 두 워드 벡터 간의 코사인 유사도 값을 0부터 1사이의 범위를 갖도록 정규화하여 두 워드 벡터 간의 유사도로 산출할 수 있다.For example, the server 10 may calculate a degree of similarity between two word vectors by using a cosine value of an angle between two word vectors in real space. In addition, the server 10 may calculate the similarity between the two word vectors by normalizing the cosine similarity value between the two word vectors to have a range between 0 and 1.

즉, 서버(10)는 전체 특허무선 내에 포함된 모든 단어들을 워드 벡터로 변환하고, 변환된 모든 워드 벡터 간의 거리를 코사인 유사도로 산출하며, 산출된 거리를 유사도로 산출할 수 있다.That is, the server 10 may convert all words included in the entire patent wireless into word vectors, calculate a distance between all converted word vectors as a cosine similarity, and calculate the calculated distance as a similarity.

도 4는 본 발명의 일 실시예에 따른 특허문서의 유의어 사전 생성 방법의 후처리 과정을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a post-processing process of a method for generating a synonym dictionary for a patent document according to an embodiment of the present invention.

도 4를 참조하면, 단계 S310에서, 서버(10)는 유사도에 기초하여 워드 벡터에 대응되는 단어가 유사어 그룹에 포함되는지 여부를 확인할 수 있다. 구체적으로, 서버(10)는 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만인지 여부를 확인하고, 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이면 두 워드 벡터에 각각 대응하는 단어가 동일한 유사어 그룹에 포함되는 것으로 확인할 수 있다.Referring to FIG. 4, in step S310, the server 10 may check whether a word corresponding to a word vector is included in a similar word group based on the similarity. Specifically, the server 10 checks whether the similarity between any two word vectors is less than a preset reference similarity, and if the similarity between any two word vectors is less than a preset reference similarity, the words corresponding to the two word vectors are the same. It can be confirmed as being included in the group.

이후, 단계 S320에서, 서버(10)는 두 워드 벡터에 각각 대응하는 단어가 동일한 유사어 그룹에 포함되는 것으로 확인되면 해당 두 단어를 동일한 유사어 그룹에 포함시킬 수 있다. Thereafter, in step S320, when it is determined that words corresponding to each of the two word vectors are included in the same similar word group, the server 10 may include the two words in the same similar word group.

한편, 서버(10)는 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이더라도 유사어 그룹에 이전에 포함된 단어에 대응되는 워드 벡터 간의 유사도가 미리 설정된 최대 유사도를 초과하는 경우, 유사도가 미리 설정된 기준 유사도 미만인 두 워드 벡터에 대응되는 단어들을 해당 유사어 그룹에 포함시키지 않을 수 있다.On the other hand, even if the similarity between any two word vectors is less than the preset reference similarity, if the similarity between the word vectors corresponding to the words previously included in the similar word group exceeds the preset maximum similarity, the similarity is a preset criterion. Words corresponding to two word vectors less than the similarity may not be included in the corresponding similar word group.

이를 통해, 서버(10)는 유사어 그룹이 확장되어 어느 두 단어 간의 유사성이 감소되는 현상을 방지할 수 있다.Through this, the server 10 may prevent a phenomenon in which the similarity between any two words decreases due to the expansion of the similar word group.

도 5는 본 발명의 다른 실시예에 따른 특허문서의 유의어 사전 생성 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a method of generating a synonym dictionary for a patent document according to another embodiment of the present invention.

도 5를 참조하면, 상술된 단계 S210에 이어 단계 S410에서, 서버(10)는 워드 벡터들 중에서 기준 워드 벡터를 선택할 수 있다.Referring to FIG. 5, in step S410 following step S210 described above, the server 10 may select a reference word vector from among word vectors.

여기서, 기준 워드 벡터는 후술되는 유사어 그룹의 기준 단어에 대응되는 워드 벡터일 수 있다. 예를 들어, 기준 단어가 사과이면, 사과의 유사어에 포함되는 단어들이 유사어 그룹에 포함될 수 있다.Here, the reference word vector may be a word vector corresponding to a reference word of a similar word group to be described later. For example, if the reference word is an apple, words included in the similar word of the apple may be included in the similar word group.

이후, 단계 S420에서, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도를 산출할 수 있다. 이때, 서버(10)는 단계 S220에서의 유사도 산출 방법과 동일한 방법으로 기준 워드 벡터와 다른 워드 벡터 간의 유사도를 산출할 수 있다.Thereafter, in step S420, the server 10 may calculate a degree of similarity between the reference word vector and another word vector. In this case, the server 10 may calculate the similarity between the reference word vector and other word vectors in the same manner as the method of calculating the similarity in step S220.

이어서, 단계 S430에서, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도에 기초하여 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는지 여부를 확인할 수 있다.Subsequently, in step S430, the server 10 determines whether a word corresponding to another word vector is included in a similarity group based on a reference word corresponding to the reference word vector based on the similarity between the reference word vector and the other word vector. I can confirm.

구체적으로, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이면 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는 것으로 확인할 수 있다.Specifically, if the similarity between the reference word vector and another word vector is less than a preset reference similarity, the server 10 indicates that a word corresponding to another word vector is included in the similarity group based on the reference word corresponding to the reference word vector. I can confirm.

이후, 단계 S440에서, 서버(10)는 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는 것으로 확인되면 다른 워드 벡터에 대응하는 단어를 기준 단어를 기준으로 하는 유사도 그룹에 포함시킬 수 있다.Thereafter, in step S440, when it is determined that a word corresponding to another word vector is included in the similarity group based on the reference word corresponding to the reference word vector, the server 10 selects the word corresponding to the other word vector as a reference word. Similarity based on criteria can also be included in the group.

도 6은 일 실시 예에 따른 특허문서의 유의어 사전 생성 장치의 구성도이다.6 is a block diagram of an apparatus for generating a synonym dictionary for a patent document according to an exemplary embodiment.

프로세서(102)는 하나 이상의 코어(core, 미도시) 및 그래픽 처리부(미도시) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다.The processor 102 may include one or more cores (not shown) and a graphic processing unit (not shown) and/or a connection path (eg, a bus) for transmitting and receiving signals with other components. .

일 실시예에 따른 프로세서(102)는 메모리(104)에 저장된 하나 이상의 인스트럭션을 실행함으로써, 도 1 내지 도 5와 관련하여 설명된 특허문서의 유의어 사전 생성 방법을 수행한다.The processor 102 according to an embodiment executes one or more instructions stored in the memory 104 to perform the method of generating a synonym dictionary for a patent document described with reference to FIGS. 1 to 5.

예를 들어, 프로세서(102)는 메모리에 저장된 하나 이상의 인스트럭션을 실행함으로써 서버가, 유의어 사전의 기초가 되는 전체 특허문서를 획득하고, 전체 특허문서 각각에 대해 형태소를 분석하며, 형태소 분석 결과에 기초하여 전체 특허문서 각각 포함된 단어를 워드 벡터로 변환하고, 워드 벡터 간의 유사도를 산출하며, 유사도에 기초하여 상기 워드 벡터에 대응되는 단어를 유사어 그룹으로 그루핑할 수 있다. For example, the processor 102 executes one or more instructions stored in the memory, so that the server acquires all patent documents that are the basis of the thesaurus, analyzes morphemes for each of all patent documents, and based on the results of morpheme analysis. Accordingly, words included in each of the entire patent documents are converted into word vectors, a degree of similarity between word vectors is calculated, and words corresponding to the word vectors may be grouped into similar word groups based on the degree of similarity.

한편, 프로세서(102)는 프로세서(102) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(102)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다. Meanwhile, the processor 102 temporarily and/or permanently stores a signal (or data) processed inside the processor 102, and a RAM (Random Access Memory, not shown) and a ROM (Read-Only Memory). , Not shown) may further include. In addition, the processor 102 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processing unit, RAM, and ROM.

메모리(104)에는 프로세서(102)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(104)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다.The memory 104 may store programs (one or more instructions) for processing and controlling the processor 102. Programs stored in the memory 104 may be divided into a plurality of modules according to functions.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, implemented as a software module executed by hardware, or a combination thereof. Software modules include Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), Flash Memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer-readable recording medium well known in the art to which the present invention pertains.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.Components of the present invention may be implemented as a program (or application) and stored in a medium in order to be combined with a computer as hardware to be executed. Components of the present invention may be implemented as software programming or software elements, and similarly, embodiments include various algorithms implemented with a combination of data structures, processes, routines or other programming components, including C, C++ , Java, assembler, or the like may be implemented in a programming or scripting language. Functional aspects can be implemented with an algorithm running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. In the above, embodiments of the present invention have been described with reference to the accompanying drawings, but those skilled in the art to which the present invention pertains can be implemented in other specific forms without changing the technical spirit or essential features. You will be able to understand. Therefore, the embodiments described above are illustrative in all respects, and should be understood as non-limiting.

10: 서버
102: 프로세서
104: 메모리10: server
102: processor
104: memory

Claims

In the method of generating the thesaurus dictionary of patent documents,
Obtaining, by the server, all patent documents that are the basis of the thesaurus;
Analyzing, by the server, a morpheme for each of the entire patent documents;
Converting words included in each of the entire patent documents into word vectors based on the result of the morpheme analysis;
Calculating, by the server, a degree of similarity between the word vectors; And
And grouping, by the server, a word corresponding to the word vector into a similar word group based on the degree of similarity.