KR102085214B1

KR102085214B1 - Method and system for acquiring word set of patent document

Info

Publication number: KR102085214B1
Application number: KR1020190122092A
Authority: KR
Inventors: 박상준; 김도언
Original assignee: (주)디앤아이파비스
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2020-03-04

Abstract

Disclosed is a method for obtaining a word set of a patent document. The method for obtaining a word set of a patent document comprises the following steps. A server obtains a patent document. The server obtains a plurality of word sets by analyzing a morpheme based on the patent document. The server determines an error word set among the obtained plurality of word sets. The server corrects the error word set. The server obtains at least one complex noun phrase set based on a degree of association between the plurality of word sets.

Description

Method and system for acquiring word set of patent document {METHOD AND SYSTEM FOR ACQUIRING WORD SET OF PATENT DOCUMENT}

본 발명은 특허문서의 단어 세트 획득 방법 및 시스템에 관한 것이다. The present invention relates to a method and system for obtaining a word set of a patent document.

4차 산업의 발전과 함께 지식재산권에 대한 가치가 높아지고 있다. 이에 따라 많은 사람들은 자신이 가진 기술을 보호하고, 기술에 대한 권리를 획득하려 노력하고 있으며, 기술에 대한 특허 출원에 대한 관심도가 높아지고 있다.With the development of the fourth industry, the value of intellectual property rights is increasing. Accordingly, many people are trying to protect their technology, acquire rights to the technology, and interest in patent applications for the technology is increasing.

한편, 특허 출원을 위해서는 자신의 기술이 특허 받을 수 있을지를 판단하기 위해 선행 기술 조사를 수행하며, 과거 공개된 다양한 특허 문헌을 검색함으로써 선행 기술 조사를 수행할 수 있다.On the other hand, for patent applications, prior art research is performed to determine whether a technology can be patented, and prior art research may be performed by searching various patent documents published in the past.

그러나, 특허문서의 양이 방대하고, 시간의 제약으로 인하여 과거 공개된 모든 특허문서를 분석하는 것은 사실상 불가능에 가까운 일이며, 주어진 시간 내에서 최대한의 결과를 얻기 위하여 검색식 입력 등의 방법을 통해 선행기술 조사가 수행되고 있는 것이 현실이다.However, due to the large amount of patent documents and the limitation of time, it is virtually impossible to analyze all patent documents published in the past. The reality is that prior art research is being conducted.

검색식 입력 등을 통한 선행 기술 조사는 선행기술 조사를 수행하는 인력의 능력에 좌우되는 경우가 많아 선행 기술 조사에 대한 안정적인 결과에 대한 보장이 안되는 경우가 있었다.Prior art surveys, such as search-type inputs, are often dependent on the ability of a person to conduct prior art surveys, and thus there are cases where a stable result for prior art surveys is not guaranteed.

따라서, 적은 시간 투자로 안정적인 결과를 보장 받을 수 있는 선행 기술 조사 방법의 필요성이 대두되고 있다.Therefore, there is a need for a prior art research method that can guarantee stable results with little time investment.

등록특허공보 제10-0481598호, 2005.03.29Patent Application Publication No. 10-0481598, 2005.03.29

본 발명이 해결하고자 하는 과제는 특허문서의 단어 세트 획득 방법 및 시스템을 제공하는 것이다.The problem to be solved by the present invention is to provide a method and system for acquiring a word set of a patent document.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Problems to be solved by the present invention are not limited to the above-mentioned problems, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 특허문서의 단어 세트 획득 방법은, 서버가, 특허문서를 획득하는 단계; 상기 서버가, 상기 특허문서를 바탕으로 형태소를 분석하여 복수개의 단어 세트를 획득하는 단계; 상기 서버가, 상기 획득된 복수개의 단어 세트 중 오류 단어 세트를 판단하는 단계; 상기 서버가, 상기 오류단어 세트를 수정하는 단계; 및 상기 서버가, 상기 복수개의 단어 세트간의 연관도를 바탕으로 적어도 하나의 복합 명사구 세트를 획득하는 단계;를 포함한다.According to an aspect of the present invention, there is provided a method of obtaining a word set of a patent document, the method comprising: obtaining, by a server, a patent document; Obtaining, by the server, a plurality of word sets by analyzing a morpheme based on the patent document; Determining, by the server, an error word set among the obtained plurality of word sets; The server correcting the error word set; And acquiring, by the server, at least one compound noun phrase set based on the degree of association between the plurality of word sets.

이때, 상기 복수개의 단어 세트를 획득하는 단계는, 상기 특허문서에 포함된 문장을 획득하는 단계; 상기 문장을 음절 단위로 분할하는 단계; 상기 분할된 음절에 품사 태그를 매칭하는 단계; 및 상기 분할된 음절을 바탕으로 상기 문장에 포함된 형태소를 획득하는 단계; 를 포함할 수 있다.In this case, the obtaining of the plurality of word sets may include: obtaining a sentence included in the patent document; Dividing the sentence into syllable units; Matching a part-of-speech tag to the divided syllables; Obtaining a morpheme included in the sentence based on the divided syllables; It may include.

이때, 상기 오류 단어 세트를 판단하는 단계는, 상기 획득된 복수개의 단어 세트 각각에 대한 의미 정보를 매칭하는 단계; 상기 매칭된 의미 정보 및 상기 특허문서를 바탕으로, 상기 복수개의 단어 세트 각각에 대한 정확도 점수를 획득하는 단계; 및 상기 정확도 점수가 기 설정된 값 이하인 단어 세트를 오류 단어 세트로 판단하는 단계;를 포함할 수 있다.In this case, the determining of the error word set may include: matching semantic information on each of the obtained plurality of word sets; Obtaining an accuracy score for each of the plurality of word sets based on the matched semantic information and the patent document; And determining a word set having an accuracy score equal to or less than a preset value as an error word set.

이때, 상기 오류 단어 세트를 수정하는 단계는, 상기 오류 단어에 대한 복수개의 의미 정보를 획득하는 단계; 상기 획득된 복수개의 의미 정보 각각에 대한 복수개의 가중치를 획득하는 단계; 상기 복수개의 가중치 중, 상기 특허문서와의 연관도가 가장 높은 가중치를 획득하는 단계; 및 상기 특허문서와의 연관도가 가장 높은 가중치에 대응되는 의미 정보를 상기 오류 단어 세트에 매칭하는 단계; 를 포함할 수 있다.In this case, the correcting of the error word set may include: obtaining a plurality of semantic information about the error word; Obtaining a plurality of weights for each of the obtained plurality of semantic informations; Obtaining a weight having the highest association with the patent document among the plurality of weights; Matching semantic information corresponding to a weight having the highest association with the patent document to the error word set; It may include.

이때, 상기 적어도 하나의 복합 명사구 세트를 획득하는 단계는, 상기 특허문서에 포함된 문장에 대한 복수개의 단어 세트를 획득하는 단계; 상기 복수개의 단어 세트의 조합으로 획득된 복수개의 복합 명사구 세트 후보를 획득하는 단계; 상기 획득된 복수개의 복합 명사구 세트 후보와 동일한 복합 명사구 세트가 특허문서에 포함되는 빈도를 획득하는 단계; 및 상기 빈도가 기 설정된 빈도 이상인 복합 명사구 세트 후보를 복합 명사구 세트로 결정하는 단계; 를 포함하고, 상기 복합 명사구 세트 후보는 복합 명사구 세트 후보에 포함된 단어 세트들의 순서 정보 및 이격 정보를 포함할 수 있다.In this case, the obtaining of the at least one compound noun phrase set may include: obtaining a plurality of word sets for a sentence included in the patent document; Obtaining a plurality of compound noun phrase set candidates obtained by combining the plurality of word sets; Obtaining a frequency at which a compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in a patent document; Determining a compound noun phrase set candidate whose frequency is equal to or greater than a preset frequency as a compound noun phrase set; The compound noun phrase set candidate may include order information and spacing information of word sets included in the compound noun phrase set candidate.

이때, 상기 복합 명사구 세트 후보를 획득하는 단계는, 상기 복수개의 단어 세트가 기 설정된 조건을 만족하는 경우, 상기 복수개의 단어 세트의 위치를 변경하는 단계; 상기 변경된 복수개의 단어 세트를 복합 명사구 세트 후보로 결정하는 단계; 를 포함하고, 상기 기 설정된 조건은, 상기 복수개의 단어 세트가 인접한 조건, 상기 인접한 복수개의 단어 세트 사이에 기 설정된 부호가 포함될 조건, 상기 인접한 복수개의 단어 세트 중 오른쪽에 위치한 단어 세트가 괄호를 포함하는 조건 및 상기 복수개의 단어 세트가 서로 다른 언어인 조건일 수 있다.In this case, the obtaining of the compound noun phrase set candidate may include: changing positions of the plurality of word sets when the plurality of word sets satisfy a preset condition; Determining the changed plurality of word sets as a compound noun phrase set candidate; The preset condition may include a condition in which the plurality of word sets are adjacent to each other, a condition to include a predetermined code between the plurality of adjacent word sets, and a word set positioned at the right side of the plurality of adjacent word sets includes parenthesis. May be a condition in which the set of words and the plurality of words are different languages.

이때, 상기 단어 세트를 획득하는 단계는, 상기 특허문서에 포함된 문장을 분석하는 단계; 상기 문장이 기 설정된 구조인 경우, 상기 문장에 포함된 단어 세트 중 상기 기 설정된 구조에 대응되는 단어 세트를 제외한 적어도 하나의 단어 세트를 획득하는 단계; 상기 적어도 하나의 단어 세트의 의미 정보를 획득하는 단계; 및 상기 의미 정보에 대응되는 단어 세트를 획득하는 단계; 를 포함할 수 있다.In this case, the obtaining of the word set may include analyzing a sentence included in the patent document; When the sentence has a preset structure, obtaining at least one word set except for a word set corresponding to the preset structure among the word sets included in the sentence; Obtaining semantic information of the at least one word set; Obtaining a word set corresponding to the semantic information; It may include.

이때, 상기 단어 세트를 획득하는 단계는, 상기 특허문서에 포함된 복수의 식별항목 정보 및 템플릿을 획득하는 단계; 상기 획득된 템플릿을 제외한 복수개의 문장을 획득하는 단계; 상기 복수의 식별항목 정보 각각에 가중치를 부여하는 단계; 상기 복수의 식별항목 정보 각각의 가중치를 바탕으로 형태소 분석을 수행할 식별항목의 우선순위를 결정하는 단계; 및 상기 결정된 우선순위에 따라, 상기 템플릿이 제외된 복수의 문장에 대한 단어 세트를 획득하는 단계; 를 포함할 수 있다.In this case, the acquiring of the word set may include: acquiring a plurality of identification item information and a template included in the patent document; Obtaining a plurality of sentences except the obtained template; Weighting each of the plurality of pieces of identification information; Determining a priority of identification items to be morphologically analyzed based on weights of the plurality of identification item information; Acquiring word sets for a plurality of sentences in which the template is excluded, according to the determined priority; It may include.

이때, 상기 단어 세트를 획득하는 단계는, 상기 특허문서에 이미지가 포함된 경우, 상기 이미지와 인접한 식별항목이 존재하는지 여부를 판단하는 단계; 상기 식별항목이 상기 이미지와 인접하는 경우, 상기 이미지에 포함된 텍스트를 획득하는 단계; 및 상기 획득된 텍스트를 바탕으로 단어 세트를 획득하는 단계; 를 포함할 수 있다. The acquiring of the word set may include determining whether an identification item adjacent to the image exists when the patent document includes an image; Acquiring text included in the image when the identification item is adjacent to the image; Obtaining a word set based on the obtained text; It may include.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

상술한 본 발명의 다양한 실시예에 따라, 특허문서를 효율적으로 분석하기 위한 단어 세트를 획득할 수 있다.According to various embodiments of the present invention described above, a word set for efficiently analyzing a patent document may be obtained.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 특허문서의 단어 세트 획득 방법을 구현하기 위한 시스템도이다.
도 2는 본 발명의 일 실시예에 따른 특허문서의 단어 세트 획득 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 특허문서로부터 복수개의 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 오류 단어 세트를 판단하는 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 오류 단어 세트 수정 방법을 설명하기 위한 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 복합 명사구 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 7은 본 발명의 또 다른 실시예에 따른 복합 명사구 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 특허문서의 템플릿을 제외하고 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 10은 본 발명의 일 실시예에 따른 특허문서에 이미지가 존재하는 경우 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.
도 11은 본 발명의 일 실시예에 따른 장치의 구성도이다.1 is a system diagram for implementing a word set obtaining method of a patent document according to an embodiment of the present invention.
2 is a flowchart illustrating a method of obtaining a word set of a patent document according to an embodiment of the present invention.
3 is a flowchart illustrating a method of obtaining a plurality of word sets from a patent document according to an embodiment of the present invention.
4 is a flowchart illustrating a method of determining an error word set according to an embodiment of the present invention.
5 is a flowchart illustrating a method of correcting an error word set according to an embodiment of the present invention.
6 is a flowchart illustrating a method of obtaining a compound noun phrase set according to an embodiment of the present invention.
7 is a flowchart illustrating a method of obtaining a compound noun phrase set according to another embodiment of the present invention.
8 is a flowchart illustrating a method of obtaining a word set according to an embodiment of the present invention.
9 is a flowchart illustrating a method of obtaining a word set except for a template of a patent document according to an embodiment of the present invention.
10 is a flowchart illustrating a method of obtaining a word set when an image exists in a patent document according to an embodiment of the present invention.
11 is a block diagram of an apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but may be embodied in various different forms, only the present embodiments are intended to complete the disclosure of the present invention, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the invention, which is defined only by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more other components in addition to the mentioned components. Like reference numerals refer to like elements throughout, and "and / or" includes each and all combinations of one or more of the mentioned components. Although "first", "second", etc. are used to describe various components, these components are of course not limited by these terms. These terms are only used to distinguish one component from another. Therefore, of course, the first component mentioned below may be the second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, terms that are defined in a commonly used dictionary are not ideally or excessively interpreted unless they are specifically defined clearly.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.The term "part" or "module" as used herein refers to a hardware component such as software, FPGA or ASIC, and the "part" or "module" plays certain roles. However, "part" or "module" is not meant to be limited to software or hardware. The “unit” or “module” may be configured to be in an addressable storage medium or may be configured to play one or more processors. Thus, as an example, a "part" or "module" may include components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, Procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. Functions provided within components and "parts" or "modules" may be combined into smaller numbers of components and "parts" or "modules" or into additional components and "parts" or "modules". Can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.The spatially relative terms " below ", " beneath ", " lower ", " above ", " upper " It can be used to easily describe a component and its correlation with other components. Spatially relative terms are to be understood as including terms in different directions of components in use or operation in addition to the directions shown in the figures. For example, when flipping a component shown in the drawing, a component described as "below" or "beneath" of another component may be placed "above" the other component. Can be. Thus, the exemplary term "below" can encompass both an orientation of above and below. Components may be oriented in other directions as well, so spatially relative terms may be interpreted according to orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In the present specification, a computer refers to any kind of hardware device including at least one processor, and according to an embodiment, it may be understood as a meaning encompassing a software configuration that operates on the hardware device. For example, a computer may be understood as including, but not limited to, a smartphone, a tablet PC, a desktop, a notebook, and a user client and an application running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings will be described an embodiment of the present invention;

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described herein is described as being performed by a computer, but the subject of each step is not limited thereto, and at least some of the steps may be performed in different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 특허문서의 단어 세트 획득 방법을 구현하기 위한 시스템도이다.1 is a system diagram for implementing a word set obtaining method of a patent document according to an embodiment of the present invention.

본 발명에 따른 특허문서의 단어 세트 획득을 위한 시스템은 서버(10) 및 전자 장치(20)를 포함한다.The system for acquiring a word set of a patent document according to the present invention includes a server 10 and an electronic device 20.

서버(10)는 특허문서를 획득하고, 획득된 특허문서로부터 단어 세트를 획득하기 위한 구성이다.The server 10 is a component for obtaining a patent document and obtaining a word set from the obtained patent document.

구체적으로, 서버(10)는 전자 장치(20)로부터 특허문서를 입력 받거나, 외부 서버로부터 특허문서를 획득하고, 획득된 특허문서의 단어 세트를 획득할 수 있다.In detail, the server 10 may receive a patent document from the electronic device 20, obtain a patent document from an external server, and obtain a word set of the obtained patent document.

본 명세서에서, 특허문서란, 각국 특허청에 특허 등록을 받기 위해 출원인이 제출하는 기술 내용에 대한 문서일 수 있다. 다만, 이에 한정되는 것은 아니고, 특허문서는, 특허 출원을 위한 직무 발명서, 논문 등 기술 내용을 포함한 다양한 문서를 포함하는 개념으로 이해될 수 있다.In the present specification, the patent document may be a document on the technical content submitted by the applicant for patent registration with the respective patent offices. However, the present invention is not limited thereto, and the patent document may be understood as a concept including various documents including technical contents such as job inventions and papers for patent application.

전자 장치(20)는 서버(10)로 특허문서를 제공하기 위한 구성이다. 본 발명에 따른 전자 장치(200)는 스마트 폰으로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 스마트폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 영상 전화기, 전자책 리더기(e-book reader), 데스크탑 PC (desktop PC), 랩탑 PC(laptop PC), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player) 또는 웨어러블 장치(wearable device) 중 적어도 하나를 포함할 수 있다.The electronic device 20 is a component for providing a patent document to the server 10. The electronic device 200 according to the present invention may be implemented as a smart phone, but this is only an example, and may include a smartphone, a tablet personal computer, a mobile phone, a video phone, E-book readers, desktop PCs, laptop PCs, netbook computers, workstations, servers, personal digital assistants, portable multimedia It may include at least one of a player or a wearable device.

이하에서는 도 2 내지 도 10의 다양한 실시예를 통해, 본 발명을 구체적으로 서술한다.Hereinafter, the present invention will be described in detail with reference to various embodiments of FIGS. 2 to 10.

도 2는 본 발명의 일 실시예에 따른 특허문서의 단어 세트 획득 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of obtaining a word set of a patent document according to an embodiment of the present invention.

단계 S110에서, 서버(10)는 특허문서를 획득할 수 있다.In operation S110, the server 10 may acquire a patent document.

일 실시예로, 서버(10)는 전자 장치(20)로부터 직무 발명서를 특허문서로 획득할 수 있다. 또 다른 실시예로, 서버(10)는 외부 서버로부터 특허 공개공보 또는 등록 공보를 전송받아, 특허문서로 획득할 수 있다.In an embodiment, the server 10 may obtain a job invention from the electronic device 20 as a patent document. In another embodiment, the server 10 may receive a patent publication or a registration publication from an external server, and obtain the patent document.

단계 S120에서, 서버(10)는 특허문서를 바탕으로 형태소를 분석하여 단어 세트를 획득할 수 있다.In operation S120, the server 10 may obtain a word set by analyzing a morpheme based on the patent document.

일 실시예로, 서버(10)는 Mecab 형태소 분석기를 이용하여 특허문서의 형태소 분석을 수행할 수 있다. 다만, 이에 한정되는 것은 아니고, 경우에 따라 Okt, Komoran, Hannanum, Kkma 형태소 분석기 등 다양한 형태소 분석기가 이용될 수 있음은 물론이다. 나아가, 서버(10)는 분석하고자 하는 특허문서의 사용 언어에 따라 다양한 형태소 분석기를 사용할 수 있음은 물론이다.In one embodiment, the server 10 may perform a morphological analysis of the patent document using the Mecab morpheme analyzer. However, the present invention is not limited thereto, and various morpheme analyzers such as Okt, Komoran, Hannanum, and Kkma morpheme analyzers may be used. Furthermore, the server 10 may use various morpheme analyzers according to the language of the patent document to be analyzed.

단계 S130에서, 서버(10)는 획득된 복수개의 단어 세트 중 오류 단어 세트를 판단할 수 있다.In operation S130, the server 10 may determine an error word set among the obtained plurality of word sets.

일 실시예로, 서버(10)는 획득된 복수개의 단어 세트가 기 설정된 조건을 만족하지 못하는 경우, 해당 단어를 오류 단어로 판단할 수 있다. 이에 대한 구체적인 설명은 후술한다.In an embodiment, when the acquired plurality of word sets does not satisfy a preset condition, the server 10 may determine the corresponding word as an error word. Detailed description thereof will be described later.

단계 S140에서, 서버(10)는 오류 단어 세트를 수정할 수 있다.In step S140, the server 10 may correct the error word set.

단계 S150에서, 서버(10)는 복수개의 단어 세트간의 연관도를 바탕으로 적어도 하나의 복합 명사구 세트를 획득할 수 있다.In operation S150, the server 10 may obtain at least one compound noun phrase set based on the degree of association between the plurality of word sets.

도 3은 본 발명의 일 실시예에 따른 특허문서로부터 복수개의 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method of obtaining a plurality of word sets from a patent document according to an embodiment of the present invention.

단계 S210에서, 서버(10)는 특허문서에 포함된 문장을 획득할 수 있다. 구체적으로, 서버(10)는 특허문서에 포함된 텍스트를 문장 단위로 분류하고, 분류된 문장 각각에 대한 형태소 분석을 수행할 수 있다.In operation S210, the server 10 may acquire a sentence included in the patent document. In detail, the server 10 may classify the text included in the patent document into sentence units and perform morphological analysis on each of the classified sentences.

단계 S220에서, 서버(10)는 문장을 음절 단위로 분할하고, 단계 S230에서, 서버(10)는 분할된 음절에 품사 태그를 매칭하고, 단계 S240에서, 서버(10)는 분할된 음절을 바탕으로 문장에 포함된 형태소를 획득할 수 있다.In operation S220, the server 10 divides a sentence into syllable units. In operation S230, the server 10 matches a part-of-speech tag to the divided syllables. In operation S240, the server 10 based on the divided syllables. The morpheme included in the sentence can be obtained.

일 실시예로, 서버(10)는 LSTM-CRFs 및 word2vec 알고리즘을 바탕으로 음절 분할 및 품사 태그 매칭 과정을 수행할 수 있으나, 이에 한정되는 것은 아니다.In one embodiment, the server 10 may perform a syllable segmentation and part-of-speech tag matching process based on LSTM-CRFs and word2vec algorithms, but is not limited thereto.

도 4는 본 발명의 일 실시예에 따른 오류 단어 세트를 판단하는 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method of determining an error word set according to an embodiment of the present invention.

단계 S310에서, 서버(10)는 획득된 복수개의 단어 세트 각각에 대한 의미 정보를 매칭할 수 있다.In operation S310, the server 10 may match semantic information on each of the acquired plurality of word sets.

이때, 의미 정보란, 단어 세트에 대한 인텐트를 의미할 수 있다. 서버(10)는 단어 세트에 대한 의미 정보를 획득하기 위해 인공지능 모델을 이용한 자연어 처리를 수행할 수 있다. In this case, the semantic information may mean an intent for a word set. The server 10 may perform natural language processing using an artificial intelligence model to obtain semantic information about a word set.

구체적으로, 인공지능 모델은 자연어 이해부를 포함하고, 자연어 이해부는 문장 분석 결과를 바탕으로 엔티티(entity) 및 문장에 포함된 단어 세트의 의도(intent)를 파악할 수 있으며, 나아가, 자연어 이해부는 문장의 구조 및 주요 성분 분석을 통해 문장을 해석하고 통계/분석 등을 이용하여 문장 분석을 수행할 수 있다.Specifically, the artificial intelligence model includes a natural language understanding unit, the natural language understanding unit may grasp the intent of the word set included in the entity and the sentence based on the sentence analysis result, and furthermore, the natural language understanding unit may Analyze sentences through structure and major component analysis and sentence analysis using statistics / analysis.

일 실시예로, 서버(10)는 '사과'가 포함된 문장을 분석하여 사과에 대한 의미 정보를 획득할 수 있다. 예를 들어, 문장 문석을 통해 획득된 단어 세트가 "사과' 인 경우를 가정할 수 있다. 이때, 사과에 대한 의미 정보는 명사로서 과일의 한 종류를 나타내는 의미 정보일 수 있으나, 동사로서 다른 사람에게 잘못을 말하는 것을 나타내는 의미 정보일 수도 있다. 서버(10)는 '사과'에 대한 복수개의 의미 정보 중 문장과 적합하다고 판단되는 의미 정보를 '사과'와 매칭할 수 있다.In one embodiment, the server 10 may obtain the semantic information about the apple by analyzing a sentence including the 'apple'. For example, it may be assumed that the word set obtained through sentence sentence is “apple”, where the semantic information about an apple may be semantic information indicating a kind of fruit as a noun, but as a verb The server 10 may match the semantic information that is determined to be suitable with a sentence among the plurality of semantic information about the apple.

단계 S320에서, 서버(10)는 매칭된 의미 정보 및 특허문서를 바탕으로 복수개의 단어 세트 각각에 대한 정확도 점수를 획득할 수 있다.In operation S320, the server 10 may obtain an accuracy score for each of the plurality of word sets based on the matched semantic information and the patent document.

이 경우, 단계 S310에서는 하나의 문장을 통해 획득한 단어 세트에 대한 의미 정보 획득하나, 단계 S310을 통해 획득한 의미 정보는 부정확할 가능성이 있다. 따라서, 서버(10)는 문장을 포함하는 특허문서 전체를 바탕으로 단어 세트에 매칭된 의미 정보의 정확도 점수를 획득할 수 있다.In this case, in step S310, semantic information about the word set acquired through one sentence is obtained, but the semantic information obtained through step S310 may be inaccurate. Accordingly, the server 10 may obtain the accuracy score of the semantic information matched to the word set based on the entire patent document including the sentence.

따라서, 서버(10)는 문장 분석을 통해 획득된 단어 세트의 의미 정보에 대한 정확도 점수를, 특허문서 전체에서 발견되는 동일한 단어 세트에 대한 의미 정보를 바탕으로 획득할 수 있다.Accordingly, the server 10 may obtain an accuracy score of the semantic information of the word set obtained through sentence analysis based on the semantic information of the same word set found in the entire patent document.

예를 들어, '사과'가 포함된 문장을 분석하여 획득한 '사과'에 대한 의미 정보가 과일을 나타내는 사과에 대한 정보이지만, 문장을 포함한 특허문서 전체에서 검색되는 '사과'의 의미 정보가 다른 사람에게 잘못을 말하는 것에 대한 정보인 경우, 서버(10)는 기 매칭된 의미 정보를 낮게 설정할 수 있다.For example, the semantic information obtained by analyzing a sentence containing 'apple' is information about an apple representing fruit, but the semantic information of 'apple' found in the patent document including the sentence is different. In the case of information on telling a wrong person, the server 10 may set a low matched semantic information.

단계 S330에서, 서버(10)는 정확도 점수가 기 설정된 값 이하인 단어 세트를 오류 단어 세트로 판단할 수 있다.In operation S330, the server 10 may determine a word set having an accuracy score equal to or less than a preset value as an error word set.

즉, 서버(10)는 정확도 점수가 기 설정된 값 이하인 경우, 해당 단어 세트에 매칭된 의미 정보가 잘못 매칭된 것으로 판단하고, 단어 세트를 오류 단어 세트로 획득할 수 있다.That is, when the accuracy score is less than or equal to a preset value, the server 10 may determine that semantic information matched to the corresponding word set is incorrectly matched, and may acquire the word set as an error word set.

도 5는 본 발명의 일 실시예에 따른 오류 단어 세트 수정 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a method of correcting an error word set according to an embodiment of the present invention.

단계 S410에서, 서버(10)는 오류 단어에 대한 복수개의 의미 정보를 획득할 수 있다. 구체적으로, 서버(10)는 오류 단어가 가지는 복수개의 의미 정보를 복수개의 특허문서로부터 획득할 수 있다.In operation S410, the server 10 may obtain a plurality of semantic information about an error word. In detail, the server 10 may obtain a plurality of semantic information of an error word from a plurality of patent documents.

단계 S420에서, 서버(10)는 획득된 복수개의 의미 정보 각각에 대한 복수개의 가중치를 획득할 수 있다.In operation S420, the server 10 may obtain a plurality of weights for each of the obtained plurality of semantic information.

일 실시예로, 서버(10)는 오류 단어를 포함하는 특허문서를 바탕으로 가중치를 획득할 수 있다. 구체적으로, 서버(10)는 특허문서에서 오류 단어를 포함하는 적어도 하나의 문장을 획득하고, 획득된 적어도 하나의 문장에 포함된 오류 단어와 동일한 단어에 대한 의미 정보를 획득할 수 있다. 서버(10)는 오류단어와 동일한 적어도 하나의 단어에 대한 의미 정보를 바탕으로, 복수개의 의미 정보에 대한 가중치를 획득할 수 있다.In one embodiment, the server 10 may obtain a weight based on the patent document including the error word. In detail, the server 10 may acquire at least one sentence including an error word in a patent document, and obtain semantic information on the same word as an error word included in the obtained at least one sentence. The server 10 may obtain weights for a plurality of semantic information based on semantic information about at least one word that is the same as an error word.

단계 S430에서, 서버(10)는 복수개의 가중치 중 특허문서와의 연관도가 가장 높은 가중치를 획득할 수 있다. In operation S430, the server 10 may obtain a weight having the highest correlation with the patent document among the plurality of weights.

단계 S440에서, 서버(10)는 특허문서와의 연관도가 가장 높은 가중치에 대응되는 의미 정보를 오류 단어 세트에 매칭할 수 있다. 즉, 서버(10)는 오류 단어에 대한 복수개의 의미 정보 중 가중치가 가장 큰 의미 정보를 오류 단어에 대한 의미 정보로 결정하고, 오류 단어를 수정할 수 있다.In operation S440, the server 10 may match semantic information corresponding to the weight having the highest association with the patent document to the error word set. That is, the server 10 may determine the semantic information having the largest weight among the plurality of semantic information of the error word as semantic information of the error word, and correct the error word.

한편, 본 발명의 다양한 실시예에 따라, 오류 단어 세트는 단어 세트에 잘못된 의미 정보가 매칭된 경우뿐만 아니라 형태소 분석 과정에서의 오류로 잘못 파싱된 단어 세트일 경우일 수 있음은 물론이다. 일 실시예로, 단어 세트 획득을 위한 문장이 "머신 러닝을 이용한 자연어 처리를 한다" 인 경우, 서버(10)는 머신, 러닝, 이용, 자연어, 처리를 단어 세트로 획득할 수 있다. Meanwhile, according to various embodiments of the present disclosure, the error word set may be not only a case in which wrong semantic information is matched with the word set, but also a case in which the word set is incorrectly parsed due to an error in the morphological analysis process. In one embodiment, when the sentence for acquiring the word set is "natural language processing using machine learning", the server 10 may acquire the machine, the learning, the use, the natural language, and the processing as the word set.

이 경우, 형태소 분석의 오류로 인하여, 머신, 러닝, 을이용, 자연어, 처리와 같이 단어 세트를 획득하는 경우가 발생할 수 있다. 이 경우, 서버(10)는 '을이용'을 오류 단어 세트로 판단하고, 수정할 수 있다. In this case, due to an error in morphological analysis, a case of acquiring a word set such as a machine, a learning, a using, a natural language, and a processing may occur. In this case, the server 10 may determine 'use' as an error word set and correct it.

구체적으로, 서버(10)는 획득된 단어 세트 각각에 대한 의미 정보를 획득한 후, 의미 정보와 특허문서와의 연관도를 바탕으로 오류 단어인지 여부를 판단할 수 있다. 일 실시예에 따라, 서버(10)는 획득된 단어 세트가 기 설정된 빈도 이상 단어 세트를 포함하는 특허문서에서 발견되는 경우, 해당 단어 세트는 오류가 없는 단어 세트로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어 세트가 전체 특허문서에서 기 설정된 빈도 이상 발견되는 경우, 해당 단어 세트는 일반적으로 사용되는 단어 세트로 판단하여 오류 단어 세트가 아닌 것으로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어 세트에 대한 의미 정보를 찾지 못한 경우, 해당 단어 세트를 오류 단어 세트로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어 세트에 매칭된 의미 정보가 단어 세트를 포함하는 특허문서와 이질적인 경우, 해당 단어 세트를 오류 단어 세트로 판단할 수 있다.In detail, after acquiring semantic information about each of the acquired word sets, the server 10 may determine whether the word is an error word based on the degree of association between the semantic information and the patent document. According to an embodiment of the present disclosure, when the acquired word set is found in a patent document including a word set having a predetermined frequency or more, the server 10 may determine that the word set is an error-free word set. In another embodiment, when the word set is found more than a predetermined frequency in the entire patent document, the server 10 may determine that the word set is not an error word set by determining that the word set is generally used. In another embodiment, when the semantic information about the word set is not found, the server 10 may determine the word set as an error word set. In another embodiment, if the semantic information matched with the word set is heterogeneous with the patent document including the word set, the server 10 may determine the corresponding word set as an error word set.

구체적으로, 서버(10)는 특허문서 전체에 포함된 단어 세트를 획득하고, 획득된 단어 세트 각각에 대한 의미 정보를 매칭하여 저장할 수 있다. 서버(10)는 매칭된 단어 세트 및 의미 정보를 클러스터링하여 연관성있는 의미 정보를 가지는 복수개의 단어 세트의 클러스터를 획득할 수 있다. 서버(10)는 복수개의 단어 세트 클러스터 중, 클러스터에 포함된 단어 세트가 기 설정된 개수 이하인 단어 세트 클러스터를 획득할 수 있다. 이때, 기 설정된 개수란, 특허문서에서 사용된 복수의 동일한 단어 세트에 대한 개수를 의미하는 것이 아닌, 서로 다른 형태를 가지는 복수개의 단어 세트에 대한 개수를 의미할 수 있다. 예를 들어, 특허문서에서 "을이용" 이라는 단어가 복수번 검색된 경우라고 하더라도, "을이용"은 동일한 단어 세트이기 때문에, 하나의 단어 세트로 판단할 수 있다. 서버(10)는 상기 단계에서 획득한 클러스터에 포함된 단어 세트를 오류 단어 세트로 판단할 수 있다. In detail, the server 10 may obtain a word set included in the entire patent document, and match and store semantic information on each of the acquired word sets. The server 10 may cluster the matched word set and the semantic information to obtain a cluster of a plurality of word sets having relevant semantic information. The server 10 may acquire a word set cluster of a plurality of word set clusters, in which a word set included in the cluster is equal to or less than a preset number. In this case, the preset number may not mean the number of the plurality of identical word sets used in the patent document but may mean the number of the plurality of word sets having different forms. For example, even when the word "use" is found a plurality of times in a patent document, since "use" is the same word set, it can be determined as one word set. The server 10 may determine the word set included in the cluster obtained in the above step as the error word set.

다만, 이에 한정되는 것은 아니고, 서버(10)는 복수개의 단어 세트 클러스터 중, 다른 클러스터와의 거리가 기 설정된 거리 이상인 클러스터에 포함된 단어 세트를 오류 단어 세트로 판단할 수 있음은 물론이다. 구체적으로, 서버(10)는 복수개의 클러스터의 중심점을 획득하고, 획득된 중심점을 바탕으로 클러스터간의 거리를 판단할 수 있다. 예를 들어, 획득된 단어 세트 클러스터가 제1 단어세트 클러스터 내지 제4 단어 세트 클러스터일 수 있다. 이 경우, 서버(10)는 제1 단어 세트 클러스터와 제2 단어 세트 클러스터와의 거리, 제1 단어 세트 클러스터와 제3 단어세트 클러스터와의 거리, 제1 단어 세트 클러스터와 제4 단어세트 클러스터와의 거리를 각각 획득하고, 획득된 3개의 거리 모두가 기 설정된 거리 이상인 경우, 제1 단어 세트 클러스터를 오류 단어 세트로 판단할 수 있다.However, the present invention is not limited thereto, and the server 10 may determine, as an error word set, a word set included in a cluster of a plurality of word set clusters having a distance from another cluster greater than or equal to a preset distance. In detail, the server 10 may acquire center points of the plurality of clusters and determine the distance between clusters based on the obtained center points. For example, the obtained word set cluster may be a first word set cluster to a fourth word set cluster. In this case, the server 10 may include the distance between the first word set cluster and the second word set cluster, the distance between the first word set cluster and the third word set cluster, the first word set cluster and the fourth word set cluster. When each of the distances is obtained and all three acquired distances are greater than or equal to a predetermined distance, the first word set cluster may be determined as an error word set.

도 6은 본 발명의 일 실시예에 따른 복합 명사구 세트를 획득하는 방법을 설명하기 위한 흐름도이다.6 is a flowchart illustrating a method of obtaining a compound noun phrase set according to an embodiment of the present invention.

구체적으로, 단계 S110 내지 S150을 통해 획득한 단어 세트는 단일 단어에 관한 것이나, 단일 단어들의 결합으로 복합 명사구가 형성되는 경우, 그 의미가 달라지는 경우가 있다. 따라서, 서버(10)는 단일 단어 세트로부터 복합 명사구 세트를 획득할 필요성이 존재한다. 따라서, 서버(10)는 복수개의 단어 세트를 바탕으로 복합 명사구 세트를 획득할 수 있다Specifically, the word set obtained through steps S110 to S150 relates to a single word, but when a compound noun phrase is formed by combining single words, the meaning may be different. Thus, there is a need for the server 10 to obtain a complex noun phrase set from a single word set. Accordingly, the server 10 may obtain a compound noun phrase set based on the plurality of word sets.

단계 S510에서, 서버(10)는 특허문서에 포함된 문장에 대한 복수개의 단어 세트를 획득할 수 있다.In operation S510, the server 10 may obtain a plurality of word sets for sentences included in the patent document.

예를 들어, "머신러닝을 이용한 자연어 처리를 한다"라는 문장을 분석한 결과 서버(10)는 머신, 러닝, 이용, 자연어, 처리의 단어 세트를 획득할 수 있다.For example, as a result of analyzing the sentence "to perform natural language processing using machine learning", the server 10 may obtain a word set of a machine, a learning, a use, a natural language, and a processing.

단계 S520에서, 서버(10)는 복수개의 단어 세트의 조합으로 획득된 복수개의 복합 명사구 세트 후보를 획득할 수 있다.In operation S520, the server 10 may obtain a plurality of compound noun phrase set candidates obtained by combining a plurality of word sets.

예를 들어, 서버(10)는 획득된 단어 세트의 조합으로부터 머신러닝, 러닝이용, 이용자언어, 자연어처리 등의 복합 명사구 세트 후보를 획득할 수 있다.For example, the server 10 may obtain a compound noun phrase set candidate such as machine learning, learning use, user language, natural language processing, etc. from the obtained combination of the word sets.

단계 S530에서, 서버(10)는 획득된 복수개의 복합 명사구 세트 후보와 동일한 복합 명사구 세트가 특허문서에 포함되는 빈도를 획득할 수 있다.In operation S530, the server 10 may acquire a frequency at which the compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in the patent document.

단계 S540에서, 서버(10)는 획득된 빈도가 기 설정된 빈도 이상인 복합 명사구 세트 후보를 복합 명사구 세트로 결정할 수 있다.In operation S540, the server 10 may determine a compound noun phrase set candidate having an acquired frequency greater than or equal to a preset frequency as a compound noun phrase set.

예를 들어, 서버(10)는 머신러닝은 총 301회, 러닝이용은 총 1회, 이용자 언어는 총 0회, 자연어처리는 총 58회 발견되는 것에 대한 정보를 획득하고, 출현 빈도가 기 설정된 빈도 이상인 복합 명사구 후보 세트를 복합 명사구 세트로 획득할 수 있다.For example, the server 10 obtains information about machine learning total 301 times, learning usage once, user language total 0 times, and natural language processing 58 times, and the frequency of occurrence is set in advance. A compound noun phrase candidate set of more than a frequency can be obtained as a compound noun phrase set.

이때, 복합 명사구 세트 후보는 복합 명사구 세트 후보에 포함된 단어 세트들의 순서 정보 및 이격 정보를 포함할 수 있다.In this case, the compound noun phrase set candidate may include order information and spacing information of word sets included in the compound noun phrase set candidate.

일 실시예로, 복합 명사구 세트는 둘 이상의 인접한 단어 세트의 조합으로 결정되지만, 본 발명에서는 단어들이 이격된 단어 세트의 조합 또한 복합 명사구 세트로 획득할 수 있다. 예를 들어, 서버(10)는 "복수개의 장치 중 제1 장치, 복수개의 장치 중 제2 장치, 복수개의 장치 중 제3 장치"가 포함된 문장에 서 복합 명사구 세트를 획득할 때, 복수개의 장치 중 제1 장치, 복수개의 장치 중 제2 장치 및 복수개의 장치 중 제3 장치를 독립된 복합 명사구 세트로 획득할 수 있으나, "복수개의 장치 중 (이격 단어 1) 장치"를 하나의 복합 명사구 세트로 획득할 수도 있음은 물론이다. 이때, 복합 명사구 세트는 이격 정보를 포함할 수 있다. 상술한 실시예에서 이격 정보란 단어 "중"과 단어 "장치" 사이에 1개의 단어가 포함되어 있다는 정보일 수 있다. 다양한 실시예에 따라, 단어와 단어 사이에 복수개의 단어가 포함될 수 있음은 물론이다.In one embodiment, a compound noun phrase set is determined by a combination of two or more adjacent word sets, but in the present invention, a combination of word sets in which words are spaced may also be obtained as a compound noun phrase set. For example, when the server 10 obtains a compound noun phrase set from a sentence including "a first device of a plurality of devices, a second device of a plurality of devices, a third device of a plurality of devices", The first device of the device, the second device of the plurality of devices, and the third device of the plurality of devices can be obtained as a set of independent compound noun phrases, but a " device of a plurality of devices (spaced word 1) " Of course it can also be obtained. In this case, the complex noun phrase set may include spaced information. In the above-described embodiment, the separation information may be information indicating that one word is included between the word “in” and the word “device”. According to various embodiments, a word and a plurality of words may be included between the words.

또 다른 실시예로, 서버(10)는 복합 명사구 세트를 구성하는 단어 세트의 최대 개수를 결정하고, 결정된 개수 내의 복합 명사구 세트를 획득할 수 있다. 구체적으로, 복합 명사구 세트 획득을 위한 문장에 포함된 단어가 n개이고, 단어 세트간의 순서가 변경되지 않는 경우, 획득되는 복합 명사구 세트의 수는

개이고, 단어 세트간의 순서가 변경되는 경우 획득되는 복합 명사구 세트의 수는

개이다. 따라서, 문장이 길어지면 길어질수록, 서버(10)는 복합 명사구 세트 검색을 위해 과도한 리소스를 투입하여야 한다. 따라서, 서버(10)는 복합 명사구 세트를 구성하는 단어 세트의 최대 개수를 결정하고, 결정된 개수 내의 복합 명사구 세트를 획득할 수 있다. 이때, 서버(10)는, 단어 세트 획득을 수행하는 특허문서와 동일한 기술 분야인 복수개의 특허문서를 획득하고, 획득된 특허문서에 포함된 복합 명사구 세트에 포함된 단어 세트의 최대 개수를 복합 명사구 세트를 구성하는 단어 세트의 최대 개수로 결정할 수 있다. In another embodiment, the server 10 may determine the maximum number of word sets constituting the compound noun phrase set, and obtain the compound noun phrase set within the determined number. Specifically, if there are n words included in the sentence for acquiring the compound noun phrase set, and the order between the word sets is not changed, the number of the compound noun phrase sets obtained is

, And the number of compound noun phrase sets obtained when the order between word sets changes

Dog. Therefore, the longer the sentence is, the longer the server 10 has to inject excessive resources to search for a compound noun phrase set. Accordingly, the server 10 may determine the maximum number of word sets constituting the compound noun phrase set, and obtain the compound noun phrase set within the determined number. At this time, the server 10 acquires a plurality of patent documents in the same technical field as the patent document for performing the word set acquisition, and calculates the maximum number of word sets included in the compound noun phrase set included in the acquired patent document. The maximum number of word sets constituting the set may be determined.

도 7은 본 발명의 또 다른 실시예에 따른 복합 명사구 세트를 획득하는 방법을 설명하기 위한 흐름도이다.7 is a flowchart illustrating a method of obtaining a compound noun phrase set according to another embodiment of the present invention.

단계 S610에서, 서버(10)는 복수개의 단어 세트가 기 설정된 조건을 만족하는 경우, 복수개의 단어 세트의 위치를 변경할 수 있다. In operation S610, when the plurality of word sets satisfy a preset condition, the server 10 may change positions of the plurality of word sets.

일반적으로, 문장에서 복합 명사구를 획득하고자 할 때, 문장에 포함된 단어 세트의 순서를 변경하는 작업은 불필요하다. 예를 들어, "머신러닝을 이용한~~"의 문장에서 서버(10)는 "머신러닝"이란 복합 명사구 세트를 획득할 필요성이 있으나, 단어 세트의 순서를 바꾼 "러닝머신"이란 복합 명사구 세트를 획득할 필요는 없을 것이다. 오히려 "러닝머신"의 복합 명사구 세트를 획득하는 경우, 전혀 다른 의미의 복합 명사구 세트를 획득하게 되어 본 발명에서 이루고자 하는 성능을 떨어트릴 여지도 존재한다.In general, when trying to obtain a compound noun phrase in a sentence, it is unnecessary to change the order of the word set included in the sentence. For example, in the sentence of "~" using machine learning, the server 10 needs to obtain a complex noun phrase set "machine learning", but a compound noun phrase set of "treadmill" that changes the order of the word set. It will not need to be acquired. Rather, when a compound noun phrase set of a "treading machine" is obtained, there is a possibility that a compound noun phrase set having a completely different meaning is obtained, thereby degrading the performance to be achieved in the present invention.

그러나 단계 S610에서 설명하는 바와 같이, 기설정된 조건을 만족하면, 서버(10)는 문장에 포함된 단어 세트의 순서를 변경하여 복합 명사구 세트 후보를 획득할 수 있다. 이때, 기 설정된 조건은, 복수개의 단어 세트가 인접한 조건, 인접한 복수개의 단어 세트 사이에 기 설정된 부호가 포함될 조건, 인접한 복수개의 단어 세트 중 오른쪽에 위치한 단어 세트가 괄호를 포함하는 조건 및 복수개의 단어 세트가 서로 다른 언어인 조건일 수 있다.However, as described in operation S610, when the predetermined condition is satisfied, the server 10 may obtain a compound noun phrase set candidate by changing the order of the word set included in the sentence. In this case, the preset conditions include a condition in which a plurality of word sets are adjacent to each other, a condition in which a predetermined code is included between a plurality of adjacent word sets, a condition in which a word set located at the right of the plurality of adjacent word sets includes parentheses and a plurality of words It can be a condition where the sets are in different languages.

예를 들어, 분석하고자 하는 문장이 "머신 러닝(Machine Learning)"을 포함하는 경우, 서버(10)는 머신, 러닝, Machine, Learning을 단어 세트로 획득할 수 있다. 나아가, 서버(10)는 "머신 러닝 Machine Learning"을 복합 명사구 세트로 획득할 수 있다. 나아가, 서버(10)는 기 설정된 조건인 괄호 부호가 포함되어 있음을 판단하고, "Machine Learning 머신 러닝"을 복합 명사구 세트로 획득할 수 있다. 즉, 서버(10)는 "머신 러닝"이라는 복합 명사구(또는 단어) 및 복합 명사구(또는 단어)에 대한 다른 언어의 복합 명사구(Machine Learning)가 함께 존재하는 경우, 다른 언어로 표현되었으나, 동일한 의미를 가지는 두 복합 명사구(또는 단어)를 하나의 복합 명사구로 획득하고, 또한 두 복합 명사구의 순서를 변경하여 하나의 복합 명사구로 획득할 수 있다. 이는 각기 다른 언어로 작성된 특허문서를 비교하는 경우 비교의 효율성을 높일 수 있는 효과가 존재한다. 즉, 한글로 작성된 특허문서는 "머신 러닝(Machine Learning)"으로 표현되나, 영어로 작성된 특허문서는 "Machine Learning(머신 러닝)"으로 표현될 수 있으므로, 서버(10)는 두 가지 경우 모두 복합 명사구 세트로 획득하여 검색의 효율성을 높일 수 있다.For example, when the sentence to be analyzed includes "Machine Learning", the server 10 may obtain a machine, a running, a machine, and a learning as a word set. Furthermore, the server 10 may obtain "machine learning" as a compound noun phrase set. Furthermore, the server 10 may determine that a parenthesis code, which is a preset condition, is included, and acquire “Machine Learning Machine Learning” as a compound noun phrase set. That is, the server 10 is expressed in a different language when a compound noun phrase (or word) called "machine learning" and a compound noun phrase of another language for the compound noun phrase (or word) are present together. Two compound noun phrases (or words) having a can be obtained as a compound noun phrase, and a compound noun phrase can be obtained by changing the order of the two compound noun phrases. This has the effect of increasing the efficiency of comparison when comparing patent documents written in different languages. That is, a patent document written in Korean may be represented as "Machine Learning", but a patent document written in English may be represented as "Machine Learning", so that the server 10 may be complex in both cases. Acquire a noun phrase set to increase the efficiency of the search.

한편, 상술한 실시예에서는 기 설정된 부호가 포함 조건이 괄호 부호가 포함될 조건에 대하여 설명하였으나, 이에 한정되는 것은 아니다. 다양한 실시예에 따라, 기 설정된 부호는 하이픈(-), 슬래시(/), 물결표(~), 따옴표(' 또는 ")등의 부호일 수 있음은 물론이다.Meanwhile, in the above-described embodiment, the condition in which the preset code includes the parentheses is described, but the present invention is not limited thereto. According to various embodiments of the present disclosure, the predetermined sign may be a sign such as a hyphen (-), a slash (/), a tilde (~), a quote ('or "), or the like.

단계 S620에서, 서버(10)는 변경된 복수개의 단어 세트를 복합 명사구 세트 후보로 결정할 수 있다.In operation S620, the server 10 may determine the changed plurality of word sets as a compound noun phrase set candidate.

도 8은 본 발명의 일 실시예에 따른 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.8 is a flowchart illustrating a method of obtaining a word set according to an embodiment of the present invention.

단계 S710에서, 서버(10)는 특허문서에 포함된 문장을 분석할 수 있다.In operation S710, the server 10 may analyze a sentence included in the patent document.

단계 S720에서, 서버(10)는 분석된 문장이 기 설정된 구조를 가지는 경우, 문장에 포함된 단어 세트 중 기 설정된 구조에 대응되는 단어 세트를 제외한 적어도 하나의 단어 세트를 획득할 수 있다.In operation S720, when the analyzed sentence has a predetermined structure, the server 10 may obtain at least one word set except for a word set corresponding to the predetermined structure among the word sets included in the sentence.

일 실시예에 따라, 기 설정된 구조는 "~는 ~로 정의한다"와 같이 정형화된 구조일 수 있다. 구체적으로 특허문서에서 "~는 ~로 정의한다", "이하 ~는 ~라고 한다." "~는 ~로 이해될 수 있다." "~는 ~라고 서술한다" 등과 같은 구조의 문장은 기술 분야에서 일반적으로 사용되지 않는 용어를 정의하기 위해 사용될 수 있다. 따라서, 서버(10)는 기술 분야에서 일반적으로 사용되지 않은 용어를 정의하는 문장을 판단하고, 판단된 문장에서 정의된 단어 세트를 획득할 수 있다.According to an embodiment, the preset structure may be a structured structure such as "to be defined as". Specifically, in a patent document, "~ is defined as", "hereinafter" is called. " "Can be understood as." A sentence of structure such as "describes" may be used to define terms that are not commonly used in the art. Accordingly, the server 10 may determine a sentence defining a term not generally used in the technical field, and obtain a word set defined in the determined sentence.

예를 들어, 분석하고자 하는 문장이 "머신 러닝은 OOO로 서술한다(또는 정의한다)."라는 문장인 경우, 서버(10)는 해당 문장이 기 설정된 구조라고 판단하고, 머신, 러닝, OOO의 단어 세트를 획득할 수 있다.For example, if the sentence to be analyzed is a sentence "Machine learning is described (or defined) as OOO", the server 10 determines that the sentence is a predetermined structure, and the machine, running, OOO The word set can be obtained.

단계 S730에서, 서버(10)는 적어도 하나의 단어 세트의 의미 정보를 획득할 수 있다.In operation S730, the server 10 may obtain semantic information of at least one word set.

구체적으로, 획득된 OOO의 단어 세트가 기술 분야에서 통용되는 의미 정보를 가진 경우, 서버(10)는 OOO에 대응되는 의미 정보를 OOO에 매칭할 수 있다. 그러나, OOO 용어가 사용자가 새롭게 정의한 용어여서 의미 정보가 불명확한 경우, 서버(10)는 머신 러닝에 대응되는 의미 정보를 OOO의 의미 정보로 매칭할 수 있다.In detail, when the acquired word set of OOO has semantic information commonly used in the technical field, the server 10 may match semantic information corresponding to OOO with OOO. However, when the term OOO is newly defined by the user and the semantic information is not clear, the server 10 may match the semantic information corresponding to the machine learning with the semantic information of the OOO.

단계 S740에서, 서버(10)는 의미 정보에 대응되는 단어 세트를 획득할 수 있다.In operation S740, the server 10 may acquire a word set corresponding to semantic information.

다만, 이에 한정되는 것은 아니고, 서버(10)는 머신 러닝과 OOO의 연관도가 기 설정된 연관도 이상인 경우, OOO의 단어 세트를 머신 러닝의 단어 세트로 교체할 수 있음은 물론이다.However, the present invention is not limited thereto, and the server 10 may replace the word set of OOO with the word set of machine learning when the degree of association between the machine learning and the OOO is greater than or equal to a preset degree of association.

도 9는 본 발명의 일 실시예에 따른 특허문서의 템플릿을 제외하고 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.9 is a flowchart illustrating a method of obtaining a word set except for a template of a patent document according to an embodiment of the present invention.

구체적으로, 특허문서는 정형화된 문서로서 발명의 핵심적인 내용과는 크게 상관이 없으나, 필요한 필수 구성 요소를 정형화된 방법으로 서술한 문단이 존재할 수 있다. 따라서 정형화된 문단을 제거하고 단서 세트를 획득하는 경우, 서버(10)의 계산량을 줄일 수 있는 효과가 존재한다.In detail, the patent document is a formal document, which is not significantly related to the essential contents of the present invention, but there may exist a paragraph describing a necessary essential component in a formal method. Therefore, when the formal paragraph is removed and a clue set is obtained, there is an effect of reducing the amount of computation of the server 10.

단계 S810에서, 서버(10)는 특허문서에 포함된 복수의 식별항목 정보 및 템플릿을 획득할 수 있다.In operation S810, the server 10 may obtain a plurality of identification item information and a template included in the patent document.

일 실시예로, 템플릿은 특정 위치에 존재하는 문단으로 설정될 수 있다. 예를 들어, 템플릿은 특허문서에 포함된 [발명을 실시하기 위한 구체적인 내용] 식별항목 다음 줄에 포함된 문장부터, [발명을 실시하기 위한 구체적인 내용] 식별항목에서 최초로 '도 1'의 단어가 검색되는 문장 이전까지의 텍스트일 수 있다.In one embodiment, the template may be set to a paragraph existing at a specific location. For example, the template may include the first word in FIG. 1 from the sentence included in the line following the [detail for carrying out the invention] identification item included in the patent document. The text up to the sentence to be searched may be.

또 다른 실시예로, 템플릿은 분석하고자 하는 특허문서의 작성자가 작성한 다른 특허문서와 공통된 부분을 의미할 수 있다. 즉, 서버(10)는 특허문서의 작성자를 판단하고, 판단된 작성자의 또 다른 특허문서를 획득하고, 획득된 복수개의 특허문서를 비교하고, 비교 결과를 통해 기 설정된 비율 이상 유사하다고 판단되는 문단 또는 식별항목을 템플릿으로 획득할 수 있다. 동일 작성자에 의해 작성된 특허문서는 문장 구조가 비슷할 가능성이 크기 때문에 서버(10) 유사판단의 기준을 문장 단위가 아닌, 문단 또는 식별항목으로 설정할 수 있다.In another embodiment, the template may mean a part in common with other patent documents prepared by the creator of the patent document to be analyzed. That is, the server 10 determines the author of the patent document, obtains another patent document of the determined author, compares the obtained plurality of patent documents, and determines that the paragraph is similar or more similar to the preset ratio through the comparison result. Alternatively, the identification item may be obtained as a template. Since the patent document created by the same author is likely to have a similar sentence structure, the criteria of the server 10 similar judgment may be set to a paragraph or an identification item instead of a sentence unit.

또 다른 실시예로, 서버(10)는 기 설정된 식별항목에 대응되는 텍스트를 템플릿으로 획득할 수 있다. 예를 들어, 서버(10)는 [도면의 간단한 설명], [부호의 설명] 등의 식별항목에 포함된 텍스트를 템플릿으로 획득할 수 있다. In another embodiment, the server 10 may obtain a text corresponding to a predetermined identification item as a template. For example, the server 10 may obtain the text included in the identification item such as [a brief description of the drawings], [description of the reference] as a template.

상술한 실시예에서는 국내 출원을 위한 특허문서의 일반적인 식별항목에 대한 템플릿 획득 방법을 설명하였으나, 이에 한정되는 것은 아니다. 즉, 서버(10)는 PCT 출원을 위한 국문서식의 식별항목, PCT 출원을 위한 영문 서식의 식별항목, 나아가, 미국, 일본, 중국, 유럽 출원에 사용되는 서식의 식별항목 등 다양한 식별항목에 대한 템플릿을 획득할 수 있음은 물론이다.In the above-described embodiment, a method for obtaining a template for a general identification item of a patent document for a domestic application has been described, but is not limited thereto. That is, the server 10 may identify various identification items, such as identification items of the national document for PCT applications, identification items of English forms for PCT applications, and furthermore, identification items of forms used in US, Japan, China, and European applications. Of course, the template can be obtained.

단계 S820에서, 서버(10)는 획득된 템플릿을 제외한 복수개의 문장을 획득할 수 있다.In operation S820, the server 10 may obtain a plurality of sentences except for the obtained template.

단계 S830에서, 서버(10)는 복수의 식별항목 각각에 가중치를 부여할 수 있다.In operation S830, the server 10 may assign a weight to each of the plurality of identification items.

일 실시예로, 특허문서가 본 명세서와 동일한 서식 및 식별항목을 가지는 경우, 서버(10)는 발명의 명칭, 기술분야, 발명의 배경이 되는 기술, 해결하고자 하는 과제, 과제의 해결 수단, 발명의 효과, 도면의 간단한 설명, 발명을 실시하기 위한 구체적인 내용, 부호의 설명, 청구범위, 요약, 대표도, 도면 식별항목 각각에 대한 가중치를 부여할 수 있다.In one embodiment, when the patent document has the same form and identification as the present specification, the server 10 is the name of the invention, the technical field, the background of the invention, the problem to be solved, the means for solving the problem, the invention The weights of the effects, the brief description of the drawings, the detailed contents for carrying out the invention, the description of the signs, the claims, the summary, the representations, and the drawing identification items can be given.

이때, 가중치는 다양한 방법에 의해 부여될 수 있다. 일 실시예로, 서버(10)는 기술분야, 발명의 배경이 되는 기술, 청구범위, 발명의 효과, 발명을 실시하기 위한 구체적인 내용, 부호의 설명, 도면의 간단한 설명, 요약, 해결하고자 하는 과제, 과제의 해결수단, 대표도의 순서로 높은 가중치를 부여할 수 있다.In this case, the weight may be given by various methods. In one embodiment, the server 10 is a technical field, the background technology of the invention, claims, the effects of the invention, the specific contents for carrying out the invention, the description of the symbols, a brief description of the drawings, a summary, the problem to be solved The weights can be given in the order of the solution, the solution to the problem, and the representation.

또 다른 실시예로, 가중치는 특허문서에 포함된 단어 세트(또는 복합 명사구 세트)각각의 중요도 스코어를 획득하고, 이를 바탕으로 식별항목별 중요도 스코어를 획득한 후 높은 중요도 스코어를 가지는 식별항목에 높은 가중치를 부여할 수 있다. 중요도 스코어란, 해당 단어 세트(또는 복합 명사구 세트)가 특허문서에서 가지는 중요성을 나타내는 지표로, 중요도 스코어가 높은 단어 세트(또는 복합 명사구 세트)일수록 해당 특허문서의 키워드일 수 있다.In another embodiment, the weight is obtained for the importance score of each word set (or compound noun phrase set) included in the patent document, and based on this, the importance score for each identification item is obtained, and the weight is high for the identification item having a high importance score. Can be weighted. The importance score is an index indicating the importance of the word set (or the compound noun phrase set) in the patent document, and a word set (or the compound noun phrase set) having a high importance score may be a keyword of the patent document.

한편, 식별항목별 중요도 스코어는 식별항목에 포함된 단어 세트(또는 복합 명사구 세트)의 중요도 스코어를 합산한 후, 단어 세트(또는 복합 명사구 세트)의 개수를 나눈 값을 의미할 수 있다. 이때, 식별항목별 중요도 스코어를 획득하는데 사용되는 단어 세트(또는 복합 명사구 세트)는 식별항목에 포함된 모든 단어 세트(또는 복합 명사구 세트)일 수 있으나 이에 한정되는 것은 아니다. 예를 들어, 식별항목별 중요도 스코어를 획득하는데 사용되는 단어 세트(또는 복합 명사구 세트)는 해당 식별항목에 포함된 모든 단어 세트(또는 복합 명사구 세트) 중, 템플릿으로 판단된 부분의 단어 세트(또는 복합 명사구 세트)를 제외한 단어 세트(또는 복합 명사구 세트)일 수 있음은 물론이다.Meanwhile, the importance score for each identification item may mean a value obtained by summing the importance scores of a word set (or a compound noun phrase set) included in the identification item and dividing the number of word sets (or a compound noun phrase set). In this case, the word set (or compound noun phrase set) used to obtain the importance score for each identification item may be all word sets (or compound noun phrase set) included in the identification item, but is not limited thereto. For example, a word set (or compound noun phrase set) used to obtain an importance score for each identifier is a word set (or compound nominal set) of all the word sets (or compound noun phrase sets) included in the identifier. It may be a word set (or a compound noun phrase set) except a compound noun phrase set).

단계 S840에서, 서버(10)는 복수의 식별항목 정보 각각의 가중치를 바탕으로 형태소 분석을 수행할 식별항목의 우선순위를 결정할 수 있다.In operation S840, the server 10 may determine the priority of the identification item to be morphologically analyzed based on the weight of each piece of identification information.

단계 S850에서, 서버(10)는 결정된 우선순위에 따라 템플릿이 제외된 복수의 문장에 대한 단어 세트를 획득할 수 있다.In operation S850, the server 10 may obtain a word set for a plurality of sentences in which the template is excluded according to the determined priority.

즉, 서버(10)는 키워드가 존재할 가능성이 높은 식별항목을 먼저 분석하고, 분석된 내용을 바탕으로 상대적으로 중요도가 낮은 식별항목을 분석함으로써, 계산량을 감소시킬 수 있다. 나아가, 동일한 단어 세트에 대한 서로 다른 의미 정보가 매칭된 경우, 서버(10)는 우선순위가 높은 식별항목에서 획득된 의미 정보를 정확한 의미 정보로 판단할 수 있다.That is, the server 10 may reduce the amount of calculation by first analyzing an identification item having a high possibility of a keyword and analyzing the identification item having a relatively low importance based on the analyzed content. Furthermore, when different semantic information about the same word set is matched, the server 10 may determine the semantic information obtained from the high priority item as correct semantic information.

도 10은 본 발명의 일 실시예에 따른 특허문서에 이미지가 존재하는 경우 단어 세트를 획득하는 방법을 설명하기 위한 흐름도이다.10 is a flowchart illustrating a method of obtaining a word set when an image exists in a patent document according to an embodiment of the present invention.

단계 S910에서, 서버(10)는 특허문서에 이미지가 포함된 경우, 이미지와 인접한 식별항목이 존재하는지 여부를 판단할 수 있다.In operation S910, when the image is included in the patent document, the server 10 may determine whether an identification item adjacent to the image exists.

일 실시예로, 특허문서에 [수학식]의 식별항목이 존재하고, 식별항목 하단에 이미지가 존재하는 경우, 서버(10)는 이미지를 수학식에 대한 이미지로 판단할 수 있다. 수학식의 경우, 특허문서의 핵심적인 내용일 가능성이 높으므로, 서버(10)는 이미지에 포함된 수학식 텍스트를 획득할 수 있다. In one embodiment, if there is an identification item of [mathematical formula] in the patent document, and an image exists at the bottom of the identification item, the server 10 may determine the image as an image for the equation. In the case of the equation, since it is likely to be the essential content of the patent document, the server 10 may obtain the equation text included in the image.

또 다른 실시예로, 특허문서에 [화학식] 의 식별항목이 존재하고, 식별항목 하단에 이미지가 존재하는 경우, 서버(10)는 이미지를 화학식에 대한 이미지로 판단할 수 있다.In another embodiment, when an identification item of [Formula] exists in the patent document and an image exists at the bottom of the identification item, the server 10 may determine the image as an image for the chemical formula.

단계 S920에서, 서버(10)는 식별항목이 이미지와 인접하는 경우, 이미지를 분석하여 이미지에 포함된 텍스트를 획득할 수 있다. 이때, 이미지 분석은 광학 문자 판독 방법(optical character recognition, OCR)에 의해 수행될 수 있으나, 이에 제한되는 것은 아니다. 예를 들어, 이미지가 화학식 이미지인 경우, 서버(10)는 화학식 구조를 분석하여 이미지에 대응되는 화학식을 획득할 수 있음은 물론이다. In operation S920, when the identification item is adjacent to the image, the server 10 may analyze the image to obtain text included in the image. In this case, image analysis may be performed by optical character recognition (OCR), but is not limited thereto. For example, when the image is a chemical formula image, the server 10 may analyze the chemical formula structure to obtain a chemical formula corresponding to the image.

단계 S930에서, 서버(10)는 획득된 텍스트를 바탕으로 단어 세트를 획득할 수 있다.In operation S930, the server 10 may acquire a word set based on the obtained text.

한편, 서버(10)는 후술하는 방법을 통해 단어 세트 각각의 중요도 스코어를 획득할 수 있다. 일 실시예로, 서버(10)는 특허문서로부터 중요도 스코어의 산출 대상이 되는 단어 세트를 획득할 수 있다. 서버(10)는, 전체 특허문서에서의 단어 세트의 제1 세부 중요도, 중요도 스코어를 판단하고자 하는 단어 세트가 포함된 특허문서의 기술분야정보에 대응되는 특허분류정보에서의 단어 세트의 제2 세부 중요도 및 전체 특허문서 중 단어 세트가 포함된 검색특허문서의 제3 세부 중요도 중 하나 이상의 세부 중요도를 산출할 수 있다.Meanwhile, the server 10 may acquire the importance score of each word set through a method described below. In one embodiment, the server 10 may obtain a word set that is the object of calculating the importance score from the patent document. The server 10 may determine the first detail importance of the word set in the entire patent document and the second detail of the word set in the patent classification information corresponding to the technical field information of the patent document including the word set to determine the importance score. One or more detailed importance of the third detailed importance of the searched patent document including the importance and the word set among the entire patent documents may be calculated.

구체적으로, 서버(10)는 제1 출현비율 및 제2 출현비율을 바탕으로 제1 세부 중요도를 산출할 수 있다. 이때, 제1 출현비율은 전체 특허문서의 전체 단어 세트수 대비 전체 특허문서에서의 중요도 스코어를 획득하고자 하는 단어 세트의 출연횟수를 의미하고, 제2 출현비율은 전체 특허문서의 문장 중에서 단어 세트가 출현된 출현 문장수를 카운트하고, 전체 특허문서의 문장 중에서 단어 세트가 출현된 출현 문장수 대비 전체 특허문서의 전체 문장수 의미할 수 있다.In detail, the server 10 may calculate the first detailed importance based on the first appearance rate and the second appearance rate. In this case, the first appearance ratio means the number of appearances of the word set to obtain the importance score in the entire patent document compared to the total number of word sets of the entire patent document, the second appearance ratio is a word set of the sentences of the entire patent document The number of appearance sentences that appear may be counted, and may mean the total number of sentences of the entire patent document compared to the number of appearance sentences in which the word set appears among the sentences of the entire patent document.

구체적으로, 서버(10)는 하기의 수학식 1을 이용하여 제1 세부 중요도를 산출할 수 있다.Specifically, the server 10 may calculate the first detailed importance level using Equation 1 below.

여기서, W1은 제1 세부 중요도이고, wpw은 전체 특허문서에서의 단어 세트의 출현횟수이고, WPW은 전체 특허문서의 전체 단어 세트수이고, wps은 전체 특허문서의 문장 중에서 단어 세트가 출현된 출현 문장수이고, WPS은 전체 특허문서의 전체 문장수이고, a1은 제2 출현비율의 조절 상수이다.Here, W1 is the first detail importance, wpw is the number of occurrences of the word set in the entire patent document, WPW is the total number of words set in the entire patent document, wps is the appearance of the word set appeared in the sentences of the entire patent document Is the number of sentences, WPS is the total number of sentences in the entire patent document, and a1 is the adjustment constant of the second appearance rate.

또 다른 실시예로, 서버(10)는 제1 출현비율 및 제2 출현비율을 바탕으로 제1 세부 중요도를 산출할 수 있다. 이때, 제3 출현비율은 특허분류정보의 전체 단어 세트수 대비 특허분류정보에서의 중요도 스코어를 획득하고자 하는 단어 세트의 출연횟수를 의미하고, 제4 출현비율은 전체 특허문서의 문장 중에서 단어 세트가 출현된 출현 문장수 대비 전체 특허문서의 전체 문장수를 의미할 수 있다. In another embodiment, the server 10 may calculate the first detail importance based on the first appearance rate and the second appearance rate. In this case, the third appearance ratio refers to the number of appearances of the word set to obtain the importance score in the patent classification information to the total number of word sets of the patent classification information, and the fourth appearance ratio is a word set from the sentences of the entire patent document. It may mean the total number of sentences of the entire patent document compared to the number of appearance sentences appeared.

여기서, 특허분류정보는 기술분야에 따라 특허를 분류할 수 있는 코드로써, IPC(International Patent Classfication), CPC(Cooperative Patent Classification) 및 F-Term 중 어느 하나일 수 있다.Here, the patent classification information is a code that can classify patents according to the technical field, and may be any one of International Patent Classfication (IPC), Cooperative Patent Classification (CPC), and F-Term.

구체적으로, 서버(10)는 하기의 수학식 2를 이용하여 제2 세부 중요도를 산출할 수 있다.Specifically, the server 10 may calculate the second detailed importance level using Equation 2 below.

여기서, W₂은 제2 세부 중요도이고, ipcw은 특허분류정보에서의 단어 세트의 출현횟수이고, IPCW은 특허분류정보의 전체 단어 세트수이고, ipcs은 전체 특허문서의 문장 중에서 단어 세트가 출현된 출현 문장수이고, IPCS은 전체 특허문서의 전체 문장수이고, a2은 제4 출현비율의 조절 상수이다.Where W ₂ is the second level of importance, ipcw is the number of occurrences of the word set in the patent classification information, IPCW is the total number of words in the patent classification information, and ipcs is the word set in the sentences of the entire patent document. Is the number of appearance sentences, IPCS is the total number of sentences in all patent documents, and a2 is the adjustment constant of the fourth appearance ratio.

또 다른 실시예로, 서버(10)는, 전체 특허문서 중에서 중요도 스코어의 산출 대상이 되는 단어 세트를 포함하는 검색특허문서를 검색하고, 검색특허문서 각각의 참조 정보에 기초하여 검색특허문서 각각의 영향력 값을 산출하고, 산출된 영향력 값을 바탕으로 제3 세부 중요도를 획득할 수 있다.In another embodiment, the server 10 retrieves a search patent document including a word set that is a target of calculating importance scores among all patent documents, and based on reference information of each search patent document, The influence value may be calculated, and the third detailed importance may be obtained based on the calculated influence value.

구체적으로, 서버(10)는 검색특허문서의 참조 정보인 출원인, 발명자, 권리자 중 하나 이상이 다른 특허문서와 동일한 항목의 개수, 참조 정보인 인용 횟수 및 피인용 횟수에 기초하여 영향력 값을 산출할 수 있다.In detail, the server 10 may calculate the influence value based on the number of the same items as the other patent documents, the number of citations and the number of citations of one or more applicants, inventors, and rights holders, which are reference information of the searched patent document. Can be.

즉, 서버(10)는 검색특허문서가 다른 특허문서와 관련된 정도를 영향력 값으로 산출할 수 있다. 예를 들어, 서버(10)는 검색특허문서는 여러 특허문서로부터 인용될 때, 해당 특허문서에 검색특허문서가 영향력을 끼친 것으로 판단하여 검색특허문서의 영향력 값으로 산출할 수 있다. That is, the server 10 may calculate the degree to which the search patent document is related to another patent document as an influence value. For example, when the search patent document is cited from several patent documents, the server 10 may determine that the search patent document has an influence on the patent document and calculate the search patent document as an influence value of the search patent document.

나아가, 서버(10)는 검색특허문서 각각에 대해 산출된 영향력 값의 평균을 산출하고, 산출된 평균을 제3 세부 중요도로 산출할 수 있다.Furthermore, the server 10 may calculate an average of the influence values calculated for each of the search patent documents, and calculate the calculated average as the third detailed importance.

도 11은 본 발명의 일 실시예에 따른 장치의 구성도이다.11 is a block diagram of an apparatus according to an embodiment of the present invention.

프로세서(102)는 하나 이상의 코어(core, 미도시) 및 그래픽 처리부(미도시) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다.The processor 102 may include a connection passage (eg, a bus, etc.) that transmits and receives signals with one or more cores (not shown) and graphics processing unit (not shown) and / or other components. .

일 실시예에 따른 프로세서(102)는 메모리(104)에 저장된 하나 이상의 인스트럭션을 실행함으로써, 도 2 내지 도 10과 관련하여 설명된 방법을 수행한다.The processor 102 according to one embodiment executes one or more instructions stored in the memory 104 to perform the method described in connection with FIGS. 2-10.

예를 들어, 프로세서(102)는 메모리에 저장된 하나 이상의 인스트럭션을 실행함으로써 신규 학습용 데이터를 획득하고, 학습된 모델을 이용하여, 상기 획득된 신규 학습용 데이터에 대한 테스트를 수행하고, 상기 테스트 결과, 라벨링된 정보가 소정의 제1 기준값 이상의 정확도로 획득되는 제1 학습용 데이터를 추출하고, 상기 추출된 제1 학습용 데이터를 상기 신규 학습용 데이터로부터 삭제하고, 상기 추출된 학습용 데이터가 삭제된 상기 신규 학습용 데이터를 이용하여 상기 학습된 모델을 다시 학습시킬 수 있다. For example, the processor 102 obtains new training data by executing one or more instructions stored in a memory, performs a test on the acquired new training data using a trained model, and performs the test result and labeling. Extracting first learning data for which the obtained information is obtained with an accuracy equal to or greater than a predetermined first reference value, deleting the extracted first learning data from the new learning data, and extracting the new learning data from which the extracted learning data is deleted. Can be used to retrain the learned model.

한편, 프로세서(102)는 프로세서(102) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(102)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다. Meanwhile, the processor 102 may include random access memory (RAM) and read-only memory (ROM) for temporarily and / or permanently storing a signal (or data) processed in the processor 102. , Not shown) may be further included. In addition, the processor 102 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processor, a RAM, and a ROM.

메모리(104)에는 프로세서(102)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(104)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다.The memory 104 may store programs (one or more instructions) for processing and controlling the processor 102. Programs stored in the memory 104 may be divided into a plurality of modules according to functions.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, in a software module executed by hardware, or by a combination thereof. The software module may be a random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any form of computer readable recording medium well known in the art.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.The components of the present invention may be embodied as a program (or an application) and stored in a medium for execution in combination with a computer which is hardware. The components of the present invention may be implemented in software programming or software elements, and similarly, embodiments include C, C ++, including various algorithms implemented in combinations of data structures, processes, routines, or other programming constructs. It may be implemented in a programming or scripting language such as Java, assembler, or the like. Functional aspects may be implemented in algorithms running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. In the above, embodiments of the present invention have been described with reference to the accompanying drawings, but those skilled in the art to which the present invention pertains may realize the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive.

10 : 서버
20 : 전자 장치 10: server
20: electronic device

Claims

In the word set acquisition method of the patent document,
Obtaining, by the server, a patent document;
Obtaining, by the server, a plurality of word sets by analyzing a morpheme based on the patent document;
Determining, by the server, an error word set among the obtained plurality of word sets;
The server correcting the error word set; And
Acquiring, by the server, at least one compound noun phrase set based on the degree of association between the plurality of word sets;
Determining the set of error words,
Matching semantic information for each of the obtained plurality of word sets;
Obtaining an accuracy score for each of the plurality of word sets based on the matched semantic information and the patent document; And
And determining a word set having the accuracy score equal to or less than a preset value as an error word set.

The method of claim 1,
Acquiring the plurality of word sets,
Obtaining a sentence included in the patent document;
Dividing the sentence into syllable units;
Matching a part-of-speech tag to the divided syllables; And
Obtaining a morpheme included in the sentence based on the divided syllables; Word set acquisition method comprising a.

delete

Claim 4 was abandoned upon payment of a set-up fee.

The method of claim 1,
Modifying the set of error words,
Obtaining a plurality of semantic information on the error word;
Obtaining a plurality of weights for each of the obtained plurality of semantic informations;
Obtaining a weight having the highest association with the patent document among the plurality of weights; And
Matching semantic information corresponding to a weight having the highest association with the patent document to the error word set; Word set acquisition method comprising a.

The method of claim 1,
Obtaining the at least one compound noun phrase set,
Obtaining a plurality of word sets for sentences included in the patent document;
Obtaining a plurality of compound noun phrase set candidates obtained by combining the plurality of word sets;
Obtaining a frequency at which a compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in a patent document; And
Determining a compound noun phrase set candidate having the frequency equal to or greater than a preset frequency as a compound noun phrase set; Including,
And the compound noun phrase set candidate comprises order information and spacing information of word sets included in the compound noun phrase set candidate.

Claim 6 has been abandoned upon payment of a set-up fee.

The method of claim 5,
Acquiring the compound noun phrase set candidate,
Changing the positions of the plurality of word sets when the plurality of word sets satisfy a preset condition;
Determining the changed plurality of word sets as a compound noun phrase set candidate; Including,
The preset condition may include a condition in which the plurality of word sets are adjacent to each other, a condition in which a predetermined code is included between the plurality of adjacent word sets, a condition in which a word set positioned at the right of the plurality of adjacent word sets includes parentheses, and the A method of obtaining a word set, characterized in that the plurality of word sets are in different languages.

Claim 7 was abandoned upon payment of a set-up fee.

The method of claim 1,
Acquiring the word set,
Analyzing a sentence included in the patent document;
When the sentence has a preset structure, obtaining at least one word set except for a word set corresponding to the preset structure among the word sets included in the sentence;
Obtaining semantic information of the at least one word set; And
Obtaining a word set corresponding to the semantic information; Word set acquisition method comprising a.

Claim 8 has been abandoned upon payment of a setup registration fee.

The method of claim 1,
Acquiring the word set,
Obtaining a plurality of identification item information and a template included in the patent document;
Obtaining a plurality of sentences except the obtained template;
Weighting each of the plurality of pieces of identification information;
Determining a priority of identification items to be morphologically analyzed based on weights of the plurality of identification item information; And
Acquiring a set of words for a plurality of sentences from which the template is excluded according to the determined priority; Word set acquisition method comprising a.

Claim 9 was abandoned upon payment of a set-up fee.

The method of claim 1,
Acquiring the word set,
Determining whether an identification item adjacent to the image exists when an image is included in the patent document;
Acquiring text included in the image when the identification item is adjacent to the image; And
Obtaining a word set based on the obtained text; Word set acquisition method comprising a.

Memory for storing one or more instructions; And
A processor for executing the one or more instructions stored in the memory;
The processor executes the one or more instructions,
An apparatus for carrying out the method of claim 1.