KR102085217B1

KR102085217B1 - Method, apparatus and system for determining similarity of patent documents

Info

Publication number: KR102085217B1
Application number: KR1020190127327A
Authority: KR
Inventors: 박상준; 김도언
Original assignee: (주)디앤아이파비스
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-03-04

Abstract

Provided are a method, an apparatus, and a system for determining a similarity of patent documents. A control method of the apparatus for determining a similarity of patent documents comprises the following steps: obtaining, by a server, a target patent document; obtaining, by the server, at least one word on the basis of the target patent document; obtaining, by the server, an importance score of the obtained at least one word; obtaining, by the server, a similar patent document by inputting the target patent document to a first artificial intelligence model; obtaining, by the server, a plurality of sentences included in the target patent document and a plurality of sentences included in the similar patent document; obtaining, by the server, a first sentence among the plurality of sentences included in the target patent document and a second sentence among the plurality of sentences included in the similar patent document; obtaining, by the server, evaluation results with respect to the first sentence and the second sentence by inputting the first sentence and the second sentence to a second artificial intelligence module; and generating, by the server, a prior art research report of the target patent document on the basis of the evaluation results.

Description

Method, apparatus and system for determining similarity of patent document {METHOD, APPARATUS AND SYSTEM FOR DETERMINING SIMILARITY OF PATENT DOCUMENTS}

본 발명은 특허문서의 유사도 판단 방법, 장치 및 시스템에 관한 것이다. The present invention relates to a method, apparatus and system for determining similarity of a patent document.

4차 산업의 발전과 함께 지식재산권에 대한 가치가 높아지고 있다. 이에 따라 많은 사람들은 자신이 가진 기술을 보호하고, 기술에 대한 권리를 획득하려 노력하고 있으며, 기술에 대한 특허 출원에 대한 관심도가 높아지고 있다.With the development of the fourth industry, the value of intellectual property rights is increasing. Accordingly, many people are trying to protect their technology, acquire rights to the technology, and interest in patent applications for the technology is increasing.

한편, 특허 출원을 위해서는 자신의 기술이 특허 받을 수 있을지를 판단하기 위해 선행 기술 조사를 수행하며, 과거 공개된 다양한 특허 문헌을 검색함으로써 선행 기술 조사를 수행할 수 있다.On the other hand, for patent applications, prior art research is performed to determine whether a technology can be patented, and prior art research may be performed by searching various patent documents published in the past.

선행 기술 조사를 수행하는데 있어서 가장 중요한 것은, 대상특허문서의 진보성을 부정할 수 있을만한 유사특허문서가 존재하는지 여부를 판단하는 일이다.The most important thing in carrying out the prior art research is to determine whether there is a similar patent document that can deny the progress of the target patent document.

그러나, 특허 문서의 양이 방대하고, 시간의 제약으로 인하여 과거 공개된 모든 특허 문서를 분석하는 것은 사실상 불가능에 가까운 일이며, 주어진 시간 내에서 최대한의 결과를 얻기 위하여 검색식 입력 등의 방법을 통해 유사특허문서를 획득하는 것이 현실이다.However, due to the huge amount of patent documents and due to time constraints, it is virtually impossible to analyze all published patent documents, and in order to obtain the maximum results within a given time, a search method or the like is provided. It is a reality to obtain a similar patent document.

그러나, 검색식 입력 등을 통한 유사특허문서 획득은 선행기술 조사를 수행하는 인력의 능력에 좌우되는 경우가 많아 선행 기술 조사에 대한 안정적인 결과에 대한 보장이 되지 않는 경우가 많다.However, the acquisition of similar patent documents through search input or the like is often dependent on the ability of a manpower to perform prior art research, and thus it is often not guaranteed for stable results for prior art research.

따라서, 적은 시간 투자로 안정적인 결과를 보장 받을 수 있는 선행 기술 조사 방법의 필요성이 대두되고 있다.Therefore, there is a need for a prior art research method that can guarantee stable results with little time investment.

공개특허공보 제10-2018-0110713호, 2018.10.11Patent Publication No. 10-2018-0110713, Oct. 11, 2018

본 발명이 해결하고자 하는 과제는 특허문서의 유사도 판단 방법, 장치 및 시스템을 제공하는 것이다.The problem to be solved by the present invention is to provide a method, apparatus and system for determining similarity of patent documents.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Problems to be solved by the present invention are not limited to the above-mentioned problems, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 특허문서의 유사도 판단 시스템의 제어 방법은, 서버가, 대상특허문서를 획득하는 단계; 상기 서버가, 상기 대상특허문서를 바탕으로 적어도 하나의 단어를 획득하는 단계; 상기 서버가, 상기 획득된 적어도 하나의 단어의 중요도 스코어를 획득하는 단계; 상기 서버가, 상기 대상특허문서를 제1 인공지능 모델에 입력하여 유사특허문서를 획득하는 단계; 상기 서버가, 상기 대상특허문서에 포함된 복수의 문장을 획득하고, 상기 유사특허문서에 포함된 복수의 문장을 획득하는 단계; 상기 서버가, 상기 대상특허문서에 포함된 복수의 문장 중 제1 문장을 획득하고, 상기 유사특허문서에 포함된 복수의 문장 중 제2 문장을 획득하는 단계; 상기 서버가, 상기 제1 문장 및 상기 제2 문장을 제2 인공지능 모델에 입력하여, 상기 제1 문장과 상기 제2 문장에 대한 평가 결과를 획득하는 단계; 및 상기 서버가, 상기 평가 결과를 바탕으로, 상기 대상특허문서에 대한 선행기술조사보고서를 생성하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method of controlling a similarity determination system of a patent document, the method including: obtaining, by a server, a target patent document; Acquiring, by the server, at least one word based on the target patent document; Acquiring, by the server, an importance score of the obtained at least one word; Obtaining, by the server, a similar patent document by inputting the target patent document into a first artificial intelligence model; Acquiring, by the server, a plurality of sentences included in the target patent document, and obtaining a plurality of sentences included in the similar patent document; Acquiring, by the server, a first sentence among a plurality of sentences included in the target patent document and a second sentence among a plurality of sentences included in the similar patent document; Acquiring, by the server, the first sentence and the second sentence into a second artificial intelligence model to obtain evaluation results of the first sentence and the second sentence; And generating, by the server, a prior art research report on the target patent document based on the evaluation result.

이때, 상기 단어를 획득하는 단계는, 상기 서버가, 상기 대상특허문서를 바탕으로 형태소를 분석하여 복수개의 단어를 획득하는 단계; 상기 서버가, 상기 획득된 복수개의 단어 중 오류 단어를 판단하는 단계; 상기 서버가, 상기 오류 단어를 수정하는 단계; 및 상기 서버가, 상기 복수개의 단어간의 연관도를 바탕으로 적어도 하나의 복합 명사구 세트를 획득하는 단계를 포함한다.In this case, the acquiring the word may include: acquiring, by the server, a plurality of words by analyzing a morpheme based on the target patent document; Determining, by the server, an error word among the obtained plurality of words; The server correcting the error word; And obtaining, by the server, at least one compound noun phrase set based on the degree of association between the plurality of words.

이때, 상기 중요도 스코어를 획득하는 단계는, 상기 서버가, 상기 대상특허문서로부터 중요도 스코어의 산출 대상이 되는 단어를 획득하는 단계; 상기 서버가, 전체 특허문서에서의 상기 단어의 제1 세부 중요도, 상기 대상특허문서의 기술분야정보에 대응되는 특허분류정보에서의 상기 단어의 제2 세부 중요도 및 상기 전체 특허문서 중 상기 단어가 포함된 검색특허문서의 제3 세부 중요도 중 하나 이상의 세부 중요도를 산출하는 단계; 및 상기 서버가, 상기 제1 세부 중요도, 상기 제2 세부 중요도 및 상기 제3 세부 중요도 중 하나 이상에 기초하여 상기 단어의 상기 중요도 스코어를 산출하는 단계를 포함한다.In this case, the acquiring the importance score may include: acquiring, by the server, a word that is a calculation target of the importance score from the target patent document; The server includes the first detail importance of the word in the entire patent document, the second detail importance of the word in the patent classification information corresponding to the technical field information of the target patent document, and the word among the entire patent document. Calculating at least one detail importance of the third detail importance of the searched patent document; And calculating, by the server, the importance score of the word based on one or more of the first detail importance, the second detail importance and the third detail importance.

이때, 상기 제어 방법은, 상기 서버가, 유의어 사전의 기초가 되는 전체 특허문서를 획득하는 단계; 상기 서버가, 상기 전체 특허문서 각각에 대해 형태소를 분석하는 단계; 상기 형태소 분석 결과에 기초하여 상기 전체 특허문서 각각에 포함된 단어를 워드 벡터로 변환하는 단계; 상기 서버가, 상기 워드 벡터 간의 유사도를 산출하는 단계; 및 상기 서버가, 상기 유사도에 기초하여 상기 워드 벡터에 대응되는 단어를 유사어 그룹으로 그루핑하는 단계를 포함한다.At this time, the control method, the server, the step of obtaining the entire patent document that is the basis of the thesaurus; Analyzing, by the server, morphemes for each of the entire patent documents; Converting a word included in each of the entire patent documents into a word vector based on the result of the morphological analysis; Calculating, by the server, the similarity between the word vectors; And grouping, by the server, a word corresponding to the word vector into a group of similar words based on the similarity.

이때, 상기 오류 단어를 판단하는 단계는, 상기 획득된 복수개의 단어 각각에 대한 의미 정보를 매칭하는 단계; 상기 매칭된 의미 정보 및 상기 대상특허문서를 바탕으로, 상기 복수개의 단어 각각에 대한 정확도 점수를 획득하는 단계; 및 상기 정확도 점수가 기 설정된 값 이하인 단어를 오류 단어로 판단하는 단계를 포함하고, 상기 오류 단어를 수정하는 단계는, 상기 오류 단어에 대한 복수개의 의미 정보를 획득하는 단계; 상기 획득된 복수개의 의미 정보 각각에 대한 복수개의 가중치를 획득하는 단계; 상기 복수개의 가중치 중, 상기 대상특허문서와의 연관도가 가장 높은 가중치를 획득하는 단계; 및 상기 대상특허문서와의 연관도가 가장 높은 가중치에 대응되는 의미 정보를 상기 오류 단어에 매칭하는 단계를 포함하고, 상기 적어도 하나의 복합 명사구 세트를 획득하는 단계는, 상기 대상특허문서에 포함된 문장에 대한 복수개의 단어를 획득하는 단계; 상기 복수개의 단어의 조합으로 획득된 복수개의 복합 명사구 세트 후보를 획득하는 단계; 상기 획득된 복수개의 복합 명사구 세트 후보와 동일한 복합 명사구 세트가 상기 대상특허문서에 포함되는 빈도를 획득하는 단계; 및 상기 빈도가 기 설정된 빈도 이상인 복합 명사구 세트 후보를 복합 명사구 세트로 결정하는 단계; 를 포함하고, 상기 복합 명사구 세트 후보는 복합 명사구 세트 후보에 포함된 단어들의 순서 정보 및 이격 정보를 포함한다.In this case, the determining of the error word may include: matching semantic information on each of the obtained plurality of words; Obtaining an accuracy score for each of the plurality of words based on the matched semantic information and the target patent document; And determining a word having an accuracy score equal to or less than a preset value as an error word, and correcting the error word comprises: obtaining a plurality of semantic information about the error word; Obtaining a plurality of weights for each of the obtained plurality of semantic informations; Obtaining a weight having the highest association with the target patent document among the plurality of weights; And matching the semantic information corresponding to the weight having the highest association with the target patent document to the error word, and obtaining the at least one compound noun phrase set includes: Obtaining a plurality of words for a sentence; Obtaining a plurality of compound noun phrase set candidates obtained by combining the plurality of words; Obtaining a frequency at which a compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in the target patent document; Determining a compound noun phrase set candidate whose frequency is equal to or greater than a preset frequency as a compound noun phrase set; Wherein the compound noun phrase set candidate includes order information and spacing information of words included in the compound noun phrase set candidate.

이때, 상기 단어의 세부 중요도를 산출하는 단계는, 상기 전체 특허문서의 전체 단어수 대비 상기 전체 특허문서에서의 상기 단어의 출현횟수의 제1 출현비율 및 상기 전체 특허문서의 전체 문장수 대비 상기 전체 특허문서의 문장 중에서 상기 단어가 출현된 출현 문장수의 제2 출현비율에 기초하여 상기 제1 세부 중요도를 산출하는 단계; 상기 특허분류정보의 전체 단어수 대비 상기 특허분류정보에서의 상기 단어의 출현횟수의 제3 출현비율 및 상기 전체 특허문서의 전체 문장수 대비 상기 전체 특허문서의 문장 중에서 상기 단어가 출현된 출현 문장수의 제4 출현비율에 기초하여 상기 제2 세부 중요도를 산출하는 단계; 및 상기 검색특허문서 각각의 참조 정보에 기초하여 상기 검색특허문서 각각의 영향력 값을 산출하고, 상기 영향력 값을 이용하여 상기 검색특허문서의 제3 세부 중요도를 산출하는 단계를 포함하고, 상기 제1 세부 중요도를 산출하는 단계는, 하기의 수학식1을 이용하여 상기 제1 세부 중요도를 산출하고, 상기 제2 세부 중요도를 산출하는 단계는, 하기의 수학식을 이용하여 상기 제2 세부 중요도를 산출하는 단계를 포함한다.At this time, the step of calculating the detailed importance of the word, the first appearance ratio of the number of appearance of the word in the entire patent document relative to the total number of words of the entire patent document and the total of the total number of sentences of the entire patent document Calculating the first sub-critical importance based on a second appearance ratio of the number of sentences in which the word appears in sentences of a patent document; The third occurrence rate of the number of occurrences of the word in the patent classification information to the total number of words of the patent classification information and the number of appearance sentences in which the word appeared among the sentences of the entire patent document compared to the total number of sentences of the entire patent document Calculating the second sub-critical importance based on a fourth appearance ratio of the second; And calculating an influence value of each of the search patent documents based on reference information of each of the search patent documents, and calculating a third detailed importance of the search patent document using the influence value. The calculating of the detailed importance may include calculating the first detail importance using Equation 1 below, and calculating the second detail importance may calculate the second detail importance using Equation 1 below. It includes a step.

<수학식 1><Equation 1>

<수학식 2><Equation 2>

여기서, 상기 W1은 제1 세부 중요도이고, 상기 wpw은 상기 전체 특허문서에서의 상기 단어의 출현횟수이고, 상기 WPW은 전체 특허문서의 전체 단어수이고, 상기 wps은 상기 전체 특허문서의 문장 중에서 상기 단어가 출현된 출현 문장수이고, 상기 WPS은 상기 전체 특허문서의 전체 문장수이고, 상기 a1은 상기 제2 출현비율의 조절 상수이고, 상기 W2은 제2 세부 중요도이고, 상기 ipcw은 상기 특허분류정보에서의 상기 단어의 출현횟수이고, 상기 IPCW은 특허분류정보의 전체 단어수이고, 상기 ipcs은 상기 전체 특허문서의 문장 중에서 상기 단어가 출현된 출현 문장수이고, 상기 IPCS은 상기 전체 특허문서의 전체 문장수이고, 상기 a2은 상기 제4 출현비율의 조절 상수이다.Here, the W1 is the first detail importance, the wpw is the number of occurrences of the word in the entire patent document, the WPW is the total number of words in the patent document, the wps is the The word is the number of appearance sentences appeared, the WPS is the total number of sentences of the patent document, the a1 is the control constant of the second appearance ratio, the W2 is the second detail importance, the ipcw is the patent classification The number of occurrences of the word in the information, the IPCW is the total number of words in the patent classification information, the ipcs is the number of appearance sentences where the word appeared in the sentences of the entire patent document, the IPCS is A2 is an adjustment constant of the fourth occurrence rate.

이때, 상기 유사특허문서를 획득하는 단계는, 상기 대상특허문서에 포함된 복수개의 단어를 획득하는 단계; 상기 획득된 복수개의 단어를 클러스터링하여 복수개의 대상특허 클러스터를 획득하는 단계; 상기 복수개의 대상특허 클러스터 각각에 대한 복수개의 중점을 획득하고, 상기 복수개의 대상특허 클러스터 각각에 대한 복수개의 중점 및 상기 대상특허 클러스터에 포함된 단어의 수를 바탕으로 상기 대상특허문서의 위치를 판단하는 단계; 특허문서에 포함된 복수개의 단어를 획득하고, 획득된 복수개의 단어를 클러스터링하여 복수의 특허 클러스터를 획득하는 단계; 상기 복수개의 특허 클러스터 각각에 대한 복수개의 중점을 획득하고, 상기 복수개의 특허 클러스터 각각에 대한 복수개의 중점 및 상기 특허 클러스터에 포함된 단어의 수를 바탕으로 상기 특허문서의 위치를 판단하는 단계; 및 상기 대상특허문서의 위치 및 상기 특허문서의 위치가 기 설정된 거리 이내인 경우, 상기 특허문서를 상기 유사특허문서로 결정하는 단계를 포함하고, 상기 제1 문장과 상기 제2 문장에 대한 평가 결과를 획득하는 단계는, 상기 제1 문장 및 상기 제2 문장의 유사도 점수 및 비유사도 점수를 각각 획득하는 단계; 상기 비유사도 점수가 기 설정된 점수 이상인 경우, 상기 제1 문장과 상기 제2 문장은 관계없는 문장으로 판단하는 단계; 상기 유사도 점수가 기 설정된 점수 이상인 경우, 상기 제1 문장에 포함된 단어 중 상기 제2 문장에 포함되지 않은 적어도 하나의 단어를 획득하는 단계; 상기 제2 문장에 포함되지 않은 적어도 하나의 단어 각각에 대한 적어도 중요도 점수를 획득하고, 상기 획득된 적어도 하나의 중요도 점수 중 기 설정된 중요도 점수 이상인 단어가 존재하는지 여부를 판단하는 단계; 기 설정된 중요도 점수 이상인 단어가 존재하지 않는 경우, 상기 제1 문장 및 상기 제2 문장을 일치 문장으로 판단하는 단계; 및 상기 기 설정된 중요도 점수 이상인 단어가 존재하면, 상기 제1 문장 및 상기 제2 문장을 불일치 문장으로 판단하는 단계를 포함한다.In this case, the obtaining of the similar patent document may include: obtaining a plurality of words included in the target patent document; Clustering the obtained plurality of words to obtain a plurality of target patent clusters; Acquiring a plurality of midpoints for each of the plurality of target patent clusters, and determining a position of the target patent document based on the plurality of midpoints for each of the plurality of target patent clusters and the number of words included in the target patent cluster. Making; Obtaining a plurality of words included in a patent document and clustering the obtained plurality of words to obtain a plurality of patent clusters; Acquiring a plurality of midpoints for each of the plurality of patent clusters, and determining a position of the patent document based on the plurality of midpoints for each of the plurality of patent clusters and the number of words included in the patent cluster; And determining the patent document as the similar patent document when the position of the target patent document and the position of the patent document are within a preset distance, and an evaluation result of the first sentence and the second sentence. The obtaining may include obtaining similarity scores and dissimilarity scores of the first sentence and the second sentence, respectively; Determining that the first sentence and the second sentence are irrelevant sentences when the dissimilarity score is equal to or greater than a predetermined score; Obtaining at least one word not included in the second sentence among words included in the first sentence when the similarity score is equal to or greater than a predetermined score; Obtaining at least importance scores for each of at least one word not included in the second sentence, and determining whether there is a word equal to or greater than a predetermined importance score among the obtained at least one importance scores; Determining that the first sentence and the second sentence are matched sentences when a word having a predetermined importance score or more does not exist; And determining that the first sentence and the second sentence are inconsistent sentences when a word having a predetermined importance score or more exists.

이때, 상기 워드 벡터로 변환하는 단계는 상기 특허문서를 기초하여 Word2Vec 학습을 통해 상기 단어를 워드 벡터로 변환하는 단계를 포함하고, 상기 유사도를 산출하는 단계는, 상기 워드 벡터 중 어느 두 워드 벡터 간의 거리를 산출하고 상기 산출된 거리를 유사도로 산출하는 단계를 포함하고, 상기 유사어 그룹으로 그루핑하는 단계는, 상기 워드 벡터 중 어느 두 워드 벡터 간의 상기 유사도가 미리 설정된 기준 유사도 미만인지 여부를 확인하고, 상기 워드 벡터 중 어느 두 워드 벡터 간의 상기 유사도가 미리 설정된 기준 유사도 미만이면 해당 두 워드 벡터에 대응되는 두 단어를 상기 유사어 그룹으로 그루핑하는 단계를 포함하고, 상기 전체 특허문서를 획득하는 단계는, 상기 획득된 전체 특허문서 중 어느 하나의 특허문서가 노이즈 문서 조건을 충족하는지 여부를 확인하고, 상기 노이즈 문서 조건을 충족하는 특허문서를 제거하는 단계를 포함한다.In this case, the converting into the word vector includes converting the word into a word vector through Word2Vec learning based on the patent document, and calculating the similarity between the two word vectors among the word vectors. Calculating a distance and calculating the calculated distance as a similarity, and grouping into the similar word group includes: checking whether the similarity between any two word vectors among the word vectors is less than a predetermined reference similarity, If the similarity between any two word vectors of the word vectors is less than a preset reference similarity, grouping two words corresponding to the two word vectors into the similar word group, and acquiring the entire patent document comprises: The patent document of any of the acquired patent documents Determine whether the group, and removing the patent document that meets the conditions, the noise document.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

상술한 본 발명의 실시예에 따라, 사용자는 적은 시간 투자로 안정적인 결과를 보장 받을 수 있는 선행 기술 조사 방법이 제공될 수 있다.According to the above-described embodiment of the present invention, the user can be provided with a prior art research method that can ensure a stable result with little time investment.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 시스템도이다.
도 2는 본 발명의 일 실시예에 따른 선행기술조사보고서를 획득하는 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 단어를 획득하는 방법을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 중요도 스코어를 산출하는 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 본 발명의 일 실시예에 따른 유의어 사전을 생성하는 방법을 설명하기 위한 흐름도이다.
도 6a 내지 도 6c는 본 발명의 일 실시예에 따른 단어 획득방법을 구체적으로 설명하기 위한 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 중요도 스코어를 산출하는 방법을 구체적으로 설명하기 위한 흐름도이다.
도 8a 및 도 8b는 본 발명의 일 실시예에 따른 유사특허문서 획득 방법 및 유사 문장 판단 방법을 설명하기 위한 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 유의어 사전을 생성하는 구체적인 방법을 설명하기 위한 흐름도이다.
도 10은 본 발명의 일 실시예에 따른 제1 인공지능 모델 및 제2 인공지능 모델을 이용하여 평가 결과를 도출하는 과정을 설명하기 위한 예시도이다.
도 11은 본 발명의 일 실시예에 따른 장치의 구성도이다.1 is a system diagram according to an embodiment of the present invention.
2 is a flowchart illustrating a method of obtaining a prior art research report according to an embodiment of the present invention.
3 is a flowchart illustrating a method of obtaining a word according to an embodiment of the present invention.
4 is a flowchart illustrating a method of calculating importance scores according to an embodiment of the present invention.
5 is a flowchart illustrating a method of generating a thesaurus according to an embodiment of the present invention.
6A through 6C are flowcharts illustrating a word acquisition method in detail according to an embodiment of the present invention.
7 is a flowchart illustrating a method of calculating importance scores according to an embodiment of the present invention in detail.
8A and 8B are flowcharts illustrating a method for obtaining a similar patent document and a method for determining a similar sentence according to an embodiment of the present invention.
9 is a flowchart illustrating a specific method of generating a thesaurus according to an embodiment of the present invention.
10 is an exemplary diagram for describing a process of deriving an evaluation result using the first AI model and the second AI model according to an embodiment of the present invention.
11 is a block diagram of an apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but may be embodied in various different forms, only the present embodiments are intended to complete the disclosure of the present invention, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the invention, which is defined only by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more other components in addition to the mentioned components. Like reference numerals refer to like elements throughout, and "and / or" includes each and all combinations of one or more of the mentioned components. Although "first", "second", etc. are used to describe various components, these components are of course not limited by these terms. These terms are only used to distinguish one component from another. Therefore, of course, the first component mentioned below may be the second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, terms that are defined in a commonly used dictionary are not ideally or excessively interpreted unless they are specifically defined clearly.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.The term "part" or "module" as used herein refers to a hardware component such as software, FPGA or ASIC, and the "part" or "module" plays certain roles. However, "part" or "module" is not meant to be limited to software or hardware. The “unit” or “module” may be configured to be in an addressable storage medium or may be configured to play one or more processors. Thus, as an example, a "part" or "module" may include components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, Procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. Functions provided within components and "parts" or "modules" may be combined into smaller numbers of components and "parts" or "modules" or into additional components and "parts" or "modules". Can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.The spatially relative terms " below ", " beneath ", " lower ", " above ", " upper " It can be used to easily describe a component and its correlation with other components. Spatially relative terms are to be understood as including terms in different directions of components in use or operation in addition to the directions shown in the figures. For example, when flipping a component shown in the drawing, a component described as "below" or "beneath" of another component may be placed "above" the other component. Can be. Thus, the exemplary term "below" can encompass both an orientation of above and below. Components may be oriented in other directions as well, so spatially relative terms may be interpreted according to orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In the present specification, a computer refers to any kind of hardware device including at least one processor, and according to an embodiment, it may be understood as a meaning encompassing a software configuration that operates on the hardware device. For example, a computer may be understood as including, but not limited to, a smartphone, a tablet PC, a desktop, a notebook, and a user client and an application running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings will be described an embodiment of the present invention;

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described herein is described as being performed by a computer, but the subject of each step is not limited thereto, and at least some of the steps may be performed in different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 시스템도이다.1 is a system diagram according to an embodiment of the present invention.

본 발명에 따른 특허문서 유사도 판단 시스템은 서버(10) 및 전자 장치(20)를 포함한다.The patent document similarity determination system according to the present invention includes a server 10 and an electronic device 20.

서버(10)는 대상특허문서를 획득하고, 획득된 대상특허문서로부터 유사특허문서를 획득하고, 대상특허문서의 문장과 유사특허문서의 문장의 유사도를 판단하기 위한 구성이다.The server 10 is configured to obtain a target patent document, obtain a similar patent document from the obtained target patent document, and determine the similarity between sentences of the target patent document and sentences of the similar patent document.

일 실시예로, 서버(10)는 전자장치(20)로부터 대상특허문서를 입력 받거나, 외부 서버로부터 대상특허문서를 획득할 수 있다. In an embodiment, the server 10 may receive a target patent document from the electronic device 20 or obtain a target patent document from an external server.

또 다른 실시예로, 서버(10)는 대상특허문서 또는 유사특허문서로부터 복수의 단어를 획득하고, 획득된 단어의 중요도 스코어를 획득할 수 있다. 또한, 서버(10)는 전체 특허문서로부터 유의어 사전을 획득하여 단어의 중요도 스코어 판단에 이용할 수 있다.In another embodiment, the server 10 may obtain a plurality of words from the target patent document or the similar patent document, and obtain the importance score of the acquired words. In addition, the server 10 may obtain a thesaurus from the entire patent document and use it to determine the importance score of the word.

또 다른 실시예로, 서버(10)는 대상특허문서에 대한 유사특허문서를 획득하고, 유사한 문장을 획득하여 선행기술조사보고서를 생성할 수 있다. In another embodiment, the server 10 may obtain a similar patent document for the target patent document, and obtain a similar sentence to generate a prior art research report.

본 명세서에서, 특허문서는 대상특허문서 및 유사특허문서를 포함하는 개념으로, 각국 특허청에 특허 등록을 받기 위해 출원인이 제출하는 기술 내용에 대한 문서일 수 있다. 다만, 이에 한정되는 것은 아니고, 특허문서는, 특허 출원을 위한 직무 발명서, 논문 등 기술 내용을 포함한 다양한 문서를 포함하는 개념으로 이해될 수 있다. 일 실시예에 따라, 대상특허문서는 특허 출원을 위한 직무 발명서, 논문 중 적어도 하나이고, 유사특허문서는 특허 출원을 위한 직무 발명서, 논문, 특허출원서 중 적어도 하나일 수 있다.In the present specification, a patent document is a concept including a target patent document and a similar patent document, and may be a document on a technical content submitted by an applicant for patent registration at each patent office of each country. However, the present invention is not limited thereto, and the patent document may be understood as a concept including various documents including technical contents such as job inventions and papers for patent application. According to an embodiment, the target patent document may be at least one of a job invention and a paper for a patent application, and the similar patent document may be at least one of a job invention, a paper and a patent application for a patent application.

전자 장치(20)는 서버(10)로 특허문서를 제공하기 위한 구성이다. 본 발명에 따른 전자 장치(200)는 스마트폰으로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 스마트폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 영상 전화기, 전자책 리더기(e-book reader), 데스크탑 PC (desktop PC), 랩탑 PC(laptop PC), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player) 또는 웨어러블 장치(wearable device) 중 적어도 하나를 포함할 수 있다.The electronic device 20 is a component for providing a patent document to the server 10. The electronic device 200 according to the present invention may be implemented as a smart phone, but this is only an example, and may include a smart phone, a tablet personal computer, a mobile phone, a video phone, E-book readers, desktop PCs, laptop PCs, netbook computers, workstations, servers, personal digital assistants, portable multimedia It may include at least one of a player or a wearable device.

도 2는 본 발명의 일 실시예에 따른 선행기술조사보고서를 획득하는 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of obtaining a prior art research report according to an embodiment of the present invention.

단계 S110에서, 서버(10)는, 대상특허문서를 획득할 수 있다.In operation S110, the server 10 may acquire a target patent document.

단계 S120에서, 서버(10)는, 대상특허문서를 바탕으로 적어도 하나의 단어를 획득할 수 있다.In operation S120, the server 10 may obtain at least one word based on the target patent document.

단계 S130에서, 서버(10)는, 획득된 적어도 하나의 단어의 중요도 스코어를 획득할 수 있다.In operation S130, the server 10 may acquire an importance score of the obtained at least one word.

단계 S140에서, 서버(10)는, 대상특허문서를 제1 인공지능 모델에 입력하여 유사특허문서를 획득할 수 있다.In operation S140, the server 10 may obtain a similar patent document by inputting the target patent document into the first artificial intelligence model.

구체적으로, 서버(10)는 대상특허문서 또는 유사특허문서로부터 복수의 단어를 획득하고, 획득된 단어의 중요도 스코어를 획득할 수 있다. 또한, 서버(10)는 전체 특허문서로부터 유의어 사전을 획득하여 단어의 중요도 스코어 판단에 이용할 수 있다.In detail, the server 10 may obtain a plurality of words from the target patent document or the similar patent document, and obtain the importance score of the acquired words. In addition, the server 10 may obtain a thesaurus from the entire patent document and use it to determine the importance score of the word.

일 실시예로, 서버(10)는, 제1 데이터베이스에 저장된 복수의 특허문서 각각에 대한 형태소 분석을 바탕으로, 복수의 특허문서 각각을 분석하고, 대상특허문서에 대한 형태소분석을 바탕으로 대상특허문서를 분석한 뒤, 대상특허문서와 연관도가 높은 유사특허문서를 획득할 수 있다.In one embodiment, the server 10, based on the morphological analysis of each of the plurality of patent documents stored in the first database, analyzes each of the plurality of patent documents, the target patent based on the morphological analysis of the target patent document After analyzing the document, a similar patent document highly related to the target patent document can be obtained.

단계 S150에서, 서버(10)는, 대상특허문서에 포함된 복수의 문장을 획득하고, 유사특허문서에 포함된 복수의 문장을 획득할 수 있다.In operation S150, the server 10 may obtain a plurality of sentences included in the target patent document and may obtain a plurality of sentences included in the similar patent document.

단계 S160에서, 서버(10)는, 대상특허문서에 포함된 복수의 문장 중 제1 문장을 획득하고, 유사특허문서에 포함된 복수의 문장 중 제2 문장을 획득할 수 있다.In operation S160, the server 10 may acquire a first sentence of the plurality of sentences included in the target patent document, and obtain a second sentence of the plurality of sentences included in the similar patent document.

단계 S170에서, 서버(10)는, 제1 문장 및 제2 문장을 제2 인공지능 모델에 입력하여, 제1 문장과 제2 문장에 대한 평가 결과를 획득할 수 있다.In operation S170, the server 10 may obtain a result of evaluating the first sentence and the second sentence by inputting the first sentence and the second sentence into the second artificial intelligence model.

단계 S180에서, 서버(10)는, 평가 결과를 바탕으로, 대상특허문서에 대한 선행기술조사보고서를 생성할 수 있다. 구체적으로, 대상특허문서에 대한 선행기술조사보고서는 제1 문장과 제2 문장에 대한 평가 결과를 바탕으로 생성될 수 있다. 일 실시예로, 제1 문장과 제2 문장에 대한 평가 결과가 점수의 형태로 획득되는 경우, 서버(10)는 제1 문장과 제2 문장에 대한 평가 결과 점수가 기 설정된 점수 이상인 제1 문장 및 제1 문장에 대응되는 제2 문장을 바탕으로 선행기술조사보고서를 생성할 수 있다.In operation S180, the server 10 may generate a prior art research report on the target patent document based on the evaluation result. Specifically, the prior art research report on the target patent document may be generated based on the evaluation results of the first sentence and the second sentence. In one embodiment, when the evaluation results for the first sentence and the second sentence is obtained in the form of a score, the server 10 is the first sentence of which the evaluation result score for the first sentence and the second sentence is greater than or equal to the predetermined score. And a prior art research report based on the second sentence corresponding to the first sentence.

한편, 제1 인공지능 모델은 복수의 특허문서를 학습데이터로 입력하여 학습된 합성곱 신경망(Convolutional deep Neural Networks, CNN) 기반의 인공지능 모델이고, 제2 인공지능 모델은 복수의 특허문서 및 상기 복수의 특허문서 중 두 개의 특허문서 및 상기 두 개의 특허문서에 대한 선행기술조사보고서를 바탕으로 학습된 Bi-LSTM 모델 기반의 인공지능 모델일 수 있다. 다만, 이에 한정되는 것은 아니며, 다양한 인공지능 모델이 본 발명에 적용될 수 있음은 물론이다. 예컨대, DNN(Deep Neural Network), RNN(Recurrent Neural Network), BRDNN(Bidirectional Recurrent Deep Neural Network)과 같은 모델이 인공지능 모델로서 사용될 수 있으나, 이에 한정되지 않는다.Meanwhile, the first artificial intelligence model is an artificial intelligence model based on a convolutional deep neural network (CNN) trained by inputting a plurality of patent documents as learning data, and the second artificial intelligence model includes a plurality of patent documents and the It may be an AI model based on a Bi-LSTM model trained on the basis of two patent documents among a plurality of patent documents and a prior art research report on the two patent documents. However, the present invention is not limited thereto, and various artificial intelligence models may be applied to the present invention. For example, a model such as a deep neural network (DNN), a recurrent neural network (RNN), and a bidirectional recurrent deep neural network (BRDNN) may be used as an artificial intelligence model, but is not limited thereto.

이때, 합성곱 신경망(Convolutional deep Neural Networks, CNN)은 최소한의 전처리(preprocess)를 사용하도록 설계된 다계층 퍼셉트론(multilayer perceptrons)의 한 종류이다. 합성곱 신경망은 하나 또는 여러개의 합성곱 계층(convolutional layer)과 그 위에 올려진 일반적인 인공신경망 계층들로 이루어져 있으며, 가중치와 통합 계층(pooling layer)들을 추가로 활용한다. 이러한 구조 덕분에 합성곱 신경망은 2차원 구조의 입력 데이터를 충분히 활용할 수 있다. 또한, 합성곱 신경망은 표준 역전달을 통해 훈련될 수 있다. 합성곱 신경망은 다른 피드포워드 인공신경망 기법들보다 쉽게 훈련되는 편이고 적은 수의 매개변수를 사용한다는 이점이 있다. 합성곱 신경망은 입력 영상에 대하여 합성곱과 서브샘플링을 번갈아 수행함으로써 입력 영상으로부터 특징을 추출한다.In this case, convolutional deep neural networks (CNNs) are a type of multilayer perceptrons designed to use minimal preprocessing. A convolutional neural network consists of one or several convolutional layers and a general artificial neural network layer on top of them, and further utilizes weights and pooling layers. This structure allows the multiplicative neural network to fully utilize the input data of the two-dimensional structure. In addition, convolutional neural networks can be trained through standard backtransmission. The multiplicative neural network is more easily trained than other feedforward neural network techniques and has the advantage of using fewer parameters. The convolutional neural network extracts features from the input image by performing alternating convolutional and subsampling on the input image.

합성곱 신경망은 여러 개의 합성곱 계층(Convolution layer), 여러 개의 서브샘플링 계층(Subsampling layer, Lacal pooling layer, Max-Pooling layer), 완전 연결 층(Fully Connected layer)을 포함한다. 합성곱 계층은 입력 영상(Input Image)에 대해 합성곱을 수행하는 계층이다. 그리고 서브샘플링 계층은 입력 영상에 대해 지역적으로 최대값을 추출하여 2차원 영상으로 매핑하는 계층으로, 국소적인 영역을 더 크게 하고, 서브샘플링을 수행한다. The convolutional neural network includes a plurality of convolution layers, a plurality of subsampling layers, a lacal pooling layer, a max-pooling layer, and a fully connected layer. The convolution product layer is a layer that performs a convolution product on an input image. The subsampling layer is a layer that locally extracts the maximum value of the input image and maps it to the 2D image. The subsampling layer enlarges the local region and performs subsampling.

합성곱 계층에서는 커널의 크기(kernel size), 사용할 커널의 개수(즉, 생성할 맵의 개수), 및 합성곱 연산 시에 적용할 가중치 테이블 등의 정보가 필요하다. 예를 들어, 입력 영상의 크기가 32×32이고, 커널의 크기가 5×5이고, 사용할 커널의 개수가 20개인 경우를 예로 들자. 이 경우, 32×32 크기의 입력 영상에 5×5 크기의 커널을 적용하면, 입력 영상의 위, 아래, 왼쪽, 오른쪽에서 각각 2개의 픽셀(pixel)에는 커널을 적용하는 것이 불가능하다. 입력 영상의 위에 커널을 배치한 후 합성곱을 수행하면, 그 결과 값인 '-8'은 커널에 포함된 입력 영상의 픽셀들 중에서 커널의 중심요소(center element)에 대응하는 픽셀의 값으로 결정되기 때문이다. 따라서, 32×32 크기의 입력 영상에 5×5 크기의 커널을 적용하여 합성곱을 수행하면 28×28 크기의 맵(map)이 생성된다. 앞서, 사용할 커널의 개수가 총 20개인 경우를 가정하였으므로, 첫 번째 합성곱 계층에서는 총 20개의 28×28 크기의 맵이 생성된다. In the product multiplication layer, information such as kernel size, the number of kernels to be used (that is, the number of maps to be generated), and a weight table to be applied when performing a product multiplication are needed. For example, assume that the size of the input image is 32 × 32, the size of the kernel is 5 × 5, and the number of kernels to be used is 20. In this case, if a 5 × 5 kernel is applied to an input image having a size of 32 × 32, it is impossible to apply the kernel to two pixels at the top, bottom, left, and right sides of the input image. When the kernel is placed on the input image and then the result is composite, the resulting value of '-8' is determined as the pixel value corresponding to the center element of the kernel among the pixels of the input image included in the kernel. to be. Therefore, when a composite product is performed by applying a 5 × 5 kernel to a 32 × 32 input image, a 28 × 28 map is generated. Since it is assumed that the total number of kernels to be used is 20 in total, 20 maps of 28 × 28 sizes are generated in the first convolutional layer.

서브샘플링 계층에서는 서브샘플링할 커널의 크기에 대한 정보, 커널 영역 내의 값들 중 최대값을 선택할 것인지 최소값을 선택할 것인지에 대한 정보가 필요하다.In the subsampling layer, information on the size of the kernel to be subsampled and information on whether to select the maximum value or the minimum value among the values in the kernel region are needed.

또한, 심층 신경망(Deep Neural Networks, DNN)은 입력 계층(input layer)과 출력 계층(output layer) 사이에 복수개의 은닉 계층(hidden layer)들로 이뤄진 인공신경망(Artificial Neural Network, ANN)이다.Also, deep neural networks (DNNs) are artificial neural networks (ANNs) including a plurality of hidden layers between an input layer and an output layer.

이때, 심층 신경망의 구조는 퍼셉트론(perceptron)으로 구성될 수 있다. 퍼셉트론은 여러 개의 입력 값(input)과 하나의 프로세서(prosessor), 하나의 출력 값으로 구성된다. 프로세서는 여러 개의 입력 값에 각각 가중치를 곱한 후, 가중치가 곱해진 입력 값들을 모두 합한다. 그 다음 프로세서는 합해진 값을 활성화함수에 대입하여 하나의 출력 값을 출력한다. 만약 활성화함수의 출력 값으로 특정한 값이 나오기를 원하는 경우, 각 입력 값에 곱해지는 가중치를 수정하고, 수정된 가중치를 이용하여 출력 값을 다시 계산할 수 있다. 이때, 각각의 퍼셉트론은 서로 다른 활성화함수를 사용할 수 있다. 또한 각각의 퍼셉트론은 이전 계층에서 전달된 출력들을 입력으로 받아들인 다음, 활성화 함수를 이용해서 출력을 구한다. 구해진 출력은 다음 계층의 입력으로 전달된다. 상술한 바와 같은 과정을 거치면 최종적으로 몇 개의 출력 값을 얻을 수 있다. At this time, the structure of the deep neural network may be composed of a perceptron (perceptron). A perceptron consists of several inputs, one processor, and one output. The processor multiplies a plurality of input values by each weight, and then adds the weighted input values together. The processor then substitutes the sum of the values into the activation function and outputs one output value. If a specific value is desired as an output value of the activation function, the weight multiplied by each input value may be modified, and the output value may be recalculated using the modified weight. At this time, each perceptron may use a different activation function. Each perceptron also accepts the output from the previous layer as input, then uses the activation function to get the output. The obtained output is passed to the input of the next layer. Through the above process, several output values can be finally obtained.

또한, 딥 러닝 기법에 대한 설명으로 다시 돌아가면, 순환 신경망(Reccurent Neural Network, RNN)은 인공신경망을 구성하는 유닛 사이의 연결이 Directed cycle을 구성하는 신경망을 말한다. 순환 신경망은 앞먹임 신경망과 달리, 임의의 입력을 처리하기 위해 신경망 내부의 메모리를 활용할 수 있다.In addition, returning to the deep learning technique, a cyclic neural network (RNN) refers to a neural network in which a connection between units forming an artificial neural network forms a directed cycle. The cyclic neural network, unlike the front neural network, can utilize the memory inside the neural network to process arbitrary input.

심층 신뢰 신경망(Deep Belief Networks, DBN)이란 기계학습에서 사용되는 그래프 생성 모형(generative graphical model)으로, 딥 러닝에서는 잠재변수(latent variable)의 다중계층으로 이루어진 심층 신경망을 의미한다. 계층 간에는 연결이 있지만 계층 내의 유닛 간에는 연결이 없다는 특징이 있다. Deep Belief Networks (DBN) is a generic graphical model used in machine learning. In deep learning, deep Belief Networks (DBN) is a deep neural network composed of multiple layers of latent variables. There is a connection between layers, but there is no connection between units within a layer.

심층 신뢰 신경망은 생성 모형이라는 특성상 선행학습에 사용될 수 있고, 선행학습을 통해 초기 가중치를 학습한 후 역전파 혹은 다른 판별 알고리즘을 통해 가중치의 미조정을 할 수 있다. 이러한 특성은 훈련용 데이터가 적을 때 굉장히 유용한데, 이는 훈련용 데이터가 적을수록 가중치의 초기값이 결과적인 모델에 끼치는 영향이 세지기 때문이다. 선행학습된 가중치 초기값은 임의로 설정된 가중치 초기값에 비해 최적의 가중치에 가깝게 되고 이는 미조정 단계의 성능과 속도향상을 가능케 한다. Deep trust neural networks can be used for preliminary learning due to the nature of the generation model, and after the initial weights are learned through prior learning, the weights can be fine-tuned through back propagation or other discrimination algorithms. This characteristic is very useful when the training data is small, because the smaller the training data, the greater the influence of the initial value of the weight on the resulting model. The pre-learned weight initial value is closer to the optimal weight than the arbitrarily set weight initial value, which enables the performance and speed improvement of the fine tuning step.

상술한 인공지능 및 그 학습방법에 관한 내용은 예시를 위하여 서술된 것이며, 이하에서 설명되는 실시 예들에서 이용되는 인공지능 및 그 학습방법은 제한되지 않는다. 예를 들어, 당 업계의 통상의 기술자가 동일한 과제해결을 위하여 적용할 수 있는 모든 종류의 인공지능 기술 및 그 학습방법이 개시된 실시 예에 따른 시스템을 구현하는 데 활용될 수 있다.Details of the above-described artificial intelligence and its learning method are described for the purpose of illustration, and the artificial intelligence used in the embodiments described below and its learning method are not limited. For example, all kinds of artificial intelligence techniques and learning methods that can be applied by those skilled in the art to solve the same problem may be utilized to implement a system according to the disclosed embodiments.

도 3은 본 발명의 일 실시예에 따른 단어를 획득하는 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method of obtaining a word according to an embodiment of the present invention.

단계 S210에서, 서버(10)는, 대상특허문서를 바탕으로 형태소를 분석하여 복수개의 단어를 획득할 수 있다.In operation S210, the server 10 may obtain a plurality of words by analyzing a morpheme based on the target patent document.

일 실시예로, 서버(10)는 Mecab 형태소 분석기를 이용하여 대상특허문서의 형태소 분석을 수행할 수 있다. 다만, 이에 한정되는 것은 아니고, 경우에 따라 Okt, Komoran, Hannanum, Kkma 형태소 분석기 등 다양한 형태소 분석기가 이용될 수 있음은 물론이다. 나아가, 서버(10)는 분석하고자 하는 대상특허문서의 사용 언어에 따라 다양한 형태소 분석기를 사용할 수 있음은 물론이다. 한편, 본 발명에서는 대상특허문서의 형태소를 분석하는 방법을 설명하였으나, 서버(10)는, 대상특허문서뿐만 아니라 다양한 특허문서에 대한 형태소 분석을 수행할 수 있음은 물론이다.In one embodiment, the server 10 may perform a morphological analysis of the target patent document using the Mecab morpheme analyzer. However, the present invention is not limited thereto, and various morpheme analyzers such as Okt, Komoran, Hannanum, and Kkma morpheme analyzers may be used. Furthermore, the server 10 may use various morpheme analyzers according to the language of the target patent document to be analyzed. On the other hand, the present invention has been described a method for analyzing the morphemes of the target patent document, the server 10, of course, can perform a morphological analysis of various patent documents as well as the target patent document.

단계 S220에서, 서버(10)는, 획득된 복수개의 단어 중 오류 단어를 판단할 수 있다.In operation S220, the server 10 may determine an error word among the obtained plurality of words.

일 실시예로, 서버(10)는 획득된 복수개의 단어가 기 설정된 조건을 만족하지 못하는 경우, 해당 단어를 오류 단어로 판단할 수 있다.In an embodiment, if the acquired plurality of words does not satisfy a preset condition, the server 10 may determine the corresponding word as an error word.

단계 S230에서, 서버(10)는, 오류 단어를 수정할 수 있다.In operation S230, the server 10 may correct an error word.

단계 S240에서, 서버(10)는, 복수개의 단어간의 연관도를 바탕으로 적어도 하나의 복합 명사구 세트를 획득할 수 있다.In operation S240, the server 10 may obtain at least one compound noun phrase set based on the degree of association between the plurality of words.

도 4는 본 발명의 일 실시예에 따른 중요도 스코어를 산출하는 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method of calculating importance scores according to an embodiment of the present invention.

단계 S310에서, 서버(10)는, 대상특허문서로부터 중요도 스코어의 산출 대상이 되는 단어를 획득할 수 있다. In operation S310, the server 10 may obtain a word, which is a calculation target of the importance score, from the target patent document.

단계 S320에서, 서버(10)는, 전체 특허문서에서의 단어의 제1 세부 중요도, 대상특허문서의 기술분야정보에 대응되는 특허분류정보에서의 단어의 제2 세부 중요도 및 전체 특허문서 중 단어가 포함된 검색특허문서의 제3 세부 중요도 중 하나 이상의 세부 중요도를 산출할 수 있다.In step S320, the server 10 determines that the first detailed importance of the words in the entire patent document, the second detailed importance of the words in the patent classification information corresponding to the technical field information of the target patent document, and the words in the entire patent document are displayed. At least one detail importance of the third detail importance of the included search patent document may be calculated.

단계 S330에서, 서버(10)는, 제1 세부 중요도, 제2 세부 중요도 및 제3 세부 중요도 중 하나 이상에 기초하여 단어의 중요도 스코어를 산출할 수 있다.In operation S330, the server 10 may calculate an importance score of the word based on at least one of the first detail importance level, the second detail importance level, and the third detail importance level.

구체적으로, 서버(10)는 대상특허문서에 대해 형태소 분석을 수행하여 명사만을 추출할 수 있다. 이때, 서버(10)는 대상특허문서로부터 명사를 추출하는 한 형태소 분석법의 종류는 한정되지 않음을 유의한다.Specifically, the server 10 may extract only nouns by performing morphological analysis on the target patent document. In this case, it is noted that the type of the morphological analysis method is not limited as long as the server 10 extracts a noun from the target patent document.

이어서, 서버(10)는 추출된 명사에 기초하여 대상특허문서의 기술분야정보를 결정할 수 있다. 이때, 서버(10)는 미리 저장된 기술분야정보별 단어 데이터와 추출된 명사를 비교하고, 비교 결과, 추출된 명사가 최다 포함된 기술분야정보를 대상특허문서의 기술분야정보로 결정할 수 있다.Subsequently, the server 10 may determine the technical field information of the target patent document based on the extracted noun. At this time, the server 10 may compare the pre-stored word data for each technical field information and the extracted noun, and as a result of the comparison, determine the technical field information including the extracted noun as the technical field information of the target patent document.

이후, 서버(10)는 대상특허문서에 포함된 단어 중 불용어로 설정된 단어를 중요도 스코어의 산출 대상이 되는 단어에서 제외시킬 수 있다. 구체적으로, 서버(10)는 대상특허문서의 기술분야정보에 대응되는 기술분야정보별 불용어 데이터를 독출하고, 대상특허문서에 포함되어 추출된 단어 중 독출된 기술분야정보별 불용어 데이터에 포함된 단어를 중요도 스코어의 산출 대상이 되는 단어에서 제외시킬 수 있다.Thereafter, the server 10 may exclude a word set as a stop word among the words included in the target patent document from the word that is the target of calculating the importance score. Specifically, the server 10 reads stopword data for each technical field information corresponding to the technical field information of the target patent document, and includes words included in the extracted terminology data for each technical field information among the words extracted from the target patent document. May be excluded from the words to be calculated for the importance score.

여기서, 불용어는 해당 기술분야에서 빈번하게 사용되는 단어이지만 기술적 의미를 갖지 않는 단어를 의미할 수 있다. 예를 들어, IT 기술분야에서 컴퓨터는 빈번하게 사용되나 IT 기술과 관련한 기술적 의미를 갖지 않아 불용어로 정의될 수 있다.Here, the stop word may mean a word that is frequently used in the related art but does not have a technical meaning. For example, computers are frequently used in the field of IT technology but do not have a technical meaning related to IT technology and may be defined as a stopword.

이에 따라, 서버(10)는 대상특허문서로부터 추출한 단어 중 불용어가 아닌 단어를 중요도 스코어의 산출 대상이 되는 단어로 획득할 수 있다.Accordingly, the server 10 may obtain a word that is not a stopword among the words extracted from the target patent document as a word that is a calculation target of the importance score.

도 5는 본 발명의 일 실시예에 따른 본 발명의 일 실시예에 따른 유의어 사전을 생성하는 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a method of generating a thesaurus according to an embodiment of the present invention.

단계 S410에서, 서버(10)는, 유의어 사전의 기초가 되는 전체 특허문서를 획득할 수 있다. 구체적으로, 서버(10)는 실시간으로 시간이 경과함에 따라 공개되는 특허문서를 획득하여 전체 특허문서를 획득할 수 있다. 이때, 서버(10)는 통신을 통해 외부 서버 또는 전자 장치(20)로부터 최신에 공개된 특허문서를 수신하여 메모리(104)에 저장할 수 있다.In operation S410, the server 10 may acquire the entire patent document that is the basis of the thesaurus. In detail, the server 10 may obtain a patent document that is disclosed as time passes in real time to obtain the entire patent document. In this case, the server 10 may receive the latest published patent document from the external server or the electronic device 20 through communication and store it in the memory 104.

일 실시예로, 서버(10)는 획득된 전체 특허문서 중 어느 하나의 특허문서가 노이즈 문서 조건을 충족하는지 여부를 확인할 수 있다. 구체적으로, 서버(10)는 노이즈 문서 조건을 충족하는 특허문서를 전체 특허문서로부터 제거할 수 있다. 여기서, 노이즈 문서 조건은 특허문서가 노이즈 문서 인지 여부를 확인하는 것으로써, 전체 특허문서 각각 내에 미리 설정된 기준 크기 이상의 공백을 갖는 경우, 미리 설정된 횟수 이상 동일한 문자가 반복되는 경우, 전체 특허문서 내에 텍스트와 이지미와의 비율이 미리 설정된 기준 비율 이상인 경우이다.In one embodiment, the server 10 may check whether any one of the obtained patent documents meet the noise document conditions. Specifically, the server 10 may remove the patent document that satisfies the noise document condition from the entire patent document. Here, the noise document condition is to check whether the patent document is a noise document. If the patent document has a space equal to or greater than a preset reference size in each patent document, and if the same character is repeated more than a preset number of times, the text in the entire patent document It is the case that the ratio with the image is more than the preset reference ratio.

즉, 서버(10)는 전체 특허문서에 대해 노이즈 여부 인지 여부를 확인할 수 있다. 서버(10)는 노이즈 문서 조건을 만족하는 노이즈 문서를 제거할 수 있다. That is, the server 10 may check whether or not there is noise for all patent documents. The server 10 may remove the noise document that satisfies the noise document condition.

단계 S420에서, 서버(10)는, 전체 특허문서 각각에 대해 형태소를 분석할 수 있다.In operation S420, the server 10 may analyze the morpheme for each patent document.

단계 S430에서, 서버(10)는, 형태소 분석 결과에 기초하여 전체 특허문서 각각에 포함된 단어를 워드 벡터로 변환할 수 있다.In operation S430, the server 10 may convert a word included in each patent document into a word vector based on the result of the morpheme analysis.

이때, 서버(10)는 상술된 단어의 복수의 단어적 특성을 다차원의 실수 공간에 사영하여 벡터화하여 워드 벡터로 변환할 수 있다. 일 실시 예에서, 서버(10)는 단어를 Word2vec 학습을 이용하여 워드 벡터를 변환할 수 있다.In this case, the server 10 may project the plurality of word characteristics of the above-described words into a multidimensional real space and vectorize them to be converted into a word vector. In one embodiment, server 10 may convert a word vector to a word vector using Word2vec learning.

또한, 서버(10)는 단어를 200~300차원 정도의 벡터 공간에 표현할 수 있으며, 학습을 위하여 주변 단어가 만드는 의미의 방향성을 기반으로 타겟 단어를 예측하는 CBOW(Continuous Bag of Words)와 한 단어를 기준으로 주변에 올 수 있는 단어를 예측하는 Skip-gram모델을 활용할 수 있다.In addition, the server 10 may express a word in a vector space of about 200 to 300 dimensions, and CBOW (Continuous Bag of Words) and one word predicting a target word based on the direction of meaning created by surrounding words for learning. You can use the Skip-gram model to predict words that can come around.

이때, 두 워드 벡터 간의 거리는 두 워드 벡터 각각에 대응되는 단어 간의 유사성을 나타내고, 워드 벡터의 방향은 특허문서 내에서의 의미를 나타낼 수 있다. In this case, the distance between two word vectors may indicate similarity between words corresponding to each of the two word vectors, and the direction of the word vector may indicate meaning in the patent document.

단계 S440에서, 서버(10)는, 워드 벡터 간의 유사도를 산출할 수 있다.In operation S440, the server 10 may calculate the similarity between the word vectors.

구체적으로, 서버(10)는 두 워드 벡터 간의 코사인 유사도 산출하거나 코사인 유사도를 정규화하여 두 워드 벡터 간의 유사도로 산출할 수 있다.In detail, the server 10 may calculate the cosine similarity between two word vectors or normalize the cosine similarity to calculate the similarity between two word vectors.

예를 들어, 서버(10)는 실수 공간상의 두 워드 벡터 간 각도의 코사인 값을 이용하여 두 워드 벡터 간의 유사도로 산출할 수 있다. 또한, 서버(10)는 두 워드 벡터 간의 코사인 유사도 값을 0부터 1사이의 범위를 갖도록 정규화하여 두 워드 벡터 간의 유사도로 산출할 수 있다.For example, the server 10 may calculate the degree of similarity between two word vectors using a cosine of an angle between two word vectors in real space. In addition, the server 10 may calculate the degree of similarity between two word vectors by normalizing a cosine similarity value between two word vectors to have a range of 0 to 1.

즉, 서버(10)는 전체 특허무선 내에 포함된 모든 단어들을 워드 벡터로 변환하고, 변환된 모든 워드 벡터 간의 거리를 코사인 유사도로 산출하며, 산출된 거리를 유사도로 산출할 수 있다.That is, the server 10 may convert all words included in the entire patent radio into a word vector, calculate a distance between all the converted word vectors with cosine similarity, and calculate the calculated distance with similarity.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 특허문서의 유의어 사전 생성 방법의 후처리 과정을 수행할 수 있다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may perform a post-processing process of the thesaurus dictionary generation method of the patent document.

구체적으로, 서버(10)는 유사도에 기초하여 워드 벡터에 대응되는 단어가 유사어 그룹에 포함되는지 여부를 확인할 수 있다. 구체적으로, 서버(10)는 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만인지 여부를 확인하고, 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이면 두 워드 벡터에 각각 대응하는 단어가 동일한 유사어 그룹에 포함되는 것으로 확인할 수 있다.In detail, the server 10 may determine whether a word corresponding to the word vector is included in the similar word group based on the similarity. Specifically, the server 10 determines whether the similarity between any two word vectors is less than the preset reference similarity, and if the similarity between any two word vectors is less than the preset reference similarity, the words corresponding to the two word vectors are the same similar words. It can be confirmed that it is included in the group.

이후, 서버(10)는, 유사도에 기초하여 워드 벡터에 대응되는 단어를 유사어 그룹으로 그루핑할 수 있다.Thereafter, the server 10 may group the words corresponding to the word vectors into similar word groups based on the similarity.

구체적으로, 서버(10)는 두 워드 벡터에 각각 대응하는 단어가 동일한 유사어 그룹에 포함되는 것으로 확인되면 해당 두 단어를 동일한 유사어 그룹에 포함시킬 수 있다. In detail, when it is confirmed that the words corresponding to the two word vectors are included in the same similar word group, the server 10 may include the two words in the same similar word group.

한편, 서버(10)는 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이더라도 유사어 그룹에 이전에 포함된 단어에 대응되는 워드 벡터 간의 유사도가 미리 설정된 최대 유사도를 초과하는 경우, 유사도가 미리 설정된 기준 유사도 미만인 두 워드 벡터에 대응되는 단어들을 해당 유사어 그룹에 포함시키지 않을 수 있다.Meanwhile, even when the similarity between any two word vectors is less than the preset reference similarity, the server 10 may determine that the similarity between the word vectors corresponding to the words previously included in the similar group exceeds the preset maximum similarity. Words corresponding to two word vectors less than the similarity may not be included in the similar word group.

이를 통해, 서버(10)는 유사어 그룹이 확장되어 어느 두 단어 간의 유사성이 감소되는 현상을 방지할 수 있다.Through this, the server 10 may prevent the similarity group from being expanded to reduce the similarity between any two words.

도 6a 내지 도 6c는 본 발명의 일 실시예에 따른 단어 획득방법을 구체적으로 설명하기 위한 흐름도이다.6A through 6C are flowcharts illustrating a word acquisition method in detail according to an embodiment of the present invention.

구체적으로, 도 6a에 도시된 바와 같이, 단계 S505에서, 서버(10)는, 획득된 복수개의 단어 각각에 대한 의미 정보를 매칭할 수 있다.In detail, as illustrated in FIG. 6A, in operation S505, the server 10 may match semantic information about each of the acquired words.

이때, 의미 정보란, 단어에 대한 인텐트를 의미할 수 있다. 서버(10)는 단어에 대한 의미 정보를 획득하기 위해 인공지능 모델을 이용한 자연어 처리를 수행할 수 있다.In this case, the semantic information may mean an intent for a word. The server 10 may perform natural language processing using an artificial intelligence model to obtain semantic information about a word.

구체적으로, 인공지능 모델은 자연어 이해부를 포함하고, 자연어 이해부는 문장 분석 결과를 바탕으로 엔티티(entity) 및 문장에 포함된 단어의 의도(intent)를 파악할 수 있으며, 나아가, 자연어 이해부는 문장의 구조 및 주요 성분 분석을 통해 문장을 해석하고 통계/분석 등을 이용하여 문장 분석을 수행할 수 있다.Specifically, the artificial intelligence model includes a natural language understanding unit, the natural language understanding unit may grasp the intent of the words included in the entity and the sentence based on the sentence analysis result, and furthermore, the natural language understanding unit may structure the sentence. And sentence analysis through the analysis of the principal components and statistics / analysis can be performed using the sentence.

일 실시예로, 서버(10)는 '사과'가 포함된 문장을 분석하여 사과에 대한 의미 정보를 획득할 수 있다. 예를 들어, 문장 문석을 통해 획득된 단어가 "사과' 인 경우를 가정할 수 있다. 이때, 사과에 대한 의미 정보는 명사로서 과일의 한 종류를 나타내는 의미 정보일 수 있으나, 동사로서 다른 사람에게 잘못을 말하는 것을 나타내는 의미 정보일 수도 있다. 서버(10)는 '사과'에 대한 복수개의 의미 정보 중 문장과 적합하다고 판단되는 의미 정보를 '사과'와 매칭할 수 있다.In one embodiment, the server 10 may obtain the semantic information about the apple by analyzing a sentence including the 'apple'. For example, it may be assumed that a word acquired through sentence sentence is “apple.” In this case, semantic information about an apple may be semantic information indicating a kind of fruit as a noun, but as a verb to another person. The server 10 may match semantic information, which is determined to be suitable with a sentence, among the plurality of semantic information about the apple.

단계 S510에서, 서버(10)는, 매칭된 의미 정보 및 대상특허문서를 바탕으로, 복수개의 단어 각각에 대한 정확도 점수를 획득할 수 있다.In operation S510, the server 10 may obtain an accuracy score for each of the plurality of words based on the matched semantic information and the target patent document.

이 경우, 서버(10)는 하나의 문장을 통해 획득한 단어에 대한 의미 정보 획득하나, 이를 통해 획득한 의미 정보는 부정확할 가능성이 있다. 따라서, 서버(10)는 문장을 포함하는 특허문서 전체를 바탕으로 단어에 매칭된 의미 정보의 정확도 점수를 획득할 수 있다.In this case, the server 10 obtains semantic information about the word acquired through one sentence, but the semantic information obtained through this may be inaccurate. Therefore, the server 10 may obtain the accuracy score of the semantic information matched to the word based on the entire patent document including the sentence.

따라서, 서버(10)는 문장 분석을 통해 획득된 단어의 의미 정보에 대한 정확도 점수를, 특허문서 전체에서 발견되는 동일한 단어에 대한 의미 정보를 바탕으로 획득할 수 있다.Therefore, the server 10 may obtain the accuracy score for the semantic information of the word obtained through sentence analysis based on the semantic information on the same word found in the entire patent document.

예를 들어, '사과'가 포함된 문장을 분석하여 획득한 '사과'에 대한 의미 정보가 과일을 나타내는 사과에 대한 정보이지만, 문장을 포함한 특허문서 전체에서 검색되는 '사과'의 의미 정보가 다른 사람에게 잘못을 말하는 것에 대한 정보인 경우, 서버(10)는 기 매칭된 의미 정보를 낮게 설정할 수 있다.For example, the semantic information obtained by analyzing a sentence containing 'apple' is information about an apple representing fruit, but the semantic information of 'apple' found in the patent document including the sentence is different. In the case of information on telling a wrong person, the server 10 may set a low matched semantic information.

단계 S515에서, 서버(10)는, 정확도 점수가 기 설정된 값 이하인 단어를 오류 단어로 판단할 수 있다.In operation S515, the server 10 may determine a word having an accuracy score equal to or less than a preset value as an error word.

즉, 서버(10)는 정확도 점수가 기 설정된 값 이하인 경우, 해당 단어에 매칭된 의미 정보가 잘못 매칭된 것으로 판단하고, 단어를 오류 단어로 획득할 수 있다.That is, when the accuracy score is less than or equal to a preset value, the server 10 may determine that the semantic information matched with the corresponding word is incorrectly matched, and may acquire the word as an error word.

한편, 도 6b에 도시된 바와 같이, 단계 S520에서, 서버(10)는, 오류 단어에 대한 복수개의 의미 정보를 획득할 수 있다. 구체적으로, 서버(10)는 오류 단어가 가지는 복수개의 의미 정보를 복수개의 특허문서로부터 획득할 수 있다.On the other hand, as shown in Figure 6b, in step S520, the server 10 may obtain a plurality of semantic information for the error word. In detail, the server 10 may obtain a plurality of semantic information of an error word from a plurality of patent documents.

단계 S525에서, 서버(10)는, 획득된 복수개의 의미 정보 각각에 대한 복수개의 가중치를 획득할 수 있다.In operation S525, the server 10 may obtain a plurality of weights for each of the obtained plurality of semantic informations.

일 실시예로, 서버(10)는 오류 단어를 포함하는 대상특허문서를 바탕으로 가중치를 획득할 수 있다. 구체적으로, 서버(10)는 전체 특허문서에서 오류 단어를 포함하는 적어도 하나의 문장을 획득하고, 획득된 적어도 하나의 문장에 포함된 오류 단어와 동일한 단어에 대한 의미 정보를 획득할 수 있다. 서버(10)는 오류단어와 동일한 적어도 하나의 단어에 대한 의미 정보를 바탕으로, 복수개의 의미 정보에 대한 가중치를 획득할 수 있다.In one embodiment, the server 10 may obtain a weight based on the target patent document including the error word. In detail, the server 10 may obtain at least one sentence including an error word in the entire patent document, and obtain semantic information on the same word as the error word included in the obtained at least one sentence. The server 10 may obtain weights for a plurality of semantic information based on semantic information about at least one word that is the same as an error word.

단계 S530에서, 서버(10)는, 복수개의 가중치 중, 대상특허문서와의 연관도가 가장 높은 가중치를 획득할 수 있다.In operation S530, the server 10 may obtain a weight having the highest association with the target patent document among the plurality of weights.

단계 S535에서, 서버(10)는, 대상특허문서와의 연관도가 가장 높은 가중치에 대응되는 의미 정보를 오류 단어에 매칭할 수 있다. 즉, 서버(10)는 오류 단어에 대한 복수개의 의미 정보 중 가중치가 가장 큰 의미 정보를 오류 단어에 대한 의미 정보로 결정하고, 오류 단어를 수정할 수 있다.In operation S535, the server 10 may match semantic information corresponding to a weight having the highest association with the target patent document to an error word. That is, the server 10 may determine the semantic information having the largest weight among the plurality of semantic information of the error word as semantic information of the error word, and correct the error word.

한편, 본 발명의 다양한 실시예에 따라, 오류 단어는 단어에 잘못된 의미 정보가 매칭된 경우뿐만 아니라 형태소 분석 과정에서의 오류로 잘못 파싱된 단어일 경우일 수 있음은 물론이다. 일 실시예로, 단어 획득을 위한 문장이 "머신 러닝을 이용한 자연어 처리를 한다" 인 경우, 서버(10)는 머신, 러닝, 이용, 자연어, 처리를 단어로 획득할 수 있다. Meanwhile, according to various embodiments of the present disclosure, the error word may be a case in which the word is incorrectly parsed as an error in the morpheme analysis process as well as the case where the wrong semantic information is matched with the word. In one embodiment, when the sentence for acquiring the word is "natural language processing using machine learning", the server 10 may obtain a machine, running, use, natural language, processing as a word.

이 경우, 형태소 분석의 오류로 인하여, 머신, 러닝, 을이용, 자연어, 처리와 같이 단어를 획득하는 경우가 발생할 수 있다. 이 경우, 서버(10)는 '을이용'을 오류 단어로 판단하고, 수정할 수 있다. In this case, due to an error in morpheme analysis, a case of acquiring words such as machine, running, using, natural language, and processing may occur. In this case, the server 10 may determine 'use' as an error word and correct it.

구체적으로, 서버(10)는 획득된 단어 각각에 대한 의미 정보를 획득한 후, 의미 정보와 대상특허문서와의 연관도를 바탕으로 오류 단어인지 여부를 판단할 수 있다. 일 실시예에 따라, 서버(10)는 획득된 단어가 기 설정된 빈도 이상 단어를 포함하는 대상특허문서에서 발견되는 경우, 해당 단어는 오류가 없는 단어로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어가 전체 특허문서에서 기 설정된 빈도 이상 발견되는 경우, 해당 단어는 일반적으로 사용되는 단어로 판단하여 오류 단어가 아닌 것으로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어에 대한 의미 정보를 찾지 못한 경우, 해당 단어를 오류 단어로 판단할 수 있다. 또 다른 실시예로, 서버(10)는 단어에 매칭된 의미 정보가 단어를 포함하는 대상특허문서와 이질적인 경우, 해당 단어를 오류 단어로 판단할 수 있다.In detail, the server 10 may determine whether the word is an error word based on the degree of association between the semantic information and the target patent document after acquiring semantic information about each acquired word. According to an embodiment, when the acquired word is found in a target patent document including a word having a predetermined frequency or more, the server 10 may determine that the word is an error-free word. In another embodiment, when a word is found more than a predetermined frequency in the entire patent document, the server 10 may determine that the word is not an error word by determining that the word is a commonly used word. In another embodiment, when the server 10 does not find semantic information about a word, the server 10 may determine the word as an error word. In another embodiment, if the semantic information matched with the word is heterogeneous with the target patent document including the word, the server 10 may determine the word as an error word.

구체적으로, 서버(10)는 대상특허문서 전체에 포함된 단어를 획득하고, 획득된 단어 각각에 대한 의미 정보를 매칭하여 저장할 수 있다. 서버(10)는 매칭된 단어 및 의미 정보를 클러스터링하여 연관성있는 의미 정보를 가지는 복수개의 단어의 클러스터를 획득할 수 있다. 서버(10)는 복수개의 단어 클러스터 중, 클러스터에 포함된 단어가 기 설정된 개수 이하인 단어 클러스터를 획득할 수 있다. 이때, 기 설정된 개수란, 대상특허문서에서 사용된 복수의 동일한 단어에 대한 개수를 의미하는 것이 아닌, 서로 다른 형태를 가지는 복수개의 단어에 대한 개수를 의미할 수 있다. 예를 들어, 대상특허문서에서 "을이용" 이라는 단어가 복수번 검색된 경우라고 하더라도, "을이용"은 동일한 단어이기 때문에, 하나의 단어로 판단할 수 있다. 서버(10)는 상기 단계에서 획득한 클러스터에 포함된 단어를 오류 단어로 판단할 수 있다. In detail, the server 10 may acquire words included in the entire target patent document, and match and store semantic information on each of the acquired words. The server 10 may cluster a matched word and semantic information to obtain a cluster of a plurality of words having relevant semantic information. The server 10 may obtain a word cluster of a plurality of word clusters, in which a word included in the cluster is equal to or less than a preset number. In this case, the preset number may not mean the number of the plurality of identical words used in the target patent document, but may mean the number of the plurality of words having different forms. For example, even when the word "use" is found a plurality of times in the target patent document, since "use" is the same word, it can be determined as one word. The server 10 may determine the word included in the cluster obtained in the above step as an error word.

다만, 이에 한정되는 것은 아니고, 서버(10)는 복수개의 단어 클러스터 중, 다른 클러스터와의 거리가 기 설정된 거리 이상인 클러스터에 포함된 단어를 오류 단어로 판단할 수 있음은 물론이다. 구체적으로, 서버(10)는 복수개의 클러스터의 중심점을 획득하고, 획득된 중심점을 바탕으로 클러스터간의 거리를 판단할 수 있다. 예를 들어, 획득된 단어 클러스터가 제1 단어 클러스터 내지 제4 단어 클러스터일 수 있다. 이 경우, 서버(10)는 제1 단어 클러스터와 제2 단어 클러스터와의 거리, 제1 단어 클러스터와 제3 단어 클러스터와의 거리, 제1 단어 클러스터와 제4 단어 클러스터와의 거리를 각각 획득하고, 획득된 3개의 거리 모두가 기 설정된 거리 이상인 경우, 제1 단어 클러스터를 오류 단어로 판단할 수 있다.However, the present invention is not limited thereto, and the server 10 may determine, as an error word, a word included in a cluster of a plurality of word clusters having a distance from another cluster more than a preset distance. In detail, the server 10 may acquire center points of the plurality of clusters and determine the distance between clusters based on the obtained center points. For example, the obtained word cluster may be a first word cluster to a fourth word cluster. In this case, the server 10 obtains the distance between the first word cluster and the second word cluster, the distance between the first word cluster and the third word cluster, and the distance between the first word cluster and the fourth word cluster, respectively. When all three acquired distances are equal to or greater than a predetermined distance, the first word cluster may be determined as an error word.

한편, 도 6c에 도시된 바와 같이, 단계 S540에서, 서버(10)는, 대상특허문서에 포함된 문장에 대한 복수개의 단어를 획득할 수 있다.On the other hand, as shown in Figure 6c, in step S540, the server 10 may obtain a plurality of words for the sentences included in the target patent document.

구체적으로, 단일 단어들의 결합으로 복합 명사구가 형성되는 경우, 그 의미가 달라지는 경우가 있다. 따라서, 서버(10)는 단일 단어로부터 복합 명사구 세트를 획득할 필요성이 존재한다. 따라서, 서버(10)는 복수개의 단어를 바탕으로 복합 명사구 세트를 획득할 수 있다Specifically, when a compound noun phrase is formed by combining single words, the meaning may be different. Thus, there is a need for the server 10 to obtain a complex noun phrase set from a single word. Thus, the server 10 may obtain a compound noun phrase set based on a plurality of words.

예를 들어, "머신러닝을 이용한 자연어 처리를 한다"라는 문장을 분석한 결과 서버(10)는 머신, 러닝, 이용, 자연어, 처리의 단어를 획득할 수 있다.For example, as a result of analyzing the sentence "to perform natural language processing using machine learning", the server 10 may acquire a word of a machine, a learning, a use, a natural language, and a processing word.

단계 S545에서, 서버(10)는, 복수개의 단어의 조합으로 획득된 복수개의 복합 명사구 세트 후보를 획득할 수 있다.In operation S545, the server 10 may obtain a plurality of compound noun phrase set candidates obtained by combining a plurality of words.

예를 들어, 서버(10)는 획득된 단어의 조합으로부터 머신러닝, 러닝이용, 이용자언어, 자연어처리 등의 복합 명사구 세트 후보를 획득할 수 있다.For example, the server 10 may obtain a compound noun phrase set candidate such as machine learning, learning use, user language, natural language processing, etc. from the obtained word combination.

단계 S550에서, 서버(10)는, 획득된 복수개의 복합 명사구 세트 후보와 동일한 복합 명사구 세트가 대상특허문서에 포함되는 빈도를 획득할 수 있다.In operation S550, the server 10 may acquire a frequency at which the compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in the target patent document.

단계 S555에서, 서버(10)는, 빈도가 기 설정된 빈도 이상인 복합 명사구 세트 후보를 복합 명사구 세트로 결정할 수 있다.In operation S555, the server 10 may determine a compound noun phrase set candidate having a frequency equal to or greater than a preset frequency as a compound noun phrase set.

예를 들어, 서버(10)는 머신러닝은 총 301회, 러닝이용은 총 1회, 이용자 언어는 총 0회, 자연어처리는 총 58회 발견되는 것에 대한 정보를 획득하고, 출현 빈도가 기 설정된 빈도 이상인 복합 명사구 후보 세트를 복합 명사구 세트로 획득할 수 있다.For example, the server 10 obtains information about machine learning total 301 times, learning usage once, user language total 0 times, and natural language processing 58 times, and the frequency of occurrence is set in advance. A compound noun phrase candidate set of more than a frequency can be obtained as a compound noun phrase set.

이때, 복합 명사구 세트 후보는 복합 명사구 세트 후보에 포함된 단어들의 순서 정보 및 이격 정보를 포함할 수 있다.In this case, the compound noun phrase set candidate may include order information and spacing information of words included in the compound noun phrase set candidate.

일 실시예로, 복합 명사구 세트는 둘 이상의 인접한 단어의 조합으로 결정되지만, 본 발명에서는 단어들이 이격된 단어의 조합 또한 복합 명사구 세트로 획득할 수 있다. 예를 들어, 서버(10)는 "복수개의 장치 중 제1 장치, 복수개의 장치 중 제2 장치, 복수개의 장치 중 제3 장치"가 포함된 문장에 서 복합 명사구 세트를 획득할 때, 복수개의 장치 중 제1 장치, 복수개의 장치 중 제2 장치 및 복수개의 장치 중 제3 장치를 독립된 복합 명사구 세트로 획득할 수 있으나, "복수개의 장치 중 (이격 단어 1) 장치"를 하나의 복합 명사구 세트로 획득할 수도 있음은 물론이다. 이때, 복합 명사구 세트는 이격 정보를 포함할 수 있다. 상술한 실시예에서 이격 정보란 단어 "중"과 단어 "장치" 사이에 1개의 단어가 포함되어 있다는 정보일 수 있다. 다양한 실시예에 따라, 단어와 단어 사이에 복수개의 단어가 포함될 수 있음은 물론이다.In one embodiment, a compound noun phrase set is determined by a combination of two or more adjacent words, but in the present invention, a combination of words in which words are spaced may also be obtained as a compound noun phrase set. For example, when the server 10 obtains a compound noun phrase set from a sentence including "a first device of a plurality of devices, a second device of a plurality of devices, a third device of a plurality of devices", The first device of the device, the second device of the plurality of devices, and the third device of the plurality of devices can be obtained as a set of independent compound noun phrases, but a " device of a plurality of devices (spaced word 1) " Of course it can also be obtained. In this case, the complex noun phrase set may include spaced information. In the above-described embodiment, the separation information may be information indicating that one word is included between the word “in” and the word “device”. According to various embodiments, a word and a plurality of words may be included between the words.

또 다른 실시예로, 서버(10)는 복합 명사구 세트를 구성하는 단어의 최대 개수를 결정하고, 결정된 개수 내의 복합 명사구 세트를 획득할 수 있다. 구체적으로, 복합 명사구 세트 획득을 위한 문장에 포함된 단어가 n개이고, 단어간의 순서가 변경되지 않는 경우, 획득되는 복합 명사구 세트의 수는

개이고, 단어간의 순서가 변경되는 경우 획득되는 복합 명사구 세트의 수는

개이다. 따라서, 문장이 길어지면 길어질수록, 서버(10)는 복합 명사구 세트 검색을 위해 과도한 리소스를 투입하여야 한다. 따라서, 서버(10)는 복합 명사구 세트를 구성하는 단어의 최대 개수를 결정하고, 결정된 개수 내의 복합 명사구 세트를 획득할 수 있다. 이때, 서버(10)는, 단어 획득을 수행하는 대상특허문서와 동일한 기술 분야인 복수개의 특허문서를 획득하고, 획득된 특허문서에 포함된 복합 명사구 세트에 포함된 단어의 최대 개수를 복합 명사구 세트를 구성하는 단어의 최대 개수로 결정할 수 있다. In another embodiment, the server 10 may determine the maximum number of words constituting the compound noun phrase set, and obtain the compound noun phrase set within the determined number. Specifically, when n words are included in a sentence for obtaining a compound noun phrase set, and the order between words does not change, the number of compound noun phrase sets obtained is

, And the number of compound noun phrase sets obtained when the order of words changes

Dog. Therefore, the longer the sentence is, the longer the server 10 has to inject excessive resources to search for a compound noun phrase set. Accordingly, the server 10 may determine the maximum number of words constituting the compound noun phrase set, and obtain the compound noun phrase set within the determined number. In this case, the server 10 obtains a plurality of patent documents in the same technical field as the target patent document for performing word acquisition, and sets the maximum number of words included in the compound noun phrase set included in the obtained patent document. It can be determined by the maximum number of words constituting the.

한편, 본 발명에 따른 복합 명사구 세트 획득 방법은 상술한 방법에 한정되는 것은 아니며, 다양한 방법에 의해 획득될 수 있음은 물론이다.On the other hand, the method of obtaining a compound noun phrase set according to the present invention is not limited to the above-described method, of course, can be obtained by various methods.

일 실시예로, 서버(10)는 복수개의 단어가 기 설정된 조건을 만족하는 경우, 복수개의 단어의 위치를 변경할 수 있다. In an embodiment, when the plurality of words satisfy a preset condition, the server 10 may change the positions of the plurality of words.

일반적으로, 문장에서 복합 명사구를 획득하고자 할 때, 문장에 포함된 단어의 순서를 변경하는 작업은 불필요하다. 예를 들어, "머신러닝을 이용한~~"의 문장에서 서버(10)는 "머신러닝"이란 복합 명사구 세트를 획득할 필요성이 있으나, 단어의 순서를 바꾼 "러닝머신"이란 복합 명사구 세트를 획득할 필요는 없을 것이다. 오히려 "러닝머신"의 복합 명사구 세트를 획득하는 경우, 전혀 다른 의미의 복합 명사구 세트를 획득하게 되어 본 발명에서 이루고자 하는 성능을 떨어트릴 여지도 존재한다.In general, when trying to obtain a compound noun phrase in a sentence, it is unnecessary to change the order of the words included in the sentence. For example, in a sentence of "~" using machine learning, the server 10 needs to acquire a complex noun phrase set "machine learning", but obtains a compound noun phrase set of "treadmill" that changes the order of words. You won't have to. Rather, when a compound noun phrase set of a "treading machine" is obtained, there is a possibility that a compound noun phrase set having a completely different meaning is obtained, thereby degrading the performance to be achieved in the present invention.

그러나 기설정된 조건을 만족하면, 서버(10)는 문장에 포함된 단어의 순서를 변경하여 복합 명사구 세트 후보를 획득할 수 있다. 이때, 기 설정된 조건은, 복수개의 단어가 인접한 조건, 인접한 복수개의 단어 사이에 기 설정된 부호가 포함될 조건, 인접한 복수개의 단어 중 오른쪽에 위치한 단어가 괄호를 포함하는 조건 및 복수개의 단어가 서로 다른 언어인 조건일 수 있다.However, if the predetermined condition is satisfied, the server 10 may obtain a compound noun phrase set candidate by changing the order of words included in the sentence. In this case, the preset condition may include a condition in which a plurality of words are adjacent to each other, a condition in which a predetermined code is included between a plurality of adjacent words, a condition in which a word located at the right of the plurality of adjacent words includes parentheses, and a language in which the plurality of words are different May be a condition.

예를 들어, 분석하고자 하는 문장이 "머신 러닝(Machine Learning)"을 포함하는 경우, 서버(10)는 머신, 러닝, Machine, Learning을 단어로 획득할 수 있다. 나아가, 서버(10)는 "머신 러닝 Machine Learning"을 복합 명사구 세트로 획득할 수 있다. 나아가, 서버(10)는 기 설정된 조건인 괄호 부호가 포함되어 있음을 판단하고, "Machine Learning 머신 러닝"을 복합 명사구 세트로 획득할 수 있다. 즉, 서버(10)는 "머신 러닝"이라는 복합 명사구(또는 단어) 및 복합 명사구(또는 단어)에 대한 다른 언어의 복합 명사구(Machine Learning)가 함께 존재하는 경우, 다른 언어로 표현되었으나, 동일한 의미를 가지는 두 복합 명사구(또는 단어)를 하나의 복합 명사구로 획득하고, 또한 두 복합 명사구의 순서를 변경하여 하나의 복합 명사구로 획득할 수 있다. 이는 각기 다른 언어로 작성된 특허문서를 비교하는 경우 비교의 효율성을 높일 수 있는 효과가 존재한다. 즉, 한글로 작성된 특허문서는 "머신 러닝(Machine Learning)"으로 표현되나, 영어로 작성된 특허문서는 "Machine Learning(머신 러닝)"으로 표현될 수 있으므로, 서버(10)는 두 가지 경우 모두 복합 명사구 세트로 획득하여 검색의 효율성을 높일 수 있다.For example, when the sentence to be analyzed includes "Machine Learning", the server 10 may acquire a machine, a running, a machine, and a learning word. Furthermore, the server 10 may obtain "machine learning" as a compound noun phrase set. Furthermore, the server 10 may determine that a parenthesis code, which is a preset condition, is included, and acquire “Machine Learning Machine Learning” as a compound noun phrase set. That is, the server 10 is expressed in a different language when a compound noun phrase (or word) called "machine learning" and a compound noun phrase of another language for the compound noun phrase (or word) are present together. Two compound noun phrases (or words) having a can be obtained as a compound noun phrase, and a compound noun phrase can be obtained by changing the order of the two compound noun phrases. This has the effect of increasing the efficiency of comparison when comparing patent documents written in different languages. That is, a patent document written in Korean may be represented as "Machine Learning", but a patent document written in English may be represented as "Machine Learning", so that the server 10 may be complex in both cases. Acquire a noun phrase set to increase the efficiency of the search.

한편, 상술한 실시예에서는 기 설정된 부호가 포함 조건이 괄호 부호가 포함될 조건에 대하여 설명하였으나, 이에 한정되는 것은 아니다. 다양한 실시예에 따라, 기 설정된 부호는 하이픈(-), 슬래시(/), 물결표(~), 따옴표(' 또는 ")등의 부호일 수 있음은 물론이다.Meanwhile, in the above-described embodiment, the condition in which the preset code includes the parentheses is described, but the present invention is not limited thereto. According to various embodiments of the present disclosure, the predetermined sign may be a sign such as a hyphen (-), a slash (/), a tilde (~), a quote ('or "), or the like.

이후, 서버(10)는 변경된 복수개의 단어를 복합 명사구 세트 후보로 결정할 수 있다.Thereafter, the server 10 may determine the changed plurality of words as candidates for the compound noun phrase set.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 대상특허문서에 포함된 문장이 기 설정된 구조를 가지는 경우, 문장에 포함된 단어 중 기 설정된 구조에 대응되는 단어를 제외한 적어도 하나의 단어를 획득할 수 있다.Meanwhile, according to various embodiments of the present disclosure, when the sentence included in the target patent document has a preset structure, the server 10 may include at least one word except for a word corresponding to the preset structure among words included in the sentence. Can be obtained.

예를 들어, 기 설정된 구조는 "~는 ~로 정의한다"와 같이 정형화된 구조일 수 있다. 구체적으로 대상특허문서에서 "~는 ~로 정의한다", "이하 ~는 ~라고 한다." "~는 ~로 이해될 수 있다." "~는 ~라고 서술한다" 등과 같은 구조의 문장은 기술 분야에서 일반적으로 사용되지 않는 용어를 정의하기 위해 사용될 수 있다. 따라서, 서버(10)는 기술 분야에서 일반적으로 사용되지 않은 용어를 정의하는 문장을 판단하고, 판단된 문장에서 정의된 단어를 획득할 수 있다.For example, the predetermined structure may be a structured structure such as "to be defined as". Specifically, in the target patent document, "to be defined as", "" hereinafter is referred to as. " "Can be understood as." A sentence of structure such as "describes" may be used to define terms that are not commonly used in the art. Accordingly, the server 10 may determine a sentence defining a term not generally used in the technical field, and obtain a word defined in the determined sentence.

예를 들어, 분석하고자 하는 문장이 "머신 러닝은 OOO로 서술한다(또는 정의한다)."라는 문장인 경우, 서버(10)는 해당 문장이 기 설정된 구조라고 판단하고, 머신, 러닝, OOO의 단어를 획득할 수 있다.For example, if the sentence to be analyzed is a sentence "Machine learning is described (or defined) as OOO", the server 10 determines that the sentence is a predetermined structure, and the machine, running, OOO Words can be obtained.

이후, 서버(10)는 적어도 하나의 단어의 의미 정보를 획득할 수 있다.Thereafter, the server 10 may obtain semantic information of at least one word.

구체적으로, 획득된 OOO의 단어가 기술 분야에서 통용되는 의미 정보를 가진 경우, 서버(10)는 OOO에 대응되는 의미 정보를 OOO에 매칭할 수 있다. 그러나, OOO 용어가 사용자가 새롭게 정의한 용어여서 의미 정보가 불명확한 경우, 서버(10)는 머신 러닝에 대응되는 의미 정보를 OOO의 의미 정보로 매칭할 수 있다.In detail, when the acquired word of OOO has semantic information commonly used in the technical field, the server 10 may match semantic information corresponding to OOO with OOO. However, when the term OOO is newly defined by the user and the semantic information is not clear, the server 10 may match the semantic information corresponding to the machine learning with the semantic information of the OOO.

이후, 서버(10)는 의미 정보에 대응되는 단어를 획득할 수 있다.Thereafter, the server 10 may obtain a word corresponding to the semantic information.

다만, 이에 한정되는 것은 아니고, 서버(10)는 머신 러닝과 OOO의 연관도가 기 설정된 연관도 이상인 경우, OOO의 단어를 머신 러닝의 단어로 교체할 수 있음은 물론이다.However, the present disclosure is not limited thereto, and the server 10 may replace the word of OOO with a word of machine learning when the degree of association between machine learning and OOO is equal to or greater than a preset degree of association.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 특허 문서의 템플릿을 제외하고 단어를 획득할 수 있음은 물론이다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may acquire a word except for a template of a patent document.

구체적으로, 특허문서는 정형화된 문서로서 발명의 핵심적인 내용과는 크게 상관이 없으나, 필요한 필수 구성 요소를 정형화된 방법으로 서술한 문단이 존재할 수 있다. 따라서 정형화된 문단을 제거하고 단서 세트를 획득하는 경우, 서버(10)의 계산량을 줄일 수 있는 효과가 존재한다.In detail, the patent document is a formal document, which is not significantly related to the essential contents of the present invention, but there may exist a paragraph describing a necessary essential component in a formal method. Therefore, when the formal paragraph is removed and a clue set is obtained, there is an effect of reducing the amount of computation of the server 10.

따라서, 서버(10)는 대상특허문서에 포함된 복수의 식별항목 정보 및 템플릿을 획득할 수 있다.Accordingly, the server 10 may obtain a plurality of identification item information and a template included in the target patent document.

일 실시예로, 템플릿은 특정 위치에 존재하는 문단으로 설정될 수 있다. 예를 들어, 템플릿은 대상특허문서에 포함된 [발명을 실시하기 위한 구체적인 내용] 식별항목 다음 줄에 포함된 문장부터, [발명을 실시하기 위한 구체적인 내용] 식별항목에서 최초로 '도 1'의 단어가 검색되는 문장 이전까지의 텍스트일 수 있다.In one embodiment, the template may be set to a paragraph existing at a specific location. For example, the template may include the word of FIG. 1 for the first time in the [specific information for carrying out the invention] identification item included in the sentence following the [specific information for carrying out the invention] identification item included in the target patent document. May be the text up until the sentence being searched for.

또 다른 실시예로, 템플릿은 분석하고자 하는 대상특허문서의 작성자가 작성한 다른 특허문서와 공통된 부분을 의미할 수 있다. 즉, 서버(10)는 대상특허문서의 작성자를 판단하고, 판단된 작성자의 또 다른 특허문서를 획득하고, 획득된 복수개의 특허문서를 비교하고, 비교 결과를 통해 기 설정된 비율 이상 유사하다고 판단되는 문단 또는 식별항목을 템플릿으로 획득할 수 있다. 동일 작성자에 의해 작성된 특허문서는 문장 구조가 비슷할 가능성이 크기 때문에 서버(10) 유사판단의 기준을 문장 단위가 아닌, 문단 또는 식별항목으로 설정할 수 있다.In another embodiment, the template may mean a part in common with other patent documents prepared by the creator of the target patent document to be analyzed. That is, the server 10 determines the creator of the target patent document, obtains another patent document of the determined author, compares the obtained plurality of patent documents, and judges that the similar ratio is equal to or more than a preset ratio through the comparison result. Paragraphs or identification items can be obtained as templates. Since the patent document created by the same author is likely to have a similar sentence structure, the criteria of the server 10 similar judgment may be set to a paragraph or an identification item instead of a sentence unit.

또 다른 실시예로, 서버(10)는 기 설정된 식별항목에 대응되는 텍스트를 템플릿으로 획득할 수 있다. 예를 들어, 서버(10)는 [도면의 간단한 설명], [부호의 설명] 등의 식별항목에 포함된 텍스트를 템플릿으로 획득할 수 있다. In another embodiment, the server 10 may obtain a text corresponding to a predetermined identification item as a template. For example, the server 10 may obtain the text included in the identification item such as [a brief description of the drawings], [description of the reference] as a template.

상술한 실시예에서는 국내 출원을 위한 특허문서의 일반적인 식별항목에 대한 템플릿 획득 방법을 설명하였으나, 이에 한정되는 것은 아니다. 즉, 서버(10)는 PCT 출원을 위한 국문서식의 식별항목, PCT 출원을 위한 영문 서식의 식별항목, 나아가, 미국, 일본, 중국, 유럽 출원에 사용되는 서식의 식별항목 등 다양한 식별항목에 대한 템플릿을 획득할 수 있음은 물론이다.In the above-described embodiment, a method for obtaining a template for a general identification item of a patent document for a domestic application has been described, but is not limited thereto. That is, the server 10 may identify various identification items, such as identification items of the national document for PCT applications, identification items of English forms for PCT applications, and furthermore, identification items of forms used in US, Japan, China, and European applications. Of course, the template can be obtained.

이후, 서버(10)는 획득된 템플릿을 제외한 복수개의 문장을 획득할 수 있다.Thereafter, the server 10 may obtain a plurality of sentences except for the obtained template.

이후, 서버(10)는 복수의 식별항목 각각에 가중치를 부여할 수 있다.Thereafter, the server 10 may assign a weight to each of the plurality of identification items.

일 실시예로, 대상특허문서가 본 명세서와 동일한 서식 및 식별항목을 가지는 경우, 서버(10)는 발명의 명칭, 기술분야, 발명의 배경이 되는 기술, 해결하고자 하는 과제, 과제의 해결 수단, 발명의 효과, 도면의 간단한 설명, 발명을 실시하기 위한 구체적인 내용, 부호의 설명, 청구범위, 요약, 대표도, 도면 식별항목 각각에 대한 가중치를 부여할 수 있다.In one embodiment, if the target patent document has the same form and identification as the present specification, the server 10 is the name of the invention, the technical field, the background of the invention, the problem to be solved, the means for solving the problem, The effects of the invention, a brief description of the drawings, specific details for carrying out the invention, a description of the signs, claims, summary, representation, and weights for each of the drawings identification items can be given.

이때, 가중치는 다양한 방법에 의해 부여될 수 있다. 일 실시예로, 서버(10)는 기술분야, 발명의 배경이 되는 기술, 청구범위, 발명의 효과, 발명을 실시하기 위한 구체적인 내용, 부호의 설명, 도면의 간단한 설명, 요약, 해결하고자 하는 과제, 과제의 해결수단, 대표도의 순서로 높은 가중치를 부여할 수 있다.In this case, the weight may be given by various methods. In one embodiment, the server 10 is a technical field, the background technology of the invention, claims, the effects of the invention, the specific contents for carrying out the invention, the description of the symbols, a brief description of the drawings, a summary, the problem to be solved The weights can be given in the order of the solution, the solution to the problem, and the representation.

또 다른 실시예로, 가중치는 대상특허문서에 포함된 단어(또는 복합 명사구 세트)각각의 중요도 스코어를 획득하고, 이를 바탕으로 식별항목별 중요도 스코어를 획득한 후 높은 중요도 스코어를 가지는 식별항목에 높은 가중치를 부여할 수 있다. 중요도 스코어란, 해당 단어(또는 복합 명사구 세트)가 특허문서에서 가지는 중요성을 나타내는 지표로, 중요도 스코어가 높은 단어(또는 복합 명사구 세트)일수록 해당 특허문서의 키워드일 수 있다.In another embodiment, the weight is obtained for the importance score of each word (or compound noun phrase set) included in the target patent document, and based on this, the importance score for each identification item is obtained, and the weight is high for the identification item having a high importance score. Can be weighted. The importance score is an index indicating the importance of the word (or compound noun phrase set) in the patent document. A word having a high importance score (or compound noun phrase set) may be a keyword of the patent document.

한편, 식별항목별 중요도 스코어는 식별항목에 포함된 단어(또는 복합 명사구 세트)의 중요도 스코어를 합산한 후, 단어(또는 복합 명사구 세트)의 개수를 나눈 값을 의미할 수 있다. 이때, 식별항목별 중요도 스코어를 획득하는데 사용되는 단어(또는 복합 명사구 세트)는 식별항목에 포함된 모든 단어(또는 복합 명사구 세트)일 수 있으나 이에 한정되는 것은 아니다. 예를 들어, 식별항목별 중요도 스코어를 획득하는데 사용되는 단어(또는 복합 명사구 세트)는 해당 식별항목에 포함된 모든 단어(또는 복합 명사구 세트) 중, 템플릿으로 판단된 부분의 단어(또는 복합 명사구 세트)를 제외한 단어(또는 복합 명사구 세트)일 수 있음은 물론이다.Meanwhile, the importance score for each identification item may mean a value obtained by summing the importance scores of the words (or compound noun phrase sets) included in the identification item and dividing the number of words (or compound noun phrase sets). At this time, the word (or compound noun phrase set) used to obtain the importance score for each identification item may be all words (or compound noun phrase set) included in the identification item, but is not limited thereto. For example, a word (or compound noun phrase set) used to obtain an importance score for each item of identification is a word (or compound noun phrase set) that is determined by a template among all words (or compound noun phrase sets) included in the item. Of course, it can be a word (or a compound noun set) except for.

이후, 서버(10)는 복수의 식별항목 정보 각각의 가중치를 바탕으로 형태소 분석을 수행할 식별항목의 우선순위를 결정할 수 있다.Thereafter, the server 10 may determine the priority of the identification item to perform the morphological analysis based on the weight of each piece of identification information.

이후, 서버(10)는 결정된 우선순위에 따라 템플릿이 제외된 복수의 문장에 대한 단어를 획득할 수 있다.Thereafter, the server 10 may obtain words for a plurality of sentences in which the template is excluded according to the determined priority.

즉, 서버(10)는 키워드가 존재할 가능성이 높은 식별항목을 먼저 분석하고, 분석된 내용을 바탕으로 상대적으로 중요도가 낮은 식별항목을 분석함으로써, 계산량을 감소시킬 수 있다. 나아가, 동일한 단어에 대한 서로 다른 의미 정보가 매칭된 경우, 서버(10)는 우선순위가 높은 식별항목에서 획득된 의미 정보를 정확한 의미 정보로 판단할 수 있다.That is, the server 10 may reduce the amount of calculation by first analyzing an identification item having a high possibility of a keyword and analyzing the identification item having a relatively low importance based on the analyzed content. Furthermore, when different semantic information about the same word is matched, the server 10 may determine the semantic information obtained from the high priority item as correct semantic information.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 대상특허문서에 이미지가 포함된 경우에도 이미지로부터 단어를 획득할 수 있음은 물론이다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may acquire a word from the image even when the image is included in the target patent document.

구체적으로, 서버(10)는 대상특허문서에 이미지가 포함된 경우, 이미지와 인접한 식별항목이 존재하는지 여부를 판단할 수 있다.In detail, when the image is included in the target patent document, the server 10 may determine whether an identification item adjacent to the image exists.

일 실시예로, 대상특허문서에 [수학식]의 식별항목이 존재하고, 식별항목 하단에 이미지가 존재하는 경우, 서버(10)는 이미지를 수학식에 대한 이미지로 판단할 수 있다. 수학식의 경우, 대상특허문서의 핵심적인 내용일 가능성이 높으므로, 서버(10)는 이미지에 포함된 수학식 텍스트를 획득할 수 있다. In one embodiment, when an identification item of [mathematical formula] exists in the target patent document and an image exists below the identification item, the server 10 may determine the image as an image for the equation. In the case of the equation, since it is likely to be the essential content of the target patent document, the server 10 may obtain the equation text included in the image.

또 다른 실시예로, 대상특허문서에 [화학식] 의 식별항목이 존재하고, 식별항목 하단에 이미지가 존재하는 경우, 서버(10)는 이미지를 화학식에 대한 이미지로 판단할 수 있다.In another embodiment, when an identification item of [Formula] exists in the target patent document and an image exists at the bottom of the identification item, the server 10 may determine the image as an image for the chemical formula.

이후, 서버(10)는 식별항목이 이미지와 인접하는 경우, 이미지를 분석하여 이미지에 포함된 텍스트를 획득할 수 있다. 이때, 이미지 분석은 광학 문자 판독 방법(optical character recognition, OCR)에 의해 수행될 수 있으나, 이에 제한되는 것은 아니다. 예를 들어, 이미지가 화학식 이미지인 경우, 서버(10)는 화학식 구조를 분석하여 이미지에 대응되는 화학식을 획득할 수 있음은 물론이다. Thereafter, when the identification item is adjacent to the image, the server 10 may analyze the image to obtain text included in the image. In this case, image analysis may be performed by optical character recognition (OCR), but is not limited thereto. For example, when the image is a chemical formula image, the server 10 may analyze the chemical formula structure to obtain a chemical formula corresponding to the image.

이후, 서버(10)는 획득된 텍스트를 바탕으로 단어를 획득할 수 있다.Thereafter, the server 10 may obtain a word based on the obtained text.

도 7은 본 발명의 일 실시예에 따른 중요도 스코어를 산출하는 방법을 구체적으로 설명하기 위한 흐름도이다.7 is a flowchart illustrating a method of calculating importance scores according to an embodiment of the present invention in detail.

단계 S610에서, 서버(10)는, 전체 특허문서의 전체 단어수 대비 전체 특허문서에서의 단어의 출현횟수의 제1 출현비율 및 전체 특허문서의 전체 문장수 대비 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수의 제2 출현비율에 기초하여 제1 세부 중요도를 산출할 수 있다.In step S610, the server 10, the word appears in the first appearance ratio of the number of appearances of the word in the entire patent document to the total number of words of the patent document and the sentences of the entire patent document relative to the total number of sentences of the patent document The first detailed importance may be calculated based on the second appearance ratio of the number of appearance sentences.

구체적으로, 서버(10)는 전체 특허문서의 전체 단어수를 카운트하고, 전체 특허문서에서의 단어의 출현횟수를 카운트할 수 있다.Specifically, the server 10 may count the total number of words in the entire patent document, and count the number of occurrences of words in the entire patent document.

이후, 서버(10)는 전체 특허문서의 전체 단어수 대비 전체 특허문서에서의 단어의 출연횟수를 제1 출현비율로 산출할 수 있다.Subsequently, the server 10 may calculate the number of appearances of the words in the entire patent document as the first appearance ratio relative to the total number of words in the entire patent document.

이어서, 서버(10)는 전체 특허문서의 전체 문장수를 카운트하고, 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수를 카운트할 수 있다.Subsequently, the server 10 may count the total number of sentences of all patent documents, and count the number of appearance sentences in which words appear in the sentences of all patent documents.

서버(10)는 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수 대비 전체 특허문서의 전체 문장수를 제2 출현비율로 산출할 수 있다.The server 10 may calculate the total number of sentences of the entire patent document as the second appearance ratio relative to the number of sentences in which words appear in the sentences of the entire patent document.

이를 위해, 서버(10)는 전체 특허문서로부터 단어를 검색하고, 단어가 포함된 문장을 검색할 수 있다. 또한, 서버(10)는 전체 특허문서의 문장 성분을 분석하여 전체 단어수와 전체 문장수를 카운트할 수 있다.To this end, the server 10 may search for words from the entire patent document, and search for sentences containing the words. In addition, the server 10 may analyze the sentence components of the entire patent document to count the total number of words and the total number of sentences.

최종적으로, 서버(10)는 제1 출현비율과 제2 출현비율에 기초하여 제1 세부 중요도를 산출할 수 있다.Finally, the server 10 may calculate the first detail importance based on the first appearance rate and the second appearance rate.

이때, 서버(10)는 하기의 수학식 1을 이용하여 제1 세부 중요도를 산출할 수 있다.In this case, the server 10 may calculate the first detailed importance level using Equation 1 below.

<수학식 1> <Equation 1>

여기서, W1은 제1 세부 중요도이고, wpw은 전체 특허문서에서의 단어의 출현횟수이고, WPW은 전체 특허문서의 전체 단어수이고, wps은 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수이고, WPS은 전체 특허문서의 전체 문장수이고, a1은 제2 출현비율의 조절 상수이다.Here, W1 is the first detail importance, wpw is the number of occurrences of words in all patent documents, WPW is the total number of words in all patent documents, wps is the number of appearance sentences in which words appear in the sentences of all patent documents. , WPS is the total number of sentences in the entire patent document, and a1 is the control constant of the second appearance ratio.

수학식 1을 살펴보면, 서버(10)는 전체 특허문서에서의 단어의 출현횟수가 많고 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 적을수록 제1 세부 중요도를 크게 산출할 수 있다.Referring to Equation 1, the server 10 may calculate the first detail importance as the number of occurrences of words in the entire patent document and the number of appearance sentences in which the words appear in the sentences of the entire patent document are smaller.

즉, 서버(10)는 전체 특허문서에서 하나의 문장에 단어가 중복하여 사용될수록 제1 세부 중요도를 크게 산출할 수 있다.That is, the server 10 may calculate the first detailed importance as the word is duplicated in one sentence in the entire patent document.

한편, 서버(10)는 제2 출현비율의 조절 상수를 증가시켜 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 적더라도 제2 출현비율을 증가시킬 있고, 제2 출현비율의 조절 상수를 감소시켜 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 많더라도 제2 출현비율을 감소시킬 수 있다.On the other hand, the server 10 may increase the adjustment rate of the second appearance rate to increase the second appearance rate even if the number of appearance sentences with words appear in the sentences of the entire patent document, and decrease the adjustment constant of the second appearance rate As a result, even if the number of sentences in which words appear in the sentences of the entire patent document is high, the second appearance ratio can be reduced.

일 실시 예에서, 서버(10)는 대상특허문서에서의 단어의 제1 세부 중요도를 산출하는한 세부 중요도 산출 방법의 종류는 제한되지 않음을 유의한다.In an embodiment, it is noted that the type of the method of calculating the importance of detail is not limited as long as the server 10 calculates the first detail importance of the word in the target patent document.

예를 들어, 서버(10)는 텍스트 분석법 중 하나로 출연 빈도에 기초하여 중요도를 산출하는 TF-IDF(Term Frequency-Inverse Document Frequency) 분석법을 이용하여 제1 세부 중요도를 산출할 수 있다.For example, the server 10 may calculate the first detail importance using a TF-IDF (Term Frequency-Inverse Document Frequency) analysis method that calculates importance based on the appearance frequency as one of the text analysis methods.

단계 S620에서, 서버(10)는, 특허분류정보의 전체 단어수 대비 특허분류정보에서의 단어의 출현횟수의 제3 출현비율 및 전체 특허문서의 전체 문장수 대비 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수의 제4 출현비율에 기초하여 제2 세부 중요도를 산출할 수 있다.In step S620, the server 10, the word appears in the sentence of the entire patent document relative to the total number of words of the patent classification information, the third appearance ratio of the number of appearance of the word in the patent classification information and the total number of sentences of the patent document The second detailed importance may be calculated based on the fourth appearance ratio of the number of appearance sentences.

구체적으로, 서버(10)는 특허분류정보의 전체 단어수를 카운트하고, 특허분류정보에서의 단어의 출현횟수를 카운트할 수 있다.In detail, the server 10 may count the total number of words of the patent classification information and count the number of occurrences of words in the patent classification information.

여기서, 특허분류정보는 기술분야에 따라 특허를 분류할 수 있는 코드로써, IPC(International Patent Classfication), CPC(Cooperative Patent Classification) 및 F-Term 중 어느 하나일 수 있다.Here, the patent classification information is a code that can classify patents according to the technical field, and may be any one of International Patent Classfication (IPC), Cooperative Patent Classification (CPC), and F-Term.

이후, 서버(10)는 특허분류정보의 전체 단어수 대비 특허분류정보에서의 단어의 출연횟수를 제3 출현비율로 산출할 수 있다.Thereafter, the server 10 may calculate the number of appearances of the words in the patent classification information as the third appearance ratio to the total number of words in the patent classification information.

이어서, 서버(10)는 특허분류정보의 전체 문장수를 카운트하고, 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수를 카운트할 수 있다.Subsequently, the server 10 may count the total number of sentences of the patent classification information, and count the number of appearance sentences in which words appear in the sentences of all patent documents.

서버(10)는 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수 대비 전체 특허문서의 전체 문장수를 제4 출현비율로 산출할 수 있다.The server 10 may calculate the total number of sentences of the entire patent document as the fourth appearance ratio from the number of appearance sentences in which words appear in the sentences of the entire patent document.

이를 위해, 서버(10)는 특허분류정보로부터 단어를 검색하고, 단어가 포함된 문장을 검색할 수 있다. 또한, 서버(10)는 특허분류정보의 문장 성분을 분석하여 전체 단어수와 전체 문장수를 카운트할 수 있다.To this end, the server 10 may search for words from the patent classification information and search for sentences containing the words. In addition, the server 10 may analyze the sentence components of the patent classification information to count the total number of words and the total number of sentences.

최종적으로, 서버(10)는 제3 출현비율과 제4 출현비율에 기초하여 제2 세부 중요도를 산출할 수 있다.Finally, the server 10 may calculate the second detailed importance based on the third appearance rate and the fourth appearance rate.

이때, 서버(10)는 하기의 수학식 2를 이용하여 제2 세부 중요도를 산출할 수 있다.At this time, the server 10 may calculate the second detailed importance level using Equation 2 below.

<수학식2> <Equation 2>

여기서, W₂은 제2 세부 중요도이고, ipcw은 특허분류정보에서의 단어의 출현횟수이고, IPCW은 특허분류정보의 전체 단어수이고, ipcs은 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수이고, IPCS은 전체 특허문서의 전체 문장수이고, a2은 제4 출현비율의 조절 상수이다.Where W ₂ is the second level of importance, ipcw is the number of occurrences of words in the patent classification information, IPCW is the total number of words in the patent classification information, and ipcs is the number of appearance sentences in which words appear in the sentences of the entire patent document. Where IPCS is the total number of sentences in all patent documents, and a2 is the adjustment constant of the fourth occurrence rate.

수학식 2를 살펴보면, 서버(10)는 특허분류정보에서의 단어의 출현횟수가 많고 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 적을수록 제2 세부 중요도를 크게 산출할 수 있다.Referring to Equation 2, as the number of occurrences of words in the patent classification information and the number of appearance sentences in which words appear in the sentences of the entire patent document are smaller, the second detailed importance may be calculated.

즉, 서버(10)는 특허분류정보에서 하나의 문장에 단어가 중복하여 사용될수록 제2 세부 중요도를 크게 산출할 수 있다.That is, the server 10 may calculate the second detailed importance as the word is duplicated in one sentence in the patent classification information.

한편, 서버(10)는 제4 출현비율의 조절 상수를 증가시켜 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 적더라도 제4 출현비율을 증가시킬 있고, 제4 출현비율의 조절 상수를 감소시켜 전체 특허문서의 문장 중에서 단어가 출현된 출현 문장수가 많더라도 제4 출현비율을 감소시킬 수 있다.On the other hand, the server 10 may increase the fourth appearance rate by increasing the adjustment constant of the fourth appearance rate, even if the number of sentences in which the word appears in the sentence of the entire patent document, and decrease the adjustment constant of the fourth appearance rate In this way, even if the number of appearance sentences in which words appear in the sentences of the entire patent document can reduce the fourth appearance ratio.

일 실시 예에서, 서버(10)는 특허문서에서의 단어의 제2 세부 중요도를 산출하는 한 세부 중요도 산출 방법의 종류는 제한되지 않음을 유의한다.In an embodiment, as long as the server 10 calculates the second detailed importance of the word in the patent document, it is noted that the kind of the detailed importance calculation method is not limited.

예를 들어, 서버(10)는 텍스트 분석법 중 하나로 출연 빈도에 기초하여 중요도를 산출하는 TF-IDF(Term Frequency-Inverse Document Frequency) 분석법을 이용하여 제2 세부 중요도를 산출할 수 있다.For example, the server 10 may calculate the second detailed importance using a term frequency-inverse document frequency (TF-IDF) analysis that calculates importance based on the appearance frequency as one of the text analysis methods.

단계 S630에서, 서버(10)는, 검색특허문서 각각의 참조 정보에 기초하여 검색특허문서 각각의 영향력 값을 산출하고, 영향력 값을 이용하여 검색특허문서의 제3 세부 중요도를 산출할 수 있다.In operation S630, the server 10 may calculate an influence value of each of the search patent documents based on reference information of each of the search patent documents, and calculate a third detailed importance of the search patent document using the influence value.

구체적으로, 서버(10)는 전체 특허문서 중에서 중요도 스코어의 산출 대상이 되는 단어를 포함하는 검색특허문서를 검색할 수 있다.In detail, the server 10 may search for a search patent document including a word that is an object of calculation of an importance score among all patent documents.

서버(10)는 검색특허문서 각각의 참조 정보에 기초하여 검색특허문서 각각의 영향력 값을 산출할 수 있다.The server 10 may calculate an influence value of each search patent document based on reference information of each search patent document.

구체적으로, 서버(10)는 검색특허문서의 참조 정보인 출원인, 발명자, 권리자 중 하나 이상이 다른 특허문서과 동일한 항목의 개수, 참조 정보인 인용 횟수 및 피인용 횟수에 기초하여 영향력 값을 산출할 수 있다.Specifically, the server 10 may calculate the influence value based on the number of items of the applicant, inventor, and right holder, which are the reference information of the search patent document, as the same as the other patent document, the number of citations and the number of citations which are reference information. have.

즉, 서버(10)는 검색특허문서가 다른 특허문서와 관련된 정도를 영향력 값으로 산출할 수 있다. 예를 들어, 서버(10)는 검색특허문서는 여러 특허문서로부터 인용될 때, 해당 특허문서에 검색특허문서가 영향력을 끼친 것으로 판단하여 검색특허문서의 영향력 값으로 산출할 수 있다. That is, the server 10 may calculate the degree to which the search patent document is related to another patent document as an influence value. For example, when the search patent document is cited from several patent documents, the server 10 may determine that the search patent document has an influence on the patent document and calculate the search patent document as an influence value of the search patent document.

서버(10)는 검색특허문서 각각에 대해 산출된 영향력 값을 이용하여 검색특허문서의 제3 세부 중요도를 산출할 수 있다.The server 10 may calculate the third detailed importance of the search patent document by using the influence value calculated for each search patent document.

구체적으로, 서버(10)는 검색특허문서 각각에 대해 산출된 영향력 값의 평균을 산출하고, 산출된 평균을 제3 세부 중요도로 산출할 수 있다.In detail, the server 10 may calculate an average of the influence values calculated for each of the search patent documents, and calculate the calculated average as the third detailed importance.

이때, 서버(10)는 제1 세부 중요도, 제2 세부 중요도 및 제3 세부 중요도 각각에 대응하여 설정된 최소 세부 중요도 값과 제1 세부 중요도, 제2 세부 중요도 및 제3 세부 중요도 각각을 대소 비교하고, 최소 세부 중요도 값 미만이 세부 중요도에 대해 재산출 과정을 수행할 수 있다.At this time, the server 10 compares each of the minimum detail importance value and the first detail importance, the second detail importance, and the third detail importance with respect to each of the first detail importance, the second detail importance, and the third detail importance. In this case, the recalculation process may be performed for the subtle importance level below the minimum detail importance value.

이후, 서버(10)는 제1 세부 중요도, 제2 세부 중요도 및 제3 세부 중요도가 최소 세부 중요도 값 이상이면 제1 세부 중요도, 제2 세부 중요도 및 제3 세부 중요도 중 복수를 합산하여 중요도 스코어로 산출할 수 있다.Thereafter, when the first detail importance level, the second detail importance level, and the third detail importance level are greater than or equal to the minimum detail importance value, the server 10 adds a plurality of the first detail importance level, the second detail importance level, and the third detail importance level to an importance score. Can be calculated.

서버(10)는 산출된 중요도 스코어를 내부의 메모리 또는 프로세서로 출력하거나, 외부 서버로 송신할 수 있다.The server 10 may output the calculated importance score to an internal memory or a processor or transmit the calculated importance score to an external server.

도 8a 및 도 8b는 본 발명의 일 실시예에 따른 유사특허문서 획득 방법 및 유사 문장 판단 방법을 설명하기 위한 흐름도이다.8A and 8B are flowcharts illustrating a method for obtaining a similar patent document and a method for determining a similar sentence according to an embodiment of the present invention.

구체적으로, 도 8a에 도시된 바와 같이, 단계 S705에서, 서버(10)는, 대상특허문서에 포함된 복수개의 단어를 획득할 수 있다.In detail, as illustrated in FIG. 8A, in operation S705, the server 10 may acquire a plurality of words included in the target patent document.

단계 S710에서, 서버(10)는, 획득된 복수개의 단어를 클러스터링하여 복수개의 대상특허 클러스터를 획득할 수 있다.In operation S710, the server 10 may obtain a plurality of target patent clusters by clustering the obtained plurality of words.

이때, 본 발명에 따른 클러스터링은 다양한 클러스터링 기법이 적용될 수 있다. 일 실시예로, 서버(10)는 엘라스틱 서치(Elasticsearch clustering) 클러스터링 기법을 이용하여 복수개의 대상특허 클러스터를 획득할 수 있다. 또 다른 실시예로, 서버(10)는 K-means 클러스터링, DBSCAN((Density-based spatial clustering of applications with noise) 클러스터링, Hierarchical 클러스터링, 혼합 가우시안 클러스터링 기법 중 적어도 하나의 클러스터링 기법을 이용하여 복수개의 대상특허 클러스터를 획득할 수 있다.In this case, various clustering techniques may be applied to clustering according to the present invention. In an embodiment, the server 10 may acquire a plurality of target patent clusters by using an Elasticsearch clustering clustering technique. In another embodiment, the server 10 may include a plurality of targets using at least one of K-means clustering, Density-based spatial clustering of applications with noise (DBSCAN) clustering, Hierarchical clustering, and mixed Gaussian clustering. Patent clusters can be obtained.

단계 S715에서, 서버(10)는, 복수개의 대상특허 클러스터 각각에 대한 복수개의 중점을 획득하고, 복수개의 대상특허 클러스터 각각에 대한 복수개의 중점 및 대상특허 클러스터에 포함된 단어의 수를 바탕으로 대상특허문서의 위치를 판단할 수 있다.In operation S715, the server 10 acquires a plurality of midpoints for each of the plurality of target patent clusters, and based on the number of words included in the plurality of midpoints and the target patent cluster for each of the plurality of target patent clusters. The position of the patent document can be determined.

일 실시예로, 대상특허 클러스터의 중점은, 해당 클러스터에 포함된 복수의 단어의 위치를 바탕으로 획득될 수 있으며, 대상특허문서의 위치는 복수의 대상특허 클러스터의 중점들에 대한 무게중심일 수 있다.In one embodiment, the center point of the target patent cluster may be obtained based on the position of a plurality of words included in the cluster, and the position of the target patent document may be the center of gravity of the center points of the plurality of target patent clusters. have.

단계 S720에서, 서버(10)는, 특허문서에 포함된 복수개의 단어를 획득하고, 획득된 복수개의 단어를 클러스터링하여 복수의 특허 클러스터를 획득할 수 있다.In operation S720, the server 10 may obtain a plurality of words included in a patent document and cluster the obtained plurality of words to obtain a plurality of patent clusters.

단계 S725에서, 서버(10)는, 복수개의 특허 클러스터 각각에 대한 복수개의 중점을 획득하고, 복수개의 특허 클러스터 각각에 대한 복수개의 중점 및 특허 클러스터에 포함된 단어의 수를 바탕으로 특허문서의 위치를 판단할 수 있다.In step S725, the server 10 acquires a plurality of midpoints for each of the plurality of patent clusters, and positions the patent document based on the number of words included in the plurality of midpoints and the patent clusters for each of the plurality of patent clusters. Can be determined.

일 실시예로, 특허 클러스터의 중점은, 해당 클러스터에 포함된 복수의 단어의 위치를 바탕으로 획득될 수 있으며, 특허문서의 위치는 복수의 특허 클러스터의 중점들에 대한 무게중심일 수 있다.In one embodiment, the center point of the patent cluster may be obtained based on the position of a plurality of words included in the cluster, and the position of the patent document may be the center of gravity of the center points of the plurality of patent clusters.

단계 S730에서, 서버(10)는, 대상특허문서의 위치 및 특허문서의 위치가 기 설정된 거리 이내인 경우, 특허문서를 유사특허문서로 결정할 수 있다.In operation S730, when the location of the target patent document and the location of the patent document are within a predetermined distance, the server 10 may determine the patent document as a similar patent document.

일 실시예로, 서버(10)는 대상특허문서의 위치로부터 기 설정된 거리 이내에 존재하는 복수개의 특허문서를 유사특허문서로 결정할 수 있으며, 대상특허문서와의 거리가 가까운 유사특허문서 순으로 결정된 복수의 유사특허문서를 정렬할 수 있다.In one embodiment, the server 10 may determine a plurality of patent documents existing within a predetermined distance from the position of the target patent document as the similar patent document, the plurality of determined in the order of similar patent documents close to the target patent document You can sort similar patent documents.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 대상특허문서 및 유사특허문서로부터 복수의 문장을 획득하여 중요도 순으로 정렬할 수 있다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may obtain a plurality of sentences from the target patent document and the similar patent document, and sort them in order of importance.

구체적으로, 서버(10)는, 상기 대상특허문서 및 상기 유사특허문서에 포함된 복수의 문장 각각에 대한 복수개의 단어를 획득할 수 있다.Specifically, the server 10 may obtain a plurality of words for each of a plurality of sentences included in the target patent document and the similar patent document.

이후, 서버(10)는, 상기 획득된 복수개의 단어 각각에 대한 복수개의 중요도 스코어를 획득할 수 있다. 중요도 스코어를 획득하는 방법은 후술한다.Thereafter, the server 10 may obtain a plurality of importance scores for each of the obtained plurality of words. The method of obtaining the importance score will be described later.

이후, 서버(10)는, 상기 복수개의 중요도 스코어를 바탕으로 상기 대상특허문서에 포함된 문장의 중요도 스코어를 결정할 수 있다.Thereafter, the server 10 may determine the importance score of the sentence included in the target patent document based on the plurality of importance scores.

일 실시예로, 서버(10)는 복수개의 단어와 대상특허문서와의 연관도를 획득하고, 획득된 연관도가 기 설정된 값 이상인 단어의 중요도 스코어의 총 합을 문장의 중요도 스코어로 결정할 수 있다. 이때, 연관도는 다양한 방법을 통해 획득될 수 있다. 예를 들어 단어와 대상특허문서간의 연관도는 대상특허문서에 대응되는 IPC 분야에 대응되는 복수의 특허문서를 바탕으로 획득될 수 있다. 구체적으로, 단어와 대상특허문서간의 연관도는, 해당 단어가 IPC 분야에 대응되는 복수의 특허문서에 출현한 횟수를 바탕으로 획득될 수 있다. 일 실시예로, 해당 단어가 IPC 분야에 대응되는 복수의 특허문서에 출현한 횟수가 높을수록, 단어와 대상특허문서간의 연관도가 높아질 수 있다.In an embodiment, the server 10 may obtain a degree of association between the plurality of words and the target patent document, and determine the sum of the importance scores of the words having the obtained degree of association greater than or equal to a preset value as the importance score of the sentence. . In this case, the degree of association may be obtained through various methods. For example, the degree of association between the word and the target patent document may be obtained based on a plurality of patent documents corresponding to the IPC field corresponding to the target patent document. Specifically, the degree of association between a word and a target patent document may be obtained based on the number of times the word appears in a plurality of patent documents corresponding to the IPC field. In one embodiment, the higher the number of times a word appears in a plurality of patent documents corresponding to the IPC field, the higher the degree of association between the word and the target patent document.

이후, 서버(10)는, 상기 복수의 문장 각각의 중요도 스코어를 바탕으로 상기 복수의 문장을 중요도 순으로 정렬할 수 있다. Thereafter, the server 10 may sort the plurality of sentences in order of importance based on the importance score of each of the plurality of sentences.

나아가, 서버(10)는 복수의 문장을 n개의 그룹으로 그루핑할 수 있다. 이때, 복수의 문장을 그루핑하는 방법은 다양할 수 있다. 일 실시예로, n개의 그룹은 동일한 개수의 문장을 포함하도록 그루핑될 수 있다. 또 다른 실시예로, n개의 그룹은, 기 설정된 중요도 스코어에 따라 그루핑 될 수 있다. 설명의 편의를 위해, 이하에서는 제k 그룹에 포함된 문장의 중요도 스코어가 제k+1그룹에 포함된 문장의 중요도 스코어보다 큰 경우를 가정한다.Furthermore, the server 10 may group a plurality of sentences into n groups. In this case, a method of grouping a plurality of sentences may vary. In one embodiment, n groups may be grouped to include the same number of sentences. In another embodiment, n groups may be grouped according to a predetermined importance score. For convenience of explanation, hereinafter, it is assumed that the importance score of the sentences included in the k-th group is greater than the importance score of the sentences included in the k-th group.

서버(10)는 중요도 스코어를 바탕으로 대상특허문서의 제1 문장 및 유사특허문서의 제2 문장을 획득할 수 있다. 구체적으로, 서버(10)는, 상기 대상특허문서에 포함된 복수의 문장을 중요도 스코어를 바탕으로 기 설정된 n개의 그룹으로 분류할 수 있다.The server 10 may obtain the first sentence of the target patent document and the second sentence of the similar patent document based on the importance score. In detail, the server 10 may classify the plurality of sentences included in the target patent document into n groups based on importance scores.

이후, 서버(10)는, 상기 유사특허문서에 포함된 복수의 문장을 중요도 스코어를 바탕으로 기 설정된 n개의 그룹으로 분류할 수 있다.Thereafter, the server 10 may classify the plurality of sentences included in the similar patent document into n groups based on the importance score.

이후, 서버(10)는, 상기 제1 문장이 상기 대상특허문서의 n개의 그룹 중 제1 그룹에 포함되는 경우, 상기 제2 문장을 상기 유사특허문서에 포함된 전체 문장으로 획득할 수 있다.Thereafter, when the first sentence is included in the first group of the n groups of the target patent document, the server 10 may obtain the second sentence as the entire sentence included in the similar patent document.

이후, 서버(10)는, 상기 제1 문장이 상기 대상특허문서의 n개의 그룹 중 제k 그룹에 포함되는 경우, 상기 제2 문장을 상기 유사특허문서의 n개의 그룹 중, 제1 그룹 내지 제 (n-k+1)그룹에 포함된 문장으로 획득할 수 있다. 이때, n은 1 이상의 자연수, k는 n이하의 자연수일 수 있다. Subsequently, when the first sentence is included in the kth group among the n groups of the target patent document, the server 10 may include the second sentence from the first group to the first group among the n groups of the similar patent document. Can be obtained from a sentence included in the (n-k + 1) group. In this case, n may be a natural number of 1 or more, and k may be a natural number of n or less.

예를 들어, 대상특허문서 및 유사특허문서가 3개의 그룹으로 분류된 경우를 가정할 수 있다. 이때, 제1 문장이 제1 그룹에 포함되는 경우, 서버(10)는 제1 문장과 유사특허문헌의 제1그룹 내지 제3그룹(즉, 유사특허문서 전체)에 포함된 모든 문장을 비교하여 평가 결과를 획득할 수 있다. 또한, 제1 문장이 제2 그룹에 포함되는 경우, 서버(10)는 제1 문장과 유사특허문서의 제1그룹 및 제2그룹에 포함된 모든 문장을 비교하여 평가 결과를 획득할 수 있다. 또한, 제1 문장이 제3 그룹에 포함되는 경우, 서버(10)는 제1 문장과 유사특허문서의 제1그룹에 포함된 모든 문장을 비교하여 평가 결과를 획득할 수 있다.For example, it may be assumed that the target patent document and the similar patent document are classified into three groups. In this case, when the first sentence is included in the first group, the server 10 compares the first sentence and all sentences included in the first group to the third group of the similar patent document (that is, the entire similar patent document). Evaluation results can be obtained. In addition, when the first sentence is included in the second group, the server 10 may compare the first sentence with all sentences included in the first group and the second group of the similar patent document to obtain an evaluation result. In addition, when the first sentence is included in the third group, the server 10 may compare the first sentence with all sentences included in the first group of the similar patent document to obtain an evaluation result.

즉, 서버(10)는 문장의 중요도에 따라 중요한 문장이라고 판단되는 문장들에 대하여만 평가 결과를 획득함으로써, 불필요하거나 덜 중요한 문장에 대한 평가 결과 획득 과정을 방지하여 계산량을 줄일 수 있다. 구체적으로, 대상특허문서의 제1 그룹은 중요도가 높은 문장이므로 유사특허문서 전체와 비교하고, 대상특허문서의 제3 그룹은 중요도가 낮은 문장이므로 유사특허문서 중 중요도가 높은 제1 그룹의 문장과 비교함으로써, 서버(10)는 (덜 중요한) 대상특허문서의 제3 그룹과 (덜 중요한) 유사특허문서의 제3 그룹간의 비교는 생략할 수 있다.That is, the server 10 obtains an evaluation result only for sentences that are considered to be important sentences according to the importance of the sentence, thereby reducing the calculation amount by preventing the evaluation result obtaining process for unnecessary or less important sentences. Specifically, since the first group of the target patent document is a high priority sentence, it is compared with the whole similar patent document, and since the third group of the target patent document is a low importance sentence, the sentence of the first group of high similarity patent documents By comparing, the server 10 can omit the comparison between the third group of (less important) target patent documents and the third group of (less important) similar patent documents.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 대상특허문서 및 유사특허문서를 몇 개의 그룹으로 그루핑할 것인지 여부를 결정할 수 있다. 구체적으로, 서버(10)는 현재 서버(10)의 GPU 점유율 및 요구되는 정확도를 바탕으로 n의 값을 결정할 수 있다. GPU 점유율이 높거나 요구되는 정확도가 낮은 경우, 서버(10)는 n의 값을 증가시키고, GPU 점유율이 낮거나 요구되는 정확도가 높은 경우, 서버(10)는 n의 값을 감소시킬 수 있다. 예컨대 대상특허문서 및 유사특허문서가 6개의 문장으로 구성되고, n이 3인 경우 및 n이 6인 경우를 가정할 수 있다. n이 6인 경우, 서버(10)는 6+5+4+3+2+1=21개의 문장을 비교하고, n이 3인 경우 서버(10)는 6+6+4+4+2+2=24개의 문장을 비교하게 된다. 즉, n의 값이 클수록 대상특허문서와 유사특허문서에서 비교해야 되는 문장의 수가 적어지고, n의 값이 작을수록 대상특허문서와 유사특허문서에서 비교해야 되는 문장의 수가 많아지므로, 서버(10) 현재 자신이 처리할 수 있는 계산량(GPU 점유율)이 적거나 요구되는 정확도가 낮은 경우, 서버(10)는 n의 값을 증가시킬 수 있다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may determine whether to group the target patent document and the similar patent document into several groups. Specifically, the server 10 may determine the value of n based on the GPU occupancy and required accuracy of the current server 10. If the GPU occupancy is high or the required accuracy is low, the server 10 may increase the value of n. If the GPU occupancy is low or the required accuracy is high, the server 10 may decrease the value of n. For example, it may be assumed that the target patent document and the similar patent document are composed of six sentences, where n is 3 and n is 6. If n is 6, server 10 compares 6 + 5 + 4 + 3 + 2 + 1 = 21 sentences; if n is 3, server 10 is 6 + 6 + 4 + 4 + 2 + 2 = 24 sentences will be compared. That is, as the value of n increases, the number of sentences to be compared in the target patent document and the similar patent document decreases. As the value of n decreases, the number of sentences to be compared in the target patent document and the similar patent document increases, so that the server 10 If the amount of computation (GPU occupancy) that is currently handled by him or the required accuracy is low, the server 10 may increase the value of n.

한편, 상술한 실시예에서는 대상특허문서 및 유사특허문서가 동일한 n개의 개수로 그루핑 되는 경우에 대해 설명하였으나, 이에 한정되는 것은 아니다. 예컨대, 대상특허문서와 유사특허문서를 서로 다른 개수로 그루핑될 수 있음은 물론이다. Meanwhile, in the above-described embodiment, the case where the target patent document and the similar patent document are grouped by the same n number has been described, but is not limited thereto. For example, the target patent document and the similar patent document can be grouped in different numbers.

한편, 도 8b에 도시된 바와 같이, 단계 S735에서, 서버(10)는, 제1 문장 및 제2 문장의 유사도 점수 및 비유사도 점수를 각각 획득할 수 있다.Meanwhile, as shown in FIG. 8B, in step S735, the server 10 may obtain similarity scores and dissimilarity scores of the first sentence and the second sentence, respectively.

구체적으로, 유사도 점수는 제1 문장과 제2 문장이 관련된 문장인지 여부를 판단하기 위한 지표이며, 비유사도 점수는 제1 문장과 제2 문장이 관련성이 없는 문장인지 여부를 판단하기 위한 지표이다.In detail, the similarity score is an index for determining whether the first sentence and the second sentence are related sentences, and the dissimilarity score is an index for determining whether the first sentence and the second sentence are irrelevant sentences.

일 실시예로, 유사도 점수 및 비유사도 점수는, 제1 문장 및 제2 문장에 포함된 단어의 의미 정보를 획득하고, 획득된 의미 정보가 동일 주제에 대한 의미 정보인지 여부를 판단하여 획득될 수 있다. In one embodiment, the similarity score and the dissimilarity score may be obtained by obtaining semantic information of a word included in the first sentence and the second sentence, and determining whether the obtained semantic information is semantic information on the same subject. have.

또 다른 실시예로, 서버(10)는 제1 문장 및 제2 문장의 문장 구조를 판단하여 유사도 점수 및 비유사도 점수 계산에 활용할 수 있다. 예를 들어, 서버(10)는 제1 문장 및 제2 문장의 문장 구조가 유사한 경우, 유사도 점수를 높이고, 제1 문장 및 제2 문장의 문장 구조가 비유사한 경우, 비유사도 점수를 높일 수 있다.In another embodiment, the server 10 may determine the sentence structure of the first sentence and the second sentence and use it to calculate the similarity score and dissimilarity score. For example, the server 10 may increase the similarity score when the sentence structures of the first sentence and the second sentence are similar, and increase the dissimilarity score when the sentence structures of the first sentence and the second sentence are dissimilar. .

단계 S740에서, 서버(10)는, 비유사도 점수가 기 설정된 점수 이상인 경우, 제1 문장과 제2 문장은 관계없는 문장으로 판단할 수 있다.In operation S740, when the dissimilarity score is greater than or equal to a predetermined score, the server 10 may determine that the first sentence and the second sentence are irrelevant sentences.

단계 S745에서, 서버(10)는, 유사도 점수가 기 설정된 점수 이상인 경우, 제1 문장에 포함된 단어 중 제2 문장에 포함되지 않은 적어도 하나의 단어를 획득할 수 있다.In operation S745, when the similarity score is equal to or greater than a predetermined score, the server 10 may obtain at least one word not included in the second sentence among words included in the first sentence.

즉, 제1 문장과 제2 문장이 관련성이 있는 문장이라고 하더라도, 제1 문장과 제2 문장이 동일한 의미를 가지는 문장인 것은 아니다. 따라서, 서버(10)는 제1 문장과 제2 문장이 동일한 문장인지 또는 동일한 주제에 대한 문장이나 차이점이 존재하는 문장인지 여부를 판단할 수 있다.That is, even if the first sentence and the second sentence are related sentences, the first sentence and the second sentence are not sentences having the same meaning. Accordingly, the server 10 may determine whether the first sentence and the second sentence are the same sentence or a sentence about the same subject or a difference exists.

일 실시예로, 서버(10)는 제1 문장과 제2 문장의 차이점을 판단하기 위하여 제1 문장에 포함된 단어 중 제2 문장에 포함되지 않은 적어도 하나의 단어를 판단할 수 있다.In an embodiment, the server 10 may determine at least one word not included in the second sentence among words included in the first sentence in order to determine a difference between the first sentence and the second sentence.

또 다른 실시예로, 서버(10)는 제1 문장 및 제2 문장의 문장 구조를 판단할 수 있다.In another embodiment, the server 10 may determine the sentence structure of the first sentence and the second sentence.

단계 S750에서, 서버(10)는, 제2 문장에 포함되지 않은 적어도 하나의 단어 각각에 대한 적어도 중요도 점수를 획득하고, 획득된 적어도 하나의 중요도 점수 중 기 설정된 중요도 점수 이상인 단어가 존재하는지 여부를 판단할 수 있다.In operation S750, the server 10 obtains at least importance scores for each of the at least one word not included in the second sentence, and determines whether there is a word equal to or greater than a predetermined importance score among the obtained at least one importance scores. You can judge.

단계 S755에서, 서버(10)는, 기 설정된 중요도 점수 이상인 단어가 존재하지 않는 경우, 제1 문장 및 제2 문장을 일치 문장으로 판단할 수 있다.In operation S755, when there is no word having a predetermined importance score or more, the server 10 may determine the first sentence and the second sentence as the matching sentence.

단계 S760에서, 서버(10)는, 기 설정된 중요도 점수 이상인 단어가 존재하면, 제1 문장 및 제2 문장을 불일치 문장으로 판단할 수 있다.In operation S760, when there is a word equal to or greater than a predetermined importance score, the server 10 may determine the first sentence and the second sentence as inconsistent sentences.

즉, 제1 문장의 적어도 하나의 단어가 제2 문장에 포함되어 있지 않은 경우라고 하더라도, 중요도 점수가 낮은 단어는 제1 문장과 제2 문장의 유사도와는 연관이 없는 경우가 존재할 수 있다. 따라서 서버(10)는, 제2 문장에 포함되지 않은 적어도 하나의 단어의 중요도 점수가 기 설정된 중요도 점수 이상인 경우의 단어를 획득하고, 획득된 단어를 바탕으로 제1 문장 및 제2 문장이 일치 문장인지 불일치 문장인지 여부를 판단할 수 있다.That is, even when at least one word of the first sentence is not included in the second sentence, a word having a low importance score may not be associated with the similarity between the first sentence and the second sentence. Therefore, the server 10 obtains a word when the importance score of at least one word not included in the second sentence is equal to or greater than a preset importance score, and the first sentence and the second sentence are matched sentences based on the acquired word. It can be determined whether or not the sentence is a mismatch.

이때, 서버(10)는 상술한 단계 S180에서, 제1 문장과 제2 문장이 관련성이 없는 문장인 경우에는, 제2 문장을 생성되는 선행기술조사보고서에 포함시키지 않을 수 있다. 서버(10)는 제1 문장과 동일한 제2 문장 또는 제1 문장과 유사한 제2 문장을 바탕으로 선행기술조사보고서를 생성할 수 있다.In this case, in step S180, the server 10 may not include the second sentence in the generated prior art research report when the first sentence and the second sentence are not related. The server 10 may generate a prior art research report based on a second sentence identical to the first sentence or a second sentence similar to the first sentence.

일 실시예에 따라, 서버(10)는 대상특허문서의 복수의 문장을 획득하여 중요도 순으로 정렬하고, 중요도가 높은 문장부터 유사특허문서에 유사한 문장이 있는지 여부를 판단할 수 있다. 예를 들어, 서버(10)는 대상특허문서를 중요도 순으로 정렬하여, 제1-1문장 내지 제1-n문장을 획득할 수 있다. According to an embodiment of the present disclosure, the server 10 may obtain a plurality of sentences of the target patent document and sort them in order of importance, and determine whether there are similar sentences in the similar patent document from the sentences of high importance. For example, the server 10 may sort the target patent document in order of importance, and obtain the first-first sentence to the first-n sentence.

서버(10)는 중요도가 가장 높은 제1-1 문장과 유사특허문서를 비교하여 제1-1 문장에 대한 유사특허문서의 제2 문장을 획득할 수 있다. 같은 방법으로, 서버(10)는 제1-2 문장과 유사특허문서를 비교하여 제1-2 문장에 대한 유사특허문서의 제2 문장을 획득하고, 제1-n 문장과 유사특허문서를 비교하여 제1-n 문장에 대한 유사특허문서의 제2 문장을 획득할 수 있다.The server 10 may obtain a second sentence of the similar patent document with respect to the first-first sentence by comparing the first-first sentence with the highest priority with the similar patent document. In the same manner, the server 10 obtains a second sentence of the similar patent document for the 1-2 sentence by comparing the 1-2 patent and the similar patent document, and compares the 1-n sentence and the similar patent document. The second sentence of the similar patent document with respect to the first-n sentence can be obtained.

서버(10)는, 획득된 복수의 제2 문장 중, 제1 문장과 관련있는 문장만을 선별할 수 있다. 다양한 실시예에 따라, 서버(10)는 대상특허문서의 모든 문장(즉, 제1-1 문장 내지 제1-n 문장)에 대한 제2 문장을 표시할 수 있으나, 이에 한정되는 것은 아니며, 대상특허문서의 모든 문장 중 기 설정된 상위 중요도 문장(예를 들어 중요도 순으로 정렬한 결과 상위 30%에 해당하는 문장)에 대한 제2 문장만을 표시할 수 있음은 물론이다.The server 10 may select only a sentence related to the first sentence among the obtained plurality of second sentences. According to various embodiments of the present disclosure, the server 10 may display second sentences for all sentences (ie, first-first sentence to first-n sentence) of the target patent document, but is not limited thereto. Of course, only the second sentence of a predetermined upper priority sentence (for example, a sentence corresponding to the top 30% as a result of sorting in the order of importance) of the patent document may be displayed.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 대상특허문서, 대상특허문서에 대한 적어도 하나의 유사특허문서의 청구범위, 대상특허문서 및 적어도 하나의 유사특허문서의 청구범위에 포함된 단어에 대한 대한 중요도 스코어를 바탕으로 대상특허문서의 특정문장이가 유사특허문서를 침해하는지 여부를 판단할 수 있다.On the other hand, according to various embodiments of the present invention, the server 10 is included in the claims of the target patent document, at least one similar patent document for the target patent document, the target patent document and at least one similar patent document claims Based on the importance score for the word, it is possible to determine whether a specific sentence of the target patent document violates the similar patent document.

도 9는 본 발명의 일 실시예에 따른 유의어 사전을 생성하는 구체적인 방법을 설명하기 위한 흐름도이다.9 is a flowchart illustrating a specific method of generating a thesaurus according to an embodiment of the present invention.

단계 S810에서, 서버(10)는, 특허문서를 기초하여 Word2Vec 학습을 통해 단어를 워드 벡터로 변환할 수 있다.In operation S810, the server 10 may convert a word into a word vector through Word2Vec learning based on the patent document.

단계 S820에서, 서버(10)는, 워드 벡터 중 어느 두 워드 벡터 간의 거리를 산출하고 산출된 거리를 유사도로 산출할 수 있다.In operation S820, the server 10 may calculate a distance between any two word vectors among the word vectors, and calculate the calculated distance as the similarity.

단계 S830에서, 서버(10)는, 워드 벡터 중 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만인지 여부를 확인하고, 워드 벡터 중 어느 두 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이면 해당 두 워드 벡터에 대응되는 두 단어를 유사어 그룹으로 그루핑할 수 있다.In operation S830, the server 10 determines whether the similarity between any two word vectors of the word vectors is less than the preset reference similarity, and if the similarity between any two word vectors of the word vectors is less than the preset reference similarity, the corresponding two words. Two words corresponding to the vector may be grouped into a group of similar words.

단계 S840에서, 서버(10)는, 획득된 전체 특허문서 중 어느 하나의 특허문서가 노이즈 문서 조건을 충족하는지 여부를 확인하고, 노이즈 문서 조건을 충족하는 특허문서를 제거할 수 있다. In operation S840, the server 10 may determine whether any one of the acquired patent documents satisfies the noise document condition, and may remove the patent document that satisfies the noise document condition.

한편, 본 발명의 다양한 실시예에 따라, 서버(10)는 워드 벡터들 중에서 기준 워드 벡터를 선택할 수 있다.Meanwhile, according to various embodiments of the present disclosure, the server 10 may select a reference word vector from among word vectors.

여기서, 기준 워드 벡터는 후술되는 유사어 그룹의 기준 단어에 대응되는 워드 벡터일 수 있다. 예를 들어, 기준 단어가 사과이면, 사과의 유사어에 포함되는 단어들이 유사어 그룹에 포함될 수 있다.Here, the reference word vector may be a word vector corresponding to the reference word of the similar word group described later. For example, if the reference word is an apple, words included in the similar word of the apple may be included in the similar word group.

이후, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도를 산출할 수 있다. 이때, 서버(10)는 상술한 유사도 산출 방법과 동일한 방법으로 기준 워드 벡터와 다른 워드 벡터 간의 유사도를 산출할 수 있다.Thereafter, the server 10 may calculate the similarity between the reference word vector and another word vector. In this case, the server 10 may calculate the similarity between the reference word vector and another word vector by the same method as the above-described similarity calculation method.

이어서, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도에 기초하여 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는지 여부를 확인할 수 있다.Subsequently, the server 10 may determine whether a word corresponding to the other word vector is included in the similarity group based on the reference word corresponding to the reference word vector, based on the similarity between the reference word vector and the other word vector.

구체적으로, 서버(10)는 기준 워드 벡터와 다른 워드 벡터 간의 유사도가 미리 설정된 기준 유사도 미만이면 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는 것으로 확인할 수 있다.Specifically, if the similarity between the reference word vector and the other word vector is less than the preset reference similarity, the server 10 includes the word corresponding to the other word vector in the similarity group based on the reference word corresponding to the reference word vector. You can check.

이후, 서버(10)는 다른 워드 벡터에 대응하는 단어가 기준 워드 벡터에 대응되는 기준 단어를 기준으로 하는 유사도 그룹에 포함되는 것으로 확인되면 다른 워드 벡터에 대응하는 단어를 기준 단어를 기준으로 하는 유사도 그룹에 포함시킬 수 있다.Subsequently, when the server 10 determines that the word corresponding to the other word vector is included in the similarity group based on the reference word corresponding to the reference word vector, the server 10 determines the similarity based on the reference word based on the word corresponding to the other word vector. Can be included in a group.

도 10은 본 발명의 일 실시예에 따른 제1 인공지능 모델 및 제2 인공지능 모델을 이용하여 평가 결과를 도출하는 과정을 설명하기 위한 예시도이다.10 is an exemplary diagram for describing a process of deriving an evaluation result using the first AI model and the second AI model according to an embodiment of the present invention.

도 10에 도시된 바와 같이, 제1 인공지능 모델은 대상 특허문서를 입력값으로 입력받아, 대상 특허문서에 대한 유사 특허문서를 획득할 수 있다. 이때, 대상특허문서에 대한 유사 특허 문서는 복수개일 수 있음은 물론이고, 유사특허문서가 복수개인 경우, 서버(10)는 복수개의 유사특허문서를 대상특허문서와의 연관성을 바탕으로 정렬할 수 있음은 물론이다.As illustrated in FIG. 10, the first artificial intelligence model may receive a target patent document as an input value and obtain a similar patent document for the target patent document. In this case, as well as the number of similar patent documents for the target patent document, of course, if there are a plurality of similar patent documents, the server 10 may arrange the plurality of similar patent documents based on the association with the target patent document. Of course.

제1 데이터베이스는 전체 특허문서를 저장하기 위한 구성이다. 구체적으로, 제1 데이터베이스는, 전체 특허문서 각각에 대한 형태소 분석을 통해 전체 특허문서 각각의 단어를 획득하고, 획득된 단어에 대한 유의어 사전을 획득하고, 획득된 단어 각각에 대한 중요도 스코어를 획득하여, 획득된 중요도 스코어를 단어에 매칭하여 저장할 수 있다.The first database is a configuration for storing the entire patent document. Specifically, the first database obtains words of each patent document through morphological analysis of each patent document, obtains a thesaurus of the acquired words, and obtains an importance score for each of the acquired words. The acquired importance score may be matched with a word and stored.

제1 인공지능 모델은 제1 데이터베이스에 저장된 전체 특허문서 중 대상특허문서에 대응되는 유사특허문서를 획득할 수 있다.The first artificial intelligence model may obtain a similar patent document corresponding to the target patent document among all patent documents stored in the first database.

한편, 서버(10)는 제1 인공지능 모델을 통해 대상특허문서 및 유사특허문서가 획득되면, 대상특허문서 중 제1 문장을 추출하고, 유사특허문서 중 제2 문장을 추출할 수 있다. 제1 문장 및 제2 문장은 상술한 다양한 방법을 통해 획득될 수 있다.Meanwhile, when the target patent document and the similar patent document are obtained through the first artificial intelligence model, the server 10 may extract the first sentence of the target patent document and extract the second sentence of the similar patent document. The first sentence and the second sentence may be obtained through the various methods described above.

나아가, 제2 인공지능 모델은제1 문장 및 제2 문장을 입력값으로 입력하여 제1 문장과 제2 문장에 대한 평가 결과를 획득할 수 있다. 이때, 평가 결과란, 제1 문장과 제2 문장이 일치 문장인지, 불일치 문장인지, 관계없는 문장인지에 대한 결과일 수 있으며, 상술한 다양한 실시예에 따라 획득될 수 있다.In addition, the second AI model may obtain the evaluation result of the first sentence and the second sentence by inputting the first sentence and the second sentence as input values. In this case, the evaluation result may be a result of whether the first sentence and the second sentence are matched sentences, mismatched sentences, or irrelevant sentences, and may be obtained according to the above-described various embodiments.

제2 데이터베이스는, 특허문서 및 특허문서에 매칭된 선행조사보고서가 저장될 수 있다. 이때, 선행조사보고서는 특허문헌과 유사하다고 판단된 유사특허문서 및 특허문서의 특정 문장에 대응되는 유사특허문서의 특정 문장에 대한 정보를 포함할 수 있다.The second database may store a patent document and a previous search report matching the patent document. In this case, the preceding research report may include information on a specific patent of a similar patent document corresponding to a specific sentence of a similar patent document and a patent document determined to be similar to the patent document.

도 11은 본 발명의 일 실시예에 따른 장치의 구성도이다.11 is a block diagram of an apparatus according to an embodiment of the present invention.

프로세서(102)는 하나 이상의 코어(core, 미도시) 및 그래픽 처리부(미도시) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다.The processor 102 may include a connection passage (eg, a bus, etc.) that transmits and receives signals with one or more cores (not shown) and graphics processing unit (not shown) and / or other components. .

일 실시예에 따른 프로세서(102)는 메모리(104)에 저장된 하나 이상의 인스트럭션을 실행함으로써, 도 1 내지 도 9와 관련하여 설명된 방법을 수행한다.The processor 102 according to one embodiment executes one or more instructions stored in the memory 104 to perform the method described in connection with FIGS. 1-9.

예를 들어, 프로세서(102)는 메모리에 저장된 하나 이상의 인스트럭션을 실행함으로써 대상특허문서를 바탕으로 적어도 하나의 단어를 획득하고, 획득된 적어도 하나의 단어의 중요도 스코어를 획득하고, 대상특허문서를 제1 인공지능 모델에 입력하여 유사특허문서를 획득하고, 대상특허문서에 포함된 복수의 문장을 획득하고, 유사특허문서에 포함된 복수의 문장을 획득하고, 대상특허문서에 포함된 복수의 문장 중 제1 문장을 획득하고, 유사특허문서에 포함된 복수의 문장 중 제2 문장을 획득하고, 제1 문장 및 제2 문장을 제2 인공지능 모델에 입력하여, 제1 문장과 제2 문장에 대한 평가 결과를 획득하고, 평가 결과를 바탕으로, 상기 대상특허문서에 대한 선행기술조사보고서를 생성할 수 있다.For example, the processor 102 obtains at least one word based on the target patent document by executing one or more instructions stored in memory, obtains an importance score of the obtained at least one word, and retrieves the target patent document. 1 inputting an artificial intelligence model to obtain a similar patent document, to obtain a plurality of sentences contained in the target patent document, to obtain a plurality of sentences included in the similar patent document, and among the plurality of sentences included in the target patent document Acquire a first sentence, obtain a second sentence among a plurality of sentences included in the similar patent document, input the first sentence and the second sentence into the second artificial intelligence model, and Obtaining the evaluation result, based on the evaluation result, it is possible to generate a prior art research report on the target patent document.

한편, 프로세서(102)는 프로세서(102) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(102)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다. Meanwhile, the processor 102 may include random access memory (RAM) and read-only memory (ROM) for temporarily and / or permanently storing a signal (or data) processed in the processor 102. , Not shown) may be further included. In addition, the processor 102 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processor, a RAM, and a ROM.

메모리(104)에는 프로세서(102)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(104)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다.The memory 104 may store programs (one or more instructions) for processing and controlling the processor 102. Programs stored in the memory 104 may be divided into a plurality of modules according to functions.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다. 본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, in a software module executed by hardware, or by a combination thereof. The software module may be a random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any form of computer readable recording medium well known in the art. The components of the present invention may be embodied as a program (or an application) and stored in a medium for execution in combination with a computer which is hardware. The components of the present invention may be implemented in software programming or software elements, and similarly, embodiments include C, C ++, including various algorithms implemented in combinations of data structures, processes, routines, or other programming constructs. It may be implemented in a programming or scripting language such as Java, assembler, or the like. Functional aspects may be implemented in algorithms running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. In the above, embodiments of the present invention have been described with reference to the accompanying drawings, but those skilled in the art to which the present invention pertains may realize the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive.

10 : 서버
20 : 사용자 단말10: server
20: user terminal

Claims

In the control method of the similarity determination system of a patent document,
Acquiring, by the server, a target patent document;
Acquiring, by the server, at least one word based on the target patent document;
Acquiring, by the server, an importance score of the obtained at least one word;
Obtaining, by the server, a similar patent document by inputting the target patent document into a first artificial intelligence model;
Acquiring, by the server, a plurality of sentences included in the target patent document, and obtaining a plurality of sentences included in the similar patent document;
Acquiring, by the server, a first sentence among a plurality of sentences included in the target patent document and a second sentence among a plurality of sentences included in the similar patent document;
Acquiring, by the server, the first sentence and the second sentence into a second artificial intelligence model to obtain evaluation results of the first sentence and the second sentence; And
Generating, by the server, a prior art research report on the target patent document based on the evaluation result;
Acquiring the word,
Acquiring, by the server, a morpheme based on the target patent document to obtain a plurality of words;
Determining, by the server, an error word among the obtained plurality of words;
The server correcting the error word; And
And obtaining, by the server, at least one compound noun phrase set based on the degree of association between the plurality of words.

delete

Claim 3 has been abandoned upon payment of a setup registration fee.

The method of claim 1,
Acquiring the importance score,
Acquiring, by the server, a word that is an object of calculating an importance score from the target patent document;
The server includes the first detail importance of the word in the entire patent document, the second detail importance of the word in the patent classification information corresponding to the technical field information of the target patent document, and the word among the entire patent document. Calculating at least one detail importance of the third detail importance of the searched patent document; And
And calculating, by the server, the importance score of the word based on at least one of the first detail importance level, the second detail importance level, and the third detail importance level.

The method of claim 1,
The control method,
Acquiring, by the server, an entire patent document that is the basis of a thesaurus;
Analyzing, by the server, morphemes for each of the entire patent documents;
Converting a word included in each of the entire patent documents into a word vector based on the result of the morphological analysis;
Calculating, by the server, the similarity between the word vectors; And
And the server grouping a word corresponding to the word vector into a group of similar words based on the similarity.

The method of claim 1,
Determining the error word,
Matching semantic information with respect to each of the obtained plurality of words;
Obtaining an accuracy score for each of the plurality of words based on the matched semantic information and the target patent document; And
Determining a word having an accuracy score equal to or less than a preset value as an error word,
Correcting the error word,
Obtaining a plurality of semantic information on the error word;
Obtaining a plurality of weights for each of the obtained plurality of semantic informations;
Obtaining a weight having the highest association with the target patent document among the plurality of weights; And
Matching semantic information corresponding to the weight having the highest association with the target patent document to the error word;
Obtaining the at least one compound noun phrase set,
Obtaining a plurality of words for sentences included in the target patent document;
Obtaining a plurality of compound noun phrase set candidates obtained by combining the plurality of words;
Obtaining a frequency at which a compound noun phrase set identical to the obtained plurality of compound noun phrase set candidates is included in the target patent document; And
Determining a compound noun phrase set candidate having the frequency equal to or greater than a preset frequency as a compound noun phrase set; Including,
And the compound noun phrase set candidate comprises sequence information and separation information of words included in the compound noun phrase set candidate.

The method of claim 3,
The step of calculating the detailed importance of the word
The first occurrence rate of the number of occurrences of the word in the entire patent document relative to the total number of words in the entire patent document and the number of appearance sentences in which the word appears in the sentences of the entire patent document relative to the total number of sentences in the total patent document Calculating the first sub-critical importance based on a second occurrence ratio of;
The third occurrence rate of the number of occurrences of the word in the patent classification information to the total number of words of the patent classification information and the number of appearance sentences in which the word appeared among the sentences of the patent document compared to the total number of sentences of the patent document Calculating the second sub-critical importance based on a fourth appearance ratio of the second; And
Calculating an influence value of each of the search patent documents based on reference information of each of the search patent documents, and calculating a third detailed importance of the search patent document using the influence value;
The calculating of the first detail importance level
The first detailed importance is calculated using Equation 1 below.
The calculating of the second detailed importance level
A method of controlling a patent document similarity determination system comprising calculating the second detailed importance level using the following equation.
<Equation 1>

Here, the W ₁ is the first detail importance, the wpw is the number of occurrences of the word in the entire patent document, the WPW is the total number of words in the patent document, the wps is among the sentences of the patent document The word is the number of appearance sentences appeared, the WPS is the total number of sentences of the entire patent document, the a1 is a control constant of the second appearance ratio, the W ₂ is a second detail importance, the ipcw is the The number of occurrences of the word in patent classification information, the IPCW is the total number of words in the patent classification information, the ipcs is the number of appearance sentences where the word appeared in the sentences of the entire patent document, the IPCS is the total patent A total is the number of sentences in the document, and a2 is a control constant of the fourth appearance ratio.

Claim 7 was abandoned upon payment of a set-up fee.

The method of claim 1,
Acquiring the similar patent document,
Acquiring a plurality of words included in the target patent document;
Clustering the obtained plurality of words to obtain a plurality of target patent clusters;
Acquiring a plurality of midpoints for each of the plurality of target patent clusters, and determining a position of the target patent document based on the plurality of midpoints for each of the plurality of target patent clusters and the number of words included in the target patent cluster. Making;
Obtaining a plurality of words included in a patent document and clustering the obtained plurality of words to obtain a plurality of patent clusters;
Acquiring a plurality of midpoints for each of the plurality of patent clusters, and determining a position of the patent document based on the plurality of midpoints for each of the plurality of patent clusters and the number of words included in the patent cluster; And
And determining the patent document as the similar patent document when the position of the target patent document and the position of the patent document are within a preset distance.
Obtaining an evaluation result for the first sentence and the second sentence,
Obtaining similarity scores and dissimilarity scores of the first sentence and the second sentence, respectively;
Determining that the first sentence and the second sentence are irrelevant sentences when the dissimilarity score is equal to or greater than a predetermined score;
Obtaining at least one word not included in the second sentence among words included in the first sentence when the similarity score is equal to or greater than a predetermined score;
Obtaining at least importance scores for each of at least one word not included in the second sentence, and determining whether there is a word equal to or greater than a predetermined importance score among the obtained at least one importance scores;
Determining that the first sentence and the second sentence are matched sentences when a word having a predetermined importance score or more does not exist; And
And determining the first sentence and the second sentence as inconsistent sentences when a word having a predetermined importance score or more exists.

The method of claim 4, wherein
Converting to the word vector
Converting the word into a word vector through Word2Vec learning based on the patent document;
Computing the similarity
Calculating a distance between any two word vectors of the word vectors and calculating the calculated distances with similarities;
Grouping into the group of similar words
Determine whether the similarity between any two word vectors of the word vectors is less than a preset reference similarity, and if the similarity between any two word vectors is less than a preset reference similarity, two words corresponding to the two word vectors Grouping into a group of similar words;
Acquiring the entire patent document
And determining whether any one of the acquired patent documents satisfies the noise document condition, and removing the patent document that satisfies the noise document condition.

Memory for storing one or more instructions; And
A processor for executing the one or more instructions stored in the memory;
The processor executes the one or more instructions,
An apparatus for carrying out the method of claim 1.

A computer program, coupled with a computer, which is hardware, stored on a recording medium readable by a computer to perform the method of claim 1.