KR101401225B1

KR101401225B1 - System for analyzing documents

Info

Publication number: KR101401225B1
Application number: KR1020110003291A
Authority: KR
Inventors: 차완규; 정미경; 안한준; 김정중; 최성호
Original assignee: 엘지전자 주식회사
Priority date: 2011-01-12
Filing date: 2011-01-12
Publication date: 2014-05-28
Also published as: KR20110010664A

Abstract

본 실시예의 문서 분석 시스템은 특허문서들이 저장되는 데이터베이스; 상기 특허문서들 간의 간접인용 관계를 독출하고, 독출된 간접인용 관계를 이용하여 분석대상의 특허문서들에 대해서 분류를 수행하는 문서분류 모듈; 상기 특허문서가 갖는 속성정보를 이용하여, 상기 특허문서에 대한 평가 수행 및 그 평가결과로서 평가치를 연산하는 문서평가 모듈; 및 상기의 속성정보를 기준으로 분석대상의 특허문서들에 대한 평가정보를 사용자에게 제공하는 UI출력 수단;을 포함한다. The document analysis system of this embodiment includes a database in which patent documents are stored; A document classifying module for reading an indirect citation relation between the patent documents and performing classification on patent documents to be analyzed by using the indirect citation relation read out; A document evaluation module for performing an evaluation on the patent document and calculating an evaluation value as a result of the evaluation using the attribute information of the patent document; And UI output means for providing the user with evaluation information on the patent documents to be analyzed based on the attribute information.

Description

System for analyzing documents

본 발명은 문서들 간의 간접 인용관계를 이용하여 복수의 특허문서들을 군집화 및 자동분류가 수행되도록 하고, 이러한 분류가 수행된 문서들에 대해서 분석 및 평가를 수행하는 시스템에 대해서 개시한다. The present invention discloses a system for performing clustering and automatic classification of a plurality of patent documents using an indirect citation relationship between documents, and for performing analysis and evaluation on documents for which classification has been performed.

특허출원인이 특허를 받으려고 하는 경우에는, 소정의 요건을 충족시키는 서류를 작성하고, 특허청에 제출할 필요가 있다. 특허청에 제출된 특허출원 서류는 소정의 시간경과 또는 요건이 충족된 이후에 공개되는데, 이러한 문서들을 특허문서라고 할 수 있다. When a patent applicant intends to obtain a patent, it is necessary to prepare a document that meets the prescribed requirements and submit it to the Patent Office. The patent application documents submitted to the Patent Office are released after a predetermined period of time or requirements have been met, and these documents can be called patent documents.

일반적으로, 특허를 출원하려고 하는 자등은 선행기술의 존재여부를 확인하기 위하여 이러한 특허문서들을 검색/서치하는 과정을 거치게되는데, 대부분의 특허문서 검색은 키워드를 입력한 형태로 이루어지고 있다. Generally, a person who intends to apply for a patent is subjected to a process of searching / searching for such patent documents in order to confirm the existence of prior art. Most patent document searches are performed by inputting keywords.

즉, 근래의 기술 발달에 수반하여 특허출원의 건수가 방대해지고 있으며, 이에 수반하여 특허문서의 양도 방대해지고 있다. 이에 따라, 중복 연구를 방지하거나 권리침해의 여부를 확인하거나 특허출원 전의 선행기술 조사를 하거나 타사의 기술 개발 동향을 파악하거나 연구 개발 향상 등을 위하여 특허문서를 조사하는 작업이 용이하지 않은 실정이다. That is, with the recent development of technology, the number of patent applications has become enormous, and accompanying this, the amount of patent documents has become enormous. Accordingly, it is not easy to prevent duplicate research, to confirm whether or not infringement has occurred, to conduct prior art research before patent application, to grasp trends of technology development of other companies, and to investigate patent documents for improvement of research and development.

이러한 특허문서를 서치하거나 조사하기 위한 종래의 검색 시스템에서는, 키워드를 잘못 선택하면, 불필요한 정보가 방대히 포함되는 경우가 발생하곤 한다. 그리고, 이와 같은 경우에는 조사 그 자체의 시간도 방대해지는 문제점이 있다. In a conventional search system for searching or examining such a patent document, if the keyword is selected incorrectly, there is a case where unnecessary information is included in a large amount. In such a case, there is a problem that the time of the investigation itself becomes enormous.

본 발명의 실시예는 복수의 특허문서 간의 참조 또는 인용 관계를 도출하거나 직접 인용 관계가 아니더라도 간접 인용 관계를 독출함으로써, 특허문서 간의 보다 효율적인 분류 및 군집화를 수행하고, 이러한 문서 분류 및 군집화의 결과를 사용자에게 보다 효과적으로 제공할 수 있는 분서 분석 시스템을 제공하고자 한다. Embodiments of the present invention can perform more efficient classification and clustering between patent documents by deriving references or quotation relationships among a plurality of patent documents or by reading indirectly quoted relationships even if they are not directly related to each other, And to provide a branch analysis system that can more effectively provide the user with the analysis results.

그리고, 상기 속성정보는 상기 특허문서에 기록된 사항으로부터 도출되는 대내적 특성과, 상기 특허문서가 인용하고 있는 피인용 특허문서에 기록된 사항을 고려함으로써 도출되는 대외적 특성을 포함한다. The attribute information includes external characteristics derived by taking into account the internal characteristics derived from the matters recorded in the patent document and the matters recorded in the patent document for citing the patent document.

그리고, 상기 속성정보는 상기 특허문서들간의 인용횟수 또는 피인용 횟수에 대한 정보를 이용하는 인용 인덱스를 더 포함한다. The attribute information further includes a quotation index that uses information on the number of citations or the number of citations between the patent documents.

그리고, 상기 문서평가 모듈은 특허문서의 평가결과를 기설정된 자사특허와 이외의 타사특허로 분류하고, 상기 UI출력 수단은 상기 평가정보로서, 상기 속성정보를 기준으로 자사특허에 대한 제 1 평가정보와 타사특허에 대한 제 2 평가정보를 UI로 제공한다. The document evaluation module classifies the evaluation result of the patent document into a third party patent other than the pre-established patent, and the UI output means outputs, as the evaluation information, the first evaluation information And second evaluation information about other patents.

제안되는 바와 같은 실시예에 의해서, 복수의 특허문서 간의 참조 또는 인용 관계를 도출하거나 직접 인용 관계가 아니더라도 간접 인용 관계를 독출함으로써, 특허문서 간의 보다 효율적인 분류를 수행할 수 있는 장점이 있다. According to the embodiment as proposed, there is an advantage that a more effective classification between patent documents can be performed by deriving a reference or quotation relation between a plurality of patent documents, or by reading an indirect quotation relation even if the quotation is not directly related.

그리고, 효율적인 문서의 분류 및 군집화의 결과를 다양한 UI를 통하여 사용자에게 정보를 제공함으로써, 사용자가 특허문서의 분석을 용이하게 수행할 수 있도록 하는 장점이 있다. In addition, there is an advantage that a user can easily analyze a patent document by providing information to a user through various UIs as a result of classification and clustering of an effective document.

도 1은 본 실시예에 따른 문서 분석 시스템을 설명하기 위한 도면.
도 2는 본 실시예에 따른 문서 분석 시스템의 동작 흐름을 설명하기 위한 도면.
도 3은 본 실시예에 따른 문서평가 모듈의 평가팩터 테이블의 일례.
도 4는 본 실시예에 따른 문서의 검색 및 평가 결과를 도시한 일례.
도 5는 본 실시예에 따라 문서의 정보가 보여지는 UI의 일례.
도 6은 본 실시에에 따른 문서군집 수단의 구성을 보여주는 도면.
도 7은 본 실시예에 따른 간접인용 관계를 설명하기 위한 도면.
도 8은 본 실시예에 따라 제 2 그룹 문서가 제 1 그룹의 카테고리로 분류 및 군집화되는 것을 설명하기 위한 도면.
도 9는 본 실시예에 따른 카테고리 문서 또는 제 2 그룹 문서의 속성 정보를 나타내는 일례.
도 10은 본 실시예에 따른 카테고리 문서 또는 제 2 그룹 문서로부터 산출되는 특징 벡터를 나타내는 일례.
도 11 내지 도 15는 본 실시예에 따른 문서 분류 및 군집의 결과로서 사용자에게 제공되는 다양한 종류의 UI. 1 is a diagram for explaining a document analysis system according to the present embodiment;
2 is a diagram for explaining the operational flow of the document analysis system according to the present embodiment;
3 is an example of an evaluation factor table of the document evaluation module according to the present embodiment.
4 is an example showing the result of search and evaluation of a document according to the present embodiment.
5 is an example of a UI in which information of a document is displayed according to the present embodiment.
6 is a diagram showing a configuration of document grouping means according to the present embodiment;
FIG. 7 is a diagram for explaining an indirect citation relationship according to the present embodiment; FIG.
8 is a diagram for explaining how a second group document is classified and grouped into categories of a first group according to the present embodiment;
9 is an example showing attribute information of the category document or the second group document according to the present embodiment.
10 is an example showing a feature vector calculated from a category document or a second group document according to the present embodiment.
11 to 15 illustrate various types of UIs provided to a user as a result of document classification and clustering according to the present embodiment.

이하에서는, 본 실시예에 대하여 첨부되는 도면을 참조하여 상세하게 살펴보도록 한다. 다만, 본 실시예가 개시하는 사항으로부터 본 실시예가 갖는 발명의 사상의 범위가 정해질 수 있을 것이며, 본 실시예가 갖는 발명의 사상은 제안되는 실시예에 대하여 구성요소의 추가, 삭제, 변경등의 실시변형을 포함한다고 할 것이다. Hereinafter, the present embodiment will be described in detail with reference to the accompanying drawings. It should be understood, however, that the scope of the inventive concept of the present embodiment can be determined from the matters disclosed in the present embodiment, and the spirit of the present invention possessed by the present embodiment is not limited to the embodiments in which addition, Variations.

그리고, 이하의 설명에서, 단어 '포함하는'은 열거된 것과 다른 구성요소들 또는 단계들의 존재를 배제하지 않는다. In the following description, the word " comprising " does not exclude the presence of other elements or steps than those listed.

도 1은 본 실시예에 따른 문서 분석 시스템의 구성을 보여주는 일례이다. 1 is an example showing a configuration of a document analysis system according to the present embodiment.

도 1에 도시된 바와 예와 같은 시스템은 특허문서가 저장되는 데이터베이스(130)와, 상기 데이터베이스(130)에 저장된 특허문서 또는 네트워크를 통하여 접속가능한 다른 특허문서들에 대해서 기설정된 평가팩터(사용자에 의해 변경가능함)를 이용해서 평가치를 부여하는 문서평가 모듈(140)과, 사용자에 의해 지정된 또는 데이터베이스에 격납된 특허문서들에 대한 직접 및 간접 인용관계를 도출하여 특허문서들의 분류 및 군집화가 이루어지도록 하는 문서분류 모듈(150)을 포함한다. A system such as the one shown in FIG. 1 may include a database 130 in which a patent document is stored, and a predetermined evaluation factor (user) for a patent document stored in the database 130 or other patent documents accessible via a network A document evaluating module 140 for evaluating the patent documents by using the patent documents stored in the database, and direct and indirect citation relations of the patent documents designated by the user or stored in the database are derived so that classification and clustering of the patent documents are performed And a document classification module 150. [

또한, 이러한 문서 분석 시스템은 특허문서들 간의 간접 인용관계를 이용하여 특허문서들에 분류를 수행하고, 수행된 분류의 결과(분류된 특허문서들내에서 대표화 문서)를 이용해서 미분류의 특허문서들에 대해서 군집화를 수행할 수 있는 것으로서, 서버 장치나 컴퓨터 등에 의하여 실현될 수 있으며, 입출력 모듈(110), 문서검색 모듈(120), 문서특징 작성모듈(160) 및 문서특징 DB(170)를 더 포함할 수 있다. In addition, such a document analysis system classifies the patent documents using the indirect citation relation between the patent documents, and uses the result of the classification (the representative document in the classified patent documents) Output module 110, the document retrieval module 120, the document characteristic creation module 160, and the document characteristic DB 170, which can be realized by a server device, a computer, or the like, .

도 2는 상기 문서 분석 시스템은 특허 검색(S101), 검새결과를 이용한 특허문서의 분석(S102), 분석대상의 특허문서들에 대한 문서 분류(S103), 문서 분류를 이용한 문서 군집의 결과의 UI를 제공(S104)할 수 있으며, 각각의 스텝에 대해서는 문서 분석 시스템의 각 구성을 이용해서 보다 상세히 설명하여 보기로 한다. 2, the document analysis system includes a patent search step S101, a patent document analysis step S102, a document classification step S103 for analyzing patent documents, a UI (Step S104), and each step will be described in more detail using each configuration of the document analysis system.

먼저, 문서 분석 시스템을 이용한 특허문서의 검색(S101) 동작에 대해서 설명하여 보기로 한다. First, the operation of searching for a patent document (S101) using the document analysis system will be described.

입출력 모듈(110)의 질의어 수신수단(111)는 사용자가 문서 검색 또는 분석등의 행위를 수행하기 위하여 키보드나 마우스등을 이용해서 입력한 질의어를 수신하는 것으로서, 사용자가 입력하는 질의어는 상기 데이터베이스(130)에 저장되어 있는(또는 네트워크 연결이 가능한) 특허문서에 기록된 키워드가 될 수 있다. 그리고, 상기 키워드는 문자 이외에 상기 특허문서를 구성하는 출원번호, 공개번호등의 숫자도 포함한다. The query input unit 111 of the input / output module 110 receives a query input by a user using a keyboard, a mouse, or the like in order to perform an operation such as document search or analysis. 130) or a patent document stored in a patent document (which can be network-connected). In addition, the keyword includes numbers such as an application number, a public number, etc. constituting the patent document in addition to characters.

그리고, 입출력 모듈(110)의 UI(User Interface) 출력수단(112)은, 상기 문서검색 모듈(120) 또는 문서분류 모듈(150) 또는 문서평가 모듈(140)에 의하여 연산 내지는 추출되는 정보를 사용자측에 제공하는 역할을 수행하며, 후술되는 다양한 UI를 제공하는 장치로 기술되어 있지만, 실시예에 따라 당연히 평가 시스템의 다른 구성요소 내에 마련되는 것도 가능하다. The user interface output unit 112 of the input and output module 110 outputs the information calculated or extracted by the document search module 120 or the document classification module 150 or the document evaluation module 140 to the user side And is provided as an apparatus for providing various UIs to be described later, but it is also possible to arrange it in other components of the evaluation system according to the embodiment.

또한, 실시예의 데이터베이스(130)에는 특허문서 데이터들이 저장되며, 특허문서 데이터군은 전자화되는 특허출원 또는 특허에 관계되는 명세서의 문서 데이터를 격납하도록 구성되어 있는 데이터베이스이다. 이 특허문서 데이터는, 문자 코드에 의하여 명세서의 내용을 기술한 텍스트 데이터를 포함하는 데이터이다. 플레인 텍스트 데이터의 다른 곳, 예를 들면, SGML(Standard Generalized Markup Language), HTML(HyperText Markup Language), XML(eXtensible Markup Language)등의 범용 태그 언어에 의한 기술을 포함하는 문서 데이터도 가능하다. 그리고, 텍스트 데이터의 추출이 가능하다면, PDF(Portable Document Format)이나 범용의 워드 프로세서(word processor)의 문서 포맷(format), RTF(Rich TextFormat) 포맷등의 다른 포맷도 가능하다. In addition, the database 130 of the embodiment stores patent document data, and the patent document data group is a database configured to store document data of a specification relating to a patent application or patent to be electronicized. This patent document data is data including text data describing the content of the specification by the character code. Document data including a description in general purpose tag language such as SGML (Standard Generalized Markup Language), HTML (HyperText Markup Language), and XML (eXtensible Markup Language) elsewhere in plain text data is also possible. Other formats such as Portable Document Format (PDF), document format of a general word processor, and Rich Text Format (RTF) format are also possible if text data can be extracted.

특허문서 데이터베이스(130)는, 특허문서 평가 시스템의 외부에 마련되어 있는 것도 가능하며, 그 경우에는, 네트워크를 이용하여 특허문서 평가 시스템이 데이터베이스에 접속하고, 특허문서의 문서 데이터를 취득할 것이다. The patent document database 130 may be provided outside the patent document evaluation system. In this case, the patent document evaluation system may access the database using the network, and acquire document data of the patent document.

문서검색 모듈(120)은 사용자가 입력한 질의어를 바탕으로 상기 데이터베이스(130)에 저장되어 있는 특허문서들 중에서 호출대상의 특허문서들을 검색한다. 상기 문서검색 모듈(120)에 의한 특허문서의 검색에 있어서는, 상기 문서특징 작성모듈(160) 및 문서특징 DB(170)가 이용될 수 있다. The document retrieval module 120 retrieves patent documents to be retrieved from the patent documents stored in the database 130 based on the query input by the user. In searching for a patent document by the document search module 120, the document feature creation module 160 and the document feature DB 170 may be used.

문서특징 작성모듈(160)은 상기 데이터베이스(130)에 저장되어 있는 문서들로부터 텍스트를 취득하여 각 키워드별 빈도수에 대한 인덱스 정보를 문서특징 DB(170)에 제공할 수 있다. 그리고, 상기 문서검색 모듈(120)은 질의어 수신수단(111)에 의하여 소정의 질의어가 수신되는 경우에 상기 문서특징 DB(170)에 저장된 각 문서의 인덱스 파일을 이용하여 질의어가 포함된 문서들을 검색할 수 있다. The document characteristic creation module 160 may obtain texts from the documents stored in the database 130 and provide index information on the frequency of each keyword to the document characteristic DB 170. [ When a predetermined query term is received by the query receiving means 111, the document searching module 120 searches for documents including a query term using an index file of each document stored in the document characteristic DB 170 can do.

상기 문서검색 모듈(120)에 의해 검색된 결과의 문서는 UI 출력수단(112)을 통하여 도 4에 도시된 바와 같은 UI가 사용자측에 제공될 수 있다. A document as a result of the search by the document search module 120 may be provided to the user through a UI output means 112 as shown in FIG.

상기 문서특징 작성모듈(160)은 질의어 수신수단(111)을 통하여 소정의 질의어가 수신되는 경우 또는 웹 로봇에 의하여 상기 데이터베이스(130)에 신규의 문서가 격납되는 경우에 해당 문서들에 대한 인덱스 파일을 작성하고, 이를 이용하여 각 문서에 대한 특징 벡터를 결정할 수 있다. 이에 대한 설명을 위하여 도 9를 참조하여 본다. When a predetermined query is received through the query receiving unit 111 or when a new document is stored in the database 130 by the web robot, the document characteristic creation module 160 creates an index file And the feature vector for each document can be determined using the generated feature vectors. This will be described with reference to FIG.

도 9는 각 문서의 속성 정보를 나타낸 도면이고, 도 9에 도시된 문서들의 속성정보는 문서특징 작성모듈(160)에 의해 인덱스 파일의 형태로 작성될 수 있으며, 작성된 인덱스 파일은 상기 문서특징 DB(170)에 저장된다. FIG. 9 is a diagram showing attribute information of each document. Attribute information of the documents shown in FIG. 9 can be created in the form of an index file by the document characteristic creation module 160, (170).

그리고, 문서특징 DB(170)에 저장된 인덱스 파일을 이용하여 상기 문서특징 작성모듈(160)은 각 문서의 특징 벡터를 결정할 수 있으며, 상기 특징 벡터 역시 문서특징 DB(170)에 저장될 수 있다. Using the index file stored in the document feature DB 170, the document feature creation module 160 can determine a feature vector of each document, and the feature vector can also be stored in the document feature DB 170.

도 9에는 각 문서마다 키워드(A,B,C,D,M,I,K,O,P,Q,Z)별 발생빈도에 대한 정보가 도시되어 있으며, 예를 들면, 제 1 문서에는 키워드 A(여기서, A는 명사, 고유명사, 복합명사등의 단어를 의미하는 것이며, 알파벳 A를 의미하는 것이 아님)가 35번, 키워드 B가 19번, 키워드 C가 15번, 키워드 D가 13번이 포함되어 있음을 나타낸다.FIG. 9 shows information on the frequency of occurrences of the keywords A, B, C, D, M, I, K, O, P, Q and Z for each document. For example, A is 35, the keyword B is 19, the keyword C is 15, the keyword D is 13, the keyword A is not Is included.

그리고, 각 문서에 포함되어 있는 키워드별 발생빈도 테이블은 도 9에 도시된 바와 같이 가장 높은 빈도수를 갖는 키워드로부터 낮은 빈도수를 갖는 키워드로 순차적으로 배열되도록 작성될 수 있다.The occurrence frequency table for each keyword included in each document can be created so as to be sequentially arranged from the keyword having the highest frequency to the keyword having the lower frequency as shown in FIG.

예컨대, 문서 1에서는 키워드 A가 4.5%, 키워드 B가 2.4%, 키워드 C가 1.9%, 키워드 D가 1.7%가 포함되어 있음을 나타내기 위하여, 상기 문서 1에 대한 인덱스 파일은 (A,B,C,D) → (4.5%,2.4%,1.9%,1.7%)의 의미를 포함하도록 작성될 수 있다.For example, in Document 1, in order to indicate that the keyword A includes 4.5%, the keyword B contains 2.4%, the keyword C contains 1.9%, and the keyword D contains 1.7%, the index file for document 1 includes (A, B, C, D) → (4.5%, 2.4%, 1.9%, 1.7%).

이렇게 다양한 방법에 의하여 각 문서들의 인덱스 파일이 작성되고, 작성된 인덱스 파일을 이용해서는 각 문서의 특징 벡터를 추출하는 것이 가능해진다.An index file of each document is created by the various methods described above, and the feature vector of each document can be extracted by using the created index file.

상세히, 상기 문서특징 작성모듈(160)은 각 문서에서 키워드별 발생빈도수에 근거한 테이블을 작성하고, 이를 이용하여 각 문서의 특징 벡터를 함께 작성한다.In detail, the document characteristic creation module 160 creates a table based on the frequency of occurrence of each keyword in each document, and creates a feature vector of each document using the generated table.

여기서, 상기 문서특징 작성모듈(160)에 의해 결정되는 특징 벡터는 각 문서에 대하여 키워드의 평가치를 요소로 하며, 예를 들어 각 문서에 포함된 키워드의 총 수가 n개인 경우, 각 문서의 특징 벡터는 n차원 공간의 벡터로서 다음 식(1)과 같이 표현될 수 있다.Herein, the feature vector determined by the document characteristic creation module 160 has the evaluation value of the keyword as an element for each document. For example, when the total number of keywords included in each document is n, the feature vector Is a vector of n-dimensional space and can be expressed as the following equation (1).

특징 벡터 = (키워드 A의 평가치 w1, 키워드 B의 평가치 w2, ······· 단어 n의 평가치 wn) --- (1)Characteristic vector = (evaluation value w1 of keyword A, evaluation value w2 of keyword B, evaluation value wn of word n) --- (1)

평가치의 연산에는, 예를 들어 문헌(Salton, G:Automatic Text Processing : The transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley)에 개시되어 있는 tf·idf법을 이용할 수 있다. tf·idf법에 따르면, 제 1 문서에 대응하는 n차원의 특징 벡터 중, 제 1 문서에 포함되는 키워드에 대응하는 요소에 대해서는, 평가치로서 0이외의 값이 산출되고, 제 1 문서에 포함되지 않은 키워드(빈도가 0인 단어)에 대응하는 요소에 대해서는 평가치로서 0이 산출된다.For calculation of evaluation values, for example, the tf · idf method disclosed in Salton, G: Automatic Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley can be used. According to the tf.idf method, a value other than 0 is calculated as an evaluation value for an element corresponding to a keyword included in the first document among n-dimensional feature vectors corresponding to the first document, and is included in the first document 0 " is calculated as an evaluation value for an element corresponding to a keyword (a word with a frequency of 0) which is not included in the keyword.

이와 같은 견지에서, 특징 벡터의 한 요소로서 키워드의 평가치는 각각의 문서에 나타나는 각 키워드의 빈도율이 될 수 있다. 예컨대, 상기 문서검색 모듈(120)에 의해서, 제 1 문서로부터 키워드 A, 키워드 B 및 키워드 C는 유사어로 군집화될 수 있으며, 군집된 유사어는 별도의 유사어 DB에 저장될 수 있다. From this viewpoint, the evaluation value of the keyword as an element of the feature vector can be the frequency rate of each keyword appearing in each document. For example, by the document search module 120, a keyword A, a keyword B, and a keyword C from a first document can be clustered into a similar word, and the clustered similar words can be stored in a separate similar DB.

즉, 상기 문서검색 모듈(120)에 의해서 소정의 키워드 A와 키워드 B가 군집화되고, 군집화된 키워드 A와 키워드 B는 유사어 DB에 저장된다.That is, the predetermined keyword A and keyword B are clustered by the document search module 120, and the clustered keyword A and keyword B are stored in the similarity DB.

그리고, 상기 문서검색 모듈(120)은 추출되는 키워드에 키워드 A와 키워드 B중 어느 하나가 포함되어 있는 경우에는, 나머지 키워드가 포함된 유사문서에 대해서도 검색을 수행한다. If the extracted keyword includes any one of the keyword A and the keyword B, the document search module 120 searches for a similar document including the remaining keywords.

추출된 키워드에 한정된 검색이 수행되는 것이 아니라 특허문서들의 속성에 근거하여 유사한 문서들의 검색이 수행될 수 있는 것이다. The retrieval of similar documents can be performed based on the attributes of the patent documents rather than the search limited to the extracted keywords.

상기 질의어 수신수단(111)을 통해 수신되는 질의어 중에 키워드A가 포함되어 있는 경우에는, 유사문서 검색시에 키워드 A와 함께 키워드 B 및 키워드 C가 포함되어 있는 문서의 검색이 수행될 수 있다. When the keyword A is included in the query term received through the query receiving means 111, the search for a document including the keyword B and the keyword C together with the keyword A can be performed at the time of retrieving the similar document.

한편, 문서 분석 시스템은 특허검색 과정이나 문서 분류 및 군집화 과정에서 기설정된 평가 팩터를 이용한 문서에 대해 평가치를 부여함으로써, 각 문서들에 대해 중요도 또는 트렌드를 파악할 수 있도록 한다. On the other hand, the document analysis system grasps the importance or trend of each document by assigning evaluation values to documents using a predetermined evaluation factor in a patent search process or a document classification and clustering process.

상기 문서 분석 시스템에 의한 특허문서의 평가에 대해서 살펴보기로 한다. The evaluation of the patent document by the document analysis system will be described.

상기 데이터베이스(130)에 격납되어 있는 또는 네트워크를 통하여 접속이 가능한 특허문서들에 대해서 상기 문서평가 모듈(140)은 기 설정된 평가 팩터에 따라 평가치를 부여한다. For the patent documents stored in the database 130 or accessible via the network, the document evaluation module 140 assigns evaluation values according to a predetermined evaluation factor.

즉, 문서평가 모듈(140)은 상기 데이터베이스(130)에 격납된 특허문서 또는 네트워크 연결가능한 특허문서들에 대해서 상기 특허문서가 갖는 속성정보를 이용해서 특허문서를 평가하며, 또한 그 평가의 결과가 사용자에게 보여지도록 그 결과를 상기 UI 출력수단(112)으로 제공한다. 그리고, UI 출력수단(112)은 특허문서의 검색결과 리스트와 함께 검색된 대상의 특허문서들에 대한 평가치에 대한 정보를 사용자측에 제공할 수 있으며, 상기 검새결과 리스트와는 별도의 팝업 또는 OSD로 특허문서들에 대한 평가치 정보를 제공할 수도 있다. That is, the document evaluation module 140 evaluates the patent document using the attribute information of the patent document with respect to the patent document stored in the database 130 or network connectable patent documents, And provides the result to the UI output means 112 to be displayed to the user. The UI output means 112 may provide the user with information about the evaluated value of the patent documents searched together with the search result list of the patent document, and may be provided with a pop-up or OSD It may also provide evaluation information on patent documents.

상기 문서평가 모듈(140)은 상기 데이터베이스(130)에 격납된 특허문서들 또는 네트워크 연결이 가능한 특허문서들에 대해서 설정된 평가항목을 이용해서 평가항목 테이블을 작성하며, 이러한 특허문서의 평가 작업은 상기 데이터베이스(130)에 신규의 특허문서가 저장되는 경우마다 수행될 수 있다. The document evaluation module 140 creates an evaluation item table using the evaluation items set for the patent documents stored in the database 130 or the network documents that can be connected to the network, And may be performed whenever a new patent document is stored in the database 130. [

다만, 상기 문서평가 모듈(140)에 의한 특허문서의 평가 작업은 사용자의 문서검색 요청이 있고 검색되는 문서가 존재하는 경우에 수행되는 것도 가능하며, 이하의 설명에서는 이러한 평가 작업이 수행되는 순서의 한정없이 서술한다. However, the evaluation of the patent document by the document evaluation module 140 may be performed when there is a document search request of a user and a document to be searched exists. In the following description, Describe without limitation.

상기 문서평가 모듈(140)에는 특허문서가 갖는 특성을 평가팩터로서 관리하는 평가팩터 관리수단(141)과, 상기 평가팩터를 이용하여 상기 데이터베이스(130)에 저장된 특허문서에 대해서 평가를 수행하는 문서평가 수단(142)과, 상기 문서평가 수단(142)에 의한 문서 평가결과인 평가치가 특허문서 각각에 대응되도록 하는 DB문서 관리수단(143)이 포함될 수 있다. The document evaluation module 140 includes an evaluation factor management unit 141 that manages the characteristics of the patent document as an evaluation factor, a document management unit 141 that evaluates the patent document stored in the database 130 using the evaluation factor, Evaluation means 142 and DB document management means 143 for making the evaluation value, which is a document evaluation result by the document evaluation means 142, correspond to each of the patent documents.

상기 평가팩터 관리수단(141)은 상기 데이터베이스(130)에 격납된 특허문서의 대내적 특성과 대외적 특성에 대한 항목을 관리하며, 이러한 특성들은 사용자에 의하여 편집될 수 있다. The evaluation factor management unit 141 manages the internal and external characteristics of the patent document stored in the database 130, and these characteristics can be edited by the user.

즉, 상기 평가팩터 관리수단(141)에 의하여 특허문서의 대내적 특성 및 대외적 특성에 대한 평가팩터들의 구조는 도 3에 도시된다. 도 3은 특허문서의 평가팩터의 구조를 나타내는 도면이다. That is, the structure of the evaluation factors for the internal and external characteristics of the patent document by the evaluation factor managing means 141 is shown in FIG. 3 is a diagram showing a structure of an evaluation factor of a patent document.

도 3에 나타나 있듯이, 상기 평가팩터 관리수단(141)에 의하여 기술되는 특허에 관한 속성의 테이블이 국가별로 복수개 연결될 수 있으며, 하나하나의 테이블에는 특허문서 내에 기록되어 있는 사항으로부터 도출되는 대내적 특성과, 특허문서가 인용하고 있는 피인용 문서의 특성을 고려함으로써 도출될 수 있는 대외적 특성을 포함한다. As shown in FIG. 3, a plurality of tables of the attributes relating to the patent described by the evaluation factor management means 141 can be linked for each country. Each table has internal characteristics derived from items recorded in the patent document, , And external characteristics that can be derived by considering the characteristics of the cited document cited by the patent document.

특허문서에 기록되어 있는 사항으로부터 도출될 수 있는 대내적 특성이라 함은, 특허문서의 기재사항에 대한 텍스트마이닝 작업을 통하여 추출될 수 있는 키워드 또는 해당 특허문서의 정보를 가리킨다. The internal characteristics that can be derived from the matters recorded in the patent document refer to the information of the keyword or the patent document that can be extracted through text mining operation on the description of the patent document.

예를 들어, 특허문서에 기록되어 있는 등록일자로부터 현재일자까지의 기간이 연산된 유지기간은 해당 특허문서 내에 기재된 사항으로부터 도출가능한 것이므로, 특허문서의 대내적 특성이 될 수 있다. For example, the maintenance period in which the period from the registration date registered to the current date recorded in the patent document is calculated can be derived from the matters described in the patent document, and thus can be an internal characteristic of the patent document.

그리고, 특허문서에 기재된 출원일자로부터 현재일자까지의 기간이 연산된 경과정보, 특허문서의 독립항의 수, 특정 독립항에 대한 텍스트 마이닝결과 독출되는 키워드의 개수에 따라 결정될 수 있는 청구항 길이, '제 1 항에 있어서' 또는 'according to claim 1'과 같이 특정의 문구가 들어가기 때문에 종속항으로 식별가능한 종속항들의 개수 역시 특허문서의 대내적 특성이 될 수 있다. The length of the claims, which can be determined according to the elapsed time information of the period from the filing date to the current date described in the patent document, the number of independent terms of the patent document, the number of keywords read out as a text mining result for a specific independent term, The number of subordinate terms that can be identified as subordinate terms can also be an internal characteristic of a patent document, since certain phrases such as 'in' or 'according to claim 1' are entered.

또한, 특허문서에 기재되어 있는 발명자들의 수 역시 특허문서의 대내적 특성이 될 수 있다. In addition, the number of inventors described in a patent document can also be an internal characteristic of a patent document.

다만, 제 1 특허문서에서 발명자로 기록된 A가 출원한 특허의 개수에 대해서는, 해당 발명자 A가 발명자로 기록되어 있는 다른 특허문서들을 검색하여야 하기 때문에, 특허문서의 대외적 특성이라 할 수 있다. However, the number of patents filed by the inventor A, which is recorded as the inventor in the first patent document, is the external characteristic of the patent document since the inventor A must retrieve other patent documents recorded as inventors.

그리고, 해당 특허문서에서 인용하고 있는 다른 특허문서가 있을 경우에는, 인용하고 있는 특허문헌의 개수, 인용/피인용의 기간등은 특허문서의 대외적 특성이 된다. When there are other patent documents cited in the patent document, the number of patent documents cited and the period for citation / citation are the external characteristics of the patent document.

특허문서를 점수화하기 위한 평가치 연산을 위해서는, 특허문서에 대한 평가팩터가 정의되어야 하고, 정의된 평가팩터들에 대한 각각의 가중치(weighting value)를 연산함으로써, 종국적으로 해당 특허에 대한 평가치가 연산될 수 있다. In order to evaluate an evaluation value for scoring a patent document, an evaluation factor for a patent document must be defined and a weighting value for each of the evaluation factors defined is calculated to finally calculate an evaluation value for the patent .

이러한 견지에서, 도 3에 도시된 바와 같은 일례의 테이블을 이용하여, 상기 평가팩터 관리수단(141)은 상기 데이터베이스(130)에 격납된 특허문서들 각각에 대한 평가팩터 항목들을 작성한다. 도 2에는 대내적 특성과 대외적 특성들이 랜덤하게 배열되어 있으나, 특허문서 내에서 추출되는 정보로부터 획득가능한 대내적 특성에 대한 평가치와, 해당 특허문서와 다른 특허문서(검색결과내에서의 다른 특허문서와 데이터베이스에 저장된 동일 기술분야의 다른 특허문서도 가능)간의 관계에서 산출되는 평가치를 별도의 항목으로서 구별하여 둘 수도 있다. In this regard, the evaluation factor management unit 141 creates evaluation factor items for each of the patent documents stored in the database 130, using an exemplary table as shown in FIG. FIG. 2 shows an example in which the internal characteristic and the external characteristic are randomly arranged, but the evaluation value of the internal characteristic obtainable from the information extracted in the patent document and the patent document and the other patent document And other patent documents in the same technical field that are stored in the database are also possible) may be distinguished as separate items.

각각의 특허문서들로부터 독출되는 특성들의 값을 도 3에 도시된 바와 같은 테이블에 기록한 다음에는, 상기 문서평가 수단(142)에 의하여 특허문서의 평가치가 연산된다. After the values of the characteristics read out from the respective patent documents are recorded in the table as shown in Fig. 3, the evaluation value of the patent document is calculated by the document evaluating means 142. Fig.

예를 들면, 각각의 평가팩터들에 대해서는 미리 결정된 가중치가 부여될 수 있으며, 이 경우 특허문서로부터 추출되는 대내적 특성 및 대외적 특성의 값에 상기 가중치가 연산됨으로써, 평가팩터 각각의 점수의 합이 해당 특허문서의 평가치가 될 수 있다. For example, a predetermined weight may be assigned to each evaluation factor. In this case, the weight is calculated on the values of the internal characteristic and the external characteristic extracted from the patent document, so that the sum of the scores of the evaluation factors is It can be an evaluation value of a patent document.

이렇게 연산된 특허문서에 대한 평가치들은 DB문서 관리수단(143)에 의하여 별도로 관리될 수 있으며, 특허문서 검색결과의 정보가 사용자에게 보여질 때 검색된 결과에 포함되는 특허문서마다 연산된 평가치가 함께 보여지도록 한다. Evaluation values of the patent document thus calculated can be separately managed by the DB document management means 143. When the information of the patent document search result is displayed to the user, evaluation values calculated for each patent document included in the retrieved result .

따라서, 상기 입출력 모듈(110)의 UI 출력수단(112)은 상기 평가팩터 관리수단(141)에 의해 관리되는 평가팩터의 항목 내지는 테이블을 사용자측에 제공하고, 사용자가 추가, 편집 및 삭제하는 평가팩터의 내용은 상기 평가팩터 관리수단(141)에 의해 저장관리된다. Therefore, the UI output means 112 of the input / output module 110 may provide the user with an item or table of evaluation factors managed by the evaluation factor management means 141, and may be provided with an evaluation factor Is stored and managed by the evaluation factor management means 141. [

상기와 같은 문서평가 모듈(140)에 의해 각각의 특허문서들은 평가치를 부여받을 수 있으며, 이렇게 부여된 평가치는 해당 특허문서가 검색의 결과로서 사용자측에 보여질 때 그 결과 리스트와 함께, 도 4와 같이, 보여질 수 있다. Each of the patent documents can be given an evaluation value by the document evaluation module 140 as described above. When the patent document is displayed on the user side as a result of the search, Likewise, it can be seen.

참고로, 도 4에는 사용자의 컴퓨터 또는 서버에 제공되는 문서검색 결과의 일 리스트가 예시되어 있다. 예를 들어, 사용자가 입력한 질의어에 대해서 상기 문서검색 모듈(120)이 상기 데이터베이스(130)에 저장된 특허문서의 검색결과가 7건이 독출된 경우에는, 검색대상의 특허문서들에 대한 서지적인 정보(예를 들면, 특허번호, 상태, 출원일, 특허일, 발명의 명칭, IPC)의 표시와 함께 각각의 특허문서들에 대한 평가치가 함께 표시된다. For reference, FIG. 4 illustrates a list of document search results provided to a user's computer or server. For example, when seven search results of the patent document stored in the database 130 are read by the document search module 120 for the query term input by the user, the bibliographic information about the patent documents to be searched (For example, the patent number, the status, the filing date, the patent date, the name of the invention, the IPC) and the evaluation value of each patent document are displayed together.

또한, 상기 문서평가 수단(142)은 검색된 결과의 특허문서들중에서 사용자가 가장 가치가 높은 특허와 그렇지 않은 특허를 빨리 구별할 수 있도록 특허문서에 대한 평가치를 상기 UI 출력수단(112)으로 제공한다. 그리고, 특허문서 각각에 대한 평가치와 함께, 검색된 결과의 특허문서들의 평균 평가치를 연산하여, 이러한 평균 평가치 역시 상기 UI 출력수단(112)으로 제공할 수 있다. In addition, the document evaluation unit 142 provides the UI output unit 112 with an evaluation value for the patent document so that the user can quickly distinguish the patent having the highest value from the patent documents having the retrieved result . In addition to the evaluation values for each of the patent documents, an average evaluation value of the retrieved patent documents may be calculated, and the average evaluation value may also be provided to the UI output means 112.

검색된 결과의 특허문서들에 대한 평균 평가치가 함께 보여질 경우에는, 사용자는 검색결과의 특허문서 각각에 대한 우열을 용이하게 결정할 수 있을 것이며, 본 실시예에 따라 사용자는 그 평가 가치가 높은 특허문서들을 먼저 확인하여 봄으로써, 검색 효율을 향상시킬 수 있다. When the average evaluation value of the patent documents is searched together, the user can easily determine the superiority of each of the patent documents of the search result. According to the present embodiment, The search efficiency can be improved.

그리고, 상기 문서평가 수단(142)은 검색결과의 특허문서들이 포함되는 기술분야에서의 평균 평가치를 연산할 수 있으며, 상기 UI 출력수단(112)은 검색결과의 특허문서 각각의 평가치와 함께 해당 특허문서들이 속하는 기술분야에서의 평균 평가치를 함께 제공할 수 있다. The document evaluation means 142 can calculate an average evaluation value in the technical field including the patent documents of the search result. The UI output means 112 outputs the evaluation value of each patent document of the search result The average evaluation value in the technical field to which the patent documents belong can be provided together.

이 경우, 검색된 결과의 특허문서들이 속하는 기술분야의 공통 여부는 국제분류인 IPC분류에 의하여 수행되거나, 일본 특허청에서 분류하고 있는 F-term에 의하여 판단될 수 있다. 그리고, 서로 다른 기술분야로 분류되는 특허문서들이 검색결과로서 출력되어야 할 경우에는, 검색결과에서 다수 비율을 차지하는 특허문서들이 속하는 기술분야에 대한 평가치의 평균값이 제공될 수 있다. In this case, the commonness of the technical fields to which the retrieved patent documents belong may be determined by the IPC classification of the international classification, or by the F-term classified by the Japanese Patent Office. When patent documents classified in different technical fields are to be output as search results, an average value of evaluation values for technology fields to which patent documents occupying a large proportion in search results belong can be provided.

따라서, 사용자는 검색된 결과의 특허문서들 각각에 부여된 평가치를 해당 기술분야의 특허문서들의 평균 평가치와 비교함으로써, 검색된 결과의 특허문서들이 어느 정도 중요도를 갖는 특허문서들인지를 쉽게 파악할 수 있게 된다. Therefore, the user can easily grasp the degree of importance of the patent documents of the searched result by comparing the evaluation value given to each of the searched patent documents with the average evaluation value of the patent documents of the technical field .

한편, 검색결과의 리스트를 사용자가 선택적으로 다운로드할 수 있는 기능이 제공될 수 있으며, 검색결과 리스트의 다운로드시에는 상기 문서평가 모듈(140)에 의하여 수행되는 평가치에 대한 정보도 함께 사용자측 컴퓨터 또는 서버에 제공될 수 있도록 한다. On the other hand, when the search result list is downloaded, the information on the evaluation value performed by the document evaluation module 140 may be provided to the user's computer or To be provided to the server.

또한, 도 4에 도시된 바와 같은 검색결과의 UI에서, 사용자가 각각의 특허문서에 부여된 평가치의 세부 항목을 확인하기 위하여 특정의 평가치(Weighting Value)를 클릭하는 경우에는, 상기 평가치를 구성하는 평가팩터들과 각 평가팩터에 대해 해당 특허문서에 부여된 점수를 상세하게 확인할 수 있도록 하는 별도의 UI를 제공할 수 있다. In the UI of the search result as shown in Fig. 4, when the user clicks a specific evaluation value (Weighting Value) in order to check the details of the evaluation value given to each patent document, And a separate UI for allowing the detailed evaluation of the score given to the patent document for each evaluation factor can be provided.

또한, 검색결과의 리스트를 포함하는 도 4에 도시된 바와 같은 UI에서, 사용자가 특정의 특허문서를 선택하는 경우에는, 해당 특허문서에 대한 요약내용을 보여주는 별도의 창(UI)이 생성될 수 있다. 즉, 도 5에 도시된 바와 같이 특허문서 분석 UI가 사용자측에 제공될 수 있으며, 이러한 UI에도 해당 특허문서에 대한 평가치 정보가 제공된다. In addition, in a UI including a list of search results as shown in FIG. 4, when a user selects a specific patent document, a separate window (UI) showing the summary contents of the patent document may be created have. That is, as shown in FIG. 5, a patent document analysis UI can be provided to the user side, and evaluation information on the corresponding patent document is also provided to the UI.

예를 들면, 선택된 특허문서에 대한 발명의 명칭, 대표도면 및 요약등에 대한 사항과 함께 해당 특허문서에 적용된 평가팩터의 항목과, 해당 항목마다의 점수의 정보가 제공될 수 있다. 그리고, 앞서 설명한 바와 같이, 검색된 결과의 특허문서들 또는 해당 특허와 동일한 기술분야의 특허문서들의 평균 평가팩터값들이 함께 제공될 수 있다. For example, an item of an evaluation factor applied to the patent document, and information of a score of each item may be provided, along with a description of the invention, a representative drawing, and a summary of the selected patent document. And, as described above, the average evaluation factor values of the retrieved patent documents or patent documents of the same technical field as the patent may be provided together.

그리고, 사용자는 자신의 서버 또는 컴퓨터등을 조작하여, 표시된 평가팩터 항목에 대해서 수정 및 편집할 수 있으며, 또한 부여된 점수에 대해서도 별도로 편집할 수 있다. 이를 위해서, 상기 문서평가 모듈(140)의 평가팩터 관리수단(141)과 DB문서 관리수단(143)들은 사용자에 의해 변경된 평가팩터의 항목 및 점수에 대응되도록 해당 특허문서의 정보를 변경한다. Then, the user can manipulate his / her server, computer, etc. to modify and edit the displayed evaluation factor item, and the assigned score can also be edited separately. To this end, the evaluation factor management unit 141 and the DB document management unit 143 of the document evaluation module 140 change the information of the corresponding patent document so as to correspond to the item and score of the evaluation factor changed by the user.

그 다음, 데이터베이스에 격납된 문서들에 대해서 평가 작업을 수행하거나 사용자가 요청한 별도의 문서들에 대해서 한정적으로 평가 작업을 수행하는 등의 과정을 거친 다음에는, 본 실시예의 문서분류 모듈(150)에 의한 문서 분류(S103) 작업이 수행되며, 이러한 작업의 결과는 도 11 내지 도 15와 같은 다양한 UI의 정보가 UI 출력수단(112)에 의하여 사용자에게 제공될 수 있다. Next, after the process of evaluating the documents stored in the database or performing the evaluation work on the separate documents requested by the user, the document classification module 150 of the present embodiment (S103) is performed. As a result of this operation, information of various UIs as shown in FIG. 11 to FIG. 15 can be provided to the user by the UI output means 112. FIG.

본 실시예의 문서검색 모듈(120), 문서평가 모듈(140) 및 문서분류 모듈(150)은 별개로서 동작하기 보다는, 문서의 검색, 분류 및 군집이 보다 효과적으로 이루어지도록 하기 위하여 이들이 함께 기설정된 알고리즘에 따라 복합적으로 동작하는 것임을 알 수 있다. The document retrieval module 120, the document evaluation module 140, and the document classification module 150 of the present embodiment may be used together with a predetermined algorithm in order to make retrieval, classification, and clustering of documents more effective, It can be seen that it operates in a complex manner.

이하에서는, 사용자가 입력한 질의어에 대해서 상기 문서검색 모듈(120) 및 문서특징 작성모듈(160)에 의하여 소정의 특허문서들이 검색되고, 그 검색의 결과가 도 3과 같은 리스트로 나타나는 경우에, 문서검색 결과의 특허문서들에 대해서 그 기술적 해결과제(종래기술의 문제점) 또는 해결방법(과제 해결 수단)이 유사한 문서들끼리 분류하는 동작에 대해서 설명하여 보기로 한다. Hereinafter, when predetermined patent documents are retrieved by the document retrieval module 120 and the document characteristic creation module 160 with respect to a query input by the user, and the retrieval result is shown in the list as shown in FIG. 3, An operation of classifying documents that are similar to the technical solution (problems in the prior art) or the solution (the solution to the problem) to the patent documents in the document search result will be described.

즉, 본 실시예에 따라 특허문서들 간의 간접 인용관계를 이용함으로써 문서들을 분류할 수 있고, 이러한 인용관계를 갖는 특허문서들은 그 기술적 해결과제 또는 해결방법을 공통으로 하는 경향이 있으므로, 데이터베이스(130)에 격납된 특허문서들 모두를 대상으로 분류를 수행하기 보다는 사용자가 입력한 질의어에 대한 문서검색(유사검색 포함) 결과의 특허문서들에 대해서 분류를 수행하는 것이 더욱 유리하다. That is, according to the present embodiment, documents can be classified by using the indirect citation relation between patent documents, and since patent documents having such a citation relation tend to share the technical solution or solution, the database 130 It is more advantageous to classify the patent documents of the result of document search (including similar search) of the query term inputted by the user, rather than classifying all of the patent documents stored in the document database.

이러한 점에서, 상기 문서분류 모듈(150)의 동작은, 문서검색의 결과로서 소정의 유사범위에 속하는 특허문서들을 예로 들어 설명하여 보기로 한다. 다만, 문서평가 모듈(140)은 특허문서들의 분류 후 문서의 군집화에서도 동작하지만, 이러한 문서 분류와 문서 군집화 이전의 문서 검색 단계에서도 도 4와 같이 부여받은 평가치 정보가 제공될 수 있는 것이다. In this regard, the operation of the document classifying module 150 will be described by exemplifying patent documents belonging to a predetermined similar range as a result of document retrieval. However, the document evaluation module 140 operates in clustering of documents after classification of patent documents, but evaluation value information given as shown in FIG. 4 can also be provided in the document classifying step and the document searching step before document clustering.

한편, 상기 UI 출력수단(112)은 사용자가 검색결과의 특허문서 리스트중에서 일부 특허문서 또는 검색결과의 전부의 특허문서에 대해서 분류 및 군집화를 수행하는 것을 안내하기 위한 태그(34, 도 4 참조)를 제공할 수 있다. Meanwhile, the UI output means 112 includes a tag 34 (see FIG. 4) for guiding the user to perform classification and clustering on some patent documents or all patent documents of the search results from the list of patent documents of the search result, Can be provided.

이러한 문서 분류 및 군집화를 요청하는 키가 입력되면, 상기 문서분류 모듈(150)은 선택된 특허들에 대한 간접 인용관계 도출 및 이를 이용한 문서 분류를 수행하는 것이다. 예를 들어, 제 1 특허문서가 제 2 특허문서에 인용되고, 상기 제 2 특허문서가 제 3 특허문서에서 인용되는 경우에 상기 제 1 특허문서와 제 3 특허문서는 간접 인용관계에 있는 것이므로, 상기 문서분류 모듈(150)은 제 2 특허문서와 함께 제 1 및 제 3 특허문서도 동일한 카테고리 내로 분류한다. When the key for requesting document classification and clustering is input, the document classification module 150 derives an indirect citation relation for selected patents and performs document classification using the indirect citation relation. For example, when the first patent document is cited in the second patent document and the second patent document is cited in the third patent document, since the first patent document and the third patent document are indirectly quoted, The document classification module 150 classifies the first and third patent documents into the same category together with the second patent document.

여기서, 본 실시예에 따른 인용관계 즉, 간접 인용관계에 대해서 살펴보기로 한다. 인용관계라고 함은, 특허문서 내에서 종래기술의 문제점을 서술하기 위하여 기재되어 있는 다른 특허문서의 참조 문서 번호(특허출원번호, 특허공개번호, 등록번호등)가 기재되어 있다면, 인용되는 특허문서와 인용하는 특허문서의 관계가 형성될 수 있다. Here, the citation relation according to the present embodiment, that is, the indirect citation relation will be described. Quotation relationship means that, if the reference document number (patent application number, patent publication number, registration number, etc.) of another patent document described in the patent document to describe a problem of the prior art is described, And the patent document to be cited can be formed.

또한, 특허문서 내에서 언급 내지는 기재하고 있는 특허문서만이 인용되는 문서라고 한정될 필요는 없으며, 해당 특허문서에 대한 심사 또는 이의신청 또는 무효심판등에서 있어서 선행기술/인용발명등으로 참조되는 문서 역시 인용관계에 있다고 할 수 있다. 따라서, 해당 특허문서 내에 다른 특허문서의 서지적 정보등에 대해에 기재되어 있는 경우 뿐만 아니라, 간접적으로 심사관 또는 다른 제 3 자등에 의하여 심사도중에 이용될 수 있는 다른 특허문서 역시 인용 관계에 있다고 할 수 있다. In addition, the patent document referred to or described in the patent document need not be limited to the document to be cited, and the document referenced by the prior art / cited invention in the examination, objection or invalidation judgment of the patent document It can be said that it is in the citation relation. Accordingly, not only the case where the bibliographic information of other patent documents are described in the patent document but also other patent documents which can be used indirectly by the examiner or other third party during the examination can be said to be in the citation relation .

이러한 인용관계를 확대시키기 위하여, 상기 데이터베이스(130)에는 특허문서들 각각의 인용여부에 정보가 격납되는 인용 및 참조문서 저장부가 구비될 수 있으며, 이 경우 특허문서에 기재된 사항으로부터 인용관계 여부를 도출하기 위한 독출수단 이외에 특허청 등이 제공하는 자료들로부터 심사중 또는 등록후의 절차에서 사용된 문헌들로부터 인용관계를 도출하기 위한 독출수단이 구비될 수 있다. In order to expand the citation relation, the database 130 may be provided with a citation and a reference document storage section for storing information on whether or not each patent document is cited. In this case, A reading means for deriving a citation relation from the documents used during the examination or the post-registration procedure may be provided from the materials provided by the Patent Office and the like.

예를 들면, A 특허문서 내에서 다른 B 특허문서의 공개번호등이 기재되어 있다면, A특허문서와 B특허문서간의 직접 인용관계를 독출할 수 있는 것이다. 그리고, A특허문서에 대한 심사중에 그 인용발명으로서 C특허문서가 심사관에 의하여 제시되었다면, C특허문서 역시 A특허문서와 인용관계에 있다고 할 수 있다. For example, if the disclosure number of another B patent document is listed in the A patent document, the direct quotation relation between the A patent document and the B patent document can be read. In addition, if the C patent document is presented by the examiner as the cited invention during the examination of the A patent document, the C patent document is also quoted with the A patent document.

또한, 청구항에 기재된 사항중에는 제 1 그룹의 특허문서와 제 2 그룹의 특허문서가 있으나, 제 1 그룹은 사용자가 문서를 검색한 다음 검새결과의 특허문서들에 대해서 간접인용관계를 이용한 문서분류를 수행함으로써 형성되는 문서그룹이라 할 수 있다. 그리고, 제 2 그룹은 사용자에 의하여 지정된 특허문서들 또는 데이터베이스(130)에 격납된 다른 특허문서들을 가리키는 것으로서, 실시예의 문서분류 모듈(150)에 의한 문서의 분류가 수행되지 않은 특허문서들의 집단을 나타낸다고 볼 수 있다. The first group includes a patent document of the first group and a patent document of the second group. However, the first group searches the document and then classifies the document using the indirect citation relation A document group formed by performing the above process. The second group refers to the patent documents designated by the user or other patent documents stored in the database 130. The second group refers to a group of patent documents for which classification of documents by the document classifying module 150 of the embodiment is not performed .

따라서, 사용자가 검색결과의 특허문서들에 대해서 문서 분류를 수행할 것을 요청하는 경우에, 상기 문서분류 모듈(150)에 의하여 문서 분류가 수행된 이후에는 상기의 제 1 그룹과 같이 적어도 하나 이상의 그룹이 생성될 수 있다. 그리고, 사용자가 문서 분류 이후에 다른 특허문서들(이하, "제 2 그룹 문서"라 함)에 대해서도 문서 분류 내지는 군집화를 수행하고자 하는 경우에는, 미분류 또는 미군집화된 제 2 그룹에 속하는 문서들은 제 1 그룹의 특성(대표문서 또는 대표벡터)을 이용해서 상기 제 1 그룹에 속하는 분류들로 분류 및 군집화될 수 있다. Accordingly, when the user requests the document classification to be performed on the patent documents of the search result, after the document classification is performed by the document classification module 150, at least one or more groups Can be generated. When the user desires to perform document classification or clustering on other patent documents (hereinafter referred to as "second group document") after the document classification, the documents belonging to the second group of the non- Can be classified and grouped into the classes belonging to the first group using the characteristics of one group (representative document or representative vector).

다만, 이해를 돕기 위하여, 제 1 그룹에 속하는 문서들을 간접 인용관계를 이용한 문서 분류가 수행된 것으로 정의하고, 제 2 그룹에 속하는 문서들을 아직 분류 내지는 군집화가 수행되지 않은 것으로 설명하였으나, 제 2 그룹에 속하는 문서들 역시 분류 내지는 군집화가 이미 수행된 것이여도 제 1 그룹의 분류 기준에 따라 다시 분류 및 군집화를 수행하면 되는 것으므로, 반드시 이러한 정의에 한정될 필요는 없다. In order to facilitate understanding, it has been described that the documents belonging to the first group are defined as those in which the document classification using the indirect citation relation is performed, and the documents belonging to the second group are not yet classified or clustered. However, The document belonging to the first group can be classified and clustered according to the classification criteria of the first group even if classification or clustering has already been performed.

그리고, 본 발명의 상세한 설명에 있어서, 분류와 군집이라는 용어에 대해서도 혼용될 수 있으나, 이는 상기 문서분류 모듈(150) 또는 문서검색 모듈(120)등의 동작과 관련하여서 해석하면 충분하니, 이 점 유의할 필요가 있다. In the description of the present invention, the term classification and cluster may be used in combination. However, this may be interpreted in terms of the operations of the document classification module 150 or the document search module 120, It is necessary to note.

한편, 이러한 인용관계의 독출 이외에 본 실시예에서는 간접 인용관계를 이용하여 특허문서들을 분류할 수 있으며, 이에 대해서는 첨부되는 도 6 내지 8을 참조하여 보기로 한다. Meanwhile, in addition to the reading of the citation relation, the present embodiment can classify the patent documents using the indirect citation relation, which will be described with reference to FIGS. 6 to 8 attached hereto.

도 6은 본 실시예에 따른 문서분류 모듈의 문서군집 수단에 대한 일례를 보여주는 도면이고, 도 7은 본 실시예에 따른 문서분류 모듈에 의한 간접 인용관계를 도출하는 구성을 설명하기 위한 도면이고, 도 8은 본 실시예에 따른 문서분류 모듈에 의하여 유사문서를 분류된 그룹 내로 군집화시키는 구성을 설명하기 위한 도면이다. FIG. 6 is a diagram showing an example of a document grouping module of the document classifying module according to the present embodiment, FIG. 7 is a diagram for explaining a configuration for deriving an indirect citation relation by the document classifying module according to the present embodiment, FIG. 8 is a diagram for explaining a configuration for grouping similar documents into classified groups by the document classification module according to the present embodiment.

먼저, 도 7을 참조하여, 본 실시예의 문서분류 모듈(150)에 의해서 간접 인용관계를 도출하는 구성에 대해서 살펴보기로 한다. First, referring to FIG. 7, a configuration for deriving an indirect citation relation by the document classifying module 150 of the present embodiment will be described.

사용자는 검색된 결과의 문서 또는 직접 지정한 문서들에 대해서 상기 문서분류 모듈(150)에 의한 간접 인용관계에 대한 정보를 획득하는 것이 가능하다. 도 7에 도시되어 있는 바와 같이, 사용자는 분류하고자 하는 문서들에 대해서 기간(기간 A ~ 기간 B) 설정이 가능하며, 이 경우 분류대상의 특허문서들중에서도 설정된 기간 내의 문서들에 대한 분류가 수행된다. The user can obtain information about the indirect citation relation by the document classification module 150 with respect to the document of the searched result or directly designated documents. As shown in FIG. 7, the user can set a period (period A to period B) for the documents to be classified. In this case, among the patent documents to be classified, classification for the documents within the set period is performed do.

즉, 설정된 기간 내에 속하는 특허문서들 간에 직접인용관계(문서 내에 서지적 정보를 기록함으로써 형성되는 인용관계 또는 심사관등에 의하여 참조됨으로써 형성되는 인용관계)가 성립되지 않는 경우라도, 인용하는 특허문서 또는 인용되는 특허문서 간의 연관관계가 존재한다면 이러한 특허문서들에 대해서는 간접 인용관계로서 동일한 카테고리 내에 분류될 수 있다. In other words, even if a direct citation relationship between patent documents belonging to a set period (a citation relation formed by recording bibliographic information in the document or a citation relation formed by an examiner etc.) is not established, These patent documents may be classified in the same category as indirect citation relationships if there is a relationship between them.

예를 들면, 문서 분석 및 분류를 위하여 사용자에 의하여 설정된 기간이 기간A ~ 기간B이고, 이러한 기간 내에 속하는 특허문서들(Base Patent, Patent 5, Patent 6, Patent 7, Patent 8, Patent9)간에 서로 직접인용관계에 있지 않고, 설정된 기간 외의 제 1 특허문서(Patent 1)가 제 5 특허문서(Patent 5) 및 Base Patent에 인용된다면, 제 5 특허문서(Patent 5)와 베이스 특허문서(Base Patent)는 상호 간에 간접 인용관계가 성립한다. For example, the period set by the user for document analysis and classification is period A to period B, and the patent documents (Base Patent, Patent 5, Patent 6, Patent 7, Patent 8, Patent 9) If the first patent document (Patent 1) outside the set period is cited in the fifth patent document (Patent 5) and the Base Patent, the fifth patent document (Patent 5) and the base patent document There is an indirect citation relation between them.

다른 예를 들면, 제 3 특허문서(Patent 3)가 기간 내의 제 7 특허문서(Patent 7)과 베이스 특허문서(Base Patent)를 직접 인용하고 있다면, 상기 제 3 특허문서(Patent 3)과 제 7 특허문서(Patent 7)은 상호간에 간접 인용관계가 성립되므로, 본 실시예에 따라 동일한 카테코리로 분류된다. As another example, if the third patent document (Patent 3) refers directly to the seventh patent document (Patent 7) and the base patent document within the period, the third patent document (Patent 3) The patent document (Patent 7) is categorized into the same category according to the present embodiment since the indirect citation relation is established mutually.

이러한 방법을 통해서, 도 7의 경우에서는, 베이스 특허문서(Base Patent)가제 5 특허문서 내지 제 9 특허문서(Patent 5 ~ 9)에 모두 간접 인용관계를 형성하므로, 대표화 문서 또는 베이스 특허문서가 될 수 있다. In this case, in the case of FIG. 7, since the indirect citation relation is formed in the base patent document 5 patent document to the ninth patent document (Patents 5 to 9), the representative document or the base patent document .

그리고, 사용자는 이러한 방법에 의하여 분류되는 특허문서들의 카테고리 단위들에 대해서, 그 내용을 쉽게 파악하기 위하여 분류명을 직접 작성할 수 있다. 예를 들면, 분류된 카테고리의 특허문서들이 '소음저감'이라는 해결과제(또는 해당과제)를 공통으로 하고 있는 경우에, 그 카테고리명을 '소음저감(ex.해당과제1)'로 기입할 경우에, 폴더명이 '소음저감'으로 관리될 수 있다. Then, the user can directly write the category name to easily understand the contents of the category units of the patent documents classified by this method. For example, when a patent document of a classified category shares a solution task (or a corresponding task) called "noise reduction", the category name is written as "noise reduction (ex. , The folder name can be managed as 'noise reduction'.

그리고, 사용자가 분류된 카테고리들에 대해서 그 이름을 해결과제1,2,3과 해결수단 1,2,3등으로 명칭한 경우에, 각각의 해결과제와 해결수단에 대응되는 카테고리들에 대한 표시로서 이미지가 표시될 수 있다. 이 경우, 그래프 내에 표시된 이미지는 각 카테고리 내에 포함되는 특허문서들의 개수에 따라 다른 색상 또는 크기로 표현될 수 있으며, 또한 각 카테고리 내에 포함된 특허문서들의 평가치의 합(또는 평균 평가치)의 고저에 따라 다른 색상 또는 다른 크기로 표현될 수도 있다. When the user names the classified categories as resolution tasks 1, 2, 3 and resolution means 1, 2, 3, etc., the display of the categories corresponding to the respective resolution tasks and resolution means An image may be displayed. In this case, the image displayed in the graph can be expressed in a different color or size according to the number of patent documents included in each category, and the image of the sum (or average evaluation value) of the evaluation values of the patent documents included in each category And may be represented in different colors or different sizes.

사용자에게 제공되는 문서 분류의 결과 또는 문서 군집의 결과로서 제공되는 UI에서 사용자가 특정의 카테고리(해결수단1,해결수단2, 해결수단3, 해당과제1, 해당과제2, 해당과제3)를 선택하는 경우에는, 앞서 설명한 대표 특허문서(베이스 특허문서) 또는 문서평가 모듈에 의하여 부여된 평가치가 가장 높은 특허문서에 대한 정보를 사용자에게 제공한다. The user selects a specific category (resolution means 1, resolution means 2, solution means 3, corresponding Assignment 1, corresponding Assignment 2, and Assignment 3) in the UI provided as a result of document classification provided to the user or as a result of document collusion , Information on the patent document having the highest evaluation value given by the representative patent document (base patent document) or the document evaluation module described above is provided to the user.

이러한 과정에 의하여, 사용자는 검색된 결과의 문서들에 대해서 문서 분류를 수행할 수 있다. 나아가, 간접 인용관계를 이용한 문서 분류를 수행한 다음에는, 분류되지 않았거나 다른 간접 인용관계로 분류되어 있는 특허문서들 - 제 2 그룹에 속하는 것이라 할 수 있음 -을 분류 및 군집화할 수 있다. By this process, the user can perform document classification on the retrieved documents. Furthermore, after classifying documents using indirect citation relations, it is possible to classify and clusters patent documents that are not categorized or classified as other indirect citation relations - possibly belonging to the second group.

여기서의 문서 군집화 과정은 상기 문서특징 작성모듈(160)에 의한 문서간의 유사도 판단이 이용될 수 있으며, 상기 문서분류 모듈(150)은 이미 분류된 제 1 그룹의 특허문서들을 중심으로 제 2 그룹의 특허문서들을 분류 및 군집화한다. 그리고, 상기 문서분류 모듈(150)의 문서군집 수단(152)은 제 1 그룹의 제 1 카테고리에 속하는 특허문서(제 1 카테고리의 대표문서가 될 수 있음)와 제 2 그룹의 특허문서간의 유사도를 판단함으로써, 상기 제 2 그룹에 속하는 특허문서를 제 1 그룹의 어느 카테고리로 분류할 지 여부를 결정한다. Here, the document clustering process may be performed using similarity determination between documents by the document characteristic creation module 160, and the document classification module 150 may classify the first group of patent documents, Classify and cluster patent documents. The document clustering means 152 of the document classifying module 150 determines the degree of similarity between the patent document belonging to the first category of the first category (which may be the representative document of the first category) And judges which category of the first group the patent document belonging to the second group should be categorized by judging.

상기 문서군집 수단(152)은 분류된 카테고리 내의 대표 문서를 이용하거나 해당 카테고리에 속하는 복수의 문서들을 이용해서 군집화에 필요한 대표 벡터를 산출하는 대표벡터 산출부(1521)를 포함할 수 있다. The document gathering unit 152 may include a representative vector calculating unit 1521 that calculates a representative vector necessary for clustering using a representative document in the classified category or a plurality of documents belonging to the category.

그리고, 상기 문서군집 수단(152)은 특허문서를 구성하는 필드별(또는 식별항목별)로 유사한 문서들을 군집하기 위한 필드별 군집화부(1522)를 포함할 수 있다. The document clustering unit 152 may include a field clustering unit 1522 for clustering similar documents by fields (or by identification items) constituting the patent document.

상기 대표벡터 산출부(1521)는 이미 형성된 카테고리 내에서의 대표 문서(베이스 특허문서 또는 평가치를 이용함으로써 선출되는 특허문서) 또는 동일 카테고리 내에 속하는 문서들로부터 키워드별 발생빈도에 근거하여 상기 문서특징 작성모듈(160)에 의해 작성된 인덱스 파일을 이용한다. 예를 들어, 상기 대표벡터 산출부(1521)는 각 문서에서 나타나는 키워드들 중에서 높은 빈도수를 갖는 대표 키워드들을 추출할 수 잇으며, 각 문서의 인덱스 파일로부터 발생빈도가 높은 순서로 상위 몇개의 키워드들을 선정할 수 있다. The representative vector calculating unit 1521 calculates the representative feature vector of the document based on the occurrence frequency of each keyword from the representative document (the patent document selected by using the base patent document or the evaluation value) in the already formed category or documents belonging to the same category The index file created by the module 160 is used. For example, the representative vector calculating unit 1521 can extract representative keywords having a high frequency from among the keywords appearing in each document, and extracts a plurality of keywords from the index file of each document in order of occurrence frequency Can be selected.

도 9에 도시된 바와 같은 키워드 분포도에 대한 이러한 선정작업에 의하여, 도 10에 도시된 바와 같은 각 문서의 특징 벡터들이 형성될 수 있다. By this selection operation on the keyword distribution diagram as shown in FIG. 9, feature vectors of each document as shown in FIG. 10 can be formed.

그리고, 대표벡터 산출부(1521)는 발생빈도가 높은 순서로 선택된 키워드들에 대하여 각 문서에서 차지하는 백분율을 계산할 수 있으며, 예컨대 Documents 1에서 키워드 A는 4.5%, 키워드 B가 2.4%, 키워드 E가 1.9%, 키워드 D가 1.7%로 각 키워드별 발생빈도의 백분율을 계산할 수 있다.For example, in Documents 1, the keyword A is 4.5%, the keyword B is 2.4%, the keyword E is the keyword E, and the keyword E is the keyword B. The representative vector calculating unit 1521 calculates the percentage of each keyword 1.9%, and the keyword D is 1.7%, so that the percentage of occurrence frequency of each keyword can be calculated.

이러한 과정으로, 해당 카테고리 내의 문서들 또는 대표 문서(이하에서는, '카테고리 문서'라고 함)에 대하여 키워드별 발생빈도의 백분율을 계산한다.With this process, the percentage of the occurrence frequency of each keyword is calculated for the documents in the category or the representative document (hereinafter referred to as the " category document ").

도 9 및 도 10을 참조하면, 카테고리 문서들에 대하여 이러한 과정이 수행된 다음에는, 카테고리 문서 전체에 대하여 각 키워드가 차지하는 백분율을 합산하고, 합산된 키워드의 백분율이 높은 순서로 소정 개수의 특정 키워드를 대표 키워드로 선출할 수 있다.9 and 10, after this process is performed for the category documents, the percentage of each keyword for the entire category document is summed up, and a predetermined number of specific keywords in the descending order of the percentage of the summed keywords Can be selected as representative keywords.

예를 들어, 도 9에 도시된 각각의 키워드들 중에서 10개의 카테고리 문서 전체에서 각 키워드가 차지하는 백분율을 합산한 값이 키워드 B, 키워드 A, 키워드 E, 키워드 D, 키워드 O, 키워드 C, 키워드 K 순서로 높은 값을 갖는 경우에는, 선택된 문서들을 군집화하기 위한 대표 키워드로 키워드 B, 키워드 A, 키워드 E 및 키워드 D를 선택할 수 있다. 그리고, 선택된 대표 키워드를 대표 벡터의 성분으로 하여, 각각의 문서들에 대한 특징 벡터를 산정한다. 즉, 선택된 대표 키워드를 높은 확률분포 순서로 배열하여 이들을 대표 벡터의 성분으로 선정한다. 선택된 키워드 B, 키워드 A, 키워드 E 및 키워드 D를 기준으로 각 문서의 특징 벡터를 작성하는 과정이 수행되는데, 각 문서의 인덱스 파일중에서 상위 4개의 키워드들에 대하여 이러한 과정이 수행된다. 다만, 대표 벡터의 성분을 구성하는 대표 키워드로 4개가 선택되고, 각 문서에서 빈도수가 높은 4개의 키워드를 비교하여 각 문서의 특징벡터를 작성하는 것으로 설명하고 있으나, 이는 예시적인 사항일뿐 시스템의 관리자에 의하여 얼마든지 변경될 수 있다.For example, a value obtained by adding the percentage occupied by each keyword in all ten category documents among the keywords shown in FIG. 9 is a keyword B, a keyword A, a keyword E, a keyword D, a keyword O, a keyword C, a keyword K Keyword A, keyword E, and keyword D can be selected as representative keywords for grouping the selected documents. Then, the feature vector for each document is calculated using the selected representative keyword as a representative vector component. That is, the selected representative keywords are arranged in a high probability distribution order and are selected as the representative vector components. A process of creating a feature vector of each document based on the selected keyword B, the keyword A, the keyword E, and the keyword D is performed. This process is performed on the top four keywords in the index file of each document. However, four representative keywords constituting the components of the representative vector are selected and four keywords having a high frequency are compared with each other to generate feature vectors of the respective documents. However, this is only an example, Can be changed at any time.

선택된 대표 키워드가 각 문서에 포함되어 있을 경우에는, 벡터 성분을 '1'로 설정하고, 포함되어 있지 않을 경우에는 '0'으로 설정할 수 있다. 다만, 이 역시 벡터 성분으로 1과 0 대신에 각 키워드에 대한 가중치를 부여한 값으로 벡터 성분을 작성할 수도 있다.If the selected representative keyword is included in each document, the vector component may be set to '1', and if it is not included, it may be set to '0'. However, it is also possible to generate a vector component by assigning weights to the respective keywords instead of 1 and 0 as vector components.

이렇게 작성된 각 문서의 특징 벡터는, 도 9에 도시된 바와 같이, 대표 키워드가 포함된 경우에는 '1', 포함되어 있지 않을 경우에는 '0'으로 하여 각 문서의 특징 벡터가 완성된다.As shown in Fig. 9, the feature vector of each document thus created is set to '1' when the representative keyword is included, and to '0' when it is not included, thereby completing the feature vector of each document.

이러한 과정에 의하여, Document 1의 특징 벡터는 (1,1,1,1)이 되고, Document 2의 특징 벡터는 (1,1,0,1)이 된다. 각 특징 벡터의 성분이 1 또는 0으로 작성하였으나, 각 대표 키워드가 차지하는 발생빈도수에 따라 각 벡터 성분을 다른 값으로도 부여할 수 있다.By this process, the feature vector of Document 1 becomes (1,1,1,1), and the feature vector of Document 2 becomes (1,1,0,1). Although the components of each feature vector are created as 1 or 0, each vector component can be assigned different values depending on the occurrence frequency of each representative keyword.

복수의 카테고리 문서를 이용하는 경우에, 이러한 각 문서의 특징벡터들을 이용하여 대표 벡터(또는 중심 벡터)를 선정하는 과정이 수행되는데, 여기서는 각 특징 벡터들 중에서 크기가 가장 큰 벡터를 군집화하기 위한 대표 벡터로 선정될 수 있다. In the case of using a plurality of category documents, a process of selecting a representative vector (or a center vector) using feature vectors of the respective documents is performed. Here, a representative vector .

이러한 경우에, 도 9에 도시된 각각의 특징 벡터 중에서 Document 1의 특징 벡터(1,1,1,1)가 대표 벡터로 될 수 있으며, 선정된 대표 벡터를 이용함으로써 미분류된 제 2 그룹의 특허문서들을 군집화시킬 수 있다. In this case, among the respective feature vectors shown in FIG. 9, the feature vector (1,1,1,1) of Document 1 can be a representative vector, and by using the selected representative vector, Documents can be clustered.

카테고리 문서로부터 도출되는 대표 벡터를 이용함으로써, 특정 카테고리와 소정의 유사도를 갖는 특허문서가 제 2 그룹에 포함되어 있는지 여부를 확인할 수 있으며, 이러한 유사도는 전술한 바와 같은 특징 벡터 또는 대표 벡터를 제 2 그룹의 특허문서들에 대해서도 수행함으로써 판단될 수 있다. By using the representative vector derived from the category document, it is possible to check whether or not a patent document having a predetermined degree of similarity with a specific category is included in the second group. Such similarity can be obtained by using the above- It can also be judged by carrying out on the patent documents of the group.

즉, 제 1 그룹의 소정 카테고리에 속하는 카테고리 문서와 제 2 그룹에 미분류된 문서와의 유사도는 각각의 특징 벡터 또는 대표 벡터에 의한 내적을 이용하여 산출될 수 있으며, 예컨대 카테고리 문서의 대표 벡터와 제 2 그룹의 특허문서에 대한 특징 벡터와의 내적을 통해서 내적된 연산의 값이 기 설정된 범위내에 속하는 경우에는 상기 대표 벡터와 함께 군집화될 수 있다. 즉, 상기 대표 벡터가 속하는 카테고리 내로 분류 및 군집화될 수 있다. That is, the similarity between the category document belonging to the predetermined category of the first group and the non-classified document belonging to the second group can be calculated by using the inner product by the feature vector or the representative vector. For example, If the value of the inner product operation through the inner product with the feature vector of the two groups of patent documents falls within a predetermined range, it can be grouped together with the representative vector. That is, the representative vectors can be classified and grouped into categories to which the representative vectors belong.

그리고, 상기 문서군집 수단(152)은 대표 벡터를 A라 하고, 유사도의 비교대상이 되는 문서의 특징 벡터를 B라고 하였을 때, 벡터A와 벡터B의 내적값을 |A|²으로 나눈 값이 '1'로부터 얼마나 떨어져있는지에 따라 벡터 A에 해당하는 문서와 벡터 B에 해당하는 문서간의 유사도를 판단한다. When the representative vector is A and the feature vector of the document to be compared with the similarity degree is B, the document grouping means 152 sets the inner product value of the vector A and the vector B as | A | ² is divided from '1', the degree of similarity between the document corresponding to the vector A and the document corresponding to the vector B is determined.

그러나, 상기 대표 벡터와의 제 2 그룹 문서의 특징 벡터간의 내적된 연산의 값이 기준값을 벗어나는 경우에는, 상기 대표 벡터와 함께 군집화되지 아니하며, 다른 군집을 위한 문서로 사용된다.However, when the value of the inner product operation between the representative vector and the feature vector of the second group document deviates from the reference value, it is not grouped together with the representative vector, and is used as a document for another group.

카테고리를 대표하는 대표 벡터와 제 2 그룹 문서의 특징 벡터간의 이러한 유사도 산출 및 판정에 따라, 도 7과 같이, 제 2 그룹에 속하는 제 20문서(P20)는 제 1 그룹의 A분류로 군집화될 수 있으며, 제 2 그룹의 제 21문서(P21)는 제 1 그룹의 B분류로 군집화될 수 있다. According to such similarity degree calculation and determination between the representative vector representing the category and the feature vector of the second group document, the twentieth document P20 belonging to the second group can be grouped into the first group A classification , And the twenty-first document P21 of the second group can be grouped into the B group of the first group.

다만, 전술한 실시예 외에 문서분류 모듈(150)에 의하여 문서 분류가 수행되면, 그 결과로서 상기 문서분류 모듈(150)은 카테고리를 대표하는 기술분류 코드(IPC 또는 F-term)를 선정할 수 있다. 이 경우, 상기 문서군집 수단(152)에 의한 제 2 그룹 문서들의 분류 및 군집은 전술한 유사도 판단 이외에 기술분류 코드를 이용한다. However, if document classification is performed by the document classification module 150 in addition to the above-described embodiments, the document classification module 150 can select a technical classification code (IPC or F-term) representative of the category have. In this case, the classification and grouping of the second group documents by the document grouping unit 152 uses the technical classification code in addition to the similarity determination described above.

예를 들면, 상기 문서군집 수단(152)은 간접 인용관계를 이용해서 문서를 분류한 결과인 각각의 카테고리들에 대하여 높은 빈도수를 기록하는 F-term들을 이용해서, 제 2 그룹 문서들이 갖는 F-term과의 유사도를 판단할 수 있다. For example, the document grouping unit 152 may use F-terms that record a high frequency for each category that is a result of classifying a document by using the indirect citation relation, term can be determined.

F-term의 경우, 해결과제 또는 해결수단에 따라 분류된 것이기에, 문서의 벡터화를 이용한 유사도 판단과 함께 사용된다면, 보다 효과적인 문서 군집화를 수행할 수 있을 것이다. In the case of the F-term, since it is classified according to the solution or solution, if it is used together with the similarity determination using the vectorization of the document, more effective document clustering can be performed.

그 다음, 본 실시예에 따라 특허문서의 분류 및 그 분류 결과를 이용한 군집화가 수행된 다음에는, 상기 문서분류 모듈(150) 및 UI 출력수단(112)에 의하여 도 11 내지 도 15와 같은 다양한 정보를 갖는 UI들이 사용자에게 제공될 수 있다. 11 to 15 by the document classifying module 150 and the UI outputting means 112, the document classification module 150 and the UI outputting means 112 are used for classifying the patent documents according to the present embodiment, May be provided to the user.

도 11은 문서의 분류 및 군집으로부터 획득될 수 있는 정보에 대한 UI의 제 1 실시예이다. 11 is a first embodiment of a UI for information that can be obtained from the classification and clustering of documents.

본 실시예의 문서 분석 시스템에 의하여 특허문서의 분류가 수행되고, 분류된 결과를 이용하여 다른 특허문서들을 군집화시킨 다음에는, 사용자의 기간 설정 또는 출원인(또는 등록권자) 설정에 따라 도 11과 같은 특허문서 분석 UI가 사용자에게 제공될 수 있다. After the classification of the patent document is performed by the document analysis system of this embodiment and the other patent documents are clustered by using the classified result, the patent document such as the one shown in FIG. 11 is generated according to the period setting of the user or the setting of the applicant An analysis UI can be provided to the user.

예를 들면, 사용자가 자사의 설정을 "LGE(대표명화를 포함)"로 하고, 경쟁사를 "A사"로 설정하여 둔 경우에, 군집화된 결과 내에서 국가별 출원 건수와 해당 건들의 평가치등이 도표로 보여질 수 있다. 특히, 상기 문서평가 모듈(140)에 의하여 부여된 평가치가 포함될 수 있으며, 해당 항목에 포함되는 건들의 평가치 합을 보여주거나 해당 항목에 포함된 건들의 평균 평가치를 보여줄 수 있다. For example, when a user sets his or her setting to "LGE (including representative name)" and competitor is set to "A company", the number of applications by country and the evaluation value Etc. can be shown in the diagram. In particular, the evaluation value given by the document evaluation module 140 may be included, and the evaluation value of the items included in the item may be shown or an average evaluation value of the items included in the item may be displayed.

그리고, 이러한 정보와 함께, 특허당 인용 인덱스(CPP), 기술영향력 인덱스(CII), 기술력 인덱스(TS), 영향력 측정 인덱스(TII), 기술진보 측정 인덱스(TCT) 및 기술자립도 인덱스(TI)등이 보여질 수 있다. In addition to this information, a patent index (CPP), a technology influence index (CII), a technology index (TS), an impact measurement index (TII), a technology progress measurement index (TCT) and a technology independence index Can be seen.

여기서, 특허당 인용 인덱스는 보유특허의 평균 피인용 횟수를 가리키고, 기업의 기술적 진보정도를 평가하기 위한 항목으로서 해당 특허문서의 인용된 횟수를 전체특허수로 나눈 값이 될 수 있다. 기술영향력 인덱스는, 예를 들어, 과거 5년동안 기업의 특허가 인용된 정보를 나타내어, 기업의 기술이 최근에 미치는 영향력 정보를 평가하기 위한 것으로서, CII = (연도별피인용도×연도별 특허수의 총합 / 전년도 특허수 총합)으로 계산될 수 있다. Here, the citation index per patent refers to the average number of patent citations of the patent, and may be a value obtained by dividing the number of citations of the patent document by the total number of patents as an item for evaluating the technological progress degree of the enterprise. The technical impact index, for example, is used for evaluating information on the influence of a company's technology on the past five years, showing that the company's patent information is cited in the past five years. CII = The total number of patents / the total number of patents in the previous year).

그리고, 기술력 인덱스는 기업의 기술영향력을 양적으로 평가하기 위한 항목으로서, (CII×특허건수)로 계산될 수 있다. 영향력 측정 인덱스는 특정 기술분야에서 상위 10% 이상 인용되는 특허가 해당 기술분야의 전체인용횟수에서 차지하는 비율을 가리키는 항목으로서, 기업별로 특정 기술분야에 미치는 영향력을 평가하기 위하여, (피인용상위 10% 이상에 속하는 특허의 피인용횟수/전체피인용횟수)로 계산될 수 있다. And, the technology index can be calculated as (CII × number of patents) as an item for quantitatively evaluating the technological influence of the enterprise. The Influence Measurement Index is a measure of the percentage of patents cited in the top 10% or more in a particular technology area in the total number of citations in that technology area. To assess the impact on a specific technology area by company, The number of patents cited above / the total number of patents cited).

그리고, 기술진보 측정 인덱스는 기업의 기술진보 속도에 대한 평가 항목으로서, 인용특허들과의 년도차 중 중간값에 해당하는 년도차의 평균을 나타내며, (중간인용특허와의 연도차 총합/특허수)로 계산될 수 있다. 기술자립도 인덱스는 자사 기술의 독립성을 평가하기 위한 항목으로서, 자사의 특허를 인용하는 정도를 획득하기 위하여 (자사특허 인용횟수/전체인용횟수)로 계산될 수 있다. The index of technological progress measurement is an evaluation item on the speed of technological progress of the company. It represents the average of the year value corresponding to the median value of the patent year with the quotation patent. ). &Lt; / RTI > The technology self-reliance index is an item for evaluating the independence of the technology, and can be calculated in order to obtain the degree to which the patent is cited (the number of patent citations / the total number of citations).

이러한 다양한 종류의 인덱스는 문서의 분류 및 군집화가 수행된 다음에 상기 문서분류 모듈(150)에 의하여 연산될 수 있으며, 그러한 연산의 결과는 UI 출력수단(112)에 의하여 도 11 내지 도 15와 같은 도표 또는 그래프등으로 나타날 수 있다. These various kinds of indexes can be calculated by the document classifying module 150 after the classification and clustering of the documents is performed and the results of such operations are displayed by the UI output means 112 as shown in Figs. 11 to 15 A graph or a graph.

도 12는 문서의 분류 및 군집으로부터 획득될 수 있는 정보에 대한 UI의 제 2 실시예이다. 제 2 실시예의 경우는, 설정된 기간 내에서의 출원인 별 특허문서 건수가 도표로 도시되고, 해당 출원인은 사용자가 선택하여 둔 경우가 될 수 있다. Figure 12 is a second embodiment of a UI for information that can be obtained from the classification and clustering of documents. In the case of the second embodiment, the number of patent documents per applicant in the set period is shown in a diagram, and the applicant may be selected by the user.

그리고, 각 시기에 해당되는 특허문서들의 평균 평가치가 W/F로 도시될 수 있으며, 사용자는 이러한 UI에 함께 표시되는 W/F 항목으로부터 그 기술개발의 변곡점이 될 수 있는 위치를 확인할 수 있다. 또한, 사용자가 평균 평가치인 W/F가 높은 시점을 선택하는 경우에는, 실시예의 문서분류 모듈(150) 및 UI 출력수단(112)은 해당 시점의 특허문서들의 정보를 별도의 UI로 제공하거나 해당 시점에서 평가치가 가장 높은 문서 또는 대표화 문서를 별도의 UI로 제공할 수 있다. Then, the average evaluation value of the patent documents corresponding to each period can be shown as W / F, and the user can confirm the position that can become the inflection point of the technology development from the W / F item displayed together with the UI. When the user selects a point having a high W / F as the average evaluation value, the document classifying module 150 and the UI outputting means 112 of the embodiment provide information of the patent documents at that point in a separate UI, A document or a representative document having the highest evaluation value at a time point can be provided in a separate UI.

도 13은 문서의 분류 및 군집으로부터 획득될 수 있는 정보에 대한 UI의 제 3 실시예이다. 도 13에는 사용자가 설정한 시기 및 출원인에 대한 특허당 인용지수(CPP), 기술영향력 지수(CII), 특허당 인용지수(CPP) 및 기술영향력 지수(CII)등의 정보가 포함된 UI가 도시되어 있으며, 이러한 UI내에 출원인별 특허당 인용지수를 시기별로 표시한 그래프가 더 포함될 수 있다. Figure 13 is a third embodiment of a UI for information that can be obtained from the classification and clustering of documents. 13, a UI including information such as a time set by the user and a patent per person patent (CPP), a technical influence index (CII), a patent per citation index (CPP), and a technical influence index (CII) , And a graph showing the index of patent per person by applicant in the UI in the UI may be further included.

즉, 도 13 아래에 도시된 UI에는, 삼성전자와 샤프등의 출원인이 높은 인용지수를 갖는 것으로 예시되어 있다. That is, in the UI shown below in FIG. 13, applicants such as Samsung Electronics and Sharp are exemplified as having a high quotation index.

이외에, 기술분야별 특허활동 평가, 특허활동지수, 특허 포트폴리오 분석(HHI) 및 특허 다각화 지표에 대한 정보가 더 제공될 수 있으며, 기술분야별 특허활동 평가는 선택된 기간 내에 분야별 특허활동을 정량적으로 비교하는 것으로서, 기술분야별 출원(또는 공개)건을 비교하는 것에 의하여 가능하다. In addition, information on patent activity evaluation, patent activity index, patent portfolio analysis (HHI), and patent diversification index by technology field can be further provided. Patent activity evaluation by technology field is to quantitatively compare patent activity in each field within a selected period , And by filing (or disclosing) applications by technology sector.

그리고, 특허활동 지수는 특정 기술분야에서 차지하는 비율을 가리키는 것으로서, {(특정분야의 특허수/회사전체 특허수)/(회사전체 특허수/모든 분야 전체 특허수)}로 계산될 수 있다. The patent activity index indicates the percentage of a specific technology field, which can be calculated as {(number of patents in a specific field / total number of patents of a company) / (total number of patents / total number of patents in all fields)}.

그리고, 특허 포트폴리오 분석 지수는 기업들이 시장에서 경쟁하는 형태를 파악하기 위한 항목으로서, 각 기업의 상위 IPC 분야별로 산출하고, 기업별로 과점하고 있는 기술분야와 경쟁하고 있는 기술분야를 산출할 수 있다. 예를 들면, 발명자당 출원 건수는 발명자 1인당 출원 건수의 상대평가(총 출원 건수/회사발명자수) 지수를 가리키고, 발명자당 청구항 수는 발명자 1인당 획득한 청구항 수의 상대 평가(총 보유청구항 수 / 회사발명자수) 지수를 가리키고, 유효특허 평균 잔존기간은 보유 특허의 평균 잔존 기간(유효특허의 잔존기간 총합/유효특허 전체 건수)의 지수를 가리킬 수 있다. In addition, the patent portfolio analysis index is an item to identify the way in which companies compete in the market, and it can be calculated by the top IPC field of each company, and the technology field competing with the technology field which is dominated by each company can be calculated. For example, the number of applications per inventor indicates the relative evaluation (number of total applications / number of inventors) of the number of applications per inventor, and the number of claims per inventor indicates the relative evaluation of the number of claims acquired per inventor / Company inventors) index, and the effective patent average remaining period may indicate an index of the average remaining period of the patent (the total number of remaining patents of the effective patent / total number of effective patents).

그리고, 공동 출원 비율은 공동 연구 활동의 활발한 정도를 평가하는 항목으로서, (공동출원 건수/전체 특허건수)로 계산될 수 있다. The joint application ratio is an item that evaluates the active degree of the joint research activities and can be calculated as (the number of joint applications / total number of patents).

도 14 및 도 15는 문서의 분류 및 군집으로부터 획득될 수 있는 정보에 대한 UI의 제 4 및 제 5 실시예이다. Figures 14 and 15 are fourth and fifth embodiments of a UI for information that can be obtained from the classification and clustering of documents.

도 14 및 도 15에는, 특정 기간 내의 회사별 인용횟수에 대한 그래프와, 높은 인용횟수를 갖는 특허문서에 대한 도표를 갖는 UI가 도시되어 있으며, 높은 인용횟수를 갖는 특허문서의 표시에 있어서는 상기 문서평가 모듈(140)에 의해 부여된 평가치가 함께 표시되도록 할 수 있다. 14 and 15 show a graph of a number of quotations per company in a specific period and a UI having a chart of a patent document having a high number of citations. In the display of a patent document having a high number of citations, And the evaluation values given by the evaluation module 140 can be displayed together.

또한, 사용자가 인용횟수가 높은 순서대로 배열된 도표를 보고서, 특정 특허문서의 번호(출원번호, 등록번호등)를 선택할 경우에는, 해당 특허문서에 대한 추가 정보 또는 해당 명세서를 사용자에게 제공할 수 있다. In addition, when a user reports a chart arranged in the order of a high number of citations and selects a number of a specific patent document (application number, registration number, etc.), additional information about the patent document or a corresponding statement can be provided to the user have.

전술한 바와 같은 본 실시예의 문서 분석 시스템에 의하여 제공되는 문서 분류 결과 또는 문서 군집화의 결과는 시스템의 설정에 따라 내용 저장 및 다른 사용자와의 공유도 가능한 것이며, 이러한 경우에 특히 특허개발을 유도하는 기업 또는 팀에서는 매우 유용할 것이다.
The document classification result or document clustering result provided by the document analysis system of the present embodiment as described above can be stored and shared with other users according to the system setting. In this case, Or it would be very useful on the team.

Claims

A database storing a plurality of patent documents;
A document classifying unit grouping the plurality of patent documents into a plurality of patent documents belonging to the first group so as to belong to the first group; And
And an output unit for outputting information on a plurality of patent documents belonging to a second company in comparison with information on a plurality of patent documents belonging to a first company among a plurality of patent documents belonging to the first group,
The document classifying unit classifies the plurality of patent documents into a first group by using an indirect citation relation appearing through a patent document out of the set period among a plurality of patent documents belonging to a set period
Document analysis system.

The method according to claim 1,
The first company is a third party, and the second company is a third party
Document analysis system.

The method according to claim 1,
The output unit may output the number of applications per country as information on the plurality of patent documents
Document analysis system.

The method according to claim 1,
Wherein the information on the plurality of patent documents includes an evaluation value indicating a frequency rate of the keywords included in the patent document,
Wherein the output unit outputs a sum of the evaluation values or an average of the evaluation values as information on the plurality of patent documents
Document analysis system.

The method according to claim 1,
The output unit may output at least one of a patent citation index, a technology influence index, a technology index, an influence measurement index, a technology progress measurement index, and a technology independence index as information on the plurality of patent documents
Document analysis system.

The method according to claim 1,
Wherein the output unit outputs information on a plurality of patent documents belonging to the second company in comparison with information on a plurality of patent documents belonging to the first company in accordance with a plurality of technical categories
Document analysis system.

delete

The method according to claim 1,
Wherein the document classification unit classifies a plurality of patent documents belonging to the second group based on a degree of similarity between a plurality of patent documents belonging to the second group and a plurality of patent documents belonging to the first group
Document analysis system.

9. The method of claim 8,
Wherein a plurality of patent documents belonging to the first group are classified into a plurality of categories,
Wherein the document classification unit classifies a plurality of patent documents belonging to the second group into the plurality of categories based on a degree of similarity between a plurality of patent documents belonging to the second group and a plurality of patent documents belonging to the first group
Document analysis system.

10. The method of claim 9,
Wherein the plurality of categories include a plurality of representative patent documents representing the plurality of categories,
Wherein the document classification unit classifies a plurality of patent documents belonging to the second group and a plurality of patent documents belonging to the second group on the basis of the similarity degree of the plurality of representative patent documents into the plurality of categories
Document analysis system.