KR20160098084A

KR20160098084A - System for filtering documents of interest and method thereof

Info

Publication number: KR20160098084A
Application number: KR1020160015567A
Authority: KR
Inventors: 윤장혁; 김무진; 최덕용; 경진영
Original assignee: 특허법인 해담; 윤장혁; 김무진; 최덕용
Priority date: 2015-02-09
Filing date: 2016-02-11
Publication date: 2016-08-18
Also published as: KR102468930B1

Abstract

The present invention relates to a system and a method for filtering documents of interest, which can filter noise or key documents from a plurality of documents by extracting words included in the documents such as patents or papers, selecting a reference seed document through information quantity analysis using the extracted words, and grouping the documents similar to the reference seed through semantic similarity relation analysis. The method for filtering documents of interest comprises the steps of: selecting, as the reference seed, at least one specific document which is an object of interest among the documents; analyzing the topic of each document based on word statistical information on the documents to calculate topic-specific probability distribution and calculating a distance to the reference seed by using the topic-specific probability distribution of each document; and selecting and filtering the documents of interest according to the calculated distance.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document filtering system,

본 발명은 수집된 모집단 데이터에서 관심 대상인 노이즈 또는 핵심 데이터를 자동으로 추출하기 위한 관심대상 문서 필터링 시스템 및 그 방법에 관한 것이다.The present invention relates to a system and method for filtering interest documents for automatically extracting noise or key data of interest from collected population data.

일반적으로, 키워드 검색을 통해 수집된 특허나 논문 등의 기술 문서(모집단)는 분석하고자 하는 대상기술 분야의 유효 문서 외에도 제거되어야 하는 노이즈 문서를 포함하고 있다. 또한, 유효 문서에는 대상기술 분야와 보다 밀접한 관련이 있는 핵심 문서도 포함되어 있다.In general, technical documents (populations) such as patents or papers collected through keyword search contain noise documents that must be removed in addition to the effective documents in the target technical field to be analyzed. In addition, the effective documentation also includes core documents that are more closely related to the technical field of interest.

효율적인 특허 분석을 위해 필요한 유효 특허 문서를 수집하여야 하며, 이를 위해서는 노이즈 문서를 제거하거나 핵심 문서를 추출하는 필터링 과정이 반드시 요구된다. In order to efficiently analyze the patent, effective patent documents must be collected. In order to do this, a filtering process is required to remove noise documents or extract key documents.

예컨대, 기존에는 노이즈 문서를 제거하기 위해 모집단에서 사람이 직접 특허 문서를 일일이 검토하고 노이즈를 제거하는 단순 반복 작업을 통해 노이즈 문서를 제거하고 있어 많은 시간과 인력이 투입된다. 물론, 핵심 문서를 추출하기 위해서도 수작업으로 일일이 문서 내용을 검토하면서 추출함에 따라 많은 시간이 소요된다.For example, in order to remove a noise document, a human being directly examines a patent document directly in a population and removes a noise document through a simple repetitive operation of removing noise, so that a lot of time and manpower are put into. Of course, it takes a lot of time to manually extract the core document and extract it while reviewing the document contents.

하기의 특허문헌은 문서 검색 및 분류 방법 및 그 시스템에 관한 것이나, 상술한 문제에 대한 해결책을 제시하지 못하고 있다.The following patent documents relate to a document retrieval and classification method and system thereof, but fail to provide a solution to the above-mentioned problems.

한국 공개특허공보 제10-2006-0047306호Korean Patent Publication No. 10-2006-0047306

본 발명은 상기한 종래 기술의 문제점을 해결하기 위한 것으로써, 특허나 논문 등의 문서에 포함된 단어를 추출한 후 이를 이용해 정보량 분석을 통해 기준시드 문서를 선정하고, 의미론적 유사관계 분석을 통해 기준시드와 유사를 문서를 군집화함으로써, 복수의 문서에서 노이즈 또는 핵심 문서를 필터링할 수 있는 관심대상 문서 필터링 시스템 및 그 방법을 제공한다.In order to solve the problems of the prior art described above, the present invention extracts words included in a document such as a patent or a paper, selects a reference seed document through analysis of information using the same, The present invention provides a system and method for filtering a document of interest that can filter noise or key documents from multiple documents by clustering documents similar to seeds.

본 발명의 실시예에 의한 관심대상 문서 필터링 방법은, (a) 복수의 문서 중 관심대상인 적어도 어느 하나의 특정 문서를 기준시드(seed)로 선정하는 단계; (b) 상기 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 기준시드와의 유사도(distance)를 산출하는 단계; 및 (c) 상기 산출된 유사도에 따라 관심문서를 선정하여 필터링하는 단계;를 포함할 수 있다.A method of filtering a document of interest according to an embodiment of the present invention includes the steps of: (a) selecting at least one specific document of interest as a reference seed among a plurality of documents; (b) analyzing a topic of each document based on the word statistical information about the plurality of documents to calculate a subject probability distribution, and calculating a probability distribution of the subject based on the subject's probability distribution, ; And (c) selecting and filtering an interest document according to the calculated degree of similarity.

상기 (a)단계의 기준시드는 미리 설정된 관심어가 복수의 문서 내 출현빈도 확률을 통해 산출된 정보 엔트로피를 기초로 선택되며, 상기 기준시드는 산출된 정보 엔트로피와 미리 설정된 시드에 대한 정보 엔트로피의 범위나 조건을 상호 비교하여 선정될 수 있다.The reference seed of step (a) is selected on the basis of information entropy calculated through a probability of occurrence of a predetermined interest in a plurality of documents, and the reference seed is selected from a range of information entropy of the calculated information entropy and a condition Can be selected.

상기 (b)단계는, 상기 단어통계 정보에 대해 주제(topic) 모델링 알고리즘을 적용하여 각각의 문서가 각 주제에 속할 확률분포를 산출하는 단계; 상기 각 문서의 주제별 확률분포와 기준시드로 선정된 문서의 주제 확률분포를 이용하여 기준시드와 비교대상 문서 간의 유사도를 산출하는 단계; 및 상기에서 산출된 유사도에 기초하여 상기 기준시드와 유사한 문서들을 군집화하여 제공하는 단계;를 포함할 수 있다.The step (b) includes: calculating a probability distribution of each document by applying a topic modeling algorithm to the word statistical information; Calculating a degree of similarity between a reference seed and a document to be compared using a subject probability distribution of each document and a subject probability distribution of the document selected as the reference seed; And clustering and providing documents similar to the reference seed based on the calculated similarity.

본 발명의 실시예에 의한 관심대상 문서 필터링 시스템은, 복수의 문서 중 관심대상인 적어도 하나의 특정 문서를 기준시드로 선정하는 기준시드선정부; 상기 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제 분석을 통해 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 선정된 기준시드와의 유사도를 산출하는 시드군집화부; 및 상기에서 산출된 문서별 유사도에 따라 관심문서를 선정하여 필터링하는 관심문서필터링부;를 포함할 수 있다.A system for filtering a document of interest according to an embodiment of the present invention includes: a reference seed selection unit for selecting at least one specific document of interest as a reference seed among a plurality of documents; A seed clustering unit for calculating a probability distribution of the subject through subject analysis of each document based on the word statistical information about the plurality of documents and calculating a degree of similarity with the selected reference seed using the subject classified probability distribution of each document; And an interest document filtering unit for selecting and filtering an interest document according to the calculated similarity for each document.

또한, 상기 필터링 시스템은, 수집된 복수의 문서에서 각 문서별로 단어를 추출하는 단어추출부; 상기에서 추출된 단어의 문서 내 출현빈도를 각각 측정하고, 단어의 출현빈도와 역문서 빈도를 이용하여 단어통계 정보를 획득하는 통계정보획득부; 및 상기에서 측정된 문서내 전체 단어의 출현수에 대한 미리 설정된 관심어의 문서 내 출현빈도 확률을 통해 각 문서의 정보 엔트로피를 산출하는 정보엔트로피산출부;를 더 포함할 수 있다.The filtering system may further include a word extracting unit for extracting a word for each document from the collected plurality of documents; A statistical information acquiring unit for measuring the occurrence frequencies of the extracted words in the document and acquiring the word statistical information using the appearance frequencies of the words and the inverse document frequencies; And an information entropy calculation unit for calculating information entropy of each document through a probability of appearance frequency in a document of a preset interest word with respect to the number of occurrences of all words in the document measured in the above.

본 발명의 실시예에 의하면, 노이즈나 핵심문서와 같은 관심문서의 필터링 시스템의 경우, 분석 대상의 기술군에 대한 정보를 담고 있는 정도를 섀넌 엔트로피 방법에 의해 정성적 분석으로만 분별하는 것이 가능했던 관심문서의 가능성을 정량화된 수치로 표현하여 전문가의 판단을 도울 수 있다.According to the embodiment of the present invention, in the case of a filtering system of interest documents such as noise or a core document, it is possible to discriminate the extent of information about the technical group to be analyzed only by qualitative analysis by the Shannon entropy method It is possible to express the possibility of the document of interest in quantified numerical value to help the expert judgment.

이는 기존의 전문가의 장시간 단순 반복 작업 방식의 문제점 중 하나인 관심문서 여부 판단에 일관성이 떨어지는 문제에 대해 일관성을 부여 및 유지하도록 돕는 것이 가능하여, 실수나 오차가 발생할 수 있는 과정을 시스템적으로 보완함으로써 작업의 품질을 향상시킬 수 있다. This is because it is possible to help to maintain and maintain the consistency of problems that are inconsistent with the judgment of the interest documents, which is one of the problems of the long time simple repetitive working methods of the existing experts, so that the processes that may cause errors or errors are systematically supplemented Thereby improving the quality of the work.

또한, 토픽 분석인 주제분석 모델링은 기준시드와 의미론적으로 유사한 문서들을 선별하여 군집화시키는 역할을 수행하는 데, 이를 통해 전문가가 확신한 관심문서를 추출함과 동시에 그와 의미론적으로 유사한 관심문서들을 일괄적으로 추출할 수 있도록 한다. 이를 통해 전문가가 수행해야 하는 관심문서 필터링 작업의 공수를 혁신적으로 줄이는 것이 가능하다. Topic analysis modeling, which is a topic analysis, plays a role of collecting and grouping documents that are semantically similar to the reference seed, extracting the documents of interest that the experts are confident of, So that it can be extracted collectively. It is possible to drastically reduce the chances of filtering interest documents that an expert should perform.

또한, 본 발명에서는 관심문서 필터링 과정을 기준시드를 선별하는 과정부터 복수회 실시함에 따라 선별 작업이 필요한 문서 집합의 범위를 충분히 줄일 수 있다. 이는 기존 방식에 의해 소요되었던 시간적 비용에 비해 전문인력의 시간 투자를 혁신적으로 감소시킴에 따라 연구개발의 속도와 효율성을 높일 수 있는 효과가 있다. In the present invention, since the process of filtering the target document is performed a plurality of times from the process of selecting the reference seed, the range of the document set requiring the sorting operation can be sufficiently reduced. This is because it can reduce the time investment of the professional manpower in comparison with the time cost which has been consumed by the conventional method, thereby increasing the speed and efficiency of research and development.

도 1은 본 발명에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 개념도이다.
도 2는 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 구성도이다.
도 3은 도 2에 도시된 시드군집화부를 나타낸 도면이다.
도 4는 본 발명의 실시예에 의한 관심대상 문서 필터링 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 의한 문서 필터링 시스템을 나타낸 UI화면이다.
도 6은 도 4의 단어추출 및 출현빈도 산출 과정을 설명하기 위한 흐름도이다.
도 7은 도 6의 과정을 나타낸 도면이다.
도 8은 도 4의 시드군집화 과정을 설명하기 위한 흐름도이다.
도 9a 내지 도 9c는 도 8의 주제확률분포 산출에 적용된 LDA를 설명하기 위한 도면이다.
도 10은 본 발명의 실시예에 의한 기준시드와 유사한 문서를 추출하여 보여주는 화면이다.1 is a conceptual diagram for explaining a system for filtering a document of interest according to the present invention.
2 is a block diagram illustrating a system for filtering a target document according to an exemplary embodiment of the present invention.
3 is a diagram illustrating the seed clustering unit shown in FIG.
4 is a flowchart illustrating a method for filtering a target document according to an embodiment of the present invention.
5 is a UI screen illustrating a document filtering system according to an embodiment of the present invention.
6 is a flowchart for explaining the word extraction and appearance frequency calculation process of FIG.
7 is a view showing the process of FIG.
FIG. 8 is a flowchart for explaining the seed clustering process of FIG.
9A to 9C are diagrams for explaining the LDA applied to the calculation of the subject probability distribution of FIG.
10 is a screen for extracting and displaying a document similar to the reference seed according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 형태들을 설명한다. Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

그러나, 본 발명의 실시형태는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명하는 실시 형태로 한정되는 것은 아니다. 또한, 본 발명의 실시형태는 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다.However, the embodiments of the present invention can be modified into various other forms, and the scope of the present invention is not limited to the embodiments described below. Further, the embodiments of the present invention are provided to more fully explain the present invention to those skilled in the art.

본 발명에 참조된 도면에서 실질적으로 동일한 구성과 기능을 가진 구성요소들은 동일한 부호가 사용될 것이며, 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.In the drawings referred to in the present invention, elements having substantially the same configuration and function will be denoted by the same reference numerals, and the shapes and sizes of the elements and the like in the drawings may be exaggerated for clarity.

본 실시예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field-programmable gate array) 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. As used in this embodiment, the term " portion " refers to a hardware component such as software or an FPGA (field-programmable gate array) or ASIC, and 'part' performs certain roles.

그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. However, 'part' is not meant to be limited to software or hardware. &Quot; to " may be configured to reside on an addressable storage medium and may be configured to play one or more processors.

따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다. Thus, by way of example, 'parts' may refer to components such as software components, object-oriented software components, class components and task components, and processes, functions, , Subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. The functions provided in the components and components may be further combined with a smaller number of components and components or further components and components.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.In addition, the components and components may be implemented to play back one or more CPUs in a device or a secure multimedia card.

도 1은 본 발명에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 개념도이다.1 is a conceptual diagram for explaining a system for filtering a document of interest according to the present invention.

일반적으로, 특허나 논문 검색에서는 최대한 넓은 범위의 모집단을 얻기 위하여 검색식을 개략적으로 작성한 후 다양한 응용 기술이나 이질적인 특징을 가지는 문서들을 모두 수집하고 있다. 결과적으로 이는 모집단에 노이즈가 포함되어 노이즈를 제거하는 과정이나 핵심문서를 추출하는 과정이 필연적으로 발생하게 된다. Generally, in order to obtain the largest possible population in a patent or thesis search, a search formula is roughly created, and then various documents with various application technologies or heterogeneous features are collected. As a result, noise is included in the population and the process of removing the noise or extracting the core document necessarily occurs.

따라서, 본 발명은 효율적인 노이즈 또는 핵심문서 추출 방법을 제안하는 것이다. 본 발명에서 적용된 방법은 순수한 결정을 얻기 위해 사용하는 방법인 결정 성장(Crystal growth) 방법에서 착안되었으며, 반복적인 정제 과정을 거치게 된다. 결정 성장의 과정에서는, 충분한 크기의 순수 결정을 얻기 위하여 비교적 얻기 쉬운 작은 크기의 순수한 결정을 성장시키는 방법을 사용한다. 이때 작은 크기의 결정은 결정 씨앗(Crystal seed)로서, 해당 결정 씨앗을 안정적인 포화 용액 등에서 같은 종류인 용질의 결정화를 유도하게 하여, 충분한 크기의 결정으로 성장하게 된다. Accordingly, the present invention proposes an efficient noise or key document extraction method. The method applied in the present invention is based on a crystal growth method which is a method used to obtain pure crystals and is subjected to a repetitive purification process. In the course of crystal growth, a method of growing relatively small crystals of pure crystals is used to obtain pure crystals of sufficient size. At this time, the crystal of a small size is a crystal seed, and the crystal seed is caused to induce crystallization of the same kind of solute in a stable saturated solution or the like to grow into a crystal of sufficient size.

마찬가지로 본 발명에서 제안된 방법에서는 섀넌 정보 엔트로피(Shannon entropy)와 의미론적 유사도 분석을 이용하여, 시스템으로부터 추천받은 관심문서 중 일부를 기준시드로 선정하고, 이를 기반으로 의미론적 유사도가 높은 문서들을 군집화하여 일괄 추출 및 반복 정제하여 필터링하는 것이다.Likewise, in the method proposed in the present invention, some of the documents of interest recommended by the system are selected as the reference seeds using the Shannon information entropy and semantic similarity analysis, and based on this, documents having high semantic similarity are clustered And then filtering and repeating the extraction and repetition.

도 2는 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 구성도이다. 2 is a block diagram illustrating a system for filtering a target document according to an exemplary embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템(1)은 문서저장부(10), 관심어저장부(50) 및 정보처리부(100)로 이루어져 있으며, 정보처리부(100)는 단어추출부(110), 통계정보획득부(120), 정보엔트로피산출부(130), 기준시드선정부(140), 시드군집화부(150) 및 관심문서필터링부(160)를 포함할 수 있다. 2, the system for filtering target documents according to an exemplary embodiment of the present invention includes a document storage unit 10, an interest storage unit 50, and an information processing unit 100. The information processing unit 100 Includes a word extracting unit 110, a statistical information obtaining unit 120, an information entropy calculating unit 130, a reference seeding unit 140, a seed clustering unit 150 and an interest document filtering unit 160 .

문서저장부(10)는 문서 검색 시스템에 의해 키워드 검색식을 통해 수집된 복수의 문서를 제공받아 저장할 수 있다. 실시예에서, 문서는 특허문서나 논문일 수 있으며, 문서 검색 시스템은 구글특허(Google Patent), 델피온(Delphi-in), 키프리스(Kipris), 윕스온(WIPSON), 윈텔립스(WINTELIPS), NDSL 등과 같은 검색 시스템일 수 있다. The document storage unit 10 can receive and store a plurality of documents collected through the keyword search formula by the document search system. In an embodiment, the document may be a patent document or a paper, and the document retrieval system may be a Google patent, a Delphi-in, a Kipris, a WIPSON, a WINTELIPS, , NDSL, and the like.

관심어저장부(50)는 사용자로부터 입력된 관심어, 즉 관심 키워드를 저장한다. 관심어는 적어도 하나 이상의 키워드로 구성되며, 관심어는 후술할 문서의 정보 엔트로피 계산에 이용된다.The interest word storage unit 50 stores a word of interest input from the user, that is, a keyword of interest. The word of interest is composed of at least one or more keywords, and the word of interest is used in information entropy calculation of a document to be described later.

단어추출부(110)는 수집된 복수의 문서에서 각 문서별로 단어를 추출하도록 구성되어 있다. 단어추출부(110)는 자연어 처리를 통해 문서에 포함된 문장을 추출하고, 추출된 문장의 품사 분석을 통해 형용사 및 명사에 해당하는 단어를 추출하며, 추출된 단어 중 미리 설정된 불용어 리스트에 포함된 단어를 제거하여 필요 단어를 추출하게 된다.The word extracting unit 110 extracts words from each of the plurality of collected documents. The word extracting unit 110 extracts sentences included in the document through natural language processing, extracts words corresponding to adjectives and nouns through parsing analysis of the extracted sentences, and extracts words included in a preset imperative word list The word is removed and the necessary word is extracted.

통계정보획득부(120)는 추출된 단어의 문서 내 출현빈도를 각각 측정하고, 단어의 출현빈도와 역문서 빈도를 이용하여 단어통계 정보를 획득하도록 구성되어 있다. The statistical information obtaining unit 120 is configured to measure the frequency of appearance of the extracted words in the document and to obtain the word statistical information by using the appearance frequency of the word and the reverse document frequency.

여기서, 통계정보획득부(120)는 빈도측정부(121) 및 통계정보생성부(125)를 포함할 수 있다. 빈도측정부(121)는 단어추출부(110)를 통해 추출된 단어의 문서 내 출현 빈도를 각각 측정하도록 구성되어 있다. 통계정보생성부(125)는 추출된 단어에 대하여, 전체 문서수에서 임의 단어가 포함된 문서수를 나눈 역문서 빈도(Inverse Document Frequency)를 산출하고, 임의 단어의 출현 빈도와 역문서 빈도를 승산하여 단어통계 정보를 획득하도록 구성되어 있다.Here, the statistical information obtaining unit 120 may include a frequency measuring unit 121 and a statistical information generating unit 125. The frequency measurement unit 121 is configured to measure the occurrence frequencies of words extracted through the word extraction unit 110, respectively. The statistical information generation unit 125 calculates an inverse document frequency by dividing the number of documents including an arbitrary word by the total number of documents for the extracted word and outputs the inverse document frequency multiplied by the occurrence frequency of the arbitrary word and the inverse document frequency Thereby acquiring word statistical information.

한편, 정보엔트로피산출부(130)는 측정된 문서내 전체 단어의 출현수에 대한 미리 설정된 관심어의 문서 내 출현빈도 확률의 합산을 통해 각 문서의 정보 엔트로피를 산출하도록 구성되어 있다. 관심어는 핵심 키워드 또는 노이즈 키워드일 수 있으며, 키워드 별로 가중치를 다르게 설정하여 정보 엔트로피 계산에 반영할 수 있다. Meanwhile, the information entropy calculation unit 130 is configured to calculate the information entropy of each document by summing probabilities of occurrence frequencies in the document of a preset interest word with respect to the number of occurrences of all words in the measured document. A keyword of interest may be a key keyword or a noise keyword, and the weight may be set differently for each keyword to be reflected in the information entropy calculation.

정보 엔트로피는 정보이론의 중요한 개념으로서, 어떠한 상황에서 불확실성을 측정하는 것이다. 즉 불확실성이 높은 상황에서는 높은 정보 엔트로피 값을 가지며, 불확실성이 낮은 상황에서는 낮은 정보 엔트로피 값을 가진다. 예를 들어, 동전을 던지는 사건은 주사위를 던지는 사건보다 낮은 불확실성, 다시 말해, 발생할 수 있는 사건이 2가지인 경우가 6가지일 경우보다 낮은 불확실성과 정보 엔트로피 값을 가진다. 또한, 같은 상황에서 각 사건이 발생하는 확률에 따라 하나의 시스템의 정보량이 변화한다. 즉 사건의 수와 다른 조건이 같은 상황에서 각 사건의 발생 확률이 다르다고 가정한다면, 각 사건의 발생확률이 모두 같은 경우 사건에 대한 예측이 더욱 어려워지므로 이 경우 가장 높은 정보 엔트로피 값을 가지게 된다. Information entropy is an important concept of information theory, in which uncertainty is measured. That is, it has a high information entropy value in a high uncertainty state and a low information entropy value in a low uncertainty state. For example, a coin throwing event has lower uncertainty and information entropy values than six cases where there are two cases of uncertainty, that is, events that can occur, than the case of throwing a die. Also, the amount of information of one system changes according to the probability of occurrence of each event in the same situation. In other words, assuming that the probability of occurrence of each event is different in the same situation with the number of events, if the occurrence probability of each event is the same, prediction of the event becomes more difficult, and in this case, the highest information entropy value is obtained.

이와 같은 정보 엔트로피는 이산확률 분포에 대해 하기 수학식 1의 섀넌 엔트로피(Shannon entropy) 알고리즘을 활용하여 측정하는 것이 가능하다. 해당 수학식을 활용할 경우 하나의 시스템에 대한 정보의 량을 수치로 나타낼 수 있는데, 이는 해당 시스템의 정보의 다양성의 정도를 의미한다. 즉, 문서가 사용자가 원하는 정보, 즉 관심어를 골고루 포함하고 있다면 관심문서일 확률이 높기 때문에 관심어에 대한 정보량을 측정하여 문서의 정보 엔트로피를 산출하는 것이다.Such information entropy can be measured using the Shannon entropy algorithm of the following equation (1) for the discrete probability distribution. Using the mathematical expression, the amount of information about one system can be expressed as a numerical value, which means the degree of information diversity of the system. That is, if the document contains the information that the user desires, that is, the interest word, the information entropy of the document is measured by measuring the amount of information on the word of interest because the probability of the document being interested is high.

[수학식 1][Equation 1]

단, i는 관심어이며, pi는 관심어 i의 확률 값임.Where i is the word of interest and pi is the probability value of the word i of interest.

기준시드선정부(140)는 정보엔트로피산출부(130)에서 산출된 각 문서의 정보 엔트로피를 기초로 복수의 문서 중 관심대상인 적어도 하나의 특정 문서를 기준시드로 선정하게 된다. 이때 기준시드는 적어도 하나 이상이 선정될 수 있으며, 정보 엔트로피와 설정된 기준값을 상호 비교하여 기준값을 벗어난 문서를 기준시드로 추천할 수도 있다.The reference seeding unit 140 selects at least one specific document of interest from among a plurality of documents based on information entropy of each document calculated by the information entropy calculating unit 130 as a reference seed. At this time, at least one reference seed may be selected, and the information entropy and the set reference value may be compared with each other to recommend a document that is out of the reference value as the reference seed.

시드군집화부(150)는 통계정보생성부(125)에서 생성된 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제 분석을 통해 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 선정된 기준시드와의 유사도를 산출하도록 구성되어 있다. The seed clustering unit 150 calculates the subject probability distribution based on the subject analysis of each document based on the word statistical information about the plurality of documents generated by the statistical information generation unit 125 and uses the subject probability distribution And calculates the similarity with the selected reference seed.

여기서, 시드군집화부(150)는 도 3에 도시된 바와 같이 주제확률분포산출부(151), 유사도산출부(153) 및 유사문서군집화부(155)를 포함할 수 있다. 주제확률분포산출부(151)는 통계정보생성부(125)에서 생성된 단어통계 정보에 대해 주제 모델링 알고리즘을 통해 각 문서가 각각의 주제에 속할 확률분포를 산출하도록 구성되어 있고, 유사도산출부(153)는 관심시드선정부(140)에서 기준시드로 선정된 문서의 주제 확률분포와 다른 비교대상 문서의 주제 확률분포를 비교하여, 기준시드로 선정된 문서와 다른 비교대상 문서 간의 유사도를 산출하도록 구성되어 있고, 유사문서군집화부(155)는 유사도산출부(153)에서 산출된 유사도에 기초하여 기준시드와 유사한 문서들을 군집화하여 제공하도록 구성되어 있다.Here, the seed clustering unit 150 may include a subject probability distribution calculating unit 151, a similarity calculating unit 153, and a similar document clustering unit 155 as shown in FIG. The subject probability distribution calculating unit 151 is configured to calculate a probability distribution in which each document belongs to each topic through the subject modeling algorithm with respect to the word statistical information generated by the statistical information generating unit 125. The similarity calculating unit 153 compares the subject probability distribution of the document selected as the reference seed with the subject probability distribution of the other comparison target document at the interested seed selection unit 140 and calculates the similarity between the document selected as the reference seed and the other comparison target document And the similar document clustering unit 155 is configured to cluster and provide documents similar to the reference seed based on the similarity calculated by the similarity calculating unit 153. [

상기에서 주제 모델링 알고리즘으로는 LDA(Latent Dirichlet Allocation), LSA(Latent Semantic Analysis), 및 PLSA(Probabilistic Latent Semantic Analysis) 중 어느 하나의 알고리즘이 이용될 수 있고, 유사도 측정 알고리즘으로는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나가 이용될 수 있다. 상기의 알고리즘들은 공지의 기술이므로 상세한 설명은 생략한다.As the subject modeling algorithm, any one of LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis), and PLSA (Probabilistic Latent Semantic Analysis) may be used, and the similarity measurement algorithm may be Hellinger distance, cosine similarity, and Jaccard similarity coefficient may be used. The above-described algorithms are well known in the art and will not be described in detail.

그리고, 관심문서필터링부(160)는 시드군집화부(150)에서 산출된 문서별 유사도 정보에 따라 관심문서들을 선택하여 필터링하게 된다. 여기서, 관심문서필터링부(160)는 산출된 문서별 유사도 정보와 미리 설정된 임계값을 비교하여 기준시드와 유사한 관심문서를 추천할 수 있으며, 기준시드를 포함한 관심문서를 문서저장부(10)에 저장된 모집단 문서데이터에 별도로 표시할 수 있다. 이후, 관심문서가 노이즈일 경우에는 모집단으로부터 제거될 것이고, 관심문서가 핵심특허일 경우에는 모집단으로부터 추출될 것이다. The interest document filtering unit 160 selects interest documents according to the similarity information for each document calculated by the seed clustering unit 150, and filters the interest documents. Here, the interested document filtering unit 160 may compare the calculated similarity information for each document with a predetermined threshold value to recommend an interest document similar to the reference seed, and may transmit an interest document including the reference seed to the document storage unit 10 Can be displayed separately on the stored population document data. Thereafter, if the document of interest is noise, it will be removed from the population, and if the document of interest is a core patent, it will be extracted from the population.

이와 같이 본 발명에서는 관심어를 이용하여 문서별 정보 엔트로피를 구한 후 문서별 정보 엔트로피를 기준으로 관심대상 기준시드를 선택한다. 이후 기준시드와 유사한 문서들을 찾아내어 군집화한 후 제공하게 된다. Thus, in the present invention, information entropy for each document is obtained using a word of interest, and a reference seed for interest is selected based on the information entropy for each document. Thereafter, documents similar to the reference seed are searched and clustered.

여기서, 관심어는 문서별 정보 엔트로피 계산에 이용되며, 기준시드는 각 문서의 정보 엔트로피를 기준으로 선정되며, 문서별 TF-IDF 값은 유사도 측정을 위한 주제 확률분포 산출에 이용되며, 유사도는 기준시드의 주제 확률분포와 다른 문서의 주제 확률분포 간의 벡터를 측정하여 산출된다.Here, the word of interest is used for document entropy calculation, the reference seed is selected based on the information entropy of each document, the TF-IDF value for each document is used to calculate the subject probability distribution for measuring the similarity, It is calculated by measuring the vector between the subject probability distribution and the subject probability distribution of other documents.

또한, 각 문서의 정보 엔트로피를 계산하기 위해 Shannon entropy 알고리즘이 이용될 수 있고, 각 문서의 주제 확률분포를 획득하기 위해 Latent Dirichlet Allocation, Latent Semantic Analysis 또는 Probabilistic Latent Semantic Analysis 알고리즘이 이용될 수 있고, 문서간 유사도를 측정하기 위해 Hellinger distance, Cosine similarity 또는 Jaccard similarity coefficient 알고리즘이 이용될 수 있다.In addition, the Shannon entropy algorithm can be used to compute the information entropy of each document, and Latent Dirichlet Allocation, Latent Semantic Analysis or Probabilistic Latent Semantic Analysis algorithms can be used to obtain the subject probability distribution of each document, Hellinger distance, Cosine similarity, or Jaccard similarity coefficient algorithm can be used to measure liver similarity.

이와 같이 구성된 관심문서 필터링 시스템의 제반 동작과정을 도 4 내지 도 9를 참조하여 보다 구체적으로 살펴본다. 이하에서는 관심문서가 특허문서인 경우를 예로 들어 설명한다.The overall operation of the interest document filtering system configured as above will be described in more detail with reference to FIG. 4 to FIG. Hereinafter, the case where the document of interest is a patent document will be described as an example.

도 4는 본 발명의 실시예에 의한 관심문서 필터링 방법을 나타낸 순서도이다. 4 is a flowchart illustrating a method of filtering interest documents according to an embodiment of the present invention.

먼저, 사용자로부터 입력된 키워드 검색식 또는 특허번호(출원번호, 공개번호 또는 등록번호)를 제공받아 특허 데이터베이스로부터 관련 특허문서를 검색하게 되고, 검색된 특허문서는 문서 필터링 시스템(1)에 로딩되어 문서저장부(10)에 저장될 수 있다. 이어, 문서 필터링 시스템(1)은 관심어에 대한 키워드를 입력받아 관심어저장부(50)에 저장할 수 있다(S100). 여기에서 특허문서는 키워드로 검색된 경우에는 노이즈와 핵심특허 등이 포함된 엑셀(excel) 형식의 특허데이터이며, 문서에는 서지정보(출원번호, 출원일자, 공개번호, 공개일자, 출원인, 제목 등)와 요약, 청구항 및 인용정보 등이 포함될 수 있다. First, the related patent document is retrieved from the patent database by receiving the keyword retrieval expression or patent number (application number, public number or registration number) input from the user. The retrieved patent document is loaded into the document filtering system 1, And may be stored in the storage unit 10. Then, the document filtering system 1 may receive a keyword for a word of interest and store it in the interest word storage unit 50 (S100). Here, the patent document is an excel-type patent data including a noise and a key patent when it is searched by a keyword, and the document includes bibliographic information (application number, filing date, public number, publication date, applicant, And summaries, claims, and citation information.

예컨대, 문서 필터링 시스템(1)은 도 5와 같이 유저 인터페이스가 구성될 수 있는 데, 사용자는 필터링 시스템(1)의 특허로딩(①)을 선택하여 검색된 특허 모집단을 시스템(1)으로 로딩하여 문서저장부(10)에 저장할 수 있고, 관심어에 대한 키워드(②)를 입력하여 관심어저장부(50)에 저장할 수 있다. 이때 관심어는 관심대상 키워드로 핵심 키워드 또는 노이즈 키워드일 수 있다.5, the user can select the patent loading (1) of the filtering system 1, load the retrieved patent population into the system 1, And stores the keyword (2) for the word of interest in the storage unit (10), and stores the keyword (2) in the interest word storage unit (50). At this time, the keyword of interest may be a key keyword or a noise keyword.

이어, 필터링 시스템(1)의 단어추출부(110)는 문서저장부(10)에 저장된 복수의 문서에 포함된 단어를 추출하고, 통계정보획득부(120)는 복수의 문서 각각이 포함하는 단어의 출현 빈도를 산출하여 복수의 문서에 대한 단어통계 정보를 생성할 수 있다(S200). The word extracting unit 110 of the filtering system 1 extracts words contained in a plurality of documents stored in the document storage unit 10 and the statistical information obtaining unit 120 obtains a word included in each of the plurality of documents The word statistics information for a plurality of documents can be generated (S200).

실시예에서, 단어추출부(110)는 도 6과 같이 문서에 대해 자연어처리를 통해 문서에 포함된 문장을 추출할 수 있고(S210), 상기 추출된 문장의 품사 분석을 통해 형용사 및 명사에 해당하는 단어를 추출할 수 있다(S220). 여기서, 단어추출부(110)는 상기 추출된 단어 중 기 설정된 불용어 리스트에 포함된 단어를 제거할 수 있다(S230). In the embodiment, the word extracting unit 110 extracts a sentence included in the document through natural language processing on the document as shown in FIG. 6 (S210), and then, based on the analysis of parts of speech in the extracted sentence, (S220). Here, the word extracting unit 110 may remove a word included in a predetermined stopword list among the extracted words (S230).

이어, 통계정보획득부(120)는 추출된 단어를 통해 문서별 추출된 단어의 출현 빈도수를 산출할 수 있다(S240). 그리고, 통계정보획득부(120)는 문서 전체의 수에 대하여 해당 단어를 포함하는 문서 수에 대한 통계를 의미하는 역문서 빈도(idf: Inverse Document Frequency)를 각각 산출할 수 있다(S250). Then, the statistical information obtaining unit 120 may calculate the occurrence frequencies of words extracted for each document through the extracted words (S240). The statistical information obtaining unit 120 may calculate an inverse document frequency (idf) indicating the number of documents including the corresponding word with respect to the total number of documents (S250).

여기서, 단어의 출현 빈도(tf)는 특정 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 문서의 길이에 따라 단어의 빈도값을 조절하여 산출할 수 있다. 예컨대, 빈도측정부(121)는 단어 출현 빈도(tf)로 출현횟수를 이용할 수도 있지만, 하기 수학식 2에 의해 산출할 수도 있다. 이때, 문서 내에서 출현 빈도가 가장 높은 단어는 '1'값을 가질 것이고, 그 외의 단어는 1보다 작은 값을 가질 것이다.Here, the appearance frequency (tf) of a word is a value indicating how often a particular word appears in the document, and can be calculated by adjusting the frequency of the word according to the length of the document. For example, the frequency measurement unit 121 may use the frequency of appearance as the frequency of occurrence of the word tf, but may also be calculated by the following equation (2). At this time, the word having the highest occurrence frequency in the document will have a value of '1', and other words will have a value smaller than 1.

[수학식 2]&Quot; (2) "

여기서, t: 임의의 단어, d: 임의의 문서, w: 문서 d에 있는 임의의 단어, f(t,d): 문서 d에 들어 있는 단어 t의 빈도임.Where t is any word, d is any document, w is any word in document d, and f (t, d) is the frequency of word t in document d.

또한, 통계정보생성부(125)는 임의의 한 단어가 복수의 문서 전체에서 얼마나 공통적으로 포함되어 있는지를 나타내는 역문서 빈도(idf)를 산출하게 되며, 역문서 빈도는 문서 전체의 수를 상기 단어를 포함한 문서의 수로 나눈 뒤 로그 스케일을 취하여 산출될 수 있다. In addition, the statistical information generation unit 125 calculates a reverse document frequency idf indicating how common an arbitrary word is included in all of a plurality of documents, And dividing the number of documents by the number of documents including the log scale.

일예로, 역문서 빈도(idf)는 이하의 수학식 3에 의해 산출될 수 있다. For example, the inverse document frequency idf can be calculated by the following equation (3).

[수학식 3] &Quot; (3) "

여기서, t; 임의의 단어, d; 임의의 문서, D; 전체 문서 수, |d∈D:t∈d|; 단어 t가 포함된 문서 수.Here, t; Any word, d; Any document, D; Total number of documents, | d? D: t? D |; The number of documents containing the word t.

다음으로, 통계정보생성부(125)는 단어 출현 빈도(tf)와 역문서 빈도(idf)를 승산하여 TF-IDF(Term Frequency-Inverse Document Frequency)값을 산출할 수 있으며, 상기 TF-IDF값을 단어통계 정보로 획득할 수 있다(S260). Next, the statistical information generating unit 125 may calculate a TF-IDF (TF-IDF) value by multiplying the word occurrence frequency tf and the inverse document frequency idf, May be acquired as word statistical information (S260).

일예로, TF-IDF값은 이하의 수학식 4에 의해 산출될 수 있다. 여기서 역문서빈도수에 ’을 가산하는 이유는 로그 스케일의 밑(base)에 따라 역문서 빈도수가 음수가 나올 수 있으므로 이를 방지하기 위함이며, 로그 스케일의 밑이 1보다 큰 경우에는 ’을 가산하지 않을 수도 있다.For example, the TF-IDF value can be calculated by the following equation (4). Here, the reason for adding 'to the reverse document frequency is to prevent the inverse document frequency from being negative depending on the base of the log scale. When the base of the log scale is larger than 1,' is not added It is possible.

[수학식 4]&Quot; (4) "

TF-IDF값은 특정 문서 내에서 단어 출현 빈도가 높을수록, 그리고 전체 문서 중 상기 단어를 포함한 문서가 적을수록 커지게 된다. The TF-IDF value becomes larger as the frequency of occurrence of a word in a specific document increases and as the number of documents including the word in the entire document increases.

통계정보생성부(125)는 문서별 추출한 단어의 TF-IDF값을 이용하여 문서저장부(10)에 저장된 복수의 문서를 정형화할 수 있다. The statistical information generation unit 125 can form a plurality of documents stored in the document storage unit 10 using the TF-IDF values of the extracted words for each document.

예컨대, 문서저장부(10)에 저장된 문서의 수가 8개이며, 상기 문서에서 추출된 총 단어의 수가 7개인 경우, 통계정보생성부(125)는 문서에서 추출된 각각의 단어에 대해 TF-IDF값을 산출하여 하기의 표 1과 같이 정형화할 수 있다. For example, when the number of documents stored in the document storage unit 10 is eight and the total number of words extracted from the document is seven, the statistical information generation unit 125 generates TF-IDF Value can be calculated and can be shaped as shown in Table 1 below.

[표 1][Table 1]

이와 같이 단어추출부(110)와 통계정보획득부(120)의 동작 과정을 그림으로 나타내면 도 7과 같다.The operations of the word extracting unit 110 and the statistical information obtaining unit 120 are illustrated in FIG.

한편, 정보 엔트로피산출부(130)는 관심어저장부(50)에 저장된 관심어를 기준으로 정보 엔트로피(Shannon Entorpy)를 산출할 수 있다(S300). 관심어는 노이즈 또는 핵심특허와 관련된 키워드이며, 사용자에 의해 설정될 수 있으며, 관심어 별로 가중치가 다르게 설정될 수 있다. 여기서 가중치는 해당 관심어의 출현 빈도에 대해 가중치를 적용할 수 있다는 의미이다.Meanwhile, the information entropy calculation unit 130 may calculate an information entropy (Shannon Entorpy) based on a word of interest stored in the interest word storage unit 50 (S300). The keyword of interest is related to the noise or core patent, and may be set by the user, and the weight may be set differently for each word of interest. Here, the weight means that weights can be applied to the frequency of occurrence of the concerned word.

구체적으로, 정보 엔트로피산출부(130)는 문서가 포함하고 있는 단어들 중 관심어에 포함된 단어와 포함되지 않은 단어의 빈도를 이용하여 관심어에 대한 정보 엔트로피를 산출할 수 있다. 각 단어의 빈도 정보는 빈도측정부(121)에서 측정된 빈도 정보를 이용할 수 있다. 도 5의 유저 인터페이스에서 '시드특허 산출(③)' 버튼이 정보 엔트로피 계산 명령이 될 수 있다.Specifically, the information entropy calculating unit 130 may calculate information entropy for a word of interest using the frequency of words included in the word of interest and words not included in the words included in the document. The frequency information of each word can be used as the frequency information measured by the frequency measuring unit 121. In the user interface of FIG. 5, the 'seed patent calculation (3)' button may be an information entropy calculation command.

여기서, 정보 엔트로피는 하기 수학식 5에 의해 산출될 수 있다. Here, the information entropy can be calculated by the following equation (5).

[수학식 5]&Quot; (5) "

여기서, n: 각 문서의 필요 정보량의 값을 의미, k: 관심어의 분류 수, h_i: 각 관심어의 발생 확률로서, 하나의 문서내의 전체 단어 출현수에 대한 관심어 i에 해당하는 단어의 출현빈도 확률임.Here, n is the value of the required information amount of each document, k is the number of classes of interest, h _i is the probability of occurrence of each word of interest, The probability of occurrence is.

예를 들면, 관심어의 키워드가 Stereo, Lithography 및 3D로 3개인 경우, 특정 문서의 관심어에 포함된 단어와 관심어에 포함되지 않은 단어의 빈도는 하기 표 2와 같을 수 있다. For example, if the keyword of interest is 3 in Stereo, Lithography and 3D, the frequency of words included in the word of interest and words not included in the word of interest in a specific document may be as shown in Table 2 below.

[표 2][Table 2]

여기서, 특정 문서에서의 Stereo의 hi는 Stereo의 출현 빈도인 4를 전체 단어의 출현 빈도인 50으로 나눈값이 될 수 있다. 이를 바탕으로 상기 표 2의 관심어(k1~k3)를 수학식 5에 적용하면, 해당 문서의 정보 엔트로피는 '1.08'이 될 수 있다. Here, hi of Stereo in a specific document can be a value obtained by dividing 4, which is the appearance frequency of Stereo, by 50, which is the appearance frequency of the whole word. On the basis of this, the information entropy of the corresponding document may be '1.08' when the words of interest (k1 to k3) of Table 2 are applied to Equation (5).

상기 표 2의 매트릭스는 하나의 문서에 대해 각각 발생한다. 본 발명에서는 관심어 리스트를 제외한 나머지 단어들은 하나의 단어처럼 취급하여 단어 군(비관심어)을 형성하여, 핵심 단어들(관심어)에 대한 정보량의 구성을 극대화하였다. The matrix of Table 2 above occurs for each document. In the present invention, words other than the word list of interest are treated as one word to form word groups (non-preferred words), thereby maximizing the composition of information amounts for the core words (words of interest).

이어, 관심시드선정부(140)는 산출된 각 문서의 정보 엔트로피와 미리 설정된 정보 엔트로피를 상호 비교하여 기준시드 문서를 선정하게 된다(S400). 여기에서, 노이즈를 추출할 경우 각 문서의 정보 엔트로피가 미리 설정된 기준값보다 작으면 해당 문서가 기준시드로 선정될 것이고, 핵심특허를 추출한다면 각 문서의 정보 엔트로피가 미리 설정된 기준값보다 큰 경우에만 해당 문서가 기준시드로 선정될 것이다.Then, the interested seed selection unit 140 compares the information entropy of each document calculated and the preset information entropy to select a reference seed document (S400). If the information entropy of each document is smaller than a preset reference value, the document will be selected as the reference seed. If the information is extracted from the core patent, if the information entropy of each document is larger than a preset reference value, Will be selected as the reference seed.

그리고, 시드군집화부(150)는 관심시드선정부(140)에서 선정된 기준시드와 유사한 적어도 하나 이상의 문서를 추출 및 군집화할 수 있다(S500). 시드군집화부(150)는 기준시드와 유사한 유사문서를 적어도 하나 이상을 추출하는 것으로, 유사 문서를 추출하기 위해서 주제 모델링 알고리즘을 이용하여 문서 간의 벡터 유사도를 분석함으로써 문서 간의 잠재적인 연관관계까지 고려할 수 있다. 만일, 관심시드선정부(140)에 의해 복수개의 기준시드가 선정된 경우, 시드군집화부(150)는 기준시드 문서별로 유사한 적어도 하나의 문서를 추출할 수 있다. The seed clustering unit 150 may extract and cluster at least one document similar to the reference seed selected by the seed seeding unit 140 (S500). The seed clustering unit 150 extracts at least one similar document similar to the reference seed. In order to extract a similar document, the seed clustering unit 150 may analyze the vector similarity between the documents using a subject modeling algorithm, have. If a plurality of reference seeds are selected by the interested seeding unit 140, the seed clustering unit 150 may extract at least one similar document for each reference seed document.

일예로, 도 8에 도시된 바와 같이 시드군집화부(150)는 통계정보생성부(125)를 통해 생성된 각 문서의 단어별 가중치(TF-IDF 매트릭스)에 대해 주제 모델링 알고리즘을 적용하여 각각의 문서가 각 주제에 속할 확률분포를 산출할 수 있다(S510). 예컨대, 주제 모델링 알고리즘은 잠재적 디리클레 할당(LDA; Latent Dirichlet Allocation) 알고리즘이 될 수 있다. For example, as shown in FIG. 8, the seed clustering unit 150 applies a subject modeling algorithm to each word weight value (TF-IDF matrix) of each document generated through the statistical information generating unit 125, A probability distribution in which the document belongs to each subject can be calculated (S510). For example, the subject modeling algorithm may be a Latent Dirichlet Allocation (LDA) algorithm.

LDA 알고리즘은 공지기술로 문서의 주제(Topic)별 분류에서 일반적으로 사용되는 툴로서, 도 9와 같은 매트랩 코드(Matlab Code)를 참조하여 간단하게 설명하고자 한다. 기본적으로 LDA 알고리즘은 문서가 단어의 묶음이고, 문서는 특정 주제를 가지고 있으며, 주제는 문서들마다 공유된다는 전제에서 시작된다. 예를 들어 도 9a와 같이 8개의 문서가 있고, 각 문서는 총 16개의 단어로 이루어져 있다고 가정할 경우 단어의 출현빈도에 따라 칼라로 표시하는 것이 가능하다. 초록색이 짙을수록 단어의 출현빈도가 높은 것이고, 파란색이 짙을수록 출현빈도가 낮은 것을 의미한다. 도 9a의 7번 문서의 경우 매트릭스 (3,4)의 단어만 출현빈도가 상당히 높은 것을 알 수 있다. The LDA algorithm is a tool generally used in classification according to a topic of a document according to a known technology, and will be briefly described with reference to a MATLAB code as shown in FIG. Basically, the LDA algorithm begins on the premise that the document is a bunch of words, the document has a specific topic, and the topic is shared for each document. For example, assuming that there are eight documents as shown in FIG. 9A, and each document is made up of a total of 16 words, it is possible to display the document in color according to the appearance frequency of the word. The more green the word is, the higher the occurrence frequency of words. The more blue the word, the lower the occurrence frequency. In the case of the document No. 7 in FIG. 9A, only the words of the matrix (3, 4) are found to have a significantly high appearance frequency.

도 9b는 주제에 대한 분포를 나타내는 것으로, 8개의 주제(Topic1~Topic8)가 있고 주제별로 어떤 단어들을 가지고 있는지를 나타낸다. 즉, 주제는 단어들에 대한 분포를 의미한다. 예컨대, 주제1의 경우는 첫 번째에서 네 번째((1,1)~(1,4))까지 단어들의 출현빈도가 높은 것이다. 따라서, 각 문서별 단어별 가중치에 대해 LDA를 적용하면 도 9b와 같은 비슷한 양상을 보이게 되며, 이를 통해 각 주제를 찾게 된다. FIG. 9B shows a distribution on a topic, which has eight topics (Topic 1 to Topic 8) and shows which words have a topic. That is, the subject means a distribution of words. For example, in the case of topic 1, the frequency of appearance of words from the first to the fourth ((1,1) to (1,4)) is high. Therefore, applying the LDA to the weight of each word for each document shows a similar pattern as shown in FIG. 9B, thereby finding each topic.

도 9c는 각 문서에 대한 주제의 분포를 나타낸 것으로, 빨간색은 데이터를 만들 때 사용된 것이고, 파란색이 LDA를 통해서 찾아낸 것이다. 즉, x축에 해당되는 주제의 순서를 무시했을 때, 결국 LDA를 통해 각 문서의 주제를 유사하게 찾아낼 수 있다는 것을 알 수 있다. Figure 9c shows the distribution of the subject for each document, the red being used to create the data, and the blue being found through the LDA. In other words, if you ignore the order of the subject in the x-axis, you can see that the LDA can eventually find the subject of each document similarly.

상기 LDA(Latent Dirichlet Allocation) 외에도 주제 모델링 알고리즘으로 LSA(Latent Semantic Analysis) 또는 PLSA(Probabilistic latent semantic analysis)가 사용될 수도 있다.In addition to the Latitude Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (PLSA) may be used as a subject modeling algorithm.

주제(Topic; '기술분야'에 해당됨)의 수는 필터링 시스템(1)에 미리 설정될 수 있으며, 주제의 수는 여러 번의 테스트에 걸쳐 8개 내지 10개로 분류하는 것이 가장 적절한 것으로 확인되었다. 따라서, 하기 표 3과 같이 주제를 먼저 9개로 분류한 후 다수의 문서에 LDA를 적용하여 각 주제별로 분류하였다. The number of topics (corresponding to the 'technical field') can be preset in the filtering system 1, and it has been found most appropriate to classify the number of topics into 8 to 10 over several tests. Therefore, as shown in Table 3 below, the subjects were first classified into 9 categories, and then LDA was applied to a plurality of documents and classified into respective subjects.

하기 표 3에서와 같이 LDA의 결과로 도출된 각 주제에 속하는 특허 문서의 수와 각 주제를 구성하는 주요 키워드 정보를 나타낼 수 있으며, 각 주제에 대응하는 주요 키워드 정보를 이용하여, 해당 주제의 특성을 판단하는 것이 가능하다. 예를 들어 Topic 1의 경우 작은 입자를 접착하는 방식(Adhesive particulate bonding)의 기술 군집임을 유추할 수 있다.As shown in Table 3 below, the number of patent documents belonging to each topic derived as a result of the LDA and the main keyword information constituting each topic can be displayed. Using the main keyword information corresponding to each topic, Can be determined. For example, in Topic 1, it can be inferred that this is a technical clustering of adhesive particulate bonding.

[표 3][Table 3]

이와 같이 주제확률분포산출부(151)는 주제별 키워드를 추출하고, 각 특허 문서별로 각 주제에 속할 확률분포를 하기 표 4와 같이 산출할 수 있다.In this way, the subject probability distribution calculating unit 151 extracts the subject keywords and calculates the probability distribution belonging to each subject in each patent document as shown in Table 4 below.

[표 4][Table 4]

이어, 유사도산출부(153)는 각 주제에 속할 확률분포를 이용하여 문서간 유사도 분석을 실행하여 문서간 유사도를 산출할 수 있으며, 유사도는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine Similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나의 알고리즘에 의해 산출될 수 있다(S520). The similarity calculator 153 may calculate similarities between documents by performing similarity analysis between documents using a probability distribution belonging to each topic. The similarity may be calculated using Hellinger distance, cosine similarity, And a Jacquard similarity coefficient (S520).

일예로, 유사도산출부(153)는 하기 수학식 6의 헬링거 디스턴스(Hellinger distance; H(P,Q))에 의해 관심시드선정부(140)에서 선정된 기준시드와 다른 비교 대상 문서 사이의 유사도를 산출할 수 있다. For example, the similarity calculating unit 153 calculates the similarity between the reference seed selected by the interested seed selecting unit 140 and the other comparison target document by the Hellinger distance (H (P, Q)) of Equation (6) The degree of similarity can be calculated.

[수학식 6]&Quot; (6) "

여기서, i는 주제, k는 주제의 개수, pi는 기준시드 문서의 주제 확률분포, qi는 비교대상 문서의 주제 확률분포임.Here, i is the subject, k is the number of subjects, pi is the subject probability distribution of the reference seed document, and qi is the subject probability distribution of the comparison document.

헬링거 디스턴스로 산출되는 결과값(H(P,Q))은 0에서 1 사이의 값을 가지게 되는 데, 결과값이 작을수록 두 문서 사이의 유사도 정도가 크고, 결과값이 클수록 두 문서 간의 유사도 정도가 작다. 따라서, 최종 유사도 값(S(P,Q))은 직관적인 이해가 용이하도록 헬링커 디스턴스의 결과값(H(P,Q))을 하기 수학식 7과 같이 1로 감산한 후 감산된 값을 유사도 값으로 사용할 수도 있다. The result value (H (P, Q)) calculated by the hellinger distance has a value between 0 and 1. The smaller the result value, the greater the degree of similarity between the two documents. The larger the result, It is small. Therefore, the final similarity value S (P, Q) is obtained by subtracting the result value H (P, Q) of the hellenistance to 1 as shown in the following Equation 7 so as to intuitively understand it easily It can also be used as the similarity value.

[수학식 7]&Quot; (7) "

하기 표 5는 상기 수학식 6 및 7에 의해 산출된 문서별 유사도 매트릭스의 일례를 나타낸 것이다.Table 5 below shows an example of the similarity matrix for each document calculated by Equations (6) and (7).

[표 5][Table 5]

이어, 유사문서군집화부(155)는 유사도산출부(153)에 의해 산출된 유사도 정보에 기초하여 기준시드와 유사한 유사문서들을 군집화하여 제공하게 된다(S530). 이때 기준시드가 복수개일 경우 기준시드 별로 유사문서를 군집화하는 것도 가능하다. Next, the similar document clustering unit 155 clusters similar documents similar to the reference seed based on the similarity information calculated by the similarity calculating unit 153 (S530). At this time, if there are a plurality of reference seeds, it is also possible to group similar documents for each reference seed.

이후, 관심문서필터링부(160)는 시드군집화부(150)를 통해 군집화된 기준시드와 그 유사문서를 미리 설정된 임계값을 기준으로 필터링하여 제거 또는 추출할 수 있다(S600). 즉, 관심문서필터링부(160)는 기준시드와 비교대상 문서 간의 유사도 정보를 미리 설정된 임계값과 비교하여 기준시드와 유사한 관심문서들을 선별하여 추천할 수 있다. 또한, 관심문서필터링부(160)는 기준시드를 포함한 관심문서를 문서저장부(10)에 저장된 모집단 문서데이터에 별도로 표시할 수도 있다. 이후, 관심문서가 노이즈일 경우에는 모집단으로부터 제거될 것이고, 관심문서가 핵심특허일 경우에는 모집단으로부터 추출될 것이다. Thereafter, the interest document filtering unit 160 may filter or remove the reference seed and its similar documents clustered through the seed clustering unit 150 based on a preset threshold value (S600). That is, the interest document filtering unit 160 may compare the similarity information between the reference seed and the comparison target document with a preset threshold value, and select interest documents similar to the reference seed to recommend. In addition, the interest document filtering unit 160 may separately display the interest document including the reference seed on the population document data stored in the document storage unit 10. [ Thereafter, if the document of interest is noise, it will be removed from the population, and if the document of interest is a core patent, it will be extracted from the population.

즉, 도 10에서 보는 바와 같이, 필터링 시스템(1)에서는 기준시드와 유사한 문서를 군집화하여 제공하게 되는 데, 적어도 하나 이상의 기준시드(④)와 해당 기준시드와 유사한 유사문서들(⑤)과 해당 문서의 정보 엔트로피값(⑦) 및 기준시드와의 유사도(거리; ⑧) 등에 대한 정보를 각각 제공하게 된다. 도 10에서는 기준시드(④) 중의 하나인 특허번호 2002-018463와 그 기준특허와 유사한 유사특허들(⑤)을 나타냈다. 여기서 유사특허(⑤)는 미리 설정된 임계값(⑥)보다 낮은 유사도(⑧)를 갖는 특허만 추출된 것이다. 물론, 다른 기준시드를 선택하면 그 기준시드와 유사한 특허들로 재추출되어 표시될 것이다.In other words, as shown in FIG. 10, in the filtering system 1, documents similar to the reference seed are clustered and provided. At least one reference seed (4) and similar documents (5) The information entropy value (⑦) of the document, and the similarity (distance: ⑧) with the reference seed. In FIG. 10, Patent No. 2002-018463, which is one of the reference seeds (4), and similar patents (5) similar to the reference patent are shown. Here, the similar patent (5) is only a patent having a similarity degree (8) lower than the preset threshold value (6). Of course, if another reference seed is selected, it will be re-extracted and displayed as patents similar to the reference seed.

상기의 기준시드 선정 단계(S400) 내지 유사문서 필터링 단계(S600)는 필요에 따라 반복 수행하여 핵심특허 또는 노이즈를 단계적으로 필터링할 수도 있다.The reference seed selection step (S400) to the similar document filtering step (S600) may be repeatedly performed as necessary to filter the core patent or noise step by step.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고 후술하는 특허청구범위에 의해 한정되며, 본 발명의 구성은 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 그 구성을 다양하게 변경 및 개조할 수 있다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 쉽게 알 수 있다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not intended to limit the invention to the particular forms disclosed. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

1: 문서 필터링 시스템 10: 문서저장부
50: 관심어저장부 100: 정보처리부
110: 단어추출부 120: 통계정보획득부
121: 빈도측정부 125: 통계정보생성부
130: 정보엔트로피산출부 140: 관심시드선정부
150: 시드군집화부 151: 주제확률분포산출부
153: 유사도산출부 155: 유사문서군집화부
160: 관심문서필터링부1: document filtering system 10: document storage unit
50: interest word storage unit 100: information processing unit
110: word extracting unit 120: statistical information obtaining unit
121: frequency measuring unit 125: statistical information generating unit
130: Information entropy computing unit 140: Interest seeding unit
150: seed clustering unit 151: subject probability distribution calculating unit
153: similarity calculating unit 155: similar document clustering unit
160: Interest document filtering unit

Claims

(a) selecting at least one specific document of interest as a reference seed among a plurality of documents;
(b) analyzing a topic of each document based on the word statistical information about the plurality of documents to calculate a subject probability distribution, and calculating a probability distribution of the subject based on the subject's probability distribution, ; And
(c) selecting and filtering an interest document according to the calculated degree of similarity.

2. The method of claim 1, wherein the reference seed of step (a)
Wherein the predetermined interest is selected based on information entropy calculated through a plurality of document occurrence frequency probabilities.

The method of claim 2,
Wherein the reference seed is selected by comparing a range or a condition of information entropy of the calculated information entropy with a predetermined seed.

The method of claim 2, wherein the information entropy (n)
A method of filtering a document of interest generated by the following equation: Shannon entropy.
[Equation]

Here, n is the value of the required information amount of each document, k is the number of classified interest words, h _i is the probability of occurrence of each word of interest, The frequency of occurrence of word is probability.

The method of claim 1, wherein the word statistical information of step (b)
A method of filtering a document of interest, the method comprising: extracting words from a plurality of documents, respectively, and measuring the frequency of occurrence of the extracted words in the document.

The method according to claim 1,
Wherein the reference seed is a document of interest to be extracted, the query being one of noise and key data.

The method of claim 1, wherein the word statistical information of step (b)
Extracting words included in a plurality of documents;
Measuring a frequency of appearance of the extracted words in the document for each of the plurality of documents;
Calculating an inverse document frequency by dividing the number of documents in which the word is included in the total number of documents with respect to the extracted word; And
And multiplying the appearance frequency of the word by the inverse document frequency.

The method of claim 7, wherein extracting the word comprises:
Extracting a sentence included in the document through natural language processing on the document;
Extracting words corresponding to adjectives and nouns through analysis of parts of speech of the extracted sentences; And
And removing a word included in a predetermined stopword list among the extracted words.

The method of claim 1, wherein the step (b)
Applying a topic modeling algorithm to the word statistical information to calculate a probability distribution of each document belonging to each topic;
Calculating a degree of similarity between a reference seed and a document to be compared using a subject probability distribution of each document and a subject probability distribution of the document selected as the reference seed; And
And clustering and providing documents similar to the reference seed based on the calculated similarity.

The method of claim 9,
The subject modeling algorithm is any one of Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA), and the similarity is determined by Hellinger distance, Cosine Similarity, And a Jaccard similarity coefficient. The method of claim 1,

2. The method of claim 1, wherein the degree of similarity in step (b)
Wherein the probability distribution of each document belonging to the subject is calculated by applying the Hellinger distance (H (P, Q)) of the following formula.
[Equation]

Here, k is the distribution being subject, t is the number of topics, p _k are probability distributions of the subject based on the seed, q _k is the probability the subject of the comparison target document.

The method of claim 11,
Since the value of 'H' (P, Q) is equal to the probability distribution of documents, the final value (S (P, Q), similarity) is determined by the following equation How to filter documents of interest.
[Equation]

The method according to claim 1,
Wherein the interest comprises at least one of a core keyword and a noise keyword.

A reference seed selection unit for selecting at least one specific document as a reference seed among a plurality of documents as a reference seed;
A seed clustering unit for calculating a probability distribution of the subject through subject analysis of each document based on the word statistical information about the plurality of documents and calculating a degree of similarity with the selected reference seed using the subject classified probability distribution of each document; And
And an interest document filtering unit that selects and filters an interest document according to the similarity of each document calculated in the above.

[Claim 15] The method according to claim 14,
When selecting the reference seed, selects a preset interest based on information entropy calculated through a plurality of document occurrence frequency probabilities.

15. The method of claim 14,
A word extracting unit for extracting a word for each document from a plurality of collected documents;
A statistical information acquiring unit for measuring the occurrence frequencies of the extracted words in the document and acquiring the word statistical information using the appearance frequencies of the words and the inverse document frequencies; And
And an information entropy calculation unit for calculating information entropy of each document through a probability of appearance frequency in a document of a preset interest word with respect to the number of occurrences of all words in the document measured in the above step.

17. The system according to claim 16,
A frequency measurement unit for measuring a frequency of appearance of a word extracted through the word extraction unit; And
Calculating an inverse document frequency by dividing the number of documents including an arbitrary word in the total number of documents by the extracted word and multiplying the appearance frequency of the arbitrary word by the inverse document frequency to obtain word statistical information And a statistical information generating unit that generates a statistical information based on the statistical information.

15. The seed clustering unit according to claim 14,
A subject probability distribution calculating unit for calculating a probability distribution of each document included in each topic through a subject modeling algorithm with respect to the generated word statistical information;
A similarity calculating unit for comparing the subject probability distribution of the document selected as the reference seed from the interest seed selection unit with the subject probability distribution of the other comparison target document to calculate the similarity between the document selected as the reference seed and another comparison target document; And
And a similar document clustering unit for clustering and providing documents similar to the reference seed based on the similarity calculated by the similarity calculating unit.

15. The method of claim 14,
Wherein the interest includes at least one of a core keyword and a noise keyword.