KR102468930B1

KR102468930B1 - System for filtering documents of interest and method thereof

Info

Publication number: KR102468930B1
Application number: KR1020160015567A
Authority: KR
Inventors: 윤장혁; 김무진; 최덕용; 경진영
Original assignee: 특허법인(유한) 해담; 윤장혁; 김무진; 최덕용
Priority date: 2015-02-09
Filing date: 2016-02-11
Publication date: 2022-11-23
Also published as: KR20160098084A

Abstract

본 발명은 특허나 논문 등의 문서에 포함된 단어를 추출한 후 이를 이용해 정보량 분석을 통해 기준시드 문서를 선정하고, 의미론적 유사관계 분석을 통해 기준시드와 유사를 문서를 군집화함으로써, 복수의 문서에서 노이즈 또는 핵심 문서를 필터링할 수 있는 관심대상 문서 필터링 시스템 및 그 방법을 제공하는 것으로, 복수의 문서 중 관심대상인 적어도 어느 하나의 특정 문서를 기준시드(seed)로 선정하는 단계, 상기 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 기준시드와의 유사도(distance)를 산출하는 단계, 및 상기 산출된 유사도에 따라 관심문서를 선정하여 필터링하는 단계를 포함할 수 있다.The present invention extracts words included in documents such as patents or papers, uses them to select a reference seed document through information amount analysis, and clusters documents similar to the reference seed through semantic similarity analysis, in a plurality of documents. A system and method for filtering documents of interest capable of filtering out noise or core documents are provided, comprising the steps of selecting at least one specific document of interest among a plurality of documents as a reference seed; Calculating a probability distribution by topic by performing topic analysis of each document based on word statistical information about each document, and calculating a distance with a reference seed using the probability distribution by topic of each document, and the calculation A step of selecting and filtering documents of interest according to the obtained similarity may be included.

Description

System for filtering documents of interest and its method {SYSTEM FOR FILTERING DOCUMENTS OF INTEREST AND METHOD THEREOF}

본 발명은 수집된 모집단 데이터에서 관심 대상인 노이즈 또는 핵심 데이터를 자동으로 추출하기 위한 관심대상 문서 필터링 시스템 및 그 방법에 관한 것이다.The present invention relates to a document-of-interest filtering system and method for automatically extracting noise or core data of interest from collected population data.

일반적으로, 키워드 검색을 통해 수집된 특허나 논문 등의 기술 문서(모집단)는 분석하고자 하는 대상기술 분야의 유효 문서 외에도 제거되어야 하는 노이즈 문서를 포함하고 있다. 또한, 유효 문서에는 대상기술 분야와 보다 밀접한 관련이 있는 핵심 문서도 포함되어 있다.In general, technical documents (population) such as patents or papers collected through keyword search include noise documents to be removed in addition to valid documents in the target technology field to be analyzed. In addition, valid documents include core documents that are more closely related to the target technology field.

효율적인 특허 분석을 위해 필요한 유효 특허 문서를 수집하여야 하며, 이를 위해서는 노이즈 문서를 제거하거나 핵심 문서를 추출하는 필터링 과정이 반드시 요구된다. Effective patent documents necessary for efficient patent analysis must be collected, and for this purpose, a filtering process to remove noise documents or extract key documents is required.

예컨대, 기존에는 노이즈 문서를 제거하기 위해 모집단에서 사람이 직접 특허 문서를 일일이 검토하고 노이즈를 제거하는 단순 반복 작업을 통해 노이즈 문서를 제거하고 있어 많은 시간과 인력이 투입된다. 물론, 핵심 문서를 추출하기 위해서도 수작업으로 일일이 문서 내용을 검토하면서 추출함에 따라 많은 시간이 소요된다.For example, conventionally, in order to remove noise documents, a person directly reviews patent documents from a population and removes noise documents through a simple repetitive task of removing noise, which requires a lot of time and manpower. Of course, it takes a lot of time to extract key documents by manually reviewing and extracting document contents.

하기의 특허문헌은 문서 검색 및 분류 방법 및 그 시스템에 관한 것이나, 상술한 문제에 대한 해결책을 제시하지 못하고 있다.The following patent literature relates to a document search and classification method and system, but does not provide a solution to the above problems.

한국 공개특허공보 제10-2006-0047306호Korean Patent Publication No. 10-2006-0047306

본 발명은 상기한 종래 기술의 문제점을 해결하기 위한 것으로써, 특허나 논문 등의 문서에 포함된 단어를 추출한 후 이를 이용해 정보량 분석을 통해 기준시드 문서를 선정하고, 의미론적 유사관계 분석을 통해 기준시드와 유사를 문서를 군집화함으로써, 복수의 문서에서 노이즈 또는 핵심 문서를 필터링할 수 있는 관심대상 문서 필터링 시스템 및 그 방법을 제공한다.The present invention is to solve the above-mentioned problems of the prior art, after extracting words included in documents such as patents or papers, using them to select a reference seed document through information amount analysis, and analyzing semantic similarities A document filtering system and method capable of filtering noise or core documents from a plurality of documents by clustering documents similar to seeds are provided.

본 발명의 실시예에 의한 관심대상 문서 필터링 방법은, (a) 복수의 문서 중 관심대상인 적어도 어느 하나의 특정 문서를 기준시드(seed)로 선정하는 단계; (b) 상기 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제(topic) 분석을 수행하여 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 기준시드와의 유사도(distance)를 산출하는 단계; 및 (c) 상기 산출된 유사도에 따라 관심문서를 선정하여 필터링하는 단계;를 포함할 수 있다.A method for filtering documents of interest according to an embodiment of the present invention includes the steps of: (a) selecting at least one specific document of interest among a plurality of documents as a reference seed; (b) Based on the word statistical information of the plurality of documents, topic analysis of each document is performed to calculate a probability distribution by topic, and distance with the reference seed is calculated using the probability distribution by topic of each document Calculating; and (c) selecting and filtering the document of interest according to the calculated similarity.

상기 (a)단계의 기준시드는 미리 설정된 관심어가 복수의 문서 내 출현빈도 확률을 통해 산출된 정보 엔트로피를 기초로 선택되며, 상기 기준시드는 산출된 정보 엔트로피와 미리 설정된 시드에 대한 정보 엔트로피의 범위나 조건을 상호 비교하여 선정될 수 있다.The reference seed in the step (a) is selected based on the information entropy calculated through the probability of the occurrence frequency of the preset keyword in a plurality of documents, and the reference seed is the range or condition of the calculated information entropy and the information entropy for the preset seed. can be selected by mutual comparison.

상기 (b)단계는, 상기 단어통계 정보에 대해 주제(topic) 모델링 알고리즘을 적용하여 각각의 문서가 각 주제에 속할 확률분포를 산출하는 단계; 상기 각 문서의 주제별 확률분포와 기준시드로 선정된 문서의 주제 확률분포를 이용하여 기준시드와 비교대상 문서 간의 유사도를 산출하는 단계; 및 상기에서 산출된 유사도에 기초하여 상기 기준시드와 유사한 문서들을 군집화하여 제공하는 단계;를 포함할 수 있다.The step (b) may include calculating a probability distribution in which each document belongs to each topic by applying a topic modeling algorithm to the word statistical information; calculating a degree of similarity between a reference seed and a document to be compared using a probability distribution by subject of each document and a subject probability distribution of a document selected as a reference seed; and grouping and providing documents similar to the reference seed based on the calculated similarity.

본 발명의 실시예에 의한 관심대상 문서 필터링 시스템은, 복수의 문서 중 관심대상인 적어도 하나의 특정 문서를 기준시드로 선정하는 기준시드선정부; 상기 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제 분석을 통해 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 선정된 기준시드와의 유사도를 산출하는 시드군집화부; 및 상기에서 산출된 문서별 유사도에 따라 관심문서를 선정하여 필터링하는 관심문서필터링부;를 포함할 수 있다.A target-of-interest document filtering system according to an embodiment of the present invention includes: a reference seed selector selecting at least one specific document of interest among a plurality of documents as a reference seed; a seed clustering unit that calculates a probability distribution for each subject through topic analysis of each document based on the word statistical information of the plurality of documents, and calculates a degree of similarity with a selected reference seed using the probability distribution for each subject of each document; and a document of interest filtering unit for selecting and filtering a document of interest according to the similarity of each document calculated above.

또한, 상기 필터링 시스템은, 수집된 복수의 문서에서 각 문서별로 단어를 추출하는 단어추출부; 상기에서 추출된 단어의 문서 내 출현빈도를 각각 측정하고, 단어의 출현빈도와 역문서 빈도를 이용하여 단어통계 정보를 획득하는 통계정보획득부; 및 상기에서 측정된 문서내 전체 단어의 출현수에 대한 미리 설정된 관심어의 문서 내 출현빈도 확률을 통해 각 문서의 정보 엔트로피를 산출하는 정보엔트로피산출부;를 더 포함할 수 있다.In addition, the filtering system may include a word extraction unit for extracting words for each document from a plurality of collected documents; a statistical information acquisition unit that measures the frequency of occurrence of the word extracted above in the document, and acquires word statistical information using the frequency of occurrence of the word and the inverse document frequency; and an information entropy calculation unit calculating information entropy of each document through a pre-set probability of appearance in the document of the word of interest relative to the number of occurrences of all words in the document measured as above.

본 발명의 실시예에 의하면, 노이즈나 핵심문서와 같은 관심문서의 필터링 시스템의 경우, 분석 대상의 기술군에 대한 정보를 담고 있는 정도를 섀넌 엔트로피 방법에 의해 정성적 분석으로만 분별하는 것이 가능했던 관심문서의 가능성을 정량화된 수치로 표현하여 전문가의 판단을 도울 수 있다.According to an embodiment of the present invention, in the case of a filtering system for documents of interest such as noise or key documents, it was possible to discriminate the degree of containing information about the technology group of the analysis target only by qualitative analysis by the Shannon entropy method. The possibility of a document of interest can be expressed as a quantified number to help expert judgment.

이는 기존의 전문가의 장시간 단순 반복 작업 방식의 문제점 중 하나인 관심문서 여부 판단에 일관성이 떨어지는 문제에 대해 일관성을 부여 및 유지하도록 돕는 것이 가능하여, 실수나 오차가 발생할 수 있는 과정을 시스템적으로 보완함으로써 작업의 품질을 향상시킬 수 있다. It is possible to help give and maintain consistency for the problem of inconsistency in determining whether a document of interest, which is one of the problems of the existing expert's long-term simple repetitive work method, and systematically compensate for the process in which mistakes or errors may occur. By doing so, the quality of work can be improved.

또한, 토픽 분석인 주제분석 모델링은 기준시드와 의미론적으로 유사한 문서들을 선별하여 군집화시키는 역할을 수행하는 데, 이를 통해 전문가가 확신한 관심문서를 추출함과 동시에 그와 의미론적으로 유사한 관심문서들을 일괄적으로 추출할 수 있도록 한다. 이를 통해 전문가가 수행해야 하는 관심문서 필터링 작업의 공수를 혁신적으로 줄이는 것이 가능하다. In addition, topic analysis modeling, which is a topic analysis, plays a role of selecting and clustering documents that are semantically similar to the reference seed. to be extracted in batches. Through this, it is possible to innovatively reduce the number of man-hours for filtering documents of interest to be performed by experts.

또한, 본 발명에서는 관심문서 필터링 과정을 기준시드를 선별하는 과정부터 복수회 실시함에 따라 선별 작업이 필요한 문서 집합의 범위를 충분히 줄일 수 있다. 이는 기존 방식에 의해 소요되었던 시간적 비용에 비해 전문인력의 시간 투자를 혁신적으로 감소시킴에 따라 연구개발의 속도와 효율성을 높일 수 있는 효과가 있다. In addition, in the present invention, since the process of filtering the documents of interest is performed multiple times starting from the process of selecting the reference seed, the range of document sets requiring selection can be sufficiently reduced. This has the effect of increasing the speed and efficiency of R&D by innovatively reducing the time investment of experts compared to the time cost required by the existing method.

도 1은 본 발명에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 개념도이다.
도 2는 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 구성도이다.
도 3은 도 2에 도시된 시드군집화부를 나타낸 도면이다.
도 4는 본 발명의 실시예에 의한 관심대상 문서 필터링 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 의한 문서 필터링 시스템을 나타낸 UI화면이다.
도 6은 도 4의 단어추출 및 출현빈도 산출 과정을 설명하기 위한 흐름도이다.
도 7은 도 6의 과정을 나타낸 도면이다.
도 8은 도 4의 시드군집화 과정을 설명하기 위한 흐름도이다.
도 9a 내지 도 9c는 도 8의 주제확률분포 산출에 적용된 LDA를 설명하기 위한 도면이다.
도 10은 본 발명의 실시예에 의한 기준시드와 유사한 문서를 추출하여 보여주는 화면이다.1 is a conceptual diagram illustrating a system for filtering a document of interest according to the present invention.
2 is a configuration diagram illustrating a system for filtering a document of interest according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating the seed clustering unit shown in FIG. 2 .
4 is a flowchart illustrating a method for filtering a document of interest according to an embodiment of the present invention.
5 is a UI screen showing a document filtering system according to an embodiment of the present invention.
FIG. 6 is a flowchart for explaining the word extraction and appearance frequency calculation process of FIG. 4 .
7 is a diagram illustrating the process of FIG. 6 .
8 is a flowchart illustrating the seed clustering process of FIG. 4 .
9A to 9C are diagrams for explaining LDA applied to the calculation of the subject probability distribution of FIG. 8 .
10 is a screen showing a document similar to a reference seed extracted according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 형태들을 설명한다. Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

그러나, 본 발명의 실시형태는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명하는 실시 형태로 한정되는 것은 아니다. 또한, 본 발명의 실시형태는 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다.However, the embodiments of the present invention can be modified in many different forms, and the scope of the present invention is not limited to the embodiments described below. In addition, the embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art.

본 발명에 참조된 도면에서 실질적으로 동일한 구성과 기능을 가진 구성요소들은 동일한 부호가 사용될 것이며, 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.In the drawings referred to in the present invention, the same reference numerals will be used for components having substantially the same configuration and function, and the shapes and sizes of elements in the drawings may be exaggerated for clearer explanation.

본 실시예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field-programmable gate array) 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. The term '~unit' used in this embodiment means software or a hardware component such as a field-programmable gate array (FPGA) or ASIC, and '~unit' performs certain roles.

그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. However, '~ part' is not limited to software or hardware. '~bu' may be configured to be in an addressable storage medium and may be configured to reproduce one or more processors.

따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다. Therefore, as an example, '~unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. Functions provided within components and '~units' may be combined into smaller numbers of components and '~units' or further separated into additional components and '~units'.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.In addition, components and '~units' may be implemented to play one or more CPUs in a device or a secure multimedia card.

도 1은 본 발명에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 개념도이다.1 is a conceptual diagram illustrating a system for filtering a document of interest according to the present invention.

일반적으로, 특허나 논문 검색에서는 최대한 넓은 범위의 모집단을 얻기 위하여 검색식을 개략적으로 작성한 후 다양한 응용 기술이나 이질적인 특징을 가지는 문서들을 모두 수집하고 있다. 결과적으로 이는 모집단에 노이즈가 포함되어 노이즈를 제거하는 과정이나 핵심문서를 추출하는 과정이 필연적으로 발생하게 된다. In general, in a patent or paper search, all documents having various application technologies or heterogeneous characteristics are collected after an outline of a search formula is drawn up in order to obtain a population as wide as possible. As a result, noise is included in the population, so the process of removing noise or extracting key documents inevitably occurs.

따라서, 본 발명은 효율적인 노이즈 또는 핵심문서 추출 방법을 제안하는 것이다. 본 발명에서 적용된 방법은 순수한 결정을 얻기 위해 사용하는 방법인 결정 성장(Crystal growth) 방법에서 착안되었으며, 반복적인 정제 과정을 거치게 된다. 결정 성장의 과정에서는, 충분한 크기의 순수 결정을 얻기 위하여 비교적 얻기 쉬운 작은 크기의 순수한 결정을 성장시키는 방법을 사용한다. 이때 작은 크기의 결정은 결정 씨앗(Crystal seed)로서, 해당 결정 씨앗을 안정적인 포화 용액 등에서 같은 종류인 용질의 결정화를 유도하게 하여, 충분한 크기의 결정으로 성장하게 된다. Therefore, the present invention proposes an efficient noise or key document extraction method. The method applied in the present invention is conceived from the crystal growth method, which is a method used to obtain pure crystals, and undergoes repetitive purification processes. In the process of crystal growth, in order to obtain pure crystals of sufficient size, a relatively easy-to-obtain method of growing small-sized pure crystals is used. At this time, the small-sized crystal is a crystal seed, and the corresponding crystal seed induces crystallization of the same type of solute in a stable saturated solution, etc., and grows into a crystal of sufficient size.

마찬가지로 본 발명에서 제안된 방법에서는 섀넌 정보 엔트로피(Shannon entropy)와 의미론적 유사도 분석을 이용하여, 시스템으로부터 추천받은 관심문서 중 일부를 기준시드로 선정하고, 이를 기반으로 의미론적 유사도가 높은 문서들을 군집화하여 일괄 추출 및 반복 정제하여 필터링하는 것이다.Likewise, in the method proposed in the present invention, by using Shannon entropy and semantic similarity analysis, some of the documents of interest recommended by the system are selected as reference seeds, and based on this, documents with high semantic similarity are clustered to filter by batch extraction and repeated purification.

도 2는 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템을 설명하기 위한 구성도이다. 2 is a configuration diagram illustrating a system for filtering a document of interest according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 의한 관심대상 문서 필터링 시스템(1)은 문서저장부(10), 관심어저장부(50) 및 정보처리부(100)로 이루어져 있으며, 정보처리부(100)는 단어추출부(110), 통계정보획득부(120), 정보엔트로피산출부(130), 기준시드선정부(140), 시드군집화부(150) 및 관심문서필터링부(160)를 포함할 수 있다. Referring to FIG. 2 , a system for filtering documents of interest according to an embodiment of the present invention 1 includes a document storage unit 10, a word of interest storage unit 50, and an information processing unit 100, and an information processing unit 100 ) may include a word extraction unit 110, a statistical information acquisition unit 120, an information entropy calculation unit 130, a reference seed selection unit 140, a seed clustering unit 150, and a document of interest filtering unit 160. can

문서저장부(10)는 문서 검색 시스템에 의해 키워드 검색식을 통해 수집된 복수의 문서를 제공받아 저장할 수 있다. 실시예에서, 문서는 특허문서나 논문일 수 있으며, 문서 검색 시스템은 구글특허(Google Patent), 델피온(Delphi-in), 키프리스(Kipris), 윕스온(WIPSON), 윈텔립스(WINTELIPS), NDSL 등과 같은 검색 시스템일 수 있다. The document storage unit 10 may receive and store a plurality of documents collected through a keyword search formula by the document search system. In an embodiment, the document may be a patent document or a paper, and the document search system may include Google Patent, Delphi-in, Kipris, WIPSON, and WINTELIPS. , NDSL, and the like.

관심어저장부(50)는 사용자로부터 입력된 관심어, 즉 관심 키워드를 저장한다. 관심어는 적어도 하나 이상의 키워드로 구성되며, 관심어는 후술할 문서의 정보 엔트로피 계산에 이용된다.The interest word storage unit 50 stores the interest word input from the user, that is, the interest keyword. The word of interest is composed of at least one keyword, and the word of interest is used to calculate information entropy of a document to be described later.

단어추출부(110)는 수집된 복수의 문서에서 각 문서별로 단어를 추출하도록 구성되어 있다. 단어추출부(110)는 자연어 처리를 통해 문서에 포함된 문장을 추출하고, 추출된 문장의 품사 분석을 통해 형용사 및 명사에 해당하는 단어를 추출하며, 추출된 단어 중 미리 설정된 불용어 리스트에 포함된 단어를 제거하여 필요 단어를 추출하게 된다.The word extraction unit 110 is configured to extract words for each document from a plurality of collected documents. The word extraction unit 110 extracts sentences included in the document through natural language processing, extracts words corresponding to adjectives and nouns through part-of-speech analysis of the extracted sentences, and among the extracted words is included in a preset stopword list. Words are removed to extract the necessary words.

통계정보획득부(120)는 추출된 단어의 문서 내 출현빈도를 각각 측정하고, 단어의 출현빈도와 역문서 빈도를 이용하여 단어통계 정보를 획득하도록 구성되어 있다. The statistical information acquisition unit 120 is configured to measure the frequency of occurrence of each extracted word in the document, and acquire word statistical information using the frequency of occurrence of the word and the inverse document frequency.

여기서, 통계정보획득부(120)는 빈도측정부(121) 및 통계정보생성부(125)를 포함할 수 있다. 빈도측정부(121)는 단어추출부(110)를 통해 추출된 단어의 문서 내 출현 빈도를 각각 측정하도록 구성되어 있다. 통계정보생성부(125)는 추출된 단어에 대하여, 전체 문서수에서 임의 단어가 포함된 문서수를 나눈 역문서 빈도(Inverse Document Frequency)를 산출하고, 임의 단어의 출현 빈도와 역문서 빈도를 승산하여 단어통계 정보를 획득하도록 구성되어 있다.Here, the statistical information acquisition unit 120 may include a frequency measuring unit 121 and a statistical information generating unit 125 . The frequency measurement unit 121 is configured to measure each occurrence frequency in a document of words extracted through the word extraction unit 110 . The statistical information generation unit 125 calculates an inverse document frequency by dividing the number of documents including a random word from the total number of documents with respect to the extracted word, and multiplies the frequency of occurrence of the random word by the inverse document frequency. It is configured to obtain word statistical information.

한편, 정보엔트로피산출부(130)는 측정된 문서내 전체 단어의 출현수에 대한 미리 설정된 관심어의 문서 내 출현빈도 확률의 합산을 통해 각 문서의 정보 엔트로피를 산출하도록 구성되어 있다. 관심어는 핵심 키워드 또는 노이즈 키워드일 수 있으며, 키워드 별로 가중치를 다르게 설정하여 정보 엔트로피 계산에 반영할 수 있다. Meanwhile, the information entropy calculation unit 130 is configured to calculate the information entropy of each document by summing the occurrence frequency probabilities of the preset word of interest in the document with the number of occurrences of all words in the measured document. The keyword of interest may be a core keyword or a noise keyword, and a different weight may be set for each keyword to be reflected in information entropy calculation.

정보 엔트로피는 정보이론의 중요한 개념으로서, 어떠한 상황에서 불확실성을 측정하는 것이다. 즉 불확실성이 높은 상황에서는 높은 정보 엔트로피 값을 가지며, 불확실성이 낮은 상황에서는 낮은 정보 엔트로피 값을 가진다. 예를 들어, 동전을 던지는 사건은 주사위를 던지는 사건보다 낮은 불확실성, 다시 말해, 발생할 수 있는 사건이 2가지인 경우가 6가지일 경우보다 낮은 불확실성과 정보 엔트로피 값을 가진다. 또한, 같은 상황에서 각 사건이 발생하는 확률에 따라 하나의 시스템의 정보량이 변화한다. 즉 사건의 수와 다른 조건이 같은 상황에서 각 사건의 발생 확률이 다르다고 가정한다면, 각 사건의 발생확률이 모두 같은 경우 사건에 대한 예측이 더욱 어려워지므로 이 경우 가장 높은 정보 엔트로피 값을 가지게 된다. Information entropy is an important concept in information theory and measures uncertainty in a situation. That is, it has a high information entropy value in a situation with high uncertainty, and a low information entropy value in a situation with low uncertainty. For example, a coin toss event has a lower uncertainty than a dice toss event, that is, a lower uncertainty and information entropy value than a case where there are 2 possible events and 6 events. In addition, the amount of information in one system changes according to the probability of occurrence of each event in the same situation. In other words, assuming that the probability of occurrence of each event is different under the same number of events and other conditions, prediction of the event becomes more difficult when the probability of occurrence of each event is the same, so in this case, it has the highest information entropy value.

이와 같은 정보 엔트로피는 이산확률 분포에 대해 하기 수학식 1의 섀넌 엔트로피(Shannon entropy) 알고리즘을 활용하여 측정하는 것이 가능하다. 해당 수학식을 활용할 경우 하나의 시스템에 대한 정보의 량을 수치로 나타낼 수 있는데, 이는 해당 시스템의 정보의 다양성의 정도를 의미한다. 즉, 문서가 사용자가 원하는 정보, 즉 관심어를 골고루 포함하고 있다면 관심문서일 확률이 높기 때문에 관심어에 대한 정보량을 측정하여 문서의 정보 엔트로피를 산출하는 것이다.Such information entropy can be measured using the Shannon entropy algorithm of Equation 1 below for discrete probability distribution. When the corresponding equation is used, the amount of information about one system can be expressed as a numerical value, which means the degree of diversity of information in the system. That is, if a document contains the information desired by the user, that is, the word of interest evenly, it is highly likely to be a document of interest. Therefore, the information entropy of the document is calculated by measuring the amount of information about the word of interest.

[수학식 1][Equation 1]

단, i는 관심어이며, pi는 관심어 i의 확률 값임.However, i is the word of interest, and pi is the probability value of the word of interest i.

기준시드선정부(140)는 정보엔트로피산출부(130)에서 산출된 각 문서의 정보 엔트로피를 기초로 복수의 문서 중 관심대상인 적어도 하나의 특정 문서를 기준시드로 선정하게 된다. 이때 기준시드는 적어도 하나 이상이 선정될 수 있으며, 정보 엔트로피와 설정된 기준값을 상호 비교하여 기준값을 벗어난 문서를 기준시드로 추천할 수도 있다.The reference seed selection unit 140 selects at least one specific document of interest among a plurality of documents as a reference seed based on the information entropy of each document calculated by the information entropy calculation unit 130 . At this time, at least one or more reference seeds may be selected, and a document outside the reference value may be recommended as the reference seed by mutually comparing information entropy with a set reference value.

시드군집화부(150)는 통계정보생성부(125)에서 생성된 복수의 문서에 대한 단어통계 정보를 기초로 각 문서의 주제 분석을 통해 주제별 확률분포를 산출하고, 각 문서의 주제별 확률분포를 이용하여 선정된 기준시드와의 유사도를 산출하도록 구성되어 있다. The seed clustering unit 150 calculates a probability distribution by subject through subject analysis of each document based on the word statistical information on a plurality of documents generated by the statistical information generating unit 125, and uses the probability distribution by subject of each document. It is configured to calculate the degree of similarity with the selected reference seed.

여기서, 시드군집화부(150)는 도 3에 도시된 바와 같이 주제확률분포산출부(151), 유사도산출부(153) 및 유사문서군집화부(155)를 포함할 수 있다. 주제확률분포산출부(151)는 통계정보생성부(125)에서 생성된 단어통계 정보에 대해 주제 모델링 알고리즘을 통해 각 문서가 각각의 주제에 속할 확률분포를 산출하도록 구성되어 있고, 유사도산출부(153)는 관심시드선정부(140)에서 기준시드로 선정된 문서의 주제 확률분포와 다른 비교대상 문서의 주제 확률분포를 비교하여, 기준시드로 선정된 문서와 다른 비교대상 문서 간의 유사도를 산출하도록 구성되어 있고, 유사문서군집화부(155)는 유사도산출부(153)에서 산출된 유사도에 기초하여 기준시드와 유사한 문서들을 군집화하여 제공하도록 구성되어 있다.Here, the seed clustering unit 150 may include a subject probability distribution calculating unit 151, a similarity calculating unit 153, and a similar document clustering unit 155, as shown in FIG. The topic probability distribution calculation unit 151 is configured to calculate a probability distribution that each document belongs to each topic through a topic modeling algorithm for word statistical information generated by the statistical information generation unit 125, and a similarity calculation unit ( 153) compares the topic probability distribution of the document selected as the reference seed in the interested seed selection unit 140 with the topic probability distribution of other documents to be compared, and calculates the degree of similarity between the document selected as the reference seed and other documents to be compared. and the similar document clustering unit 155 is configured to group and provide documents similar to the reference seed based on the similarity calculated by the similarity calculating unit 153.

상기에서 주제 모델링 알고리즘으로는 LDA(Latent Dirichlet Allocation), LSA(Latent Semantic Analysis), 및 PLSA(Probabilistic Latent Semantic Analysis) 중 어느 하나의 알고리즘이 이용될 수 있고, 유사도 측정 알고리즘으로는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나가 이용될 수 있다. 상기의 알고리즘들은 공지의 기술이므로 상세한 설명은 생략한다.In the above, as the topic modeling algorithm, any one of LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis), and PLSA (Probabilistic Latent Semantic Analysis) may be used, and as the similarity measurement algorithm, Hellinger Distance distance), any one of cosine similarity and Jaccard similarity coefficient may be used. Since the above algorithms are known technologies, a detailed description thereof will be omitted.

그리고, 관심문서필터링부(160)는 시드군집화부(150)에서 산출된 문서별 유사도 정보에 따라 관심문서들을 선택하여 필터링하게 된다. 여기서, 관심문서필터링부(160)는 산출된 문서별 유사도 정보와 미리 설정된 임계값을 비교하여 기준시드와 유사한 관심문서를 추천할 수 있으며, 기준시드를 포함한 관심문서를 문서저장부(10)에 저장된 모집단 문서데이터에 별도로 표시할 수 있다. 이후, 관심문서가 노이즈일 경우에는 모집단으로부터 제거될 것이고, 관심문서가 핵심특허일 경우에는 모집단으로부터 추출될 것이다. The document of interest filtering unit 160 selects and filters the documents of interest according to similarity information for each document calculated by the seed clustering unit 150 . Here, the document of interest filtering unit 160 may recommend a document of interest similar to the reference seed by comparing the calculated similarity information for each document with a preset threshold, and store the document of interest including the reference seed in the document storage unit 10. It can be displayed separately in the stored population document data. Thereafter, if the document of interest is noise, it will be removed from the population, and if the document of interest is a core patent, it will be extracted from the population.

이와 같이 본 발명에서는 관심어를 이용하여 문서별 정보 엔트로피를 구한 후 문서별 정보 엔트로피를 기준으로 관심대상 기준시드를 선택한다. 이후 기준시드와 유사한 문서들을 찾아내어 군집화한 후 제공하게 된다. As described above, in the present invention, information entropy for each document is obtained using the word of interest, and then a reference seed of interest is selected based on the information entropy for each document. Afterwards, documents similar to the reference seed are found, clustered, and provided.

여기서, 관심어는 문서별 정보 엔트로피 계산에 이용되며, 기준시드는 각 문서의 정보 엔트로피를 기준으로 선정되며, 문서별 TF-IDF 값은 유사도 측정을 위한 주제 확률분포 산출에 이용되며, 유사도는 기준시드의 주제 확률분포와 다른 문서의 주제 확률분포 간의 벡터를 측정하여 산출된다.Here, the word of interest is used to calculate the information entropy of each document, the reference seed is selected based on the information entropy of each document, and the TF-IDF value for each document is used to calculate the topic probability distribution for measuring similarity. It is calculated by measuring the vector between the topic probability distribution and the topic probability distribution of other documents.

또한, 각 문서의 정보 엔트로피를 계산하기 위해 Shannon entropy 알고리즘이 이용될 수 있고, 각 문서의 주제 확률분포를 획득하기 위해 Latent Dirichlet Allocation, Latent Semantic Analysis 또는 Probabilistic Latent Semantic Analysis 알고리즘이 이용될 수 있고, 문서간 유사도를 측정하기 위해 Hellinger distance, Cosine similarity 또는 Jaccard similarity coefficient 알고리즘이 이용될 수 있다.In addition, the Shannon entropy algorithm may be used to calculate the information entropy of each document, and the Latent Dirichlet Allocation, Latent Semantic Analysis or Probabilistic Latent Semantic Analysis algorithm may be used to obtain the subject probability distribution of each document. Hellinger distance, Cosine similarity, or Jaccard similarity coefficient algorithm may be used to measure the similarity between the two.

이와 같이 구성된 관심문서 필터링 시스템의 제반 동작과정을 도 4 내지 도 9를 참조하여 보다 구체적으로 살펴본다. 이하에서는 관심문서가 특허문서인 경우를 예로 들어 설명한다.The overall operation process of the document-of-interest filtering system configured as described above will be described in more detail with reference to FIGS. 4 to 9 . Hereinafter, a case in which the document of interest is a patent document will be described as an example.

도 4는 본 발명의 실시예에 의한 관심문서 필터링 방법을 나타낸 순서도이다. 4 is a flowchart illustrating a method of filtering a document of interest according to an embodiment of the present invention.

먼저, 사용자로부터 입력된 키워드 검색식 또는 특허번호(출원번호, 공개번호 또는 등록번호)를 제공받아 특허 데이터베이스로부터 관련 특허문서를 검색하게 되고, 검색된 특허문서는 문서 필터링 시스템(1)에 로딩되어 문서저장부(10)에 저장될 수 있다. 이어, 문서 필터링 시스템(1)은 관심어에 대한 키워드를 입력받아 관심어저장부(50)에 저장할 수 있다(S100). 여기에서 특허문서는 키워드로 검색된 경우에는 노이즈와 핵심특허 등이 포함된 엑셀(excel) 형식의 특허데이터이며, 문서에는 서지정보(출원번호, 출원일자, 공개번호, 공개일자, 출원인, 제목 등)와 요약, 청구항 및 인용정보 등이 포함될 수 있다. First, a keyword search formula or patent number (application number, publication number, or registration number) input from the user is provided to search for related patent documents from the patent database, and the searched patent documents are loaded into the document filtering system 1 and document It may be stored in the storage unit 10. Subsequently, the document filtering system 1 may receive a keyword for the word of interest and store it in the word of interest storage unit 50 (S100). Here, the patent document is patent data in Excel format that includes noise and key patents when searched by keyword, and the document contains bibliographic information (application number, application date, publication number, publication date, applicant, title, etc.) and summary, claims and citation information, etc. may be included.

예컨대, 문서 필터링 시스템(1)은 도 5와 같이 유저 인터페이스가 구성될 수 있는 데, 사용자는 필터링 시스템(1)의 특허로딩(①)을 선택하여 검색된 특허 모집단을 시스템(1)으로 로딩하여 문서저장부(10)에 저장할 수 있고, 관심어에 대한 키워드(②)를 입력하여 관심어저장부(50)에 저장할 수 있다. 이때 관심어는 관심대상 키워드로 핵심 키워드 또는 노이즈 키워드일 수 있다.For example, the user interface of the document filtering system 1 may be configured as shown in FIG. It can be stored in the storage unit 10, and a keyword (②) for the word of interest can be input and stored in the word of interest storage unit 50. In this case, the word of interest is a keyword of interest and may be a core keyword or a noise keyword.

이어, 필터링 시스템(1)의 단어추출부(110)는 문서저장부(10)에 저장된 복수의 문서에 포함된 단어를 추출하고, 통계정보획득부(120)는 복수의 문서 각각이 포함하는 단어의 출현 빈도를 산출하여 복수의 문서에 대한 단어통계 정보를 생성할 수 있다(S200). Next, the word extraction unit 110 of the filtering system 1 extracts words included in a plurality of documents stored in the document storage unit 10, and the statistical information acquisition unit 120 extracts words included in each of the plurality of documents. It is possible to generate word statistical information for a plurality of documents by calculating the appearance frequency of (S200).

실시예에서, 단어추출부(110)는 도 6과 같이 문서에 대해 자연어처리를 통해 문서에 포함된 문장을 추출할 수 있고(S210), 상기 추출된 문장의 품사 분석을 통해 형용사 및 명사에 해당하는 단어를 추출할 수 있다(S220). 여기서, 단어추출부(110)는 상기 추출된 단어 중 기 설정된 불용어 리스트에 포함된 단어를 제거할 수 있다(S230). In an embodiment, the word extraction unit 110 may extract sentences included in the document through natural language processing as shown in FIG. 6 (S210), and correspond to adjectives and nouns through part-of-speech analysis of the extracted sentences. A word can be extracted (S220). Here, the word extraction unit 110 may remove words included in a preset stopword list among the extracted words (S230).

이어, 통계정보획득부(120)는 추출된 단어를 통해 문서별 추출된 단어의 출현 빈도수를 산출할 수 있다(S240). 그리고, 통계정보획득부(120)는 문서 전체의 수에 대하여 해당 단어를 포함하는 문서 수에 대한 통계를 의미하는 역문서 빈도(idf: Inverse Document Frequency)를 각각 산출할 수 있다(S250). Next, the statistical information acquisition unit 120 may calculate the frequency of occurrence of the extracted word for each document through the extracted word (S240). In addition, the statistical information acquisition unit 120 may calculate an inverse document frequency (idf), which means statistics on the number of documents including the corresponding word with respect to the total number of documents (S250).

여기서, 단어의 출현 빈도(tf)는 특정 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 문서의 길이에 따라 단어의 빈도값을 조절하여 산출할 수 있다. 예컨대, 빈도측정부(121)는 단어 출현 빈도(tf)로 출현횟수를 이용할 수도 있지만, 하기 수학식 2에 의해 산출할 수도 있다. 이때, 문서 내에서 출현 빈도가 가장 높은 단어는 '1'값을 가질 것이고, 그 외의 단어는 1보다 작은 값을 가질 것이다.Here, the frequency of occurrence of a word (tf) is a value representing how often a specific word appears in a document, and can be calculated by adjusting the frequency value of the word according to the length of the document. For example, the frequency measurement unit 121 may use the number of appearances as the word appearance frequency (tf), but may also calculate it by Equation 2 below. At this time, the word with the highest occurrence frequency in the document will have a value of '1', and the other words will have a value less than 1.

[수학식 2][Equation 2]

여기서, t: 임의의 단어, d: 임의의 문서, w: 문서 d에 있는 임의의 단어, f(t,d): 문서 d에 들어 있는 단어 t의 빈도임.where t: any word, d: any document, w: any word in document d, f(t,d): frequency of word t in document d.

또한, 통계정보생성부(125)는 임의의 한 단어가 복수의 문서 전체에서 얼마나 공통적으로 포함되어 있는지를 나타내는 역문서 빈도(idf)를 산출하게 되며, 역문서 빈도는 문서 전체의 수를 상기 단어를 포함한 문서의 수로 나눈 뒤 로그 스케일을 취하여 산출될 수 있다. In addition, the statistical information generating unit 125 calculates an inverse document frequency (idf) indicating how commonly a given word is included in a plurality of documents, and the inverse document frequency is the number of documents in the entire document. It can be calculated by taking a logarithmic scale after dividing by the number of documents containing .

일예로, 역문서 빈도(idf)는 이하의 수학식 3에 의해 산출될 수 있다. For example, the inverse document frequency (idf) can be calculated by Equation 3 below.

[수학식 3] [Equation 3]

여기서, t; 임의의 단어, d; 임의의 문서, D; 전체 문서 수, |d∈D:t∈d|; 단어 t가 포함된 문서 수.Here, t; any word, d; any document, D; total number of documents, |d∈D:t∈d|; The number of documents containing the word t.

다음으로, 통계정보생성부(125)는 단어 출현 빈도(tf)와 역문서 빈도(idf)를 승산하여 TF-IDF(Term Frequency-Inverse Document Frequency)값을 산출할 수 있으며, 상기 TF-IDF값을 단어통계 정보로 획득할 수 있다(S260). Next, the statistical information generation unit 125 may calculate a term frequency-inverse document frequency (TF-IDF) value by multiplying the word appearance frequency (tf) and the inverse document frequency (idf), and the TF-IDF value may be obtained as word statistical information (S260).

일예로, TF-IDF값은 이하의 수학식 4에 의해 산출될 수 있다. 여기서 역문서빈도수에 ’을 가산하는 이유는 로그 스케일의 밑(base)에 따라 역문서 빈도수가 음수가 나올 수 있으므로 이를 방지하기 위함이며, 로그 스케일의 밑이 1보다 큰 경우에는 ’을 가산하지 않을 수도 있다.For example, the TF-IDF value may be calculated by Equation 4 below. Here, the reason why ' is added to the inverse document frequency is to prevent the inverse document frequency from being negative depending on the base of the logarithmic scale. may be

[수학식 4][Equation 4]

TF-IDF값은 특정 문서 내에서 단어 출현 빈도가 높을수록, 그리고 전체 문서 중 상기 단어를 포함한 문서가 적을수록 커지게 된다. The TF-IDF value increases as the occurrence frequency of a word in a specific document increases and as the number of documents including the word among all documents decreases.

통계정보생성부(125)는 문서별 추출한 단어의 TF-IDF값을 이용하여 문서저장부(10)에 저장된 복수의 문서를 정형화할 수 있다. The statistical information generating unit 125 may standardize a plurality of documents stored in the document storage unit 10 using the TF-IDF values of words extracted for each document.

예컨대, 문서저장부(10)에 저장된 문서의 수가 8개이며, 상기 문서에서 추출된 총 단어의 수가 7개인 경우, 통계정보생성부(125)는 문서에서 추출된 각각의 단어에 대해 TF-IDF값을 산출하여 하기의 표 1과 같이 정형화할 수 있다. For example, when the number of documents stored in the document storage unit 10 is 8 and the total number of words extracted from the documents is 7, the statistical information generator 125 generates TF-IDF for each word extracted from the document. Values can be calculated and standardized as shown in Table 1 below.

[표 1][Table 1]

이와 같이 단어추출부(110)와 통계정보획득부(120)의 동작 과정을 그림으로 나타내면 도 7과 같다.In this way, the operation process of the word extraction unit 110 and the statistical information acquisition unit 120 is illustrated in FIG. 7 .

한편, 정보 엔트로피산출부(130)는 관심어저장부(50)에 저장된 관심어를 기준으로 정보 엔트로피(Shannon Entorpy)를 산출할 수 있다(S300). 관심어는 노이즈 또는 핵심특허와 관련된 키워드이며, 사용자에 의해 설정될 수 있으며, 관심어 별로 가중치가 다르게 설정될 수 있다. 여기서 가중치는 해당 관심어의 출현 빈도에 대해 가중치를 적용할 수 있다는 의미이다.Meanwhile, the information entropy calculator 130 may calculate information entropy (Shannon Entorpy) based on the word of interest stored in the word of interest storage unit 50 (S300). The word of interest is a keyword related to noise or core patents, and may be set by the user, and a weight may be set differently for each word of interest. Here, the weight means that a weight can be applied to the frequency of occurrence of the word of interest.

구체적으로, 정보 엔트로피산출부(130)는 문서가 포함하고 있는 단어들 중 관심어에 포함된 단어와 포함되지 않은 단어의 빈도를 이용하여 관심어에 대한 정보 엔트로피를 산출할 수 있다. 각 단어의 빈도 정보는 빈도측정부(121)에서 측정된 빈도 정보를 이용할 수 있다. 도 5의 유저 인터페이스에서 '시드특허 산출(③)' 버튼이 정보 엔트로피 계산 명령이 될 수 있다.Specifically, the information entropy calculation unit 130 may calculate the information entropy of the target word using frequencies of words included in the target word and words not included in the target word among words included in the document. Frequency information measured by the frequency measurement unit 121 may be used as the frequency information of each word. In the user interface of FIG. 5, the 'seed patent calculation (③)' button may be an information entropy calculation command.

여기서, 정보 엔트로피는 하기 수학식 5에 의해 산출될 수 있다. Here, information entropy can be calculated by Equation 5 below.

[수학식 5][Equation 5]

여기서, n: 각 문서의 필요 정보량의 값을 의미, k: 관심어의 분류 수, h_i: 각 관심어의 발생 확률로서, 하나의 문서내의 전체 단어 출현수에 대한 관심어 i에 해당하는 단어의 출현빈도 확률임.Here, n: means the value of the required amount of information for each document, k: number of categories of interest words, h _i : probability of occurrence of each word of interest, a word corresponding to word of interest i relative to the number of occurrences of all words in one document The frequency of occurrence of is the probability.

예를 들면, 관심어의 키워드가 Stereo, Lithography 및 3D로 3개인 경우, 특정 문서의 관심어에 포함된 단어와 관심어에 포함되지 않은 단어의 빈도는 하기 표 2와 같을 수 있다. For example, when the keywords of the keyword of interest are stereo, lithography, and 3D, the frequencies of words included in the keyword of interest and words not included in the keyword of interest of a specific document may be as shown in Table 2 below.

[표 2][Table 2]

여기서, 특정 문서에서의 Stereo의 hi는 Stereo의 출현 빈도인 4를 전체 단어의 출현 빈도인 50으로 나눈값이 될 수 있다. 이를 바탕으로 상기 표 2의 관심어(k1~k3)를 수학식 5에 적용하면, 해당 문서의 정보 엔트로피는 '1.08'이 될 수 있다. Here, Hi of Stereo in a specific document may be a value obtained by dividing 4, which is the frequency of occurrence of Stereo, by 50, which is the frequency of occurrence of all words. Based on this, when the words of interest (k1 to k3) in Table 2 are applied to Equation 5, the information entropy of the document may be '1.08'.

상기 표 2의 매트릭스는 하나의 문서에 대해 각각 발생한다. 본 발명에서는 관심어 리스트를 제외한 나머지 단어들은 하나의 단어처럼 취급하여 단어 군(비관심어)을 형성하여, 핵심 단어들(관심어)에 대한 정보량의 구성을 극대화하였다. The matrix of Table 2 is generated for each document. In the present invention, words other than the word of interest list are treated as one word to form a group of words (words of interest), thereby maximizing the composition of the amount of information on key words (words of interest).

이어, 관심시드선정부(140)는 산출된 각 문서의 정보 엔트로피와 미리 설정된 정보 엔트로피를 상호 비교하여 기준시드 문서를 선정하게 된다(S400). 여기에서, 노이즈를 추출할 경우 각 문서의 정보 엔트로피가 미리 설정된 기준값보다 작으면 해당 문서가 기준시드로 선정될 것이고, 핵심특허를 추출한다면 각 문서의 정보 엔트로피가 미리 설정된 기준값보다 큰 경우에만 해당 문서가 기준시드로 선정될 것이다.Next, the interest seed selection unit 140 selects a reference seed document by comparing the calculated information entropy of each document with a preset information entropy (S400). Here, when extracting noise, if the information entropy of each document is smaller than the preset reference value, the corresponding document will be selected as the reference seed, and when extracting core patents, the corresponding document only when the information entropy of each document is greater than the preset reference value. will be selected as the reference seed.

그리고, 시드군집화부(150)는 관심시드선정부(140)에서 선정된 기준시드와 유사한 적어도 하나 이상의 문서를 추출 및 군집화할 수 있다(S500). 시드군집화부(150)는 기준시드와 유사한 유사문서를 적어도 하나 이상을 추출하는 것으로, 유사 문서를 추출하기 위해서 주제 모델링 알고리즘을 이용하여 문서 간의 벡터 유사도를 분석함으로써 문서 간의 잠재적인 연관관계까지 고려할 수 있다. 만일, 관심시드선정부(140)에 의해 복수개의 기준시드가 선정된 경우, 시드군집화부(150)는 기준시드 문서별로 유사한 적어도 하나의 문서를 추출할 수 있다. Then, the seed clustering unit 150 may extract and cluster at least one or more documents similar to the reference seed selected by the interested seed selection unit 140 (S500). The seed clustering unit 150 extracts at least one similar document similar to the reference seed. In order to extract similar documents, the vector similarity between documents may be analyzed using a topic modeling algorithm to consider potential associations between documents. have. If a plurality of reference seeds are selected by the interested seed selection unit 140, the seed clustering unit 150 may extract at least one similar document for each reference seed document.

일예로, 도 8에 도시된 바와 같이 시드군집화부(150)는 통계정보생성부(125)를 통해 생성된 각 문서의 단어별 가중치(TF-IDF 매트릭스)에 대해 주제 모델링 알고리즘을 적용하여 각각의 문서가 각 주제에 속할 확률분포를 산출할 수 있다(S510). 예컨대, 주제 모델링 알고리즘은 잠재적 디리클레 할당(LDA; Latent Dirichlet Allocation) 알고리즘이 될 수 있다. For example, as shown in FIG. 8 , the seed clustering unit 150 applies a topic modeling algorithm to weights (TF-IDF matrix) for each word of each document generated through the statistical information generation unit 125 to obtain each A probability distribution in which a document belongs to each subject may be calculated (S510). For example, the subject modeling algorithm may be a Latent Dirichlet Allocation (LDA) algorithm.

LDA 알고리즘은 공지기술로 문서의 주제(Topic)별 분류에서 일반적으로 사용되는 툴로서, 도 9와 같은 매트랩 코드(Matlab Code)를 참조하여 간단하게 설명하고자 한다. 기본적으로 LDA 알고리즘은 문서가 단어의 묶음이고, 문서는 특정 주제를 가지고 있으며, 주제는 문서들마다 공유된다는 전제에서 시작된다. 예를 들어 도 9a와 같이 8개의 문서가 있고, 각 문서는 총 16개의 단어로 이루어져 있다고 가정할 경우 단어의 출현빈도에 따라 칼라로 표시하는 것이 가능하다. 초록색이 짙을수록 단어의 출현빈도가 높은 것이고, 파란색이 짙을수록 출현빈도가 낮은 것을 의미한다. 도 9a의 7번 문서의 경우 매트릭스 (3,4)의 단어만 출현빈도가 상당히 높은 것을 알 수 있다. The LDA algorithm is a known technology and is a tool commonly used in classifying documents by topic, and will be briefly described with reference to the Matlab code as shown in FIG. 9 . Basically, the LDA algorithm starts with the premise that a document is a group of words, that a document has a specific subject, and that the subject is shared across documents. For example, assuming that there are 8 documents as shown in FIG. 9A and that each document consists of a total of 16 words, it is possible to display in color according to the frequency of occurrence of words. The darker the green, the higher the frequency of occurrence of the word, and the darker the blue, the lower the frequency of occurrence. In the case of document No. 7 in FIG. 9A, it can be seen that only the words in the matrix (3, 4) have a very high frequency.

도 9b는 주제에 대한 분포를 나타내는 것으로, 8개의 주제(Topic1~Topic8)가 있고 주제별로 어떤 단어들을 가지고 있는지를 나타낸다. 즉, 주제는 단어들에 대한 분포를 의미한다. 예컨대, 주제1의 경우는 첫 번째에서 네 번째((1,1)~(1,4))까지 단어들의 출현빈도가 높은 것이다. 따라서, 각 문서별 단어별 가중치에 대해 LDA를 적용하면 도 9b와 같은 비슷한 양상을 보이게 되며, 이를 통해 각 주제를 찾게 된다. 9B shows the distribution of topics, and there are 8 topics (Topic1 to Topic8), and it shows what words each topic has. That is, the subject means the distribution of words. For example, in the case of topic 1, the frequency of occurrence of words from the first to the fourth ((1,1) to (1,4)) is high. Therefore, when LDA is applied to the weight of each word of each document, a similar pattern is shown as shown in FIG. 9b, and through this, each subject is found.

도 9c는 각 문서에 대한 주제의 분포를 나타낸 것으로, 빨간색은 데이터를 만들 때 사용된 것이고, 파란색이 LDA를 통해서 찾아낸 것이다. 즉, x축에 해당되는 주제의 순서를 무시했을 때, 결국 LDA를 통해 각 문서의 주제를 유사하게 찾아낼 수 있다는 것을 알 수 있다. Figure 9c shows the distribution of topics for each document. Red is used when creating data, and blue is found through LDA. In other words, when the order of the topics corresponding to the x-axis is ignored, it can be seen that the topics of each document can be similarly found through LDA.

상기 LDA(Latent Dirichlet Allocation) 외에도 주제 모델링 알고리즘으로 LSA(Latent Semantic Analysis) 또는 PLSA(Probabilistic latent semantic analysis)가 사용될 수도 있다.In addition to Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (PLSA) may also be used as a topic modeling algorithm.

주제(Topic; '기술분야'에 해당됨)의 수는 필터링 시스템(1)에 미리 설정될 수 있으며, 주제의 수는 여러 번의 테스트에 걸쳐 8개 내지 10개로 분류하는 것이 가장 적절한 것으로 확인되었다. 따라서, 하기 표 3과 같이 주제를 먼저 9개로 분류한 후 다수의 문서에 LDA를 적용하여 각 주제별로 분류하였다. The number of topics (corresponding to 'technical fields') can be set in advance in the filtering system 1, and it has been confirmed that the most appropriate number of topics is classified into 8 to 10 through several tests. Therefore, as shown in Table 3 below, the topics were first classified into 9 categories, and then LDA was applied to a plurality of documents to classify each topic.

하기 표 3에서와 같이 LDA의 결과로 도출된 각 주제에 속하는 특허 문서의 수와 각 주제를 구성하는 주요 키워드 정보를 나타낼 수 있으며, 각 주제에 대응하는 주요 키워드 정보를 이용하여, 해당 주제의 특성을 판단하는 것이 가능하다. 예를 들어 Topic 1의 경우 작은 입자를 접착하는 방식(Adhesive particulate bonding)의 기술 군집임을 유추할 수 있다.As shown in Table 3 below, the number of patent documents belonging to each subject derived as a result of LDA and the main keyword information constituting each subject can be indicated. Using the main keyword information corresponding to each subject, the characteristics of the subject it is possible to judge For example, in the case of Topic 1, it can be inferred that it is a technology cluster of adhesive particulate bonding.

[표 3][Table 3]

이와 같이 주제확률분포산출부(151)는 주제별 키워드를 추출하고, 각 특허 문서별로 각 주제에 속할 확률분포를 하기 표 4와 같이 산출할 수 있다.In this way, the subject probability distribution calculation unit 151 may extract keywords for each subject and calculate a probability distribution belonging to each subject for each patent document as shown in Table 4 below.

[표 4][Table 4]

이어, 유사도산출부(153)는 각 주제에 속할 확률분포를 이용하여 문서간 유사도 분석을 실행하여 문서간 유사도를 산출할 수 있으며, 유사도는 헬링거 디스턴스(Hellinger distance), 코사인 유사도(Cosine Similarity) 및 자카드계수(Jaccard similarity coefficient) 중 어느 하나의 알고리즘에 의해 산출될 수 있다(S520). Subsequently, the similarity calculation unit 153 may calculate the similarity between documents by performing a similarity analysis between documents using the probability distribution belonging to each subject, and the similarity may be determined by Hellinger distance and cosine similarity. and Jaccard similarity coefficient (S520).

일예로, 유사도산출부(153)는 하기 수학식 6의 헬링거 디스턴스(Hellinger distance; H(P,Q))에 의해 관심시드선정부(140)에서 선정된 기준시드와 다른 비교 대상 문서 사이의 유사도를 산출할 수 있다. For example, the similarity calculation unit 153 determines the relationship between the reference seed selected in the interest seed selection unit 140 and other documents to be compared by the Hellinger distance (H(P,Q)) of Equation 6 below. similarity can be calculated.

[수학식 6][Equation 6]

여기서, i는 주제, k는 주제의 개수, pi는 기준시드 문서의 주제 확률분포, qi는 비교대상 문서의 주제 확률분포임.Here, i is the topic, k is the number of topics, pi is the topic probability distribution of the reference seed document, and qi is the topic probability distribution of the comparison target document.

헬링거 디스턴스로 산출되는 결과값(H(P,Q))은 0에서 1 사이의 값을 가지게 되는 데, 결과값이 작을수록 두 문서 사이의 유사도 정도가 크고, 결과값이 클수록 두 문서 간의 유사도 정도가 작다. 따라서, 최종 유사도 값(S(P,Q))은 직관적인 이해가 용이하도록 헬링커 디스턴스의 결과값(H(P,Q))을 하기 수학식 7과 같이 1로 감산한 후 감산된 값을 유사도 값으로 사용할 수도 있다. The result value (H(P,Q)) calculated by the Hellinger distance has a value between 0 and 1. The smaller the result value, the greater the degree of similarity between the two documents, and the greater the result value, the greater the degree of similarity between the two documents. degree is small Therefore, the final similarity value (S(P,Q)) is obtained by subtracting the resultant value (H(P,Q)) of the Hellinker distance by 1 as shown in Equation 7 below for easy intuitive understanding, and then the subtracted value is It can also be used as a similarity value.

[수학식 7][Equation 7]

하기 표 5는 상기 수학식 6 및 7에 의해 산출된 문서별 유사도 매트릭스의 일례를 나타낸 것이다.Table 5 below shows an example of a similarity matrix for each document calculated by Equations 6 and 7 above.

[표 5][Table 5]

이어, 유사문서군집화부(155)는 유사도산출부(153)에 의해 산출된 유사도 정보에 기초하여 기준시드와 유사한 유사문서들을 군집화하여 제공하게 된다(S530). 이때 기준시드가 복수개일 경우 기준시드 별로 유사문서를 군집화하는 것도 가능하다. Next, the similar document clustering unit 155 clusters similar documents similar to the reference seed based on the similarity information calculated by the similarity calculating unit 153 and provides them (S530). In this case, if there are a plurality of reference seeds, it is also possible to cluster similar documents for each reference seed.

이후, 관심문서필터링부(160)는 시드군집화부(150)를 통해 군집화된 기준시드와 그 유사문서를 미리 설정된 임계값을 기준으로 필터링하여 제거 또는 추출할 수 있다(S600). 즉, 관심문서필터링부(160)는 기준시드와 비교대상 문서 간의 유사도 정보를 미리 설정된 임계값과 비교하여 기준시드와 유사한 관심문서들을 선별하여 추천할 수 있다. 또한, 관심문서필터링부(160)는 기준시드를 포함한 관심문서를 문서저장부(10)에 저장된 모집단 문서데이터에 별도로 표시할 수도 있다. 이후, 관심문서가 노이즈일 경우에는 모집단으로부터 제거될 것이고, 관심문서가 핵심특허일 경우에는 모집단으로부터 추출될 것이다. Thereafter, the document of interest filtering unit 160 may remove or extract the reference seeds and similar documents clustered through the seed clustering unit 150 by filtering based on a preset threshold (S600). That is, the interest document filtering unit 160 may select and recommend documents of interest similar to the reference seed by comparing similarity information between the reference seed and the comparison target document with a preset threshold value. Also, the document of interest filtering unit 160 may separately display the document of interest including the reference seed in the population document data stored in the document storage unit 10 . Thereafter, if the document of interest is noise, it will be removed from the population, and if the document of interest is a core patent, it will be extracted from the population.

즉, 도 10에서 보는 바와 같이, 필터링 시스템(1)에서는 기준시드와 유사한 문서를 군집화하여 제공하게 되는 데, 적어도 하나 이상의 기준시드(④)와 해당 기준시드와 유사한 유사문서들(⑤)과 해당 문서의 정보 엔트로피값(⑦) 및 기준시드와의 유사도(거리; ⑧) 등에 대한 정보를 각각 제공하게 된다. 도 10에서는 기준시드(④) 중의 하나인 특허번호 2002-018463와 그 기준특허와 유사한 유사특허들(⑤)을 나타냈다. 여기서 유사특허(⑤)는 미리 설정된 임계값(⑥)보다 낮은 유사도(⑧)를 갖는 특허만 추출된 것이다. 물론, 다른 기준시드를 선택하면 그 기준시드와 유사한 특허들로 재추출되어 표시될 것이다.That is, as shown in FIG. 10, the filtering system 1 clusters and provides documents similar to the reference seed. At least one reference seed ④ and similar documents similar to the reference seed ⑤ and corresponding Information on the information entropy value (⑦) of the document and the degree of similarity (distance; ⑧) with the reference seed is provided respectively. In FIG. 10, patent number 2002-018463, which is one of the reference seeds (④), and similar patents (⑤) similar to the reference patent are shown. Here, similar patents (⑤) are extracted from only patents having similarities (⑧) lower than a preset threshold (⑥). Of course, if another reference seed is selected, patents similar to the reference seed will be re-extracted and displayed.

상기의 기준시드 선정 단계(S400) 내지 유사문서 필터링 단계(S600)는 필요에 따라 반복 수행하여 핵심특허 또는 노이즈를 단계적으로 필터링할 수도 있다.The step of selecting the reference seed (S400) to the step of filtering similar documents (S600) may be repeated as necessary to filter core patents or noise step by step.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고 후술하는 특허청구범위에 의해 한정되며, 본 발명의 구성은 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 그 구성을 다양하게 변경 및 개조할 수 있다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 쉽게 알 수 있다.The present invention described above is not limited by the above-described embodiments and the accompanying drawings, but is limited by the claims to be described later, and the configuration of the present invention can be varied within a range that does not deviate from the technical spirit of the present invention. Those skilled in the art can easily know that the present invention can be changed and modified accordingly.

1: 문서 필터링 시스템 10: 문서저장부
50: 관심어저장부 100: 정보처리부
110: 단어추출부 120: 통계정보획득부
121: 빈도측정부 125: 통계정보생성부
130: 정보엔트로피산출부 140: 관심시드선정부
150: 시드군집화부 151: 주제확률분포산출부
153: 유사도산출부 155: 유사문서군집화부
160: 관심문서필터링부1: document filtering system 10: document storage unit
50: word of interest storage unit 100: information processing unit
110: word extraction unit 120: statistical information acquisition unit
121: frequency measuring unit 125: statistical information generating unit
130: information entropy calculation unit 140: interest seed selection unit
150: seed clustering unit 151: subject probability distribution calculation unit
153: similarity calculation unit 155: similar document clustering unit
160: interest document filtering unit

Claims

(a) selecting at least one specific document of interest among a plurality of documents as a reference seed;
(b) Based on the word statistical information of the plurality of documents, topic analysis of each document is performed to calculate a probability distribution by topic, and distance with the reference seed is calculated using the probability distribution by topic of each document Calculating; and
(c) selecting and filtering documents of interest according to the calculated similarity;
The reference seed in step (a) is,
A method of filtering a document of interest in which a preset word of interest is selected based on information entropy calculated through probability of occurrence in a plurality of documents.

delete

The method of claim 1,
The reference seed is selected by comparing the calculated information entropy with a range or condition of information entropy for a preset seed.

The method according to claim 3, wherein the information entropy (n) is,
A method for filtering documents of interest calculated by Shannon entropy, which is the following formula.
[Equation]

Here, n: means the value of the required amount of information for each document, k: number of classified interest words, h _i : occurrence probability of each interest word, corresponding to the interest word i for the total number of occurrences of words in one document The probability of occurrence of a word.

The method according to claim 1, wherein the word statistical information of step (b) is,
A method of filtering a document of interest generated by extracting words from a plurality of documents and measuring the frequency of occurrence of the extracted words in the document.

The method of claim 1,
The reference seed is a document of interest to be extracted, and is one of noise and core data.

The method according to claim 1, wherein the word statistical information of step (b) is,
extracting words included in a plurality of documents;
measuring a frequency of appearance in the document of the word extracted above with respect to each of the plurality of documents;
Calculating an inverse document frequency by dividing the number of documents including the word from the total number of documents with respect to the word extracted above; and
Multiplying the occurrence frequency of the word by the inverse document frequency; a document of interest filtering method obtained through the step.

The method according to claim 7, wherein the step of extracting the word,
extracting sentences included in the document through natural language processing on the document;
extracting words corresponding to adjectives and nouns through part-of-speech analysis of the extracted sentence; and
and removing words included in a preset stopword list among the extracted words.

The method according to claim 1, wherein the (b) step,
calculating a probability distribution in which each document belongs to each topic by applying a topic modeling algorithm to the word statistical information;
Calculating a degree of similarity between a reference seed and a document to be compared using a probability distribution by subject of each document and a subject probability distribution of a document selected as a reference seed; and
and grouping and providing documents similar to the reference seed based on the calculated similarity.

The method of claim 9,
The subject modeling algorithm is any one of LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis), and PLSA (Probabilistic latent semantic analysis), and the similarity is Hellinger distance, Cosine Similarity and a Jaccard similarity coefficient.

The method according to claim 1, wherein the similarity of step (b) is,
A method of filtering documents of interest calculated by applying a probability distribution in which each document belongs to a corresponding subject to a Hellinger distance (H(P,Q)) of the following equation.
[Equation]

Here, k is the topic, t is the number of topics, p _k is the topic probability distribution of the reference seed, and q _k is the topic probability distribution of the comparison target document.

delete

The method of claim 1,
The target document filtering method of claim 1 , wherein the target word includes at least one of a core keyword and a noise keyword.

a reference seed selection unit that selects at least one specific document of interest among a plurality of documents as a reference seed;
a seed clustering unit that calculates a probability distribution for each subject through topic analysis of each document based on the word statistical information of the plurality of documents, and calculates a degree of similarity with a selected reference seed using the probability distribution for each subject of each document; and
A document of interest filtering unit for selecting and filtering documents of interest according to the similarity of each document calculated above;
The reference seed selection unit,
When the reference seed is selected, the target-of-interest document filtering system selects based on information entropy calculated through an occurrence frequency probability of a preset term of interest in a plurality of documents.

delete

The method of claim 14,
A word extraction unit for extracting words for each document from a plurality of collected documents;
a statistical information acquisition unit that measures the frequency of occurrence of the word extracted above in the document, and acquires word statistical information using the frequency of occurrence of the word and the inverse document frequency; and
and an information entropy calculation unit configured to calculate information entropy of each document through a pre-set probability of occurrence of the word of interest in the document with respect to the number of occurrences of all words in the document measured as described above.

The method according to claim 16, wherein the statistical information acquisition unit,
a frequency measurement unit for measuring the occurrence frequency of the words extracted through the word extraction unit in the document; and
With respect to the extracted word, an inverse document frequency is calculated by dividing the number of documents including a random word from the total number of documents, and word statistical information is obtained by multiplying the frequency of occurrence of the random word by the inverse document frequency. A document filtering system of interest comprising a;

delete