KR102028487B1

KR102028487B1 - Document topic modeling apparatus and method, storage media storing the same

Info

Publication number: KR102028487B1
Application number: KR1020180017564A
Authority: KR
Inventors: 김남규; 최호창; 현윤진; 윌리엄
Original assignee: 국민대학교산학협력단
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2019-10-04
Also published as: KR20190097748A

Abstract

본 발명은 문서의 토픽 모델링 장치 및 방법에 관한 것으로, 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링(Topic Modeling)을 수행하여 지역 토픽 정보를 생성하는 지역 토픽 정보 생성부, 상기 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합을 생성하고 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성하는 전역 토픽 정보 생성부 및 상기 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 토픽 변환 정보를 상기 복수의 문서들 각각에 배정하는 문서 토픽 배정부를 포함한다. 따라서, 본 발명은 분할 수행된 토픽 모델링의 결과를 합리적으로 통합할 수 있다.The present invention relates to an apparatus and method for modeling a topic of a document. The present invention relates to a method for generating local topic information by dividing a global document set including a plurality of documents, generating a plurality of local document sets, and performing topical modeling. A local topic information generator, a global topic information generator for generating a global topic information by generating a reduced global document set through documents in each of the plurality of local document sets, and performing global topic modeling; and the local and global topics And a document topic assignment system that assigns topic conversion information generated through document mapping between the information to each of the plurality of documents. Therefore, the present invention can reasonably integrate the results of the segmented topic modeling.

Description

TOPIC MODELING APPARATUS AND METHOD, STORAGE MEDIA STORING THE SAME

본 발명은 문서의 토픽 모델링 기술에 관한 것으로, 보다 상세하게는 분할 수행된 토픽 모델링의 결과를 합리적으로 통합할 수 있는 문서의 토픽 모델링 장치 및 방법에 관한 것이다.The present invention relates to a topic modeling technique of a document, and more particularly, to a topic modeling apparatus and method of a document capable of rationally integrating the results of segmented topic modeling.

토픽 모델링은 방대한 양의 문서로부터 주요 이슈를 추출하고 각 이슈에 해당하는 문서를 식별하여 이들을 군집으로 제공할 수 있다. 토픽 모델링은 일반적 문서 군집화 기법과는 달리 문서의 의미적 요소를 반영한 결과를 제시하기 때문에 매우 유용한 기법에 해당한다. 하지만 전통적인 토픽 모델링은 전체 문서에 걸친 주요 용어의 분포에 기반을 두고 수행되기 때문에, 각 문서의 토픽을 식별하기 위해서는 전체 문서에 대한 일괄 분석이 이루어져야 한다. 이로 인해 대용량 문서의 토픽 모델링에는 매우 오랜 시간이 소요되며, 분석 대상 규모가 증가함에 따라 분석 시간이 지수적으로 증가하는 확장성(Scalability)의 문제가 발생할 수 있다. Topic modeling can extract key issues from a large volume of documents, identify documents that correspond to each issue, and provide them as a cluster. Topic modeling is a very useful technique because it presents the result reflecting the semantic elements of the document, unlike the general document clustering technique. However, traditional topic modeling is based on the distribution of key terms across the entire document, so a batch analysis of the entire document must be performed to identify the topic of each document. As a result, topic modeling of a large document takes a very long time, and may cause a problem of scalability in which the analysis time increases exponentially as the size of the analysis target increases.

한국공개특허 제10-2009-0013928(2009.02.06)호는 토픽 추출 장치, 상기 토픽 추출 장치를 이용한 소셜 네트워크 생성 시스템 및 그 방법에 관한 것으로, 자동으로 사용자의 특성을 나타낼 수 있는 토픽을 추출하는 방법을 제시하고, 상기 추출된 토픽의 정확성을 높일 수 있는 랭킹 방법 및 관리 방법들을 제시하며, 이를 이용해 자동으로 다수 사용자들 간의 인맥 관계를 형성할 수 있는 장치 및 방법을 제공할 수 있다.Korean Patent Laid-Open No. 10-2009-0013928 (2009.02.06) relates to a topic extracting apparatus, a social network generating system using the topic extracting apparatus, and a method thereof, which automatically extracts a topic representing a user's characteristics. The present invention provides a method, a ranking method and a management method for improving the accuracy of the extracted topic, and can provide an apparatus and method for automatically forming a social relationship among a plurality of users using the same.

한국공개특허 제10-2009-0013928(2009.02.06)호Korean Patent Publication No. 10-2009-0013928 (2009.02.06)

본 발명의 일 실시예는 분할 수행된 토픽 모델링의 결과를 합리적으로 통합할 수 있는 문서의 토픽 모델링 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a topic modeling apparatus and method of a document that can reasonably integrate the results of partitioned topic modeling.

본 발명의 일 실시예는 전체 문서를 하위 군집으로 분할하고 각 하위 군집에서 대표 문서를 추출하여 축소된 전역 문서 집합을 구성할 수 있는 문서의 토픽 모델링 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a topic modeling apparatus and method for dividing an entire document into sub-groups and extracting a representative document from each sub-group to form a reduced global document set.

본 발명의 일 실시예는 대표 문서를 매개로 하위 군집에서 도출한 지역 토픽으로부터 전역 토픽의 성분을 도출할 수 있는 문서의 토픽 모델링 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a topic modeling apparatus and method of a document that can derive the components of the global topic from the local topic derived from the sub-group through the representative document.

실시예들 중에서, 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링(Topic Modeling)을 수행하여 지역 토픽 정보를 생성하는 지역 토픽 정보 생성부, 상기 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합을 생성하고 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성하는 전역 토픽 정보 생성부 및 상기 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 토픽 변환 정보를 상기 복수의 문서들 각각에 배정하는 문서 토픽 배정부를 포함한다.Among the embodiments, a local topic information generator for generating a plurality of local document sets by dividing a global document set including a plurality of documents and generating local topic information by performing local topic modeling. A topic created by a global topic information generator that generates a reduced global document set through documents in each of the local document sets and performs global topic modeling to generate global topic information, and a document mapping between the local and global topic information. A document topic assignment for allocating conversion information to each of the plurality of documents.

상기 지역 토픽 정보 생성부는 상기 지역 토픽 모델링을 수행하여 상기 복수의 지역 문서 집합들 각각에 대해 적어도 하나의 주요 지역 토픽을 추출하고 지역 문서 집합에 속한 지역 문서 및 상기 적어도 하나의 주요 지역 토픽 간의 지역 토픽 행렬을 생성할 수 있다.The regional topic information generation unit performs the regional topic modeling to extract at least one major regional topic for each of the plurality of regional document sets, and a regional topic between the local document belonging to the regional document set and the at least one major regional topic. You can create a matrix.

상기 전역 토픽 정보 생성부는 상기 복수의 지역 문서 집합들 각각에 대해 토픽별 가중치가 높은 순서에 따라 적어도 하나의 지역 대표문서를 추출하여 상기 축소된 전역 문서 집합을 생성할 수 있다.The global topic information generation unit may generate the reduced global document set by extracting at least one regional representative document in the order of a high weight for each topic for each of the plurality of local document sets.

상기 전역 토픽 정보 생성부는 상기 전역 토픽 모델링을 수행하여 상기 축소된 전역 문서 집합에 대해 적어도 하나의 주요 전역 토픽을 추출하고 상기 축소된 전역 문서 집합에 속한 지역 대표문서 및 상기 적어도 하나의 주요 전역 토픽 간의 전역 토픽 행렬을 생성할 수 있다.The global topic information generator extracts at least one main global topic for the reduced global document set by performing the global topic modeling, and between the local representative document belonging to the reduced global document set and the at least one main global topic. You can create a global topic matrix.

상기 문서 토픽 배정부는 상기 지역 토픽 모델링을 통해 생성된 지역 토픽 행렬에서 상기 지역 대표문서와 연관된 토픽 성분 정보를 추출하고 상기 토픽 성분 정보 및 상기 전역 토픽 모델링을 통해 생성된 전역 토픽 행렬 간의 곱 연산을 통해 적어도 하나의 토픽 변환 행렬을 생성할 수 있다.The document topic assignment unit extracts topic component information associated with the local representative document from a local topic matrix generated through the local topic modeling, and multiplies the topic component information and a global topic matrix generated through the global topic modeling. At least one topic transformation matrix may be generated.

상기 문서 토픽 배정부는 상기 지역 토픽 행렬 각각에 대해 상기 적어도 하나의 토픽 변환 행렬과의 곱 연산을 통해 상기 복수의 문서들에 전역 토픽 가중치를 배정할 수 있다.The document topic locator may assign global topic weights to the plurality of documents by multiplying each of the local topic matrices with the at least one topic transformation matrix.

실시예들 중에서, 문서의 토픽 모델링 방법은 (a) 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링(Topic Modeling)을 수행하여 지역 토픽 정보를 생성하는 단계, (b) 상기 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합을 생성하고 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성하는 단계 및 (c) 상기 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 토픽 변환 정보를 상기 복수의 문서들 각각에 배정하는 단계를 포함 한다.Among the embodiments, a topic modeling method of a document may include (a) generating a plurality of local document sets by dividing a global document set including a plurality of documents, and performing local topic modeling to generate local topic information. (B) generating a reduced global document set through a document in each of the plurality of local document sets and performing global topic modeling to generate global topic information; and (c) between the local and global topic information. And assigning the topic transformation information generated through document mapping to each of the plurality of documents.

상기 (a) 단계는 상기 지역 토픽 모델링을 수행하여 상기 복수의 지역 문서 집합들 각각에 대해 적어도 하나의 주요 지역 토픽을 추출하고 지역 문서 집합에 속한 지역 문서 및 상기 적어도 하나의 주요 지역 토픽 간의 지역 토픽 행렬을 생성하는 단계일 수 있다.In the step (a), the local topic modeling is performed to extract at least one main regional topic for each of the plurality of regional document sets, and the local topic between the local document belonging to the regional document set and the at least one main regional topic. It may be a step of generating a matrix.

상기 (b) 단계는 상기 복수의 지역 문서 집합들 각각에 대해 토픽별 가중치가 높은 순서에 따라 적어도 하나의 지역 대표문서를 추출하여 상기 축소된 전역 문서 집합을 생성하는 단계일 수 있다.The step (b) may be performed to generate the reduced global document set by extracting at least one regional representative document in the order of increasing weight for each topic for each of the plurality of regional document sets.

상기 (b) 단계는 상기 전역 토픽 모델링을 수행하여 상기 축소된 전역 문서 집합에 대해 적어도 하나의 주요 전역 토픽을 추출하고 상기 축소된 전역 문서 집합에 속한 지역 대표문서 및 상기 적어도 하나의 주요 전역 토픽 간의 전역 토픽 행렬을 생성하는 단계일 수 있다.In the step (b), the global topic modeling is performed to extract at least one main global topic for the reduced global document set, and between the local representative document belonging to the reduced global document set and the at least one main global topic. It may be a step of generating a global topic matrix.

상기 (c) 단계는 상기 지역 토픽 모델링을 통해 생성된 지역 토픽 행렬에서 상기 지역 대표문서와 연관된 토픽 성분 정보를 추출하고 상기 토픽 성분 정보 및 상기 전역 토픽 모델링을 통해 생성된 전역 토픽 행렬 간의 곱 연산을 통해 적어도 하나의 토픽 변환 행렬을 생성하는 단계일 수 있다. Step (c) extracts the topic component information associated with the regional representative document from the local topic matrix generated through the regional topic modeling, and performs a multiplication operation between the topic component information and the global topic matrix generated through the global topic modeling. It may be a step of generating at least one topic transformation matrix.

상기 (c) 단계는 상기 지역 토픽 행렬 각각에 대해 상기 적어도 하나의 토픽 변환 행렬과의 곱 연산을 통해 상기 복수의 문서들에 전역 토픽 가중치를 배정하는 단계일 수 있다.Step (c) may be a step of assigning global topic weights to the plurality of documents by multiplying each of the local topic matrices with the at least one topic transformation matrix.

실시예들 중에서, 컴퓨터 수행 가능한 기록매체는 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링(Topic Modeling)을 수행하여 지역 토픽 정보를 생성하는 과정, 상기 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합을 생성하고 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성하는 과정 및 상기 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 토픽 변환 정보를 상기 복수의 문서들 각각에 배정하는 과정을 포함한다.In one or more embodiments, a computer-executable recording medium may generate a plurality of local document sets by dividing a global document set including a plurality of documents, and generate local topic information by performing local topic modeling. The topic transformation information generated by generating a reduced global document set through a document in each of the plurality of local document sets, performing global topic modeling to generate global topic information, and mapping a document between the local and global topic information. And assigning to each of the plurality of documents.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technique can have the following effects. However, since a specific embodiment does not mean to include all of the following effects or only the following effects, it should not be understood that the scope of the disclosed technology is limited by this.

본 발명의 일 실시예에 따른 문서의 토픽 모델링 장치 및 방법은 전체 문서를 하위 군집으로 분할하고 각 하위 군집에서 대표 문서를 추출하여 축소된 전역 문서 집합을 구성할 수 있다.The apparatus and method for modeling a topic of a document according to an embodiment of the present invention may divide the entire document into sub-groups and extract a representative document from each sub-group to form a reduced global document set.

본 발명의 일 실시예에 따른 문서의 토픽 모델링 장치 및 방법은 대표 문서를 매개로 하위 군집에서 도출한 지역 토픽으로부터 전역 토픽의 성분을 도출할 수 있다.An apparatus and method for modeling a topic of a document according to an embodiment of the present invention may derive a component of a global topic from a local topic derived from a lower cluster through a representative document.

도 1은 본 발명의 일 실시예에 따른 문서의 토픽 모델링 시스템을 설명하는 도면이다.
도 2는 도 1에 있는 토픽 모델링 장치를 설명하는 블록도이다.
도 3은 도 1에 있는 토픽 모델링 장치에서 수행되는 토픽 모델링 과정을 설명하는 순서도이다.
도 4는 본 발명의 일 실시예에 따른 문서의 토픽 모델링 시스템의 전체적인 개요를 나타내는 도면이다.
도 5는 도 2에 있는 지역 토픽 정보 생성부에서 수행되는 지역 토픽 모델링을 통해 생성된 지역 토픽 정보의 일 실시예를 설명하는 예시도이다.
도 6은 도 2에 있는 전역 토픽 정보 생성부에서 수행되는 축소된 전역 문서 집합에 대한 전역 토픽 모델링을 통해 생성된 전역 토픽 정보의 일 실시예를 설명하는 예시도이다.
도 7은 도 2에 있는 문서 토픽 배정부에서 수행되는 문서 매핑을 통해 토픽 변환 정보를 생성하는 과정의 일 실시예를 설명하는 예시도이다.
도 8은 도 2에 있는 문서 토픽 배정부에서 토픽 변환 정보를 복수의 문서들 각각에 배정하는 과정의 일 실시예를 설명하는 예시도이다.1 is a diagram illustrating a topic modeling system of a document according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating the topic modeling apparatus in FIG. 1.
3 is a flowchart illustrating a topic modeling process performed in the topic modeling apparatus of FIG. 1.
4 is a diagram showing an overall overview of a topic modeling system of a document according to an embodiment of the present invention.
FIG. 5 is an exemplary diagram illustrating an embodiment of regional topic information generated through regional topic modeling performed by the regional topic information generator of FIG. 2.
FIG. 6 is an exemplary diagram illustrating an embodiment of global topic information generated through global topic modeling of a reduced global document set performed by the global topic information generator of FIG. 2.
FIG. 7 is an exemplary diagram illustrating an embodiment of a process of generating topic transformation information through document mapping performed in the document topic assignment in FIG. 2.
FIG. 8 is an exemplary diagram illustrating an embodiment of a process of allocating topic transformation information to each of a plurality of documents in the document topic assignment in FIG. 2.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Description of the present invention is only an embodiment for structural or functional description, the scope of the present invention should not be construed as limited by the embodiments described in the text. That is, since the embodiments may be variously modified and may have various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical idea. In addition, the objects or effects presented in the present invention does not mean that a specific embodiment should include all or only such effects, the scope of the present invention should not be understood as being limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as "first" and "second" are intended to distinguish one component from another component, and the scope of rights should not be limited by these terms. For example, the first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" to another component, it should be understood that there may be other components in between, although it may be directly connected to the other component. On the other hand, when a component is referred to as being "directly connected" to another component, it should be understood that there is no other component in between. On the other hand, other expressions describing the relationship between the components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring to", should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as "comprise" or "have" refer to a feature, number, step, operation, component, part, or feature thereof. It is to be understood that the combination is intended to be present and does not exclude in advance the possibility of the presence or addition of one or more other features or numbers, steps, operations, components, parts or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, an identification code (e.g., a, b, c, etc.) is used for convenience of description, and the identification code does not describe the order of the steps, and each step clearly indicates a specific order in context. Unless stated otherwise, they may occur out of the order noted. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer readable code on a computer readable recording medium, and the computer readable recording medium includes all kinds of recording devices in which data can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. Generally, the terms defined in the dictionary used are to be interpreted to coincide with the meanings in the context of the related art, and should not be interpreted as having ideal or excessively formal meanings unless clearly defined in the present application.

토픽 모델링(Topic Modeling)은 문서 문서 집합의 추상적인 '주제'를 발견하기 위한 통계적 모델 중 하나로, 텍스트 본문의 숨겨진 의미구조를 발견하기 위해 사용되는 텍스트 마이닝 기법 중 하나에 해당할 수 있다. 토픽 모델링은 텍스트 분석 연구에서 문서의 구조화 및 주요 토픽 추출을 목적으로 활용되고 있다. Topic Modeling is one of the statistical models for discovering the abstract 'topics' of a document set of documents, and may correspond to one of the text mining techniques used to discover hidden semantics of text bodies. Topic modeling is used for text structuring and main topic extraction in text analysis studies.

토픽 모델링은 일반적으로 단어 꾸러미(Bag of Words) 개념을 사용하며, 이는 각 문서를 해당 문서에서 등장하는 용어들의 집합으로 인식한다. 각 문서는 많은 수의 단어를 포함하기 때문에 차원 축소 과정을 통해 적절한 수의 단어 군집으로 표현될 수 있으며, 이 과정에서 도출된 차원의 수가 토픽의 수에 해당할 수 있다. 이에 따라 각 문서는 개별 토픽에 대한 대응도인 문서 가중치(Document Topic Weight)를 갖게 되며, 일반적으로 문서 가중치의 “평균 + 1σ”을 통해 산출되는 문서 임계값(Document Cutoff)을 통해 각 문서의 토픽 포함 여부를 판단한다. Topic modeling generally uses the concept of a bag of words, which recognizes each document as a set of terms that appear in that document. Since each document contains a large number of words, it can be represented as an appropriate number of word clusters through the dimension reduction process, and the number of dimensions derived from this process can correspond to the number of topics. As a result, each document has a Document Topic Weight, which is a correspondence to an individual topic, and each document's topic through the Document Cutoff, which is generally calculated through the “average + 1σ” of the document weight. Determine whether it is included.

즉, 문서 임계값 이상의 문서 가중치를 갖는 문서가 해당 토픽을 포함하고 있는 것으로 해석된다. 또한 이 과정을 통해 각 문서와 토픽의 문서 가중치를 2차원 행렬로 나타낸 것을 문서/토픽 행렬(Document/Topic Matrix)이라 한다. 본 발명에서는 문서/토픽 행렬의 일 실시예로서 지역 토픽 행렬 또는 전역 토픽 행렬을 사용할 수 있다.That is, it is interpreted that a document having a document weight equal to or greater than the document threshold value includes the corresponding topic. In this process, the document weight of each document and topic is expressed as a two-dimensional matrix called a document / topic matrix. In the present invention, a local topic matrix or a global topic matrix may be used as an embodiment of the document / topic matrix.

도 1은 본 발명의 일 실시예에 따른 문서의 토픽 모델링 시스템을 설명하는 도면이다.1 is a diagram illustrating a topic modeling system of a document according to an embodiment of the present invention.

도 1을 참조하면, 문서의 토픽 모델링 시스템은(100) 토픽 모델링 장치(110) 및 데이터베이스(130)를 포함할 수 있다.Referring to FIG. 1, a topic modeling system of a document 100 may include a topic modeling apparatus 110 and a database 130.

토픽 모델링 장치(110)는 문서 집합에 대하여 토픽 모델링을 수행하여 해당 문서 집합에 대한 토픽을 추출할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 토픽 모델링 장치(110)는 내부 메모리에 저장된 문서 집합 데이터를 이용할 수 있고 독립적으로 구현된 데이터베이스(130)로부터 문서 집합 데이터를 수신하여 토픽 모델링을 수행할 수 있다.The topic modeling apparatus 110 may be implemented as a server corresponding to a computer or a program capable of performing topic modeling on a document set to extract a topic for the document set. The topic modeling apparatus 110 may use the document set data stored in the internal memory, and receive the document set data from an independently implemented database 130 to perform topic modeling.

토픽 모델링 장치(110)는 데이터베이스(130)를 포함하여 구현될 수 있고, 데이터베이스(130)와 독립적으로 구현될 수 있다. 데이터베이스(130)와 독립적으로 구현된 경우 토픽 모델링 장치(110)는 데이터베이스(130)와 유선 또는 무선으로 연결되어 데이터를 주고 받을 수 있다.The topic modeling apparatus 110 may be implemented including a database 130 and may be implemented independently of the database 130. When implemented independently of the database 130, the topic modeling device 110 may be connected to the database 130 by wire or wirelessly to exchange data.

데이터베이스(130)는 문서의 토픽 모델링을 수행하기 위해 필요한 다양한 정보들을 저장하는 저장장치이다. 데이터베이스(130)는 토픽 모델링 장치(110)가 토픽 모델링을 수행할 문서 집합에 대한 정보를 저장할 수 있고, 토픽 모델링 과정에서 산출되는 문서에 대한 토픽 정보, 지역 토픽 정보 및 전역 토픽 정보들을 저장할 수 있으며, 반드시 이에 한정되지 않고, 문서의 토픽 모델링과 관련하여 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다. 여기에서, 지역 및 전역 토픽 정보는 지역 및 전역 문서 집합에 대한 지역 및 전역 토픽 모델링을 통해 산출되는 지역 및 전역 토픽 행렬을 포함할 수 있고, 지역 및 전역 토픽 행렬은 문서 및 토픽 간의 가중치 행렬에 해당할 수 있다.The database 130 is a storage device that stores various information necessary for performing topic modeling of a document. The database 130 may store information about a document set for the topic modeling device 110 to perform topic modeling, and may store topic information, local topic information, and global topic information on a document calculated during the topic modeling process. However, the present invention is not limited thereto and may store information collected or processed in various forms in connection with topic modeling of a document. Here, geographic and global topic information can include geographic and global topic matrices that are produced by modeling geographic and global topics for a set of geographic and global documents, and geographic and global topic matrices correspond to weight matrices between documents and topics. can do.

데이터베이스(130)는 특정 범위에 속하는 정보들을 저장하는 적어도 하나의 독립된 서브-데이터베이스들로 구성될 수 있고, 적어도 하나의 독립된 서브-데이터베이스들이 하나로 통합된 통합 데이터베이스로 구성될 수 있다. 적어도 하나의 독립된 서브-데이터베이스들로 구성되는 경우에는 각각의 서브-데이터베이스들은 블루투스, WiFi 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 상호 간의 데이터를 주고 받을 수 있다. 데이터베이스(130)는 통합 데이터베이스로 구성되는 경우 각각의 서브-데이터베이스들을 하나로 통합하고 상호 간의 데이터 교환 및 제어 흐름을 관리하는 제어부를 포함할 수 있다.The database 130 may be composed of at least one independent sub-databases that store information belonging to a specific range, and may be configured as an integrated database in which at least one independent sub-databases are integrated into one. When composed of at least one independent sub-database, each sub-database may be wirelessly connected through Bluetooth, WiFi, and the like, and may exchange data with each other through a network. When the database 130 is configured as an integrated database, the database 130 may include a control unit for integrating respective sub-databases into one and managing data exchange and control flow between them.

도 2는 도 1에 있는 토픽 모델링 장치를 설명하는 블록도이다.FIG. 2 is a block diagram illustrating the topic modeling apparatus in FIG. 1.

도 2를 참조하면, 토픽 모델링 장치(110)는 지역 토픽 정보 생성부(210), 전역 토픽 정보 생성부(230), 문서 토픽 배정부(250) 및 제어부(270)를 포함할 수 있다.Referring to FIG. 2, the topic modeling apparatus 110 may include a local topic information generator 210, a global topic information generator 230, a document topic assigner 250, and a controller 270.

지역 토픽 정보 생성부(210)는 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링을 수행하여 지역 토픽 정보를 생성할 수 있다. 여기에서, 분할되는 지역 문서 집합들의 수는 전체 문서의 수, 문서를 관리하는 기관 및 사이트와 같은 다양한 기준에 의해 결정될 수 있다. 또한, 각 지역 문서 집합별 주제 및 시기는 서로 동일하거나 상이할 수 있다.The local topic information generator 210 may generate a plurality of regional document sets by dividing a global document set including a plurality of documents, and perform regional topic modeling to generate local topic information. Here, the number of local document sets to be divided may be determined by various criteria such as the total number of documents, the agency managing the documents, and the site. In addition, the topic and timing of each regional document set may be the same or different.

일 실시예에서, 지역 토픽 정보 생성부(210)는 지역 토픽 모델링을 수행하여 복수의 지역 문서 집합들 각각에 대해 적어도 하나의 주요 지역 토픽을 추출하고 지역 문서 집합에 속한 지역 문서 및 적어도 하나의 주요 지역 토픽 간의 지역 토픽 행렬을 생성할 수 있다. 지역 토픽 모델링은 지역 문서 집합에 대해 수행되는 토픽 모델링을 의미하며, 일반적인 토픽 모델링 기법이 적용될 수 있다. 지역 토픽 행렬은 지역 문서 집합에 포함된 각 지역 문서 및 주요 지역 토픽의 문서 가중치를 2차원 행렬로 표현한 것에 해당할 수 있다.In one embodiment, the regional topic information generation unit 210 performs regional topic modeling to extract at least one major regional topic for each of a plurality of regional document sets, and includes a local document and at least one major topic belonging to the regional document set. You can create a local topic matrix between local topics. Regional topic modeling means topic modeling performed on a local document set, and general topic modeling techniques may be applied. The regional topic matrix may correspond to a 2-dimensional matrix of document weights of each regional document and a main regional topic included in the regional document set.

전역 토픽 정보 생성부(230)는 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합을 생성하고 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성할 수 있다. 보다 구체적으로 전역 토픽 정보 생성부(230)는 지역 토픽 정보 생성부(210)에 의해 분할된 복수의 지역 문서 집합들 각각에 대하여 해당 지역 문서 집합에 포함된 문서 중 일부를 추출한 후 하나로 통합하여 축소된 전역 문서 집합을 생성할 수 있다. 전역 토픽 정보 생성부(230)는 축소된 전역 문서 집합에 대하여 전역 토픽 모델링을 수행할 수 있고, 전역 토픽 모델링은 일반적인 토픽 모델링으로서 전역 문서 집합에 대해 수행되는 토픽 모델링에 해당할 수 있다.The global topic information generation unit 230 may generate global topic information by generating a reduced global document set through a document in each of a plurality of local document sets and performing global topic modeling. More specifically, the global topic information generator 230 extracts some of the documents included in the local document set for each of the plurality of local document sets divided by the local topic information generator 210, and then merges them into one to reduce them. Generated global document sets. The global topic information generator 230 may perform global topic modeling on the reduced global document set, and global topic modeling may correspond to a topic modeling performed on the global document set as a general topic modeling.

일 실시예에서, 전역 토픽 정보 생성부(230)는 복수의 지역 문서 집합들 각각에 대해 토픽별 가중치가 높은 순서에 따라 적어도 하나의 지역 대표문서를 추출하여 축소된 전역 문서 집합을 생성할 수 있다. 보다 구체적으로, 전역 토픽 정보 생성부(230)는 지역 문서 집합에 포함된 문서들에 대하여 각 토픽별 가중치에 따라 내림차순 정렬을 수행할 수 있고, 토픽별 상위 가중치를 갖는 문서들 중 특정 순위 내의 문서들을 지역 대표문서로서 추출할 수 있다.In one embodiment, the global topic information generation unit 230 may generate a reduced global document set by extracting at least one local representative document in the order of the weight of each topic for each of the plurality of local document sets in order. . More specifically, the global topic information generator 230 may perform descending sorting on documents included in a local document set according to weights for each topic, and documents within a specific ranking among documents having a higher weight for each topic. Can be extracted as a regional representative document.

토픽 모델링 장치(110)는 지역 대표문서 추출을 위해 필요한 특정 순위 정보를 미리 설정하여 전역 토픽 정보 생성부(230)가 지역 대표문서를 추출하는데 활용할 수 있도록 할 수 있다. 다른 실시예에서, 토픽 모델링 장치(110)는 지역 문서 집합에 포함된 문서들에 대하여 단순 무작위 추출법(Simple Random Sampling)을 통해 지역 대표문서를 추출하여 축소된 전역 문서 집합을 생성할 수 있다.The topic modeling apparatus 110 may preset specific ranking information necessary for extracting the local representative document so that the global topic information generator 230 may be used to extract the local representative document. In another embodiment, the topic modeling apparatus 110 may generate a reduced global document set by extracting a local representative document through simple random sampling on the documents included in the local document set.

일 실시예에서, 전역 토픽 정보 생성부(230)는 전역 토픽 모델링을 수행하여 축소된 전역 문서 집합에 대해 적어도 하나의 주요 전역 토픽을 추출하고 축소된 전역 문서 집합에 속한 지역 대표문서 및 적어도 하나의 주요 전역 토픽 간의 전역 토픽 행렬을 생성할 수 있다. 전역 토픽 행렬은 축소된 전역 문서 집합에 포함된 각 지역 대표 문서 및 주요 전역 토픽의 문서 가중치를 2차원 행렬로 표현한 것에 해당할 수 있다.In one embodiment, the global topic information generation unit 230 performs global topic modeling to extract at least one main global topic for the reduced global document set, and includes at least one local representative document belonging to the reduced global document set. You can create global topic matrices between major global topics. The global topic matrix may correspond to a 2-dimensional matrix representing document weights of respective regional representative documents and major global topics included in the reduced global document set.

문서 토픽 배정부(250)는 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 적어도 하나의 토픽 변환 정보를 복수의 문서들 각각에 배정할 수 있다. 여기에서, 문서 매핑은 지역 토픽 정보 및 전역 토픽 정보 생성에 모두 관여된 문서인 지역 대표문서를 매개로 하여 지역 토픽 정보와 전역 토픽 정보를 연결하는 것에 해당할 수 있다. 토픽 변환 정보는 문서 매핑을 통해 지역 토픽 정보가 변환된 결과로 생성된 2차원 행렬로서 지역 토픽 정보와 전역 토픽 정보 간의 가중치 변환 행렬에 해당할 수 있다. The document topic locator 250 may assign at least one topic transformation information generated through document mapping between local and global topic information to each of the plurality of documents. Here, the document mapping may correspond to linking the local topic information and the global topic information through a local representative document, which is a document involved in both local topic information and global topic information generation. The topic transformation information is a two-dimensional matrix generated as a result of transforming local topic information through document mapping and may correspond to a weight transformation matrix between local topic information and global topic information.

일 실시예에서, 문서 토픽 배정부(250)는 지역 토픽 모델링을 통해 생성된 지역 토픽 행렬에서 지역 대표문서와 연관된 토픽 성분 정보를 추출하고 토픽 성분 정보 및 전역 토픽 모델링을 통해 생성된 전역 토픽 행렬 간의 곱 연산을 통해 적어도 하나의 토픽 변환 행렬을 생성할 수 있다. 여기에서, 토픽 성분 정보는 지역 토픽 모델링을 통해 생성된 지역 토픽 행렬에서 지역 대표문서에 해당하는 부분을 추출하여 하나로 통합함으로써 생성되는 2차원 행렬로서 주요 지역 토픽 및 지역 대표 문서 간의 가중치 행렬에 해당할 수 있다. 토픽 변환 행렬은 토픽 변환 정보에 포함될 수 있고 지역 토픽 정보 및 전역 토픽 정보 간의 가중치 변환 행렬에 해당할 수 있다. 다른 실시예에서, 문서 토픽 배정부(250)는 지역 대표문서와 연관된 지역 및 전역 토픽 정보 간의 유사도를 산출하여 토픽 변환 행렬을 생성할 수 있다.In one embodiment, document topic assignment 250 extracts topic component information associated with the regional representative document from the regional topic matrix generated through regional topic modeling and between the topic component information and the global topic matrix generated through global topic modeling. At least one topic transformation matrix may be generated through a multiplication operation. Here, the topic component information is a two-dimensional matrix generated by extracting the parts corresponding to the local representative documents from the local topic matrix generated through the regional topic modeling and integrating them into one, which corresponds to the weight matrix between the main regional topics and the local representative documents. Can be. The topic transformation matrix may be included in the topic transformation information and may correspond to a weight transformation matrix between local topic information and global topic information. In another embodiment, the document topic assignment 250 may generate a topic transformation matrix by calculating the similarity between the regional and global topic information associated with the regional representative document.

일 실시예에서, 문서 토픽 배정부(250)는 지역 토픽 행렬 각각에 대해 적어도 하나의 토픽 변환 행렬과의 곱 연산을 통해 복수의 문서들에 전역 토픽 가중치를 배정할 수 있다. 문서 토픽 배정부(250)는 모든 지역 문서에 대해 주요 전역 토픽에 대한 문서 가중치를 할당함으로써 모든 지역 문서 및 축소된 전역 문서 집합에 대한 전역 토픽 모델링을 통해 추출한 주요 전역 토픽 간의 가중치 행렬을 생성할 수 있다. 결과적으로, 문서 토픽 배정부(250)는 모든 지역 문서에 대해 전역 토픽 모델링을 수행한 것과 유사한 결과를 얻을 수 있다.In one embodiment, the document topic assignment 250 may assign global topic weights to a plurality of documents by multiplying each topic topic matrix with at least one topic transformation matrix. The document topic assignment 250 can generate a weight matrix between key global topics extracted by global topic modeling for all local documents and a reduced set of global documents by assigning document weights for key global topics for all local documents. have. As a result, the document topic assignment 250 can achieve similar results to performing global topic modeling for all local documents.

제어부(270)는 토픽 모델링 장치(110)의 전체적인 동작을 제어하고, 지역 토픽 정보 생성부(210), 전역 토픽 정보 생성부(230) 및 문서 토픽 배정부(250) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The controller 270 controls the overall operation of the topic modeling device 110, and controls a control flow or data flow between the local topic information generator 210, the global topic information generator 230, and the document topic distributor 250. Can manage

도 3은 도 1에 있는 토픽 모델링 장치에서 수행되는 토픽 모델링 과정을 설명하는 순서도이다.3 is a flowchart illustrating a topic modeling process performed in the topic modeling apparatus of FIG. 1.

도 3을 참조하면, 토픽 모델링 장치(110)는 지역 토픽 정보 생성부(210)를 통해 복수의 문서들을 포함하는 전역 문서 집합을 분할하여 복수의 지역 문서 집합들을 생성하고 지역 토픽 모델링을 수행하여 지역 토픽 정보를 생성할 수 있다(단계 S310).Referring to FIG. 3, the topic modeling apparatus 110 generates a plurality of regional document sets by dividing a global document set including a plurality of documents through the regional topic information generation unit 210, and performs local topic modeling. Topic information may be generated (step S310).

토픽 모델링 장치(110)는 전역 토픽 정보 생성부(230)를 통해 복수의 지역 문서 집합들 각각에 있는 문서를 통해 축소된 전역 문서 집합의 생성과 전역 토픽 모델링을 수행하여 전역 토픽 정보를 생성할 수 있다(단계 S330).The topic modeling apparatus 110 may generate global topic information by generating a reduced global document set and global topic modeling through documents in each of a plurality of local document sets through the global topic information generator 230. (Step S330).

토픽 모델링 장치(110)는 문서 토픽 배정부(250)를 통해 지역 및 전역 토픽 정보 간의 문서 매핑을 통해 생성된 적어도 하나의 토픽 변환 정보를 복수의 문서들 각각에 배정할 수 있다(단계 S350).The topic modeling apparatus 110 may allocate at least one topic transformation information generated through document mapping between local and global topic information to each of the plurality of documents through the document topic assignment unit 250 (step S350).

도 4는 본 발명의 일 실시예에 따른 문서의 토픽 모델링 시스템의 전체적인 개요를 나타내는 도면이다.4 is a diagram showing an overall overview of a topic modeling system of a document according to an embodiment of the present invention.

도 4를 참조하면, 토픽 모델링장치(110)는 분석을 위해 수집된 문서 집합인 전역 군집(Global Set)을 하위 지역 군집(Local Set, ex. L_A, L_B)으로 분할한 후(410), 이들을 대상으로 지역 군집별 주요 토픽을 추출한다(420). 다음으로 각 지역 군집에서 일부 문서를 임의로 선발하여 지역 대표 문서(Delegate)로 지정하며(430), 이들을 통합하여 모든 문서의 특질을 대표할 수 있는 축소된 전역 집합을 생성한다(440). 이후 축소된 전역 집합으로부터 전역 토픽을 추출하고(450) 이를 지역 대표 문서의 지역 토픽 정보와 비교함으로써(460), 지역 토픽으로부터 전역 토픽의 성분을 추출하는 규칙을 도출한다(470). 마지막으로 이 규칙을 각 문서의 지역 토픽 가중치에 적용하여 각 문서의 전역 토픽을 배정한다(480). Referring to FIG. 4, the topic modeling apparatus 110 divides a global set, which is a document set collected for analysis, into sub-local clusters (Local Set, ex. L _A , L _B ) (410). In operation 420, the main topics of the regional communities are extracted from these. Next, some documents are randomly selected from each local community, and designated as regional representative documents (Delegate), and these are combined to generate a reduced global set that can represent the characteristics of all documents (440). Then, by extracting a global topic from the reduced global set (450) and comparing it with local topic information of the local representative document (460), a rule for extracting a component of the global topic from the local topic is derived (470). Finally, this rule is applied to the local topic weights of each document to assign a global topic for each document (480).

도 5는 도 2에 있는 지역 토픽 정보 생성부에서 수행되는 지역 토픽 모델링을 통해 생성된 지역 토픽 정보의 일 실시예를 설명하는 예시도이다.FIG. 5 is an exemplary diagram illustrating an embodiment of regional topic information generated through regional topic modeling performed by the regional topic information generator of FIG. 2.

도 5를 참조하면, 토픽 모델링 장치(110)는 지역 토픽 정보 생성부(210)를 통해 전체 문서의 분할을 통해 생성된 지역 문서 집합에 대한 지역 토픽 모델링을 수행할 수 있다. 도 5에서는 전체 문서를 2개의 지역 문서 집합으로 분할하는 경우를 설명하고 있으나, 토픽 모델링 장치(110)는 복수 개의 지역 문서 집합에 대해서도 동일한 방식으로 동작할 수 있다.Referring to FIG. 5, the topic modeling apparatus 110 may perform local topic modeling on a local document set generated by dividing an entire document through the regional topic information generator 210. In FIG. 5, the entire document is divided into two local document sets. However, the topic modeling apparatus 110 may operate in the same manner with respect to the plurality of local document sets.

지역 토픽 정보 생성부(210)는 2개의 지역 문서 집합(510, 530) 각각에 대하여 지역 토픽 모델링을 수행하여 지역 토픽 정보로서 주요 지역 토픽(511, 531)과 지역 토픽 행렬(513, 533)을 각각 생성할 수 있다. 도 5에서는 두 지역 문서 집합(510, 530)의 주제가 상이한 경우를 가정하고 있고, 각 지역 문서 집합인 Local A(510)와 Local B(530)는 각각 정치 및 생활/문화 관련 문서로 구성되어 있다. 지역 토픽 행렬(513, 533)은 각 지역 문서 집합(510, 530)에 속한 복수의 지역 문서들 및 주요 지역 토픽(511, 531) 간의 가중치 행렬에 해당할 수 있다.The regional topic information generation unit 210 performs regional topic modeling on each of two regional document sets 510 and 530 to generate the main regional topics 511 and 531 and the local topic matrix 513 and 533 as regional topic information. Each can be created. In FIG. 5, it is assumed that the subjects of two local document sets 510 and 530 are different, and each local document set Local A 510 and Local B 530 are composed of political and life / culture related documents, respectively. . The regional topic matrices 513 and 533 may correspond to weight matrices between a plurality of regional documents belonging to each regional document set 510 and 530 and main regional topics 511 and 531.

도 6은 도 2에 있는 전역 토픽 정보 생성부에서 수행되는 축소된 전역 문서 집합에 대한 전역 토픽 모델링을 통해 생성된 전역 토픽 정보의 일 실시예를 설명하는 예시도이다.FIG. 6 is an exemplary diagram illustrating an embodiment of global topic information generated through global topic modeling of a reduced global document set performed by the global topic information generator of FIG. 2.

도 6을 참조하면, 토픽 모델링 장치(110)는 전역 토픽 정보 생성부(230)를 통해 각 지역 문서 집합으로부터 추출된 지역 대표문서들을 하나로 통합하여 축소된 전역 문서 집합(RGS)을 생성할 수 있다. 전역 토픽 정보 생성부(230)는 축소된 전역 문서 집합에 대한 전역 토픽 모델링을 수행하여 전역 토픽 정보로서 주요 전역 토픽(610) 및 전역 토픽 행렬(630)을 생성할 수 있다.Referring to FIG. 6, the topic modeling apparatus 110 may generate a reduced global document set (RGS) by integrating local representative documents extracted from each local document set into one through the global topic information generator 230. . The global topic information generator 230 may generate a global topic model 610 and a global topic matrix 630 as global topic information by performing global topic modeling on the reduced global document set.

일 실시예에서, 전역 토픽 정보 생성부(230)는 각 지역 문서 집합으로부터 단순 무작위 추출법을 통해 지역 대표문서를 추출할 수 있다. 다른 실시예에서, 전역 토픽 정보 생성부(230)는 각 지역 문서 집합들에 속한 복수의 지역 문서들을 토픽별 가중치 순서로 정렬할 수 있고, 정렬된 순서에 따라 상위 특정 개수의 지역 문서들 만을 지역 대표문서로서 추출할 수 있다.In one embodiment, the global topic information generator 230 may extract the local representative document from each local document set through a simple random extraction method. In another embodiment, the global topic information generation unit 230 may sort a plurality of regional documents belonging to respective regional document sets in a weighted order per topic, and localize only the upper specific number of local documents according to the sorted order. Can be extracted as a representative document.

전역 토픽 정보 생성부(230)는 도 5에서의 지역 문서 집합 Local A(510) 및 Local B(530)의 지역 대표문서로 각각 Doc. 1, 3 및 5와 Doc. 10, 12 및 14를 추출할 수 있다. 전역 토픽 행렬(630)은 지역 대표문서와 주요 전역 토픽(610) 간의 대응도를 나타낼 수 있다.The global topic information generator 230 is a local representative document of the local document set Local A 510 and Local B 530 in FIG. 5, respectively. 1, 3 and 5 and Doc. 10, 12 and 14 can be extracted. The global topic matrix 630 may represent a correspondence between the regional representative document and the main global topic 610.

도 7은 도 2에 있는 문서 토픽 배정부에서 수행되는 문서 매핑을 통해 토픽 변환 정보를 생성하는 과정의 일 실시예를 설명하는 예시도이다.FIG. 7 is an exemplary diagram illustrating an embodiment of a process of generating topic transformation information through document mapping performed in the document topic assignment in FIG. 2.

도 7을 참조하면, 토픽 모델링 장치(110)는 문서 토픽 배정부(250)를 통해 지역 토픽 정보 및 전역 토픽 정보 간의 문서 매핑을 통해 토픽 변환 정보를 생성할 수 있다. 문서 토픽 배정부(250)는 지역 토픽 모델링을 통해 생성된 지역 토픽 행렬에서 지역 대표문서와 연관된 토픽 성분 정보(710)를 추출할 수 있고, 토픽 성분 정보(710) 및 전역 토픽 모델링을 통해 생성된 전역 토픽 행렬(730) 간의 곱 연산을 통해 적어도 하나의 토픽 변환 행렬(750)을 생성할 수 있다. 토픽 성분 정보(710)는 주요 지역 토픽 및 지역 대표문서 간의 2차원 가중치 행렬에 해당할 수 있다.Referring to FIG. 7, the topic modeling apparatus 110 may generate topic transformation information through a document mapping between local topic information and global topic information through the document topic assignment 250. The document topic assignment unit 250 may extract topic component information 710 associated with the regional representative document from the local topic matrix generated through the regional topic modeling, and may be generated through the topic component information 710 and the global topic modeling. At least one topic transformation matrix 750 may be generated through a multiplication operation between the global topic matrices 730. The topic component information 710 may correspond to a two-dimensional weight matrix between main regional topics and regional representative documents.

도 7에서, “Local i”(지역 문서 집합 i)의 “Topic j”(주요 지역 토픽 j)를 L_i_T_j로, 축소된 전역 문서 집합(RGS)의 “Topic k”(주요 전역 토픽 k)를 RGS_T_k로 나타내기로 한다. In Fig. 7, “Topic j” (major local topic j) of “Local i” (local document set i) is L _i _T _j , and “Topic k” (major global topic k) of the reduced global document set (RGS). ) Is _denoted by RGS_T _k .

지역 대표문서의 경우 도 5에서의 지역 토픽 모델링 뿐 아니라 도 6에서의 전역 토픽 모델링에도 참여하기 때문에, 주요 지역 토픽과 주요 전역 토픽의 정보를 모두 갖고 있다. 예를 들어, Local A의 Doc. 1은 지역 토픽 모델링의 결과로 L_A_T₁ ~ L_A_T₅의 주요 지역 토픽 5개에 대해 (0.013, 0.009, 0.048, 0.022, 0.021)의 문서 가중치를 가질 수 있고, 이와 동시에 전역 토픽 모델링의 결과로 RGS_T₁ ~ RGS_T₅의 주요 전역 토픽 5개에 대해 (0.312, 0.024, 0.003, 0.004, 0.050)의 문서 가중치를 가질 수 있다. In the case of the regional representative document, since it participates in the global topic modeling in FIG. 6 as well as the regional topic modeling in FIG. For example, Doc. 1 can have a document weight of (0.013, 0.009, 0.048, 0.022, 0.021) for five major regional topics from L _A _T ₁ to L _A _T ₅ as a result of regional topic modeling, while at the same time As a result, it may have a document weight of (0.312, 0.024, 0.003, 0.004, 0.050) for five major global topics of RGS_T ₁ to RGS_T ₅ .

도 7은 L_A_T₁ ~ L_A_T₅으로부터 RGS_T₁ ~ RGS_T₅의 값을 예측하는 규칙(Rule A)와 L_B_T₁ ~ L_B_T₅으로부터 RGS_T₁ ~ RGS_T₅의 값을 예측하는 규칙(Rule B)에 대한 도출 과정을 나타내고 있다. 문서 토픽 배정부(250)는 각 지역 토픽 모델링의 결과로 나타난 지역 토픽 행렬에서 지역 대표문서에 해당하는 부분(710)만을 추출할 수 있고, 이를 축소된 전역 문서 집합(RGS)에 대한 전역 토픽 모델링을 통해 도출된 전역 토픽 행렬(730)과 비교할 수 있다. 7 is L _A _T ₁ to L _A _T ₅ from ₁ to rules for predicting the value of RGS_T ₅ (Rule A) RGS_T and L _B _T ₁ to L _B _T ₅ from the rules for predicting the value of RGS_T ₁ to RGS_T ₅ The derivation process for (Rule B) is shown. The document topic assignment 250 can extract only the portion 710 corresponding to the regional representative document from the local topic matrix resulting from each regional topic modeling, and then global topic modeling for the reduced global document set (RGS). It can be compared with the global topic matrix 730 derived through.

문서 토픽 배정부(250)는 지역 문서 집합인 Local A 및 Local B로부터 토픽 변환 행렬(750)을 생성할 수 있다. 예를 들어, 도 7의 (a)는 L_A_T₁ ~ L_A_T₅로부터 RGS_T₁의 가중치를 도출하는 과정을 나타내고 있다. 토픽 변환 행렬(750)에서 점선 사각형으로 표시된 '0.021'이라는 값은 L_A_T₂ 값의 '1' 증가가 RGS_T₁ 값의 '0.021'의 증가를 가져옴을 나타낸다. 즉, RGS_T₁는 다른 주요 지역 토픽들에 비해 L_A_T₂의 영향을 많이 받으며, 이러한 결과는 지역 대표문서 Doc. 1, Doc. 3, Doc. 5에서 L_A_T₂의 값이 높을수록 RGS_T₁의 값이 높게 나타나는 현상을 반영하고 있다. The document topic locator 250 may generate a topic transformation matrix 750 from local document sets Local A and Local B. For example, Figure 7 (a) shows a process of deriving a weight of from ₁ RGS_T L _A L _A _T _T ₁ ~ _5. _A value of '0.021', indicated by dotted rectangles, in the topic transformation matrix 750 indicates that an increase of '1' of the value of L _A _T ₂ results in an increase of '0.021' of the value of RGS_T ₁ . In other words, RGS_T ₁ receives a lot of influence of L _A _T _2, compared to other major regional topics, these results are representative of local documents Doc. 1, Doc. 3, Doc. The higher the value of L _A _ T ₂ in 5, the higher the value of RGS_T ₁ reflects the phenomenon.

도 8은 도 2에 있는 문서 토픽 배정부에서 토픽 변환 정보를 복수의 문서들 각각에 배정하는 과정의 일 실시예를 설명하는 예시도이다.FIG. 8 is an exemplary diagram illustrating an embodiment of a process of allocating topic transformation information to each of a plurality of documents in the document topic assignment in FIG. 2.

도 8을 참조하면, 토픽 모델링 장치(110)는 문서 토픽 배정부(250)를 통해 토픽 변환 정보를 전역 문서 집합에 포함된 복수의 문서들 각각에 배정할 수 있다. 여기에서, 토픽 변환 정보는 복수의 지역 토픽 행렬(810)에 대응하는 적어도 하나의 토픽 변환 행렬(830)에 해당할 수 있다. 또한, 도 8의 (a)에서 지역 토픽 행렬(810)은 도 5의 지역 토픽 행렬(513)에 해당할 수 있고, 토픽 변환 행렬(830)은 도 7의 토픽 변환 행렬(750)에 해당할 수 있다.Referring to FIG. 8, the topic modeling apparatus 110 may assign the topic transformation information to each of a plurality of documents included in the global document set through the document topic assignment 250. Here, the topic transformation information may correspond to at least one topic transformation matrix 830 corresponding to the plurality of local topic matrices 810. Also, in FIG. 8A, the local topic matrix 810 may correspond to the local topic matrix 513 of FIG. 5, and the topic transformation matrix 830 may correspond to the topic transformation matrix 750 of FIG. 7. Can be.

다음의 표 1은 몇 가지 표기법을 나타낸 것이다.Table 1 below shows some notation.

[표 1]TABLE 1

상기 표 1의 표기법을 사용하여, 지역 문서 집합 Local A에 속한 지역 문서 Doc. i의 지역 토픽 가중치로부터 전역 토픽 가중치를 도출하기 위한 규칙을 정의하면 다음과 같다(단, N은 Local A의 전체 지역 토픽 수).Using the notation in Table 1 above, the local document Doc. A rule for deriving the global topic weight from the local topic weight of i is defined as follows (where N is the total number of local topics in Local A).

예를 들어, 지역 문서 집합 Local A에 속한 Doc. 2의 전역 토픽 RGS_T₁에 대한 문서 가중치를 구한 결과는 다음과 같다. For example, Doc that belongs to the local document set Local A. The document weights for global topic RGS_T ₁ of 2 are as follows.

d₂(RGS_T₁) = (0.006 * 0.020) + (0.021 * 0.062) + (0.016 * 0.030) + (0.007 * 0.022) + (0.011 * 0.045) = 0.003d ₂ (RGS_T ₁ ) = (0.006 * 0.020) + (0.021 * 0.062) + (0.016 * 0.030) + (0.007 * 0.022) + (0.011 * 0.045) = 0.003

즉, Doc. 2의 전역 토픽 RGS_T₁에 대한 문서 가중치는 0.003으로 계산된다. 문서 토픽 배정부(250)는 동일한 방식으로 지역 대표문서 뿐 아니라 대표에 포함되지 않은 모든 지역 문서에 대해 전역 토픽 가중치를 배정할 수 있고, 각 지역문서 및 전역 토픽 가중치 간의 2차원 가중치 행렬(850)을 결과로서 생성할 수 있다.Doc. The document weight for global topic RGS_T ₁ of 2 is calculated as 0.003. The document topic assignment 250 may assign global topic weights to the local representative document as well as all local documents not included in the representative in the same manner, and the two-dimensional weight matrix 850 between each regional document and the global topic weights. Can be generated as a result.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to a preferred embodiment of the present invention, those skilled in the art will be variously modified and changed within the scope of the invention without departing from the spirit and scope of the invention described in the claims below I can understand that you can.

100: 문서의 토픽 모델링 시스템
110: 토픽 모델링 장치 130: 데이터베이스
210: 지역 토픽 정보 생성부 230: 전역 토픽 정보 생성부
250: 문서 토픽 배정부 270: 제어부
510: 지역 문서 집합 Local A 511: 주요 지역 토픽
513: 지역 토픽 행렬 530: 지역 문서 집합 Local B
531: 주요 지역 토픽 533: 지역 토픽 행렬
610: 주요 전역 토픽 630: 전역 토픽 행렬
710: 토픽 성분 정보 730: 전역 토픽 행렬
750: 토픽 변환 행렬 810: 지역 토픽 행렬
830: 토픽 변환 행렬
850: 지역 문서에 대한 전역 토픽 가중치 배정 행렬100: document modeling system
110: topic modeling device 130: database
210: local topic information generation unit 230: global topic information generation unit
250: document topic assignment 270: control unit
510: Local Document Set Local A 511: Main Regional Topics
513: Local Topic Matrix 530: Local Document Set Local B
531: Major Regional Topics 533: Regional Topic Matrix
610: main global topic 630: global topic matrix
710: topic component information 730: global topic matrix
750: topic transformation matrix 810: local topic matrix
830: topic transformation matrix
850: Global topic weighting matrix for geographic documents

Claims

A local topic information generation unit for generating a plurality of local document sets by dividing a global document set including a plurality of documents, and generating local topic information by performing local topic modeling;
A global topic information generator configured to generate a global document set reduced through a representative document in each of the plurality of local document sets and perform global topic modeling to generate global topic information; And
And a document topic allocation unit for allocating the topic transformation information generated through the document mapping between the local and global topic information to each of the plurality of documents.

The method of claim 1, wherein the local topic information generation unit
Extracting at least one major regional topic for each of the plurality of regional document sets by generating the regional topic modeling, and generating a regional topic matrix between the local document belonging to the regional document set and the at least one major regional topic Topic modeling device for documents.

The method of claim 1, wherein the global topic information generation unit
The topic modeling apparatus of claim 1, wherein the reduced global document set is generated by extracting at least one local representative document in the order of the weight of each topic for each of the plurality of local document sets.

The method of claim 1, wherein the global topic information generation unit
Extracting at least one main global topic for the reduced global document set by performing the global topic modeling, and generating a global topic matrix between a local representative document belonging to the reduced global document set and the at least one major global topic Topic modeling apparatus of a document, characterized in that.

The method of claim 1, wherein the document topic allocation
At least one topic transformation matrix by extracting topic component information associated with a local representative document from a local topic matrix generated through the regional topic modeling and multiplying the topic component information and a global topic matrix generated through the global topic modeling Topic modeling apparatus of a document, characterized in that for generating.

The method of claim 5, wherein the document topic allocation
And assigning a global topic weight to the plurality of documents by multiplying each of the local topic matrices with the at least one topic transformation matrix.

In the topic modeling method performed in the topic modeling apparatus of a document,
(a) generating a plurality of local document sets by dividing a global document set including a plurality of documents, and performing local topic modeling to generate local topic information;
generating global topic information by generating a reduced global document set through representative documents in each of the plurality of local document sets and performing global topic modeling; And
(c) assigning to each of the plurality of documents topic translation information generated through document mapping between the local and global topic information.

The method of claim 7, wherein step (a)
Extracting at least one major regional topic for each of the plurality of regional document sets by generating the regional topic modeling, and generating a regional topic matrix between the local document belonging to the regional document set and the at least one major regional topic; Topic modeling method of a document, characterized in that.

The method of claim 7, wherein step (b)
And extracting at least one regional representative document in order of increasing weight for each topic for each of the plurality of regional document sets to generate the reduced global document set.

The method of claim 7, wherein step (b)
Extracting at least one main global topic for the reduced global document set by performing the global topic modeling, and generating a global topic matrix between a local representative document belonging to the reduced global document set and the at least one major global topic A topic modeling method for a document, characterized in that the step.

The method of claim 7, wherein step (c)
At least one topic transformation matrix by extracting topic component information associated with a local representative document from a local topic matrix generated through the regional topic modeling and multiplying the topic component information and a global topic matrix generated through the global topic modeling The topic modeling method of the document, characterized in that for generating a step.

The method of claim 11, wherein step (c)
And assigning a global topic weight to the plurality of documents by multiplying each of the local topic matrices with the at least one topic transformation matrix.

A computer-readable recording medium for recording a topic modeling method performed by a topic modeling apparatus of a document,
Dividing a global document set including a plurality of documents to generate a plurality of local document sets and performing local topic modeling to generate local topic information;
Generating global topic information by generating a reduced global document set through representative documents in each of the plurality of local document sets and performing global topic modeling; And
And assigning to each of the plurality of documents topic translation information generated through document mapping between the local and global topic information.