KR20180111646A

KR20180111646A - Device and method for chronological big data curation system

Info

Publication number: KR20180111646A
Application number: KR1020180036540A
Authority: KR
Inventors: 한상용; 최승진; 서지완; 유가람
Original assignee: 중앙대학교 산학협력단
Priority date: 2017-03-31
Filing date: 2018-03-29
Publication date: 2018-10-11
Also published as: KR102025813B1

Abstract

The present invention relates to a curation device based on chronological information and a controlling method thereof. More specifically, the present invention relates to a curation device based on chronological information for providing event flow information and a method thereof. The present invention summarizes various types of information scattered on the internet through topics and associations, easily and accessibly shows the summarized data to users and enables the summarized data to reuse to generate valuable information through maintaining, managing and providing for the information. The curation device comprises: a data collecting part; a graph modeling part; and a chronological analyzing part.

Description

Technical Field [0001] The present invention relates to a chronological information-based curation apparatus for providing event flow information, and a control method thereof. [0002]

발명은 연대순 정보 기반 큐레이션 장치 및 방법에 관한 것으로, 더욱 상세하게는 사건 흐름 정보를 제공하기 위한 연대순 정보 기반 큐레이션 장치 및 제어방법에 관한 것이다.The present invention relates to a chronological information-based curation apparatus and method, and more particularly to a chronological information-based curation apparatus and method for providing event flow information.

웹 기술의 출현과 발전으로 인하여 방대한 양의 다른 종류의 데이터 들이 급속도로 생산되고 있고 정보의 양이 상당하게 증가하고 있다. 이런 빅데이터 시대에 소비자들은 이전보다 더 많은 데이터와 정보를 얻을 수 있지만 수많은 정보와 지식 가운데 가치 있는 데이터를 가려내는 것은 쉽지 않다. 많은 양의 데이터로부터 가치 있는 정보를 찾는 것은 점점 더 중요해지고 있고, 많은 국가와 회사들이 데이터의 획득과 분석에 많은 시간과 돈을 투자하고 있다.Due to the advent and development of web technologies, vast quantities of different kinds of data are being produced rapidly and the amount of information is increasing considerably. In this big data era, consumers can get more data and information than ever before, but it is not easy to find valuable data out of a lot of information and knowledge. Finding valuable information from large amounts of data is becoming increasingly important, and many countries and companies are investing a lot of time and money in data acquisition and analysis.

디지털 큐레이션이란 인터넷에 산재해 있는 다양한 종류의 정보들을 주제 및 연관성을 통해 정리하고, 정리된 데이터를 사용자들에게 보여주기 쉽고, 접근성 있게 보여주는 작업을 의미하며, 이러한 데이터에 대해 재사용이 가능하게 하도록 하는 작업이다. 이러한, 디지털 큐레이션은 정보에 대한 유지, 관리, 제공을 통하여 가치적인 정보를 생성할 수 있으며, 정보의 접근성과 재사용성을 높일 수 있다.Digital curation refers to the task of organizing various kinds of information scattered on the Internet through themes and associations, showing the organized data easily and easily to the users, and making the data reusable . Digital curation can generate valuable information through maintenance, management and provision of information, and can improve accessibility and reusability of information.

이러한 빅데이터 시대에 중요한 토픽을 검색하고, 이해하고, 식별하는데 많은 노력과 비용이 들어 가기 때문에, 중요한 이슈 중 하나는 유저의 요구를 만족 시킬 수 있는 유익하고 신뢰 있는 디지털 큐레이션 장치를 만드는 것이다.One of the important issues is to create a beneficial and reliable digital curation device that meets the user's needs, as it takes a lot of effort and cost to search, understand and identify important topics in this big data age.

현재 존재하는 정보 시스템은 질의를 기반으로 검색 서비스를 제공하는 것이 일반적이다.Currently existing information systems provide search services based on queries.

시맨틱 검색은 좀 더 좋은 성능을 보이고, 좀더 정확한 결과를 얻기 위하여 검색자의 의도와 단어의 맥락적 의미를 이해하여 검색 정확도를 향상시킬 수 있는 검색을 말한다. 그러나 이러한 시맨틱 시스템을 포함하고 있는 현재 검색 시스템은, 시간과는 무관한 파라미터만을 고려하기 때문에, 시간 감쇠 효과를 고려한 검색 데이터의 표현과 순위 매김에 큰 어려움을 가지고 있다.Semantic search is a search that can improve the search accuracy by understanding the searcher's intentions and the contextual meaning of the words in order to get better performance and get more accurate results. However, since the current search system including such a semantic system considers only time-independent parameters, it has a great difficulty in expressing and ranking search data considering time damping effect.

몇몇 사건은 오랜 시간에 걸쳐 발생하고, 사건과 관련된 중요도와 흥미는 시간에 따라 바뀐다. 다양한 사건 및/또는 사고가 다른 사람들과 얽히게 되면 검색 데이터와 관련된 내재적이고 근본적인 의미를 이해하는 것이 어려워진다.Some events occur over a long period of time, and the importance and interests associated with the events change over time. When various events and / or accidents are intertwined with others, it becomes difficult to understand the underlying and fundamental implications associated with search data.

이에 따라, 사건이 발생한 시간을 더욱 고려하는 디지털 큐레이션 장치 및 시스템에 관련하는 연구가 요구되는 실정이다.Accordingly, there is a need for research on a digital curation apparatus and system that consider the time of occurrence of an event.

본 발명은 전술한 문제 및 다른 문제를 해결하는 것을 목적으로 한다. 또 다른 목적은 사건간의 발생 시간을 더욱 고려하여 보다 효율적으로 데이터를 분류할 수 있는 큐레이션 장치을 제공하는 것을 그 목적으로 한다.The present invention is directed to solving the above-mentioned problems and other problems. Another object of the present invention is to provide a culity apparatus capable of classifying data more efficiently in consideration of occurrence time between events.

본 발명은 빅데이터 환경에서 사용자가 검색하려고 하는 특정한 사건 또는 사고에 대한 포괄적인 이해를 돕기 위하여 관련된 핵심 정보를 연대순으로 제공하는 연대순 정보 기반 큐레이션 장치 및 방법을 제공한다.The present invention provides a chronological information-based curation apparatus and method for providing relevant key information in chronological order to help a comprehensive understanding of a specific event or accident that a user is searching for in a big data environment.

본 발명은 사건 또는 사고 데이터를 다양한 소스로부터 수집하며 관 련된 정보를 연대순으로 분석하고, 특정 사건이나 지식을 시간에 따라 모델링 함에 따라 제공되고 시각화 도구를 통해 보여지며, 특정한 사건이나 지식을 이해하기 위하여 반복적인 검색 작업을 줄일 수 있고, 재사용할 수 있는 정보를 생성하는 연대순 정보 기반 큐레이션 장치 및 방법을 제공한다.The present invention collects event or accident data from a variety of sources, analyzes the information chronologically, provides a modeling of a particular event or knowledge over time, is presented through visualization tools, This invention provides a chronological information-based curation apparatus and method that can reduce repetitive search operations and generate reusable information.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, unless further departing from the spirit and scope of the invention as defined by the appended claims. It will be possible.

상기 또는 다른 목적을 달성하기 위해 본 발명의 일 측면에 따르면, 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집하는 데이터 수집부; 상기 수집된 복수 개의 토픽 데이터 간의 연관 관계를 추출하는 그래프 모델링부; 및 상기 추출된 연관 관계에 기초하여, 상기 복수 개의 토픽 데이터 중 소정 토픽과 직접적으로 연관되는 내부 노드 집합과, 상기 소정 토픽과 간접적으로 연관되는 외부 노드 집합으로 분류하는 연대순 분석부를 포함하는, 큐레이션 장치를 제공한다.According to an aspect of the present invention, there is provided a data collecting apparatus including: a data collecting unit collecting a plurality of topic data from a data source on a web; A graph modeling unit for extracting an association between the collected plurality of topic data; And a chronological analyzer for classifying an internal node set directly related to a predetermined topic among the plurality of topic data and an external node set indirectly related to the predetermined topic based on the extracted correlation, Lt; / RTI >

여기서 데이터 소스는, 소셜 네트워크 서비스(SNS)의 게시글, 뉴스 아티클(News article)을 포함할 수 있다.Here, the data source may include a post of a social network service (SNS), and a news article.

그리고 상기 소정 토픽은, 사용자로부터 입력 받은 검색어 키워드에 대한 토픽일 수 있다.The predetermined topic may be a topic for a keyword of a keyword inputted from a user.

또한, 두 토픽 데이터 간의 연관 관계는, 단일 문서 상에서 상기 두 토픽 데이터가 함께 포함되어 있는 문서의 개수로 수치화될 수 있다.In addition, the association between the two topic data can be quantified in the number of documents in which the two topic data are included together in a single document.

그리고 상기 연대순 분석부는, 상기 소정 토픽과 연관 관계가 존재하는 토픽 데이터는 내부 노드 집합으로 분류하고, 상기 소정 토픽과 연관 관계는 없지만, 생성 시간과 관련되는 토픽 데이터는 외부 노드 집합으로 분류할 수 있다.The chronological analyzer classifies the topic data having a relation with the predetermined topic into an internal node set and classifies the topic data related to the generation time into an external node set although it is not related to the predetermined topic .

상기 연대순 분석부는, 상기 분류된 복수 개의 토픽을 생성 시간순으로 정렬할 수 있다.The chronological analyzer may sort the plurality of classified topics in order of generation time.

상기 또는 다른 목적을 달성하기 위해 본 발명의 다른 측면에 따르면, 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집하는 단계; 상기 수집된 복수 개의 토픽 데이터 간의 연관 관계를 추출하는 단계; 및 상기 추출된 연관 관계에 기초하여, 상기 복수 개의 토픽 데이터 중 소정 토픽과 직접적으로 연관되는 내부 노드 집합과, 상기 소정 토픽과 간접적으로 연관되는 외부 노드 집합으로 분류하는 단계를 포함하는, 큐레이션 장치의 제어 방법을 제공한다.According to another aspect of the present invention for achieving the above or other objects, there is provided a method comprising: collecting a plurality of topic data from a data source on a web; Extracting an association between the collected plurality of topic data; And classifying into an external node set indirectly related to the predetermined topic, based on the extracted association, an internal node set directly related to a predetermined topic among the plurality of topic data Control method.

본 발명에 따른 큐레이션 장치 및 그것의 제어 방법의 효과에 대해 설명하면 다음과 같다.Advantages of the curation apparatus and the control method thereof according to the present invention are as follows.

본 발명의 일실시예 중 적어도 하나에 의하면 인터넷에 산재해 있는 다양한 종류의 정보들을 주제 및 연관성을 통해 정리하고, 정리된 데이터를 사용자들이 보기 쉽도록 보여줄 수 있고, 정리된 데이터에 대해 재사용이 가능하게 하여 정보에 대한 유지, 관리, 제공을 통하여 가치적인 정보를 생성할 수 있다.According to at least one embodiment of the present invention, it is possible to organize various types of information scattered in the Internet through topics and associativity, to display the organized data easily for users, and to reuse the organized data And maintains, manages, and provides information, thereby generating valuable information.

본 발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시 예와 같은 특정 실시 예는 단지 예시로 주어진 것으로 이해되어야 한다. Further scope of applicability of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and specific examples, such as the preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art.

도 1은 본 발명의 일실시예에 따른 큐레이션 장치의 블록도를 도시하는 도면이다.
도 2는 본 발명의 일실시예에 따라 구분되는 SNS, 블로그 및 뉴스 아티클의 성격을 도표로 도시한다.
도 3은 본 발명의 일실시예에 따른 주요 토픽과 하위 토픽을 정렬시키는 토픽 정렬 구조를 도시하는 도면이다.
도 4는 본 발명의 일실시예에 따른 시간관계 링크(Time relation graph)를 설명하기 위한 예시 도면이다.
도 5는 본 발명의 일실시예에 따른 유사도 링크(Similarity relation graph)를 설명하기 위한 예시 도면이다.
도 6 및 도 7은 본 발명의 일실시예에 따라 재사용부(400)가 트리 구조로 저장하는 일예를 도시하는 도면이다.
도 8은 본 발명의 일실시예에 따른 큐레이션 장치의 제어 방법에 대한 순서도를 도시한다.
도 9는 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 유명한 정치가 안철수를 검색 한 결과에 대한 결과 그래프이다.
도 10은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 은행인 "농협" 해킹 사건에 대한 검색 결과 그래프이다.
도 11은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 위의 실험결과들과 다른 결과 그래프를 도시한다.1 is a block diagram of a cueing apparatus according to an embodiment of the present invention.
FIG. 2 graphically illustrates the nature of SNS, blog, and news articles that are distinguished in accordance with one embodiment of the present invention.
3 is a diagram showing a topic alignment structure for aligning a main topic and a sub topic according to an embodiment of the present invention.
4 is an exemplary diagram illustrating a time relation graph according to an embodiment of the present invention.
5 is an exemplary diagram illustrating a similarity relation graph according to an embodiment of the present invention.
FIG. 6 and FIG. 7 are diagrams showing an example in which the reuse unit 400 stores a tree structure according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a method of controlling a curation apparatus according to an embodiment of the present invention.
FIG. 9 is a graph showing a result of a search by a famous politician Ahn Cheol-Su of a curation apparatus according to an embodiment of the present invention.
FIG. 10 is a graph of a search result for the "Nonghyup" hacking case, which is a Korean bank, by a curlation apparatus according to an embodiment of the present invention.
11 is a graph showing a result different from the above experimental results by the curation apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, wherein like reference numerals are used to designate identical or similar elements, and redundant description thereof will be omitted. The suffix "module" and " part "for the components used in the following description are given or mixed in consideration of ease of specification, and do not have their own meaning or role. In the following description of the embodiments of the present invention, a detailed description of related arts will be omitted when it is determined that the gist of the embodiments disclosed herein may be blurred. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. , &Lt; / RTI > equivalents, and alternatives.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinals, such as first, second, etc., may be used to describe various elements, but the elements are not limited to these terms. The terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The singular expressions include plural expressions unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, the terms "comprises", "having", and the like are used to specify that a feature, a number, a step, an operation, an element, a component, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

본 발명은 큐레이션 장치 및 그것의 제어방법에 관한 것으로써, 큐레이션이란 다양한 데이터를 어떠한 방법으로 보여줄지 결정하는 방법을 말한다. 뉴스기사를 예로 들면, 기존의 '네이버 뉴스'와 같은 검색 시스템들은 특정 키워드를 입력하게 되면, 똑같은 시간의 똑같은 기사(제목만 조금씩 다른)들만 나오게 된다. 결국 이러한 시스템을 통한 결과는, 검색 키워드에 관련되는 다양한 결과를 확인하기는 다소 어려운 검색 결과 화면으로 볼 수 있다. 이를 극복하기 위해 본 발명에서는 검색을 위한 특정 키워드와 연관성 있는 복수 개의 토픽(시간에 따른 기사들의 여러 키워드)들을 구성하고, 복수 개의 토픽들을 여러 파라미터로 적절하게 구분시켜, 사용자에게 효과적인 검색 결과를 제공하는 것에 그 목적이 있다.The present invention relates to a curation apparatus and a control method thereof, and curation refers to a method for determining how to display various data. For example, in a news article, search systems such as the existing "Naver News" will only get the same articles (titles only slightly different) at the same time when you enter certain keywords. As a result, the results of such a system can be viewed as a rather difficult search result screen for verifying various results related to the search keyword. In order to overcome this, according to the present invention, a plurality of topics (multiple keywords of articles according to time) related to a specific keyword for searching are formed, and a plurality of topics are appropriately divided into various parameters to provide an effective search result to the user The purpose is to do.

특히, 본 발명에서는, 복수 개의 토픽들을 구성하는데 있어서, 직접적인 관련성뿐만 아니라 간접적인 관련성을 더 고려하도록 제안한다. 검색 할 때 입력된 주제(키워드)의 토픽 발생 시간과 가장 근접하게 토픽이 발생한 시간을 고려함으로써 그 간접성을 추측하는 것이다. 여기서 간접성이란, 사람이 판단하지 미처 판단하지 못한 기준을 의미할 수 있을 것이다.Particularly, in the present invention, in constructing a plurality of topics, it is proposed to consider not only direct relevance but also indirect relevancy. The indirectness is estimated by taking into consideration the time at which the topic occurred closest to the topic occurrence time of the subject (keyword) inputted at the time of searching. Indirectness can mean a criterion that a person has not yet judged.

““포항에서 일어난 산불””에 관한 실험 결과 예시에서, ““산불””이라는 키워드로 검색하게 되면 모든 산불이 일어난 지역도 연관적으로 보여지게 되는 것이 일반적일 것이다. 추가적으로 본 발명의 실시예에 따르면, 시간에 의한 간접성을 더 고려하기 때문에 ““포항””이라는 키워드로 검색을 할 때에도, 다른 지역에서 일어난 ““산불””도 보여질 가능성이 있게 된다. 왜냐하면, "포항"과 "산불"은 직접적인 관련(해당 사건을 제외한 다른 경우에 있어서)이 있지는 않지만, 같은 시간에 발생된 토픽이기 때문에 본 발명에 따르면 간접적인 관련성이 인정될 수 있다. 이러한 간접적인 관련성으로 인하여 "포항"이라는 검색어가 입력되는 경우 "산불"이라는 토픽이 함께 구성(이하에서는 이러한 토픽을 외부 토픽이라 설명함)되며, "산불"이라는 토픽으로 인하여 다른 산불 관련 검색 결과를 사용자에게 보여줄 수 있게 되는 것이다. "In the example of the experimental results of" forest fire in Pohang ", it is common that when searching with the keyword" forest fire " In addition, according to the embodiment of the present invention, since the indirectness by time is further taken into account, it is possible to see "" forest fire "occurring in other areas even when searching with the keyword" POHANG ". Because "POHANG" and "fire" do not have a direct relationship (except in the case other than the case), they are topics that occur at the same time, so that indirect relevance can be recognized according to the present invention. Due to this indirect relevance, when a search term "Pohang" is input, the topic "forest fire" is composed together (hereinafter, these topics will be described as an external topic) So that it can be shown to the user.

이렇게 직접적으로 연관이 있는 토픽 집합(Internal Node Group, 이하에서 내부 노드 집합이라 함)과 간접적으로 시간에 의한 연관관계가 존재하는 토픽 집합(External Node Group, 이하에서는 외부 노드 집합이라 함)을 동시에 고려한다. 또한, 특정 임계값으로 어느 검색 결과까지 제공할 것인지 결정(Maximum Cut을 이용)하여 사용자가 관심있어 할만한 결과 만을 적절하게 보여줄 수 있을 것이다. 시간의 차이는 적으며(토픽을 중심으로 최근에 일어난 토픽), 연관도는 최대(토픽과 유사한 토픽)인 적정 임계값을 통하여, 가장 유의미한 결과만을 제공하여 줄 수 있을 것이다. 즉, 내부 노드 집합(Internal Node Group)은 직관성에 의해서 토픽을 검색하고, 외부 노드 집합(External Node Group)은 내포되어 있는 영역을 찾기 위하여 제공되는 토픽일 것이다.Consider an external node group (hereinafter, referred to as an external node group) in which a directly related topic group (internal node group, hereinafter referred to as an internal node group) and indirectly related temporal association exists do. In addition, it will be possible to determine which search results will be provided with a certain threshold value (using Maximum Cut), so that only the results that the user is interested in can be properly displayed. The difference in time is small (recent topics around the topic), and associativity will provide the most significant results through the appropriate thresholds, which are maximum (topic-like topics). In other words, the Internal Node Group searches for a topic by intuitiveness, and the External Node Group is a topic that is provided to search for the area implied.

도 1은 본 발명의 일실시예에 따른 큐레이션 장치의 블록도를 도시하는 도면이다.1 is a block diagram of a cueing apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 연대순 정보 기반 큐레이션 장치는 데이터 수집부(100), 그래프 모델링부(200), 연대순 분석부(300) 및 재사용부(400)를 포함하도록 구성될 수 있다.1, the chronological information-based curation apparatus according to an embodiment of the present invention includes a data collecting unit 100, a graph modeling unit 200, a chronological analysis unit 300, and a reuse unit 400 Lt; / RTI >

데이터 수집부(100)는 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집한다. 여기서 데이터 소스란, 소셜 네트워크 서비스(SNS)의 게시글, 뉴스 아티클(News article)등을 포함할 수 있으며, www(world wide web) 프로토콜 및 다른 기타 통신 프로토콜을 통하여 데이터를 제공하여 줄 수 있는 모든 소스가 포함될 수 있을 것이다.The data collection unit 100 collects a plurality of topic data from a data source on the web. Here, the data source may include a social network service (SNS) publication, a news article, and the like, and may include any source capable of providing data through a www (world wide web) May be included.

그리고, 토픽 데이터란, 사람들의 관심이나 주목을 받을 수 있는 사건이나 주제에 대한 키워드 데이터를 의미할 수 있다. 예를 들어, '대통령 선거'나 '화재'등의 단어가 될 수 있을 것이다. 이렇게 토픽 데이터를 수집하는 구체적인 방법에 대해서는 이하에서 구체적으로 후술하기로 한다.And, the topic data may mean keyword data on an event or a subject that can receive the attention or attention of the people. For example, it could be the words 'presidential election' or 'fire.' Specific methods of collecting the topic data will be described later in detail.

데이터 수집부(100)는 단순하게 토픽 데이터만을 수집하는 것이 아니라, 해당 토픽 데이터가 생성된 시간(날짜와 시간을 모두 포함할 수 있음, 이하에서는 생성된 날짜와 시간을 모두 포함하여 생성 시간이라 함)에 대한 정보를 함께 수집할 수 있을 것이다.The data collecting unit 100 collects not only the topic data but also the time at which the topic data is generated (may include both the date and the time, hereinafter referred to as the generation time including both the generated date and time). ), As well as information on the other.

데이터 수집부(100)에서 이와 같이 각 데이터 소스에서 수집되는 생성 시간, 토픽 데이터 및 연관 관계 데이터는 후술되는 전처리 과정(그래픽 모델링부의 단계)을 거쳐 이후의 단계에서 정확성을 높이고 데이터를 효율적으로 사용할 수 있도록 한다.The generation time, topic data, and association data collected from each data source in the data collecting unit 100 are processed through a preprocessing process (a step of the graphic modeling unit) to be described later, .

특히, 본 발명의 일실시예에서는, 상기 데이터 소스 중 소셜 네트워크 서비스(SNS)의 성격을 다음과 같은 적어도 두 가지로 구분하여 데이터를 수집하도록 제안한다.Particularly, in one embodiment of the present invention, it is proposed to collect data by dividing the nature of the social network service (SNS) among the data sources into at least two types as follows.

이를테면, 유저들이 모바일이나 PC 등에서 간편하게 일상을 공유할 수 있는 제 1 SNS와, 세부적인 정보를 공유하기 위하여 제공되는 제 2 SNS로 구분 가능할 것이다. 제 1 SNS의 예시로, 카카오톡(kakaotalk), 트위터(twitter), 인스타그램(instagram) 이나 페이스북(facebook)을 들 수 있으며, 제 2 SNS의 예시로는, 블로그(blog)나 브런치(brunch)를 들 수 있다.For example, it can be divided into a first SNS in which users can easily share their daily lives in a mobile or a PC, and a second SNS provided in order to share detailed information. Examples of the first SNS include kakaotalk, twitter, instagram, and facebook. Examples of the second SNS include a blog or a brunch brunch).

이와 같이 SNS의 성격을 구분하는 이유는, 성격에 따라 수집되는 데이터의 종류가 달라질 수 있기 때문이다.The reason for distinguishing the characteristics of the SNS is that the types of data collected may vary depending on the personality.

도 2는 본 발명의 일실시예에 따라 구분되는 SNS, 블로그 및 뉴스 아티클의 성격을 도표로 도시한다. 도시된 도표에서, SNS란, 상술한 제 1 SNS를 의미하고, 블로그는 상술한 제 2 SNS를 말한다.FIG. 2 graphically illustrates the nature of SNS, blog, and news articles that are distinguished in accordance with one embodiment of the present invention. In the diagram, the SNS means the first SNS mentioned above, and the blog means the second SNS mentioned above.

일상적인 대화를 나누기 위한 제 1 SNS의 경우에는, 유저들이 간편하게 자신의 의견을 작성하기 때문에 문서의 길이는 매우 짧고 전파력은 매우 높지만, 신뢰도 측면에서는 다소 부족할 것이다.In the case of the first SNS for daily conversation, the length of the document is very short and the propagation power is very high because users easily write their opinions, but it will be somewhat lacking in reliability.

정보를 공유하기 위한 제 2 SNS의 경우에도 문서의 길이는 짧은 편에 속하지만, 위와는 반대로 전파력은 다소 부족하고, 신뢰도 측면에서는 제 2 SNS 보다는 높은 중간 수준일 수 있다.Even in the case of the second SNS for sharing information, the length of the document is on the short side, but the propagation power is somewhat lacking, and in terms of reliability, it may be a middle level higher than the second SNS.

위 두 개의 예시와는 반대로, 뉴스 아티클의 경우에는 상술한 SNS 보다는 매우 신뢰도 측면에서 높을 수 밖에 없으며, 전문 기자에 의해서 작성되기 때문에 전파력 측면에서도 매우 높을 것이다. 또한, 문서의 길이 역시 소정 길이 이상 확보되기 때문에, 상술한 두 SNS의 경우 보다 높을 수 있다.Contrary to the above two examples, news articles are very much more reliable in terms of reliability than the above-mentioned SNS, and will be very high in terms of propagation power because they are written by professional reporters. In addition, since the length of the document is also secured to a predetermined length or more, it can be higher than the case of the above-described two SNSs.

본 발명의 일실시예에서는, 제 1 SNS에서와 같이 전파력은 높지만 문서 길이나 신뢰도 측면에서 다소 낮은 경우에는 토픽 데이터와 해당 토픽 데이터의 생성 시간만을 수집하도록 제안한다.In an embodiment of the present invention, when the propagation power is high as in the first SNS, but is somewhat low in terms of document length and reliability, it is proposed to collect only the generation time of topic data and topic data.

또한, 본 발명의 일실시예에서는, 제 2 SNS와 뉴스 아티클에서는, 신뢰도가 다소 높은 데이터 소스이기 때문에, 데이터간의 연관 관계를 수집하도록 제안하는 것이다.Further, in the embodiment of the present invention, since the second SNS and the news article have a somewhat higher reliability, it is proposed to collect the relationship between the data.

이와 같이 수집된 연관 관계는, 이하에서 토픽 데이터를 분류하는데 활용된다.The association thus collected is used to classify the topic data below.

이러한 연관 관계의 추출(수집)은, 그래프 모델링부(200)에서 이루어진다.The extraction (collection) of such an association is performed in the graph modeling unit 200.

그래프 모델링부(200)는, 상기 데이터 수집부(100)에서 수집된 데이터로부터 유의미한 데이터를 얻어내기 위하여, 전처리 과정을 수행한다. 전처리 과정이란, 분석에 필요한 중요한 데이터만을 골라내기 위하여 필요한 다양한 과정들을 모두 포괄하는 과정을 말한다.The graph modeling unit 200 performs a preprocessing process to obtain meaningful data from the data collected by the data collecting unit 100. The preprocessing process is a process that encompasses all the various processes necessary for selecting only the important data necessary for analysis.

그래프 모델링부(200)는 상기 복수 개의 토픽 데이터들 사이의 연관 관계를 추출한다. 그리고, 연관 관계는, 상술한 예시에서 제 2 SNS(블로그 등)과 뉴스 아티클과 같이 신뢰도가 어느 정도 인정되는 데이터 소스로부터 추출(수집)될 수 있다. 이러한 연관 관계에 대한 데이터는, 복수 개의 토픽 데이터들 간의 유사도(similarity) 및 동시출현빈도 중 적어도 하나를 이용하는 데이터일 수 있다. 즉, 그래프 모델링부(200)는 어느 하나의 토픽 데이터와 다른 토픽 데이터 간의 연관 관계를 판단하는데 있어서, 유사도(similarity) 및 동시출현빈도 중 적어도 하나를 이용하는 것이다.The graph modeling unit 200 extracts a relation between the plurality of topic data. And, the association can be extracted (collected) from a data source whose credibility is recognized to some extent, such as the second SNS (blog and the like) and the news article in the above example. The data for this association may be data using at least one of similarity and coexistence frequency between the plurality of topic data. That is, the graph modeling unit 200 uses at least one of similarity and coexistence frequency in determining an association between any one topic data and another topic data.

동시출현빈도란, 단일 문서 상에서 상기 두 토픽 데이터가 함께 포함되어 있는 문서의 개수를 의미할 수 있다. 즉, 두 토픽 데이터 각각을 나타내는 단어가 동시에 하나의 문서에 포함되는 경우의 수를 말한다. 예를 들어, 제 1 토픽 데이터와 제 2 토픽 데이터 간의 연관 관계를 파악하고자 할 때, 제 1 및 제 2 토픽 데이터가 동시에 포함되는 단일 문서의 개수를 제 1 및 제 2 토픽 데이터간의 연관 관계라고 할 수 있을 것이다. 이 경우 연관 관계가 높은 값을 가질 수록 두 토픽 데이터 간에는 연관성이 높다고 인정될 수 있을 것이다.The frequency of simultaneous occurrence may refer to the number of documents including the two topic data together in a single document. That is, the number of cases in which words indicating two topic data are included in one document at the same time. For example, when it is desired to grasp the association between the first topic data and the second topic data, the number of the single documents including the first and second topic data at the same time is referred to as a relation between the first and second topic data It will be possible. In this case, the higher the correlation, the higher the correlation between the two topic data.

두 토픽 데이터 사이에 강한 연관 관계가 존재할 경우(즉, 연관 관계가 높은 경우), 생성 시간 순서에 따라 주요 토픽과 하위 토픽으로 구분된다. 토픽 데이터간의 관계는 병렬 처리를 위해 맵리듀스(Map & Reduce)로 확장 가능한 그래프 구조로 표현될 수 있다.If there is a strong association between the two topic data (that is, if the association is high), then the main topic and sub-topic are classified according to the generation time order. The relationship between topic data can be expressed in a graphical structure that can be extended to map and reduce for parallel processing.

이하에서 상세하게 후술하겠지만, 이와 같이 수집된 데이터들을 그래프로 도시화하는데 있어서, 각 토픽 데이터들은 네트워크 상의 노드로 표현되고, 연관 관계는 노드와 노드를 서로 연결시키는 링크 구조로 표현된다.As will be described later in detail, in graphically displaying collected data, each topic data is represented by a node on the network, and the association is represented by a link structure linking nodes and nodes.

도 3은 본 발명의 일실시예에 따른 주요 토픽과 하위 토픽을 정렬시키는 토픽 정렬 구조를 도시하는 도면이다. 도시된 토픽 정렬 구조는, 소정 토픽(특정 이벤트)과 관련되는 토픽 데이터들의 계층적 구조(hierarchical structure)이다.3 is a diagram showing a topic alignment structure for aligning a main topic and a sub topic according to an embodiment of the present invention. The illustrated topic alignment structure is a hierarchical structure of topic data associated with a predetermined topic (a specific event).

각 토픽 데이터에는 생성 시간(이벤트 작성 시간)에 따라 하위 토픽 데이터가 포함될 수 있으며, 재귀적으로 하위 토픽(하위 토픽 데이터의 하위로 하하위 토픽 데이터)을 가질 수 있다. 각 주요 토픽에 대해 하위 토픽은 방향 그래프 구조를 사용하여 계층적으로 표현된다.Each topic data may include sub-topic data according to a generation time (event creation time), and may recursively have a sub-topic (sub-topic data below the sub-topic data). For each major topic, the subtopics are hierarchically represented using a directional graph structure.

도 3과 같이 그래프는 토픽 데이터를 노드(node)로 하고, 두 개의 노드를 서로 연결하는 방향성 링크로 구성된다. 방향성은 생성 시간 관계의 순서를 의미하고 값은 생성 시간의 차이를 의미할 수 있다. 도 3에는 값이 표시되어 있지 않지만, 이러한 값에 대하여 이하 도 4 및 5를 참조하여 후술한다.As shown in FIG. 3, the graph includes topic data as a node, and a directional link that connects two nodes with each other. Directionality means the order of the generation time relationship, and the value can mean the difference in the generation time. Although the values are not shown in Fig. 3, these values will be described later with reference to Figs. 4 and 5 below.

도 3에서 노드 A(301-1)는 주요 토픽(예를 들어 사용자로부터 입력 받은 검색어 키워드에 대한 토픽)을 나타내고 다른 노드(301-2 ~ 301-6)들은 하위 토픽을 의미한다. 이러한 토픽 정렬 구조는 토픽들의 계층적 구조를 이해하는데 유용하다. 그래프의 분석을 통해 어느 사건(이벤트)들이 일련의 과정으로 일어났는지 분석 할 수 있다. 도 3에서 노드 C(301-3)가 2개의 하하위 토픽 E, F에 대한 노드 E, F(301-5, 301-6)를 가지고 있으며, 주요 토픽 A 는 E, F와 함축적인 관계가 있음을 확인할 수 있다. 이를 통해 하위 토픽 C가 주요 토픽 A이 후에 발생하였음을 알 수 있으며, 토픽 E, F 가 토픽 C를 뒤따라 발생하였음을 확인할 수 있다.In FIG. 3, the node A 301-1 represents a main topic (for example, a topic for a keyword of a search word input from a user) and the other nodes 301-2 to 301-6 refer to a sub topic. This topic alignment structure is useful for understanding the hierarchical structure of topics. Analysis of the graph can analyze which events (events) occurred as a series of processes. In FIG. 3, node C 301-3 has nodes E, F (301-5, 301-6) for two lower and lower subjects E and F, and the main topic A has an implicit relationship with E, F . It can be seen that the subtopic C occurred after the main topic A, and that the topics E and F occurred after the topic C.

두 토픽 데이터 간의 연관 관계를 그래프 적으로 나타내기 위한 링크의 두 가지 유형에 대하여 도 4 및 도 5를 참조하여 설명한다.Two types of links for graphically representing the relationship between two topic data will be described with reference to FIGS. 4 and 5. FIG.

도 4는 본 발명의 일실시예에 따른 시간관계 링크(Time relation graph)를 설명하기 위한 예시 도면이다. 도 5는 본 발명의 일실시예에 따른 유사도 링크(Similarity relation graph)를 설명하기 위한 예시 도면이다.4 is an exemplary diagram illustrating a time relation graph according to an embodiment of the present invention. 5 is an exemplary diagram illustrating a similarity relation graph according to an embodiment of the present invention.

도 4를 참조하면, 두 토픽 데이터 사이를 연결하는 링크의 값(수치)는, 두 토픽 데이터의 생성 시간 차이를 의미한다. 즉, 제 1 토픽 데이터의 생성 시간과 제 2 토픽 데이터의 생성 시간 간의 차이가 수치화되는 것이다. 시간관계 링크는 생성 시간 선후에 따라 방향성이 존재한다.Referring to FIG. 4, a value (numerical value) of a link connecting two topic data means a difference in generation time of two topic data. That is, the difference between the generation time of the first topic data and the generation time of the second topic data is numerically expressed. The temporal relationship link is directional depending on the generation time prefix.

도 4에서는 다섯 가지 토픽 데이터(401-1 ~ 401-5) 사이의 시간관계 링크를 도시한다. 토픽 데이터 B(401-2)가 토픽 데이터 A(401-1)의 25시간 후에 생성되었으며, 토픽 데이터 D(401-4)가 토픽 데이터 A(401-1)의 1시간 후에 발생했다는 것을 나타낸다. In Fig. 4, there is shown a temporal relationship link between the five topic data 401-1 to 401-5. Indicates that topic data B (401-2) has been generated after 25 hours of topic data A (401-1) and topic data D (401-4) has occurred one hour after topic data A (401-1).

도 5는 방향성과 무관한 유사도 만을 나타내는 링크를 의미한다. 이러한 유사도 링크는 TF-IDF(Term Frequency - Inverse Document Frequency)를 사용하여 상술한 동시출현빈도를 수치화한 값을 가질 수 있다.5 shows a link showing only the degree of similarity not related to the directionality. Such a similarity link may have a value obtained by quantifying the above-mentioned simultaneous occurrence frequency using the TF-IDF (Term Frequency - Inverse Document Frequency).

도 4에서와 같이 시간관계 링크가 완성 된 후에 도 5의 토픽 데이터 간 유사도를 계산할 수 있다. 앞서 설명한 것과 같이 문서에서 동시에 발생하는 두 토픽 데이터 사이에는 유사성이 존재하다고 인정되기 때문이다. 연관 관계를 정량화 하기 위하여, 블로그 및 뉴스의 동시출현빈도수를 사용하여 각 토픽 데이터(키워드) 사이의 연관 관계를 수치화 하는 것이다.After the time relation link is completed as shown in FIG. 4, the similarity between the topic data of FIG. 5 can be calculated. This is because similarity exists between two topic data that occur simultaneously in the document, as described above. In order to quantify the association, the association between each topic data (keyword) is quantified by using the frequency of simultaneous appearance of the blog and news.

다시 도 1로 복귀하여 연대순 분석부(300)에 대하여 설명한다.1, and the chronological analyzer 300 will be described.

연대순 분석부(300)는 시간 순서에 따른 분석을 통하여 특정 주제(주요 토픽)에 대해서 시간 순으로 일어난 일련의 사건들의 집합인 내부 노드 집합(Internal Chronical Node Group)과, 특정 주제(주요 토픽과직접적인 연관 관계가 존재하지 않는 잠재적 사건들의 집합인 외부 노드 집합(External Chonical Nodes Group)으로 분류한다. 그래프 모델링부(200)에서 추출된 토픽 데이터 간의 연관 관계에 기초하여, 토픽 데이터들이 내부 노드 집합과 외부 노드 집합으로 구분되어 구성된다.The chronological analyzer 300 analyzes an internal chronical node group, which is a set of a series of events occurring in a time order, with respect to a specific topic (main topic) (External Chonical Nodes Group), which is a set of potential events for which no association exists. Based on the association between the topic data extracted by the graph modeling unit 200, the topic data is divided into an internal node set and an external node .

내부 노드 집합은 일련의 시간순서로 정렬되고 주요 토픽과 관련되는 토픽 데이터들의 집합을 나타낸다. 외부 노드 집합은 주요 토픽과 간접적으로 관련되어 있는 토픽 데이터들의 집합임을 의미한다.The internal node set represents a set of topic data arranged in a sequence of chronological order and associated with the main topic. An external node set is a collection of topic data that is indirectly related to a major topic.

부모 노드(상위 노드)와 자식 노드(하위 노드)의 관계는 시간순서에 따라 결정된다.The relationship between the parent node (parent node) and the child node (child node) is determined according to the time order.

외부 노드 집합과 주요 토픽 간의 관계는 낮지만 유사한 이벤트가 비슷한 시기에 발생 했음을 의미한다. 즉, 도 3에 도시된 바에서와 같이, 외부 노드 집합의 최상위에 있는 외부 토픽(304-1)은 주요 토픽(301-1)과의 연관 관계는 없거나 낮을 수는 있지만, 발생 시간이 근접하다는 관련성이 존재하는 것이다.The relationship between the external node set and the main topic is low, but a similar event occurred at a similar time. That is, as shown in FIG. 3, the outer topic 304-1 at the top of the outer node set may have no association with the main topic 301-1, or may be low, There is relevance.

도 3의 토픽 정렬 구조 상에서 최종적으로 보여지게 될 노드의 수는 생성 시간과 링크 간의 유사도 값에 따라 적절한 개수로 제한될 수 있을 것이다. 특히 본 발명의 일실시예에서는 맥시멈 컷(Maximum cut) 방식에 기초하여 적절한 개수로 제한할 수 있다. 즉, 기설정된 혹은 사용자의 입력에 따른 맥시멈 컷 임계값을 통해 보여질 노드의 수를 제어 할 수 있는 것이다.The number of nodes that will ultimately be shown on the topic alignment structure of FIG. 3 may be limited to an appropriate number depending on the generation time and the similarity value between links. In particular, according to the embodiment of the present invention, the number of frames can be limited to a suitable number based on a maximum cut method. That is, it is possible to control the number of nodes to be displayed through a maximum cut threshold according to a preset or user input.

한편, 처음 토픽이 등장한 후에 오랜 시간 후에 다른 토픽 등장한다면 두 토픽 사이의 관계가 중요하지 않을 수 있다. 또는, 두 토픽 사이의 유사도가 높은데도 불구하고 오랜 시간이 지난 후에 연관성 높은 토픽이 재등장 할 수도 있다. 두 경우 모드를 만족시키기 위해 사용자에게 보여질 노드는 유사도는 관련 가중치는 높되 시간적인 가중치는 최소로 하는 맥시멈 컷 임계값을 사용한다.On the other hand, if another topic appears after a long time since the first topic appeared, the relationship between the two topics may not be important. Or, a high relevance topic may reappear after a long period of time despite the high degree of similarity between the two topics. In order to satisfy the two cases, the nodes to be shown to the user use a maximum cut threshold value in which the related weight is high but the temporal weight is minimum.

재사용부(400)는, 상술한 방식에 따른 그래프를 시각화(visualization)시키고, 생성된 결과를 저장시킬 수 있다. 즉, 상술한 방식에 따라 연대순의 분석을 수행한 결과에 대하여 시각적으로 보여주고 분석된 결과를 나중에 사용을 위하여 트리 구조로 저장하는 것이다.The reuse unit 400 can visualize a graph according to the above-described method, and store the generated result. That is, the result of analyzing the chronological order according to the above-described method is visually displayed and the analyzed result is stored in a tree structure for later use.

도 6 및 도 7은 본 발명의 일실시예에 따라 재사용부(400)가 트리 구조로 저장하는 일예를 도시하는 도면이다.FIG. 6 and FIG. 7 are diagrams showing an example in which the reuse unit 400 stores a tree structure according to an embodiment of the present invention.

도 6 및 도 7을 참조하면, 재사용부(400)는 상술한 방식에 따른 그래프가 구성되면 그래프의 모든 정보가 표준화 된 형식으로 저장된다. 결과적으로 데이터의 가용성 및 재사용성이 증가하고 향후 유지 관리에 유용하게 될 수 있을 것이다.Referring to FIGS. 6 and 7, when the graph according to the above-described method is configured, all information of the graph is stored in a standardized form. As a result, the availability and reusability of data will increase and may be useful for future maintenance.

분석 결과는 추후 연대기 분석 단계에서 사용할 수 있도록 트리 구조 방식으로 저장되며, 일예로 도 7에서와 같은 XML(eXtensible Markup Language) 형식을 사용하여 트리 구조로 표현된다. 즉, 도 6의 그래프 데이터는 도 7과 같은 XML형식으로 저장될 수 있다. 이러한 분석 결과는 API로 제공되거나 사용자가 쿼리를 입력으로 주요 사건에 대한 키워드를 입력 할 때 다시 사용될 수 있다.The analysis result is stored in a tree structure method for use in the later chronological analysis step. For example, the tree structure is expressed using XML (eXtensible Markup Language) format as shown in FIG. That is, the graph data of FIG. 6 can be stored in the XML format as shown in FIG. The results of this analysis can be provided to the API or reused when the user types a query and enters keywords for key events.

도 8은 본 발명의 일실시예에 따른 큐레이션 장치의 제어 방법에 대한 순서도를 도시한다.FIG. 8 is a flowchart illustrating a method of controlling a curation apparatus according to an embodiment of the present invention.

도시된 도면에 따르면, S810 단계에서 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집할 수 있다. 이러한 한 예시로, 상술한 바와 같이 제 1 SNS 데이터 소스에서 토픽 데이터와 생성 시간을 수집하고, 블로그 등의 제 2 SNS 데이터 소스, 뉴스 아티클에서 데이터 사이의 연관 관계를 수집할 수 있다.According to the drawing, in step S810, a plurality of topic data may be collected from a data source on the web. As an example of such, it is possible to collect topic data and generation time from the first SNS data source as described above, and to collect the association between the data in the second SNS data source, news article, such as a blog.

S830 단계에서 연대순 분석부(300)는 토픽 데이터들을 내부 노드 집합과 외부 노드 집합으로 분류할 수 있다.In step S830, the chronological analyzer 300 may classify the topic data into an inner node set and an outer node set.

그리고 S840 단계에서 재사용부(400)는 연대순의 분석을 수행한 결과에 대하여 그래프를 통한 시각화 및 재사용을 위하여 분석 결과를 트리 구조로 저장한다.In step S840, the reuse unit 400 stores analysis results in a tree structure for visualization and reuse of the results of the chronological analysis.

이하 도면 및 이와 관련되는 설명에서는, 실제 실험을 통하여 분석한 데이터에 관하여 상세히 설명한다.In the following drawings and related explanations, data analyzed through actual experiments will be described in detail.

실험을 위해 한 달의 기간 동안의 트위터, 네이버 블로그, 네이버 뉴스를 사용하였다. 총 트위터를 통하여 수집된 데이터의 수는 150만개의 트윗이며, 전체 크기는 약 30GB다. 트위터로부터 무의미한 단어를 제거하고 총 600개의 토픽 데이터를 추출하였으며 이 토픽 데이터를 사용하여 그래프가 구성되었다. 또한 22,692개의 네이버블로그 및 16,288개의 네이버뉴스 기사를 수집하였다. 블로그 데이터는 한 문서당 평균 328단어이고, 뉴스 데이터는 한 문서당 평균 254단어로 계산됐다. 그래프의 링크는 네이버 블로그와 네이버 뉴스 데이터를 이용하여 구성됐으며, 토픽 정렬 구조에서 123,894개의 링크가 만들어졌다. 본 실험에서는, SNS, 블로그 및 뉴스에 대한 데이터 셋의 범위를 제한하였지만, 보다 다양한 소스로부터 데이터를 수집하기 위한 전처리 모듈을 추가함으로써 쉽게 확장할 수 있다.We used Twitter, Naver Blog, and Naver News for one month period for the experiment. The total number of data collected through Twitter is 1.5 million tweets and the total size is about 30 GB. We removed the meaningless words from Twitter and extracted a total of 600 topic data, and the graph was constructed using this topic data. It also collected 22,692 Naver blogs and 16,288 Naver news articles. Blog data is an average of 328 words per document, and news data is an average of 254 words per document. The links in the graph are composed of Naver blog and Naver News data, and 123,894 links are created in the topic alignment structure. In this experiment, we limited the range of datasets for SNS, blog, and news, but they can be easily extended by adding a preprocessing module to collect data from a wider variety of sources.

데이터 소스로부터 토픽 데이터를 추출하기 위해 명사추출기인 'Komoran'을 사용하였다. 생성 시간과 키워드는 토픽 데이터를 추출하는 데 사용되었다. 트위터 데이터 소스에서 추출한 키워드를 사용하여 관련 블로그 및 뉴스 데이터를 수집하였다. 문서에서 단어의 동시출현빈도는 블로그 및 뉴스 데이터에서 추출하고, 주제와 TF-IDF 쌍을 계산하고, 연관 관계에 해당하는 유사도 링크의 값은 0에서 1사이로 정규화 시켰다.We used the noun extractor 'Komoran' to extract the topic data from the data source. Generation time and keywords were used to extract topic data. Related blogs and news data using keywords extracted from Twitter data sources. The frequency of simultaneous occurrence of words in the document was extracted from the blog and news data, the subject and TF-IDF pairs were calculated, and the similarity link corresponding to the association was normalized between 0 and 1.

시간관계 링크는 트위터 데이터에서 추출한 해당 토픽이 포함된 게시글의 작성 시간을 기초로 산출되었다. 시간관계 링크 값은 문서 생성 시간과 정규화 된 값(혹은 평균값)의 차이에 의해 계산되며 마찬가지로 0에서 1사이로 정규화된다. 그래프는 주제, 유사성 및 시간 관계(연관 관계)를 사용하여 구성된다. 내부 및 외부 노드 집합은 연대기 분석에 따라 구성된다. 연관 관계를 나타내는 유사도 링크가 특정 임계 값보다 높은 값을 갖는 경우, 이들 노드들은 부모 노드 및 자식 노드로써 포함된다. 이러한 집합을 내부 노드 집합(Internal Chronological Group)이라 한다. 특정 임계값을 초과하지 않은 다른 노드는 외부 노드 집합(External Chronological Group)으로 분류 하였다.The time relation link was calculated on the basis of the creation time of the article containing the topic extracted from the Twitter data. The time relation link value is calculated by the difference between the document creation time and the normalized value (or average value), and is also normalized from 0 to 1. Graphs are constructed using subject, similarity, and time relationships (associations). The set of internal and external nodes are organized according to chronological analysis. If the similarity link representing the association has a value higher than a certain threshold value, these nodes are included as parent nodes and child nodes. This set is called the Internal Chronological Group. Other nodes that do not exceed a certain threshold have been classified as External Chronological Groups.

도 9는 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 유명한 정치가 안철수를 검색 한 결과에 대한 결과 그래프이다. "안철수"가 주요 토픽이며, "김종훈"은 하위 토픽으로 구성되었다. "안철수"는 야당 대선후보였으며, 김종훈은 여당 후보로 여겨졌으며, 이후 "안철수"는 야당후보 를 지원하기 위해 대선 후보로부터 "사퇴"하였다. 이러한 일련의 사건들이, 안철수와 관련되는 내부 노드 집합과 외부 노드 집합으로 구분되어 그래프로서 시각화 되고 있다.FIG. 9 is a graph showing a result of a search by a famous politician Ahn Cheol-Su of a curation apparatus according to an embodiment of the present invention. "Ahn, Chul - soo" is the main topic, and "Jong - hoon Kim" is composed of the subtopic. "Ahn Chul - soo" was a candidate for the opposition party, and Kim Jong - hoon was regarded as a ruling party candidate. After that, "Ahn Cheol - su" resigned from the presidential candidate to support the opposition candidate. This series of events is visualized as a graph separated into a set of internal nodes and a set of external nodes associated with Ahn.

도 10은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 은행인 "농협" 해킹 사건에 대한 검색 결과 그래프이다. 당시 사건과 관련되는 주된 의혹 중 하나는 그 사건 뒤에 용의자가 "북한"이었다는 것이다. 그리고, 사람들은 본 해킹 사건이 "천안함"과 관련이 있다고 생각하였다. "북한"과의 외부 연결 고리는 사고 당시의 한국의 국방부 장관인 "김관진"이다. 이러한 일련의 사건들이, "농협"과 관련되는 내부 노드 집합과 외부 노드 집합으로 구분되어 그래프로서 시각화 되고 있다. 연관이 없어 보이는 키워드도 존재 하지만 대부분의 경우에는 "농협 해킹"과 관련된 결과를 보여주었다.FIG. 10 is a graph of a search result for the "Nonghyup" hacking case, which is a Korean bank, by a curlation apparatus according to an embodiment of the present invention. One of the main suspicions related to the incident at the time was that the suspect was "North Korea" after the incident. And people thought that this hacking incident was related to "Cheonan". The external link with "North Korea" is "Kim Kwan Jin", the Minister of National Defense of Korea at the time of the accident. These series of events are visualized as a graph separated into an internal node set and an external node set related to "Nonghyup". There are keywords that seem to be unrelated, but in most cases they showed results related to "Nonghyup hacking".

도 11은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 위의 실험결과들과 다른 결과 그래프를 도시한다. "화이트데이"는 3월 14일에 남자가 여자에게 사탕을 선물하는 기념일이다. "화이트데이"에는 하위 토픽인 "사탕"이 존재하는걸 확인할 수 있다. 이 두 키워드 사이에 의미상 유사성이 있지만 시간 관계가 있음을 확인하기가 어렵다. 다양한 검색 키워드를 입력 할 때 제안하는 'Chronological Big data Curation' 방식은 대부분의 경우 견고하게 동작하며 일련의 토픽들을 연대기적으로 이해하는데 도움을 준다. 그러나, 일반적인 주제 또는 오래 지속되지 않는 일회성 이벤트의 경우에는 한계를 보이는 것을 확인하였다.11 is a graph showing a result different from the above experimental results by the curation apparatus according to an embodiment of the present invention. "White Day" is the anniversary on March 14th when a man presents candy to a woman. You can see that there is a sub topic "candy" in "White Day". There is semantic similarity between these two keywords, but it is difficult to confirm that there is a temporal relationship. The 'Chronological Big Data Curation' method, which is suggested when entering various search keywords, works hard in most cases and helps to understand chronological topics in a series of topics. However, we confirmed that there is a limit in the case of general topics or one-off events that do not last long.

본 발명의 일실시예에 따른 연대순 정보 기반 큐레이션 방법과 기존의 방법을 비교 분석하여 평가를 수행하였다. 평가를 하기 위한 요소는 "검색어를 사용하여 토픽에 대한 포괄적인 이해가 가능한가" 라는 질문에 답하는 것이다. 본 발명의 발명가들은 한국에서 가장 널리 사용되는 뉴스 포털 사이트인 네이버 뉴스 포털과 비교 하였다.The chronological information-based curlation method according to an embodiment of the present invention is compared with an existing method and evaluated. An element of the assessment is to answer the question "Is it possible to use a search term to comprehend a comprehensive understanding of the topic?" Inventors of the present invention compared Naver News portal, which is the most widely used news portal site in Korea.

도 12는 검색 키워드 "안철수"와 "해킹"을 입력한 네이버 뉴스 결과를 도시한다. "안철수"에 대한 결과인 도 12 (a)를 참조하면, 검색 옵션인 정확도 순으로 정렬하여도, 첫 번째 페이지의 검색 결과는 가장 최근 날짜인 3월 31일의 기사만 표시된다. 2일 전인 3월 29일의 기사는 130번째로 노출되었다. 이 그림에서 알 수 있듯이 뉴스 결과만 보고서, 전반적인 사건의 흐름을 이해하기는 어렵다. 반면에 검색 키워드 "해킹"에 대한 결과인 도 12 (b)는 전체 사건들을 이벤트 중심으로 쉽게 이해할 수 있다.Fig. 12 shows a Naver news result in which the search keywords "Ahn-kun" and "hacking" Referring to FIG. 12 (a), which is a result of "Ahn," only the articles of March 31, which is the latest date, are displayed in the search results of the first page even if they are sorted in order of accuracy as a search option. Two days ago, the article on March 29 was exposed for the 130th time. As you can see from this picture, it is difficult to understand the trend of the overall events, only report the news results. On the other hand, Fig. 12 (b) which is the result of the search keyword "hacking "

더 정확한 평가를 위해 우리는 전통적인 평가방법과 'Chronological Big data Curation'과의 정량적 평가를 수행하였다. 정량적인 평가를 위해 우리는 기사에서 나타난 유일한 단어들을 측정하였다.For a more accurate assessment, we conducted quantitative assessments of traditional assessment methods and 'Chronological Big Data Curation'. For quantitative evaluation, we measured the only words that appeared in the article.

도 13은 본 발명의 일실시예에 따른 연대순 정보 기반 큐레이션 방법과 네이버 뉴스 검색 결과를 비교한 결과이다.FIG. 13 shows a result of comparing a chronological information-based curation method and a Naver News search result according to an embodiment of the present invention.

검색어를 기준으로 다음 항목의 평균 값을 계산하였다. 1) 고유한 단어 비율: 뉴스에 대한 주제를 식별 할 수 있는 고유 명사의 비율, 2) 시간분포: 최종 결과에 표시된 기사 작성 시간의 평균 시간 범위, K@N은 검색에 노출된 뉴스 기사의 상위 개수 N개 별로 계산한 값이다.The average value of the following items was calculated based on the search term. 1) Percentage of unique words: Percentage of proper nouns that can identify topics for news, 2) Time distribution: Average time range of article creation time displayed in the final result, K @ N is the upper It is a value calculated by N number.

상위 5 ~ 15개의 검색 결과를 비교해 보았을 때, 해당 고유한 단어가 포함되는 비율이 기존 네이버 뉴스 검색 결과 보다, 본 발명의 일실시예에 따른 큐레이션 방법이 모두 높은 수치를 가지고 있음을 확인할 수 있다. 또한, 검색 결과 시간 분표 역시 본 발명의 일실시예에 따른 큐레이션 방법이 더 높은 수치를 가지고 있음을 확인할 수 있다.When comparing the top five to fifteen search results, it can be seen that the rate of inclusion of the unique word has a higher value than the existing Naver News search result and the curlation method according to an embodiment of the present invention has a higher value . Also, it can be seen that the curation method according to the embodiment of the present invention has a higher value in the search result time interval.

즉, 본 발명에 따른 결과가 네이버 뉴스 검색 결과 보다 더 포괄적이고 정확한 검색 결과를 제공하여 줄 수 있다고 인정될 수 있을 것이다.That is, it can be recognized that the result according to the present invention can provide a more comprehensive and accurate search result than the Naver News search result.

이상으로 본 발명에 따른 연대순 정보 기반 큐레이션 장치 및 이를 이용한 제어 방법의 실시예를 설시하였으나 이는 적어도 하나의 실시예로서 설명되는 것이며, 이에 의하여 본 발명의 기술적 사상과 그 구성 및 작용이 제한되지는 아니하는 것으로, 본 발명의 기술적 사상의 범위가 도면 또는 도면을 참조한 설명에 의해 한정／제한되지는 아니하는 것이다. 또한 본 발명에서 제시된 발명의 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로써 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의해 사용되어질 수 있을 것인데, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의한 수정 또는 변경된 등가 구조는 특허청구범위에서 기술되는 본 발명의 기술적 범위에 구속되는 것으로서, 특허청구범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능한 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it should be understood that the same is by way of illustration and example only and is not to be taken by way of limitation, And the scope of the technical idea of the present invention is not limited by the description with reference to the drawings or the drawings. It will also be appreciated by those skilled in the art that the concepts and embodiments of the invention set forth herein may be used as a basis for modifying or designing other structures for carrying out the same purposes of the present invention It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims and their equivalents. And various changes, substitutions, and alterations can be made without departing from the scope of the invention.

Claims

A data collection unit for collecting a plurality of topic data from a data source on the web;
A graph modeling unit for extracting an association between the collected plurality of topic data; And
And a chronological analyzer for classifying the set of internal nodes into an external node set indirectly associated with the main topic, based on the extracted association, the internal node set directly related to the main topic among the plurality of topic data,
Curation device.

The method according to claim 1,
The data source comprising:
Including social network services (SNS) posts, news articles,
Curation device.

The method according to claim 1,
Wherein the main topic is a topic for a keyword of a keyword inputted from a user,
Curation device.

The method according to claim 1,
The association between the two topic data,
The number of documents included together with the two topic data in a single document,
Curation device.

5. The method of claim 4,
Wherein the chronological analyzer comprises:
The topic data having an association with the main topic is classified into an internal node set,
Characterized in that the topic data related to the generation time is classified into an external node set although it is not related to the main topic.
Curation device.

The method according to claim 1,
Wherein the chronological analyzer comprises:
And sorting the classified plurality of topics in order of generation time.
Curation device.

Collecting a plurality of topic data from a data source on the web;
Extracting an association between the collected plurality of topic data; And
Grouping into an internal node set directly related to a predetermined topic among the plurality of topic data and an external node set indirectly related to the predetermined topic based on the extracted association.
A control method of a curation device.