KR101575683B1

KR101575683B1 - Method for analyzing trend based on context over time

Info

Publication number: KR101575683B1
Application number: KR1020140187026A
Authority: KR
Inventors: 이상근; 히즈불 알람; 류우종
Original assignee: 고려대학교 산학협력단; 포항공과대학교 산학협력단
Priority date: 2014-12-23
Filing date: 2014-12-23
Publication date: 2015-12-09

Abstract

Disclosed is a method for analyzing a trend based on a context according to time-lapse wherein a trend based on context is analyzed according to time-lapse from a document group by using a device capable of calculating provability distribution. The method includes the steps of: extracting distribution of words to be used to select a word and distribution of hashtags to be used to select a hashtag; extracting distribution of topics with respect to each document included in the document group; performing statistical inference with respect to the distribution of the words, the distribution of the hashtags, and the distribution of the topics; extracting a topic from the distribution of the topics with respect to each word of each document included in the document group and extracting time from beta distribution for time; and extracting a word or a hashtag with respect to each word of each document included in the document group.

Description

[0001] METHOD FOR ANALYZING TREND BASED ON CONTEXT OVER TIME [0002]

본 발명의 개념에 따른 실시 예는 시간 흐름에 따른 문맥 기반 트렌드 분석 장치 및 방법에 관한 것으로, 특히 소셜 미디어 상에서 사용자가 기술한 문서 집합으로부터 시간 흐름에 따른 문맥 기반 트렌드를 분석하는 트렌드 분석 장치 및 방법에 관한 것이다.An embodiment according to the concept of the present invention relates to an apparatus and method for context-based trend analysis over time, and more particularly to a trend analysis apparatus and method for analyzing context-based trends over time from a document set described by a user on social media .

트위터, 페이스북, 마이크로 블로그와 같은 소셜 미디어의 사용자가 증가함에 따라, 사용자들은 소셜 미디어에 자신의 관심사, 사회적 이슈 등에 대한 의견을 기술하고 있다. 이러한 의견들 중 많은 사용자들이 기술하는 대상, 즉 많은 사용자들이 공통적으로 관심을 가지는 대상을 트렌드(trend)로써 정의한다. 따라서 소셜 미디어 상에 사용자들이 기술한 문서 집합을 이용해 트렌드를 분석하기 위한 많은 연구가 진행 중이다. As users of social media such as Twitter, Facebook, and micro blogs grow, users are commenting on their social interests and social issues. Among these opinions, many users describe objects to be described, that is, objects to which many users have a common interest as trends. Therefore, a lot of research is underway to analyze trends using the set of documents described by users on social media.

상술한 종래기술에서는 트렌드를 분석할 때 시간적인 속성을 고려해 시간 흐름에 따라 트렌드 및 트렌드의 변화를 분석한다. 예를 들어, 소셜 미디어로부터 세월호 참사에 대한 트렌드를 추출한 경우, 사용자들의 관심이 언제, 어떻게 증가하는지 또는 감소하는지 등의 시간 흐름에 따른 트렌드의 변화를 분석할 수 있다. 하지만 트렌드를 분석함에 있어 상기 트렌드에 어떤 감정, 기관, 인물 등이 연관되어 있는지, 즉 트렌드의 문맥을 분석할 수 없다는 한계가 있다.In the above-described prior art, when analyzing trends, the changes in trends and trends are analyzed according to time according to temporal properties. For example, when trends are extracted from social media, the change in trends over time can be analyzed, such as when and how the user's interest increases or decreases. However, there is a limitation in analyzing the trend, in which the emotion, organization, person, etc. are related to the trend, that is, the context of the trend can not be analyzed.

본 발명이 이루고자 하는 기술적인 과제는 문서 집합으로부터 트렌드를 분석함에 있어 트렌드와 연관된 문맥을 추출하고, 추출된 문맥을 고려해 시간 흐름에 따른 트렌드의 변화를 분석할 수 있는 트렌드 분석 장치 및 방법을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention provides a trend analyzing apparatus and method capable of extracting a context associated with a trend in analyzing a trend from a document set and analyzing a change in trend according to a time flow in consideration of the extracted context will be.

본 발명의 실시 예에 따른 문맥 기반 트렌드를 분석하는 방법은 확률 분포를 계산할 수 있는 장치를 이용하여 문서 집합으로부터 시간 흐름에 따른 문맥 기반 트렌드를 분석하는 방법으로서, 각 토픽에 대해, 단어를 선택하기 위해 사용될 어휘 분포와 해시태그를 선택하기 위해 사용될 해시태그 분포를 추출하는 단계, 상기 문서 집합에 포함된 각 문서에 대해, 토픽 분포를 추출하는 단계, 상기 어휘 분포, 상기 해시태그 분포, 및 상기 토픽 분포에 대해 통계적 추론을 수행하는 단계, 상기 문서 집합에 포함된 각 문서의 각 단어에 대해, 상기 토픽 분포로부터 토픽을 추출하고, 시간에 대한 베타 분포로부터 시간을 추출하는 단계, 및 상기 문서 집합에 포함된 각 문서의 각 단어에 대해, 단어 또는 해시태그를 추출하는 단계를 포함한다.A method for analyzing context-based trends according to an embodiment of the present invention is a method for analyzing context-based trends over time from a set of documents using a device capable of calculating a probability distribution, Extracting a hash tag distribution to be used for selecting a hash tag and a hash tag to be used for the hash tag; extracting a topic distribution for each document included in the document set; Extracting a topic from the topic distribution and extracting a time from a beta distribution over time, and for each word in each document included in the document set, For each word of each document included, extracting a word or a hashtag.

본 발명의 실시 예에 따른 트렌드 분석 장치 및 방법은 사용자들의 트렌드에 대한 관심이 시간에 따라 증가하는지 또는 감소하는지 여부를 모니터링할 수 있는 효과가 있다. The apparatus and method for analyzing trends according to the embodiment of the present invention have the effect of monitoring whether or not interest of users' trends increases or decreases with time.

또한, 트렌드와 연관된 감정, 기관, 인물 등과 같은 트렌드의 문맥을 분석할 수 있는 효과가 있다.It also has the effect of analyzing the context of trends such as emotions, organs, figures, etc., associated with trends.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 상세한 설명이 제공된다.
도 1은 본 발명의 일 실시 예에 의한 트렌드 분석 장치의 기능 블럭도이다.
도 2는 도 1에 도시된 트렌드 분석 장치를 이용한 트렌드 분석 방법에서 사용되는 표기를 도시한다.
도 3은 도 1에 도시된 트렌드 분석 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 분석 방법을 설명하기 위한 도면이다.
도 4는 도 1에 도시된 트렌드 추출 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 추출 방법의 개념을 도시하고 있다.
도 5는 도 1에 도시된 트렌드 분석 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 추출 방법을 설명하기 위한 흐름도이다.
도 6은 도 1에 도시된 트렌드 분석 장치를 이용하여 분석된 트렌드의 예를 도시한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In order to more fully understand the drawings recited in the detailed description of the present invention, a detailed description of each drawing is provided.
1 is a functional block diagram of a trend analysis apparatus according to an embodiment of the present invention.
FIG. 2 shows a notation used in the trend analysis method using the trend analysis apparatus shown in FIG.
FIG. 3 is a diagram for explaining a context-based trend analysis method according to a time flow using the trend analysis apparatus shown in FIG.
FIG. 4 shows a concept of a context-based trend extracting method according to time flow using the trend extracting apparatus shown in FIG.
FIG. 5 is a flowchart for explaining a context-based trend extraction method according to a time flow using the trend analysis apparatus shown in FIG.
FIG. 6 shows an example of trends analyzed using the trend analysis apparatus shown in FIG.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.It is to be understood that the specific structural or functional description of embodiments of the present invention disclosed herein is for illustrative purposes only and is not intended to limit the scope of the inventive concept But may be embodied in many different forms and is not limited to the embodiments set forth herein.

본 발명의 개념에 따른 실시 예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시 예들을 도면에 예시하고 본 명세서에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.The embodiments according to the concept of the present invention can make various changes and can take various forms, so that the embodiments are illustrated in the drawings and described in detail herein. It should be understood, however, that it is not intended to limit the embodiments according to the concepts of the present invention to the particular forms disclosed, but includes all modifications, equivalents, or alternatives falling within the spirit and scope of the invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않은 채, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고 유사하게 제2 구성 요소는 제1 구성 요소로도 명명될 수 있다.The terms first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms may be named for the purpose of distinguishing one element from another, for example, without departing from the scope of the right according to the concept of the present invention, the first element may be referred to as a second element, The component may also be referred to as a first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between. Other expressions that describe the relationship between components, such as "between" and "between" or "neighboring to" and "directly adjacent to" should be interpreted as well.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 본 명세서에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, the terms "comprises" or "having" and the like are used to specify that there are features, numbers, steps, operations, elements, parts or combinations thereof described herein, But do not preclude the presence or addition of one or more other features, integers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the meaning of the context in the relevant art and, unless explicitly defined herein, are to be interpreted as ideal or overly formal Do not.

이하, 본 명세서에 첨부된 도면들을 참조하여 본 발명의 실시 예들을 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings attached hereto.

도 1은 본 발명의 일 실시 예에 의한 트렌드 분석 장치의 기능 블럭도이다. 도 1을 참조하면, 시간 흐름에 따른 문맥에 기반하여 트렌드를 분석할 수 있는 트렌드 분석 장치(10)는 문서 수집부(100), 문서 저장부(200), 트렌드 추출부(300), 및 트렌드 저장부(400)를 포함한다. 또한, 트렌드 분석 장치(10)는 확률 분포를 계산할 수 있는 장치로 명칭될 수도 있다.1 is a functional block diagram of a trend analysis apparatus according to an embodiment of the present invention. Referring to FIG. 1, a trend analysis apparatus 10 capable of analyzing trends based on a context according to time flows includes a document collection unit 100, a document storage unit 200, a trend extraction unit 300, And a storage unit 400. In addition, the trend analysis apparatus 10 may be referred to as an apparatus capable of calculating a probability distribution.

문서 수집부(100)는 인터넷 등의 네트워크를 통해 소셜 미디어 또는 소셜 서비스를 제공하는 서버로부터 하나 이상의 문서를 수집할 수 있다. 문서 수집부(100)에 의해 수집된 문서 집합은 문서 저장부(200)에 저장될 수 있다. 이때, 문서 저장부(200)에 저장된 문서 집합 또는 문서 집합의 일부는 기존에 구축된 것일 수도 있다. 이때, 상기 문서 집합은 복수의 트윗(tweet)을 포함할 수 있다.The document collecting unit 100 may collect one or more documents from a server providing a social media or a social service through a network such as the Internet. The document set collected by the document collection unit 100 may be stored in the document storage unit 200. At this time, the document set stored in the document storage unit 200 or a part of the document set may be a conventional one. At this time, the document set may include a plurality of tweets.

트렌드 추출부(300)는 문서 저장부(200)에 저장된 문서 집합에서 시간 흐름에 따른 문맥 기반 트렌드를 추출할 수 있다. 구체적으로, 트렌드 추출부(300)는 문서 저장부(200)에 저장된 문서 집합으로부터 시간 흐름에 따른 토픽 및 시간 흐름에 따른 문맥을 추출한다. 즉, 트렌드 추출부(300)는 토픽, 시간, 문맥을 동시에 고려함으로써 시간 흐름에 따른 문맥 기반 트렌드를 추출할 수 있다.The trend extracting unit 300 can extract a context-based trend according to time from a set of documents stored in the document storage unit 200. [ Specifically, the trend extracting unit 300 extracts a context based on a topic and a time flow according to time from a set of documents stored in the document storage unit 200. That is, the trend extracting unit 300 can extract a context-based trend according to a time flow by simultaneously considering a topic, a time, and a context.

트렌드 추출부(300)는 시간 흐름에 따른 문맥 기반 트렌드 추출부로 명칭될 수도 있다. 트렌드 추출부(300)에 의해 추출된 트렌드는 트렌드 저장부(400)에 저장된다.The trend extracting unit 300 may be referred to as a context-based trend extracting unit according to a time flow. The trends extracted by the trend extracting unit 300 are stored in the trend storage unit 400.

토픽 모델을 이용해 소셜 미디어 상에서의 많은 사용자들이 공통적으로 관심을 가지는 트렌드를 분석하기 위한 연구가 있었다. 또한, 고정된 시간에서의 트렌드를 분석하는데에서 나아가 시간이라는 속성을 고려해 트렌드를 분석할 뿐만 아니라 시간 흐름에 따른 트렌드의 변화를 모니터링하기 위한 연구가 진행되고 있다.There have been studies to analyze the trends that many users in social media are interested in using topic models. In addition, in order to analyze the trends at a fixed time, research is being conducted not only to analyze the trends in consideration of the attribute of time but also to monitor the change of the trends with time.

하지만 상술한 종래기술들은 단순히 트렌드를 분석하는 데에 초점이 맞춰져 있고, 분석한 트렌드에 대해 사용자들이 어떠한 감정 또는 생각을 가지고 있는지, 어떤 기관, 단체 또는 인물이 연관되어 있는지에 대한 문맥은 고려하지 않는다.However, the prior art described above is focused on simply analyzing trends, and does not consider the context in which the user has any feelings or thoughts on the analyzed trends, and on which institutions, organizations or persons are associated .

트렌드나 시간에 따른 트렌드의 변화를 분석함과 동시에 트렌드에 대한 문맥을 고려할 수 있다면 유용할 것이다. 사용자들이 트렌드에 대해 가지는 감정 또는 생각, 트렌드와 연관된 기관, 단체 또는 인물까지 분석 할 수 있다면 트렌드를 분석하는 관점을 그 자체에 한정하지 않고, 트렌드를 다양한 시각에서 분석할 수 있다. 예를 들어, 트렌드 분석 업체 및 트렌드에 초점을 맞추는 마케팅 업체의 경우, 트렌드의 문맥까지 고려한다면 다양한 사용자들을 위한 다양한 마케팅 전략을 세울수 있을 것이다.It would be useful if you could analyze trends and changes in trends over time, as well as take into account the context of trends. If users are able to analyze the feelings, thoughts, and trends associated with an organization, organization, or person, it is possible to analyze trends from various perspectives without limiting the perspective of analyzing the trends themselves. For example, if you are a trend analyst and a marketing company that focuses on trends, you can set up a variety of marketing strategies for a variety of users, taking into account the context of trends.

본 발명에 따른 시간 흐름에 따른 문맥 기반 트렌드 분석 방법 및 장치는 소셜 미디어 상에 사용자들이 기술한 하나 이상의 문서 집합으로부터 시간 흐름에 따른 트렌드 분석 뿐만 아니라, 트렌드의 문맥까지 자동으로 분석 할 수 있다는 장점을 가지고 있다.The context-based trend analysis method and apparatus according to the present invention have the advantage of automatically analyzing the trend context as well as trend analysis according to time from one or more sets of documents described by users on social media Have.

문맥(context)은 해당 개념을 사용하는 분야에 따라 다양한 의미로서 정의할 수 있다. 하지만 본 발명에서 고려하는 문맥은 트렌드와 함께 자주 나타나는 해시태그들의 군집(cluster of hashtags)으로 정의될 수 있다.Context can be defined as various meanings depending on the field in which the concept is used. However, the context considered in the present invention may be defined as a cluster of hashtags that often appears with trends.

트위터의 트윗은 해시태그(hashtag), 멘션(mention), 이미지(image), 링크(link) 등과 같은 다양한 메타데이터를 포함하고 있다. 그 중 해시태그는 중요한 메타데이터로써 다양한 역할을 하고 있다.Twitter's tweets contain various metadata such as hashtags, mentions, images, links, and so on. Among them, hash tags play various roles as important metadata.

트윗은 140자로 제한된 길이의 문자를 통해서만 의견의 기술이 가능하다. 따라서 사용자들은 이러한 제한을 극복하기 위해 다양한 의미 및 유용한 정보를 함축하고 있는 해시태그를 사용한다.A tweet can only be described by a character with a length of 140 characters. Users therefore use hash tags that imply various meanings and useful information to overcome these limitations.

예를 들어, 세월호 참사와 관련된 해시태그 중, "#PrayForSouthKorea"는 세월호 참사의 희생자를 추모하기 위한 온라인 커뮤니티를, "#SadStory"는 세월호 참사에 대한 사용자들의 감정을, "#RestInPeace"는 세월호 참사의 희생자들에 대한 사용자들의 소망을, "#YellowRibbon"은 세월호 참사 실종자들의 구조를 기원하는 사용자들의 운동을 의미할 수 있다. 이와 같이 해시태그는 한 단어로써 표현되지만 140자로 제한된 길이의 문자 제한을 극복하기 위해 사용자들의 감정, 소망 또는 기관, 단체, 인물 등 의미있는 정보를 함축하고 있다.For example, "#PrayForSouthKorea" refers to an online community for memorializing the victims of the disaster, "#SadStory" refers to the emotions of the users for the disaster, "#RestInPeace" , "#YellowRibbon" can refer to the users' wishes for the victims of the disasters of the seasons. Thus, a hash tag is expressed as a single word, but it implies meaningful information such as users' feelings, desires, or organizations, organizations, and characters in order to overcome the character limitation of a length limited to 140 characters.

또한, 해시태그는 의미적으로 유사한 트윗들을 연결시킨다. 예를 들어, 공통의 관심사를 가진 사용자들은 공통의 해시태그를 사용함으로써, 이 해시태그는 두 개 이상의 서로 다른 트윗이 의미적으로 유사하다는 것을 암시적으로 나타낸다.In addition, a hash tag links semantically similar tweets. For example, users with a common interest use a common hash tag, which implicitly indicates that two or more different tweets are semantically similar.

따라서 본 발명에서 고려하는 문맥은 트렌드와 연관된 감정, 기관, 단체, 인물 등에 대한 정보를 함축하고 있는 해시태그들의 군집으로 정의될 수 있다.Therefore, the context to be considered in the present invention can be defined as a cluster of hash tags that imply information on emotion, organization, group, person, etc. associated with trends.

트윗은 트윗이 작성된 시간 정보를 포함하고 있다. 본 발명의 생성 과정(generative process)에서 시간은 토픽에 의해 생성된다고 가정하지만, 한 트윗내의 모든 단어(또는 어휘) 및 해시태그에는 트윗이 작성된 시간을 부여한다.A tweet contains information about the time the tweet was created. In the generative process of the present invention, all words (or vocabularies) and hash tags within a tweet are given the time at which the tweet was created, assuming that the time is generated by the topic.

시간은 연속적인 속성을 가진다. 따라서 연속적인 속성을 가진 시간을 샘플링하기 위해 0~1 사이의 구간에서 정의되는 연속 확률 분포인 베타 분포(beta distribution)를 따른다.Time has a continuous attribute. Therefore, we follow the beta distribution, a continuous probability distribution defined in the interval between 0 and 1, to sample time with continuous properties.

도 2는 도 1에 도시된 트렌드 분석 장치를 이용한 트렌드 분석 방법에서 사용되는 표기를 도시한다.FIG. 2 shows a notation used in the trend analysis method using the trend analysis apparatus shown in FIG.

도 1과 도 2를 참조하면, 각 문서는 d, 토픽은 z, 단어(또는 어휘)는 w, 해시태그는 c, 시간은 t로 표기되고, 일반적으로 개수 또는 횟수는 n으로 표기된다. 토픽, 단어, 해시태그, 시간 등에 대한 확률 분포는 각각 그리스 문자로 표기하고 있으며, 각 확률 분포의 디리클레 사전확률(Dirichlet prior)에도 해당 그리스 문자를 할당한다.Referring to FIGS. 1 and 2, each document is represented by d, a topic by z, a word (or vocabulary) by w, a hash tag by c, and a time by t. Generally, the number or the number of times is denoted by n. The probability distributions for topics, words, hashtags, and time are denoted by Greek letters, and the corresponding Greek letters are assigned to the Dirichlet prior probability of each probability distribution.

또한, 이후 기술에서 Dir()은 괄호 안의 인수를 기초로 디리클레 분포(dirichlet distribution)를 생성함을 의미하고, B()는 괄호 안의 인수를 기초로 베타 분포를 생성함을 의미한다.In the following description, Dir () means to create a dirichlet distribution based on the arguments in parentheses, and B () means to generate a beta distribution based on the arguments in parentheses.

도 3은 도 1에 도시된 트렌드 분석 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 분석 방법을 설명하기 위한 도면이다.FIG. 3 is a diagram for explaining a context-based trend analysis method according to a time flow using the trend analysis apparatus shown in FIG.

도 1 내지 도 3을 참조하면, 각 토픽에 대해, 단어를 선택하기 위해 사용될 어휘 분포(또는 단어 분포), 해시태그를 선택하기 위해 사용될 해시태그 분포를 추출한다. 본 발명에서 단어, 해시태그, 및 시간은 토픽에 의해 생성된다고 가정한다. 따라서, 트윗을 모델링하기 위한 단어와 해시태그는 두 단계를 통해 생성된다.Referring to Figs. 1-3, for each topic, a lexical distribution (or word distribution) to be used to select a word, a hashtag distribution to be used to select a hashtag is extracted. In the present invention, it is assumed that words, hash tags, and time are generated by a topic. Therefore, words and hashtags for modeling tweets are generated in two steps.

먼저, 각 문서의 각 단어에 대해, 토픽 분포로부터 토픽을 선택하고, 토픽의 어휘 분포 및 시간에 대한 베타 분포로부터 단어와 시간을 선택한다.First, for each word in each document, select a topic from the topic distribution, and select words and time from the lexical distribution of the topic and the beta distribution over time.

다음으로, 각 문서의 각 해시태그에 대해, 토픽 분포로부터 토픽을 선택하고, 토픽의 해시태그 분포 및 시간에 대한 베타 분포로부터 해시태그와 시간을 선택한다.Next, for each hashtag in each document, a topic is selected from the topic distribution, and a hashtag and time are selected from the hashtag distribution of the topic and the beta distribution over time.

도 4는 도 1에 도시된 트렌드 추출 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 추출 방법의 개념을 도시하고 있다.FIG. 4 shows a concept of a context-based trend extracting method according to time flow using the trend extracting apparatus shown in FIG.

도 1 내지 도 4를 참조하면, 도 4는 토픽(z), 단어(w), 해시태그(c), 시간(t) 사이의 관계를 표기하고 있다. 또한 각 확률 분포 및 변수를 구하는 순서를 표기하고 있다. Referring to Figs. 1 to 4, Fig. 4 shows a relationship between a topic z, a word w, a hash tag c, and a time t. Also, the order of obtaining each probability distribution and variables is shown.

도 4의 우측을 보면, 단어의 다항 분포, 즉 어휘 분포(word distribution)는 디리클레 사전확률로부터 산출되고, 최종적으로 단어를 샘플링하는데 사용된다. 도 4의 좌측을 보면, 해시태그의 다항 분포, 즉 해시태그 분포(hashtag distribution)는 디리클레 사전확률로부터 산출되고, 최종적으로 해시태그를 샘플링하는데 사용된다.4, the polynomial distribution of words, that is, the word distribution, is calculated from the dirichlet prior probability and is finally used to sample the words. 4, a polynomial distribution of a hash tag, a hashtag distribution, is calculated from the dirichlet prior probability and is finally used to sample the hashtag.

도 5는 도 1에 도시된 트렌드 분석 장치를 이용한 시간 흐름에 따른 문맥 기반 트렌드 추출 방법을 설명하기 위한 흐름도이다. 도 5에 도시된 각 단계는 트렌드 분석 장치(10), 구체적으로는 트렌드 추출부(300)에 의해 수행될 수 있다.FIG. 5 is a flowchart for explaining a context-based trend extraction method according to a time flow using the trend analysis apparatus shown in FIG. Each step shown in FIG. 5 can be performed by the trend analyzing apparatus 10, specifically, the trend extracting unit 300.

도 1 내지 도 5를 참조하면, 단계 S110에서, 단어를 선택하기 위한 어휘 분포와 해시태그를 선택하기 위한 해시태그 분포가 추출된다. 즉, 토픽을 구성하는 어휘와 해시태그에 대해 디리클레 사전확률을 기초로 확률 분포가 구축된다.1 to 5, in step S110, a lexical distribution for selecting a word and a hash tag distribution for selecting a hash tag are extracted. That is, a probability distribution is constructed based on the diclicle prior probability for the vocabulary constituting the topic and the hash tag.

단계 S120에서, 각 문서에 대해, 토픽의 다항분포, 즉 토픽 분포가 추출된다. 상기 토픽 분포는 토픽 분포에 대한 디리클레 사전확률로부터 추출될 수 있다. 이때, 단계 S120 이전 또는 단계 S110 이전에 네트워크를 통하여 상기 문서 집합에 포함되는 적어도 하나의 문서를 수집하는 단계가 더 포함될 수 있고, 이 단계는 문서 수집부(100)에 의해 수행될 수 있다.In step S120, for each document, a polynomial distribution of topics, i.e., a topic distribution, is extracted. The topic distribution may be extracted from the dirikler prior probability for the topic distribution. At this time, it may further include collecting at least one document included in the document set via the network before the step S120 or before the step S110, and this step may be performed by the document collection unit 100. [

단계 S130에서, 추출된 각 분포, 즉 어휘 분포, 해시태그 분포, 및 토픽 분포에 대해 통계적 추론을 수행한다. 이때, 통계적 추론에는 깁스 샘플링(Gibbs sampling) 기법이 사용될 수 있다. 아래의 수학식 1과 수학식 2는 각각 해시태그에 대한 샘플링 분포와 단어에 대한 샘플링 분포를 나타내고 있다.In step S130, statistical inference is performed on each extracted distribution, that is, lexical distribution, hashtag distribution, and topic distribution. At this time, a Gibbs sampling technique may be used for statistical reasoning. The following equations (1) and (2) show the sampling distribution for the hash tag and the sampling distribution for the word, respectively.

수학식 1은 해시태그, 시간, 및 단어가 주어졌을 때, 토픽이 z일 확률을 의미하고, 수학식 2는 시간과 단어가 주어졌을 때 토픽이 z일 확률을 의미한다. z^´은 토픽에 대한 할당 벡터(assignment vector)로, 문서 d의 i번째 단어와 해시태그를 제외한 모든 단어 및 해시태그에 대한 것이다.Equation 1 means probability that a topic is z when a hashtag, time, and word are given, and Equation 2 means probability that a topic is z when time and words are given. z ^' is the assignment vector for the topic, for all the words and hash tags except the ith word of the document d and the hashtag.

아래의 수학식 3과 수학식 4는 토픽 z에 대한 어휘 분포와 해시태그 분포를 구하기 위한 수식이다.Equation (3) and Equation (4) below are equations for obtaining a lexical distribution and a hash tag distribution for the topic z.

수학식 3을 이용하면 각 토픽을 구성하는 어휘 분포를 추출할 수 있고, 수학식 4를 이용하면 각 문맥을 구성하는 해시태그 분포를 추출할 수 있다.Using Equation (3), a lexical distribution constituting each topic can be extracted, and a hash tag distribution constituting each context can be extracted using Equation (4).

단계 S140에서, 각 문서의 각 단어에 대해, 토픽 분포로부터 토픽을 추출(또는 선택)하고, 시간에 대한 베타 분포로부터 시간을 추출(또는 선택)한다.In step S140, for each word of each document, a topic is extracted (or selected) from the topic distribution and time is extracted (or selected) from the beta distribution over time.

단계 S150에서, 어휘 분포로부터 단어를 선택(또는 추출)한다.In step S150, a word is selected (or extracted) from the lexical distribution.

단계 S160에서, 단어가 해시태그인지 여부를 판단한다. 단어가 해시태그인 경우 단계 S170에서 해시태그 분포로부터 해시태그를 선택(또는 추출)한다.In step S160, it is determined whether the word is a hashtag. If the word is a hashtag, the hashtag is selected (or extracted) from the hashtag distribution at step S170.

이와 같이, 토픽, 시간, 및 단어를 선택하거나 토픽, 시간, 단어, 및 해시태그를 선택함에 따라, 시간 흐름에 따른 문맥 기반 트렌드 추출이 완료된다.Thus, by selecting a topic, a time, and a word, or selecting a topic, a time, a word, and a hash tag, context-based trend extraction over time is completed.

도 6은 도 1에 도시된 트렌드 분석 장치를 이용하여 분석된 트렌드의 예를 도시한다.FIG. 6 shows an example of trends analyzed using the trend analysis apparatus shown in FIG.

분석을 위한 자료는 2014년 4월 16일부터 약 한달간 "ferry"와 "prayforsouthkorea" 두 개의 키워드를 이용해 트위터로부터 "세월호 참사"와 관련된 트윗 데이터를 수집했고, 도 6은 트렌드 분석 장치(10)를 통해 추출된 시간 흐름에 따른 문맥 기반 트렌드 중 3개의 예를 도시하고 있다.The data for analysis were collected from twitter on tweet keywords "ferry" and "prayforsouthkorea" for about one month from April 16, 2014, Based contextual trends based on the extracted time-flow.

도 6에서 각 트렌드의 첫째 줄은 트렌드의 이름이고, 둘째 줄은 시간 흐름에 따른 트렌드의 변화를 나타내는 히스토그램이며, 셋째 줄은 트렌드에 대한 문맥 및 트렌드에 대한 단어의 집합이다.In FIG. 6, the first line of each trend is the name of the trend, the second line is the histogram showing the change of the trend with time, and the third line is the set of words about the context and trend of the trend.

"Brother saved sister" 트렌드를 살펴보면, 히스토그램을 통해 트렌드에 대한 사용자들의 관심이 어떻게 변화하는지 관찰할 수 있다. 또한, 문맥을 통해, 이 트렌드에 대해 사용자들은 충격을 받았고, 슬픈 감정을 가진다는 것을 알 수 있다. 또한, 이 트렌드는 뉴스 속보로 방송되었고, 청와대와 오바마 대통령과도 연관되어 있음을 추축할 수 있다.Looking at the "Brother saved sister" trend, you can observe how the user's interest in trends changes through the histogram. Also, through the context, it can be seen that users are shocked and have sad feelings about this trend. In addition, this trend was broadcast in the news, and can also be associated with President Obama and President Obama.

본 발명은 도면에 도시된 실시 예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

10 : 트렌드 분석 장치
100 : 문서 수집부
200 : 문서 저장부
300 : 트렌드 추출부
400 : 트렌드 저장부10: Trend analysis device
100: Document collecting section
200: Document storage unit
300: Trend extracting section
400: Trend storage unit

Claims

A method for analyzing context-based trends over time from a set of documents, using a device capable of calculating a probability distribution,
(a) for each topic, extracting a hash tag distribution to be used for selecting a lexical distribution and a hashtag to be used for selecting a word;
(b) for each document included in the document set, extracting a topic distribution;
(c) performing statistical inference on the lexical distribution, the hashtag distribution, and the topic distribution;
(d) for each word of each document included in the document set, extracting a topic from the topic distribution and extracting time from a beta distribution for time; And
(e) for each word of each document included in the document set, extracting a word or a hashtag.

The method according to claim 1,
Wherein the lexical distribution, the hashtag distribution, and the topic distribution are extracted from a Dirichlet prior probability.

The method according to claim 1,
Wherein the statistical inference is a Gibbs sampling technique.

The method according to claim 1,
The step (e)
Extracting words from the lexical distribution; And
And extracting a hash tag from the hashtag distribution if the extracted word is a hashtag.

The method according to claim 1,
Wherein the set of documents comprises a plurality of tweets.

The method according to claim 1,
Further comprising collecting at least one document included in the set of documents through the network prior to step (a).