KR102126911B1

KR102126911B1 - Key player detection method in social media using KeyplayerRank

Info

Publication number: KR102126911B1
Application number: KR1020180170598A
Authority: KR
Inventors: 김민선; 유기윤; 김지영
Original assignee: 서울대학교산학협력단
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2020-07-07

Abstract

The present invention relates to a method of detecting a key player by topic on social media by using keyplayerrank. The method of detecting a key player by topic on social media by using keyplayerrank includes the steps of: (a) collecting, by a news data collection module, news articles related to a specific category, and preprocessing, by the news data preprocessing module, the collected news articles; (b) training, by an article text classification module, an SVM through training data and classifying a text through the trained SVM model; (c) extracting, by a subject extraction module, a subject and keywords for a subject by performing LDA on data found to be related to a specific category; (d) collecting, by a Twitter data collection module, Twitter data, and preprocessing, by a Twitter data preprocessing module, Twitter data including subject keywords among the collected Twitter data; (e) training, by the Twitter text classification module, the SVM through the training data, and classifying the text through the SVM model trained by using the Twitter data including the subject keyword as test data; (f) classifying, by a user classification module, users on data found to be related to the subject; (g) calculating, by an influence index and subject similarity calculation module, an influence index and subject similarity; and (h) selecting, by a key player detection module, a key player for each subject by using the influence index and subject similarity. The most influential key players for each subject can be extracted by simultaneously considering the influence index and subject similarity.

Description

Key player detection method in social media using KeyplayerRank}

본 발명은 KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법에 관한 것으로, 더욱 상세하게는 소셜 미디어상에서 사용자가 쓴 텍스트를 기반으로 주제별로 사용자를 분류하고 영향력 지수와 주제 유사도를 동시에 고려하여 각 주제별 가장 유력한 키플레이어를 추출할 수 있는, KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법에 관한 것이다.The present invention relates to a method for detecting a key player for each subject on social media using KeyplayerRank, and more specifically, to classify users by subject based on text written by the user on social media and to consider the impact index and subject similarity at the same time. It relates to a key player detection method for each topic on social media using KeyplayerRank, which can extract a potent key player.

소셜 미디어 이용자가 매년 꾸준히 증가하면서 소셜 미디어를 이용하여 자료나 정보를 습득하고 소셜 미디어상에서 다른 사용자와 커뮤니케이션 하는 사람들이 늘어가고 있고, 자료나 정보를 습득할 때 소셜 미디어에서 가장 영향력 있는 사용자의 말을 신뢰하고 따르는 경우가 빈번하게 일어나고 있다.As the number of social media users increases steadily every year, more and more people are using social media to acquire data or information, and communicating with other users on social media, and the words of the most influential users on social media when acquiring data or information. Trust and follow are happening frequently.

이에 따라 소셜 미디어가 발달하면서 사회적 영향력을 소셜 미디어에서 규명하려는 시도는 최근 다양한 방법으로 이루어지고 있다. 특히 소셜 미디어 중에서도 트위터에서 영향력 있는 자를 찾으려는 연구가 가장 활발하다.Accordingly, as social media has developed, attempts to identify social influence in social media have been made in various ways in recent years. In particular, among social media, research to find the most influential person on Twitter is the most active.

본 발명과 유사하게 구글의 페이지랭크(PageRank) 알고리즘을 이용한 선행연구들이 있는데, Tunkelang(2009)는 트위터에서 영향력을 판별하는데 키플레이어가 리트윗하는 내용을 포함한 모든 트윗을 팔로워들이 읽을 확률로 계산하는 TunkRank를 적용하였다. 그러나 TunkRank는 트위터의 내용을 고려하지 않고 리트윗에 중점을 둔 알고리즘이다.Similar to the present invention, there are prior studies using Google's PageRank algorithm, and Tunkelang (2009) determines the influence on Twitter, and calculates the probability that followers will read all tweets including retweets from key players. TunkRank was applied. However, TunkRank is an algorithm that focuses on retweets without considering the content of Twitter.

Weng et al.(2010)은 TwitterRank를 제안하였는데, 기존의 소셜 네트워크 분석 방법에서 키플레이어를 찾을 때 in-degree(그래프에서 한 노드로의 진입 차수, 여기서는 한 사용자가 다른 사용자를 팔로우하는 수를 계산하는 것으로 이용됨)를 이용하여 단순하게 팔로워 수로 영향력을 측정하는 것은 영향력 개념을 정확하게 나타내지 않으며 페이지랭크는 전체적인 네트워크의 연결 구조를 고려하여 in-degree를 개선하였지만, 트위터 사용자 간에 영향력을 주는 사용자 간의 관심도를 고려하지 못했다고 지적하였다. 따라서 트위터 사용자 간의 연결 구조와 주제 유사도를 고려하여 in-degree와 페이지랭크의 단점을 보완하고자 TwitterRank를 적용하였는데, TunkRank와는 반대로 TwitterRank 알고리즘은 주제 유사도에 중점을 둔 Rank 방식을 개발한 것이다.Weng et al. (2010) proposed TwitterRank, which calculates the in-degree (number of entries from a graph to one node in a graph, when an existing social network analysis method finds a key player, where one user follows another) Simple impact measurement using the number of followers) does not accurately represent the concept of influence, and PageRank improves the in-degree considering the connection structure of the entire network, but shows the degree of interest among users affecting Twitter users. He pointed out that he could not be considered. Therefore, TwitterRank was applied to compensate for the disadvantages of in-degree and page rank by considering the connection structure and topic similarity between Twitter users. In contrast to TunkRank, the TwitterRank algorithm developed a ranking method focusing on topic similarity.

만약 주제 유사성만을 고려한 키플레이어를 찾고 싶다면 TwitterRank를 이용하여 찾을 수 있고, 반대로 유명인사, 정치인 등과 같이 특정한 주제에 제한하지 않고 영향력이 있는 사용자를 찾고 싶다면 TunkRank의 알고리즘을 이용할 수 있다.If you want to find a key player that only considers topic similarity, you can use TwitterRank to find it. On the other hand, if you want to find an influential user without limiting to a specific topic such as celebrities, politicians, etc., you can use TunkRank's algorithm.

한편 주제 유사성과 영향력, 두 가지를 따로 구하여 더한다면 다음과 같은 모순이 발생할 수 있는데, 예를 들어 영향력 있는 사용자가 한 주제에 관해 이야기하지 않고 여러 주제를 조금씩 언급만 한다면 그 주제에 전문적인 지식이 없다고 보고 키플레이어라고 할 수 없는데도 선행연구들은 이 사용자를 키플레이어라고 한다. 이러한 키플레이어는 영향력에만 편향된 키플레이어라고 할 수 있다.On the other hand, if you obtain and add two themes of similarity and influence separately, the following contradiction may occur. For example, if an influential user does not talk about a topic and mentions several topics little by little, there is no expert knowledge on the topic. Although it cannot be said to be a key player, previous studies refer to this user as a key player. Such a key player can be said to be a key player biased only to influence.

다시 말해 영향력 지수와 주제 유사도를 단순 결합하여 키플레이어를 도출할 경우, 주제 자체가 가지는 영향력에 매우 의존적인 결과가 도출될 수 있다. 즉, 키플레이어를 탐지하고자 하는 주제가 많은 사람으로부터 관심을 받는 주제가 아닌 경우에 대해서는, 해당 주제의 키플레이어가 다른 주제의 키플레이어에 비해 상대적으로 낮은 영향력 지수를 가지게 되는데, 이 경우 주제 유사도가 높음에도 불구하고 낮은 영향력 지수로 인해 원하는 주제에 대한 키플레이어를 도출할 수 없게 되는 문제가 있고, 반대로 영향력 지수가 아주 높은 사용자여도 사용자가 관심 있는 주제의 키플레이어가 아닐 수 있는 단점이 있다.In other words, if a key player is derived by simply combining the influence index and the subject similarity, a result highly dependent on the influence of the subject itself can be derived. That is, for a case in which the subject to detect a key player is not a subject that is attracted by many people, the key player of the subject has a relatively low impact index compared to the key players of other subjects, although the subject similarity is high in this case. Nevertheless, there is a problem in that a key player for a desired topic cannot be derived due to a low influence index. On the contrary, even if a user has a very high influence index, there is a disadvantage that the user may not be a key player of a topic of interest.

또한, 종래 기술인 대한민국 공개특허공보 제10-2014-0062635호(2014.05.26.공개)는 사용자로부터 키워드를 입력 받는 단계; 입력 받은 상기 키워드를 포함하는 검색 요청 메시지를 송신하여 송신된 상기 검색 요청 메시지에 상응하는 검색 응답 메시지를 수신하는 단계; 상기 검색 응답 메시지를 수신하면, 수신된 상기 검색 응답 메시지로부터 상기 키워드에 관련된 영향력 있는 사용자가 기 설정된 영향력 순위에 따라 나열된 검색 리스트를 추출하는 단계; 추출된 상기 검색 리스트를 기반으로 상기 사용자로부터 입력 받은 키워드에 관련된 영향력 있는 사용자를 기설정된 영향력 순위에 따라 순차적으로 표시하는 단계; 및 추출된 상기 검색 리스트를 저장하는 단계;를 포함하는 것을 특징으로 하는 소셜 미디어에서 영향력 있는 사용자를 검색하기 위한 방법이 개시되어 있다. 그러나 키워드와 관련된 영향력자를 찾은 후 그 사용자를 기 설정된 영향력 순위에 따라 표시하기 때문에 각 주제별 유사성과 영향력을 동시에 반영하는 영향력자를 추출하는 데 한계가 있다.In addition, the prior art Republic of Korea Patent Publication No. 10-2014-0062635 (2014.05.26. published) step of receiving a keyword from the user; Transmitting a search request message including the received keyword and receiving a search response message corresponding to the transmitted search request message; Upon receiving the search response message, extracting a search list listed according to a predetermined influence ranking by an influential user related to the keyword from the received search response message; Sequentially displaying influential users related to keywords received from the user based on the extracted search list according to a predetermined influence ranking; And storing the extracted search list. A method for searching an influential user in social media is disclosed. However, there is a limit to extracting influencers that reflect similarities and influences of each topic at the same time because the user is displayed according to a predetermined influence ranking after finding the influencers related to the keyword.

대한민국 공개특허공보 제10-2014-0062635호(2014.05.26.공개, 발명의 명칭: 소셜 미디어에서 영향력 있는 사용자를 검색하기 위한 장치, 시스템 및 그 방법)Republic of Korea Patent Publication No. 10-2014-0062635 (2014.05.26. Publication, the name of the invention: a device, system and method for searching for influential users in social media)

본 발명은 상기와 같은 문제점을 해결하기 위해 안출된 것으로, 본 발명의 목적은 소셜 미디어상에서 발생하는 메시지 분석에 의한 영향력 지수와 주제 유사도를 동시에 고려하여 한 지표에 편향되지 않는 소셜 미디어상의 키플레이어(영향력자)를 탐지함으로써, 주제별로 가장 영향력 있는 사용자를 통해 주제에 대한 확실한 정보를 얻을 수 있고 불필요한 정보는 거를 수 있는, KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법을 제공하는데 있다.The present invention has been devised to solve the above problems, and the object of the present invention is to consider the impact index and the subject similarity by analyzing messages generated on social media at the same time, and key players on social media that are not biased to one indicator ( By detecting the influencers), it is to provide a method for detecting a key player for each topic on social media using KeyplayerRank, which can obtain reliable information about the topic through the most influential users by topic and filter out unnecessary information.

상기와 같은 목적을 달성하기 위하여, 본 발명은 (a) 뉴스 데이터 수집 모듈이 크롤링을 통해 뉴스 데이터 DB로부터 지정된 특정 범주와 관련있는 뉴스 기사를 수집하고, 뉴스 데이터 전처리 모듈이 수집된 뉴스 기사의 ‘TEXT’ 필드에서 불용어 제거하고 형태소 분석하여 명사를 추출하는 전처리 단계와; (b) 기사 텍스트 분류 모듈이 뉴스 기사 텍스트 분류를 위한 분석에 이용될 특징을 선정하고, 특징 가중치로 특징을 표현하는 텍스트 인덱싱을 한 뒤, 상기 단계(a)에서 명사를 추출한 기사 텍스트를 일정한 비율로 훈련 데이터와 테스트 데이터로 나누어 훈련 데이터를 통해 SVM을 훈련하고, 훈련된 SVM 모델을 통해 텍스트를 분류하는 단계와; (c) 주제 추출 모듈이 상기 단계(b)에서의 SVM 모델을 통한 텍스트 분류에 의해 특정 범주와 관련 있는 것으로 나온 데이터에 대해 LDA를 수행하여 주제와 주제별 키워드들을 추출하는 단계와; (d) 트위터 데이터 수집 모듈이 트위터 데이터 DB로부터 상기 뉴스 기사 수집 기간과 동일한 기간의 트위터 데이터를 수집하고, 트위터 데이터 전처리 모듈이 수집된 트위터 데이터 중에서 상기 LDA를 통해 추출된 주제별 키워드를 포함하는 트위터 데이터들만의 ‘text’필드에서 불용어 제거하고 형태소 분석하여 명사를 추출하는 전처리 단계와; (e) 트위터 텍스트 분류 모듈이 일정 개수의 기사 데이터에 트위터 데이터를 추가한 훈련 데이터를 통해 SVM을 훈련하고, 상기 단계(d)에서의 주제별 키워드를 포함하는 트위터 데이터를 테스트 데이터로 하여 훈련된 SVM 모델을 통해 텍스트를 분류하는 단계와; (f) 사용자 분류 모듈이 상기 단계(e)에서의 SVM 모델을 통한 텍스트 분류에 의해 주제와 관련 있는 것으로 나온 데이터에 대해 사용자 분류를 하는 단계와; (g) 영향력 지수 및 주제 유사도 산정 모듈이 상기 사용자 분류에 의한 속성 정보를 이용하여 영향력 지수를 산정하고, ‘Jaccard Index’를 사용하여 주제 유사도를 산정하는 단계, 및 (h) 키플레이어 탐지 모듈이 상기 영향력 지수와 주제 유사도를 이용하여 사용자 중에서 가장 높은 KpRank값을 가지는 사용자를 주제별 키플레이어로 선정하는 단계로 이루어지는 것을 특징으로 한다.In order to achieve the above object, the present invention is (a) news data collection module to collect news articles related to a specific category specified from the news data DB through crawling, the news data pre-processing module is collected ' A pre-processing step of extracting nouns by removing idioms from the TEXT' field and analyzing morphemes; (b) The article text classification module selects features to be used for analysis for classifying news article texts, indexes texts expressing features by feature weights, and then, a certain percentage of article texts extracted from nouns in step (a) Dividing into training data and test data to train the SVM through the training data, and classifying the text through the trained SVM model; (c) extracting a subject and subject-specific keywords by subjecting the subject extraction module to LDA on data found to be related to a specific category by text classification through the SVM model in step (b); (d) Twitter data that the Twitter data collection module collects Twitter data in the same period as the news article collection period from the Twitter data DB, and Twitter data including subject-specific keywords extracted through the LDA among the Twitter data collected by the Twitter data pre-processing module. A pre-processing step of extracting nouns by removing idioms from the'text' field of their own and morphological analysis; (e) The Twitter text classification module trains SVM through training data in which Twitter data is added to a certain number of article data, and SVM trained using Twitter data including subject-specific keywords in step (d) as test data. Classifying text through a model; (f) the user classification module classifying the data on the data found to be related to the subject by text classification through the SVM model in step (e); (g) an influence index and subject similarity calculation module calculate an impact index using attribute information by the user classification, and calculate a subject similarity using a'Jaccard Index', and (h) a key player detection module It characterized in that it consists of selecting the user having the highest KpRank value among the users as the key player for each topic by using the influence index and the subject similarity.

또한, 본 발명에서 상기 단계(b)는, (b1) 명사를 추출한 기사 텍스트들에 범주와의 관련여부에 따라 라벨링이 수행되고, 문헌 빈도를 사용하여 텍스트 분류를 하는 데 사용할 단어들을 선정하는 특징 선정단계와; (b2) 선정된 특징들을 이용하여 TF-IDF를 특징 가중치로 적용한 기사 텍스트 DTM을 생성하는 단계, 및 (b3) 상기 DTM을 이용하여 훈련 데이터와 테스트 데이터를 일정한 비율로 나누어 상기 훈련 데이터를 통해 훈련된 SVM 모델로 테스트 데이터에 적용하여 기사 텍스트를 분류하고, 훈련된 SVM 모델로 테스트 데이터에 적용하여 기사 텍스트를 분류한 결과와 실제 라벨링된 결과를 비교하여 올바르게 분류된 것을 추출하는 단계로 이루어지는 것을 특징으로 한다.In addition, in the present invention, the step (b), (b1) is characterized by selecting the words to be used to classify the text by performing the labeling according to whether or not the article text extracted from the noun is related to the category Selection step; (b2) generating an article text DTM applying TF-IDF as a feature weight using the selected features, and (b3) training the training data through the training data by dividing the training data and the test data in a constant ratio using the DTM. It consists of classifying the article text by applying it to the test data with the trained SVM model, and comparing the result of classifying the article text by applying the test data with the trained SVM model to the actual labeled result, and extracting the correct classification. Is done.

또한, 본 발명에서 상기 단계(c)는, 라벨링된 결과와 SVM 모델을 통해 기사 텍스트를 분류한 결과가 모두 특정 범주와 관련 있는 것으로 나온 데이터 텍스트를 DTM으로 생성하여 LDA 모델을 생성하고, LDA 모델을 수행하는데 필요한 파라미터인 'num_topics', 'passes'와 '키워드 수'를 정하여 주제별 키워드를 추출하며, 추출된 키워드로부터 주제를 판별하는 것을 특징으로 한다.In addition, in the present invention, the step (c), the result of classifying the article text through the labeling result and the SVM model, generates a LDA model by generating a data text that is related to a specific category as a DTM, and an LDA model. It is characterized by extracting keywords for each subject by determining'num_topics','passes', and'number of keywords', which are necessary parameters to perform the task, and determining the subject from the extracted keywords.

또한, 본 발명에서 상기 단계(e)는, 훈련된 SVM 모델로 테스트 데이터에 적용하여 트위터 텍스트를 분류한 결과와 실제 라벨링된 결과를 비교하여 올바르게 분류된 것을 추출하는 것을 특징으로 한다.In addition, in the present invention, the step (e) is characterized by extracting the correct classification by comparing the results of classifying the Twitter text and the actual labeled results by applying to the test data with a trained SVM model.

또한, 본 발명에서 상기 단계(f)는, 리트윗된 사용자와 리트윗한 사용자로 분류하여 트위터 사용자들의 속성 정보를 불러오되, 리트윗된 사용자가 저장장치에 저장되어 있지 않다면 트위터 REST API를 이용하여 트위터 데이터 DB에서 해당 사용자의 속성 정보를 불러오는 것을 특징으로 한다.In addition, in the present invention, the step (f) is classified into a retweet user and a retweet user to retrieve attribute information of Twitter users, but if the retweet user is not stored in the storage device, use the Twitter REST API. It is characterized in that the attribute information of the user is retrieved from the Twitter data DB.

또한, 본 발명은 상기 단계(g)에서, 상기 영향력 지수는 사용자들의 속성 정보 중에서‘Follwers_count, RT_count, Favorite_count’를 이용하여 다음의 수학식,

(여기서,

는 사용자 i가 주제 내에서 쓴 각 트윗에 해당하는 각 리트윗된 횟수를 모두 더해 사용자 i가 주제 내에서 쓴 모든 트윗 수로 나눈 사용자 i의 평균 리트윗된 수이고,

는

와 같이 사용자 i의 평균 favorite 수를 나타내며,

와

는 주제 내의 전체 사용자들의 평균 리트윗된 수와 평균 favorite 수이고,

는

과

를 구한 것과 같이 사용자 i가 주제 내에서 쓴 각 트윗에 해당하는 각 팔로워 수를 모두 더해 사용자 i가 주제 내에서 쓴 모든 트윗 수로 나눈 평균 팔로워 수이고,

는 전체 사용자들의 평균 팔로워 수임)으로 산정하고, 상기 주제 유사도는 다음의 수학식,In addition, the present invention, in the step (g), the impact index is the following equation using'Follwers_count, RT_count, Favorite_count' among the attribute information of users,

(here,

Is the average number of retweets for user i divided by the number of all tweets written by user i within the topic by adding up each number of retweets corresponding to each tweet written by user i in the topic,

The

Like this, it represents the average number of favorite users i,

Wow

Is the average number of retweets and average favorites of all users in the subject,

The

and

The average number of followers divided by the total number of tweets that user i has written within the topic by adding up the number of each follower corresponding to each tweet that user i wrote within the topic, as obtained by

Is the average number of followers of all users), and the subject similarity is the following equation,

(여기서, A와 B는 각 사용자가 쓴 트위터 텍스트(명사들로만 이루어져 있음) 묶음임)으로 산정하는 것을 특징으로 한다.

(Here, A and B are characterized by calculating by Twitter text (consisting only of nouns) written by each user).

또한, 본 발명은 상기 단계(h)에서, 상기 KpRank값은 다음의 수학식,In addition, the present invention, in step (h), the KpRank value is the following equation,

(여기서, d는 특정 주제 내에서 트위터 사용자가 관심 있는 주제를 언급한 트위터 페이지를 방문할 확률임)으로 산정하는 것을 특징으로 한다.(Where d is a probability that a Twitter user visits a Twitter page referring to a topic of interest within a specific topic).

이상에서 살펴본, 본 발명인 KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법은 영향력 지수와 주제 유사도를 따로 구하여 단순히 더하는 것과는 다른 방식으로 소셜 미디어상에서 발생하는 메시지 분석에 의한 영향력 지수와 주제 유사도를 동시에 고려하여 한 지표에 편향되지 않는 소셜 미디어상의 키플레이어(영향력자)를 탐지함으로써, 주제별로 가장 영향력 있는 사용자를 통해 주제에 대한 확실한 정보를 얻을 수 있고 불필요한 정보는 거를 수 있는 효과가 있다.As described above, the key player detection method for each topic on social media using the present invention, KeyplayerRank, is different from simply finding and adding the impact index and subject similarity, and simultaneously considering the impact index and subject similarity by analyzing messages generated on social media. By detecting key players (influencers) on social media that are not biased by one indicator, it is possible to obtain certain information about the subject through the most influential users for each subject and to filter out unnecessary information.

나아가, 본 발명은 마케팅 측면에서 활용하여 기업의 제품을 홍보하고자 할 때 특정 주제에 대한 키플레이어를 탐지하여 그 키플레이어를 통해 효율적으로 홍보할 수 있다.Furthermore, the present invention can utilize a marketing aspect to promote a company's product, and detect a key player for a specific subject to effectively promote it through the key player.

도 1 은 본 발명에 따른 KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법의 전체 흐름도를 나타낸 도면.
도 2 는 본 발명에서 TF-IDF를 특징 가중치로 적용한 기사 텍스트 DTM의 일부를 나타낸 도면.
도 3 은 본 발명에서 주제별 사용자를 분류하는 과정을 나타낸 도면.
도 4 는 주제별 키플레이어를 지도에 시각화하여 나타낸 도면.
도 5 는 본 발명에 따른 KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법과 관련된 시스템의 일실시예를 나타낸 구성도.1 is a view showing the overall flow of a key player detection method for each topic on social media using KeyplayerRank according to the present invention.
Figure 2 is a view showing a part of the article text DTM applying TF-IDF as a feature weight in the present invention.
3 is a view showing a process of classifying users by subject in the present invention.
4 is a diagram showing a key player for each subject visualized on a map.
5 is a block diagram showing an embodiment of a system related to a key player detection method for each subject on social media using a KeyplayerRank according to the present invention.

상기와 같이 구성된 본 발명의 바람직한 실시예를 첨부된 도면을 참조하면서 상세히 설명하면 다음과 같다. 첨부된 도면들 및 이를 참조한 설명은 본 발명에 관하여 당해 기술 분야에서 통상의 지식을 가진 자들이 쉽게 이해할 수 있도록 하기 위해 예시된 것이며, 본 발명의 사상 및 범위를 한정하려는 의도로 제시된 것은 아님에 유의하여야 할 것이다.If described in detail with reference to the accompanying drawings, preferred embodiments of the present invention configured as described above are as follows. It should be noted that the accompanying drawings and the description with reference to them are exemplified for easy understanding by those skilled in the art with respect to the present invention, and are not intended to limit the spirit and scope of the present invention. Will have to.

도 5는 본 발명에 따른 주제별 키플레이어 탐지 방법과 관련된 시스템의 일실시예를 나타낸 구성도로, 주제별 키플레이어 탐지 장치(10)는 소셜 미디어상에서 사용자가 쓴 텍스트를 기반으로 주제별로 사용자를 분류하고 영향력 지수와 주제 유사도를 동시에 고려하여 각 주제별 가장 유력한 키플레이어를 추출하는 방법을 포함하는 것으로, 뉴스 데이터 수집 모듈(11), 뉴스 데이터 전처리 모듈(12), 기사 텍스트 분류 모듈(13), 주제 추출 모듈(14), 트위터 데이터 수집 모듈(15), 트위터 데이터 전처리 모듈(16), 트위터 텍스트 분류 모듈(17), 사용자 분류 모듈(18), 키플레이어 탐지 모듈(20)과 영향력 지수 및 주제 유사도 산정 모듈(19)을 포함한다. 상기 주제별 키플레이어 탐지 장치(10)는 서버, 데스크톱, 노트북 또는 휴대용 단말기 등으로, KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지를 수행하기 위한 소프트웨어를 포함한다.5 is a block diagram showing an embodiment of a system related to a method for detecting a key player for each subject according to the present invention, and the key player detection device for each subject 10 classifies and influences users by subject based on text written by the user on social media. Including the method of extracting the most influential key player for each subject by considering the index and subject similarity at the same time, news data collection module (11), news data pre-processing module (12), article text classification module (13), subject extraction module (14), Twitter data collection module (15), Twitter data pre-processing module (16), Twitter text classification module (17), user classification module (18), key player detection module (20) and influence index and subject similarity calculation module (19). The subject key player detection device 10 includes a server, a desktop, a laptop, or a portable terminal, and includes software for performing subject key player detection on social media using KeyplayerRank.

더불어 상기 주제별 키플레이어 탐지 장치(10)에서 연산되거나 입출력되는 자료는 저장 장치(30)에 저장되도록 하는 것이 좋다. 주제별 키플레이어 탐지 장치(10)는 저장 장치(30)를 포함할 수도 있다.In addition, it is recommended that the data calculated or input and output by the subject key player detection device 10 is stored in the storage device 30. The subject key player detection device 10 may include a storage device 30.

상기와 같이 이루어진 본 발명에 따른 KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법에 관하여 도 1의 흐름도를 참조하여 설명하면 다음과 같다.The key player detection method for each topic on social media using the KeyplayerRank according to the present invention made as described above will be described with reference to the flowchart of FIG. 1.

뉴스 데이터 전처리(S10)News data preprocessing (S10)

우선, 뉴스 데이터(예를 들어, 조선일보 데이터) 전처리 과정이다. 본 발명에서 주제를 추출하는데 뉴스 기사를 이용하는 이유는 트위터의 특성상 작성할 수 있는 텍스트가 140자 이내로 한정되어 있어 주제를 추출하기에 부적합하기 때문이고, 트위터는 언론 뉴스에 즉각적으로 반응하며 대중의 의견을 신속하게 파악하는 소셜 미디어이기 때문에 뉴스 기사에서 추출한 주제를 후술하는 바와 같이 트위터에 적용할 수 있다.First, it is a pre-processing of news data (for example, Chosun Ilbo data). In the present invention, the reason why the news article is used to extract the subject is because the text that can be written is limited to 140 characters or less due to the nature of Twitter, and is inappropriate for extracting the subject, and Twitter responds immediately to media news and receives public opinion. Because it is a social media that can be quickly identified, topics extracted from news articles can be applied to Twitter as described below.

뉴스 데이터 수집 모듈(11)은 크롤링을 통해 뉴스 데이터 DB로부터 뉴스 기사를 수집하는데, 이때 후술하는 주제 추출을 위해 특정 범주를 지정해야 한다. 본 발명에서는 일실시예로 범주를 '인공지능'으로 정한다. 상기 범주와 관련있는 뉴스 기사에서 가져오는 필드는 ‘TITLE, CATEGORY, URL, DATE, TEXT’로 다음의 표 1과 같이 구성되어 있다. 참고로, 범주 '인공지능'과 관련있는 뉴스 기사 내용에는 '인공지능'뿐만 아니라 '인공', '지능' 등을 포함하는 뉴스 기사도 포함되어 있다.The news data collection module 11 collects news articles from the news data DB through crawling. At this time, a specific category must be designated for subject extraction, which will be described later. In the present invention, in one embodiment, the category is defined as'artificial intelligence'. Fields imported from news articles related to the above categories are composed of'TITLE, CATEGORY, URL, DATE, TEXT' as shown in Table 1 below. For reference, the content of news articles related to the category'Artificial Intelligence' includes not only'Artificial Intelligence' but also news articles including'Artificial Intelligence' and'Intelligence'.

본 발명에서 뉴스 데이터 전처리 모듈(12)은 수집된 뉴스 기사에서 ‘TEXT’ 필드만을 전처리한다. 기사 텍스트는 여러 문장으로 구성되어 있으므로 먼저 문장 단위로 분리하는 과정을 진행하는데, 이러한 과정을 토큰화(Tokenization)라고 한다. 토큰화는 문서나 문장을 분석하기 좋도록 토큰(token, 의미를 가지는 문자열을 뜻하며 형태소나 단어를 포함)으로 나누는 작업을 말한다. 문장을 분리한 후 문장 내에서 형태소 분석하기 위해 단어 단위로 한 번 더 토큰화하는데, 이때 사용되는 패키지는 ‘NLTK(Natural Language Toolkit)’ 패키지로 교육용으로 개발된 자연어 처리 및 문서 분석용 파이썬(Python) 패키지일 수 있다.In the present invention, the news data pre-processing module 12 pre-processes only the'TEXT' field in the collected news articles. Since the article text is composed of several sentences, the process of first separating into sentences is called tokenization. Tokenization refers to the task of dividing a document or sentence into tokens (tokens, meaning strings with meanings, and morphemes and words). After the sentence is separated, it is tokenized once more in word units for morphological analysis within the sentence.The package used at this time is the'NLTK (Natural Language Toolkit)' package, which is a natural language processing and document analysis Python developed for education (Python ) It can be a package.

문장 및 단어 단위로 토큰화한 텍스트에서 텍스트 분석에 의미가 없다고 판단되는 불용어(한자, 숫자, 영문자, URL 주소)를 제거하고, 불용어가 제거된 텍스트에서 주제를 추출하기 위해 한글 명사만을 이용한다. 이에 명사만을 추출하기 위해 어절 단위로 분리된 단어에 품사(POS, part-of-speech)를 부여하는 형태소 분석을 시행하는데, 여기서 이용할 수 있는 형태소 분석기는 java기반의 한국어 형태소 분석기로, 각 단어에 대해 42종의 품사를 부여하는 KoNLPy 패키지의 KOMORAN을 이용하여 명사(일반명사(NNG), 고유명사(NNP))만을 추출한다. 명사만 추출한 기사 텍스트를 저장장치(30)에 저장하여 다음 과정에서 범주와 관련 있는 텍스트만을 분류하기 위해 SVM을 텍스트 분류에 이용하게 된다.Removes stopwords (Chinese characters, numbers, English characters, and URL addresses) that are not meaningful in text analysis from tokenized text in sentence and word units, and uses only Korean nouns to extract topics from the text where the stopwords are removed. Therefore, in order to extract only nouns, morpheme analysis is performed in which words separated by word units are given part-of-speech (POS). The morpheme analyzer available here is a java-based Korean morpheme analyzer. Only nouns (general nouns (NNG) and proper nouns (NNP)) are extracted using KOMORAN of the KoNLPy package that gives 42 kinds of parts of speech. SVM is used for text classification in order to classify only text related to a category in the next process by storing the article text in which only nouns are extracted in the storage device 30.

SVM을SVM 통한 기사 텍스트 분류(S20) Article text classification through (S20)

다음으로, SVM(Support Vector Machine)을 통한 텍스트 분류 과정이다.Next, it is a text classification process through a support vector machine (SVM).

여기서, SVM(Support Vector Machine)은 부정예제로부터 긍정예제를 분류하는 결정면(decision surface)을 찾아내는 분류모형으로, 이원 패턴인식 문제를 해결하기 위해 Vladimir Vapnik에 의해 제안된 머신러닝 알고리즘이다.Here, SVM (Support Vector Machine) is a classification model that finds a decision surface that classifies positive examples from negative examples, and is a machine learning algorithm proposed by Vladimir Vapnik to solve the binary pattern recognition problem.

텍스트 분류에 SVM 알고리즘이 적합한 이유는 첫 번째로, 텍스트는 무수히 많은 특징(단어)으로 구성되어 있으므로, 차원이 무한으로 커지기 때문에 데이터에 과적합된 모델이 생성될 수 있다. 하지만 SVM은 이러한 큰 차원을 조절할 수 있는 변수가 존재하므로 텍스트 분류에 적합하다. 두 번째로, 특징 선택에 있어 무관한 특징을 구분해 낼 수 있는데, SVM은 무관한 특징을 구별해내기 위해 여러 특징을 학습하여 모든 정보를 고려하여 문서를 분류할 수 있다. 마지막으로, 대부분 텍스트는 선형적으로 분류할 수 있으며, SVM은 기본적으로 선형적인 분류 방법을 찾기 때문에 텍스트 분류에 적합하다. 선형 SVM은 데이터를 선형으로 분리하는 최적의 선형결정 경계를 찾는데, 이때 서로 다른 데이터들을 가장 잘 분류할 수 있는 결정경계를 찾을 수 있다.First, the reason why the SVM algorithm is suitable for text classification is that text is composed of a myriad of features (words), and therefore, a model that is overfitting to the data can be generated because the dimension is infinitely large. However, SVM is suitable for text classification because there are variables that can control these large dimensions. Second, irrelevant features can be distinguished in feature selection. SVM can classify documents considering all information by learning various features to distinguish irrelevant features. Finally, most texts can be classified linearly, and SVM is suitable for text classification because it basically finds a linear classification method. The linear SVM finds an optimal linear decision boundary that linearly separates data, where it is possible to find a decision boundary that best classifies different data.

본 발명에서는 뉴스 데이터로부터의 연관성 있는 뉴스 기사 텍스트를 가지고 주제를 추출하기 위해 선형 SVM을 통한 문서 분류가 이루어진다. 이러한 SVM을 통한 텍스트 분류 과정은 후술할 토픽모델링(topic modeling) 결과의 성능을 높일 수 있다.In the present invention, document classification is performed through a linear SVM to extract a topic with relevant news article text from news data. The text classification process through the SVM can improve performance of a topic modeling result, which will be described later.

여기서, 기사 텍스트 분류 모듈(13)에 의한 SVM 과정은 다음과 같이 이루어지는데, 뉴스 기사 텍스트 분류를 위한 분석에 이용될 특징을 선정하고, 특징 가중치로 특징을 표현하는 텍스트 인덱싱(Indexing)을 한 뒤, 데이터를 일정한(예를 들어, 3:1) 비율로 훈련 데이터(training data)와 테스트 데이터(test data)로 나눈다. 그리고 훈련 데이터를 통해 SVM을 훈련하고, 훈련된 SVM 모델을 통해 텍스트를 분류하는 과정으로 진행한다.Here, the SVM process by the article text classification module 13 is performed as follows, after selecting a feature to be used for analysis for classifying a news article text, and after performing text indexing to express the feature by feature weight , Divide the data into training data and test data at a constant (e.g., 3:1) ratio. Then, SVM is trained through training data, and text is classified through a trained SVM model.

이를 구체적으로 살펴보면, 앞서 전처리 과정에서 명사만을 추출한 기사 텍스트들에 다음의 표 2(실시예: 기사 텍스트 5,000개)와 같이 수동으로 라벨링을 수행하는데, 예를 들어, 범주 '인공지능'과 관련이 있는 텍스트의 class에는 ‘1’을 부여하고 관련이 없는 내용을 가진 텍스트에는 ‘-1’ 값을 class에 부여하게 된다.Looking specifically at this, in the pre-processing process, manual labeling is performed on the article texts extracted only from nouns as shown in Table 2 (Example: 5,000 article texts), for example, related to the category'Artificial Intelligence'. '1' is assigned to the class of the text, and'-1' is assigned to the text with unrelated content.

상기 표 2에서 라벨링 한 5,000개의 기사 텍스트를 하나의 말뭉치(corpus)로 형성하고, 텍스트 분류를 하는 데 있어 유용하게 사용할 만한 단어들을 선정하기 위하여 특징 선정과정으로 문헌 빈도(Document Frequency)를 사용한다. 이때, 문헌빈도(1개의 기사 텍스트에서 일정 단어가 반복되는 횟수)가 2 이하인 단어들을 제외하고 나머지 단어들을 특징으로 선정함이 바람직하다. 참고로 상기 표 2의 text들이 모두 특징으로 선정된 단어들이다.Document frequency is used as a feature selection process in order to form 5,000 article texts labeled in Table 2 into one corpus, and to select words useful for classifying texts. At this time, it is preferable to select the remaining words as features except for words whose document frequency (the number of times a certain word is repeated in one article text) is 2 or less. For reference, the texts in Table 2 are all selected words.

다음 과정인 텍스트 인덱싱에서는 특징 가중치로 Term Frequency - Inverse Document Frequency(TF-IDF, 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서 군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치)를 적용한다.In the next process, text indexing, the weight of the term Term Frequency-Inverse Document Frequency (TF-IDF, used for information retrieval and text mining) is used as a feature weight to determine how important a word is within a specific document when there is a group of documents consisting of multiple documents. Statistical figures).

도 2는 DTM의 일부를 나타낸 것으로 그에 나타낸 바와 같이, TF-IDF를 특징 가중치로 적용한 기사 텍스트 DTM(Document Term Matrix(DTM)에 특징 가중치인 TF-IDF를 곱함)으로 각 행은 텍스트(표 2에서 5,000개의 기사 텍스트)를 나타내고, 각 열은 이전에 선정된 특징들(표 2의 단어들을 나열)을 의미하며, 행렬 안의 각 요소는 TF-IDF가 가중된 값을 의미한다.FIG. 2 shows a part of the DTM, and as shown therein, the article text DTM (Document Term Matrix (DTM) to which TF-IDF is applied as a feature weight is multiplied by the feature weight TF-IDF) and each row is a text (Table 2 In 5,000 article texts), each column represents the previously selected features (listing the words in Table 2), and each element in the matrix represents the TF-IDF weighted value.

이러한 상기 DTM을 가지고 SVM 모델을 형성하기 위한 훈련 데이터와 테스트 데이터를 3:1의 비율로 나누면 훈련 데이터는 3,750개이고, 테스트 데이터는 1,250개이다. 상기 훈련 데이터를 통해 훈련된 SVM 모델로 1,250개의 테스트 데이터에 적용하여 텍스트를 분류하고, 분류가 정확하게 됐는지를 검증하기 위해 정확률(precision)과 재현율(recall)을 구한다.Dividing the training data and test data to form an SVM model with the DTM in a ratio of 3:1 has 3,750 training data and 1,250 test data. The SVM model trained from the training data is applied to 1,250 test data to classify text, and accuracy and recall are obtained to verify whether the classification is correct.

다음의 표 3은 훈련된 SVM 모델로 1,250개의 테스트 데이터에 적용하여 텍스트를 분류한 결과(실험 결과, Predicted)와 실제 수동으로 라벨링한 결과(Actual)를 비교한 오차행렬(Confusion matrix)로, 1,250개의 데이터 중 올바르게 분류된 것은 1,083개로, 약 86.64%의 정확도(overall accuracy)를 보이고, 정확률은 약 72.46%로 높은 예측력을 보이며, 실제 관련 있는 텍스트 중 올바르게 예측된 텍스트의 비율인 재현율은 약 68.73%이다.Table 3 below is a trained SVM model applied to 1,250 test data, and the result of classifying the text (experimental results, Predicted) and the actual manually labeled result (Actual) comparing the error matrix (Confusion matrix), 1,250 1,083 of the data were correctly classified, showing an overall accuracy of about 86.64%, a high accuracy rate of about 72.46%, and a reproducibility of about 68.73%, which is the proportion of the text correctly predicted among actual related texts. to be.

여기서, 정확률(precision)은 실험 결과 중 실제로 주제와 관련 있는 텍스트들의 비율(수학식 1)이고, 재현율(recall)은 실제로 주제와 관련 있는 텍스트들 중 실험 결과로 분류된 관련 있는 텍스트들의 비율(수학식 2)이다. 참고로 다음의 수학식 3은 정확도를 나태내는 식이고 다음의 표 4는 수학식 1 내지 수학식 3을 설명하기 위한 오차행렬(Confusion matrix)이다.Here, the precision is the ratio of texts that are actually related to the subject in the experimental results (Equation 1), and the recall is the ratio of texts that are actually classified as experimental results among the texts related to the subject (math Equation 2). For reference, the following Equation 3 is an expression representing accuracy, and the following Table 4 is a confusion matrix for explaining Equations 1 to 3.

LDALDA 토픽모델링Topic modeling (S30)(S30)

그 다음으로, 상기에서 라벨링한 결과와 SVM 모델을 통해 예측된 실험결과가 모두 관련 있는 것(class: 1)으로 나온 200개의 데이터를 저장장치(30)에 저장하고 주제 추출 모듈(14)은 200개의 데이터에 대해 LDA(Latent Dirichlet Allocation)를 진행하여 주제를 추출하고 주제별 키워드들을 추출하게 된다. 즉, 본 발명의 일실시예로 기사 텍스트에서 명사만 추출하여 SVM으로 범주인‘인공지능’과 관련 있다고 나온 200개의 데이터를 가지고 LDA를 수행한다.Next, 200 data from the labeling result and the predicted experimental result through the SVM model are all related (class: 1) to the storage device 30, and the subject extraction module 14 is 200 LDA (Latent Dirichlet Allocation) is performed on dog data to extract topics and keywords for each topic. That is, as an embodiment of the present invention, only nouns are extracted from the article text, and LDA is performed with 200 data that are related to the category “Artificial Intelligence” in SVM.

여기서, LDA는 구조화된 텍스트 자료의 뭉치에서 의미 있는 주제들을 추출해주는 확률모델 알고리즘으로, 주어진 문서에 대해 각 문서에 어떤 주제들이 존재하는지에 대한 확률모형으로서 미리 알고 있는 주제별 단어 수 분포를 바탕으로 주어진 문서에서 발견된 단어 수 분포를 분석함으로써 해당 문서가 어떤 주제들을 함께 다루고 있을지를 예측할 수 있는 토픽모델링 방법 중 하나이다.Here, LDA is a probability model algorithm that extracts meaningful topics from a bundle of structured text data. Based on the distribution of the number of words for each topic, which is known in advance as a probability model for which topics exist in each document for a given document. It is one of the topic modeling methods that can predict which topics the document covers together by analyzing the distribution of the number of words found in the document.

LDA 모델을 만들기 위해서 먼저 명사로 이루어진 200개의 텍스트를 DTM으로 생성하고, 각 단어가 텍스트당 몇 번의 빈도로 언급되었는지 DTM으로 나타낸 뒤 LDA 모델을 생성한다. LDA 모델을 수행하는데 들어가는 파라미터는 num_topics와 passes인데, ‘num_topics’는 잠재적인 주제의 개수를 정하는 것이고, ‘passes’는 모델을 반복하여 학습하는 횟수를 정하는 것으로, passes의 횟수가 많아질수록 모델이 정교해진다.In order to create an LDA model, 200 texts of nouns are first generated in DTM, and the number of times each word is mentioned in DTM is expressed in DTM, and then an LDA model is generated. The parameters used to perform the LDA model are num_topics and passes,'num_topics' is to determine the number of potential topics, and'passes' is to determine the number of times to repeat the model and the number of passes increases. Become sophisticated.

본 발명에서는 일실시예로‘num_topics’를 10으로 정하고 ‘passes’는 1,000으로 설정하여 LDA 진행한다. 다음의 표 5는 1,000번의 모델 반복 학습 횟수를 통한 범주 '인공지능'에 대한 LDA 결과로 저장장치(30)에 저장되며, 표 5 내의 숫자인 확률값은 주제별 키워드가 해당 번호의 주제에 분포할 확률이고 키워드 수는 5로 정한 것이다.In the present invention, in one embodiment,'num_topics' is set to 10, and'passes' is set to 1,000 to perform LDA. The following Table 5 is stored in the storage device 30 as an LDA result for the category'Artificial Intelligence' through the number of model iterative learning times of 1,000 times, and the probability value, which is a number in Table 5, is the probability that keywords for each subject will be distributed to the subject of the corresponding number. And the number of keywords is 5.

상기 표 5에서 주제 1번으로 키워드 ‘이세돌, 바둑, 알파고, 인공지능, 대국’이 나오는데, 이 키워드들로 유추해 봤을 때 주제 1번은 ‘알파고와 이세돌의 바둑대국’이라 할 수 있을 것이다. 2016년 3월 9일부터 15일까지 알파고와 이세돌의 바둑 대결이 인공지능과 인간의 대결로 한창 이슈가 되고 주목을 받았으므로, 본 발명의 일실시예에서의 수집 기간인 2015년 7월 22일부터 2016년 7월 22일 기간에 이와 관련된 뉴스 기사도 많이 발행되었기 때문에 1번과 같은 주제가 나왔을 것으로 추정할 수 있다. 즉, 주제별 키워드가 추출되면 해당 키워드를 기반으로 주제를 판별할 수 있는 것이다.In Table 5 above, the keywords'Isedol, Go, Alpha Go, AI, and the Great Power' appear as the subject number 1, and inferring from these keywords, the subject number 1 can be said to be'Alphago and Lee Sedol's Baduk Power'. . Since March 9-15, 2016, the Go match between Alpha Go and Lee Sedol became a hot issue and attracted attention due to the confrontation between artificial intelligence and human beings, so the collection period in one embodiment of the present invention was July 22, 2015 Since many news articles related to this have been published in the period from July 22, 2016, it can be assumed that the same topic as #1 appeared. That is, when keywords for each subject are extracted, the subject can be determined based on the corresponding keywords.

트위터Twitter 데이터 전처리(S40) Data preprocessing (S40)

본 발명에서는 SNS로 트위터를 일실시예로 하여 설명한다.In the present invention, Twitter is described as an example using SNS.

트위터 데이터 수집 모듈(15)에 의해 트위터 데이터는 트위터 데이터 DB로부터 REST 기반의 오픈 API를 통해 수집하되 상기 뉴스 데이터 수집 기간과 동일한 기간(예를 들어, 2015년 7월 22일부터 2016년 7월 22일까지)의 데이터를 수집하며, 수집된 데이터를 중복 제거한 후의 트위터 데이터(실시예: 3,001,589개)를 이용한다.The Twitter data is collected by the Twitter data collection module 15 through the REST-based open API from the Twitter data DB, but the same period as the news data collection period (for example, July 22, 2015 to July 22, 2016) Data), and Twitter data (eg, 3,001,589) after removing the duplicated data is used.

본 발명에서는 트위터 오픈 API에서 제공하는 다양한 속성 정보 중 다음의 표 6과 같이 ‘text, user_id, user_s_name, followers_count, retweet_count, favorite_count’ 필드만을 이용하는데, ‘text’는 사용자가 작성한 트윗 내용을 분석하기 위해 선택하고, ‘use_id’와 ‘user_s_name’은 사용자의 정보를 알기 위해 선택하며, ‘followers_count’, ‘retweet_count’, ‘favorite_count’는 영향력 지수를 구하기 위해 이용한다.In the present invention, only the'text, user_id, user_s_name, followers_count, retweet_count, favorite_count' fields are used as shown in Table 6 below among various property information provided by the Twitter open API.'text' is used to analyze the contents of the tweet written by the user. Select,'use_id' and'user_s_name' are selected to know the user's information, and'followers_count','retweet_count', and'favorite_count' are used to obtain the influence index.

상기 트위터 필드 중 트위터 텍스트 분석을 하기 위하여 트위터 데이터 전처리 모듈(16)은‘text’ 필드만을 전처리하는데 상기 LDA를 통해 추출된 주제별 키워드를 포함하는 트위터들만 전처리하고, 전처리 과정은 앞서 기사 텍스트를 전처리했던 것과 같이 KoNLPy 패키지를 이용하여 불용어 제거 및 형태소 분석을 수행한다. 트위터에서도 SVM을 통한 텍스트 분류와 KeyplayerRank(이하, KpRank)를 구하기 위한 주제 유사도를 계산하기 위하여 한글 명사만을 추출하여 저장장치(30)에 저장한다.In order to analyze Twitter text among the Twitter fields, the Twitter data pre-processing module 16 pre-processes only the'text' field, and pre-processes only the Twitters that contain the keyword for each subject extracted through the LDA, and the pre-processing process preprocesses the article text. As described above, terminology removal and morpheme analysis are performed using the KoNLPy package. On Twitter, only Korean nouns are extracted and stored in the storage device 30 in order to calculate text classification through SVM and subject similarity for obtaining KeyplayerRank (hereinafter KpRank).

SVM을SVM 통한 through 트위터Twitter 텍스트 분류(S50) Text classification (S50)

트위터 텍스트 분류 모듈(17)에 의한 트위터 SVM 과정은 상술한 LDA를 통해 추출된 주제별 키워드를 포함하는 트위터들을 전처리한 후 진행하게 되는데, 일실시예로 상기 3,001,589개의 트위터 데이터 중에서 상기 LDA 결과 중 주제 1번의 키워드 ‘이세돌, 바둑, 알파고, 인공지능, 대국’을 이용하여 추출한 총 트위터 5,666개를 가지고 SVM을 수행한다. 트위터에서도 SVM을 수행하는 이유는 앞서 기사 텍스트를 분류한 것과 같이 트위터에서도 주제와 관련 있는 텍스트를 쓴 사용자를 추출하기 위함이다. 즉, 트위터에서 주제별 키워드에 해당하는 텍스트들을 뽑을 때 주제와 관련 없는 텍스트가 뽑히는 경우가 많은데, 예를 들어, ‘행사’라는 단어가 포함되는 텍스트를 뽑아온다고 하면 ‘여행사진을 모으고 있어요.’와 같은 텍스트가 추출되기도 한다. 이러한 오류를 줄이기 위하여 텍스트를 분류하는 것이다.The Twitter SVM process by the Twitter text classification module 17 is performed after pre-processing the Twitters including the keywords for each subject extracted through the above-mentioned LDA. In one embodiment, the subject 1 of the LDA results among the 3,001,589 Twitter data. SVM is performed with a total of 5,666 tweets extracted using the keywords'Lee Sedol, Go, Alpha Go, AI, and Korea'. The reason for performing SVM on Twitter is to extract users who wrote text related to the subject on Twitter as well as classifying the article text. In other words, when you pull out texts that correspond to the keywords for each topic on Twitter, you often get texts that are not related to the subject. For example, if you pull out texts that contain the word'event', you are collecting'travel photos.' The same text may be extracted. To reduce this error, we classify the text.

여기서, 트위터 SVM 과정은 기사 텍스트에 SVM을 한 것과 같은 과정을 진행하며, 훈련 데이터와 테스트 데이터만 달라진다. 일실시예로 트위터에서 이용하는 훈련 데이터는 기사 데이터 5,000개와 임의로 뽑은 트위터 데이터 400개로 구성되어 있는데, 기사 데이터만 훈련한 SVM 모델을 트위터에 적용하기에는 DTM 형성에 적합하지 않아 트위터 데이터를 추가하여 훈련 데이터로 만들고, 상기 트위터 5,666개 중 형태소 분석결과 null 값이 나온 67개의 데이터를 제외한 5,599개의 데이터를 테스트 데이터로 정하는 식이다.Here, the Twitter SVM process goes through the same process as SVM in the article text, and only the training data and test data are different. In one embodiment, the training data used in Twitter is composed of 5,000 article data and 400 randomly selected Twitter data, and it is not suitable for DTM formation to apply the SVM model trained only on article data to Twitter, so Twitter data is added to training data. It is an expression that determines 5,599 data as test data, excluding 67 data of which null value is found as a result of morpheme analysis among 5,666 tweeters.

다음의 표 7은 상기 표 3과 같이 훈련된 SVM 모델로 5,599개의 테스트 데이터에 적용하여 트위터 텍스트를 분류한 결과(Predicted)와 실제 수동으로 라벨링한 결과(Actual)를 비교한 오차행렬(Confusion matrix)로, 전체 5,599개 중 5,277개의 값이 올바르게 분류되었고, 약 94.25%의 높은 정확도를 가지는 반면, 예측된 텍스트 중 관련 있다고 나온 비율은 약 48.78%로 낮은 정확률을 보여주고 있다. 그러나 실제로 관련 있는 텍스트 중 올바르게 예측된 비율은 약 98.05%로 높은 재현율을 보이고 있다.The following Table 7 is a SVM model trained as shown in Table 3 above.It is applied to 5,599 test data, and the result is a Twitter text classification result (Predicted) compared to the actual manually labeled result (Actual). As a result, 5,277 out of 5,599 were correctly classified and had a high accuracy of about 94.25%, while the proportion of the predicted text that was related showed a low accuracy rate of about 48.78%. However, the correctly predicted proportion of the relevant texts is about 98.05%, showing a high reproducibility.

라벨링한 결과와 SVM 모델을 통해 예측된 실험결과가 모두 주제와 관련 있는 것(class: 1)으로 나온 301개의 데이터에 대해 저장장치(30)에 저장하고 주제별 사용자 분류를 수행한다.Both the labeled result and the experimental result predicted through the SVM model are stored in the storage device 30 for 301 data that are found to be related to the subject (class: 1), and user classification is performed by subject.

주제별 사용자 분류(User classification by subject ( S60,S70S60,S70 ))

그 다음으로, LDA 결과를 이용하여 주제별 사용자 분류 과정이다.Next, it is a user classification process by topic using LDA results.

일실시예로 상기 표 5의 LDA 결과 중 1번 주제인 ‘알파고와 이세돌의 바둑대국’을 가지고 진행하면 다음과 같다.As an example, when proceeding with the first theme of'Alphago and Lee Sedol's Go game' among the LDA results in Table 5 above, as follows.

즉, 상술한 바와 같이‘알파고와 이세돌의 바둑대국’과 관련된 주제에 해당하는 트위터들을 추출한 결과 총 3,001,589개의 트위터 데이터 중 6,266개의 트위터가 추출되었고, 6,266개의 데이터에서 중복된 데이터를 제거한 결과 총 5,666개의 트위터를 얻게 되었다. 그리고 주제와 관련 있는 트위터를 선별하기 위해 앞서 SVM을 수행하여 얻은 301개의 데이터(표 7)에서 사용자 분류 모듈(18)은 사용자 추출 및 분류를 하게 되는 것이다.That is, as described above, as a result of extracting Twitters corresponding to the subject related to'Alphago and Lee Sedol's Go,' 6,266 Twitters were extracted from a total of 3,001,589 Twitter data, and 6,266 data were removed from duplicate data, resulting in a total of 5,666 I got two Twitters. In addition, the user classification module 18 extracts and classifies the user from the 301 data (Table 7) obtained by performing the SVM prior to selecting the Twitter related to the subject.

추출된 301개의 트위터 텍스트들을 보면, 다른 사용자의 내용을 리트윗한 text가 201개로 사용자가 직접 작성한 text 100개보다 약 2배가량 많았는데, 트위터 특성상 다른 사용자의 글을 퍼다 나르는 리트윗이 빈번하게 이루어지기 때문에 상대적으로 다른 사용자의 내용을 담은 리트윗한 text가 많은 것이다. 리트윗한 텍스트는 RT_count가 리트윗된 사용자의 리트윗된 횟수이기 때문에 리트윗된 사용자를 따로 분류한다.Looking at the extracted 301 Twitter texts, there were 201 texts that retweets other users' content, about twice as many as 100 texts written by the user, but due to the nature of Twitter, retweets that carry other users' texts frequently Because it is done, there are many retweets of text that contain the content of other users. The retweeted text classifies the retweets separately because RT_count is the number of retweets of the retweets.

상기 301개의 데이터에서 사용자 분류는 먼저 리트윗을 한 사람과 안 한 사람을 분류하는데. 일실시예에서는 총 사용자 172명 중 리트윗을 안 한 사용자는 34명, 리트윗한 사용자는 127명이었고, 다음으로 리트윗을 한 사용자의 텍스트에 있는 리트윗된 사용자의 아이디를 확인하고, 그 결과 리트윗된 사용자 11명이 수집한 트위터 DB 내에 존재하지 않았으며 그 이유는 REST API로 트위터를 수집할 때 트위터를 가져오는 개수가 한정되어 있어 누락된 사용자가 발생했기 때문이다. 따라서, 리트윗한 사용자의 텍스트에서 ‘RT @’뒤의 screen name을 뽑아 따로 저장장치(30)에 저장한다. 이때, 사용자의 팔로워 수는 DB 내에 존재하지 않아 알 수 없으므로 트위터 API를 통해 각 사용자의 팔로워 수를 수집하며 누락된 사용자를 포함하여 총 172명의 트위터 사용자들의 속성 정보를 얻는다.In the above 301 data, the user classification classifies the one who did the retweet and the one who did not. In one embodiment, 34 users who did not retweet and 127 users who did not retweet out of 172 total users, and then checked the ID of the retweeted user in the text of the user who retweeted, and As a result, 11 retweets did not exist in the Twitter database collected because the number of fetched Twitters was limited when collecting Twitters using the REST API. Therefore, the screen name after'RT @'is extracted from the text of the retweeted user and stored in the storage device 30 separately. At this time, since the number of followers of the user does not exist in the DB and is unknown, the number of followers of each user is collected through the Twitter API, and property information of a total of 172 Twitter users including missing users is obtained.

상술한 바와 같이, LDA 토픽모델링을 통해 추출한 키워드를 포함하는 사용자를 추출하고 분류하는데, 주제별 사용자를 분류하는 이유는 키워드를 언급한 사용자 중에는 리트윗하는 사용자들이 대부분이기 때문이다. 리트윗을 한 사용자의 속성 정보 중 retweet_count는 리트윗된 사용자의 retweet_count가 나오기 때문에 이러한 사용자는 따로 분류해야 한다. 사용자 분류 모듈(18)에 의해 사용자를 분류하는 과정은 구체적으로 도 3과 같다. 도 3에 도시된 바와 같이 이 과정에서 분류해야 할 사용자는 ① 원글 작성자, ② 리트윗한 사용자, ③ DB 내에 있는 리트윗된 사용자, ④ DB 내에 없는 리트윗된 사용자로 나뉜다. 먼저, 작성된 트윗이 리트윗된 것인지 아닌지를 판별한다. 만약 리트윗된 것이 아니라면 원글 작성자 아이디(①)의 속성 정보를 추출한다. 반대로 리트윗된 것이라면 리트윗된 사용자가 본 발명의 트위터 DB(즉, 저장장치(30))에 존재하는지를 판별한다. 트위터 DB에 있다면 리트윗한 사용자(②)의 속성 정보를 추출한다. 하지만 DB(즉, 저장장치(30)) 내에 리트윗된 사용자가 없다면 해당 리트윗된 사용자(④)의 속성 정보를 트위터 REST API를 이용하여 트위터 데이터 DB에서 불러온다. 트위터를 수집할 때 REST API로 시간 당 불러올 수 있는 트윗 수가 350회로 제한되어 있어 리트윗된 사용자가 누락되기도 한다. 그러나 여기서 리트윗된 사용자는 다른 사용자들에 의해 리트윗이 되기도 하므로 영향력이 어느 정도 있는 사용자로 간주하여 본 발명에서 제외하면 안 되는 데이터이다. 그러므로 트위터 DB에 없는 사용자가 생긴다면 REST API를 이용하여 트위터 데이터 DB에서 해당 사용자의 속성 정보를 따로 불러오도록 한다. 마지막으로 DB 내의 리트윗된 사용자(③)는 트위터 DB(즉, 저장장치(30)) 내에 있는 속성 정보를 추출한다. 추출된 속성 정보와 분류된 사용자는 저장장치(30)에 저장한다.As described above, users including keywords extracted through LDA topic modeling are extracted and classified. The reason for classifying users by topic is that most of the retweets are among users who mention keywords. Among retweet user's attribute information, retweet_count of retweeted user shows retweet_count, so these users must be classified separately. The process of classifying users by the user classification module 18 is specifically illustrated in FIG. 3. As shown in FIG. 3, users to be classified in this process are divided into: ① original author, ② retweeted user, ③ retweeted user in DB, and ④ retweeted user not in DB. First, it is determined whether the created tweet is retweeted or not. If it is not retweeted, the attribute information of the original author ID (①) is extracted. Conversely, if it is a retweet, it is determined whether the retweeted user exists in the tweeter DB (that is, the storage device 30) of the present invention. If it is in the Twitter database, the attribute information of the retweeted user (②) is extracted. However, if there is no retweeted user in the DB (that is, the storage device 30), the attribute information of the retweeted user (④) is retrieved from the Twitter data DB using the Twitter REST API. When collecting Twitter, the number of tweets that can be retrieved per hour with the REST API is limited to 350, so retweets are sometimes missing. However, since the retweeted user may be retweeted by other users, the data should not be excluded from the present invention as a user with a certain influence. Therefore, if a user does not exist in the Twitter DB, the attribute information of the user is separately retrieved from the Twitter data DB using the REST API. Finally, the retweeted user (③) in the DB extracts the attribute information in the Twitter DB (that is, the storage device 30). The extracted attribute information and the classified user are stored in the storage device 30.

영향력 지수 및 주제 유사도 산정 모듈(19)은 상기 속성 정보에서 사용자들의 ‘Follwers_count, RT_count, Favorite_count’를 이용하여 본 발명에서는 다음의 수학식 4와 같은 영향력 지수를 구하는 식을 개발 적용한다.Influence index and subject similarity calculation module 19 uses the'Follwers_count, RT_count, Favorite_count' of users in the attribute information to develop and apply an equation for obtaining an impact index as in Equation 4 below.

여기서,

는

와 같이 사용자 i의 평균 favorite 수를 나타내며,

와

는 주제 내의 전체 사용자들의 평균 리트윗된 수와 평균 favorite 수를 나타낸 것이다. 리트윗과 favorite의 경우 사용자가 쓴 텍스트별로 값이 모두 다르므로 평균값을 구하는 것이고, 사용자별 리트윗과 favorite 수를 전체 평균으로 나누는 것은 해당 사용자의 텍스트에 대한 가중치를 의미하는 것으로 사용자가 쓴 텍스트가 전체 평균값보다 리트윗과 favorite의 수가 많을수록 그 주제에서의 영향력은 커지는 것이다.here,

The

Like this, it represents the average number of favorite users i,

Wow

Is the average number of retweets and average number of favorite users across the subject. In the case of retweet and favorite, the value is different for each text written by the user, so the average value is obtained. Dividing the number of retweets and favorites by user by the total average means the weight for the user's text. The greater the number of retweets and favorites than the overall average, the greater the influence on the subject.

또한,

는 앞서

과

를 구한 것과 같이 사용자 i가 주제 내에서 쓴 각 트윗에 해당하는 각 팔로워 수를 모두 더해 사용자 i가 주제 내에서 쓴 모든 트윗 수로 나눈 평균 팔로워 수를 의미하는 것이고,

는 전체 사용자들의 평균 팔로워 수를 의미하는 것이다. 전체 사용자의 평균보다 사용자 i의 팔로워 수가 높으면 높을수록 다른 트위터 사용자들이 이 사용자를 믿고 따른다는 의미로 볼 수 있다.Also,

Ahead

and

It means the average number of followers divided by the number of tweets that user i wrote in the topic by adding all the number of each follower corresponding to each tweet that user i wrote in the topic,

Means the average number of followers of all users. The higher the number of followers of user i than the average of all users, the more Twitter users can trust and follow this user.

결과적으로 영향력 지수는 한 사용자가 작성한 텍스트를 다른 사용자들이 전달하며 널리 퍼뜨리고 있다는 지표인 리트윗과 다른 사용자들이 ‘좋아요’를 누르며 그 텍스트에 대한 호감도가 있다는 것을 알려주는 favorite, 그리고 그 사용자를 따르는 다른 사용자들의 수를 알 수 있는 팔로워 수를 더한 비율 값을 구함으로 이 사용자가 얼마나 영향력을 끼치는지 알려주는 지표이다.As a result, the impact index is a retweet, an indicator that texts written by one user are being spread and spread by other users, and a favorite that tells other users that they like the text by clicking'Like', and following the user. It is an indicator of how much influence this user has by finding the percentage value plus the number of followers that can tell the number of other users.

한편, 영향력 지수 및 주제 유사도 산정 모듈(19)은 사용자들이 쓴 트위터 텍스트 간의 유사도를 구하기 위하여 기존의 ‘Jaccard Index’를 사용하여 주제 유사도를 구한다.Meanwhile, the influence index and the subject similarity calculation module 19 obtains the subject similarity using the existing'Jaccard Index' in order to obtain the similarity between Twitter texts written by users.

상기 Jaccard Index란 두 집합 사이의 유사도를 측정하는 방법으로 0과 1 사이의 값을 가지며, 동일한 원소가 한 개도 존재하지 않으면 ‘0’ 값을 가지고 모두 일치하면 ‘1’ 값을 가지는데, Jaccard Index는 다음의 수학식 5와 수학식 6과 같다.The Jaccard Index is a method of measuring the similarity between two sets. It has a value between 0 and 1, and if none of the same elements exist, it has a value of '0', and if they all match, it has a value of '1'. Equations 5 and 6 are as follows.

여기서, A와 B는 각 사용자가 쓴 트위터 텍스트(명사들로만 이루어져 있음) 묶음이다. 따라서, 본 발명의 일실시예에서 사용자 172명의 각각 상호간의 주제 유사도를 구할 수 있게 된다.Here, A and B are a bunch of Twitter text (consisting only of nouns) written by each user. Accordingly, in one embodiment of the present invention, it is possible to obtain a similarity between the subjects among 172 users.

본 발명에서는 키워드를 포함하는 텍스트 간의 유사도를 구하기 때문에 Jaccard Index를 이용하는 것이고, 결과적으로 사용자 간의 트위터 텍스트에서 공통으로 등장하는 명사들을 멀티 셋으로 만든 후 Jaccard Index를 통해 전체 명사들의 멀티 셋에서 공통되는 명사의 수가 어느 정도 나오는지를 통해 주제 유사도를 구할 수 있는 것이다. 주제 유사도를 통해 사용자 간의 주제적으로 유사한 정도를 확인할 수 있고, 유사할수록 같은 주제에 관심을 가지는 사용자라고 볼 수 있다.In the present invention, since the similarity between texts containing keywords is obtained, Jaccard Index is used, and as a result, nouns commonly appearing in Twitter texts between users are made into multiple sets, and then nouns common in multisets of all nouns through Jaccard Index It is possible to find the similarity of the topic through how many of them appear. Through the subject similarity, it is possible to check the degree of thematic similarity between users, and the more similar, the more interested the same subject.

키플레이어Key player 탐지(S80) Detection (S80)

본 발명에서는 주제별 키플레이어 탐지를 위해 키플레이어 탐지 모듈(20)은 상기 영향력 지수와 주제 유사도를 모두 적용하는 KpRank(주제별 사용자들의 영향력 Ranking)를 정의하는데, 이는 다음의 수학식 7과 같다.In the present invention, for the key player detection by subject, the key player detection module 20 defines KpRank (influence ranking of users by subject) that applies both the impact index and subject similarity, as shown in Equation 7 below.

상기 수힉식 4와 수학식 5에서의 영향력 지수와 주제 유사도를 구하여 상기 수학식 7에 넣어준다. 또한, 여기서 d는 특정 주제 내에서 트위터 사용자가 관심 있는 주제를 언급한 트위터 페이지를 방문할 확률을 의미하는 것인데 default 값인 0.85를 사용한다. 그 이유는 0.85가 아닐 시 KpRank 값 계산시 수렴하지 않고 진동하기 때문이다. 참고로 상술한 바와 같이 영향력 지수에서는 이미 전체 사용자들의 영향력을 가중치 값으로 나눠주고 있다.The influence index and subject similarity in Equation 4 and Equation 5 are obtained and put into Equation 7. In addition, d means the probability that a Twitter user visits a Twitter page referring to a topic of interest within a specific topic, and the default value of 0.85 is used. The reason is that when it is not 0.85, it vibrates without convergence when calculating the KpRank value. For reference, as described above, the influence index already divides the influence of all users by the weight value.

상기 수학식 7에서 전항은 트위터 사용자 i 개인의 영향력을 측정하기 위한 부분이고, 후항은 트위터 사용자 i가 주제적으로 영향력이 있는지를 구하는 부분이다. 따라서 d가 0.85이기 때문에 트위터 사용자 i의 주제적 영향력 비중을 0.85, 개인의 영향력 비중을 0.15로 둔 것을 알 수 있다.In Equation 7, the previous term is a part for measuring the influence of the Twitter user i individual, and the latter term is a part for determining whether the Twitter user i is subjectively influential. Therefore, since d is 0.85, it can be seen that the subject influence ratio of Twitter user i is 0.85 and the individual influence ratio is 0.15.

여기서, 영향력 지수는 p_i(트위터 사용자 i)의 트위터 페이지를 다른 사용자들이 방문할 확률을 높여주는 가중치로 이용되고, 주제 유사도는 같은 주제에 관심을 보이는 사용자의 트위터 페이지를 방문할 확률을 높여주게 된다.Here, the influence index is used as a weight that increases the probability of other users visiting the Twitter page of p _i (Twitter user i), and the topic similarity increases the probability of visiting the Twitter page of users who are interested in the same topic. do.

따라서, p_i(트위터 사용자 i)의 KpRank는 p_i(트위터 사용자 i)와 연결된 같은 주제를 언급하는 다른 사용자들의 KpRank의 합산과 p_i(트위터 사용자 i)의 영향력이 가중되어 다른 사용자들이 p_i(트위터 사용자 i)의 트위터에 머물 확률을 더한 값이 된다.Thus, p _i KpRank of (Twitter user i) is p _i (Twitter users i) and the weighted influence of the p _i (Twitter users i) and summation of the KpRank other user to refer to the same subject are connected by other users p _i (Twitter user i) is the value of the probability of staying on Twitter.

결과적으로 KpRank는 특정 주제 내에서 상대적으로 영향력 있는 트위터 사용자를 찾는 알고리즘으로, 특정 주제 내에서 상대적으로 영향력 있는 사용자를 찾는 것은 그 사용자가 주제에서 전문적인 정보와 지식을 가지고 있으며 가장 영향력 있는 사용자이기 때문이며, KpRank를 통해 가장 높은 랭크를 가지는 사용자를 키플레이어라고 부르며, 이 사용자가 해당 주제 내에서 가장 영향력 있는 사용자라고 판단하는 것이다.As a result, KpRank is an algorithm that finds relatively influential Twitter users within a specific topic. Finding relatively influential users within a specific topic is because the user has the most professional and knowledgeable information and knowledge in the topic. , KpRank refers to the user with the highest rank as a key player, and determines that this user is the most influential user in the subject.

본 발명의 일실시예로 총 172명의 트위터 사용자(p₁~p₁₇₂) 각각의 KpRank 값을 상기 수학식 7을 이용하여 구하려면 초기값으로 KpRank(p₁) 내지 KpRank(p₁₇₂) 값을 모두 1/172로 설정하는데, KpRank(p_i) 값은 모든 다른 KpRank(p_j) 값을 정규화시킨 값의 합이라고 할 수 있으므로 모든 사용자(p₁~p₁₇₂)의 KpRank 값을 더하면 1이기 때문이다. 사용자 172명의 영향력 지수 및 사용자 p_i와각 모든 다른 사용자(p_j) 간의 각 주제 유사도를 구하고 모든 사용자(p₁~p₁₇₂)의 KpRank 초기값(1/172)을 수학식 7에 입력한다. 그러면 초기값(1/172)과 다른 172명의 KpRank 값이 구해지고 트위터 사용자 172명의 KpRank 각각의 값이 수렴할때까지 수학식 7을 반복 계산하면 모든 사용자의 최종 KpRank 값을 구하게 되는 것이다.In an embodiment of the present invention, to obtain KpRank values of each of ₁₇₂ Twitter users (p ₁ to p ₁₇₂ ) using Equation 7, the values of KpRank(p ₁ ) to KpRank(p ₁₇₂ ) are all used as initial values. It is set to 1/172, because the KpRank(p _i ) value is the sum of the normalized values of all other KpRank(p _j ) values, so adding the KpRank values of all users (p ₁ to p ₁₇₂ ) is 1 . Influence index of 172 users and user p _i The similarity of each subject between all other users (p _j ) is obtained, and the initial value of KpRank (1/172) of all users (p ₁ ~p ₁₇₂ ) is input to Equation (7). Then, KpRank values of 172 different from the initial value (1/172) are obtained, and Equation 7 is repeatedly calculated until the values of each of the 172 Twitter users KpRank converge to obtain the final KpRank values of all users.

따라서, 본 발명에서는 특정 주제에 관심 있는 사람들 간의 관계에 가중치로 작용하는 주제 유사도와 사용자의 영향력과 파급력을 고려한 영향력 지수를 적용한 KpRank를 이용하여 소셜 미디어상의 키플레이어를 찾고자 하는 것이다.Accordingly, the present invention is to find a key player on social media by using KpRank to which an influence index considering the similarity of a subject and a user's influence and ripple power acts as a weight for a relationship between people interested in a specific subject.

더불어, 최종적으로 주제별 키플레이어가 구해진 후 이를 지도에 시각화하여 나타내어 사용자들이 찾고자 하는 주제에 해당하는 키플레이어를 지역별로 나타내어 쉽게 알아볼 수 있도록 해줄수 있다. 예를 들어, ‘떡볶이’라는 주제에서의 키플레이어를 탐지하고자 한다면, 도 4에 도시된 바와 같이 지역별로 ‘Rank 보기’를 클릭하여 가장 순위가 높은 키플레이어를 탐지할 수 있다.In addition, after the final key player for each subject is obtained, it can be visualized and displayed on a map, so that the key player corresponding to the subject that the user is looking for can be displayed for each region so that it can be easily recognized. For example, if you want to detect a key player in the theme of'Tteokbokki', you can detect the highest ranked key player by clicking on'View Rank' by region as shown in FIG.

한편, 상기 이러한 일련의 과정은 이를 컴퓨터로 수행하기 위해 프로그램 언어(Python 등)를 통해 직접 알고리즘을 코딩한 프로그램에 의해 이루어지거나 부분적으로 상용 프로그램에 의해 이루어질 수 있다. 그리고 본 발명에서의 용어 중 '텍스트'는 실시예에 기재된 내용에 의해 파악할 수 있듯이 품사는 모두 '명사'이다.Meanwhile, the series of processes may be performed by a program that directly codes an algorithm through a programming language (such as Python) or partially by a commercial program in order to perform it in a computer. In addition, in the term “text” in the present invention, all parts of speech are “nouns” as can be understood by the contents described in the examples.

본 발명의 방법론은 트위터만이 아닌 다른 소셜 미디어(특히, 인스타그램)에도 적용 가능하며, 기업의 인플루언서 마케팅이나 바이럴 마케팅 또는 빅데이터 분석 및 여론형성을 통한 정책 결정에도 활용될 수 있다. 이를 통해 본 발명은 시공간적 요소를 포함하여 실시간으로 각 주제가 이슈가 되었던 때의 실시간 키플레이어를 찾는 발명으로 발전시킬 수 있을 것이다.The methodology of the present invention is applicable not only to Twitter but also to other social media (especially Instagram), and can be used for corporate influencer marketing, viral marketing, or policy analysis through big data analysis and public opinion formation. Through this, the present invention can be developed into an invention that finds a real-time key player when each subject becomes an issue in real time, including spatiotemporal elements.

10: 주제별 키플레이어 탐지 장치 11: 뉴스 데이터 수집 모듈
12; 뉴스 데이터 전처리 모듈 13: 기사 텍스트 분류 모듈
14: 주제 추출 모듈 15: 트위터 데이터 수집 모듈
16: 트위터 데이터 전처리 모듈 17: 트위터 텍스트 분류 모듈
18: 사용자 분류 모듈
19: 영향력 지수 및 주제 유사도 산정 모듈
20: 키플레이어 탐지 모듈 30: 저장 장치10: Key player detection device by topic 11: News data collection module
12; News data pre-processing module 13: article text classification module
14: subject extraction module 15: Twitter data collection module
16: Twitter data pre-processing module 17: Twitter text classification module
18: User classification module
19: Impact Index and Subject Similarity Calculation Module
20: key player detection module 30: storage device

Claims

(a) The news data collection module 11 collects news articles related to a specific category specified from the news data DB through crawling, and the news data pre-processing module 12 removes stop words from the'TEXT' field of the collected news articles And morphological analysis to extract the noun pre-processing step (S10) and;
(b) Article text classification module 13 selects features to be used for analysis for classifying news article texts, indexes texts expressing features by feature weights, and then extracts noun article texts in step (a). Dividing the training data and the test data at a constant rate to train the SVM through the training data, and classifying the text through the trained SVM model (S20);
(c) subject extracting module 14 extracts subject and subject-specific keywords by performing LDA on data found to be related to a specific category by text classification through the SVM model in step (b) (S30). Wow;
(d) The Twitter data collection module 15 collects Twitter data of the same period as the news article collection period from the Twitter data DB, and the Twitter data pre-processing module 16 collects Twitter data by subject extracted through the LDA. A pre-processing step (S40) of extracting nouns by removing idioms and analyzing morphemes in the'text' field of Twitter data including keywords;
(e) Twitter text classification module 17 trains the SVM through training data in which Twitter data is added to a certain number of article data, and Twitter data including subject-specific keywords in step (d) is used as test data. Classifying the text through the trained SVM model (S50);
(f) the user classification module 18 classifying users for data that is found to be related to the subject by text classification through the SVM model in step (e) (S60);
(g) an influence index and subject similarity calculation module 19 calculating an impact index using attribute information by the user classification, and calculating subject similarity using a'Jaccard Index' (S70), and
(h) characterized in that the key player detection module 20 comprises the step of selecting a user having the highest KpRank value among the users as the key player for each subject (S80) using the influence index and subject similarity (S80). How to detect key players by topic on social media.

According to claim 1,
The step (b),
(b1) labeling is performed on article texts extracted from nouns according to whether they are related to a category, and a feature selection step of selecting words to be used for classifying texts using a literature frequency;
(b2) generating an article text DTM applying TF-IDF as a feature weight using the selected features, and
(b3) Using the DTM, the training data and test data are divided into a certain ratio and applied to test data as an SVM model trained through the training data to classify article texts, and applied to test data as a trained SVM model to apply articles to test data. A method for detecting key players by topic on social media using KeyplayerRank, which is characterized by comparing the results of classifying the text with actual labeled results, and extracting the correctly classified ones.

According to claim 1,
In the step (c), the LDA model is generated by generating a data text in the DTM where both the labeled result and the result of classifying the article text through the SVM model are related to a specific category, and the parameters necessary to perform the LDA model A method of detecting key players by topic on social media using KeyplayerRank, characterized by extracting keywords for each topic by setting the'num_topics','passes', and'number of keywords'.

According to claim 1,
The step (e), by applying the test data to the trained SVM model, characterized by extracting the correct classification by comparing the results of the classification of the Twitter text and the actual labeled results, the key for each topic on social media using KeyplayerRank Player detection method.

According to claim 1,
The step (f) is classified as a retweet user and a retweet user to retrieve attribute information of Twitter users, but if the retweet user is not stored in the storage device 30, Twitter using the Twitter REST API Key player detection method for each topic on social media using KeyplayerRank, characterized in that the attribute information of the user is retrieved from the data DB.

According to claim 1,
In step (g),
The influence index is the following equation using'Follwers_count, RT_count, Favorite_count' among the attribute information of users,

(here,

The

Like this, it represents the average number of favorite users i,

Wow

The

and

Is the average number of followers of all users),
The subject similarity is the following equation,

(Here, A and B is a bunch of Twitter text (consisting only of nouns) written by each user), characterized in that it is calculated, the key player detection method for each topic on social media using KeyplayerRank.

According to claim 1,
In step (h),
The KpRank value is the following equation,

(Where d is a probability that a Twitter user visits a Twitter page referring to a topic of interest within a specific topic), the method for detecting a key player by topic on social media using KeyplayerRank.