KR102096066B1

KR102096066B1 - Social topics extraction system

Info

Publication number: KR102096066B1
Application number: KR1020180103025A
Authority: KR
Inventors: 조은숙; 민소연; 김봉길; 김세훈
Original assignee: 주식회사 메타소프트; 서일대학교 산학협력단
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2020-04-01
Also published as: KR20200025537A

Abstract

고품질의 소셜 토픽 추출이 가능한 소셜 토픽 추출 시스템을 제공한다. 본 발명에 따른 소셜 토픽 추출 시스템은, 네트워크를 통하여 소셜 네트워크 상에서 수집된 소셜 데이터들을 기초로 토픽 추출 모델을 생성하는 모델 구축 서브 시스템, 토픽 추출 모델을 사용하여 수집된 소셜 데이터들로부터 소셜 토픽을 추출하는 토픽 추출 서브 시스템, 네트워크를 통한 사용자의 요청을 토픽 추출 서브 시스템에 전달하고 추출된 소셜 토픽을 사용자에게 제공하는 사용자 인터페이스, 및 추출된 소셜 토픽에 대한 정확도 평가를 수행하고 추출된 소셜 토픽에 대한 사용자의 피드백을 종합하여 모델 구축 서브 시스템에 토픽 추출 모델의 최적화 또는 재생성을 요청하는 모델 피드백 모듈을 포함한다. It provides a social topic extraction system capable of extracting high-quality social topics. The social topic extraction system according to the present invention is a model building sub-system that generates a topic extraction model based on social data collected on a social network through a network, and extracts social topics from social data collected using a topic extraction model. Topic extraction subsystem, user interface to deliver user's request through the network to the topic extraction subsystem, and provide the extracted social topics to the user, and perform accuracy evaluation on the extracted social topics and It includes a model feedback module that aggregates user feedback and requests optimization or regeneration of the topic extraction model to the model building subsystem.

Description

Social topics extraction system

본 발명은 소셜 토픽 추출 시스템에 관한 것으로, 특히, 소셜 네트워크 상의 다량의 소셜 데이터로부터 소셜 토픽을 추출할 수 있도록 하는 소셜 토픽 추출 시스템에 관한 것이다. The present invention relates to a social topic extraction system, and more particularly, to a social topic extraction system that enables extraction of a social topic from a large amount of social data on a social network.

본 발명은 중소기업기술정보진행원 2017년 산학연협력기술개발사업(첫걸음, 도약)의 일환으로 (주)메타소프트가 주관하여 수행된 연구로부터 도출된 것이다.The present invention was derived from a study conducted by Metasoft Co., Ltd. as part of the 2017 Industry-Academic Cooperation Technology Development Project (First Step, Leap).

[연구기간 : 2017. 09. 01 ~ 2018. 08. 31, 연구관리 전문기관 : 서일대학교산학협력단, 연구과제명 : 딥러닝 기술을 이용한 미디어 컨텍스트 기반 소셜 토픽 추출 및 분류 프레임워크 개발, 과제 고유번호 : C0532698][Research period: 2017. 09. 01 ~ 2018. 08. 31, Research institute: Seoil University Industrial-Academic Cooperation Group, Research project name: Development of social context extraction and classification framework based on media context using deep learning technology, task identification number : C0532698]

온라인상에 다양한 서비스를 제공하는 소셜 채널들이 증가하고 활성화됨에 따라 이러한 소셜 채널들에서 생성되는 소셜 데이터의 양이 기하급수적으로 늘어 매우 방대해지고 있다. 이러한 소셜 데이터를 활용하기 위해서, 방대한 소셜 데이터를 분석하여 소셜 데이터의 주제인 소셜 토픽을 추출해야 할 필요가 있다. As the number of social channels providing various services online increases and becomes active, the amount of social data generated in these social channels increases exponentially and becomes very vast. In order to utilize such social data, it is necessary to analyze a large amount of social data and extract social topics, which are the subjects of social data.

기존의 소셜 토픽 추출 시스템은 통계적인 분석 모델 적용에 따른 연관 키워드 집합을 기반으로, 키워드를 포함한 소셜 콘텐츠를 검색할 수 있도록 하고 있다. 다양한 소셜 채널들이 서비스되는 소셜 네트워크는 개방성 및 자율성에 따라 신조어의 생성 및 트렌드의 변화 등이 지속적으로 이루어지나, 기존의 소셜 토픽 추출 시스템은 일정 시점에서 통계적인 분석 모델을 구축하므로, 추가적인 의미 혹은 키워드가 고려되지 못하는 문제점이 있다. Existing social topic extraction systems allow users to search social content including keywords based on a set of related keywords according to the application of a statistical analysis model. In social networks where various social channels are provided, new words are created and trends are continuously changed according to openness and autonomy, but existing social topic extraction systems build statistical analysis models at a certain point in time, so additional meanings or keywords There is a problem that is not considered.

또한, 소셜 콘텐츠는 정형화된 형식이 없는 자연어 수준의 비정형 데이터 형태로 단순히 키워드의 포함 여부를 기준으로 연관성을 판별하는 것은 소셜 토픽 추출 품질이 하락하는 문제점이 존재한다. In addition, the social content is a natural language level unstructured data type without a formalized form, and simply determining the association based on whether keywords are included has a problem in that the quality of extracting social topics decreases.

본 발명의 기술적 과제는 상기한 문제점을 해결하기 위하여, 고품질의 소셜 토픽 추출이 가능한 소셜 토픽 추출 시스템을 제공하는 데에 있다. In order to solve the above problems, the technical problem of the present invention is to provide a social topic extraction system capable of extracting high-quality social topics.

상기 기술적 과제를 해결하기 위하여, 고품질의 소셜 토픽 추출이 가능한 소셜 토픽 추출 시스템을 제공한다. In order to solve the above technical problem, there is provided a social topic extraction system capable of extracting high-quality social topics.

본 발명에 따른 소셜 토픽 추출 시스템은, 네트워크를 통하여 소셜 네트워크 상에서 수집된 소셜 데이터들을 기초로, 토픽 추출 모델을 생성하는 모델 구축 서브 시스템; 상기 토픽 추출 모델을 사용하여, 상기 수집된 소셜 데이터들로부터 소셜 토픽을 추출하는 토픽 추출 서브 시스템; 상기 네트워크를 통한 사용자의 요청을 상기 토픽 추출 서브 시스템에 전달하고, 상기 추출된 소셜 토픽을 상기 사용자에게 제공하는 사용자 인터페이스; 및 상기 추출된 소셜 토픽에 대한 정확도 평가를 수행하고, 상기 추출된 소셜 토픽에 대한 상기 사용자의 피드백을 종합하여, 상기 모델 구축 서브 시스템에 상기 토픽 추출 모델의 최적화 또는 재생성을 요청하는 모델 피드백 모듈;을 포함한다. The social topic extraction system according to the present invention includes a model building sub-system for generating a topic extraction model based on social data collected on a social network through a network; A topic extraction subsystem that extracts a social topic from the collected social data using the topic extraction model; A user interface that delivers a user's request through the network to the topic extraction subsystem, and provides the extracted social topic to the user; And a model feedback module that performs an accuracy evaluation on the extracted social topic, synthesizes the user's feedback on the extracted social topic, and requests the model building subsystem to optimize or regenerate the topic extraction model; It includes.

본 발명에 따른 소셜 토픽 추출 시스템은, 통계 분석 기법과 머신러닝 기법을 혼합하여 토픽 추출 모델을 구축하고, 토픽 추출 모델의 재생성 및 최적화를 수행할 수 있으므로, 고품질의 소셜 토픽 추출 및 유지 보수가 가능하도록 할 수 있다. The social topic extraction system according to the present invention is capable of constructing a topic extraction model by mixing statistical analysis techniques and machine learning techniques, and regenerating and optimizing the topic extraction model, thereby enabling high-quality social topic extraction and maintenance. You can do it.

도 1은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템의 구성을 나타내는 개략도이다.
도 2는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템의 구성을 나타내는 개략도이다.
도 3은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 토픽 추출 서브 시스템의 구성을 나타내는 개략도이다.
도 4는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템에서 입력 변수를 선정하는 과정을 나타내는 개념도이다.
도 5는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템에서 토픽 추출 모델을 구축하는 과정을 나타내는 개념도이다.
도 6 및 도 7은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템에서 추출한 소셜 토픽을 제공한 결과를 예시적으로 보여준다. 1 is a schematic diagram showing the configuration of a social topic extraction system according to an embodiment of the present invention.
2 is a schematic diagram showing the configuration of a model building subsystem of a social topic extraction system according to an embodiment of the present invention.
3 is a schematic diagram showing the configuration of a topic extraction sub-system of a social topic extraction system according to an embodiment of the present invention.
4 is a conceptual diagram illustrating a process of selecting an input variable in a model building subsystem of a social topic extraction system according to an embodiment of the present invention.
5 is a conceptual diagram illustrating a process of constructing a topic extraction model in a model building subsystem of a social topic extraction system according to an embodiment of the present invention.
6 and 7 exemplarily show results of providing a social topic extracted from a social topic extraction system according to an embodiment of the present invention.

이하, 본 발명의 실시 예들에 따른 소셜 토픽 추출 시스템을 첨부된 도면을 참조하여 상세하게 설명하지만, 본 발명이 하기의 실시 예들에 한정되는 것은 아니며, 해당 분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명을 다양한 다른 형태로 구현할 수 있을 것이다. 즉, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시 예들을 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시 예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니된다. 본문에 설명된 실시 예들에 의해 한정되는 것이 아니므로 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the social topic extraction system according to the embodiments of the present invention will be described in detail with reference to the accompanying drawings, but the present invention is not limited to the following embodiments, and those skilled in the art will recognize the present invention. The present invention may be implemented in various other forms without departing from the technical spirit of the. That is, specific structural or functional descriptions are exemplified only for the purpose of describing the embodiments of the present invention, and the embodiments of the present invention may be implemented in various forms and should be interpreted as being limited to the embodiments described in the text. No. It is not limited by the embodiments described in the text, it should be understood to include all modifications, equivalents, or substitutes included in the spirit and scope of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구비하다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성 요소 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 것이다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprises" or "includes" are intended to indicate that a feature, number, step, action, component, or combination thereof described is present, one or more other features or numbers, It will be understood that the existence or addition possibilities of steps, actions, components, or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in a commonly used dictionary, are interpreted to have meanings consistent with meanings in the context of related technologies, and are not to be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. .

본 명세서에서 소셜 데이터란, 소셜 네트워크 상에서 서비스되는 소셜 채널들에서 생성되고 전달되는 데이터를 의미한다. 예를 들면, 종래의 웹은 정적인 파일로 이루어지나, 트위터와 같이 소셜 네트워크 상에서 생성되는 소셜 데이터는 매우 작은 크기의 정보들을 실시간으로 전 세계에 전달되는 동적인 흐름을 가진다. 즉, 소셜 데이터는 소셜 네트워크 상에서 동적인 흐름을 가질 수 있다. In this specification, social data refers to data generated and transmitted in social channels serviced on a social network. For example, the conventional web is made of a static file, but social data generated on a social network such as Twitter has a dynamic flow of transmitting very small information to the world in real time. In other words, social data can have a dynamic flow on social networks.

도 1은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템의 구성을 나타내는 개략도이다. 1 is a schematic diagram showing the configuration of a social topic extraction system according to an embodiment of the present invention.

도 1을 참조하면, 소셜 토픽 추출 시스템(1)은 모델 구축 서브 시스템(200) 및 토픽 추출 서브 시스템(300)을 포함할 수 있다. 소셜 토픽 추출 시스템(1)은 네트워크(50)를 통하여 소셜 네트워크(20)로부터 소셜 채널(21), 소셜 콘텐츠(22), 작성 기록(23) 등의 소셜 데이터를 수집하여 저장하는 소셜 데이터 DB(510)를 더 포함할 수 있다. Referring to FIG. 1, the social topic extraction system 1 may include a model building subsystem 200 and a topic extraction subsystem 300. The social topic extraction system 1 collects and stores social data such as the social channel 21, the social content 22, and the writing record 23 from the social network 20 through the network 50. 510) may be further included.

네트워크(50)는 유선 인터넷 서비스, 근거리 통신망(LAN), 광대역 통신망(WAN), 인트라넷, 무선 인터넷 서비스, 이동 컴퓨팅 서비스, 무선 데이터 통신 서비스, 무선 인터넷 접속 서비스, 위성 통신 서비스, 무선 랜, 블루투스 등 유/무선을 통하여 데이터를 주고 받을 수 있는 것을 모두 포함할 수 있다. 네트워크(50)가 스마트폰 또는 태블릿 등과 연결되는 경우, 네트워크(50)는 3G, 4G, 5G 등의 무선 데이터 통신 서비스, 와이파이(Wi-Fi) 등의 무선 랜, 블루투스 등일 수 있다. The network 50 includes wired Internet service, local area network (LAN), broadband communication network (WAN), intranet, wireless Internet service, mobile computing service, wireless data communication service, wireless Internet access service, satellite communication service, wireless LAN, Bluetooth, etc. It can include anything that can send and receive data over wired or wireless. When the network 50 is connected to a smartphone or tablet, the network 50 may be wireless data communication services such as 3G, 4G, 5G, wireless LAN such as Wi-Fi, Bluetooth, or the like.

소셜 채널(21)은 소셜 네트워크(20) 상에서 제공되는 각 서비스일 수 있고, 소셜 콘텐츠(22)는 각 소셜 채널(21) 내에 작성된 데이터일 수 있고, 작성 기록(23)은 작성된 일시, 분류, 작성자 등 소셜 콘텐츠(22)에 대해서 소셜 채널(21)가 가지고 있는 부가적인 정보일 수 있다. The social channel 21 may be each service provided on the social network 20, the social content 22 may be data created in each social channel 21, and the creation record 23 may be created date and time, classification, It may be additional information that the social channel 21 has for the social content 22 such as the author.

소셜 데이터 DB(510)은 소셜 데이터를 상대적으로 장기간 저장 또는 영구 저장할 수도 있으나, 소셜 데이터를 일시적으로 임시 저장할 수도 있다. 소셜 데이터 DB(510)는 예를 들면, NoSQL, 관계형 데이터베이스, 파일시스템 등 어떠한 형태로든 데이터를 저장할 수 있는 공간일 수 있다. 소셜 데이터 DB(510)는 논리적으로 구분되는 하나의 저장 장치이거나, 하나 또는 복수의 저장 장치를 논리적으로 구분하는 구분 단위이거나 물리적으로 구분되는 하나의 저장 장치 또는 논리적으로 구분되는 하나의 구분 단위 중 일부일 수 있다. The social data DB 510 may store or permanently store the social data for a relatively long time, but may also temporarily store the social data. The social data DB 510 may be a space that can store data in any form, for example, NoSQL, relational database, or file system. The social data DB 510 may be one storage device that is logically divided, a division unit that logically divides one or a plurality of storage devices, or one storage unit that is physically divided or one of the logical divisions. You can.

소셜 토픽 추출 시스템(1)은 수집되어 소셜 데이터 DB(510)에 저장된 소셜 데이터를 기초로, 모델 구축 서브 시스템(200)에서 토픽 추출 모델을 생성하여, 토픽 추출 모델 DB(520)에 저장할 수 있다. 토픽 추출 모델 DB(520)는 소셜 데이터 DB(510)와는 별도인 데이터를 저장할 수 있는 공간일 수 있으나, 소셜 데이터 DB(510)와 토픽 추출 모델 DB(520)는 데이터를 저장할 수 있는 하나의 공간일 수도 있다. The social topic extraction system 1 may generate a topic extraction model in the model building subsystem 200 and store it in the topic extraction model DB 520 based on the collected and stored social data in the social data DB 510. . The topic extraction model DB 520 may be a space for storing data separate from the social data DB 510, but the social data DB 510 and the topic extraction model DB 520 may be one space for storing data. It may be.

모델 구축 서브 시스템(200)은 입력 변수 선정 모듈(210), 추출 모델 구축 모듈(220), 및 모델 최적화 모듈(230)을 포함할 수 있다. The model building subsystem 200 may include an input variable selection module 210, an extraction model building module 220, and a model optimization module 230.

입력 변수 선정 모듈(210)은 소셜 데이터를 입력 받아서 소셜 콘텐츠간의 의미 연관성과 주제 연관성을 분석하여 영향력이 높은 순으로 입력 변수를 선정할 수 있다. 입력 변수 선정 모듈(210)은 소셜 데이터로부터 추출된 특징들의 가능한 조합 중에서 연산속도와 추출 정확도를 기준으로 비교하여 입력 변수를 선정할 수 있다. The input variable selection module 210 receives social data and analyzes semantic associations and topic associations between social contents to select input variables in order of high influence. The input variable selection module 210 may select an input variable by comparing it based on a calculation speed and an extraction accuracy among possible combinations of features extracted from social data.

추출 모델 구축 모듈(220)은 선정된 입력 변수를 기초로, 소셜 데이터들의 의미 연관성 및 주제 연관성의 정확도를 학습 레이어의 배치를 통하여 측정하고, 이동평균법과 지수평활법 등의 통계적 기법 중 시계열 분석 기법과 같은 통계 분석 기법과, 랜덤 포레스트(Random Forest), 서포트 벡터 머신(Support Vector Machine) 등과 같은 머신러닝 기법을 혼합하여, 의미 연관성 및 주제 연관성에 기반한 소셜 데이터로부터의 토픽 추출에 최적화된 토픽 추출 모델인 앙상블 모델을 구축할 수 있다. 토픽 추출 모델은 분석 대상인 소셜 데이터와 입력변수의 특징에 기반하여 선정될 수 있다. The extraction model building module 220 measures the accuracy of semantic association and subject association of social data through the placement of the learning layer based on the selected input variable, and time series analysis technique among statistical techniques such as moving average method and exponential smoothing method. A topic extraction model optimized for topic extraction from social data based on semantic relevance and topic relevance by mixing statistical analysis techniques such as Random Forest and Support Vector Machine, etc. Build an ensemble model. The topic extraction model can be selected based on the characteristics of the social data and input variables to be analyzed.

모델 최적화 모듈(230)은 선정된 토픽 추출 모델을 소셜 데이터 중에서 선정된 훈련 데이터 셋(Training Data Set)을 활용하여 학습시키고, 훈련 데이터 셋 이외의 소셜 데이터 중에서 선정된 테스트 데이터 셋(Test Data Set)을 이용하여 추출 결과를 평가할 수 있다. 모델 최적화 모듈(230)은 입력변수 최적화와 파라미터 최적화로 구성될 수 있다. 선정된 토픽 추출 모델은 연산속도와 정확도 등의 특성을 기준으로 최적화가 진행될 수 있다. 선정된 토픽 추출 모델의 최적화는 입력 변수와 모델의 하이퍼파라미터(hyperparameter)를 대상으로 진행될 수 있다. The model optimization module 230 trains the selected topic extraction model using a training data set selected from social data, and a test data set selected from social data other than the training data set. The extraction result can be evaluated using. The model optimization module 230 may be composed of input variable optimization and parameter optimization. The selected topic extraction model can be optimized based on characteristics such as operation speed and accuracy. Optimization of the selected topic extraction model can be performed for input parameters and hyperparameters of the model.

모델 최적화 모듈(230)에서 얻어진 평가 결과는 피드백되어, 입력 변수 선정 모듈(210)에서 입력 변수 선정에 반영되고, 추출 모델 구축 모듈(220)에서 선정된 토픽 추출 모델의 최적화에 반영될 수 있어, 최적화된 토픽 추출 모델을 구축할 수 있다. 일부 실시 예에서, 모델 최적화 모듈(230)에서의 평가 결과는 피드백 입력으로 다른 조합의 토픽 추출 모델 구축 과정에서 반영될 수 있다. The evaluation result obtained from the model optimization module 230 is fed back, reflected in the input variable selection in the input variable selection module 210, and can be reflected in the optimization of the topic extraction model selected in the extraction model construction module 220, You can build an optimized topic extraction model. In some embodiments, the evaluation result in the model optimization module 230 may be reflected in the construction of a topic extraction model of another combination as a feedback input.

소셜 토픽 추출 시스템(1)은 토픽 추출 모델 DB(520)에 저장된 토픽 추출 모델을 사용하여, 수집되어 소셜 데이터 DB(510)에 저장된 소셜 데이터로부터 토픽 추출 서브 시스템(300)에서 소셜 토픽을 추출할 수 있다. The social topic extraction system 1 uses the topic extraction model stored in the topic extraction model DB 520 to extract social topics from the topic extraction subsystem 300 from the social data collected and stored in the social data DB 510. You can.

토픽 추출 서브 시스템(300)은 사용자(10)의 네트워크(50)를 통한 요청을 사용자 인터페이스(UI, 100)에서 수신하여, 요청에 따라 추출된 소셜 토픽을 사용자 인터페이스(100)를 통하여 사용자(10)에게 제공할 수 있다. The topic extraction sub-system 300 receives a request through the network 50 of the user 10 from the user interface (UI, 100), and receives the social topic extracted according to the request through the user interface 100 through the user 10 ).

사용자 인터페이스(100)는 사용자(10)가 사용하는 단말기 등을 통하여 소셜 토픽 추출 시스템(1)에 엑세스하기 위한 인터페이스를 제공할 수 있다. 사용자(10)는 사용자 인터페이스(100)를 통하여 소셜 토픽 추출 시스템(1)에 키워드와 같은 질의를 전송할 수 있고, 사용자 인터페이스(100)를 통하여 소셜 토픽 추출 시스템(1)이 제공하는 추출된 토픽을 수신할 수 있다. The user interface 100 may provide an interface for accessing the social topic extraction system 1 through a terminal or the like used by the user 10. The user 10 may transmit a query such as a keyword to the social topic extraction system 1 through the user interface 100, and extract the extracted topics provided by the social topic extraction system 1 through the user interface 100. I can receive it.

토픽 추출 서브 시스템(300)은 의미 연관성 추출 모듈(310), 주제 연관성 추출 모듈(320) 및 토픽 예측 모듈(330)을 포함할 수 있다. The topic extraction sub-system 300 may include a semantic association extraction module 310, a topic association extraction module 320, and a topic prediction module 330.

의미 연관성 추출 모듈(310)과 주제 연관성 추출 모듈(320) 각각은 소셜 데이터 DB(510)에 저장된 소셜 데이터들 각각이 가지고 있는 의미와 주제로부터 의미 연관성과 주제 연관성을 추출할 수 있다. 토픽 예측 모듈(330)은 토픽 추출 모델 DB(520)에 저장된 토픽 추출 모델을 사용하여, 의미 연관성 추출 모듈(310)과 주제 연관성 추출 모듈(320) 각각에서 추출한 소셜 데이터들의 의미 연관성과 주제 연관성을 분석하여, 소셜 데이터들로부터 소셜 토픽을 추출할 수 있다. 토픽 예측 모듈(330)은 소셜 데이터 DB(510)에 저장된 소셜 데이터들 중 의미와 주제가 대체로 유사한 것들을 모아서 소셜 토픽으로 추출할 수 있다. 도 4에 도시한 주성분 분석 결과 그래프(우측 그래프)를 참조하여 토픽 예측 모듈(330)에서의 소셜 토픽 추출 방법을 설명하면, 제1 주성분(PC1)과 제2 주성분(PC2) 각각을 좌표축으로 하여 소셜 데이터를 재구성하면, 제1 주성분(PC1)과 제2 주성분(PC2) 각각이 유사한 소셜 데이터들이 몇 개의 그룹으로 구분됨을 알 수 있다. 예를 드면, 토픽 예측 모듈(330)은 각 그룹을 대표하는 의미 및/또는 주제를 선정하여 소셜 토픽을 추출할 수 있다. Each of the semantic association extraction module 310 and the topic association extraction module 320 may extract semantic association and topic association from meanings and topics of each of the social data stored in the social data DB 510. The topic prediction module 330 uses the topic extraction model stored in the topic extraction model DB 520 to determine the semantic association and topic association of social data extracted from each of the semantic association extraction module 310 and the topic association extraction module 320. By analyzing, social topics can be extracted from social data. The topic prediction module 330 may collect the social data stored in the social data DB 510 and have similar meanings and topics, and extract the social topics. When the social topic extraction method in the topic prediction module 330 is described with reference to a graph (right graph) of the principal component analysis result illustrated in FIG. 4, each of the first principal component PC1 and the second principal component PC2 is used as a coordinate axis. When reconstructing the social data, it can be seen that the social data, each of which is similar to the first main component PC1 and the second main component PC2, is divided into several groups. For example, the topic prediction module 330 may extract social topics by selecting meanings and / or topics representing each group.

소셜 토픽 추출 시스템(1)은 모델 피드백 모듈(400)을 더 포함할 수 있다. 모델 피드백 모듈(400)은 사용자(10)가 제공된 소셜 토픽에 대하여 사용자 인터페이스(100)를 통하여 한 피드백을 수신하여 분석하는 사용자 피드백부(410) 및, 토픽 추출 서브 시스템(300)에서 추출한 소셜 토픽을 평가하는 예측 결과 평가부(420)을 포함할 수 있다. The social topic extraction system 1 may further include a model feedback module 400. The model feedback module 400 receives and analyzes one feedback through the user interface 100 on the social topics provided by the user 10, and analyzes the social feedback extracted from the user feedback unit 410 and the topic extraction subsystem 300 It may include a prediction result evaluation unit 420 to evaluate the.

모델 피드백 모듈(400)은 사용자 피드백부(410)에서 수신한 사용자(10)의 피드백에 따른 소셜 토픽에 대한 피드백들을 분석하여 얻어진 사용자(10) 평가에 기인한 정확도, 및 예측 결과 평가부(420)에서 평가한 토픽 추출 서브 시스템(300)에서 추출한 소셜 토픽에 대한 정확도 중 어느 하나가 토픽 추출 모델 DB(520)에 저장된 토픽 추출 모델을 구축하는 과정에서 예측된 정확도보다 낮은 경우에, 모델 구축 서브 시스템(200)에게 토픽 추출 모델을 재생성할 것을 요청할 수 있다. The model feedback module 400 is the accuracy resulting from the user 10 evaluation obtained by analyzing feedback on the social topic according to the feedback of the user 10 received from the user feedback unit 410, and the prediction result evaluation unit 420 ), If any one of the accuracy of the social topics extracted from the topic extraction sub-system 300 evaluated in) is lower than the predicted accuracy in the process of constructing the topic extraction model stored in the topic extraction model DB 520, the model building sub The system 200 may be requested to regenerate the topic extraction model.

모델 피드백 모듈(400)은 사용자(10)의 피드백과 토픽 추출 서브 시스템(300)에서 추출한 소셜 토픽에 대한 평가를 종합하여, 모델 구축 서브 시스템(200)에 토픽 추출 모델을 재생성할 것을 요청할 수 있다. 예를 들면, 사용자 피드백부(410)에서 수신한 사용자(10)의 피드백에 따른 소셜 토픽에 대한 피드백들을 분석하여 얻어진 사용자(10) 평가에 기인한 정확도가, 예측 결과 평가부(420)에서 평가한 토픽 추출 서브 시스템(300)에서 추출한 소셜 토픽에 대한 정확도보다 낮은 경우, 모델 피드백 모듈(400)은 모델 구축 서브 시스템(200)에게 토픽 추출 모델을 재생성할 것을 요청할 수 있다. The model feedback module 400 may synthesize feedback from the user 10 and evaluations of social topics extracted from the topic extraction sub-system 300, and request the model construction sub-system 200 to regenerate the topic extraction model. . For example, the accuracy resulting from the evaluation of the user 10 obtained by analyzing feedback on the social topic according to the feedback of the user 10 received from the user feedback unit 410 is evaluated by the prediction result evaluation unit 420 If the accuracy of the social topic extracted from one topic extraction sub-system 300 is lower than the accuracy, the model feedback module 400 may request the model building subsystem 200 to regenerate the topic extraction model.

모델 피드백 모듈(400)로부터 토픽 추출 모델의 재생성을 요청을 받은 모델 구축 서브 시스템(200)은, 모델 최적화 모듈(230)을 통하여 토픽 추출 모델의 최적화를 재진행하거나, 추출 모델 구축 모듈(220)을 통하여 새롭게 토픽 추출 모델을 구축하도록 할 수 있다. The model building sub-system 200, which is requested to regenerate the topic extraction model from the model feedback module 400, re-optimizes the topic extraction model through the model optimization module 230, or extracts the model building module 220 Through this, it is possible to construct a new topic extraction model.

일부 실시 예에서, 소셜 토픽 추출 시스템(1)은 모델 구축 서브 시스템(200)과 토픽 추출 서브 시스템(300) 각각이 별도로 시스템을 이루도록 구성될 수 있다. In some embodiments, the social topic extraction system 1 may be configured such that each of the model building subsystem 200 and the topic extraction subsystem 300 separately forms a system.

예를 들면, 모델 구축 서브 시스템(200), 소셜 데이터 DB(510), 및 토픽 추출 모델 DB(520)를 포함하는 모델 구축 시스템과 사용자 인터페이스(100), 토픽 추출 서브 시스템(300), 모델 피드백 모듈(400), 소셜 데이터 DB(510), 및 토픽 추출 모델 DB(520)를 포함하는 토픽 추출 시스템이 별도의 시스템을 이룰 수 있다. 이 경우, 상기 토픽 추출 시스템의 모델 피드백 모듈(400)에서 상기 모델 구축 시스템으로 피드백을 제공할 수 있다. 상기 토픽 추출 시스템과 상기 모델 구축 시스템은 각각 독립적으로 소셜 데이터 DB(510)를 가지거나, 하나의 소셜 데이터 DB(510)를 공유할 수 있다. 상기 토픽 추출 시스템과 상기 모델 구축 시스템은 각각 독립적으로 토픽 추출 모델 DB(520)을 가지되 서로 연동되거나, 하나의 토픽 추출 모델(520)을 공유할 수 있다. For example, the model building system and user interface 100 including the model building subsystem 200, the social data DB 510, and the topic extraction model DB 520, the topic extraction subsystem 300, and model feedback The topic extraction system including the module 400, the social data DB 510, and the topic extraction model DB 520 may form a separate system. In this case, the model feedback module 400 of the topic extraction system may provide feedback to the model building system. The topic extraction system and the model building system may each independently have a social data DB 510, or may share a single social data DB 510. The topic extraction system and the model building system may each independently have a topic extraction model DB 520, but may be linked to each other or share a single topic extraction model 520.

본 발명에 따른 소셜 토픽 추출 시스템(1)은 통계 분석 기법과 머신러닝 기법을 혼합하여 토픽 추출 모델을 구축하고, 토픽 추출 모델의 재생성 및 최적화를 수행할 수 있으므로, 고품질의 소셜 토픽 추출 및 유지 보수가 가능하도록 할 수 있다. The social topic extraction system 1 according to the present invention can build a topic extraction model by mixing statistical analysis techniques and machine learning techniques, and regenerate and optimize the topic extraction model, so that high-quality social topic extraction and maintenance Can make it possible.

도 2는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템의 구성을 나타내는 개략도이다.2 is a schematic diagram showing the configuration of a model building subsystem of a social topic extraction system according to an embodiment of the present invention.

도 2를 참조하면, 모델 구축 서브 시스템(200)은, 네트워크(50)를 통하여 소셜 네트워크(20)로부터 수집하여 소셜 데이터 DB(510)에 저장된 소셜 데이터를 사용하여 토픽 추출 모델을 생성하여, 구축 토픽 추출 모델 DB(520)에 저장할 수 있다. Referring to FIG. 2, the model building subsystem 200 generates and builds a topic extraction model using social data collected from the social network 20 through the network 50 and stored in the social data DB 510. It can be stored in the topic extraction model DB 520.

모델 구축 서브 시스템(200)은 입력 변수 선정 모듈(210), 추출 모델 구축 모듈(220), 모델 최적화 모듈(230), 및 자연어 이해 모듈(240)을 포함할 수 있다. The model building subsystem 200 may include an input variable selection module 210, an extraction model building module 220, a model optimization module 230, and a natural language understanding module 240.

자연어 이해 모듈(240)은 소셜 데이터 DB(510)에 저장된 소셜 데이터에 대하여, 시맨틱 롤 라벨링(Semantic Role Labeling), 형태소 분석, 구문 분석, 개체명 분석, 의도 분류, 및 도메인 분석 등의 분석 과정을 수행하여, 입력 변수 선정 모듈(210)에 제공할 수 있다. The natural language understanding module 240 performs analysis processes such as semantic role labeling, morpheme analysis, syntax analysis, object name analysis, intention classification, and domain analysis on the social data stored in the social data DB 510. By performing, it can be provided to the input variable selection module 210.

일부 실시 예에서, 자연어 이해 모듈(240)은 소셜 데이터에 대한 자연어 분석 결과를 테이블(table) 구조 또는 트리(tree) 구조를 가지는 정형 데이터로 입력 변수 선정 모듈(210)에 제공하도록 할 수 있다. 소셜 데이터는 예를 들면, CSV, XLS, XLSX, HTML, WMS, PDF, XML, RDF, ZIP, Open API, DOC, DOCX, HWP, PPT, PPTX 등 다양한 형태의 정보를 담을 수 있는 파일 형식으로 이루어질 수 있고, 각 파일 형식 내에서도 다양한 형태의 구조를 가질 있다. 자연어 이해 모듈(240)에 의하여 제공될 수 있는 정형 데이터는 예를 들면, RDB(Relation DateBase), CSV(Comma-Seperated Variables), XML(eXtensible Markup Language), JSON(JavaScript Object Notation) 등일 수 있으나, 이에 한정되지 않는다. In some embodiments, the natural language understanding module 240 may provide the natural language analysis results for social data to the input variable selection module 210 as structured data having a table structure or a tree structure. Social data consists of file formats that can contain various types of information, such as CSV, XLS, XLSX, HTML, WMS, PDF, XML, RDF, ZIP, Open API, DOC, DOCX, HWP, PPT, PPTX, etc. It can have various types of structures within each file format. The structured data that can be provided by the natural language understanding module 240 may be, for example, Relation DateBase (RDB), Comma-Seperated Variables (CSV), eXtensible Markup Language (XML), JavaScript Object Notation (JSON), etc. It is not limited to this.

자연어 이해 모듈(240)는 시맨틱 롤 라벨링부(SRL, 241), 형태소 분석부(242), 구문 분석부(243), 개체명 분석부(244), 의도 분류부(245) 및 도메인 분석부(246)를 포함할 수 있다. 시맨틱 롤 라벨링부(241)는 소셜 데이터에 대한 의미 역할(Semantic Role)을 부여할 수 있다. 형태소 분석부(242)는 의미 역할이 부여된 소셜 데이터를 형태소 단위로 분리할 수 있다. 구문 분석부(243) 및 개체명 분석부(244)는 형태소 단위로 분리된 소셜 데이터에 구문 분석 및 개체명 분석을 할 수 있다. 의도 분류부(245) 및 도메인 분석부(2560)는 의미 역할이 부여된 소셜 데이터의 의도(intention) 분류 및 도메인 분석을 할 수 있다. The natural language understanding module 240 includes a semantic roll labeling unit (SRL, 241), a morphological analysis unit 242, a syntax analysis unit 243, an entity name analysis unit 244, an intention classification unit 245, and a domain analysis unit ( 246). The semantic roll labeling unit 241 may give a semantic role to social data. The morpheme analysis unit 242 may separate social data to which a semantic role is assigned in morpheme units. The syntax analysis unit 243 and the entity name analysis unit 244 may perform syntax analysis and entity name analysis on social data separated into morpheme units. The intention classification unit 245 and the domain analysis unit 2560 may perform intention classification and domain analysis of social data to which a semantic role is assigned.

일부 실시 예에서 자연어 이해 모듈(240)은 어학 사전 정보(600)를 참조하여, 형태소 분석부(242)에서 분리된 형태소에 어깨 번호를 부여할 수 있다. 구체적으로, 의도 분류부(256)는 어학 사전 정보(600)를 참조하여, 동음이의어의 의미에 해당하는 어깨 번호를 분리된 형태소에 부여할 수 있다. In some embodiments, the natural language understanding module 240 may assign a shoulder number to a morpheme separated from the morpheme analysis unit 242 with reference to the language dictionary information 600. Specifically, the intention classifying unit 256 may assign the shoulder number corresponding to the meaning of the homonym to a separate morpheme with reference to the language dictionary information 600.

입력 변수 선정 모듈(210)은 자연어 이해 모듈(240)에서 자연어 분석이 된 소셜 데이터를 입력 받아서 소셜 데이터 간의 의미 연관성과 주제 연관성을 분석하여 영향력이 높은 순으로 입력 변수를 선정할 수 있다. The input variable selection module 210 may receive the social data analyzed in the natural language from the natural language understanding module 240 and analyze the semantic association and the subject association between the social data to select the input variables in the order of high influence.

입력 변수 선정 모듈(210)은 소셜 데이터로부터 입력 변수 후보에 해당하는 특징들을 추출하는 특징 추출부(212) 및 추출된 특징들을 분석하는 변수 분석부(214)를 포함할 수 있다. 입력 변수를 선정하는 방법에 대해서는 도 4를 통하여 자세히 설명하도록 한다. The input variable selection module 210 may include a feature extraction unit 212 for extracting features corresponding to input variable candidates from social data, and a variable analysis unit 214 for analyzing the extracted features. The method of selecting the input variable will be described in detail through FIG. 4.

추출 모델 구축 모듈(220)은 소셜 토픽 추출 결과에 대하여 학습 레이어의 배치를 통한 의미 연관성 및 주제 연관성의 정확도를 측정하여, 통계 분석 기법과, 머신러닝 기법이 혼합되어 소셜 데이터로부터의 토픽 추출에 최적화된 토픽 추출 모델을 구축할 수 있다. The extraction model building module 220 measures the accuracy of semantic association and subject association through the placement of the learning layer with respect to the social topic extraction results, and is optimized for topic extraction from social data by mixing statistical analysis techniques and machine learning techniques. A built-in topic extraction model.

추출 모델 구축 모듈(220)은 토픽 추출 모델을 구축하는 모델 구축부(222) 및 구축된 토픽 추출 모델에 대한 평가를 수행하는 모델 평가부(224)를 포함할 수 있다. 토픽 추출 모델을 구축하는 방법에 대해서는 도 5를 통하여 자세히 설명하도록 한다. The extraction model building module 220 may include a model building unit 222 for building a topic extraction model and a model evaluation unit 224 for evaluating the constructed topic extraction model. The method of constructing the topic extraction model will be described in detail with reference to FIG. 5.

모델 최적화 모듈(230)은 선정된 토픽 추출 모델을 학습 및 평가할 수 있다. 모델 최적화 모듈(230)은 선정된 입력 변수에 대한 최적화를 수행하는 입력 변수 최적화부(232), 및 파라미터에 대한 최적화를 수행하는 파라미터 최적화부(234)를 포함할 수 있다. The model optimization module 230 may learn and evaluate the selected topic extraction model. The model optimization module 230 may include an input variable optimization unit 232 that performs optimization for the selected input variable, and a parameter optimization unit 234 that performs optimization for the parameters.

예를 들면, 파라미터 최적화부(234)는 머신러닝 기법이 훈련 데이터 셋으로부터 학습하는 파라미터인 모델 파라미터를 최적화하거나, 머신러닝 기법의 튜닝 파라미터인 하이퍼파라미터를 최적화할 수 있다. 모델 파라미터는 예를 들면, 선형 회귀의 가중치 계수(또는 기울기)와 편향(또는 좌표축의 절편)일 수 있고, 하이퍼파라미터는 예를 들면, 회귀 분석의 평균 제곱 오차(mean squared error), 비용 함수(cost function)에서 L2 페널티의 규제 강도, 또는 의사 결정 트리의 최대 깊이 셋팅 값일 수 있다. For example, the parameter optimization unit 234 may optimize a model parameter that is a parameter that the machine learning technique learns from a training data set, or a hyperparameter that is a tuning parameter of the machine learning technique. The model parameters can be, for example, weight coefficients (or slopes) and biases (or intercepts of coordinate axes) of linear regression, and hyperparameters can be, for example, mean squared error of regression analysis, cost function ( cost function), or the regulatory intensity of the L2 penalty, or the maximum depth setting of the decision tree.

도 3은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 토픽 추출 서브 시스템의 구성을 나타내는 개략도이다.3 is a schematic diagram showing the configuration of a topic extraction sub-system of a social topic extraction system according to an embodiment of the present invention.

도 3을 참조하면, 토픽 추출 서브 시스템(300)은 사용자(10)의 네트워크(50)를 통한 요청을 사용자 인터페이스(100)에서 수신하여, 네트워크(50)를 통하여 소셜 네트워크(20)로부터 수집하여 소셜 데이터 DB(510)에 저장된 소셜 데이터에 대하여, 토픽 추출 모델 DB(520)에 저장된 토픽 추출 모델을 사용하여 소셜 토픽을 추출하여 사용자 인터페이스(100)를 통하여 사용자(10)에게 제공할 수 Referring to FIG. 3, the topic extraction subsystem 300 receives a request through the network 50 of the user 10 from the user interface 100 and collects it from the social network 20 through the network 50 For social data stored in the social data DB 510, a social topic can be extracted using the topic extraction model stored in the topic extraction model DB 520 and provided to the user 10 through the user interface 100.

있다. 일부 실시 예에서, 토픽 추출 서비 시스템(300)에서 추출된 소셜 토픽은 토픽 추출 결과 DB(530)에 저장될 수 있다. 토픽 추출 결과 DB(530)는 소셜 데이터 DB(510) 및 토픽 추출 모델 DB(520)와는 별도인 데이터를 저장할 수 있는 공간일 수 있으나, 토픽 추출 결과 DB(530)는, 소셜 데이터 DB(510) 및 토픽 추출 모델 DB(520) 중 적어도 어느 하나와 함께 데이터를 저장할 수 있는 하나의 공간일 수도 있다. have. In some embodiments, social topics extracted from the topic extraction service system 300 may be stored in the topic extraction result DB 530. The topic extraction result DB 530 may be a space for storing data separate from the social data DB 510 and the topic extraction model DB 520, but the topic extraction result DB 530 may be a social data DB 510. And one space for storing data together with at least one of the topic extraction model DB 520.

토픽 추출 서브 시스템(300)은, 의미 연관성 추출 모듈(310), 주제 연관성 추출 모듈(320), 토픽 예측 모듈(330), 및 자연어 이해 모듈(340)을 포함할 수 있다. The topic extraction sub-system 300 may include a semantic association extraction module 310, a topic association extraction module 320, a topic prediction module 330, and a natural language understanding module 340.

자연어 이해 모듈(340)은 소셜 데이터 DB(510)에 저장된 소셜 데이터에 대하여, 시맨틱 롤 라벨링, 형태소 분석, 구문 분석, 개체명 분석, 의도 분류, 및 도메인 분석 등의 분석 과정을 수행하여, 의미 연관성 추출 모듈(310) 및 주제 연관성 추출 모듈(320)에 제공할 수 있다. The natural language understanding module 340 performs an analysis process such as semantic roll labeling, morpheme analysis, syntax analysis, object name analysis, intention classification, and domain analysis on the social data stored in the social data DB 510, thereby semantic association The extraction module 310 and the subject association extraction module 320 may be provided.

자연어 이해 모듈(340)는 시맨틱 롤 라벨링부(341), 형태소 분석부(342), 구문 분석부(343), 개체명 분석부(344), 의도 분류부(345) 및 도메인 분석부(346)를 포함할 수 있으며, 이들 각각은 도 2에서 설명한 자연어 이해 모듈(240)이 포함하는 시맨틱 롤 라벨링부(SRL, 241), 형태소 분석부(242), 구문 분석부(243), 개체명 분석부(244), 의도 분류부(245) 및 도메인 분석부(246) 각각과 대체로 유사한 바 자세한 설명은 생략하도록 한다.The natural language understanding module 340 includes a semantic roll labeling unit 341, a morpheme analysis unit 342, a syntax analysis unit 343, an entity name analysis unit 344, an intention classification unit 345, and a domain analysis unit 346 Each of these may include, each of the semantic roll labeling unit (SRL, 241), morphological analysis unit 242, syntax analysis unit 243, entity name analysis unit included in the natural language understanding module 240 described in Figure 2 (244), the intention classification unit 245 and the domain analysis unit 246 are generally similar to each, detailed description thereof will be omitted.

일부 실시 예에서, 도 1에 보인 소셜 토픽 추출 시스템(1)은 모델 구축 서브 시스템(200) 및 토픽 추출 서브 시스템(300)이 함께 공유하는 하나의 자연어 이해 모듈을 제공할 수 있다. In some embodiments, the social topic extraction system 1 shown in FIG. 1 may provide one natural language understanding module shared by the model building subsystem 200 and the topic extraction subsystem 300.

의미 연관성 추출 모듈(310)은 자연어 이해 모듈(240)에서 자연어 분석이 된 소셜 데이터를 입력 받아서 의미 연관성을 분석하는 의미 연관성 분석부(312), 및 분석된 의미 연관성을 토대로 의미 연관성 사전을 구축하는 의미 연관성 사전 구축부(314)를 포함한다. 의미 연관성 분석부(312)는 의미 연관성 사전 구축부(314)에서 구축된 의미 연관성 사전을 참조하여, 소셜 데이터의 의미 연광성을 분석할 수 있다. The semantic relevance extraction module 310 receives the social data that has been analyzed by the natural language understanding module 240 and constructs a semantic relevance dictionary based on the semantic relevance analysis unit 312 for analyzing semantic relevance and the analyzed semantic relevance A semantic association dictionary construction unit 314 is included. The semantic association analysis unit 312 may analyze the semantic lightness of social data by referring to the semantic association dictionary constructed by the semantic association dictionary construction unit 314.

주제 연관성 추출 모듈(320)은 자연어 이해 모듈(240)에서 자연어 분석이 된 소셜 데이터를 입력 받아서 주제 연관성을 분석하는 주제 연관성 분석부(322), 및 분석된 주제 연관성을 토대로 주제 연관성 사전을 구축하는 주제 연관성 사전 구축부(324)를 포함한다. 주제 연관성 분석부(322)는 주제 연관성 사전 구축부(324)에서 구축된 주제 연관성 사전을 참조하여, 소셜 데이터의 주제 연광성을 분석할 수 있다. The topic association extraction module 320 receives the social data that has been analyzed in the natural language from the natural language understanding module 240 and constructs a topic association dictionary based on the topic association analysis unit 322 for analyzing the topic association, and the analyzed topic association. And a topic association dictionary construction unit 324. The topic association analysis unit 322 may analyze the topic lightness of social data by referring to the topic association dictionary constructed by the topic association dictionary construction unit 324.

토픽 예측 모듈(330)은 의미 연관성 추출 모듈(310)에서 추출된 의미 연관성을 기초로 의미 연관성 토픽을 추출하는 의미 연관성 토픽 추출부(332), 주제 연관성 추출 모듈(320)에서 추출된 주제 연관성을 기초로 주제 연관성 토픽을 추출하는 주제 연관성 토픽 추출부(334), 및 의미 연관성 토픽 추출부(332) 및 주제 연관성 토픽 추출부(334) 각각에서 추출된 의미 연관성 토픽과 주제 연관성 토픽을 기초로 소셜 토픽을 추출하는 소셜 토픽 추출부(336)을 포함한다. The topic prediction module 330 extracts the topic association extracted from the semantic association topic extraction unit 332 and the topic association extraction module 320 for extracting the meaning association topic based on the semantic association extracted from the semantic association extraction module 310. Social based on the semantic associative topic and the topic associative topic extracted from the topic associative topic extractor 334 and semantic associative topic extractor 332 and the topic associative topic extractor 334 for extracting the topic associative topic based And a social topic extraction unit 336 for extracting topics.

도 4는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템에서 입력 변수를 선정하는 과정을 나타내는 개념도이다. 4 is a conceptual diagram illustrating a process of selecting an input variable in a model building subsystem of a social topic extraction system according to an embodiment of the present invention.

도 4를 참조하면, 소셜 데이터들 왼쪽 그래프와 같이 고차원 데이터일 수 있다. 즉, 각각의 차원인 입력 변수 후보에 해당하는 특징들이 가지는 성분은 매우 다양할 수 있다. 다양한 성분들 중에서, 소셜 토픽에 영향을 크게 미치는 요소를 추출해 사용하기 위해 데이터 차원 축소 방법인 주성분 분석(Principal Component Analysis)을 사용한다. Referring to FIG. 4, social data may be high-dimensional data, such as a graph on the left. That is, the components of the features corresponding to the input variable candidates of each dimension may be very diverse. Principal Component Analysis, a data dimensionality reduction method, is used to extract and use factors that greatly affect social topics among various components.

주성분 분석은 왼쪽 그래프와 같은 고차원 데이터를 오른쪽 그래프와 같은 저차원의 데이터로 변환시켜서, 소셜 데이터 집합을 새로운 좌표축으로 변환시킬 수 있다. 따라서, 주성분 분석은 데이터 손실을 최소화하면서 소셜 데이터들 각각을 서로 간에 독립인 좌표축들로 재구성하여, 최소한의 차원으로 최대한의 효과를 얻을 수 있다. 이를 통하여, 정보의 손실을 최소화하면서 소셜 데이터를 대표하는 주성분(Principal Components)을 찾아 변수의 차원(개수)을 줄여 변수에 의한 데이터의 중복(overlap)을 감소시킬 수 있다. Principal component analysis can transform high-dimensional data, such as the left graph, into low-dimensional data, such as the right graph, and transform social data sets into new coordinate axes. Therefore, the principal component analysis can reconstruct each of the social data with independent coordinate axes while minimizing data loss, thereby obtaining a maximum effect in a minimum dimension. Through this, while minimizing information loss, it is possible to find principal components representing social data and reduce the dimension (number) of variables to reduce data overlap due to variables.

예를 들면, 주성분 분석을 통하여, 소셜 토픽에 영향을 크게 미치는 요소로 제1 주성분(PC1)과 제2 주성분(PC2)을 추출하여, 소셜 데이터들을 2차원 데이터로 변환시킬 수 있다. For example, through the principal component analysis, the first principal component PC1 and the second principal component PC2 may be extracted as factors that greatly affect the social topic, and social data may be converted into two-dimensional data.

일부 실시 예에서, 주성분 분석은 유전자 알고리즘을 이용하여 가장 식별력이 좋은 특징들을 차원이 감소된 주성분들, 예를 들면, 제1 주성분(PC1)과 제2 주성분(PC2)을 추출할 수 있다. 유전자 알고리즘은 자연 세계의 유전과 진화 매커니즘에 기반한 계산 모델로서, 풀고자 하는 문제에 대한 가능한 해들을 정해진 형태의 자료구조로 표한 다음(get_fitness), 이들을 점차적으로 변형함(_generate_parent와 _mutate)으로써 점점 더 좋은 해들을 생성(get_best)해 나갈 수 있다. 각각의 가능한 해를 하나의 유기체 또는 개체(Individual)로 보며 이들의 집합을 개체군(Population)이라 한다.In some embodiments, the principal component analysis may extract the principal components with reduced dimensions, for example, the first principal component (PC1) and the second principal component (PC2) using the genetic algorithm. Genetic algorithms are computational models based on the genetic and evolutionary mechanisms of the natural world, which represent possible solutions to the problem to be solved in a defined data structure (get_fitness), then gradually transform them (_generate_parent and _mutate) to become more and more. You can generate good solutions (get_best). Each possible solution is viewed as one organism or individual, and a collection of these is called a population.

예를 들면, 유전자 알고리즘을 위한 변형 알고리즘은 다음과 같이 구성될 수 있다. For example, a modification algorithm for a genetic algorithm can be configured as follows.

def _mutate(parent, geneSet, get_fitness):def _mutate (parent, geneSet, get_fitness):

childGenes = parent.Genes[:]childGenes = parent.Genes [:]

index = random.randrange(0, len(parent.Genes))index = random.randrange (0, len (parent.Genes))

newGene, alternate = random.sample(geneSet, 2)newGene, alternate = random.sample (geneSet, 2)

childGenes[index] = alternate \childGenes [index] = alternate \

if newGene == childGenes[index] \if newGene == childGenes [index] \

else newGeneelse newGene

fitness = get_fitness(childGenes)fitness = get_fitness (childGenes)

return Chromosome(childGenes, fitness)return Chromosome (childGenes, fitness)

def _generate_parent(length, geneSet, get_fitness):def _generate_parent (length, geneSet, get_fitness):

genes = []genes = []

while len(genes) < length:while len (genes) <length:

sampleSize = min(length - len(genes), len(geneSet))sampleSize = min (length-len (genes), len (geneSet))

genes.extend(random.sample(geneSet, sampleSize))genes.extend (random.sample (geneSet, sampleSize))

fitness = get_fitness(genes)fitness = get_fitness (genes)

return Chromosome(genes, fitness)return Chromosome (genes, fitness)

def get_best(get_fitness, targetLen, optimalFitness, geneSet, display):def get_best (get_fitness, targetLen, optimalFitness, geneSet, display):

random.seed()random.seed ()

bestParent = _generate_parent(targetLen, geneSet, get_fitness)bestParent = _generate_parent (targetLen, geneSet, get_fitness)

display(bestParent)display (bestParent)

if bestParent.Fitness >= optimalFitness:if bestParent.Fitness> = optimalFitness:

return bestParentreturn bestParent

while True:while True:

child = _mutate(bestParent, geneSet, get_fitness)child = _mutate (bestParent, geneSet, get_fitness)

if bestParent.Fitness >= child.Fitness:if bestParent.Fitness> = child.Fitness:

continuecontinue

display(child)display (child)

if child.Fitness >= optimalFitness:if child.Fitness> = optimalFitness:

return childreturn child

bestParent = childbestParent = child

def get_fitness(genes, target):def get_fitness (genes, target):

return sum(1 for expected, actual in zip(target, genes)return sum (1 for expected, actual in zip (target, genes)

if expected == actual)if expected == actual)

일부 실시 예에서, 제1 주성분(PC1)과 제2 주성분(PC2) 각각은 의미와 주제일 수 있다. 이 경우, 제1 주성분(PC1)이 유사한 소셜 데이터들은 의미 연관성을 가지고, 제2 주성분(PC2)이 유사한 소셜 데이터들은 주제 연관성을 가지는 것으로 판단할 수 있다. 소셜 데이터는 크게 작성자와 작성 일자, 내용으로 구성되며, 내용은 자연어로 기술되고, 자연어는 여러 문장과 단어들로 구성된다. 제1 주성분(PC1)은 각 소셜 데이터를 구성하는 문장과 단어의 등장 위치와 등장 횟수를 기준으로 벡터화하여 중심단어 목록을 추출하고, 다른 소셜 데이터와 작성자, 작성 일자, 중심 단어 등을 기준으로 다차원 행렬을 구성한다. 이때, 중심단어 추출을 위한 벡터화는 Word2Vec을 이용하여 각 단어들 사이의 거리 값을 계산하고, 이를 기반으로 단어들 간의 유사성을 판단할 수 있다. 의미 연관성에 대한 판단은 각 소셜 데이터의 중심 단어 순위와 각 중심단어의 유사 단어 목록을 기반으로 유사성의 정도를 통해 이루어진다. 즉, 소셜 데이터 A와 소셜 데이터 B의 중심단어와 유사 단어 목록이 유사성의 판단기준보다 크다면 의미 연관성이 존재한다고 판단한다. 유사성의 판단기준은 0부터 1까지의 실수로 이루어지며, 1에 가까울수록 유사성이 높고, 초기값 0.5로 시작하여 머신러닝에 의한 훈련 횟수와 훈련결과에 따른 의미 연관 집합의 수에 따라 점진적으로 상승 및 하락하여 최적값을 찾는다. 제2 주성분(PC2)은 제1 주성분(PC1)에서 의미 연관성으로 판단된 토픽들을 기반으로 소셜 데이터 자체를 기준으로 소셜 데이터 간의 유사성을 벡터화한다. 이 때, 유사한 소셜 데이터 집합의 공통점을 주제라고 식별하며, Doc2Vec을 이용하여 소셜 데이터들 사이의 거리값을 계산한다. 따라서 제1 주성분(PC1) 및 제2 주성분(PC2)이 대체로 유사한 소셜 데이터들, 즉 의미 연관성과 주제 연관성을 가지는 소셜 데이터들로부터 소셜 토픽을 추출할 수 있다. In some embodiments, each of the first main component PC1 and the second main component PC2 may be a meaning and a subject. In this case, it can be determined that social data having a similar first principal component PC1 has semantic association, and social data having a similar second principal component PC2 has subject association. Social data is largely composed of the author, the date of creation, and the content, the content is described in natural language, and the natural language is composed of several sentences and words. The first main component (PC1) is vectorized based on the occurrence position and the number of occurrences of sentences and words constituting each social data to extract a list of central words, and multidimensional based on other social data, author, creation date, and central words. Construct a matrix. At this time, vectorization for extracting the central word may calculate the distance value between words using Word2Vec, and based on this, similarity between words may be determined. The determination of semantic relevance is made through the degree of similarity based on the ranking of the central words of each social data and the list of similar words of each central word. That is, if the central word and the similar word list of the social data A and the social data B are larger than the criterion for similarity, it is determined that there is a semantic association. The criterion for similarity is made by real numbers from 0 to 1, and the closer it is to 1, the higher the similarity, starting with an initial value of 0.5 and gradually increasing with the number of training sets by machine learning and the number of semantic association sets according to the training results. And down to find the optimal value. The second principal component PC2 vectorizes the similarity between social data based on the social data itself based on topics determined as semantic association in the first principal component PC1. At this time, the commonality of similar social data sets is identified as a subject, and a distance value between social data is calculated using Doc2Vec. Accordingly, a social topic can be extracted from social data having substantially the first principal component PC1 and the second principal component PC2, that is, social data having semantic association and topic association.

도 5는 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템이 가지는 모델 구축 서브 시스템에서 토픽 추출 모델을 구축하는 과정을 나타내는 개념도이다. 5 is a conceptual diagram illustrating a process of constructing a topic extraction model in a model building subsystem of a social topic extraction system according to an embodiment of the present invention.

도 5를 참조하면, 토픽 추출 모델의 학습 레이어는 레벨 1, 레벨 2 및 레벨 3의 3개의 레벨로 구성될 수 있다. Referring to FIG. 5, the learning layer of the topic extraction model may be composed of three levels, level 1, level 2, and level 3.

레벨 1(L1)에서는 시계열 분석으로 이동평균법(M11), 지수평활법(M12), 자기상관(M13), 아리마(ARIMA, M14) 등의 통계 분석 기법을 각각 수행할 수 있다. 각 통계 분석 기법을 통해 분석된 결과는 반복적인 분석 과정을 거치며, 결과에 가중치가 적용될 수 있다. 소셜 토픽 추출 결과와 추출 정확도에 따라 가중치는 지속적으로 갱신되며, 가중치의 총합은 1이 된다. 즉, 시계열 분석인 레벨 1(L1)에 적용되는 n개의 통계적 분석 기법들을 TA₁, TA₂, TA₃, …, TAn이라 정의하면, 각각의 가중치는 WS₁, WS₂, WS₃, … , WS_n이고, 가중치의 총합은

이다. At level 1 (L1), statistical analysis techniques such as moving average method (M11), exponential smoothing method (M12), autocorrelation (M13), and Arima (ARIMA, M14) can be performed by time series analysis, respectively. The results analyzed through each statistical analysis technique undergo an iterative analysis process, and weights can be applied to the results. The weight is continuously updated according to the social topic extraction result and the extraction accuracy, and the sum of the weights is 1. That is, n statistical analysis techniques applied to the time series analysis Level 1 (L1) are TA ₁ , TA ₂ , TA ₃ ,… , TAn, each weight is WS ₁ , WS ₂ , WS ₃ ,… , WS _n , and the sum of the weights

to be.

따라서, 레벨 2(L2)로 전달되는 시계열 분석인 레벨 1(L1)의 결과값은

이 된다. Therefore, the result of Level 1 (L1), which is a time series analysis delivered to Level 2 (L2),

It becomes.

레벨 2(L2)에서는 토픽 추출을 위하여 랜덤 포레스트(M21), 지지벡터회귀(M22), 뉴럴넷(M23) 등의 머신러닝 기법을 각각 수행할 수 있다. 각 머신 러닝 기법을 통해 분석된 결과는 반복적인 분석 과정을 거치며, 결과에 가중치가 적용될 수 있다. 소셜 토픽 추출 결과와 추출 정확도에 따라 가중치는 지속적으로 갱신되며, 가중치의 총합은 1이 된다. 즉, 토픽 추출을 위하여 레벨 2(L2)에 적용되는 k개의 머신러닝 기법들을 MA₁, MA₂, MA₃, ..., MA_k이라 정의하면, 각각의 가중치는 WM₁, WM₂, WM₃, ..., WM_k이고, 가중치의 총합은

이다. At level 2 (L2), machine learning techniques such as random forest (M21), support vector regression (M22), and neural net (M23) can be performed for topic extraction. The results analyzed through each machine learning technique undergo an iterative analysis process, and weights can be applied to the results. The weight is continuously updated according to the social topic extraction result and the extraction accuracy, and the sum of the weights is 1. That is, if k machine learning techniques applied to level 2 (L2) are defined as MA ₁ , MA ₂ , MA ₃ , ..., MA _k for topic extraction, each weight is WM ₁ , WM ₂ , WM ₃ , ..., WM _k , and the sum of the weights

to be.

레벨 3(L3)은 레벨 2(L2)로부터 추출 결과를 통합하는 앙상블 모델(M31)을 구축하며, 레벨 3(L3)의 앙상블 모델(M31)에서 통합되는 토픽 추출 결과값은

이 된다. Level 3 (L3) builds an ensemble model (M31) that integrates the extraction results from level 2 (L2), and the topic extraction result value that is integrated in the level 3 (L3) ensemble model (M31) is

It becomes.

도 6 및 도 7은 본 발명의 일 실시 예에 따른 소셜 토픽 추출 시스템에서 추출한 소셜 토픽을 제공한 결과를 예시적으로 보여준다. 6 and 7 exemplarily show results of providing a social topic extracted from a social topic extraction system according to an embodiment of the present invention.

도 1, 도 6 및 도 7을 함께 참조하면, 소셜 토픽 추출 시스템(1)은 사용자(10)의 네트워크(50)를 통한 요청을 사용자 인터페이스(100)에서 수신하여, 요청에 따라 추출된 소셜 토픽을 사용자 인터페이스(100)를 통하여 사용자(10)에게 제공할 수 있다. 1, 6 and 7 together, the social topic extraction system 1 receives a request through the network 50 of the user 10 from the user interface 100, and extracts the social topic according to the request Can be provided to the user 10 through the user interface 100.

예를 들면, 사용자(10)가 '경제'라는 요청을 입력하면, 도 6에 보인 것과 같이 경제와 관련된 추출된 소셜 토픽인 '기업', '기업가', '펀드', '주가', '세금', '환율' 등의 소셜 토픽을 사용자(10)에게 제공할 수 있다. For example, when the user 10 inputs a request for 'economy', as shown in FIG. 6, the extracted social topics related to the economy 'enterprise', 'entrepreneur', 'fund', 'stock price', 'tax Social topics such as', 'exchange rate' may be provided to the user 10.

일부 실시 예에서, 사용자(10)가 제공된 소셜 토픽 중 '기업'을 선택하면, 도 7에 보인 것과 같이 기업과 관련된 추출된 소셜 토픽인 '삼성', '현대', 'LG', 'SK', '포스코', '한화', 'GS', '국민은행', '기업은행', '100대 기업', 'CJ' 등의 세부적인 소셜 토픽을 사용자(10)에게 제공할 수도 있다. In some embodiments, when the user 10 selects 'enterprise' from the provided social topics, as shown in FIG. 7, the extracted social topics related to the enterprise, 'Samsung', 'Modern', 'LG', and 'SK' , 'POSCO', 'Hanhwa', 'GS', 'Kookmin Bank', 'Kookmin Bank', '100 Top Companies', 'CJ', etc. can also provide detailed social topics to the user 10.

사용자(10)가 특정 소셜 토픽을 선택하면, 소셜 토픽 추출 시스템(1)은 사용자가 선택한 특정 소셜 토픽에 해당하는 소셜 데이터를 사용자(10)에게 제공할 수 있다. When the user 10 selects a specific social topic, the social topic extraction system 1 may provide the user 10 with social data corresponding to the specific social topic selected by the user.

본 발명의 실시 예들은 컴퓨터 시스템에서 실행할 수 있는 프로그램으로 작성 가능하다. 또한, 상기 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체로부터 읽혀진 해당 프로그램은 디지털 컴퓨터 시스템에서 실행될 수 있다. Embodiments of the present invention can be written as a program executable on a computer system. In addition, the corresponding program read from a computer-readable recording medium containing the program can be executed in a digital computer system.

컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, DVD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의해 용이하게 추론될 수 있다.Examples of computer-readable recording media include ROM, RAM, CD-ROM, DVD-ROM, magnetic tapes, floppy disks, optical data storage devices, etc., and can also be used for carrier waves (for example, transmission over the Internet). It also includes those implemented in the form. The computer-readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers in the technical field to which the present invention pertains.

1 : 소셜 토픽 추출 시스템, 10 : 사용자, 20 : 소셜 네트워크, 50 : 네트워크, 100 : 사용자 인터페이스, 200 : 모델 구축 서브 시스템, 210 : 입력 변수 선정 모듈, 220 : 추출 모델 구축 모듈, 230 : 모델 최적화 모듈, 300 : 토픽 추출 서브 시스템, 310 : 의미 연관성 추출 모듈, 320 : 주제 연관성 추출 모듈, 330 : 토픽 예측 모듈, 510 : 소셜 데이터 DB, 520 : 토픽 추출 모델 DB, 530 : 토픽 추출 결과 DB, 600 : 어학 사전 정보1: Social topic extraction system, 10: user, 20: social network, 50: network, 100: user interface, 200: model building subsystem, 210: input variable selection module, 220: extraction model building module, 230: model optimization Module, 300: topic extraction subsystem, 310: semantic association extraction module, 320: topic association extraction module, 330: topic prediction module, 510: social data DB, 520: topic extraction model DB, 530: topic extraction result DB, 600 : Language Dictionary Information

Claims

A model building subsystem that generates a topic extraction model based on the social data collected on the social network through the network;
A user interface that receives a user's request through the network and provides the user with a social topic extracted according to the user's request;
A topic extraction subsystem that receives the user's request from the user interface and extracts the social topic according to the user's request from the collected social data using the topic extraction model; And
A model feedback module that performs an accuracy evaluation on the extracted social topic and aggregates the user's feedback on the extracted social topic, requesting the model building subsystem to optimize or regenerate the topic extraction model; It includes,
The topic extraction subsystem,
The center word list is extracted through the calculation of the distance value between words of each of the collected social data, and the similarity between the central word list of each of the collected social data is determined to mean the collected social data. A semantic association extraction module for determining association;
A topic association extraction module that analyzes a topic association of social data determined as semantic association by calculating a distance value between the social data determined as semantic association among the collected social data; And
And a topic prediction module that extracts the social topic according to the user's request by analyzing the analyzed semantic association and the analyzed topic association.

According to claim 1,
The model building subsystem,
Principal component analysis using genetic algorithm is performed on the collected social data to convert the input variables by converting the collected social data, which is high-dimensional data, into low-dimensional data having a meaning and a subject as a first principal component and a second principal component. An input variable selection module to select;
Based on the selected input variable, the semantic association, which determines similarity through calculation of a distance value between words of each of the collected social data, and analysis through calculation of a distance value between the collected social data An extraction model building module that builds the topic extraction model by measuring the accuracy of the subject association; And
A model optimization module for training the topic extraction model using a training data set selected from the collected social data, and evaluating it using a test data set selected from the collected social data other than the training data set ; Social topic extraction system comprising a.

delete

According to claim 2,
The extraction model building module,
Has a learning layer consisting of level 1, level 2, and level 3,
In the level 1, a plurality of statistical analysis techniques are performed by time series analysis, and weights are applied to the results.
The level 2 receives the result of the level 1, performs a plurality of machine learning techniques, and applies weights to the result.
The level 3 is a social topic extraction system characterized by constructing an ensemble model by integrating the results of the level 2 applied with weights.

The method of claim 5,
The model optimization module,
An input variable optimization unit for optimizing the input variable; And
And a parameter optimizer for optimizing the parameters of the topic extraction model.

The method of claim 6,
The parameter optimization unit, the social topic extraction system, characterized in that to optimize the model parameters that are parameters learned by the plurality of machine learning techniques, or hyperparameters that are tuning parameters of the plurality of machine learning techniques.

delete

According to claim 1,
The model feedback module, when the accuracy evaluation result of analyzing the user's feedback on the extracted social topic is lower than the accuracy evaluation result on the extracted social topic, the model construction sub-system of the topic extraction model A social topic extraction system characterized by requesting optimization or regeneration.

According to claim 1,
The model feedback module analyzes the predicted accuracy when constructing the topic extraction model in the topic extraction subsystem, the accuracy evaluation result for the extracted social topic, and the user's feedback for the extracted social topic. If it is lower than at least one of the accuracy evaluation results, the social topic extraction system, characterized in that to request the optimization or regeneration of the topic extraction model to the model building subsystem.