KR20140133185A

KR20140133185A - Method of predicting a stock price through an analysis of a social data and system applying the same

Info

Publication number: KR20140133185A
Application number: KR20130052919A
Authority: KR
Inventors: 김영대; 고경훈; 이동진
Original assignee: 주식회사 코스콤
Priority date: 2013-05-10
Filing date: 2013-05-10
Publication date: 2014-11-19

Abstract

Provided are a method to predict stock prices by analyzing social data and a system using the same. The method includes: a step of collecting a plurality of documents related to at least a single individual item from social media data and stock market-related web data; a step of analyzing morphemes for the documents; a step of evaluating the emotions of the whole documents by evaluating the emotion of each keyword extracted from the analyzed morphemes as positive or negative and analyzing data on the whole documents; and a step of predicting the stock price of the individual item by reflecting the evaluated emotion data of the whole documents.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for predicting stock prices using analysis of social data,

본 발명은 주가 예측 방법 및 이를 이용한 시스템에 관한 것으로, 보다 상세하게는 소셜 데이터의 감성과 관련된 평가 및 분석에 의한 주가 예측 방법 및 그 시스템에 관한 것이다.The present invention relates to a stock price prediction method and a system using the same, and more particularly, to a stock price prediction method and system based on evaluation and analysis related to sensitivity of social data.

주식시장은 특유의 복잡한 가격결정 메커니즘으로 인해 주가의 변동을 시장 펀더멘탈의 변화로 설명할 수 없는 경우가 자주 발생한다. 펀더멘탈의 뚜렷한 변화가 발생하지 않았음에도 불구하고 가격이 크게 변동하는 것을 발견할 수 있는데, 이때 새로운 뉴스의 출현이 가격변동의 중요한 원인으로 종종 작용하곤 한다. 뉴스는 현실 세계에 일어나는 각종 현상에 대한 설명과 미래의 정치, 경제,사회, 기업 등과 관련하여 앞으로 어떤 변화가 발생되고 진행되어 갈지 그에 대한 정보들을 포함하고 있기 때문이다. 그러므로 뉴스와 주가는 밀접한 관계를 가지고 있으며, 뉴스를 통해 시장 참가자들은 주식시장의 변동성을 일부나마 예측할 수 있게 된다.Stock markets are often unable to explain changes in stock prices as a result of changes in market fundamentals due to their unique complex pricing mechanisms. Despite the absence of significant changes in fundamentals, we find that prices fluctuate significantly, with the emergence of new news often contributing to price fluctuations. This is because the news includes information on various phenomena that take place in the real world and information about what kind of changes will occur and proceed in the future with regard to politics, economy, society and enterprise in the future. Therefore, news and stock prices are closely related, and news allows market participants to predict the volatility of the stock market in some way.

한편, 최근에는 증권사, 언론사 등에서 제공되는 뉴스 정보 뿐만 아니라, 모바일 기기의 급격한 발전으로 인하여, 소셜 미디어 데이터, 예컨대 트위터(twitter), 증시 관련 개인 블로그(blog), 페이스북, 다양한 포털 사이트의 소셜 데이터 서비스 등에 의해서 제공되는 정보가 폭발적으로 증가하고 있다. 이와 같은 데이터는 뉴스 정보보다 매우 많은 양으로 시장 참가자들에게 유통되고 있며, 이에 대해 빅데이터라고 칭하고 있다. In recent years, not only news information provided by securities companies, media companies, etc., but also social media data such as tweets, personal blogs related to stock market, Facebook, social data of various portal sites Information provided by services and the like is explosively increasing. Such data is being distributed to market participants in a much larger amount than news information, and is referred to as Big Data.

소셜 미디어 데이터는 개인의 주관적 관점으로 작성되어 있어 뉴스 정보보다 낮은 신뢰성을 가진다는 측면이 있으나, 소셜 미디어 데이터가 빅데이터급으로 제공되므로, 이 데이터를 통해 시장 참가자들의 주식시장, 특히 개별 종목에 대한 반응이 상당 정도의 객관성을 갖고 도출될 뿐만 아니라, 개별 종목의 향후 전망도 타당성을 가질 수 있는 정도에 이르렀다. Social media data is composed of individual subjective viewpoints and has a lower reliability than news information. However, since social media data is provided in a big data class, Not only are the responses derived with considerable objectivity, but the future outlook for individual items has reached a point where it can be justified.

그러나, 주가에 영향을 미치는 펀더멘털 요인들은 너무나도 다양하고 복잡하며 이러한 요인들이 소셜 미디어 데이터, 뉴스와 주가에 영향을 미치고 소셜 미디어 데이터 등은 다시 주가에 영향을 미치는 식의 순환이 발생하기도 한다.However, the fundamentals that affect stock prices are so diverse and complex that circulation is that these factors affect social media data, news and stock prices, and social media data, again, affect share prices.

결국 소셜 미디어 데이터는 주가에 영향을 미치는 영향 요인이 되기도 하고 주가의 흐름을 미리 보여주는 선행지표가 되기도 한다. 그러나 하루에도 수없이 많은 뉴스들이 나타나고 사라지고 있어, 뉴스를 하나하나 분석하여 주가에 미치는 영향을 파악하기란 거의 불가능한 일이다.In the end, social media data can be a factor influencing stock prices and leading indicators of share price trends. However, a lot of news is appearing and disappearing every day, and it is almost impossible to analyze the news one by one to understand the impact on the stock price.

더욱이 거시적 관점의 정책, 전망뉴스부터 매일 매일의 시황, 실적, 기업뉴스 등 다양한 유형의 소셜 미디어 데이터 및 뉴스가 실시간으로 양산되며, 그 내용이 시장에 긍정적인지 부정적인지 명확히 파악하기가 쉽지 않다. 또한 소셜 미디어 데이터 및 뉴스라는 속성상 다소 중립적인 뉘앙스로 주식시장의 긍정/부정 양쪽 의견을 모두 제시하는 경우가 많기 때문에 실상 그 저의를 파악하는 것 또한 간단치 않으며, 뉴스 등을 분석하는 사람마다의 주관에 따라 달라질 위험성이 존재한다.Moreover, it is not easy to grasp clearly whether the contents are positive or negative for the market, because various types of social media data and news such as macro policies, forecast news, daily market conditions, performance, and corporate news are mass-produced in real time. In addition, since it is often the case that both the positive and negative opinions of the stock market are presented by the somewhat neutral nuance of the social media data and news attributes, it is not simple to grasp the hypothesis of the stock market. There is a risk that it will change depending on

이로 인하여, 기존의 연구들 역시 쉽게 판별이 가능한 특정 사건과 뉴스들을 위주로 그에 반응하는 주가를 분석하거나, 주가가 크게 변동되었을 때 이를 야기한 뉴스 등이 존재했는지를 역으로 분석하였다. 그러나 뉴스 등이 대부분 일정한 양식이나 속성이 없는 텍스트들로 구성되어 있으며, 하루에도 수없이 뉴스들이 양산된다. For this reason, previous researches have analyzed the stock prices responding to specific events and news that can easily be identified, or reversed the existence of news that caused the stock price to fluctuate significantly. However, most of the news is composed of texts with no fixed form or attribute, and many news items are mass produced every day.

따라서, 최근 뉴스를 포함하여 개인화된 미디어 데이터와 같은 빅데이터를 분석하여 의미있는 정보를 추출하고자 하는 방법이 다양하게 시도되고 있다. Accordingly, various attempts have been made to extract meaningful information by analyzing big data such as personalized media data including recent news.

본 발명이 이루고자 하는 기술적 과제는 소셜 데이터 및 뉴스를 포함한 대량의 데이터에 대한 감성 평가 데이터를 반영하여 개별 종목의 주가를 보다 신뢰성있게 예측하는 소셜 데이터의 분석을 통한 주가 예측 방법 및 그 시스템을 제공하는데 있다. SUMMARY OF THE INVENTION The present invention provides a method and system for estimating stock price through analysis of social data that more reliably predict stock prices of individual stocks by reflecting emotion evaluation data on a large amount of data including social data and news have.

본 발명의 목적은 이상에서 언급된 목적으로 제한되지 않으며, 언급되지 않은 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 이루기 위한 본 발명의 일 양태에 따르면, 소셜 데이터의 분석에 의한 주가 예측 방법은 소셜 미디어 데이터 및 증시 관련 웹데이터로부터 적어도 하나의 개별 종목과 관련된 복수의 문서를 수집하는 단계와, 상기 복수의 문서에 대하여 형태소를 분석하는 단계와, 상기 분석된 형태소에서 추출된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 상기 복수의 문서 전체에 대한 감성을 평가하여 상기 복수의 문서 전체의 데이터를 분석하는 단계와, 상기 복수의 문서 전체의 감성과 관련된 평가 데이터를 반영하여 상기 개별 종목의 주가를 예측하는 단계를 포함한다. According to an aspect of the present invention, there is provided a method for predicting stock prices by analyzing social data, comprising the steps of: collecting a plurality of documents related to at least one individual item from social media data and stock market related web data; The method of claim 1, further comprising the steps of: analyzing morphemes for a plurality of documents; evaluating emotions for all of the plurality of documents by evaluating emotions with positive or negative for each keyword extracted from the analyzed morpheme, And estimating a stock price of the individual stock based on the evaluation data related to the emotion of the entire plurality of documents.

기타 실시예들의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다. The details of other embodiments are included in the detailed description and drawings.

본 발명에 따르면, 소셜 데이터 및 뉴스를 포함한 대량의 데이터에 대한 감성 평가 데이터를 반영함으로써, 시장 참가자들의 다양한 견해로부터 개별 종목에 대한 시장 분위기 및 정보를 보다 객관적이면서 유의미하게 추출할 수 있으므로, 개별 종목의 주가를 보다 신뢰성있게 예측할 수 있다. According to the present invention, since the emotion evaluation data for a large amount of data including social data and news is reflected, it is possible to extract more objective and meaningful market atmosphere and information for individual items from various viewpoints of market participants, Can be predicted more reliably.

아울러, 본 발명에 따르면, 개별 종목마다 증시 지표 데이터들과 소셜 데이터 간의 과거 상관 관계의 분석을 통해, 개별 종목에 영향을 주는 소셜 데이터의 수집 기간 및 소셜 데이터의 감성 평가 데이터가 개별 종목에 실제 영향을 주는 시점인 지연 기간을 주가 예측시에 활용함으로써, 주가 예측이 보다 정확하게 이루어질 수 있다. In addition, according to the present invention, by analyzing the past correlation between the stock market index data and the social data for each individual item, the collection period of the social data influencing individual items and the emotional evaluation data of the social data affect the individual items The stock price can be predicted more accurately by utilizing the delay period, which is the time point of giving the stock price, to the stock price prediction.

도 1은 본 발명의 일 실시예에 따른 주가 예측 시스템의 구성도이다.
도 2는 키워드 데이터베이스의 구성도이다.
도 3은 문서 저장부의 구성도이다.
도 4는 데이터 분석부의 구성도이다.
도 5는 감성 사전 데이터베이스의 구성도이다.
도 6은 상관 분석/결정부의 구성도이다.
도 7은 본 발명의 다른 실시예에 따른 주가 예측 방법의 순서도이다.
도 8은 평가 데이터의 수집 기간, 지연 기간의 결정 및 평가 데이터의 선택 과정을 나타낸 순서도이다.
도 9는 키워드 및 소셜 미디어 데이터의 선택 및 이에 따른 키워드 현황을 표시부에 표시한 도면이다.
도 10은 메인 키워드 및 서브 키워드의 수집 현황을 표시부에 표시한 도면이다.
도 11은 특정 키워드의 수집 현황을 표시부에 표시한 도면이다.
도 12는 개별 종목과 관련된 소셜 미디어 데이터 및 뉴스에 대한 감성 관련 평가 데이터의 결과 및 평가 데이터의 지수와 개별 종목의 주가 간의 상관 관계를 표시부에 표시한 도면이다.
도 13은 평가 데이터를 반영하여 개별 종목의 주가를 예측한 결과를 표시부에 표시한 도면이다. 1 is a configuration diagram of a stock price prediction system according to an embodiment of the present invention.
2 is a block diagram of a keyword database.
3 is a block diagram of the document storage unit.
4 is a block diagram of the data analysis unit.
5 is a block diagram of the emotion dictionary database.
6 is a block diagram of the correlation analysis / decision unit.
7 is a flowchart of a stock price prediction method according to another embodiment of the present invention.
FIG. 8 is a flowchart showing a collection period of evaluation data, a determination of a delay period, and a selection process of evaluation data.
FIG. 9 is a diagram showing selection of keywords and social media data and the keyword status according to the selection on the display unit.
10 is a diagram showing the collection status of main keywords and sub keywords on the display unit.
11 is a diagram showing the collection status of a specific keyword on the display unit.
Fig. 12 is a diagram showing the results of emotion-related evaluation data on social media data and news related to each individual item, and the correlation between the index of the evaluation data and the stock price of individual items on the display unit.
Fig. 13 is a diagram showing the result of predicting the stock price of an individual item reflecting the evaluation data on the display unit. Fig.

이하, 첨부한 도면들 및 후술되어 있는 내용을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 그러나, 본 발명은 여기서 설명되어지는 실시예들에 한정되지 않고 다른 형태로 구체화될 수도 있다. 오히려, 여기서 소개되는 실시예들은 개시된 내용이 철저하고 완전해질 수 있도록 그리고 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 제공되어지는 것이다. 명세서 전체에 걸쳐서 동일한 참조번호들은 동일한 구성요소들을 나타낸다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급되지 않는 한 복수형도 포함된다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자가 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings and the following description. However, the present invention is not limited to the embodiments described herein but may be embodied in other forms. Rather, the embodiments disclosed herein are being provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Like reference numerals designate like elements throughout the specification. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. &Quot; comprises "and / or" comprising ", as used herein, unless the recited element, step, operation, and / Or additions.

이하, 도 1 내지 도 6을 참조하여, 본 발명의 일 실시예에 따른 소셜 미디어의 분석에 의한 주가 예측 시스템에 대하여 상세히 설명하기로 한다. 도 1은 본 발명의 일 실시예에 따른 주가 예측 시스템의 구성도이다. 도 2는 키워드 데이터베이스의 구성도이고, 도 3은 문서 저장부의 구성도이다. 또한, 도 4는 데이터 분석부의 구성도이며, 도 5는 감성 사전 데이터베이스의 구성도이고, 도 6은 상관 분석/결정부의 구성도이다. Hereinafter, a system for predicting a stock price based on an analysis of social media according to an embodiment of the present invention will be described in detail with reference to FIG. 1 to FIG. 1 is a configuration diagram of a stock price prediction system according to an embodiment of the present invention. FIG. 2 is a block diagram of a keyword database, and FIG. 3 is a block diagram of a document storage unit. 4 is a configuration diagram of the data analysis unit, FIG. 5 is a configuration diagram of the emotion dictionary database, and FIG. 6 is a configuration diagram of the correlation analysis / determination unit.

주가 예측 시스템(100)은 소셜 미디어 데이터(10)와 증시 관련 웹데이터(20)로부터 대량의 문서를 수집하는 문서 수집/추출부(110), 수집된 문서를 개별 기업별로 저장하는 문서 저장부(130), 개별 기업별로 복수의 문서에 포함된 표현 내지는 문장에 대하여 형태소를 분석하는 형태소 분석부(140), 분석된 형태소에서 추출된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 복수의 문서 전체에 대한 감성을 평가하여 복수의 문서 전체의 데이터를 분석하는 데이터 분석부(150)를 포함한다. 또한, 주가 예측 시스템(100)은 누적된 감성 평가 데이터 중 소정의 조건에 의해 선택된 감성 관련 평가 데이터와 함께, 증시 지표 데이터와 경제 지표 데이터 간의 상관 관계로부터의 분석 데이터를 생성하는 상관 분석/결정부(170) 및 선택된 평가 데이터와 분석 데이터에 근거하여 개별 종목의 주가를 예측 산정하는 주가 예측부(180) 및 주가 예측부(180)로부터 도출된 예측 결과를 표시하는 표시부(190)를 포함할 수 있다. The stock price forecasting system 100 includes a document collecting / extracting unit 110 for collecting a large amount of documents from the social media data 10 and the stock market related web data 20, a document storing unit 130), a morphological analysis unit (140) for analyzing morphemes with respect to expressions or sentences contained in a plurality of documents for each individual company, and an emotional evaluation unit for evaluating the morphemes by positive or negative for each keyword extracted from the analyzed morpheme, And a data analysis unit 150 for analyzing the data of all the plurality of documents by evaluating the sensitivity of the document. In addition, the stock price prediction system 100 includes a correlation analysis / decision unit (not shown) for generating analysis data from the correlation between the stock index data and the economic index data, together with the sensitivity related evaluation data selected by predetermined conditions among the cumulative sensitivity evaluation data A stock price predicting unit 180 for predicting the stock price of an individual item on the basis of the selected evaluation data and the analysis data, and a display unit 190 for displaying the prediction result derived from the stock price predicting unit 180 have.

문서 수집/추출부(110)는 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 적어도 하나의 개별 종목과 관련된 대량의 문서를 수집하고, 증시 지표 데이터들(30)을 입력받는다. 여기서, 개별 종목은 증시에 상장된 기업이고, 수집되는 문서는 html, PDF(Portable Document Format), 이미지 및 동영상 중 적어도 하나의 형태로 구현될 수 있다. The document collecting / extracting unit 110 collects a large amount of documents related to at least one individual item from the social media data 10 and the stock market related web data 20, and inputs the stock index data 30. Here, the individual item is a company listed on the stock market, and the collected document can be implemented in the form of at least one of html, PDF (Portable Document Format), image and moving image.

소셜 미디어 데이터(10)는 인터넷 등의 네트워크와 접속되는 고정형 컴퓨터 또는 모바일 기기를 통해 입력되는 미디어 데이터로서, 네트워크와 접속된 다른 사용자와 상호 공유될 수 있는 데이터이다. 예컨대, 소셜 미디어 데이터(10)는 소셜 미디어 서버에서 운영하는 소셜 미디어 사이트들(12) 및 다양한 포털 사이트 등에서 운영하며 개인화된 컨텐츠가 포함된 블로그 사이트들(14)일 수 있다. 소셜 미디어 사이트들(12)은 소위 SNS로서, 트위터(twitter), 페이스북(facebook), 다양한 포털 사이트에서 서비스하는 소셜 미디어일 수 있다. The social media data 10 is media data input via a fixed computer or a mobile device connected to a network such as the Internet and is data that can be mutually shared with other users connected to the network. For example, the social media data 10 may be social media sites 12 operated by a social media server and blog sites 14 operated by various portal sites and including personalized contents. The social media sites 12 may be social media serving on so-called SNS, twitter, facebook, and various portal sites.

증시 관련 웹데이터(20)는 언론사, 공중파 방송사, 케이블 방송사, 포털 사이트 뉴스, 금융사, 증시 관련 기관 등으로부터 제공되는 웹데이터로서, 소셜 미디어 데이터(10)에 비해 전문적이거나 공신력있는 증시 관련 데이터이다. 이러한 증시 관련 웹데이터(20)는 언론사, 방송사, 포털 사이트 뉴스로부터 서비스되는 증시 관련 뉴스 사이트들(22), 은행, 증권사, 보험 등의 금융사에서 증시와 관련하여 서비스되는 금융사 포털 사이트들(24) 및 증시 관련 공공 기관 또는 사설 기관에서 증시와 관련된 분석 정보를 제공하는 증시 관련 통계 사이트들(26)일 수 있다. The stock market related web data 20 is web data provided from a media company, a national wave broadcasting company, a cable broadcasting company, a portal site news, a financial corporation, a stock market related institution or the like, and is stock market related data which is more professional or credible than the social media data 10. The stock market related web data 20 includes stock market news sites 22 served by news agencies, broadcasters and portal site news, financial institution portal sites 24 served by financial institutions such as banks, Related statistical sites 26 that provide analysis information related to the stock market at public or private institutions related to the stock market.

증시 지표 데이터들(30)은 주식에 상장된 개별 종목마다의 주식 정보로서, 예컨대 시가, 고가, 저가, 종가, 호가, 체결 여부, 거래량, 거래 대금, 거래원, 상한가, 하한가, 신고가, 신저가 등을 포함할 수 있다. The stock index data 30 is stock information for each individual stock listed on the stock, and for example, market information such as market price, high price, low price, close price, closing price, transaction amount, transaction amount, transaction price, transaction source, upper limit price, lower limit price, . &Lt; / RTI >

소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 대량의 문서를 수집하는 경우에, 문서 수집/추출부(110)는 모든 문서를 수집하는 것이 아니라, 키워드 데이터베이스(120)를 참조하여 적어도 하나의 개별 종목과 관련된 문서를 수집하는 것이다. The document collecting / extracting unit 110 does not collect all the documents but collects at least the documents related to the keyword database 120 by referring to the keyword database 120 in order to collect a large amount of documents from the social media data 10 and the market- To collect documents related to one individual item.

키워드 데이터베이스(120)는 개별 종목에 해당하는 기업마다 카테고리화되어 있는 키워드 군을 포함할 수 있으며, 구체적으로 도 2에 도시된 바와 같이, 개별 종목의 기업명과 관련된 메인 키워드(122)와 아울러서, 기업에서 출시하는 상품, 서비스에 관한 제품/서비스 관련 키워드(124), 기업의 경영진 등에 관한 인적 관련 키워드(126) 및 개별 종목에 영향을 미칠 수 있는 단어, 컨텍스트에 관한 기업 상황 관련 키워드(128) 등을 포함하는 서브 키워드를 저장할 수 있다. 서브 키워드는 해당 기업 특유의 단어, 컨텍스트 등으로서, 해당 기업마다 분류되어 카테고리화된 형태로 존재할 수 있다. As shown in FIG. 2, the keyword database 120 includes a main keyword 122 associated with a company name of an individual item, Related keywords 126 related to the company's executives and the like, words that may affect the individual items, company-related keywords 128 related to the context, and the like May be stored. The sub keyword may exist in a categorized form classified for each company as a specific word, context or the like of the corresponding company.

메인 키워드에 대하여 예를 들어 설명하면, 메인 키워드(122)는 삼성전자, 엘지전자, KT 등과 같이 증시에 상장된 개별 종목의 기업명일 수 있으며, 삼성전자의 경우에 제품/서비스 관련 키워드(124)는 "갤럭시", "스마트폰", "하우젠", "태블릿", "앱 마켓" 등일 수 있으며, 인적 관련 키워드(126)는 삼성전자의 주요 임원진, 삼성전가와 거래하는 기업의 임원진 등일 수 있으며, 기업 상황 관련 키워드(128)는 삼성전자의 주가에 영향을 미칠 수 있는 단어 등으로서, "사상최대", "실적", "호조", "애플", "불만", "악화" 등으로 다양한 단어를 포함할 수 있다. For example, the main keyword 122 may be a company name of an individual item listed on the market such as Samsung Electronics, LG Electronics, KT, etc. In the case of Samsung Electronics, the main keyword 122 may include a product / May be "Galaxy", "smartphone", "Hauzen", "tablet", "appmarket" and the like, and the human-related keywords 126 may be executives of major executives of Samsung Electronics, The keywords related to the company situation (128) are words that can affect the stock price of Samsung Electronics and various words such as "maximum ever", "performance", "goodbye", " . &Lt; / RTI >

문서 수집/추출부(110)는 수집된 복수의 문서에 포함된 표현에서 전술한 키워드 중 메인 키워드(122), 제품/서비스 관련 키워드(124) 및 인적 관련 키워드(126)가 포함되는 문서들을 추출함으로써, 감성 평가에 적합한 문서 데이터를 효율적으로 선정할 수 있다. The document collecting / extracting unit 110 extracts documents including the main keyword 122, the product / service related keyword 124 and the human related keyword 126 among the keywords included in the collected plurality of documents. Thus, document data suitable for sensitivity evaluation can be efficiently selected.

문서 저장부(130)는 형태소 분석에 적합한 형태로 추출된 문서들을 저장할 수 있으며, 예컨대 도 3에 도시된 바와 같이, 개별 종목 그룹(131)마다 추출된 문서들의 포맷 별, 즉 html(132), pdf(133), 이미지(134), 동영상(135) 등으로 분산 저장될 수 있다. The document storage unit 130 may store the extracted documents in a form suitable for morphological analysis. For example, as shown in FIG. 3, pdf 133, image 134, moving image 135, and the like.

형태소 분석부(140)는 감성 평가에 적합한 형태로 처리하기 위한 전처리로서, 저장된 복수의 문서의 포맷에 대하여 의미를 갖는 최소의 언어 단위인 형태소를 분석하여 각 품사를 특정하는 처리를 수행한다. 이 경우에, 형태소 분석부(140)는 도 3에 도시된 포맷마다 적합한 처리를 통해, 각 포맷에 대하여 병렬적으로 형태소 분석을 진행할 수 있다. The morpheme analysis unit 140 is a pre-processing unit for processing the morpheme in a form suitable for emotion evaluation, and analyzes the morpheme that is the smallest language unit having a meaning with respect to the format of the stored plurality of documents, and performs processing for specifying each part of speech. In this case, the morpheme analysis unit 140 can perform morphological analysis in parallel on each format through a process suitable for each format shown in Fig.

아울러, 형태소 분석부(140)는 문서의 포맷에 포함된 표현에서 문장, 컨텍스트 등을 어절 단위로 분류하고, 개별 종목과 관련된 키워드에 인접한 키워드들을 파싱(parsing)할 수 있다. 예를 들어 설명하면, 특정인의 블로그 사이트에서 삼성전자와 관련된 문장 및 엘지전자와 관련된 문장이 함께 존재하는 경우에, 형태소 분석부(140)는 문장 구조, 접속 구조, 구문 등을 고려하여 블로그 사이트의 텍스트를 어절 단위로 분류하고, 이후에 삼성전자 또는 엘지전자의 명칭, 상품/서비스, 인적 사항 등의 키워드를 검색하여, 이에 인접한 단어, 구문들을 파싱하고, 삼성전자 및 엘지전자 별 키워드들로 분류하여 저장한다. In addition, the morpheme analysis unit 140 may classify sentences, contexts, and the like in the expressions included in the document format in units of words, and may parse keywords adjacent to keywords related to individual items. For example, in a case where a sentence related to Samsung Electronics and a sentence related to the LG Electronics coexist in a blog site of a particular person, the morphological analysis unit 140 analyzes the sentence structure of the blog site We classify the text into an e-word unit, and then search for keywords such as Samsung Electronics or LG Electronics' names, products / services, personal information, etc., parse words and phrases adjacent thereto, and classify them into keywords for Samsung Electronics and LG Electronics And stores it.

데이터 분석부(150)는 도 4를 참조하면, 형태소 분석부(140)에서 처리된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 복수의 문서 전체에 대한 감성을 평가하는 데이터 감성 평가부(152) 및 형태소 분석부(140)에서 처리된 키워드를 통계 처리하는 키워드 분석부(154)를 포함할 수 있다. Referring to FIG. 4, the data analysis unit 150 includes a data sensitivity evaluation unit 152 that evaluates the sensitivity of all the plurality of documents by performing an emotion evaluation of affirmative or negative for each keyword processed by the morphological analysis unit 140, And a keyword analyzing unit 154 for statistically processing the keyword processed by the morphological analyzer 140.

데이터 감성 평가부(152)는 형태소 분석부(140)로부터의 키워드마다 긍정, 중립 또는 부정에 대한 평가 및 이 평가와 연계된 스코어를 저장하는 감성 사전 데이터베이스(160)를 참조하여, 추출된 키워드에 대하여 긍정, 중립 및 부정 중 어느 하나로 평가함과 아울러서 스코어링한다. 스코어링 알고리즘은 Naive bayes 알고리즘, Simple voter 알고리즘, KNN(K Nearest Neighborhood), SVM(Support Vector Machine) 일 수 있다. 이 중 Simple voter 알고리즘을 예로 들어 설명하면, 감성 사전 데이터베이스(160)는 도 5에 도시된 바와 같이, 키워드에 대한 감성 평가로서 긍정, 중립, 부정마다의 키워드를 테이블 형태로 저장할 수 있다. 이러한 감성 평가와 관련된 키워드의 품사의 대부분은 명사, 형용사로 구성될 수 있다. 예컨대 긍정 평가의 테이블(162)에서는 "상승", "사상최대", "오르다" 등의 키워드가 존재하고, 각 키워드에 부여되는 스코어 "1"이다. 또한, 부정 평가의 테이블(166)에서는 "불황", "내리다", "불만" 등의 키워드가 존재하고, 각 키워드에 부여되는 스코어 "-1"이다. 중립 평가 테이블(164)에 저장된 키워드에 부여되는 스코어는 "0"이다. 도 5에 도시된 스코어는 긍정과 부정을 구별하기 위한 것으로 예시되고 있으나, 이와는 달리, 긍정 또는 부정 평가와 연계된 스코어는 시장 참가자들이 해당 키워드에 느끼는 감성의 정도에 따라, 해당 키워드의 가중치를 달리하여 서로 다른 스코어로 구성될 수 있다. The data sensitivity evaluation unit 152 refers to the emotion dictionary database 160 storing an evaluation of affirmative, neutral or negative for each keyword from the morpheme analysis unit 140 and a score associated with the evaluation, And evaluates it as positive, neutral, or negative, and scores it. The scoring algorithm may be Naive bayes algorithm, Simple voter algorithm, K Nearest Neighborhood (KNN), SVM (Support Vector Machine). As an example of the simple voter algorithm, the emotion dictionaries database 160 can store the keywords of affirmative, neutral, and negation in the form of a table as emotion evaluation for the keyword, as shown in FIG. Most of the parts of a keyword related to such emotional evaluation can be composed of nouns and adjectives. For example, in the affirmative evaluation table 162, there are keywords such as "rise "," maximum ever ", "ascend ", and score" 1 " In the negative evaluation table 166, keywords such as "recession "," down ", and "complaint" exist and the score "-1" The score given to the keyword stored in the neutral evaluation table 164 is "0 ". The score shown in FIG. 5 is exemplified for distinguishing between positive and negative. However, a score associated with positive or negative evaluation is different from a score according to the degree of sensitivity of market participants to the keyword, And can be composed of different scores.

데이터 감성 평가부(152)는 감성 사전 데이터베이스(160)에 의해 긍정, 중립 및 부정으로 판별된 키워드마다 부여된 스코어를 합산하여 복수의 문서 전체에 대한 감성 지수와 같은 감성 관련 평가 데이터를 산출할 수 있다. 여기서, 데이터 감성 평가부(152)는 모든 문서의 키워드에 대하여 감성 평가를 수행한 후, 문서 별로 긍정, 중립, 부정의 평가를 수행하지 않는다. 만약 문서의 감성 뉘앙스를 파악하기 위해 문서 별로 감성 평가를 수행하는 경우, 어떤 문서는 다른 문서에 비해 부정적으로 평가된 키워드가 훨씬 많이 존재함에도 불구하고, 각 문서가 동등한 스코어의 부정 평가를 받을 수 있다. 이에 의하면, 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 추출된 복수의 문서 전체로부터 존재하는 개별 종목의 긍정 또는 부정 요소에 대한 비율이 왜곡되게 분석될 수 있다. 따라서, 본 실시예에서는 복수의 문서 전체로부터 형태소 분석된 키워드들을 문서 별로 그룹핑없이, 감성 평가를 수행함으로써, 분석의 왜곡을 방지할 수 있다. The data sensitivity evaluation unit 152 may calculate the sensitivity-related evaluation data such as the sensitivity index for all the plurality of documents by summing the scores given for the keywords determined as positive, neutral, and negative by the sensitivity dictionary database 160 have. Here, the data sensitivity evaluation unit 152 does not perform evaluation of affirmative, neutral, and negative for each document after performing sensitivity evaluation on keywords of all documents. If an emotional evaluation is performed on a document to determine the sensitivity nuances of the document, each document may be given a negative evaluation of an equivalent score, even though there are many more negatively valued keywords compared to other documents . According to this, the ratio of the positive or negative elements of the individual items existing from all the plurality of documents extracted from the social media data 10 and the stock market related web data 20 can be analyzed to be distorted. Therefore, in this embodiment, the analysis is prevented from being distorted by performing the emotion evaluation without grouping the morphologically analyzed keywords from the entire plurality of documents by document.

키워드 분석부(154)는 형태소 분석부(140)로부터 분석된 키워드들에 대하여 기간별 수집 건수, 각 키워드 간의 상관 분석 등의 통계 분석을 수행하여 그 결과를 표시부(190)에 제공할 수 있다. 또한, 키워드 분석부(154)는 분석된 키워드들 중 키워드 데이터베이스(120)에 등록되지 않은 키워드를 선별하고, 신규로 선별된 키워드는 키워드 데이터베이스(120)에 갱신 저장됨으로써, 문서 수집/추출부(110)에서 수행되는 문서 수집의 정확성을 향상시킬 수 있으며, 관리자는 신규의 키워드 중 감성 평가에 반영할 키워드에 대해서는 감성 사전 데이터베이스(160)에 저장시킬 수 있다.The keyword analyzing unit 154 may perform a statistical analysis on the analyzed keywords from the morpheme analyzing unit 140, such as the number of collected data per period and the correlation analysis between the keywords, and provide the result to the display unit 190. The keyword analysis unit 154 selects keywords that are not registered in the keyword database 120 among the analyzed keywords and the newly selected keywords are updated and stored in the keyword database 120 so that the document collection / 110), and the manager can store the keyword to be reflected in the emotion evaluation of the new keyword in the emotion dictionary database 160. [0053] FIG.

한편, 상관 분석/결정부(170)는 누적된 감성 평가 데이터 중 소정의 조건에 의해 선택된 감성 관련 평가 데이터와 함께, 증시 지표 데이터와 경제 지표 데이터 간의 상관 관계로부터의 분석 데이터를 생성할 수 있다. 도 6을 참조하면, 상관 분석/결정부(170)는 평가 데이터 저장부(171), 제 1 상관 테이블부(172), 평가 데이터 수집 기간 결정부(173), 평가 데이터 선택부(174), 지연 기간 결정부(175), 경제 지표 데이터베이스(176) 및 제 2 상관테이블부(177)를 포함할 수 있다. On the other hand, the correlation analysis / decision section 170 can generate analytical data from the correlation between the stock index data and the economic index data together with the sensitivity related evaluation data selected by the predetermined condition among the accumulated emotional evaluation data. 6, the correlation analysis / decision unit 170 includes an evaluation data storage unit 171, a first correlation table unit 172, an evaluation data collection period determination unit 173, an evaluation data selection unit 174, A delay period determination unit 175, an economic index database 176, and a second correlation table unit 177. [

평가 데이터 저장부(171)는 일별로 개별 종목마다의 감성 지수와 같은 감성 관련 평가 데이터를 누적 저장할 수 있으며, 이러한 평가 데이터는 제 1 상관테이블부(172)에 제공되어 외부로부터 입력되는 증시 지표 데이터들(30)과의 상관 관계 분석을 수행하여, 과거 시점에서 개별 종목의 증시 지표 데이터들(30)과 이에 상응하는 평가 데이터 간의 분석된 상관 관계가 제 1 상관테이블부(172)에 수록된다. The evaluation data storage unit 171 accumulates the emotion-related evaluation data such as the emotion index for each individual item on a day-by-day basis. The evaluation data is provided to the first correlation table unit 172, The correlation analysis between the stock index data 30 of individual items and the evaluation data corresponding to the stock indexes 30 at the past time point is recorded in the first correlation table section 172. [

평가 데이터 수집 기간 결정부(173)는 제 1 상관테이블부(172)에 저장된 과거 상관 관계에 기초하여 개별 종목의 주가에 영향을 미치는 평가 데이터의 수집 기간을 결정하고, 평가 데이터 선택부(174)는 평가 데이터 저장부(171)에 누적 저장된 감성 평가 데이터 중 수집 기간에 부합하는 평가 데이터를 선택하여 주가 예측부(180)로 제공할 수 있다.The evaluation data collection period determining unit 173 determines the collection period of the evaluation data that affects the stock price of the individual item based on the past correlation stored in the first correlation table unit 172, May select evaluation data corresponding to the collection period among the sensitivity evaluation data accumulated in the evaluation data storage unit 171 and provide the evaluation data to the stock price prediction unit 180. [

또한, 지연 기간 결정부(175)는 제 1 상관테이블부(172)의 과거 상관 관계에 기초하여 감성 관련 평가 데이터가 개별 종목의 주가에 반영되어질 때까지의 경과되는 지연 기간을 결정하고, 주가 예측부(180)에 개별 종목의 주가 예측시에 지연 기간을 제공하여, 지연 기간 이후의 주가를 예측할 수 있다. The delay period determination unit 175 determines a delay period that elapses until the emotion-related evaluation data is reflected in the stock price of the individual item based on the past correlation of the first correlation table unit 172, And the stock price after the delay period can be predicted by providing the delay unit 180 at the time of predicting the stock price of the individual stock.

이와 같이 수집 기간 및 지연 기간을 주가 예측부(180)의 예측시에 제공함으로써, 보다 유효한 감성 평가 데이터를 활용할 수 있으며, 주가 예측 시점을 더 정확하게 특정할 수 있다. By providing the collection period and the delay period in the prediction of the stock predicting unit 180 as described above, it is possible to utilize more effective emotion evaluation data and more accurately specify the stock price prediction timing.

또한, 제 2 상관테이블부(177)는 증시 지표 데이터들(30)과 경제 지표 데이터베이스(176)에 축적된 거시 경제 지수와 관련된 경제 지표 데이터들 간의 상관 관계로부터 도출되는 분석 데이터를 주가 예측부(180)에 제공할 수 있다. 이 경우에, 경제 지표 데이터들은 모든 개별 종목에 기본적으로 공통되게 영향을 주는 경제 지표로서, 예를 들면 금리, 환율, 예상성장율, 물가지수, 국제수지 등일 수 있다. The second correlation table unit 177 also stores the analysis data derived from the correlation between the stock index data 30 and the economic index data related to the macro economic index accumulated in the economic index database 176, 180). In this case, the economic indicator data are basically common economic indicators for all individual items, such as interest rates, exchange rates, projected growth rates, price indexes, balance of payments, and so on.

다시 도 1을 참조하면, 주가 예측부(180)는 상관 분석/결정부(170)로부터 선택된 감성 관련 평가 데이터, 지연 기간 및 제 2 상관테이블부(177)로부터 생성된 분석 데이터에 근거하여 개별 종목의 주가를 예측할 수 있다. 주가 예측은 증시 지표 데이터들(30)과 경제 지표 데이터에 기초한 시계열 분석을 토대로 하며, 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)의 뉴스로부터 분석된 평가 데이터는 상기 시계열 분석으로부터 산출되는 예측 주가를 보정하는 항으로 결합될 수 있다. 주가 예측의 정확성을 보다 높이기 위해, 제 1 상관테이블부(172)의 상관 관계에 기초하여 산출된 가중치가 감성 관련 평가 데이터에 부여됨으로써, 가중치가 부여된 평가 데이터가 주가 예측에 반영될 수 있다. 주가 예측부(180)에서 산출된 개별 종목의 예측 주가 및 그 통계값은 표시부(190)에 표시된다. Referring back to FIG. 1, the stock price predicting unit 180 predicts the individual items based on the emotion-related evaluation data selected from the correlation analysis / decision unit 170, the delay period, and the analysis data generated from the second correlation table unit 177 Can be predicted. The stock price prediction is based on the time series analysis based on the stock market index data 30 and the economic index data and the evaluation data analyzed from the news of the social media data 10 and the stock market related web data 20 is calculated from the time series analysis Can be combined into a term that corrects the forecasted price. In order to further improve the accuracy of the stock price prediction, the weight value calculated on the basis of the correlation of the first correlation table unit 172 is assigned to the sensitivity-related evaluation data, so that the weighted evaluation data can be reflected in the stock price prediction. The predicted stock prices of the individual items calculated by the stock price estimating unit 180 and their statistical values are displayed on the display unit 190.

본 발명에 따르면, 소셜 데이터 및 뉴스를 포함한 대량의 데이터에 대한 감성 관련 평가 데이터를 반영함으로써, 시장 참가자들의 다양한 견해로부터 개별 종목에 대한 시장 분위기 및 정보를 보다 객관적이면서 유의미하게 추출할 수 있으므로, 개별 종목의 주가를 보다 신뢰성있게 예측할 수 있다. 특히, 단순히 증시 관련 웹데이터(20)에서 생산되는 뉴스의 분석에 의한 주가 예측보다는 뉴스 분석을 포함한 소셜 미디어 데이터의 감성 평가를 통한 주가 예측이 정확성과 신뢰성을 갖는 이유는 소셜 미디어 데이터가 뉴스에 비해 훨씬 많은 데이터량으로 생산되어, 통계적으로 보다 모집단에 근접한 분석이 이루어지기 때문이다. According to the present invention, since the emotion-related evaluation data for a large amount of data including social data and news is reflected, it is possible to extract more objective and meaningful market atmosphere and information for individual items from various viewpoints of market participants, The stock price of the stock can be more reliably predicted. Particularly, the reason why the stock price prediction through the emotional evaluation of the social media data including the news analysis is more accurate and reliable than the stock price prediction by the analysis of the news produced by the stock market related web data 20 is that the social media data Because they are produced with much more data and statistically closer to the population.

이하, 도 1 및 도 7 내지 도 13을 참조하여 본 발명의 다른 실시예에 따른 주가 예측 방법에 대하여 상세히 설명하기로 한다. Hereinafter, the stock price prediction method according to another embodiment of the present invention will be described in detail with reference to FIG. 1 and FIG. 7 to FIG.

도 7은 본 발명의 다른 실시예에 따른 주가 예측 방법의 순서도이고, 도 8은 평가 데이터의 수집 기간, 지연 기간의 결정 및 평가 데이터의 선택 과정을 나타낸 순서도이다. FIG. 7 is a flow chart of a stock price prediction method according to another embodiment of the present invention, and FIG. 8 is a flowchart showing a collection period of evaluation data, a determination of a delay period, and a selection process of evaluation data.

도 9는 키워드 및 소셜 미디어 데이터의 선택 및 이에 따른 키워드 현황을 표시부에 표시한 도면이다. 도 10은 메인 키워드 및 서브 키워드의 수집 현황을 표시부에 표시한 도면이다. 도 11은 특정 키워드의 수집 현황을 표시부에 표시한 도면이다. FIG. 9 is a diagram showing selection of keywords and social media data and the keyword status according to the selection on the display unit. 10 is a diagram showing the collection status of main keywords and sub keywords on the display unit. 11 is a diagram showing the collection status of a specific keyword on the display unit.

도 12는 개별 종목과 관련된 소셜 미디어 데이터 및 뉴스에 대한 감성 관련 평가 데이터의 결과 및 평가 데이터의 지수와 개별 종목의 주가 간의 상관 관계를 표시부에 표시한 도면이다. 도 13은 평가 데이터를 반영하여 개별 종목의 주가를 예측한 결과를 표시부에 표시한 도면이다. Fig. 12 is a diagram showing the results of emotion-related evaluation data on social media data and news related to each individual item, and the correlation between the index of the evaluation data and the stock price of individual items on the display unit. Fig. 13 is a diagram showing the result of predicting the stock price of an individual item reflecting the evaluation data on the display unit. Fig.

문서 수집/추출부(110)는 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 적어도 하나의 개별 종목과 관련된 대량의 문서로서, html, PDF, 이미지 및 동영상 중 적어도 하나를 수집하고, 증시 지표 데이터들(30)을 입력받는다(S710). The document collecting / extracting unit 110 collects at least one of html, PDF, image, and moving image as a large amount of documents related to at least one individual item from the social media data 10 and market related web data 20, The stock index data 30 is input (S710).

이 경우에, 소셜 미디어 데이터(10)는 소위 SNS로서, 트위터(twitter), 페이스북(facebook), 다양한 포털 사이트에서 서비스하는 소셜 미디어와 같은 소셜 미디어 사이트들(12) 및 다양한 포털 사이트 등에서 운영하며 개인화된 컨텐츠가 포함된 블로그 사이트들(14)일 수 있다. 증시 관련 웹데이터(20)는 언론사, 방송사, 포털 사이트 뉴스로부터 서비스되는 증시 관련 뉴스 사이트들(22), 은행, 증권사, 보험 등의 금융사에서 증시와 관련하여 서비스되는 금융사 포털 사이트들(24) 및 증시 관련 공공 기관 또는 사설 기관에서 증시와 관련된 분석 정보를 제공하는 증시 관련 통계 사이트들(26)일 수 있다. In this case, the social media data 10 is operated as a so-called SNS on twitter, facebook, social media sites 12 such as social media serving on various portal sites, and various portal sites Or blog sites 14 containing personalized content. The stock market related web data 20 includes stock market news sites 22 served by news agencies, broadcasters and portal site news, financial institution portal sites 24 served by financial institutions such as banks, Related statistical sites 26 that provide analysis information related to the stock market at public or private institutions related to the stock market.

다음으로, 문서 수집/추출부(110)는 키워드 데이터베이스(120)를 참조하여 적어도 하나의 개별 종목과 관련된 문서를 수집하고, 문서 저장부(130)는 형태소 분석에 적합한 형태로 추출된 문서들을 저장할 수 있다(S720). 문서 수집/추출부(110)는 수집된 복수의 문서에 포함된 표현 중, 도 2에 도시된 키워드 데이터베이스(120)에 저장된 키워드 중 메인 키워드(122), 제품/서비스 관련 키워드(124) 및 인적 관련 키워드(126)가 포함되는 문서들을 추출함으로써, 감성 평가에 적합한 문서 데이터를 효율적으로 선정할 수 있다. Next, the document collection / extraction unit 110 collects documents related to at least one individual item by referring to the keyword database 120, and the document storage unit 130 stores the extracted documents in a form suitable for morphological analysis (S720). The document collecting / extracting unit 110 extracts a main keyword 122, a product / service-related keyword 124, and a personal keyword 122 from the keywords stored in the keyword database 120 shown in FIG. 2 among the expressions contained in the collected plurality of documents. By extracting the documents including the related keywords 126, document data suitable for emotion evaluation can be efficiently selected.

또한, 문서 저장부(130)는 예컨대 도 3에 도시된 바와 같이, 개별 종목 그룹(131)마다 추출된 문서들의 포맷 별, 즉 html(132), pdf(133), 이미지(134), 동영상(135) 등으로 분산 저장할 수 있다.3, the document storage unit 130 stores the extracted documents according to the format of the extracted documents, that is, html 132, pdf 133, image 134, 135) or the like.

이어서, 형태소 분석부(140)는 감성 평가에 적합한 형태로 처리하기 위한 전처리로서, 저장된 복수의 문서의 포맷에 대하여 형태소를 분석한다(S730). 이 경우에, 형태소 분석부(140)는 도 3에 도시된 포맷마다 적합한 처리를 통해, 각 포맷에 대하여 병렬적으로 형태소 분석을 진행할 수 있다. 또한, 형태소 분석부(140)는 문서의 포맷에 포함된 표현에서 문장, 컨텍스트 등을 어절 단위로 분류하고, 개별 종목과 관련된 키워드에 인접한 키워드들을 파싱(parsing)할 수 있다. 이에 대한 상세하 설명은 주가 예측 시스템(100)의 형태소 분석부(140)에서 기재된 바 생략하기로 한다. Then, the morpheme analysis unit 140 analyzes morphemes for the format of a plurality of stored documents as a pre-processing for processing in a form suitable for emotion evaluation (S730). In this case, the morpheme analysis unit 140 can perform morphological analysis in parallel on each format through a process suitable for each format shown in Fig. In addition, the morpheme analysis unit 140 may classify sentences, contexts, and the like in the expressions included in the document format in units of words, and may parse keywords adjacent to keywords related to individual items. The detailed description thereof will be omitted in the morpheme analysis unit 140 of the stock price prediction system 100.

다음으로, 데이터 분석부(150)의 데이터 감성 평가부(152)는 도 5에 도시된 감성 사전 데이터베이스(160)를 참조하여, 형태소 분석부(140)에서 처리된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 복수의 문서 전체에 대한 감성을 평가한다(S740). Next, the data sensitivity evaluation unit 152 of the data analysis unit 150 refers to the emotion dictionaries database 160 shown in Fig. 5 and determines whether the keyword is positive or negative for each keyword processed by the morphology analysis unit 140 The emotion of the plurality of documents is evaluated by the emotion evaluation (S740).

보다 구체적으로, 데이터 분석부(150)는 형태소 분석부(140)로부터의 키워드마다 긍정, 중립 또는 부정에 대한 평가 및 이 평가와 연계된 스코어를 저장하는 감성 사전 데이터베이스(160)를 참조하여, 추출된 키워드에 대하여 긍정, 중립 및 부정 중 어느 하나로 평가함과 아울러서 스코어링한다. 아울러, 데이터 감성 평가부(152)는 감성 사전 데이터베이스(160)에 의해 긍정, 중립 및 부정으로 판별된 키워드마다 부여된 스코어를 합산하여 복수의 문서 전체에 대한 감성 지수와 같은 감성 관련 평가 데이터를 산출할 수 있다. 평가 데이터의 예로서, 도 12에 도시된 소셜 미디어 데이터(10)의 감성 점수(220)와 증시 관련 웹데이터(20)의 감성 점수(222)로 나타날 수 있다. More specifically, the data analysis unit 150 refers to the emotion dictionary database 160 storing an evaluation of affirmative, neutral, or negative for each keyword from the morpheme analysis unit 140 and a score associated with the evaluation, And evaluates the keyword as positive, neutral, or negative, and scores it. In addition, the data sensitivity evaluation unit 152 calculates the sensitivity-related evaluation data such as the sensitivity index for all of the plurality of documents by summing the scores assigned to the keywords determined as positive, neutral, and negative by the sensitivity dictionary database 160 can do. As an example of the evaluation data, the sensitivity score 220 of the social media data 10 and the sensitivity score 222 of the market-related web data 20 shown in Fig. 12 can be shown.

한편, 키워드 분석부(154)는 데이터 감성 평가부(152)에서 이루어지는 감성 평가의 수행 동안에, 형태소 분석부(140)로부터 분석된 키워드들에 대하여 기간별 수집 건수, 각 키워드 간의 상관 분석 등의 통계 분석을 수행하여 그 결과를 표시부(190)에 제공할 수 있다. 또한, 키워드 분석부(154)는 분석된 키워드들 중 키워드 데이터베이스(120)에 등록되지 않은 키워드를 키워드 데이터베이스(120)에 갱신 저장하고, 관리자는 신규의 키워드 중 감성 평가에 반영할 키워드에 대해서는 감성 사전 데이터베이스(160)에 저장시킬 수 있다.On the other hand, the keyword analyzing unit 154 performs a statistical analysis on the keywords analyzed by the morpheme analyzing unit 140, such as the number of collected data per period and the correlation analysis between the keywords, during the emotion evaluation performed by the data emotion evaluating unit 152 And may provide the result to the display unit 190. [ The keyword analyzing unit 154 updates and stores the keywords not analyzed in the keyword database 120 among the analyzed keywords in the keyword database 120, Can be stored in the dictionary database 160. [

이와 관련하여 도 9 내지 도 11을 살펴보면, 사용자는 주가 예측 시스템(100)에서 메인 키워드의 입력란에 기업명(202)를 선택하여 개별 종목과 관련된 키워드, 형태소 통계 데이터, 개별 종목의 예측 주가를 시각적으로 확인할 수 있다. 아울러, 수집된 데이터(204)는 소셜 미디어 데이터(10)로서의 SNS, 증시 관련 웹데이터(20)가 선택된다. 사용자가 기업명(202), 수집된 데이터(204)의 종류를 선택하는 경우에, 키워드 데이터베이스(120)는 형태소 분석부(140)로부터 분석된 형태소들(206)에 대하여 기간별 수집 건수(208), 각 키워드 간의 상관 분석 등의 통계 분석을 표시부(190)에 제공하여 표시할 수 있다, 또한, 사용자는 도 10에서와 같이, 메인 키워드에 해당하는 "삼성전자"(210)의 서브 키워드로서 제품/서비스 관련 키워드에 해당하는 "s펜"(212)의 일별 수집 건수를 키워드 데이터베이스(120)를 통해 확인할 수 있다.9 to 11, the user selects the company name 202 in the input field of the main keyword in the stock price prediction system 100, and visually compares keywords, morpheme statistical data, and predicted stock prices of the individual items related to the individual item Can be confirmed. In addition, the collected data 204 is selected as the SNS as the social media data 10 and the stock market related web data 20. When the user selects the type of the company name 202 and the collected data 204, the keyword database 120 stores the number 208 of the morphemes 206 analyzed by the morpheme analysis unit 140, The user can provide statistical analysis such as correlation analysis between the keywords on the display unit 190. The user can also display the product / The number of daily collection of the "s pen" 212 corresponding to the service-related keyword can be confirmed through the keyword database 120. [

아울러, 신규 키워드가 입수되는 경우에, 키워드 데이터베이스(120)는 전술한 바와 같이, 키워드(형태소)를 갱신하고, 도 11에서와 같이, 사용자의 요청에 의해 신규 형태소의 일별 수집 건수를 표시부(190)에 나타낼 수 있다. 11, the keyword database 120 updates the keyword (morpheme), and the number of daily collection of the new morpheme at the request of the user is displayed on the display unit 190 ). &Lt; / RTI >

다음으로, 상관 분석/결정부(170)는 누적된 감성 평가 데이터 중 소정의 조건에 의해 선택된 감성 평가 데이터와 함께, 증시 지표 데이터와 경제 지표 데이터 간의 상관 관계로부터의 분석 데이터를 생성할 수 있다(S750). Next, the correlation analysis / decision section 170 can generate analysis data from the correlation between the stock market index data and the economic index data, together with the emotion evaluation data selected by the predetermined condition among the accumulated emotional evaluation data ( S750).

소정 조건에 의한 감성 평가 데이터의 선택 과정에 대하여 도 8을 통해 설명하면, 평가 데이터 저장부(171)에 개별 종목마다, 일별로 누적 저장된 감성 관련 평가 데이터와 증시 지표 데이터들(30) 간의 과거 상관 관계가 저장된 제 1 상관테이블부(172)의 상관 관계 분석 결과에 기초하여, 평가 데이터 수집 기간 결정부(173)는 개별 종목의 주가에 영향을 미치는 평가 데이터의 수집 기간을 결정한다(S752). 제 1 상관테이블부(172)의 상관 관계 분석 결과는 도 12에 도시된 "주가&키워드 인덱스"와 같은 개별 종목의 실제 종가(224)와 키워드 인덱스(226) 간의 상관 데이터를 이용하여 분석이 이루어질 수 있다. 8, a description will be given of the selection process of the emotion evaluation data according to the predetermined condition. In the evaluation data storage unit 171, the past correlation between the emotion-related evaluation data cumulatively accumulated for each individual item and the stock indicator data 30 Based on the correlation analysis result of the first correlation table unit 172 in which the relationship is stored, the evaluation data collection period determination unit 173 determines the collection period of evaluation data that affects the stock price of the individual stock (S752). The correlation analysis result of the first correlation table section 172 is analyzed using the correlation data between the actual closing price 224 of the individual item such as "stock price & keyword index" shown in FIG. 12 and the keyword index 226 .

다음으로, 지연 기간 결정부(175)는 제 1 상관테이블부(172)의 과거 상관 관계에 기초하여 감성 관련 평가 데이터가 개별 종목의 주가에 반영되어질 때까지 경과되는 지연 기간을 결정한다(S754).Next, the delay period determination unit 175 determines a delay period that elapses until the emotion-related evaluation data is reflected in the share price of the individual item based on the past correlation of the first correlation table unit 172 (S754) .

계속해서, 평가 데이터 선택부(174)는 평가 데이터 저장부(171)에 누적 저장된 평가 데이터 중 수집 기간에 부합하는 평가 데이터를 선택한다(S756). 이어서, 상관 분석/결정부(170)는 선택된 감성 평가 데이터와 지연 기간을 주가 예측부(180)로 제공한다(S758).Subsequently, the evaluation data selection unit 174 selects the evaluation data corresponding to the collection period among the evaluation data stored cumulatively in the evaluation data storage unit 171 (S756). Then, the correlation analysis / determination unit 170 provides the selected sensitivity evaluation data and the delay period to the stock price prediction unit 180 (S758).

다시 도 7을 참조하면, 주가 예측부(180)는 상관 분석/결정부(170)로부터 선택된 감성 관련 평가 데이터, 지연 기간 및 제 2 상관테이블부(177)로부터 생성된 분석 데이터에 근거하여 개별 종목의 주가를 예측한다(S760). 개별 종목의 예측 주가는 도 13에 도시된 바와 같이, 과거 실제 종가(228)의 최후일보다 지연된 지연 기간 이후의 예상 주가를 나타내며, 예상 주가는 예측 종가(203)를 기준으로 소정 오차 범위 내의 최상 예측 종가(232), 최하 예측 종가(234)로 표시될 수 있다. Referring again to FIG. 7, the stock price predicting unit 180 predicts, based on the emotion-related evaluation data selected from the correlation analysis / decision unit 170, the delay period and the analysis data generated from the second correlation table unit 177, (S760). As shown in FIG. 13, the predicted stock price of the individual stock represents the expected stock price after the delay period delayed from the last day of the past actual closing price 228, and the expected stock price is the best The predicted closing price 232, and the lowest predicted closing price 234, respectively.

도 1에 도시된 주가 예측 시스템(100)을 구성하는 구성요소 또는 도 7, 도 8에 도시된 주가 예측 방법의 단계는 그 기능을 실현시키는 프로그램의 형태로 컴퓨터 판독가능한 기록 매체에 기록될 수 있다. 여기에서, 컴퓨터 판독 가능한 기록 매체란, 데이터나 프로그램 등의 정보를 전기적, 자기적, 광학적, 기계적, 또는 화학적 작용에 의해 축적하고, 컴퓨터에서 판독할 수 있는 기록 매체를 말한다. 이러한 기록 매체 중 컴퓨터로부터 분리 가능한 것으로서는, 예를 들면, 플렉시블 디스크, 광자기 디스크, CD-ROM, CD-R/W, DVD, DAT, 메모리 카드 등이 있다. 또한, 컴퓨터에 고정된 기록 매체로서 하드디스크나 ROM 등이 있다.The components of the stock price prediction system 100 shown in Fig. 1 or the steps of the stock price prediction method shown in Figs. 7 and 8 can be recorded on a computer-readable recording medium in the form of a program realizing the function . Here, the computer-readable recording medium refers to a recording medium that can be read by a computer by accumulating information such as data and programs by electric, magnetic, optical, mechanical, or chemical action. Examples of such a recording medium that can be detached from a computer include a flexible disk, a magneto-optical disk, a CD-ROM, a CD-R / W, a DVD, a DAT, and a memory card. In addition, a hard disk, a ROM, or the like is used as a recording medium fixed to a computer.

이상에서 대표적인 실시예를 통하여 본 발명에 대하여 상세하게 설명하였으나, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리 범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태에 의하여 정해져야 한다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be construed as limiting the scope of the present invention. I will understand. Therefore, the scope of the present invention should not be limited to the above-described embodiments, but should be determined by all changes or modifications derived from the scope of the appended claims and the appended claims.

100: 주가 예측 시스템 110: 문서 수집/추출부
120: 키워드 데이터베이스 130: 문서 저장부
140: 형태소 분석부 150: 데이터 분석부
160: 감성 사전 데이터베이스 170: 상관 분석/결정부
180: 주가 예측부 190: 표시부100: stock price prediction system 110: document collection /
120: keyword database 130: document storage unit
140: Morphological analysis unit 150: Data analysis unit
160: emotional dictionary database 170: correlation analysis / decision unit
180: stock price prediction unit 190: display unit

Claims

In a stock price prediction method through analysis of social data,
Collecting a plurality of documents related to at least one individual item from social media data and stock market related web data;
Analyzing morphemes for the plurality of documents;
Analyzing the data of all the plurality of documents by evaluating emotions of the plurality of documents by evaluating emotions with positive or negative for each keyword extracted from the analyzed morpheme; And
And estimating a stock price of the individual stock by reflecting evaluation data related to emotion of the plurality of documents.

The method according to claim 1,
Wherein the social media data is at least one of a social media site and a personalized blog, and the stock market related data is at least one of a portal news site, a news company news site, a financial company portal site, and a stock market related statistic site.

The method according to claim 1,
Wherein the step of analyzing the morpheme comprises classifying the expressions contained in the document in units of words and parsing the keywords adjacent to the keyword related to the individual item in the expression,

The method according to claim 1,
Wherein analyzing the data comprises:
Evaluating affirmative or negative for each keyword extracted from the analyzed morpheme by referring to an emotional dictionary database storing an evaluation of affirmative or negative for each keyword and a score associated with the evaluation, and scoring the same; And
And obtaining evaluation data related to emotion of the plurality of documents as a whole by calculating the emotion index of the plurality of documents as a whole by summing the scores assigned to the keywords.

5. The method of claim 4,
Wherein the score associated with the affirmative or negative evaluation is composed of different scores with different weights for the respective keywords.

The method according to claim 1,
Wherein the collecting of the plurality of documents refers to a keyword database having a keyword related to a name of the individual item and a keyword related to a product, service, and personal information of the individual item, Further comprising extracting documents including a name, a keyword related to the product, the sub, and the personal information,
Wherein analyzing morphemes for the plurality of documents comprises analyzing morphemes for the extracted documents.

The method according to claim 1,
Acquiring stock market index data including stock market related information of the individual stock at the same time as collecting the document,
The step of predicting stock prices of the individual stocks includes predicting a stock price of the individual stock based on the stock index data, the evaluation data related to the stock market, and the economic index including the macro economic index, And a weight calculated based on a past correlation between the evaluation data is given to the evaluation data.

8. The method of claim 7,
After analyzing the data of the entire plurality of documents,
Accumulating the evaluation data for a predetermined period of time;
Determining a collection period of the evaluation data that affects a stock price of the individual stock based on the past correlation;
Selecting evaluation data corresponding to the collection period among the accumulated evaluation data;
Determining a delay period in which the evaluation data is reflected in the stock price of the individual stock based on the past correlation; And
Further comprising the step of providing the selected evaluation data and the delay period to be used in the step of predicting the stock price of the individual stock at the time of forecasting.

The method according to claim 1,
Wherein the document includes at least one of html, Portable Document Format (PDF), image, and moving image.

In a stock price forecasting system through analysis of social data,
Collecting / extracting a plurality of documents from social media data related to at least one individual item and stock market related web data;
A morpheme analysis unit for analyzing morphemes for the plurality of documents;
A data analyzer for evaluating emotions of the plurality of documents by evaluating emotions with positive or negative for each keyword extracted from the analyzed morpheme to analyze data of the entire plurality of documents; And
And a stock price predicting unit for predicting stock prices of the individual stocks based on evaluation data related to emotion of the entire plurality of documents.