KR101511709B1

KR101511709B1 - Method of predicting a composite stockrelated price index through an analysis of a social data and system applying the same

Info

Publication number: KR101511709B1
Application number: KR20130163584A
Authority: KR
Inventors: 김영대; 고경훈; 이동진
Original assignee: 주식회사 코스콤
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2015-04-13

Abstract

The present invention relates a method and a system for predicting a composite stock price related index whereby prediction results with adaptability and reliability can be obtained even when huge market fluctuation occurs by favorable or unfavorable issues on stock types which were not reflected to prediction of the composite stock price related index. According to an embodiment of the present invention. The method includes the steps of: (a) appointing at least one stock type for prediction of the composite stock price related index; (b) predicting the stock price indexes with respect to each stock type using big data consisting of structured data, including economy statistical data, and unstructured data, including social media data (SMD); (c) predicting the composite stock price related index based on the predicted stock price indexes for each stock type; (d) calculating difference between the composite stock price related index, predicted in step (c), and an actual composite stock price related index; (e) determining whether the difference is over a predetermined threshold value; and (f) replacing a portion or the whole of the stock types with other stock types when the difference is over the predetermined threshold value. The composite stock price related index is predicted by analyzing social data.

Description

[TECHNICAL FIELD] The present invention relates to a method for predicting a stock price index by analyzing social data, and a system for predicting a stock price index using an analytical method,

본 발명은 소셜 데이터의 분석을 통한 종합주가 관련지수 예측 방법 및 이를 적용한 종합주가 관련지수 예측 시스템에 관한 것으로, 보다 상세하게는 종합주가 관련지수의 예측에 반영되지 않은 구성 종목에 대한 긍정적 또는 부정적 이슈로 인하여 시장에 큰 변동이 발생된 경우라도 적응적으로 신뢰성 있는 예측 결과를 도출할 수 있는 예측 방법 및 그 시스템에 관한 것이다.The present invention relates to a method for predicting a total stock price related index through analysis of social data and a system for predicting a total stock price index using the same, more particularly, to a method for predicting positive or negative issues The present invention relates to a forecasting method and system that can reliably obtain reliable forecast results even when a large fluctuation occurs in a market due to a change in a market.

주식시장은 특유의 복잡한 가격결정 메커니즘으로 인해 주가의 변동을 시장 펀더멘탈의 변화로 설명할 수 없는 경우가 자주 발생한다. 펀더멘탈의 뚜렷한 변화가 발생하지 않았음에도 불구하고 가격이 크게 변동하는 것을 발견할 수 있는데, 이때 새로운 뉴스의 출현이 가격변동의 중요한 원인으로 종종 작용하곤 한다. 뉴스는 현실 세계에 일어나는 각종 현상에 대한 설명과 미래의 정치, 경제,사회, 기업 등과 관련하여 앞으로 어떤 변화가 발생되고 진행되어 갈지 그에 대한 정보들을 포함하고 있기 때문이다. 그러므로 뉴스와 주가는 밀접한 관계를 가지고 있으며, 뉴스를 통해 시장 참가자들은 주식시장의 변동성을 일부나마 예측할 수 있게 된다.Stock markets are often unable to explain changes in stock prices as a result of changes in market fundamentals due to their unique complex pricing mechanisms. Despite the absence of significant changes in fundamentals, we find that prices fluctuate significantly, with the emergence of new news often contributing to price fluctuations. This is because the news includes information on various phenomena that take place in the real world and information about what kind of changes will occur and proceed in the future with regard to politics, economy, society and enterprise in the future. Therefore, news and stock prices are closely related, and news allows market participants to predict the volatility of the stock market in some way.

한편, 최근에는 증권사, 언론사 등에서 제공되는 뉴스 정보 뿐만 아니라, 모바일 기기의 급격한 발전으로 인하여, 소셜 미디어 데이터, 예컨대 트위터(twitter), 증시 관련 개인 블로그(blog), 페이스북, 다양한 포털 사이트의 소셜 데이터 서비스 등에 의해서 제공되는 정보가 폭발적으로 증가하고 있다. 이와 같은 데이터는 뉴스 정보보다 매우 많은 양으로 시장 참가자들에게 유통되고 있으며, 이에 대해 빅데이터라고 칭하고 있다. In recent years, not only news information provided by securities companies, media companies, etc., but also social media data such as tweets, personal blogs related to stock market, Facebook, social data of various portal sites Information provided by services and the like is explosively increasing. Such data is distributed to market participants in a much larger amount than news information, and is referred to as Big Data.

소셜 미디어 데이터는 개인의 주관적 관점으로 작성되어 있어 뉴스 정보보다 낮은 신뢰성을 가진다는 측면이 있으나, 소셜 미디어 데이터가 빅데이터급으로 제공되므로, 이 데이터를 통해 시장 참가자들의 주식시장, 특히 개별 종목에 대한 반응이 상당 정도의 객관성을 갖고 도출될 뿐만 아니라, 개별 종목의 향후 전망도 타당성을 가질 수 있는 정도에 이르렀다. Social media data is composed of individual subjective viewpoints and has a lower reliability than news information. However, since social media data is provided in a big data class, Not only are the responses derived with considerable objectivity, but the future outlook for individual items has reached a point where it can be justified.

그러나, 예를 들어 코스피 200지수와 같은 선물과 옵션의 기초가 되는 종합지수(이하, 종합주가 관련지수라 칭함)의 예측에 있어서는 효율성과 제어성을 고려하여, 그 구성종목 중의 시장대표성(시가총액의 일정비율 이상인 종목)이 큰 소정 개수의 개별 종목에 대해서만 지수 예측 프로세스를 수행하므로(예를 들면, 코스피 200지수의 예측에 있어서는 시장대표성이 큰 50개의 종목에 대해서만 지수 예측 프로세스를 수행), 만일 상기 종합주가 관련지수의 예측에 반영되지 않은 구성 종목에 대한 긍정적 또는 부정적 이슈로 인하여 시장에 큰 변동이 생긴 경우에는 그 예측된 종합주가 관련 지수가 실제의 종합주가 관련 지수와 부합하지 않게 되는 문제점이 발생한다.However, considering the efficiency and controllability of the composite index, which is the basis of futures and options such as the KOSPI200 index (hereinafter referred to as the comprehensive stock price index), the market representation (For example, in the prediction of the KOSPI 200 index, the exponential prediction process is performed for only 50 items with a large market representative), and therefore, If the market fluctuates significantly due to positive or negative issues for constituent stocks that are not reflected in the forecast of the composite stock price index, the forecasted stock price index does not match the actual stock price index Occurs.

예를 들어, 도 2의 흑색 그래프(실제의 종합주가 관련지수 그래프))가 나타낸 바와 같이, 실제의 상황에 있어서는(예를 들면, 12월 3일과 12월 4일 사이 기간) 종합주가 관련지수의 산출에 반영되지 않은 특정 종목의 부정적 이슈로 인하여 전체 종합주가 관련지수가 크게 음의 방향으로 흐르고 있음에도, 적색 그래프(예측된 종합주가 관련지수)로 나타낸 바와 같이 이러한 상황이 종합주가 관련 지수 예측에 전혀 반영되지 않고 있으므로, 지수 예측의 신뢰성이 크게 떨어지게 되는 문제점이 발생할 수 있다.For example, as shown by the black graph in FIG. 2 (actual total stock price related index graph)), in the actual situation (for example, between December 3 and December 4) Although the overall composite stock index is largely negative due to the negative issues of specific stocks that are not reflected in the output, as shown by the red graph (predicted composite stock price index) The reliability of the exponential prediction may be greatly deteriorated.

본 발명은 상술한 바와 같은 문제점을 감안하여 안출된 것으로서, 종합주가 관련지수의 예측에 반영되지 않은 구성 종목에 대한 긍정적 또는 부정적 이슈로 인하여 시장에 큰 변동이 생긴 경우에도 적응적으로 신뢰성 있는 예측 결과를 도출할 수 있는 방법 및 시스템을 제공하는 것을 목적으로 한다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for estimating a reliability of a composite stock index, even when the market is largely changed due to positive or negative issues, And to provide a method and a system that can derive a system.

상기 기술적 과제를 이루기 위한 본 발명의 일 양태에 따르면, 소셜 데이터의 분석을 통한 종합주가 관련지수 예측 방법이 제공되며, 상기 방법은 (a) 상기 종합주가 관련지수 예측을 위한 하나 이상의 구성종목을 지정하는 단계; (b) 경제 통계데이터를 포함하는 정형데이터 및 소셜 미디어 데이터(Social Media Data; SMD)를 포함하는 비정형데이터로 구성되는 빅데이터를 이용하여 상기 하나 이상의 구성종목 각각에 대한 주가 지수를 예측하는 단계; (c) 상기 예측된 하나 이상의 구성종목 각각에 대한 주가 지수에 근거하여 상기 종합주가 관련지수를 예측하는 단계; (d) 상기 (c) 단계에서 예측된 종합주가 관련지수와 실제의 종합주가 관련지수의 오차를 계산하는 단계; (e) 상기 오차가 미리결정된 임계값을 벗어나는지의 여부를 판정하는 단계; 및 (f) 상기 (e) 단계의 결과, 상기 오차가 미리결정된 임계값을 벗어나는 것으로 판정되는 경우, 상기 하나 이상의 구성종목의 일부 또는 전부를 다른 구성종목으로 변경하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method for predicting a total stock price related index through analysis of social data, the method comprising the steps of: (a) designating one or more constituent items ; (b) predicting a stock index for each of the one or more constituent items using big data composed of fixed data including economic statistical data and atypical data including social media data (SMD); (c) predicting the composite stock price related index based on the stock price index for each of the predicted one or more constituent items; (d) calculating an error of the composite stock price related index predicted in the step (c) and the actual total stock price related index; (e) determining whether the error is above a predetermined threshold; And (f) if it is determined that the error is out of the predetermined threshold value as a result of the step (e), changing part or all of the one or more constituent items to another constituent item.

바람직한 실시예에 따라, 상기 (f) 단계에서, 상기 하나 이상의 구성종목의 일부 또는 전부를 다른 구성종목으로 변경하는 것은, 상기 하나 이상의 구성종목 중의 관련 소셜 미디어 데이터의 발생 양이 적은 구성종목을 상기 하나 이상의 구성종목 이외의 개별 종목 중의 관련 소셜 미디어 데이터의 발생 양이 많은 종목으로 대체하는 것일 수 있다.According to a preferred embodiment, in the step (f), changing part or all of the one or more constituent items to another constituent item may include changing a constituent item having a small amount of generation of related social media data among the one or more constituent items Or may be replaced with items in which the amount of related social media data generated in individual items other than one or more constituent items is high.

바람직한 실시예에 따라, 상기 관련 소셜 미디어 데이터는 소셜 미디어 사이트 및 개인화된 블로그 중의 적어도 하나로부터 수집되는 데이터일 수 있다.According to a preferred embodiment, the relevant social media data may be data collected from at least one of a social media site and a personalized blog.

바람직한 실시예에 따라, 상기 관련 소셜 미디어 데이터는 html, PDF(Portable Document Format), 이미지 및 동영상 중 적어도 하나를 포함할 수 있다.According to a preferred embodiment, the related social media data may include at least one of html, Portable Document Format (PDF), image and video.

바람직한 실시예에 따라, 상기 (f) 단계에서, 상기 하나 이상의 구성종목의 일부 또는 전부를 다른 구성종목으로 변경하는 액션은, 상기 (e) 단계의 결과 상기 오차가 미리결정된 임계값을 벗어나는 것으로 판정되었다는 제 1 조건과, 상기 제 1 조건이 일정 기간 이상 지속되었다는 제 2 조건을 모두 만족시키는 경우에 취해질 수도 있다.According to a preferred embodiment, in the step (f), the action of changing part or all of the one or more constituent items to another constituent item is determined by determining that the error is out of a predetermined threshold value as a result of the step (e) And the second condition that the first condition is maintained for a predetermined period or longer.

바람직한 실시예에 따라, 상기 (b) 단계는, (b1) 상기 소셜 미디어 데이터에 대한 형태소를 분석하는 단계; (b2) 상기 분석된 형태소에서 추출된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가하는 방식으로 상기 소셜 미디어 데이터 전체를 분석하는 단계; 및 (b3) 상기 감성 평가된 소셜 미디어 데이터를 반영하여 상기 하나 이상의 구성종목 각각에 대한 주가 지수를 예측하는 단계를 포함할 수도 있다.According to a preferred embodiment, the step (b) comprises the steps of: (b1) analyzing the morpheme for the social media data; (b2) analyzing the whole of the social media data in a manner of performing emotional evaluation by affirmative or negative for each keyword extracted from the analyzed morpheme; And (b3) estimating a stock index for each of the one or more constituent items by reflecting the emotionally-evaluated social media data.

상기 기술적 과제를 이루기 위한 본 발명의 다른 양태에 따르면, 경제 통계데이터를 포함하는 정형데이터 및 소셜 미디어 데이터를 포함하는 비정형데이터로 구성되는 빅데이터를 이용하여 종합주가를 예측하는 종합주가 관련지수 예측 시스템이 제공되며, 상기 시스템은 상기 종합주가 관련지수 예측을 위한 하나 이상의 구성종목을 결정하는 구성종목 결정 모듈을 포함하고, 상기 구성종목 결정 모듈은, 상기 하나 이상의 구성종목을 포함하는 개별 종목마다의 소셜 미디어 데이터 발생 양을 수치화하여 SMD(Social Media Data) 스코어로서 생성하여 누적 저장하는 SMD 스코어 생성부; 예측된 종합주가와 실제의 종합주가의 오차가 미리결정된 임계값을 벗어나는지의 여부를 판정하는 예측 이상 판정부; 및 상기 SMD 스코어 생성부가 생성하여 누적 저장한 상기 SMD 스코어를 참조하여, 상기 종합주가 관련지수 예측을 위한 하나 이상의 구성종목을 상기 하나 이상의 구성종목 이외의 종목으로 변경하는 구성종목 변경부로서, 상기 변경된 종목의 SMD 스코어는 변경되기 이전의 종목의 SMD 스코어보다 큰, 상기 구성종목 변경부를 포함한다.According to another aspect of the present invention, there is provided a system for estimating an integrated stock price index system using large data composed of fixed data including economic statistical data and atypical data including social media data, Wherein the system comprises a constituent item determination module for determining one or more constituent items for predicting the composite stock price related index, wherein the constituent item determination module is configured to determine, for each individual item including the one or more constituent items, An SMD score generation unit for generating and accumulating the amount of media data generated as a SMD (Social Media Data) score, A prediction error determining section that determines whether an error between the predicted integrated stock price and the actual integrated stock price deviates from a predetermined threshold value; And a constituent item changing unit for referring to the SMD score generated and cumulatively stored by the SMD score generating unit to change at least one constituent item for the composite stock price related exponentiation to an item other than the at least one constituent item, The SMD score of the item includes the constituent item changing part that is larger than the SMD score of the item before the change.

도 1은 본 발명의 일 실시예에 따른 종합주가 관련지수 예측 시스템의 구성도이다.
도 2는 실제 종합주가 관련지수와 예측 종합주가 관련지수 사이에 큰 오차가 발생된 상황을 예시한 도면이다.
도 3a 및 도 3b는 도 1의 종합주가 관련지수 예측 시스템에 적용되는 SMD 스코어 데이터베이스의 구성도이다.
도 4는 본 발명의 일 실시예에 따른 종합주가 관련지수 예측 방법을 설명하기 위한 흐름도이다.
도 5는 도 1의 종합주가 관련지수 예측 시스템에 적용되는 키워드 데이터베이스의 구성도이다.
도 6은 도 1의 종합주가 관련지수 예측 시스템에 적용되는 문서 저장부의 구성도이다.
도 7은 도 1의 종합주가 관련지수 예측 시스템에 적용되는 데이터 분석부의 구성도이다.
도 8은 도 1의 종합주가 관련지수 예측 시스템에 적용되는 감성 사전 데이터베이스의 구성도이다.
도 9는 도 1의 종합주가 관련지수 예측 시스템에 적용되는 상관 분석/결정부의 구성도이다. 1 is a block diagram of a general stock price related index prediction system according to an embodiment of the present invention.
2 is a diagram illustrating a situation where a large error occurs between the actual total stock price index and the forecast total stock price index.
3A and 3B are block diagrams of an SMD score database applied to the composite stock price related index prediction system of FIG.
4 is a flowchart illustrating a method of predicting a composite stock price index according to an embodiment of the present invention.
5 is a block diagram of a keyword database applied to the composite stock price related index prediction system of FIG.
6 is a configuration diagram of a document storage unit applied to the composite stock price related index prediction system of FIG.
7 is a configuration diagram of a data analysis unit applied to the composite stock price related index prediction system of FIG.
8 is a configuration diagram of a sentence dictionary database applied to the composite stock price related index prediction system of FIG.
9 is a configuration diagram of a correlation analysis / decision unit applied to the composite stock price related index prediction system of FIG.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하도록 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise. Also, the terms " part, "" module," and " module ", etc. in the specification mean a unit for processing at least one function or operation and may be implemented by hardware or software or a combination of hardware and software have.

또한, 본 발명의 이해를 용이하게 하기 위해, 본 명세서에서 사용되는 용어들을 설명하면 다음과 같다.In order to facilitate understanding of the present invention, terms used in this specification will be described as follows.

본 명세서에서 사용되는 "종합주가 관련지수"는 예를 들어 코스피 200지수와 같은 선물과 옵션의 기초가 되는 주가 관련 종합지수를 지칭한다.As used herein, the "composite stock price index" refers to a stock index-related composite index on which futures and options are based, such as the KOSPI 200 index.

본 명세서에서 사용되는 "감성사전"은 감성을 나타내는 단어에 긍정과 부정의 점수를 부여한 사전을 지칭하며, 예를 들어, 소셜 미디어 데이터가 해당 종목에 미치는 영향을 판단하는 감성분석의 기초 도구로 활용된다.As used herein, the term " emotional dictionary "refers to a dictionary to which affirmative and negative scores are assigned to words expressing emotion, for example, as a basic tool for emotional analysis to judge the effect of social media data on the relevant items do.

또한, 본 명세서에서 사용되는 "SMD(Social Media Data) 스코어"는 일정 기간 동안 각 개별 종목에서 발생한 소셜 미디어 데이터의 양을 수치화한 것을 지칭하며, 예를 들어, 어떤 특정 개별 종목이 일정 기간 동안에 크게 이슈가 되고 있는지의 여부를 판별하기 위한 도구로 활용된다.
As used herein, the term "SMD (Social Media Data) score" refers to a numerical value of the amount of social media data generated in each individual event for a predetermined period. For example, It is used as a tool to determine whether or not an issue is occurring.

[종합주가 관련지수 예측 시스템(1)의 종합주가 관련지수 예측을 위한 구성종목 변경 기능][Composition change function for predicting composite stock price related index of composite stock price index prediction system (1)]

도 1은 본 발명의 일 실시예에 따른 소셜 데이터의 분석을 통한 종합주가 관련지수 예측 시스템(1)의 블록도를 나타낸다.1 is a block diagram of a system for predicting a total stock price index according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 종합주가 관련지수 예측 시스템(1)은 경제 통계데이터를 포함하는 정형데이터 및 소셜 미디어 데이터(Social Media Data; SMD)를 포함하는 비정형데이터로 구성되는 빅데이터를 이용하여 개별주가 및/또는 종합주가 관련지수(예를 들면, 코스피 200 지수)를 예측할 수 있는 시스템이다.Referring to FIG. 1, a system for predicting a stock index related index according to the present invention includes big data composed of fixed data including economic statistical data and unstructured data including social media data (SMD) (For example, the KOSPI 200 index) using the stock price and / or the stock price index.

특히, 본 발명에 다른 종합주가 관련지수 예측 시스템(1)은 종합주가 관련지수 예측에 반영되지 않은 소수 종목에 대한 소셜 미디어 데이터가 소정 기간 급격히 변동된 경우에 적응적으로 그 종목을 예측에 반영가능한 시스템이다.In particular, when the social media data for a minority item that has not been reflected in the composite stock price-related index prediction is changed rapidly for a predetermined period, the composite stock price related index prediction system 1 according to the present invention can adaptively reflect the stock System.

이를 위해, 종합주가 관련지수 예측 시스템(1)은 소셜 미디어 데이터의 발생이 많은(즉, 현재 이슈가 되고 있는) 종목을 종합주가 관련지수 예측에 자동으로 반영시키는 동작을 수행하는 구성종목 결정 모듈(200)을 포함하고 있다.To this end, the total stock price related index predicting system (1) includes a constituent item determining module (1) that automatically performs an operation of automatically reflecting the stocks having a large amount of social media data (that is, 200).

일 예에서, 구성종목 결정 모듈(200)은 구성종목마다의 소셜 미디어 데이터 발생 양을 수치화하여 SMD(Social Media Data) 스코어로서 생성하여 데이터베이스(300)에 누적 저장하는 SMD 스코어 생성부(200)와, 예측된 종합주가 관련지수와 실제의 종합주가 관련지수의 오차가 미리결정된 임계값을 벗어나는지의 여부를 판정하는 예측 이상 판정부(210)와, SMD 스코어 생성부(210)가 생성하여 누적 저장된 상기 SMD 스코어를 참조하여 종합주가 관련지수 예측을 위한 하나 이상의 구성종목을 변경하는 구성종목 변경부(230)를 포함할 수 있다.In one example, the constituent item determination module 200 includes an SMD score generation unit 200 for generating a social media data (SMD) score by accumulating the amount of generated social media data for each constituent item, and accumulating it in the database 300 A prediction error judging unit 210 for judging whether or not the error of the predicted total stock price related index and the actual total stock price related index is out of a predetermined threshold value and an SMD score generating unit 210 for generating and accumulating And a constituent item changing unit 230 for changing one or more constituent items for predicting the composite stock price related index referring to the SMD score.

일 실시예에서, 구성종목 변경부(230)는 예측 이상 판정부(210)에 의하여 예측된 종합주가 관련지수와 실제의 종합주가 관련지수의 오차가 미리결정된 임계값을 벗어났다는 통지를 받은 경우에 동작할 수 있다.In one embodiment, when the configuration item change unit 230 receives a notification that the error of the composite stock price related index predicted by the prediction error determination unit 210 and the actual total stock price related index have deviated from a predetermined threshold value Lt; / RTI >

다른 실시예에서, 구성종목 변경부(230)는 예측 이상 판정부(210)에 의하여 예측된 종합주가 관련지수와 실제의 종합주가 관련지수의 오차가 미리결정된 임계값을 벗어났으며 그 상황이 미리결정된 임계기간 이상 지속되었다는 통지를 받은 경우에 동작할 수 있다.In another embodiment, the constituent item changing unit 230 determines whether the error of the composite stock price related index predicted by the prediction error determining unit 210 and the actual total stock price related index is out of a predetermined threshold value, And may be operated when it is notified that it has lasted beyond the determined critical period.

바람직한 실시예에 따라, 구성종목 변경부(230)는 종합주가 관련지수 예측에 반영된 하나 이상의 구성종목 중의 관련 소셜 미디어 데이터의 발생 양이 적은 구성종목을, 종합주가 관련지수 예측에 반영되지 않은 종목 중의 관련 소셜 미디어 데이터의 발생 양이 많은 구성종목을 변경하는 동작을 수행할 수 있다.According to a preferred embodiment, the constituent item changing unit 230 changes constituent items having a small amount of generation of related social media data among one or more constituent items reflected in the composite stock price-related index prediction, It is possible to perform an operation of changing a constituent item having a large amount of generated social media data.

바람직한 실시예에 따라, 구성종목 변경부(230)가 수행하는 종합주가 관련지수 예측을 위한 구성종목 변경의 기준이 되는 상기 관련 소셜 미디어 데이터는, 아래에서 더 설명되는 바와 같이, 문서 수집/추출부(110)에 의하여 소셜 미디어 사이트(12) 및 개인화된 블로그(14) 중의 적어도 하나로부터 수집되는 데이터일 수 있다.According to a preferred embodiment, the related social media data serving as a basis for constitutional item change for the comprehensive stock price related index prediction performed by the constituent item changing unit 230 includes a document collection / Or data collected from at least one of the social media site 12 and the personalized blog 14 by the client 110.

바람직한 실시예에 따라, 구성종목 변경부(230)가 수행하는 종합주가 관련지수 예측을 위한 구성종목 변경의 기준이 되는 상기 관련 소셜 미디어 데이터는, 아래에서 더 설명되는 바와 같이, html, PDF(Portable Document Format), 이미지 및 동영상 중 적어도 하나의 형태로 이루어진 것일 수 있다.According to a preferred embodiment, the related social media data serving as a basis for the constitutional item change for the comprehensive stock price related index prediction performed by the constituent item changing unit 230 includes html, PDF (Portable Document Format), an image, and a moving image.

바람직한 실시예에 따라, 종합주가 관련지수 예측 시스템(1)은 예를 들어 도 3a 나타낸 바와 같은 종합주가 관련지수 예측을 위해 반영된 소정개수 구성종목(예를 들면, 50개의 구성종목)의 종목별 SMD 스코어를 저장한 데이터와, 예를 들어 도 3b에 나타낸 바와 같은 종합주가 관련지수 예측에 미반영된 소정개수 구성종목(예를 들면, 150개의 구성종목)의 종목별 SMD 스코어를 저장한 데이터를 저장한 SMD 스코어 데이터베이스(300)를 포함할 수 있다. 여기서의 "SMD(Social Media Data) 스코어"는 일정 기간 동안 각 개별 종목에서 발생한 소셜 미디어 데이터의 양을 수치화한 것을 지칭하며, 이것은 예를 들어, SMD 스코어 생성부(210)가 문서 수집/추출부(110)에 의해 추출되는 개별 종목별 소셜 미디어 데이터의 개수를 카운팅하는 것에 의해 생성될 수 있다.According to a preferred embodiment, the integrated stock price related index predicting system 1 calculates a total stock price related index based on the SMD scores for each stock of a predetermined number of constituent items (for example, 50 constituent items) For example, an SMD score storing data storing SMD scores for each category of a predetermined number of constituent items (for example, 150 constituent items) that have not yet been subjected to the comprehensive stock price related index prediction as shown in FIG. 3B A database 300 may be included. Here, the term "SMD (Social Media Data) score" refers to a numerical value of the amount of social media data generated in each individual event for a predetermined period of time. By counting the number of social media data for each individual event extracted by the server 110. [

이제, 도 1 내지 4를 함께 참조하여, 본 발명에 따른 종합주가 관련지수 예측 시스템(1)의 특징적 동작 과정을 설명하면 다음과 같다.Referring now to FIGS. 1 to 4 together with FIG. 1, a characteristic operation of the inventive index-related exponential prediction system 1 will be described.

종합주가 관련지수 예측 시스템(1)은 기본적으로 종합주가 관련 지수 예측을 위한 하나 이상의 구성종목을 지정하여(S101), 그 하나 이상의 구성종목에 대한 주가지수를 예측하는 동작(S102)과, 그 하나 이상의 종목에 대해 예측된 주가지수를 반영하여 종합주가 관련지수를 예측하는 동작(S103)을 수행할 수 있는 시스템이다.The comprehensive stock price related index predicting system 1 basically designates at least one constituent item for predicting a comprehensive stock price related index (S101), an operation (S102) of predicting a stock price index for one or more constituent items thereof, (S103) of predicting the composite stock price related index by reflecting the stock price index predicted for the above items.

종합주가 관련지수 예측 시스템(1)의 운용 도중에, 특정 시점에 소수 종목의 영향으로 인하여 전체 예측 결과가 틀어지는 경우, 즉, 도 2의 흑색 그래프(실제의 종합주가 관련지수 그래프)가 나타낸 바와 같이 종합주가 관련지수의 산출에 반영되지 않은 구성종목(예를 들면, 도 3b의 구성종목 54)의 부정적 이슈로 인하여 전체 종합주가 관련 지수가 크게 음의 방향으로 흐르고 있음에도, 적색 그래프(예측된 종합주가 관련지수)로 나타낸 바와 같이 이러한 변동 요소가 종합주가 관련 지수 예측에 전혀 반영되고 있지 않은 경우에는, 예측된 종합주가 관련지수와 실제 종합주가 관련지수 사이에 큰 오차가 발생하게 된다.When the overall predicted result is distorted due to the influence of the minority item at a specific point during operation of the comprehensive stock price related index predicting system 1, that is, when the black graph shown in FIG. 2 (actual total stock price related index graph) Although the overall composite stock price index is largely negative due to negative issues in the constituent stocks (eg, structured stock 54 in FIG. 3B) that are not reflected in the calculation of the stock index, the red graph If these variables are not reflected in the total stock price index, there will be a large error between the predicted composite stock price index and the actual stock price index.

이때, 예측 이상 판정부(220)는 그 발생된 오차가 미리결정된 임계값을 벗어나는지의 여부를 판정하고(S104), 그 결과값이 임계값을 벗어나고 있는 경우에는(S105), 실시간으로 구성종목 변경부(230)에 그 결과를 통지한다.At this time, the prediction error determining section 220 determines whether or not the generated error is out of a predetermined threshold value (S104). If the resultant value is out of the threshold value (S105) And notifies the changing unit 230 of the result.

구성종목 변경부(230)는 SMD 스코어 데이터베이스(300)에 저장되어 있는, 예측을 위해 반영된 소정개수 구성종목 목록 데이터(예를 들면, 도 3a의 데이터) 중의 SMD 스코어가 가장 낮은 구성종목(예를 들면, 도 3a의 구성종목 3)을, 예측에 미반영된 소정개수 구성종목 목록 데이터(예를 들면, 도 3b의 데이터) 중의 SMD 스코어가 가장 높은 구성종목(예를 들면, 도 3b의 구성종목 54)으로 대체하는 동작을 수행한다(S106). The constituent item changing unit 230 changes the constituent item changing unit 230 from the constituent item changing unit 230 to the constituent item changing unit 230 that has the SMD score among the predetermined number constituent item list data (for example, the data of Fig. 3A) (For example, the constituent item 3 in FIG. 3A) is compared with the constituent item having the highest SMD score among the predetermined number constituent item list data (for example, data shown in FIG. 3B) (S106). &Lt; / RTI >

이에 따라, 본 발명에 따른 종합주가 관련지수 예측 시스템(1)은 구성종목 결정 모듈(200)의 상술한 특징 동작들에 의해 특정 시점(또는 기간)의 소셜 미디어 데이터 패턴을 종합주가 관련지수 예측에 적응적으로 반영할 수 있게 되며, 이에 의해 시장 상황을 반영한 신뢰성 있는 예측 결과를 제공할 수 있게 된다.Accordingly, the general stock price related index predicting system 1 according to the present invention can estimate the social media data pattern of a specific time point (or period) by the above-described characteristic operations of the constituent item determining module 200, It is possible to adaptively reflect the result, thereby providing a reliable prediction result reflecting the market situation.

도 2의 예시적 상황에서 12월 4일 이후에는 본 시스템(1)의 구성종목 변경부(230)의 동작에 따라, 실제의 종합주가 관련지수에 영향을 미치고 있는(즉, 이에 따라 관련 소셜 미디어 데이터를 대량으로 발생시키고 있는) 특정 종목(예를 들면, 도 3b의 구성종목 54)이 예측에 반영되었으며, 이에 의해 실제의 종합주가 관련지수와 예측된 종합주가 관련지수는 매우 유사한 패턴을 갖게 됨을 확인할 수 있다.
In the exemplary situation of FIG. 2, after December 4, the operation of the constituent item changing unit 230 of the present system 1 causes an influence on the actual total stock price related index (that is, (For example, the constituent item 54 of FIG. 3B) is reflected in the prediction, so that the actual index related to the composite stock index and the predicted composite stock index have a very similar pattern Can be confirmed.

[종합주가 관련지수 예측 시스템(1)의 개별주가 및/또는 종합주가 관련지수 예측 기능][Predictive function of individual stock price and / or total stock price related index of system for predicting total stock price index (1)]

이하에서는, 종합주가 관련지수 예측 시스템(1)이 경제 통계데이터를 포함하는 정형데이터 및 소셜 미디어 데이터를 포함하는 비정형데이터로 구성되는 빅데이터를 이용하여 개별주가 및/또는 종합주가 관련지수를 예측하는 동작에 대해 각 기능별로 상세히 설명하도록 한다.Hereinafter, the comprehensive stock price related index predicting system 1 predicts an individual stock price and / or a total stock price related index using big data composed of fixed data including economic statistical data and atypical data including social media data The operation will be described in detail for each function.

도 1을 참조하면, 종합주가 관련지수 예측 시스템(1)은 구성종목 모듈(200) 이외에, 소셜 미디어 데이터(10)와 증시 관련 웹데이터(20)로부터 대량의 문서를 수집하는 문서 수집/추출부(110)와, 수집된 문서를 개별 기업별로 저장하는 문서 저장부(130)와, 개별 기업별로 복수의 문서에 포함된 표현 내지는 문장에 대하여 형태소를 분석하는 형태소 분석부(140)와, 분석된 형태소에서 추출된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 복수의 문서 전체에 대한 감성을 평가하여 복수의 문서 전체의 데이터를 분석하는 데이터 분석부(150)를 더 포함한다. 또한, 종합주가 관련지수 예측 시스템(1)은 누적된 감성 평가 데이터 중 소정의 조건에 의해 선택된 감성 관련 평가 데이터와 함께, 증시 지표 데이터와 경제 지표 데이터 간의 상관 관계로부터의 분석 데이터를 생성하는 상관 분석/결정부(170) 및 선택된 평가 데이터와 분석 데이터에 근거하여 개별 종목의 주가 및 이를 반영한 종합주가를 예측 산정하는 주가 예측부(180) 및 주가 예측부(180)로부터 도출된 예측 결과를 표시하는 표시부(190)를 포함할 수 있다. Referring to FIG. 1, the system for predicting composite stock price indexes 1 includes a document collecting / extracting unit 20 for collecting a large amount of documents from social media data 10 and stock market related web data 20, A morphological analysis unit 140 for analyzing a morpheme for expressions or sentences included in a plurality of documents for each individual company, And a data analysis unit (150) for evaluating emotion of all the plurality of documents and analyzing data of all of the plurality of documents by performing emotional evaluation by positive or negative for each keyword extracted from the morpheme. In addition, the comprehensive stock price related index predicting system (1), together with the emotion related evaluation data selected by a predetermined condition among accumulated emotion evaluation data, generates correlation analysis data for generating analysis data from correlation between stock index data and economic index data Based on the evaluation data / analysis data selected and the analysis data, a stock price estimating unit 180 for predicting the stock price of the individual stock and the total stock price reflecting the stock price of the individual stock and the prediction result derived from the stock price predicting unit 180 And may include a display unit 190.

문서 수집/추출부(110)는 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 적어도 하나의 개별 종목과 관련된 대량의 문서를 수집하고, 증시 지표 데이터들(30)을 입력받는다. 여기서, 개별 종목은 증시에 상장된 기업이고, 수집되는 문서는 html, PDF(Portable Document Format), 이미지 및 동영상 중 적어도 하나의 형태로 구현될 수 있다. The document collecting / extracting unit 110 collects a large amount of documents related to at least one individual item from the social media data 10 and the stock market related web data 20, and inputs the stock index data 30. Here, the individual item is a company listed on the stock market, and the collected document can be implemented in the form of at least one of html, PDF (Portable Document Format), image and moving image.

소셜 미디어 데이터(10)는 인터넷 등의 네트워크와 접속되는 고정형 컴퓨터 또는 모바일 기기를 통해 입력되는 미디어 데이터로서, 네트워크와 접속된 다른 사용자와 상호 공유될 수 있는 데이터이다. 예컨대, 소셜 미디어 데이터(10)는 소셜 미디어 서버에서 운영하는 소셜 미디어 사이트들(12) 및 다양한 포털 사이트 등에서 운영하며 개인화된 컨텐츠가 포함된 블로그 사이트들(14)일 수 있다. 소셜 미디어 사이트들(12)은 소위 SNS로서, 트위터(twitter), 페이스북(facebook), 다양한 포털 사이트에서 서비스하는 소셜 미디어일 수 있다. The social media data 10 is media data input via a fixed computer or a mobile device connected to a network such as the Internet and is data that can be mutually shared with other users connected to the network. For example, the social media data 10 may be social media sites 12 operated by a social media server and blog sites 14 operated by various portal sites and including personalized contents. The social media sites 12 may be social media serving on so-called SNS, twitter, facebook, and various portal sites.

증시 관련 웹데이터(20)는 언론사, 공중파 방송사, 케이블 방송사, 포털 사이트 뉴스, 금융사, 증시 관련 기관 등으로부터 제공되는 웹데이터로서, 소셜 미디어 데이터(10)에 비해 전문적이거나 공신력있는 증시 관련 데이터이다. 이러한 증시 관련 웹데이터(20)는 언론사, 방송사, 포털 사이트 뉴스로부터 서비스되는 증시 관련 뉴스 사이트들(22), 은행, 증권사, 보험 등의 금융사에서 증시와 관련하여 서비스되는 금융사 포털 사이트들(24) 및 증시 관련 공공 기관 또는 사설 기관에서 증시와 관련된 분석 정보를 제공하는 증시 관련 통계 사이트들(26)일 수 있다. The stock market related web data 20 is web data provided from a media company, a national wave broadcasting company, a cable broadcasting company, a portal site news, a financial corporation, a stock market related institution or the like, and is stock market related data which is more professional or credible than the social media data 10. The stock market related web data 20 includes stock market news sites 22 served by news agencies, broadcasters and portal site news, financial institution portal sites 24 served by financial institutions such as banks, Related statistical sites 26 that provide analysis information related to the stock market at public or private institutions related to the stock market.

증시 지표 데이터들(30)은 주식에 상장된 개별 종목마다의 주식 정보로서, 예컨대 시가, 고가, 저가, 종가, 호가, 체결 여부, 거래량, 거래 대금, 거래원, 상한가, 하한가, 신고가, 신저가 등을 포함할 수 있다. The stock index data 30 is stock information for each individual stock listed on the stock, and for example, market information such as market price, high price, low price, close price, closing price, transaction amount, transaction amount, transaction price, transaction source, upper limit price, lower limit price, . &Lt; / RTI >

소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 대량의 문서를 수집하는 경우에, 문서 수집/추출부(110)는 모든 문서를 수집하는 것이 아니라, 키워드 데이터베이스(120)를 참조하여 적어도 하나의 개별 종목과 관련된 문서를 수집하는 것이다. The document collecting / extracting unit 110 does not collect all the documents but collects at least the documents related to the keyword database 120 by referring to the keyword database 120 in order to collect a large amount of documents from the social media data 10 and the market- To collect documents related to one individual item.

키워드 데이터베이스(120)는 개별 종목에 해당하는 기업마다 카테고리화되어 있는 키워드 군을 포함할 수 있으며, 구체적으로 도 5에 도시된 바와 같이, 개별 종목의 기업명과 관련된 메인 키워드(122)와 아울러서, 기업에서 출시하는 상품, 서비스에 관한 제품/서비스 관련 키워드(124), 기업의 경영진 등에 관한 인적 관련 키워드(126) 및 개별 종목에 영향을 미칠 수 있는 단어, 컨텍스트에 관한 기업 상황 관련 키워드(128) 등을 포함하는 서브 키워드를 저장할 수 있다. 서브 키워드는 해당 기업 특유의 단어, 컨텍스트 등으로서, 해당 기업마다 분류되어 카테고리화된 형태로 존재할 수 있다. 5, the keyword database 120 may include a main keyword 122 related to a company name of an individual item, and a keyword Related keywords 126 related to the company's executives and the like, words that may affect the individual items, company-related keywords 128 related to the context, and the like May be stored. The sub keyword may exist in a categorized form classified for each company as a specific word, context or the like of the corresponding company.

메인 키워드에 대하여 예를 들어 설명하면, 메인 키워드(122)는 삼성전자, 엘지전자, KT 등과 같이 증시에 상장된 개별 종목의 기업명일 수 있으며, 삼성전자의 경우에 제품/서비스 관련 키워드(124)는 "갤럭시", "스마트폰", "하우젠", "태블릿", "앱 마켓" 등일 수 있으며, 인적 관련 키워드(126)는 삼성전자의 주요 임원진, 삼성전가와 거래하는 기업의 임원진 등일 수 있으며, 기업 상황 관련 키워드(128)는 삼성전자의 주가에 영향을 미칠 수 있는 단어 등으로서, "사상최대", "실적", "호조", "애플", "불만", "악화" 등으로 다양한 단어를 포함할 수 있다. For example, the main keyword 122 may be a company name of an individual item listed on the market such as Samsung Electronics, LG Electronics, KT, etc. In the case of Samsung Electronics, the main keyword 122 may include a product / May be "Galaxy", "smartphone", "Hauzen", "tablet", "appmarket" and the like, and the human-related keywords 126 may be executives of major executives of Samsung Electronics, The keywords related to the company situation (128) are words that can affect the stock price of Samsung Electronics and various words such as "maximum ever", "performance", "goodbye", " . &Lt; / RTI >

문서 수집/추출부(110)는 수집된 복수의 문서에 포함된 표현에서 전술한 키워드 중 메인 키워드(122), 제품/서비스 관련 키워드(124) 및 인적 관련 키워드(126)가 포함되는 문서들을 추출함으로써, 감성 평가에 적합한 문서 데이터를 효율적으로 선정할 수 있다. The document collecting / extracting unit 110 extracts documents including the main keyword 122, the product / service related keyword 124 and the human related keyword 126 among the keywords included in the collected plurality of documents. Thus, document data suitable for sensitivity evaluation can be efficiently selected.

문서 저장부(130)는 형태소 분석에 적합한 형태로 추출된 문서들을 저장할 수 있으며, 예컨대 도 6에 도시된 바와 같이, 개별 종목 그룹(131)마다 추출된 문서들의 포맷 별, 즉 html(132), pdf(133), 이미지(134), 동영상(135) 등으로 분산 저장될 수 있다. The document storage unit 130 may store the extracted documents in a form suitable for morphological analysis. For example, as shown in FIG. 6, pdf 133, image 134, moving image 135, and the like.

형태소 분석부(140)는 감성 평가에 적합한 형태로 처리하기 위한 전처리로서, 저장된 복수의 문서의 포맷에 대하여 의미를 갖는 최소의 언어 단위인 형태소를 분석하여 각 품사를 특정하는 처리를 수행한다. 이 경우에, 형태소 분석부(140)는 도 6에 도시된 포맷마다 적합한 처리를 통해, 각 포맷에 대하여 병렬적으로 형태소 분석을 진행할 수 있다. The morpheme analysis unit 140 is a pre-processing unit for processing the morpheme in a form suitable for emotion evaluation, and analyzes the morpheme that is the smallest language unit having a meaning with respect to the format of the stored plurality of documents, and performs processing for specifying each part of speech. In this case, the morpheme analysis unit 140 can perform morphological analysis in parallel on each format through a process suitable for each format shown in Fig.

아울러, 형태소 분석부(140)는 문서의 포맷에 포함된 표현에서 문장, 컨텍스트 등을 어절 단위로 분류하고, 개별 종목과 관련된 키워드에 인접한 키워드들을 파싱(parsing)할 수 있다. 예를 들어 설명하면, 특정인의 블로그 사이트에서 삼성전자와 관련된 문장 및 엘지전자와 관련된 문장이 함께 존재하는 경우에, 형태소 분석부(140)는 문장 구조, 접속 구조, 구문 등을 고려하여 블로그 사이트의 텍스트를 어절 단위로 분류하고, 이후에 삼성전자 또는 엘지전자의 명칭, 상품/서비스, 인적 사항 등의 키워드를 검색하여, 이에 인접한 단어, 구문들을 파싱하고, 삼성전자 및 엘지전자 별 키워드들로 분류하여 저장한다. In addition, the morpheme analysis unit 140 may classify sentences, contexts, and the like in the expressions included in the document format in units of words, and may parse keywords adjacent to keywords related to individual items. For example, in a case where a sentence related to Samsung Electronics and a sentence related to the LG Electronics coexist in a blog site of a particular person, the morphological analysis unit 140 analyzes the sentence structure of the blog site We classify the text into an e-word unit, and then search for keywords such as Samsung Electronics or LG Electronics' names, products / services, personal information, etc., parse words and phrases adjacent thereto, and classify them into keywords for Samsung Electronics and LG Electronics And stores it.

데이터 분석부(150)는 도 7을 참조하면, 형태소 분석부(140)에서 처리된 키워드마다 긍정 및 부정 중 어느 하나로 감성 평가함으로써 복수의 문서 전체에 대한 감성을 평가하는 데이터 감성 평가부(152) 및 형태소 분석부(140)에서 처리된 키워드를 통계 처리하는 키워드 분석부(154)를 포함할 수 있다. Referring to FIG. 7, the data analysis unit 150 includes a data sensitivity evaluation unit 152 that evaluates the sensitivity of all the plurality of documents by performing an emotion evaluation of affirmative or negative for each keyword processed by the morphological analysis unit 140, And a keyword analyzing unit 154 for statistically processing the keyword processed by the morphological analyzer 140.

데이터 감성 평가부(152)는 형태소 분석부(140)로부터의 키워드마다 긍정, 중립 또는 부정에 대한 평가 및 이 평가와 연계된 스코어를 저장하는 감성 사전 데이터베이스(160)를 참조하여, 추출된 키워드에 대하여 긍정, 중립 및 부정 중 어느 하나로 평가함과 아울러서 스코어링한다. 스코어링 알고리즘은 Naive bayes 알고리즘, Simple voter 알고리즘, KNN(K Nearest Neighborhood), SVM(Support Vector Machine) 일 수 있다. 이 중 Simple voter 알고리즘을 예로 들어 설명하면, 감성 사전 데이터베이스(160)는 도 8에 도시된 바와 같이, 키워드에 대한 감성 평가로서 긍정, 중립, 부정마다의 키워드를 테이블 형태로 저장할 수 있다. 이러한 감성 평가와 관련된 키워드의 품사의 대부분은 명사, 형용사로 구성될 수 있다. 예컨대 긍정 평가의 테이블(162)에서는 "상승", "사상최대", "오르다" 등의 키워드가 존재하고, 각 키워드에 부여되는 스코어 "1"이다. 또한, 부정 평가의 테이블(166)에서는 "불황", "내리다", "불만" 등의 키워드가 존재하고, 각 키워드에 부여되는 스코어 "-1"이다. 중립 평가 테이블(164)에 저장된 키워드에 부여되는 스코어는 "0"이다. 도 8에 도시된 스코어는 긍정과 부정을 구별하기 위한 것으로 예시되고 있으나, 이와는 달리, 긍정 또는 부정 평가와 연계된 스코어는 시장 참가자들이 해당 키워드에 느끼는 감성의 정도에 따라, 해당 키워드의 가중치를 달리하여 서로 다른 스코어로 구성될 수 있다. The data sensitivity evaluation unit 152 refers to the emotion dictionary database 160 storing an evaluation of affirmative, neutral or negative for each keyword from the morpheme analysis unit 140 and a score associated with the evaluation, And evaluates it as positive, neutral, or negative, and scores it. The scoring algorithm may be Naive bayes algorithm, Simple voter algorithm, K Nearest Neighborhood (KNN), SVM (Support Vector Machine). As an example of the simple voter algorithm, the emotion dictionaries database 160 can store the keywords of affirmative, neutral, and negation as tabular form as emotion evaluation for the keyword, as shown in FIG. Most of the parts of a keyword related to such emotional evaluation can be composed of nouns and adjectives. For example, in the affirmative evaluation table 162, there are keywords such as "rise "," maximum ever ", "ascend ", and score" 1 " In the negative evaluation table 166, keywords such as "recession "," down ", and "complaint" exist and the score "-1" The score given to the keyword stored in the neutral evaluation table 164 is "0 ". The score shown in FIG. 8 is exemplified for distinguishing between positive and negative. However, the score associated with positive or negative evaluation is different from the score according to the degree of sensitivity of market participants to the keyword, And can be composed of different scores.

데이터 감성 평가부(152)는 감성 사전 데이터베이스(160)에 의해 긍정, 중립 및 부정으로 판별된 키워드마다 부여된 스코어를 합산하여 복수의 문서 전체에 대한 감성 지수와 같은 감성 관련 평가 데이터를 산출할 수 있다. 여기서, 데이터 감성 평가부(152)는 모든 문서의 키워드에 대하여 감성 평가를 수행한 후, 문서 별로 긍정, 중립, 부정의 평가를 수행하지 않는다. 만약 문서의 감성 뉘앙스를 파악하기 위해 문서 별로 감성 평가를 수행하는 경우, 어떤 문서는 다른 문서에 비해 부정적으로 평가된 키워드가 훨씬 많이 존재함에도 불구하고, 각 문서가 동등한 스코어의 부정 평가를 받을 수 있다. 이에 의하면, 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)로부터 추출된 복수의 문서 전체로부터 존재하는 개별 종목의 긍정 또는 부정 요소에 대한 비율이 왜곡되게 분석될 수 있다. 따라서, 본 실시예에서는 복수의 문서 전체로부터 형태소 분석된 키워드들을 문서 별로 그룹핑없이, 감성 평가를 수행함으로써, 분석의 왜곡을 방지할 수 있다. The data sensitivity evaluation unit 152 may calculate the sensitivity-related evaluation data such as the sensitivity index for all the plurality of documents by summing the scores given for the keywords determined as positive, neutral, and negative by the sensitivity dictionary database 160 have. Here, the data sensitivity evaluation unit 152 does not perform evaluation of affirmative, neutral, and negative for each document after performing sensitivity evaluation on keywords of all documents. If an emotional evaluation is performed on a document to determine the sensitivity nuances of the document, each document may be given a negative evaluation of an equivalent score, even though there are many more negatively valued keywords compared to other documents . According to this, the ratio of the positive or negative elements of the individual items existing from all the plurality of documents extracted from the social media data 10 and the stock market related web data 20 can be analyzed to be distorted. Therefore, in this embodiment, the analysis is prevented from being distorted by performing the emotion evaluation without grouping the morphologically analyzed keywords from the entire plurality of documents by document.

키워드 분석부(154)는 형태소 분석부(140)로부터 분석된 키워드들에 대하여 기간별 수집 건수, 각 키워드 간의 상관 분석 등의 통계 분석을 수행하여 그 결과를 표시부(190)에 제공할 수 있다. 또한, 키워드 분석부(154)는 분석된 키워드들 중 키워드 데이터베이스(120)에 등록되지 않은 키워드를 선별하고, 신규로 선별된 키워드는 키워드 데이터베이스(120)에 갱신 저장됨으로써, 문서 수집/추출부(110)에서 수행되는 문서 수집의 정확성을 향상시킬 수 있으며, 관리자는 신규의 키워드 중 감성 평가에 반영할 키워드에 대해서는 감성 사전 데이터베이스(160)에 저장시킬 수 있다.The keyword analyzing unit 154 may perform a statistical analysis on the analyzed keywords from the morpheme analyzing unit 140, such as the number of collected data per period and the correlation analysis between the keywords, and provide the result to the display unit 190. The keyword analysis unit 154 selects keywords that are not registered in the keyword database 120 among the analyzed keywords and the newly selected keywords are updated and stored in the keyword database 120 so that the document collection / 110), and the manager can store the keyword to be reflected in the emotion evaluation of the new keyword in the emotion dictionary database 160. [0053] FIG.

한편, 상관 분석/결정부(170)는 누적된 감성 평가 데이터 중 소정의 조건에 의해 선택된 감성 관련 평가 데이터와 함께, 증시 지표 데이터와 경제 지표 데이터 간의 상관 관계로부터의 분석 데이터를 생성할 수 있다. 도 9를 참조하면, 상관 분석/결정부(170)는 평가 데이터 저장부(171), 제 1 상관 테이블부(172), 평가 데이터 수집 기간 결정부(173), 평가 데이터 선택부(174), 지연 기간 결정부(175), 경제 지표 데이터베이스(176) 및 제 2 상관테이블부(177)를 포함할 수 있다. On the other hand, the correlation analysis / decision section 170 can generate analytical data from the correlation between the stock index data and the economic index data together with the sensitivity related evaluation data selected by the predetermined condition among the accumulated emotional evaluation data. 9, the correlation analysis / determination unit 170 includes an evaluation data storage unit 171, a first correlation table unit 172, an evaluation data collection period determination unit 173, an evaluation data selection unit 174, A delay period determination unit 175, an economic index database 176, and a second correlation table unit 177. [

평가 데이터 저장부(171)는 일별로 개별 종목마다의 감성 지수와 같은 감성 관련 평가 데이터를 누적 저장할 수 있으며, 이러한 평가 데이터는 제 1 상관테이블부(172)에 제공되어 외부로부터 입력되는 증시 지표 데이터들(30)과의 상관 관계 분석을 수행하여, 과거 시점에서 개별 종목의 증시 지표 데이터들(30)과 이에 상응하는 평가 데이터 간의 분석된 상관 관계가 제 1 상관테이블부(172)에 수록된다. The evaluation data storage unit 171 accumulates the emotion-related evaluation data such as the emotion index for each individual item on a day-by-day basis. The evaluation data is provided to the first correlation table unit 172, The correlation analysis between the stock index data 30 of individual items and the evaluation data corresponding to the stock indexes 30 at the past time point is recorded in the first correlation table section 172. [

평가 데이터 수집 기간 결정부(173)는 제 1 상관테이블부(172)에 저장된 과거 상관 관계에 기초하여 개별 종목의 주가에 영향을 미치는 평가 데이터의 수집 기간을 결정하고, 평가 데이터 선택부(174)는 평가 데이터 저장부(171)에 누적 저장된 감성 평가 데이터 중 수집 기간에 부합하는 평가 데이터를 선택하여 주가 예측부(180)로 제공할 수 있다.The evaluation data collection period determining unit 173 determines the collection period of the evaluation data that affects the stock price of the individual item based on the past correlation stored in the first correlation table unit 172, May select evaluation data corresponding to the collection period among the sensitivity evaluation data accumulated in the evaluation data storage unit 171 and provide the evaluation data to the stock price prediction unit 180. [

또한, 지연 기간 결정부(175)는 제 1 상관테이블부(172)의 과거 상관 관계에 기초하여 감성 관련 평가 데이터가 개별 종목의 주가에 반영되어질 때까지의 경과되는 지연 기간을 결정하고, 주가 예측부(180)에 개별 종목의 주가 예측시에 지연 기간을 제공하여, 지연 기간 이후의 주가를 예측할 수 있다. The delay period determination unit 175 determines a delay period that elapses until the emotion-related evaluation data is reflected in the stock price of the individual item based on the past correlation of the first correlation table unit 172, And the stock price after the delay period can be predicted by providing the delay unit 180 at the time of predicting the stock price of the individual stock.

이와 같이 수집 기간 및 지연 기간을 주가 예측부(180)의 예측시에 제공함으로써, 보다 유효한 감성 평가 데이터를 활용할 수 있으며, 주가 예측 시점을 더 정확하게 특정할 수 있다. By providing the collection period and the delay period in the prediction of the stock predicting unit 180 as described above, it is possible to utilize more effective emotion evaluation data and more accurately specify the stock price prediction timing.

또한, 제 2 상관테이블부(177)는 증시 지표 데이터들(30)과 경제 지표 데이터베이스(176)에 축적된 거시 경제 지수와 관련된 경제 지표 데이터들 간의 상관 관계로부터 도출되는 분석 데이터를 주가 예측부(180)에 제공할 수 있다. 이 경우에, 경제 지표 데이터들은 모든 개별 종목에 기본적으로 공통되게 영향을 주는 경제 지표로서, 예를 들면 금리, 환율, 예상성장율, 물가지수, 국제수지 등일 수 있다. The second correlation table unit 177 also stores the analysis data derived from the correlation between the stock index data 30 and the economic index data related to the macro economic index accumulated in the economic index database 176, 180). In this case, the economic indicator data are basically common economic indicators for all individual items, such as interest rates, exchange rates, projected growth rates, price indexes, balance of payments, and so on.

다시 도 1을 참조하면, 주가 예측부(180)는 상관 분석/결정부(170)로부터 선택된 감성 관련 평가 데이터, 지연 기간 및 제 2 상관테이블부(177)로부터 생성된 분석 데이터에 근거하여 개별 종목의 주가 및 이를 반영한 종합주가를 예측할 수 있다. 주가 예측은 증시 지표 데이터들(30)과 경제 지표 데이터에 기초한 시계열 분석을 토대로 하며, 소셜 미디어 데이터(10) 및 증시 관련 웹데이터(20)의 뉴스로부터 분석된 평가 데이터는 상기 시계열 분석으로부터 산출되는 예측 주가를 보정하는 항으로 결합될 수 있다. 주가 예측의 정확성을 보다 높이기 위해, 제 1 상관테이블부(172)의 상관 관계에 기초하여 산출된 가중치가 감성 관련 평가 데이터에 부여됨으로써, 가중치가 부여된 평가 데이터가 주가 예측에 반영될 수 있다. 주가 예측부(180)에서 산출된 개별 종목의 예측 주가 및 그 통계값은 표시부(190)에 표시된다. Referring back to FIG. 1, the stock price predicting unit 180 predicts the individual items based on the emotion-related evaluation data selected from the correlation analysis / decision unit 170, the delay period, and the analysis data generated from the second correlation table unit 177 And the composite stock price reflecting this. The stock price prediction is based on the time series analysis based on the stock market index data 30 and the economic index data and the evaluation data analyzed from the news of the social media data 10 and the stock market related web data 20 is calculated from the time series analysis Can be combined into a term that corrects the forecasted price. In order to further improve the accuracy of the stock price prediction, the weight value calculated on the basis of the correlation of the first correlation table unit 172 is assigned to the sensitivity-related evaluation data, so that the weighted evaluation data can be reflected in the stock price prediction. The predicted stock prices of the individual items calculated by the stock price estimating unit 180 and their statistical values are displayed on the display unit 190.

종합주가 관련지수 예측 시스템(1)에 의하면, 소셜 데이터 및 뉴스를 포함한 대량의 데이터에 대한 감성 관련 평가 데이터를 반영함으로써, 시장 참가자들의 다양한 견해로부터 개별 종목에 대한 시장 분위기 및 정보를 보다 객관적이면서 유의미하게 추출할 수 있으므로, 개별 종목의 주가 및 이를 반영한 종합주가를 보다 신뢰성있게 예측할 수 있다. 특히, 단순히 증시 관련 웹데이터(20)에서 생산되는 뉴스의 분석에 의한 주가 예측보다는 뉴스 분석을 포함한 소셜 미디어 데이터의 감성 평가를 통한 주가 예측이 정확성과 신뢰성을 갖는 이유는 소셜 미디어 데이터가 뉴스에 비해 훨씬 많은 데이터량으로 생산되어, 통계적으로 보다 모집단에 근접한 분석이 이루어지기 때문이다. According to the system for forecasting the stock price index (1), the emotion-related evaluation data for a large amount of data including social data and news are reflected, so that the market atmosphere and information for individual items are more objective and meaningful It is possible to more reliably predict the stock price of the individual stock and the stock price reflecting the same. Particularly, the reason why the stock price prediction through the emotional evaluation of the social media data including the news analysis is more accurate and reliable than the stock price prediction by the analysis of the news produced by the stock market related web data 20 is that the social media data Because they are produced with much more data and statistically closer to the population.

도 1에 도시된 종합주가 관련지수 예측 시스템(1)을 구성하는 구성요소 또는 도 4에 도시된 주가 예측 방법의 각 단계는 그 기능을 실현시키는 프로그램의 형태로 컴퓨터 판독가능한 기록 매체에 기록될 수 있다. 여기에서, 컴퓨터 판독 가능한 기록 매체란, 데이터나 프로그램 등의 정보를 전기적, 자기적, 광학적, 기계적, 또는 화학적 작용에 의해 축적하고, 컴퓨터에서 판독할 수 있는 기록 매체를 말한다. 이러한 기록 매체 중 컴퓨터로부터 분리 가능한 것으로서는, 예를 들면, 플렉시블 디스크, 광자기 디스크, CD-ROM, CD-R/W, DVD, DAT, 메모리 카드 등이 있다. 또한, 컴퓨터에 고정된 기록 매체로서 하드디스크나 ROM 등이 있다.Each element of the composite stock price related index prediction system 1 shown in FIG. 1 or each step of the stock price prediction method shown in FIG. 4 can be recorded on a computer readable recording medium in the form of a program realizing the function have. Here, the computer-readable recording medium refers to a recording medium that can be read by a computer by accumulating information such as data and programs by electric, magnetic, optical, mechanical, or chemical action. Examples of such a recording medium that can be detached from a computer include a flexible disk, a magneto-optical disk, a CD-ROM, a CD-R / W, a DVD, a DAT, and a memory card. In addition, a hard disk, a ROM, or the like is used as a recording medium fixed to a computer.

또한, 이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. In addition, the present invention is not necessarily limited to these embodiments, as long as all the constituent elements constituting the embodiment of the present invention are described as being combined into one operation. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to provide a program module that performs some or all of the functions in one or a plurality of hardware As shown in FIG.

또한, 이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or essential characteristics thereof. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

1: 주가 예측 시스템 110: 문서 수집/추출부
120: 키워드 데이터베이스 130: 문서 저장부
140: 형태소 분석부 150: 데이터 분석부
160: 감성 사전 데이터베이스 170: 상관 분석/결정부
180: 주가 예측부 190: 표시부
200: 구성종목 결정 모듈 210: SMD 스코어 생성부
220: 예측 이상 판정부 230: 구성종목 변경부
300: SMD 스코어 데이터베이스1: stock price prediction system 110: document collection /
120: keyword database 130: document storage unit
140: Morphological analysis unit 150: Data analysis unit
160: emotional dictionary database 170: correlation analysis / decision unit
180: stock price prediction unit 190: display unit
200: constituent item determination module 210: SMD score generation unit
220: prediction error determination section 230: configuration item changing section
300: SMD Score Database

Claims

A method for predicting a stock price related index through analysis of social data automatically performed by a computer,
(a) designating one or more constituent items for predicting the composite stock price related index;
(b) predicting a stock index for each of the one or more constituent items using big data composed of fixed data including economic statistical data and atypical data including social media data (SMD);
(c) predicting the composite stock price related index based on the stock price index for each of the predicted one or more constituent items;
(d) calculating an error of the composite stock price related index predicted in the step (c) and the actual total stock price related index;
(e) determining whether the error is above a predetermined threshold; And
(f) altering part or all of the one or more constituent items to another constituent item if it is determined that the error is out of a predetermined threshold as a result of step (e) A method for predicting composite stock price index through.

The method according to claim 1,
In the step (f), changing part or all of the one or more constituent items to another constituent item may be performed by replacing the constituent item in which the amount of related social media data is generated among the one or more constituent items, Of the social media data is replaced with items having a larger amount of related social media data than the least constituent item among the individual items of the social media data.

3. The method of claim 2,
Wherein the related social media data is data collected from at least one of a social media site and a personalized blog.

The method of claim 3,
Wherein the related social media data includes at least one of html, Portable Document Format (PDF), images, and moving images.

5. The method of claim 4,
In the step (f), an action of changing part or all of the at least one constituent item to another constituent item may include a first condition that the error is determined to be out of a predetermined threshold value as a result of step (e) And a second condition that the first condition is maintained for a predetermined period of time or longer.

3. The method of claim 2,
The step (b)
(b1) analyzing morphemes for the social media data;
(b2) analyzing the whole of the social media data in a manner of performing emotional evaluation by affirmative or negative for each keyword extracted from the analyzed morpheme; And
(b3) estimating a stock index for each of the one or more constituent items by reflecting the sensibility-evaluated social media data.

A system for predicting a composite stock price related index using big data composed of fixed data including economic statistical data and non-fixed data including social media data,
And a constituent item determination module for determining one or more constituent items for predicting the composite stock price related index,
The constituent item determination module includes:
An SMD score generation unit for generating and accumulating the amount of generated social media data for each individual item including at least one constituent item as a Social Media Data (SMD) score;
A prediction error determining section that determines whether an error between the predicted integrated stock price and the actual integrated stock price deviates from a predetermined threshold value; And
A constituent item changing unit for referring to the SMD score generated and cumulatively stored by the SMD score generating unit to change at least one constituent item for the composite stock price related exponent prediction to an item other than the at least one constituent item, Wherein the SMD score of the item is larger than the SMD score of the item prior to the change.