KR20210081699A

KR20210081699A - Apparatus for discriminator stock sell and buying signal by using natural language feature

Info

Publication number: KR20210081699A
Application number: KR1020190173917A
Authority: KR
Inventors: 박수진; 김재훈; 박재윤; 황수현; 안성모
Original assignee: 서강대학교산학협력단
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2021-07-02

Abstract

The present invention relates to a stock selling and buying signal discriminator. The stock selling and buying signal discriminator constructs a model that reflects natural language data, and is able to identify stock selling and buying signals that are robust to market conditions by using a variety of methodologies arising from meta-labeling. Therefore, the present invention is capable of allowing to respond to stock price fluctuations in the real market.

Description

Apparatus for discriminator stock sell and buying signal by using natural language feature

본 발명은 주가 예측 방법에 관한 것으로서, 더욱 구체적으로는 머신 러닝 방법을 이용하여 자연어 피처를 결합한 주식 매도 및 매수 시그널 판별기에 관한 것이다. The present invention relates to a stock price prediction method, and more particularly, to a stock sell and buy signal discriminator that combines natural language features using a machine learning method.

기존 연구에서는 예측 성능에 강점을 보이는 머신러닝 방법론들을 활용해 모델을 구축함으로써 주가를 예측하고자 하였다. XG부스트 모형을 활용한 코스피 200 주가지수 등락 예측에 관 한 연구에서는 분류성능과 속도에 강점을 보이는 XG부스트 모형을 적용해 주가를 예측했다. LSTM 신경망 모형에서는 80%의 정확도를 보인 반면 XG 부스트 모형은 86.67%라는 높은 정확도를 기록하여 그 성능을 증명하였다. 한편 주식시세 예측을 위한 딥러닝 최적화 방법 연구와 인공지능을 이용한 주가 방향성 예측 및 투자 포트폴리오 최적화에서는 대표적인 딥러닝 방식인 RNN-LSTM, CNN 방식을 이용해 주가 예측 모델을 수립하였다. 주식시세 예측을 위한 딥러닝 최적화 방법 연구에서는 종가, 거래시간 feature를 중심으로 하이퍼 파라미터 튜닝을 실시하고 이를 통해 주식시세 예측 모형을 수립하였다. 그 결과 기존 학습 정확도인 89.84%보다 약 9.6% 향상된 99.52%의 정확도를 기록했다. 인공지능을 이용한 주가 방향성 예측 및 투자 포트폴리오 최적화는 종목 별 주가 정보를 보조지표로 가공한 데이터, 경제 기사의 제목 데이터를 대상으로 선정된 종목의 투자 비율을 최적화하는 모형을 만들었다. 유전 알고리즘을 활용해 정밀도, 재현율, 정확도, 당일 예측 확률을 비교하고 이를 통해 선정한 종목의 수익성을 벤치마크 주가 지표인 KOSPI, KOSPI 200와 비교하였다. 그 결과 상위 15개 종목의 정확도가 61.52%~68.20%를 기록했다. In previous studies, it was attempted to predict stock prices by building models using machine learning methodologies that show strength in predictive performance. In a study on the prediction of fluctuations in the KOSPI 200 stock price index using the XG boost model, the stock price was predicted by applying the XG boost model, which shows strength in classification performance and speed. While the LSTM neural network model showed an accuracy of 80%, the XG boost model recorded a high accuracy of 86.67%, proving its performance. Meanwhile, in deep learning optimization method research for stock price prediction, stock price direction prediction using artificial intelligence and investment portfolio optimization, stock price prediction models were established using RNN-LSTM and CNN methods, which are representative deep learning methods. In the study of the deep learning optimization method for stock price prediction, hyperparameter tuning was performed focusing on the closing price and trading time features, and through this, a stock price prediction model was established. As a result, the accuracy of 99.52% was improved by about 9.6% compared to the existing learning accuracy of 89.84%. In the prediction of stock price direction and investment portfolio optimization using artificial intelligence, a model was created that optimizes the investment ratio of selected stocks based on data processed by stock price information for each stock as an auxiliary index, and title data of economic articles. Using a genetic algorithm, precision, recall, accuracy, and prediction probability of the day were compared, and the profitability of the selected stock was compared with the benchmark stock price index KOSPI and KOSPI 200. As a result, the accuracy of the top 15 stocks was between 61.52% and 68.20%.

이러한 기존 연구들은 과적합 문제가 발생한다는 결정적인 한계를 지닌다. 금융 데이터는 기존의 머신러닝에서 사용하는 그림, 영상 데이터와는 다르게 과거의 사건이 일정기간 동안 지속적으로 영향을 주는 시계열 상관성이 존재한다는 특징이 있다. 따라서 금융 데이터를 대상으로 다른 분야에서 좋은 성능을 내는 머신러닝 알고리즘을 그대로 적용할 경우 모델의 과적합 현상이 발생하게 되는 것이다. 즉 특정 데이터에 대한 예측력은 매우 높을 수 있으나 시장의 상황이 조금이라도 변화하면, 매우 낮은 정확도를 보이게 된다. These existing studies have a decisive limitation that the problem of overfitting occurs. Unlike picture and image data used in existing machine learning, financial data has a characteristic that there is a time series correlation in which past events continuously affect for a certain period of time. Therefore, if a machine learning algorithm that performs well in other fields is applied to financial data as it is, overfitting of the model occurs. That is, the predictive power for specific data can be very high, but if the market conditions change even a little, the accuracy is very low.

한국공개특허번호 제 10-2001-0008679호Korean Patent Publication No. 10-2001-0008679 한국공개특허공보 제 10-2001-0089791호Korean Patent Publication No. 10-2001-0089791

전술한 문제점을 해결하기 위한 본 발명의 목적은, 다양한 방법론을 활용해 과적합 문제를 극복하고, 시퀀셜 부트스트래핑을 사용한 배깅을 이용하여, 실제 시장의 주가 변동에도 대응할 수 있는 주식 매도 및 매수 시그널 판별기를 제공하는 것이다. An object of the present invention to solve the above problems is to overcome the overfitting problem by using various methodologies, and to determine stock selling and buying signals that can respond to stock price fluctuations in the actual market by using bagging using sequential bootstrapping. is to provide

전술한 기술적 과제를 달성하기 위한 본 발명의 특징에 따른 주식 매도 및 매수 시그널 판별기는, 시퀀셜 부트스트래핑을 사용한 배깅을 이용하여, 실제 시장의 주가 변동에도 대응할 수 있는 모델을 제공한다. The stock sell and buy signal discriminator according to a feature of the present invention for achieving the above technical problem provides a model that can respond to stock price fluctuations in the actual market by using bagging using sequential bootstrapping.

본 발명의 의의는 금융 시장 상황 예측에 자연어 데이터를 반영한 모델 구축을 시도했다는 것과 메타 라벨링을 비롯한 다양한 방법론을 사용하여 시장상황에 강건한 모델을 만든 데에 있다. 전체적으로 예상에 부합하는 결과가 나왔으며 74%의 투자 성공 분류 정확도는 고무적이라고 할 수 있다. 또한 실제 데이터의 결과값을 통해 Primary Model은 현재 주식 시장이 어떤 상황인가에 따라 변동되어야 함을 알 수 있었고, 따라서 투자 전략은 여러 개를 갖고 시장에 다라 다른 전략과 피처를 사용해야 하는 것 또한 알 수 있었다The significance of the present invention lies in that an attempt was made to build a model reflecting natural language data in predicting financial market conditions, and in making a model robust to market conditions using various methodologies including meta-labeling. Overall, the results were in line with expectations, and an investment success classification accuracy of 74% is encouraging. In addition, the results of the actual data showed that the primary model should change according to the current stock market situation, so it was also possible to know that several investment strategies should be used and different strategies and features should be used depending on the market. there was

도 1은 본 발명에 따른 자연어 피처를 결합한 주식 매도 및 매수 시그널 판별기를 도시한 시스템 구성도이며, 도 2는 각 구성 요소들을 개념적으로 도시한 모식도이다.
도 3은 Primary Model의 매수매도구간과 투자 시그널을 예시적으로 도시한 그래프이다.
도 4 및 도 5는 모델1과 모델 2를 통해 테스트 데이터 셋인 2019년 7월부터 9월 사이의 데이터에 적용해 본 결과에 대한 sklearn 모듈의 Classification Report와 ROC곡선 그리고 각각의 모델의 피 처 중요도를 나타내는 세 가지 지표인 MDI(Mean Decrease Impurity), MDA(Mean Decrease Accuracy), SFI(Single Feature Importance)를 나타낸 그림이다. 즉, 도 4는 배깅, 메타라벨링, Purged 4fold 적용 모델의 결과를 도시한 도표 및 그래프이며, 도 5는 시퀀셜 부트스트래핑 배깅, 메타라벨링, Purged 4fold 적용 모델의 결과를 도시한 도표 및 그래프이다.
도 6은 본 발명에 따른 판별기를 테스트한 결과를 도시한 그래프이다. 1 is a system configuration diagram illustrating a stock selling and buying signal discriminator combining natural language features according to the present invention, and FIG. 2 is a schematic diagram conceptually illustrating each component.
3 is a graph exemplarily showing a buy/sell section and an investment signal of the Primary Model.
4 and 5 show the Classification Report and ROC curve of the sklearn module for the results applied to the data between July and September 2019, the test data set through Model 1 and Model 2, and the feature importance of each model. The figure shows the three indicators, MDI (Mean Decrease Impurity), MDA (Mean Decrease Accuracy), and SFI (Single Feature Importance). That is, FIG. 4 is a table and graph showing the results of bagging, metalabeling, and Purged 4fold application models, and FIG. 5 is a diagram and graph showing the results of sequential bootstrapping bagging, metalabeling, and Purged 4fold application models.
6 is a graph showing the results of testing the discriminator according to the present invention.

우리는 과적합 문제를 비롯하여, 기존 금융시장에 머신러닝을 적용한 연구에서 제시된 문제점을 해결하고, 자연어에서 새로운 피처를 찾아내는 금융 머신러닝 모델을 만들고자 하였다. 본 ‘자연어 피처를 결합한 매수매도 시그널 판별기’는 1) 한국 선물(코스피200)시장의 모든 거래데 이터(Tick Data)를 증권사 api를 통해 수집하고 매일 저장하여 데이터 베이스화 시켜 학습 데이터 로 사용하였고 2) 이렇게 생성된 고빈도 데이터(High Frequency data)를 이용하여 일/분/초 의 시 간단위가 아닌 거래량 단위의 거래량봉(Volume Bar)를 재구성하여 머신러닝 학습 데이터에 적합한 통계적 특성을 갖출 수 있게 하였다. 3)이 데이터를 바탕으로 이동평균선 투자전략(Trend Following Strategy)을 제1투자전략(Primary Model)으로 CUSUM필터를 사용하여 도출된 시그널에 대해 메타라벨링(Meta-labeling)을 적용하여 라벨링하였다. 4)머신러닝에 사용되는 피처로 기술적 지표나 재무제표에 포함된 항목뿐만 아니라, 자연어 피처(네이버 주식게시판 글의 긍부정 정도, 실 시간 조회수)를 버트(BERT)1를 이용한 감성분석과. 실시간 게시판 크롤링을 통해 구현하여, 예측 성능을 향상시키고자 하였다. 5) 머신러닝 알고리즘으로는 금융데이터를 적용하고 과적합을 방지하는데 적합하다고 알려진 Sequential bootstrapping 을 사용한 배깅 모델을 사용하여 시장 환경 변화에 대한 강건성을 유지하는 동시에 성능을 향상시키고자 하였다. 5) 위 과정을 통해 학습한 모델(Second Model)을 바탕으로 6) 삼성전자의 주식거래데이터를 1분 간격, 삼성전자 네이버 주식 게시판 글을 5분 간격으로 실시간 크롤링하여, 매수 매도 시그널의 발생여부를 살핀 후 시그널이 발생한다면 각종 피처를 학습한 모델에 투입하여 그 결과를 웹을 통해 Broadcasting할 것이다. We tried to create a financial machine learning model that finds new features in natural language by solving the problems presented in studies applying machine learning to the existing financial market, including the overfitting problem. This 'buy and sell signal discriminator combining natural language features' 1) collects all the tick data of the Korean futures (KOSPI 200) market through the brokerage company api, stores it every day, and uses it as learning data. ) Using the high frequency data generated in this way, it is possible to reconstruct the volume bar in the unit of trading volume, not the hour/minute/second of hours, so that it can have statistical characteristics suitable for machine learning learning data. did. 3) Based on this data, the moving average line investment strategy (Trend Following Strategy) was labeled by applying meta-labeling to the signal derived using the CUSUM filter as the first investment strategy (Primary Model). 4) Sentiment analysis using BERT1 for features used in machine learning, not only technical indicators or items included in financial statements, but also natural language features (positive and negative levels of Naver stock bulletin board articles, real-time hits). It was implemented through real-time bulletin board crawling to improve prediction performance. 5) As a machine learning algorithm, a bagging model using sequential bootstrapping, which is known to be suitable for applying financial data and preventing overfitting, was used to improve performance while maintaining robustness against changes in the market environment. 5) Based on the Second Model learned through the above process, 6) Real-time crawling of Samsung Electronics' stock transaction data every 1 minute and Samsung Electronics' Naver stock bulletin board every 5 minutes to see whether a buy or sell signal occurs If a signal is generated after examining , various features will be put into the learned model and the results will be broadcast through the web.

도 1은 본 발명에 따른 자연어 피처를 결합한 주식 매도 및 매수 시그널 판별기를 도시한 시스템 구성도이며, 도 2는 각 구성 요소들을 개념적으로 도시한 모식도이다. 1 is a system configuration diagram illustrating a stock selling and buying signal discriminator combining natural language features according to the present invention, and FIG. 2 is a schematic diagram conceptually illustrating each component.

본 발명은 네이버 주식 게시판에서 추출한 여러 지표와 과적합을 줄이는 데 장점이 있는 배깅모델을 사용하여, 변화하는 시장에 대응하도록 하였다. 머신 러닝에 필요한 피처를 확보하기 위해 네이버 주식토론방의 게시글의 날짜, 제목, 내용, 작성자, 조회수, 공감수, 비공감수를 5분 간격으로 크롤링을 자동화시켰으며, 이 자료들 또한 일별로 csv파일을 저장하여 각종 자료와 수치들을 데이터베이스화 하였다. 또한 버트 모델을 파인 튜닝하는 학습 데이터 셋을 만들기 위해, 코스피 시가총액 상위20개 종목, 코스닥 시가총액 상위20개 종목에서 게시글과 내용을 크롤링하였다. 4명의 인원이 각각의 게시판 글을 보고, 긍정(1) 부정(0)으로 라벨링 하는 작업을 진행하여 ‘주식 게시판의 긍정과 부정에 관한 자연어 학습 데이터 셋’(2019)를 완성하였다.The present invention responds to the changing market by using various indicators extracted from the Naver stock bulletin board and a bagging model that has an advantage in reducing overfitting. In order to secure the features necessary for machine learning, we automated crawling of the date, title, content, author, number of views, likes and dislikes of Naver stock forum posts at 5-minute intervals. It was saved and various data and figures were converted into a database. In addition, to create a training data set that fine-tunes the vert model, posts and contents were crawled from the top 20 stocks in KOSPI market capitalization and top 20 stocks in KOSDAQ market capitalization. Four people looked at each bulletin board and labeled it as positive (1) or negative (0) to complete the ‘Natural Language Learning Dataset on Positive and Negative of Stock Boards’ (2019).

한국의 대표적인 주식인 삼성전자 네이버 주식게시판에서, 작성 시간을 기준으로 바 1단위의 시간마다 생성된 게시글의 개수를 합하여 피처로 사용하였고, 조회수는 5분마다 시행된 크롤링 데이터를 전처리를 하여, 매 5분 사이 증가한 조회수를 피처로 사용하였다. 게시물의 긍부정에 관한 피처는, ‘주식 게시판의 긍정과 부정에 관한 자연어 학습 데이터 셋’(2019)을 이용하여 버트 모델을 파인튜닝 시켜, ‘주가에 관한 긍/부정 정도 판별기’(2019)를 만든 후, 이 판별기를 통해 삼성전자 네이버 주식 게시판에 존재하는 글의 긍부정 정도를 산출해내어, 바 1단위 시간마다 생성된 게시물의 긍부정도를 평균해 이를 피처로 사용하였다. In the Naver stock bulletin board of Samsung Electronics, which is a representative stock in Korea, the number of posts created for each hour of the bar based on the writing time was added up and used as a feature, and the number of views was calculated by preprocessing crawling data performed every 5 minutes, The number of views that increased during 5 minutes was used as a feature. Feature on positive/negative of posts, 'positive/negative level discriminator on stock price' (2019) by fine-tuning the vert model using 'Natural Language Learning Data Set on Positives and Negatives of Stock Boards' (2019) After creating , the positive and negative levels of posts on Samsung Electronics' Naver stock bulletin board were calculated through this discriminator, and the positive and negative levels of posts created for each unit of time were averaged and used as a feature.

30 거래량 바의 이동평균선이 60 거래량 바의 이동평균선 위에 있는 경우에는 매수, 반대의 경우엔 매도하는 전략을 통해 각각의 거래량 봉마다 매수 혹은 매도를 판별하였고, 이후 CUSUM 필터를 사용하여 특정한 변동성을 포함한 투자 시점(Timestamp)만을 산출해 이 시점과, 그 시점의 투자 방향(매수1 / 매도-1)을 산출하고 이 시그널에 대해 트리플 베리어 메소드를 통한 메타 라벨링 방법론을 적용하였다. If the moving average line of the 30 trading volume bar is above the moving average of the 60 trading volume bar, buy or sell was determined for each trading volume bar through a strategy of buying and in the opposite case, selling. By calculating only the investment timestamp, the meta-labeling methodology through the triple barrier method was applied to this time and the investment direction (buy 1 / sell-1) at that time.

이하, 본 발명에 따른 시스템에 사용될 학습 데이터 셋 구축에 대하여 설명한다. Hereinafter, the construction of the training data set to be used in the system according to the present invention will be described.

먼저, 금융 데이터 셋 구축시, 본 방법론을 적용하기 위해서는, 기본적으로 매 순간 이뤄지는 거래 데이터가 필요하다. 해당 거래 데이터(틱데이터)를 확보하기 위해 증권사에 요청해 보았으나 1달 이상의 자료를 보유하고 있는 증권사가 없었고, 그 자료조차 높은 가격에 구매해야 했다. 따라서 우리는 한 국 주식시장을 대표할 수 있는 선물시장(코스피200선물)의 모든 거래데이터(Tick Data)를 증권 사 api를 이용해 직접 수집하고 이를 데이터베이스화 시켜 학습 데이터로 사용하기로 하였다. 키움증권의 api를 이용하였으며, 이를 자동화하여 수집된 데이터를 일별로 csv파일로 저장하여 틱데이터를 데이터베이스화 하였다. 본 연구에는 2018년 10월 1일부터 2019년 9월 30일 까지 1년 간 수집한 코스피200 선물거래 틱데이터가 사용되었다. First, in order to apply this methodology when constructing a financial data set, transaction data that is basically every moment is required. I tried to ask the securities company to secure the transaction data (tick data), but there was no securities company that had data for more than one month, and even that data had to be purchased at a high price. Therefore, we decided to collect all the tick data of the futures market (KOSPI 200 Futures), which can represent the Korean stock market, directly using the securities company's API, and make a database to use it as learning data. The API of Kiwoom Securities was used, and the data collected by automating it was saved as a csv file for each day and the tick data was converted into a database. KOSPI 200 futures trading tick data collected for one year from October 1, 2018 to September 30, 2019 was used in this study.

다음, 네이버 주식게시판 데이터셋 구축시, 한국의 네이버 주식 토론방은, 한국 주식 시장에만 특징적으로 존재하는 것으로 많은 시 장 참여자들이 이곳에서 주식과 관련된 정보를 얻고 의견을 교환한다. Does Big Data Matter?: Predicting Stock Returns using Online Stock Message Boards(김재훈, 2019)에서는 네이버 주식 토론방의 게시물 수를 예측 변수로 활용할 경우 주가 예측의 성능을 향상시킬 수 있음을 보였다. 즉 게시판에 존재하는 어떤 정보나 사람들의 반응이 주가에 대해 설명력이 있음을 의미한다. Next, when constructing the Naver stock bulletin board dataset, the Naver stock discussion forum in Korea is uniquely present only in the Korean stock market, where many market participants obtain stock-related information and exchange opinions. Does Big Data Matter?: Predicting Stock Returns using Online Stock Message Boards (Kim Jae-Hoon, 2019) showed that the performance of stock price prediction can be improved when the number of posts in the Naver stock discussion forum is used as a predictor variable. In other words, it means that certain information or people's reactions on the bulletin board have explanatory power for the stock price.

머신 러닝에 필요한 피처를 확보하기 위해 네이버 주식토론방의 게시글의 날짜, 제목, 내 용, 작성자, 조회수, 공감수, 비공감수를 5분 간격으로 크롤링을 자동화 시켰으며, 이 자료들 또한 일별로 csv파일을 저장하여 각종 자료와 수치들을 데이터베이스화 하였다. 또한 버트 모 델을 파인 튜닝하는 학습 데이터 셋을 만들기 위해, 코스피 시가총액 상위20개 종목, 코스닥 시가총액 상위20개 종목에서 게시글과 내용을 크롤링하였다. 5명의 인원이 각각의 게시판 글을 보고, 긍정(1) 부정(0)으로 라벨링 하는 작업을 진행하여 ‘주식 게시판의 긍정과 부정에 관 한 자연어 학습 데이터 셋’(2019)를 완성하였다. In order to secure the features necessary for machine learning, we automated crawling of the date, title, content, author, number of views, likes and dislikes of Naver stock forum posts at 5-minute intervals, and these data are also daily csv files. was saved and various data and figures were converted into a database. In addition, to create a training data set that fine-tunes the vert model, posts and contents were crawled from the top 20 stocks in KOSPI market capitalization and top 20 stocks in KOSDAQ market capitalization. Five people looked at each bulletin board and labeled it as positive (1) or negative (0) to complete the ‘Natural Language Learning Dataset on Positive and Negative of Stock Boards’ (2019).

이하, 네이버 주식 게시판의 자연어와 관련된 피처 생성에 대하여 설명한다. 시장에 변화의 영향력과 이에 따른 사람들의 반응을 변수로 사용하기 위해, 네이버 주식 게시판에서 수치화가 가능한 세 가지 피쳐를 생성하였다. 게시물 개수와 조회수, 그리고 게시물의 내용의 긍부정 정도가 그것이다. 게시물의 개수와 조회수는 사람들의 실시간 관심 정도를 나타내는 지표로 볼 수 있고, 게시물 긍부정 정도는 사람들의 주식에 대한 심리적 반응에 관한 대리 지표로 삼았다. Hereinafter, the generation of features related to natural language of the Naver stock bulletin board will be described. In order to use the influence of change on the market and people's reactions to it as a variable, we created three quantifiable features on the Naver stock board. These are the number of posts, the number of views, and the degree of positive or negative content of the posts. The number of posts and the number of views can be seen as indicators of people's real-time interest level, and the positivity and negativity of posts is used as a proxy indicator for people's psychological reactions to stocks.

한국의 대표적인 주식인 삼성전자 네이버 주식게시판에서, 작성 시간을 기준으로 바 1단 위의 시간마다 생성된 게시글의 개수를 합하여 피처로 사용하였고, 조회수는 5분마다 시행된 크롤링 데이터를 전처리를하여, 매 5분 사이 증가한 조회수를 피처로 사용하였다. 게시물의 긍부정 에 관한 피처는, ‘주식 게시판의 긍정과 부정에 관한 자연어 학습 데이터 셋’(2019)을 이용하여 버트 모델을 파인튜닝 시켜,‘주가에 관한 긍부정 정도 판별기’(2019)를 만든 후, 이 판별기를 통해 삼성전자 네이버 주식 게시판에 존재하는 글의 긍부정 정도를 산출해내어, 바 1단위 시간마다 생성된 게시물의 긍부정도를 평균해 이를 ‘sentiment’피처로 사용하였다. In the Naver stock bulletin board of Samsung Electronics, a representative stock in Korea, the number of posts created for each hour per bar based on the writing time was added up and used as a feature, and the number of views was calculated by preprocessing crawling data performed every 5 minutes. The number of views that increased every 5 minutes was used as a feature. The feature on the positives and negatives of the posts is the 'Natural Language Learning Dataset on Positives and Negatives of Stock Boards' (2019), by fine-tuning the Bert Model, and 'Determining the Positives and Negatives of Stocks' (2019) After creation, the positive and negative levels of posts on the Samsung Electronics Naver stock bulletin board were calculated through this discriminator, and the positive and negative levels of posts created for each hour of bar were averaged and used as a 'sentiment' feature.

다음, 자연어 피처를 결합한 금융 머신러닝 모델 구축하는 과정을 설명한다. 우리는 수집된 거래량 바를 사용하여, Primary 모델로, 이동평균선 투자전략 (Trend Following Strategy)을 사용하여 500거래대금봉의 이동평균선이 4000거래대금봉의 이동평균선 위 에 있는 경우에는 매수, 반대의 경우엔 매도하는 전략을 통해 각각의 거래대금봉마다 매수(검정표 시구간) 혹은 매도(비표시구간)를 판별하였고, 이후 CUSUM 필터를 사용하여 특정한 변동성을 포 함한 투자 시점(Timestamp)만을 산출해 이 시점과, 그 시점의 투자 방향(매수1 / 매도-1)을 산출하고 이 시그널에 대해 트리플 베리어 메소드를 통한 메타 라벨링 방법론을 적용하였다. Next, we describe the process of building a financial machine learning model that combines natural language features. We use the collected trading volume bar as the primary model, and using the Trend Following Strategy, we buy when the moving average of 500 trades is above the moving average of 4000, and sell when the opposite is true. strategy to determine whether to buy (black mark period) or sell (non-marked period) for each trading price bar, and then use the CUSUM filter to calculate only the investment time (timestamp) including specific volatility, , the investment direction (buy 1 / sell -1) at that time was calculated, and the meta-labeling methodology through the triple barrier method was applied to this signal.

도 3은 Primary Model 의 매수매도구간과 투자 시그널을 예시적으로 도시한 그래프이다. 도 3을 참조하면, 각각의 투자 시그널에 대한 피처 매트릭스를 만들기 위해, 여러가지 기술적 지표들을 추가하였다. 주식과 관련한 기술적 지표인 수익률 자기상관도 변동성을 1바 5바 10바 30바 50바 단위로 산출하였고, 시장에 대한 추가 정보로 RSI와 FracDiff도 변수로 사용하였다. 자연어 데이터와 관련한 피처는 앞서 설명한 것과 같이, 해당 거래량바 기간 동안의 긍부정도를 평균하여 ‘sentiment’ 변수로 추가하였다. 3 is a graph illustrating an example of a buy/sell section and an investment signal of the Primary Model. Referring to FIG. 3 , in order to create a feature matrix for each investment signal, various technical indicators are added. Return autocorrelation volatility, a technical indicator related to stocks, was calculated in units of 1 bar, 5 bar, 10 bar, 30 bar, 50 bar, and RSI and FracDiff were also used as variables as additional information about the market. As described above, the features related to natural language data were added as a ‘sentiment’ variable by averaging the positive and negative degrees during the period of the corresponding trading volume.

이렇게 생성된 피처 매트릭스와 라벨링 데이터를 2018년 10월부터 2019년 6월까지는 학습데이터로 사용하고, 2019년 7월부터 2019년 9월까지를 테스트 데이터 셋으로 나눈 후, 학습데이터를 이용하여 첫 번째는 Purged K fold CV방식을 이용한 배깅모델, 두 번째는 시퀀셜 부트스트래핑을 이용한 배깅모델을 사용하여 학습하였다. 적절한 파라미터를 찾기 위해 max depth를 2,3,5,8 estimator의 개수를 10,20,50,100개로 설정하여 그리드 서치를 진행한 후, 여기서 산출된 최적학습 파라미터를 통해 모델1(일반배깅, max_depth:5 estimator 100개) 모델2(시퀀셜부트스트래핑배깅, max_depth:3, estimator:10개)모델을 결정하고 학습하였다. The feature matrix and labeling data generated in this way are used as training data from October 2018 to June 2019, and July 2019 to September 2019 is divided into test data sets. is a bagging model using the Purged K fold CV method, and the second is a bagging model using sequential bootstrapping. To find an appropriate parameter, the max depth is set to 2, 3, 5, 8, and the number of estimators is set to 10, 20, 50, 100, and a grid search is performed, and then model 1 (general bagging, max_depth: 5 100 estimators) Model 2 (sequential bootstrapping bagging, max_depth:3, estimator: 10) models were determined and trained.

이하, 본 발명에 따른 판별기의 결과를 설명한다. 도 4 및 도 5는 모델1과 모델 2를 통해 테스트 데이터 셋인 2019년 7월부터 9월 사이의 데이터에 적용해 본 결과에 대한 sklearn 모듈의 Classification Report와 ROC곡선 그리고 각각의 모델의 피 처 중요도를 나타내는 세 가지 지표인 MDI(Mean Decrease Impurity), MDA(Mean Decrease Accuracy), SFI(Single Feature Importance)를 나타낸 그림이다. 즉, 도 4는 배깅, 메타라벨링, Purged 4fold 적용 모델의 결과를 도시한 도표 및 그래프이며, 도 5는 시퀀셜 부트스트래핑 배깅, 메타라벨링, Purged 4fold 적용 모델의 결과를 도시한 도표 및 그래프이다. Hereinafter, the results of the discriminator according to the present invention will be described. 4 and 5 show the Classification Report and ROC curve of the sklearn module for the results applied to the data between July and September 2019, the test data set through Model 1 and Model 2, and the feature importance of each model. The figure shows the three indicators, MDI (Mean Decrease Impurity), MDA (Mean Decrease Accuracy), and SFI (Single Feature Importance). That is, FIG. 4 is a table and graph showing the results of bagging, metalabeling, and Purged 4fold application models, and FIG. 5 is a diagram and graph showing the results of sequential bootstrapping bagging, metalabeling, and Purged 4fold application models.

도 4 및 도 5를 통해, 첫째 Recall Rate를 통해 우리가 Primary Model로 삼았던 Trend Following Strategy가 테스트 기간에 적합한 전략인지를 가늠해볼 것이다. 둘째 두 분류기의 투자 성공에 대한 예측과 투자 실패에 대한 Accuracy를 비교하여, 해당 분류기가 리스크 관리에 적합 한 모델인지, 수익 추구에 적합한 모형인지를 살펴볼 것이다. 마지막으로 피처 중요도에 대해 다 각도로 검토하여, 어떤 피처가 분류에 있어서 중요한 역할을 하였는지, 자연어 피처의 영향력은 어떠한지도 확인해볼 것이다. 4 and 5, through the first Recall Rate, we will evaluate whether the Trend Following Strategy, which we used as the primary model, is suitable for the test period. Second, by comparing the prediction of investment success and the accuracy of investment failure of the two classifiers, we will examine whether the classifier is a model suitable for risk management or a model suitable for profit pursuit. Finally, by examining feature importance from multiple angles, we will check which features played an important role in classification and the influence of natural language features.

재현율(Recall)의 경우 주식의 투자 방향과 시점을 알려주는 Primary Model의 지표로 활용할 수 있다. 본 발명에서 Primary Model로 Trend Following Strategy를 적용하였다. 모델1의 Recall Rate는 투자 성공에 대해서 0.54를 보이고, 모델2의 경우 투자 성공에 대해서 0.50을 보였다. 모델의 낮은 Recall Rate는 Primary Model이 투자의 방향과 시점 자체를 잘못 잡아냈다는 의미로 해석할 수 있다. Trend Following Strategy는 일반적으로 주식의 대세 상승기 혹은 대세 하락기에 적합하다고 알려진 투자 전략이다. 테스트 기간 동안인 2019년 7월에서 9월 사이는 한국 주식이 박스권에서 횡보하던 시기로, 이 시기에 모델이 낮은 Recall Rate를 보이는 것은 예상에 부합하는 결과이다. 즉 분류기의 성능을 높이기 위해서는 좋은 머신 러닝 알고리즘 외에도 Primary Model 즉, 투자 전략 자체가 매우 중요한 요소가 될 수 있음을 확인할 수 있었다. In the case of recall, it can be used as an indicator of the primary model that informs the investment direction and timing of stocks. Trend Following Strategy was applied as the primary model in the present invention. Model 1's Recall Rate showed 0.54 for investment success, and Model 2 showed 0.50 for investment success. The low recall rate of the model can be interpreted as meaning that the primary model has misunderstood the direction and timing of the investment itself. Trend Following Strategy is an investment strategy that is generally known to be suitable for a period of uptrend or downtrend in stocks. During the test period, from July to September 2019, Korean stocks moved sideways in the box area, and the low recall rate of the model during this period is in line with expectations. In other words, it was confirmed that the primary model, that is, the investment strategy itself, can be a very important factor in addition to a good machine learning algorithm to increase the performance of the classifier.

정확도(precision)를 살펴보면, 모델 1은 투자 성공에 대해서 77%의 분류 정확도, 투자 실패에 대해서는 0.63에 분류 정확도를 보이고, 모델2는 투자 성공에 대해서 74% 분류 정확도 투자 실패에 대해서는 0.63에 분류 정확도를 보인다. 모델2, 즉 시퀀셜 부트스트래핑 방법론을 사용한 분류기가 통해 일반 배깅 방법론을 사용한 분류기보다 과적합을 낮추도록 설계되었으므로 분류 정확도가 떨어지는 것은 연구 의도에 부합하는 결과라 할 수 있다. 또한 전반적으로 투자 실패의 경우보다 투자 성공의 경우에 accuracy가 높았고, 이는 분류기가 리스크 관리를 하는 분야보다, 수익을 추구하는 모델에서 좀 더 잘 활용될 수 있다. Looking at the precision, Model 1 showed a classification accuracy of 77% for investment success and 0.63 for investment failure, and Model 2 showed a classification accuracy of 74% for investment success and a classification accuracy of 0.63 for investment failure. looks like As the classifier using Model 2, that is, the sequential bootstrapping methodology, is designed to lower overfitting than the classifier using the general bagging methodology, it can be said that the lower classification accuracy is a result in line with the research intent. Also, overall, the accuracy was higher in the case of investment success than in the case of investment failure, and this can be used better in the model for pursuing returns than in the field where the classifier manages risk.

도 6은 본 발명에 따른 판별기를 테스트한 결과를 도시한 그래프이다. 6 is a graph showing the results of testing the discriminator according to the present invention.

본 발명의 의의는 금융 시장 상황 예측에 자연어 데이터를 반영한 모델 구축을 시도했다는 것과 메타 라벨링을 비롯한 다양한 방법론을 사용하여 시장상황에 강건한 모델을 만든 데에 있다. 전체적으로 예상에 부합하는 결과가 나왔으며 74%의 투자 성공 분류 정확도는 고무적이라고 할 수 있다. 또한 실제 데이터의 결과값을 통해 Primary Model은 현재 주식 시장이 어떤 상황인가에 따라 변동되어야 함을 알 수 있었고, 따라서 투자 전략은 여러 개를 갖고 시장에 따라 다른 전략과 피처를 사용해야 하는 것 또한 알 수 있었다는 점은 새롭게 얻은 시사점이었다. The significance of the present invention lies in that an attempt was made to build a model reflecting natural language data in predicting financial market conditions, and in making a model robust to market conditions using various methodologies including meta-labeling. Overall, the results were in line with expectations, and an investment success classification accuracy of 74% is encouraging. In addition, the results of the actual data showed that the primary model should change according to the current stock market situation, so it was also possible to know that several investment strategies should be used and different strategies and features should be used depending on the market. It was a new lesson learned.

하지만 데이터의 양이 부족함에서 오는 아쉬움이 있었다. 데이터를 수집한 기간이 1년 정도로 짧으며, 코스피200선물 데이터 한종목으로 그 양도 제한적이었다. 만약 코스콤과의 제휴를 통해 모든 종목의 과거 고빈도 데이터에 위 방법론을 적용할 수 있다면, 산업별, 시기별 다양한 시뮬레이션을 통해 다양한 투자 전략과 각 상황에서 중요한 피처들을 산출해 내는데 도움이 될 것이다. 또한 이번 연구에서는 확인할 수 없었던, 그러나 중요한 자연어 데이터의 영향력을 향상시키기 위해서, 긍/부정도 이외에 다른 조합을 시도해 볼 필요가 있음을 깨달았다. 향후 위의 문제점들을 보완하고, 다양한 산업군 별로 거래데이터를 수집하고, 지속적인 시뮬레이션을 통해 시장 상황별 다양한 Primary Model을 구축하는 것이 중요할 것이다. 이 후 각 시장상황별로 중요한 Feature들을 찾아내어 피쳐 매트릭스를 구성한 뒤, 실시간 데이터에서 시장상황에 적절한 Feature를 산출한다면, 시장 상황에 영향을 받지 않는 매수 매도 판별기의 성능을 향상시킬 수 있을 것이다. However, there was a disappointment that comes from the lack of data. The period of data collection was as short as one year, and the amount was limited to one item of KOSPI 200 futures data. If the above methodology can be applied to the historical high frequency data of all stocks through partnership with Koscom, it will be helpful to calculate various investment strategies and important features in each situation through various simulations by industry and period. In addition, we realized that it was necessary to try other combinations other than positive/negative in order to improve the impact of important natural language data, which could not be confirmed in this study. In the future, it will be important to supplement the above problems, collect transaction data for each industry group, and build various primary models for each market situation through continuous simulation. After that, if important features are found for each market situation, a feature matrix is constructed, and features appropriate to the market conditions are calculated from real-time data, the performance of the buy/sell discriminator that is not affected by the market conditions can be improved.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. In the above, the present invention has been described with respect to the preferred embodiment thereof, but this is merely an example and does not limit the present invention, and those of ordinary skill in the art to which the present invention pertains without departing from the essential characteristics of the present invention. It will be appreciated that various modifications and applications not exemplified above in the scope are possible. In addition, differences related to such modifications and applications should be construed as being included in the scope of the present invention defined in the appended claims.

Claims

Build a model that reflects natural language data,
Stock sell and buy signal discriminator combining natural language features, characterized in that it uses various methodologies including meta-labeling to discriminate stock sell and buy signals that are robust to market conditions.