CN106022522A

CN106022522A - Method and system for predicting stocks based on big data published by internet

Info

Publication number: CN106022522A
Application number: CN201610338598.4A
Authority: CN
Inventors: 马健; 俞扬
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2016-10-12

Abstract

The invention discloses a method and system for predicting stocks based on big data published by the internet. The method comprises the following steps: crawling related information of the stocks before a business day; and then performing the feature extraction using the crawled data, constructing a training dataset, and using a Group Lasso to perform prediction model training, wherein the evaluation standard of the model is yield rate in a period of time in the operation mode of selling stocks purchased in late trading day and purchasing the stocks recommended at the current trading day at the opening every day; and then constructing a new testing set according to the data crawled at the trading day, predicting using the prediction model trained in former step to obtain the finally recommended stocks. Through the adoption of the method and system disclosed by the invention, a new, useful and reliable information source is provided for quantitative stock selection or stock prediction, the adding of above information can more reflect the market in combination with the traditional information; on the basis of method and system, the stock prediction model obtained using the machine learning technique can more capture the internal operation mechanism of the market, and the benefit of the investor can be effectively improved.

Description

A kind of method and system based on data prediction stock big disclosed in the Internet

Technical field

The present invention relates to a kind of big data Prediction of Stock Index method, grasp based on stock invester disclosed in the Internet particularly to one The big data Prediction of Stock Index method such as work, analyst's prediction, stock invester's comment, news, bulletin, historical stock price, funds flow, basic side And system.

Background technology

Before the seventies in last century, equity investment is that one is analyzed qualitatively, does not has a market demand, but one subjective Art.Along with popularizing of computer, a lot of people begin one's study and drive the rule of the change of stock price, tradition basic side research method mould Type replaces, and p/e ratio, the concept of HSBC are born, and quantify investment and thus rise.

From subjective judgment to quantifying investment, it it is the process transferring science from art to.The seventies in last century with previous substantially Face researcher can only pay close attention to 20 to 50 stocks, and coverage rate is the most limited.There is quantitative model just can cover all stock, this It it is exactly a big leap.Additionally, along with the development of computer process ability, the consumption of information also has a leap change.Cross Go to see that three indexs are the most much of that, referring now to index get more and more, the prediction made is more and more accurate.

Along with the arrival of 21 century, quantify investment and encounter again new bottleneck, it is simply that homogeneity is competed.The amount of Ge Jia mechanism Change model more and more convergent, cause investing result with rising with falling." can seek by bigger data before seeing report data Look for rule？" this is the problem that big data policy entrepreneurs attempt to solve.

The investment model that Nobel prize in economics winner Robert's seat in 2013 is strangled in design is spoken approvingly of the most in the industry. In his model, three variablees of Primary Reference: the cash flow of investment project plan, the estimated cost of corporate capital, stock city The field reaction (market sentiment) to investment.He thinks, market can affect investment per se with subjective judgment factor, investor sentiment Behavior, and investment behavior directly affects assets price.Computer is by analyzing news, research report, social information, search behavior Deng, by natural language processing method, extract useful information；And by machine learning intellectual analysis, the past only quantifies investment Can cover tens strategies, the investment of big data then can cover thousands of strategies.

Show that traditional Prediction of Stock Index is all based on the history tendency of stock price, funds flow, and each stock accordingly Market value, the information such as p/e ratio carries out stock analysis prediction.When present internet deep affects many traditional industries, Compared to the Internet decades ago also no before invention, or even before the Internet is the most universal, except those traditional stocks Outside ticket data, the Internet also has the data much about stock, including the practical operation of stock invester of public data, analyst Prediction, the comment of stock invester, news, bulletin etc. information.These information are the reaction to current stock market to a certain extent, also can Show the intended reaction to following stock market.The present invention attempts to these new useful data and traditional data profit With a kind of big data Study on Stock Prediction Model of the technology creation such as natural language processing, machine learning.

Summary of the invention:

Goal of the invention: for problems of the prior art, the present invention proposes a kind of based on stock disclosed on the Internet The big data quantity share-selecting method of the people and analyst's operation behavior and system, for numerous stock investers, investment reference is done by Fund Company etc..

Technical scheme: the present invention proposes a kind of method based on data prediction stock big disclosed in the Internet, including as follows Step:

1) relevant information of stock before the day of trade is crawled；

Concrete crawling method is: first crawl some Agent IPs, then uses Scrapy framework to crawl the number of related web site According to, it is stored in after converting the data into json form in Mongodb data base；

The specifying information crawled include snowball net, gold compass, stock, phoenix finance and economics, on the website such as Sina's finance and economics about stock The Stock-operation of the stock invester of the ticket of ticket, the prediction of analyst, stock invester's comment, news, bulletin, and the historical price of every stock Data, market value, net assets income ratio, Return on Assets, earnings per share rate of increase, ratio of current liabilities, enterprise value multiple, clean The stock earnings price ratio of profit year-on-year growth rate, Equity Concentration Ratio, free flow market value and nearest one month and stability bandwidth.

2) data utilizing step 1 to crawl carry out feature extraction, construct training dataset, and use Group Lasso to enter Row forecast model is trained；

The training dataset of structure: be made up of, for this data of 5 day of trade in the previous week of current trading day Each day of trade of 5 day of trade, every stock is made up of feature and classification, and wherein feature obtains with according to relevant information process Vector representation, whether classification rises for this stock price of next day of trade, if rising is just 1 to be otherwise 0, the most just obtains Initial training matrix；Owing to data exist redundancy, this step can first filter out the data that quantity of information is not enough, concrete filter criteria For: filter out stock invester on same day in the data crawled to sample less than 10 times of the operand of stock.

The extracting method of the vector characterizing stock feature is: operate data for stock invester, according to income last month of stock invester Rate, is divided into 10 groups by stock invester, and each grade group is extracted first 1 day, 3 days, 7 days, 15 days, 30 days to this stock of this group Deng buying number, sell number, the amount of holding position, position in storehouse knots modification, this group at each timestamp in timestamp each in timestamp The feature such as average return；

For analyst's prediction data, extraction and analysis teacher was to first 1 day of this stock, 3 days, 7 days, 15 days, the time such as 30 days Buying number, sell the features such as number in each timestamp in stamp；

For stock invester's comment data, extraction and analysis teacher was to first 1 day of this stock, 3 days, 7 days, 15 days, the timestamp such as 30 days In the comment number of this stock in each timestamp, the average of the emotion value of each comment, the feature such as variance；

For news data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days every The news number of this stock in individual timestamp, the average of the emotion value of each news, the feature such as variance；

For advertisement data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days every The bulletin number of this stock in individual timestamp, the summation of the number of times that the word in bulletin keywords database corresponding in each bulletin occurs Etc. feature；

For historical stock price data, extraction and analysis teacher was to first 1 day of this stock, 3 days, 7 days, 15 days, the timestamp such as 30 days In the opening price of this stock in each timestamp, closing price, highest price, lowest price and the ratio of first 30 days prices, line on the 3rd oblique The features such as rate, line slope, line slope, line slope, line slopes on the 30th on the 15th on the 10th on the 7th；

For funds flow data, extraction and analysis teacher was to first 1 day of this stock, 3 days, 7 days, 15 days, the timestamp such as 30 days In the feature such as ratio of the amount of flowing to of this stock main force fund and discharge in each timestamp；

For other information datas, extract the current market value of this stock, net assets income ratio, Return on Assets, per share receipts Benefit rate of increase, ratio of current liabilities, enterprise value multiple, net profit year-on-year growth rate, Equity Concentration Ratio, free flow market value with And the feature such as the stock earnings price ratio of nearest month and stability bandwidth；

Finance emotion dictionary, bulletin keywords database two are primarily based on for text datas such as stock invester's comment, news, bulletins Dictionary uses natural language processing technique that text is carried out participle, calculates every stock further according to the financial emotion word occurred in text In the emotion value of people's comment, news etc., and bulletin corresponding key word occur number of times, finance emotion dictionary lists one A little stock emotion key words and emotion score corresponding to this key word, list some and announce relevant in bulletin keywords database Key word, the two dictionary is to use the mode of mass-rent manually to mark to obtain.

Owing to operating data for stock invester in feature extraction, according to earning rate last month of stock invester, stock invester is divided into 10 Individual group, each group in this is equivalent to a packet (Group), and the feature of each packets inner is to have stronger association, and The relatedness between feature between different grouping is then the strongest, when model training, it would be desirable to in same packet Feature has the factor of overall consideration, uses the Group Lasso algorithm in machine learning preferably to consider on this basis To these factors, so selecting Group Lasso algorithm.

Group Lasso algorithmic notation is as follows:

{\hat{β}}_{λ} = \underset{β}{\arg \min} (| | Y - X β | |_{2}^{2} + λ Σ_{g = 1}^{G} | | β_{I_{g}} | |_{2})

Wherein,For model training result, X is training sample matrix, and Y is the categorization vector of sample, I_gRepresent and belong to g The aspect indexing of individual Group, wherein g=1 ..., G,Represent that belonging to model corresponding to the aspect indexing of g Group instructs The value of the weights practised.

During model training, the method utilizing crosscheck, take turns the test set probability according to prediction for each Descending chooses the stock that prediction probability is the highest, then according to every day, the stock bought in the last day of trade was sold in opening quotation, buys current The earning rate of stock such mode of operation two time-of-week total revenue that the day of trade recommends, regulates the parameter of model with this.

3) crawl the test set that the data configuration on the same day day of trade is new, and the forecast model using step 2 to train is carried out Prediction, obtains consequently recommended stock.

The present invention also proposes a kind of system based on data prediction stock big disclosed in the Internet, crawls storage including data Module, forecast model training module and Prediction of Stock Index module；Wherein, data crawl memory module for crawling and storing stock Relevant information；The data configuration training dataset that forecast model training module crawls before utilizing the day of trade, and use Group Lasso trains forecast model；Prediction of Stock Index module, utilizes the test set that the data configuration crawled the same day day of trade is new, and uses The forecast model trained predicts consequently recommended stock.

The system of big data prediction stock based on the Internet public data also includes display module, for by Prediction of Stock Index Result shows client.

Beneficial effect: the present invention is that quantization is selected stocks or Prediction of Stock Index provides new useful reliable information source, all As the operation of stock invester, the prediction of analyst, news, announce, grind the data such as report relative to traditional such as stock historical price, The data such as funds flow are novel Data Sources, and these information are the reaction to current stock market to a certain extent, also can table Reveal the intended reaction to following stock market.Owing to there being substantial amounts of text data, the difficulty crawling in real time and analyzing of these data Degree crawls and processes difficulty than traditional equity data, and the present invention uses the skills such as Scrapy framework reptile and natural language processing Art crawls in real time for the data of these types and processes, and and the traditional historical price of such as stock, cash flow To etc. the combination of data more can reflect market.Owing to some part of feature of the extraction of the present invention is the last month according to stock invester Earning rate, is divided into multiple packet by stock invester, and the feature of each packets inner is to have stronger association, and between different grouping Relatedness between feature is then the strongest, when model training, it is therefore desirable to be able to the feature in same packet is had entirety The factor considered, uses the Group Lasso algorithm in machine learning can preferably consider these factors on this basis, The Study on Stock Prediction Model obtained is better able to catch the inherent operating mechanism in market, substantially increases the income brought to money person.

Accompanying drawing explanation

Fig. 1 is the integrated stand composition of the Prediction of Stock Index system of the present invention；

Fig. 2 is the Organization Chart that the data of the present invention crawl memory module；

Fig. 3 is the Organization Chart of the forecast model training module of the present invention；

Fig. 4 is the Organization Chart of the Prediction of Stock Index prediction module of the present invention.

Detailed description of the invention

Below in conjunction with specific embodiment, it is further elucidated with the present invention, it should be understood that these embodiments are merely to illustrate the present invention Rather than restriction the scope of the present invention, after having read the present invention, the those skilled in the art's various equivalences to the present invention The amendment of form all falls within the application claims limited range.

Fig. 1 is the general frame of the Prediction of Stock Index system of the present invention, and including four modules, data crawl memory module, stock Ticket forecast model training module, Prediction of Stock Index module and display module.Language use Python of the present invention, data base uses Mongodb。

Data crawl memory module as in figure 2 it is shown, reptile uses Scrapy framework, and Scrapy is one and opens based on Python Quick, the high-level Web information grasping system sent out, is mainly used in automatically accessing relevant Web sites and extracting knot from the page The data of structure.Scrapy use efficient Twisted asynchronous network storehouse to process network communication, Scrapy overall architecture As shown in Figure 3.

In reptile, in order to solve the anti-creep problem of the websites such as such as snowball net, first crawl some Agent IPs, then use Scrapy framework crawl snowball net, gold compass, stock, phoenix finance and economics, Sina's finance and economics, the data of huge website such as tide information etc., by number It is stored in Mongodb data base according to after changing into json form.Wherein, snowball net can crawl the operand of some stock investers According to data such as, stock invester's comment, news, bulletins, gold compass can crawl the data such as the prediction of analyst, and stock can crawl Data, phoenix finance and economics and Sina's finance and economicss such as stock invester's comment can crawl news and the historical price of stock, funds flow, base The data such as this face, huge tide information can crawl the data such as bulletin.

Study on Stock Prediction Model training module as shown in Figure 4, first constructs the training dataset of machine learning, training dataset by The data composition of 5 day of trade in the previous week of distance current trading day.For each day of trade of this 5 day of trade, A 2780 every, stock stocks of stock are made up of feature and classification, and wherein feature one vector representation, this vector has 700 dimensions left The right side, whether classification rises for this stock price of next day of trade, if rising is just 1 to be otherwise 0, so can obtain a 5* The matrix of about 2780*701.This is initial training collection.

The composition of the characteristic vector about table 1 700 dimension

The data crawled due to the stock day having are not a lot, so describing possible distortion with original 700 dimensional vectors, So Study on Stock Prediction Model training module can filter out the data that quantity of information is not enough, concrete filter criteria can be according to evaluating standard Then being adjusted, the present stage present invention filters out stock invester on same day in the data crawled to sample less than 10 times of the operand of stock This.Training set after so can being filtered.

Then carrying out model training with the Group Lasso algorithm in machine learning, the statistic of same type is one Group.Different with traditional Machine Learning Problems at this, the evaluation criterion of model quality here is not accuracy rate, F1 etc., and It is to recommend 8 stocks every day according to model, sell, according to opening quotation every day, the stock bought in the last day of trade, buys current trading day The such mode of operation of stock recommended earning rate during this period of time.The parameter of model is regulated with this.Group Lasso algorithm It is expressed as follows:

{\hat{β}}_{λ} = \underset{β}{\arg \min} (| | Y - X β | |_{2}^{2} + λ Σ_{g = 1}^{G} | | β_{I_{g}} | |_{2})

Thus obtain big data Study on Stock Prediction Model, about 10 hours before each day of trade opens the set, this The bright training carrying out model on the same day.

The prediction module of big data Study on Stock Prediction Model as shown in Figure 4, is extracted feature according to the data crawled the same day and is obtained Test data set, so can obtain 2780 samples of 2780 stocks of A-share.According still further to training data and the method for filtration, Get rid of the sample that quantity of information is few, the test set after being filtered.Finally use the big data Study on Stock Prediction Model pair trained Test set after filtration is predicted, and selects 8 the highest stocks of the output probability recommendation stock as the next day of trade.

Claims

1. a method based on data prediction stock big disclosed in the Internet, comprises the steps:

1) related data information of stock before the day of trade is crawled；

2) data utilizing step 1 to crawl carry out feature extraction, construct training set, and use Group Lasso algorithm to carry out greatly The training of data Study on Stock Prediction Model；

3) crawl the test set that the data configuration on the same day day of trade is new, and the forecast model using step 2 to train be predicted, Obtain consequently recommended stock.

Method based on data prediction stock big disclosed in the Internet the most according to claim 1, described step 1 extracts stock The method of ticket information is: first crawl some Agent IPs, then uses Scrapy framework to crawl the data of related web site, data is turned It is stored in Mongodb data base after chemical conversion Json form.

Method based on data prediction stock big disclosed in the Internet the most according to claim 1, described step 1 crawls Specifying information include snowball net, gold compass, stock, phoenix finance and economics, on the website such as Sina's finance and economics about the stock of stock invester of stock Operation, the prediction of analyst, stock invester's comment, news, bulletin, grind and respond with and price history data, the market value of every stock, only provide Product earning rate, Return on Assets, earnings per share rate of increase, ratio of current liabilities, enterprise value multiple, net profit increase by a year-on-year basis The stock earnings price ratio of rate, Equity Concentration Ratio, free flow market value and nearest one month and stability bandwidth.

Method based on data prediction stock big disclosed in the Internet the most according to claim 1, described step 2 can filter Falling the data that quantity of information is not enough, concrete filter criteria is: filter out the stock invester on same day in the data the crawled operand to stock Sample less than 10 times.

Method based on data prediction stock big disclosed in the Internet the most according to claim 1, described step 2 structure Training dataset is made up of the data of 5 day of trade in the previous week of current trading day, each for this 5 day of trade The day of trade, every stock is made up of feature and classification, and wherein feature is with processing the vector representation obtained, classification according to relevant information Whether rise for this stock price of next day of trade, if rising is just 1 to be otherwise 0, the most just obtain initial training matrix.

Method based on data prediction stock big disclosed in the Internet the most according to claim 5, described sign stock is special The vectorial extracting method levied is:

Data are operated for stock invester, according to earning rate last month of stock invester, stock invester is divided into 10 groups, the group of each grade is carried Take this group to first 1 day of this stock, 3 days, 7 days, 15 days, buying number, sell in each timestamp in the timestamp such as 30 days Number, the amount of holding position, position in storehouse knots modification, this group are in the feature such as average return of each timestamp；

For analyst's prediction data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days Buying number, sell the features such as number in each timestamp；

For stock invester's comment data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days every The comment number of this stock in individual timestamp, the average of emotion value of each comment, the feature such as variance；

For news data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days each time Between the news number of this stock in stamp, the average of the emotion value of each news, the feature such as variance；

For advertisement data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days each time Between the bulletin number of this stock in stamp, the spy such as the summation of the number of times that word in bulletin keywords database corresponding in each bulletin occurs Levy；

For historical stock price data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days every The opening price of this stock, closing price, highest price, lowest price and the ratio of first 30 days prices in individual timestamp, line slope on the 3rd, 7 Day feature such as line slope, line slope, line slope, line slopes on the 30th on the 15th on the 10th；

For funds flow data, extraction and analysis teacher to first 1 day of this stock, 3 days, 7 days, 15 days, in the timestamp such as 30 days every The feature such as ratio of the amount of flowing to of this stock main force fund and discharge in individual timestamp；

For other information datas, extract the current market value of this stock, net assets income ratio, Return on Assets, earnings per share increasing Long rate, ratio of current liabilities, enterprise value multiple, net profit year-on-year growth rate, Equity Concentration Ratio, free flow market value and The features such as the stock earnings price ratio of nearly month and stability bandwidth；

Finance emotion dictionary, bulletin two dictionaries of keywords database are primarily based on for text datas such as stock invester's comment, news, bulletins Use natural language processing technique that text is carried out participle, calculate every stock invester further according to the financial emotion word occurred in text and comment In the emotion value of opinion, news etc., and bulletin corresponding key word occur number of times, finance emotion dictionary lists some stocks Ticket emotion key word and emotion score corresponding to this key word, list some and announce relevant passes in bulletin keywords database Keyword, the two dictionary is to use the mode of mass-rent manually to mark to obtain.

Method based on data prediction stock big disclosed in the Internet the most according to claim 6, due in feature extraction In data are operated for stock invester, according to earning rate last month of stock invester, stock invester is divided into 10 groups, each group in this is equivalent to One packet, the feature of each packets inner is to have stronger association, and the relatedness between feature between different grouping is then The strongest, in order to enable the consideration that the feature in same packet is had entirety, use on this basis in machine learning Group Lasso algorithm is predicted model training, and Group Lasso algorithmic notation is as follows:

{\hat{β}}_{λ} = \underset{β}{\arg \min} (| | Y - X β | |_{2}^{2} + λ Σ_{g = 1}^{G} | | β_{I_{g}} | |_{2})

Wherein,For model training result, X is training sample matrix, and Y is the categorization vector of sample, I_gRepresent and belong to g The aspect indexing of Group, wherein g=1 ..., G,Represent and belong to the model training that the aspect indexing of g Group is corresponding The value of the weights gone out；

During model training, the method utilizing crosscheck, take turns the test set probability descending according to prediction for each Choose the stock that prediction probability is the highest, then according to every day, the stock bought in the last day of trade was sold in opening quotation, buy current transaction The earning rate of stock such mode of operation two time-of-week total revenue that day recommends, regulates the parameter of model with this.

8. a system based on data prediction stock big disclosed in the Internet, crawls memory module, forecast model including data Training module and Prediction of Stock Index module；Wherein, data crawl memory module for crawling and store the relevant information of stock；Prediction The data configuration training dataset that model training module crawls before utilizing the day of trade, and use Group Lasso training prediction mould Type；Prediction of Stock Index module, utilizes the test set that the data configuration crawled the same day day of trade is new, and uses the forecast model trained Predict consequently recommended stock.

System based on data prediction stock big disclosed in the Internet the most according to claim 8, also includes display module, For Prediction of Stock Index result is showed client.