CN107895051A

CN107895051A - A kind of stock news quantization method and system based on artificial intelligence

Info

Publication number: CN107895051A
Application number: CN201711294146.1A
Authority: CN
Inventors: 张潇
Original assignee: Hong Gu Information Technology (zhuhai) Co Ltd
Current assignee: Peking University
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-04-10

Abstract

The invention discloses a kind of stock news quantization method and system based on artificial intelligence, to solve the problems, such as that the news reference factor of existing Prediction of Stock Index has one-sidedness.This method includes：Obtain the stock news sequence of the day of trade in preset time；The stock news sequence is divided into word sequence according to preset length；Judge the stock news whether be the day of trade on the day of news, if so, then obtaining the term vector feature of each lexical item of the news using Word2Vec and GloVe；If the stock news be not the day of trade on the day of news, obtain the document vector characteristics of the news using fastText.The present invention extracts news features by three kinds of different vector representation learning methods, makes news more comprehensive as reference factor, predictablity rate is higher.

Description

A kind of stock news quantization method and system based on artificial intelligence

Technical field

The present invention relates to field of artificial intelligence, more particularly to a kind of stock news quantization method based on artificial intelligence And system.

Background technology

Prediction of Stock Price refers to that prediction stock is not using the historical information and stock of price related market information Carry out the ups and downs situation or price situation of a period of time.In recent years, deep learning method obtained in natural language processing field Many progress.Deep learning method also gradually applies to Prediction of Stock Index field.

TH Nguyen etc. predict stock price using agent model.In document [Topic modeling based Sentiment analysis on social media for stock market prediction] in, they propose one The topic model of individual fusion emotion and topic, and by the analysis of the main of the model use to stock related news.Obtaining After the theme distribution vector of each news, this theme vector is added in the feature of Prediction of Stock Index by they, is finally obtained Good prediction effect.But it have ignored the exclusive feature in financial field itself.

Except the news information related to stock, mass media is also used for Prediction of Stock Index with the content in social media. Johan Bollen etc. are in document [Twitter mood predicts the stock market] with Twitter Ups and downs of the content to stock market are made a prediction.They use popular feelings daily on the tool analysis Twitter such as OpinionFinde These affective characteristicses, are then added in forecast model, the ups and downs to stock market are made a prediction by sense.But can only be overall to stock market Situation make a prediction, be not suitable for the prediction of single stock.

Development situation of the related news information of stock generally to stock in itself is more related, also easily favourable comprising some Term of polarity etc., therefore Zeya Zhang et al. are in related work [Stock prediction:a method based on Extraction of news features and recurrent neural networks] in used the advantage of news The distribution of polarity section is used as its feature, and is put into Recognition with Recurrent Neural Network and is calculated in the lump with historical price information.It is but new Hear in text and contain abundant information, only go to consider from polarity favourable and insufficient.

The content of the invention

The technical problem to be solved in the present invention purpose is to provide a kind of stock news quantization method based on artificial intelligence And system, to solve the problems, such as that the news reference factor of existing Prediction of Stock Index has one-sidedness.

To achieve these goals, the technical solution adopted by the present invention is：

A kind of stock news quantization method based on artificial intelligence, including step：

Obtain the stock news sequence of the day of trade in preset time；

The stock news sequence is divided into word sequence according to preset length；

Judge the stock news whether be the day of trade on the day of news, if so, then being obtained using Word2Vec and GloVe The term vector feature of each lexical item of the news；

If the stock news be not the day of trade on the day of news, using fastText obtain the document of the news to Measure feature.

Further, the term vector feature of each lexical item that the news is obtained using Word2Vec and GloVe Step specifically includes：

Using Word2Vec obtain predicting context lexical item and by maximize conditional probability learn to obtain the first word to Measure feature；

The second term vector feature based on global information is obtained using GloVe；

The first term vector feature and the second term vector merging features are obtained into the term vector feature of each lexical item.

Further, described the step of obtaining the linear relationship between each lexical item using Word2Vec, specifically includes：

Set lexical item w_iContext be the set in current sentence with the lexical item of the distance of the lexical item less than k Context(w_i)：

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, W_i+1..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

Obtain target prediction word o and appear in the lexical item w_iConditional probability be：

Wherein, u₀For target prediction word o outer vector,For target prediction word o interior vector；

Build Skip-Gram models；The loss function of Skip-Gram models is obtained according to the conditional probability：

Wherein, T is the sum of current sentence lexical item, and j is distance with the lexical item, m be with the lexical item it is maximum away from From.

Further, it is described obtained using GloVe based on global information the second term vector feature the step of specifically include：

In the model that co-occurrence matrix construction matches with the Skip-Gram Model Conditions；

The loss function of the model is：

Further, it is described the news is obtained using fastText document vector characteristics the step of specifically include：

Every stock news are labeled using the ups and downs of next day stock price as label；

The classification based training for having supervision is carried out using fastText；

The document vector characteristics of every stock news are calculated using the model after training.

A kind of stock news quantization system based on artificial intelligence, including：

Acquisition module, for obtaining the stock news sequence of the day of trade in preset time；

Division module, for the stock news sequence to be divided into word sequence according to preset length；

Term vector module, for judge the stock news whether be the day of trade on the day of news, if so, then utilizing Word2Vec and GloVe obtains the term vector feature of each lexical item of the news；

Document vector module, if the news on the day of being not the day of trade for the stock news, is obtained using fastText To the document vector characteristics of the news.

Further, the term vector module specifically includes：

First model unit, for obtaining predicting the lexical item of context and by maximizing conditional probability using Word2Vec Study obtains the first term vector feature；

Second model unit, for obtaining the second term vector feature based on global information using GloVe；

Concatenation unit, for the first term vector feature and the second term vector merging features to be obtained into each lexical item Term vector feature.

Further, first model unit specifically includes：

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, w_i+1..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

Further, second model unit specifically includes：

The loss function of the model is：

Further, the document vector module specifically includes：

Labeling module, for being labeled using the ups and downs of next day stock price as label to every stock news；

Sort module, for the classification based training for carrying out having supervision using fastText；

Computing module, for calculating the document vector characteristics of every stock news using the model after training.

It is of the invention compared with traditional technology, have the following advantages：

The present invention extracts news features by three kinds of different vector representation learning methods, make news as reference factor more Adding ground, the accuracy rate of prediction is higher comprehensively.

Brief description of the drawings

Fig. 1 is a kind of stock news quantization method flow chart based on artificial intelligence that embodiment one provides；

Fig. 2 is a kind of stock news quantization system structure chart based on artificial intelligence that embodiment two provides.

Embodiment

It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described, But the present invention is not limited to these embodiments.

Embodiment one

A kind of stock news quantization method based on artificial intelligence is present embodiments provided, as shown in figure 1, including step：

S11：Obtain the stock news sequence of the day of trade in preset time；

S12：Stock news sequence is divided into word sequence according to preset length；

S13：Judge whether stock news are the news on the same day, if so, then obtaining news using Word2Vec and GloVe The term vector feature of each lexical item；

S14：If stock news be not the day of trade on the day of news, the document vector that news is obtained using fastText is special Sign.

Text based term vector identification method and document identification method are all the fashion in recent years, by it is a large amount of expect into Row training, finally learns an eigenmatrix for each word or article, makes similar word more close in vector space.Can be with Text is effectively subjected to quantification treatment.

When carrying out the application such as stock market, price advance-decline forecasting, text is as one of highly important reference factor, warp It is for reference often to need some preferable quantization characteristics.

The present embodiment carries out text vector training using substantial amounts of stock news, and with Recognition with Recurrent Neural Network to quantifying to identify History news be further processed, so as to extract the history news features related to stock, as with the new of timing Hear character representation.

The present embodiment uses news related to stock in each day of trade or agencies report as text data, then News features are extracted using three kinds of different vectorial learning methods.Including Word2Vec, GloVe and fastText.

Because the news on the same day would generally produce large effect, therefore the headline for the same day and content point to stock Word quantization means have not been used, and mark is quantified using text fragment to history news.

In the present embodiment, step S11 is the stock news sequence for obtaining the day of trade in preset time.

Specifically, stock may produce some related news or agencies report D in each day of trade_t, these are logical Development situation often to stock in itself is more related, and the news sequence for recording some day of trade of continuous N days is：

In the present embodiment, step S12 is that stock news sequence is divided into word sequence according to preset length.

Specifically, the word sequence length of every news is expressed as lt, therefore news Dt can be expressed as sequence：

In the present embodiment, step S13 be judge stock news whether be the day of trade on the day of news, if so, then utilizing Word2Vec and GloVe obtains the term vector feature of each lexical item of news.

Wherein, the step of in step S13 using the term vector feature of Word2Vec and the GloVe each lexical item for obtaining news Specifically include：

In step S13, described the step of obtaining the linear relationship between each lexical item using Word2Vec, specifically includes：

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, w_I+1,..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

Specifically, Word2Vec is the unsupervised learning method that a kind of term vector represents, sought using neutral net for word The expression looked in a vector row space, such method assume that the probability of similar word co-occurrence in the text is bigger.This implementation Example mainly uses Skip-Gram models therein, and the model predicts the word of its context using current word, and passes through maximization Its vocabulary of conditional probability acquistion.

Skip-Gram models predict all words of context using current word, therefore the utilization to training corpus is more To be efficient, and the linear relationship for proving that this term vector mark can preferably between acquistion word and word is tested, much appointed Performance in business is all slightly better than term vector models based on statistics such as SVD.

In step S13, it is described obtained using GloVe based on global information the second term vector feature the step of specifically wrap Include：

The loss function of the model is：

Although specifically, this term vector models based on prediction of Word2Vec can have to the linear relationship between word compared with Good assurance, but utilization of this class model to statistical information is not abundant enough, and the training time is related to language material size.And it is based on The term vector model of statistics is then with its advantage and disadvantage on the contrary, the statistical information of the overall situation can be made full use of, so as to obtain its hidden layer Semantic vector.

GloVe is then respectively to take its length between both, similar to Skip-Gram models by being constructed on co-occurrence matrix Condition, so as to obtain a term vector model based on global information, and a kind of unsupervised method for expressing.

Compared with Skip-Gram models, GloVe, also can be compared with while the global statistics information of language material is taken full advantage of The linear relationship that I is lived between word well.

In the present embodiment, if step S14 is stock news be not the day of trade on the day of news, obtained using fastText The document vector characteristics of news.

Wherein, step S14 specifically includes step：

Specifically, because term vector represents that feature is more concerned with the local message of each word, therefore it is used herein FastText document vector characteristics, it is easy to preferably obtain the global information amount of document.

Because fastText is supervised learning, and it is mainly used in classification task, and the task of the present embodiment is to stock The ups and downs of ticket carry out two classification tasks.Therefore, every news is labeled first in the present embodiment, with second day stock price Ups and downs be label, the classification based training for having supervision is then carried out using fastText.The model trained is finally recycled, is calculated Obtain a document vector representation feature of every news.

FastText input layer borrows feature and N-gram features using the word of document, then passes through the nerve net in intermediate layer Network carries out linear transformation to characteristic vector, and it is general to be mapped as last label finally by the nonlinear activation primitive of output layer Rate.

Because news has certain timing, therefore preferably can be obtained using Recognition with Recurrent Neural Network in news Information, and Two-way Cycle network is then made up of forward circulation neutral net with recycled back neutral net, compared to single-phase circulating net Network has bigger advantage, can capture the information of reverse sequence, and text feature processing is particularly effective.Therefore, this implementation Example handles news features based on bidirectional circulating neutral net.Multiple Recognition with Recurrent Neural Network are exported and merged, then to be required Newsletter archive quantization characteristic.

The present embodiment extracts stock news feature by three kinds of different learning methods, makes news as stock reference factor Consideration is more comprehensive, predicts that the accuracy rate of stock is higher.

Embodiment two

A kind of stock news quantization system based on artificial intelligence is present embodiments provided, as shown in Fig. 2 including：

Acquisition module 21, for obtaining the stock news sequence of the day of trade in preset time；

Division module 22, for stock news sequence to be divided into word sequence according to preset length；

Term vector module 23, for judge stock news whether be the same day news, if so, then using Word2Vec with GloVe obtains the term vector feature of each lexical item of news；

Document vector module 24, if the news on the day of being not the day of trade for stock news, is obtained using fastText The document vector characteristics of news.

In the present embodiment, acquisition module 21 is used for the stock news sequence for obtaining the day of trade in preset time.

In the present embodiment, division module 22 is used to stock news sequence being divided into word sequence according to preset length.

Specifically, the word sequence length of every news is expressed as l_t, therefore news D_tSequence can be expressed as：

In the present embodiment, term vector module 23 be used for judge stock news whether be the day of trade on the day of news, if so, then The term vector feature of each lexical item of news is obtained using Word2Vec and GloVe.

Wherein, specifically included in term vector module 23：

First model unit specifically includes：

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, w_i+1..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

Second model unit specifically includes：

The loss function of the model is：

In the present embodiment, if document vector module 24 is used for the news on the day of stock news are not the days of trade, utilize FastText obtains the document vector characteristics of news.

Wherein, document vector module 24 specifically includes：

Unit is marked, for being labeled using the ups and downs of next day stock price as label to every stock news；

Taxon, for the classification based training for carrying out having supervision using fastText；

Computing unit, for calculating the document vector characteristics of every stock news using the model after training.

The present embodiment extracts stock news feature by three kinds of different learning methods, makes news as stock reference factor Consideration is more comprehensive, and the accuracy rate of Prediction of Stock Index is higher.

Specific embodiment described herein is only to spirit explanation for example of the invention.Technology belonging to the present invention is led The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims

1. a kind of stock news quantization method based on artificial intelligence, it is characterised in that including step：

Obtain the stock news sequence of the day of trade in preset time；

Judge the stock news whether be the day of trade on the day of news, if so, then being obtained using Word2Vec and GloVe described The term vector feature of each lexical item of news；

If the stock news be not the day of trade on the day of news, the document vector that the news is obtained using fastText is special Sign.

A kind of 2. stock news quantization method based on artificial intelligence according to claim 1, it is characterised in that the profit The step of obtaining the term vector feature of each lexical item of the news with Word2Vec and GloVe specifically includes：

Obtain predicting the lexical item of context using Word2Vec and learn to obtain the first term vector spy by maximizing conditional probability Sign；

A kind of 3. stock news quantization method based on artificial intelligence according to claim 2, it is characterised in that the profit The step of linear relationship between obtaining each lexical item with Word2Vec, specifically includes：

Set lexical item w_iContext be the set Context in current sentence with the lexical item of the distance of the lexical item less than k (w_i)：

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, w_i+1..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>o</mi> <mo>|</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>exp</mi> <mrow> <mo>(</mo> <msup> <msub> <mi>u</mi> <mn>0</mn> </msub> <mi>T</mi> </msup> <msub> <mi>v</mi> <msub> <mi>w</mi> <mi>i</mi> </msub> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>w</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>W</mi> </msubsup> <mi>exp</mi> <mrow> <mo>(</mo> <msup> <msub> <mi>u</mi> <mi>w</mi> </msub> <mi>T</mi> </msup> <msub> <mi>v</mi> <msub> <mi>w</mi> <mi>i</mi> </msub> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>,</mo> <mi>o</mi> <mo>&Element;</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

<mrow> <mi>J</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <mo>-</mo> <mi>m</mi> <mo>&le;</mo> <mi>j</mi> <mo>,</mo> <mo>=</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </munder> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>w</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Wherein, T is the sum of current sentence lexical item, and j is the distance with the lexical item, and m is the ultimate range with the lexical item.

A kind of 4. stock news quantization method based on artificial intelligence according to claim 3, it is characterised in that the profit The step of the second term vector feature based on global information is obtained with GloVe specifically includes：

The loss function of the model is：

<mrow> <mi>J</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>W</mi> </munderover> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <msup> <msub> <mi>u</mi> <mi>i</mi> </msub> <mi>T</mi> </msup> <msub> <mi>v</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>logP</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>.</mo> </mrow>

A kind of 5. stock news quantization method based on artificial intelligence according to claim 1, it is characterised in that the profit The step of document vector characteristics that the news is obtained with fastText, specifically includes：

A kind of 6. stock news quantization system based on artificial intelligence, it is characterised in that including：

Term vector module, for judge the stock news whether be the day of trade on the day of news, if so, then utilizing Word2Vec The term vector feature of each lexical item of the news is obtained with GloVe；

Document vector module, if the news on the day of being not the day of trade for the stock news, institute is obtained using fastText State the document vector characteristics of news.

A kind of 7. stock news quantization system based on artificial intelligence according to claim 6, it is characterised in that institute's predicate Vector module specifically includes：

First model unit, for obtaining predicting the lexical item of context using Word2Vec and being learnt by maximizing conditional probability Obtain the first term vector feature；

Concatenation unit, for the first term vector feature and the second term vector merging features to be obtained into the word of each lexical item Vector characteristics.

8. a kind of stock news quantization system based on artificial intelligence according to claim 7, it is characterised in that described One model unit specifically includes：

Set lexical item w_iContext be the set Context in current sentence with the lexical item of the distance of the lexical item less than k (w_i)；

Context(w_i)={ w_i-k, w_i-k+1..., w_i-1, w_i+1..., w_i+k}；

Wherein, i represents position of the lexical item in sentence；

9. a kind of stock news quantization system based on artificial intelligence according to claim 8, it is characterised in that described Two model units specifically include：

The loss function of the model is：

10. a kind of stock news quantization system based on artificial intelligence according to claim 6, it is characterised in that described Document vector module specifically includes：