A kind of stock news quantization method and system based on artificial intelligence
Technical field
The present invention relates to field of artificial intelligence, more particularly to a kind of stock news quantization method based on artificial intelligence
And system.
Background technology
Prediction of Stock Price refers to that prediction stock is not using the historical information and stock of price related market information
Carry out the ups and downs situation or price situation of a period of time.In recent years, deep learning method obtained in natural language processing field
Many progress.Deep learning method also gradually applies to Prediction of Stock Index field.
TH Nguyen etc. predict stock price using agent model.In document [Topic modeling based
Sentiment analysis on social media for stock market prediction] in, they propose one
The topic model of individual fusion emotion and topic, and by the analysis of the main of the model use to stock related news.Obtaining
After the theme distribution vector of each news, this theme vector is added in the feature of Prediction of Stock Index by they, is finally obtained
Good prediction effect.But it have ignored the exclusive feature in financial field itself.
Except the news information related to stock, mass media is also used for Prediction of Stock Index with the content in social media.
Johan Bollen etc. are in document [Twitter mood predicts the stock market] with Twitter
Ups and downs of the content to stock market are made a prediction.They use popular feelings daily on the tool analysis Twitter such as OpinionFinde
These affective characteristicses, are then added in forecast model, the ups and downs to stock market are made a prediction by sense.But can only be overall to stock market
Situation make a prediction, be not suitable for the prediction of single stock.
Development situation of the related news information of stock generally to stock in itself is more related, also easily favourable comprising some
Term of polarity etc., therefore Zeya Zhang et al. are in related work [Stock prediction:a method based on
Extraction of news features and recurrent neural networks] in used the advantage of news
The distribution of polarity section is used as its feature, and is put into Recognition with Recurrent Neural Network and is calculated in the lump with historical price information.It is but new
Hear in text and contain abundant information, only go to consider from polarity favourable and insufficient.
The content of the invention
The technical problem to be solved in the present invention purpose is to provide a kind of stock news quantization method based on artificial intelligence
And system, to solve the problems, such as that the news reference factor of existing Prediction of Stock Index has one-sidedness.
To achieve these goals, the technical solution adopted by the present invention is:
A kind of stock news quantization method based on artificial intelligence, including step:
Obtain the stock news sequence of the day of trade in preset time;
The stock news sequence is divided into word sequence according to preset length;
Judge the stock news whether be the day of trade on the day of news, if so, then being obtained using Word2Vec and GloVe
The term vector feature of each lexical item of the news;
If the stock news be not the day of trade on the day of news, using fastText obtain the document of the news to
Measure feature.
Further, the term vector feature of each lexical item that the news is obtained using Word2Vec and GloVe
Step specifically includes:
Using Word2Vec obtain predicting context lexical item and by maximize conditional probability learn to obtain the first word to
Measure feature;
The second term vector feature based on global information is obtained using GloVe;
The first term vector feature and the second term vector merging features are obtained into the term vector feature of each lexical item.
Further, described the step of obtaining the linear relationship between each lexical item using Word2Vec, specifically includes:
Set lexical item wiContext be the set in current sentence with the lexical item of the distance of the lexical item less than k
Context(wi):
Context(wi)={ wi-k, wi-k+1..., wi-1, Wi+1..., wi+k};
Wherein, i represents position of the lexical item in sentence;
Obtain target prediction word o and appear in the lexical item wiConditional probability be:
Wherein, u0For target prediction word o outer vector,For target prediction word o interior vector;
Build Skip-Gram models;The loss function of Skip-Gram models is obtained according to the conditional probability:
Wherein, T is the sum of current sentence lexical item, and j is distance with the lexical item, m be with the lexical item it is maximum away from
From.
Further, it is described obtained using GloVe based on global information the second term vector feature the step of specifically include:
In the model that co-occurrence matrix construction matches with the Skip-Gram Model Conditions;
The loss function of the model is:
Further, it is described the news is obtained using fastText document vector characteristics the step of specifically include:
Every stock news are labeled using the ups and downs of next day stock price as label;
The classification based training for having supervision is carried out using fastText;
The document vector characteristics of every stock news are calculated using the model after training.
A kind of stock news quantization system based on artificial intelligence, including:
Acquisition module, for obtaining the stock news sequence of the day of trade in preset time;
Division module, for the stock news sequence to be divided into word sequence according to preset length;
Term vector module, for judge the stock news whether be the day of trade on the day of news, if so, then utilizing
Word2Vec and GloVe obtains the term vector feature of each lexical item of the news;
Document vector module, if the news on the day of being not the day of trade for the stock news, is obtained using fastText
To the document vector characteristics of the news.
Further, the term vector module specifically includes:
First model unit, for obtaining predicting the lexical item of context and by maximizing conditional probability using Word2Vec
Study obtains the first term vector feature;
Second model unit, for obtaining the second term vector feature based on global information using GloVe;
Concatenation unit, for the first term vector feature and the second term vector merging features to be obtained into each lexical item
Term vector feature.
Further, first model unit specifically includes:
Set lexical item wiContext be the set in current sentence with the lexical item of the distance of the lexical item less than k
Context(wi):
Context(wi)={ wi-k, wi-k+1..., wi-1, wi+1..., wi+k};
Wherein, i represents position of the lexical item in sentence;
Obtain target prediction word o and appear in the lexical item wiConditional probability be:
Wherein, u0For target prediction word o outer vector,For target prediction word o interior vector;
Build Skip-Gram models;The loss function of Skip-Gram models is obtained according to the conditional probability:
Wherein, T is the sum of current sentence lexical item, and j is distance with the lexical item, m be with the lexical item it is maximum away from
From.
Further, second model unit specifically includes:
In the model that co-occurrence matrix construction matches with the Skip-Gram Model Conditions;
The loss function of the model is:
Further, the document vector module specifically includes:
Labeling module, for being labeled using the ups and downs of next day stock price as label to every stock news;
Sort module, for the classification based training for carrying out having supervision using fastText;
Computing module, for calculating the document vector characteristics of every stock news using the model after training.
It is of the invention compared with traditional technology, have the following advantages:
The present invention extracts news features by three kinds of different vector representation learning methods, make news as reference factor more
Adding ground, the accuracy rate of prediction is higher comprehensively.
Brief description of the drawings
Fig. 1 is a kind of stock news quantization method flow chart based on artificial intelligence that embodiment one provides;
Fig. 2 is a kind of stock news quantization system structure chart based on artificial intelligence that embodiment two provides.
Embodiment
It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described,
But the present invention is not limited to these embodiments.
Embodiment one
A kind of stock news quantization method based on artificial intelligence is present embodiments provided, as shown in figure 1, including step:
S11:Obtain the stock news sequence of the day of trade in preset time;
S12:Stock news sequence is divided into word sequence according to preset length;
S13:Judge whether stock news are the news on the same day, if so, then obtaining news using Word2Vec and GloVe
The term vector feature of each lexical item;
S14:If stock news be not the day of trade on the day of news, the document vector that news is obtained using fastText is special
Sign.
Text based term vector identification method and document identification method are all the fashion in recent years, by it is a large amount of expect into
Row training, finally learns an eigenmatrix for each word or article, makes similar word more close in vector space.Can be with
Text is effectively subjected to quantification treatment.
When carrying out the application such as stock market, price advance-decline forecasting, text is as one of highly important reference factor, warp
It is for reference often to need some preferable quantization characteristics.
The present embodiment carries out text vector training using substantial amounts of stock news, and with Recognition with Recurrent Neural Network to quantifying to identify
History news be further processed, so as to extract the history news features related to stock, as with the new of timing
Hear character representation.
The present embodiment uses news related to stock in each day of trade or agencies report as text data, then
News features are extracted using three kinds of different vectorial learning methods.Including Word2Vec, GloVe and fastText.
Because the news on the same day would generally produce large effect, therefore the headline for the same day and content point to stock
Word quantization means have not been used, and mark is quantified using text fragment to history news.
In the present embodiment, step S11 is the stock news sequence for obtaining the day of trade in preset time.
Specifically, stock may produce some related news or agencies report D in each day of tradet, these are logical
Development situation often to stock in itself is more related, and the news sequence for recording some day of trade of continuous N days is:
In the present embodiment, step S12 is that stock news sequence is divided into word sequence according to preset length.
Specifically, the word sequence length of every news is expressed as lt, therefore news Dt can be expressed as sequence:
In the present embodiment, step S13 be judge stock news whether be the day of trade on the day of news, if so, then utilizing
Word2Vec and GloVe obtains the term vector feature of each lexical item of news.
Wherein, the step of in step S13 using the term vector feature of Word2Vec and the GloVe each lexical item for obtaining news
Specifically include:
Using Word2Vec obtain predicting context lexical item and by maximize conditional probability learn to obtain the first word to
Measure feature;
The second term vector feature based on global information is obtained using GloVe;
The first term vector feature and the second term vector merging features are obtained into the term vector feature of each lexical item.
In step S13, described the step of obtaining the linear relationship between each lexical item using Word2Vec, specifically includes:
Set lexical item wiContext be the set in current sentence with the lexical item of the distance of the lexical item less than k
Context(wi):
Context(wi)={ wi-k, wi-k+1..., wi-1, wI+1,..., wi+k};
Wherein, i represents position of the lexical item in sentence;
Obtain target prediction word o and appear in the lexical item wiConditional probability be:
Wherein, u0For target prediction word o outer vector,For target prediction word o interior vector;
Build Skip-Gram models;The loss function of Skip-Gram models is obtained according to the conditional probability:
Wherein, T is the sum of current sentence lexical item, and j is distance with the lexical item, m be with the lexical item it is maximum away from
From.
Specifically, Word2Vec is the unsupervised learning method that a kind of term vector represents, sought using neutral net for word
The expression looked in a vector row space, such method assume that the probability of similar word co-occurrence in the text is bigger.This implementation
Example mainly uses Skip-Gram models therein, and the model predicts the word of its context using current word, and passes through maximization
Its vocabulary of conditional probability acquistion.
Skip-Gram models predict all words of context using current word, therefore the utilization to training corpus is more
To be efficient, and the linear relationship for proving that this term vector mark can preferably between acquistion word and word is tested, much appointed
Performance in business is all slightly better than term vector models based on statistics such as SVD.
In step S13, it is described obtained using GloVe based on global information the second term vector feature the step of specifically wrap
Include:
In the model that co-occurrence matrix construction matches with the Skip-Gram Model Conditions;
The loss function of the model is:
Although specifically, this term vector models based on prediction of Word2Vec can have to the linear relationship between word compared with
Good assurance, but utilization of this class model to statistical information is not abundant enough, and the training time is related to language material size.And it is based on
The term vector model of statistics is then with its advantage and disadvantage on the contrary, the statistical information of the overall situation can be made full use of, so as to obtain its hidden layer
Semantic vector.
GloVe is then respectively to take its length between both, similar to Skip-Gram models by being constructed on co-occurrence matrix
Condition, so as to obtain a term vector model based on global information, and a kind of unsupervised method for expressing.
Compared with Skip-Gram models, GloVe, also can be compared with while the global statistics information of language material is taken full advantage of
The linear relationship that I is lived between word well.
In the present embodiment, if step S14 is stock news be not the day of trade on the day of news, obtained using fastText
The document vector characteristics of news.
Wherein, step S14 specifically includes step:
Every stock news are labeled using the ups and downs of next day stock price as label;
The classification based training for having supervision is carried out using fastText;
The document vector characteristics of every stock news are calculated using the model after training.
Specifically, because term vector represents that feature is more concerned with the local message of each word, therefore it is used herein
FastText document vector characteristics, it is easy to preferably obtain the global information amount of document.
Because fastText is supervised learning, and it is mainly used in classification task, and the task of the present embodiment is to stock
The ups and downs of ticket carry out two classification tasks.Therefore, every news is labeled first in the present embodiment, with second day stock price
Ups and downs be label, the classification based training for having supervision is then carried out using fastText.The model trained is finally recycled, is calculated
Obtain a document vector representation feature of every news.
FastText input layer borrows feature and N-gram features using the word of document, then passes through the nerve net in intermediate layer
Network carries out linear transformation to characteristic vector, and it is general to be mapped as last label finally by the nonlinear activation primitive of output layer
Rate.
Because news has certain timing, therefore preferably can be obtained using Recognition with Recurrent Neural Network in news
Information, and Two-way Cycle network is then made up of forward circulation neutral net with recycled back neutral net, compared to single-phase circulating net
Network has bigger advantage, can capture the information of reverse sequence, and text feature processing is particularly effective.Therefore, this implementation
Example handles news features based on bidirectional circulating neutral net.Multiple Recognition with Recurrent Neural Network are exported and merged, then to be required
Newsletter archive quantization characteristic.
The present embodiment extracts stock news feature by three kinds of different learning methods, makes news as stock reference factor
Consideration is more comprehensive, predicts that the accuracy rate of stock is higher.
Embodiment two
A kind of stock news quantization system based on artificial intelligence is present embodiments provided, as shown in Fig. 2 including:
Acquisition module 21, for obtaining the stock news sequence of the day of trade in preset time;
Division module 22, for stock news sequence to be divided into word sequence according to preset length;
Term vector module 23, for judge stock news whether be the same day news, if so, then using Word2Vec with
GloVe obtains the term vector feature of each lexical item of news;
Document vector module 24, if the news on the day of being not the day of trade for stock news, is obtained using fastText
The document vector characteristics of news.
Text based term vector identification method and document identification method are all the fashion in recent years, by it is a large amount of expect into
Row training, finally learns an eigenmatrix for each word or article, makes similar word more close in vector space.Can be with
Text is effectively subjected to quantification treatment.
When carrying out the application such as stock market, price advance-decline forecasting, text is as one of highly important reference factor, warp
It is for reference often to need some preferable quantization characteristics.
The present embodiment carries out text vector training using substantial amounts of stock news, and with Recognition with Recurrent Neural Network to quantifying to identify
History news be further processed, so as to extract the history news features related to stock, as with the new of timing
Hear character representation.
The present embodiment uses news related to stock in each day of trade or agencies report as text data, then
News features are extracted using three kinds of different vectorial learning methods.Including Word2Vec, GloVe and fastText.
Because the news on the same day would generally produce large effect, therefore the headline for the same day and content point to stock
Word quantization means have not been used, and mark is quantified using text fragment to history news.
In the present embodiment, acquisition module 21 is used for the stock news sequence for obtaining the day of trade in preset time.
Specifically, stock may produce some related news or agencies report D in each day of tradet, these are logical
Development situation often to stock in itself is more related, and the news sequence for recording some day of trade of continuous N days is:
In the present embodiment, division module 22 is used to stock news sequence being divided into word sequence according to preset length.
Specifically, the word sequence length of every news is expressed as lt, therefore news DtSequence can be expressed as:
In the present embodiment, term vector module 23 be used for judge stock news whether be the day of trade on the day of news, if so, then
The term vector feature of each lexical item of news is obtained using Word2Vec and GloVe.
Wherein, specifically included in term vector module 23:
First model unit, for obtaining predicting the lexical item of context and by maximizing conditional probability using Word2Vec
Study obtains the first term vector feature;
Second model unit, for obtaining the second term vector feature based on global information using GloVe;
Concatenation unit, for the first term vector feature and the second term vector merging features to be obtained into each lexical item
Term vector feature.
First model unit specifically includes:
Set lexical item wiContext be the set in current sentence with the lexical item of the distance of the lexical item less than k
Context(wi):
Context(wi)={ wi-k, wi-k+1..., wi-1, wi+1..., wi+k};
Wherein, i represents position of the lexical item in sentence;
Obtain target prediction word o and appear in the lexical item wiConditional probability be:
Wherein, u0For target prediction word o outer vector,For target prediction word o interior vector;
Build Skip-Gram models;The loss function of Skip-Gram models is obtained according to the conditional probability:
Wherein, T is the sum of current sentence lexical item, and j is distance with the lexical item, m be with the lexical item it is maximum away from
From.
Specifically, Word2Vec is the unsupervised learning method that a kind of term vector represents, sought using neutral net for word
The expression looked in a vector row space, such method assume that the probability of similar word co-occurrence in the text is bigger.This implementation
Example mainly uses Skip-Gram models therein, and the model predicts the word of its context using current word, and passes through maximization
Its vocabulary of conditional probability acquistion.
Skip-Gram models predict all words of context using current word, therefore the utilization to training corpus is more
To be efficient, and the linear relationship for proving that this term vector mark can preferably between acquistion word and word is tested, much appointed
Performance in business is all slightly better than term vector models based on statistics such as SVD.
Second model unit specifically includes:
In the model that co-occurrence matrix construction matches with the Skip-Gram Model Conditions;
The loss function of the model is:
Although specifically, this term vector models based on prediction of Word2Vec can have to the linear relationship between word compared with
Good assurance, but utilization of this class model to statistical information is not abundant enough, and the training time is related to language material size.And it is based on
The term vector model of statistics is then with its advantage and disadvantage on the contrary, the statistical information of the overall situation can be made full use of, so as to obtain its hidden layer
Semantic vector.
GloVe is then respectively to take its length between both, similar to Skip-Gram models by being constructed on co-occurrence matrix
Condition, so as to obtain a term vector model based on global information, and a kind of unsupervised method for expressing.
Compared with Skip-Gram models, GloVe, also can be compared with while the global statistics information of language material is taken full advantage of
The linear relationship that I is lived between word well.
In the present embodiment, if document vector module 24 is used for the news on the day of stock news are not the days of trade, utilize
FastText obtains the document vector characteristics of news.
Wherein, document vector module 24 specifically includes:
Unit is marked, for being labeled using the ups and downs of next day stock price as label to every stock news;
Taxon, for the classification based training for carrying out having supervision using fastText;
Computing unit, for calculating the document vector characteristics of every stock news using the model after training.
Specifically, because term vector represents that feature is more concerned with the local message of each word, therefore it is used herein
FastText document vector characteristics, it is easy to preferably obtain the global information amount of document.
Because fastText is supervised learning, and it is mainly used in classification task, and the task of the present embodiment is to stock
The ups and downs of ticket carry out two classification tasks.Therefore, every news is labeled first in the present embodiment, with second day stock price
Ups and downs be label, the classification based training for having supervision is then carried out using fastText.The model trained is finally recycled, is calculated
Obtain a document vector representation feature of every news.
FastText input layer borrows feature and N-gram features using the word of document, then passes through the nerve net in intermediate layer
Network carries out linear transformation to characteristic vector, and it is general to be mapped as last label finally by the nonlinear activation primitive of output layer
Rate.
Because news has certain timing, therefore preferably can be obtained using Recognition with Recurrent Neural Network in news
Information, and Two-way Cycle network is then made up of forward circulation neutral net with recycled back neutral net, compared to single-phase circulating net
Network has bigger advantage, can capture the information of reverse sequence, and text feature processing is particularly effective.Therefore, this implementation
Example handles news features based on bidirectional circulating neutral net.Multiple Recognition with Recurrent Neural Network are exported and merged, then to be required
Newsletter archive quantization characteristic.
The present embodiment extracts stock news feature by three kinds of different learning methods, makes news as stock reference factor
Consideration is more comprehensive, and the accuracy rate of Prediction of Stock Index is higher.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology belonging to the present invention is led
The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode
Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.