CN108959266A

CN108959266A - A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary

Info

Publication number: CN108959266A
Application number: CN201810778565.0A
Authority: CN
Inventors: 饶东宁; 黄思宏
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2018-12-07

Abstract

The invention discloses a kind of Forecasting of Stock Prices method and devices based on Stemming stem dictionary, based on the mark for carrying out feeling polarities to the pretreatment comment data collection, and corresponding prediction model is constructed using libsvm classifier and liblinear classifier, it carries out sentiment analysis classification and carries out stock market development, it is different from long text to solve the financial short text understanding of automation, not the technical issues of short text does not have complete syntactic structure usually, and the very short no enough information of length makes inferences to computer.

Description

A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary

Technical field

The present invention relates to the field of data mining, have been related specifically to a kind of Forecasting of Stock Prices based on Stemming stem dictionary Method and device.

Background technique

With the fast development over more than 30 years, stock market is occupied an leading position in China's modern financial system, day Gradually emerge several equity investment fans.Due to being influenced by politics, economy, technology etc., Stock Price Fluctuation is changed greatly, In order to maximize returns of investment, stock investor all urgently obtain it is a kind of can more Accurate Prediction the change of stock price method.Pass through Comprehensive analysis influences the variable of the change of stock price, and then the variation tendency in predicting Stock Price future, preferably guidance investment.Such application Belong to the scope of data mining.

Financial short text analysis is one and predicts vital task for tending towards uplift of finance stock market, in knowledge excavation, feelings There are many potential utilizations in the fields such as sense analysis, understand short text very simple, this is because the mankind possess thinking, can accumulate Thinking is judged.However the financial short text understanding of automation is different from long text, short text does not have complete grammer usually Structure, and the very short no enough information of length makes inferences to computer.

Summary of the invention

It is short to solve automation finance for a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Text understanding is different from long text, and short text does not have complete syntactic structure usually, and length is very short without enough letters Cease the technical issues of making inferences to computer.

A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising:

Obtain stock comment data collection；

Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection；

The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word；

The mark of feeling polarities is carried out to the pretreatment comment data collection；

Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification And carry out stock market development.

Optionally, described that correspondence is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Semantic generic word includes:

The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.

Optionally, the libsvm classifier is defined as:

Wherein, σ²For gaussian kernel function variance, x_iAnd x_jRespectively indicate the sampling feature vectors obtained after model training and survey The feature vector of data sample is tried, | | x_i-x_j||²Indicate square Euclidean distance between two samples.

Optionally, the liblinear classifier is defined as:

K(x_i, x_j)=x_i×x_j。

Optionally, rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects The stock comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:

Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.

A kind of Forecasting of Stock Prices device based on Stemming stem dictionary provided by the invention, comprising:

First obtains module, for obtaining stock comment data collection；

First rejects module, the number for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data According to, and the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection；

First replacement module, for being replaced by Stemming stem dictionary by the artificial word that comment data is concentrated is pre-processed For corresponding semantic generic word；

First labeling module, for carrying out the mark of feeling polarities to the pretreatment comment data collection；

First prediction module, for constructing corresponding prediction model using libsvm classifier and liblinear classifier, It carries out sentiment analysis classification and carries out stock market development.

Optionally, the libsvm classifier is defined as:

Optionally, the liblinear classifier is defined as:

K(x_i, x_j)=x_i×x_j。

As can be seen from the above technical solutions, the invention has the following advantages that

A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising: obtain stock comment Data set；Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection；It will be located in advance by Stemming stem dictionary The artificial word that reason comment data is concentrated replaces with the generic word of corresponding semanteme；Emotion pole is carried out to the pretreatment comment data collection The mark of property；Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification And stock market development is carried out, a kind of Stemming stem dictionary method is introduced in the processing of financial phrase material to reduce data Ambiguity improves phrase material and segments accuracy rate, based on the mark to the pretreatment comment data collection progress feeling polarities, and uses Libsvm classifier and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market to become Gesture prediction solves the financial short text of automation and understands different from long text, and short text does not have complete syntactic structure usually, And the technical issues of very short no enough information of length makes inferences to computer.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other attached drawings according to these attached drawings.

Fig. 1 is an a kind of reality of the Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention Apply the flow diagram of example；

Fig. 2 is another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention The flow diagram of embodiment；

Fig. 3 is an a kind of reality of the Forecasting of Stock Prices device based on Stemming stem dictionary provided in an embodiment of the present invention Apply the structural schematic diagram of example.

Specific embodiment

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field Those of ordinary skill's all other embodiment obtained without making creative work, belongs to protection of the present invention Range.

Fig. 1 is a kind of one embodiment of the Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Flow chart, it is provided in an embodiment of the present invention with being based on as shown in Figure 1, the embodiment of the present invention can be applied to server The Forecasting of Stock Prices method of Stemming stem dictionary may include:

S100: stock comment data collection is obtained；

It is alternatively possible to collect within a certain period of time, the Chinese comment data of stock invester in domestic Gu Ba forum.

S101: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained；

S102: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Generic word；

It should be noted that mainly there are two aspects for the data processing of phrase material.First is drawing for Stemming stem dictionary Enter, second is Chinese words segmentation.Introducing for Stemming stem dictionary can use the method based on canonical formula, by There is a large amount of proprietary artificial words in stock comment data.These words have special meaning on stock market, in journey Sequence cannot go to understand when analyzing with the meaning in vocabulary face.So artificial word common in a collection of stock market is summarized, when program reads number According to when these special artificial words first can be converted into the generic word with the word particular meaning and be read out preservation.It is reduced with this The ambiguousness of data, and the accuracy rate of participle can be improved.The promotion of participle accuracy rate is mainly reflected in, when we use 2- When gram segments method, we can be expressed the common word in Stemming stem dictionary in the form of 2 words.For example, we The proprietary word in stock market " red three soldier " can be replaced with " rise ".For Chinese words segmentation, we are carried out using n-gram algorithm Chinese word segmentation and matching, statistical language model, it is assumed that a sentence S can be expressed as a sequence S=ω₁ω₂Λω_n, language Model is exactly the probability for requiring sentence SThe calculation amount of this probability is too big, solves the problems, such as Method be by all history ω₁ω₂Λω_iEquivalence class S (ω is mapped to according to some rule₁ω₂Λω_n), the number of equivalence class Mesh is far smaller than the number of different history, that is, assumes: p (ω_i|ω₁ω₂Λω_i-1)=p (ω_i|S(ω₁ω₂Λω_i-1))。

2-gram model maps two history and arrives when the nearest N-1 word (or word) of two history is identical The same equivalence class, model in the case are referred to as 2-gram model.2-gram model is referred to as single order Ma Erke Husband's chain.The value of N cannot be too big, otherwise calculates still too big.According to maximal possibility estimation, the parameter of language model:Wherein, C (ω₁ω₂Λω_i) indicate ω₁ω₂Λω_iIn training data The number of appearance.

In addition, about for text matching techniques.The present invention can carry out ambiguity using the method based on regular expression The matching and replacement of word；It if subexpression is matched to thing, and is not a position during regular expression matching, And it is finally saved in matched result.It is such to be known as occupying character, and only match position or matched Content is not saved in matching result, this to be referred to as zero width.Occupying character is mutual exclusion, and zero width is non-exclusive. A namely character, the same time can only be matched by a subexpression, and a position, but can be wide by multiple zero simultaneously The subexpression of degree matches.Regular expression is successively matched by left-to-right, is obtained by an expression formula under normal conditions Control is matched from some position of character string, and a subexpression begins trying matched position, is from previous son The successful end position of expression matching start (such as: (expression formula one) (expression formula two) meaning be exactly that expression formula one has matched The ability matching expression two after, and the position of matching expression two is opened from the position after the location matches of expression formula one Begin).IF expression is first is that zero width, and after the completion of the matching of that expression formula one, the matched position of expression formula two is still original to be expressed Formula is with matched position.That is the position that it matches beginning and end is same.

It should be noted that the building about dictionary:

Feature lexicon is constructed using Chi-square dimensionality reduction and the combination of tf-idf method.Pass through Chi-square first Dimensionality reduction obtains Feature Words.The thought of tf-idf method be if the number that occurs in the same document of a feature is more, The number occurred in different documents is fewer, then illustrates that this feature has significant classification capacity.Chi-square formalization Function are as follows:

Wherein A is labeled data actual value, and T is guess value of the program to labeled data.X2 obtained by calculation is indicated Difference degree between actual value and guess value, x2 value size are positively correlated with degree of relationship's size.Degree of correlation is extracted herein Then big word assigns weight to Feature Words by tf-idf and is saved in feature lexicon as Feature Words, be next step mould Type training offer condition.Selection Chi-square dimensionality reduction and tf-idf method combination construction feature dictionary the reason of be, although Chi-square dimensionality reduction has high efficiency in Feature Selection, but there are low-frequency word defects.Therefore, selection Chi-square drop Dimension, which is combined with tf-idf method, to be carried out using maximizing favourable factors and minimizing unfavourable ones.

S103: the mark of feeling polarities is carried out to pretreatment comment data collection；

It should be noted that polarity may include: good, not good etc. the vocabulary that can indicate stock invester's emotion；

S104: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis Classify and carries out stock market development.

Libsvm can be based on RBF (Gaussian radial basis function) kernel function.Gaussian radial basis function (RBF) is a kind of part The strong kernel function of property.One sample can be mapped in the space of a more higher-dimension, formalize function are as follows:

Wherein, parameter σ²For gaussian kernel function variance.σ controls the radial effect range of function: σ is too small to be easy to appear " over-fitting ", σ is excessive, is likely to occur " poor fitting ".Xi and xj respectively indicates the feature vector of two samples, and one is model The sampling feature vectors obtained after training, one be test data sample feature vector, and | | xi-xj | | 2 indicate be two Square Euclidean distance between a sample illustrates test data sample and such data if the value subtracted each other is smaller Closer, being predicted as such probability will greatly increase.Conversely, then reducing.

Liblinear can be based on linear kernel function, be mainly used for the data of linear separability.Feature space is to the input space Dimension be consistent.Its advantage is that parameter is few, speed is fast, it is ideal for linear separability effect data.Formalize function are as follows:

K(x_i, x_j)=x_i×x_j；

In the case where linear separability, xi and xj respectively represent different sample vectors, carry out inner product to it, obtained value It is distributed in which kind of region and which kind of just belongs to.

A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention, comprising: obtain stock Ticket comment data collection；Rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained；It will be pre-processed by Stemming stem dictionary The artificial word that comment data is concentrated replaces with the generic word of corresponding semanteme；The mark of feeling polarities is carried out to pretreatment comment data collection Note；Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carry out sentiment analysis classification and is carried out Stock market development, in the processing of financial phrase material introduces a kind of Stemming stem dictionary method to reduce data ambiguity, It improves phrase material and segments accuracy rate, based on the mark for carrying out feeling polarities to pretreatment comment data collection, and using libsvm points Class device and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market development, solve The financial short text understanding of automation of having determined is different from long text, and short text does not have complete syntactic structure, and length usually Very short the technical issues of being made inferences without enough information to computer.

The above is a kind of the detailed of one embodiment progress to Forecasting of Stock Prices method based on Stemming stem dictionary Description will carry out detailed retouch to a kind of another embodiment of Forecasting of Stock Prices method based on Stemming stem dictionary below It states.

Referring to Fig. 2, another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Embodiment, comprising:

S200: stock comment data collection is obtained；

S201: Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm；

S202: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained；

S203: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Generic word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters；

S204: the mark of feeling polarities is carried out to pretreatment comment data collection；

S205: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis Classify and carries out stock market development；

Libsvm classifier is defined as:

Liblinear classifier is defined as:

K(x_i, x_j)=x_iXx_j。

Referring to Fig. 3, to show a kind of share price based on Stemming stem dictionary provided in an embodiment of the present invention pre- by Fig. 3 Survey the structural schematic diagram of device, comprising:

First obtains module 301, for obtaining stock comment data collection；

First rejects module 302, the number for concentrating corpus length to be greater than the first preset characters for rejecting stock comment data According to, and the vocabulary that stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection；

First replacement module 303, for the artificial word of comment data concentration will to be pre-processed by Stemming stem dictionary Replace with the generic word of corresponding semanteme；

First labeling module 304, for carrying out the mark of feeling polarities to pretreatment comment data collection；

First prediction module 305, for constructing corresponding prediction mould using libsvm classifier and liblinear classifier Type carries out sentiment analysis classification and carries out stock market development.

Optionally, the artificial word that comment data is concentrated will be pre-processed by Stemming stem dictionary and replaces with corresponding semanteme Generic word include:

Optionally, libsvm classifier is defined as:

Optionally, liblinear classifier is defined as:

K(x_i, x_j)=x_i×x_j。

Optionally, rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock Comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:

Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm.

Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of Forecasting of Stock Prices method based on Stemming stem dictionary characterized by comprising

Obtain stock comment data collection；

Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment The vocabulary of the stock market data set Zhong Fei finance obtains pretreatment comment data collection；

The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word；

Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, sentiment analysis classification is carried out and goes forward side by side Row stock market development.

2. the Forecasting of Stock Prices method according to claim 1 based on Stemming stem dictionary, which is characterized in that described logical It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:

The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word, and Convert the generic word that number of words is greater than the second preset characters to the short word of corresponding semanteme.

3. the Forecasting of Stock Prices method according to claim 2 based on Stemming stem dictionary, which is characterized in that described Libsvm classifier is defined as:

Wherein, σ²For gaussian kernel function variance, x_iAnd x_jRespectively indicate the sampling feature vectors obtained after model training and test number According to the feature vector of sample, | | x_i-x_j||²Indicate square Euclidean distance between two samples.

4. the Forecasting of Stock Prices method according to claim 3 based on Stemming stem dictionary, which is characterized in that described Liblinear classifier is defined as:

K(x_i, x_j)=x_i×x_j。

5. the Forecasting of Stock Prices method according to claim 4 based on Stemming stem dictionary, which is characterized in that reject institute Stating stock comment data concentrates corpus length to be greater than the data of the first preset characters, and it is non-to reject the stock comment data concentration The vocabulary of stock market's finance, obtain pretreatment comment data collection before further include:

6. a kind of Forecasting of Stock Prices device based on Stemming stem dictionary characterized by comprising

First obtains module, for obtaining stock comment data collection；

First rejects module, the data for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data, And the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection；

First replacement module replaces with pair for that will pre-process the artificial word that comment data is concentrated by Stemming stem dictionary Answer semantic generic word；

First prediction module is carried out for constructing corresponding prediction model using libsvm classifier and liblinear classifier Sentiment analysis classifies and carries out stock market development.

7. the Forecasting of Stock Prices device according to claim 6 based on Stemming stem dictionary, which is characterized in that described logical It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:

8. the Forecasting of Stock Prices device according to claim 7 based on Stemming stem dictionary, which is characterized in that described Libsvm classifier is defined as:

9. the Forecasting of Stock Prices device according to claim 8 based on Stemming stem dictionary, which is characterized in that described Liblinear classifier is defined as:

K(x_i, x_j)=x_i×x_j。

10. the Forecasting of Stock Prices device according to claim 9 based on Stemming stem dictionary, which is characterized in that reject The stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment data and concentrate The vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include: