CN108959266A - A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary - Google Patents

A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary Download PDF

Info

Publication number
CN108959266A
CN108959266A CN201810778565.0A CN201810778565A CN108959266A CN 108959266 A CN108959266 A CN 108959266A CN 201810778565 A CN201810778565 A CN 201810778565A CN 108959266 A CN108959266 A CN 108959266A
Authority
CN
China
Prior art keywords
stock
comment data
stemming
word
forecasting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810778565.0A
Other languages
Chinese (zh)
Inventor
饶东宁
黄思宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201810778565.0A priority Critical patent/CN108959266A/en
Publication of CN108959266A publication Critical patent/CN108959266A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Abstract

The invention discloses a kind of Forecasting of Stock Prices method and devices based on Stemming stem dictionary, based on the mark for carrying out feeling polarities to the pretreatment comment data collection, and corresponding prediction model is constructed using libsvm classifier and liblinear classifier, it carries out sentiment analysis classification and carries out stock market development, it is different from long text to solve the financial short text understanding of automation, not the technical issues of short text does not have complete syntactic structure usually, and the very short no enough information of length makes inferences to computer.

Description

A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary
Technical field
The present invention relates to the field of data mining, have been related specifically to a kind of Forecasting of Stock Prices based on Stemming stem dictionary Method and device.
Background technique
With the fast development over more than 30 years, stock market is occupied an leading position in China's modern financial system, day Gradually emerge several equity investment fans.Due to being influenced by politics, economy, technology etc., Stock Price Fluctuation is changed greatly, In order to maximize returns of investment, stock investor all urgently obtain it is a kind of can more Accurate Prediction the change of stock price method.Pass through Comprehensive analysis influences the variable of the change of stock price, and then the variation tendency in predicting Stock Price future, preferably guidance investment.Such application Belong to the scope of data mining.
Financial short text analysis is one and predicts vital task for tending towards uplift of finance stock market, in knowledge excavation, feelings There are many potential utilizations in the fields such as sense analysis, understand short text very simple, this is because the mankind possess thinking, can accumulate Thinking is judged.However the financial short text understanding of automation is different from long text, short text does not have complete grammer usually Structure, and the very short no enough information of length makes inferences to computer.
Summary of the invention
It is short to solve automation finance for a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Text understanding is different from long text, and short text does not have complete syntactic structure usually, and length is very short without enough letters Cease the technical issues of making inferences to computer.
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising:
Obtain stock comment data collection;
Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection;
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word;
The mark of feeling polarities is carried out to the pretreatment comment data collection;
Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification And carry out stock market development.
Optionally, described that correspondence is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Semantic generic word includes:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, the libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, the liblinear classifier is defined as:
K(xi, xj)=xi×xj
Optionally, rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects The stock comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
A kind of Forecasting of Stock Prices device based on Stemming stem dictionary provided by the invention, comprising:
First obtains module, for obtaining stock comment data collection;
First rejects module, the number for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data According to, and the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module, for being replaced by Stemming stem dictionary by the artificial word that comment data is concentrated is pre-processed For corresponding semantic generic word;
First labeling module, for carrying out the mark of feeling polarities to the pretreatment comment data collection;
First prediction module, for constructing corresponding prediction model using libsvm classifier and liblinear classifier, It carries out sentiment analysis classification and carries out stock market development.
Optionally, described that correspondence is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Semantic generic word includes:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, the libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, the liblinear classifier is defined as:
K(xi, xj)=xi×xj
Optionally, rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects The stock comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
As can be seen from the above technical solutions, the invention has the following advantages that
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising: obtain stock comment Data set;Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection;It will be located in advance by Stemming stem dictionary The artificial word that reason comment data is concentrated replaces with the generic word of corresponding semanteme;Emotion pole is carried out to the pretreatment comment data collection The mark of property;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification And stock market development is carried out, a kind of Stemming stem dictionary method is introduced in the processing of financial phrase material to reduce data Ambiguity improves phrase material and segments accuracy rate, based on the mark to the pretreatment comment data collection progress feeling polarities, and uses Libsvm classifier and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market to become Gesture prediction solves the financial short text of automation and understands different from long text, and short text does not have complete syntactic structure usually, And the technical issues of very short no enough information of length makes inferences to computer.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other attached drawings according to these attached drawings.
Fig. 1 is an a kind of reality of the Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention Apply the flow diagram of example;
Fig. 2 is another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention The flow diagram of embodiment;
Fig. 3 is an a kind of reality of the Forecasting of Stock Prices device based on Stemming stem dictionary provided in an embodiment of the present invention Apply the structural schematic diagram of example.
Specific embodiment
In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field Those of ordinary skill's all other embodiment obtained without making creative work, belongs to protection of the present invention Range.
Fig. 1 is a kind of one embodiment of the Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Flow chart, it is provided in an embodiment of the present invention with being based on as shown in Figure 1, the embodiment of the present invention can be applied to server The Forecasting of Stock Prices method of Stemming stem dictionary may include:
S100: stock comment data collection is obtained;
It is alternatively possible to collect within a certain period of time, the Chinese comment data of stock invester in domestic Gu Ba forum.
S101: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;
S102: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Generic word;
It should be noted that mainly there are two aspects for the data processing of phrase material.First is drawing for Stemming stem dictionary Enter, second is Chinese words segmentation.Introducing for Stemming stem dictionary can use the method based on canonical formula, by There is a large amount of proprietary artificial words in stock comment data.These words have special meaning on stock market, in journey Sequence cannot go to understand when analyzing with the meaning in vocabulary face.So artificial word common in a collection of stock market is summarized, when program reads number According to when these special artificial words first can be converted into the generic word with the word particular meaning and be read out preservation.It is reduced with this The ambiguousness of data, and the accuracy rate of participle can be improved.The promotion of participle accuracy rate is mainly reflected in, when we use 2- When gram segments method, we can be expressed the common word in Stemming stem dictionary in the form of 2 words.For example, we The proprietary word in stock market " red three soldier " can be replaced with " rise ".For Chinese words segmentation, we are carried out using n-gram algorithm Chinese word segmentation and matching, statistical language model, it is assumed that a sentence S can be expressed as a sequence S=ω1ω2Λωn, language Model is exactly the probability for requiring sentence SThe calculation amount of this probability is too big, solves the problems, such as Method be by all history ω1ω2ΛωiEquivalence class S (ω is mapped to according to some rule1ω2Λωn), the number of equivalence class Mesh is far smaller than the number of different history, that is, assumes: p (ωi1ω2Λωi-1)=p (ωi|S(ω1ω2Λωi-1))。
2-gram model maps two history and arrives when the nearest N-1 word (or word) of two history is identical The same equivalence class, model in the case are referred to as 2-gram model.2-gram model is referred to as single order Ma Erke Husband's chain.The value of N cannot be too big, otherwise calculates still too big.According to maximal possibility estimation, the parameter of language model:Wherein, C (ω1ω2Λωi) indicate ω1ω2ΛωiIn training data The number of appearance.
In addition, about for text matching techniques.The present invention can carry out ambiguity using the method based on regular expression The matching and replacement of word;It if subexpression is matched to thing, and is not a position during regular expression matching, And it is finally saved in matched result.It is such to be known as occupying character, and only match position or matched Content is not saved in matching result, this to be referred to as zero width.Occupying character is mutual exclusion, and zero width is non-exclusive. A namely character, the same time can only be matched by a subexpression, and a position, but can be wide by multiple zero simultaneously The subexpression of degree matches.Regular expression is successively matched by left-to-right, is obtained by an expression formula under normal conditions Control is matched from some position of character string, and a subexpression begins trying matched position, is from previous son The successful end position of expression matching start (such as: (expression formula one) (expression formula two) meaning be exactly that expression formula one has matched The ability matching expression two after, and the position of matching expression two is opened from the position after the location matches of expression formula one Begin).IF expression is first is that zero width, and after the completion of the matching of that expression formula one, the matched position of expression formula two is still original to be expressed Formula is with matched position.That is the position that it matches beginning and end is same.
It should be noted that the building about dictionary:
Feature lexicon is constructed using Chi-square dimensionality reduction and the combination of tf-idf method.Pass through Chi-square first Dimensionality reduction obtains Feature Words.The thought of tf-idf method be if the number that occurs in the same document of a feature is more, The number occurred in different documents is fewer, then illustrates that this feature has significant classification capacity.Chi-square formalization Function are as follows:
Wherein A is labeled data actual value, and T is guess value of the program to labeled data.X2 obtained by calculation is indicated Difference degree between actual value and guess value, x2 value size are positively correlated with degree of relationship's size.Degree of correlation is extracted herein Then big word assigns weight to Feature Words by tf-idf and is saved in feature lexicon as Feature Words, be next step mould Type training offer condition.Selection Chi-square dimensionality reduction and tf-idf method combination construction feature dictionary the reason of be, although Chi-square dimensionality reduction has high efficiency in Feature Selection, but there are low-frequency word defects.Therefore, selection Chi-square drop Dimension, which is combined with tf-idf method, to be carried out using maximizing favourable factors and minimizing unfavourable ones.
S103: the mark of feeling polarities is carried out to pretreatment comment data collection;
It should be noted that polarity may include: good, not good etc. the vocabulary that can indicate stock invester's emotion;
S104: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis Classify and carries out stock market development.
Libsvm can be based on RBF (Gaussian radial basis function) kernel function.Gaussian radial basis function (RBF) is a kind of part The strong kernel function of property.One sample can be mapped in the space of a more higher-dimension, formalize function are as follows:
Wherein, parameter σ2For gaussian kernel function variance.σ controls the radial effect range of function: σ is too small to be easy to appear " over-fitting ", σ is excessive, is likely to occur " poor fitting ".Xi and xj respectively indicates the feature vector of two samples, and one is model The sampling feature vectors obtained after training, one be test data sample feature vector, and | | xi-xj | | 2 indicate be two Square Euclidean distance between a sample illustrates test data sample and such data if the value subtracted each other is smaller Closer, being predicted as such probability will greatly increase.Conversely, then reducing.
Liblinear can be based on linear kernel function, be mainly used for the data of linear separability.Feature space is to the input space Dimension be consistent.Its advantage is that parameter is few, speed is fast, it is ideal for linear separability effect data.Formalize function are as follows:
K(xi, xj)=xi×xj
In the case where linear separability, xi and xj respectively represent different sample vectors, carry out inner product to it, obtained value It is distributed in which kind of region and which kind of just belongs to.
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention, comprising: obtain stock Ticket comment data collection;Rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;It will be pre-processed by Stemming stem dictionary The artificial word that comment data is concentrated replaces with the generic word of corresponding semanteme;The mark of feeling polarities is carried out to pretreatment comment data collection Note;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carry out sentiment analysis classification and is carried out Stock market development, in the processing of financial phrase material introduces a kind of Stemming stem dictionary method to reduce data ambiguity, It improves phrase material and segments accuracy rate, based on the mark for carrying out feeling polarities to pretreatment comment data collection, and using libsvm points Class device and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market development, solve The financial short text understanding of automation of having determined is different from long text, and short text does not have complete syntactic structure, and length usually Very short the technical issues of being made inferences without enough information to computer.
The above is a kind of the detailed of one embodiment progress to Forecasting of Stock Prices method based on Stemming stem dictionary Description will carry out detailed retouch to a kind of another embodiment of Forecasting of Stock Prices method based on Stemming stem dictionary below It states.
Referring to Fig. 2, another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention Embodiment, comprising:
S200: stock comment data collection is obtained;
S201: Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm;
S202: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;
S203: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Generic word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters;
S204: the mark of feeling polarities is carried out to pretreatment comment data collection;
S205: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis Classify and carries out stock market development;
Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Liblinear classifier is defined as:
K(xi, xj)=xiXxj
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention, comprising: obtain stock Ticket comment data collection;Rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock and comment By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;It will be pre-processed by Stemming stem dictionary The artificial word that comment data is concentrated replaces with the generic word of corresponding semanteme;The mark of feeling polarities is carried out to pretreatment comment data collection Note;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carry out sentiment analysis classification and is carried out Stock market development, in the processing of financial phrase material introduces a kind of Stemming stem dictionary method to reduce data ambiguity, It improves phrase material and segments accuracy rate, based on the mark for carrying out feeling polarities to pretreatment comment data collection, and using libsvm points Class device and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market development, solve The financial short text understanding of automation of having determined is different from long text, and short text does not have complete syntactic structure, and length usually Very short the technical issues of being made inferences without enough information to computer.
Referring to Fig. 3, to show a kind of share price based on Stemming stem dictionary provided in an embodiment of the present invention pre- by Fig. 3 Survey the structural schematic diagram of device, comprising:
First obtains module 301, for obtaining stock comment data collection;
First rejects module 302, the number for concentrating corpus length to be greater than the first preset characters for rejecting stock comment data According to, and the vocabulary that stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module 303, for the artificial word of comment data concentration will to be pre-processed by Stemming stem dictionary Replace with the generic word of corresponding semanteme;
First labeling module 304, for carrying out the mark of feeling polarities to pretreatment comment data collection;
First prediction module 305, for constructing corresponding prediction mould using libsvm classifier and liblinear classifier Type carries out sentiment analysis classification and carries out stock market development.
Optionally, the artificial word that comment data is concentrated will be pre-processed by Stemming stem dictionary and replaces with corresponding semanteme Generic word include:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, liblinear classifier is defined as:
K(xi, xj)=xi×xj
Optionally, rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock Comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of Forecasting of Stock Prices method based on Stemming stem dictionary characterized by comprising
Obtain stock comment data collection;
Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment The vocabulary of the stock market data set Zhong Fei finance obtains pretreatment comment data collection;
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word;
The mark of feeling polarities is carried out to the pretreatment comment data collection;
Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, sentiment analysis classification is carried out and goes forward side by side Row stock market development.
2. the Forecasting of Stock Prices method according to claim 1 based on Stemming stem dictionary, which is characterized in that described logical It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word, and Convert the generic word that number of words is greater than the second preset characters to the short word of corresponding semanteme.
3. the Forecasting of Stock Prices method according to claim 2 based on Stemming stem dictionary, which is characterized in that described Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and test number According to the feature vector of sample, | | xi-xj||2Indicate square Euclidean distance between two samples.
4. the Forecasting of Stock Prices method according to claim 3 based on Stemming stem dictionary, which is characterized in that described Liblinear classifier is defined as:
K(xi, xj)=xi×xj
5. the Forecasting of Stock Prices method according to claim 4 based on Stemming stem dictionary, which is characterized in that reject institute Stating stock comment data concentrates corpus length to be greater than the data of the first preset characters, and it is non-to reject the stock comment data concentration The vocabulary of stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
6. a kind of Forecasting of Stock Prices device based on Stemming stem dictionary characterized by comprising
First obtains module, for obtaining stock comment data collection;
First rejects module, the data for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data, And the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module replaces with pair for that will pre-process the artificial word that comment data is concentrated by Stemming stem dictionary Answer semantic generic word;
First labeling module, for carrying out the mark of feeling polarities to the pretreatment comment data collection;
First prediction module is carried out for constructing corresponding prediction model using libsvm classifier and liblinear classifier Sentiment analysis classifies and carries out stock market development.
7. the Forecasting of Stock Prices device according to claim 6 based on Stemming stem dictionary, which is characterized in that described logical It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word, and Convert the generic word that number of words is greater than the second preset characters to the short word of corresponding semanteme.
8. the Forecasting of Stock Prices device according to claim 7 based on Stemming stem dictionary, which is characterized in that described Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and test number According to the feature vector of sample, | | xi-xj||2Indicate square Euclidean distance between two samples.
9. the Forecasting of Stock Prices device according to claim 8 based on Stemming stem dictionary, which is characterized in that described Liblinear classifier is defined as:
K(xi, xj)=xi×xj
10. the Forecasting of Stock Prices device according to claim 9 based on Stemming stem dictionary, which is characterized in that reject The stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment data and concentrate The vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
CN201810778565.0A 2018-07-16 2018-07-16 A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary Pending CN108959266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810778565.0A CN108959266A (en) 2018-07-16 2018-07-16 A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810778565.0A CN108959266A (en) 2018-07-16 2018-07-16 A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary

Publications (1)

Publication Number Publication Date
CN108959266A true CN108959266A (en) 2018-12-07

Family

ID=64481356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810778565.0A Pending CN108959266A (en) 2018-07-16 2018-07-16 A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary

Country Status (1)

Country Link
CN (1) CN108959266A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106227802A (en) * 2016-07-20 2016-12-14 广东工业大学 A kind of based on Chinese natural language process and the multiple source Forecasting of Stock Prices method of multi-core classifier
US20170018033A1 (en) * 2015-07-15 2017-01-19 Foundation Of Soongsil University Industry Cooperation Stock fluctuatiion prediction method and server
CN106919673A (en) * 2017-02-21 2017-07-04 浙江工商大学 Text mood analysis system based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170018033A1 (en) * 2015-07-15 2017-01-19 Foundation Of Soongsil University Industry Cooperation Stock fluctuatiion prediction method and server
CN106227802A (en) * 2016-07-20 2016-12-14 广东工业大学 A kind of based on Chinese natural language process and the multiple source Forecasting of Stock Prices method of multi-core classifier
CN106919673A (en) * 2017-02-21 2017-07-04 浙江工商大学 Text mood analysis system based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
饶东宁等: "基于THUCTC的金融语料情感分析模型优化", 《广东工业大学学报》 *

Similar Documents

Publication Publication Date Title
US20230016365A1 (en) Method and apparatus for training text classification model
Wu et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation
Yasunaga et al. Robust multilingual part-of-speech tagging via adversarial training
Tang et al. Aspect level sentiment classification with deep memory network
TWI682302B (en) Risk address identification method, device and electronic equipment
CN110110335B (en) Named entity identification method based on stack model
CN104699763B (en) The text similarity gauging system of multiple features fusion
Zhang et al. Neural networks incorporating dictionaries for Chinese word segmentation
CN112084327A (en) Classification of sparsely labeled text documents while preserving semantics
CN108268447A (en) A kind of mask method of Tibetan language name entity
Quan et al. Weighted high-order hidden Markov models for compound emotions recognition in text
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN112287100A (en) Text recognition method, spelling error correction method and voice recognition method
CN112464669A (en) Stock entity word disambiguation method, computer device and storage medium
CN110750646A (en) Attribute description extracting method for hotel comment text
CN111930936A (en) Method and system for excavating platform message text
CN114398943B (en) Sample enhancement method and device thereof
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
US20190095525A1 (en) Extraction of expression for natural language processing
Zhang et al. Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents
Che et al. Deep learning in lexical analysis and parsing
CN112765353B (en) Scientific research text-based biomedical subject classification method and device
CN108959266A (en) A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary
Xie et al. Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181207