CN108959266A - A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary - Google Patents
A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary Download PDFInfo
- Publication number
- CN108959266A CN108959266A CN201810778565.0A CN201810778565A CN108959266A CN 108959266 A CN108959266 A CN 108959266A CN 201810778565 A CN201810778565 A CN 201810778565A CN 108959266 A CN108959266 A CN 108959266A
- Authority
- CN
- China
- Prior art keywords
- stock
- comment data
- stemming
- word
- forecasting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
Abstract
The invention discloses a kind of Forecasting of Stock Prices method and devices based on Stemming stem dictionary, based on the mark for carrying out feeling polarities to the pretreatment comment data collection, and corresponding prediction model is constructed using libsvm classifier and liblinear classifier, it carries out sentiment analysis classification and carries out stock market development, it is different from long text to solve the financial short text understanding of automation, not the technical issues of short text does not have complete syntactic structure usually, and the very short no enough information of length makes inferences to computer.
Description
Technical field
The present invention relates to the field of data mining, have been related specifically to a kind of Forecasting of Stock Prices based on Stemming stem dictionary
Method and device.
Background technique
With the fast development over more than 30 years, stock market is occupied an leading position in China's modern financial system, day
Gradually emerge several equity investment fans.Due to being influenced by politics, economy, technology etc., Stock Price Fluctuation is changed greatly,
In order to maximize returns of investment, stock investor all urgently obtain it is a kind of can more Accurate Prediction the change of stock price method.Pass through
Comprehensive analysis influences the variable of the change of stock price, and then the variation tendency in predicting Stock Price future, preferably guidance investment.Such application
Belong to the scope of data mining.
Financial short text analysis is one and predicts vital task for tending towards uplift of finance stock market, in knowledge excavation, feelings
There are many potential utilizations in the fields such as sense analysis, understand short text very simple, this is because the mankind possess thinking, can accumulate
Thinking is judged.However the financial short text understanding of automation is different from long text, short text does not have complete grammer usually
Structure, and the very short no enough information of length makes inferences to computer.
Summary of the invention
It is short to solve automation finance for a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention
Text understanding is different from long text, and short text does not have complete syntactic structure usually, and length is very short without enough letters
Cease the technical issues of making inferences to computer.
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising:
Obtain stock comment data collection;
Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock
Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection;
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Word;
The mark of feeling polarities is carried out to the pretreatment comment data collection;
Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification
And carry out stock market development.
Optionally, described that correspondence is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Semantic generic word includes:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, the libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey
The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, the liblinear classifier is defined as:
K(xi, xj)=xi×xj。
Optionally, rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects
The stock comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
A kind of Forecasting of Stock Prices device based on Stemming stem dictionary provided by the invention, comprising:
First obtains module, for obtaining stock comment data collection;
First rejects module, the number for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data
According to, and the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module, for being replaced by Stemming stem dictionary by the artificial word that comment data is concentrated is pre-processed
For corresponding semantic generic word;
First labeling module, for carrying out the mark of feeling polarities to the pretreatment comment data collection;
First prediction module, for constructing corresponding prediction model using libsvm classifier and liblinear classifier,
It carries out sentiment analysis classification and carries out stock market development.
Optionally, described that correspondence is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Semantic generic word includes:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, the libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey
The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, the liblinear classifier is defined as:
K(xi, xj)=xi×xj。
Optionally, rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects
The stock comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
As can be seen from the above technical solutions, the invention has the following advantages that
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention, comprising: obtain stock comment
Data set;Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock
Comment data concentrates the vocabulary of non-stock market's finance, obtains pretreatment comment data collection;It will be located in advance by Stemming stem dictionary
The artificial word that reason comment data is concentrated replaces with the generic word of corresponding semanteme;Emotion pole is carried out to the pretreatment comment data collection
The mark of property;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis classification
And stock market development is carried out, a kind of Stemming stem dictionary method is introduced in the processing of financial phrase material to reduce data
Ambiguity improves phrase material and segments accuracy rate, based on the mark to the pretreatment comment data collection progress feeling polarities, and uses
Libsvm classifier and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market to become
Gesture prediction solves the financial short text of automation and understands different from long text, and short text does not have complete syntactic structure usually,
And the technical issues of very short no enough information of length makes inferences to computer.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art
To obtain other attached drawings according to these attached drawings.
Fig. 1 is an a kind of reality of the Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention
Apply the flow diagram of example;
Fig. 2 is another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention
The flow diagram of embodiment;
Fig. 3 is an a kind of reality of the Forecasting of Stock Prices device based on Stemming stem dictionary provided in an embodiment of the present invention
Apply the structural schematic diagram of example.
Specific embodiment
In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention
Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below
Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field
Those of ordinary skill's all other embodiment obtained without making creative work, belongs to protection of the present invention
Range.
Fig. 1 is a kind of one embodiment of the Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention
Flow chart, it is provided in an embodiment of the present invention with being based on as shown in Figure 1, the embodiment of the present invention can be applied to server
The Forecasting of Stock Prices method of Stemming stem dictionary may include:
S100: stock comment data collection is obtained;
It is alternatively possible to collect within a certain period of time, the Chinese comment data of stock invester in domestic Gu Ba forum.
S101: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment
By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;
S102: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Generic word;
It should be noted that mainly there are two aspects for the data processing of phrase material.First is drawing for Stemming stem dictionary
Enter, second is Chinese words segmentation.Introducing for Stemming stem dictionary can use the method based on canonical formula, by
There is a large amount of proprietary artificial words in stock comment data.These words have special meaning on stock market, in journey
Sequence cannot go to understand when analyzing with the meaning in vocabulary face.So artificial word common in a collection of stock market is summarized, when program reads number
According to when these special artificial words first can be converted into the generic word with the word particular meaning and be read out preservation.It is reduced with this
The ambiguousness of data, and the accuracy rate of participle can be improved.The promotion of participle accuracy rate is mainly reflected in, when we use 2-
When gram segments method, we can be expressed the common word in Stemming stem dictionary in the form of 2 words.For example, we
The proprietary word in stock market " red three soldier " can be replaced with " rise ".For Chinese words segmentation, we are carried out using n-gram algorithm
Chinese word segmentation and matching, statistical language model, it is assumed that a sentence S can be expressed as a sequence S=ω1ω2Λωn, language
Model is exactly the probability for requiring sentence SThe calculation amount of this probability is too big, solves the problems, such as
Method be by all history ω1ω2ΛωiEquivalence class S (ω is mapped to according to some rule1ω2Λωn), the number of equivalence class
Mesh is far smaller than the number of different history, that is, assumes: p (ωi|ω1ω2Λωi-1)=p (ωi|S(ω1ω2Λωi-1))。
2-gram model maps two history and arrives when the nearest N-1 word (or word) of two history is identical
The same equivalence class, model in the case are referred to as 2-gram model.2-gram model is referred to as single order Ma Erke
Husband's chain.The value of N cannot be too big, otherwise calculates still too big.According to maximal possibility estimation, the parameter of language model:Wherein, C (ω1ω2Λωi) indicate ω1ω2ΛωiIn training data
The number of appearance.
In addition, about for text matching techniques.The present invention can carry out ambiguity using the method based on regular expression
The matching and replacement of word;It if subexpression is matched to thing, and is not a position during regular expression matching,
And it is finally saved in matched result.It is such to be known as occupying character, and only match position or matched
Content is not saved in matching result, this to be referred to as zero width.Occupying character is mutual exclusion, and zero width is non-exclusive.
A namely character, the same time can only be matched by a subexpression, and a position, but can be wide by multiple zero simultaneously
The subexpression of degree matches.Regular expression is successively matched by left-to-right, is obtained by an expression formula under normal conditions
Control is matched from some position of character string, and a subexpression begins trying matched position, is from previous son
The successful end position of expression matching start (such as: (expression formula one) (expression formula two) meaning be exactly that expression formula one has matched
The ability matching expression two after, and the position of matching expression two is opened from the position after the location matches of expression formula one
Begin).IF expression is first is that zero width, and after the completion of the matching of that expression formula one, the matched position of expression formula two is still original to be expressed
Formula is with matched position.That is the position that it matches beginning and end is same.
It should be noted that the building about dictionary:
Feature lexicon is constructed using Chi-square dimensionality reduction and the combination of tf-idf method.Pass through Chi-square first
Dimensionality reduction obtains Feature Words.The thought of tf-idf method be if the number that occurs in the same document of a feature is more,
The number occurred in different documents is fewer, then illustrates that this feature has significant classification capacity.Chi-square formalization
Function are as follows:
Wherein A is labeled data actual value, and T is guess value of the program to labeled data.X2 obtained by calculation is indicated
Difference degree between actual value and guess value, x2 value size are positively correlated with degree of relationship's size.Degree of correlation is extracted herein
Then big word assigns weight to Feature Words by tf-idf and is saved in feature lexicon as Feature Words, be next step mould
Type training offer condition.Selection Chi-square dimensionality reduction and tf-idf method combination construction feature dictionary the reason of be, although
Chi-square dimensionality reduction has high efficiency in Feature Selection, but there are low-frequency word defects.Therefore, selection Chi-square drop
Dimension, which is combined with tf-idf method, to be carried out using maximizing favourable factors and minimizing unfavourable ones.
S103: the mark of feeling polarities is carried out to pretreatment comment data collection;
It should be noted that polarity may include: good, not good etc. the vocabulary that can indicate stock invester's emotion;
S104: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis
Classify and carries out stock market development.
Libsvm can be based on RBF (Gaussian radial basis function) kernel function.Gaussian radial basis function (RBF) is a kind of part
The strong kernel function of property.One sample can be mapped in the space of a more higher-dimension, formalize function are as follows:
Wherein, parameter σ2For gaussian kernel function variance.σ controls the radial effect range of function: σ is too small to be easy to appear
" over-fitting ", σ is excessive, is likely to occur " poor fitting ".Xi and xj respectively indicates the feature vector of two samples, and one is model
The sampling feature vectors obtained after training, one be test data sample feature vector, and | | xi-xj | | 2 indicate be two
Square Euclidean distance between a sample illustrates test data sample and such data if the value subtracted each other is smaller
Closer, being predicted as such probability will greatly increase.Conversely, then reducing.
Liblinear can be based on linear kernel function, be mainly used for the data of linear separability.Feature space is to the input space
Dimension be consistent.Its advantage is that parameter is few, speed is fast, it is ideal for linear separability effect data.Formalize function are as follows:
K(xi, xj)=xi×xj;
In the case where linear separability, xi and xj respectively represent different sample vectors, carry out inner product to it, obtained value
It is distributed in which kind of region and which kind of just belongs to.
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention, comprising: obtain stock
Ticket comment data collection;Rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock and comment
By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;It will be pre-processed by Stemming stem dictionary
The artificial word that comment data is concentrated replaces with the generic word of corresponding semanteme;The mark of feeling polarities is carried out to pretreatment comment data collection
Note;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carry out sentiment analysis classification and is carried out
Stock market development, in the processing of financial phrase material introduces a kind of Stemming stem dictionary method to reduce data ambiguity,
It improves phrase material and segments accuracy rate, based on the mark for carrying out feeling polarities to pretreatment comment data collection, and using libsvm points
Class device and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market development, solve
The financial short text understanding of automation of having determined is different from long text, and short text does not have complete syntactic structure, and length usually
Very short the technical issues of being made inferences without enough information to computer.
The above is a kind of the detailed of one embodiment progress to Forecasting of Stock Prices method based on Stemming stem dictionary
Description will carry out detailed retouch to a kind of another embodiment of Forecasting of Stock Prices method based on Stemming stem dictionary below
It states.
Referring to Fig. 2, another of a kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided by the invention
Embodiment, comprising:
S200: stock comment data collection is obtained;
S201: Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm;
S202: it rejects stock comment data and corpus length is concentrated to be greater than the data of the first preset characters, and reject stock and comment
By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;
S203: corresponding semanteme is replaced with for the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Generic word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters;
S204: the mark of feeling polarities is carried out to pretreatment comment data collection;
S205: corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carries out sentiment analysis
Classify and carries out stock market development;
Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey
The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Liblinear classifier is defined as:
K(xi, xj)=xiXxj。
A kind of Forecasting of Stock Prices method based on Stemming stem dictionary provided in an embodiment of the present invention, comprising: obtain stock
Ticket comment data collection;Rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock and comment
By the vocabulary of the stock market data set Zhong Fei finance, pretreatment comment data collection is obtained;It will be pre-processed by Stemming stem dictionary
The artificial word that comment data is concentrated replaces with the generic word of corresponding semanteme;The mark of feeling polarities is carried out to pretreatment comment data collection
Note;Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, carry out sentiment analysis classification and is carried out
Stock market development, in the processing of financial phrase material introduces a kind of Stemming stem dictionary method to reduce data ambiguity,
It improves phrase material and segments accuracy rate, based on the mark for carrying out feeling polarities to pretreatment comment data collection, and using libsvm points
Class device and liblinear classifier construct corresponding prediction model, carry out sentiment analysis classification and carry out stock market development, solve
The financial short text understanding of automation of having determined is different from long text, and short text does not have complete syntactic structure, and length usually
Very short the technical issues of being made inferences without enough information to computer.
Referring to Fig. 3, to show a kind of share price based on Stemming stem dictionary provided in an embodiment of the present invention pre- by Fig. 3
Survey the structural schematic diagram of device, comprising:
First obtains module 301, for obtaining stock comment data collection;
First rejects module 302, the number for concentrating corpus length to be greater than the first preset characters for rejecting stock comment data
According to, and the vocabulary that stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module 303, for the artificial word of comment data concentration will to be pre-processed by Stemming stem dictionary
Replace with the generic word of corresponding semanteme;
First labeling module 304, for carrying out the mark of feeling polarities to pretreatment comment data collection;
First prediction module 305, for constructing corresponding prediction mould using libsvm classifier and liblinear classifier
Type carries out sentiment analysis classification and carries out stock market development.
Optionally, the artificial word that comment data is concentrated will be pre-processed by Stemming stem dictionary and replaces with corresponding semanteme
Generic word include:
The common of corresponding semanteme is replaced with by the artificial word that comment data is concentrated is pre-processed by Stemming stem dictionary
Word, and corresponding semantic short word is converted by the generic word that number of words is greater than the second preset characters.
Optionally, libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and survey
The feature vector of data sample is tried, | | xi-xj||2Indicate square Euclidean distance between two samples.
Optionally, liblinear classifier is defined as:
K(xi, xj)=xi×xj。
Optionally, rejecting stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects stock
Comment data concentrates the vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to stock comment data collection by n-gram algorithm.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (10)
1. a kind of Forecasting of Stock Prices method based on Stemming stem dictionary characterized by comprising
Obtain stock comment data collection;
Rejecting the stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment
The vocabulary of the stock market data set Zhong Fei finance obtains pretreatment comment data collection;
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word;
The mark of feeling polarities is carried out to the pretreatment comment data collection;
Corresponding prediction model is constructed using libsvm classifier and liblinear classifier, sentiment analysis classification is carried out and goes forward side by side
Row stock market development.
2. the Forecasting of Stock Prices method according to claim 1 based on Stemming stem dictionary, which is characterized in that described logical
It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word, and
Convert the generic word that number of words is greater than the second preset characters to the short word of corresponding semanteme.
3. the Forecasting of Stock Prices method according to claim 2 based on Stemming stem dictionary, which is characterized in that described
Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and test number
According to the feature vector of sample, | | xi-xj||2Indicate square Euclidean distance between two samples.
4. the Forecasting of Stock Prices method according to claim 3 based on Stemming stem dictionary, which is characterized in that described
Liblinear classifier is defined as:
K(xi, xj)=xi×xj。
5. the Forecasting of Stock Prices method according to claim 4 based on Stemming stem dictionary, which is characterized in that reject institute
Stating stock comment data concentrates corpus length to be greater than the data of the first preset characters, and it is non-to reject the stock comment data concentration
The vocabulary of stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
6. a kind of Forecasting of Stock Prices device based on Stemming stem dictionary characterized by comprising
First obtains module, for obtaining stock comment data collection;
First rejects module, the data for concentrating corpus length to be greater than the first preset characters for rejecting the stock comment data,
And the vocabulary that the stock comment data concentrates non-stock market's finance is rejected, obtain pretreatment comment data collection;
First replacement module replaces with pair for that will pre-process the artificial word that comment data is concentrated by Stemming stem dictionary
Answer semantic generic word;
First labeling module, for carrying out the mark of feeling polarities to the pretreatment comment data collection;
First prediction module is carried out for constructing corresponding prediction model using libsvm classifier and liblinear classifier
Sentiment analysis classifies and carries out stock market development.
7. the Forecasting of Stock Prices device according to claim 6 based on Stemming stem dictionary, which is characterized in that described logical
It crosses Stemming stem dictionary and will pre-process the artificial word that comment data is concentrated and replace with the generic word of corresponding semanteme and include:
The artificial word that comment data is concentrated, which will be pre-processed, by Stemming stem dictionary replaces with corresponding semantic generic word, and
Convert the generic word that number of words is greater than the second preset characters to the short word of corresponding semanteme.
8. the Forecasting of Stock Prices device according to claim 7 based on Stemming stem dictionary, which is characterized in that described
Libsvm classifier is defined as:
Wherein, σ2For gaussian kernel function variance, xiAnd xjRespectively indicate the sampling feature vectors obtained after model training and test number
According to the feature vector of sample, | | xi-xj||2Indicate square Euclidean distance between two samples.
9. the Forecasting of Stock Prices device according to claim 8 based on Stemming stem dictionary, which is characterized in that described
Liblinear classifier is defined as:
K(xi, xj)=xi×xj。
10. the Forecasting of Stock Prices device according to claim 9 based on Stemming stem dictionary, which is characterized in that reject
The stock comment data concentrates corpus length to be greater than the data of the first preset characters, and rejects the stock comment data and concentrate
The vocabulary of non-stock market's finance, obtain pretreatment comment data collection before further include:
Chinese word segmentation is carried out to the stock comment data collection by n-gram algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810778565.0A CN108959266A (en) | 2018-07-16 | 2018-07-16 | A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810778565.0A CN108959266A (en) | 2018-07-16 | 2018-07-16 | A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108959266A true CN108959266A (en) | 2018-12-07 |
Family
ID=64481356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810778565.0A Pending CN108959266A (en) | 2018-07-16 | 2018-07-16 | A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959266A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106227802A (en) * | 2016-07-20 | 2016-12-14 | 广东工业大学 | A kind of based on Chinese natural language process and the multiple source Forecasting of Stock Prices method of multi-core classifier |
US20170018033A1 (en) * | 2015-07-15 | 2017-01-19 | Foundation Of Soongsil University Industry Cooperation | Stock fluctuatiion prediction method and server |
CN106919673A (en) * | 2017-02-21 | 2017-07-04 | 浙江工商大学 | Text mood analysis system based on deep learning |
-
2018
- 2018-07-16 CN CN201810778565.0A patent/CN108959266A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170018033A1 (en) * | 2015-07-15 | 2017-01-19 | Foundation Of Soongsil University Industry Cooperation | Stock fluctuatiion prediction method and server |
CN106227802A (en) * | 2016-07-20 | 2016-12-14 | 广东工业大学 | A kind of based on Chinese natural language process and the multiple source Forecasting of Stock Prices method of multi-core classifier |
CN106919673A (en) * | 2017-02-21 | 2017-07-04 | 浙江工商大学 | Text mood analysis system based on deep learning |
Non-Patent Citations (1)
Title |
---|
饶东宁等: "基于THUCTC的金融语料情感分析模型优化", 《广东工业大学学报》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230016365A1 (en) | Method and apparatus for training text classification model | |
Wu et al. | Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation | |
Yasunaga et al. | Robust multilingual part-of-speech tagging via adversarial training | |
Tang et al. | Aspect level sentiment classification with deep memory network | |
TWI682302B (en) | Risk address identification method, device and electronic equipment | |
CN110110335B (en) | Named entity identification method based on stack model | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
Zhang et al. | Neural networks incorporating dictionaries for Chinese word segmentation | |
CN112084327A (en) | Classification of sparsely labeled text documents while preserving semantics | |
CN108268447A (en) | A kind of mask method of Tibetan language name entity | |
Quan et al. | Weighted high-order hidden Markov models for compound emotions recognition in text | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
CN111339260A (en) | BERT and QA thought-based fine-grained emotion analysis method | |
CN112287100A (en) | Text recognition method, spelling error correction method and voice recognition method | |
CN112464669A (en) | Stock entity word disambiguation method, computer device and storage medium | |
CN110750646A (en) | Attribute description extracting method for hotel comment text | |
CN111930936A (en) | Method and system for excavating platform message text | |
CN114398943B (en) | Sample enhancement method and device thereof | |
Romero et al. | Modern vs diplomatic transcripts for historical handwritten text recognition | |
US20190095525A1 (en) | Extraction of expression for natural language processing | |
Zhang et al. | Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents | |
Che et al. | Deep learning in lexical analysis and parsing | |
CN112765353B (en) | Scientific research text-based biomedical subject classification method and device | |
CN108959266A (en) | A kind of Forecasting of Stock Prices method and device based on Stemming stem dictionary | |
Xie et al. | Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181207 |