CN102023967A - Text emotion classifying method in stock field - Google Patents

Text emotion classifying method in stock field Download PDF

Info

Publication number
CN102023967A
CN102023967A CN2010105432677A CN201010543267A CN102023967A CN 102023967 A CN102023967 A CN 102023967A CN 2010105432677 A CN2010105432677 A CN 2010105432677A CN 201010543267 A CN201010543267 A CN 201010543267A CN 102023967 A CN102023967 A CN 102023967A
Authority
CN
China
Prior art keywords
stock
speech
text
emotion
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105432677A
Other languages
Chinese (zh)
Inventor
张勇
高旸
周莉
邢春晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2010105432677A priority Critical patent/CN102023967A/en
Publication of CN102023967A publication Critical patent/CN102023967A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

A kind of text sentiment classification method of oriented for stock, belong to stock proneness analysis technical field, it is characterized in that passing through the open press information including stock news, utilize the evaluation group improved, feature selecting is carried out to the stock emotion word expanded, and characteristic weighing selection is carried out to the emotion word in stock Chinese text with the absolute word frequency weight after normalization, it is final to utilize Bayes, K-NN or SVM text emotion sorting algorithm carry out proneness analysis to stock news. The present invention has the advantages that simple and feasible and convenience of calculation.

Description

A kind of text emotion sorting technique towards the stock field
Technical field
The invention belongs to the text emotion classification field of natural language processing, be specifically related to a kind of text emotion sorting technique towards the stock field.
Background technology
Along with the raising of expanding economy and living standards of the people, carry out the trend of the times that Investment ﹠ Financing becomes current society gradually by buying stock, how buying stock exactly becomes the problem that the investor is concerned about very much.Meanwhile, fast development along with network technology, network relies on characteristics such as real-time, rich and spreadability to replace traditional news media gradually becomes the main path that people obtain information, increasing stock news appears on the network, and these news comprise macroeconomy news, personal share related news, INDUSTRY OVERVIEW, listed company's news or the like.
Efficient Market Theory (EMH:Efficient Markets Hypothesis), EMH or efficiet market hypothesis are otherwise known as, start from the famous professor Eugene Fama of Univ Chicago USA is published in one piece " market value of shares tendency " by name of " commercial academic periodical " in nineteen sixty-five paper, then Eugene Fama deepens in being published in the paper " efficient capital market: theory and practice research is looked back " of " finance " in 1970 and proposition.Efficient Market Theory supposes that all disclosed information all can be reflected among the market price, if relevant information is not twisted and reflected fully that in security price market is exactly effective.Since security price can fully reflect all obtainable information, so, obtainable relevant information just becomes effectively determinative of price.
According to the difference of obtainable information classification, Efficient Market Theory is divided into following three kinds of display forms in efficient capital market: weak-form efficient market, semistrong-form efficient market, strong-form efficient market.From the reality of China, domestic most scholars support that China Stock Markets is that weak formula is effective.
In weak efficient market, need a period of time just can be reacted in the share price after the information issue, after the information that is to say was issued, stock can be through just adjusting to suitable price after a while.Therefore can not ignore of the influence of stock news, the buying behavior that the quantity of news and the tendentiousness of content to a great extent also can left and right sides investors for the stock market.For example State Council will be gone out stamp duty rate on April 24th, 2010 by the message one that is adjusted to 1 ‰ for 3 ‰ times, index of Shanghai bourse 304 points that rise suddenly and sharply, over thousands of personal share limit-up; And for example in " two Conferences " at 2010 beginning of the years, the government work report proposes to develop " low-carbon economy ", and " new forms of energy plate " attracts advantage afterwards, gradually enhancement.Therefore study the tendentiousness of stock market's news, the ancillary investment person is made investment decision have certain Practical significance.
So-called based on sentiment classification, discerning text is front or negative, the research of this type is called as the emotion classification.The text emotion classification is a kind of special text classification problem, needs the emotion tendency of text to be made classification judge by subjective informations such as the position in excavation and the analysis text, viewpoint, view, moods.The text emotion classification is to judge the good method of tendentiousness, is well used at aspects such as personalized recommendation, personalization standpoint retrieval, user interest excavation, information filtering, filtrating mail, public opinion analyses.
The enterprise that has at present some that the financial Information service is provided both at home and abroad, for example domestic big wisdom information, Wei Saite information and external Reuter etc.Yet generally all price is high in the service that these companies provide, and common investor is unaffordable.Therefore can consider to utilize the information such as news that obtain easily on the financial web site, after the processing by the text emotion classification, provide the positive negativity prompting of every news, can help the investor to make investment decision more quickly.
Summary of the invention
The object of the present invention is to provide a kind of text emotion sorting technique, be used to provide the suggestion of stock news emotion tendency classification towards the stock field.
The invention is characterized in that the classification of described text emotion is a kind of based on sentiment classification, the Chinese text that is used to discern the stock field is front or negative, and described sorting technique realizes in computing machine successively according to the following steps:
The described computer initialization of step (1), set following Software tool:
Add-delta data smoothing algoritic module;
Stock news is carried out the Chinese lexical analysis module ICTCLAS that Chinese word segmentation is used;
Be used for the evaluation module that text feature is selected;
The Weka module that classification experiments is used, comprising The algorithm of classifying such as Bayes and K-NN,
Be defined in the neologisms that stock field Chinese text participle is used:
Initialism includes but not limited to: PetroChina Company Limited., state throw and middle gold;
Proper noun includes but not limited to: incorporated company and security investment fund;
Derivative includes but not limited to: unexpected rival's thigh, neck rise and empty profit;
Compound word includes but not limited to: leap high and then fall down and dividend sharing;
Step (2) the headline in the security news of setting and comprise the security everyday words and for the relevant stock information of the stock name of emotion classification usefulness as original language material, promptly Chinese text is input to described computing machine;
Step (3) Chinese text participle is cut into the speech that has independent meaning one by one to the Chinese character sequence in the Chinese text described in the step (2), and step is as follows:
The n-gram statistical language model that step (3.1) adopts new word discovery to use makes up stock field dictionary for word segmentation, and step is as follows:
Step (3.1.1) is set up the n-gram model,
Set a character string sequence n-gram W=w 1w 2... w nExpression, w iRepresent a character, n gets 2~6 integer, represents the character number in this character string,
Then be calculated as follows the probability P that a described character string sequence W occurs in described Chinese text MLE(w n| w 1w 2W N-1), MLE represents that this is a kind of method for parameter estimation that adopts maximal possibility estimation, is called the n-gram language model,
If the length of a character string (n-gram) is L, obtaining by the character string quantity after the n cutting so thus is L-n+1, and adds up the wherein frequency of occurrences of identical characters string, wherein
P MLE ( w n | w 1 w 2 . . . w n - 1 ) = C ( w 1 w 2 . . . w n ) C ( w 1 w 2 . . . w n - 1 ) ,
C (w 1w 2... w n) expression character string w 1w 2... w nThe number of times that in described original language material, occurs, C (w 1w 2... w N-1) represent by character string w 1w 2... w nIn before n-1 character w 1w 2... w N-1The number of times that the character string of forming occurs in described original language material,
Step (3.1.2) is carried out smoothing processing with the Add-delta data smoothing algorithm that has improved to the character string that step (3.1.1) obtains,
P Add - delta ( w 1 w 2 . . . w n ) = C ( w 1 w 2 . . . w n ) + Δ C ( w 1 w 2 . . . w n - 1 ) + Δ · N ,
Δ=0.5 wherein, N is the quantity of all character string n-gram in the described original language material,
Step (3.1.3) is filtered the otiose character substring in the everyday words,
When as the difference of an everyday words of father string and the frequency of its character substring less than 0.0001 and the length of this everyday words and its character substring only poor less than 3 the time, then this character substring filtration,
Can obtain the dictionary for word segmentation in stock field from step (3.1.1) to step (3.1.3),
Step (3.2)
Integrating step (3.1.1) is carried out participle to the dictionary for word segmentation in the described stock field that step (3.1.3) obtains with based on the ICTCLAS Chinese lexical analysis module of multilayer Markov model to described stock news;
Step (4) remove that step (3.2) obtains to the stop words in the described stock news word segmentation result, described stop words is that the frequency of occurrences is higher than everyday words and does not have the participle of practical significance,
Step (4.1) is set up the inactive vocabulary in the stock newsletter archive, and import this computing machine, the vocabulary of should stopping using comprises preposition, article, auxiliary word, conjunction and punctuation mark, in described stock news, be commonly used for suggestive speech in addition, at least include but not limited to news flash, sharp point, deep bid and market
Step (4.2) utilizes described inactive vocabulary to what obtain in the step (3) described stock news word segmentation result to be carried out the stop words removal;
Step (5) is represented described stock Chinese text with a vector space model on step (3) and the pretreated basis of step (4), its step is as follows:
Step (5.1) is utilized the described evaluation module based on evaluation theory Appraisal Theory, extracts adjective phrase, the adjective that has the emotion color, verb and qualifier from described stock Chinese text, general designation emotion speech,
Step (5.2) is set the evaluation group of a described stock emotion speech, comprising: the front speech is used to describe the positive surface analysis word that includes but not limited to that stock price goes up, public company's achievement is interior fortunately; Negation words is used to describe including but not limited to that fall in stock prices, public company's achievement difference are in interior negative analysis word; The degree speech is meant the speech of describing positive or negative degree; Negative word is used to be added in before front speech or the negation words, otherwise anticipates mutually; Uncertain speech determines the confidence level of described front speech or negation words, and described five types stock emotion is estimated word and constituted a feature set of words, and is input to affiliated computing machine,
The feature set of words that step (5.3) utilizes step (5.2) to obtain, the stock emotion speech that step (5.1) is extracted carries out the text emotion analysis, and indicates its affiliated type,
Step (5.4) is utilized based on normalized absolute word frequency weight, and the stock emotion speech described in the step (5.3) is carried out characteristic weighing:
Absolute word frequency weight after the normalized of j text, value in [0,1] interval:
weight normal ( t k , d j ) = weight ( t k , d j ) Σ j = 1 | T | ( weight ( t k , d j ) ) 2 ,
T wherein k, t is the stock emotion speech after expression is estimated through the feature set of words, k is through the rich sequence number in back that the described stock emotion speech in a plurality of described stock Chinese texts is sorted greatly,
d j, d represents described stock Chinese text, j is the sequence number of described stock Chinese text, | T| represents the number of all stock Chinese texts, so j=1,2 ..., | T|,
Weight (t k, d j) the word frequency weight before normalized of k described stock emotion speech in expression j the described stock Chinese text, in [0,1] interval in value; Weightnormal (t k, d j) represent the absolute word frequency weight of this emotion speech after normalized, in [0,1] interval interior value;
The classification of step (6) text emotion
Utilize any sorting algorithm in the described Weka module that an one stock Chinese text is carried out emotion classification, positive belong to positive focus plate, negative belong to negative focus plate.
The invention has the advantages that:
1. original language material comes from network, and is real-time.
2. pure software is realized with low cost.
Description of drawings
Fig. 1 is towards stock field text emotion sorting technique process flow diagram;
Fig. 2 is a stock evaluation group;
Fig. 3 program realization flow figure.
Embodiment
The present invention proposes a kind of text emotion sorting technique towards the stock field, described method is undertaken by following three steps in computing machine successively, idiographic flow as shown in Figure 1:
Step (1) text pre-service.
The text pre-service mainly is divided into the Chinese text participle and removes two processes of stop words, wherein:
One, Chinese text participle:
The Chinese text participle is meant the Chinese character sequence is cut into the speech that has independent meaning one by one, is the basis of carrying out Chinese natural language processing.Need carry out in two steps:
The first step makes up stock field dictionary for word segmentation based on the n-gram statistical language model:
In the Chinese word segmentation field, the neologisms (New Words) and unregistered word (UnknownWords) two conceptions of species are arranged, but they are not distinguished sometimes.Neologisms or unregistered word can roughly be divided into following four kinds: 1) initialism, as " PetroChina Company Limited. ", " state's throwing ", " middle gold " etc.; 2) proper noun is as " incorporated company ", " security investment fund " etc.; 3) derivative is as " unexpected rival's thigh ", " neck rises ", " empty profit " etc.; 4) compound word is as " leaping high and then falling down ", " dividend sharing " etc.
At present, aspect new word discovery, following two kinds of ways are arranged usually: rule-based method, promptly summarize the composition rule or the characteristics of some neologisms by the expert, guess possible neologisms and provide degree of confidence, do further evaluation afterwards again; Based on the method for statistics, promptly utilize some statistics strategy and degrees of correlation, to seek those and the major term of possibility occurs, this method is applicable to finds short neologisms.
Owing to lack dictionary for word segmentation at present, need to make up a dictionary for word segmentation towards the stock field.Simultaneously, the object that the present invention handles mainly is the stock headline, and speech wherein is many to be occurred with brief form, so can adopt the new word discovery method based on statistics, for example n-gram language model.
From statistical angle, in natural language, sentence can be made up of character string arbitrarily, but the probability P (s) that their occur has very big difference.For example: s 1=" more than half bankers expect following season monetary policy constant ", s 2=" more than half policies expection banker currency of following season are constant ", two characters that comprise in front and back are in full accord, but obviously the former is bigger as the probability that in short occurs, i.e. P (s 1)>P (s 2).
For given natural language, P (s) is normally unknown.And, estimate that according to given language sample the process of P is known as the language modeling for a language L who obeys certain unknown probability P distribution.If suppose to use W=w 1w 2... w nCharacter string sequence (n-gram) in the expression text, wherein w iRepresent a character, the task of language modeling is to provide the probability P (w) that character string sequence W occurs in text so.Utilize the product formula of probability, P (W) can be expanded into:
Figure BSA00000344960200081
This formula is very complicated, even to smaller n, calculated amount also is sizable.Usually for simplified model and convenient calculating, can consider oversize history, the general history of only considering that n-1 character constitutes, think that promptly the probability of any one speech appearance is only relevant with its front n-1 speech, at this moment this language model is called as the n-gram language model, also is called to be the single order Markov chain.
Can adopt the method for parameter estimation of maximal possibility estimation (MLE) to calculate P (w):
P MLE ( w n | w 1 w 2 . . . w n - 1 ) = C ( w 1 w 2 . . . w n ) C ( w 1 w 2 . . . w n - 1 ) ,
Dictionary for word segmentation building process based on the n-gram statistical language model mainly carries out according to following three steps:
1) sets up the n-gram model
New word discovery needs a large amount of language materials as the basis, but the Chinese text emotion classification language material towards the stock field is not arranged at present as yet.Go up the news relevant as original language material so can select " Sina's finance and economics " with stock.And that stock news is paid attention to very much is ageing, many times only just can summarize the particular content of whole news from headline, and distinguishes out its emotion tendency.Therefore for improving treatment effeciency, only adopt headline to get final product.Specific practice is to gather online 2009 annual stock headline of Sina's finance and economics, comprises personal share news, INDUSTRY OVERVIEW, plate news, corporate news etc., amounts to 233282, and as original language material, promptly Chinese text is input to described computing machine.
Then original language material is set up the n-gram model of word one-level, promptly the stock headline from first to last is cut into one by one character string, and the frequency of occurrences of statistics identical characters string.Wherein the length n of character string represents the Chinese character number (English word or numeral are thought a Chinese character) in this character string.In theory, when n was big, the language ambience information that provides was more, and linguistic context has more distinctiveness, but calculated amount is also bigger, and parameter estimation is more unreliable; And the language ambience information that n hour provides is less, and the linguistic context distinctiveness is less, but calculated amount is also less, and parameter estimation is more reliable.Therefore, need reasonably to select the size of n in actual applications.In addition, if the length of a character string (n-gram) is L, it should be L-n+1 by the character string quantity after the n cutting so.
Can get n in the reality and be the integer from 2 to 6, for example " European stock market receive low each plate is general fall ", when n=6, can be divided into " European stock market receive low ", " stock market, continent receive low each ", " low each plate is received by the stock market ", " low each plate is received in the city ", " it is general to receive low each plate " and " hang down each plate is general to fall ".
Original language material is set up the n-gram model of word one-level, and add up the frequency of character string.If use MLE, then there is the sparse problem of data, therefore also need to adopt the data smoothing technology.
2) data smoothing is handled
The basis of data smoothing technology is a maximal possibility estimation, and smoothing method commonly used comprises that Add-one is level and smooth, smoothly (getting delta=0.5), retention estimate and delete estimation etc. that its basis is a maximum likelihood estimate to Add-delta.
The Add-one smoothing method stipulates that the statistics number of any one n-gram is to increase by 1 on the basis of this n-gram actual number of times that occurs in corpus, think that just those n-gram that do not occur have also occurred once in corpus, i.e. C (N-gram) New=C (N-gram) Old+ 1.Adopt the parameter estimation result of Add-one smoothing method to be
P Add - one ( w 1 w 2 . . . w n ) = C ( w 1 w 2 . . . w n ) + 1 C ( w 1 w 2 . . . w n - 1 ) + N ,
Wherein N represents the quantity of all n-gram in the corpus.
If there is a large amount of n-gram not appear in the corpus, these do not have the n-gram that occurs to occupy larger proportion in whole probability distribution with the level and smooth back of Add-one method, and this is not too rational.A kind of improving one's methods is that occurrence number does not add 1, but adds one less than several Δs of 1, promptly
P Add - delta ( w 1 w 2 . . . w n ) = C ( w 1 w 2 . . . w n ) + Δ C ( w 1 w 2 . . . w n - 1 ) + Δ · N ,
0<Δ<1 wherein, Here it is Add-delta smoothing method facts have proved that its effect generally is better than Add-one.
Retaining the basic thought of estimating (Held-out Estimation) is, all language materials are divided into corpus and retain two parts of language material, and wherein corpus is used to improve initial Frequency Estimation as initial Frequency Estimation and retain language material.Specific practice is at first for each n-gram w 1w 2W N-1, calculate the frequency that it occurs respectively in corpus and retention language material, i.e. C Tr(w 1w 2W n) and C Ho(w 1w 2W n).Establishing T then is to retain all n-gram numbers in the language material, represents the frequency that certain n-gram occurs, i.e. r=C with r in corpus Tr(w 1w 2W n), establish N simultaneously rBe illustrated in the number of the different n-gram that has occurred r time in the corpus, T rRepresent the frequency sum that all n-gram that occurred r time in corpus occur in retaining language material, promptly
Figure BSA00000344960200111
Therefore, adopt the parameter estimation result who retains method of estimation to be
P ho ( w 1 w 2 . . . w n ) = T r T × 1 N r ,
Deletion estimates that (Deleted Estimation) is that corpus is divided into two parts, does corpus and retains language material with a part wherein respectively, calculates back exchange role, asks both weighted means at last, promptly
P del ( w 1 w 2 . . . w n ) = T r 01 + T r 10 N ( N r 0 + N r 1 ) ,
T wherein r IjExpression i does corpus, j retains language material, N r 1Be illustrated in the number of the different n-gram that has occurred r time among the corpus i, N represents to practice language material and retains in the language material number of n-gram altogether.
3) " substring " filters
Have a lot of everyday words in actual applications, some " substrings " of forming these speech can only occur in these speech substantially, are difficult to occur as a speech separately.For example " incorporated company " is one " father's string ", and " the limited public affairs of share " then are its one " substrings ".Even some " substrings " can not become speech separately usually, the frequency of their appearance is but basic identical with its " father's string ".In general statistical language model, the probability (result of parameter estimation) of such " substring " and " father's string " is very approaching, but " substring " is otiose often, thereby becomes distracter, need filter them.
By setting up statistical language model, those otiose " substrings " are often very little with the difference of the probability of its " father's string ".So the basic thought of filtering useless " substring " is, all " father's strings " for a character string, if the difference of the probability of this character string and its " father's string " is worth less than certain, and the difference of the length of this character string and its " father's string " can be filtered this character string when being worth less than certain.The present invention adopts this method to filter just, and wherein the probability difference gets 0.0001, and length difference gets 3.
Through above-mentioned three steps, can set up a dictionary by original language material.In conjunction with the relevant information of some stocks, comprise security Essential Terms and stock name etc. simultaneously, finally can make up the dictionary for word segmentation in a stock field.
In second step,, utilize the ICTCLAS system to carry out participle in conjunction with stock field dictionary for word segmentation:
The ICTCLAS system is by people such as the Zhang Huaping of Inst. of Computing Techn. Academia Sinica and Liu Qun, based on the multilayer Hidden Markov Model (HMM), and the Chinese lexical analytic system of exploitation.The major function of this system comprises Chinese word segmentation, part-of-speech tagging, and named entity recognition, user-oriented dictionary is supported in neologisms identification simultaneously.
Can utilize the ICTCLAS system, and, stock news be carried out participle in conjunction with the stock field dictionary for word segmentation that makes up.
Two, remove stop words:
Stop words is meant that some frequencies of occurrences than higher, but do not have the speech of too many practical significance, to almost not effect of text-processing.Remove stop words and be very important, can adopt the method that makes up the vocabulary of stopping using for the efficient that improves text-processing.
The structure of inactive vocabulary is not only relevant with used language, also relevant with specific application environment.Inactive vocabulary in the stock newsletter archive mainly contains two kinds: first kind is preposition, article, auxiliary word, conjunction and punctuation mark; Second kind is suggestive speech before the stock headline, as " news flash ", " sharp point ", " deep bid ", " market " etc.
Step (2) text representation.
In order to allow computing machine " understanding " text, can represent text with vector space model.The basic skills of vector space model is the entry vector representation text with one group of quadrature, and wherein each different entry is just as dimension independently in the feature space.Text representation mainly is divided into text feature to be selected and the text feature weighting, wherein:
One, text feature is selected:
People such as Casey Whitelaw are for the text emotion problem analysis, introduce evaluation theory (Appraisal Theory), by from text, extracting phrase that adjective and modifier thereof constitute as the feature speech, carry out the semantic tendency analysis, this adjective phrase is called as evaluation group (AG), experiment shows, utilizes " evaluation group " as the feature set of words, can improve the degree of accuracy of emotion classification.People such as Casey Whitelaw are according to the evaluation theory of Martin, for evaluation is provided with four attributes: attitude (Attitude), tendency (Orientation), grade (Graduation) and polarity (Polarity).
The present invention at the emotion classification problem, also can utilize the method for similar evaluation group.But different is, what need to extract is not only the adjective phrase, also should comprise the adjective, verb and the qualifier that have the emotion color, and these speech are referred to as the emotion speech.Simultaneously, can tentatively stock emotion speech be divided into five types of front speech, negation words, degree speech, negative word and uncertain speech etc.: the front speech is described stock price exactly and is gone up, and public company's achievement waits vocabulary well; Negation words then is to describe vocabulary such as fall in stock prices and listed company's achievement difference; The degree speech is meant the positive and negative degree of description; Negative word is added in before front speech or the negation words, just represents the opposite meaning; The confidence level of uncertain speech decision front speech and negation words; Concrete structure as shown in Figure 2.
Two, text feature weighting:
Yet the different characteristic that process is selected is different to the differentiation dynamics of text.Therefore in the process of text being carried out the formalization processing, also need these features are done further weighted.The purpose of weighting is to improve the weight of the strong feature of differentiation dynamics, and weakened region divides the weight of the weak feature of dynamics.The weighting function that the present invention adopts has boolean's weight, absolute word frequency weight, TF-IDF weight and normalized weight etc.
1) boolean's weight
Boolean's weight is the simplest a kind of weighting function, as the term suggests its value is a Boolean: if the feature speech did not occur, its weight is 0; As long as the feature speech occurred, its weight promptly thinks 1.Be formulated as
weight ( t k , d j ) = 0 , # ( t k , d j ) = 0 1 , # ( t k , d j ) > 0 ,
Wherein t represents through the feature speech, and k represents that d represents one piece of document through the sequence number of giving after feature speech described in a plurality of documents is sorted greatly, and j represents the sequence number of the document in document sets, weight (t k, d j) representation feature speech t kAt document d jIn weight, # (t k, d j) representation feature speech t kAt document d jThe middle number of times that occurs.
2) absolute word frequency weight
Boolean's weight is very simple, only distinguishes the different characteristic speech with 0 and 1, but can not distinguish the importance between the different characteristic speech.In text classification, often think that the speech that the many speech of occurrence number lack than occurrence number has bigger effect to classification, so the weight of the different feature speech of occurrence number should be different.Absolute word frequency weight be the frequency that directly in document, occurs with the feature speech as weight, occurrence number is many more important more, can be formulated as
weight(t k,d j)=#(t k,d j),
3) TF-IDF weight
TF-IDF is information retrieval field a kind of method commonly used, also can be used as the text feature weighting function, and it calculates the weight of this speech in whole text set according to the word frequency of certain speech and the document frequency that occurred thereof, and is formulated as follows
weight ( t k , d j ) = TF ( t k , d j ) × IDF ( t k , d j ) = # ( t k , d j ) · log | D | # D ( t k ) ,
#D (t wherein k) expression comprises feature speech t kThe frequency of occurrences of document, in the set of all documents t appearred promptly kThe document number.TF (t k, d j)=# (t k, d j) expression t kAt document d jIn occurrence number,
Figure BSA00000344960200151
Expression inverted entry frequency.Why formula has such form, is based on two hypothesis: the number of times that feature speech occurs in one piece of document is many more, more can be as the representative of the document content; A feature speech occurred in many more documents, and then its perspective is just more little.
4) normalization word frequency
The length possibility difference of practical application Chinese version is very big, if adopt the three kinds of methods in front, then the eigenvalue distribution of long text and short text will differ greatly, and be unfavorable for calculating.Therefore, can give the authority to flumps in the interval into [0,1], uses the length of document of vector representation identical like this, utilizes the cosine standardized means to do normalized again, and last result is formulated as
weight normal ( t k , d j ) = weight ( t k , d j ) Σ j = 1 | T | \ ( weight ( t k , d j ) ) 2 ,
Weight wherein Normal(t k, d j) the absolute word frequency of expression after the normalized,, in [0,1] interval in value; | T| represents the element number of the set of all documents.
The classification of step (3) text emotion.
In the present invention, the emotion of stock news classification can be regarded one two classification problem as, for belonging to the positive positive focus plate that then is after the classification, for belonging to the negative negative focus plate that then is after the classification.Because stock news has direct influence for investor's purchase, be reacted in the real trade, can think to be mentioned manyly and all to be that the plate that the front is mentioned is the bigger plates of those amounts of increase, be mentioned manyly but all be that the negative plate of mentioning then is the bigger plates of those drop ranges.
Present existing Chinese emotion sort research is also few, we can say also to be in the exploratory stage, especially for the stock field, still needs and will explore which type of machine learning method and be applicable to Chinese stock text.Select for use in the present invention
Figure BSA00000344960200161
These three kinds of sorting techniques of Bayes, SVM and KNN experimentize to the emotion classification in stock field.

Claims (1)

1. text emotion sorting technique towards the stock field, it is characterized in that, the classification of described text emotion is a kind of based on sentiment classification, and the Chinese text that is used to discern the stock field is front or negative, and described sorting technique realizes in computing machine successively according to the following steps:
The described computer initialization of step (1), set following Software tool:
Add-delta data smoothing algoritic module;
Stock news is carried out the Chinese lexical analysis module ICTCLAS that Chinese word segmentation is used;
Be used for the evaluation module that text feature is selected;
The Weka module that classification experiments is used, comprising
Figure FSA00000344960100011
The algorithm of classifying such as Bayes and K-NN,
Be defined in the neologisms that stock field Chinese text participle is used:
Initialism includes but not limited to: PetroChina Company Limited., state throw and middle gold;
Proper noun includes but not limited to: incorporated company and security investment fund;
Derivative includes but not limited to: unexpected rival's thigh, neck rise and empty profit;
Compound word includes but not limited to: leap high and then fall down and dividend sharing;
Step (2) the headline in the security news of setting and comprise the security everyday words and for the relevant stock information of the stock name of emotion classification usefulness as original language material, promptly Chinese text is input to described computing machine;
Step (3) Chinese text participle is cut into the speech that has independent meaning one by one to the Chinese character sequence in the Chinese text described in the step (2), and step is as follows:
The n-gram statistical language model that step (3.1) adopts new word discovery to use makes up stock field dictionary for word segmentation, and step is as follows:
Step (3.1.1) is set up the n-gram model,
Set a character string sequence n-gram W=w 1w 2... w nExpression, w iRepresent a character, n gets 2~6 integer, represents the character number in this character string,
Then be calculated as follows the probability P that a described character string sequence W occurs in described Chinese text MLE(w n| w 1w 2W N-1), MLE represents that this is a kind of method for parameter estimation that adopts maximal possibility estimation, is called the n-gram language model,
If the length of a character string (n-gram) is L, obtaining by the character string quantity after the n cutting so thus is L-n+1, and adds up the wherein frequency of occurrences of identical characters string, wherein
P MLE ( w n | w 1 w 2 . . . w n - 1 ) = C ( w 1 w 2 . . . w n ) C ( w 1 w 2 . . . w n - 1 ) ,
C (w 1w 2... w n) expression character string w 1w 2... w nThe number of times that in described original language material, occurs, C (w 1w 2... w N-1) represent by character string w 1w 2... w nIn before n-1 character w 1w 2... w N-1The number of times that the character string of forming occurs in described original language material,
Step (3.1.2) is carried out smoothing processing with the Add-delta data smoothing algorithm that has improved to the character string that step (3.1.1) obtains,
P Add - delta ( w 1 w 2 . . . w n ) = C ( w 1 w 2 . . . w n ) + Δ C ( w 1 w 2 . . . w n - 1 ) + Δ · N ,
Δ=0.5 wherein, N is the quantity of all character string n-gram in the described original language material,
Step (3.1.3) is filtered the otiose character substring in the everyday words,
When as the difference of an everyday words of father string and the frequency of its character substring less than 0.0001 and the length of this everyday words and its character substring only poor less than 3 the time, then this character substring filtration,
Can obtain the dictionary for word segmentation in stock field from step (3.1.1) to step (3.1.3),
Step (3.2)
Integrating step (3.1.1) is carried out participle to the dictionary for word segmentation in the described stock field that step (3.1.3) obtains with based on the ICTCLAS Chinese lexical analysis module of multilayer Markov model to described stock news;
Step (4) remove that step (3.2) obtains to the stop words in the described stock news word segmentation result, described stop words is that the frequency of occurrences is higher than everyday words and does not have the participle of practical significance,
Step (4.1) is set up the inactive vocabulary in the stock newsletter archive, and import this computing machine, the vocabulary of should stopping using comprises preposition, article, auxiliary word, conjunction and punctuation mark, in described stock news, be commonly used for suggestive speech in addition, at least include but not limited to news flash, sharp point, deep bid and market
Step (4.2) utilizes described inactive vocabulary to what obtain in the step (3) described stock news word segmentation result to be carried out the stop words removal;
Step (5) is represented described stock Chinese text with a vector space model on step (3) and the pretreated basis of step (4), its step is as follows:
Step (5.1) is utilized the described evaluation module based on evaluation theory Appraisal Theory, extracts adjective phrase, the adjective that has the emotion color, verb and qualifier from described stock Chinese text, general designation emotion speech,
Step (5.2) is set the evaluation group of a described stock emotion speech, comprising: the front speech is used to describe the positive surface analysis word that includes but not limited to that stock price goes up, public company's achievement is interior fortunately; Negation words is used to describe including but not limited to that fall in stock prices, public company's achievement difference are in interior negative analysis word; The degree speech is meant the speech of describing positive or negative degree; Negative word is used to be added in before front speech or the negation words, otherwise anticipates mutually; Uncertain speech determines the confidence level of described front speech or negation words, and described five types stock emotion is estimated word and constituted a feature set of words, and is input to affiliated computing machine,
The feature set of words that step (5.3) utilizes step (5.2) to obtain, the stock emotion speech that step (5.1) is extracted carries out the text emotion analysis, and indicates its affiliated type,
Step (5.4) is utilized based on normalized absolute word frequency weight, and the stock emotion speech described in the step (5.3) is carried out characteristic weighing:
Absolute word frequency weight after the normalized of j text, value in [0,1] interval:
weight normal ( t k , d j ) = weight ( t k , d j ) Σ j = 1 | T | ( weight ( t k , d j ) ) 2 ,
T wherein k, t is the stock emotion speech after expression is estimated through the feature set of words, k is through the rich sequence number in back that the described stock emotion speech in a plurality of described stock Chinese texts is sorted greatly,
d j, d represents described stock Chinese text, j is the sequence number of described stock Chinese text, | T| represents the number of all stock Chinese texts, so j=1,2 ..., | T|,
Weight (t k, d j) the word frequency weight before normalized of k described stock emotion speech in expression j the described stock Chinese text, in [0,1] interval in value; Weightnormal (t k, d j) represent the absolute word frequency weight of this emotion speech after normalized, in [0,1] interval interior value;
The classification of step (6) text emotion
Utilize any sorting algorithm in the described Weka module that an one stock Chinese text is carried out emotion classification, positive belong to positive focus plate, negative belong to negative focus plate.
CN2010105432677A 2010-11-11 2010-11-11 Text emotion classifying method in stock field Pending CN102023967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105432677A CN102023967A (en) 2010-11-11 2010-11-11 Text emotion classifying method in stock field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105432677A CN102023967A (en) 2010-11-11 2010-11-11 Text emotion classifying method in stock field

Publications (1)

Publication Number Publication Date
CN102023967A true CN102023967A (en) 2011-04-20

Family

ID=43865277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105432677A Pending CN102023967A (en) 2010-11-11 2010-11-11 Text emotion classifying method in stock field

Country Status (1)

Country Link
CN (1) CN102023967A (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682130A (en) * 2012-05-17 2012-09-19 苏州大学 Text sentiment classification method and system
CN102999533A (en) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 Textspeak identification method and system
CN103365867A (en) * 2012-03-29 2013-10-23 腾讯科技(深圳)有限公司 Method and device for emotion analysis of user evaluation
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103778215A (en) * 2014-01-17 2014-05-07 北京理工大学 Stock market forecasting method based on sentiment analysis and hidden Markov fusion model
CN103838737A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Method for improving vector distance classifying quality
CN104199809A (en) * 2014-04-24 2014-12-10 江苏大学 Semantic representation method for patent text vectors
CN104915327A (en) * 2014-03-14 2015-09-16 腾讯科技(深圳)有限公司 Text information processing method and device
US20150287337A1 (en) * 2012-09-24 2015-10-08 Nec Solution Innovators, Ltd. Mental health care support device, system, method and program storage medium
CN105005552A (en) * 2014-04-22 2015-10-28 北京四维图新科技股份有限公司 Information processing method and apparatus
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web
CN105069141A (en) * 2015-08-19 2015-11-18 北京工商大学 Construction method and construction system for stock standard news library
CN105138506A (en) * 2015-07-09 2015-12-09 天云融创数据科技(北京)有限公司 Financial text sentiment analysis method
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification
CN106202372A (en) * 2016-07-08 2016-12-07 中国电子科技网络信息安全有限公司 A kind of method of network text information emotional semantic classification
CN107122351A (en) * 2017-05-02 2017-09-01 灯塔财经信息有限公司 A kind of attitude trend analysis method and system applied to stock news field
CN107391480A (en) * 2017-06-23 2017-11-24 广州市万隆证券咨询顾问有限公司 A kind of stock invester's personality characters analysis method and system based on stock invester's market sentiment
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108090190A (en) * 2017-12-19 2018-05-29 深圳市富途网络科技有限公司 One B shareB index parameter UI novel classifications integrate methods of exhibiting
CN108388554A (en) * 2018-01-04 2018-08-10 中国科学院自动化研究所 Text emotion identifying system based on collaborative filtering attention mechanism
CN108563630A (en) * 2018-03-21 2018-09-21 上海蔚界信息科技有限公司 A kind of construction method of text analyzing knowledge base
CN108710654A (en) * 2018-05-10 2018-10-26 新华智云科技有限公司 A kind of public sentiment data method for visualizing and equipment
TWI643076B (en) * 2017-10-13 2018-12-01 Yuan Ze University Financial analysis system and method for unstructured text data
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN109241276A (en) * 2018-07-11 2019-01-18 河海大学 Word's kinds method, speech creativeness evaluation method and system in text
TWI651622B (en) * 2017-09-21 2019-02-21 群益金鼎證券股份有限公司 Intelligent article summary system and method
CN109492097A (en) * 2018-10-23 2019-03-19 重庆誉存大数据科技有限公司 A kind of corporate news data classification of risks method
CN110096631A (en) * 2019-03-19 2019-08-06 北京师范大学 A kind of stock market's mood report-generating method of the text analyzing of posting based on stock forum
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN110489557A (en) * 2019-08-22 2019-11-22 电子科技大学成都学院 A kind of stock comment class text sentiment analysis method that SVM and Bootstrapping is blended
CN110502638A (en) * 2019-08-30 2019-11-26 重庆誉存大数据科技有限公司 A kind of Company News classification of risks method based on target entity
CN110688475A (en) * 2019-09-05 2020-01-14 上海异势信息科技有限公司 Article recommendation method and system based on content subjective tendency
CN110941713A (en) * 2018-09-21 2020-03-31 上海仪电(集团)有限公司中央研究院 Self-optimization financial information plate classification method based on topic model
CN111221974A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN114036949A (en) * 2021-11-08 2022-02-11 中国银行股份有限公司 Investment strategy determination method and device based on information analysis

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999533A (en) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 Textspeak identification method and system
CN103365867A (en) * 2012-03-29 2013-10-23 腾讯科技(深圳)有限公司 Method and device for emotion analysis of user evaluation
CN103365867B (en) * 2012-03-29 2017-07-21 腾讯科技(深圳)有限公司 It is a kind of that the method and apparatus for carrying out sentiment analysis are evaluated to user
CN102682130B (en) * 2012-05-17 2013-11-27 苏州大学 Text sentiment classification method and system
CN102682130A (en) * 2012-05-17 2012-09-19 苏州大学 Text sentiment classification method and system
US20150287337A1 (en) * 2012-09-24 2015-10-08 Nec Solution Innovators, Ltd. Mental health care support device, system, method and program storage medium
CN103838737A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Method for improving vector distance classifying quality
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103646088B (en) * 2013-12-13 2017-03-15 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103778215B (en) * 2014-01-17 2016-08-17 北京理工大学 A kind of Stock Market Forecasting method merged based on sentiment analysis and HMM
CN103778215A (en) * 2014-01-17 2014-05-07 北京理工大学 Stock market forecasting method based on sentiment analysis and hidden Markov fusion model
CN104915327A (en) * 2014-03-14 2015-09-16 腾讯科技(深圳)有限公司 Text information processing method and device
CN104915327B (en) * 2014-03-14 2019-01-29 腾讯科技(深圳)有限公司 A kind of processing method and processing device of text information
US10262059B2 (en) 2014-03-14 2019-04-16 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
CN105005552A (en) * 2014-04-22 2015-10-28 北京四维图新科技股份有限公司 Information processing method and apparatus
CN105005552B (en) * 2014-04-22 2019-01-08 北京四维图新科技股份有限公司 A kind of information processing method and device
CN104199809A (en) * 2014-04-24 2014-12-10 江苏大学 Semantic representation method for patent text vectors
CN105138506B (en) * 2015-07-09 2018-07-03 天云融创数据科技(北京)有限公司 A kind of finance text emotion analysis method
CN105138506A (en) * 2015-07-09 2015-12-09 天云融创数据科技(北京)有限公司 Financial text sentiment analysis method
CN105022725B (en) * 2015-07-10 2018-04-20 河海大学 A kind of text emotion trend analysis method applied to finance Web fields
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web
CN105069141A (en) * 2015-08-19 2015-11-18 北京工商大学 Construction method and construction system for stock standard news library
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification
CN105260437B (en) * 2015-09-30 2018-11-23 陈一飞 Text classification feature selection approach and its application in biological medicine text classification
CN106202372A (en) * 2016-07-08 2016-12-07 中国电子科技网络信息安全有限公司 A kind of method of network text information emotional semantic classification
CN107122351A (en) * 2017-05-02 2017-09-01 灯塔财经信息有限公司 A kind of attitude trend analysis method and system applied to stock news field
CN107391480A (en) * 2017-06-23 2017-11-24 广州市万隆证券咨询顾问有限公司 A kind of stock invester's personality characters analysis method and system based on stock invester's market sentiment
TWI651622B (en) * 2017-09-21 2019-02-21 群益金鼎證券股份有限公司 Intelligent article summary system and method
TWI643076B (en) * 2017-10-13 2018-12-01 Yuan Ze University Financial analysis system and method for unstructured text data
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108090190A (en) * 2017-12-19 2018-05-29 深圳市富途网络科技有限公司 One B shareB index parameter UI novel classifications integrate methods of exhibiting
CN108388554A (en) * 2018-01-04 2018-08-10 中国科学院自动化研究所 Text emotion identifying system based on collaborative filtering attention mechanism
CN108388554B (en) * 2018-01-04 2021-09-28 中国科学院自动化研究所 Text emotion recognition system based on collaborative filtering attention mechanism
CN108563630A (en) * 2018-03-21 2018-09-21 上海蔚界信息科技有限公司 A kind of construction method of text analyzing knowledge base
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN108710654A (en) * 2018-05-10 2018-10-26 新华智云科技有限公司 A kind of public sentiment data method for visualizing and equipment
CN108710654B (en) * 2018-05-10 2021-03-26 新华智云科技有限公司 Public opinion data visualization method and equipment
CN109241276A (en) * 2018-07-11 2019-01-18 河海大学 Word's kinds method, speech creativeness evaluation method and system in text
CN109241276B (en) * 2018-07-11 2022-03-08 河海大学 Word classification method in text, and speech creativity evaluation method and system
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN110941713B (en) * 2018-09-21 2023-12-22 上海仪电(集团)有限公司中央研究院 Self-optimizing financial information block classification method based on topic model
CN110941713A (en) * 2018-09-21 2020-03-31 上海仪电(集团)有限公司中央研究院 Self-optimization financial information plate classification method based on topic model
CN109492097A (en) * 2018-10-23 2019-03-19 重庆誉存大数据科技有限公司 A kind of corporate news data classification of risks method
CN109492097B (en) * 2018-10-23 2021-11-16 重庆誉存大数据科技有限公司 Enterprise news data risk classification method
CN110096631B (en) * 2019-03-19 2021-03-05 北京师范大学 Stock market emotion report generation method based on postings text analysis of stock forum
CN110096631A (en) * 2019-03-19 2019-08-06 北京师范大学 A kind of stock market's mood report-generating method of the text analyzing of posting based on stock forum
CN110489557A (en) * 2019-08-22 2019-11-22 电子科技大学成都学院 A kind of stock comment class text sentiment analysis method that SVM and Bootstrapping is blended
CN110502638A (en) * 2019-08-30 2019-11-26 重庆誉存大数据科技有限公司 A kind of Company News classification of risks method based on target entity
CN110688475A (en) * 2019-09-05 2020-01-14 上海异势信息科技有限公司 Article recommendation method and system based on content subjective tendency
CN111221974A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN114036949A (en) * 2021-11-08 2022-02-11 中国银行股份有限公司 Investment strategy determination method and device based on information analysis

Similar Documents

Publication Publication Date Title
CN102023967A (en) Text emotion classifying method in stock field
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
Saad et al. Arabic text classification using decision trees
CN108388660B (en) Improved E-commerce product pain point analysis method
Dang et al. Improvement methods for stock market prediction using financial news articles
CN102591988A (en) Short text classification method based on semantic graphs
Gupta et al. Automatic Punjabi text extractive summarization system
CN107180026A (en) The event phrase learning method and device of a kind of word-based embedded Semantic mapping
Dwivedi et al. Sentiment analytics for crypto pre and post covid: topic modeling
Ahmed et al. A novel approach for Sentimental Analysis and Opinion Mining based on SentiWordNet using web data
Pratama et al. Sentiment analysis of the Indonesian police mobile brigade corps based on twitter posts using the SVM and NB methods
Suryono et al. P2P Lending sentiment analysis in Indonesian online news
CN105912720B (en) A kind of text data analysis method of emotion involved in computer
Ayadi et al. Latent topic model for indexing arabic documents
Wahbeh et al. Comparative assessment of the performance of three WEKA text classifiers applied to arabic text
Fagan et al. An introduction to textual econometrics
Gao et al. Sentiment classification for stock news
Indhuja et al. Text based language identification system for indian languages following devanagiri script
Sembok et al. Arabic word stemming algorithms and retrieval effectiveness
Madatov et al. Uzbek text summarization based on TF-IDF
Basak et al. British Stock Market, BREXIT and Media Sentiments-A Big Data Analysis
Eghbalzadeh et al. Persica: A Persian corpus for multi-purpose text mining and Natural language processing
Fissette Text mining to detect indications of fraud in annual reports worldwide
Tamboli et al. Authorship identification with multi sequence word selection method
Karuna et al. Comparison of methods for automatic classification of Russian-language texts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110420