CN102023967A

CN102023967A - Text emotion classifying method in stock field

Info

Publication number: CN102023967A
Application number: CN2010105432677A
Authority: CN
Inventors: 张勇; 高旸; 周莉; 邢春晓
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2010-11-11
Filing date: 2010-11-11
Publication date: 2011-04-20

Abstract

A kind of text sentiment classification method of oriented for stock, belong to stock proneness analysis technical field, it is characterized in that passing through the open press information including stock news, utilize the evaluation group improved, feature selecting is carried out to the stock emotion word expanded, and characteristic weighing selection is carried out to the emotion word in stock Chinese text with the absolute word frequency weight after normalization, it is final to utilize Bayes, K-NN or SVM text emotion sorting algorithm carry out proneness analysis to stock news. The present invention has the advantages that simple and feasible and convenience of calculation.

Description

A kind of text emotion sorting technique towards the stock field

Technical field

The invention belongs to the text emotion classification field of natural language processing, be specifically related to a kind of text emotion sorting technique towards the stock field.

Background technology

Along with the raising of expanding economy and living standards of the people, carry out the trend of the times that Investment ﹠ Financing becomes current society gradually by buying stock, how buying stock exactly becomes the problem that the investor is concerned about very much.Meanwhile, fast development along with network technology, network relies on characteristics such as real-time, rich and spreadability to replace traditional news media gradually becomes the main path that people obtain information, increasing stock news appears on the network, and these news comprise macroeconomy news, personal share related news, INDUSTRY OVERVIEW, listed company's news or the like.

Efficient Market Theory (EMH:Efficient Markets Hypothesis), EMH or efficiet market hypothesis are otherwise known as, start from the famous professor Eugene Fama of Univ Chicago USA is published in one piece " market value of shares tendency " by name of " commercial academic periodical " in nineteen sixty-five paper, then Eugene Fama deepens in being published in the paper " efficient capital market: theory and practice research is looked back " of " finance " in 1970 and proposition.Efficient Market Theory supposes that all disclosed information all can be reflected among the market price, if relevant information is not twisted and reflected fully that in security price market is exactly effective.Since security price can fully reflect all obtainable information, so, obtainable relevant information just becomes effectively determinative of price.

According to the difference of obtainable information classification, Efficient Market Theory is divided into following three kinds of display forms in efficient capital market: weak-form efficient market, semistrong-form efficient market, strong-form efficient market.From the reality of China, domestic most scholars support that China Stock Markets is that weak formula is effective.

In weak efficient market, need a period of time just can be reacted in the share price after the information issue, after the information that is to say was issued, stock can be through just adjusting to suitable price after a while.Therefore can not ignore of the influence of stock news, the buying behavior that the quantity of news and the tendentiousness of content to a great extent also can left and right sides investors for the stock market.For example State Council will be gone out stamp duty rate on April 24th, 2010 by the message one that is adjusted to 1 ‰ for 3 ‰ times, index of Shanghai bourse 304 points that rise suddenly and sharply, over thousands of personal share limit-up; And for example in " two Conferences " at 2010 beginning of the years, the government work report proposes to develop " low-carbon economy ", and " new forms of energy plate " attracts advantage afterwards, gradually enhancement.Therefore study the tendentiousness of stock market's news, the ancillary investment person is made investment decision have certain Practical significance.

So-called based on sentiment classification, discerning text is front or negative, the research of this type is called as the emotion classification.The text emotion classification is a kind of special text classification problem, needs the emotion tendency of text to be made classification judge by subjective informations such as the position in excavation and the analysis text, viewpoint, view, moods.The text emotion classification is to judge the good method of tendentiousness, is well used at aspects such as personalized recommendation, personalization standpoint retrieval, user interest excavation, information filtering, filtrating mail, public opinion analyses.

The enterprise that has at present some that the financial Information service is provided both at home and abroad, for example domestic big wisdom information, Wei Saite information and external Reuter etc.Yet generally all price is high in the service that these companies provide, and common investor is unaffordable.Therefore can consider to utilize the information such as news that obtain easily on the financial web site, after the processing by the text emotion classification, provide the positive negativity prompting of every news, can help the investor to make investment decision more quickly.

Summary of the invention

The object of the present invention is to provide a kind of text emotion sorting technique, be used to provide the suggestion of stock news emotion tendency classification towards the stock field.

The invention is characterized in that the classification of described text emotion is a kind of based on sentiment classification, the Chinese text that is used to discern the stock field is front or negative, and described sorting technique realizes in computing machine successively according to the following steps:

The described computer initialization of step (1), set following Software tool:

Add-delta data smoothing algoritic module;

Stock news is carried out the Chinese lexical analysis module ICTCLAS that Chinese word segmentation is used;

Be used for the evaluation module that text feature is selected;

The Weka module that classification experiments is used, comprising The algorithm of classifying such as Bayes and K-NN,

Be defined in the neologisms that stock field Chinese text participle is used:

Initialism includes but not limited to: PetroChina Company Limited., state throw and middle gold;

Proper noun includes but not limited to: incorporated company and security investment fund;

Derivative includes but not limited to: unexpected rival's thigh, neck rise and empty profit;

Compound word includes but not limited to: leap high and then fall down and dividend sharing;

Step (2) the headline in the security news of setting and comprise the security everyday words and for the relevant stock information of the stock name of emotion classification usefulness as original language material, promptly Chinese text is input to described computing machine;

Step (3) Chinese text participle is cut into the speech that has independent meaning one by one to the Chinese character sequence in the Chinese text described in the step (2), and step is as follows:

The n-gram statistical language model that step (3.1) adopts new word discovery to use makes up stock field dictionary for word segmentation, and step is as follows:

Step (3.1.1) is set up the n-gram model,

Set a character string sequence n-gram W=w ₁w ₂... w _nExpression, w _iRepresent a character, n gets 2～6 integer, represents the character number in this character string,

Then be calculated as follows the probability P that a described character string sequence W occurs in described Chinese text _MLE(w _n| w ₁w ₂W _N-1), MLE represents that this is a kind of method for parameter estimation that adopts maximal possibility estimation, is called the n-gram language model,

If the length of a character string (n-gram) is L, obtaining by the character string quantity after the n cutting so thus is L-n+1, and adds up the wherein frequency of occurrences of identical characters string, wherein

P_{MLE} (w_{n} | w_{1} w_{2} . . . w_{n - 1}) = \frac{C (w_{1} w_{2} . . . w_{n})}{C (w_{1} w_{2} . . . w_{n - 1})},

C (w ₁w ₂... w _n) expression character string w ₁w ₂... w _nThe number of times that in described original language material, occurs, C (w ₁w ₂... w _N-1) represent by character string w ₁w ₂... w _nIn before n-1 character w ₁w ₂... w _N-1The number of times that the character string of forming occurs in described original language material,

Step (3.1.2) is carried out smoothing processing with the Add-delta data smoothing algorithm that has improved to the character string that step (3.1.1) obtains,

P_{Add - delta} (w_{1} w_{2} . . . w_{n}) = \frac{C (w_{1} w_{2} . . . w_{n}) + Δ}{C (w_{1} w_{2} . . . w_{n - 1}) + Δ \cdot N},

Δ=0.5 wherein, N is the quantity of all character string n-gram in the described original language material,

Step (3.1.3) is filtered the otiose character substring in the everyday words,

When as the difference of an everyday words of father string and the frequency of its character substring less than 0.0001 and the length of this everyday words and its character substring only poor less than 3 the time, then this character substring filtration,

Can obtain the dictionary for word segmentation in stock field from step (3.1.1) to step (3.1.3),

Step (3.2)

Integrating step (3.1.1) is carried out participle to the dictionary for word segmentation in the described stock field that step (3.1.3) obtains with based on the ICTCLAS Chinese lexical analysis module of multilayer Markov model to described stock news;

Step (4) remove that step (3.2) obtains to the stop words in the described stock news word segmentation result, described stop words is that the frequency of occurrences is higher than everyday words and does not have the participle of practical significance,

Step (4.1) is set up the inactive vocabulary in the stock newsletter archive, and import this computing machine, the vocabulary of should stopping using comprises preposition, article, auxiliary word, conjunction and punctuation mark, in described stock news, be commonly used for suggestive speech in addition, at least include but not limited to news flash, sharp point, deep bid and market

Step (4.2) utilizes described inactive vocabulary to what obtain in the step (3) described stock news word segmentation result to be carried out the stop words removal;

Step (5) is represented described stock Chinese text with a vector space model on step (3) and the pretreated basis of step (4), its step is as follows:

Step (5.1) is utilized the described evaluation module based on evaluation theory Appraisal Theory, extracts adjective phrase, the adjective that has the emotion color, verb and qualifier from described stock Chinese text, general designation emotion speech,

Step (5.2) is set the evaluation group of a described stock emotion speech, comprising: the front speech is used to describe the positive surface analysis word that includes but not limited to that stock price goes up, public company's achievement is interior fortunately; Negation words is used to describe including but not limited to that fall in stock prices, public company's achievement difference are in interior negative analysis word; The degree speech is meant the speech of describing positive or negative degree; Negative word is used to be added in before front speech or the negation words, otherwise anticipates mutually; Uncertain speech determines the confidence level of described front speech or negation words, and described five types stock emotion is estimated word and constituted a feature set of words, and is input to affiliated computing machine,

The feature set of words that step (5.3) utilizes step (5.2) to obtain, the stock emotion speech that step (5.1) is extracted carries out the text emotion analysis, and indicates its affiliated type,

Step (5.4) is utilized based on normalized absolute word frequency weight, and the stock emotion speech described in the step (5.3) is carried out characteristic weighing:

Absolute word frequency weight after the normalized of j text, value in [0,1] interval:

{weight}_{normal} (t_{k}, d_{j}) = \frac{weight (t_{k}, d_{j})}{\sqrt{Σ_{j = 1}^{| T |} {(weight (t_{k}, d_{j}))}^{2}}},

T wherein _k, t is the stock emotion speech after expression is estimated through the feature set of words, k is through the rich sequence number in back that the described stock emotion speech in a plurality of described stock Chinese texts is sorted greatly,

d _j, d represents described stock Chinese text, j is the sequence number of described stock Chinese text, | T| represents the number of all stock Chinese texts, so j=1,2 ..., | T|,

Weight (t _k, d _j) the word frequency weight before normalized of k described stock emotion speech in expression j the described stock Chinese text, in [0,1] interval in value; Weightnormal (t _k, d _j) represent the absolute word frequency weight of this emotion speech after normalized, in [0,1] interval interior value;

The classification of step (6) text emotion

Utilize any sorting algorithm in the described Weka module that an one stock Chinese text is carried out emotion classification, positive belong to positive focus plate, negative belong to negative focus plate.

The invention has the advantages that:

1. original language material comes from network, and is real-time.

2. pure software is realized with low cost.

Description of drawings

Fig. 1 is towards stock field text emotion sorting technique process flow diagram;

Fig. 2 is a stock evaluation group;

Fig. 3 program realization flow figure.

Embodiment

The present invention proposes a kind of text emotion sorting technique towards the stock field, described method is undertaken by following three steps in computing machine successively, idiographic flow as shown in Figure 1:

Step (1) text pre-service.

The text pre-service mainly is divided into the Chinese text participle and removes two processes of stop words, wherein:

One, Chinese text participle:

The Chinese text participle is meant the Chinese character sequence is cut into the speech that has independent meaning one by one, is the basis of carrying out Chinese natural language processing.Need carry out in two steps:

The first step makes up stock field dictionary for word segmentation based on the n-gram statistical language model:

In the Chinese word segmentation field, the neologisms (New Words) and unregistered word (UnknownWords) two conceptions of species are arranged, but they are not distinguished sometimes.Neologisms or unregistered word can roughly be divided into following four kinds: 1) initialism, as " PetroChina Company Limited. ", " state's throwing ", " middle gold " etc.; 2) proper noun is as " incorporated company ", " security investment fund " etc.; 3) derivative is as " unexpected rival's thigh ", " neck rises ", " empty profit " etc.; 4) compound word is as " leaping high and then falling down ", " dividend sharing " etc.

At present, aspect new word discovery, following two kinds of ways are arranged usually: rule-based method, promptly summarize the composition rule or the characteristics of some neologisms by the expert, guess possible neologisms and provide degree of confidence, do further evaluation afterwards again; Based on the method for statistics, promptly utilize some statistics strategy and degrees of correlation, to seek those and the major term of possibility occurs, this method is applicable to finds short neologisms.

Owing to lack dictionary for word segmentation at present, need to make up a dictionary for word segmentation towards the stock field.Simultaneously, the object that the present invention handles mainly is the stock headline, and speech wherein is many to be occurred with brief form, so can adopt the new word discovery method based on statistics, for example n-gram language model.

From statistical angle, in natural language, sentence can be made up of character string arbitrarily, but the probability P (s) that their occur has very big difference.For example: s ₁=" more than half bankers expect following season monetary policy constant ", s ₂=" more than half policies expection banker currency of following season are constant ", two characters that comprise in front and back are in full accord, but obviously the former is bigger as the probability that in short occurs, i.e. P (s ₁)＞P (s ₂).

For given natural language, P (s) is normally unknown.And, estimate that according to given language sample the process of P is known as the language modeling for a language L who obeys certain unknown probability P distribution.If suppose to use W=w ₁w ₂... w _nCharacter string sequence (n-gram) in the expression text, wherein w _iRepresent a character, the task of language modeling is to provide the probability P (w) that character string sequence W occurs in text so.Utilize the product formula of probability, P (W) can be expanded into:

This formula is very complicated, even to smaller n, calculated amount also is sizable.Usually for simplified model and convenient calculating, can consider oversize history, the general history of only considering that n-1 character constitutes, think that promptly the probability of any one speech appearance is only relevant with its front n-1 speech, at this moment this language model is called as the n-gram language model, also is called to be the single order Markov chain.

Can adopt the method for parameter estimation of maximal possibility estimation (MLE) to calculate P (w):

P_{MLE} (w_{n} | w_{1} w_{2} . . . w_{n - 1}) = \frac{C (w_{1} w_{2} . . . w_{n})}{C (w_{1} w_{2} . . . w_{n - 1})},

Dictionary for word segmentation building process based on the n-gram statistical language model mainly carries out according to following three steps:

1) sets up the n-gram model

New word discovery needs a large amount of language materials as the basis, but the Chinese text emotion classification language material towards the stock field is not arranged at present as yet.Go up the news relevant as original language material so can select " Sina's finance and economics " with stock.And that stock news is paid attention to very much is ageing, many times only just can summarize the particular content of whole news from headline, and distinguishes out its emotion tendency.Therefore for improving treatment effeciency, only adopt headline to get final product.Specific practice is to gather online 2009 annual stock headline of Sina's finance and economics, comprises personal share news, INDUSTRY OVERVIEW, plate news, corporate news etc., amounts to 233282, and as original language material, promptly Chinese text is input to described computing machine.

Then original language material is set up the n-gram model of word one-level, promptly the stock headline from first to last is cut into one by one character string, and the frequency of occurrences of statistics identical characters string.Wherein the length n of character string represents the Chinese character number (English word or numeral are thought a Chinese character) in this character string.In theory, when n was big, the language ambience information that provides was more, and linguistic context has more distinctiveness, but calculated amount is also bigger, and parameter estimation is more unreliable; And the language ambience information that n hour provides is less, and the linguistic context distinctiveness is less, but calculated amount is also less, and parameter estimation is more reliable.Therefore, need reasonably to select the size of n in actual applications.In addition, if the length of a character string (n-gram) is L, it should be L-n+1 by the character string quantity after the n cutting so.

Can get n in the reality and be the integer from 2 to 6, for example " European stock market receive low each plate is general fall ", when n=6, can be divided into " European stock market receive low ", " stock market, continent receive low each ", " low each plate is received by the stock market ", " low each plate is received in the city ", " it is general to receive low each plate " and " hang down each plate is general to fall ".

Original language material is set up the n-gram model of word one-level, and add up the frequency of character string.If use MLE, then there is the sparse problem of data, therefore also need to adopt the data smoothing technology.

2) data smoothing is handled

The basis of data smoothing technology is a maximal possibility estimation, and smoothing method commonly used comprises that Add-one is level and smooth, smoothly (getting delta=0.5), retention estimate and delete estimation etc. that its basis is a maximum likelihood estimate to Add-delta.

The Add-one smoothing method stipulates that the statistics number of any one n-gram is to increase by 1 on the basis of this n-gram actual number of times that occurs in corpus, think that just those n-gram that do not occur have also occurred once in corpus, i.e. C (N-gram) _New=C (N-gram) _Old+ 1.Adopt the parameter estimation result of Add-one smoothing method to be

P_{Add - one} (w_{1} w_{2} . . . w_{n}) = \frac{C (w_{1} w_{2} . . . w_{n}) + 1}{C (w_{1} w_{2} . . . w_{n - 1}) + N},

Wherein N represents the quantity of all n-gram in the corpus.

If there is a large amount of n-gram not appear in the corpus, these do not have the n-gram that occurs to occupy larger proportion in whole probability distribution with the level and smooth back of Add-one method, and this is not too rational.A kind of improving one's methods is that occurrence number does not add 1, but adds one less than several Δs of 1, promptly

P_{Add - delta} (w_{1} w_{2} . . . w_{n}) = \frac{C (w_{1} w_{2} . . . w_{n}) + Δ}{C (w_{1} w_{2} . . . w_{n - 1}) + Δ \cdot N},

0＜Δ＜1 wherein, Here it is Add-delta smoothing method facts have proved that its effect generally is better than Add-one.

Retaining the basic thought of estimating (Held-out Estimation) is, all language materials are divided into corpus and retain two parts of language material, and wherein corpus is used to improve initial Frequency Estimation as initial Frequency Estimation and retain language material.Specific practice is at first for each n-gram w ₁w ₂W _N-1, calculate the frequency that it occurs respectively in corpus and retention language material, i.e. C _Tr(w ₁w ₂W _n) and C _Ho(w ₁w ₂W _n).Establishing T then is to retain all n-gram numbers in the language material, represents the frequency that certain n-gram occurs, i.e. r=C with r in corpus _Tr(w ₁w ₂W _n), establish N simultaneously _rBe illustrated in the number of the different n-gram that has occurred r time in the corpus, T _rRepresent the frequency sum that all n-gram that occurred r time in corpus occur in retaining language material, promptly

Therefore, adopt the parameter estimation result who retains method of estimation to be

P_{ho} (w_{1} w_{2} . . . w_{n}) = \frac{T_{r}}{T} \times \frac{1}{N_{r}},

Deletion estimates that (Deleted Estimation) is that corpus is divided into two parts, does corpus and retains language material with a part wherein respectively, calculates back exchange role, asks both weighted means at last, promptly

P_{del} (w_{1} w_{2} . . . w_{n}) = \frac{{T_{r}}^{01} + {T_{r}}^{10}}{N (N_{r}^{0} + N_{r}^{1})},

T wherein _r ^IjExpression i does corpus, j retains language material, N _r ¹Be illustrated in the number of the different n-gram that has occurred r time among the corpus i, N represents to practice language material and retains in the language material number of n-gram altogether.

3) " substring " filters

Have a lot of everyday words in actual applications, some " substrings " of forming these speech can only occur in these speech substantially, are difficult to occur as a speech separately.For example " incorporated company " is one " father's string ", and " the limited public affairs of share " then are its one " substrings ".Even some " substrings " can not become speech separately usually, the frequency of their appearance is but basic identical with its " father's string ".In general statistical language model, the probability (result of parameter estimation) of such " substring " and " father's string " is very approaching, but " substring " is otiose often, thereby becomes distracter, need filter them.

By setting up statistical language model, those otiose " substrings " are often very little with the difference of the probability of its " father's string ".So the basic thought of filtering useless " substring " is, all " father's strings " for a character string, if the difference of the probability of this character string and its " father's string " is worth less than certain, and the difference of the length of this character string and its " father's string " can be filtered this character string when being worth less than certain.The present invention adopts this method to filter just, and wherein the probability difference gets 0.0001, and length difference gets 3.

Through above-mentioned three steps, can set up a dictionary by original language material.In conjunction with the relevant information of some stocks, comprise security Essential Terms and stock name etc. simultaneously, finally can make up the dictionary for word segmentation in a stock field.

In second step,, utilize the ICTCLAS system to carry out participle in conjunction with stock field dictionary for word segmentation:

The ICTCLAS system is by people such as the Zhang Huaping of Inst. of Computing Techn. Academia Sinica and Liu Qun, based on the multilayer Hidden Markov Model (HMM), and the Chinese lexical analytic system of exploitation.The major function of this system comprises Chinese word segmentation, part-of-speech tagging, and named entity recognition, user-oriented dictionary is supported in neologisms identification simultaneously.

Can utilize the ICTCLAS system, and, stock news be carried out participle in conjunction with the stock field dictionary for word segmentation that makes up.

Two, remove stop words:

Stop words is meant that some frequencies of occurrences than higher, but do not have the speech of too many practical significance, to almost not effect of text-processing.Remove stop words and be very important, can adopt the method that makes up the vocabulary of stopping using for the efficient that improves text-processing.

The structure of inactive vocabulary is not only relevant with used language, also relevant with specific application environment.Inactive vocabulary in the stock newsletter archive mainly contains two kinds: first kind is preposition, article, auxiliary word, conjunction and punctuation mark; Second kind is suggestive speech before the stock headline, as " news flash ", " sharp point ", " deep bid ", " market " etc.

Step (2) text representation.

In order to allow computing machine " understanding " text, can represent text with vector space model.The basic skills of vector space model is the entry vector representation text with one group of quadrature, and wherein each different entry is just as dimension independently in the feature space.Text representation mainly is divided into text feature to be selected and the text feature weighting, wherein:

One, text feature is selected:

People such as Casey Whitelaw are for the text emotion problem analysis, introduce evaluation theory (Appraisal Theory), by from text, extracting phrase that adjective and modifier thereof constitute as the feature speech, carry out the semantic tendency analysis, this adjective phrase is called as evaluation group (AG), experiment shows, utilizes " evaluation group " as the feature set of words, can improve the degree of accuracy of emotion classification.People such as Casey Whitelaw are according to the evaluation theory of Martin, for evaluation is provided with four attributes: attitude (Attitude), tendency (Orientation), grade (Graduation) and polarity (Polarity).

The present invention at the emotion classification problem, also can utilize the method for similar evaluation group.But different is, what need to extract is not only the adjective phrase, also should comprise the adjective, verb and the qualifier that have the emotion color, and these speech are referred to as the emotion speech.Simultaneously, can tentatively stock emotion speech be divided into five types of front speech, negation words, degree speech, negative word and uncertain speech etc.: the front speech is described stock price exactly and is gone up, and public company's achievement waits vocabulary well; Negation words then is to describe vocabulary such as fall in stock prices and listed company's achievement difference; The degree speech is meant the positive and negative degree of description; Negative word is added in before front speech or the negation words, just represents the opposite meaning; The confidence level of uncertain speech decision front speech and negation words; Concrete structure as shown in Figure 2.

Two, text feature weighting:

Yet the different characteristic that process is selected is different to the differentiation dynamics of text.Therefore in the process of text being carried out the formalization processing, also need these features are done further weighted.The purpose of weighting is to improve the weight of the strong feature of differentiation dynamics, and weakened region divides the weight of the weak feature of dynamics.The weighting function that the present invention adopts has boolean's weight, absolute word frequency weight, TF-IDF weight and normalized weight etc.

1) boolean's weight

Boolean's weight is the simplest a kind of weighting function, as the term suggests its value is a Boolean: if the feature speech did not occur, its weight is 0; As long as the feature speech occurred, its weight promptly thinks 1.Be formulated as

weight (t_{k}, d_{j}) = \{\begin{matrix} 0, & # (t_{k}, d_{j}) = 0 \\ 1, & # (t_{k}, d_{j}) > 0 \end{matrix},

Wherein t represents through the feature speech, and k represents that d represents one piece of document through the sequence number of giving after feature speech described in a plurality of documents is sorted greatly, and j represents the sequence number of the document in document sets, weight (t _k, d _j) representation feature speech t _kAt document d _jIn weight, # (t _k, d _j) representation feature speech t _kAt document d _jThe middle number of times that occurs.

2) absolute word frequency weight

Boolean's weight is very simple, only distinguishes the different characteristic speech with 0 and 1, but can not distinguish the importance between the different characteristic speech.In text classification, often think that the speech that the many speech of occurrence number lack than occurrence number has bigger effect to classification, so the weight of the different feature speech of occurrence number should be different.Absolute word frequency weight be the frequency that directly in document, occurs with the feature speech as weight, occurrence number is many more important more, can be formulated as

weight(t _k，d _j)＝#(t _k，d _j)，

3) TF-IDF weight

TF-IDF is information retrieval field a kind of method commonly used, also can be used as the text feature weighting function, and it calculates the weight of this speech in whole text set according to the word frequency of certain speech and the document frequency that occurred thereof, and is formulated as follows

weight (t_{k}, d_{j}) = TF (t_{k}, d_{j}) \times IDF (t_{k}, d_{j}) = # (t_{k}, d_{j}) \cdot \log \frac{| D |}{# D (t_{k})},

#D (t wherein _k) expression comprises feature speech t _kThe frequency of occurrences of document, in the set of all documents t appearred promptly _kThe document number.TF (t _k, d _j)=# (t _k, d _j) expression t _kAt document d _jIn occurrence number,

Expression inverted entry frequency.Why formula has such form, is based on two hypothesis: the number of times that feature speech occurs in one piece of document is many more, more can be as the representative of the document content; A feature speech occurred in many more documents, and then its perspective is just more little.

4) normalization word frequency

The length possibility difference of practical application Chinese version is very big, if adopt the three kinds of methods in front, then the eigenvalue distribution of long text and short text will differ greatly, and be unfavorable for calculating.Therefore, can give the authority to flumps in the interval into [0,1], uses the length of document of vector representation identical like this, utilizes the cosine standardized means to do normalized again, and last result is formulated as

{weight}_{normal} (t_{k}, d_{j}) = \frac{weight (t_{k}, d_{j})}{\sqrt{Σ_{j = 1}^{| T | \} {(weight (t_{k}, d_{j}))}^{2}}},

Weight wherein _Normal(t _k, d _j) the absolute word frequency of expression after the normalized,, in [0,1] interval in value; | T| represents the element number of the set of all documents.

The classification of step (3) text emotion.

In the present invention, the emotion of stock news classification can be regarded one two classification problem as, for belonging to the positive positive focus plate that then is after the classification, for belonging to the negative negative focus plate that then is after the classification.Because stock news has direct influence for investor's purchase, be reacted in the real trade, can think to be mentioned manyly and all to be that the plate that the front is mentioned is the bigger plates of those amounts of increase, be mentioned manyly but all be that the negative plate of mentioning then is the bigger plates of those drop ranges.

Present existing Chinese emotion sort research is also few, we can say also to be in the exploratory stage, especially for the stock field, still needs and will explore which type of machine learning method and be applicable to Chinese stock text.Select for use in the present invention

These three kinds of sorting techniques of Bayes, SVM and KNN experimentize to the emotion classification in stock field.

Claims

1. text emotion sorting technique towards the stock field, it is characterized in that, the classification of described text emotion is a kind of based on sentiment classification, and the Chinese text that is used to discern the stock field is front or negative, and described sorting technique realizes in computing machine successively according to the following steps: