CN104951548A

CN104951548A - Method and system for calculating negative public opinion index

Info

Publication number: CN104951548A
Application number: CN201510355005.0A
Authority: CN
Inventors: 李雪梅; 刘大伟; 刘玮; 王海洋; 隋雪青; 程学旗; 戴鹏飞
Original assignee: Yantai Zhong Ke Network Technical Institute
Current assignee: Yantai Zhong Ke Network Technical Institute
Priority date: 2015-06-24
Filing date: 2015-06-24
Publication date: 2015-09-30
Anticipated expiration: 2035-06-24
Also published as: CN104951548B

Abstract

The invention relates to a method and system for calculating the negative public opinion index. The method includes the steps that S1, positive and negative classification based on an emotional thesaurus and SVM classification based on a Model are conducted on a text to be classified, and accordingly a first classification result and a second classification result are obtained respectively; S2, if the value of the first classification result and the value of the second classification result are both negative, it is believed that the text to be classified is negative, and the step S3 is executed continuously; if the value of the first classification result and the value of the second classification result are not both negative, it is believed that the text to be classified is not negative, and calculation is ended; S3, the text to be classified is matched with a user annotation negative dictionary and a training set key dictionary, so that a first negative index and a second negative index are obtained respectively; S4, the first classification result, the first negative index and the second negative index are subjected to linear combination, and thus the negative public opinion index of the text to be classified is obtained. By the adoption of the method and system, the accuracy rate of the calculated negative public opinion index is high, the calculated corpus range is wide and real-time calculating can be conducted.

Description

A kind of computing method of negative public sentiment index and system

Technical field

The present invention relates to field, particularly relate to a kind of computing method and system of negative public sentiment index.

Background technology

According to CNNIC (CNNIC) statistics, end in Dec, 2014, Chinese netizen's scale reaches 6.49 hundred million.Internet penetration reaches 47.9%, comparatively improves 2.1 percentage points the end of the year 2013.2014, Chinese netizen per capita weekly online duration reach 26.1 hours, compare and add 1.1 hours the end of the year 2013.This illustrates internet development center of gravity from " extensively " to " going deep into " conversion, and every network application changes netizen's life deeply.

Along with the development of internet, internet, as the instrument of public opinion, has the function of reaction and guiding public opinion.For the network public sentiment information that this freedom is unordered, the guide effect that network public-opinion index calculates seems and becomes stronger day by day.Timely grasp public sentiment is dynamic, A clear guidance Social Public Feelings, is the Important Action of maintaining social stability.Therefore, it is the basis studying the technology such as public sentiment monitoring, state of affairs deduction and prediction that public sentiment index calculates, and studies the calculating of negative public sentiment index and has important theoretical and practical significance.

Two classification problems (negative and non-negative) of the theoretical method that negative public sentiment index calculates mainly in text (the present invention mainly processes Chinese text) classification, relate generally to word segmentation processing, feature selecting, text representation, Algorithm of documents categorization, evaluation index etc. in text classification.

Public opinion research institute of the current Renmin University of China cooperates with Baidu, and the humongous search data based on Baidu propose 14 public sentiment indexes such as Chinese cold warmth index.They think that the height of volumes of searches reflects the degree of concern of the common people to event representated by this keyword.It is the highest and pay close attention to temperature and to rise the fastest front 1000 hot words of search that Baidu gathers annual volumes of searches every year, by " can portray " comparatively roughly and go out " sketch map " that Chinese netizen pays close attention to Chinese society.This technology is searched word to heat and is done comprehensive with all-in-one-piece deciphering, if but to the grasp macroscopical of entire society's basic side and the cycle needed for understanding oversize, judge at least to need a couple of days, several weeks even several months to the public sentiment of an event.

Carry out having the public sentiment index of supervision to calculate in order to avoid manually marking training set, the method for some scholar's proposition Unsupervised clustering realizes the calculating to public sentiment index.Use unsupervised clustering first in effect not if any supervision text classification, cluster number of clusters is unknown in advance.Secondly, whether the efficiency on ultra-large data set also needs textual criticism in tolerance interval.

A kind of defect can be there is with VSM Expressive Features space after feature selecting is carried out to text, i.e. the Deta sparseness of feature space, and only use VSM to describe the words be left intact well can not to reflect semantic relation between vocabulary.

The Wang Haoyu Soviet Union Xinning of Nanjing University proposes a kind of character labeling model based on condition random field (Conditional Random Fields, CRFs).This model is utilized to carry out character labeling to the title that note is discussed by news or forum, by carrying out the discovery of public sentiment focus in conjunction with the background of name to the statistics of name occurrence number.But the public sentiment that this model needs finds corpus closure, have certain restricted.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of computing method and system of negative public sentiment index.

The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of computing method of negative public sentiment index, comprise the following steps:

Step S1, treats classifying text and carries out, based on the just negative classification of sentiment dictionary and the svm classifier based on Model model, obtaining classification results 1 and classification results 2 respectively;

Step S2, if described classification results 1 is all negative with the value of described classification results 2, then thinks that text to be sorted is negative, continues to perform step S3; If described classification results 1 is not all negative with the value of described classification results 2, then think that text to be sorted is non-negative, calculate and terminate;

Step S3, the crucial dictionary matching of negative dictionary and training set with user annotation respectively by text to be sorted, obtains negative index 1 and negative index 2 respectively;

Step S4, carries out linear combination by described classification results 1, described negative index 1 and described negative index 2, obtains the negative public sentiment index of text to be sorted.

On the basis of technique scheme, the present invention can also do following improvement.

Further, before execution step S1, treat classifying text when to be also included in text to be sorted be spam page and filter.

Further, the negative dictionary of user annotation described in step S3 and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials;

The formation of the crucial dictionary of described training set comprises: carry out ICTCLAS word segmentation processing to the negative language material in training set and carry out VSM text representation, then TFIDF feature selecting is carried out, extract from the result of TFIDF feature selecting TFIDF weight maximum before p keyword, wherein p be more than or equal to 1 integer; Namely the crucial dictionary of training set is become described in shape after the keyword duplicate removal extracted by negative language material in described training set, normalized weight; Wherein said training set is the set of the just negative front language material of artificial mark and negative language material.

Further, in step S1, describedly treat classifying text based on sentiment dictionary the just negative unsupervised segmentation device be categorized as based on general sentiment dictionary and classify;

The formation of described Model model comprises the following steps:

TFIDF feature selecting is carried out to described training set, and by the result matrix A after TFIDF feature selecting _{m × n}represent, wherein m represents the sum of keyword in training set, m be more than or equal to 1 integer, n to represent in training set language material sum, n be more than or equal to 1 integer;

To matrix A _{m × n}carry out svd, SVD (A _{m × n})=U _{m × m}Σ _{m × n}v ^t _{n × n}, wherein U _{m × m}square is the square formation of m × m, Σ _{m × n}for the diagonal matrix of m × n, V ^t _{n × n}for the square formation of n × n;

Get matrix A _{m × n}matrix V corresponding to front k singular value ^t _{k × n}carry out representing matrix A _{m × n}, wherein k be more than or equal to 1 integer;

To matrix V ^t _{k × n}carry out SVM training, obtain Model model.

Further, the value also comprising the text to be sorted negative public sentiment exponential quantity being exceeded predetermined threshold and described classification results 1 and described classification results 2 after step S4 is non-negative text to be sorted and upgrades described training set as new negative and non-negative language material.

The another kind of technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of computing system of negative public sentiment index, comprises the just negative sort module based on sentiment dictionary, the svm classifier module based on Model model, just negative judge module, the negative dictionary matching module of user annotation, the crucial dictionary matching module of training set and linear combiner module;

The described just negative sort module based on sentiment dictionary is carried out the just negative classification based on sentiment dictionary for treating classifying text and obtains classification results 1;

The described svm classifier module based on Model model is carried out the svm classifier based on Model model for treating classifying text and obtains classification results 2;

Described just negative judge module is for judging the just negative of classification results 1 and classification results 2;

Described user annotation negative dictionary matching module is used for text to be sorted and the negative dictionary matching of user annotation, obtains negative index 1;

The crucial dictionary matching module of described training set is used for, by text to be sorted and the crucial dictionary matching of training set, obtaining negative index 2;

Described linear combiner module is used for classification results 1, negative index 1 and negative index 2 to carry out linear combination, obtains the negative public sentiment index of text to be sorted.

Further, the computing system of described negative public sentiment index also comprises text filtering module to be sorted, filters for treating classifying text when text to be sorted is spam page.

Further, the negative dictionary of described user annotation and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials;

The formation of the crucial dictionary of described training set comprises: carry out ICTCLAS word segmentation processing to the negative language material of training set and carry out VSM text representation, then TFIDF feature selecting is carried out, extract from the result of TFIDF feature selecting TFIDF weight maximum before p keyword, wherein p be more than or equal to 1 integer; Namely the crucial dictionary of training set is become described in shape after the keyword duplicate removal extracted by negative language material in described training set, normalized weight; Wherein said training set is the set of the just negative front language material of artificial mark and negative language material.

Further, the described just negative sort module based on the sentiment dictionary unsupervised segmentation device comprised based on general sentiment dictionary is treated classifying text and is classified;

The formation of Model model described in the described svm classifier module based on Model model comprises the following steps:

To matrix V ^t _{k × n}carry out SVM training, obtain Model model.

Further, the computing system of described negative public sentiment index, also comprise artificial mark training set update module, text to be sorted and classification results 1 for negative public sentiment exponential quantity being exceeded predetermined threshold are non-negative text to be sorted with the value of classification results 2 and upgrade described training set as new negative and non-negative language material.

The invention has the beneficial effects as follows: the cycle that the present invention calculates negative public sentiment index is short, classifying text can be treated in real time and calculate; The present invention adopts the SVM text classification of supervision, effective, is applicable to the calculating of ultra-large data set and counting yield is high; The present invention adopts SVD technology, and the object that can reach sparse matrix dimensionality reduction can reflect again the semantic relation between vocabulary very well; The present invention both can process closure language material also can process open language material.

Accompanying drawing explanation

Fig. 1 is the computing method overall flow figure of negative public sentiment index of the present invention;

Fig. 2 is the computing method particular flow sheet of negative public sentiment index of the present invention;

Fig. 3 is the computing system structural drawing of negative public sentiment index of the present invention.

Embodiment

Be described principle of the present invention and feature below in conjunction with accompanying drawing, example, only for explaining the present invention, is not intended to limit scope of the present invention.

As shown in Figure 1, a kind of computing method of negative public sentiment index, is characterized in that, comprise the following steps:

Step S1, judges whether text to be sorted is spam page.

The rubbish page is that those obtain the webpage of unnecessary higher ranked in search engine by improper means, mention according in correlative theses in the present system, the content characteristics such as article title length, webpage URL length, article content length are met filtering out as spam page of a certain condition, if then filter out, if not, then perform step S2.

Step S2, treats classifying text and carries out, based on the just negative classification of sentiment dictionary and the svm classifier based on Model model, obtaining classification results 1 i.e. class_res1 and classification results 2 i.e. class_res2 respectively.

Namely just negative classification wherein based on sentiment dictionary treats classifying text carrying out classified calculating without supervision emotion classifiers module and obtaining classification results class_res1 based on general sentiment dictionary based on the unsupervised segmentation device of general sentiment dictionary.Is the sentiment classification assembly that Computer Department of the Chinese Academy of Science teacher Tan Songbo provides based on general sentiment dictionary without supervision emotion classifiers module.The theoretical foundation of this sorter publishes thesis " Combining Learn-based and Lexicon-based Techniques for Sentiment Detection without Using Labeled Examples ".This module avoids and only uses the extreme of emotion word coupling to rely on expertise, and dictionary quality directly affects classifying quality; It also avoid simultaneously and only use supervised learning to rely on larger problem to training set classification.Specific algorithm is as follows:

A) sentiment dictionary is utilized from all texts to be sorted, to find out the maximum a part of text of semantically quantity of information as training set (namely seeming it is most possibly front or negative text).

B) sorter is trained in order to upper text.

C) whole data set is tested with above sorter.

The determination of training set needs by sentiment dictionary, then calculates quantity and both each accounting examples of positive emotion word and negative emotion word in every section of language material.Just (bearing) the maximum N bar text of face emotion word ratio as just (bearing) face training text, thus to define a scale be the training set of 2N.

Its advantage does not need training set.Emotion tagging is carried out to 100,000 conventional Chinese vocabularies, therefrom marks 7857 the most obvious emotion word of tendency, wherein, positive emotion word 3133, negative emotion word 4724.And for can not determine that tendentious vocabulary will not be included, such as, " drop " is concerning may mean passive developer but may be the positive meaning for house purchaser.Concrete sentiment classification step: given text to be sorted, by mating the tendentiousness value obtaining the text with general sentiment dictionary.Tendentiousness value is positioned at the floating number between interval [0,1], [0,0.5) represent negative, 0.5 represents neutral, (0.5,1] represent front.Tendentiousness value represents more negative the closer to 0, is more just representing the closer to 1.

As shown in Figure 2, based on the generation type of Model model in the svm classifier of Model model be: first each section of language material of artificial mark carried out to ICTCLAS word segmentation processing and carry out VSM text representation, then carrying out TFIDF feature selecting.Be expressed as matrix A after all training set participles, text representation, feature selecting, next SVD svd dimensionality reduction carried out to matrix A, the matrix after dimensionality reduction carries out SVM training and obtain disaggregated model model.

Namely svm classifier based on Model model carries out word segmentation processing, text representation, feature selecting process for each section of text to be sorted is the same with training sample, the text after process is projected to latent semantic space and carries out text classification according to Model obtain class_res2.

Word segmentation processing: it is exactly Chinese word segmentation that the Chinese character sequence of Chinese is cut into significant word, Chinese words segmentation belongs to natural language processing technique category, and existing segmentation methods can be divided into three major types: based on the segmenting method of string matching, the segmenting method based on the segmenting method understood, Corpus--based Method.The segmentation methods that the present invention uses is the ICTCLAS of Computer Department of the Chinese Academy of Science's development.

Feature selecting: direct representation text after word segmentation processing is carried out to text, the shortcoming such as make feature vector dimension too highly cause " dimension disaster ", training pattern is complicated, the training time is long, model Generalization Ability is low.Can select according to certain screening strategy and text is represented to the large entry of classification contribution.Conventional feature selection approach has: document frequency, TFIDF, information gain, expectation cross entropy, mutual information etc.The IFIDF strategy that the present invention adopts.

Text representation: need document representation to become the manageable form of computing machine and text representation before text classification, mainly contains Boolean Model, vector space model (Vector Space Model, VSM), probability retrieval model etc.The present invention uses one of the most frequently used most effective method, vector space model.

Algorithm of documents categorization: file classification method is divided into two classes usually: statistical method and rule and method.Wherein statistical method mainly comprises center method, Rocchio algorithm, nearest neighbor algorithm, Windows algorithm, naive Bayesian, support vector machine, neural network, least square fitting etc.Rule and method mainly comprises decision tree, rough set etc.The present invention adopts support vector machine Algorithm of documents categorization.

Evaluation index: evaluation index is a kind of method of classification of assessment device quality, and evaluation index mainly contains recall rate, accuracy rate, F1 value.What wherein recall rate was weighed is the recall ratio of sorting algorithm, and the precision ratio of sorting algorithm that what accuracy rate was weighed is, F1 value are the combinations of recall rate and accuracy rate.

Detailed process is: latent semantic analysis is by carrying out special Singular Value Decomposition Using (Singular Value Decomposition to word-document matrix, SVD), original matrix being mapped to approx a k ties up on latent semantic space, singular value vector after mapping can reflect the dependence between entry and document to greatest extent, in fact SVD is an a kind of mathematical computations having physical significance clearly, the matrix of a more complicated can represent with less simpler being multiplied of several matrix by it, what these minor matrixs described is the key character of original matrix.

Latent semantic space has such mystery effect, the text feature value of frequent co-occurrence can be made to be mapped to same one dimension, but not the entry of co-occurrence is mapped to different dimensional, makes latent semantic space less than former space like this, reach the object of dimensionality reduction.And after such mapping, also larger similarity may be had because of the cooccurrence relation of entry between the document originally not comprising or comprise little identical entry information, reach the object of denoising, and this process is based on potential applications, i.e. profound implication between word, the document vector being exactly similar import from the results of view has a higher similarity.

For word-document matrix A, we carry out SVD to it:

SVD(A)＝UΣV ^T

If the entry of training set D adds up to m, number of files is n, then:

A _m×n＝U _m×mΣ _m×nV ^T _n×n

A is the matrix to be decomposed of m*n, and U is that (vector of the inside is orthogonal U for the square formation of a m*m ^tu=I, I are unit matrix, and the vector inside U is called left singular value vector), Σ is the matrix (except cornerwise element is all 0, the element on diagonal line is called singular value) of m*n, V ^tthat (vector of the inside is also orthogonal V for the square formation of a n*n ^tv=I, I are unit matrix, and the vector inside V is called right singular value vector).

In diagonal matrix Σ, singular value arranges from big to small, and reduce fast especially, under many circumstances, the front 10% even singular value of 1% and just account for whole singular value sums more than 99%.So we can carry out approximate description matrix by the singular value that front r is large, that is:

A _m×n＝U _m×rΣ _r×rV ^T _r×n，γ＜＜m，n；

In practical application, the selection of r can adjust based on experience value, uses V ^t _{r × n}represent original word-document matrix A _{m × n}, after SVD dimensionality reduction, word is tieed up from original m (being generally several ten thousand) and is become r (being generally hundreds of) and tie up, and greatly reduces storage space and calculated amount loses again the information originally held hardly.

The present invention adopts the svdcmp.c file of increasing income and has done local directed complete set, makes the diagonal matrix Σ value exported by sorting from big to small, and matrix U and V accordingly ^tcorresponding singular value sequence adjusts.

SVM is at Statistical Learning Theory (Statistical Learning Theory, SLT) a kind of machine learning method that basis develops, his structure based risk minimization principle, between the complicacy and learning ability of model, best compromise is sought according to limited sample information, to obtaining best Generalization Ability, study obtains categorised decision function.Its basic thought be structure lineoid as decision surface, make the blank between positive negative mode maximum.

SVM is quadratic programming (Quadratic Programming, the QP) problem of a belt restraining in fact:

\min \frac{1}{2} | | w | |^{2} s . t ., y_{i} (w^{T} x_{i} + b) &GreaterEqual; 1, i = 1, ..., n;

This problem is a convex problem, can obtain globally optimal solution.

The present invention adopts the LIBSVM increased income to be that the source code that (http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm/index.html) provides is classified.

Step S3, if the value of class_res1 and class_res2 is all negative, then thinks that text to be sorted is negative, continues to perform step S4; If class_res1 and class_res2's is not all negative, then think that text to be sorted is non-negative, calculate and terminate.

Step S4, mates with user_neg and train_neg respectively by text to be sorted, obtains negative index 1 i.e. user_neg_index and negative index 2 i.e. train_neg_index respectively.

User_neg represents the negative dictionary of user annotation, and the negative dictionary of user annotation and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials, and the negative dictionary of user annotation is that user accumulates the negative vocabulary of mark voluntarily according to the concern face of oneself.Different user demands is different, if user is national news office, the negative vocabulary that they pay close attention to comprises removal, corruption, complains to the higher authorities about an injustice and request fair settlement; Than if any user pay close attention to the negative public sentiment of financial field and the negative public sentiment of some concern topical news, the concern aspect provided according to user and some negative vocabulary and respective weights, we directly can use or find relevant co-occurrence term by search engine or use after finding similarity vocabulary to expand by HowNet.

Co-occurrence word is the vocabulary of often arranging in pairs or groups in large volume document, jointly occurring, the co-occurrence word of certain word be integrated into the semantic environment describing this word in a way, the strength of association between co-occurrence word also reflects the strength of association between the semanteme representated by these words to a certain extent.

Existing co-occurrence word extraction algorithm has a lot, as found the co-occurrence word algorithm in document based on association cluster and distance cluster, ask co-occurrence word algorithm based on potential applications index LSI, carry out co-occurrence word extraction etc. based on lexical attraction and repulsion model by statistic combination.

For the purpose of simple, native system carries out co-occurrence word extraction on the basis of the existing network data platform of our unit.Network data platform Real-time Collection some keywords etc. of magnanimity info web and each webpage.Co-occurrence word extracts specific algorithm:

A) read up-to-date N bar record from network data platform, N is greater than 100,000.

B) inverted index is set up to all keywords of this N bar record.

C) provide negation words to remit in index for each user to retrieve, find the Web Page Key Words that the keyword that matches is corresponding, we are referred to as co-occurrence word, choose the highest front m of co-occurrence number of times the expansion as this negative vocabulary.

Experimental result: the co-occurrence word of construction: project, pilot, reform, guarantee etc.; The co-occurrence word of mitigation: to take precautions against natural calamities, disaster, shockproof, investigation etc.

HowNet is one and represents between concept and concept and pass between attribute that concept has is the knowledge base of substance.In the present system keyword is mapped to justice former, the problem that synonym is replaced can be solved to a certain extent, make same subject, comprise the document of different synonym and near synonym and can flock together better.

At the Dowload page-downloading Word Similarity Computing Based on Hoe-net of HowNet homepage http://www.keenage.com/html/e_index.html, the glossary.dat file of the inside and HowNet knowledge base.There is provided negation words to remit in glossary.dat for each user to retrieve, the near synonym finding the keyword that matches corresponding are as the expansion of this negative vocabulary.

Experimental result: the near synonym of construction: build; The near synonym of mitigation: reduction, misfortune.

Train_neg represents the crucial dictionary of training set, and the semantic relation that the SVM text classification based on LSA only considers vocabulary does not effectively utilize the key vocabularies in training set.Concrete generation type: ICTCLAS word segmentation processing is carried out to the negative language material in training set and carries out VSM text representation, then TFIDF feature selecting is carried out, extract from the result of TFIDF feature selecting TFIDF weight maximum before p keyword, wherein p be more than or equal to 1 integer; Namely the crucial dictionary of training set is become described in shape after the keyword duplicate removal extracted by negative language material in described training set, normalized weight; Wherein said training set is the set of the just negative front language material of artificial mark and negative language material.

Step S5, carries out linear combination by class_res1, user_neg_index and train_neg_index, obtains the negative public sentiment index of text to be sorted.

According to attribute and each index proportion of each variable, determine linear combination:

negIndex＝-120*class_res1+60+30*user_neg_index+10*train_neg_index；

Wherein class_res1 ∈ [0,0.5) more little be more negative, user_neg_index ∈ [0,1] be more negative more greatly, train_neg_index ∈ [0,1] is more negative more greatly, the negIndex ∈ [0,100] finally obtained is more negative more greatly.Final negative public sentiment exponential quantity negIndex ∈ [0,100], after tested personnel's test can think this can select negIndex>=80 as high value.

What negative public sentiment exponential quantity was higher is the non-negative training set upgrading artificial mark as new negative and non-negative language material with the value of class_res1 and class_res2.

As shown in Figure 3, a computing system for negative public sentiment index, comprises the just negative sort module based on sentiment dictionary, the svm classifier module based on Model model, just negative judge module, the negative dictionary matching module of user annotation, the crucial dictionary matching module of training set and linear combiner module.

The described just negative sort module based on sentiment dictionary is carried out the just negative classification based on sentiment dictionary for treating classifying text and obtains classification results 1; The described just negative sort module based on the sentiment dictionary unsupervised segmentation device comprised based on general sentiment dictionary is treated classifying text and is classified.

The described svm classifier module based on Model model is carried out the svm classifier based on Model model for treating classifying text and obtains classification results 2; The formation of Model model described in the described svm classifier module based on Model model comprises the following steps: carry out TFIDF feature selecting to described training set, and by the result matrix A after TFIDF feature selecting _{m × n}represent, wherein m represents the sum of keyword in training set, m be more than or equal to 1 integer, n to represent in training set language material sum, n be more than or equal to 1 integer; To matrix A _{m × n}carry out svd, SVD (A _{m × n})=U _{m × m}Σ _{m × n}v ^t _{n × n}, wherein U _{m × m}square is the square formation of m × m, Σ _{m × n}for the diagonal matrix of m × n, V ^t _{n × n}for the square formation of n × n; Get matrix A _m _{× n}matrix V corresponding to front k singular value ^t _{k × n}carry out representing matrix A _{m × n}wherein k be more than or equal to 1 integer, manually will mark the test of training set half training half, the contrast of test result index is obtained to the value of k, the value being specially k is the 1%-10% of diagonal matrix dimension, and diagonal matrix is 100 dimensions, then k gets the number between 1-10, test on training set again, see which value good classification effect k gets, and just the empirical value of k is taken as several; To matrix V ^t _{k × n}carry out SVM training, obtain Model model.

Described just negative judge module is for judging the just negative of classification results 1 and classification results 2.

Described user annotation negative dictionary matching module is used for text to be sorted and the negative dictionary matching of user annotation, obtains negative index 1; The negative dictionary of described user annotation and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials, and the negative dictionary of user annotation is that user accumulates the negative vocabulary of mark voluntarily according to the concern face of oneself.If user is national news office, the negative vocabulary that they pay close attention to comprises removal, corruption, complains to the higher authorities about an injustice and request fair settlement.

The crucial dictionary matching module of described training set is used for, by text to be sorted and the crucial dictionary matching of training set, obtaining negative index 2; The formation of the crucial dictionary of described training set comprises: carry out ICTCLAS word segmentation processing to the negative language material in training set and carry out VSM text representation, then TFIDF feature selecting is carried out, extract from the result of TFIDF feature selecting TFIDF weight maximum before p keyword, wherein p be more than or equal to 1 integer; Namely the crucial dictionary of training set is become described in shape after the keyword duplicate removal extracted by negative language material in described training set, normalized weight; Wherein said training set is the set of the just negative front language material of artificial mark and negative language material.

The computing system of described negative public sentiment index also comprises text filtering module to be sorted, filters for treating classifying text when text to be sorted is spam page.The computing system of described negative public sentiment index also comprises artificial mark training set update module, and text to be sorted and classification results 1 for negative public sentiment exponential quantity being exceeded predetermined threshold are non-negative text to be sorted with the value of classification results 2 and upgrade described training set as new negative and non-negative language material.

The present invention calculates negative public sentiment index and studies, and adopts ensemble machine learning method and multiple Weak Classifier can be integrated into the principle of a strong classifier.If be used alone unsupervised emotion classifiers to carry out public sentiment index and calculate by the impact of general sentiment dictionary larger; If it is larger that the svm classifier be used alone based on SVD carries out the impact that public sentiment index calculates by artificial mark training set.In order to make the higher the present invention of the accuracy rate of the exponential quantity calculated adopt the thought of two kinds of methods combining, negatively think that the text is negative just be namely all a same section both texts to be marked result, otherwise be non-negative.Because the result of svm classifier only has negative and non-negative dividing, in the determination of exponential quantity, except considering the end value without supervision emotion classifiers, also contemplate the matching result of the negative dictionary of user and the crucial dictionary of training set.

The present invention uses based on the svm classifier of SVD dimensionality reduction and carries out real-time negative public sentiment calculating based on sentiment dictionary classification, and the method is a kind of negative public sentiment index calculation method newly proposed for body matters such as the news on internet, blog, forums.The present invention technically merged based on general sentiment dictionary without crucial dictionary four aspects in supervision emotion classifiers, support vector machine text classification based on latent semantic analysis dimensionality reduction, the negative dictionary of user annotation, negative training set.In numerous sorting algorithms, select SVM classifier to be because its classifying quality is fine, there is the superiority that other machines learning art hardly matches, secondly use SVD dimensionality reduction not only to serve denoising, shorten the effect of classification time and also consider semantic relation between Feature Words, the sort module based on sentiment dictionary also by teacher Tan Songbo in the determination of negative language material, the determination of public sentiment index is by means of the crucial dictionary in the negative dictionary of user annotation and training set.Calculate according to existing knowledge real-time analysis the article of each section of public sentiment index to be calculated, training set and the negative vocabulary of user annotation can according to existing knowledge regular updates.

The present invention proposes a kind of svm classifier based on SVD dimensionality reduction and the real-time negative public sentiment index calculation method based on sentiment dictionary classification.The method adopts the thought of ensemble machine learning to say, and that two kinds of sorting algorithms combine the more independent a kind of sorting algorithm of accuracy rate of the negative language material ensureing to calculate is high.Adopt SVD dimensionality reduction that the dimension of proper vector can be made to drop to hundreds of dimension from several ten thousand dimensions in specific experiment, wherein singular value really normal root obtain according to the empirical value of many experiments, also with reference to the theoretical direction of SVD dimensionality reduction: in most of the cases, the singular value of front 10% even front 1% and just account for whole singular value sums more than 99%.Use SVD dimensionality reduction to substantially reduce the memory usage of svm classifier algorithm when storing proper vector and also consider semantic relation between feature.Be verification to svm classifier algorithm based on sentiment dictionary classification, both are all and negatively just think negative language material.In addition, the negative dictionary of user annotation directly can use and also can use after expanding according to the co-occurrence word of vocabulary or from some similaritys of the many crawls of HowNet, correlativity vocabulary.The vocabulary of some non-emotion word also has negative connotation to use the keyword in negative training set mainly to consider, such as in the place of the generation of certain time period event, personage, event title etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. computing method for negative public sentiment index, is characterized in that, comprise the following steps:

2. the computing method of negative public sentiment index according to claim 1, is characterized in that, before execution step S1, treat classifying text and filter when to be also included in text to be sorted be spam page.

3. the computing method of negative public sentiment index according to claim 1, is characterized in that, the negative dictionary of user annotation described in step S3 and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials;

4. the computing method of negative public sentiment index according to claim 3, is characterized in that, in step S1, describedly treat classifying text based on sentiment dictionary the just negative unsupervised segmentation device be categorized as based on general sentiment dictionary and classify;

The formation of described Model model comprises the following steps:

To matrix V ^t _{k × n}carry out SVM training, obtain Model model.

5. the computing method of negative public sentiment index according to claim 3, it is characterized in that, the value also comprising text to be sorted and described classification results 1 and the described classification results 2 negative public sentiment exponential quantity being exceeded predetermined threshold after step S4 is non-negative text to be sorted and upgrades described training set as new negative and non-negative language material.

6. the computing system of a negative public sentiment index, it is characterized in that, comprise the just negative sort module based on sentiment dictionary, the svm classifier module based on Model model, just negative judge module, the negative dictionary matching module of user annotation, the crucial dictionary matching module of training set and linear combiner module;

7. the computing system of negative public sentiment index according to claim 6, is characterized in that, also comprises text filtering module to be sorted, filters for treating classifying text when text to be sorted is spam page.

8. the computing system of negative public sentiment index according to claim 6, is characterized in that, the negative dictionary of described user annotation and user carry out the dictionary of positive and negative dough figurine work mark formation to many sections of language materials;

9. the computing system of negative public sentiment index according to claim 8, is characterized in that, the described just negative sort module based on the sentiment dictionary unsupervised segmentation device comprised based on general sentiment dictionary is treated classifying text and classified;

To matrix V ^t _{k × n}carry out SVM training, obtain Model model.

10. the computing system of negative public sentiment index according to claim 8, it is characterized in that, also comprise artificial mark training set update module, text to be sorted and classification results 1 for negative public sentiment exponential quantity being exceeded predetermined threshold are non-negative text to be sorted with the value of classification results 2 and upgrade described training set as new negative and non-negative language material.